Joint Object Detection and Multi-Object Tracking Based on Hypergraph Matching

Cui, Zhoujuan; Dai, Yuqi; Duan, Yiping; Tao, Xiaoming

doi:10.3390/app142311098

Open AccessArticle

Joint Object Detection and Multi-Object Tracking Based on Hypergraph Matching

by

Zhoujuan Cui

^1,2,*,

Yuqi Dai

³

,

Yiping Duan

^1,2 and

Xiaoming Tao

^1,2

¹

Department of Electronic Engineering, Tsinghua University, Beijing 100084, China

²

Beijing National Research Center for Information Science and Technology, Beijing 100084, China

³

School of Vehicle and Mobility, Tsinghua University, Beijing 100084, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(23), 11098; https://doi.org/10.3390/app142311098

Submission received: 22 August 2024 / Revised: 8 October 2024 / Accepted: 6 November 2024 / Published: 28 November 2024

Download

Browse Figures

Versions Notes

Abstract

Addressing the challenges in online multi-object tracking algorithms under complex scenarios, where the independence among feature extraction, object detection, and data association modules leads to both error accumulation and the difficulty of maintaining visual consistency for occluded objects, we have proposed an end-to-end multi-object tracking method based on hypergraph matching (JDTHM). Initially, a feature extraction and object detection module is introduced to achieve preliminary localization and description of the objects. Subsequently, a deep feature aggregation module is designed to extract temporal information from historical tracklets, amalgamating features from object detection and feature extraction to enhance the consistency between the current frame features and the tracklet features, thus preventing identity swaps and tracklet breaks caused by object detection loss or distortion. Finally, a data association module based on hypergraph matching is constructed, integrating with object detection and feature extraction into a unified network, transforming the data association problem into a hypergraph matching problem between the tracklet hypergraph and the detection hypergraph, thereby achieving end-to-end model optimization. The experimental results demonstrate that this method has yielded favorable qualitative and quantitative analysis results on three multi-object tracking datasets, thereby validating its effectiveness in enhancing the robustness and accuracy of multi-object tracking tasks.

Keywords:

multi-object tracking; object detection; hypergraph; hypergraph matching; data association

1. Introduction

Multiple Object Tracking (MOT) is an important task in the field of computer vision, aiming to stably track multiple objects of interest in a video sequence and generate their motion tracklets. It has been widely applied in autonomous driving, action recognition, mobile robotics, and intelligent surveillance. Depending on whether or not the generation of motion tracklets depends on the information of subsequent frames, MOT tasks can be divided into two different modes: offline tracking and online tracking. Offline tracking uses all the detection results in a video segment to jointly infer the motion tracklets of objects in each frame, which can be seen as a global optimization problem, while online tracking generates new tracklets sequentially according to the current detection results and historical tracklets. Due to its ability to process input video sequences in real-time, online MOT is widely used in real-world scenarios.

Early online MOT algorithms focused on complex association optimization methods, which had limited accuracy. With the rapid development of deep learning, online MOT algorithms based on deep learning have gradually become mainstream. At present, deep learning-based MOT methods are primarily divided into Detection-based-Embedding (DBE) [1,2,3,4,5,6,7,8] and Joint Detection and Embedding (JDE) [9,10,11,12,13,14].

DBE separates tracking into two independent tasks: object detection and data association. First, the detector outputs the detection results, which are then sent to the data association module to match with the tracklets of the previous frame, thus forming new tracklet boxes. DBE, with its straightforward approach and excellent tracking accuracy, once became the mainstream paradigm for MOT. The object detection module and data association module are trained separately to achieve the best performance, but they usually have the disadvantage of not being able to propagate errors back through the entire MOT system. In other words, each module is optimized individually, rather than for the entire MOT system. Moreover, when there are a large number of objects in the detection frame, due to the lack of feature sharing between the two modules in the DBE paradigm, a Re-ID model is needed for each bounding box (Bbox) in the detection frame, leading to the disadvantage of not achieving real-time inference speed.

JDE integrates the appearance embedding model used for data association into the detection module, so it can output detection results and corresponding appearance embeddings simultaneously. JDE uses a shared-weight, deep learning network to complete the tasks of object detection and data association, achieving good real-time performance.

Figure 1a depicts the input data for an MOT task, which is a sequence of object tracking images where each frame contains multiple targets represented by rectangular boxes of different colors. Figure 1b is a schematic representation of the target tracklets abstracted from Figure 1a, where the circular dots of different colors correspond to the target rectangular boxes in Figure 1a, triangles represent similar targets, the gray square indicates the occurrence of an intersection between two targets at this time, causing visual occlusion, the red curve represents the correct tracklet of the target, and the blue curve represents the incorrect tracklet of the target. It can be seen from the figure that when there are similar targets nearby or severe occlusions occur, it is easy to accumulate errors or maintain visual consistency. If attention is paid to the higher-order interactions between multiple targets, as indicated by the purple part in the figure, it can effectively mitigate the impact of these situations. However, as shown in Figure 1, due to the difference in features required by the object detection task and the Re-ID task, there is a competitive phenomenon between the detection task and the Re-ID task within the JDE paradigm; for the object detection task, the model aims to increase the similarity of appearance features of objects belonging to the same category, i.e., to narrow the intra-class distance, while for the Re-ID task, the model aims to maximize the differences in appearance features between different objects, i.e., to expand the inter-class distance. In addition, when the tracked objects are occluded by adjacent or cluttered objects, especially in severe occlusion, the reliability of the features detected will decrease, making it difficult to maintain the visual consistency of the occluded objects. Moreover, because most of the association modules in the above two paradigms use the Hungarian matching data association algorithm, the errors generated in the data association module cannot be propagated back to the object detection module and the feature embedding module, preventing the formation of an end-to-end optimization.

In response to the above issues, we propose an MOT method (JDTHM) based on hypergraph matching association, which enhances object detection and feature extraction through the tracklet information formed by the data association module in an end-to-end tracking model to form positive feedback. Specifically, the end-to-end tracking model uses tracklet information to enhance the visual features of object detection, thereby reducing false positives and false negatives. At the same time, the enhanced visual features assist the feature extraction module in generating discriminative features, thereby reducing the impact of occlusion. Based on more robust detection results and aggregated features, the data association module can generate higher quality tracklets. The main contributions are as follows:

We introduce a feature extraction and detection module to initially achieve object localization and description of detection results.
We design a deep feature aggregation module that extracts temporal information from historical tracklet information, aggregates features from object detection and feature extraction, enhances the consistency area between the current frame features and tracklet features, and avoids the impact of identity swapping and tracklet breaks caused by object detection loss or distortion.
We construct a data association module based on the hypergraph neural network, integrating with object detection and feature extraction into a unified network. This integration transforms the data association problem into a hypergraph matching problem between the tracklet hypergraph and the detection hypergraph, thereby achieving end-to-end model optimization.
The hypergraph framework demonstrates highly competitive performance on three MOT datasets, confirming its effectiveness in enhancing the accuracy of change detection.

The remainder of this paper is organized as follows: Section 2 reviews the related work; Section 3 introduces the relevant theories of hypergraph matching; Section 4 provides a detailed introduction to the proposed hypergraph architecture; Section 5 presents the experimental results and discussion; Section 6 concludes the paper.

2. Related Works

2.1. Detection-Based-Embedding

Online MOT algorithms based on deep learning are divided into two frameworks: detection-based tracking and joint detection tracking, with the difference being the presence or absence of a tracking module integrated with the object detection network. MOT algorithms based on deep object detectors [15,16,17] incorporate high-performance deep learning object detectors in the detection part of the MOT process, such as Region-based Convolutional Neural Networks (R-CNNs) [18,19], Single Shot Multibox Detector (SSD) [20], and You Only Look Once (YOLO) [21,22], to enhance the performance of online MOT algorithms. The typical Simple Online and Real-time Tracking (SORT) algorithm [15] is based on the traditional Hungarian association algorithm and uses the Faster R-CNN object detection network to replace the original Aggregate Channel Feature (ACF), achieving a significant improvement in tracking accuracy and speed. Subsequent research [6] has also shown a high positive correlation between detection accuracy and online MOT performance.

MOT algorithms based on deep object detectors generally take good object detection results as the pre-input and make improvements from the perspectives of appearance feature extraction, object feature prediction, and data association.

Compared to some online trackers [15,23,24] that only use simple motion modeling, the appearance features of the object can more robustly track objects that are occluded for a long time. For example, DeepSORT [25] uses a pre-trained ResNet network [26] on a large-scale dataset to extract the appearance features of the object, and then incorporates the similarity measurement of appearance features into the association cost for association, reducing the problem of object identity switching during the tracking process. To better simulate temporal information, some algorithms, independent of the detector, use the encoded vectors of the object’s geometric features (position, length, width, etc.) in historical frames and the appearance feature vectors and predict the change of the feature vectors through a temporal model for data association. For example, Exponential Moving Average (EMA) [10] aggregates the tracklet features to prevent occlusion and outlier feature points from affecting the match. Dai et al. [27] connect the average appearance features with spatiotemporal features and then use Graph Convolutional Networks (GCNs) [28] to aggregate features to obtain long-term appearance features. In addition, for data association based on matching, a cost matrix is constructed through the similarity of tracklet features and detection features, and then the allocation matrix is solved, with the allocation matrix supervising the learning of motion features and appearance features. For instance, DeepMOT [29] proposes a deep Hungarian network, using a Bidirectional Recurrent Neural Network (BRNN) [30] to convey global information in the cost matrix, and its loss function is designed based on two differentiable evaluation indicators, making the allocation matrix output by the network more conducive to the improvement of the indicators. Shan et al. [31] further integrate the object’s position in the previous frames and appearance features to obtain better global spatiotemporal information, alleviating the problem of object feature variation over time.

Each module of the detection-based tracking algorithm can be optimized separately, and with the improvement of the performance of each module the algorithm can achieve a higher tracking accuracy [32,33,34]. However, the coupling between modules is not high, and the independent training method of multiple modules may cause the model parameters to converge to the optimum of each module, rather than the global optimum of MOT. Moreover, some models improve the tracking effect by stacking multiple modules and using more deep learning techniques, which makes the DBE framework increasingly complex, and the tracking real-time performance does not meet online requirements, making it difficult to deploy in practice.

2.2. Joint Detection and Embedding

In recent years, the Joint Detection and Embedding (JDE) frameworks have reduced the complexity of the DBE frameworks while also improving the accuracy of MOT [35,36,37]. Its strategy is to integrate some functional modules to a certain extent on the basis of DBE, reducing the algorithmic complexity brought by phased processing while increasing the coupling between functional modules, alleviating the problem of local optimization brought by manually connecting each module.

In the detection algorithm, a head network is added to infer the whole picture information through an integrated network, then extract the corresponding pixel points of the object on the feature map as the object’s appearance feature vector and output the object’s detection box at the same time. JDE uses a YOLOv3 framework [38], adding an Re-ID branch parallel to the detection branch and extracting the feature vector at the center point of the positive anchor box on the feature map output by the branch as the object’s appearance feature vector. However, as mentioned in FairMOT [11], the predicted anchor center may not fall within the object area, so the anchor box mechanism is not suitable for training appearance features; they built tracking algorithms on CenterNet [39] by using Deep Layer Aggregation (DLA) [40] in the backbone architecture and extracting feature vectors at the object center point to learn appearance features. One of the main challenges of the joint learning of detection and tracking comes from the conflict between these two tasks. The purpose of the detection task is to identify object categories from the background, such as pedestrians and vehicles, hoping to improve the similarity of appearance features belonging to the same category of objects, that is, to narrow the intra-class gap. While the purpose of the Re-ID embedding is to distinguish different objects, not categories, hoping to maximize the appearance feature differences between different objects, that is, to expand the intra-class gap. Therefore, some methods decouple the feature maps of different tasks. For instance, Yang et al. [41] use a separate Re-ID head to learn appearance features in addition to the original detection branch. CSTrackv2 [12] learns the correlation and difference information between detection and re-identification through a channel mutual attention mechanism, and then use a multi-scale attention module to extract appearance features. SimpleTrack [42] is based on the FairMOT model and uses a feature fusion branch different from detection to extract appearance features. The design of these modules has increased the modeling of feature maps in the head network, enriched the differences between the respective task heads and the original feature maps, and improved the detection performance and recognition performance of the algorithm. Similarly, the appearance features extracted in a single frame image cannot obtain temporal information, so some researchers have used a multi-frame feature extraction method. For example, GSDT [43] is based on the FairMOT algorithm, using a graph convolutional model to model the local correlation between tracklet features and the current feature map, making the extracted appearance features contain historical information and improving the adaptability to object occlusion and deformation. However, since most of the association modules use the Hungarian matching data association algorithm, the errors generated in the data association module cannot be propagated back to the object detection module and the feature embedding module, and an end-to-end optimization cannot be formed.

The feature prediction and detection phase of JDE are carried out simultaneously, through the addition of frame information in the detection network or multi-frame detection. In order to learn the appearance and motion features of the tracklets across multiple frames, model the temporal characteristics, and predict the tracklet direction, some researchers use a multi-frame network or a head aggregation network in the form of joint detection and tracking to learn spatiotemporal feature maps for tracking. For example, CTrack [44] follows the CenterNet anchor-free detection framework, splicing a pair of sequential frames and the previous frame’s heatmap at the input end and predicting the object center position, size, offset, and inter-frame object displacement at the output end, which can achieve two-dimensional and three-dimensional MOT at the same time. TransCenter [45] also embeds Transformer into MOT, processing dense pixel-level, multi-scale detection and tracking queries in two query learning networks based on a variable-shaped Transformer encoder and decoder to obtain detection and predicted displacement. JDE’s data association and object detection are carried out simultaneously, using a single neural network to process the input image and directly infer the tracklet of the object. The object location can be refined through the correlation between the tracklet features and the current feature map. Some researchers have integrated single-object tracking algorithms into multi-object algorithms [46], using the powerful feature extraction and positioning capabilities of single-object algorithms to achieve higher precision end-to-end tracking. SOTMOT [47] is built on a variant of DLA-34, solving the ridge regression coefficients of the previous frame for each object and using the learned coefficients to predict the labels of the next frame. Therefore, discriminative features can be learned in the local area. The disadvantage is that as the number of objects in the scene increases, assigning a single-object tracker to each object will cause efficiency problems. Recently, with the progress of Transformer [48], some researchers have also used the query vector mechanism in the Transformer model for MOT, to learn the tracklet features with the query vector for frame-by-frame positioning and object identification, implicitly completing data association, thereby simplifying the complex and multi-step architecture. Meinhardt et al. [49] were the first to introduce Transformer into MOT. They iterated the DETR (end-to-end object detection with transformers) detector [50] in the time direction. The first frame uses randomly initialized empty queries to detect objects in the image and output the corresponding queries for the objects. In subsequent frames, the object queries from the previous frame and the randomly initialized empty queries are input together into the decoder, where the object queries are responsible for detecting the position of the tracklet in the current frame and passing the identity, implicitly completing data association, while the empty queries are responsible for detecting newly appearing objects.

The joint detection and tracking algorithms integrate different modules with the detection network for joint optimization, expecting the modules of the algorithm to work more collaboratively and simplifying the complex multi-step form, thereby improving the speed of MOT, which has received widespread attention in recent years.

3. Preliminaries of Hypergraph Matching

We introduce the preliminary concept association graph, on which our methods are based. For clarification, we first summarize important notations and definitions in Table 1.

Graph Matching (GM) is a fundamental problem, which is NP-complete. It aims to match graphs

G^{1} = (V^{1}, ε^{1})

and

G^{2} = (V^{2}, ε^{2})

, involving the establishment of node correspondences between the two graphs based on node-to-node and edge-to-edge affinities. It can be formulated as a Quadratic Assignment Programming (QAP) problem [51], in graph matching tasks, assuming a perfect match corresponds to the highest affinity score, the objective

J (X)

needs to be maximized while satisfying certain constraints.

\begin{array}{l} J (X) = vec {(X)}^{T} Κ vec (X), \\ s . t . X \in {[0, 1]}^{n_{1} \times n_{2}}, X 1_{n_{2}} = 1_{n_{1}}, X^{T} 1_{n_{1}} \leq 1_{n_{2}} \end{array},

(1)

where

X

is a doubly stochastic matrix, where all its rows sum to 1 and all its column sums are ≤1.

1_{n_{1}}, 1_{n_{2}}

represent column vectors of length

n_{1}

and

n_{2}

, with all elements equal to 1, where

n_{1}

and

n_{2}

denote the number of vertices in

G^{1}

and

G^{2}

.

Initially, to measure the similarity between nodes in two graphs, an affinity matrix,

Κ \in ℝ^{n_{1} n_{2} \times n_{1} n_{2}}

, is constructed. The diagonal elements,

Κ_{i a, i a} \in s_{v} (V_{i}^{1}, V_{a}^{2})

, represent the first-order node similarity between node

V_{i}^{1}

and node

V_{a}^{2}

. The off-diagonal elements,

Κ_{i a, j b} \in s_{e} (ε_{i j}^{1}, ε_{a b}^{2})

, indicate the second-order edge similarity between edges

ε_{i j}^{1}

and

ε_{a b}^{2}

.

Subsequently, as shown in Figure 2, the association graph,

G^{A} = (V^{A}, ε^{A})

, between the two graphs is constructed, transforming the graph matching problem into the node classification problem on the association graph, which simplifies the problem’s structure and allows the application of graph embedding and machine learning techniques. The nodes of the association graph,

V_{i a}^{A}

, represent potential matching relationships between the nodes

V_{i}^{1}

and

V_{a}^{2}

, where the superscript means the index of graphs and the subscript means the index of vertices. Specifically, if

V_{i}^{1}

is a node in

G^{1}

and

V_{a}^{2}

is a node in

G^{2}

, then vertex

V_{i a}^{A}

in the association graph represents the match between

V_{i}^{1}

and

V_{a}^{2}

. The edges in the association graph,

ε_{i a, j b}^{A}

, represent the consistency between matching pairs. If two matching pairs,

(V_{i}^{1}, V_{a}^{2})

and

(V_{j}^{1}, V_{b}^{2})

, can coexist in an optimal match, then there is an edge between

V_{i a}^{A}

and

V_{j b}^{A}

. The weighted adjacency matrix of association graph

W

comes from the off-diagonal elements of

Κ

. To better exploit the first-order similarities, diagonal elements,

Κ_{i a, i a}

, are further assigned as node attributes,

V^{A}

.

In order to perform node classification on the association graph, GCNs are first used to embed the nodes of the association graph, capturing the topological structure information between nodes and generating feature representations for each node. We denote

v^{(k)} \in ℝ^{n_{1} n_{2} \times l_{k}}

as

l_{k}

-dimensional vertex embeddings on layer

k

. The initial embeddings at k = 0 are scalar, i.e.,

l_{k} = 1

, taken from the diagonal of

Κ

.

By outputting a doubly stochastic matrix through the Sinkhorn network, the node embeddings generated by GCNs are transformed into representations that satisfy matching constraints. By introducing matching constraints in the embedding process, the matching-aware embedding not only utilizes the topological structure information of the nodes but also more directly learns node representations that meet the matching conditions, thereby enhancing the performance of the graph matching task.

The hypergraph is a generalization of the graph. A hypergraph,

G = (V, ε)

, encompasses a collection of vertices and a set of hyperedges, where

V

denotes the set of vertices, with the number of vertices in

V

represented as

N

. Each vertex possesses certain semantic information.

ε

represents the set of hyperedges. Unlike edges in a simple graph that can only connect two vertices, each hyperedge can connect any number of vertices, and the number of hyperedges in

ε

is denoted as

M

. The above QAP models can also be generalized to the higher-order case. One line of work adopted a tensor-marginalization-based model for

p

-order (

p \geq 3

) hypergraph matching, resulting in a higher-order assignment problem [51].

4. Methodology

We follow the online Multiple Object Tracking (MOT) methodology and have designed an end-to-end MOT framework (JDTHM), as shown in Figure 3. This framework mainly includes a feature extraction and object detection module, a feature aggregation module that integrates historical tracklet information, and a data association module.

The feature extraction and object detection module preliminarily achieves the localization and description of detection results for objects. The deep feature aggregation module extracts temporal information from historical tracklet data, amalgamating features from object detection with those from feature extraction. This enhances the consistency between the current frame features and the tracklet features, preventing identity swaps and tracklet breaks caused by the loss or distortion of object detection. The data association module, based on hypergraph matching, receives information from object detection and feature extraction. It uses the tracklet information generated by the data association module to enhance the object-detection and feature extraction modules, thereby reducing false positives and false negatives. The aggregated features further assist the feature extraction module in generating discriminative features, thus minimizing variations caused by occlusions. With more robust detection results and features, the data association module can produce higher-quality tracklets.

4.1. Feature Extraction and Object Detection

To encompass a wider range of low-level and high-level features, and to adjust the receptive field for tracking objects of different shapes and scales, we use Deep Layer Aggregation (DLA) [40] as the backbone to extract feature maps from the input image

X^{t}

to obtain the basic feature map

F^{t}

.

We employ the two-stage detector Faster R-CNN [52] as our detection component, merging the public detection results of the current frame with the tracking results from the previous frames as the candidate proposal

ℙ^{t}

. The corresponding detection results,

B^{t} = {B_{1}^{t}, B_{2}^{t}, \dots, B_{n}^{t}}

, are generated by the detector

D (\cdot)

, where

n

represents the number of detected objects, and

B_{i}^{t}

represents the corresponding candidate bboxes.

4.2. Deep Feature Aggregation

Due to the presence of complex scenarios such as partial occlusions and motion blur in videos, it is difficult to detect occluded objects using only the clues from the current frame. Based on the consistency of object motion, historical information contains rich spatiotemporal contextual information that can assist in detecting occluded objects. By leveraging the abundant temporally complementary information, the description of occluded objects is enhanced, thereby reducing identity switches. Therefore, we first constructed the tracklet embedding module, which maintains a tracklet buffer area to save the image feature information and motion information contained in all historical tracklets. The buffer area is divided into different sub-areas according to the number of historical tracklets. The features in each area are represented as

M_{j}^{t - 1} = {{\hat{F}}_{j}^{t - η}, {\hat{F}}_{j}^{t - η + 1}, \dots, {\hat{F}}_{j}^{t - 1}}

, where

{\hat{F}}_{j}^{t - 1}

and

η

are the single-step embedding and cache length of the

j

-th object, respectively. The tracklet feature is updated to

T_{j}^{t} = γ (M_{j}^{t - 1}, {\hat{F}}_{j}^{t})

, where

γ

denotes the feature aggregation operation.

Subsequently, an attention mechanism based historical tracklet feature enhancement module is adopted. When it is necessary to associate with the current frame object for data correlation, a memory feature fusion network is used to integrate all historical inputs of each tracklet, enhancing the features of the consistent areas between the historical tracklets and the current frame, to generate high-quality tracklets for subsequent modules.

Setting

ξ (x_{i}, x_{j}) = \exp^{ϕ {(x_{i})}^{T} φ (x_{j})}

, the attention between the visual features of the current frame and the historical tracklet features is defined as follows:

A (F_{i, *}^{t}, Τ^{t - 1}) = \sum_{k = 1}^{m} \frac{α_{*, k} \sum_{\forall j} ξ (F_{i, *}^{t}, Τ_{j, k}^{t - 1}) θ (Τ_{j, k}^{t - 1})}{\sum_{\forall j} ξ (F_{i, *}^{t}, Τ_{j, k}^{t - 1})},

(2)

where

F_{i, *}^{t}

and

Τ_{j, k}^{t - 1}

represent random features from

F_{*}^{t}

and

Τ_{k}^{t - 1}

, respectively.

Τ_{k}^{t - 1} \in Τ^{t - 1}

represents the feature map of the

k

-th tracklet.

ϕ (\cdot)

,

φ (\cdot)

and

θ (\cdot)

denote Convolution layers.

α_{*, k}

is an indicator function, taking the value 1 if

IoU (P_{*}^{t}, B_{k}^{t - 1}) > α_{IoU}

and 0 otherwise.

IoU

stands for “Intersection over Union”, which measures the degree of overlap between two regions, often used to compare the similarity between a predicted bbox or segmentation area and the ground truth bbox or segmentation area.

IoU (P_{*}^{t}, B_{k}^{t - 1})

represents the geometric similarity between the candidate proposal

P_{*}^{t}

and the final bbox

B_{k}^{t - 1}

in the tracklet

T_{k}^{t - 1}

, where

α_{IoU}

is the geometric threshold. Furthermore, the features enhanced by the historical tracklets are as follows:

{\tilde{F}}_{*}^{t} = F_{*}^{t} \oplus A (F_{*}^{t}, T^{t - 1}),

(3)

where

\oplus

represents element-wise addition and subtraction operations for matrices. The output is a fusion of features generated from the detection results of the current frame and the enhanced features of historical tracklets.

{\hat{F}}_{i}^{t} = μ ({\tilde{F}}^{t}, B_{i}^{t}),

(4)

where

μ

is the convolution layer on top of the RoIAlign layer. The feature corresponding to

B^{t}

can be obtained through

{\hat{F}}^{t} = {{\hat{F}}_{1}^{t}, {\hat{F}}_{2}^{t}, \dots, {\hat{F}}_{n}^{t}}

.

4.3. Data Association

We transform the data association into a hypergraph matching problem between detection results

B^{t} = {B_{1}^{t}, B_{2}^{t}, \dots, B_{n}^{t}}

and historical tracklets

T^{t - 1} = {T_{1}^{t - 1}, T_{2}^{t - 1}, \dots, T_{m}^{t - 1}}

, where

n

and

m

represent the number of detection results and historical tracklets, respectively. We first construct a detection hypergraph and a tracklet hypergraph following NHGM [51] to describe the relationships between different objects in the detection results and historical tracklets. The hypergraph matching process is shown in Figure 4.

The detection hypergraph is defined as

G_{D} = (V_{D}, ε_{D})

, where

V_{D} = {(B_{i}^{t}, {\hat{F}}_{i}^{t})}, i \in [1, n]

represents the vertex set, with

{\hat{F}}_{i}^{t}

representing the features of the bbox

B_{i}^{t}

.

ε_{D}

represents the set of hyperedges. Similarly, the tracklet hypergraph is defined as

G_{T} = (V_{T}, ε_{T})

, where

V_{T} = {(B_{j}^{t - 1}, T_{j}^{t - 1})}, j \in [1, m]

represents the vertex set, with

T_{j}^{t - 1}

representing the feature of final bbox

B_{j}^{t - 1}

of the

j

-th historical tracklet

T_{j}^{t - 1}

.

ε_{T}

represents the set of hyperedges.

The connectivity matrices of the detection hypergraph are denoted as

{\bar{P}}_{D}

and

{\bar{Q}}_{D}

, and those of the tracklet hypergraph are denoted as

{\bar{P}}_{T}

and

{\bar{Q}}_{T}

. The adjacency matrices obtained are

{\bar{A}}_{D} = {\bar{P}}_{D} {\bar{Q}}_{D}^{T}

and

{\bar{A}}_{T} = {\bar{P}}_{T} {\bar{Q}}_{T}^{T}

, respectively. The representation of the hyperedges is constructed by concatenating the vertex features at both ends of the edges.

{\bar{E}}_{D} = [{\bar{P}}_{D}^{T} {\hat{F}}^{t} {\bar{Q}}_{D}^{T} {\hat{F}}^{t}],

(5)

{\bar{E}}_{T} = [{\bar{P}}_{T}^{T} T^{t - 1} {\bar{Q}}_{T}^{T} T^{t - 1}],

(6)

We enhance the vertex features through cross-graph information transfer between the detection hypergraph and the tracklet hypergraph.

{\hat{F}}_{i}^{t, l + 1} = ψ_{D} ({\hat{F}}_{i}^{t, l} + \sum_{j = 1}^{m} (\cos ({\hat{F}}_{i}^{t, l}, T_{j}^{t - 1, l}) + I o U (B_{i}^{t, l}, B_{j}^{t - 1, l}))),

(7)

T_{j}^{t - 1, l + 1} = ψ_{T} (T_{j}^{t - 1, l} + \sum_{i = 1}^{n} (\cos ({\hat{F}}_{i}^{t, l}, T_{j}^{t - 1, l}) + I o U (B_{i}^{t, l}, B_{j}^{t - 1, l}))),

(8)

The enhanced vertex features are denoted as

{\hat{F}}_{i}^{t, L}

and

T_{j}^{t - 1, L}

, meanwhile, the hyperedge features are denoted as

{\bar{E}}_{D, i 1 i 2}^{L}

and

{\bar{E}}_{T, j 1 j 2}^{L}

. For hypergraph matching, the above QAP model is generalized to the higher-order case, adopting a tensor-marginalization-based model for

p

-order (

p \geq 3

) hypergraph matching, resulting in a higher-order assignment problem [51]:

\begin{array}{l} J (x) = H \otimes_{1} x \otimes_{2} x \dots \otimes_{p} x \\ s . t . X 1_{n_{2}} = 1_{n_{1}}, X^{T} 1_{n_{1}} \leq 1_{n_{2}} \end{array},

(9)

where

x = vec (X)

is the column-vectorized form and

H

is the

p

-order affinity tensor whose elements record the affinity between two hyperedges, operated by tensor product

\otimes_{q}

, where

\otimes_{q}

can be regarded as tensor marginalization at dimension

q

.

To measure the similarity between vertices in the two hypergraphs, an affinity tensor,

H

, is constructed. The first-order vertex similarity, the second-order edge similarity, and the third-order angle similarity matrices of the affinity tensor

H

can be represented as follows:

H_{i, j}^{(1)} = \cos ({\hat{F}}_{i}^{t, L}, T_{j}^{t - 1, L}),

(10)

H_{i 1 i 2, j 1 j 2}^{(2)} = \cos ({\bar{E}}_{D, i 1 i 2}^{L}, {\bar{E}}_{T, j 1 j 2}^{L}),

(11)

H_{ω_{1}, ω_{2}, ω_{3}}^{(3)} = \exp {- (\sum_{q = 1}^{3} | \cos β_{D}^{ω_{q}} - \cos β_{T}^{ω_{q}} |) / σ_{3}},

(12)

where

β_{D}^{ω_{q}}

and

β_{T}^{ω_{q}}

represent the angles between

H_{D}

and

H_{T}

, respectively.

Finally, we follow the computational process of hypergraph matching vertex updates as described in NHGM [51], utilizing hypergraph matching with high-order contextual information to match the detection hypergraph with the tracklet hypergraph. Hypergraph convolutional networks are used to embed the vertices of the association graph, capturing the topological structure information between nodes and generating feature representations for each node. A Sinkhorn network outputs a doubly stochastic matrix, transforming the vertex embeddings generated by HGCN into representations that meet the matching constraints. After each HGCN layer, the output of the Sinkhorn network (i.e., the predicted matching matrix) is concatenated with the vertex embeddings of the current layer. By introducing matching constraints in the embedding process, the matching-aware embeddings can not only utilize the topological structure information of the nodes but also more directly learn the node representations that satisfy the matching conditions, thereby improving the performance of the graph matching task. If the corresponding similarity in

H

is higher than the affinity threshold

δ

, then two nodes between the detection hypergraph and the tracklet hypergraph are matched. Finally, optimal matching can be obtained. For detections that do not match any tracklet, if the detection confidence score is greater than a certain threshold, we will use it to initialize a new tracklet. Unmatched tracklet representations suggest that they may have left the scene or were not detected.

During the training process, our proposed algorithm can be optimized in an end-to-end manner using the following loss function:

L = L_{D e t} + L_{E m b} + L_{H G M},

(13)

where

L_{D e t}

,

L_{E m b}

, and

L_{H G M}

represent the detection loss, feature loss, and hypergraph matching loss, respectively.

5. Experiments

5.1. Datasets and Evaluation Metrics

We evaluate our method on MOTChallenges [53,54,55], including MOT16, MOT17 and MOT20 benchmarks. The MOT16 and MOT17 datasets both include 14 video sequences, and the MOT20 dataset includes 8 video sequences, with high density and extremely complex scenes. Following the CLEAR MOT metrics [56], HOTA [57], and IDF1 score [58], we apply some basic items for quantitative evaluation, e.g., Multiple Object Tracking Accuracy (MOTA ↑), Higher Order Tracking Accuracy (HOTA ↑), IDF1 (↑), False Positives (FP ↓), False Negatives (FN ↓), and Identity Switches (IDS ↓). MOTA focuses on the performance of the detection branch, HOTA can comprehensively assess the performance of both the detection branch and data association, while IDF1 evaluates the ability to maintain identity, paying more attention to the association performance.

5.2. Benchmark Evaluation

To demonstrate the advantages of our hypergraph-based architecture in the MOT task, we have selected many pioneering and influential methods for comparison. The compared methods are divided into online tracking and offline tracking. We have summarized the MOTChallenge results of different methods in Table 2, Table 3 and Table 4. For the aforementioned methods, if the performance on the corresponding dataset is reported on the official leaderboard, we adopt it directly. If not, we compare it based on the latest data provided in the selected papers.

In most metrics, the proposed method has achieved good performance on most evaluation metrics. Since the MOTA metric emphasizes the detection side more, we believe that our strong HOTA performance indicates that our detector and data association modules perform well. In addition, our impressive IDF1 performance indicates that, due to the spatiotemporal relationship modeling through hypergraph interactions, our method can produce temporally consistent tracklets.

Among the existing methods, the one most related to us is GMTracker [3], which uses a similar graph matching method for identity association. The main differences and novelties lie in our use of knowledge extracted from historical tracklets to enhance the object detection and feature embedding modules. On the MOT16 dataset, our proposed method has improved by 6.5%, 5.1%, and 2.5% in MOTA, HOTA, and IDF1 metrics, respectively. On the MOT17 dataset, compared to GMTracker [3], our proposed method has improved by 14.9%, 8.5%, and 5.8% in MOTA, HOTA, and IDF1 metrics, respectively. Particularly, on the MOT20 dataset, we also compared with the JDE method GSDT [43], and our proposed method has improved by 2.2%, 3.6%, and 3.4% in MOTA, HOTA, and IDF1 metrics, respectively. A positive feedback loop that fuses historical tracklet features is used for enhancement and correction.

At the same time, compared to the second-order information features used in other MOT methods based on graph neural networks, our method can not only extract higher-order information features but also pay more attention to the visual consistency relationship of occluded tracking objects, thereby helping the tracker to associate better in severe occlusion situations. Another point worth mentioning is that our IDS is better than most mainstream methods; the smaller the IDS, the fewer the number of identity switchovers in the tracking tracklet and the higher the reliability of the tracking results.

Figure 5 displays the qualitative analysis results of our hypergraph architecture on the MOT17.

Figure 6 displays the qualitative analysis results of our hypergraph architecture on the MOT20.

5.3. Ablation Study

To verify the impact of the proposed modules on tracking performance, we conducted ablation studies on the modules within the framework, and the results of the impact of different modules on tracking performance are shown in Table 5. From the perspective of the MOTA metric, the TFE module significantly improved the tracker’s accuracy, with an increase of 5.2% in MOTA, while from the perspective of the IDF1 metric, the hypergraph matching module notably enhanced the association precision, with an increase of 3.5% in IDF1. This proves that the graph matching algorithm overlooked the similarity of third-order angles, a similarity that can model group activities, thereby generating more reliable tracklets and fully demonstrating the potential of the hypergraph architecture in MOT tasks. However, due to the computational cost of high-order structures, which explodes to

O ((N_{D} N_{T})^{p})

, where

p

is the order, hypergraph neural networks with structures higher than the third order were not explored in this paper. Moreover, since our proposed method is an end-to-end framework, the performance of each module was greatly enhanced when the two modules were combined, with an increase of 6.8% in MOTA, 3.3% in HOTA, and 5.8% in IDF1.

Furthermore, to verify the impact of high-order structural information in hypergraph matching on tracking performance, we conducted ablation experiments on different orders of high-order structural information in hypergraph matching, with the results shown in Table 6. In the paper, we only discuss the case where

p \leq 3

. As the order of high-order information in the hypergraph matching progress increases, both MOTA and IDF1 also show an increasing trend. This trend demonstrates that when learning spatiotemporal consistency, high-order structural information relationship features are more effective.

6. Conclusions

We designed an MOT framework based on hypergraph association. Through the feature extraction and detection module, we achieved initial localization and detection of objects. By integrating the features from object detection with those from feature extraction, we enhanced the consistency between the current frame features and the tracklet features, avoiding issues such as identity swapping and tracklet fragmentation due to the loss or distortion of object detection. We modelled the data association process using a hypergraph neural network, effectively capturing the relational information between objects. Hypergraph matching was used to associate both occluded and non-occluded objects, maintaining the visual consistency of the tracked objects. The experimental results showed that this method achieves good qualitative and quantitative analysis on three MOT datasets, thereby proving its effectiveness in improving the robustness and accuracy of MOT tasks.

Author Contributions

Methodology, Z.C.; Validation, Z.C.; Writing—original draft, Z.C.; Writing—review and editing, Y.D. (Yuqi Dai); Project administration, X.T.; Funding acquisition, Y.D. (Yiping Duan). All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China (Nos. NSFC 61925105, 62227801, 62322109, 62171257, and U22B2001), the State Key Laboratory of Space Network and Communications, the New Cornerstone Science Foundation through the XPLORER PRIZE, and the Tsinghua University (Department of Electronic Engineering)-Nantong Research Institute for Advanced Communication Technologies Joint Research Center for Space, Air, Ground, and Sea Cooperative Communication Network Technology.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bae, S.H.; Yoon, K.J. Confidence-Based Data Association and Discriminative Deep Appearance Learning for Robust Online Multi-Object Tracking. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 595–610. [Google Scholar] [CrossRef] [PubMed]
Bergmann, P.; Meinhardt, T.; Leal-Taixe, L. Tracking without bells and whistles. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 941–951. [Google Scholar]
He, J.; Huang, Z.; Wang, N.; Zhang, Z. Learnable Graph Matching: Incorporating Graph Partitioning with Deep Feature Learning for Multiple Object Tracking. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 5295–5305. [Google Scholar]
Ristani, E.; Tomasi, C.J. Features for Multi-target, Multi-camera Tracking and Re-identification. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6036–6046. [Google Scholar]
Saleh, F.; Aliakbarian, S.; Rezatofighi, H.; Salzmann, M.; Gould, S. Probabilistic tracklet scoring and inpainting for multiple object tracking. In Proceedings of the 2021 IEEE Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 14329–14339. [Google Scholar]
Brasó, G.; Leal-Taixé, L. Learning a Neural Solver for Multiple Object Tracking. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6246–6256. [Google Scholar]
Tang, S.; Andriluka, M.; Andres, B.; Schiele, B. Multiple People Tracking by Lifted Multicut and Person Re-identification. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3701–3710. [Google Scholar]
Leal-Taixé, L.; Ferrer, C.C.; Schindler, K. Learning by tracking: Siamese CNN for robust target association. In Proceedings of the 2016 IEEE Conference on Computer Vision & Pattern Recognition Workshops (CVPRW), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 418–425. [Google Scholar]
Lu, Z.; Rathod, V.; Votel, R.; Huang, J. Retinatrack: Online single stage joint detection and tracking. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 14656–14666. [Google Scholar]
Wang, Z.; Zheng, L.; Liu, Y.; Li, Y.; Wang, S. Towards real-time multi-object tracking. In Proceedings of the ECCV 2020, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 107–122. [Google Scholar]
Zhang, Y.; Wang, C.; Wang, X.; Zeng, W.; Liu, W. Fairmot: On the fairness of detection and re-identification in multiple object tracking. Int. J. Comput. Vis. 2021, 129, 3069–3087. [Google Scholar] [CrossRef]
Liang, C.; Zhang, Z.; Zhou, X.; Li, B.; Lu, Y.; Hu, W. One more check: Making “fake background” be tracked again. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 22 February–1 March 2021. [Google Scholar]
Voigtlaender, P.; Krause, M.; Osep, A.; Luiten, J.; Sekar, B.B.G.; Geiger, A.; Leibe, B. Mots: Multi-object tracking and segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 7934–7943. [Google Scholar]
Guo, S.; Wang, J.; Wang, X.; Tao, D. Online multiple object tracking with cross-task synergy. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 8132–8141. [Google Scholar]
Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple Online and Realtime Tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3464–3468. [Google Scholar]
Zhang, L.; Gray, H.; Ye, X.; Collins, L.; Allinson, N. Automatic individual pig detection and tracking in pig farms. Sensors 2019, 19, 1188. [Google Scholar] [CrossRef]
Lu, Y.; Lu, C.; Tang, C.K. Online Video Object Detection Using Association LSTM. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2363–2371. [Google Scholar]
Girshick, R.J.C.S. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object Detection via Region-based Fully Convolutional Networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; Red Hook: New York, NY, USA; pp. 379–387. [Google Scholar]
Wei, L.; Dragomir, A.; Dumitru, E.; Christian, S.; Scott, R.; Cheng-Yang, F.; Berg, A.C.J.S. SSD: Single Shot MultiBox Detector. In Proceedings of the ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 14–17 March 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. Bytetrack: Multi-object tracking by associating every detection box. In Proceedings of the ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 1–21. [Google Scholar]
Cao, J.; Weng, X.; Khirodkar, R.; Pang, J.; Kitani, K. Observation-Centric SORT: Rethinking SORT for Robust Multi-Object Tracking. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2022; pp. 9686–9696. [Google Scholar]
Wojke, N.; Bewley, A.; Paulus, D. Simple Online and Realtime Tracking with a Deep Association Metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Dai, P.; Weng, R.; Choi, W.; Zhang, C.; He, Z.; Ding, W. Learning a Proposal Classifier for Multiple Object Tracking. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 2443–2452. [Google Scholar]
Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017; pp. 1–14. [Google Scholar]
Xu, Y.; Ban, Y.; Alameda-Pineda, X.; Horaud, R. DeepMOT: How to train your deep multi-object tracker. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6787–6796. [Google Scholar]
Schuster, M.; Paliwal, K.K. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 1997, 45, 2673–2681. [Google Scholar] [CrossRef]
Shan, C.; Wei, C.; Deng, B.; Huang, J.; Hua, X.S.; Cheng, X.; Liang, K. Tracklets Predicting Based Adaptive Graph Tracking. arXiv 2020, arXiv:2010.09015. [Google Scholar] [CrossRef]
Wu, Y.; Liu, Q.; Sun, H.; Xue, D. HRTracker: Multi-Object Tracking in Satellite Video Enhanced by High-Resolution Feature Fusion and an Adaptive Data Association. Remote Sens. 2024, 16, 3347. [Google Scholar] [CrossRef]
Li, J.; Piao, Y. Multi-Object Tracking Based on Re-Identification Enhancement and Associated Correction. Appl. Sci. 2023, 13, 9528. [Google Scholar] [CrossRef]
Kim, J.S.; Chang, D.S.; Choi, Y.S. Enhancement of Multi-Target Tracking Performance via Image Restoration and Face Embedding in Dynamic Environments. Appl. Sci. 2021, 11, 649. [Google Scholar] [CrossRef]
Zhao, H.; Shen, Y.; Wang, Z.; Zhang, Q. MFACNet: A Multi-Frame Feature Aggregating and Inter-Feature Correlation Framework for Multi-Object Tracking in Satellite Videos. Appl. Sci. 2024, 16, 1604. [Google Scholar] [CrossRef]
Chen, T.; Pennisi, A.; Li, Z.; Zhang, Y.; Sahli, H. A Hierarchical Association Framework for Multi-Object Tracking in Airborne Videos. Remote Sens. 2018, 10, 1347. [Google Scholar] [CrossRef]
Wen, J.; Gucma, M.; Li, M.; Mou, J. Multi-Object Detection for Inland Ship Situation Awareness Based on Few-Shot Learning. Remote Sens. 2023, 13, 10282. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A.J. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Zhou, X.; Wang, D.; Krhenbühl, P. Objects as Points. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 474–490. [Google Scholar]
Yu, F.; Wang, D.; Shelhamer, E.; Darrell, T. Deep Layer Aggregation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; IEEE Computer Society: Washington, WA, USA, 2018; pp. 2403–2412. [Google Scholar]
Yang, J.; Ge, H.; Yang, J.; Tong, Y.; Su, S. Online multi-object tracking using multi-function integration and tracking simulation training. Appl. Intell. 2022, 52, 1268–1288. [Google Scholar] [CrossRef]
Li, J.; Ding, Y.; Wei, H. SimpleTrack: Rethinking and Improving the JDE Approach for Multi-Object Tracking. Sensors 2022, 22, 5863. [Google Scholar] [CrossRef]
Wang, Y.; Weng, X.; Kitani, K. Joint Detection and Multi-Object Tracking with Graph Neural Networks. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2020; pp. 13708–13715. [Google Scholar]
Krhenbühl, P.; Koltun, V.; Zhou, X. Tracking Objects as Points. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]
Xu, Y.; Ban, Y.; Delorme, G.; Gan, C.; Rus, D.; Alameda-Pineda, X. TransCenter: Transformers with Dense Queries for Multiple-Object Tracking. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 7820–7835. [Google Scholar] [CrossRef]
Zhu, J.; Yang, H.; Liu, N.; Kim, M.; Zhang, W.; Yang, M.H. Online Multi-Object Tracking with Dual Matching Attention Networks. In Proceedings of the ECCV 2018, Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018; pp. 379–396. [Google Scholar]
Zheng, L.; Tang, M.; Chen, Y.; Zhu, G.; Wang, J.; Lu, H. Improving Multiple Object Tracking with Single Object Tracking. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 2453–2462. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
Meinhardt, T.; Kirillov, A.; Leal-Taixe, L.; Feichtenhofer, C. TrackFormer: Multi-Object Tracking with Transformers. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2021; pp. 8834–8844. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the ECCV 2020, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Wang, R.; Yan, J.; Yang, X. Neural Graph Matching Network: Learning Lawler’s Quadratic Assignment Problem with Extension to Hypergraph and Multiple-Graph Matching. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 5261–5279. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Dendorfer, P.; Rezatofighi, H.; Milan, A.; Shi, J.; Cremers, D.; Reid, I.; Roth, S.; Schindler, K.; Leal-Taixe, L. CVPR19 Tracking and Detection Challenge: How crowded can it get? arXiv 2019, arXiv:1906.04567. [Google Scholar] [CrossRef]
Dendorfer, P.; Rezatofighi, H.; Milan, A.; Shi, J.; Cremers, D.; Reid, I.; Roth, S.; Schindler, K.; Leal-Taixé, L. MOT20: A benchmark for multi object tracking in crowded scenes. arXiv 2020, arXiv:2003.09003. [Google Scholar] [CrossRef]
Milan, A.; Leal-Taixe, L.; Reid, I.; Roth, S.; Schindler, K. MOT16: A Benchmark for Multi-Object Tracking. arXiv 2016, arXiv:1603.00831. [Google Scholar] [CrossRef]
Keni, B.; Rainer, S.J. Evaluating multiple object tracking performance: The CLEAR MOT metrics. EURASIP J. Image Video Process. 2008, 2008, 246309. [Google Scholar]
Luiten, J.; Ošep, A.; Dendorfer, P.; Torr, P.; Geiger, A.; Leal-Taixé, L.; Leibe, B. HOTA: A Higher Order Metric for Evaluating Multi-object Tracking. Int. J. Comput. Vis. 2021, 129, 548–578. [Google Scholar] [CrossRef] [PubMed]
Ristani, E.; Solera, F.; Zou, R.S.; Cucchiara, R.; Tomasi, C.J.S. Performance measures and a data set for multi-target, multi-camera tracking. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; Springer: Cham, Switzerland, 2016. [Google Scholar] [CrossRef]
Hornakova, A.; Henschel, R.; Rosenhahn, B.; Swoboda, P. Lifted Disjoint Paths with Application in Multiple Object Tracking. Int. Conf. Mach. Learn. 2020, 119, 4364–4375. [Google Scholar]
Stadler, D.; Beyerer, J. Improving Multiple Pedestrian Tracking by Track Management and Occlusion Handling. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 10953–10962. [Google Scholar]
You, S.; Yao, H.; Bao, B.K.; Xu, C. UTM: A Unified Multiple Object Tracking Model with Identity-Aware Feature Enhancement. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 21876–21886. [Google Scholar]
Liang, C.; Zhang, Z.; Lu, Y.; Zhou, X.; Li, B.; Ye, X.; Zou, J. Rethinking the competition between detection and ReID in Multi-Object Tracking. IEEE Trans. Image Process. 2020, 31, 3182–3196. [Google Scholar] [CrossRef] [PubMed]
Tokmakov, P.; Li, J.; Burgard, W.; Gaidon, A. Learning to Track with Object Permanence. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 10840–10849. [Google Scholar]
Wang, Q.; Zheng, Y.; Pan, P.; Xu, Y. Multiple Object Tracking with Correlation Learning. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 3875–3885. [Google Scholar]
Yu, E.; Li, Z.; Han, S.; Wang, H. RelationTrack: Relation-aware Multiple Object Tracking with Decoupled Representation. IEEE Trans. Multimed. 2022, 25, 2686–2697. [Google Scholar] [CrossRef]
Papakis, I.; Sarkar, A.; Karpatne, A. GCNN Match: Graph Convolutional Neural Networks for MOT via Sinkhorn Normalization. arXiv 2020, arXiv:2010.00067. [Google Scholar] [CrossRef]

Figure 1. (a) Input images for the MOT task. (b) Schematic representation of the object tracklets. Existing methods often fail when multiple objects with similar appearance or motion patterns or occlusion appear in proximity.

Figure 2. Overview of graph/hypergraph matching pipeline. (a) Graph Matching. (b) Association Graph. (c) Association Hypergraph. The node-to-node matching problem in (a) can therefore be formulated as the node classification task on the association graph, whose edge weights can be induced by the affinity matrix. Similarly, the vertex-to-vertex matching problem can be formulated as a vertex classification task on the associated hypergraph, where the edge weights can be induced by the affinity tensor.

Figure 3. Framework overview of proposed method (JDTHM). JDTHM is composed of a feature extraction and object detection module, a feature aggregation module, and a data association module. Different colored circles represent the candidate bboxes and history tracklets.

Figure 4. Hypergraph matching process diagram. The computation process of vertex-to-vertex matching relationships in two hypergraphs is translated into a vertex classification task on the associated hypergraph, where edge weights can be induced by the association tensor.

Figure 5. Visualization of our results on four sequences of the MOT17. Each row shows the results of sampled frames in chronological order of a video sequence. Bboxes and identities are marked in the images. Bboxes with different colors represent different identities. Best viewed in color.

Figure 6. Visualization of our results on three sequences of the MOT20. Bboxes and identities are marked in the images. Bboxes with different colors represent different identities. Best viewed in color.

Table 1. Notations and definitions.

Notation	Definition
$G$	Graph/Hypergraph
$V$	The set of vertices of $G$
$ε$	The set of hyperedges of $G$
$N$	The number of vertices on $G$ .
$M$	The number of hyperedges on $G$ .
$A$	The adjacency matrix $A = \bar{P} {\bar{Q}}^{T}$ , where $\bar{P}, \bar{Q}$ are the connectivity matrices.
$D_{v}$	The diagonal matrix of vertex degrees.
$D_{e}$	The diagonal matrix of hyperedge degrees.

Table 2. Comparison with different MOT methods on MOT16 benchmark with the provided public detections, showing the quantitative results.

Method	MOTA	HOTA	IDF1	FP	FN	IDS
GMTracker [3]	61.1	51.2	66.6	3891	66,550	503
MPNTrack [6]	58.6	48.9	61.7	4949	70,252	354
DeepMOT [29]	54.8	42.2	53.4	2955	78,765	645
LPC [27]	58.8	51.7	67.6	6167	68,432	435
Lif_T [59]	57.5	49.6	64.1	4249	72,868	335
TMOH [60]	63.2	50.7	63.5	3122	63,376	635
UTM [61]	63.8	53.1	67.1	8328	57,269	428
JDTHM	67.6	56.3	69.1	6137	52,686	377

Table 3. Comparison with different methods on MOT17 benchmark with the provided public detections, showing the quantitative results.

Method	MOTA	HOTA	IDF1	FP	FN	IDS
GMTracker [3]	59.0	51.1	65.9	20,395	209,553	1105
MPNTrack [6]	58.8	49.0	61.7	17,413	213,594	1185
FairMOT [11]	73.7	59.3	72.3	27,507	117,477	3303
LPC [27]	59.0	51.7	66.8	23,102	206,947	1122
CTTrack [44]	61.5	48.2	59.6	14,076	200,672	2583
TransCenter [45]	71.9	54.1	62.3	17,378	137,008	4046
Trackformer [49]	74.1	57.3	68.0	34,602	108,777	2829
Lif_T [59]	60.5	51.3	65.6	14,966	206,619	1189
CSTrack [62]	74.9	59.3	72.3	23,847	114,303	3567
PermaTrack [63]	73.1	54.2	67.2	24,577	123,508	3571
JDTHM	73.9	59.6	71.7	25,639	117,756	3733

Table 4. Comparison with modern methods on MOT20 benchmark with the provided public detections, showing the quantitative results.

Method	MOTA	HOTA	IDF1	FP	FN	IDS
FairMOT [11]	61.8	54.6	67.3	103,440	88,901	5243
CSTrack [62]	66.6	54.0	68.6	25,404	144,358	3196
GSDT [43]	67.1	53.1	67.5	31,913	135,409	3133
MPNTrack [6]	57.6	46.8	59.1	16,953	201,384	1210
Trackformer [49]	68.6	54.7	65.7	20,348	140,373	1532
CorrTracker [64]	65.2	57.2	69.1	79,429	95,855	5183
RelationTrack [65]	67.2	56.5	70.5	61,134	104,597	4243
GNNMatch [66]	54.5	40.2	49.0	9522	223,611	2038
JDTHM	69.3	56.7	70.9	29,716	127,135	2288

Table 5. Ablation study of the model and a report of the MOTA/HOTA/IDF1 scores for each configuration on the datasets.

Baseline	Feature Aggregation	Hypergraph	MOTA	HOTA	IDF1
✓	-	-	67.1	56.3	65.9
✓	✓	-	72.3	57.1	66.2
✓	-	✓	71.6	57.9	69.4
✓	✓	✓	73.9	59.6	71.7

Table 6. Effect of the order of high-order structural information in hypergraph matching on tracking performance and a report of the MOTA/HOTA/IDF1/IDS scores for each configuration on the datasets.

Order	MOTA	HOTA	IDF1
1	64.4	53.6	65.9
2	66.9	54.6	68.6
3	69.3	56.7	70.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cui, Z.; Dai, Y.; Duan, Y.; Tao, X. Joint Object Detection and Multi-Object Tracking Based on Hypergraph Matching. Appl. Sci. 2024, 14, 11098. https://doi.org/10.3390/app142311098

AMA Style

Cui Z, Dai Y, Duan Y, Tao X. Joint Object Detection and Multi-Object Tracking Based on Hypergraph Matching. Applied Sciences. 2024; 14(23):11098. https://doi.org/10.3390/app142311098

Chicago/Turabian Style

Cui, Zhoujuan, Yuqi Dai, Yiping Duan, and Xiaoming Tao. 2024. "Joint Object Detection and Multi-Object Tracking Based on Hypergraph Matching" Applied Sciences 14, no. 23: 11098. https://doi.org/10.3390/app142311098

APA Style

Cui, Z., Dai, Y., Duan, Y., & Tao, X. (2024). Joint Object Detection and Multi-Object Tracking Based on Hypergraph Matching. Applied Sciences, 14(23), 11098. https://doi.org/10.3390/app142311098

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Joint Object Detection and Multi-Object Tracking Based on Hypergraph Matching

Abstract

1. Introduction

2. Related Works

2.1. Detection-Based-Embedding

2.2. Joint Detection and Embedding

3. Preliminaries of Hypergraph Matching

4. Methodology

4.1. Feature Extraction and Object Detection

4.2. Deep Feature Aggregation

4.3. Data Association

5. Experiments

5.1. Datasets and Evaluation Metrics

5.2. Benchmark Evaluation

5.3. Ablation Study

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI