Occlusion-Robust Multi-Target Tracking and Segmentation Framework with Mask Enhancement

Sheng, Hao; Zhang, Defa; Yang, Dazhi; Yang, Da; Liu, Xi; Ke, Wei

doi:10.3390/app15136969

Open AccessArticle

Occlusion-Robust Multi-Target Tracking and Segmentation Framework with Mask Enhancement

by

Hao Sheng

^1,2,3

,

Defa Zhang

^1,2,

Dazhi Yang

²

,

Da Yang

^2,4,*,

Xi Liu

^4,5 and

Wei Ke

³

¹

State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang University, Beijing 100191, China

²

Data Science and Intelligent Computing Laboratory, Hangzhou International Innovation Institute, Beihang University, Hangzhou 311115, China

³

Engineering Research Centre of Applied Technology on Machine Translation and Artificial Intelligence of Ministry of Education, Faculty of Applied Sciences, Macao Polytechnic University, Macao, China

⁴

State Key Laboratory for Intelligent Coal Mining and Strata Control, Beijing 100028, China

⁵

Research Institute of Mine Artificial Intelligence, Chinese Institute of Coal Science, Beijing 100013, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(13), 6969; https://doi.org/10.3390/app15136969

Submission received: 30 May 2025 / Revised: 17 June 2025 / Accepted: 18 June 2025 / Published: 20 June 2025

(This article belongs to the Special Issue Advances in Image Recognition and Processing Technologies)

Download

Browse Figures

Versions Notes

Abstract

Multi-object tracking stands as one of the most prominent domains in Computer Vision and has significant research value and practical importance. However, due to the complexity of scenarios in the real world, especially in crowded environments with frequent target occlusion, existing MOT frameworks often struggle to achieve precise tracking results. To enhance the trajectory association accuracy of MOT frameworks in occluded scenarios, this paper proposes a mask-enhanced occlusion-robust multi-target tracking and segmentation framework. Our method first introduces a mask-conditional feature fusion network and an occlusion-aware mask propagation network. The former network integrates a mask-guided attention mechanism with a spatial–temporal feature aggregation sub-network to improve tracking robustness in crowded scenes, and the latter network prevents the contamination of online tracking templates from noise inputs by perceiving a target occlusion state. The framework merges the mask-based methods above into a mask-integrated multi-hypothesis tracking algorithm, achieves superior adaptability in occluded scenarios, and enhances the robustness of MOTS tasks. Our framework achieves the best performance on the MOTSA (84.4%), MT, and FN metrics, with a 6.1% reduction in FN compared to the state-of-the-art method. Our method achieves significant improvements in both accuracy and precision and is validated on public datasets.

Keywords:

multi-object tracking; mask; instance segment; target occlusion; trajectory association

1. Introduction

Multiple-object tracking (MOT) is one of the prominent tasks in the field of Computer Vision. The process of MOT involves associating multiple targets across different frames in a video sequence frame-by-frame and generating motion trajectories of the tracked targets. MOT serves as the foundation of action recognition and behavior analysis of targets [1].

The Tracking By Detection (TBD) framework is the mainstream approach in MOT methods. This framework divides MOT into two stages: detection and data association. In the detection stage, a pretrained detector is used to obtain targets’ detection results in the video sequence. The detection process transforms the target tracking problem into a data association problem across consecutive frames. During the data association stage, targets in consecutive frames are associated so as to form complete trajectories. This stage outputs the positional information of different targets at different times. Recent years have witnessed various solutions proposed to optimize trajectory association in MOT tasks. They have addressed complex interactions such as inter-object occlusion and dynamic target counts in different scenarios [2,3,4,5,6,7,8].

Existing MOT trackers often perform well in simple scenarios. However, when applied to crowded scenes, their performance significantly deteriorates, since target occlusion interferes with the target detection and the data association processes. In practical applications, not all observations reflect target information correctly, since they suffer from background noise, motion blur, or occlusion [9,10]. The lack of state evaluation mechanisms makes traditional approaches vulnerable to noise inputs and ineffective for long-term tracking in occluded scenarios.

To address these challenges, Voigtlaender et al. [1] integrated instance segmentation techniques with MOT algorithms. They constructed a multiple-object tracking and segmentation (MOTS) framework, which can achieve pixel-level tracking on multiple targets.

Compared to traditional bounding box detections, instance segmentation provides pixel-level detection, assigning a unique mask to each target. Image instance segmentation aims to classify each pixel in an image, predicting both target categories and pixel-level segmentation masks to accomplish simultaneous object detection and semantic segmentation [11,12]. Additionally, the non-overlapping nature of binary instance masks resolves the ambiguity involved in distinguishing target identities during occlusion. Approaches like Mask R-CNN [13] have demonstrated that pixel-level detection based on segmentation can effectively handle complex scenarios with occlusion by eliminating errors from traditional bounding boxes.

Therefore, merging instance segmentation into MOT tasks can mitigate issues caused by traditional bounding box detections and enhance tracking performance in occluded scenarios.

The main contributions of our paper can be summarized as follows:

We propose a mask-conditional feature fusion network that integrates a mask-guided attention mechanism and a spatial–temporal feature aggregation sub-network. The network improves tracking robustness under crowded scenes. The network leverages temporal information from short trajectories and learns to generate occlusion-robust spatial–temporal aggregation features under the guidance of masks.
We present an occlusion-aware mask propagation network. The network prevents the contamination of online tracking templates from noise inputs by perceiving the occlusion states of targets. It completes the association of missing trajectory segments through a constructed mask propagation network.
We propose a mask-enhanced multi-object tracking and segmentation framework. This framework combines mask-based methods with a mask-integrated multi-hypothesis tracking algorithm to achieve adaptability in occluded scenarios and enhance the robustness of the MOTS framework.

The rest of this paper is organized as follows. Related work is discussed in Section 2. The details of the MOTS framework methodology are described in Section 3. Experimental results are shown and analyzed in Section 4. Section 5 discusses the benefits, as well as the limitations, of the method, and it is followed by the conclusions in Section 6.

2. Related Works

2.1. Trajectory Association Algorithms

In recent years, researchers have proposed various solutions to optimize trajectory association in MOT tasks, as shown in Table 1. Online trajectory association methods process current frames using present and past data without future information. Existing online methods [2,3,4,5,6,7,8] process video frames sequentially to generate tracking results but often cause trajectory loss or identity drift due to occlusion or detection errors.

Filter-based tracking [9,10] based on target motion and appearance generates predictive data for trajectory association. These algorithms demonstrate fast computational speeds but suffer from background noise and motion blur.

The Multiple-Hypothesis Tracking (MHT) algorithm [5,14] constructs multi-layer hypothesis trees for each target. However, the computational complexity grows exponentially with hypothesis expansion.

Kim et al. [15] integrated deep neural network-derived appearance features into MHT frameworks. Zhang et al. [16] further optimized the hypothesis, updating it through LSTM networks. Progress was also made in deep-learning based methods. Chu et al. [17] employed spatial–temporal attention to address partial occlusion. Sadeghian et al. [18] proposed LSTM networks for modeling temporal dependencies. Peng et al. [8] utilized deep residual networks to compute inter-frame target similarity. Yi et al. [19] developed an anchor-free tracking approach based on CenterNet [20].

2.2. Instance Segmentation-Based Multi-Object Tracking

With the development of object detection and instance segmentation, the accuracy of MOT algorithms has significantly improved. Voigtlaender et al. introduced the MOTS task based on MOT datasets [21]. MOTS integrates detection, segmentation, and tracking to achieve pixel-level multi-object tracking.

Mask R-CNN [13] became a standard framework: it performed well on datasets like COCO [22], Cityscapes [23], and MVD [24]. It featured three branches for regression, classification, and segmentation. The core principle of the architectures involves adding a mask prediction branch to existing Fast R-CNN [25] or Faster R-CNN [26] frameworks, and the mask prediction branch utilizes an encoder–decoder Fully Convolutional Network (FCN) [27] structure.

Video instance segmentation (VIS) methods include propagation-based approaches [13,28,29,30,31,32] that refine masks using temporal information. In this field, Li et al. [30] integrated a pedestrian re-identification module into a video instance segmentation system. Wang et al. [31] improved mask propagation through clustering. For crowded scenes, Liu et al. [4] used spatial–temporal trees, while others focused on occlusion detection: Sun et al. [33] compared target features, Dong et al. [34] maintained feature pools, and Stadler et al. [35] scored occlusion patterns.

2.3. Multi-Object Tracking and Segmentation Frameworks

Yang et al. [36] first integrated instance segmentation with tracking tasks. Subsequently, to improve the accuracy of existing multi-object tracking, Voigtlaender et al. [1] constrained the tracking targets in VIS to pedestrians and vehicles.

They also developed the TrackR-CNN framework that combined segmentation instances with tracking algorithms. Following the tracking-by-detection paradigm, TrackR-CNN extended Mask R-CNN [13] by adding a tracking branch. It employed 3D convolutions to capture temporal and appearance information in videos, computed target embedding vectors and IoU predictions from optical flow, and incorporated both as weights in a bipartite graph matching algorithm for data association.

3. Methodology

We proposed a novel MOTS framework based on masks and an enhanced multi-hypothesis tracking algorithm (MHT). The framework integrated a mask-conditioned feature fusion network and a occlusion-aware mask propagation network in congestion scenes. Our framework first extracted spatial–temporal features from video sequences via mask-guided foreground-background separation. Foreground features captured the target geometry such as contour and area and background features encoded the scene context. These features were then processed by a hierarchical hypothesis tree that dynamically weighted appearance, motion, and IoU based on the real-time occlusion state. Eventually, a memory-enhanced decoder synthesized the masks of the occlusion targets by correlating historical non-occlusion features with the current frame query to achieve pixel-accurate trajectory recovery.

3.1. Mask-Conditioned Feature Fusion Network

Traditional feature extractors perform poorly when processing occlusion and acquiring temporal information. Therefore, we proposed a mask-conditional feature fusion network to solve the limitations. The network integrated a mask-guided attention mechanism and a spatial–temporal feature aggregation sub-network so as to improve tracking robustness in crowded scenes. Its main sub-modules aew shown in Figure 1 below.

3.1.1. Mask-Attentive Dual-Stream Encoder

Conventional single-frame feature extraction methods have limited performance in occlusion scenarios since they fail to capture target motion patterns. Furthermore, in re-identification, background removal often degrades performance [37,38,39,40] due to segmentation errors. Tian et al. [40] showed that background removal reduces performance despite creating biases.

To solve these problems, we proposed a dual-stream encoder architecture based on short tracklet pairs. Given a tracklet of K frames as the encoder’s input, the data in each frame contained an RGB patch clipped from the bounding box detection and a corresponding instance mask (RLE encoding). The encoding process began with two parallel paths.

The foreground branch processed the mask-cropped region to extract geometric features. In the foreground branch, element-by-element multiplication was performed between the input image

I_{t}

and binary mask

M_{t}

to obtain the foreground region

I_{f g}

. Features were then extracted through a ResNet network and the spatial attention mechanism was applied to enhance feature responses in the target region. The background branch analyzed the entire detection patch to capture contextual meaning. It directly processed the original detection box image

I_{b g} = I_{t}

using the same ResNet network so as to extract contextual features.

This dual-stream design preserved both geometric target features and valuable contextual scene information. To increase robustness to segmentation noise, dynamic mask expansion was introduced during training. The method simulated real-world mask imprecision with random morphological operations. This allowed the model to prioritize geometric attributes invariant to occlusion. Both branches employed a shared-weight ResNet50 backbone to ensure feature compatibility. The ResNet50 backbone extracted features from both streams, with the foreground branch receiving additional mask-derived attention:

F_{f g}^{t} = ResNet (I ⊙ M) \cdot Sigmoid (MaskCNN (M))

(1)

where M is the binary mask and ⊙ is the element-by-element multiplication. Eventually, this part employed a global average pooling (GAP) layer to reduce the computational complexity through dimensionality reduction. The method output the following two sets of features: foreground features

F_{f g}^{t}

and background features

F_{b g}^{t}

.

3.1.2. Adaptive Feature Fusion Sub-Module

Studies [37,40] also show that directly removing the background with a mask not only degrades person re-identification’s performance but also tends to degrade MOT tasks’ performances entirely, since directing the network’s attention with incorrect segmentation instances will impact tracking results negatively.

The Adaptive Feature Fusion Sub-Module is an intelligent component that combines foreground and background features and dynamically adapts to masks’ quality changing conditions. Based on the dual-stream features

F_{f g}^{t}, F_{b g}^{t}

generated from the aforementioned encoder, this module implemented a fusion mechanism through the following steps.

The process began with a quality evaluation of the mask. For foreground feature

F_{f g}^{t}

in each frame, the module provided a quality scoring mechanism through a lightweight evaluation network. The output was a quality-scoring

α_{t} \in [0, 1]

according to the following:

α_{t} = σ (W_{2} ReLU (W_{1} [GAP (F_{f g}^{t}) \oplus Q (M_{t})]))

(2)

where

Q (M_{t})

represents a mask quality index that includes boundary sharpness and area consistency between frames, while

W_{1}

and

W_{2}

are the learnable weights. The sigmoidal activation

σ (\cdot)

guaranteed a normalized quality score. The quality scores led to an adaptive fusion of foreground and background features through a gating mechanism as follows:

F_{f u s e d}^{t} = α_{t} \cdot F_{f g}^{t} + (1 - α_{t}) \cdot SEBlock (F_{b g}^{t})

(3)

The Squeeze and Excitation Block (SEBlock) plays a crucial role in the fusion process. It performs per-channel feature recalibration. This design allows the module to emphasize reliable features and suppresses noise components. When the quality of the mask is high (

α_{t} \to 1

), foreground features are dominant. When the mask is unreliable (

α_{t} \to 0

), enhanced background features are dominant. The output

F_{f u s e d}^{t}

of this module maintains the same dimension as the input features. It achieves temporal consistency through frame-by-frame adaptive fusion.

3.1.3. Spatial–Temporal Aggregation Sub-Network

The spatial–temporal aggregation sub-network constitutes the final processing stage that transforms frame-level fusion features into a comprehensive tracklet representation. The network operates on the quality-weighted features

F_{f u s e d}^{t}

from the Adaptive Feature Fusion Sub-Module. It establishes spatial–temporal relationships through a graph neural network. The aggregation process initiates a spatial–temporal graph

G = (V, E)

. Each node

v_{t} \in V

corresponds to the frame’s feature

F_{f u s e d}^{t}

, and each edge

e_{i j} \in E

encodes pairwise relationships between frames. We utilized a learned affinity function to compute the edge weights:

A_{i j} = softmax (\frac{W_{q} F_{f u s e d}^{i} {(W_{k} F_{f u s e d}^{j})}^{⊤}}{\sqrt{d}} + ϕ (Δ t_{i j}))

(4)

where

W_{q}

and

W_{k}

are learnable projection matrices, d is the feature dimension, and

ϕ (\cdot)

encodes the time interval as a position embedding. This formulation captures similarity in appearance and temporal proximity simultaneously. For frames with low quality scores, the network activates the memory augmentation mechanism. It retrieves historically reliable features from the memory bank

M

through content-based addressing. The expanded features are then processed in the transformer encoder layer to produce a tracklet representation:

F_{m e m o r y}^{t} = \sum_{k = 1}^{K} softmax (〈 F_{f u s e d}^{t}, M_{k} 〉) \cdot M_{k}

(5)

F_{t r a n s f o r m e r}^{t} = Transformer ({F_{f u s e d}^{t} + F_{m e m o r y}^{t} \cdot I (α_{t} < τ)}_{t = 1}^{N} + PE (t))

(6)

where

PE (t)

is the standard sinusoidal position coding and

I (\cdot)

is the indicator function. The transformer encoder employs multi-head self-attention and layer normalization to model long-range dependencies across tracklets effectively.

Eventually, output features are dimensionally reduced by mean pooling and fully connected layers. The final output encompasses both the pattern of occlusion and the temporal motion characteristics of the target.

3.2. Occlusion-Aware Mask Propagation Network

Existing occlusion perception methods rely primarily on traditional bounding box modeling [1]. These approaches use techniques such as IoU, confidence scores, and regression models to detect occlusion states and integrate target states into the appearance model in order to optimize tracking performance. However, these approaches require manually designed templates and hyperparameters such as thresholds and result in complex models that lack robustness.

To solve these issues, this section details an occlusion-aware mask propagation network that overcomes the three main sources of noise in congestion tracking: occlusion between objects, occlusion between objects and scenes, and mutual interference between targets.

3.2.1. Occlusion State Perception with Mask IoU

The network proposes an occlusion state perception method. It defines four different occlusion states in crowded scenes. For frame t, we defined target A’s bounding box as

{Box}_{t}^{A} = (x_{t}^{A 1}, y_{t}^{A 1}, x_{t}^{A 2}, y_{t}^{A 2}, c_{t}^{A}), A \in O_{t}

(7)

where

O_{t}

is the target set,

(x^{1}, y^{1})

is target A’s top-left coordinate,

(x^{2}, y^{2})

is target A’s bottom-right coordinate, and c is the confidence. We defined target A’s mask as

M_{t}^{A} = (m_{t}^{A}, c_{t}^{A}), m_{t}^{A} \in {0, 1}^{H \times W}

(8)

where H and W denote frame height and width, respectively. The method defines IoU between target A and target B in the same frame t as

α_{t}^{A B}

:

α_{t}^{A B} = IoU ({Box}_{t}^{A}, {Box}_{t}^{B}) = \frac{{Box}_{t}^{A} \cap {Box}_{t}^{B}}{{Box}_{t}^{A} \cup {Box}_{t}^{B}}

(9)

The method then defines the mask area of target A as follows, where

β_{t}^{A}

stands for the mask area of target A in frame t:

β_{t}^{A} = area (M_{t}^{A}) = sum (m_{t}^{A})

(10)

With the definitions above, we introduce the occlusion state of target A in frame t as

o_{t}^{A} \in {0, 1, 2, 3}

.

o_{t}^{A} = 0

corresponds to “no occlusion”,

o_{t}^{A} = 1

corresponds to “occluded by target”,

o_{t}^{A} = 2

corresponds to “occluding other targets”, and

o_{t}^{A} = 3

corresponds to “occluded by non-target objects”. The occlusion state from frame 1 to frame t of target A can be represented as

o_{1 : t}^{A}

, and the mask area from frame 1 to frame t of target A can be represented as

β_{1 : t}^{A}

:

o_{1 : t}^{A} = {o_{1}^{A}, \dots, o_{t}^{A}}

(11)

β_{1 : t}^{A} = {β_{1}^{A}, \dots, β_{t}^{A}}

(12)

Eventually, we build the crowded perception model

M_{o a}

based on the least squares method:

o_{t}^{A} = M_{o a} (max_{i \in O t} (α_{t}^{A i}), β_{t - T : t}^{A})

(13)

o_{t}^{A} = M_{o a} (α_{t}^{A, i}, W_{k}),

(14)

where

α_{t}^{A, i}

represents the IoU between target A and other targets, and

W_{k}

is a weight matrix learned from the mask area. During the tracking process,

α_{t}^{A, i}

is input to

M_{o a}

to determine the occlusion state of the target along with the mask area of previous T frames. The output is the occlusion perception state.

The operation of

M_{o a}

consists of two stages: trend fitting and state determination. In the trend fitting stage, for each target A in frame t, the input matrix is defined as

X_{t - T : t} = {[t - T, t - T + 1, \dots, t]}^{T},

(15)

And the response matrix is defined as

V_{t - T : t}^{A} = matrix {(β_{t - T : t}^{A})}^{T} = {[β_{t - T}^{A}, β_{t - T + 1}^{A}, \dots, β_{t}^{A}]}^{T},

(16)

The model learns the weight matrices

W_{k} \in R^{1 \times 1}

and

W_{b} \in R^{1 \times 1}

to perceive the crowding state of the target. Weight

W_{k}

is computed by minimizing the following objective function:

min_{W_{k}} \sum_{q = t - T}^{t} ∥ X_{q} W_{k} - β_{q}^{A} ∥_{2}^{2} + λ {∥ W_{k} ∥}_{2}^{2}

(17)

where

{∥ \cdot ∥}_{F}^{2}

is the Frobenius norm. Therefore,

W_{k}

can be solved with the linear regression equation as follows:

[\sum_{q = t - T}^{t} (X_{q}^{T} X_{q}) + λ I] W_{k} = \sum_{q = t - T}^{t} (X_{q}^{T} V_{q})

(18)

With the least squares method,

W_{k}

can be computed as

\hat{W_{k}} = \frac{\sum_{q = t - T}^{t} (x_{q} - \bar{x}) (v_{q} - \bar{v})}{\sum_{q = t - T}^{t} {(x_{q} - \bar{x})}^{2}}

(19)

In the state determination stage, we designed a occlusion states classification strategy based on the IoU in current frame. The strategy utilized the following strategy to calculate

o_{t}^{A}

:

o_{t}^{A} = \{\begin{matrix} 0, & if max_{i \in O_{t}} (α_{t}^{A i}) < α_{g t} and | W_{k} | \geq θ_{k} \\ 1, & if max_{i \in O_{t}} (α_{t}^{A i}) \geq α_{g t} and | W_{k} | < θ_{k} \\ 2, & if max_{i \in O_{t}} (α_{t}^{A i}) \geq α_{g t} and | W_{k} | \geq θ_{k} \\ 3, & if max_{i \in O_{t}} (α_{t}^{A i}) < α_{g t} and | W_{k} | < θ_{k} \end{matrix}

(20)

where

α_{g t}

is the minimum value of the IoU. This determines if the result is the same as labeling.

θ_{k}

is the threshold used to determine if

W_{k}

decays too fast, and

max_{i \in O_{t}} (α_{t}^{A i})

represents the maximum value of IoU between target A and other targets in frame t:

If there exists a target i such that $α_{t}^{A i} \geq α_{g t}$ , this indicates that target A overlaps with another target. To determine whether A is the occluder or the occluded, $W_{k}$ is used:
–
If A is smaller than $W_{k}$ , this indicates a rapid decrease in the mask area $β_{t - T : t}^{A}$ , which means that A is an occluder ( $o_{t}^{A}$ = 1);
–
If A is larger than $W_{k}$ , this indicates no decreasing trend in the mask area, which means that A is the occluder ( $o_{t}^{A}$ = 2).
If there is no target i that satisfies $α_{t}^{A i} \geq α_{g t}$ , the occlusion state is also determined by $W_{k}$ :
–
If A is larger than $W_{k}$ , this indicates no occlusion ( $o_{t}^{A}$ = 0);
–
If A is smaller than $W_{k}$ , this indicates occlusion by non-target objects ( $o_{t}^{A}$ = 3).

3.2.2. Robust Mask Propagation Network

To face the challenge of data association in congestion scenarios, we propose a novel approach for robust data association in such scenarios. The Robust Mask Propagation Network utilizes a constrained time window and captures the movement context effectively. The implementation steps of the method are as follows:

(1) When the target is not occluded, the correlation between trajectories is realized based on the multi-hypothesis tracking algorithm.

(2) When the occlusion sensor detects that a target is occluded, the first T frame of the current frame is input into the memory encoder, and the mask information of the network learning target is propagated based on the mask.

(3) When the occlusion sensor detects that a target is occluded, the current image is input into the query encoder as a query after the memory encoder is constructed.

(4) The mask propagation network builds a mask for the input query construction target, and the mask with the same size as the original query is obtained through decoder processing. Steps (3) and (4) are repeated for each frame where the occluded target is located.

(5) When the target ends the occlusion stage, the correlation between trajectories is continued based on the multi-hypothesis tracking algorithm.

Based on the steps above, the trajectory of the occluded target is generated by the mask propagation network. The relevant definition and network structure of the mask propagation network will be introduced in detail, and the workflow of mask propagation is shown in Figure 2.

We defined memory encoder as

E n_{M}

, query encoder as

E n_{Q}

, and mask decoder as

D e

. We also defined

S_{t}^{A}

as the set of masks of target A in the previous T frames before frame t, where trajectories are not occluded:

S_{t}^{A} = {m_{t - T - 1}^{A}, m_{t - T}^{A}, \dots, m_{t - 1}^{A}}

(21)

We took a random target in frame t as an example to explain the mask propagation process. The input of the network was divided into two parts. In order to learn and memorize the coding pattern, the memory encoder

E n_{M}

took the historical frames

I_{j}^{t - T - 1 : t - 1}

and

S_{t}^{A}

as inputs, outputting

F_{t}^{E n_{M}}

,

K_{t}^{E n_{M}}

, and

V_{t}^{E n_{M}}

.

F_{t}^{E n_{M}}

,

K_{t}^{E n_{M}}

,

V_{t}^{E n_{M}}

are the key-value pair set, key set, and value set corresponding to all the detections after being encoded. For a single frame input, the encoder

E n_{M}

generated

f_{t}^{E n_{M}}

,

k_{t}^{E n_{M}}

, and

v_{t}^{E n_{M}}

.

f_{t}^{E n_{M}}

,

k_{t}^{E n_{M}}

, and

v_{t}^{E n_{M}}

are the key-value pairs, keys, and values corresponding to the detections in frame t after being encoded.

In order to generate new masks, the query encoder

E n_{Q}

processed the current frame

I_{j}^{t}

and generated

f_{t}^{E n_{Q}}

,

k_{t}^{E n_{Q}}

, and

v_{t}^{E n_{Q}}

. In order to search based on the query key, the dot product was used between

k_{t}^{E n_{Q}}

and

k_{t}^{E n_{M}}

. The method used normalization to obtain the similarity matrix

k_{t}^{Q M}

:

k_{t}^{Q M} = softmax (k_{t}^{E n_{Q}} \cdot k_{t}^{E n_{M}})

(22)

The similarity matrix

k_{t}^{Q M}

was multiplied by the memory values

v_{t}^{E n_{M}}

to obtain the memory retrieval result matrix

f_{t}^{Q M}

. By concatenating

f_{t}^{Q M}

with

v_{t}^{E n_{Q}}

, we obtained the mask feature matrix

f_{t}^{c o n c a t}

. We then inputted

f_{t}^{c o n c a t}

in

D e

for mask refinement.

Eventually, the module restored

f_{t}^{c o n c a t}

to the original image size.

3.3. Mask-Enhanced Multi-Object Tracking and Segmentation Framework

Despite recent progress in MOT tasks, existing frameworks struggle with persistent occlusion scenarios. Traditional MHT algorithms [5,14] rely on delayed decisions and cannot handle model occlusions dynamically. Bounding box-based representations lack accuracy for robust pixel-level associations.

Therefore, we introduced a Mask-Enhanced MHT-based Framework. The framework integrated mask geometry and hypothesis management based on the mask-conditional feature fusion network and the occlusion-aware propagation network. The components in the framework are shown in Figure 3.

Our method addressed three critical shortcomings of existing MHT: the occlusion-blind association, geometric inaccuracy of bounding boxes, and inefficiency of mask-based operations.

3.3.1. Mask-Integrated Multi-Hypothesis Tracking

The mask-integrated MHT method redefines traditional hypothesis management by incorporating mask shape and occlusion dynamics. The core of the method operates with a short tracklet

T

, where each node in the hypothesis tree represents a sequence of mask-annotated detections rather than an isolated bounding box.The hypothesis generation process is governed by geometric validation criteria that evaluate the spatial–temporal compatibility of candidate tracklets. For any given pair of tracklets

T_{p}

and

T_{q}

, their associated validity is determined by a mask crossing constraint:

valid (T_{p}, T q) = \{\begin{matrix} 1, & if \frac{1}{L} \sum_{k = 1}^{L} maskIoU (M_{k}^{p}, M_{k}^{q}) \geq τ_{g} \\ 0, & otherwise \end{matrix}

(23)

where

τ_{g}

is the threshold that ensures physical plausible associations. This constraint filters out conflicting hypotheses and allows legitimate occlusions, which is an improvement over bounding box based methods that often fail to distinguish overlapping targets. Temporal coherence is modeled by an exponential smoothing mechanism. Each hypothetical branch maintains a coherence score

ψ_{t}

, which is calculated as follows:

ψ_{t} = γ ψ_{t - 1} + (1 - γ) (\frac{1}{L} \sum_{k = t - L + 1}^{t} {| f_{k} - {\bar{f}}_{t} |}_{2})

(24)

Scores are adaptively weighted by the deviation (

f_{k}

) of recent features from the moving average

{\bar{f}}_{t}

, where the smoothing rate is controlled by

γ

. Branches that show sudden coherence drops are truncated as potential tracking failures. To handle dynamic occlusions, the framework integrates the occlusion state classification

o_{t}^{A}

from Section 3.2. The hypothesis scoring function

S (H_{i}^{j})

is augmented with an occlusion-aware term:

S (H_{i}^{j}) = S_{b a s e} + λ (o_{t}^{A}) \cdot S_{o c c}

(25)

The

S_{o c c}

term incorporates mask continuity metrics for occluded targets (

o_{t}^{A} = 1

), motion consistency for occluded targets (

o_{t}^{A} = 2

), and scene context similarity for scene occluded targets (

o_{t}^{A} = 3

). This adaptive scoring ensures that the hypothesis is consistent with the observed occlusion patterns.

3.3.2. Mask-Enhanced Multi-Hypothesis Tracking Framework

The proposed framework integrates three core components to achieve MOTS in occluded scenarios. The proposed framework also establishes an integrated pipeline that performs MOTS with five interlinked processing stages.

The framework workflow follows the steps below:

Short tracklet generation. Match background tracklets with foreground tracklets and generate short tracklet pairs. The tracklets to be matched come from detections generated by object detection and instance segmentation. These tracklet pairs have one-to-one correspondence with targets and are used for subsequent feature extraction and track association.
Short tracklet representation. Based on the Mask-Conditioned Feature Fusion Network, extract the spatial–temporal features of target tracklet pairs to generate more robust features suitable for occluded scenarios. They are used for subsequent data association.
Track association with multi-hypothesis tracking algorithm. For target A in frame t, if $o_{t}^{A} = 0$ or $o_{t}^{A} = 2$ , update the tracker’s internal template using the MHT algorithm. Compute similarity between adjacent nodes through mask association evaluation and perform track association.
Occlusion state perception. Use the crowded perception model to detect target occlusion states. If $o_{t}^{A} = 0$ or $o_{t}^{A} = 2$ , perform data association as described in step (3). If $o_{t}^{A} = 1$ or $o_{t}^{A} = 3$ , suspend template updates in the multi-hypothesis tracker and proceed to step (5) for occluded track completion.
Mask propagation. Generate target masks in occluded regions with the mask propagation network from Section 3.2 to complete missing track segments. When the target disappears or $o_{t}^{A} = 0$ , return to step (3) for tracking with the MHT algorithm and complete the association of all track segments.

4. Experiments

The experiment was conducted on the public dataset MOTS to evaluate the effectiveness of the Mask-Enhanced Multi-Hypothesis Tracking Framework.

4.1. MOTS Dataset

MOTS [1] extends the MOT benchmark MOT17 [41] to pixel-level accuracy by providing instance mask annotation for pedestrians based on the MOT17 video sequences. MOTS uses frame-by-frame segmentation masks to provide more accurate annotation. The dataset contains four training videos and four test videos, for a total of 5906 frames. The dataset contains 556 tracked pedestrians, 59,163 are detected, and instance segmentation results are obtained for both the training and test sets. MOTS is a pixel-level multi-object tracking and segmentation performance evaluation dataset.

4.2. Implementation Details

The framework incorporates the mask-conditioned feature fusion network, the mask-integrated multi-hypothesis tracking algorithm, and the occlusion-aware mask propagation network, which are trained, respectively, following the experimental configurations below. The training data for the mask-conditioned feature fusion network comes from the MOT17 and KITTI [42] datasets. Foreground extraction and related preprocessing are performed using the Mask R-CNN instance segmentation detection method. The method is pretrained on the COCO trainval35k dataset [22].

The mask-integrated multi-hypothesis tracking algorithm is configured with the following parameters:

Short trajectory pair length: 3.
Average confidence filtering threshold for short trajectory pairs: 0.7.
Pruning backtracking layers: 5.
Target disappearance window length: 20.

The occlusion-aware mask propagation network uses a residual network [43] as the backbone of the encoder and is trained on the YouTube-VOS [36] and DAVIS-2017 [44] datasets.

4.3. Performance Metrics

This subsection introduces the main performance metrics which are used in the experiments below.

The evaluation of MOTS performance employs several key metrics. False negatives (FNs) quantify missed detections where true targets are not identified, and false positives (FPs) measure spurious detections of non-existent targets. Lower values for both metrics indicate better detection accuracy. Identity switches (IDSs) count occurrences where target identities are incorrectly swapped, directly reflecting tracking consistency. The MOT accuracy (MOTA) integrates the three factors above as a comprehensive accuracy measure. However, since MOTA has limited sensitivity to identity preservation, researchers typically complement it with the identity F1-score (IDF1) to specifically evaluate ID maintenance capability, which is important in crowded scenarios.

For segmentation-specific evaluation, MOTS accuracy (MOTSA) and MOTS precision (MOTSP) extend these metrics: MOTSA adapts the MOTA framework using mask areas for error calculation, and MOTSP focuses on geometric precision of segmentation boundaries through mask IoU. To further evaluate segmentation performance, the segmented MOT accuracy (sMOTSA) metric extends MOTSA by incorporating mask quality assessment. Specifically, sMOTSA computes accuracy based on correctly segmented pixels rather than entire masks, so it can provide finer-grained evaluation of segmentation quality.

All metrics follow ingthe convention that higher values indicate better performance, though their complementary nature must be noted. Low MOTA with high IDF1 may suggest excellent long-term tracking despite short-term detection deficiencies. These metrics collectively enable thorough evaluation across different aspects of tracking and segmentation quality.

4.4. Ablation Experiment

We conducted ablation experiments on the Occlusion-aware Mask Propagation Network using the MOTS training set. To investigate the optimal performance of linear interpolation, we analyzed the impact of different mask IoU thresholds on the method.

Since the MOTS20-05 and MOTS20-09 video sequences among the MOTS dataset were captured by a moving camera, the displacement of the same target across consecutive frames is relatively large. In contrast, for the remaining GT data, the mask IoU ratios of the same target in consecutive frames are mostly clustered around 0.9.

To determine the optimal value for the mask IoU threshold

α

, we tested the impacts of different

α

values on the final results using the MOTS-02 sequence. The experimental results are presented in Table 2.

It can be observed that when

α

= 0.9, the MOTSA, sMOTSA, and (FN + FP) metrics achieve relatively optimal results. Thus, we set the mask IoU threshold

α

of the linear interpolation algorithm to 0.9.

We conducted the ablation experiment on the MOTS training set, with the results shown in Table 3. We define LI as the linear interpolation method, OA as the occlusion-aware sub-module, and MP as the mask propagation network. The comparative results between linear interpolation and mask propagation fully demonstrate the effectiveness of mask propagation in completing trajectories.

MT stands for Mostly Track targets. It is observed that after adding only the linear interpolation method, the MOTSA and other metrics of the baseline decreased, which means worse tracking performance. However, with the further addition of the occlusion-aware sub-module that identifies the occluded states of targets, the +LI+OA approach shows improvements in both MOTSA and MT metrics compared to the baseline. This proves the effectiveness of the occlusion-aware sub-module. Furthermore, after integrating OA with our proposed Robust Mask Propagation Network, which uses the mask propagation network to complete missing trajectories based on the occlusion-aware module, this method achieved a 0.5% improvement in MOTSA over the baseline and a significant 138% increase in MT. This indicates that the framework produces more complete target trajectories. Additionally, MOTSP improved by 0.3% and 0.4% compared to the baseline and +LI+OA, respectively, further enhancing the accuracy of our method.

4.5. Experiment Results

This subsection evaluates the proposed MOTS framework on the MOTS test set. Table 4 presents a comparison with state-of-the-art methods in multi-target tracking and segmentation, where our approach outperforms in MOTSA, MT, and FN metrics.

For the multi-object tracking and segmentation trajectory association algorithms compared in Table 4, the principles of the trackers are analyzed as follows.

ReMOTS_ [45] enhances tracking components through detection fusion and adaptive thresholding. EMNT [46] introduces a general recurrent tracking unit with multi-hypothesis framework for association modeling. GMPHD_MAF [10] incorporates pedestrian re-identification classifiers to filter outliers before trajectory association. MPNTrackSeg [47] employs differentiable message-passing networks based on network flow models. TrackRCNN [1] extends Mask R-CNN [32] with tracking branches using bipartite graph matching.

Table 4. Comparison results for the MOTS test set.

Tracker	MOTSA↑	IDF1↑	sMOTSA↑	MOTSP↑	MT↑	TP ↓	FP ↓	FN ↓
TrackRCNN [1]	55.2	42.4	40.6	76.1	127	19,628	1261	12,641
MPNTrackSeg [47]	73.7	68.8	58.6	80.6	207	25,036	1059	7233
GMPHD_MAF [10]	83.3	66.4	69.4	84.2	249	28,284	935	3985
EMNT [46]	83.7	77.0	70.0	84.1	234	27,943	666	4326
ReMOTS_ [45]	84.4	75.0	70.4	84.0	248	28,270	819	3999
Our Method	84.4	75.1	70.2	84.3	254	28,504	1043	3765

Our method detects occluded states by analyzing regression trends, extracts spatial–temporal features through mask propagation network, and achieves multi-target tracking and segmentation using a masked-enhanced MHT algorithm.

Notably, our method achieves a 0.7% higher MOTSA (84.4%) compared to the second-best tracker and reduces false negatives (FNs) by 6.1% (3765 vs. 3999). The results prove the framework’s superior accuracy in occluded scenarios. This improvement stems from the two key innovations: the mask propagation network’s ability to recover occluded targets by leveraging temporal coherence in spatial features and the fusion of mask-based association metrics with motion cues. The significant reduction in FN demonstrates the effectiveness of our mask propagation network, as the framework bridges gaps caused by transient occlusions successfully.

EMNT achieves higher IDF1 compared to our method (77.0% vs. our 75.1%), but its reliance on sequential feature updates leads to 15% higher FN (4326) in long-term occlusions, underscoring the advantage of our parallel mask-and-motion fusion design. Our method exhibits a trade-off in FP, primarily due to the sensitivity of mask propagation to abrupt motion changes in dynamic scenes. This limitation arises from the linear interpolation assumption in occluded regions, which may fail under rapid camera or target movements. Future work could integrate adaptive motion modeling to mitigate this issue.

The tracking framework proposed in these experiments shows that it can associate masks across frames accurately and achieves target tracking and segmentation while adapting to various scenarios, which contains severe occlusion scenarios by design. Our competitive sMOTSA (70.2%) and IDF1 (75.1%) scores, approaching those of SOTA performance (ReMOTS_: 70.4%; sMOTSA, 75.0% IDF1), validate the robustness of our method under heavy occlusion conditions. Tests conducted on public datasets have verified the effectiveness of this mask-enhanced framework.

To demonstrate the effectiveness of the proposed multi-object tracking and segmentation framework directly, Figure 4 shows the partial tracking results of our method on the MOTS dataset. In Figure 4a, for horizontally moving pedestrians under static camera, most of the mask IDs remain unchanged after mutual occlusion. In Figure 4b, the movement of the camera and pedestrians causes complete and partial occlusions jointly, resulting in fierce bounding box variations and mask variations within the detections, but our method still keeps the occluded targets’ mask IDs correctly.

Specifically, in Figure 4a, three targets under trees mutually occlude each other, yet our method maintains their mask IDs without failure. The tracker only loses track when one target crosses the camera view, causing prolonged full occlusion. This aligns with the quantitative results where our method reduces FN by 5.9% compared to ReMOTS_. Notably, even when severe occlusions occur between targets, our method successfully maintains the mask ID throughout the frame sequences. Figure 4b further highlights this capability: the green-masked target (initially heavily occluded by the red-masked one) is accurately segmented and tracked despite drastic mask deformation during occlusion. The ID remains stable after the red-masked target moves away, and the segmentation becomes more precise. Similarly, the yellow-masked target (initially non-occluded) retains the correct mask ID when later obscured by the red-masked person, proving our method’s adaptability to dynamic occlusion severity.

The performance data and visualization results shown above propose that MHT framework can associate masks across frames and achieve target tracking with segmentation, as well as that it is more adaptive to severe occluded scenarios.

5. Discussion

False positive, false negative, and target identity drift are often encountered in traditional MOT algorithms. One of the main reasons for this situation is the occurrence of occlusion. In this paper, we propose an optimization method for the error problem of trajectory association under occlusion scenarios with a mask-based MOTS framework, which achieves a better result for the multi-object tracking problem, as shown in Section 4.

However, the proposed method also has certain limitations. Several challenges still remain in multi-object tracking research. Although our experiments demonstrate that the mask-enhanced network improves trajectory association in occluded scenarios, the current feature network does not utilize inter-object interaction. Since most occlusion scenarios result from object interactions, incorporating interaction information with a feature extraction network is a promising research direction. Additionally, our framework employs several manually designed hyperparameter thresholds. Current scene-adaptive parameter trackers lack designs for occluded scenarios. The development of parameter adaptation methods for mask-based MOTS frameworks remains an essential yet understudied research direction.

6. Conclusions

This paper focuses on multi-object trajectory association in occluded scenarios. It uses instance segmentation to optimize trajectory association methods under occlusion conditions. We propose a mask-integrated multi-hypothesis tracking algorithm that achieves trajectory association through short tracklet pairs. The framework incorporates a mask-conditioned feature fusion network and an occlusion-aware mask propagation network that enable better trajectory association in occluded scenarios and enhance the robustness of the MOTS framework. The proposed mask-based framework is tested on public datasets, and the effectiveness is proved scientifically.

Author Contributions

Conceptualization, H.S., D.Z., D.Y. (Dazhi Yang) and D.Y. (Da Yang); methodology, H.S., D.Z., D.Y. (Dazhi Yang) and D.Y. (Da Yang); software, H.S., D.Z., D.Y. (Dazhi Yang) and D.Y. (Da Yang); validation, H.S., D.Z., D.Y. (Dazhi Yang) and D.Y. (Da Yang); formal analysis, H.S., D.Z., D.Y. (Dazhi Yang) and D.Y. (Da Yang); investigation, D.Z.; resources, H.S. and D.Y. (Da Yang); data curation, H.S., D.Z. and D.Y. (Da Yang); writing—original draft preparation, H.S., D.Z. and D.Y. (Da Yang); writing—review and editing, H.S., D.Z. and D.Y. (Da Yang); visualization, H.S., D.Z. and D.Y. (Da Yang); supervision, H.S., D.Y. (Da Yang), X.L. and W.K.; project administration, H.S., D.Y. (Da Yang), X.L. and W.K.; funding acquisition, H.S., D.Y. (Da Yang), X.L. and W.K. All authors have read and agreed to the published version of the manuscript.

Funding

This study is partially supported by the National Natural Science Foundation of China (No. 62372023), the Open Fund of the State Key Laboratory of Software Development Environment (No. SKLSDE-2023ZX-11), the Research Start-up Funds of Hangzhou International Innovation Institute of Beihang University under Grant No. 2024KQ087, and the Zhejiang Provincial Natural Science Foundation of China under Grant No. LQN25F020028 and supported by the Open Funding of the State Key Laboratory for Intelligent Coal Mining and Strata Control (Grant No. SKLIS202405).

Institutional Review Board Statement

Ethical review and approval were waived for this study, because it solely utilized the publicly available MOTS dataset [1], which contains anonymized data and does not involve direct interaction with human subjects or collection of new personal data. No new human or animal subjects were involved in our work.

Informed Consent Statement

Informed consent was obtained from all subjects involved in this study. The data used in this study were sourced from the publicly available MOTS dataset, for which consent was obtained by the original data collectors.

Data Availability Statement

The data presented in this study are openly available in the MOTS dataset at https://motchallenge.net/data/MOTS/ (accessed on 29 May 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Voigtlaender, P.; Krause, M.; Osep, A.; Luiten, J.; Sekar, B.B.G.; Geiger, A.; Leibe, B. Mots: Multi-object tracking and segmentation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7934–7943. [Google Scholar]
Wu, Y.; Lim, J.; Yang, M.H. Online object tracking: A benchmark. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 2411–2418. [Google Scholar]
Chaabane, M.; Zhang, P.; Beveridge, J.R.; O’Hara, S. DEFT: Detection Embeddings for Tracking. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 10–25 June 2021. [Google Scholar]
Liu, T.; Wang, G.; Yang, Q. Real-time part-based visual tracking via adaptive correlation filters. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4902–4912. [Google Scholar]
Kim, C.; Li, F.; Ciptadi, A.; Rehg, J.M. Multiple hypothesis tracking revisited. In Proceedings of the IEEE International Conference on Computer Vision, Boston, MA, USA, 7–12 June 2015; pp. 4696–4704. [Google Scholar]
Zhang, Y.; Sheng, H.; Wu, Y.; Wang, S.; Ke, W.; Xiong, Z. Multiplex Labeling Graph for Near-Online Tracking in Crowded Scenes. IEEE Internet Things J. 2020, 7, 7892–7902. [Google Scholar] [CrossRef]
Wu, J.; Cao, J.; Song, L.; Wang, Y.; Yang, M.; Yuan, J. Track to detect and segment: An online multi-object tracker. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 10–25 June 2021; pp. 12347–12356. [Google Scholar]
Chu, P.; Ling, H. FAMNet: Joint Learning of Feature, Affinity and Multi-Dimensional Assignment for Online Multiple Object Tracking. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6171–6180. [Google Scholar]
Vermaa, J.; Doucet, A.; P’erez, P. Maintaining multi-modality through mixture tracking. In Proceedings of the IEEE International Conference on Computer Vision, Nice, France, 14–17 October 2003; Volume 2, pp. 1110–1116. [Google Scholar]
Song, Y.; Yoon, Y.; Yoon, K.; Jeon, M.; Park, D.; Paik, J. Online Multi-Object Tracking and Segmentation with GMPHD Filter and Mask-Based Affinity Fusion. arXiv 2020, arXiv:2009.00100. [Google Scholar]
Chen, L.C.; Hermans, A.; Papandreou, G.; Schroff, F.; Wang, P.; Adam, H. MaskLab: Instance Segmentation by Refining Object Detection with Semantic and Direction Features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4013–4022. [Google Scholar]
Hafiz, A.M.; Bhat, G.M. A Survey on Instance Segmentation. Int. J. Multimed. Inf. Retr. 2020, 9, 171–189. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Cox, I.J.; Hingorani, S.L. An efficient implementation of reid’s multiple hypothesis tracking algorithm and its evaluation for the purpose of visual tracking. IEEE Trans. Pattern Anal. Mach. Intell. 1996, 18, 148–150. [Google Scholar] [CrossRef]
Kim, C.; Li, F.; Rehg, J.M. Multi-Object Tracking with Neural Gating Using Bilinear LSTM. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 208–224. [Google Scholar]
Zhang, Y.; Sheng, H.; Wu, Y.; Wang, S.; Lyu, W.; Ke, W.; Xiong, Z. Long-Term Tracking with Deep Tracklet Association. IEEE Trans. Image Process. 2020, 29, 6694–6706. [Google Scholar] [CrossRef]
Chu, Q.; Ouyang, W.; Li, H.; Wang, X.; Liu, B.; Yu, N. Online Multi-Object Tracking Using CNN-Based Single Object Tracker with Spatial-Temporal Attention Mechanism. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1–10. [Google Scholar]
Saleh, F.; Aliakbarian, S.; Rezatofighi, H.; Salzmann, M.; Gould, S.; Petersson, L.; Garg, S. Probabilistic Tracklet Scoring and Inpainting for Multiple Object Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual Conference, 19–25 June 2021; pp. 14324–14334. [Google Scholar]
Zhang, Y.; Wang, C.; Wang, X.; Zeng, W.; Liu, W. FairMOT: On the Fairness of Detection and Re-Identification in Multiple Object Tracking. International Journal of Computer Vision 2021, 129, 3069–3087. [Google Scholar] [CrossRef]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. CenterNet: Keypoint Triplets for Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6568–6577. [Google Scholar]
Milan, A.; Leal-Taixé, L.; Reid, I.; Roth, S.; Schindler, K. MOT16: A Benchmark for Multi-Object Tracking. arXiv 2016, arXiv:1603.00831. [Google Scholar]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2014; Volume 8693, pp. 740–755. [Google Scholar]
Alexander, P. Cityscapes. Methodist Debakey Cardiovasc. J. 2022, 18, 114–116. [Google Scholar] [CrossRef] [PubMed]
Neuhold, G.; Ollmann, T.; Bulo, S.R.; Kontschieder, P. The Mapillary Vistas Dataset for Semantic Understanding of Street Scenes. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5000–5009. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. Adv. Neural Inf. Process. Syst. 2015, 28, 159–183. [Google Scholar] [CrossRef] [PubMed]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–20 June 2019; pp. 847–856. [Google Scholar]
Khoreva, A.; Benenson, R.; Ilg, E.; Brox, T.; Schiele, B. Lucid Data Dreaming for Video Object Segmentation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2019; pp. 1175–1197. [Google Scholar]
Rothberg, M.B.; Vakharia, N. Object Segmentation with Joint Re-identification and Attention-Aware Mask Propagation. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 1322–1324. [Google Scholar]
Perazzi, F.; Khoreva, A.; Benenson, R.; Schiele, B.; Sorkine-Hornung, A. Learning video object segmentation from static images. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3491–3500. [Google Scholar]
Wang, W.; Shen, J.; Xie, J.; Cheng, M.-M.; Ling, H. Super-Trajectory for Video Segmentation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1680–1688. [Google Scholar]
Bertasius, G.; Torresani, L. Classifying, Segmenting, and Tracking Object Instances in Video with Mask Propagation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9736–9745. [Google Scholar]
Sun, C.; Wang, D.; Lu, H. Occlusion-Aware Fragment-Based Tracking with Spatial-Temporal Consistency. IEEE Trans. Image Process. 2016, 25, 3814–3825. [Google Scholar] [CrossRef] [PubMed]
Dong, X.; Shen, J.; Yu, D.; Wang, W.; Liu, J.; Huang, H. Occlusion-Aware Real-Time Object Tracking. IEEE Trans. Multimed. 2017, 19, 763–771. [Google Scholar] [CrossRef]
Stadler, D.; Beyerer, J. Improving Multiple Pedestrian Tracking by Track Management and Occlusion Handling. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Virtual Conference, 19–25 June 2021; pp. 10953–10962. [Google Scholar]
Yang, L.; Fan, Y.; Xu, N. Video Instance Segmentation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5187–5196. [Google Scholar]
Song, C.; Wang, Y.; Huang, Y.; Ouyang, W.; Wang, L. Mask-Guided Contrastive Attention Model for Person Re-Identification. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1179–1188. [Google Scholar]
Qi, M.; Wang, S.; Huang, G.; Lu, H.; Zhang, L. Mask-guided dual attention-aware network for visible-infrared person re-identification. Multimed. Tools Appl. 2021, 80, 17645–17666. [Google Scholar] [CrossRef]
Cai, H.; Wang, Z.; Cheng, J. Multi-scale body-part mask guided attention for person re-identification. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–20 June 2019; pp. 1555–1564. [Google Scholar]
Tian, M.; Yi, S.; Li, H.; Shen, X.; Jin, X.; Wang, X. Eliminating Background-Bias for Robust Person Re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5794–5803. [Google Scholar]
Dendorfer, P.; Osep, A.; Milan, A.; Schindler, K.; Cremers, D.; Reid, I.; Roth, S.; Leal-Taixé, L. MOTChallenge: A Benchmark for Single-Camera Multiple Target Tracking. Int. J. Comput. Vis. 2021, 129, 845–881. [Google Scholar] [CrossRef]
Geiger, A.; Lenz, P.; Urtasun, R. Are We Ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Perazzi, F.; Pont-Tuset, J.; McWilliams, B.; Van Gool, L.; Gross, M.; Sorkine-Hornung, A. A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Yang, F.; Chang, X.; Dang, C.; Ge, Z.; Zheng, N. ReMOTS: Self-Supervised Refining Multi-Object Tracking and Segmentation. arXiv 2020, arXiv:2007.03200. [Google Scholar]
Wang, S.; Sheng, H.; Yang, D.; Zhang, Y.; Wu, Y.; Wang, S. Extendable Multiple Nodes Recurrent Tracking Framework With RTU++. IEEE Trans. Image Process. 2022, 31, 5257–5271. [Google Scholar] [CrossRef] [PubMed]
Bras’o, G.; Cetintas, O.; Leal-Taix’e, L. Multi-Object Tracking and Segmentation Via Neural Message Passing. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022. [Google Scholar]

Figure 1. Architecture for mask-conditioned feature fusion network.

Figure 2. The workflow of the Robust Mask Propagation Network.

Figure 3. The components in the Mask-Enhanced MOTS-based Framework.

Figure 4. The visualization of MOTS results in indoor and outdoor occluded scenarios. (a) Segment adaptation in outdoor crowded scenarios with static cameras. (b) Segment adaptation in indoor crowded scenarios with moving cameras.

Table 1. Summary of trajectory association algorithms.

Method	Technical Features	Advantages
Online methods [2,3,4,5,6,7,8]	Sequential frame processing	Real-time performance
Filter-based [9,10]	Motion/appearance modeling	Fast computation
MHT [5,14]	Multi-Hypothesis Trees	Comprehensive association
Kim et al. [15]	CNN features with MHT	Improved appearance modeling
Zhang et al. [16]	LSTM-enhanced MHT	Long-term dependency capture
Chu et al. [17]	Spatial–temporal attention	Occlusion handling
Sadeghian et al. [18]	LSTM networks	Temporal dependency modeling
Peng et al. [8]	Deep residual networks	Frame-to-frame similarity
Yi et al. [19]	CenterNet	Background bias reduction

Table 2. Tracking results with different mask IoU thresholds.

Mask IoU Threshold ( $α$ )	MOTSA ↑	sMOTSA ↑	(FP + FN) ↓
$α = 0.8$	63.5	45.5	2509
$α = 0.85$	64.9	46.8	2410
$α = 0.9$	65.0	47.9	2392
$α = 0.95$	64.8	46.3	2452

Table 3. Ablation results on the MOTS training set.

Method	MOTSA ↑	MOTSP ↑	MT ↑
Baseline	74.9	86.7	42
+LI	74.7	86.5	110
+LI+OA	75.3	86.6	105
+OA+MP	75.4	87.0	110

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sheng, H.; Zhang, D.; Yang, D.; Yang, D.; Liu, X.; Ke, W. Occlusion-Robust Multi-Target Tracking and Segmentation Framework with Mask Enhancement. Appl. Sci. 2025, 15, 6969. https://doi.org/10.3390/app15136969

AMA Style

Sheng H, Zhang D, Yang D, Yang D, Liu X, Ke W. Occlusion-Robust Multi-Target Tracking and Segmentation Framework with Mask Enhancement. Applied Sciences. 2025; 15(13):6969. https://doi.org/10.3390/app15136969

Chicago/Turabian Style

Sheng, Hao, Defa Zhang, Dazhi Yang, Da Yang, Xi Liu, and Wei Ke. 2025. "Occlusion-Robust Multi-Target Tracking and Segmentation Framework with Mask Enhancement" Applied Sciences 15, no. 13: 6969. https://doi.org/10.3390/app15136969

APA Style

Sheng, H., Zhang, D., Yang, D., Yang, D., Liu, X., & Ke, W. (2025). Occlusion-Robust Multi-Target Tracking and Segmentation Framework with Mask Enhancement. Applied Sciences, 15(13), 6969. https://doi.org/10.3390/app15136969

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Occlusion-Robust Multi-Target Tracking and Segmentation Framework with Mask Enhancement

Abstract

1. Introduction

2. Related Works

2.1. Trajectory Association Algorithms

2.2. Instance Segmentation-Based Multi-Object Tracking

2.3. Multi-Object Tracking and Segmentation Frameworks

3. Methodology

3.1. Mask-Conditioned Feature Fusion Network

3.1.1. Mask-Attentive Dual-Stream Encoder

3.1.2. Adaptive Feature Fusion Sub-Module

3.1.3. Spatial–Temporal Aggregation Sub-Network

3.2. Occlusion-Aware Mask Propagation Network

3.2.1. Occlusion State Perception with Mask IoU

3.2.2. Robust Mask Propagation Network

3.3. Mask-Enhanced Multi-Object Tracking and Segmentation Framework

3.3.1. Mask-Integrated Multi-Hypothesis Tracking

3.3.2. Mask-Enhanced Multi-Hypothesis Tracking Framework

4. Experiments

4.1. MOTS Dataset

4.2. Implementation Details

4.3. Performance Metrics

4.4. Ablation Experiment

4.5. Experiment Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI