Hierarchical Multi-Prototype Appearance Memory: A Plug-and-Play Module for Identity-Stable Online Multi-Object Tracking

Zhang, Wenning; Liu, Mintao; Cao, Yangjie; Cai, Jihao; Wang, Chao; Xia, Huili; Xu, Kunming

doi:10.3390/electronics15112357

Open AccessArticle

Hierarchical Multi-Prototype Appearance Memory: A Plug-and-Play Module for Identity-Stable Online Multi-Object Tracking

by

Wenning Zhang

^1,2,

Mintao Liu

¹,

Yangjie Cao

¹,

Jihao Cai

¹

,

Chao Wang

¹,

Huili Xia

^3,* and

Kunming Xu

¹

School of Cyber Science and Engineering, Zhengzhou University, Zhengzhou 450002, China

²

Software College, Zhongyuan University of Technology, Zhengzhou 450007, China

³

Henan Engineering Research Center for Intelligent Data Processing and Security, Zhengzhou 450000, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(11), 2357; https://doi.org/10.3390/electronics15112357

Submission received: 14 April 2026 / Revised: 13 May 2026 / Accepted: 15 May 2026 / Published: 29 May 2026

(This article belongs to the Special Issue Advances in Image Processing and Computer Vision)

Download

Browse Figures

Versions Notes

Abstract

Online multi-object tracking (MOT) aims to maintain consistent target identities across video frames, yet it remains vulnerable to identity switches under occlusion and appearance variation. Many existing trackers rely on single-prototype exponential moving average (EMA) memory, which is efficient but prone to contamination, over-smoothing, and staleness. To address this issue, we propose Hierarchical Multi-Prototype Appearance Memory (HMP), a plug-and-play module for online MOT. HMP separates stable long-term identity anchors from short-term transitional evidence through a multi-prototype long-term memory and a short first-in-first-out (FIFO) queue. A unified joint reliability score governs memory writing and maintenance, and a frozen two-stage association strategy first performs stable primary matching and then allows conservative short-term recovery only on residual cases. Experiments on MOT17 and MOT20 show that HMP improves identity continuity while preserving competitive overall tracking quality. Controlled ablations further support the effectiveness of the proposed memory representation, reliability control, and staged evidence usage under fixed upstream modules.

Keywords:

online multi-object tracking; appearance memory; multi-prototype long-term memory; identity continuity; two-stage association strategy

1. Introduction

Online multi-object tracking (MOT) is ultimately judged by whether the same target can retain a consistent identity over time, even under occlusion, crowd interaction, missed detections, and abrupt appearance change. Most modern online trackers adopt the tracking-by-detection paradigm, where a detector produces candidate boxes, a re-identification (ReID) module extracts appearance embeddings, a motion model predicts track states, and an association solver matches detections to tracks [1,2,3,4,5,6]. Within this pipeline, appearance cues serve as the final safeguard of identity continuity. Once the appearance state of a track becomes unreliable, subsequent associations can deteriorate rapidly even when the detector and motion model remain unchanged.

Existing work often attributes identity switches to insufficient single-frame appearance discrimination and therefore seeks stronger ReID embeddings through improved training pipelines, stronger backbones, jointly optimized detection-and-embedding models, or condition-aware appearance modeling [3,7,8,9,10]. These directions are effective, yet they typically demand additional data, computation, or retraining, and their gains may be tied to specific model configurations. In many deployment settings, however, the detector, ReID model, and motion model are already fixed due to system constraints, engineering cost, or fairness requirements. Under such conditions, improving upstream embeddings is not always the most practical path. Instead, the design of the appearance-memory mechanism itself becomes an important yet comparatively underexplored source of performance improvement.

We therefore ask a different question: when the upstream modules are fixed, can identity stability still be improved by redesigning the appearance-memory mechanism alone? Our observation is that many identity switches arise not because single-frame features are fundamentally inseparable, but because the memory used for association becomes contaminated, over-smoothed, or stale during long-term online updating [9,10,11,12,13]. This problem is especially severe under progressive occlusion, viewpoint change, and crowded interaction, where a single prototype is repeatedly pulled by heterogeneous observations. Consequently, the track representation gradually drifts away from a clean identity anchor, and errors accumulate across subsequent frames.

Figure 1 provides a schematic illustration of this challenge. The target appearance evolves across five consecutive visibility states, from full visibility to partial occlusion, severe occlusion, partial reappearance, and back to full visibility again. This visual sequence highlights that observations available to the tracker are not equally reliable over time: the middle stages are more likely to contain incomplete or contaminated appearance cues, whereas the early and late stages provide cleaner identity evidence. Under a single-prototype exponential moving average (EMA) scheme, updating continuously across all stages can drag the prototype toward corrupted intermediate observations and induce representation drift. Conversely, rejecting updates too rigidly may prevent the model from adapting to normal appearance change after reappearance. Hence, the challenge is not merely whether to update, but how to distinguish stable long-term identity evidence from short-term observations that are only locally valid or partially contaminated.

Motivated by this observation, we propose a Hierarchical Multi-Prototype Appearance Memory (HMP) module. The core idea is not simply to store more features, but to reorganize the lifecycle of track-level identity evidence: stable long-term evidence is preserved as identity anchors, short-term evidence is isolated as transitional support, and both are written and used under explicit reliability control.

It should be emphasized that HMP does not claim that multi-prototype storage, gated updating, or staged association is individually new. Instead, its novelty lies in coupling these otherwise separate design choices into a unified track-level appearance-evidence lifecycle. Unlike generic feature-bank accumulation or generic cascaded matching, HMP jointly specifies how appearance evidence is represented, which observations are allowed to enter memory, and which evidence source is permitted to participate at each association stage. In this lifecycle, stable long-term prototypes and short-term transitional evidence are assigned different roles; the reliability score governs memory admission and maintenance; and the frozen two-stage association strategy determines when each type of evidence is allowed to affect matching. Our resulting claim is deliberately narrow and testable: even when the detector, ReID extractor, and motion model are left unchanged, identity continuity can still be improved by redesigning how track-level appearance evidence is represented, written, and reused.

The main contributions of this work are as follows.

Hierarchical Memory Architecture: We propose a hierarchical appearance-memory design for online MOT that decomposes track appearance memory into a multi-prototype long-term memory and a short-term first-in-first-out (FIFO) queue, thereby explicitly separating stable identity anchors from short-lived transitional evidence.
Unified Reliability Control Mechanism: We formulate a unified reliability control mechanism that integrates a joint reliability score with the reliability-controlled update mechanism built on it, so that appearance consistency, motion consistency, and detection quality jointly regulate long-memory writing, short-queue admission, and prototype maintenance under one shared criterion.
Frozen Two-Stage Association Strategy: We develop a frozen two-stage association strategy in which stable long-memory evidence is used for primary matching, whereas short-term evidence is reserved for conservative residual recovery, preventing fragile transitional cues from overturning high-confidence matches already established in Stage 1.
Controlled Evaluation Protocol: We design a two-layer evaluation protocol in which official MOTChallenge results provide external positioning of the complete tracker, while controlled ablations under BoT-SORT-ReID and Deep OC-SORT isolate the contribution of the proposed memory redesign under fixed upstream modules.

This paper deliberately adopts a plug-and-play setting. The detector, ReID extractor, motion model, and assignment solver are treated as fixed components, and the proposed method intervenes only at the track-memory level. The contribution is therefore not a stronger feature extractor or a new end-to-end pipeline, but a more coherent way to store, filter, and reuse the appearance evidence that is already available. This scope is practically meaningful because, in many deployed MOT systems, retraining or replacing the full upstream stack is costly, constrained, or simply infeasible.

2. Related Work

2.1. Tracking-by-Detection and Identity Consistency

Most modern MOT methods follow the tracking-by-detection paradigm, in which the detector first generates candidate boxes, the motion model provides prediction and gating, and the association module combines motion and appearance cues to establish trajectory continuity. Classical systems such as SORT and DeepSORT established this formulation, and many later methods have strengthened different components within the same overall pipeline, including association robustness, motion modeling, and appearance usage [1,2,10,14,15,16,17]. Despite the maturity of this general architecture, identity switches remain a major obstacle to long-term identity consistency in crowded scenes with occlusion, interaction, and abrupt appearance variation [10,14,15,16,18]. These observations indicate that, beyond improving the detector or the motion model, online MOT still critically depends on how each track preserves a stable and discriminative identity representation over time. From this perspective, our work studies identity consistency at the track-memory level and asks whether a stronger appearance-memory design alone can improve long-term identity preservation under a fixed tracking pipeline.

2.2. Upstream Representation Enhancement and Track-Level Memory Design

A direct way to reduce identity switches is to improve the quality of upstream appearance embeddings. Representative directions include jointly optimized detection-and-embedding models such as FairMOT, stronger similarity-learning pipelines such as QDTrack, dedicated re-identification frameworks such as FastReID and DiPerceiveNet, and condition-aware appearance modeling [7,8,9,19,20,21]. These approaches are effective, but they usually rely on additional data, retraining, or stronger upstream representation learning, and can therefore be viewed as improvements to the detector–embedding stack itself.

A different line of work keeps the detector, ReID extractor, and motion model fixed, and instead studies how appearance evidence is maintained and reused at the track level. This direction is particularly meaningful in deployment-oriented settings, where upstream modules may already be constrained by engineering cost, computational budget, or fairness requirements in controlled comparisons. From this perspective, track-level memory design becomes a practically attractive alternative: rather than replacing the representation backbone, the goal is to extract more reliable identity continuity from the appearance evidence that is already available. Following this plug-and-play perspective, our work does not compete by retraining stronger upstream features; instead, it focuses on improving identity continuity by redesigning the storage, filtering, and reuse of track-level appearance evidence under fixed upstream modules.

2.3. Appearance Memory Representation and Update Control

In many tracking-by-detection systems, the appearance state of a track is maintained as a compact prototype or aggregated embedding and updated over time through exponential moving average (EMA) or related rules [2,10,13]. This design is simple and efficient, but it implicitly assumes that target appearance evolves smoothly enough to be summarized by a single dominant representation. In practice, however, online MOT must handle recurrent and sometimes multimodal appearance variation caused by viewpoint change, pose change, illumination fluctuation, interaction, and progressive occlusion. Once heterogeneous observations are repeatedly absorbed into one prototype, the resulting track representation can become over-smoothed, contaminated, or stale.

To preserve richer temporal evidence, memory-based studies introduce longer-range historical cues or explicit memory structures beyond a single prototype [11,12,13]. Such designs are useful for extending appearance context, but they often do not explicitly distinguish the roles of stable long-term identity anchors and short-term transitional observations. In practice, these two forms of evidence serve different purposes: the former should remain sufficiently clean and stable to support robust primary matching, whereas the latter may still be useful for short-range adaptation or recovery even when it is less reliable. Without an explicit role separation, long-term stability and short-term adaptability are difficult to optimize simultaneously.

Different from generic memory banks that mainly enlarge the amount of retained historical evidence, HMP treats memory as role-specific track-level evidence. Long-memory prototypes, short-queue entries, reliability-controlled writing, and frozen stage-wise usage are jointly specified, so the novelty boundary lies in how historical appearance evidence is admitted, organized, and scheduled rather than in feature accumulation alone.

A closely related issue concerns how appearance memory should be updated. To reduce contamination from occlusion, ambiguous interactions, and unreliable detections, prior methods often rely on confidence filtering, gated updating, or selective maintenance strategies [9,10,13]. These efforts reflect a shared intuition that low-reliability observations should not be written into appearance memory indiscriminately. However, update control is still often handled through local heuristics or module-specific rules, rather than through a unified perspective on memory maintenance. As a result, memory representation and update reliability are frequently studied as related but separate problems, even though they jointly determine how track-level identity evidence evolves over time. In contrast, our work treats these two aspects jointly: it couples hierarchical appearance representation with unified reliability-controlled maintenance, so that memory structure and memory writing are optimized as one coherent identity-modeling mechanism.

2.4. Staged Association and Residual Recovery

Staged association is another widely adopted design pattern in online MOT. A common strategy is to establish high-confidence matches first and then process the remaining unmatched tracks and detections more conservatively in later passes, as exemplified by ByteTrack and subsequent tracking-by-detection variants [14,15,16]. Such cascaded or residual matching schemes are effective because the residual unmatched set is typically more ambiguous and therefore requires stricter recovery criteria.

Nevertheless, most staged-association designs operate mainly at the scheduler or association level. They define multiple matching passes, but they do not always explicitly regulate which type of appearance evidence should be used at each stage. In particular, stable long-term identity cues and short-term transitional evidence are often treated as generic inputs to association, rather than as evidence sources with different trust levels and different appropriate roles. This leaves open an important design question: whether stage-wise matching can be made more reliable by coupling different stages to differently managed forms of appearance memory, instead of relying on a single undifferentiated appearance representation throughout the entire association process. Our work addresses this question by explicitly binding stage-specific evidence usage to the memory hierarchy: stable long-memory prototypes are reserved for primary matching, whereas short-term transitional evidence is confined to conservative residual recovery through a frozen two-stage association policy.

2.5. Transformer-Based Tracking

Recent Transformer-based trackers formulate multi-object tracking as an end-to-end temporal association problem. TrackFormer [22] propagates object queries across frames to maintain identities, while MOTR [23] extends the query-based formulation for online tracking. These methods differ from HMP in their target setting: they redesign the tracking architecture and learn temporal association within an end-to-end framework, whereas HMP targets conventional tracking-by-detection pipelines and modifies only the track-level appearance-memory path under fixed upstream modules.

Memory-augmented Transformer trackers are especially relevant to this discussion. MeMOT and MeMOTR introduce memory into Transformer-based MOT to preserve historical information and improve temporal association [11,12]. Their memories are usually represented as historical tokens, object queries, or attention-based temporal states learned inside a redesigned tracking architecture. HMP differs in three aspects. First, it does not introduce learnable temporal attention or an end-to-end query tracker; it is inserted into existing tracking-by-detection pipelines. Second, its memory units are explicit appearance prototypes and FIFO short-queue observations rather than learned query memories. Third, HMP couples memory admission with frozen stage-specific evidence usage, assigning stable long-term anchors and short-term transitional evidence different levels of association authority. Thus, HMP is complementary to memory-augmented Transformer trackers, but its novelty boundary is the deterministic track-level appearance-memory lifecycle under fixed upstream modules.

2.6. Temporal Representation Stability Beyond Pedestrian MOT

Temporal representation stability has also been studied in broader temporal visual understanding tasks. For example, remote sensing spatiotemporal vision-language models consider multi-temporal interpretation tasks such as change captioning, change question answering, and change grounding [24]. Although these tasks differ from online pedestrian MOT, they share a related concern: useful temporal evidence should be preserved while unstable observations should not dominate the representation. This motivates our discussion of stable long-term evidence and short-term transitional evidence from a broader temporal-representation perspective.

Overall, prior work has explored upstream representation enhancement, appearance-memory modeling, selective updating, staged association, Transformer-based temporal association, memory-augmented Transformer mechanisms, and broader temporal representation stability from different perspectives [11,12,22,23,24]. HMP is positioned differently from these directions: it does not redesign the detector, ReID extractor, motion model, or the overall tracking architecture, and it does not learn query-level temporal memory. Instead, it focuses on how explicit track-level appearance evidence is represented, filtered, and reused within a fixed online tracking-by-detection pipeline. This positioning supports the central motivation of HMP: stable evidence and transitional evidence should be separated and used with different levels of trust during association.

Table 1 summarizes representative MOT research directions and clarifies that HMP is not intended as a replacement for stronger upstream representations or end-to-end tracking architectures. Instead, the present work focuses on a narrower plug-and-play question under fixed upstream modules: whether identity continuity can be improved by redesigning how track-level appearance evidence is represented, updated, and reused over time. This perspective motivates a unified appearance-memory view in which hierarchical representation, reliability-controlled maintenance, and staged usage are treated as coupled parts of the same identity-modeling problem rather than as loosely connected heuristics.

3. Materials and Methods

3.1. Overall Framework and Problem Definition

We study track-level appearance-memory design for online multi-object tracking under fixed upstream modules. The proposed HMP module is organized around three coupled components: Hierarchical Memory Architecture, Unified Reliability Control Mechanism, and Frozen Two-Stage Association Strategy. Together, these components define how appearance evidence is represented, maintained, and used during online matching.

As illustrated in Figure 2, HMP separates appearance evidence into two functional branches. The long-memory branch stores stable identity anchors and supports Stage 1 primary matching, whereas the short-queue branch retains recent observations for conservative recovery only on the residual set after Stage 1. This design makes explicit the division of labor between stable identity anchoring and limited short-term adaptation. The reliability-controlled writing and maintenance mechanism will be introduced in Section 3.3.

Let the detection set at frame t be

D_{t} = {d_{j}^{t}}_{j = 1}^{N_{t}},

(1)

where

N_{t}

denotes the number of detections in the current frame. After feature extraction by the ReID network, each detection corresponds to a normalized appearance feature

f_{j}^{t} \in R^{d} .

(2)

Meanwhile, let the current active track set be

T_{t - 1} = {τ_{i}}_{i = 1}^{K_{t - 1}},

(3)

where

K_{t - 1}

is the number of tracks to be associated. Under this formulation, HMP intervenes at the appearance-memory and stage-wise evidence-usage levels while leaving the detector, ReID extractor, motion prediction/gating pipeline, and one-to-one assignment primitive unchanged. The Hierarchical Memory Architecture specifies how appearance evidence is represented through the long-memory/short-queue structure, the Unified Reliability Control Mechanism specifies how evidence is admitted and maintained through a shared reliability signal, and the Frozen Two-Stage Association Strategy specifies how the two memory layers are scheduled during matching. Together, these components define a closed-loop lifecycle for track-level identity evidence.

3.2. Hierarchical Memory Architecture

To uniformly measure the appearance discrepancy between detection features and track memory, cosine distance is adopted as the basic distance metric [21]:

d (a, b) = 1 - a^{⊤} b .

(4)

Since all appearance features are

ℓ_{2}

-normalized, Equation (4) directly measures cosine distance; a smaller value means higher appearance similarity. We use this metric consistently for long-memory matching, short-queue recovery, and reliability computation.

3.2.1. Long Memory: Multi-Prototype Long-Term Representation

Conventional single-prototype updating implicitly assumes that one averaged appearance vector is sufficient to describe a track. In real scenes, however, the same target often exhibits multiple distinct yet relatively stable appearance modes under different viewpoints, poses, and occlusion conditions. Forcing these modes into a single prototype tends to blur the representation and weaken the discriminability of appearance memory.

To address this issue, for each track

τ_{i}

, we maintain a compact set of at most M long-term appearance prototypes, denoted as

P_{i}^{L} = {p_{i, m}^{L}}_{m = 1}^{M_{i}}, 1 \leq M_{i} \leq M .

(5)

Here,

M_{i}

denotes the current number of active long-term prototypes of track

τ_{i}

. Given the current detection feature

f_{j}^{t}

, the long-term appearance distance between track

τ_{i}

and detection

d_{j}^{t}

is defined as

d_{i j}^{L} = \min_{p \in P_{i}^{L}} d (p, f_{j}^{t}) .

(6)

The minimum operation in Equation (6) lets the closest stable prototype determine the long-memory similarity between the track and the detection. Compared with single-prototype averaging, the long memory preserves several stable appearance modes separately and serves as the primary identity anchor in Stage 1.

3.2.2. Short Queue: Short-Term Transitional Representation

Although the long memory provides stable identity anchors, its update threshold is relatively high and its response to abrupt appearance changes is limited. For example, after short-term occlusion, rapid turning, or local visibility fluctuations, the current detection may temporarily deviate from long-term prototypes while still remaining highly consistent with observations from the most recent frames. If only the long memory is used, such targets are likely to be missed during the primary matching stage.

To capture such transient yet useful evidence, we further maintain, for each track, a short queue of maximum length S:

Q_{i}^{S} = {q_{i, s}}_{s = 1}^{S_{i}}, 0 \leq S_{i} \leq S .

(7)

Here,

S_{i}

denotes the current number of valid entries stored in the queue of track

τ_{i}

. A first-in-first-out (FIFO) strategy is used to cache recent reliable appearance observations. Accordingly, the short-term appearance distance between track

τ_{i}

and detection

d_{j}^{t}

is defined as

d_{i j}^{S} = \min_{q \in Q_{i}^{S}} d (q, f_{j}^{t}) .

(8)

When

Q_{i}^{S}

is empty, Stage 2 recovery is disabled for that track. The short queue preserves recent state traces within a local temporal window and acts only as supplementary evidence in the residual association stage.

The two-level design is a deliberate minimal decomposition of the stability–adaptability trade-off. The long memory captures slowly evolving and high-reliability identity modes, whereas the short queue preserves recent transitional states for residual recovery. This separation avoids forcing one memory state to simultaneously preserve long-term purity and absorb short-term appearance changes, keeping the memory mechanism interpretable and bounded.

3.3. Unified Reliability Control Mechanism

The Unified Reliability Control Mechanism first computes a joint reliability score and then uses it to regulate long-memory writing, short-queue admission, and prototype maintenance under one shared criterion.

3.3.1. Joint Reliability Score

The core issue in online appearance memory is not simply whether to update, but whether the current observation is reliable enough to be written into memory. Blindly updating under low-quality matches will continually reinforce incorrect representations caused by noise, occlusion, or mismatches and will further damage subsequent association results. Therefore, we construct a joint reliability score from three sources of evidence: appearance consistency, motion consistency, and detection quality. This score serves as the unified control signal for long-memory writing, short-queue admission, and prototype maintenance; by filtering which short-term observations are allowed to enter the queue, it also indirectly constrains the evidence available to Stage 2 recovery.

First, based on the long-term appearance distance, the appearance reliability is defined as

r_{i j}^{app} = σ (\frac{τ_{a} - d_{i j}^{L}}{κ_{a}}),

(9)

where

τ_{a}

is the appearance threshold,

κ_{a}

is the scaling parameter, and

σ (\cdot)

denotes the sigmoid function. A smaller

d_{i j}^{L}

therefore gives higher appearance reliability.

Second, appearance similarity alone is insufficient to guarantee match reliability, especially in crowded scenes where different identities may look similar. Therefore, we further impose motion consistency constraints. Let

z_{j}

denote the measurement vector of detection

d_{j}^{t}

, let

{\hat{z}}_{i}

denote the Kalman-predicted measurement of track

τ_{i}

, and let

Σ_{i}

denote the corresponding innovation covariance. We define the motion inconsistency as the Mahalanobis distance

g_{i j} = {(z_{j} - {\hat{z}}_{i})}^{⊤} Σ_{i}^{- 1} (z_{j} - {\hat{z}}_{i}),

(10)

and then convert it into the motion reliability

r_{i j}^{mot} = σ (\frac{τ_{m} - g_{i j}}{κ_{m}}),

(11)

where

τ_{m}

and

κ_{m}

denote the motion-inconsistency threshold and the scaling parameter, respectively [1,25]. Lower motion inconsistency leads to higher motion reliability.

In addition, detection quality also directly affects the reliability of memory updating. For a low-confidence detection, the bounding box is more likely to be inaccurate, and the extracted appearance feature is more easily contaminated by background noise. We denote the detector confidence of detection

d_{j}^{t}

as

r_{j}^{\det} = s_{j}^{t} \in [0, 1] .

(12)

On this basis, the three types of evidence are fused into a unified joint reliability score:

r_{i j} = λ_{A} r_{i j}^{app} + λ_{M} r_{i j}^{mot} + λ_{Q} r_{j}^{\det},

(13)

where

λ_{A}

,

λ_{M}

, and

λ_{Q}

are fixed nonnegative fusion weights satisfying

λ_{A} + λ_{M} + λ_{Q} = 1

. Since all three components lie in

[0, 1]

,

r_{i j}

also lies in

[0, 1]

. This scalar score provides a shared admission criterion for the subsequent memory-writing decisions. The score is computed for candidate track–detection pairs, but memory writing is applied only to associations retained in the final matching result.

In this work, we use fixed fusion weights rather than learned or scene-adaptive weights for two reasons. First, the target setting of HMP is plug-and-play deployment under fixed upstream modules, where introducing another learned weighting network would weaken the controlled-attribution claim and require additional training data. Second, the three reliability terms have different failure modes and complementary roles: the appearance term measures consistency with the long-term identity state, the motion term suppresses spatially implausible matches, and the detection-quality term reduces the chance of writing features extracted from unreliable boxes. Therefore, the default setting assigns the largest weight to appearance consistency, a secondary weight to motion consistency, and a smaller but nonzero weight to detection quality. The robustness of this design choice is examined later through weight-sensitivity experiments.

The motion-reliability term is used as a consistency cue rather than as a standalone decision rule. Although it is computed from the Mahalanobis distance of the Kalman-filter prediction, the final reliability score also includes appearance consistency and detection quality. Therefore, an inaccurate motion prediction does not directly determine memory writing by itself; instead, it changes one component of the joint reliability score. This design keeps HMP compatible with mainstream online MOT pipelines, while the limitations of Kalman-based motion reliability under highly nonlinear or abrupt motion are further discussed in Section 5.3.

3.3.2. Reliability-Controlled Memory Updating

Building on the unified joint reliability score defined above, we next specify how this shared control signal governs memory updating in HMP. Concretely, the Unified Reliability Control Mechanism is realized through two coupled steps: reliability estimation and reliability-controlled memory updating. As shown in Figure 3, the memory-update strategy is not a uniform write-back process but a reliability-controlled admission process. When a matched observation satisfies the stricter threshold

η_{L}

, it is eligible for long-memory updating; when it satisfies the looser threshold

η_{S}

, it is admitted to the short queue. Thus, highly reliable evidence may update the long memory and also be cached in the short queue, whereas moderately reliable evidence bypasses long-memory writing but may still be retained as short-term transitional support. Prototype maintenance is further performed under deterministic rules to avoid uncontrolled accumulation. Accordingly, the long memory emphasizes purity and stability, whereas the short queue preserves recent adaptability under more relaxed admission conditions.

Long-Memory Update

For a successfully matched track–detection pair, updating the long memory is allowed only when the joint reliability reaches the long-memory writing threshold:

r_{i j} \geq η_{L} .

(14)

Here,

η_{L}

enforces high-confidence long-memory writing.

After Equation (14) is satisfied, we do not update all long-term prototypes simultaneously. Instead, we update only the prototype closest to the current detection:

m^{*} = \underset{m = 1, \dots, M_{i}}{argmin} d (p_{i, m}^{L}, f_{j}^{t}) .

(15)

The role of Equation (15) is to preserve modal specialization across multiple prototypes and to avoid mixing different appearance modes into the same representation.

Furthermore, we adapt the update step size according to the joint reliability:

λ_{i j} = η_{\max} r_{i j},

(16)

where

0 < η_{\max} \leq 1

denotes the maximum update strength. More reliable matches therefore use a larger but still bounded update step. Finally, the

m^{*}

-th long-term prototype is updated as

p_{i, m^{*}}^{L} \leftarrow \frac{(1 - λ_{i j}) p_{i, m^{*}}^{L} + λ_{i j} f_{j}^{t}}{{∥(1 - λ_{i j}) p_{i, m^{*}}^{L} + λ_{i j} f_{j}^{t}∥}_{2}} .

(17)

This update incorporates reliable new observations while keeping the prototype

ℓ_{2}

-normalized.

Short-Queue Update

Unlike the long memory, the purpose of the short queue is to preserve recent transitional appearance states. It therefore allows a more relaxed enqueuing threshold. For a matched pair, if its joint reliability satisfies

r_{i j} \geq η_{S},

(18)

the current feature is written into the short queue; if the queue length exceeds S, the oldest entry is removed. The looser threshold

η_{S}

preserves locally useful transitional evidence without allowing the queue to replace the long-memory identity anchor.

Multi-Prototype Management

To prevent the long-term prototype set from degenerating into a redundant collection of near-duplicate vectors, we further introduce three deterministic management operations: spawning, merging, and eviction. First, after a successful long-memory update, the matched observation is treated as a new-mode candidate only when

r_{i j} \geq η_{L}

and

d_{i j}^{L} > τ_{spawn}

. Here,

d_{i j}^{L}

denotes the pre-update long-memory distance in Equation (6), so the spawning decision is based on whether the current observation represents a genuinely new appearance mode before the selected prototype is updated. If this condition holds and

M_{i} < M

, a new prototype initialized from

f_{j}^{t}

is appended to the bank. If the same condition holds, but the memory is already full, we first evict the least valuable prototype according to the score defined below and then insert the new prototype. Second, if two prototypes within the same track become sufficiently close, they are merged to suppress fragmentation. In practice, we merge prototype pair

(m_{a}, m_{b})

when

d (p_{i, m_{a}}^{L}, p_{i, m_{b}}^{L}) < τ_{merge}

, and the merged prototype is obtained by support-weighted averaging using the accumulated-support statistics maintained for eviction, followed by

ℓ_{2}

normalization. Third, when the bank must be compacted during spawning or periodic maintenance, we remove the least valuable prototype according to a score that jointly reflects historical support, recent access frequency, and inactivity:

V_{m} = α_{\sup} {\bar{S}}_{m} + α_{freq} {\bar{F}}_{m} - α_{idle} {\bar{I}}_{m},

(19)

where

{\bar{S}}_{m}

,

{\bar{F}}_{m}

, and

{\bar{I}}_{m}

are normalized online statistics in

[0, 1]

that denote accumulated support, recent matching frequency, and the elapsed time since the last reliable access, respectively; correspondingly,

α_{\sup}

,

α_{freq}

, and

α_{idle}

are fixed balancing factors that reward historical support, reward recent usefulness, and penalize long inactivity. In implementation, these quantities are maintained as lightweight prototype statistics so that the three terms remain on a comparable scale. The prototype with the smallest

V_{m}

is evicted. These operations keep the long memory compact, interpretable, and resistant to both uncontrolled growth and excessive averaging.

It should be noted that prototype spawning and prototype merging introduce opposite risks. If spawning is too permissive, the long-memory bank may contain redundant or weakly supported prototypes, leading to memory fragmentation and dispersed updates. If merging is too aggressive, distinct but valid appearance modes may be averaged together, partially reintroducing the over-smoothing problem of single-prototype memory. HMP therefore uses deterministic safeguards rather than unconstrained prototype growth or unconditional merging. A new prototype is spawned only when the matched observation is both reliable and sufficiently different from existing long-memory prototypes, merging is triggered only for highly similar prototypes within the same track, and eviction removes the least valuable prototype according to support, recent access frequency, and inactivity. These rules do not claim to eliminate the fragmentation–averaging trade-off completely; instead, they keep the prototype bank bounded and make the trade-off explicit and reproducible.

3.4. Frozen Two-Stage Association Strategy

At the association stage, we do not directly mix the long and short memories. Instead, we adopt the Frozen Two-Stage Association Strategy. The rationale is that long-term evidence is more stable and trustworthy and should therefore be used first to establish high-confidence correspondences, while short-term evidence should be reserved only for the unresolved residual cases.

3.4.1. Stage 1: Primary Matching Based on the Long Memory

In Stage 1, only the long memory is allowed to participate in appearance-cost construction. For candidate track-detection pairs that pass motion gating,

d_{i j}^{L}

is used as the appearance cost and is combined with the motion cost in the original framework to form the association matrix. The Hungarian algorithm is then used to solve the optimal one-to-one matching [26], producing the Stage 1 matching result

M_{1}

. Because this stage is based on stable long-term identity anchors, all matches established in Stage 1 are immediately frozen and will not be modified in later stages. This freezing mechanism effectively prevents weak short-term evidence from overturning strong primary-matching evidence.

3.4.2. Stage 2: Residual Recovery Based on the Short Queue

After Stage 1, the system obtains a residual unmatched track set and a residual unmatched detection set. Stage 2 operates only on this residual motion-gated subproblem, i.e., on the residual candidate edges inherited from the original motion-gating step, and only the short queue is allowed to participate. Since the queue stores only observations that have already passed the reliability-controlled admission rule in Equation (18), Stage 2 is indirectly constrained by the same reliability mechanism even though it uses dedicated residual-recovery criteria. Its goal is not to rematch the entire frame more aggressively, but to recover a small number of cases in which recent transitional evidence is genuinely more informative than long-term memory.

Because the short queue is more locally adaptive but also more prone to erroneous recovery, we impose two additional constraints in Stage 2.

First, to ensure that recovery is well justified, the short-queue distance must show a clear advantage over the long-memory distance:

d_{i j}^{S} + δ_{adv} < d_{i j}^{L},

(20)

where

δ_{adv}

is the advantage margin. Equation (20) ensures that the system enters the recovery process only when the short-term evidence is significantly better than the long-term evidence, rather than triggering Stage 2 hastily because of a marginal advantage.

Second, to suppress ambiguous recovery, let

d_{i}^{(1)}

and

d_{i}^{(2)}

denote the best and second-best short-queue distances of track

τ_{i}

over the residual detection set, respectively, and define the discrimination gap as

Δ_{i} = d_{i}^{(2)} - d_{i}^{(1)} .

(21)

If fewer than two residual detection candidates are available for track

τ_{i}

, the ambiguity test in Equation (22) is bypassed for that track. Otherwise, only when

Δ_{i} \geq τ_{gap}

(22)

is the current best candidate regarded as having a sufficiently clear discriminative advantage. Equation (22) rejects ambiguous recovery cases in which the best and second-best candidates are too close, thereby reducing the risk of incorrect association among locally similar targets.

Before Stage 2 matching, we first remove residual candidate edges that fail Equation (20). For tracks with at least two residual candidates, we retain only the best recovery candidate when the ambiguity test in Equation (22) is satisfied. For tracks with fewer than two residual candidates, the surviving candidate is retained directly once it passes motion gating and Equation (20). We then run matching again on the remaining residual graph to obtain the recovery result

M_{2}

. The final association result is

M = M_{1} \cup M_{2} .

(23)

Equation (23) shows that the final output is composed of the high-confidence primary matches from Stage 1 and the conservative recovery matches from Stage 2. Since Stage 2 acts only on the residual set and does not overturn the frozen

M_{1}

, the proposed method formally supplements Stage 1 rather than replacing it; that is, it fills in missed matches without destroying high-confidence associations.

3.5. Track Initialization and Queue Maintenance

When a new track is initialized or confirmed by the host tracker, the first long-memory prototype is initialized using the

ℓ_{2}

-normalized appearance feature of the associated detection. The initial long-memory bank therefore contains one valid prototype, and additional prototypes are spawned only when later reliable observations satisfy the spawning condition described in Section 3.3.2. The short queue is used only as a recent transitional buffer. It is initialized as empty and is subsequently filled only by matched observations whose joint reliability satisfies the short-queue admission threshold

η_{S}

.

For unmatched tracks, the long-memory prototypes are retained as part of the track state as long as the host tracker keeps the track alive. In contrast, the short queue is treated more conservatively. If a track remains unmatched beyond the short reactivation window or is reactivated after a long absence, its short queue is cleared before Stage 2 recovery is enabled again. This prevents stale local observations from participating in residual matching. When a track is terminated and later a new track is created, both the long memory and the short queue are reinitialized according to the new track state. This design preserves the stable identity role of long memory while preventing short-term evidence from being reused outside its valid temporal window.

3.6. Complexity Analysis

The computational pattern of HMP remains friendly to existing online trackers. The proposed module mainly introduces additional feature-to-memory distance evaluations and lightweight queue/prototype maintenance, both of which are linear in the number of gated candidate associations. No retraining, auxiliary network, or expensive global optimization is introduced. Let E denote the number of candidate track–detection edges retained after motion gating, let M denote the maximum number of long-term prototypes per track, and let S denote the maximum short-queue length. The dominant additional appearance-comparison cost can then be summarized as

C_{HMP} = O (E (M + S)) .

(24)

As shown in Equation (24), the additional appearance-distance overhead of HMP grows linearly with the number of gated candidate edges. Since M and S are both small constants in practice, the extra computational burden introduced by the module remains limited. This property is important for online deployment, because it means that the gain in identity stability is obtained through better memory organization and evidence control rather than through heavy computation.

In this section, we presented HMP, a hierarchical appearance-memory framework for online MOT. At the contribution level, the method is organized around three coupled ideas: Hierarchical Memory Architecture, Unified Reliability Control Mechanism, and Frozen Two-Stage Association Strategy. At the implementation level, the first idea is realized through a multi-prototype long memory together with a short queue, the second through a joint reliability score and the reliability-controlled update policy built on it, and the third through the Frozen Two-Stage Association Strategy, which constrains how stable and transitional evidence participate in association. Together, they transform appearance memory from a passive storage unit into an actively regulated identity-modeling mechanism.

4. Results

4.1. Datasets, Metrics, and Controlled Attribution Protocol

We evaluate the proposed method on two standard pedestrian multi-object tracking benchmarks from MOTChallenge: MOT17 and MOT20 [27]. MOT17 covers pedestrian tracking sequences from multiple cameras and scenes, with substantial occlusion, interaction, and appearance ambiguity. MOT20 focuses on extremely crowded scenes and therefore places much stronger pressure on identity preservation under dense overlap, mutual occlusion, and local confusion. Evaluating both benchmarks allows us to examine not only the general effectiveness of HMP, but also its robustness as crowd density and association difficulty increase.

The main benchmark results are reported on the official MOTChallenge test sets. The ablation experiments are conducted on the MOT17 training set, where the first half is used for upstream detector/ReID training or calibration when needed, and the second half is reserved for validation. Taken together, these two evidence layers form the controlled attribution protocol of this paper: the official server results establish the external competitiveness of the complete tracker, whereas the controlled validation experiments isolate which parts of the proposed memory redesign are responsible for the observed gains. This protocol avoids repeated submissions to the public server and provides a controlled environment for analyzing individual design choices. Unless otherwise stated, all ablation results are obtained under identical detector, ReID, motion-model, and solver settings so that the observed differences can be attributed to the appearance-memory design itself.

The evaluation metrics are higher-order tracking accuracy (HOTA), multiple object tracking accuracy (MOTA), identity F1 score (IDF1), and identity switches (IDSW) [28,29,30]. HOTA reflects overall detection and association quality, MOTA summarizes missed detections, false positives, and association errors, IDF1 focuses on identity consistency, and IDSW counts the number of identity switches. Unless otherwise specified, in all result tables, ↑ and ↓ indicate that higher and lower values are better, respectively, and bold values denote the best or tied-best result in each column. Because HMP primarily targets identity modeling rather than detector redesign, we place particular emphasis on IDF1 and IDSW while still reporting HOTA and MOTA for overall completeness.

4.2. Implementation Details and Fair Comparison Protocol

For the detector and ReID configuration, we distinguish clearly between the official benchmark submissions and the controlled ablations. For the benchmark results in Table 2 and Table 3, the upstream detector/ReID stack follows the public BoT-SORT-ReID-style configuration used for MOTChallenge submission: a YOLOX-X detector implemented with YOLOX version 0.3.0 and initialized from Common Objects in Context (COCO) pretraining [31], together with an SBS-S50 ReID branch implemented with FastReID version 1.4.0 [7]. The MOT17 detector is trained using the public pedestrian-tracking training mixture that includes MOT17, CityPersons, ETHZ, CrowdHuman, and WiderPerson [27,32,33,34]; the MOT20 detector follows the corresponding public MOT20 pedestrian-training setting. We describe these details to make the official submission setting transparent, but we do not treat the official tables as strict module-level attribution evidence. Accordingly, the official HMP submission should be understood as the complete HMP tracker built on top of this shared detector/ReID stack, rather than as the untouched BoT-SORT-ReID baseline with a single module toggled on. By contrast, in the controlled ablations under BoT-SORT-ReID and Deep OC-SORT, we insert HMP into the original reference frameworks while keeping the detector, ReID extractor, motion prediction/gating pipeline, one-to-one assignment primitive, and track life-cycle rules fixed. This design provides a tighter test of whether the observed gain comes from the proposed memory mechanism itself.

The method was implemented in Python 3.8. All controlled experiments were conducted on a workstation equipped with an NVIDIA GeForce RTX 3090 graphics processing unit (GPU) (NVIDIA Corporation, Santa Clara, CA, USA) and an Intel Core i7-10700 central processing unit (CPU) (Intel Corporation, Santa Clara, CA, USA).

For motion modeling, we follow the noise-scale-adaptive (NSA) Kalman filter and camera motion compensation (CMC) used in the public BoT-SORT-style tracking pipeline, together with Hungarian matching and the standard track initialization, confirmation, and termination procedures used in mainstream frameworks [1,25,26]. HMP does not replace the detector, ReID extractor, motion prediction/gating pipeline, or the underlying one-to-one assignment primitive; it changes how appearance evidence is represented, admitted into memory, and scheduled across the two association stages.

Unless explicitly varied, all controlled experiments use the same default HMP configuration. The memory capacity is set to

M = 3

long-term prototypes and

S = 6

short-queue entries per track. The reliability-related parameters are

η_{L} = 0.75

,

η_{S} = 0.25

,

η_{\max} = 0.10

,

τ_{a} = 0.30

,

κ_{a} = 0.05

,

τ_{m} = 9.49

,

κ_{m} = 2.00

, and

(λ_{A}, λ_{M}, λ_{Q}) = (0.50, 0.30, 0.20)

. The Stage 2 control parameters are

δ_{adv} = 0.05

and

τ_{gap} = 0.08

. The prototype-maintenance parameters are

τ_{spawn} = 0.20

,

τ_{merge} = 0.08

,

α_{\sup} = 1.00

,

α_{freq} = 0.50

, and

α_{idle} = 0.30

.

These values are not re-tuned for different host trackers in the controlled experiments or for the additional DanceTrack-val check, and the same HMP memory parameters are kept between the MOT17 and MOT20 official submissions. Only the benchmark-specific detector-training setting follows the corresponding public MOTChallenge practice. The default setting is therefore intended to represent a stable operating point rather than a globally optimal configuration for every tracker or dataset. In particular,

η_{L}

and

η_{S}

control the trade-off between long-memory purity and short-term adaptability,

η_{\max}

bounds the speed of prototype movement, and

δ_{adv}

together with

τ_{gap}

controls how conservative Stage 2 residual recovery remains. The sensitivity analyses in Section 4.4.6 further examine whether the observed identity gains depend on a narrowly tuned parameter setting. Appendix A reports the same values together with the functional role of each key parameter, and Appendix B provides a compact pseudocode summary of the HMP inference and memory-update flow.

To make the evaluation protocol explicit, we separate the experiments into two groups. Table 2 and Table 3 report the official test-set results and mainly serve as external positioning for the complete HMP tracker under the MOTChallenge protocol. Because the compared methods do not share identical end-to-end pipelines, these results should not be interpreted as strict module-level attribution evidence for the memory module alone. By contrast, Table 4, Table 5, Table 6, Table 7, Table 8, Table 9, Table 10 and Table 11 provide controlled or diagnostic evidence under matched settings, where HMP is inserted into the reference framework while the upstream detector/ReID configuration, motion prediction/gating, assignment primitive, and track life-cycle rules are kept fixed within each comparison. Throughout this paper, these controlled experiments therefore constitute the primary evidence chain for the claim that redesigning track memory improves identity continuity under fixed upstream modules. Since HMP targets track-level identity modeling rather than detector quality, the following analysis emphasizes IDF1 and IDSW as the most direct indicators of identity continuity, while still reporting HOTA and MOTA to verify that the identity gains are not obtained by sacrificing overall tracking quality.

4.3. Benchmark Results on Official MOTChallenge Test Sets

All results in this subsection are taken directly from the official MOTChallenge evaluation server. The compared methods include FairMOT [8], ByteTrack [14], OC-SORT [15], StrongSORT++ [10], Deep OC-SORT [16], BoT-SORT-ReID, and BoostTrack++ [17]. Our official submission follows a BoT-SORT-ReID-style upstream detector/ReID configuration but uses the complete HMP tracker described in Section 3 rather than the untouched original BoT-SORT-ReID pipeline. Accordingly, these tables are used mainly for external positioning of the complete system under the official benchmark protocol. They answer whether the full HMP tracker is externally competitive, especially on identity-oriented metrics. The stricter question of why the gain appears, and whether it can be attributed specifically to the proposed memory redesign, is addressed later by the controlled ablations under fixed upstream settings.

4.3.1. MOT17 Test

As shown in Table 2, the complete HMP tracker reaches 66.6 HOTA, 81.0 MOTA, 82.6 IDF1, and 882 IDSW on the MOT17 test set. These numbers place the complete system among the competitive online trackers in this comparison, with particularly strong identity-oriented performance. Because the official benchmark compares complete systems rather than identical pipelines, the differences in Table 2 should be interpreted as external positioning rather than strict module-level attribution.

In particular, the numerical gaps between HMP and other official submissions are not used here as evidence that these gains are caused solely by the HMP module, since the compared trackers may use different detectors, ReID extractors, training data, and engineering pipelines.

The result nevertheless shows that the complete HMP tracker remains competitive in overall metrics while obtaining a favorable IDF1/IDSW profile.

4.3.2. MOT20 Test

As shown in Table 3, the complete HMP tracker obtains 65.5 HOTA, 77.5 MOTA, 80.8 IDF1, and 752 IDSW on the MOT20 test set. The most notable pattern is the low number of identity switches under extremely crowded conditions. We again interpret this result cautiously because the compared trackers use different end-to-end pipelines.

Therefore, Table 3 is used only to position the complete HMP-based tracker under the official benchmark protocol, not to quantify the standalone contribution of HMP relative to trackers built on different upstream stacks.

The controlled experiments below are used instead to isolate the effect of the memory redesign itself.

Overall, the official test-set results show that the complete HMP tracker is externally competitive and has a favorable identity-oriented profile.

These official results provide benchmark-level context for the complete submitted system, while the module-level attribution is deliberately based on the controlled ablations under fixed upstream settings.

This observation motivates the controlled ablations below, where we separate more cleanly the contributions of memory structure, writing policy, and stage-specific evidence usage under fixed upstream settings.

4.4. Ablation and Sensitivity Analysis

The ablation study is organized to validate the contributions of HMP and to characterize their practical operating range. Concretely, eight questions are examined: (1) whether reliability-controlled writing improves the baseline memory state, (2) whether multi-prototype long memory provides additional stable identity modeling, (3) whether short-queue residual recovery contributes beyond the long memory, (4) whether the frozen Stage 1 policy and Stage 2 risk controls are necessary, (5) whether the observed benefits persist across different tracking frameworks, (6) whether the pure module gain remains visible under fixed upstream pipelines, (7) whether the method shows cross-dataset transferability beyond MOT17/MOT20, and (8) how sensitive the method is to memory capacity, reliability weights, and admission thresholds, as well as what runtime and memory-footprint overhead HMP introduces. The purpose of this subsection is therefore not merely to show that HMP improves benchmark numbers, but to identify which part of the proposed memory lifecycle is responsible for that improvement. The controlled results thus form a direct claim-to-evidence chain: Table 4 and Table 5 provide stepwise component attribution for reliability-controlled writing, multi-prototype long memory, and short-queue recovery; Table 6 isolates the frozen association policy and residual-recovery constraints; Table 7 summarizes the pure memory-module gain under fixed host pipelines; Table 8 provides additional cross-dataset validation on DanceTrack-val; Figure 4 and Figure 5 together with Table 9 and Table 10 examine parameter sensitivity; and Table 11 quantifies runtime overhead and the subsequent feature-state memory analysis under matched GPU-workstation settings. The qualitative analysis in Section 4.5 further links the quantitative IDF1/IDSW changes to occlusion and reappearance stages.

4.4.1. Stepwise Component Ablation

To make the contribution of each component explicit, we report a stepwise component ablation under the same BoT-SORT-ReID pipeline. The Baseline uses the original single-prototype EMA-style appearance memory and does not use the proposed reliability-controlled writing or short queue. B1 adds the proposed reliability-controlled writing while keeping the single-prototype representation unchanged. A1 then replaces the single prototype with the multi-prototype long-term memory under the same reliability-controlled writing rule. A2 further introduces the short queue and risk-controlled Stage 2 residual recovery, resulting in the complete HMP configuration. Therefore, the transition from Baseline to Baseline + B1 measures the effect of reliability-controlled writing, the transition from Baseline + B1 to Baseline + B1 + A1 measures the effect of multi-prototype long-memory representation, and the transition from Baseline + B1 + A1 to Baseline + B1 + A1 + A2 measures the incremental effect of controlled short-term residual recovery.

The progression in Table 4 provides a clearer attribution chain for the proposed design. Adding B1 to the Baseline increases IDF1 from 81.8 to 82.1 and reduces IDSW from 135 to 129, while MOTA remains unchanged. Since the memory capacity and association schedule are unchanged in this comparison, the improvement can be attributed to reliability-controlled memory writing rather than to larger storage or an additional matching stage.

Adding A1 on top of B1 further improves IDF1 to 82.5 and reduces IDSW to 125. This indicates that, once unreliable observations are suppressed by the reliability-controlled writing rule, replacing the single prototype with a compact multi-prototype long memory further alleviates over-smoothing and representation drift. In other words, A1 mainly improves the coverage and stability of long-term identity anchors.

Finally, adding A2 yields the full HMP configuration, with MOTA remaining stable at 78.5, IDF1 further increasing to 82.7, and IDSW decreasing to 121. This shows that the short queue is valuable not as another long-term memory bank, but as a tightly constrained source of transitional evidence for residual recovery. Relative to the Baseline, the complete HMP configuration improves MOTA by 0.1, improves IDF1 by 0.9, and reduces IDSW by 14, or 10.4%, while keeping HOTA competitive. Overall, these results support the intended division of labor in HMP: B1 controls memory contamination, A1 strengthens stable long-term identity modeling, and A2 selectively recovers difficult residual cases without destabilizing the primary association stage.

4.4.2. Cross-Framework Ablation

We further conduct experiments under the Deep OC-SORT framework to test whether the effect of HMP depends on a particular tracker implementation. Unless otherwise specified, the definitions of B1, A1, and A2 are kept the same as those under BoT-SORT-ReID; only the host tracking framework is changed. To maintain a consistent attribution protocol, we again report a stepwise component ablation rather than separating structural and writing effects into disconnected tables.

As shown in Table 5, the same component-level trend is observed under Deep OC-SORT. Adding B1 to the Baseline increases IDF1 from 82.5 to 82.8 and reduces IDSW from 187 to 176, while MOTA and HOTA remain stable. Since the memory representation and association schedule are unchanged in this comparison, this improvement again supports the value of reliability-controlled memory writing for suppressing contaminated updates.

Adding A1 on top of B1 further improves IDF1 to 83.0 and HOTA to 70.8, while reducing IDSW to 175. This indicates that the multi-prototype long memory remains beneficial under a different host tracker by providing richer and more stable long-term identity anchors. Finally, adding A2 yields the complete HMP configuration, reaching 80.0 MOTA, 83.1 IDF1, 70.9 HOTA, and 173 IDSW. The additional reduction in IDSW shows that short-queue-based residual recovery is still useful when it is constrained by the same frozen two-stage policy.

Compared with the Baseline, the complete HMP configuration improves MOTA by 0.2, IDF1 by 0.6, and HOTA by 0.7, while reducing IDSW by 14 under Deep OC-SORT. Together with the BoT-SORT-ReID results, these cross-framework experiments show that the benefit of HMP is not tied to a particular tracker implementation. Instead, the improvement comes from a transferable memory-organization principle: B1 controls memory contamination, A1 improves stable long-term identity modeling, and A2 provides conservative short-term recovery for difficult residual cases.

4.4.3. Stage-Policy Ablation

The stepwise component ablation above verifies the progressive contribution of reliability-controlled writing, multi-prototype long memory, and short-queue-based residual recovery. However, it does not fully answer whether the conservative stage policy itself is necessary once these memory components are already enabled. Therefore, we conduct an additional stage-policy ablation under BoT-SORT-ReID on the MOT17 validation split. All variants in this comparison use the same memory components as the full HMP configuration, namely reliability-controlled writing, multi-prototype long memory, and the short queue; only the policy that controls how Stage 2 evidence is allowed to affect matching is changed.

Specifically, the Without Stage 1 freezing variant removes the rule that Stage 1 matches are finalized before residual recovery, allowing short-term evidence to compete with or modify associations that would otherwise have been fixed by long-memory primary matching. The Without advantage margin variant removes Equation (20), so Stage 2 recovery no longer requires the short-queue distance to be clearly better than the corresponding long-memory distance. The Without ambiguity gap variant removes Equation (22), making Stage 2 less selective when multiple residual candidates have similar short-queue distances. This ablation therefore evaluates whether HMP benefits simply from adding a second matching pass, or from constraining that pass with a conservative trust hierarchy.

As shown in Table 6, removing the frozen Stage 1 policy increases IDSW from 121 to 134 and reduces IDF1 from 82.7 to 82.2. This confirms that the frozen design is not merely an implementation detail: it prevents short-term transitional evidence from overturning high-confidence matches already established by stable long-memory anchors. In other words, Stage 2 is useful as a residual-recovery mechanism, but it becomes risky when it is allowed to interfere with the primary association result. Removing the advantage-margin condition also increases IDSW from 121 to 131 while reducing MOTA from 78.5 to 78.4 relative to the full HMP setting. This indicates that a more permissive recovery policy does not bring a better overall trade-off, and instead weakens identity preservation by accepting riskier residual matches. Similarly, removing the ambiguity-gap test increases IDSW to 128, showing that Stage 2 should reject residual cases in which the best and second-best short-term candidates are not sufficiently separated. Overall, these results support the intended trust hierarchy of HMP: stable long-memory evidence should dominate primary matching, while short-term evidence should only supplement unresolved residual cases when it provides a clear and discriminative advantage.

4.4.4. Pure Module-Gain Summary Under Fixed Pipelines

To further separate module-level attribution from whole-system benchmark positioning, we summarize the pure memory-module gain under the two controlled host pipelines. In this comparison, the baseline and the full HMP variant share the same detector, ReID extractor, motion prediction/gating, assignment primitive, and track life-cycle rules; the difference is restricted to the track-level memory representation, writing policy, and stage-specific evidence usage introduced by HMP. Therefore, Table 7 should be read as the most direct experimental evidence for the effect of the proposed memory redesign.

As shown in Table 7, the full HMP configuration improves MOTA, IDF1, and HOTA under both host trackers while reducing IDSW. The absolute magnitude of the gain differs between the two frameworks because their baseline association behavior and identity-error profiles are different. However, the direction of improvement is consistent: replacing the original memory mechanism with HMP produces a clearer identity-continuity gain than a detection-oriented gain. This pattern directly supports the intended claim of the paper: the proposed module improves the track-memory lifecycle under fixed upstream conditions, rather than relying on a different detector, a stronger ReID extractor, or a separate end-to-end pipeline.

4.4.5. Cross-Dataset Generalization on DanceTrack

To further evaluate cross-dataset generalization beyond MOT17 and MOT20, we additionally evaluate HMP on the DanceTrack validation set [35] under the Deep OC-SORT framework. DanceTrack is complementary to MOT17/MOT20 because it contains targets with relatively uniform appearance and diverse motion patterns, making identity association less dependent on strong appearance discrimination and more sensitive to motion consistency and temporal evidence usage. Therefore, this experiment provides a useful stress test for HMP under weak appearance discrimination and complex motion. It is not intended as a new official leaderboard comparison, but as a controlled cross-dataset validation. The host tracker, evaluation protocol, and upstream configuration are kept consistent between the baseline and HMP variant, and no dataset-specific retuning of HMP parameters is performed.

As shown in Table 8, adding HMP to Deep OC-SORT improves MOTA from 88.5 to 88.9, HOTA from 58.51 to 58.70, and IDF1 from 59.03 to 59.43, while reducing IDSW from 1587 to 1543. The reduction of 44 identity switches corresponds to a 2.77% decrease relative to the baseline. Although the absolute improvement is moderate, the trend is consistent with the results on MOT17 and MOT20: HMP mainly improves identity-oriented behavior while keeping overall tracking quality stable.

This result provides additional cross-dataset evidence that the proposed memory mechanism is not restricted to a single benchmark. At the same time, we interpret this experiment cautiously. It validates HMP under one additional dataset and one host tracker, but it does not fully cover all possible domain shifts such as driving scenarios, low-resolution surveillance, or cross-modal tracking. Broader validation on more datasets remains an important direction for future work.

4.4.6. Parameter Sensitivity

We next examine how the number of long-term prototypes M and the short-queue length S affect identity-oriented performance on MOT17 under the BoT-SORT-ReID framework. Rather than treating these curves as single-metric tuning results, we interpret them through IDF1 and IDSW jointly, because the purpose of HMP is to improve identity continuity rather than to optimize one scalar score in isolation. The goal of this subsection is therefore to determine whether the gain of HMP comes from a compact, well-structured memory design or merely from increasing memory capacity. Each operating point requires rerunning the controlled tracker under a different memory configuration, so the curves are reported as validation trends for identifying a stable working region rather than as claims of statistical significance between neighboring points.

As shown in Figure 4, the effect of M is most informative when IDF1 and IDSW are read together. Increasing M from 1 to 3 raises IDF1 from 82.1 to 82.5 while simultaneously reducing IDSW from 129 to 125, indicating that a small set of long-term prototypes already provides sufficient diversity to capture recurring appearance modes of the same target. This is the most favorable operating region because the improvement is supported by both identity-quality indicators: association quality improves while switch errors decrease. When M is increased beyond this range, however, the long-memory bank becomes overly fragmented. Updates are dispersed across too many modes, each prototype receives less stable reinforcement, and the long-term identity anchors become less reliable. Accordingly, IDF1 declines and IDSW rises again. The figure therefore supports a more specific conclusion than “more memory helps”: the value of multi-prototype long memory lies in maintaining a compact and reliable set of identity anchors that improves identity matching while suppressing harmful switches.

As shown in Figure 5, the role of S is likewise clearer when both IDF1 and IDSW are considered together. Starting from very small values, increasing S improves IDF1 and reduces IDSW, which suggests that a short queue is useful for covering brief occlusions, rapid pose changes, and other local appearance transitions that should not be written directly into long-term memory. The best trade-off appears at

S = 6

, where IDF1 reaches one of its highest observed levels and IDSW attains the minimum observed value. Although

S = 7

maintains a similar IDF1, its higher IDSW suggests that an overly long queue may introduce stale or noisy transitional evidence. Once S becomes too large, however, the queue begins to retain more stale or noisy transitional evidence, which weakens its discriminative value for Stage 2 residual recovery. In that regime, the identity gain saturates and switch errors begin to rise again. This trend is consistent with the intended role of the short queue: it should operate as a compact buffer for recent transitions, rather than gradually turning into another unconstrained long-term memory.

Taken together, Figure 4 and Figure 5 indicate that the main gain of HMP does not depend on large memory capacity. A moderate configuration already captures most of the identity benefit, which is why we use

M = 3

and

S = 6

as the default settings. This operating point offers a favorable balance among IDF1, IDSW, and practical cost.

We next evaluate whether HMP depends strongly on a narrowly tuned reliability-weight setting. The default setting assigns the largest weight to appearance consistency, a secondary weight to motion consistency, and a smaller weight to detection quality. We compare it with balanced, appearance-heavy, motion-heavy, and detection-heavy variants while keeping all other parameters unchanged.

Table 9 shows that the default setting provides the best IDF1/IDSW trade-off among the tested configurations, but the improvement trend does not collapse under moderate weight perturbations. Appearance-heavy weighting remains close to the default setting, which is reasonable because HMP mainly targets track-level appearance memory. Balanced and motion-heavy variants still improve identity continuity compared with the baseline, but their IDSW values are higher than the default. The detection-heavy setting attains comparable MOTA but weakens IDF1, HOTA, and IDSW, indicating that detector confidence alone is insufficient for reliable memory writing. Overall, the results suggest that HMP benefits from appearance-dominant reliability fusion but is not overly dependent on a single fragile weight setting.

We also examine the sensitivity of HMP to the two main memory-admission thresholds,

η_{L}

and

η_{S}

. These thresholds control the purity-adaptability trade-off: lower thresholds admit more observations but increase contamination risk, whereas higher thresholds preserve memory purity but may suppress useful adaptation.

As shown in Table 10, HMP remains effective within a moderate range of admission thresholds. Loose writing attains comparable MOTA, but it increases IDSW because some less reliable observations are allowed to affect long-term identity anchors or short-term recovery. Strict writing reduces the risk of contamination but weakens adaptability, leading to lower IDF1 and slightly worse overall metrics. The default configuration achieves the strongest identity-oriented trade-off, but neighboring settings remain close, indicating that the method is not dependent on a narrowly tuned threshold pair.

Taken together, the parameter analyses support three observations. First, HMP prefers an appearance-dominant reliability fusion because the proposed module operates at the appearance-memory level. Second, the method remains stable under moderate perturbations of reliability weights and admission thresholds, suggesting that the observed gains are not caused by a fragile tuning point. Third, increasing memory capacity alone is insufficient; compact memory with role separation and reliability-controlled writing is more important than simply storing more features. Although the sensitivity sweeps are conducted on the MOT17 validation split to avoid repeated official-server submissions, the same default HMP parameter set is used for the MOT17/MOT20 benchmark submissions and for the additional DanceTrack-val check. The competitive identity-oriented behavior across these settings therefore provides practical evidence that the default setting is not narrowly dataset-specific.

4.4.7. Runtime and Memory-Footprint Overhead

To evaluate the practical deployment cost of HMP, we report the observed inference speed under matched detector, ReID extractor, input, and hardware settings, and we further analyze the deterministic memory footprint introduced by the HMP state itself. For each framework, the baseline and +HMP variants are timed under the same code path. The reported frames per second (FPS) reflects end-to-end tracking throughput rather than isolated HMP module timing.

All runtime measurements in this subsection were obtained on a workstation equipped with an NVIDIA GeForce RTX 3090 graphics processing unit (GPU) (NVIDIA Corporation, Santa Clara, CA, USA) and an Intel Core i7-10700 central processing unit (CPU) (Intel Corporation, Santa Clara, CA, USA). The reported values are intended for controlled within-framework comparison under our implementation and hardware setting, rather than for direct cross-paper speed comparison, because FPS values are sensitive to the detector/ReID configuration, input resolution, precomputed inputs, hardware platform, and timing protocol.

As shown in Table 11, HMP introduces a moderate but predictable runtime overhead under the matched GPU-workstation setting. Under BoT-SORT-ReID, the end-to-end speed decreases from 7.1 FPS to 6.4 FPS after inserting HMP, corresponding to a relative slowdown of 9.9%. Under Deep OC-SORT, the speed decreases from 22.2 FPS to 19.3 FPS, corresponding to a relative slowdown of 13.1%. These FPS values include detection, ReID extraction, motion prediction, association, and memory maintenance, rather than isolated timing of the HMP module alone.

In addition to runtime, we analyze the incremental memory footprint introduced by HMP. The module does not add an extra neural network, learnable temporal-attention block, or large external feature bank. Its persistent state mainly consists of long-memory prototypes and short-queue entries maintained for active tracks. Let K denote the number of active tracks and d denote the ReID feature dimension. In our implementation, each stored feature element uses single-precision storage, i.e., 4 bytes per dimension. The additional persistent feature-state memory is approximately

B_{HMP} \approx 4 K (M + S) d bytes,

(25)

excluding a small number of scalar prototype statistics for support, recent access frequency, and inactivity. With the default setting

M = 3

and

S = 6

, this becomes

36 K d

bytes. For example, when

d = 2048

, the additional persistent memory is about 72 KiB per active track, or about 7.0 MiB for 100 active tracks. Therefore, the per-track HMP-specific persistent state is bounded by the fixed memory capacities M and S, and the total feature-state memory grows linearly with the number of active tracks and the feature dimension.

The lower absolute FPS of BoT-SORT-ReID mainly reflects its heavier detector–ReID stack, not an exceptional HMP penalty. In contrast, Deep OC-SORT has a lighter end-to-end execution profile under the same hardware setting and therefore achieves higher absolute FPS. In both frameworks, the relative slowdown remains bounded, suggesting that the additional cost introduced by HMP mainly comes from feature-to-memory distance computation and lightweight prototype/queue maintenance rather than from a major change to the upstream tracking pipeline.

Peak GPU-memory consumption of the full tracker is strongly affected by the detector/ReID backbone, CUDA memory caching, input resolution, batching strategy, and implementation backend. Therefore, a single peak-memory number from one workstation would not provide a general deployment claim. We instead report the deterministic incremental HMP state memory in Equation (25), while leaving broader peak GPU-memory profiling on different accelerators to future hardware-aware evaluation.

For practical real-time edge deployment, the heavy detector–ReID configuration used in the workstation benchmark should not be directly transferred to low-power devices. An edge-oriented implementation should pair HMP with lightweight detectors and lightweight ReID extractors, such as compact detection backbones, reduced input resolution, model quantization, pruning, or accelerated inference backends. In such a setting, the end-to-end FPS is expected to depend primarily on the upstream detector–ReID stack, while the additional cost of HMP remains bounded by the number of gated candidate associations and the small memory sizes M and S.

Overall, HMP does not dominate the additional runtime cost or the HMP-specific persistent-state memory footprint under the evaluated GPU-workstation setting. Its additional cost is consistent with the linear-complexity analysis in Section 3.6. Nevertheless, these measurements should not be interpreted as a complete deployment benchmark for all hardware platforms. A systematic evaluation with lightweight detectors, lightweight ReID models, embedded accelerators, and peak GPU-memory profiling remains necessary before making stronger claims about real-time edge deployment.

4.5. Qualitative Analysis

As shown in Figure 6, the qualitative example makes the stage-wise behavior of HMP more explicit. In the bottom row, HMP preserves the target identity through the reappearance region and the subsequent interleaved interaction, as indicated by the green boxes. By contrast, the top-row baseline exhibits tracking failures and identity switches around the same region, as highlighted by the red boxes. The labeled Stage 2 rescue points mark residual cases in which short-term evidence is activated only after Stage 1 leaves them unresolved and only when the recovery decision is supported by the clear-advantage and ambiguity-suppression criteria. This conservative design therefore improves identity continuity precisely in the difficult local regions emphasized by the figure, consistent with controlled IDSW reductions in Table 4, Table 5 and Table 6 and official identity trends in Table 2 and Table 3.

To connect the visibility transition in Figure 1 with the qualitative interaction example in Figure 6, Table 12 summarizes how the main failure risk and the corresponding HMP mechanism change across different visibility stages. This table is not intended as an additional benchmark result; rather, it provides an interpretation layer that connects the visual example, the staged association design, and the quantitative identity metrics.

Overall, the visual comparison and the stage-wise interpretation in Table 12 reinforce the quantitative findings: the main value of HMP lies not in more aggressive association, but in more reliable association. In relation to the visibility transition illustrated in Figure 1, the most informative region is the partial-reappearance stage, where stale long-term evidence alone may be insufficient but indiscriminate short-term recovery is risky. The figure and the stage-wise interpretation also clarify the intended boundary of the method: HMP is most effective when the dominant failure source is memory contamination or transitional ambiguity, rather than missing detections, extremely degraded upstream features, or motion patterns that invalidate the underlying prediction model.

5. Discussion

5.1. Main Findings and Attribution

This work addresses a deliberately narrow but practically important question: can identity continuity still be improved by redesigning the appearance-memory lifecycle when the surrounding tracking pipeline is kept fixed? The overall evidence presented in this paper supports a positive answer.

First, the official MOTChallenge results show that the complete HMP tracker is externally competitive, with its clearest strength appearing on identity-oriented metrics. These official results should still be interpreted as whole-system external positioning rather than as a numerical decomposition of the HMP module alone. On MOT17 test, HMP remains competitive in HOTA and MOTA while achieving strong IDF1 and a clear reduction in ID switches. On MOT20 test, where crowd density and local ambiguity are more severe, the most notable advantage of HMP appears in IDSW reduction. The module-level attribution is therefore deliberately based on the controlled ablations, where the upstream detector, ReID extractor, motion model, assignment primitive, and track life-cycle rules are kept fixed.

Second, the controlled ablations provide the main attribution evidence for the paper’s central claim. Under both BoT-SORT-ReID and Deep OC-SORT, the detector, ReID extractor, motion model, association solver, and track life-cycle rules are held fixed, so the observed gains can be attributed more directly to the proposed memory redesign. The stepwise component ablations make this attribution explicit: B1 first reduces memory contamination through reliability-controlled writing, A1 further improves stable long-term identity modeling through multi-prototype representation, and A2 adds conservative short-term recovery for residual cases. The stage-policy ablation then shows that the frozen Stage 1 design and Stage 2 risk controls are necessary to prevent short-term evidence from becoming overly permissive. The pure module-gain summary in Table 7 further confirms that, after replacing only the memory mechanism, MOTA, IDF1, and HOTA increase while IDSW decreases under both host trackers. Taken together, these results indicate that the improvement does not come from unrelated changes elsewhere in the pipeline, but from reorganizing how appearance evidence is represented, admitted, and reused over time.

Third, the experimental pattern helps position the contribution more precisely. HMP should not be understood as a simple accumulation of extra components such as “more prototypes”, “a short queue”, or “an additional matching stage”. Rather, the results support a coordinated interpretation: the Hierarchical Memory Architecture provides stable identity anchors, the Unified Reliability Control Mechanism regulates what is allowed to enter the memory system, and the Frozen Two-Stage Association Strategy constrains when short-term evidence is permitted to influence matching. The gain therefore comes from a coherent appearance-memory lifecycle rather than from any single module viewed in isolation. In other words, the observed improvement is better understood as an effect of role separation and evidence governance than as an effect of raw memory expansion.

5.2. Mechanistic Understanding and Practical Implications

The main reason HMP works is that it separates two types of appearance evidence that play fundamentally different roles in online MOT. Stable long-term evidence should remain clean, compact, and trustworthy enough to support primary matching across time, whereas short-term transitional evidence is locally useful but should not directly overwrite identity anchors. Conventional single-prototype EMA memory mixes these two roles into one representation, which makes it vulnerable to contamination, over-smoothing, and staleness. HMP instead decouples them explicitly: the multi-prototype long memory models recurring stable appearance modes, while the short FIFO queue preserves recent transitional states for limited residual recovery only.

The controlled results support this interpretation. The stepwise component ablations show that reliability-controlled writing first reduces identity errors by suppressing unreliable updates under the original single-prototype memory form. Adding multi-prototype long memory on top of this writing rule then improves robustness by reducing representation drift and improving the coverage of recurring appearance modes. Finally, adding the short queue brings additional gains only when recent transitional evidence is isolated from long-term memory and activated under restricted Stage 2 conditions. This leads to a more specific conclusion than simply stating that larger memory is beneficial: what matters is not memory size alone, but the coordinated role separation between stable evidence, admissible updates, and conservative transitional recovery.

The reliability mechanism is equally important in this evidence chain. A larger or richer memory is not automatically safer; without appropriate admission control, it can still absorb corrupted observations and accumulate ambiguity over time. The writing-strategy ablations show that even under the baseline memory form, replacing indiscriminate EMA-style writing with reliability-aware updating already yields better identity behavior. This suggests that memory quality control is not a minor implementation detail, but a central requirement for preserving trustworthy identity evidence during long-term online tracking.

The practical implication is also clear. HMP is especially suitable for deployment-oriented settings in which the surrounding tracking pipeline is already fixed and retraining or replacing upstream modules is undesirable. In such settings, the method offers a plug-and-play path to improving identity continuity through appearance-memory lifecycle redesign under fixed upstream modules. At the same time, the observed metric pattern indicates that HMP is intentionally risk-controlled rather than aggressively recall-oriented. Its preference is to avoid harmful identity switches when the evidence is ambiguous, especially in crowded scenes, instead of pursuing more aggressive residual recovery at any cost. The occlusion-stage interpretation in Table 12 makes this trade-off more explicit: HMP is expected to help most when reliable long-term anchors and cautious short-term recovery can complement each other, and it is expected to help least when the upstream evidence itself is absent or severely degraded. This operating preference is consistent with the strong IDSW behavior observed on MOT20 and should be viewed as an intentional design choice rather than as an accidental side effect.

The metric pattern should also be interpreted according to the design goal of HMP. The proposed module does not modify the detector, the ReID extractor, or the global tracking architecture, and therefore it is not expected to produce large gains in detection-dominated metrics such as MOTA or broad overall metrics such as HOTA. Instead, HMP mainly reduces identity errors by improving how track-level appearance evidence is stored, filtered, and reused. Consequently, the most pronounced improvements appear in IDF1 and IDSW, while MOTA and HOTA usually remain stable or improve more moderately. For applications where identity continuity is critical, such as long-term pedestrian tracking and post-event trajectory analysis, this trade-off is desirable. For applications where MOTA or HOTA is the sole priority, however, HMP should be viewed as a complementary memory module rather than a substitute for stronger detection, ReID, or end-to-end tracking architectures.

5.3. Limitations and Future Work

The scope of the present study should also be stated clearly. HMP addresses a track-memory problem rather than an upstream representation problem. It is therefore not intended to replace stronger detector training, improved ReID learning, end-to-end Transformer tracking, or more advanced global optimization. When tracking failure is dominated by long-term missed detections, severely degraded appearance embeddings, or persistent ambiguity that cannot be resolved from the available features, memory redesign alone can offer only limited improvement. This boundary is consistent with the plug-and-play evaluation setting adopted in this paper, where upstream modules are intentionally kept fixed in order to isolate the contribution of track-memory redesign.

A second limitation is that the current Stage 2 policy is deliberately conservative. This conservatism is beneficial for suppressing harmful identity switches, but it may also reject a subset of difficult yet recoverable matches. In other words, the present design prioritizes identity safety over aggressive recovery in ambiguous residual cases. This trade-off is consistent with the goal of reducing IDSW, but it may be less suitable for applications where maximizing recall or MOTA is more important than preserving identity continuity. Future work may therefore investigate adaptive Stage 2 control, so that the recovery policy can respond more flexibly to scene density, track age, or confidence structure while preserving the core principle that fragile short-term evidence should not overturn already reliable primary matches.

A third limitation concerns prototype maintenance. Prototype spawning improves the coverage of distinct appearance modes, but excessive spawning may fragment the long-memory bank and disperse updates across redundant prototypes. Conversely, merging nearby prototypes suppresses redundancy, but overly aggressive merging may average distinct appearance modes and partially reintroduce the over-smoothing behavior that HMP is designed to avoid. The deterministic spawning, merging, and eviction rules used in this paper are intended to keep this trade-off bounded and reproducible, rather than to provide an optimal memory-management policy. Future work may therefore explore adaptive or learned prototype management strategies that can adjust spawning and merging behavior according to scene density, track age, and prototype reliability.

A fourth limitation comes from the motion-reliability term in the joint reliability score. In the current implementation, motion consistency is measured using the Mahalanobis distance derived from Kalman-filter prediction. This design is compatible with mainstream online MOT pipelines, but it may become less reliable under highly nonlinear, abrupt, or erratic motion patterns that cannot be accurately predicted by the motion model. In such cases, the motion term may underestimate valid matches or over-penalize unusual motion, which can make memory writing overly conservative or reduce the chance of Stage 2 residual recovery. Since motion reliability is only one component of the joint reliability score, appearance consistency and detection quality can partly compensate for this issue, but they cannot eliminate it completely. Future work may replace the fixed Kalman-based motion term with a more adaptive motion model, a learned motion-reliability estimator, or a scene-aware reliability mechanism.

Finally, the present experiments still have limited coverage in terms of datasets and hardware. In addition to MOT17 and MOT20, the DanceTrack-val experiment provides a useful cross-dataset check under Deep OC-SORT and shows that the identity-stability trend is not restricted to a single benchmark family. However, this additional validation should still be interpreted cautiously: it does not fully cover other domain shifts such as BDD100K driving scenes [36], low-resolution surveillance, cross-modal tracking, or remote sensing videos. Similarly, the runtime and memory-footprint study is limited to one GPU-workstation setting and reports the deterministic HMP state memory rather than exhaustive peak GPU-memory consumption across hardware platforms. Future work can therefore explore broader validation on additional object categories and benchmarks, stronger domain-shift evaluation, lightweight detector/ReID configurations, and hardware-aware optimization for resource-constrained deployment. Such hardware-aware evaluation should include peak GPU-memory profiling under different detector/ReID backbones, input resolutions, and embedded accelerators.

6. Conclusions

This paper proposed HMP, a plug-and-play appearance-memory module for online multi-object tracking. Rather than modifying the detector, ReID extractor, or motion model, HMP targets a different source of identity failure: how track-level appearance evidence is stored, filtered, and reused during long-term online operation.

The method is built around three coordinated ideas: Hierarchical Memory Architecture, Unified Reliability Control Mechanism, and Frozen Two-Stage Association Strategy. Concretely, HMP separates stable identity anchors from short-term transitional evidence through a multi-prototype long-term memory and a short FIFO queue, regulates memory admission and maintenance through a shared reliability criterion, and restricts short-term evidence to conservative residual recovery after stable primary matching has already been established.

Experiments on MOT17 and MOT20, together with controlled ablations under BoT-SORT-ReID and Deep OC-SORT, show that the complete HMP tracker achieves competitive overall tracking quality with strong identity-oriented performance. Official benchmark results provide external positioning of the complete tracker, while controlled experiments under fixed upstream settings, the pure module-gain summary, and the qualitative occlusion-stage analysis support that the gain comes from track-memory redesign itself. The additional DanceTrack-val experiment under Deep OC-SORT further provides a modest but consistent cross-dataset check, improving IDF1 and reducing IDSW under the same host tracker. The most consistent improvements appear on identity-oriented metrics, especially IDF1 and IDSW, which matches the design goal of the method.

The main conclusion of this work is therefore not merely that additional memory is helpful. Rather, online MOT benefits when stable evidence and transitional evidence are assigned different roles, admitted under reliability control, and used at different association stages. This highlights track-memory lifecycle design as a practical and effective optimization dimension for identity-stable online MOT. Practically, this makes HMP particularly suitable for deployment-oriented scenarios in which upstream modules are difficult to retrain but identity stability remains critical.

Author Contributions

Conceptualization, Y.C., H.X. and W.Z.; methodology, M.L., W.Z. and Y.C.; software, M.L.; validation, M.L., J.C. and C.W.; formal analysis, M.L. and K.X.; investigation, M.L., J.C. and C.W.; data curation, J.C. and C.W.; visualization, M.L.; writing—original draft preparation, M.L.; writing—review and editing, W.Z., Y.C., H.X. and K.X.; supervision, W.Z., Y.C. and H.X.; project administration, Y.C. and H.X.; funding acquisition, Y.C. and H.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Henan Province Science and Technology Research Project, grant number 262102210161.

Data Availability Statement

The MOT17 and MOT20 datasets analyzed in this study are publicly available from the MOTChallenge benchmark website (https://motchallenge.net/, accessed on 13 May 2026). The DanceTrack dataset used for the additional validation experiment is publicly available from the DanceTrack project website (https://dancetrack.github.io/, accessed on 13 May 2026). Additional implementation details and configuration files are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MOT	multi-object tracking
HMP	Hierarchical Multi-Prototype Appearance Memory
ReID	re-identification
EMA	exponential moving average
FIFO	first-in-first-out
HOTA	higher-order tracking accuracy
MOTA	multiple object tracking accuracy
IDF1	identity F1 score
IDSW	identity switches
FPS	frames per second
GPU	graphics processing unit
CPU	central processing unit
NSA	noise-scale-adaptive
CMC	camera motion compensation

Appendix A. Default Hyperparameter Settings

Table A1 summarizes the default hyperparameter settings of HMP. The memory-size parameters follow the sensitivity analysis in Section 4.4.6, whereas the remaining thresholds and weights are adopted as stable defaults with reference to publicly available tracker configurations and common empirical practice; all of them are then kept fixed in the remaining experiments unless otherwise specified. We do not claim that these defaults are globally optimal. Rather, the design goal is to provide stable operating values that preserve the same improvement trend across the tested host trackers without per-framework retuning. In particular, the motion threshold

τ_{m}

follows the conventional Kalman-filter gating scale used in mainstream MOT implementations. To make the defaults easier to audit, Table A1 reports not only the meaning and value of each parameter, but also the expected qualitative effect of increasing it. Functionally, the defaults fall into three groups: (1) M and S determine the capacity of stable long-term anchors and short-term transitional evidence; (2)

η_{L}

,

η_{S}

,

η_{\max}

,

τ_{a}

,

κ_{a}

,

τ_{m}

,

κ_{m}

, and

λ_{A}

–

λ_{Q}

determine how reliability is computed and how strictly evidence is admitted into memory; and (3)

δ_{adv}

,

τ_{gap}

,

τ_{spawn}

,

τ_{merge}

, and

α_{\sup}

–

α_{idle}

determine how conservative Stage 2 remains and how the long-memory bank is maintained over time.

Table A1. Default hyperparameter settings of HMP. The memory-size parameters follow the sensitivity analysis in Section 4.4.6, and the remaining values are kept as fixed defaults throughout the experiments unless otherwise specified.

Parameter	Meaning	Default	Expected Effect When Increased
Memory capacity and structure
M	number of long-term prototypes	3	improves mode coverage at first, but overly large M scatters updates and weakens prototype stability
S	short-queue length	6	enlarges short-term recovery context at first, but overly large S introduces stale or noisy transitional evidence
Reliability thresholds and fusion weights
$η_{L}$	long-memory writing threshold	0.75	makes long-memory updates more selective and conservative
$η_{S}$	short-queue admission threshold	0.25	makes short-queue admission more selective and reduces noisy transitional entries
$η_{\max}$	maximum long-memory update step	0.10	allows prototypes to adapt faster, but may increase drift risk if set too large
$τ_{a}$	appearance reliability threshold	0.30	relaxes appearance acceptance by allowing larger appearance distances to receive higher reliability
$κ_{a}$	appearance reliability scale	0.05	smooths the appearance-reliability transition and reduces abrupt thresholding
$τ_{m}$	motion reliability threshold	9.49	relaxes motion acceptance and makes motion reliability less strict
$κ_{m}$	motion reliability scale	2.00	smooths the motion-reliability transition and reduces abrupt gating changes
$λ_{A}$	appearance weight in joint reliability	0.50	makes joint reliability rely more strongly on appearance consistency
$λ_{M}$	motion weight in joint reliability	0.30	makes joint reliability rely more strongly on motion consistency
$λ_{Q}$	detection-quality weight in joint reliability	0.20	makes joint reliability more sensitive to detector confidence and box quality
Prototype maintenance and Stage 2 control
$δ_{adv}$	Stage 2 advantage margin	0.05	requires a clearer short-term advantage before residual recovery is accepted
$τ_{gap}$	ambiguity-suppression threshold	0.08	suppresses more ambiguous Stage 2 matches and makes recovery more conservative
$τ_{spawn}$	prototype-spawning distance threshold	0.20	makes new long-term prototypes harder to spawn, favoring compact memory
$τ_{merge}$	prototype-merging distance threshold	0.08	merges more nearby prototypes and increases compactness of the memory bank
$α_{\sup}$	support weight in eviction score	1.00	favors prototypes with stronger support when eviction is decided
$α_{freq}$	frequency weight in eviction score	0.50	favors prototypes that are used more often over time
$α_{idle}$	inactivity penalty in eviction score	0.30	penalizes stale prototypes more strongly and removes idle ones earlier

This table is intended as a practical reading aid rather than as an invitation to heavy per-tracker retuning. In our use case, M and S control compact memory capacity,

η_{L}

,

η_{S}

,

η_{\max}

,

δ_{adv}

, and

τ_{gap}

define the conservativeness of writing and Stage 2 recovery, and the remaining thresholds and weights mainly determine how reliability is computed and how the long-memory bank is maintained. The intended message is therefore robustness of the design rather than aggressive threshold chasing.

Appendix B. Pseudocode of HMP

For completeness, Algorithm A1 summarizes the main inference and memory-update flow of HMP.

In addition to Stage 1 matching, residual-set construction, Stage 2 recovery, reliability-controlled writing, and prototype maintenance, the pseudocode explicitly specifies how the first long-memory prototype is initialized when a track is created or confirmed, and how the short queue is cleared after long absence or track reactivation.

Track confirmation, lost-state handling, and termination still follow the host tracker, whereas HMP defines how the associated appearance-memory states are initialized, updated, and reset.

Algorithm A1 Compact inference and memory-update flow of HMP

Require: active tracks

T

, detections

D

, long memory

P^{L}

, short queues

Q^{S}

, gated edges

E

Ensure: updated track states and HMP memory for the next frame

1:: for all newly initialized or newly confirmed track $τ_{i}$ without existing HMP memory do
2:: initialize $P_{i}^{L} \leftarrow {norm (f_{j}^{t})}$ using the associated detection feature
3:: initialize $Q_{i}^{S} \leftarrow \emptyset$ and reset prototype statistics
4:: end for
5:: for all track $τ_{i}$ reactivated after a long absence or unmatched beyond the short reactivation window do
6:: clear $Q_{i}^{S}$ and disable Stage 2 recovery until new reliable observations are admitted
7:: end for
8:: for all $(i, j) \in E$ do
9:: compute long-memory distance $d_{L} (i, j)$ and joint reliability score $r_{i j}$
10:: end for
11:: perform Stage 1 association using only long memory, yielding $M_{1}$
12:: freeze $M_{1}$ and construct residual sets $T^{r}$ and $D^{r}$
13:: construct residual gated edges $E^{r} \subseteq E$ on $T^{r} \times D^{r}$
14:: for all $(i, j) \in E^{r}$ with valid short-term evidence do
15:: reuse $r_{i j}$ and compute short-queue distance $d_{S} (i, j)$
16:: end for
17:: initialize retained residual edge set ${\tilde{E}}^{r} \leftarrow \emptyset$
18:: for all track $τ_{i} \in T^{r}$ do
19:: let $C_{i} = {j ∣ (i, j) \in E^{r}, Q_{i}^{S} \neq \emptyset, d_{S} (i, j) + δ_{adv} < d_{L} (i, j)}$
20:: if $| C_{i} | = 1$ then
21:: add the sole surviving edge to ${\tilde{E}}^{r}$
22:: else if $| C_{i} | \geq 2$ then
23:: compute $d_{i}^{(1)}$ , $d_{i}^{(2)}$ , and $Δ_{i} = d_{i}^{(2)} - d_{i}^{(1)}$
24:: if $Δ_{i} \geq τ_{gap}$ then
25:: add the best candidate edge to ${\tilde{E}}^{r}$
26:: end if
27:: end if
28:: end for
29:: perform Stage 2 association only on ${\tilde{E}}^{r}$ , yielding $M_{2}$
30:: form final matches $M = M_{1} \cup M_{2}$
31:: for all $(i, j) \in M$ do
32:: if $r_{i j} \geq η_{L}$ then
33:: select the closest long-term prototype and compute $λ_{i j} = η_{\max} r_{i j}$
34:: update the selected prototype with $λ_{i j}$ and apply $ℓ_{2}$ normalization
35:: maintain the long-memory bank by spawning, merging, and eviction using the pre-update $d_{L} (i, j)$
36:: end if
37:: if $r_{i j} \geq η_{S}$ then
38:: push $norm (f_{j}^{t})$ into $Q_{i}^{S}$ and remove the oldest entry if its length exceeds S
39:: end if
40:: end for
41:: return updated track states and memory

References

Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple Online and Realtime Tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3464–3468. [Google Scholar] [CrossRef]
Wojke, N.; Bewley, A.; Paulus, D. Simple Online and Realtime Tracking with a Deep Association Metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar] [CrossRef]
Wang, Z.; Zheng, L.; Liu, Y.; Li, Y.; Wang, S. Towards Real-Time Multi-Object Tracking. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; pp. 107–122. [Google Scholar] [CrossRef]
Li, S.; Ren, H.; Xie, X.; Cao, Y. A Review of Multi-Object Tracking in Recent Times. IET Comput. Vis. 2025, 19, e70010. [Google Scholar] [CrossRef]
Qin, Z.; Wang, L.; Zhou, S.; Fu, P.; Hua, G.; Tang, W. Towards Generalizable Multi-Object Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 18995–19004. [Google Scholar] [CrossRef]
Shim, K.; Ko, K.; Yang, Y.; Kim, C. Focusing on Tracks for Online Multi-Object Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 10–17 June 2025; pp. 11687–11696. [Google Scholar] [CrossRef]
He, L.; Liao, X.; Liu, W.; Liu, X.; Cheng, P.; Mei, T. FastReID: A PyTorch Toolbox for General Instance Re-Identification. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 9664–9667. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, C.; Wang, X.; Zeng, W.; Liu, W. FairMOT: On the Fairness of Detection and Re-Identification in Multiple Object Tracking. Int. J. Comput. Vis. 2021, 129, 3069–3087. [Google Scholar] [CrossRef]
Wang, Y.; Li, R.; Zhang, D.; Li, M.; Cao, J.; Zheng, Z. CATrack: Condition-Aware Multi-Object Tracking with Temporally Enhanced Appearance Features. Knowl.-Based Syst. 2025, 308, 112760. [Google Scholar] [CrossRef]
Du, Y.; Zhao, Z.; Song, Y.; Zhao, Y.; Su, F.; Gong, T.; Meng, H. StrongSORT: Make DeepSORT Great Again. IEEE Trans. Multimed. 2023, 25, 8725–8737. [Google Scholar] [CrossRef]
Cai, J.; Xu, M.; Li, W.; Xiong, Y.; Xia, W.; Tu, Z.; Soatto, S. MeMOT: Multi-Object Tracking with Memory. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 8080–8090. [Google Scholar] [CrossRef]
Gao, R.; Wang, L. MeMOTR: Long-Term Memory-Augmented Transformer for Multi-Object Tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 9867–9876. [Google Scholar] [CrossRef]
Chen, L.; Ai, H.; Chen, R.; Zhuang, Z. Aggregate Tracklet Appearance Features for Multi-Object Tracking. IEEE Signal Process. Lett. 2019, 26, 1613–1617. [Google Scholar] [CrossRef]
Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. ByteTrack: Multi-Object Tracking by Associating Every Detection Box. In Proceedings of the Computer Vision—ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; pp. 1–21. [Google Scholar] [CrossRef]
Cao, J.; Pang, J.; Weng, X.; Khirodkar, R.; Kitani, K. Observation-Centric SORT: Rethinking SORT for Robust Multi-Object Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 9686–9696. [Google Scholar] [CrossRef]
Maggiolino, G.; Ahmad, A.; Cao, J.; Kitani, K. Deep OC-SORT: Multi-Pedestrian Tracking by Adaptive Re-Identification. In Proceedings of the 2023 IEEE International Conference on Image Processing (ICIP), Kuala Lumpur, Malaysia, 8–11 October 2023; pp. 3025–3029. [Google Scholar] [CrossRef]
Stanojević, V.; Todorović, B. BoostTrack++: Using Tracklet Information to Detect More Objects in Multiple Object Tracking. Filomat 2025, 39, 5685–5702. [Google Scholar] [CrossRef]
Sun, Z.; Wei, G.; Fu, W.; Ye, M.; Jiang, K.; Liang, C.; Zhu, T.; He, T.; Mukherjee, M. Multiple Pedestrian Tracking Under Occlusion: A Survey and Outlook. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 1009–1027. [Google Scholar] [CrossRef]
Pang, J.; Qiu, L.; Li, X.; Chen, H.; Li, Q.; Darrell, T.; Yu, F. Quasi-Dense Similarity Learning for Multiple Object Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 20–25 June 2021; pp. 164–173. [Google Scholar] [CrossRef]
Cai, J.; He, Z.; Liu, Z.; Cao, Y. DiPerceiveNet: A Bidirectional Cross-Scale Perception Network for Vehicle Re-Identification. Pattern Recogn. 2026, 178, 113476. [Google Scholar] [CrossRef]
Wojke, N.; Bewley, A. Deep Cosine Metric Learning for Person Re-Identification. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 748–756. [Google Scholar] [CrossRef]
Meinhardt, T.; Kirillov, A.; Leal-Taixé, L.; Feichtenhofer, C. TrackFormer: Multi-Object Tracking with Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 8834–8844. [Google Scholar] [CrossRef]
Zeng, F.; Dong, B.; Zhang, Y.; Wang, T.; Zhang, X.; Wei, Y. MOTR: End-to-End Multiple-Object Tracking with Transformer. In Proceedings of the Computer Vision—ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; pp. 659–675. [Google Scholar] [CrossRef]
Liu, C.; Zhang, J.; Chen, K.; Wang, M.; Zou, Z.; Shi, Z. Remote Sensing Spatiotemporal Vision–Language Models: A Comprehensive Survey. IEEE Geosci. Remote Sens. Mag. 2025, 14, 383–423. [Google Scholar] [CrossRef]
Kalman, R.E. A New Approach to Linear Filtering and Prediction Problems. J. Basic Eng. 1960, 82, 35–45. [Google Scholar] [CrossRef]
Kuhn, H.W. The Hungarian Method for the Assignment Problem. Nav. Res. Logist. Q. 1955, 2, 83–97. [Google Scholar] [CrossRef]
Dendorfer, P.; Osep, A.; Milan, A.; Schindler, K.; Cremers, D.; Reid, I.; Roth, S.; Leal-Taixé, L. MOTChallenge: A Benchmark for Single-Camera Multiple Target Tracking. Int. J. Comput. Vis. 2021, 129, 845–881. [Google Scholar] [CrossRef]
Luiten, J.; Osep, A.; Dendorfer, P.; Torr, P.H.S.; Geiger, A.; Leal-Taixé, L.; Leibe, B. HOTA: A Higher Order Metric for Evaluating Multi-Object Tracking. Int. J. Comput. Vis. 2021, 129, 548–578. [Google Scholar] [CrossRef] [PubMed]
Bernardin, K.; Stiefelhagen, R. Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics. EURASIP J. Image Video Process. 2008, 2008, 246309. [Google Scholar] [CrossRef]
Ristani, E.; Solera, F.; Zou, R.S.; Cucchiara, R.; Tomasi, C. Performance Measures and a Data Set for Multi-Target, Multi-Camera Tracking. In Proceedings of the Computer Vision—ECCV 2016 Workshops, Amsterdam, The Netherlands, 11–14 October 2016; pp. 17–35. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the Computer Vision–ECCV 2014, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar] [CrossRef]
Zhang, S.; Benenson, R.; Schiele, B. CityPersons: A Diverse Dataset for Pedestrian Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4457–4465. [Google Scholar] [CrossRef]
Ess, A.; Leibe, B.; Schindler, K.; Van Gool, L. A Mobile Vision System for Robust Multi-Person Tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Anchorage, AK, USA, 23–28 June 2008; pp. 1–8. [Google Scholar] [CrossRef]
Zhang, S.; Xie, Y.; Wan, J.; Xia, H.; Li, S.Z.; Guo, G. WiderPerson: A Diverse Dataset for Dense Pedestrian Detection in the Wild. IEEE Trans. Multimed. 2020, 22, 380–393. [Google Scholar] [CrossRef]
Sun, P.; Cao, J.; Jiang, Y.; Yuan, Z.; Bai, S.; Kitani, K.; Luo, P. DanceTrack: Multi-Object Tracking in Uniform Appearance and Diverse Motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 20961–20970. [Google Scholar] [CrossRef]
Yu, F.; Chen, H.; Wang, X.; Xian, W.; Chen, Y.; Liu, F.; Madhavan, V.; Darrell, T. BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 2636–2645. [Google Scholar] [CrossRef]

Figure 1. Schematic illustration of the visibility transition of a target under progressive occlusion. The figure shows five consecutive appearance states—fully visible, partial occlusion, severe occlusion, partial reappearance, and fully visible again—to illustrate how the reliability of appearance observations changes over time and why indiscriminate single-prototype updating can either absorb contaminated evidence or become stale after reappearance.

Figure 2. Overall framework of Hierarchical Multi-Prototype Appearance Memory (HMP) in a plug-and-play tracking pipeline. The detector, re-identification (ReID) extractor, motion prediction/gating pipeline, and one-to-one assignment primitive remain unchanged, while the proposed module redesigns the appearance-memory path and stage-wise evidence usage: stable identity cues are organized in a multi-prototype long-term memory for frozen primary matching in Stage 1, and recent reliable observations are retained in a short first-in-first-out (FIFO) queue for risk-controlled recovery on the residual set in Stage 2.

Figure 3. Reliability-controlled memory-update mechanism of HMP. A matched observation is evaluated by a unified joint reliability score and then admitted into different memory components according to threshold conditions: highly reliable evidence is eligible for long-memory updating and can also be cached in the short queue, whereas moderately reliable evidence bypasses long-memory writing but may still be retained in the short queue. Prototype maintenance is performed in a deterministic manner to preserve memory purity and role separation.

Figure 4. Sensitivity to the number of long-term prototypes M. The short queue is disabled (

S = 0

) to isolate the effect of the multi-prototype long-term memory. IDF1 (higher is better) and IDSW (lower is better) are reported jointly to reflect identity continuity.

Figure 4. Sensitivity to the number of long-term prototypes M. The short queue is disabled (

S = 0

) to isolate the effect of the multi-prototype long-term memory. IDF1 (higher is better) and IDSW (lower is better) are reported jointly to reflect identity continuity.

Figure 5. Sensitivity to the short-queue length S. The number of long-term prototypes is fixed at

M = 3

, and only S is varied. IDF1 and IDSW are reported jointly to assess how short-term transitional evidence affects identity continuity.

Figure 5. Sensitivity to the short-queue length S. The number of long-term prototypes is fixed at

M = 3

, and only S is varied. IDF1 and IDSW are reported jointly to assess how short-term transitional evidence affects identity continuity.

Figure 6. Qualitative comparison between the baseline tracker and HMP under occlusion and interleaved interactions. Top row (Baseline): the reference tracker. Bottom row (HMP): the proposed method. Green boxes denote correctly maintained target identities, whereas red boxes denote tracking failures or identity switches. Near reappearance, the baseline is more prone to identity switches, whereas HMP better preserves identity continuity in these difficult local regions.

Table 1. Method-level positioning of Hierarchical Multi-Prototype Appearance Memory (HMP) relative to representative multi-object tracking (MOT) directions.

Method Family	Main Optimization Focus	Difference from HMP
Upstream representation enhancement	Improves the detector or re-identification (ReID) embeddings through stronger training, better backbones, or joint detection-and-embedding learning.	HMP keeps the detector, ReID extractor, motion model, and assignment primitive fixed, and improves identity continuity only by redesigning track-level appearance memory.
Generic feature bank or memory extension	Stores more historical embeddings or longer appearance histories to enlarge temporal context.	HMP does not treat all historical features as homogeneous memory. It separates stable long-term identity anchors from short-term transitional evidence and assigns them different association roles.
Heuristic gated updating	Uses confidence thresholds or local rules to suppress unreliable memory updates.	HMP uses a shared joint reliability signal to govern long-memory writing, short-queue admission, and prototype maintenance, thereby linking representation quality and update control.
Generic staged association	Performs multiple matching passes to recover unmatched detections or tracks.	HMP binds each stage to a specific evidence type: Stage 1 uses stable long-memory evidence and is frozen, whereas Stage 2 uses short-term evidence only for conservative residual recovery.
Transformer-based or memory-augmented tracking	Learns temporal association through object queries, historical tokens, memory attention, or end-to-end detection-and-tracking architectures.	HMP does not introduce learnable query memory or redesign the overall architecture. It uses explicit track-level appearance prototypes and a first-in-first-out (FIFO) short queue under fixed upstream modules, with reliability-controlled writing and frozen stage-specific evidence usage.

Table 2. Comparison with representative online multi-object tracking methods on the MOT17 test set. HOTA denotes higher-order tracking accuracy, MOTA denotes multiple object tracking accuracy, IDF1 denotes identity F1 score, and IDSW denotes identity switches.

Tracker	HOTA↑	MOTA↑	IDF1↑	IDSW↓
FairMOT	59.3	73.7	72.3	3303
ByteTrack	63.1	80.3	77.3	2196
OC-SORT	63.2	78.0	77.5	1950
StrongSORT++	64.4	79.6	79.5	1194
Deep OC-SORT	64.9	79.4	80.6	1023
BoT-SORT-ReID	65.0	80.5	80.2	1212
BoostTrack++	66.6	80.7	82.2	1062
HMP (Ours)	66.6	81.0	82.6	882

Table 3. Comparison with representative online multi-object tracking methods on the MOT20 test set.

Tracker	HOTA↑	MOTA↑	IDF1↑	IDSW↓
FairMOT	54.6	61.8	67.3	5243
ByteTrack	61.3	77.8	75.2	1223
OC-SORT	62.1	75.5	75.9	913
StrongSORT++	62.6	73.8	77.0	770
BoT-SORT-ReID	63.3	77.8	77.5	1313
Deep OC-SORT	63.9	75.6	79.2	779
BoostTrack++	66.4	77.7	82.0	762
HMP (Ours)	65.5	77.5	80.8	752

Table 4. Stepwise component ablation on the MOT17 validation split under BoT-SORT-ReID. B1 denotes reliability-controlled memory writing, A1 denotes multi-prototype long-term memory, and A2 denotes risk-controlled short-queue-based residual recovery.

Setting	MOTA↑	IDF1↑	HOTA↑	IDSW↓
Baseline	78.4	81.8	69.2	135
Baseline + B1	78.4	82.1	69.4	129
Baseline + B1 + A1	78.6	82.5	69.5	125
Baseline + B1 + A1 + A2 (HMP)	78.5	82.7	69.4	121

Table 5. Stepwise component ablation on the MOT17 validation split under Deep OC-SORT. B1 denotes reliability-controlled memory writing, A1 denotes multi-prototype long-term memory, and A2 denotes risk-controlled short-queue-based residual recovery.

Setting	MOTA↑	IDF1↑	HOTA↑	IDSW↓
Baseline	79.8	82.5	70.2	187
Baseline + B1	79.9	82.8	70.2	176
Baseline + B1 + A1	79.8	83.0	70.8	175
Baseline + B1 + A1 + A2 (HMP)	80.0	83.1	70.9	173

Table 6. Stage-policy ablation on the MOT17 validation split under BoT-SORT-ReID. All variants use the same reliability-controlled writing, multi-prototype long memory, and short queue as HMP; only the Stage 2 usage policy is changed.

Setting	MOTA↑	IDF1↑	HOTA↑	IDSW↓
HMP full	78.5	82.7	69.4	121
Without Stage 1 freezing	78.3	82.2	69.2	134
Without advantage margin	78.4	82.4	69.3	131
Without ambiguity gap	78.4	82.5	69.3	128

Table 7. Summary of pure module gain under fixed upstream pipelines on the MOT17 validation split. Each row reports the performance change obtained by adding HMP to the corresponding host tracker while keeping the detector, ReID extractor, motion model, and assignment primitive unchanged.

Host Tracker with HMP	ΔMOTA	ΔIDF1	ΔHOTA	ΔIDSW
BoT-SORT-ReID + HMP	$+ 0.1$	$+ 0.9$	$+ 0.2$	$- 14$ ( $- 10.4 %$ )
Deep OC-SORT + HMP	$+ 0.2$	$+ 0.6$	$+ 0.7$	$- 14$ ( $- 7.5 %$ )

Table 8. Cross-dataset generalization experiment on DanceTrack-val under Deep OC-SORT. The comparison is conducted under the same host-tracker setting, and HMP is added as a track-level appearance-memory module.

Method	MOTA↑	HOTA↑	IDF1↑	IDSW↓
Deep OC-SORT	88.5	58.51	59.03	1587
Deep OC-SORT + HMP	88.9	58.70	59.43	1543

Table 9. Sensitivity analysis of reliability-fusion weights on the MOT17 validation split under BoT-SORT-ReID. All other HMP parameters are fixed.

Setting	$λ_{A}$	$λ_{M}$	$λ_{Q}$	MOTA↑	IDF1↑	HOTA↑	IDSW↓
Balanced	0.33	0.33	0.34	78.3	82.5	69.3	124
Default	0.50	0.30	0.20	78.5	82.7	69.4	121
Appearance-heavy	0.60	0.25	0.15	78.4	82.6	69.4	123
Motion-heavy	0.40	0.40	0.20	78.3	82.4	69.3	125
Detection-heavy	0.40	0.25	0.35	78.5	82.4	69.2	127

Table 10. Sensitivity analysis of memory-admission thresholds on the MOT17 validation split under BoT-SORT-ReID. All other HMP parameters are fixed.

Setting	$η_{L}$	$η_{S}$	MOTA↑	IDF1↑	HOTA↑	IDSW↓
Loose writing	0.65	0.20	78.5	82.4	69.2	126
Moderately loose	0.70	0.20	78.4	82.5	69.3	124
Default	0.75	0.25	78.5	82.7	69.4	121
Moderately strict	0.80	0.25	78.3	82.5	69.3	123
Strict writing	0.85	0.30	78.2	82.2	69.2	125

Table 11. Runtime overhead of HMP under matched graphics processing unit (GPU) workstation execution settings. Frames per second (FPS) denotes end-to-end tracking throughput measured in our implementation. Higher FPS is better, while lower relative slowdown is better. The deterministic HMP memory footprint is analyzed below as incremental feature-state storage.

Framework	Baseline FPS↑	+HMP FPS↑	Relative Slowdown↓
BoT-SORT-ReID	7.1	6.4	9.9%
Deep OC-SORT	22.2	19.3	13.1%

Table 12. Occlusion-stage interpretation of HMP behavior. The table links the visibility stages in Figure 1 to the dominant identity risk, the corresponding HMP mechanism, and the remaining limitation.

Visibility Stage	Dominant Identity Risk	HMP Response	Remaining Limitation
Fully visible	Normal appearance variation	Long-memory prototypes maintain stable identity anchors.	Limited gain is expected because the baseline is already reliable.
Partial occlusion	Contaminated features enter memory	Reliability-controlled writing suppresses unreliable memory updates.	If the detection box is severely polluted, appearance evidence may still be weak.
Severe occlusion	Missing or highly degraded observations	Long memory is preserved instead of being overwritten by poor evidence.	HMP cannot recover targets that are persistently missed by the detector.
Partial reappearance	Long memory may be temporarily stale	Short queue supports conservative residual recovery after Stage 1.	Recovery is rejected when the short-term advantage is ambiguous.
Interleaved interaction	Similar neighboring identities cause identity switches	Frozen Stage 1 and ambiguity-gap filtering prevent risky short-term rematching.	Very similar identities may remain ambiguous without stronger upstream features.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, W.; Liu, M.; Cao, Y.; Cai, J.; Wang, C.; Xia, H.; Xu, K. Hierarchical Multi-Prototype Appearance Memory: A Plug-and-Play Module for Identity-Stable Online Multi-Object Tracking. Electronics 2026, 15, 2357. https://doi.org/10.3390/electronics15112357

AMA Style

Zhang W, Liu M, Cao Y, Cai J, Wang C, Xia H, Xu K. Hierarchical Multi-Prototype Appearance Memory: A Plug-and-Play Module for Identity-Stable Online Multi-Object Tracking. Electronics. 2026; 15(11):2357. https://doi.org/10.3390/electronics15112357

Chicago/Turabian Style

Zhang, Wenning, Mintao Liu, Yangjie Cao, Jihao Cai, Chao Wang, Huili Xia, and Kunming Xu. 2026. "Hierarchical Multi-Prototype Appearance Memory: A Plug-and-Play Module for Identity-Stable Online Multi-Object Tracking" Electronics 15, no. 11: 2357. https://doi.org/10.3390/electronics15112357

APA Style

Zhang, W., Liu, M., Cao, Y., Cai, J., Wang, C., Xia, H., & Xu, K. (2026). Hierarchical Multi-Prototype Appearance Memory: A Plug-and-Play Module for Identity-Stable Online Multi-Object Tracking. Electronics, 15(11), 2357. https://doi.org/10.3390/electronics15112357

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hierarchical Multi-Prototype Appearance Memory: A Plug-and-Play Module for Identity-Stable Online Multi-Object Tracking

Abstract

1. Introduction

2. Related Work

2.1. Tracking-by-Detection and Identity Consistency

2.2. Upstream Representation Enhancement and Track-Level Memory Design

2.3. Appearance Memory Representation and Update Control

2.4. Staged Association and Residual Recovery

2.5. Transformer-Based Tracking

2.6. Temporal Representation Stability Beyond Pedestrian MOT

3. Materials and Methods

3.1. Overall Framework and Problem Definition

3.2. Hierarchical Memory Architecture

3.2.1. Long Memory: Multi-Prototype Long-Term Representation

3.2.2. Short Queue: Short-Term Transitional Representation

3.3. Unified Reliability Control Mechanism

3.3.1. Joint Reliability Score

3.3.2. Reliability-Controlled Memory Updating

Long-Memory Update

Short-Queue Update

Multi-Prototype Management

3.4. Frozen Two-Stage Association Strategy

3.4.1. Stage 1: Primary Matching Based on the Long Memory

3.4.2. Stage 2: Residual Recovery Based on the Short Queue

3.5. Track Initialization and Queue Maintenance

3.6. Complexity Analysis

4. Results

4.1. Datasets, Metrics, and Controlled Attribution Protocol

4.2. Implementation Details and Fair Comparison Protocol

4.3. Benchmark Results on Official MOTChallenge Test Sets

4.3.1. MOT17 Test

4.3.2. MOT20 Test

4.4. Ablation and Sensitivity Analysis

4.4.1. Stepwise Component Ablation

4.4.2. Cross-Framework Ablation

4.4.3. Stage-Policy Ablation

4.4.4. Pure Module-Gain Summary Under Fixed Pipelines

4.4.5. Cross-Dataset Generalization on DanceTrack

4.4.6. Parameter Sensitivity

4.4.7. Runtime and Memory-Footprint Overhead

4.5. Qualitative Analysis

5. Discussion

5.1. Main Findings and Attribution

5.2. Mechanistic Understanding and Practical Implications

5.3. Limitations and Future Work

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Default Hyperparameter Settings

Appendix B. Pseudocode of HMP

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI