1. Introduction
With the rapid proliferation of unmanned vehicles and autonomous robotic systems, robust visual perception has emerged as a cornerstone for safe navigation and intelligent human–robot interaction within complex environments. While the majority of current research efforts have predominantly focused on front-view perception, rear-view human tracking and re-identification in the context of robotic visual sensing remain critically underexplored. However, rear-view monitoring is indispensable for operational scenarios such as autonomous personal assistants, security patrol robots, and human–robot collaborative following [
1,
2,
3,
4]. In these applications, the ability to robustly track and re-identify a trailing human target—even when they temporarily fall behind, change pace, or navigate through crowded spaces—is fundamental to maintaining seamless interaction and preventing safety hazards. Consequently, developing highly reliable rear-view tracking frameworks tailored for human targets in unstructured environments is not only a pressing research frontier but also a fundamental requirement for the next generation of intelligent mobile robots.
Despite the impressive performance of modern tracking systems under controlled conditions, real-world open-world scenarios introduce a constellation of challenges that severely degrade system performance. First, adverse environmental conditions—particularly rainy nights—produce intense specular reflections, uneven illumination, and excessive sensor noise, which collectively corrupt appearance cues and significantly deteriorate feature discriminability. These complex visual artifacts undermine the stability of common appearance models. Second, in dynamic scenes populated with pedestrians and mobile obstacles, targets frequently undergo severe or even complete occlusions that disrupt continuity and confuse conventional trackers. Under such conditions, adversarial environments expose fundamental deficits in existing perception frameworks, leading to tracking failures that jeopardize real-time operations [
5,
6,
7,
8]. Taken together, these factors highlight a critical gap between idealized benchmark performance and the demands of real-world deployment.
Recent advances in deep learning–based object detection and tracking have achieved remarkable precision in standard benchmarks. However, several fundamental limitations prevent their seamless application to rear-view unmanned vehicles, particularly under adversarial conditions. First, while paradigm frameworks like DeepSORT excel in clear environments, they demonstrate significant instability when applied to densely populated or occlusion-heavy contexts [
9,
10]. Second, contemporary models are highly susceptible to feature contamination and trajectory drift. For instance, studies such as Sun et al. [
11] on feature-enhanced Siamese networks have attempted to address low-light challenges by integrating dynamic templates for robust feature updates. However, these methods still implicitly rely on relatively stable illumination and frequently fail when visibility severely degrades (e.g., specular reflections on rainy nights). Third, addressing temporal discontinuity often comes at an unacceptable computational cost. Research by Wan et al. [
12] proposed using 3D spatiotemporal graphs to reestablish fragmented target paths by integrating temporal motion priors. While effective for sustained tracking, such systems are prohibitively computationally intensive and poorly suited for onboard processing in edge-constrained scenarios, such as drones or small autonomous vehicles. Consequently, when a target briefly disappears and reappears, conventional lightweight mechanisms typically resort to blind global search strategies. This results in expensive feature matching across the entire image space (often exceeding 90 ms per frame [
13]), compromising real-time responsiveness and making such approaches inadequate for mission-critical autonomous navigation.
Recent studies on robust target tracking have explored multimodal fusion and adaptive representation learning from different sensing perspectives. For example, transformer-based RGBT tracking frameworks [
14] have demonstrated the effectiveness of combining cross-modal and spatiotemporal information for robust target association under challenging conditions. TIR-oriented studies [
15] have further emphasized the importance of fine-grained feature modeling and template reconstruction for suppressing appearance degradation and improving temporal robustness. In addition, recent review literature on RGBT tracking [
16] has highlighted the broader trend toward stronger fusion strategies, adaptive template mechanisms, and context-aware robustness in complex environments. These developments provide useful theoretical context for the present study.
While these state-of-the-art methodologies collectively establish a strong foundation for robust multimodal or template-aware tracking, the focus of our work differs from RGBT or pure TIR tracking. Rather than relying on visible–thermal cross-modal sensing, this paper addresses rear-view robotic target tracking in a visible-light sensing setting. In this constrained context, robustness must be achieved purely through algorithmic design to overcome severe illumination transitions, specular reflection, occlusion, and ego-motion constraints. The proposed method formulates a unified cue-level multimodal tracking framework by integrating complementary information from appearance similarity, geometric consistency, and spatiotemporal prediction, together with an adaptive reference-model update mechanism. In this sense, our framework can be viewed as a lightweight and deployment-oriented alternative to computationally intensive spatiotemporal fusion and template-management strategies, designed for real-time robotic perception under edge-computing constraints.
To mitigate the aforementioned limitations, this paper proposes a lightweight and robust rear-view target tracking and re-identification framework tailored for unmanned vehicles in complex dynamic environments. Specifically, we integrate a Kalman-driven spatiotemporal prediction mechanism that actively maintains the target’s kinematic continuity during severe occlusions. Rather than relying on computationally prohibitive global matching when the target reappears, our system triggers an active local search strategy. Within this dynamically constrained region, we construct a robust multi-factor descriptor by fusing quantized color histograms and geometric kinematics, ensuring high discriminability even under adverse weather conditions. Furthermore, an adaptive model update strategy with a dynamic learning rate is introduced to selectively assimilate valid features and actively purge background noise, thereby preventing trajectory drift caused by specular reflections on rainy nights.
To avoid ambiguity in the terminology used in the title, we clarify the meanings of the two core concepts adopted in this work. First, “spatiotemporal prediction” in this paper does not merely refer to conventional Kalman-based state extrapolation or passive trajectory smoothing. Instead, it denotes a coupled mechanism that integrates motion continuity modeling, ego-motion-aware observation uncertainty adjustment, occlusion-triggered suspension of measurement updates, and covariance-constrained local target re-identification. Second, “multimodal fusion” in the present study does not specifically indicate multi-sensor fusion such as RGB-T tracking, but rather the joint use of heterogeneous yet complementary cues, including appearance similarity, geometric consistency, and motion-prediction information, within a unified rear-view human tracking framework. These terms are used to emphasize that the proposed method enhances robustness under illumination variation, specular reflection, occlusion, and robotic ego-motion by combining temporal prediction with multiple complementary tracking cues. The detailed formulations of these components are presented in
Section 2.
The main contributions of this paper are summarized as follows:
A lightweight spatiotemporal prediction framework for severe occlusion: We propose an active tracking mechanism integrating a Kalman prior. Unlike traditional passive trackers, it strictly maintains temporal continuity and constrains the re-identification search space dynamically during complete target loss. This mathematically reduces the spatial search complexity from to , fundamentally bypassing the severe latency of global blind searches.
A multimodal feature fusion and adaptive update strategy: We design a robust target descriptor combining appearance features (quantized HSV histograms) and spatiotemporal geometric constraints. Paired with a piecewise dynamic learning rate (ηk) that actively freezes template updates during severe feature degradation, this strategy fundamentally prevents feature contamination and trajectory drift in exceptionally harsh environments (e.g., severe specular reflections on rainy nights).
Extensive real-world validation in adverse weather: We deploy and evaluate the proposed system on a custom Mecanum-wheeled unmanned robot. Experimental results demonstrate that the framework achieves a peak precision of 94.2% and a tracking success rate of 93.4%. Notably, in extreme rainy night scenarios, the system reduces the average tracking error by 35% (maintaining a Center Location Error below 11 pixels) and achieves a rapid re-identification latency of 72.83 ms during occlusion recovery, proving its high robustness and real-time engineering feasibility.
3. Results
In this section, we analyze the experimental results obtained from the proposed target model update mechanism. The experiments were conducted under varying environmental conditions to evaluate the effectiveness, adaptability, and stability of the model during dynamic changes.
3.1. Dataset and Experimental Protocol
To rigorously evaluate the proposed tracking framework under complex real-world robotic sensing conditions, we constructed a custom real-world evaluation dataset using the proposed robotic platform. Because the proposed framework relies on explicit dynamic feature updating and Kalman-based motion modeling rather than end-to-end retraining on the collected sequences, this dataset is used strictly for evaluation rather than being divided into conventional training, validation, and testing subsets. All compared methods were evaluated on the same sequence set under consistent evaluation conditions to ensure a fair comparison.
As summarized in
Table 2, the dataset comprises 51 challenging video sequences collected under three representative scenario categories, each designed to assess different aspects of tracking performance. Specifically, the dataset includes 26 nighttime post-rain reflection sequences, 9 outdoor dynamic-lighting sequences, and 16 indoor corridor sequences involving complete occlusion and reappearance events. Each sequence lasts approximately 20–30 s. The nighttime post-rain subset was recorded after rainfall had stopped, when wet road surfaces and strong streetlight reflections created severe specular interference. The outdoor dynamic-lighting subset captures abrupt brightness transitions as the target moves between direct sunlight and building-shadow regions. The indoor corridor subset contains complete occlusion–reappearance events caused by corner-induced target disappearance, which are used to evaluate target re-identification performance and recovery latency.
Furthermore, to ensure the reliability of the quantitative evaluation, the ground truth for all sequences was generated through manual frame-by-frame annotation using LabelImg (version 1.8.6), followed by consistency checking. Since the objective of this study is robotic target-following of a designated subject rather than generic multi-person tracking or recognition, the collected sequences mainly consist of repeated recordings of the same target subject under different environmental conditions. The detailed annotation protocol and the ground-truth generation process for different evaluation metrics are summarized in
Table 3.
To further clarify the reliability of the evaluation,
Table 3 summarizes the annotation protocol and the ground-truth generation process used for different metrics.
All compared methods were evaluated on the same annotated sequences under the same protocol to ensure fair comparison.
3.2. Hardware and Software Experimental Environment
This research employs a Raspberry Pi 4B single-board computer (Raspberry Pi Ltd., Cambridge, UK) as the core embedded computing platform. The hardware system comprises an omnidirectional Mecanum-wheeled chassis, an aluminum alloy frame, and a binocular stereo camera. The software environment is built upon Raspbian OS, with Python 3.9.18 (Python Software Foundation, Wilmington, DE, USA) utilized for the end-to-end development of the system.
At the perception layer, the YOLOv8n lightweight detection model is deployed for inference based on the PyTorch 1.12.0 framework. To address the constraints of embedded hardware, the ARM Neon instruction set is leveraged to optimize the floating-point computational performance of the Broadcom BCM2711 (Cortex-A72, 1.5 GHz) CPU, thereby ensuring the fulfillment of real-time processing requirements.
Experimental Parameter Settings:
During the experiments, the binocular stereo camera is mounted at the top-front of the robot, with the acquisition resolution configured at 1280 × 720. The closed-loop control frequency of the system is stabilized at 30 Hz. The control layer implements a PID algorithm integrated with the kinematic model of the Mecanum wheels to achieve coordinated regulation of the robot’s linear velocity and yaw angle. All reported runtime-related results in this study, including the re-identification latency discussed later, were measured on the deployed Raspberry Pi 4B robotic platform rather than on a separate high-performance GPU workstation (see
Figure 4).
3.3. Evaluation Metrics
Frame-level precision, recall, and F1-score are calculated based on the successful target localization results over a sequence of length . Center-based localization error and temporal stability metrics are employed to reflect the rear-view robotic tracking performance under dynamic noise.
Let the center of the predicted bounding box at frame
be denoted by
, and the center of the ground-truth bounding box be
. The frame-wise center localization error
is computed as the Euclidean distance:
The overall Center Location Error (CLE) over the sequence is calculated as:
The Tracking Success Rate (TSR) is defined as the percentage of frames in which the target is successfully localized. Localization at frame
is considered successful if
strictly falls below a distance threshold
. TSR is formulated as:
is the indicator function. Given the acquisition resolution in our experiments, the distance threshold is uniformly set to 20 pixels across all compared methods.
The stability metric is defined as the standard deviation of the localization error over the successfully tracked frames. Let
denote the set of successfully tracked frames, with
being the total number of frames in
. The stability is computed as:
is the mean CLE of the frames in . A lower stability value indicates smaller temporal fluctuation of localization error and therefore smoother tracking performance.
For the indoor occlusion subset, the re-identification latency is strictly measured as the time elapsed from the exact frame the occluded target re-enters the field of view until the tracker successfully recovers the target ID. The reported re-identification latency was measured on the deployed Raspberry Pi 4B robotic platform and averaged across multiple occlusion-reappearance cycles.
3.4. Effectiveness of the Model Update Mechanism
To verify the performance of the proposed target model update mechanism, a comparative experiment was conducted involving the Kalman Filter, YOLOv8-based tracker, YOLOv11-based tracker, and the Proposed Method.
To ensure a fair comparison, all baseline models were evaluated under strictly identical hardware conditions and evaluation protocols. The YOLOv8- and YOLOv11-based baselines were both implemented in tracking-by-detection mode using the same standard BoT-SORT association framework. During the inference phase, the original 1280 × 720 video frames were resized to a detector input of 640 × 640 using standard letterbox preprocessing to preserve the aspect ratio. The resulting predicted bounding boxes were subsequently mapped back to the original image coordinates for quantitative evaluation. The traditional Kalman-filter baseline was implemented as a constant-velocity filtering method without appearance-based re-identification. No scenario-specific re-tuning was applied during the evaluation process, ensuring that all compared methods were tested on the same annotated sequences under identical conditions. The detailed implementation configurations and fairness protocols for all baseline methods are explicitly summarized in
Table 4.
The evaluation metrics include Precision (%), Recall (%), F1-score (%), Stability, and Tracking Success Rate (%), and the corresponding results are summarized in
Table 5.
The experimental results, summarized in
Table 5 and
Figure 5, demonstrate the significant advantages of the proposed method: Detection Accuracy: The proposed method achieves a peak precision of 94.2%, outperforming the standard YOLOv8 (92.1%) and YOLOv11 (90.8%). This indicates superior feature extraction and target localization capabilities in complex backgrounds. Tracking Reliability: Our approach reaches a tracking success rate of 93.4%, which is 17.3 percentage points higher than the 76.1% achieved by the traditional Kalman Filter. This improvement highlights the effectiveness of the model update mechanism in preventing target loss. System Stability: In terms of tracking stability, the proposed method achieves the lowest stability value (4.7 px), whereas the Kalman Filter shows the largest temporal fluctuation (12.4 px). This ensures consistent performance during dynamic target changes and complex maneuvers of the Mecanum-wheeled chassis. This exceptional stability directly validates the effectiveness of the ego-motion-aware dynamic observation noise model (
) introduced in
Section 2.3, which actively suppresses measurement jitter during omnidirectional movements.
It should be noted that the observed difference between YOLOv8 and YOLOv11 in this study is scenario- and configuration-dependent rather than a universal ranking between the two detector generations. Since both baselines were evaluated using the same BoT-SORT tracker, the same parameter configuration, the same detector input size, and the same evaluation protocol, the performance gap mainly reflects the stability of the detector outputs under the present rear-view robotic tracking setting. In our experiments, YOLOv11 exhibited more noticeable bounding-box jitter and larger frame-to-frame scale variation than YOLOv8, especially under illumination transitions and specular reflections. These fluctuations propagated into the downstream association stage and led to more frequent matching uncertainties, resulting in slightly inferior F1-score, stability, and tracking success rate compared with YOLOv8.
3.5. Ablation Analysis of Key Components
To further verify the contribution of the key modules in the proposed framework, an ablation analysis was conducted by selectively disabling one component at a time while keeping the remaining settings unchanged. The evaluated components include the geometric cue, the local-search constraint, the occlusion-triggered update-suspension mechanism, and the adaptive reference-model update strategy. All variants were evaluated on the same 51-sequence evaluation set under the same detector setting, initialization strategy, and evaluation protocol, without variant-specific re-tuning. As summarized in
Table 6, the full model consistently achieves the best overall balance among all tested variants.
As summarized in
Table 6, the full model achieves the best overall balance among all tested variants. Removing any key component results in measurable performance degradation, confirming the coordinated contribution of the proposed multi-module design.
Specifically, removing the geometric cue slightly reduces the re-identification latency (from 72.83 ms to 71.45 ms), likely due to the simplified matching process. However, this marginal speedup is accompanied by noticeable declines in Precision and TSR, indicating that geometric consistency plays an important role in discriminative target association under cluttered conditions.
When the local-search constraint is disabled, the re-identification latency increases markedly from 72.83 ms to 98.64 ms. This result highlights the importance of the covariance-constrained search region in reducing candidate ambiguity and maintaining efficient recovery.
Among all ablated variants, disabling the occlusion-triggered update-suspension mechanism leads to the most severe overall degradation, yielding the lowest Precision (88.6%), the lowest TSR (85.2%), and the worst Stability (8.4 px). This suggests that suppressing unreliable updates during target disappearance is important for preventing template contamination and cumulative trajectory drift.
Finally, removing the adaptive reference-model update strategy also degrades performance across all three tracking-quality metrics, showing that adaptive template evolution remains beneficial for maintaining robust target representation under environmental variation.
3.6. Dynamic Feature Evolution Analysis
The figure above demonstrates the application of the proposed target model update mechanism in response to dynamic environmental changes. The nine smaller images in the middle section show models generated from the template library under different environmental conditions. These images transition gradually, illustrating how the target model evolves under the influence of different data sources. The large image at the bottom represents the weighted average result of all templates after the model update, effectively showcasing the smooth transition between templates and the model’s ability to adapt to environmental changes.
In the experiment, the detection of drastic environmental changes and the rapid update mechanism of the target model played central roles. By dynamically adjusting the model over time, the proposed algorithm can quickly update the target model when a significant environmental change occurs, enabling it to respond to new environmental features. Over time, the model gradually incorporates new templates through weighted averaging, ultimately achieving effective adaptation to environmental shifts.
Moreover, during the transition of each model, the process is visually represented by arrows and color changes, indicating the gradual elimination of old models and the introduction of new ones. In this process, old models are shown in gray and represented by arrows flying out, signifying their failure in the current environment; the new models, in contrast, exhibit strong color transitions, indicating their rapid adaptation to the environmental change.
This experiment demonstrates that the proposed method not only effectively detects and responds to environmental changes but also maintains high stability and accuracy throughout the evolution of the target model. It highlights the adaptability and practicality of the model update mechanism in dynamic environments (see
Figure 6).
3.7. Verification of the Model Update Mechanism
To validate the robustness and stability of the adaptive model update mechanism proposed in this paper under complex environmental changes, comparative experiments were conducted in scenarios involving dynamic lighting changes and complex background conditions.
The experiments selected the classical Kalman filter, the YOLOv8 tracking model, and the proposed Prediction + Adaptive Update Mechanism for performance comparison. The main focus of the evaluation was the tracking error and dynamic response characteristics of the learning rate under conditions where the target is not occluded but external environmental changes occur.
The camera viewpoint in
Figure 7 illustrates the real-time data acquisition process in the experimental scene, including the target tracking performance under varying lighting conditions. The experimental scene simulated different environmental lighting changes (such as transitions from bright to dim, and strong light reflections) to test the system’s adaptability under dynamic lighting changes, and to evaluate the performance of different methods.
To ensure a rigorous and fair evaluation of continuous tracking performance, the state-of-the-art deep learning models (YOLOv8 and YOLOv11) evaluated in this study were strictly deployed in tracking-by-detection mode using the BoT-SORT association framework [
30]. Consequently, they serve as representative end-to-end tracking baselines rather than mere frame-by-frame object detectors, providing a reliable benchmark for evaluating trajectory stability.
Analysis of Tracking Errors Under Lighting Variation
The evaluation is progressively conducted in two phases: initially verifying the baseline visual adaptability under regular shadow transitions, followed by a deep quantitative analysis of the tracking error and learning rate mechanisms under extreme disturbances.
As illustrated in
Figure 8, we first evaluated the system in a common outdoor scenario where the target transitions from bright direct sunlight into a dense shadow. During this process, the overall illumination on the target drops abruptly, significantly altering the visual appearance of the clothing features. However, the proposed adaptive update mechanism smoothly integrates the newly darkened features into the tracking template library. As demonstrated by the green bounding boxes, the system maintains a tight and precise lock on the target throughout the entire illumination transition, exhibiting no noticeable scale distortion or trajectory drift. This baseline qualitative test confirms that the algorithm can effortlessly handle routine lighting variations in daily operations, establishing a solid foundation for the subsequent quantitative mechanism analysis and extreme stress tests.
3.8. Robustness and Adaptive Mechanism Analysis in Extreme Rainy Scenarios
Building upon the aforementioned mechanism analysis, this section evaluates the performance boundaries and ultimate robustness of the system by introducing a highly challenging rainy night scenario under streetlights. This environment presents severe local overexposure and specular ground reflections. Through a micro-level qualitative comparison across multiple frames, we comprehensively showcase the superior anti-interference capability of the proposed method when confronted with high-frequency false feature inductions.
To further investigate the system’s dynamic response,
Figure 9 presents a qualitative comparison of tracking sequences in a highly challenging rainy night scenario. This environment features low overall contrast coupled with severe specular reflections from wet surfaces under dynamic streetlights.
As shown in Frame 2 (Peak Specular Reflection), the standard YOLOv8-based tracker suffers from severe detection drift and template contamination. The specular reflections on the wet ground create false target features, causing the bounding box to improperly expand and shift towards the light source. Consequently, even as visibility relatively improves in Frame 4 (Feature Recovery Phase), the accumulated errors cause continued trajectory deviation, with the Center Location Error (CLE) peaking at 42.3 pixels. The Kalman Filter, while smoothing the trajectory, exhibits significant temporal lag (delay) across the sequence (e.g., Frames 4–6), as it relies primarily on linear motion momentum rather than real-time feature adaptation.
In stark contrast, the Proposed Method maintains a precise and stable lock throughout the entire sequence. By leveraging the adaptive model update mechanism, the system actively regulates the template learning rate. Specifically, as analyzed in
Section 2.4, the Specular Reflection Penalty
rigorously identifies the high-Value and low-Saturation properties of the wet ground reflections. This triggers the piecewise function to instantly freeze the template
during severe degradation in Frame 2. This mechanism effectively shields the tracking window from historical template contamination, ensuring robust recovery in subsequent frames (e.g., Frame 4) and keeping the CLE consistently below 11 pixels. This visual evidence strongly corroborates the quantitative stability metrics presented earlier, demonstrating the framework’s superior capability to manage complex environmental interference.
Quantitative Error Analysis. To further quantify the visual performance observed in
Figure 9 and
Figure 10, this section plots the continuous tracking error over a 100-frame sequence. It is crucial to note that the “Lighting Change Period” (gray shaded area) in the graph precisely corresponds to the severe disturbance and feature recovery phases (Frames 2–4) shown in the qualitative sequence.
From the results, it can be observed that the classic Kalman filter experiences a significant increase in error during this disturbance phase, accompanied by noticeable drift. The YOLO trackers (YOLOv8 and YOLOv11) exhibit amplified error fluctuations, reflecting instability when faced with sudden environmental changes. In contrast, the proposed method maintains a highly stable error curve throughout the entire process, achieving the lowest error even during the severe disturbance phase, with an approximate reduction in average error of 35%.
These results demonstrate that the proposed adaptive update mechanism effectively mitigates the impact of external disturbances on recognition outcomes. It enables the system to maintain high-precision target tracking without the need for re-detection. Ultimately, the system achieves a “smooth update–no drift” characteristic during feature updates, significantly enhancing the overall robustness of the tracking framework.
Underlying Adaptive Mechanism. The fundamental mechanism enabling this exceptional stability is revealed in
Figure 11. The graph tracks the dynamic learning rate,
, corresponding to the same sequence. From the figure, it can be observed that during the disturbance phase (gray shaded area), the classic Kalman filter keeps its learning rate constant, leading to insufficient updating capability. YOLOv8 and YOLOv11, while showing some tracking state fluctuations, exhibit a slow adjustment process unable to quickly adapt to abrupt visual changes.
In contrast, the proposed method (red solid line) exhibits a highly sophisticated dual-layered adaptive mechanism. At a macro level, upon entering the “Lighting Change Period” (the globally illuminated zone under the streetlight), the baseline learning rate appropriately increases to rapidly adapt to the target’s newly emerging bright appearance features.
However, at a micro level, the red curve is not smooth; it is characterized by sharp, instantaneous downward dips during this disturbance phase. These sudden drops are the direct manifestation of the Specular Reflection Penalty
actively intervening. Whenever severe local degradation occurs (e.g., the intense wet ground reflections in Frame 2 of
Figure 9), the mechanism aggressively suppresses the learning rate
to prevent the template from incorporating contaminated puddle features. Once the environment stabilizes (after Frame 70), it quickly executes a smooth decay, effectively preventing overfitting to temporary noise. This decoupled adaptation proves that the algorithm can simultaneously embrace valid macroscopic illumination changes while rigorously rejecting microscopic environmental noise.
3.9. Trajectory Prediction and Target Re-Identification Analysis Under Occlusion
To validate the performance of the proposed prediction constraints and multi-factor fusion mechanism under complex scenarios involving target loss, this section conducts a comprehensive experimental analysis focusing on qualitative occlusion handling, trajectory prediction accuracy, and re-identification efficiency.
Qualitative Occlusion Handling Analysis. To visually demonstrate the system’s behavior under visibility loss,
Figure 12 captures a representative sequence of the mobile robot handling complete occlusion in an indoor corridor. As shown in
Figure 12b, when the target is fully obscured by the corridor wall, standard detection modules would typically fail. However, the proposed framework switches to an active prediction mode, maintaining an estimated trajectory (represented by the orange dashed box) [
31]. This “blind-tracking” ensures that the system remains centered on the target’s hypothesized position. Upon target reappearance in
Figure 12c, the system achieves rapid re-identification, with the trajectory ultimately resuming stable tracking in
Figure 12d with a low CLE of 13.1 px. This qualitative evidence provides a strong foundation for the subsequent quantitative analysis.
Trajectory Prediction Accuracy during Target Loss. Building upon the qualitative evidence in
Figure 12 and
Figure 13, this section illustrates the quantitative trajectory comparison between the proposed prediction mechanism and the ground truth during a target loss and reappearance event. The actual motion trajectory is denoted by the blue curve, where dashed segments represent the ground truth path during periods of occlusion. The estimated motion trajectory, derived from the prediction constraints, is depicted by the red dashed line. During the target’s invisibility phase (gray shaded area), the predicted trajectory preserves the motion trend via the Kalman state transition. Although a natural kinematic deviation accumulates during the prolonged invisibility phase, the prediction strictly bounds the target’s reappearance space, setting the stage for the instant position correction seen upon reappearance. This confirms that the model preserves trajectory continuity and effectively follows motion trends even in the absence of real-time observation data. Upon target reappearance (yellow marked point), the predicted path aligns precisely with the actual position, validating the high precision of the prediction model in spatial position estimation. Furthermore, by proactively predicting the target’s temporal position during the loss phase, the system significantly reduces re-identification latency overhead upon reappearance.
Target Re-identification Efficiency. High-precision spatial prediction directly translates to significantly reduced recovery latency.
Figure 14 quantitatively compares the average re-identification latency during the target re-identification phase across various tracking frameworks, including the classic Kalman filter, improved adaptive filtering, deep learning detectors (YOLOv8, YOLOv11), and our proposed mechanism. The results indicate that traditional Kalman filters exhibit the highest delay (approx. 175 ms) due to their inherent inability to handle nonlinear motion and appearance variations during occlusion. While adaptive filtering reduces lag to 138 ms by dynamically adjusting noise parameters, it remains insufficient for high-speed robotic tasks. Deep learning models (YOLOv8, YOLOv11) leverage end-to-end feature extraction for accuracy, but are constrained by complex network inference times, resulting in response latencies between 90 and 105 ms.
In stark contrast, the proposed Prediction + Multi-Factor Mechanism integrates spatiotemporal constraints with multimodal appearance features to achieve high-precision position alignment immediately upon target reappearance by dynamically weighting and filtering candidate boxes within the predicted region (as demonstrated in the spatial reasoning of
Figure 12 and
Figure 13). By strictly confining the multi-factor feature matching to the dynamic Mahalanobis ellipsoid
, the spatial complexity is mathematically reduced to
, enabling the system to complete re-identification in merely 72.83 ms. This represents an efficiency improvement of approximately 20.95% compared to the state-of-the-art YOLOv11-based tracker [
32]. Such a significant reduction in latency highlights the system’s superior real-time performance and robustness in maintaining target continuity under challenging conditions.
3.10. Overall Performance Comparison
Before summarizing the overall performance, we further include a representative qualitative case under cluttered outdoor conditions involving a similar-looking pedestrian distractor. Since the proposed framework performs cue-level multimodal fusion on visible-light robotic input, this qualitative comparison follows the same onboard visible-light input setting and baseline trackers as the quantitative evaluation.
Figure 15 presents a rear-view tracking sequence in which a visually similar pedestrian enters the field of view and temporarily coexists with the designated target, thereby introducing potential identity ambiguity in a cluttered outdoor scene.
As shown in
Figure 15, all compared trackers can initialize the target in the initial tracking frame. When the similar-looking pedestrian enters and coexists with the target in Frames 2 and 3, the YOLOv8- and YOLOv11-based trackers show different degrees of localization drift and residual offset, indicating increased association ambiguity under distractor interference. In contrast, the proposed method maintains a more consistent bounding box around the original target throughout the sequence.
To provide an overall quantitative comparison, this subsection summarizes the performance of all evaluated tracking methods based on
Table 5. The Kalman-filter baseline exhibits the lowest tracking success rate (76.1%) and the largest Stability value (12.4 px), indicating its vulnerability to dynamic occlusions. While the YOLOv8- and YOLOv11-based trackers achieve relatively high precision (92.1% and 90.8%), they are susceptible to prolonged environmental disturbances (e.g., specular reflections), leading to larger temporal fluctuations in localization compared to the proposed method.
In contrast, the proposed Prediction + Multi-Factor mechanism achieves the best overall tracking performance. By utilizing the adaptive learning-rate mechanism (
) and constraining local searches within the Mahalanobis ellipsoid, the proposed method maintains more stable tracking under the tested disturbances. Consequently, as reported in
Table 5, the proposed method attains the highest Precision (94.2%) and Tracking Success Rate (93.4%), along with the lowest Stability value (4.7 px). These quantitative results show that the proposed framework provides the most stable and reliable tracking performance among the compared methods under complex rear-view conditions.
4. Conclusions
In this paper, we proposed a lightweight and highly robust rear-view human tracking and re-identification framework tailored for robotic visual sensing in unmanned vehicles operating in complex, open-world environments. To overcome the critical vulnerabilities of conventional deep learning models when faced with adverse weather (e.g., rainy nights) and severe occlusions, our approach seamlessly integrated a Kalman-driven spatiotemporal prediction mechanism with a multimodal feature fusion strategy. By strictly confining the re-identification search space within a dynamic Mahalanobis ellipsoid and utilizing a robust descriptor (combining quantized HSV histograms and geometric kinematics), the system successfully decoupled target appearance from environmental noise. Furthermore, the introduction of a rigorous adaptive update strategy—driven by a Specular Reflection Penalty and a piecewise learning rate—effectively prevented trajectory drift by proactively freezing invalid feature assimilations.
Extensive real-world experiments conducted on a custom Mecanum-wheeled robot comprehensively validated the superiority of the proposed framework. The system achieved a peak precision of 94.2% and a tracking success rate of 93.4% without requiring large-scale end-to-end retraining on massive annotated datasets. Notably, under extreme rainy night conditions, our method reduced the average tracking error by 35%—maintaining a Center Location Error (CLE) below 11 pixels. During complete occlusion phases, the active local search mechanism enabled a rapid re-identification latency of 72.83 ms, bridging the critical gap between robustness and real-time edge computing constraints.
Ultimately, this research delivers a highly reliable, computationally efficient solution for intelligent human–robot interaction and autonomous tracking. Future work will focus on extending this framework to multi-target rear-view tracking scenarios and exploring its deployment on aerial platforms (UAVs) for cross-view collaborative perception.