Progressive Alignment of Multi-Modal Trajectories Under Modality Imbalance: A Case Study in Metro Stations
Abstract
1. Introduction
- High similarity among passenger trajectories: In metro stations, passenger movement paths are relatively fixed, with many trajectories overlapping, caused by groups traveling together or dense flows. Combined with UWB drift, this leads to a high risk of confusion in cross-modal alignment, thereby limiting the success rate of single matching attempts.
- Severe modality imbalance between vision and UWB: In real-world transportation scenarios, all passengers generate vision trajectories, but only a small subset carries UWB tags. This imbalance yields an overwhelming majority of negative samples in training, causing the model to overemphasize mismatches while neglecting scarce but critical true associations.
- Prior knowledge-guided progressive trajectory alignment: The alignment probability from the previous time step is used as prior information to guide and constrain the current attention module. This design leverages the temporal continuity of trajectories, progressively increasing confidence in alignments while mitigating the uncertainty of individual matches.
- Contrastive learning for modality imbalance mitigation: The contrastive learning with InfoNCE loss is employed to deal with the severe modality imbalance. By maximizing the similarity of true UWB–Vision pairs while contrasting them against a large set of negatives, the model is encouraged to learn from scarce yet critical associations, thereby promoting stable matching on the UWB modality.
2. Related Work
3. Methodology
3.1. Scene Description
3.2. Trajectory Coordinate Unification
3.2.1. UWB Data Processing
3.2.2. Vision Data Processing
3.2.3. Coordinate Transformation
3.3. Progressive Alignment of Trajectories
- Trajectory feature representation and encoding (Figure 4A): Position and displacement information are fused and normalized, followed by a TCN-based encoder to obtain local temporal features of each trajectory (see Section 3.3.1).
- Self-attention for individual trajectories (Figure 4B): Long-range temporal dependencies within each trajectory are modeled to enhance the representation quality of features (see Section 3.3.2).
- Cross-modal attention (Figure 4C): This module leverages visual-UWB trajectory interactive information and incorporates a prior knowledge-guided attention mechanism to highlight critical associations, thus enhancing trajectory feature representation (see Section 3.3.3).
- Model output and post-processing (Figure 4D): Contrastive learning and Hungarian matching generate final alignment results, while converting alignment probabilities into priors for the next time step (see Section 3.3.4).
- Loss function: The InfoNCE loss is employed to enhance the separability between positive and negative pairs, thereby guiding the model to capture the essential cross-modal alignments (see Section 3.3.5).
3.3.1. Trajectory Feature Representation and Encoding
- For , min–max normalization is applied. Since UWB localization errors have a limited impact on these larger-scale coordinates, this approach effectively preserves the geometric consistency of each trajectory. Moreover, normalization is performed within each network input rather than across the entire dataset, thereby enhancing sample diversity and improving model generalization.
- For , tanh-based normalization is employed to mitigate the effects of transient drift or hand-induced motion perturbations. Given that these displacement features are small in scale but prone to extreme fluctuations, tanh normalization smoothly bounds abnormal values while maintaining sensitivity to normal motion dynamics, thereby preserving richer feature information.
3.3.2. Self-Attention for Individual Trajectories
3.3.3. Cross-Modal Attention
3.3.4. Model Output and Post-Processing
| Algorithm 1. Construction of the prior probability matrix |
| Require: For initialization: Average trajectory distances across time steps , where ; For non-initialization: Trajectory pair similarities output by the model , where . Ensure: Prior probability matrix . |
| .
If in the initialization phase: For each UWB-vision trajectory pair, compute initial similarity scores: ; ; . : ; Set UWB-to-vision probability:; Set vision-to-UWB probability: ; . |
3.3.5. Loss Function
4. Experiments
4.1. Implementation Details
4.2. Model Performance
- Distance-based greedy matching: For each UWB trajectory, this method selects the vision trajectory with the smallest average trajectory error as its corresponding match.
- Hungarian Algorithm: This algorithm constructs an average trajectory error matrix using all UWB trajectories and candidate vision trajectories involved in the matching process at the time, then identifies the matching relationship with the minimum total average trajectory error to achieve globally optimal matching.
- Transformer-based method: Currently, there are no existing deep learning–based trajectory alignment methods directly applicable to the UWB–vision matching scenario. To provide a deep learning–based baseline, we implemented a simplified Transformer-based contrastive learning framework derived from our proposed architecture. Specifically, the prior alignment probability and the InfoNCE loss were removed, and a standard Binary Cross-Entropy (BCE) loss was adopted instead.
4.3. Ablation Study
- : If this component is excluded, the network input will only utilize absolute position coordinates as trajectory features.
- Hybrid normalization strategy: When this strategy is removed, and are all normalized using the min–max normalization method.
- Self-Attention mechanism: If this mechanism is removed, the extraction of temporal features for each trajectory will rely solely on the Temporal Convolutional Network (TCN).
- Prior alignment probability: When this prior knowledge is excluded, the first cross-attention layer adopts the standard Transformer attention, identical to the subsequent layers, thus removing the prior constraint.
- InfoNCE loss: If this component is omitted, the model will adopt the standard Binary Cross-Entropy (BCE) Loss as its loss function.
4.4. Computational Efficiency and Deployment Feasibility
5. Conclusions and Future Work
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Di Pietra, V.; Dabove, P. Recent Advances for UWB Ranging from Android Smartphone. In Proceedings of the 2023 IEEE/ION Position, Location and Navigation Symposium (PLANS), Monterey, CA, USA, 24–27 April 2023; IEEE: New York, NY, USA, 2023; pp. 1226–1233. [Google Scholar]
- Heinrich, A.; Krollmann, S.; Putz, F.; Hollick, M. Smartphones with UWB: Evaluating the Accuracy and Reliability of UWB Ranging. arXiv 2023, arXiv:2303.11220. [Google Scholar] [CrossRef]
- Win, M.Z.; Dardari, D.; Molisch, A.F.; Wiesbeck, W.; Jinyun Zhang, W. History and Applications of UWB; Institute of Electrical and Electronics Engineers: New York, NY, USA, 2009. [Google Scholar]
- Coppens, D.; Shahid, A.; Lemey, S.; Van Herbruggen, B.; Marshall, C.; De Poorter, E. An Overview of UWB Standards and Organizations (IEEE 802.15. 4, FiRa, Apple): Interoperability Aspects and Future Research Directions. IEEE Access 2022, 10, 70219–70241. [Google Scholar] [CrossRef]
- Pirch, H.-J.; Leong, F. Introduction to Impulse Radio Uwb Seamless Access Systems. In Proceedings of the Fraunhofer SIT ID: SMART Workshop, Darmstadt, Germany, 28 January 2020; pp. 19–20. [Google Scholar]
- Welch, T.B.; Musselman, R.L.; Emessiene, B.A.; Gift, P.D.; Choudhury, D.K.; Cassadine, D.N.; Yano, S.M. The Effects of the Human Body on UWB Signal Propagation in an Indoor Environment. IEEE J. Sel. Areas Commun. 2002, 20, 1778–1782. [Google Scholar] [CrossRef]
- Ramirez-Mireles, F. On the Performance of Ultra-Wide-Band Signals in Gaussian Noise and Dense Multipath. IEEE Trans. Veh. Technol. 2002, 50, 244–249. [Google Scholar] [CrossRef]
- Stephan, P.; Heck, I.; Krau, P.; Frey, G. Evaluation of Indoor Positioning Technologies under Industrial Application Conditions in the SmartFactoryKL Based on EN ISO 9283. IFAC Proc. Vol. 2009, 42, 870–875. [Google Scholar] [CrossRef]
- Morar, A.; Moldoveanu, A.; Mocanu, I.; Moldoveanu, F.; Radoi, I.E.; Asavei, V.; Gradinaru, A.; Butean, A. A Comprehensive Survey of Indoor Localization Methods Based on Computer Vision. Sensors 2020, 20, 2641. [Google Scholar] [CrossRef] [PubMed]
- Dhall, A.; Chelani, K.; Radhakrishnan, V.; Krishna, K.M. LiDAR-Camera Calibration Using 3D-3D Point Correspondences. arXiv 2017, arXiv:1705.09785. [Google Scholar] [CrossRef]
- Magnusson, M.; Nuchter, A.; Lorken, C.; Lilienthal, A.J.; Hertzberg, J. Evaluation of 3D Registration Reliability and Speed-A Comparison of ICP and NDT. In Proceedings of the 2009 IEEE International Conference on Robotics and Automation, Kobe, Japan, 12–17 May 2009; IEEE: New York, NY, USA, 2009; pp. 3907–3912. [Google Scholar]
- Levinson, J.; Thrun, S. Automatic Online Calibration of Cameras and Lasers. In Proceedings of the Robotics: Science and Systems, Berlin, Germany, 24–28 June 2013; Volume 2. [Google Scholar]
- Nagy, B.; Benedek, C. On-the-Fly Camera and Lidar Calibration. Remote Sens. 2020, 12, 1137. [Google Scholar] [CrossRef]
- Peng, P.; Yu, C.; Xia, Q.; Zheng, Z.; Zhao, K.; Chen, W. An Indoor Positioning Method Based on UWB and Visual Fusion. Sensors 2022, 22, 1394. [Google Scholar] [CrossRef]
- Fang, S.; Islam, T.; Munir, S.; Nirjon, S. Eyefi: Fast Human Identification through Vision and Wifi-Based Trajectory Matching. In Proceedings of the 2020 16th International Conference on Distributed Computing in Sensor Systems (DCOSS), Marina Del Rey, CA, USA, 15–17 June 2020; IEEE: New York, NY, USA, 2020; pp. 59–68. [Google Scholar]
- Sun, C.; Yang, X.; Zhen, Y.; Bai, Y.; Peng, L. Research on Multimodal Fusion Indoor Positioning Under High-Throughput Passenger Flow: A Case Study of Metro Station. In Proceedings of the 2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC), Edmonton, AB, Canada, 24–27 September 2024; IEEE: New York, NY, USA, 2024; pp. 2214–2220. [Google Scholar]
- Cao, D.; Liu, R.; Li, H.; Wang, S.; Jiang, W.; Lu, C.X. Cross Vision-Rf Gait Re-Identification with Low-Cost Rgb-d Cameras and Mmwave Radars. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2022, 6, 1–25. [Google Scholar] [CrossRef]
- Cai, K.; Xia, Q.; Li, P.; Stankovic, J.; Lu, C.X. Robust Human Detection under Visual Degradation via Thermal and Mmwave Radar Fusion. arXiv 2023, arXiv:2307.03623. [Google Scholar] [CrossRef]
- Liu, R.; Yao, T.; Shi, R.; Mei, L.; Wang, S.; Yin, Z.; Jiang, W.; Wang, S. Mission: Mmwave Radar Person Identification with Rgb Cameras. In Proceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems, Hangzhou, China, 4–7 November 2024; pp. 309–321. [Google Scholar]
- Dai, Y.; Shuai, X.; Tan, R.; Xing, G. Interpersonal Distance Tracking with mmWave Radar and IMUs. In Proceedings of the 22nd International Conference on Information Processing in Sensor Networks, San Antonio, TX, USA, 9–12 May 2023; pp. 123–135. [Google Scholar]
- Meinhardt, T.; Kirillov, A.; Leal-Taixe, L.; Feichtenhofer, C. Trackformer: Multi-Object Tracking with Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 8844–8854. [Google Scholar]
- Sun, P.; Cao, J.; Jiang, Y.; Zhang, R.; Xie, E.; Yuan, Z.; Wang, C.; Luo, P. Transtrack: Multiple Object Tracking with Transformer. arXiv 2020, arXiv:2012.15460. [Google Scholar] [CrossRef]
- Miah, M.; Bilodeau, G.-A.; Saunier, N. Learning Data Association for Multi-Object Tracking Using Only Coordinates. Pattern Recognit. 2025, 160, 111169. [Google Scholar] [CrossRef]
- Tuzcuoğlu, Ö.; Köksal, A.; Sofu, B.; Kalkan, S.; Alatan, A.A. Xoftr: Cross-Modal Feature Matching Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 4275–4286. [Google Scholar]
- Bai, X.; Hu, Z.; Zhu, X.; Huang, Q.; Chen, Y.; Fu, H.; Tai, C.-L. Transfusion: Robust Lidar-Camera Fusion for 3D Object Detection with Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 1090–1099. [Google Scholar]
- Pang, S.; Morris, D.; Radha, H. TransCAR: Transformer-Based Camera-and-Radar Fusion for 3D Object Detection. In Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023; IEEE: New York, NY, USA, 2023; pp. 10902–10909. [Google Scholar]
- Qingyun, F.; Dapeng, H.; Zhaokui, W. Cross-Modality Fusion Transformer for Multispectral Object Detection. arXiv 2021, arXiv:2111.00273. [Google Scholar] [CrossRef]
- Liang, L.; Tian, Z.; Huang, H.; Li, X.; Yin, Z.; Zhang, D.; Zhang, N.; Zhai, W. Heterogeneous Secure Transmissions in IRS-Assisted NOMA Communications: CO-GNN Approach. IEEE Internet Things J. 2025, 12, 34113–34125. [Google Scholar] [CrossRef]
- Huang, H.; Jiang, D.; Liang, L.; Zhou, F.; Zhang, N. Performance Evaluations for RIS-Assisted GF-NOMA in Satellite Aerial Terrestrial Integrated Networks. IEEE Internet Things J. 2025, 1. [Google Scholar] [CrossRef]
- Qin, C.; Pang, M.; Wang, Z.; Hou, S.; Zhang, D. Observer Based Fault Tolerant Control Design for Saturated Nonlinear Systems with Full State Constraints via a Novel Event-Triggered Mechanism. Eng. Appl. Artif. Intell. 2025, 161, 112221. [Google Scholar] [CrossRef]
- Qin, C.; Hou, S.; Pang, M.; Wang, Z.; Zhang, D. Reinforcement Learning-Based Secure Tracking Control for Nonlinear Interconnected Systems: An Event-Triggered Solution Approach. Eng. Appl. Artif. Intell. 2025, 161, 112243. [Google Scholar] [CrossRef]
- Zhang, D.; Wang, Y.; Meng, L.; Yan, J.; Qin, C. Adaptive Critic Design for Safety-Optimal FTC of Unknown Nonlinear Systems with Asymmetric Constrained-Input. ISA Trans. 2024, 155, 309–318. [Google Scholar] [CrossRef]
- Güvenç, İ.; Chong, C.-C.; Watanabe, F.; Inamura, H. NLOS Identification and Weighted Least-Squares Localization for UWB Systems Using Multipath Channel Statistics. EURASIP J. Adv. Signal Process. 2007, 2008, 271984. [Google Scholar] [CrossRef]
- Zhang, Y.; Wang, C.; Wang, X.; Zeng, W.; Liu, W. Fairmot: On the Fairness of Detection and Re-Identification in Multiple Object Tracking. Int. J. Comput. Vis. 2021, 129, 3069–3087. [Google Scholar] [CrossRef]
- PaddlePaddle. PaddleDetection, Object Detection and Instance Segmentation Toolkit Based on PaddlePaddle. Github 2019. Available online: https://github.com/PaddlePaddle/PaddleDetection (accessed on 1 September 2025).
- van den Oord, A.; Li, Y.; Vinyals, O. Representation Learning with Contrastive Predictive Coding 2018. arXiv 2018, arXiv:1807.03748. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need 2017. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Rogers, A.; Kovaleva, O.; Rumshisky, A. A Primer in BERTology: What We Know About How BERT Works. Trans. Assoc. Comput. Linguist. 2020, 8, 842–866. [Google Scholar] [CrossRef]
- Paszek, K.; Grzechca, D.; Becker, A. Design of the UWB Positioning System Simulator for LOS/NLOS Environments. Sensors 2021, 21, 4757. [Google Scholar] [CrossRef] [PubMed]










Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, K.; Zhen, Y.; Ghaffar, M.A.; Pan, N.; Peng, L. Progressive Alignment of Multi-Modal Trajectories Under Modality Imbalance: A Case Study in Metro Stations. Electronics 2025, 14, 4265. https://doi.org/10.3390/electronics14214265
Zhang K, Zhen Y, Ghaffar MA, Pan N, Peng L. Progressive Alignment of Multi-Modal Trajectories Under Modality Imbalance: A Case Study in Metro Stations. Electronics. 2025; 14(21):4265. https://doi.org/10.3390/electronics14214265
Chicago/Turabian StyleZhang, Kangshuai, Yongfeng Zhen, Muhammad Arslan Ghaffar, Nuo Pan, and Lei Peng. 2025. "Progressive Alignment of Multi-Modal Trajectories Under Modality Imbalance: A Case Study in Metro Stations" Electronics 14, no. 21: 4265. https://doi.org/10.3390/electronics14214265
APA StyleZhang, K., Zhen, Y., Ghaffar, M. A., Pan, N., & Peng, L. (2025). Progressive Alignment of Multi-Modal Trajectories Under Modality Imbalance: A Case Study in Metro Stations. Electronics, 14(21), 4265. https://doi.org/10.3390/electronics14214265

