Vision-Based Trajectory Reconstruction in Human Activities: Methodology and Application
Abstract
1. Introduction
1.1. Problem Statement
1.2. Experimental Data Collection to Study Crowd Dynamics
1.3. Contribution of the Present Work
2. Materials and Methods
2.1. Experimental Program
2.1.1. Fixed-Size Group Events
2.1.2. Large-Scale Stream Events
2.2. Materials
2.2.1. Optical Sensors
2.2.2. RTK GNSS Receiver
2.2.3. Coloured Hats
2.3. General Computer Vision and Photogrammetry Techniques
2.3.1. Open-Set Object Detector
2.3.2. Structure-from-Motion
- Features such as distinctive pixels (keypoints), patches and image descriptors are extracted from each image and formulated mathematically, e.g., using the Scale-Invariant Feature Transform (SIFT) algorithm [44].
- Features identified across multiple viewpoints are matched using, for instance, Approximate Nearest Neighbors (ANN) [45] and the Random Sample Consensus (RANSAC) iterative method [46]. This method ensures robust feature matching by eliminating outlier correspondences. For instance, transient features like pedestrians or moving objects in the scene are automatically excluded from the matched images and subsequent image processing [42].
- The image correspondences impose constraints on the position and orientation of the moving camera. Following a thorough assessment of matched features, the predominant transformation between two viewpoints is determined using principles of epipolar geometry. This results in the construction of the fundamental matrix, or in the case of a calibrated camera, the essential matrix, which relates the poses of both cameras [47].
- Camera poses and the 3D positions of matched features are estimated through aerial triangulation. Since photogrammetry inherently relies on a scaling factor, the 3D reconstruction should be appropriately scaled and, if necessary, geo-referenced. This can be achieved by incorporating scaling elements, GCPs or geospatial data obtained from a UAV equipped with a Real-Time Kinematic (RTK) GNSS receiver.
- Initial estimates are refined through iterative bundle block adjustment, and the 3D reconstruction of the scene can be enhanced using techniques such as dense stereo matching.
2.3.3. Perspective-n-Point
2.3.4. Homography
3. Vision-Based Trajectory Reconstruction
3.1. Overhead Person Detection
3.1.1. Colour-Based Image Segmentation
3.1.2. Object Detection Using an Open-Set Object Detector
- To reduce the computational cost associated with detecting and processing numerous bounding boxes, Grounding DINO was applied exclusively within a pre-defined Region of Interest (RoI). The rectangular RoI was positioned along one of the vertical edges of the frame, where runners first appeared, as shown on Figure 6a.
- Since the subsequent tracking step relies on these bounding boxes as input, a filtering step was applied prior to initiating tracking. This multi-step procedure uses Non-Maximum Suppression (NMS) and a bounding box similarity score (≥0.65), based on the total and matched image features of the image patched in two bounding boxes, to remove redundant boxes. The strongest bounding boxes are selected using the MATLAB built-in function selectStrongestBbox, which uses the confidence scores as provided per bounding box by Grounding DINO. Moreover, it was observed that shadows often caused false positives. These were mitigated by applying aspect ratio constraints and comparing the colour space against a set of template shadow images. In this way, redundant detections are excluded, overlapping bounding boxes of partially detected runners are merged, bounding boxes containing multiple runners are separated and false positives caused by shadows are removed. While this filtering process applies generic techniques, certain steps may require tuning under different acquisition conditions.
3.2. Overhead Person Tracking
3.2.1. Blob Association
3.2.2. Feature-Based Tracking
3.3. Perspective Projection with a Planar Homography
3.3.1. Pinhole Camera Model
3.3.2. Camera Parameter Estimation from a Moving Camera Perspective
3.3.3. Camera Parameter Estimation from a Stationary Camera Perspective
3.3.4. Homography-Based Projection
- For the group events, the triangulated point cloud of the athletics track, obtained via the SfM algorithm, is used. However, only a limited area around a runner is considered when modelling the homography-plane. The area, defined in the image plane, corresponds to a square of 2 m × 2 m. Figure 12b conceptualizes the approach. In Section 4.1.2, the influence of using a (small) homography-plane with a fixed offset above the ground is investigated.
- For the stream events, involving a near-stationary camera hovering above a footbridge, the geometry of the bridge deck is used to reconstruct the homography-planes. When the height and gender of the runner is unknown, it is suggested to use a fixed offset of 1.7 m between the ground and the homography-plane [60].
3.4. Trajectory Refinement
3.4.1. Measurement Noise
3.4.2. Process Noise
4. Results and Discussion
4.1. Evaluation of the Systematic Measurement Errors
4.1.1. Imperfect Camera Calibration
4.1.2. Height Variation of the Homography-Plane
4.1.3. Object Detection Error
4.1.4. Combined Systematic Error
4.2. Running Trajectories Characteristics
4.2.1. Group Events
4.2.2. Stream Events
4.3. Suggestions for Future Research
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| FoV | Field of View |
| GCP | Ground Control Point |
| PnP | Perspective-n-Point |
| RoI | Region of Interest |
| SfM | Structure-from-Motion |
References
- Association Française de Génie Civil, Sétra/AFGC. Sétra: Evaluation du Comportement Vibratoire des Passerelles Piétonnes sous l’action des Piétons (Assessment of Vibrational Behaviour of Footbridges Under Pedestrian Loading); Sétra: Paris, France, 2006. [Google Scholar]
- Pańtak, M. Dynamic characteristics of medium span truss, cable-stayed and suspension steel footbridges under human-induced excitation. In Proceedings of the 4th International Conference Footbridge, Wrocław, Poland, 6–8 July 2011; Volume 2011, pp. 1209–1214. [Google Scholar]
- Živanović, S.; Pavić, A.; Reynolds, P. Vibration serviceability of footbridges under human-induced excitation: A literature review. J. Sound Vib. 2005, 279, 1–74. [Google Scholar] [CrossRef]
- Heinemeyer, C.; Butz, C.; Keil, A.; Schlaich, M.; Goldack, A.; Lukić, M.; Chabrolin, B.; Lemaire, A.; Martin, P.; Cunha, A.; et al. Design of Lightweight Footbridges for Human Induced Vibrations—Background Document in Support to the Implementation, Harmonization and Further Development of the Eurocodes; JRC-ECCS 2009; Lehrstuhl für Stahl-und Leichtmetallbau und Institut für Stahlbau: Aachen, Germany, 2009. [Google Scholar]
- Živanović, S.; Pavić, A.; Reynolds, P. Human-structure dynamic interaction in footbridges. Bridge Eng. 2005, 158, 165–177. [Google Scholar] [CrossRef]
- Van Nimmen, K.; Lombaert, G.; De Roeck, G.; Van den Broeck, P. The impact of vertical human-structure interaction on the response of footbridges to pedestrian excitation. J. Sound Vib. 2017, 402, 104–121. [Google Scholar] [CrossRef]
- Bocian, M.; Brownjohn, J.; Racić, V.; Hester, D.; Quattrone, A.; Gilbert, L.; Beasley, R. Time-dependent spectral analysis of interactions within groups of walking pedestrians and vertical structural motion using wavelets. Mech. Syst. Signal Process. 2018, 105, 502–523. [Google Scholar] [CrossRef]
- Ricciardelli, F.; Pizzimenti, A.D. Lateral walking-induced forces on footbridges. J. Bridge Eng. 2007, 12, 677–688. [Google Scholar] [CrossRef]
- Venuti, F.; Tubino, F. Human-induced loading and dynamic response of footbridges in the vertical direction due to restricted pedestrian traffic. Struct. Infrastruct. Eng. 2021, 17, 1431–1445. [Google Scholar] [CrossRef]
- Živanović, S. Probability-Based Estimation of Vibration for Pedestrian Structures Due to Walking. Ph.D. Thesis, University of Sheffield, Sheffield, UK, 2006. [Google Scholar]
- Grundmann, H.; Kreuzinger, H.; Schneider, M. Schuringungsuntersuchungen für Fussgängerbrücken (Vibration tests of Pedestrian Bridges). Bauingenieur 1993, 68, 215–225. [Google Scholar]
- Butz, C.; Feldmann, M.; Heinemeyer, C.; Sedlacek, G. SYNPEX: Advanced Load Models for Synchronous Pedestrian Excitation and Optimised Design Guidelines for Steel Footbridges; Technical Report, Research Fund for Coal and Steel; Publications Office of the EU: Luxembourg, 2008. [Google Scholar]
- Ebrahimpour, A.; Hamam, A.; Sack, R.; Patten, W. Measuring and modeling dynamic loads imposed by moving crowds. J. Struct. Eng. 1996, 122, 1468–1474. [Google Scholar] [CrossRef]
- Bocian, M.; Brownjohn, J.; Racić, V.; Hester, D.; Quattrone, A.; Monnickendam, R. A framework for experimental determination of localised vertical pedestrian forces on full-scale sturctures using wireless attitude and heading reference systems. J. Sound Vib. 2016, 376, 217–243. [Google Scholar] [CrossRef]
- Obeidat, H.; Shuaieb, W.; Obeidat, O.; Abd-Alhameed, R. A review of indoor localization techniques and wireless technologies. Wirel. Pers. Commun. 2021, 119, 289–327. [Google Scholar] [CrossRef]
- Hoogendoorn, S.P.; Daamen, W.; Bovy, P.H. Extracting microscopic pedestrian characteristics from video data. In Proceedings of the Transportation Research Board Annual Meeting, Washington, DC, USA, 12–16 January 2003; National Academy Press: Washington, DC, USA, 2003; pp. 1–15. [Google Scholar]
- Feng, D.; Feng, M.Q. Computer Vision for Structural Dynamics and Health Monitoring; John Wiley & Sons: Hoboken, NJ, USA, 2020. [Google Scholar]
- Zona, A. Vision-based vibration monitoring of structures and infrastructures: An overview of recent applications. Infrastructures 2020, 6, 4. [Google Scholar] [CrossRef]
- Fujino, Y.; Pacheco, B.M.; Nakamura, S.I.; Warnitchai, P. Synchronization of human walking observed during lateral vibration of a congested pedestrian bridge. Earthq. Eng. Struct. Dyn. 1993, 22, 741–758. [Google Scholar] [CrossRef]
- Yoshida, J.; Fujino, Y.; Sugiyama, T. Image Processing for Capturing Motions of Crowd and Its Application to Pedestrian-Induced Lateral Vibration of a Footbridge. Shock Vib. 2007, 14, 251–260. [Google Scholar] [CrossRef][Green Version]
- Setareh, M. Study of Verrazano-Narrows Bridge Movements During a New York City Marathon. J. Bridge Eng. 2011, 16, 127–138. [Google Scholar] [CrossRef]
- Pimentel, R.L.; Araújo, M.C., Jr.; Braga Fernandes Brito, H.M.; Vital de Brito, J.L. Synchronization among pedestrians in footbridges due to crowd density. J. Bridge Eng. 2013, 18, 400–408. [Google Scholar] [CrossRef]
- Araújo, M.C., Jr.; Brito, H.M.B.F.; Pimentel, R.L. Experimental evaluation of synchronisation in footbridges due to crowd density. Struct. Eng. Int. 2009, 19, 298–303. [Google Scholar] [CrossRef]
- Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Jiang, Q.; Li, C.; Yang, J.; Su, H.; et al. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. arXiv 2023, arXiv:2303.05499. [Google Scholar] [CrossRef]
- Boltes, M.; Kilic, D.; Schrödter, T.; Arens, T.; Dreßen, L.; Adrian, J.; Boomers, A.K.; Kandler, A.; Küpper, M.; Graf, A.; et al. PeTrack (v1.0). 2025. Available online: https://zenodo.org/records/15119517 (accessed on 7 June 2024).
- Boltes, M.; Seyfried, A. Collecting pedestrian trajectories. Neurocomputing 2013, 100, 127–133. [Google Scholar] [CrossRef]
- Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.E.; Sheikh, Y. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 172–186. [Google Scholar] [CrossRef]
- Mathis, A.; Mamidanna, P.; Cury, K.M.; Abe, T.; Murthy, V.N.; Mathis, M.W.; Bethge, M. DeepLabCut: Markerless pose estimation of user-defined body parts with deep learning. Nat. Neurosci. 2018, 21, 1281–1289. [Google Scholar] [CrossRef]
- Insafutdinov, E.; Pishchulin, L.; Andres, B.; Andriluka, M.; Schiele, B. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part VI 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 34–50. [Google Scholar]
- Fang, H.S.; Li, J.; Tang, H.; Xu, C.; Zhu, H.; Xiu, Y.; Li, Y.L.; Lu, C. Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 7157–7173. [Google Scholar] [CrossRef]
- Van Hauwermeiren, J.; Van Nimmen, K.; Van den Broeck, P.; Vergauwen, M. Vision-based methodology for characterizing the flow of a high-density crowd on footbridges: Strategy and application. Infrastructures 2020, 5, 51. [Google Scholar] [CrossRef]
- DJI. DJI Mini 3 Pro: User Manual, Version 1.6; DJI: Shenzhen, China, 2023.
- DJI. DJI Mavic 3 Enterprise: User Manual, Version 1.0; DJI: Shenzhen, China, 2022.
- Emlid Tech Korlátolt. Reach RX User Documentation; Firmware Version 1.5; Emlid Flow app Version 10.6 or Newer; Emlid Tech Korlátolt: Budapest, Hungary, 2024. [Google Scholar]
- Benabbas, Y.; Ihaddadene, N.; Yahiaoui, T.; Urruty, T.; Djeraba, C. Spatio-temporal optical flow analysis for people counting. In Proceedings of the 2010 7th IEEE International Conference on Advanced Video and Signal Based Surveillance, Boston, MA, USA, 29 August–1 September 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 212–217. [Google Scholar]
- Cao, J.; Sun, L.; Odoom, M.G.; Luan, F.; Song, X. Counting people by using a single camera without calibration. In Proceedings of the 2016 Chinese Control and Decision Conference (CCDC), Yinchuan, China, 28–30 May 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 2048–2051. [Google Scholar]
- Carletti, V.; Del Pizzo, L.; Percannella, G.; Vento, M. An efficient and effective method for people detection from top-view depth cameras. In Proceedings of the 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Lecce, Italy, 29 August–1 September 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1–6. [Google Scholar]
- Bendale, A.; Boult, T.E. Towards Open Set Deep Networks. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 1563–1572. [Google Scholar]
- Agarwal, S.; Snavely, N.; Simon, I.; Seitz, S.M.; Szeliski, R. Building Rome in a Day. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision, Kyoto, Japan, 27 September–4 October 2009; pp. 72–79. [Google Scholar] [CrossRef]
- Wu, C. Towards Linear-time Incremental Structure from Motion. In Proceedings of the 2013 International Conference on 3D Vision-3DV, Seattle, WA, USA, 29 June 29–1 July 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 127–134. [Google Scholar]
- Schonberger, J.L.; Frahm, J.M. Structure-from-Motion Revisited. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 4104–4113. [Google Scholar]
- Westoby, M.J.; Brasington, J.; Glasser, N.F.; Hambrey, M.J.; Reynolds, J.M. ‘Structure-from-Motion’ photogrammetry: A low-cost, effective tool for geoscience applications. Geomorphology 2012, 179, 300–314. [Google Scholar] [CrossRef]
- Eltner, A.; Kaiser, A.; Castillo, C.; Rock, G.; Neugirg, F.; Abellán, A. Image-based surface reconstruction in geomorphometry–merits, limits and developments. Earth Surf. Dyn. 2016, 4, 359–389. [Google Scholar] [CrossRef]
- Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
- Arya, S.; Mount, D.M.; Netanyahu, N.S.; Silverman, R.; Wu, A.Y. An optimal algorithm for approximate nearest neighbor searching fixed dimensions. J. ACM (JACM) 1998, 45, 891–923. [Google Scholar] [CrossRef]
- Fischler, M.A.; Bolles, R.C. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
- Hartley, R.; Zisserman, A. Epipolar Geometry and the Fundamental Matrix. In Multiple View Geometry in Computer Vision; Cambridge University Press: Cambridge, UK, 2004; pp. 239–261. [Google Scholar]
- Lu, X.X. A review of solutions for perspective-n-point problem in camera pose estimation. J. Phys. Conf. Ser. 2018, 1087, 052009. [Google Scholar] [CrossRef]
- Lepetit, V.; Moreno-Noguer, F.; Fua, P. EPnP: An accurate O(n) solution to the PnP problem. Int. J. Comput. Vis. 2009, 81, 155–166. [Google Scholar] [CrossRef]
- Rauch, H.E.; Tung, F.; Striebel, C.T. Maximum likelihood estimates of linear dynamic systems. AIAA J. 1965, 3, 1445–1450. [Google Scholar] [CrossRef]
- MATLAB, R2023a; The MathWorks Inc.: Natick, MA, USA, 2023.
- Schulter, S.; Vernaza, P.; Choi, W.; Chandraker, M. Deep network flow for multi-object tracking. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6951–6960. [Google Scholar]
- Liu, X.; Caesar, H. Offline tracking with object permanence. In Proceedings of the 2024 IEEE Intelligent Vehicles Symposium (IV), Jeju, Republic of Korea, 2–5 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1272–1279. [Google Scholar]
- Lucas, B.D.; Kanade, T. An iterative image registration technique with an application to stereo vision. In Proceedings of the IJCAI’81: 7th International Joint Conference on Artificial Intelligence, Vancouver, BC, Canada, 24–28 August 1981; Volume 2, pp. 674–679. [Google Scholar]
- Hartley, R.; Zisserman, A. Multiple View Geometry in Computer Vision; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
- Conrady, A. Decentered Lens-Systems. Mon. Not. R. Astron. Soc. 1919, 79, 384–390. [Google Scholar] [CrossRef]
- Brown, D. Decentering Distortion of Lenses. Photogramm. Eng. 1966, 32, 444–462. [Google Scholar]
- Ullman, S. The interpretation of structure from motion. Proc. R. Soc. Lond. Ser. Biol. Sci. 1979, 203, 405–426. [Google Scholar]
- METASHAPE. Agisoft Metashape; Agisoft LLC: St. Petersburg, Russia, 2024.
- Van der Heyden, J.; Nguyen, D.; Renard, F.; Scohy, A.; Demarest, S.; Drieskens, S.; Gisle, L. Belgisch Gezondheidsonderzoek 2018; Technical Report D/2019/14.440/90; Rapportnummer: 2019/14.440/89; Sciensano: Brussel, Belgium, 2019. [Google Scholar]
- Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum Likelihood from Incomplete Data via the EM Algorithm. J. R. Stat. Soc. Ser. (Methodol.) 1977, 39, 1–22. [Google Scholar] [CrossRef]
- Feng, Y.; Wang, J. GPS RTK performance characteristics and analysis. Positioning 2008, 1, 13. [Google Scholar] [CrossRef]
















Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lottefier, J.; Van den Broeck, P.; Van Nimmen, K. Vision-Based Trajectory Reconstruction in Human Activities: Methodology and Application. Sensors 2025, 25, 7577. https://doi.org/10.3390/s25247577
Lottefier J, Van den Broeck P, Van Nimmen K. Vision-Based Trajectory Reconstruction in Human Activities: Methodology and Application. Sensors. 2025; 25(24):7577. https://doi.org/10.3390/s25247577
Chicago/Turabian StyleLottefier, Jasper, Peter Van den Broeck, and Katrien Van Nimmen. 2025. "Vision-Based Trajectory Reconstruction in Human Activities: Methodology and Application" Sensors 25, no. 24: 7577. https://doi.org/10.3390/s25247577
APA StyleLottefier, J., Van den Broeck, P., & Van Nimmen, K. (2025). Vision-Based Trajectory Reconstruction in Human Activities: Methodology and Application. Sensors, 25(24), 7577. https://doi.org/10.3390/s25247577

