Spatially Time-Based Robust Tracking and Re-Identification of Kindergarten Students: A Hybrid Deep Learning Framework Combining YOLOv8n and Vision Transformer (ViT)
Abstract
1. Introduction
- A robust re-identification framework is introduced, incorporating a novel integration of YOLOv8 for real-time detection and ViT for feature extraction. This hybrid architecture effectively addresses the ‘intra-class variation’ challenge, thereby enabling precise re-identification of children in identical uniforms, a task at which conventional CNN-based methodologies frequently struggle.
- Furthermore, we present innovative behavioral and social analytics. Beyond mere tracking, we propose a ‘Social Interaction Heatmap’ to quantify children’s interactions and ‘Screen Time Analysis’ to assess the degree of personal engagement. This methodology provides substantial contributions to the domain of child sociology, thereby aiding in the early detection of social isolation and instances of bullying.
- Automated Physical Activity Monitoring: The system employs an automated algorithm to track each child’s movement patterns and compute the total distance walked. This provides a quantitative measure for assessing children’s physical activity levels and overall health.
- A high-performance, domain-specific dataset, containing 31,521 images of kindergarten children, has been developed. Comprehensive evaluations indicate that our approach attains an overall accuracy of 93.75%, alongside a perfect IDF1 score of 99.7%, thereby illustrating its efficacy in preserving identity throughout the tracking process.
2. Related Works
3. Materials and Methods
- Data Collection and Preprocessing
- Hybrid Model Architecture Design
- Ground Truth Generation
- Performance Evaluation and Analysis
3.1. Data Collection and Preprocessing
3.1.1. Video Input
- Live view Resolution: 1280 × 720 pixels, 1920 × 1080 pixels
- Record Resolution: 1920 × 1080 pixels
- Frame rate: 29.50 frames per second
- Data rate: 1352 Kbps
- Total bitrate: 1480 Kbps
3.1.2. Frame Extraction and Normalization
3.1.3. Dataset Description
3.2. Hybrid Model Architecture
3.2.1. Person Detection
- Takes a complete frame as input.
- Divides the frame into a grid and creates a feature map.
- Checks whether each cell contains ‘person’ (Person Class ID: 0).
- Provides four coordinates as output: and a confidence score shown in Figure 1.
3.2.2. Feature Extraction Model ViT
3.3. Re-ID and Matching Mechanism
3.3.1. Cosine Similarity
3.3.2. Matching Algorithm
- The embedding of the detected person in each frame is extracted.
- The cosine similarity of each ID (Known IDs) in memory with this embedding is extracted.
- If the highest similarity score is greater than a certain threshold (e.g., 0.6 or 0.7), then that ID is assigned to it.
- If the score is less than the threshold, then it is considered as “unknown” or “new person” (but according to our current code logic, we count it as a mismatch).
3.3.3. Distance Estimation
3.4. Ground Truth Generation
- The system plays video input and detects humans using YOLO.
- The user is given the opportunity to input an ID for each detection.
- The user manually confirms that “this person is ID-1” and “that person is ID-2”.
- This manual labeling information is stored in a JSON file (e.g., frame_ids.json) with timestamps and coordinates.
- We have considered this JSON file to be the “gold standard” or 100% accurate data against which our automated model will be compared.
3.5. Performance Evaluation Metrics
3.6. Hardware and Software Environment
- Programming Language: Python 3.12
- Deep Learning Framework: PyTorch = 2.5.0 + cu124 (for ViT), Ultralytics = 8.4.18 (for YOLOv8n)
- Computer Vision Library: OpenCV = 4.13.0.92
- Data Analysis: NumPy = 2.1.3, Pandas = 2.3.2, Scipy = 1.16.1
- Visualization: Matplotlib = 3.10.5, Seaborn = 0.13.x
- 6.
- GPU: NVIDIA RTX A4000
- 7.
- GPU Memory: Dedicated 16 GB + Shared 15.8 GB
- 8.
- Processor: Intel(R) Core (TM) i9-10900X CPU @ 3.70 GHz, 3696 MHz, 10 Core(s), 20 Logical Processors.
- 9.
- Processing Time: 30 ms
4. Results and Discussion
- Quantitative Analysis: where there will be a mathematical explanation of the accuracy, precision, recall, and confusion matrix of the model.
- Qualitative and Behavioral Analysis: where the visual data of students’ trajectory, walking distance, and screen time will be interpreted.
4.1. Quantitative Performance Evaluation of the Model
4.1.1. Cross-Dataset Performance Evaluation: Custom Dataset vs. MOT20 Dataset
4.1.2. In-Depth Analysis of Tracking vs. Re-Identification Across Datasets
- Performance Analysis: The model’s MOTA score, as shown in the graph, is 91.2% on the MOT20 dataset. This is a slight improvement over the 86.7% score on the custom dataset. This indicates that the model can easily avoid false positives in a typical environment (e.g., traffic congestion). However, the most significant result is seen in the IDF1 score. The model’s performance on the custom dataset yielded an IDF1 score of 99.7%, surpassing the 97.7% achieved on the MOT20 dataset. IDF1 essentially quantifies the duration of identity persistence. Within a kindergarten setting, children are prone to identity switches, a consequence of their uniforms and swift physical activity. The model has demonstrated considerable proficiency in addressing this specific challenge within our custom dataset, thereby achieving near-perfect ID retention (99.7%).
- Re-Identification Performance Analysis: An inverse trend is observed in ReID performance. The model’s Rank-1 (93.5%) and mAP (95.0%) scores on the MOT20 dataset are significantly higher than the Rank-1 (83.1%) and mAP (85.7%) scores on the custom dataset. There is high inter-class variation in clothing, color, and body type of people present in the MOT20 dataset. As a result, it is relatively easy for Vision Transformer (ViT) to extract and separate these different features, which is the main reason for the high Rank-1 and mAP scores. On the other hand, in the kindergarten dataset, all children wear the same color uniform and cap, so there is a high visual ambiguity between them. Due to this extreme similarity, there is a slight drop in the ReID metric in the custom dataset (83.1% Rank-1), which is very normal and realistic in this type of research. This analysis clearly demonstrates that a high ReID metric (e.g., 95.0% mAP of MOT20) does not automatically guarantee perfect tracking in a complex environment (e.g., kindergarten). Rather, it is very important to accustom the model to the custom dataset to consistently track children wearing the same uniform, which is scientifically proven by the 99.7% improved IDF1 score we achieved.
4.2. Qualitative and Behavioral Analysis
4.2.1. Qualitative Evaluation of Spatial Trajectory Mapping
- Comprehensive Tracking in Custom Model:
- 2.
- Visual Evidence of Domain Gap in MOT20:
- 3.
- Behavioral Insights for Early Childhood Development:
4.2.2. Comparative Analysis of Total Walking Distance and Tracking Stability
- Dynamic ID Assignment and Distribution Gap: The IDs shown in the graph (1 to 12) are dynamically assigned in two different model runs. Therefore, the ‘ID 1’ in the custom dataset might not represent the same child as ‘ID 1’ in the MOT20 dataset. In the case of the MOT20 dataset, the model repeatedly gets confused in tracking due to domain gaps and loses IDs (ID Switch). As a result, the entire walking path of a child becomes fragmented, and the model divides it into multiple different IDs. This is why the distance data of MOT20 (red bars) is disorganized and unacceptable as an accurate measure of real physical exercise.
- Trajectory Consistency of Custom Model: On the other hand, our proposed custom model (blue bars) maintains excellent tracking stability. Due to the high IDF1 score of the model (99.7%), it is able to map the uninterrupted trajectory of each child from start to finish without losing any IDs. For example, ID 5 in the custom model recorded a maximum walking distance of 14.99 m and ID 6 12.11 m, which is the result of a single and complete tracking session.
- Consequences of Missed Detections: In agreement with the earlier screen time assessment, the graph further illustrates that the MOT20 model failed to produce any distance data Not Detected (N/D) for IDs 10, 11, and 12. This outcome stemmed from the model’s complete inability to detect and track these children within the provided frame. Conversely, the custom model successfully furnished distance data for all twelve children (e.g., 6.0 m for ID 10, 3.91 m for ID 11).
4.2.3. Comparative Analysis of Screen Time and Detection Reliability
- Figure 9 clearly shows that the model trained on our custom dataset, represented by the blue bar, was able to calculate screen time very closely to the actual values, which are shown by the green bar. For example, for ID 1, the model predicted 6.41 s; for ID 2 (7.25 s), ID 3 (10.0 s), and ID 12 (3.53 s), the custom model prediction matches the ground truth exactly. This proves that the proposed YOLOv8 + ViT framework is able to track children without losing IDs even under occlusion.
- Domain Gap and Temporal Fragmentation in MOT20: On the other hand, the use of the MOT20 dataset (red bar) shows the effect of severe ‘domain gaps. The MOT20 model entirely failed to identify IDs 10, 11, and 12 as humans in the video, as indicated by their designation as ‘Not Detected’ in the graph. This observation reinforces the assertion that the model was only able to detect 9 out of the initial 12 IDs.
4.2.4. Comparative Analysis of the Proposed Method with Previous Work
- Comprehensive and Integrated Architecture: A look at the ‘Detection’, ‘Tracking’, ‘Re-ID’, and ‘Trajectory Analysis’ columns of Table 2 reveals that most of the previous work is partial or fragmented. For example, the very popular tracking models such as ByteTrack [46], OC-SORT [47], and Hybrid-SORT [48] have robust detection and tracking but no specific ‘Re-ID’ module. On the other hand, TransRe-ID [55] is good at re-identification but lacks detection or tracking capabilities. Our proposed model is the only framework that successfully integrates all four modules into a single pipeline, which is essential for real-world surveillance.
- Unrivaled Identity Preservation (IDF-1): The biggest challenge in multi-object tracking is to prevent ID switches during tracking. According to Table 2, among the previous SOTA models, FairMOT [45] has the highest IDF-1 of 72.8%, and ByteTrack [46] has the highest IDF-1 of 77.3%. Even the Hybrid-SORT [48] model could not achieve an IDF-1 higher than 78.7%. In contrast, our proposed model achieved an IDF-1 score of 97.7% on the MOT20 benchmark and 99.7% on the custom dataset. This clearly demonstrates that our approach outperforms any previous SORT-based or Transformer-based trackers by a large margin in preserving IDs through global feature extraction with ViT.
- Cross-Dataset Robustness: Most previous work has only been evaluated on a specific domain dataset (such as only MOT17 or only DanceTrack). We evaluated our model on both a general crowd dataset (MOT20) and a highly visually ambiguous custom kindergarten dataset. Our model achieved 91.2% MOTA and 93.5% Rank-1 Accuracy on the MOT20 dataset, which proves that it is equally effective not only in kindergarten but also in any complex and crowded environment.
5. Conclusions and Future Works
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Tillmann, S.; Tobin, D.; Avison, W.; Gilliland, J. Mental health benefits of interactions with nature in children and teenagers: A systematic review. J. Epidemiol. Community Health 2018, 72, 958–966. [Google Scholar] [CrossRef]
- McCurdy, L.E.; Winterbottom, K.E.; Mehta, S.S.; Roberts, J.R. Using nature and outdoor activity to improve children’s health. Curr. Probl. Pediatr. Adolesc. Health Care 2010, 40, 102–117. [Google Scholar] [CrossRef] [PubMed]
- McCormick, R. Does access to green space impact the mental well-being of children: A systematic review. J. Pediatr. Health Care 2017, 37, 3–7. [Google Scholar] [CrossRef]
- Poulain, T.; Sobek, C.; Ludwig, J.; Igel, U.; Grande, G.; Ott, V.; Kiess, W.; Körner, A.; Vogel, M. Associations of Green Spaces and Streets in the Living Environment with Outdoor Activity, Media Use, Overweight/Obesity and Emotional Wellbeing in Children and Adolescents. Int. J. Environ. Res. Public Health 2020, 17, 6321. [Google Scholar] [CrossRef]
- Olshansky, S.J.; Passaro, D.J.; Hershow, R.C.; Layden, J.; Carnes, B.A.; Brody, J.; Hayflick, L.; Butler, R.N.; Allison, D.B.; Ludwig, D.S. A potential decline in life expectancy in the United States in the 21st century. N. Engl. J. Med. 2005, 352, 1138–1145. [Google Scholar] [CrossRef] [PubMed]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
- Malagoli, E.; Di Persio, L. 2D Object Detection: A Survey. Mathematics 2025, 13, 893. [Google Scholar] [CrossRef]
- Ye, M.; Shen, J.; Lin, G.; Xiang, T.; Shao, L.; Hoi, S.C.H. Deep Learning for Person Re-Identification: A Survey and Outlook. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 2872–2893. [Google Scholar] [CrossRef] [PubMed]
- Zheng, L.; Yang, Y.; Hauptmann, A.G. Person reidentification: Past, present and future. arXiv 2016, arXiv:1610.02984. [Google Scholar] [CrossRef]
- He, L.; Liao, X.; Liu, W.; Liu, X.; Cheng, P.; Mei, T. FastReID: A Pytorch Toolbox for General Instance Re-identification. arXiv 2020. [Google Scholar] [CrossRef]
- Trujillo-Lopez, L.A.; Raymundo-Guevara, R.A.; Morales-Arevalo, J.C. User-Centered Design of a Computer Vision System for Monitoring PPE Compliance in Manufacturing. Computers 2025, 14, 312. [Google Scholar] [CrossRef]
- Liu, Q.; Jiang, X.; Jiang, R. Classroom Behavior Recognition Using Computer Vision: A Systematic Review. Sensors 2025, 25, 373. [Google Scholar] [CrossRef]
- Yi, K.; Li, J.; Zhang, Y. Multi-Object Tracking with Confidence-Based Trajectory Prediction Scheme. Sensors 2025, 25, 7221. [Google Scholar] [CrossRef]
- Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
- Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10778–10787. [Google Scholar] [CrossRef]
- Jiao, L.C.; Zhang, F.; Liu, F.; Yang, S.Y.; Li, L.L.; Feng, Z.X.; Qu, R. A Survey of Deep Learning-Based Object Detection. IEEE Access 2019, 7, 128837–128868. [Google Scholar] [CrossRef]
- Cheng, G.; Chao, P.; Yang, J.; Ding, H. SGST-YOLOv8: An Improved Lightweight YOLOv8 for Real-Time Target Detection for Campus Surveillance. Appl. Sci. 2024, 14, 5341. [Google Scholar] [CrossRef]
- Zheng, L.; Zhang, H.; Sun, S.; Chandraker, M.; Yang, Y.; Tian, Q. Person Re-identification in the Wild. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3346–3355. [Google Scholar]
- Lin, Y.-W.; Lin, Y.-B.; Hung, H.-N. CalibrationTalk: A Farming Sensor Failure Detection and Calibration Technique. IEEE Internet Things J. 2021, 8, 6893–6903. [Google Scholar] [CrossRef]
- Miao, J.; Wu, Y.; Liu, P.; Ding, Y.; Yang, Y. Pose-Guided Feature Alignment for Occluded Person Re-Identification. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27–29 October 2019; pp. 542–551. [Google Scholar] [CrossRef]
- Zhu, K.; Guo, H.; Liu, Z.; Tang, M.; Wang, J. Identity-Guided Human Semantic Parsing for Person Re-identification. In Computer Vision–ECCV 2020; Lecture Notes in Computer Science; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer: Cham, Switzerland, 2020; Volume 12348. [Google Scholar] [CrossRef]
- He, L.; Liang, J.; Li, H.; Sun, Z. Deep Spatial Feature Reconstruction for Partial Person Re-identification: Alignment-free Approach. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7073–7082. [Google Scholar]
- Chan, S.; Wang, J.; Cui, J.; Hu, J.; Li, Z.; Mao, J. Multi-granularity feature intersection learning for visible-infrared person re-identification. Complex Intell. Syst. 2025, 11, 291. [Google Scholar] [CrossRef]
- Luo, H.; Gu, Y.; Liao, X.; Lai, S.; Jiang, W. Bag of Tricks and a Strong Baseline for Deep Person Re-Identification. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 16–17 June 2019; pp. 1487–1495. [Google Scholar]
- Wei, L.; Zhang, S.; Gao, W.; Tian, Q. Person Transfer GAN to Bridge Domain Gap for Person Re-identification. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 79–88. [Google Scholar]
- Zhou, J.; Zhao, S.; Li, S.; Cheng, B.; Chen, J. Research on Person Re-Identification through Local and Global Attention Mechanisms and Combination Poolings. Sensors 2024, 24, 5638. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Bin Amir, S.; Horio, K. YOLOv8s-NE: Enhancing Object Detection of Small Objects in Nursery Environments Based on Improved YOLOv8. Electronics 2024, 13, 3293. [Google Scholar] [CrossRef]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar]
- Habock, U.; Garoffolo, A.; Benedetto, D.D. Darlin: Recursive Proofs using Marlin. arXiv 2021. [Google Scholar] [CrossRef]
- Han, C.; Koo, Y.; Kim, J.; Choi, K.; Hong, S. Wafer Type Ion Energy Monitoring Sensor for Plasma Diagnosis. Sensors 2023, 23, 2410. [Google Scholar] [CrossRef]
- Wu, H.; Xiao, B.; Codella, N.; Liu, M.; Dai, X.; Yuan, L.; Zhang, L. CvT: Introducing Convolutions to Vision Transformers. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 22–31. [Google Scholar]
- Dai, S.; Zhou, H.; Han, T.; Yang, M.; Zhang, Y. Blockchain-Based Smart Kitchen Platform for the Catering Industry. In Proceedings of the 2022 5th International Conference on Hot Information-Centric Networking (HotICN), Guangzhou, China, 24–26 November 2022; pp. 55–60. [Google Scholar] [CrossRef]
- Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A Survey on Vision Transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 87–110. [Google Scholar] [CrossRef]
- Liu, C.; Zhao, Z.; Zhou, Y.; Ma, L.; Sui, X.; Huang, Y.; Yang, X.; Ma, X. Object Detection and Information Perception by Fusing YOLO-SCG and Point Cloud Clustering. Sensors 2024, 24, 5357. [Google Scholar] [CrossRef]
- Fouad, M.A.; Hamza, H.M.; Hosny, K.M. Robust vision transformer-based framework for person re-identification through occlusion-aware training. Multimed. Tools Appl. 2026, 85, 190. [Google Scholar] [CrossRef]
- Sun, S.; Akhtar, N.; Song, H.; Mian, A.S.; Shah, M. Deep Affinity Network for Multiple Object Tracking. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 104–119. [Google Scholar] [CrossRef] [PubMed]
- Mohamed, A.; Qian, K.; Elhoseiny, M.; Claudel, C. Social-STGCNN: A Social Spatio-Temporal Graph Convolutional Neural Network for Human Trajectory Prediction. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 14412–14420. [Google Scholar] [CrossRef]
- Alahi, A.; Goel, K.; Ramanathan, V.; Robicquet, A.; Fei-Fei, L.; Savarese, S. Social LSTM: Human Trajectory Prediction in Crowded Spaces. In Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 961–971. [Google Scholar] [CrossRef]
- Abdelgawwad, A.; Mallofre, A.C.; Patzold, M. A Trajectory-Driven 3D Channel Model for Human Activity Recognition. IEEE Access 2021, 9, 103393–103406. [Google Scholar] [CrossRef]
- Zhang, Y.; Wang, C.; Wang, X.; Zeng, W.; Liu, W. FairMOT: On the Fairness of Detection and Re-identification in Multiple Object Tracking. Int. J. Comput. Vis. 2021, 129, 3069–3087. [Google Scholar] [CrossRef]
- Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. ByteTrack: Multi-object Tracking by Associating Every Detection Box. In Computer Vision–ECCV 2022; Lecture Notes in Computer Science; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer: Cham, Switzerland, 2022; Volume 13682. [Google Scholar] [CrossRef]
- Cao, J.; Pang, J.; Weng, X.; Khirodkar, R.; Kitani, K. Observation-Centric SORT: Rethinking SORT for Robust Multi-Object Tracking. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 9686–9696. [Google Scholar] [CrossRef]
- Zeng, F.; Dong, B.; Zhang, Y.; Wang, T.; Zhang, X.; Wei, Y. MOTR: End-to-End Multiple-Object Tracking with Transformer. In Computer Vision–ECCV 2022; Lecture Notes in Computer Science; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer: Cham, Switzerland, 2022; Volume 13687. [Google Scholar] [CrossRef]
- Yang, M.; Han, G.; Yan, B.; Zhang, W.; Qi, J.; Lu, H.; Wang, D. Hybrid-SORT: Weak cues matter for online multi-object tracking. Proc. AAAI Conf. Artif. Intell. 2024, 38, 6504–6512. [Google Scholar] [CrossRef]
- Meng, W.; Duan, S.; Ma, S.; Hu, B. Motion-Perception Multi-Object Tracking (MPMOT): Enhancing Multi-Object Tracking Performance via Motion-Aware Data Association and Trajectory Connection. J. Imaging 2025, 11, 144. [Google Scholar] [CrossRef]
- Islam, R.; Horio, K. Identification of Human Activity by Utilizing YOLOv5s Approaches. In Proceedings of the 2024 IEEE 3rd International Conference on Computing and Machine Intelligence (ICMI), Mount Pleasant, MI, USA, 13–14 April 2024; pp. 1–7. [Google Scholar] [CrossRef]
- Bisla, T.; Shukla, R.; Dhawan, M.; Islam, R.; Horio, K. Jumping behavior Analysis after Identification of Daycare Children’s Activities utilizing YOLOv8 Algorithm. In Proceedings of the 2024 IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT), Bangalore, India, 12–14 July 2024; pp. 1–6. [Google Scholar] [CrossRef]
- Hermens, F. Automatic object detection for behavioural research using YOLOv8. Behav. Res. Methods 2024, 56, 7307–7330. [Google Scholar] [CrossRef] [PubMed]
- Wu, Q.; Nie, X. Improved YOLOv10: A Real-Time Object Detection Approach in Complex Environments. Sensors 2025, 25, 6893. [Google Scholar] [CrossRef] [PubMed]
- He, S.; Luo, H.; Wang, P.; Wang, F.; Li, H.; Jiang, W. TransRe-ID: Transformer-based Object Re-ID. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 14993–15002. [Google Scholar] [CrossRef]
- Bikku, T.; Thota, S.; Ayoade, A.A. An Innovative Framework for Intelligent Computer Vision Empowered by Deep Learning. Uniciencia 2025, 39, 222–238. [Google Scholar] [CrossRef]
- Ghiță, A.Ș.; Florea, A.M. Real-Time People Re-Identification and Tracking for Autonomous Platforms Using a Trajectory Prediction-Based Approach. Sensors 2022, 22, 5856. [Google Scholar] [CrossRef]
- Zhang, X.; Wu, R.; Qu, Z. A Cosine-Similarity-Based Deconvolution Method for Analyzing Data-Independent Acquisition Mass Spectrometry Data. Appl. Sci. 2023, 13, 5969. [Google Scholar] [CrossRef]
- Available online: https://en.wikipedia.org/wiki/Euclidean_distance (accessed on 5 January 2026).









| Dataset | Class | Train (73%) | Test (9%) | Val (18%) | All |
|---|---|---|---|---|---|
| Custom | person | 23,010 | 2837 | 5674 | 31,521 |
| MOT20 | person | 20,791 | 2613 | 5226 | 28,630 |
| Works | Method | Dataset | Detection | Tracking | Re-ID | Trajectory Analysis | MOTA | IDF-1 | RANK-1 |
|---|---|---|---|---|---|---|---|---|---|
| Sun et al. [37] | Deep Affinity Network | MOT17 | × | ✓ | ✓ | ✓ | 52.42% | 49.49% | |
| MOT15 | × | ✓ | ✓ | ✓ | 38.30% | 45.60% | |||
| Mohamed et al. [38] | Social-STGCNN | ETH | × | ✓ | × | ✓ | |||
| UCY | × | ✓ | × | ✓ | |||||
| Alahi et al. [39] | Social-LSTM | ETH | × | × | × | ✓ | |||
| UCY | × | × | × | ✓ | |||||
| Abdelgawwad et al. [40] | Trajectory-Driven 3D Model | × | × | × | ✓ | ||||
| Zhang et al. [41] | FairMOT | MOT15 | ✓ | ✓ | ✓ | ✓ | 60.5% | 64.7% | |
| MOT16 | ✓ | ✓ | ✓ | ✓ | 74.9% | 72.8% | |||
| MOT17 | ✓ | ✓ | ✓ | ✓ | 73.7% | 72.3% | |||
| MOT20 | ✓ | ✓ | ✓ | ✓ | 61.8% | 67.3% | |||
| Zhang et al. [42] | ByteTrack | MOT17 | ✓ | ✓ | × | ✓ | 80.3% | 77.3% | |
| MOT20 | ✓ | ✓ | × | ✓ | 77.8% | 75.2% | |||
| Cao et al. [43] | OC-SORT | MOT17 | ✓ | ✓ | × | ✓ | 78.0% | 77.5% | |
| MOT20 | ✓ | ✓ | × | ✓ | 75.5% | 75.9% | |||
| Zeng et al. [44] | MOTR | DanceTrack | ✓ | ✓ | × | ✓ | 79.7% | 51.5% | |
| MOT17 | ✓ | ✓ | × | ✓ | 73.4% | 68.6% | |||
| Yang et al. [45] | Hybrid-SORT | DanceTrack | ✓ | ✓ | × | ✓ | 91.8% | 67.4% | |
| MOT17 | ✓ | ✓ | × | ✓ | 79.9% | 78.7% | |||
| MOT20 | ✓ | ✓ | × | ✓ | 76.7% | 78.4% | |||
| Meng et al. [46] | MPMOT | MOT16 | ✓ | ✓ | × | ✓ | 72.2% | 72.8% | |
| MOT17 | ✓ | ✓ | × | ✓ | 71.4% | 72.6% | |||
| MOT20 | ✓ | ✓ | × | ✓ | |||||
| Hermens [47] | YOLOv8 | ✓ | × | × | × | ||||
| He et al. [48] | TransRe-ID | MSMT17 | × | × | × | × | 83.3% | ||
| VeRi-776 | × | × | × | × | 96.9% | ||||
| Islam et al. [49] | YOLOv5s | Custom | ✓ | ✓ | × | × | |||
| Bisla et al. [50] | YOLOv8s | Custom | ✓ | ✓ | × | ✓ | |||
| Qili Wu et al. [51] | YOLOv10n | self-constructed Chu | ✓ | × | × | × | 69.5% | ||
| Thulasi et al. [52] | CNN-FPN | MS COCO | × | × | 57.2% | ||||
| Alexxandra et al. [53] | Social–GAN | MOT17 | ✓ | × | × | ✓ | 61.04% | 52.24% | |
| Our Proposed Work | Hybrid Methods based on YOLOv8n & ViT | MOT20 | ✓ | ✓ | ✓ | ✓ | 91.2% | 97.7% | 93.5% |
| Custom | ✓ | ✓ | ✓ | ✓ | 86.7% | 99.7% | 83.1% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Islam, M.R.; Kataoka, Y.; Teramoto, K.; Horio, K. Spatially Time-Based Robust Tracking and Re-Identification of Kindergarten Students: A Hybrid Deep Learning Framework Combining YOLOv8n and Vision Transformer (ViT). J. Imaging 2026, 12, 150. https://doi.org/10.3390/jimaging12040150
Islam MR, Kataoka Y, Teramoto K, Horio K. Spatially Time-Based Robust Tracking and Re-Identification of Kindergarten Students: A Hybrid Deep Learning Framework Combining YOLOv8n and Vision Transformer (ViT). Journal of Imaging. 2026; 12(4):150. https://doi.org/10.3390/jimaging12040150
Chicago/Turabian StyleIslam, Md. Rahatul, Yui Kataoka, Keisuke Teramoto, and Keiichi Horio. 2026. "Spatially Time-Based Robust Tracking and Re-Identification of Kindergarten Students: A Hybrid Deep Learning Framework Combining YOLOv8n and Vision Transformer (ViT)" Journal of Imaging 12, no. 4: 150. https://doi.org/10.3390/jimaging12040150
APA StyleIslam, M. R., Kataoka, Y., Teramoto, K., & Horio, K. (2026). Spatially Time-Based Robust Tracking and Re-Identification of Kindergarten Students: A Hybrid Deep Learning Framework Combining YOLOv8n and Vision Transformer (ViT). Journal of Imaging, 12(4), 150. https://doi.org/10.3390/jimaging12040150

