Author Contributions
Conceptualization, Y.W. and H.K.; methodology, Y.W. and H.K.; writing—original draft preparation, Y.W. and H.K.; data curation, Y.W. and H.K.; validation, Y.W. and M.F.; visualization, Y.W., M.F. and L.M.D.; investigation, L.M.D. and M.F.; software, L.M.D. and M.F.; writing—review and editing, M.F. and L.M.D.; supervision, H.M.; project administration, H.M. and K.-W.L.; funding acquisition, H.M. and K.-W.L. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported by the Korea Institute of Planning and Evaluation for Technology in Food, Agriculture, Forestry and Fisheries (IPET) through the Technology Commercialization Support Program, funded by Ministry of Agriculture, Food and Rural Affairs (MAFRA) (RS-2025-02218444), and by the “Regional Innovation System & Education (RISE)” through the Seoul RISE Center, funded by the Ministry of Education (MOE) and the Seoul Metropolitan Government (2025-RISE-01-019-04) and by the Culture, Sports and Tourism R&D Program through the Korea Creative Content Agency grant funded by the Ministry of Culture, Sports and Tourism in 2025 (Project Name: Training Global Talent for Copyright Protection and Management of On-Device AI Models, Project Number: RS-2025-02221620, Contribution Rate: 100%).
Figure 1.
Overall framework of the proposed OATSAM-Track. SAM-generated instance masks enable explicit visibility estimation, which drives occlusion-aware memory update and re-identification recovery.
Figure 1.
Overall framework of the proposed OATSAM-Track. SAM-generated instance masks enable explicit visibility estimation, which drives occlusion-aware memory update and re-identification recovery.
Figure 2.
Occlusion-aware tracking mechanism of OATSAM-Track. The hybrid occlusion classifier integrates MobileSAM-based visibility cues and appearance information to estimate occlusion states [
14]. The predicted occlusion state is further used by a state controller to selectively regulate appearance memory updates and to trigger strict multi-gate re-identification recovery.
Figure 2.
Occlusion-aware tracking mechanism of OATSAM-Track. The hybrid occlusion classifier integrates MobileSAM-based visibility cues and appearance information to estimate occlusion states [
14]. The predicted occlusion state is further used by a state controller to selectively regulate appearance memory updates and to trigger strict multi-gate re-identification recovery.
Figure 3.
Occlusion-aware ReID recovery process with multi-gate validation and confidence accumulation.
Figure 3.
Occlusion-aware ReID recovery process with multi-gate validation and confidence accumulation.
Figure 4.
Typical samples from GrapeOcclusionMOTS dataset. The original image (top) and the corresponding ground truth annotations (bottom). Different colors in the annotations indicate occlusion levels: green represents no occlusion, orange represents partial occlusion, and red represents severe occlusion.
Figure 4.
Typical samples from GrapeOcclusionMOTS dataset. The original image (top) and the corresponding ground truth annotations (bottom). Different colors in the annotations indicate occlusion levels: green represents no occlusion, orange represents partial occlusion, and red represents severe occlusion.
Figure 5.
Illustration of the pipeline for constructing the GrapeOcclusionMOTS dataset, including mask generation, occlusion annotation, and detection file preparation for MOT evaluation. ✓ indicates items that are used, × indicates items that are not used, and + indicates the merging of two outputs to form the extended dataset.
Figure 5.
Illustration of the pipeline for constructing the GrapeOcclusionMOTS dataset, including mask generation, occlusion annotation, and detection file preparation for MOT evaluation. ✓ indicates items that are used, × indicates items that are not used, and + indicates the merging of two outputs to form the extended dataset.
Figure 6.
Representative detection results of the proposed OATSAM-Track on the GrapeOcclusionMOTS dataset. (a) Input images. (b) Detection results. (c1–c4) Examples under different occlusion levels: no, partial, severe, and full occlusion.
Figure 6.
Representative detection results of the proposed OATSAM-Track on the GrapeOcclusionMOTS dataset. (a) Input images. (b) Detection results. (c1–c4) Examples under different occlusion levels: no, partial, severe, and full occlusion.
Figure 7.
Tracking continuity visualization across consecutive frames. The same track IDs are consistently maintained as targets move and reappear from partial or severe occlusion, demonstrating robust temporal association and ReID recovery in OATSAM-Track. Segmentation masks are generated using MobileSAM [
14]. Representative examples are shown; similar tracking behavior was observed across multiple sequences and occlusion conditions.
Figure 7.
Tracking continuity visualization across consecutive frames. The same track IDs are consistently maintained as targets move and reappear from partial or severe occlusion, demonstrating robust temporal association and ReID recovery in OATSAM-Track. Segmentation masks are generated using MobileSAM [
14]. Representative examples are shown; similar tracking behavior was observed across multiple sequences and occlusion conditions.
Figure 8.
Failure case of OATSAM-Track under extreme foliage crowding at early tracking stages. Consecutive frames (Frames 6 and 8) illustrate a representative failure scenario where dense leaf overlap and degraded SAM mask quality lead to unreliable visibility estimation. As a result, multiple targets undergo rapid identity switches within a short temporal span, causing severe identity fragmentation and unsuccessful re-identification recovery. This failure typically occurs during early track initialization under extreme occlusion conditions.
Figure 8.
Failure case of OATSAM-Track under extreme foliage crowding at early tracking stages. Consecutive frames (Frames 6 and 8) illustrate a representative failure scenario where dense leaf overlap and degraded SAM mask quality lead to unreliable visibility estimation. As a result, multiple targets undergo rapid identity switches within a short temporal span, causing severe identity fragmentation and unsuccessful re-identification recovery. This failure typically occurs during early track initialization under extreme occlusion conditions.
Table 1.
Visibility ratio thresholds and corresponding occlusion definitions.
Table 1.
Visibility ratio thresholds and corresponding occlusion definitions.
| Visibility Ratio | Occlusion State (ID) | Occlusion Level | Description |
|---|
| 0 | No occlusion | Fully visible target |
| 1 | Partial occlusion | Partially occluded target |
| 2 | Severe occlusion | Heavily occluded target |
| 3 | Full occlusion | Nearly or fully invisible target |
Table 2.
Statistical description of the GrapeOcclusionMOTS dataset, including the number of images, annotated instances, and their distribution across four occlusion levels (no, partial, severe, full).
Table 2.
Statistical description of the GrapeOcclusionMOTS dataset, including the number of images, annotated instances, and their distribution across four occlusion levels (no, partial, severe, full).
| Subset | Total Images | Total Instances | No (0) | Partial (1) | Severe (2) | Full (3) |
|---|
| Train | 927 | 4406 | 2403 | 1075 | 926 | 2 |
| Test | 400 | 5869 | 3960 | 1273 | 636 | 0 |
| Total | 1327 | 10,275 | 6363 | 2348 | 1562 | 2 |
Table 3.
Performance comparison of YOLO11 detectors on the grape occlusion dataset.
Table 3.
Performance comparison of YOLO11 detectors on the grape occlusion dataset.
| Model | Precision | Recall | mAP50 | mAP50-95 |
|---|
| YOLO11n | 0.852 | 0.763 | 0.824 | 0.463 |
| YOLO11m | 0.860 | 0.801 | 0.865 | 0.543 |
| YOLO11s (Selected) | 0.874 | 0.818 | 0.868 | 0.544 |
Table 4.
Hardware Configuration (Actual Experiment Machine).
Table 4.
Hardware Configuration (Actual Experiment Machine).
| Component | Specification | Remark |
|---|
| GPU | NVIDIA GeForce GTX TITAN X | Actual experiment GPU |
| CPU | Intel Core i7-5820K CPU @ 3.30 GHz (6 cores, 6 threads) | Actual experiment CPU |
| RAM | 62 GB DDR4 | Actual experiment RAM |
| Storage | 238.5 GB + 2 × 2.7 TB (NVMe/HDD) | Actual experiment storage |
Table 5.
Software Environment (Actual Experiment Machine).
Table 5.
Software Environment (Actual Experiment Machine).
| Software | Version/Details | Remark |
|---|
| Operating System | Ubuntu 16.04.7 LTS | Actual OS |
| Python | 3.10.18 | Actual Python version |
| PyTorch | 1.12.1 + cu113 (CUDA 11.3) | Actual PyTorch version |
| OpenCV | 4.12.0 | Actual OpenCV version |
| Key Libraries | ultralytics (v8.3.201), boxmot (v15.0.2),
mobile_sam (v1.0) | Installed Python libraries used in experiments |
Table 6.
Model Weights (Actual Experiment).
Table 6.
Model Weights (Actual Experiment).
| Model | Weights/Training Details |
|---|
| YOLO11s [22] | Trained from scratch (100 epochs) |
| MobileSAM [14] | Pretrained (mobile_sam.pt) |
| ResNet18 [23] Occlusion Classifier | Trained on extended dataset (20 epochs) |
| OSNet [26] (StrongSORT ReID) | Pretrained on MSMT17 [27] (osnet_x0_25_msmt17.pt) |
Table 7.
Final StrongSORT configuration used in the OATSAM-Track framework.
Table 7.
Final StrongSORT configuration used in the OATSAM-Track framework.
| Parameter | Value | Description |
|---|
| max_age | 30 | Maximum number of frames a track is kept without detections |
| n_init | 1 | Immediate activation suitable for stable detector outputs |
| max_cos_dist | 0.4 | Appearance-matching threshold to prevent ID switches |
| max_iou_dist | 0.8 | Geometric matching threshold for slow-moving fruit clusters |
| min_conf | 0.05 | Allows low-confidence detections to avoid premature track deletion |
| nn_budget | 100 | Maximum number of stored ReID features per track |
| per_class | False | Class-agnostic tracking suitable for single-class fruit tracking |
| half | False | FP32 mode for numerically stable Kalman updates |
| reid_weights | OSNet-x0.25 | Pretrained ReID model (MSMT17) used by StrongSORT |
Table 8.
Runtime comparison of different segmentation backends used for visibility estimation. All values report end-to-end per-frame costs measured within the tracking pipeline, including mask generation and necessary post-processing. The down arrow (↓) indicates that lower values are better.
Table 8.
Runtime comparison of different segmentation backends used for visibility estimation. All values report end-to-end per-frame costs measured within the tracking pipeline, including mask generation and necessary post-processing. The down arrow (↓) indicates that lower values are better.
| Segmentation Backend | GPU | Batch Size | Time per Frame (s) ↓ |
|---|
| SAM (ViT-H) [10] | GTX TITAN X | 1 | >1.0 (impractical) |
| MobileSAM [14] | GTX TITAN X | 1 | ∼0.17 |
Table 9.
Overall tracking performance on the full test set. Bold values indicate the best performance for each metric.
Table 9.
Overall tracking performance on the full test set. Bold values indicate the best performance for each metric.
| Method | MOTA | MOTP | IDF1 | Precision | Recall | IDSW | Frag |
|---|
| DeepSORT [3] | 0.126 | 0.144 | 0.301 | 0.602 | 0.418 | 94 | 26 |
| ByteTrack [4] | 0.286 | 0.044 | 0.302 | 0.992 | 0.321 | 183 | 165 |
| OC-SORT [30] | 0.362 | 0.041 | 0.387 | 0.971 | 0.398 | 142 | 138 |
| BoT-SORT [31] | 0.418 | 0.039 | 0.441 | 0.964 | 0.452 | 118 | 121 |
| OATSAM-Track | 0.712 | 0.031 | 0.642 | 0.918 | 0.756 | 61 | 74 |
Table 10.
Occlusion-specific F1 score comparison using ground-truth occlusion labels. These scores specifically measure identity preservation under occlusion, rather than overall detection completeness.
Table 10.
Occlusion-specific F1 score comparison using ground-truth occlusion labels. These scores specifically measure identity preservation under occlusion, rather than overall detection completeness.
| Method | No_F1 | Partial_F1 | Severe_F1 |
|---|
| DeepSORT [3] | 0.483 | 0.472 | 0.456 |
| ByteTrack [4] | 0.486 | 0.484 | 0.445 |
| OC-SORT [30] | 0.521 | 0.498 | 0.463 |
| BoT-SORT [31] | 0.548 | 0.517 | 0.482 |
| OATSAM-Track | 0.953 | 0.954 | 0.909 |
Table 11.
Ablation study of occlusion-aware components corresponding to the proposed modules in
Section 3.2,
Section 3.3,
Section 3.4 and
Section 3.5. ↑ and ↓ indicate that higher and lower values are better, respectively. ✓ and × denote whether the corresponding module is enabled or disabled.
Table 11.
Ablation study of occlusion-aware components corresponding to the proposed modules in
Section 3.2,
Section 3.3,
Section 3.4 and
Section 3.5. ↑ and ↓ indicate that higher and lower values are better, respectively. ✓ and × denote whether the corresponding module is enabled or disabled.
| Method | SAM | Occlusion State (ID) | Memory | ReID | IDF1 ↑ | IDSW ↓ | Severe-IDF1 ↑ |
|---|
| Baseline (StrongSORT) | × | × | × | × | 71.2 | 164 | 61.8 |
| + SAM visibility | ✓ | × | × | × | 72.6 | 152 | 64.3 |
| + Occlusion state (rule) | ✓ | ✓ | × | × | 74.1 | 138 | 67.5 |
| + Adaptive memory | ✓ | ✓ | ✓ | × | 75.4 | 121 | 69.8 |
| + ReID recovery (Full) | ✓ | ✓ | ✓ | ✓ | 76.8 | 109 | 72.6 |