Figure 1.
Typical forest fire images from the FireMOT dataset. We describe the images in terms of the camera viewpoint, time of day, fire and smoke intensity, scene, etc. (a) Top-Down Shot, snow, daylight, minor fire, slight smoke. (b) Top-Down Shot, daylight, open flame, no visible smoke. (c) Top-Down Shot, dusk or dawn, minor fire, slight Smoke. (d) Oblique shot, dusk or dawn, minor fire, slight smoke. (e) Oblique shot, snow, daylight, minor fire, slight smoke. (f) Oblique shot, daylight, major fire, dense smoke. (g) Eye-Level shot, daylight, minor fire, medium smoke. (h) Oblique shot, dusk or dawn, minor fire, dense smoke. (i) Oblique shot, daylight, dense smoke, no active fire. (j) Oblique shot, night, major fire, no visible smoke. (k) Eye-Level shot, night, moderate fire, medium smoke. (l) Oblique shot, daylight, major fire, medium smoke.
Figure 1.
Typical forest fire images from the FireMOT dataset. We describe the images in terms of the camera viewpoint, time of day, fire and smoke intensity, scene, etc. (a) Top-Down Shot, snow, daylight, minor fire, slight smoke. (b) Top-Down Shot, daylight, open flame, no visible smoke. (c) Top-Down Shot, dusk or dawn, minor fire, slight Smoke. (d) Oblique shot, dusk or dawn, minor fire, slight smoke. (e) Oblique shot, snow, daylight, minor fire, slight smoke. (f) Oblique shot, daylight, major fire, dense smoke. (g) Eye-Level shot, daylight, minor fire, medium smoke. (h) Oblique shot, dusk or dawn, minor fire, dense smoke. (i) Oblique shot, daylight, dense smoke, no active fire. (j) Oblique shot, night, major fire, no visible smoke. (k) Eye-Level shot, night, moderate fire, medium smoke. (l) Oblique shot, daylight, major fire, medium smoke.
Figure 2.
The architecture of the AO-OCSORT framework. The AO-OCSORT algorithm associates existing trajectories with current detection bounding boxes. The proposed temporal–physical similarity metric integrates both temporal information and physical characteristics of fire targets to enhance data association. This information is processed by the HAS module, which uses scene classification and low-score filtering to reduce interference. Ultimately, the VTG module produces continuous and stable tracking trajectories.
Figure 2.
The architecture of the AO-OCSORT framework. The AO-OCSORT algorithm associates existing trajectories with current detection bounding boxes. The proposed temporal–physical similarity metric integrates both temporal information and physical characteristics of fire targets to enhance data association. This information is processed by the HAS module, which uses scene classification and low-score filtering to reduce interference. Ultimately, the VTG module produces continuous and stable tracking trajectories.
Figure 3.
Structure of the temporal correlation module. The initial encoding of temporal information involves extracting temporal features from trajectory sequences. Motion decoding begins by updating trajectory states with these encoded features and the target’s last known position, utilizing the OS-SSM approach. A gating mechanism and FFN then calculate the similarity score between trajectories and current detections, using the updated temporal information and position.
Figure 3.
Structure of the temporal correlation module. The initial encoding of temporal information involves extracting temporal features from trajectory sequences. Motion decoding begins by updating trajectory states with these encoded features and the target’s last known position, utilizing the OS-SSM approach. A gating mechanism and FFN then calculate the similarity score between trajectories and current detections, using the updated temporal information and position.
Figure 4.
Structure of the temporal information encoding network. Unfold the components of encoding network, bi-Mamba and Mamba sequentially from left to right.
Figure 4.
Structure of the temporal information encoding network. Unfold the components of encoding network, bi-Mamba and Mamba sequentially from left to right.
Figure 5.
Structure of the motion decoding network. The final positions of the input trajectory and the current detection box are encoded positionally and transformed into feature embeddings through two linear layers. The trajectory feature embedding, denoted as , is managed by the OS-SSM to forecast its temporal dynamics, resulting in trajectory prediction features, . A gating mechanism, governed by a GRU unit, adjusts the fusion ratio between and adaptively. The fusion process diagram is represented by an *. The fused features are processed through an FFN to estimate the temporal similarity score.
Figure 5.
Structure of the motion decoding network. The final positions of the input trajectory and the current detection box are encoded positionally and transformed into feature embeddings through two linear layers. The trajectory feature embedding, denoted as , is managed by the OS-SSM to forecast its temporal dynamics, resulting in trajectory prediction features, . A gating mechanism, governed by a GRU unit, adjusts the fusion ratio between and adaptively. The fusion process diagram is represented by an *. The fused features are processed through an FFN to estimate the temporal similarity score.
Figure 6.
Structure of physical feature modeling. ROI extraction is used to obtain image patches from objects within detection bounding boxes. The image is then converted to the HSV color space and processed with flame filtering and Gaussian denoising techniques. Contours are extracted using the Sobel operator, and corner detection is performed on these contours to compute sparse optical flow, which models the physical characteristics.
Figure 6.
Structure of physical feature modeling. ROI extraction is used to obtain image patches from objects within detection bounding boxes. The image is then converted to the HSV color space and processed with flame filtering and Gaussian denoising techniques. Contours are extracted using the Sobel operator, and corner detection is performed on these contours to compute sparse optical flow, which models the physical characteristics.
Figure 7.
Structure of the hierarchical association strategy. The HAS categorizes data associations into high-score and low-score groups. Scene classification segregates simple trajectories and high-score detections for prioritized matching, using only the IoU cost matrix to improve processing speed. For complex trajectories and low-score detections, a multi-level matching strategy is implemented to leverage additional spatiotemporal cues.
Figure 7.
Structure of the hierarchical association strategy. The HAS categorizes data associations into high-score and low-score groups. Scene classification segregates simple trajectories and high-score detections for prioritized matching, using only the IoU cost matrix to improve processing speed. For complex trajectories and low-score detections, a multi-level matching strategy is implemented to leverage additional spatiotemporal cues.
Figure 8.
Structure of virtual trajectory generation. The VTG module validates the target position predicted by KF using the temporal and physical similarity provided by TC and PFM.
Figure 8.
Structure of virtual trajectory generation. The VTG module validates the target position predicted by KF using the temporal and physical similarity provided by TC and PFM.
Figure 9.
Selection of scene classification threshold.
Figure 9.
Selection of scene classification threshold.
Figure 10.
Selection of dynamic filtering threshold.
Figure 10.
Selection of dynamic filtering threshold.
Figure 11.
Visualization of tracking results.
Figure 11.
Visualization of tracking results.
Table 1.
Basic information about the FireMOT dataset.
Table 1.
Basic information about the FireMOT dataset.
Datasets | MOT17 | DanceTrack | FireMOT |
---|
Number of videos | 14 | 100 | 60 |
Average number of trajectories | 96 | 9 | 7 |
Total number of trajectories | 1342 | 990 | 413 |
Total number of frames | 11,235 | 105,855 | 60,959 |
Frame rate (FPS) | 30 | 20 | 30 |
Table 2.
Training and testing environments.
Table 2.
Training and testing environments.
Name | Environments |
---|
CPU | AMD Ryzen 3990X @ 2.9 GHz |
GPU | NVIDIA GTX 3090 @ 24 GB |
RAM | 128 GB |
PyCharm version | 2020.2.5 |
Python | 3.8 |
PyTorch version | 2.0.1 |
CUDA version | 11.7 |
cuDNN version | 8.9.7 |
Table 3.
Selection of trajectory length. The optimal value is indicated in boldface.
Table 3.
Selection of trajectory length. The optimal value is indicated in boldface.
Trajectory Length | HOTA | MOTA | IDF1 | AssA |
---|
3 | 72.9 | 72.7 | 81.6 | 77.1 |
5 | 73.8 | 72.7 | 82.3 | 79.4 |
7 | 73.4 | 72.6 | 81.6 | 78.2 |
10 | 71.7 | 71.7 | 79.6 | 74.6 |
20 | 70.0 | 69.3 | 78.4 | 71.1 |
Table 4.
Selection of encoder depth. The optimal value is indicated in boldface.
Table 4.
Selection of encoder depth. The optimal value is indicated in boldface.
Encoder Depth | HOTA | MOTA | IDF1 | AssA |
---|
3 | 72.9 | 72.5 | 81.6 | 77.4 |
6 | 73.8 | 72.7 | 82.3 | 79.4 |
9 | 73.9 | 72.8 | 82.3 | 79.4 |
12 | 72.1 | 72.0 | 80.9 | 75.5 |
Table 5.
Selection of duplicate filtering threshold. The optimal value is indicated in boldface.
Table 5.
Selection of duplicate filtering threshold. The optimal value is indicated in boldface.
Threshold | MOTA | FP | FN | IDSW |
---|
0.4 | 73.8 | 5013 | 11,013 | 409 |
0.5 | 74.0 | 5237 | 10,599 | 371 |
0.6 | 74.3 | 5384 | 10,176 | 364 |
0.7 | 73.9 | 5498 | 10,117 | 380 |
0.8 | 73.7 | 5641 | 10,208 | 395 |
Table 6.
Results on the FireMOT dataset. The optimal value is indicated in boldface.
Table 6.
Results on the FireMOT dataset. The optimal value is indicated in boldface.
Model | HOTA | MOTA | IDF1 | AssA | DetA |
---|
SORT [16] | 66.2 | 65.5 | 71.8 | 64.7 | 68.1 |
DeepSORT [9] | 63.8 | 67.2 | 69.5 | 61.4 | 66.3 |
FairMOT [40] | 58.9 | 62.4 | 63.2 | 54.1 | 64.0 |
ByteTrack [10] | 68.6 | 68.1 | 74.6 | 69.2 | 68.4 |
BoT-SORT [17] | 69.1 | 69.8 | 75.1 | 69.8 | 68.2 |
OC-SORT [11] | 69.4 | 68.3 | 74.7 | 70.3 | 68.6 |
MOTR [41] | 68.8 | 66.7 | 72.4 | 71.8 | 65.9 |
DiffMOT [5] | 72.5 | 72.5 | 80.2 | 76.0 | 69.5 |
TC-OCSORT | 73.8 | 72.7 | 82.3 | 79.4 | 68.9 |
AO-OCSORT | 74.8 | 74.3 | 84.0 | 81.6 | 68.9 |
Table 7.
Results on the VisDrone dataset. The optimal value is indicated in boldface.
Table 7.
Results on the VisDrone dataset. The optimal value is indicated in boldface.
Model | MOTA | IDF1 | FP | FN | IDSW |
---|
SORT [16] | 14.0 | 38.0 | 80,845 | 112,954 | 3629 |
DeepSORT [9] | 25.3 | 40.2 | 65,890 | 115,436 | 2567 |
ByteTrack [10] | 36.1 | 49.7 | 18,706 | 98,056 | 1247 |
MOTR [41] | 22.8 | 41.4 | 28,407 | 147,937 | 959 |
FairMOT [40] | 30.8 | 41.9 | 21,503 | 138,789 | 3007 |
OC-SORT [11] | 39.6 | 50.4 | 14,631 | 123,513 | 986 |
UAVMOT [42] | 36.1 | 51.0 | 27,983 | 115,925 | 7396 |
FPUAV [43] | 34.3 | 45.0 | 30,741 | 113,409 | 2138 |
TC-OCSORT | 40.9 | 52.1 | 12,874 | 120,273 | 943 |
AO-OCSORT | 41.4 | 53.2 | 12,037 | 105,367 | 917 |
Table 8.
Results on the MOT17 dataset. The optimal value is indicated in boldface.
Table 8.
Results on the MOT17 dataset. The optimal value is indicated in boldface.
Model | HOTA | MOTA | IDF1 | AssA | DetA |
---|
FairMOT [40] | 59.3 | 73.7 | 72.3 | 50.1 | 70.4 |
ByteTrack [10] | 63.1 | 80.3 | 77.3 | 54.5 | 73.2 |
OC-SORT [11] | 63.2 | 80.4 | 77.5 | 54.8 | 73.1 |
CenterTrack [44] | 52.2 | 67.8 | 64.7 | 44.3 | 61.5 |
TrackFormer [45] | 61.9 | 74.1 | 68.2 | 53.1 | 72.3 |
MeMOTR [41] | 62.5 | 78.3 | 76.8 | 54.2 | 72.1 |
TC-OCSORT | 63.4 | 80.7 | 78.6 | 55.3 | 73.1 |
AO-OCSORT | 63.7 | 80.9 | 79.2 | 55.7 | 73.2 |
Table 9.
Results onthe DanceTrack dataset. The optimal value is indicated in boldface.
Table 9.
Results onthe DanceTrack dataset. The optimal value is indicated in boldface.
Model | HOTA | MOTA | IDF1 | AssA | DetA |
---|
FairMOT [40] | 39.7 | 82.2 | 40.8 | 23.8 | 66.7 |
ByteTrack [10] | 47.7 | 89.6 | 53.9 | 32.1 | 71.0 |
BoT-SORT [17] | 54.7 | 91.3 | 56.0 | 37.8 | 79.6 |
OC-SORT [11] | 55.1 | 92.0 | 54.6 | 38.3 | 80.3 |
CenterTrack [44] | 41.8 | 86.8 | 35.7 | 22.6 | 78.1 |
MOTR [41] | 54.2 | 79.7 | 51.5 | 40.2 | 73.5 |
DiffMOT [5] | 56.1 | 92.2 | 55.7 | 38.5 | 81.9 |
TC-OCSORT | 57.3 | 92.1 | 57.5 | 40.3 | 81.5 |
AO-OCSORT | 57.5 | 92.2 | 57.6 | 40.7 | 81.6 |
Table 10.
Impact of AO-OCSORT modules on model performance. The optimal value is indicated in boldface.
Table 10.
Impact of AO-OCSORT modules on model performance. The optimal value is indicated in boldface.
Model | HOTA | MOTA | IDF1 | AssA |
---|
OC-SORT | 69.4 | 68.3 | 74.7 | 70.3 |
TC-OCSORT | 73.8 | 72.7 | 82.3 | 79.4 |
TC-OCSORT + PFM | 73.9 | 73.2 | 83.4 | 79.5 |
TC-OCSORT + PFM + HAS | 74.5 | 73.9 | 83.7 | 80.8 |
AO-OCSORT | 74.8 | 74.3 | 84.0 | 81.6 |
Table 11.
Impact of detector scale on model performance. The optimal value is indicated in boldface.
Table 11.
Impact of detector scale on model performance. The optimal value is indicated in boldface.
Detector | HOTA | FPS |
---|
YOLOX-s | 65.6 | 32.3 |
YOLOX-m | 69.3 | 30.1 |
YOLOX-l | 73.7 | 28.1 |
YOLOX-x | 73.9 | 20.9 |