Next Article in Journal
Geometry and Topology Preservable Line Structure Construction for Indoor Point Cloud Based on the Encoding and Extracting Framework
Previous Article in Journal
Advancing Forest Inventory in Tropical Rainforests: A Multi-Source LiDAR Approach for Accurate 3D Tree Modeling and Volume Estimation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Enhancing Search and Rescue Missions with UAV Thermal Video Tracking

Department of Electronics, Information, and Bioengineering, Politecnico di Milano, Via Ponzio 34/5, 20133 Milan, Italy
*
Author to whom correspondence should be addressed.
Remote Sens. 2025, 17(17), 3032; https://doi.org/10.3390/rs17173032
Submission received: 27 June 2025 / Revised: 17 August 2025 / Accepted: 29 August 2025 / Published: 1 September 2025
(This article belongs to the Section Earth Observation for Emergency Management)

Abstract

Wilderness Search and Rescue (WSAR) missions are time-critical emergency response operations that require locating a lost person within a short timeframe. Large forested terrains must be explored in challenging environments and adverse conditions. Unmanned Aerial Vehicles (UAVs) equipped with thermal cameras enable the efficient exploration of vast areas. However, manual analysis of the huge amount of collected data is difficult, time-consuming, and prone to errors, increasing the risk of missing a person. This work proposes an object detection and tracking pipeline that automatically analyzes UAV thermal videos in real-time to identify lost people in forest environments. The tracking module combines information from multiple viewpoints to suppress false alarms and focus responders’ efforts. In this moving camera scenario, tracking performance is enhanced by introducing a motion compensation module based on known camera poses. Experimental results on the collected thermal video dataset demonstrate the effectiveness of the proposed tracking-based approach by achieving a Precision of 90.3% and a Recall of 73.4%. On a dataset of UAV thermal images, the introduced camera alignment technique increases the Recall by 6.1%, with negligible computational overhead, reaching 35.2 FPS. The proposed approach, optimized for real-time video processing, has direct application in real-world WSAR missions to improve operational efficiency.

1. Introduction

Search and Rescue operations are among the most time-critical emergency response scenarios requiring fast and efficient localization of people reported missing. Traditional human-based search is challenging due to factors such as terrain inaccessibility, adverse weather conditions, and the vastness of the search area. The challenges increase for Wilderness Search and Rescue (WSAR) missions, which require the systematic exploration of large forest lands in a very short time.
The adoption of Unmanned Aerial Vehicles (UAVs) by rescue teams has become the operational standard in recent years. Camera-equipped drones address the challenges of WSAR missions, enabling the exploration of large areas even in adverse conditions while minimizing the risk for responders. Such advantages can significantly shorten the search times, increase the probability of success, and reduce the operational costs. Modern UAV platforms are equipped with thermal cameras, which are more effective than optical sensors for WSAR applications. Thermal imaging enables the detection of human body heat signatures even when the target is partially occluded by dense vegetation, as is typical of forest scenes, or during nighttime operations.
However, despite the benefits provided by UAVs in terms of terrain coverage, responders spend considerable effort manually inspecting the captured images, looking for signals of human presence that are difficult to discern. The manual review of UAV thermal images poses several challenges because lost people are often hidden by the vegetation, which results in heat signatures that do not resemble a typical human shape. Furthermore, the temperature of branches or tree crowns heated by the sun may be in the body temperature range, making the detection even more difficult for the human eye. Such factors, plus the mental fatigue of reviewing long WSAR videos, make the manual inspection time-consuming and error-prone, increasing the risk of missing a lost person [1]. Therefore, the efficiency bottleneck of WSAR operations has become the review of the large amount of data collected during a mission [2].
In the scientific literature, several works address the use of UAVs for WSAR operations, specifically the problem of automatic mission data analysis to minimize the risk of human errors [2]. The effectiveness of UAV applications in WSAR is examined in [1,3,4]. Such studies demonstrate an increase in operational efficiency and an improvement in target detection accuracy. Moreover, they also report that without software assistance, rescuers develop mental fatigue, which negatively impacts the mission outcome [1]. Some applications exploit RGB cameras to identify people concealed within the surrounding environment [5,6], while other studies use thermal sensors to increase the detection robustness in the presence of vegetation occlusion [7].
Before the advent of Deep Learning (DL) models, the research efforts adapted traditional Computer Vision (CV) techniques to recognize people from drone imagery. In [8], people are located in thermal images using an object detector based on Haar-like features [9]. Other works analyze the pixel spectral information of UAV images to identify color anomalies that may indicate the presence of a person [10,11]. The Reed-Xiaoli algorithm [12] is used to model the background color distribution and unusually colored pixels are highlighted to indicate the location of targets. Such methods can be effectively applied to WSAR imagery when objects of interest (e.g., clothes or backpacks) are characterized by colors significantly different from the background. However, they are not robust to occlusions and, without prior knowledge of the objects of interest, may generate many false alarms. Finally, some works enhance person detection accuracy by combining RGB and thermal images [13,14]. This approach leverages the strengths of both data sources but requires precise camera registration between the two imaging sensors.
Recent works employ modern DL-based object detector models to accurately localize people on RGB [5,15] or thermal [7,16] images. The work [15] aims at detecting injured people in post-avalanche scenarios using UAVs. A Convolutional Neural Network (CNN) processes optical aerial images to extract discriminative features that a Support Vector Machine classifies. In [5], an object detection model is trained to identify injured people in non-urban areas (e.g., roads, high and low grass, rocks). The study introduces a methodology for simulating various weather conditions to improve the robustness of the method in a broader range of scenarios.
The most critical challenge in WSAR scenarios is the presence of vegetation that masks the ground and reduces the visibility of lost people. The authors of [7,16] aim at suppressing occlusions and emphasizing objects on the ground by combining UAV thermal images from different perspectives using a technique called Airborne Optical Sectioning (AOS). Thermal images are aligned based on the known camera poses and pixels referring to the same ground point are averaged. The original version relied on CV-based pose estimation, which was too computationally expensive to enable real-time execution. In a later work [17], the authors demonstrated that aligning images using less accurate GPS poses, derived from on-board telemetry data, yields comparable performance. The images resulting from AOS integration are processed with an object detection model to localize lost people.
In [18], a tracking-based method is introduced to detect people from UAV thermal videos in mountain scenarios. The addressed situation is quite specific because the videos portray moving persons, while the UAV position remains almost constant. An object detection model processes each frame to identify potential people locations, which are tracked throughout the video. For each tracked object, the velocity is estimated, and low-speed objects are discarded since they are potential false detections.
Person localization approaches based on object detection algorithms tend to generate a high number of false alarms, especially in challenging scenarios with complex backgrounds. However, a large number of intermittent and sparse detections is distracting and time-consuming for rescue operators who need to validate each location highlighted by the detection model. Many false detections can be effectively discarded by considering their temporal consistency across a sequence of frames. By processing individual frames, detectors are unable to differentiate between spurious false alarms and genuine objects. The confidence score is not a reliable indicator, as incorrect detections might get assigned high confidence values. To address the problem of isolated false alarms, this study introduces an object tracking module that combines temporal information from consecutive frames to minimize the false alarm rate. The introduced tracking algorithm attempts to match predictions on individual frames to maintain track throughout the video. Sparse and isolated false alarms are suppressed by identifying detections not consistently predicted by the underlying object detector. Persistently tracked objects are displayed to responders for manual confirmation of person identification.
The use of an object tracker for post-processing candidate detections is also useful for overcoming the problem of masking by vegetation. The degree of ground occlusion depends on the viewpoint. An object detector might fail to recognize a person in a highly occluded frame and thus might produce intermittent detections. The object tracker can follow an occluded object until a detection in a subsequent frame confirms its presence. The proposed approach processes UAV thermal recordings online and generates a list of subsequences with a high probability of human presence. Its output can be exploited to guide search operations in real time, thus improving mission efficiency. Our contributions can be summarized as follows:
  • We introduce an object detection and tracking pipeline for analyzing UAV thermal videos and identifying people lost in the wilderness. The object detector is trained to recognize targets in individual thermal frames characterized by dense vegetation typical of forest environments. The tracking module processes sequences of frames, suppresses sparse false alarms, and temporarily tracks detected people masked by vegetation.
  • We enhance object tracking performance by introducing a Camera Motion Compensation module that utilizes UAV telemetry data. This addition achieves better camera alignment compared to CV-based motion estimation algorithms, which struggle with low-texture thermal images. The Camera Motion Compensation introduces a minimal overhead with negligible impact on computational efficiency.
  • The proposed architecture is evaluated on a dataset of UAV thermal videos specifically constructed to represent realistic WSAR scenarios. The dataset includes 213 sequences, totaling more than 39 minutes of video footage, each lasting 5 to 30 seconds. The performance is assessed at various framerates to identify a trade-off between detection accuracy and computational requirements. The dataset variant obtained by extracting thermal video frames at 15 Frames Per Second (FPS) contains 35,714 images. The resulting configuration supports real-time execution and can be used to guide search operations towards locations with a high probability of human presence.
The proposed pipeline improves operations in WSAR scenarios and provides first responders with an effective tool for real-time guidance of search missions. Experimental results on the collected UAV thermal video dataset demonstrate the effectiveness of the tracking-based person detection technique in minimizing false alarms by achieving a Precision of 90.3% and a Recall of 73.4%. Camera motion compensation improves tracking performance in UAV videos. The motion correction module, based on drone camera poses, addresses the challenges of CV-based motion estimation algorithms that fail on low-texture thermal images. Experimental results on a publicly available dataset of WSAR thermal images with UAV pose data demonstrate that the proposed camera alignment technique increases the Recall by 6.1%, achieving a value of 82.2%, with a negligible impact on computation time, which reaches 35.2 FPS (−0.4 FPS).
The rest of this paper is organized as follows: Section 2 describes the tracking approach, the proposed Camera Motion Compensation module, and the datasets used for the experiments; Section 3 presents the results and their impact on WSAR missions; Section 4 discusses the results from the perspective of real-world applications; and Section 5 draws the conclusions.

2. Materials and Methods

This work focuses on analyzing and comparing person detection methods employed during WSAR missions to guide real-time operations. To achieve this goal, the research work has been divided into two main phases. Section 2.1 presents a person detection and tracking approach for identifying lost people in WSAR environments from drone thermal videos. Section 2.2 proposes a motion compensation module that leverages known camera poses to improve tracking performance with minimal computational overhead.

2.1. Target Detection and Tracking Pipeline

Automatic search systems should be accurately calibrated to localize lost people in a wide range of challenging WSAR scenarios. When a target is missed, it becomes impossible to detect a lost person in a timely manner, an event that could have fatal consequences. Therefore, the cost of false negatives (i.e., missing a target) is very high. Conversely, a detection model that produces a very high number of false alarms may hamper the mission efficiency because operators must review each candidate detection. This process requires a significant amount of time and effort. If the number of false detections is large, most effort is wasted pointlessly without any beneficial effect on the mission outcomes. In this work, the problem of false positives is addressed by post-processing the proposals of an object detector with a multi-object tracker (MOT), as shown in Figure 1.
An object tracker is a CV algorithm designed to detect, track, and maintain the identities of multiple objects across video sequences. An object detector initially identifies objects in each frame and assigns a confidence score. The tracker determines which detections in the current frame correspond to already tracked objects (tracks) from previous frames. The matching problem is solved based on estimated object motion [19,20], appearance [21,22], or a combination of both information [23]. Appearance features are relevant for the re-identification of objects occluded for short periods. For each tracked object, the position in the next frame is predicted based on a motion model. At each frame, the tracker decides to initialize tracks for newly detected objects or terminate tracks for objects that remained undetected for several frames.
Modern object trackers can follow a large number of objects in the scene even when they are obstructed by occlusions or appear in crowded scenarios [24]. When the targets are few, as in WSAR scenes, tracking algorithms help filter spurious false detections that are not matched in consecutive video frames. The MOT algorithm processes proposals predicted by the object detector, suppresses sparse false positives, and returns a list of persistently tracked objects. These should correspond to actual people with high probability, but may also include other objects confused as such (e.g., rocks or bushes). Responders manually inspect only the filtered list of locations to confirm person identification and discard erroneous detections.

2.1.1. Target Detection

The target detection model must operate in real time to enable online video analysis. Therefore, this work considers only single-stage object detectors due to their superior computational efficiency. The recent versions of the YOLO architecture have demonstrated strong detection performance while being significantly faster than traditional region-proposal networks [25]. Three state-of-the-art detectors with different designs are compared: YOLOv8 [26], YOLOv12 [27], and YOLOX [28]. YOLOv8 employs a decoupled head with an anchor-free prediction approach to reduce computational complexity by removing predefined anchor boxes. YOLOv12 introduces an area attention framework that applies local attention to small image patches. This approach reduces the computational complexity and achieves state-of-the-art latency–accuracy performance for real-time applications. YOLOX features a decoupled head and an anchor-free design that improves the YOLOv3 [29] architecture, built on the DarkNet backbone. The above-mentioned architectural differences influence detection performance, speed, and generalization across tasks.

2.1.2. Target Tracking

The multi-object tracker used in this work is BoostTrack++ (BT) [23], which, at the time of writing, is among the leading architectures on the MOT17 [30] and MOT20 [31] benchmarks. BT employs a tracking-by-detection approach that identifies objects in each frame of a video and matches their IDs across frames. BT has a modular design comprising a Camera Motion Compensation (CMC) module to account for camera movements, an appearance similarity module to track objects across occlusions, and two confidence-boosting mechanisms to recover low-confidence predictions.
The Content-Based Camera Motion Compensation (CB-CMC) module is a key component that estimates the camera movement between frames to compensate for tracking drift. The motion parameters are estimated based on the Enhanced Correlation Coefficient (ECC) maximization method [32]. ECC aligns two images by iteratively optimizing a parametric warp that maximizes the correlation coefficient between image intensities. The similarity measure is based on the zero-mean normalization of pixel intensity, which is invariant to brightness and contrast variations.
A common challenge in object tracking occurs when partially occluded objects identified with low confidence values are filtered out, breaking the tracking continuity. BT addresses this problem by introducing the Detecting Likely Objects (DLO) module, which recovers valid predictions by boosting their confidence scores prior to filtering. This approach recovers predictions that are used for the subsequent matching step. Some objects may remain undetected throughout a video sequence because they are partially occluded or positioned at the scene boundary. Meanwhile, false positives typically cluster around already-tracked objects. As a result, low-confidence detections that appear far from already tracked objects may represent new potential targets not yet discovered. BT introduces the Detecting Unlikely Objects (DUO) module to boost the confidence score of new objects characterized by locations statistically independent from the currently known tracks.

2.2. Pose-Based Camera Motion Compensation (PB-CMC)

The proposed approach for person detection in WSAR scenarios is based on UAV thermal imagery. Thermal cameras capture temperature variations of objects in the scene by measuring their emitted infrared radiation. Heat is conducted and radiated by objects, creating thermal gradients that result in images with blurred edges and lower texture details. Consequently, the CB-CMC method of BT, based on ECC maximization, is not effective for aligning thermal images that lack consistent patterns and have large regions of homogeneous intensity.
To enhance the robustness of object tracking on UAV thermal images, we aim to improve Camera Motion Compensation by exploiting the known camera pose of each frame. This differs from the original CB-CMC module of BT, which estimates camera motion based on visual content. Accurate camera poses can be obtained from Structure-from-Motion (SfM) techniques [33] or, in real-time applications, can be estimated from drone telemetry data. Commercial GPS modules equipped with Real-Time Kinematic (RTK) correction reliably achieve centimeter-level accuracy [34], delivering precise positioning information.
The mathematical relationship between 3D world points and their corresponding 2D projections onto the image plane is described by intrinsic and extrinsic camera parameters. The intrinsic matrix K R 3 × 3 encodes the internal characteristics of the camera and is defined as
K = f x 0 c x 0 f y c y 0 0 1 ,
where f x and f y are the focal lengths along the x and y axes (in pixels), and ( c x , c y ) is the principal point, which represents the center of projection in the image plane. This point corresponds to the intersection of the optical axis with the image plane.
The extrinsic parameters determine the camera pose in the world reference system. They are defined by the rotation matrix R R 3 × 3 and the translation vector t R 3 , which together form the extrinsic matrix [ R | t ] R 3 × 4 . The extrinsic matrix transforms points from world coordinates to the camera coordinate system.
The complete camera projection matrix P R 3 × 4 , which transforms 3D world coordinates into 2D coordinates on the image plane, is then computed as
P = K [ R | t ] .
The proposed PB-CMC module exploits the camera pose of two consecutive frames to estimate the inter-frame motion in three steps: Backprojection of Principal Point, Projection onto the Current Frame, and Motion Vector Estimation.

2.2.1. Backprojection of Principal Point

For frame F t 1 at time t 1 , the principal point ( c x , c y ) is backprojected into 3D world coordinates. Given the intrinsic matrix K and the known extrinsic parameters [ R t 1 | t t 1 ] at time t 1 , a ray r t 1 is defined from the camera center through the principal point:
r t 1 = R t 1 1 K 1 c x c y 1 .
The backprojection from a single image yields a ray in world coordinates. The ray direction is determined by the camera orientation (i.e., roll, pitch, yaw camera angles) embedded in the rotation matrix R t 1 . Therefore, the 3D point X w o r l d is calculated by defining a reference distance d along the ray:
X w o r l d = C t 1 + d · r t 1 | | r t 1 | | ,
where C t 1 = R t 1 1 t t 1 represents the camera center in world coordinates, and d is set based on the approximate drone altitude. For simplicity, the ground is approximated as a planar surface and the drone is assumed to maintain a constant flight altitude. In practical applications, the distance d can be estimated from the drone altitude above ground level or using an onboard range finder sensor. In complex terrain scenarios, where the assumption of planar ground may be violated, the estimation of distance d can be improved by intersecting the projected ray with a Digital Elevation Model (DEM) that models the local ground elevation. DEMs are commonly used in WSAR missions nowadays to support accurate drone localization during automatic flights.

2.2.2. Projection onto the Current Frame

Given the camera pose [ R t | t t ] at time t, the 3D point X w o r l d is projected onto the current frame F t :
x t y t λ = K [ R t | t t ] X w o r l d 1 ,
where ( x t , y t , λ ) are the homogeneous coordinates of X w o r l d projected on the current frame F t . Also in this step, camera orientation is defined by the rotation matrix R t . The homogeneous coordinates are converted to image coordinates ( u t , v t ) dividing by the third component λ :
u t v t = x t / λ y t / λ .

2.2.3. Motion Vector Estimation

The 2D motion vector m is calculated as the displacement between the principal point ( c x , c y ) in the previous frame F t 1 and its projected position ( u t , v t ) in the current frame F t :
m = u t c x v t c y .
This vector represents the camera motion between the two consecutive frames, which is approximated by pure translational motion. BT uses the estimated motion vector m to align the position of tracks in the current frame F t , compensating for camera movement.

2.3. Datasets

For the assessment of the video object tracking method and the evaluation of the PB-CMC module, we utilize two datasets with complementary characteristics, both capturing WSAR scenarios as illustrated in Figure 2a,b. We collected a dataset of UAV thermal videos (presented in Section 2.3.1) to assess the components of the proposed tracking-based pipeline. Through video frame downsampling, we evaluate the correlation between framerate and detection accuracy to derive valuable insights for real-time applications. To assess the contribution of the PB-CMC module, we employ a publicly available dataset of drone thermal images (described in Section 2.3.2) which includes the camera pose for each image.

2.3.1. Drone Thermal Video (DTV) Dataset

The dataset used for training and evaluating the person identification models was collected with the support of a local SAR team. The collaboration with rescue operators ensured the realism of the simulated WSAR scenarios. Drones were used to collect videos in environments typically encountered during WSAR missions. During the simulations, people were instructed to change pose between videos (e.g., lying down, sitting, standing, walking) to increase the variability of the captured scenes. The drone payload combined both a radiometric thermal sensor and an optical camera. However, the proposed approach exploits only thermal imagery that enables the detection of people traces also in the presence of dense vegetation. In WSAR scenarios, this aspect is crucial to increase the likelihood of finding targets, even when they appear completely occluded in optical images. Figure 3 shows a comparison between an optical image and the corresponding thermal image of the DTV Dataset.
The data collection campaigns cover common rescue scenarios in forest and mountainous environments with various degrees of vegetation density. The dataset also includes open field videos in which a person can be easily confused with surrounding bushes or rocks characterized by similar heat signatures. During data collection flights, the drone altitude was maintained between 40 m and 60 m. All thermal videos have been recorded with a resolution of 640 × 512 px and a framerate of 30 FPS. Figure 2a shows examples of frames extracted from the collected UAV thermal videos.
To increase the number of training annotations, several targets could have been positioned within the same scene, enabling the object detector to learn from several instances in the same image. However, typical WSAR scenes contain only a single person, which limits the number of training labels available in each video frame. The limit of one target per frame has been maintained during the video collection to accurately represent realistic WSAR scenarios. The collected videos were manually annotated at the original framerate (30 FPS) by drawing a bounding box around each visible person.
To simulate a reduction in the number of perspectives that the detection model would process during an actual WSAR mission, the original videos are downsampled at multiple framerates. The number of ground viewpoints increases at higher framerates, and so do the chances of detecting a partially occluded target. However, in real-time applications, the amount of data that can be processed online is limited by the available computational resources. Therefore, to ensure reliable performance in realistic scenarios, this study investigates the impact on detection performance by decreasing the video framerate. Table 1 summarizes the characteristics of the five versions of the DTV Dataset generated by downsampling the original video sequences.

2.3.2. AOS Dataset

As discussed in Section 1, few works in the literature address the problem of person detection from thermal images in WSAR scenarios. In [7], the authors propose Airborne Optical Sectioning (AOS), an approach to suppress vegetation occlusion and emphasize targets on the ground. Occlusion suppression is achieved by registering and averaging several thermal images captured with a drone flight that covers a square search area. Accurate camera poses are estimated using an SfM technique. The dataset used in the experiments is publicly available [35]. In the posterior work [17], the authors combine a sequence of images collected along a linear flight path, instead of a 2D square area, and exploit GPS poses measured by an onboard sensor for image registration.
Our pipeline addresses the same problem formulated in [17], where the sequence of images captured along a linear flight resembles the frames of a video. However, in our experiments, we use the thermal images provided in the original AOS Dataset [35], because the posterior dataset [36], used in [17], does not contain all the original thermal images.
The AOS Dataset images have been collected during 12 flights in dense forest scenarios. To obtain linear flight paths, as indicated in [17], we resampled the images from each flight that covers a square area. Figure 2b shows examples of UAV thermal images from the AOS Dataset. The image capture occurs every 1 m, a much greater interval than the distance between frames in the DTV Dataset. The camera poses obtained with an SfM technique are provided for all images. To simulate GPS positioning inaccuracy, we perturbed camera poses by adding a simulated noise factor. A white Gaussian noise with a standard deviation of 10 cm has been added to each camera pose to introduce a shift from the original position. Commercial GPS modules equipped with RTK correction achieve an accuracy of a few centimeters [34,37]. Therefore, the simulated positioning error exceeds the expected range of typical real-world applications. Each image in a linear flight path is treated as a video frame and processed by the video object tracker, which considers all sequences as separate videos. The dataset split has been preserved from [17], where the training set comprises 5 flights, the validation set 2 flights, and the test set 4 flights.

2.4. Evaluation Procedure

2.4.1. Training

The YOLO object detectors explored in this work are pre-trained on the COCO [38] dataset, a large-scale collection of natural images often used to train DL networks. The training procedure on the thermal datasets used in this study is divided into two phases: Transfer Learning and Fine Tuning. In the Transfer Learning phase, the model leverages the knowledge learned during pre-training and adapts the network to the task of person detection in thermal imagery. During this phase, only the new detection head is trained by exploiting the features extracted from the frozen pre-trained backbone. The model is optimized in the Fine Tuning phase by adapting the high-level representation extracted from the model backbone to the specific features of the person detection task. The network backbone is unfrozen, and the learned high-level features are refined alongside the newly initialized detection head.
The hyperparameters of all object detectors and the BT tracker are optimized on the validation set. Detections with confidence scores below 20% are discarded. Predictions are considered true positives if their Intersection over Union (IoU) with a bounding box in the ground truth exceeds 20%. The optimal model configuration is selected based on the highest F1 score, which represents the harmonic mean of Precision and Recall. This metric offers a comprehensive evaluation by accounting for both false positives and false negatives, thus balancing detection accuracy with the ability to identify all relevant objects. The optimal hyperparameter configuration is evaluated on the test set with the experimental results reported in Section 3. The computational efficiency of the detection models, measured in FPS, is assessed on the consumer-level NVIDIA GeForce RTX 2080 Ti GPU.

2.4.2. Video Object Tracking Evaluation

The DTV Dataset presented in Section 2.3.1 contains a heterogeneous set of scenarios with various terrains and vegetation occlusion intensities. To overcome the high contextual shift between scenes and provide a robust estimate of model performance, k-fold cross-validation, a popular technique used to assess the performance and generalization capability of DL models, has been adopted. The dataset is divided into k folds, and the model is trained iteratively on k 1 folds with the remaining fold used as the test set. The validation set is defined by randomly selecting 10 % of the training set images.
All videos captured during a data collection campaign, thus in the same zone, have similar textures and visual content. To avoid data leakage during model evaluation, which occurs when a model is evaluated on test images similar to the training images, the videos from the same campaign are assigned to the same fold.
At each k-fold iteration, the object detection model is evaluated on the test fold by computing standard object detection metrics. The overall model performance is obtained by averaging the metrics from all the folds. The average is weighted based on the number of images in each fold to account for size imbalance between folds.

3. Results

In this section, the tracking-based pipeline is evaluated under different perspectives. Section 3.1 reports the performance analysis comparing different object detectors, examining the influence of framerate, and assessing the contribution of the internal BT modules (CB-CMC, DLO, DUO). The DTV Dataset is used for these experiments, which offers high-frequency frames that enable framerate downsampling, but does not provide camera poses. Section 3.2 focuses on the contribution of the PB-CMC module and evaluates the best model configuration on the AOS Dataset, which has a fixed large inter-frame spatial distance but provides camera poses.

3.1. Evaluation of Object Detection and Tracking Results

Table 2 reports detection results achieved by all explored object detectors evaluated on the DTV Dataset. All the explored YOLO-based architectures implement a scale-up approach, offering model variants in different sizes. In the case of person detection from UAV thermal images, larger models are not beneficial and generalize similarly to smaller and more efficient versions. This finding contrasts with the results on the COCO benchmark dataset [38], where larger models improve performance compared to smaller variants. YOLOX outperforms other YOLO-based models, achieving a Recall of 73.4% and a Precision above 90%, with an F1 of 80.2%. High Precision values indicate that most false positives are eliminated by the tracking step, drastically reducing the false alarm rate. YOLOv8, despite reaching a similar Recall to YOLOX (−1.2%) is limited by the low Precision (−19.4% compared to YOLOX), indicating a higher number of false positives. YOLOv12 is the weakest model with the worst Recall (−10% with respect to YOLOX) and low Precision (−15.7% with respect to YOLOX), resulting in the worst F1 score (67.9%). The weak performance may be caused by the Area Attention module introduced in YOLOv12. The inherently global and coarse-grained nature of the attention mechanism does not properly handle small objects, such as people in UAV imagery, which occupy a small region of the image. Moreover, the presence of irrelevant objects in the scene (e.g., tree branches, bushes) with textures and patterns resembling people introduces confusion. This induces the attention mechanism to focus on incorrect areas of the image, reducing the target detection capability.
As shown in Table 2, the three YOLO versions achieve similarly good computational performance, indicated by high framerates on a consumer-level GPU. The most accurate model, YOLOX, can process frames at 29.3 FPS, enabling real-time person detection even at high video framerates.
Figure 4 and Figure 5 compare the behavior of a pure object detector (YOLOX) and of the proposed pipeline based on object tracking (YOLOX + BT) on two short video sequences. In Figure 4, BT correctly suppresses two isolated false positives, reducing the false alarm rate, while keeping track of the detected person.
Figure 5 illustrates a sequence where the person is consistently tracked by BT, despite intermittent predictions generated by the underlying detector.

Framerate Analysis and Contribution of BT Modules

Ablation studies have been conducted on the pipeline using YOLOX as the object detector to evaluate the influence of the BT tracker at different framerates as well as the contribution of its internal modules. To isolate the behavior of the tracker, the ablation experiment also verifies a version of the pipeline with the CB-CMC module disabled.
Figure 6 illustrates the detection performance of BT with/without CB-CMC at various framerates and compares it with SORT [19], a popular baseline multi-object tracking method. SORT has no Camera Motion Compensation and uses a Kalman filter to estimate the position of tracked objects between frames based on a linear motion model.
BT without CB-CMC exhibits similar behavior to SORT at all framerates. Recall has a downward trend for lower framerates, which correspond to higher inter-frame distances. At high framerates and small inter-frame spatial distances, the linear motion assumption is a good approximation of the objects’ movement. This is shown in Figure 6, where at framerates higher than 10 FPS, the effect on the Recall of the CB-CMC module is negligible. A negative trend in Recall emerges as the spatial interval between frames increases (i.e., the framerate decreases), which leads to greater object motion between frames and makes the linearity hypothesis unsuitable. In such cases, the CB-CMC module effectively compensates for non-linear motions between consecutive frames, thus preserving comparable Recall even at reduced framerates down to 3 FPS. However, at 1 FPS, the excessive inter-frame distance causes the correlation coefficient maximization method of CB-CMC to struggle in low-texture conditions typical of thermal images. Moreover, inaccuracies in frame alignment result in the small object bounding boxes not being matched due to insufficient or absent overlap. The Precision, both with and without CB-CMC, remains consistent across all framerates, demonstrating BT’s ability to suppress false detections independently of the accuracy of motion compensation.
Table 3 shows the comparison between the most complete BT configuration (last row), other configurations obtained by removing the DLO and DUO components, and the baseline object detector without tracking.
Based on the F1 score, the best configuration is BT without DLO and DUO models (second row), which is the setting used in previous experiments (Table 2 and Figure 6). Introducing the tracking phase (second row) slightly reduces the Recall (−0.9%) compared to the baseline object detector (first row) because isolated correct detections may also be suppressed. This occurs in strongly occluded scenes where a person can be identified only from a small subset of frames, preventing consistent tracking. However, the main benefit is the suppression of isolated false positives, as illustrated by the increase in Precision (+2.9%). In real-world WSAR missions, this approach focuses responders’ attention on consistently tracked objects, eliminating distractions caused by sparse false alarms. The computational cost is significantly affected by the performance improvement provided by BT. The additional time for the tracking phase causes the detection framerate to decrease from 107.6 FPS to 29.3 FPS.
The introduction of the DLO module (third row) improves the Recall by +6.3% compared to the best BT setting without DLO and DUO modules (second row). DLO recovers predictions with a low confidence score. Results show that the DLO module is effective in improving tracking continuity in cases where, due to occlusions, YOLOX detects a target intermittently. However, Precision is significantly impacted (−15.1%), compared to the best BT setting (second row), because many false targets are tracked for longer periods. The DUO module recovers new predictions distant from active tracks. The introduction of this module (last row) further negatively impacts the Precision (−7.8%), compared to the BT setting with DLO enabled (third row), because many false detections predicted with a low confidence score become actively tracked (e.g., tree branches, rocks, bushes). In the original work [23], DLO and DUO modules recover low-confidence score detections to improve tracking performance in crowded scenes. In WSAR scenarios, these modules are ineffective because many isolated false detections should be suppressed rather than being promoted by the tracker.

3.2. Contribution of the Pose-Based Camera Motion Compensation

The best pipeline configuration identified through experiments on the DTV Dataset is the YOLOX detector combined with the BT tracker without the DLO and DUO modules. In this section, we consider such a configuration and replace the original CB-CMC module of BT with the proposed motion estimation based on camera poses (PB-CMC), described in Section 2.2. The YOLOX model is trained on the AOS Dataset, which offers camera poses. Table 4 reports the performance of the proposed PB-CMC version of BT, the original CB-CMC version, and the version of BT without camera compensation (No CMC). The effectiveness of the proposed camera alignment method is demonstrated by the increase in Recall (+6.8%), with comparable Precision, which leads to the highest F1 score (85.5%). The ECC-based CB-CMC performance is comparable to the results obtained without camera alignment, with a slight decrease in Recall (−0.7%). This suggests that, in some cases, the CB-CMC module incorrectly estimates camera movement, preventing a new object prediction from being associated with its corresponding track. As a result, valid but misaligned predictions are considered false positives and discarded.
As shown in Table 4, the proposed PB-CMC approach preserves the framerate of the case without motion correction (−0.4 FPS), enabling real-time analysis of thermal videos. In contrast, the CB-CMC approach has a significant impact on the speed of the tracking phase (−21.1 FPS), drastically reducing the framerate at which videos can be processed.
Figure 7 visualizes the image alignment results of PB-CMC and CB-CMC. Aligned images are shown with a yellow tint. Red objects are associated with the previous frame, while green objects are from the current frame. In Figure 7a, two images with strong texture and defined patterns are successfully aligned by both techniques. In Figure 7b, CB-CMC fails to align low-texture images because the underlying ECC maximization algorithm is not effective on large areas with homogeneous pixel intensities.

3.2.1. Assessment of Robustness to Pose Error

In a real-world scenario, the GPS precision may degrade due to signal interference or temporary loss of satellite visibility. The robustness of the proposed PB-CMC method has been assessed by simulating GPS errors and analyzing the performance trend. As stated in Section 2.3.2, by default, each camera pose has been perturbed by adding a white Gaussian noise with 10 cm standard deviation to simulate implicit GPS accuracy. Table 5 presents the results of the ablation study where the perturbation factor has been increased using standard deviations of 20 cm, 30 cm, and 50 cm. Performance remained stable for standard deviations up to 30 cm (third row), with only a slight decrease in Recall, despite camera position errors exceeding 1 m from the original pose. Increasing the pose error to an average displacement of 64 cm (fourth row) decreases the Recall by 8%. This experiment simulated extreme conditions with GPS pose errors up to 182 cm, which exceed the expected range of real-world scenarios [37]. Under these extreme conditions, performance degrades and becomes comparable to the results achieved with the default CB-CMC method (second row of Table 4).
The main advantage of PB-CMC is that the pose error range can be estimated since the expected GPS inaccuracy is a known parameter [34,37]. This does not apply to CB-CMC, as the extent of alignment inaccuracy cannot be predetermined and, in certain cases, can be exceedingly severe.

4. Discussion

In this work, a video object tracking approach for person detection in real-world WSAR environments has been illustrated. The detection performance of the proposed pipeline has been reported as a function of thermal video framerates. Experimental results demonstrate that an object tracker equipped with a CMC module achieves very good Precision and Recall values even at low framerates. The highest Recall value of 73.4% on the DTV Dataset is computed by evaluating detection models against all ground truth annotations. However, multiple bounding boxes depict different perspectives of the same person with varying degrees of occlusion that change due to drone movement. To successfully identify a person in a practical application, consistent detection from all viewpoints is not required. This occurs when the ground is strongly occluded by vegetation from some perspectives but remains visible from subsequent viewpoints. Therefore, Recall is a conservative metric for measuring person detection ability since high values are achieved by a model capable of identifying a person from most views. However, this strategy enables model evaluation on a wider range of scenarios, including several degrees of occlusion.
Experimental results confirm the feasibility of performing real-time person detection on consumer-grade hardware. Online execution is fundamental in WSAR missions to guide operations and focus search efforts towards the most promising locations for finding a lost person. Traditional content-based camera alignment methods are computationally intensive and prone to errors in the challenging visual conditions of thermal images, which feature homogeneous regions and blurred edges. Table 4 shows that, when compared to the case without camera alignment, the image correlation maximization method of BT (CB-CMC) does not enhance performance. The proposed pose-based approach (PB-CMC) improves the performance by relying only on the estimated camera motion derived from onboard sensors, which is computed quickly and efficiently, and is not susceptible to challenging image content.
The tracking-based method has been evaluated on the AOS Dataset, which provides the camera pose for each thermal image. Comparison with the results of the AOS technique [17] on the same dataset is not straightforward. AOS combines all thermal images from an entire flight sequence to produce a single integrated image, which the person detection model processes. The method proposed in this work processes each frame individually and then combines detection proposals using the tracking module. Because the set of images processed by the two methods is substantially different, a fair comparison based on standard object detection metrics is not possible. The only valid benchmark is the number of people detected in the captured scenes. Both approaches detect all 26 people, proving equally effective on the test flights of the AOS Dataset.
However, in mountainous environments, GPS accuracy deteriorates due to obstructions, loss of satellite visibility, or signal reflection caused by proximity to cliffs or canyons [37,39]. In these conditions, AOS is vulnerable because, with inaccurate GPS poses, the integration process fails to align thermal images. The result is a deteriorated integrated image where tree occlusions are not effectively suppressed, thus harming person detection accuracy. In the following work [16], the authors continuously apply their method as new images are captured. Such an approach increases the likelihood of finding a person because several integrated images are analyzed, rather than a single one. However, the issue of degraded GPS accuracy remains unsolved, as the imprecise reconstruction would impact all integrated images.
Experimental results in Section 3.2.1 indicate that the performance of the proposed PB-CMC approach remained consistent for GPS errors deviating up to 1 m from the actual position. This error range is much higher than the typical inaccuracy of RTK positioning in critical conditions [37]. The robustness of the proposed method derives from processing each captured viewpoint individually. Since the ground is variably occluded by vegetation, a person should be visible at least from a subset of frames, thus increasing the likelihood of detection. Even in the case of strongly unreliable GPS poses, detection capability remains unaffected because the positioning information is used only in the later tracking stage to align predictions on individual frames. The tracking phase combines detection results from multiple viewpoints to filter isolated false alarms and focus responders’ attention on consistently tracked objects.
As a final note, we underline that the performance improvement introduced by the tracker, presented in Section 3.1, constitutes a lower bound due to the relatively low proportion of empty sequences in the DTV Dataset. In real-world videos, sequences with no targets but with confounding objects represent the majority of captured frames. Therefore, the suppression of isolated and spurious false detections has a greater impact. This was verified by evaluating a three-minute video sequence of a WSAR scenario that is not included in the DTV Dataset used for model training. The object detector identifies the lost person in the scene while the tracker reduces false alarms by 5.9%.

5. Conclusions

This work presents a tracking-based method for identifying lost people in real-world WSAR missions from UAV thermal imagery. The proposed approach individually processes each video frame and generates a set of proposals for potential person locations. The downstream tracking module post-processes the predictions to filter sparse false alarms while following actual person detections. On the DTV Dataset, specifically collected to replicate realistic WSAR conditions, the optimal configuration achieves a Recall of 73.4% with a Precision of 90.3%.
According to the experimental results, correcting for camera motion significantly improves object tracking on UAV videos, especially at lower framerates. However, the accuracy of content-based pose estimation degrades on thermal images characterized by large homogeneous patches, blurred edges, and textures. To overcome the issue of thermal image alignment, this work proposes the PB-CMC module that uses known camera poses to estimate camera motion between consecutive frames. In comparison with tracking without camera motion correction, experimental results on the AOS Dataset show a 6.1% increase in Recall with negligible computational overhead (−0.4 FPS). Future research will investigate fusion strategies to integrate the camera motion vectors estimated by both PB-CMC and CB-CMC to enhance tracking robustness in real-world applications. The generalization ability of the proposed method will be further evaluated by supplementing the current version of the DTV Dataset with a variety of new WSAR scenarios.
The proposed approach, based on a lightweight object detection model and a streamlined tracking pipeline, is optimized for processing UAV thermal videos in real time. The proposed DL model is intended to be deployed on a ground control station, consisting of a commercial laptop computer that responders can transport to the mission site. The real-time UAV thermal video is streamed directly to the ground station, which processes the incoming video sequence and displays the detection results to the rescuers. Future work will focus on developing the ground control station software along with the deployment of the proposed person detection pipeline. This method has direct application in real-world WSAR operations, with the potential to effectively assist rescue operators in locating lost people efficiently. The proposed pipeline would automatically analyze thermal videos from a UAV scanning the search area, showing rescuers only the locations with a high probability of finding a person.

Author Contributions

Conceptualization, L.M. and P.F.; methodology, L.M. and R.M.; software, R.M.; validation, L.M. and R.M.; formal analysis, P.F.; investigation, L.M. and R.M.; resources, L.M. and R.M.; data curation, L.M. and R.M.; writing—original draft preparation, L.M.; writing—review and editing, L.M., R.M., and P.F.; visualization, R.M.; supervision, P.F.; project administration, P.F.; funding acquisition, P.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially funded by the Italian Ministry of University and Research (MUR), under the National Recovery and Resilience Plan (NRRP), program “I.3.4 Borse PNRR transizioni digitali e ambientali” (PNRR grants for digital and environmental transition), and by the European Union (EU) under the NextGenerationEU project.

Data Availability Statement

The DTV Dataset will be published in the future and is available upon request. The AOS Dataset can be downloaded from Zenodo at this link: https://doi.org/10.5281/zenodo.4024677 (accessed on 30 August 2025).

Acknowledgments

This work was supported by AREU Lombardia and CNSAS Lombardia, who provided UAV videos of realistic Search and Rescue simulations used in the dataset. We acknowledge ISCRA for awarding this project access to the LEONARDO supercomputer, owned by the EuroHPC Joint Undertaking, hosted by CINECA (Italy).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AOSAirborne Optical Sectioning
BTBoostTrack++
CB-CMCContent-Based Camera Motion Compensation
CNNConvolutional Neural Network
CMCCamera Motion Compensation
CVComputer Vision
DEMDigital Elevation Model
DLDeep Learning
DLODetecting Likely Objects
DTVDrone Thermal Video
DUODetecting Unlikely Objects
ECCEnhanced Correlation Coefficient
FPSFrames Per Second
IoUIntersection over Union
MOTMulti-Object Tracking
PB-CMCPose-Based Camera Motion Compensation
RTKReal-Time Kinematic
SfMStructure-from-Motion
UAVUnmanned Aerial Vehicle
WSARWilderness Search and Rescue

References

  1. Weldon, W.T.; Hupy, J. Investigating Methods for Integrating Unmanned Aerial Systems in Search and Rescue Operations. Drones 2020, 4, 38. [Google Scholar] [CrossRef]
  2. Vincent-Lambert, C.; Pretorius, A.; Tonder, B.V. Use of Unmanned Aerial Vehicles in Wilderness Search and Rescue Operations: A Scoping Review. Wilderness Environ. Med. 2023, 34, 580–588. [Google Scholar] [CrossRef] [PubMed]
  3. Karaca, Y.; Cicek, M.; Tatli, O.; Sahin, A.; Pasli, S.; Beser, M.F.; Turedi, S. The potential use of unmanned aircraft systems (drones) in mountain search and rescue operations. Am. J. Emerg. Med. 2018, 36, 583–588. [Google Scholar] [CrossRef]
  4. van Veelen, M.J.; Roveri, G.; Voegele, A.; Cappello, T.D.; Masè, M.; Falla, M.; Regli, I.B.; Mejia-Aguilar, A.; Mayrgündter, S.; Strapazzon, G. Drones reduce the treatment-free interval in search and rescue operations with telemedical support—A randomized controlled trial. Am. J. Emerg. Med. 2023, 66, 40–44. [Google Scholar] [CrossRef]
  5. Sambolek, S.; Ivasic-Kos, M. Automatic Person Detection in Search and Rescue Operations Using Deep CNN Detectors. IEEE Access 2021, 9, 37905–37922. [Google Scholar] [CrossRef]
  6. Niedzielski, T.; Jurecka, M.; Miziński, B.; Pawul, W.; Motyl, T. First Successful Rescue of a Lost Person Using the Human Detection System: A Case Study from Beskid Niski (SE Poland). Remote Sens. 2021, 13, 4903. [Google Scholar] [CrossRef]
  7. Schedl, D.C.; Kurmi, I.; Bimber, O. Search and rescue with airborne optical sectioning. Nat. Mach. Intell. 2020, 2, 783–790. [Google Scholar] [CrossRef]
  8. Doherty, P.; Rudol, P. A UAV Search and Rescue Scenario with Human Body Detection and Geolocalization. In Proceedings of the AI 2007: Advances in Artificial Intelligence, Gold Coast, Australia, 2–6 December 2007; pp. 1–13. [Google Scholar]
  9. Park, K.Y.; Hwang, S.Y. An improved Haar-like feature for efficient object detection. Pattern Recognit. Lett. 2014, 42, 148–153. [Google Scholar] [CrossRef]
  10. Hoai, D.K.; Van Phuong, N. Anomaly color detection on UAV images for search and rescue works. In Proceedings of the 2017 9th International Conference on Knowledge and Systems Engineering (KSE), Hue, Vietnam, 19–21 October 2017; pp. 287–291. [Google Scholar] [CrossRef]
  11. Proft, J.; Suarez, J.; Murphy, R. Spectral anomaly detection with machine learning for wilderness search and rescue. In Proceedings of the 2015 IEEE MIT Undergraduate Research Technology Conference (URTC), Cambridge, MA, USA, 7–8 November 2015; pp. 1–3. [Google Scholar] [CrossRef]
  12. Reed, I.; Yu, X. Adaptive multiple-band CFAR detection of an optical pattern with unknown spectral distribution. IEEE Trans. Acoust. Speech Signal Process. 1990, 38, 1760–1770. [Google Scholar] [CrossRef]
  13. Rasmussen, N.D.; Morse, B.S.; Goodrich, M.A.; Eggett, D. Fused visible and infrared video for use in Wilderness Search and Rescue. In Proceedings of the 2009 Workshop on Applications of Computer Vision (WACV), Snowbird, UT, USA, 7–8 December 2009; pp. 1–8. [Google Scholar] [CrossRef]
  14. Vempati, A.S.; Agamennoni, G.; Stastny, T.; Siegwart, R. Victim Detection from a Fixed-Wing UAV: Experimental Results. In Proceedings of the Advances in Visual Computing, Las Vegas, NV, USA, 14–16 December 2015; Bebis, G., Boyle, R., Parvin, B., Koracin, D., Pavlidis, I., Feris, R., McGraw, T., Elendt, M., Kopper, R., Ragan, E., et al., Eds.; Springer: Cham, Switzerland, 2015; pp. 432–443. [Google Scholar]
  15. Bejiga, M.B.; Zeggada, A.; Nouffidj, A.; Melgani, F. A Convolutional Neural Network Approach for Assisting Avalanche Search and Rescue Operations with UAV Imagery. Remote Sens. 2017, 9, 100. [Google Scholar] [CrossRef]
  16. Kurmi, I.; Schedl, D.C.; Bimber, O. Combined person classification with airborne optical sectioning. Sci. Rep. 2022, 12, 3804. [Google Scholar] [CrossRef]
  17. Schedl, D.C.; Kurmi, I.; Bimber, O. An autonomous drone for search and rescue in forests using airborne optical sectioning. Sci. Robot. 2021, 6, eabg1188. [Google Scholar] [CrossRef] [PubMed]
  18. Yeom, S. Thermal Image Tracking for Search and Rescue Missions with a Drone. Drones 2024, 8, 53. [Google Scholar] [CrossRef]
  19. Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple online and realtime tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3464–3468. [Google Scholar] [CrossRef]
  20. Yi, K.; Luo, K.; Luo, X.; Huang, J.; Wu, H.; Hu, R.; Hao, W. UCMCTrack: Multi-Object Tracking with Uniform Camera Motion Compensation. Proc. AAAI Conf. Artif. Intell. 2024, 38, 6702–6710. [Google Scholar] [CrossRef]
  21. Wojke, N.; Bewley, A.; Paulus, D. Simple Online and Realtime Tracking with a Deep Association Metric. arXiv 2017, arXiv:1703.07402. [Google Scholar] [CrossRef]
  22. Wang, Y.H.; Hsieh, J.W.; Chen, P.Y.; Chang, M.C.; So, H.H.; Li, X. SMILEtrack: SiMIlarity LEarning for Occlusion-Aware Multiple Object Tracking. Proc. AAAI Conf. Artif. Intell. 2024, 38, 5740–5748. [Google Scholar] [CrossRef]
  23. Stanojević, V.; Todorović, B. BoostTrack++: Using tracklet information to detect more objects in multiple object tracking. arXiv 2024, arXiv:2408.13003. [Google Scholar]
  24. Hassan, S.; Mujtaba, G.; Rajput, A.; Fatima, N. Multi-object tracking: A systematic literature review. Multimed. Tools Appl. 2024, 83, 43439–43492. [Google Scholar] [CrossRef]
  25. Li, Z.; Dong, Y.; Shen, L.; Liu, Y.; Pei, Y.; Yang, H.; Zheng, L.; Ma, J. Development and challenges of object detection: A survey. Neurocomputing 2024, 598, 128102. [Google Scholar] [CrossRef]
  26. Yaseen, M. What is YOLOv8: An In-Depth Exploration of the Internal Features of the Next-Generation Object Detector. arXiv 2024, arXiv:2408.15857. [Google Scholar]
  27. Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
  28. Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
  29. Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
  30. Milan, A.; Leal-Taixé, L.; Reid, I.; Roth, S.; Schindler, K. MOT16: A Benchmark for Multi-Object Tracking. arXiv 2016, arXiv:1603.00831. [Google Scholar] [CrossRef]
  31. Dendorfer, P.; Rezatofighi, H.; Milan, A.; Shi, J.; Cremers, D.; Reid, I.; Roth, S.; Schindler, K.; Leal-Taixé, L. MOT20: A benchmark for multi object tracking in crowded scenes. arXiv 2020, arXiv:2003.09003. [Google Scholar] [CrossRef]
  32. Evangelidis, G.D.; Psarakis, E.Z. Parametric Image Alignment Using Enhanced Correlation Coefficient Maximization. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 1858–1865. [Google Scholar] [CrossRef] [PubMed]
  33. Schonberger, J.L.; Frahm, J.M. Structure-From-Motion Revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  34. Zeybek, M. Accuracy assessment of direct georeferencing UAV images with onboard global navigation satellite system and comparison of CORS/RTK surveying methods. Meas. Sci. Technol. 2021, 32, 065402. [Google Scholar] [CrossRef]
  35. Schedl, D.C.; Kurmi, I.; Bimber, O. Data: Search and Rescue with Airborne Optical Sectioning (Dataset), Version v2. Zenodo. 2020. [CrossRef]
  36. Kurmi, I.; Schedl, D.C.; Bimber, O. Data: Autonomous Drones for Search and Rescue in Forests (Dataset), Version v1. Zenodo. 2020. [CrossRef]
  37. Tavasci, L.; Nex, F.; Gandolfi, S. Reliability of Real-Time Kinematic (RTK) Positioning for Low-Cost Drones’ Navigation across Global Navigation Satellite System (GNSS) Critical Environments. Sensors 2024, 24, 6096. [Google Scholar] [CrossRef] [PubMed]
  38. Lin, T.Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common Objects in Context. arXiv 2015, arXiv:1405.0312. [Google Scholar] [CrossRef]
  39. Kunisada, Y.; Premachandra, C. High Precision Location Estimation in Mountainous Areas Using GPS. Sensors 2022, 22, 1149. [Google Scholar] [CrossRef] [PubMed]
Figure 1. The architecture of the proposed tracking-based person detection pipeline. An object detector identifies object proposals on each frame of a UAV thermal video (green boxes). The position of tracked objects in the current frame is estimated by predicting their motion between previous and current frames (red boxes). Proposals on the current frame are associated with tracked objects from the previous frame. Proposals not associated with tracked objects (false positives) are suppressed (e.g., box #2). This process is repeated for each new frame acquisition. Objects consistently matched are tracked (blue boxes) and displayed to rescue operators.
Figure 1. The architecture of the proposed tracking-based person detection pipeline. An object detector identifies object proposals on each frame of a UAV thermal video (green boxes). The position of tracked objects in the current frame is estimated by predicting their motion between previous and current frames (red boxes). Proposals on the current frame are associated with tracked objects from the previous frame. Proposals not associated with tracked objects (false positives) are suppressed (e.g., box #2). This process is repeated for each new frame acquisition. Objects consistently matched are tracked (blue boxes) and displayed to rescue operators.
Remotesensing 17 03032 g001
Figure 2. Examples of thermal images from the two datasets. (a) Frames from the DTV Dataset described in Section 2.3.1. (b) Images from the AOS Dataset presented in Section 2.3.2. Targets are annotated with green bounding boxes.
Figure 2. Examples of thermal images from the two datasets. (a) Frames from the DTV Dataset described in Section 2.3.1. (b) Images from the AOS Dataset presented in Section 2.3.2. Targets are annotated with green bounding boxes.
Remotesensing 17 03032 g002
Figure 3. Comparison of RGB and thermal images capturing the same scene. The person is visible in the thermal image while being completely masked in the RGB image. A zoomed view is provided in the bottom right corner.
Figure 3. Comparison of RGB and thermal images capturing the same scene. The person is visible in the thermal image while being completely masked in the RGB image. A zoomed view is provided in the bottom right corner.
Remotesensing 17 03032 g003
Figure 4. Comparison of YOLOX predictions (top row) and BT filtered detections (bottom row) on a video sequence. Light blue boxes are YOLOX detections while yellow boxes are predictions after BT post-processing. The confidence score is reported for each prediction. BT suppresses the sparse false alarms in the upper part of the image.
Figure 4. Comparison of YOLOX predictions (top row) and BT filtered detections (bottom row) on a video sequence. Light blue boxes are YOLOX detections while yellow boxes are predictions after BT post-processing. The confidence score is reported for each prediction. BT suppresses the sparse false alarms in the upper part of the image.
Remotesensing 17 03032 g004
Figure 5. Comparison of YOLOX predictions (top row) and BT-filtered detections (bottom row) on a video sequence. Light blue boxes are YOLOX detections while green boxes represent ground truth annotations where YOLOX missed the person. Yellow boxes are predictions after BT post-processing. The confidence score is reported for each prediction. BT tracks the person even with intermittent predictions by the underlying detector.
Figure 5. Comparison of YOLOX predictions (top row) and BT-filtered detections (bottom row) on a video sequence. Light blue boxes are YOLOX detections while green boxes represent ground truth annotations where YOLOX missed the person. Yellow boxes are predictions after BT post-processing. The confidence score is reported for each prediction. BT tracks the person even with intermittent predictions by the underlying detector.
Remotesensing 17 03032 g005
Figure 6. Comparison of BT with CB-CMC, BT without CB-CMC, and the SORT baseline in relation to inference framerate. BT without the CB-CMC module has a similar behavior to SORT, highlighting the utility of motion compensation in UAV images.
Figure 6. Comparison of BT with CB-CMC, BT without CB-CMC, and the SORT baseline in relation to inference framerate. BT without the CB-CMC module has a similar behavior to SORT, highlighting the utility of motion compensation in UAV images.
Remotesensing 17 03032 g006
Figure 7. Comparison of PB-CMC (ours) and CB-CMC (original) image alignment. (a) Both methods correctly align the two consecutive images. (b) CB-CMC cannot accurately align two thermal images with low texture and homogeneous pixel intensities.
Figure 7. Comparison of PB-CMC (ours) and CB-CMC (original) image alignment. (a) Both methods correctly align the two consecutive images. (b) CB-CMC cannot accurately align two thermal images with low texture and homogeneous pixel intensities.
Remotesensing 17 03032 g007
Table 1. Statistics of the DTV Dataset downsampled at various framerates. In all versions, about two-thirds of the frames contain a person.
Table 1. Statistics of the DTV Dataset downsampled at various framerates. In all versions, about two-thirds of the frames contain a person.
FramerateFramesAnnotationsEmpty Frames
1 FPS24811647834
3 FPS722548802345
5 FPS11,98181033878
10 FPS23,85616,1757681
15 FPS35,71424,22811,486
Table 2. Person detection results on the DTV Dataset at 10 FPS. Results obtained with IoU threshold of 20% and confidence threshold of 20%. YOLOX surpasses other models on detection performance metrics. The best result is highlighted in bold for each metric.
Table 2. Person detection results on the DTV Dataset at 10 FPS. Results obtained with IoU threshold of 20% and confidence threshold of 20%. YOLOX surpasses other models on detection performance metrics. The best result is highlighted in bold for each metric.
ModelPrecisionRecallF1FPS
YOLOv8 + BT70.9%72.2%71.1%31.5
YOLOv12 + BT74.6%63.4%67.9%28.3
YOLOX + BT90.3%73.4%80.2%29.3
Table 3. Ablation study on the DTV Dataset at 10 FPS. The baseline YOLOX is compared to the results obtained with different configurations of the BT tracker. Results are obtained with IoU threshold of 20% and confidence threshold of 20%. The best result is highlighted in bold for each metric.
Table 3. Ablation study on the DTV Dataset at 10 FPS. The baseline YOLOX is compared to the results obtained with different configurations of the BT tracker. Results are obtained with IoU threshold of 20% and confidence threshold of 20%. The best result is highlighted in bold for each metric.
ModelBTDLODUOPrecisionRecallF1FPS
YOLOX87.4%74.3%79.7%107.6
90.3%73.4%80.2%29.3
75.2%79.7%77.2%28.7
67.4%80.1%72.9%28.3
Table 4. Comparison of the proposed CMC method (PB-CMC), the compensation module of BT (CB-CMC), and tracking without motion compensation (No CMC) on the AOS Dataset. Results obtained with IoU threshold of 20% and confidence threshold of 20%. The best result is highlighted in bold for each metric.
Table 4. Comparison of the proposed CMC method (PB-CMC), the compensation module of BT (CB-CMC), and tracking without motion compensation (No CMC) on the AOS Dataset. Results obtained with IoU threshold of 20% and confidence threshold of 20%. The best result is highlighted in bold for each metric.
CMC MethodPrecisionRecallF1FPS
PB-CMC (ours)89.0%82.2%85.5%35.2
CB-CMC (original)89.1%75.4%81.7%14.5
No CMC72.5%76.1%82.3%35.6
Table 5. Performance degradation trend by increasing the camera pose error. Each camera pose is perturbed with a white Gaussian noise with increasing standard deviation ( σ ). For each simulation, average and maximum position errors from the original camera poses are reported.
Table 5. Performance degradation trend by increasing the camera pose error. Each camera pose is perturbed with a white Gaussian noise with increasing standard deviation ( σ ). For each simulation, average and maximum position errors from the original camera poses are reported.
Noise σ PrecisionRecallF1Avg ErrorMax Error
10 cm89.0%82.2%85.5%12.8 cm36.4 cm
20 cm89.0%81.4%85.0%25.6 cm72.9 cm
30 cm89.0%80.3%84.5%38.4 cm109.3 cm
50 cm89.4%74.2%81.1%64.0 cm182.2 cm
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Fraternali, P.; Morandini, L.; Motta, R. Enhancing Search and Rescue Missions with UAV Thermal Video Tracking. Remote Sens. 2025, 17, 3032. https://doi.org/10.3390/rs17173032

AMA Style

Fraternali P, Morandini L, Motta R. Enhancing Search and Rescue Missions with UAV Thermal Video Tracking. Remote Sensing. 2025; 17(17):3032. https://doi.org/10.3390/rs17173032

Chicago/Turabian Style

Fraternali, Piero, Luca Morandini, and Riccardo Motta. 2025. "Enhancing Search and Rescue Missions with UAV Thermal Video Tracking" Remote Sensing 17, no. 17: 3032. https://doi.org/10.3390/rs17173032

APA Style

Fraternali, P., Morandini, L., & Motta, R. (2025). Enhancing Search and Rescue Missions with UAV Thermal Video Tracking. Remote Sensing, 17(17), 3032. https://doi.org/10.3390/rs17173032

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop