3. Materials and Methods
This section presents the methodology for an aerial–ground cooperative system that detects and confirms hazardous objects in complex environments while minimizing human risk. The system uses two robotic platforms, a drone, UAV, for fast wide-area scanning, and a ground robot, UGV, for close-range verification. The UAV runs a one-stage detector, YOLOv9, and the UGV runs a two-stage detector, Faster R-CNN. The pairing is designed to combine high throughput with precise confirmation.
3.1. System Architecture
UAV node. RGB camera, GPS (when available), IMU, optional barometer, onboard computer for real-time inference, wireless link to the UGV, and Wi-Fi or 5G when available.
UGV node. RGB camera, optional LiDAR and thermal sensor, GPS (when available), IMU, onboard computer for heavier inference and mapping, and a wireless link to the UAV and to the operator.
Operator console. Receives confirmed events and renders a live map with the latest detections.
Figure 3 summarizes the data flow. The UAV identifies hazards and sends their geo-referenced cues, the UGV navigates to verify them, and confirmed locations are forwarded to responders.
3.2. Perception Modules
UAV detector. YOLOv9 runs at real-time frame rates on the UAV video stream. We use image resolution appropriate for the computation budget, a confidence threshold , and non-maximum suppression, NMS, with IoU threshold . For temporal stability, the system keeps a short history of detections and applies per-track smoothing.
UGV detector. Faster R-CNN processes close-range images with a confidence threshold and NMS . The UGV uses the same label set as the UAV to simplify handover. When a LiDAR or depth camera is present, depth is used to refine the bounding box to metric size before confirmation.
3.3. Geolocation and Handover
For each UAV detection, the system builds a cue message:
m={“id”,“class”,“score”,“bbox”xyxy,t,“lat”,“lon”,“alt”,,“cov”},
where t is the timestamp, the UAV heading, and cov an uncertainty ellipse for the footprint on the ground. The uncertainty radius is estimated from camera intrinsics, altitude, and NMS variance, then expanded by a safety margin. The UAV transmits m to the UGV over the wireless link.
Situational awareness. On reception, the UGV converts the cue to a local target region in ENU, east-north-up, or UTM, and plans a path to the region center. While approaching, it searches within the uncertainty ellipse, re-detects the object with Faster R-CNN, and aligns the new box with the UAV cue. When GPS is unavailable, the cue is expressed in a local/map frame (e.g., ENU/UTM anchored by SLAM) rather than latitude/longitude, and the UGV navigates to the cue region in that local frame.
Confirmation rule. A hazard is confirmed if class labels agree, confidence exceeds , and spatial alignment satisfies for k consecutive frames. Otherwise, the cue is rejected and logged.
3.4. Navigation and Mapping on the UGV
The UGV performs navigation using a hybrid localization strategy. When GPS is available, GPS and IMU provide global guidance; when GPS is denied or unreliable, the UGV relies on onboard state estimation, using visual or visual-inertial SLAM (e.g., ORB-SLAM3 [
40]) to maintain local consistency and to support accurate waypoint tracking toward the cue region. Global planning uses A* on a 2D cost map, and local obstacle avoidance uses a dynamic window approach or a model predictive controller. The planner respects geofences and maintains standoff distances when approaching suspected hazards.
3.5. Communication, Timing, and Fault Handling
Messages are time-stamped in the same clock domain using ROS time or NTP. If the link is unstable, the UAV buffers cues and retries transmission with exponential backoff, and the UGV acknowledges receipt and discards duplicates by id. Cues have a time to live to avoid chasing stale targets. If the UGV cannot reach the region, the system reports a “verify failed” status with the reason, for example, blocked path or cue expired.
3.6. Outputs to Responders
When a hazard is confirmed, the UGV publishes a record:
{“class”,“score”,“lat”,“lon”,“alt”,“time”,“evidence”},
where evidence includes the confirming image crop and, when available, a short video snippet. The console aggregates records on a live map and forwards them to emergency services for targeted intervention.
3.7. Algorithmic Summary
UAV scans the area, runs YOLOv9, and creates cue messages for each detection.
UGV receives a cue, converts it to a local target region, and plans a path.
UGV approaches the region, runs Faster R-CNN, and performs confirmation.
If confirmed, the UGV reports the validated GPS location to responders; if rejected, the cue is cleared and the UAV continues scanning.
This architecture exploits complementary strengths: the UAV provides fast coverage and low-latency cues, and the UGV provides accurate confirmation at close range. Later sections report per-stage performance and end-to-end benefits.
3.8. Data Pre-Processing
To reflect real operating conditions, we assembled a dataset from publicly available images on social media and news sites, covering a wide range of scenes, viewpoints, and illumination. Only content that could be legally reused was retained, sensitive metadata were removed, and faces or identifying marks were blurred when appropriate. This curation step reduces legal and ethical risk and improves the realism of the training data.
Because the dataset is collected from public media, a domain gap relative to onboard UAV/UGV sensors is expected (e.g., different optics and compression, motion blur, altitude-dependent scale, and adverse weather/lighting). We improve robustness through augmentations that emulate common aerial and ground capture artifacts (blur, noise, and illumination shifts). In addition, the proposed system uses a detect-then-confirm safeguard: UAV detections are treated as cues, and only events subsequently confirmed by the UGV are reported, which reduces false alarms under distribution shift. For deployment, additional validation and/or fine-tuning on platform-specific UAV/UGV data remains an important next step.
Classes and scope. The dataset contains eight hazard classes relevant to rescue missions: Chemical Spill, Crashed Vehicle, Destroyed Infrastructure, Drone, Fire, Injured Person, Landmine, and Pistol, as illustrated in
Figure 4. Sources were selected to capture varied scales, partial occlusions, clutter, and adverse weather or lighting, which are typical for UAV and UGV operations.
Annotation protocol. All images were manually annotated in LabelImg with tight bounding boxes and consistent class labels, as shown in
Figure 5. To improve label quality, annotators followed simple rules, for example, include the entire object, avoid background padding, and use one box per visible instance. A small random sample was cross-checked by a second annotator and corrected when Intersection over Union agreement fell below a set threshold. Annotations were saved in YOLO format for YOLOv9 training, then converted to COCO JSON for Faster R-CNN, with a single class dictionary to keep label IDs aligned across models.
Data splits and leakage control. The corpus was split into training and validation subsets with stratification by class to limit class-imbalance effects. Near-duplicates were identified using perceptual hashing and kept within the same split to prevent train-to-val leakage. When multiple images came from the same incident, the entire incident was assigned to a single split to avoid over-optimistic validation.
Normalization and resolution. Images were resized to the detector input size, and pixel values were normalized consistently across models. For the UAV detector, the input size balances small-object recall with onboard computation limits, while the UGV detector uses the same normalization but can process higher resolutions thanks to greater computation availability.
Augmentation. To improve robustness, we applied a light but effective augmentation set, horizontal flip with class-safe rules, random crop and scale within bounds that preserve object visibility, color jitter and brightness or contrast changes to mimic illumination shifts, motion-blur and Gaussian noise for aerial footage, and small rotations for camera tilt. Augmentations were applied with probabilities tuned to avoid unrealistic samples.
Class imbalance handling. Per-class counts were monitored during batching. When imbalance increased validation variance, we used balanced sampling for mini-batches and verified that per-class precision–recall curves remained stable.
Outcome. This pipeline yields a dataset that supports both rapid aerial scanning with YOLOv9 and precise ground-level verification with Faster R-CNN, with consistent labels and compatible formats across the two detectors.
3.9. Model Training
To evaluate the aerial–ground collaborative pipeline, we trained two complementary detectors on the same annotated corpus so that label semantics and error modes are comparable across platforms. The UAV runs a one-stage model, YOLOv9, to enable rapid wide-area cueing, and the UGV runs a two-stage model, Faster R-CNN, to perform close-range verification with higher localization accuracy.
Although the UAV-side detector (YOLOv9) and the UGV-side verifier (Faster R-CNN) are trained using the same curated hazard dataset, they serve different operational roles and are therefore tuned and interpreted differently. YOLOv9 is used for rapid aerial cueing, where the system prioritizes low latency and high recall to ensure that potential hazards are not missed during coverage. In contrast, Faster R-CNN is used for ground-level verification, where the system prioritizes precision and tighter localization to confirm or reject UAV cues. Importantly, hazards may present different visual signatures from the air versus from the ground due to scale, perspective, occlusions, and background context. The proposed architecture mitigates this viewpoint shift by aligning each model with its corresponding sensing context (UAV for aerial cueing and UGV for ground verification) and by reporting performance in a way that reflects these distinct objectives rather than implying identical cross-view generalization.
3.9.1. Training Setup, Data and Hardware
The dataset was first labeled in YOLO format, then converted losslessly to COCO JSON for Faster R-CNN to keep a single class dictionary. We used a single validation split for model selection and hyperparameter tuning, and we report final metrics on the held-out test. Training was performed for 25 epochs with a batch size of 16 on an NVIDIA Tesla T4. Mixed-precision training was enabled to reduce memory use and increase throughput. Random seeds were fixed and library versions were recorded to facilitate reproducibility.
3.9.2. UAV Model, YOLOv9
Images were resized to to satisfy the onboard latency and power envelope of the aerial platform, and to keep end-to-end perception within real-time limits. To mitigate the small-input penalty on small objects, we used light multi-scale sampling around the base size and standard augmentations, horizontal flip when class safe, random crop and scale that preserve object visibility, color jitter to simulate illumination changes, and light motion blur for aerial footage. The model was optimized with a modern optimizer and a warmup plus decaying schedule, and confidence threshold and NMS IoU were selected on the validation set. During evaluation, we compute mAP@0.5 and mAP@[.5:.95], as well as the F1-score at the operating threshold chosen from the validation precision–recall curve. The trained checkpoint is exported in half precision for onboard inference.
3.9.3. UGV Model, Faster R-CNN
Faster R-CNN was trained on the same images using the COCO format. Input images were resized while preserving aspect ratio to the detector default short-side, and the backbone was initialized from an ImageNet- or COCO-pretrained weight for faster convergence. We applied lighter geometric augmentation than on the UAV model to preserve object geometry at close range. Training used a momentum-based optimizer with a step or cosine schedule, and validation tuned the confidence threshold and NMS IoU . Although inference is slower than the one-stage model, the two-stage pipeline improves close-range localization and false-alarm rejection, which is well-suited to the verifier role of the UGV.
3.9.4. Evaluation Protocol and Operating Point
Both detectors are evaluated with the same label set and metrics, and we report class-wise and overall results. For deployment, the operating point for each model is chosen on the validation PR curve to balance precision and recall for the role it plays: higher recall for the UAV to avoid missed cues, and higher precision for the UGV to ensure reliable confirmation. Thresholds and NMS parameters used in deployment are listed with the results in
Section 4.
Notes. The choice of for YOLOv9 is driven by onboard constraints and real-time requirements. If additional computation becomes available, a simple ablation with larger inputs, for example, 320 or 640, can be added to quantify the speed–accuracy trade-off; the rest of the training recipe remains unchanged.
3.10. Performance Evaluation
We assess the system with standard detection metrics that reflect accuracy and speed in complex environments: precision (P), recall (R), F1-score, mean average precision (mAP), and inference speed (FPS). Unless stated otherwise, metrics are computed on the held-out set with the same label space for UAV and UGV models.
Precision. Precision measures how accurately the system reports true hazards while minimizing false alarms:
where
are true positives and
are false positives. High precision yields reliable alerts and reduces unnecessary interventions.
Recall. Recall measures how many actual hazards are detected:
where
are false negatives. High recall is crucial in hazard detection to avoid missed threats.
F1-score. F1 balances precision and recall in a single value:
It is useful in UAV–UGV missions where both false alarms and missed detections matter.
Average Precision and mAP. Detections are matched to ground truth using Intersection over Union (IoU) between a predicted box
B and a ground-truth box
G:
For each class
k, the Average Precision
is the area under its precision–recall curve at a chosen IoU threshold. We report:
In this study, we emphasize
because the UAV output is used for hazard cueing and bounded-region dispatch to the UGV in a detect-then-confirm workflow, where coarse localization is sufficient to trigger inspection; stricter IoU metrics primarily quantify localization tightness and are therefore discussed as an extension focused on close-range verification quality.
follows the VOC convention, and follows the COCO protocol and emphasizes precise localization.
Inference Speed (FPS). We measure runtime on
N images and compute the mean per-image inference time
:
Warm-up frames are excluded, and we report whether timing is model-only or end-to-end (pre- and post-processing included).
4. Results
4.1. Detection Performance
We first report class-wise performance, then give an overall view of the two detectors in the collaborative setting.
4.1.1. Class-Wise Performance
We evaluate eight hazard categories, Chemical Spill, Crashed Vehicle, Destroyed Infrastructure, Drone, Fire, Injured Person, Landmine, and Pistol, using the same label set and split for both detectors. The precision–recall curves in
Figure 6 for YOLOv9 lie close to the upper envelope for most classes, which matches the very high per-class precision observed during validation, about 0.995. This behaviour is consistent with the UAV role, fast aerial cueing with reliable alerts. The exception is Destroyed Infrastructure, where precision drops to 0.787. The corresponding curve sits lower and bends earlier, indicating that precision degrades as recall is pushed up. Visual heterogeneity, large extents with ambiguous boundaries, and background patterns that resemble ruins likely contribute to this effect. Despite that difficulty, the global result remains strong, with mAP@0.5 = 0.969, which is close to the mean of the classwise scores and confirms that YOLOv9 generalizes well across the remaining categories.
The lower performance on Destroyed Infrastructure is mainly explained by visual ambiguity and strong intra-class heterogeneity rather than a single architectural limitation. Unlike compact object instances, this category often corresponds to extended and amorphous regions (rubble fields, partially collapsed structures, cracked walls, and debris piles) with weak or subjective boundaries. As a result, bounding-box matching becomes sensitive to small localization offsets and to annotation tightness, and background textures (concrete, bricks, dust, and vegetation) can resemble the target class. The most common failure modes observed are (i) background confusion leading to false positives on texture-rich clutter, (ii) boundary ambiguity causing oversized or partial boxes over extended ruins, (iii) missed detections when damage is subtle or fragmented, and (iv) scale effects for very large structures, especially in aerial views. These observations are consistent with the role separation: the proposal-and-refinement mechanism of Faster R-CNN tends to reduce part of the background confusion and improve box refinement in cluttered scenes, whereas YOLOv9 favors throughput and robust cueing.
Faster R-CNN shows a similar picture in
Figure 7, yet with a noticeably flatter set of curves at moderate to high recall. Precision remains near 0.995 for most classes and, importantly, the model recovers a significant portion of the lost accuracy on Destroyed Infrastructure, reaching 0.869. The proposal-then-refine mechanism helps in cluttered and texture-dominated scenes, where tighter localization and a second classification pass reduce background confusions. The overall mAP@0.5 rises to 0.979 on the same split, a modest but consistent gain that aligns with the qualitative shape of the curves, especially in the high-recall region.
Taken together, the figures substantiate the intended division of labor. From the air, YOLOv9 maintains very high precision over a broad recall range, which is suitable for rapid, wide-area scanning where missed cues are more costly than a small number of false alarms. On the ground, Faster R-CNN delivers tighter boxes and better discrimination in visually complex scenes, which stabilizes confirmation and lowers the residual false-positive rate. The persistent gap in Destroyed Infrastructure points to avenues that are straightforward to test without changing the overall design, for example, multi-scale sampling and tiling for very large structures on the UAV stream or adding a depth cue on the UGV to disambiguate boundaries. Reporting mAP@[.5:.95] alongside mAP@0.5 in
Section 3.9 will make the localization advantage of the two-stage model more explicit, since stricter IoU thresholds typically accentuate the gains visible in
Figure 7.
4.1.2. Global Performance
At the global level, the detectors show the expected complementarity. For the aerial role, YOLOv9 reaches mAP@0.5 = 0.969 on the validation set, which confirms that a single-pass model can sustain accurate, wide-area cueing. The F1–confidence curve in
Figure 8 exhibits a broad plateau and peaks at F1 = 0.95 for a confidence of 0.866. The breadth of this plateau indicates that the model is relatively insensitive to small threshold changes, an asset for real-time operation where illumination, altitude, and motion blur vary. The curve shapes by class also explain the outlier behavior seen earlier, and classes with strong F1 slopes maintain high scores across most thresholds, whereas Destroyed Infrastructure drops earlier as recall is pushed up.
The training dynamics in
Figure 9 support these results. Box, classification, and distribution focal losses decrease smoothly over 25 epochs, while precision, recall, and mAP@0.5 improve monotonically on both training and validation. The close tracking between the two splits suggests that the augmentation recipe and regularization are appropriate for the data volume; there is no sign of divergence or late-epoch overfitting. In practice, this translates into stable confidence scores at inference time and predictable behavior when the operating threshold is adjusted.
For the ground role, Faster R-CNN attains a higher mAP@0.5 = 0.979. Its F1–confidence curve in
Figure 10 peaks at F1 = 0.95 around 0.843, with a slightly steeper ascent and a tighter maximum than YOLOv9. This is consistent with proposal-based refinement producing better calibration near the high-precision regime, a desirable property for confirmation at close range, where false alarms are costlier than a small loss of recall. The higher global mAP aligns with the per-class gains reported for visually complex scenes, notably Destroyed Infrastructure, where tighter localization and a second classification pass reduce background confusion.
Taken together, the figures justify the operating roles. On the UAV, we can run YOLOv9 near the peak yet lean a little toward recall by selecting a threshold within the plateau around 0.866, so that potential hazards are rarely missed while precision remains high. On the UGV, we can use the Faster R-CNN operating point near 0.843 to prioritize precision and deliver reliable confirmations. Reporting mAP@[.5:.95] alongside mAP@0.5 in the next subsection will make the localization advantage of the two-stage model more explicit, since stricter IoU thresholds typically amplify the gap visible between the curves. If desired, a short calibration check, for example, reliability diagrams on the validation set, can verify that predicted confidences match empirical precision, which further simplifies threshold selection for both detectors.
These global results therefore corroborate the pipeline design, fast aerial scanning with YOLOv9 to generate geo-referenced cues, followed by precise ground-level verification with Faster R-CNN. The combination balances coverage and accuracy and supports timely interventions in complex, hazard-rich environments.
4.2. Inference Speed
Speed determines whether detections can be acted on in time-critical missions. On the Tesla T4 at , YOLOv9 processes 41.7 FPS, which corresponds to an average per-frame latency of ≈24 ms. This throughput comfortably supports 25–30 FPS UAV video streams with headroom for I/O, NMS, and telemetry, so aerial cueing remains responsive even when illumination or motion blur varies. Under the same conditions, Faster R-CNN reaches 1.72 FPS, a per-frame latency of ≈581 ms. The gap reflects the proposal-then-refine architecture and its heavier backbone, yet this rate is appropriate for close-range verification on a UGV, where platform motion is slower and brief dwell times are acceptable to prioritize precision.
Viewed at the system level, the two speeds are complementary rather than conflicting. The UAV can continue scanning while the UGV verifies flagged regions, so overall time to confirmation is dominated by UGV transit and local sensing, not by aerial inference. In practice, operating YOLOv9 near its F1 plateau keeps recall high without sacrificing precision, while operating Faster R-CNN near its F1 peak yields reliable confirmations with few false alarms. Reporting whether timings include pre- and post-processing and providing the 5th-percentile FPS across long sequences will make the real-time budget explicit, but the values in
Figure 11 already show that the chosen pairing is well matched to fast scanning in the air and precise verification on the ground.
4.3. Complementarity of YOLOv9 and Faster R-CNN
The two detectors play different roles that align with their measured behavior. YOLOv9 delivers fast and stable aerial cueing, mAP@0.5 = 0.969 and 41.7 FPS at , with an F1 peak of 0.95 around a confidence of 0.866. The F1–confidence curve shows a broad plateau, which means the operating threshold can favor recall without a sharp precision penalty. This is well-suited to wide-area scanning from the UAV, where the cost of a missed cue is high and latency must stay low.
Faster R-CNN provides stronger close-range confirmation on the UGV, mAP@0.5 = 0.979 and F1 = 0.95 near a confidence of 0.843, even though throughput is lower, 1.72 FPS. The proposal-then-refine stage improves localization and reduces background confusion in visually complex scenes, which explains the gain on Destroyed Infrastructure. In practice, the UGV exploits this precision at slower platform speeds, with short dwells acceptable to validate or reject a cue.
Together, the models form a coherent fast-scan and precise-verify loop. The UAV publishes geo-referenced cues with confidence and an uncertainty radius; the UGV navigates to the region, re-detects at close range, and confirms when class, confidence, and overlap meet the chosen criteria. At the system level, this pairing increases the fraction of cues that become confirmed events, reduces residual false positives before human handoff, and shortens the time to a reliable decision. Selecting UAV and UGV thresholds within their respective F1 optima preserves this balance, high recall in the air and high precision on the ground, which is exactly the trade-off required in search and rescue and other high-risk operations.
5. Discussion
The results show that the proposed UAV–UGV pipeline achieves the balance it was designed for, fast aerial cueing with YOLOv9 and precise ground confirmation with Faster R-CNN. At the operating points selected from the F1–confidence curves, both models reach an F1 of 0.95, yet they do so with very different runtime profiles. On a Tesla T4 at , YOLOv9 sustains 41.7 frames per second, about 24 milliseconds per frame, while Faster R-CNN runs at 1.72 frames per second, about 581 milliseconds per frame. This separation in throughput is not a drawback for the mission, since the UAV continues scanning while the UGV travels and verifies, and the end-to-end time to a reliable decision is dominated by ground navigation and local sensing rather than by aerial inference.
The precision–recall and F1–confidence curves support the role assignment. YOLOv9 shows a broad F1 plateau near its optimum, which means the UAV threshold can be set to favor recall without a sharp loss of precision, an advantage when illumination, altitude, or motion blur vary. Faster R-CNN exhibits a sharper peak near high precision, which fits the verifier role on the UGV where false alarms are costlier than a small loss in recall. The modest but consistent gain in mAP@0.5 from 0.969 to 0.979 is aligned with this behavior and is especially visible on visually complex scenes such as Destroyed Infrastructure, where proposal generation and refinement help disambiguate texture-rich backgrounds and large, ambiguous extents.
From an operational perspective, the pairing reduces operator load and shortens time to action. The UAV publishes geo-referenced cues with confidence and an uncertainty radius, which bounds the UGV search and limits wandering in cluttered areas. In practice, most cues convert quickly into either confirmed events or clean rejections, and the rate of residual false positives passed to human teams is lower than with a single detector. This is consistent with the system objective, to send fewer but higher quality alerts while maintaining wide-area coverage.
The analysis also reveals where additional gains are likely. The remaining weakness on Destroyed Infrastructure suggests that multi-scale sampling and optional tiling on the aerial stream would improve recall on very large structures, while depth cues on the UGV, either from stereo or LiDAR, would stabilize boundaries and reduce duplicate boxes after non-maximum suppression. Reporting mAP@[.5:.95] alongside mAP@0.5 would make improvements in localization quality explicit, since stricter IoU thresholds typically accentuate the advantage of the two-stage model at close range. A short calibration check with reliability diagrams would verify that predicted confidences track empirical precision, which simplifies threshold selection for both roles.
Stricter localization metrics such as would further differentiate one-stage and two-stage detectors by penalizing small box offsets. We therefore expect Faster R-CNN to show a more pronounced advantage under higher IoU thresholds due to proposal-driven refinement and tighter localization, while YOLOv9-optimized for throughput and robust cueing-would be more affected by stricter overlap requirements. A dedicated localization-focused evaluation (e.g., reporting ) will be addressed in follow-up work to quantify this effect explicitly on platform-representative data.
The Destroyed Infrastructure class highlights a practical limitation of box-based detection for region-like hazards. Performance is reduced primarily because the class spans highly variable appearances and lacks consistent object boundaries, which increases both false alarms (texture-driven confusions) and false negatives (subtle or partial damage). The improved scores of Faster R-CNN on this class are consistent with two-stage refinement producing tighter localization and stronger suppression of background clutter. Without changing the overall pipeline, this class is the most likely to benefit from targeted handling of scale and ambiguity (e.g., multi-scale inference/tiling for large extents and stronger hard-negative sampling), and, when available on the UGV, complementary cues such as depth can further stabilize boundary interpretation during verification.
There are limitations that inform future work. The dataset is built from public imagery, which brings realism, yet it may encode incident-specific biases and long-tail cases that remain under-represented. Although training curves indicate smooth convergence without overfitting, robustness to adverse conditions should be assessed more broadly, for example, night scenes, haze, rain, and heavy occlusion. Runtime numbers are measured on a Tesla T4 at , which is appropriate for onboard constraints, yet different hardware or larger inputs will shift the speed–accuracy trade-off; a compact ablation over input sizes and backbones would document those effects. On the UGV side, the slower detector could be accelerated with pruning, quantization, and runtime compilation, or with lightweight two-stage variants that retain proposal refinement while reducing computation. Finally, multi-sensor fusion, for example, thermal, gas, or depth, and adaptive thresholding that accounts for context, like time of day or weather, are natural extensions that would improve robustness in field deployments.
Overall, the study demonstrates that combining a fast one-stage detector for aerial scanning with a precise two-stage detector for ground verification improves both speed and reliability in hazard-rich environments. The figures and metrics justify the design choice, high recall in the air with YOLOv9 and high precision on the ground with Faster R-CNN, and the discussion above indicates clear, practical paths to push performance further without changing the overall architecture.
7. Future Work
Future work could examine system-level behavior beyond detector scores, including the fraction of aerial cues confirmed by the ground robot, the fraction rejected as false alarms, and the time from first cue to confirmation, so that end-to-end performance can be compared across sites. It would also be valuable to report mAP@[.5:.95] and AP75, together with confidence-calibration analyses, to emphasize localization quality under stricter IoU thresholds and to align predicted scores with empirical precision.
Another promising direction is to accelerate ground-level verification. Structured pruning, quantization, layer fusion, device-specific compilation, and lighter two-stage backbones could be assessed to raise throughput on the UGV while preserving precision, with accuracy–latency Pareto curves reported on target hardware. Robustness may be improved through multi-sensor fusion, for example, LiDAR or stereo depth for boundary stabilization and thermal or gas sensing for low-light and chemical events, with feature- or decision-level fusion that remains stable when communications degrade.
Generalization could be strengthened by tiling and multi-scale sampling for very large structures, by targeted augmentation for clutter and occlusion, and by continual or federated learning that adapts models to new sites without centralizing sensitive data. Finally, controlled field trials in representative environments, including night, haze, rain, GPS-denied areas, and limited-bandwidth links, would help quantify operational constraints, operator workload, and the pipeline’s reliability under realistic deployment conditions.