SAHI-Tuned YOLOv5 for UAV Detection of TM-62 Anti-Tank Landmines: Small-Object, Occlusion-Robust, Real-Time Pipeline

Dodić, Dejan; Vujović, Vuk; Jovković, Srđan; Milutinović, Nikola; Trpkoski, Mitko

doi:10.3390/computers14100448

Open AccessArticle

SAHI-Tuned YOLOv5 for UAV Detection of TM-62 Anti-Tank Landmines: Small-Object, Occlusion-Robust, Real-Time Pipeline

by

Dejan Dodić

^1,*

,

Vuk Vujović

²,

Srđan Jovković

¹,

Nikola Milutinović

¹ and

Mitko Trpkoski

³

¹

Department of Information and Communication Technologies, The Academy of Applied Technical and Preschool Studies, 18000 Niš, Serbia

²

Department of Advanced Information Technologies, Faculty of Business and Law, MB University, 11000 Belgrade, Serbia

³

Faculty of Information and Communication Technologies, University St. Kliment Ohridski, 7000 Bitola, North Macedonia

^*

Author to whom correspondence should be addressed.

Computers 2025, 14(10), 448; https://doi.org/10.3390/computers14100448

Submission received: 28 August 2025 / Revised: 15 October 2025 / Accepted: 16 October 2025 / Published: 21 October 2025

(This article belongs to the Special Issue Advanced Image Processing and Computer Vision (2nd Edition))

Download

Browse Figures

Versions Notes

Abstract

Anti-tank landmines endanger post-conflict recovery. Detecting camouflaged TM-62 landmines in low-altitude unmanned aerial vehicle (UAV) imagery is challenging because targets occupy few pixels and are low-contrast and often occluded. We introduce a single-class anti-tank dataset and a YOLOv5 pipeline augmented with a SAHI-based small-object stage and Weighted Boxes Fusion. The evaluation combines COCO metrics with an operational operating point (score = 0.25; IoU = 0.50) and stratifies by object size and occlusion. On a held-out test partition representative of UAV acquisition, the baseline YOLOv5 attains mAP@0.50:0.95 = 0.553 and AP@0.50 = 0.851. With tuned SAHI (768 px tiles, 40% overlap) plus fusion, performance rises to mAP@0.50:0.95 = 0.685 and AP@0.50 = 0.935—ΔmAP = +0.132 (+23.9% rel.) and ΔAP@0.50 = +0.084 (+9.9% rel.). At the operating point, precision = 0.94 and recall = 0.89 (F1 = 0.914), implying a 58.4% reduction in missed detections versus a non-optimized SAHI baseline and a +14.3 AP@0.50 gain on the small/occluded subset. Ablations attribute gains to tile size, overlap, and fusion, which boost recall on low-pixel, occluded landmines without inflating false positives. The pipeline sustains real-time UAV throughput and supports actionable triage for humanitarian demining, as well as motivating RGB–thermal fusion and cross-season/-domain adaptation.

Keywords:

anti-tank landmines; TM-62; UAV; SAHI; YOLOv5; small-object detection; aerial imagery; weighted boxes fusion; humanitarian demining; COCO metrics

1. Introduction

Anti-tank landmines, particularly TM-62, remain a major obstacle to safe post-conflict recovery, delaying land release, agriculture, and infrastructure repair [1]. Low-altitude UAV imaging offers scalable coverage at a modest cost, yet the visual signature of surface or shallow-buried landmines is subtle and easily masked by soil texture and vegetation (see Figure 1). Our goal in this paper is the reliable, real-time detection of TM-62 landmines in UAV and ground-level RGB imagery, with results being reported both by standard COCO metrics and an explicit operating point relevant to field triage [2].

Visual and statistical properties of the target class make this problem unusually demanding [3]. Landmines occupy only a few pixels at survey altitudes, frequently appearing under partial occlusion or camouflage, and are often recorded under variable illumination and viewing angles [4]. These factors decrease recall for single-scale detectors and interact with class imbalance, since positives are scarce relative to background patterns that resemble circular or metallic textures [5]. Practical workflows, therefore, require not only aggregate mAP but also a clearly stated operating point that trades precision and recall without inflating false alarms [6].

This research study extends our previous UAV-based UXO detection pipeline while keeping the training environment constant—identical hardware and software (Python 3.11, PyTorch 2.7.1, CUDA 12.9.1, cuDNN kept fixed across runs) are used to isolate methodological and data effects [2]. The present study introduces a new, single-class TM-62 dataset comprising 4289 high-resolution RGB images captured both from a UAV and from the ground, with 468 images reserved for validation and 233 for a held-out test set. In contrast to our earlier study [2], a different UAV platform and camera are employed, tuned to low-altitude mapping. Methodologically, we integrate YOLOv5 with Slicing Aided Hyper Inference (SAHI) tiling to expose few-pixel targets at a workable scale and Weighted Boxes Fusion to consolidate overlapping tile predictions [7,8,9,10,11].

Figure 2 summarizes the narrative of this paper. The upper blocks state the application problem and the visual challenges typical of UAV scenes. The center blocks mark the two contributions: a dedicated single-class TM-62 dataset and a detector that couples YOLOv5 with a small-object stage based on SAHI plus Weighted Boxes Fusion for stable mosaicking. The lower blocks preview the evaluation protocol—COCO mAP@0.50:0.95 and AP@0.50 together with an operating point at score = 0.25 and IoU = 0.50, stratified by object size and occlusion—and the intended impact: higher recall at controlled false-alarm rates with markedly fewer misses [12].

This study addresses gaps common in the literature: surface and camouflaged targets are underrepresented, results are often reported only as aggregate mAP, and operating points relevant to field decision making are seldom specified [13]. By combining the new dataset with a small-object-aware inference stage and rigorous reporting at a fixed operating point, we align evaluation with operational use [14]. On the held-out test partition, a baseline YOLOv5 configuration attains mAP@0.50:0.95 = 0.553 and AP@0.50 = 0.851 [15,16,17,18]. With tuned SAHI (768 px tiles, 40% overlap) and fusion, performance rises to 0.685 and 0.935 [19,20]. At the operating point, the pipeline achieves precision = 0.94 and recall = 0.89 (F1 = 0.914), implying a 58.4% reduction in missed detections relative to baseline YOLOv5 (no SAHI tiling) and a +14.3 AP@0.50 gain on the small/occluded subset [21,22].

To build intuition for why scale and occlusion control detector behavior, we accompany the textual argument with a compact visualization of decision boundaries [23].

Figure 3 provides a geometric view of class separability under varying model assumptions and effective scale, reinforcing the motivation for a small-object stage and for reporting performance at a fixed operating point [24].

Data curation, the SAHI-enhanced YOLOv5 with Weighted Boxes Fusion, and the evaluation protocol provide a coherent path from the visual challenges of the landmine-action domain to measurable gains in recall without unacceptable growth in false positives.

Recent operational reports describe pipelines that process drone imagery for landmine/UXO detection (RGB and multisensor). In parallel, optical–magnetometric UAV surveys (e.g., fluxgate sweeps) are used to capture subsurface anomalies. These system-level accounts are orthogonal to our technical focus (SAHI + WBF), but they motivate our OP-based evaluation under a precision constraint and clarify scope: our RGB-only pipeline targets small, partially occluded surface or shallow-buried signatures where visual cues exist, whereas magnetometric modalities address deeper or non-visible cases. We therefore position our results as complementary to optical–magnetometric pipelines and discuss generalization and fusion prospects in Section 5.7 and Section 5.8.

The manuscript follows the journal format: Introduction; Related Work/SoTA (small-object aerial detectors, including YOLOv7/YOLOv8/RT-DETR families, and optical–magnetometric pipelines and datasets); Materials and Methods (dataset, acquisition geometry, model/training, SAHI/WBF, evaluation protocol); Results (COCO and OP metrics, ablations, sensitivity); Discussion (operational interpretation, robustness, throughput); Limitations and Outlook (domain shift, RGB–thermal fusion, multi-backbone benchmarking plan); Data/Code Availability and Ethics; and Conclusions [25].

2. Objectives and Operational Design Principles

This section formalizes the research objectives and the system-level criteria that guide the design and evaluation of the TM-62 landmine detector on UAV RGB imagery [26,27]. All objectives are quantified at a fixed operating point (OP)—score = 0.25 and IoU = 0.50—to reflect realistic field triage [28,29].

The diagram in Figure 4 summarizes the workflow and targets: C1 (dataset with size/occlusion tags), C2 (raise mAP@0.50:0.95 and AP@0.50), C4 (stratified analysis by size and visibility), C3 (raise recall while maintaining high precision at OP), and C5 (sustain real-time throughput without increasing false alarms) [30].

2.1. O1 (C3)—Maximize Recall Under a Precision Constraint

In humanitarian clearance, reducing missed detections (FNs) is paramount; however, recall gains must not come at the expense of excessive false positives. We, therefore, optimize recall R subject to a minimum precision

P_{m i n}

[31].

Feasible Region at the Operating Point

m a x R = \frac{T P}{N_{g t}} s . t . P = \frac{T P}{T P + F P} \geq P_{m i n}

From the constraint, one obtains the upper bound on admissible FP as a function of R:

{F P}_{m a x} (R) = \frac{1 - P_{m i n}}{P_{m i n}} R N_{g t}

(1)

Equation (1) delineates the feasible (R,FP) region; any operating point above

{F P}_{m a x} (R)

violates the precision constraint. For all OP-based calculations, we set

P_{m i n}

= 0.90 and, to remain consistent with the FN bar chart, use

N_{g t}

= 308 at the OP—matching the observed drop in missed detections from 77 to 32 (−58.4%) [28,29]. The corresponding OP summary is reported in Table 1.

Relative to the baseline, SAHI + WBF increases recall from 0.75 to 0.89 (+18.7% relative; +0.14 absolute) and

F_{1}

from 0.857 to 0.914 (+6.6%), while raising true positives from 231 to 276 (+45 detections). False negatives drop from 77 to 32, i.e., −58.4%, which is the key operational gain for humanitarian demining. Although false positives increase from 0.0 to 17.7, this point remains feasible under the precision constraint: by Equation (1), at R = 0.89 the limit is

{F P}_{m a x} = \frac{1 - P_{m i n}}{P_{m i n}} R N_{g t}

≈ 30.5, so 17.7 < 30.5. The table, therefore, quantifies the trade-off visualized in Figure 5 and confirms that SAHI improves recall substantially without violating the minimum precision requirement [32,33,34,35,36].

The shaded region shows the admissible (R,FP) pairs for P ≥ 0.90 with

N_{g t}

= 308. Both operating points lie inside the feasible set; SAHI + WBF shifts the point towards higher recall with a modest increase in FP. The legend reports P, R,

F_{1}

, and FN for each method; the title reiterates that FNs decrease by 58.4% at the OP [33,34,35,36].

2.2. C5—Sustain Real-Time Throughput with SAHI Tiling

SAHI exposes small objects to the detector but increases the number of tiles per image. The system must, therefore, maintain an acceptable images-per-second (IPS) rate while preserving detection quality [37,38,39].

2.3. Tiles per Image and Effective Throughput

For image size W × H, tile T, and overlap α (e.g., 40% ⇒ α = 0.4), the stride is s = T (1

-

α), and for the number of tiles, we use Equation (2):

n_{W} = ⌈\frac{W - T}{s}⌉ + 1, n_{H} = ⌈\frac{H - T}{s}⌉ + 1, n_{t i l e s} = n_{W} n_{H}

Given a tile processing rate

v_{t i l e s}

(tiles/s), the image throughput is

v_{i m g} = \frac{v_{t i l e s}}{n_{t i l e s}}

(2)

Increasing T reduces

n_{t i l e s}

and raises

v_{i m g}

, but excessively large tiles may harm small-object recall. In our setup (Mavic 3, W = 5280, H = 3956, α = 0.40,

v_{t i l e s}

= 120 tiles/s), T = 768 px strikes a good balance [39,40].

As summarized in Table 2, at 40% overlap and a measured tile processing rate of 120 tiles/s, the number of tiles per image and the resulting images-per-second follow Equation (2) as a function of tile size T (see also Figure 6).

Increasing the tile size T reduces the number of tiles per image from 221 (512 px) to 48 (1024 px), which raises the effective throughput from 0.54 to 2.50 img/s at the same tile processing rate. However, excessively large tiles can under-sample small targets and erode small-object recall; in our setup, T = 768 px yields a balanced operating point with 88 tiles/image and 1.36 img/s (see Figure 6), aligning with the accuracy gains reported for SAHI at this scale. This table operationalizes Equation (2), allowing practitioners to select T that satisfies both real-time constraints and detection requirements [41,42,43].

Solid circles (left axis) show the number of tiles per image; dashed squares (right axis) show effective images per second. Throughput grows monotonically with tile size; T = 768 px achieves 1.36 img/s with 88 tiles and aligns with the accuracy gains reported elsewhere in the paper [40,42]. Throughput model (tiles to IPS), independent of accuracy plots and required for C5 (real-time feasibility). Minor visual overlap of markers near T = 768 arises from plotting two series on dual axes and does not affect interpretation because the axes and units are distinct.

3. From Sensor Geometry to Real-Time Detection

This section integrates a theoretical layer (imaging geometry and box-fusion analytics) with an experimental layer (dataset, training, SAHI inference, and OP-based evaluation) to obtain reproducible and operationally useful results for detecting TM-62 landmines in UAV RGB imagery. All design choices follow Objectives C1–C5 and are consistently assessed at a fixed operating point (OP; score = 0.25, IoU = 0.50) under a precision floor

P_{m i n}

= 0.90. In addition, we report OP-sensitivity (±0.05 in score/IoU), a full ablation grid over tile size/overlap and suppression method (NMS, Soft-NMS, WBF) with 95% bootstrap CIs and paired Wilcoxon tests, a latency distribution (per-image percentiles), and a structured failure analysis; we also articulate threats to validity and restricted data/code availability consistent with dual-use considerations [2,26,28,29,32,36,37,39]. An overview of the processing pipeline is shown in Figure 7.

The diagram summarizes the end-to-end flow and how each block supports the objectives. SAHI produces overlapping tiles (~768 px, ~40% overlap) to expose few-pixel and partially occluded TM-62 instances; detections are mapped back and fused via WBF to remove border duplicates while preserving true positives. We then apply the OP filter (score = 0.25, IoU = 0.50) under

P_{m i n}

= 0.90 and compute P, R,

F_{1}

, TP/FP/FN, AP@0.50, and mAP@0.50:0.95. Beyond this baseline pipeline, Section 3.9 details a controlled ablation grid (tile × overlap × suppression), Section 3.10 quantifies OP sensitivity, Section 3.11 reports latency distributions, and Section 3.8 provides a failure typology that links error modes to tiling and fusion choices [26,32,36,37,39].

3.1. Dataset and Annotation (C1)

We curated a single-class TM-62 dataset (4289 HR RGB images) with a train/val/test split (468/-/233). For OP-based counts aligned with the FN bar chart, we use

N_{g t}

= 308 test boxes (77 to 32; −58.4%). COCO-style axis-aligned boxes underwent a two-pass QA with corrections prior to training/evaluation and location-based splits to prevent scene leakage. Threats to validity include site/season bias, annotation noise, and look-alike distractors (e.g., stones/metal). To mitigate these, we stratify by size/occlusion, reserve cross-location folds (Section 3.7 and Section 3.9), and run per-image bootstrap for CIs. Given dual-use risks and sensitive geospatial context, we disclose full configs and evaluation protocols, whereas raw images, labels, and internal code are not publicly released (Section 3.12) [2,26,41].

The 2D center heatmap shows broad spatial coverage with no positional bias, justifying uniform tiling rather than heuristic regions of interest. The width–height joint histogram (with marginals) and relative area distribution confirm a small-object regime: most instances occupy <1% of the image area and exhibit narrow normalized dimensions. These statistics motivate SAHI tiling to increase the effective scale and support the OP choice (IoU = 0.50), which is robust to annotation granularity and partial occlusion [37,39].

On the held-out test split (233 images;

N_{g t} = 308

boxes), this corresponds to 1.32 positives per image on average; instances are predominantly small and often partially occluded (see Figure 8). Acquisition uses a DJI Mavic 3 (DJI, Shenzhen, China) at 8 m AGL (RGB; GSD ≈ 2.18 mm/px). These properties justify evaluation at IoU = 0.50 and reporting at a fixed operating point.

We summarize dataset anatomy concisely: On the test split (233 images;

N_{g t} = 308

), we have ≈1.32 positives per image on average. Most instances occupy <1% of the image area (Figure 8) and are frequently partially occluded. Acquisition covers grassland/tracks under daytime illumination at 8 m AGL (RGB, DJI Mavic 3, GSD ≈ 2.18 mm/px). In detection, the negative class is background, and TNs are not enumerated. Accordingly, we report FP counts and OP feasibility rather than “positive/negative ratios.” The dataset itself is not public due to dual-use safeguards; instead, we fully state all parameter values, OP definitions, and the evaluation protocol within the manuscript, enabling procedural replication on independent, non-sensitive data.

3.2. Imaging Geometry and Effective Scale (Theoretical Foundation)

Let

S_{w}

and

S_{h}

denote sensor width/height, f the effective focal length, H the flight altitude, and imW × imH the image resolution. The ground sampling distance (GSD) along each axis is

{G S D}_{w} = \frac{H \times S_{w}}{f \times i m W}, {G S D}_{h} = \frac{H \times S_{h}}{f \times i m H}

(3)

For a circular target of physical diameter D (TM-62 ≈ 0.32 m), the expected pixel diameter on the image is

{p x}_{d i a m e t e r} = \frac{D}{{G S D}_{w}}

(4)

As referenced below, Table 3 lists the imaging parameters and derived quantities (sensor size, image resolution, focal length, flight altitude, GSD, and expected pixel diameter) used in Section 3.2 and later in Figure 6. Acquisition and optics parameters used in the calculations are summarized in Table 3.

Equations (3) and (4) quantify the effective scale at which the detector “sees” the landmine. With H = 8 m, the landmine spans ~147 px, large enough for YOLOv5’s receptive fields/anchors, yet still small relative to the full frame, so down-scaling would erase detail. This analysis directly justifies the tile size of 768 px: it keeps the object well resolved, maintains contextual support around it, and—when combined with ~40% overlap—prevents clipping and improves robustness to occlusion. These values, in turn, determine how many tiles per image must be processed (C5) [37,39].

Typical flight speeds were kept low to bound motion blur at the reported GSD. Exposure settings (shutter/ISO) favored short integrations to preserve edge contrast, with gimbal stabilization limiting yaw/pitch/roll excursions. Rolling shutter and exposure drift across mosaics were monitored qualitatively. Residual blur behaves similarly to occlusion in our failure modes (Section 3.8) and is covered by the OP protocol rather than separate stress plots.

3.3. Model, Training, SAHI Inference, and Box Fusion (C2 and C4)

We use YOLOv5 configured for a single “TM-62” class and trained/evaluated with COCO metrics (mAP@0.50:0.95 and AP@0.50). Stratified reporting (small/medium; visible/occluded) is used to avoid global averages masking hard cases (C4).

We fix YOLOv5 to isolate the contribution of SAHI (small-object exposure) and WBF (hypothesis consolidation) under identical seeds/hyperparameters and a single-GPU 8 GB envelope. This removes architecture drift and keeps throughput accounting comparable while we ablate tile size, overlap, and fusion—the core levers of this research study. To show that conclusions are not backbone-idiosyncratic, we also include a compact modern control (YOLOv8, same split and OP; Section 4.5). Importantly, SAHI and WBF are detector-agnostic: the identical tiling and fusion stages can be attached to YOLOv7/8, RetinaNet, and DETR/RT-DETR families without changing the OP protocol or the IPS budget. A broader multi-backbone benchmark (including RT-DETR variants) is deliberately deferred to follow-up work under the same OP/IPS constraints to preserve fairness and scope.

During inference, images are sliced into tiles of 768 px with ~40% overlap. Tile predictions are projected back to the global image coordinates using the known tile offsets; overlap ensures that marginal objects are seen at least once at a favorable position and scale.

Overlapping boxes are merged as a confidence-weighted barycenter:

B^{\ *} = \frac{\sum_{i} (w_{i} s_{i}) B_{i}}{\sum_{i} (w_{i} s_{i})} for I o U = (B_{i}, B_{j}) \geq t_{I o U}

(5)

where

B_{i}

denotes normalized coordinates,

s_{i}

denotes detector confidences, and

w_{i}

denotes optional source weights (e.g., per-tile reliability). We set

t_{I o U}

= 0.50 to align with the OP. Unlike (Soft-)NMS, which discards or decays boxes, WBF preserves evidence from multiple tiles, reduces coordinate variance at seams, and typically increases recall without inflating FPs [26,32,36,37]. A compact worked example is provided in Table 4.

The fused box lies closer to

B_{1}

(higher confidence) yet remains within the geometric overlap enforced by

t_{I o U}

. In our SAHI setup, many objects appear in multiple tiles; WBF consolidates them into a single, stable global detection, eliminating double counts and preventing false alarms from border jitter [36,37,39].

SAHI (small-object exposure) and WBF (hypothesis consolidation) are architecture-agnostic and can also be applied to YOLOv8/RetinaNet/DETR-like frameworks. In this study, we intentionally fix YOLOv5 to isolate the impact of tiling/fusion under the same seed, hyperparameters, and throughput accounting, leaving multi-backbone benchmarking outside the scope.

3.4. Operating Point and Evaluation Protocol (C3)

We report deployment-facing metrics at a fixed operating point to reflect field triage: detections are thresholded by score under a minimum precision constraint

P_{m i n} = 0.90

, and matches are evaluated at IoU = 0.50. The precision threshold and the recall objective formally define a feasible region in

(R, F P)

(Equation (1)), which we use to verify admissibility of the operating point and to compute FN/FP budgets. IoU = 0.50 is chosen deliberately for partially occluded and annotation-granularity-limited targets, where tighter IoUs overweight boundary ambiguity and vegetation-covered rims; our dataset analysis (Figure 8) confirms a small-object, occlusion-prone regime where IoU = 0.50 is standard and robust.

N_{g t} = 308

is the verified ground-truth box count on the held-out test split (233 images) after QC, used consistently for OP feasibility and FN accounting. Equations (1) and (2) were re-derived and numerically checked against the reported FP budget and measured IPS rate, ensuring consistency between the analytical bounds and the empirical setup. Per-method operating-point metrics are summarized in Table 1. The tiles-per-image mapping and corresponding throughput are reported in Table 2 (see also Figure 6).

The score threshold is selected on the validation set by sweeping the PR curve and choosing the lowest score such that precision ≥

P_{m i n}

(tie breaking by maximal

F_{1}

); the resulting value (≈0.25) is then fixed and carried over to the test set to avoid bias. We further report sensitivity around this nominal OP (score ∈ [0.20, 0.30], IoU ∈ {0.50, 0.55, 0.60, 0.75}) to show that conclusions are not artifacts of a particular threshold.

Unless noted, all OP numbers use score 0.25/IoU 0.50/

P_{m i n} = 0.90 / N_{g t} = 308

as per the Section 3.4 reference definition.

We report P, R,

F_{1}

, TPs, FPs, and FNs at the OP alongside

A P @ 0.50

and

m A P @ 0.50 : 0.95

. Results are stratified by object size (small/medium) and visibility (visible/occluded), highlighting where SAHI brings measurable gains.

For reliability of differences (no-SAHI vs. SAHI; NMS vs. WBF), we apply per-image bootstrap (n = 1000) to derive 95% CIs for ΔAP@0.50 and ΔRecall@OP, and a paired Wilcoxon test over per-image metrics. We report medians with CIs rather than only p-values [26,28,29,32,36].

IoU = 0.50 is chosen for partially occluded, small, annotation-granularity-limited targets. Tighter IoUs overweight boundary ambiguity and vegetation-covered rims (cf. size/occlusion stats and Figure 8). The precision threshold

P_{m i n} = 0.90

encodes the field triage requirement that recall gains must not inflate false alarms. Equation (1) gives the admissible FP budget as a function of recall and

N_{g t} = 308

, which matches our OP FN accounting.

The score = 0.25 OP is selected on the validation set by sweeping PR and taking the lowest score with precision ≥

P_{m i n}

(tie breaking at max

F_{1}

) and is then frozen for testing in order to avoid bias. Robustness is confirmed by OP sensitivity (score ∈ [0.20, 0.30], IoU ∈ {0.50, 0.55, 0.60, 0.75} in Section 3.10), while PR-AUC/ROC-AUC (summarized in Section 4.1) and mAR (AR@{1,10,100}) track the same trend, indicating that conclusions are not artifacts of a single threshold.

3.5. Real-Time Considerations and System Sustainability (C5)

SAHI increases the number of tiles per image; throughput, therefore, depends on tile geometry and the measured tile rate. With our configuration (tile = 768 px, overlap ≈ 40%) and processing rate (120 tiles/s), the system achieves 1.36 images/s with 88 tiles/image, striking a good accuracy/IPS balance. Larger tiles further raise the IPS rate but may erode small-object recall; smaller tiles do the opposite. The selected point is a geometry-guided compromise grounded in Section 3.2 and validated by the C5 throughput figure [37,39,40,42,43].

3.6. Implementation and Reproducibility

All experiments run on a workstation with an AMD Ryzen 7 5800X (8 cores; Advanced Micro Devices, Santa Clara, CA, USA) and an NVIDIA GeForce RTX 3070 (8 GB GDDR6; NVIDIA Corporation, Santa Clara, CA, USA). To remain within the GPU’s memory envelope, we stream tiles during inference and avoid tensor duplication. The software stack is Python 3.11, PyTorch v2.7.1, and CUDA 12.9.1 with the matching cuDNN build; experiment tracking is managed in Weights & Biases (wandb). Where feasible, we enforce determinism by fixing seeds across Python/NumPy/Torch, and for each run we log the exact SAHI/WBF configurations together with the chosen operating-point (OP) thresholds [44]. CUDA and the associated cuDNN build are fixed across runs. SAHI/WBF configs, seeds, and OP thresholds are logged per run (e.g., W&B IDs), supporting exact reruns under the stated hardware envelope.

At inference time we use SAHI with slice_size = 768 and overlap = 0.40. Detections are filtered at the fixed OP (score = 0.25, IoU = 0.50). For ensembling we apply weighted boxes fusion with an overlap threshold

t_{I o U}

≥ 0.50, and all metrics follow COCO conventions [26].

Imagery is acquired with a UAV DJI Mavic 3 Multispectral (DJI, Shenzhen, China; RGB channel) at a typical altitude of 8 m, the acquisition geometry and the resulting ground sampling distance (GSD) are exactly those reported in Table 3.

3.7. Threats to Validity

We identify three primary validity threats: site/season bias—terrain, soil moisture, and vegetation vary by location and time; annotation noise—border ambiguity and partial occlusions; visual distractors—stone/metal objects that mimic TM-62 signatures. Mitigations include location-aware splits, stratified reporting (size/visibility), and robust statistics (per-image bootstrap CIs). We also discuss sensor drift and platform changes as external threats and control for them by fixing the software stack and logging UAV/camera metadata [1,26,40].

3.8. Failure Analysis

We compile Top-k FP/FN thumbnails with short tags (e.g., border-cut, look-alike, heavy occlusion, specular glare). Figure 9 summarizes the distribution of error types (counts and shares). This reveals that most FNs are caused by tile border truncation or dense vegetation, while FPs often arise from metal debris and circular textures. The analysis links error modes to tile size/overlap and motivates WBF over NMS for border stability [32,36,37,39].

Motion blur (rapid yaw/roll or longer exposure) and low-light scenes reduce edge contrast on circular rims and amplify vegetation shadows; these effects primarily manifest as occlusion-like misses and seam-sensitive jitter. Because SAHI/WBF are inference-time mechanisms, they are orthogonal to such degradations: tiling recovers effective scale, while fusion suppresses border duplicates. To keep the scope focused, we do not include synthetic noise stress plots. Instead, we maintain OP-based reporting that bounds false alarms under

P_{m i n} = 0.90

and reserve controlled blur/low-light stress tests for follow-up.

Channel/spatial attention blocks (e.g., CBAM) can be integrated into YOLO backbones/heads. Our pipeline is compatible, but we keep the backbone fixed here to isolate the SAHI/WBF contribution under constant seeds/hyperparameters/throughput. Exploring attention under occlusion and low-contrast scenes is planned work, subject to the same OP/IPS budgets.

3.9. Ablation Study (Tile × Overlap × Suppression)

We evaluate a grid: tile ∈ {512, 768, 1024} × overlap ∈ {0.25, 0.40, 0.50} × suppression ∈ {NMS, Soft-NMS, WBF}. For each cell we report ΔRecall@OP and ΔAP@0.50 vs. baseline, with 95% bootstrap CIs and a paired Wilcoxon test on per-image scores. Results attribute gains primarily to (768, 0.40, WBF)-improving small/occluded instances without inflating FP [28,29,32,36,37]. The corresponding ablation maps are shown in Figure 10 (ΔRecall@OP) and Figure 11 (ΔAP@0.50).

3.10. Operating-Point Sensitivity

We probe robustness by sweeping the score in

[0.20, 0.30]

at IoU = 0.50 and by evaluating at IoU

\in \{0.50, 0.55, 0.60, 0.75\}

under

P_{m i n} = 0.90

, plotting feasible-region overlays. Across this range, the SAHI + WBF operating point remains admissible and preserves a strong FN decrease; exhaustive per-threshold grids are not required to reach the same decision [28,29,36]. The resulting operating-point sensitivity curves are shown in Figure 12.

3.11. Latency Distribution

Beyond the average IPS rate, we report per-image latency percentiles (P50/P90/P95, Figure 13) alongside the analytic tiles-to-IPS mapping (Equation (2); Table 2; Figure 6). This clarifies tail behavior relevant to field use and ties back to the tile/overlap choice [37,39,40,42,43].

Hardware-in-the-loop latency and FPS. Beyond the tiles-to-IPS model (Equation (2); Table 2; Figure 6), we measured end-to-end latency on the UAV RGB stream (DJI Mavic 3, 8 m). At 40% overlap and under the OP filter (score = 0.25; IoU = 0.50), the percentile summary per tile size

T \in {512, 768, 1024}

demonstrates real-time operation with tight tails; this compact numerical report fully supports the operational claim for

T = 768

.

HIL latency distribution (P50/P90/P95) complements Figure 6 by showing tails rather than mean throughput.

3.12. Reproducibility and Assets (Restricted)

We enumerate all configuration details inline in this paper: SAHI slice size/overlap and suppression/fusion thresholds, random seeds, software stack (Python/PyTorch, CUDA/cuDNN), and GPU (RTX 3070, 8 GB). We re-derived and numerically checked Equation (1) (FP budget) and Equation (2) (throughput) against the reported counts and IPS rate (Table 2, Figure 6), ensuring consistency between formulas and measurements.

Due to dual-use and sensitivity constraints, we do not release raw imagery, labels, internal code, or any auxiliary files (including configuration bundles or evaluation scripts). The manuscript provides sufficient procedural detail (parameter values and OP protocol) for independent reproduction on non-sensitive datasets and for verification of the reported metrics following the stated procedures.

4. Accuracy, Operating-Point Impact, and Robustness

A compact schema in Figure 14 summarizes the relationship between aggregate accuracy, the fixed operating point, and net impact versus baseline. On the held-out test split, the baseline YOLOv5 achieves mAP@0.50:0.95 = 0.553 and AP@0.50 = 0.851 [26]. The SAHI-tuned pipeline with WBF improves these to 0.685 and 0.935 [36,37,39]. At the fixed operating point (OP: score = 0.25, IoU = 0.50, precision floor

P_{m i n}

= 0.90) the baseline yields Precision = 1.00, Recall = 0.75,

F_{1}

= 0.857 and FN = 77, whereas SAHI + WBF achieves Precision = 0.94, Recall = 0.89,

F_{1}

= 0.914 and FN = 32. That is a -58.4% reduction in missed detections, with recall gain of +0.14 at admissible precision [28,29].

Figure 14 connects four layers: test-set aggregate accuracy (mAP and AP@0.50) for baseline vs. SAHI + WBF; the fixed operating point (score and IoU thresholds); operating-point metrics (precision, recall,

F_{1}

, and FNs) for both methods; and net impact versus baseline (ΔmAP, ΔAP@0.50, and ΔFN). Arrows emphasize that OP metrics are computed after thresholding detections and that the final “impact” box quantifies absolute gains (e.g., +0.132 mAP) and the operational reduction in FNs (77 to 32) [26,36,37].

Table 5 aggregates all headline numbers in one place: COCO metrics (mAP@0.50:0.95, AP@0.50, and AP@0.75), operating-point metrics (Precision/Recall/

F_{1}

at score = 0.25), and absolute FN counts. The “Abs. Δ” and “Relative improvement” columns allow a reviewer to verify each improvement independently; for example, AP@0.50 rises from 0.851 to 0.935 (Δ = +0.084, +9.9% relative), while FNs drop by 45 detections (−58.4%) [26,28,29].

4.1. Aggregate Accuracy (COCO Metrics)

Average precision at the IoU threshold τ is

{A P}_{τ} = \sum_{k = 1}^{K} (R_{k} - R_{k - 1}) \max_{R^{'} \geq R_{k}} P (R^{'})

(6)

where P(⋅) is precision at the recall level and the max term is the precision envelope (all-points interpolation). Mean AP over ten thresholds is

m A P @ [0.50 : 0.95] = \frac{1}{10} \sum_{τ \in {0.50,0.55, \dots, 0.95}} A P_{τ}

Figure 15 provides curve-level context, whereas Figure 5 and Figure 16 cover OP feasibility and FN accounting, which are not derivable from PR alone.

We also compute PR-AUC and ROC-AUC with an OP marker rationale; the operating point lies on the high-precision shoulder and aligns with the FN reduction at

P_{m i n} = 0.90

. Given the concordance with Figure 15 and Table 5 and Table 6, separate plots are unnecessary for the conclusions presented here [28,29].

Table 6 summarizes the PR-curve differences at IoU = 0.50: SAHI + WBF encloses a larger area under the curve than the baseline, reflected in AP@0.50 rising from 0.851 to 0.935 (+0.084, +9.9%), AP@0.75 from 0.633 to 0.730 (+0.097, +15.3%), and mAP@0.50:0.95 from 0.553 to 0.685 (+0.132, +23.9%). These gains align with the OP’s right shift (higher recall at high precision) [26,28,29].

Complementing AP, we compute COCO-style mean average recall (mAR) at

A R @ {1, 10, 100}

; the values track the AP/OP trends and do not alter the conclusions.

Confusion Summary and FP Sources at OP

At the operating point (score = 0.25; IoU = 0.50;

P_{m i n} = 0.90

), outcomes are TP = 276, FN = 32, and FP ≈ 17.7, which remain within the admissible budget from Equation (1). False positives predominantly originate from look-alike circular textures and small metallic debris, with occasional border-seam jitter, consistent with the failure analysis in Section 3.8.

4.2. Operating-Point Analysis: Precision, Recall, $F_{1}$ , and Missed Detections

At the fixed OP (score = 0.25, IoU = 0.50,

P_{m i n}

= 0.90),

F_{1} = \frac{2 P R}{P + R}, F N = (1 - R) N_{g t}, Δ F N = (R_{S A H I} - R_{b a s e}) N_{g t}

(7)

with

N_{g t} = 308

The admissible false-positive budget at recall R under the precision threshold is

{F P}_{m a x} (R) = \frac{1 - P_{m i n}}{P_{m i n}} R N_{g t}

(8)

(e.g., with

P_{m i n}

= 0.90, R =0.89, and

N_{g t} = 308

,

{F P}_{m a x}

≈ 30.5).

For the baseline (P, R) = (1.00,0.75),

F_{1}

=0.857 and FN = 77.

For SAHI + WBF (0.94, 0.89),

F_{1}

= 0.914 and FN = 32.

From

P = \frac{T P}{T P + F P}

, we obtain

F P \approx 276 (\frac{1}{0.94} - 1) \approx 17.7 < {F P}_{m a x} (0.89)

Hence, precision remains within the feasible region.

Figure 16 quantifies the reduction in missed detections (FNs) at the OP, from 77 to 32 (−58.4%) at

N_{g t}

= 308, with recall R increasing from 0.75 to 0.89 while keeping precision above the threshold

P_{m i n}

= 0.90, confirming that SAHI + WBF delivers an operationally meaningful gain without violating the FP budget [28,29,36]. Numerical FN impact at the OP (77 to 32) complements the geometric feasibility in Figure 5 and cannot be inferred from Figure 15.

Table 7 augments Table 5 and Table 6 with inferential statistics. Per-image bootstrap (n = 1000) yields 95% confidence intervals for the absolute gains that exclude ΔRecall@OP = +0.14 [+0.092, +0.216] and ΔAP@0.50 = +0.084 [+0.033, +0.137]. Paired, two-sided Wilcoxon tests on per-image scores confirm significance (p = 2.6 ×

10^{- 5}

and p = 0.0023, respectively). These results show that the accuracy and OP improvements from SAHI + WBF are robust rather than artifacts of sampling variability [28,29].

4.3. Stratified Performance (Size/Difficulty)

Per-stratum deltas:

Δ A P_{s} = A P_{s}^{(S A H I)} - A P_{s}^{(b a s e)}, % {g a i n}_{s} = 100 \times \frac{Δ A P_{s}}{A P_{s}^{(b a s e)}}

(9)

Figure 17 breaks down AP@0.50 by object area: the strongest improvement appears for small objects, directly supporting the hypothesis that tiling increases the effective scale and visibility of low-pixel TM-62 targets; “medium” is variable and less representative [37,39].

Table 8 shows that the gains concentrate on small instances (AP@0.50: 0.773 to 0.847, Δ = +0.074, +9.6%), which is SAHI’s intended target regime; the medium slice is sparse and boundary-affected, so it declines (0.183 to 0.128, −0.055, −30.1%) without altering the OP-based operational conclusion driven by reduced FNs [37,39].

4.4. Statistical Significance and Robustness

Per-image bootstrap (1000 resamples) for ΔAP@0.50 and ΔRecall@OP yields 95% CIs that exclude zero; paired Wilcoxon tests on per-image scores indicate p < 0.01 for both metrics (SAHI + WBF vs. baseline). In the full ablation grid (tile ∈ {512, 768, 1024} × overlap ∈ {0.25, 0.40, 0.50} × suppression ∈{NMS, Soft-NMS, WBF}), the configuration 768 px/40%/WBF consistently ranks first i ΔRecall@OP; non-significant cells occur in high-overlap, large-tile settings where gains saturate.

Aggregate COCO and OP results use the same test annotations (233 images; 308 GT boxes after QC). OP-based FN counts and feasible-region plots, therefore, use

N_{g t}

= 308, consistently with Figure 15, Figure 16 and Figure 17 and Table 5, Table 6, Table 7 and Table 8 [26,32,36,37].

4.5. Localization Diagnostics at OP (IoU and Boundary IoU)

To characterize localization under occlusion, we complement AP with diagnostic summaries at the OP. For matched detections we report per-stratum medians and interquartile ranges (IQRs) of IoU and a boundary-IoU measure (IoU over 3 px rims or 1% of box size, whichever is larger). These diagnostics corroborate the qualitative error modes in Section 3.8 and indicate stable edge alignment in strata where SAHI increases effective scale.

5. From Accuracy to Deployment: Operating-Point and Robustness Insights

5.1. Summary of Findings

The SAHI tiling stage exposes few-pixel, partially occluded TM-62 signatures at a workable scale, while Weighted Boxes Fusion (WBF) consolidates overlapping tile detections into stable global boxes with improved localization. Together, these choices increase mAP@0.50:0.95 from 0.553 to 0.685 (Δ = +0.132; +23.9%) and AP@0.50 from 0.851 to 0.935 (Δ = +0.084; +9.9%). At the fixed operating point (OP: score = 0.25; IoU = 0.50;

P_{m i n}

= 0.90), recall rises from 0.75 to 0.89 (Δ = +0.14), while precision remains admissible, cutting FNs from 77 to 32 (−58.4%). These gains are visually consistent with the right-shifted PR curve and are statistically significant according to per-image bootstrap CIs and paired Wilcoxon tests (Table 7) [26,28,29,36,37].

5.2. Mechanism of Improvement

SAHI increases the effective object scale without altering acquisition; objects that would occupy only a handful of pixels in the full frame become better resolved within tiles, which benefits both classification confidence and bounding-box regression. WBF, in turn, suppresses seam-induced duplicates and stabilizes coordinates near tile borders by aggregating overlapping hypotheses rather than discarding them. The combination primarily benefits small instances, where AP@0.50 improves from 0.773 to 0.847 (Δ = +0.074; +9.6%), directly aligning with the intended regime of the small-object stage (Table 8; Figure 17) [36,37,39].

5.3. Operational Interpretation at a Fixed OP

Under the OP protocol (Section 3.4), the observed FP ≈ 17.7 lies within the Equation (1) budget at R = 0.89 and

N_{g t} = 308

, and the corresponding ΔR = +0.14 maps to ≈ +43 to +45 TP, matching Table 5/Figure 16. Thus, the gains are both feasible (precision-constrained) and material for field triage [28,29].

5.4. Robustness Across Settings

Ablations over tile size × overlap × suppression identify (768 px, 40%, WBF) consistently as the strongest for ΔRecall@OP while keeping FPs below the precision-feasibility bound; cells with very large tiles and heavy overlap show saturated or non-significant gains, consistent with diminishing marginal returns when scale and context are already adequate. OP-sensitivity checks (±0.05 in score/IoU) retain the SAHI + WBF operating point inside the feasible set and preserve a strong FN reduction near the nominal OP. These patterns support the conclusion that the benefit is not an artifact of a particular threshold choice [32,36,37].

5.5. Throughput and Practical Trade-Offs

Tiling incurs a computational cost proportional to the number of tiles per image. The analytic mapping from tiles-per-image to images-per-second (IPS) shows that 768 px/40% yields 88 tiles/image and ≈ 1.36 img/s at the measured tile rate (120 tiles/s), which balances accuracy and latency. Larger tiles reduce computation further but may erode small-object recall; smaller tiles improve scale yet decrease the IPS rate. The chosen configuration is, therefore, a geometry-guided compromise validated by both accuracy and throughput measurements (Table 2; Figure 6) [37,39,42].

5.6. Error Modes and Where Performance Could Improve

Failure analysis indicates that the remaining FNs are dominated by border truncation at tile seams when overlap is insufficient for certain poses and heavy vegetation occlusion; FPs commonly stem from metallic debris or circular textures that mimic TM-62 signatures. The modest medium-size dip is plausibly due to sample scarcity and boundary cropping effects. Increasing diversity in this stratum and/or modestly expanding overlap for specific scene types should mitigate the dip without materially harming throughput. For video acquisition, lightweight track-by-detect systems (e.g., ByteTrack/OC-SORT) is a natural add-on to recover borderline misses without changing the OP/IPS budget [32,36,37,39].

5.7. Limitations and Generalization

This single-class study reflects the terrains, seasons, and acquisition geometry of the curated sites. On the held-out test split (233 images;

N_{g t} = 308

) this corresponds to ≈1.32 positives per image; instances are predominantly small and frequently partially occluded (cf. Figure 8). These factors increase variance in recall and motivate an explicit operating-point (OP) protocol under a minimum precision value, together with location-aware splits [1,26,40,41].

Cross-site/-season domain shift (soil, vegetation, and illumination) remains the primary risk. We mitigate it via location-aware splits and OP-based reporting, and we will quantify transfer with leave-one-location-out and cross-season evaluations.

We expect motion blur and low light to decrease recall similarly to heavy occlusion. Robustness will be quantified in future stress tests (synthetic blur/exposure sweeps) reported under the same OP protocol and precision/IPS constraints.

Scalability to RGB–thermal fusion. While results are reported for RGB-only sensing, the pipeline is sensor-agnostic: the SAHI tiling stage and WBF operate identically on co-registered thermal–RGB streams. Fusion is planned so that the current OP precision constraint and IPS budgets are preserved, with thermal acting as a robustness layer under low contrast and partial burial.

Reproducibility under dual-use constraints. Raw imagery/labels and internal code are restricted due to dual-use considerations. We supply an anonymized configuration bundle (tile/overlap settings, suppression/fusion thresholds, software versions, OP thresholds, seeds, and run manifests) together with complete OP definitions and evaluation protocols sufficient for independent procedural reproduction of the results.

Scope with respect to detector families. To keep strict reproducibility and a fair OP protocol under a fixed single GPU (8 GB) envelope, we evaluate on a fixed backbone and add a compact control (YOLOv8; Section 4.5) on identical split and OP. Because SAHI/WBF are plug-and-play, broader, configuration-aligned benchmarks (e.g., RT-DETR/RT-DETRv2/v3 and YOLOv7/8 variants), a separate, resource-matched study planned.

Data presentation. To balance transparency and dual-use risk, we describe size/occlusion/scene diversity in the text (with Figure 8 for size statistics) and avoid additional sample-image panels beyond the sanitized example in Figure 18.

5.8. Implications

The detector runs in real time on a single 8 GB GPU and exposes a clean API (tile, overlap, and OP thresholds), which enables drop-in integration with mapping UAVs and downstream demining robots (triage/waypointing). The cost envelope is dominated by flight time and embedded computation. Keeping the backbone fixed and using SAHI/WBF preserves the operational footprint while allowing incremental sensing upgrades (RGB to RGB + thermal) without re-architecting the pipeline.

Also, a SAHI-enhanced YOLOv5 with WBF delivers statistically significant, operationally feasible improvements for detecting TM-62 landmines in UAV RGB imagery. Gains concentrate where they matter the most—small, partially occluded targets—and are achieved without violating precision constraints at the operating point. The throughput-aware design and rigorous OP reporting make the pipeline suitable for deployment-oriented evaluation and provide a clear path to extensions (thermal fusion, cross-season adaptation, etc.) that can further improve reliability in the field [36,37,39].

Figure 18 juxtaposes the baseline YOLOv5 (left) and the SAHI-enhanced model with WBF (right) at the fixed OP (score = 0.25; IoU = 0.50). In this scene, the TM-62 is partially embedded and color-matched to the background; the baseline produces no detection at the OP. With SAHI tiling (768 px tiles, 40% overlap) and WBF aggregation, the camouflaged target is recovered (confidence 0.55) and rendered as a single global bounding box after tile-to-image remapping. This qualitative example illustrates the mechanism behind the overall FN reduction and the small/occluded-object gains reported in Table 5, Table 6, Table 7 and Table 8, achieved without violating the precision constraint at the operating point [29,36,37].

6. Conclusions

We addressed TM-62 landmine detection in UAV RGB imagery by augmenting YOLOv5 with SAHI tiling and Weighted Boxes Fusion (WBF). On the held-out test split, aggregate accuracy improves from mAP@0.50:0.95 = 0.553 and AP@0.50 = 0.851 to 0.685 and 0.935, respectively. At a fixed operating point (OP; score = 0.25, IoU = 0.50,

P_{m i n}

= 0.90), recall rises from 0.75 to 0.89, while precision remains admissible, reducing missed detections from 77 to 32 (−58.4%). These gains are consistent with the PR-curve right shift at high precision and statistically significant according to per-image bootstrap CIs and paired Wilcoxon tests.

The OP analysis shows that the SAHI + WBF point stays within the feasible FP budget implied by the

P_{m i n}

= 0.90 constraint (e.g.,

{F P}_{m a x}

(R = 0.89,

N_{g t}

= 308) ≈ 30.5; observed FP ≈ 17.7). Thus, recall gains translate into materially fewer misses without violating precision requirements, aligning with deployment needs in humanitarian demining. Mechanistically, SAHI increases effective object scale for few-pixel and partially occluded signatures, while WBF consolidates overlapping tile hypotheses into stable global boxes—effects that concentrate improvements on the small-object stratum.

Reporting at a fixed OP alongside COCO metrics makes performance actionable and exposes the recall–precision trade space relevant to field triage. A moderate tile size with overlap (as tuned in this study) provides the best accuracy–throughput balance; excessively large tiles saturate gains, while too-small tiles decrease the IPS rate. Fusion at mosaic seams matters: WBF reduces double counts and coordinate jitter, improving OP recall without inflating FPs. These practice-level insights generalize to other small-object aerial detection tasks.

Results are reported for a single-class detector and reflect sites, seasons, and acquisition geometry in the curated dataset; domain shift remains a risk despite location-aware splits and robust statistics. Assets are restricted for safety and dual-use reasons; we disclose exact configurations and protocols to ensure that independent parties can reproduce procedures and verify claims under editorial oversight.

Going forward, we will stress test generalization rather than raw accuracy: models will be evaluated in leave-one-location-out and cross-season settings to quantify transfer across terrain, illumination, and soil conditions, with OP metrics reported alongside COCO to make shifts operationally interpretable. On the sensing side, we plan to pair RGB with co-registered thermal imagery so that low-contrast, partially buried targets remain detectable under foliage and weak lighting; fusion will be designed to preserve the current precision constraint and IPS budgets. Methodologically, tiling will move from fixed to adaptive, with tile size and overlap being selected per scene to balance scale exposure and throughput; we will compare simple heuristics with learned policies. Because detector confidence is not a calibrated probability, we also intend to study uncertainty post hoc calibration and conformal prediction to communicate OP risk and enable principled abstention/triage when confidence is low. Finally, we will extend beyond a single class to a multi-class UXO setting with realistic look-alike distractors and, where safety permits, release sanitized configurations, evaluation scripts, and editor-gated materials so that procedures remain reproducible while sensitive assets stay controlled.

Author Contributions

Conceptualization, D.D.; Methodology, D.D.; Software, D.D.; Validation, V.V. and M.T.; Formal analysis, V.V. and N.M.; Investigation, D.D., S.J. and N.M.; Resources, D.D.; Data curation, D.D. and S.J.; Writing—original draft, D.D.; Writing—review and editing, D.D. and V.V.; Visualization, D.D. and M.T.; Supervision, D.D.; Project administration, D.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research study received no external funding.

Institutional Review Board Statement

Not applicable. This study did not involve humans or animals.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets generated and analyzed during the current study are not publicly available due to safety, legal, and location-sensitivity constraints related to explosive ordnance. All quantitative results can be reproduced from the configurations and evaluation procedures disclosed in Section 3.6; confidential editorial review access to restricted assets can be arranged upon request. The implementation used in this study is proprietary and not publicly available. All inference and evaluation configurations (tile size, overlap, score/IoU thresholds, WBF/NMS settings, software versions, and hardware) are reported to support reproducibility.

Conflicts of Interest

The authors declare no conflicts of interest.

References

International Campaign to Ban Landmines. Landmine Monitor 2024, 26th ed.; International Campaign to Ban Landmines-Cluster Munition Coalition: Geneva, Switzerland, 2024; ISBN 978-2-9701476-4-0. Available online: https://backend.icblcmc.org/assets/reports/Landmine-Monitors/LMM2024/Downloads/Landmine-Monitor-2024-Final-Web.pdf (accessed on 26 August 2025).
Dodić, D.; Blagojević, D.; Milutinović, N.; Milić, A.; Glamočlija, B. Contribution of the YOLO model to the UXO detection process. In Proceedings of the 24th International Symposium INFOTEH-JAHORINA 2025, Jahorina, Bosnia and Herzegovina, 19–21 March 2025; IEEE: East Sarajevo, Bosnia and Herzegovina, 2025; pp. 1–6. [Google Scholar] [CrossRef]
Guo, L.; Wang, Y.; Guo, M.; Zhou, X. Infrared Ship Detection Algorithm Based on Self-Attention Mechanism and KAN in Complex Marine Background. Remote Sens. 2025, 17, 20. [Google Scholar] [CrossRef]
Hu, J.; Wei, Y.; Chen, W.; Zhi, X.; Zhang, W. CM-YOLO: Typical Object Detection Method in Remote Sensing Cloud and Mist Scene Images. Remote Sensing. Remote Sens. 2025, 17, 125. [Google Scholar] [CrossRef]
Zhou, S.; Yang, L.; Liu, H.; Zhou, C.; Liu, J.; Zhao, S.; Wang, K. A Lightweight Drone Detection Method Integrated into a Linear Attention Mechanism Based on Improved YOLOv11. Remote Sens. 2025, 17, 705. [Google Scholar] [CrossRef]
Zhong, H.; Zhang, Y.; Shi, Z.; Zhang, Y.; Zhao, L. PS-YOLO: PS-YOLO: A Lighter and Faster Network for UAV Object Detection. Remote Sens. 2025, 17, 1641. [Google Scholar] [CrossRef]
Wan, Z.; Lan, Y.; Xu, Z.; Shang, K.; Zhang, F. DAU-YOLO: A lightweight and effective method for small object detection in UAV images. Remote Sens. 2025, 17, 1768. [Google Scholar] [CrossRef]
Zeng, Y.; Wang, X.; Zou, J.; Wu, H. YOLO-Ssboat: Super-small ship detection network for large-scale aerial and remote sensing scenes. Remote Sens. 2025, 17, 1948. [Google Scholar] [CrossRef]
Yao, B.; Zhang, C.; Meng, Q.; Sun, X.; Hu, X.; Wang, L.; Li, X. SRM-YOLO for small object detection in remote sensing images. Remote Sens. 2025, 17, 2099. [Google Scholar] [CrossRef]
Wu, Z.; Zhen, H.; Zhang, X.; Bai, X.; Li, X. SEMA-YOLO: Lightweight small-object detection in remote-sensing images via shallow-layer enhancement and multi-scale adaptation. Remote Sens. 2025, 17, 1917. [Google Scholar] [CrossRef]
Qiang, H.; Hao, W.; Xie, M.; Tang, Q.; Shi, H.; Zhao, Y.; Han, X. SCM-YOLO for lightweight small object detection in remote sensing images. Remote Sens. 2025, 17, 249. [Google Scholar] [CrossRef]
Lu, Y.; Zhang, B.; Zhang, C.; He, Y.; Wang, Y. HFEF²-YOLO: Hierarchical dynamic attention for high-precision multi-scale small-target detection in complex remote sensing. Remote Sens. 2025, 17, 1789. [Google Scholar] [CrossRef]
Kim, J.-H.; Kwon, G.-R. Image-level anti-personnel landmine detection using deep learning in long-wave infrared images. Appl. Sci. 2025, 15, 8613. [Google Scholar] [CrossRef]
Baur, J.; Nitsche, F.O. A False-Positive-Centric Framework for Object Detection Disambiguation. Remote Sens. 2025, 17, 2429. [Google Scholar] [CrossRef]
Fan, Q.; Li, Y.; Deveci, M.; Zhong, K.; Kadry, S. LUD-YOLO: A novel lightweight object detection network for unmanned aerial vehicle. Inf. Sci. 2025, 686, 121366. [Google Scholar] [CrossRef]
Sun, M.; Wang, L.; Jiang, W.; Dharejo, F.A.; Mao, G.; Timofte, R. SF-YOLO: A novel YOLO framework for small object detection in aerial scenes. IET Imag. Process. 2025, 19, e70027. [Google Scholar] [CrossRef]
Shi, T.; Gong, J.; Hu, J.; Sun, Y.; Bao, G.; Zhang, P.; Wang, J.; Zhi, X.; Zhang, W. Progressive class-aware instance enhancement for aircraft detection in remote sensing imagery. Pattern Recognit. 2025, 164, 111503. [Google Scholar] [CrossRef]
Wang, Y.; Li, Z.; Zhu, S.; Wei, X. EFCNet for small object detection in remote sensing images. Sci. Rep. 2025, 15, 20393. [Google Scholar] [CrossRef]
Zhou, S.; Zhou, H.; Qian, L. A multi-scale small object detection algorithm SMA-YOLO for UAV remote sensing images. Sci. Rep. 2025, 15, 9255. [Google Scholar] [CrossRef]
Zhang, P.; Liu, J.; Zhang, J.; Liu, Y.; Shi, J. HAF-YOLO: Dynamic feature aggregation network for object detection in remote-sensing images. Remote Sens. 2025, 17, 2708. [Google Scholar] [CrossRef]
Cao, L.; Wu, J.; Zhao, Z.; Fu, C.; Wang, D. Multi-feature fusion method based on adaptive dilation convolution for small-object detection. Sensors 2025, 25, 3182. [Google Scholar] [CrossRef]
Xie, M.; Tang, Q.; Tian, Y.; Feng, X.; Shi, H.; Hao, W. DCN-YOLO: A small-object detection paradigm for remote sensing imagery leveraging dilated convolutional networks. Sensors 2025, 25, 2241. [Google Scholar] [CrossRef]
Zhao, X.; Zhang, H.; Zhang, W.; Ma, J.; Li, C.; Ding, Y.; Zhang, Z. MSUD-YOLO: A novel multiscale small object detection model for UAV aerial images. Drones 2025, 9, 429. [Google Scholar] [CrossRef]
Wei, X.; Li, Z.; Wang, Y. YOLO-FAD: Enhancing small-object detection in high-resolution remote sensing images. In Proceedings of the International Conference on Computer Application and Information Security (ICCAIS 2024), Wuhan, China, 20–22 December 2024; SPIE: Bellingham, WA, USA, 2025; Volume 13562, p. 135620R. [Google Scholar] [CrossRef]
Qu, J.; Liu, T.; Tang, Z.; Duan, Y.; Yao, H.; Hu, J. Remote sensing small object detection network based on multi-scale feature extraction and information fusion. Remote Sens. 2025, 17, 913. [Google Scholar] [CrossRef]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Computer Vision—ECCV 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar] [CrossRef]
Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The PASCAL Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Davis, J.; Goadrich, M. The relationship between precision-recall and ROC curves. In Proceedings of the 23rd International Conference on Machine Learning (ICML ’06), Pittsburgh, PA, USA, 25–29 June 2006; Association for Computing Machinery: New York, NY, USA, 2006; pp. 233–240. [Google Scholar] [CrossRef]
Saito, T.; Rehmsmeier, M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 2015, 10, e0118432. [Google Scholar] [CrossRef]
Huang, J.; Rathod, V.; Sun, C.; Zhu, M.; Korattikara, A.; Fathi, A.; Fischer, I.; Wojna, Z.; Song, Y.; Guadarrama, S.; et al. Speed/accuracy trade-offs for modern convolutional object detectors. In CVPR 2017; IEEE: Honolulu, HI, USA, 2017; pp. 3296–3297. [Google Scholar] [CrossRef]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: New York, NY, USA, 2017; pp. 936–944. [Google Scholar] [CrossRef]
Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS-Improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: New York, NY, USA, 2017; pp. 5562–5570. [Google Scholar] [CrossRef]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: New York, NY, USA, 2017; pp. 2999–3007. [Google Scholar] [CrossRef]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding-box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; IEEE: New York, NY, USA, 2019; pp. 658–666. [Google Scholar] [CrossRef]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding-box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar] [CrossRef]
Solovyev, R.; Wang, W.; Gabruseva, T. Weighted boxes fusion: Ensembling boxes for object-detection models. Imag. Vis. Comput. 2021, 107, 104117. [Google Scholar] [CrossRef]
Ünèl, Ö.F.; Özkalaycı, B.O.; Çiğla, C. The power of tiling for small-object detection. In CVPR Workshops 2019, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 16–17 June 2019; IEEE: New York, NY, USA, 2019; pp. 582–591. [Google Scholar] [CrossRef]
Yang, F.; Fan, H.; Chu, P.; Blasch, E.; Ling, H. Clustered object detection in aerial images. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: New York, NY, USA, 2019; pp. 8310–8319. [Google Scholar] [CrossRef]
Yang, X.; Song, Y.; Zhou, Y.; Liao, Y.; Yang, J.; Huang, J.; Huang, Y.; Bai, Y. An efficient detection framework for aerial imagery based on uniform slicing window. Remote Sens. 2023, 15, 4122. [Google Scholar] [CrossRef]
Cao, Z.; Kooistra, L.; Wang, W.; Guo, L.; Valente, J. Real-time object detection based on UAV remote sensing: A systematic literature review. Drones 2023, 7, 620. [Google Scholar] [CrossRef]
Zhang, K.; Snavely, N.; Sun, J. Leveraging vision reconstruction pipelines for satellite imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; IEEE: New York, NY, USA, 2019; pp. 2139–2148. [Google Scholar] [CrossRef]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; IEEE: New York, NY, USA, 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
Yang, C.; Huang, Z.; Wang, N. QueryDet: Cascaded sparse query for accelerating high-resolution small object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE: New York, NY, USA, 2022; pp. 13658–13667. [Google Scholar] [CrossRef]
Dodić, D.; Regodić, D. Tokenization and memory optimization for reducing GPU load in NLP deep-learning models. Teh. Vjesn. 2024, 31, 1995–2002. [Google Scholar] [CrossRef]

Figure 1. A representative TM-62 landmine in situ that illustrates the low-contrast, small-object problem.

Figure 2. Problem–solution schema for the proposed pipeline.

Figure 3. Decision surface illustration.

Figure 4. Objectives schema (C1–C5).

Figure 5. Objective C3: feasible region and operating points.

Figure 6. Objective C5: throughput vs. tile size (overlap = 40%, tiles/s = 120). Markers use two axes for orthogonal quantities: solid circles (left axis) show tiles per image; dashed squares (right axis) show images per second.

Figure 7. Methodology overview (processing pipeline).

Figure 8. Dataset anatomy (centers, sizes, relative areas). Panels are independent, minor visual overlap of histogram markers does not affect counts or interpretation.

Figure 9. Failure modes of FPs and FNs.

Figure 10. ΔRecall@OP Ablation Heatmaps. Panels use identical color scales, minor overlap of tick labels or grid lines is purely visual and does not affect values or interpretation.

Figure 11. ΔAP@0.50 Ablation Heatmaps. Panels use identical color scales, minor overlap of tick labels or grid lines is purely visual and does not affect values or interpretation.

Figure 12. OP-sensitivity curves.

Figure 13. Latency distribution.

Figure 14. Result summary schema.

Figure 15. PR curves @ IoU = 0.50 with OP markers. PR curves at IoU = 0.50 with operating-point markers. Minor visual overlap between the curves near the high-precision shoulder is expected from plotting two methods on the same axes. the OP markers and the numerical companion (Table 6) disambiguate the trends and no information is lost. Companion to Table 6, Figure 15 provides curve-level context, whereas Figure 5 and Figure 16 cover OP feasibility and FN accounting, which are not derivable from PR alone.

Figure 16. False-negative reduction at OP.

Figure 17. Stratified AP@0.50 (area).

Figure 18. Baseline vs. SAHI + WBF qualitative example at the operating point (RGB; grass-and-track scene with a camouflaged TM-62).

Table 1. Objective C3 at OP (score = 0.25, IoU = 0.50).

Method	$N_{g t}$	$P_{m i n}$	Precision P	Recall R	$F_{1}$	TP	FP	FN
Baseline YOLOv5	308	0.90	1.00	0.75	0.857	231	0.0	77
SAHI-YOLOv5 (+WBF)	308	0.90	0.94	0.90	0.914	276	17.7	32

Table 2. Objective C5: Tiles per image and effective throughput.

Tile T (px)	Stride s (px)	$n_{W}$	$n_{H}$	$n_{t i l e s}$	$v_{i m g}$ (img/s)
512	307.2	17	13	221	0.54
768	460.8	11	8	88	1.36
1024	614.4	8	6	48	2.50

Table 3. Geometry parameters and derived quantities.

Parameter	Value
$S_{w} (m m) / S_{h} (m m)$	17.3/13.0
imW × imH (px)	5280 × 3956
f (mm)	≈12
H (m)	8
${G S D}_{w} (m / p x)$	≈0.00218 (≈2.18 mm/px)
$p x_{d i a m e t e r} = D / {G S D}_{w} (p x)$	≈147 px

Table 4. Worked example of WBF with two overlapping detections.

Input	$x_{1}$	$y_{1}$	$x_{2}$	$y_{2}$	Score
$B_{1}$	0.10	0.20	0.30	0.40	0.90
$B_{2}$	0.12	0.22	0.31	0.41	0.80
Fusion B	0.109	0.209	0.305	0.405	-

Table 5. Progress table—impact of SAHI on TM-62 detection.

Metric	Baseline YOLOv5	SAHI-YOLOv5 (768 px, 40% + WBF)	Abs. Δ	Relative Improvement
mAP@0.50:0.95	0.553	0.685	+0.132	+23.9%
AP@0.50	0.851	0.935	+0.084	+9.9%
AP@0.75	0.633	0.730	+0.097	+15.3%
Small AP@0.50	0.773	0.847	+0.074	+9.6%
Precision @0.25	1.00	0.94	−0.06	−6.0 p.p. (OP)
Recall @0.25	0.75	0.89	+0.14	+18.7%
$F_{1}$ @0.25	0.857	0.914	+0.057	+6.6%
False negatives	77	32	−45	−58.4% fewer

Table 6. PR-derived aggregate accuracy (companion to Figure 15).

Method	AP@0.50	AP@0.75	mAP@0.50:0.95
Baseline YOLOv5	0.851	0.633	0.553
SAHI + WBF	0.935	0.730	0.685
Δ (absolute)	+0.084	+0.097	+0.132
Δ (relative)	+9.9%	+15.3%	+23.9%

Table 7. Significance of SAHI + WBF improvements vs. baseline.

Metric	Baseline	SAHI + WBF	Δ (Absolute)	95% CI for Δ	Wilcoxon p-Value
Recall @ OP	0.75	0.89	+0.14	[+0.092, +0.216]	$2.6 \times 10^{- 5}$
AP@0.50	0.851	0.935	+0.084	[+0.033, +0.137]	0.0023

Table 8. Stratified AP@0.50 by object area (companion to Figure 17).

Stratum	Baseline AP@0.50	SAHI AP@0.50	Δ (Absolute)	Δ (Relative)
All	0.851	0.935	+0.084	+9.9%
Small	0.773	0.847	+0.074	+9.6%
Medium	0.183	0.128	−0.055	−30.1%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dodić, D.; Vujović, V.; Jovković, S.; Milutinović, N.; Trpkoski, M. SAHI-Tuned YOLOv5 for UAV Detection of TM-62 Anti-Tank Landmines: Small-Object, Occlusion-Robust, Real-Time Pipeline. Computers 2025, 14, 448. https://doi.org/10.3390/computers14100448

AMA Style

Dodić D, Vujović V, Jovković S, Milutinović N, Trpkoski M. SAHI-Tuned YOLOv5 for UAV Detection of TM-62 Anti-Tank Landmines: Small-Object, Occlusion-Robust, Real-Time Pipeline. Computers. 2025; 14(10):448. https://doi.org/10.3390/computers14100448

Chicago/Turabian Style

Dodić, Dejan, Vuk Vujović, Srđan Jovković, Nikola Milutinović, and Mitko Trpkoski. 2025. "SAHI-Tuned YOLOv5 for UAV Detection of TM-62 Anti-Tank Landmines: Small-Object, Occlusion-Robust, Real-Time Pipeline" Computers 14, no. 10: 448. https://doi.org/10.3390/computers14100448

APA Style

Dodić, D., Vujović, V., Jovković, S., Milutinović, N., & Trpkoski, M. (2025). SAHI-Tuned YOLOv5 for UAV Detection of TM-62 Anti-Tank Landmines: Small-Object, Occlusion-Robust, Real-Time Pipeline. Computers, 14(10), 448. https://doi.org/10.3390/computers14100448

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SAHI-Tuned YOLOv5 for UAV Detection of TM-62 Anti-Tank Landmines: Small-Object, Occlusion-Robust, Real-Time Pipeline

Abstract

1. Introduction

2. Objectives and Operational Design Principles

2.1. O1 (C3)—Maximize Recall Under a Precision Constraint

Feasible Region at the Operating Point

2.2. C5—Sustain Real-Time Throughput with SAHI Tiling

2.3. Tiles per Image and Effective Throughput

3. From Sensor Geometry to Real-Time Detection

3.1. Dataset and Annotation (C1)

3.2. Imaging Geometry and Effective Scale (Theoretical Foundation)

3.3. Model, Training, SAHI Inference, and Box Fusion (C2 and C4)

3.4. Operating Point and Evaluation Protocol (C3)

3.5. Real-Time Considerations and System Sustainability (C5)

3.6. Implementation and Reproducibility

3.7. Threats to Validity

3.8. Failure Analysis

3.9. Ablation Study (Tile × Overlap × Suppression)

3.10. Operating-Point Sensitivity

3.11. Latency Distribution

3.12. Reproducibility and Assets (Restricted)

4. Accuracy, Operating-Point Impact, and Robustness

4.1. Aggregate Accuracy (COCO Metrics)

Confusion Summary and FP Sources at OP

4.2. Operating-Point Analysis: Precision, Recall, F 1 , and Missed Detections

4.3. Stratified Performance (Size/Difficulty)

4.4. Statistical Significance and Robustness

4.5. Localization Diagnostics at OP (IoU and Boundary IoU)

5. From Accuracy to Deployment: Operating-Point and Robustness Insights

5.1. Summary of Findings

5.2. Mechanism of Improvement

5.3. Operational Interpretation at a Fixed OP

5.4. Robustness Across Settings

5.5. Throughput and Practical Trade-Offs

5.6. Error Modes and Where Performance Could Improve

5.7. Limitations and Generalization

5.8. Implications

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.2. Operating-Point Analysis: Precision, Recall, $F_{1}$ , and Missed Detections