Deep Learning for Wildlife Monitoring: Near-Infrared Bat Detection Using YOLO Frameworks

González-Barbosa, José-Joel; Rangel, Israel Cruz; Ramírez-Pedraza, Alfonso; Ramírez-Pedraza, Raymundo; Bárcenas-Reyes, Isabel; González-Barbosa, Erick-Alejandro; Razo-Razo, Miguel

doi:10.3390/signals6030046

Open AccessArticle

Deep Learning for Wildlife Monitoring: Near-Infrared Bat Detection Using YOLO Frameworks

by

José-Joel González-Barbosa

¹

,

Israel Cruz Rangel

¹

,

Alfonso Ramírez-Pedraza

^1,2

,

Raymundo Ramírez-Pedraza

³

,

Isabel Bárcenas-Reyes

⁴

,

Erick-Alejandro González-Barbosa

⁵

and

Miguel Razo-Razo

^6,*

¹

Instituto Politécnico Nacional, CICATA-Unidad Querétaro, Querétaro 76090, Mexico

²

Secretaría de Ciencia, Humanidades, Tecnología e Innovación SECIHTI, Ciudad de México 03940, Mexico

³

Facultad de Contaduria y Administración, Universidad Autónoma de Querétaro, Querétaro 76017, Mexico

⁴

Facultad de Ciencias Naturales, Universidad Autónoma de Querétaro, Querétaro 76230, Mexico

⁵

Tecnológico Nacional de México/ITS de Irapuato, Guanajuato 36821, Mexico

⁶

The University of Texas at Dallas, Richardson, TX 75080, USA

^*

Author to whom correspondence should be addressed.

Signals 2025, 6(3), 46; https://doi.org/10.3390/signals6030046

Submission received: 25 July 2025 / Revised: 21 August 2025 / Accepted: 26 August 2025 / Published: 4 September 2025

Download

Browse Figures

Versions Notes

Abstract

Bats are ecologically vital mammals, serving as pollinators, seed dispersers, and bioindicators of ecosystem health. Many species inhabit natural caves, which offer optimal conditions for survival but present challenges for direct ecological monitoring due to their dark, complex, and inaccessible environments. Traditional monitoring methods, such as mist-netting, are invasive and limited in scope, highlighting the need for non-intrusive alternatives. In this work, we present a portable multisensor platform designed to operate in underground habitats. The system captures multimodal data, including near-infrared (NIR) imagery, ultrasonic audio, 3D structural data, and RGB video. Focusing on NIR imagery, we evaluate the effectiveness of the YOLO object detection framework for automated bat detection and counting. Experiments were conducted using a dataset of NIR images collected in natural shelters. Three YOLO variants (v10, v11, and v12) were trained and tested on this dataset. The models achieved high detection accuracy, with YOLO v12m reaching a mean average precision (mAP) of 0.981. These results demonstrate that combining NIR imaging with deep learning enables accurate and non-invasive monitoring of bats in challenging environments. The proposed approach offers a scalable tool for ecological research and conservation, supporting population assessment and behavioral studies without disturbing bat colonies.

Keywords:

near-infrared; bats detection; YOLO

1. Introduction

Ecologically, bats play pivotal roles in trophic chains and pollination cycles [1], serving as reservoirs of microbiological diversity. These mammals are often the sole pollinators for numerous plant species [2], underscoring their crucial role in maintaining ecosystem balance. Bats frequently utilize artificial shelters—including tunnels, bridges, abandoned mines, and tree hollows [3,4]—yet ecological studies show that such structures often fail to meet survival requirements for their colonies, negatively impacting reproductive cycles [5]. In contrast, natural caves with their cracked, rocky, deep architecture, offer chambers or cavities that provide optimal temperature and humidity conditions. These features make caves favorable habitats for the survival of different species of bats, especially during breeding seasons [6,7]. Bats are considered useful bioindicators of ecosystem health due to their sensitivity to environmental disturbances [8,9]. The COVID-19 pandemic has further complicated efforts to study the conservation status and population trends of bats, particularly within their natural habitats, which also provide invaluable research environments for understanding bat ecology, behavior, and population dynamics [10].

Traditional methods for studying bats typically involve the use of mist-nets, which are assembled using two poles, which can be PVC tubes or even thin and long branches of trees measuring at least 4.5 m. These mist-nets are set up near pens or at the entrances of their natural shelters, such as caves [11,12]. This technique has been instrumental in identifying bat species and their geographical distribution, as well as in activities for prevention and diagnosis of diseases affecting bats carried out in different countries that have national campaigns for epidemiological surveillance of public and animal health [12]. Despite these efforts, there remains a lack of robust data on chiropteran population size and trends within natural ecosystems. For bats, this typically refers to the number of individuals roosting or perching in a particular cave [11]. These estimates have been mainly constructed from sampled data limited to the scope of chiropteran monitoring projects, resulting in an undersampling of reality [13] due to high time demands, insufficiently trained personnel for data analysis, limited researcher presence, disturbance to bat colonies, or poor accessibility to natural roosts [13]. These limitations directly affect the planning and implementation of diagnostic, prevention, and control strategies for zoonotic diseases impacting both human and animal health, particularly bat populations in Mexico. The current situation highlights the urgent need for innovative monitoring technologies that can operate effectively in constrained environments while minimizing disruption to fragile ecosystems [14].

Some semi-automatic algorithms for counting bats rely on infrared thermal imaging, which has long been used to detect and count bats in groups [15]. For example, in Texas, thermal imaging is employed to monitor bats as they exit their caves at night [16]. In Brazil, an advanced infrared thermal imaging system was developed to count large colonies of insectivorous bats: Brazilian mouse-tailed bats (Tadarida brasieliensis), using statistical algorithms trained to accumulate bat counts in fractions of a second. To improve detection accuracy, researchers constructed a background tower between the camera and the bats’ flight path [17]. The use of trained Artificial Intelligence (AI) algorithms has emerged as a non-invasive and effective tool for bat monitoring, offering speed, affordability, and improved accuracy. For example, acoustic libraries were used to train algorithms capable of predicting bat species in forests of Uruguay, allowing the identification of 662 sound pulses of 10 different bat species [18]. Another study in Bulgaria used deep learning algorithms to detect bat echolocation sounds, enabling real-time estimation of bat abundance and presence in acoustically complex environments with a low number of false positives [19].

In [20], the authors developed and deployed custom camera traps equipped with mirrorless digital cameras and external white flashes, triggered by infrared light barriers, to monitor bats at the entrances of underground hibernation sites such as caves or mines. This setup ensures high-quality, minimally invasive imaging, allowing for accurate species-level identification without disturbing bat behavior. The traps are strategically positioned 2–3 m from entrances at angles

\leq 45^{\circ}

to optimize image capture. To automate analysis, the authors introduced BatNet, a deep learning-based tool for detecting, segmenting, and classifying bats in images. BatNet was trained to identify 13 European bat species and can be adapted to other species or regions. The tool supports ecological studies by providing data on bat counts, species diversity, and activity patterns. Its high image consistency and manual depth-of-field focusing improve classification accuracy and monitoring reliability. The authors, in [21], employed high-sensitivity stereo cameras (HAS-U2M digital high-speed cameras manufactured by DITECT in Japan) to monitor bats emerging from a cave. Two cameras were positioned at the entrance, recording at 60 fps and 2592 × 2048 pixels, with infrared illumination to improve nighttime visibility. The system captured continuous footage for approximately 30 min, limited by the storage capacity of the built-in solid-state drive (SSD) in the connected personal computer. This setup enabled 3D reconstruction of bat flight paths and was deployed outdoors at the cave entrance to study the eastern bent-wing bat (Miniopterus fuliginosus). The focus was on counting individuals and estimating population size based on behavioral classification (e.g., entering vs. exiting). Species identification was not addressed. The study applied computer vision techniques for automatic tracking and behavioral analysis of collective flight patterns.

In [22,23], the authors introduced a novel bat point count method that integrates thermal imaging, ultrasound recording, and near-infrared (NIR) photography to detect and identify flying bats in the field. This multisensor setup was deployed outdoors at various sites in Sumatra and Indonesia, particularly near streams and roads within oil palm plantations, to maximize species detection. Thermal scopes were used to detect bats based on their heat signatures. Ultrasound recorders captured echolocation calls for acoustic identification. NIR cameras recorded bat morphology, enabling visual classification. The integration of these three sensor types allowed for effective species identification and population counting without disturbing the animals. To support this process, the authors developed a custom identification key based on both acoustic and morphological features. The work presented in [24] employs a range of video-based technologies—including low-resolution GoPro cameras, high-resolution cameras, infrared (IR), and thermal cameras—to detect and monitor bats outdoors near roost entrances. Videos were recorded from stationary viewpoints, capturing bats as they emerged against backgrounds such as the sky or cave edges. Recent advances in near- and far-infrared imaging have significantly enhanced monitoring capabilities. The primary objective of the study was to count rather than identify species. The study focused on developing methods to accurately estimate the number of bats in video frames, particularly in scenes with overlapping individuals, using deep learning techniques. Specifically, convolutional neural networks (CNNs) were trained to classify the number of animals (bats, birds, or fish) in regions of interest within video frames, using count-range categories (e.g., 0 to >10). The authors also applied computer vision methods such as background subtraction (via median pixel values) to detect moving objects. Additionally, synthetic image generation was used to augment training datasets and address class imbalance in high-count categories.

The authors, in [25], developed a low-cost, automated system to detect, count, and analyze bats using standard RGB GoPro cameras (models 4, 6, and 7), deployed outdoors in Kasanka National Park, Zambia. Ten cameras were positioned around a colony of straw-colored fruit bat (Eidolon helvum), recording their evening emergence in low-light conditions. Instead of relying on expensive thermal cameras, the study leveraged computer vision and deep learning to process visible light video. The primary goal was to count bats and extract behavioral data without accurately identifying the species. The proposed technique included four key processes. The first process, called Deep Learning and Semantic Segmentation, uses a UNet-based convolutional neural network (CNN) that segments bats from the background in each frame, even under challenging lighting conditions. The second process is Background Subtraction and Data Augmentation. The third, a Tracking Algorithm, allowed bats to be tracked across frames using the Hungarian algorithm to build individual movement trajectories. The authors used a Behavioral Analysis to estimate wingbeat frequency, flight altitude, and group polarity from video data. Finally, validation and correction: Human-labeled data were used to validate and refine the results, especially for wingspan and altitude estimation.

In [26], the authors present a novel approach for wild bat detection and species identification using an enhanced object detection model, WB-YOLO, based on YOLOv7. Two types of photographic methods were employed to build the dataset: aerial high-resolution photography and handheld macro-imaging using a ViVO X90 smartphone. Images were captured across various natural habitats in Anhui Province, including mountainous areas, caves, and building crevices, under both natural and artificial lighting conditions. The image acquisition setup enabled diverse perspectives, improving the detection of bats in complex outdoor environments. The focus of the research was species identification rather than individual counting, addressing the challenges of detecting bats in occluded and cluttered scenes. WB-YOLO integrates several advanced deep learning techniques to improve detection performance and reliability. In [27], the authors used Doppler weather radar data from NOAA’s NEXRAD system to detect and monitor Mexican free-tailed bats (Tadarida brasiliensis) across large landscapes. The study introduced a novel algorithm called Bat-Aggregated Time Series (BATS), which employs a feed-forward artificial neural network to identify bat presence in radar imagery. The NEXRAD radar tower (KDAX), located 7 km southeast of Davis, California, served as the remote sensing instrument. Operating outdoors and at a distance, the system enabled landscape-scale monitoring of bat foraging activity, rather than localized roost surveillance. Key features of the approach include: (a) Machine Learning Detection: The BATS algorithm processes radar data to classify bat presence (binary) and estimate relative foraging intensity across space and time. (b) Remote and Scalable Monitoring: The system tracks bat activity patterns without requiring direct field observation or individual counts. (c) Focus on Foraging Distributions: The goal was to map bat foraging behavior and relative density, not species identification or population estimation.

In this paper, we present a portable prototype designed for the acquisition of multimodal data in challenging underground environments. The system is capable of capturing near-infrared (NIR) images, ultrasonic audio, 3D structural data, and RGB imagery. Leveraging these capabilities, we focus on NIR imaging and propose an approach for bat detection and counting. To evaluate detection performance, we compared three versions of the YOLO object detection framework, each with different model configurations. We identified the best-performing model for automated bat monitoring using the portable near-infrared imaging system developed by our team.

2. Materials and Methods

Figure 1 illustrates the methodology. The development process began with the design and construction of the first prototype of a multisensor platform [28]. This initial version was tested in the field to assess its functionality and identify areas for improvement. Based on these tests, a second version of the platform was developed with several key enhancements. We integrated a touchscreen interface to improve user interaction, repositioned the infrared light projectors above the cameras and fixed them to ensure consistent illumination, added a high-frequency sound recorder, and refined the overall design to improve portability for field deployment (see Figure 2 and Figure 3). Following these hardware improvements, a new series of data acquisition tests was conducted. For this study, we focused specifically on the data collected from the near-infrared cameras. Multiple versions of the YOLO (You Only Look Once) object detection algorithm were implemented to detect bats in the captured imagery. We then evaluated and compared the performance of each YOLO version in terms of detection efficiency.

2.1. Multisensor Platform Description

We developed a portable multisensor platform for bat monitoring and cave exploration, integrating near-infrared (NIR) cameras, infrared projectors, RGB cameras, a microphone capable of capturing high-frequency audio up to 128 kHz, and an active Infrared (IR) stereo camera Intel Real Sense D435i (see Figure 2). The system is battery-powered and supports up to six hours of continuous operation. It is optimized for lightweight transport and deployment in confined environments, such as caves in Guanajuato and San Luis Potosí, Mexico.

The platform is designed to record the following:

Stereo NIR images;
Depth information produced by Intel RealSense D435i;
High-frequency audio signals.

Figure 3 shows the internal connections and components of the imaging unit. The multisensor platform consists of the following modules:

1.

Main Imaging Unit

Equipped with two Arducam IR cameras (60 FPS) mounted 8 cm apart using a custom 3D-printed frame for stereo image acquisition.
Includes an array of NIR LED lights (wavelength: 890 nm).
Features an 8x11-inch touchscreen interface.
Captures high-frequency audio signals.
Raspberry Pi 4 Model B+ (8GB RAM), which serves as the onboard computer. The Raspberry Pi facilitates the implementation of lightweight, low-cost, and low-power multisensor platforms [29].
The Active Infrared (IR) stereo camera, Intel RealSense D435i, is a compact and lightweight device. It projects IR patterns onto the scene and uses a pair of global shutter sensors to compute depth information via stereo triangulation, enabling robust performance even in low-texture or low-light conditions.

2.

Power and Regulation Unit

Powered by two lithium-polymer (LiPo) batteries (9.6 V and 5 V).
–
5 V for the Raspberry Pi.
–
9.6 V for IR LED illuminators.
One battery powers the Raspberry Pi + Arducam cameras, touchscreen, and Intel RealSense D435i, while the second battery exclusively powers the IR LED array.

This modular design allows for safe handling and flexible setup in confined environments.

Although the system supports multimodal data capture, this study focuses exclusively on monocular NIR imaging to evaluate various versions of the YOLO algorithm for bat detection. Both near-infrared (NIR) stereo images are acquired following the procedure described in Algorithm 1. These images are subsequently used to train and evaluate the detection algorithm based on the YOLO framework. Notably, the left and right cameras do not share identical optical characteristics or viewpoints. This variation in optics and perspective allows the same scene to be captured with slight differences or disturbances, thereby introducing greater diversity to the dataset. Such diversity is a critical factor in improving the robustness and generalization capabilities of the training process.

The dataset was randomly divided into 70% for training and 30% for validation. An independent test set was not created, as the scope of this work focused on model training and validation for performance comparison under consistent conditions.

Algorithm 1 Stereo image acquisition for bat counting.

1:: ALGORITHM Double capture camera
2:: Saving Directory ← Path to image storage
3:: fps ← (Frames per Second) User can specify the rate of video capture.
4:: timestamp ← current date
5:: right_folder ← Saving Directory + "/right_" + timestamp
▹ Create folders for saving Right frames
6:: left_folder ← Saving Directory + "/left_" + timestamp
▹ Create folders for saving Left frames
7:: Camera initialization
8:: for each camera do
9:: capture_object ← video_capture_methods
10:: cap1, cap2 ← capture_object
11:: Threading_NewThread(Capturing frames thread)
▹ Start new thread to capture frames
12:: end for
13:: if start_recording ← TRUE then
14:: is_recording ← TRUE
15:: last_save_time ← get current time
16:: Threading_NewThread(Saving frames thread)
▹ Start new thread to save frames
17:: while is_recording ← TRUE do
18:: current_time ← get current time
19:: if current_time − last_save_time == FPS then
20:: cap1, cap2 ← captures from both cameras
21:: frame_count ← frame_count + 1
22:: SAVE right_folder ← cap1 + frame_count
23:: SAVE left_folder ← cap2 + frame_count
24:: last_save_time ← get current time
25:: end if
26:: end whileWhen User pushes Stop Recording button
27:: end if

2.2. Semi-Automatic Labeling of Images

Stereo vision has emerged as a powerful technique for extracting three-dimensional (3D) information from two-dimensional (2D) images, enabling precise depth estimation and spatial analysis across a wide range of applications. For each stereo image, the background is estimated using a pixel appearance model. A critical step in this process is the initialization of the background model [30]. To achieve this, we adopted a strategy based on computing the appearance model as the median of a selected set of images [31]. Let

I (x, t)

represent the temporal image acquired at time t, where x denotes a pixel position in the image. In practice, the observed image is

J (x, t)

, which is a noisy version of

I (x, t)

and can be model as

J (x, t) = I (x, t) + δ (x, t)

(1)

where

δ (x, t)

is assumed to be a zero-mean Gaussian random variable representing noise. We assume that changes in illumination occur gradually over time, primarily due to variations in natural daylight. This assumption excludes scenarios involving abrupt or drastic changes in lighting conditions. In this work, we assume the background is free of foreground objects. Therefore, the observed intensity variations can be modeled using a single Gaussian distribution. Let the Gaussian process be defined as follows:

g (x; μ_{k}, Σ_{k}) = \frac{1}{2 π | Σ_{k} |^{- \frac{1}{2}}} exp [- \frac{1}{2} {(I (x) - μ_{k})}^{T} Σ_{k}^{- 1} (I (x) - μ_{k})],

(2)

where

μ_{k}

and

Σ_{k}

are, respectively, the mean and the covariance matrix. In the case of near-infrared images, both

μ_{k}

and

Σ_{k}

are reduced to a scalar. When a new observation

I (x, t)

becomes available, it is compared against the parameters of the Gaussian model. If

‖ I (x) - μ_{k} ‖^{2} \leq α ‖ Σ_{k} ‖,

(3)

where

‖

‖

and

‖

‖^{2}

represent norm operators, and

α

is a constant that may vary depending on the spatial position x. It is assumed that the observation is likely the result of a small perturbation from the true value, which would otherwise closely match the model prediction. The Gaussian parameters are updated over time using the online Expectation–Maximization (EM) algorithm [32]. Specifically,

\begin{matrix} μ_{k} & \leftarrow & ρ μ_{k} + (1 - ρ) I (x, t), \\ Σ_{k}^{2} & \leftarrow & ρ Σ_{k}^{2} + (1 - ρ) (I (x, t) - μ_{k}) {(I (x, t) - μ_{k})}^{T}, \end{matrix}

(4)

where

ρ \in [0, 1]

is the learning rate. Bat detection corresponds to identifying the foreground. To extract the foreground, we calculate the difference between the current image and the background model. If the absolute difference is greater than v, the corresponding pixels are classified as foreground. If the number of neighboring pixels detected as variations exceeds

ϵ

, the images before and after detection are manually reviewed to confirm and label the presence of bats.

2.3. Algorithms for Detecting Bats

YOLO (You Only Look Once) models rely on a unified detection framework that predicts both bounding boxes and class probabilities through a single regression pass. The authors have extensive experience implementing the YOLO framework across diverse applications, which guided the design and optimization of the present study [33,34]. A key advantage of YOLO is its speed, enabling real-time detection using standard hardware. YOLO operates on global image features and performs object detection through the following main steps:

1.

Grid-based prediction. The input image I is divided into an

M \times M

grid. Each grid cell is responsible for detecting objects whose center falls within it.

2.

Bounding box regression. The model uses a regression approach to estimate bounding boxes, which are rectangles enclosing detected objects. The output vector for each prediction is defined as

Y = [p, x, y, h, w, c]

(5)

where the following are used:

p is the confidence score (range: 0–1) indicating the presence of an object in the cell;
$x, y$ are the coordinates of the center of the bounding box relative to the grid cell;
$h, w$ denotes the height and width of the bounding box;
c is the class label among n predefined categories.

3.

Non-Maximum Suppression (NMS). Since multiple overlapping boxes may be predicted for the same object, NMS is applied to retain only the most confident detection, eliminating redundant predictions.

While YOLO is widely recognized for its efficiency and simplicity, it also presents certain limitations. Among the most notable challenges are difficulties in detecting small objects and a tendency to generate false positives in complex backgrounds. Earlier versions, in particular, were prone to localization errors. To address overlapping detections, YOLO employs Non-Maximum Suppression (NMS), which retains only the predictions with the highest confidence scores. Despite these drawbacks, recent iterations have progressively improved the framework. In this work, we compare three of the latest YOLO models—YOLOv10 [35], YOLOv11 [36], and YOLOv12 [37]—which are described in the following section.

The architectures illustrated in Figure 4 (YOLOv10) [35], Figure 5 (YOLOv11) [36], and Figure 6 (YOLOv12) [37] represent the progressive evolution of real-time object detection models, while preserving the core YOLO philosophy of balancing speed and accuracy. YOLOv10 emphasizes computational efficiency through a lightweight architecture, making it well-suited for deployment on resource-constrained devices. YOLOv11 introduces spatial attention mechanisms and dynamic convolutions, enhancing detection accuracy in complex scenarios. YOLOv12 further advances the framework by integrating transformer-based modules, substantially improving performance in multi-class and overlapping-object environments. Together, these versions illustrate a clear trajectory toward more robust and adaptable models, achieved without sacrificing inference speed.

Table 1 summarizes the hyperparameters used during training for YOLOv10b, YOLOv11n, and YOLOv12s. Most parameters were kept constant across models, including learning rate, optimizer, momentum, and image size, ensuring comparability. Only the batch size and number of training epochs were explicitly defined for each model. In the case of YOLOv12s, a smaller batch size was used due to GPU memory limitations.

2.4. Evaluation Metrics

Precision represents the proportion of true positives relative to the total number of positive predictions made by the model. In other words, it indicates the percentage of detected objects that are actually correct. This metric is defined by Equation (6).

P r e c i s i o n = \frac{T P}{T P + F P}

(6)

Recall, in contrast to precision, measures the proportion of true positives relative to the total number of actual objects present in the image. It reflects how many of the existing objects were correctly detected by the model. This metric is defined in Equation (7).

R e c a l l = \frac{T P}{T P + F N}

(7)

Mean average precision (mAP) is a widely used metric for evaluating the performance of object detection models. By incorporating both precision and recall, it offers a comprehensive measure of the model’s effectiveness in accurately identifying objects.

Where

T P

is True Positives and refers to correctly detected bats,

F P

False Positives—represents incorrect detections (instances where the model identifies a “bat” when it is not).

F N

is False Negatives, which corresponds to missed objects (where the model fails to identify actual bats present in the image).

3. Results

The video sequences were shot at 60 frames per second (FPS) in caves located in Xilitla, San Luis Potosí, Mexico. This is an area with a high incidence of rabies. A total of six caves were visited. In the first cave, the cameras captured a colony of Diphilla ecaudata. Adjacent to this main cave, a smaller cave revealed a group of Desmodus rotundus, including one particularly clear specimen that might serve as a morphological reference in the NIR imagery. Two of the other four caves were not inhabited by bats. However, one of the caves was inhabited by at least four unidentified bat specimens. The final cave contained a large colony of Myotis occultus. This cave produced excellent sequences of bats crossing in front of the device, but it also demonstrated some of the system’s limitations. The amount of illumination provided by our IR LED array was insufficient to illuminate the deepest part of the cave’s ceiling, where some of the animals were roosting. Thus, delivering imagery that was good for the neural network training, but insufficient to perform counting over the whole colony.

3.1. NIR Camera Sensitivity

In this section, we evaluate the camera’s sensitivity in capturing bat imagery. Using the acquisition system described in Section 2.1, we recorded images at varying distances from the subject and computed the corresponding gray-level intensities. Conducting this experiment with a live bat posed practical challenges in controlling the conditions; therefore, we used a stuffed bat as a surrogate, placing it at different distances to assess the gray-level response. The results are presented in Figure 7.

3.2. Semi-Automatic Labeling of Images

Figure 8 presents a sample image acquired by the stereo system deployed inside a cave. In contrast, Figure 9 illustrates the bat detection obtained through background subtraction (as described in Section 2.2), using the parameters

α = 30

and

ρ = 0.8

from Equations (3) and (4), respectively. A sequence of n frames before and after each detection was analyzed to manually label the bats. Figure 10 shows another detection sequence. It is common for bats to go undetected in several frames; therefore, adjacent frames were also reviewed and manually labeled. Conversely, false positives occasionally occurred, and those frames were manually discarded. These semi-automatically labeled images were used to train various versions of the YOLO model.

3.3. Detection Using YOLO Frameworks

Table 2 presents a performance comparison among YOLOv10, YOLOv11, and YOLOv12 across various model configurations (b: base, l: large, m: medium, n: nano, s: small, x: extra-large). It can be observed that the YOLOv10-m configuration achieves the highest mAP@50 score (0.970) and a recall of 0.966, indicating an excellent balance between detection accuracy and precision on this test set. Meanwhile, YOLOv12-m and YOLOv11-l also demonstrate outstanding performance, with both precision and recall metrics exceeding 0.96. Overall, the “m” and “l” variants consistently deliver superior results across all three versions, achieving mAP@50 scores of 0.981 and 0.979, respectively. This performance can be attributed to their intermediate complexity and enhanced generalization capabilities. Typically, mAP@0.75 yields lower values than mAP@0.5 due to the stricter IoU threshold, while mAP@[0.5:0.95] is also lower than mAP@0.5 but closer to mAP@0.75, as it averages across multiple higher IoU thresholds. On standard COCO benchmarks, strong detectors achieve mAP@0.5 in the range of 0.6–0.8, mAP@0.75 between 0.4 and 0.6, and mAP@[0.5:0.95] around 0.35–0.55.

When focusing on overall accuracy and robustness under stricter IoU thresholds, YOLOv12-n and YOLOv11-n stand out as the most competitive models. However, when prioritizing global performance and stability, YOLOv12-m proves to be the most consistent, achieving the best balance among precision, recall, and mAP@50.

Figure 11 shows the precision–recall curves for the YOLOv10-m (blue), YOLOv11-l (yellow), and YOLOv12-m (green) models. All three models demonstrate outstanding performance, with areas under the curve (AUC-PR) exceeding 0.97. Specifically, YOLOv10-m achieves a maximum value of 0.970, while YOLOv11-l follows closely with 0.979, and YOLOv12-m slightly outperforms both with an AUC-PR of 0.981. These results highlight the model’s high detection capabilities and low false positive rates, validating its effectiveness for bat detection.

Figure 12 presents boxplots showing the distributions of the variables’ Weight and Height for bats in class 0, obtained through exploratory data analysis (EDA) on the training set. Both variables exhibit moderate dispersion within the interquartile range but also display a considerable number of outliers above the third quartile. For Weight, outliers reach up to 0.5, while for Height, extreme values extend to approximately 0.38. This pattern suggests the possible presence of subgroups within the class or significant variability in the body dimensions of individual bats. In both cases, the mean and median are closely aligned, indicating a relatively symmetric central distribution, albeit with a positive skew caused by the outliers. These findings are important for informing normalization strategies or anomaly detection prior to model training.

Figure 13 presents the histograms of Weight and Height for class bat. Both distributions show a clear positive skewness, with a concentration of values at the lower ranges and extended right tails. This pattern indicates that most bats display reduced weight and height, while individuals with higher measurements appear less frequently. The observed heterogeneity provides valuable insight into the intrinsic variability of the sample and represents a relevant factor to be considered in subsequent analyses and statistical modeling.

Figure 14 presents the detection results obtained using the YOLOv10m, YOLOv11l, and YOLOv12m models. The first row shows the manual annotations used as ground truth, while the second, third, and fourth rows display the predictions generated by the YOLOv10m, YOLOv11l, and YOLOv12m models, respectively.

The global Friedman test confirmed significant differences among YOLO variants for both precision (

χ^{2} (15) = 106.54

,

p < 0.001

) and F1-score (

χ^{2} (15) = 104.31

,

p < 0.001

), while no significant effects were observed for AP@0.75 or recall (

p > 0.58

). Post-hoc pairwise comparisons (Table 3) revealed that YOLOv11-m significantly outperformed YOLOv10-x and YOLOv10-n, with mean differences of

- 0.228

and

- 0.194

, respectively, both within narrow confidence intervals and associated with moderate-to-large effect sizes (

r \approx - 0.5

). These two comparisons were highlighted because they were the only ones with Holm-adjusted p-values (

p_{a d j}

) below the significance threshold (

\leq 0.05

), indicating genuine performance differences, while the remaining contrasts did not reach statistical significance. This indicates that although absolute detection rates (mAP) were comparable across models, YOLOv11-m provides more reliable predictions by maintaining higher precision and balanced performance across samples. The full set of pairwise comparisons is provided in Appendix A, reporting mean differences (

Δ

), 95% confidence intervals (95% CI), Holm-adjusted p-values (

p_{a d j}

), and effect sizes (Effect).

To assess the statistical significance of differences among YOLO models, we conducted pairwise comparisons using the Wilcoxon signed-rank test. Each model pair was evaluated across four metrics: precision, recall, F1-score, and mAP@0.75. This procedure resulted in a total of 480 pairwise comparisons, covering all possible model combinations under each metric. Adjusted p-values were calculated using Holm’s method to control for multiple testing. The complete set of results is provided in Appendix A, including mean differences (

Δ

), 95% confidence intervals (95% CI), adjusted p-values (

p_{a d j}

), and effect sizes.

We conducted a qualitative error analysis for three representative models (YOLOv10-n, YOLOv11-m, and YOLOv12-m). Table 4 reports the total number of false positives (FP) and false negatives (FN) relative to the number of evaluated images. Figure 15 shows typical failure modes: FP (red) frequently occur over cave-rock edges and high-contrast textures that resemble bat contours; in some cases, overlapping red boxes indicate duplicate detections on the same background structure. Conversely, FN (blue) are mainly associated with partially occluded or low-contrast individuals. This qualitative assessment complements the quantitative metrics and clarifies where the detectors tend to fail under challenging cave scenes.

4. Discussion

YOLO architectures have demonstrated outstanding performance in the automatic detection of bats, particularly in scenarios involving small-sized individuals, rapid motion, and variable lighting conditions. As shown in Figure 11, the models YOLOv12m, YOLOv11l, and YOLOv10m achieve precision and recall rates exceeding 97%, with YOLOv12m standing out as the most accurate, reaching a mean average precision (mAP) of 0.981.

This robust detection capability in natural images enables effective application in ecological monitoring, automated species counting, and real-time animal behavior studies, even in complex and uncontrolled environments. The findings reinforce YOLO’s potential as a powerful tool for the automated study of wildlife, provided it is supported by rigorous data preprocessing and proper dataset curation. Table 5 summarizes the performance of the proposed method in comparison with existing state-of-the-art approaches.

In [38], the authors employed acoustic-based methods for bat identification. They developed a software program, Waveman (version 4), and constructed a reference library comprising more than 880 audio files from 36 Asian bat species. The software integrated a novel neural network, BatNet, along with a re-checking strategy (ReChk) designed to maximize accuracy. Their approach to library preparation and the use of ReChk significantly improved sensitivity while reducing the false positive rate, particularly when tested on 15 species with more detailed and contextually diverse records. BatNet was successfully applied to identify Hipposideros larvatus and Rhinolophus siamensis across three different environments. In contrast, our study focuses on the use of NIR imagery for bat identification through YOLO-based frameworks, which yielded very promising results. Moreover, our multisensor platform has been designed to incorporate acoustic methods in future work, with a specific emphasis on the identification of Desmodus rotundus, a species of special interest.

In [20], the authors developed camera traps equipped with mirrorless cameras, infrared light barriers, and external flashes to monitor bats at underground hibernation sites. This non-invasive setup provided high-quality imaging for accurate species-level identification. They also introduced BatNet, a deep learning tool designed to detect, segment, and classify 13 European bat species, with the potential adaptability to other regions. In their setup, the controlled camera positioning and standardized entrance sizes of the hibernation sites allowed for uniform bat image capture. By contrast, our work involves camera placement that results in substantial variability in bat image size, as illustrated in Figure 8, Figure 9 and Figure 10.

In [21], high-sensitivity stereo cameras with infrared illumination were deployed at a cave entrance to monitor Miniopterus fuliginosus. The system recorded at 60 fps and a resolution of 2592×2048, enabling 3D reconstruction of flight paths and population estimates. However, recordings were limited to 30 min due to storage constraints. In contrast, our multiplatform system captures NIR stereo images at 60 fps over several hours, using cost-effective Arducam cameras. Unlike Fujioka’s outdoor configuration, our system is installed indoors within the roost. We adopted this stereo setup because training the YOLO framework benefits from incorporating two distinct views. Looking ahead, we plan to extend our platform with NIR stereo cameras for 3D trajectory reconstruction and to estimate bat dimensions—an essential step given the substantial morphological differences between Desmodus rotundus and frugivorous and insectivorous bat species.

In [26], the authors introduce WB-YOLO, an enhanced YOLOv7-based model for wild bat detection and species identification. Their dataset was constructed using two photographic methods: aerial high-resolution imaging and handheld macro photography with a ViVO X90 smartphone. Data were collected across diverse habitats in Anhui Province—including mountains, caves, and building crevices—under both natural and artificial lighting conditions. In contrast, our work evaluates YOLOv10, YOLOv11, and YOLOv12 using NIR images acquired indoors within the roost environment.

Unlike most studies that identify or count bats outdoors near their roost [20,21,22,23,24,25,26], our work as [39] enables bat detection and counting within the roost using infrared illumination and a compact multisensor platform. However, [27] evaluated both indoor and outdoor roost environments.

The drawback of using thermal cameras, as in [22,23,24], is their prohibitively high cost compared with NIR Arducam systems like the one employed in this work. On the other hand, using RGB cameras, as in [20,24,25,26,39], requires image acquisition under appropriate lighting conditions. This either restricts recording to specific times of the day or, in indoor applications, necessitates illuminating the cave—an invasive practice that disturbs the bats.

While YOLOv12-m emerged as the most consistent model when considering overall performance and stability—achieving the best balance across precision, recall, and mAP@50—our statistical significance testing provided additional insights into specific metric-level differences. The Friedman test revealed significant effects for both precision and F1-score, and subsequent pairwise comparisons demonstrated that YOLOv11-m significantly outperformed YOLOv10-x and YOLOv10-n in precision, with medium-to-large effect sizes. These results suggest that, although YOLOv12-m can be considered the most reliable variant from a global standpoint, YOLOv11-m offers a clear advantage in terms of precision when compared with earlier YOLO versions. Taken together, the findings indicate that YOLOv12-m should be prioritized when stability across metrics is essential, whereas YOLOv11-m may be preferable for tasks where maximizing precision is critical.

5. Conclusions and Future Work

This research introduces a non-invasive, scalable, and zoonotically safe method for detecting and monitoring bat populations. The proposed approach enables accurate detection of flying bats in natural habitats, enhancing our ability to study their behavior, distribution, and population dynamics. We evaluate a YOLO-based bat detector for NIR imagery, using inexpensive cameras and low-cost IR LED illumination for data collection. Remarkably, despite the limited quality and quantity of the images, the model achieved excellent detection performance. While higher-end imaging devices and additional illuminators would likely improve feature identification, the current setup demonstrates that effective bat monitoring can be achieved with minimal resources, making this an excellent tool for field researchers and scientists. During field visits to various caves, we observed considerable variation in the volume of bat roosts. In some cases, the developed platform provided adequate coverage; however, in others, the field of view of the multisensor platform was insufficient. To address this limitation, we plan to design multiple multisensor platforms that operate independently of the stereo system. Synchronization during data acquisition will be essential to generate panoramic NIR images capable of covering the entire roost. This approach presents several challenges. On one hand, the stereo setup of each multisensor platform is fixed and can be calibrated in the laboratory [40]. On the other hand, it is necessary to determine in situ the geometric transformation between the NIR images acquired by each platform to construct the panoramic view. To achieve this, we will explore two strategies: (i) leveraging environmental features captured by at least two platforms to estimate the transformation, or (ii) developing a portable calibration pattern for in situ use. Each platform will function as a slave device, while a master unit will be developed to synchronize acquisitions. To avoid continuous image capture, future iterations will integrate ultrasonic sensors to detect bat activity and trigger image acquisition. This ultrasonic trigger must operate in real time, enabling synchronized recording of NIR imagery and acoustic data. Additionally, the stereo system can contribute to species identification by providing estimates of bat morphometric features and reconstructing flight trajectories [41]. This multimodal integration—combining acoustic signatures, morphometrics, and trajectory data—will enhance species identification and, more importantly, enable the detection and quantification of Desmodus rotundus. The importance of Desmodus rotundus in Mexico is critical, as it is a primary transmitter of rabies to cattle and, in some cases, to humans. We believe this strategy will allow us to expand the field of view, improve selectivity in stored data, and reduce power consumption, thereby extending the recording capacity of the system.

The potential applications of this work are broad and impactful:

Ecological Monitoring and Biodiversity Assessment: Our approach provides an unbiased tool for surveying flying bats, including species that evade traditional methods such as mist-netting or do not emit detectable echolocation calls. It enhances assessments of species richness, community composition, and temporal activity patterns across diverse landscapes.
Conservation: By enabling safe, continuous, and non-invasive observation of bat colonies and behaviors, the method supports long-term monitoring and conservation planning. It facilitates population size estimation, behavioral studies, and habitat use analysis, key to addressing threats such as habitat loss, climate change, and emerging diseases.
Zoonosis Risk Assessment: The elimination of direct bat handling reduces the risk of zoonotic disease transmission, making this method particularly suitable for surveillance programs targeting potential disease reservoirs while ensuring the safety of both wildlife and researchers.
Behavioral Ecology and Fundamental Research: Combining visual and acoustic data allows the study of complex behaviors such as flight maneuvers, social interactions, and echolocation in natural settings. It opens new avenues for investigating species-specific traits, resolving taxonomic ambiguities, and analyzing collective movement patterns.
Agricultural Pest Management: The ability to map bat foraging activity across landscapes can help identify areas of high ecological service potential, such as pest control in crop fields. This has implications for reducing pesticide use and promoting sustainable agricultural practices.
Engineering and Technological Innovation: Insights into bat navigation and group flight behavior can inform the development of bio-inspired algorithms for autonomous vehicles, swarm robotics, and artificial intelligence systems focused on coordinated group behavior.
Broader Biological Applications: The methodology can be adapted to monitor other flying animal aggregations, including migratory birds and insect swarms. It offers a flexible platform for ecological studies, biodiversity monitoring, and conservation efforts beyond bat populations.

Author Contributions

I.B.-R., investigation, methodology, writing (original draft); I.C.R., investigation, methodology, software, writing (original draft); A.R.-P., investigation, methodology, software, writing (original draft); R.R.-P., formal analysis, investigation; J.-J.G.-B., conceptualization, formal analysis, supervision, writing (original draft); E.-A.G.-B., investigation, writing (original draft); M.R.-R., investigation, writing (original draft). All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available from the corresponding author upon reasonable request. Due to the fact that many of the images were captured on private property, they cannot be made publicly available.

Acknowledgments

We would like to thank the Comité Estatal para el Fomento y la Protección Pecuaria in Guanajuato and San Luis Potosí, Mexico, for their assistance with roost access and supervision. We also acknowledge the support of SIP-IPN and the Ministry of Science, Humanities, Technology, and Innovation (SECIHTI) through the National System of Researchers (SNII).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1. Full Set of Pairwise Comparisons

Table A1. Part 1. Pairwise comparisons for all YOLO variants.

Comparison	$Δ$ (mean)	95% CI	$p_{adj}$	Effect
mAP@75
YOLOv10-l vs YOLOv10-n	−0.0192	[−0.1155, 0.0712]	1.0000	r = −0.050
YOLOv10-l vs YOLOv10-s	−0.0524	[−0.1473, 0.0424]	1.0000	r = −0.227
YOLOv10-l vs YOLOv10-x	0.0084	[−0.0756, 0.0917]	1.0000	r = 0.143
YOLOv10-l vs YOLOv11-l	0.0000	[−0.0825, 0.0818]	1.0000	r = −0.030
YOLOv10-l vs YOLOv11-m	−0.0190	[−0.1139, 0.0738]	1.0000	r = −0.048
YOLOv10-l vs YOLOv11-n	−0.0061	[−0.0970, 0.0840]	1.0000	r = −0.053
YOLOv10-l vs YOLOv11-s	0.0091	[−0.0887, 0.1046]	1.0000	r = 0.048
YOLOv10-l vs YOLOv11-x	0.0611	[−0.0351, 0.1550]	1.0000	r = 0.190
YOLOv10-l vs YOLOv12-l	−0.0083	[−0.0946, 0.0773]	1.0000	r = −0.059
YOLOv10-l vs YOLOv12-m	0.0053	[−0.0825, 0.0902]	1.0000	r = 0.000
YOLOv10-l vs YOLOv12-n	0.0218	[−0.0666, 0.1104]	1.0000	r = 0.050
YOLOv10-l vs YOLOv12-s	0.0220	[−0.0818, 0.1258]	1.0000	r = 0.130
YOLOv10-l vs YOLOv12-x	−0.0187	[−0.1163, 0.0774]	1.0000	r = −0.048
YOLOv10-m vs YOLOv10-n	−0.0563	[−0.1347, 0.0262]	1.0000	r = −0.226
YOLOv10-m vs YOLOv10-s	−0.0895	[−0.1681, −0.0113]	1.0000	r = −0.400
YOLOv10-m vs YOLOv10-x	−0.0287	[−0.1211, 0.0623]	1.0000	r = 0.026
YOLOv10-m vs YOLOv11-l	−0.0371	[−0.1296, 0.0539]	1.0000	r = −0.176
YOLOv10-m vs YOLOv11-m	−0.0561	[−0.1481, 0.0350]	1.0000	r = −0.200
YOLOv10-m vs YOLOv11-n	−0.0432	[−0.1250, 0.0387]	1.0000	r = −0.185
YOLOv10-m vs YOLOv11-s	−0.0280	[−0.1236, 0.0675]	1.0000	r = −0.053
YOLOv10-m vs YOLOv11-x	0.0240	[−0.0564, 0.1061]	1.0000	r = 0.097
YOLOv10-m vs YOLOv12-l	−0.0454	[−0.1242, 0.0312]	1.0000	r = −0.185
YOLOv10-m vs YOLOv12-m	−0.0318	[−0.1227, 0.0636]	1.0000	r = −0.125
YOLOv10-m vs YOLOv12-n	−0.0153	[−0.1072, 0.0759]	1.0000	r = −0.100
YOLOv10-m vs YOLOv12-s	−0.0151	[−0.1076, 0.0781]	1.0000	r = 0.000
YOLOv10-m vs YOLOv12-x	−0.0559	[−0.1454, 0.0375]	1.0000	r = −0.143
YOLOv10-n vs YOLOv10-s	−0.0332	[−0.1152, 0.0489]	1.0000	r = −0.212
YOLOv10-n vs YOLOv10-x	0.0276	[−0.0621, 0.1188]	1.0000	r = 0.135
YOLOv10-n vs YOLOv11-l	0.0192	[−0.0718, 0.1136]	1.0000	r = −0.029
YOLOv10-n vs YOLOv11-m	0.0002	[−0.0886, 0.0868]	1.0000	r = −0.059
YOLOv10-n vs YOLOv11-n	0.0131	[−0.0778, 0.1046]	1.0000	r = −0.029
YOLOv10-n vs YOLOv11-s	0.0283	[−0.0652, 0.1192]	1.0000	r = 0.056
YOLOv10-n vs YOLOv11-x	0.0803	[−0.0098, 0.1712]	1.0000	r = 0.200
YOLOv10-n vs YOLOv12-l	0.0109	[−0.0701, 0.0919]	1.0000	r = 0.000
YOLOv10-n vs YOLOv12-m	0.0245	[−0.0749, 0.1273]	1.0000	r = 0.050
YOLOv10-n vs YOLOv12-n	0.0410	[−0.0455, 0.1305]	1.0000	r = 0.000
YOLOv10-n vs YOLOv12-s	0.0412	[−0.0497, 0.1311]	1.0000	r = 0.118
YOLOv10-n vs YOLOv12-x	0.0004	[−0.0792, 0.0796]	1.0000	r = −0.037
YOLOv10-s vs YOLOv10-x	0.0608	[−0.0316, 0.1515]	1.0000	r = 0.317
YOLOv10-s vs YOLOv11-l	0.0524	[−0.0338, 0.1431]	1.0000	r = 0.294
YOLOv10-s vs YOLOv11-m	0.0334	[−0.0540, 0.1191]	1.0000	r = 0.118
YOLOv10-s vs YOLOv11-n	0.0463	[−0.0459, 0.1387]	1.0000	r = 0.135
YOLOv10-s vs YOLOv11-s	0.0616	[−0.0241, 0.1486]	1.0000	r = 0.200
YOLOv10-s vs YOLOv11-x	0.1135	[0.0202, 0.2070]	1.0000	r = 0.381
YOLOv10-s vs YOLOv12-l	0.0441	[−0.0301, 0.1214]	1.0000	r = 0.241
YOLOv10-s vs YOLOv12-m	0.0578	[−0.0341, 0.1509]	1.0000	r = 0.243
YOLOv10-s vs YOLOv12-n	0.0742	[−0.0199, 0.1672]	1.0000	r = 0.143
YOLOv10-s vs YOLOv12-s	0.0744	[−0.0075, 0.1584]	1.0000	r = 0.375
YOLOv10-s vs YOLOv12-x	0.0337	[−0.0500, 0.1182]	1.0000	r = 0.187
YOLOv10-x vs YOLOv11-l	−0.0084	[−0.1038, 0.0865]	1.0000	r = −0.128
YOLOv10-x vs YOLOv11-m	−0.0274	[−0.1202, 0.0662]	1.0000	r = −0.135
YOLOv10-x vs YOLOv11-n	−0.0145	[−0.1084, 0.0786]	1.0000	r = −0.135
YOLOv10-x vs YOLOv11-s	0.0007	[−0.0910, 0.0938]	1.0000	r = −0.027
YOLOv10-x vs YOLOv11-x	0.0527	[−0.0436, 0.1492]	1.0000	r = 0.073
YOLOv10-x vs YOLOv12-l	−0.0167	[−0.1152, 0.0803]	1.0000	r = −0.128

Table A2. Part 2. Pairwise comparisons for all YOLO variants.

Comparison	$Δ$ (mean)	95% CI	$p_{adj}$	Effect
YOLOv10-x vs YOLOv12-m	−0.0031	[−0.0962, 0.0916]	1.0000	r = −0.111
YOLOv10-x vs YOLOv12-n	0.0134	[−0.0818, 0.1079]	1.0000	r = −0.100
YOLOv10-x vs YOLOv12-s	0.0136	[−0.0842, 0.1121]	1.0000	r = 0.000
YOLOv10-x vs YOLOv12-x	−0.0272	[−0.1196, 0.0645]	1.0000	r = −0.111
YOLOv11-l vs YOLOv11-m	−0.0190	[−0.0975, 0.0578]	1.0000	r = −0.111
YOLOv11-l vs YOLOv11-n	−0.0061	[−0.1031, 0.0893]	1.0000	r = −0.030
YOLOv11-l vs YOLOv11-s	0.0091	[−0.0812, 0.0985]	1.0000	r = 0.059
YOLOv11-l vs YOLOv11-x	0.0611	[−0.0238, 0.1452]	1.0000	r = 0.226
YOLOv11-l vs YOLOv12-l	−0.0083	[−0.0780, 0.0598]	1.0000	r = 0.000
YOLOv11-l vs YOLOv12-m	0.0053	[−0.0757, 0.0833]	1.0000	r = 0.091
YOLOv11-l vs YOLOv12-n	0.0218	[−0.0621, 0.1049]	1.0000	r = 0.125
YOLOv11-l vs YOLOv12-s	0.0220	[−0.0736, 0.1152]	1.0000	r = 0.059
YOLOv11-l vs YOLOv12-x	−0.0188	[−0.1080, 0.0716]	1.0000	r = −0.103
YOLOv11-m vs YOLOv11-n	0.0129	[−0.0729, 0.1017]	1.0000	r = 0.000
YOLOv11-m vs YOLOv11-s	0.0281	[−0.0523, 0.1108]	1.0000	r = 0.154
YOLOv11-m vs YOLOv11-x	0.0801	[−0.0068, 0.1695]	1.0000	r = 0.257
YOLOv11-m vs YOLOv12-l	0.0107	[−0.0622, 0.0827]	1.0000	r = 0.130
YOLOv11-m vs YOLOv12-m	0.0243	[−0.0730, 0.1238]	1.0000	r = 0.105
YOLOv11-m vs YOLOv12-n	0.0408	[−0.0417, 0.1228]	1.0000	r = 0.097
YOLOv11-m vs YOLOv12-s	0.0410	[−0.0489, 0.1306]	1.0000	r = 0.200
YOLOv11-m vs YOLOv12-x	0.0002	[−0.0825, 0.0827]	1.0000	r = 0.000
YOLOv11-n vs YOLOv11-s	0.0152	[−0.0704, 0.1008]	1.0000	r = 0.071
YOLOv11-n vs YOLOv11-x	0.0672	[−0.0216, 0.1575]	1.0000	r = 0.294
YOLOv11-n vs YOLOv12-l	−0.0022	[−0.0871, 0.0841]	1.0000	r = 0.071
YOLOv11-n vs YOLOv12-m	0.0114	[−0.0794, 0.1023]	1.0000	r = 0.034
YOLOv11-n vs YOLOv12-n	0.0279	[−0.0526, 0.1100]	1.0000	r = 0.067
YOLOv11-n vs YOLOv12-s	0.0281	[−0.0719, 0.1257]	1.0000	r = 0.111
YOLOv11-n vs YOLOv12-x	−0.0127	[−0.1014, 0.0750]	1.0000	r = 0.032
YOLOv11-s vs YOLOv11-x	0.0519	[−0.0385, 0.1417]	1.0000	r = 0.176
YOLOv11-s vs YOLOv12-l	−0.0174	[−0.0901, 0.0522]	1.0000	r = −0.043
YOLOv11-s vs YOLOv12-m	−0.0038	[−0.0926, 0.0855]	1.0000	r = −0.091
YOLOv11-s vs YOLOv12-n	0.0126	[−0.0704, 0.0967]	1.0000	r = −0.032
YOLOv11-s vs YOLOv12-s	0.0129	[−0.0668, 0.0924]	1.0000	r = 0.083
YOLOv11-s vs YOLOv12-x	−0.0279	[−0.1047, 0.0494]	1.0000	r = −0.154
YOLOv11-x vs YOLOv12-l	−0.0694	[−0.1462, 0.0047]	1.0000	r = −0.333
YOLOv11-x vs YOLOv12-m	−0.0557	[−0.1478, 0.0351]	1.0000	r = −0.143
YOLOv11-x vs YOLOv12-n	−0.0393	[−0.1310, 0.0507]	1.0000	r = −0.189
YOLOv11-x vs YOLOv12-s	−0.0391	[−0.1300, 0.0486]	1.0000	r = −0.086
YOLOv11-x vs YOLOv12-x	−0.0798	[−0.1576, −0.0043]	1.0000	r = −0.286
YOLOv12-l vs YOLOv12-m	0.0137	[−0.0682, 0.0955]	1.0000	r = 0.077
YOLOv12-l vs YOLOv12-n	0.0301	[−0.0516, 0.1119]	1.0000	r = 0.032
YOLOv12-l vs YOLOv12-s	0.0303	[−0.0560, 0.1158]	1.0000	r = 0.103
YOLOv12-l vs YOLOv12-x	−0.0104	[−0.0898, 0.0671]	1.0000	r = −0.077
YOLOv12-m vs YOLOv12-n	0.0164	[−0.0803, 0.1122]	1.0000	r = 0.026
YOLOv12-m vs YOLOv12-s	0.0167	[−0.0788, 0.1122]	1.0000	r = 0.059
YOLOv12-m vs YOLOv12-x	−0.0241	[−0.1106, 0.0623]	1.0000	r = −0.143
YOLOv12-n vs YOLOv12-s	0.0002	[−0.0956, 0.0951]	1.0000	r = 0.027
YOLOv12-n vs YOLOv12-x	−0.0405	[−0.1254, 0.0440]	1.0000	r = 0.000
YOLOv12-s vs YOLOv12-x	−0.0407	[−0.1190, 0.0386]	1.0000	r = −0.154
YOLOv10-b vs YOLOv11-n	−0.0326	[−0.1242, 0.0605]	1.0000	r = −0.135
YOLOv10-b vs YOLOv11-m	−0.0455	[−0.1243, 0.0335]	1.0000	r = −0.187
YOLOv10-b vs YOLOv11-l	−0.0265	[−0.1106, 0.0569]	1.0000	r = −0.176
YOLOv10-b vs YOLOv10-x	−0.0181	[−0.1121, 0.0758]	1.0000	r = 0.026
YOLOv10-b vs YOLOv10-s	−0.0790	[−0.1653, 0.0104]	1.0000	r = −0.366
YOLOv10-b vs YOLOv10-n	−0.0457	[−0.1356, 0.0464]	1.0000	r = −0.158
YOLOv10-b vs YOLOv10-m	0.0106	[−0.0760, 0.0976]	1.0000	r = −0.029
YOLOv10-b vs YOLOv10-l	−0.0265	[−0.1137, 0.0621]	1.0000	r = −0.081
YOLOv10-l vs YOLOv10-m	0.0371	[−0.0622, 0.1341]	1.0000	r = 0.073
YOLOv10-b vs YOLOv12-x	−0.0453	[−0.1288, 0.0373]	1.0000	r = −0.176
YOLOv10-b vs YOLOv12-s	−0.0045	[−0.0962, 0.0878]	1.0000	r = 0.027
YOLOv10-b vs YOLOv12-n	−0.0048	[−0.0932, 0.0839]	1.0000	r = −0.100
YOLOv10-b vs YOLOv12-m	−0.0212	[−0.1099, 0.0674]	1.0000	r = −0.086
YOLOv10-b vs YOLOv12-l	−0.0348	[−0.1122, 0.0439]	1.0000	r = −0.103

Table A3. Part 3. Pairwise comparisons for all YOLO variants.

Comparison	$Δ$ (mean)	95% CI	$p_{adj}$	Effect
YOLOv10-b vs YOLOv11-x	0.0345	[−0.0574, 0.1264]	1.0000	r = 0.150
YOLOv10-b vs YOLOv11-s	−0.0174	[−0.1075, 0.0720]	1.0000	r = −0.056
F1
YOLOv10-b vs YOLOv10-l	0.0383	[−0.0285, 0.1049]	1.0000	r = 0.270
YOLOv10-b vs YOLOv10-m	−0.0032	[−0.0774, 0.0695]	1.0000	r = 0.036
YOLOv10-b vs YOLOv10-n	0.0279	[−0.0327, 0.0872]	1.0000	r = 0.265
YOLOv10-b vs YOLOv10-s	−0.0396	[−0.1137, 0.0332]	1.0000	r = −0.151
YOLOv10-b vs YOLOv10-x	0.0674	[−0.0060, 0.1396]	1.0000	r = 0.297
YOLOv10-b vs YOLOv11-l	−0.0982	[−0.1749, −0.0214]	1.0000	r = −0.424
YOLOv10-b vs YOLOv11-m	−0.1152	[−0.1886, −0.0414]	0.6416	r = −0.429
YOLOv10-b vs YOLOv11-n	−0.0735	[−0.1495, 0.0023]	1.0000	r = −0.231
YOLOv10-l vs YOLOv11-x	0.0084	[−0.0643, 0.0800]	1.0000	r = 0.127
YOLOv10-l vs YOLOv11-s	−0.1383	[−0.2233, −0.0533]	1.0000	r = −0.351
YOLOv10-l vs YOLOv11-n	−0.1118	[−0.1889, −0.0353]	1.0000	r = −0.412
YOLOv10-l vs YOLOv11-m	−0.1535	[−0.2315, −0.0781]	0.1555	r = −0.472
YOLOv10-l vs YOLOv11-l	−0.1365	[−0.2036, −0.0694]	0.0923	r = −0.449
YOLOv10-l vs YOLOv10-x	0.0292	[−0.0335, 0.0910]	1.0000	r = 0.086
YOLOv10-l vs YOLOv10-s	−0.0778	[−0.1515, −0.0074]	1.0000	r = −0.342
YOLOv10-l vs YOLOv10-n	−0.0103	[−0.0742, 0.0538]	1.0000	r = 0.000
YOLOv10-l vs YOLOv10-m	−0.0414	[−0.1155, 0.0302]	1.0000	r = −0.143
YOLOv10-b vs YOLOv12-x	−0.0767	[−0.1513, −0.0032]	1.0000	r = −0.312
YOLOv10-b vs YOLOv12-s	−0.0490	[−0.1241, 0.0292]	1.0000	r = −0.302
YOLOv10-b vs YOLOv12-n	0.0126	[−0.0548, 0.0802]	1.0000	r = −0.015
YOLOv10-b vs YOLOv12-m	−0.0670	[−0.1406, 0.0056]	1.0000	r = −0.290
YOLOv10-b vs YOLOv12-l	−0.0504	[−0.1206, 0.0184]	1.0000	r = −0.207
YOLOv10-b vs YOLOv11-x	0.0466	[−0.0314, 0.1245]	1.0000	r = 0.159
YOLOv10-b vs YOLOv11-s	−0.1000	[−0.1866, −0.0138]	1.0000	r = −0.361
YOLOv10-l vs YOLOv12-l	−0.0886	[−0.1596, −0.0167]	1.0000	r = −0.333
YOLOv10-l vs YOLOv12-m	−0.1053	[−0.1733, −0.0365]	1.0000	r = −0.403
YOLOv10-l vs YOLOv12-n	−0.0257	[−0.0973, 0.0461]	1.0000	r = −0.127
YOLOv10-l vs YOLOv12-s	−0.0872	[−0.1705, −0.0058]	1.0000	r = −0.165
YOLOv10-l vs YOLOv12-x	−0.1149	[−0.1933, −0.0384]	1.0000	r = −0.359
YOLOv10-m vs YOLOv10-n	0.0311	[−0.0241, 0.0890]	1.0000	r = 0.258
YOLOv10-m vs YOLOv10-s	−0.0364	[−0.1016, 0.0282]	1.0000	r = −0.079
YOLOv10-m vs YOLOv10-x	0.0706	[0.0051, 0.1339]	1.0000	r = 0.343
YOLOv10-m vs YOLOv11-l	−0.0950	[−0.1771, −0.0108]	1.0000	r = −0.460
YOLOv10-m vs YOLOv11-m	−0.1120	[−0.1899, −0.0334]	1.0000	r = −0.433
YOLOv10-m vs YOLOv11-n	−0.0704	[−0.1402, −0.0006]	1.0000	r = −0.283
YOLOv10-m vs YOLOv11-s	−0.0968	[−0.1846, −0.0111]	1.0000	r = −0.267
YOLOv10-m vs YOLOv11-x	0.0498	[−0.0182, 0.1196]	1.0000	r = 0.200
YOLOv10-m vs YOLOv12-l	−0.0472	[−0.1116, 0.0180]	1.0000	r = −0.200
YOLOv10-m vs YOLOv12-m	−0.0638	[−0.1408, 0.0134]	1.0000	r = −0.258
YOLOv10-m vs YOLOv12-n	0.0158	[−0.0559, 0.0876]	1.0000	r = −0.079
YOLOv10-s vs YOLOv11-m	−0.0756	[−0.1509, −0.0005]	1.0000	r = −0.324
YOLOv10-s vs YOLOv11-n	−0.0340	[−0.1104, 0.0447]	1.0000	r = −0.303
YOLOv10-s vs YOLOv11-s	−0.0604	[−0.1444, 0.0250]	1.0000	r = −0.324
YOLOv10-s vs YOLOv11-x	0.0862	[0.0131, 0.1609]	1.0000	r = 0.195
YOLOv10-s vs YOLOv12-l	−0.0108	[−0.0785, 0.0588]	1.0000	r = −0.175
YOLOv10-s vs YOLOv12-m	−0.0274	[−0.1035, 0.0521]	1.0000	r = −0.224
YOLOv10-s vs YOLOv12-n	0.0522	[−0.0215, 0.1275]	1.0000	r = 0.015
YOLOv10-s vs YOLOv12-s	−0.0094	[−0.0812, 0.0644]	1.0000	r = −0.114
YOLOv10-s vs YOLOv12-x	−0.0371	[−0.1121, 0.0401]	1.0000	r = −0.233
YOLOv10-x vs YOLOv11-l	−0.1656	[−0.2397, −0.0902]	0.0649	r = −0.472
YOLOv10-x vs YOLOv11-m	−0.1826	[−0.2618, −0.1024]	0.0276	r = −0.514
YOLOv10-x vs YOLOv11-n	−0.1410	[−0.2150, −0.0667]	0.4517	r = −0.463
YOLOv10-x vs YOLOv11-s	−0.1675	[−0.2457, −0.0854]	0.1412	r = −0.541
YOLOv10-x vs YOLOv11-x	−0.0208	[−0.0902, 0.0472]	1.0000	r = −0.086
YOLOv10-x vs YOLOv12-l	−0.1178	[−0.1931, −0.0426]	1.0000	r = −0.397
YOLOv10-x vs YOLOv12-m	−0.1344	[−0.1975, −0.0718]	0.0475	r = −0.429
YOLOv10-m vs YOLOv12-s	−0.0458	[−0.1268, 0.0364]	1.0000	r = −0.242

Table A4. Part 4. Pairwise comparisons for all YOLO variants.

Comparison	$Δ$ (mean)	95% CI	$p_{adj}$	Effect
YOLOv10-m vs YOLOv12-x	−0.0735	[−0.1477, 0.0021]	1.0000	r = −0.258
YOLOv10-n vs YOLOv10-s	−0.0675	[−0.1275, −0.0092]	1.0000	r = −0.333
YOLOv10-n vs YOLOv10-x	0.0395	[−0.0221, 0.1006]	1.0000	r = 0.206
YOLOv10-n vs YOLOv11-l	−0.1261	[−0.2004, −0.0516]	1.0000	r = −0.493
YOLOv10-n vs YOLOv11-m	−0.1431	[−0.2155, −0.0717]	0.1640	r = −0.493
YOLOv10-n vs YOLOv11-n	−0.1015	[−0.1691, −0.0325]	1.0000	r = −0.417
YOLOv10-n vs YOLOv11-s	−0.1279	[−0.2086, −0.0481]	1.0000	r = −0.444
YOLOv10-n vs YOLOv11-x	0.0187	[−0.0445, 0.0829]	1.0000	r = 0.000
YOLOv10-n vs YOLOv12-l	−0.0783	[−0.1382, −0.0179]	1.0000	r = −0.355
YOLOv10-n vs YOLOv12-m	−0.0949	[−0.1687, −0.0194]	1.0000	r = −0.333
YOLOv10-n vs YOLOv12-n	−0.0153	[−0.0749, 0.0447]	1.0000	r = −0.182
YOLOv10-n vs YOLOv12-s	−0.0769	[−0.1568, 0.0005]	1.0000	r = −0.342
YOLOv10-n vs YOLOv12-x	−0.1046	[−0.1677, −0.0401]	0.7348	r = −0.507
YOLOv10-s vs YOLOv10-x	0.1070	[0.0414, 0.1732]	1.0000	r = 0.361
YOLOv10-s vs YOLOv11-l	−0.0586	[−0.1382, 0.0248]	1.0000	r = −0.353
YOLOv11-m vs YOLOv12-l	0.0648	[0.0000, 0.1300]	1.0000	r = 0.440
YOLOv11-m vs YOLOv11-x	0.1618	[0.0871, 0.2388]	0.0565	r = 0.484
YOLOv11-m vs YOLOv11-s	0.0152	[−0.0615, 0.0946]	1.0000	r = 0.040
YOLOv11-m vs YOLOv11-n	0.0417	[−0.0339, 0.1180]	1.0000	r = 0.115
YOLOv11-l vs YOLOv12-x	0.0215	[−0.0641, 0.1078]	1.0000	r = 0.120
YOLOv11-l vs YOLOv12-s	0.0492	[−0.0345, 0.1297]	1.0000	r = 0.214
YOLOv11-l vs YOLOv12-n	0.1108	[0.0371, 0.1829]	1.0000	r = 0.439
YOLOv11-l vs YOLOv12-m	0.0312	[−0.0385, 0.1006]	1.0000	r = 0.149
YOLOv11-l vs YOLOv12-l	0.0478	[−0.0145, 0.1090]	1.0000	r = 0.395
YOLOv11-l vs YOLOv11-x	0.1448	[0.0670, 0.2228]	0.3603	r = 0.429
YOLOv11-l vs YOLOv11-s	−0.0018	[−0.0807, 0.0778]	1.0000	r = −0.083
YOLOv11-l vs YOLOv11-n	0.0247	[−0.0586, 0.1070]	1.0000	r = 0.074
YOLOv11-l vs YOLOv11-m	−0.0170	[−0.0919, 0.0564]	1.0000	r = −0.020
YOLOv10-x vs YOLOv12-x	−0.1441	[−0.2206, −0.0681]	0.1637	r = −0.507
YOLOv10-x vs YOLOv12-s	−0.1164	[−0.1950, −0.0367]	1.0000	r = −0.351
YOLOv10-x vs YOLOv12-n	−0.0548	[−0.1255, 0.0172]	1.0000	r = −0.205
YOLOv11-m vs YOLOv12-m	0.0482	[−0.0330, 0.1304]	1.0000	r = 0.143
YOLOv11-m vs YOLOv12-n	0.1278	[0.0524, 0.2023]	0.4057	r = 0.400
YOLOv11-m vs YOLOv12-s	0.0662	[−0.0091, 0.1428]	1.0000	r = 0.296
YOLOv11-m vs YOLOv12-x	0.0385	[−0.0343, 0.1126]	1.0000	r = 0.240
YOLOv11-n vs YOLOv11-s	−0.0265	[−0.1047, 0.0530]	1.0000	r = −0.222
YOLOv11-n vs YOLOv11-x	0.1202	[0.0504, 0.1892]	0.4186	r = 0.483
YOLOv11-n vs YOLOv12-l	0.0232	[−0.0457, 0.0920]	1.0000	r = 0.200
YOLOv11-n vs YOLOv12-m	0.0065	[−0.0704, 0.0835]	1.0000	r = 0.094
YOLOv11-n vs YOLOv12-n	0.0861	[0.0109, 0.1622]	1.0000	r = 0.276
YOLOv11-n vs YOLOv12-s	0.0246	[−0.0620, 0.1113]	1.0000	r = 0.129
YOLOv11-n vs YOLOv12-x	−0.0032	[−0.0784, 0.0718]	1.0000	r = 0.034
YOLOv11-s vs YOLOv11-x	0.1467	[0.0708, 0.2244]	0.1271	r = 0.410
YOLOv11-s vs YOLOv12-l	0.0496	[−0.0217, 0.1202]	1.0000	r = 0.400
YOLOv11-s vs YOLOv12-m	0.0330	[−0.0459, 0.1112]	1.0000	r = 0.216
YOLOv11-s vs YOLOv12-n	0.1126	[0.0356, 0.1911]	1.0000	r = 0.298
YOLOv11-s vs YOLOv12-s	0.0511	[−0.0271, 0.1285]	1.0000	r = 0.265
YOLOv12-s vs YOLOv12-x	−0.0277	[−0.1024, 0.0472]	1.0000	r = −0.102
YOLOv12-n vs YOLOv12-x	−0.0893	[−0.1663, −0.0157]	1.0000	r = −0.238
YOLOv12-n vs YOLOv12-s	−0.0616	[−0.1420, 0.0169]	1.0000	r = −0.061
YOLOv12-m vs YOLOv12-x	−0.0097	[−0.0849, 0.0644]	1.0000	r = 0.018
YOLOv12-m vs YOLOv12-s	0.0180	[−0.0603, 0.0965]	1.0000	r = 0.164
YOLOv12-m vs YOLOv12-n	0.0796	[0.0070, 0.1534]	1.0000	r = 0.200
YOLOv12-l vs YOLOv12-x	−0.0263	[−0.0958, 0.0417]	1.0000	r = −0.057
YOLOv12-l vs YOLOv12-s	0.0014	[−0.0696, 0.0718]	1.0000	r = −0.074
YOLOv12-l vs YOLOv12-n	0.0630	[−0.0069, 0.1334]	1.0000	r = 0.220
YOLOv12-l vs YOLOv12-m	−0.0166	[−0.0879, 0.0565]	1.0000	r = −0.069
YOLOv11-x vs YOLOv12-x	−0.1233	[−0.1904, −0.0580]	0.1348	r = −0.458
YOLOv11-x vs YOLOv12-s	−0.0956	[−0.1717, −0.0205]	1.0000	r = −0.311

Table A5. Part 5. Pairwise comparisons for all YOLO variants.

Comparison	$Δ$ (mean)	95% CI	$p_{adj}$	Effect
YOLOv11-x vs YOLOv12-n	−0.0340	[−0.1039, 0.0356]	1.0000	r = −0.171
YOLOv11-x vs YOLOv12-m	−0.1136	[−0.1880, −0.0389]	0.8720	r = −0.377
YOLOv11-x vs YOLOv12-l	−0.0970	[−0.1649, −0.0305]	1.0000	r = −0.368
YOLOv11-s vs YOLOv12-x	0.0233	[−0.0458, 0.0923]	1.0000	r = 0.083
precision
YOLOv10-l vs YOLOv10-n	−0.0006	[−0.0657, 0.0638]	1.0000	r = 0.000
YOLOv10-l vs YOLOv10-s	−0.0705	[−0.1417, −0.0010]	1.0000	r = −0.342
YOLOv10-l vs YOLOv10-x	0.0337	[−0.0270, 0.0950]	1.0000	r = 0.086
YOLOv10-l vs YOLOv11-l	−0.1689	[−0.2357, −0.1018]	0.0023	r = −0.449
YOLOv10-l vs YOLOv11-m	−0.1945	[−0.2711, −0.1214]	0.0019	r = −0.472
YOLOv10-l vs YOLOv11-n	−0.1380	[−0.2146, −0.0640]	0.3564	r = −0.412
YOLOv10-l vs YOLOv11-s	−0.1754	[−0.2594, −0.0932]	0.0361	r = −0.351
YOLOv10-l vs YOLOv11-x	−0.0114	[−0.0842, 0.0597]	1.0000	r = 0.127
YOLOv10-l vs YOLOv12-l	−0.1062	[−0.1764, −0.0359]	1.0000	r = −0.333
YOLOv10-l vs YOLOv12-m	−0.1344	[−0.2042, −0.0660]	0.0855	r = −0.403
YOLOv10-l vs YOLOv12-n	−0.0337	[−0.1020, 0.0346]	1.0000	r = −0.127
YOLOv10-l vs YOLOv12-s	−0.1243	[−0.2075, −0.0432]	1.0000	r = −0.165
YOLOv10-l vs YOLOv12-x	−0.1457	[−0.2242, −0.0686]	0.2337	r = −0.359
YOLOv10-m vs YOLOv10-n	0.0577	[0.0026, 0.1151]	1.0000	r = 0.258
YOLOv10-m vs YOLOv10-s	−0.0122	[−0.0777, 0.0521]	1.0000	r = −0.062
YOLOv10-m vs YOLOv10-x	0.0920	[0.0294, 0.1533]	1.0000	r = 0.343
YOLOv10-m vs YOLOv11-l	−0.1106	[−0.1926, −0.0277]	1.0000	r = −0.460
YOLOv10-m vs YOLOv11-m	−0.1362	[−0.2154, −0.0556]	0.4671	r = −0.458
YOLOv10-m vs YOLOv11-n	−0.0797	[−0.1509, −0.0093]	1.0000	r = −0.283
YOLOv10-m vs YOLOv11-s	−0.1171	[−0.2043, −0.0321]	1.0000	r = −0.267
YOLOv10-m vs YOLOv11-x	0.0469	[−0.0231, 0.1171]	1.0000	r = 0.186
YOLOv10-m vs YOLOv12-l	−0.0479	[−0.1138, 0.0186]	1.0000	r = −0.200
YOLOv10-m vs YOLOv12-m	−0.0761	[−0.1530, 0.0024]	1.0000	r = −0.258
YOLOv10-m vs YOLOv12-n	0.0246	[−0.0448, 0.0947]	1.0000	r = −0.079
YOLOv10-m vs YOLOv12-s	−0.0660	[−0.1493, 0.0182]	1.0000	r = −0.262
YOLOv10-m vs YOLOv12-x	−0.0874	[−0.1629, −0.0123]	1.0000	r = −0.279
YOLOv10-n vs YOLOv10-s	−0.0698	[−0.1290, −0.0122]	1.0000	r = −0.333
YOLOv10-n vs YOLOv10-x	0.0343	[−0.0233, 0.0918]	1.0000	r = 0.206
YOLOv10-n vs YOLOv11-l	−0.1683	[−0.2418, −0.0953]	0.0183	r = −0.493
YOLOv10-n vs YOLOv11-m	−0.1938	[−0.2689, −0.1212]	0.0009	r = −0.493
YOLOv10-n vs YOLOv11-n	−0.1374	[−0.2071, −0.0671]	0.0917	r = −0.417
YOLOv10-n vs YOLOv11-s	−0.1747	[−0.2561, −0.0942]	0.0232	r = −0.444
YOLOv10-s vs YOLOv12-s	−0.0539	[−0.1281, 0.0233]	1.0000	r = −0.143
YOLOv10-s vs YOLOv12-n	0.0368	[−0.0338, 0.1095]	1.0000	r = 0.029
YOLOv10-s vs YOLOv12-m	−0.0639	[−0.1386, 0.0141]	1.0000	r = −0.235
YOLOv10-s vs YOLOv12-l	−0.0358	[−0.1034, 0.0349]	1.0000	r = −0.175
YOLOv10-s vs YOLOv11-x	0.0591	[−0.0154, 0.1356]	1.0000	r = 0.184
YOLOv10-s vs YOLOv11-s	−0.1049	[−0.1909, −0.0172]	1.0000	r = −0.324
YOLOv10-s vs YOLOv11-n	−0.0676	[−0.1436, 0.0099]	1.0000	r = −0.313
YOLOv10-s vs YOLOv11-m	−0.1240	[−0.2024, −0.0460]	0.8329	r = −0.353
YOLOv10-s vs YOLOv11-l	−0.0985	[−0.1779, −0.0172]	1.0000	r = −0.353
YOLOv10-s vs YOLOv10-x	0.1041	[0.0424, 0.1682]	1.0000	r = 0.361
YOLOv10-n vs YOLOv12-x	−0.1450	[−0.2114, −0.0790]	0.0268	r = −0.507
YOLOv10-n vs YOLOv12-s	−0.1237	[−0.2062, −0.0442]	1.0000	r = −0.342
YOLOv10-n vs YOLOv12-n	−0.0330	[−0.0900, 0.0229]	1.0000	r = −0.182
YOLOv10-n vs YOLOv12-m	−0.1337	[−0.2057, −0.0591]	0.2575	r = −0.333
YOLOv10-n vs YOLOv12-l	−0.1056	[−0.1669, −0.0447]	0.6134	r = −0.355
YOLOv10-n vs YOLOv11-x	−0.0107	[−0.0718, 0.0523]	1.0000	r = 0.000
YOLOv10-s vs YOLOv12-x	−0.0752	[−0.1539, 0.0057]	1.0000	r = −0.260
YOLOv10-x vs YOLOv11-l	−0.2026	[−0.2758, −0.1287]	0.0004	r = −0.472
YOLOv10-x vs YOLOv11-m	−0.2282	[−0.3096, −0.1451]	0.0004	r = −0.514
YOLOv10-x vs YOLOv11-n	−0.1717	[−0.2458, −0.0985]	0.0145	r = −0.463
YOLOv10-x vs YOLOv11-s	−0.2090	[−0.2885, −0.1285]	0.0011	r = −0.541
YOLOv10-x vs YOLOv11-x	−0.0451	[−0.1094, 0.0178]	1.0000	r = −0.086
YOLOv10-x vs YOLOv12-l	−0.1399	[−0.2156, −0.0658]	0.2277	r = −0.397

Table A6. Part 6. Pairwise comparisons for all YOLO variants.

Comparison	$Δ$ (mean)	95% CI	$p_{adj}$	Effect
YOLOv10-x vs YOLOv12-m	−0.1680	[−0.2310, −0.1059]	0.0009	r = −0.429
YOLOv10-x vs YOLOv12-n	−0.0673	[−0.1374, 0.0033]	1.0000	r = −0.205
YOLOv10-x vs YOLOv12-s	−0.1580	[−0.2356, −0.0802]	0.1132	r = −0.351
YOLOv10-x vs YOLOv12-x	−0.1794	[−0.2569, −0.1027]	0.0065	r = −0.507
YOLOv11-l vs YOLOv11-m	−0.0256	[−0.1024, 0.0499]	1.0000	r = −0.020
YOLOv11-l vs YOLOv11-n	0.0309	[−0.0514, 0.1137]	1.0000	r = 0.074
YOLOv11-l vs YOLOv11-s	−0.0064	[−0.0845, 0.0727]	1.0000	r = −0.083
YOLOv11-l vs YOLOv11-x	0.1575	[0.0774, 0.2375]	0.1262	r = 0.452
YOLOv11-l vs YOLOv12-l	0.0627	[0.0009, 0.1235]	1.0000	r = 0.395
YOLOv12-l vs YOLOv12-n	0.0726	[0.0020, 0.1450]	1.0000	r = 0.233
YOLOv12-l vs YOLOv12-m	−0.0281	[−0.1020, 0.0477]	1.0000	r = −0.069
YOLOv11-x vs YOLOv12-x	−0.1343	[−0.2046, −0.0672]	0.0864	r = −0.483
YOLOv11-x vs YOLOv12-s	−0.1129	[−0.1889, −0.0363]	1.0000	r = −0.311
YOLOv11-x vs YOLOv12-n	−0.0223	[−0.0941, 0.0499]	1.0000	r = −0.171
YOLOv11-x vs YOLOv12-m	−0.1230	[−0.1961, −0.0489]	0.3365	r = −0.400
YOLOv11-x vs YOLOv12-l	−0.0948	[−0.1654, −0.0249]	1.0000	r = −0.393
YOLOv11-s vs YOLOv12-x	0.0297	[−0.0433, 0.1032]	1.0000	r = 0.083
YOLOv11-s vs YOLOv12-s	0.0510	[−0.0307, 0.1317]	1.0000	r = 0.265
YOLOv11-s vs YOLOv12-n	0.1417	[0.0647, 0.2194]	0.2674	r = 0.310
YOLOv11-s vs YOLOv12-m	0.0410	[−0.0385, 0.1191]	1.0000	r = 0.216
YOLOv11-s vs YOLOv12-l	0.0691	[−0.0065, 0.1438]	1.0000	r = 0.400
YOLOv11-s vs YOLOv11-x	0.1640	[0.0860, 0.2435]	0.0477	r = 0.433
YOLOv11-n vs YOLOv12-x	−0.0077	[−0.0853, 0.0690]	1.0000	r = 0.034
YOLOv11-n vs YOLOv12-s	0.0137	[−0.0750, 0.1019]	1.0000	r = 0.129
YOLOv11-n vs YOLOv12-n	0.1044	[0.0289, 0.1814]	1.0000	r = 0.276
YOLOv11-n vs YOLOv12-m	0.0037	[−0.0751, 0.0835]	1.0000	r = 0.094
YOLOv11-n vs YOLOv12-l	0.0318	[−0.0367, 0.1023]	1.0000	r = 0.200
YOLOv11-n vs YOLOv11-x	0.1266	[0.0565, 0.1963]	0.1529	r = 0.483
YOLOv11-n vs YOLOv11-s	−0.0373	[−0.1159, 0.0442]	1.0000	r = −0.222
YOLOv11-m vs YOLOv12-x	0.0488	[−0.0244, 0.1229]	1.0000	r = 0.280
YOLOv11-m vs YOLOv12-s	0.0702	[−0.0075, 0.1489]	1.0000	r = 0.296
YOLOv11-m vs YOLOv12-n	0.1608	[0.0833, 0.2384]	0.0394	r = 0.410
YOLOv11-m vs YOLOv12-m	0.0601	[−0.0194, 0.1408]	1.0000	r = 0.143
YOLOv11-m vs YOLOv12-l	0.0883	[0.0205, 0.1565]	1.0000	r = 0.440
YOLOv11-m vs YOLOv11-x	0.1831	[0.1073, 0.2620]	0.0090	r = 0.508
YOLOv11-m vs YOLOv11-s	0.0191	[−0.0596, 0.0999]	1.0000	r = 0.040
YOLOv11-m vs YOLOv11-n	0.0565	[−0.0221, 0.1345]	1.0000	r = 0.115
YOLOv11-l vs YOLOv12-x	0.0232	[−0.0635, 0.1102]	1.0000	r = 0.120
YOLOv11-l vs YOLOv12-s	0.0446	[−0.0409, 0.1268]	1.0000	r = 0.214
YOLOv11-l vs YOLOv12-n	0.1353	[0.0608, 0.2080]	0.1882	r = 0.448
YOLOv11-l vs YOLOv12-m	0.0346	[−0.0352, 0.1046]	1.0000	r = 0.149
YOLOv12-l vs YOLOv12-s	−0.0181	[−0.0910, 0.0537]	1.0000	r = −0.074
YOLOv12-l vs YOLOv12-x	−0.0395	[−0.1115, 0.0315]	1.0000	r = −0.057
YOLOv12-m vs YOLOv12-n	0.1007	[0.0309, 0.1712]	1.0000	r = 0.212
YOLOv12-m vs YOLOv12-s	0.0100	[−0.0672, 0.0875]	1.0000	r = 0.164
YOLOv12-m vs YOLOv12-x	−0.0113	[−0.0863, 0.0643]	1.0000	r = 0.018
YOLOv12-n vs YOLOv12-s	−0.0907	[−0.1717, −0.0109]	1.0000	r = −0.108
YOLOv12-n vs YOLOv12-x	−0.1120	[−0.1918, −0.0365]	1.0000	r = −0.250
YOLOv12-s vs YOLOv12-x	−0.0214	[−0.1005, 0.0566]	1.0000	r = −0.102
YOLOv10-b vs YOLOv11-s	−0.1290	[−0.2172, −0.0415]	1.0000	r = −0.361
YOLOv10-b vs YOLOv11-x	0.0350	[−0.0439, 0.1133]	1.0000	r = 0.159
YOLOv10-b vs YOLOv12-l	−0.0598	[−0.1303, 0.0096]	1.0000	r = −0.207
YOLOv10-b vs YOLOv12-m	−0.0880	[−0.1620, −0.0152]	1.0000	r = −0.290
YOLOv10-b vs YOLOv12-n	0.0127	[−0.0524, 0.0773]	1.0000	r = 0.000
YOLOv10-b vs YOLOv12-s	−0.0779	[−0.1528, 0.0006]	1.0000	r = −0.302
YOLOv10-b vs YOLOv12-x	−0.0993	[−0.1734, −0.0264]	1.0000	r = −0.312
YOLOv10-l vs YOLOv10-m	−0.0583	[−0.1296, 0.0115]	1.0000	r = −0.143
YOLOv10-b vs YOLOv10-l	0.0464	[−0.0192, 0.1126]	1.0000	r = 0.238
YOLOv10-b vs YOLOv10-m	−0.0119	[−0.0849, 0.0596]	1.0000	r = 0.036

Table A7. Part 7. Pairwise comparisons for all YOLO variants.

Comparison	$Δ$ (mean)	95% CI	$p_{adj}$	Effect
YOLOv10-b vs YOLOv10-n	0.0458	[−0.0134, 0.1048]	1.0000	r = 0.265
YOLOv10-b vs YOLOv10-s	−0.0241	[−0.0976, 0.0476]	1.0000	r = −0.151
YOLOv10-b vs YOLOv10-x	0.0801	[0.0090, 0.1514]	1.0000	r = 0.297
YOLOv10-b vs YOLOv11-l	−0.1225	[−0.1997, −0.0425]	1.0000	r = −0.424
YOLOv10-b vs YOLOv11-m	−0.1481	[−0.2233, −0.0735]	0.0838	r = −0.429
YOLOv10-b vs YOLOv11-n	−0.0916	[−0.1672, −0.0161]	1.0000	r = −0.262
recall
YOLOv12-l vs YOLOv12-s	0.0409	[−0.0455, 0.1227]	1.0000	r = 0.200
YOLOv12-l vs YOLOv12-x	−0.0136	[−0.0955, 0.0682]	1.0000	r = −0.040
YOLOv12-m vs YOLOv12-n	−0.0182	[−0.1227, 0.0818]	1.0000	r = −0.056
YOLOv12-m vs YOLOv12-s	0.0182	[−0.0773, 0.1091]	1.0000	r = 0.097
YOLOv12-m vs YOLOv12-x	−0.0364	[−0.1273, 0.0545]	1.0000	r = −0.111
YOLOv12-n vs YOLOv12-s	0.0364	[−0.0591, 0.1318]	1.0000	r = 0.152
YOLOv12-n vs YOLOv12-x	−0.0182	[−0.1091, 0.0727]	1.0000	r = −0.071
YOLOv12-s vs YOLOv12-x	−0.0545	[−0.1364, 0.0273]	1.0000	r = −0.273
YOLOv10-b vs YOLOv11-n	0.0091	[−0.0864, 0.1091]	1.0000	r = 0.030
YOLOv10-b vs YOLOv11-m	−0.0227	[−0.1045, 0.0636]	1.0000	r = −0.083
YOLOv10-b vs YOLOv11-l	0.0000	[−0.0909, 0.0909]	1.0000	r = −0.037
YOLOv10-b vs YOLOv10-x	−0.0091	[−0.1045, 0.0864]	1.0000	r = 0.032
YOLOv10-b vs YOLOv10-s	−0.0818	[−0.1773, 0.0136]	1.0000	r = −0.312
YOLOv10-b vs YOLOv10-n	−0.0318	[−0.1227, 0.0545]	1.0000	r = −0.111
YOLOv10-b vs YOLOv10-m	0.0364	[−0.0545, 0.1273]	1.0000	r = 0.143
YOLOv10-b vs YOLOv10-l	−0.0409	[−0.1273, 0.0456]	1.0000	r = −0.154
YOLOv12-l vs YOLOv12-n	0.0045	[−0.0818, 0.0909]	1.0000	r = 0.000
YOLOv12-l vs YOLOv12-m	0.0227	[−0.0636, 0.1091]	1.0000	r = 0.077
YOLOv11-x vs YOLOv12-x	−0.0682	[−0.1500, 0.0136]	1.0000	r = −0.280
YOLOv11-x vs YOLOv12-s	−0.0136	[−0.1091, 0.0818]	1.0000	r = −0.032
YOLOv11-x vs YOLOv12-n	−0.0500	[−0.1409, 0.0455]	1.0000	r = −0.200
YOLOv11-x vs YOLOv12-m	−0.0318	[−0.1273, 0.0636]	1.0000	r = −0.118
YOLOv11-x vs YOLOv12-l	−0.0545	[−0.1364, 0.0273]	1.0000	r = −0.273
YOLOv11-s vs YOLOv12-x	−0.0364	[−0.1182, 0.0455]	1.0000	r = −0.238
YOLOv10-l vs YOLOv11-x	0.0864	[−0.0182, 0.1909]	1.0000	r = 0.243
YOLOv10-l vs YOLOv11-s	0.0545	[−0.0500, 0.1545]	1.0000	r = 0.167
YOLOv10-l vs YOLOv11-n	0.0500	[−0.0455, 0.1455]	1.0000	r = 0.161
YOLOv10-l vs YOLOv11-m	0.0182	[−0.0773, 0.1136]	1.0000	r = 0.062
YOLOv10-l vs YOLOv11-l	0.0409	[−0.0500, 0.1318]	1.0000	r = 0.103
YOLOv10-l vs YOLOv10-x	0.0318	[−0.0591, 0.1227]	1.0000	r = 0.185
YOLOv10-l vs YOLOv10-s	−0.0409	[−0.1409, 0.0591]	1.0000	r = −0.176
YOLOv10-l vs YOLOv10-n	0.0091	[−0.0864, 0.1000]	1.0000	r = 0.034
YOLOv10-l vs YOLOv10-m	0.0773	[−0.0227, 0.1727]	1.0000	r = 0.250
YOLOv10-b vs YOLOv12-x	−0.0227	[−0.1182, 0.0727]	1.0000	r = −0.071
YOLOv10-b vs YOLOv12-s	0.0318	[−0.0591, 0.1273]	1.0000	r = 0.133
YOLOv10-b vs YOLOv12-n	−0.0045	[−0.1000, 0.0909]	1.0000	r = −0.030
YOLOv10-b vs YOLOv12-m	0.0136	[−0.0818, 0.1091]	1.0000	r = 0.034
YOLOv10-b vs YOLOv12-l	−0.0091	[−0.0909, 0.0727]	1.0000	r = −0.043
YOLOv10-b vs YOLOv11-x	0.0455	[−0.0545, 0.1455]	1.0000	r = 0.143
YOLOv10-b vs YOLOv11-s	0.0136	[−0.0818, 0.1091]	1.0000	r = 0.067
YOLOv10-s vs YOLOv11-l	0.0818	[−0.0136, 0.1818]	1.0000	r = 0.290
YOLOv10-s vs YOLOv10-x	0.0727	[−0.0273, 0.1682]	1.0000	r = 0.294
YOLOv10-n vs YOLOv12-x	0.0091	[−0.0727, 0.0909]	1.0000	r = 0.043
YOLOv10-n vs YOLOv12-s	0.0636	[−0.0273, 0.1545]	1.0000	r = 0.259
YOLOv10-n vs YOLOv12-n	0.0273	[−0.0591, 0.1182]	1.0000	r = 0.071
YOLOv10-n vs YOLOv12-m	0.0455	[−0.0545, 0.1500]	1.0000	r = 0.111
YOLOv10-n vs YOLOv12-l	0.0227	[−0.0591, 0.1045]	1.0000	r = 0.091
YOLOv10-n vs YOLOv11-x	0.0773	[−0.0136, 0.1682]	1.0000	r = 0.267
YOLOv10-n vs YOLOv11-s	0.0455	[−0.0500, 0.1365]	1.0000	r = 0.172
YOLOv10-n vs YOLOv11-n	0.0409	[−0.0500, 0.1318]	1.0000	r = 0.143
YOLOv10-n vs YOLOv11-m	0.0091	[−0.0818, 0.0955]	1.0000	r = 0.037
YOLOv10-n vs YOLOv11-l	0.0318	[−0.0591, 0.1227]	1.0000	r = 0.071
YOLOv10-n vs YOLOv10-x	0.0227	[−0.0682, 0.1136]	1.0000	r = 0.133

Table A8. Part 8. Pairwise comparisons for all YOLO variants.

Comparison	$Δ$ (mean)	95% CI	$p_{adj}$	Effect
YOLOv10-n vs YOLOv10-s	−0.0500	[−0.1409, 0.0409]	1.0000	r = −0.231
YOLOv10-m vs YOLOv12-x	−0.0591	[−0.1501, 0.0364]	1.0000	r = −0.161
YOLOv10-m vs YOLOv12-s	−0.0045	[−0.0955, 0.0909]	1.0000	r = 0.000
YOLOv10-m vs YOLOv12-n	−0.0409	[−0.1409, 0.0591]	1.0000	r = −0.143
YOLOv10-m vs YOLOv12-m	−0.0227	[−0.1182, 0.0727]	1.0000	r = −0.103
YOLOv10-m vs YOLOv12-l	−0.0455	[−0.1227, 0.0318]	1.0000	r = −0.238
YOLOv10-m vs YOLOv11-x	0.0091	[−0.0773, 0.1000]	1.0000	r = 0.037
YOLOv10-m vs YOLOv11-s	−0.0227	[−0.1227, 0.0727]	1.0000	r = −0.059
YOLOv10-m vs YOLOv11-n	−0.0273	[−0.1136, 0.0591]	1.0000	r = −0.120
YOLOv10-m vs YOLOv11-m	−0.0591	[−0.1500, 0.0318]	1.0000	r = −0.200
YOLOv10-m vs YOLOv11-l	−0.0364	[−0.1318, 0.0591]	1.0000	r = −0.161
YOLOv10-m vs YOLOv10-x	−0.0455	[−0.1364, 0.0455]	1.0000	r = −0.097
YOLOv10-m vs YOLOv10-s	−0.1182	[−0.2000, −0.0364]	1.0000	r = −0.583
YOLOv10-m vs YOLOv10-n	−0.0682	[−0.1455, 0.0136]	1.0000	r = −0.304
YOLOv10-l vs YOLOv12-x	0.0182	[−0.0864, 0.1227]	1.0000	r = 0.081
YOLOv10-l vs YOLOv12-s	0.0727	[−0.0364, 0.1774]	1.0000	r = 0.200
YOLOv10-l vs YOLOv12-n	0.0364	[−0.0636, 0.1364]	1.0000	r = 0.086
YOLOv10-l vs YOLOv12-m	0.0545	[−0.0409, 0.1455]	1.0000	r = 0.172
YOLOv10-l vs YOLOv12-l	0.0318	[−0.0591, 0.1227]	1.0000	r = 0.103
YOLOv10-s vs YOLOv11-m	0.0591	[−0.0273, 0.1455]	1.0000	r = 0.259
YOLOv10-s vs YOLOv11-n	0.0909	[−0.0045, 0.1864]	1.0000	r = 0.355
YOLOv10-s vs YOLOv11-s	0.0955	[0.0045, 0.1864]	1.0000	r = 0.379
YOLOv10-s vs YOLOv11-x	0.1273	[0.0227, 0.2273]	1.0000	r = 0.405
YOLOv10-s vs YOLOv12-l	0.0727	[−0.0091, 0.1545]	1.0000	r = 0.391
YOLOv10-s vs YOLOv12-m	0.0955	[−0.0045, 0.1955]	1.0000	r = 0.314
YOLOv10-s vs YOLOv12-n	0.0773	[−0.0273, 0.1818]	1.0000	r = 0.222
YOLOv10-s vs YOLOv12-s	0.1136	[0.0273, 0.2000]	1.0000	r = 0.481
YOLOv10-s vs YOLOv12-x	0.0591	[−0.0273, 0.1455]	1.0000	r = 0.259
YOLOv10-x vs YOLOv11-l	0.0091	[−0.0864, 0.1045]	1.0000	r = −0.059
YOLOv10-x vs YOLOv11-m	−0.0136	[−0.1045, 0.0773]	1.0000	r = −0.103
YOLOv10-x vs YOLOv11-n	0.0182	[−0.0773, 0.1136]	1.0000	r = 0.000
YOLOv10-x vs YOLOv11-s	0.0227	[−0.0727, 0.1182]	1.0000	r = 0.032
YOLOv10-x vs YOLOv11-x	0.0545	[−0.0455, 0.1545]	1.0000	r = 0.143
YOLOv10-x vs YOLOv12-l	0.0000	[−0.0955, 0.0955]	1.0000	r = −0.062
YOLOv10-x vs YOLOv12-m	0.0227	[−0.0727, 0.1182]	1.0000	r = 0.000
YOLOv11-m vs YOLOv12-l	0.0136	[−0.0591, 0.0864]	1.0000	r = 0.053
YOLOv11-m vs YOLOv11-x	0.0682	[−0.0227, 0.1591]	1.0000	r = 0.241
YOLOv11-m vs YOLOv11-s	0.0364	[−0.0455, 0.1182]	1.0000	r = 0.182
YOLOv11-m vs YOLOv11-n	0.0318	[−0.0500, 0.1182]	1.0000	r = 0.120
YOLOv11-l vs YOLOv12-x	−0.0227	[−0.1136, 0.0682]	1.0000	r = −0.071
YOLOv11-l vs YOLOv12-s	0.0318	[−0.0591, 0.1227]	1.0000	r = 0.172
YOLOv11-l vs YOLOv12-n	−0.0045	[−0.0864, 0.0773]	1.0000	r = 0.000
YOLOv11-l vs YOLOv12-m	0.0136	[−0.0682, 0.0955]	1.0000	r = 0.091
YOLOv11-l vs YOLOv12-l	−0.0091	[−0.0818, 0.0636]	1.0000	r = 0.000
YOLOv11-l vs YOLOv11-x	0.0455	[−0.0455, 0.1364]	1.0000	r = 0.200
YOLOv11-l vs YOLOv11-s	0.0136	[−0.0773, 0.1045]	1.0000	r = 0.103
YOLOv11-l vs YOLOv11-n	0.0091	[−0.0864, 0.1045]	1.0000	r = 0.067
YOLOv11-l vs YOLOv11-m	−0.0227	[−0.1091, 0.0636]	1.0000	r = −0.040
YOLOv10-x vs YOLOv12-x	−0.0136	[−0.1091, 0.0818]	1.0000	r = −0.062
YOLOv10-x vs YOLOv12-s	0.0409	[−0.0591, 0.1409]	1.0000	r = 0.086
YOLOv10-x vs YOLOv12-n	0.0045	[−0.0955, 0.1045]	1.0000	r = −0.029
YOLOv11-m vs YOLOv12-m	0.0364	[−0.0636, 0.1409]	1.0000	r = 0.081
YOLOv11-m vs YOLOv12-n	0.0182	[−0.0682, 0.1045]	1.0000	r = 0.040
YOLOv11-m vs YOLOv12-s	0.0545	[−0.0364, 0.1455]	1.0000	r = 0.214
YOLOv11-m vs YOLOv12-x	0.0000	[−0.0864, 0.0864]	1.0000	r = −0.040
YOLOv11-n vs YOLOv11-s	0.0045	[−0.0864, 0.0955]	1.0000	r = 0.037
YOLOv11-n vs YOLOv11-x	0.0364	[−0.0545, 0.1273]	1.0000	r = 0.143
YOLOv11-n vs YOLOv12-l	−0.0182	[−0.1091, 0.0727]	1.0000	r = −0.083
YOLOv11-n vs YOLOv12-m	0.0045	[−0.0864, 0.0955]	1.0000	r = 0.000

Table A9. Part 9. Pairwise comparisons for all YOLO variants.

Comparison	$Δ$ (mean)	95% CI	$p_{adj}$	Effect
YOLOv11-n vs YOLOv12-n	−0.0136	[−0.1000, 0.0727]	1.0000	r = −0.077
YOLOv11-n vs YOLOv12-s	0.0227	[−0.0773, 0.1182]	1.0000	r = 0.091
YOLOv11-n vs YOLOv12-x	−0.0318	[−0.1182, 0.0545]	1.0000	r = −0.111
YOLOv11-s vs YOLOv11-x	0.0318	[−0.0545, 0.1227]	1.0000	r = 0.111
YOLOv11-s vs YOLOv12-l	−0.0227	[−0.0955, 0.0500]	1.0000	r = −0.158
YOLOv11-s vs YOLOv12-m	0.0000	[−0.0909, 0.0909]	1.0000	r = −0.034
YOLOv11-s vs YOLOv12-n	−0.0182	[−0.1136, 0.0773]	1.0000	r = −0.103
YOLOv11-s vs YOLOv12-s	0.0182	[−0.0636, 0.1000]	1.0000	r = 0.091

References

Darras, K.F.; Balle, M.; Xu, W.; Yan, Y.; Zakka, V.G.; Toledo-Hernández, M.; Sheng, D.; Lin, W.; Zhang, B.; Lan, Z.; et al. Eyes on nature: Embedded vision cameras for terrestrial biodiversity monitoring. Methods Ecol. Evol. 2024, 15, 2262–2275. [Google Scholar] [CrossRef]
Ghanem, S.J.; Voigt, C.C. Chapter 7-Increasing Awareness of Ecosystem Services Provided by Bats. In Advances in the Study of Behavior; Brockmann, H.J., Roper, T.J., Naguib, M., Mitani, J.C., Simmons, L.W., Eds.; Academic Press: Cambridge, MA, USA, 2012; Volume 44, pp. 279–302. [Google Scholar] [CrossRef]
Corrêa Scheffer, K.; Fernandes De Barros, R.; Iamamoto, K.; Mori, E.; Miyuki Asano, K.; M Achkar, S.; Estevez Garcia, A.I.; de Oliveira Lima, J.Y.; de Oliveira Fahl, W. Diphylla ecaudata y Diaemus youngi, biología y comportamiento. Diphylla ecaudata and Diaemus youngi, Biology and behavior. ACTA ZOOLÓGICA MEXICANA (N.S.) 2015, 31, 436–445. [Google Scholar] [CrossRef]
David, O.R.; Consuelo, L.; Eduardo, N.; Livia, L.P. Selección de refugios por tres especies de murciélagos frugívoros (Chiroptera: Phyllostomidae) en la Selva Lacandona, Chiapas, México. Rev. Mex. De Biodivers. 2006, 77, 261–270. [Google Scholar]
Brigham, R.; Fenton, B. The influence of roost closure on the roosting and foraging behaviour of Eptesicus fuscus (Chiroptera: Vespertilionidae). Can. J. Zool. 2011, 64, 1128–1133. [Google Scholar] [CrossRef]
Labadie, M.; Morand, S.; Bourgarel, M.; Niama, F.R.; Nguilili, G.F.; Tobi, N.; Caron, A.; De Nys, H. Habitat sharing and interspecies interactions in caves used by bats in the Republic of Congo. PeerJ 2025, 13, e18145. [Google Scholar] [CrossRef]
Bullen, R.D. A Review of Ghost Bat Ecology, Threats and Survey Requirements; Technical report; Australian Government Department of Agriculture, Water and Environment: Hillarys, Australia, 2002. [Google Scholar]
Russo, D.; Salinas-Ramos, V.B.; Cistrone, L.; Smeraldo, S.; Bosso, L.; Ancillotto, L. Do We Need to Use Bats as Bioindicators? Biology 2021, 10, 693. [Google Scholar] [CrossRef]
Frick, W.F.; Kingston, T.; Flanders, J. A review of the major threats and challenges to global bat conservation. Ann. N. Y. Acad. Sci. 2020, 1469, 5–25. [Google Scholar] [CrossRef]
Platto, S.; Zhou, J.; Wang, Y.; Wang, H.; Carafoli, E. Biodiversity loss and COVID-19 pandemic: The role of bats in the origin and the spreading of the disease. Biochem. Biophys. Res. Commun. 2021, 538, 2–13. [Google Scholar] [CrossRef]
Kunz, T.H.; Betke, M.; Hristov, N.I.; Vonhof, M.J. Methods for assessing colony size, population size, and relative abundance of bats. In Ecological and Behavioral Methods for the Study of Bats; Johns Hopkins University Press: Baltimore, MD, USA, 2009; pp. 133–157. [Google Scholar]
Orugas, A.; Pally, I.; Ramos, A.; Gutiérrez, M. Murciélagos: Análisis de su problemática y alternativas de mitigación. Rev. Estud. AGRO-VET 2022, 6, 56–70. [Google Scholar]
O’Shea, T.J.; Bogan, M.A. Monitoring Trends in Bat Populations of the United States and Territories: Problems and Prospects; U.S. Geological Survey, Biological Resources Discipline, Information and Technology: Reston, VA, USA, 2003. [Google Scholar]
Whiting, J.C.; Doering, B.; Aho, K.; Bybee, B.F. Disturbance of hibernating bats due to researchers entering caves to conduct hibernacula surveys. Sci. Rep. 2024, 14, 13496. [Google Scholar] [CrossRef]
Sabol, B.M.; Hudson, M.K. Technique using thermal infrared-imaging for estimating populations of gray bats. J. Mammal. 1995, 76, 1242–1248. [Google Scholar] [CrossRef]
Hristov, N.I.; Betke, M.; Kunz, T.H. Applications of thermal infrared imaging for research in aeroecology. Integr. Comp. Biol. 2008, 48, 50–59. [Google Scholar] [CrossRef]
Frank, J.; Kunz, T.; Horn, J.; Cleveland, C.; Petronio, S. Advanced infrared detection and image processing for automated bat censusing. Proc. SPIE—Int. Soc. Opt. Eng. 2003, 5074, 261–271. [Google Scholar] [CrossRef]
Botto Nuñez, G.; Lemus, G.; Muñoz Wolf, M.; Rodales, A.; González, E.; Crisci, C. The first artificial intelligence algorithm for identification of bat species in Uruguay. Ecol. Inform. 2018, 46, 97–102. [Google Scholar] [CrossRef]
Mac Aodha, O.; Gibb, R.; Barlow, K.E.; Browning, E.; Firman, M.; Freeman, R.; Harder, B.; Kinsey, L.; Mead, G.R.; Newson, S.E.; et al. Bat detective—Deep learning tools for bat acoustic signal detection. PLoS Comput. Biol. 2018, 14, e1005995. [Google Scholar] [CrossRef] [PubMed]
Krivek, G.; Gillert, A.; Harder, M.; Fritze, M.; Frankowski, K.; Timm, L.; Meyer-Olbersleben, L.; von Lukas, U.F.; Kerth, G.; van Schaik, J. BatNet: A deep learning-based tool for automated bat species identification from camera trap images. Remote Sens. Ecol. Conserv. 2023, 9, 759–774. [Google Scholar] [CrossRef]
Fujioka, E.; Fukushiro, M.; Ushio, K.; Kohyama, K.; Habe, H.; Hiryu, S. Three-Dimensional Trajectory Construction and Observation of Group Behavior of Wild Bats During Cave Emergence. J. Robot. Mechatronics 2021, 33, 556–563. [Google Scholar] [CrossRef]
Darras, K.F.A.; Yusti, E.; Huang, J.C.C.; Zemp, D.C.; Kartono, A.P.; Wanger, T.C. Bat point counts: A novel sampling method shines light on flying bat communities. Ecol. Evol. 2021, 11, 17179–17190. [Google Scholar] [CrossRef]
Darras, K.; Yusti, E.; Knorr, A.; Huang, J.C.C.; Kartono, A.P. Sampling flying bats with thermal and near-infrared imaging and ultrasound recording: Hardware and workflow for bat point counts. F1000Research 2022, 10, 189. [Google Scholar] [CrossRef]
Bentley, I.; Gebran, M.; Vorderer, S.; Ralston, J.; Kloepper, L. Utilizing Neural Networks to Resolve Individual Bats and Improve Automated Counts. In Proceedings of the 2023 IEEE World AI IoT Congress (AIIoT), Seattle, WA, USA, 7–10 June 2023; pp. 0112–0119. [Google Scholar] [CrossRef]
Koger, B.; Hurme, E.; Costelloe, B.R.; O’Mara, M.T.; Wikelski, M.; Kays, R.; Dechmann, D.K. An automated approach for counting groups of flying animals applied to one of the world’s largest bat colonies. Ecosphere 2023, 14, e4590. [Google Scholar] [CrossRef]
Wang, Y.; Ma, C.; Zhao, C.; Xia, H.; Chen, C.; Zhang, Y. WB-YOLO: An efficient wild bat detection method for ecological monitoring in complex environments. Eng. Appl. Artif. Intell. 2025, 157, 111232. [Google Scholar] [CrossRef]
Lee, B.; Sambado, S.; Farrant, D.N.; Boser, A.; Ring, K.; Hyon, D.; Larsen, A.E.; MacDonald, A.J. Novel Bat-Monitoring Dataset Reveals Targeted Foraging With Agricultural and Pest Control Implications. Ecol. Evol. 2025, 15, e70819. [Google Scholar] [CrossRef] [PubMed]
Rangel, I.C.; Arroyo-Romero, J.A.; Bárcenas-Reyes, I.; González-Barbosa, J.J.; Hurtado-Ramos, J.B.; Ornelas-Rodríguez, F.J.; Ramírez-Pedraza, A. Explorando las Profundidades: Reconstrucción de Cuevas y Detección de Murciélagos mediante Imágenes Infrarrojas. Mem. Investig. En Ing. 2025, 1, 110–125. [Google Scholar] [CrossRef]
Amézquita-Gómez, N.; González-Bautista, S.R.; Teran, M.; Salazar, C.; Corredor, J.; Corzo, G.D. Preliminary Approach for UAV-Based Multi-Sensor Platforms for Reconnaissance and Surveillance applications. Ingeniería 2023, 28, e21035. [Google Scholar] [CrossRef]
Gutchess, D.; Trajkovics, M.; Cohen-Solal, E.; Lyons, D.; Jain, A. A background model initialization algorithm for video surveillance. In Proceedings of the Eighth IEEE International Conference on Computer Vision. ICCV 2001, Vancouver, BC, Canada, 7–14 July 2001; Volume 1, pp. 733–740. [Google Scholar] [CrossRef]
Tai, J.C.; Song, K.T. Background segmentation and its application to traffic monitoring using modified histogram. In Proceedings of the IEEE International Conference on Networking, Sensing and Control, New Delhi, India, 21–23 March 2004; Volume 1, pp. 13–18. [Google Scholar] [CrossRef]
Stauffer, C.; Grimson, W. Adaptive background mixture models for real-time tracking. In Proceedings of the 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149), Fort Collins, CO, USA, 23–25 June 1999; Volume 2, pp. 246–252. [Google Scholar] [CrossRef]
Ramírez-Pedraza, A.; Salazar-Colores, S.; Terven, J.; Romero-González, J.A.; González-Barbosa, J.J.; Córdova-Esparza, D.M. Nutritional Monitoring of Rhodena Lettuce via Neural Networks and Point Cloud Analysis. AgriEngineering 2024, 6, 3474–3493. [Google Scholar] [CrossRef]
Ramírez-Pedraza, A.; Salazar-Colores, S.; Cardenas-Valle, C.; Terven, J.; González-Barbosa, J.J.; Ornelas-Rodriguez, F.J.; Hurtado-Ramos, J.B.; Ramirez-Pedraza, R.; Córdova-Esparza, D.M.; Romero-González, J.A. Deep Learning in Oral Hygiene: Automated Dental Plaque Detection via YOLO Frameworks and Quantification Using the O’Leary Index. Diagnostics 2025, 15, 231. [Google Scholar] [CrossRef] [PubMed]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Chen, X.; Zhao, J.; hua Chen, Y.; Zhou, W.; Hughes, A.C. Automatic standardized processing and identification of tropical bat calls using deep learning approaches. Biol. Conserv. 2020, 241, 108269. [Google Scholar] [CrossRef]
Hernández, B.O.; Sánchez-García, Á.J.; Alfonso, C.A.D.; Ocharán-Hernández, J.O.; Ortiz, E.M.; Ríos-Figueroa, H.V. Desarrollo de una aplicación para el conteo automático de murciélagos en cuevas basado en visión por computadora. Res. Comput. Sci. 2018, 147, 11–22. [Google Scholar] [CrossRef]
Aza Taimal, J.J.; Bacca Cortes, B.; Restrepo Girón, A.D. Software Tool for the Extrinsic Calibration of Infrared and RGBD Cameras Applied to Thermographic Inspection. Ingeniería 2022, 28, e18145. [Google Scholar] [CrossRef]
Rodríguez-Lira, D.C.; Córdova-Esparza, D.M.; Terven, J.; Romero-González, J.A.; Alvarez-Alvarado, J.M.; González-Barbosa, J.J.; Ramírez-Pedraza, A. Recent Developments in Image-Based 3D Reconstruction Using Deep Learning: Methodologies and Applications. Electronics 2025, 14, 3032. [Google Scholar] [CrossRef]

Figure 1. We designed and iteratively improved a portable multisensor platform for field data collection.

Figure 2. The left image shows a multisensor platform comprising two near-infrared (NIR) cameras, infrared projectors, an Intel RealSense D435i depth camera, and microphones capable of capturing audio with a Maximum Recordable Frequency of 128 kHz. The right image shows the multisensor platform, which includes a touchscreen interface that enables direct interaction with the device.

Figure 3. Connections and components of the portable multisensor platform.

Figure 4. Architecture YOLOv10 with progressive improvements in efficiency, precision, and adaptability.

Figure 5. Architecture YOLOv11 with progressive improvements in efficiency, precision, and adaptability.

Figure 6. Architecture YOLOv12 with progressive improvements in efficiency, precision, and adaptability.

Figure 7. The figure illustrates the gray-level intensity of the stuffed bat in the acquired images, captured using the multisensor platform described in this work.

Figure 8. Random images acquired by the prototype. The top row displays images from the right camera, while the bottom row shows the images from the left camera.

Figure 9. Bat detection through background subtraction and binarization across a sequence of images. The top row displays images from the right camera, while the bottom row shows the images from the left camera.

Figure 10. Bat detection using background subtraction and binarization across a sequence of images. The top row displays images from the right camera, while the bottom row shows the images from the left camera.

Figure 11. Precision–recall curves corresponding to the evaluation of the three most effective models.

Figure 12. Exploratory analysis of the variables Weight and Height in class 0 bats. The boxplots (a,b) show the median, interquartile ranges, and outliers.

Figure 13. Representation of histograms showing positive skewness and a concentration of low values, suggesting the potential need for statistical transformations prior to modeling. The histogram on the left represents Width, while the one on the right corresponds to Height.

Figure 14. Bat detection results. The first row shows manually labeled reference detections. The second, third, and fourth rows correspond to the predictions generated by the YOLOv10m, YOLOv11l, and YOLOv12m models, respectively.

Figure 15. Qualitative error analysis for three YOLO versions (rows: YOLOv10, YOLOv11, YOLOv12; two examples per row). Red boxes mark false positives—often aligned with cave-rock edges and high-gradient textures that mimic bat silhouettes; overlapping red boxes denote duplicate detections on the same background feature. Blue boxes indicate false negatives, typically associated with partial occlusions or low-contrast individuals. These examples are representative and illustrate the dominant failure modes observed across models.

Table 1. Training hyperparameters used in YOLOv10b, YOLOv11n, and YOLOv12s. The table is divided into two blocks: parameters common to all models and those that vary in each configuration.

Parameters	Value
Optimizer	auto (AdamW)
Initial learning rate ( $l r 0$ )	0.01
Final learning rate ( $l r f$ )	0.01 (cosine decay)
Momentum	0.937
Weight decay	0.0005
Warmup epochs	3.0
Warmup momentum	0.8
Warmup bias LR	0.1
Patience (early stopping)	50
Image size ( $i m g s z$ )	640
Model-specific parameters	Value
YOLOv10b	epochs = 200, batch = 32
YOLOv11n	epochs = 200, batch = 32
YOLOv12s	epochs = 200, batch = 4

Table 2. Performance comparison of YOLOv10, YOLOv11, and YOLOv12 models with different architectures using precision, recall, mAP@50, mAP@0.75, and mAP@[0.5:0.95] metrics.

Detector	Model	Precision	Recall	mAP@50	mAP@0.75	mAP@[0.5:0.95]
YOLOv10	b	0.917	0.914	0.959	0.390	0.474
	l	0.910	0.853	0.954	0.424	0.453
	m	0.889	0.966	0.970	0.355	0.459
	n	0.911	0.879	0.946	0.349	0.447
	s	0.879	0.941	0.939	0.408	0.463
	x	0.895	0.914	0.951	0.348	0.446
YOLOv11	n	0.927	0.945	0.958	0.392	0.479
	l	0.941	0.964	0.979	0.359	0.462
	m	0.937	0.951	0.970	0.389	0.471
	s	0.935	0.964	0.947	0.390	0.476
	x	0.901	0.912	0.953	0.337	0.471
YOLOv12	n	0.940	0.940	0.957	0.388	0.486
	l	0.940	0.983	0.965	0.358	0.475
	m	0.956	0.983	0.981	0.378	0.471
	s	0.929	0.974	0.971	0.375	0.452
	x	0.940	0.957	0.935	0.410	0.469

Table 3. Pairwise comparisons for precision focusing on YOLOv11-m. Reported are mean differences (

Δ = A - B

in the order shown), 95% CI, adjusted p-values (Holm), and effect size.

Table 3. Pairwise comparisons for precision focusing on YOLOv11-m. Reported are mean differences (

Δ = A - B

in the order shown), 95% CI, adjusted p-values (Holm), and effect size.

Comparison	$Δ$ (mean)	95% CI	$p_{adj}$	Effect
YOLOv10-x vs YOLOv11-m	−0.228	[−0.310, −0.145]	0.000	r = −0.514
YOLOv10-n vs YOLOv11-m	−0.194	[−0.269, −0.121]	0.001	r = −0.493

Table 4. Summary of false positives (FP) and false negatives (FN) for each YOLO version (no heuristics).

	Images	Images with Errors	FP	FN	Errors (FP+FN)	FP (%)	FN (%)
YOLOv10	110	101	435	2	437	100	0
YOLOv11	110	41	108	3	111	97	3
YOLOv12	110	57	137	1	138	99	1

Table 5. Comparative analysis of the proposed method with existing state-of-the-art methods. *Position* indicates whether the sensor is installed inside or outside the shelter.

Reference	Sensor	Position	Objective	Technique	Accuracy
[38]	Acoustic	N/A	Identification tested with 15 species	BatNet and re-checking strategy	0.91
[20]	RGB image triggered by infrared light barriers	Entrance of the hibernacula (caves or mines)	Counting the number of bats and identify 13 European bat species	BatNet	0.993
[21]	Stereo cameras	Front of the cave entrance	Counting	Three-dimensional flight trajectories	0.94
[22]	Thermal, ultrasound, NIR camera	Outdoors relative to the roost site	Counting the number of bats and identifying species	Multimodal detection and analysis approach	N/A
[39]	RGB images	Inside	Counting	Background subtraction and Otsu segmentation	N/A
[24]	Infrared, RGB and Thermal videos	Outdoors relative to the roost site	Counting the number of bats	Convolutional neural networks (CNNs)	0.95–0.99
[25]	RGB images	Outdoors around the bat colony	Counting the number of bats	UNet model	0.88
[23]	Thermal, ultrasound, NIR camera	Relative to the river, oil palm, or road	Counting and identification	Morphological-acoustic bat identification	N/A
[27]	Doppler weather radar	Outdoor and Indoor to the site	Estimating bat foraging distributions and relative bat activity	Bat-Aggregated Time Series (BATS)	N/A
[26]	RGB images: aerial images and macro photography	Outdoors relative to the roost site	Primarily identifying bat species rather than counting the number of bats	YOLOv7	0.94
Our work	NIR camera	Inside of the roost site	Counting in each frame	YOLOv10, YOLOv11, YOLOv12	0.98

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

González-Barbosa, J.-J.; Rangel, I.C.; Ramírez-Pedraza, A.; Ramírez-Pedraza, R.; Bárcenas-Reyes, I.; González-Barbosa, E.-A.; Razo-Razo, M. Deep Learning for Wildlife Monitoring: Near-Infrared Bat Detection Using YOLO Frameworks. Signals 2025, 6, 46. https://doi.org/10.3390/signals6030046

AMA Style

González-Barbosa J-J, Rangel IC, Ramírez-Pedraza A, Ramírez-Pedraza R, Bárcenas-Reyes I, González-Barbosa E-A, Razo-Razo M. Deep Learning for Wildlife Monitoring: Near-Infrared Bat Detection Using YOLO Frameworks. Signals. 2025; 6(3):46. https://doi.org/10.3390/signals6030046

Chicago/Turabian Style

González-Barbosa, José-Joel, Israel Cruz Rangel, Alfonso Ramírez-Pedraza, Raymundo Ramírez-Pedraza, Isabel Bárcenas-Reyes, Erick-Alejandro González-Barbosa, and Miguel Razo-Razo. 2025. "Deep Learning for Wildlife Monitoring: Near-Infrared Bat Detection Using YOLO Frameworks" Signals 6, no. 3: 46. https://doi.org/10.3390/signals6030046

APA Style

González-Barbosa, J.-J., Rangel, I. C., Ramírez-Pedraza, A., Ramírez-Pedraza, R., Bárcenas-Reyes, I., González-Barbosa, E.-A., & Razo-Razo, M. (2025). Deep Learning for Wildlife Monitoring: Near-Infrared Bat Detection Using YOLO Frameworks. Signals, 6(3), 46. https://doi.org/10.3390/signals6030046

Article Menu

Deep Learning for Wildlife Monitoring: Near-Infrared Bat Detection Using YOLO Frameworks

Abstract

1. Introduction

2. Materials and Methods

2.1. Multisensor Platform Description

2.2. Semi-Automatic Labeling of Images

2.3. Algorithms for Detecting Bats

2.4. Evaluation Metrics

3. Results

3.1. NIR Camera Sensitivity

3.2. Semi-Automatic Labeling of Images

3.3. Detection Using YOLO Frameworks

4. Discussion

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix A.1. Full Set of Pairwise Comparisons

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI