Deep Learning Approaches for Automatic Livestock Detection in UAV Imagery: State-of-the-Art and Future Directions

Adam, Muhammad; Song, Jianchao; Yu, Wei; Li, Qingqing

doi:10.3390/fi17090431

Open AccessReview

Deep Learning Approaches for Automatic Livestock Detection in UAV Imagery: State-of-the-Art and Future Directions

Department of Computer and Information Sciences, Towson University, Towson, MD 21252, USA

^*

Author to whom correspondence should be addressed.

Future Internet 2025, 17(9), 431; https://doi.org/10.3390/fi17090431

Submission received: 13 August 2025 / Revised: 13 September 2025 / Accepted: 16 September 2025 / Published: 21 September 2025

Download

Browse Figures

Versions Notes

Abstract

Accurate livestock monitoring is critical for precision agriculture, supporting effective farm management, disease prevention, and sustainable resource allocation. Deep learning and remote sensing are recent technological advancements that have created auspicious opportunities for the development of livestock monitoring systems. This paper presents a comprehensive survey of deep learning approaches for automatic livestock detection in unmanned aerial vehicles (UAVs), highlighting key deep learning techniques, livestock detection challenges, and emerging trends. We analyze the innovations of popular deep learning models in the area of object detection, including You Look Only Once (YOLO) versions, Region-based Convolutional Neural Networks (RCNN), Anchor-based networks, and transformer models, to discuss their suitability for scalable and cost-efficient UAV-based livestock detection scenarios. To complement the survey, a case study is conducted on a custom UAV cattle dataset to benchmark representation detection models. Evaluation results demonstrate a trade-off between Precision, Recall, F1 score, IoU, mAP@50, mAP@50-95, inference speed, and model size. The case study results provide a clear understanding of selection and adapting deep learning models for UAV-based livestock monitoring and outline future directions for lightweight, domain-adapted frameworks in precision farming applications.

Keywords:

livestock detection; deep learning; aerial images; YOLO; object detection; unmanned aerial vehicles

1. Introduction

Livestock farming, which involves breeding, managing, and caring for domesticated animals such as cattle, sheep, and goats, is a vital component of global agriculture [1,2]. It provides essential resources such as meat, milk, leather, and wool while supporting rural livelihoods and economies. In addition to its contribution to food security, livestock production plays a critical role in maintaining ecosystem balance and enabling sustainable land use through practices such as rotational grazing and manure recycling. However, the growth of livestock farming has raised several ecological, economic, and ethical concerns. Greenhouse gas emissions from livestock farming account for approximately 14.5% of anthropogenic emissions worldwide [3]. Overgrazing and poor land management practices can lead to soil degradation, deforestation, and loss of biodiversity, affecting ecosystem health. Moreover, the intensification of livestock farming increases the risks of diseases and antimicrobial resistance, with implications for both animal and human health [4].

The growing demand for animal-based products, coupled with increasing concerns about environmental impact and animal welfare, has increased the need for efficient and sustainable livestock farming management. Accurate monitoring of livestock populations is a cornerstone of precision livestock farming (PLF), enabling farmers to optimize resource allocation, detect diseases early, manage grazing patterns effectively, and enhance animal welfare. Traditional approaches for livestock monitoring, such as radio frequency identification (RFID) tracking [5] and manual counting, are very time-consuming, labor intensive, and prone to errors, especially in large or remote areas. This creates a pressing need for automated, scalable, and cost-effective solutions that can provide real-time information on livestock distribution and behavior. Automatic livestock detection provides potential to satisfy this need. It refers to the automated identification and localization of animals in visual data, which enables efficient tracking of the presence, distribution, and movement of animals.

With the advancement of unmanned aerial vehicles (UAVs), aerial imagery that has been captured using drones and satellites has been considered as a promising alternative in the research of automatic livestock detection [6]. It offers the flexibility to cover extensive grazing land quickly, capturing visual data that can be processed for real-time analysis [7,8,9]. Equipped with RGB, thermal, or multispectral cameras, UAVs can provide detailed imagery under various environmental conditions, improving monitoring of livestock distribution, behavior, and even health indicators. One of the key advantages of UAV-based monitoring lies in its capability to capture data at multiple scales and resolutions. Low-altitude flights enable fine-grained observations of individual animals, while high-altitude surveys can efficiently map herd distribution across entire landscapes. Furthermore, UAVs are less intrusive than ground-based monitoring, minimizing stress and disturbance to animals, a critical factor in maintaining natural behaviors in extensive grazing systems [10]. The integration of UAV imagery with advanced data analytics, particularly deep learning, holds significant promise for automating livestock detection, counting, and behavior analysis, thus improving the efficiency and sustainability of precision livestock farming.

The field of object detection has been a good strategy for aerial imagery-based livestock monitoring, contributing to the ability to identify and localize animals in large-scale environments automatically. Generally speaking, object detection is considered a computer vision-related task that involves predicting the locations and categories of objects within an image. With the advancement of deep learning techniques, object detection has been widely deployed in a body of Internet of Things (IoT)-based smart systems, allowing good monitoring ability and enriching control capability [11,12,13,14]. Object detection algorithms are broadly classified into two categories based on the number of times the network processes the input image: two-stage and one-stage detectors [15]. Two-stage detectors, such as R-CNN [16,17] and Faster R-CNN [18,19], first generate region proposals in which objects might be located and then pass these region proposals from stage 1 through a second stage for classification and refinement. This approach typically gives high accuracy, but is computationally more intensive. Unlike one-stage detectors, including You Only Look Once (YOLO) [20,21,22] and Single Shot Multibox Detector (SSD) [23], carry out tasks like localizing and classifying objects in a single pass, enabling faster inference and real-time performance, though sometimes at the cost of reduced accuracy for small or densely packed objects. Further improvements in backbone networks (e.g., ResNet, CSPNet) [22,24] and the adoption of transformer-based architectures [25,26] have improved detection in complex, high-variance scenes. Recent end-to-end anchor-free models, such as Fully Convolutional One-Stage Object Detection (FCOS) [27] and Detection Transformer (DETR) [28], simplify the detection pipeline and better handle small overlapping objects, making them particularly suitable for the demands of aerial livestock monitoring.

Automatic aerial livestock detection can improve data accuracy, reduce operational costs, and support timely decision making, ultimately improving productivity, animal welfare, and environmental sustainability [29,30]. However, detecting and counting livestock in aerial imagery poses several challenges. Livestock often appear as various-scale objects due to the changes in altitude made by drones, making them difficult to distinguish from the background when at high altitude. Variability in animal posture, occlusion by vegetation or structures, and overlapping groups further complicate detection. Environmental factors such as lighting, shadows, terrain texture, and image resolution introduce additional noise [31]. In addition, the limited availability of annotated datasets for aerial livestock imagery hinders the training of robust deep learning models. These factors collectively make the task of accurately and reliably detecting livestock from aerial platforms a non-trivial task.

In this paper, we provide a comprehensive understanding of deep learning approaches for automatic livestock detection in aerial imagery. Specifically, this work aims to achieve the following:

Reviewing deep learning object detection techniques, prior application on UAV-based livestock monitoring, and challenges in aerial livestock detection.
Presenting a case study evaluating representative models on a custom UAV cattle dataset, detailing the experimental setup, evaluation experiments, and results analysis.
Providing an understanding and recommendations for future research related to livestock detection.

The remainder of the paper is organized as follows: Section 2 reviews the related literature on deep learning-based object detection and aerial livestock applications and challenges. Section 3 introduces a case study that evaluates representative models on a custom UAV cattle dataset. Section 4 provides a discussion and future work on livestock detection. Finally, Section 5 concludes this work.

2. Related Work

In the following, we begin by summarizing key object detection techniques, followed by a review of their applications in livestock monitoring. We then focus specifically on deep learning models used for detecting livestock within aerial imagery collected by UAVs, underlining major challenges such as small object detection and the scarcity of annotated datasets. Finally, we discuss emerging solutions designed to address these limitations and improve detection performance in real-world aerial monitoring scenarios.

2.1. Deep Learning-Based Object Detection Techniques

Object detection techniques involve using algorithms, often neural networks, to recognize and locate objects in visual data, typically images or videos. These techniques aim to detect the objects’ presence and point out their precise location by drawing bounding boxes around objects. The development of object detection has evolved through several stages. Traditional object detection methods are highly dependent on hand-crafted features and traditional classifiers, including support vector machines (SVMs). Starting with the sliding window, where a fixed-size window is moved across the image, features are extracted from each region [32]. Histograms of Oriented Gradients (HOGs) [33] is another feature extraction-based object detection method that extracts features based on the orientation of intensity gradients within an image, making it robust to variations in illumination and pose. Scale-Invariant Feature Transform (SIFT) is a technique designed to localize the image reference positions in one or more images. It can identify and describe local features of an image that are invariant to scale and rotation, making them useful for object recognition and detection [34]. However, these methods are computationally expensive and often less accurate.

In recent years, deep learning has transformed object detection by significantly improving accuracy, scalability, and real-time performance. Unlike traditional methods that rely on manual features and separate classifiers, deep learning models are capable of automatically learning hierarchical representations from raw data. The breakthrough came with region-based convolutional networks (R-CNNs) [16], which introduced end-to-end feature learning and region proposal mechanisms. Subsequent improvements, including Fast R-CNN [18] and Faster R-CNN [19], optimized training, and significantly improved speed. The introduction of one-stage detectors, including YOLO [20] and SSD [23], shifted the focus to real-time applications by eliminating the proposal step and performing direct prediction. More recently, anchor-free models and transformer-based architectures such as FCOS [27], DETR [28], and DINO [35] have pushed the limits of precision and simplicity of detection, particularly in complex and cluttered scenes. In the following, we analyze the core idea behind typical deep-learning techniques.

2.1.1. R-CNN Based

The core idea behind the R-CNN [16] is to break down the tasks of detecting objects into two stages: region proposal and classification. First, the model uses a separate algorithm (such as selective search [36]) to generate a set of potential object regions from the input image. After that, each proposed region is resized and passed through a CNN to extract further features, which are then classified using a separate classifier (e.g., SVM).

The R-CNN was the first to successfully apply deep learning to object detection, significantly improving accuracy over traditional methods; however, it was slow due to redundant CNN computations across overlapping regions. Successive R-CNN-based models, including Fast R-CNN [18] and Faster R-CNN [19], streamlined this process.
The Fast R-CNN improves on the R-CNN by performing feature extraction on the entire image only once, which strengthens a shared convolutional backbone. Instead of processing each region proposal separately, it employs a region-of-interest (RoI) pooling layer to derive fixed-size feature vectors from the shared feature map for every proposal.
The Faster R-CNN improves upon previous models by incorporating a Region Proposal Network (RPN) that generates region proposals directly from the feature maps, replacing the slower Selective Search algorithm. The RPN and the detection network share convolutional layers, enabling fully end-to-end training and much faster performance while preserving the high accuracy of two-stage detection.
Following region-based fully convolutional networks RFCN [37] RoI pooling is replaced with a position-sensitive score map to reduce the calculation per region while maintaining accuracy.
The Mask-RCNN [17] is also a technique that replaces RoI clustering with RoIAlign to improve mask accuracy by preserving spatial alignment.

The R-CNN-based two-stage detectors maintain high accuracy while improving the speed and efficiency of object detection significantly. Nonetheless, R-CNN-based models suffer from some limitations and depend on different models to form the multi-stage pipelines, which makes training complex and computationally expensive. Additionally, region proposals are processed individually, resulting in redundant computations and slow inference speeds, which are unsuitable for real-time applications.

2.1.2. YOLO

To overcome the challenges of the R-CNN-based models, YOLO was introduced as a single-stage detector that treats object detection as a regression problem [20]. As proposed, it is a single-stage detector, which formulates object detection as a regression problem. Over time, the YOLO series models have evolved significantly, each proposing a new technical idea for predicting and classifying probabilities, as well as bounding boxes directly from the entire image within a forward pass. This enables real-time detection with a much simpler architecture while enhancing detection accuracy.

YOLOv1 [20] introduced the core concept of grid-based detection but struggled with localization accuracy and the detection of small objects.
YOLOv2 (YOLO9000) [21] improved accuracy and recall through the use of batch normalization, multi-scale training, as well as anchor boxes.
YOLOv3 [22] added residual connections and feature pyramid networks to enhance multi-scale detection, making it more robust for small objects.
YOLOv4 [38] and YOLOv5 [39] optimized speed and accuracy trade-offs with advanced backbones (e.g., Cross-Stage Partial Darknet(CSPDarknet)), better data augmentation, and training tricks like mosaic augmentation and complete intersection over union (CIoU) loss.
YOLOv6 [40] and YOLOv7 [41] focused on deployment efficiency, especially for edge devices, by introducing lightweight designs and faster inference.
YOLOv8 [42] adopts a more modular and flexible design, featuring anchor-free detection, dynamic label assignment, and improved generalization, leading to good performance across various benchmarks.
The successive advanced YOLO versions (YOLOv9-11) [43,44,45] reflect a consistent trend of preserving or improving detection accuracy while optimizing speed and model efficiency.

2.1.3. Anchor Based

Anchor-based object detection was initially introduced in the Faster R-CNN, where fixed-size anchor boxes were used to generate region proposals. A major advancement came with the Single Shot MultiBox Detector (SSD), which extended the anchor concept by applying multiple default boxes of varying scales and aspect ratios across different feature maps. The single forward pass enables the SSD to perform efficient multi-scale detection, making it considerably faster than two-stage detectors. The SSD established a new direction for real-time object detection and directly influenced the design of later models, such as YOLOv2, which uses k-means to optimize anchor sizes. RetinaNet and YOLOv3 further developed the usage of anchors with feature pyramids and focal loss, respectively. YOLOv4 and YOLOv5, which are the following versions, further optimized anchor settings and training strategies. Although these models improved detection performance, anchor design remained a manual and computationally costly process for detecting tiny objects, prompting a shift toward anchor-free methods.

2.1.4. DETR (DEtection TRansformer)

It replaces traditional anchor-based detection mechanisms with transformer-based attention, allowing the model to learn object relationships directly using self-attention [46]. It provides an end-to-end solution where object detection is treated as a direct set prediction problem. It also eliminates the requirements for components (e.g., anchor boxes, non-maximum suppression (NMS)), resulting in a conceptually more straightforward yet powerful model. Although it requires more training time and higher computational resources, DETR has a future-forward architecture that is suitable for integration into digital twin systems, where object states, interactions, and environment simulations are continuously updated.

2.2. Applications in Aerial Livestock Detection

The impressive state-of-the-art achievements of deep learning techniques and their capabilities have demonstrated remarkable success in handling visual data, now serving as a core technology in many computer vision applications. Recently, they have become an integral component in domains such as agriculture [47], object tracking [48], video surveillance [49], autonomous driving [50], and healthcare [51], among others. Figure 1 illustrates various applications of livestock detection in a real-world setting, and thus, those applications are discussed below.

2.2.1. Livestock Identification and Classification

Livestock identification and classification focus on detecting and distinguishing livestock in different environmental settings. These approaches utilize various object detection and classification models for herd management, breed monitoring, and behavior tracking. Deep learning techniques, particularly in aerial imagery-based livestock applications, have shown a strong potential to automate these tasks. For example, Gowri et al. [52] introduced a remote sensing-assisted satellite image-based identification technique, Animal Sensing and Classification Model (LASCM), which simplifies animal identification from aerial imagery and uses a random forest model for cross-validation. Gaurav et al. [53] proposed a hybrid model that integrates CNN and long-short-term memory (LSTM) networks for animal detection using UAV imagery, achieving robust detection performance. Martha et al. [54] compared YOLOv5, YOLOv7, and YOLOv8 for real-time cow detection based on a dataset containing 11,828 images and videos from a dairy cattle barn that is housed indoors. The evaluation experiment reports an overall accuracy of 94.7% and 93% for pre-augmentation and after-augmentation tasks. And it achieves 92.9% mAP@50 after fine-tuning on YOLOv8 with RayTune. Mücher et al. [55] proposed deep learning models for cattle detection, individual identification, and posture recognition using the three trials of satellite, aerial, and UAV imagery data collected in Poland and The Netherlands. These models achieved classification accuracies: cattle detection (95%), individual identification (91%), and posture recognition (88%). Gunawan et al. [56] conducted the performance comparison of YOLOv3, YOLOv4, and YOLOv5 for object detection in UAV images. They implemented a lightweight version of YOLOv3 and YOLOv5s to extract features from objects and perform multi-class classification using the VisDrone 2019 dataset. On the other hand, Li et al. [57] investigated a multi-scale model that integrates a CMT module to better capture multi-scale features (containing posture variations and differences in animal appearance) for cattle recognition. Ismail et al. [58] compared deep learning models, including YOLO-NAS, YOLOv5, and DETR, for sheep detection in aerial imagery. Their evaluation showed that both YOLO-NAS and YOLOv5 achieved an accuracy of 97%.

2.2.2. Livestock Counting and Density Estimation

This application emphasizes accurately quantifying animal populations while addressing challenges such as occlusions, overlapping instances, and varying image resolutions caused by different altitudes. Lishchynskyi et al. [59] introduced a trajectory agnostic livestock counting method using UAV imaging to resolve occlusion in crowded scenes. Despite the good performance, the approach has limitations, such as counting errors, double counting, and missing targets. Fang et al. [60] introduced CCS-YOLOv8, incorporating CBAM, CARAFE, and a small object detection layer to enhance the detection tasks in dense scenes. The study integrates drones, provides proactive alerts, and expands datasets to various grasslands with precision (84.1%) and recall (82.2%). Ocholla et al. [61] evaluated nine models for cattle detection using 1039 high-resolution UAV images to improve detection under occlusion and variable conditions. The comparative experiments indicate that YOLOv8m achieved the best results but with the highest counting error rate. Rivas et al. [62] introduced a dense cattle detection method using pixel-level segmentation and classification on a custom aerial dataset. U-Net was used for segmentation, and Inception-v4 for ROI classification, achieving 0.893 accuracy—outperforming YOLOv3 (0.83) and slightly surpassing the Faster R-CNN (0.891). Likewise, Robinson et al. [63] proposed CowNet for recognizing and counting cattle and elk in satellite imagery. The proposed approach handles the uncertainty associated with noisy data and achieves an average precision of 0.56 to 0.61 using the LC-FCN model.

2.2.3. Real-Time Detection, Surveillance, and Health Monitoring

Real-time applications enable animals to be tracked, assessed for health, and detected for behavior, which improves response time and support on-site and remote livestock monitoring decision-making. To this end, Wang et al. [64] designed an improved YOLOv5 model integrating bi-level routing attention to enhance the detection accuracy, and it has achieved 94.2% mAP. Myint et al. [65] investigated a single-side view lameness detection system using YOLOv8 and RCNN from Detectron2, demonstrating efficacy in carrying out real-time livestock health management. Hang et al. [66] developed a CNN-based real-time animal detection platform that is adaptable to multiple species. It utilized multirotor UVAs for cattle detection and supported multi-brand drones to improve clustering accuracy and enable GPS-independent tracking via object attributes. Barrios et al. [67] applied the Multi-Object Tracking and Segmentation (MOTS) framework to recognize, track, and count dairy cattle based on thermal imagery collected by UAVs. This method employs 3D convolutions and an association head, which outperforms LSTM and optical flow in complex and low-motion scenarios. Noe et al. [68] proposed deep learning techniques that advanced precision livestock tracking by monitoring black cattle. This work also emphasizes the importance of real-time monitoring for sustainable agriculture, early disease detection, and grazing optimization. It employed techniques such as DETIC, the ByteTrack model, and YOLOv8 for automatic livestock labeling. Likewise, Kurniadi et al. [69] designed a YOLOv5 algorithm for livestock surveillance based on UAV imagery and videography, focusing on the real-time detection and tracking of livestock. It addressed livestock monitoring challenges such as lighting variation, occlusion, and dataset size, achieving 94.3% precision and 83.1% mAP.

2.2.4. Smart Farming, Ecosystem Monitoring, and Model Generalization

This application covers integrative systems such as IoT-based farm management, environmental monitoring, and approaches designed for cross-domain adaptability. It also includes efforts to build generalizable models that account for geographic and ecological variations. Asdikian et al. [70] evaluated YOLOv8 and YOLOv9 for real-time wildlife detection based on a custom dataset augmented with different lighting conditions and color space augmentation. The result shows a promising solution when combining YOLOv9 with a color space augmentation approach. Similarly, Xu et al. [71] reviewed CNNs, region-based CNNs (R-CNNs), and You Only Look Once (YOLO) models for wildlife and livestock detection. The review highlighted livestock challenges, including limited annotated datasets and poor model generalization in diverse environments. Additionally, it emphasized the benefits of deep learning techniques over traditional approaches and suggested future research directions, such as lightweight models for real-time applications and integrating multi-modal data. Das et al. [48] studied the model generalization using YOLOv8 and YOLOv9 models on the Cow Localization (COLO) dataset to evaluate indoor cows detection performance. It also highlighted the challenge associated with cow detection from high-altitude imagery. Tian et al. [72] introduced a YOLOv5 CNN that optimizes the candidate frame center point algorithm for better accuracy in detecting and classifying captive pigs. This optimization supports the IoT-based farm supervision and achieves a mean of average precision (mAP) at different thresholds (0.5 and 0.5–0.95). Korkmaz et al. [73] explored Industry 4.0 image processing technologies for wild and farm animal detection to enhance meat safety and sustainability in livestock operations. This method minimizes wildlife threats, improves livestock security, and advances livestock monitoring toward more efficient animal husbandry. Likewise, Jain et al. [74] proposed deep learning-based animal detecting and tracking to improve animal safety and reduce the financial loss due to animal–vehicle-related incidents.

One of the most critical approaches in modern agricultural precision is the automated and accurate detection of livestock, especially detection from aerial imagery. However, it presents numerous challenges that differ from traditional object detection tasks, influencing not only classification but also object localization, counting, and real-time deployment. To develop a scalable and robust system for monitoring livestock, it is necessary to address these problems, as this approach also enables farmers to monitor animal populations, health, and behavior with minimal human intervention.

The following summarizes the key challenges of livestock detection indicated in some related research works:

Small object size: Livestock often occupies only a few pixels in high-altitude imagery, reducing feature richness and detection accuracy.
Occlusion and overlapping: Animals may be covered by vegetation, landscape, or each other, problematizing object separation and identification.
Variable lighting and weather conditions: Changes in sunlight, shadows, cloud cover, or fog can degrade image quality and model reliability.
Posture and orientation variability: Animals appear in different poses and angles, requiring models to generalize across a wide range of visual patterns.
Background complexity: Natural topography, like grasslands, forests, or drylands, introduces visual noise that can lead to false positives.
Limited Annotated Datasets: High-quality, labeled aerial livestock datasets are scarce, limiting supervised learning and model generalization.
Domain adaptability: This refers to the issue that models trained in one region may not perform well elsewhere without retraining or adaptation to the new domain.
Real-time inference constraints: UAV-based systems require efficient models that can run on limited hardware while maintaining high accuracy.
Counting and tracking Errors: Challenges such as double counting, missed detections, and dynamic group movement impact population estimation and monitoring.

3. Case Study: Evaluation of the Object Detection Models on a Custom UAV Cattle Dataset

To better understand the existing object detection approaches for UAV-based livestock detection, we conduct a practical case study that is essential to validate the deployment possibility of these methods in real-world scenarios. We evaluate a set of representative object detection models on a custom aerial cattle dataset collected using UAV imagery. The experiments aim to evaluate detection accuracy, computational efficiency, and real-time feasibility so that we can offer insights into the trade-offs between low-resource and high-capacity models.

3.1. Data Collection and Pre-Processing

To support the development of an automated livestock detection system, a custom aerial image dataset was gathered specifically targeting cattle in open-field environments by using a drone. The data collection took place across multiple locations in Gombe State, Nigeria, an area characterized by heterogeneous geography, open grazing land, and varying lighting conditions. These factors are representative of real-world scenarios in livestock monitoring. The DJI drone was used for data collection, which was equipped with an FC7303 camera to capture high-resolution images of livestock. To ensure consistency across all the collected images, the cameras were then standardized with a focal length of 4 mm, an aperture of f/2.8, an ISO speed of 100, and a quick shutter speed of 1/800 s, with each image at a resolution of 4000 × 2250 taken in the sRGB color space, and 24-bit color depth. At the ground level, the drone flew at an altitude of approximately 392 m. For instance, we have taken a sample of one image’s GPS coordinates, which were 10°16′34.79″ N latitude and 11°19′5.53″ E longitude, marking the study site. We were mindful to maintain the data collection process consistently, which assured a uniform resolution, lighting, and camera settings throughout the process. The images were then saved in JPEG format for processing, with at least 7 megabytes in each. We made sure to capture a variety of viewpoints and different distributions of livestock to ensure the dataset was as diverse and comprehensive as possible.

The dataset comprises 270 aerial images, each featuring varying numbers of cattle against diverse backgrounds, including roads, fences, trees, and natural environments. Figure 2 displays four representative samples from the dataset. Each image is accompanied by metadata, including GPS coordinates, altitude, and timestamp, enabling future spatio-temporal analysis such as behavior tracking and heatmap generation. To ensure consistency and compliance with the YOLO input format, all images were manually annotated with the LabelImg tool. The tool generates a corresponding annotation file for each image, saved in YOLO-compatible text format, in which each line refers to an object instance with the corresponding class label for the cattle and the normalized coordinates (x-axis center, y-axis center, image width, image height).

The characteristics of the dataset include: (i) It is a single-class labeling focused on cattle. (ii) The animals appear at different sizes depending on altitude and image resolution. (iii) It has diverse environmental conditions, such as different backgrounds, occlusions, and lighting conditions. (iv) Some animals were overlapped to reflect the real-world scenarios. (v) We make sure of balance and equal distribution to a uniform spatial distribution to minimize bias in our model toward center-located objects.

3.2. Evaluation Metrics

To evaluate the accuracy, computational efficiency, and real-time feasibility among these deep learning-based models for livestock detection tasks in practical agricultural scenarios, we employed a set of object detection metrics, which collectively measure the model’s ability to accurately detect, localize, and count cattle in high-resolution aerial imagery captured by UAVs. Below, we list the metrics that will be used in this work.

Precision measures the proportion of correct positive detections. In the context of livestock detection, it quantifies how many of the identified cattle are indeed present in the image. The formula to calculate the precision is given below:

Precision = \frac{T P}{T P + F P} .

(1)

We used recall to evaluate the model’s capability to detect all actual cattle in an image. A high recall score indicates that the model successfully identified most of the cows present. The formula to evaluate recall is given as:

Recall = \frac{T P}{T P + F N} .

(2)

The F1 score is the harmonic mean of precision and recall. Maintaining a balance between false positives and false negatives is advantageous, as is often the case in livestock monitoring applications where both overcounting and undercounting can lead to inaccurate population assessments. F1 score is calculated by:

F 1 score = 2 \times \frac{Precision \times Recall}{Precision + Recall} .

(3)

To evaluate both the classification and localization performance, we used the mean average precision (mAP) metrics at two standard intersection-over-union (IoU) thresholds.

mAP@0.5: Measures precision at a relaxed threshold where IoU ≥ 0.5.
mAP@0.5:0.95: A more stringent metric averaging mAP over 10 IoU thresholds ranging from 0.5 to 0.95.

The IoU is defined as

IoU = \frac{Area of Overlap}{Area of Union} .

(4)

To measure resource requirements and model scalability, we report the model size and floating-point operations (FLOPs) for each model as well. The Model Size indicates memory and storage footprint, relevant for UAV onboard deployment. We use the number of parameters to assess the model size, while FLOPs evaluate the computational complexity of the model.

3.3. Experiments and Evaluation

To validate the effectiveness of deep learning approaches for UAV-based livestock monitoring, we conducted three complementary experiments in the case study. The first experiment evaluates multiple YOLOv12 variants to understand the trade-offs between model size, accuracy, and computational efficiency. The second experiment compares the latest YOLOv12 with other YOLO families. Finally, the third experiment benchmarks the best-performing YOLOv12 variant against representative object detection models, highlighting relative strengths and weaknesses in UAV-based livestock detection.

All experiments are evaluated over our custom aerial cow dataset mentioned in Section 3.1. The dataset was randomly divided into three parts: 70% for training, 15% for validation, and 15% for testing, each including corresponding images and labels (annotated files) in the YOLO format, to make sure robust model assessment and overfitting prevention. All models are implemented using PyTorch 2.5.0, with GPU-accelerated training on an NVIDIA A100 GPU. Hyperparameter configuration is shown in Table 1.

The hyperparameters used in this work were selected through a combination of prior recommendations from YOLO-based architectures and empirical fine-tuning on our dataset. We trained all models for 100 epochs with a batch size of 16 to ensure a balance between convergence stability and available GPU memory. The input image resolution of

640 \times 640

was adopted from the standard YOLO models. The momentum value of the learning rate, and the close mosaic were chosen in line with the default configurations of the YOLOv12 framework, which also offered stable training dynamics in ou r experiments.

3.3.1. Experiment I—YOLOv12 Variants Evaluation

YOLOv12 is a recently proposed attention-centric object detection framework that integrates lightweight convolutional operations with efficient attention mechanisms, achieving real-time performance while surpassing previous YOLO versions in detection accuracy. Its architecture introduces modules such as refined ELAN blocks and position perceivers, which improve multi-scale feature extraction and robustness in complex visual scenes, making it particularly suitable for UAV-based livestock monitoring.

To accommodate diverse deployment scenarios, YOLOv12 is released in five standard variants: YOLOv12-nano (n), small (s), medium (m), large (l), and extra-large (x). These variants share the same architectural principles but differ in model depth, width scaling, and parameter size, resulting in different trade-offs between accuracy, inference speed, and computational resource requirements. For instance, YOLOv12n is optimized for lightweight deployment on resource-constrained UAVs or embedded devices, while YOLOv12x is designed to maximize detection accuracy, suitable for offline or server-side processing.

In this experiment, we systematically evaluate all five YOLOv12 variants on our custom UAV cattle dataset. The goal is to achieve the following: (1) Quantifying performance trade-offs between speed and accuracy across different model sizes. (2) Identifying the optimal variant for UAV-based livestock monitoring, where real-time feasibility and energy efficiency are critical. (3) Providing deployment recommendations by analyzing which variant best balances accuracy, memory efficiency, and low latency for specific UAV applications.

3.3.2. Experiment II—YOLO Family Comparison

In the second experiment, we compare the best-performing YOLOv12 variant (determined from Experiment I) with representative versions from the YOLO family to highlight architectural improvements across generations. Specifically, we include YOLOv5, YOLOv7, and YOLOv8 as baselines. YOLOv5 introduced the widely adopted scaling strategy (n, s, m, l, x) and remains a practical benchmark in UAV and agricultural applications. YOLOv7 represents the peak of anchor-based CNN architectures, achieving strong accuracy through innovations such as the E-ELAN backbone and model re-parameterization. YOLOv8 marked a major transition to anchor-free detection, incorporating decoupled heads and enhanced C2f blocks for improved the detection of small-objects and inference in run-time, making it a direct predecessor to YOLOv12. Together, these versions capture the key milestones in the YOLO evolution, providing meaningful historical and technical context for evaluating YOLOv12.

To summarize the differences, Table 2 presents a side-by-side comparison of the YOLO family versions considered in this experiment.

3.3.3. Experiment III—Benchmarking Against Other Models

The third experiment broadens the comparison beyond the YOLO family. We benchmark the selected YOLOv12 variant against representative object detection architectures from different design paradigms to evaluate how YOLOv12 compares to both traditional CNN-based detectors and emerging transformer-based models in UAV livestock detection.

The details of the compared models include

SSD (Single Shot Multibox Detector)—A classical one-stage detector, included as a baseline for real-time object detection.
Faster R-CNN—A two-stage detector known for high accuracy, providing a comparison against precision-focused architectures.
YOLOv7-tiny—A lightweight detector optimized for resource-constrained environments from YOLOv7, offering fast inference at the cost of reduced accuracy.
CenterNet [75]—an anchor-free, keypoint-based detector known for its robustness in small-object detection, making it relevant for aerial livestock imagery.
RT-DETR (Real-Time DEtection TRansformer) [76]—A recent transformer-based detector balancing accuracy and efficiency, included to assess the potential of transformer backbones for UAV-based livestock monitoring.

3.4. Results Analysis

We now show a detailed analysis of the results from our case study on the UAV cattle dataset. The findings are reported for the three experiments. The focus should be on improving detection accuracy, enhancing computational efficiency, and ensuring suitability for real-time deployment in UAVs.

3.4.1. Performance of YOLOv12 Variants

Table 3 presents the performance of five YOLOv12 variants on the custom UAV cattle dataset. Across all models, precision, recall, F1 score, and mAP metrics remain consistently high, with only marginal differences between variants. This consistency suggests that all YOLOv12 versions are sufficiently capable of handling livestock detection under our dataset conditions. However, differences emerge in computational efficiency and real-time feasibility. YOLOv12-n delivers the fastest inference (4.5 ms) but at the cost of slightly lower stability in detection accuracy. YOLOv12 l and YOLOv12-x achieve top accuracy but come with higher model complexity and resource requirements.

YOLOv12 s stands out as the most balanced option, offering competitive accuracy with only a modest latency increase (8.1 ms) and moderate model size (9.23 M parameters). This balance between accuracy, speed, and efficiency makes YOLOv12 s the recommended choice for UAV-based livestock detection in subsequent experiments, particularly for entry-level UAV platforms where onboard compute and energy resources are limited. Given its strong overall performance, YOLOv12 s is chosen as the baseline model for further comparison against other architectures in Experiments II and III.

3.4.2. Performance of YOLO Family Comparison

Table 4 compares YOLOv12-s, the selected variant from Experiment I, against representative YOLO family models, including YOLOv5s, YOLOv7 (the smaller version for YOLOv7 is the YOLOv7-Tiny, the comparison between YOLOv7-Tiny and YOLOv12 s is in Experiment III), and YOLOv8-s. Across all models, precision, recall, and F1 scores are consistently high (greater than 0.93), confirming that each detector can reliably identify livestock with minimal false positives or false negatives. YOLOv7 achieves the highest mAP@50 (0.99) and ties with YOLOv5 s for the top mAP@50-95 (0.64). However, this results in a significantly larger parameter count (37.19 M) and high computational demands (105.1 GFLOPs), which restricts its use on UAV platforms with limited resources.

YOLOv8 s offers a strong balance, leading to the highest mAP@50-95 (0.66) among all models while keeping parameters (11.13 M) and FLOPs (28.4 G) relatively moderate. However, its extremely low reported inference time (0.9 ms) appears unusually fast compared to the others, suggesting potential differences in benchmarking conditions. YOLOv5 s matches YOLOv12 s in overall accuracy but has fewer parameters (7.01 M) and lower FLOPs (15.8 G), although its inference time (102.5 ms) is the slowest in the group. YOLOv12 s maintains competitive detection performance (mAP@50-95 of 0.63) while offering a relatively low computational load (9.23 M parameters, 21.2 GFLOPs) and a fast inference time (8.1 ms), making it much suited for real-time UAV deployment, in which speed and efficiency are critical.

Overall, while YOLOv8 s slightly edges ahead in fine-grained accuracy (mAP@50-95), YOLOv12 s provides a more balanced trade-off between detection performance, latency, and resource efficiency. This makes YOLOv12 s the preferred choice for subsequent experiments in UAV-based livestock monitoring.

3.4.3. Performance of Benchmarking Against Other Models

Table 5 compares YOLOv12 s against a broader set of popular object detection architectures, including SSD (MobileNetV2), Faster R-CNN (ResNet50-FPN), CenterNet (ResNet50 V1-FPN), YOLOv7-Tiny, and RT-DETR.

From an accuracy standpoint, YOLOv12 s and RT-DETR are tied for the highest mAP@50 (0.97), with RT-DETR also leading slightly in mAP@50-95 (0.65 vs. 0.63). The Faster R-CNN delivers competitive accuracy (mAP@50 of 0.94, mAP@50-95 of 0.63) but at the cost of large model size (41.53 M parameters) and high FLOPs (216.58 G), which makes it less suitable for UAV deployment. SSD and YOLOv7-Tiny show notably lower mAP@50-95 (0.21 and 0.33, respectively), limiting their effectiveness in fine-grained livestock detection. In terms of computational efficiency, YOLOv12 s demonstrates a strong balance with only 9.23 M parameters, 21.2 GFLOPs, and an 8.1 ms inference time, making it highly suitable for real-time UAV operation. The RT-DETR is faster (3.2 ms) but requires more than triple the FLOPs (103.4 G), potentially impacting battery life and processing efficiency in edge devices. In contrast, CenterNet’s inference time (1051.96 ms) is prohibitively slow for real-time applications, despite respectable accuracy. When considering real-time feasibility, YOLOv12 s stands out as a balanced performer, achieving high detection accuracy, competitive latency, and low computational load. The RT-DETR could be a viable alternative for deployments with higher computational budgets. Overall, the results confirm that YOLOv12 s offers the best trade-off between detection precision, computational cost, and deployment readiness, making it the preferred choice for UAV-based livestock detection in subsequent experiments and real-world applications.

We also plot the detection results in an image by using the benchmarking models. Figure 3 shows the detection results that highlight the disparity in performance between these models when applied to aerial imagery for livestock detection. We observe that the detection results on this image are consistent with the results from Table 5. The YOLOv12 s model and RT-DETR demonstrate consistently accurate detections with confidence scores, successfully identifying all cows across various scenes, even those that are partially unclear, overlapped, or located in low-contrast areas. The objects are well-aligned in their corresponding bounding boxes, along with their confidence scores, without missing any targets, which demonstrates strong reliability under different conditions. On the other hand, the other models, including SSD, CenterNet, YOLOv7-Tiny, and Faster-RCNN, present more inconsistent detection results, while many of the cows are correctly detected, the model irregularly misses some, particularly those that are smaller and more distant. Hence, it shows that they are struggling to detect all objects in the image, especially when they are not prominently visible or appear in groups. YOLOv12 s and RT-DERT offer higher detection performance and robustness, excelling at handling these challenges due to their advanced architecture and improved feature extraction, making them more dependable and robust for real-time livestock monitoring. In applications such as health surveillance or herd management, where precision and precision are crucial, the higher consistency provided by the YOLOv12 s model can lead to more reliable data and improved decision-making results in the field, and RT-DETR can be an alternative to YOLOv12 s in these applications.

4. Discussion and Future Research Directions

The experimental results demonstrate that most of the YOLO detection models and RT-DERT achieve an effective detection accuracy on the cattle dataset. Notably, YOLOv12 s has great performance on computational efficiency and real-time feasibility, making it a strong candidate for UAV-based livestock monitoring. Compared to other deep learning models, YOLOv12 s offers consistent detection performance while maintaining a low inference latency and reduced computational overhead, which are critical for UAV deployments where onboard processing power and battery life are limited.

However, several challenges remain to translate these experimental results into scalable, real-world solutions. For instance, the current dataset is limited in geographic diversity, which may impact the model’s generalization capability when deployed in different terrains or with various livestock breeds. In addition, the noise produced by UAVs can disturb or stress the animals, which potentially affects both the cattle’s behavior and the detection accuracy. Moreover, while YOLOv12 s is lightweight compared to most detectors, further model compression and hardware optimization could extend flight time and allow deployment on even smaller UAV platforms.

Future work should focus on integrating the model’s pipeline into a digital twin system [77], as it could provide assertive benefits for precision agriculture. A digital twin of the farm environment would allow continuous synchronization between physical herds and their virtual replicas, assisting in real-time simulation, adaptation of a livestock management system, data-driven decision-making, dynamic monitoring, adaptive management strategies, and predictive analytics. In practice, this means the UAV-based monitoring is capable of automatically informing grazing rotation schedules, detecting early signs of illness through behavior analysis, and optimizing resource distribution, such as feed and water. The digital twin can also enhance detection accuracy via multi-modal sensor fusion at the physical side, which optimizes the model for feasible use on edge devices with limited resources, expands its generalization to different livestock species and environments, as well as incorporates behavior and health analysis to support predictive livestock management and sustainable agricultural practices. Ultimately, interdisciplinary collaboration among agriculture, computer vision, and systems engineering disciplines will be essential in transforming these technologies from academic prototypes to large-scale, field-ready solutions in practice.

Furthermore, future research could concentrate on the following:

Domain adaptation and transfer learning to improve model generalization across diverse environments.
Multi-modal data fusion, incorporating RGB, thermal, and multispectral imagery for enhanced health monitoring and behavior analysis.
Onboard AI acceleration using edge TPUs or lightweight transformer backbones for more efficient UAV processing.
Autonomous UAV flight planning integrated with the digital twin to dynamically adjust flight routes based on real-time livestock distribution and anomalies.
Via the integration of UAV-based detection with digital twin technology, livestock management can be revolutionized, offering proactive decision-making, reduced operational costs, and improved animal welfare.

5. Conclusions

This paper has contributed a comprehensive literature review with a case study for automatically detecting livestock in aerial imagery using deep learning techniques. The case study results reveal that the YOLOv12 s achieves excellent results for livestock detection on a custom cow aerial dataset, showing strong accuracy, robustness, and scalability while also achieving high precision and inference speed. At the same time, the comparison with other models highlights that RT-DETR is a competitive alternative, as it achieves the best results for several performance metrics. Furthermore, the case study provides a practical framework for flexible model selection, where the choice can be guided by the most essential criteria in a given scenario. For example, if computational resources are constrained, YOLOv7-tiny offers an effective lightweight alternative. Overall, this work shows that deep-learning-based UAV monitoring systems provide the adaptability needed for diverse environmental conditions, paving the way for sustainable, scalable, and intelligent livestock management solutions.

Author Contributions

Conceptualization, M.A. and Q.L.; methodology, M.A., W.Y. and Q.L.; software, M.A.; validation, M.A. and Q.L.; formal analysis, M.A. and Q.L.; data preparation, M.A.; writing—original draft preparation, M.A.; writing—review and editing, M.A., J.S., W.Y. and Q.L.; visualization, M.A.; supervision, W.Y. and Q.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

https://github.com/heartzeey/LivestockDetection_Experiments.git (accessed on 6 September 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Said, M.I. The role of the livestock farming industry in supporting the global agricultural industry. In Agricultural Development in Asia—Potential Use of Nano-Materials and Nano-Technology; IntechOpen: London, UK, 2022. [Google Scholar]
Qian, M.; Qian, C.; Yu, W. Chapter 11—Edge intelligence in smart agriculture CPS. In Edge Intelligence in Cyber-Physical Systems; Yu, W., Ed.; Intelligent Data-Centric Systems; Academic Press: Cambridge, MA, USA, 2025; pp. 265–291. [Google Scholar] [CrossRef]
FAO. World Food and Agriculture; Pocketbook, FAOSTATFAO Statistical; FAO: Rome, Italy, 2015. [Google Scholar]
Aidara-Kane, A.; Angulo, F.J.; Conly, J.M.; Minato, Y.; Silbergeld, E.K.; McEwen, S.A.; Collignon, P.J.; on behalf of the WHO Guideline Development Group. World Health Organization (WHO) guidelines on use of medically important antimicrobials in food-producing animals. Antimicrob. Resist. Infect. Control 2018, 7, 7. [Google Scholar] [CrossRef]
Jebali, C.; Kouki, A. A proposed prototype for cattle monitoring system using RFID. In Proceedings of the 2018 International Flexible Electronics Technology Conference (IFETC), Ottawa, ON, Canada, 7–9 August 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–3. [Google Scholar]
Sarwar, F.; Pasang, T.; Griffin, A. Survey of livestock counting and tracking methods. Nusant. Sci. Technol. Proc. 2020, 150–157. [Google Scholar] [CrossRef]
Chamoso, P.; Raveane, W.; Parra, V.; González, A. UAVs applied to the counting and monitoring of animals. In Proceedings of the Ambient Intelligence-Software and Applications: 5th International Symposium on Ambient Intelligence, University of Salamanca, Salamanca, Spain, 4–6 June 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 71–80. [Google Scholar]
Liang, H.; Gao, W.; Nguyen, J.H.; Orpilla, M.F.; Yu, W. Internet of Things Data Collection Using Unmanned Aerial Vehicles in Infrastructure Free Environments. IEEE Access 2020, 8, 3932–3944. [Google Scholar] [CrossRef]
Chrétien, L.P.; Théau, J.; Ménard, P. Visible and thermal infrared remote sensing for the detection of white-tailed deer using an unmanned aerial system. Wildl. Soc. Bull. 2016, 40, 181–191. [Google Scholar] [CrossRef]
Ji, W.; Luo, Y.; Liao, Y.; Wu, W.; Wei, X.; Yang, Y.; He, X.Z.; Shen, Y.; Ma, Q.; Yi, S.; et al. UAV assisted livestock distribution monitoring and quantification: A low-cost and high-precision solution. Animals 2023, 13, 3069. [Google Scholar] [CrossRef]
Hatcher, W.G.; Yu, W. A Survey of Deep Learning: Platforms, Applications and Emerging Research Trends. IEEE Access 2018, 6, 24411–24432. [Google Scholar] [CrossRef]
Liang, F.; Yu, W.; Liu, X.; Griffith, D.; Golmie, N. Toward Edge-Based Deep Learning in Industrial Internet of Things. IEEE Internet Things J. 2020, 7, 4329–4341. [Google Scholar] [CrossRef]
Song, J.; Guo, Y.; Yu, W. Chapter 10—Edge intelligence in smart home CPS. In Edge Intelligence in Cyber-Physical Systems; Yu, W., Ed.; Intelligent Data-Centric Systems; Academic Press: Cambridge, MA, USA, 2025; pp. 247–264. [Google Scholar] [CrossRef]
Idama, G.; Guo, Y.; Yu, W. QATFP-YOLO: Optimizing Object Detection on Non-GPU Devices with YOLO Using Quantization-Aware Training and Filter Pruning. In Proceedings of the 2024 33rd International Conference on Computer Communications and Networks (ICCCN), Kailua-Kona, HI, USA, 29–31 July 2024; pp. 1–6. [Google Scholar] [CrossRef]
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Region-based convolutional networks for accurate object detection and segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 142–158. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Girshick, R. Fast r-cnn. arXiv 2015, arXiv:1504.08083. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Redmon, J. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 13–19 June 2020; pp. 390–391. [Google Scholar]
Beal, J.; Kim, E.; Tzeng, E.; Park, D.H.; Zhai, A.; Kislyuk, D. Toward transformer-based object detection. arXiv 2020, arXiv:2012.09958. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: A simple and strong anchor-free object detector. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1922–1933. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, X.; Yang, T.; Sun, J. Anchor detr: Query design for transformer-based detector. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; Volume 36, pp. 2567–2575. [Google Scholar]
Alanezi, M.A.; Shahriar, M.S.; Hasan, M.B.; Ahmed, S.; Sha’aban, Y.A.; Bouchekara, H.R. Livestock management with unmanned aerial vehicles: A review. IEEE Access 2022, 10, 45001–45028. [Google Scholar] [CrossRef]
Kellenberger, B.; Volpi, M.; Tuia, D. Fast animal detection in UAV images using convolutional neural networks. In Proceedings of the 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Fort Worth, TX, USA, 23–28 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 866–869. [Google Scholar]
Brown, J.; Qiao, Y.; Clark, C.; Lomax, S.; Rafique, K.; Sukkarieh, S. Automated aerial animal detection when spatial resolution conditions are varied. Comput. Electron. Agric. 2022, 193, 106689. [Google Scholar] [CrossRef]
Lee, J.; Bang, J.; Yang, S.I. Object detection with sliding window in images including multiple similar objects. In Proceedings of the 2017 International Conference on Information and Communication Technology Convergence (ICTC), Jeju Island, Republic of Korea, 18–20 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 803–806. [Google Scholar]
Islam, S.; Abdullah-Al-Ka, M. Orientation Robust Object Detection Using Histogram of Oriented Gradients. Ph.D. Thesis, East West University, Dhaka, Bangladesh, 2017. [Google Scholar]
Burger, W.; Burge, M.J. Scale-invariant feature transform (SIFT). In Digital Image Processing: An Algorithmic Introduction; Springer: Berlin/Heidelberg, Germany, 2022; pp. 709–763. [Google Scholar]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.Y. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
Uijlings, J.R.; Van De Sande, K.E.; Gevers, T.; Smeulders, A.W. Selective search for object recognition. Int. J. Comput. Vis. 2013, 104, 154–171. [Google Scholar] [CrossRef]
Dai, J.; Li, Y.; He, K.; Sun, J. R-fcn: Object detection via region-based fully convolutional networks. arXiv 2016. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Jocher, G.; Stoken, A.; Borovec, J.; Changyu, L.; Hogan, A.; Diaconu, L.; Poznanski, J.; Yu, L.; Rai, P.; Ferriday, R.; et al. ultralytics/yolov5: v3.0. Zenodo. 2020. Available online: https://docs.ultralytics.com/models/yolov5/ (accessed on 12 September 2025).
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar] [CrossRef]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. 2023. Available online: https://docs.ultralytics.com/models/yolov8/#key-features-of-yolov8 (accessed on 12 September 2025).
Wang, C.Y.; Liao, H.Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information; Springer: Cham, Switzerland, 2024. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Jocher, G.; Qiu, J. Ultralytics YOLO11. 2024. Available online: https://docs.ultralytics.com/models/yolo11/ (accessed on 12 September 2025).
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–18 June 2024; pp. 16965–16974. [Google Scholar]
Badgujar, C.M.; Poulose, A.; Gan, H. Agricultural object detection with You Only Look Once (YOLO) Algorithm: A bibliometric and systematic literature review. Comput. Electron. Agric. 2024, 223, 109090. [Google Scholar] [CrossRef]
Das, M.; Ferreira, G.; Chen, C. A model generalization study in localizing indoor Cows with COw LOcalization (COLO) dataset. arXiv 2024, arXiv:2407.20372. [Google Scholar] [CrossRef]
Nascimento, J.C.; Marques, J.S. Performance evaluation of object detection algorithms for video surveillance. IEEE Trans. Multimed. 2006, 8, 761–774. [Google Scholar] [CrossRef]
Simhambhatla, R.; Okiah, K.; Kuchkula, S.; Slater, R. Self-driving cars: Evaluation of deep learning techniques for object detection in different driving conditions. SMU Data Sci. Rev. 2019, 2, 23. [Google Scholar]
Ragab, M.G.; Abdulkadir, S.J.; Muneer, A.; Alqushaibi, A.; Sumiea, E.H.; Qureshi, R.; Al-Selwi, S.M.; Alhussian, H. A comprehensive systematic review of YOLO for medical object detection (2018 to 2023). IEEE Access 2024, 12, 57815–57836. [Google Scholar] [CrossRef]
Gowri, A.; Yovan, I.; Jebaseelan, S.S.; Selvasofia, S.A.; Nandhana, N. Satellite Image Based Animal Identification System Using Deep Learning Assisted Remote Sensing Strategy. In Proceedings of the 2024 International Conference on Advances in Computing, Communication and Applied Informatics (ACCAI), Chennai, India, 9–10 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar]
Gaurav, A.; Gupta, B.B.; Chui, K.T.; Arya, V. Unmanned Aerial Vehicle-Based Animal Detection via Hybrid CNN and LSTM Model. In Proceedings of the ICC 2024—IEEE International Conference on Communications, Denver, CO, USA, 9–13 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 2586–2591. [Google Scholar]
Martha, G.W.; Mwangi, R.W.; Aramvith, S.; Rimiru, R. Comparing Deep Learning Object Detection Methods for Real-Time Cow Detection. In Proceedings of the TENCON 2023—2023 IEEE Region 10 Conference (TENCON), Chiang Mai, Thailand, 31 October–3 November 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1187–1192. [Google Scholar]
Mücher, C.; Los, S.; Franke, G.; Kamphuis, C. Detection, identification and posture recognition of cattle with satellites, aerial photography and UAVs using deep learning techniques. Int. J. Remote Sens. 2022, 43, 2377–2392. [Google Scholar] [CrossRef]
Gunawan, T.S.; Ismail, I.M.M.; Kartiwi, M.; Ismail, N. Performance Comparison of Various YOLO Architectures on Object Detection of UAV Images. In Proceedings of the 2022 IEEE 8th International Conference on Smart Instrumentation, Measurement and Applications (ICSIMA), Melaka, Malaysia, 26–28 September 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 257–261. [Google Scholar]
Li, Y.; Gou, X.; Zuo, H.; Zhang, M. A Multi-scale Cattle Individual Identification Method Based on CMT Module and Attention Mechanism. In Proceedings of the 2024 7th International Conference on Computer Information Science and Application Technology (CISAT), Hangzhou, China, 12–14 July 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 336–341. [Google Scholar]
Ismail, M.S.; Samad, R.; Pebrianti, D.; Mustafa, M.; Abdullah, N.R.H. Comparative Analysis of Deep Learning Models for Sheep Detection in Aerial Imagery. In Proceedings of the 2024 9th International Conference on Mechatronics Engineering (ICOM), Kuala Lumpur, Malaysia, 13–14 August 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 234–239. [Google Scholar]
Lishchynskyi, M.; Cheng, I.; Nikolaidis, I. Trajectory Agnostic Livestock Counting Through UAV Imaging. In Proceedings of the 2024 IEEE 21st International Conference on Mobile Ad-Hoc and Smart Systems (MASS), Seoul, Republic of Korea, 23–24 September 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 682–687. [Google Scholar]
Fang, C.; Li, C.; Yang, P.; Kong, S.; Han, Y.; Huang, X.; Niu, J. Enhancing livestock detection: An efficient model based on YOLOv8. Appl. Sci. 2024, 14, 4809. [Google Scholar] [CrossRef]
Ocholla, I.A.; Pellikka, P.; Karanja, F.; Vuorinne, I.; Väisänen, T.; Boitt, M.; Heiskanen, J. Livestock Detection and Counting in Kenyan Rangelands Using Aerial Imagery and Deep Learning Techniques. Remote Sens. 2024, 16, 2929. [Google Scholar] [CrossRef]
Rivas, A.; Chamoso, P.; González-Briones, A.; Corchado, J.M. Detection of cattle using drones and convolutional neural networks. Sensors 2018, 18, 2048. [Google Scholar] [CrossRef]
Robinson, C.; Ortiz, A.; Hughey, L.; Stabach, J.A.; Ferres, J.M.L. Detecting cattle and elk in the wild from space. arXiv 2021, arXiv:2106.15448. [Google Scholar] [CrossRef]
Wang, W.; Xie, M.; Jiang, C.; Zheng, Z.; Bian, H. Cow Detection Model Based on Improved YOLOv5. In Proceedings of the 2024 39th Youth Academic Annual Conference of Chinese Association of Automation (YAC), Dalian, China, 7–9 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1697–1701. [Google Scholar]
Myint, B.B.; Onizuka, T.; Tin, P.; Aikawa, M.; Kobayashi, I.; Zin, T.T. Development of a real-time cattle lameness detection system using a single side-view camera. Sci. Rep. 2024, 14, 13734. [Google Scholar] [CrossRef]
Han, L.; Tao, P.; Martin, R.R. Livestock detection in aerial images using a fully convolutional network. Comput. Vis. Media 2019, 5, 221–228. [Google Scholar] [CrossRef]
Barrios, D.B.; Valente, J.; van Langevelde, F. Monitoring mammalian herbivores via convolutional neural networks implemented on thermal UAV imagery. Comput. Electron. Agric. 2024, 218, 108713. [Google Scholar] [CrossRef]
Noe, S.M.; Zin, T.T.; Tin, P.; Kobayashi, I. Precision Livestock Tracking: Advancements in Black Cattle Monitoring for Sustainable Agriculture. J. Signal Process. 2023, 28, 179–182. [Google Scholar]
Kurniadi, F.A.; Setianingsih, C.; Syaputra, R.E. Innovation in Livestock Surveillance: Applying the YOLO Algorithm to UAV Imagery and Videography. In Proceedings of the 2023 IEEE 9th International Conference on Smart Instrumentation, Measurement and Applications (ICSIMA), Kuala Lumpur, Malaysia, 17–18 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 246–251. [Google Scholar]
Asdikian, J.P.H.; Li, M.; Maier, G. Performance evaluation of YOLOv8 and YOLOv9 on custom dataset with color space augmentation for Real-time Wildlife detection at the Edge. In Proceedings of the 2024 IEEE 10th International Conference on Network Softwarization (NetSoft), Saint Louis, MI, USA, 24–28 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 55–60. [Google Scholar]
Xu, Z.; Wang, T.; Skidmore, A.K.; Lamprey, R. A review of deep learning techniques for detecting animals in aerial and satellite images. Int. J. Appl. Earth Obs. Geoinf. 2024, 128, 103732. [Google Scholar] [CrossRef]
Tian, M.; Zhao, F.; Chen, N.; Zhang, D.; Li, J.; Li, H. Application of Convolutional Neural Network in Livestock Target Detection of IoT Images. In Proceedings of the 2024 5th International Seminar on Artificial Intelligence, Networking and Information Technology (AINIT), Nanjing, China, 29–31 March 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1146–1151. [Google Scholar]
Korkmaz, A.; Agdas, M.T.; Kosunalp, S.; Iliev, T.; Stoyanov, I. Detection of Threats to Farm Animals Using Deep Learning Models: A Comparative Study. Appl. Sci. 2024, 14, 6098. [Google Scholar] [CrossRef]
A Deep Learning-Based Model for the Detection and Tracking of Animals for Safety and Management of Financial Loss. J. Electr. Syst. 2024, 20, 14–20. [CrossRef]
Zhou, X.; Wang, D.; Krähenbühl, P. Objects as Points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
Lv, W.; Zhao, Y.; Chang, Q.; Huang, K.; Wang, G.; Liu, Y. RTDETRv2: All-in-One Detection Transformer Beats YOLO and DINO. arXiv 2024, arXiv:2407.17140. [Google Scholar]
Qian, C.; Guo, Y.; Hussaini, A.; Musa, A.; Sai, A.; Yu, W. A New Layer Structure of Cyber-Physical Systems under the Era of Digital Twin. ACM Trans. Internet Technol. 2024; accepted. [Google Scholar] [CrossRef]

Figure 1. Applications of livestock detection.

Figure 2. Image samples of the custom cattle dataset.

Figure 3. Detection Results Comparison for YOLOv12-s, SSD, CenterNet, YOLOv7-Tiny, RT-DETR, and Faster-RCNN Using Custom Cattle Dataset.

Table 1. Hyperparameter configuration.

Epochs	Batch Size	Imgsz	Momentum	lr	Close Mosaic	Workers
100	16	$640 \times 640$	0.937	0.01	10	8

Table 2. Compact comparison of YOLO family versions for UAV livestock detection.

Model	Year	Anchor	Innovations	Strengths	Limitations
YOLOv5	2020	Anchor-based	Scalable variants (n–x); PyTorch implementation	Widely adopted; lightweight for edge devices	Weak small-object detection; no transformer features
YOLOv7	2022	Anchor-based	E-ELAN module; re-parameterization	High accuracy; efficient CNN	Still anchor-based; less suited for UAV small-object detection
YOLOv8	2023	Anchor-free	Decoupled head; flexible input sizes	Strong small-object detection; real-time inference	Needs fine-tuning; unstable in dense livestock clusters
YOLOv12	2025	Anchor-free + Attention-centric	R-ELAN; FlashAttention integration	SOTA accuracy; balances speed and robustness	Large models heavy for UAV edge hardware

Table 3. Performance of YOLOv12 Variants.

Model	Precision	Recall	F1 Score	mAP@50	mAP@50-95	Parameters (M)	FLOPs (G)	Inference Time (ms)
YOLOv12-n	0.96	0.93	0.94	0.97	0.63	2.57	6.3	4.5
YOLOv12 s	0.96	0.93	0.95	0.97	0.63	9.23	21.2	8.1
YOLOv12 m	0.95	0.95	0.95	0.98	0.66	20.1	67.1	14.9
YOLOv12 l	0.95	0.94	0.95	0.98	0.65	26.34	88.5	4.5
YOLOv12-x	0.95	0.95	0.96	0.98	0.67	59.04	198.5	6.8

Note: Bold values indicate the best result in each column.

Table 4. Performance of YOLO Family Comparison.

Model	Precision	Recall	F1 Score	mAP@50	mAP@50-95	Parameters (M)	FLOPs (G)	Inference Time (ms)
YOLOv5 s	0.96	0.94	0.95	0.98	0.64	7.01	15.8	102.5
YOLOv7	0.97	0.96	0.96	0.99	0.64	37.19	105.1	90.7
YOLOv8 s	0.95	0.96	0.96	0.98	0.66	11.13	28.4	0.9
YOLOv12 s	0.96	0.93	0.95	0.97	0.63	9.23	21.2	8.1

Note: Bold values indicate the best result in each column.

Table 5. Performance of YOLOv12 s Against Other Models.

Model	Precision	Recall	F1 Score	mAP@50	mAP@50-95	Parameters (M)	FLOPs (G)	Inference Time (ms)
SSD (MobileNet V2)	0.62	0.65	0.63	0.59	0.21	25.12	88.2	115.86
Faster R-CNN (ResNet50-FPN)	0.84	0.94	0.89	0.94	0.63	41.53	216.58	28.06
CenterNet (ResNet50 V1 FPN)	0.93	0.62	0.74	0.93	0.57	25.25	-	1051.96
YOLOv7-Tiny	0.86	0.71	0.78	0.83	0.33	6.02	13.2	120.1
RT-DETR	0.95	0.94	0.95	0.97	0.65	31.9	103.4	3.2
YOLOv12 s	0.96	0.93	0.95	0.97	0.63	9.23	21.2	8.1

Note: Bold values indicate the best result in each column.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Adam, M.; Song, J.; Yu, W.; Li, Q. Deep Learning Approaches for Automatic Livestock Detection in UAV Imagery: State-of-the-Art and Future Directions. Future Internet 2025, 17, 431. https://doi.org/10.3390/fi17090431

AMA Style

Adam M, Song J, Yu W, Li Q. Deep Learning Approaches for Automatic Livestock Detection in UAV Imagery: State-of-the-Art and Future Directions. Future Internet. 2025; 17(9):431. https://doi.org/10.3390/fi17090431

Chicago/Turabian Style

Adam, Muhammad, Jianchao Song, Wei Yu, and Qingqing Li. 2025. "Deep Learning Approaches for Automatic Livestock Detection in UAV Imagery: State-of-the-Art and Future Directions" Future Internet 17, no. 9: 431. https://doi.org/10.3390/fi17090431

APA Style

Adam, M., Song, J., Yu, W., & Li, Q. (2025). Deep Learning Approaches for Automatic Livestock Detection in UAV Imagery: State-of-the-Art and Future Directions. Future Internet, 17(9), 431. https://doi.org/10.3390/fi17090431

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Learning Approaches for Automatic Livestock Detection in UAV Imagery: State-of-the-Art and Future Directions

Abstract

1. Introduction

2. Related Work

2.1. Deep Learning-Based Object Detection Techniques

2.1.1. R-CNN Based

2.1.2. YOLO

2.1.3. Anchor Based

2.1.4. DETR (DEtection TRansformer)

2.2. Applications in Aerial Livestock Detection

2.2.1. Livestock Identification and Classification

2.2.2. Livestock Counting and Density Estimation

2.2.3. Real-Time Detection, Surveillance, and Health Monitoring

2.2.4. Smart Farming, Ecosystem Monitoring, and Model Generalization

3. Case Study: Evaluation of the Object Detection Models on a Custom UAV Cattle Dataset

3.1. Data Collection and Pre-Processing

3.2. Evaluation Metrics

3.3. Experiments and Evaluation

3.3.1. Experiment I—YOLOv12 Variants Evaluation

3.3.2. Experiment II—YOLO Family Comparison

3.3.3. Experiment III—Benchmarking Against Other Models

3.4. Results Analysis

3.4.1. Performance of YOLOv12 Variants

3.4.2. Performance of YOLO Family Comparison

3.4.3. Performance of Benchmarking Against Other Models

4. Discussion and Future Research Directions

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI