Livestock Detection and Counting in Kenyan Rangelands Using Aerial Imagery and Deep Learning Techniques

Ocholla, Ian A.; Pellikka, Petri; Karanja, Faith; Vuorinne, Ilja; Väisänen, Tuomas; Boitt, Mark; Heiskanen, Janne

doi:10.3390/rs16162929

Open AccessArticle

Livestock Detection and Counting in Kenyan Rangelands Using Aerial Imagery and Deep Learning Techniques

by

Ian A. Ocholla

^1,2,*

,

Petri Pellikka

^1,2,3

,

Faith Karanja

⁴

,

Ilja Vuorinne

^1,2

,

Tuomas Väisänen

^1,5,6

,

Mark Boitt

⁷

and

Janne Heiskanen

^1,8

¹

Department of Geosciences and Geography, University of Helsinki, P.O. Box 64, 00014 Helsinki, Finland

²

Institute for Atmospheric and Earth System Research, University of Helsinki, P.O. Box 4, 00014 Helsinki, Finland

³

Wangari Maathai Institute for Environmental and Peace Studies, University of Nairobi, Nairobi P.O. Box 29053-00625, Kenya

⁴

Department of Geospatial and Space Technology, University of Nairobi, Nairobi P.O. Box 30197-00100, Kenya

⁵

Helsinki Institute of Sustainability Science, University of Helsinki, P.O. Box 4, 00014 Helsinki, Finland

⁶

Helsinki Institute of Urban and Regional Studies, University of Helsinki, P.O. Box 4, 00014 Helsinki, Finland

⁷

Institute of Geomatics, GIS and Remote Sensing, Dedan Kimathi University of Technology, Private Bag, Dedan Kimathi, Nyeri P.O. Box 10143-10100, Kenya

⁸

Finnish Meteorological Institute, P.O. Box 503, 00101 Helsinki, Finland

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(16), 2929; https://doi.org/10.3390/rs16162929

Submission received: 28 June 2024 / Revised: 4 August 2024 / Accepted: 7 August 2024 / Published: 9 August 2024 / Corrected: 30 September 2024

(This article belongs to the Section AI Remote Sensing)

Download

Browse Figures

Versions Notes

Abstract

:

Accurate livestock counts are essential for effective pastureland management. High spatial resolution remote sensing, coupled with deep learning, has shown promising results in livestock detection. However, challenges persist, particularly when the targets are small and in a heterogeneous environment, such as those in African rangelands. This study evaluated nine state-of-the-art object detection models, four variants each from YOLOv5 and YOLOv8, and Faster R-CNN, for detecting cattle in 10 cm resolution aerial RGB imagery in Kenya. The experiment involved 1039 images with 9641 labels for training from sites with varying land cover characteristics. The trained models were evaluated on 277 images and 2642 labels in the test dataset, and their performance was compared using Precision, Recall, and Average Precision (AP_0.5–0.95). The results indicated that reduced spatial resolution, dense shrub cover, and shadows diminish the model’s ability to distinguish cattle from the background. The YOLOv8m architecture achieved the best AP_0.5–0.95 accuracy of 39.6% with Precision and Recall of 91.0% and 83.4%, respectively. Despite its superior performance, YOLOv8m had the highest counting error of −8%. By contrast, YOLOv5m with AP_0.5–0.95 of 39.3% attained the most accurate cattle count with RMSE of 1.3 and R² of 0.98 for variable cattle herd densities. These results highlight that a model with high AP_0.5–0.95 detection accuracy may struggle with counting cattle accurately. Nevertheless, these findings suggest the potential to upscale aerial-imagery-trained object detection models to satellite imagery for conducting cattle censuses over large areas. In addition, accurate cattle counts will support sustainable pastureland management by ensuring stock numbers do not exceed the forage available for grazing, thereby mitigating overgrazing.

Keywords:

aerial survey; deep learning; livestock census; object detection; remote sensing

Graphical Abstract

1. Introduction

Accurate and regular livestock censuses are essential for implementing sustainable pasture management strategies and advancing the Sustainable Development Goals (SDGs) related to zero hunger, poverty reduction, and climate action [1]. Census data helps in monitoring livestock population trends, managing zoonotic disease outbreaks [2,3], tracking livestock affected by extreme climate events [4,5], and monitoring shifts in land use patterns [6,7]. They also guide farmers in adapting livestock practices, such as grazing patterns, in response to the effects of climate change [2,8,9].

The African continent hosts a third of the global livestock population [10], contributing around 40% to the gross domestic product of African countries [11]. However, the livestock sector in Sub-Saharan Africa (SSA) remains underused compared to other regions of the world [12] due to a lack of reliable and up-to-date population data, which hinders policy formulation and management decisions [13,14].

Since the 1970s, livestock censuses in Africa have been conducted via aerial counting from manned aircraft [15]. However, these counts are irreproducible and subject to observer bias, leading to an underestimation of over 15% due to animal clustering and dense vegetation [16,17]. While aerial images are more reliable and can be stored for repeated counts, the manual interpretation process is still tedious and prone to bias [18,19].

The advancement of artificial intelligence in the last two decades has propelled the use of deep learning models [20] in livestock surveys through object detection. Object detection is a fundamental task in computer vision aimed at localising and classifying multiple objects in an image [21]. Object detection models are typically based on convolutional neural network (CNN) architectures, the standard deep learning networks for image data [22]. In livestock surveys, object detectors have proven efficient for livestock counting [9,23] and multispecies classification [24,25,26]. These detectors can process large volumes of aerial images [27], enabling real-time livestock detection and counting [28,29,30].

Several studies have explored different CNN architectures and techniques to enhance the accuracy and efficiency of livestock detection. Barbedo et al. [24] assessed the accuracy of 15 CNN architectures for the classification of images of cattle captured by drones across different simulated ground sample distances and varying illumination based on season and time of the day. Acceptable performances were attained from NasNet Large and Xception CNN networks. However, their detection performance was hindered in cases of low contrast caused by severe specular reflection and blurred images. Soares et al. [31] developed a pipeline based on Faster R-CNN Inception Resnet v2 and a graph-based matching model to remove duplicate animals detected in overlapping images. The experiment utilised metadata information related to the GPS positions and drone altitude for each image to detect cattle and filter out duplicates. The graph-based algorithm reduced duplicates by 70% in cattle on motion with a computation time of 15 times less than an earlier study by Shao et al. [32] that used a three-dimensional Structural from Motion (SfM) reconstruction model on a similar dataset to avoid duplicate counts.

When dealing with counting cattle in clustered herds in low-contrast environments, Barbedo, Koenigkan, and Santos et al. [33] proposed three modules: first, the use of colour space manipulation to segment the background from the animals; second, the use of mathematical morphological operations to separate groups of animals and remove non-animal objects; and last, combining overlapping images to avoid duplicate counts. The study attained high F1 scores across different cluster sizes and environments. However, the method was only used with white-coated cattle and may not apply in scenarios of cattle with different coat colours. Delplanque et al. [26] proposed a point-based model, HerdNet, to detect, classify and count multiple livestock species—sheep, goats, donkeys, and camels—from images acquired from manned aircraft. The authors highlighted that including a hard negative patch mining technique lowered average confusion between species and attained acceptable accuracy (F1 scores, 73.6%) and lower counting errors of −9.4%. The experiment was conducted in homogeneous herds for dense, medium, and sparse animal herds.

Despite this progress, most studies are limited to using low-flying Unmanned Aerial Vehicles (UAVs) [24,33,34]. However, coverage of UAVs can be restricted by national legislation to the line of sight and specific maximum heights [35]. Furthermore, current studies are limited in capturing the diversity of cattle breeds and sizes [24,32,36] and are confined in homogeneous landscapes with sparsely distributed livestock [3,26].

For conducting livestock censuses in vast and inaccessible areas, such as the SSA rangelands, aerial images from manned aircraft offer a favourable compromise between area coverage and spatial resolution. Manned aircrafts can cover larger areas than UAVs and provide higher spatial resolution than satellite images. However, detecting livestock from aerial images acquired from typical flying heights of 700 to 2000 m above ground level can be challenging for CNN-based computer vision techniques, as an animal may occupy only a few pixels, limiting the use of their inherent features (shape, colour, and size) for detection [37]. In addition, the heterogeneous background increases the probability of false positives and negatives in detection due to low contrast.

This study’s main objective was to evaluate the potential of object detection models based on aerial RGB imagery for the automated detection and counting of cattle across heterogeneous rangelands in Kenya. Nine models were trained and tested—four variants each of the one-stage detectors YOLOv5 and YOLOv8, and the two-stage detector Faster R-CNN, using a large, annotated cattle dataset from three sites, each characterised by unique land cover. Our specific aims were to (i) evaluate how background characteristics and quality of training data impact model detection performance, (ii) evaluate the performance of trained detection models across three unique study sites, and (iii) compare the performance of the models in detecting and counting cattle clusters of various sizes.

2. Materials and Methods

2.1. Study Area

This study was conducted in Taita Taveta County (38°07′36″E to 38°27′23″E and 3°43′16″S to 3°25′33″S), located in southern Kenya (Figure 1). The county is predominantly characterised by semi-arid conditions, covering 89% of its area, particularly in the lowlands between 600 and 1000 m above sea level. The lowlands feature savanna and bushland vegetation that support national parks, wildlife sanctuaries, and cattle ranches. The lowland areas receive an average annual rainfall of 440 mm, with a mean annual air temperature of 23 °C [38]. The main land use types in the study area are conservation, livestock management and agropastoralism, small-holder agriculture, mining, and sisal plantations [39].

Three distinct sites were selected for the study: Lumo Conservancy, Taita Hills Wildlife Sanctuary (THWS), and Choke Conservancy (Figure 1). These sites are home to both livestock (such as cattle, sheep, and goats) and wildlife (e.g., zebras, elephants, and buffalos). The three sites vary in land cover characteristics and size, with the smallest being Lumo at 84.7 km², followed by THWS at 101.6 km², and the largest, Choke, covering 128.6 km². THWS comprises a grassland area with scattered trees, mainly protected for wildlife with restricted livestock movement into the area. Choke is characterised by a dense shrubland, supporting both livestock and wildlife. Lumo has a landscape characterised by overgrazed bushland, grassland, and bare land due to the high yearly livestock population. Due to overgrazing, large herds of cattle from Lumo often move into the western and southern sections of THWS for grazing. Lumo hosts approximately 3500 cattle and 200 goats, while Choke hosts 1400 heads of cattle [39]. The numbers of livestock fluctuate between seasons and years.

2.2. Acqusition of Aerial Imagery

A Cessna 206 aircraft, travelling at a speed of 100 knots (approximately 180 kph), was used for aerial photography across the three sites. The survey employed a Leica RCD30 60-megapixel camera, equipped with a 50 mm Leica NAG-D lens and a field of view of 53.8°. The camera was oriented vertically downward (nadir) and set to a maximum frame rate of 1 frame per second. The RGB images captured had a resolution of 6372 × 9000 pixels.

The aerial survey was conducted during the dry season of 2022, between 14 February and 20 March, under sunny and clear weather conditions. These favourable conditions allowed for maintaining constant aircraft speed and altitude. The flight lines were spaced 500 m apart, with the north–south orientation chosen to avoid glare from the sun affecting the pilot. The surveys were conducted between 8:30 am and noon local time, when most of the cattle were in the grazing fields and weather conditions were typically most suitable. After noon, rising temperatures often cause cattle to seek shade under trees, making them harder to detect. In the evening, the setting sun creates shadows that further complicate detection. The average flying height was 850 m above ground in Lumo and THWS, and 1400 m in Choke.

A Leica laser altimeter recorded the flying height and the associated geolocations for accurate image registration and georeferencing. All images were georeferenced in the Universal Transverse Mercator (UTM 37S) coordinate system before generating high-quality orthomosaics using Leica Frame Pro Software (version 1.3, Leica Geosystems Inc., Heerbrugg, Canton of St. Gallen, Switzerland). The orthomosaicing process involved stitching individual images into a seamless, continuous, and uniform mosaic with a spatial resolution of 10 cm and a tile size of 4 km².

2.3. Cattle Detection and Counting

A general workflow of the cattle detection and counting methodology is illustrated in Figure 2. The process began with converting the tiled image mosaics into image patches, followed by data augmentation and annotation before splitting them into training and test datasets. The training data were used to train models under three scenarios, and the trained models were evaluated based on their detection and counting performance. All steps are described in detail in the following sections.

2.3.1. Tile Selection and Image Slicing

The 147 orthomosaicked tiles were systematically scanned from left to right using a 0.5 km grid at a scale of 1:2500 on QGIS software (version 3.26.2, The Open Source Geospatial Foundation Project, Beaverton, OR, USA) by the first author. Fifty-five tiles (37%) containing cattle were retained and later sliced into 200 × 200-pixel image patches, each measuring 400 m² in area. A total of 1374 patches with livestock, representing 0.23% of the total patches (n = 606,429) from the three sites, were then used to train the models. Splitting the tiles into small image patches (later referred to as images) decreased the computational resources and memory required for object detection, thereby increasing the model’s efficiency.

2.3.2. Data Augmentation Strategies

Data augmentation is an important technique in computer vision, aimed at increasing the size and diversity of training images to reduce the risk of model overfitting [41] and to strengthen the model’s ability to generalise unseen datasets [42,43]. Data augmentation techniques involve geometric and pixel transformations. Geometric transformation techniques such as rotation (90, 180, and 270 degrees), horizontal and vertical flips, and random cropping were implemented. Rotations and flips reduce the model’s sensitivity to image orientation [43], while cropping addresses the challenge of multi-scale objects in the images [44] (Figure 3). For pixel transformations, random brightness, saturation, and contrast strategies were employed to strengthen the model against complex background and environmental conditions, and to ensure resilience to alterations in sensor settings [44,45].

2.3.3. Annotation

We used MakeSense.ai tool [46] to annotate each image patch by manually drawing a bounding box around each visible cattle. We only considered cattle whose bounding boxes were completely or partially covered and labelled each bounding box as “C”. The MakeSense tool then generated a text file with the position and labels of each bounding box in the images.

2.3.4. Training Scenarios

The three study sites, THWS, Lumo and Choke, each have distinct landscape characteristics (Section 2.1), and data from Choke were acquired from a higher altitude. Therefore, we assessed three scenarios to evaluate how the training dataset impacts the model’s ability to generalise to different sets of unseen data (Table 1).

In Scenario 1 (the base scenario), nine models were trained and validated using combined data from all three sites to develop a model applicable across the diverse conditions of the Kenyan rangelands. In Scenario 2 (the single-site training scenario), detection models were trained using data from one study site to assess the impact of training data quality. Data from the remaining two sites were used for inference. In Scenario 3 (the cross-site training scenario), data from two study sites were combined for training to assess the impact of different land covers on cattle detection, while the third site was reserved for inference.

For all scenarios, datasets were randomly split in a 4:1 ratio into training and validation sets, consistent with prior studies on counting livestock from aerial datasets [3,26]. The split was based on the orthomosaic tiles with livestock (Section 2.3.1), randomly selected from each site. The split was based on the tiles rather than image patches to maintain the independence of the test dataset (Table S1). Some images contained many cattle, while others had only one or two. However, the cattle numbers closely approached the 4:1 ratio used for the images (Table 1). Additionally, 10% of the training data was included as background data (negatives) to improve the model performance and reduce false positives.

2.3.5. Object Detection Models

Object detection models are categorised into one-stage and two-stage detectors [47]. A one-stage detector applies a single convolutional neural network pipeline to detect and classify target objects in an image. Two-stage detectors operate in two steps: first, generating possible region proposals of the target object using a region proposal network, and then classifying and refining the proposals in the detection head [48]. The one-stage detectors are simple and have high computational speed. By contrast, two-stage detectors achieve higher accuracy but at a higher computational cost, as they require the configuration of many hyperparameters and the selection of an appropriate convolutional backbone network during training [49,50]. For instance, the one-stage detector You Only Look Once (YOLO) [51] trained on the benchmark datasets Pascal VOC 2007 and 2012 [52] achieved a mean Average Precision (mAP) of 63.4% at a computation speed of 45 frames per second (fps) per image, while Faster RCNN achieved a mAP of 70.0% but at a lower speed of 0.5 fps [53]. We selected the most stable and proven detectors from both categories based on their high accuracy and speed. Two one-stage detectors, YOLOv5 and YOLOv8, from the You Look Only Once (YOLO) family [51], and one two-stage detector, Faster R-CNN [54].

YOLOv5 [55] is the most popular and stable YOLO model that boosts a balance of speed and accuracy. YOLOv5 is commonly used to detect medium and large objects from close-range photographs. In YOLO models, the input image is passed through convolutional networks for feature extraction to generate feature maps. These are then passed through a fully connected convolutional layer for localisation and classification (see Huang et al. [56] for further details). YOLOv8 architecture replicates the YOLOv5 network with slight modifications to its anchor type. YOLOv8 uses a decoupled head, meaning the classification and object detection are treated as separate entities, unlike the coupled head in YOLOv5. The decoupling of the head makes the YOLOv8 model anchor-free, eliminating the need for tuning the anchor boxes to the target sizes during training, boosting the detection accuracy [57,58].

Faster R-CNN is the state-of-the-art object detector within the two-stage detectors. It is an improved version of Fast R-CNN [59] and R-CNN [60], incorporating Region Proposal Networks (RPNs) to efficiently predict region proposal and reduce computation time. Faster R-CNN comprises two parts: the RPN, which is composed of deep convolutional networks, and the Fast R-CNN detector for prediction. The raw input images are initially passed through convolutional layers that extract salient features to generate feature maps. These feature maps are then passed to RPN, which generates region proposals and assigns the objectness score to each region containing an object. The potential proposals are refined and reshaped before passing to the detection network, where classification and bounding box regression are performed. The detector network and RPN share convolutional layers, reducing the cost of generating proposals to 10 ms per image. In addition, the RPNs are designed to be scalable and capable of generating proposals at multiple scales and aspect ratios for objects of various sizes. The reader is referred to Ren et al. [54] for further details on the Faster R-CNN model and implementation.

2.3.6. Training YOLO and Faster R-CNN Models

Transfer learning [42] was used to initialise the training of all models, using pre-trained weights from the MS COCO2017 dataset [61]. The pre-trained weights were applied only to the first backbone layers, while the rest were fine-tuned on the training dataset. Transfer learning and pre-trained weights allow leveraging existing models to detect shallow features, accelerating training time and improving model accuracy compared to training from scratch. The final head node layer was customised for single object detection: “cattle”.

YOLOv5 and YOLOv8 models have five different variants, ranging from nano to extra-large, with variations in the number of model parameters and processing speed. We selected the small, medium, large, and extra-large YOLO variants for this study. Each YOLO variant was trained for 300 epochs with early stopping after 50 epochs if no improvements were observed in accuracy metrics. Stochastic Gradient Descent (SGD) was used for model optimisation. The trained models were then validated to optimise hyperparameters before conducting inference on the test dataset. As YOLOv5 is an anchor-based model, the default generic evolutionary algorithm [55] was employed to generate the optimal anchor settings for training the cattle dataset (Table S2).

In Faster R-CNN, the ResNet50 network [62] was used as the backbone for feature extraction. The model was optimised using the SGD algorithm over 100 epochs and 20,000 iterations, with validation checkpoints every 2500 iterations. Training configuration included an initial learning rate of 0.001, a confidence threshold of 0.25, a batch size of 32 images, a momentum of 0.9, and a linear warm-up of 200 iterations. Hyperparameters such as anchor size, aspect ratios, and the number of classes were tuned for optimal performance. A hookbase function was used to select the checkpoint with the best average precision value as the best model for prediction.

Training and validation were conducted simultaneously in all three scenarios. During training, the epochs yielding the best weights according to the mean average precision (mAP) were saved for validation. A post-training non-maximum suppression (NMS) approach was applied to retain the bounding boxes with the highest confidence in cases where multiple predicted boxes overlapped during inference. YOLOv5 and YOLOv8 were implemented using PyTorch version 2.1.0, while the Faster R-CNN model was implemented with Detectron2 [63]. All models were trained on a NVIDIA Tesla Volta V100-SXM2 GPU with 32 GB RAM.

2.3.7. Accuracy Assessment

The metrics used to assess the model’s performance included Precision (P), Recall (R), F₁-score, Average Precision (AP_0.5), and mean Average Precision (AP_0.5–0.95). Precision measures the model’s ability to correctly detect positive targets, while Recall represents the proportion of true positives detected in an image.

P r e c i s i o n = \frac{T P}{(T P + F P)}

(1)

R e c a l l = \frac{T P}{(T P + F N)}

(2)

where TP, TN, TP, and FN are the numbers of true positives, true negatives, false positives, and false negatives, respectively. Predictions with a confidence score equal to or greater than the Intersection over Union (IoU) threshold were categorised as true positives (TPs), while the rest were regarded as false positives (FPs). We used a lower confidence threshold of 0.25 compared to 0.3 in [35], since our images were collected above 800 m from the surface. Additionally, because the primary focus was on accurate cattle counts, the exact position of animals was of minimal importance.

F₁-score is the harmonic mean of Precision and Recall, providing a balanced measure of the object detector’s performance and localisation accuracy.

F_{1} - s c o r e = 2 \times \frac{R \times P}{R + P}

(3)

AP_0.5 is the average Precision across different Recall values at IoU threshold greater than 0.5. In addition, the AP_0.5–0.95 metric, as used in the MS COCO2017 dataset, averages Average Precision (AP) over multiple IoU thresholds ranging from 0.5 to 0.95, with intervals of 0.05.

{A P}_{0.5} = \frac{1}{N_{c l a s s}} \sum_{i = 1}^{n_{c l a s s}} {A P}_{i}

(4)

{A P}_{0.5 - 0.95} = \frac{1}{n_{c l a s s}} \int_{0}^{1} P (R) \partial R

(5)

Finally, total Counting Error (CE), coefficient of determination (R²), and Root Mean Square Error (RMSE) were used to evaluate cattle counts. CE identified systematic biases arising from over- and under-estimation errors. R² assessed the model goodness of fit based on regression, while RMSE provided an overall measure of model performance by combining error magnitude and frequency.

C E = \frac{P r e d i c t e d c o u n t s - R e f e r e n c e c o u n t s}{R e f e r e n c e c o u n t s} \times 100

(6)

R^{2} = 1 - \frac{\sum_{i}^{n} {(x_{i} - y_{i})}^{2}}{\sum_{i}^{n} {(x_{i} - \bar{x})}^{2}}

(7)

R M S E = \sqrt{\frac{1}{n} \sum_{i}^{n} {(x_{i} - y_{i})}^{2}}

(8)

where n is the number of images, and x_i and y_i are the predicted and reference counts of the i-th image, respectively.

3. Results

3.1. Detection Performance on the Combined Dataset

For training Scenario 1, three experiments were conducted to assess the impact of transfer learning and data augmentation strategies on the model training. The first experiment trained the model from scratch without pre-trained weights. The second experiment initialised the training with pre-trained weights, and the last experiment combined transfer learning and data augmentation strategies. The inclusion of data augmentation strategies increased the validation accuracy (AP_0.5–0.95) by an average of 12.1%, 12.5%, and 1.8% in YOLOv5, YOLOv8, and Faster R-CNN, respectively (Table 2). By contrast, transfer learning had a minimal increase of AP_0.5–0.95 by 0.7% in YOLOv8 and 1.3% in YOLOv5, compared to +9.2% in Faster R-CNN. The models from the third experiment were applied to the test data to make inferences.

The YOLOv8m model emerged with the highest accuracy in terms of AP_0.5–0.95 of 39.6% for the model trained with a dataset combining all three sites in Scenario 1 (Table 3). Despite the high AP_0.5–0.95, the model had lower Precision and Recall than YOLOv8l (91.4%) and YOLOv5l (86.2%). In contrast to the high AP_0.50–0.95, YOLOv8m recorded the highest CE (−8.0%) compared to YOLOv5m at −2.5%, indicating lower counting metrics in YOLOv8m. Across all the YOLOv8 variants, although they attained better AP_0.5–0.95 compared to YOLOv5 variants, they displayed poor CE.

Notably, AP_0.5 for Faster R-CNN was 30% lower than those of the YOLOv8 and YOLOv5 models, indicating a low precision of the model. Despite the low detection accuracy, Faster R-CNN had a lower CE than YOLOv8 variants. The positive CE in Faster R-CNN suggests higher instances of false positive detection by the model.

In both YOLOv5 and YOLOv8, the extra-large (x) architectures had lower AP_0.5–0.95 compared to the intermediate architectures, medium (m) and large (l), despite extra-large architectures being deeper and having more parameters. The better performance of intermediate architectures suggests that an increase in parameter depth within models does not necessarily result in better accuracy. Also, all the models had a decrease in their Recall rates compared to their Precision. The influence of shadows, dense vegetation, and blurred images in the test dataset could be the cause of the low Recall rate. Finally, all the models attained absolute count errors of less than 10%; however, YOLOv8 models recorded a higher average count error of −7.5% compared to YOLOv5 at −3.8%.

Figure 4 shows the performance of YOLOv5m and YOLOv8m predictions compared to the manual annotations. True positive detections among the white fur cattle had higher confidence scores than brown and dark-coated cattle. However, there were higher false negative detections (red boxes) within the YOLOv8m predictions compared to YOLOv5m, suggesting a possible limitation of YOLOv8 in distinguishing shadows from black cattle.

The shadows in the images posed a significant challenge, causing false positives (Figure 5a) and reducing the model’s ability to distinguish black fur cattle from shadows, leading to false negatives (Figure 5b). The models could detect new targets missed during annotation, mainly located along image edges with few discriminative characteristics for manual labelling. However, some of the new targets detected by the models were erroneous. For example, there were instances of overlapping predicted bounding boxes despite using NMS, which resulted in double counting of the cattle. In addition, in the image patches with shadows, the model detected both the cattle and their shadows separately, highlighting the complexity of the task and the challenge shadows pose to accurate detection.

3.2. Cattle Counts at Image Patch Level

As cattle density varies considerably in the study area, the models (YOLOv5m, YOLOv8m and Faster R-CNN) were assessed for their ability to estimate cattle counts across variable herd densities (Figure 6). YOLOv5 and YOLOv8 showed good correlations between the manual labels and the predicted counts, as indicated by the RMSE and R². Image patches with smaller herds, fewer than 20 animals, were close to the one-to-one line, indicating minimal difference between predicted and manual counts. As herd size increased, YOLOv5m outperformed YOLOv8m, producing accurate predictions for large herds and achieving the best R² of 0.98. By contrast, Faster R-CNN recorded a lower correlation with R² of 0.87 and RMSE of 2.9. Faster R-CNN performed particularly poorly with larger herds, heavily underestimating their size, and also attained lower accuracy with small herds, including some large overestimates.

Figure 5. Illustration of challenges in cattle detection by YOLOv5: (a) false positives on iron roof shadows and (b) false negatives due to confusion between shadows and dark-coated cattle.

3.3. Impact of Training Data Selection on YOLO Detection Accuracy

As Faster R-CNN showed poor results compared to YOLO models, we report results only for the YOLO models for training Scenarios 2 and 3. For Scenario 2, we trained the models using data from one site and tested them on the remaining two sites (Table 4). The results indicate that as the flying height increased in the more heterogeneous landscapes of the Choke site, the models registered lower Recall and AP_0.5–0.95 scores on the test datasets. The lower Recall, particularly for the Lumo site, suggests a higher rate of false negatives on the test datasets. The low metrics from Choke models contrast with models trained using the THWS and Lumo datasets, which achieved AP_0.5 greater than 80% compared to models trained on Choke data (Tables S3 and S4). The low Recall and AP_0.5–0.95 observed in models trained on the Choke dataset can be attributed to its denser, more heterogeneous vegetation and image blurriness due to higher flying height, in contrast to Lumo and THWS, which have more similar landscape characteristics and flying heights.

3.4. Landcover Characteristics

For training Scenario 3, we trained the models using data from two sites and tested them on the third site. This approach considerably improved detection accuracies (Table 5), underscoring the benefits of combining training data from multiple sites. Inference on the THWS dataset attained the highest performance metrics, with YOLOv5x recording the best AP_0.5–0.95 score of 40.4%, and Precision and Recall values of 91.0% and 83.6%, respectively. The high performance on THWS data is likely due to its highly homogeneous grassland landscape, which increases the contrast with the cattle. Conversely, YOLOv5m had a lower Recall of 76.5% on the highly heterogeneous Choke dataset. Notably, test data from Lumo had the lowest Precision of 86.3% and AP_0.5–0.95 score of 36.7% among the three sites (Table S5). The low Precision may be partly due to shadows in Lumo images, which can confuse the model and result in more false negatives of the dark-coated cattle.

In terms of counts, cattle within THWS were the most accurately detected, achieving an R² of 0.98 (Figure 7). The best performance in the THWS dataset is aligned with the high Precision of 91.0% of YOLOv5x, as shown in Table 5. The high RMSE in Lumo and Choke suggests a high occurrence of false negatives caused by low contrast and shadows in these sites’ images. While Lumo had a higher R² than Choke, it registered a slightly lower RMSE, reflecting the larger mean herd size of 11.6 in Lumo compared to 7.0 in Choke per 400 m² image patch.

The models generally detected cattle across large areas with minimal residuals between manual and predicted annotations. More than two-thirds (67.7%) of the tile-wise residuals were within the −25 to 25 range, indicating minimal errors by the models compared to the manual counts (Figure 8). All image tiles with zero residuals (8.8%) were located within THWS, while the highest residual underestimated 124 animals. This poor detection for that tile was due to the model’s inability to differentiate dark-coated cattle from shadows.

4. Discussion

4.1. Performance of Computer Vision Models on Cattle Detection in Diverse Environments

We demonstrated that it is feasible to automate the detection and counting of cattle in heterogeneous landscapes in southern Kenya using computer vision models on aerial RGB imagery. The automated detection models had high accuracy and were close to human detection capabilities. The one-stage detectors (YOLOv5 and YOLOv8) and a two-stage detector (Faster R-CNN) were automated to detect cattle from three distinct sites: grassland, dense shrubland, and a mix of shrub and bare land. The medium YOLOv8 model architecture had the best accuracy (AP_0.5–0.95) at 39.6%, outperforming the extra-large YOLOv8 architecture and Faster R-CNN. The high AP_0.5–0.95 accuracies in both Yolov5m and YOLOv8m are contrary to prior studies [24,64] in which deep and complex models recorded the highest accuracies as the increase in parameter depth made them robust in a diverse dataset. However, the results align with previous findings [36,44,65] that complex models with extensive down sampling operations within their CNNs experience spatial degradation of target features, particularly small-sized targets. These operations result in the loss of critical information about the target, making it challenging to achieve high accuracy for small-sized targets.

Models trained on data from dense shrubland (Choke) and lower spatial resolution (higher flying height) recorded the lowest AP_0.5–0.95 when tested in an unfamiliar environment. The low accuracy suggests that the models may generalise poorly to different ecological conditions, limiting their practical application across diverse landscapes. Besides, its low Precision and Recall scores indicated higher instances of cattle misdetection. This finding contrasts with the high Precision observed in datasets from open grassland areas (THWS and Lumo), where the higher spatial resolution of images enhanced the contrast between the cattle and the background.

The open grassland of THWS presented minimal occlusion for misdetection, and most cattle grazing there have white coats, which contrasts significantly with the surroundings. Similar results on the impact of coat colour on livestock detection accuracy were reported by Xu et al. [25], where white sheep were detected with an accuracy of 97.3%, compared to 94.7% for cattle heads. The cattle’s brown-to-dark coat colour reduced the contrast with the background, making it difficult to distinguish them from rocks in pastures and bushes. These findings align with the criterion established by LaRue et al. [66], which emphasises that high spatial resolution, sharp contrast between the target and the background, and an open environment are fundamental for high detection accuracy rates in aerial images. While Lumo shares similar land cover characteristics with THWS, lower detection accuracies were observed in Lumo due to shadows. Shadows were prevalent in the dataset collected over Lumo, mainly because the images were collected earlier in the day. This highlights the importance of consistency in image acquisition. However, aerial image acquisition highly depends on favourable weather conditions, particularly near hilly areas with limited clear sky windows, as experienced during our data collection.

4.2. Impact of Training Strategy, Augmentation Methods and Remaining Challenges

An increase in training size and data diversity through augmentation strategies improved YOLOv8 mAP_0.5–0.95 accuracies by +2.5%, +3.3% and +3.4% for the test datasets collected from THWS, Choke and Lumo, respectively. The inclusion of augmented images with varying brightness, contrast and saturation levels strengthened the model against varying illumination, minimising the probability of false negatives [26]. Random cropping in the augmentation strategies introduced multi-scale variation, enabling the model to detect cattle of varying sizes. Shadows from trees and cattle, poor illumination, and minimal contrast of brown and dark-coated livestock with their background were the main sources of erroneous detection. Despite our efforts to include additional background data and augmentation strategies, these challenges persisted, limiting the model detection capability.

The best model in the study attained AP_0.5–0.95 of 39.6%, with Precision and Recall of 91.0% and 83.4%, respectively (Table 3). A Precision of greater than 90% indicates that combining of data from the different landscapes boosts the model’s ability to generalise to unseen datasets. The Recall score of greater than 80% suggests that the detection of all cattle in the test data can be improved by adding more data from varied ecological conditions. The presence of shadows, dense vegetation, and blurred images in the overall test dataset may contribute to the decline in Recall scores. Achieving high AP_0.5–0.95 scores for small objects is more challenging than for large ones, as the pre-trained weights of the MS COCO dataset are optimised for large objects. The MS COCO dataset is biased towards medium and large objects in terms of the number of images and annotations, such that even slight deviations in the predicted box for small objects can significantly reduce detection accuracy [21,67]. Furthermore, nine of the ten IoU thresholds used to calculate the default MS COCO AP_0.5–0.95 metrics are higher than the IoU = 0.5 used during testing. This means that AP_0.5–0.95 is a more convenient metric for evaluating models focused on localisation quality rather than counting. This is evident in the counting error metrics, where the high AP_0.5–0.95 YOLOv8m model had the least accurate counting performance (Table 3).

Compared to a similar study conducted in Chad [21], the YOLOv5m model used in our study outperformed the HerdNet model, which attained a Precision and total count error of 77.5% and −9.4%, respectively. The study used the point-based HerdNet model to detect goats and sheep from manned aerial images collected at altitudes of 300 to 350 feet. By contrast, studies using images acquired from UAVs [9,32,64,68] attained higher Precision rates between 89% and 97%. The high Precisions in these studies can be attributed to their more homogeneous environment, which enhances the sharp contrast between brightly coloured livestock and their backgrounds. Additionally, the images were captured from an altitude of less than 100 m, compared to our flying height range of 850 to 1400 m. The lower flying height of UAVs and the higher spatial resolution enable detection models to leverage unique cattle features more effectively, unlike in our study, where cattle often appeared as blobs.

The YOLOv8 architectures generally registered higher Precision and Recall scores compared to YOLOv5 and Faster R-CNN. We deduce that improving the YOLOv8 detection head (anchor-free) enhances its accuracy over the anchor-based models YOLOv5 and Faster R-CNN. Furthermore, the superior overall accuracy of both YOLOv5 and YOLOv8 compared to Detectron2’s Faster R-CNN can be attributed to YOLO’s architecture, which utilises larger feature maps, making it more efficient in pattern and object recognition [69]. Surprisingly, YOLOv8 had poorer counting metrics despite being an anchor-free model compared to YOLOv5. The low CE across YOLOv8 arises from high number of false negative during detection. The poor detection in YOLOv8 can be attributed to its limited 80 × 80 maximum feature map size [70], limiting it from fully detecting small-size targets. In addition, we attribute the better counting performance of YOLOv5 to fine-tuning hyperparameters to fit the target object size prior to training. These findings are consistent with previous study [14] suggesting that AP_0.5–0.95 metrics alone should not be the sole basis for selecting the best model in animal survey studies. The low performance of Faster R-CNN in aerial detection aligns with findings from [21], where Faster R-CNN achieved a Precision of 39.4% with a high number of false positives in dense herds. In addition, Detectron2 resizes input images to 1024 × 1024 pixels, reducing the resolution of small targets and increasing the risk of information loss during feature extraction and max-pooling phases in the convolutional layers of Faster R-CNN [16,48].

4.3. The Way Forward to Automatic Livestock Counting in African Rangelands

Our results demonstrate the high potential of computer vision models for livestock inventories in African rangelands. However, the presence of shadows and the detection of low-contrast brown and dark-coated cattle remain challenging. Although data augmentation strategies and fine-tuning were implemented to improve the model’s performance on unseen data, the size of the target in detection has a dominant influence compared to the complexity of the model and other pre-processing strategies. Nevertheless, integrating thermal infrared imaging and visible (RGB) images could help reduce the impact of shadows on detection accuracy [30,71]. However, thermal infrared imaging will be limited to the early hours when air temperatures are low in tropical regions. In addition, using near-infrared bands, which are less sensitive to illumination, can enhance the contrast between vegetation and livestock [71,72].

In detecting of small objects, active development and modifications in object detector architectures are being made to reduce the limitation of small area pixels and enhance their applications. The development of methods such as programmable gradient information (PGI) in the latest one-stage detector YOLOv9 [73] can significantly reduce information loss of small objects during feature extraction in remote sensing imagery. In addition, the use of attention mechanism networks [74] in detector architecture holds significant potential, as they empower the networks to concentrate on crucial features around small objects. Focusing on small targets in large aerial imagery will ensure optimal feature representation, improving overall detection accuracy.

This study contributes a unique set of livestock aerial datasets and customised models for the heterogeneous landscapes of African rangelands to a growing collection of remotely sensed livestock detection studies [19,24,26]. However, further collection and annotation of aerial spatiotemporal livestock data from various geographical locations are necessary to improve detection accuracy. Expanding the pool of such aerial datasets will improve the representation of small objects in current remote sensing image datasets, such as DOTA [75]. This expansion will facilitate the development of pre-trained models based on aerial images, improving detection accuracy by better aligning the target and source domain feature spaces for transfer learning [42].

The trained models from this study can be upscaled to conduct livestock census using very high-resolution satellite images. Counting livestock from space will enhance adaptive management for large rangeland areas by supporting forage management, tracking grazing pressure, and assessing the rangelands’ carrying capacity to minimise degradation. Additionally, consistent livestock counts are essential for developing accurate models to track the effectiveness of greenhouse gas emission measures from cattle across the African continent, contributing to climate change mitigation efforts.

5. Conclusions

Detecting and counting livestock from remotely sensed images is critical for developing sustainable grazing measures in rangeland areas. This study showcases deep learning object detection models to detect and count cattle on images from manned aircraft across heterogeneous landscapes. Nine object detection models were assessed on three sites with differing land cover characteristics. Increasing the training dataset’s size and diversity improved detection accuracy compared to using transfer learning to initialise the models during training. Additionally, the quality of training data, such as illumination, shadow presence, and cattle coat colour, impacted detection accuracy. YOLOv8 models had higher mAP detection accuracy but recorded lower counting accuracies than YOLOv5 models. Tuning the hyperparameters in the YOLOv5m model during training led to better localisation and minimal misdetection in the inference dataset. However, attaining high detection accuracy for small-size targets like cattle from aerial images remains challenging, especially when using transfer learning weights from the MS COCO dataset due to its bias towards medium and large objects. Developing pre-trained weights from large aerial datasets with sufficient small-size target training data will enhance detection accuracy. In addition, strategies such as data fusion of thermal and near-infrared imaging can help overcome difficulties posed by poor illumination and improve detection accuracy. The small size of cattle relative to the spatial resolution of aerial images offers minimal features for detection. However, different cattle species may be distinguished based on their coat colours. Future work will seek to classify the livestock species based on their coat colours at the three sites and test the model with diverse datasets from rangelands worldwide.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/rs16162929/s1; Table S1. Distribution of test images and annotation. Table S2. Fine-tuned YOLOv5 hyperparameters values attained from generic evolution. Table S3. Accuracy metrics for YOLOv5 on scenario 2 test dataset. Table S4. Accuracy metrics for YOLOv8 on scenario 2 test dataset. Table S5. Model performance across YOLOv5 and YOLOv8 in scenario 3.

Author Contributions

Conceptualization, I.A.O., P.P. and J.H.; methodology, I.A.O., J.H. and T.V.; formal analysis, I.A.O.; data curation, I.A.O. and I.V.; writing—original draft preparation, I.A.O.; writing—review and editing, I.A.O., P.P., F.K., I.V., T.V., M.B. and J.H.; visualisation, I.A.O.; supervision, J.H.; funding acquisition, P.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by European Union DG International Partnerships under DeSIRA (Development of Smart Innovation through Research in Agriculture) programme (FOOD/2020/418-132) through the Earth observation and environmental sensing for climate-smart sustainable agropastoral ecosystem transformation in East Africa (ESSA) project. Open access funding provided by University of Helsinki.

Data Availability Statement

The data and the python codes developed for this study are made available at GitHub (https://github.com/Ian-ocholla/Aerial_detection_livestock, accessed on 8 August 2024).

Acknowledgments

The research permits from the National Commission for Science, Technology & Innovation (NACOSTI/P/21/14977, NACOSTI/P/21/14537 & NACOSTI/P/24/34925) in Kenya are acknowledged. We acknowledge CSC-IT Center for Science, Finland, for computational resources.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

FAO. World Livestock: Transforming the Livestock Sector through the Sustainable Development Goals; Food and Agriculture Organization of the United Nations: Rome, Italy, 2018; ISBN 978-92-5-130883-7. [Google Scholar]
Pica-Ciamarra, U.; Baker, D.; Morgan, N.; Zezza, A.; Azzarri, C.; Ly, C.; Nsiima, L.; Nouala, S.; Okello, P.; Sserugga, J. Investing in the Livestock Sector: Why Good Numbers Matter. A Sourcebook for Decision Makers on How to Improve Livestock Data; Report Number 85732-GLB; World Bank: Rome, Italy, 2014. [Google Scholar]
Brown, J.; Qiao, Y.; Clark, C.; Lomax, S.; Rafique, K.; Sukkarieh, S. Automated Aerial Animal Detection When Spatial Resolution Conditions Are Varied. Comput. Electron. Agric. 2022, 193, 106689. [Google Scholar] [CrossRef]
Mare, F.; Bahta, Y.T.; Van Niekerk, W. The Impact of Drought on Commercial Livestock Farmers in South Africa. Dev. Pract. 2018, 28, 884–898. [Google Scholar] [CrossRef]
FAO. Disasters on Agriculture and Food Security Through Investment in Resilience; Food and Agriculture Organization of the United Nations: Rome, Italy, 2023; ISBN 9789251381946. [Google Scholar]
Cheng, M.; McCarl, B.; Fei, C. Climate Change and Livestock Production: A Literature Review. Atmosphere 2022, 13, 140. [Google Scholar] [CrossRef]
Herrero, M.; Grace, D.; Njuki, J.; Johnson, N.; Enahoro, D.; Silvestri, S.; Rufino, M.C. The Roles of Livestock in Developing Countries. Animal 2013, 7, 3–18. [Google Scholar] [CrossRef] [PubMed]
Marcon, A.; Battocchio, D.; Apollonio, M.; Grignolio, S. Assessing Precision and Requirements of Three Methods to Estimate Roe Deer Density. PLoS ONE 2019, 14, e0222349. [Google Scholar] [CrossRef] [PubMed]
Han, L.; Tao, P.; Martin, R.R. Livestock Detection in Aerial Images Using a Fully Convolutional Network. Comput. Vis. Media 2019, 5, 221–228. [Google Scholar] [CrossRef]
Gilbert, M.; Nicolas, G.; Cinardi, G.; Van Boeckel, T.P.; Vanwambeke, S.O.; Wint, G.R.W.; Robinson, T.P. Global Distribution Data for Cattle, Buffaloes, Horses, Sheep, Goats, Pigs, Chickens and Ducks in 2010. Sci. Data 2018, 5, 180227. [Google Scholar] [CrossRef] [PubMed]
Balehegn, M.; Kebreab, E.; Tolera, A.; Hunt, S.; Erickson, P.; Crane, T.A.; Adesogan, A.T. Livestock Sustainability Research in Africa with a Focus on the Environment. Anim. Front. 2021, 11, 47–56. [Google Scholar] [CrossRef]
UNDESA. World Population Prospects 2022. Summary of Results; United Nations: New York, NY, USA, 2022; ISBN 978-92-1-148373-4. [Google Scholar]
Dutilly, C.; Alary, V.; Bonnet, P.; Lesnoff, M.; Fandamu, P.; de Haan, C. Multi-Scale Assessment of the Livestock Sector for Policy Design in Zambia. J. Policy Model. 2020, 42, 401–418. [Google Scholar] [CrossRef]
Ekwem, D.; Enright, J.; Hopcraft, J.G.C.; Buza, J.; Shirima, G.; Shand, M.; Mwajombe, J.K.; Bett, B.; Reeve, R.; Lembo, T. Local and Wide-Scale Livestock Movement Networks Inform Disease Control Strategies in East Africa. Sci. Rep. 2023, 13, 9666. [Google Scholar] [CrossRef]
Norton-Griffiths, M. Counting Animals. Available online: https://www.awf.org/sites/default/files/media/Resources/Books%2520and%2520Papers/AWF_1_counting_animals.pdf (accessed on 21 March 2022).
Jachmann, H. Comparison of Aerial Counts with Ground Counts for Large African Herbivores. J. Appl. Ecol. 2002, 39, 841–852. [Google Scholar] [CrossRef]
Schlossberg, S.; Chase, M.J.; Griffin, C.R. Testing the Accuracy of Aerial Surveys for Large Mammals: An Experiment with African Savanna Elephants (Loxodonta africana). PLoS ONE 2016, 11, e0164904. [Google Scholar] [CrossRef] [PubMed]
Corcoran, E.; Denman, S.; Hanger, J.; Wilson, B.; Hamilton, G. Automated Detection of Koalas Using Low-Level Aerial Surveillance and Machine Learning. Sci. Rep. 2019, 9, 3208. [Google Scholar] [CrossRef] [PubMed]
Moreni, M.; Theau, J.; Foucher, S. Do You Get What You See? Insights of Using MAP to Select Architectures of Pretrained Neural Networks for Automated Aerial Animal Detection. PLoS ONE 2023, 18, e0284449. [Google Scholar] [CrossRef] [PubMed]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep Learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Cheng, G.; Yuan, X.; Yao, X.; Yan, K.; Zeng, Q.; Xie, X.; Han, J. Towards Large-Scale Small Object Detection: Survey and Benchmarks. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13467–13488. [Google Scholar] [CrossRef] [PubMed]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Xu, B.; Wang, W.; Falzon, G.; Kwan, P.; Guo, L.; Chen, G.; Tait, A.; Schneider, D. Automated Cattle Counting Using Mask R-CNN in Quadcopter Vision System. Comput. Electron. Agric. 2020, 171, 105300. [Google Scholar] [CrossRef]
Barbedo, J.G.A.; Koenigkan, L.V.; Santos, T.T.; Santos, P.M. A Study on the Detection of Cattle in UAV Images Using Deep Learning. Sensors 2019, 19, 5436. [Google Scholar] [CrossRef]
Xu, B.; Wang, W.; Falzon, G.; Kwan, P.; Guo, L.; Sun, Z.; Li, C. Livestock Classification and Counting in Quadcopter Aerial Images Using Mask R-CNN. Int. J. Remote Sens. 2020, 41, 8121–8142. [Google Scholar] [CrossRef]
Delplanque, A.; Foucher, S.; Théau, J.; Bussière, E.; Vermeulen, C.; Lejeune, P. From Crowd to Herd Counting: How to Precisely Detect and Count African Mammals Using Aerial Imagery and Deep Learning? ISPRS J. Photogramm. Remote Sens. 2023, 197, 167–180. [Google Scholar] [CrossRef]
Torney, C.J.; Lloyd-Jones, D.J.; Chevallier, M.; Moyer, D.C.; Maliti, H.T.; Mwita, M.; Kohi, E.M.; Hopcraft, G.C. A Comparison of Deep Learning and Citizen Science Techniques for Counting Wildlife in Aerial Survey Images. Methods Ecol. Evol. 2019, 10, 779–787. [Google Scholar] [CrossRef]
Chamoso, P.; Raveane, W.; Parra, V.; González, A. UAVs Applied to the Counting and Monitoring of Animals. In Advances in Intelligent Systems and Computing; Ramos, C., Novais, P., Nihan, C.E., Rodriquez, J.M.C., Eds.; Springer International Publishing: Berlin/Heidelberg, Germany, 2014; Volume 291, pp. 71–80. ISBN 9783319075952. [Google Scholar]
Rivas, A.; Chamoso, P.; González-Briones, A.; Corchado, J. Detection of Cattle Using Drones and Convolutional Neural Networks. Sensors 2018, 18, 2048. [Google Scholar] [CrossRef] [PubMed]
Lee, S.; Song, Y.; Kil, S.H. Feasibility Analyses of Real-Time Detection of Wildlife Using UAV-Derived Thermal and RGB Images. Remote Sens. 2021, 13, 2169. [Google Scholar] [CrossRef]
Soares, V.H.A.; Ponti, M.A.; Gonçalves, R.A.; Campello, R.J.G.B. Cattle Counting in the Wild with Geolocated Aerial Images in Large Pasture Areas. Comput. Electron. Agric. 2021, 189, 106354. [Google Scholar] [CrossRef]
Shao, W.; Kawakami, R.; Yoshihashi, R.; You, S.; Kawase, H.; Naemura, T. Cattle Detection and Counting in UAV Images Based on Convolutional Neural Networks. Int. J. Remote Sens. 2020, 41, 31–52. [Google Scholar] [CrossRef]
Barbedo, J.G.A.; Koenigkan, L.V.; Santos, P.M. Cattle Detection Using Oblique UAV Images. Drones 2020, 4, 75. [Google Scholar] [CrossRef]
Barbedo, J.G.A.; Koenigkan, L.V.; Santos, P.M.; Ribeiro, A.R.B. Counting Cattle in UAV Images—Dealing with Clustered Animals and Animal/Background Contrast Changes. Sensors 2020, 20, 2126. [Google Scholar] [CrossRef] [PubMed]
Eikelboom, J.A.J.; Wind, J.; van de Ven, E.; Kenana, L.M.; Schroder, B.; de Knegt, H.J.; van Langevelde, F.; Prins, H.H.T. Improving the Precision and Accuracy of Animal Population Estimates with Aerial Image Object Detection. Methods Ecol. Evol. 2019, 10, 1875–1887. [Google Scholar] [CrossRef]
Sarwar, F.; Griffin, A.; Rehman, S.U.; Pasang, T. Detecting Sheep in UAV Images. Comput. Electron. Agric. 2021, 187, 106219. [Google Scholar] [CrossRef]
Zhao, W.; Liu, Y.; Liu, P.; Wu, H.; Dong, Y. Optimal Strategies for Wide-Area Small Object Detection Using Deep Learning: Practices from a Global Flying Aircraft Dataset. Int. J. Appl. Earth Obs. Geoinf. 2024, 127, 103648. [Google Scholar] [CrossRef]
Taita Taveta County Integrated Development Plan III (2023–2027). Available online: https://taitatavetaassembly.go.ke/documents/cidp-iii-2023-27/ (accessed on 19 October 2023).
Amara, E.; Adhikari, H.; Heiskanen, J.; Siljander, M.; Munyao, M.; Omondi, P.; Pellikka, P. Aboveground Biomass Distribution in a Multi-Use Savannah Landscape in Southeastern Kenya: Impact of Land Use and Fences. Land 2020, 9, 381. [Google Scholar] [CrossRef]
Abera, T.A.; Vuorinne, I.; Munyao, M.; Pellikka, P.K.E.; Heiskanen, J. Land Cover Map for Multifunctional Landscapes of Taita Taveta County, Kenya, Based on Sentinel-1 Radar, Sentinel-2 Optical, and Topoclimatic Data. Data 2022, 7, 36. [Google Scholar] [CrossRef]
Maharana, K.; Mondal, S.; Nemade, B. A Review: Data Pre-Processing and Data Augmentation Techniques. Glob. Transit. Proc. 2022, 3, 91–99. [Google Scholar] [CrossRef]
Safonova, A.; Ghazaryan, G.; Stiller, S.; Main-Knorn, M.; Nendel, C.; Ryo, M. Ten Deep Learning Techniques to Address Small Data Problems with Remote Sensing. Int. J. Appl. Earth Obs. Geoinf. 2023, 125, 103569. [Google Scholar] [CrossRef]
Hao, X.; Liu, L.; Yang, R.; Yin, L.; Zhang, L.; Li, X. A Review of Data Augmentation Methods of Remote Sensing Image Target Recognition. Remote Sens. 2023, 15, 827. [Google Scholar] [CrossRef]
Wan, D.; Lu, R.; Wang, S.; Shen, S.; Xu, T.; Lang, X. YOLO-HR: Improved YOLOv5 for Object Detection in High-Resolution Optical Remote Sensing Images. Remote Sens. 2023, 15, 614. [Google Scholar] [CrossRef]
Mumuni, A.; Mumuni, F. Data Augmentation: A Comprehensive Survey of Modern Approaches. Array 2022, 16, 100258. [Google Scholar] [CrossRef]
Skalski, P. Make Sense. Available online: https://github.com/SkalskiP/make-sense/ (accessed on 3 August 2023).
Osco, L.P.; Marcato Junior, J.; Marques Ramos, A.P.; de Castro Jorge, L.A.; Fatholahi, S.N.; de Andrade Silva, J.; Matsubara, E.T.; Pistori, H.; Gonçalves, W.N.; Li, J. A Review on Deep Learning in UAV Remote Sensing. Int. J. Appl. Earth Obs. Geoinf. 2021, 102, 102456. [Google Scholar] [CrossRef]
Xu, Z.; Wang, T.; Skidmore, A.K.; Lamprey, R. A Review of Deep Learning Techniques for Detecting Animals in Aerial and Satellite Images. Int. J. Appl. Earth Obs. Geoinf. 2024, 128, 103732. [Google Scholar] [CrossRef]
Wang, N.; Gao, Y.; Chen, H.; Wang, P.; Tian, Z.; Shen, C.; Zhang, Y. NAS-FCOS: Fast Neural Architecture Search for Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11940–11948. [Google Scholar]
Carranza-García, M.; Torres-Mateo, J.; Lara-Benítez, P.; García-Gutiérrez, J. On the Performance of One-Stage and Two-Stage Object Detectors in Autonomous Vehicles Using Camera Data. Remote Sens. 2021, 13, 89. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 779–788. [Google Scholar]
Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Wang, Y.; Bashir, S.M.A.; Khan, M.; Ullah, Q.; Wang, R.; Song, Y.; Guo, Z.; Niu, Y. Remote Sensing Image Super-Resolution and Object Detection: Benchmark and State of the Art. Expert Syst. Appl. 2022, 197, 116793. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Jocher, G. YOLOv5 by Ultralytics. Available online: https://github.com/ultralytics/yolov5 (accessed on 19 February 2024).
Huang, Y.; Qian, Y.; Wei, H.; Lu, Y.; Ling, B.; Qin, Y. A Survey of Deep Learning-Based Object Detection Methods in Crop Counting. Comput. Electron. Agric. 2023, 215, 108425. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Qiu, J. YOLO by Ultralytics. Available online: https://github.com/ultralytics/ultralytics (accessed on 21 June 2023).
Terven, J.; Cordova-Esparza, D. A Comprehensive Review of YOLO: From YOLOv1 and Beyond. arXiv 2023, arXiv:2304.00501. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision—ECCV 2014 Part of the Lecture Notes in Computer Science; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer International Publishing: Cham, Switzerland, 2014; Volume 8693, pp. 740–755. ISBN 978-3-319-10602-1. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2016, arXiv:1512.03385. [Google Scholar]
Wu, Y.; Kirillov, A.; Massa, F.; Lo, W.-Y.; Girshick, R. Detectron2. Available online: https://github.com/facebookresearch/detectron2 (accessed on 21 February 2023).
de Lima Weber, F.; de Moraes Weber, V.A.; de Moraes, P.H.; Matsubara, E.T.; Paiva, D.M.B.; de Nadai Bonin Gomes, M.; de Oliveira, L.O.F.; de Medeiros, S.R.; Cagnin, M.I. Counting Cattle in UAV Images Using Convolutional Neural Network. Remote Sens. Appl. 2023, 29, 100900. [Google Scholar] [CrossRef]
Peng, J.; Wang, D.; Liao, X.; Shao, Q.; Sun, Z.; Yue, H.; Ye, H. Wild Animal Survey Using UAS Imagery and Deep Learning: Modified Faster R-CNN for Kiang Detection in Tibetan Plateau. ISPRS J. Photogramm. Remote Sens. 2020, 169, 364–376. [Google Scholar] [CrossRef]
LaRue, M.A.; Stapleton, S.; Anderson, M. Feasibility of Using High-resolution Satellite Imagery to Assess Vertebrate Wildlife Populations. Conserv. Biol. 2017, 31, 213–220. [Google Scholar] [CrossRef]
Kisantal, M.; Wojna, Z.; Murawski, J.; Naruniec, J.; Cho, K. Augmentation for Small Object Detection. In Proceedings of the 9th International Conference on Advances in Computing and Information Technology (ACITY 2019), Sydney, Australia, 21–22 December 2019; pp. 119–133. [Google Scholar]
de Andrade Porto, J.V.; Rezende FP, C.; Astolfi, G.; de Moraes Weber, V.A.; Pache MC, B.; Pistori, H. Automatic Counting of Cattle with Faster R-CNN on UAV Images. In Proceedings of the Anais do XVII Workshop de Visão Computacional (WVC 2021), Online, 22–23 November 2021; Sociedade Brasileira de Computação—SBC: Brasília, Brazil, 2021; pp. 1–6. [Google Scholar]
Butt, M.; Glas, N.; Monsuur, J.; Stoop, R.; de Keijzer, A. Application of YOLOv8 and Detectron2 for Bullet Hole Detection and Score Calculation from Shooting Cards. Ai 2023, 5, 72–90. [Google Scholar] [CrossRef]
Fang, C.; Li, C.; Yang, P.; Kong, S.; Han, Y.; Huang, X.; Niu, J. Enhancing Livestock Detection: An Efficient Model Based on YOLOv8. Appl. Sci. 2024, 14, 4809. [Google Scholar] [CrossRef]
Ocholla, I.A.; Pellikka, P.; Karanja, F.N.; Vuorinne, I.; Odipo, V.; Heiskanen, J. Livestock Detection in African Rangelands: Potential of High-Resolution Remote Sensing Data. Remote Sens. Appl. 2024, 33, 101139. [Google Scholar] [CrossRef]
Chen, F.; Zhou, R.; Van de Voorde, T.; Chen, X.; Bourgeois, J.; Gheyle, W.; Goossens, R.; Yang, J.; Xu, W. Automatic Detection of Burial Mounds (Kurgans) in the Altai Mountains. ISPRS J. Photogramm. Remote Sens. 2021, 177, 217–237. [Google Scholar] [CrossRef]
Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
Xiuling, Z.; Huijuan, W.; Yu, S.; Gang, C.; Suhua, Z.; Quanbo, Y. Starting from the Structure: A Review of Small Object Detection Based on Deep Learning. Image Vis. Comput. 2024, 146, 105054. [Google Scholar] [CrossRef]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A Large-Scale Dataset for Object Detection in Aerial Images. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3974–3983. [Google Scholar]

Figure 1. Location and land cover of the study area with examples of cattle captured from aerial RGB imagery from the three study sites: (a) Lumo Conservancy, (b) Taita Hills Wildlife Sanctuary and (c) Choke Conservancy. Land cover data are from Abera et al. [40] (CC-BY).

Figure 2. Workflow of the cattle detection and counting based on aerial imagery and YOLOv5, YOLOv8 and Faster R-CNN deep learning techniques.

Figure 3. Data augmentation strategies using geometric and pixel transformations on a single image patch.

Figure 4. Comparison of (a) manual annotations and (b) YOLOv5m and (c) YOLOv8m predictions for cattle heads in Lumo Conservancy.

Figure 6. Manually annotated cattle counts compared to predicted counts per image patch (400 m²) for YOLOv5m, YOLOv8m, and Faster R-CNN.

Figure 7. Count metrics on the variability of the detection models on land cover characteristics in (a) Choke, (b) Lumo, and (c) THWS dataset.

Figure 8. Comparison of (a) manual and (b) predicted cattle counts for a 4 km² tiles for Lumo and THWS.

Table 1. Sizes of the original and augmented training datasets for the three scenarios based on different data combinations from Lumo, Choke, and Taita Hills Wildlife Sanctuary (THWS).

	Original Data		Augmented Data
	Images	Annotations	Images	Annotations
Scenario 1
All sites	1316	12,283	8021	75,053
Scenario 2
Lumo	537	6211	4268	47,210
THWS	356	3106	2878	25,879
Choke	423	2966	3030	23,705
Scenario 3
Lumo + Choke	960	9177	7300	70,927
Lumo + THWS	893	9317	7146	73,089
Choke + THWS	779	6072	5908	49,584

Table 2. Validation metrics on the impact of transfer learning and inclusion of data augmentation strategies.

		Training without Pre-Trained Weights				Transfer Learning				Transfer Learning + Augmentation Strategies
Yolov5		P	R	AP_0.5	AP_0.5–0.95	P	R	AP_0.5	AP_0.5–0.95	P	R	AP_0.5	AP_0.5–0.95
	v5s	86.9	80.8	86.7	33.5	87.2	84.5	88.1	34.5	91.7	88.7	93.4	45.7
	v5m	87.5	80.1	86.0	33.6	87.2	82.8	87.0	35.2	91.5	88.1	93.0	45.6
	v5l	85.8	81.8	85.3	33.3	88.5	82.2	87.7	35.1	91.6	88.0	93.1	45.0
	v5x	87.2	81.2	85.7	33.0	88.2	83.0	87.4	35.2	91.3	88.3	93.1	45.4
Yolov8
	v8s	86.8	78.9	85.5	33.9	85.8	80.5	86.3	34.3	90.8	88.5	93.5	46.1
	v8m	85.1	78.6	85.5	33.8	85.2	78.3	84.6	34.3	91.2	88.3	93.6	46.4
	v8l	86.0	78.4	85.0	33.7	88.1	77.7	86.1	34.7	91.2	87.7	93.3	46.2
	v8x	85.0	77.6	84.1	33.6	87.0	79.7	86.2	34.6	91.1	88.5	93.3	46.1
Faster R-CNN		-	-	49.7	14.6	-	-	66.1	23.8	-	-	68.0	25.6

P—Precision, R—Recall.

Table 3. Accuracy of the computer vision models (YOLOv5, YOLOv8, and Faster R-CNN) in cattle detection for Scenario 1.

Model	P	R	F1-Score	AP_0.5	AP_0.5–0.95	CE
YOLOv5s	90.0	81.9	86.0	88.6	39.0	−4.0
YOLOv8s	90.3	83.1	87.0	88.5	39.3	−6.7
YOLOv5m	90.4	83.2	87.0	88.8	39.3	−2.5
YOLOv8m	91.0	83.4	87.0	88.8	39.6	−8.0
YOLOv5l	89.5	86.2	86.0	88.1	38.1	−6.1
YOLOv8l	91.4	83.6	87.0	89.1	39.3	−7.9
YOLOv5x	89.5	83.4	86.0	88.7	38.7	−2.7
YOLOv8x	90.4	82.3	86.0	88.2	38.8	−7.7
Faster R-CNN	-	-	-	58.8	18.7	5.5

P—Precision; R—Recall; CE—Count error.

Table 4. Accuracy metrics for the best YOLOv5 and YOLOv8 models when trained using data from one site and tested in another site.

Training Site	Test Site	Model	Precision	Recall	AP_0.5	AP_0.5–0.95
Lumo	THWS	YOLOv5x	90.6	82.6	89.1	36.4
		YOLOv8l	90.4	82.6	87.9	39.6
	Choke	YOLOv5l	88.4	73.2	83.1	33.0
		YOLOv8x	90.6	71.2	81.6	36.0
THWS	Choke	YOLOv5x	85.5	75.1	81.8	32.1
		YOLOv8l	87.9	73.4	82.2	36.6
	Lumo	YOLOv5x	86.0	77.0	85.0	33.8
		YOLOv8m	86.5	74.4	82.8	37.1
Choke	Lumo	YOLOv5x	79.5	63.9	71.8	25.8
		YOLOv8l	75.0	65.7	73.9	29.6
	THWS	YOLOv5x	85.6	75.6	81.9	30.7
		YOLOv8l	85.1	75.1	82.4	34.3

Table 5. Accuracy metrics for the best YOLOv5 and YOLOv8 models trained using data from two sites and tested on the third site.

Test Site	Model	Precision	Recall	F1-Score	AP_0.5	AP_0.5–0.95
THWS	YOLOv5x	91.0	83.6	88.0	89.1	40.4
	YOLOv8l	90.3	84.1	88.0	88.7	40.3
Choke	YOLOv5m	90.1	76.5	82.0	84.7	37.3
	YOLOv8x	90.5	76.9	82.0	84.8	37.7
Lumo	YOLOv5l	86.4	77.2	83.0	84.9	36.7
	YOLOv8m	86.3	79.6	85.3	84.4	37.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ocholla, I.A.; Pellikka, P.; Karanja, F.; Vuorinne, I.; Väisänen, T.; Boitt, M.; Heiskanen, J. Livestock Detection and Counting in Kenyan Rangelands Using Aerial Imagery and Deep Learning Techniques. Remote Sens. 2024, 16, 2929. https://doi.org/10.3390/rs16162929

AMA Style

Ocholla IA, Pellikka P, Karanja F, Vuorinne I, Väisänen T, Boitt M, Heiskanen J. Livestock Detection and Counting in Kenyan Rangelands Using Aerial Imagery and Deep Learning Techniques. Remote Sensing. 2024; 16(16):2929. https://doi.org/10.3390/rs16162929

Chicago/Turabian Style

Ocholla, Ian A., Petri Pellikka, Faith Karanja, Ilja Vuorinne, Tuomas Väisänen, Mark Boitt, and Janne Heiskanen. 2024. "Livestock Detection and Counting in Kenyan Rangelands Using Aerial Imagery and Deep Learning Techniques" Remote Sensing 16, no. 16: 2929. https://doi.org/10.3390/rs16162929

APA Style

Ocholla, I. A., Pellikka, P., Karanja, F., Vuorinne, I., Väisänen, T., Boitt, M., & Heiskanen, J. (2024). Livestock Detection and Counting in Kenyan Rangelands Using Aerial Imagery and Deep Learning Techniques. Remote Sensing, 16(16), 2929. https://doi.org/10.3390/rs16162929

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Livestock Detection and Counting in Kenyan Rangelands Using Aerial Imagery and Deep Learning Techniques

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area

2.2. Acqusition of Aerial Imagery

2.3. Cattle Detection and Counting

2.3.1. Tile Selection and Image Slicing

2.3.2. Data Augmentation Strategies

2.3.3. Annotation

2.3.4. Training Scenarios

2.3.5. Object Detection Models

2.3.6. Training YOLO and Faster R-CNN Models

2.3.7. Accuracy Assessment

3. Results

3.1. Detection Performance on the Combined Dataset

3.2. Cattle Counts at Image Patch Level

3.3. Impact of Training Data Selection on YOLO Detection Accuracy

3.4. Landcover Characteristics

4. Discussion

4.1. Performance of Computer Vision Models on Cattle Detection in Diverse Environments

4.2. Impact of Training Strategy, Augmentation Methods and Remaining Challenges

4.3. The Way Forward to Automatic Livestock Counting in African Rangelands

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI