1. Introduction
One of the most exciting and promising areas in artificial intelligence (AI) is computer vision, giving machines the ability to see, interpret, and understand the world around them. This area is growing rapidly with applications across a variety of industries including robotics, self-driving cars, health analytics, livestock tracking and monitoring, and security [
1].
However, much work remains to be done to develop an intelligent classifier to identify similar-looking cattle. Here, we will correctly identify goats or sheep using an intelligent classifier using computer vision. We reviewed similar recent research in this field to find a suitable algorithm for the development of the smart and intelligent classifier. Chen et al. [
2] analyzed the process of image segmentation, recognition and analysis of behaviors through computer vision and modern deep learning models.
Moreover, an assessment was conducted on the progression of remarkable investigations within this domain, including the advancement of resilient algorithms designed for the identification and detection of behaviors in cattle across various stages of growth. They then quantified the cattle’s behavioral recognition results and built a powerful monitoring system for their growth, health, and wellbeing [
3]. Hossain et al. [
1] conducted a systematic examination of diverse machine learning algorithms, such as Support Vector Machine (SVM), k-Nearest Neighbor (KNN), and Artificial Neural Network (ANN), within the context of cattle identification. Additionally, they investigated the suitability of Convolutional Neural Network (CNN), Residual Network (ResNet), Inception, You Only Look Once (YOLO), and Faster R-CNN models to enhance livestock-monitoring systems.
On the other hand, management is crucial for monitoring cattle growth to improve production and welfare, which can be particularly difficult with novel breeds. A metalearning system that uses machine learning methods has proven successful in detecting irregularities in the weight gain of cattle during the fattening process, leading to continuous improvement in its effectiveness. This was illustrated in a study carried out at the “El Rosario” farm in Monteria, Colombia, where an R2 value of 90.8% was achieved [
4].
The utilization of a machine learning ensemble technique is crucial in incorporating spatial variances in the adaptability of livestock to drought conditions, thereby enhancing the efficacy of statistical models devised for the anticipation of reductions in yield linked to drought [
5]. This could potentially enhance the predictive precision of deep learning models, thereby enabling the integration of all aforementioned deep learning models into an ensemble to enhance accuracy in cattle identification. The results of this work have the potential to improve livestock comfort by offering a more efficient and accurate method for identifying and managing animals. Through the use of deep learning models such as YOLOv8, which accurately distinguishes between sheep and goat images based on distinct features, the system ensures minimal physical intervention, reducing stress among the animals. Additionally, this automated system can be adapted to account for various environmental factors and animal characteristics, such as different diets for different stages of life (juvenile, adult, gestation, and lactation), as well as breed-specific needs [
6].
Separating sheep and goats can offer several advantages, depending on the management goals and specific conditions of the farm. Sheep and goats, while both small ruminants, have different nutritional, behavioral, and health needs. Goats, for example, are more selective browsers and tend to seek out better food, sometimes leading to competition with sheep if food is scarce. This competition could result in stronger animals dominating the resources, impacting the weaker ones. Additionally, the reproductive cycles and nutritional requirements of sheep and goats differ, especially during gestation and lactation, necessitating different management strategies for optimal production. Health-wise, although disease transmission between the two species is uncommon, separating them can help in tailoring disease prevention and treatment strategies based on species-specific susceptibilities [
7].
The aim of this work is to automatically identify individual sheep or goats based on their physical characteristics, including muzzle pattern, coat pattern or ear pattern. This can be achieved by taking pictures of goats or sheep with a camera and then using computer vision algorithms to isolate and analyze the characteristics of the goats or sheep [
8].
Through this project, herd management can be carried out intelligently and without harming the livestock. This research work is very useful for farmers to track and monitor the health and welfare of their livestock. This research can also be used to count sheep and goats using drone footage. The efficiency, precision of farming, and the quality of life of farmers can be radically improved.
Literature Review
The advancement of the identification system was heavily dependent on the significance of cattle muzzles and fur patterns. It is worth mentioning that commonly utilized methods for feature extraction, including Local Binary Pattern (LBP), Speeded Up Robust Features (SURF), Scale-Invariant Feature Transform (SIFT), and Inception or Convolutional Neural Network (CNN), were recognized [
9].
The Convolutional Neural Network (CNN) model was utilized to extract the characteristics of cattle from a rear perspective. Moreover, the Long Short-Term Memory (LSTM) model was utilized to capture the temporal details of cattle. The objective of this methodology was to develop a sophisticated identification system, as emphasized in the research [
10]. Qiao et al. presented a one-shot learning-based approach with pseudo-labeling for cattle video segmentation in smart livestock farming. The method uses an Xception-based Fully Convolutional Neural Network (Xception-FCN) to extract features and a pseudo-labeling module to improve segmentation accuracy by leveraging unlabeled data. The system was evaluated using a challenging feedlot cattle video dataset, achieving an 88.7% mean intersection-over-union score and an 80.8% contour accuracy. These results demonstrate that the proposed approach outperforms state-of-the-art methods, providing a reliable solution for livestock video segmentation [
11].
Ahmed et al. proposed a deep transfer learning-based animal face identification system using a hybrid approach that automates the identification of livestock animals. The system leverages YOLO v7 for detecting faces and muzzles, and applies the SIFT algorithm to extract key features, which are then stored in a database for future matching. The method is tested on a dataset, using FLANN for matching extracted features against the database, achieving an impressive 99.7% accuracy in muzzle detection and 100% accuracy in livestock identification. This demonstrates the effectiveness of the system for real-time livestock identification, which has practical applications in modern agriculture [
12].
Md Sultan Mahmud et al. conducted a detailed examination of the use of deep-learning in the livestock identification and health monitoring. They identified Convolutional Neural Network (CNN) as the most versatile and comprehensive model, which integrates Long Short-Term Memory (LSTM), neural networks, Mask region-based Convolutional Neural Network (Mask- RCNN), and Faster-RCNN [
13].
It is also observed in this research that RestNet is the most widely used pre-trained model for automated systems in smart animal husbandry. Aburasain, R.Y. et al. configured, trained, and used SSD-500 and YOLO V-3 for cattle group identification using images captured by drones with the greatest accuracy [
14].
Santosh Kumar et al. also used a deep belief convolutional neural network for the identification of individual animals and groups of animals using features including muzzle point image patterns [
15]. Peiyuan Jiang et al. discussed the similarities as well as differences and disadvantages of YOLO versions and convolutional neural network (CNN) algorithms for computer vision. The YOLO algorithm for livestock identification is still being improved [
16].
Furthermore, with the utilization of YOLO (You Only Look Once) and the situation of subsequent advancements in architecture, there has been a discernible improvement in the precision of object detection, occasionally even outperforming traditional two-stage object detection systems. Arunabha M. Roy et al. highlighted that YOLO achieves an accuracy of 63.4, while Fast-RCNN reaches 70 [
17]. However, it is crucial to note that the inference speed of YOLO is approximately 300 times faster. Additionally, the authors conducted a thorough examination of single-stage object detectors, with a specific focus on YOLOs, including their regression formulation, advancements in architecture, and performance metrics [
17]. The researchers also provided a summary of the comparative analyses conducted on two-stage and one-stage object detection models, as well as between various versions of YOLO and their respective applications [
18].
Chen, W. et al. proposed the real-time YOLO face detector, which maintains the high speed of the original YOLO method and is one of the best face detectors, offering a balance between throughput and speed [
2]. Shubham Shinde et al. showcased the efficacy and efficiency of YOLO as a rapid detection and localization technique within the Liris Human Activity dataset [
19]. Xingyu Jiang et al. explained and developed several applications to demonstrate the importance of multimodal image matching. Further innovations in multimodal imaging can be found in [
20].
Both supervised and self-supervised learning have been used for image identification, image segmentation, and image classification. Generally, supervised learning is used when the labeled dataset is available. If not, self-supervised learning is chosen. Kriti Ohri and Mukesh Kumar explored self-supervised learning, in which the data itself provides strong signals of interest that enable learners to perceive the relationships involved in the data without external labels [
21]. Syed Sahil Abbas Zaidi et al. also used CNN for object detection and classification and compared the performance with the help of various matrices [
22].
Huang, Y. et al. identified astrocytes with complex morphological structure composed of glial fibrillary protein (GFAP) in the central nervous system (CNS) using YOLOv5, an advanced deep learning platform for object detection and classification [
23]. Lawal, O.M. has successfully used YOLOv3, which includes YOLODenseNet and YOLOMixNet, for tomato detection during robotic harvesting [
24].
Kganyago et al. presented a summary of the latest developments in remote sensing technologies and machine learning models to estimate biochemical and biophysical parameters in precision agriculture. Future deep learning research efforts should develop adaptive models to address the challenges facing agriculture [
25].
The results of the literature reviews of the main references related to our research work are represented in
Table 1.
2. Methods
Figure 1 shows the development steps of an intelligent classifier for distinguishing images of goats and sheep. The performance of an intelligent classifier depends on several factors, such as the quality of the dataset, the complexity of the models, and the number of training datasets available.
Here we have used various goat and sheep datasets available on the internet and Kaggle datasets. In our dataset, we have tried to include a variety of images of goats and sheep of all possible colors and breeds. There is a mix of images taken in the morning, noon, and evening; and images taken in different lighting conditions. We have taken the images of goat and sheep faces from the full image, which describes the shape of the face, eyes, nose, ears, horns, and facial hair [
26].
It is possible to achieve high accuracy in detecting goat and sheep images. Image capture, image processing, feature extraction using convolutional neural network and classification using dense network are used to identify goat and sheep images accurately in computer vision. Initially, 4759 images of size 64 × 64 × 3 were collected from the world wide web.
On the Keras platform offered by Colab, the architecture of the CNN model entails the incorporation of two convolutional layers succeeded by a MaxPooling layer. For this experiment, we split the image set into a set of training images (78%) and a set of test images (22%).
The result of this model was not encouraging, so we looked at the well-known pre-trained model VGG16 and adapted it accordingly on the Google Colab platform. Its performance is better than that of the previously developed CNN model. Finally, we use the Roboflow framework to design and deploy the deep learning model of our classifier. We increased the total number of images to 35,204 using a data augmentation tool that includes a training dataset (88%), a validation dataset (8%), and a test dataset (4%).
This classifier performs better than the previous two models and can correctly and quickly classifyreal images of goat and sheep.
In order to evaluate the efficacy of the Intelligent Classifier, various performance measures including precision, recall, and
F1
Score were utilized. These metrics are designed using the following formulas:
where
TP and
TN represent the number of positive classified and negative rejected samples.
FP and
FN represent the number of positive and negative samples that were misclassified, respectively [
27].
After a thourough literature review of recent research, we found that the CNN classifier is one of the best classifiers for identifying goat and sheep images because convolutional networks have advantages of parameter sharing and sparsity of connection over using only fully connected networks. The architecture of the CNN classifier is depicted in
Figure 2.
There are different CNN models, such as LeNet-5, AlexNet, VGG16, VGG-19 and ResNets, that are already available for computer vision. The LeNet model is trained on 32 × 32 × 1 gray images to recognize numbers 0 to 9. There are 60,000 parameters in it. AlexNet is a 1000-times larger network than LeNet-5 because there are 60 million parameters in this network and they are trained on 227 × 227 × 3 color images. This model requires multiple GPUs for computation [
28].
The VGG-16 architecture has 138 million parameters with 16 layers. The architecture of VGG-16 is quite uniform. ResNets is a very deep model of 100 layers, with skip connections that perform identity mapping and merge with the layer outputs through addition operations [
29].
Which CNN models to use depends entirely on the nature of the problem and the size of the dataset [
30]. Upon conducting an extensive review of existing literature, a convolutional neural network (CNN) model was formulated, consisting of two convolutional layers, each accompanied by a subsequent MaxPooling layer.
This architecture was finalized with a flat layer followed by two dense layers implementing the Softmax activation function. Despite efforts to optimize this model using the Keras tuner, the achieved performance fell short of expectations. Subsequently, various pre-trained models including VGG16, MobileNet, Exception, ResNet50, ResNet101, ResNet152, and EfficientNet—already fine-tuned with the ImageNet dataset—were explored. These pre-trained models were trained using a limited dataset of only 250 images sized (200 × 200 × 3), over 25 epochs, and their performance was assessed. Results indicated a marginal improvement compared to our initially proposed model.
Recognizing the unique challenges posed by our dataset, particularly the absence of classes for goat and sheep images within the standard 1000 classes of the ImageNet dataset, we proceeded to fine-tune the top layers and hyper parameters of VGG16, ResNet, MobileNet, XceptionNet, Inception, and EfficientNet pretrained models. Evaluation of the outcomes from each model revealed that EfficientNet outperformed the others. Detailed results and analysis of all models are elaborated upon in the subsequent section.
Tensorflow framework with Keras API on Google Colab was used to perform image processing tasks and build the intelligent classifiers. Additionally, we used Python 3.6 with the OpenCV library for the data augmentation program. All programs and analyses were carried out on an HP laptop with Intel(R) Core(TM) i5-4210M
[email protected] GHz, 2 core(s), 16 GB RAM Hewlett-Packard, Greater Noida India and Windows 10 configuration. All models were trained, validated and tested on Google Colab with T4GPU, except the YOLO8 classifier. The YOLO8 classifier was built, trained and deployed on the Roboflow computer vision platform.
3. Results
Following an extensive examination of the available literature, a Convolutional Neural Network (CNN) model was developed, comprising 2 convolutional layers, each one succeeded by a layer of MaxPooling. The input image dimensions were configured at 200 × 200 × 3, with a 3 × 3 filter size, utilizing Rectified Linear Unit (RELU) as the activation mechanism. The initial convolutional layer contains 32 filters, while the subsequent MaxPooling layer employs a 2 × 2 filter size and is positioned immediately after the first convolutional layer.
The second convolutional layer is furnished with 64 filters, also with a 3 × 3 filter size and utilizing RELU as the activation method. Subsequently, the MaxPooling layer connected to the second convolutional layer features a 2 × 2 filter size. Furthermore, the model structure incorporates a flat layer succeeded by two dense layers. The first dense layer encompasses 384 neurons with RELU as the activation process, whereas the second dense layer comprises a single neuron with the Sigmoid activation function. A detailed illustration of this model’s architecture is provided in
Table 2.
The model is configured with a learning rate of 0.0001 and uses the Adam optimizer for training. This architecture effectively extracts features from input data through the convolutional and pooling layers and then performs classification based on these features using the dense layers.
In
Figure 3, the behavior of the CNN model in training and validation is shown, with
Figure 3a depicting the accuracy metrics, whereas
Figure 3b shows the losses incurred during said steps.
Table 3 presents the performance metrics of a CNN model for classifying “Goat” and “Sheep,” with precision, recall, and F1-score represented as percentages. For the “Goat” category, the model achieved a precision of 67%, a recall of 94%, and an F1-score of 78%, based on 17 samples. For the “Sheep” category, it obtained a precision of 91%, a recall of 56%, and an F1-score of 69%, with 18 samples. The overall accuracy of the model is 74% across 35 total samples. The macro average of the metrics is 79% precision, 75% recall, and 74% F1-score, while the weighted averages are 79% precision, 74% recall, and 73% F1-score, accounting for the different sample sizes. The CNN model shows better precision with sheep but higher recall for goats.
Table 4 outlines the architecture of a tuned pre-trained VGG16 model, detailing the layers, output shapes, and the number of parameters for each layer. The model starts with an input layer (input1) that processes images of shapes (200, 200, 3) with 0 parameters. This is followed by two convolutional layers (block1conv1 and block1conv2), both with 64 filters, leading to an output shape of (200, 200, 64), with 1792 and 36,928 parameters, respectively. A max-pooling layer (block1pool) reduces the output size to (100, 100, 64) with no parameters.
Next, two convolutional layers in block2conv1 and block2conv2 increase the filters to 128, producing an output of (100, 100, 128) and requiring 73,856 and 147,584 parameters, respectively; followed by another max-pooling layer that reduces the output to (50, 50, 128). Three convolutional layers in block3 increase the filters to 256, producing an output of (50, 50, 256), with each layer requiring 295,168, 590,080, and 590,080 parameters, followed by max-pooling that reduces the output to (25, 25, 256).
In block4, three convolutional layers increase the filters to 512, producing an output of (25, 25, 512), with 1,180,160, 2,359,808, and 2,359,808 parameters, respectively, followed by max-pooling reducing the output to (12, 12, 512). The same configuration applies to block5, resulting in an output of (12, 12, 512) and 2,359,808 parameters for each convolutional layer. A final max-pooling reduces the output to (6, 6, 512), and the VGG16 block contains 14,714,688 total parameters. After flattening the output to (None, 18,432), a dense layer with 384 units is added with 7,078,272 parameters, followed by a dropout layer. Finally, a dense layer outputs one unit with 385 parameters.
The model has a total of 14,714,688 parameters (56.13 MB), of which 7,079,424 are trainable (27.01 MB), and 7,635,264 are non-trainable (29.13 MB). The learning rate is set to 0.0001, and the optimizer used is Adam.
Results of a tuned VGG16 pre-trained model are presented in
Figure 4 and
Table 5. In
Figure 4a, the accuracy metrics for the VGG16 model are shown, for training and validation, whereas
Figure 4b shows the loss metrics for the same scenario.
In
Table 5, the performance of the VGG16 model for classifying “Goat” and “Sheep” are presented. For the “Goat” category, the model achieved a precision of 76%, a recall of 94%, and an F1-score of 84%, based on 17 samples. For the “Sheep” category, the precision was 93%, recall was 72%, and the F1-score was 81%, with 18 samples. The overall accuracy of the model is 83%, based on 35 total samples. The macro averages for precision, recall, and F1-score are 85%, 83%, and 83%, respectively, and the weighted averages for these metrics are also 85%, 83%, and 83%, accounting for the number of samples in each category.
Next, we have the results of the pre-trained EfficientNet model, depicted in
Figure 5 and
Table 6. In
Figure 5a, the accuracy metrics for the EfficientNet model are shown, for training and validation, with
Figure 5b showing the loss metrics. Therefore, in
Table 5, we have presented the performance of the EfficientNet model for classifying “Goat” and “Sheep”. For the “Goat” category, the model achieved a precision of 89%, a recall of 94%, and an F1-score of 91%, based on 17 samples. For the “Sheep” category, the model reached a precision of 94%, recall of 89%, and an F1-score of 91%, with 18 samples.
The overall accuracy of the model is 91%, based on 35 total samples. The macro average for precision, recall, and F1-score is 92%, 92%, and 91%, respectively. Similarly, the weighted averages for precision, recall, and F1-score are 92%, 91%, and 91%, accounting for the number of samples in each category.
Table 7 provides performance metrics for a ResNet50 model, which is used to classify “Goat” and “Sheep.” For the “Goat” category, the model achieved a precision of 84%, a recall of 94%, and an F1-score of 89%, based on 17 samples. For the “Sheep” category, the precision was 94%, recall was 83%, and the F1-score was 88%, with 18 samples. The overall accuracy of the model is 89%, based on a total of 35 samples. The macro average of precision, recall, and F1-score is 89% across both categories. Similarly, the weighted average for precision, recall, and F1-score is also 89%, accounting for the distribution of samples in each category.
The next model is the ResNet101, with validation and training accuracy and losses shown in
Figure 7.
In
Table 8, the performance metrics for a ResNet101 model is presented, where it was used to classify “Goat” and “Sheep”. For the “Goat” category, the model achieved a precision of 81%, a recall of 100%, and an F1-score of 89%, based on 17 samples. For the “Sheep” category, the model reached a precision of 100%, a recall of 78%, and an F1-score of 88%, based on 18 samples. The overall accuracy of the model is 89%, calculated over 35 total samples. The macro average of the precision, recall, and F1-score is 90%, 89%, and 88%, respectively, across both categories. Similarly, the weighted average for these metrics—taking into account the number of samples per category—is 91% precision, 89% recall, and 88% F1-score. The model shows high precision for sheep, but perfect recall for goats.
Additionally, the performance of the ResNet152 model presented in
Table 9 depicts 93% precision for “sheep” and 94% recall for “goat” images, with fewer false positives.
Table 10 shows the training and classification time per step of the models used in this research. The bold values show the lowest training and classification time.
Roboflow YOLO
Finally, the computer vision pipeline provided by Roboflow, YOLO (You Only Look Once), was tested. This pipeline consists of collecting images, organizing the image datasets, annotating images, augmenting the images to increase the dataset size, training the model, managing the model, and deploying it. The classifier of this framework has a validation accuracy of 95.8%. A snapshot of the goat and sheep images are shown in the
Figure 8 and
Figure 9, respectively.
The YOLO algorithm, commonly used for object detection, is a real-time system that identifies objects in images or video frames. This is used with Roboflow, which is a platform.
The total number of images for this classifier was initially 3699. After augmenting the images through horizontal flip, vertical flip, clockwise and counterclockwise rotation, upside-down flip, cropping, rotation between certain angles, shearing both horizontally and vertically, and adjustments to hue (between specific values), saturation (in a certain range), brightness (in a specific range), exposure (within a range), blur (up to 2.5 px), noise (up to a percentage of pixel values), and cutout (with 3 boxes of a specified size), the dataset size increased to a total of 35,204 images. Of these, the training set consists of 30,816 images, the validation set contains 2924 images, and the test set includes 1464 images in
Figure 10.
Figure 11 shows a comparison of the deep learning models discussed in this research work.