Automated Muzzle Detection and Biometric Identiﬁcation via Few-Shot Deep Transfer Learning of Mixed Breed Cattle

: Livestock welfare and management could be greatly enhanced by the replacement of branding or ear tagging with less invasive visual biometric identiﬁcation methods. Biometric identiﬁcation of cattle from muzzle patterns has previously indicated promising results. Signiﬁcant barriers exist in the translation of these initial ﬁndings into a practical precision livestock monitoring system, which can be deployed at scale for large herds. The objective of this study was to investigate and address key limitations to the autonomous biometric identiﬁcation of cattle. The contributions of this work are fourfold: (1) provision of a large publicly-available dataset of cattle face images (300 individual cattle) to facilitate further research in this ﬁeld, (2) development of a two-stage YOLOv3-ResNet50 algorithm that ﬁrst detects and extracts the cattle muzzle region in images and then applies deep transfer learning for biometric identiﬁcation, (3) evaluation of model performance across a range of cattle breeds, and (4) utilizing few-shot learning (ﬁve images per individual) to greatly reduce both the data collection requirements and duration of model training. Results indicated excellent model performance. Muzzle detection accuracy was 99.13% (1024 × 1024 image resolution) and biometric identiﬁcation achieved 99.11% testing accuracy. Overall, the two-stage YOLOv3-ResNet50 algorithm proposed has substantial potential to form the foundation of a highly accurate automated cattle biometric identiﬁcation system, which is applicable in livestock farming systems. The obtained results indicate that utilizing livestock biometric monitoring in an advanced manner for resource management at multiple scales of production is possible for future agriculture decision support systems, including providing useful information to forecast acceptable stocking rates of pastures.


Introduction
In grazing systems, livestock are primary consumers of biomass, whilst in intensive production systems such as feedlots, livestock require vast amounts of grains and roughage. On the other hand, in order to meet an increase in global demand for livestock products, there is considerable pressure to raise greater numbers of livestock within shorter durations of time. Consumer preference and legislative requirements also require adherence to high standards of animal welfare. Keeping track and effectively monitoring the welfare of large numbers of individual animals is an increasingly difficult task, but it provides useful information to inform grain feed demand forecasts and acceptable stocking rates of pastures. Traditional approaches to herd management are labor intensive and invasive with the potential to cause pain and morbidity to stock. A wide range of approaches, including ear tagging, ear tattooing, hot ironing, freeze branding, ID collars, microchipping or small scales (<66 individuals per study). Furthermore, current algorithms require both a pre-cropped muzzle print image along with a prohibitive number of images per individual for model training. The algorithm of [38], for instance, utilizes at least 10 images per individual. Obtaining such large numbers of images per individual is both inconvenient and impractical in most livestock production settings. In this paper, we introduce a novel deep learning model combining joint muzzle detector and biometric classifier, which seeks to address these limitations. This model is evaluated on a herd (300 individuals) of mixed breed cattle using a strict limit of five images per individual for model training and exhibits excellent (99.11%) accuracy.

Data Collection
Biometric images consisting of the cattle muzzle and face were collected at the University of New England's Tullimba Research Feedlot, Kingstown, NSW, Australia. The images were collected during Induction Day in February 2019 when animals were vaccinated and tagged. In total, 300 cattle were inducted, which involved each animal being restrained in a crush restraint with its head placed in a "head scoop" for application of an eye treatment and micro-chipping. Upon its containment in the crush, a photographer stood approximately 1-2 m in front and photographed images of each individual animal in frontal pose from a camera set at 1 m above the ground. The camera was a Canon D800 (Toyko, Japan) equipped with a 18-55 mm lens (Canon EF-S 18-55 mm f/4-5.6 IS STM) [47]. Images of the cattle's face were taken while the focus of the camera was on the muzzle. The image resolution was 4000 (width) × 6000 (height) pixels in RGB mode and JPEG format without compression and auto lighting balance. The camera was set to "burst" mode to capture 10 images in rapid (6 shots per second) succession, due to the rapid nature of the induction process and the need to capture multiple images of each individual animal. Data collection was conducted between 8:00 and 16:00 on a sunny day and under natural lighting conditions. The captured images were stored in the camera's local SD card.

Dataset
A total of 2900 images were captured from 300 animals consisting of Bos taurus beef cattle of mixed breeds including Angus, Hereford Charolais and Simmental. A proportion (268 images or 9.24%) of the images did not capture the muzzle or were extremely blurry due to significant movement of the animals' heads ( Figure 1). We have released this dataset as supplementary material and to the best of our knowledge it is currently the largest and most comprehensive livestock biometrics dataset which is publicly available.

YOLOv3 Muzzle Detector
A key step in the biometric identification processing chain is the detection and extraction of the cattle muzzle. Head movements and slight differences in alignment of the photo resulted in the muzzle location being different between images. The YOLOv3 object detector [48] was utilized for muzzle detection, with transfer learning implemented to customize the YOLOv3 network weights for this task. In the YOLOv3 framework, the darknet-53 [49] convolutional neural network with 53 layers, is utilized to predict both object categories and bounding boxes. Convolutional layers with stride 2 (down-sampling by 2) without feature pooling ensure that the output of each feature layer is smaller than the previous, thereby reducing computational complexity and processing times.

YOLOv3 Muzzle Detector
A key step in the biometric identification processing chain is the detection and extraction of the cattle muzzle. Head movements and slight differences in alignment of the photo resulted in the muzzle location being different between images. The YOLOv3 object detector [48] was utilized for muzzle detection, with transfer learning implemented to customize the YOLOv3 network weights for this task. In the YOLOv3 framework, the darknet-53 [49] convolutional neural network with 53 layers, is utilized to predict both object categories and bounding boxes. Convolutional layers with stride 2 (down-sampling by 2) without feature pooling ensure that the output of each feature layer is smaller than the previous, thereby reducing computational complexity and processing times.
In the context of muzzle detection, the YOLOv3 framework both detects the presence of a muzzle and provides the rectangular image coordinates of its location within the image. The YOLOv3 framework achieves this task by dividing images into S × S grid, where S represents the size of the grid, and uses each grid to predict the object centered in that grid cell. In each grid cell, B bounding boxes are predicted and the confidence scores for each is calculated. The confidence score indicates how likely the object exists in that bounding box. A cell with no object detected returns to zero for the confidence score. The confidence score is defined and calculated by the formula: The intersection over union (IOU) between the predicted box and any ground truth box is expected to be equal to the confidence score [48]. Bounding boxes, the confidence score and the class probabilities are encoded as a S × S × (B × 5 + C) tensor where, as previously mentioned, S represents the size of grid and B indicates the number of predicted bounding boxes for each grid. There is a set of five predicted values for each bounding box, namely x, y, w, h, and the confidence score. The center of the bounding box is represented by the pair of (x, y) coordinates. The w and h parameters represent the width and In the context of muzzle detection, the YOLOv3 framework both detects the presence of a muzzle and provides the rectangular image coordinates of its location within the image. The YOLOv3 framework achieves this task by dividing images into S × S grid, where S represents the size of the grid, and uses each grid to predict the object centered in that grid cell. In each grid cell, B bounding boxes are predicted and the confidence scores for each is calculated. The confidence score indicates how likely the object exists in that bounding box. A cell with no object detected returns to zero for the confidence score. The confidence score is defined and calculated by the formula: The intersection over union (IOU) between the predicted box and any ground truth box is expected to be equal to the confidence score [48]. Bounding boxes, the confidence score and the class probabilities are encoded as a S × S × (B × 5 + C) tensor where, as previously mentioned, S represents the size of grid and B indicates the number of predicted bounding boxes for each grid. There is a set of five predicted values for each bounding box, namely x, y, w, h, and the confidence score. The center of the bounding box is represented by the pair of (x, y) coordinates. The w and h parameters represent the width and height of the predicted bounding box, respectively. The parameter C represents the conditional class probabilities that are conditioned on the grid cell containing an object [50].
YOLOv3 s detection requires only one pass through the network and, as a result, is comparatively fast. It also achieves an excellent balance between computational processing times and detector accuracy [50]. Reduced versions of the YOLOv3 network architecture also exist, thereby facilitating future incorporation of the muzzle detector on mobile computing platforms, such as smartphones. Figure 2 illustrates the muzzle detection and extraction procedure based on YOLOv3. Transfer learning was utilized via ImageNet pre-trained darknet-53 weights [51]. To train the YOLOv3 model for muzzle detection, 80% (2320) of the images were randomly selected for training and validation and the remaining 20% (580) used for model testing. The large size of the original images (4000 × 6000 × 3 pixels) proved computationally prohibitive to model training. Therefore, images were re-scaled (608 × 832 × 3 pixels) using the Bicubic interpolation of the Pillow library [52]. The visual object tagging tool (VoTT) [53] was used to annotate the muzzle region of each image as required by the YOLOv3 model training process. Model training was performed using a Lambda Quad RTX 6000 computer (hardware specified in Table 1) equipped with the CUDA toolkit (NVIDIA, Santa Clara, CA, USA, 2007) to perform rapid computations using the Graphical Processing Unit. chitecture also exist, thereby facilitating future incorporation of the muzzle detector on mobile computing platforms, such as smartphones. Figure 2 illustrates the muzzle detection and extraction procedure based on YOLOv3. Transfer learning was utilized via ImageNet pre-trained darknet-53 weights [51]. To train the YOLOv3 model for muzzle detection, 80% (2320) of the images were randomly selected for training and validation and the remaining 20% (580) used for model testing. The large size of the original images (4000 × 6000 × 3 pixels) proved computationally prohibitive to model training. Therefore, images were re-scaled (608 × 832 × 3 pixels) using the Bicubic interpolation of the Pillow library [52]. The visual object tagging tool (VoTT) [53] was used to annotate the muzzle region of each image as required by the YOLOv3 model training process. Model training was performed using a Lambda Quad RTX 6000 computer (hardware specified in Table 1) equipped with the CUDA toolkit (NVIDIA, Santa Clara, CA, USA, 2007) to perform rapid computations using the Graphical Processing Unit.

Muzzle Detector Model Hyper-Parameter Configuration
The YOLOv3 framework with the darknet-53 CNN requires specification of model hyper-parameters prior to training. There are several hyper-parameters including: batch size, learning rate, channel size and network resolution. These hyper-parameters were set to batch size = 64, channel size = 3, learning rate = 0.001 and network resolution = 608 × 832. The full set of YOLOv3 muzzle detection hyper-parameters is provided as a configuration file in the supplementary material.
The batch size refers to the number of image samples passed to the network through each step of model training. The magnitude of batch size is adjusted according to hardware performance (CPU and GPU) and memory availability. Generally, large batch sizes are preferred to allow the network to access more samples and features through each iteration. In the YOLOv3 muzzle detector, the batch size hyper-parameter was found by a process of on-line monitoring of hardware performance and network convergence through several trial model training runs across a set of batch sizes (4, 8, 16, 32, 64 and 128). The channel size refers to the number of color channels in the image. The muzzle images are RGB (3 channels) and to utilize all available color information, the channel size parameter was set to 3. As stated in Section 2.2.2 the input image resolution was down-sampled to (608 × 832 × 3 pixels) as the original image resolution was too high (4000 × 6000 × 3 pixels) for model training, exhausting available GPU memory. Progressively down-sampling (in factors of 32, as required by the darknet-53 network), was implemented with model training monitored. A network resolution of (608 × 832 × 3 pixels) was found to permit model training within available computational resources whilst also allowing a sufficiently large resolution for the muzzle detection task. Muzzle detector model training was then performed utilizing this set of hyper-parameters and model configurations. The number of model training iterations was set to a maximum of 10,000. On-line monitoring of average loss and mean average precision (mAP) per iteration, indicated that network convergence had been achieved by 6000 iterations with the lowest model training error rate and highest validation set accuracy. This lower magnitude of iterations was preferred to help safeguard from model over-fitting. Therefore, the YOLOv3 darknet-53 weights at 6000 iterations was accessed via saved checkpoints and utilized as the optimal muzzle detector model.

Data Pre-Processing
To identify the individual cattle, depending on the available database, unique features should be selected so that the identification operation can be performed efficiently. Based on this requirement, the muzzle was selected. In most of the proposed methods so far, the muzzle was extracted manually [35,54,55], which is both time-consuming and not suitable for the practical applications. Therefore, an effective automatic model for detecting and extracting the muzzle utilizing YOLOv3 darknet-53 was deployed in this study. This process was performed automatically in software using the bounding box coordinates output by the muzzle detector. The muzzle detector was found to be highly effective in both the detection and extraction of the cattle muzzle region. Post muzzle detection and extraction, each individual animal had at least seven images (e.g., please refer to Figure 1c) suitable for further analysis (those discarded by the muzzle detector were either too blurry or partially out-of-frame). A biometric model identification model training data set was produced by randomly sampling five images per individual while almost half of the remaining images were used for validation, and others were used for testing. In total, the biometric model evaluation dataset included 300 cattle, with 1500 images for training, 569 images for validation and 563 images for the testing. Data augmentation, in the form of 15-degree rotation, was utilized to increase the model training set size and enhance model robustness and performance.

Biometric Identification Using ResNet-50 CNN
Overview: The cattle biometric identification process is displayed in Figure 3. This process follows the structure of the ACE-V (analysis, comparison, evaluation and verification) procedure [21]. A muzzles dataset is formed using muzzle images from known individual animals. Automatic feature learning and extraction is applied using a ResNet-50 CNN. The ResNet-50 CNN was pre-initialized with ImageNet weights and then modified for biometric muzzle print recognition by implementing transfer learning through CNN retraining and modification of the final network layer (modified from 1000 object classes to the 300 individual animal identifier classes). Then, by using the SoftMax classifier, parameters were updated based on the loss to learn individual cattle. Therefore, the classifier assigns each class with a specific predicted output. Muzzle print identification from an unknown individual was then performed by first extracting the muzzle pattern (using YOLOv3 darknet-53 muzzle detector) and then by utilizing the model with best parameters (ResNet-50 CNN), the features were extracted. Then, the SoftMax classifier was used to predict the class by the highest probability it had. In this research, the "unknown" individual muzzle patterns are in fact sourced from the model test data set and therefore it is possible for model accuracies to be reported. assigns each class with a specific predicted output. Muzzle print identification from an unknown individual was then performed by first extracting the muzzle pattern (using YOLOv3 darknet-53 muzzle detector) and then by utilizing the model with best parameters (ResNet-50 CNN), the features were extracted. Then, the SoftMax classifier was used to predict the class by the highest probability it had. In this research, the "unknown" individual muzzle patterns are in fact sourced from the model test data set and therefore it is possible for model accuracies to be reported.

Biometric Model Training
The ResNet-50 CNN was used to automatically extract features from each muzzle image [56]. Selection of ResNet-50 was based on the fact that (i) it is a top-performer on object classification tasks [57][58][59], (ii) the fact its depth still permits mobile computing [60] and (iii) its depth does not require large numbers of samples per class to prevent overfitting [61]. A ResNet-50 pre-trained with ImageNet data consisting of (1.28 million training images belongs to 1000 object classes) was used to greatly reduce the sample size (muzzle image) requirements by Keras with TensorFlow backend [56]. A strategy of transfer learning with fine-tuning was implemented to modify the ResNet-50 ImageNet CNN into a ResNet-50 muzzle pattern identifier. Transfer learning with fine-tuning involves removing the final pooling and fully connected layer in the original ResNet-50 ImageNet model. Next, an average pooling and a flattened layer, followed by a dense layer, are added to the end of the network. Figure 4 shows the network after the three new fully connected layers have been added, this network architecture formed the basis of the Res-Net-50 muzzle pattern identifier.

Biometric Model Training
The ResNet-50 CNN was used to automatically extract features from each muzzle image [56]. Selection of ResNet-50 was based on the fact that (i) it is a top-performer on object classification tasks [57][58][59], (ii) the fact its depth still permits mobile computing [60] and (iii) its depth does not require large numbers of samples per class to prevent overfitting [61]. A ResNet-50 pre-trained with ImageNet data consisting of (1.28 million training images belongs to 1000 object classes) was used to greatly reduce the sample size (muzzle image) requirements by Keras with TensorFlow backend [56]. A strategy of transfer learning with fine-tuning was implemented to modify the ResNet-50 ImageNet CNN into a ResNet-50 muzzle pattern identifier. Transfer learning with fine-tuning involves removing the final pooling and fully connected layer in the original ResNet-50 ImageNet model. Next, an average pooling and a flattened layer, followed by a dense layer, are added to the end of the network. Figure 4 shows the network after the three new fully connected layers have been added, this network architecture formed the basis of the ResNet-50 muzzle pattern identifier.  In Figure 4, each box represents a layer of the model's architecture which consists of "Frozen Layers", "Unfrozen layers" and "Added layers". The frozen convolutional layers  were the initial layers of the network and consist of more general image features (lines, squares, circles) that were learned during the ImageNet pretraining. The Unfrozen layers (47)(48)(49) were the trainable parts of the network specific to muzzle pattern features. Whilst the 'Added Layers' consist of the classifier components trained after the flattening the output from the unfrozen convolutional layers. The number of outputs in the final fully connected layer was set at 300 classes, corresponding to the number of individual animals in the study.
In total, four different model training strategies were evaluated. These were: (a) training from scratch using the ResNet-50 architecture without pre-initialization, (b) transfer learning using ResNet-50 pre-initialized with ImageNet weights with all layers frozen but In Figure 4, each box represents a layer of the model's architecture which consists of "Frozen Layers", "Unfrozen layers" and "Added layers". The frozen convolutional layers  were the initial layers of the network and consist of more general image features (lines, squares, circles) that were learned during the ImageNet pretraining. The Unfrozen layers (47)(48)(49) were the trainable parts of the network specific to muzzle pattern features. Whilst the 'Added Layers' consist of the classifier components trained after the flattening the output from the unfrozen convolutional layers. The number of outputs in the final fully In total, four different model training strategies were evaluated. These were: (a) training from scratch using the ResNet-50 architecture without pre-initialization, (b) transfer learning using ResNet-50 pre-initialized with ImageNet weights with all layers frozen but a SoftMax classifier added to the final layer to train the model to identify the individual animals, (c) fine-tuning the last convolutional layer whereby transfer learning was used as previously but the last convolutional layer was unfrozen and thereby its weights could be modified and (d) fine-tuning the last 3 convolutional layers, similar to transfer learning but the last 3 convolutional layers were unfrozen and therefore these weights could be modified.
The Resnet-50 muzzle pattern identifier was trained with 1500 muzzle pattern images and their corresponding individual identity labels using fine-tuning transfer learning from a ResNet-50 ImageNet classifier. All model training was performed on a Lambda Quad RTX 6000 ( Table 1). The data generator package and Keras library [62] were used in conjunction with the GPU version of TensorFlow (1.13.1) [63]. All muzzle images were reshaped to (224 × 224 × 3 pixels) dimensions in order to meet the requirements of the ResNet-50 architecture. Note that the ResNet-50 architecture still permitted the use of RGB-format color images. The ResNet-50 model hyper-parameters were set to learning rate: 0.0001 (0.01 training from scratch), batch size: 10, resulting in 150 iterations per epoch. These hyper-parameters were set based on well-established magnitudes used by practitioners followed by careful tuning and modification based on monitoring model performance and convergence. Post hyper-parameter tuning the full ResNet-50 transfer learning process was implemented. The ADAM optimizer [64] was utilized in conjunction with the cross-entropy loss function. Model training was continued for 100 epochs (totaling 15,000 iterations) and performance evaluated by the held-back validation set of 569 muzzle images. The ResNet-50 muzzle pattern identifier was then assessed for accuracy on the test data set consisting of 563 muzzle images from known individual animals.

Muzzle Detection
The weights of the network resulted from training are adopted to evaluate the trained model with the test dataset. To obtain the optimal resolution for the input to the network in the test phase, we tried different resolutions according to the specification of YOLO. In [49], the authors advised using the highest network resolution possible as this will increase the precision of detection and will be useful in detecting small objects. Based on the result of the trials, a resolution of 1024 × 1024 provided the best precision and was adopted in the test phase. Table 2 compares the accuracy and the true and false positive rates as well as negative rates in determining the network resolution. According to the results, the trained model was able to detect the muzzle region in the test set with an accuracy of 99.13% providing very strong support that it is reliable for the detection task. However, as the model is based on YOLO, it suffers from one of its shortcomings and that is sensitivity to the background. The proposed model requires very little computation time, but the accuracy of detection is highly dependent on the training data. In other words, if the cattle images have not been captured in almost the same angle and distance, the model could fail to detect the muzzle region in those situations. In smart camera monitoring situations, this detector property is an advantage as it ensures that only those frames suitable for muzzle biometric recognition are captured for further processing.

Biometric Recognition
The speed of execution of deep learning models depends on the hardware used. In this study, all models have been implemented and tested on a Lambda Quad RTX 6000. Details of hardware information in this experiment are provided in Table 1.
The batch size was set as 10, resulting in 150 iterations in each epoch. The proposed model was trained in 100 epochs (totaling 15,000 iterations). An initial learning rate of 0.0001 was used, which is smaller than the learning rate for training a model from scratch, normally 0.01. Through using a very small learning rate, the newly added layers were able to learn patterns from the previously learned convolutional layers. If a higher learning rate was used instead, the risk of losing previous knowledge would increase.
We experimented with different configurations for our model to verify the performance of the model in different scenarios. The results of the training procedure under different settings are summarized in Table 3. In the first stage, the model was trained from scratch and without using pre-trained weights. Over-fitting occurred as expected. In the second phase, ImageNet pre-trained weights were used. All layers of the CNN were frozen, while a Softmax classifier was added to train the model based on the classes in the dataset, which enabled the model to overcome over-fitting and achieve reasonable accuracy. As can be noticed in Table 3, the number of trainable parameters and the average validation accuracy of different settings, when compared, revealed that by training the last three convolutional layers, the proposed model managed to reach the highest average validation of accuracy.
Although transfer learning was clearly helpful for training the model, one can notice in Figure 5 that the model required more epochs to achieve an acceptable accuracy. As a result, in the following stage, the last convolution layer of the proposed model was unfrozen to make it trainable, which resulted in the model being able to learn to differentiate the classes faster than before. After freezing different layers, the model was found to achieve the best performance when the previous three convolutional layers were unfrozen. One can compare, in Figures 5 and 6, both the validation accuracies and the loss incurred by the proposed model under four different settings described earlier.
Based on the results shown in Figures 5 and 6, the best performance was achieved when the last three convolutional layers were trained along with the added layers. Furthermore, as one can see in Figure 7a,b, the training and validation accuracies rapidly increased while the training loss and validation loss rapidly decreased. In addition, the model managed to achieve the highest training and validation accuracies after just 10 epochs, highlighting the reliability of the model. The network managed to converge relatively quickly but after 10 epochs, further training would not likely improve the accuracy of the model significantly. Accordingly, to minimize computation, the training process can be terminated after around 10 epochs. result, in the following stage, the last convolution layer of the proposed model was unfrozen to make it trainable, which resulted in the model being able to learn to differentiate the classes faster than before. After freezing different layers, the model was found to achieve the best performance when the previous three convolutional layers were unfrozen. One can compare, in Figures 5 and 6, both the validation accuracies and the loss incurred by the proposed model under four different settings described earlier.  Based on the results shown in Figures 5 and 6, the best performance was achieved when the last three convolutional layers were trained along with the added layers. Furthermore, as one can see in Figure 7a,b, the training and validation accuracies rapidly increased while the training loss and validation loss rapidly decreased. In addition, the model managed to achieve the highest training and validation accuracies after just 10 epochs, highlighting the reliability of the model. The network managed to converge relatively quickly but after 10 epochs, further training would not likely improve the accuracy result, in the following stage, the last convolution layer of the proposed model was unfrozen to make it trainable, which resulted in the model being able to learn to differentiate the classes faster than before. After freezing different layers, the model was found to achieve the best performance when the previous three convolutional layers were unfrozen. One can compare, in Figures 5 and 6, both the validation accuracies and the loss incurred by the proposed model under four different settings described earlier.  Based on the results shown in Figures 5 and 6, the best performance was achieved when the last three convolutional layers were trained along with the added layers. Furthermore, as one can see in Figure 7a,b, the training and validation accuracies rapidly increased while the training loss and validation loss rapidly decreased. In addition, the model managed to achieve the highest training and validation accuracies after just 10 epochs, highlighting the reliability of the model. The network managed to converge relatively quickly but after 10 epochs, further training would not likely improve the accuracy Testing the Model Although most deep learning models employed the SoftMax activation function for classification tasks [65], in addition to the SoftMax classifier, three well known classifiers (K-Nearest Neighbors (KNN) [66], Support Vector Machine (SVM) [65], and Multilayer Perceptron (MLP) [67]) were utilized as well to investigate the precision of the proposed

Testing the Model
Although most deep learning models employed the SoftMax activation function for classification tasks [65], in addition to the SoftMax classifier, three well known classifiers (K-Nearest Neighbors (KNN) [66], Support Vector Machine (SVM) [65], and Multilayer Perceptron (MLP) [67]) were utilized as well to investigate the precision of the proposed model. To apply the listed classifiers, the last fully connected layer of the model was removed and then the features and labels of all images were extracted and stored. Finally, for performing the prediction using the listed classifiers, the best hyper-parameters were found using grid search. For this purpose, a test set containing 563 images that belonged to 300 individual cattle was used. It is worth noting that the test set was obtained through random sampling. Although all the classifiers were able to achieve acceptable accuracy, SoftMax was able to correctly identify the classes of 558 out of the 563 images and managed to achieve the highest accuracy (99.11%) as expected. The obtained accuracies and inference time per image are listed in Table 4.   Table 5 presents results of several recent studies in the same field. As one would notice, most of the models need large numbers of training images for each individual animal. In practical settings, acquisition of a large number of images is not always possible. Moreover, collecting a large number of images is both time-consuming and costly as well as a training model using a large dataset requires more computation.   Table 5 presents results of several recent studies in the same field. As one would notice, most of the models need large numbers of training images for each individual animal. In practical settings, acquisition of a large number of images is not always possible. Moreover, collecting a large number of images is both time-consuming and costly as well as a training model using a large dataset requires more computation.

Discussion
A fully automated detection system should be capable of not only detecting and classifying objects, but it should also segment the object accurately from the background. The YOLO-ResNet model pipeline described in this paper offers this capability, which is a significant practical advancement over previous cattle identification using muzzle patterns. The YOLO-ResNet model proposed in this paper utilizes a YOLO muzzle detector to first detect the region-of-interest followed by a fine-tuned ResNet-50 model to extract muzzle features and provide individual identification. To our knowledge, all previously reported muzzle print cattle identification methods have required the extraction of the muzzle region of interest by manual procedures [35,54,55,70]. Incorporation of the muzzle detector and biometric recognition algorithms facilitates the further development of this technology within precision livestock settings.
Few-shot learning was also utilized within the RestNet-50 biometric recognition model. There was a strict limit of five training images per individual animal. This differs from the other cattle muzzle biometric studies listed in Table 5. The proposed YOLO-ResNet-50 model has the lowest number of training images per individual amongst any relevant study in the literature. The proposed model has the higher accuracy compared to competitive methods, but the results are not directly comparable due to the lower number of training images per individual and the overall greater number of individual animals than most studies. From a machine learning perspective, increasing the number of training images per individual animal might actually further improve performance of the YOLO-ResNet50 model. However, limiting the number of training images per individual has significant practical benefits; in many operational precision livestock management settings, it might not be feasible to obtain and label an extensive training data set. Furthermore, greater numbers of training images per individual increase the computational complexity of model updates. Deployment of cattle muzzle biometric algorithms into commercial settings will require such model updating to produce a model suitable for identification of the particular animals in a herd.
Envisaged applications of the YOLO-ResNet50 cattle biometric recognition model include livestock monitoring via 'smart cameras' coupled with edge computing network infrastructure through to smartphone-based biometric recognition apps. Both the YOLOv3 and ResNet50 network architectures are well-established and tested throughout a range of visual recognition applications, including livestock monitoring. The deeper ResNet-50 network architecture is well suited for processors utilized in many edge and fog computing applications, but it might not be the most suitable model for smartphone applications. Future research should investigate the benefits of smaller-sized models, which would have additional benefits for deployment within smartphone applications.
Biometric recognition of livestock has advanced far over the past decade, particularly those models using convolutional neural networks. Despite the considerable advances in the computer science and engineering aspects, there remain several unknowns regarding the utilization of livestock biometric monitoring within practical settings. This study is one of the largest cattle muzzle print biometric recognition studies to date; however, even Agronomy 2021, 11, 2365 13 of 16 300 individuals is a small scale herd in many settings. Extensive research needs to be conducted to determine if this approach is suitable for a large national or international scale systems incorporating millions of animals or, alternatively, whether the models are more suitable for within herd applications on a farm (e.g., automated feed intake monitoring). The relevance of the YOLO-ResNet-50 (and similar) models for the biometric recognition of sheep, pigs, goats, horses, dairy cattle and other livestock is also of practical interest. Although many studies have recently been conducted on animal biometric characteristics, the possibility of successfully using the proposed models in practical environments is still unknown. Developing a better understanding of convolutional neural network-based livestock biometric recognition, in particular its benefits and limitations, will lead to greater confidence in these systems and facilitate industry adoption.

Conclusions
In this paper, we have proposed the YOLO-ResNet-50 muzzle biometric identification system as a novel deep learning modelling approach for the identification of individual cattle. The YOLO-ResNet-50 model addressed a major limitation of previous cattle identification systems by automating both the muzzle detection and individual identification steps within a single workflow. The implementation of the YOLO-ResNet50 model finds and detects the muzzle region automatically by using images taken of the frontal view of cattle. By extracting the muzzle region-of-interest, a cattle muzzle database was created, which is another main contribution of this study. Next, unknown images of cattle were compared and matched to return their ID utilizing a fine-tuned ResNet-50 model. Experimental results demonstrate that by using transfer learning with fine-tuning (rather than developing a new network architecture), it was possible to develop a leading biometric recognition model with 99.11% accuracy. Furthermore, by utilizing transfer learning, the amount of time and effort required for data collection and training can be reduced.
The proposed YOLO-ResNet50 model can classify an individual animal using the muzzle pattern with just five images. This outperforms similar approaches that require larger sets of images for training. In addition, distinct from other studies, the YOLO-ResNet50 model evaluations were performed on mixed breeds of cattle, which indicates that the biometric recognition model is not confined to one particular cattle breed. The model system architecture and workflow provide a useful template for similar livestock monitoring applications, such as the automated detection and identification of livestock from surveillance video or drone footage. The ability to achieve high classification accuracy underpins that further development of an automated livestock identification system is feasible and fast becoming a reality.