Individual Beef Cattle Identification Using Muzzle Images and Deep Learning Techniques

Simple Summary The ability to identify individual animals has gained great interest in beef feedlots to allow for animal tracking and all applications for precision management of individuals. This study assessed the feasibility and performance of a total of 59 deep learning models in identifying individual cattle with muzzle images. The best identification accuracy was 98.7%, and the fastest processing speed was 28.3 ms/image. A dataset containing 268 US feedlot cattle and 4923 muzzle images was published along with this article. This study demonstrates the great potential of using deep learning techniques to identify individual cattle using muzzle images and to support precision beef cattle management. Abstract Individual feedlot beef cattle identification represents a critical component in cattle traceability in the supply food chain. It also provides insights into tracking disease trajectories, ascertaining ownership, and managing cattle production and distribution. Animal biometric solutions, e.g., identifying cattle muzzle patterns (unique features comparable to human fingerprints), may offer noninvasive and unique methods for cattle identification and tracking, but need validation with advancement in machine learning modeling. The objectives of this research were to (1) collect and publish a high-quality dataset for beef cattle muzzle images, and (2) evaluate and benchmark the performance of recognizing individual beef cattle with a variety of deep learning models. A total of 4923 muzzle images for 268 US feedlot finishing cattle (>12 images per animal on average) were taken with a mirrorless digital camera and processed to form the dataset. A total of 59 deep learning image classification models were comparatively evaluated for identifying individual cattle. The best accuracy for identifying the 268 cattle was 98.7%, and the fastest processing speed was 28.3 ms/image. Weighted cross-entropy loss function and data augmentation can increase the identification accuracy of individual cattle with fewer muzzle images for model development. In conclusion, this study demonstrates the great potential of deep learning applications for individual cattle identification and is favorable for precision livestock management. Scholars are encouraged to utilize the published dataset to develop better models tailored for the beef cattle industry.


Introduction
Beef products are among the most consumed animal protein, and the beef cattle industry is critical for many rural communities. Numerous challenges face the beef industry including needs for improved nutrition and production efficiency to feed a growing human population [1,2]. The US is the largest beef producer, the second-largest importer, and the third-largest exporter by volume, and it has the largest total consumption globally [3]. During 2021, the total US cattle inventory (including all cattle and calf operations) was  A reference summary of previous cattle muzzle identification studies was thoroughly conducted to investigate the method development progress and determine the current research gaps (Table 1). Petersen [16] was the first to explore the muzzle pattern recog- nition for dairy cattle. During the early stages [16][17][18], investigators manually observed imprinted muzzle patterns and explored the muzzle uniqueness of healthy and ill cattle. These studies contributed significantly to examining the possibility of muzzle recognition; however, manual observation was laborious and not suitable for large-scale application. Then, conventional digital image processing algorithms (e.g., scale-invariant feature transform and box-counting fractal dimension models) were used to identify individual cattle automatically [15,19]. These methods typically matched features, including color, texture, shape, and edge, among different muzzle images and achieved high identification accuracy (e.g., 98.3% [20] and 100% [21]) with small image sets and controlled conditions. However, the method performance may be challenged by inconsistent illumination and background, variable muzzle shapes and sizes, similar appearances of the same animal at different times, missing parts or occlusions on a muzzle, and low resolution [22]. Machine learning classification models (e.g., support vector machine, K-nearest neighbor, and decision tree) were embedded with image processing-based feature extractors (e.g., Weber local descriptor) to further boost the performance of muzzle identification [23][24][25]. Despite promising results with an over 95% accuracy for beef cattle muzzle classification, the approaches require sophisticated hand-crafted features and may be difficult to develop and optimize for researchers from non-computer science backgrounds.
Deep learning is a data-driven method and has been researched for computer vision applications in animal production [22]. Deep learning models can capture spatial and temporal dependencies of images/videos through the use of shared-weight filters and can be trained end-to-end without strenuous hand-crafted design of feature extractors [26], empowering the models to adaptively discover the underlying class-specific patterns and the most discriminative features automatically. Kumar et al. [27], Bello et al. [28], and Shojaeipour et al. [29] tried deep learning models (e.g., convolutional neural network, deep belief neural network, You Only Look Once, and residual network) in large sets (over 2900 images in total) of dairy and beef cattle muzzle images and obtained great accuracies of over 98.9%. The US beef cattle industry is quite unique from the dairy sector, in terms of both animal genetics and housing environment [3], which may result in different bioinformatic markers between dairy and beef cattle that influence model classification performance. Shojaeipour et al. [29] investigated the muzzle pattern of beef cattle, but the cattle were constrained in a crush restraint (i.e., hydraulic squeeze chute) with their heads placed in head scoops for data collection, which may cause extra distress to the cattle. Moreover, except for the study of Shojaeipour et al. [29], the muzzle image datasets were not publicly available in most studies, limiting the progress of developing models tailored for beef cattle applications.
Model processing speed was only reported in a limited number of publications [15,23,30] but not reported in the three recent deep learning studies mentioned above (Table 1). Processing speed is a critical metric to estimate the overall identification duration in a farm when classification models are incorporated into computer platforms or robots for use. During conventional data collection [16], cattle were constrained with ropes or other tools, snot and grass on the nose was wiped clean with tissues, thin ink was smeared on the nose area, the paper was rolled upward or downward to obtain the printed muzzle pattern, and the imprinted muzzle was scanned or photographed to digitalize the muzzle pattern for further data analysis. Such procedures may acquire clear muzzle patterns but are also complicated and inefficient to apply in modern feedlots. Grayscale images were applied in some studies but only provided one-channel information, whereas RGB (red, green, and blue) images contain richer information for processing and were applied more frequently, as indicated in Table 1.
The objectives of this study were to (1) collect high-resolution RGB muzzle images of feedlot cattle without any restraint or contact with the animals to develop a high-quality dataset to train deep learning models for individual cattle identification, and (2) benchmark classification performance and processing speed of muzzle identification optimized with various deep learning techniques.  Note: '−' indicates that information was not provided. DIP, digital image processing; ML, machine learning; DL, deep learning. Cattle species include beef cattle and dairy cattle. Image type is categorized as printed (samples are obtained from a direct compress with cattle noses and then scanned or photographed to form electronic images), grayscale with one-channel data captured directly from cameras, and RGB with three-channel (red, green, and blue) data. 'Y' indicates that the animal was restrained during data collection, while 'N' indicates that it was not.

Image Collection and Dataset Curation
This research was conducted at the University of Nebraska-Lincoln (UNL) Eastern Nebraska Research Extension and Education Center (ENREEC) research farm located near Mead, NE, USA. All animals were cared for under approval of the UNL Institution of Animal Care and Use Committee protocol 1785 (approved 4 December 2019), and no direct contact with animals was made throughout the course of data collection.
The RGB images of beef cattle were collected using a mirrorless digital camera (X-T4, FUJIFILM, Tokyo, Japan) and a 70-300 mm F4-5.6 focal lens (XF70-300 mm F4-5.6 R LM OIS WR, FUJINON, Tokyo, Japan), from 11 March to 31 July 2021. All images were collected from various distances outside the pens, while cattle were free to express their natural behaviors. A total of 4531 raw images from 400 US mixed-breed finishing cattle (Angus, Angus × Hereford, and Continental × British cross) were collected, of which the muzzle areas were the focus of the images. The ear tag information of each animal was recorded for verifying individual beef cattle. Because all images were taken under natural outdoor feedlot conditions, these images were presented with different angles of view and lighting conditions. Raw images contained unnecessary anatomical parts (e.g., face, eye, and body), particularly for classification purposes. To reduce classification interference and highlight muzzle visual features, the cattle face area was rotated so that the muzzle area aligned horizontally, after which the muzzle area was manually cropped. Extremely blurry, incomplete, or feedcovered muzzle images were removed to maintain dataset quality. Small sets of images per animal (≤3) were also discarded to obtain sufficient data for model training, validation, and testing. At the end, a total of 4923 muzzle images (multiple muzzles could be cropped from a single raw image) from 268 beef cattle were selected to form the dataset. Nine sample images from nine cattle are presented in Figure 2. Although colors and textures could be similar among individuals, the patterns of beads, grooves, and ridges were visually different, which were favorable to individual identification. The cropped muzzle images are published in an open-access science community [42].
Raw images contained unnecessary anatomical parts (e.g., face, eye, and body), particularly for classification purposes. To reduce classification interference and highlight muzzle visual features, the cattle face area was rotated so that the muzzle area aligned horizontally, after which the muzzle area was manually cropped. Extremely blurry, incomplete, or feed-covered muzzle images were removed to maintain dataset quality. Small sets of images per animal (≤3) were also discarded to obtain sufficient data for model training, validation, and testing. At the end, a total of 4923 muzzle images (multiple muzzles could be cropped from a single raw image) from 268 beef cattle were selected to form the dataset. Nine sample images from nine cattle are presented in Figure 2. Although colors and textures could be similar among individuals, the patterns of beads, grooves, and ridges were visually different, which were favorable to individual identification. The cropped muzzle images are published in an open-access science community [42].  A frequency distribution of the normalized width/length of cropped images is depicted in Figure 3. Inconsistent image sizes may lead to biased detection performance, while high-resolution (large-size) images can downgrade the processing efficiency [43]. Therefore, the cropped images were resized to the same dimensions before being supplied into classification models. The dimensions should be determined on the basis of the following criteria: (1) with similar dimensions to those reported in the previous studies (Table 1); (2) with greater frequency of normalized width/length of the cropped muzzle images in the dataset, as indicated in Figure 3; (3) compliance with the input size requirement of most deep learning image classification models. In the end, dimensions of 300 × 300 pixels were selected in this study to normalize the cropped images. while high-resolution (large-size) images can downgrade the processing efficiency [43]. Therefore, the cropped images were resized to the same dimensions before being supplied into classification models. The dimensions should be determined on the basis of the following criteria: (1) with similar dimensions to those reported in the previous studies (Table 1); (2) with greater frequency of normalized width/length of the cropped muzzle images in the dataset, as indicated in Figure 3; (3) compliance with the input size requirement of most deep learning image classification models. In the end, dimensions of 300 × 300 pixels were selected in this study to normalize the cropped images.

General Model Evaluation and Development Strategies
Transfer learning was deployed during training, with which models were pre-trained with a large dataset, ImageNet [62], whereas only the fully connected layers of the models were fine-tuned with the current dataset for custom classification. This strategy improves training efficiency without compromising inference performance. The cattle muzzle dataset was randomly partitioned and reshuffled into three subsets: 65% for training, 15% for validation, and 20% for testing. Image pixel intensities per color channel were normalized to the range of [0,1] for enhanced image recognition performance [63]. Each model was trained with five replications assigned the same random seeds, and the mean accuracy on the testing dataset was computed to evaluate model performance and reduce the random effects resulting from data reshuffling. All models were trained for 50 epochs (in which training typically converged for muzzle data), using a stochastic gradient descent optimizer and momentum of 0.9. The learning rate was initially set to 0.001 and dynamically decayed by a factor of 0.1 every seven epochs for stabilizing the model training. Models were trained and validated in a cloud-based service, Google Colab Pro, allocated with a Tesla P100-PCIE-16GB GPU, 12.69 GB of RAM, and a disk space of 166.83 GB. The working space and ethernet speed of cloud services can vary, resulting in inconsistent processing speeds among different models. Therefore, a local machine with an Intel ® Core™ i7-8700K CPU @ 3.70 GHz processor, 16.0 GB of RAM, and Windows 10 ® 64 bit operation system was also used for model testing. Utilization of multiple machines allowed accelerating training speed with the cloud-allocated GPU and examining standard model performance without the GPU for mobile applications.
The cross-entropy (CE) loss function was used to evaluate the training and validating model performance during the training process (Equation (1)).
where p i ∈ R 268 (R 268 indicates a 268-dimensional vector) is the vector of a Softmax output layer and indicates the probability of predicting the 268 individual cattle, C is the number of cattle (=268), w (=1) indicates that equal weights were assigned to all cattle, and t i denotes the true probability for the i-th cattle, is defined as follows: Accuracy was calculated for each model during each epoch of training using the validation dataset and after training using the testing dataset to determine model performance for overall classification. It was also calculated for each class to determine individual identification performance. Processing speed was computed using the reported time in Python divided by total number of processed images. Higher values suggest better model accuracy but lower processing speed. We proposed a comprehensive index (CI, Equation (3)) to balance the two opposite evaluation metrics to determine comprehensive performance for each model. The accuracy and processing speed computed from the testing dataset was firstly ranked, where high accuracy values and low processing speeds had high orders. Because accuracy was considered more important than processing speed in this study, 80% of the ranked results were weighed for accuracy while 20% were weighed for the processing speed. The proportion can be changed on the basis of the specific metric importance determined by developers. Overall, a lower CI indicates that the model provides better comprehensive performance.
where the variable Order i represents integers ranging from 1 to 59, and the subscripts refer to the appropriate metric of interest. Pearson's correlation analysis was conducted to understand the correlation effects among model total parameters, model size, accuracy, and processing speed. Larger absolute values of the Pearson correlation coefficient (R) indicate a higher correlation between parameters.

Optimization for Class Imbalance
Class imbalance was observed in the cropped muzzle image dataset with a range of images per cattle from four to 70 (Table 3). Because fewer images were fed to the dataset, a minority class (cattle with fewer images) may be prone to misidentification. Two commonly used deep learning strategies, weighted cross-entropy (WCE) loss function [64] and data augmentation [22], were adopted to mitigate this issue during model training. Both optimization strategies were evaluated by 20 models, which were determined using the optimal accuracy, processing speed, and CI among the 59 models. Accuracy was the primary metric to optimize class imbalance. The WCE loss function assigned heavier weights for cattle with fewer cropped muzzle images, as defined in Equation (4).
where w i is the individualized weight assigned to the i-th cattle, which can be calculated as follows: where N i denotes the image number for the i-th cattle, and N max is the maximum image counts per head (=70 in this case). The assigned weight in the WCE loss function for each cattle is provided in Table A1 in Appendix A. Data augmentation is a technique to create synthesized images and increase limited datasets for training deep learning models. The augmentation was implemented as a preprocessing step in an end-to-end training or interference process. Four augmentation strategies were adopted on the basis of raw image limitations, namely, horizontal flipping, brightness modification, randomized rotation, and blurring. The horizontal flipping was to mimic events when cattle were photographed in different locations due to their natural behaviors. The brightness modification was to mimic varying outdoor lighting conditions, and the brightness factor was set from 0.2 to 0.5 (minimum = 0, maximum = 1). Rotation was randomized from −15 • to 15 • , simulating natural cattle head movements. Blurring was applied to simulate the cases of overexposure and motion blur, and a Gaussian function with kernel sizes of 1-5 was used to create blurred muzzle images. Figure 4 provides representative results of the validation accuracy of the 59 deep learning image classification models assessed. Most models converged before the 50th epoch, and various models achieved the best validation accuracy at an epoch between 13 to 49. Each model took 20 to 302 min to finish the training of 50 epochs, using the cloud service. MnasNet_0.5, ShuffleNetV2_×0.5, and ShuffleNetV2_×1.0 consistently showed a low validation accuracy (<5%) across all training epochs. Valley points were observed for validation accuracy curves of RegNetY_16GF and RegNetX_800MF, probably because of data randomness after data reshuffling. In sum, the 50 epochs were reasonable configurations for model training, and validation accuracy was observed for all model training, including the training of 20 selected models for optimizing the accuracy of individual identification.

Examples of Validation Performance
behaviors. The brightness modification was to mimic varying outdoor lighting conditions, and the brightness factor was set from 0.2 to 0.5 (minimum = 0, maximum = 1). Rotation was randomized from −15° to 15°, simulating natural cattle head movements. Blurring was applied to simulate the cases of overexposure and motion blur, and a Gaussian function with kernel sizes of 1-5 was used to create blurred muzzle images. Figure 4 provides representative results of the validation accuracy of the 59 deep learning image classification models assessed. Most models converged before the 50th epoch, and various models achieved the best validation accuracy at an epoch between 13 to 49. Each model took 20 to 302 min to finish the training of 50 epochs, using the cloud service. MnasNet_0.5, ShuffleNetV2_×0.5, and ShuffleNetV2_×1.0 consistently showed a low validation accuracy (<5%) across all training epochs. Valley points were observed for validation accuracy curves of RegNetY_16GF and RegNetX_800MF, probably because of data randomness after data reshuffling. In sum, the 50 epochs were reasonable configurations for model training, and validation accuracy was observed for all model training, including the training of 20 selected models for optimizing the accuracy of individual identification.   Table 2. Table 3 shows the testing accuracies and processing speeds of the 59 deep learning image classification models. Accuracy ranged from 1.2% to 98.4% with ShuffleNetV2_×0.5 being the worst and VGG16_BN being the best. Each model took 32.3 to 678.2 ms to process a muzzle image, with the ShuffleNetV2_×0.5 being the fastest model and Efficient-Net_b7 being the slowest model. Twenty models were selected and organized on the basis of the CI ranking, namely, VGG11_BN, AlexNet, VGG16_BN, VGG13, SqueezeNet_1.1, VGG11, VGG13_BN, MobileNetV3_Large, VGG19_BN, VGG16, SqueezeNet_1.0, VGG19, MobileNetV3_Small, ResNeXt101_32×8 d, ResNet34, DenseNet169, DPN68, DenseNet161, DenseNet201, and RegNetY_32GF. The 20 models were further used to evaluate the optimization of class imbalance.

Testing Performance of the Selected Deep Learning Image Classification Models
The processing speed was computed by Google Colab (with GPU) for all 59 models, and the average ± standard deviation was 60.5 ± 93.4 ms/image, much lower than the speed 197.8 ± 145.1 ms/image computed by the local computer with CPU only (Table 3), presumably due to the cloud-based service. For example, the processing speed using the cloud was 103.6 ms/image for DenseNet121 but 20.9 ms/image for DenseNet161, 182.4 ms/image for EfficientNet_b3 but 12.4 ms/image for EfficientNet_b4, 379.0 ms/image for  Table 2. Table 3 shows the testing accuracies and processing speeds of the 59 deep learning image classification models. Accuracy ranged from 1.2% to 98.4% with ShuffleNetV2_×0.5 being the worst and VGG16_BN being the best. Each model took 32.3 to 678.2 ms to process a muzzle image, with the ShuffleNetV2_×0.5 being the fastest model and Efficient-Net_b7 being the slowest model. Twenty models were selected and organized on the basis of the CI ranking, namely, VGG11_BN, AlexNet, VGG16_BN, VGG13, SqueezeNet_1.1, VGG11, VGG13_BN, MobileNetV3_Large, VGG19_BN, VGG16, SqueezeNet_1.0, VGG19, MobileNetV3_Small, ResNeXt101_32×8 d, ResNet34, DenseNet169, DPN68, DenseNet161, DenseNet201, and RegNetY_32GF. The 20 models were further used to evaluate the optimization of class imbalance.

Testing Performance of the Selected Deep Learning Image Classification Models
The processing speed was computed by Google Colab (with GPU) for all 59 models, and the average ± standard deviation was 60.5 ± 93.4 ms/image, much lower than the speed 197.8 ± 145.1 ms/image computed by the local computer with CPU only (Table 3), presumably due to the cloud-based service. For example, the processing speed using the cloud was 103.6 ms/image for DenseNet121 but 20.9 ms/image for DenseNet161, 182.4 ms/image for EfficientNet_b3 but 12.4 ms/image for EfficientNet_b4, 379.0 ms/image for MnasNet_0.5 but 10.0 ms/image for MnasNet_1.0, and 186.2 ms/image for VGG13 but 18.8 ms/image for VGG16. The internet speed inconsistency may have led to the abnormal data trends where simpler architectures in the same model family processed images more slowly. Therefore, although the current processing speed provided with CPU only was not optimal, it was at least reliable for building the benchmark performance for some mobile computing machines without GPU. Table 4 presents a Pearson correlation coefficient (R) matrix to better understand the relationships among model performance parameters. Accuracy had a low and positive correlation with model parameters (total parameter and size), while processing speed was moderately and positively correlated with model parameters. This result matched with our original hypothesis of correlation direction (a complicated model with more model parameters should have greater accuracy but longer processing time), although we expected greater correlation magnitudes. More factors may also affect model performance, such as connection schemes and network depth. For example, both ShuffleNet [58] and SqueezeNet [59] were lightweight models with 1.2-2.3 million total parameters, a 2.5-6.0 MB model size, and a fast processing speed of 32.3-62.1 ms/image. However, SqueezeNet achieved much better accuracies (95.0-95.9%) than that of ShuffleNet (1.2-1.3%). SqueezeNet introduced the Fire module (a squeezed convolution filters) to build CNN architecture and achieved AlexNet-level accuracy with fewer total parameters and smaller model sizes [59]. ShuffleNet used a direct measure of processing speed rather than an indirect measure of FLOPs to efficiently design and optimize CNN architecture, although the method was not beneficial in improving accuracy (the top-1 error rate was up to 39.7% [58]). Table 4. Pearson correlation coefficient (R) matrix among model total parameter and size, accuracy, and processing speed.

Accuracy Processing Speed
Total parameter 0.389 0.517 Model size 0.391 0.521 Interestingly, a few earlier models, such as AlexNet [44] and VGG [60], outperformed some newer models (e.g., EfficientNet [47], MnasNet [52], and RegNet [55]). One plausible explanation is that the connection scheme greatly impacted the model performance for this muzzle image dataset. AlexNet and VGG were operated in a feed-forward manner and could improve model performance by increasing architecture depth, while other models increased architecture width, introduced shortcut connection, and scaled up architecture width and depth. Our results indicate that a simple and feed-forward network architecture is sufficient in identifying individual beef cattle using muzzle images.

Optimization Peformance for Class Imbalance
The highest accuracy of the 20 selected models increased by 0.1% with the weighted cross-entropy loss function and 0.3% with data augmentation, compared with that of the development without any class imbalance optimization (98.4%, Table 5). The average accuracy increased by 0.6% with weighted cross-entropy but decreased by 0.2% with data augmentation, compared with the development without any class imbalance optimization (96.1%, Table 5). It turned out that the accuracy was not consistently improved for every model by the class imbalance optimization (e.g., AlexNet, MobileNetV3_Small, SqueezeNet_1.0, SqueezeNet_1.1, and VGG13). Therefore, only the models that performed best in both strategies, VGG16_BN (with cross-entropy loss function and data augmenta- tion) and VGG19_BN (with weighted cross-entropy loss function), were selected to evaluate the accuracy of individual cattle identification. Table 5. Accuracy and processing speed for the 20 selected models before and after optimization for class imbalance. The outperformed models for each parameter were highlighted in bold fonts. Note: 'Cross-entropy' indicates models developed with the cross-entropy loss function and without any class imbalance optimization. Descriptions of the models are provided in Table 2.

Model
There was no significant difference in processing speed between the developments conducted with cross-entropy and with weighted cross-entropy. However, the processing speeds of the 20 models with data augmentation were faster compared to those without any class imbalance optimization. The processing time was the sum of model loading time and total image processing time divided by the total number of processed images (Equation (6) A best model classification accuracy of 98.7% was achieved in this study (Tables 3 and 5) and was comparable to that of other deep learning studies for cattle muzzle recognition, in which the accuracy was 98.9% [27,28] and 99.1% [29]. Despite the discrepancies of cattle breeds, rearing environments, data acquisition conditions, and network architecture, all these studies achieved the desired accuracy (>90%), which again proves the empowering object recognition ability of deep learning and suggests a suitable application of the deep learning technique for individual cattle identification.
The processing speed ranged from 28.3 to 678.2 ms/image (Tables 3 and 5) and was also comparable to or faster than some previous studies: 32.0-738.0 ms/image [15], 77.5-1361.9 ms/image [23], and 368.3-1193.3 ms/image [30]. Interestingly, these studies used machine learning or digital image processing algorithms as indicated in Table 1, and these models were supposed to be relatively lightweight compared to the deep learning models, with a faster processing speed, but our study suggested the opposite. In addition to programming language and platform, computing hardware explained the processing speed performance, particularly for configurations that were less advanced than those listed in Section 2.3.

Identification Accuracy of Each Cattle
A probability chart demonstrating the identification accuracy of individual cattle is presented in Figure 5 and summarized in Table 6. In general, 92.5-95.1% cattle were 100% accurately identified, suggesting a great potential of using deep learning techniques to identify individual cattle. The results are also in agreement with Table 5. Despite different models, the development with weighted cross-entropy and data augmentation indeed outperformed that with cross-entropy loss function only. Worst-case scenarios (with 0% identification accuracy) developed from models without class imbalance optimization were reduced from four to three with data augmentation. Best-case scenarios (100% identified) increased by 6-7 after class imbalance optimization. Accuracy, excluding best-case or worst-case scenarios, was improved by 1.3-1.5% after class imbalance optimization. identified) increased by 6-7 after class imbalance optimization. Accuracy, excluding bestcase or worst-case scenarios, was improved by 1.3-1.5% after class imbalance optimization.  The ID numbers of cattle with 0% identification accuracy were 2100, 4549, 5355, and 5925, with only four cropped images per head (Table A1). Although more images per cattle may result in higher accuracy for identifying individual cattle [22], multiple factors should be considered for data collection, such as access to animals, resource availability, and labeling workload. An optimal threshold of images per head is favorable in balancing  The ID numbers of cattle with 0% identification accuracy were 2100, 4549, 5355, and 5925, with only four cropped images per head (Table A1). Although more images per cattle may result in higher accuracy for identifying individual cattle [22], multiple factors should be considered for data collection, such as access to animals, resource availability, and labeling workload. An optimal threshold of images per head is favorable in balancing identification accuracy and difficulties associated with data collection. The commonly used per head image rate is 5-20, as suggested in Table 1. Our results also indicated that an animal with over four muzzle images for model development could be identified successfully (with over 90% accuracy) by deep learning models. These data suggest that five images per animal could be an appropriate threshold.
This project aims to initiate the very first step of an individual cattle identification system. Coupled with computer engineer and software development capacity, the optimal model, VGG16_BN, can be installed into a computer vision system to livestream cattle muzzles. In the future, such computer vision systems have the potential to be integrated into commercial beef cattle feedlot facilities via other facilities or technologies (e.g., hydraulic chute, mobile robot systems, drones, smartphones) that allow for individual cattle muzzle capture and maintain the consistency of data collection.

Conclusions
Individual beef cattle were identified with muzzle images and deep learning techniques. A dataset containing 268 US feedlot cattle and 4923 muzzle images was published along with this article, forming the largest dataset for beef cattle to date. A total of 59 deep learning models were comparatively evaluated for identifying muzzle patterns of individual cattle. The best identification accuracy was 98.7%, and the fastest processing speed was 28.3 ms/image. The VGG models performed better in terms of accuracy and processing speed. Weighted cross-entropy loss function and data augmentation could improve the identification accuracy for the cattle with fewer muzzle images. This study demonstrates the great potential of using deep learning techniques to identify individual cattle using muzzle images and to support precision beef cattle management.

Acknowledgments:
The authors appreciate the enormous efforts of the staff and graduate students at University of Nebraska-Lincoln's ENREEC research facilities without whom this project would not have been possible.

Conflicts of Interest:
The authors declare no conflict of interest.