Evaluation of Deep Learning for Automatic Multi-View Face Detection in Cattle

: Individual identiﬁcation plays an important part in disease prevention and control, trace-ability of meat products, and improvement of agricultural false insurance claims. Automatic and accurate detection of cattle face is prior to individual identiﬁcation and facial expression recognition based on image analysis technology. This paper evaluated the possibility of the cutting-edge object detection algorithm, RetinaNet, performing multi-view cattle face detection in housing farms with ﬂuctuating illumination, overlapping, and occlusion. Seven different pretrained CNN models (ResNet 50, ResNet 101, ResNet 152, VGG 16, VGG 19, Densenet 121 and Densenet 169) were ﬁne-tuned by transfer learning and re-trained on the dataset in the paper. Experimental results showed that RetinaNet incorporating the ResNet 50 was superior in accuracy and speed through performance evaluation, which yielded an average precision score of 99.8% and an average processing time of 0.0438 s per image. Compared with the typical competing algorithms, the proposed method was preferable for cattle face detection, especially in particularly challenging scenarios. This research work demonstrated the potential of artiﬁcial intelligence towards the incorporation of computer vision systems for individual identiﬁcation and other animal welfare improvements.


Introduction
Animal husbandry is undergoing a transition from extensive farming to precision livestock farming and welfare breeding. However, the farming facilities and technologies play crucial parts in affecting the economic benefits of large-scale pastures. Inadequate management probably directly damages the health of livestock and is adverse to the food quality and safety, and the development of the livestock industry [1]. Therefore, there is an urgent need for cost-effective technology methods to address these challenges in animal agricultural systems, such as lack of labor and difficulties in real-time monitoring. Precision farming has aroused more interest recently due to the increasing concern over sustainable livestock and production efficiency [1][2][3][4][5]. Precision farming takes advantage of modern information technologies as an enabler of more efficient, productive, and profitable farming enterprises. For example, Internet of Things (IoT) are used for collecting data on the whole lifecycle of livestock, including breeding, slaughtering, meat processing, and marketing; Big Data and Artificial Intelligence (AI) can provide accurate analysis and real-time physical dynamics of each animal species as for a scientific basis for decisionmaking and analysis of farm managers. Among these, recognition of individual livestock

Related Work
Face detection is a particular application of object detection that accurately finds the target face and its location in images. Object detection is currently a very active research field in computer vision that facilitates high-level tasks such as automatic individual identification and intelligent image recognition. The early object detection methods, including Viola-Jones detectors, HOG detector, and deformable part-based model were built based on handcrafted features, which render the time complexity high and many of the windows redundant [40]. In addition, manually designed features in the traditional object detection are not sufficiently robust to deal with the wide diversity of image changes encountered in practice; thereby, CNN was introduced into the object detection community. Due to its relatively superior performance of learning for robust and high-level feature representations of an image, CNN-based object detection prevents extracting complicated features and their reconstruction process in traditional object detection. Therefore, after R. Girshick et al. took the lead to propose the region-based CNN features for object detection in 2014, the object detection algorithms evolved from R-CNN at an unprecedented speed and have made much progress in recent years. Current state-of-the-art CNN-based object detectors can be grouped into two-stage algorithms and one-stage detection algorithms.
The two-stage detectors start with the extraction of object proposals through selective search or Region Proposal Network (RPN), and then the candidate regions are classified and regressed for precise coordinates. Regression-based algorithms such as Yolo and SSD require the sampling densely at various positions with different aspect ratios first, then provide the direct prediction of object categorization and a bounding box using CNN. Although the end-to-end procedure of the regression-based detectors outperforms the region-based detectors in processing speed, they achieve lower mean average precision because of example imbalance between object and background. As a result, T.-Y. Lin et al. designed a novel one-stage detector called RetinaNet in 2017 to address the class imbalance and increase the importance of hard examples [41]. "Focal loss" was used in RetinaNet to redefine the standard cross-entropy loss, so the training could automatically downweight the simple examples and center more on hard and misclassified examples. Focal loss enables RetinaNet to achieve comparable accuracy of two-stage algorithms and also maintains relatively high processing speed [41].
Considering the aspects of operating speed and accuracy in farming practice, Reti-naNet was selected in this paper for further study. For face detection, unlike the human face, consideration should be given to changes in cattle's face and body orientation due to their random roaming. Therefore, this paper will explore the effectiveness of RetinaNet for multi-view cattle face detection. Advancements in deep learning networks present an opportunity to extend the research to the empirical comparisons of the typical CNN backbones for RetinaNet in the task of detecting multi-view cattle face.  Figure 1 shows the overall workflow proposed for processing RGB images that are captured by 2-D cameras to detect multi-view cattle faces based on RetinaNet. The RGB images acquired by 2-D cameras are used as input images after image preprocessing, including image partitioning and image resize. The backbone, including ResNet, VGG, and Densenet, is selected for feature extraction, and then the Feature Pyramid Network (FPN) strengthens the multi-scale features formed in the former convolutional network to obtain more expressive feature maps, which contain a rich and multi-scale feature pyramid. The feature map selects two Fully Convolutional Network (FCN) sub-networks with the same structure but without sharing parameters for cattle face classification prediction and bounding-box prediction. Ground truth was annotated manually for every cattle face in the training sets and then network training was performed after labeling for forming the cattle face detector, followed by the output of multi-view cattle face detection in testing sets. present an opportunity to extend the research to the empirical comparisons of the typical CNN backbones for RetinaNet in the task of detecting multi-view cattle face. Figure 1 shows the overall workflow proposed for processing RGB images that are captured by 2-D cameras to detect multi-view cattle faces based on RetinaNet. The RGB images acquired by 2-D cameras are used as input images after image preprocessing, including image partitioning and image resize. The backbone, including ResNet, VGG, and Densenet, is selected for feature extraction, and then the Feature Pyramid Network (FPN) strengthens the multi-scale features formed in the former convolutional network to obtain more expressive feature maps, which contain a rich and multi-scale feature pyramid. The feature map selects two Fully Convolutional Network (FCN) sub-networks with the same structure but without sharing parameters for cattle face classification prediction and bounding-box prediction. Ground truth was annotated manually for every cattle face in the training sets and then network training was performed after labeling for forming the cattle face detector, followed by the output of multi-view cattle face detection in testing sets.

RetinaNet-Based Object Detection
The name of RetinaNet comes from its dense sampling on the input image. RetinaNet is designed to evaluate the proposed focal loss for class imbalance in regression-based algorithms. The framework consists of three parts: (i) the front backbone network for feature extraction, (ii) FPN for constructing the multi-scale feature pyramids, and (iii) two subnetworks for object classification and bounding box regression. Focal loss is a newly high-sufficient loss function that replaces the training with the sampling heuristics and two-stage cascade while dealing with class imbalance. The details for backbones and FCN sub-networks, commonly used in R-CNN-like detectors, are expounded in the original papers, and this section mainly describes FPN and focal loss of the algorithm.

Feature Pyramid Networks
FPN is adopted to strengthen the feature extraction of backbone for weak semantic features using a top-down pyramid and lateral connections (see Figure 2). As indicated in the blue blocks, the bottom-up path is the feed-forward calculation for the main convolutional network, which calculates the feature hierarchy with different proportions. For the feature pyramid, the pyramid level is defined for each stage and the output of the last layer in each stage is chosen as the feature map because the deepest layer of each stage

RetinaNet-Based Object Detection
The name of RetinaNet comes from its dense sampling on the input image. RetinaNet is designed to evaluate the proposed focal loss for class imbalance in regression-based algorithms. The framework consists of three parts: (i) the front backbone network for feature extraction, (ii) FPN for constructing the multi-scale feature pyramids, and (iii) two subnetworks for object classification and bounding box regression. Focal loss is a newly high-sufficient loss function that replaces the training with the sampling heuristics and two-stage cascade while dealing with class imbalance. The details for backbones and FCN sub-networks, commonly used in R-CNN-like detectors, are expounded in the original papers, and this section mainly describes FPN and focal loss of the algorithm.

Feature Pyramid Networks
FPN is adopted to strengthen the feature extraction of backbone for weak semantic features using a top-down pyramid and lateral connections (see Figure 2). As indicated in the blue blocks, the bottom-up path is the feed-forward calculation for the main convolutional network, which calculates the feature hierarchy with different proportions. For the feature pyramid, the pyramid level is defined for each stage and the output of the last layer in each stage is chosen as the feature map because the deepest layer of each stage should have the strongest characteristics. Specifically, for the ResNet101 used in the RetinaNet, the outputs of these final residual blocks for conv2_x, conv3_x, conv4_x, and conv5_x are denoted as {C2, C3, C4, C5}. Since conv1 will occupy plenty of memory, it is not included in the pyramid. should have the strongest characteristics. Specifically, for the ResNet101 used in the RetinaNet, the outputs of these final residual blocks for conv2_x, conv3_x, conv4_x, and conv5_x are denoted as {C2, C3, C4, C5}. Since conv1 will occupy plenty of memory, it is not included in the pyramid. The top-down flow marked in green obtains high-resolution features by upsampling the feature maps with coarser space but stronger semantics from higher pyramid levels. Later, the bottom-up path is connected laterally to reinforce these features. Specifically, the weak feature map is upsampled twice, and then the upsampling map is merged with the corresponding bottom-up map. This cycle is repeated until the final resolution map is produced. We only need to combine a 1 × 1 convolutional layer with C5 to produce lowresolution images to run the iteration. Next, we append a 3 × 3 convolution to perform on each merged image so as to diminish the aliasing effect of upsampling. The same applies to other layers and the final feature map set is called {P2, P3, P4, P5} for object classification and bounding box regression, corresponding to {C2, C3, C4, C5}, respectively.

Focal Loss
The box regression sub-net and classification sub-net in the RetinaNet are implemented using the standard SmoothL1 loss (Formula (1)) and the Focal loss (Formula (3)), respectively, as the loss functions. Focal loss is a cross-entropy loss that can be dynamically scaled. A weighting factor is added for the traditional cross-entropy function, which can automatically drop the weight of the loss contributed by simple examples and center more on hard samples to solve the class imbalance.
Here, x is the error value between the estimated value and ground truth ; and are two tunable focusing hypermeters and they function as the role of balancing The top-down flow marked in green obtains high-resolution features by upsampling the feature maps with coarser space but stronger semantics from higher pyramid levels. Later, the bottom-up path is connected laterally to reinforce these features. Specifically, the weak feature map is upsampled twice, and then the upsampling map is merged with the corresponding bottom-up map. This cycle is repeated until the final resolution map is produced. We only need to combine a 1 × 1 convolutional layer with C5 to produce low-resolution images to run the iteration. Next, we append a 3 × 3 convolution to perform on each merged image so as to diminish the aliasing effect of upsampling. The same applies to other layers and the final feature map set is called {P2, P3, P4, P5} for object classification and bounding box regression, corresponding to {C2, C3, C4, C5}, respectively.

Focal Loss
The box regression sub-net and classification sub-net in the RetinaNet are implemented using the standard Smooth L 1 loss (Formula (1)) and the Focal loss (Formula (3)), respectively, as the loss functions. Focal loss is a cross-entropy loss that can be dynamically scaled. A weighting factor is added for the traditional cross-entropy function, which can automatically drop the weight of the loss contributed by simple examples and center more on hard samples to solve the class imbalance.
Here, x is the error value between the estimated value f (x i ) and ground truth y i ; ∂ t and γ are two tunable focusing hypermeters and they function as the role of balancing the ratio between simple and difficult examples; p is the estimated possibility for the given label class. Thus, if the figure of math is 1, it specifies the label class and P is the same as the p in this situation.

Datasets Preparation and Preprocessing
To address the scarce dataset for cattle face detection and recognition using deep learning, datasets were collected from two housing farms located in Jiangxi Province, China, and there were 85 healthy scalpers and Simmental ranging in age from 6 to 20 months. The experiment was conducted under various scenes such as different illumination, overlapping, and postures without human intervention, and it took three days to complete this data collection. Examples of multi-view cattle face in different scenes are displayed in Figure 3. This work aims to simulate and facilitate the detection and identification of cattle face by future mobile devices instead of surveillance cameras, and it is common to collect the images where the cattle faces occupy large areas. The cattle were filmed using a Sony FDR-AX 40 camera with MOV video format (3840 × 2160 pixels) at 25 frames per second. The camera on a tripod was fronted straight to the standing cow with a view of 3 cow's face width and 1.5 cow's face length. The original images cropped from videos were in JPG format at 3840 by 2160 pixels. After extracting valuable data frames of every video in MATLAB, the selected images were clipped using MATLAB and then be resized to 224 × 224 pixels. Notably, to ensure the effectiveness of detection performance, during the image selection, different situations of cattle faces for each cow were selected and highly similar faces, especially in consecutive frames, were avoided. The datasets contained a total of 3000 images (1000 negative images included) that were split into training and testing in the proportion 2:1.
the ratio between simple and difficult examples; p is the estimated possibility for the given label class. Thus, if the figure of math is 1, it specifies the label class and P is the same as the p in this situation.

Datasets Preparation and Preprocessing
To address the scarce dataset for cattle face detection and recognition using deep learning, datasets were collected from two housing farms located in Jiangxi Province, China, and there were 85 healthy scalpers and Simmental ranging in age from 6 to 20 months. The experiment was conducted under various scenes such as different illumination, overlapping, and postures without human intervention, and it took three days to complete this data collection. Examples of multi-view cattle face in different scenes are displayed in Figure 3. This work aims to simulate and facilitate the detection and identification of cattle face by future mobile devices instead of surveillance cameras, and it is common to collect the images where the cattle faces occupy large areas. The cattle were filmed using a Sony FDR-AX 40 camera with MOV video format (3840 × 2160 pixels) at 25 frames per second. The camera on a tripod was fronted straight to the standing cow with a view of 3 cow's face width and 1.5 cow's face length. The original images cropped from videos were in JPG format at 3840 by 2160 pixels. After extracting valuable data frames of every video in MATLAB, the selected images were clipped using MATLAB and then be resized to 224 × 224 pixels. Notably, to ensure the effectiveness of detection performance, during the image selection, different situations of cattle faces for each cow were selected and highly similar faces, especially in consecutive frames, were avoided. The datasets contained a total of 3000 images (1000 negative images included) that were split into training and testing in the proportion 2:1. LabelImg is the annotation tool that was used to label the ground truth for cattle faces using RectBox for training datasets. For labeling, the region of every cattle face was selected and annotated using the RectBox in the image. Then, the class label named cattle face needed to be marked on the bubble pop up on the screen. The details of data annotation include object name, box location, and image size, as shown in Figure 4. LabelImg is the annotation tool that was used to label the ground truth for cattle faces using RectBox for training datasets. For labeling, the region of every cattle face was selected and annotated using the RectBox in the image. Then, the class label named cattle face needed to be marked on the bubble pop up on the screen. The details of data annotation include object name, box location, and image size, as shown in

Implementation Details
The experiment was conducted on a desktop computer equipped with Windows 10 64-bit and an NVIDIA GeForce GTX 1080 graphics card. The proposed framework was written employing available libraries including numpy 1.16.5 and scikit-learn 0.21.3 in Python3.6. Keras 2.31 combined with tensorflow-gpu-2.1.0 was installed to provide a deep neural network framework for Python that was compatible with the Python version.
Transfer learning was adopted because of the limited computing resources and datasets for training. Transfer learning was to fine-tune a particular model for the intended task based on existing models. The backbones used in the proposed framework were initialized by ResNet-pretrained model using COCO datasets and VGG-pretrained model using ImageNet datasets and Densenet-pretrained model using ImageNet datasets. All 200,000 training iterations took approximately 17 h, and the best performing epoch for the model was chosen on testing data after the training loss converged. The threshold was set at 0.5 for the Intersection-over-Union (IoU) of confidence and bounding-box in all network models.

Performance Analysis with Different Backbones
As referred in Section 3.1., the original ResNet 50 backbone model of RetinaNet can be replaced with ResNet 101, ResNet 152, VGG 16, VGG 19, Densenet 121, and Densenet 169. The experiment compared the RetinaNet with ResNet 50 with these various backbone CNNs. The results in Figure 5 demonstrate the comparison Average Precision (AP) and Average Processing Time (Atime) between different backbones using 1000 images, including 500 positive samples with cattle face and 500 negative samples without cattle face. In addition, to better assess the performance of various models on cattle face detection in detail, we also computed True Positive (TP), False Positive (FP), and False Negatives (FN) of seven backbones and then calculated the corresponding precision, recall, and F1 score, as presented in Table 1.

Implementation Details
The experiment was conducted on a desktop computer equipped with Windows 10 64-bit and an NVIDIA GeForce GTX 1080 graphics card. The proposed framework was written employing available libraries including numpy 1.16.5 and scikit-learn 0.21.3 in Python3.6. Keras 2.31 combined with tensorflow-gpu-2.1.0 was installed to provide a deep neural network framework for Python that was compatible with the Python version.
Transfer learning was adopted because of the limited computing resources and datasets for training. Transfer learning was to fine-tune a particular model for the intended task based on existing models. The backbones used in the proposed framework were initialized by ResNet-pretrained model using COCO datasets and VGG-pretrained model using ImageNet datasets and Densenet-pretrained model using ImageNet datasets. All 200,000 training iterations took approximately 17 h, and the best performing epoch for the model was chosen on testing data after the training loss converged. The threshold was set at 0.5 for the Intersection-over-Union (IoU) of confidence and bounding-box in all network models.

Performance Analysis with Different Backbones
As referred in Section 3.1., the original ResNet 50 backbone model of RetinaNet can be replaced with ResNet 101, ResNet 152, VGG 16, VGG 19, Densenet 121, and Densenet 169. The experiment compared the RetinaNet with ResNet 50 with these various backbone CNNs. The results in Figure 5 demonstrate the comparison Average Precision (AP) and Average Processing Time (Atime) between different backbones using 1000 images, including 500 positive samples with cattle face and 500 negative samples without cattle face. In addition, to better assess the performance of various models on cattle face detection in detail, we also computed True Positive (TP), False Positive (FP), and False Negatives (FN) of seven backbones and then calculated the corresponding precision, recall, and F1 score, as presented in Table 1.  It can be seen from Figure 5 that the average precision of VGG 16 and VGG 19 are slightly higher than the value of ResNet 50 and achieve the best average precision, but the average processing time of ResNet 50 outperforms other backbones. As for cattle face detection, Densenet has a poor detection effect with the best average precision of 88.35% and the fastest processing time of 0.1370 s. AP and Atime are both significant metrics in the matter of how practical the system might be in actual use. Therefore, considering processing time and accuracy, the detection algorithm with ResNet 50 as the feature extraction model is regarded as having the best performance, whose AP reaches 99.8% and Atime is 0.0438 s per image.
As observed in Table 1, the cattle face detection model using ResNet 50 yields a precision of 99.8%, a 100% of recall and an F1 score of 0.9990, which are higher than other backbones. Moreover, the results concerning cattle face detection errors depict that the model achieves the lowest FP and FN rates with only 1 in 500 cattle faces potentially being misclassified in the case of ResNet 50. In contrast, although deeper ResNet including ResNet101, ResNet 152, and VGG network architectures obtain better performance on FP, they are reported to receive more falsely detected cattle face, especially using VGG. As with the results shown in Figure 5, the lowest scores on precision, recall, and F1 score are reported by employing Densenet due to the superior FP and FN rates but the lowest TP rate. Some representative examples for the prediction on the test image processed by seven different backbones is visualized in Figure 6.  It can be seen from Figure 5 that the average precision of VGG 16 and VGG 19 are slightly higher than the value of ResNet 50 and achieve the best average precision, but the average processing time of ResNet 50 outperforms other backbones. As for cattle face detection, Densenet has a poor detection effect with the best average precision of 88.35% and the fastest processing time of 0.1370 s. AP and Atime are both significant metrics in the matter of how practical the system might be in actual use. Therefore, considering processing time and accuracy, the detection algorithm with ResNet 50 as the feature extraction model is regarded as having the best performance, whose AP reaches 99.8% and Atime is 0.0438 s per image.
As observed in Table 1, the cattle face detection model using ResNet 50 yields a precision of 99.8%, a 100% of recall and an F1 score of 0.9990, which are higher than other backbones. Moreover, the results concerning cattle face detection errors depict that the model achieves the lowest FP and FN rates with only 1 in 500 cattle faces potentially being misclassified in the case of ResNet 50. In contrast, although deeper ResNet including ResNet101, ResNet 152, and VGG network architectures obtain better performance on FP, they are reported to receive more falsely detected cattle face, especially using VGG. As with the results shown in Figure 5, the lowest scores on precision, recall, and F1 score are reported by employing Densenet due to the superior FP and FN rates but the lowest TP rate. Some representative examples for the prediction on the test image processed by seven different backbones is visualized in Figure 6.

Comparison with Other State-of-the-Art Object Detection Algorithms
The proposed RetinaNet based multi-view cattle face detection is also compared to show its advantages over the typical existing object detection approaches. Yolov3 and Faster R-CNN are the typical works of object detectors in practice. For instance, Faster R-CNN has been attempted to explore the multi-class fruit detection [42][43][44], livestock detection [45], posture detection of pigs [46], and cattle face detection [35]. Yolov3 has also been applied to fruit and fruit disease detection [47][48][49][50], plant and plant disease and pest detection [51][52][53], livestock behavior detection [47,54], and fish detection [55]. Therefore, experiments in this paper are conducted to compare the testing results of these competing methods with the ground truth information, and the results are summarized in Table 2.
It is observed from Table 2 that RetinaNet with ResNet 50 show better detection performance than Yolov3 and Faster R-CNN in both detection accuracy and calculation requirement for future online detection (AP of 99.8% and Atime of 0.0438 s). The results indicate that RetinaNet is most competent in real-world practice as the datasets are in different complex scenes with severe face-pose variation and different degrees of occlusion. Yolov3 and Faster R-CNN achieved nearly similar performance with RetinaNet in AP (99.68% for Yolov3 and 99.8% for RetinaNet) and F1 score (0.9970 for Faster R-CNN and 0.9990 for RetinaNet), respectively, but the F1 score is preferable as the metric for "true positive detection" whilst average precision is preferable for "boundary extraction" of cattle face. Therefore, Yolov3 and Faster R-CNN are not sufficiently reliable in complex multi-view cattle face detection.

Comparison with Other State-of-the-Art Object Detection Algorithms
The proposed RetinaNet based multi-view cattle face detection is also compared to show its advantages over the typical existing object detection approaches. Yolov3 and Faster R-CNN are the typical works of object detectors in practice. For instance, Faster R-CNN has been attempted to explore the multi-class fruit detection [42][43][44], livestock detection [45], posture detection of pigs [46], and cattle face detection [35]. Yolov3 has also been applied to fruit and fruit disease detection [47][48][49][50], plant and plant disease and pest detection [51][52][53], livestock behavior detection [47,54], and fish detection [55]. Therefore, experiments in this paper are conducted to compare the testing results of these competing methods with the ground truth information, and the results are summarized in Table 2. It is observed from Table 2 that RetinaNet with ResNet 50 show better detection performance than Yolov3 and Faster R-CNN in both detection accuracy and calculation requirement for future online detection (AP of 99.8% and Atime of 0.0438 s). The results indicate that RetinaNet is most competent in real-world practice as the datasets are in different complex scenes with severe face-pose variation and different degrees of occlusion. Yolov3 and Faster R-CNN achieved nearly similar performance with RetinaNet in AP (99.68% for Yolov3 and 99.8% for RetinaNet) and F1 score (0.9970 for Faster R-CNN and 0.9990 for RetinaNet), respectively, but the F1 score is preferable as the metric for "true positive detection" whilst average precision is preferable for "boundary extraction" of cattle face. Therefore, Yolov3 and Faster R-CNN are not sufficiently reliable in complex multi-view cattle face detection.

Evaluation of Multi-View Cattle Face Detection Results
The major misdetections of the abovementioned algorithms concern multi-view cattle face in complex conditions. To clearly observe the comparisons of results for multi-view cattle face detection in different scenes, 100 images were selected from 500 positive samples for three scenes of partial occlusion, light change, and posture change, and then the detection AP values and F1 scores were calculated separately for these competing detection models, as shown in Table 3. As seen in Table 3, RetinaNet with ResNet 50 outperforms Yolov3 and Faster R-CNN under three particularly challenging situations. Three detection models all present very accurate detection results with AP of 100% and F1 score of 1.0000 in the situation with light changes, which implies that CNN-based deep learning algorithms are robust to illumination variations. However, as observed, there are inaccurate detection boundaries using Yolov3 and false cattle face detections using Faster R-CNN while the performance of RetinaNet remains relatively high in partial occlusion situation. Although three detection models do not present good detection results in posture change situations, RetinaNet achieves better performance in detection accuracy and boundary accuracy owing to the structure of FPN and focal loss in the model. Faster R-CNN presents the advantage of RPN, which is commonly used in two-stage detectors, and thus the boundary precision is higher than Yolov3. To facilitate the readers to visually observe the comparisons of results, this paper compares the predictions processed by the above-competing methods under partial occlusion and posture change situations, as shown in Figure 7.

Evaluation of Multi-View Cattle Face Detection Results
The major misdetections of the abovementioned algorithms concern multi-view cattle face in complex conditions. To clearly observe the comparisons of results for multiview cattle face detection in different scenes, 100 images were selected from 500 positive samples for three scenes of partial occlusion, light change, and posture change, and then the detection AP values and F1 scores were calculated separately for these competing detection models, as shown in Table 3. As seen in Table 3, RetinaNet with ResNet 50 outperforms Yolov3 and Faster R-CNN under three particularly challenging situations. Three detection models all present very accurate detection results with AP of 100% and F1 score of 1.0000 in the situation with light changes, which implies that CNN-based deep learning algorithms are robust to illumination variations. However, as observed, there are inaccurate detection boundaries using Yolov3 and false cattle face detections using Faster R-CNN while the performance of RetinaNet remains relatively high in partial occlusion situation. Although three detection models do not present good detection results in posture change situations, RetinaNet achieves better performance in detection accuracy and boundary accuracy owing to the structure of FPN and focal loss in the model. Faster R-CNN presents the advantage of RPN, which is commonly used in two-stage detectors, and thus the boundary precision is higher than Yolov3. To facilitate the readers to visually observe the comparisons of results, this paper compares the predictions processed by the abovecompeting methods under partial occlusion and posture change situations, as shown in Figure 7.

Discussion
This paper evaluated an up-to-date object detector, RetinaNet, to automate the face detection process for a livestock identification vision system in the farmland. The key novelty of the study is the application evaluation of the RetinaNet algorithm with various backbones and comparisons with typical competing detection models for multi-view cattle face detection in complex and relevant cattle production scenarios. The essence of the detection in this paper is bounding-box location and classification with confidence. Previous studies in cattle face suffered the deviation of the bounding-box [56] and the challenge for dataset collection from complex scenarios [35]. The strong point of the RetinaNet is the capability to perform both relatively high detection accuracy and fast processing time of cattle face within the imagery. This allows for the development of further algorithms to perform tasks such as facial expression assessment from the imagery for welfare monitoring. Cattle face detection in the paper is the first step toward real-time individual livestock identification in farming environments that have different applications, such as the cattle insurance industry, meat products traceability [57], and other animal welfare improvements.
Transfer learning is an essential part of machine learning as pretrained CNN models can be fine-tuned and re-trained to perform new tasks when limited annotated data exists for training. However, the generalization capabilities of various deep networks on different datasets might change due to their architecture [43,58,59]. Therefore, this study compared the performance quantitatively of ResNet, VGG, and Densenet with different depth to select the optimal backbone in this detection task. The results indicate that RetinaNet with ResNet 50 achieves the best performance with an average precision of 99.8%, F1 score of 0.9990, and average processing time of 0.0438 s. Since backbones with better performance can improve the accuracy of detection, and there is no agreed pretrained CNN model in object detection algorithms, this backbone could be properly adjusted and optimized depending on the circumstances and applications. For instance, Yolov3 incorporating the DenseNet for apple detection in various growth periods [49] was considered to perform well. Still, ResNet may be better for fruit detection and instance segmentation [43], and plant disease detection achieves better results using VGG architecture [60].
For demonstrating the feasibility of the proposed framework further, this study made the performance comparisons with two competitive algorithms of object detection on the same datasets. The detection results presented illustrate that the AP and Atime provided by the RetinaNet with ResNet 50 model are significantly better than the other two models, reflecting the superiority of the proposed cattle face detection model. Considering the multi-view face caused by various unstructured scenes in actual cattle production scenarios, such as overlapping, occlusion, and illumination changes, the cattle face detection accuracy could be reduced to some extent. The F1 scores and average precision metrics were assessed over unstructured scenes in the study, and it is worth mentioning that the performance of RetinaNet was better than other algorithms. Some detection results of cattle faces are shown in Figure 8. Especially for partial occlusion and light variation situations, the accuracy of cattle face detection using RetinaNet reaches 100%, but the posture change situation is particularly challenging, even using RetinaNet and computer vision in general. The suggested main reason for this performance discrepancy of posture change situation can be attributed to multiple behaviors, such as leaning over to graze or drink and lying on the side to rest, which then bring difficulties to cattle face detection.

Conclusions
Developing deep learning for object detection and image processing is crucial to the livestock identification system, which substitutes for wearable devices such as RFID ear tags, thus reducing the damage to animals. To establish the livestock machine vision system capable of monitoring individuals, this paper focused on cattle face detection, which is an important component of envisaged future technology. The state-of-art RetinaNet detection model proposed in this study was assessed on various unstructured scenes. The compared metrics performed successfully across a range of scenarios with an average precision score of 99.8% and an average processing time of 0.0438 s. The results presented indicate that the proposed model was particularly effective for the detection of cattle faces with illumination changes, overlapping, and occlusion. Compared to the existing algorithms, the proposed model has better universality and robustness both in accuracy and speed, which makes it generally more applicable for actual scenes. However, the conditions of training and testing are the same in this work, and the robustness of the system may be questioned; thus, further experiments are needed.
This work has potential for computer vision system integration into mobile apps to perform not only livestock detection and counting and individual identification, but also facial expression recognition for animal welfare. Despite the significantly high success of the proposed method, it is still far from being a generic tool that could be used in actual livestock production scenarios. Future work will focus on a lightweight neural network to improve the running speed of cattle face detection. In addition, future work will also concentrate on building an autonomous livestock individual identification system using facial features.

Conclusions
Developing deep learning for object detection and image processing is crucial to the livestock identification system, which substitutes for wearable devices such as RFID ear tags, thus reducing the damage to animals. To establish the livestock machine vision system capable of monitoring individuals, this paper focused on cattle face detection, which is an important component of envisaged future technology. The state-of-art RetinaNet detection model proposed in this study was assessed on various unstructured scenes. The compared metrics performed successfully across a range of scenarios with an average precision score of 99.8% and an average processing time of 0.0438 s. The results presented indicate that the proposed model was particularly effective for the detection of cattle faces with illumination changes, overlapping, and occlusion. Compared to the existing algorithms, the proposed model has better universality and robustness both in accuracy and speed, which makes it generally more applicable for actual scenes. However, the conditions of training and testing are the same in this work, and the robustness of the system may be questioned; thus, further experiments are needed.
This work has potential for computer vision system integration into mobile apps to perform not only livestock detection and counting and individual identification, but also facial expression recognition for animal welfare. Despite the significantly high success of the proposed method, it is still far from being a generic tool that could be used in actual livestock production scenarios. Future work will focus on a lightweight neural network to improve the running speed of cattle face detection. In addition, future work will also concentrate on building an autonomous livestock individual identification system using facial features.