Goat-Face Recognition in Natural Environments Using the Improved YOLOv4 Algorithm

: In view of the low accuracy and slow speed of goat-face recognition in real breeding environments, dairy goats were taken as the research objects, and video frames were used as the data sources. An improved YOLOv4 goat-face-recognition model was proposed to improve the detection accuracy; the original backbone network was replaced by a lightweight GhostNet feature extraction network. The pyramid network of the model was improved to a channel management mechanism with a spatial pyramid structure. The path aggregation network of the model was improved into a fusion network with residual structure in the form of double parameters, in order to improve the model ’ s ability to detect fine-grained features and distinguish differences between similar faces. The transfer learning pre-training weight loading method was adopted, and the detection speed, the model weight, and the mean average precision (mAP) were used as the main evaluation indicators of the network model. A total of 2522 images from 30 dairy goats were augmented, and the training set, validation set, and test set were divided according to 7:1:2. The test results of the improved YOLOv4 model showed that the mAP reached 96.7%, and the average frame rate reached 28 frames/s in the frontal face detection. Compared with the traditional YOLOv4, the mAP improved by 2.1%, and the average frame rate improved by 2 frames/s. The new model can effectively extract the facial features of dairy goats, which improves the detection accuracy and speed. In terms of profile face detection, the average detection accuracy of the improved YOLOv4 goat-face-recog-nition network can reach 78%. Compared with the traditional YOLOv4 model, the mAP increased by 7%, which effectively demonstrated the improved profile recognition accuracy of the model. In addition, the improved model is conducive to improving the recognition accuracy of the facial poses of goats from different angles, and provides a technical basis and reference for establishing a goat-face-recognition model in complex situations.


Introduction
In recent years, machine-vision technology has developed rapidly in the field of target-individual identification [1][2][3][4]. With the further development of precision agriculture, deep-learning methods have been widely used in the fields of the agricultural pest identification [5][6][7] and biometric identification [8][9][10], and corresponding progress has been made. The precise identification of individual livestock has become a pressing problem to be solved [11][12][13][14]. The common livestock-identification methods are mainly divided into contact and non-contact kinds [15,16]. At present, non-contact identification technology mainly includes individual identification methods for livestock breeding based on physiological characteristics, such as irises and retinal vessels [17,18]. In this non-contact identification method, data collection is complicated, since herd animals do not readily cooperate, resulting in poor practical applicability.
Face-recognition technology offers advantages in that it is natural and intuitive, does not involve contact, and does not require the cooperation of livestock in a fixed posture; in addition, face-recognition technology has strong anti-interference ability, and has wide practical-application prospects. Therefore, contactless identification based on visual biometrics has become a new trend in individual-livestock identification [19][20][21]. Chen et al. [22] proposed a lightweight convolutional-neural-network cow-facial-recognition algorithm suitable for edge-computing applications, with an average detection accuracy of 90%. A neural-network model based on a single bovine nose-tip-texture feature by Kumar et al. [23] achieved discrimination between different individuals with 98.99% accuracy. Huang et al. [24] used a multiscale local-differential-direction-number (MLDDN) model for the facial recognition of pigs. Weng et al. [25] proposed a cow-face-recognition model based on a double-branch convolutional neural network (TB-CNN), which has a good detection accuracy. Yang et al. [26] proposed a YOLOv4 detection network incorporating coordinate information to achieve the accurate identification of individual cows, with an average recognition accuracy of 93.4%. Yan et al. [27] proposed an FPA-Tiny-YOLO model combining pyramid attention and Tiny-YOLO to enhance feature-extraction ability and target detection accuracy to solve the problem of individual pig adhesion and obscuration. Hu et al. [28] introduced the dual-attentional-feature mechanism into the Mask-RCNN network structure, which can achieve individual pig segmentation in complex environments. He et al. [29] improved detection accuracy by introducing a dense connectionblock structure in the YOLOV3 backbone network to achieve the detection of small targets with occlusion at long distances. Wang et al. [30] proposed a multi-scale convolutionalneural-network-based individual-pig-identification model with 92% accuracy to perform contactless individual-pig identification in complex and variable environments. Yang et al. [31] used a full convolutional-network structure for image segmentation to perform the fast and accurate identification of lactating sows in a piggery environment, with better detection results.
Currently, scholars at home and abroad rarely pay attention to the identification of individual herding goats. Han [32] proposed an improved VGGNet pain-expression recognition algorithm with a recognition accuracy of up to 96.06%, which solved the problems of high experience requirements and low recognition accuracy in current manual pain-recognition processes for individual goats. Zhang et al. [33] proposed an improved MobileFaceNet goat-face-recognition network with an accuracy of 97.91% The above study improves the model's feature-extraction effect by introducing a spatial-attention mechanism and a spatial-transformation module, but the recognition accuracy decreases when encountering situations in which the difference between goat facial-texture features becomes smaller and the similarity increases. In addition, the two-stage target algorithm is limited by the large amounts of computational resources, hardware, and software, which are difficult to apply in practical conditions. To achieve high accuracy, low cost, and high efficiency in non-contact goat recognition, the main contributions of this paper are as follows: (1) A YOLOv4 goat-face-recognition network based on GhostNet is proposed to reduce the number of model parameters and computational effort. (2) Combined with the small differences and high similarity of goat facial-texture features, a channel-management mechanism with a pyramid structure is introduced to improve the detection capability and accuracy of the model for finegrained features. (3) The original path-aggregation network (PANet) is changed to a twoparameter PANet structure to improve the generalization performance of the model. (4) In order to comprehensively evaluate the improved goat-face-recognition model, in this experiment, a goat-face training set, validation-verification set, and test set were produced and compared with the traditional YOLOv4. The results show that the improved model helps to improve the recognition accuracy of different facial-angle postures of dairy goats, which provides a technical basis and reference for establishing a goat-face-recognition model in complex situations.

Experimental Data Sources
The test goat-face video was taken in a standardized indoor goat factory in Li Zhuang Village, Yichuan County, Luoyang City, Henan Province, China. Thirty adult (35-45 kg) Saanen-breed dairy goats were selected as the test subjects and marked in advance, as shown in Figure 1. In the experiment, a Canon camera was used to track a single dairy goat at a frame rate of 30 fps, and the length of each video recording was between 15 and 30 min to ensure the effectiveness of the recorded video.

Goat 2 marker map
Goat 3 marker map Goat 4 marker map Goat-5 marker map

Data Pre-Processing and Labeling
In this study, images were intercepted by 25 frames of the collected video, and the effective images with large similarity differences were selected from the retained images as the sample data for the experiment. Eventually, a total of 3428 valid images were screened. The labeling was used to annotate the images according to annotation format of Pascal VOC dataset, and to generate an annotation file of xml type. The whole dataset was divided according to the ratio of 7:1:2. In order to improve the generalization performance of the model, different-scale images were used to enhance the data in four ways: the random rotation (−15°~15°), the mirror flip, the horizontal flip, and the brightness change. At the same time, the corresponding annotation files of t each image were transformed simultaneously to generate training set (9598 images), validation set (1371 images), and test set (2742 images). The overall process was divided into four parts: video image processing, image augmentation, data-set division and model training, and model validation, as shown Figure 2.

YOLOv4 Algorithm
YOLOv4 target-recognition network makes a series of improvements to YOLOv3. It has a backbone-feature-extraction network (CSPDarkNet53), a spatial pyramid (SPP), a path-aggregation network (PANet), a head network (YOLOhead), and four other components. The specific structure is shown in Figure 3.  Figure 3. Structure diagram of YOLOv4 model. Note: CSPX stands for cross-stage partial structure, Conv stands for convolutional, BN stands for batch norm, CBL stands for Conv+Batch BN+Leakyrelu-activation-function-synthesis module, CBM stands for Conv+BN+Mish-activation-functionsynthesis module, ResUnit stands for the residual connection module, Concat stands for the featureconcatenation operation, Up stands for upsampling operation, Maxpool stands for the pooling operation, *3 and *5 stand for the number of repetitions of the CBL module.
YOLOv4 combines the advantages of the CSPNet and DarkNet53 feature-extraction networks, replacing the DarkNet53 backbone network in the original YOLOV3 with the CSPDarkNet53 backbone network. CSPDarkNet53 feature-extraction network consists of the five residual modules from CSP1 to CSP5; each residual module consists of small residual structures (ResUnit) and CBM modules stacked together, as shown in Figure. 3. The SPP structure is located between the backbone network and the neck network. It uses three sizes of pooling kernel, 13 × 13, 9 × 9, and 5 × 5, and then splices the feature maps of different scales with the original feature maps for output, which can improve the receptive field of the network and facilitate subsequent path aggregation of network-feature-information fusion. The YOLOv4 elicits three different-sized feature maps from CSP3~CSP5, 52 × 52, 26 × 26, and 13 × 13, with the aim of detecting objects of different sizes in the image more comprehensively. The three different-sized feature maps are fused with bottom-up and top-down features using PANet, which enhances utilization of effective features and prevents loss of low-order features in the feature-extraction process.

Improved Goat-Face-Recognition-Algorithm Construction
The original CSPDarkNet53 backbone-feature-extraction network was replaced by a lightweight GhostNet feature-extraction network to reduce the number of parameters and amount of computation of model to solve the problems of huge parameters of the YOLOv4 algorithm-backbone network, increased computation, and poor goat-face recognition in complex environments. GhostNet is a more efficient generation method proposed for the phenomenon of feature redundancy in feature-extraction networks, as shown in  YOLOv4 feature-fusion phase consists of two components: SPP and PANet. Spatialpyramid structure can extract different-scale features from the pixel level and consider multiple receptive-field data in parallel, which has a strong recognition effect on targets of large and small size. However, the fusion of feature information between feature maps of different scales in the traditional pyramid structure is completed only by linear superposition, which tends to ignore detailed features and lacks further extraction of the important features. Therefore, fusion of features directly through path-aggregation networks may lose important location information. In this study, an improved pyramid structure was used, as shown in Figure 5. A channel-management mechanism was added to the SPP to achieve the effect of improving the screening of the important feature layers and increasing the utilization of effective feature layers by introducing the SE channel-management mechanism between different-sized pooling kernels. It can achieve effective features that facilitate target recognition by introducing channel-management mechanism to achieve weight distribution among feature maps. This operation can effectively focus on the features that contain objects and suppress secondary information to improve the effect of model detection. The PANet structure in the YOLOv4 has multi-port feature-fusion effect, which can perform bottom-up and top-down feature fusion from shallow features to deep features and improve the detection capability of large, medium, and small objects. However, the transfer path from shallow features to deep features is long, and its important feature and localization information are easily lost, which causes problems such as low data utilization and unsatisfactory detection accuracy. To address these problems, the PANet is replaced by a PANet structure with a double-parameter residual structure. It reduces the network-model size, number of algorithmic parameters, and amount of computation by introducing the trainable parameter Wi for focusing on important features and using deep separable convolution instead of part of the normal convolution in the PANet. At the same time, it improves the feature-fusion capability of the network by increasing the output ports of backbone network (104 × 104 × 64). This operation preserves the location information of the lower-order feature maps and adds the higher-level abstract semantic information to improve the recognition accuracy and feature-extraction capability of the network in complex situations. The improved YOLOv4 goat-face-recognition algorithm is shown in Figure 6. As can be seen from the figure, the network as a whole is divided into four parts: ① represents the GhostNet backbone network structure; ② represents the improved spatial pyramid structure; ③ represents the improved PANet structure; and ④ represents the head network (YOLO Head). The combined convolution block in Figure  6 contains a DWConv and a Ghost module. The DWConv reduces the number of parameters in the goat-face model and improves the network's recognition speed, while the Ghost module reduces the redundancy of the features in the feature-fusion process and improves the utilization of effective features. W1~W4 denote the trainable parameters added in this experiment, which were used to achieve the focus of the residual network structure on the effective features and enhance the recognition effect of the network model in complex environments.

YOLOv4 Objective Loss Function
The YOLOv4 objective loss function consists of four parts, namely, positive samplecoordinate loss, positive sample-confidence loss, negative sample-confidence loss, and positive sample-classification loss. The loss function is calculated as shown in Equation (1).
where and represent positive sample-weight coefficients and negative sample coefficients, respectively; ∑ ∑ =0

× =0
represents traversal all prediction boxes; and represent the presence or absence of an object, i.e., 1 for presence of an object and 0 for absence of an object, respectively; ̂, represent the predicted and true values of the sample, respectively; and represents the predicted probability for a category. Complete intersection of union loss (CIOU) represents the loss function used between the prediction frame and the true frame in this experiment.
loss function is calculated as follows.
where 2 ( − ) represents the diagonal distance of the minimum closure region between the prediction frame and the real frame; is used to measure the consistency parameter between the prediction frame and the true frame and represents a trade-off parameter.

Model Training and Parameters
The hardware platform was Intel(R) Xeon(R) Silver4210R with 3.5 GHz, 32 GB memory, and NIVIDIA GeForce RTX 2080Ti GPU with 16 GB video memory. The software platform used was Pycharm2020.2+ CUDNN7.4.1.5+ Python3.8+ pytorch1.2. In this experiment, the transfer-learning training method was used to train improved goat-facerecognition network on COCO dataset. Next, the pre-COCO trained network weights were used as initialization, which can accelerate the model convergence and improve the generalization performance of goat-face-recognition network. In terms of network-parameter settings, this experiment uniformly set the training-image size to 416 × 416 size, the training batch size (Batchsize) to 16, and the network-training-period size (epoch) to 100. It automatically saved the weights once for each epoch training completed by the model. The backbone layer in the first 50 epochs of goat-face-recognition network was trained by freezing, and the learning rate (lr) was initially set to 0.001. The backbone network was trained by thawing for the last 50 epochs. To enhance the extraction of the goat-face-recognition network features by the network, the lr was set to 0.0001. In order to enhance the generalization and recognition accuracy while the model was training, employing training techniques were used to make YOLOv4 more versatile and robust in terms of detection, such as Mosic data-enhancement method, label-smoothing algorithm, and cosineannealing algorithm.

Model-Evaluation Indicators
The where TP represents the number of positive samples that the model predicts to be consistent with the true label; FP represents the number of samples in which the model prediction does not match actual positive sample; FN represents the number of samples in which the model prediction does not match the actual negative sample; TN represents the number of samples in which the model prediction is consistent with the actual negative sample.

Comparison of Frontal Face Results of Different Models
This experiment used a series of improved YOLOv4 goat face recognition model and the YOLOv4 model to detect the positive faces of the goats, respectively, and the results are shown in Table 1. The mAPs in the table were all obtained at IOU = 0.5. In Table 1, ① represents the replacement of the original YOLOv4 backbone network with a lightweight GhostNet structure, ② represents the replacement of the pyramidal network with a network structure that adds an attention mechanism, and ③ represents the replacement of the original path-aggregation-network structure with a path-aggregation network in the form of residual structured double parameters. As shown in Table 1, the mAP detection accuracy decreased by 8.8%, after replacing the YOLOv4 backbone network with GhostNet. However, the frame rate reached 35/s. To verify the effectiveness of the network structure, the combination of the replaced backbone target-recognition network with ② and ③ approaches, respectively, yielded significant improvements compared to replacing only the YOLOv4 backbone structure. The mAPs were 89.9%, 93.4%, and 96.7%, respectively; the goat-face-recognition network was improved by 2.1% compared to the YOLOv4 recognition network after introducing the operations ①②③. In terms of the detection speed and model size, the improved YOLOv4 goat-face-recognition network recognized the animals faster than the YOLOv4, with a frame rate of up to 28/s and a model weight reduced to one-fourth of the YOLOv4 weight. This study also shows the results of the goat-face-recognition model in terms of the model parameters, the memory required for the model node inference, and the model computation. As shown in Table 1, this improved goat-face-recognition model reduced the model parameters and model computation significantly. However, there was less change in the memory required for the model-node inference, and the inference speed was delayed compared to the modification of the backbone. Nevertheless, is the model demonstrated an improvement to the original YOLOv4, which affects the time required for the model frame rate to some extent. Figure 7 shows the positive face-recognition results of each model for goats 9, 11, 13, and 22, from which it can be seen that each model can accurately recognize the corresponding goats without omission or misrecognition However, in this study, the improved YOLOv4+①+②+③ goat-face-recognition model demonstrated the best results and had a higher detection accuracy.

Number
YOLOv4 YOLOv4+① YOLOv4+①+② YOLOv4+①+③ YOLOv4+①+②+③ The validation-set-loss (val_loss) variation curves of each model for the 100-epoch training cycles were plotted, as shown in Figure 8. From the figure, it can be seen that the val_loss variation curves for the different models all tended to converge steadily with the training period. However, after the introduction of the ①②③ structure in this study, the optimal smoothness of the goat-face-recognition network further demonstrated the network's effectiveness and stability.

Recognition Results of Different Models for Side-Facing Dairy Goats
Dairy goat side face recognition is unavoidable in the face recognition process, with physical occlusions such as fences and dairy goat behavior where they are. Therefore, the side-face recognition of goats is of great importance for their identity verification. To some extent, it represents the quality of goat-face-recognition networks and their resistance to external influences. In this experiment, 225 side-face photographs from outside the dataset of five dairy goats were selected to test the built goat-face-recognition model, and the results are shown in Table 2. Table 2. Side-face-recognition results of goats with different side faces.

Model
Goat6 Goat9  Goat13  Goat17  Goat21  mAP  YOLOv4  38  42  24  32  24  71  YOLOv4+①  38  42  24  32  24  58  YOLOv4+①+②  38  42  24  32  24  69  YOLOv4+①+③  38  42  24  32  24  72  YOLOv4+①+②+③  38  42  24  32 24 78 The distribution of the side-face photographs of the five goats is shown in Table 2. For the side-face photographs, the amount of data was relatively small. Since the tests were completed on the same side-face photographs between different models, the test data of each goat in the different models in Table 2 were consistent. From Table 2, it can be observed that GhostNet has a smaller network structure compared to CSPDarknt53, which is a lightweight network structure and is less effective at the side-face recognition of goats. Therefore, the mAP of the goat-face-recognition network after replacing the backbone was significantly decreased, by 13%, compared to the YOLOv4. From Table 2, it can be observed that the mAP of the test was significantly improved by combining the recognition network with the improved ② and ③ structures, respectively, after replacing the backbone. The map in the goat face recognition network increases to 69% in side-face recognition after introducing the ② structure, 72% after the introducing the ③ structure, and 78% after adding both ② and ③ improved structures. The goat-face-recognition network in this study improved the side-face recognition by 7% compared to the YOLOv4, indicating the effectiveness of the goat-face-recognition network built for sideface recognition.
As the color of goat faces is mainly pure white, some goat faces have high similarity, which increases the difficulty of identifying the side faces of goats, leading to misidentification and omission. This experiment demonstrates the occurrence of misidentified and omitted goat-face measurements in the five categories of images containing side-faced goats mentioned above, as shown in Figure 9. Since the side-face images contain limited features of goat faces, it is difficult for the recognition network to capture important goatface features in terms of feature extraction. As can be observed in Figure 9, YOLOv4 was weak at side-face recognition, misidentifying goat13 as goat17. YOLOv4 missed the recognition of goat 21. The AP was only 67% for both goat 13 and goat 17, respectively. The YOLOv4+① network structure misidentified goat 21 as goat 20, and the mAP was only 43%. Since the goat's side-face recognition contained fewer important features, it can be observed in Figure 9 that the improved goat-face-recognition structure of this experiment did not show this misrecognition, but the omitted recognition was not resolved. The improved network structure still failed to detect goat21, although the AP improved to 75%.

Number
YOLOv4 YOLOv4+① YOLOv4+①+② YOLOv4+①+③ YOLOv4+①+②+③ This experiment further demonstrates the effectiveness of the present network at identifying individual goats and the robustness effect of the model by detecting photo graphs of five side facing dairy goats. In this study, the introduction of the attention mechanism in the pyramid structure can enhance the fine-grained feature extraction of goatface-recognition networks and the detection of differences between similar faces. By introducing the residual path structure of trainable parameters, the screening of effective features can be enhanced and the recognition accuracy of the model can be further improved. In this study, the path network structure in the original YOLOv4 was improved to a residual path structure with trainable parameters. The trainable parameters can further enhance the extraction of important features of goat faces and improve the detection accuracy of the model. Although the goat-face-recognition network based on the YOLOv4+①+②+③ has high accuracy in frontal face recognition, it still needs further improvements in its side-face recognition to improve the accuracy with which it identifies individual goats.

Conclusions
(1) The backbone network in YOLOv4 was replaced by a GhostNet lightweight network structure to address the problems of the large number of YOLOv4 network parameters, low accuracy of goat-face-recognition, and slow recognition speed. After replacing the backbone, the goat-face-recognition network can reduce the number of network parameters and improve the operation speed and detection efficiency of the model. (2) The SPP and PANet structure in YOLOv4 was changed to a pyramid structure with a spatial attention mechanism and a fusion network with a residual structure in the form of double parameters. The improved goat-face-recognition network enhances the detectability of fine-grained features and improves the detection of similar faces. The improved goat-face-recognition network improved on the frontal face recognition of the YOLOv4 by 2.1%, and the mAP reached 96.7%. In terms of the side-face detection, the improved goat-face-recognition model improved on the YOLOv4 by 7% compared. The model's detection speed was up to 28 frames/s to meet the needs of real-time monitoring. However, the network still needs to be improved in terms of side-face recognition to improve the accuracy with which it identifies individual goats. (3) This study mainly focuses on the characteristics of goats' facial texture features, which become less different and difficult to recognize. Furthermore, it proposes a low-cost and high-efficiency improved lightweight YOLOv4 face-recognition model. In order to further achieve individual-goat recognition in flock scenarios, future research will be carried out on flock goats on large-scale farms. By constructing a goatface-detection network, the interception of goat faces will be achieved. The data will be transmitted to the improved YOLOv4 model to achieve the recognition of goats in multiple situations.