Happy Cow or Thinking Pig? WUR Wolf—Facial Coding Platform for Measuring Emotions in Farm Animals

: Emotions play an indicative and informative role in the investigation of farm animal behaviors. Systems that respond and can measure emotions provide a natural user interface in enabling the digitalization of animal welfare platforms. The faces of farm animals can be one of the richest channels for expressing emotions. WUR Wolf (Wageningen University & Research: Wolf Mascot), a real-time facial recognition platform that can automatically code the emotions of farm animals, is presented in this study. The developed Python-based algorithms detect and track the facial features of cows and pigs, analyze the appearance, ear postures, and eye white regions, and correlate these with the mental/emotional states of the farm animals. The system is trained on a dataset of facial features of images of farm animals collected in over six farms and has been optimized to operate with an average accuracy of 85%. From these, the emotional states of animals in real time are determined. The software detects 13 facial actions and an inferred nine emotional states, including whether the animal is aggressive, calm, or neutral. A real-time emotion recognition system based on YoloV3, a Faster YoloV4-based facial detection platform and an ensemble Convolutional Neural Networks (RCNN) is presented. Detecting facial features of farm animals simultaneously in real time enables many new interfaces for automated decision-making tools for livestock farmers. Emotion sensing offers a vast potential for improving animal welfare and animal–human interactions. and other barriers in the environment. Hence images and videos were processed through cropping before employed for further processing. Videos were converted into images and grouped into folders after labelling. The length of the videos ranged from 2 to 6 min each. All the images were grouped and categorized into multiple subfolders, based on three emotions of cows and six emotions of pigs. The farm animal’s facial expression started from positive to neutral to negative states and returns to neutral state during the data collection process. No elicitation or inducement of affective states on the farm animals were conducted during the data collection. Datasets were split up as 70% for training, 10% for validation, and 20% for testing phase for all three models for evaluation. At-least 100 images were present for each speciﬁc emotion in these folders for the test dataset. The data obtained from the internet were from original live farm animals and not from virtual images and were cross-checked with the sources. One of the core design requirements of an emotion recognition model is the availability of sufﬁcient labeled training data with varying features and conditions. For this, the data collected from farms were predominantly used. Data from the internet were not used in the training of the model but only for testing the model.


Introduction
Digital technologies, in particular, precision livestock farming, and artificial intelligence have the potential to shape transformation in animal welfare [1]. To ensure access to sustainable and high-quality health attention and welfare in animal husbandry management, innovative tools are needed. Unlocking the full potential of automated measurement of mental and emotional states of farm animals through digitalization such as facial coding systems would help blur the lines between biological, physical, and digital technologies [1,2].
Animal caretakers, handlers, and farmworkers typically rely on hands-on observations and measurements while investigating methods of monitoring animal welfare. To avoid the increased handling of animals in the process of taking functional or physiological data, and to reduce the subjectivity associated with manual assessments, automated animal behavior and physiology measurement systems can complement the current traditional welfare assessment tools and processes in enhancing the detection of animals in distress or pain in the barn [3]. Automated and continuous monitoring of animal welfare through digital alerting is rapidly becoming a reality [4].
In the human context, facial analysis platforms have long been in use for various applications, such as password systems on smartphones, identification at international border checkpoints, identification of criminals [5], diagnosis of Turner syndrome [6], detection of genetic disorder phenotypes [7], as a potential diagnostic tool for Parkinson disease [8], measuring tourist satisfaction through emotional expressions [9], and quantification of customer interest during shopping [10].

Understanding Animal Emotions
The human comprehension of animal emotions may seem trivial; however, it is a mutually beneficial skill. The ability of animals to express complex emotions, such as love and joy, is still debated within the field of behavioral science. Other emotions, such as fear, stress, and pleasure are more commonly studied [14]. These basic emotions have an impact on how animals feel about their environment and interact with it. They also impact an animal's interactions with its counter specifics [15].
Non-domesticated species of animals are commonly observed in the wild and maintained in captivity to understand and conserve their species. Changes in the natural environment, because of human actions, can be stressful for individuals within a species. Captive non-domesticated animals also experience stress created through artificial environments and artificial mate selection. If even one animal experiences and displays signs of stress or aggression, its companions are likely to understand and attempt to respond to the emotional state [16]. These responses can result in stress, conflict, and the uneven distribution of resources [17]. The understanding of emotional expression in captive animals can help caretakers determine the most beneficial forms of care and companion matching for each individual, resulting in a better quality of life for the animals in question.
Companion animals are another category which can benefit from a deeper understanding of animal emotion. Just like humans, individual animals experience different thresholds for coping with pain and discomfort. Since many companion animals must undergo voluntary medical procedures for the well-being of their health and their species, it is important to understand their physical responses. Animals cannot tell humans how much pain they are in, so it is up to their caretakers to interpret the pain level an animal is experiencing and treat it appropriately [18]. This task is most accurately completed when the emotions of an animal are clearly and quickly detectable.
The understanding of expressions related to stress and pain from indicative facial features is impactful in animal agriculture. Animals used for food production often produce higher quality products when they do not experience unpleasant emotions [19]. The detection of individual animals experiencing stress also allows for the early identification of medical complications. A study on sows in parturition showed a uniform pattern of facially expressed discomfort during the birthing cycle [20]. In such a case, facial identification of emotional distress could be used to detect abnormally high levels of discomfort and alert human caretakers to the possibility of dystocia.

Facial Recognition Software
Facial recognition software has been used on human subjects for years. It has even contributed to the special effect capabilities in films and is used as a password system for locked personal devices [21]. It is a non-invasive method that tracks specific points on an individual's face using photos and videos. These points need not be placed directly on the subject's face; instead, computer software can be customized and trained to identify the location of each point. Once this software identifies an individual's characteristics, it can be modified to detect changes in facial positioning and associate these changes with emotional states. In addition to this traditional way of tracking specific points on faces, there are a varied number of approaches [22] to identifying people from their faces.
The method of tracking specific points on faces can also be used to identify individuals and emotional states when it comes to animal subjects. With a little software reconstruction, scientists have been able to create reliable systems for the assessment of animal emotions through technological means [23,24]. These systems have been specified to identify multiple species including, cows, cats, sheep, large carnivores, and many species of non-human primate. In studies focusing on identifying individual members of the same species within a group, the accuracy of specialized facial recognition software was found to be between 94% and 98.7%. Some of these studies even displayed the ability of software to identify and categorize new individuals within a group and the ability to identify individuals at night [23][24][25]. Other studies focused more on the emotional expressions that could be identified through facial recognition software and some of the studies showed an accuracy of around 80% when compared to the findings of professionals in the field of animal emotion identification [26]. The differences in facial features such as the area of the eyes, size, shape and form of the ears and snout regions of farm animals has been used as a parameter in the identification of the individual members in previous studies. The focus of our study is to measure inferred affective states/emotions of animals and not to identify individual animals.

The Grimace Scale
The facial landmark detection software used is based on a series of points in relation to phenotypic features of the species in question, but it uses an older theory to attach the location of those points to emotional states.
The grimace scale is a template created to depict the physical reactions associated with varying levels of discomfort. These scales are created in relation to a specific species and are defined by a numerical scale [27]. In the case of pigs, sheep, and cattle, grimace scales normally focus on tension in the neck, shape of the eye, tension in the brow, nose bunching, and positioning of the ears [20,28]. These visual cues can be combined with vocal cues to further depict the level of discomfort an animal is experiencing. In species like mice, other expressive physical features must be accounted for, such as whisker movement [29]. For less social species, like cats, the changes in facial expression in response to pain are more minute but still identifiable with the use of a grimace scale [18]. These scales have been proven as an accurate way to assess pain with minimal human bias [27]. They are created through the professional observation of species during controlled procedures that are known to trigger pain receptors. Once created, grimace scales can be converted to specific measurements that are detectable through facial recognition software with the assistance of the Viola-Jones algorithm. This algorithm breaks down the facial structure of animal subjects into multiple sections to refine, crop, and identify major facial features [26].
These features make the technological interpretation of animal emotions feasible across a variety of species and in a variety of settings.

Best Way to Manage Animal Emotion Recognition
Studies are most accurate when the spectrum of discomfort, including everything from acute low-grade pain to severe chronic pain, is fully identified. Events of low-grade discomfort are significant; however, they may not be identifiable through the production of the stress hormone cortisol [30]. In such situations, discomfort may only be discernable through the facial expressions of an animal detectable by facial feature measurement software. Because of the quantitative nature, physiological indicators are typically preferred in the investigation of farm animal emotions. However, due to lack of precise correlations with the affective states and the physiological profiles, animal emotion researchers should be cognizant of not relying only on the physiological indicators of stress and pain, but must use a combination in the phenotyping interpretation of animal behavior.
On large-scale farms, it is important to keep the animals comfortable and relaxed, but it would be impractical and expensive to test the chemical levels of stress present in every animal. The identification of emotional states through facial recognition software provides a more efficient and cost-effective answer. It also provides an opportunity for the identification of very similar individuals in a way that cannot be illegally altered, unlike ear tags, which are sometimes changed for false insurance claims [25].
The use of facial recognition software also reduces the need for human interaction with animal subjects. For non-domesticated animals, the presence of human observers can be a stressful experience and alter their natural behavior. Facial recognition software allows researchers to review high-quality video and photo evidence of the subject's emotional expressions without any disturbance. Researchers can even record the identification and actions of multiple individuals within a group of animals at the same time with the help of software such as LemurFaceID [24].
Room for human error in the form of bias is reduced with the help of facial recognition software. Since humans experience emotions and have the ability to empathize with other emotional beings, human observers run the risk of interpreting animals' emotional expressions improperly. In a study concerning the pain expressions of sows during parturition, it was noted that all female observers rated the sows' pain significantly higher than the male observers [20]. The problem of distinguishing between the emotional sensitivity of the interpreter of animal behavior and the measurement error is of fundamental importance in the case of the subject matter of the assessed work. With well-calculated software, these discrepancies will cease to exist, and researchers can focus more of their time on finding significant points and connections within recorded data, rather than spending their time recording the data.

Dataset Characteristics
Mapping of images to specific emotion classes of cows and pigs based on indicators of facial features is shown in Figure 1. Images ( Figure 2) and videos of cows and pigs were collected from multiple locations: 3 farms in Canada, 2 farms in the USA, and 1 farm in India. The goal of various video and image collection and dataset enhancement was to enhance the generalization of the model and the diversity of the training dataset. Videos of cows and pigs were converted to images based on frames per second and augmentation methods. For testing the model to create ground truth and validation with unseen data, images were collected from the Internet and cropped, querying search engines using animal keywords. These images were also manually annotated for the potential inferred affective states perceived as emotions of the farm animals. All images were included for analysis when the entire face of the cows and pigs were contained in the frame with visible ears and eyes and snout features. The dataset consisted of 7800 images and over 150 videos from a total of 235 pigs and 210 dairy cows. In our application in the on-farm scenario, cows and pigs were often crowded together with occlusions due to mechanical farm structures and other barriers in the environment. Hence images and videos were processed through cropping before employed for further processing. Videos were converted into images and grouped into folders after labelling. The length of the videos ranged from 2 to 6 min each. All the images were grouped and categorized into multiple subfolders, based on three emotions of cows and six emotions of pigs. The farm animal's facial expression started from positive to neutral to negative states and returns to neutral state during the data collection process. No elicitation or inducement of affective states on the farm animals were conducted during the data collection. Datasets were split up as 70% for training, 10% for validation, and 20% for testing phase for all three models for evaluation. At-least 100 images were present for each specific emotion in these folders for the test dataset. The data obtained from the internet were from original live farm animals and not from virtual images and were cross-checked with the sources. One of the core design requirements of an emotion recognition model is the availability of sufficient labeled training data with varying features and conditions. For this, the data collected from farms were predominantly used. Data from the internet were not used in the training of the model but only for testing the model. ears and eyes and snout features. The dataset consisted of 7800 images and over 150 videos from a total of 235 pigs and 210 dairy cows. In our application in the on-farm scenario, cows and pigs were often crowded together with occlusions due to mechanical farm structures and other barriers in the environment. Hence images and videos were processed through cropping before employed for further processing. Videos were converted into images and grouped into folders after labelling. The length of the videos ranged from 2 to 6 min each. All the images were grouped and categorized into multiple subfolders, based on three emotions of cows and six emotions of pigs. The farm animal's facial expression started from positive to neutral to negative states and returns to neutral state during the data collection process. No elicitation or inducement of affective states on the farm animals were conducted during the data collection. Datasets were split up as 70% for training, 10% for validation, and 20% for testing phase for all three models for evaluation. Atleast 100 images were present for each specific emotion in these folders for the test dataset. The data obtained from the internet were from original live farm animals and not from virtual images and were cross-checked with the sources. One of the core design requirements of an emotion recognition model is the availability of sufficient labeled training data with varying features and conditions. For this, the data collected from farms were predominantly used. Data from the internet were not used in the training of the model but only for testing the model. .

Features and Data Processing
Several recent studies have clearly laid the foundation for the measurement of emotional states of farm animals based on their facial features such as ears, eyes, and orbital tightening (Table 1). Based on the evidence of correlations between the physiological responses and inferred internal states and the facial features, the farm animal's emotional

Features and Data Processing
Several recent studies have clearly laid the foundation for the measurement of emotional states of farm animals based on their facial features such as ears, eyes, and orbital tightening (Table 1). Based on the evidence of correlations between the physiological responses and inferred internal states and the facial features, the farm animal's emotional state is being assumed. The collected and grouped images dataset was divided into nine classes based on the correlation between facial features such as ear posture and eye whites of cows and pigs and the sensing parameters, as compiled in Table 1. The eye white region, the ear posture direction for cows and pigs and their relationship to mental states such as whether they are feeling positive, negative, or neutral have been studied previously. The data were labelled and annotated by trained ethologists based on the established protocols as given in [31][32][33][34][35]. The videos and images were preprocessed initially using a three-stage method: (1) Detection of faces, (2) Alignment of faces, (3) Normalization of input. A regular smartphone (Samsung Galaxy S10) was used for capturing images and videos from different angles and directions when the animals were in the barn or pen. The collected data were labelled based on the time stamp and the RFID tags and markers. Faces were not manually extracted, but via the MIT LabelImg code [36]. Annotation for labeling different models' bounding boxes was done in the standard format for each: PASCAL format for Faster-RCNN and YOLO format for both YOLOv3 and YOLOv4. Table 1. Sensing parameters that were used for each of the nine classes related to recognizing emotions of cows and pigs [2].

Hardware
The training and the testing of the three models based on YoloV3, YoloV4 and Faster RCNN were performed on NVidia GeForce GTX 1080 Ti graphics processing unit (GPU) running on CUDA 9.0 (compute unified device architecture) and cuDNN 7.6.1 (CUDA deep neural network library), equipped with 3584 CUDA cores and 11 GB memory.

YOLOv3
You Only Look Once (YOLO) is one of the fastest Object Detection Systems with a 30 FPS image processing capability and a 57.9% mAP (mean Average Precision) score [40]. YOLO is based on a single Convolutional Neural Network (CNN), i.e., one-step detection and classification. The CNN divides an image into blocks and then predicts the bounding boxes and probabilities for each block. It was built on a custom Darknet architecture: darknet-19, a 19-layer network supplemented with 11 object detection layers. This architecture, however, struggled with small object detection. YOLOv3 uses a variant of Darknet, a 53-layer Imagenet-trained network combined with 53 more layers for detection and 61.5 M parameters. Detection is done at three receptive fields: 85 × 85, 181 × 181, 365 × 365, addressing the small object detection issue. The loss function does not utilize exhaustive candidate regions but generates the bounding box coordinates and confidence using regression. This gives faster and more accurate detection. It consists of four parts, each given equal weightage: regression loss, confidence loss, classification loss, and loss for the absence of any object. When applied to face detection, multiple pyramid pooling layers capture high-level semantic features, and the loss function is altered. Regression loss and confidence loss are given a higher weight. These alterations produce accurate bounding boxes and efficient feature extraction. YOLOv3 provides detection at an excellent speed. However, it suffers from some shortcomings: expressions are affected by the external environment, and orientations/posture are not taken into account.

YOLOv4
YOLOv4 introduces several features that improve the learning of Convolution Neural Networks (CNNs) [41]. These include Weighted Residual Connections (WRC), Cross-Stage-Partial connections (CSP), Cross mini-Batch Normalization (CmBN), and Self-adversarial training (SAT). CSPDarknet is used as an architecture. It contains 29 convolutional layers 3 × 3, a 725 × 725 receptive field, and 27.6 M parameters. Spatial Pyramid Pooling (SPP) is added on the top of this layer. YOLOv4 improves the Accuracy Precision Score and FPS of v3 by 10-12%. It is faster, more accurate, and can be used on a conventional GPU with 8 to 16 GB-VRAM, which enables widespread adoption. New features suppress the weakness and improve on the already impressive face detection capabilities of its predecessor.

Faster R-CNN
Faster R-CNN is the third iteration of the R-CNN architecture. Rich feature hierarchies for accurate detection of objects and features and semantic segmentation CNN (R-CNN) started in 2014, introducing a method of Selective Search to detect regions of interest in an image and a CNN to classify and adjust them [42]. However, it struggled with producing real-time results. The next step in its evolution was Fast R-CNN, a faster model with shared computation capabilities owing to the Region of Interest Pooling technique. Finally came Faster R-CNN, the first fully differentiable model. The architecture consists of a pre-trained CNN (ImageNet) up to an intermediate layer, which gives a convolutional map. This is used as a feature extractor and is provided as input to Region Proposal Network, which tries to find bounding boxes in the image. Region of Interest (RoI) Pooling then extracts features that correspond to the relevant objects into a new tensor. Finally, the R-CNN module classifies the contents in the bounding box and adjusts its coordinates to better fit the detected object. Maximum pooling is used to reduce the dimensions of extracted features. A Softmax layer and a regression layer were used to classify facial expressions. This results in Faster R-CNN, achieving higher precision and lower miss-rate. However, it is prone to overfitting: the model can stop generalizing at any point and start learning noise.

Model Parameters
YOLOv3 and YOLOv4 were given image inputs in batches of 64. Learning rate, Momentum, and Step Size were set to 0.001, 0.9, and 20,000 steps, respectively. Training took 10+ hours for the former and 8+ for the latter. Faster R-CNN accepted input in batches of 32. Learning rate, Gamma, Momentum, and Step Size were set to 0.002, 0.1, 0.9, and 15,000, respectively. It is the most time-consuming to train of the three, taking 14+ hours. The confusion matrix of DarkNet-53, CSPDarkNet-53, VGG-16 trained and tested on the farm animals' images and videos dataset using YoloV3, YoloV4 and Faster RCNN, respectively, are shown in the Supplementary Material Tables S1-S3.

Computation Resources
YOLOv3 with its Darknet53 architecture takes the most inference time (0.0331 s) compared to YOLOv4 (0.27 s) and Faster R-CNN (0.3 s), both of which have CSPDarknet53 and VGG-16 architectures, respectively. YOLOv4 is the computationally efficient model, using 3479 MBs compared to 4759 MBs usage by YOLOv3 and 5877 MBs by Faster R-CNN. YOLOv4 trumps its two competitors when it comes to resources and efficiency, with optimal memory usage and good-enough inference time. Figure 3 illustrates the proposed WUR Wolf model developed using the pre-trained Deep CNN. Figure 4 shows the images of farm animals detected by WUR Wolf facial coding platform from the dataset using Faster RCNN technique.

Computation Resources
YOLOv3 with its Darknet53 architecture takes the most inference time (0.0331 s) compared to YOLOv4 (0.27 s) and Faster R-CNN (0.3 s), both of which have CSPDarknet53 and VGG-16 architectures, respectively. YOLOv4 is the computationally efficient model, using 3479 MBs compared to 4759 MBs usage by YOLOv3 and 5877 MBs by Faster R-CNN. YOLOv4 trumps its two competitors when it comes to resources and efficiency, with optimal memory usage and good-enough inference time. Figure 3 illustrates the proposed WUR Wolf model developed using the pre-trained Deep CNN. Figure 4 shows the images of farm animals detected by WUR Wolf facial coding platform from the dataset using Faster RCNN technique.     YOLOv4 is slower in learning than Yolov3 but achieves a higher accuracy score and a smoother loss function. Validation accuracy is also very close to train accuracy, indicating that the model generalizes well on unseen data ( Figure 6) and would perform better in real-time than v3. YOLOv4 is slower in learning than Yolov3 but achieves a higher accuracy score and a smoother loss function. Validation accuracy is also very close to train accuracy, indicating that the model generalizes well on unseen data ( Figure 6) and would perform better in real-time than v3. Faster R-CNN achieves a higher accuracy (Figure 7) score than both of the YOLO variants, as well as converging quickly. However, it performs poorly in generalizing the learning as the difference between validation and train accuracy is very large at multiple times. Faster R-CNN's accuracy score (93.11% on training and 89.19% on validation set) outperforms both YOLOv4 (89.96% on training and 86.45% on validation set) and YOLOv3 (85.21% on training and 82.33% on validation set) on these metrics. Its loss curve is also faster to converge, followed closely by v4, and v3 is the worst performer on this metric.

Mean Average Precision (mAP)
The mAP score compares the actual bounding box to the detected box and returns a score. The higher the score, the more accurate the model's object boundary detection. YOLOv4 has a mAP score of 81.6% at 15 FPS, performing better than both the other mod- Faster R-CNN achieves a higher accuracy (Figure 7) score than both of the YOLO variants, as well as converging quickly. However, it performs poorly in generalizing the learning as the difference between validation and train accuracy is very large at multiple times. Faster R-CNN's accuracy score (93.11% on training and 89.19% on validation set) outperforms both YOLOv4 (89.96% on training and 86.45% on validation set) and YOLOv3 (85.21% on training and 82.33% on validation set) on these metrics. Its loss curve is also faster to converge, followed closely by v4, and v3 is the worst performer on this metric. Faster R-CNN achieves a higher accuracy (Figure 7) score than both of the YOLO variants, as well as converging quickly. However, it performs poorly in generalizing the learning as the difference between validation and train accuracy is very large at multiple times. Faster R-CNN's accuracy score (93.11% on training and 89.19% on validation set) outperforms both YOLOv4 (89.96% on training and 86.45% on validation set) and YOLOv3 (85.21% on training and 82.33% on validation set) on these metrics. Its loss curve is also faster to converge, followed closely by v4, and v3 is the worst performer on this metric.

Mean Average Precision (mAP)
The mAP score compares the actual bounding box to the detected box and returns a score. The higher the score, the more accurate the model's object boundary detection. YOLOv4 has a mAP score of 81.6% at 15 FPS, performing better than both the other models. YOLOv3 also performs well on this metric with a mAP score of 77.60% at 11 FPS. Faster R-CNN also provides a moderate mAP score of 75.22%; however, its processing

Mean Average Precision (mAP)
The mAP score compares the actual bounding box to the detected box and returns a score. The higher the score, the more accurate the model's object boundary detection. YOLOv4 has a mAP score of 81.6% at 15 FPS, performing better than both the other models. YOLOv3 also performs well on this metric with a mAP score of 77.60% at 11 FPS. Faster R-CNN also provides a moderate mAP score of 75.22%; however, its processing speed is very slow at just 5 FPS. Among the 3, YOLOv4 provides the best bounding boxes at a higher speed.

F1 Score
The annotated labels for both cows and pigs can be grouped on the basis of mental states such as positive, negative, and neutral. Analyzing model performance on these groups is useful in measuring how the model works in different contexts. F1 score is a good measure for this analysis. A Confusion Matrix tabulates the performance of a model on the dataset for which true values are known. Model results are compared against pre-set annotations, and an analysis reveals the performance of each model in detecting the emotion portrayed in the picture. Confusion Matrices of all three models are given in the Supplementary Reading Section alongside respective F1 scores. Negative context requires additional effort and reactions, and as a result there are more pixels with useful information in classification. All 3 models perform return higher True Positives for such cases (Tables S4-S6). The average F1 scores of each of the models are as follows: 85.44% for YOLOv3, 88.33% for YOLOv4, and 86.66% for Faster R-CNN. YOLOv4 outperforms the other two in predicting emotion states for each image.

Discussions
Non-invasive technology that can assess good and poor welfare of farm animals, including positive and negative emotional states, will be soon possible using the proposed WUR Wolf Facial Coding Platform. The ability to track and analyze how animals feel will be a breakthrough in establishing animal welfare auditing tools.
In this project, the applicability of three deep learning-based models for determining the emotions of farm animals, Faster R-CNN, and two variants of YOLO, i.e., YOLOv3 and YOLOv4, has been determined. For training the YOLOv3 and YOLOv4 algorithms, the darknet framework was employed. YOLOv4 has the CSPDarknet53, while YOLOv3 has the Darknet53. Because of the differences between the backbones, Yolov4 is faster and provides more accurate results for real-time applications.
Demonstration and results of emotion detection of cows and pigs using Faster RCNN ( Figure 2) is shown in the attached Supplementary Video S1. Faster RCNN is suitable for mobile terminals where there is a lack of hardware resources in facial expressions recognition [43]. If speed (time for data processing) is the deciding factor, then YoloV4 is a better choice than Faster RCNN. Due to the advantage of the network design, large variations in the dataset composed of facial images and videos with complex and multiscale objects is better analyzed by the two-stage Faster RCNN method. Hence, for higher accuracy in the results of emotion detection, Faster RCNN is recommended over YoloV4. In on-farm conditions where there may be a lack of equipment related to strong data processing ability, Faster RCNN would be a good choice. Technological advances in the field of animal behavior are a huge step in improving humans' understanding of the animals they share this world with, but there is still room to grow.
Performance evaluation of the three models using a complex data set concerning animal faces has been explored in this study. None of the detectors has been deployed in animal data before. Animal facial features are different than human faces, as animals have fur and less facial muscles in comparison to humans. Moreover, there has been no study that compares the facial coding analysis between Yolov3, Yolov4 and Faster RCNN. Our study also provides critical insights into specific advantages of one ML model over another detection method for on-farm practical applications. The presented facial coding platform in our study is only a preliminary step, a proof-of-concept in the yet to be fully developed platform for assessing the emotions and mental states of farm animals. For example, mixed ear position (one ear directed forwards and one ear backwards) indicates a negative mental state of pigs as evidenced by [32,33]; this and other lateral ear posture data from pigs were not available in our study. Farm animals are phenotypically very diverse depending on the genotype. Additional studies with varying animal breeds would further strengthen the validation of the developed emotion recognition platform. The proposed tool has the ability to offer new ways of investigating the individual variations within the same species based on facial features. Future investigation with additional comprehensive farm animal's facial feature data under varying conditions is warranted for full and thorough validation of facial coding platforms for determining the affective states of farm animals.
No facial recognition software created for animals is 100% accurate yet, and so far only a few common species and non-human primates have had this software modified to identify their physical features. Animal species that are not identified as mammals are minimally expressive and have not been tested with facial recognition software for the study of their emotions. One study even brought up the consideration that animals may be able to suppress emotional expression, much like people do in situations where it is socially appropriate to express only certain emotions [29]. The results we have presented in this study are only preliminary, an early work and a basis for the development of a more comprehensive platform. Currently, studies are underway to induce and elicit explicit specific emotions in cows and pigs and thereby measure these emotions using the facial coding features. In this study, we present a framework for measuring emotions using facial features of cows and pigs. Further study is needed to explore the intensity of the emotions and the relationship to valence and arousal components in the measurement of emotions from farm animals. A key takeaway from this study is the ability of automated systems to measure not just pain and suffering but also positive emotions. Unlike humans, farm animals are not capable of hiding or deceiving their true emotions, hence the developed system is expected to exert an influence in the welfare of farm animals. Analysis of human facial expressions from 6 million video clips from 144 countries around the world using deep neural network [44] determined that 16 expressions are significantly similar in the way the facial expressions are displayed in varying social contexts. This study further clarified that 70% of human facial expressions as emotional responses are shared across cultures. There are many questions related to animal emotional expression that have yet to be answered, but there is a good chance that the advancement and implementation of facial recognition software will lead scientists to those answers in the future.

Conclusions
The detailed analysis of the performance of the 3-machine learning python-based models shows the utility of each model in specific farm conditions and how they compare against each other. YOLOv3 learns quickly but gives random predictions and fluctuating losses. Its next iteration, YOLOv4, has improved considerably in many regards. If the aim is to balance higher accuracy with faster response and less training time, YOLOv4 works best. If the speed of training and memory usage is not a concern, the 2-staged Faster R-CNN method performs well and has a robust design for predicting different contexts. The output is accurate, and overfitting is avoided. There is no one-size-fits-all model, but with careful consideration, the most efficient and cost-effective methods can be selected and implemented in automating the facial coding platform for determining farm animal emotions. Facial features as an indicator of emotions of farm animals provides only a onedimensional aspect of their affective states. Due to the advent of Artificial Intelligence and sensor technologies, in the near future multi-dimensional models of mental and emotional affective states will emerge in the form of measuring behavioral patterns, combined track changes in farm animal postures and behavioral changes with large-scale neural recordings.

Supplementary Materials:
The following are available online at https://www.mdpi.com/article/10 .3390/ai2030021/s1, Table S1: The confusion matrix of DarkNet-53 trained and tested on the farm animals' images and videos dataset using YoloV3, Table S2: The confusion matrix of CSPDarkNet-53 trained and tested on the farm animals' images and videos dataset using YoloV4, Table S3: The confusion matrix of VGG-16 trained and tested on the farm animals' images and videos dataset using Faster RCNN, Table S4: F1-Scores for emotion detection and recognition by Yolov3 from facial features of cows and pigs' images and videos data set, Table S5: F1-Scores for emotion detection and recognition by Yolov4 from facial features of cows and pigs' images and videos data set, Table S6: F1-Scores for emotion detection and recognition by Faster RCNN from facial features of cows and pigs' images and videos data set, Video S1: Demonstration of the WUR Wolf Facial Coding Platform.