Pig Face Recognition Based on Metric Learning by Combining a Residual Network and Attention Mechanism

: As machine vision technology has advanced, pig face recognition has gained wide attention as an individual pig identiﬁcation method. This study establishes an improved ResNAM network as a backbone network for pig face image feature extraction by combining an NAM (normalization-based attention module) attention mechanism and a ResNet model to probe non-contact open-set pig face recognition. Then, an open-set pig face recognition framework is designed by integrating three loss functions and two metrics to ﬁnish the task with no crossover of individuals in the training and test sets. The SphereFace loss function with the cosine distance as a metric and ResNAM are combined in the framework to obtain the optimal open-set pig face recognition model. To train our model, 37 pigs with a total of 12,993 images were randomly selected from the collected pig face images, and 9 pigs with a total of 3431 images were set as a test set. 900 pairs of positive sample pairs and 900 pairs of negative pairs were obtained from the images in the test set. A series of experimental results show that our accuracy reached 95.28%, which was 2.61% higher than that of a human face recognition model. NAM was more effective in improving the performance of the pig face recognition model than the mainstream BAM (bottleneck attention module) and CBAM (convolutional block attention module). The research results can provide technological support for non-contact open-set individual recognition for intelligent farming processes.


Introduction
Intensive pig farms are replacing small-scale breeding models, such as individual breeding, as livestock farming moves toward scale, informatization, and refinement. Technology that identifies individual pigs is important for the daily fine management of individual pigs on large-scale pig farms. In the traditional pig industry, physical tags and radio-frequency identification (RFID) chips are usually applied to identify pigs. Methods based on physical labels, such as pattern marking and ear cutting, can cause stress in pigs and affect animal welfare [1]. With RFID chips of different frequencies, the identification range is different for each chip, which can cause the problem of false identification. There are a variety of non-contact recognition methods based on computer vision that have recently been applied to a variety of animal recognition tasks. These methods require the assistance of only cameras and computing equipment, without additional personnel or other equipment, and they are able to quickly and accurately identify individual animals.
Prior human knowledge was needed to extract image features in the early development of computer vision technology. Kashiha et al. [2] used Fourier descriptors with rotation and translation invariance to preserve the pattern features of 10 pigs, and the accuracy of pig pattern recognition was 88.7%. Zhao et al. [3] proposed a vision system for Holstein cow body image extraction and identity recognition. Side-view images of 66 cows were collected. The methods of features from an accelerated segment test (FAST),

Related Work
Closed-set recognition can only recognize livestock individuals who have appeared in the training set, whereas open-set recognition can recognize livestock individuals that have never been seen in the model according to an assessment of the two types of recognition by Andrew et al. [14]. Therefore, there are two main types of livestock individual recognition: open-set recognition and closed-set recognition.
Closed-set recognition has been studied more often. Hansen et al. [15] designed a nine-layer CNN and SVM (support vector machine) to realize pig face recognition. Marsot et al. [16] proposed a novel framework composed of computer vision algorithms, machine learning, and deep learning techniques with an accuracy of 83% on 320 test-ing images to provide a relatively low-cost and scalable solution for pig recognition. Salama et al. [8] used Bayesian optimization to find the best CNN for sheep face recognition, with an accuracy of 98%. Wang et al. [17] introduced a Keras convolutional-neural-networkbased pig facial recognition model. The model's recognition accuracy was the best, reaching 97.6%. Wang et al. [18] introduced a ternary-loss-function-based pig face recognition approach that lowered intra-class differences while increasing the distance between classes, giving a novel research idea for raising pig face recognition accuracy. The aforementioned deep-learning-based livestock individual recognition algorithms achieved good recognition results, thus proving the viability of utilizing deep learning to identify pig individuals and providing an important reference for future research on pig face recognition. The general drawback with these methods is that the training and test sets contain the same individuals, and the models can only detect livestock individuals that exist in the training set.
There proposes an open-set pig face recognition method with an improved backbone network and a metric method.
The contributions of this study can be summarized as follows.
1. To improve feature extraction from pig face photos, this research offers a feature extraction backbone network (ResNAM) based on a normalized attention mechanism.
2. To increase the accuracy of open-set pig face recognition, a framework for open-set pig face recognition was created by integrating three loss functions and two measurement techniques. The best open-set pig face recognition method was then obtained by combining the ResNAM network, SphereFace loss function [21], and cosine distance.
3. An open-set pig face recognition model based on the BAM and CBAM attention mechanisms was constructed, and ablation experiments were designed to verify the effectiveness of ResNAM.

Data Acquisition
Pig facial images were collected in August 2018 at Hui Kang Breeding Farm, Tianjin, China. A system for capturing images of pigs' faces that used a positioning pen to fix the camera on a tripod at a height of 50 cm from the positioning pen-at the same height as the pig's face-was designed. Videos of pigs under natural breeding conditions were collected with an industrial camera with a resolution of 1920 pixels × 1080 pixels (HD1080). Pig face images were collected between the hours of 6:00 a.m. and 8:00 a.m., 10:00 a.m. and 12:00 a.m., and 2:00 p.m. and 6:00 p.m., for a total of 46 pigs.

Data Preprocessing
Pig face images were extracted from each pig face video. The pig face images collected under unconstrained conditions contained large amounts of background noise, and information such as pig pens and windows appearing in the images could affect the effect of recognition of pig faces. Therefore, this paper used Faster RCNN to crop the pig face images out of the images, and then the pig face data were screened by manually selecting the pig-face-positive images. The images of the pig face data after screening are shown in Figure 1, and the processing results for the complete pig face dataset are shown in Table 1.
As can be seen in Table 1, to achieve open-set pig face recognition, the number of pigs was divided according to the ratio of 8:2, and the facial images of 37 pigs were randomly selected for model training. The accuracy of the open-set recognition model was tested by using the facial images of nine pigs that never appeared in the training set, and the distribution of the number of images in the pig face dataset is shown in Figure 2. Thus, the training set sample contained a total of 12,993 images, and the test set contained 3431 images. Two images from each pig in the test set were randomly selected to form image pairs as positive samples for testing pig face recognition, and 200 pairs of positive sample images were reserved for each pig; thus, a total of 1800 pairs of positive sample image pairs were generated in the test set. Two images of different pigs were randomly selected to form image pairs as negative sample pairs for testing pig face recognition, and 1800 negative sample image pairs were randomly retained to ensure that there was the same number of positive and negative image sample pairs.

MobileFaceNet
Recently, lightweight networks such as MobilenetV1 and MobileNetV2 [22] have been used for visual recognition tasks in mobile terminals, but due to the specificities of facial structures, these networks have not obtained satisfactory results for facial recognition tasks. Chen et al. [23] specifically proposed a lightweight network for facial recognition-MobileFaceNet. The model used a global depth-wise convolution (GDConv) layer instead of a 7 × 7 global average pooling layer, which assigned weight coefficients to the importance of different positions. In addition, the PRelu activation layer is used instead of Relu, and a smaller expansion factor than that of MobileNetV2 was selected to make the model lighter. The Arcface loss function was used to increase the inter-class distance and decrease the intra-class distance during training.

Feature Extraction Backbone Network with a Normalized Attention Mechanism
Pig face images contain more levels of semantic information, and extracting the lowlevel and high-level semantic information of images is a key and difficult point in improving the recognition rate. An attention mechanism can weight the semantic information in an image through autonomous learning and filter the image features that are beneficial for the recognition result. In this study, a new feature extraction backbone network, which was called the ResNAM network, was made by combining the NAM [24] and a residual module to get the shallow semantic information of pig face images. The structure of the ResNAM network is shown in Figure 3. The feature extraction process is as follows: A 224 pixel × 224 pixel image of a pig face was fed into the backbone network to extract the facial features. The feature extraction backbone network consisted of a CBR module, four ResNAM modules, and a dropout layer. The CBR module included a The ResNAM module incorporated the residual structure and the NAM attention mechanism. With the design idea of CBAM attention, NAM integrated channel attention and spatial attention submodules. Figure 4a shows the channel attention submodule of the NAM. First, a weight sparsity penalty was used on the input feature map. Second, the individual channel changes reacted according to the scaling factor in the BN. Third, the feature channels that the network was interested in were highlighted, and the background information was suppressed. The scaling factor in batch normalization was used to measure the channel variance and calibrate the importance of the channel features, as shown in Equation (1): where µ B and σ 2 B are the mean and standard deviation of the smallest batch, respectively, and γ and β are trainable affine transformation parameters (scale and translocation). B in is the feature map output from the previous layer. The NAM channel attention submodule is shown in Equation (2): where where F 1 is an input feature map, γ i is a set of scale factors obtained from the input feature map through the BN layer, the weight W r is calculated under the guidance of γ i , and M c is a weight factor after the processing of a sigmoid function. Figure 4b shows the NAM spatial attention submodule, which is calculated as shown in Equation (5). The scaling factor of BN is applied to the spatial dimension to measure the importance of a pixel, which is called pixel normalization. where, where the output is denoted as W s , and λ is the scaling factor.

An Open-Set Recognition Method for Pig Faces
Unlike other deep learning methods applied to pig faces for closed-set recognition, this study did not use the traditional softmax approach for classification learning. Instead, the last layer of the feature map was extracted and mapped into feature vectors, and metric methods were used to calculate the distance between the feature vectors to identify individual pigs. Therefore, an open-set pig face recognition method is proposed in this paper, and the specific process is shown in Figure 5. Figure 5 shows the open-set pig face recognition method, which includes a feature extraction backbone network that incorporates a normalized attention mechanism called the ResNAM network, which was designed as described in Sections 3.1 and 3.2. In the open-set pig face recognition method, pig face images that were 224 pixels × 224 pixels in size were divided into a training set and a test set according to the process described in Section 3.2 and the test set consisted of 1800 pairs of images. The ResNAM model was trained by using the training set, the forward inference results were output after the random dropout and fully connected layers to calculate the loss values for pig face classification, and the parameters were adjusted by using stochastic gradient descent (SGD) to save the optimal model. During model testing, the positive and negative pig face image pairs were fed into the trained ResNAM network to obtain paired feature vectors; the Euclidean distance or cosine distance was chosen as the measure for calculating the distance between the two vectors, and it was compared with the optimal threshold to determine whether the image pairs belonged to the same pig. In this case, the best threshold was calculated in the way described in the literature [14], and the test was chosen to obtain the average accuracy of the test set through ten-fold cross-validation. To further optimize the PigFaceNet model, the optimal loss function was obtained by selecting the most suitable loss function for pig face recognition from the existing loss functions-ArcFace, CosFace [25], and SphereFace [26]and the optimal metric was found from the two measures of the Euclidean distance and cosine distance to preserve the optimal model.

Experimental Settings
For training, a 16 GB NVIDIA Tesla P100 graphics processor was used, and a deep learning algorithm training platform was built by using the Ubuntu 16.0 operating system, Python 3.8, and PyTorch 1.7.1. The CUDA API version was 10, and the CuDNN was 8.0.5. The training procedure had 300 training rounds, each training batch had 256 images, and the initial learning rate was set to 0.1. The learning rate was reduced to one-tenth of the original during the 5th, 60th, and 200th rounds.

Results of Training
This research used ResNet18 as the baseline model to construct a ResNAM network that incorporated the normalized attention mechanism. The ResNAM network described in this paper was trained by using the dataset shown in Table 1. The performance of our model with the cosine distance and Euclidean distance was evaluated in the test set after each epoch of training. Figure 6 depicts the changes in parameters throughout training. Figure 6 depicts the accuracy and loss value curves when the Euclidean and cosine distances were employed during ResNAM training. Following the completion of each training epoch, the ResNAM network was used to infer test set image pairs to obtain feature vectors, and the open-set pig face recognition method from Section 4.3 was used to obtain the best threshold, test the accuracy with different measurement methods, and save the model weight with the highest accuracy. The recognition accuracy of the model displayed an overall rising trend as the number of training epochs increased, and this tended to be steady. The loss value initially dropped and then tended to stabilize. The accuracy and loss value of the model fluctuated substantially before 60 epochs; then, they progressively became stable after 60 epochs because the phased learning technique was used in this paper. The learning rate was dropped to one-tenth of the original between the fifth and 60th epochs, thus accelerating model convergence.  The results of a comparison between the model in this paper and the baseline models of ResNet18 and ResNet50 are shown in Table 2. The ResNAM, ResNet18, and ResNet50 models were trained by using the same experimental environment and dataset. The test results indicated that when using the deep network as the backbone network to extract pig face features, the accuracy of the cosine-distance-based method was slightly higher than that of the Euclidean-distance-based method. The pig facial recognition architecture of this work incorporated three loss functions. Figure 7 depicts a comparison of the recognition results from various models with different loss functions. ResNet50's model size was double that of ResNAM when the loss function was ArcFace. ResNet50 had a measurement accuracy of 93.89% when the cosine distance was applied. The accuracy of ResNet50 was 93.67% when the Euclidean distance was utilized as the measurement technique. When the Euclidean distance or cosine distance was employed as the measurement method in the optimal model created by using CosFace and SphereFace as the loss functions, the accuracy of ResNAM reached 95.28% at best, exceeding those of ResNet18 and ResNet50.
When using the same feature extraction network, the model trained with SphereFace as the loss function provided the best accuracy and discrimination impact for pig facial features. To summarize, this paper combined the ResNAM network, the SphereFace loss function, and cosine distance measurement to produce the best open-set pig face recognition method, with an accuracy of 95.28%, which was significantly greater than that of the model before improvement. The experimental findings suggest that the ResNAM model developed in this paper substantially improved pig facial image feature extraction. They also demonstrate that the SphereFace loss function and cosine distance measurement could efficiently differentiate pig facial features by narrowing the intra-class gap and widening the interclass distance.

Ablation Study
In this paper, ResNAM, a ResNet18-based pig facial image feature extraction model, was built, and an open-set pig face recognition method is proposed. To produce the best model for open-set pig face recognition, the framework incorporated ResNAM, the SphereFace loss function, and cosine distance measurement. A control variable method was utilized in Section 4.2 to create several model comparison experiments that validated the efficiency of the SphereFace loss function and cosine distance measurement methods. In this part, the creation of a backbone network by merging several attention modules is described, and the same training and testing procedures as those in Section 4.2 were employed to achieve the best model and pig face recognition results. Table 3 Table 3 demonstrates how several attention processes, such as BAM [26] , CBAM [27], and NAM, were combined in the same backbone network to construct Nos. 1-3. The accuracy of the ResNAM model presented in this research was the greatest, achieving 95.28% with SphereFace as the loss function and cosine distance as the measurement technique. This model's accuracy was 0.17% and 1.34% greater than those of No. 2 and No. 3 with the identical loss function and measurement technique, respectively, and the ResNAM model was the smallest. ResNAM had the highest accuracy, achieving 93.22% when using CosFace as the loss function and the cosine distance as the measurement technique. This approach's accuracy was 1.72% and 3.33% greater than those of Nos. 2 and 3 with the identical loss function and measurement method, respectively. ResNAM had the highest accuracy, achieving 92.94% with ArcFace as the loss function and the cosine distance as the measurement technique. When compared to No. 2 and No. 3 with the identical loss function and measurement method, this strategy improved the accuracy by 2.5% and 1.05%, respectively.
To summarize, the accuracy of ResNAM was higher than that of the ResNet18 model that incorporated BAM and CBAM under the same loss function and measurement approach. When the same feature extraction network was used, the model trained using SphereFace as the loss function had the best pig face recognition performance. This demonstrated that, when compared to other loss functions, SphereFace produced features with high angular separability. It could cohere features from comparable pig face photos while also separating features from various pig face images, making it an enhancement suitable for constraining pig face features.
The ResNAM model extracted more pig face image characteristics. As a result, the best model generated by combining training with ResNAM and the SphereFace function was able to better discriminate pig facial features. The Euclidean and cosine distances had the least impact on pig face recognition. Table 4 shows the results of the facial recognition model when it was used to identify pigs from the open-set recognition dataset for the pigs provided in this work. MobileFaceNet had a maximum accuracy of 92.67% when utilizing the Euclidean distance as the measurement technique. When the cosine distance was employed as the measurement technique, it had a maximum accuracy of 95.28%, which was 2.61% greater than that of MobileFaceNet. Due to the impact of inbreeding, there was minimal variation among pig individuals. However, there was a huge intra-class variability within the pig individuals due to the influence of light, angle, and posture, thus posing a significant obstacle to pig face recognition. The precision of applying a facial recognition model directly to a pig face was not optimal. As a result, the open-set pig face recognition technique suggested in this research successfully enhanced pig facial features, boosted pig face identification accuracy, and can be more effectively applied to pig farms.

Discussion and Analysis
To eliminate identification mistakes caused by unequal sample categorization, this research employed ten-fold cross-validation to compute the average accuracy and verify the model's resilience. We divided the test set into ten parts and took nine at a time to get the optimal threshold value. Then, by using the optimal threshold value, we tested the accuracy of the tenth piece ten times and determined the average accuracy of each test set as the test set's average accuracy. The cosine distance had a value range of [0,1], while the Euclidean distance had a value range of [0,4] [28]. Table 5 displays the test results of the ten-fold cross-validation of the best model in this article. It was discovered via ten-fold cross-validation that different test sets had different ideal thresholds for image capture, and the different optimal thresholds had varied test outcomes in each test set, with a maximum recognition rate of 97.778%. The best threshold value assessed with the cosine distance for pig face images was 0.745, while the best threshold value measured with the Euclidean distance was 0.510. The average accuracy of the test set was determined by taking the average accuracy of 10 test sets. The model's accuracy measured with the cosine distance was 95.278%, while the model's accuracy measured with the Euclidean distance was 95.111%. The testing findings revealed that the cosine distance was more suitable for determining the distances between pig face photos. Because the pig facial features recovered from the best ResNAM model with SphereFace as the loss function differentiated various pig individuals well, the accuracy of using the cosine distance as the measurement technique was only marginally greater than that of using the Euclidean distance. According to the results in Table 5, the ideal threshold value was 0.745, and the accuracy of the test set was the highest, reaching 95.278% when the model utilized the cosine distance as the measurement method. As a result, the cosine threshold was adjusted to 0.745, and the accuracy was computed for each pig in the test set, as shown in Figure 8. Figure 8 shows that the accuracy of the pig face recognition model that utilized the cosine distance as the measurement method to identify Pig4 was 100%, whereas Pig39 and Pig40 had poor recognition rates. Recognition mistakes occurred when Pig39 was combined with some images of Pig34, Pig46, Pig39, and Pig40 to produce an image pair, and recognition errors occurred when Pig40 was paired with some photographs of Pig34, Pig46, Pig39, and Pig40 to make an image pair. The number of negative samples with identification errors in the test sample pair of Pig39 was greater than the number of positive samples with identification errors, indicating that the distinction between classes for Pig39 was modest, and it was easily confused with other pigs.
Negative sample pairs with identification errors accounted for 8.8% of the overall number of negative samples in the Pig40 test samples, while positive sample pairings with identification errors accounted for 11% of the total number of positive samples. Positive samples had a higher mistake rate than that of negative samples. As a result, big intraclass variances had a significant impact on the accuracy for Pig40. Figure 9 depicts various sample pairs with incorrect identification. Figure 9a shows that the test samples of Pig4, Pig39, and Pig40 had various angle and ear occlusion issues. Pig39 and Pig40 had much greater pig face angles and light effects than Pig4. However, the recognition rate for Pig4 was 100%, and those for Pig39 and Pig40 were 89.01% and 90.64%. On the one hand, this demonstrates that the pigs' ears, posture, angle, and illumination conditions, which are typical characteristics in pig face recognition, caused a high intraclass difference and a modest interclass difference. On the other hand, this demonstrated that the strategy in this research minimized the intraclass difference to some extent, which presents suggestions for boosting pig face recognition accuracy. The positive and negative examples of recognition mistakes, as shown in Figure 9b, had a strong similarity between the face images of Pig39 and those of Pig34 and Pig46, which were difficult for human eyes to identify, and the difference between classes was modest. However, there were many light and angle interferences in pig image 40, resulting in a large intra-class gap and low recognition rate for positive sample pairs. The method in this paper improved the accuracy of pig face recognition to a certain extent, but it has not been completely solved. In practical applications, open-set recognition can be used to compare new pig face images with the pig face images in a database one by one to determine whether the unknown pig has ever appeared in the database.
As a result, we can try to add pig face photographs with various perspectives, lighting, and attitudes to the database in order to increase the accuracy of pig face identification. Closed-set recognition, on the other hand, assumes that the pig to be identified is a pig in the database, and it cannot be used to identify pigs that have not been in the database. This problem cannot be solved by adding wealthy pig face photos. To summarize, the approach in this study increased the accuracy of pig face identification, addressed the problem of huge gaps within a class, overcame the interference of external factors to some extent, and serves as a reference for future pig face recognition research (Tables 6 and 7).    Table 8 shows

Conclusions
Firstly, to extract features from pig face images, the ResNAM backbone network was presented, which integrated a normalized attention mechanism and residual network and could fully extract key features from pig face images when they were disturbed by noise, occlusion, and other situations.
Secondly, when compared to the BAM and CBAM attention modules, NAM improved the performance of the pig facial recognition model more effectively and could extract richer high-level semantic data. The accuracy of pig face recognition when using NAM was higher than that when using BAM and CBAM with the identical loss function and measurement method.
Thirdly, an open-set pig face recognition framework was provided in this study, which integrated three loss functions and two measurement methods and accomplished open-set pig face recognition with non-overlapping individuals in the training and test sets. ResNAM's accuracy was 95.28% with SphereFace as the loss function and the cosine distance as the measurement technique, which was 2.61% greater than that of the facial recognition model.
To summarize, deep learning may be used to perform open-set pig face recognition. The issue of only identifying pig individuals that have appeared in the training set was overcome through open-set pig face recognition. This paper proposed a corresponding open-set pig face recognition framework based on metric learning, and it made corresponding improvements in the backbone network used for feature extraction in the framework, which improved the accuracy of open-set pig face recognition and provided a new idea for future research on pig face recognition algorithms. In addition, in future work, we will try to combine open-set recognition with deep unsupervised active learning in order to improve the quality of learning and render it more semantic.