You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

27 December 2023

A Lightweight Pig Face Recognition Method Based on Automatic Detection and Knowledge Distillation

,
,
,
and
1
Division of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Republic of Korea
2
Core Research Institute of Intelligent Robots, Jeonbuk National University, Jeonju 54896, Republic of Korea
*
Author to whom correspondence should be addressed.

Abstract

Identifying individual pigs is crucial for efficient breeding, health management, and disease control in modern farming. Traditional animal face identification methods are labor-intensive and prone to inaccuracies, while existing CNN-based pig face recognition models often struggle with high computational demands, large sizes, and reliance on extensive labeled data, which limit their practical application. This paper addresses these challenges by proposing a novel, decoupled approach to pig face recognition that separates detection from identification. This strategy employs a detection model as a pre-processing step, significantly reducing the need for extensive re-annotation for new datasets. Additionally, the paper introduces a method that integrates offline knowledge distillation with a lightweight pig face recognition model, aiming to build an efficient and embedding-friendly system. To achieve these objectives, the study constructs a small-scale, high-quality pig face detection dataset consisting of 1500 annotated images from a selection of 20 pigs. An independent detection model, trained on this dataset, then autonomously generates a large-scale pig face recognition dataset with 56 pig classes. In the face recognition stage, a robust teacher model guides the student model through a distillation process informed by a knowledge distillation loss, enabling the student model to learn relational features from the teacher. Experimental results confirm the high accuracy of the pig face detection model on the small-scale detection dataset and the ability to generate a large-scale dataset for pig face recognition on unlabeled data. The recognition experiments further verify that the distilled lightweight model outperforms its non-distilled counterparts and approaches the performance of the teacher model. This scalable, cost-effective solution shows significant promise for broader computer vision applications beyond agriculture.

1. Introduction

With the escalating demands for livestock production and quality, agriculture has rapidly advanced into intensive and smart farming practices aimed at enhancing livestock productivity [1]. In today’s world, the advent of new diseases that pose threats to pig health necessitates increasingly vigilant monitoring [2], and AI technology has been adopted for automated diagnostic tests to facilitate the early detection of diseases in livestock [3]. Critical to this process is the ability to automatically identify individual animals, which serves as the foundation for effective tracking and disease management.
Traditional methods for pig identification, such as paint marking and manual observation, have long been used on farms [4]. These methods, however, suffer from several drawbacks. Manual observation is time-consuming and error-prone, making it particularly unsuitable for large herds. Subsequently, RFID ear-tagging emerged as an alternative, aiming to streamline the process [5]. Farmers can implant RFID tags in the ears of livestock and use readers to monitor and manage their behavior, health status, and production data. An HF RFID system demonstrated impressive results in recording pig drinking behaviors, with an accuracy of 93% when cross-referenced with visual observations [6]. Similarly, the application of UHF RFID showcased potential in tracking feeding behaviors, achieving a 96% accuracy [7]. While RFID systems are prevalent, they are costly, especially when deployed on a large-scale farm [8]. In addition, the requirement for physical implantation presents ethical dilemmas concerning animal welfare.
In recent years, researchers have been focusing on utilizing non-invasive deep learning-based automatic animal face recognition technology to track and monitor the health of animals, inspired by advancements in human face recognition technology [9,10,11,12,13]. Although some relevant studies [14] have achieved promising results in automatic animal recognition tasks, such as sheep [15], cattle [16], birds [17], and pigs [18], existing pig face recognition models, primarily based on comprehensive Convolutional Neural Networks (CNNs), face challenges of high computational demands and reliance on extensive labeled data, limiting their practical deployment in resource-constrained environments. The comprehensive CNN architectures, as employed in the aforementioned studies, offer robust feature extraction capabilities and exceptional performance due to the depth and intricacy of their stacked convolutional layers [19]. However, this sophistication comes at a cost. The significant model sizes and dense architectural designs inherently lead to high computational complexity. In practical applications, deploying CNN algorithms in embedded systems requires efficient memory utilization, low-cost operators, and compact weight representations. Hence, recent works have sought to pivot towards lightweight models, acknowledging the significance of reduced complexity in embedded systems. However, a dilemma arises. While these slim models have fewer parameters, making them amenable for embedded deployments, they often compromise on accuracy, underperforming when compared to their heftier counterparts.
Meanwhile, existing pig face recognition studies typically employ one of two methods: manually cropping pig faces from images or using end-to-end two-stage models that automatically detect and then identify pig faces. The former is time-consuming and not scalable for large datasets, while the latter relies heavily on large annotated datasets. Moreover, these end-to-end models require re-annotation when applied to new data, making them less feasible for practical applications.
To bridge the existing shortcomings in the pig face recognition task, firstly, a two-stage pig face recognition model was proposed that decoupled the detection from recognition. The advantage of this is that we reduce the need for repetitive annotation tasks. Once trained, this pig face detection model can serve as a preprocessing step, autonomously generating pig face data for subsequent recognition tasks without additional human intervention. Secondly, we introduce a cutting-edge method that harnesses offline knowledge distillation to optimize the balance between accuracy and compactness in the lightweight pig face recognition model. Knowledge distillation is a technique where a ’student’ model learns from a ’teacher’ model that is more complex [20]. The essence of our approach lies in transferring the nuanced feature representations from the teacher to the student model, which allows the student to replicate the sophisticated understanding of its more advanced counterpart. This process empowers our lightweight model to maintain its reduced size, fitting embedded systems’ constraints, while still delivering performance on par with the more comprehensive teacher model. Our proposed overview of the architecture of pig face recognition task is shown in Figure 1.
Figure 1. Proposed pig face recognition architecture based on automatic detection and a knowledge distillation technique.
In our study, empirical validation was a critical step. We initiated this process by constructing a small-scale pig face detection dataset, which included annotated images from 20 different pigs. To process this dataset, we employed a state-of-the-art detection model, YOLO-v8 [21], specifically chosen for its efficiency in detecting pig faces. This step was vital in generating a preliminary set of high-quality, relevant pig face images. Following the initial detection stage, we incorporated a filtering module that utilized the Multi-Scale Structural Similarity (MS-SSIM) technique [22]. This module played a crucial role in refining our dataset further by removing images with high similarity. Such a process ensured the diversity and quality of our dataset, laying a solid foundation for the effectiveness of the models to follow. A significant advancement in our approach was the application of relational knowledge distillation [23], conducted offline. This technique allowed us to transfer complex features from a high-capacity Vision Transformer (ViT) [24] to a more streamlined model, ShuffleNetV2 [25]. The primary benefit of this methodology was the considerable enhancement in accuracy in our lightweight pig face recognition model, achieved without increasing its computational demands. This balance of accuracy and efficiency is pivotal in applications where resources, particularly computational ones, are limited.
The contributions of our proposed work can be summarized as follows:
-
We propose an innovative method that integrates a deep learning-based offline knowledge distillation technique with a lightweight model for pig face recognition. This approach effectively balances accuracy with computational efficiency, which is particularly beneficial for smart pig farming environments;
-
We propose a novel, decoupled approach to pig face recognition by separating the detection from the identification process, employing a detection model to create a pre-processing step that significantly reduces the need for exhaustive re-annotation when applying the model to new datasets. This advance facilitates a universal, scalable application across varied datasets, saving considerable time and resources while maintaining high accuracy;
-
We introduce a high-quality, publicly accessible 20-pig face dataset, complete with labeled bounding boxes, which serves as a substantial resource for fostering research in pig face detection and recognition within the realm of computer vision.

3. Materials and Methods

This section details a novel decoupled architecture for pig face recognition, uniquely designed to enhance efficiency and accuracy in identifying individual pigs. The workflow is shown in Figure 2. In the workflow of our proposed system, Step 1 collects pig face data from the installed camera. In Step 2, a filtering module refines the dataset by selecting only high-quality pig face images. Step 3 employs our specialized pig face detection model to automatically extract pig faces from filtering images. Step 4 utilizes the proposed pig face recognition model based on the knowledge distillation technique to predict the individual pig ID. Central to our approach is the integration of an offline knowledge distillation technique, which significantly improves the performance of a lightweight pig face recognition model. Student and teacher models are introduced in this section, and an offline knowledge distillation method has been presented in this paper. Additionally, this section outlines the data-cleaning techniques, and the characteristics and assembly of the pig face detection dataset used for training and validation are described, underscoring the robustness and relevance of the applied methods in practical scenarios.
Figure 2. The workflow of the proposed decoupled pig face recognition system.

3.1. Dataset Collection, Preparation, and Annotation

Our methodology for creating the pig face detection dataset focused on selecting high-quality images from farm-captured footage of 56 pigs. This approach aimed to optimize the dataset for training an effective face detection model while significantly reducing the annotation workload. Prioritizing high-quality images meant choosing those with clear visibility of facial features, optimal lighting, and minimal background distractions. Out of the extensive footage, we concentrated on 20 pigs and manually annotated around 1500 high-quality images, as shown in Figure 3. The annotated dataset was then divided into training, validation, and testing subsets, following a 6:2:2 ratio, to provide a balanced approach for comprehensive model training and effective validation. Then, we were able to streamline the data preparation process and bolster the detection model’s capability to process and extract valuable data from broader datasets. This methodology is particularly advantageous in smart farming scenarios, where efficient and accurate data processing is crucial.
Figure 3. Sample image visualization from the pig face detection dataset.

3.2. Automatic Pig Face Detection

In developing our two-stage pig face recognition model, we implemented a decoupled approach that distinctly separates the tasks of detection and recognition. This design choice significantly reduces the burden of annotating new data. In this stage, we employed the advanced capabilities of YOLO-v8 to automatically detect pig faces within the dataset. YOLO-v8, known for its speed and accuracy, is well-suited for real-time applications and serves as the backbone for our automated pre-processing.
Yolo-V8 includes CSPDarknet backbone, YOLO-v8PAFPN neck, and head blocks. The feature extraction part uses a network architecture called CSPDarknet, which is an improved version based on Darknet. CSPDarknet uses a Cross-Stage Partial Network (CSP) structure that divides the network into two parts, each containing multiple residual blocks. This structure can effectively reduce the number of parameters and calculation amount of the model and improve the efficiency of feature extraction. The object detection part adopts a detection head structure named YOLO-v8-Head. The structure consists of multiple convolution layers and pooling layers for processing and compressing feature maps. Then, the feature map is converted into the target detection result through multiple convolution layers and fully connected layers. Finally, the detected pig faces are then cropped and fed into the recognition model.
YOLO-v8 total loss consists of class loss, objectness loss, and location loss:
L T o t a l = λ 1 L c l s + λ 2 L o b j + λ 3 L l o c ,
L c l s is the class loss, representing the error in classifying objects into their correct categories. L o b j is the objectness loss, which measures the error in detecting whether a certain region contains an object or not. L l o c denotes the location loss, which quantifies the error in determining the precise location and size of the detected objects.
To balance the contribution of each loss component to the total loss, they are weighted by their respective coefficients, λ 1 , λ 2 , λ 3 . These coefficients allow for fine-tuning of the importance of each component in the final combined loss.

3.3. MS-SSIM Filtering

Subsequently, to further refine the dataset, we utilized the Multi-Scale Structural Similarity Index (MS-SSIM) technique, setting a threshold of 0.5, as depicted in Figure 4. This approach helped in removing redundant images with similar features, thus ensuring a diverse and information-rich dataset that enhanced the model’s learning capability. Lastly, this filtration step refined our dataset to 20,689 distinct images, which constituted our final high-quality dataset for the pig face recognition tasks. MS-SSIM is defined in Equation (2):
l ms - ssim = 1 m = 1 M 2 μ p μ g + c 1 μ p 2 + μ g 2 + c 1 β m 2 σ p g + c 2 σ p 2 + σ g 2 + c 2 γ m ,
Figure 4. The MS-SSIM filters similar images in our pig face dataset.
M refers to different scales. μ p μ g are the mean values of the predicted image and the ground-truth, respectively. σ p , σ g represent the standard deviations between the predicted image and the ground-truth. σ p g stands for the covariance between the predicted values and the ground-truth. β m , γ m signify the relative importance between the two components. c 1 , c 2 are constants used to prevent division by zero. In our study, we set 20 values of 6.5025 and 58.5225 for c 1 , c 2 , respectively.

3.4. Proposed Lightweight Pig Face Recognition Method Based on Knowledge Distillation

To trade off accuracy and computational cost, the advanced offline knowledge distillation technique [44] was employed to enhance the accuracy of the lightweight pig face recognition model. In the offline knowledge distillation technique, the student model learns by adopting insights from a teacher model that has already been trained, whereas the teacher model obtains more precise results but is computationally expensive. The lightweight student model can perform better by transferring high-order feature information from the teacher model. To implement this method, initially, the more comprehensive teacher model is first trained on our pig face recognition dataset to establish a robust knowledge base before distillation. Secondly, this teacher model guides and imparts knowledge onto a smaller, more agile student model during the distillation process.
Large-Scale Teacher Model Selection. In the proposed method, choosing an advanced high-precision classification model as the teacher network is a critical step in offline knowledge distillation for enhancing the performance of the student network. Vision Transformer (ViT) was opted as our core teacher model due to its exemplary performance. ViT applies the self-attention mechanism to sequences of image patches, enabling the model to capture global dependencies across the entire image. The model streamlines pig face recognition using a Vision Transformer (ViT) as shown in Figure 5. It starts with an input image of dimension R H × W × C , which is normalized and resized to R 384 × 384 × 3 . The image is then divided into patches, flattened, and linearly embedded to R N × ( p 2 · C ) , with a class token and positional embeddings incorporated, resulting in R ( N + 1 ) × d , where N is the patch count (in this case, 576) and d is the embedding size (768). The Transformer encoder processes these to output R ( N + 1 ) × d , capturing inter-patch dependencies. An MLP head then interprets this feature-rich representation, outputting a classification score that reflects the model’s certainty in the pig face identification.
Figure 5. The architecture of ViT-base as teacher model.
Lightweight Student Model Selection. For the student model, we opted for the acclaimed lightweight architecture, ShuffleNet-V2, leveraging its capabilities to distill the high-order pig face features inherited from our teacher model, ResNet18. ShuffleNet V2, renowned for its balance between performance and computational efficiency, capitalizes on channel shuffle and pointwise group convolution strategies. This design reduces computational costs while ensuring accuracy. As depicted in Figure 6, the input to ShuffleNet V2 is an image of dimensions 224 × 224 across 3 RGB channels. Initial layers process this input through convolution, batch normalization, and a Rectified Linear Unit (ReLU) activation. Subsequent operations further refine these features, reducing spatial dimensions. The core of ShuffleNet V2 lies in its distinct units: Unit 1 and Unit 2. These units integrate operations such as channel splitting, depthwise and pointwise convolutions, and the signature channel shuffling for enhanced cross-channel information flow. The architecture concludes with a series of convolutional and pooling layers, culminating in a fully connected layer. This final layer produces an output corresponding to the 6 distinct pig categories.
Figure 6. The architecture of Shufflenet-V2 as student model.
Relational Knowledge Distillation Technique. In this study, we explore the efficacy of using Relational Knowledge Distillation (RKD) to transfer knowledge from a large Vision Transformer (ViT) teacher model to a compact ShuffleNet V2 student model. The main advantage of RKD lies in its ability to capture and transfer the relational aspects of the teacher model’s feature representations, rather than merely mimicking the direct outputs of the teacher model. This is particularly crucial for knowledge transfer between models with significant structural differences.
We adopted a two-stage distillation strategy (offline distillation). Initially, the ViT teacher model was pre-trained on a large-scale pig face recognition dataset to capture rich feature representations. Subsequently, we transferred the ’knowledge’ of these representations to the ShuffleNet V2 student model via the RKD method. Specifically, we focused on minimizing two principal loss functions: the Distance-Wise Loss and the Angle-Wise Loss. The Distance-Wise Loss encourages the student model to learn the distance relations between features in the teacher model, while the Angle-Wise Loss focuses on the distribution of angles between feature vectors.
In our study, we implemented the Relational Knowledge Distillation (RKD) method to transfer complex relational knowledge from a Vision Transformer (ViT) teacher model to a ShuffleNet V2 student model, focusing on pig face recognition. The RKD process commences with both models undergoing forward propagation; the ViT processes input images and generates high-dimensional feature representations from its Transformer encoder block, while the ShuffleNet V2 produces its feature representations from the last pooling layer. The essence of RKD lies in quantifying and minimizing the relational discrepancies between these feature representations, achieved through the RKD loss that encompasses both distance-wise and angle-wise relational comparisons. During backward propagation, this loss guides the student model in adjusting its parameters to align more closely with the teacher’s feature representations. The student model’s ability to mimic the teacher is refined through iterative updates aimed at reducing the RKD loss, alongside minimizing the classification loss for the primary task of pig face recognition. This training methodology enables the student model to not only acquire a nuanced understanding of pig face features similar to the ViT but to do so within a more compact and computationally efficient framework. The result is a distilled model that retains essential relational knowledge critical for effective recognition, making it well-suited for deployment in environments where computational resources are limited.
In the distillation training process of the lightweight pig face recognition model, the model minimizes the multiple-task losses, including distillation loss L R K D and classification loss L C l s , so that the lightweight student model can better learn the feature expression of the teacher model, as shown in Equation (3):
L T o t a l = L R K D + L C l s ,
The distillation loss from relational distillation loss consists of two kinds of loss: L R K D D , L R K D A , as presented in Equation (4):
L R K D = x 1 , , x n , ϵ X N ι ψ t 1 , , t n , ψ s 1 , , s n ,
RKD loss calculates a relational potential ψ for each n-tuple of data examples and transfers information through the potential from the teacher to the student. Here, ( x 1 , x 2 , , x n ) is an n-tuple drawn from X N , ψ is a relational potential function that measures relational energy of the given n-tuple, and ι is a loss that penalizes difference between the teacher and the student. Classification loss is a logarithmic loss to classify over 56 pig classes and is computed based on the output score.

4. Experiments

In this section, we present the implementation details and show both quantitative and qualitative experimental results on the pig face dataset. These evaluations describe the performance of our proposed method in real pig farming scenarios.

4.1. First-Stage: Automatic Pig Face Detection Module

4.1.1. Implementation Details

To establish an effective pre-trained model for pig face detection, we carefully selected a small-scale dataset comprising 1586 high-quality images from a group of 20 pigs from our 50-pig dataset. These images were systematically divided into training, validation, and testing sets with a distribution ratio of 6:2:2. Each image was manually annotated with bounding boxes using the LabelMe tool to provide precise ground-truth data. This meticulous annotation process was crucial for training a robust model on a smaller dataset, which could then be applied to a larger unannotated dataset of 50 pigs, leveraging the predictive power of the pre-trained model to detect pig faces without the need for further manual labeling.
We implemented four different state-of-the-art detection models on our built small-annotation pig face dataset in our first-stage module for pig face detection. We conducted end-to-end training of the model by fine-tuning a pre-trained model on the MS-COCO dataset. The training process involved using the SGD optimizer with a learning rate of 0.01, momentum of 0.937, and weight decay of 0.0005. The 2160 × 3840 original images were resized to (224, 224) as input images. The final pig face detection weights were obtained after training the model for 300 epochs on a computer equipped with 4 Titan GPUs. We set 64 as our batch size and some advanced data augmentation to increase the variability in the training data, such as Mosaic, RandomAffine, Albumentation, MixUP, and RandomFlip.
In assessing the models’ performance, standard object detection metrics such as AP 50 95 , AP 50 , and AP 75 were utilized. The average precision across a range of IoU thresholds, AP 50 95 , was computed by averaging the precision and recall at IoU thresholds from 0.5 to 0.95 in increments of 0.05. The metric AP 50 represents the model’s average precision at the IoU threshold of 0.50, while AP 75 indicates the same at an IoU of 0.75. These specific metrics, AP 50 and AP 75 , offer a nuanced view of the model’s detection abilities at these particular IoU levels, supplementing the broader assessment provided by AP 50 95 .

4.1.2. Implementation Results

Training and prediction on the small-size pig face detection dataset with annotation. In our experimentation, four YOLO-based object detection architecture versions—YOLO-v5, YOLO-v6, YOLO-v7, and YOLO-v8—were evaluated to determine their suitability for pig face detection tasks. While YOLO-v8 was our primary detection model due to its advanced architecture and high performance, YOLO-v5, YOLO-v6, and YOLO-v7 served as benchmarks to assess improvements. The results, as presented in Table 1, show that YOLO-v8 outperforms its predecessors across various metrics. It achieved the highest average precisions (AP) at IoU thresholds of 0.50 ( AP 50 ), 0.75 ( AP 75 ), and 0.95 ( AP 95 ), with values of 0.990, 0.973, and 0.869, respectively, alongside the highest recall of 0.895. These metrics suggest that YOLO-v8 has a superior ability to detect pig faces accurately, even in challenging conditions.
Table 1. Comparative performance metrics of different detection models on the pig face detection dataset.
From the loss curve in Figure 7, YOLO-v8 shows a faster and more consistent decline, indicating a more effective learning process and stabilization compared to YOLO-v5, YOLO-v6, and YOLO-v7. After the initial sharp decrease, YOLO-v8 maintains a steady, low loss, suggesting that it has better generalized the training data without overfitting. The mAP curve of Figure 7 further supports the superiority of YOLO-v8. All models show a rapid increase in mAP during the early epochs, with YOLO-v8 achieving a higher mAP more quickly than its predecessors and maintaining this lead throughout the training process. The mAP for YOLO-v8 plateaus at a higher value, signifying that it has attained better overall detection performance across different object detection thresholds. In addition, the prediction visualization of YOLO-v8 on the testing dataset is displayed in Figure 8. The visualizations highlight the model’s ability to focus on relevant features for pig face detection, confirming its effectiveness as the backbone of our detection module.
Figure 7. The visualization of the training loss and average precision ( AP 50 95 ) from four YOLO models.
Figure 8. Prediction of YOLO-v8 on pig face detection testing dataset.
Given the observed high performance, YOLO-v8 was chosen as the primary detection model for our pig face recognition system. The training weights from YOLO-v8 enable the automated detection of faces in unlabeled datasets, which significantly increases the amount of training data available for our recognition tasks. By leveraging YOLO-v8’s superior training weights, we can enrich our dataset without the extensive manual labeling traditionally required, saving valuable time and resources.

4.1.3. Performance of Our Pre-Trained YOLO-v8 Model on Our Unlabeled 56-Pig Dataset

Building upon the success of our pre-trained YOLO-v8 model in detecting high-quality pig faces, we utilized its capabilities to automatically process and automatically generate our unannotated dataset with 56 pigs. Figure 9 showcases a selection of images from the unannotated dataset, generated by the YOLO-v8 model. These images illustrate the model’s capability to autonomously detect high-quality pig faces, further emphasizing the model’s utility in significantly reducing the manual labor typically required for such large-scale annotation tasks.
Figure 9. Predicted unlabeled images of 56 pigs using YOLO-v8.

4.2. Second-Stage: Relational Knowledge Distillation-Based Pig Face Recognition

4.2.1. Implementation Details

To validate the efficacy of our methodology, we constructed a dedicated pig face dataset tailored for the pig face recognition task. Utilizing the pre-trained YOLO-v8 model from the previous phase, we processed an extensive collection of images from 56 unlabeled pigs. Through the application of the Multi-Scale Structural Similarity (MS-SSIM) technique, we filtered the dataset to retain images with low similarity, thus ensuring a diverse range of facial features for robust recognition. From this refined dataset, we randomly selected 400 distinct images per pig class, resulting in a comprehensive dataset that spanned 50 categories with a total of 20,000 images. This dataset was thoughtfully partitioned into training, validation, and testing sets with a ratio of 5:2.5:2.5. Such a distribution was designed to optimize model performance by providing a balanced and varied set of examples for training while allowing for thorough evaluation during validation and testing.
For the implementation of our method, we selected the Vision Transformer (ViT) as the teacher model and ShuffleNet-V2 as the student model, leveraging Relational Knowledge Distillation for efficient knowledge transfer. The ViT was pre-trained on the ImageNet dataset, providing a solid foundation for feature extraction capabilities. In the training phase for both models, we utilized a Stochastic Gradient Descent (SGD) optimizer with an initial learning rate of 0.1. The optimizer’s parameters were fine-tuned with the stochastic gradient descent (SGD) optimizer and a learning rate of 0.5, ensuring optimal convergence rates and regularization. A tailored learning rate schedule was employed to adapt the learning rate during training dynamically. We used the learning policy for the first 20 epochs to maintain a steady learning rate, promoting stable initial learning progress. After the 20th epoch, another scheduler with a polynomial decay was applied, starting from the 50th epoch, to gradually reduce the learning rate, allowing for fine-tuning of the models’ weights as training progressed. The entire training was set to run for 100 epochs.
For the evaluation of model performance, we used a comprehensive suite of metrics that included Accuracy-top1, Precision, Recall, and F1 Score, alongside computational metrics such as FLOPs (Floating Point Operations per second) and the number of parameters. These metrics provided a holistic assessment of the models’ abilities to recognize pig faces accurately, considering both the proportion of correct identifications and the balance between precision and recall. These detailed configurations of the training and evaluation process were meticulously planned to ensure the robustness and generalizability of the pig face recognition models, aligning with the goal of deploying these models in environments with computational constraints.

4.2.2. Implementation Result

To validate the effectiveness of our proposed method that integrated the knowledge distillation technique with the lightweight student model, we conducted experiments comparing the performance of the student model, both with and without the distillation from the teacher model. The teacher model used was a Vision Transformer (ViT-base), and the student model was ShuffleNet-V2. The results of these experiments are summarized in Table 2.
Table 2. Result of student model experiments.
The ViT-base teacher model (T) achieved high performance based on all metrics, with an Accuracy of 0.972, Precision of 0.978, Recall of 0.970, and an F1 Score of 0.974. These metrics set a benchmark for the student model to aspire to through the distillation process. The FLOPs and parameters for the teacher model are denoted as XX, indicating their substantial computational requirements. The ShuffleNet-V2 student model, when trained independently without knowledge distillation, reached an Accuracy of 0.941, Precision of 0.943, Recall of 0.936, and F1 Score of 0.944. These results, while commendable for a lightweight model, suggested room for improvement in mirroring the teacher model’s performance.
Applying Relational Knowledge Distillation (RKD) to the ShuffleNet-V2 student model yielded a marked improvement in performance. The distilled student model (RKD) achieved an Accuracy of 0.968, Precision of 0.972, and Recall of 0.970, with an F1 Score of 0.973. Notably, the distilled student model attained this enhanced performance while maintaining the same level of computational efficiency as the undistilled model, with 0.15 billion FLOPs and 1.26 million parameters, demonstrating the efficacy of the RKD technique in transferring the teacher’s knowledge to the student model effectively. These results showcase the potential of knowledge distillation as a method to enhance the performance of lightweight models, enabling them to perform complex tasks with a high degree of accuracy while operating within the constraints of limited computational resources.
Figure 10 presents training and Accuracy curves that depict a compelling narrative of the efficacy of our distillation approach in pig face recognition. Notably, the student model, ShuffleNet-V2 with Relational Knowledge Distillation (RKD–S), was trained without leveraging pre-trained weights from public datasets, which is often a common practice to accelerate the initial learning phase. This deliberate choice is reflected in the initial lower accuracy rates compared to the non-distilled ShuffleNet-V2, which benefits transiently from such pre-trained weights. The left graph exhibits the training loss for both models. The RKD–S shows a steep decline from a high initial loss, indicating an intense early learning phase, and quickly stabilizes, demonstrating rapid convergence. This behavior underscores the distilled model’s swift adaptation to the intricacies of the task at hand, mimicking the high-level feature interrelationships learned from the teacher model, ViT-base (T). On the right, the Accuracy curves tell a story of transformation and rapid improvement. After the initial epochs, the RKD-S model’s accuracy surges ahead of the non-distilled ShuffleNet-V2, showcasing the distilled model’s rapid assimilation of the teacher’s knowledge. The student model’s performance not only overtakes its non-distilled counterpart, but also exhibits a clear trend toward the teacher model’s high-accuracy benchmark. The accuracy of the ViT model depicted in the graph represents the zenith of our training efforts with the ViT model, serving as the ideal teacher model’s weights for the distillation process. Figure 11’s confusion matrices reveal the performance differences between the Vision Transformer (ViT), ShuffleNet-V2, and ShuffleNet-V2 with Relational Knowledge Distillation (RKD). ViT shows high classification accuracy, consistent with its advanced design. ShuffleNet-V2 displays some misclassifications, as expected from a lightweight model. However, the RKD version of ShuffleNet-V2 shows marked improvement, with a confusion matrix that approaches the clarity of ViT, suggesting that knowledge distillation effectively enhances accuracy while maintaining computational efficiency.
Figure 10. Training loss and Accuracy curve on teacher, student, and distilled student models. The figure displays the respective progression of training loss and Accuracy for the ViT-base teacher model, the standalone ShuffleNet-V2 student model, and the ShuffleNet-V2 student model after undergoing Relational Knowledge Distillation (RKD).
Figure 11. The confusion matrix on the teacher model, lightweight model, and distilled student model.
These visual analyses not only confirm the superiority of the ViT-base as a teacher model but also underscore the effectiveness of the RKD process in bridging the performance gap between the teacher and student models. The maintained computational efficiency, combined with the notable gains in accuracy, showcases the potential of RKD in optimizing lightweight models for real-world applications where resource efficiency is paramount.

4.3. Ablation Studies

Large-Scale Teacher Model Comparisons. This section illustrates a comparative study of various advanced architectures for the selection of an optimal teacher model in our knowledge distillation process for pig face recognition. The analysis focused on ResNet-152, Vision Transformer (ViT-base), and Res2Net-101, each differing in design principles and performance metrics. Figure 12 conveys the training dynamics of each model, with the ViT-base model exhibiting a steep initial loss reduction, stabilizing at a low value, indicating its capacity for rapid learning. Its corresponding accuracy graph shows a trajectory of swift improvement, where, despite starting from a lower value due to the lack of pre-trained weights, it ascends rapidly, surpassing the other models and nearing the teacher model’s top performance. Table 3 confirms the superior capabilities of the ViT-base, with the highest Accuracy and Recall scores, suggesting its potential as a teacher model. Although ViT-base demands more computational resources, as indicated by its FLOPs and parameters, its performance in our experiments justifies its selection as the teacher model in the distillation process. The trade-off between the computational cost and the performance gain is deemed acceptable, particularly because the distilled student model, informed by ViT-base, effectively bridges the gap, delivering high accuracy with reduced complexity that are suitable for embedded systems in smart farming applications.
Figure 12. Comparison of teacher models’ loss and Accuracy curves.
Table 3. Comparison of state-of-art teacher model for pig face recognition task.
Lightweight Student Model Selection. In the quest to optimize a lightweight student model for pig face recognition, our exploration led us to evaluate several architectures renowned for their efficiency. The graph accompanying this text delineates the training loss and accuracy over epochs for these contenders, including ShuffleNet-V1, ShuffleNet-V2, and MobileNet-V2. The left side of Figure 13 showcases the three models’ training loss. The training loss and Accuracy graphs depict the learning efficiency of various models. Initially, ShuffleNet-V2 shows a higher loss compared to MobileNet-V2 and ShuffleNet-V1 but rapidly converges to a lower loss, indicating effective learning. In terms of accuracy, ShuffleNet-V2 demonstrates a remarkable improvement over epochs, surpassing the other models and approaching the performance of more complex architectures. This is further corroborated by Table 4, which shows ShuffleNet-V2 achieving the highest F1 Score, denoting a superior balance between Precision and Recall, ultimately making it the preferred student model for knowledge distillation in pig face recognition tasks.
Figure 13. Comparison of student models’ loss and Accuracy curves.
Table 4. Comparison of lightweight state-of-art student models for the pig face recognition task.
State-of-the-art Knowledge Distillation Method Comparison for Pig Face Recognition. A critical aspect of our research was to evaluate the effectiveness of different knowledge distillation (KD) techniques applied to the student model in the context of pig face recognition. To this end, we compared several state-of-the-art distillation methods, with the aim of identifying which method best leverages the teacher model’s knowledge to enhance the student model’s performance. The evaluated methods included Relational Knowledge Distillation (RKD), Weighted Soft Label Distillation (WSLD), and Traditional Knowledge Distillation (KD). Table 5 presents the comparative results of these methods. In distillation experiments, we selected ViT-base and ShuffleNet-V2 as teacher and student models, respectively. RKD emerged as the leading technique, achieving the highest Accuracy of 0.968 and an F1 Score of 0.972. These results indicate that RKD effectively distilled the relational inductive biases of the teacher model, leading to superior performance of the student model. WSLD, while slightly lagging behind in Accuracy and F1 Score, still demonstrated a considerable enhancement over the baseline, suggesting that weighted soft labels can contribute to the student model’s learning, albeit not as effectively as RKD in this instance. The traditional KD approach also showed a significant improvement in the student model’s capabilities, with an accuracy of 0.960 and an F1 Score of 0.966. This underscores the value of KD as a reliable method for transferring knowledge, despite being outperformed by the more advanced RKD technique in our experiments. These findings highlight the potential of advanced distillation methods in improving the performance of lightweight models for complex tasks such as pig face recognition. The successful application of these techniques paves the way for deploying highly accurate yet computationally efficient models in practical, real-world scenarios within the domain of smart farming.
Table 5. Comparison of state-of-art knowledge distillation methods applied in the pig face recognition task.

5. Conclusions

In this study, we have successfully addressed the challenging task of pig face recognition by developing a comprehensive approach that leverages the strengths of deep learning models while considering the computational constraints of practical applications. Through the strategic selection of a Vision Transformer (ViT) as our teacher model and ShuffleNet-V2 as our student model, we have demonstrated the effectiveness of knowledge distillation in the domain of intelligent livestock management. Our method began with the curation of a high-quality pig face dataset, where we utilized a pre-trained YOLO-v8 model to detect and filter high-quality images from a large corpus of unannotated videos. This step not only significantly reduced the need for manual annotation but also ensured that the data used for training were of high fidelity, which is crucial for the success of any recognition task. The evaluation of various state-of-the-art knowledge distillation techniques showed that Relational Knowledge Distillation (RKD) was particularly effective in enhancing the performance of the lightweight ShuffleNet-V2 model. By distilling knowledge from the ViT, the student model achieved impressive Accuracy and F1 scores, making it highly suitable for deployment in resource-constrained environments. These results open up new possibilities for the application of AI in agriculture, particularly in the monitoring and management of livestock. In the practical application of our pig face recognition system, initially, pig face data are captured by cameras installed on the pig farm. These raw data are then processed through our filtering model, which selects high-quality images. Subsequently, these filtered images are processed by our pig face detection model to automatically generate pig face data. The final step involves our distilled lightweight recognition model predicting the unique ID of each pig. By automating the entire process of pig face recognition, our system significantly reduces the manual labor typically involved in monitoring and managing livestock. This not only leads to a reduction in operational costs but also enhances the accuracy of tracking and health management of each pig. The economic efficiency of our system is evident in its ability to process large data volumes quickly and accurately, making it an invaluable tool for modern smart farming practices. The integration of this technology in a farm setting translates into tangible benefits such as improved livestock management, better health monitoring, and enhanced productivity, all contributing to a more economically efficient farming operation. The ability to accurately recognize individual animals in a non-invasive manner can significantly contribute to the well-being and productivity of the animals, as well as the efficiency of farm operations. Future work may explore the integration of these models into a real-time monitoring system, the expansion of the dataset to include more diverse environmental conditions, and the adaptation of the approach to other animal recognition tasks. Furthermore, continued advancements in model compression and edge computing could further enhance the deployment of these deep-learning models directly on farm sites, making intelligent livestock management more accessible and impactful. Our research presents a significant step forward in the utilization of deep learning for smart farming solutions, and we anticipate that our findings will facilitate further innovations in the field.

Author Contributions

R.M. and S.C.K. contributed to the framework of pig face recognition system design. Data collection, H.A. and S.C.; experiment implementation and writing, R.M. and S.C.K.; supervision, S.C.K. and H.K.; funding acquisition, S.C.K. and H.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Korea Institute of Planning and Evaluation for Technology in Food, Agriculture and Forestry (IPET), the Korea Smart Farm R&D Foundation through Smart Farm Innovation Technology Development Program, funded by the Ministry of Agriculture, Food and Rural Affairs (MAFRA) and Ministry of Science and ICT (MSIT) and Rural Development Administration (RDA) (1545027910, 1545027587), and by the National Research Foundation (NRF), funded by the Ministry of Education (NRF-2019R1A6A1A09031717).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are available in the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Sihalath, T.; Basak, J.K.; Bhujel, A.; Arulmozhi, E.; Moon, B.E.; Kim, H.T. Pig identification using deep convolutional neural network based on different age range. J. Biosyst. Eng. 2021, 46, 182–195. [Google Scholar] [CrossRef]
  2. Salman, M.D. Surveillance and monitoring systems for animal health programs and disease surveys. In Animal Disease Surveillance and Survey Systems: Methods and Applications; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2003; pp. 3–13. [Google Scholar]
  3. Kaur, S.; Singla, J.; Nkenyereye, L.; Jha, S.; Prashar, D.; Joshi, G.P.; El-Sappagh, S.; Islam, M.S.; Islam, S.R. Medical diagnostic systems using artificial intelligence (ai) algorithms: Principles and perspectives. IEEE Access 2020, 8, 228049–228069. [Google Scholar] [CrossRef]
  4. Voulodimos, A.S.; Patrikakis, C.Z.; Sideridis, A.B.; Ntafis, V.A.; Xylouri, E.M. A complete farm management system based on animal identification using RFID technology. Comput. Electron. Agric. Resonance 2010, 70, 380–388. [Google Scholar] [CrossRef]
  5. Rajaraman, V. Radio frequency identification. Resonance 2017, 22, 3–13. [Google Scholar] [CrossRef]
  6. Adrion, F.; Kapun, A.; Eckert, F.; Holland, E.M.; Staiger, M.; Götz, S.; Gallmann, E. Monitoring trough visits of growing-finishing pigs with UHF-RFID. Comput. Electron. Agric. 2018, 144, 105386. [Google Scholar] [CrossRef]
  7. Maselyne, J.; Saeys, W.; Briene, P.; Mertens, K.; Vangeyte, J.; De Ketelaere, B.; Hessel, E.F.; Sonck, B.; Van Nuffel, A. Methods to construct feeding visits from RFID registrations of growing-finishing pigs at the feed trough. Comput. Electron. Agric. 2021, 128, 9–19. [Google Scholar] [CrossRef]
  8. Ahmad, M.; Ghazal, T.M.; Aziz, N. A survey on animal identification techniques past and present. Int. J. Comput. Innov. Sci. 2022, 1, 27–32. [Google Scholar]
  9. Liu, L.; Chen, M.; Chen, X.; Zhu, S.; Tan, P. GB-CosFace: Rethinking softmax-based face recognition from the perspective of open set classification. arXiv 2021, arXiv:2111.11186. [Google Scholar]
  10. Awad, A.I. From classical methods to animal biometrics: A review on cattle identification and tracking. Comput. Electron. Agric. 2016, 123, 423–435. [Google Scholar] [CrossRef]
  11. Laishram, M.; Mandal, S.N.; Haldar, A.; Das, S.; Bera, S.; Samanta, R. Biometric identification of Black Bengal goat: Unique iris pattern matching system vs. deep learning approach. Anim. Biosci. 2023, 36, 980. [Google Scholar] [CrossRef]
  12. Zhao, J.; Li, A.; Jin, X.; Pan, L. Technologies in individual animal identification and meat products traceability. Biotechnol. Biotechnol. Equip. 2020, 34, 48–57. [Google Scholar] [CrossRef]
  13. Meng, Q.; Zhao, S.; Huang, Z.; Zhou, F. Magface: A universal representation for face recognition and quality assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 14225–14234. [Google Scholar]
  14. Riekert, M.; Klein, A.; Adrion, F.; Hoffmann, C.; Gallmann, E. Automatically detecting pig position and posture by 2D camera imaging and deep learning. Comput. Electron. Agric. 2020, 174, 105391. [Google Scholar] [CrossRef]
  15. Billah, M.; Wang, X.; Yu, J.; Jiang, Y. Real-time goat face recognition using convolutional neural network. Comput. Electron. Agric. 2022, 173, 105386. [Google Scholar] [CrossRef]
  16. Xu, B.; Wang, W.; Guo, L.; Chen, G.; Li, Y.; Cao, Z.; Wu, S. CattleFaceNet: A cattle face identification approach based on RetinaFace and ArcFace loss. Comput. Electron. Agric. 2020, 193, 106675. [Google Scholar] [CrossRef]
  17. Manna, A.; Upasani, N.; Jadhav, S.; Mane, R.; Chaudhari, R.; Chatre, V. Bird Image Classification using Convolutional Neural Network Transfer Learning Architectures. Int. J. Adv. Comput. Sci. Appl. 2023, 14. [Google Scholar] [CrossRef]
  18. Chen, C.; Zhu, W.; Steibel, J.; Siegford, J.; Wurtz, K.; Han, J.; Norton, T. Recognition of aggressive episodes of pigs based on convolutional neural network and long short-term memory. Comput. Electron. Agric. 2020, 173, 105166. [Google Scholar] [CrossRef]
  19. Li, Z.; Lei, X.; Liu, S. A lightweight deep learning model for cattle face recognition. Comput. Electron. Agric. 2022, 195, 106848. [Google Scholar] [CrossRef]
  20. Duong, C.N.; Quach, K.G.; Jalata, I.; Le, N.; Luu, K. Mobiface: A lightweight deep learning face recognition on mobile devices. In Proceedings of the 2019 IEEE 10th International Conference on Biometrics Theory, Applications and Systems (BTAS), Tampa, FL, USA, 23–26 September 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–6. [Google Scholar]
  21. Jocher, G.; Chaurasia, A.; Qiu, J. YOLO by Ultralytics. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 20 April 2023).
  22. Khalaf, H.A.; Tolba, A.S.; Rashid, M.Z. Event triggered intelligent video recording system using MS-SSIM for smart home security. AIN Shams Eng. J. 2020, 9, 1527–1533. [Google Scholar] [CrossRef]
  23. Park, W.; Kim, D.; Lu, Y.; Cho, M. Relational knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3967–3976. [Google Scholar]
  24. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  25. Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
  26. Shao, H.; Pu, J.; Mu, J. Pig-posture recognition based on computer vision: Dataset and exploration. Animals 2021, 11, 1295. [Google Scholar] [CrossRef]
  27. Wada, N.; Shinya, M.; Shiraishi, M. Pig Face Recognition Using Eigenspace Method. ITE Trans. Media Technol. Appl. 2013, 1, 328–332. [Google Scholar]
  28. Hansen, M.F.; Smith, M.L.; Smith, L.N.; Salter, M.G.; Baxter, E.M.; Farish, M.; Grieve, B. Towards on-farm pig face recognition using convolutional neural networks. Comput. Ind. 2018, 98, 145–152. [Google Scholar] [CrossRef]
  29. Marsot, M.; Mei, J.; Shan, X.; Ye, L.; Feng, P.; Yan, X.; Li, C.; Zhao, Y. An adaptive pig face recognition approach using Convolutional Neural Networks. Comput. Electron. Agric. 2020, 173, 105386. [Google Scholar] [CrossRef]
  30. Wang, Z.; Liu, T. Two-stage method based on triplet margin loss for pig face recognition. Comput. Electron. Agric. 2020, 194, 106737. [Google Scholar] [CrossRef]
  31. Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, DC, USA, 14–19 June 2020; pp. 10781–10790. [Google Scholar]
  32. Gou, J.; Yu, B.; Maybank, S.J.; Tao, D. Knowledge distillation: A survey. Int. J. Comput. Vis. 2021, 129, 1789–1819. [Google Scholar] [CrossRef]
  33. Bucilua, C.; Caruana, R.; Niculescu-Mizil, A. Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, 20–23 August 2006. [Google Scholar]
  34. Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
  35. Meng, Z.; Li, J.; Zhao, Y.; Gong, Y. Conditional teacher-student learning. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6445–6449. [Google Scholar]
  36. Kim, S.W.; Kim, H.E. Transferring knowledge to smaller network with class-distance loss. In Proceedings of the ICLRW, Toulon, France, 24–26 April 2017. [Google Scholar]
  37. Müller, R.; Kornblith, S.; Hinton, G.E. When does label smoothing help? Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
  38. Zhang, D.; Yin, J.; Zhu, X.; Zhang, C. Network representation learning: A survey. IEEE Trans. Big Data 2018, 174, 3–28. [Google Scholar] [CrossRef]
  39. Romero, A.; Ballas, N.; Kahou, S.E.; Chassang, A.; Gatta, C.; Bengio, Y. Fitnets: Hints for thin deep nets. arXiv 2020, arXiv:1412.6550. [Google Scholar]
  40. Passalis, N.; Tefas, A. Learning deep representations with probabilistic knowledge transfer. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 268–284. [Google Scholar]
  41. Chen, D.; Mei, J.P.; Zhang, Y.; Wang, C.; Wang, Z.; Feng, Y.; Chen, C. Cross-layer distillation with semantic calibration. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; pp. 7028–7036. [Google Scholar]
  42. Liu, Y.; Cao, J.; Li, B.; Yuan, C.; Hu, W.; Li, Y.; Duan, Y. Knowledge distillation via instance relationship graph. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7096–7104. [Google Scholar]
  43. Tung, F.; Mori, G. Similarity-preserving knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1365–1374. [Google Scholar]
  44. Park, W.; Kim, D.; Lu, Y.; Cho, M. Ensemble knowledge distillation for learning improved and efficient networks. arXiv 2019, arXiv:1909.08097. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.