A Lightweight Facial Landmark Recognition Model for Individual Sheep Based on SAMS-KLA-YOLO11

Bai, Yangfan; Zhao, Xiaona; Liang, Xinran; Zhang, Zhimin; Yan, Yuqiao; Li, Fuzhong; Zhang, Wuping

doi:10.3390/agriculture16020151

Open AccessArticle

A Lightweight Facial Landmark Recognition Model for Individual Sheep Based on SAMS-KLA-YOLO11

by

Yangfan Bai

,

Xiaona Zhao

,

Xinran Liang

,

Zhimin Zhang

,

Yuqiao Yan

,

Fuzhong Li

and

Wuping Zhang

^*

College of Software, Shanxi Agricultural University, Jinzhong 030801, China

^*

Author to whom correspondence should be addressed.

Agriculture 2026, 16(2), 151; https://doi.org/10.3390/agriculture16020151

Submission received: 14 November 2025 / Revised: 22 December 2025 / Accepted: 6 January 2026 / Published: 7 January 2026

(This article belongs to the Special Issue Computer Vision Analysis Applied to Farm Animals)

Download

Browse Figures

Versions Notes

Abstract

Accurate and non-contact identification of individual sheep is important for intelligent livestock management, but remains challenging due to subtle inter-individual differences, breed-dependent facial morphology, and complex farm environments. This study proposes a lightweight sheep face detection and keypoint recognition framework based on an improved YOLO11 architecture, termed SAMS-KLA-YOLO11. The model incorporates a Sheep Adaptive Multi-Scale Convolution (SAMSConv) module to enhance feature extraction across breed-dependent facial scales, a Keypoint-Aware Lightweight Attention (KLAttention) mechanism to emphasize biologically discriminative facial landmarks, and the Efficient IoU (EIoU) loss to stabilize bounding box regression. A dataset of 3860 images from 68 individuals belonging to three breeds (Hu, Dorper, and Dorper × Hu crossbreeds) was collected under unconstrained farm conditions and annotated with five facial keypoints. On this dataset, the proposed model achieves higher precision, recall, and mAP than several mainstream YOLO-based baselines, while reducing FLOPs and parameter count compared with the original YOLO11. Additional ablation experiments confirm that each proposed module provides complementary benefits, and OKS-based evaluation shows accurate facial keypoint localization. All results are obtained on a single, site-specific dataset without external validation or on-device deployment benchmarks, so the findings should be viewed as an initial step toward practical sheep face recognition rather than definitive evidence of large-scale deployment readiness.

Keywords:

keypoint recognition on sheep faces; multi-scale convolutions; attention mechanism; bounding box regression loss; individual recognition

1. Introduction

As China’s livestock industry advances toward intensification and intelligent automation, individual identification technology has become a fundamental component for improving farming efficiency and enabling precision management. In this context, non-contact biometric recognition based on computer vision has emerged as a prominent research direction in smart farming, owing to its non-invasive nature and seamless compatibility with automated systems [1]. Sheep individual identification plays a critical role in modern livestock monitoring and management, supporting applications such as health assessment, breeding management, and behavior analysis [2]. Traditional identification methods, including ear tags and radio-frequency identification (RFID), remain widely used but suffer from inherent limitations such as detachment risk, animal stress, and high labor dependency. In contrast, vision-based non-contact biometric approaches, particularly facial recognition, have gained increasing attention due to their operational convenience, reliable traceability, and ease of integration into automated workflows [3,4].

With the rapid development of deep learning, vision-based non-contact recognition has demonstrated strong potential in livestock applications such as cattle and pigs [5]. Luan et al. (2024) improved feature extraction structures to enhance robustness under complex farm environments [6]. Huang et al. (2024) introduced a pose-aware YOLOv7-Pose framework for precise cattle facial keypoint localization, providing a basis for subsequent behavior analysis [7]. In pig face recognition, Xie et al. (2022) incorporated channel attention mechanisms to enhance discriminative feature learning in large-scale farming scenarios [8]. Mark et al. (2018) developed a non-invasive imaging system for automatic pig identification at watering stations, highlighting the feasibility of practical on-farm deployment [9]. Mathieu et al. (2020) further combined Haar features with convolutional neural networks to improve pig face detection performance in real environments [10].

Sheep faces typically exhibit dense hair coverage, limited texture variation, and high morphological similarity among individuals, which constrains the effectiveness of holistic facial representations [11]. In real farming environments, variations in illumination, head pose, and partial occlusion further degrade recognition stability [12]. Pang et al. (2023) evaluated conventional deep learning models on the SheepFace-107 dataset and observed notable performance degradation under non-frontal conditions [13]. Moreover, the limited computational resources commonly available in pasture environments impose strict constraints on model complexity, making it difficult for existing high-capacity models to achieve a practical balance between accuracy and efficiency [14].

In recent years, sheep-specific face recognition methods have been actively explored. Li et al. (2022) proposed a hybrid architecture combining MobileNetV2 and Transformer to balance lightweight design and global feature modeling [15]. However, its adaptability to multi-breed scenarios and robustness under occlusion remain limited. Zhang et al. (2022) enhanced MobileFaceNet with efficient channel attention to strengthen feature focus on key facial regions such as eyes and nose [16], yet the model exhibits reduced robustness under large pose variations and breed-dependent morphological differences. Li et al. (2023) introduced a lightweight base module, Eblock, to alleviate the trade-off between accuracy and inference speed [17], but feature degradation still occurs under hair occlusion or pose deviation. Zhang et al. (2023) further reported that reliance on holistic facial features leads to significant performance degradation under severe pose changes or partial occlusions [18]. Although Hao et al. (2024) incorporated facial keypoint associations into an SSD-based framework for Small-tailed Han sheep recognition [19], keypoint information was mainly used for alignment, without being deeply integrated into the feature extraction process, limiting its discriminative contribution.

Despite these advances, sheep face recognition presents unique challenges compared with other livestock species. Sheep faces are characterized by relatively uniform texture, dense hair coverage, and subtle inter-individual differences, which complicate discriminative feature extraction. In real farming environments, variations in illumination, head pose, and partial occlusion further degrade recognition reliability. Moreover, sheep face morphology differs substantially across breeds, particularly in ear shape, size, and spatial distribution, leading to pronounced scale variation among facial key regions. These characteristics make sheep face recognition significantly more challenging than cattle or pig face recognition, especially in mixed-breed farming scenarios.

Despite recent progress in deep learning–based animal face recognition, accurate sheep face keypoint detection in real farm environments remains a challenging task. Unlike controlled laboratory settings, sheep face images exhibit substantial variations in breed-specific morphology, pose, illumination, and occlusion. In particular, noticeable differences in facial structure—such as wide and deformable ears in Hu sheep versus compact ears in Dorper sheep—introduce significant scale variation across key facial regions, which limits the effectiveness of conventional fixed-receptive-field convolution operations. Moreover, existing one-stage detection frameworks typically treat all spatial regions equally during feature extraction, without explicitly emphasizing anatomically or functionally important facial keypoints. This lack of keypoint-aware feature modeling weakens the discriminative representation of critical regions such as ears and eyes, especially under pose changes or partial occlusion. In addition, commonly used bounding box regression losses impose rigid geometric constraints, which may lead to suboptimal localization when handling sheep faces with diverse shapes and proportions across breeds. These challenges highlight the need for a sheep-specific keypoint recognition framework that can adaptively model multi-scale facial features, selectively enhance discriminative keypoint regions, and achieve stable and precise localization across heterogeneous sheep breeds and complex farm conditions.

To address the above challenges, this study proposes a lightweight sheep face keypoint recognition framework based on an improved YOLO11 architecture, termed SAMS-KLA-YOLO11. The proposed model targets multi-breed recognition scenarios involving Hu sheep, Dorper sheep, and Dorper × Hu crossbreeds. Three key improvements are introduced. First, a Sheep Adaptive Multi-Scale Convolution (SAMSConv) module is designed to enhance feature extraction across different facial scales, enabling better adaptation to breed-specific morphological differences. Second, a Keypoint-Aware Lightweight Attention (KLAttention) mechanism is incorporated to integrate the biological significance of annotated facial keypoints into the feature learning process, guiding the network to focus on highly discriminative regions. Third, the Efficient IoU (EIoU) loss is adopted to improve bounding box regression stability and localization accuracy. Experimental results on a multi-breed sheep face dataset demonstrate that the proposed method achieves high recognition accuracy while maintaining a lightweight model structure. This study provides a practical and extensible solution for sheep face recognition in mixed-breed farming environments and offers methodological support for future intelligent livestock management applications.

2. Data Acquisition and Feature Engineering

2.1. Data Source and Collection

The dataset used in this study was collected at the Animal Science and Technology Experimental Station of Shanxi Agricultural University to ensure data standardization and scientific reliability. Following the dataset construction guidelines proposed by Alam Noor et al. [20], image acquisition focused on three representative sheep breeds widely raised in large-scale farming systems: Hu sheep, Dorper sheep, and Dorper × Hu crossbred sheep. These breeds were selected due to their extensive populations in Shanxi Province and surrounding regions, as well as their pronounced differences in facial morphology and appearance characteristics. Hu sheep are characterized by high fecundity and large, drooping ears, Dorper sheep typically exhibit compact ears with a distinctive black head and white body, and Dorper × Hu crossbred sheep display substantial phenotypic variability by inheriting traits from both parental breeds. As illustrated in Figure 1, these breeds differ markedly in ear shape, facial contour, and hair color and texture, posing considerable challenges for robust sheep face recognition and providing suitable conditions for evaluating the generalization capability of the proposed model across multi-breed scenarios [21].

Image acquisition was performed using a digital single-lens reflex (DSLR) camera (Canon EOS 70D, Canon Inc., Tokyo, Japan) and a smartphone camera (iPhone 14, Apple Inc., Cupertino, CA, USA) to introduce diversity in image resolution and color characteristics. Data collection was conducted periodically from November 2024 to March 2025, with an acquisition interval of approximately 1–2 days between sessions. During each acquisition session, around 50 images were captured under natural farm conditions. For each individual sheep, approximately 40–70 images were collected across multiple sessions. All images were acquired under unconstrained conditions without enforcing specific head poses. Sheep were recorded in their natural postures during routine activities such as standing, feeding, and resting, resulting in a diverse distribution of facial orientations rather than predominantly frontal views. This acquisition strategy was designed to reflect realistic visual conditions encountered in practical pasture environments as shown in Figure 2, rather than idealized or laboratory-controlled settings. To enhance robustness against real-world interference, natural variations commonly observed in farming environments were intentionally retained in the dataset. These include moderate occlusion caused by wool or feed residues, complex backgrounds involving pen structures or neighboring sheep, and diverse illumination conditions ranging from natural daylight to partial backlighting. Such factors contribute to increased variability and challenge for the recognition task, thereby improving the practical relevance of the dataset. Quality control was conducted through a multi-stage filtering process. Immediately after each acquisition session, an initial manual screening was performed to remove images in which two or fewer facial keypoints were visible, as these samples provide limited value for reliable keypoint learning. Subsequently, images with low resolution (below 500 × 500 pixels), severe occlusion (more than 50% of keypoints obscured), or extreme head pose (facial rotation exceeding 90°) were excluded. After filtering, a total of 3860 valid images were retained, comprising 1289 images of Hu sheep, 1242 images of Dorper sheep, and 1327 images of Dorper × Hu crossbred sheep, ensuring relatively balanced representation across breeds. All retained images were converted to JPEG format and resized to 640 × 640 pixels to match the input requirements of the proposed model. The final dataset reflects realistic multi-breed sheep face recognition scenarios and provides a reliable foundation for evaluating model robustness and generalization under practical farming conditions.

2.2. Labeling Methods and Rules

In this study, a two-step annotation strategy of “detection box first, keypoint second” was used, and all images were finely annotated by LabelMe tool. Firstly, the sheep face detection box was labeled, and the rectangular detection box was drawn to completely cover the area from the leftmost end of the left ear root to the rightmost end of the right ear root and from the tip of the forehead to the tip of the nose, and the background interference such as body, limbs and enclosure facilities was strictly excluded. The coordinates of the detection box were recorded in the YOLO standard format (x_center, y_center, width, height), and all coordinate values were normalized to the interval [0, 1] [22]. Based on the detection box annotation, five keypoints, namely the left eye, the right eye, the nose, the left ear root and the right ear root, were marked in a fixed order. The labeling process was strictly based on the physiological structural characteristics of the sheep face [13], and the differentiation criteria were formulated for different breeds: the labeling of the left and right eyes was based on the geometric center of the iris, Hu sheep and Dorper × Hu crossbred sheep should avoid the occlusion of the long hair around the eye, and Dorper sheep should distinguish the boundary between the iris and the dark hair around the eye, and the angle error between the interocular line and the horizontal axis was controlled within ±2°. The keypoints of the nose were marked at the front of the nose tip, the Hu sheep were located at the center of the black nose lens, and the boundary between the nose lens and the nostril was identified for Dorper sheep and Dorper × Hu crossbreed sheep. The left and right ear roots were marked at the bottom end of the connection between the ear and the head, respectively. The Hu sheep paid attention to avoiding the ear pendulous fold area, while the Dorper sheep accurately marked the connection between the short ear and the head, and the error between the line connecting the left and right ear roots and the horizontal axis was no more than ±3°.

In order to ensure the quality of annotation, all work was completed by three researchers, and cross-check was performed to improve consistency [10]. The final dataset consisted of 3860 valid samples, which were randomly divided into training set (2702 samples), validation set (772 samples) and test set (386 samples) according to the ratio of 7:2:1 to ensure the balance of varieties and scene distribution in each set. All the labeling results were converted into YOLO format label files for subsequent model training and validation, for keypoint detection tasks, a keypoint was considered “visible” and valid only if its predicted confidence score exceeded 0.6. Keypoints below this threshold were marked as occluded and assigned to group (0). The labeling results are shown in Figure 3.

3. SAMS-KLA-YOLO11 Recognition Model

3.1. Original YOLO11 Model Framework

YOLO11 was selected as the baseline architecture in this study because it is an advanced one-stage object detection and keypoint recognition framework that achieves a favorable trade-off between accuracy and computational complexity, making it suitable for resource-constrained farm environments [23]. Unlike general-purpose pose estimation frameworks, keypoint prediction in this work is employed as an auxiliary task to enhance detection robustness and individual discrimination. Given a 640 × 640 input image, the network extracts multi-scale features and performs joint face detection and keypoint localization through a unified architecture, where the detection branch predicts the bounding box parameters (x, y, w, h) and the keypoint branch outputs the 2D coordinates of five facial keypoints. During training, CIoU loss and cross-entropy loss are adopted for bounding box regression and classification, respectively. The overall architecture of the original YOLO11 is illustrated in Figure 4, and detailed structural information can be found in the official implementation.

Despite its effectiveness, the original YOLO11 still exhibits several limitations when applied to sheep face keypoint recognition across multiple breeds. First, the fixed receptive field of conventional convolution layers restricts the model’s ability to adapt to the substantial scale variations among different sheep breeds, such as the wide ears of Hu sheep and the short ears of Dorper sheep. Second, the absence of a keypoint-aware attention mechanism prevents the model from sufficiently emphasizing highly discriminative facial regions. Third, the rigid aspect-ratio constraint in the CIoU loss may introduce regression bias when handling sheep faces with diverse geometric characteristics. These limitations motivate the architectural improvements proposed in this study.

3.2. Model Improvement Strategy

3.2.1. SAMSConv: Sheep Adaptive Multi-Scale Convolution Design

To address the first limitation of the original model—the fixed receptive field of standard convolution—we designed the SAMSConv module. This design is motivated by the multi-scale distribution of keypoints across the three sheep breeds (such as Dorper sheep eyes for large scale features, ears for mesoscale features, all varieties of nose for small scale), the structure is shown in Figure 5.

The SAMSConv module realizes the efficient extraction of keypoint features of multi-breed sheep faces through the triple structure design. Firstly, the input feature map is compressed by 1 × 1 convolution to reduce the number of channels to 1/4 of the original dimension. This process can be expressed by Equation (1).

F_{C O M P} = {C o n v}_{1 \times 1} (F_{i n}, C_{o u t} = \frac{C_{i n}}{4})

(1)

here,

F_{C O M P}

represents the compressed feature map,

F_{i n}

denotes the input feature map,

C_{i n}

is the number of input channels, and

C_{o u t}

is the number of output channels. This operation preserves 92% of the key feature information while significantly reducing the subsequent computational load, with experiments showing a 68% reduction in computation. Building upon channel compression, the module employs 3 × 3, 7 × 7, and 11 × 11 depthwise separable convolutions in parallel to extract multi-scale features. Among these, the 3 × 3 convolution focuses on small-scale features in the 10–30-pixel range, such as the nasal areas of various sheep breeds and the eye regions of Dorper sheep. The 7 × 7 convolution captures medium-scale features in the 30–60-pixel range, typically including the ear bases and eye areas of Hu sheep and their crossbreeds. The 11 × 11 convolution targets large-scale features in the 60–120-pixel range, emphasizing the extraction of macrostructures such as the ear contours of Hu sheep and the facial outlines of Dorper sheep. Depthwise separable convolution accomplishes feature extraction through a two-step process of depthwise convolution and pointwise convolution. While ensuring that each convolutional kernel processes feature channels independently, this approach reduces the number of parameters by 82% compared to traditional multi-scale convolution. Subsequently, the three streams of multi-scale features are concatenated along the channel dimension to form a multi-scale feature map with channels. A shared 1 × 1 convolution is then applied to restore the channel count to the original input channel number. This transformation process is described by Equation (2).

F_{C o n v} = {C o n v}_{1 \times 1} (F_{c a t}, C_{o u t} = C_{i n})

(2)

here,

F_{o u t}

represents the fused and integrated output feature map,

F_{c a t}

denotes the concatenated feature map,

C_{i n}

is the number of input channels, and

C_{o u t}

is the number of output channels. Finally, the features undergo batch normalization and SiLU activation, further enhancing nonlinear representation capability, and outputs the enhanced multi-scale feature as described in Equation (3).

F_{o u t} = S i L U ({B N (F}_{C o n v}))

(3)

F_{o u t}

denotes the processed output feature map. Experiments demonstrate that after embedding SAMSConv into the convolutional layers of both the backbone and neck networks in YOLO11, the model exhibits significantly enhanced multi-scale perception capability for keypoints across different sheep breeds. Specifically, the detection recall rate for small-scale keypoints (exemplified by the nasal region) increased by 6.3%.

3.2.2. KLAttention: Keypoint Aware Lightweight Attention Module

In order to strengthen the model’s Attention to highly discriminative keypoints (such as ears and eyes) and adapt to the feature differences of the three breeds of sheep faces, KLAttention (Keypoint-Aware Lightweight Attention) module is designed to embed the feature fusion layer of YOLO11 neck network BiFPN. The structure is shown in Figure 6.

The KLAttention module enhances features in highly discriminative regions of sheep faces by integrating global context with keypoint prior knowledge. The module first applies separate horizontal and vertical adaptive average pooling operations to the input feature map

F_{i n} \in R^{C \times H \times W}

(where

C

,

H

and

W

represent the number of channels, height, and width, respectively),

A d a p t i v e A v g P o o l 2 d (o u t p u t s i z e) = (H, 1)

extracting global contextual information along different dimensions. The horizontal pooling output

F_{h} \in R^{C \times H \times 1}

focuses on capturing the horizontal distribution characteristics of the left and right ear bases, while

A d a p t i v e A v g P o o l 2 d (o u t p u t s i z e) = (1, w)

the vertical pooling output

F_{v} \in R^{C \times 1 \times H}

emphasizes extracting structural information of the eyes and nose along the vertical axis. These two outputs are then flattened and concatenated into a feature vector

V_{c a t} \in R^{C \times (H + W)}

, achieving the fusion of horizontal and vertical contextual information.

In the weight map generation stage, a 1 × 1 convolution

{{C o n v}_{1 \times 1} (V}_{c a t}, C_{o u t} = \frac{C}{8})

is first utilized to reduce the channel dimension to 1/8 of the original, yielding the compressed feature

V_{r e d} \in R^{\frac{C}{8} \times (H + W)}

. This is followed by GroupNorm normalization (with 4 groups) as shown in Equation (4) and processing through a Sigmoid activation function, generating a preliminary attention weight vector

W_{a t t} \in R^{\frac{C}{8} \times (H + W)}

with values in the range [0, 1].

V_{g n} = G r o u p N o r m (V_{r e d})

(4)

here,

V_{g n}

represents the normalized output. This weight vector is then reshaped into a spatial weight map

M_{a t t} \in R^{C \times H \times W}

with the same dimensions as the input feature map, serving as the basis for subsequent feature weighting.

To further enhance perception of key regions, the module incorporates a unified keypoint weight vector

W_{k e y}

derived from dataset statistics and biological significance. This vector,

W_{k e y} = [0.25, 0.25, 0.20, 0.15, 0.15]

corresponds to the left eye, right eye, nose, left ear base, and right ear base, respectively. The weights were assigned to reflect the relative stability and discriminative power of each keypoint across all three breeds for the task of individual identification. Specifically, the eyes were assigned higher weights due to their relatively stable anatomical structure and lower susceptibility to hair occlusion compared to the ear bases. The nose tip, being a central and often clearly visible point, received a moderate weight. The ear bases, while highly discriminative for breeds like Hu sheep, can exhibit greater positional variance due to head poses and are sometimes occluded, thus were assigned slightly lower weights. The final weight vector undergoes L2 normalization, calculated as shown in Equation (5).

W_{n o r m} = \frac{W_{k e y}}{{∥ W_{k e y} ∥}_{2}}

(5)

Here, the keypoint weight vector

W_{n o r m}

is element-wise multiplied with the spatial weight map

M_{a t t}

to produce the final weight map

M_{f i n a l}

. Ultimately, the input feature map

F_{i n}

is element-wise multiplied by

M_{f i n a l},

yielding the enhanced feature map

F_{o u t}

, which achieves feature reinforcement in highly discriminative keypoint regions. This process effectively enhances the model’s focus on discriminative areas such as the ear bases and eyes. The keypoint weights in KLAttention are fixed and manually designed based on anatomical stability and dataset-level statistics. We acknowledge that learnable or adaptive weighting mechanisms may further improve generalization and will be explored in future work.

3.2.3. EIoU Loss: Bounding Box Regression Optimization

Aiming at the limitations of the CIoU loss function in the original YOLO11 in the multi-breed sheep face localization task, such as the conflict between the wide face of Hu sheep and the narrow face of Dorper sheep in the aspect ratio constraint, and the gradient disappearance problem caused by the IoU of small target keypoints going to zero, This study introduces the EIoU (Efficient IoU) loss function [24] to optimize the bounding box regression process. EIoU decomposes the regression error into four independent parts: the intersection over union error, the distance error of the center point, the width error and the height error, and the calculation formula is shown in Equation (6).

L_{E I o U} = 1 - I o U + \frac{ρ^{2} (b, b^{g t})}{{(w^{c})}^{2} + {(h^{c})}^{2}} + \frac{ρ^{2} (w, w^{g t})}{{(w^{c})}^{2}} + \frac{ρ^{2} (h, h^{g t})}{{(h^{c})}^{2}}

(6)

here,

I o U

represents the Intersection over Union between the predicted bounding box

B

and the ground truth bounding box

B^{g t}

, calculated as shown in Equation (7), which reflects the degree of overlap between them.

I o U = \frac{B \cap B^{g t}}{B \cup B^{g t}}

(7)

b

and

b^{g t}

represent the center coordinates of the predicted bounding box and the ground truth bounding box, respectively, while

ρ (b, b^{g t})

denotes the Euclidean distance between these centers, calculated as shown in Equation (8).

ρ (b, b^{g t}) = \sqrt{{(x - x^{g t})}^{2} + {(y - y^{g t})}^{2}}

(8)

here,

w

and

h

represent the width and height of the predicted bounding box, while

w^{g t}

and

h^{g t}

denote the width and height of the ground truth bounding box. The terms

ρ (w, w^{g t})

and

ρ (h, h^{g t})

correspond to the absolute errors in width and height, respectively

w^{c}

and

h^{c}

represent the width and height of the minimum enclosing rectangle covering both boxes. These dimensions are used to normalize various error terms, mitigating the impact of target size on gradient computation.

This design separately constrains the center point distance and the width-height errors, overcoming the facial shape adaptability issue in CIoU caused by the coupling between the aspect ratio penalty term and the width-height errors. Furthermore, by maintaining effective normalized gradient flow even when IoU approaches zero, it enhances the localization stability of the model for small target keypoints.

On this basis, the overall model loss function integrates the bounding box regression loss, individual classification loss, and keypoint regression loss, with its structure shown in Equation (9).

L_{t o t a l} = α \cdot L_{E I o U} + β \cdot L_{C E} + γ \cdot L_{k g t}

(9)

here,

L_{C E}

represents the cross-entropy classification loss used to distinguish between different sheep individuals, defined as shown in Equation (10). The loss weights were empirically determined based on validation performance and task priorities. Higher emphasis was placed on keypoint regression to ensure precise localization of facial landmarks, which is critical for individual discrimination in sheep face analysis.

L_{C E} = - \sum_{k = 1}^{3} p_{k} \cdot \log (q_{k})

(10)

p_{k}

denotes the one-hot encoded ground truth individual label, and

q_{k}

represents the predicted individual probability output by the model.

L_{k g t}

denotes the keypoint regression loss, which employs an L1 loss function to constrain the coordinate prediction accuracy of the five sheep facial keypoints (left eye, right eye, nose tip, left ear base, and right ear base). The calculation formula is shown in Equation (11).

L_{k g t} = \sum_{k}^{5} {(| x}_{k} - x_{k}^{g t} | + {| y}_{k} - y_{k}^{g t} |)

(11)

here,

(x_{k}, y_{k})

represents the predicted coordinates of the k-th keypoint, and

(x_{k}^{g t}, y_{k}^{g t})

denotes its corresponding ground truth coordinates. The loss weights are set as

α = 0.75

,

β = 0.05

, and

γ = 0.12

, prioritizing keypoint localization accuracy—since the geometric relationships of keypoints are crucial for breed and individual discrimination—followed by face bounding box localization, and finally classification.

3.3. Improve Overall Architecture of the Model

The improved YOLO11-based sheep face keypoint recognition model constructs an end-to-end multi-task framework through a systematic structural design, and its complete architecture is shown in Figure 7. The model takes 640 × 640 × 3 sheep face images as input and adopts a data augmentation strategy in the preprocessing stage to enhance generalization ability. In the backbone network, the self-developed SAMSConv module replaces standard convolutions. By performing parallel processing at three feature scales (80 × 80, 40 × 40 and 20 × 20), SAMSConv builds a multi-level feature representation system that jointly captures local keypoint details, regional contour information, and global spatial relationships.

The model designed in this experiment took 640 × 640 × 3 sheep face images as input and adopted a data augmentation strategy in the preprocessing stage to improve the generalization ability of the model. The backbone network uses the self-developed SAMSConv module to replace the standard convolution. Through parallel processing on three feature scales of 80 × 80, 40 × 40 and 20 × 20, a multi-level feature representation system covering local keypoint features, regional contour information and global spatial relationships is constructed.

In the feature fusion stage, the KLAttention module is embedded in the critical path of the neck network to realize the attention-guided feature reconstruction based on the spatial distribution characteristics of the sheep face keypoints. The module enhances the perception of highly discriminative regions such as ear root and eye corner through an adaptive weighting mechanism. After multiple rounds of up sampling and down sampling operations, the module outputs a multi-scale feature map with balanced semantic information and spatial details.

The head network adopts a multi-branch collaborative architecture, and completes the detection, classification and keypoint location tasks through three parallel branches respectively. The detection branch realizes accurate bounding box regression based on EIoU loss, the classification branch distinguishes three sheep breeds by cross-entropy loss, and the keypoint location branch accurately outputs the spatial coordinates of five facial keypoints. Through a carefully designed collaboration mechanism, each component completes multiple goals synchronously in a forward propagation process, and the final output is a complete recognition result containing the detection box, breed information and keypoint coordinates. This integrated design enables the model to achieve high-precision recognition and location of multi-breed sheep faces while maintaining efficient reasoning. Breed classification is integrated into the same network to enable shared feature representation and reduce computational redundancy. This unified design improves efficiency and is more suitable for edge deployment compared to multi-stage or task-separated pipelines.

4. Experimental Results and Analysis

4.1. Experimental Environment and Parameter Setting

The specific configuration of the experimental platform that provides computational support for model training in this study is shown in Table 1. The platform is equipped with a high-performance GPU accelerator, sufficient memory capacity and high-speed storage system, which provides the necessary hardware foundation for large-scale matrix operations and complex feature extraction of deep neural networks. In terms of software environment, the platform adopts mainstream deep learning frameworks and optimized numerical calculation libraries to ensure that the training process runs in a stable and controllable environment, ensuring the comparability and reproducibility of the experimental results.

4.2. Evaluation Metrics

To comprehensively evaluate the proposed sheep face detection and keypoint recognition model, six detection-based metrics and one keypoint-specific metric are adopted. Precision (P) measures the proportion of correctly predicted positive samples among all positive predictions and is defined as in Equation (12).

P = \frac{T P}{T P + F P}

(12)

where TP and FP denote the numbers of true positives and false positives, respectively. Recall (R), given in Equation (13).

R = \frac{T P}{T P + F N}

(13)

where

F N

represents the number of undetected keypoints (false negatives), used to evaluate the model’s detection coverage capability. For localization accuracy, we report the mean Average Precision at an IoU threshold of 0.5 (mAP50) as the core detection metric, which reflects overall bounding box quality under a relatively relaxed overlap requirement. In addition, mAP50–95 is calculated as the average of AP values over IoU thresholds from 0.5 to 0.95 with a step of 0.05, providing a more stringent and robust evaluation of localization stability.

To characterize model efficiency, the computational complexity is measured by the number of floating-point operations (FLOPs, in Giga units), which is closely related to inference latency on embedded devices. The number of parameters (Params, in Millions) reflects model size and memory requirements, and is therefore a key indicator for lightweight design and practical deployment. Together, these five metrics form a comprehensive evaluation system for detection performance and computational efficiency.

In addition to detection-based metrics, we further introduce Object Keypoint Similarity (OKS) to directly assess facial keypoint localization accuracy. Following the COCO keypoint evaluation protocol, OKS measures the similarity between predicted and ground-truth keypoints by weighting the normalized Euclidean distance of each keypoint according to the object scale and a keypoint-specific fall-off parameter. The OKS for an image lies in [0, 1], with higher values indicating more accurate keypoint localization. Based on OKS, we report the mean OKS over all evaluated samples (Mean OKS), as well as OKS@0.50 and OKS@0.75, which denote the proportions of samples whose OKS exceeds 0.50 and 0.75, respectively. These indicators provide a complementary, keypoint-focused perspective to the detection-based mAP metrics and facilitate comparison with general pose estimation literature.

In summary, detection-based metrics (P, R, mAP50, mAP50–95) are used to evaluate sheep face localization and overall detection quality, FLOPs and Params characterize model complexity, and OKS-based indicators quantify the accuracy and robustness of facial keypoint localization under multi-breed and unconstrained farm conditions.

4.3. Comparison of Experimental Results and Analysis

4.3.1. Comparison with the Mainstream YOLO Series Model

In order to verify the overall performance of the improved model, it is compared with the mainstream YOLO series keypoint recognition models such as YOLO3-pose and YOLO12-pose on the test set. The experimental results are shown in Table 2.

According to the experimental results shown in Table 2, the improved model shows significant advantages in multiple performance dimensions. From the perspective of precision indicators, the improved model ranks first in the four core indicators of precision, recall, mAP50 and mAP50–95, which are 4.8%, 7.6%, 3.7% and 4.5% higher than the original YOLO11 respectively. This result fully shows that the triple improvement strategy adopted effectively enhances the recognition ability of the model for the keypoints of multi-breed sheep faces. At the same time, the accuracy of the model in the breed classification task reaches 97.6%, which is 5.3% higher than that of the original YOLO11, reflecting that the improved feature extraction mechanism has significantly improved the ability to distinguish the features of different breed sheep faces.

In terms of model efficiency, the improved model has 5.9 G FLOPs, which is 10.6% lower than the original YOLO11, and 2.27 M parameters, representing a reduction of 14.7%. Compared with the similar lightweight model YOLO9-pose, the improved model achieves a 4.8% increase in mAP50 index with only 0.24 M additional parameters, showing a good balance between accuracy and efficiency.

Compared with the early YOLO series models, the improved model has an increase of 2.7–5.4% in various accuracy indicators, and the calculation amount and parameter amount are greatly reduced by 70.2% and 82.1% respectively. This comparison result not only reflects the advanced nature of the model architecture but also verifies its deployment value in practical scenarios.

From the experimental results in Table 2, it can be seen that the improved model proposed in this paper achieves a good balance between accuracy and efficiency. While maintaining the lightweight characteristics of the model, various recognition accuracy indicators have been steadily improved, especially in the multi-breed sheep face keypoint detection task, which shows strong adaptability. Compared with the existing mainstream models, the proposed model can still maintain high recognition performance in the case of limited computing resources, which creates favorable conditions for its deployment and application in the actual breeding environment. In summary, the improved model structure further expands its application boundary in complex scenes through targeted optimization on the basis of maintaining the efficient characteristics of YOLO series.

4.3.2. Effectiveness of Individual Modules and Their Combinations

To evaluate the contribution of each proposed module, including SAMSConv, KLAttention, and the EIoU loss, a series of ablation experiments were conducted. Eight comparative configurations were designed by progressively introducing each module into the baseline model. The quantitative results are summarized in Table 3. From the ablation results, it can be observed that each module contributes differently to overall model performance. When applied individually, the introduction of the EIoU loss leads to consistent improvements in localization-related metrics, including recall and mAP, without introducing additional computational overhead. This observation suggests that EIoU provides a more suitable bounding box regression constraint for sheep face detection under diverse facial shapes. The KLAttention module yields further performance gains, particularly in precision and mAP, while introducing only a slight increase in computational cost. This improvement may be attributed to its ability to emphasize biologically discriminative facial keypoint regions. In addition, the SAMSConv module achieves a favorable trade-off between accuracy and efficiency, reducing computational complexity while maintaining competitive detection performance, indicating its effectiveness in multi-scale feature extraction for sheep faces with diverse morphological characteristics.

When combining different modules, further performance improvements are observed. The joint use of EIoU and KLAttention results in higher mAP compared with single-module configurations, indicating that enhanced localization and keypoint-aware feature refinement can be effectively integrated. Similarly, combining KLAttention with SAMSConv leads to noticeable gains in both detection accuracy and breed classification performance, suggesting that multi-scale feature extraction can provide complementary support for keypoint-guided attention mechanisms. The combination of EIoU and SAMSConv also exhibits stable improvements, reflecting the compatibility between bounding box optimization and multi-scale feature representation.

When all three modules are integrated, the model achieves the best overall performance among all configurations, while maintaining a relatively low computational cost and parameter count. This result indicates that SAMSConv, KLAttention, and EIoU contribute to model improvement from different aspects, including feature extraction, attention weighting, and localization optimization. The combined configuration therefore provides a balanced solution for sheep face keypoint recognition across multiple breeds. Visual comparisons of detection results under different ablation settings are presented in Figure 8.

4.3.3. Comparison of SAMSConv with Other Lightweight Convolutional Structures

In order to verify the superiority of SAMSConv, it was compared with PConv (partial convolution), DSConv (Depthwise Separable Convolution), DWConv (Depthwise Convolution), GhostConv (Ghost Convolution) [25,26,27,28], and the original YOLO11 convolution layer was replaced respectively in the experiment. Other parameters remain the same, and the results are shown in Table 4.

According to the comparative experimental results shown in Table 4, the YOLO11 + SAMSConv model shows a good balance between accuracy and lightweight. The recall rate (91.4%), mAP50 (96.3%) and mAP50–95 (95.1%) of YOLO11 + SamsConv are better than mainstream lightweight convolution schemes such as DWConv and GhostConv. In terms of computational efficiency, although the calculation amount of 5.7 G of the model is slightly higher than that of DSConv (4.9 G), the parameter amount is controlled at a low level of 2.18 M, and the multi-scale feature capture ability of the model is better—compared with DSConv, the recall rate and mAP50 are increased by 1.8% and 0.7%, respectively.

From the individual performance of each model, although PConv has the highest accuracy of 96.3%, its calculation amount reaches 7.4 G, which is significantly higher than that of the original YOLO11, and its recall rate is only 89.4%, indicating that it has a certain missed detection problem while maintaining high accuracy. DSConv has an advantage in computational complexity, but its recall rate is only 89.6%, which reflects its limitation in detecting small-scale keypoints. Although DWConv has the lowest number of parameters (2.00 M), its mAP50 is 94.0%, which is 2.3% lower than that of SAMSConv, indicating that it is still insufficient in multi-scale feature expression ability.

In summary, SAMSConv shows stable and balanced performance on multiple evaluation dimensions of keypoint recognition and achieves effective capture of multi-scale features of three breeds of sheep faces while maintaining low computational complexity.

4.3.4. Comparison Between KLAttention and Other Attention Mechanisms

To verify the superiority of KLAttention, it is compared with CBAM (Convolutional Block Attention Module), SCSA (Spatial-Channel Split Attention), SHSAttention (squeee-and-hard-shared Attention), SMFA (Spatial Multi-Scale Feature Attention) [29,30,31,32]. In the experiment, different attention modules were embedded in YOLO11 neck network, and other parameters were kept consistent. The results are shown in Table 5.

According to the comparative experimental results shown in Table 5, the YOLO11 + KLAttention model shows advantages in multiple performance indicators. The model is superior to other attention mechanisms in accuracy (94.9%), mAP50 (96.8%) and MAP50–95 (95.4%). mAP50 is 1.7% higher than CBAM, 2.0% higher than SCSA, and 3.0% higher than SHSAttention. In terms of recall rate, KLAttention reaches 92.8%, which is second only to SMFA’s 91.6%. However, the amount of calculation (8.0 G) and parameter (3.54 M) of SMFA are significantly higher than those of KLAttention, increasing by 17.6% and 29.7% respectively, which means that SMFA is at a significant disadvantage in efficiency.

It is worth noting that KLAttention shows good adaptability in the three-breed sheep face recognition tasks by introducing the keypoint weight vector and performing differential weight allocation according to different breed characteristics. This design improves the breed classification accuracy by 2.3% compared with CBAM, which proves that KLAttention has better compatibility for multi-breed scenarios.

From the perspective of various indicators, KLAttention effectively controls the computational cost while maintaining high recognition accuracy and improves the practicability of the model in multi-breed scenarios through the breed adaptive weight allocation mechanism, which provides a more balanced attention solution for the sheep face keypoint recognition task.

4.3.5. Comparison of EIoU and Other IoU Loss Functions

To verify the superiority of EIoU, it was compared with DIoU, SIoU, WIoU, and MPDIoU [33,34,35,36]. In the experiment, only the CIoU loss in the original YOLO11 was replaced by these alternative loss functions, while keeping all other parameters consistent. The results are shown in Table 6. According to the experimental results shown in Table 6, the model using EIoU loss function performs best in various accuracy indicators, and its precision, recall, mAP50 and mAP50–95 reach 94.2%, 92.0%, 95.8% and 94.9%, respectively. Compared with DIoU, EIoU improves the precision and recall by 1.5% and 1.9%, respectively. Compared with SIoU, it improves the accuracy by 1.2% and 1.9% on mAP50 and mAP50–95, respectively. In terms of gradient effectiveness, EIoU shows significant advantages in small object keypoint detection by separating the constraint mechanism of the distance between the center point and the width and height error. When the intersection/union ratio approaches zero, the gradient effectiveness of EIoU is 15.3% higher than that of CIoU, which increases the recall rate of the model by 2.5%, and effectively reduces the missed detection rate of small-scale keypoints. It is worth noting that the amount of calculation and parameter of each loss function remain completely the same (6.6 G FLOPs, 2.66 M parameters), indicating that EIoU achieves the improvement of bounding box regression accuracy through a more scientific error constraint without increasing any computational burden.

From the experimental results, the EIoU loss function effectively improves the detection accuracy of the model through its unique error decomposition mechanism while maintaining the computational efficiency, especially in the small target keypoint location task, which shows obvious advantages, providing a more effective bounding box regression solution for the sheep face keypoint recognition task.

4.3.6. OKS-Based Keypoint Evaluation

While the previous subsections mainly focus on detection-oriented metrics, OKS provides a complementary evaluation that directly quantifies facial keypoint localization accuracy. To this end, we computed OKS on test images in which all five facial keypoints were clearly visible and reliably annotated. The quantitative evaluation results are: Mean OKS = 0.9459, OKS@0.50 = 0.9622, and OKS@0.75 = 0.9496. Visual examples of OKS-based evaluation are shown in Figure 9.

The high Mean OKS indicates that, on average, the predicted facial landmarks are very close to the ground-truth positions when normalized by sheep-face scale. The values of OKS@0.50 and OKS@0.75 further show that more than 96% of the evaluated samples achieve an OKS above 0.50 and approximately 95% exceed 0.75, respectively, demonstrating that the majority of keypoint predictions are not only correct but also spatially precise. These results are consistent with the improvements observed in mAP50 and mAP50–95 and confirm that the proposed SAMS-KLA-YOLO11 model achieves accurate keypoint localization in addition to high-quality bounding box detection.

In Figure 9, the predicted keypoints of the eyes, nose, and ear bases are superimposed on the original images, and the corresponding OKS values for each sample are displayed in the titles (e.g., “sheep0039_07 − OKS = 0.998”). Samples with high OKS values exhibit tightly aligned keypoints with the ground truth, whereas occasional low-OKS cases are typically associated with strong occlusions, large head-pose variations, or challenging illumination conditions. These qualitative observations are consistent with the failure case analysis in Section 4.5 and further illustrate the strengths and remaining limitations of the proposed model under realistic farm conditions.

4.4. Analysis of Visual Results

To intuitively demonstrate the recognition performance of the proposed model, three representative scenarios—no interference, partial occlusion, and large head-pose variation—were selected to compare the recognition results of the original YOLO11 and the improved SAMS-KLA-YOLO11. The corresponding visualizations are presented in Figure 10. In the ideal scene without interference, both models are able to approximately locate the five facial keypoints, but the original YOLO11 exhibits noticeable coordinate shifts around the eyes of Dorper sheep. By contrast, the improved model provides more stable and accurate landmark localization, with predicted keypoints closely aligned with the ground-truth positions. Under partial occlusion, such as when the ear-base region of Hu sheep is heavily covered by wool, the original model tends to miss or misplace keypoints. The proposed SAMS-KLA-YOLO11 effectively reinforces the remaining visible structures through the KLAttention module and maintain reliable detection of occluded keypoints. In scenes with pronounced pose changes—for example when the head of Dorper × Hu crossbred sheep is rotated by approximately 30°—the original YOLO11 produces larger localization errors at the nose tip and ear bases, whereas the improved model achieves more precise and robust keypoint predictions.

To further analyze the training behavior and overall performance stability of the proposed model, the loss curves as well as the precision, recall, mAP50 and mAP50–95 curves for both the training and validation sets are plotted in Figure 11. All loss curves show a monotonic decreasing trend and converge to low values, and the trajectories of the training and validation curves remain close to each other, indicating good convergence without obvious overfitting. The precision, recall, and mAP curves rapidly increase and then stabilize at high levels close to 1.0, and the corresponding curves for the training and validation sets almost overlap. These results confirm that the SAMS-KLA-YOLO11 model not only learns discriminative features from the training data but also generalizes well to unseen samples, providing a solid foundation for subsequent deployment in practical farm environments.

In addition, feature heatmaps were generated to interpret how the proposed model utilizes facial keypoints for individual identification. For each test image, the trained detector first predicts the sheep-face bounding box. Feature maps are then extracted from the layer following the KLAttention module, and the channel-wise mean activation is computed. This activation map is upsampled to the input resolution and overlaid as a colored heatmap only within the detected face region, while the remaining parts of the image are kept unchanged. Representative examples are shown in Figure 12. The resulting heatmaps reveal that the model consistently produces strong responses around biologically meaningful facial landmarks, including the eyes, nose tip, and ear bases, whereas background regions inside the bounding box receive much lower activation. This pattern demonstrates that KLAttention-guided feature learning effectively drives the network to concentrate on keypoint-related regions that are critical for discriminating individual sheep. Moreover, in cases with moderate occlusion or pose variation, the heatmaps still focus on the remaining visible keypoints, which is consistent with the OKS-based quantitative results and the failure case analysis in Section 4.5 and further confirms the robustness and interpretability of the proposed approach under realistic farm conditions.

4.5. Failure Case Analysis

Although the proposed model achieves strong overall performance, failure cases are still observed under challenging real-world conditions. Based on qualitative inspection of the visualization results in Figure 13, most errors can be summarized into three typical scenarios. (1) Severe occlusion: When key facial regions such as ear roots are heavily covered by wool or partially blocked by other body parts, the predicted keypoints tend to shift or become unstable. This is mainly because occlusion significantly weakens local texture cues, which are critical for precise keypoint localization. (2) Large pose variation: For sheep faces captured at large yaw angles (greater than approximately 45°), some symmetric keypoints become ambiguous or invisible. In such cases, the model may incorrectly regress keypoints based on incomplete contextual information. (3) Extreme illumination conditions: Strong backlighting or uneven illumination reduces contrast around fine-grained facial structures, particularly affecting small-scale keypoints such as eyes and nose regions.

From a keypoint perspective, ear-related keypoints are more prone to errors compared to eye keypoints, as ears exhibit larger shape deformation and higher appearance variability across individuals and poses. This observation is consistent with the motivation of introducing keypoint-aware weighting in the KLAttention module, which aims to emphasize more discriminative and stable facial regions. In line with the OKS-based evaluation in Section 4.3.6, most low-OKS samples are concentrated in scenarios with severe occlusion or large head pose variations, further confirming that these conditions remain the primary failure modes of the proposed model. These failure cases indicate that while the proposed method is robust under normal farm conditions, extreme occlusion and pose variation remain challenging and deserve further investigation in future work.

5. Discussion

In this study, the application potential of sheep face recognition technology in pasture environments is explored through experimental validation. While the proposed method demonstrates encouraging performance under the tested conditions, further research is required to facilitate the transition from methodological validation to broader practical applicability. Future work can be systematically extended along several complementary directions.

At the data and model level, constructing more comprehensive datasets that better reflect real breeding scenarios is a critical step toward improving model practicality. Expanding the sample distribution across different growth stages, health conditions, and breed types would enhance robustness and generalization. Professionally annotating pathological characteristics in accordance with veterinary diagnostic standards could ensure closer alignment between training data and the dynamic variations encountered in real-world breeding environments [37]. Such fine-grained annotations are expected to improve the model’s ability to handle complex and atypical cases. In addition, building upon the pretrained model developed in this study, domain-adaptive transfer learning techniques [23] could be applied to fine-tune the model using limited samples from specific environments, thereby improving adaptability while reducing data collection costs. Despite the encouraging results, this study has several limitations. The dataset consists of images collected from 68 sheep at a single experimental farm, which may limit population diversity and introduce sampling bias. Although annotations were cross-checked by multiple annotators, quantitative inter-annotator agreement metrics were not calculated. Moreover, the comparative study in this paper is limited to YOLO-based detectors with keypoint heads. General-purpose human pose estimation frameworks such as HRNet, SimpleBaseline, RTMPose, or PIPNet were not included as baselines, primarily due to their higher implementation and computational complexity in joint detection–keypoint pipelines. Although these methods could provide a broader reference for keypoint localization performance, a detailed comparison is beyond the scope of this work and will be explored in future research. Additionally, intra-breed variability and pose distribution were not statistically analyzed. These limitations highlight the need for larger, multi-farm datasets and more comprehensive statistical evaluation in future studies.

To address data isolation and privacy concerns in large-scale livestock farming, distributed training frameworks based on federated learning provide a promising research direction [38]. By allowing multiple farms to participate in collaborative model optimization without sharing raw data, federated learning enables effective utilization of data diversity across regions and breeding modes while preserving data security. Previous studies have shown that such collaborative strategies can enhance model adaptability in heterogeneous environments [39], suggesting their potential value for future sheep face recognition systems. From a technical implementation perspective, further optimization of model efficiency remains an important consideration. Although the proposed model is designed with lightweight principles in mind, additional model compression strategies—such as knowledge distillation, network pruning, and quantization—could be explored to further reduce computational overhead. When combined with optimized inference engines, these techniques may facilitate deployment on resource-constrained edge or embedded platforms and support near real-time inference in practical farming scenarios [40]. Moreover, integrating sheep individual identification with higher-level functions such as behavior analysis and health monitoring could contribute to more comprehensive intelligent livestock management systems [41]. Several limitations of this study should also be acknowledged. First, real-world inference benchmarks on embedded or edge devices, including inference speed, memory usage, and power consumption, are not reported. Instead, model efficiency is evaluated using proxy metrics such as parameter count and FLOPs, which are commonly adopted but cannot fully reflect hardware-dependent performance. Second, all experimental results are reported based on a single training run due to computational constraints, and statistical variance metrics such as standard deviation are therefore not provided. Although the proposed method demonstrates consistent performance improvements over baseline models, the reported results may contain optimistic bias. Future work will include on-device evaluations and multiple independent training runs to further assess deployment feasibility and statistical robustness under practical farm conditions.

6. Conclusions

This paper presented SAMS-KLA-YOLO11, an improved YOLO11-based framework for joint sheep face detection and facial keypoint recognition aimed at supporting individual identification in precision livestock management. By introducing the SAMSConv module, the network effectively adapts to the considerable scale variations and morphological differences among Hu, Dorper and Dorper × Hu crossbred sheep. The KLAttention mechanism explicitly guides the model to focus on biologically meaningful facial landmarks, while the EIoU loss improves bounding box regression stability for faces with diverse aspect ratios. These architectural modifications are integrated into a unified multi-task network that simultaneously predicts sheep face bounding boxes and five keys facial keypoints.

Comprehensive experiments on a self-built dataset of 3860 images from 68 individuals collected under unconstrained barn conditions show that SAMS-KLA-YOLO11 consistently outperforms several mainstream YOLO11-based detectors in terms of precision, recall, mAP50 and mAP50–95, while requiring fewer parameters and reduced computational complexity compared with the original YOLO11. Ablation studies demonstrate that SAMSConv, KLAttention and EIoU each provide measurable improvements and yield the best performance when combined. OKS and OKS-based visualizations further confirm that the proposed model achieves precise localization of facial landmarks, and heatmap analysis indicates that the network learns to concentrate on keypoint-related regions that are critical for distinguishing individual sheep.

At the same time, several limitations should be acknowledged. The dataset was collected from a single farm and includes a limited number of individuals, so potential biases related to breed composition, management practices and imaging conditions cannot be excluded. All reported results are obtained from single training runs, and real-device inference metrics such as frame rate, memory usage and power consumption are not measured; model efficiency is evaluated using proxy indicators such as FLOPs and parameter count. In addition, the comparative study is restricted to YOLO-based detectors with keypoint heads and does not include general-purpose human pose estimation frameworks such as HRNet or RTMPose, which may provide a broader reference for keypoint localization performance. Future work will focus on constructing larger, multi-site datasets covering more breeds and management systems, and on performing multi-run training with variance analysis to obtain more statistically robust conclusions. Incorporating additional pose estimation baselines, exploring model compression and knowledge distillation for edge deployment, and conducting systematic on-device benchmarking on embedded platforms such as agricultural robots or barn monitoring systems are also promising directions. These efforts are expected to further enhance the practicality and scalability of keypoint-aware sheep face recognition in real-world smart farming applications.

Author Contributions

Conceptualization, Y.B. and X.Z.; methodology, Y.B.; software, X.Z. and X.L.; formal analysis, Y.B. and Z.Z.; data curation, Z.Z. and Y.Y.; writing—original draft preparation, Y.B. and F.L.; writing—review and editing, Y.B.; visualization, F.L.; supervision, Z.Z. and X.L.; project administration, Y.Y.; funding acquisition, W.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

The animal study protocol, which involved only non-invasive image acquisition, was granted an exemption from formal ethics approval by the Animal Ethics Committee of Shanxi Agricultural University. This exemption is in accordance with the committee’s policies as the research did not involve any direct intervention, handling, or distress to the animals.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to restrictions related to a confidential research project. For academic research purposes, a minimal dataset necessary for method validation can be obtained by contacting the corresponding author with a formal request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Meng, H.; Zhang, L.; Yang, F.; Hai, L.; Wei, Y.; Zhu, L.; Zhang, J. Livestock biometrics identification using computer vision approaches: A review. Agriculture 2025, 15, 102. [Google Scholar] [CrossRef]
Monteiro, A.; Santos, S.; Goncalves, P. Precision agriculture for crop and livestock farming—Brief review. Animals 2021, 11, 2345. [Google Scholar] [CrossRef] [PubMed]
Yao, Z.; Tan, H.; Tian, F.; Zhou, Y. Research progress of computer vision technology in smart sheep farm. China Feed 2021, 1, 7–12. [Google Scholar] [CrossRef]
Qin, G.; Liu, Z.; Zhao, C.; Zhang, C.; Sun, J.; Wang, Z.; Li, J. Application of machine vision technology in animal husbandry. Agric. Eng. 2021, 11, 27–33. [Google Scholar]
Alonso, R.S.; Sitton-Candanedo, I.; Garcia, O.; Prieto, J.; Rodriguez-Gonzalez, S. An intelligent Edge-IoT platform for monitoring livestock and crops in a dairy farming scenario. Ad Hoc Netw. 2020, 98, 102047. [Google Scholar] [CrossRef]
Luan, H.; Qi, Y.; Liu, L.; Wang, Z.; Li, Y. VanillaFaceNet: A Cow face Recognition Method with high Accuracy and Fast Inference. Trans. Chin. Soc. Agric. Eng. 2024, 40, 120–131. [Google Scholar]
Huang, X.; Hou, X.; Guo, Y.; Zheng, H.; Dou, Z.; Liu, M.; Zhao, J. Cattle face keypoint detection and posture recognition method based on improved YOLO v7-Pose. Trans. Chin. Soc. Agric. Mach. 2024, 55, 84–92+102. [Google Scholar]
Xie, Q.; Wu, M.; Bao, J.; Yin, H.; Liu, H.; Li, X.; Zheng, P.; Liu, W.; Chen, G. Individual pig face recognition combined with attention mechanism. Trans. Chin. Soc. Agric. Eng. 2022, 38, 180–188. [Google Scholar]
Hansen, M.F.; Smith, M.L.; Smith, L.N.; Salter, M.G.; Baxter, E.M.; Farish, M.; Grieve, B. Towards on-farm pig face recognition using convolutional neural networks. Comput. Ind. 2018, 98, 145–152. [Google Scholar] [CrossRef]
Marsot, M.; Mei, J.; Shan, X.; Ye, L.; Feng, P.; Yan, X.; Li, C.; Zhao, Y. An adaptive pig face recognition approach using Convolutional Neural Networks. Comput. Electron. Agric. 2020, 173, 105386. [Google Scholar] [CrossRef]
Xue, J.; Hou, Z.; Xuan, C.; Ma, Y.; Sun, Q.; Zhang, X.; Zhong, L. A sheep identification method based on three-dimensional sheep face reconstruction and feature point matching. Animals 2024, 14, 1923. [Google Scholar] [CrossRef] [PubMed]
Wurtz, K.; Camerlink, I.; D ’Eath, R.B.; Fernandez, A.P.; Norton, T.; Steibel, J.; Siegford, J. Recording behaviour of indoor-housed farm animals automatically using machine vision technology: A systematic review. PLoS ONE 2019, 14, e0226669. [Google Scholar] [CrossRef] [PubMed]
Pang, Y.; Yu, W.; Xuan, C.; Zhang, Y.; Wu, P. A Large Benchmark Dataset for Individual Sheep Face Recognition. Agriculture 2023, 13, 1718. [Google Scholar] [CrossRef]
Chen, J.; Ran, X. Deep learning with edge computing: A review. Proc. IEEE 2019, 107, 1655–1674. [Google Scholar] [CrossRef]
Li, X.; Du, J.; Yang, J.; Li, S. When mobilenetv2 meets transformer: A balanced sheep face recognition model. Agriculture 2022, 12, 1126. [Google Scholar] [CrossRef]
Zhang, H.; Zhou, L.; Li, Y.; Hao, J.; Sun, Y. Sheep face recognition method based on improved MobileFaceNet. Trans. Chin. Soc. Agric. Mach. 2022, 53, 267–274. [Google Scholar]
Li, X.; Zhang, Y.; Li, S. SheepFaceNet: A speed–accuracy balanced model for sheep face recognition. Animals 2023, 13, 1930. [Google Scholar] [CrossRef]
Zhang, C.; Zhang, H.; Tian, F.; Zhou, Y.; Zhao, S.; Du, X. Research on sheep face recognition algorithm based on improved AlexNet model. Neural Comput. Appl. 2023, 35, 24971–24979. [Google Scholar] [CrossRef]
Hao, M.; Sun, Q.; Xuan, C.; Zhang, X.; Zhao, M.; Song, S. Lightweight small—Tailed han sheep facial recognition based on improved SSD algorithm. Agriculture 2024, 14, 468. [Google Scholar] [CrossRef]
Noor, A.; Zhao, Y.; Koubaa, A.; Wu, L.; Khan, R.; Abdalla, F.Y. Automated sheep facial expression classification using deep transfer learning. Comput. Electron. Agric. 2020, 175, 105528. [Google Scholar] [CrossRef]
Gao, X.; Xue, J.; Luo, G.; Qu, C.; Sun, W.; Qu, L. Comparative study on performance of Dorper × Hu crossbred F1 lambs and Hu sheep. Chin. J. Anim. Sci. 2024, 60, 169–172+182. [Google Scholar] [CrossRef]
Huang, L.W.; Qian, B.; Guan, F.; Hou, Z. Sheep face recognition model based on wavelet transform and convolutional neural network. Trans. Chin. Soc. Agric. Mach. 2023, 54, 278–287. [Google Scholar]
Khanam, R.; Hussain, M. YOLO11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar]
Zhang, Y.-F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and efficient IOU loss for accurate bounding box regression. Neurocomputing 2022, 506, 146–157. [Google Scholar] [CrossRef]
Yang, J.; Liu, S.; Wu, J.; Su, X.; Hai, N.; Huang, X. Pinwheel-shaped convolution and scale-based dynamic loss for infrared small target detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 27 February–2 March 2025; pp. 9202–9210. [Google Scholar]
Nascimento, M.G.d.; Fawcett, R.; Prisacariu, V.A. Dsconv: Efficient convolution operator. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5148–5157. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Si, Y.; Xu, H.; Zhu, X.; Zhang, W.; Dong, Y.; Chen, Y.; Li, H. SCSA: Exploring the synergistic effects between spatial and channel attention. Neurocomputing 2025, 634, 129866. [Google Scholar] [CrossRef]
Yun, S.; Ro, Y. Shvit: Single-head vision transformer with memory efficient macro design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 5756–5767. [Google Scholar]
Zheng, M.; Sun, L.; Dong, J.; Pan, J. SMFANet: A lightweight self-modulation feature aggregation network for efficient image super-resolution. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 359–375. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 12993–13000. [Google Scholar]
Gevorgyan, Z. SIoU loss: More powerful learning for bounding box regression. arXiv 2022, arXiv:2205.12740. [Google Scholar] [CrossRef]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding box regression loss with dynamic focusing mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
Ma, S.; Xu, Y. MPDIoU: A loss for efficient and accurate bounding box regression. arXiv 2023, arXiv:2307.07662. [Google Scholar] [CrossRef]
Ferretti, V.; Papaleo, F. Understanding others: Emotion recognition in humans and other animals. Genes Brain Behav. 2019, 18, e12544. [Google Scholar] [CrossRef]
Yang, Q.; Liu, Y.; Chen, T.; Tong, Y. Federated machine learning: Concept and applications. ACM Trans. Intell. Syst. Technol. (TIST) 2019, 10, 1–19. [Google Scholar] [CrossRef]
Batistatos, M.C.; De Cola, T.; Kourtis, M.A.; Apostolopoulou, V.; Xilouris, G.K.; Sagias, N.C. AGRARIAN: A Hybrid AI-Driven Architecture for Smart Agriculture. Agriculture 2025, 15, 904. [Google Scholar] [CrossRef]
Liu, J.; Zhou, X.; Li, M.; Han, S.; Guo, L.; Chi, L.; Yang, L. Artificial intelligence drives high-quality development of new quality productivity of animal husbandry: Constraints, generation logic and promotion path. Smart Agric. 2025, 7, 165–177. [Google Scholar]
Arshad, J.; Rehman, A.U.; Othman, M.T.B.; Ahmad, M.; Tariq, H.B.; Khalid, M.A.; Moosa, M.A.R.; Shafiq, M.; Hamam, H. Deployment of wireless sensor network and iot platform to implement an intelligent animal monitoring system. Sustainability 2022, 14, 6249. [Google Scholar] [CrossRef]

Figure 1. Differences among Hu sheep, Dorper sheep and Dorper × Hu crossbreed sheep.

Figure 2. Data collection environment at the experimental station.

Figure 3. Annotation of sheep faces from different angles.

Figure 4. Structure diagram of the original YOLO11.

Figure 5. Structure diagram of SAMSConv.

Figure 6. KLAttention structure diagram.

Figure 7. Structure diagram of SAMS-KLA-YOLO11.

Figure 8. Detection results of ablation experiments.

Figure 9. Visual examples of OKS-based facial keypoint evaluation on multi-breed sheep faces.

Figure 10. Comparison of recognition results between the original YOLO11 and the improved model.

Figure 11. Training and validation curves of loss, precision, recall, mAP50 and mAP50–95 for the SAMS-KLA-YOLO11 model.

Figure 12. Heatmap-based visualization within detected sheep face regions.

Figure 13. Representative failure cases of the proposed SAMS-KLA-YOLOv11 model under severe occlusion, large head-pose variation, and extreme illumination conditions.

Table 1. Experimental platform parameters.

Configuration	Specifications
Operating system	Windows 11 (Microsoft Corporation, Redmond, WA, USA)
Central processing unit	Intel Core i5-14600KF (Intel Corporation, Santa Clara, CA, USA)
GPU	RTX 4060 (NVIDIA Corporation, Santa Clara, CA, USA)
Application packages	python: 3.9, pytorch: 2.0, CUDA: 11.7

Table 2. Comparison of mainstream YOLO models.

Models	P	R	mAP50	mAP50–95	FLOPs (G)	Params
YOLO3-pose	0.950	0.902	0.956	0.948	19.8	12.66 M
YOLO5-pose	0.921	0.867	0.921	0.906	7.3	2.58 M
YOLO6-pose	0.919	0.826	0.910	0.895	11.9	4.27 M
YOLO8-pose	0.929	0.883	0.936	0.920	8.4	3.08 M
YOLO9-pose	0.928	0.894	0.938	0.921	7.8	2.03 M
YOLO10-pose	0.907	0.848	0.924	0.911	6.8	2.34 M
YOLO11-pose	0.938	0.895	0.951	0.938	6.6	2.66 M
YOLO12-pose	0.933	0.886	0.946	0.927	6.6	2.63 M
SAMS + KLA + YOLO11	0.983	0.963	0.986	0.980	5.9	2.27 M

Table 3. Table of ablation test results.

EIoU	KLAttention	SAMSConv	P	R	mAP50	mAP50–95	FLOPs(G)	Params
×	×	×	0.938	0.895	0.951	0.938	6.6	2.66 M
√	×	×	0.942	0.920	0.958	0.949	6.6	2.66 M
×	√	×	0.949	0.928	0.968	0.954	6.8	2.75 M
×	×	√	0.943	0.914	0.963	0.951	5.7	2.18 M
√	√	×	0.958	0.943	0.974	0.962	6.8	2.75 M
×	√	√	0.961	0.957	0.980	0.973	5.9	2.27 M
√	×	√	0.954	0.925	0.967	0.960	5.7	2.18 M
√	√	√	0.983	0.963	0.986	0.980	5.9	2.27 M

Note: “√” indicates that the component is included, and “×” indicates that the component is not included.

Table 4. Comparison results of convolutional layers.

Models	P	R	mAP50	mAP50–95	FLOPs (G)	Params
YOLO11	0.938	0.895	0.951	0.938	6.6	2.66 M
YOLO11 + SAMSConv	0.943	0.914	0.963	0.951	5.7	2.18 M
YOLO11 + DSConv	0.950	0.896	0.956	0.940	4.9	2.67 M
YOLO11 + PConv	0.963	0.894	0.959	0.942	7.4	2.53 M
YOLO11 + DWConv	0.939	0.877	0.940	0.924	5.0	2.00 M
YOLO11 + GhostConv	0.941	0.856	0.931	0.916	5.8	2.33 M

Table 5. Comparison table of attention mechanism.

Models	P	R	mAP50	mAP50–95	FLOPs (G)	Params
YOLO11	0.938	0.895	0.951	0.938	6.6	2.66 M
YOLO11 + KLAttention	0.949	0.928	0.968	0.954	6.8	2.75 M
YOLO11 + CBAM	0.937	0.907	0.951	0.936	6.7	2.75 M
YOLO11 + SCSA	0.929	0.898	0.946	0.931	6.6	2.67 M
YOLO11 + SHSA	0.936	0.892	0.941	0.924	6.8	2.75 M
YOLO11 + SMFA	0.923	0.916	0.946	0.932	8.0	3.54 M

Table 6. Comparison Table of loss functions.

Models	P	R	mAP50	mAP50–95	FLOPs (G)	Params
YOLO11	0.938	0.895	0.951	0.938	6.6	2.66 M
YOLO11 + EIoU	0.942	0.920	0.958	0.949	6.6	2.66 M
YOLO11 + DIoU	0.927	0.901	0.951	0.936	6.6	2.66 M
YOLO11 + SIoU	0.928	0.895	0.946	0.930	6.6	2.66 M
YOLO11 + WIoU	0.940	0.892	0.951	0.935	6.6	2.66 M
YOLO11 + MPDIoU	0.921	0.892	0.951	0.932	6.6	2.66 M

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bai, Y.; Zhao, X.; Liang, X.; Zhang, Z.; Yan, Y.; Li, F.; Zhang, W. A Lightweight Facial Landmark Recognition Model for Individual Sheep Based on SAMS-KLA-YOLO11. Agriculture 2026, 16, 151. https://doi.org/10.3390/agriculture16020151

AMA Style

Bai Y, Zhao X, Liang X, Zhang Z, Yan Y, Li F, Zhang W. A Lightweight Facial Landmark Recognition Model for Individual Sheep Based on SAMS-KLA-YOLO11. Agriculture. 2026; 16(2):151. https://doi.org/10.3390/agriculture16020151

Chicago/Turabian Style

Bai, Yangfan, Xiaona Zhao, Xinran Liang, Zhimin Zhang, Yuqiao Yan, Fuzhong Li, and Wuping Zhang. 2026. "A Lightweight Facial Landmark Recognition Model for Individual Sheep Based on SAMS-KLA-YOLO11" Agriculture 16, no. 2: 151. https://doi.org/10.3390/agriculture16020151

APA Style

Bai, Y., Zhao, X., Liang, X., Zhang, Z., Yan, Y., Li, F., & Zhang, W. (2026). A Lightweight Facial Landmark Recognition Model for Individual Sheep Based on SAMS-KLA-YOLO11. Agriculture, 16(2), 151. https://doi.org/10.3390/agriculture16020151

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Lightweight Facial Landmark Recognition Model for Individual Sheep Based on SAMS-KLA-YOLO11

Abstract

1. Introduction

2. Data Acquisition and Feature Engineering

2.1. Data Source and Collection

2.2. Labeling Methods and Rules

3. SAMS-KLA-YOLO11 Recognition Model

3.1. Original YOLO11 Model Framework

3.2. Model Improvement Strategy

3.2.1. SAMSConv: Sheep Adaptive Multi-Scale Convolution Design

3.2.2. KLAttention: Keypoint Aware Lightweight Attention Module

3.2.3. EIoU Loss: Bounding Box Regression Optimization

3.3. Improve Overall Architecture of the Model

4. Experimental Results and Analysis

4.1. Experimental Environment and Parameter Setting

4.2. Evaluation Metrics

4.3. Comparison of Experimental Results and Analysis

4.3.1. Comparison with the Mainstream YOLO Series Model

4.3.2. Effectiveness of Individual Modules and Their Combinations

4.3.3. Comparison of SAMSConv with Other Lightweight Convolutional Structures

4.3.4. Comparison Between KLAttention and Other Attention Mechanisms

4.3.5. Comparison of EIoU and Other IoU Loss Functions

4.3.6. OKS-Based Keypoint Evaluation

4.4. Analysis of Visual Results

4.5. Failure Case Analysis

5. Discussion

6. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI