Wild Yak Behavior Recognition Method Based on an Improved Yolov11

Tie, Jun; Dunzhu, Basang; Zheng, Lu; Xie, Jin; Tian, Shasha; Li, Shuangyang

doi:10.3390/info17020214

Open AccessArticle

Wild Yak Behavior Recognition Method Based on an Improved Yolov11

by

Jun Tie

^1,2

,

Basang Dunzhu

^1,2,

Lu Zheng

^1,3,*,

Jin Xie

^2,3,

Shasha Tian

^2,3 and

Shuangyang Li

^1,3

¹

School of Computer Science, South-Central Minzu University, Wuhan 430074, China

²

Hubei Provincial Engineering Research Center for Intelligent Management of Manufacturing Enterprise, Wuhan 430074, China

³

Hubei Provincial Engineering Research Center of Agricultural Blockchain and Intelligent Management, Wuhan 430074, China

^*

Author to whom correspondence should be addressed.

Information 2026, 17(2), 214; https://doi.org/10.3390/info17020214

Submission received: 12 January 2026 / Revised: 6 February 2026 / Accepted: 12 February 2026 / Published: 19 February 2026

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Yak daily behaviors, including feeding, standing, lying down, and walking, are closely related to their health status, making accurate behavior recognition essential for intelligent monitoring and management in yak husbandry. However, real-world grazing environments present significant challenges due to complex backgrounds, occlusions, small or distant targets, and high visual similarity between behavior categories. To address these issues, we propose a problem-driven, multi-scale behavior recognition framework based on an enhanced YOLOv11n architecture specifically designed for outdoor yak monitoring. A dedicated real-world dataset is constructed to capture four fundamental behaviors under diverse natural conditions. Based on this dataset, we develop the DPAP-YOLOv11n model, which incorporates Dynamic Convolution for adaptive feature modulation and Pinwheel-shaped Convolution (PConv) for fine-grained spatial representation. Additionally, a YOLOv7-Aux auxiliary training head is introduced to strengthen intermediate feature learning, and a Focal-PIoU loss function is adopted to improve robustness against hard or ambiguous samples. Experimental results show that DPAP-YOLOv11n outperforms the baseline YOLOv11n, achieving gains of 2.4% in mAP@50 and 2.8% in mAP@50–95. These findings demonstrate the practical potential of the proposed approach for high-precision, real-time yak behavior recognition in complex field environments.

Keywords:

object detection; yak behavior recognition; YOLOv11; multi-scale detection; small object detection

1. Introduction

The yak is a vital livestock species on the Tibetan Plateau and its surrounding regions, providing herders with essential resources such as meat, milk, and pelts, while also contributing to the ecological stability of alpine environments. With the continuous expansion of yak husbandry, grazing intensity and herd size have increased, placing growing pressure on fragile rangeland ecosystems and making the balance between economic development and ecological conservation an urgent challenge. In this context, effective monitoring of yak health and behavior has become increasingly important for achieving sustainable pastoral management.

Daily behaviors such as eating, walking, standing, and resting directly reflect the physiological condition and environmental adaptability of yaks. Abnormal behavioral patterns often serve as early indicators of disease, environmental stress, or management-related problems. However, conventional monitoring approaches, including manual observation and sensor-based tracking [1], are limited by restricted spatial coverage, high labor costs, and insufficient scalability. These limitations make them particularly unsuitable for large, open, and high-altitude grazing areas where yaks are widely distributed and environmental conditions are highly variable. In contrast, rapid advances in computer vision [2] and deep learning [3] have enabled automated, non-contact livestock behavior recognition, providing a promising alternative for large-scale pastoral monitoring.

In recent years, deep convolutional neural networks have been increasingly applied to yak behavior analysis. Sun et al. [4] proposed a non-contact behavior recognition method based on an improved SlowFast framework, introducing a 3D ResNet50 backbone and refined three-dimensional convolution kernels to enhance recognition of grazing, lying, standing, and walking behaviors; however, the recognition performance for standing and walking behaviors remained suboptimal due to their high visual similarity. Li et al. [5] developed an individual identification method for wild yaks using a component-based convolutional network combined with self-supervised learning strategies, such as random erasure and region-visibility prediction, achieving notable improvements under limited labeled data conditions. Wang et al. [6] integrated an enhanced YOLOv5 detector with the ByteTrack algorithm to construct a yak video-tracking framework, incorporating coordinate attention, multi-scale feature fusion, and dilated pooling pyramids, and achieved improved multi-target tracking stability. Wang et al. [7] further proposed a Faster R-CNN-based detection strategy for Tibetan pastoral areas by reformulating grazing-scene recognition as a binary yak/non-yak classification problem. Additionally, Cao et al. [8] developed a YOLOv5-based yak detection method using independently collected field images, demonstrating fast and accurate detection across diverse plateau environments.

Beyond yak-specific studies, extensive research has been conducted on behavior recognition in other livestock species, particularly dairy cattle. Bai et al. [9] proposed a multi-scale dairy cow behavior recognition method based on an improved YOLOv5s framework by integrating Transformer modules and squeeze-and-excitation attention, which enhanced recognition accuracy but remained limited for small-object scenarios. Fu et al. [10] developed an enhanced YOLOv8-based behavior recognition and tracking framework incorporating C2f Faster structures, CARAFE upsampling, BiFormer attention, DyHead dynamic detection heads, and a Focal SIoU loss function. Zhang et al. [11] introduced the WCG-YOLO11n model by redesigning the C3K2 module and incorporating attention-enhanced feature fusion, achieving measurable improvements in multi-object behavior recognition. Bai et al. [12] further expanded multi-scale receptive fields using a Res2 backbone and global-context prediction heads, improving robustness under dense occlusion. Zheng et al. [13] proposed an enhanced YOLOv8n-based model by introducing NWD localization loss and multiple attention mechanisms to improve estrus behavior recognition in complex environments.

Additional studies have explored lightweight model design and efficiency optimization. For example, lightweight pruning strategies based on YOLOv5n and YOLOv5 frameworks integrating ASPP modules have been reported [14,15,16]. In addition, Li et al. [17] proposed a ConvNeXt-based multi-branch CAFNet model for cattle behavior recognition, significantly improving accuracy while reducing computational complexity. Gao et al. [18] introduced the LDCM-YOLOv10n model, enhancing small-target and multi-scale behavior recognition through large-kernel attention and improved upsampling strategies. Chen et al. [19] developed a lightweight EVH-YOLO11 model that combines an EfficientViT backbone with a Dynamic Head, achieving robust performance in complex pastoral environments characterized by occlusion and crowding.

Related research on pigs and sheep further highlights common challenges in livestock behavior recognition. Yan et al. [20] proposed the FPA-Tiny-YOLO method for pig detection under occlusion by integrating feature pyramid attention. Comprehensive reviews by Teng et al. [21] and Chen et al. [22] emphasized that scene complexity, occlusion, small targets, and model lightweighting remain key challenges across intelligent livestock monitoring systems. Zhuang et al. [23] proposed a CNN-based method for recognizing oestrus behavior in large white sows. Wang et al. [24] developed an enhanced YOLOv8s-based sheep behavior recognition method, but noted performance degradation under group interactions and severe occlusion.

Although existing studies have achieved notable progress, several challenges remain particularly pronounced in real-world yak grazing scenarios. First, visually similar behaviors such as standing and walking are still difficult to distinguish reliably. Second, small-scale or distant yaks are often inadequately represented in high-level feature maps, leading to missed detections [25]. Third, group interactions and partial occlusion frequently degrade recognition stability [26]. These limitations indicate that directly applying existing livestock behavior recognition models to yak monitoring is insufficient.

To address these challenges, this study proposes an enhanced yak behavior recognition method based on an improved YOLOv11 framework. By introducing adaptive feature extraction, fine-grained spatial modeling, and optimized loss design, the proposed method aims to improve recognition accuracy and robustness under complex field conditions characterized by small targets, occlusion, and background interference. The main contributions of this work are summarized as follows:

(1) An enhanced YOLOv11-based yak behavior recognition framework is proposed, specifically tailored for complex outdoor grazing environments on the Tibetan Plateau.

(2) A Dynamic Convolution–enhanced C3K2 module is introduced to improve adaptive feature representation and alleviate misclassification between visually similar behaviors such as standing and walking.

(3) A Pinwheel-shaped Convolution (PConv) combined with a small-target detection mechanism is employed to strengthen fine-grained structural modeling and improve detection of small or distant yaks.

(4) A Focal-PIoU loss function is incorporated to enhance sensitivity to hard-to-detect and occluded behavior samples, improving localization accuracy and training stability.

(5) Extensive experiments on a field-collected yak behavior dataset demonstrate that the proposed method outperforms baseline and existing YOLO-based approaches, validating its effectiveness in real-world pastoral monitoring scenarios.

2. Materials and Methods

This section introduces the materials and methods used in this study, including the dataset, data preprocessing techniques, evaluation metrics, experimental parameters, as well as the detailed architecture and components of the proposed DPAP-YOLOv11n model.

2.1. Dataset

2.1.1. Data Collection

The yak image dataset used in this study originates from two main sources. One portion was obtained from the ImageNet dataset, while the other was collected in the field at Kangzong Ranch, Jiwa Village, Xiangqu Township, Biru County, Nagqu City, Tibet Autonomous Region. The data were captured using a vivo X100 Pro smartphone (vivo Mobile Communication Co., Ltd., Dongguan, China) at a resolution of 1920 × 1080 and 30 frames per second, with all videos stored in MP4 format. Between 15 September and 15 October 2024, a total of 49 video clips were recorded across different time periods, involving 60 individual yaks. To strengthen dataset diversity and ensure representativeness, the collected material includes images captured under varying weather conditions (sunny and overcast) and from multiple perspectives (close- and long-range views). These variations help approximate real-world pastoral environments when recognizing behaviors such as standing, lying, eating, and walking. An example of the outdoor yak environment is shown in Figure 1.

2.1.2. Data Preprocessing

During the data acquisition phase, 49 videos containing 60 yaks were collected. For comprehensive behavioral analysis, four representative activities eating, lying, standing, and walking were selected for detailed observation. These behaviors represent routine yak activities and serve as key indicators of health status. Table 1 provides detailed descriptions, and Figure 2 presents illustrative examples. The video recordings were decomposed into frames, converted into still images, and saved in JPG format.

To improve dataset quality and reduce redundancy, image deduplication was applied, resulting in 1637 valid images. Following preprocessing, yak behaviors were annotated using LabelImg (version 1.8.6), with annotations stored in TXT files corresponding to each image. These files specify the class, position, and bounding box of each labeled behavior. An illustration of the annotation process is shown in Figure 3.

In accordance with dataset construction requirements, the annotated images and labels were divided into training, validation, and testing subsets at a ratio of 7:2:1. To enhance generalization, horizontal flipping was employed for data augmentation, thereby increasing data diversity and representativeness. The final dataset includes 2291 training images, 654 validation images, and 329 test images. The overall workflow for dataset construction is depicted in Figure 4.

Due to natural behavioral distribution in real grazing environments, the dataset exhibited class imbalance, particularly for the lying behavior, which occurred less frequently than eating, standing, and walking. This imbalance reflects realistic pastoral conditions, where yaks spend most of their time feeding, standing, or moving. To alleviate the impact of class imbalance during training, horizontal flipping-based data augmentation was applied preferentially to underrepresented classes, especially the lying category. This strategy increases sample diversity while preserving the semantic and structural characteristics of the original behavior patterns, without introducing unrealistic visual artifacts.

After augmentation, the dataset was divided into training, validation, and testing subsets using a 7:2:1 ratio, resulting in 2291 training images, 654 validation images, and 329 testing images. The overall dataset construction and augmentation workflow is illustrated in Figure 4.

2.2. The Indicators of Evaluation

To scientifically evaluate the performance of different models on yak behavior recognition tasks, we adopted precision (P), recall (R), average precision (AP), and mean average precision (mAP) as evaluation metrics, to assess model accuracy. Model complexity was evaluated using GFLOPs (calculated using the THOP library (version 0.1.1.post2209072238) under the PyTorch 2.2.0 framework) and model size. Specifically, indicates the detection precision when the IoU threshold is 0.5, and denotes the average precision over IoU thresholds ranging from 0.50 to 0.95.GFLOPs is used to measure the computational complexity of the model. P, R, and mAP are used to evaluate the detection accuracy of the model, where higher values indicate better detection performance. GFLOPs and model size indicate the model’s lightweight degree; smaller values suggest a more compact model with lower hardware requirements. The specific calculation formulas are shown in Equations (1)–(4).

P = \frac{TP}{TP + FP}

(1)

R = \frac{TP}{TP + FN}

(2)

AP = \int_{0}^{1} P R d r

(3)

mAP = \frac{1}{N} \sum_{i = 0}^{n} A P_{i}

(4)

In this study, TP and TN represent the numbers of correctly predicted positive and negative samples, respectively. FP denotes negative samples misclassified as positive, whereas FN refers to positive samples misclassified as negative. AP_i indicates the average precision of category i, and represents the total number of behavioral categories.

2.3. Experimental Parameter Configuration

This study built an experimental platform on a Windows-based server using the PyTorch deep learning framework and the YOLOv11 base network. Detailed hardware and software environment settings are listed in Table 2. Unless stated otherwise, all experiments were conducted using default hyperparameter configurations. The complete list of hyperparameters is presented in Table 3.

Aside from differences in model architecture, the parameter configurations are consistent throughout all model versions. The input photos were scaled to 640 by 640 pixels. Each training batch comprised 32 images. A threshold of 100 epochs for early halting was implemented to reduce overfitting. Training was conducted for 300 epochs, after which the model weights were preserved and assessed upon reaching convergence.

2.4. Methods

2.4.1. YOLOv11n Network Model

YOLOv11 [27], proposed by Ultralytics in September 2024, is a lightweight real-time object detection framework composed of three core components: the backbone, neck, and detection head. In this study, the YOLOv11n variant is selected as the baseline model due to its favorable trade-off between detection accuracy and computational efficiency, making it suitable for deployment in large-scale outdoor grazing environments.

The backbone of YOLOv11n is based on an improved CSP-style architecture, where the C3k2 module serves as the main feature extraction unit. Compared with the C3 and C3k modules used in earlier YOLO versions, C3k2 improves feature extraction efficiency while maintaining lower computational complexity. In addition, a Spatial Pyramid Pooling Fast (SPPF) module is introduced at the high-level feature stage to enrich multi-scale contextual information. The neck adopts a Feature Pyramid Network (FPN) with enhanced bottom-up information flow, enabling effective fusion of shallow spatial details and deep semantic features. A C2PSA module with multi-head attention is further incorporated to enhance sensitivity to fine-grained features, which is particularly beneficial for small and partially occluded targets.

The detection head follows a decoupled design, independently performing classification and localization tasks. Depth-wise convolution layers are employed to reduce parameter redundancy and computational cost while maintaining detection accuracy. Based on this architecture, YOLOv11n provides a robust baseline for yak behavior recognition in complex outdoor scenarios. To further address challenges such as severe occlusion and small-scale targets, an enhanced model termed DPAP-YOLOv11n is proposed and described in the following sections.

2.4.2. DPAP-YOLOv11n

To address the practical challenges of wild yak behavior detection in complex outdoor environments, including background similarity, frequent occlusion, and scale variation, we propose an enhanced model named DPAP YOLOv11n. This model introduces four targeted improvements over the YOLOv11n baseline. First, the original C3k2 module is redesigned into a C3k2 DynamicConv structure by integrating wavelet-based dynamic convolution and an optimized bottleneck design. This enhancement allows for adaptive kernel adjustment based on local feature context, effectively improving robustness to partial occlusion, scale variability, and visually complex backgrounds. Second, Position-Sensitive Convolution (PSC) is incorporated into the backbone layers to explicitly encode spatial directionality. Unlike conventional attention mechanisms that simply reweight feature importance, PSC enhances the modeling of local structure and improves the network’s ability to distinguish visually similar behaviors such as standing and walking.

In addition, the original detection head is replaced with a YOLOv7-style auxiliary head, which introduces intermediate-level supervision and enhances multi-scale feature learning. This design is especially beneficial for small or visually ambiguous behavior targets. Finally, the conventional CIoU loss function is replaced with the Focal PIoU loss, which combines focal weighting with predicted IoU awareness. This substitution mitigates the training bias toward difficult-to-localize samples, enhances localization accuracy under occlusion, and improves overall model robustness. Each module was carefully selected to address the specific challenges of field-based yak behavior recognition while maintaining real-time inference efficiency. The complete network architecture is illustrated in Figure 5.

2.4.3. Dynamic Convolution

In the task of wild yak behavior recognition, outdoor environments are typically characterized by dramatic lighting shifts, dense vegetation occlusion, and large-scale variations in targets, which hinder the fixed-kernel structure of the conventional C3K2 module from capturing essential features effectively. To overcome these limitations, this paper integrates DynamicConv [28] into a redesigned C3K2 module within YOLOv11n, improving the model’s adaptability to complex wild scenarios and boosting its feature extraction capability for multi-scale yak behavior.

The DynamicConv module adaptively adjusts the combination of convolution kernels based on the semantic content of the input feature maps, enabling more selective feature extraction. The module mainly consists of two components: an attention-based weight generation module and a multi-branch convolution module. First, the input feature map is processed using Global Average Pooling to extract its global semantic information. Then, the semantic vector is passed through two fully connected (FC) layers, and the fusion weight vector

α = [α_{1}, α_{2}, \dots \dots α_{k}]

is obtained via the Softmax function, where each weight satisfies:

\sum_{i = 1}^{k} α_{i} = 1

(5)

Then, the input feature map is fed into k convolution branches with distinct parameters for feature extraction, producing intermediate outputs:

{Conv}_{1} (X), {Conv}_{2} (X), \dots \dots, {Conv}_{k} (X)

.

The final output feature map Y is obtained by performing a weighted sum of these branch outputs, as defined by the following formula:

Y = \sum_{i = 1}^{k} α_{i} . C o n v_{i} (X)

(6)

In this equation, X denotes the input feature map, and k represents the total number of convolution kernels.

{Conv}_{i} (X)

indicates the output generated when kernel i is applied to X, and

α_{i}

denotes the weight associated with that kernel. The processing procedure of DynamicConv is show in Figure 6.

2.4.4. Pinwheel-Shaped Convolution (PConv)

In wild yak behavior recognition, large shooting distances and complex grassland backgrounds often result in small-scale targets with blurred contours and weak structural details. As target size decreases, discriminative information becomes concentrated in local gradients and edge regions, which are difficult to capture using the uniform sampling pattern of standard 3 × 3 convolutions. Consequently, small yak targets are prone to being overwhelmed by background textures in shallow feature maps, leading to degraded behavior recognition performance.

To enhance the representation of small-scale targets, Pinwheel-shaped Convolution (PConv) [29] is integrated into the initial convolutional layer of the YOLOv11 backbone. PConv decomposes a conventional convolution kernel into four directional filters of sizes 1 × 3 and 3 × 1, combined with asymmetric padding along horizontal and vertical directions to form a pinwheel-shaped sampling pattern. This design strengthens directional feature perception and emphasizes central body regions, enabling more effective extraction of fine structural cues such as contours and limb boundaries while suppressing background interference from grass textures and illumination variations. After directional feature concatenation and a 2 × 2 fusion convolution, the output preserves the original spatial resolution, allowing seamless integration into shallow network layers. The structural configuration of PConv is illustrated in Figure 7.

2.4.5. YOLOv7-Aux Auxiliary Head

To mitigate insufficient feature representation caused by weak textures, small targets, and frequent occlusion in wild yak behavior recognition, a YOLOv7-style auxiliary head (Aux Head) [30] is introduced to enhance mid-level feature learning, as illustrated in Figure 8. The Aux Head adds an additional detection branch on intermediate-scale feature maps produced by the backbone and neck, enabling direct supervision of high-resolution spatial features during training. Its architecture mirrors that of the main detection head, comprising classification and regression branches, while being activated only during training and completely removed during inference, thereby introducing no additional computational overhead.

During training, supervision signals from the primary detection head are jointly assigned to both heads through a lead-guided assigner, as shown in Figure 8d, ensuring consistent learning across multiple branches. To further improve recall for small and partially occluded targets, the Aux Head adopts a Coarse-to-Fine positive-sample allocation strategy (Figure 8e). The loss from the Auxiliary Head is backpropagated together with that of the main head, strengthening gradient propagation to earlier layers and promoting more robust feature representation and stable convergence.

2.4.6. Focal-PIoU Loss Function

In the baseline YOLOv11 model, DFL Loss and CIoU Loss are employed for bounding box regression. DFL Loss allows the network to more effectively focus on yak object locations and their surrounding regions, while CIoU Loss is utilized to evaluate the accuracy of predicted bounding boxes. However, CIoU Loss primarily considers geometric factors such as center distance, aspect ratio, and overlap between the predicted and ground truth boxes, but lacks sensitivity to sample-level difficulty during training. This limitation may lead the model to overfit on easy samples while underemphasizing harder instances, such as occluded, blurred, or small-scale targets. As a result, the overall detection performance and robustness may be negatively affected.

This paper presents Focal-PIoU [31] as an enhanced solution for bounding box regression to address the limitations of the CIoU loss function. Focal-PIoU integrates the difficulty-aware modulation principle of Focal Loss within the IoU-based framework, imposing increased gradient penalties on samples with diminished predicted IoU scores (i.e., inadequately aligned predicted and ground truth boxes), thus amplifying the model’s emphasis on challenging-to-detect targets. Simultaneously, for successfully predicted simple samples, Focal-PIoU allocates reduced loss values, thereby mitigating local minima and preventing simple samples from overshadowing the training process. This method substantially enhances the model’s precision in identifying diminutive and obscured targets inside intricate outdoor settings. The precise formulation of Focal-PIoU is delineated as follows.

ζ Fo c a l - P I o U = {(1 - P I o U)}^{γ}

(7)

P I o U = \frac{| Β_{p} \cap Β_{t} |}{| Β_{p} \cup Β_{t} | + ε}

(8)

In the equation, PIoU signifies the probabilistic Intersection over Union, indicating the extent of loss between each forecast box and the ground-truth box. is the focusing parameter that regulates the prioritization of difficult samples. When, samples with low PIoU levels incur greater penalties. denotes the predicted box, while represents the ground-truth box. is a stability parameter implemented to avert division by zero. Figure 9 presents a schematic depiction of the Focal-PIoU parameters.

3. Results

3.1. Comparison of Model Results

The DPaP-YOLOv11n and YOLOv11 (s, m, l, x) models were trained using the yak behavior dataset. Their performance was evaluated based on model size, precision, recall, and mean average precision (mAP) across four behavioral domains. The objective was to assess the efficacy of various recognition models. The experimental findings are encapsulated in Table 4.

The proposed YOLOv11n-DPAP model achieved a precision of 0.933, a recall of 0.891, and a mean average precision (mAP) of 0.941. Compared to YOLOv11n and YOLOv11s, the precision increased by 0.032 and 0.004, respectively. However, it was slightly lower than that of YOLOv11m, YOLOv11l, and YOLOv11x by 0.008, 0.006, and 0.012, respectively. In terms of recall, YOLOv11n-DPAP outperformed YOLOv11n by 0.201, while showing marginal declines compared to YOLOv11s, YOLOv11m, YOLOv11l, and YOLOv11x by 0.009, 0.017, 0.011, and 0.005, respectively. Regarding mAP, the model surpassed YOLOv11n by 0.024 but was slightly inferior to YOLOv11s, YOLOv11m, YOLOv11l, and YOLOv11x by 0.001, 0.012, 0.001, and 0.004, respectively.

Regarding model scale, the YOLOv11 series (n, s, m, l, x) shows a progressive growth in model size. YOLOv11n is the smallest at 5.5 MB, whereas YOLOv11x is the largest at 114.4 MB. The YOLOv11n-DPAP model occupies 8.1 MB, slightly exceeding YOLOv11n, yet it reduces model size by 93% compared to YOLOv11x.YOLOv11m yields a mAP of 0.953, outperforming YOLOv11n-DPAP by 0.012 percentage points. Nevertheless, its model scale is fivefold that of YOLOv11n-DPAP.Based on the overall data, YOLOv11n-DPAP demonstrates superior performance-size balance for behavior recognition in complex outdoor scenes.

3.2. Comparative Analysis of Loss Functions

A comparison of commonly used loss functions, including CIoU, EIoU, GIoU, DIoU, SIoU, and PIoU, applied to the converged DPAP-YOLOv11n model is presented in Table 5. The experimental results show that the PIoU loss achieves superior accuracy compared with the other IoU-based loss functions, with the DPAP-YOLOv11n model attaining mAP50 and mAP50-95 scores of 94.1% and 83.4%, respectively.

After integrating the classification Focal Loss into PIoU, the resulting DPAP-YOLOv11n (Focal PIoU) model achieves mAP50 and mAP50-95 scores of 94.1% and 83.9%, respectively, representing improvements of 0.5 and 0.6 percentage points over the DPAP-YOLOv11n (CIoU) variant. These results indicate that, for the yak behavior dataset, incorporating Focal Loss into the bounding box regression process enhances model performance, particularly in mitigating class imbalance and improving the detection of hard-to-classify samples.

3.3. DPAP-YoloV11n Ablation Study

This study analyzes the AP values of the yak behavior recognition model across all and individual behavior categories to demonstrate the effectiveness of DPAP-YOLOv11n. To assess the impact of the C3K2_DynamicConv module, PConv-based downsampling, the Aux Head, and the Focal-PIoU loss function on the recognition of different behavior categories, ablation experiments were conducted, and the results are shown in Table 6.

According to Table 6, the baseline YOLOv11n model achieves relatively high APs for eating and lying, at 94.3% and 92.7%, respectively, but performs less effectively for standing and walking, with APs of 88.9% and 91.1%. This limitation is mainly attributed to the high visual similarity between standing and walking. Additionally, since YOLOv11n relies solely on static features and lacks temporal cues, it frequently misclassifies these similar behaviors. To address this issue, the integration of the C3k2-DynamicConv module improved the AP for standing to 91.3% and for walking to 92.9%, while the AP for eating increased to 94.8% and for lying to 93.1%. To further enhance detection performance in complex natural environments such as grasslands and vegetation, the PConv module was incorporated, resulting in AP improvements for eating and standing, reaching 95.1% and 92.2%, respectively. The inclusion of the Aux Head further raises the APs for eating, standing, and walking to 95.6%, 92.8%, and 93.4%, respectively. Moreover, the use of the Focal-PIoU loss yields notable improvements in lying and walking, increasing their APs to 94.1% and 94.3%. By combining C3k2-DynamicConv, PConv, the Aux Head, and Focal-PIoU loss, the final APs achieved are: eating 95.5%, standing 92.4%, lying 94.1%, and walking 94.3%. Compared to the baseline YOLOv11n, these represent improvements of 1.3%, 3.5%, 1.4%, and 3.2% in the respective behavior categories. In addition, the enhanced DPAP-YOLOv11n model achieves mAP50 of 94.1% and mAP50-95 of 83.9%, outperforming the baseline by 2.4% and 2.8%, respectively, and confirming its superior capability for behavior detection under complex field conditions.

3.4. Comparative Analysis of Various Yolo Models

To further validate the advantages of the proposed DPAP-YOLOv11n model in terms of accuracy and lightweight design, we conducted a comprehensive performance comparison against representative models from the YOLO series, including YOLOv3-tiny, YOLOv5n, YOLOv6, YOLOv10n, and YOLOv11n. All models were trained on the same yak behavior dataset under identical experimental settings to ensure fair comparison. To comprehensively evaluate model performance, we adopted precision, recall, and mean average precision (mAP) as accuracy metrics, and used computational complexity (FLOPs) and model size as indicators of inference speed and efficiency. Detailed experimental results are summarized in Table 7.

Compared with the five aforementioned YOLO-series models, the proposed DPAP-YOLOv11n demonstrates the most favorable overall performance across all evaluated metrics. Specifically, DPAP-YOLOv11n achieves the highest mean average precision of 94.1%, surpassing the baseline YOLOv11n by 2.4 percentage points. In comparison to YOLOv3-tiny, YOLOv5n, YOLOv6, and YOLOv10n, the proposed model exhibits mAP improvements of 1.6, 1.2, 0.4, and 0.8 percentage points, respectively, confirming its superior detection accuracy. In terms of efficiency, DPAP-YOLOv11n maintains a balanced trade-off between computational complexity and accuracy. It requires 6.2 GFLOPs and has a model size of 8.1 MB, which represents only a 0.1 GFLOPs decrease and a 2.6 MB increase compared to the baseline YOLOv11n. Despite this slight increase in model size, it remains significantly lighter than YOLOv3-tiny while delivering higher accuracy. These results demonstrate that DPAP-YOLOv11n achieves high detection precision with minimal computational overhead, making it particularly well-suited for deployment on edge or mobile devices.

3.5. TensorRT Inference Acceleration Testing

Deploying deep learning models trained in high-performance environments on resource-limited hardware is clearly unrealistic, as the detection speed still requires further optimization. Hence, an inference acceleration approach for wild yak behavior recognition was explored. By employing the TensorRT framework [32] for GPU-based acceleration, the model achieves faster and more flexible deployment, enhancing its applicability in real-world yak monitoring scenarios.

The experiments were conducted on a Windows 11 system equipped with an NVIDIA GeForce RTX 3060 GPU, CUDA 11.8, and PyTorch 2.7.1. The inference results of the PyTorch-trained model on the RTX 3060 served as the baseline. Thereafter, the TensorRT framework was employed to perform structural optimization and precision quantization on the trained PyTorch weights. Models with varying precision levels were then executed for inference testing on wild yak image datasets, with the corresponding results summarized in Table 8.

According to Table 8, the DPAP model under the PyTorch framework achieved an inference time of 14.7 ms and an average per-frame detection time of 17.5 ms, corresponding to a speed of 57 f/s, which satisfies the real-time requirements for wild yak behavior recognition. After being optimized into an Engine model using TensorRT, the single-precision model reduced its inference time by 8.2 ms and reached 93 f/s, representing a 1.6× improvement. The half-precision model further decreased inference time by 0.1 ms and achieved 105 f/s, yielding a 1.8× speed increase compared with the PyTorch model. Figure 10 demonstrates the performance gains in inference time and detection speed.

In conclusion, the TensorRT-optimized and quantized half-precision version of the DPAP-YOLOv11n model reached a detection speed of 105 f/s, indicating that TensorRT is more suitable for inference acceleration on GPU hardware. This approach supports feasible deployment across diverse computational environments such as cloud and edge devices, markedly reducing image inference time and substantially improving overall detection performance, thereby facilitating the practical deployment of the model in intelligent livestock farming systems.

3.6. DPAP-YOLOV11n Visualization and Analysis

To visually demonstrate the superiority of the proposed DPAP-YOLOv11n model over the baseline YOLOv11n in real-world scenarios, a comparative analysis was conducted on yak behavior recognition in complex outdoor environments. As illustrated in Figure 11, the detection results of both models are presented across various natural field settings, highlighting their respective behavior recognition capabilities under different environmental conditions.

In both simple and complex scenarios, the proposed DPAP-YOLOv11n model demonstrates accurate recognition of yak behaviors, with overall detection precision notably surpassing that of the baseline YOLOv11n. As shown in Figure 11a,b, the DPAP-YOLOv11n model maintains high detection accuracy across varying environmental complexities. In contrast, YOLOv11n exhibits clear limitations in real-world yak detection tasks, particularly under near-and-far range scenes and in the presence of occlusion or small targets. As illustrated in Figure 11c–e, YOLOv11n fails to correctly identify yak behaviors under such conditions, leading to behavior misclassification. The DPAP-YOLOv11n model, however, significantly reduces false detections and improves behavior recognition accuracy in these challenging scenarios. These results confirm the model’s enhanced robustness and reliability in handling behavior recognition tasks under complex, unstructured outdoor environments.

In summary, the proposed DPAP-YOLOv11n model demonstrates superior performance in recognizing diverse and complex yak behaviors under real-world field conditions, particularly in scenarios involving small objects and occlusion. Compared with the baseline YOLOv11n model, DPAP-YOLOv11n significantly reduces false detections and enhances adaptability in challenging environments. These results provide a reliable and effective solution for practical applications of yak behavior recognition in natural outdoor settings.

4. Discussion

The proposed DPAP-YOLOv11n model demonstrates superior performance in yak behavior recognition under complex natural environments. Owing to the integration of DynamicConv, PConv, an Aux Head detection head, and the Focal-PIoU loss function within the YOLOv11n framework, the model achieves more precise localization, enhanced feature representation, and improved robustness when addressing small-scale or occluded targets. As presented in the experimental results, DPAP-YOLOv11n attains 94.1% mAP₅₀ and 83.9% mAP50-95, exceeding the baseline YOLOv11n by 2.4% and 2.8%, respectively. These findings substantiate the effectiveness of the proposed optimization strategies and highlight their potential for advancing real-time animal behavior recognition in field conditions.

The primary innovation of this study lies in the integration of the DynamicConv structure into the C3k2 module, enabling convolutional kernels to adaptively adjust their weights based on local feature content. This mechanism markedly enhances the model’s feature extraction capability across varying scales and complex field environments, thereby improving its perception of multi-scale yak targets. In particular, under distant or occluded conditions, the model accurately captures critical behavioral cues, significantly reducing both false positives and missed detections. Furthermore, the PConv module incorporated into the backbone establishes multi-directional receptive fields, enabling more effective extraction of fine-grained postural variations among yak behaviors such as feeding, standing, and walking. This module enhances the model’s sensitivity to local structural and edge features, ensuring stable recognition accuracy under diverse illumination and viewpoint conditions. In addition, the Aux Head detection head introduces multi-level supervision during early feature learning, guiding intermediate layers to better capture discriminative behavioral region representations. This design strengthens the detection of small or distant yak individuals and mitigates feature confusion in crowded or complex outdoor scenes, while also accelerating model convergence and improving overall training efficiency. Finally, the Focal-PIoU loss function integrates the difficulty-aware weighting of Focal Loss with the bounding-box optimization of PIoU, focusing on hard samples while minimizing the influence of easy ones. This strategy maintains high detection accuracy under conditions of severe occlusion or strong inter-target similarity. Experimental results further demonstrate that the proposed loss function offers clear advantages in distinguishing visually similar behaviors such as walking and standing.

From an overall performance perspective, the DPAP-YOLOv11n model retains a lightweight architecture (8.1 MB) and low computational complexity (6.2 GFLOPs), effectively balancing detection accuracy and real-time efficiency. With TensorRT acceleration, the model achieves an inference speed of 105 fps, approximately 1.8 times faster than its PyTorch baseline, demonstrating strong potential for real-time deployment on edge computing devices and UAV-based platforms.

Overall, the DPAP-YOLOv11n model demonstrates substantial improvements in recognizing multi-scale yak behaviors, including feeding, standing, lying, and walking, under complex natural field conditions. Its dynamic feature modeling and multi-directional perception mechanisms effectively mitigate performance degradation commonly observed in conventional detectors when addressing small targets, behavioral ambiguity, and occlusion challenges. Future research will focus on expanding the spatial and temporal diversity of the dataset and integrating lightweight architectures and incremental learning strategies to further enhance model generalization across diverse terrains and climatic conditions. These efforts aim to provide a more robust and deployable solution for intelligent yak behavior monitoring and precision livestock management.

5. Conclusions

Field-collected yak behavior data are characterized by complex backgrounds, diverse postures, and a large number of small or partially occluded targets, which limit the effectiveness of existing recognition methods. To address these challenges, this study proposes the DPAP-YOLOv11n model for multi-scale behavior recognition in natural field environments. By introducing Dynamic Convolution into the C3k2 module, integrating PConv to strengthen fine-grained structural perception, and incorporating an Aux Head together with the Focal-PIoU loss function, the model significantly improves adaptive feature extraction, small-target detection, and localization accuracy. Experimental results show that DPAP-YOLOv11n increases mean accuracy by 2.4 percentage points over YOLOv11n, with gains of 1.3, 3.5, 1.4, and 3.2 percentage points for feeding, standing, lying down, and walking behaviors, respectively. The model also achieves an mAP50-95 of 83.9%, representing a 2.8 percentage point improvement. These findings confirm the effectiveness of DPAP-YOLOv11n in accurately and reliably recognizing yak behaviors under complex field conditions.

Author Contributions

Conceptualization, J.T. and S.L.; methodology, J.T. and B.D.; software, B.D. and S.T.; validation, J.X. and S.L.; formal analysis, S.T. and S.L.; investigation, J.X.; resources, L.Z.; data curation, B.D. and S.T.; writing—original draft preparation, B.D.; writing—review and editing, L.Z. and J.X.; visualization, B.D.; supervision, J.T. and L.Z.; project administration, L.Z.; funding acquisition, J.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Hubei Province Key Research and Development Special Project of Science and Technology Innovation Plan (2023BAB087); the Wuhan Key Research and Development Projects (2023010402010614); the Open Competition Project for Selecting the Best Candidates of Wuhan East Lake High-tech Development Zone (Grant Number: 2024KJB328); the Fund for Research Platform of South-Central Minzu University (Grant Number: PTZ25003); and the Central Government Guides Local Funds for Science and Technology Development (ZYYD2024QY08).

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to the ownership issue.

Conflicts of Interest

The authors declare no conflicts of interest.

References

He, J.; Yang, W.; Liu, T.; Zhuang, J. Research on Pig Behavior Recognition Based on the Combination of Wearable Sensors. J. Chin. Agric. Mech. 2025, 46, 42–49. [Google Scholar]
Ji, J.; Zhong, J.; Niu, B.; Chen, Q. Application of Image Recognition Models in Wildlife Protection. Chin. J. Nat. 2025, 47, 207–214. [Google Scholar]
Shen, T.; Wang, S.; Li, M.; Qin, L. Research Progress in the Application of Deep Learning in Animal Behavior Analysis. J. Front. Comput. Sci. Technol. 2024, 18, 612–626. [Google Scholar]
Sun, G.; Liu, T.; Zhang, H.; Tan, B.; Li, Y. Basic behavior recognition of yaks based on improved SlowFast network. Ecol. Inform. 2023, 78, 102313. [Google Scholar] [CrossRef]
Li, L.; Zhang, T.; Cuo, D.; Zhao, Q.; Zhou, L.; Jiancuo, S. Automatic identification of individual yaks in in-the-wild images using part-based convolutional networks with self-supervised learning. Expert Syst. Appl. 2023, 216, 119431. [Google Scholar] [CrossRef]
Wang, J.; Zhang, Y.; Zhu, H.; Song, R. Yak Tracking Based on Improved YOLOv5 and ByteTrack. Comput. Syst. Appl. 2023, 32, 48–61. [Google Scholar]
Wang, S.; Li, C.; Song, J.; Jiang, Y. Yak Detection Algorithm Based on Faster R-CNN in Tibetan Pastoral Areas. J. Inn. Mong. Agric. Univ. (Nat. Sci. Ed.) 2021, 42, 77–83. [Google Scholar]
Cao, H.; Gong, S. Object Detection of Plateau Yak Images Based on YOLOv5. Inf. Comput. (Theor. Ed.) 2021, 33, 84–87. [Google Scholar]
Bai, Q.; Gao, R.; Zhao, C.; Li, Q.; Wang, R.; Li, S. Multi-Scale Behavior Recognition Method for Dairy Cows Based on an Improved YOLOv5s Network. Trans. Chin. Soc. Agric. Eng. 2022, 38, 163–172. [Google Scholar]
Fu, C.; Ren, L.; Wang, F. Method for Cattle Behavior Recognition and Tracking Based on Improved YOLOv8. Trans. Chin. Soc. Agric. Mach. 2024, 55, 290–301. [Google Scholar]
Han, G.; Zhang, L.; Bai, Z.; Zhang, X.; Han, R.; Tang, C.; Tang, J. Daily Behaviour Detection of Multi-Target Dairy Cows Based on Improved YOLO11n in Complex Environments. Trans. Chin. Soc. Agric. Eng. 2025, 41, 74. [Google Scholar]
Bai, Q.; Gao, R.; Li, Q.; Wang, R.; Zhang, H. Recognition of the behaviors of dairy cows by an improved YOLO. Intell. Robot. 2024, 4, 1–19. [Google Scholar] [CrossRef]
Wang, Z.; Hua, Z.; Wen, Y.; Zhang, S.; Xu, X.; Song, H. E-YOLO: Recognition of estrus cow based on improved YOLOv8n model. Expert Syst. Appl. 2024, 238, 122212. [Google Scholar] [CrossRef]
Wang, Z.; Xu, X.; Hua, Z.; Shang, Y.; Duan, Y.; Song, H. Lightweight recognition for the oestrus behavior of dairy cows combining YOLO v5n and channel pruning. Trans. Chin. Soc. Agric. Eng 2022, 38, 130–140. [Google Scholar]
Wang, R.; Gao, Z.; Li, Q.; Zhao, C.; Gao, R.; Zhang, H.; Li, S.; Feng, L. Detection Method of Cow Estrus Behavior in Natural Scenes Based on Improved YOLOv5. Agriculture 2022, 12, 1339. [Google Scholar] [CrossRef]
Liu, Z.; He, D. Recognition Method of Cow Estrus Behavior Based on Convolutional Neural Network. Trans. Chin. Soc. Agric. Mach. 2019, 50, 186–193. [Google Scholar]
Li, E.; Wang, K.; Si, Y.; Yuan, Y.; He, Z. Cow Behavior Recognition Method Based on Improved ConvNeXt. Trans. Chin. Soc. Agric. Mach. 2024, 55, 282–289+404. [Google Scholar]
Gao, S.; Yang, J.; Xu, D. Cattle species and behavior identification based on improved YOLO v10. J. South China Agric. Univ. 2025, 46, 832–842. [Google Scholar]
Chen, M.; Ren, R.; Zhang, Z.; Li, H. Construction of lightweight multi-objective cattle behavior recognition model EVH-YOLO11. Agric. Eng. 2025, 15, 41–48. [Google Scholar]
Yan, H.; Liu, Z.; Cui, Q.; Hu, Z. Multi-target detection based on feature pyramid attention and deep convolution network for pigs. Trans. Chin. Soc. Agric. Eng. 2020, 36, 193–202. [Google Scholar]
Teng, G.; Ji, H.; Zhuang, Y.; Liu, M. Research Progress of Deep Learning in the Process of Pig Feeding. Trans. Chin. Soc. Agric. Eng. 2022, 38, 235–249. [Google Scholar]
Chen, C.; Zhu, W.; Norton, T. Behaviour recognition of pigs and cattle: Journey from computer vision to deep learning. Comput. Electron. Agric. 2021, 187, 106255. [Google Scholar] [CrossRef]
Zhuang, Y.; Yu, J.; Teng, G.; Cao, M. Recognition Method of Large White Sow Oestrus Behavior Based on a Convolutional Neural Network. Trans. Chin. Soc. Agric. Mach. 2020, 51, 364–370. [Google Scholar]
Wang, W.; Wang, F.; Zhang, W.; Liu, H.; Wang, C.; Wang, C.; He, Z.X. Sheep Behavior Recognition Method Based on Improved YOLOv8s. Trans. Chin. Soc. Agric. Mach. 2024, 55, 325–335+344. [Google Scholar]
Tong, X.Z.; Wei, J.Y.; Su, S.J.; Sun, B.; Zuo, Z. Typical Small Target Detection on Water Surfaces Fusing Attention and Multi-Scale Features. Chin. J. Sci. Instrum. 2023, 44, 212–222. [Google Scholar] [CrossRef]
Jia, T.; Peng, L.; Dai, F. Object Detector with Residual Learning and Multi-Scale Feature Enhancement. Comput. Sci. Explor. 2023, 17, 1102–1111. [Google Scholar]
Khanam, R.; Hussain, M. YOLOv11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Chen, Y.; Dai, X.; Liu, M.; Chen, D.; Yuan, L.; Liu, Z. Dynamic Convolution: Attention over Convolution Kernels. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11027–11036. [Google Scholar] [CrossRef]
Dai, J.; Li, Y.; He, K. R-FCN: Object Detection via Region-based Fully Convolutional Networks. Adv. Neural Inf. Process. Syst. 2016, 29, 379–387. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
Zhang, Y.F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and efficient IOU loss for accurate bounding box regression. Neurocomputing 2022, 506, 146–157. [Google Scholar] [CrossRef]
Wang, T. Enhanced Feline Facial Recognition: Advancing Cat Face Detection with YOLOv8 and TensorRT. In Proceedings of the Fourth International Conference on Computer Vision and Pattern Analysis (ICCPA 2024), Anshan, China, 17–19 May 2024; SPIE: Bellingham, WA, USA, 2024; Volume 13256, pp. 193–202. [Google Scholar]

Figure 1. Yak data collection site.

Figure 2. Examples of yak behavioral data collected in the field. four categories of daily behaviors are illustrated: (a) feeding; (b) standing; (c) lying; and (d) walking.

Figure 3. Example of the LabelImg-annotated dataset. Bounding boxes in different colors represent different yak behavior categories (cyan for standing, light blue for feeding, magenta for walking, and purple for lying).

Figure 4. Dataset construction flowchart.

Figure 5. DPAP-Yolov11n network structure.

Figure 6. DynamicConv convolution. The symbol “+” indicates feature aggregation, and “*” denotes weighted multiplication between the attention weight π_k and the corresponding convolution kernel. x and y represent input and output features, respectively.

Figure 7. Comparison between the normal convolution and the PSConv structure.

Figure 8. Structure of the auxiliary head (Aux Head).

Figure 9. Schematic representation of Focal-PIoU parameters.

Figure 10. Acceleration effect of different inference frames.

Figure 11. Yak Behavior Recognition Performance. (a) Recognition performance in simple scenes; (b) Recognition performance in complex scenes; (c) False detection in close-range recognition; (d) False detection in long-range recognition; (e) False detection under occlusion. In this figure, bluish-cyan indicates “stand”, greenish-cyan indicates “walk”, white indicates “lie”, and dark blue indicates “eat”.

Table 1. Description of yak behavior.

Behavior Name	Behavior Description	Label	Sample Count
eat	Lower your neck and touch your head to the ground	eat	1158
lie	Legs in contact with the ground	lie	680
Stand	Legs in contact with the ground	stand	1194
walk	Leg flexion movement (obtained during walking)	walk	1348

Table 2. Configuration environment.

Hardware	Configuration
Configuration	Ubuntu 22.04
CPU	10 vCPU
GPU	NVIDIA RTX 3090 (1 card, 24 GB video) memory)
Memory	60 GB RAM
programming language	Python 3.10
Learning Framework	PyTorch 2.2.0

Table 3. Training environment.

Hyper Parameters	Value
cache	False
imgsz	640
optimizer	SGD
batch	32
epochs	300
Amp	True

Table 4. Empirical outcomes of various models.

Models	Precision	Recall	mAP50	Model Size/M
YOLOV11n	0.901	0.869	0.917	5.5
YOLOV11s	0.929	0.90	0.942	19.2
YOLOV11m	0.941	0.908	0.953	40.5
YOLOV11l	0.939	0.902	0.942	51.2
YOLOV11x	0.945	0.896	0.945	114.4
YOLOv11n—DPAP	0.933	0.891	0.941	8.1

Table 5. Comparative analysis of experimental outcomes for various loss functions.

Models	IoU		Focal + IoU
Models	mAP50	mAP50-95	mAP50	mAP50-95
DPAP-Yolov11n + CIoU	93.6	83.3	93.7	82.8
DPAP-Yolov11n + EIoU	93.0	82.3	93.1	82.8
DPAP-Yolov11n + GIoU	92.7	82.3	93.2	82.9
DPAP-Yolov11n + DIoU	93.4	83.0	93.9	82.7
DPAP-Yolov11n + SIoU	92.1	81.7	93.3	82.8
DPAP-Yolov11n + PIoU	94.1	83.4	94.1	83.9

Table 6. Ablation experiments on the yak dataset.

Models				mAP_0.5/%	AP/%				mAP_0.5/% − 0.95/%
DynamicConv	PConv	Aux	Focal_PIoU	mAP_0.5/%	Eat	Stand	Lie	Walk	mAP_0.5/% − 0.95/%
✕	✕	✕	✕	91.7	94.3	88.9	92.7	91.1	81.1
√	✕	✕	✕	93.0	94.8	91.3	93.1	92.9	83.0
✕	√	✕	✕	93.4	94.9	91.3	93.8	93.4	82.5
✕	✕	√	✕	93.0	94.9	90.5	93.4	93.0	82.0
✕	✕	✕	√	92.4	95.1	90.4	91.9	92.0	81.5
√	√	✕	✕	93.2	95.1	92.2	93.1	92.5	83.2
√	√	√	✕	93.6	95.6	92.8	93.1	93.0	83.3
√	√	√	√	94.1	95.5	92.4	94.1	94.3	83.9

Table 7. Test outcomes for various models.

Model	P(%)	R(%)	mAP@0.5	mAP@0.5-0.95	FLOPs(G)	Model Volume/(MB)
Yolov3-tiny	0.924	0.868	0.925	0.82	18.9	24.4
YOLOV5n	0.906	0.881	0.929	0.818	7.1	5.3
YOLOV6	0.917	0.891	0.937	0.835	11.8	5.8
YOLOV10n	0.913	0.876	0.933	0.825	6.5	5.8
YOLOV11n	0.898	0.875	0.917	0.811	6.3	5.5
DPAP-YOLOv11n	0.933	0.891	0.941	0.839	6.2	8.1

Table 8. TensorRT inference speed test results.

Framework	Preprocessing Time (ms)	Inference Time (ms)	Postprocessing Time (ms)	Average Latency (ms)	Detection Speed (FPS)
Pytorch	1.5	14.7	1.3	17.5	57
TensorRT (FP32)	2.6	6.5	1.6	10.7	93
TensorRT (FP16)	2.4	5.6	1.5	9.5	105

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tie, J.; Dunzhu, B.; Zheng, L.; Xie, J.; Tian, S.; Li, S. Wild Yak Behavior Recognition Method Based on an Improved Yolov11. Information 2026, 17, 214. https://doi.org/10.3390/info17020214

AMA Style

Tie J, Dunzhu B, Zheng L, Xie J, Tian S, Li S. Wild Yak Behavior Recognition Method Based on an Improved Yolov11. Information. 2026; 17(2):214. https://doi.org/10.3390/info17020214

Chicago/Turabian Style

Tie, Jun, Basang Dunzhu, Lu Zheng, Jin Xie, Shasha Tian, and Shuangyang Li. 2026. "Wild Yak Behavior Recognition Method Based on an Improved Yolov11" Information 17, no. 2: 214. https://doi.org/10.3390/info17020214

APA Style

Tie, J., Dunzhu, B., Zheng, L., Xie, J., Tian, S., & Li, S. (2026). Wild Yak Behavior Recognition Method Based on an Improved Yolov11. Information, 17(2), 214. https://doi.org/10.3390/info17020214

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Wild Yak Behavior Recognition Method Based on an Improved Yolov11

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.1.1. Data Collection

2.1.2. Data Preprocessing

2.2. The Indicators of Evaluation

2.3. Experimental Parameter Configuration

2.4. Methods

2.4.1. YOLOv11n Network Model

2.4.2. DPAP-YOLOv11n

2.4.3. Dynamic Convolution

2.4.4. Pinwheel-Shaped Convolution (PConv)

2.4.5. YOLOv7-Aux Auxiliary Head

2.4.6. Focal-PIoU Loss Function

3. Results

3.1. Comparison of Model Results

3.2. Comparative Analysis of Loss Functions

3.3. DPAP-YoloV11n Ablation Study

3.4. Comparative Analysis of Various Yolo Models

3.5. TensorRT Inference Acceleration Testing

3.6. DPAP-YOLOV11n Visualization and Analysis

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI