RFR-YOLO-Based Recognition Method for Dairy Cow Behavior in Farming Environments

Li, Congcong; Ma, Jialong; Cao, Shifeng; Guo, Leifeng

doi:10.3390/agriculture15181952

Open AccessArticle

RFR-YOLO-Based Recognition Method for Dairy Cow Behavior in Farming Environments

¹

College of Information Science and Technology, Hebei Agricultural University, Baoding 071001, China

²

Hebei Key Laboratory of Agricultural Big Data, Hebei Agricultural University, Baoding 071001, China

³

Agricultural Information Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(18), 1952; https://doi.org/10.3390/agriculture15181952

Submission received: 15 August 2025 / Revised: 8 September 2025 / Accepted: 11 September 2025 / Published: 15 September 2025

(This article belongs to the Section Farm Animal Production)

Download

Browse Figures

Versions Notes

Abstract

Cow behavior recognition constitutes a fundamental element of effective cow health monitoring and intelligent farming systems. Within large-scale cow farming environments, several critical challenges persist, including the difficulty in accurately capturing behavioral feature information, substantial variations in multi-scale features, and high inter-class similarity among different cow behaviors. To address these limitations, this study introduces an enhanced target detection algorithm for cow behavior recognition, termed RFR-YOLO, which is developed upon the YOLOv11n framework. A well-structured dataset encompassing nine distinct cow behaviors—namely, lying, standing, walking, eating, drinking, licking, grooming, estrus, and limping—is constructed, comprising a total of 13,224 labeled samples. The proposed algorithm incorporates three major technical improvements: First, an Inverted Dilated Convolution module (Region Semantic Inverted Convolution, RsiConv) is designed and seamlessly integrated with the C3K2 module to form the C3K2_Rsi module, which effectively reduces computational overhead while enhancing feature representation. Second, a Four-branch Multi-scale Dilated Attention mechanism (Four Multi-Scale Dilated Attention, FMSDA) is incorporated into the network architecture, enabling the scale-specific features to align with the corresponding receptive fields, thereby improving the model’s capacity to capture multi-scale characteristics. Third, a Reparameterized Generalized Residual Feature Pyramid Network (Reparameterized Generalized Residual-FPN, RepGRFPN) is introduced as the Neck component, allowing for the features to propagate through differentiated pathways and enabling flexible control over multi-scale feature expression, thereby facilitating efficient feature fusion and mitigating the impact of behavioral similarity. The experimental results demonstrate that RFR-YOLO achieves precision, recall, mAP50, and mAP50:95 values of 95.9%, 91.2%, 94.9%, and 85.2%, respectively, representing performance gains of 5.5%, 5%, 5.6%, and 3.5% over the baseline model. Despite a marginal increase in computational complexity of 1.4G, the algorithm retains a high detection speed of 147.6 frames per second. The proposed RFR-YOLO algorithm significantly improves the accuracy and robustness of target detection in group cow farming scenarios.

Keywords:

cattle behavior recognition; object detection; multi-scale feature representation; YOLOv11n

1. Introduction

Dairy cow health is closely associated with their behavioral patterns [1]. Abnormal behaviors in dairy cattle often serve as early indicators of underlying health problems [2]. For example, lameness or an unsteady gait during movement may signal conditions such as foot rot or laminitis. Mounting behavior is a clear indicator of estrus activity; the timely detection of this behavior can significantly improve conception rates [3]. Therefore, in practical farming operations, managers can effectively monitor herd health through systematic observation of bovine behavioral traits [4].

The development of modern animal husbandry has aligned with the expansion of large-scale farming operations, making manual monitoring of dairy cow behavior insufficient to satisfy modern farm management requirements. Contemporary approaches to bovine behavior detection predominantly rely on two methodologies: wearable sensor technology and computer vision-based video analytics [5]. Researchers have proposed systems for monitoring cow behavior using multi-category sensors [6]. Illustrative examples include combining triaxial accelerometers with GPS to record locomotion patterns [7], employing pressure sensors to detect rumination behavior [8], and utilizing ear-tag sensors to identify standing and feeding activities [9]. Although wearable sensor technology for bovine behavior monitoring is well-established, it is associated with several limitations. For instance, prolonged deployment may result in sensor degradation or moisture ingress, which compromises data accuracy and leads to behavioral misclassification. Furthermore, wearable devices entail high implementation costs for large-scale operations and may cause discomfort, thereby adversely impacting animal welfare.

The advancement of smart farming has increased demands for informatization, efficiency, and welfare-oriented solutions in dairy cow behavior monitoring, making the application of computer vision technology essential for accurate bovine behavior recognition [10]. Researchers acquire image or video data of dairy cows and employ deep learning techniques to perform behavior analyses, enabling non-contact recognition of dairy cow behaviors. This approach not only enhances monitoring efficiency but also contributes to the improvement of animal welfare [11]. Yunfei et al. [12] proposed an efficient 3D CNN algorithm integrating 3D convolutions with Dwise modules and utilizing Efficient Channel Attention (ECA) for channel-wise feature filtering, demonstrating effectiveness for classifying fundamental bovine locomotion behaviors. Jiang et al. [13] developed a neural network incorporating single-stream long-term optical flow convolutions for cow behavior monitoring; using 756 training videos, their model significantly improved limping detection accuracy. Wang et al. [14] designed a lightweight system for mounting behavior detection by combining YOLOv5s with a streamlined backbone network and feature enhancement modules, achieving 87.7% mAP in validation. Hang et al. [15] established critical heat stress thresholds in herds using a deep learning framework, with YOLOv5s attaining 98.5% average precision. Bai et al. [16] enhanced YOLOv5s by embedding attention mechanisms and refining detectors, achieving 92% mean accuracy across four behavioral categories. Duan et al. [17] integrated Slim-Neck architectures and decoupled attention modules into YOLOv7, achieving 95.2% mean average precision for seven beef cattle behaviors. Zheng et al. [18] modified YOLOv8n through optimized loss functions, context enhancement modules, and multi-dimensional feature fusion, attaining 93.90% precision in estrus behavior recognition. Cao et al. [19] proposed a TimeSformer-based multi-behavior recognition model for beef cattle, which serves as a baseline evaluation framework. Their experimental results demonstrated that the model can effectively learn category labels from behavioral data, achieving an average recognition accuracy of 90.33% on the test set. Giannone et al. [20] adapted YOLOv8n for the automatic recognition of dairy cow feeding behavior, attaining an accuracy of 85% and a recall rate of 62% at an IoU threshold of 0.5, with a processing speed of 12 ms per frame. Wu et al. [21] introduced a dairy cow behavior recognition algorithm, DMSF-YOLO, built upon the YOLOv11n architecture. The proposed method incorporates a multi-scale convolution module and a multi-scale feature extraction module to enhance the model’s performance in detecting multi-scale behaviors in complex environments. This algorithm enables rapid and accurate recognition of six common dairy cow behaviors—lying, standing, walking, feeding, drinking, and mounting—with improvements of 2.4%, 3%, and 2.7% in accuracy, recall rate, and F1 score, respectively, while maintaining a relatively high frame processing speed (FPS).

While existing methods have effectively detected bovine behavioral information, several challenges remain: the complexity of activity zones in large-scale farms, difficulties in capturing behavioral features, significant scale variations, and high similarity between behaviors can reduce model detection accuracy, potentially leading to missed detections and false positives. Additionally, the use of surveillance cameras for cow behavior recognition generates massive volumes of video data and processing tasks, where a model’s complexity directly impacts its deployability and operational efficiency in embedded camera systems. Based on current research and practical requirements, this paper proposes an improved object detection algorithm for cow behavior recognition. The algorithm introduces three key enhancements to YOLOv11n: first, an Inverted Dilated Convolution module (RsiConv) [22] was designed and integrated with the C3K2 module in YOLOv11n to form the C3K2_Rsi module, which utilizes feature maps of varying regional sizes to meet different receptive field demands, while expanding the channel dimensions prior to feature extraction to reduce the computational load and capture richer feature information; second, a Four-branch Multi-scale Dilated Attention mechanism (FMSDA) [23] was proposed, enabling features at different scales to correspond to distinct receptive fields and enhance the multi-scale feature capture capability; finally, a Reparameterized Generalized Residual Feature Pyramid Network (RepGRFPN) [24] was developed, significantly improving multi-scale feature fusion while maintaining a lightweight architecture and ensuring robust recognition of highly similar cow behaviors compared to the baseline model.

2. Materials and Methods

2.1. Dataset Establishment

Analysis of existing datasets on cow behavior reveals insufficient behavioral diversity, as most studies focus primarily on routine activities that do not fully represent the range of physiological states. This limitation may result in potential misjudgments concerning animal health. To address this issue, this study constructs a comprehensive dataset encompassing nine distinct bovine behaviors, categorized into six normal behaviors (lying, standing, walking, eating, drinking, and fur licking) and three abnormal behaviors (self-grooming, estrus, and lameness). The video data were collected from a large-scale dairy farm located in Qingyuan District, Baoding City, Hebei Province. The farm comprises 12 outdoor activity areas housing a total of 700 Holstein cows. Surveillance was conducted using DS-2CD3T27EWDV3-L network cameras manufactured by Hangzhou Hikvision Digital Technology Co., Ltd., Hangzhou, China, which were centrally installed in each passageway at a height of 3 m above ground level. The specific camera installation positions are illustrated in Figure 1.

Between July and October 2024, a total of 78 surveillance videos, each ranging from 30 min to one hour in duration, were retrieved from the camera backend system across the cow activity zones. In addition, 116 video clips, each lasting between 30 s and 4 min, were captured using tripod-mounted smartphones(HONOR Magic5 Pro, Honor, China). After excluding footage characterized by extreme blurriness or prolonged repetitive behaviors, 146 video segments, each lasting 10–20 s, were manually extracted. The ratio of daytime to nighttime video footage was 3:2. Due to the reduced activity frequency of cows during nighttime, daytime videos constitute a relatively larger proportion of the overall dataset. Frame-splitting techniques applied to these segments yielded a total of 9460 images. Subsequently, the Structural Similarity (SSIM) algorithm [25] was employed to remove redundant images by analyzing inter-frame differences. The evaluation criteria primarily emphasized three aspects: image brightness, contrast, and structural integrity. Specifically, image brightness was determined by the mean gray level, while contrast was quantified by the standard deviation of the gray level distribution.

S S I M (x, y) = \frac{(2 μ_{x} μ_{y} + C_{1}) (2 σ_{x y} + C_{2})}{(μ_{x}^{2} + μ_{y}^{2} + C_{1}) (σ_{x}^{2} + σ_{y}^{2} + C_{2})}

(1)

In the SSIM formula,

μ_{x}

and

μ_{y}

denote the means of x and y, respectively;

σ_{x}^{2}

and

σ_{y}^{2}

denote the variances in x and y, respectively;

σ_{x y}

denotes the covariance between x and y; and

C_{1}

and

C_{2}

are two constants used to stabilize the computation.

When employing SSIM for sample image selection, a higher SSIM value indicates greater similarity between two images. As illustrated in Figure 2, the SSIM curve remains relatively stable within the threshold range of 0.6 to 0.8, experiences a sharp decline between 0.8 and 0.85, and stabilizes again within the range of 0.85 to 0.9. This trend suggests that the images within the 0.8 to 0.85 threshold range exhibit significant similarity. Notably, when the threshold is set at 0.8, the maximum number of redundant images is eliminated. Therefore, in this study, the SSIM threshold was determined as 0.8. If the SSIM value between two sample images exceeded 0.8, this indicated a high degree of similarity, and only one image was retained. All the images with SSIM values below 0.8 were preserved without further filtering.

Following the identification and screening process to remove redundant samples, a final dataset of 4728 cow behavior images was obtained, with all the images resized to a standardized resolution of 1680 × 1080 pixels. The filtered images were randomly partitioned into training, testing, and validation sets at a ratio of 6:2:2, resulting in 2838 images in the training set, 945 in the testing set, and 945 in the validation set. The labeling of dairy cow behavior tags was conducted entirely with the assistance of video data. The criteria for determining each behavior were as follows: If a dairy cow’s four limbs remained in contact with the ground for more than ten seconds in a video sequence, the frames within that time interval were labeled as lying behavior. A similar labeling process was applied to other behaviors. In this farm, the feed is distributed on both sides of the passageways in the activity area. Therefore, when the cows eat, they extend their heads through the retractable fences to access the feed. During this process, the behavior of extending the head to consume feed was labeled as eating behavior. The same principle applied to drinking and grooming behaviors. When the cows approached water troughs or body brushes, if they were observed engaging in drinking or grooming actions for more than ten seconds within the corresponding video sequence, the frames were labeled accordingly. Licking behavior includes both self-licking and licking of other cows. Therefore, if a cow’s tongue was observed in contact with its own body or another cow, the behavior was labeled as licking. During estrus, cows may exhibit mounting and sniffing behaviors, with sniffing being visually similar to licking. By analyzing the full video sequence, if mounting behavior followed the observed action, the behavior was classified as estrus; otherwise, if only licking was observed, it was classified as licking behavior. Standing, walking, and limping behaviors are difficult to label accurately based solely on individual images and require analysis of the associated video sequence. Standing behavior was defined as the absence of any movement trend within a 3 s window. Walking behavior was identified when a significant crossing angle was observed between the front and rear legs during movement. Limping was labeled when a cow’s foot appeared distorted during movement, and this was confirmed after excluding environmental factors. As illustrated in Figure 3c,i in the figure below, the cow in Figure 3i exhibits a deformed hind leg. In the five seconds before and after this frame, the cow’s hind leg is visibly bent and its body tilts during movement, confirming the presence of limping behavior.

The behavioral criteria for dairy cows, along with the corresponding label names and quantities, are summarized in Table 1. This annotation process yielded a total of 13,224 behavioral labels. Among these, lying, standing, and walking were the most frequently observed behaviors; eating, drinking, and licking occurred with moderate frequency; and grooming, estrus, and limping were relatively rare. The overall label distribution closely reflects the actual behavioral patterns observed on the farm. Representative example images for each bovine behavior are presented in Figure 3.

2.2. Dairy Cow Behavior Target Detection Model

2.2.1. Model Selection

To address challenges such as complex farm environments, difficulties in capturing key bovine features, significant scale variations during surveillance, high behavioral similarity, and limited computational resources of monitoring cameras, this study adopted the YOLOv11 network as its foundational architecture. YOLOv11, a state-of-the-art object detection algorithm introduced by researchers in 2024 based on the Ultralytics framework, consists of four core components: Input, Backbone, Neck, and Head. The network provides five model variants (n, s, m, l, and x), which vary in scale and complexity. Given the computational constraints associated with embedded deployment on farm monitoring devices, the YOLOv11n variant was selected for bovine behavior recognition, as it achieves a favorable balance between detection accuracy and inference speed.

2.2.2. Improvement of the Network Model

The overall architecture of the RFR-YOLO (RsiConv FMSDA RepGRFPN-YOLO) bovine behavior detection model proposed in this study, which is based on an improved YOLOv11n framework, is presented in Figure 4.

The model introduces three key enhancements: first, the internal convolution structure of the C3K2 module is replaced with an Inverted Dilated Convolution (RsiConv) during the feature extraction stage. This modification incorporates an inverted residual connection mechanism to enhance multi-scale feature capture while simultaneously reducing the model’s computational complexity; second, the Four-branch Multi-scale Dilated Attention (FMSDA) module is integrated into the Neck network to strengthen the model’s capability for representing multi-scale features; third, a Reparameterized Generalized Residual Feature Pyramid Network (RepGRFPN) is designed to replace the original feature fusion network within the Neck, thereby improving the efficiency of multi-scale feature fusion and enhancing recognition accuracy for highly similar bovine behaviors.

(1) RsiConv feature extraction module

Building on field investigations and surveillance video analysis, dairy farms exhibit complex environments characterized by significant scale variations in the monitoring footage and strong behavioral continuity by the cows, highlighting the importance of capturing key behavioral features and contextual information. Processing the surveillance imagery for bovine behavior recognition generates massive volumes of video data and computational tasks for the servers, resulting in high computing costs; therefore, a lightweight network design becomes essential. To address these constraints, this study proposes the Inverted Dilated Convolution module (RsiConv). This module decomposes multi-scale feature extraction into spatial residual learning and semantic residual learning, utilizing inverted residual connections during feature interaction. The inverted residual structure partitions the feature maps into query (Q), key (K), and value (V) components to generate attention matrices: Q and K interact to produce an attention matrix, which is subsequently multiplied by V to yield the attention-weighted feature maps. The attention matrix generation process is mathematically formulated as follows:

Q, K = Linear (X)

(2)

V = L i n e a r (X_{e})

(3)

A t t n = S o f t m a x (\frac{{Q K}^{T}}{\sqrt{d_{k}}})

(4)

where X denotes the input feature map and

X_{e}

represents the expanded feature map for subsequent residual processing.

In spatial residual learning, the feature map undergoes 3 × 3 convolution coupled with batch normalization and ReLU activation to generate compact feature maps expressing diverse regional characteristics. Subsequently, semantic residual learning applies depthwise separable convolutions (DwConv) with targeted receptive fields to each regional feature map, enabling morphology-based semantic filtering while avoiding redundant connections. For example, distant or obscured cow behaviors exhibit smaller feature scales; thus, leveraging multi-scale feature maps to accommodate varying receptive field requirements significantly enhances the model’s multi-scale information capture capability. The spatial and semantic residual processes are formulated as follows:

R^{(k)} = {C o n v}_{3 \times 3} (X_{e})

(5)

S^{(k)} = {D W - D C o n v}_{d_{k}} (R^{(k)})

(6)

where X_e denotes the expanded feature map,

{C o n v}_{3 \times 3}

represents the standard convolutional layer with BN and ReLU,

R^{(k)}

signifies the regional feature map of the k-th branch (k corresponds to branches with different dilation rates),

{D W - D C o n v}_{d_{k}}

indicates the depthwise separable dilated convolution (dilation rate

d_{k}

), and

S^{(k)}

is the semantic residual feature map of the k-th branch. Compared to the original structure, RsiConv doubles the channel count of the minimal-dilation-rate branch to enhance local behavioral details while employing depthwise separable convolutions to reduce the computational load. Finally, multi-branch semantic residuals are integrated with the input through summation, mitigating gradient vanishing and improving training efficiency. The feature fusion is expressed as follows:

Y = C o n c a t (S^{1}, S^{2} {, \cdot \cdot \cdot \cdot \cdot \cdot \cdot \cdot \cdot \cdot, S}^{(K)})

(7)

Z = {C o n v}_{1 \times 1} (Y)

(8)

O u t p u t = X + Z

(9)

where Y denotes the concatenation of all the semantic residual feature maps, Z represents the fusion result via a 1 × 1 convolution, and the final output is generated by summing the fused result with the input X. The architectural diagram of the Inverted Dilated Convolution module is illustrated in Figure 5.

(2) FMSDA mechanism

To enhance the model’s capability at capturing multi-scale information, this study proposes a Four-branch Multi-scale Dilated Attention mechanism (FMSDA). By exploiting the sparsity of self-attention across different scales, the FMSDA generates the corresponding query, key, and value matrices through linear projections of the feature map. This process is mathematically formulated as follows:

X \in R^{C \times H \times W}, Q = X W_{q}, K = X W_{k}, V = X W_{v}

(10)

where X is the input feature map; Q, K, and V represent the query, key, and value matrices, respectively;

W_{q}

,

W_{k}

, and

W_{v}

∈

R^{C \times d}

denote the projection matrices; and d indicates the head dimension.

The FMSDA partitions the feature map into four distinct heads, where the features are processed through head-specific pathways via 1 × 1 convolutions before undergoing Sliding Window Dilated Attention (SWDA) with varying dilation rates in each head. This approach reduces channel redundancy and minimizes interference from irrelevant features, thereby improving the detection efficiency and enabling each head to focus more effectively on the multi-scale features. As a result, the model achieves a more comprehensive capture of bovine behavioral information in images. The window partitioning and local attention computation in SWDA are mathematically formulated as follows:

K_{r} (i, j) = \{(i + r \cdot ∆ x, j + r \cdot ∆ y) | ∆ x, ∆ y \in [- \frac{k}{2}, \frac{k}{2}]\}

(11)

A t t n (q_{i j}, K_{r}) = S o f t m a x (\frac{q_{i j} K_{K_{r} (i, j)}^{T}}{\sqrt{d}})

(12)

S W D A (q_{i j}) = A t t n (q_{i j}, K_{r}) \cdot V_{k_{r} (i, j)}

(13)

where r denotes the dilation rate of the current head, k represents the window size (e.g., 3 × 3),

K_{r} (i, j)

is the sparse sampling coordinate set centered at

(i, j)

with stride r, and

K_{k_{r} (i, j)}

and

V_{k_{r} (i, j)}

are the key/value vectors indexed by

K_{r} (i, j)

.

The FMSDA effectively aggregates the multi-scale information within the attended regions while reducing the redundancy in self-attention without complex operations or additional computational costs. The multi-head independent computation and feature fusion process of FMSDA are formulated as follows:

h_{i} = S W D A (Q_{i}, K_{i}, V_{i}, r_{i}), r_{i} \in \{r_{1}, r_{2}, \cdot \cdot \cdot \cdot \cdot \cdot \cdot \cdot, r_{n}\}

(14)

X_{o u t} = L i n e a r (C o n c a t [h_{1}, h_{2}, \cdot \cdot \cdot \cdot \cdot \cdot \cdot \cdot, h_{n}]

(15)

where

Q_{i}, K_{i},

and

V_{i}

denote channel-partitioned sub-slices of the input feature map; and

X_{o u t}

represents the output feature.

The architecture of FMSDA is illustrated in Figure 6, which shows four distinct dilation rates corresponding to different receptive field sizes. Within each head, self-attention operations are applied to their respective dilation rates and receptive fields, facilitating multi-scale feature capture across varying spatial resolutions. The resulting features are then concatenated and passed through a linear layer for effective feature aggregation.

(3) RepGRFPN feature fusion network.

To enhance the multi-scale feature fusion capability, improve the recognition of highly similar behaviors, and ensure inference efficiency, this study proposes a Reparameterized Generalized Residual Feature Pyramid Network (RepGRFPN). While preserving a lightweight design, the RepGRFPN dynamically assigns distinct channel dimensions to the features at different scales, thereby enabling flexible control over hierarchical feature representation. Compared to previous feature fusion approaches, it eliminates redundant upsampling operations, thereby accelerating the inference speed with a minimal impact on accuracy. The module replaces conventional convolutional feature fusion with Cross-Stage Partial Network (CSPNet) connections and integrates reparameterization techniques with Efficient Layer Aggregation Network (ELAN) linkages, thereby improving the accuracy without increasing the computational costs. The channel allocation per layer and cross-stage feature fusion in RepGRFPN are mathematically formulated as follows:

C_{o u t}^{(i)} = j_{S c a l e} (i) \times C_{b a s e}

(16)

Y_{k} = g (C o n c a t [U p s a m p l e (Y_{k + 1}), D o w n s a m p l e (Y_{k - 1}), X_{k}])

(17)

where

C_{b a s e}

denotes the base channel count; and

j_{S c a l e} (i)

represents the scaling function, ensuring deeper layers retain more semantic channels (

C_{d e e p}

>

C_{s h a l l o w}

), thereby preserving richer semantic information of deep features while reducing the redundant shallow channels. This mechanism enhances the model’s capacity to learn hierarchical behavioral patterns, improves classification of distinct bovine behaviors, and strengthens its recognition of lying, standing, and lameness activities.

Y_{k}

signifies the output feature of the k-th layer, g

(\cdot)

is the fusion function, Upsample employs bilinear interpolation, and Downsample uses stride-2 convolution.

To address gradient vanishing and explosion issues arising from the stacked fusion modules, residual connections are integrated during both the training and inference phases. This design preserves richer semantic information and finer feature details while promoting efficient information propagation to the subsequent layers. Within the RepGRFPN fusion scheme, the number of fusion nodes is fixed to prevent efficiency degradation caused by the elongated serial chains in stacked fusion structures. The fusion module employs recurrent structural units composed of multiple 1 × 1 and 3 × 3 convolutions, each followed by batch normalization (BN) and an activation function, with residual connections applied for improved feature stability. The architecture differs between the training and inference modes: during training, both 3 × 3 and 1 × 1 convolutions are activated, whereas only the 1 × 1 convolution is retained during inference to enhance the computational efficiency. The feature fusion mechanism of RepGRFPN is illustrated in Figure 7, where the lower-left section presents the overall fusion workflow, the upper section displays the fusion module structure, and the lower-right section details the internal architecture of the 3 × 3 Rep block.

2.3. Experiment

2.3.1. Experimental Environment

The specific configuration details of the experiment in this paper are shown in Table 2. In the initial phase of the experiment, the COCO dataset was utilized for pretraining to obtain the initial weight parameters. The Adam optimizer (The implementation utilizes the built-in Adam optimizer within the PyTorch framework, which is based on the original algorithm proposed by Kingma and Ba.) was employed to optimize the parameters of the entire network model. The batch size for image processing was set to 24, the learning rate was initialized at 0.01, the number of training epochs was set to 200, and the input image size was fixed at 640 × 640. To minimize the impact of variable factors on the experimental outcomes, all the experiments in this study were conducted under identical parameter settings and using the same dataset.

The computational platform utilized in this experiment comprises the following hardware components: an Intel Core i9-10920X CPU (Intel Corp., Santa Clara, CA, USA), an NVIDIA GeForce RTX 3090 GPU (NVIDIA Corp., Santa Clara, CA, USA), and 64 GB of DDR4 memory. All hardware components were procured through publicly accessible commercial channels, including open markets and online platforms. The experimental environment was built upon the following software stack: the Windows 10 operating system (Microsoft Corp., Redmond, WA, USA), the Python programming language (Python Software Foundation, Wilmington, DE, USA), the PyTorch deep learning framework (version 2.2.1), and NVIDIA CUDA (version 11.8). All software packages were obtained from their respective official websites or designated repositories (e.g., PyPI, Conda) to ensure consistency and reproducibility of the experimental setup.

2.3.2. Evaluation Indicators

P (precision), R (recall), mAP50 (mean average precision at IoU threshold 0.5), mAP50:95 (mean average precision over IoU thresholds from 0.5 to 0.95), GFLOPs (giga floating-point operations per second), and FPS (frames per second) are employed as six evaluation metrics to assess the model’s performance across nine bovine behaviors: lying, standing, walking, feeding, drinking, licking, brushing, estrus, and lameness. The average values of P, R, mAP50, and mAP50:95 across all nine behaviors are reported. To minimize the potential error associated with conducting the experiment only once, the experimental results presented below represent the average values obtained from multiple repeated trials. These metrics provide a comprehensive balance between inference efficacy and model complexity, ensuring that the selected model robustly adapts to complex real-world farming environments.

3. Results

3.1. Module Comparison Experiment

To evaluate the effectiveness of the proposed algorithm, this section presents comprehensive ablation studies comparing it with peer methodologies. All the experiments were conducted under consistent protocols to ensure fairness and reliability: the models were trained from scratch without pretrained weights, all the training processes were set to 200 epochs, and identical hyperparameters were maintained across all trials. This rigorous standardization ensured experimental consistency and strengthens the credibility of the findings.

3.1.1. Comparative Analysis of RsiConv

The RsiConv module effectively utilizes multi-scale feature maps to enhance feature capture capabilities while preserving model efficiency. To evaluate its effectiveness, comparative experiments were conducted with various feature extraction modules, including GhostConv [26], DwConv [27], PConv [28], HetConv [29], CDConv [30], and CDSConv [31]. The results are summarized in Table 3.

As shown in Table 3, the integration of RsiConv increased model complexity by only 0.8 GFLOPs, while delivering significant improvements across all the core performance metrics—precision, recall, mAP50, and mAP50:95. Notably, RsiConv demonstrated the most pronounced performance gains over lightweight modules, such as GhostConv and DwConv, fully validating its effectiveness for multi-scale feature extraction and model performance preservation. Regarding the inference speed, considerable variations were observed among the feature extraction modules. PConv exploited the feature map redundancy to reduce the computation and memory access, achieving the highest FPS (168.9). CDSConv dynamically captured key behavioral features of cattle through convolutional kernel offsets, but was constrained to 98.4 FPS due to the complexity of farm environments. In contrast, RsiConv achieved 130.6 FPS by leveraging multi-region feature maps to enhance multi-scale feature capture, satisfying real-time requirements despite being slower than the lightweight alternatives.

3.1.2. Comparative Analysis of FMSDA

The FMSDA mechanism improves multi-scale image processing by assigning distinct receptive fields to features at different scales, thereby enabling efficient feature aggregation and enhancing the capability for multi-scale feature capture. To assess the effectiveness of FMSDA, comparative experiments were carried out with various attention mechanisms, including GAM [32], CBAM [33], CA [34], ECA [35], SEAM [36], and Biformer [37]. The experimental results are summarized in Table 4.

The experimental results show that although FMSDA increases the model’s complexity by 1.0 GFLOPs, it significantly outperforms the existing attention mechanisms in detection performance. Despite its higher computational load compared to CA, FMSDA achieves an inference speed of 140.9 FPS—the highest among the competing methods—thereby meeting real-time requirements. FMSDA delivers notable improvements in precision while maintaining high frame rates, confirming its effectiveness at model optimization. By assigning distinct dilation rates to different attention heads and incorporating 1 × 1 convolutions to reduce the channel dimensions, FMSDA enhances the focus on multi-scale features. The mechanism reduces redundancy in the self-attention computation, thereby improving detection efficiency and enabling rapid feature propagation without sacrificing accuracy. These results validate FMSDA’s superior trade-off between computational cost and inference performance, making it well suited for deployment on mobile devices.

3.1.3. Comparative Analysis of RepGRFPN

To address the limitations of traditional Feature Pyramid Networks (FPNs) in multi-scale feature fusion, this paper proposes the RepGFPN structure to improve cross-scale feature interaction. To assess the influence of different FPN architectures on model performance, comparative experiments were carried out using various FPN structures, including GFPN [38], BiFPN [39], and AFPN [40]. The experimental results are summarized in Table 5.

The experimental results show that although RepGFPN introduces increased computational complexity compared to the baseline model, its detection performance significantly outperforms the existing comparative methods. In terms of model complexity, BiFPN achieves the lowest computational cost at 6.4 GFLOPs, yet its core metrics—precision and mAP50—are substantially lower than those of RepGFPN, revealing a critical limitation in the accuracy–complexity trade-off. Regarding inference speed, AFPN achieves comparable core metrics to RepGFPN, but operates at only 36.3 FPS, which does not meet real-time deployment requirements. In contrast, RepGFPN maintains a detection speed of 153.7 FPS while simultaneously improving precision, demonstrating its capability to efficiently allocate computational resources and enhance overall performance. RepGFPN dynamically assigns distinct feature channels to different scales, enabling flexible multi-scale feature representation. It eliminates redundant upsampling operations during feature fusion to improve the inference speed without compromising accuracy. Additionally, it incorporates cross-stage local networks, reparameterization techniques, and efficient aggregation connections to enhance precision without increasing the computational cost. Furthermore, residual connections are utilized to strengthen feature propagation, alleviate gradient vanishing, and enhance model robustness. These findings confirm that RepGFPN’s innovative feature fusion strategy achieves high detection accuracy while preserving a lightweight architecture, thereby offering a reliable foundation for practical deployment.

3.2. Ablation Experiment Analysis

To evaluate the contribution of each improvement module in the RFR-YOLO algorithm for cow behavior detection, this section presents the results of ablation studies. The experiments employed YOLOv11n as the baseline model, with all other parameters fixed to ensure a fair comparison. The modules—RsiConv, FMSDA, and RepGRFPN—were progressively integrated into the baseline architecture. The performance outcomes are summarized in Table 6, where a checkmark (“√”) indicates the inclusion of a specific module.

The experimental results show that compared to the original YOLOv11n model, the incorporation of RsiConv, FMSDA, and RepGRFPN increased the precision by 4.3, 4.9, and 3.6 percentage points, respectively, and improved the mAP50 by 3.9, 5.2, and 4.6 percentage points. Although these enhancements raised the model’s complexity, the detection efficiency remained high due to their shared focus on computational efficiency, thereby achieving a favorable balance between performance improvement and resource consumption. Adding FMSDA to RsiConv significantly enhanced the recall and mAP50 compared to RsiConv alone, while also accelerating inference despite the increased complexity, which confirms FMSDA’s effectiveness at multi-scale feature extraction and false detection suppression. Integrating RepGRFPN with RsiConv substantially reduced the model’s complexity and improved the inference speed; the novel fusion strategy in RepGRFPN strengthened the multi-scale feature integration while preserving the detection accuracy and ensuring a lightweight design. The joint integration of FMSDA and RepGRFPN resulted in a notable improvement in the mAP50:95, although minor declines were observed in the other metrics, suggesting a balanced enhancement of detection performance and computational efficiency. The full RFR-YOLO model, incorporating all three modules, achieved the highest performance across all four core evaluation metrics, confirming the synergistic contribution of the individual components. Despite a moderate increase in the computational load, the model maintained an inference speed of 147.6 FPS, thereby satisfying real-time deployment requirements.

In summary, this study proposes a series of enhancements to the YOLOv11n model through the integration of RsiConv, FMSDA, and RepGRFPN. These improvements not only enhance the detection accuracy but also maintain the model’s lightweight structure, thereby promoting the effectiveness of bovine behavior recognition. The ablation experiments systematically evaluated the contribution of each module to the overall performance, providing empirical validation of their individual and collective efficacy. The proposed framework offers a reliable technical foundation for the deployment of intelligent monitoring systems in dairy farming environments.

3.3. Comparison with Other Object Detection Algorithms

To assess the effectiveness of the RFR-YOLO model at recognizing dairy cow behaviors, it was compared with mainstream object detection algorithms—including SSD, Faster R-CNN, YOLOv5n, and YOLOv7n—as well as improved models proposed by other scholars, such as SEPH-YOLOv5s [16], E-YOLO [18], YOLOv8n_BiF_DSC [31], and Res-DenseYOLO [41], under the same dataset and consistent hyperparameter settings. Among these, the models proposed by other researchers were independently reproduced and re-evaluated using the dataset presented in this study for comparative analysis. The experimental results are detailed in Table 7.

The comparative results of object detection models (Table 7) show that RFR-YOLO significantly outperformed the foundational models (e.g., SSD, Faster R-CNN) across all four core evaluation metrics. Although its computational complexity was 4.8 GFLOPs higher than that of the lightest model, the competing approaches exhibited an 8.7% lower mAP50, indicating accuracy constraints from their lightweight architectures. RFR-YOLO achieved a detection speed of 147.6 FPS, satisfying real-time deployment requirements. Compared to recent state-of-the-art models, RFR-YOLO demonstrated superior performance in precision, recall, and mAP50:95, with only a minor reduction in the mAP50 relative to the algorithm in [28]. It also delivered faster inference, further validating its ability to balance detection accuracy and computational efficiency. The complexity analysis revealed a potential drawback: the increased computational demand may lead to higher hardware costs for deployment, highlighting the need for future optimization. Overall, RFR-YOLO achieved robust performance that fulfills the practical requirements for cow behavior monitoring in dairy farming environments.

3.4. Analysis of the Behavioral Classification Effect of the RFR-YOLO Model

Figure 8 presents the confusion matrix of the RFR-YOLO model on the self-compiled dataset. The diagonal elements reflect high recognition accuracy for distinct behaviors, demonstrating the model’s effectiveness for behavioral classification. However, certain misclassifications are observed among the behaviors of standing, walking, and lameness.

To further analyze the performance of the RFR-YOLO model for classifying cow behaviors, this study summarizes the behavior detection results as shown in Table 8. Through a detailed analysis of the four key evaluation indicators, the following conclusions can be drawn: The precision of the RFR-YOLO model for different cow behaviors is generally above 92%, indicating high detection accuracy and a low false detection rate. It is worth noting that three behaviors—standing, walking, and limping—are highly similar in appearance and are therefore prone to misclassification. Improving the recognition accuracy of these behaviors will be crucial for further enhancing the overall classification performance of the model. The RFR-YOLO model achieves an accuracy rate of over 97% for recognizing eating, estrus, and grooming behaviors, which represents a significant improvement in the detection performance. In terms of recall, the recall rates for all cow behaviors, except licking, estrus, and limping, are above 85%, effectively reducing the rate of missed detections. The low recall rates for licking and estrus are mainly attributed to the high cow density in the dataset, where close contact between individuals poses a significant challenge to accurate recognition. The low recall rate for limping is primarily due to its occurrence mainly during walking, which is a relatively infrequent behavior under high-density conditions. Moreover, overlapping cows make it difficult to capture limping behaviors accurately, leading to missed detections and a reduced recall rate. Additionally, regarding the average precision of cow behavior detection under different intersection-over-union (IoU) thresholds, the average precision exceeds 87% when the threshold is set at 0.5; even when the threshold increases from 0.5 to 0.95, the average precision remains above 79%. These results fully demonstrate the high detection accuracy and strong generalization capability of the proposed model. Finally, in terms of the F1 score, the uneven distribution of cow behaviors on the dairy farm results in a significantly higher frequency of normal behaviors compared to abnormal ones. Therefore, the F1 score serves as a more critical evaluation metric, as it accounts for both precision and recall. The F1 scores for all behavior classes in this study exceed 85%, indicating that the model achieves a high overall performance by effectively balancing accuracy and comprehensiveness.

In conclusion, this study employed four key evaluation indicators along with the F1 score to conduct a comprehensive assessment of different dairy cow behaviors. The results indicate that the RFR-YOLO model achieves low false-positive and missed detection rates across all behavior categories, maintaining a consistently high detection accuracy and robust average precision. Therefore, the detection performance of the model meets the practical monitoring requirements for dairy cow behavior in real-world farming environments.

3.5. Visual Analysis

To further analyze the practical efficacy of the proposed improvements, this study evaluates the detection performance across various challenging scenarios. These include overexposed blur scenarios (Figure 9a,b), where scenario a exhibits greater blur than b); nighttime scenes captured from different perspectives (Figure 9c,d); scenes with low-, medium-, and high-density cow distributions (Figure 9e–g); and cow behavior transition scenarios (Figure 9h). Figure 9 illustrates the comparative results: the left panels display the original images; the middle panels show the detections by the baseline model; and the right panels present the detections by RFR-YOLO.

The comparative results show that both models achieved high detection accuracy when the cow overlap was minimal and environmental obstructions were limited. However, RFR-YOLO significantly outperformed the baseline model under challenging conditions, such as blurred scenes, nighttime, and medium-to-high-density environments. For instance, in blurred scenario b and high-density scenario g, RFR-YOLO successfully detected distinct lying and standing behaviors that were missed by the baseline model. In nighttime scenario c, the proposed model correctly identified standing behavior, whereas the baseline model misclassified it as walking. These results confirm that the proposed improvements effectively enhance detection performance for multi-scale bovine features and distinguish between highly similar behaviors.

4. Discussion

In response to challenges such as large multi-scale variations in the background of group cattle farming for target detection, this paper proposes an improved target detection algorithm, RFR-YOLO, based on YOLOv11n. First, the RsiConv module is designed and integrated with the C3K2 module to form the C3K2_Rsi module, which enhances feature representation while reducing the computational load. Second, the FMSDA structure is introduced to improve the model’s ability to capture multi-scale features. Finally, the RepGRFPN network is proposed to enhance the model’s capacity for multi-scale feature fusion and recognition of similar behaviors. The experimental results on a self-constructed dataset demonstrate that RFR-YOLO achieves excellent performance for cow behavior recognition, with a detection accuracy of 94.7%. Furthermore, the ablation experiments confirm that the three improved modules contribute to the performance enhancement to varying degrees, thereby validating the effectiveness of the proposed improvements. The proposed model demonstrates a significant improvement in detection accuracy compared to the baseline model, with an increase of 29.4 frames per second in FPS and a reduction of 1.7 milliseconds in the processing time per frame. These improvements indicate faster response times and lower computational costs. By maintaining consistent parameter settings and utilizing the same dataset, the model is compared with mainstream algorithms and models developed by other researchers, confirming its superior comprehensive performance. Furthermore, through model visualization, it is evident that the model can effectively recognize dairy cow behaviors. The RFR-YOLO model proposed in this study achieves a strong balance of accuracy and speed in recognizing dairy cow behaviors, laying a solid technical foundation for its practical implementation on modern dairy farms. The recent work by Giannone et al [20]. (2025) further supports this potential, as they successfully developed and deployed a YOLOv8-based system for real-time monitoring of feeding behaviors in dairy cows, demonstrating excellent performance in real-world conditions. This clearly illustrates the strong engineering feasibility of such algorithms in agricultural environments. While their study focused specifically on feeding behavior, our model extends the scope by covering a broader range of behavioral categories. This highlights the adaptability and flexibility of the YOLO architecture across diverse applications in smart livestock farming.

Future work will focus on further optimizing the RFR-YOLO algorithm to enable on-site deployment and integrating a target tracking mechanism. Specifically, the following directions will be pursued: We envision that integrating this model with hardware systems, such as feeding stations and automatic sorting gates, could enable the development of a comprehensive ecosystem for dairy cow health and behavior management. Such a system would support continuous and automated monitoring of cows throughout various activities—including feeding, drinking, and resting—and help detect early signs of health issues, such as lameness or estrus. The partial confusion observed in the model when distinguishing between standing, walking, and limping highlights the inherent limitations of behavior recognition based solely on static images. Standing, as a static behavior, is primarily identified through posture and contextual background cues. In contrast, walking and limping are dynamic behaviors: the former is characterized by a healthy and coordinated gait, while the latter reflects abnormal movement patterns caused by pain or functional impairments. However, in a single image frame, a slightly limping posture may resemble either an intermediate frame of normal walking or a particular standing posture, which can lead to misclassification. To address this ambiguity effectively, future research should prioritize the integration of temporal modeling techniques. By analyzing sequences of consecutive frames, the model could capture the temporal dynamics that distinguish these behaviors more accurately. Specifically, walking typically exhibits rhythmic limb movements and consistent body motion across frames, whereas limping often involves persistent gait asymmetry, a reduced weight-bearing duration for the affected limb, head swaying, or back arching—features that indicate pain over time. Furthermore, standing could be more reliably identified through long-term posture stability across consecutive frames. This study primarily focuses on achieving high accuracy in behavior recognition from single-frame images. However, in order to develop a complete and practical real-time multi-object tracking (MOT) system suitable for pasture environments, several integration challenges still need to be carefully addressed. In real-world tracking scenarios, it is essential to ensure consistent and accurate identification of the same cow across consecutive frames—commonly known as the ID switching problem—as well as smooth transitions between behavioral states (e.g., from “walking” to “standing”). These represent important robustness considerations that a fully functional tracking system must effectively handle.

Author Contributions

Conceptualization, C.L., J.M., S.C., and L.G.; methodology, C.L. and J.M.; software, J.M.; validation, J.M.; formal analysis, C.L. and J.M.; investigation, C.L. and J.M.; resources, C.L. and J.M.; data curation, C.L. and J.M.; writing—original draft preparation, C.L. and J.M.; writing—review and editing, C.L., J.M., and S.C.; visualization, J.M.; supervision, C.L. and J.M.; project administration, C.L. and L.G. All authors have read and agreed to the published version of the manuscript.

Funding

This project was supported by the 2023 Program for Introducing Overseas Educated Personnel in Hebei Province (Grant Number: C20230333) and the Scientific Research Project of Higher Education Institutions in Hebei Province (Grant Number: CXZX2025040).

Institutional Review Board Statement

The video data used in this study was collected from a dairy farm located in Jingxiu District, Baoding, Hebei Province. Since the research involved only the analysis of pre-recorded, anonymized video clips and did not include any direct interaction with the animals or human subjects, approval from an ethical review committee was not required. Nevertheless, explicit permission was obtained from the farm management for the use of these videos in our research.

Data Availability Statement

Another task is underway, so for ethical and privacy reasons, the data cannot be made public directly. Contact 15612756256@163.com if necessary.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, H.; Sun, Y.; Zhao, C.; Wang, B.; Li, B.; Wang, B. Review on Typical Behavior Monitoring and Physiological Condition Identification Methods for Ruminant Livestock. Trans. Chin. Soc. Agric. Mach. 2023, 54, 1–21. [Google Scholar] [CrossRef]
Hui, Y. Foundations, Drivers, and Pathways for Building a Modern Livestock Industry System Under the Context of Chinese-style Modernization. Feed Res. 2024, 47, 189–192. [Google Scholar] [CrossRef]
Wang, Z.; Bai, Y.; Xia, C.; Zheng, J.; Lu, W.; Xu, C. Investigation and Analysis on the Occurrence of Common Diseases of Dairy Cows in Scale Pasture in Heilongjiang Province. China Dairy Cattle 2023, 4, 27–32. [Google Scholar] [CrossRef]
Marcia, E. Understanding the Behaviour and Improving the Welfare of Dairy Cattle.; Burleigh Dodds Science Publishing: London, UK, 2021. [Google Scholar] [CrossRef]
Wang, Z.; Song, H.; Wang, Y.; Hua, Z.; Li, R.; Xu, X. Research Progress and Technology Trend of Intelligent Morning of Dairy Cow Motion Behavior. Smart Agric. 2022, 4, 36–52. [Google Scholar] [CrossRef]
Zhang, X.; Zhang, M.; Wang, L.; Luo, H.; Li, J. Research Status and Development Analysis of Wearable Information Monitoring Technology in Animal Husbandry. Trans. Chin. Soc. Agric. Mach. 2019, 50, 1–14. [Google Scholar] [CrossRef]
Cabezas, J.; Yubero, R.; Visitación, B.; Navarro-Garcia, J.; Jesus, M.; Cano, E.L.; Ortega, F. Analysis of Accelerometer and GPS Data for Cattle Behaviour Identification and Anomalous Events Detection. Entropy 2022, 24, 336. [Google Scholar] [CrossRef] [PubMed]
Chen, G.; Li, C.; Guo, Y.; Shu, H.; Cao, Z.; Xu, B. Recognition of Cattle’s Feeding Behaviors Using Noseband Pressure Sensor with Machine Learning. Front. Vet. Sci. 2022, 9, 822621. [Google Scholar] [CrossRef]
Aoughlis, S.; Saddaoui, R.; Achour, B.; Bouhedda, M.; Laghrouche, M.; Elweza, M. Dairy Cows Localisation and Feeding Behaviour Monitoring Using a Combination of IMU and RFID Network. Int. J. Sens. Netw. 2021, 37, 23–35. [Google Scholar] [CrossRef]
Chen, Z.; Sun, B.; Liu, D.; Sun, X.; Li, J.; Wang, Y.; Chen, J. Research and Application of Key Technologies for Dairy Cattle Smart Farming. China Dairy Cattle 2024, 02, 52–58. [Google Scholar] [CrossRef]
Shi, X.; Lv, F.; Seng, D.; Zhang, J.; Chen, J.; Xing, B. Visualizing and Understanding Graph Convolutional Network. Multimed. Tools Appl. 2020, 80, 8355–8375. [Google Scholar] [CrossRef]
Wang, Y.; Li, R.; Wang, Z.; Hua, Z.; Jiao, Y.; Duan, Y.; Song, H. E3D: An Efficient 3D CNN for the Recognition of Dairy Cow’s Basic Motion Behavior. Comput. Electron. Agric. 2023, 205, 107607. [Google Scholar] [CrossRef]
Jiang, B.; Yin, X.; Song, H. Single-stream Long-term Optical Flow Convolution Network for Action Recognition of Lameness Dairy Cow. Comput. Electron. Agric. 2020, 175, 105536. [Google Scholar] [CrossRef]
Rong, W.; Gao, R.; Li, Q.; Zhao, C.; Ma, W.; Yu, L.; Ding, L. A lightweight Cow Mounting Behavior Recognition System Based on Improved YOLOv5s. Sci. Rep. 2023, 13, 17418. [Google Scholar] [CrossRef]
Shu, H.; Bindelle, J.; Guo, L.; Gu, X. Determining the Onset of Heat Stress in a Dairy Herd Based on Automated Behaviour Recognition. Biosyst. Eng. 2023, 226, 238–251. [Google Scholar] [CrossRef]
Bai, Q.; Gao, R.; Zhao, C.; Li, Q.; Wang, R.; Li, S. Multi-scale Behavior Recognition Method for Dairy Cows Based on Improved YOLOV5s Network. Trans. Chin. Soc. Agric. Eng. 2022, 38, 163–172. [Google Scholar] [CrossRef]
Duan, Q.; Zhao, Z.; Jiang, T.; Guo, X.; Zhang, Y. Behavior Recognition Method of Beef Cattle Based on SNSS-YOLO v7. Trans. Chin. Soc. Agric. Mach. 2023, 54, 266–274, 347. [Google Scholar] [CrossRef]
Wang, Z.; Hua, Z.; Wen, Y.; Zhang, S.; Xu, X.; Song, H. E-YOLO: Recognition of Estrus Cow Based on Improved YOLOv8n Model. Expert Syst. Appl. 2024, 238, 122212. [Google Scholar] [CrossRef]
Cao, Z.; Li, C.; Yang, X.; Zhang, S.; Luo, L.; Wang, H.; Zhao, H. Semi-automated annotation for video-based beef cattle behavior recognition. Sci. Rep. 2025, 15, 17131. [Google Scholar] [CrossRef]
Giannone, C.; Sahraeibelverdy, M.; Lamanna, M.; Cavallini, D.; Formigoni, A.; Tassinari, P.; Torreggiani, D.; Bovo, M. Automated dairy cow identification and feeding behaviour analysis using a computer vision model based on YOLOv8. Smart Agric. Technol. 2025, 12, 101304. [Google Scholar] [CrossRef]
Wu, C.; Fang, J.; Wang, X.; Zhao, Y. DMSF-YOLO: Cow Behavior Recognition Algorithm Based on Dynamic Mechanism and Multi-Scale Feature Fusion. Sensors 2025, 25, 3479. [Google Scholar] [CrossRef]
Xu, X.; Bing, X.; Zhi, C. Real-time Fall Attitude Detection Algorithm Based on iRMB. Signal Image Video Process. 2024, 19, 156. [Google Scholar] [CrossRef]
Wu, H.; Liu, Y.; He, C.; Luo, S. MSDA-HLGCformer-based Context-aware Fusion Network for Underwater Organism Detection. Opt. Laser Technol. 2025, 181, 111957. [Google Scholar] [CrossRef]
Mei, M.; Zhou, Z.; Liu, W.; Ye, Z. GOI-YOLOv8 Grouping Offset and Isolated GiraffeDet Low-Light Target Detection. Sensors 2024, 24, 5787. [Google Scholar] [CrossRef]
Jiang, P.; Zhang, J.; Chen, J. Enhanced Rain Removal Network with Convolutional Block Attention Module (CBAM): A Novel Approach to Image De-raining. EURASIP J. Adv. Signal Process. 2025, 2025, 9. [Google Scholar] [CrossRef]
Deng, F.; He, Z.; Fu, L.; Chen, J.; Li, N.; Chen, W.; Luo, J.; Qiao, W.; Hou, J.; Lu, Y. A New Maturity Recognition Algorithm for Xinhui Citrus Based on Improved YOLOv8. Front. Plant Sci. 2025, 16, 1472230. [Google Scholar] [CrossRef] [PubMed]
Lu, Z.; Zhang, J.; Cai, B.; Wu, Y.; Li, D.; Liu, M.; Zhang, L. A multi-scale Information Fusion Medical Image Segmentation Network Based on Convolutional Kernel Coupled Updata Mechanism. Comput. Biol. Med. 2025, 187, 109723. [Google Scholar] [CrossRef]
Yan, P.; Qi, X.; Jiang, L. LI-YOLOv8: Lightweight Small Target Detection Algorithm for Remote Sensing Images that Combines GSConv and PConv. PLoS ONE 2025, 20, e0321026. [Google Scholar] [CrossRef] [PubMed]
Singh, P.; Verma, K.V.; Rai, P.; Namboodiri, V. HetConv: Beyond Homogeneous Convolution Kernels for Deep CNNs. Int. J. Comput. Vis. 2019, 128, 2068–2088. [Google Scholar] [CrossRef]
Luo, H.; Cai, L.; Li, C. Rail Surface Defect Detection Based on An Improved YOLOv5s. Appl. Sci. 2023, 13, 7330. [Google Scholar] [CrossRef]
Li, G.; Shi, G.; Zhu, C. Dynamic Serpentine Convolution with Attention Mechanism Enhancement for Beef Cattle Behavior Recognition. Animals 2024, 14, 466. [Google Scholar] [CrossRef]
Zhao, Y.; Ju, Z.; Sun, T.; Dong, F.; Li, J.; Yang, R.; Fu, Q.; Lian, C.; Shan, P. TGC-YOLOv5: An Enhanced YOLOv5 Drone Detection Model Based on Transformer, GAM & CA Attention Mechanism. Drones 2023, 7, 446. [Google Scholar] [CrossRef]
Wang, J.; Wu, J.; Wu, J.; Wang, J.; Wang, J. YOLOv7 Optimization Model Based on Attention Mechanism Applied in Dense Scenes. Appl. Sci. 2023, 13, 9173. [Google Scholar] [CrossRef]
Guang, L.; Guo, S.; Jun, J. YOLOv5-KCB: A New Method for Individual Pig Detection Using Optimized K-Means, CA Attention Mechanism and a Bi-Directional Feature Pyramid Network. Sensors 2023, 23, 5242. [Google Scholar] [CrossRef]
Yuan, C.; Jian, C.; Zi, Z. A Sheep Dynamic Counting Scheme Based on the Fusion Between an Improved-sparrow-search YOLOv5x-ECA Model and Few-shot Deepsort Algorithm. Comput. Electron. Agric. 2023, 206, 107696. [Google Scholar] [CrossRef]
Zhang, X.; Cui, W.; Tao, Y.; Shi, T. Steel Surface Defect Detection Algorithm Based on S-YOLOv8. IAENG Int. J. Comput. Sci. 2025, 3, 52. [Google Scholar] [CrossRef]
Chao, D.; Xi, Z. Bi-DAUnet: Leveraging BiFormer in a Unet-like Architecture for Building Damage Assessment. J. Phys. Conf. Ser. 2024, 2833, 012015. [Google Scholar] [CrossRef]
Liu, H.; Hou, Y.; Zhang, J.; Zheng, P.; Hou, S. Research on Weed Reverse Detection Methods Based on Improved You Only Look Once (YOLO) v8: Preliminary Results. Agronomy 2024, 14, 1667. [Google Scholar] [CrossRef]
Yu, A.; Fan, H.; Xiong, Y.; Wei, L.; She, J. LHB-YOLOv8: An Optimized YOLOv8 Network for Complex Background Drop Stone Detection. Appl. Sci. 2025, 15, 737. [Google Scholar] [CrossRef]
Liu, H.; Zhu, J.; Xu, Y.; Xie, L. Mcan-YOLO: An Improved Forest Fire and Smoke Detection Model Based on YOLOv7. Forests 2024, 15, 1781. [Google Scholar] [CrossRef]
Yu, R.; Wei, X.; Liu, Y.; Yang, F.; Shen, W.; Gu, Z. Research on Automatic Recognition of Dairy Cow Daily Behaviors Based on Deep Learning. Animals 2024, 14, 458. [Google Scholar] [CrossRef]

Figure 1. Farm monitoring camera installation diagram. Note: The picture shows the specific situation inside the farm, including the length of the cattle stall, 80 m, and the installation location of the surveillance camera.

Figure 2. The quantity of similar images removed at different SSIM thresholds.

Figure 3. Examples of cow behavior dataset: (a) lie, (b) stand, (c) walk, (d) eat, (e) drink, (f) lick, (g) groom, (h) estrus, (i) limp.

Figure 4. RFR-Yolo model diagram.

Figure 5. Structure of RsiConv.

Figure 6. FMSDA working principles.

Figure 7. RepGRFPN fusion mode and internal structure of fusion block.

Figure 8. Confusion matrix.

Figure 9. Recognition effect of different models. Note: The above figure shows the detection results of cow behaviors in eight different scenarios. The left column is the original image, the middle column is the detection result of YOLOv11n, and the right column is the detection result of RFR-YOLO. Additionally, since the cameras used in this paper are made in China, the upper left and lower right corners of the images contain the time information of the video frame and the location information.

Table 1. Criteria for cow behavior.

Behavior Category	Behavior Description	Label	Quantity
Lying	Legs in contact with ground surface	lie	3427
Standing	Upright stance with legs supporting body	stand	3629
Walking	Bent-leg movement during locomotion	walk	2217
Eating	Head positioned within feeding zone	eat	1321
Drinking	Head aligned with drinking trough	drink	978
Licking	Tongue contacting own torso or contacting other cows	lick	720
Grooming	Torso engaged with cow brush	groom	567
Estrus	Mounting or mutual sniffing between two cows	estrus	212
Limping	Foot deformation with dragging gait	limp	153

Table 2. Hardware and software configurations.

Software and Hardware Configurations	Parameter
Software and hardware configuration	Windows10
Operating system	Pytorch2.2.1
Framework	CUDA 11.8 + PyTorch 2.2.1
CPU	Intel(R) Core(TM) i9-10920X CPU @ 3.5 GHz
GPU	NVIDIA GeForce RTX 3090
Memory	64G

Table 3. The impact of different feature extraction modules on model performance.

Feature Extraction Module	Precision	Recall	mAP50	mAP50:95	GFLOPs	FPS
Feature Extraction Module	/%	/%	/%	/%	/G	F/S
Yolov11n	90.4	86.2	89.3	81.7	7.7	118.2
GhostConv	86.4	82.7	85.2	75.3	7.7	150.3
DWConv	88.7	83.9	86.8	76.9	7.7	146.0
Pconv	87.2	81.6	86.1	76.3	6.0	168.9
HetConv	89.7	84.6	87.9	78.5	6.6	150.4
CDConv	91.4	85.9	90.2	81.9	7.9	116.1
CDSConv	92.6	87.3	91.9	82.3	9.0	98.4
RsiConv	94.7	88.9	93.2	84.8	8.5	130.6

Table 4. The impact of different attention mechanism modules on model performance.

Attention Mechanism Module	Precision	Recall	mAP50	mAP50:95	GFLOPs	FPS
Attention Mechanism Module	/%	/%	/%	/%	/G	F/S
Yolov11n	90.4	86.2	89.3	81.7	7.7	118.2
GAM	93.6	89.3	93.2	82.9	9.1	102.6
CBAM	90.2	90.2	92.2	81.9	7.7	136.2
CA	93.4	89.3	93.1	83.2	7.5	139.3
ECA	92.7	88.3	92.8	82.1	7.7	135.3
SEAM	91.8	89.3	91.6	81.2	7.9	137.2
Biformer	92.5	90.3	92.6	81.6	41.6	80.6
FMSDA	95.3	90.6	94.5	84.2	8.7	140.9

Table 5. Experimental results of different fusion mechanisms.

Feature Pyramid	Precision	Recall	mAP50	mAP50:95	GFLOPs	FPS
Feature Pyramid	/%	/%	/%	/%	/G	F/S
Yolov11n	90.4	86.2	89.3	81.7	7.7	118.2
GFPN	91.3	88.3	91.8	82.0	8.5	148.3
BiFPN	91.9	87.9	91.6	81.4	6.4	144.7
AFPN	93.2	89.3	92.6	82.9	20.0	36.3
RepGRFPN	94.0	89.4	93.9	83.7	8.1	153.7

Table 6. Comparison results of ablation experiments.

Yolov11n	RsiConv	FMSDA	RepGRFPN	P	R	mAP50	mAP50:95	GFLOPs	FPS
Yolov11n	RsiConv	FMSDA	RepGRFPN	/%	/%	/%	/%	/G	F/S
√				90.4	86.2	89.3	81.7	7.7	118.2
√	√			94.7	88.9	93.2	84.8	8.5	130.6
√		√		95.3	90.6	94.5	84.2	8.7	140.9
√			√	94.0	89.4	93.9	83.7	8.1	153.7
√	√	√		94.8	90.9	94.7	84.6	8.9	139.2
√		√	√	95.2	90.3	94.0	84.9	8.6	152.8
√	√		√	94.6	89.2	93.7	84.6	8.3	145.1
√	√	√	√	95.9	91.2	94.9	85.2	9.1	147.6

Table 7. Experimental results of different models.

Object Detection Model	Precision	Recall	mAP50	mAP50:95	GFLOPs	FPS
Object Detection Model	/%	/%	/%	/%	/G	F/S
SSD	89.5	84.0	88.7	80.3	136.6	90.3
Faster R-CNN	73.1	70.6	72.0	61.3	200.8	62.4
Yolov5n	87.2	85.5	86.2	75.5	4.3	135.9
Yolov7	83.3	81.2	82.9	73.0	13	127.4
Yolov8n	89.1	86.0	90.7	79.0	8.1	124.9
Yolov9t	87.7	85.1	89.6	77.3	6.9	118.2
Yolov10n	86.3	84.2	87.9	76.4	8.2	125.6
Yolov11n	90.4	86.2	89.3	81.7	7.7	118.2
Yolov12n	87.9	85.7	89.1	77.6	6.6	130.7
Literature [16]	90.4	89.5	91.2	83.6	15.9	136.8
Literature [18]	94.8	91.1	93.7	85.1	9.9	138.3
Literature [28]	93.6	90.5	95.5	83.5	42.1	127.6
Literature [38]	93.7	90.6	94.8	84.5	5.2	145.4
RFR-YOLO	95.9	91.2	94.9	85.2	9.1	147.6

Table 8. The detection effect of different behaviors.

Behavior	Precision	Recall	mAP50	mAP50:95	F1 Score
Behavior	/%	/%	/%	/%	/%
Lying	97.6	98.1	97.4	85.6	97.9
Standing	92.4	92.6	93.7	84.8	92.5
Walking	93.4	89.2	95.2	88.1	91.2
Drinking	97.2	98.5	96.8	85.7	97.9
Eating	97.3	95.4	97.9	87.5	96.3
Licking	96.7	83.6	87.7	79.1	89.7
Estrus	97.8	84.5	98.6	87.5	90.7
Limping	92.3	80.3	89.5	81.3	81.3
Grooming	98.4	98.6	97.3	87.2	98.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, C.; Ma, J.; Cao, S.; Guo, L. RFR-YOLO-Based Recognition Method for Dairy Cow Behavior in Farming Environments. Agriculture 2025, 15, 1952. https://doi.org/10.3390/agriculture15181952

AMA Style

Li C, Ma J, Cao S, Guo L. RFR-YOLO-Based Recognition Method for Dairy Cow Behavior in Farming Environments. Agriculture. 2025; 15(18):1952. https://doi.org/10.3390/agriculture15181952

Chicago/Turabian Style

Li, Congcong, Jialong Ma, Shifeng Cao, and Leifeng Guo. 2025. "RFR-YOLO-Based Recognition Method for Dairy Cow Behavior in Farming Environments" Agriculture 15, no. 18: 1952. https://doi.org/10.3390/agriculture15181952

APA Style

Li, C., Ma, J., Cao, S., & Guo, L. (2025). RFR-YOLO-Based Recognition Method for Dairy Cow Behavior in Farming Environments. Agriculture, 15(18), 1952. https://doi.org/10.3390/agriculture15181952

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

RFR-YOLO-Based Recognition Method for Dairy Cow Behavior in Farming Environments

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Establishment

2.2. Dairy Cow Behavior Target Detection Model

2.2.1. Model Selection

2.2.2. Improvement of the Network Model

2.3. Experiment

2.3.1. Experimental Environment

2.3.2. Evaluation Indicators

3. Results

3.1. Module Comparison Experiment

3.1.1. Comparative Analysis of RsiConv

3.1.2. Comparative Analysis of FMSDA

3.1.3. Comparative Analysis of RepGRFPN

3.2. Ablation Experiment Analysis

3.3. Comparison with Other Object Detection Algorithms

3.4. Analysis of the Behavioral Classification Effect of the RFR-YOLO Model

3.5. Visual Analysis

4. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI