2.2.1. Model Selection
To address challenges such as complex farm environments, difficulties in capturing key bovine features, significant scale variations during surveillance, high behavioral similarity, and limited computational resources of monitoring cameras, this study adopted the YOLOv11 network as its foundational architecture. YOLOv11, a state-of-the-art object detection algorithm introduced by researchers in 2024 based on the Ultralytics framework, consists of four core components: Input, Backbone, Neck, and Head. The network provides five model variants (n, s, m, l, and x), which vary in scale and complexity. Given the computational constraints associated with embedded deployment on farm monitoring devices, the YOLOv11n variant was selected for bovine behavior recognition, as it achieves a favorable balance between detection accuracy and inference speed.
2.2.2. Improvement of the Network Model
The overall architecture of the RFR-YOLO (RsiConv FMSDA RepGRFPN-YOLO) bovine behavior detection model proposed in this study, which is based on an improved YOLOv11n framework, is presented in
Figure 4.
The model introduces three key enhancements: first, the internal convolution structure of the C3K2 module is replaced with an Inverted Dilated Convolution (RsiConv) during the feature extraction stage. This modification incorporates an inverted residual connection mechanism to enhance multi-scale feature capture while simultaneously reducing the model’s computational complexity; second, the Four-branch Multi-scale Dilated Attention (FMSDA) module is integrated into the Neck network to strengthen the model’s capability for representing multi-scale features; third, a Reparameterized Generalized Residual Feature Pyramid Network (RepGRFPN) is designed to replace the original feature fusion network within the Neck, thereby improving the efficiency of multi-scale feature fusion and enhancing recognition accuracy for highly similar bovine behaviors.
(1) RsiConv feature extraction module
Building on field investigations and surveillance video analysis, dairy farms exhibit complex environments characterized by significant scale variations in the monitoring footage and strong behavioral continuity by the cows, highlighting the importance of capturing key behavioral features and contextual information. Processing the surveillance imagery for bovine behavior recognition generates massive volumes of video data and computational tasks for the servers, resulting in high computing costs; therefore, a lightweight network design becomes essential. To address these constraints, this study proposes the Inverted Dilated Convolution module (RsiConv). This module decomposes multi-scale feature extraction into spatial residual learning and semantic residual learning, utilizing inverted residual connections during feature interaction. The inverted residual structure partitions the feature maps into query (Q), key (K), and value (V) components to generate attention matrices: Q and K interact to produce an attention matrix, which is subsequently multiplied by V to yield the attention-weighted feature maps. The attention matrix generation process is mathematically formulated as follows:
where X denotes the input feature map and
represents the expanded feature map for subsequent residual processing.
In spatial residual learning, the feature map undergoes 3 × 3 convolution coupled with batch normalization and ReLU activation to generate compact feature maps expressing diverse regional characteristics. Subsequently, semantic residual learning applies depthwise separable convolutions (DwConv) with targeted receptive fields to each regional feature map, enabling morphology-based semantic filtering while avoiding redundant connections. For example, distant or obscured cow behaviors exhibit smaller feature scales; thus, leveraging multi-scale feature maps to accommodate varying receptive field requirements significantly enhances the model’s multi-scale information capture capability. The spatial and semantic residual processes are formulated as follows:
where X
e denotes the expanded feature map,
represents the standard convolutional layer with BN and ReLU,
signifies the regional feature map of the k-th branch (k corresponds to branches with different dilation rates),
indicates the depthwise separable dilated convolution (dilation rate
), and
is the semantic residual feature map of the k-th branch. Compared to the original structure, RsiConv doubles the channel count of the minimal-dilation-rate branch to enhance local behavioral details while employing depthwise separable convolutions to reduce the computational load. Finally, multi-branch semantic residuals are integrated with the input through summation, mitigating gradient vanishing and improving training efficiency. The feature fusion is expressed as follows:
where Y denotes the concatenation of all the semantic residual feature maps, Z represents the fusion result via a 1 × 1 convolution, and the final output is generated by summing the fused result with the input X. The architectural diagram of the Inverted Dilated Convolution module is illustrated in
Figure 5.
(2) FMSDA mechanism
To enhance the model’s capability at capturing multi-scale information, this study proposes a Four-branch Multi-scale Dilated Attention mechanism (FMSDA). By exploiting the sparsity of self-attention across different scales, the FMSDA generates the corresponding query, key, and value matrices through linear projections of the feature map. This process is mathematically formulated as follows:
where X is the input feature map; Q, K, and V represent the query, key, and value matrices, respectively;
,
, and
∈
denote the projection matrices; and d indicates the head dimension.
The FMSDA partitions the feature map into four distinct heads, where the features are processed through head-specific pathways via 1 × 1 convolutions before undergoing Sliding Window Dilated Attention (SWDA) with varying dilation rates in each head. This approach reduces channel redundancy and minimizes interference from irrelevant features, thereby improving the detection efficiency and enabling each head to focus more effectively on the multi-scale features. As a result, the model achieves a more comprehensive capture of bovine behavioral information in images. The window partitioning and local attention computation in SWDA are mathematically formulated as follows:
where r denotes the dilation rate of the current head, k represents the window size (e.g., 3 × 3),
is the sparse sampling coordinate set centered at
with stride r, and
and
are the key/value vectors indexed by
.
The FMSDA effectively aggregates the multi-scale information within the attended regions while reducing the redundancy in self-attention without complex operations or additional computational costs. The multi-head independent computation and feature fusion process of FMSDA are formulated as follows:
where
and
denote channel-partitioned sub-slices of the input feature map; and
represents the output feature.
The architecture of FMSDA is illustrated in
Figure 6, which shows four distinct dilation rates corresponding to different receptive field sizes. Within each head, self-attention operations are applied to their respective dilation rates and receptive fields, facilitating multi-scale feature capture across varying spatial resolutions. The resulting features are then concatenated and passed through a linear layer for effective feature aggregation.
(3) RepGRFPN feature fusion network.
To enhance the multi-scale feature fusion capability, improve the recognition of highly similar behaviors, and ensure inference efficiency, this study proposes a Reparameterized Generalized Residual Feature Pyramid Network (RepGRFPN). While preserving a lightweight design, the RepGRFPN dynamically assigns distinct channel dimensions to the features at different scales, thereby enabling flexible control over hierarchical feature representation. Compared to previous feature fusion approaches, it eliminates redundant upsampling operations, thereby accelerating the inference speed with a minimal impact on accuracy. The module replaces conventional convolutional feature fusion with Cross-Stage Partial Network (CSPNet) connections and integrates reparameterization techniques with Efficient Layer Aggregation Network (ELAN) linkages, thereby improving the accuracy without increasing the computational costs. The channel allocation per layer and cross-stage feature fusion in RepGRFPN are mathematically formulated as follows:
where
denotes the base channel count; and
represents the scaling function, ensuring deeper layers retain more semantic channels (
>
), thereby preserving richer semantic information of deep features while reducing the redundant shallow channels. This mechanism enhances the model’s capacity to learn hierarchical behavioral patterns, improves classification of distinct bovine behaviors, and strengthens its recognition of lying, standing, and lameness activities.
signifies the output feature of the k-th layer, g
is the fusion function, Upsample employs bilinear interpolation, and Downsample uses stride-2 convolution.
To address gradient vanishing and explosion issues arising from the stacked fusion modules, residual connections are integrated during both the training and inference phases. This design preserves richer semantic information and finer feature details while promoting efficient information propagation to the subsequent layers. Within the RepGRFPN fusion scheme, the number of fusion nodes is fixed to prevent efficiency degradation caused by the elongated serial chains in stacked fusion structures. The fusion module employs recurrent structural units composed of multiple 1 × 1 and 3 × 3 convolutions, each followed by batch normalization (BN) and an activation function, with residual connections applied for improved feature stability. The architecture differs between the training and inference modes: during training, both 3 × 3 and 1 × 1 convolutions are activated, whereas only the 1 × 1 convolution is retained during inference to enhance the computational efficiency. The feature fusion mechanism of RepGRFPN is illustrated in
Figure 7, where the lower-left section presents the overall fusion workflow, the upper section displays the fusion module structure, and the lower-right section details the internal architecture of the 3 × 3 Rep block.