1. Introduction
With the growth of the global population and the transformation of dietary structures, the demand for beef consumption has been continuously increasing, which has accelerated the rapid development of Precision Livestock Farming (PLF) technologies [
1]. As a core component of PLF systems, livestock behaviour monitoring plays an irreplaceable role in assessing animal health [
2,
3]. However, owing to the diversity and subtlety of cattle behaviours, as well as the complexity of farm environments, achieving high-precision automated behaviour recognition remains a challenging task [
4].
In recent years, researchers have sought to enhance the performance of cattle behaviour recognition using deep learning techniques. For example, Shakeel et al. [
5] introduced an innovative behaviour recognition and computation framework, which employs a deep recurrent learning paradigm to iteratively recognise behavioural patterns and thus improve prediction accuracy. The ST-GCN network, which utilises multimodal data such as skeletal and posture information, has enhanced behaviour discrimination to a certain extent [
6]. Gao et al. [
7] proposed a novel approach that combines mandibular skeletal feature extraction, skeletal heatmap descriptors, and the Kalman filtering algorithm for regurgitation behaviour recognition, achieving an accuracy of 98.25%. However, in complex environments, the extraction of skeletal keypoints remains challenging, which limits model performance. Tong et al. [
8] improved detection accuracy in complex scenes by integrating dynamic convolution and the C2f-iRMB structure, achieving an average precision of 91.7% in cattle detection. Brassel et al. [
9] developed an accelerometer-based oestrus detection system that analyses behavioural data through multiple metrics, attaining a detection sensitivity of 93.3%. Nevertheless, traditional contact-based devices such as ear tags and pressure sensors can compromise animal welfare during behavioural monitoring. Therefore, developing contact-free, video sequence-based behaviour recognition methods has become particularly important. On the one hand, video sequences provide rich spatio-temporal information, which facilitates the capture of complex behavioural patterns; on the other hand, video-based behaviour recognition combined with deep learning techniques not only promises improved recognition accuracy but also enables large-scale and real-time cattle behaviour monitoring while ensuring animal welfare.
Chen et al. [
10] systematically evaluated the latest developments in computer vision approaches for cattle behaviour recognition and analysed the evolution from traditional computer vision to deep learning methods for image segmentation, detection, and behaviour recognition. Fuentes et al. [
11] proposed a multi-view action recognition method for monitoring individual cattle behaviours, in which a hierarchical recognition and tracking mechanism was designed to address the challenges of occlusion and individual identification. Bello et al. [
12] employed the Mask R-CNN framework to recognise behaviours such as feeding and drinking, achieving an average recognition accuracy of over 90%, which demonstrates the effectiveness of deep learning methods in real-time behaviour recognition of group-ranched cattle.
Methods based on CNN or 3D-CNN can extract key spatial and motion features; however, they remain insufficiently sensitive to local key regions and environmental interactions [
13,
14]. Furthermore, attention-based approaches have shown improved capability in feature selection [
15,
16]. Li et al. [
17] integrated dynamic serpentine convolution with the BiFormer attention mechanism, achieving an accuracy of 96.5% and significantly enhancing feature extraction performance. Chen et al. [
18] introduced a machine learning approach that combines local slope and frequency-domain features, reaching an accuracy of 0.966 in recognising rumination and feeding behaviours. Shang et al. [
19] proposed a lightweight network structure that fuses features through an improved combination of two distinct attention mechanisms. This approach achieved remarkable improvements in both accuracy and generalisation capability; nevertheless, it still fails to sufficiently capture the dynamic interactions between local key regions of the cattle body and the surrounding environment.
Although some of the aforementioned studies have attempted to incorporate attention mechanisms or skeletal information, they still lack the explicit ability to model the dynamic and fine-grained interactions between local cattle body parts (e.g., head, muzzle) and environmental objects, which is critical for recognising behaviours such as drinking and grooming. For example, in cattle running behaviour recognition, focusing solely on morphological changes in the body while ignoring dynamic changes in environmental factors during movement may result in short-term running behaviours being misclassified as walking. Similarly, in identifying drinking behaviour within a video clip, cattle may simultaneously perform actions unrelated to drinking. This could lead to the extracted features of the clip containing not only valid interaction cues between the cattle and the water trough but also irrelevant information, thereby undermining the accuracy of drinking behaviour recognition.
In contrast to previous studies, our method explicitly models the ‘cattle–environment’ relationship through a novel reasoning algorithm. The core idea is to treat cattle behaviour as a direct manifestation of the spatio-temporal evolution of interactions between key body regions and environmental elements. To this end, we design a three-stage network that jointly models cattle–environment interactions. Specifically, the spatio-temporal perception network extracts spatial features of key regions and their variations; the spatio-temporal relation integration network incorporates metric learning with relation reasoning to automatically uncover associations between cattle features and environmental factors; and the spatio-temporal enhancement network further optimises spatio-temporal relation representations, enabling accurate recognition of complex behaviours. This approach significantly improves the model’s discriminative ability and generalisability across diverse scenarios.
The main contributions of this study are as follows:
We propose a novel spatio-temporal feature extraction algorithm that explicitly models the relationships between key regions of cattle behaviours and other regions in the video, effectively reducing redundant information and mitigating occlusion-related interference.
We develop a spatio-temporal awareness network to accurately capture features of key regions and their motion dynamics. In addition, a spatio-temporal relation fusion network is designed to integrate metric learning with relational reasoning, enabling adaptive exploration and quantification of interactions between cattle body parts and environmental factors. Furthermore, the proposed spatio-temporal enhancement network significantly improves the discriminability of features that are spatially similar but temporally distinct.
We validate the effectiveness of each component of the proposed algorithm through comprehensive comparative and ablation experiments. Moreover, we demonstrate the robustness and generalization capability of our method by fine-tuning the trained model on a separate dataset, confirming its applicability to cross-scenario cattle behaviour recognition tasks.
3. Results
To validate the effectiveness of the proposed key-region relation reasoning-based spatio-temporal cattle behaviour recognition network, this study conducted comparative experiments with several classical action recognition models, including TSN [
23], TSM [
24], I3D [
25], ACTION-Net [
26], and SlowFast [
27]. TSN divides the video into fixed-length segments, extracts a single frame feature from each segment, and fuses all segment features for the final classification. TSM captures temporal information by shifting input features along the time dimension. I3D (Inflated 3D ConvNet) employs a weight inflation strategy, extending pre-trained 2D convolutional network weights to 3D convolutions, enabling more effective handling of spatio-temporal information in videos.
3.1. Comparison Experiment
As shown in
Table 1, the proposed model achieves an overall
mAP of 87.19%, representing an improvement of 6.78 percentage points over I3D, 11.12 points over TSM, 6.96 points over TSN, and 4.92 points over SlowFast. Notably, our method also surpasses the state-of-the-art ACTION-Net, which is specifically designed for complex action recognition, by 3.57 percentage points in
mAP, thereby demonstrating the effectiveness of the proposed relation reasoning paradigm in agricultural scenarios. Across different behaviour categories, our model consistently outperforms the compared approaches in most cases. The relatively lower performance observed for the Grooming and Running categories may be attributed to the limited number of samples available for these behaviours.
To evaluate the generalizability and robustness of the proposed model, we further fine-tuned it on the CBVD-5 dataset [
28], which contains five common cattle behaviours. The experimental results are presented in
Table 2.
As shown in the table, the proposed model achieves higher recognition accuracy than the original methods for most behaviour categories. Notably, for static or high-frequency behaviours such as Standing, Lying down, and Foraging, the recognition accuracy improved from 98.61%, 96.71%, and 96.95% to 99.85%, 99.73%, and 98.70%, respectively. This indicates that the model possesses stronger expressive power in capturing posture-related spatio-temporal features, allowing it to more accurately distinguish behaviours with small motion amplitudes but similar semantics. It is worth noting that for low-frequency behaviours highly influenced by environmental factors, such as Drinking water, the recognition accuracy significantly increased from 34.82% to 57.18%. This demonstrates that the proposed model is more adaptive in modelling interactions between animals and their environment.
However, for the Rumination category, the recognition accuracy decreased from 41.00% to 35.50%. We speculate that this performance drop is due to rumination relying on subtle, continuous head movements, for which the current model’s Microaction timing modelling capability is still limited. This issue will be further analysed in
Section 4.
Overall, the proposed model outperforms existing methods in recognising primary static and high-frequency behaviours, while also highlighting potential improvements in modelling fine-grained actions. In summary, this study introduces a key-region relation reasoning-based spatio-temporal cattle behaviour recognition network that models the spatio-temporal relationships of cattle behaviours in videos. Compared to conventional action recognition models, our approach demonstrates superior behavioural representation ability and exhibits strong transferability and robustness across datasets. Notably, while maintaining balanced performance across behaviours, the model shows particularly significant improvements in the classification of Grooming and Running behaviours. The experimental results indicate that explicitly modelling relational interactions among extracted features yields higher accuracy than using traditional spatio-temporal feature extraction alone. Visualization of drinking behaviour using Grad-CAM is shown in
Figure 7, where the model focuses attention on the intersection between the cattle and the water trough.
3.2. Ablation Experiment
To quantitatively evaluate the contribution of each module to the overall model performance, this section conducts ablation studies on the Frame-Level Spatial Attention Module (FL-SAM) in the spatio-temporal perception network, the Key Motion Feature Extraction Module (KMFEM) in the inter-frame differential residual network, the Metric-based Feature (MF) in the low-frame-rate branch of the spatio-temporal relation integration network, the Spatial Relation Module (SRM), the Temporal Relation Module (TRM) in the high-frame-rate branch, and the Channel-Level Spatial Attention Module (CL-SAM) and LSTM in the spatio-temporal enhancement network, either individually or in combination. Specifically, each module was progressively removed from the complete model, and the resulting performance changes were recorded.
Table 3 presents the results of these experiments. All ablation studies were conducted under controlled settings: except for the module being tested, all other training hyperparameters (including learning rate, optimizer, batch size, number of epochs, and random seed) were kept constant to ensure fair comparison.
The experimental results indicate the following:
Removing FL-SAM led to an overall mAP drop of 0.46%, with more pronounced accuracy decreases in localised behaviours such as Grooming and Drinking, which rely heavily on cattle’s local body regions. This demonstrates that FL-SAM suppresses background redundancy and enhances the representation of key body parts. Unlike conventional global attention, FL-SAM applies grouped convolutions before attention computation, which is particularly effective in livestock video scenarios with occlusions and non-target regions. This design reduces computation while increasing the independence of local features, thereby improving the discriminability of short-sequence key features.
Removing KMFEM caused the most significant drop in Running accuracy (6.5%), confirming its role in effectively extracting rapid motion details from inter-frame residuals. Traditional differential networks tend to introduce background noise, whereas KMFEM captures changes in critical regions during sudden cattle movements, addressing characteristics that may not be suitable for human action recognition methods.
Removing MF decreased mAP by 0.56%, mainly affecting behaviours with similar postures, such as Drinking and Standing. MF introduces contrastive constraints in the low-frame-rate branch, enabling the model to distinguish fine-grained action categories in feature space. Compared with purely classification-based supervision, contrastive metric learning strengthens the discriminative boundaries between different action features, which is crucial for capturing subtle differences in cattle behaviours.
The absence of SRM led to the most substantial performance drop, reducing mAP to 84.3%, with Grooming and Drinking declining by 7.9% and 2.6%, respectively. This highlights the importance of SRM in capturing cattle–environment interaction patterns, addressing the critical “cattle–environment” relational problem. Unlike general human action recognition, livestock behaviours heavily depend on subject–environment relationships (e.g., head–trough, torso–ground). By explicitly modelling these spatial relations, SRM significantly enhances recognition robustness in complex environments.
Removing TRM decreased mAP by 1.6%, with Running accuracy dropping sharply by 6.2%. We speculate that running exhibits strong temporal dynamics requiring long-range dependency modelling to distinguish it from fast walking. The complementary relationship between SRM and TRM is also evident: SRM models spatial interactions, while TRM focuses on temporal dependencies, together enabling comprehensive spatio-temporal relation modelling.
Removing the channel-level attention enhancement module resulted in a 0.16% decrease in mAP, with slight drops across all seven behaviours. This indicates that CL-SAM suppresses noisy channels (e.g., non-target regions) and strengthens salient features related to the cattle and interacting objects, ensuring the discriminability of the representations.
Removing LSTM led to a 0.22% decrease in mAP, particularly affecting Walking and Running, where accuracy dropped by 0.8% and 0.6%, respectively. This demonstrates that spatial variations in short sequences alone cannot fully capture temporal evolution patterns. LSTM in the long-sequence branch captures cross-frame dynamic dependencies, enabling the network to distinguish behaviours with similar motion magnitude but different temporal patterns.
Overall, removing SRM and TRM caused the most significant performance degradation, confirming the necessity of explicit relational modelling for livestock behaviour recognition. Meanwhile, KMFEM, MF, and FL-SAM provide fine-grained feature optimization across different dimensions, ensuring robustness in slow and fast behaviours, local and global features, and subject–environment interactions. Finally, dual-branch spatio-temporal enhancement (CL-SAM + LSTM) further improves cross-time-scale modelling capability. Compared with conventional human action recognition networks, the proposed method not only achieves significant performance gains (+6.78% mAP over I3D) but also effectively addresses challenges unique to livestock scenarios, including severe occlusions, subtle motion differences, and strong subject–environment dependencies, demonstrating the targeted advantages of our approach.
4. Discussion
This study addresses the challenges in livestock scenarios, including complex individual–environment interactions, subtle behavioural differences, and diverse temporal dependencies, by proposing a spatio-temporal cattle behaviour recognition network based on key region relational reasoning. Unlike previous methods that rely solely on global features or single-subject modelling, the proposed approach explicitly integrates interactions between local cattle body features and environmental features, and jointly models them through a spatio-temporal relational reasoning module, thereby capturing the semantic dependencies inherent in the behaviour generation process more comprehensively. Experimental results demonstrate that this method effectively enhances the model’s understanding of complex interactive behaviours, particularly outperforming traditional approaches in behaviour categories that rely on environmental cues. From the perspective of precision livestock management, enhancing identification accuracy is crucial. Traditional manual observation methods are time-consuming, labour-intensive, and highly subjective, making it difficult to achieve 24/7 monitoring. This method achieves over 91% identification accuracy, providing technical assurance for automated and highly reliable cattle behaviour monitoring systems. This enables ranch managers to make decisions based on objective data rather than relying on experience, marking a key step in elevating the intelligent management level of modern livestock farming.
In terms of experimental results, the model achieved state-of-the-art overall recognition performance on two cattle behaviour datasets, with a mean average precision (mAP) of 91.31% on the CBVD-7 dataset, representing improvements of 4.92% and 3.57% over SlowFast and ACTION-Net, respectively. This improvement can be primarily attributed to the advantage of the spatial relational reasoning module (SRM) in modelling local interactions. For instance, in behaviours that are strongly associated with the environment, such as drinking and grooming, the model explicitly attends to the spatio-temporal consistency among the water trough, the cattle’s head, and body movements, leading to a significant increase in recognition accuracy. These results indicate that environmental context constitutes a critical basis for behaviour recognition in livestock videos, and that relational reasoning mechanisms can substantially enhance the model’s sensitivity to complex interactive patterns. This precise identification of critical behaviours directly translates into actionable pasture management applications. Specifically: Abnormal reduction in water intake serves as an early sensitive indicator of heat stress or digestive system disorders (such as ketosis). The model automatically records individual cattle’s water consumption frequency and duration, providing data support for early health alerts that enable farmers to intervene before clinical symptoms emerge. Analysing behavioural frequency and duration serves as core metrics for assessing animal welfare status. By quantifying this behaviour, the model helps managers evaluate breeding environments and social interactions within herds, promptly identifying welfare decline issues caused by diseases or discomfort.
The temporal relational module (TRM) also plays a crucial role in distinguishing behaviours that exhibit similar appearances but differ in temporal dynamics. For example, “Walking” and “Running” appear nearly identical in static frames but differ across the temporal scale. By incorporating KMFEM and TRM, the model is able to capture both short-term and long-term temporal dependencies, thereby reducing the confusion between similar behaviours. Distinguishing between behaviours that appear similar but differ significantly in energy expenditure holds practical significance for precise management. For instance, the frequency and intensity of “running” behaviour serve as a core criterion for automated oestrus detection. Accurately differentiating “running” from general “walking” allows for more precise quantification of cows’ activity levels, enabling timely identification of oestrus cycles to enhance conception rates and reproductive efficiency. Additionally, changes in overall movement patterns provide crucial references for evaluating cattle health and body condition.
Compared with previous studies, this work demonstrates significant differences in data utilisation and feature modelling strategies. Networks based on skeletal keypoints, such as ST-GCN [
6], can explicitly model the topological relationships among body parts; however, in livestock videos, the reliability of keypoint detection is often compromised due to occlusion, body size variations, and coat colour interference. In contrast, the proposed model does not require skeletal annotations and relies solely on RGB videos to establish an equivalent spatial relational structure, making it more suitable for behaviour recognition in real-world farm environments. Compared with dual-branch attention models, this study achieves synchronous inference of spatial and temporal features within a unified framework, thereby avoiding semantic inconsistencies between branch features. Unlike the SlowFast network, which relies on dual-rate sampling, the proposed method learns discriminative cattle body spatial features through spatial constraint metrics, resulting in greater stability and maintaining high recognition accuracy in dynamic scenes.
Notably, compared with the SocialCattle system [
1], the proposed study achieves recognition of multiple behaviour categories, including drinking, standing, running, and interaction, solely through video analysis. This approach eliminates the costs associated with sensor deployment and validates the feasibility of visual relational modelling for behaviour monitoring. Previous research on ST-GCN-based cattle behaviour recognition [
6] has highlighted that occlusion in multi-cattle scenarios can significantly reduce recognition accuracy. In contrast, the spatial attention mechanism in the present study effectively mitigates such effects by focusing on semantically salient regions. These comparative results indicate that the proposed method not only surpasses existing networks in accuracy but also offers advantages in robustness for practical deployment.
However, the model’s performance in recognising rumination behaviour on the CBVD-5 dataset declined. This phenomenon may be attributed to the fact that rumination primarily relies on small-amplitude, highly rhythmic head movements, and the video frame rate and spatial resolution limit the model’s ability to capture such micro-movements. In contrast, for behaviours involving interactions with large environmental objects, such as drinking, the model’s relational reasoning mechanism performs more effectively. While current advancements in recognizing micro-movements like rumination still have room for improvement, automated monitoring of these behaviours remains a critical need in precision livestock management. Rumination duration serves as the most crucial indicator for diagnosing rumen health (e.g., acidosis) and evaluating feed digestibility. By optimizing video capture frame rates and model architectures, this approach is fully capable of addressing this challenge. It will replace the time-consuming and labour-intensive manual observation methods, enabling 24/7 seamless monitoring of core health metrics in cattle herds.
Furthermore, taking grooming behaviour as an example (see
Figure 8), the model accurately focuses on the abdominal region when the cattle face the camera; however, when the cattle are oriented away from the camera, the attention shifts to the limbs. This observation indicates that occlusion and viewpoint variations continue to affect the model’s ability to temporally model key body regions. To address this limitation, future work could leverage multi-view learning and region-level attention mechanisms to enhance the model’s spatio-temporal representation capabilities in small-scale regions.
Although the proposed method has a slightly higher parameter count compared with some lightweight models, it achieves substantial performance improvements. Future research will further explore model compression techniques, such as pruning, quantisation, and knowledge distillation, to reduce deployment costs. In addition, integrating transfer learning and meta-learning approaches could enhance the generalisation performance for small-sample categories, such as rumination. Moreover, we plan to extend the framework to individual-level relational modelling. On one hand, region-level attention mechanisms could be introduced on ROI-extracted subject features to model self-interactions by partitioning body parts such as the head, limbs, and abdomen, thereby supporting group behaviour recognition and social interaction analysis to provide more comprehensive data for precision livestock management. On the other hand, adversarial data augmentation based on failure samples and multi-view statistics could be employed to mitigate biases caused by occlusion and viewpoint variations. Furthermore, we intend to conduct head ROI enlargement experiments and manually verify annotation consistency to further validate the model’s robustness.
From a practical application perspective, the proposed model can be deployed in smart farm monitoring systems to automatically recognise daily cattle behaviours and detect anomalies. For instance, by identifying changes in water intake or excessive activity, potential stress or health issues can be detected at an early stage. The model’s performance across multiple datasets, as well as its robustness under occlusion, facilitates the development of efficient real-time behaviour monitoring and health warning systems, thereby reducing the burden of manual observation and improving animal welfare in group-rearing environments.