Recognizing Cattle Behaviours by Spatio-Temporal Reasoning Between Key Body Parts and Environmental Context

Qi, Fangzheng; Hou, Zhenjie; Lin, En; Li, Xing; Liang, Jiuzhen; Zhang, Wenguang

doi:10.3390/computers14110496

Open AccessArticle

Recognizing Cattle Behaviours by Spatio-Temporal Reasoning Between Key Body Parts and Environmental Context

by

Fangzheng Qi

¹,

Zhenjie Hou

^1,*,

En Lin

²,

Xing Li

³,

Jiuzhen Liang

¹ and

Wenguang Zhang

⁴

¹

School of Computer Science and Artificial Intelligence, Changzhou University, Changzhou 213000, China

²

Goldcard Smart Group Co., Ltd., Hangzhou 310018, China

³

College of Information Science and Technology, College of Artificial Intelligence, Nanjing Forestry University, Nanjing 210098, China

⁴

School of Life Sciences, Inner Mongolia Agricultural University, Hohhot 010000, China

^*

Author to whom correspondence should be addressed.

Computers 2025, 14(11), 496; https://doi.org/10.3390/computers14110496

Submission received: 8 October 2025 / Revised: 3 November 2025 / Accepted: 11 November 2025 / Published: 13 November 2025

(This article belongs to the Topic AI, Deep Learning, and Machine Learning in Veterinary Science Imaging)

Download

Browse Figures

Versions Notes

Abstract

The accurate recognition of cattle behaviours is crucial for improving animal welfare and production efficiency in precision livestock farming. However, existing methods pay limited attention to recognising behaviours under occlusion or those involving subtle interactions between cattle and environmental objects in group farming scenarios. To address this limitation, we propose a novel spatio-temporal feature extraction network that explicitly models the associative relationships between key body parts of cattle and environmental factors, thereby enabling precise behaviour recognition. Specifically, the proposed approach first employs a spatio-temporal perception network to extract discriminative motion features of key body parts. Subsequently, a spatio-temporal relation integration module with metric learning is introduced to adaptively quantify the association strength between cattle features and environmental elements. Finally, a spatio-temporal enhancement network is utilised to further optimise the learned interaction representations. Experimental results on a public cattle behaviour dataset demonstrate that our method achieves a state-of-the-art mean average precision (mAP) of 87.19%, outperforming the advanced SlowFast model by 6.01 percentage points. Ablation studies further confirm the synergistic effectiveness of each module, particularly in recognising behaviours that rely on environmental interactions, such as drinking and grooming. This study provides a practical and reliable solution for intelligent cattle behaviour monitoring and highlights the significance of relational reasoning in understanding animal behaviours within complex environments.

Keywords:

cattle behaviour recognition; spatio-temporal reasoning; deep learning; precision livestock farming

1. Introduction

With the growth of the global population and the transformation of dietary structures, the demand for beef consumption has been continuously increasing, which has accelerated the rapid development of Precision Livestock Farming (PLF) technologies [1]. As a core component of PLF systems, livestock behaviour monitoring plays an irreplaceable role in assessing animal health [2,3]. However, owing to the diversity and subtlety of cattle behaviours, as well as the complexity of farm environments, achieving high-precision automated behaviour recognition remains a challenging task [4].

In recent years, researchers have sought to enhance the performance of cattle behaviour recognition using deep learning techniques. For example, Shakeel et al. [5] introduced an innovative behaviour recognition and computation framework, which employs a deep recurrent learning paradigm to iteratively recognise behavioural patterns and thus improve prediction accuracy. The ST-GCN network, which utilises multimodal data such as skeletal and posture information, has enhanced behaviour discrimination to a certain extent [6]. Gao et al. [7] proposed a novel approach that combines mandibular skeletal feature extraction, skeletal heatmap descriptors, and the Kalman filtering algorithm for regurgitation behaviour recognition, achieving an accuracy of 98.25%. However, in complex environments, the extraction of skeletal keypoints remains challenging, which limits model performance. Tong et al. [8] improved detection accuracy in complex scenes by integrating dynamic convolution and the C2f-iRMB structure, achieving an average precision of 91.7% in cattle detection. Brassel et al. [9] developed an accelerometer-based oestrus detection system that analyses behavioural data through multiple metrics, attaining a detection sensitivity of 93.3%. Nevertheless, traditional contact-based devices such as ear tags and pressure sensors can compromise animal welfare during behavioural monitoring. Therefore, developing contact-free, video sequence-based behaviour recognition methods has become particularly important. On the one hand, video sequences provide rich spatio-temporal information, which facilitates the capture of complex behavioural patterns; on the other hand, video-based behaviour recognition combined with deep learning techniques not only promises improved recognition accuracy but also enables large-scale and real-time cattle behaviour monitoring while ensuring animal welfare.

Chen et al. [10] systematically evaluated the latest developments in computer vision approaches for cattle behaviour recognition and analysed the evolution from traditional computer vision to deep learning methods for image segmentation, detection, and behaviour recognition. Fuentes et al. [11] proposed a multi-view action recognition method for monitoring individual cattle behaviours, in which a hierarchical recognition and tracking mechanism was designed to address the challenges of occlusion and individual identification. Bello et al. [12] employed the Mask R-CNN framework to recognise behaviours such as feeding and drinking, achieving an average recognition accuracy of over 90%, which demonstrates the effectiveness of deep learning methods in real-time behaviour recognition of group-ranched cattle.

Methods based on CNN or 3D-CNN can extract key spatial and motion features; however, they remain insufficiently sensitive to local key regions and environmental interactions [13,14]. Furthermore, attention-based approaches have shown improved capability in feature selection [15,16]. Li et al. [17] integrated dynamic serpentine convolution with the BiFormer attention mechanism, achieving an accuracy of 96.5% and significantly enhancing feature extraction performance. Chen et al. [18] introduced a machine learning approach that combines local slope and frequency-domain features, reaching an accuracy of 0.966 in recognising rumination and feeding behaviours. Shang et al. [19] proposed a lightweight network structure that fuses features through an improved combination of two distinct attention mechanisms. This approach achieved remarkable improvements in both accuracy and generalisation capability; nevertheless, it still fails to sufficiently capture the dynamic interactions between local key regions of the cattle body and the surrounding environment.

Although some of the aforementioned studies have attempted to incorporate attention mechanisms or skeletal information, they still lack the explicit ability to model the dynamic and fine-grained interactions between local cattle body parts (e.g., head, muzzle) and environmental objects, which is critical for recognising behaviours such as drinking and grooming. For example, in cattle running behaviour recognition, focusing solely on morphological changes in the body while ignoring dynamic changes in environmental factors during movement may result in short-term running behaviours being misclassified as walking. Similarly, in identifying drinking behaviour within a video clip, cattle may simultaneously perform actions unrelated to drinking. This could lead to the extracted features of the clip containing not only valid interaction cues between the cattle and the water trough but also irrelevant information, thereby undermining the accuracy of drinking behaviour recognition.

In contrast to previous studies, our method explicitly models the ‘cattle–environment’ relationship through a novel reasoning algorithm. The core idea is to treat cattle behaviour as a direct manifestation of the spatio-temporal evolution of interactions between key body regions and environmental elements. To this end, we design a three-stage network that jointly models cattle–environment interactions. Specifically, the spatio-temporal perception network extracts spatial features of key regions and their variations; the spatio-temporal relation integration network incorporates metric learning with relation reasoning to automatically uncover associations between cattle features and environmental factors; and the spatio-temporal enhancement network further optimises spatio-temporal relation representations, enabling accurate recognition of complex behaviours. This approach significantly improves the model’s discriminative ability and generalisability across diverse scenarios.

The main contributions of this study are as follows:

We propose a novel spatio-temporal feature extraction algorithm that explicitly models the relationships between key regions of cattle behaviours and other regions in the video, effectively reducing redundant information and mitigating occlusion-related interference.

We develop a spatio-temporal awareness network to accurately capture features of key regions and their motion dynamics. In addition, a spatio-temporal relation fusion network is designed to integrate metric learning with relational reasoning, enabling adaptive exploration and quantification of interactions between cattle body parts and environmental factors. Furthermore, the proposed spatio-temporal enhancement network significantly improves the discriminability of features that are spatially similar but temporally distinct.

We validate the effectiveness of each component of the proposed algorithm through comprehensive comparative and ablation experiments. Moreover, we demonstrate the robustness and generalization capability of our method by fine-tuning the trained model on a separate dataset, confirming its applicability to cross-scenario cattle behaviour recognition tasks.

2. Materials and Methods

2.1. Dataset Introduction

The dataset used in this study is a cattle behaviour dataset collected, annotated, and publicly released by Zia et al. [20]. Data were recorded under natural lighting conditions with a video resolution of 1920 × 1080 pixels and a frame rate of 30 frames per second (FPS). High-resolution cameras were deployed at the four corners of an experimental site containing eight Angus beef cattle to construct a comprehensive video surveillance dataset. The dataset construction process involved the following steps: first, the collected videos were segmented into clips at fixed time intervals; then, pretrained object detection and tracking models were applied to localise individual cattle in each frame; after bounding boxes were corrected by domain experts, the CVAT (Computer Vision Annotation Tool) video annotation tool was used to assign behaviour labels to each cattle instance in every frame. The final dataset comprised 502 video clips of 15 s each, with annotations covering seven visually recognisable cattle behaviours in farm environments, namely grazing, walking, running, drinking, grooming, standing, and lying. The annotation format included the video ID (video_id), time (second), bounding box coordinates (x, y, x1, y2), cattle ID (cattle_id), and behaviour category (behaviour).

In the data preprocessing stage, videos were further divided into one-second segments, resulting in 7530 valid video clips and 30,282 corresponding behaviour bounding boxes. To ensure reliable model evaluation, the dataset was partitioned into training, validation, and test sets at a ratio of 70%, 10%, and 20%, respectively. The training set was used for parameter learning, the validation set for hyperparameter tuning and model selection, and the test set for final evaluation of model generalisation. It is noteworthy that the behaviour categories are mutually exclusive. Since behaviour recognition in this task relies on annotated bounding boxes, the model depends on these annotations to extract spatio-temporal features of cattle. Considering that each second contains 30 frames, we uniformly sampled 8 frames and 16 frames from each one-second clip as model inputs, which were used to capture spatial and motion information, respectively. This strategy leverages all available temporal information within the annotated bounding boxes of each clip and provides the model with rich spatio-temporal representations, in sharp contrast to single-frame approaches that discard most motion information. The number of samples and annotated bounding boxes for each behaviour category are illustrated in Figure 1.

2.2. Research Method

This study aims to address the problem of cattle behaviour recognition based on video sequences. The input video sequence is defined as

X = {x_{t}}_{t = 1}^{T}, x_{t} \in R^{H \times W \times 3},

Here, H and W represent the height and width of the frame image, respectively, while T denotes the number of frames. The objective is to train a model.

f : R^{T \times H \times W \times 3} \to {1,2, \dots, C} .

The video sequence X is classified into one of C predefined cattle behaviour categories. Model performance is evaluated using mean average precision (mAP) and average precision per category (AP).

To achieve this goal, this study proposes a cattle behaviour recognition network based on key regional relation reasoning (see Figure 2). The feature learning process is divided into three stages, each employing dual-branch parallel processing to capture multi-dimensional information. First, the input video frames are processed through two branches to extract the spatial features of key regions and the motion variations in these regions across consecutive frames, thereby obtaining discriminative features of key regions within the video sequence. Subsequently, the spatio-temporal relation integration network performs convolutional operations on the features extracted in the previous stage under different frame sequence lengths (8 and 16 frames), while simultaneously constructing a dynamic association matrix between the cattle body and environmental factors (e.g., trough position, peer distance). This matrix quantitatively represents the relational weights between key cattle features and environmental cues, thus enabling explicit modelling of cattle–environment interactions. The spatio-temporal enhancement network then applies channel attention and temporal feature reconstruction to the fused relational feature maps, further enhancing local sensitivity of key regions and improving the discrimination of cattle–environment interaction variations. Finally, the output features are mapped into a high-dimensional space through an MLP (Multi-Layer Perceptron) layer and optimised jointly for cattle behaviour classification using a hybrid contrastive-classification loss function designed in this study.

2.2.1. Spatial-Temporal Perception Network

The spatial characteristics and temporal dynamics of cattle herds are both critical for behaviour recognition, yet they provide complementary information. To extract these features effectively and efficiently, we design a dual-branch spatio-temporal perception network—comprising a spatial attention residual network and an inter-frame difference residual network—that processes video clips of 8 and 16 frames, respectively, to fully capture the complementary information.

As shown in Figure 2, the spatial attention residual network employs a Frame-Level Spatial Attention Module (FL-SAM, Figure 3a) to focus on the spatial features of key regions of the cattle in the video. Compared with the attention computation method of conventional CBAM (Convolutional Block Attention Module), FL-SAM extracts global features based on residual blocks and, through grouped convolutions combined with a spatial attention mechanism, can generate attention scores with minimal computation (in Figure 3a, the computation and parameter cost is reduced to one-third of that of Using ordinary convolution) while ensuring that critical spatial regions of the cattle body do not interfere with each other during motion, thereby enhancing the discriminative ability of key regions. The Key Motion Feature Extraction Module (KMFEM, Figure 3b) is designed within the frame-difference residual network to capture temporal motion information of cattle by computing residuals between adjacent frames. Specifically, KMFEM leverages the key spatial features extracted by FL-SAM and uses frame differencing to obtain the dynamic motion changes in critical regions. This approach also prevents irrelevant motion from affecting the model, enabling effective modelling of cattle behavioural motion features.

2.2.2. Spatial-Temporal Relationship Integration Network

The relationships between cattle and environmental information across different dimensions exhibit significant variations. To effectively capture these diverse spatio-temporal dependencies, this study designs a spatio-temporal relation integration network to precisely model the dynamic interaction patterns between cattle body features and environmental factors. The network mainly consists of three components: local feature extraction, metric learning with relation reasoning, and relation feature fusion in spatial and temporal dimensions. Specifically, the ROI Align module is employed to extract local body features of each individual cow. However, in complex scenes, interference from objects unrelated to the recognition target may lead to inaccurate feature extraction, thereby affecting behaviour recognition. Therefore, additional methods are required to filter irrelevant information and establish effective correlations between cattle body features and environmental factors.

To this end, a contrastive loss function (Equation (1)) is introduced in the branch with a frame sequence length of 8. This ensures that cattle body features of the same behaviour remain close in the feature space, while those of different behaviours are pushed further apart. Through this strategy, the model enhances the discriminability of cattle body features, thereby providing more reliable representations for subsequent relational reasoning.

L_{c o n t r a s t i v e} (F_{a}, F_{b}, y) = (1 - y) \times d^{2} (F_{a}, F_{b}) + y \times \max (0, m a r g i n - d^{2} (F_{a}, F_{b}))

(1)

In this context, F_a and F_b denote a randomly sampled pair of features, y = 0 indicates that the two samples are similar, y = 1 indicates they are dissimilar, d represents the distance between the two samples in the feature space, and margin is a hyperparameter defining the minimum distance for dissimilar pairs. When the distance d² ≥ m, the loss is set to zero.

As illustrated in Figure 4, video data are first processed with ROI cropping to obtain the body features of three cattle exhibiting grazing behaviour and one cattle exhibiting drinking behaviour. During training, the introduction of the contrastive loss function forces the body features of cattle with the same behaviour to cluster tightly in the feature space, while maintaining clear separability between different behaviour categories. To further verify the effectiveness of metric learning, we employed the t-SNE algorithm to visualise the metric vectors extracted by the trained model. As shown in Figure 5, compared with the baseline without contrastive learning, the incorporation of contrastive loss yields body features with superior inter-class separability and intra-class compactness, thereby significantly enhancing feature representation capability.

Inspired by [21,22], this study proposes a learnable cattle–environment relational reasoning algorithm, as illustrated in Figure 6. To address both low- and high-frame-rate branches, we introduce a Spatial Relation Module (SRM) and a Temporal Relation Module (TRM) for relational reasoning, respectively. Specifically, the global spatio-temporal features are first partitioned into local environment blocks, with each block representing local environmental characteristics within the video. The metric-refined cattle feature vectors are then mapped to the same dimensionality as the global spatio-temporal features. Subsequently, a 1 × 1 convolution is employed to compute the correlations between cattle features and each environmental block, enabling adaptive weighting. This process constructs a dynamic association matrix between cattle and environmental factors (e.g., feeder position, peer distance), thereby facilitating the modelling of cattle–environment interactions.

Moreover, the associations between cattle and the environment vary across different dimensions. In particular, the SRM integrates the metric-refined spatial cattle features with the global spatial-motion features, capturing dependencies between the subject and surrounding environmental attributes as well as their spatial variations. In contrast, the TRM models the dynamic temporal changes in cattle features and their relationships with environmental spatial and temporal variations, thereby enhancing temporal consistency and stabilising behaviour recognition. Through the joint modelling of spatial and temporal relations, the network is capable of simultaneously capturing cattle dynamic behavioural features and contextual environmental information, ultimately improving both robustness and recognition accuracy.

2.2.3. Spatial-Temporal Enhancement Network

After extracting relational features through the spatio-temporal relation integration network, these features are fed into the spatio-temporal enhancement network according to frame sequence length to further improve cattle behaviour recognition performance. This network employs a dual enhancement mechanism: a channel-level spatial attention module (CL-SAM) for optimising spatial relational features, and LSTMs for temporal feature modelling of both short and long sequences.

For relational feature maps with a frame sequence length of 8, CL-SAM generates channel attention weights using a sigmoid function, which are then multiplied element-wise with the original feature maps to enhance the spatially critical relational features for behaviour classification. Unlike conventional SE-blocks (Squeeze-and-Excitation block) or CBAM (Convolutional Block Attention Module), this approach computes attention weights via sigmoid without additional convolution operations including other channels, thereby suppressing irrelevant information while focusing the network on key cattle regions and their significant spatio-temporal interactions with the environment.

In cattle behaviour recognition tasks, certain behaviours, such as standing up and lying down, can appear as opposing but similarly evolving sequences along the temporal dimension. Therefore, distinguishing such behaviours requires explicit modelling of temporal dynamics. In the branch with a frame sequence length of 16, an LSTM is employed to model the temporal relationships between cattle and environmental features, capturing the evolution of behaviours over time.

Finally, the short-sequence branch captures key action-related relational features, while the long-sequence branch models behavioural trends over time. The spatially and temporally enhanced features extracted from both branches are flattened and concatenated, then mapped into a high-dimensional feature space via an MLP. These representations serve as input for subsequent contrastive-classification hybrid loss training, thereby improving the model’s ability to recognise complex behavioural patterns.

2.2.4. The Loss Function of Entire Model

This study proposes a loss function to train the overall model introduced in this work, with the complete training objective defined in Equation (2).

L_{t o t a l} = L_{c l s} + α \times L_{c o n t r a s t i v e} + β \times \frac{1}{N u m_{δ}} \sum_{i \in δ} L_{t r i p} (F_{a}, F_{p}, F_{n})

(2)

Here, α and β are hyperparameters of the model, both set to 0.5. L_cls denotes the binary cross-entropy loss (BCE Loss) introduced in this study to ensure accurate sample class predictions. L_contrastive represents a contrastive loss function designed to extract highly discriminative spatial features of cattle.

To further enhance the stability of model training, a triplet loss (L_triplet) is additionally employed to optimise the feature space distribution, ensuring that features of positive samples (F_p) from the same class are closer, while those of negative samples (F_n) from different classes are pushed apart.

δ

denotes the set of all triplets extracted from a training batch of size B. Specifically, since triplet loss requires an anchor vector as a reference, we define an anchor feature vector F_a. Considering that the mean of features within each behaviour class provides good stability, using the class mean as the anchor effectively reduces the impact of sample variability on training, while enhancing intra-class compactness and inter-class separability, thereby improving classification performance.

Accordingly, F_a is defined as the mean feature vector of each behaviour class after fusion of the two branches in the spatio-temporal enhancement network. Equation (3) presents the triplet loss constructed based on the anchor F_a, positive sample F_p, and negative sample F_n, which optimises the feature space distribution to enhance model discriminability. This joint loss design ensures both the discriminative power of features and the accuracy of classification, leading to comprehensive improvements in model performance.

L_{t r i p} (F_{a}, F_{p}, F_{n}) = \max (0, d^{2} (F_{a}, F_{p}) - d^{2} (F_{a}, F_{n}) + m a r g i n)

(3)

The distance metric employs the cosine function. Margin and γ are hyperparameters, where margin controls the minimum difference between positive and negative samples, initially set to 0.1. During training, margin typically increases with epochs according to Equation (4).

m a r g i n (e p o c h) = {m a r g i n}_{i n i t a l} \times {(1 + \frac{e p o c h}{t o t a l_e p o c h s})}^{γ}

(4)

2.2.5. Evaluation Indicators

In this study, the primary evaluation metrics for the experimental models are Average Precision (AP) and mean Average Precision (mAP). Both AP and mAP are derived from the Precision–Recall (P–R) curve, which simultaneously reflects the model’s precision and recall, thereby avoiding the limitations of using a single metric. Moreover, mAP is particularly suitable for datasets with imbalanced samples, as it averages performance across multiple classes, making it appropriate for multi-class classification tasks. The formulas related to AP and mAP computation are provided in Equations (5)–(8).

P r e c i s i o n = \frac{T P}{T P + F P}

(5)

R e c a l l = \frac{T P}{T P + F N}

(6)

A P_{i} = \sum_{j = 2}^{n} (R e c a l l (j) - R e c a l l (j - 1)) \times P r e c i s i o n (j)

(7)

m A P = \frac{\sum_{i = 1}^{k} A P_{i}}{k}

(8)

Consider the metrics for category i: TP (True Positive) denotes the number of samples correctly classified as category i by the model (i.e., those actually in category i but predicted as such); FP (False Positive) refers to samples incorrectly predicted as category i (i.e., those actually in other categories); FN (False Negative) indicates samples actually in category i but misclassified as others. Higher values of AP and mAP demonstrate superior model performance in classification tasks, indicating more accurate category predictions.

2.2.6. Experimental Setup

The experiments were conducted on an Ubuntu 18.04.6 system using Python 3.8, implemented with PyTorch 2.2.2 and CUDA 12.1. The server configuration included a 48-core CPU (Xeon E5-2678 v3) and four NVIDIA Tesla V100-PCIE 32 GB GPUs. Stochastic Gradient Descent (SGD) was used for backpropagation with a momentum of 0.9 and a batch size of 16. The initial learning rate was set to 0.001, with a weight decay of 0.02. Cosine annealing learning rate decay was applied, in which the learning rate was annealed from the initial value to the minimum value over each epoch from epoch 0 to epoch 200 according to a cosine function, and then restarted at the beginning of the next epoch.

3. Results

To validate the effectiveness of the proposed key-region relation reasoning-based spatio-temporal cattle behaviour recognition network, this study conducted comparative experiments with several classical action recognition models, including TSN [23], TSM [24], I3D [25], ACTION-Net [26], and SlowFast [27]. TSN divides the video into fixed-length segments, extracts a single frame feature from each segment, and fuses all segment features for the final classification. TSM captures temporal information by shifting input features along the time dimension. I3D (Inflated 3D ConvNet) employs a weight inflation strategy, extending pre-trained 2D convolutional network weights to 3D convolutions, enabling more effective handling of spatio-temporal information in videos.

3.1. Comparison Experiment

As shown in Table 1, the proposed model achieves an overall mAP of 87.19%, representing an improvement of 6.78 percentage points over I3D, 11.12 points over TSM, 6.96 points over TSN, and 4.92 points over SlowFast. Notably, our method also surpasses the state-of-the-art ACTION-Net, which is specifically designed for complex action recognition, by 3.57 percentage points in mAP, thereby demonstrating the effectiveness of the proposed relation reasoning paradigm in agricultural scenarios. Across different behaviour categories, our model consistently outperforms the compared approaches in most cases. The relatively lower performance observed for the Grooming and Running categories may be attributed to the limited number of samples available for these behaviours.

To evaluate the generalizability and robustness of the proposed model, we further fine-tuned it on the CBVD-5 dataset [28], which contains five common cattle behaviours. The experimental results are presented in Table 2.

As shown in the table, the proposed model achieves higher recognition accuracy than the original methods for most behaviour categories. Notably, for static or high-frequency behaviours such as Standing, Lying down, and Foraging, the recognition accuracy improved from 98.61%, 96.71%, and 96.95% to 99.85%, 99.73%, and 98.70%, respectively. This indicates that the model possesses stronger expressive power in capturing posture-related spatio-temporal features, allowing it to more accurately distinguish behaviours with small motion amplitudes but similar semantics. It is worth noting that for low-frequency behaviours highly influenced by environmental factors, such as Drinking water, the recognition accuracy significantly increased from 34.82% to 57.18%. This demonstrates that the proposed model is more adaptive in modelling interactions between animals and their environment.

However, for the Rumination category, the recognition accuracy decreased from 41.00% to 35.50%. We speculate that this performance drop is due to rumination relying on subtle, continuous head movements, for which the current model’s Microaction timing modelling capability is still limited. This issue will be further analysed in Section 4.

Overall, the proposed model outperforms existing methods in recognising primary static and high-frequency behaviours, while also highlighting potential improvements in modelling fine-grained actions. In summary, this study introduces a key-region relation reasoning-based spatio-temporal cattle behaviour recognition network that models the spatio-temporal relationships of cattle behaviours in videos. Compared to conventional action recognition models, our approach demonstrates superior behavioural representation ability and exhibits strong transferability and robustness across datasets. Notably, while maintaining balanced performance across behaviours, the model shows particularly significant improvements in the classification of Grooming and Running behaviours. The experimental results indicate that explicitly modelling relational interactions among extracted features yields higher accuracy than using traditional spatio-temporal feature extraction alone. Visualization of drinking behaviour using Grad-CAM is shown in Figure 7, where the model focuses attention on the intersection between the cattle and the water trough.

3.2. Ablation Experiment

To quantitatively evaluate the contribution of each module to the overall model performance, this section conducts ablation studies on the Frame-Level Spatial Attention Module (FL-SAM) in the spatio-temporal perception network, the Key Motion Feature Extraction Module (KMFEM) in the inter-frame differential residual network, the Metric-based Feature (MF) in the low-frame-rate branch of the spatio-temporal relation integration network, the Spatial Relation Module (SRM), the Temporal Relation Module (TRM) in the high-frame-rate branch, and the Channel-Level Spatial Attention Module (CL-SAM) and LSTM in the spatio-temporal enhancement network, either individually or in combination. Specifically, each module was progressively removed from the complete model, and the resulting performance changes were recorded. Table 3 presents the results of these experiments. All ablation studies were conducted under controlled settings: except for the module being tested, all other training hyperparameters (including learning rate, optimizer, batch size, number of epochs, and random seed) were kept constant to ensure fair comparison.

The experimental results indicate the following:

Removing FL-SAM led to an overall mAP drop of 0.46%, with more pronounced accuracy decreases in localised behaviours such as Grooming and Drinking, which rely heavily on cattle’s local body regions. This demonstrates that FL-SAM suppresses background redundancy and enhances the representation of key body parts. Unlike conventional global attention, FL-SAM applies grouped convolutions before attention computation, which is particularly effective in livestock video scenarios with occlusions and non-target regions. This design reduces computation while increasing the independence of local features, thereby improving the discriminability of short-sequence key features.

Removing KMFEM caused the most significant drop in Running accuracy (6.5%), confirming its role in effectively extracting rapid motion details from inter-frame residuals. Traditional differential networks tend to introduce background noise, whereas KMFEM captures changes in critical regions during sudden cattle movements, addressing characteristics that may not be suitable for human action recognition methods.

Removing MF decreased mAP by 0.56%, mainly affecting behaviours with similar postures, such as Drinking and Standing. MF introduces contrastive constraints in the low-frame-rate branch, enabling the model to distinguish fine-grained action categories in feature space. Compared with purely classification-based supervision, contrastive metric learning strengthens the discriminative boundaries between different action features, which is crucial for capturing subtle differences in cattle behaviours.

The absence of SRM led to the most substantial performance drop, reducing mAP to 84.3%, with Grooming and Drinking declining by 7.9% and 2.6%, respectively. This highlights the importance of SRM in capturing cattle–environment interaction patterns, addressing the critical “cattle–environment” relational problem. Unlike general human action recognition, livestock behaviours heavily depend on subject–environment relationships (e.g., head–trough, torso–ground). By explicitly modelling these spatial relations, SRM significantly enhances recognition robustness in complex environments.

Removing TRM decreased mAP by 1.6%, with Running accuracy dropping sharply by 6.2%. We speculate that running exhibits strong temporal dynamics requiring long-range dependency modelling to distinguish it from fast walking. The complementary relationship between SRM and TRM is also evident: SRM models spatial interactions, while TRM focuses on temporal dependencies, together enabling comprehensive spatio-temporal relation modelling.

Removing the channel-level attention enhancement module resulted in a 0.16% decrease in mAP, with slight drops across all seven behaviours. This indicates that CL-SAM suppresses noisy channels (e.g., non-target regions) and strengthens salient features related to the cattle and interacting objects, ensuring the discriminability of the representations.

Removing LSTM led to a 0.22% decrease in mAP, particularly affecting Walking and Running, where accuracy dropped by 0.8% and 0.6%, respectively. This demonstrates that spatial variations in short sequences alone cannot fully capture temporal evolution patterns. LSTM in the long-sequence branch captures cross-frame dynamic dependencies, enabling the network to distinguish behaviours with similar motion magnitude but different temporal patterns.

Overall, removing SRM and TRM caused the most significant performance degradation, confirming the necessity of explicit relational modelling for livestock behaviour recognition. Meanwhile, KMFEM, MF, and FL-SAM provide fine-grained feature optimization across different dimensions, ensuring robustness in slow and fast behaviours, local and global features, and subject–environment interactions. Finally, dual-branch spatio-temporal enhancement (CL-SAM + LSTM) further improves cross-time-scale modelling capability. Compared with conventional human action recognition networks, the proposed method not only achieves significant performance gains (+6.78% mAP over I3D) but also effectively addresses challenges unique to livestock scenarios, including severe occlusions, subtle motion differences, and strong subject–environment dependencies, demonstrating the targeted advantages of our approach.

4. Discussion

This study addresses the challenges in livestock scenarios, including complex individual–environment interactions, subtle behavioural differences, and diverse temporal dependencies, by proposing a spatio-temporal cattle behaviour recognition network based on key region relational reasoning. Unlike previous methods that rely solely on global features or single-subject modelling, the proposed approach explicitly integrates interactions between local cattle body features and environmental features, and jointly models them through a spatio-temporal relational reasoning module, thereby capturing the semantic dependencies inherent in the behaviour generation process more comprehensively. Experimental results demonstrate that this method effectively enhances the model’s understanding of complex interactive behaviours, particularly outperforming traditional approaches in behaviour categories that rely on environmental cues. From the perspective of precision livestock management, enhancing identification accuracy is crucial. Traditional manual observation methods are time-consuming, labour-intensive, and highly subjective, making it difficult to achieve 24/7 monitoring. This method achieves over 91% identification accuracy, providing technical assurance for automated and highly reliable cattle behaviour monitoring systems. This enables ranch managers to make decisions based on objective data rather than relying on experience, marking a key step in elevating the intelligent management level of modern livestock farming.

In terms of experimental results, the model achieved state-of-the-art overall recognition performance on two cattle behaviour datasets, with a mean average precision (mAP) of 91.31% on the CBVD-7 dataset, representing improvements of 4.92% and 3.57% over SlowFast and ACTION-Net, respectively. This improvement can be primarily attributed to the advantage of the spatial relational reasoning module (SRM) in modelling local interactions. For instance, in behaviours that are strongly associated with the environment, such as drinking and grooming, the model explicitly attends to the spatio-temporal consistency among the water trough, the cattle’s head, and body movements, leading to a significant increase in recognition accuracy. These results indicate that environmental context constitutes a critical basis for behaviour recognition in livestock videos, and that relational reasoning mechanisms can substantially enhance the model’s sensitivity to complex interactive patterns. This precise identification of critical behaviours directly translates into actionable pasture management applications. Specifically: Abnormal reduction in water intake serves as an early sensitive indicator of heat stress or digestive system disorders (such as ketosis). The model automatically records individual cattle’s water consumption frequency and duration, providing data support for early health alerts that enable farmers to intervene before clinical symptoms emerge. Analysing behavioural frequency and duration serves as core metrics for assessing animal welfare status. By quantifying this behaviour, the model helps managers evaluate breeding environments and social interactions within herds, promptly identifying welfare decline issues caused by diseases or discomfort.

The temporal relational module (TRM) also plays a crucial role in distinguishing behaviours that exhibit similar appearances but differ in temporal dynamics. For example, “Walking” and “Running” appear nearly identical in static frames but differ across the temporal scale. By incorporating KMFEM and TRM, the model is able to capture both short-term and long-term temporal dependencies, thereby reducing the confusion between similar behaviours. Distinguishing between behaviours that appear similar but differ significantly in energy expenditure holds practical significance for precise management. For instance, the frequency and intensity of “running” behaviour serve as a core criterion for automated oestrus detection. Accurately differentiating “running” from general “walking” allows for more precise quantification of cows’ activity levels, enabling timely identification of oestrus cycles to enhance conception rates and reproductive efficiency. Additionally, changes in overall movement patterns provide crucial references for evaluating cattle health and body condition.

Compared with previous studies, this work demonstrates significant differences in data utilisation and feature modelling strategies. Networks based on skeletal keypoints, such as ST-GCN [6], can explicitly model the topological relationships among body parts; however, in livestock videos, the reliability of keypoint detection is often compromised due to occlusion, body size variations, and coat colour interference. In contrast, the proposed model does not require skeletal annotations and relies solely on RGB videos to establish an equivalent spatial relational structure, making it more suitable for behaviour recognition in real-world farm environments. Compared with dual-branch attention models, this study achieves synchronous inference of spatial and temporal features within a unified framework, thereby avoiding semantic inconsistencies between branch features. Unlike the SlowFast network, which relies on dual-rate sampling, the proposed method learns discriminative cattle body spatial features through spatial constraint metrics, resulting in greater stability and maintaining high recognition accuracy in dynamic scenes.

Notably, compared with the SocialCattle system [1], the proposed study achieves recognition of multiple behaviour categories, including drinking, standing, running, and interaction, solely through video analysis. This approach eliminates the costs associated with sensor deployment and validates the feasibility of visual relational modelling for behaviour monitoring. Previous research on ST-GCN-based cattle behaviour recognition [6] has highlighted that occlusion in multi-cattle scenarios can significantly reduce recognition accuracy. In contrast, the spatial attention mechanism in the present study effectively mitigates such effects by focusing on semantically salient regions. These comparative results indicate that the proposed method not only surpasses existing networks in accuracy but also offers advantages in robustness for practical deployment.

However, the model’s performance in recognising rumination behaviour on the CBVD-5 dataset declined. This phenomenon may be attributed to the fact that rumination primarily relies on small-amplitude, highly rhythmic head movements, and the video frame rate and spatial resolution limit the model’s ability to capture such micro-movements. In contrast, for behaviours involving interactions with large environmental objects, such as drinking, the model’s relational reasoning mechanism performs more effectively. While current advancements in recognizing micro-movements like rumination still have room for improvement, automated monitoring of these behaviours remains a critical need in precision livestock management. Rumination duration serves as the most crucial indicator for diagnosing rumen health (e.g., acidosis) and evaluating feed digestibility. By optimizing video capture frame rates and model architectures, this approach is fully capable of addressing this challenge. It will replace the time-consuming and labour-intensive manual observation methods, enabling 24/7 seamless monitoring of core health metrics in cattle herds.

Furthermore, taking grooming behaviour as an example (see Figure 8), the model accurately focuses on the abdominal region when the cattle face the camera; however, when the cattle are oriented away from the camera, the attention shifts to the limbs. This observation indicates that occlusion and viewpoint variations continue to affect the model’s ability to temporally model key body regions. To address this limitation, future work could leverage multi-view learning and region-level attention mechanisms to enhance the model’s spatio-temporal representation capabilities in small-scale regions.

Although the proposed method has a slightly higher parameter count compared with some lightweight models, it achieves substantial performance improvements. Future research will further explore model compression techniques, such as pruning, quantisation, and knowledge distillation, to reduce deployment costs. In addition, integrating transfer learning and meta-learning approaches could enhance the generalisation performance for small-sample categories, such as rumination. Moreover, we plan to extend the framework to individual-level relational modelling. On one hand, region-level attention mechanisms could be introduced on ROI-extracted subject features to model self-interactions by partitioning body parts such as the head, limbs, and abdomen, thereby supporting group behaviour recognition and social interaction analysis to provide more comprehensive data for precision livestock management. On the other hand, adversarial data augmentation based on failure samples and multi-view statistics could be employed to mitigate biases caused by occlusion and viewpoint variations. Furthermore, we intend to conduct head ROI enlargement experiments and manually verify annotation consistency to further validate the model’s robustness.

From a practical application perspective, the proposed model can be deployed in smart farm monitoring systems to automatically recognise daily cattle behaviours and detect anomalies. For instance, by identifying changes in water intake or excessive activity, potential stress or health issues can be detected at an early stage. The model’s performance across multiple datasets, as well as its robustness under occlusion, facilitates the development of efficient real-time behaviour monitoring and health warning systems, thereby reducing the burden of manual observation and improving animal welfare in group-rearing environments.

5. Conclusions

In summary, this study proposes an innovative spatio-temporal network model that explicitly reasons about the relationships between key body regions of cattle and their environment, thereby enabling precise behaviour recognition in intensive farming scenarios. Our core contribution lies in establishing an explicit modelling framework for cattle–environment interactions, which is realised through a dedicated relation reasoning algorithm and metric learning. Extensive experiments demonstrate that the proposed method not only achieves state-of-the-art performance but also exhibits greater robustness in complex environments. The main limitation of the current study lies in its insufficient ability to recognise micro-actions (e.g., rumination), which may result from the relatively small proportion of the head region, constraints of frame rate/resolution, or ambiguous annotations. Future work will focus on (1) integrating high-resolution local ROI branches and incorporating video magnification or keypoint-assisted techniques to enhance the modelling of head-related micro-actions; (2) exploring prototype-based meta-learning methods and contrastive few-shot augmentation strategies to improve performance on low-sample categories; and (3) extending cross-individual interaction modelling to support the recognition of complex group behaviours and health monitoring. We believe this research direction will provide more efficient and interpretable technical support for intelligent farm management.

Author Contributions

F.Q.: Writing—original draft, Resources, Methodology, Data curation.; Z.H.: Writing—review and editing, Project administration, Investigation.; E.L.: Resources, Conceptualization.; J.L.: Supervision, Investigation.; X.L.: Supervision, Resources, Formal analysis.; W.Z.: Supervision, Formal analysis. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data will be made available on request.

Acknowledgments

Thanks for the support of the College of Computer and Artificial Intelligence, Changzhou University. This study did not receive any specific funding from funding organisations in the public, commercial or not-for-profit sectors.

Conflicts of Interest

Author En Lin is employed by the company Goldcard Smart Group Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Feng, Y.; Niu, H.; Wang, F.; Ivey, S.J.; Wu, J.J.; Qi, H.; Almeida, R.A.; Eda, S.; Cao, Q. SocialCattle: IoT-based mastitis detection and control through social cattle behavior sensing in smart farms. IEEE Internet Things J. 2021, 9, 10130–10138. [Google Scholar] [CrossRef]
Sim, H.S.; Kim, T.K.; Lee, C.W.; Choi, C.S.; Kim, J.S.; Cho, H.C. Optimizing Cattle Behavior Analysis in Precision Livestock Farming: Integrating YOLOv7-E6E with AutoAugment and GridMask to Enhance Detection Accuracy. Appl. Sci. 2024, 14, 3667. [Google Scholar] [CrossRef]
Qiao, Y.; Kong, H.; Clark, C.; Lomax, S.; Su, D.; Eiffert, S.; Sukkarieh, S. Intelligent Perception-Based Cattle Lameness Detection and Behaviour Recognition: A Review. Animals 2021, 11, 3033. [Google Scholar] [CrossRef] [PubMed]
Fuentes, A.; Yoon, S.; Park, J.; Park, D.S. Deep learning-based hierarchical cattle behavior recognition with spatio-temporal information. Comput. Electron. Agric. 2020, 177, 105627. [Google Scholar] [CrossRef]
Shakeel, P.M.; bin Mohd Aboobaider, B.; Salahuddin, L.B. A deep learning-based cow behavior recognition scheme for improving cattle behavior modeling in smart farming. Internet Things 2022, 19, 100539. [Google Scholar] [CrossRef]
Li, Z.; Zhang, Q.; Lv, S.; Han, M.; Jiang, M.; Song, H. Fusion of RGB, optical flow and skeleton features for the detection of lameness in dairy cows. Biosyst. Eng. 2022, 218, 62–77. [Google Scholar] [CrossRef]
Gao, G.; Wang, C.; Wang, J.; Lv, Y.; Li, Q.; Zhang, X.; Li, Z.; Chen, G. UD-YOLOv5s: Recognition of cattle regurgitation behavior based on upper and lower jaw skeleton feature extraction. J. Electron. Imaging 2023, 32, 043036. [Google Scholar] [CrossRef]
Tong, L.; Fang, J.; Wang, X.; Zhao, Y. Research on cattle behavior recognition and multi-object tracking algorithm based on yolo-bot. Animals 2024, 14, 2993. [Google Scholar] [CrossRef]
Brassel, J.; Rohrssen, F.; Failing, K.; Wehrend, A. Automated oestrus detection using multimetric behaviour recognition in seasonal-calving dairy cattle on pasture. New Zealand Vet. J. 2018, 66, 243–247. [Google Scholar] [CrossRef]
Chen, C.; Zhu, W.; Norton, T. Behaviour recognition of pigs and cattle: Journey from computer vision to deep learning. Comput. Electron. Agric. 2021, 187, 106255. [Google Scholar] [CrossRef]
Fuentes, A.; Han, S.; Nasir, M.F.; Park, J.; Yoon, S.; Park, D.S. Multiview monitoring of individual cattle behavior based on action recognition in closed barns using deep learning. Animals 2023, 13, 2020. [Google Scholar] [CrossRef]
Bello, R.W.; Mohamed, A.S.A.; Talib, A.Z.; Sani, S.; Ab Wahab, M.N. Behavior recognition of group-ranched cattle from video sequences using deep learning. Indian J. Anim. Res. 2022, 56, 505–512. [Google Scholar] [CrossRef]
Wang, S.H.; He, D.J. Estrus behavior recognition of dairy cows based on improved YOLO v3 model. Nongye Jixie Xuebao/Trans. Chin. Soc. Agric. Mach. 2021, 52, 141–150. [Google Scholar] [CrossRef]
Wang, Y.; Li, R.; Wang, Z.; Hua, Z.; Jiao, Y.; Duan, Y.; Song, H. E3D: An efficient 3D CNN for the recognition of dairy cow’s basic motion behavior. Comput. Electron. Agric. 2023, 205, 107607. [Google Scholar] [CrossRef]
Liu, Z.; Wang, L.; Wu, W.; Qian, C.; Lu, T. Tam: Temporal adaptive module for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 13708–13718. [Google Scholar]
Han, Y.; Wu, J.; Zhang, H.; Cai, M.; Sun, Y.; Li, B.; Feng, X.; Hao, J.; Wang, H. Beef cattle abnormal behaviour recognition based on dual-branch frequency channel temporal excitation and aggregation. Biosyst. Eng. 2024, 241, 28–42. [Google Scholar] [CrossRef]
Li, G.; Shi, G.; Zhu, C. Dynamic serpentine convolution with attention mechanism enhancement for beef cattle behavior recognition. Animals 2024, 14, 466. [Google Scholar] [CrossRef]
Chen, G.; Li, C.; Guo, Y.; Shu, H.; Cao, Z.; Xu, B. Recognition of cattle’s feeding behaviors using noseband pressure sensor with machine learning. Front. Vet. Sci. 2022, 9, 822621. [Google Scholar] [CrossRef]
Shang, C.; Wu, F.; Wang, M.; Gao, Q. Cattle behavior recognition based on feature fusion under a dual attention mechanism. J. Vis. Commun. Image Represent. 2022, 85, 103524. [Google Scholar] [CrossRef]
Zia, A.; Sharma, R.; Arablouei, R.; Bishop-Hurley, G.; McNally, J.; Bagnall, N.; Rolland, V.; Kusy, B.; Petersson, L.; Ingham, A. Cvb: A video dataset of cattle visual behaviors. arXiv 2023, arXiv:2305.16555. [Google Scholar] [CrossRef]
Kim, D.H.; Anvarov, F.; Lee, J.M.; Song, B.C. Metric-based attention feature learning for video action recognition. IEEE Access 2021, 9, 39218–39228. [Google Scholar] [CrossRef]
Santoro, A.; Raposo, D.; Barrett, D.G.; Malinowski, M.; Pascanu, R.; Battaglia, P.; Lillicrap, T. A simple neural network module for relational reasoning. arXiv 2017. [Google Scholar] [CrossRef]
Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; Van Gool, L. Temporal segment networks: Towards good practices for deep action recognition. In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2016; pp. 20–36. [Google Scholar]
TLin, J.; Gan, C.; Han, S. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7083–7093. [Google Scholar]
Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
Wang, Z.; She, Q.; Smolic, A. Action-net: Multipath excitation for action recognition. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 13214–13223. [Google Scholar]
Xiao, F.; Lee, Y.J.; Grauman, K.; Malik, J.; Feichtenhofer, C. Audiovisual slowfast networks for video recognition. arXiv 2020, arXiv:2001.08740. [Google Scholar] [CrossRef]
Li, K.; Fan, D.; Wu, H.; Zhao, A. A new dataset for video-based cow behavior recognition. Sci. Rep. 2024, 14, 18702. [Google Scholar] [CrossRef]

Figure 1. Example cattle behaviour video frames and corresponding annotation counts.

Figure 2. Overall framework of the proposed method, where B denotes the batch size and N represents the number of cattle included in each batch.

Figure 3. (a) Internal structure of FL-SAM; (b) Internal structure of KMFEM. Here, B denotes the number of video clips processed simultaneously during a training session; T represents the number of frames in a video clip; C indicates the number of channels per frame; H stands for the height of the feature map; and W denotes the width of the feature map.

Figure 4. Metric learning of cattle body spatial features using contrastive loss. During training, similar samples are clustered together while dissimilar ones are separated.

Figure 5. (a) Feature visualisation without contrastive learning; (b) Feature visualisation with contrastive learning.

Figure 6. Implementation details of the relational reasoning algorithm. Here, B is the batch size and N is the number of cattle in each batch.

Figure 7. Grad-CAM visualization of cattle drinking behaviour samples.

Figure 8. Visualized samples of cattle grooming behaviour from different perspectives. Here, (a) denotes a cow grooming the camera directly, while (b) indicates a cow grooming the camera from behind.

Table 1. Comparison of mAP, model parameters, and per-behaviour AP across different models.

	I3D	TSM	TSN	Action-Net	Slow Fast	Ours
Drinking (%)	89.88	86.49	91.81	90.7	91	97.62
Grazing (%)	98.67	96.83	96.35	98.87	97.41	99.4
Grooming (%)	47.49	46.37	50.15	55.35	53.44	67.35
Lying (%)	98.36	98.19	98.82	99.78	99.37	99.38
Walking (%)	95.77	86.8	93.71	96.63	98.42	96.3
Standing (%)	75.86	66.47	70.16	79	80.63	83.8
Running (%)	56.84	51.33	60.59	65	55.59	66.5
mAP (%)	80.41	76.07	80.23	83.62	82.27	87.19
Params (M)	28	25.5	25.5	27	35	35.59

Table 2. Performance comparison (in Average Precision, AP %) on the CBVD-5 dataset. ‘Baseline’ denotes the results reported by [28], and ‘Ours’ denotes the performance of our fine-tuned model.

Behaviour	Baseline (%)	Ours (%)	Improvement (%)
Standing	98.61	99.85	+1.24
Lying down	96.71	99.73	+3.02
Foraging	96.95	98.7	+1.75
Drinking water	34.82	57.18	+22.36
Rumination	41	35.5	−5.5

Table 3. Comparison of mAP and corresponding behavioural AP of the model obtained by ablation experiment.

	Ours	No FL-SAM	No KMFEM	No MF	No SRM	No TRM	No CL-SAM	No LSTM
Drinking (%)	97.62	96.53	97.5	95.87	94.99	96.43	97.53	97.6
Grazing (%)	99.4	99.36	99.37	99.33	98.37	98.59	99.31	99.35
Grooming (%)	67.35	66.3	67.31	66.32	59.41	65.3	67.33	67.29
Lying (%)	99.38	99.27	99.25	99.4	98.25	99.1	99.19	99.36
Walking (%)	96.3	96	93.48	96.25	93.99	94.35	95.73	95.5
Stand (%)	83.8	83.5	83.75	82.84	83.56	85.11	83.65	83.77
Running (%)	66.5	66.17	60	66.43	61.5	60.31	66.45	65.9
mAP (%)	87.19	86.73	85.81	86.63	84.3	85.6	87.03	86.97

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qi, F.; Hou, Z.; Lin, E.; Li, X.; Liang, J.; Zhang, W. Recognizing Cattle Behaviours by Spatio-Temporal Reasoning Between Key Body Parts and Environmental Context. Computers 2025, 14, 496. https://doi.org/10.3390/computers14110496

AMA Style

Qi F, Hou Z, Lin E, Li X, Liang J, Zhang W. Recognizing Cattle Behaviours by Spatio-Temporal Reasoning Between Key Body Parts and Environmental Context. Computers. 2025; 14(11):496. https://doi.org/10.3390/computers14110496

Chicago/Turabian Style

Qi, Fangzheng, Zhenjie Hou, En Lin, Xing Li, Jiuzhen Liang, and Wenguang Zhang. 2025. "Recognizing Cattle Behaviours by Spatio-Temporal Reasoning Between Key Body Parts and Environmental Context" Computers 14, no. 11: 496. https://doi.org/10.3390/computers14110496

APA Style

Qi, F., Hou, Z., Lin, E., Li, X., Liang, J., & Zhang, W. (2025). Recognizing Cattle Behaviours by Spatio-Temporal Reasoning Between Key Body Parts and Environmental Context. Computers, 14(11), 496. https://doi.org/10.3390/computers14110496

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Recognizing Cattle Behaviours by Spatio-Temporal Reasoning Between Key Body Parts and Environmental Context

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Introduction

2.2. Research Method

2.2.1. Spatial-Temporal Perception Network

2.2.2. Spatial-Temporal Relationship Integration Network

2.2.3. Spatial-Temporal Enhancement Network

2.2.4. The Loss Function of Entire Model

2.2.5. Evaluation Indicators

2.2.6. Experimental Setup

3. Results

3.1. Comparison Experiment

3.2. Ablation Experiment

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI