MACA-Net: Mamba-Driven Adaptive Cross-Layer Attention Network for Multi-Behavior Recognition in Group-Housed Pigs

Zeng, Zhixiong; Wu, Zaoming; Xie, Runtao; Lin, Kai; Tan, Shenwen; He, Xinyuan; Luo, Yizhi

doi:10.3390/agriculture15090968

Open AccessArticle

MACA-Net: Mamba-Driven Adaptive Cross-Layer Attention Network for Multi-Behavior Recognition in Group-Housed Pigs

by

Zhixiong Zeng

^1,†,

Zaoming Wu

^2,†,

Runtao Xie

³,

Kai Lin

³

,

Shenwen Tan

¹

,

Xinyuan He

¹ and

Yizhi Luo

^1,4,5,*

¹

Key Laboratory of Agricultural Equipment Technology, College of Engineering, South China Agricultural University, Guangzhou 510642, China

²

School of Electronic Engineering, South China Agricultural University, Guangzhou 510642, China

³

College of Mathematics and Informatics, South China Agricultural University, Guangzhou 510642, China

⁴

Institute of Facility Agriculture, Guangdong Academy of Agricultural Sciences, Guangzhou 510640, China

⁵

State Key Laboratory of Swine and Poultry Breding Industry, Guangzhou 510640, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work and should be considered co-first authors.

Agriculture 2025, 15(9), 968; https://doi.org/10.3390/agriculture15090968

Submission received: 6 March 2025 / Revised: 9 April 2025 / Accepted: 24 April 2025 / Published: 29 April 2025

(This article belongs to the Special Issue Modeling of Livestock Breeding Environment and Animal Behavior)

Download

Browse Figures

Versions Notes

Abstract

The accurate recognition of pig behaviors in intensive farming is crucial for health monitoring and growth assessment. To address multi-scale recognition challenges caused by perspective distortion (non-frontal camera angles), this study proposes MACA-Net, a YOLOv8n-based model capable of detecting four key behaviors: eating, lying on the belly, lying on the side, and standing. The model incorporates a Mamba Global–Local Extractor (MGLE) Module, which leverages Mamba to capture global dependencies while preserving local details through convolutional operations and channel shuffle, overcoming Mamba’s limitation in retaining fine-grained visual information. Additionally, an Adaptive Multi-Path Attention (AMPA) mechanism integrates spatial-channel attention to enhance feature focus, ensuring robust performance in complex environments and low-light conditions. To further improve detection, a Cross-Layer Feature Pyramid Transformer (CFPT) neck employs non-upsampled feature fusion, mitigating semantic gap issues where small target features are overshadowed by large target features during feature transmission. Experimental results demonstrate that MACA-Net achieves a precision of 83.1% and mAP of 85.1%, surpassing YOLOv8n by 8.9% and 4.4%, respectively. Furthermore, MACA-Net significantly reduces parameters by 48.4% and FLOPs by 39.5%. When evaluated in comparison to leading detectors such as RT-DETR, Faster R-CNN, and YOLOv11n, MACA-Net demonstrates a consistent level of both computational efficiency and accuracy. These findings provide a robust validation of the efficacy of MACA-Net for intelligent livestock management and welfare-driven breeding, offering a practical and efficient solution for modern pig farming.

Keywords:

pig behaviors; Mamba; deep learning; object detection

1. Introduction

The rising global demand for food has led to growing public concern about animal welfare. Studies have shown significant correlations between common behavioral modalities in group-housed pig and their health status [1]. Among these behaviors, eating, lying on the belly, lying on the side, and standing are key indicators of pigs’ physical health. Specifically, eating behavior directly reflects nutritional intake levels, and abnormal feeding frequency can effectively signal potential illness [2]. Postural behaviors like lying on the belly, lying on the side, and standing help assess the animals’ comfort levels and can also serve as indicative signs of external injuries [3]. Furthermore, the timing and frequency of posture changes are particularly important indicators during the gestation period in sows, providing valuable insights into their health status [4]. Continuous monitoring of these behaviors facilitates the early detection of health problems and enables timely adjustments to feeding and management strategies, thereby enhancing animal welfare while ensuring economic efficiency in pig production.

As research into human–animal–robot [5,6] interactions advances, the use of robotic agents in standardized production has become a key strategy for the sustainable development of animal husbandry. In recent years, functional robots have been increasingly deployed in group-housed animal-farming systems to perform specific tasks, such as poultry house inspection [7] and automated cattle feeding [8]. However, conventional robotic perception systems, which rely on sensors with limited information bandwidths (e.g., infrared or radar), have limited ability to detect health-related behavioral patterns in livestock. Although wearable or implantable sensors can be used to monitor pig behavior, such methods may adversely affect their health and welfare. Against this backdrop, accurate behavior-detection technologies have emerged as a breakthrough tool for enhancing the efficiency and sustainability of farm management [9].

Automated detection technology based on machine vision enables effective assessment of physiological status and growth performance [10,11,12]. These technologies have achieved field-tested success in several agricultural areas, including posture monitoring in dairy cattle [13] and behavior tracking in the field of poultry [14]. In recent years, research interest in pig health status detection and posture recognition has grown significantly. Several studies have adopted multidimensional machine learning approaches to identify disease manifestations and behavioral patterns in pigs simultaneously [15]. For instance, Lee [16] employed depth sensors to capture the movement characteristics of pigs and applied support vector machines (SVM) to predict aggressive behavior based on posture analysis in farm environments. Other researchers have developed methods using motion history images to extract kinematic features [17]. Additionally, prior research has demonstrated that utilizing acceleration features and hierarchical clustering [18], in combination with the kinetic energy differences between adjacent video frames [19], can significantly enhance the accuracy of aggressive behavior recognition in group-housed pigs. These studies offer viable solutions for early warning and continuous monitoring of pigs’ health status. However, these methods exhibit compromised robustness when dealing with real-world disturbances, including illumination variance, partial occlusion, and heterogeneous farm configurations, resulting in performance degradation.

In the field of deep learning, this approach demonstrates superior performance in addressing the aforementioned challenges. This technical paradigm has also achieved serial breakthroughs in pig behavior detection: Wei et al. [20] proposed an EMA-YOLOv8 enhanced behavior-detection framework integrated with ByteTrack. Through quantitative analysis of multi-scale feature coupling between individual pig motion trajectory parameters (detection accuracy 96.4%) and aggressive behaviors (head-neck biting/body collisions). Yang et al. [11] conducted a comprehensive review of methodological advances in pig body segmentation, individual detection, and behavior recognition using computer vision technologies for intelligent livestock monitoring. Nasirahmadi et al. [21] proposed a deep learning framework integrating R-FCN and ResNet101 for automated detection of standing and lying postures in pigs through machine vision analysis. Ji et al. [22] introduced an enhanced YOLOX architecture for multi-pose detection in group-housed pigs, enabling accurate detection of standing, lying, and sitting postures. An enhanced Faster R-CNN framework was proposed by Riekert et al. [23] for real-time localization of swine spatial coordinates and behavioral posture discrimination between recumbent and ambulatory states. Kai et al. [24] proposed an automatic temporal detection method that utilizes RGB data and optical flow to detect aggressive behaviors among group-housed pigs. In addition, Chen et al. [25] systematically constructed a multimodal algorithm-evaluation framework for pig posture detection, establishing cross-scenario deployment standards through quantitative analysis of synergistic gain boundaries between multi-source data (RGB/depth camera fusion) and network architectures (two-stage/single-stage). Zhong et al. [26] proposed the YOLO-DLHS-P model by introducing a Dilated Reparameterization Block (C2f-DRB) into the YOLOv8n baseline network, achieving a 52.49% parameter reduction while improving localization accuracy by 1.16%. Moreover, deep learning-based behavior-detection methods have achieved outstanding performance in group-housed animals such as sheep [27] and cattle [28].

Notably, recent studies have demonstrated that various advanced methods not only achieve high-accuracy pig posture recognition but also exhibit strong adaptability to environmental disturbances commonly encountered in real-world farming conditions. However, in complex herd environments, these environmental factors remain key challenges that significantly constrain detection accuracy. Kim et al. [29] developed angular refinement modules integrated with YOLOv3/v4 architectures to enable robust identification of feeding behavior in group-housed weaned pigs under intensive farming conditions. Zhang et al. [30] developed a collaborative tracking framework combining CNN detectors with correlation filter trackers, achieving 87.6% tracking accuracy under lighting variations and partial occlusion scenarios. Mao et al. [31] proposed the DM-GD-YOLO model, which integrates deformable convolution and attention mechanisms, and dynamically models non-rigid deformation features (e.g., body collisions/posture variations) through C2f-DM modules. Combined with cross-scale feature aggregation-dispersion mechanisms, this 6 MB parameter model achieves synchronous detection of seven behaviors (including three anomalies) in high-density group housing (30 pigs/pen) with 95.3% mAP, providing a lightweight real-time monitoring solution for intensive farming. However, fundamental challenges remain in complex herd environments. Camera positions are often insufficient to capture fine details of distant or small targets. Additionally, during multi-scale processing, the upsampling and downsampling of feature maps can introduce semantic gaps, which adversely affect detection accuracy. In group pig farming, particularly under non-orthogonal viewing angles, the head and tail of an individual pig may appear spatially separated. Traditional convolutional layers struggle to model such long-range spatial dependencies, making it difficult to associate relevant features and often leading to segmentation errors.

The challenge of modeling long-range dependencies has shown significant progress in subsequent research efforts. Recent advances in state space models (SSMs) provide new insights. Mamba-based SSM architectures demonstrate linear-time complexity for long-range dependency modeling [32,33], significantly improving computational efficiency. Fazzari et al. [34] proposed the Mamba-MSQNet model, achieving 74.6 mAP in animal behavior detection with 2.3M parameters (90% computational reduction) through Transformer-to-Mamba block replacement for selective state space modeling. Inspired by this, Mamba was integrated into the target-detection framework, enabling efficient feature modeling and resource optimization in complex breeding environments.

To address the aforementioned challenges in intensive animal farming and enhance the detection accuracy of four key behavioral states—eating, lying on the belly, lying on the side, and standing—this study proposes a novel Mamba-YOLO hybrid architecture.

The main contributions of this paper include:

An efficient multi-behavior-recognition algorithm for pigs, named Mamba-driven Adaptive Cross-layer Attention Network (MACA-Net), capable of recognizing various pig behaviors in group-housing environments.
An Adaptive Multi-Path Attention mechanism (AMPA), which effectively constructs attention maps across spatial and channel dimensions, significantly enhancing the model’s focus. This mechanism can serve as a valuable reference for other detection tasks.
The Mamba Global–Local Extractor (MGLE) module is proposed to enhance the model’s global modeling capabilities while mitigating Mamba’s tendency to overlook fine-grained details in visual tasks. This module may also serve as a reference for extending Mamba to other vision-related applications.
The Cross-layer Feature Pyramid Transformer (CFPT) is integrated into the proposed framework, effectively mitigating the semantic gap in multi-scale feature transmission. This enhancement significantly improves the aggregation capability of multi-scale features for pig behavior recognition.

2. Materials and Methods

2.1. Animals, Housing and Management

All dataset collection for this study was conducted between May and June 2022 at a large-scale commercial pig farm in Wenshan City, Yunnan Province, which is a real-world production facility engaging in standard commercial pig-farming practices. Each pen in the facility measured 3.5 m in length and 2.2 m in width, equipped with plastic slatted floors. The standard pen configuration included one feeding trough and two drinking nozzles, with each unit housing 20 pigs of the hybrid variety (Large White × Landrace × Duroc). The quantity of feed was adjusted according to the age specifications for pigs, with twice-daily feedings administered at 8:00 a.m. and 3:00 p.m. All observed pigs were managed under unaltered commercial husbandry conditions, ensuring that the behavioral data reflect authentic production environments.

The video data was captured using RGB cameras (Zhongwo CWT003, Shenzhen, China), mounted at the lower-left corner of each pen with a 50-degree downward angle. This setup follows the original installation position and angle of surveillance equipment in commercial pig farms. Compared to conventional orthogonal monitoring angles, this configuration enables full spatial coverage without relying on the installation height of the equipment, making it more consistent with the objectives of this study. To enhance data diversity and model robustness, we performed recordings under both high-intensity and low-intensity illumination conditions. The video streams captured during this process were then transmitted simultaneously via a wireless network to cloud servers for subsequent storage. The footage was stored in the Transport Stream format on hard drives, resulting in approximately 162 h of footage at a resolution of 2304 × 1296.

2.2. Image Acquisition and Data Preprocessing

The construction of the dataset followed a six-stage process initiated through the observation of group-housed pig behaviors via video pre-screening. To capture behavioral diversity, videos with a relatively balanced range of pig activities were selectively chosen, resulting in a total of 84 video samples. Using FFmpeg, frames were extracted at one-second intervals, with the original resolution preserved as JPG files. Excessively distorted data were manually filtered through quality screening, resulting in a curated dataset of 5159 images with well-defined foreground-background separation. This final collection comprised 1250 low-light condition samples and 3909 standard-illumination specimens.

In the annotation phase, Labelme implemented strict inclusion criteria: excluded individuals with >60% occlusion or non-target pen origins, while classifying behaviors into four discrete categories (lying on belly, lying on side, feeding, standing). Identification criteria with corresponding examples are formalized in Table 1. JSON-formatted labels were converted to YOLO format prior to dataset partitioning.

In order to mitigate the occurrence of data leakage from behavioral consistency in consecutive frames, source-based splitting was implemented at the video clip level. Based on empirical observations, the final dataset was partitioned into training, validation, and test sets using a stratified 8:1:1 ratio. Since each image contains multiple instances, the instance count for each behavioral category is summarized in Table 2. Figure 1 illustrates the spatial distribution of bounding boxes and the variation in target scales, both of which are critical factors for effective model training.

To enhance the robustness of the model, this study employs Mosaic Data Augmentation (MDA) [35] for online data augmentation. Specifically, it incorporates Mosaic augmentation, Mixup augmentation, random perspective transformation, and HSV augmentation, effectively increasing data diversity.

2.3. Mamba-Driven Adaptive Cross-Layer Attention Network

In group-housed pig-farming environments, significant positional variations are exhibited by the pigs, and camera orientations often deviate from perfect perpendicular alignment with the ground plane (top-down view), leading to significant scale variations in the visual representation of pigs across image sequences. This inherent scale variability poses substantial challenges for conventional models in effectively extracting and fusing multiscale spatial information. While the YOLOv8 model has been shown to be proficient in behavior classification and mitigates certain information loss through its architecture, it exhibits limitations in three critical aspects: (1) modeling long-range dependencies in spatial-temporal contexts while maintaining computational economy, (2) adaptive fusion of multiscale features under perspective distortion, and (3) efficient detection of small distant targets. To address these challenges, we propose MACA-Net (Mamba-Driven Adaptive Cross-Layer Attention Network), a efficient model that integrates state space models with cross-layer attention mechanisms. The overall architecture of the proposed MACA-Net in this paper is depicted in Figure 2.

The MACA-Net follows the architectural composition of the YOLO series, comprising three integral components: the Backbone, Neck, and Head. The Backbone inherits the primary framework of YOLOv8, but it replaces the C2f module with the Mamba Global–Local Extractor (MGLE) module to capture gradient-rich information flows. The MGLE-Module employs a three-branch architecture that enhances long-range dependency modeling while preserving local details. This module is integrated with an Adaptive Multi-Path Attention (AMPA) mechanism to suppress redundant features. The Backbone processes input images to generate three multi-scale feature maps

[P_{9}, P_{6}, P_{4}]

, which encapsulate hierarchical features from fine to coarse resolutions. These features are then directed into the Neck.

For the Neck, we abandon the original bidirectional flow structure of PAN [36] and instead adopt a Transformer-based CFPT [37] module. This design facilitates cross-layer feature interaction across spatial and channel dimensions, thereby strengthening information aggregation capabilities. Finally, the model utilizes general YOLO detection heads for classification and localization tasks. The proposed architecture maintains efficient learning and inference capabilities while becoming more lightweight through structural optimization.

2.4. State Space Models

Mamba [32] is predicated on state space models (SSMs), which were developed from the Kalman filter [38]. Compared to Transformer-based architectures, Mamba exhibits comparable proficiency in long-sequence modeling while sustaining a linear time complexity, a substantial advantage in terms of data handling efficiency. SSMs map an input sequence

x (t) \in R

to an output sequence

y (t) \in R

through a hidden state

h (t) \in R^{N}

, showcasing significant potential in complex sequence modeling tasks. The core mathematical formulation of SSMs is defined as follows:

h^{'} (t) = A h (t) + B x (t)

(1)

y (t) = C h (t)

(2)

where

A \in R^{N \times N}

represents the state transition matrix,

B \in R^{N \times 1}

denotes the input projection matrix, and

C \in R^{1 \times N}

is the output projection matrix. In order to adapt SSMs for deep learning applications and enable efficient computation on modern hardware, a discretization step with input resolution

Δ

(step size) is introduced. This converts the continuous-time system into a discrete-time system via the zero-order hold (ZOH) method, parameterized as

(Δ, A, B, C) \mapsto (\bar{A}, \bar{B}, C)

. The discretized matrices

\bar{A}

and

\bar{B}

are defined as:

\bar{A} = exp (Δ A)

(3)

\bar{B} = {(Δ A)}^{- 1} (exp (Δ A) - I) \cdot Δ B

(4)

where

exp (\cdot)

denotes the matrix exponential and

I

is the identity matrix. After discretization, the model leverages two distinct computation modes to accommodate training and inference requirements:

Recurrent Mode: Enables efficient autoregressive inference through the recurrence:

h_{t} = \bar{A} h_{t - 1} + \bar{B} x_{t}

(5)

y_{t} = C h_{t}

(6)

Convolutional Mode: Facilitates efficient parallel training by unrolling the recurrence into a global convolution kernel $\bar{K} \in R^{L}$ :

\bar{K} = (C \bar{B}, C \bar{A} \bar{B}, \dots, C {\bar{A}}^{L - 1} \bar{B}, \dots)

(7)

y = x * \bar{K}

(8)

where ∗ denotes the convolution operation and L is the sequence length. This kernel formulation enables hardware-optimized parallel computation during training.

2.5. Mamba Global–Local Extractor Module

In large-scale pig farm monitoring, the significant variation in pixel occupancy of individual pigs, attributable to perspective distortions and spatial distribution, in conjunction with the presence of semantically analogous features dispersed across distant regions in intricate herd environments, significantly complicates image modeling. This predicament underscores the need for augmented global receptive field expansion and enhanced long-range dependency modeling as pivotal research directions.

To address the limitations of global representation in vision tasks, VMamba [39] introduces the 2D-Selective-Scan (SS2D) module (Figure 3), which adapts state space models to visual processing. SS2D employs a novel spatial traversal strategy—Cross-Scan—that unfolds input patches into 1D subsequences along four distinct scanning directions. This multi-directional scanning reorganizes image regions to construct enriched feature representations. Each sequence is independently processed by dedicated S6 blocks [32], and their outputs are aggregated into 2D feature maps. This design enables each pixel to integrate contextual information from all directions through parameterized pathways, effectively establishing a global receptive field while preserving long-range dependencies—addressing the core limitations of conventional convolutional operations.

The Mamba framework has been demonstrated to exhibit remarkable efficiency in the domain of sequence modeling. However, it is confronted with significant challenges in the extraction of discriminative local features when processing visual tasks characterized by complex scale variations. In order to address this critical limitation, we propose the Mamba Global–Local Extractor Module (MGLE-Module), as illustrated in Figure 3. Specifically, given an input tensor

X \in R^{C \times H \times W}

, the module first splits the feature maps along the channel dimension into two equal partitions:

F_{1} \in R^{\frac{C}{2} \times H \times W}

and

F_{2} \in R^{\frac{C}{2} \times H \times W}

. These partitions are processed through two specialized branches: the Global Contextualization Branch (GCB) extracts rich multi-scale contextual information, while the Local Refinement Branch (LRB) preserves spatial details. The processed features are concatenated and enhanced with a residual connection to facilitate gradient flow, yielding the final output

Y \in R^{C \times H \times W}

. This operation is formally expressed as:

Y = Concat (GCB (F_{1}), LRB (F_{2})) + X

(9)

where

GCB (\cdot)

and

LRB (\cdot)

denote the global and local processing branches, respectively.

2.5.1. Global Contextualization Branch

The Global Contextualization Branch employs a dual-path architecture based on SSMs to enhance feature representation. The input features first undergo layer normalization before being split into two parallel sub-paths. The first sub-path sequentially applies Depthwise Separable Convolution (DW-Conv) [40] and activation functions to learn deeper feature representations, followed by the SS2D module for global context modeling, and concludes with a linear projection layer. The second sub-path integrates an Adaptive Multi-Path Attention (AMPA) mechanism (detailed in Section 2.6) with activation functions to refine feature dependencies.

For feature fusion, while gated mechanisms such as Gated MLP [41] demonstrated efficacy in natural language processing (NLP). However, our observations indicate that the application of gating mechanisms to selectively filter output features based on shallow-level representations could compromise the long-range dependencies systematically constructed by SS2D’s directional scanning paradigm. To preserve these dependencies while incorporating attention-guided refinements, we adopt a simple yet effective additive fusion strategy.

2.5.2. Local Refinement Branch

The LRB architecture is inspired by the Transformer’s abstract design, which has shown remarkable potential in visual tasks, as evidenced by the empirical success of MetaFormer [42]. For computational efficiency, we employ DGST [43] as the core feature extraction component. DGST innovatively combines the channel shuffle technique from ShuffleNetV2 [44] with group convolution principles. The module first expands channel dimensions via 1 × 1 convolutions before splitting features into primary and auxiliary branches. The primary branch utilizes GConv [45] to explicitly model spatial locality within individual channels, capturing fine-grained patterns without cross-channel interference. The channel shuffle operation dynamically recombines grouped features through tensor reshaping and permutation, enabling cross-region interaction while preserving spatial coherence. Simultaneously, the auxiliary branch integrates the remaining three-quarters of channels into the primary branch to preserve global contextual information. DGST achieves computationally efficient localized feature extraction while balancing parameter reduction.

It has been demonstrated that the Multi-Layer Perceptron (MLP) architecture in Long Short-Term Memory (LSTM) networks, which also follows the Transformer design, lacks selectivity in processing input information. In contrast, the Gated Linear Unit (GLU) [41,46], has been shown to outperform traditional MLP across various natural language processing tasks. Related work [47] has shown that integrating gating mechanisms into MLP structures and embedding them within RG Blocks following the Transformer’s abstract architecture leads to significant improvements in visual tasks, suggesting a promising potential in combining the Transformer’s abstract architecture with gating mechanisms. Inspired by these findings, we propose the replacement of the conventional ConvFFN [48] with ConvGLU [49], a gating-based architecture enhanced with channel attention. When configured with an expansion ratio R and a convolution kernel size

k \times k

, the computational complexities of ConvGLU and ConvFFN are, respectively:

Ω (ConvGLU) = 2 R H W C^{2} + \frac{2}{3} R H W C k^{2}

(10)

Ω (ConvFFN) = 2 R H W C^{2} + R H W C k^{2}

(11)

Notably, ConvGLU exhibits lower computational complexity than ConvFFN, while employing a gating mechanism to generate input-specific control signals based on nearby fine-grained features. This design allows LRB to effectively preserve local detail features from the output of DGST, while simultaneously maintaining channel attention capabilities. As a result, it achieves a favorable balance between computational efficiency and feature representation quality.

2.6. Adaptive Multi-Path Attention

The proposed Adaptive Multi-Path Attention (AMPA) is illustrated in Figure 4c. Given an input feature map

X \in R^{C \times H \times W}

, inspired by the Squeeze-and-Excitation (SE) [50] block (Figure 4a), which pioneered channel attention mechanisms in visual tasks, we employ global average pooling to improve sensitivity to informative channels. The squeeze operation for the c-th channel is formulated as:

z_{c} = \frac{1}{H W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} x_{c} (i, j)

(12)

where

z_{c}

denotes the squeezed output for channel c. Subsequently, a 1×1 convolution generates the channel attention map

{\hat{X}}^{c} \in R^{C \times 1 \times 1}

:

{\hat{X}}^{c} = F_{1} (z_{c})

(13)

To address the lack of spatial attention modeling, we incorporate coordinate information embedding from the Coordinate Attention (Figure 4b) [51] mechanism. Specifically, we use two spatial pooling kernels

(H, 1)

and

(1, W)

to aggregate features along vertical and horizontal directions, respectively, to capture long-range spatial interactions. The outputs for the c-th channel at height h and width w are computed as:

z_{c}^{h} (h) = \frac{1}{W} \sum_{i = 0}^{W - 1} x_{c} (h, i)

(14)

z_{c}^{w} (w) = \frac{1}{H} \sum_{j = 0}^{H - 1} x_{c} (j, w)

(15)

Relevant research [52] indicates that the constrained receptive field of 1 × 1 convolutional kernels in CA adversely hinders both the modeling of local cross-channel interactions and the effective utilization of contextual information. To mitigate this, we replace the 1 × 1 convolutions with 3 × 1 shared convolutional layers when processing the concatenated directional features

[z^{h}, z^{w}]

:

{\hat{X}}^{s} = F_{2} ([z^{h}, z^{w}])

(16)

where

{\hat{X}}^{s} \in R^{C \times 1 \times (H + W)}

. This spatial feature map is subsequently split into

{\hat{X}}^{h} \in R^{C \times H \times 1}

and

{\hat{X}}^{w} \in R^{C \times 1 \times W}

.

To enhance adaptability to varying inputs, we employ a simple gating mechanism with sigmoid activation. Specifically, a

1 \times 1

convolutional layer is applied to process and activate the intermediate feature map

\hat{X^{s}}

, generating spatial weights

W^{s} \in R^{C \times 1 \times (H + W)}

:

W^{s} = δ (F_{3} (\hat{X^{s}}))

(17)

where

δ

denotes the sigmoid function. The spatial weights

W^{s}

are divided into

W^{h} \in R^{C \times H \times 1}

and

W^{w} \in R^{C \times 1 \times W}

, serving as gating signals for

{\hat{X}}^{h} \in R^{C \times H \times 1}

and

{\hat{X}}^{w} \in R^{C \times 1 \times W}

, respectively. Considering that using global pooling to squeeze spatial information into channels makes its channel attention lack flexibility and become too coarse-grained, we aggregate the deeply stacked spatial weight information

W^{s}

into the channel. Specifically, we compute channel weights

W^{c} \in R^{C \times 1 \times 1}

through spatial squeezing. The outputs for the c-th channel are computed as:

w_{c}^{c} = \frac{1}{H + W} \sum_{i = 0}^{H + W - 1} w_{c}^{s} (0, i)

(18)

The final output is obtained through adaptive fusion of three attention components:

Y = X \times δ ({\hat{X}}^{h} \times W^{h}) \times δ ({\hat{X}}^{w} \times W^{w}) \times δ ({\hat{X}}^{c} \times W^{c})

(19)

This multi-path architecture enables simultaneous improvement of channel-wise and position-wise feature responses while maintaining computational efficiency.

2.7. CFPT

To effectively represent and process multi-scale features, feature pyramid networks have been widely integrated into various detection frameworks, and have demonstrated superior performance in handling objects of different scales. The conventional FPN [53] employs top-down unidirectional information flow to propagate semantic features into shallow layers, which inherently suffers from inevitable information loss due to its constrained single-directional pathways that prevent comprehensive multi-scale feature integration. To mitigate this limitation, PAN [36] introduces a complementary bottom-up path augmentation.

Building upon this structure, YOLOv8’s neck employs Conv, C2f, and Upsample modules within a PAN architecture. Despite its multi-scale fusion capability, two critical limitations persist. First, the prevalent use of nearest-neighbor interpolation in upsample layers leads to blurred feature maps and progressive semantic misalignment during multi-layer propagation. Second, small object features may be lost or suppressed by larger ones during convolution. Together, these limitations hinder detection performance in behavior-recognition tasks that require accurate multi-scale analysis.

In this work, we utilize a Cross-layer Feature Pyramid Transformer (CFPT) [37] to address multi-scale feature-fusion challenges. The CFPT framework eliminates upsampling operations, thereby avoiding computational overhead and blurred feature artifacts. It leverages two transformer blocks with linear computational complexity: (1) Cross-layer Channel-wise Attention (CCA), enabling inter-scale channel dependency modeling, and (2) Cross-layer Spatial-wise Attention (CSA), capturing spatial correlations across pyramid levels. By adaptively aggregating features through learnable weights, CFPT mitigates semantic misalignment caused by uniform supervision across scales while maximizing cross-layer feature integration.

As shown in Figure 2, our framework processes multi-scale inputs

[P_{9}, P_{6}, P_{4}]

through linear projection via 1 × 1 convolutions. This design achieves computational efficiency while establishing channel dimension unification across feature hierarchies.Assume that the feature map of each scale after the linear projection can be represented as

X = {X_{i} \in R^{H_{i} \times W_{i} \times C}}_{i = 0}^{L - 1}

, where

L = 3

denotes the number of input layers,

H_{i} \times W_{i}

represent the spatial dimensions at the i-th scale, and C indicates the unified channel depth across all scales. The CCA and CSA modules establish global feature dependencies across channel and spatial dimensions through patch group interactions, while shortcut branches enhance gradient flow throughout the pyramid hierarchy. The above process can be described as

Y = CSA (CCA (X)) + X

(20)

where the output

Y = {Y_{i} \in R^{H_{i} \times W_{i} \times C}}_{i = 0}^{L - 1}

maintains the identical shape as the corresponding input feature maps.

2.7.1. CCA

The CCA employs spatial unshuffling to redistribute spatial variations into channel dimensions, reformulating multi-scale features into alignment as

\hat{X} = {{\hat{X}}_{i} \in R^{H_{0} \times W_{0} \times \frac{H_{i} W_{i}}{H_{0} W_{0}} \times C}}_{i = 0}^{L - 1}

with unified resolution. Next, the Overlapped Channel-wise Patch Partition (OCP) organizes channel-aligned features into hierarchical groups where adjacent feature maps exhibit channel scaling factors of 4. To enhance cross-group interactions, we introduce bias parameters

α_{i}

that construct overlapped adjacent groups, resulting in transformed features

\tilde{X} = {\{{\tilde{X}}_{i} \in R^{C \times (4^{i} + α_{i}) \times H_{0} W_{0}}\}}_{i = 0}^{L - 1}

. The above process can be described as

\tilde{X_{i}} = OCP (Unshuffle (X_{i}), α_{i})

(21)

where overlapping parameters

α_{i}

are empirically configured as

[0, 1, 1]

for

i = 0, 1, 2

, respectively.

The CCA captures global dependencies through multi-head attention across patch groups: Hierarchical outputs

Y = {Y_{i} \in R^{H_{i} \times W_{i} \times C}}_{i = 0}^{L - 1}

are generated by first computing

(Q_{i}, K_{i}, V_{i}) = (W_{i}^{q} {\tilde{X}}_{i}, W_{i}^{k} {\tilde{X}}_{i}, W_{i}^{v} {\tilde{X}}_{i})

(22)

where

W_{i}^{q}, W_{i}^{k}, W_{i}^{v}

are linear projection matrices. Cross-level interactions are modeled via

{\tilde{Y}}_{i} = MultiHeadAtt (Q_{i}, Concat ({K_{j}}_{j = 0}^{L - 1}), Concat ({V_{j}}_{j = 0}^{L - 1}), P_{i}; h = 4),

(23)

where

P_{i}

denotes the cross-level contextual positional encoding for the i-th layer, and h represents the number of attention heads. Finally, feature reconstruction is performed using Reverse OCP (ROCP) and spatial shuffling to restore channel-spatial configurations:

Y = Shuffle (ROCP ({{\tilde{Y}}_{i}}_{i = 0}^{L - 1}, {α_{i}}_{i = 0}^{L - 1})) .

(24)

Here,

ROCP (\cdot)

merges multi-scale features with coefficients

α_{i}

, while

Shuffle (\cdot)

ensures spatial consistency.

2.7.2. CSA

Similarly, the CSA module processes the input features

X = {{X_{i} \in R^{H_{i} \times W_{i} \times C}}}_{i = 0}^{L - 1}

through three steps: First, Overlapped Spatial-wise Patch Partition (OSP) with channel-aligned features generates spatial patch groups via

\hat{X_{i}} = OSP (X_{i}, β_{i})

(25)

where

β_{i} = [(0, 0), (1, 1), (1, 1)]

for

i = 0, 1, 2

controls sliding-window overlaps. Cross-layer interactions are then modeled through

(Q_{i}, K_{i}, V_{i}) = (W_{i}^{q} {\hat{X}}_{i}, W_{i}^{k} {\hat{X}}_{i}, W_{i}^{v} {\hat{X}}_{i})

(26)

{\hat{Y}}_{i} = MultiHeadAtt (Q_{i}, Concat ({K_{j}}_{j = 0}^{L - 1}), Concat ({V_{j}}_{j = 0}^{L - 1}), P_{i}; h = 4)

(27)

where

W_{i}^{q}, W_{i}^{k}, W_{i}^{v}, P_{i}

and h share similar meanings as defined in Section 2.7.1. The final output

Y = {Y_{i} \in R^{H_{i} \times W_{i} \times C}}_{i = 0}^{L - 1}

is obtained by applying Reverse OSP (ROSP) to restore the original hierarchical configurations:

Y = ROSP ({{\tilde{Y}}_{i}}_{i = 0}^{L - 1}, {β_{i}}_{i = 0}^{L - 1})

(28)

3. Results

3.1. Evaluation Metrics

In this study, the evaluation metrics include Precision (P), Recall (R), Average Precision (AP), Mean Average Precision (mAP), number of parameters (Params), and floating point operations (FLOPs), which are computed as shown in Equations (29)–(33):

P = \frac{T P}{T P + F P}

(29)

R = \frac{T P}{T P + F N}

(30)

A P = \int_{0}^{1} P (R) d R

(31)

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(32)

P a r a m e t e r s = C_{in} \times C_{out} \times K \times K

(33)

where true positives (TP) denotes the count of correctly identified positive samples, false positives (FP) represents negative samples erroneously classified as positive, and false negatives (FN) indicates positive samples incorrectly predicted as negative.

A P_{i}

refers to the

A P

score of the i-th class, and N denotes the total number of classes in the dataset. Here, K denotes the kernel size, while

C_{i n}

and

C_{o u t}

represent the number of input and output feature channels, respectively.

3.2. Experimental Environment and Parameter Setting

To ensure a fair evaluation of algorithm performance, this study employs identical experimental platforms and hyperparameter settings. The experiments were run on an Ubuntu 22.04 system equipped with an Intel® Xeon® Platinum 8362 CPU (2.80 GHz), two NVIDIA GeForce RTX 3090 GPUs (24 GB each), Python 3.10.14, PyTorch 2.2.2, and CUDA 12.1, with a maximum RAM allocation of 90 GB. The hyperparameters used in the experiments are detailed in Table 3.

3.3. Comparative Experiments of Different Models

A total of eight state-of-the-art models were selected to evaluate the performance of the models on the same dataset, with the same experimental setup (see Section 3.2) and hyperparameter settings (see Table 3), including Faster R-CNN [54], FCOS [55], RT-DETR [56], YOLOv5n [57], YOLOv8n [58], YOLOv10n [59], YOLOv11n [60], and MACA-Net. Table 4 presents the performance of these models, with evaluation metrics encompassing precision, recall, mean average precision (mAP50 and mAP50-95), number of parameters, and floating point operations (FLOPs).

The comparative results in Table 4 show that Faster R-CNN, FCOS, and RT-DETR have significantly higher FLOPs and number of parameters compared to other models. Specifically, the FLOPs of Faster R-CNN, FCOS, and RT-DETR are 33.9G, 20.47G, and 103.4G, respectively, while their parameter counts are 25.5 times, 19.8 times, and 19.7 times that of our proposed MACA-Net. However, considering the practical conditions of pig farms, large-scale facilities often require the monitoring of multiple pens, where the deployment of relatively large models can significantly increase server costs. In addition, their mAP50 performance lags behind that of MACA-Net, especially for Faster R-CNN and FCOS. These characteristics make it difficult for these three models to operate efficiently in resource-constrained environments.

The proposed MACA-Net, demonstrates superior performance compared to the YOLO series models (YOLOv5n, YOLOv8n, YOLOv10n, and YOLOv11n) across multiple metrics. Specifically, MACA-Net achieves 1.62M parameters, 5.3G FLOPs, 83.1% precision, 78.8% recall, 85.1% mAP50, and 55.4% mAP50-95, exhibiting outstanding detection performance with high parameter efficiency and compact model size. A comparison with YOLOv5n and YOLOv8n reveals that MACA-Net is more lightweight yet still delivers superior detection performance. MACA-Net also outperforms YOLOv10n, which has the fewest parameters in the contrast models, with 28.6% fewer parameters, 18.4% fewer FLOPs, and a 7.4% improvement in recall. In addition, MACA-Net has a significant computational cost advantage over the recently released YOLOv11n. Despite a slight decrease in recall of 0.2%, MACA-Net exhibits superior performance in other metrics, in particular achieving a 6.6% improvement in precision.

To visually compare the performance of different models, a random selection of images captured during the day and at night was made for the purpose of testing, including prediction models including RT-DETR, YOLOv8n, and YOLOv11n. Figure 5 shows the prediction results of various models, demonstrating that the improved model achieves more reliable performance in pig behavior recognition.

In daytime scenarios with severe pig occlusion and stacking, YOLOv8n and YOLOv11n exhibited instances of erroneous behavior detection (highlighted by bright yellow boxes in the figure) and duplicate detection of the same pig (marked with red boxes in the figure). Additionally, RT-DETR also showed cases of incorrect behavior identification. Under low-light nighttime conditions, YOLOv8n and YOLOv11n generated false detections in regions with dispersed pig distributions. In contrast, through network optimization, the improved model effectively preserves local detail features and aggregates contextual information. This enhancement enables more efficient and accurate detection, demonstrating greater robustness compared to existing models.

3.4. Comprehensive Comparison with Baseline YOLOv8n

In this section, we provide a detailed explanation of the improvements of MACA-Net compared to YOLOv8n. In Figure 6, we compare the performance of the two models on four behaviors (“eating,” “lying_on_belly,” “lying_on_side,” “standing”) across four performance metrics (precision, recall, mAP50, mAP50-95), as well as the average performance of the four behaviors on these metrics. Experimental results demonstrate that MACA-Net consistently outperforms YOLOv8n in all behavioral categories. Detailed quantitative comparisons in Table 5 reveal that the proposed framework achieves average improvements of 8.9% in precision (74.2% to 83.1%), 4.4% in mAP50 (80.7% to 85.1%), and 3.7% in the more stringent mAP50-95 metric (51.7% to 55.4%) across all behaviors. Notably, the most significant improvement occurs in the “standing” category, where precision increases by 16%, mAP50 by 11.9%, and mAP50-95 by 8.5%. Furthermore, the proposed framework demonstrates superior detection capabilities compared to other behavioral categories, with the “eating” class attaining 90.3% mAP50 and “lying_on_belly” achieving 60.3% mAP50-95.

Constructing subsets under different conditions within the validation set allows for targeted evaluation of the model’s performance in specific scenarios. We created three subsets within the validation dataset: the Standard Condition (SC) subset (365 images), the Low-Light (LL) subset (122 images), and the High Occlusion (HO) subset (198 images). The enhanced model was rigorously evaluated across four key performance metrics (Precision, Recall, mAP50, mAP50-95) on three validation subsets. As demonstrated in Table 6, MACA-Net consistently outperforms the baseline model across all metrics in every subset. Notably, mAP50 improvements of 9.0% (SC subset), 9.7% (LL subset), and 6.6% (HO subset) were achieved, demonstrating the robustness of the proposed model.

In practical intelligent farming scenarios, centralized management and modular specialization are commonly adopted—where edge devices are primarily responsible for data acquisition, while data processing is typically performed on centralized cloud servers. Consequently, deploying models on cloud platforms has become the predominant approach in such systems. To validate the practical value of the proposed model, we conducted efficiency comparison experiments before and after optimization on a cloud server. The server configuration is detailed in Section 3.2. The evaluated metrics include the number of parameters, FLOPs, model size, and the per-image time required for preprocessing (Pre-t), inference (Inf-t), postprocessing (Post-t), and total processing (Tot-t). The experimental results are summarized in Table 7.

Specifically, the improved model achieves a 48.4% reduction in parameters, 39.5% reduction in FLOPs, and 41.9% reduction in model size compared to the baseline, significantly reducing computational demands for server-side deployment. Although the SSM-based MGLE adopts an autoregressive inference strategy via recurrence—leading to a predictable increase in inference time per image—it generates higher-quality outputs, thereby alleviating the burden on postprocessing. As a result, the overall processing time per image remains approximately the same as that of the baseline.

To visualize the improvement of the model, the Gradient-weighted Class Activation Mapping (Grad-CAM) method [61] was applied to the output layers of YOLOv8n and MACA-Net. Grad-CAM determines the importance weights of feature maps by analyzing gradients of target class scores relative to the final convolutional layer’s feature maps. These weights are then combined with the feature maps through weighted summation to generate heatmaps that highlight regions critical for decision-making. Figure 7 displays the class activation maps of YOLOv8n and the improved model, where deeper red hues indicate higher values corresponding to stronger model responses and greater contributions to final predictions.

In the first comparison group (first row of Figure 7), while YOLOv8n successfully locates pig regions, its attention remains constrained by the inherent limitations of CNN architectures, capturing only limited local information. In contrast, guided by the Mamba mechanism, MACA-Net effectively establishes global dependencies and captures more detailed regions. The second comparison group (second row) further validates these observations. In particular, MACA-Net shows superior focus on heavily occluded areas with dense pig clusters, while YOLOv8n exhibits deficiencies in handling such challenging scenarios. Under low-light nighttime conditions (third row), MACA-Net shows enhanced attention to critical behavioral features in pigs’ limbs and head, while the baseline model predominantly focuses on torso regions. These visual comparisons comprehensively reveal MACA-Net’s advantages in global perception, fine-grained feature extraction, and robustness to complex scenarios.

3.5. Ablation Experiment

3.5.1. Overall Ablation Experiment of MACA-Net

To validate the effectiveness of individual improvements, we conduct ablation experiments using the controlled variable method, with results detailed in Table 8.

When the C2f module in the backbone was replaced with MGLE, a slight reduction in recall was observed compared to the baseline YOLOv8n. However, precision and mAP50 improved by 2.5 % and 0.6 %, respectively, and FLOPs decreased by 16%. Subsequently, the abstract architecture of CFPT was introduced in the neck network, though only the CSA module was implemented. This modification enhanced the model’s spatial aggregation capability, further improving detection precision while achieving global optimization in both parameter count and FLOPs. Finally, the CCA module was incorporated to strengthen multi-scale feature interaction across channels. Notably, MACA-Net achieves globally optimal performance across all experimental configurations in precision (83.1%), recall (78.8%), mAP50 (85.1%), and mAP50-95 (55.4%), establishing comprehensive metric superiority. Compared to MGLE+CSA with the minimal parameter quantity, this performance leadership is maintained while requiring only 0.071M additional parameters.

The experimental observations revealed that the combination of MGLE and CCA demonstrated particularly outstanding performance. Specifically, the MGLE+CCA and MGLE+CCA+CSA (MACA-Net) configurations achieved mAP50 scores of 84.4% and 85.1%, respectively (the top two performers in the experiment), while maintaining a low number of parameters and computational complexity. This indicates that the MGLE module, integrated with spatial Selective Scan mechanisms, effectively enhances spatial modeling capabilities. Furthermore, the CCA module further facilitates multi-path cross-layer interactions across channels. Such a unified framework, combining spatial feature extraction and cross-channel mixing mechanisms, demonstrates clear advantages and significant potential for future development.

Critically, our analysis shows that under the more stringent mAP50-95 performance metric, the MGLE+CCA configuration performed relatively modestly (51.4%), showing a 4% deficit compared to MGLE+CCA+CSA. This suggests that feature pyramid structures relying solely on channel-wise fusion face inherent limitations in handling demanding detection tasks. A comparative evaluation of recall and precision metrics further indicates that MGLE+CCA exhibits degraded performance across both measures compared to MGLE+CCA+CSA, particularly with a notable 4.5% precision reduction. These observations lead us to hypothesize that the absence of multi-scale spatial fusion mechanisms may increase false positive samples, consequently significantly degrading the performance of mAP50-95.

3.5.2. Ablation Experiment of the MGLE

To systematically evaluate the individual contributions of MGLE’s components, we conducted ablation experiment whose results are summarized in Table 9. The key distinctions among variants lie in four design dimensions:

GCB: presence/absence of the Global Contextualization Branch;
LRB: presence/absence of the Local Refinement Branch;
FA: fusion approach between main and auxiliary branches in GCB;
AMPA: whether the Adaptive Multi-Path Attention is applied to GCB’s auxiliary branch.

Table 9 demonstrates that the M9 model, which integrates GCB and LRB with additive integration of AMPA-based attention guidance in GCB, achieves the globally highest mAP50 score while maintaining relatively low computational overhead. This highlights the ensemble’s robust performance and computational efficiency in the swine-recognition dataset. The high recall and precision, respectively, underscore MGLE’s superior capability in distinguishing the majority of positive cases and reducing false positives, reflecting its balanced effectiveness in comprehensive detection tasks.

When employing a gating-based fusion strategy in GCB, the model variants M1-M4 (with progressively integrated components) achieved mAP50 scores of 81.6%, 82.6%, 84.0%, and 81.7%, respectively. Notably, despite the introduction of an attention mechanism in M4 compared to M3, M4 exhibited a 2.3% decline in mAP50. This motivated replacing gating with additive fusion in M9, which achieved the highest mAP50.

To rigorously evaluate fusion strategies in MGLE, we conducted ablation studies comparing gated fusion (original) and additive fusion variants in four model pairs: (M1, M5), (M2, M6), (M4, M8), and (M3, M7). The additive approach improved mAP50 by 2%, 1.2%, and 4.4% in the first three pairs but caused a minor degradation in M7 compared to M3. Further analysis revealed that models combining GCB and LRB (M3/M7) underperformed relative to LRB-only counterparts (M8), suggesting that gains from local feature embedding may mask the inherent flaws in the fusion strategy. We speculate that gating mechanisms relying on shallow features could disrupt the global dependencies established by the Mamba block integrated with a selective scan mechanisms.

Furthermore, compared to M7 without AMPA, M9 with the attention mechanism achieved improvements of 1.1% in precision, 4.2% in recall, and 2.0% in mAP50, while maintaining the model’s lightweight design. This demonstrates that AMPA effectively suppresses redundant information and improves model focus in the complex environment of pig farms.

3.6. Targeted Experiments for AMPA

Inspired by SE [50] and CA [51], we propose AMPA to enhance feature representation in pig behavior recognition. To validate the effectiveness of AMPA, experiments are conducted by replacing AMPA with mainstream attention modules at the same position in MACA-Net. As shown in Table 10, AMPA demonstrates superior performance compared to existing attention mechanisms, achieving the highest recall, mAP50, and mAP50-95 metrics. Specifically, AMPA outperforms CBAM [62], CA, ELA [63], EMA [52], and SE by improvements of 2.1%, 1.7%, 3.2%, 1.9%, and 1.5% in mAP50, respectively. For mAP50-95, the corresponding improvements reach 1.6%, 2.2%, 2.4%, 2.5%, and 1.5%. Furthermore, the six attention modules exhibit negligible differences in computational cost, with the maximum variation in FLOPs being only 0.1G. These results indicate that the proposed multi-path attention mechanism can effectively integrate both spatial and channel information without introducing significant computational overhead. Overall, the experimental findings demonstrate that our design achieves a favorable balance between accuracy and efficiency, offering clear advantages in object-detection tasks related to animal behavior recognition.

To further assess the robustness of the AMPA mechanism, comparative experiments were conducted across three challenge-specific validation subsets (SC, LL, HO) as detailed in Section 3.4. The quantitative results summarized in Table 11 demonstrate that Model 2 (with AMPA) consistently outperformed Model 1 (without AMPA) across all evaluation metrics. Notably, on the LL subset, which represents low-light conditions, Model 2 achieved improvements of 6.0% in accuracy, 4.5% in recall, 4.8% in mAP50, and 3.6% in mAP50–90. These results demonstrate that AMPA effectively enhances the model’s performance under challenging environmental conditions.

4. Discussion

To effectively detect pig behaviors while maintaining cost efficiency, vision-based approaches for 24-h monitoring of pig activities have become mainstream, given the critical significance of behavioral data for disease prevention and growth management. In traditional image analysis, previous research [64] recognized drinking behavior by measuring the distance between pigs’ snouts and drinkers. Another study [65] proposed a feeding behavior-recognition method for sows based on feeder positions and body orientations. Within deep learning technologies, one study [66] achieved feeding behavior detection through grayscale video frames and modified GoogLeNet architecture, while another [67] utilized a convolutional neural network (CNN) combined with long short-term memory (LSTM) to recognize feeding patterns in nursery pig, demonstrating improved robustness over predecessors. With the maturation of deep learning technologies, deep learning-based methods have emerged as more effective solutions for pig behavior recognition.

Distinct from existing works, this study is dedicated to overcoming detection challenges caused by perspective distortions resulting from non-orthogonal camera installations relative to pig pens. These perspective distortions introduce significant pixel-scale variations of pigs across images, demanding enhanced multi-scale feature extraction and fusion capabilities.

To address this challenge, we propose MACA-Net, which achieves state-of-the-art performance in pig behavior-recognition benchmarks, demonstrating superior capability in handling scale variations and illumination challenges compared to existing methods.

While our developed pig behavior detector, MACA-Net, has achieved success, we acknowledge the limitations of this work. Figure 8 illustrates representative failure cases. Specifically, in Figure 8a, a standing pig is simultaneously misclassified as both “lying_on_belly” and “standing” due to severe occlusion of its limbs by surrounding pigs, and the inherent constraints of single-frame behavior recognition. These factors create ambiguous visual cues, leading to conflicting predictions. To address this challenge, future research will focus on developing temporal sequence-sensitive detection models that extract and fuse behavioral features across multiple time points, enabling dynamic behavior tracking to improve classification robustness.

In Figure 8b, a piglet lying on its belly is erroneously detected as standing. This error results from the limited viewing angle, where the hind legs are completely occluded and only partial forelimbs are visible, resulting in nearly identical visual patterns for “lying_on_belly” and “standing” under this perspective. To overcome such viewpoint-related ambiguities, our subsequent work will implement multi-view detecting systems to capture complementary spatial and mitigate single-view limitations. These enhancements aim to improve feature discriminability for posture recognition under complex scenarios.

5. Conclusions

This study proposes a more efficient detector named MACA-Net to identify pig behaviors in group-housed environments, addressing limitations of traditional models including high computational costs, insufficient attention to key porcine features, and challenges in multi-scale behavior recognition under perspective distortions. Specifically, we introduce a triple-branch feature extraction module (MGLE) that effectively models global features while preserving local details. The Global Contextualization Branch (GCB) based on Mamba architecture alleviates the restricted receptive field issue inherent in CNN models. To compensate for Mamba’s local information deficiency in perspective tasks, we design a dedicated Local Refinement Branch (LRB) for local feature aggregation. By replacing conventional C2f modules in YOLOv8n with MGLE, we enhance model performance while reducing parameter count. Furthermore, we develop an Adaptive Multi-Path Attention (AMPA) mechanism integrating spatial and channel attention weights, which effectively suppresses redundant feature representations to meet the demands of complex farm environments. AMPA demonstrates performance improvements of 1.6%, 2.2%, and 1.5% in mAP50-95 compared to CBAM, CA, and SE modules, respectively. Finally, we utilize CFPT without upsampling for feature fusion, which mitigates contextual semantic loss and enhances multi-scale feature processing capability while further reducing model parameters.

Through these improvements, the proposed model reduces parameters by 48.4% and FLOPs by 39.5% compared to the baseline YOLOv8n, while achieving a 4.4% improvement in mAP50 to 85.1%. In comprehensive comparisons with state-of-the-art models including Faster R-CNN, RT-DETR, and YOLOv11n, MACA-Net maintains leading performance in both computational efficiency and detection accuracy. The experimental results demonstrate that MACA-Net enables more efficient swine behavior recognition, providing reliable detection data for farm operators to support health assessment and growth monitoring. Furthermore, this work validates the application potential of Mamba architectures in object-detection tasks, offering valuable references for extending its use to other vision downstream applications.

Author Contributions

Author Contributions: Conceptualization: Y.L. and Z.W.; Data curation: Z.Z., S.T., K.L. and R.X.; Formal analysis: Z.W., S.T. and R.X.; Funding acquisition: Z.Z. and Y.L.; Methodology: Z.W., K.L. and Y.L.; Project administration: Z.Z. and Y.L.; Supervision: Z.Z. and Y.L.; Validation: S.T. and X.H.; Investigation: X.H., Z.Z. and R.X.; Visualization: Z.W., X.H., K.L. and R.X.; Writing—original draft: Z.Z. and Z.W.; Writing—review and editing: Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

The authors are grateful for the Rural Revitalization Strategy of Guangdong (2025TS-3), Open Fund of the National Key Laboratory of Agricultural Equipment Technology (South China Agricultural University) (SKLAET-202407), Fundamental Research Funds for State Key Laboratory of Swine and Poultry Breeding Industry(GDNKY-ZQQZ-K25, 2023QZ-NK16), the Scientific and Technological Innovation Strategic Program of Guangdong Academy of Agricultural Sciences (ZX202402), Guangdong Provincial Agricultural Science and Technology Social Service Achievement Integration Demonstration Project (20230201); Guangzhou Key R&D Plan Project (2023B03J1363); Shanwei Science and Technology Plan Project (2023A009); Guangzhou Basic and Applied Basic Research Project (2023A04J0752); Project of Innovation Team Construction of Guangdong Agriculture Research System (No. 2024CXTD02); Maoming Laboratory Independent Research Project (2021ZZ003, 2021TDQD002).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data will be made available on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Guan, R.; Zheng, Z.; Yu, H.; Wu, L.; Huang, H.; Jiang, P.; Li, X. Identification of factors affecting fattening efficiency of commercial pig herds and analysis of their impact at different performance levels. Sci. Rep. 2024, 14, 20105. [Google Scholar] [CrossRef] [PubMed]
Ahmed, S.; Mun, H.S.; Yoe, H.; Yang, C.J. Monitoring of behavior using a video-recording system for recognition of Salmonella infection in experimentally infected growing pigs. Animal 2015, 9, 115–121. [Google Scholar] [CrossRef] [PubMed]
Larsen, T.; Kaiser, M.; Herskin, M.S. Does the presence of shoulder ulcers affect the behaviour of sows? Res. Vet. Sci. 2015, 98, 19–24. [Google Scholar] [CrossRef]
Mumm, J.M. Dynamic Space Utilization of Lame and Non-Lame Sows as Determined by Their Lying-Standing Sequence Profile. Master’s Thesis, Iowa State University, Ames, IA, USA, 2019. [Google Scholar]
Romano, D.; Stefanini, C. Investigating Social Immunity in Swarming Locusts via a Triple Animal—Robot—Pathogen Hybrid Interaction. Adv. Intell. Syst. 2025, 2400763. [Google Scholar] [CrossRef]
Eckstein, M.; Mamaev, I.; Ditzen, B.; Sailer, U. Calming Effects of Touch in Human, Animal, and Robotic Interaction—Scientific State-of-the-Art and Technical Advances. Front. Psychiatry 2020, 11, 555058. [Google Scholar] [CrossRef]
Quan, Q.; Palaoag, T.D.; Sun, H. Research and Design of Intelligent Inspection Robot for Large-Scale Chicken Farms. In Proceedings of the 2024 5th International Conference on Machine Learning and Human-Computer Interaction (MLHMI), Kawasaki, Japan, 14–16 March 2024; pp. 1–6. [Google Scholar] [CrossRef]
Karn, P.; Sitikhu, P.; Somai, N. Automatic Cattle Feeding System. In Proceedings of the 2nd International Conference on Engineering And Technology (KEC Conference), Lalitpur, Nepal, 26 September 2019. [Google Scholar]
Mellor, D.J. Updating Animal Welfare Thinking: Moving beyond the ‘Five Freedoms’ towards ‘A Life Worth Living’. Animals 2016, 6, 21. [Google Scholar] [CrossRef] [PubMed]
Jia, M.; Zhang, H.; Xu, J.; Su, Y.; Zhu, W. Feeding frequency affects the growth performance, nutrient digestion and absorption of growing pigs with the same daily feed intake. Livest. Sci. 2021, 250, 104558. [Google Scholar] [CrossRef]
Yang, Q.; Xiao, D. A review of video-based pig behavior recognition. Appl. Anim. Behav. Sci. 2020, 233, 105146. [Google Scholar] [CrossRef]
Fernandes, J.N.; Hemsworth, P.H.; Coleman, G.J.; Tilbrook, A.J. Costs and benefits of improving farm animal welfare. Agriculture 2021, 11, 104. [Google Scholar] [CrossRef]
Li, G.; Sun, J.; Guan, M.; Sun, S.; Shi, G.; Zhu, C. A New Method for Non-Destructive Identification and Tracking of Multi-Object Behaviors in Beef Cattle Based on Deep Learning. Animals 2024, 14, 2464. [Google Scholar] [CrossRef]
Fang, C.; Zhuang, X.; Zheng, H.; Yang, J.; Zhang, T. The Posture Detection Method of Caged Chickens Based on Computer Vision. Animals 2024, 14, 3059. [Google Scholar] [CrossRef] [PubMed]
Sanchez-Vazquez, M.J.; Nielen, M.; Edwards, S.A.; Gunn, G.J.; Lewis, F.I. Identifying associations between pig pathologies using a multi-dimensional machine learning methodology. BMC Vet. Res. 2012, 8, 151. [Google Scholar] [CrossRef]
Lee JongUk, L.J.; Jin Long, J.L.; Park DaiHee, P.D.; Chung YongWha, C.Y. Automatic recognition of aggressive behavior in pigs using a Kinect depth sensor. Sensors 2016, 16, 631. [Google Scholar] [CrossRef] [PubMed]
Viazzi, S.; Ismayilova, G.; Oczak, M.; Sonoda, L.T.; Fels, M.; Guarino, M.; Vranken, E.; Hartung, J.; Bahr, C.; Berckmans, D. Image feature extraction for classification of aggressive interactions among pigs. Comput. Electron. Agric. 2014, 104, 57–62. [Google Scholar] [CrossRef]
Chen, C.; Zhu, W.; Ma, C.; Guo, Y.; Huang, W.; Ruan, C. Image motion feature extraction for recognition of aggressive behaviors among group-housed pigs. Comput. Electron. Agric. 2017, 142, 380–387. [Google Scholar] [CrossRef]
Chen, C.; Zhu, W.; Guo, Y.; Ma, C.; Huang, W.; Ruan, C. A kinetic energy model based on machine vision for recognition of aggressive behaviours among group-housed pigs. Livest. Sci. 2018, 218, 70–78. [Google Scholar] [CrossRef]
Wei, J.; Tang, X.; Liu, J.; Zhang, Z. Detection of pig movement and aggression using deep learning approaches. Animals 2023, 13, 3074. [Google Scholar] [CrossRef] [PubMed]
Nasirahmadi, A.; Sturm, B.; Edwards, S.; Jeppsson, K.H.; Olsson, A.C.; Müller, S.; Hensel, O. Deep Learning and Machine Vision Approaches for Posture Detection of Individual Pigs. Sensors 2019, 19, 3738. [Google Scholar] [CrossRef]
Ji, H.; Yu, J.; Lao, F.; Zhuang, Y.; Wen, Y.; Teng, G. Automatic Position Detection and Posture Recognition of Grouped Pigs Based on Deep Learning. Agriculture 2022, 12, 1314. [Google Scholar] [CrossRef]
Riekert, M.; Opderbeck, S.; Wild, A.; Gallmann, E. Model selection for 24/7 pig position and posture detection by 2D camera imaging and deep learning. Comput. Electron. Agric. 2021, 187, 106213. [Google Scholar] [CrossRef]
Yan, K.; Dai, B.; Liu, H.; Yin, Y.; Li, X.; Wu, R.; Shen, W. Deep neural network with adaptive dual-modality fusion for temporal aggressive behavior detection of group-housed pigs. Comput. Electron. Agric. 2024, 224, 109243. [Google Scholar] [CrossRef]
Chen, Z.; Lu, J.; Wang, H. A review of posture detection methods for pigs using deep learning. Appl. Sci. 2023, 13, 6997. [Google Scholar] [CrossRef]
Zhong, C.; Wu, H.; Jiang, J.; Zheng, C.; Song, H. YOLO-DLHS-P: A Lightweight Behavior Recognition Algorithm for Captive Pigs. IEEE Access 2024, 12, 104445–104462. [Google Scholar] [CrossRef]
Xu, Y.; Nie, J.; Cen, H.; Wen, B.; Liu, S.; Li, J.; Ge, J.; Yu, L.; Lv, L. An Image Detection Model for Aggressive Behavior of Group Sheep. Animals 2023, 13, 3688. [Google Scholar] [CrossRef] [PubMed]
Tong, L.; Fang, J.; Wang, X.; Zhao, Y. Research on Cattle Behavior Recognition and Multi-Object Tracking Algorithm Based on YOLO-BoT. Animals 2024, 14, 2993. [Google Scholar] [CrossRef]
Kim, M.; Choi, Y.; Lee, J.; Sa, S.; Cho, H. A deep learning-based approach for feeding behavior recognition of weanling pigs. J. Anim. Sci. Technol. 2021, 63, 1453–1463. [Google Scholar] [CrossRef]
Zhang, L.; Gray, H.; Ye, X.; Collins, L.; Allinson, N. Automatic Individual Pig Detection and Tracking in Pig Farms. Sensors 2019, 19, 1188. [Google Scholar] [CrossRef] [PubMed]
Mao, R.; Shen, D.; Wang, R.; Cui, Y.; Hu, Y.; Li, M.; Wang, M. An Integrated Gather-and-Distribute Mechanism and Attention-Enhanced Deformable Convolution Model for Pig Behavior Recognition. Animals 2024, 14, 1316. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
Wang, C.; Tsepa, O.; Ma, J.; Wang, B. Graph-mamba: Towards long-range graph sequence modeling with selective state spaces. arXiv 2024. [Google Scholar] [CrossRef]
Fazzari, E.; Romano, D.; Falchi, F.; Stefanini, C. Selective state models are what you need for animal action recognition. Ecol. Inform. 2025, 85, 102955. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar] [CrossRef]
Du, Z.; Hu, Z.; Zhao, G.; Jin, Y.; Ma, H. Cross-Layer Feature Pyramid Transformer for Small Object Detection in Aerial Images. arXiv 2024, arXiv:2407.19696. [Google Scholar] [CrossRef]
Basar, T. A New Approach to Linear Filtering and Prediction Problems. In Control Theory: Twenty-Five Seminal Papers (1932–1981); Wiley: Hoboken, NJ, USA, 2001; pp. 167–179. [Google Scholar] [CrossRef]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. VMamba: Visual State Space Model. arXiv 2024, arXiv:2401.10166. [Google Scholar]
Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2016; pp. 1800–1807. [Google Scholar] [CrossRef]
Dauphin, Y.N.; Fan, A.; Auli, M.; Grangier, D. Language modeling with gated convolutional networks. In Proceedings of the International Conference on Machine Learning. PMLR, Sydney, Australia, 6–11 August 2017; pp. 933–941. [Google Scholar] [CrossRef]
Yu, W.; Luo, M.; Zhou, P.; Si, C.; Zhou, Y.; Wang, X.; Feng, J.; Yan, S. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10819–10829. [Google Scholar] [CrossRef]
Gong, W. Lightweight Object Detection: A Study Based on YOLOv7 Integrated with ShuffleNetv2 and Vision Transformer. arXiv 2024, arXiv:2403.01736. [Google Scholar] [CrossRef]
Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar] [CrossRef]
Ioannou, Y.; Robertson, D.; Cipolla, R.; Criminisi, A. Deep Roots: Improving CNN Efficiency with Hierarchical Filter Groups. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5977–5986. [Google Scholar] [CrossRef]
Shazeer, N.M. GLU Variants Improve Transformer. arXiv 2020, arXiv:2002.05202. [Google Scholar] [CrossRef]
Wang, Z.; Li, C.; Xu, H.; Zhu, X. Mamba YOLO: SSMs-Based YOLO For Object Detection. arXiv 2024, arXiv:2406.05835. [Google Scholar] [CrossRef]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pvt v2: Improved baselines with pyramid vision transformer. Comput. Vis. Media 2022, 8, 415–424. [Google Scholar] [CrossRef]
Shi, D. TransNeXt: Robust Foveal Visual Perception for Vision Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 17773–17783. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar] [CrossRef]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient multi-scale attention module with cross-spatial learning. In Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.B.; He, K.; Hariharan, B.; Belongie, S.J. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 936–944. [Google Scholar] [CrossRef]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015. [Google Scholar] [CrossRef]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. arXiv 2019, arXiv:1904.01355. [Google Scholar] [CrossRef]
Lv, W.; Xu, S.; Zhao, Y.; Wang, G.; Wei, J.; Cui, C.; Du, Y.; Dang, Q.; Liu, Y. DETRs Beat YOLOs on Real-time Object Detection. arXiv 2023, arXiv:2304.08069. [Google Scholar] [CrossRef]
Jocher, G. Ultralytics YOLOv5; Zenodo: Brussel, Belgium, 2020. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8; Zenodo: Brussel, Belgium, 2023. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar] [CrossRef]
Jocher, G.; Qiu, J. Ultralytics YOLO11. 2024. Available online: https://github.com/ultralytics/ultralytics (accessed on 1 February 2025).
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. arXiv 2018, arXiv:1807.06521. [Google Scholar] [CrossRef]
Xu, W.; Wan, Y. ELA: Efficient Local Attention for Deep Convolutional Neural Networks. arXiv 2024, arXiv:2403.01123. [Google Scholar] [CrossRef]
Kashiha, M.; Bahr, C.; Haredasht, S.A.; Ott, S.; Moons, C.P.; Niewold, T.A.; Ödberg, F.O.; Berckmans, D. The automatic monitoring of pigs water use by cameras. Comput. Electron. Agric. 2013, 90, 164–169. [Google Scholar] [CrossRef]
Lao, F.; Brown-Brandl, T.; Stinn, J.; Liu, K.; Teng, G.; Xin, H. Automatic recognition of lactating sow behaviors through depth image processing. Comput. Electron. Agric. 2016, 125, 56–62. [Google Scholar] [CrossRef]
Alameer, A.; Kyriazakis, I.; Dalton, H.A.; Miller, A.L.; Bacardit, J. Automatic recognition of feeding and foraging behaviour in pigs using deep learning. Biosyst. Eng. 2020, 197, 91–104. [Google Scholar] [CrossRef]
Chen, C.; Zhu, W.; Steibel, J.; Siegford, J.; Wurtz, K.; Han, J.; Norton, T. Recognition of aggressive episodes of pigs based on convolutional neural network and long short-term memory. Comput. Electron. Agric. 2020, 169, 105166. [Google Scholar] [CrossRef]

Figure 1. Visualized distribution map of the dataset, showing the positions of bounding boxes and the sizes of the marked targets.

Figure 2. The overall architecture of the MACA-Net.

Figure 3. The overall architecture of the MGLE.

Figure 4. Schematic comparison of the proposed Adaptive Multi-Path Attention block (c) to the classic SE channel attention block [50] (a) and CA block [51] (b). Here, “GAP” refers to global average pooling. “X+Y Avg Pool” represents 1D global pooling along the (H+W) dimensions. “X Avg Pool” and “Y Avg Pool” refer to 1D horizontal global pooling and 1D vertical global pooling, respectively.

Figure 5. Detection results of pig behavior recognition using different object-detection networks (from top to bottom: RT-DETR, YOLOv8n, YOLOv11n, and MACA-Net). Yellow boxes indicate erroneous behavior detection, red boxes denote duplicate detection of the same pig, and light yellow boxes highlight representative instances of correct detections.

Figure 6. The comparison of model performance pre- and post-optimization is presented. Here, precision, recall, mAP50, and mAP50-95 are, respectively, represented in blue, green, red, and purple. In terms of architectural comparison, the performance trajectory of MACA-Net is depicted through square-markered dashed lines, while the baseline measurements of YOLOv8n are shown via circle-markered solid trajectories.

Figure 7. Comparison of heat maps before and after model optimization. Note: column (a) original image. (b) Feature map of the YOLOv8n model. (c) Feature map of the MACA-Net.

Figure 8. Examples of misrecognition: (a) redundant detection, (b) error of detecting “lying_on_side” as “standing”.

Table 1. Recognition rules for typical behaviors of group-housed pigs.

Behavior Class	Behavior Description	Visual Example
Eating	The group-housed pigs feed on the feeders, with their heads facing the feeders and some overlapping.
Lying on belly	The group-housed pigs lie on their abdomen/sternum with their front and hind legs folded under the body.
Lying on side	The group-housed pigs lie on either side, with the lateral part of their body in contact with the ground and the outer legs visible.
Standing	The group-housed pigs stretch their bodies and keep their legs straight, with only their hooves in contact with the floor.

The red-marked regions indicate typical examples of this category.

Table 2. Dataset statistics.

Behavior	Training Set	Validation Set	Test Set	Total
Eating	12,841	1583	804	15,228
Lying on belly	23,080	4023	3148	30,251
Lying on side	14,430	2135	3790	20,355
Standing	7187	655	558	8400

Table 3. Training parameter setting.

Parameter Name	Value
input	416 × 416
max_epochs	200
optimizer	SGD
lrf	0.01
momentum	0.937
weight_decay	0.0005
iou	0.7
batch	64

Table 4. Comparison of object-detection results from different algorithms.

Models	Parameters/M	FLOPs/G	P/%	R/%	mAP50/%	mAP50-95/%
Faster R-CNN	41.36	33.9	72.6	72.8	78.5	42.9
FCOS	32.12	20.5	69.5	59.4	69.1	34.8
RT-DETR	31.99	103.4	83.3	83.5	83.5	55.6
YOLOv5n	2.50	7.1	76.7	80.3	82.4	53.3
YOLOv8n	3.01	8.1	74.2	78.4	80.7	51.7
YOLOv10n	2.27	6.5	82.1	71.4	83.7	54.9
YOLOv11n	2.58	6.3	76.5	79.0	83.1	55.2
MACA-Net(ours)	1.62	5.3	83.1	78.8	85.1	55.4

mAP50 denotes the mAP calculated at an IoU threshold of 0.5, mAP50-95 represents the average mAP computed over IoU thresholds ranging from 0.5 to 0.95 with a step size of 0.05.

Table 5. Comparison of pig behavior-recognition metrics before and after model improvement.

	Models	P/%	R/%	mAP50/%	mAP50-95/%
all	YOLOv8n	74.2	78.4	80.7	51.7
all	ours	83.1	78.8	85.1	55.4
eatting	YOLOv8n	83.6	92.3	90.0	50.5
eatting	ours	89.4	86.2	90.3	52.1
lying_on_belly	YOLOv8n	82.5	78.8	87.5	58.2
lying_on_belly	ours	89.3	80.9	89.8	60.3
lying_on_side	YOLOv8n	74.6	67.4	77.9	51.5
lying_on_side	ours	81.3	70.3	81.1	54.0
standing	YOLOv8n	56.3	75.0	67.3	46.5
standing	ours	72.3	77.7	79.2	55.0

Table 6. Targeted comparison of pig behavior-recognition metrics before and after model improvement.

	Models	P/%	R/%	mAP50/%	mAP50-95/%
SC	YOLOv8n	58.5	64.9	61.5	37.7
SC	ours	65.3	67.1	70.5	43.9
LL	YOLOv8n	68.7	62.5	67.9	49.8
LL	ours	76.4	70.6	77.6	55.6
HO	YOLOv8n	73.7	74.1	79.0	50.0
HO	ours	79.6	79.8	85.6	54.6

Table 7. Comparison of computational efficiency before and after the improvement.

Models	Parameters/M	FLOPs/G	Model Size/MB	Pre-t/ms	Inf-t/ms	Post-t/ms	Tot-t/ms
YOLOv8n	3.01	8.1	6.2	0.3	2.2	1.2	3.7
ours	1.62	5.3	3.6	0.3	3.6	0.5	4.4

Pre-t, Inf-t, and Post-t represent the time (in milliseconds) required per image for preprocessing, inference, and postprocessing, respectively. Tot-t denotes the total processing time, calculated as the sum of Pre-t, Inf-t, and Post-t.

Table 8. Results of ablation experiments with different optimization modules.

MGLE	CCA	CSA	Parameters/M	FLOPs/G	P/%	R/%	mAP50/%	mAP50-95/%
			3.006	8.1	74.2	78.4	80.7	51.7
✓			2.733	6.8	76.7	77.5	81.3	51.6
	✓		1.827	6.2	79.3	74.0	82.0	52.1
		✓	1.823	6.2	81.4	76.3	84.0	54.4
✓	✓		1.554	4.9	78.6	76.8	84.4	51.4
✓		✓	1.550	4.9	78.8	75.1	82.9	53.9
	✓	✓	1.894	6.6	77.9	75.4	82.3	52.1
✓	✓	✓	1.621	5.3	83.1	78.8	85.1	55.4

Table 9. Ablation study on MGLE-Module.

Models	GCB	FA	AMPA	LRB	Parameters/M	FLOPs/G	P/%	R/%	mAP50/%
M1	✓	gate			2.01	5.6	78.3	77.1	81.6
M2	✓	gate	✓		2.24	5.6	82.1	76.1	82.6
M3	✓	gate		✓	1.49	5.2	82.1	77.4	84
M4	✓	gate	✓	✓	1.62	5.3	82.4	72.4	81.7
M5	✓	add			1.81	5.6	82.8	74.1	83.6
M6	✓	add	✓		2.24	5.6	79.4	78.4	83.8
M7	✓	add		✓	1.49	5.2	82.0	74.6	83.1
M8				✓	1.73	6.2	82.9	75.8	84.2
M9	✓	add	✓	✓	1.62	5.3	83.1	78.8	85.1

The ablation experiments were conducted on MACA-Net, where M1-M8 represent models integrated with variants of MGLE, and M9 denotes the final MACA-Net. “FA” and “AMPA” can be selected only when the “GCB” option is enabled.

Table 10. The comparative experiment among different attention modules.

Attention	P/%	R/%	mAP50/%	mAP50-95/%	FLOPs/G
CBAM	78.6	76.3	83	53.8	5.2
CA	81.6	77.9	83.4	53.2	5.2
ELA	80.6	72.2	81.9	53	5.3
EMA	82.4	74.8	83.2	52.9	5.3
SE	83.2	75.7	83.6	53.9	5.2
AMPA (ours)	83.1	78.8	85.1	55.4	5.3

Table 11. Targeted comparison of pig behavior-recognition metrics with and without AMPA.

	Models	P/%	R/%	mAP50/%	mAP50-95/%
SC	Model 1	61.0	63.8	66.9	41.5
SC	Model 2	65.3	67.1	70.5	43.9
LL	Model 1	70.4	66.1	72.8	52.0
LL	Model 2	76.4	70.6	77.6	55.6
HO	Model 1	78.1	74.0	82.1	51.5
HO	Model 2	79.6	79.8	85.6	54.6

The experiments were carried out using MACA-Net. Model 1 corresponds to the baseline configuration without AMPA, whereas Model 2 incorporates the AMPA module.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zeng, Z.; Wu, Z.; Xie, R.; Lin, K.; Tan, S.; He, X.; Luo, Y. MACA-Net: Mamba-Driven Adaptive Cross-Layer Attention Network for Multi-Behavior Recognition in Group-Housed Pigs. Agriculture 2025, 15, 968. https://doi.org/10.3390/agriculture15090968

AMA Style

Zeng Z, Wu Z, Xie R, Lin K, Tan S, He X, Luo Y. MACA-Net: Mamba-Driven Adaptive Cross-Layer Attention Network for Multi-Behavior Recognition in Group-Housed Pigs. Agriculture. 2025; 15(9):968. https://doi.org/10.3390/agriculture15090968

Chicago/Turabian Style

Zeng, Zhixiong, Zaoming Wu, Runtao Xie, Kai Lin, Shenwen Tan, Xinyuan He, and Yizhi Luo. 2025. "MACA-Net: Mamba-Driven Adaptive Cross-Layer Attention Network for Multi-Behavior Recognition in Group-Housed Pigs" Agriculture 15, no. 9: 968. https://doi.org/10.3390/agriculture15090968

APA Style

Zeng, Z., Wu, Z., Xie, R., Lin, K., Tan, S., He, X., & Luo, Y. (2025). MACA-Net: Mamba-Driven Adaptive Cross-Layer Attention Network for Multi-Behavior Recognition in Group-Housed Pigs. Agriculture, 15(9), 968. https://doi.org/10.3390/agriculture15090968

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MACA-Net: Mamba-Driven Adaptive Cross-Layer Attention Network for Multi-Behavior Recognition in Group-Housed Pigs

Abstract

1. Introduction

2. Materials and Methods

2.1. Animals, Housing and Management

2.2. Image Acquisition and Data Preprocessing

2.3. Mamba-Driven Adaptive Cross-Layer Attention Network

2.4. State Space Models

2.5. Mamba Global–Local Extractor Module

2.5.1. Global Contextualization Branch

2.5.2. Local Refinement Branch

2.6. Adaptive Multi-Path Attention

2.7. CFPT

2.7.1. CCA

2.7.2. CSA

3. Results

3.1. Evaluation Metrics

3.2. Experimental Environment and Parameter Setting

3.3. Comparative Experiments of Different Models

3.4. Comprehensive Comparison with Baseline YOLOv8n

3.5. Ablation Experiment

3.5.1. Overall Ablation Experiment of MACA-Net

3.5.2. Ablation Experiment of the MGLE

3.6. Targeted Experiments for AMPA

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI