2.1. Data Acquisition and Processing
The data for this study were sourced from a large-scale farm raising floor-raised yellow-feathered broilers in Jinniuhu Street, Liuhe District, Nanjing City, Jiangsu Province, China (118°52′38″ E, 32°26′54″ N). Each chicken coop on the farm measures 5.0 m in width and 12.5 m in length, with an internal area of 62.5 m2, housing approximately 1258 yellow-feathered broilers. In typical large-scale farms, farmers often exceed standard stocking densities to improve economic efficiency, resulting in complex coop environments—dim lighting, uneven floors, and obstacles such as feeders, waterers, sand, and feces.
The objective of on-farm data collection was to acquire images of floor-raised chickens exhibiting different behaviors for the subsequent training of a behavior recognition model. To ensure moderate lighting conditions, image collection was conducted between 10 a.m. and 3 p.m. from December 2022 to March 2024 using a Logitech C930c high-definition camera. During collection, the camera’s shooting angle was kept at approximately 90 degrees, with manual adjustments to handheld and shooting angles to avoid light interference and cluttered backgrounds that could affect image quality. The shooting height and distance were controlled within 90–130 cm and 1–3 m, respectively. A schematic diagram is shown in
Figure 1. Real-time adjustments to the shooting angle and distance were made based on the captured footage to minimize occlusion and excessive distance from the chickens, ensuring high-quality behavioral images and a reliable foundation for constructing the behavior recognition model.
Daily poultry behaviors, such as feeding and standing, are frequently observed and studied. By consulting literature on poultry behavioral characteristics, poultry diseases, and abnormal animal behaviors [
31,
32], and combining methods for identifying the health status and abnormal behaviors of yellow-feathered chickens, a standardized classification scheme for their behaviors was developed. As shown in
Figure 2, the scheme categorizes behaviors into five types: Pecking, Resting, Walking, Dead, and Inactive. Inactivity is associated with sub-health conditions, such as leg disorders and contact dermatitis on the feet and breasts, which are related to inactivity in chickens [
33].
The definitions of the daily behaviors of chickens are shown in
Table 1.
In this study, the annotation tool Labellmg [
34] was used to annotate these five behaviors in each image using bounding box annotation. This tool allows users to draw rectangular boxes on images and label categories via a graphical interface. These operations are converted into coordinate and category information, saved as annotation files in formats such as PASCAL VOC or YOLO, and provide structured data input for training deep learning models. The core process includes image loading, interactive annotation, and data storage, supporting keyboard shortcuts and batch processing. The underlying implementation is based on Python 3.10 and PyQt6 for cross-platform interaction. To ensure the accuracy of the above dataset annotations, agricultural experts supervise and adjust the annotations during the annotation process. After annotation, we organized three rounds of checks by agricultural experts and students to further ensure the accuracy of the original annotations. After the annotation process, the annotation and category information for each image were exported to YOLO format annotation files. As shown in the figure, the numbers 0 to 4 correspond to the behaviors of Pecking, Resting, Walking, Dead, and Inactive, respectively. The training process requires a large amount of high-quality datasets, so it is necessary to perform data augmentation on the dataset used in this experiment. This is performed by processing the experimental images through methods such as translation transformation and adding noise. After data augmentation processing, a total of 1565 images were obtained, of which 1230 were original images. The annotated dataset, named ChickenData, was divided into a training set, validation set, and test set, containing 1126, 126, and 313 images, respectively.
2.2. Dual-Backbone Heterogeneous YOLOv11
The target detection system in floor-raised chicken farming faces two key technical challenges:
- (1)
Complex background interference and coexisting multi-scale target individuals significantly impact detection performance. Targets exhibit pronounced multi-scale characteristics due to varying distances from the camera, creating spatial scale heterogeneity that makes traditional detection algorithms struggle to maintain detection consistency across spatial domains.
- (2)
Cluttered farming environment noise conflicts with real-time processing requirements at the device end, especially under limited edge computing resources—the detection accuracy of existing models is significantly degraded by environmental interference.
To address these challenges, this study innovatively proposes the DualHet-YOLO, which is a deep optimization based on the YOLOv11 framework. The architecture incorporates a heterogeneous feature fusion mechanism and multi-scale attention modules specifically tailored to the characteristics of floor-raised chicken farming scenarios.
As shown in
Figure 3, the DualHet-YOLO behavior detection model for floor-raised chickens employs a four-stage collaborative architecture. The Front-end Deep Feature Extraction Network generates multi-resolution feature maps through four-level residual modules: 80 × 80 × 256 (shallow details), 40 × 40 × 512 (mid-level semantics), and 20 × 20 × 1024 (high-level abstraction). It achieves resolution dimensionality reduction via strided convolutions and enhances feature representation capability through a channel expansion strategy, completing the progressive mapping of raw images to a high-dimensional semantic space. The Back-end Deep Feature Enhancement Network introduces a CBLinear channel reweighting mechanism, using the CBFuse module to perform cross-layer concatenation and upsampling operations on the multi-level features output by the front-end, constructing a composite feature pyramid that fuses spatial details and deep semantics while integrating visual features from multi-scale receptive fields through a parameterized fusion strategy.
The Eff-HetKConv Multi-scale Feature Fusion Neck adopts a bidirectional feature interaction mechanism with top-down semantic propagation and bottom-up detail supplementation strategies, enabling complementary enhancement of spatial-channel dual perception by integrating low-level edge/texture information with high-level target semantic features.
The TriAxis Unified Detection Head innovatively fuses three attention mechanisms—scale-aware, spatial-aware, and task-aware—adapting to target size variations through dynamic channel weighting, capturing pose diversity using deformable convolutions and achieving collaborative optimization of classification and localization through feature decoupling to realize cross-layer synchronous detection and precise localization of multi-scale, multi-morphology targets in free-range scenarios. This architecture innovatively integrates a trinity collaborative optimization framework of feedforward feature extraction, feedback feature enhancement, and multi-dimensional attention modeling. Under extreme scenario factors such as complex lighting changes, target occlusion, and dense group distribution, it achieves precise capture of chicken posture transformations, movement trajectories, and group interaction behaviors through cross-layer feature progressive optimization and dynamic weight adaptation mechanisms, ultimately leading to a significant breakthrough in detection accuracy. The structural details and innovations of each module are described in detail below.
2.3. Dual-Path Feature Map Extraction Architecture
In the YOLOv11 object detection model, we innovatively introduce the overview of dual-path feature map extraction architecture to enhance the model’s capability for feature information extraction and utilization. This architecture is inspired by an in-depth understanding of the information bottleneck problem in deep learning, as well as the ingenious application of invertible functions and auxiliary supervision mechanisms. Specifically, the dual-path feature map extraction architecture processes input data through two parallel backbone networks, each responsible for extracting feature information at different levels and semantics.
In practical design, the dual-path feature map extraction architecture is implemented through a series of carefully designed modules. The model begins with an initial convolutional layer to perform preliminary feature extraction and downsampling on the input image. Subsequently, the feature map is divided into two paths, entering two backbone networks, respectively. Each backbone network consists of multiple convolutional layers, ELAN-style modules (e.g., Eff-HetKConv), and feature fusion layers (CBFuse). These modules collaborate to ensure that each backbone network can deeply mine features from different perspectives. The following describes the hierarchical construction steps of the four stages of the dual-path feature map extraction architecture.
The first stage centers on the Front-end Deep Feature Extraction Network, constructing a basic feature extractor with four-level (P1–P5) downsampling using Eff-HetKConv modules to optimize computational efficiency via heterogeneous convolution. The main branch employs an improved Eff-HetKConv module, replacing standard convolutions with heterogeneous convolutions (HetConv). By mixing convolution kernels of different sizes (3 × 3 and 1 × 1), this approach enhances feature diversity while preserving more original information in cross-stage connections. The mathematical expression is shown as Equation (1), where
denotes the output features of the first stage and
represents the input features:
where the decomposition of the HetConv operation is shown in Equation (2), where
denotes the input feature map, and
and
denote the 3 × 3 and 1 × 1 convolution operations:
As shown in
Figure 4, the second stage deploys a cross-branch routing mechanism within the Back-end Deep Feature Extraction Network using the CBLinear module to extract multi-scale features from each level of the front-end network and establish information pathways. The CBLinear module constructs a feature routing mechanism for the auxiliary branch, extracting features from the P5 layer of the main branch and performing linear projection on 5D features. This multi-scale feature aggregation strategy significantly enhances the information capacity of the auxiliary branch, providing a rich gradient signal source for subsequent fusion. The mathematical formulation of the second stage is expressed by Equation (3), where
denotes the input features of the
i-th layer,
represents the parameters of the 1 × 1 convolution kernel, and N signifies the number of routing layers:
In the third stage, the CBFuse module is used to achieve spatially adaptive fusion of features from the two networks, and a channel attention weighting strategy is adopted to enhance the integrity of gradient information. The CBFuse module employs a channel attention mechanism to realize the adaptive fusion of features. The specific implementation process is as follows: First, the current feature
of the main branch and the features
of the auxiliary branch are concatenated along the channel dimension. Then, the channel statistics are generated through global average pooling. Two fully connected layers are used to generate the channel weight vectors
and
, and weighted fusion is performed:
. This process can be formalized as Equation (4):
where
is the spliced feature and ⊙ denotes the channel level multiplication.
In the fourth stage, heterogeneous convolution stacking and channel recalibration are performed on the basis of the fused features. Finally, a highly discriminative feature pyramid containing P3–P5 is output, completing the in-depth refinement and optimization of multi-scale object detection features. The network refines the fused features by stacking Eff-HetKConv modules. These modules utilize the gradient regularization signals provided by the auxiliary branch, effectively suppressing the feature degradation phenomenon. Their effect can be verified by the change in information entropy, which can be represented by Equation (5), where
H represents the information entropy and
represents the feature probability distribution of the channel.
To enable information interaction and fusion within the overview of dual-path feature map extraction architecture and facilitate the construction of the four-stage network, we introduce routing layers (CBLinear) and feature fusion layers (CBFuse) at specific network levels. Routing layers are responsible for linearly combining or routing feature maps from different levels, providing multi-scale feature inputs for subsequent feature fusion. Feature fusion layers, in turn, deeply integrate features from the two paths to ensure the main branch receives complete semantic information, avoiding information loss and unreliable gradient propagation. Based on this, this section presents the following two innovations.
2.3.1. Reversible Auxiliary Branching via Modular Realization of CBLinear
As shown in the code snippet of
Table 2, the cross-branch linear transformations (CBLinear) module constructs five levels of feature routing channels, and each CBLinear layer contains the following core operations.
Let the output feature of the kth layer of the main branch be
, and the projection process of CBLinear can be formalized as Equation (6):
where
is the 1 × 1 convolution kernel parameter and ∗ denotes the channel-by-channel convolution operation. In particular, [9, 1, CBLinear, [[64, 128, 256, 512, 1024]]] denotes the extraction of 5 groups of features from the 9th layer of the main branch (P5 output), each with 64, 128, 256, 512, and 1024 channels, respectively. This multi-scale compression strategy can be explained by Equation (7):
where ⨁ denotes channel dimensional splicing and Ti is a low-rank projection matrix whose parameters are learned through end-to-end training.
CBLinear assumes the role of a gradient distributor in backpropagation. Let the total loss function be L. The auxiliary branch gradient can be computed as Equation (8):
where L is the number of connected auxiliary branches. This multi-path gradient back propagation mechanism effectively mitigates the gradient vanishing problem.
2.3.2. Dynamic Weighting and Spatial Attention Mechanisms for CBFuse Modules
The dynamic weight generation process of the CBFuse module is shown in
Figure 5. Given the main branch feature
and N auxiliary features
, the fusion process is divided into three steps: feature splicing, channel statistics generation, and dynamic weight calculation:
Step 1: Feature splicing is described as Equation (9).
Step 2: Channel statistics generation is described as Equation (10).
Step 3: Dynamic weight calculation is described as Equation (11).
where
and
are fully connected layer parameters, δ is the ReLU activation, and σ is the Sigmoid function. The compression ratio r is set to 16 to balance the computational complexity. In the specific implementation, CBFuse uses grouped convolution to improve efficiency, as shown in the code snippet of
Table 3.
To further improve the performance of small target detection, we introduce a spatial attention mechanism in high-level CBFuse;
denotes the spatial attention map as Equation (12):
2.4. YOLOv11 Efficient Heterogeneous Kernel Convolution
In the field of target detection, real time and accuracy of the model are required for floor chicken behavior recognition, and the C3K2 structure of the YOLOv11 model has limitations of computational efficiency and feature expression ability. In this paper, we propose the Eff-HetKConv structure, which integrates the advantages of heterogeneous convolutional kernels, reduces the computational complexity, and improves the efficiency of feature extraction so as to provide a highly efficient solution for the recognition of floor-raised chicken behaviors.
2.4.1. Deficiencies in the C3K2 Structure
In the floor chicken behavior recognition scenario, the C3K2 structure in the YOLOv11 model has obvious deficiencies that restrict the effect of practical application. The C3K2 structure has a bottleneck in computational efficiency. Despite the optimization of feature extraction, it still consumes more computational resources and time when dealing with large-scale data and complex scenarios, making it difficult to meet real-time requirements. In terms of the flexibility and diversity of feature extraction, the C3K2 structure mainly extracts features through two convolutional layers, and the fixed structure is difficult to adapt to the diversified feature requirements in different scenarios. It cannot fully capture the complex features of the diverse behaviors of floor chickens, which affects the accuracy of behavior recognition. In addition, the C3K2 structure is not satisfactory enough in terms of parameter utilization efficiency, and there is a parameter redundancy problem, which increases the storage and computation overhead of the model, limiting its wide application on resource-limited devices.
2.4.2. Eff-HetKConv Structure Principle
To address the issues in the C3K2 structure, we propose an improved Eff-HetKConv structure. The core principle of Eff-HetKConv is to construct convolutional layers using heterogeneous convolution kernels. Specifically, in a single convolutional layer, convolution kernels of different sizes, such as 3 × 3 and 1 × 1 kernels, are used simultaneously. By reasonably allocating the usage ratios of these two types of convolution kernels across channels, we can retain the ability of 3 × 3 convolution kernels to capture local spatial features and utilize 1 × 1 convolution kernels to reduce the computational cost while performing feature fusion and dimensionality reduction between channels.
Compared with the C3K2 structure, Eff-HetKConv not only has an advantage in computational efficiency but is also more powerful in feature extraction. The 3 × 3 convolution kernels are responsible for capturing local spatial correlations, while the 1 × 1 convolution kernels can recombine and weight features in the channel dimension, enabling the model to learn more discriminative feature representations. This combination allows Eff-HetKConv to extract richer and more diverse features at different feature levels, which helps improve the model’s ability to recognize target behaviors in complex scenarios.
Figure 6 shows the difference between the standard filter and the Eff-HetKConv filter.
Eff-HetKConv achieves computational optimization by mixing convolutional kernels of different scales. For a convolutional layer with M number of input channels and N number of output channels, each output channel uses P-grouped heterogeneous kernels, 3 × 3 convolutional kernels for 1/P scaled channels, and 1 × 1 convolutional kernels for (1-1/P)-scaled channels. The FLOPs equation is shown as Equation (13):
where
is the output feature map size and 9 corresponds to the number of parameters in the 3 × 3 kernel. When compared to the computational complexity of the traditional C3K2 architecture, the scenario involving two 3 × 3 convolutional layers is described by Equation (14).
The speed improvement ratio is shown in Equation (15).
In the floor chicken behavior recognition task, the Eff-HetKConv structure exhibits multiple advantages that make it a more suitable convolutional structure for this task.
The Eff-HetKConv structure demonstrates significant advantages in the task of floor-raised chicken behavior recognition. By employing heterogeneous convolution kernels and reducing the usage ratio of large kernels, this structure substantially decreases the computational load and inference time, meeting the demand for real-time processing of massive data and providing timely and effective support for breeding management. In terms of feature extraction, Eff-HetKConv integrates the advantages of different-sized convolution kernels, enabling it to capture local spatial features while obtaining more representative feature representations through channel fusion, thereby improving the accuracy of behavior recognition. For example, during chickens’ foraging behavior, 3 × 3 kernels capture local action features, while 1 × 1 kernels integrate channel features to highlight behavior-relevant dimensions. Additionally, Eff-HetKConv reduces parameter redundancy by rationally allocating the use of convolution kernels across channels, making the model more compact and facilitating deployment on resource-constrained devices—thereby lowering hardware costs and deployment complexity. Overall, through optimizing computational efficiency, enhancing feature extraction capability, and improving model compactness, Eff-HetKConv provides an efficient, accurate, and practical solution for floor-raised chicken behavior recognition.
2.4.3. Co-Optimization of Two-Way Feature Map Extraction Architecture with Eff-HetKConv
Deformable Convolution is introduced before the CBLinear module to spatially align the feature maps of the auxiliary branch with the spatial location of the main branch. Since the heterogeneous convolution kernels (3 × 3 and 1 × 1) of Eff-HetKConv may lead to inconsistencies in the receptive fields, spatial alignment is a critical step to ensure the accuracy of feature fusion. This problem can be effectively solved by dynamically adjusting the spatial distribution of the feature map through deformable convolution.
denotes the aligned feature map,
denotes the feature map of the auxiliary branch, and DeformConv denotes the deformable convolution operation. The relationship between A and B is described as Equation (16).
To balance the computational efficiency and feature expression ability, a computationally aware weight attenuation factor is introduced. This factor changes how much weight is reduced based on how complex Eff-HetKConv and C3K2 are, helping to prevent the model from focusing too much on the part that requires more computation during optimization. λ denotes the weight attenuation factor, and
and
denote the Eff-HetKConv and C3K2, respectively, in terms of floating-point operations. The calculation method for λ is described as Equation (17).
Since the heterogeneous convolutional kernel of Eff-HetKConv may lead to large differences in the gradients of different branches, a two-branch gradient normalization strategy is designed. Normalizing the gradient magnitude ensures that the contributions of the main and auxiliary branches to the loss function are balanced during the training process, which improves the stability of training and convergence speed.
With the above adaptation strategy, Eff-HetKConv and the dual-backbone feature extraction architecture achieve a synergistic optimization, which retains the advantages of heterogeneous convolutional kernels in feature extraction and ensures the computational efficiency and training stability of the model. These improvements provide a more efficient and reliable solution for target detection tasks in complex scenes.
2.5. TriAxis Unified Detection Head
The initial detection head of the YOLOv11 model shows remarkable efficiency and effectiveness but has certain limitations in floor-raised chicken behavior recognition. These limitations mainly appear in its inadequate handling of objects at different scales, limited ability to capture spatial relationships, and inflexible adaptation to specific tasks. These issues affect the model’s detection accuracy and robustness in complex situations of floor-raised chicken behavior. To solve these problems, we put forward a new object detection head: the TriAxis Unified Detection Head. It combines three attention mechanisms—scale-aware, spatial-aware, and task-aware—into one detection head. This design helps the model better deal with the complexities of floor-raised chicken behavior recognition. The structure of the TriAxis Unified Detection Head is described below.
The TriAxis Unified Detection Head builds a unified framework by combining three attention mechanisms: scale-aware, spatial-aware, and task-aware. These mechanisms are applied in sequence to the feature tensor to strengthen its representation, thus enhancing the accuracy of target detection.
Given a feature tensor
, where L_level stands for the number of feature levels,
stands for the spatial dimensionality (height × width), and C_channel stands for the number of channels, the TriAxis Unified Detection Head applies the following three attention functions consecutively. The relationship between the three attention functions is shown in Equation (18):
- (1)
Scale-aware attention (): focuses on the feature level dimension, dynamically fuses features at different scales, and adjusts the weights of features at each level based on semantic importance.
- (2)
Space-aware attention (): applied to the spatial dimension, learns discriminative representations of different spatial locations, helps the model to focus on the relevant regions, and better captures geometric transformations and spatial configurations of objects.
- (3)
Task-aware attention (): operating on the channel dimension, directs various feature pathways to prioritize distinct tasks. This allows the model to adjust resource distribution based on input, catering to diverse detection requirements like categorization, boundary box localization, and key point identification.
2.5.1. The Triple Attention Mechanism of Scale, Space, and Tasks
The equation for the scale-aware attention module is shown as Equation (19):
where
f (x) is modeled using a 1 × 1 convolutional layer as a linear function and
serves as a hard sigmoid function. This module learns the relative importance of different semantic layers to enhance the feature representation of objects at different scales.
The space-oriented perception attention module is split into two phases as Equation (20):
where K represents the count of sparsely sampled locations,
denotes the location modified by the self-learned spatial offset
, and
signifies the self-learned importance weight at the position
. This module concentrates on distinctive areas and flexibly combines features from various layers at the same spatial position.
The equation for the task-aware attention module is shown as Equation (21):
where
is the feature slice of the cth channel, and
is a hyperfunction for learning to control the activation threshold. The module dynamically switches the feature channels to adapt to different tasks, improving the model’s ability to adapt to different detection demands.
2.5.2. Full-Dimensional Dynamic Triple Focus Module
As shown in
Figure 6, the OmniDyna TriFocus Block, the core component in the TriAxis Unified Detection Head, is a composite structure based on multi-dimensional attentional synergies, whose core design revolves around the 3D tensor output from the feature pyramid, which is finely tuned dynamically in terms of hierarchical, spatial, and channel dimensions, respectively.
In the scale-aware attention module, the system first compresses the information in the spatial and channel dimensions through global average pooling to generate a feature vector representing the importance of each hierarchical level. Subsequently, a lightweight 1 × 1 convolutional layer is used to learn the correlation weights between hierarchical levels, and a hard Sigmoid function is applied to constrain the weights to the range of (0, 1). This process enables the model to dynamically allocate the contribution of feature levels according to the target size. For example, it enhances the representation ability of shallow features for small targets while suppressing the redundant responses of deep features for large targets.
The spatial-aware attention module achieves dynamic receptive field adjustment through a deformable convolution mechanism. Based on the standard 3 × 3 convolution kernel, this module predicts the offset parameters (Δp) using the middle-layer features, guiding the sampling points of the convolution kernel to adaptively shift towards the key deformation regions of the target (such as vehicle tires and animal limbs). Meanwhile, it assigns different weights to the sampling points by combining them with the mask parameters (Δm). In particular, after calculating the independent offset for features at different levels, the module forms the final spatial attention map through cross-layer weighted aggregation, effectively capturing the spatial context associations of multi-scale targets.
The final task-aware attention module constructs a non-linear mapping relationship in the channel dimension through a fully connected layer. First, it generates two sets of learnable affine transformation parameters (α, β). Then, it performs linear transformation and maximum value fusion operations on the channel features. This design allows the module to dynamically select and enhance key channel features according to different task requirements, such as classification and regression. For example, it enhances the responses of channels with strong semantic discrimination while suppressing the interference of noise channels.
Figure 7 illustrates the integrated application of the OmniDyna TriFocus Block in a single-stage object detector through a clear modular design. The upper region shows the cascaded arrangement of multiple focus modules (e.g., TL, TS, TC), with arrows annotating the feature flow direction to form a serial processing pipeline for the “scale-aware–spatial-aware–task-aware” attention mechanisms. Within each head block, simplified symbols (e.g., π) represent the attention computation process, while lateral connections between hierarchical levels enable multi-stage feature optimization. The overall architecture presents a progressive enhancement path from basic feature input to multi-task output, demonstrating a systematic integration of attention mechanisms for hierarchical feature refinement.
The right region corresponds to the specific task distribution modules of the focus block, including the Classifier, Center Regressor, Box Regressor, and Keypoint Regressor. The figure uses a vertically aligned layout to visually present the mapping between outputs of different head blocks and task modules: for example, the scale-aware head block labeled “TL” primarily serves the box regression task, while the task-aware head block “TC” preferentially associates with the Classifier. This design reflects the directional adaptation characteristics of different attention mechanisms to detection tasks while retaining the efficiency of feature sharing.
From a data flow perspective, basic features first undergo cascaded processing by the left-side focus modules, sequentially completing multi-scale feature fusion, spatial deformation modeling, and channel semantic filtering. The optimized features are then distributed to the right-side task modules through a branching structure. For instance, the Keypoint Regressor receives feature inputs from both the spatial-aware (TS) and task-aware (TC) head blocks, capturing both geometric offset information of target local regions and enhancing channel responses strongly correlated with keypoint semantics. The Classifier, in contrast, heavily relies on discriminative channel features refined by the task-aware head block.
Through dynamic computation of attention weights, different dynamic modules enable the detector to flexibly balance the needs for multi-scale target detection, dense spatial localization, and complex semantic understanding. This approach maintains the efficiency of single-stage detectors while significantly improving detection accuracy for deformable targets like floor-raised chickens through multi-dimensional attention mechanisms.
2.6. Proportional Scale IoU: Adaptive Scale Perceptual Loss Function for Behavior Recognition of Floor-Raised Chickens
In the task of daily behavior recognition for floor-raised chickens, object detection models must accurately capture multi-scale and multi-morphology target features of flocks in different behavioral states (e.g., walking, pecking, fighting). However, traditional IoU-based loss functions in YOLOv11 (such as CioU and SIoU) primarily focus on the geometric relationships between predicted and ground-truth boxes, failing to fully exploit the impact of inherent target attributes (e.g., aspect ratio, absolute scale) on the regression process. To address this issue, this section proposes the Proportional Scale IoU (Pro-Scale IoU) loss function, which significantly improves the behavior recognition accuracy of the lightweight model DualHet-YOLO in complex scenarios by introducing a target box’s morphological proportional factor and scale-adaptive factor.
The core concept of Pro-Scale IoU stems from in-depth observations of behavior recognition scenarios: First, the behavioral patterns of floor-raised chickens exhibit significant morphological differences—for example, wing-spreading behavior presents a floor morphology with an aspect ratio > 1, while standing behavior shows an approximately square contour. Second, the target scales of different behaviors span a wide range (e.g., the area of a full-body detection box can be over six times that of a local action box). Traditional loss functions do not explicitly distinguish these characteristics during regression, leading to systematic biases in the localization of morphology-sensitive behaviors. Pro-Scale IoU solves this problem through a dual-path adaptive mechanism: the Proportional Factor dynamically adjusts coordinate regression weights based on the aspect ratio of ground-truth boxes, enabling the model to focus more on offset correction in the long-edge direction; the Scale Factor introduces non-linear scaling based on target absolute size, enhancing the regression sensitivity for small targets. After embedding the Pro-Scale IoU module into YOLOv11’s regression head, it generates spatially adaptive loss surfaces by real-time parsing of the morphological and scale features of GT boxes, guiding the network to prioritize the optimization of localization errors in key dimensions.
The advantages of Pro-Scale IoU in floor-raised chicken behavior recognition are reflected in four aspects: First, it improves localization accuracy—by introducing proportional factors and geometric constraints, it provides effective gradients even when there is no overlap between the target and predicted boxes, alleviating the gradient vanishing issue of traditional IoU and reducing misdetections and omissions. Second, it enhances multi-scale adaptability—dynamically adjusting loss weights balances the detection performance for targets of different sizes (e.g., chicks vs. adult chickens), avoiding missed small targets and misaligned large targets. Third, it improves generalization ability—geometric feature modeling endows the model with robustness against complex scenarios such as lighting changes and occlusions, ensuring stable recognition in real farming environments. Fourth, it accelerates model convergence—the loss function, which integrates target scale and shape, provides a clearer optimization direction, shortening the training cycle compared to traditional methods and facilitating rapid iterative deployment. Through refined spatial relationship modeling, this loss function balances detection accuracy and efficiency, providing a reliable technical foundation for the automated management of floor-raised chicken behavior analysis.
The core innovation of Pro-Scale IoU lies in the establishment of a dual-domain resolution mechanism for morphological scale and target scale. The morphological scale factor calculation path is shown in Equation (22):
where
and
denote the width and height of the real frame, respectively, and
is the morphological sensitivity coefficient. When the target width-to-height ratio
,
, the model increases the coordinate error weight in the width direction (long side). The scale adaptive factorization path is shown in Equation (23), where,
is the target area,
is the average area of the dataset, and
is the scale gain factor. This factor is able to produce a loss amplification effect for small targets.
Reconstruct the Pro-Scale IoU loss function by embedding the above factors into the SloU framework. The reconstruction process is shown in Equation (24).
The calculation method for the morphological correction term
is described as Equation (25), where
, and
and
are the width and height of the minimum enclosing frame. The design gives a higher penalty weight to coordinate deviations in the long side direction (e.g., the width of the spreading behavior).
The calculation method for the scale correction item
is shown as Equation (26), where
. When detecting small targets,
significantly increases the contribution of the aspect error.
GT Feature Extraction Layer: Real-time parsing of each target’s , , and from annotated data, with parallel computation of shape ratio coefficients , , and scale factor .
Dynamic Weight Fusion Layer: Injecting into the coordinate regression branch to achieve per-anchor weight allocation via matrix broadcasting mechanism; simultaneously performing Hadamard product operation between μs and width–height loss terms to realize scale-aware enhancement.
Differentiable Loss Calculation Layer: Employing automatic differentiation technology to seamlessly integrate the composite loss term into the model’s backward propagation pipeline, ensuring stable convergence during training.
In the backpropagation stage, the gradient computation of the Pro-Scale IoU exhibits spatial anisotropy properties. Taking the width direction gradient as an example, the width direction gradient is shown as Equation (27).
The equation shows that, when dealing with targets with large aspect ratios, the morphology-sensitive term causes the network to preferentially correct the width error, while the scale-enhancing term produces a larger gradient magnitude and accelerates the model convergence when encountering small-scale targets.