CF-DETR: A Lightweight Real-Time Model for Chicken Face Detection in High-Density Poultry Farming

Gao, Bin; Zhang, Wanchao; Hao, Deqi; Yang, Kaisi; Chen, Changxi

doi:10.3390/ani15192919

Open AccessArticle

CF-DETR: A Lightweight Real-Time Model for Chicken Face Detection in High-Density Poultry Farming

by

Bin Gao

^1,2,†

,

Wanchao Zhang

^2,†,

Deqi Hao

²,

Kaisi Yang

¹ and

Changxi Chen

^1,2,*

¹

Key Laboratory of Smart Breeding (Co-Construction by Ministry and Province), Ministry of Agriculture and Rural Affairs, Tianjin 300384, China

²

College of Computer and Information Engineering, Tianjin Agricultural University, Tianjin 300384, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Animals 2025, 15(19), 2919; https://doi.org/10.3390/ani15192919

Submission received: 1 September 2025 / Revised: 2 October 2025 / Accepted: 6 October 2025 / Published: 8 October 2025

(This article belongs to the Section Poultry)

Download

Browse Figures

Versions Notes

Abstract

Simple Summary

Modern poultry farms often use automated systems to monitor flock health, but overlapping chickens and changing lighting make individual face detection challenging. Traditional methods struggle to identify each bird reliably in these conditions. In this study, we present CF-DETR, a new deep-learning model that detects chicken faces in real time. The model uses special computer vision techniques to capture details even when birds overlap or move. We tested CF-DETR on images from crowded farm environments. It detected over 95% of chicken faces correctly and processed video at around 80 frames per second. These results show CF-DETR can robustly monitor chickens in dense flocks under challenging conditions. This automatic monitoring system could help farmers track poultry health earlier and more efficiently, improving animal welfare and farm productivity. By accurately detecting each chicken’s face, CF-DETR could help farmers spot signs of disease or stress earlier. Such smart monitoring is valuable for modern intelligent poultry farming.

Abstract

Reliable individual detection under dense and cluttered conditions is a prerequisite for automated monitoring in modern poultry systems. We propose CF-DETR, an end-to-end detector that builds on RT-DETR and is tailored to chicken face detection in production-like environments. CF-DETR advances three technical directions: Dynamic Inception Depthwise Convolution (DIDC) expands directional and multi-scale receptive fields while remaining lightweight, Polar Embedded Multi-Scale Encoder (PEMD) restores global context and fuses multi-scale information to compensate for lost high-frequency details, and a Matchability Aware Loss (MAL) aligns predicted confidence with localization quality to accelerate convergence and improve discrimination. On a comprehensive broiler dataset, CF-DETR achieves a mean average precision at IoU 0.50 of 96.9% and a mean average precision (IoU 0.50–0.95) of 62.8%. Compared to the RT-DETR baseline, CF-DETR reduces trainable parameters by 33.2% and lowers FLOPs by 23.0% while achieving 81.4 frames per second. Ablation studies confirm that each module contributes to performance gains and that the combined design materially enhances robustness to occlusion and background clutter. Owing to its lightweight design, CF-DETR is well-suited for deployment in real-time smart farming monitoring systems. These results indicate that CF-DETR delivers an improved trade-off between detection performance and computational cost for real-time visual monitoring in intensive poultry production.

Keywords:

chicken face detection; CF-DETR; real-time detection; precision poultry

1. Introduction

Broiler production plays a pivotal role in the global supply of animal protein. With continued population growth and accelerated urbanization, broiler chickens have become a primary source of animal protein because of their high productivity and low production cost [1]. The broiler industry not only contributes to food security and market supply but also plays an important role in promoting agricultural economic development and increasing farmers’ incomes [2,3]. In recent years, driven by rapid expansion of farming scale and growing demand for intelligent management, precision poultry farming has become a major development direction [4]. This approach integrates modern sensing and information technologies to enable automated and fine-grained monitoring of flocks, aiming to improve production efficiency while safeguarding animal welfare [5].

In precision farming scenarios, comprehensive collection of individual level information supports the formulation of scientific breeding strategies, enables early disease warning, and effectively reduces systemic risks during production [4,6,7]. Precise detection of individual broilers within a flock is the essential first step for acquiring such detailed individual information [8]. Previous studies have shown that the facial region of chickens exhibits relatively stable texture and morphological patterns, and therefore it serves as a key anatomical site for individual identification and health monitoring [8,9]. Accordingly, chicken face detection not only provides a feasible pathway for identity recognition but also constitutes a core component for building intelligent visual perception systems.

However, high density rearing environments are often characterized by severe occlusion among individuals, frequent pose variation, and complex illumination conditions [10], which pose substantial challenges to chicken face detection [11,12]. At the same time, traditional manual observation and video replay methods are inefficient and costly, and they struggle to meet the real-time and large-scale processing requirements of modern farms [13]. These limitations have accelerated the adoption of deep learning-based object detection techniques for avian monitoring.

In the object detection field, the YOLO series has been widely applied to poultry monitoring because of its favorable trade off between speed and accuracy [14,15]. Some studies have employed YOLO-based convolutional neural networks to detect broiler heads and to estimate feeding duration by detecting head entry into feeders [16]. Another study built a broiler leg disease detection model using YOLOv8 and demonstrated promising performance for early lesion diagnosis [17]. To address dense occlusion, researchers have proposed single class dense detection networks such as YOLO-SDD to enhance robustness [11]. Other work has applied YOLOv5 for diseased bird detection and reported an accuracy of approximately 89.2 percent [18]. Nevertheless, these YOLO-based detectors typically rely on non-maximum suppression NMS for post processing, which reduces inference speed and introduces multiple hyperparameters, thus causing instability in the speed versus accuracy trade off [14].

By contrast, DETR, the DEtection TRansformer, employs a Transformer architecture and a set prediction mechanism to form an end-to-end detection pipeline and thereby eliminates dependence on NMS [19]. However, despite removing post processing, the standard DETR design incurs considerable computational overhead and its inference speed often fails to meet real time requirements [19]. To this end, RT-DETR retains the advantages of the DETR paradigm while introducing a convolutional backbone and an efficient hybrid encoder, achieving a more favorable balance between detection accuracy and speed [20].

Although RT-DETR demonstrates advantages in structural simplicity and global modeling capability, its original ResNet backbone has limited representational power when handling targets with highly similar structures and subtle texture variations, making it difficult to capture fine interindividual differences [21]. In addition, the built-in Anchor Free Instance Interaction AIFI architecture exhibits insufficient synergy between global modeling and local detail recovery in the chicken face detection context, and normalization procedures show degraded efficiency under parallel computation. These issues are accompanied by high computational and memory costs, loss of fine detail information, and inconsistent behavior across scales [22,23,24,25]. Meanwhile, the Varifocal Loss VFL used in RT-DETR is insensitive to low-quality matching samples, which results in insufficient discrimination between positive and negative samples and thereby limits the effectiveness of matching optimization and the speed of training convergence [26].

In summary, although RT-DETR offers end-to-end detection benefits, it still faces multiple bottlenecks in backbone design, encoding mechanisms, and loss formulation, which hinder its direct application to chicken face detection, a task that involves complex structural patterns and subtle textures. To meet the accuracy and real-time requirements of chicken face detection in complex farm environments, this paper proposes a systematic set of structural optimizations to the RT-DETR framework that focuses on three critical aspects of the detection pipeline: feature extraction, information encoding, and loss design. By synergistically improving these components, we construct an end-to-end detection model CF-DETR with enhanced structural adaptability and expressive capacity. The main contributions are summarized as follows:

(i): Dynamic Inception Depthwise Convolution DIDC. To overcome limitations of conventional backbones in representing diverse spatial structures, we design a convolutional operator based on multi path depthwise separable convolutions and a dynamic fusion mechanism to strengthen feature extraction under complex poses and occlusion.
(ii): Polar Embedded Multi-scale Encoder PEMD. To address key issues of global modeling, normalization efficiency, multi-scale fusion, and detail compensation, we propose a structurally closed loop encoder module that improves coordination and stability in representing global context and local texture.
(iii): Matchability Aware Loss MAL. In the design of training objectives, we introduce a dynamic label weighting mechanism, so that predicted confidence adapts to matching strength, thereby optimizing training convergence and enhancing discriminability between targets and background.

2. Materials and Methods

2.1. Image Acquisition

Broiler growth can be divided into three stages: the starter period from 0 to 14 days, the grower period from 15 to 28 days, and the finisher period from day 29 until slaughter. During the starter period the facial region of chickens is small and exhibits low discriminative features. In the grower period the facial area increases gradually and the discriminability of facial features improves markedly. In the finisher period the facial region is fully developed and shows the greatest stability and discriminative power. To enhance the model performance across the entire production cycle we systematically collected facial images from birds from entry at day of age zero through to the slaughter stage. Samples were obtained from a commercial broiler breeding farm in Xinxing County Yunfu City Guangdong Province and WOD168 white-feather broilers were used as the study subjects. Image acquisition was performed with a full color low light night vision camera module (Shenzhi Weilai; Shenzhen, China) yielding a total of 1000 valid images. All data were split into training validation and test subsets at a ratio of 8:1:1 to ensure rigorous and reliable model training and evaluation. Figure 1 illustrates pronounced differences in facial geometry and texture across the growth stages and thereby reflects the diversity in spatial distribution and appearance of the samples which provides a solid basis for evaluating model robustness and generalization.

2.2. Data Annotation and Augmentation

We first used the LabelImg tool to generate precise annotations of chicken face regions in the training validation and test sets. To address dynamic challenges in real farm environments such as frequent pose variation noise interference and illumination fluctuation we designed and applied three data augmentation strategies to the training set. As shown in Figure 2 the first strategy applies random rotation in the range minus 45 degrees to 45 degrees to simulate natural head movements and to improve the model’s adaptability to diverse poses [27]. The second strategy injects salt and pepper noise at densities between 0.5 percent and 3 percent to simulate lens dust accumulation and sensor aging and thereby strengthen the model’s noise robustness [28]. The third strategy combines random gamma correction with gamma sampled from the interval 0.5 to 1.5 and brightness perturbation in the HSV space by plus or minus 20 percent to mimic illumination changes across different times of day and weather conditions and thus improve illumination generalization [29]. These augmentation operations produced an additional 120 images from the original 800 image training set, expanding the training set to 920 images while the validation and test sets remained at 100 images each. The detailed partitioning of the three subsets is reported in Table 1 and illustrated in Figure 3 ensuring the rigor and comprehensiveness of model training and evaluation.

2.3. Construction of the Chicken Face Detection Model

2.3.1. CF-DETR Model Construction

In this study we propose CF-DETR. As shown in Figure 4, the core idea is to perform systematic and collaborative optimization of the following three components: the backbone, the encoder, and the loss function. First, the backbone core module DIDC extracts multi-scale spatial features in parallel and uses global context to adaptively fuse the outputs of each branch thereby enhancing representational capacity under lightweight constraints. At the encoder side the PEMD synthesis module performs four sequential operations. It first employs PolaAttention to efficiently reconstruct global dependencies in the polar coordinate domain; then, it uses DynamicTanh to perform smooth channel normalization and accelerate computation; next, it relies on the Mona submodule to juxtapose multi-receptive field branches for collaborative aggregation of local and global information; and finally, it applies EDFFN for directional compensation of high frequency details so that a closed loop feature flow is formed between global context and detail recovery. Lastly, the Matchability Aware Loss (MAL) replaces conventional classification targets with matchability aware labels that impose stronger penalties on low quality predictions as confidence increases, thereby accelerating training convergence and improving localization accuracy. Overall, CF-DETR integrates these three modules to construct a lighter and more efficient end-to-end chicken face detection framework that provides solid support for real-time high accuracy detection in complex farming environments.

2.3.2. DIDC Module

To improve detection robustness and structural adaptability of the facial region under complex conditions, we propose a backbone design based on dynamic multi-scale modeling whose core innovation is the Dynamic Inception Depthwise Convolution (DIDC) module. As shown in Figure 5, DIDC employs three parallel depthwise separable convolution paths comprising a square convolution path, a horizontal strip convolution path, and a vertical strip convolution path to extract spatial features with different directions and receptive fields. At the same time, the module generates dynamic weights from global context to achieve adaptive fusion of multi-scale features.

Building on this design we further construct a Dynamic Inception featuring Mixing module (DIM) as detailed in the inset of Figure 4. DIM uses the DIDC module as the central operator and replaces a single depthwise convolution path with a multi-path spatial featuring mixing unit. This unit is embedded in a Transformer-style dual residual structure and works in concert with a gated CGLU mechanism to form the C2f_DCBM_bottleneck. Inspired by the YOLOv8 backbone we then assemble an improved backbone block C2f_DCMB that integrates the above designs and deploys the block throughout the RT-DETR backbone.

In this section we detail the design and mathematical formulation of the DIDC module to elucidate how it adaptively fuses multi-scale spatial features within a lightweight framework. As shown in Figure 5 the operator is built upon the input feature map

V \in R^{B \times C \times H \times W}

and simultaneously constructs three depthwise separable convolution branches to cover distinct receptive fields:

F_{1} (V) = D W C {onv}_{k \times k} (V)

(1)

F_{2} (V) = D W C {onv}_{1 \times b} (V)

(2)

F_{3} (V) = D W C {onv}_{b \times 1} (V)

(3)

F₁ denotes the square convolution branch, F₂ denotes the horizontal strip convolution branch, and F₃ denotes the vertical strip convolution branch. The strip convolution kernel width is set to

b = 3 k + 2

to extend the response capability for extreme aspect ratios. The three parallel branches can therefore extract both local compact features and elongated structural features while maintaining high efficiency in a single forward pass.

To allocate each branch’s contribution adaptively according to input content, the operator extracts a weighting signal from the global context. Global average pooling is applied to V to obtain the tensor:

U = G A P (V) \in R^{B \times C \times 1 \times 1}

(4)

It is then mapped to 3C channels by a 1 × 1 convolution.

S = C o n v_{1 \times 1} (U) \in R^{B \times 3 C \times 1 \times 1}

(5)

S is reshaped along the channel axis into three segments

{\{S_{i}\}}_{i = 1}^{3}

, and Softmax normalization is applied along the branch dimension to obtain the dynamic weights:

α_{i} = \frac{\exp (S_{i})}{\sum_{j = 1}^{3} \exp (s_{j})}, i = 1, 2, 3 .

(6)

This mechanism enables each convolutional branch at different scales to incorporate global semantic information and adaptively adjust the importance of its output during every forward computation. Finally, the weighted branch outputs are aggregated and subsequently passed through batch normalization and SiLU activation to generate the final feature map:

Y = S i L U (B N (α_{1} F_{1} (V) + α_{2} F_{2} (V) + α_{3} F_{3} (V)))

(7)

In summary, the C2f_DCMB module, with DIMC as its core, leverages a multi-branch structure and dynamic weight fusion mechanism to enhance feature representation while achieving highly robust modeling of occlusion, pose variation, and scale differences. Moreover, the module is lightweight and efficient, offering strong potential for embedded deployment, and is particularly well suited for end-to-end, real-time, and fine-grained facial detection of chickens in precision farming scenarios.

2.3.3. PEMD Module

Given the multiple performance bottlenecks exposed by the native AIFI architecture of RT-DETR in the context of chicken face detection, namely the exponential O(N²d) complexity of multi-head attention in high-dimensional sequence interactions and the over-smoothed Softmax response [23], the spatial blind spots and inefficiency in high-frequency information restoration of the standard FFN (OC2N) during channel mapping [25], the collapse of parallel throughput caused by the dependence of LayerNorm on μ and σ² statistics [24], and the insufficient multi-scale coordination that results in global–local disharmony [22], we propose the PEMD composite module, which holds both theoretical and practical significance. As shown in Figure 6, this module is composed of four subsystems—PolaAttention, DynamicTanh, EDFFN, and Mona—systematically reshaping attention, normalization, and feed-forward networks, while achieving global recasting, lightweight normalization, high-frequency compensation, and multi-scale juxtaposition through multi-scale spatial fusion, thereby forming a four-stage closed loop.

During a single forward pass of this module, the input feature map

X \in R^{B \times C \times H \times W}

is first flattened into [B, N, C] (where N = H × W) and then processed by Polar Coordinate Linear Attention (PolaAttention) to recast the global information. Learnable radial and angular projections replace the conventional Softmax, reducing the computational complexity of attention from O(N²d) to a linear level, while retaining sensitivity to subtle signals, thereby generating:

Z_{1} = X + D r o p_{1} (P o l a A t t e n t i o n (r e s h a p e (X)))

(8)

Subsequently, to eliminate dependence on batch statistics and enhance parallel throughput, the module adopts parameterized tanh normalization (DynamicTanh). By applying learnable scaling factors and offsets within the channel dimension, it performs smooth correction, significantly accelerating the normalization process and stabilizing the output distribution. Thus, we obtain:

Z_{2} = D y T_{1} (Z_{1})

(9)

After completing the lightweight normalization, the Mona submodule deploys multiple pooling and convolutional branches with different receptive fields in parallel, and dynamically assigns fusion weights based on the input features, achieving a coordinated aggregation of local and global information, outputting:

Z_{3} = M o n a_{1} (Z_{2})

(10)

Subsequently, to recover high-frequency details that the standard FFN struggles to capture, an Efficient Dynamic Feedforward Network (EDFFN) was designed. It compensates for spatial blind spots through channel expansion, nonlinear activation, and multi-kernel dynamic convolution, and integrates residual connections with stochastic dropout to restore high-frequency information under near-linear complexity, balancing efficiency and detail reconstruction:

Z_{4} = Z_{3} + D r o p_{2} (E D F F N (Z_{3}))

(11)

Finally, the module applies DynamicTanh again for a second normalization and, through Mona, achieves multi-scale feature adaptation with minimal parameter overhead, enhancing the perception of complex-shaped targets and producing the output feature map.

Y = M o n a_{2} (D y T_{2} (Z_{4}))

(12)

Overall, the PEMD synthesis module systematically reshapes RT-DETR’s attention, normalization, and feedforward network structures through a four-stage closed loop that sequentially performs global re-casting, lightweight normalization, high-frequency recovery, and multi-scale juxtaposition, while embedding a multi-scale spatial fusion mechanism. Seamlessly integrated into the encoder, this module maintains the end-to-end detection pipeline and provides a robust structural foundation for achieving real-time responsiveness and high-precision performance in subsequent chicken face detection tasks under severe occlusion, variable poses, and complex lighting conditions.

2.3.4. Matchability-Aware Loss Function

In this study, we replaced the original Varifocal Loss in the classification branch of the real-time chicken face detection model CF-DETR with the Matchability Aware Loss (MAL), in order to not only maintain sufficient optimization for high-quality matches but also to significantly enhance the model’s responsiveness to low-quality matches. The original Varifocal Loss can be expressed as:

L_{V F L} (p, q) = \{\begin{array}{l} - q (q \log p + (1 - q) \log (1 - p)), q > 0, \\ - \partial p^{γ} \log (1 - p), q = 0, \end{array}\}

(13)

In the equation,

p \in [0, 1]

represents the predicted foreground confidence by the model,

q \in [0, 1]

denotes the IoU (overlap) between the predicted box and the ground truth box,

\partial

is used to balance the ratio of positive and negative samples, and

γ

controls the model’s focus between easy and hard samples. Since Varifocal Loss produces small gradients for matches with low IoU, it struggles to drive CF-DETR to sufficiently adjust for these critical but subtle feature errors, thereby limiting the precise localization of fine chicken face regions. To address this, we adopt the Matchability Aware Loss, expressed as follows:

L_{M A L} (p, q, y) = \{\begin{array}{l} - q^{γ} \log p + (1 - q^{γ}) \log (1 - p), y = 1, \\ - p^{γ} \log (1 - p), y = 0, \end{array}\}

(14)

Here

y \in [0, 1]

denotes the binary ground-truth label (positive when

y = 1

, negative when

y = 0

); the remaining symbols have the same meanings as in Equation (13). By replacing the positive sample label from

q

to

q^{λ}

, MAL not only eliminates the need to tune the hyperparameter

\partial

but also imposes stronger penalties on low–IoU matches as the confidence

p

increases, effectively enhancing the model’s learning signal for matches that are easily overlooked.

In the CF-DETR model, we only need to replace the loss function in the classification branch from Varifocal Loss to Matchability Aware Loss, while the regression and matching processes remain entirely consistent with the original framework. Specifically, the overall loss function is adjusted as follows:

L = λ_{c l s} L_{M A L} (p, q, y) + λ_{l_{1}} ‖b - {b^{*}‖}_{1} + λ_{G I o U} (1 - G I o U (b, b^{*}))

(15)

Through this simple substitution, CF-DETR not only retains the original advantages of one-to-one matching but also significantly strengthens the penalty on low-quality matches, thereby potentially improving the localization accuracy of subtle chicken face features and accelerating model convergence.

2.4. Experimental Environment and Parameter Settings

All experiments in this study were conducted on the Ubuntu 20.04 Linux operating system. The hardware platform utilized an NVIDIA GeForce RTX 3090 GPU with 24 GB of memory (NVIDIA Corporation, Santa Clara, CA, USA), providing powerful parallel computing capabilities sufficient to meet the computational demands of deep learning training. The software environment included the PyTorch 1.13.1 deep learning framework, the CUDA 11.6 acceleration platform, and Python 3.8.0 programming language. During model training, the AdamW optimizer was employed, with key hyperparameters such as learning rate, momentum, and regularization terms appropriately configured according to the task characteristics. All input images were resized to a uniform resolution of 640 × 640 during training and validation to ensure the model learns features at a fixed scale. The main training configurations are summarized in Table 2.

2.5. Evaluation Metrics

To comprehensively and multidimensionally assess the performance of the proposed model, this study established an evaluation framework encompassing five aspects: detection accuracy, object detection capability, overall performance, real-time capability, and model complexity. This framework includes core metrics such as Precision, Recall, F₁-Score, mean Average Precision (mAP@0.5, mAP@0.5:0.95), frames per second (FPS), as well as the number of parameters (Parameters) and floating-point operations (FLOPs), enabling a quantitative and systematic analysis of the model’s performance across various application scenarios. The specific definitions are as follows:

\Pr e c i s i o n = \frac{T P}{T P + F P}

(16)

Re c a l l = \frac{T P}{T P + F N}

(17)

F 1 - S c o r e = 2 \times \frac{p r e c i s i o n \times Re c a l l}{\Pr e c i s i o n + Re c a l l}

(18)

Here, TP, FP, and FN represent the numbers of true positive, false positive, and false negative samples, respectively.

A P = \int_{0}^{1} P (r) d r

(19)

m A P_{0.5} = \frac{1}{N} \sum_{i = 1}^{N} A P_{i} (I o U = 0.5)

(20)

m A P_{0.5 : 0.95} = \frac{1}{N} \sum_{t = 0.50 : 0.05 : 0.95} [\frac{1}{N} \sum_{i = 1}^{N} A P_{i} (I o U = t)]

(21)

Here, N denotes the total number of classes, and IoU (Intersection over Union) measures the degree of overlap between the predicted and ground-truth boxes.

F P S = \frac{1000 m s}{T_{pre} + T_{\inf} + T_{N M S}}

(22)

T_{p r e}

represents the average time consumed during the preprocessing stage,

T_{\inf}

represents the average time during the model inference stage, and NMS denotes the average time in the post-processing (non-maximum suppression) stage.

Parameters: the total number of trainable parameters (in millions, M); FLOPs: the number of floating-point operations required for a single forward pass (in billions, G).

3. Results

3.1. Backbone Comparison Experiment

To verify the effectiveness of the DIDC module, we compared four backbones under the same experimental settings: the original RT-DETR, a backbone based on the C2f structure, a backbone based on C2f integrated with the PKI module, and a backbone based on DIDC, i.e., C2f_DCMB (denoted as DIDC(C2f) in the table). Regarding lightweight design and embedded deployment requirements, the DIDC module employs three parallel depthwise separable convolution branches (square, horizontal stripe, and vertical stripe) along with a global context-based dynamic weight generation mechanism, reducing the model parameters from 19.9 M to 13.2 M and FLOPs from 56.9 G to 43.5 G, while maintaining an inference speed of 73.8 FPS, indicating that even with further reduction in computational cost, the DIDC module does not noticeably affect real-time performance and better meets the requirements for subsequent embedded deployment.

From the perspective of feature extraction capability, the DIDC module leverages multi-branch convolutions covering different directions and receptive fields and adaptively fuses the outputs of each branch with global semantic signals, effectively enhancing the model’s adaptability to chicken face regions of varying scales and complex shapes. As shown in Table 3, after introducing the DIDC module, Precision increased from 93.8% to 96.2%, F₁-Score from 92.9% to 94.3%, mAP50 from 95.4% to 96.3%, and mAP50:95 from 60.2% to 61.4%, surpassing all other comparison models on multiple key metrics and fully validating the design advantages of the DIDC module in multi-scale spatial feature extraction and dynamic fusion.

3.2. Classification Loss Comparison Experiment

To meet the dual requirements of high accuracy and fast convergence in chicken face detection, this study compared four classification loss functions: VariFocal Loss (VFL) [30], Slide Loss [31], EMASlide Loss [32], and MAL [26] to determine the optimal solution.

As shown in Table 4, in the comparison experiment of the four classification loss functions, MAL outperformed all others across all key detection metrics. Specifically, MAL achieved 95.2% Precision, 93.1% Recall, and 94.1% F₁-Score, significantly exceeding VFL’s 93.8%, 92.1%, and 92.9%; meanwhile, MAL reached 96.3% mAP50 and 61.1% mAP50:95, representing improvements of 0.9 and 0.9 percentage points over VFL, respectively, and also outperformed Slide Loss (95.6%/60.7%) and EMASlide Loss (96.0%/59.3%). The mAP50 curves corresponding to each loss function in Figure 7b clearly show that the MAL curve remains at the highest level and maintains the smallest decline during later iterations, further demonstrating its advantages in accuracy and robustness.

From the perspective of training convergence speed, the loss curves shown in Figure 7a indicate that MAL reaches a convergence plateau at around 31 epochs, whereas VFL and Slide Loss require 37 and 39 epochs, respectively. Although EMASlide Loss also converges at approximately 31 epochs, its final mAP50 and mAP50:95 still lag behind MAL. Considering MAL’s leading performance in Precision, Recall, F₁, and mAP metrics as shown in Table 4, along with the fastest convergence speed and lowest training oscillation in Figure 7a, it is evident that MAL not only significantly shortens the training cycle but also improves detection performance. Based on this rigorous data comparison and analysis, MAL was ultimately selected as the classification loss function for the CF-DETR model.

3.3. Ablation Experiment

To verify the effectiveness of the systematic structural optimization proposed in this study within the RT-DETR framework—focusing on three key aspects: feature extraction (DIDC), information encoding (PEMD), and loss design (MAL)—we conducted ablation experiments based on RT-DETR under the same experimental settings. The experiments considered not only commonly used detection accuracy metrics (mAP50, mAP50:95, Precision, Recall, F₁) but also included model parameters, FLOPs, and inference speed, enabling a comprehensive evaluation from both detection performance and embedded deployment feasibility perspectives.

As shown in Table 5, introducing the DIDC module alone reduces the model parameters from 19.9 M to 13.2 M (a 33.7% decrease) and FLOPs from 56.9 G to 43.5 G (a 23.5% decrease), while maintaining an inference speed of 73.8 FPS, demonstrating its suitability for embedded deployment. In addition, its multi-scale convolutional branches and global context-based dynamic weight fusion enable finer extraction of local textures and edge information of chicken faces, increasing Precision to 96.2% (+2.4%) and mAP50:95 to 61.4% (+1.2%), while Recall fluctuates by no more than 0.5% (92.6%), validating its efficient feature extraction capability. Although the PEMD encoding module increases the model parameters to 20.0 M and FLOPs to 57.2 G, its PolarAttention reconstructs global dependencies in the polar coordinate domain with linear complexity, the Mona branch adaptively fuses multi-scale pooling and convolutional branches, and EDFFN with DynamicTanh compensates high-frequency details and stabilizes feature distributions, resulting in Recall increasing to 93.6% (+1.5%) and mAP50:95 rising to 61.2% (+1.0%), fully demonstrating its advantage in global–local collaborative perception for small and occluded chicken-face targets.

When the DIDC and PEMD modules are applied simultaneously, the model further balances accuracy and speed while maintaining efficient computation, achieving 94.9% Precision and 93.1% Recall, with mAP50 increasing to 96.7% and mAP50:95 rising to 62.0%, while parameters remain at 13.3 M, FLOPs at 43.8 G, and the frame rate improves to 80.6 FPS. On this basis, by integrating the Matchability Aware Loss (MAL) into CF-DETR, the model’s learning signal becomes more focused on low-quality matched samples, resulting in a more balanced performance improvement. The final CF-DETR achieves 95.5% Precision, 94.6% Recall, and 95.1% F₁-Score, representing a 1.1 percentage point improvement in F₁ compared to the combination model without MAL; mAP50 reaches 96.9%, mAP50:95 further increases to 62.8%, while parameters and FLOPs remain at 13.3 M and 43.8 G, and the frame rate stabilizes at 81.4 FPS.

The training curves in Figure 8 clearly demonstrate the combined advantages of the DIDC–PEMD–MAL configuration in convergence speed, training stability, and final plateau height. First, the DIDC module enhances the representation of fine-grained and direction-sensitive features, promoting the quality of candidate boxes in the early training stages and accelerating the rise of mAP. Second, PEMD, through polar-coordinate attention, distribution stabilization, and high-frequency detail compensation, effectively mitigates performance decline in the middle and later stages, resulting in a smoother convergence curve and a higher plateau. Finally, MAL strengthens the gradient response to low-quality matched samples through its match-driven loss design, significantly suppressing training oscillations and improving consistency in localization and confidence. Therefore, the steeper rise, earlier stabilization, and higher plateau shown in the figure are not isolated phenomena but a direct reflection of the three modules complementing each other during training, indicating that DIDC–PEMD–MAL indeed outperforms other comparison configurations in terms of convergence and generalization capability.

This combined configuration outperforms any single- or dual-module combination in both detection accuracy and deployment efficiency, fully validating the synergistic optimization of DIDC, PEMD, and MAL in structural design, information encoding, and loss modeling. Overall, the modules complement each other across different task levels, providing robust support for high-performance, low-resource embedded real-time chicken face detection.

3.4. Visualization Analysis

To intuitively evaluate the detection robustness and interpretability of the baseline RT-DETR and the proposed CF-DETR in complex farming environments across different growth stages, we selected representative samples from the fattening, growing, and brooding periods for visual comparison. As shown in Figure 9, during the fattening period, the targets are relatively large and feature distinctions are high, allowing both models to perform stable detection with almost no false positives or false negatives. In the growing period, as target size decreases and neighborhood interference increases, the baseline model produces false positives due to the similarity of adjacent head features, whereas CF-DETR maintains stable detection results, showing no missed detections and significantly fewer false positives. In the most challenging brooding period, targets are smaller, stocking density is higher, and occlusion is severe, causing the baseline model to exhibit both missed detections and false positives. Missed detections result from feature weakening and lost candidate boxes, while false positives mainly arise from background textures and neighboring interference. In contrast, CF-DETR maintains zero missed detections during this stage, with fewer false positives than the baseline RT-DETR, fully demonstrating its significant advantages in feature extraction, occlusion handling, and interference suppression.

To further reveal the internal mechanism behind performance differences in the brooding period, we conducted heatmap visualization analysis on brooding-period samples. As shown in Figure 10, compared with the baseline model, CF-DETR exhibits clear and stable activation responses for more targets during the brooding period, whereas the baseline model lacks identifiable activation signals on several true targets. Further observation shows that the unactivated targets in the heatmaps highly coincide with the detector’s missed detections, indicating that the absence of internal responses is an important cause of missed detections. From this, it can be concluded that heatmaps, as a form of explainable evidence, support that CF-DETR possesses stronger feature response capability and higher recall in small-scale, high-density, and heavily occluded brooding-period scenarios, providing an intuitive visual confirmation of the performance improvement of our method. It should be emphasized that heatmaps are an interpretability tool rather than direct detection outputs; this section aims to provide an intuitive understanding of model behavior and supplementary evidence.

3.5. Comparison with Other Models

To comprehensively and objectively evaluate the practical performance of the proposed CF-DETR in chicken face detection, we conducted a horizontal comparison under the same dataset and evaluation criteria with representative single-stage detectors (YOLOv10m, YOLOv11m) [33,34], anchor-based single-stage detector (TOOD) [35], two-stage detector (Faster R-CNN) [36], and Transformer-based detectors (DETR, RT-DETR, RT-DETR-r34). The comparison metrics include detection accuracy (Precision, Recall, mAP@0.5, mAP@0.5:0.95), real-time performance (FPS), and model complexity (Parameters, FLOPs), aiming to reveal the strengths and limitations of each method across the three dimensions of accuracy, speed, and resource efficiency. The results are shown in Table 6.

From the data, the models exhibit clear performance differentiation. YOLOv11m slightly leads in Precision (P = 95.8%), but its inference speed is only 31.6 FPS, indicating that although post-processing reduces false positives, the heavy computation and post-processing overhead limit its real-time performance for this task. Similarly, YOLOv10m achieves a near-leading mAP50 (96.5%) but an FPS of only 37.7, showing that single-stage networks on this dataset are still affected by NMS post-processing and model complexity. Two-stage and improved two-stage methods (Faster R-CNN, TOOD) show moderate performance in some accuracy metrics but at the cost of substantially higher parameters and FLOPs (e.g., Faster R-CNN parameters ≈ 41.3 M, FLOPs ≈ 208 G), resulting in low FPS (below 34), which is unfavorable for embedded and real-time applications. The original DETR performs poorly in this high-density scenario (mAP50 = 88.6%), indicating that standard Transformer-only architecture struggles with tiny-scale and occluded chicken faces without targeted local-scale enhancement. RT-DETR provides a good speed–accuracy trade-off (mAP50 = 95.4%, FPS = 74.1), but compared with the proposed CF-DETR, it still lags in Recall (RT-DETR R = 92.1% vs. CF-DETR R = 94.6%) and overall mAP (95.4% vs. 96.9%). The comparison results indicate that CF-DETR achieves a better balance between accuracy and efficiency, attaining the highest mAP50 (96.9%) and mAP50:95 (62.8%) while maintaining the lowest parameter count (13.3 M) and computational cost (43.8 G), and leading all compared models with an inference speed of 81.4 FPS, demonstrating the practical benefits of the synergistic optimization across backbone (DIDC), encoder (PEMD), and loss design (MAL) in this study.

In summary, although some YOLO-series models show advantages in individual metrics (such as Precision or single-point mAP), their disadvantages in real-time performance or computational resource requirements limit their feasibility for practical deployment. Two-stage methods and pure DETR-based approaches are also constrained in this task due to model complexity or lack of adaptation mechanisms. In contrast, CF-DETR, through lightweight structural design and targeted encoding and loss strategies, demonstrates the most balanced and superior performance in chicken face detection under high-density and heavily occluded real-world scenarios.

4. Discussion

This study constructed a chicken-face detection dataset covering the entire growth cycle of chickens, from the chick stage to the fattening stage, encompassing significant variations in facial size and texture across different stages (see Figure 1). During the early chick stage, the facial region is extremely small, features are difficult to discern, and high stocking density results in severe occlusion. In the growing stage, the facial area increases but neighboring interference also rises. In the fattening stage, the facial region is largest, and features are most stable. Such high-density, fine-grained detection scenarios pose significant challenges to detectors [10,11]. To address this, we selected RT-DETR as the starting point. RT-DETR offers advantages in end-to-end detection and the removal of NMS, while achieving a balance between accuracy and speed through a hybrid convolutional backbone and an efficient encoder. However, the native RT-DETR structure exhibits shortcomings in chicken face scenarios. The ResNet backbone has limited capability to express subtle texture differences [37]. The AIFI module provides sufficient global modeling ability but insufficient local detail recovery [25], and attention based on normalization and Softmax introduces parallelism and efficiency bottlenecks [23,24], while lacking juxtaposed structural interaction and multi-receptive-field collaboration [22]. These factors further limit the effectiveness of the matching mechanism optimization and training convergence speed [26], and previous work has pointed out that the DETR framework struggles to capture fine-grained information [38]. Therefore, this study proposes a systematic improvement scheme addressing the three major bottlenecks of RT-DETR: feature extraction, information encoding, and loss design.

First, at the backbone network level, we designed the DIDC module, which employs multi-branch depthwise separable convolutions (including stripe convolutions in different orientations) combined with global semantic dynamic fusion, effectively enhancing the perception of chicken faces of varying scales and complex shapes. As shown in Table 3, after introducing DIDC, both Precision and mAP improved significantly (Precision increased from 93.8% to 96.2%, mAP50 increased from 95.4% to 96.3%), while the model parameters decreased by over one-third, indicating that the multi-branch convolutions preserve fine-grained texture features while improving efficiency. Second, at the encoder level, we proposed the PEMD module, which incorporates polar-coordinate attention and a high-frequency detail compensation mechanism. Polar-coordinate attention enables global context capture with linear complexity, while the Mona multi-scale fusion branch together with EDFFN and DynamicTanh compensates for high-frequency information. In the ablation experiments shown in Table 5, introducing PEMD alone increased Recall by 1.5 percentage points (from 92.1% to 93.6%) and raised mAP50:95 from 60.2% to 61.2%. These results indicate that PEMD enhances the model’s global–local collaborative perception under occlusion and background interference, thereby improving detection robustness and overall performance. Furthermore, the proposed Matchability Aware Loss (MAL) makes the training signal more sensitive to hard samples through dynamic label weighting. Comparative experiments in Table 4 demonstrate that MAL outperforms traditional losses such as VariFocal Loss in both precision and mAP (Precision increased from 93.8% to 95.2%, mAP50 increased from 95.4% to 96.3%), and as shown in Figure 7, it achieves faster convergence and minimal training oscillation. Specifically, MAL converges in approximately 31 epochs, whereas VariFocal Loss requires 37 epochs; as illustrated in the loss and mAP curves in Figure 7a, MAL consistently maintains the highest level, further validating its effectiveness in training stability and performance improvement.

Based on the comprehensive ablation results (Table 5) and the convergence curves (Figure 7 and Figure 8), the synergistic effects of the three mechanisms are clearly evident. DIDC accelerated the early-stage rise of mAP during training, PEMD mitigated the late-stage performance decline, and MAL smoothed oscillations during training, resulting in steeper convergence curves and higher performance plateaus. In the visualization analysis of representative samples, CF-DETR exhibited more robust detection performance compared to the baseline RT-DETR. For growing-stage samples, RT-DETR produced false positives due to similar neighboring head features, whereas CF-DETR showed almost zero missed and false detections. In the most challenging chick-stage samples, RT-DETR exhibited obvious missed and false detections (missed detections resulting from weakened features and false detections caused by background interference), while CF-DETR maintained zero missed detections and significantly reduced false detections. Grad-CAM heatmaps (Figure 10) further confirmed this, showing that CF-DETR generated clear and stable activation responses for more true chicken face targets, whereas RT-DETR lacked identifiable responses for some targets, and these weak-response regions highly corresponded to missed detection instances. This indicates that the modules in CF-DETR effectively enhance the model’s feature response capability for small-scale and occluded targets, providing visual evidence supporting the performance improvement. In comparison with other mainstream models (Table 6), CF-DETR achieved the highest mAP and real-time detection speed while maintaining the lowest parameter count and computational load, demonstrating the practical benefits of the design optimizations.

Although this study demonstrates significant advantages in chicken face detection across the full growth cycle, there are still two limitations to note. First, the data source is relatively homogeneous, covering only WOD168 white-feather broilers, which somewhat limits the model’s generalization ability to other breeds and different rearing conditions. Second, in chick-stage scenarios, although overall performance is markedly improved compared to the baseline model, CF-DETR still produces some false positives due to extremely small target sizes and frequent occlusion, as illustrated in Figure 9. To address these issues, future work will focus on expanding cross-breed and cross-scenario data validation, and incorporating targeted small-object enhancement and temporal information modeling in the model design to further improve robustness and generalizability. Moreover, considering potential on-farm deployment, future research will further investigate hardware adaptability, energy efficiency, and seamless integration with existing farm management systems to ensure practical applicability in real-world poultry farming environments. Finally, as the dataset used in this study is currently under collaboration with relevant enterprises and has not been publicly released, future plans include making the dataset available upon project completion to facilitate subsequent research.

In summary, the experimental analysis demonstrates that the proposed multi-branch convolutional feature extraction module (DIDC), closed-loop encoder (PEMD), and match aware loss (MAL) effectively address the adaptation bottlenecks of RT-DETR in chicken face detection, significantly enhancing performance in high-density and heavily occluded complex scenarios. Each improvement mechanism has been thoroughly validated through ablation and visualization experiments: the DIDC module enriches spatial multi-scale feature representation, the PEMD module compensates high-frequency details and strengthens global–local fusion, and the MAL module focuses the learning signal on low-quality matched samples, improving recall and the consistency of localization confidence. The proposed modules exhibit complementarity, aligning with recent Transformer improvement strategies that emphasize stronger feature representation and more complete multi-scale information aggregation to enhance the model’s adaptability to occlusion, background interference, and scale variation [37,39]. Therefore, CF-DETR demonstrates clear advantages in accurately and efficiently detecting broiler chicken faces, providing a solid foundation for real-time intelligent poultry monitoring.

5. Conclusions

This study addresses the chicken face detection task in high-density farming scenarios and proposes a systematic improvement scheme based on the RT-DETR framework. By introducing the multi-branch convolutional feature extraction module (DIDC), the polar-coordinate closed-loop encoder (PEMD), and the match aware loss (MAL), we constructed a lightweight and efficient end-to-end detection model, CF-DETR. Experiments demonstrate that CF-DETR achieves the optimal balance between accuracy and real-time performance: while maintaining only 13.3 M parameters and 43.8 G FLOPs, it reaches 96.9% mAP50 and 94.6% recall, with an inference speed of 81.4 FPS, outperforming other comparative models. Compared to the baseline RT-DETR, CF-DETR significantly enhances the detection capability for small-scale and occluded targets and accelerates training convergence. The structural optimizations and loss strategies designed in this study not only effectively address the specific demands of chicken face detection but also provide feasible technical approaches and theoretical support for future precise and intelligent livestock monitoring.

Author Contributions

Conceptualization, B.G. and C.C.; methodology, B.G., D.H. and C.C.; software, B.G. and D.H.; validation, B.G., W.Z. and D.H.; formal analysis, B.G.; investigation, B.G., W.Z., D.H. and K.Y.; resources, C.C.; data curation, B.G., W.Z. and D.H.; writing—original draft preparation, B.G.; writing—review and editing, B.G., W.Z. and K.Y.; visualization, B.G.; supervision, W.Z. and C.C.; project administration, C.C.; funding acquisition, C.C. All authors have read and agreed to the published version of the manuscript..

Funding

(1) ”Supported by China Agriculture Research System of MOF and MARA” (CARS-41); (2) Supported by National Key R&D Program of China (2023YFD2000800).

Institutional Review Board Statement

This study adheres to the guidelines of the [Tianjin Agricultural University Laboratory Animal Ethics Committee] and has been approved (Approval number [2025LLSC54]). All experimental procedures comply with relevant animal welfare regulations.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data provided in this study are available upon request from the corresponding author.

Acknowledgments

We gratefully acknowledge the authors’ hard work, the reviewers’ insightful suggestions, and the editors’ careful and patient stewardship of the peer-review process.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kryger, K.N.; Thomsen, K.A.; Whyte, M.A.; Dissing, M. Smallholder Poultry Production–Livelihoods, Food Security and Sociocultural Significance. FAO Smallhold. Poult. Prod. Pap. 2010, 4, 1–67. [Google Scholar]
Sheheli, S.; Hasan, M.M.; Dev, D.S.; Hasan, M.M. Livelihood Improvement of Broiler Farmers in Bhaluka of Mymensingh District, Bangladesh. J. Bangladesh Agric. Univ. 2024, 22, 468–479. Available online: https://banglajol.info/index.php/JBAU/article/view/78857 (accessed on 30 June 2025). [CrossRef]
Satapathy, D.; Sharma, A.; Paswan, J.K.; Sarkar, S.; Varun, T.K. Economic Broiler Farming: Scope and Limitations. Indian Farmers 2017, 4, 393–405. [Google Scholar]
Olejnik, K.; Popiela, E.; Opaliński, S. Emerging Precision Management Methods in Poultry Sector. Agriculture 2022, 12, 718. [Google Scholar] [CrossRef]
Yang, X.; Bist, R.B.; Paneru, B.; Chai, L. Deep Learning Methods for Tracking the Locomotion of Individual Chickens. Animals 2024, 14, 911. [Google Scholar] [CrossRef] [PubMed]
Al-Marzooqi, W. Modern Technology and Traditional Husbandry of Broiler Farming; IntechOpen: London, UK, 2024; ISBN 978-0-85466-010-0. [Google Scholar]
Doornweerd, J.E.; Veerkamp, R.F.; de Klerk, B.; van der Sluis, M.; Bouwman, A.C.; Ellen, E.D.; Kootstra, G. Tracking Individual Broilers on Video in Terms of Time and Distance. Poult. Sci. 2024, 103, 103185. [Google Scholar] [CrossRef]
Gao, B.; Guo, Y.; Zheng, P.; Yang, K.; Chen, C. A Novel Lightweight Framework for Non-Contact Broiler Face Identification in Intensive Farming. Sensors 2025, 25, 4051. [Google Scholar] [CrossRef]
Kern, D.; Schiele, T.; Klauck, U.; Ingabire, W. Towards Automated Chicken Monitoring: Dataset and Machine Learning Methods for Visual, Noninvasive Reidentification. Animals 2025, 15, 1. [Google Scholar] [CrossRef] [PubMed]
Ma, X.; Lu, X.; Huang, Y.; Li, L. An Advanced Chicken Face Detection Network Based on GAN and MAE. Animals 2022, 12, 3055. [Google Scholar] [CrossRef]
Guo, Y.; Wu, Z.; You, B.; Chen, L.; Zhao, J.; Li, X. YOLO-SDD: An Effective Single-Class Detection Method for Dense Livestock Production. Animals 2025, 15, 1205. [Google Scholar] [CrossRef]
Li, G.; Huang, Y.; Chen, Z.; Zhao, Y. Practices and Applications of Convolutional Neural Network-Based Computer Vision Systems in Animal Farming: A Review. Sensors 2021, 21, 1492. [Google Scholar] [CrossRef]
Zhang, H.; Li, H.; Sun, G.; Yang, F. MDA-DETR: Enhancing Offending Animal Detection with Multi-Channel Attention and Multi-Scale Feature Aggregation. Animals 2025, 15, 259. [Google Scholar] [CrossRef] [PubMed]
Alif, M.A.R.; Hussain, M. YOLOv1 to YOLOv10: A Comprehensive Review of YOLO Variants and Their Application in the Agricultural Domain. arXiv 2024, arXiv:2406.10139. [Google Scholar] [CrossRef]
Wu, Z.; Yang, J.; Zhang, H.; Fang, C. Enhanced Methodology and Experimental Research for Caged Chicken Counting Based on YOLOv8. Animals 2025, 15, 853. [Google Scholar] [CrossRef] [PubMed]
Nasiri, A.; Amirivojdan, A.; Zhao, Y.; Gan, H. Estimating the Feeding Time of Individual Broilers via Convolutional Neural Network and Image Processing. Animals 2023, 13, 2428. [Google Scholar] [CrossRef]
Zhang, X.; Zhu, R.; Zheng, W.; Chen, C. Detection of Leg Diseases in Broiler Chickens Based on Improved YOLOv8 X-Ray Images. IEEE Access 2024, 12, 47385–47401. [Google Scholar] [CrossRef]
Syafaah, L.; Faruq, A.; Setyawan, N. Sick and Dead Chicken Detection System Based on YOLO Algorithm. Ingénierie Systèmes Inf. 2024, 9, 1723–1729. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the European Conference on Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-Time Object Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Lv, Z.; Dong, S.; Xia, Z.; He, J.; Zhang, J. Enhanced Real-Time Detection Transformer (RT-DETR) for Robotic Inspection of Underwater Bridge Pier Cracks. Autom. Constr. 2025, 170, 105921. [Google Scholar] [CrossRef]
Yin, D.; Hu, L.; Li, B.; Zhang, Y.; Yang, X. 5%>100%: Breaking Performance Shackles of Full Fine-Tuning on Visual Recognition Tasks. In Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 10–17 June 2025; pp. 20071–20081. [Google Scholar]
Meng, W.; Luo, Y.; Li, X.; Jiang, D.; Zhang, Z. PolaFormer: Polarity-Aware Linear Attention for Vision Transformers. arXiv 2025, arXiv:2501.15061. [Google Scholar]
Zhu, J.; Chen, X.; He, K.; LeCun, Y.; Liu, Z. Transformers without Normalization. In Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 10–17 June 2025; pp. 14901–14911. [Google Scholar]
Kong, L.; Dong, J.; Tang, J.; Yang, M.-H.; Pan, J. Efficient Visual State Space Model for Image Deblurring. In Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 10–17 June 2025; pp. 12710–12719. [Google Scholar]
Huang, S.; Lu, Z.; Cun, X.; Yu, Y.; Zhou, X.; Shen, X. DEIM: DETR with Improved Matching for Fast Convergence. In Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 10–17 June 2025; pp. 15162–15171. [Google Scholar]
Perez, L.; Wang, J. The Effectiveness of Data Augmentation in Image Classification Using Deep Learning. arXiv 2017, arXiv:1712.04621. [Google Scholar] [CrossRef]
Leavline, E.J.; Singh, D.A.A.G. Salt and Pepper Noise Detection and Removal in Gray Scale Images: An Experimental Analysis. IJSIP 2013, 6, 343–352. [Google Scholar] [CrossRef]
Nie, Z.; Long, T.; Zhou, Z. Research on Rotational Object Recognition Based on HSV Color Space and Gamma Correction. In Proceedings of the 6th International Symposium on Advanced Technologies and Applications in the Internet of Things Kusatsu, Shiga, Japan, 19–22 August 2024. [Google Scholar]
Zhang, H.; Wang, Y.; Dayoub, F.; Sunderhauf, N. VarifocalNet: An IoU-Aware Dense Object Detector. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 8514–8523. [Google Scholar]
Yu, Z.; Huang, H.; Chen, W.; Su, Y.; Liu, Y.; Wang, X. YOLO-FaceV2: A Scale and Occlusion Aware Face Detector. Pattern Recognit. 2024, 155, 110714. [Google Scholar] [CrossRef]
Wu, J.; Zhao, F.; Yao, G.; Jin, Z. FGA-YOLO: A one-stage and high-precision detector designed for fine-grained aircraft recognition. Neurocomputing 2025, 618, 129067. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. TOOD: Task-Aligned One-Stage Object Detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; Available online: https://www.computer.org/csdl/proceedings-article/iccv/2021/281200d490/1BmEvqSJIEE (accessed on 17 August 2025).
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 11–18 December 2015; pp. 1440–1448. [Google Scholar]
Ye, Y.; Sun, Q.; Cheng, K.; Shen, X.; Wang, D. A Lightweight Mechanism for Vision-Transformer-Based Object Detection. Complex Intell. Syst. 2025, 11, 302. [Google Scholar] [CrossRef]
Huang, J.; Wang, H. Small Object Detection by DETR via Information Augmentation and Adaptive Feature Fusion. arXiv 2024, arXiv:2401.08017. Available online: https://arxiv.org/abs/2401.08017v1 (accessed on 16 August 2025). [CrossRef]
Cao, X.; Wang, H.; Wang, X.; Hu, B. DFS-DETR: Detailed-Feature-Sensitive Detector for Small Object Detection in Aerial Images Using Transformer. Electronics 2024, 13, 3404. [Google Scholar] [CrossRef]

Figure 1. Facial images of broilers at different growth stages: panels (a,d) correspond to the starter period, panels (b,e) correspond to the grower period, and panels (c,f) correspond to the finisher period.

Figure 2. Examples of data augmentation: (a) original chicken face image, (b) random rotation from minus 45 degrees to 45 degrees, (c) injected salt and pepper noise, and (d) random gamma correction combined with HSV brightness perturbation.

Figure 3. Schematic diagram of the data split for the training set, validation set, and test set.

Figure 4. CF-DETR network diagram. Subfigure (a) illustrates the overall structure of the proposed improved backbone, which contains the core feature extraction module shown in (b). Subfigure (b) further includes the basic residual unit presented in (c), while subfigure (c) is composed of two parts, (d,e), forming the C2f DCBM bottleneck. The DIDC module displayed in (d) is the novel component introduced in this study and is described in detail in Section 2.3.2.

Figure 5. Structure diagram of the DIDC module. DIDC comprises three parallel depthwise separable convolution branches, a square branch, a horizontal strip branch, and a vertical strip branch, which extract spatial features across different orientations and receptive fields.

Figure 6. Structure of the PEMD module, subfigure (a) PolaAttention module that captures global context in polar coordinates with linear complexity, subfigure (b) EDFFN module that compensates high-frequency details through enhanced feedforward operations, and subfigure (c) Mona module that performs multi-scale fusion to aggregate features across scales.

Figure 7. (a) Comparison of the four classification loss functions; (b) changes in mAP50 after introducing the four different classification loss functions.

Figure 8. Comparison of mAP50 and mAP50:95 for each scheme during 150 training epochs in the ablation experiment. Subfigure (a) shows the mAP50 curves over training epochs; subfigure (b) shows the mAP50:95 curves over training epochs.

Figure 9. Detection comparison between RT-DETR and CF-DETR under complex samples from the fattening, growing, and brooding periods. The left column shows the original images, the middle column shows RT-DETR detection results, and the right column shows CF-DETR detection results. Box colors indicate the following: green for true positives (TP); blue for false positives (FP); and red for false negatives (FN). The legend also annotates the number of each type of box for quantitative comparison.

Figure 10. Quantitative comparison of Grad-CAM heatmaps for brooding-period samples (RT-DETR vs. CF-DETR).

Table 1. Statistics of training, validation, and test splits before and after augmentation.

Statistic	Test Set	Validation Set	Original Training Set	Augmented Training Set	Total Training Set	Pre-Augmentation Total	Post-Augmentation Total
Number of Images	100	100	800	120	920	1000	1120

Table 2. Experimental parameter settings.

Hyperparameter Selection	Setting
Input Image Size	640 × 640
Optimizer	AdamW
Initial Learning Rate	0.0001
Momentum	0.9
Weight Decay	0.0001
Batch Size	8
Epoch	150

Table 3. Comparison results of different backbone networks.

Model	P (%)	R (%)	F₁ (%)	mAP50 (%)	mAP50:95 (%)	Param (M)	FLOPs (G)	FPS
RT-DETR	93.8	92.1	92.9	95.4	60.2	19.9	56.9	74.1
C2f	96.3	91.3	93.7	95.6	58.3	15.4	52.6	68
PKI (C2f)	95.9	92.5	94.2	96.5	57.8	17.7	57.2	64.5
DIDC (C2f)	96.2	92.6	94.3	96.3	61.4	13.2	43.5	73.8

Table 4. Classification loss replacement results.

Model	P (%)	R (%)	F₁ (%)	mAP50 (%)	mAP50:95 (%)
VariFocal LOSS	93.8	92.1	92.9	95.4	60.2
Slide Loss	93.3	92.5	92	95.6	60.7
EMASlide Loss	93.6	91.3	93.2	96	59.3
MAL	95.2	93.1	94.1	96.3	61.1

Table 5. Ablation experiment results.

Model	P (%)	R (%)	F₁ (%)	mAP50 (%)	mAP50:95 (%)	Param (M)	FLOPs (G)	FPS
RT-DETR(Base)	93.8	92.1	92.9	95.4	60.2	19.9	56.9	74.1
DIDC	96.2	92.6	94.3	96.3	61.4	13.2	43.5	73.8
PEMD	94.2	93.6	93.9	96.3	61.2	20	57.2	72
MAL	95.2	93.1	94.1	96.3	61.1	19.9	56.9	76
DIDC + PEMD	94.9	93.1	94	96.7	62	13.3	43.8	80.6
CF-DETR (DIDC + PEMD + MAL)	95.5	94.6	95.1	96.9	62.8	13.3	43.8	81.4

Table 6. Comparison results with other models.

Model	P (%)	R (%)	mAP50 (%)	mAP50:95 (%)	Param (M)	FLOPs (G)	FPS
YOLOv10m	95.6	90.4	96.5	61.2	16.4	64	37.7
YOLOv11m	95.8	93.1	95.8	61.6	20.1	68.5	31.6
Tood	91.8	86.9	92.3	51.4	32	168	21
Faster_rcnn	94.2	87.1	93.4	54.6	41.3	208	34
DETR	92.9	85.5	88.6	43.6	41.5	81.6	31.4
RT-DETR	93.8	92.1	95.4	60.2	19.9	56.9	74.1
RT-DETR-r34	95.6	93.4	96.1	59.8	31.1	88.8	71.9
CF-DETR	95.5	94.6	96.9	62.8	13.3	43.8	81.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gao, B.; Zhang, W.; Hao, D.; Yang, K.; Chen, C. CF-DETR: A Lightweight Real-Time Model for Chicken Face Detection in High-Density Poultry Farming. Animals 2025, 15, 2919. https://doi.org/10.3390/ani15192919

AMA Style

Gao B, Zhang W, Hao D, Yang K, Chen C. CF-DETR: A Lightweight Real-Time Model for Chicken Face Detection in High-Density Poultry Farming. Animals. 2025; 15(19):2919. https://doi.org/10.3390/ani15192919

Chicago/Turabian Style

Gao, Bin, Wanchao Zhang, Deqi Hao, Kaisi Yang, and Changxi Chen. 2025. "CF-DETR: A Lightweight Real-Time Model for Chicken Face Detection in High-Density Poultry Farming" Animals 15, no. 19: 2919. https://doi.org/10.3390/ani15192919

APA Style

Gao, B., Zhang, W., Hao, D., Yang, K., & Chen, C. (2025). CF-DETR: A Lightweight Real-Time Model for Chicken Face Detection in High-Density Poultry Farming. Animals, 15(19), 2919. https://doi.org/10.3390/ani15192919

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CF-DETR: A Lightweight Real-Time Model for Chicken Face Detection in High-Density Poultry Farming

Abstract

Simple Summary

Abstract

1. Introduction

2. Materials and Methods

2.1. Image Acquisition

2.2. Data Annotation and Augmentation

2.3. Construction of the Chicken Face Detection Model

2.3.1. CF-DETR Model Construction

2.3.2. DIDC Module

2.3.3. PEMD Module

2.3.4. Matchability-Aware Loss Function

2.4. Experimental Environment and Parameter Settings

2.5. Evaluation Metrics

3. Results

3.1. Backbone Comparison Experiment

3.2. Classification Loss Comparison Experiment

3.3. Ablation Experiment

3.4. Visualization Analysis

3.5. Comparison with Other Models

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI