Transformer-Based Multi-Task Segmentation Framework for Dead Broiler Identification

Ham, Gyu-Sung; Oh, Kanghan

doi:10.3390/app16010419

Open AccessArticle

Transformer-Based Multi-Task Segmentation Framework for Dead Broiler Identification

by

Gyu-Sung Ham

¹

and

Kanghan Oh

^1,2,*

¹

AI Convergence Research Institute, Wonkwang University, Iksan 54538, Republic of Korea

²

Department of Computer and Software Engineering, Wonkwang University, Iksan 54538, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(1), 419; https://doi.org/10.3390/app16010419

Submission received: 3 December 2025 / Revised: 27 December 2025 / Accepted: 29 December 2025 / Published: 30 December 2025

(This article belongs to the Section Agricultural Science and Technology)

Download

Browse Figures

Versions Notes

Abstract

Efficient monitoring of large-scale poultry farms requires the timely identification of dead broilers, as delays can accelerate disease transmission, leading to significant economic loss. Nevertheless, manual inspection remains the dominant practice, resulting in a labor-intensive, inconsistent, and poorly scalable workflow. Although recent advances in computer vision have introduced automated alternatives, most existing approaches face difficulties in crowded settings where live and dead broilers share similar visual patterns, and occlusions frequently occur. To address these problems, we propose a transformer-based multi-task segmentation framework designed to operate reliably in visually complex farm environments. The model constructs a unified feature representation that supports precise segmentation of dead broilers, while an auxiliary dead broiler counting task contributes additional supervisory features that enhance segmentation performance across diverse scene configurations. Experimental evaluations indicate that the proposed method yields accurate and stable segmentation results under various farm conditions, including densely populated and visually intricate scenes. Moreover, its overall segmentation accuracy consistently surpasses that of existing approaches, demonstrating the effectiveness of integrating transformer-based global modeling with the auxiliary regression objective.

Keywords:

dead broiler segmentation; CNN; deep learning; precision agriculture

1. Introduction

Precision agriculture has become a foundational paradigm for modern industrial livestock production, offering a systematic approach to improving operational efficiency through data-driven decision-making and automated control systems. As the scale of poultry farming continues to expand, the demand for technologies that can reduce manual workload while providing reliable, high-resolution monitoring has accelerated. This transition has positioned advanced computer vision and sensing technologies at the core of next-generation agricultural automation [1,2,3].

Within the poultry industry, an array of precision tools—ranging from wearable sensor nodes and vision-based modules to distributed environmental monitoring systems—has been incorporated into routine farm management. These tools support the real-time observation of flock behavior, environmental conditions, and health indicators. Among these technologies, computer vision has become particularly influential, enabling applications such as automated poultry-house management [4,5,6], early disease detection [7,8,9], continuous weight assessment [10,11,12], carcass inspection [13,14], and egg-quality analysis [15]. Collectively, these capabilities reduce human error, minimize labor demands, and improve the consistency of welfare-oriented monitoring [3].

Broiler chickens, as a major source of global animal protein, have received increasing research attention [7,16,17]. In such environments, the rapid identification and removal of dead broilers is essential to maintain welfare standards, limiting the spread of disease, and preventing productivity loss [18]. However, manual inspection remains the predominant practice, though inherently constrained by fatigue, inconsistent judgments, and the difficulty of visually scanning dense flocks. These limitations often lead to delayed detection, elevating the risk of contamination and cross-infection. Automated detection and segmentation systems offer a more scalable and reliable alternative. The provision of continuous visual monitoring enables these systems to quickly identify deceased animals and support immediate intervention. Moreover, segmentation outputs deliver fine-grained structural information that proves valuable for downstream analyses, such as mortality-pattern assessment, environmental troubleshooting, and strategic farm-management planning. By decreasing manual labor requirements, reducing response delays, and strengthening decision-making accuracy, automated dead broiler monitoring constitutes an essential component of future intelligent poultry-farm infrastructures [6].

Automated systems to detect and segment dead broilers provide a substantial safeguard for modern poultry operations by maintaining continuous visual surveillance, and when detecting mortality, issuing immediate alerts. Such systems help stabilize the health conditions within the facility by preventing prolonged exposure to biological hazards, and directly contribute to improved productivity by ensuring rapid intervention. Beyond simple detection, segmentation-based approaches generate detailed spatial information about deceased birds, enabling fine-grained mortality analysis and supporting data-driven management decisions. These structured outputs offer actionable insights that farm operators can use to refine operational strategies, optimize environmental control, and mitigate future risks. In consequence, automated monitoring frameworks have become indispensable tools in contemporary broiler production, yielding both economic advantages and significant improvements in animal welfare [3].

Parallel to these developments, the rapid evolution of computer vision has had a transformative impact on agricultural automation. Modern vision systems now offer scalable and high-throughput pipelines that emulate essential aspects of human visual perception, while remaining cost-effective for deployment across large farming infrastructures. In particular, Convolutional neural networks (CNNs) have served as a foundational component of this progress. Their representational capacity enables the stable extraction of discriminative patterns from visually complex scenes, and in many recognition tasks they now match or surpass human-level performance. CNN-based models have become indispensable for image classification [19], object detection [20], and semantic segmentation [21,22], consistently delivering reliable results, even under structural clutter, occlusion, and dynamic lighting conditions. More recently, transformer-based architectures have further advanced the field by introducing attention-driven mechanisms that are able to model long-range spatial dependencies that CNNs inherently struggle to capture. Vision Transformers (ViT) [23] and their derivatives generate global feature interactions without requiring convolutional priors, allowing the model to reason over extended spatial structures and subtle contextual cues [24]. These properties are particularly advantageous in agricultural settings, where the spatial configuration of animals, equipment, and background clutter often imposes significant interpretive challenges. The combined effectiveness of CNNs and transformer-based models has accelerated the adoption of computer vision throughout both research and commercial livestock systems, firmly establishing visual intelligence as a core technological pillar in modern precision poultry management.

Motivated by recent advances in vision-based livestock monitoring, we introduce a machine-learning framework that is designed to segment dead broilers in scenes where live and dead animals coexist. Figure 1 presents sample image pairs from the Dead Broiler Dataset. Each pair contains the original broiler-house image, and the corresponding ground-truth mask for the dead broiler. The scenes illustrate the inherent difficulty of this task: live and dead broilers appear together. These conditions highlight the need for a model that can recognize the contextual relationships within the scene to distinguish dead broilers from live ones in crowded farm environments.

The central idea of the proposed method is to construct a unified transformer-based representation that supports both fine-grained segmentation and an auxiliary regression block derived from dead broiler counting. Instead of separating detection and segmentation into independent branches, our approach integrates both tasks into a single structural pipeline in which the segmentation network receives direct benefit from transformer-encoded global dependencies and the supervisory cues provided by the counting objective.

The model first encodes the input image through a CNN backbone, producing intermediate feature maps that retain essential spatial information. These features are then processed by a transformer module, whose self-attention mechanism evaluates the relative importance of spatial tokens. Modeling long-range dependencies allows the transformer to effectively distinguish subtle cues that differentiate dead broilers from surrounding live individuals, especially in dense or visually cluttered environments. The resulting representation is then decoded by a segmentation network to delineate the regions corresponding to dead broilers.

In parallel, the transformer tokens are aggregated and passed through a compact regression head that estimates the number of dead broilers in the scene. This auxiliary regression task does not serve as an independent prediction objective; rather, it provides an additional supervisory feature that sharpens the segmentation behavior. By forcing the model to encode object-level information relevant to mortality count, the network is guided toward representations that emphasize true dead broiler regions, while suppressing responses from live individuals. This interaction between segmentation and count-regression enhances robustness under occlusion, irregular poses, and complex visual configurations.

The contributions of this study can be summarized as:

We propose a transformer-based architecture that jointly performs dead broiler segmentation and auxiliary count regression. The regression branch is specifically designed to reinforce segmentation by encouraging consistent object-level reasoning across diverse scenes.
By integrating a transformer block into the encoding stage, the model captures long-range spatial relationships that are crucial when live and dead broilers appear in close proximity, or exhibit visually similar textures. This improves the ability to isolate dead individuals, even under crowded conditions.
The regression of dead broiler counts provides a structured supervisory feature that redirects the feature representation toward regions relevant to mortality, allowing the segmentation network to avoid misclassifying live animals, and to focus on spatially coherent dead broiler regions.

This paper is structured as follows: Section 2 reviews previous studies related to computer-vision-based broiler monitoring and dead broiler detection. Section 3 describes the materials and proposed method in detail, while Section 4 presents the experimental results and in-depth discussion of the findings. Finally, Section 5 summarizes the conclusions drawn from this study.

2. Related Works

Recent years have seen an expanding use of computer vision and CNN-based techniques to interpret broiler behavior and monitor flock conditions [1,8,25,26,27]. A representative example is the YOLOv4-based automated system introduced by Ref. [1], which was designed to locate and remove dead broilers inside large-scale poultry houses. The system supports both remote-control and fully autonomous operation modes, enabling efficient carcass removal while reducing unnecessary human–animal contact. In another line of research, van der Eijk et al. [25] employed computer vision methods, most notably Mask R−CNN [28], to identify individual broilers and track their engagement with environmental resources, such as feeders, perches, and enrichment materials. Their study demonstrated that modern instance-segmentation frameworks can achieve high accuracy in behavioral monitoring tasks under both controlled and industry-level conditions. Yang et al. [26] made further improvements by adapting YOLOv7 to detect dead hens housed in caged environments. Their model incorporated the convolutional block attention module (CBAM) to enhance feature extraction, where the deployed system combined edge-computing devices with inspection robots for reliable detection on real farms. Complementary to vision-only systems, Bao et al. [27] proposed a multimodal approach that integrated sensor-based movement analysis to identify sick or dead birds. By attaching foot-ring sensors and computing the three-dimensional variance of activity levels, their method leveraged machine learning algorithms to infer physiological abnormalities that traditional manual inspection often overlooks. Okinda et al. [18] attempted to predict disease early, by introducing a non-intrusive monitoring pipeline using depth cameras. By analyzing body shape, posture, and movement velocity, their framework classified early signs of health deterioration, and provided proactive alerts. Hao et al. [8] developed a detection system tailored for stacked-cage environments. An autonomous mobile platform navigated through narrow aisles while capturing side-view images of the cages, while a YOLOv3-based detector processed these images to reliably identify dead broilers, even in the occluded and geometrically constrained layout of stacked housing. Thermal information has also been incorporated into poultry monitoring. Li et al. [29] presented a thermal-imaging-based method to detect sick laying hens. Their technique focused on extracting and analyzing temperature variations across the head, body, and legs using CNNs, enabling the identification of atypical thermal patterns linked to illness. Massari et al. [30] studied how broilers respond to environmental changes by defining computer vision-based behavioral indices that quantify cluster formation and unrest levels within flocks. Through continuous video analysis, their system evaluated the influence of environmental enrichment and heat-stress conditions on group movement dynamics and overall welfare.

3. Materials and Methods

3.1. Dataset Collection and Description

In this study, we employed the publicly available dataset referenced in [31]. This collection comprises 36 images sourced from online repositories, encompassing a total of 86 instances of dead broilers. To ensure the reliability of the ground truth for the segmentation task, binary masks were manually created by three independent annotators, with the final mask established through an averaging process to minimize labeling errors. Concurrently, for the auxiliary regression task, we explicitly labeled the total number of dead broilers present within each image. Figure 1 illustrates the sample pairs, with the original image on the left and the corresponding ground-truth mask on the right. Given the limited scale of the dataset, we adopted a 10-fold cross-validation strategy with image-level splitting to ensure a robust and statistically stable performance evaluation. During inference, the auxiliary count regression branch is removed, and only the segmentation branch is used for evaluation and deployment.

3.2. Methodology

3.2.1. Feature Encoding

Figure 2 provides an overview of the proposed method. The process begins with a ResNet–50 [32] backbone that converts the input image into a compact and expressive feature map. The backbone receives an RGB input

X \in R^{3 \times 224 \times 224}

, which is first resized to this fixed resolution to ensure consistent spatial structure across samples. As the tensor passes through the successive convolutional blocks of ResNet–50, the spatial resolution is gradually reduced, while the channel dimension is enriched. Consequently, the output representation carries both fine-grained cues originating from earlier layers, and increasingly abstract semantic patterns synthesized by deeper layers. The final backbone feature map takes the form:

F = B (X), F \in R^{2048 \times 7 \times 7}

(1)

This compact spatial size plays an important role in enabling Transformer module [33] that follows. By compressing the input image to only 49, the model reduces the sequence length. As a result, the computational cost of the self-attention quadratic in the sequence length is dramatically reduced, allowing the transformer to analyze global contextual relations across the entire scene, without incurring excessive memory or runtime overhead. After the backbone produces the feature map

F^{'} = R e s h a p e (F) \in

R^{C \times H^{'} \times W^{'}}

, a simple coordinate-based feature is added to provide each spatial position with a consistent reference to its location within the grid. The resulting tensor is given by:

F_{p e} \in F^{'} + P

(2)

where

P

denotes the coordinate map repeated across channels. To make the representation compatible with transformer processing, the tensor is reshaped from a grid into a sequence. Each spatial position becomes one token containing all

C

channel responses. Formally,

Z_{0} = R e s h a p e (F_{pe}) \in R^{(H^{'} W^{'}) \times C}

(3)

3.2.2. Transformer Module

This stage employs a transformer block to model the contextual relationships that emerge across different regions of the input image. Transformer [33] provides a mechanism to compare distant spatial locations, making them well suited to distinguish live broilers from dead ones, particularly in crowded or visually complex scenes. Figure 3 provides an overview of the transformer block used to refine the encoded feature map and organize long-range dependencies. The normalized sequence

Z_{0}

enters the multi-head attention module. Queries, keys, and values are obtained through linear projections of the input sequence:

Q = Z_{0} W^{Q}, K = Z_{0} W^{K}, V = Z_{0} W^{V}

(4)

Each attention head measures how strongly one location should respond to another by computing the scaled dot-product attention:

h e a d_{i} = A t t e n t i o n (Q, K, V) = S o f t m a x (\frac{Q K^{⊤}}{\sqrt{d_{k}}}) V

(5)

Multiple heads operate in parallel, allowing the model to examine different types of relationships simultaneously. Their outputs are concatenated and projected using

W^{O}

:

U = M u l t i H e a d (Q, K, V) = C o n c a t ({h e a d}_{1}, \dots, {h e a d}_{n}) W^{O}

(6)

This mechanism enables the transformer to integrate cues that may be widely separated in the original image. In the context of dead broiler detection, this capability is essential. A dead broiler often appears partially occluded, framed by surrounding live broilers, or lying in a position that resembles natural resting behavior. Discriminating such cases requires information that is not confined to a single patch, but extends across the entire scene. A residual connection integrates this output with the original sequence,

Z_{attn} = Z_{0} + U

, followed by an MLP, producing the refined sequence

Y

. Then, it is later reshaped back into spatial form

F^{'} = R e s h a p e^{- 1} (Y) \in

R^{C \times H^{'} \times W^{'}}

.

3.2.3. Segmentation Module

The segmentation decoder receives the fused representation

F^{'}

, which is obtained by concatenating the backbone feature map with the transformer-refined feature map. Since ResNet–50 reduces the spatial resolution to

7 \times 7

, the decoder must progressively reconstruct the full image resolution, while preserving the semantic structure embedded in the encoded features. To achieve this, we employ a sequence of convolutional layers interleaved with spatial upsampling operations. A skip connection from the ResNet–50 pathway is incorporated to retain low-level spatial information that may not be fully preserved in the transformer branch. This connection stabilizes the decoding process by reintroducing fine-grained details, such as local textures and boundary cues. The decoder first compresses the channel dimension, then applies bilinear interpolation followed by convolution to refine the up-sampled features. This process is repeated until the spatial resolution is restored to

224 \times 224

. The final

1 \times 224 \times 224

output is produced through a

1 \times 1

convolution, which maps the decoded feature stack into a single-channel prediction. The resulting mask reflects both the globally contextualized representation learned from the transformer, and the locally grounded information supplied through the skip connection.

3.2.4. Regression Module

The regression module operates on the output tokens

Y

generated by the transformer encoder, and provides a quantitative estimate of the number of dead broilers. This module operates in parallel with the segmentation decoder, and utilizes the refined global features produced by the transformer encoder. To obtain the final count, the token sequence is first flattened into a single vector. This vector is then processed by a three-layer multilayer perceptron, which performs a series of linear transformations interleaved with nonlinear activations. Because each token summarizes contextual relationships within the image, the aggregated vector carries sufficient information to infer global quantities, such as the total number of dead broilers. The final output of the MLP is a single scalar, interpreted as the predicted count.

This design avoids any need for spatial decoding or reconstruction. The module therefore operates independently of the segmentation pathway. However, both branches rely on the same transformer-derived token representation, which provides a unified source of contextual information. Because the model shares these tokens across tasks, the segmentation pathway is guided toward features that emphasize dead broilers more strongly than the surrounding live birds, thereby improving its ability to focus on the relevant regions, while the regression branch performs global counting.

3.2.5. Loss Function

The model is trained using Dice loss, a measure designed to evaluate the overlap between the predicted mask and the ground-truth segmentation. Since the task involves identifying dead broilers within densely populated scenes, the foreground region often occupies a relatively small portion of the image. Unlike Binary Cross Entropy (BCE) loss, which can be biased towards the extensive background, Dice loss is particularly suited to such imbalanced settings, because it emphasizes the agreement between the predicted and true foreground regions, rather than treating all pixels equally. Given the predicted probability map

\hat{Y} \in R^{1 \times H \times W}

and the ground-truth mask

Y \in R^{1 \times H \times W}

, the Dice coefficient is defined as:

D i c e (Y, \hat{Y}) = \frac{2 \sum_{i} Y_{i} {\hat{Y}}_{i}}{\sum_{i} Y_{i} + \sum_{i} {\hat{Y}}_{i} + ϵ}

(7)

where

ϵ

is a small constant added for numerical stability. The Dice loss is formulated as:

L_{D i c e} = 1 - D i c e (Y, \hat{Y})

(8)

This formulation directly penalizes discrepancies in the predicted region and rewards accurate alignment along object boundaries.

For the auxiliary regression branch, we employed the Mean Squared Error (MSE) loss. As shown in Figure 2, the output of the regression head is denoted as

O_{r e g}

, which is a single scalar representing the predicted number of dead broilers. We define the regression loss by minimizing the mean squared difference between this prediction and the ground-truth count (

y

) over the batch:

L_{r e g} = \frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - O_{r e g, i})}^{2}

(9)

where

N

is the batch size, and

y_{i}

and

O_{r e g, i}

denote the ground-truth and predicted counts for the

i

-th image, respectively. The final objective function combines both segmentation and regression losses:

L_{T o t a l} = L_{D i c e} + {λ L}_{r e g}

(10)

where

λ

is a weighting factor used to balance the two objectives.

3.2.6. Implementation Detail

All experiments were carried out on a workstation equipped with an Intel Core i9-10900X processor running at 3.70 GHz, 48 GB of RAM, and an NVIDIA GeForce RTX 3090 GPU. The network was implemented and trained using PyTorch 1.8.0. Input images were uniformly resized to

224 \times 224

pxl, allowing consistent processing while reducing the computational burden associated with high-resolution data. Additionally, this resolution was chosen to better accommodate challenging smart-farm imaging conditions.

The model was trained with a batch size of 1. The backbone was initialized with ImageNet-pretrained weights provided by PyTorch. Training was conducted using the AdamW [34] optimizer with an initial learning rate of

1 \times 10^{- 4}

. The learning rate was reduced by a factor of 0.1 every 200 epochs, and the model trained for a total of 600 epochs. A fixed training schedule was adopted without early stopping. To improve generalization and reduce overfitting, several data augmentation strategies were applied, including controlled brightness variation, rotations within the range (−15 to +15

) °

, and both horizontal and vertical translations.

Within the transformer module, the attention mechanism was configured with four heads, with each assigned an embedding dimension of 64. This configuration balances expressiveness and computational efficiency, enabling the model to analyze interactions across spatial regions without excessive memory usage.

4. Results and Discussion

4.1. Evaluation Metrics

The performance of the proposed model was assessed using four quantitative metrics: intersection over union (IoU), precision, recall, and F-measure. These measures provide a direct indication of segmentation quality, while allowing detailed comparison between the predicted mask and the ground-truth annotation.

IoU evaluates how well the predicted region aligns with the true region. This is computed as the size of their intersection divided by the size of their union:

I o U = \frac{∣ P r e d \cap G T ∣}{∣ P r e d \cup G T ∣}

(11)

Precision reflects the proportion of predicted positive pixels that are correct, defined as:

P r e c i s i o n = \frac{T P}{T P + F P}

(12)

where TP denotes true positives, and FP denotes false positives. Recall measures how many of the actual foreground pixels are correctly identified:

R e c a l l = \frac{T P}{T P + F N}

(13)

with FN representing false negatives. The F-measure combines precision and recall into a single harmonic score:

F - m e a s u r e = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(14)

An F1 value close to 1 indicates a strong balance between identifying relevant pixels and avoiding incorrect predictions. For evaluation, the threshold of the model output was 0.5 to produce binary segmentation masks, allowing consistent computation of all metrics.

4.2. Performance Comparison

This section compares the segmentation performance of the proposed method with that of several established models, namely U-Net [21], FCN [35], LinkNet [36], DeepLabV3 [22], and Dual-StreamNet [3]. All models were trained and tested employing a 10-fold cross-validation process, and the results reflect the average performance across the folds. Table 1 presents the segmentation performance of the proposed method alongside the five other methods that are widely used models, using key metrics of the IoU, precision, recall, and F-measure. In terms of IoU, the model records the highest score of 87.39%, indicating that it delineates the target regions more accurately than the competing networks. The precision and recall values further illustrate this trend. The proposed method attains a precision of 94.27%, demonstrating its ability to avoid false detections, while maintaining a recall of 92.84%, which reflects a high rate of correctly identified target regions. As a result, the method achieves the best F-measure score of 93.54%, indicating a well-balanced performance across both precision and recall. These findings suggest that the integration of transformer-based contextual reasoning with the CNN decoder contributes significantly to improving segmentation quality, particularly in dense and visually complex broiler-house environments.

Figure 4 shows box plots that offer a concise comparison of segmentation performance across the six evaluated methods. The proposed method shows the highest central values in IoU, precision, and F-measure, and maintains a narrow spread, indicating both strong average performance, and stable behavior across scenes. Recall follows a slightly different pattern: Dual-StreamNet achieves the highest median in this metric, reflecting its ability to recover a larger share of the target regions. However, the proposed method demonstrates recall values that remain close to the top and exhibit limited variation, suggesting consistent identification of foreground areas. The remaining variability across folds is primarily associated with visually ambiguous postures, where dead broilers closely resemble resting live ones, leading to increased uncertainty in certain validation folds.

U−Net, FCN, and LinkNet appear in the lower ranges across all metrics, accompanied by broader distributions. These patterns reveal reduced robustness in visually challenging broiler-house environments, where dead broilers may exhibit appearances like resting live ones. Their lower precision and F-measure values further indicate frequent misclassification and unstable boundary delineation. In summary, the proposed method provides the most balanced and reliable performance among the compared models, outperforming the baselines in both accuracy and consistency.

Table 2 compares two variants of the proposed model: one without the regression block, and the other the complete model containing it. The results show that removing the regression block leads to a consistent reduction in all metrics. In particular, the IOU decreases from 87.39 to 86.88, and the F-measure drops from 93.55 to 92.76. The differences are modest but systematic across all measures, indicating that the regression branch contributes positively to the overall representation. Since both variants share the same transformer-derived token sequence, the improvement cannot be attributed to changes in architectural capacity. Rather, the counting task provides an auxiliary supervisory feature that biases the shared token representation toward features associated with dead broilers.

Table 3 compares the segmentation performance obtained from different ResNet backbones. As the backbone depth increases, the overall trend shows a gradual improvement across most metrics. ResNet18 provides the lowest accuracy, with an F-measure of 91.76, while deeper models yield more stable representations. ResNet34 achieves a moderate gain, particularly in precision, while ResNet50 further improves the balance between precision and recall, reaching an F-measure of 93.54. ResNet101 also demonstrates competitive performance, though the improvement over ResNet50 is not substantial. Considering computational efficiency and the marginal gain provided by deeper backbones, ResNet50 represents the most appropriate choice. It offers a favorable balance between accuracy and cost, supplying sufficiently rich token representations, without the overhead associated with larger models.

Table 4 reports the segmentation performance obtained by varying the number of attention heads in the transformer encoder. The overall results show that increasing the number of heads does not lead to monotonic improvement. The configurations with two and four heads achieve similar accuracy levels, but the eight-head model provides a noticeable gain in F-measure, indicating that this setting captures a balanced range of contextual interactions within the feature tokens.

4.3. Visualization Results

Figure 5 presents a sequence of segmentation results produced by the proposed method, organized in increasing structural complexity. The first row contains instances in which a single dead broiler appears clearly separated from its surroundings; here the target region forms a well-defined and internally consistent component, and the method reconstructs its shape with minimal deviation, even if the background exhibits substantial variation.

The remaining rows describe progressively more intricate configurations. Among them, the third row represents the most demanding case: a dense scene in which a dead broiler is embedded within a cluster of live animals. Such a setting introduces numerous locally similar patterns, causing the visual structure of the scene to become highly entangled. Nevertheless, the method identifies the characteristic posture and spatial footprint of the dead broiler and isolates it as a distinct region, without being distracted by the surrounding elements.

The final row illustrates the method’s behavior when multiple dead broilers appear simultaneously. Despite the increased number of overlapping shapes and the presence of mutual occlusions, each instance is recovered as a separate and stable region. The boundaries remain clearly delineated, showing that the method maintains consistent behavior, even when several objects must be distinguished within a shared spatial domain.

Overall, Figure 5 visualizes that the proposed method captures the underlying structural organization of the scene, isolates the correct connected components under both simple and congested conditions, and maintains reliable segmentation performance, even in situations where many objects overlap or exhibit similar visual patterns.

5. Conclusions

This study introduced a transformer-based multi-task framework to automatically identify dead broilers. The proposed model integrates a segmentation network with a count-regression branch, both operating on a shared structural representation derived from transformer attention. The regression component functions as an auxiliary task that reinforces object-level cues, thereby enhancing the segmentation process, rather than serving as an independent prediction objective. This design enables the system to segment coherent spatial regions corresponding to dead broilers, while benefiting from the additional structural guidance provided by the regression output. By capturing long-range spatial relationships and incorporating auxiliary supervision, the method maintains consistent performance across simple, crowded, structurally entangled environments. Experimental results demonstrate that compared to existing methods, our approach achieves superior segmentation accuracy, confirming the advantage of combining global structural modeling with auxiliary regression features. Furthermore, the proposed model comprises approximately 36.76 million trainable parameters, ensuring lightweight architecture that is feasible for deployment on resource-constrained edge devices in real-world poultry farms. Future work will focus on expanding the dataset, evaluating robustness under more challenging conditions, investigating the impact of input resolution on boundary quality and small-target representation, and further refining the structural modeling for enhanced precision.

Author Contributions

Conceptualization, software, validation, formal analysis, investigation, and writing—original draft preparation, G.-S.H. and K.O.; methodology, writing—review and editing, visualization, supervision, and project administration, K.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available in Wayne, B. Dead Chickens Dataset. Universe by Roboflow. Available online: https://universe.roboflow.com/bruce-wayne-wja03/dead-chikens (accessed on 11 October 2024).

Acknowledgments

This paper was supported by Wonkwang University in 2023.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, H.-W.; Chen, C.-H.; Tsai, Y.-C.; Hsieh, K.-W.; Lin, H.-T. Identifying Images of Dead Chickens with a Chicken Removal System Integrated with a Deep Learning Algorithm. Sensors 2021, 21, 3579. [Google Scholar] [CrossRef] [PubMed]
Abdalla, A.; Cen, H.; Wan, L.; Mehmood, K.; He, Y. Nutrient Status Diagnosis of Infield Oilseed Rape via Deep Learning-Enabled Dynamic Model. IEEE Trans. Ind. Inform. 2020, 17, 4379–4389. [Google Scholar] [CrossRef]
Ham, G.-S.; Oh, K. Dead Broiler Detection and Segmentation Using Transformer-Based Dual Stream Network. Agriculture 2024, 14, 2082. [Google Scholar] [CrossRef]
Abdanan Mehdizadeh, S.; Neves, D.P.; Tscharke, M.; Nääs, I.A.; Banhazi, T.M. Image Analysis Method to Evaluate Beak and Head Motion of Broiler Chickens During Feeding. Comput. Electron. Agric. 2015, 114, 88–95. [Google Scholar] [CrossRef]
Pereira, D.F.; Miyamoto, B.C.B.; Maia, G.D.N.; Sales, G.T.; Magalhães, M.M.; Gates, R.S. Machine Vision to Identify Broiler Breeder Behavior. Comput. Electron. Agric. 2013, 99, 194–199. [Google Scholar] [CrossRef]
Neethirajan, S. Recent Advances in Wearable Sensors for Animal Health Management. Sens. Bio-Sens. Res. 2017, 12, 15–29. [Google Scholar] [CrossRef]
Zhuang, X.; Zhang, T. Detection of Sick Broilers by Digital Image Processing and Deep Learning. Biosyst. Eng. 2019, 179, 106–116. [Google Scholar] [CrossRef]
Hao, H.; Fang, P.; Duan, E.; Yang, Z.; Wang, L.; Wang, H. A Dead Broiler Inspection System for Large-Scale Breeding Farms Based on Deep Learning. Agriculture 2022, 12, 1176. [Google Scholar] [CrossRef]
Bist, R.B.; Yang, X.; Subedi, S.; Chai, L. Mislaying Behavior Detection in Cage-Free Hens with Deep Learning Technologies. Poult. Sci. 2023, 102, 102729. [Google Scholar] [CrossRef]
Mollah, M.B.R.; Hasan, M.A.; Salam, M.A.; Ali, M.A. Digital Image Analysis to Estimate the Live Weight of Broiler. Comput. Electron. Agric. 2010, 72, 48–52. [Google Scholar] [CrossRef]
Mortensen, A.K.; Lisouski, P.; Ahrendt, P. Weight Prediction of Broiler Chickens Using 3D Computer Vision. Comput. Electron. Agric. 2016, 123, 319–326. [Google Scholar] [CrossRef]
Amraei, S.; Mehdizadeh, S.A.; Nääs, I.D.A. Development of a Transfer Function for Weight Prediction of Live Broiler Chicken Using Machine Vision. Eng. Agric. 2018, 38, 776–782. [Google Scholar] [CrossRef]
Ye, C.-W.; Yu, Z.-W.; Kang, R.; Yousaf, K.; Qi, C.; Chen, K.-J.; Huang, Y.-P. An Experimental Study of Stunned State Detection for Broiler Chickens Using an Improved Convolution Neural Network Algorithm. Comput. Electron. Agric. 2020, 170, 105284. [Google Scholar] [CrossRef]
Mansor, M.A.; Baki, S.R.M.S.; Tahir, N.M.; Rahman, R.A. An Approach of Halal Poultry Meat Comparison Based on Mean-Shift Segmentation. In Proceedings of the IEEE Conference on Systems, Process Control (ICSPC), Malacca, Malaysia, 13–15 December 2013; IEEE: New York, NY, USA; pp. 279–282. [Google Scholar] [CrossRef]
Alon, A.S.; Marasigan, R.I., Jr.; Nicolas- Mindoro, J.G.; Casuat, C.D. An Image Processing Approach of Multiple Eggs’ Quality Inspection. Int. J. Adv. Trends Comput. Sci. Eng. 2019, 8, 2794–2799. [Google Scholar] [CrossRef]
Neethirajan, S.; Tuteja, S.K.; Huang, S.T.; Kelton, D. Recent Advancement in Biosensors Technology for Animal and Livestock Health Management. Biosens. Bioelectron. 2017, 98, 398–407. [Google Scholar] [CrossRef]
Syauqi, M.N.; Zaffrie, M.M.A.; Hasnul, H.I. Broiler Industry in Malaysia. Available online: http://ap.fftc.agnet.org/files/ap_policy/532/532_1.pdf (accessed on 25 January 2018).
Okinda, C.; Lu, M.; Liu, L.; Nyalala, I.; Muneri, C.; Wang, J.; Zhang, H.; Shen, M. A Machine Vision System for Early Detection and Prediction of Sick Birds: A Broiler Chicken Model. Biosyst. Eng. 2019, 188, 229–242. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep Learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning Deep Features for Discriminative Localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: New York, NY, USA; pp. 2921–2929. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Springer: Berlin/Heidelberg, Germany Proceedings, Part III. ; Volume 18, pp. 234–241. [Google Scholar] [CrossRef]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Springer: Berlin/Heidelberg, Germany; pp. 801–818. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2020. [Google Scholar] [CrossRef]
Ham, G.S.; Oh, K. Learning Spatial Configuration Feature for Landmark Localization in Hand X-rays. Electronics 2023, 12, 4038. [Google Scholar] [CrossRef]
van der Eijk, J.A.J.; Guzhva, O.; Voss, A.; Möller, M.; Giersberg, M.F.; Jacobs, L.; de Jong, I.C. Seeing Is Caring—Automated Assessment of Resource Use of Broilers with Computer Vision Techniques. Front. Anim. Sci. 2022, 3, 945534. [Google Scholar] [CrossRef]
Yang, J.; Zhang, T.; Fang, C.; Zheng, H.; Ma, C.; Wu, Z. A Detection Method for Dead Caged Hens Based on Improved YOLOv7. Comput. Electron. Agric. 2024, 226, 109388. [Google Scholar] [CrossRef]
Bao, Y.; Lu, H.; Zhao, Q.; Yang, Z.; Xu, W. Detection System of Dead and Sick Chickens in Large Scale Farms Based on Artificial Intelligence. Math. Biosci. Eng. 2021, 18, 6117–6135. [Google Scholar] [CrossRef] [PubMed]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: New York, NY, USA; pp. 2961–2969. [Google Scholar] [CrossRef]
Li, P.; Zhang, T.; Wang, X.; Liu, J.; Huang, Y. Detection of Sick Laying Hens by Infrared Thermal Imaging and Deep Learning. J. Phys. Conf. Ser. 2021, 2025, 012008. [Google Scholar] [CrossRef]
Massari, J.M.; de Moura, D.J.; de Alencar Nääs, I.; Pereira, D.F.; Branco, T. Computer-Vision-Based Indexes for Analyzing Broiler Response to Rearing Environment: A Proof of Concept. Animals 2022, 12, 846. [Google Scholar] [CrossRef]
Wayne, B. Dead Chickens Dataset. Available online: https://universe.roboflow.com/bruce-wayne-wja03/dead-chikens (accessed on 11 October 2024).
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2016, Las Vegas, NV, USA, 27–30 June 2016; IEEE: New York, NY, USA; pp. 770–778. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; IEEE: New York, NY, USA; pp. 3431–3440. [Google Scholar] [CrossRef]
Chaurasia, A.; Culurciello, E. LinkNet: Exploiting Encoder Representations for Efficient Semantic Segmentation. In Proceedings of the IEEE Visual Communications and Image Processing (VCIP), St. Petersburg, FL, USA, 10–13 December 2017; IEEE: New York, NY, USA; pp. 1–4. [Google Scholar] [CrossRef]

Figure 1. Example image pairs from the Dead Broiler Dataset. Each pair shows the original image at (left), and the ground-truth dead broiler mask at (right).

Figure 2. Proposed network for the segmentation of dead broiler. The input image is encoded by a pre-trained ResNet-50, enriched with positional features, and reshaped for transformer processing. The transformer refines global contextual representations, after which the features are restored to spatial form, fused through a skip-connection, and decoded by a CNN to produce the final segmentation mask. In addition, the model incorporates an auxiliary regression branch that uses the refined transformer features to estimate the total number of dead broilers.

Figure 3. Overview of the transformer block with multi-head attention. The figure illustrates the process of recalibrating encoded features using a transformer block, which includes layer normalization, multi-head attention, and a multi-layer perception (MLP).

Figure 4. Box plots of segmentation performance across the different methods. The figure summarizes the IoU, precision, recall, and F-measure distributions for U−Net, FCN, LinkNet, DeepLabV3, Dual-StreamNet, and the proposed method. The proposed approach shows a higher central value and a tighter spread in most metrics, indicating stronger and more stable segmentation performance than the other models.

Figure 5. Visualization results of the proposed segmentation framework across diverse real-world scenes in poultry houses. Left to right: Original image, ground-truth segmentation mask, predicted segmentation mask, and overlaid visualization, where green and red contours denote the ground-truth and predicted boundaries, respectively.

Table 1. Segmentation performance comparison between the proposed method and several existing approaches.

Method	Metrics (Std)
Method	IOU	Precision	Recall	F-Measure
U-Net	82.99 (0.77)	87.86 (0.76)	88.80 (0.75)	88.32 (1.11)
FCN	82.34 (1.64)	82.74 (0.61)	91.49 (0.68)	86.89 (0.77)
LinkNet	82.00 (1.71)	82.18 (0.55)	88.99 (1.52)	85.17 (0.81)
DeepLabV3	83.92 (0.56)	91.28 (0.87)	89.79 (0.97)	90.53 (0.98)
Dual-StreamNet	84.79 (1.59)	89.13 (1.19)	94.29 (0.57)	91.64 (1.27)
Proposed method	87.39 (1.15)	94.27 (0.87)	92.84 (1.00)	93.20 (1.47)

Table 2. Performance comparison between the two variants of the proposed model. The first variant removes the regression block, while the full model includes a counting branch that operates on the transformer token.

Method	Metrics
Method	IOU	Precision	Recall	F-Measure
Proposed method without regression block	86.88 (1.36)	92.11 (1.04)	93.42 (0.58)	92.76 (1.19)
Proposed method	87.39 (1.15)	94.27 (0.87)	92.84 (1.00)	93.20 (1.47)

Table 3. Performance comparison across different ResNet backbones. Each backbone provides a distinct feature scale, and the transformer encoder operates on the corresponding token representation.

Backbone	Metrics
Backbone	IOU	Precision	Recall	F-Measure
ResNet18	86.82 (0.55)	89.23 (0.77)	94.43 (1.06)	91.76 (1.33)
ResNet34	87.04 (1.32)	91.58 (0.79)	93.58 (1.06)	92.57 (0.98)
ResNet50	87.39 (1.15)	94.27 (0.87)	92.84 (1.00)	93.20 (1.47)
ResNet101	87.58 (1.41)	92.60 (0.86)	93.60 (1.53)	93.07 (1.69)

Table 4. Performance comparison with respect to the number of attention heads in the transformer encoder.

Number of Attention Heads	Metrics
Number of Attention Heads	IOU	Precision	Recall	F-Measure
2	86.99 (0.51)	93.32 (0.77)	92.18 (1.16)	92.75 (1.17)
4	87.12 (1.15)	93.17 (0.87)	91.14 (1.00)	92.23 (1.47)
8	87.39 (1.15)	94.27 (1.25)	92.84 (1.10)	93.54 (1.46)
10	87.18 (1.41)	91.90 (1.06)	93.90 (0.53)	92.89 (0.93)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ham, G.-S.; Oh, K. Transformer-Based Multi-Task Segmentation Framework for Dead Broiler Identification. Appl. Sci. 2026, 16, 419. https://doi.org/10.3390/app16010419

AMA Style

Ham G-S, Oh K. Transformer-Based Multi-Task Segmentation Framework for Dead Broiler Identification. Applied Sciences. 2026; 16(1):419. https://doi.org/10.3390/app16010419

Chicago/Turabian Style

Ham, Gyu-Sung, and Kanghan Oh. 2026. "Transformer-Based Multi-Task Segmentation Framework for Dead Broiler Identification" Applied Sciences 16, no. 1: 419. https://doi.org/10.3390/app16010419

APA Style

Ham, G.-S., & Oh, K. (2026). Transformer-Based Multi-Task Segmentation Framework for Dead Broiler Identification. Applied Sciences, 16(1), 419. https://doi.org/10.3390/app16010419

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Transformer-Based Multi-Task Segmentation Framework for Dead Broiler Identification

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

3.1. Dataset Collection and Description

3.2. Methodology

3.2.1. Feature Encoding

3.2.2. Transformer Module

3.2.3. Segmentation Module

3.2.4. Regression Module

3.2.5. Loss Function

3.2.6. Implementation Detail

4. Results and Discussion

4.1. Evaluation Metrics

4.2. Performance Comparison

4.3. Visualization Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI