Next Article in Journal
Gut Microbiota—Campylobacter jejuni Crosstalk in Broiler Chickens: A Comprehensive Review
Previous Article in Journal
Characterization of Antimicrobial Resistance Patterns and Resistance Genes of Enterococci from Broiler Chicken Litter
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Precision Livestock Farming: YOLOv12-Based Automated Detection of Keel Bone Lesions in Laying Hens

1
Istituto Zooprofilattico Sperimentale delle Venezie, 35020 Legnaro, Italy
2
Department of Agronomy, Food, Natural Resources, Animals and Environment (DAFNAE), University of Padova, 35020 Legnaro, Italy
3
Department of Comparative Biomedicine and Food Science (BCA), University of Padova, 35020 Legnaro, Italy
4
Servizio Veterinario SIAOA, Az. ULSS 5 Polesana, 45100 Rovigo, Italy
*
Author to whom correspondence should be addressed.
Poultry 2025, 4(4), 43; https://doi.org/10.3390/poultry4040043
Submission received: 11 July 2025 / Revised: 20 August 2025 / Accepted: 16 September 2025 / Published: 24 September 2025

Abstract

Keel bone lesions (KBLs) represent a relevant welfare concern in laying hens, arising from complex interactions among genetics, housing systems, and management practices. This study presents the development of an image analysis system for the automated detection and classification of KBLs in slaughterhouse videos, enabling scalable and retrospective welfare assessment. In addition to lesion classification, the system can track and count individual carcasses, providing estimates of the total number of specimens with and without significant lesions. Videos of brown laying hens from a commercial slaughterhouse in northeastern Italy were recorded on the processing line using a smartphone. Six hundred frames were extracted and annotated by three independent observers using a three-scale scoring system. A dataset was constructed by combining the original frames with crops centered on the keel area. To address class imbalance, samples of class 1 (damaged keel bones) were augmented by a factor of nine, compared to a factor of three for class 0 (no or mild lesion). A YOLO-based model was trained for both detection and classification tasks. The model achieved an F1 score of 0.85 and a mAP@0.5 of 0.892. A BoT-SORT tracker was evaluated against human annotations on a 5 min video, achieving an F1 score of 0.882 for the classification task. Potential improvements include increasing the number and variability of annotated images, refining annotation protocols, and enhancing model performance under varying slaughterhouse lighting and positioning conditions. The model could be applied in routine slaughter inspections to support welfare assessment in large populations of animals.

1. Introduction

In recent years, increasing consumer awareness and demand for ethically sustainable food products have prompted the poultry industry to strengthen its commitment to improving animal welfare standards. Indeed, the evolution of EU egg production systems has been strongly influenced by consumer choices and expectations, as well as legislative developments. This shift strongly reflects a growing societal emphasis on responsible production practices and has led to the implementation of more rigorous welfare protocols across the supply chain [1]. Animal-based measures (ABMs) are direct assessments conducted on the animals themselves that provide an indication of the impact that structural- or management-related factors within the farming system have on animal welfare [2]. These direct parameters, when combined with indirect (environmental or resource-based) measures, are attracting growing interest within the scientific community and are expected to be among the most widely employed indicators in future welfare assessment approaches, which are increasingly integrated and multidisciplinary. In laying hens, current EU legislation does not explicitly require post mortem ABM evaluation as it does for footpad lesions in broilers [3]. However, the application of such measures at the slaughterhouse shows considerable promise for monitoring welfare levels during the production cycle [4]. Among the most relevant ABMs for laying hens are keel bone and footpad lesions [5]. Keel bone lesions (KBLs) represent primary indicators of compromised welfare in laying hens [2]. The KBL incidence is highly variable but generally very high; it can reach values up to 90% at the end of the production cycle [6,7]. KBLs are associated with bone fragility and include keel bone deviations and fractures; the former are more likely associated with the pressure load due to prolonged stationing on perches, while the latter usually occur as the result of collisions between animals and the facilities [6]. The etiology of keel bone fractures is multifactorial, and the underlying causes are not yet fully understood. Historically, deviations and fractures have been associated with the high productivity of laying hens: increased egg production, which demands substantial calcium availability, leads to reduced bone mineralization and predisposes hens to greater bone fragility [7,8]. This issue became increasingly apparent following the abolition of conventional cage systems [7]: some authors reported a higher risk in cage-free systems due to a greater susceptibility to traumas [9,10], while a recent study in Italy found a lower risk compared to enriched cages, presumably due to the reinforcement of the skeletal system because of the higher freedom of movement [11]. The evaluation of KBLs in laying hens can be conducted either on-farm or at the slaughterhouse using various methodologies. Some methods for detecting KBLs include dissection, radiography, ultrasound, or computed tomography [12]. However, the most commonly employed approach remains palpation, since it does not require dedicated equipment or advanced training. The on-farm assessment, however, requires a lot of time and involves the capture and handling of animals, which induces stress. To overcome these challenges, slaughterhouse-based evaluation has emerged as a more practical and feasible alternative [7,11]. Automating this task, through the use of computer vision and deep learning, would facilitate the workflow and enable the analysis of a larger number of animals [13,14]. For this reason, the objective of the present study was to develop an image analysis system based on computer vision and deep learning techniques, capable of automatically detecting KBLs and classifying their severity, using video data collected from keel bones of laying hens on the slaughter line.

2. Materials and Methods

The following section details the methodology adopted for data collection, annotation, dataset preparation, model training, tracking implementation, and evaluation metrics in the context of automated sternum lesion classification in laying hens. Generative AI tools were used to assist with study design, Python 3.12.2 coding, results interpretation, and for english text revision.

2.1. Data Collection and Annotation Protocol

Videos were recorded during official inspections at the slaughterhouse from August to November 2024 using an iPhone 12 (Apple Inc., Cupertino, CA, USA) mounted on an adjustable-height tripod positioned approximately 1–1.5 m from the slaughter line. Recordings focused on carcasses of laying hens of various brown-feathered genotypes, suspended on the slaughter line after bleeding, evisceration, and defeathering stages. Carcasses were filmed frontally from a fixed angle under consistent lighting conditions, maintaining a stationary camera setup. In the preliminary phase, both video data and visual evaluations of keel lesions were collected using simple sampling sheets, consisting of a column for the specimen number and the score assigned by the operator, completed by three trained operators. Each operator annotated 200 consecutive keel lesions. Subsequent inspections focused solely on video acquisition.
The offline annotation process was carried out independently by the same three annotators (after proper training) using the freemium version of Roboflow software [15].

2.1.1. Preliminary Phase

After video recording, 600 frames were extracted, one for each sternum that had previously been annotated in person at the slaughterhouse. The specimen of interest was always positioned at the center of the frame (almost always with other specimens also present in the same frame). An operator ensured that the specimen was clearly identifiable in the video by pointing at it (Figure 1).
The extracted frames were then annotated using Roboflow software. In this phase, each operator annotated all 600 frames, assigning to each image a class corresponding to the condition of the sternum indicated by the operator in the video. The annotation criteria were as follows: Score 0 was assigned to sterna that appeared linear, free of callus formation, tuberosity, or any signs of fracture. Score 1 was assigned to sterna that were deviated, with the presence of callus formation, tuberosity, or evident fracture.
First round analyses revealed that the agreement between video-based and in-person scores, as well as between different annotators, was not sufficiently high to ensure reliable supervision for model training. Detailed metrics on inter- and intra-annotator agreement are reported in Appendix A. These results motivated the adoption of a revised scoring system, described in the next section.

2.1.2. Revised Annotation Criteria

At this stage, it was decided to revise the annotation criteria to improve inter-annotator agreement. The new criteria were designed to reduce subjectivity and ambiguity in the classification process, thereby enhancing consistency among annotators.
According to the revised annotation criteria, Score 0 was assigned to sterna that appeared linear or only slightly deviated, with minimal callus formation or tuberosity. In contrast, Score 1 was assigned to sterna exhibiting marked deviation (S- or C-shaped) and/or pronounced callus formation and/or evident fracture.
Applying the revised criteria to the same set of 600 images resulted in a new confusion matrix for inter-annotator agreement (Figure 2). The agreement rates between annotator pairs were 81.2% (G–F), 86.7% (G–R), and 86.8% (F–R), where each letter represents a different annotator. Additionally, misclassifications were slightly more symmetric between classes compared to the first annotation phase.
At this point, other agreement metrics were calculated, such as Fleiss’s Kappa, Krippendorff’s Alpha, and all the F1 scores obtained by using the offline scores assigned by each annotator as ground truth.

2.2. Dataset Preparation

The final dataset included the same 600 images. Each frame contained about 4–5 specimens, the annotators had to annotate each specimen individually, by using bounding boxes and assigning a class label to each specimen.
Since the goal at this stage was to annotate all specimens present in each frame—not just the central one—we encountered cases where certain keels could not be reliably classified due to specimen rotation, suboptimal illumination, or occlusions. To address these situations, we introduced an additional class, Not Classifiable (NC), to be assigned whenever the condition of the keel bone could not be determined with sufficient confidence. This approach ensured that ambiguous or unclassifiable specimens were systematically accounted for during the annotation process.
The final composition of the dataset showed a strongly unbalanced class distribution: class 0 was the most frequent (1804 specimens), followed by class 1 (549 specimens), while a substantial number of specimens were labeled as Not Classifiable (NC, 427 specimens).

2.2.1. Postprocessing: NC Class Removal

To streamline the dataset, all sterna labeled as Not Classifiable (NC) were visually removed. This choice was motivated by two factors: (i) the model can leverage multi-frame information to resolve ambiguous cases, making explicit NC class learning unnecessary; (ii) in the videos on which the model will be deployed, specimens are visible in many frames, reducing the chance that a sternum remains unclassifiable throughout.
Practically, in the original frames, we masked all NC bounding boxes and their surroundings with black rectangles—each mask was centered on the NC bounding box, with height three times and width seven times that of the original box (see Figure 3).
Corresponding NC annotations were also deleted from the labels folder.

2.2.2. Data Augmentation Strategy

At this stage, only the training dataset was modified. Specifically, the original frames were further subdivided into individual images, each containing a single specimen, using an approach analogous to that detailed in Section 2.2.1. This allowed for the creation of a set of single-specimen frames, facilitating class-specific augmentation pipelines.
The final training dataset also included the original frames to help the model learn from multi-specimen contexts, reflecting the real-world deployment scenario.
The augmentation pipeline was implemented using the Albumentations library [16], which allows for efficient and flexible image transformations.
The augmentation pipeline included 15 transformations, as summarized in Table 1.
The augmentation strategy was designed to address the class imbalance in the dataset. Therefore, we applied a probabilistic combination of transformations both to the frames depicting singular specimens and the original frames depicting multiple specimens, so that
  • Original frames were augmented by a factor of 3;
  • Class 0 specimens were augmented by a factor of 3;
  • Class 1 specimens were augmented by a factor of 9.
The final class distribution of the training dataset was 2910 occurrences of class 0 and 2637 occurrences of class 1. Despite implementing the augmentation strategy manually using the Albumentations library, we decided to retain the mosaic augmentation (with mosaic = 0.5) provided by the YOLO model.

2.3. Model Architecture and Training

The final dataset was used to train a YOLOv12 model [17]. The main hyperparameters used for training are summarized in Table 2.
The training and inference were deployed on an Azure virtual machine equipped with an NVIDIA A100 GPU with 80 GB of memory.

2.4. Tracking Implementation

After model training we proceeded with the implementation of the tracking system using the BoT-SORT tracker [18].Then we visually validated the performance of the tracker on a video of 5 min with a total of 256 specimens and tuned the parameters of the tracker accordingly trying to maximize the ID consistency and the stability of the tracker.
In the tracking validation phase, a human operator was asked to
  • Watch the entire five-minute video (without predictions to avoid any bias) and count the occurrences of class 0 and class 1.
  • Watch the entire five-minute video (with predictions) and verify the agreement between their own predictions and the predictions of the tracker.
The final set of parameters used for the BoT-SORT tracker is summarized in Table 3.
The tracker output (bounding box coordinates, class labels, IDs, and confidence scores) was postprocessed by filtering out all IDs detected in fewer than 20 frames in order to minimize false positives. The remaining results were then used to generate a final output file containing the predictions for each specimen across all frames. The output file was structured as follows:
  • Individual classification section: for each detected ID, the class label was extracted, along with the mean class prediction across all frames in which the ID was detected, the mean confidence of the predictions and its standard deviation, and the total number of frames in which the ID was detected;
  • Final statistics section: the total number of detected specimens was obtained, together with the class distribution of the detected specimens (determined as the mode of the class predictions across all frames) and a modified class distribution excluding “uncertain” predictions. Specifically, predictions for which the mean class prediction ( [ 0 , 1 ] ) fell within the range [ 0.4 , 0.6 ] were considered uncertain. This threshold was selected a posteriori to match the uncertainty rate reported by the human annotator, who, during the validation phase, indicated the proportion of “borderline” sterna.

2.5. Evaluation Metrics

To evaluate the classification performance of the YOLO model, we used the standard statistics output by Roboflow software, which included
  • Absolute and normalized confusion matrices;
  • F1 score for each class;
  • Precision–recall curves for each class;
  • Mean average precision at IoU threshold 0.5 (mAP@50), which was deemed sufficient since the primary focus was on classification accuracy rather than precise localization [19].
To evaluate the performance of the tracking system, we compared the predictions of the tracker with the ground truth annotations provided by the human operator during the validation phase. The evaluation metrics used for tracking performance included
  • Number of stable IDs: the number of IDs that were consistently tracked across multiple frames, as described in Section 2.4;
  • Confusion matrices obtained comparing the tracker predictions with the human annotations, using the latter as ground truth and filtering out uncertain predictions as defined in Section 2.4;
  • F1 scores for each class and overall F1 score by considering the human annotator as the ground truth (GT).

3. Results

3.1. Inter-Annotator Agreement

The annotation criterion described in Section 2.1.2 resulted in the following inter-annotator reliability metrics: Fleiss’s Kappa of 0.63 (substantial agreement [20]), Krippendorff’s Alpha of 0.63 (insufficient reliability, not recommended for robust conclusions [21]), and the F1 scores summarized in Table 4.
Even though the agreements metrics showed limited concordance among annotators, we decided to proceed with model training, keeping in mind this important a priori limitation in the reliability of the annotated data.

3.2. YOLO Model Detection and Classification Performance

As we can observe in Figure 4, the model, trained on the dataset with class 0 and class 1 only, achieved an high accuracy in predicting class 0 (94%), but it struggled with class 1 (only 73% of right predictions), showing a significant rate of false negatives (Figure 4a). This was a predictable result because of the strong class imbalance in the dataset described in Section 2.2. Moreover, the model misclassified the background for class 0 and class 1 in 40 cases, as shown in Figure 4b. By manually inspecting these cases, we found out that the model managed to detect keel bones even when the annotators did not label them when the specimens were at the edges of the images. This was a desirable result since the model should be able to generalize well even to partially occluded lesions.
As shown in Figure 5a, the model achieved an F1 score of 0.85 for all classes. Once again we can observe a lower F1 curve for class 1 due to the class imbalance in the dataset.
This result is considered acceptable from a practical perspective as, in this study, we deliberately prioritized minimizing false positives over minimizing false negatives. This design choice aims to reduce the risk of incorrectly penalizing farmers for poor animal conditions, even though it may come at the cost of under-detecting some true cases. Therefore, occasional misses of class 1 instances are deemed more tolerable than the risk of over-detection, under the current operational constraints and intended use of the system.
Similar results were obtained for the precision–recall curves shown in Figure 5b, showing a mAP@50 of 0.892 for all classes, a mAP@50 of 0.966 for class 0, and, as expected, a lower mAP@50 of 0.817 for class 1.

3.3. Tracking Performance

We evaluated the tracking algorithm by comparing its outputs to the manual annotations provided independently by a human operator on the validation video. An example of the tracker output is given in Figure 6.

3.3.1. BoTSORT Tracking Performance

By using the script described in Section 2.4, we obtained the class distribution output by the tracker:
  • Class 0: 207 specimens (77.8%);
  • Class 1: 59 specimens (22.2%);
  • Total number of detections: 266.

3.3.2. Comparison with Manual Tracking and Postprocessing Adjustment

The initial tracker output was compared with the manual annotations collected during the validation phase. During this process, the operator identified 8 out of 251 predictions (≈3.8%) as uncertain; these were cases where the classification was ambiguous based on visual inspection. Notably, these uncertain cases often coincided with instances where the tracker frequently switched between class 0 and class 1, resulting in classification flickering.
The comparison between the refined tracker output—where uncertain predictions were filtered out—and the manual annotations produced the confusion matrices showed in Figure 7 The class distributions and the f1-score are summarized in Table 5.
We can observe that the tracker “sees” more specimens than the annotators (266 vs. 251), probably due to the presence of “double detections” that the filter described in Section 2.4 fails to remove.

4. Discussion

Nowadays, the adoption of technological innovation is expanding across numerous fields, including animal husbandry. Since the early 2000s, interest in precision livestock farming (PLF) has steadily increased, as evidenced by the growing number of scientific publications in this area [22]. In the poultry sector, the application of artificial intelligence (AI) for animal welfare research holds significant potential as a tool to enhance sustainability in farming practices. Technological advancements applicable to poultry farming include the use of sensors for environmental monitoring, assessment of animal movement and health status (e.g., through vocalization analysis), and the development of novel imaging technologies for detecting gait abnormalities, behavioral issues, and welfare concerns [23]. Automated detection systems have also been implemented in slaughterhouses to support official inspection activities. For example, as early as 2013, a prototype system was developed for the automated assessment of footpad lesions in broilers on the slaughter line [13]. Recently, image analysis technologies have been applied during post mortem inspection of broilers to enable rapid and objective identification of breast myopathies such as wooden breast syndrome [24].
The use of automated systems for lesion detection and scoring at the slaughterhouse can serve as a valuable support tool for veterinary inspection activities, facilitating the identification of indicators of poor welfare linked to the on-farm rearing phase. Furthermore, given the increasing consumer interest in “welfare-friendly” products, the development of automated lesion detection systems at slaughters could assist in the future definition of voluntary animal welfare certification schemes. It is also important to consider that the high degree of uniformity in laying hen size achieved over the years through genetic selection makes these animals particularly suitable for automated evaluation [25].
Unlike other approaches that rely on specialized hardware such as 3D cameras for single-bird assessment [25], our system operates on standard 2D video data and simultaneously processes multiple specimens per frame, thus promoting reproducibility and facilitating cost-effective integration into slaughterhouse routines. In this context, our study demonstrates that it is possible to train a pre-trained deep learning model such as YOLO to closely approximate human performance in detecting and classifying KBLs when abnormalities in lighting exposure or carcass rotation occurred. The initial quality of the dataset was notably limited by low inter-annotator agreement, despite a deliberately simplified and revised scoring system designed to reduce subjectivity. This, together with a pronounced class imbalance, posed a considerable challenge for supervised training. Nonetheless, the model achieved a performance trade-off that can be considered acceptable given the intended application: in particular, the relatively lower sensitivity for more severe lesions (class 1) is balanced by a low rate of false positives to reduce the incorrect penalization of farmers. On the other hand, automated detection should only be used as a support of the official veterinary inspections, rather than a replacement, in order to assure a thorough and exhaustive welfare assessment of each slaughtered batch. Beyond this supporting role, such technology could also guide competent authorities toward targeted on-farm inspections in response to the most critical issues identified at the slaughterhouse. As iceberg indicators [26], keel bone lesions can support a comprehensive assessment of on-farm animal welfare and can be used to monitor on-farm welfare and to adopt suitable corrective measures when needed. In addition, the automated and standardized analysis of thousands of images offers a powerful tool for advancing animal welfare research. The effectiveness of the proposed approach is further supported by the performance of the tracking system and its validation against human annotations on a full-length video, which showed that the tracker not only maintained stable predictions over time but also provided class prevalence estimates comparable to those obtained by expert operators. Importantly, despite starting from a relatively low inter-annotator agreement, the tracker still achieved a “good” F1 score when compared to manual annotations. While it occasionally misclassified individual samples—confusing class 1 with class 0 and vice versa—these errors were balanced across categories. As a result, the net error in class counts remained low (≈1.2%), enabling the detector to provide reliable estimates of class cardinalities without introducing systematic biases that could disproportionately affect stakeholders, such as farmers.
Future developments should focus on increasing the size and variability of the annotated dataset, especially to mitigate class imbalance. Additionally, improvements in the annotation protocol are warranted: one possible direction involves adopting a consensus-based procedure in which annotators examine the same set of images and agree on a common label or alternatively aggregating independent annotations through statistical fusion (e.g., majority vote or probabilistic averaging). Another promising avenue could involve shifting from discrete classification to metric-based assessment via instance segmentation, extracting objective features of each keel bone—such as geometric measurements, color patterns, or shape variability—and feeding these into downstream classifiers to enhance transparency, explainability, and robustness of the system.

5. Conclusions

This study demonstrates the feasibility of using a deep-learning-based system for automated detection and classification of KBLs in laying hens at slaughter. Leveraging standard 2D video and a fine-tuned YOLO model, the system achieved performance comparable to human operators, despite challenges such as low inter-annotator agreement and class imbalance. Its ability to process multiple carcasses simultaneously and its integration with a reliable tracking module highlight its potential for scalable, cost-effective deployment in slaughterhouse environments. While not a substitute for official veterinary inspections, the proposed approach can support welfare monitoring and contribute to the possible development of objective certification schemes. A current limitation lies in the binary classification criterion adopted in this study, which may oversimplify the variability in keel bone damage. Future work should therefore aim to refine this framework, for example, by involving more annotators, revising annotation criteria, or exploring alternative computer vision strategies such as segmentation, and by improving annotation quality and exploring metric-based approaches, whereby measurable features of the keel bone—such as shape, size, or color—are extracted and used for more interpretable and robust classification.

Author Contributions

Conceptualization, G.D.M. and A.T.; methodology, T.B.; formal analysis, R.U. and G.N.; data curation, F.M.; writing—original draft preparation, A.A.; writing—review and editing, M.P. and F.B.; supervision, S.S. and G.M.; project administration, V.T.; funding acquisition, G.D.M. All authors have read and agreed to the published version of this manuscript.

Funding

This research was funded by Ministero della Salute (Project RC 9/2022, CUP: B23C22000930001).

Institutional Review Board Statement

Ethical review and approval were waived for this study, as it was limited to image collection during routine slaughtering procedures. All procedures and animal care were performed in compliance with Council Regulation (EC) n. 1099/2009 on the protection of animals at the time of killing under the supervision of official veterinary services.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset, model weights, and related data are not publicly available due to institutional restrictions.

Acknowledgments

The authors wish to thank the property and all people of Delta Group Agroalimentare SRL (Porto Viro, Rovigo, Italy) for giving full access to the facilities and for supporting the work of the research team at the slaughterhouse. The contribution for this research was in kind.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Inter-Annotator Agreement Results for the Preliminary Phase

To assess the level of agreement among the three operators, we first examined the confusion matrix showing the offline scores assigned by each annotator, using as ground truth the corresponding online score recorded by that annotator at the slaughterhouse (Figure A1). Each confusion matrix comprised 200 entries.
Next, we analyzed the confusion matrices comparing the offline scores assigned by each annotator with those assigned by the other annotators in order to evaluate inter-annotator agreement (Figure A2). Each confusion matrix in this analysis included 600 entries, corresponding to the total number of frames annotated by all operators.
The results presented in Figure A1 indicate that annotators R and G demonstrated higher concordance between their offline (video-based) scores and their own online (in-person) scores at the slaughterhouse (84.5% agreement for both R and G), compared to annotator F (78.5% agreement). Notably, annotator F tended to overestimate the number of frames assigned a score of 0 during video annotation; in most cases where F assigned a score of 0 offline, the same frame had been assigned a score of 1 during the online evaluation. This analysis quantifies the subjective error attributable to each annotator, providing insight into the reliability of video-based scoring relative to direct visual assessment at the slaughterhouse.
Figure A1. Confusion matrix of the offline scores assigned by each annotator, using as ground truth the online score given by that annotator at the slaughterhouse. The scoring system used here is the initial one (later revised).
Figure A1. Confusion matrix of the offline scores assigned by each annotator, using as ground truth the online score given by that annotator at the slaughterhouse. The scoring system used here is the initial one (later revised).
Poultry 04 00043 g0a1
Figure A2. Confusion matrix comparing the offline scores assigned by each annotator with those assigned by the other annotators. The scoring system used here is the initial one (later revised).
Figure A2. Confusion matrix comparing the offline scores assigned by each annotator with those assigned by the other annotators. The scoring system used here is the initial one (later revised).
Poultry 04 00043 g0a2
The results shown in Figure A2 indicate that the agreement rate between different annotator pairs was 77.7% for GF, 83.7% for GR, and 84.3% for RF. It can also be observed that the agreement is not class-symmetric: in particular, annotator F tends to overestimate class 0. Furthermore, the values on the antidiagonal of the confusion matrices differ greatly, highlighting a marked asymmetry in classification errors. This characteristic may compromise the ability of any model trained on these data to correctly estimate the cardinality of each class.
In this context, discrepancies observed in the video-based annotations can largely be attributed to the quality of the extracted frames. To ensure the correct identification of the carcass to be scored, the operator’s finger indicating the specimen needed to be clearly visible in each frame. However, the optimal frame for annotation was not always the one in which the carcass was most clearly indicated or visible. As a result, some extracted frames were suboptimal for accurate video-based scoring, potentially contributing to the observed differences in annotation reliability among operators.

References

  1. Kollenda, E.; Baldock, D.; Hiller, N.; Lorant, A. Transitioning towards Cage-Free Farming in the EU: Assessment of Environmental and Socio-Economic Impacts of Increased Animal Welfare Standards; Institute for European Environmental Policy (IEEP): Brussels, Belgium, 2020. [Google Scholar]
  2. Welfare Quality®. Welfare Quality® Assessment Protocol for Poultry (Broilers, Laying Hens); Welfare Quality® Consortium: Lelystad, The Netherlands, 2009. [Google Scholar]
  3. Directive—2007/43—EN—EUR-Lex. Available online: https://eur-lex.europa.eu/eli/dir/2007/43/oj/eng (accessed on 17 September 2025).
  4. EURCAW-Poultry-SFA. Indicators at Slaughter to Assess Broiler Welfare on Farm; EURCAW-Poultry-SFA: Maisons-Alfort, France, 2023. [Google Scholar]
  5. EFSA Panel on Animal Health and Animal Welfare (AHAW); Nielsen, S.S.; Alvarez, J.; Bicout, D.J.; Calistri, P.; Canali, E.; Drewe, J.A.; Garin-Bastuji, B.; Gonzales Rojas, J.L.; Gortázar Schmidt, C.; et al. Welfare of Laying Hens on Farm. EFSA J. 2023, 21, e07789. [Google Scholar] [CrossRef]
  6. Petrik, M.T.; Guerin, M.T.; Widowski, T.M. On-Farm Comparison of Keel Fracture Prevalence and Other Welfare Indicators in Conventional Cage and Floor-Housed Laying Hens in Ontario, Canada. Poult. Sci. 2015, 94, 579–585. [Google Scholar] [CrossRef] [PubMed]
  7. Rufener, C.; Makagon, M.M. Keel Bone Fractures in Laying Hens: A Systematic Review of Prevalence across Age, Housing Systems, and Strains. J. Anim. Sci. 2020, 98, S36–S51. [Google Scholar] [CrossRef] [PubMed]
  8. Toscano, M.J.; Dunn, I.C.; Christensen, J.P.; Petow, S.; Kittelsen, K.; Ulrich, R. Explanations for Keel Bone Fractures in Laying Hens: Are There Explanations in Addition to Elevated Egg Production? Poult. Sci. 2020, 99, 4183–4194. [Google Scholar] [CrossRef] [PubMed]
  9. Blatchford, R.A.; Fulton, R.M.; Mench, J.A. The Utilization of the Welfare Quality® Assessment for Determining Laying Hen Condition across Three Housing Systems. Poult. Sci. 2016, 95, 154–163. [Google Scholar] [CrossRef] [PubMed]
  10. Wilkins, L.J.; McKinstry, J.L.; Avery, N.C.; Knowles, T.G.; Brown, S.N.; Tarlton, J.; Nicol, C.J. Influence of Housing System and Design on Bone Strength and Keel Bone Fractures in Laying Hens. Vet. Rec. 2011, 169, 414. [Google Scholar] [CrossRef] [PubMed]
  11. Nalesso, G.; Ciarelli, C.; Menegon, F.; Bordignon, F.; Urbani, R.; Di Martino, G.; Polo, P.; Sparesato, S.; Xiccato, G.; Trocino, A. On-Farm Welfare of Laying Hens: Animal-Based Measures at Slaughterhouse and Risk Factors in Italian Farms. Poult. Sci. 2025, 104, 105152. [Google Scholar] [CrossRef] [PubMed]
  12. Çavuşoğlu, E.; Toscano, M.J.; Gebhardt-Henrich, S.G. Reliability of Palpation Using Three-Dimensional Keel Bone Models. J. Appl. Poult. Res. 2025, 34, 100579. [Google Scholar] [CrossRef]
  13. Vanderhasselt, R.F.; Sprenger, M.; Duchateau, L.; Tuyttens, F.A.M. Automated Assessment of Footpad Dermatitis in Broiler Chickens at the Slaughter-Line: Evaluation and Correspondence with Human Expert Scores. Poult. Sci. 2013, 92, 12–18. [Google Scholar] [CrossRef] [PubMed]
  14. Sozzi, M.; Pillan, G.; Ciarelli, C.; Marinello, F.; Pirrone, F.; Bordignon, F.; Bordignon, A.; Xiccato, G.; Trocino, A. Measuring Comfort Behaviours in Laying Hens Using Deep-Learning Tools. Animals 2023, 13, 33. [Google Scholar] [CrossRef]
  15. Dwyer, B.; Nelson, J.; Hansen, T. Roboflow. Available online: https://roboflow.com/ (accessed on 17 September 2025).
  16. Buslaev, A.; Iglovikov, V.I.; Khvedchenya, E.; Parinov, A.; Druzhinin, M.; Kalinin, A.A. Albumentations: Fast and Flexible Image Augmentations. Inf.-Int. Interdiscip. J. 2020, 11, 125. [Google Scholar] [CrossRef]
  17. Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar] [CrossRef]
  18. Aharon, N.; Orfaig, R.; Bobrovsky, B.Z. BoT-SORT: Robust Associations Multi-Pedestrian Tracking. arXiv 2022, arXiv:2206.14651. [Google Scholar] [CrossRef]
  19. Saito, T.; Rehmsmeier, M. The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLoS ONE 2015, 10, e0118432. [Google Scholar] [CrossRef] [PubMed]
  20. Landis, J.R.; Koch, G.G. The Measurement of Observer Agreement for Categorical Data. Biometrics 1977, 33, 159–174. [Google Scholar] [CrossRef] [PubMed]
  21. Marzi, G.; Balzano, M.; Marchiori, D. K-Alpha Calculator–Krippendorff’s Alpha Calculator: A User-Friendly Tool for Computing Krippendorff’s Alpha Inter-Rater Reliability Coefficient. MethodsX 2024, 12, 102545. [Google Scholar] [CrossRef] [PubMed]
  22. Marino, R.; Petrera, F.; Abeni, F. Scientific Productions on Precision Livestock Farming: An Overview of the Evolution and Current State of Research Based on a Bibliometric Analysis. Animals 2023, 13, 2280. [Google Scholar] [CrossRef] [PubMed]
  23. Ben Sassi, N.; Averós, X.; Estevez, I. Technology and Poultry Welfare. Animals 2016, 6, 62. [Google Scholar] [CrossRef] [PubMed]
  24. Caldas-Cueva, J.P.; Mauromoustakos, A.; Sun, X.; Owens, C.M. Detection of Woody Breast Condition in Commercial Broiler Carcasses Using Image Analysis. Poult. Sci. 2021, 100, 100977. [Google Scholar] [CrossRef] [PubMed]
  25. Jung, L.; Nasirahmadi, A.; Schulte-Landwehr, J.; Knierim, U. Automatic Assessment of Keel Bone Damage in Laying Hens at the Slaughter Line. Animals 2021, 11, 163. [Google Scholar] [CrossRef] [PubMed]
  26. Welfare of Laying Hens on Farm—2023—EFSA Journal—Wiley Online Library. Available online: https://efsa.onlinelibrary.wiley.com/doi/full/10.2903/j.efsa.2023.7789 (accessed on 17 September 2025).
Figure 1. Example of a frame extracted from the video with the specimen of interest indicated at the center of the image.
Figure 1. Example of a frame extracted from the video with the specimen of interest indicated at the center of the image.
Poultry 04 00043 g001
Figure 2. Confusion matrix comparing the offline scores assigned by each annotator after applying the revised annotation criteria.
Figure 2. Confusion matrix comparing the offline scores assigned by each annotator after applying the revised annotation criteria.
Poultry 04 00043 g002
Figure 3. Frame example after occluding all NC class bounding boxes and surrounding areas with a black mask.
Figure 3. Frame example after occluding all NC class bounding boxes and surrounding areas with a black mask.
Poultry 04 00043 g003
Figure 4. Normalized (a) and absolute (b) confusion matrices for the detection and classification model on the test set.
Figure 4. Normalized (a) and absolute (b) confusion matrices for the detection and classification model on the test set.
Poultry 04 00043 g004
Figure 5. YOLO model performance on the test set: F1 score (a) and precision–recall curve (b).
Figure 5. YOLO model performance on the test set: F1 score (a) and precision–recall curve (b).
Poultry 04 00043 g005
Figure 6. Frame extracted from the tracker output video.
Figure 6. Frame extracted from the tracker output video.
Poultry 04 00043 g006
Figure 7. Confusion matrices for the tracker output compared to manual annotations: normalized (a) and absolute (b) values. Class 2 corresponds to uncertain predictions.
Figure 7. Confusion matrices for the tracker output compared to manual annotations: normalized (a) and absolute (b) values. Class 2 corresponds to uncertain predictions.
Poultry 04 00043 g007
Table 1. Albumentations functions used for data augmentation.
Table 1. Albumentations functions used for data augmentation.
TransformationAlbumentations Functions
RandomResizedCrop A.RandomResizedCrop()
HorizontalFlip A.HorizontalFlip()
VerticalFlip A.VerticalFlip()
RandomBrightnessContrast A.RandomBrightnessContrast()
ColorJitter A.ColorJitter()
GaussianBlur A.GaussianBlur()
MotionBlur A.MotionBlur()
ElasticTransform A.ElasticTransform()
GridDistortion A.GridDistortion()
CoarseDropout A.CoarseDropout()
ShiftScaleRotate A.ShiftScaleRotate()
Rotate A.Rotate()
HueSaturationValue A.HueSaturationValue()
Blur A.Blur()
GaussNoise (Noise) A.GaussNoise()
Table 2. Main YOLOv12 training hyperparameters and configuration settings.
Table 2. Main YOLOv12 training hyperparameters and configuration settings.
ParameterValue
Model architectureYOLOv12
Pretrained weightsTrue
Input image size640
Batch size0.8 (auto)
Epochs250
Patience (early stopping)100
Mosaic augmentation0.5
IoU threshold (NMS)0.7
Table 3. Main configuration parameters used for the BoT-SORT tracker.
Table 3. Main configuration parameters used for the BoT-SORT tracker.
ParameterValue
Tracker typebotsort
Track high threshold0.25
Track low threshold0.1
New track threshold0.7
Track buffer120
Match threshold0.8
Fuse scoreTrue
GMC methodnone
Proximity threshold (ReID)0.5
Appearance threshold (ReID)0.5
With ReIDTrue
ReID modelauto
Table 4. F1 scores between different annotators (R, F, G) using each as ground truth (GT).
Table 4. F1 scores between different annotators (R, F, G) using each as ground truth (GT).
G (GT)R (GT)F (GT)
G0.8090.816
R0.8090.822
F0.8160.822
Table 5. Distribution of classes and uncertain cases for the human operator and the BoT-SORT tracker. The F1 score for each class is calculated using the human operator’s labels as ground truth. Note that uncertain cases are excluded from the F1 score calculation.
Table 5. Distribution of classes and uncertain cases for the human operator and the BoT-SORT tracker. The F1 score for each class is calculated using the human operator’s labels as ground truth. Note that uncertain cases are excluded from the F1 score calculation.
Class 0Class 1UncertainTotal
Operator (ground truth)196 (78.1%)47 (18.7%)8 (3.2%)251
Tracker (including uncertain predictions)203 (76.3%)53 (19.9%)10 (3.8%)266
F1 score0.9550.8090.882
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bergamasco, T.; Ambrosi, A.; Tregnaghi, V.; Urbani, R.; Nalesso, G.; Menegon, F.; Trocino, A.; Pravato, M.; Bordignon, F.; Sparesato, S.; et al. Precision Livestock Farming: YOLOv12-Based Automated Detection of Keel Bone Lesions in Laying Hens. Poultry 2025, 4, 43. https://doi.org/10.3390/poultry4040043

AMA Style

Bergamasco T, Ambrosi A, Tregnaghi V, Urbani R, Nalesso G, Menegon F, Trocino A, Pravato M, Bordignon F, Sparesato S, et al. Precision Livestock Farming: YOLOv12-Based Automated Detection of Keel Bone Lesions in Laying Hens. Poultry. 2025; 4(4):43. https://doi.org/10.3390/poultry4040043

Chicago/Turabian Style

Bergamasco, Tommaso, Aurora Ambrosi, Vittoria Tregnaghi, Rachele Urbani, Giacomo Nalesso, Francesca Menegon, Angela Trocino, Mattia Pravato, Francesco Bordignon, Stefania Sparesato, and et al. 2025. "Precision Livestock Farming: YOLOv12-Based Automated Detection of Keel Bone Lesions in Laying Hens" Poultry 4, no. 4: 43. https://doi.org/10.3390/poultry4040043

APA Style

Bergamasco, T., Ambrosi, A., Tregnaghi, V., Urbani, R., Nalesso, G., Menegon, F., Trocino, A., Pravato, M., Bordignon, F., Sparesato, S., Manca, G., & Di Martino, G. (2025). Precision Livestock Farming: YOLOv12-Based Automated Detection of Keel Bone Lesions in Laying Hens. Poultry, 4(4), 43. https://doi.org/10.3390/poultry4040043

Article Metrics

Back to TopTop