1. Introduction
Respiratory diseases remain one of the most significant constraints in broiler production due to their high prevalence, rapid transmission, and severe economic impact. Many broiler farms worldwide report frequent respiratory outbreaks, often with morbidity approaching 100% and substantial losses from reduced growth, poor feed conversion, increased mortality, and carcass condemnation [
1,
2,
3,
4,
5,
6]. Complex respiratory syndromes involving combinations of viral (e.g., Infectious Bronchitis Virus (IBV), Newcastle Disease Virus (NDV), avian influenza, Avian Metapneumovirus (aMPV), Infectious Laryngotracheitis Virus (ILTV)) and bacterial pathogens (e.g.,
E. coli,
Mycoplasma spp.,
Ornithobacterium rhinotracheale) further exacerbate disease severity, sometimes pushing mortality above 30% [
7,
8,
9,
10]. These outcomes are intensified by high stocking density, poor ventilation, elevated ammonia levels, and inadequate biosecurity, while control efforts are complicated by antigenic variation, frequent co-infections, incomplete vaccine protection, and growing antimicrobial resistance [
9,
11]. As a result, respiratory disease outbreaks have caused multimillion-dollar losses in broiler industries across several countries [
1,
3,
12,
13].
Early detection of respiratory diseases is critical because broilers have a very short production cycle, leaving little opportunity for recovery once growth is impaired. Respiratory outbreaks often begin around 2–3 weeks of age and, if uncontrolled, can raise mortality above 10% within a single cycle, with irreversible effects on performance and profitability [
8,
14]. Rapid pathogen spread and frequent co-infections accelerate disease progression and worsen lesions, emphasizing the need for timely intervention [
3,
4,
15]. Although molecular diagnostics such as Polymerase Chain Reaction (PCR), multiplex assays, and Loop-Mediated Isothermal Amplification (LAMP) offer high sensitivity and specificity, they depend on laboratory infrastructure, skilled personnel, and sample transport, leading to delays that limit their value for immediate on-farm decision-making [
16,
17,
18]. Field studies in countries such as Bangladesh and Tunisia demonstrate that early molecular screening can detect high viral loads during the initial phase of outbreaks, enabling faster control responses; however, such approaches remain impractical for continuous monitoring at scale [
16,
19]. Traditional visual observation is similarly limited by subjectivity, labor intensity, and low sensitivity to early or subtle clinical signs [
20,
21,
22].
Acoustic monitoring has emerged as a promising alternative as respiratory pathology directly alters vocalizations, including coughs, sneezes, and rales, providing early and biologically meaningful signals [
21,
23,
24,
25]. Unlike vision systems or wearable sensors, audio monitoring is non-contact, inexpensive, and well-suited to continuous flock-level surveillance in dense group housing, without requiring individual identification or attachment of devices [
25,
26,
27,
28]. Numerous studies report high classification accuracies (>90–98%) for detecting respiratory sounds and diseases such as Newcastle disease and avian influenza using machine learning and deep learning models [
23,
24,
26,
29,
30,
31], and systematic reviews indicate that sound-based systems dominate poultry respiratory Precision Livestock Farming (PLF) applications due to their feasibility and diagnostic value [
32]. Artificial intelligence (AI) enables the transformation of raw audio into actionable health indicators through noise reduction, segmentation, feature extraction (e.g., MFCCs (Mel-Frequency Cepstral Coefficients), spectrograms, chroma, temporal descriptors), and supervised classification using models such as CNNs (Convolutional Neural Network), CNN-LSTMs (CNN- Long Short-Term Memory Network), and transfer learning frameworks [
33,
34,
35,
36,
37,
38]. These outputs can be aggregated into meaningful digital biomarkers, including cough rate–based flock health scores, disease severity classifications, and welfare or stress indicators, many of which can be deployed on low-cost edge devices for real-time monitoring [
39,
40,
41,
42].
Given the rapid expansion of audio-based monitoring and artificial intelligence in poultry research, a consolidated and critical synthesis focused specifically on broiler respiratory health and welfare is timely. This review seeks to (i) summarize the biological and acoustic basis linking vocalizations to respiratory disease and welfare states, (ii) review recording environments, sensor setups, and annotation strategies used in broiler sound studies, (iii) compare acoustic features and AI models applied to disease, and welfare monitoring, and (iv) identify key limitations and research gaps that hinder large-scale, on-farm deployment.
The manuscript first outlines the physiological basis of sound production and its alteration by respiratory disease and stress. It then reviews data acquisition and annotation practices, followed by a synthesis of acoustic feature representations and AI modeling approaches. Applications in respiratory disease detection, welfare assessment, and growth monitoring are subsequently discussed, together with evaluation metrics and deployment constraints. The review concludes by highlighting critical challenges and future directions for translating audio-based AI systems into reliable commercial tools for broiler production.
2. Review Methodology
This review follows a structured narrative methodology rather than a formal PRISMA meta-analysis, as the objective is conceptual synthesis and methodological comparison. A comprehensive literature search was performed using Web of Science, Scopus, IEEE Xplore, and Google Scholar. Search queries included but not limited to: “broiler chicken respiratory sounds”, “poultry cough detection”, “audio-based disease detection”, “precision livestock farming”, “machine learning”, “deep learning”, “MFCC”, “CNN”, “LSTM”, and “audio spectrogram transformers”. Searches covered publications from 2020 to 2025, reflecting the recent advancement of digital audio processing and AI-based monitoring. Reference lists of key review and experimental papers were also screened to identify additional relevant studies. Titles and abstracts were initially screened for relevance, followed by full-text assessment of eligible articles. Rather than quantitatively aggregating performance metrics, the selected studies were qualitatively synthesized to compare feature representations, data characteristics, and deployment contexts. Emphasis was placed on identifying conceptual trends, methodological trade-offs, and recurring limitations, particularly those affecting model generalization, interpretability, and real-world applicability. To ensure methodological rigor and relevance to the scope of this review, explicit inclusion and exclusion criteria were applied during full-text screening. Studies were included if they met at least one of the following criteria:
Applied audio or acoustic data to assess respiratory health, disease, stress, or welfare in broiler chickens or closely related poultry species;
Developed or evaluated machine-learning or deep-learning models for chicken sound analysis;
Provided biologically relevant insights linking sound production to respiratory physiology, pathology, or welfare.
Studies were excluded if they:
Focused solely on non-audio sensing modalities (e.g., vision-only systems);
Addressed poultry production without respiratory, acoustic, or AI relevance;
Were non-peer-reviewed sources lacking methodological transparency.
For visualization purposes, audio recordings were obtained from Mendeley Data- poultry vocalization signal dataset for early disease detection [
43]. Six representative samples were selected, including two healthy broiler vocalizations, two environmental noise recordings, and two unhealthy broiler sounds (first two audios from each folder were used). From each audio file, the first 3 s were extracted and resampled to 22.05 kHz. Two time–frequency representations were generated: (i) a standard spectrogram using the Short-Time Fourier Transform (STFT) and (ii) a Mel-spectrogram. The STFT was computed using a Hamming window with a window size of 2048 samples and a hop length of 512 samples. The magnitude spectra were converted to decibel (dB) scale using logarithmic amplitude compression. The Mel-spectrogram was computed using 128 Mel filter banks applied to the power spectrum obtained from the STFT, followed by conversion to the dB scale. These representations were used to visualize spectral differences between healthy, noisy, and unhealthy broiler sounds and to illustrate why Mel-based features dominate modern poultry audio analysis.
3. Respiratory Health and Welfare Challenges in Broiler Chickens
Respiratory health challenges in broiler chickens are predominantly caused by viral pathogens and are frequently compounded by secondary bacterial infections, making diagnosis and control particularly complex. The most prevalent viral agents include IBV, Avian Influenza Virus (AIV; especially H9N2), and NDV, which commonly co-circulate in broiler flocks and contribute to high morbidity and mortality [
1,
13,
15,
44,
45]. Other important viruses such as ILTV, aMPV, and Avian Pneumovirus (APV) further intensify respiratory disease burdens [
3,
7,
19,
46]. Bacterial pathogens, including
Mycoplasma gallisepticum,
Escherichia coli,
Haemophilus paragallinarum, and Chlamydia, often exacerbate viral infections, leading to impaired respiratory function, reduced feed intake, increased mortality, and substantial economic losses [
3,
7,
13]. Given the multifactorial nature of these syndromes, continuous surveillance, vaccination, and integrated control strategies are essential for effective management [
3,
44,
45].
Respiratory diseases in broilers are associated with distinct abnormal sound manifestations that serve as valuable acoustic indicators of health status. Diseased birds typically produce pathological coughs, which differ markedly from normal short, sharp cheeps [
47,
48]. Sneezes are also widely used as primary indicators in automated respiratory monitoring systems [
49]. In addition, snoring- or purring-like sounds—continuous, low-frequency, rumbling vocalizations—are commonly observed during rest and are linked to upper respiratory tract obstruction [
25,
47,
48]. As respiratory disease progresses, birds may also exhibit labored breathing and tracheal rales, clinically described as harsh airway sounds accompanied by sneezing and nasal discharge [
3,
50,
51]. These abnormal vocalizations are consistently exploited as audio biomarkers for detecting infectious bronchitis, Newcastle disease, and broader respiratory syndromes using machine-learning–based systems trained specifically on cough, sneeze, and snore/purr sounds [
25,
26,
31,
47,
48,
52].
Welfare-related stressors further influence vocalization patterns and the prevalence of respiratory sounds throughout the production cycle. Stocking density and social conditions have clearer acoustic effects, with isolated birds producing higher-energy alarm calls indicative of stress, while birds in larger groups show reduced vocal energy, reflecting improved welfare [
53]. Even though heat stress appears to have limited direct effects on primary vocalization types, such as distress calls and peeps, it alters activity levels and behavior that may indirectly affect vocal output [
54,
55,
56]. Similarly, direct evidence linking elevated ammonia (NH
3) levels to specific vocal changes remains limited; NH
3-induced respiratory discomfort likely contributes to abnormal sound production. In addition to the environment, age of broilers itself have influence over the sound they produce. Respiratory sounds are generally infrequent during early life stages but increase markedly after three to four weeks of age, coinciding with higher pathogen exposure and physiological stress [
31,
48,
57,
58]. Studies consistently report higher detection rates of coughs, sneezes, and snore-like sounds in broilers older than 26 days, particularly during the finishing phase, highlighting the importance of age-aware acoustic monitoring strategies [
57,
58].
Figure 1 and
Table 1 outline the major respiratory and welfare challenges in broiler chickens and their acoustic relevance.
4. Acoustic Characteristics of Broiler Chicken Vocalizations
Broiler chicken vocalizations, including both normal calls and abnormal respiratory sounds, are produced by the syrinx, an avian vocal organ located at the tracheobronchial junction within the interclavicular air sac [
76,
77]. In chickens, which are non-songbirds, sound generation primarily involves vibration of the lateral tympaniform membranes and associated labia that function analogously to mammalian vocal folds [
77,
78]. During expiration, increased air-sac pressure drives airflow through the syrinx, causing membrane vibration via a myoelastic–aerodynamic mechanism, with pitch governed by tissue tension and loudness by vibration amplitude [
77,
79]. Although most sounds are expiratory, inspiratory phonation has also been reported in birds [
76,
77,
80]. Importantly, both normal vocalizations and pathological respiratory sounds such as coughs and sneezes originate from the same syringeal mechanism, with the larynx, trachea, and oral cavity acting as a vocal tract filter that shapes the final acoustic signal [
76,
78,
80]. The principal vocal structures involved in broiler sound production and their functional roles are summarized in
Table 2.
Coughs and sneezes differ acoustically from normal broiler vocalizations in both temporal and spectral characteristics. Coughs and sneezes show sharper onset, higher peak amplitude, and broader frequency bandwidths, reflecting sudden expulsions of air caused by respiratory irritation or obstruction [
85,
86]. Sneezes are typically brief, high-energy, impulsive sounds with distinct broadband spectral signatures, making them detectable even under noisy farm conditions [
49]. In contrast, normal calls, including cheeps, squawks, and distress vocalizations, are generally longer in duration and more tonal, exhibiting structured harmonic patterns and regular frequency modulation [
25,
52]. These consistent acoustic differences enable machine-learning systems to distinguish pathological respiratory sounds from normal vocal behavior with high reliability [
25,
49,
52].
Stress and pain also modulate broiler vocal behavior by altering call rate, intensity, and spectral structure. Distress calls associated with negative affective states are typically repetitive, high in energy, and exhibit increased tonality with reduced spectral entropy, correlating with key welfare indicators such as growth performance and mortality [
67]. Acute stressors may transiently suppress vocal output, whereas chronic stressors such as food deprivation increase call frequency and modify spectral features including centroid and bandwidth [
68,
87]. Age further modulates these responses, with younger birds showing more pronounced acoustic changes than older individuals [
88]. Social context plays an important buffering role, as maternal contact and group housing reduce high-intensity distress calling, highlighting the context-dependent nature of vocal responses to stress and pain [
89].
Environmental noise generated by ventilation fans, feeders, heaters, and other mechanical equipment presents a major challenge for accurate detection of broiler respiratory sounds. These noise sources often overlap spectrally with target vocalizations, reducing detection sensitivity, particularly for short, low-energy events such as sneezes [
49]. To mitigate background noise in commercial farm environments, advanced signal-processing techniques such as spectral subtraction, wavelet denoising, Wiener filtering, and multi-taper spectral analysis have been widely applied to improve signal-to-noise ratio [
47,
90]. Wavelet-based methods combined with pulse extraction are particularly effective in suppressing both continuous and transient noise, enabling clearer isolation of respiratory sounds [
90]. At the model level, the introduction of a custom Burn Layer in CNNs—injecting controlled random noise during training—has increased robustness to input variability and reduced overfitting while maintaining high sensitivity with fewer parameters [
33]. Multi-domain feature extraction, integrating time- and frequency-based features, MFCCs, and sparse representations, combined with feature selection and linear–nonlinear fusion strategies, further enhances classification performance and generalization [
91,
92]. Transfer learning approaches such as improved TrAdaBoost address age-related variability in broiler vocalizations, improving cross-dataset applicability [
26], while confidence-interval-based random forest methods enable more reliable recognition of overlapping sounds in complex acoustic environments [
93]. Together, these innovations substantially improve the robustness and deployment readiness of broiler sound classification systems in real-world noisy conditions [
26,
33,
90,
91,
92,
93]. Nonetheless, farm soundscapes remain acoustically complex, underscoring the need for continued development of noise-robust algorithms for reliable respiratory health monitoring [
49,
94].
Although coughs, sneezes, distress calls, and other respiratory-related sounds in broiler chickens exhibit identifiable acoustic patterns, their practical use in AI systems is constrained by substantial overlap with non-pathological sounds such as wing flapping, pecking, and environmental noise. Most studies characterize these sounds under controlled or semi-controlled conditions, which limits their external validity in commercial farms where background noise, bird density, and ventilation systems dominate the acoustic scene. Importantly, the acoustic manifestation of respiratory distress is not static but varies with age, growth rate, and environmental stressors, making fixed-rule sound definitions insufficient. Consequently, respiratory sound characterization should be viewed as a probabilistic signal embedded within complex soundscapes rather than as isolated acoustic events, reinforcing the need for robust feature learning and domain-adaptive AI models rather than reliance on handcrafted acoustic thresholds.
Different poultry sound types provide information on respiratory health, stress, growth, and welfare at varying levels of evidence.
Figure 2 summarizes the relationships between major vocalization categories and their documented applications in health and welfare monitoring.
Table 3 describes the acoustic characteristics of key sound categories relevant to broiler respiratory health and welfare, including coughs, sneezes, distress calls, normal vocalizations, and silence.
5. Audio Data Acquisition in Broiler Production Systems
Audio monitoring in broiler houses predominantly relies on single omnidirectional microphones, reflecting a balance between practicality, coverage, and minimal disturbance to birds. Most studies report the use of one centrally placed microphone rather than directional devices or microphone arrays. For example, microphones positioned approximately 40 cm above the birds’ backs in the center of commercial houses have been used to record continuous audio at standard sampling rates, capturing vocalizations alongside environmental sounds such as fans and feeders [
95,
102]. In more task-specific applications, localized microphones have been attached to feeders to record pecking sounds, further supporting the preference for simple single-point installations over complex arrays in commercial settings [
103]. Overall, omnidirectional microphones provide sufficient acoustic information for health and behavior monitoring while remaining easy to deploy and maintain [
58,
95,
103].
Common sampling rates for animal and poultry bioacoustic monitoring typically range from 16 kHz to 48 kHz, selected according to the frequency range of target vocalizations and practical constraints. For broiler monitoring, 32 kHz and 44.1 kHz are frequently used, as they capture biologically relevant vocal frequencies while balancing data size and recording duration [
95,
102,
104,
105]. Higher rates (44.1–48 kHz) enable more detailed analysis of subtle welfare-related cues but increase storage and processing demands, whereas lower rates may reduce detection accuracy [
36,
106]. Recording durations vary from minutes to continuous multi-day monitoring, with audio commonly segmented into shorter clips for analysis, typically using window sizes between 1 and 10 s or aggregated segments of 10 min to 1 h [
58,
96,
107]. Time–frequency transformations such as Fast Fourier Transform and wavelet analysis are routinely applied using standard software platforms, e.g., MATLAB (2018b, The MathWorks, Inc., Natick, MA, USA), Adobe Audition CS6, etc. to prepare signals for feature extraction [
58,
95,
102]. Some studies integrate audio with video monitoring to link vocal patterns with observable behaviors or health status, enhancing interpretation and validation of sound-based indicators [
31,
95,
108]. While higher stocking densities can increase overall sound energy and overlapping vocalizations due to elevated activity and stress, these conditions primarily affect signal complexity rather than feasibility of data acquisition [
109,
110].
Despite advances in recording hardware, annotation remains a major bottleneck in broiler audio research. Overlapping vocalizations—such as simultaneous coughs, purrs, and movement sounds—reduce labeling accuracy and complicate sound event detection. Rare events, including coughs or sneezes during early disease stages, further limit labeled data availability; active learning strategies that prioritize informative samples have been proposed to address this imbalance [
111]. Additional challenges include annotator disagreement and background noise, which can bias model evaluation if uncertainty is not explicitly managed [
111]. However, emerging onset-based detection methods and exploratory clustering approaches show promise for improving robustness in complex, noisy broiler house environments [
112,
113]. For example, machine-learning approaches combining random forests with confidence-based classification have achieved high accuracy in identifying overlapping sound degrees, reaching over 97% in some cases [
93].
Table 4 provides an overview of representative poultry sound recording studies, highlighting recording environments, microphone setups, annotation strategies, and major methodological limitations.
The quality and structure of recorded audio fundamentally determine the effectiveness of downstream feature extraction. Microphone placement, sampling rate, background noise, and annotation reliability jointly constrain the spectral resolution, temporal precision, and signal-to-noise characteristics of the dataset. These factors directly influence which acoustic features can be meaningfully extracted and how reliably biological information—such as respiratory patterns or stress-related vocalizations—can be represented for machine learning.
6. Audio Feature Extraction and Representation Techniques
Feature extraction for broiler audio classification typically integrates multiple domains to capture comprehensive sound characteristics. Standard pipelines extract around 60-dimensional feature vectors per frame, including time-domain features (e.g., energy, zero-crossing rate), frequency-domain descriptors (e.g., spectral centroid, bandwidth), MFCCs, and sparse representation features [
31,
91,
101]. Preprocessing usually involves noise reduction (e.g., Wiener filtering or wavelet denoising), followed by sub-frame segmentation and endpoint detection [
47,
90]. MFCCs remain the most widely used features due to their ability to capture the spectral envelope of broiler vocalizations and respiratory sounds, and are commonly combined with energy-based and wavelet descriptors to improve discrimination between normal and abnormal events such as coughs and sneezes [
31,
48,
91,
101]. Feature normalization and selection methods, including random forest importance ranking and linear–nonlinear fusion, are then applied to reduce dimensionality and improve classification performance [
91,
92].
Feature representation serves as the functional interface between raw audio and AI model performance. The choice of acoustic descriptors—whether handcrafted features such as MFCCs or learned representations such as spectrogram embeddings—determines the type of information available to classification algorithms and strongly shapes model complexity, generalization capacity, and computational demands.
With the expansion of deep learning, spectrogram and Mel-spectrogram representations have become dominant, as they preserve both temporal and spectral dynamics while enabling image-based learning. CNNs trained on these representations consistently outperform classical methods, achieving accuracies above 90% for cough, distress, and welfare-related sound classification [
33,
117,
119,
120,
121]. Mel-spectrograms are particularly effective in noisy commercial environments because they emphasize perceptually relevant frequency bands and suppress irrelevant spectral detail [
31,
120,
122]. Hybrid approaches that combine spectrogram-based inputs with MFCCs or time-domain statistics further improve robustness and generalization [
119,
123].
Figure 3 presents representative STFT spectrograms and Mel-spectrograms for healthy, environmental noise, and unhealthy broiler sounds. Healthy vocalizations show relatively stable harmonic energy primarily concentrated in the low-to-mid frequency range (approximately 500–3000 Hz), with clear temporal structure corresponding to regular vocal activity. Noise recordings exhibit more diffuse and broadband spectral energy with reduced temporal regularity, reflecting mechanical and environmental background sources. In contrast, unhealthy sounds display irregular, burst-like and vertically oriented spectral patterns, with energy distributed across a wider frequency range, which is characteristic of coughing and respiratory distress events.
Compared with standard spectrograms, Mel-spectrograms provide a more compact and perceptually meaningful representation by emphasizing lower frequency components and smoothing high-frequency detail. This highlights discriminative acoustic patterns while reducing dimensionality, explaining why Mel-spectrograms are widely adopted as input features in modern deep learning-based broiler audio monitoring systems.
Early and late fusion are two main strategies for combining multiple feature sets or modalities in multimodal audio classification. Early fusion concatenates features from different sources into a single feature vector before classification, enabling the model to learn joint representations and exploit cross-modal correlations, which often leads to higher accuracy when sufficient training data is available [
124,
125,
126]. However, this approach increases feature dimensionality and can cause overfitting in small or noisy datasets [
124,
127]. In contrast, late fusion combines the outputs of separate unimodal classifiers, typically through ensemble voting or meta-classifiers, making it more robust to limited data, modality-specific noise, and missing inputs [
127,
128]. Comparative studies show that late fusion can outperform early fusion in practical settings, for example, achieving higher accuracy (0.876 vs. 0.828) and F1-score in aggression detection tasks, while early fusion showed higher precision [
128]. Similar trends have been reported in bioacoustic applications, where late fusion improved bird sound classification and acoustic scene recognition by around 10% compared to single models, demonstrating better generalization and stability [
124,
129,
130].
6.1. Core Acoustic Features and Their Mathematical Definitions
6.1.1. Short-Time Fourier Transform (STFT)
The Short-Time Fourier Transform (STFT) is used to analyze non-stationary animal sounds by representing signals in both time and frequency domains. For a signal
x(
t), STFT is defined as:
where
w(
t −
) is a window function centered at time
τ, and
ω is the angular frequency. STFT enables time-localized spectral analysis and is widely used to generate spectrograms for detecting coughs, sneezes, and distress calls in poultry audio monitoring [
131,
132,
133,
134].
6.1.2. Mel-Frequency Cepstral Coefficients (MFCCs)
MFCCs capture perceptually relevant spectral features of animal vocalizations. After computing the Fourier spectrum, energies are passed through Mel-scaled filters defined as:
The MFCCs are then obtained using the discrete cosine transform:
where
is the energy of the
m-th Mel filter,
M is the number of filters, and
k is the coefficient index. MFCCs are extensively used for poultry health, stress, and behavior classification using both classical and deep learning models [
135,
136,
137,
138].
6.1.3. Spectral Entropy
Spectral entropy measures the randomness of energy distribution across the frequency spectrum and is defined as:
where
is the normalized power at frequency bin
. Lower entropy indicates structured tonal sounds, while higher entropy reflects noisy or complex signals. In broiler monitoring, spectral entropy correlates with distress and welfare status and is useful for real-time assessment of flock conditions [
67,
139,
140,
141].
6.1.4. Zero-Crossing Rate (ZCR)
Zero-crossing rate quantifies how frequently a signal crosses the zero-amplitude axis. For a discrete signal
x[
n]:
ZCR reflects signal noisiness and frequency content and is used in poultry audio analysis to separate vocalizations from background noise and distinguish behavioral sounds such as feeding and distress [
36,
142,
143,
144,
145].
Feature representation plays a central role in determining the accuracy, robustness, and deployability of audio-based broiler monitoring systems. Handcrafted features such as Mel-frequency cepstral coefficients (MFCCs), energy-based descriptors, and wavelet measures offer computational efficiency and partial interpretability, making them attractive for real-time and edge deployment. However, these features often fail to capture subtle temporal dynamics associated with early-stage respiratory distress and complex welfare states. Spectrogram-based representations provide richer time–frequency structure and, when combined with deep learning models, consistently improve detection performance in noisy commercial environments. Emerging self-supervised and transformer-based representation learning approaches show promise in extracting higher-level and potentially farm-invariant acoustic features, but their effectiveness remains constrained by the scarcity of large, diverse, and standardized poultry audio datasets.
Table 5 summarize the commonly used acoustic features in broiler sound classification.
The selection of feature representations ultimately defines the space of AI architectures that can be effectively employed in broiler audio analysis, as different models vary in their ability to exploit temporal structure, spectral patterns, and cross-feature dependencies. Model robustness and generalization capacity, therefore, depend not only on algorithmic design but also on how acoustic information is encoded at the input level.
7. Machine Learning and Deep Learning Models
Early broiler sound classification studies primarily relied on classical machine learning models such as k-Nearest Neighbors (kNN), Random Forest (RF), Decision Trees (DT), and ensemble approaches including TrAdaBoost. These models typically operate on handcrafted features derived from time, frequency, MFCC, and sparse representations of audio signals and can achieve classification accuracies exceeding 90% with careful feature engineering and tuning [
26,
31,
36,
91,
101]. Several studies show that classical machine learning (ML) methods can achieve performance comparable to deep learning on small or well-structured datasets. For example, on the University of California, Irvine—Human Activity Recognition dataset, Linear Support Vector Classifier (SVC) achieved 96% accuracy, similar to CNN performance, while Random Forest reached 92% [
147]. In small-sample image classification (COREL-1000), SVM slightly outperformed CNN (0.86 vs. 0.83) [
148], and in low-shot Natural Language Processing (NLP) tasks, the performance gap between classical models and deep learning shrank to less than 2% when around 1000 labeled samples per class were available [
149]. In audio-related tasks, classical models using hand-crafted features (e.g., MFCCs with SVM or Random Forest) have also matched or approached deep learning accuracy in constrained datasets [
150]. Importantly, classical ML offers substantially lower computational burden, faster training, and greater interpretability, making it more suitable for resource-limited and explainability-critical applications, whereas deep learning generally requires larger datasets and higher computational resources despite its superior scalability on large, complex data [
151,
152,
153]. RF models are frequently favored for their robustness and interpretability as they aggregate many decorrelated decision trees, which reduces overfitting and improves robustness, especially in high-dimensional or partially missing data [
154,
155,
156]. For example, on large structured datasets (e.g., Kaggle tabular data with over one million samples), Random Forest outperformed single decision trees in accuracy and generalization [
157]. In terms of interpretability, Random Forest provides feature importance measures and supports rule-extraction and surrogate tree methods, allowing predictions to be explained through human-readable rules or decision paths, making it suitable for domains requiring both accuracy and transparency such as healthcare, bioinformatics, and audio classification [
154,
155,
158]. But these approaches are limited by their reliance on manual feature engineering, sensitivity to noise, and reduced generalization across broiler ages, housing systems, and farms [
26,
47,
70]. They also struggle with overlapping vocalizations and complex soundscapes, restricting scalability for commercial deployment [
36,
93].
Deep learning models address many of these limitations by learning discriminative representations directly from data. CNN-based systems are often combined with Recurrent Neural Networks (RNNs), including LSTM and GRU (Gated Recurrent Unit) architectures, to capture the temporal structure of respiratory sounds and event boundaries [
159,
160,
161]. Temporal Convolutional Networks (TCNs) have recently emerged as computationally efficient alternatives with competitive performance [
162,
163]. Transformer-based architectures, such as Audio Spectrogram Transformers (AST) and wav2vec-based models, further improve performance by modeling long-range temporal and spectral dependencies through self-attention [
164,
165,
166]. These models achieve state-of-the-art accuracy and high mean average precision (up to 0.97) for broiler stress and welfare sound classification, particularly when incorporating longer input windows and metadata such as bird age [
70,
167].
Across model families, class imbalance—especially the rarity of cough and sneeze events—remains a major challenge. This is addressed using oversampling, e.g., Synthetic Minority Oversampling Technique (SMOTE), Adaptive Synthetic Sampling (ADASYN), cost-sensitive loss functions, ensemble learning, and Generative Adversarial Networks (GAN) based data augmentation [
168,
169,
170,
171,
172]. While SMOTE and its variants are widely used and effective for general imbalanced datasets, they perform poorly when training data is extremely scarce, as they rely on local interpolation that cannot capture true data structure [
171,
173]. In such small-sample scenarios, GAN-based methods like Generative Adversarial Network Synthesis for Oversampling (GANSO) and Markov Random Fields (MRF) based oversampling have been shown to generate more realistic synthetic samples and outperform SMOTE by better preserving underlying data distributions [
174]. However, these methods are computationally more complex, making SMOTE preferable for moderate datasets, while GAN/MRF approaches are more suitable for very limited data conditions [
171,
174,
175,
176]. Transfer learning and domain adaptation further enhance robustness across production stages and environments, positioning deep learning—particularly transformer-based models—as the most promising direction for scalable, real-time broiler audio monitoring systems [
26,
139,
164,
166].
Model architecture critically influences the generalization and practical applicability of broiler audio analysis systems. Classical machine learning models provide strong and interpretable baselines under limited data conditions, but their performance degrades under domain shifts across farms, bird ages, and recording setups. Deep learning models, particularly convolutional and attention-based architectures, better capture hierarchical and temporal patterns in vocalizations, leading to superior accuracy in complex acoustic environments. However, these gains come at the cost of increased computational burden, reduced transparency, and challenges in explainability and reproducibility. Overall, the primary limitation in current broiler audio-AI research is not peak model accuracy but the lack of rigorous cross-domain evaluation, which makes it difficult to distinguish genuine biological signal learning from overfitting to farm-specific acoustic signatures and highlights the need for standardized benchmarks and deployment-oriented validation.
Table 6 summarizes commonly applied AI models to broiler sound classification.
Beyond algorithmic performance, the practical utility of audio-based AI systems is determined by their robustness, temporal resolution, and ability to generalize across heterogeneous farm conditions, which ultimately governs whether models can move beyond laboratory benchmarks toward real-time disease surveillance, early warning, and decision-support applications in commercial broiler production.
8. Applications of Audio-AI in Respiratory Health Monitoring
Audio-based cough detection has emerged as a powerful, non-invasive tool for early disease warning in broiler production. Systems using MFCCs, sparse representations, and time–frequency features achieve classification accuracies of approximately 90–91%, while flock-level health prediction based on cough-rate estimation reaches nearly 99% accuracy under controlled conditions [
31]. Transfer learning improves robustness across broiler age groups, maintaining accuracies above 80% in variable production environments [
26]. High recall and precision (>90%) reported in commercial-scale facilities demonstrate feasibility under noisy, real-world conditions [
177].
Acoustic monitoring enables detection of respiratory disease before visible clinical signs appear, providing a critical window for early intervention. Infectious bronchitis and Newcastle disease have been detected within days post-infection using wavelet entropy and MFCC-based approaches, with accuracies of 80–83% [
30,
52]. Deep learning methods further improve early-stage detection, exceeding 90% accuracy in some studies [
30]. Continuous monitoring of cough frequency and rale-like sounds correlates well with disease progression and distinguishes healthy from infected birds in real time [
23,
24,
31]. Despite these advances, generalization across breeds, farms, and disease types remains a key challenge [
25,
75].
In commercial settings, audio-AI systems are increasingly integrated with complementary sensing modalities. Vision-based platforms detect posture and mobility changes with accuracy of up to 97.8% [
178], while radar-based sensors provide contactless monitoring of respiratory and cardiac activity with reported accuracies around 96% [
179]. These systems are often embedded in IoT frameworks, incorporating environmental parameters such as temperature, humidity, and gas concentrations [
180]. However, false alarms driven by environmental noise, overlapping sounds, and age-related vocal variability remain a major limitation [
48,
181]. Improving specificity and reducing unnecessary alerts are therefore essential for widespread adoption [
21,
182].
Audio-based AI systems for respiratory health detection in broiler chickens demonstrate strong potential for early, non-invasive disease surveillance, often identifying abnormal respiratory events before overt clinical signs appear. Nevertheless, most reported performances are derived from binary or narrowly defined classification tasks conducted within single experimental settings. Such designs limit the systems’ ability to generalize across heterogeneous production environments, pathogen profiles, and management practices. Moreover, the majority of studies focus on detection accuracy while overlooking false-positive burdens, which can undermine farmer trust and practical usability. From a translational standpoint, respiratory sound detection should be integrated with contextual metadata—such as age, temperature, ventilation rate, and stocking density—to move from event detection toward actionable health decision support.
Table 7 summarizes sound-based AI approaches for poultry respiratory disease detection and early warning.
9. Audio-Based Welfare and Stress Monitoring
Stress and welfare-related conditions significantly alter broiler vocalization patterns by affecting call rate, frequency, and acoustic structure. Acute stressors such as food or water deprivation increase vocal activity and modify spectral features, including centroid and bandwidth, while mitigation strategies such as hydrated gels reduce these responses [
68]. Prolonged or intense stress is associated with increased high-energy distress calls characterized by reduced spectral entropy, which correlates with poorer welfare outcomes, including reduced growth and increased mortality [
67]. Vocal responses are strongly age-dependent and influenced by circadian rhythms, complicating interpretation without age-aware models [
55,
56,
88].
Acoustic analysis has proven effective for detecting distress, aggression, and discomfort. Distress calls serve as reliable biomarkers of negative welfare states, with spectral entropy providing a quantitative proxy for stress intensity [
67]. Deep learning models, including CNNs and transformer-based architectures, achieve high accuracy and mean average precision (up to 97%) in classifying stress-related vocalizations and differentiating between stressors and age groups [
36,
70,
96]. Classical models such as SVMs also show moderate to high performance, although results are often age-dependent [
184]. Emerging large-scale audio models capable of decoding emotional and physiological cues further highlight the potential of sound-based, non-invasive welfare monitoring, although most validation remains limited to controlled or semi-commercial conditions [
98,
118,
139,
185].
Environmental conditions strongly shape welfare-related vocalizations. Suboptimal environments consistently increase distress calling and are associated with poorer growth and welfare indicators [
67]. Reduced stocking density and increased environmental complexity generally lower fear and anxiety responses, while the effects of enrichment, heat stress, and housing design vary with age and stimulus type [
71,
186,
187,
188,
189]. Vocal behavior is also influenced by genotype and breed, suggesting the need for adaptive, population-aware models [
58,
107,
190,
191,
192]. Together, these findings support acoustic monitoring as a sensitive, age-aware tool for broiler welfare assessment under diverse production conditions.
While vocalizations and activity-related sounds offer valuable insights into broiler welfare, their interpretation through AI models remains inherently ambiguous due to the multifactorial nature of stress and discomfort. Similar acoustic patterns may arise from thermal stress, social interactions, or environmental disturbances, complicating one-to-one mappings between sound events and welfare states. Current AI approaches often infer welfare indirectly by associating sound frequency or intensity with predefined stress labels, which risks oversimplification. For welfare-sensitive applications, model transparency and explainability become particularly critical, as automated alerts may influence management interventions. Therefore, audio-based welfare assessment should be framed as a supportive, probabilistic indicator rather than a deterministic diagnostic tool, ideally complemented by multimodal sensing and expert validation.
Table 8 links welfare indicators with characteristic sound patterns and AI-based monitoring approaches.
Figure 4 illustrates the dominant feature–model–task relationships reported in poultry health and welfare monitoring literature.
10. Evaluation Metrics and Real-World Deployment Performance
A diverse set of metrics is used to evaluate broiler sound classification systems, reflecting variation in task objectives and model designs. Commonly reported metrics include classification accuracy, recognition accuracy, F1-score, mean average precision (mAP), signal-to-noise ratio (SNR), and root mean square error (RMSE). Reported accuracies typically range from 88% to over 94%, with optimized classical models such as kNN and Random Forest frequently exceeding 90% accuracy [
31,
91,
101]. Recognition accuracy can reach up to 99% when majority voting is applied to frame-level predictions [
91,
101], while deep learning–based multi-class stress detection reports mAP values as high as 0.97, indicating a strong balance between precision and recall [
70]. Signal-quality metrics such as SNR and RMSE are widely used to assess preprocessing effectiveness, as improved noise suppression directly enhances downstream classification performance [
47,
90]. Additional indicators, including cough rate and confidence intervals for overlapping sound recognition, have been proposed to quantify health-related vocal activity in complex acoustic environments [
31,
93].
Despite these high reported values, accuracy alone is insufficient for real-world deployment in commercial poultry houses. Farm environments are characterized by high background noise, environmental variability, sensor failures, and data irregularities such as missing values and outliers, all of which can degrade system reliability [
193]. Moreover, accuracy does not reflect performance for rare but welfare-critical events, alert timeliness, robustness across production cycles, or system usability—factors essential for farmer trust and effective decision-making [
193,
194,
195,
196]. Consequently, broader evaluation frameworks incorporating robustness, scalability, and operational relevance are required for practical deployment.
Rare event detection, including abnormal coughing or distress calls, is more appropriately evaluated using recall, F1-score, and event-based mAP rather than frame-level accuracy. Recall is particularly critical in health monitoring, as missed detections can have serious welfare and economic consequences [
48]. F1-scores of approximately 94% have been reported for rare cough detection using wavelet-based features and hidden Markov models under controlled conditions [
48]. Event-based mAP provides a more realistic assessment of system effectiveness in noisy and overlapping acoustic settings [
197]. Noise reduction quality remains a key determinant of rare-event detection performance, directly influencing recall and F1-score [
90].
Real-time deployment introduces additional constraints related to hardware capacity, computational efficiency, and latency. Lightweight deep learning architectures optimized for edge deployment have demonstrated real-time inference with latencies below 200 ms on embedded systems, confirming feasibility for on-site monitoring in commercial poultry environments [
195,
198,
199]. Low-latency processing is essential for timely intervention following health or welfare anomalies [
200,
201]. However, environmental noise, connectivity limitations, and hardware–software integration challenges continue to limit large-scale adoption, highlighting the need for careful co-design of models, hardware, and data pipelines.
Table 9 summarizes commonly reported evaluation metrics and their suitability for commercial deployment.
11. Key AI Challenges and Limitations in Broiler Audio Monitoring
A major limitation of current audio-based AI systems for poultry health and welfare monitoring is pervasive dataset bias, which directly undermines robustness and real-world applicability. Most datasets lack standardization in recording protocols, annotation criteria, and evaluation metrics, making cross-study comparison difficult and often misleading [
36,
139]. Data are frequently collected from limited acoustic environments, breeds, housing systems, and management conditions, resulting in models that perform well in controlled or single-farm settings but fail to generalize elsewhere [
139,
206]. Class imbalance further exacerbates this issue, as health- or stress-related sounds such as coughs, sneezes, and distress calls occur far less frequently than background noise, biasing models toward majority classes and inflating headline accuracy metrics [
172,
207]. Additionally, many datasets underrepresent the acoustic complexity of commercial farms, where overlapping sounds from ventilation systems, feeders, and animal activity differ substantially from experimental conditions [
139]. Together, these factors indicate that reported performance often reflects dataset-specific characteristics rather than true predictive capability.
Closely related to dataset bias is the widespread reliance on single-farm or single-flock data, which limits deployment readiness. Models trained on a single production site tend to learn farm-specific acoustic signatures related to microphone placement, housing reverberation, and management routines, leading to substantial performance degradation when evaluated across farms [
206,
208,
209]. In poultry systems, domain shift is further driven by age-dependent vocal behavior, housing design, ventilation regimes, and potential breed-specific physiological differences [
56,
58,
139]. Vocalization frequency and structure change markedly as broilers grow, meaning age imbalance can inadvertently confound health and welfare predictions [
58,
108]. Addressing these challenges requires multi-farm datasets, explicit cross-environment validation, and the application of domain adaptation, data augmentation, and noise-robust training strategies to improve generalization across production contexts [
25,
36,
210].
A fundamental structural limitation in the field is the absence of standardized benchmark datasets and shared evaluation frameworks for poultry audio AI. Unlike computer vision or human speech recognition, poultry audio research remains fragmented, with most studies relying on proprietary or newly collected datasets tailored to specific experiments [
25,
36,
96]. This lack of shared benchmarks hampers reproducibility, prevents fair comparison between algorithms, and slows cumulative methodological progress [
139,
211]. Although some publicly available datasets and open-source tools exist, their scope is limited, and poultry-specific large-scale benchmarks remain scarce [
25,
36,
96]. Inconsistent public release of datasets and code further restricts transparency and independent validation [
33,
36]. Establishing open, multi-environment benchmark datasets with harmonized evaluation protocols is therefore critical to advancing reproducibility, comparability, and real-world trust in poultry audio AI systems [
211,
212,
213]. Beyond data-related issues, interpretability and deployment constraints pose significant barriers to responsible adoption, particularly for welfare-critical decisions. Interpretable models are essential for building trust among farmers, veterinarians, and regulators, as they enable AI outputs to be linked to biologically meaningful vocal features and known behavioral or physiological mechanisms [
36,
214]. In contrast, black-box models—despite high predictive accuracy—introduce risks related to accountability, undetected errors, and inappropriate interventions in variable farm environments [
206,
215,
216,
217,
218]. These concerns are compounded by edge deployment constraints, as commercial poultry houses require low-power, continuously operating systems. Model compression techniques such as pruning, quantization, and knowledge distillation enable edge deployment but often involve trade-offs between accuracy and efficiency [
219,
220,
221,
222,
223,
224,
225,
226]. Future poultry audio AI systems must therefore balance interpretability, robustness, and computational feasibility to ensure ethical, transparent, and scalable deployment within precision livestock farming.
12. Research Gaps and Future Directions
Building on the systemic limitations discussed in
Section 11, several task-specific research gaps remain in broiler audio-AI. In particular, subtle respiratory sounds such as sneezing, rales, and low-intensity distress calls remain under-explored because they are infrequent, weak in amplitude, and difficult to separate from background noise in commercial farm environments [
48,
49]. Their rarity (e.g., only 0.24% of recorded sounds in one study) leads to limited labeled data and reduced detection sensitivity, even when precision is high [
49]. Consequently, most studies prioritize more salient vocalizations such as coughing, which are easier to detect and more directly linked to respiratory disease [
26,
31,
48]. The difficulty of separating subtle respiratory events from background noise, combined with age- and farm-dependent vocal variability, further increases the need for large, diverse datasets and advanced signal processing techniques, limiting broader investigation of these sounds [
48,
49].
Most existing broiler audio-AI systems also rely on binary rather than multi-class classification due to practical constraints in data availability, labeling effort, and computational complexity. Binary models targeting a single event (e.g., cough vs. non-cough) simplify annotation and training while maintaining higher robustness and accuracy under variable farm conditions [
31,
117]. In contrast, reliable multi-class classification requires extensive, well-annotated datasets covering diverse vocalizations and noise sources, which are costly to produce and difficult to generalize across broiler ages and environments [
26,
31]. Although recent studies demonstrate the feasibility of multi-class stress and sound-type classification, these models often face trade-offs in complexity, generalization, and suitability for low-power edge deployment [
33,
117], with cross-farm transferability remaining weak due to differences in acoustics, management practices, and housing systems [
36,
139,
206].
Furthermore, most studies report only single-run performance metrics without confidence intervals or repeated evaluations. This prevents rigorous assessment of model stability and reproducibility, as performance variance across random initializations, train–test splits, or cross-validation folds is rarely quantified. Consequently, reported accuracies may overestimate real-world performance stability. Limiting factors in poultry audio-based AI monitoring are summarized in
Figure 5.
Emerging learning paradigms offer promising directions to address current limitations in poultry audio-AI systems. Self-supervised learning enables models to learn robust audio representations from large volumes of unlabeled data, reducing dependence on manual annotation and improving downstream performance under limited labeled conditions [
227,
228]. Multimodal learning further enhances robustness by fusing audio with video and environmental sensor data, mitigating noise, variability, and farm-specific effects while improving generalization [
229]. Such multimodal AI systems capture the multidimensional nature of animal welfare more effectively than unimodal approaches, providing deeper insights into behavior, health status, and environmental stressors [
139,
230,
231]. Preliminary exploratory studies report that feature-level fusion strategies demonstrate superior robustness and scalability in real-world poultry farm conditions, while livestock studies report welfare and disease prediction accuracies exceeding 90%, supporting proactive intervention and improved management efficiency [
230,
231,
232]. IoT-based multi-sensor fusion also improve anomaly detection and predictive analytics, with reported gains of 25% and 30% in health metric accuracy and sensor noise reduction, respectively [
232]. Combined with domain adaptation, these approaches show strong potential for improving cross-farm performance [
26,
70]. Future research should adopt repeated experimental protocols and report mean performance with standard deviation or confidence intervals to ensure robust comparison, reproducibility, and deployment reliability. At the deployment level, edge-AI enables real-time, scalable application through low-latency on-site inference using efficient TinyML and computer vision models, supporting practical health, welfare, and environmental monitoring in commercial broiler houses [
202,
233,
234,
235]. Collectively, advances in self-supervised, multimodal, and edge-AI frameworks are expected to drive the next generation of farm-ready audio-AI systems for broiler production (
Figure 6).
13. Conclusions
Audio-AI has already demonstrated strong capabilities in broiler monitoring by accurately detecting and classifying vocalizations related to health and welfare, such as coughs and distress calls, achieving classification accuracies often above 90% using machine learning models like Random Forest, SVM, and CNNs. These systems enable non-invasive, real-time health assessment and early disease detection, providing actionable insights for timely intervention and improved animal welfare. However, large-scale adoption faces limitations including variability in vocalizations across ages and environments, lack of standardized datasets and evaluation protocols, computational constraints for continuous monitoring, and economic barriers such as high initial costs and uncertain returns, especially in resource-limited settings. Future AI systems can improve respiratory health and welfare by integrating multimodal sensor data (audio, video, environmental), employing transfer learning and domain adaptation to enhance generalization, deploying edge-AI for low-latency and scalable monitoring, and emphasizing explainability to build trust among stakeholders. This field is critical for sustainable poultry production because it supports early disease detection, reduces labor and resource use, enhances animal welfare, and lowers environmental impacts through precision management, thereby contributing to ethical, efficient, and resilient food systems. Continued research and development focused on robustness, cost-effectiveness, and standardization will be essential to realize the full potential of audio-AI in commercial broiler houses.