1. Introduction
The integration of artificial intelligence into monitoring intends to change the landscape of animal welfare, behavioral studies, and environmental control. Of many sensing modalities, acoustic sensing has become a very powerful non-invasive way of analyzing the physiological and emotional states of poultry. When vocalizations are well captured, preprocessed, and analyzed, they can provide biological and behavioral information as digital biomarkers for other indicators, including stress, disease, environmental discomfort, and social–emotional cues [
1].
This systematic review of literature explores the intersection of bioacoustics, machine learning (ML), and animal welfare, with poultry calls as the contributing data modality. Foundational methods, the particularly relevant ones being Mel-Frequency Cepstral Coefficients (MFCCs) and spectrogram analysis, have set the foundation and have begun to be supplanted with or augmented by methods from deep learning (DL), transfer learning, and self-supervised models such as wav2vec2 and Whisper. This march toward farm deployment is further accelerated by innovations in TinyML, edge computing, and real-time deployment frameworks. Chickens have more than 30 different types of calls [
2] that span from distress, mating, predator threats, etc. This makes their vocal repertoires one of the most diverse among domesticated animals. These repertoires may give insights into their emotional and physiological states, thus making vocalization analysis one of the most powerful and non-invasive methods to identify their welfare and state. Vocalizations, from the ethological and communication theory viewpoint, tend to be the selected evolutionary tools for social coordination developed by environmental pressures and flock dynamics. Analyzing poultry vocalizations in that sense aligns with embodied cognition, whereby vocal behavior extends beyond just signaling but becomes a reflection of internal state and context. Several publicly available datasets—such as chick stress vocalizations [
3], laying hen audio [
4], and raw waveform recordings [
5]—have enabled reproducible benchmarking and model comparisons. These and many other datasets are extensively discussed and compared in
Section 3,
Section 4 and
Section 5, alongside the models, feature strategies, and evaluation pipelines they support. This review identifies the trend in methodologies used and key benchmark architectures through a comprehensive thematic synthesis of peer-reviewed studies and identifies critical gaps in current approaches. Increasing importance is put on multi-modal and explainable AI; the dynamic acoustic features rather than static; and standardized datasets and pipelines for reproducibility and generalization. Furthermore, this work adds bibliometric co-occurrence mapping to illustrate evolving thematic structure in the field, thereby aiding in identifying future research trajectories and interdisciplinary collaborations. By bridging computational modeling with ethological relevance, this review aims to inform researchers, practitioners, and technologists about the current state, limitations, and untapped potential of AI-driven poultry vocalization analysis. The review entails a systematic search approach [
6] as seen in
Figure 1 through IEEE Xplore, PubMed, Scopus, Web of Science, SpringerLink, etc., focusing on research work performed between 2018 and March 2025. The query consisted of various terms related to poultry vocalizations and AI (e.g., “chicken,” “acoustic,” “machine learning,” “CNN,” “Transformer,” “wav2vec”).
In total, approximately 150 papers were examined, of which 124 were deemed relevant for inclusion based on technical rigor and contribution to poultry acoustic sensing. Studies employing ML or signal processing on vocalizations related to welfare, behavior, or disease detection were prioritized and can be referred in the
Figure 2. Seminal references on acoustic features and deep learning methods (e.g., MFCCs, attention mechanisms) are retained to establish technical context. The reviewed literature is organized into six main themes: acoustic features, ML/DL models, behavior and stress detection, disease classification, toolkits and pipelines, and on-farm deployment. Notably, over 85% of the references were published between 2020 and 2025, underscoring the rapid growth of this interdisciplinary domain.
2. Acoustic Features and Preprocessing Techniques
The meaningful extraction of acoustic features and sound preprocessing techniques are pivotal in animal vocalization analysis. All the reviewed literature indicates that MFCCs, STFT, spectral entropy, and Mel-spectrograms have always been the core components of both traditional and deep learning pipelines. These methods are summarized in
Table 1, showing how static features like MFCCs contrast with dynamic representations such as cochleagrams and wav2vec2 in vocalization analysis. The most popular acoustic feature is the MFCC, which has been cited in over half of the papers for the classification of animal sounds. They have been used to characterize vocational sounds from broiler birds, laying hens, chicks, and ducks, and other species, as perceptually relevant frequency information is extracted. For example, Umarani et al. [
7], Pereira et al. [
8], Jung et al. [
9], and Thomas et al. [
10] rely heavily on the use of MFCCs for feeding classifiers like LSTM, CNNs, or k-NN for animal sound classification. In a more technical analysis, standard and enhanced MFCC experiments were further elaborated on by Prabakaran and Sriuppili [
11] through certain steps of audio signal analysis that included pre-emphasis, windowing, FFT, and DCT; compared multiple MFCC-Hybrid configurations. Davis and Mermelstein [
12] compared various speech parameterization methods and concluded that MFCCs outperform others in recognition accuracy for speech signals. This observation favors the continued dominance of the MFCCs in animal sound classification and warrants their use to proceed with poultry vocalization. Contextual cochleagram features proposed by Sattar [
13] beat the MFCCs by over 20% in acoustic recognition performance in the presence of environmental noise on the farms, thus raising concerns about the wide acceptance of MFCCs in smart agriculture settings. Puswal and Liang [
14] explored the correlation between vocal features and anatomical traits in chickens. However, while different morphological traits between sexes have been noted, the study has discovered a weak correlation between vocal acoustics and physiology, suggesting behavioral factors and context may have a stronger influence on acoustic variability than morphology. This favors the use of dynamic rather than static acoustic features for classification models in poultry.
The input signals for convolutional networks also often employ spectrograms, especially log-Mel spectrograms. The work of Zhong et al. [
15], Henri and Mungloo-Dilmohamud [
16], Romero-Mujalli et al. [
17], Thomas et al. [
18], Mao et al. [
19], Mangalam et al. [
20], Li et al. [
21], and Neethirajan [
22] analyzed spectrograms for use in CNNs or spectrogram-based embedding studies. STFT parameters cleanly turned high-quality latent space representations with the help of Mel-scaling and z-normalization, particularly as indicated by Thomas et al. [
18] and Sainburg et al. [
23].
Spectral entropy is gaining ground as a possible indicator or feature for distress. Herborn et al. [
24] showed that reduced ratings on the spectral entropy scale of distress calls-from all of which increased calls per day-and long-term welfare and future well-being outcomes in chicks. In the same line, Ginovart-Panisello et al. [
25] had fast-induced stress in newly hatched broilers using Butterworth filtered signals and centroid spectral parameters. There are pipelines in a range of past studies to improve preprocessing in real conditions with lots of noise. Tao et al. [
26], MFCC, resorted to ZCR and exponential smoothing to filter signals before extracting features. Time masking, SpecSameClassMix, and Gaussian noise augmentation were employed to enhance the theoretical robustness of spectrograms in the works of Bermant et al. [
27] and Soster et al. [
3]. Comprehensive augmentations like frequency masking and noise injection were incorporated as stated by Mao et al. [
19]. Thomas et al. [
10] included noise suppression layers into their wider strategy for audio cleaning before deep-mould training.
Besides feature transformation, automated segmentation tools have proven efficient, similar to the benchmark ones in Terasaka et al. [
28] and Michaud et al. [
4]. Such studies involved comparative works using libraries such as Librosa, BirdNET, or Perch and revealed how BirdNET resulted in a higher F1-score. Merino Recalde [
29] developed pykanto, which is a Python library that facilitates semi-automatic segmentation and labeling of large acoustic datasets to use them in deep learning models.
Beyond MFCCs and spectrograms, researchers also seek other acoustic representations. Latent projection techniques were introduced by Sainburg et al. [
23], which sidestep traditional hand-crafted features. The importance of embeddings from perusal models trained on raw audio can be illustrated in the work by Swaminathan et al. [
30] and Bermant et al. [
27]. The representation learned is often superior to the hand-crafted ones. Some studies also use time-domain parameters such as duration, pitch, zero-crossing rate, and energy. For instance, Du et al. [
31] extracted nine temporal and spectral features based on source-filter theory to detect thermal discomfort in laying hens. Ginovart-Panisello et al. [
32,
33,
34,
35] often included metrics such as spectral centroid, vocalization rate (VocalNum), and variation in spectral bandwidth in examining the environmental impacts and stress in broiler chickens.
Table 1.
Comparison of static and dynamic acoustic feature sets in animal vocalization studies. Dynamic features such as cochleagram, SincNet, and wav2vec2 exhibit greater robustness in noisy and real-world farm environments, whereas static features like MFCC and Mel-spectrogram perform well in controlled or low-noise settings.
Table 1.
Comparison of static and dynamic acoustic feature sets in animal vocalization studies. Dynamic features such as cochleagram, SincNet, and wav2vec2 exhibit greater robustness in noisy and real-world farm environments, whereas static features like MFCC and Mel-spectrogram perform well in controlled or low-noise settings.
Bermant | Feature Name | Study/Authors | Model Used | Environment | Reported Accuracy | Notes |
---|
Dynamic | SincNet | Bravo Sanchez et al. [5] | Raw waveform classifier | Minimal preprocessing | >65% (NIPS4Bplus) | Learns directly from waveform, robust to distortions |
Static | MFCC | Umarani et al. [7] | LSTM | General (RAVDESS) | 97.22% | LSTM + MFCC for emotion recognition |
Static | MFCC | Jung et al. [9] | CNN | General | 91.02% (cattle), 75.78% (hens) | Lower for hens—possibly due to background noise |
Static | MFCC variants + FFT/DCT | Prabakaran & Sriuppili [11] | MFCC variants | Controlled | 94.44% | Comparative setup across MFCC variations |
Dynamic | Cochleagram | Sattar [13] | Context-aware classifier | Noisy farm | >20% higher than MFCC | Better adaptability to environmental noise |
Static | Mel-Spectrogram | Henri et al. [16] | MobileNetV2 | Birdsong (natural) | 84.21% | Limited context modeling |
Dynamic | Spectral Entropy | Herborn et al. [24] | Entropy analysis | Chick stress study | Qualitative improvement | Captures emotional states during distress |
Dynamic | Wav2vec2 Embeddings | Swaminathan et al. [30] | Fine-tuned classifier | Real-world bird data | F1 = 89% | SSL embeddings outperform handcrafted features |
Static | MFCC | Bhandekar et al. [36] | SVM | Lab | 95.66% | Strong in low-noise environments |
Taken together, these publications show that acoustic feature design is still a very lively arena and a pivotal aspect of poultry vocalization analysis. Feature selection can be completely hand-crafted, learned, or hybrid—the chosen approach substantially affects the robustness and generalizability of the model under the field circumstances of relatively noisy, imbalanced, and unlabeled data.
4. Self-Supervised and Transfer Learning Approaches
As there are not many annotated datasets available in the realm of animal vocalization research, transfer learning and self-supervised learning (SSL) have become the methodologies for successfully improving model generalization, reducing training cost, and improving performance when working under conditions of noise or limited resources. These applications of transfer learning and SSL models in animal vocalization research are summarized in
Table 4, illustrating how pretrained architectures enhance performance under data-scarce and noisy conditions. Several studies, mostly focused on poultry and wildlife acoustics, make use of pretrained models, which are commonly developed and fine-tuned for specific species tasks and have been applied to human audio or general bioacoustics.
4.1. Transfer Learning with Pretrained CNNs and Audio Embeddings
Studies have utilized transfer learning through pretraining from large-scale datasets like ImageNet or AudioSet before applying the convolutional model to a novel acoustic signal. Some examples include: Henri and Mungloo-Dilmohamud [
16], who refined MobileNetV2, ResNet50, and InceptionV3 for bird song classification, with best accuracy (84.21%) corresponding to MobileNetV2. Thomas et al. [
10] transferred PANN (Pretrained Audio Neural Network) weights to a multi-objective CNN for broiler vocalization and age estimation. Mangalam et al. [
20] compared a custom CNN with fine-tuned VGG16, concluding that the smaller model worked better under field conditions. Li et al. [
21] showed that chick sexing tasks conceived from different architectures (ResNet-50, GRU, CRNN), based on breed and feature type, perform variably. McGinn et al. [
45] obtained unsupervised feature embeddings derived from the BirdNET CNN to classify within-species vocalizations, emphasizing its strength without retraining. Ginovart-Panisello et al. [
37] applied pretrained CNNs to the spectrograms of hens to induce stress response for vaccinated hens.
4.2. Transformer Models and Speech Pretraining
Vaswani et al. [
46] introduced a completely novel architecture in the form of their Transformer—a new architecture that replaces recurrence with multi-head self-attention to parallelize sequence modeling and capture long-range dependencies in the modeling process. It was developed for language tasks, but later became fundamental for many acoustic modeling frameworks, including wav2vec2 and BERT. Its scalability and efficiency even become more crucial for studies on poultry vocalization that require temporal analyses across different contexts. Admittedly, transformers from natural language processing are quickly finding utility within audio classification tasks. In a more foundational review concerning AI in livestock, Menezes et al. [
47] emphasized the increasing role of transformer-based models and large language models (LLMs) such as BERT and wav2vec2 in agricultural applications. Even though the review mainly covered dairy cattle, it highlights the extent to which such architectures could find application in the study of poultry vocalizations, especially in emotion recognition and welfare prediction. Devlin et al. [
48] introduced the new language model, a bidirectional Transformer BERT, trained by means of masked language modeling and next-sentence prediction. Just like many language processing tasks, BERT showed astonishing results in several benchmarks, thereby creating the impetus, in automated response systems, for models such as WHISPER and the fine-tuned version of wav2vec2, which are presently being leveraged for poultry vocalization decoding.
Ghani et al. [
35] examined transfer learning for large-scale birdsong detection using models like BirdNET and PaSST. The model PaSST, distilled from BirdNET, achieved the highest performance and development in-domain (F1 = 0.704). Swaminathan et al. [
30] applied fine-tuning of wav2vec models using bird recordings and a feed-forward classifier against an F1 of 0.89 on C-xeno-canto data. Abzaliev et al. [
49] used the trained wav2vec2 (on human speech) to classify dog barks in terms of breed, sex, and context categories, outperforming all-frames models. Sarkar and Magimai.-Doss [
50] found speech-pretrained SSL models to perform at par with those trained specifically for bioacoustics, making it feasible to reuse human-centric models. Neethirajan [
51] studied OpenAI’s Whisper model for decoding chicken vocalizations to interpret them semantically in terms of token sequences, which were then analyzed by classifiers of sentiment to deduce the emotional states. Morita et al. [
52] used Transformer-based models for long-range dependency studies in Bengalese finch songs: eight syllables appeared to be a good context length. Gong et al. [
53] introduced the Audio Spectrogram Transformer (AST)—a convolution-free model that uses patch-based spectrogram inputs fed into a Transformer encoder. AST achieved state-of-the-art accuracy across major audio classification benchmarks, thereby emphasizing the potential of attention-based modeling architectures toward structured poultry vocalization analysis.
4.3. Self-Supervised Representation Learning
SSL models have made significant inroads into bioacoustic modeling by reducing the dependency on labeled datasets: Baevski et al. [
54] presented wav2vec 2.0, which learns by way of contrastive learning and quantization from raw audio latent representations. It serves as the backbone of several follow-up studies, e.g., [
30,
49]. Wang et al. [
55] applied HuBERT segmenting dog vocalizations and performed grammar induction to discover recurring phone sequences that may reveal meaning in sounds of Canine. Mørk et al. [
56] tested Data2Vec-denoising, an approach of robust self-supervised pretraining which can yield up to 18% improvements in accuracy over keyword spotting of supervised baselines. Bravo Sanchez et al. [
5] employed SincNet, a neural architecture with parameterized sinc filters for classifying bird vocalizations directly from raw audio waveforms. Attaining more than 65% accuracy on the NIPS4Bplus dataset with minimal preprocessing, this research shows the efficacy of raw-signal-based models for the lower complexity of attack-recognizing classification of poultry vocalizations. In personalized adaptive fine-tuning, Brydinskyi et al. [
57] indicated that only 10 min of data from an individual could fine-tune wav2vec2 to reduce word error rates: about 3% for natural voices and as much as 10% for synthetic. In personalized adaptive fine-tuning, Brydinskyi et al. [
57] indicated that only 10 min of data from an individual could fine-tune wav2vec2 to reduce word error rates: about 3% for natural voices and as much as 10% for synthetic.
Table 4.
Reported performance of transfer learning, self-supervised learning (SSL), and AutoML strategies in animal and bioacoustic vocalization analysis.
Table 4.
Reported performance of transfer learning, self-supervised learning (SSL), and AutoML strategies in animal and bioacoustic vocalization analysis.
Authors | Model/Strategy | Reported Performance |
---|
Bravo Sanchez et al. [5] | SincNet | >65% accuracy |
Thomas et al. [10] | PANN + CNN | Balanced Accuracy = 87.9% |
Swaminathan et al. [30] | Fine-tuned wav2vec2 | F1 = 89% |
Ghani et al. [35] | PaSST (Transformer) | F1 = 70.4% |
Abzaliev et al. [49] | Pretrained wav2vec2 | Outperformed all-frames models |
Mørk et al. [56] | Data2Vec SSL | +18% accuracy vs. supervised baseline |
Brydinskyi et al. [57] | Personalized wav2vec2 | WER decreased ~3% for natural, ~10% for synthetic) |
Tosato et al. [58] | AutoKeras NAS (Xception) | Outperformed ResNet, VGG, etc. |
Wav2vec2 performs better than many traditional models in poultry call detection because of its combination of contextualized audio embeddings and contrastive self-supervised training. In general, the MFCC pipeline depends on handcrafted features, but wav2vec2 learns deep representations from a raw waveform by predicting masked latent representations. In this way, the model is able to catch subtle temporal patterns and contextual variations in vocalizations and distortions that degrade standard features in a noisy farm environment. Its fine-tuning possibilities with limited labeled data also make this model apt to be used in low-resource domain problems such as poultry welfare monitoring. Similarly, SincNet performs better over several CNN-based methods due to its ability to learn sinc-based filters that are constrained to represent meaningful frequency bands that are valid frequency bands. This inductive bias enables the model to extract frequency-specific features that are physiologically relevant to bird calls while reducing the parameter search space, thus enhancing generalization across small datasets. Lastly, it operates on the raw waveform directly, avoiding any possible errors introduced in transformations to the spectral domain, such as STFT or Mel-scaling, giving the classifier increased resilience to varying acoustic distortions encountered in the real world.
While models like wav2vec2 and Whisper, fine-tuned for poultry vocalizations, perform exceedingly well, one should observe that their original training was always conducted on human-speech corpora. The structure, phoneme inventory, and temporal dynamics of animal sounds are far from those of human speech. Consequently, although such systems can offer a generic resolution to acoustic feature extraction, the semantic alignment and acoustic priors engineered for human speech do not offer the best clues for the decoding of emotional or behavioral cues speciated to poultry. For instance, spectral bandwidth and non-verbal call structures of birds lack phonetic segmentation assumptions that human speech models rely heavily upon. Mismatches like these become sources of acoustic noise on downstream tasks, which limits zero-shot generalization to presence across unseen animal domains.
4.4. AutoML and Neural Architecture Search (NAS)
In addition to the manual transfer learning, some studies employ an active nudging from automated approaches in discovering models: Tosato et al. [
58] established an optimal Xception architecture for classifying bird vocalizations by using AutoKeras, which is better than MobileNetV2, ResNet50, and VGG16. Gupta et al. [
41] presented the results of exploring a number of deep models on the Cornell Bird Challenge dataset, including CNN-LSTM and CNN-LMU, with CNN-LMU achieving the peak accuracy on Red Crossbill calls. The Top performing Classifiers are reported in
Table 5 and
Table 6 respectively.
These studies in the aggregate validate the power of pretrained and self-supervised models in enabling accurate, efficient, and scalable animal vocal analysis. Such crossroads include vision-based CNN backbones, language-inspired transformers, or SSL-driven embeddings, where cross-model transfer leads to generalizable, low-data animal sound classification, especially important when annotating precision-livestock contexts, since it is often very time-consuming and costly.
10. Conclusions
This systematic review unveils a rapidly transforming landscape where artificial intelligence fundamentally redefines our understanding of animal communication and welfare assessment through poultry vocalizations. Our comprehensive analysis of over 120 studies reveals a decisive paradigm shift from traditional hand-crafted acoustic features toward sophisticated self-supervised learning architectures, with models like wav2vec2 and SincNet demonstrating unprecedented capabilities in decoding the complex emotional and physiological states embedded within avian vocalizations. The convergence of bioacoustics and machine learning has reached a critical inflection point, where theoretical advances in deep learning architectures now demand practical translation into robust, deployable farm-level systems. However, our investigation exposes fundamental challenges that threaten to impede widespread adoption: the persistent opacity of black-box models undermines stakeholder trust, cross-species generalization remains elusive despite sophisticated transfer learning approaches, and the absence of standardized evaluation frameworks creates a fragmented research ecosystem that hinders reproducible science.
The interpretability crisis emerges as perhaps the most pressing concern for real-world deployment. While achieving impressive classification accuracies exceeding 95% in controlled settings, current deep learning models operate as impenetrable decision-making systems, providing little insight into which acoustic signatures drive welfare assessments. This opacity becomes particularly problematic when veterinarians and farm operators must act upon AI-generated alerts, demanding explainable artificial intelligence solutions that balance performance with transparency. Domain adaptation challenges reveal the brittleness of current approaches when deployed across diverse poultry breeds, housing conditions, and environmental contexts. Models trained on broiler vocalizations frequently fail when applied to laying hens, while embedding drift causes performance degradation when acoustic environments shift from laboratory to commercial farm settings. This limitation threatens the scalability of AI-driven welfare monitoring systems across the heterogeneous landscape of global poultry production.
The integration of edge computing and TinyML frameworks presents both unprecedented opportunities and technical constraints for continuous welfare monitoring. While enabling real-time inference directly on farm hardware, these resource-constrained deployments demand architectural innovations that maintain model performance while operating within strict power and computational budgets. Future trajectories must prioritize the development of interpretable, domain-adaptive models that seamlessly integrate multimodal sensor data while maintaining ethical standards for animal welfare assessment. The establishment of standardized benchmarking protocols, cross-species evaluation frameworks, and transparent dataset sharing initiatives will determine whether this promising field evolves into a transformative technology for precision livestock farming or remains confined to academic research.
The stakes extend beyond technological advancement—they encompass our fundamental responsibility to ensure that AI systems designed to safeguard animal welfare operate with the transparency, reliability, and ethical grounding that both animals and their human caretakers deserve.