MDPI - Publisher of Open Access Journals

20 pages, 1536 KiB

Open AccessArticle

Graph Convolution-Based Decoupling and Consistency-Driven Fusion for Multimodal Emotion Recognition

by Yingmin Deng, Chenyu Li, Yu Gu, He Zhang, Linsong Liu, Haixiang Lin, Shuang Wang and Hanlin Mo

Electronics 2025, 14(15), 3047; https://doi.org/10.3390/electronics14153047 - 30 Jul 2025

Abstract

Multimodal emotion recognition (MER) is essential for understanding human emotions from diverse sources such as speech, text, and video. However, modality heterogeneity and inconsistent expression pose challenges for effective feature fusion. To address this, we propose a novel MER framework combining a Dynamic [...] Read more.

Multimodal emotion recognition (MER) is essential for understanding human emotions from diverse sources such as speech, text, and video. However, modality heterogeneity and inconsistent expression pose challenges for effective feature fusion. To address this, we propose a novel MER framework combining a Dynamic Weighted Graph Convolutional Network (DW-GCN) for feature disentanglement and a Cross-Attention Consistency-Gated Fusion (CACG-Fusion) module for robust integration. DW-GCN models complex inter-modal relationships, enabling the extraction of both common and private features. The CACG-Fusion module subsequently enhances classification performance through dynamic alignment of cross-modal cues, employing attention-based coordination and consistency-preserving gating mechanisms to optimize feature integration. Experiments on the CMU-MOSI and CMU-MOSEI datasets demonstrate that our method achieves state-of-the-art performance, significantly improving the

A C C_{7}

,

A C C_{2}

, and

F 1

scores. Full article

(This article belongs to the Section Computer Science & Engineering)

► Show Figures

Figure 1

21 pages, 497 KiB

Open AccessArticle

Small Language Models for Speech Emotion Recognition in Text and Audio Modalities

by José L. Gómez-Sirvent, Francisco López de la Rosa, Daniel Sánchez-Reolid, Roberto Sánchez-Reolid and Antonio Fernández-Caballero

Appl. Sci. 2025, 15(14), 7730; https://doi.org/10.3390/app15147730 - 10 Jul 2025

Viewed by 569

Abstract

Speech emotion recognition has become increasingly important in a wide range of applications, driven by the development of large transformer-based natural language processing models. However, the large size of these architectures limits their usability, which has led to a growing interest in smaller [...] Read more.

Speech emotion recognition has become increasingly important in a wide range of applications, driven by the development of large transformer-based natural language processing models. However, the large size of these architectures limits their usability, which has led to a growing interest in smaller models. In this paper, we evaluate nineteen of the most popular small language models for the text and audio modalities for speech emotion recognition on the IEMOCAP dataset. Based on their cross-validation accuracy, the best architectures were selected to create ensemble models to evaluate the effect of combining audio and text, as well as the effect of incorporating contextual information on model performance. The experiments conducted showed a significant increase in accuracy with the inclusion of contextual information and the combination of modalities. The results obtained were highly competitive, outperforming numerous recent approaches. The proposed ensemble model achieved an accuracy of 82.12% on the IEMOCAP dataset, outperforming several recent approaches. These results demonstrate the effectiveness of ensemble methods for improving speech emotion recognition performance, and highlight the feasibility of training multiple small language models on consumer-grade computers. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

24 pages, 1664 KiB

Open AccessReview

A Comprehensive Review of Multimodal Emotion Recognition: Techniques, Challenges, and Future Directions

by You Wu, Qingwei Mi and Tianhan Gao

Biomimetics 2025, 10(7), 418; https://doi.org/10.3390/biomimetics10070418 - 27 Jun 2025

Viewed by 1535

Abstract

This paper presents a comprehensive review of multimodal emotion recognition (MER), a process that integrates multiple data modalities such as speech, visual, and text to identify human emotions. Grounded in biomimetics, the survey frames MER as a bio-inspired sensing paradigm that emulates the [...] Read more.

This paper presents a comprehensive review of multimodal emotion recognition (MER), a process that integrates multiple data modalities such as speech, visual, and text to identify human emotions. Grounded in biomimetics, the survey frames MER as a bio-inspired sensing paradigm that emulates the way humans seamlessly fuse multisensory cues to communicate affect, thereby transferring principles from living systems to engineered solutions. By leveraging various modalities, MER systems offer a richer and more robust analysis of emotional states compared to unimodal approaches. The review covers the general structure of MER systems, feature extraction techniques, and multimodal information fusion strategies, highlighting key advancements and milestones. Additionally, it addresses the research challenges and open issues in MER, including lightweight models, cross-corpus generalizability, and the incorporation of additional modalities. The paper concludes by discussing future directions aimed at improving the accuracy, explainability, and practicality of MER systems for real-world applications. Full article

(This article belongs to the Special Issue Intelligent Human–Robot Interaction: 4th Edition)

► Show Figures

Figure 1

18 pages, 1498 KiB

Open AccessArticle

Speech Emotion Recognition on MELD and RAVDESS Datasets Using CNN

by Gheed T. Waleed and Shaimaa H. Shaker

Information 2025, 16(7), 518; https://doi.org/10.3390/info16070518 - 21 Jun 2025

Viewed by 991

Abstract

Speech emotion recognition (SER) plays a vital role in enhancing human–computer interaction (HCI) and can be applied in affective computing, virtual support, and healthcare. This research presents a high-performance SER framework based on a lightweight 1D Convolutional Neural Network (1D-CNN) and a multi-feature [...] Read more.

Speech emotion recognition (SER) plays a vital role in enhancing human–computer interaction (HCI) and can be applied in affective computing, virtual support, and healthcare. This research presents a high-performance SER framework based on a lightweight 1D Convolutional Neural Network (1D-CNN) and a multi-feature fusion technique. Rather than employing spectrograms as image-based input, frame-level characteristics (Mel-Frequency Cepstral Coefficients, Mel-Spectrograms, and Chroma vectors) are calculated throughout the sequences to preserve temporal information and reduce the computing expense. The model attained classification accuracies of 94.0% on MELD (multi-party talks) and 91.9% on RAVDESS (acted speech). Ablation experiments demonstrate that the integration of complimentary features significantly outperforms the utilisation of a singular feature as a baseline. Data augmentation techniques, including Gaussian noise and time shifting, enhance model generalisation. The proposed method demonstrates significant potential for real-time emotion recognition using audio only in embedded or resource-constrained devices. Full article

(This article belongs to the Special Issue Artificial Intelligence Methods for Human-Computer Interaction)

► Show Figures

Figure 1

25 pages, 1822 KiB

Open AccessArticle

Emotion Recognition from Speech in a Subject-Independent Approach

by Andrzej Majkowski and Marcin Kołodziej

Appl. Sci. 2025, 15(13), 6958; https://doi.org/10.3390/app15136958 - 20 Jun 2025

Cited by 1 | Viewed by 584

Abstract

The aim of this article is to critically and reliably assess the potential of current emotion recognition technologies for practical applications in human–computer interaction (HCI) systems. The study made use of two databases: one in English (RAVDESS) and another in Polish (EMO-BAJKA), both [...] Read more.

The aim of this article is to critically and reliably assess the potential of current emotion recognition technologies for practical applications in human–computer interaction (HCI) systems. The study made use of two databases: one in English (RAVDESS) and another in Polish (EMO-BAJKA), both containing speech recordings expressing various emotions. The effectiveness of recognizing seven and eight different emotions was analyzed. A range of acoustic features, including energy features, mel-cepstral features, zero-crossing rate, fundamental frequency, and spectral features, were utilized to analyze the emotions in speech. Machine learning techniques such as convolutional neural networks (CNNs), long short-term memory (LSTM) networks, and support vector machines with a cubic kernel (cubic SVMs) were employed in the emotion classification task. The research findings indicated that the effective recognition of a broad spectrum of emotions in a subject-independent approach is limited. However, significantly better results were obtained in the classification of paired emotions, suggesting that emotion recognition technologies could be effectively used in specific applications where distinguishing between two particular emotional states is essential. To ensure a reliable and accurate assessment of the emotion recognition system, care was taken to divide the dataset in such a way that the training and testing data contained recordings of completely different individuals. The highest classification accuracies for pairs of emotions were achieved for Angry–Fearful (0.8), Angry–Happy (0.86), Angry–Neutral (1.0), Angry–Sad (1.0), Angry–Surprise (0.89), Disgust–Neutral (0.91), and Disgust–Sad (0.96) in the RAVDESS. In the EMO-BAJKA database, the highest classification accuracies for pairs of emotions were for Joy–Neutral (0.91), Surprise–Neutral (0.80), Surprise–Fear (0.91), and Neutral–Fear (0.91). Full article

(This article belongs to the Special Issue New Advances in Applied Machine Learning)

► Show Figures

Figure 1

20 pages, 1481 KiB

Open AccessArticle

Analysis and Research on Spectrogram-Based Emotional Speech Signal Augmentation Algorithm

by Huawei Tao, Sixian Li, Xuemei Wang, Binkun Liu and Shuailong Zheng

Entropy 2025, 27(6), 640; https://doi.org/10.3390/e27060640 - 15 Jun 2025

Viewed by 368

Abstract

Data augmentation techniques are widely applied in speech emotion recognition to increase the diversity of data and enhance the performance of models. However, existing research has not deeply explored the impact of these data augmentation techniques on emotional data. Inappropriate augmentation algorithms may [...] Read more.

Data augmentation techniques are widely applied in speech emotion recognition to increase the diversity of data and enhance the performance of models. However, existing research has not deeply explored the impact of these data augmentation techniques on emotional data. Inappropriate augmentation algorithms may distort emotional labels, thereby reducing the performance of models. To address this issue, in this paper we systematically evaluate the influence of common data augmentation algorithms on emotion recognition from three dimensions: (1) we design subjective auditory experiments to intuitively demonstrate the impact of augmentation algorithms on the emotional expression of speech; (2) we jointly extract multi-dimensional features from spectrograms based on the Librosa library and analyze the impact of data augmentation algorithms on the spectral features of speech signals through heatmap visualization; and (3) we objectively evaluate the recognition performance of the model by means of indicators such as cross-entropy loss and introduce statistical significance analysis to verify the effectiveness of the augmentation algorithms. The experimental results show that “time stretching” may distort speech features, affect the attribution of emotional labels, and significantly reduce the model’s accuracy. In contrast, “reverberation” (RIR) and “resampling” within a limited range have the least impact on emotional data, enhancing the diversity of samples. Moreover, their combination can increase accuracy by up to 7.1%, providing a basis for optimizing data augmentation strategies. Full article

► Show Figures

Figure 1

22 pages, 305 KiB

Open AccessReview

Review of Automatic Estimation of Emotions in Speech

by Douglas O’Shaughnessy

Appl. Sci. 2025, 15(10), 5731; https://doi.org/10.3390/app15105731 - 20 May 2025

Cited by 1 | Viewed by 430

Abstract

Identification of emotions exhibited in utterances is useful for many applications, e.g., assisting with handling telephone calls or psychological diagnoses. This paper reviews methods to identify emotions from speech signals. We examine the information in speech that helps to estimate emotion, from points [...] Read more.

Identification of emotions exhibited in utterances is useful for many applications, e.g., assisting with handling telephone calls or psychological diagnoses. This paper reviews methods to identify emotions from speech signals. We examine the information in speech that helps to estimate emotion, from points of view involving both production and perception. As machine approaches to recognize emotion in speech often have much in common with other speech tasks, such as automatic speaker verification and speech recognition, we compare such processes. Many methods of emotion recognition have been found in research on pattern recognition in other areas, e.g., image and text recognition, especially in recent methods for machine learning. We show that speech is very different compared to most other signals that can be recognized, and that emotion identification is different from other speech applications. This review is primarily aimed at non-experts (more algorithmic detail is present in the cited literature), but this presentation has much discussion for experts as well. Full article

(This article belongs to the Special Issue Speech Recognition and Natural Language Processing)

15 pages, 4273 KiB

Open AccessArticle

Speech Emotion Recognition: Comparative Analysis of CNN-LSTM and Attention-Enhanced CNN-LSTM Models

by Jamsher Bhanbhro, Asif Aziz Memon, Bharat Lal, Shahnawaz Talpur and Madeha Memon

Signals 2025, 6(2), 22; https://doi.org/10.3390/signals6020022 - 9 May 2025

Cited by 1 | Viewed by 1666

Abstract

Speech Emotion Recognition (SER) technology helps computers understand human emotions in speech, which fills a critical niche in advancing human–computer interaction and mental health diagnostics. The primary objective of this study is to enhance SER accuracy and generalization through innovative deep learning models. [...] Read more.

Speech Emotion Recognition (SER) technology helps computers understand human emotions in speech, which fills a critical niche in advancing human–computer interaction and mental health diagnostics. The primary objective of this study is to enhance SER accuracy and generalization through innovative deep learning models. Despite its importance in various fields like human–computer interaction and mental health diagnosis, accurately identifying emotions from speech can be challenging due to differences in speakers, accents, and background noise. The work proposes two innovative deep learning models to improve SER accuracy: a CNN-LSTM model and an Attention-Enhanced CNN-LSTM model. These models were tested on the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), collected between 2015 and 2018, which comprises 1440 audio files of male and female actors expressing eight emotions. Both models achieved impressive accuracy rates of over 96% in classifying emotions into eight categories. By comparing the CNN-LSTM and Attention-Enhanced CNN-LSTM models, this study offers comparative insights into modeling techniques, contributes to the development of more effective emotion recognition systems, and offers practical implications for real-time applications in healthcare and customer service. Full article

► Show Figures

Figure 1

19 pages, 2092 KiB

Open AccessArticle

Multi-Detection-Based Speech Emotion Recognition Using Autoencoder in Mobility Service Environment

by Jeong Min Oh, Jin Kwan Kim and Joon Young Kim

Electronics 2025, 14(10), 1915; https://doi.org/10.3390/electronics14101915 - 8 May 2025

Viewed by 620

Abstract

In mobility service environments, recognizing the user condition and driving status is critical in driving safety and experiences. While speech emotion recognition is one of the possible features to predict the driver status, current emotion recognition models have a fundamental limitation: they target [...] Read more.

In mobility service environments, recognizing the user condition and driving status is critical in driving safety and experiences. While speech emotion recognition is one of the possible features to predict the driver status, current emotion recognition models have a fundamental limitation: they target to classify only single emotion classes, not multi-classes. It prevents the comprehensive understanding of the driver’s condition and intention during driving. In addition, mobility devices inherently generate noises that might affect speech emotion recognition performances in the mobility service. Considering mobility service environments, we investigate possible models that detect multiple emotions while mitigating noise issues. In this paper, we propose a speech-emotion recognition model based on the autoencoder for multi-emotion detection. First, we analyze the Mel Frequency Cepstral Coefficients (MFCCs) to design the specific features. We also develop a multi-emotion detection scheme based on an autoencoder to detect multiple emotions with substantial flexibility compared to existing models. With our proposed scheme, we investigate and analyze mobility noise impacts and mitigation approaches to evaluate performance results. Full article

► Show Figures

Figure 1

17 pages, 1277 KiB

Open AccessArticle

Pragmatic Perception of Insult-Related Vocabulary in Spanish as L1 and L2: A Sociolinguistic Approach

by Raúl Fernández Jódar

Languages 2025, 10(4), 84; https://doi.org/10.3390/languages10040084 - 16 Apr 2025

Viewed by 978

Abstract

This study examines the perception of insult-related vocabulary in Spanish among native speakers (L1) and Polish learners of Spanish as a foreign language (L2). Insults are analyzed as versatile speech acts fulfilling pragmatic functions such as impoliteness, affiliation, and emphasis. Adopting a contrastive [...] Read more.

This study examines the perception of insult-related vocabulary in Spanish among native speakers (L1) and Polish learners of Spanish as a foreign language (L2). Insults are analyzed as versatile speech acts fulfilling pragmatic functions such as impoliteness, affiliation, and emphasis. Adopting a contrastive approach, this research evaluates perceptions of colloquialism and emotional intensity across three groups: learners without prior stays in Spanish-speaking countries, learners with prior stays, and L1 speakers. Data were collected through surveys assessing knowledge, recognition, and perception of selected insults related to intellect and sexuality. The findings reveal that insults associated with sexuality exhibit the highest perceived offensive load across all groups, while those linked to behavior and intellect are rated as less aggressive. Polish learners of Spanish, particularly those without cultural immersion, tend to overestimate the offensiveness of insults compared to L1 speakers. However, learners with prior stays align more closely with L1 perceptions, underscoring the impact of cultural exposure. The results highlight the pivotal role of context and interlanguage in shaping learners’ interpretations of offensive vocabulary. They also establish a foundation for further exploration into the acquisition and pragmatic use of colloquial and emotionally charged language in L2 learning. Full article

(This article belongs to the Special Issue Exploring Linguistic Boundaries: From the Acquisition of Languages to Multilingual Practices)

► Show Figures

Figure 1

28 pages, 530 KiB

Open AccessArticle

Advancing Spanish Speech Emotion Recognition: A Comprehensive Benchmark of Pre-Trained Models

by Alex Mares, Gerardo Diaz-Arango, Jorge Perez-Jacome-Friscione, Hector Vazquez-Leal, Luis Hernandez-Martinez, Jesus Huerta-Chua, Andres Felipe Jaramillo-Alvarado and Alfonso Dominguez-Chavez

Appl. Sci. 2025, 15(8), 4340; https://doi.org/10.3390/app15084340 - 14 Apr 2025

Cited by 1 | Viewed by 1560

Abstract

Feature extraction for speech emotion recognition (SER) has evolved from handcrafted techniques through deep learning methods to embeddings derived from pre-trained models (PTMs). This study presents the first comparative analysis focused on using PTMs for Spanish SER, evaluating six models—Whisper, Wav2Vec 2.0, WavLM, [...] Read more.

Feature extraction for speech emotion recognition (SER) has evolved from handcrafted techniques through deep learning methods to embeddings derived from pre-trained models (PTMs). This study presents the first comparative analysis focused on using PTMs for Spanish SER, evaluating six models—Whisper, Wav2Vec 2.0, WavLM, HuBERT, TRILLsson, and CLAP—across six emotional speech databases: EmoMatchSpanishDB, MESD, MEACorpus, EmoWisconsin, INTER1SP, and EmoFilm. We propose a robust framework combining layer-wise feature extraction with Leave-One-Speaker-Out validation to ensure interpretable model comparisons. Our method significantly outperforms existing state-of-the-art benchmarks, notably achieving scores on metrics such as F1 on EmoMatchSpanishDB (88.32%), INTER1SP (99.83%), and MEACorpus (92.53%). Layer-wise analyses reveal optimal emotional representation extraction at early layers in 24-layer models and middle layers in larger architectures. Additionally, TRILLsson exhibits remarkable generalization in speaker-independent evaluations, highlighting the necessity of strategic model selection, fine-tuning, and language-specific adaptations to maximize SER performance for Spanish. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

16 pages, 2022 KiB

Open AccessArticle

Development of an Artificial Intelligence-Based Text Sentiment Analysis System for Evaluating Learning Engagement Levels in STEAM Education

by Chih-Hung Wu and Kang-Lin Peng

Appl. Sci. 2025, 15(8), 4304; https://doi.org/10.3390/app15084304 - 14 Apr 2025

Viewed by 1296

Abstract

This study aims to create an AI system that analyzes text to evaluate student engagement in STEAM education. It explores how sentiment analysis can measure emotional, cognitive, and behavioral involvement in learning. We developed an AI-based text sentiment analysis system to assess learning [...] Read more.

This study aims to create an AI system that analyzes text to evaluate student engagement in STEAM education. It explores how sentiment analysis can measure emotional, cognitive, and behavioral involvement in learning. We developed an AI-based text sentiment analysis system to assess learning engagement, integrating speech recognition, natural language processing techniques, keyword analysis, and text sentiment analysis. The system was designed to evaluate the level of learning engagement effectively. A computational thinking curriculum and study sheets were developed for university students, and students’ participation experiences were collected using these study sheets. The study utilized the strengths of SnowNLP and Jieba, proposing a hybrid model to perform sentiment analysis on students’ learning experiences. We analyzed: 1, The effect of sentiment dictionaries on the model’s accuracy; 2, The accuracy of different models; and 3, Keywords. The results indicated that different sentiment dictionaries had a significant impact on the model’s accuracy. The hybrid model proposed in this study, utilizing the NTUSU sentiment dictionary, outperformed the other four models in effectively analyzing learners’ emotions. Keyword analysis indicated that teaching materials or courses designed to promote practical, fun, and easy ways of thinking and building logic helped students develop positive emotions and enhanced their learning engagement. The most frequently occurring keywords associated with negative emotions were “problem”, “error”, “not”, and “mistake”. This finding suggests that learners experiencing challenges during the learning process—such as encountering mistakes, errors, or unexpected outcomes—are likely to develop negative emotions, which in turn decrease their engagement in learning. Full article

(This article belongs to the Special Issue Application of Information Systems)

► Show Figures

Figure 1

22 pages, 3427 KiB

Open AccessArticle

A Multimodal Artificial Intelligence Model for Depression Severity Detection Based on Audio and Video Signals

by Liyuan Zhang, Shuai Zhang, Xv Zhang and Yafeng Zhao

Electronics 2025, 14(7), 1464; https://doi.org/10.3390/electronics14071464 - 4 Apr 2025

Viewed by 1558

Abstract

In recent years, artificial intelligence (AI) has increasingly utilized speech and video signals for emotion recognition, facial recognition, and depression detection, playing a crucial role in mental health assessment. However, the AI-driven research on detecting depression severity remains limited, and the existing models [...] Read more.

In recent years, artificial intelligence (AI) has increasingly utilized speech and video signals for emotion recognition, facial recognition, and depression detection, playing a crucial role in mental health assessment. However, the AI-driven research on detecting depression severity remains limited, and the existing models are often too large for lightweight deployment, restricting their real-time monitoring capabilities, especially in resource-constrained environments. To address these challenges, this study proposes a lightweight and accurate multimodal method for detecting depression severity, aiming to provide effective support for smart healthcare systems. Specifically, we design a multimodal detection network based on speech and video signals, enhancing the recognition of depression severity by optimizing the cross-modal fusion strategy. The model leverages Long Short-Term Memory (LSTM) networks to capture long-term dependencies in speech and visual sequences, effectively extracting dynamic features associated with depression. Considering the behavioral differences of respondents when interacting with human versus robotic interviewers, we train two separate sub-models and fuse their outputs using a Mixture of Experts (MOE) framework capable of modeling uncertainty, thereby suppressing the influence of low-confidence experts. In terms of the loss function, the traditional Mean Squared Error (MSE) is replaced with Negative Log-Likelihood (NLL) to better model prediction uncertainty and enhance robustness. The experimental results show that the improved AI model achieves an accuracy of 83.86% in depression severity recognition. The model’s floating-point operations per second (FLOPs) reached 0.468 GFLOPs, with a parameter size of only 0.52 MB, demonstrating its compact size and strong performance. These findings underscore the importance of emotion and facial recognition in AI applications for mental health, offering a promising solution for real-time depression monitoring in resource-limited environments. Full article

(This article belongs to the Special Issue AI for Science: Advanced Techniques and Interdisciplinary Applications)

► Show Figures

Figure 1

17 pages, 852 KiB

Open AccessReview

A Review of Multimodal Interaction in Remote Education: Technologies, Applications, and Challenges

by Yangmei Xie, Liuyi Yang, Miao Zhang, Sinan Chen and Jialong Li

Appl. Sci. 2025, 15(7), 3937; https://doi.org/10.3390/app15073937 - 3 Apr 2025

Cited by 1 | Viewed by 1830

Abstract

Multimodal interaction technology has become a key aspect of remote education by enriching student engagement and learning results as it utilizes the speech, gesture, and visual feedback as various sensory channels. This publication reflects on the latest breakthroughs in multimodal interaction and its [...] Read more.

Multimodal interaction technology has become a key aspect of remote education by enriching student engagement and learning results as it utilizes the speech, gesture, and visual feedback as various sensory channels. This publication reflects on the latest breakthroughs in multimodal interaction and its usage in remote learning environments, including a multi-layered discussion that addresses various levels of learning and understanding. It showcases the main technologies, such as speech recognition, computer vision, and haptic feedback, that enable the visitors and learning portals to exchange data fluidly. In addition, we investigate the function of multimodal learning analytics in order to measure the cognitive and emotional states of students, targeting personalized feedback and refining instructional strategies. Though multimodal communication may bring a historical improvement to the mode of online education, the platform still faces many issues, such as media synchronization, higher computational demand, physical adaptability, and privacy concerns. These problems demand further research in the fields of algorithm optimization, access to technology guidance, and the ethical use of big data. This paper presents a systematic review of the application of multimodal interaction in remote education. Through the analysis of 25 selected research papers, this review explores key technologies, applications, and challenges in the field. By synthesizing existing findings, this study highlights the role of multimodal learning analytics, speech recognition, gesture-based interaction, and haptic feedback in enhancing remote learning. Full article

(This article belongs to the Special Issue Current Status and Perspectives in Human–Computer Interaction)

► Show Figures

Figure 1

18 pages, 2018 KiB

Open AccessArticle

Adapting a Large-Scale Transformer Model to Decode Chicken Vocalizations: A Non-Invasive AI Approach to Poultry Welfare

by Suresh Neethirajan

AI 2025, 6(4), 65; https://doi.org/10.3390/ai6040065 - 25 Mar 2025

Cited by 2 | Viewed by 1314

Abstract

Natural Language Processing (NLP) and advanced acoustic analysis have opened new avenues in animal welfare research by decoding the vocal signals of farm animals. This study explored the feasibility of adapting a large-scale Transformer-based model, OpenAI’s Whisper, originally developed for human speech recognition, [...] Read more.

Natural Language Processing (NLP) and advanced acoustic analysis have opened new avenues in animal welfare research by decoding the vocal signals of farm animals. This study explored the feasibility of adapting a large-scale Transformer-based model, OpenAI’s Whisper, originally developed for human speech recognition, to decode chicken vocalizations. Our primary objective was to determine whether Whisper could effectively identify acoustic patterns associated with emotional and physiological states in poultry, thereby enabling real-time, non-invasive welfare assessments. To achieve this, chicken vocal data were recorded under diverse experimental conditions, including healthy versus unhealthy birds, pre-stress versus post-stress scenarios, and quiet versus noisy environments. The audio recordings were processed through Whisper, producing text-like outputs. Although these outputs did not represent literal translations of chicken vocalizations into human language, they exhibited consistent patterns in token sequences and sentiment indicators strongly correlated with recognized poultry stressors and welfare conditions. Sentiment analysis using standard NLP tools (e.g., polarity scoring) identified notable shifts in “negative” and “positive” scores that corresponded closely with documented changes in vocal intensity associated with stress events and altered physiological states. Despite the inherent domain mismatch—given Whisper’s original training on human speech—the findings clearly demonstrate the model’s capability to reliably capture acoustic features significant to poultry welfare. Recognizing the limitations associated with applying English-oriented sentiment tools, this study proposes future multimodal validation frameworks incorporating physiological sensors and behavioral observations to further strengthen biological interpretability. To our knowledge, this work provides the first demonstration that Transformer-based architectures, even without species-specific fine-tuning, can effectively encode meaningful acoustic patterns from animal vocalizations, highlighting their transformative potential for advancing productivity, sustainability, and welfare practices in precision poultry farming. Full article

(This article belongs to the Special Issue Artificial Intelligence in Agriculture)

► Show Figures

Figure 1

Search Results (263)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (263)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI