You are currently viewing a new version of our website. To view the old version click .
Mathematics
  • Article
  • Open Access

7 November 2023

A Neural Network Architecture for Children’s Audio–Visual Emotion Recognition

,
,
,
and
Child Speech Research Group, Department of Higher Nervous Activity and Psychophysiology, St. Petersburg University, St. Petersburg 199034, Russia
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Recent Advances in Neural Networks and Applications

Abstract

Detecting and understanding emotions are critical for our daily activities. As emotion recognition (ER) systems develop, we start looking at more difficult cases than just acted adult audio–visual speech. In this work, we investigate the automatic classification of the audio–visual emotional speech of children, which presents several challenges including the lack of publicly available annotated datasets and the low performance of the state-of-the art audio–visual ER systems. In this paper, we present a new corpus of children’s audio–visual emotional speech that we collected. Then, we propose a neural network solution that improves the utilization of the temporal relationships between audio and video modalities in the cross-modal fusion for children’s audio–visual emotion recognition. We select a state-of-the-art neural network architecture as a baseline and present several modifications focused on a deeper learning of the cross-modal temporal relationships using attention. By conducting experiments with our proposed approach and the selected baseline model, we observe a relative improvement in performance by 2%. Finally, we conclude that focusing more on the cross-modal temporal relationships may be beneficial for building ER systems for child–machine communications and environments where qualified professionals work with children.

1. Introduction

Emotions play an important role in a person’s life from its very beginning to the end. Understanding emotions becomes indispensable for people’s daily activities, in organizing adaptive behavior and determining the functional state of the organism, in human–computer interaction (HCI), etc. In order to provide natural and user-adaptable interaction, HCI systems need to recognize a person’s emotions automatically. In the last ten to twenty years, improving speech emotion recognition has been seen as a key factor in improving the performance of HCI systems. While most research has focused on emotion recognition in adult speech [1,2], significantly less research has focused on emotion recognition in children’s speech [3,4]. That is because large corpora of children’s speech, especially audio–visual speech, are still not publicly available, and this forces researchers to focus on emotion recognition in adult speech. Nevertheless, children are potentially the largest class of users of most HCI applications, especially in education and entertainment (edutainment) [5]. Therefore, it is important to understand how emotions are expressed by children and whether they can be automatically recognized.
Creating automatic emotion recognition systems in a person’s speech is not trivial, especially considering the differences in acoustic features for different genders [6], age groups [7], languages [6,8], cultures [9], and developmental [10] features. For example, in [11], it is reported that the accuracies of speech emotion recognition are “93.3%, 89.4%, and 83.3% for male, female and child utterances respectively”. The lower accuracy of emotion recognition in children’s speech may be due to the fact that children interact with the computer differently than adults, as they are still in the process of learning social and conversational interaction linguistic rules. It is highlighted in [12] that the main aim of emotion recognition in conversation (ERC) systems is to correctly identify the emotions in the speakers’ utterances during the conversation. ERC helps to understand the emotions and intentions of users and to develop engaging, interactive, and empathetic HCI systems. The input data for a multimodal ERC is information from different modalities for each utterance, such as audio–visual speech and facial expressions, and the model leverages these data to generate accurate predictions of emotions for each utterance. In [13], it was found that in the case of audio–visual recognition of emotions in voice, speech (text), and facial expressions, the facial modality provides recognition of 55% of emotional content, the voice modality provides 38%, and the textual modality provides the remaining 7%. The last is the motivation to use audio–visual speech emotion recognition.
There are few studies on multimodal emotion recognition in children, and even fewer studies have been performed on automatic children’s audio–visual emotion recognition. Due to the small size of the available datasets, the main approach was to use traditional machine learning (ML) techniques. The authors of [14] mentioned the following most popular ML-based classifiers: Support Vector Machine, Gaussian Mixture Model, Random Forest, K-Nearest Neighbors, and Artificial Neural Network, with the Support Vector Machine (SVM) classifier being employed in the majority of ML-based affective computing tasks. Recently, there has been a growing focus on automatic methods of emotion recognition in audio–visual speech. This is primarily driven by advancements in machine learning and Deep Learning [15], due to the presence of publicly available datasets of emotional audio–visual speech, and the availability of powerful computing resources [16].
Motivated by these developments, in this study, we have developed a neural network architecture for children’s audio–visual emotional recognition. We conducted extensive experiments with our architecture on our proprietary dataset of the children’s audio–visual speech.
This study offers the following main contributions:
  • An extended description of the dataset with children’s audio–visual emotional speech we collected and a methodology for collecting such datasets is presented.
  • A neural network solution for audio–visual emotion recognition in children is proposed that improves the utilization of temporal relationships between audio and video modalities in cross-modal fusion implemented through attention.
  • The results of experiments on emotion recognition based on the proposed neural network architecture and the proprietary children’s audio–visual emotional dataset are presented.
The subsequent sections of this paper are organized as follows. We analyze common datasets and algorithms for multimodal children’s emotion recognition in Section 2. In Section 3, we present a description of the dataset we collected specifically for our purposes. We demonstrate the algorithms and the model we propose for solving the problem in Section 4. In Section 5, we describe the experiments with our data and algorithms; and in Section 6, we present the results of the experiments. Lastly, Section 7 summarizes the contributions of this article and formulates the directions for future research on multimodal children’s emotion recognition.

3. Corpus Description

To study children’s audio–visual emotion recognition, an audio–visual emotional corpus was collected. The corpus contains video files with emotional speech and facial expressions of Russian-speaking children.

3.1. Place and Equipment for Audio–Visual Speech Recording

The recording sessions were held in a laboratory environment without soundproofing and with regular noise levels. A PMD660 digital recorder (Marantz Professional, inMusic, Inc., Sagamihara, Japan) with a SENNHEIZER e835S external microphone was used to capture a 48 kHz mono audio signal, and a SONY HDR-CX560 video camera (Sennheiser electronic GmbH & Co. KG, Beijing, China) was used to record a child’s face from a distance of one meter in 1080p resolution at 50 frames per second. During testing, the child sat at the table opposite the experimenter. The light level was constant throughout the recording session.

3.2. The Audio–Visual Speech Recording Procedure

Recording of speech and facial expressions of children was carried out when testing children according to the Child’s Emotional Development Method [60], which includes two blocks. Block 1 contains information about the child’s development received from parents/legal representatives. Block 2 includes tests and tasks the purpose of which is the evaluation of expression of the emotions in the child’s behavior, speech, and facial expressions, and ability of the child to perceive the emotional states in others. Each session lasted between 60 and 90 min.
Participants in this study were 30 children aged 5–11 years.
The criteria to include children in this study were:
  • The consent of the parent/legal representative and the child to participate in this study.
  • The selected age range.
  • The absence of clinically pronounced mental health problems, according to the medical conclusion.
  • The absence of verified severe visual and hearing impairments.
The parents were consulted about the aim and the procedure of this study before signing the Informed Consent. Also, the parents were asked to describe in writing the current and the overall emotional development of their child.
The experimental study began with a short conversation with the children in order to introduce the experimenter to the child. The child then completed the following tasks: playing with a standard set of toys, co-op play, “acting play” when the child is asked to show (depict) the emotions “joy, sadness, neutral (calm state) anger, fear”; should pronounce the speech material, manifesting the emotional state in voice; video tests—for emotions recognition, standard pictures containing certain plots.
All procedures were approved by the Health and Human Research Ethics Committee (HHS, IRB 00003875, St. Petersburg State University) and written informed consent was obtained from parents of the child participant.

3.3. Audio–Visual Speech Data Annotation

Specifically, for training the neural network based on our approach with 3D CNN, we have prepared an annotated dataset that contains relatively short video segments with audio. First, we performed facial landmark detection across the whole video dataset and automatically selected the segments with continuous streams of video frames with fully visible faces (as per the data collection procedure, most of the frames with fully visible faces belong to a child being recorded). Further, we applied speaker diarization and selected the segments in which continuous streams of video frames with fully visible faces overlap with continuous speech. Next, a group of 3 experts reviewed the obtained video segments to either annotate them with emotions expressed by a child, or to annotate the segment with additional timestamps when across the video segment a child expresses different emotions at different times. If the face or speech of a non-target person appears in the recording, experts should reject the segment. A segment receives a label only if all experts agree with the expressed emotion, otherwise the segment is rejected. Once the annotation process was complete, the annotations were used to filter the dataset and further categorize the video segments by expressed emotion where appropriate. Finally, we randomly split the segments into subsegments of 30 frames in length, which were then used to train the neural network.

4. A Neural Network Architecture Description

To classify children’s emotions, we propose a neural network based on 3D CNN for video processing and 1D CNN for audio processing. To demonstrate the performance of our solution, we took as the baseline the architecture from [58], as that solution has shown a state-of-the-art performance for the target problem. Note, however, that in [58], the authors propose a modality fusion block while utilizing existing approaches for video and audio processing to demonstrate the performance of their proposed solution for several machine learning problems, including emotion detection. Similarly, in this manuscript we do not discuss in detail the underlying models and refer the reader to the original article [58]. Our goal here is to demonstrate that, by optimizing the attention component of the model to the particularities of the source data, we can improve the performance of the emotion classification for children’s speech.
Per the research on children’s speech, some of which is reviewed in Section 1, the temporal alignment of video and audio modalities is highly informative for detecting emotions in children’s speech. Furthermore, research seems to indicate that this temporal alignment may depend not only on the psychophysiological aspects of children in general, but may also differ for typically and atypically developing children, and, moreover, for different types of atypical development. This naturally provides for an assumption that by increasing the focus and granularity of modeling the inter-modal temporal relationships may result in an improved performance of a model. To address this problem, we propose a modification of the cross-attention fusion module introduced in [58], followed by a classifier inspired by [59], based on the application of “Squeeze-and-Excitation”-like attention [55] to the feature maps of the final layer for a classification. This preserves more spatial relationships than the traditional approach of flattening the feature maps and attaching a fully connected network.
For a comparison between the baseline and the suggested in this paper architectures, see Figure 1.
Figure 1. An overview of the baseline (a) architecture [58], where MSAF refers to the suggested Multimodal Split Attention Fusion and the suggested architecture (b). The blocks highlighted with green signify the implementations of the multimodal fusion over the base models for video and audio processing.
Let us underscore a couple of differences between the proposed and the baseline models. First, in this paper, we present a different implementation for the fusion block, in which the fusion is performed in a window and using the query-key-value approach to calculate attention. Second, in the baseline model, the fusion block is placed at two locations, while in our model, we found that a single block is sufficient. However, it is important to highlight that neither we nor the authors of the baseline model require a specific placement of the fusion block. Both consider the fusion block as a black box or, in a sense, a layer that can be placed at arbitrary positions and an arbitrary number of times, depending on various circumstances such as a choice of the baseline models for video and audio processing. Third, in our work, we propose a different approach to classification. Instead of the traditional flattening of feature maps with the dense layer, we deploy an attention layer to transform feature maps into class maps matching the number of target classes.

4.1. An Algorithm for Multimodal Attention Fusion

Following [58], we do not assume a specific placement of the attention block in the architecture; essentially, we only consider the attention block in the context of a neural network architecture as a black box with feature maps in—feature maps out. Briefly (for a more detailed explanation we direct the reader to [58]), the cross-attention fusion module for video and audio modalities takes feature maps F = { F v , F a } , where F v are feature maps for the video modality and Fa are feature maps for the audio modality and, as an input and produces modified feature maps F = { F v , F a } with the goal of enhancing the quality of representations of features of each modality by attending to them according to the information learned from another modality. As a side note, here we do not make an explicit distinction between the sets of feature maps and the blocks of sets of feature maps where the notion of blocks appears from the concept of cardinality in the ResNeXt architecture, which refers to an additional dimension to the data passing through a network. Both our approach and the approach in [58] are essentially agnostic to this distinction in the sense that both simply operate on vectors containing feature maps. To calculate the modified feature maps, first each modality must be mapped to a space with only a temporal dimension, which for our task simply means that the spatial dimensions of the video modality are collapsed into a scalar by global average pooling. After obtaining the channel descriptors, a commonly called global context descriptor has to be formed as the source of the information about cross-modal relationships. Here, we propose the following approach: to capture the more immediate relationships between the modalities, we calculate the query, key, and value [61] in a window of length S for the context vectors F v c and F a c for video and audio modalities, respectively (see Figure 2).
Figure 2. Schema to calculate the query, key, and value in a window of length S for the context vectors F v c and F a c for video and audio modalities, respectively.
Since this approach originally appeared in the context of natural language processing and is often explained in terms of that field, here we want to provide a brief intuition for applying this approach in more general terms. In the case of one modality, the goal is to find relationships between different sections of a feature map of that modality. For additional clarity, when we consider a video, i.e., a modality with both spatial and temporal dimensions, we can consider either the self-attention within a single image, where sections are represented as regions of pixels in the image, and the self-attention within the temporal dimension obtained by collapsing the spatial dimensions of a series of images. The “query, key, and value” approach is agnostic to whichever one we choose.
In this article, we are always talking about the attention in the temporal dimension. To achieve that, each section is mapped to three different vectors: “query”—functioning as a request, “value”—functioning as a response, and “key”—a map between queries and values. Nevertheless, it is important to understand that attributing a function or role to those vectors serves mostly for the purposes of human understanding, while from a purely technical standpoint, the procedure is implemented simply through tripling a fully connected layer and then another layer joining the outputs together.
Let us call the learnable transformations for the “query”, “key”, and “value” ( q ¯ ,   k ¯ , and v ¯ ) vectors T Q ,   T K , and T V , respectively. Then, for the context vectors F v c and F a c for video and audio modalities, and for their windowed segments F v c ,   S i and F a c ,   S i , we calculate:
q ¯ = T Q F v c ,   S i ,     k ¯ = T K F a c ,   S i ,     v ¯ = T V ( F a c ,   S i )
While the dimensions of the value vectors are not required to match the dimensions of the query and key vectors, unless there is a specific reason to choose otherwise, most commonly the dimensions do match, for simplicity. We follow this approach, so q ¯ ,   k ¯ ,   v ¯     R D . Strictly speaking, the key vectors do not provide a one-to-one mapping between queries and values, instead, they encapsulate the likelihood or the strength of the relationship between each query and each value. Also, since we consider each segment of each windowed context vector to be independent, we are only interested in the relative likelihood, which we, following the common approach, implement using s o f t m a x .
So, for each query q l ¯ , we calculate:
s o f t m a x q l ¯ ,   k m ¯   f o r   e a c h   k e y   k m ¯ ,               l ,   m 1 , , M
or, in matrix form:
s o f t m a x [ q 1 ¯ , , q M ¯ ] [ k 1 ¯ , , k M ¯ ] T .
This result, in some sense, is a heatmap, showing the strength of the relationships between queries and values.
Now, at this point, we still have to construct a function that would take this heatmap and the values, and produce a new set of feature maps, and while in principle this function can also be learned. It has been shown that a simple weighted average provides a good balance between the performance and the computational resources required, since it can, again, be calculated as a straightforward matrix multiplication.
Summarizing the algorithm, we can present the equation for joining the outputs (the attention) as:
A t t e n t i o n Q , K , V = s o f t m a x Q K T d k V
where d k is a simple scaler.
As for the learnable transformations of the query, key, and value for multiple modalities, in our case we obtain them via projection of the windowed segments of the context vectors v s and a s for video and audio modalities, respectively, with learnable parameters w q , w k , and w v :
q = w q v s ,         k = w k a s ,         v = w v a s
After obtaining the attention maps (4), we can calculate the new feature maps:
F = F V ,   F A = F V A V ,   F A A A .
Here, just as we do not distinguish between feature maps and sets of feature maps, we also can view our suggested windowed attention as adding another dimension to a collection of feature maps which we can simply flatten when necessary, e.g., when passing them to a classifier.

4.2. An Algorithm for Feature-Map-Based Classification

Regarding the classifier, inspired by the concept of class activation maps in [59], we propose the following intuition first: with N feature maps at the final layer, our goal is to obtain C feature maps, each representing the category we are attempting to detect. To realize this transformation, we propose to apply the “Squeeze-and-Excitation”-type attention [55] C number times each with different learnable parameters assuming that this procedure would allow to learn the relationships between the low-level feature descriptors, represented by the feature maps of the final layer, relevant to each target class separately. This way, after applying s o f t m a x to the globally average pooled class maps, we are expecting to obtain a probability distribution for the target classes.
Comparing to [55], we omit the initial transformation step for the feature maps, as we assume the feature maps at the final layer already represent low-level features and do not require additional transformations for spatial information. So, for each of the C class maps, we perform global average pooling, followed by the excitation operation (see [55], Section 3.2):
s = σ W 2 δ W 1 z ,
where σ is a sigmoid function, δ is ReLU, W 1,2 are learnable parameters also implementing a bottleneck with a dimensionality reduction–expansion hyperparameter, and s is the vector further used to scale the channels of the feature volume F ^ i = F i ^   *   s i .
The final output of the model is then:
R = s o f t m a x G A P F ^ 1 C .

5. Experimental Setup

5.1. Software Implementation

To achieve higher efficiency in conducting the experiments, we created a software environment based on the Docker Engine 24.0 (www.docker.com, accessed on 28 October 2023). The aim of this framework was to simplify running the experiments of different machines, conducting ablation studies, and experimenting with image and audio processing models. We employed the PyTorch 2.0 (https://pytorch.org, accessed on 28 October 2023) for implementing the components of our model and we followed the SOLID [62] approach to software development to simplify reconfiguration of the model. Then, we created docker configuration scripts which would dynamically pull the necessary source code and downloadable resources such as base models, set up an execution environment and external libraries, and run the experiments. We ran the experiments on two machines with NVIDIA GeForce RTX 3090 Ti GPUs.

5.2. Fine-Tuning

Similar to [58], we used the baseline models trained on the Ryerson Audio–Visual Database of Emotional Speech and Song (RAVDESS) [63], and we further fine-tuned the models with samples from our proprietary dataset of children’s emotional speech.

5.3. Performance Measures

For evaluation of the results of the experiments, we selected several common metrics often used for similar tasks. First of all, we collected the multiclass recognition results into confusion matrices and calculated the true positive (TP), true negative (TN), false positive (FP), and false negative (FN) metrics.
Then, we calculated the accuracy, precision, and recall as
A c c u r a c y = T P + T N T P + T N + F P + F N ,
P r e c i s i o n = T P T P + F P ,
R e c a l l = T P T P + F N ,
respectively.
Additionally, we calculated the F1-scores per class as
F 1 c l a s s = 2 × P r e c i s i o n c l a s s × R e c a l l c l a s s P r e c i s i o n c l a s s + R e c a l l c l a s s .

6. Experimental Results

From the corpus of child speech, we selected 205 recorded sessions and after processing them as described in Section 3.3 we obtained 721 video segments with variable length, annotated with an expressed emotion. Due to a relatively small volume of data, we randomly extracted 30-frame-length non-intersecting segments, ensuring the balance between classes and repeated the process 6 times and averaged the results. For each batch, we performed a k-fold cross-validation with 80% of samples used for training and 20% for testing.
In addition, we conducted an ablation study where we tested the fusion block separately from the classifier.
The results of automatic emotion recognition are presented in Figure 3, and Table 2 and Table 3.
Figure 3. Confusion matrices for both the fusion block and the classifier (a) and for the fusion block only (b). The color shade visualizes the cell value with a darker shade corresponding to a higher value.
Table 2. Per-class scores in multiclass classification.
Table 3. Average scores in multiclass classifications.
Compared with the performance of the state-of-the-art (baseline) model at 0.482, our proposed approach demonstrates a relative improvement in performance by approximately 2%.

7. Discussion and Conclusions

We propose a hypothesis that by focusing more on the temporal relationships between different modalities for multimodal automatic emotion recognition in children, we can achieve improvements in performance. Due to the complexity of the problem, in the modern scientific literature, one can find a wide variety of approaches and models. To test our hypothesis, we selected several common and popular approaches that demonstrate state-of-the-art performance on similar tasks and took them as a baseline. Since it is not viable to test fusion and classification modules in isolation, to make sure that the difference in performance between the proposed solution and the baseline model emerges from the implementation of the proposed solution, it is important to minimize the differences with the baseline neural network architecture. Unfortunately, in machine learning, even repeating the same experiment with the same model and data, it is impossible to produce exactly the same results. However, we attempted our best to utilize the same models and mostly the same training data, except for our novel corpus of children’s emotional speech.
As for the implementation of our solution, we focused on the parts of the model responsible for the multimodal fusion via attention. To help the model to focus more on the temporal relationships between different modalities, we proposed to window the context vectors of the modalities, calculate the attention with the query-key-value approach, and perform modality fusion utilizing the obtained attention maps. Additionally, since this approach focuses on the temporal dimension, we also introduced an approach to classification based on the concept of class activation maps that elevates the attention to the spatial dimensions. However, it is important to highlight that our original hypothesis only related to the temporal dimension and, even though, eventually, we observed a cumulative improvement in performance. We did not explicitly test the hypothesis that the proposed approach to classification works as a universal drop-in replacement, we consider it only as an extension of the proposed fusion module.
By evaluating the results of the experiments, we confirmed that with a significant degree of certainty our solution can improve the performance of automatic children’s audio–visual emotion recognition. A relatively modest result at approximately 2% performance improvement is nevertheless promising, since there is significant space for further improvement. Our goal here was to demonstrate specific optimizations to the fusion and classification components of the network without optimizations to the overall network architecture, which means that further fine-tuning of the architecture is possible. Our ongoing work on collecting a large dataset of children’s audio–visual speech provides us with data to sufficiently improve the fine-tuning of the baseline models and further take advantage of the proposed solution. In addition, since this work only used a dataset with samples where all experts were in agreement with the emotions expressed, a larger dataset with more “difficult” samples with disagreements between the experts would be more helpful for our proposed solution by design. In future research, we plan to focus on collecting more data, particularly, for children with atypical development, and testing our solution on more diverse data. Also, we want to develop more practical tools and applications for people working with children with typical and atypical development to stress-test our solution in a real-time environment.

Author Contributions

Conceptualization, A.M., Y.M. and E.L.; methodology, Y.M.; software, A.M.; validation, O.F., E.L. and Y.M.; formal analysis, O.F. and E.L.; investigation, A.M. and Y.M.; resources, O.F., A.N. and E.L.; data curation, O.F., A.N. and E.L.; writing—original draft preparation, Y.M.; writing—review and editing, A.M., Y.M., O.F. and E.L.; visualization, A.M.; supervision, Y.M.; project administration, E.L.; funding acquisition, E.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was financially supported by the Russian Science Foundation, grant number № 22-45-02007, https://rscf.ru/en/project/22-45-02007/ (accessed on 28 October 2023).

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to restrictions on the public dissemination of this data imposed by the informed consent signed by the parents of the minors whose audio-visual data were used in this research.

Conflicts of Interest

The authors declare that they have no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

References

  1. Schuller, B.W. Speech Emotion Recognition: Two Decades in a Nutshell, Benchmarks, and Ongoing Trends. Commun. ACM 2018, 61, 90–99. [Google Scholar] [CrossRef]
  2. Khalil, R.A.; Jones, E.; Babar, M.I.; Jan, T.; Zafar, M.H.; Alhussain, T. Speech Emotion Recognition Using Deep Learning Techniques: A Review. IEEE Access 2019, 7, 117327–117345. [Google Scholar] [CrossRef]
  3. Lyakso, E.; Ruban, N.; Frolova, O.; Gorodnyi, V.; Matveev, Y. Approbation of a method for studying the reflection of emotional state in children’s speech and pilot psychophysiological experimental data. Int. J. Adv. Trends Comput. Sci. Eng. 2020, 9, 649–656. [Google Scholar] [CrossRef]
  4. Onwujekwe, D. Using Deep Leaning-Based Framework for Child Speech Emotion Recognition. Ph.D. Thesis, Virginia Commonwealth University, Richmond, VA, USA, 2021. Available online: https://scholarscompass.vcu.edu/cgi/viewcontent.cgi?article=7859&context=etd (accessed on 20 March 2023).
  5. Guran, A.-M.; Cojocar, G.-S.; Diosan, L.-S. The Next Generation of Edutainment Applications for Young Children—A Proposal. Mathematics 2022, 10, 645. [Google Scholar] [CrossRef]
  6. Costantini, G.; Parada-Cabaleiro, E.; Casali, D.; Cesarini, V. The Emotion Probe: On the Universality of Cross-Linguistic and Cross-Gender Speech Emotion Recognition via Machine Learning. Sensors 2022, 22, 2461. [Google Scholar] [CrossRef] [PubMed]
  7. Palo, H.K.; Mohanty, M.N.; Chandra, M. Speech Emotion Analysis of Different Age Groups Using Clustering Techniques. Int. J. Inf. Retr. Res. 2018, 8, 69–85. [Google Scholar] [CrossRef]
  8. Tamulevičius, G.; Korvel, G.; Yayak, A.B.; Treigys, P.; Bernatavičienė, J.; Kostek, B. A Study of Cross-Linguistic Speech Emotion Recognition Based on 2D Feature Spaces. Electronics 2020, 9, 1725. [Google Scholar] [CrossRef]
  9. Lyakso, E.; Ruban, N.; Frolova, O.; Mekala, M.A. The children’s emotional speech recognition by adults: Cross-cultural study on Russian and Tamil language. PLoS ONE 2023, 18, e0272837. [Google Scholar] [CrossRef]
  10. Matveev, Y.; Matveev, A.; Frolova, O.; Lyakso, E. Automatic Recognition of the Psychoneurological State of Children: Autism Spectrum Disorders, Down Syndrome, Typical Development. Lect. Notes Comput. Sci. 2021, 12997, 417–425. [Google Scholar] [CrossRef]
  11. Duville, M.M.; Alonso-Valerdi, L.M.; Ibarra-Zarate, D.I. Mexican Emotional Speech Database Based on Semantic, Frequency, Familiarity, Concreteness, and Cultural Shaping of Affective Prosody. Data 2021, 6, 130. [Google Scholar] [CrossRef]
  12. Zou, S.H.; Huang, X.; Shen, X.D.; Liu, H. Improving multimodal fusion with Main Modal Transformer for emotion recognition in conversation. Knowl.-Based Syst. 2022, 258, 109978. [Google Scholar] [CrossRef]
  13. Mehrabian, A.; Ferris, S.R. Inference of attitudes from nonverbal communication in two channels. J. Consult. Psychol. 1967, 31, 248–252. [Google Scholar] [CrossRef] [PubMed]
  14. Afzal, S.; Khan, H.A.; Khan, I.U.; Piran, J.; Lee, J.W. A Comprehensive Survey on Affective Computing; Challenges, Trends, Applications, and Future Directions. arXiv 2023, arXiv:2305.07665v1. [Google Scholar] [CrossRef]
  15. Dresvyanskiy, D.; Ryumina, E.; Kaya, H.; Markitantov, M.; Karpov, A.; Minker, W. End-to-End Modeling and Transfer Learning for Audiovisual Emotion Recognition in-the-Wild. Multimodal Technol. Interact. 2022, 6, 11. [Google Scholar] [CrossRef]
  16. Wang, Y.; Song, W.; Tao, W.; Liotta, A.; Yang, D.; Li, X.; Gao, S.; Sun, Y.; Ge, W.; Zhang, W.; et al. A systematic review on affective computing: Emotion models, databases, and recent advances. Inf. Fusion 2022, 83–84, 19–52. [Google Scholar] [CrossRef]
  17. Haamer, R.E.; Rusadze, E.; Lüsi, I.; Ahmed, T.; Escalera, S.; Anbarjafari, G. Review on Emotion Recognition Databases. In Human-Robot Interaction-Theory and Application; InTechOpen: London, UK, 2018. [Google Scholar] [CrossRef]
  18. Wu, C.; Lin, J.; Wei, W. Survey on audiovisual emotion recognition: Databases, features, and data fusion strategies. APSIPA Trans. Signal Inf. Process. 2014, 3, E12. [Google Scholar] [CrossRef]
  19. Avots, E.; ·Sapiński, T.; ·Bachmann, M.; Kamińska, D. Audiovisual emotion recognition in wild. Mach. Vis. Appl. 2019, 30, 975–985. [Google Scholar] [CrossRef]
  20. Karani, R.; Desai, S. Review on Multimodal Fusion Techniques for Human Emotion Recognition. Int. J. Adv. Comput. Sci. Appl. 2022, 13, 287–296. [Google Scholar] [CrossRef]
  21. Poria, S.; Cambria, E.; Bajpai, R.; Hussain, A. A review of affective computing: From unimodal analysis to multimodal fusion. Inf. Fusion 2017, 37, 98–125. [Google Scholar] [CrossRef]
  22. Abbaschian, B.J.; Sierra-Sosa, D.; Elmaghraby, A. Deep Learning Techniques for Speech Emotion Recognition, from Databases to Models. Sensors 2021, 21, 1249. [Google Scholar] [CrossRef]
  23. Schoneveld, L.; Othmani, A.; Abdelkawy, H. Leveraging recent advances in deep learning for audio-visual emotion recognition. Pattern Recognit. Lett. 2021, 146, 1–7. [Google Scholar] [CrossRef]
  24. Ram, C.S.; Ponnusamy, R. Recognising and classify Emotion from the speech of Autism Spectrum Disorder children for Tamil language using Support Vector Machine. Int. J. Appl. Eng. Res. 2014, 9, 25587–25602. [Google Scholar]
  25. Chen, N.F.; Tong, R.; Wee, D.; Lee, P.X.; Ma, B.; Li, H. SingaKids-Mandarin: Speech Corpus of Singaporean Children Speaking Mandarin Chinese. In Proceedings of the 17th Annual Conference of the International Speech Communication Association (INTERSPEECH), San Francisco, CA, USA, 8–12 September 2016; pp. 1545–1549. [Google Scholar] [CrossRef]
  26. Matin, R.; Valles, D. A Speech Emotion Recognition Solution-based on Support Vector Machine for Children with Autism Spectrum Disorder to Help Identify Human Emotions. In Proceedings of the Intermountain Engineering, Technology and Computing (IETC), Orem, UT, USA, 2–3 October 2020; pp. 1–6. [Google Scholar] [CrossRef]
  27. Pérez-Espinosa, H.; Martínez-Miranda, J.; Espinosa-Curiel, I.; Rodríguez-Jacobo, J.; Villaseñor-Pineda, L.; Avila-George, H. IESC-Child: An Interactive Emotional Children’s Speech Corpus. Comput. Speech Lang. 2020, 59, 55–74. [Google Scholar] [CrossRef]
  28. Egger, H.L.; Pine, D.S.; Nelson, E.; Leibenluft, E.; Ernst, M.; Towbin, K.E.; Angold, A. The NIMH Child Emotional Faces Picture Set (NIMH-ChEFS): A new set of children’s facial emotion stimuli. Int. J. Methods Psychiatr. Res. 2011, 20, 145–156. [Google Scholar] [CrossRef]
  29. Kaya, H.; Ali Salah, A.; Karpov, A.; Frolova, O.; Grigorev, A.; Lyakso, E. Emotion, age, and gender classification in children’s speech by humans and machines. Comput. Speech Lang. 2017, 46, 268–283. [Google Scholar] [CrossRef]
  30. Matveev, Y.; Matveev, A.; Frolova, O.; Lyakso, E.; Ruban, N. Automatic Speech Emotion Recognition of Younger School Age Children. Mathematics 2022, 10, 2373. [Google Scholar] [CrossRef]
  31. Rathod, M.; Dalvi, C.; Kaur, K.; Patil, S.; Gite, S.; Kamat, P.; Kotecha, K.; Abraham, A.; Gabralla, L.A. Kids’ Emotion Recognition Using Various Deep-Learning Models with Explainable AI. Sensors 2022, 22, 8066. [Google Scholar] [CrossRef]
  32. Sousa, A.; d’Aquin, M.; Zarrouk, M.; Hollowa, J. Person-Independent Multimodal Emotion Detection for Children with High-Functioning Autism. CEUR Workshop Proceedings. 2020. Available online: https://ceur-ws.org/Vol-2760/paper3.pdf (accessed on 28 October 2023).
  33. Ahmed, B.; Ballard, K.J.; Burnham, D.; Sirojan, T.; Mehmood, H.; Estival, D.; Baker, E.; Cox, F.; Arciuli, J.; Benders, T.; et al. AusKidTalk: An Auditory-Visual Corpus of 3- to 12-Year-Old Australian Children’s Speech. In Proceedings of the 22th Annual Conference of the International Speech Communication Association (INTERSPEECH), Brno, Czech Republic, 30 August–3 September 2021; pp. 3680–3684. [Google Scholar] [CrossRef]
  34. Kossaifi, J.; Tzimiropoulos, G.; Todorovic, S.; Pantic, M. AFEW-VA database for valence and arousal estimation in-the-wild. Image Vis. Comput. 2017, 65, 23–36. [Google Scholar] [CrossRef]
  35. Black, M.; Chang, J.; Narayanan, S. An Empirical Analysis of User Uncertainty in Problem-Solving Child-Machine Interactions. In Proceedings of the 1st Workshop on Child, Computer, and Interaction Chania (WOCCI), Crete, Greece, 23 October 2008; paper 01. Available online: https://www.isca-speech.org/archive/pdfs/wocci_2008/black08_wocci.pdf (accessed on 28 October 2023).
  36. Nojavanasghari, B.; Baltrušaitis, T.; Hughes, C.; Morency, L. EmoReact: A Multimodal Approach and Dataset for Recognizing Emotional Responses in Children. In Proceedings of the 18th ACM International Conference on Multimodal Interaction (ICMI), Tokyo, Japan, 12–16 November 2016; pp. 137–144. [Google Scholar] [CrossRef]
  37. Li, Y.; Tao, J.; Chao, L.; Bao, W.; Liu, Y. CHEAVD: A Chinese natural emotional audio–visual database. J. Ambient. Intell. Humaniz. Comput. 2017, 8, 913–924. [Google Scholar] [CrossRef]
  38. Filntisis, P.; Efthymiou, N.; Potamianos, G.; Maragos, P. An Audiovisual Child Emotion Recognition System for Child-Robot Interaction Applications. In Proceedings of the 29th European Signal Processing Conference (EUSIPCO), Dublin, Ireland, 23–27 August 2021; pp. 791–795. [Google Scholar] [CrossRef]
  39. Chiara, Z.; Calabrese, B.; Cannataro, M. Emotion Mining: From Unimodal to Multimodal Approaches. Lect. Notes Comput. Sci. 2021, 12339, 143–158. [Google Scholar] [CrossRef]
  40. Bengio, Y.; Courville, A.; Vincent, P. Representation Learning: A Review and New Perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 8, 1798–1828. [Google Scholar] [CrossRef] [PubMed]
  41. Burkov, A. The Hundred-Page Machine Learning Book; Andriy Burkov: Quebec City, QC, Canada, 2019; 141p. [Google Scholar]
  42. Egele, R.; Chang, T.; Sun, Y.; Vishwanath, V.; Balaprakash, P. Parallel Multi-Objective Hyperparameter Optimization with Uniform Normalization and Bounded Objectives. arXiv 2023, arXiv:2309.14936. [Google Scholar] [CrossRef]
  43. Glasmachers, T. Limits of End-to-End Learning. In Proceedings of the Asian Conference on Machine Learning (ACML), Seoul, Republic of Korea, 15–17 November 2017; pp. 17–32. Available online: https://proceedings.mlr.press/v77/glasmachers17a/glasmachers17a.pdf (accessed on 28 October 2023).
  44. Chen, S.; Wang, C.; Chen, Z.; Wu, Y.; Liu, S.; Chen, Z.; Li, J.; Kanda, N.; Yoshioka, T.; Xiao, X.; et al. WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing. IEEE J. Sel. Top. Signal Process. 2022, 16, 1505–1518. [Google Scholar] [CrossRef]
  45. Alexeev, A.; Matveev, Y.; Matveev, A.; Pavlenko, D. Residual Learning for FC Kernels of Convolutional Network. Lect. Notes Comput. Sci. 2019, 11728, 361–372. [Google Scholar] [CrossRef]
  46. Fischer, P.; Dosovitskiy, A.; Ilg, E.; Häusser, P.; Hazırbaş, C.; Golkov, V.; van der Smagt, P.; Cremers, D.; Brox, T. FlowNet: Learning Optical Flow with Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile; 2015; pp. 2758–2766. [Google Scholar] [CrossRef]
  47. Patil, P.; Pawar, V.; Pawar, Y.; Pisal, S. Video Content Classification using Deep Learning. arXiv 2021, arXiv:2111.13813. [Google Scholar] [CrossRef]
  48. Hara, K.; Kataoka, H.; Satoh, Y. Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 6546–6555. [Google Scholar] [CrossRef]
  49. Ordóñez, F.J.; Roggen, D. Deep Convolutional and LSTM Recurrent Neural Networks for Multimodal Wearable Activity Recognition. Sensors 2016, 16, 115. [Google Scholar] [CrossRef]
  50. Mnih, V.; Heess, N.; Graves, A.; Kavukcuoglu, K. Recurrent Models of Visual Attention. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 8–13 December 2014; Volume 2, pp. 2204–2212. Available online: https://proceedings.neurips.cc/paper_files/paper/2014/file/09c6c3783b4a70054da74f2538ed47c6-Paper.pdf (accessed on 28 October 2023).
  51. Hafiz, A.M.; Parah, S.A.; Bhat, R.U.A. Attention mechanisms and deep learning for machine vision: A survey of the state of the art. arXiv 2021, arXiv:2106.07550. [Google Scholar] [CrossRef]
  52. Bertasius, G.; Wang, H.; Torresani, L. Is Space-Time Attention All You Need for Video Understanding? arXiv 2021, arXiv:2102.05095. [Google Scholar] [CrossRef]
  53. Wei, X.; Zhang, T.; Li, Y.; Zhang, Y.; Wu, F. Multi-Modality Cross Attention Network for Image and Sentence Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10938–10947. [Google Scholar] [CrossRef]
  54. Woo, S.; Park, J.; Lee, J.-L.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Part VII; pp. 3–19. [Google Scholar] [CrossRef]
  55. Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
  56. Bello, I.; Zoph, B.; Le, Q.; Vaswani, A.; Shlens, J. Attention Augmented Convolutional Networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3285–3294. [Google Scholar] [CrossRef]
  57. Krishna, D.N.; Patil, A. Multimodal Emotion Recognition Using Cross-Modal Attention and 1D Convolutional Neural Networks. In Proceedings of the 21th Annual Conference of the International Speech Communication Association (INTERSPEECH), Shanghai, China, 25–29 October 2020; pp. 4243–4247. [Google Scholar] [CrossRef]
  58. Lang, S.; Hu, C.; Li, G.; Cao, D. MSAF: Multimodal Split Attention Fusion. arXiv 2021, arXiv:2012.07175. [Google Scholar] [CrossRef]
  59. Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning Deep Features for Discriminative Localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2921–2929. [Google Scholar] [CrossRef]
  60. Lyakso, E.; Frolova, O.; Kleshnev, E.; Ruban, N.; Mekala, A.M.; Arulalan, K.V. Approbation of the Child’s Emotional Development Method (CEDM). In Proceedings of the Companion Publication of the 2022 International Conference on Multimodal Interaction (ICMI), Bengaluru, India, 7–11 November 2022; pp. 201–210. [Google Scholar] [CrossRef]
  61. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar] [CrossRef]
  62. Martin, R.C. Agile Software Development: Principles, Patterns, and Practices; Alan Apt Series; Pearson Education: London, UK, 2003. [Google Scholar]
  63. Livingstone, S.; Russo, F. The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north American English. PLoS ONE 2018, 13, e0196391. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.