1. Introduction
Emotional footprints out the enduring emotional influence a person leaves on others via their interactions (intrinsically, the idea or thought their emotions create in a given environment), while emotional intensity expresses the depth of an individual’s emotional experiences, designating how strongly they feel a particular emotion. Emotional footprints are defined as to the lasting impression an individual or group’s emotions and actions leave on others, similar to a physical footprint. Emotional intensity is referred to as the strength or degree of an emotion experienced. Emotional footprints are based on several criteria, such as strength, duration, physiological arousal (heart rate, voice tone), and impact on the conversation/participants. Also, emotional intensity depends on numerous criteria, namely the type of emotions (positive/negative) and influence on the other conversant. Measurement methods for the emotional footprint consider self-reporting, observer ratings, textual analysis and physiological data. Measurement methods for emotional intensity include follow-up surveys, content analysis and network analysis. Actual conversation data are gathered. After, emotional categories (anger, joy, fear) are determined. Despite the emotional footprint and emotional intensity being used interchangeably, emotional footprint aids in analyzing how somebody’s emotions influence others around them, whereas emotional intensity is employed in expressing how strongly somebody perceives a specific emotion.
A method called Emotional Voice Conversion with Emotion Intensity Control (EMOVOX) was proposed in [
1], with the intent of explicitly featuring and controlling emotion intensity. Here, the speaker style was extracted from linguistic content, following which the speaker style was encoded in a continuous space, therefore forming an emotion-embedding prototype. Moreover, to make certain, emotional intelligibility, emotion classification loss and emotion embedding similarity loss were also included, controlling fine-grained emotion intensity in the output speech, therefore reducing emotion classification loss and improving emotion-embedding similarity loss extensively.
Despite the minimization of emotion classification loss and improving emotion-embedding similarity loss, the training time and the accuracy involved were not analyzed. To focus on these two aspects, an Attention Bidirectional Recurrent Neural Network model is designed by using bidirectional RNN, which leads to improved accuracy with minimal training time by taking into consideration both the past and future of a sequence involved in conversation.
A multi-task deep Cross-Attention Network (MTCANet) that simultaneously performs Key Word Spotting (KWS) and Speaker Verification (SV), while efficiently using information pertaining to both tasks, was proposed in [
2]. Also, the method combined a KWS sub-network along with an SV sub-network to improve overall performance in demanding circumstances, including shot duration speech, noisy environments and so on.
Intrinsically, the method included three modules, novel deep cross-attention (DCA) for combing KWS and SV tasks, multi-layer stacked shared encoder (SE) to minimize the influence of noise on the recognition rate and, finally, soft attention (SA) to concentrate on relevant information while circumventing gradient vanishing with minimal error rate and improved accuracy. Despite improvement observed with maximal accuracy and minimal error rate, however, the precision rate was not analyzed. To focus on this aspect, a Zero-Shot Learning-based classifier is introduced that, with the aid of semantic features, achieves the objective.
With online learning, it is tricky for teachers to identify the emotional state of students for adjusting their pace and attention to each learner. A machine learning model called TTNet was employed in [
3] for gathering the facial data and examining the emotions of each learner in the online classroom, but the accuracy was not enhanced. Humans are prepared to realize each other’s emotions via ingenious facial expressions, body movements, the way they speak, or unambiguously by the tone of voice. Alternately, computationally speaking, numerous methods are being designed for automating the procedure of information analysis from media sources to determine the emotions expressed by users. An ensemble strategy was applied in [
4] to classify the emotion observed from a speech with the intent of improving overall classification accuracy.
Yet another method to focus on the accuracy aspect using a deep neural network model was designed in [
5] by considering five different emotions, joy, shame, guilt, sadness and fear. However, certain drawbacks were observed with keyword and lexicon-based methods as they concentrate on semantic relations. To address this issue, a hybrid machine and deep learning method to identify emotions intext was proposed in [
6]. These hybrid learning methods resulted in an improvement in overall accuracy. Despite improvement in accuracy, loss factors involved in analysis were focused on in [
7], employing a convolutional neural network (CNN) with long short-term memory (LSTM).
Research into speech-based emotion recognition in clinical applications has been continuously increasing over the past few years. To this end, a novel algorithm that takes the edge off the feature extraction mechanism and attributes an importance level of selected neurons by using deterministic edge/node embeddings with attention scores was presented in [
8], therefore improving accuracy in an extensive manner.
The challenges and opportunities involved in the automatic detection of expression emotion from a speech were investigated in [
9] to focus on inter- and intra-reliability aspects. The challenges and progress involved in deep speaker recognition were presented in [
10]. However, in the task of emotion inference, the most prevalent issue is the lack of common-sense knowledge, specifically in the context of dialogue, where conventional research has failed to efficiently extract structural features, resulting in minimizing accuracy related to emotion inference. To address this problem, a dialogue emotion inference method employing Common Sense Enhancement and a Graph Model was proposed in [
11].
A systematic review of speech synthesis employing deep learning was investigated in [
12]. Yet another comparative study of recognizing emotional footprints using machine learning was presented in [
13]. A triangulation method for extracting a novel set of geometric features with the intent of classifying six different emotional expressions, namely, anger, fear, disgust, surprise, happiness and sadness, employing computer-generated markers was proposed in [
14].
An ensemble of a deep convolutional and recurrent neural network was designed in [
15] for emotion recognition in speech to improve the final recognition rate. However, these texts generally contain several hidden emotions that could indirectly contribute significantly to identifying sentiments. The emotion detection problem was addressed in [
16] via a two-stage emotion detection methodology with improved accuracy.
The success of deep learning algorithms motivates us to use a recurrent neural network in the emotional footprint classification process. In this study, we report a new method using the Attention Bidirectional Recurrent Neural Network and Zero-Shot Learning-based classifier. First, we identify emotional footprints using the Attention Bidirectional Recurrent Neural Network model. Then, we pass all the identified emotional footprints through a Zero-Shot Learning-based classifier for accurate and precise classification of emotions. This experiment was carried out on a Two-Party Conversation with Emotional Footprint and Emotional Intensity datasets. Our experimental work reveals that the proposed method outperforms some previously published results.
1.1. Innovations and Contributions of the Work
To address the above issues, like handling the error rate and measuring the accuracy rate using the two-party conversation dataset recorded at the end of conversation, the contributions of the Attention Bidirectional Recurrent Neural Zero-Shot Semantic Classifier (ABRN-ZSSC) are listed as given below.
To improve the precision, accuracy, error rate and training time, the emotional footprint classification method, ABRN-ZSSC, is designed based on an Attention Bidirectional Recurrent Neural Network employing a Zero-Shot Learning-based classifier.
To achieve computationally efficient emotional footprint, the Attention Bidirectional RNN, taking into consideration both the past and future of a conversation, allows for a more comprehensive understanding of the corresponding conversation. It includes Global Conversation State, Speaker Conversation State and Speaker Conversation Update State. Global Conversation State is employed to capture the overall context of all preceding utterances in the dialogue. Next, the Speaker Conversation State is used to maintain the path of the state of person speakers using fixed-size vectors throughout the conversation, and Speaker Conversation Update State is applied to update the individual speaker’s state based on their current utterance. In this way, the training time is reduced.
To enhance the precision and accuracy classifier, the Zero-Shot Learning-based classifier algorithm is employed for identifying the emotional footprint accurately.
To reduce the error rate, the Zero-Shot Learning-based classifier is used for classifying seven different types of emotions via semantic feature mapping. Also, the high correlation results are determined between features and emotional categories via attribute learning.
Finally, comprehensive experimental assessment is carried out with four unique performance metrics, precision, accuracy, error rate and training time, to illustrate the proposed ABRN-ZSSC method over traditional methods.
1.2. Organization of the Work
This study is organized as follows: In
Section 2, we lay out our study by introducing the background and related work pertaining to emotional footprint classification. In
Section 3, we present the details of our proposed Attention Bidirectional Recurrent Neural Zero-Shot Semantic Classifier (ABRN-ZSSC) method, and we introduce our experiments in
Section 4. In
Section 5, we report the experimental results, followed by a discussion and conclusion in
Section 6.
2. Related Works
Emotional footprint refers to the impression we leave, such as walking through a garden, associating with family members or interacting with the employees of an organization. Hence, an emotional footprint is the consequence of emotional transmission, be it positive or negative. As a leader, the emotions and strategy relating to other people influence the organizational environment. To the same extent that there is heterogeneity in the mechanisms that pollute the earth, there is an abundance of mechanisms a leader can utilize to tune the emotional environment of an organization.
Deep learning-assisted semantic text analysis was proposed in [
17] with the objective of detecting human emotion utilizing big data. A survey of methods concentrating on the recognition of emotions involved during conversations was explored in [
18]. An automatic emotion recognition system employing two-level ensemble classifiers was proposed in [
19] by making accurate differentiation between high valence versus low valence and high arousal versus low arousal.
A new framework employing temporal dynamics by incorporating LSTM was presented in [
20,
21] to improve accuracy. Speaking in a low voice can also result in wrong predictions. To address this gap, emotion analysis based on linguistics using an attention score and softmax function was presented in [
22]. This, in turn, not only ensured accurate prediction but also minimized the errors involved in emotion analysis.
A multi-task network was designed in [
23] for both speaker and command recognition, with minimal execution time. Also, the emotion detection problem as a part of sentiment analysis was presented in [
16] using the zero-shot model with an improved accuracy rate. Affective structural embedding was employed in [
24] using zero-shot learning and emotional categories to design an accurate classification model. Yet another novel mechanism employing extreme learning to reduce complexity involved in unseen labels was presented in [
25]. A systematic review of zero learning and machine learning classifiers was investigated in [
26] towards the abstract screening of emotions.
Nevertheless, fine-grained sentiments might integrate similar emotions with a single primary emotion. Trying to address this issue as a classification task can result in performance improvements; however, it does not generate a better comprehension and representation of language. A fine-grained sentiment analysis mechanism employing neutrosophy was proposed in [
27] via three different membership functions, positive, negative, and neutral, to model an instance into Single-Valued Neutrosophic Sets (SVNSs) with a higher level of precision and accuracy. A review on emotion detection employing deep learning algorithms was presented in [
28]. A novel method for detecting emotion by fine-tuning a distilled zero-shot student model with the objective of classifying emotions in text was presented in [
29].
Our study builds upon the existing body of research in emotional footprint identification within deep learning by specifically concentrating on the application of a zero-shot learning-based classifier model to analyze emotional footprints via semantic features. While previous studies have extensively explored emotional footprint identification using various machine and deep learning techniques, our research distinguishes itself by focusing on the emotional footprint of the members participating in the conversation after the conversation is over and the application of zero-shot learning employing semantic features. The elaborate description of the proposed Attention Bidirectional Recurrent Neural Zero-Shot Semantic Classifier (ABRN-ZSSC) for emotional footprint identification is provided in the following subsections.
3. Materials and Methods
While previous research has studied linguistic content and Key Word Spotting in explaining the variation in emotional footprints, a more comprehensive exploration of individual drivers would benefit the development of effective and equitable mitigation policies. Based on fine-tuning, the limitations of existing emotion recognition research are included, such as data scarcity and cost, poor generalization to unseen classes and computational expense. To address this issue, zero-shot learning is employed by enabling models to recognize emotions from previously unseen categories by leveraging prior semantic information.
Table 1 shows the advantages between zero-shot learning and fine-tuning.
This section presents our proposed Attention Bidirectional Recurrent Neural Zero-Shot Semantic Classifier (ABRN-ZSSC) method for emotional footprint identification from a fine-tuned Zero-Shot Learning-based classifier. The proposed method, ABRN-ZSSC, is depicted in
Figure 1.
As shown in the above figure, the proposed ABRN-ZSSC method is split into two sections. The data obtained from the Two-Party Conversation, with the Emotional Footprint and Emotional Intensity datasets as input, are first subjected to the Attention Bidirectional Recurrent Neural Network model for emotional footprint identification. In the discussion, we initially give details of the indispensable cleaning and encoding activities performed in data preparation using the Attention Bidirectional Recurrent Neural Network model with the identification of emotional footprint in a conversation. Bidirectional RNN examines the input sequence in both forward and backward directions. This ensures that the model has context from both the past and the future of a conversation and allows for a more comprehensive understanding of the corresponding conversation. An attention layer is added to the Bidirectional RNN. This layer learns to assign different attention weights to different parts of the input sequence, effectively highlighting the most important elements for emotion identification. It is employed to capture long-term dependencies and consider the most relevant words or features (text, audio, or video) for more accurate emotion classification.
Second, with the identified emotional footprint as the basis, a Zero-Show Semantic Classifier is applied to classify the corresponding emotions accurately and precisely. Then, we highlight the fine-tuning experiments conducted employing the fine-tuned Zero-Shot Learning-based classifier to train and classify emotional footprints in an accurate and precise manner. Zero-Shot Learning involves learning paradigms to identify classes it has never encountered. Emotional Footprint Classification refers to the task of identifying or categorizing emotions from data (text, facial expressions, speech, or other physiological signals). This is succeeded by the experimental setup and discussion on the evaluation and testing of the methods.
3.1. Dataset Description
The Two-Party Conversation with the Emotional Footprint and Emotional Intensity datasets [
30] is used to conduct experiments. This dataset is a single-modal (text-only) dataset in emotion identification. It is extracted from
https://data.mendeley.com/datasets/fvfjp6n3x9/1 (accessed on 15 November 2023). It consists of 1857 tagged conversations. In the dataset, most features are involved for each of the two parties, such as Emotional Footprint and Emotional Intensity, in the conversation. Emotional Footprint refers to the emotional state or a collection of emotional labels associated with each participant. Emotional Intensity is measured as the strength or severity of the emotions expressed by each party. The emotional footprint and intensity data are noted for each participant near the conclusion of the conversation. The dataset is Csv file format. The data are structured with column headers in the first row and individual conversation data to an individual conversation in the subsequent rows. The aim of the dataset is to train and evaluate models on how emotions and their intensity evolve and are perceived within a dialogue for providing specific metrics for researchers to assess model accuracy and precision.
Table 2, given below, lists the column header along with the feature name and description.
With the aid of the above dataset, corresponding to the utterance-level emotions of each statement within a conversation between two speakers and values within these fields ranging from 0 to 6, the Emotional Footprint noted for each party (i.e., speaker) near the conclusion of the conversation is analyzed and validated using the proposed method.
3.2. Attention Bidirectional Recurrent Neural Network-Based Emotional Footprint Identification
Emotion footprint identification in conversations has been receiving increasing awareness from the research community owing to its applications in several predominant tasks, including opinion mining over chat history and social media threads on Twitter, Facebook, YouTube, and so on. In this section, we present an algorithm based on an Attention Bidirectional Recurrent Neural Network that can serve these requirements by processing the available conversational data. Previous studies using traditional deep learning or machine learning algorithms were developed for emotion footprint identification. But it faces several drawbacks, such as an inability to capture the temporal context, feature engineering complexity and failure to handle huge datasets. To address this issue, the Recurrent Neural Network (RNN) is well suited and selected for emotional footprint analysis by the inherent temporal and sequential nature of emotional data. RNN is designed to process data in sequences using an internal memory (hidden state) to retain context from prior inputs that is crucial for capturing how emotions evolve and change over time. RNN learns and extracts relevant features from raw data (text data) for better performance compared to traditional machine learning techniques. RNN is employed to understand the dynamics of emotion.
Figure 2 shows the structure of the Attention Bidirectional Recurrent Neural Network-based emotional footprint identification model.
The Attention Bidirectional Recurrent Neural Network model shown in
Figure 2 revolves around three characteristics as follows: each party is designed, employing a speaker state that changes as and when that speaker utters an utterance. This makes the model track the speaker’s emotion footprint swings throughout the conversations that are associated with the emotion behind the utterances. Moreover, the factors of an utterance are designed employing a global state, where the foregoing utterances and the speaker states are cooperatively encoded for statement characterization requisites for accurate speaker state representation. Finally, the Attention Bidirectional Recurrent Neural Network model hypothesizes emotion footprint representation from the party state of a speaker along with the foregoing speakers’ states as statements. This emotional footprint representation is then used for the final identification of emotional footprint in a conversation.
Let there be ‘’, (‘’ for the dataset ‘’ we used) in a conversation. The objective is to identify the emotional footprint labels (no emotion, angry, disgust, fear, happiness, sadness and surprise) of the constituent utterances ‘’, where utterance ‘’ is uttered by speaker ‘’, while ‘’ is the mapping between utterance and index of its corresponding speaker. Here, ‘’, and to update states and representation, GRU cells are employed. Here, each GRU cell evaluates hidden state designated as ‘’ where ‘’ represents the current input sample and ‘’ represents the foregoing or former GRU state with ‘’ referring to the current GRU output. The emotional footprint representation of the current utterance is modeled as a function of the emotional footprint representation of the former utterance. Finally, this resultant emotion footprint representation is sent to a fine-tuned zero-shot learned model for emotion classification.
3.2.1. Global Conversation State
Global Conversation State intends to capture the statement of a given utterance by cooperatively encoding utterance and speaker state. Addressing these states eases the inter-speaker and inter-utterance reliance to impart improved statement representation. The prevailing utterance ‘
’ replaces the speaker’s state from ‘
’ to ‘
’. We reproduce this change with GRU cell ‘
’ with output size ‘
’, employing ‘
’ and ‘
’ as given below.
From Equation (1), ‘’ denotes the size of Global Conversation State vector with weight and bias denoted as ‘’ and ‘’ denoting the strength of association between neurons and constant value, permitting the neuron to be active when all inputs are close to zero via concatenation ‘’. Finally, during training, the Attention Bidirectional Recurrent Neural Network fine-tunes both weights and biases through forward and backward processes with the objective of optimizing its emotional footprint predictions on the training data.
3.2.2. Speaker Conversation State
The proposed method keeps track of the state of individual speakers employing definite fixed-size vectors ‘’ throughout the conversation. These states are characteristic of the state of speakers during a conversation, pertaining to emotional footprint classification. These states are said to be fine-tuned based on the current role of a party in the conversation (i.e., listener or speaker) and the subsequent observed utterance ‘’. The foremost motive of this model is to make certain that the Speaker Conversation State model is aware of the speaker of each utterance and handles it accordingly.
3.2.3. Speaker Conversation Update State
Finally, the speaker usually mounts the feedback on the basis of the statement that is the foregoing or earlier utterances near the conclusion of the conversation. Therefore, we encapsulate statement ‘
’ relevant to the utterance ‘
’ as given below.
From Equations (2) to (6), ‘
’ denotes the former ‘
Global Conversation State vector results. Moreover, in Equation (2), attention outcomes ‘
’ over the former global conversation states representative of the former utterances are calculated. Finally, in Equation (6), the statement vector ‘
’ is calculated by combining the former Global Conversation State vector results with ‘
’. Finally, the identified emotional footprint near the conclusion of the conversation is represented as given below.
Equation (7) encodes the information on the current utterance along with its statement from the global GRU into the speaker’s state ‘’ that aids in further emotional footprint classification, shown in the next section. The pseudocode representation of Attention Bidirectional Recurrent Neural Network-based emotional footprint identification is given below.
Algorithm 1 describes a step-by-step process of obtaining computationally efficient emotional footprints. The raw data obtained from the Two-Party Conversation along with Emotional Footprint and Emotional Intensity datasets are applied to identify the emotional footprint of the members participating in the conversation near the conclusion of the conversation. The overall process is split into three states, namely, Global Conversation State, Speaker Conversation State and Speaker Conversation Update State. Initially, in the Global Conversation State, the statement of a given utterance is obtained cooperatively by encoding utterance and speaker state, following which, the Speaker Conversation State, in turn, keeps track of the state of individual speakers using definite fixed-size vectors throughout the conversation.
| Algorithm 1 Attention Bidirectional Recurrent Neural Network-based emotional footprint identification |
| Input: Dataset ‘’, Samples ‘’, Features ‘’, Speaker ‘’, utterances ‘’ |
| Output: mputationally efficient emotional footprint ‘’ |
1: Initialize ‘N = 1857’, ‘n = 48’, ‘M = 2’, ‘m = 6’
2: Begin
3: For each Dataset ‘DS’ with Speaker ‘’ Samples ‘S’ and utterances ‘U’ //Global Conversation State
4: Formulate Global Conversation State according to (1) //Speaker Conversation State
5: Keep track of individual speakers using fixed size vectors ‘’ around conversation
//Speaker Conversation Update State
6: Formulate Speaker Conversation Update State according to (2), (3), (4), (5) and (6)
7: Obtain emotional footprint according to (7)
8: Return emotional footprint ‘’ 9: End for 10: End |
By combining these states, the Attention Bidirectional Recurrent Neural Network model keeps track of the individual speaker states all around the conversation and utilizes this information for emotion classification, therefore minimizing the training time extensively. Attending over this attention score and employing bidirectional aspects give statement representations that contain information of all preceding utterances by speaker during the conversation. This, in turn, makes certain that at each time instance, the speaker state obtains information from the speaker’s previous state and Global Conversation State that has information on the preceding parties. Finally, the Speaker Conversation Updated State is fed to decode emotion footprint representation of given utterance that is used for further emotion footprint classification, therefore improving the overall accuracy involved.
3.3. Zero-Shot Learning-Based Classifier
Emotion footprint classification is the task of connecting a conversation with a human emotion. State-of-the-art methods are typically learned using hand-crafted affective lexicons. In this work, we want to detect emotion footprints and their evolution for each party near the conclusion of the conversation. This situation results in numerous issues, ranging between small and mostly unlabeled datasets to identify and adapt methods for such situations. To handle this issue, a Zero-Shot Learning-based classifier is applied for emotional footprint classification in a precise manner with minimal error.
Figure 3, given below, shows the structure of Zero-Shot Learning-based classifier model.
A comparison with the traditional Zero-Shot Learning-based classifier concentrates on transferring knowledge between speakers to impart supplementary information in recognizing unobserved classes. Nevertheless, it is laborious and cumbersome in using these mechanisms owing to the reasons that emotion appears in a latent layer in conversation, bringing about trials and tribulations in concluding emotional descriptors, and distinct affective emotions may also result in different emotional descriptors. In emotional footprint classification, zero-shot failures on low-intensity emotions are addressed by using attribute-learning phase to measure the relationship between generic semantic features and specific emotional categories. Hence, a Zero-Shot Learning-based classifier, as shown in
Figure 3, using emotional footprints to associate semantic features (i.e., conv_length including, inform, question, directive and commissive) and emotional states is designed. The Zero-Shot Learning-based classifier using attribute learning harnesses semantic features (i.e., conv_length including, inform, question, directive and commissive) to learn emotional categories ‘
’ (i.e., no emotion, angry, disgust, fear, happiness, sadness and surprise).
The attribute-learning phase aims at fitting emotional category values (known emotion labels, i.e., no emotion, anger, disgust, fear, happiness, sadness and surprise) to conversation samples utilizing corresponding semantic features (i.e., conv_length included, inform, question, directive and commissive). This attribute-learning phase models the correlation between ‘
’ features and ‘
’ emotional categories. The ‘
’ utterance samples belonging to observed classes (known) and emotional states utilized in attribute learning are then mathematically expressed as given below.
From Equation (8), ‘
’ denotes the feature dimensionality. The subsequent emotional categories ‘
’ of ‘
’ are mathematically represented as given below.
Thus, the task of attribute learning is to learn the correlation between ‘
’ and each category. By defining the mapping of category ‘
’ from ‘
’ as
Then, the mapping function for prediction task is represented as given below.
From Equation (11), ‘
’ denotes the semantic feature mapping of the corresponding ‘
’sample, ‘
’ denoting the similarity measurement between two vectors, ‘
’ and ‘
’ via regularization term ‘
’ subject to a conditional vector set ‘
’, respectively. Finally, with the assumption that similar mappings are shared from semantic features to emotional categories, the ‘
’ dimensional predicted emotional categories are obtained are given below.
From Equation (12), results at test time (i.e., testing), speaker observes samples from classes that were not observed during training and are required to predict the class state of emotional footprint that they belong to, both precisely and with minimal error. The pseudocode representation of Zero-Shot Learning-based classifier is given below.
Algorithm 2 describes a step-by-step process of classifier model employing Zero-Shot Learning for improving precision with minimal error. Here, with the identified emotional footprint at the end of a conversation, with Speaker, Samples, Emotional Categories and utterances obtained as input, attribute learning is modeled using Zero-Shot Learning model. Here, initially, utterance samples belonging to observed classes (known) and emotional states are obtained, following which, subsequent emotional categories evolved. Third, a mapping function is formulated to finally obtain dimensional predicted emotional categories with the resultant classified results in a precise manner with minimal error. In this manner, depending on semantic attribute vectors to classify unseen emotional footprints using attribute learning realizes the requirement of comparing their semantic attribute features to the attributes of observed classes.
| Algorithm 2 Zero Shot Learning-based classifier |
| Input: Dataset ‘’, Samples ‘’, Features ‘’, Speaker ‘’, utterances ‘’, Emotional Categories ‘’ |
| Output: Precise and accuracy classifier |
1: Initialize ‘’, ‘’, ‘’, ‘’, ‘’, regularization term ‘’ 2: Begin 3: For each Dataset ‘’ with Speaker ‘’, Samples ‘’ and utterances ‘’ //Attribute Learning 4: Formulate ‘’ utterance samples belonging to observed classes (known) emotional states according to (8) 5: Formulate subsequent emotional categories according to (9) 6: Define mapping function according to (10) and (11) 7: Obtain ‘’ dimensional predicted emotional categories according to (12) 8: Return classified results ‘’ 9: End for 10: End |
4. Experimental Setup
In this section, elaborate experiments are performed to validate the efficiency of the Attention Bidirectional Recurrent Neural Zero-Shot Semantic Classifier (ABRN-ZSSC) for emotional footprint identification using Python (version 3.11.2),with the aid of the dataset, Two-Party Conversation with Emotional Footprint and Emotional Intensity dataset, extracted from
https://data.mendeley.com/datasets/fvfjp6n3x9/1 (accessed on: 15 November 2023). Comparison is made using two existing methods, Emotional Voice Conversion with Emotion Intensity Control (EMOVOX) [
1] and Multi-task deep Cross-Attention Network (MTCANet) [
2] and TTNet [
3]. To ensure fair comparison, validation is made using the same dataset for all four methods, ABRN-ZSSC, EMOVOX [
1], MTCANet [
2] and TTNet [
3].
The objective of the proposed ABRN-ZSSC method is to discover the emotional footprint of the members participating in the conversation with improved accuracy, time and minimal error rate. Based on the objective, the existing methods, such as the EMOVOX [
1] and MTCANet [
2] and TTNet [
3], are taken as base paper. Existing EMOVOX [
1] was employed for voice/keyword conversation. Existing MTCANet [
2] was used for multi-task or cross-attention network with higher accuracy. These three base papers are relevant and compared to understand the proposed method. The proposed method concept is derived by considering the problems of these base papers. The drawbacks of these methods are effectively avoided by implementing the proposed method.
While performing experiments, the hardware and software requirements are included on a computer with an Intel(R) Core (TM) i5-7200 CPU @2.50 GHz and 8.00 GB of RAM. To validate the experiments, cross-validation is employed to compute the performance of the model ability or efficiency to utilize hidden data. By using cross-validation, the dataset is divided into two sets such as training and testing. Most samples (70%) were used for training, and the remaining (30%) were taken for testing. In experiments, adaptability is quantified by using several performance metrics, such as accuracy, precision and recall, as well as time for emotional footprint identification performance. Accuracy refers to a model or system correctly identifying and classifying emotions expressed in input data samples taken from the dataset. Precision is a performance metric that measures the accuracy of emotion-specific classification made by a model. Recall, also known as sensitivity, measures the ability of the model to correctly identify actual positive cases. Time is defined as an amount of time consumed by the algorithm for emotional footprint identification. To measure the adaptability using these above metrics, 10-cross-validation is used to enhance the emotional footprint identification performance. Hyperparameters are described in
Table 3.
Hyperparameter selection is the process of choosing the best configuration for the parameters that control the learning algorithm’s performance. During learning, the rationale for these selections is rooted in a balance of model performance, computational efficiency, and preventing common drawbacks such as overfitting or underfitting. Learning rate or batch size is crucial for the stability and speed of the optimization process. The balancing bias and variance goal aims to find an optimal balance that generalizes well to unseen data. The rationale for hyperparameters during the inference phase is focused on fixed performance, speed, and reliability in a production environment.