Attention Bidirectional Recurrent Neural Zero-Shot Semantic Classifier for Emotional Footprint Identification

Jagadeesan, Karthikeyan; Kumarappan, Annapurani

doi:10.3390/computation14010008

Open AccessArticle

Attention Bidirectional Recurrent Neural Zero-Shot Semantic Classifier for Emotional Footprint Identification

by

Karthikeyan Jagadeesan

^*

and

Annapurani Kumarappan

Department of Networking and Communications, School of Computing, SRM Institute of Science and Technology, Kattankulathur, Chennai 603203, India

^*

Author to whom correspondence should be addressed.

Computation 2026, 14(1), 8; https://doi.org/10.3390/computation14010008

Submission received: 26 November 2025 / Revised: 14 December 2025 / Accepted: 16 December 2025 / Published: 2 January 2026

(This article belongs to the Section Computational Social Science)

Download

Browse Figures

Versions Notes

Abstract

Exploring emotions in organization settings, particularly in feedback on organizational welfare programs, is critical for understanding employee experiences and enhancing organizational policies. Recognizing emotions from a conversation (i.e., leaving an emotional footprint) is a predominant task for a machine to comprehend the full context of the conversation. While fine-tuning of pre-trained models has invariably provided state-of-the-art results in emotion footprint recognition tasks, the prospect of a zero-shot learned model in this sphere is, on the whole, unexplored. The objective here remains to identify the emotional footprint of the members participating in the conversation after the conversation is over with improved accuracy, time and minimal error rate. To address these gaps, in this work, a method called Attention Bidirectional Recurrent Neural Zero-Shot Semantic Classifier (ABRN-ZSSC) for emotional footprint identification is proposed. The ABRN-ZSSC for emotional footprint identification is split into two sections. First, the raw data from a Two-Party Conversation with Emotional Footprint and Emotional Intensity are subjected to the Attention Bidirectional Recurrent Neural Network model with the intent of identifying the emotional footprint for each party near the conclusion of the conversation and, second, with the identified emotional footprint in a conversation. The Zero-Shot Learning-based classifier is applied to train and classify emotions both accurately and precisely. We verify the utility of these approaches (i.e., emotional footprint identification and classification) by performing an extensive experimental evaluation on two corpora on four aspects, training time, accuracy, precision, and error rate for varying samples. Experimental results demonstrate that the ABRN-ZSSC method outperforms two existing baseline models in emotion inference tasks across the dataset. An outcome of the proposed ABRN-ZSSC method is that it obtains superior performance in terms of 10% precision, 17% accuracy and 8% recall as well as 19% training time and 18% error rate compared to the conventional methods.

Keywords:

emotional footprint; conversation; attention score; bidirectional recurrent neural network; zero-shot learning

1. Introduction

Emotional footprints out the enduring emotional influence a person leaves on others via their interactions (intrinsically, the idea or thought their emotions create in a given environment), while emotional intensity expresses the depth of an individual’s emotional experiences, designating how strongly they feel a particular emotion. Emotional footprints are defined as to the lasting impression an individual or group’s emotions and actions leave on others, similar to a physical footprint. Emotional intensity is referred to as the strength or degree of an emotion experienced. Emotional footprints are based on several criteria, such as strength, duration, physiological arousal (heart rate, voice tone), and impact on the conversation/participants. Also, emotional intensity depends on numerous criteria, namely the type of emotions (positive/negative) and influence on the other conversant. Measurement methods for the emotional footprint consider self-reporting, observer ratings, textual analysis and physiological data. Measurement methods for emotional intensity include follow-up surveys, content analysis and network analysis. Actual conversation data are gathered. After, emotional categories (anger, joy, fear) are determined. Despite the emotional footprint and emotional intensity being used interchangeably, emotional footprint aids in analyzing how somebody’s emotions influence others around them, whereas emotional intensity is employed in expressing how strongly somebody perceives a specific emotion.

A method called Emotional Voice Conversion with Emotion Intensity Control (EMOVOX) was proposed in [1], with the intent of explicitly featuring and controlling emotion intensity. Here, the speaker style was extracted from linguistic content, following which the speaker style was encoded in a continuous space, therefore forming an emotion-embedding prototype. Moreover, to make certain, emotional intelligibility, emotion classification loss and emotion embedding similarity loss were also included, controlling fine-grained emotion intensity in the output speech, therefore reducing emotion classification loss and improving emotion-embedding similarity loss extensively.

Despite the minimization of emotion classification loss and improving emotion-embedding similarity loss, the training time and the accuracy involved were not analyzed. To focus on these two aspects, an Attention Bidirectional Recurrent Neural Network model is designed by using bidirectional RNN, which leads to improved accuracy with minimal training time by taking into consideration both the past and future of a sequence involved in conversation.

A multi-task deep Cross-Attention Network (MTCANet) that simultaneously performs Key Word Spotting (KWS) and Speaker Verification (SV), while efficiently using information pertaining to both tasks, was proposed in [2]. Also, the method combined a KWS sub-network along with an SV sub-network to improve overall performance in demanding circumstances, including shot duration speech, noisy environments and so on.

Intrinsically, the method included three modules, novel deep cross-attention (DCA) for combing KWS and SV tasks, multi-layer stacked shared encoder (SE) to minimize the influence of noise on the recognition rate and, finally, soft attention (SA) to concentrate on relevant information while circumventing gradient vanishing with minimal error rate and improved accuracy. Despite improvement observed with maximal accuracy and minimal error rate, however, the precision rate was not analyzed. To focus on this aspect, a Zero-Shot Learning-based classifier is introduced that, with the aid of semantic features, achieves the objective.

With online learning, it is tricky for teachers to identify the emotional state of students for adjusting their pace and attention to each learner. A machine learning model called TTNet was employed in [3] for gathering the facial data and examining the emotions of each learner in the online classroom, but the accuracy was not enhanced. Humans are prepared to realize each other’s emotions via ingenious facial expressions, body movements, the way they speak, or unambiguously by the tone of voice. Alternately, computationally speaking, numerous methods are being designed for automating the procedure of information analysis from media sources to determine the emotions expressed by users. An ensemble strategy was applied in [4] to classify the emotion observed from a speech with the intent of improving overall classification accuracy.

Yet another method to focus on the accuracy aspect using a deep neural network model was designed in [5] by considering five different emotions, joy, shame, guilt, sadness and fear. However, certain drawbacks were observed with keyword and lexicon-based methods as they concentrate on semantic relations. To address this issue, a hybrid machine and deep learning method to identify emotions intext was proposed in [6]. These hybrid learning methods resulted in an improvement in overall accuracy. Despite improvement in accuracy, loss factors involved in analysis were focused on in [7], employing a convolutional neural network (CNN) with long short-term memory (LSTM).

Research into speech-based emotion recognition in clinical applications has been continuously increasing over the past few years. To this end, a novel algorithm that takes the edge off the feature extraction mechanism and attributes an importance level of selected neurons by using deterministic edge/node embeddings with attention scores was presented in [8], therefore improving accuracy in an extensive manner.

The challenges and opportunities involved in the automatic detection of expression emotion from a speech were investigated in [9] to focus on inter- and intra-reliability aspects. The challenges and progress involved in deep speaker recognition were presented in [10]. However, in the task of emotion inference, the most prevalent issue is the lack of common-sense knowledge, specifically in the context of dialogue, where conventional research has failed to efficiently extract structural features, resulting in minimizing accuracy related to emotion inference. To address this problem, a dialogue emotion inference method employing Common Sense Enhancement and a Graph Model was proposed in [11].

A systematic review of speech synthesis employing deep learning was investigated in [12]. Yet another comparative study of recognizing emotional footprints using machine learning was presented in [13]. A triangulation method for extracting a novel set of geometric features with the intent of classifying six different emotional expressions, namely, anger, fear, disgust, surprise, happiness and sadness, employing computer-generated markers was proposed in [14].

An ensemble of a deep convolutional and recurrent neural network was designed in [15] for emotion recognition in speech to improve the final recognition rate. However, these texts generally contain several hidden emotions that could indirectly contribute significantly to identifying sentiments. The emotion detection problem was addressed in [16] via a two-stage emotion detection methodology with improved accuracy.

The success of deep learning algorithms motivates us to use a recurrent neural network in the emotional footprint classification process. In this study, we report a new method using the Attention Bidirectional Recurrent Neural Network and Zero-Shot Learning-based classifier. First, we identify emotional footprints using the Attention Bidirectional Recurrent Neural Network model. Then, we pass all the identified emotional footprints through a Zero-Shot Learning-based classifier for accurate and precise classification of emotions. This experiment was carried out on a Two-Party Conversation with Emotional Footprint and Emotional Intensity datasets. Our experimental work reveals that the proposed method outperforms some previously published results.

1.1. Innovations and Contributions of the Work

To address the above issues, like handling the error rate and measuring the accuracy rate using the two-party conversation dataset recorded at the end of conversation, the contributions of the Attention Bidirectional Recurrent Neural Zero-Shot Semantic Classifier (ABRN-ZSSC) are listed as given below.

To improve the precision, accuracy, error rate and training time, the emotional footprint classification method, ABRN-ZSSC, is designed based on an Attention Bidirectional Recurrent Neural Network employing a Zero-Shot Learning-based classifier.
To achieve computationally efficient emotional footprint, the Attention Bidirectional RNN, taking into consideration both the past and future of a conversation, allows for a more comprehensive understanding of the corresponding conversation. It includes Global Conversation State, Speaker Conversation State and Speaker Conversation Update State. Global Conversation State is employed to capture the overall context of all preceding utterances in the dialogue. Next, the Speaker Conversation State is used to maintain the path of the state of person speakers using fixed-size vectors throughout the conversation, and Speaker Conversation Update State is applied to update the individual speaker’s state based on their current utterance. In this way, the training time is reduced.
To enhance the precision and accuracy classifier, the Zero-Shot Learning-based classifier algorithm is employed for identifying the emotional footprint accurately.
To reduce the error rate, the Zero-Shot Learning-based classifier is used for classifying seven different types of emotions via semantic feature mapping. Also, the high correlation results are determined between features and emotional categories via attribute learning.
Finally, comprehensive experimental assessment is carried out with four unique performance metrics, precision, accuracy, error rate and training time, to illustrate the proposed ABRN-ZSSC method over traditional methods.

1.2. Organization of the Work

This study is organized as follows: In Section 2, we lay out our study by introducing the background and related work pertaining to emotional footprint classification. In Section 3, we present the details of our proposed Attention Bidirectional Recurrent Neural Zero-Shot Semantic Classifier (ABRN-ZSSC) method, and we introduce our experiments in Section 4. In Section 5, we report the experimental results, followed by a discussion and conclusion in Section 6.

2. Related Works

Emotional footprint refers to the impression we leave, such as walking through a garden, associating with family members or interacting with the employees of an organization. Hence, an emotional footprint is the consequence of emotional transmission, be it positive or negative. As a leader, the emotions and strategy relating to other people influence the organizational environment. To the same extent that there is heterogeneity in the mechanisms that pollute the earth, there is an abundance of mechanisms a leader can utilize to tune the emotional environment of an organization.

Deep learning-assisted semantic text analysis was proposed in [17] with the objective of detecting human emotion utilizing big data. A survey of methods concentrating on the recognition of emotions involved during conversations was explored in [18]. An automatic emotion recognition system employing two-level ensemble classifiers was proposed in [19] by making accurate differentiation between high valence versus low valence and high arousal versus low arousal.

A new framework employing temporal dynamics by incorporating LSTM was presented in [20,21] to improve accuracy. Speaking in a low voice can also result in wrong predictions. To address this gap, emotion analysis based on linguistics using an attention score and softmax function was presented in [22]. This, in turn, not only ensured accurate prediction but also minimized the errors involved in emotion analysis.

A multi-task network was designed in [23] for both speaker and command recognition, with minimal execution time. Also, the emotion detection problem as a part of sentiment analysis was presented in [16] using the zero-shot model with an improved accuracy rate. Affective structural embedding was employed in [24] using zero-shot learning and emotional categories to design an accurate classification model. Yet another novel mechanism employing extreme learning to reduce complexity involved in unseen labels was presented in [25]. A systematic review of zero learning and machine learning classifiers was investigated in [26] towards the abstract screening of emotions.

Nevertheless, fine-grained sentiments might integrate similar emotions with a single primary emotion. Trying to address this issue as a classification task can result in performance improvements; however, it does not generate a better comprehension and representation of language. A fine-grained sentiment analysis mechanism employing neutrosophy was proposed in [27] via three different membership functions, positive, negative, and neutral, to model an instance into Single-Valued Neutrosophic Sets (SVNSs) with a higher level of precision and accuracy. A review on emotion detection employing deep learning algorithms was presented in [28]. A novel method for detecting emotion by fine-tuning a distilled zero-shot student model with the objective of classifying emotions in text was presented in [29].

Our study builds upon the existing body of research in emotional footprint identification within deep learning by specifically concentrating on the application of a zero-shot learning-based classifier model to analyze emotional footprints via semantic features. While previous studies have extensively explored emotional footprint identification using various machine and deep learning techniques, our research distinguishes itself by focusing on the emotional footprint of the members participating in the conversation after the conversation is over and the application of zero-shot learning employing semantic features. The elaborate description of the proposed Attention Bidirectional Recurrent Neural Zero-Shot Semantic Classifier (ABRN-ZSSC) for emotional footprint identification is provided in the following subsections.

3. Materials and Methods

While previous research has studied linguistic content and Key Word Spotting in explaining the variation in emotional footprints, a more comprehensive exploration of individual drivers would benefit the development of effective and equitable mitigation policies. Based on fine-tuning, the limitations of existing emotion recognition research are included, such as data scarcity and cost, poor generalization to unseen classes and computational expense. To address this issue, zero-shot learning is employed by enabling models to recognize emotions from previously unseen categories by leveraging prior semantic information. Table 1 shows the advantages between zero-shot learning and fine-tuning.

This section presents our proposed Attention Bidirectional Recurrent Neural Zero-Shot Semantic Classifier (ABRN-ZSSC) method for emotional footprint identification from a fine-tuned Zero-Shot Learning-based classifier. The proposed method, ABRN-ZSSC, is depicted in Figure 1.

As shown in the above figure, the proposed ABRN-ZSSC method is split into two sections. The data obtained from the Two-Party Conversation, with the Emotional Footprint and Emotional Intensity datasets as input, are first subjected to the Attention Bidirectional Recurrent Neural Network model for emotional footprint identification. In the discussion, we initially give details of the indispensable cleaning and encoding activities performed in data preparation using the Attention Bidirectional Recurrent Neural Network model with the identification of emotional footprint in a conversation. Bidirectional RNN examines the input sequence in both forward and backward directions. This ensures that the model has context from both the past and the future of a conversation and allows for a more comprehensive understanding of the corresponding conversation. An attention layer is added to the Bidirectional RNN. This layer learns to assign different attention weights to different parts of the input sequence, effectively highlighting the most important elements for emotion identification. It is employed to capture long-term dependencies and consider the most relevant words or features (text, audio, or video) for more accurate emotion classification.

Second, with the identified emotional footprint as the basis, a Zero-Show Semantic Classifier is applied to classify the corresponding emotions accurately and precisely. Then, we highlight the fine-tuning experiments conducted employing the fine-tuned Zero-Shot Learning-based classifier to train and classify emotional footprints in an accurate and precise manner. Zero-Shot Learning involves learning paradigms to identify classes it has never encountered. Emotional Footprint Classification refers to the task of identifying or categorizing emotions from data (text, facial expressions, speech, or other physiological signals). This is succeeded by the experimental setup and discussion on the evaluation and testing of the methods.

3.1. Dataset Description

The Two-Party Conversation with the Emotional Footprint and Emotional Intensity datasets [30] is used to conduct experiments. This dataset is a single-modal (text-only) dataset in emotion identification. It is extracted from https://data.mendeley.com/datasets/fvfjp6n3x9/1 (accessed on 15 November 2023). It consists of 1857 tagged conversations. In the dataset, most features are involved for each of the two parties, such as Emotional Footprint and Emotional Intensity, in the conversation. Emotional Footprint refers to the emotional state or a collection of emotional labels associated with each participant. Emotional Intensity is measured as the strength or severity of the emotions expressed by each party. The emotional footprint and intensity data are noted for each participant near the conclusion of the conversation. The dataset is Csv file format. The data are structured with column headers in the first row and individual conversation data to an individual conversation in the subsequent rows. The aim of the dataset is to train and evaluate models on how emotions and their intensity evolve and are perceived within a dialogue for providing specific metrics for researchers to assess model accuracy and precision. Table 2, given below, lists the column header along with the feature name and description.

With the aid of the above dataset, corresponding to the utterance-level emotions of each statement within a conversation between two speakers and values within these fields ranging from 0 to 6, the Emotional Footprint noted for each party (i.e., speaker) near the conclusion of the conversation is analyzed and validated using the proposed method.

3.2. Attention Bidirectional Recurrent Neural Network-Based Emotional Footprint Identification

Emotion footprint identification in conversations has been receiving increasing awareness from the research community owing to its applications in several predominant tasks, including opinion mining over chat history and social media threads on Twitter, Facebook, YouTube, and so on. In this section, we present an algorithm based on an Attention Bidirectional Recurrent Neural Network that can serve these requirements by processing the available conversational data. Previous studies using traditional deep learning or machine learning algorithms were developed for emotion footprint identification. But it faces several drawbacks, such as an inability to capture the temporal context, feature engineering complexity and failure to handle huge datasets. To address this issue, the Recurrent Neural Network (RNN) is well suited and selected for emotional footprint analysis by the inherent temporal and sequential nature of emotional data. RNN is designed to process data in sequences using an internal memory (hidden state) to retain context from prior inputs that is crucial for capturing how emotions evolve and change over time. RNN learns and extracts relevant features from raw data (text data) for better performance compared to traditional machine learning techniques. RNN is employed to understand the dynamics of emotion. Figure 2 shows the structure of the Attention Bidirectional Recurrent Neural Network-based emotional footprint identification model.

The Attention Bidirectional Recurrent Neural Network model shown in Figure 2 revolves around three characteristics as follows: each party is designed, employing a speaker state that changes as and when that speaker utters an utterance. This makes the model track the speaker’s emotion footprint swings throughout the conversations that are associated with the emotion behind the utterances. Moreover, the factors of an utterance are designed employing a global state, where the foregoing utterances and the speaker states are cooperatively encoded for statement characterization requisites for accurate speaker state representation. Finally, the Attention Bidirectional Recurrent Neural Network model hypothesizes emotion footprint representation from the party state of a speaker along with the foregoing speakers’ states as statements. This emotional footprint representation is then used for the final identification of emotional footprint in a conversation.

Let there be ‘

M s p e a k e r s, {S p}_{1}, {S p}_{2}, \dots, {S p}_{M}

’, (‘

M = 2

’ for the dataset ‘

D S

’ we used) in a conversation. The objective is to identify the emotional footprint labels (no emotion, angry, disgust, fear, happiness, sadness and surprise) of the constituent utterances ‘

U_{1}, U_{2}, \dots, U_{m}

’, where utterance ‘

U_{i}

’ is uttered by speaker ‘

{S p}_{m a p} (U_{i})

’, while ‘

m a p

’ is the mapping between utterance and index of its corresponding speaker. Here, ‘

S p, U_{i} \in D S

’, and to update states and representation, GRU cells are employed. Here, each GRU cell evaluates hidden state designated as ‘

H_{t} = G R U \times (H_{t - 1}, S_{t})

’ where ‘

S_{t}

’ represents the current input sample and ‘

H_{t - 1}

’ represents the foregoing or former GRU state with ‘

H_{t}

’ referring to the current GRU output. The emotional footprint representation of the current utterance is modeled as a function of the emotional footprint representation of the former utterance. Finally, this resultant emotion footprint representation is sent to a fine-tuned zero-shot learned model for emotion classification.

3.2.1. Global Conversation State

Global Conversation State intends to capture the statement of a given utterance by cooperatively encoding utterance and speaker state. Addressing these states eases the inter-speaker and inter-utterance reliance to impart improved statement representation. The prevailing utterance ‘

U_{t}

’ replaces the speaker’s state from ‘

P_{m a p (U_{t}), t - 1}

’ to ‘

P_{m a p (U_{t}), t}

’. We reproduce this change with GRU cell ‘

{G R U}_{G}

’ with output size ‘

{G C S}_{G}

’, employing ‘

U_{t}

’ and ‘

P_{m a p (U_{t}), t - 1}

’ as given below.

G_{t} = {G R U}_{G} (G_{t - 1}, (U_{t} ⨁ P_{m a p (U_{t}), t - 1})) + W_{G, S} + W_{G, H} + B_{G, S} + B_{G, H}

(1)

From Equation (1), ‘

{G C S}_{G}

’ denotes the size of Global Conversation State vector with weight and bias denoted as ‘

W_{G, S}, W_{G, H}

’ and ‘

B_{G, S}, B_{G, H}

’ denoting the strength of association between neurons and constant value, permitting the neuron to be active when all inputs are close to zero via concatenation ‘

⨁

’. Finally, during training, the Attention Bidirectional Recurrent Neural Network fine-tunes both weights and biases through forward and backward processes with the objective of optimizing its emotional footprint predictions on the training data.

3.2.2. Speaker Conversation State

The proposed method keeps track of the state of individual speakers employing definite fixed-size vectors ‘

P_{1}, P_{2}, \dots, P_{M}

’ throughout the conversation. These states are characteristic of the state of speakers during a conversation, pertaining to emotional footprint classification. These states are said to be fine-tuned based on the current role of a party in the conversation (i.e., listener or speaker) and the subsequent observed utterance ‘

U_{t}

’. The foremost motive of this model is to make certain that the Speaker Conversation State model is aware of the speaker of each utterance and handles it accordingly.

3.2.3. Speaker Conversation Update State

Finally, the speaker usually mounts the feedback on the basis of the statement that is the foregoing or earlier utterances near the conclusion of the conversation. Therefore, we encapsulate statement ‘

{S t a t}_{t}

’ relevant to the utterance ‘

U_{t}

’ as given below.

α = f_{θ} ⨁ f_{θ^{'}}^{'} (U_{t}^{T} W_{α} [G_{1}, G_{2}, \dots, G_{t - 1}])

(2)

f_{θ} ⨁ f_{θ^{'}}^{'} = [({R e s}_{0}, {R e s}_{0}^{'}), ({R e s}_{1}, {R e s}_{1}^{'}), \dots ({R e s}_{N}, {R e s}_{N}^{'})]

(3)

f_{θ} (S_{0}, H_{0}) = ({R e s}_{0}, H_{1}), f_{θ} (S_{1}, H_{1}) = ({R e s}_{1}, H_{2})

(4)

f_{θ^{'}}^{'} (S_{N}, H_{N}^{'}) = ({R e s}_{N}^{'}, H_{N - 1}^{'}), f_{θ^{'}}^{'} ({R e s}_{N - 1}, H_{N - 1}^{'}) = ({R e s}_{N - 1}^{'}, H_{N - 2}^{'})

(5)

{S t a t}_{t} = α {[G_{1}, G_{2}, \dots, G_{t - 1}]}^{T}

(6)

From Equations (2) to (6), ‘

G_{1}, G_{2}, \dots, G_{t - 1}

’ denotes the former ‘

t - 1'

Global Conversation State vector results. Moreover, in Equation (2), attention outcomes ‘

α

’ over the former global conversation states representative of the former utterances are calculated. Finally, in Equation (6), the statement vector ‘

{S t a t}_{t}

’ is calculated by combining the former Global Conversation State vector results with ‘

α

’. Finally, the identified emotional footprint near the conclusion of the conversation is represented as given below.

P_{m a p (U_{t}), t} = {G R U}_{U p d} (P_{m a p (U_{t}), t - 1,} (U_{t} ⨁ {S t a t}_{t}))

(7)

Equation (7) encodes the information on the current utterance along with its statement from the global GRU into the speaker’s state ‘

P_{m a p (U_{t})}

’ that aids in further emotional footprint classification, shown in the next section. The pseudocode representation of Attention Bidirectional Recurrent Neural Network-based emotional footprint identification is given below.

Algorithm 1 describes a step-by-step process of obtaining computationally efficient emotional footprints. The raw data obtained from the Two-Party Conversation along with Emotional Footprint and Emotional Intensity datasets are applied to identify the emotional footprint of the members participating in the conversation near the conclusion of the conversation. The overall process is split into three states, namely, Global Conversation State, Speaker Conversation State and Speaker Conversation Update State. Initially, in the Global Conversation State, the statement of a given utterance is obtained cooperatively by encoding utterance and speaker state, following which, the Speaker Conversation State, in turn, keeps track of the state of individual speakers using definite fixed-size vectors throughout the conversation.

Algorithm 1 Attention Bidirectional Recurrent Neural Network-based emotional footprint identification

Input: Dataset ‘

D S

’, Samples ‘

S = \{S_{1}, S_{2}, \dots, S_{N}\}

’, Features ‘

F = \{F_{1}, F_{2}, \dots, F_{n}\}

’, Speaker ‘

S_{1}, S_{2}, \dots, S_{M}

’, utterances ‘

U = \{U_{1}, U_{2}, \dots, U_{m}\}

’

Output: mputationally efficient emotional footprint ‘

P_{m a p (U_{t}), t}

’

1: Initialize ‘N = 1857’, ‘n = 48’, ‘M = 2’, ‘m = 6’
2: Begin
3: For each Dataset ‘DS’ with Speaker ‘

S_{1}, S_{2}, \dots, S_{M}

’ Samples ‘S’ and utterances ‘U’
//Global Conversation State
4: Formulate Global Conversation State according to (1)
//Speaker Conversation State
5: Keep track of individual speakers using fixed size vectors ‘

P_{1}, P_{2}, \dots, P_{M}

’ around conversation
//Speaker Conversation Update State
6: Formulate Speaker Conversation Update State according to (2), (3), (4), (5) and (6)
7: Obtain emotional footprint according to (7)
8: Return emotional footprint ‘

P_{m a p (U_{t}), t}

’
9: End for
10: End

By combining these states, the Attention Bidirectional Recurrent Neural Network model keeps track of the individual speaker states all around the conversation and utilizes this information for emotion classification, therefore minimizing the training time extensively. Attending over this attention score and employing bidirectional aspects give statement representations that contain information of all preceding utterances by speaker during the conversation. This, in turn, makes certain that at each time instance, the speaker state obtains information from the speaker’s previous state and Global Conversation State that has information on the preceding parties. Finally, the Speaker Conversation Updated State is fed to decode emotion footprint representation of given utterance that is used for further emotion footprint classification, therefore improving the overall accuracy involved.

3.3. Zero-Shot Learning-Based Classifier

Emotion footprint classification is the task of connecting a conversation with a human emotion. State-of-the-art methods are typically learned using hand-crafted affective lexicons. In this work, we want to detect emotion footprints and their evolution for each party near the conclusion of the conversation. This situation results in numerous issues, ranging between small and mostly unlabeled datasets to identify and adapt methods for such situations. To handle this issue, a Zero-Shot Learning-based classifier is applied for emotional footprint classification in a precise manner with minimal error. Figure 3, given below, shows the structure of Zero-Shot Learning-based classifier model.

A comparison with the traditional Zero-Shot Learning-based classifier concentrates on transferring knowledge between speakers to impart supplementary information in recognizing unobserved classes. Nevertheless, it is laborious and cumbersome in using these mechanisms owing to the reasons that emotion appears in a latent layer in conversation, bringing about trials and tribulations in concluding emotional descriptors, and distinct affective emotions may also result in different emotional descriptors. In emotional footprint classification, zero-shot failures on low-intensity emotions are addressed by using attribute-learning phase to measure the relationship between generic semantic features and specific emotional categories. Hence, a Zero-Shot Learning-based classifier, as shown in Figure 3, using emotional footprints to associate semantic features (i.e., conv_length including, inform, question, directive and commissive) and emotional states is designed. The Zero-Shot Learning-based classifier using attribute learning harnesses semantic features (i.e., conv_length including, inform, question, directive and commissive) to learn emotional categories ‘

E C

’ (i.e., no emotion, angry, disgust, fear, happiness, sadness and surprise).

The attribute-learning phase aims at fitting emotional category values (known emotion labels, i.e., no emotion, anger, disgust, fear, happiness, sadness and surprise) to conversation samples utilizing corresponding semantic features (i.e., conv_length included, inform, question, directive and commissive). This attribute-learning phase models the correlation between ‘

l_{F}

’ features and ‘

l_{E C}

’ emotional categories. The ‘

m^{(S)}

’ utterance samples belonging to observed classes (known) and emotional states utilized in attribute learning are then mathematically expressed as given below.

P^{(O C)} = {[P_{1}^{(O C)}, P_{2}^{(O C)}, \dots, P_{m^{(O C)}}^{(O C)}]}^{T} \in R^{m^{(O C)} \times l_{F}}

(8)

From Equation (8), ‘

l_{F}

’ denotes the feature dimensionality. The subsequent emotional categories ‘

{E C}^{(O C)} \in R^{m^{(O C)} \times l_{E C}}

’ of ‘

P^{(O C)}

’ are mathematically represented as given below.

{E C}^{(O C)} = {[{E C}_{1}^{(O C)}, {E C}_{2}^{(O C)}, \dots, {E C}_{m^{(O C)}}^{(O C)}]}^{T} = [α_{1}^{(O C)}, α_{2}^{(O C)}, \dots, α_{l_{E C}}^{(O C)}]

(9)

Thus, the task of attribute learning is to learn the correlation between ‘

P^{(O C)}

’ and each category. By defining the mapping of category ‘

i

’ from ‘

P^{(O C)}

’ as

f_{i} (P^{(O C)}) = {[f_{i} (P_{1}^{(O C)}), f_{i} (P_{2}^{(O C)}), \dots, f_{i} (P_{m^{(O C)}}^{(O C)})]}^{T}

(10)

Then, the mapping function for prediction task is represented as given below.

\hat{f_{i}} (.) = a r g m a x S M (f_{i} (P^{(O C)}), α_{i}^{(O C)}), s u c h t h a t φ^{(O C)} (f_{i}) \in Ω^{(O C)}

(11)

From Equation (11), ‘

f_{i}

’ denotes the semantic feature mapping of the corresponding ‘

i - t h

’sample, ‘

S M

’ denoting the similarity measurement between two vectors, ‘

f_{i} (P^{(O C)})

’ and ‘

α_{i}^{(O C)}

’ via regularization term ‘

φ^{(O C)}

’ subject to a conditional vector set ‘

Ω^{(O C)}

’, respectively. Finally, with the assumption that similar mappings are shared from semantic features to emotional categories, the ‘

l_{E C}

’ dimensional predicted emotional categories are obtained are given below.

{\hat{P}}^{(U O C)} = {[{\hat{f}}_{1} (P^{(U O C)}), {\hat{f}}_{2} (P^{(U O C)}), \dots, {\hat{f}}_{l_{E C}} (P^{(U O C)})]}^{T}

(12)

From Equation (12), results at test time (i.e., testing), speaker observes samples from classes that were not observed during training and are required to predict the class state of emotional footprint that they belong to, both precisely and with minimal error. The pseudocode representation of Zero-Shot Learning-based classifier is given below.

Algorithm 2 describes a step-by-step process of classifier model employing Zero-Shot Learning for improving precision with minimal error. Here, with the identified emotional footprint at the end of a conversation, with Speaker, Samples, Emotional Categories and utterances obtained as input, attribute learning is modeled using Zero-Shot Learning model. Here, initially, utterance samples belonging to observed classes (known) and emotional states are obtained, following which, subsequent emotional categories evolved. Third, a mapping function is formulated to finally obtain dimensional predicted emotional categories with the resultant classified results in a precise manner with minimal error. In this manner, depending on semantic attribute vectors to classify unseen emotional footprints using attribute learning realizes the requirement of comparing their semantic attribute features to the attributes of observed classes.

Algorithm 2 Zero Shot Learning-based classifier

Input: Dataset ‘

D S

’, Samples ‘

S = \{S_{1}, S_{2}, \dots ., S_{N}\}

’, Features ‘

F = \{F_{1}, F_{2}, \dots, F_{n}\}

’, Speaker ‘

S_{1}, S_{2}, \dots, S_{M}

’, utterances ‘

U = \{U_{1}, U_{2}, \dots, U_{m}\}

’, Emotional Categories ‘

E C = \{{E C}_{1}, {E C}_{2}, \dots, {E C}_{l}\}

’

Output: Precise and accuracy classifier

1: Initialize ‘

N = 1857

’, ‘

n = 48

’, ‘

M = 2

’, ‘

m = 6

’, ‘

l = 6

’, regularization term ‘

φ^{(S)} = 0.01

’
2: Begin
3: For each Dataset ‘

D S

’ with Speaker ‘

S_{1}, S_{2}, \dots, S_{M}

’, Samples ‘

S

’ and utterances ‘

U

’
//Attribute Learning
4: Formulate ‘

m^{(S)}

’ utterance samples belonging to observed classes (known) emotional states according to (8)
5: Formulate subsequent emotional categories according to (9)
6: Define mapping function according to (10) and (11)
7: Obtain ‘

l_{E C}

’ dimensional predicted emotional categories according to (12)
8: Return classified results ‘

C R

’
9: End for
10: End

4. Experimental Setup

In this section, elaborate experiments are performed to validate the efficiency of the Attention Bidirectional Recurrent Neural Zero-Shot Semantic Classifier (ABRN-ZSSC) for emotional footprint identification using Python (version 3.11.2),with the aid of the dataset, Two-Party Conversation with Emotional Footprint and Emotional Intensity dataset, extracted from https://data.mendeley.com/datasets/fvfjp6n3x9/1 (accessed on: 15 November 2023). Comparison is made using two existing methods, Emotional Voice Conversion with Emotion Intensity Control (EMOVOX) [1] and Multi-task deep Cross-Attention Network (MTCANet) [2] and TTNet [3]. To ensure fair comparison, validation is made using the same dataset for all four methods, ABRN-ZSSC, EMOVOX [1], MTCANet [2] and TTNet [3].

The objective of the proposed ABRN-ZSSC method is to discover the emotional footprint of the members participating in the conversation with improved accuracy, time and minimal error rate. Based on the objective, the existing methods, such as the EMOVOX [1] and MTCANet [2] and TTNet [3], are taken as base paper. Existing EMOVOX [1] was employed for voice/keyword conversation. Existing MTCANet [2] was used for multi-task or cross-attention network with higher accuracy. These three base papers are relevant and compared to understand the proposed method. The proposed method concept is derived by considering the problems of these base papers. The drawbacks of these methods are effectively avoided by implementing the proposed method.

While performing experiments, the hardware and software requirements are included on a computer with an Intel(R) Core (TM) i5-7200 CPU @2.50 GHz and 8.00 GB of RAM. To validate the experiments, cross-validation is employed to compute the performance of the model ability or efficiency to utilize hidden data. By using cross-validation, the dataset is divided into two sets such as training and testing. Most samples (70%) were used for training, and the remaining (30%) were taken for testing. In experiments, adaptability is quantified by using several performance metrics, such as accuracy, precision and recall, as well as time for emotional footprint identification performance. Accuracy refers to a model or system correctly identifying and classifying emotions expressed in input data samples taken from the dataset. Precision is a performance metric that measures the accuracy of emotion-specific classification made by a model. Recall, also known as sensitivity, measures the ability of the model to correctly identify actual positive cases. Time is defined as an amount of time consumed by the algorithm for emotional footprint identification. To measure the adaptability using these above metrics, 10-cross-validation is used to enhance the emotional footprint identification performance. Hyperparameters are described in Table 3.

Hyperparameter selection is the process of choosing the best configuration for the parameters that control the learning algorithm’s performance. During learning, the rationale for these selections is rooted in a balance of model performance, computational efficiency, and preventing common drawbacks such as overfitting or underfitting. Learning rate or batch size is crucial for the stability and speed of the optimization process. The balancing bias and variance goal aims to find an optimal balance that generalizes well to unseen data. The rationale for hyperparameters during the inference phase is focused on fixed performance, speed, and reliability in a production environment.

5. Performance Comparison Analysis

In this section, the proposed ABRN-ZSSC method with the existing EMOVOX [1] and MTCANet [2] and TTNet [3] is discussed by using various parameters, such as accuracy, precision, recall, time and error rate. Contrary to conventional methods, the proposed ABRN-ZSSC method aims to obtain superior performance for the desired output. As shown in Table 4, the implementation of the Zero-Shot Learning-based classifier algorithm led to a notable enhancement in both accuracy and precision. Table 5 further demonstrates a significant reduction in processing time when the model is employed. This efficiency gain can be attributed to the more accurate estimation of state evaluation results, facilitated by incorporating both the Global Conversation State and the Speaker Conversation State. Table 6 provides evidence of a reduced error rate. Finally, Table 7 illustrates an increased recall rate achieved with the ABRN-ZSCC method.

5.1. Performance Analysis of Precision and Accuracy

Accuracy and precision are two performance measures with respect to error. On one hand, accuracy is how close a given set of measurements (observations or readings) are to their true value, whereas, on the other hand, precision is how close the measurements are to each other. Accuracy is also utilized as a statistical measure of how well a classification test (i.e., emotional footprint identification test) correctly identifies or excludes a condition. In other words, accuracy is defined as the ratio of correct predictions (i.e., including both true positives and true negatives) among the total number of samples being examined. Accuracy is represented as given below.

A c c = \frac{T P + T N}{T P + T N + F P + F N}

(13)

From Equation (13), accuracy or emotional footprint identification accuracy ‘

A c c

’ is measured employing the true-positive value ‘

T P

’ (i.e., anger emotional footprint identified as anger), true-negative value ‘

T N

’ (i.e., anger emotional footprint identified as disgust), false-positive value ‘

F P

’ (i.e., disgust emotional footprint identified as disgust) and the false-negative value ‘

F N

’ (i.e., disgust emotional footprint identified as anger). The precision is measured based on the number of true positives and false positives. Precision is defined as given below.

P r e = \frac{T P}{T P + F P}

(14)

From Equation (14), precision ‘

P r e

’ is measured using the true-positive instances ‘

T P

’ and the false-positive instances ‘

F P

’. To verify the effectiveness of the proposed ABRN-ZSSC method in the emotion footprint classification task, we conducted experimental comparisons between three baseline methods, EMOVOX [1] method, MTCANet [2] and TTNet [3], employing the abovementioned dataset in terms of precision and accuracy, as shown in Table 4.

Table 4. Experimental results of precision and accuracy using ABRN-ZSSC: EMOVOX [1], MTCANet [2] and TTNet [3].

Samples	Precision				Accuracy
Samples	ABRN-ZSSC	EMOVOX [1]	MTCANet [2]	TTNet [3]	ABRN-ZSSC	EMOVOX [1]	MTCANet [2]	TTNet [3]
180	0.95	0.9	0.87	0.93	0.93	0.87	0.84	0.9
360	0.92	0.86	0.76	0.9	0.9	0.8	0.63	0.84
540	0.89	0.83	0.73	0.87	0.85	0.75	0.63	0.8
720	0.85	0.78	0.68	0.82	0.87	0.77	0.65	0.83
900	0.83	0.77	0.67	0.8	0.9	0.8	0.68	0.86
1080	0.84	0.78	0.67	0.82	0.85	0.75	0.63	0.81
1260	0.87	0.81	0.71	0.84	0.81	0.71	0.59	0.77
1440	0.9	0.84	0.74	0.87	0.81	0.72	0.6	0.77
1620	0.9	0.84	0.74	0.88	0.85	0.75	0.63	0.8
1800	0.91	0.85	0.75	0.89	0.87	0.77	0.65	0.82

A comparison of the precision rate for emotional footprint classification across Two-Party Conversation with Emotional Footprint and Emotional Intensity dataset is shown in Figure 4. From the above figure, it is inferred that using the above-mentioned dataset, the precision rate of all methods is relatively high, with the proposed ABRN-ZSSC method showing a certain advantage over the EMOVOX [1] method. Moreover, the proposed ABRN-ZSSC method exhibits a greater advantage compared to the MTCANet [2] method and TTNet [3]. Additionally, compared to the methods [1,2,3], the proposed ABRN-ZSSC method demonstrates robust performance, adapting well to more real-world data environments. The reason is due to the application of the Zero-Shot Learning-based classifier algorithm. By applying this algorithm, the identified emotional footprint at the end of a conversation was subjected to the input of the classifier along with the Speaker, Samples and Emotional Categories via attribute learning. Here, utterance samples belonging to observed classes (known) and emotional states were obtained; next, the subsequent emotional categories were used; and, finally, a mapping function was performed that, in turn, improved the overall precision of the proposed ABRN-ZSSC method by 7% compared to [1] 21% compared in [2] and 3% in [3], respectively.

Figure 5, given above, illustrates the accuracy rate results of the proposed ABRN-ZSSC method, estimated under a different number of samples. The accurate results of the proposed ABRN-ZSSC method analyzed with samples of 180 confirmed an improvement from 87% to 93% under respective increases in the number of samples from 180 to 1800. Similarly, the accurate rate of emotional footprint classifier results of the EMOVOX [1] method with 180 samples confirmed an improvement from 77% to 87%, an improvement from 65% to 84% using [2], and an improvement from 82% to 90% using [3] over an increasing number of samples. This realizable improvement in accuracy rate efficiency enabled by the proposed ABRN-ZSSC method is owing to the utilization of the Attention Bidirectional Recurrent Neural Network-based emotional footprint identification algorithm that sustains the balance between intensification and extensification processes via the Speaker Conversation Updated State. This, in turn, improves the overall accuracy rate efficiency of the proposed ABRN-ZSSC method by 12% in comparison with [1], 33% compared to [2] and 5% compared to [3], respectively.

5.2. Performance Analysis of Training Time

Training time is measured as the amount of time consumed during the process of emotional footprint identification based on the total number of samples ‘

S_{i}

’. The training time is represented as given below.

T T = \sum_{i = 1}^{N} S_{i} \times T i m e (P_{m a p (U_{t}), t})

(15)

From Equation (15), the training time or emotional footprint identification time ‘

T T

’ is measured based on the total number of tagged conversations or number of parties near the conclusion of the conversation ‘

S_{i}

’ and the time involved in identifying the emotional footprint near the conclusion of the conversation ‘

T i m e (P_{m a p (U_{t}), t})

’. It is measured in terms of seconds (sec). To verify the efficiency of the proposed ABRN-ZSSC method in the emotion footprint classification task, we conducted experimental comparisons between three baseline methods, EMOVOX [1] method, MTCANet [2] and TTNet [3], employing the abovementioned dataset in terms of training time as shown in Table 5.

Table 5. Experimental results of training time using ABRN-ZSSC, EMOVOX [1] and MTCANet [2] and TTNet [3].

Samples	Training Time (s)
Samples	ABRN-ZSSC	EMOVOX [1]	MTCANet [2]	TTNet [3]
180	63	77.4	99	70
360	85	115	135	97
540	105	135	155	122
720	135	165	185	148
900	142	172	192	159
1080	155	185	205	174
1260	185	215	235	197
1440	200	230	250	213
1620	215	245	265	226
1800	245	275	295	258

Figure 6 shows the proposed ABRN-ZSSC evaluation using training time under a varying number of samples during processing. The training time of the proposed ABRN-ZSSC method confirmed an increase from 63 s to 245 s in the number of samples from 180 to 1800. Similarly, the training time of the EMOVOX [1] confirmed an increase from 77.4 s to 275 s for an increasing number of samples. Furthermore, the training time of the MTCANet [2] also confirmed a significant increase from 99 s to 295 s. Furthermore, the training time of the TTNet [3] also confirmed a significant increase from 70 s to 258 s. This indispensable enhancement in the mean training time promoted by the proposed ABRN-ZSSC method is mainly due to the appropriate estimation of state evaluation results via the Global Conversation State and Speaker Conversation State and then by combining using the Attention Bidirectional Recurrent Neural Network during the process of emotional footprint identification. The statistical tests proved the competence and preference of the proposed ABRN-ZSSC method rather than the other considered methods [1,2,3], minimizing the training time by 21% compared to [1], 26% compared to [2] and 9% compared to [3], respectively.

5.3. Performance Analysis of Error Rate or Misclassification Error

Finally, in this section, the misclassification error or error rate is defined as the number of samples incorrectly identified and classified as emotion expressions to the total samples. Error rate refers to the inaccurate detection of emotional footprints (i.e., anger samples detected as disgust and vice versa) and is mathematically formulated as given below.

E R = \sum_{i = 1}^{N} \frac{S_{I A D}}{S_{i}} \times 100

(16)

From Equation (16), the error rate or the false identification of emotional footprint at the end of a conversation ‘

E R

’ is measured based on the samples involved in the simulation process ‘

S_{i}

’ and the samples’ inaccurate detection of emotional footprint ‘

S_{I A D}

’. It is measured in terms of percentage (%). Finally, we conducted experimental comparisons between ABRN-ZSSC and three baseline methods, EMOVOX [1] method, MTCANet [2] and TTNet [3], employing the abovementioned dataset in terms of error rate time, as shown in Table 6.

Table 6. Experimental results of error rate using ABRN-ZSSC, EMOVOX [1] MTCANet [2] and TTNet [3].

Samples	Error Rate (%)
Samples	ABRN-ZSSC	EMOVOX [1]	MTCANet [2]	TTNet [3]
180	6.66	8.33	11.11	7.7
360	7.85	9.55	12.35	8.5
540	8.35	12	14.85	10.35
720	9.25	13.35	15.55	11.25
900	12.15	14.15	15	12.95
1080	11.55	13.25	14.35	12.25
1260	11	12.55	14	11.85
1440	11	12	13.85	11.45
1620	10	11.55	12.55	11
1800	12	13.55	14.35	12.65

Finally, Figure 7 shows the error rate involved in the emotional footprint classifier with respect to 1800 samples obtained at different time instances. From the above figurative representation, two inferences are made. First, the error rate of all three methods was not found to be either directly or inversely proportional to the samples provided as input. Second, the error rate of the proposed ABRN-ZSSC method was found to be lower in comparison to [1,2,3]. This is evident from the simulation setup with 180 samples, where 165 samples had an anger emotional footprint and 15 samples had a disgust emotional footprint provided as input. The inaccurate detection using the three methods, ABRN-ZSSC [1,2,3], was observed to be 12, 15, 20 and 14, with an overall error rate of 6.66%, 8.33%, 11.11% and 7.7%, respectively. From this result, the error rate of the ABRN-ZSSC method was found to be comparatively less than [1,2,3]. The reason for error rate minimization using the ABRN-ZSSC method can be contributed to the application of the Zero-Shot Learning-based classifier algorithm. By applying this algorithm in addition to the utilization of the identified emotional footprints, semantic features and emotion categories were applied during the classification stage. Identifying high correlation results between features and emotional categories via attribute learning aided in minimizing misclassification error or error rates using the ABRN-ZSSC method by 17% in comparison with [1], 28% in comparison with [2] and 10% in comparison with [3].

5.4. Performance Analysis of Recall

Recall ‘

R e c

’ is defined as the ratio between the numbers of the true-positive rate accurately identifying the emotional footprint as anger to the total number of positive samples. It is formulated as follows

R e c = [\frac{T P}{T P + F N}]

(17)

To verify the efficiency of the proposed ABRN-ZSSC method in the emotion footprint classification task, we conducted experimental comparisons between three baseline methods, EMOVOX [1] method, MTCANet [2] and TTNet [3], employing the abovementioned dataset in terms of training time, as shown in Table 7.

Table 7. Experimental results of recall using ABRN-ZSSC, EMOVOX [1], MTCANet [2] and TTNet [3].

Samples	Recall
Samples	ABRN-ZSSC	EMOVOX [1]	MTCANet [2]	TTNet [3]
180	0.97	0.95	0.94	0.96
360	0.94	0.88	0.82	0.92
540	0.92	0.86	0.81	0.9
720	0.9	0.84	0.78	0.87
900	0.89	0.82	0.75	0.86
1080	0.87	0.8	0.73	0.84
1260	0.91	0.85	0.72	0.88
1440	0.92	0.86	0.75	0.89
1620	0.93	0.87	0.77	0.9
1800	0.95	0.89	0.79	0.92

Figure 8 shows the recall rate achieved by the proposed ABRN-ZSSC and existing EMOVOX [1], MTCANet [2] and TTNet [3]. The horizontal axis denotes the number of samples, while the vertical axis represents recall performance. Among the four methods, ABRN-ZSSC exhibits comparatively higher recall performance than EMOVOX [1], MTCANet [2] and TTNet [3]. The reason for the higher recall is to apply the Attention Bidirectional Recurrent Neural Network. To evaluate the overall improvement, the percentages were calculated in relation to the existing methods. Considering 8000 samples for experimentation, the ABRN-ZSSC achieved a recall performance of 0.95, while the existing methods [1], [2] and [3] achieved 0.89, 0.79 and 0.82, respectively. The overall performance results of the ABRN-ZSSC indicate that the recall is improved by 7%, 14% and 3% compared to [1], [2] and [3], respectively.

5.5. Ablation Study

Am ablation study is utilized to eradicate the sections of inputs systematically, elements that are inputs similar to output. The ablation study is the concept of eradicating a certain part of the network to gain a better understanding of the network behavior. The ablation study is carried out in ABRN-ZSSC to find the emotional footprint to enhance the overall performance. To conduct ablation experiments, ABRN-ZSSC is developed in Table 8.

Table 8 compares the performance of four methods, namely the attention mechanism, Bidirectional RNN, and ZSL components, using Two-Party Conversation with Emotional Footprint and Emotional Intensity dataset. The best results were achieved using the full ABRN-ZSSC method. Contrary to other methods, the proposed ABRN-ZSSC method attained higher accuracy, precision, recall by 0.87, 0.83 and 0.9 and reduced time and error by 145 ms and 9.23%.

5.6. T-Test

A t-test is an arithmetical tool employed in the proposed ABRN-ZSSC method to validate and quantify differences in data related to emotion identification. A t-test is applied in the proposed ABRN-ZSSC method to statistically compare the performance results of a Zero-Shot Learning classifier for emotional footprint identification when using different sets of semantic features (conv_length including inform). Figure 9 shows the t-test outcome for the proposed ABRN-ZSSC method. In this figure, the x-axis denotes the groups of semantic features, and the y-axis indicates the values of the proposed ABRN-ZSSC method. Also, the p-value is 0.84.

6. Conclusions

In this study, a novel deep learning method named ABRN-ZSSC was developed for the identification and classification of emotional footprints for each party near the conclusion of a conversation. First, the training time is reduced by using Bidirectional RNN for consideration of the past and future of a conversation. Also, the emotional footprint is discovered employing a Zero-Shot Learning-based classifier algorithm with higher accuracy and lower error. The proposed ABRN-ZSSC method is assessed on various numbers of samples. The performance of the proposed method is compared to the popular emotional footprint classification methods, such as EMOVOX, MTCANet and TTNet, using different performance metrics. Experimental results prove that the ABRN-ZSSC method efficiently identifies and classifies emotional footprints near the conclusion of the conversation, achieving more stable and acceptable lasting impressions on the listener. From the experimental results, the following summary key finds are achieved: the proposed ABRN-ZSSC method achieved higher precision of 10%, accuracy of 17% and recall of 8% when compared to EMOVOX [1], MTCANet [2] and TTNet [3]. The ABRN-ZSSC method also minimizes the training time by 19% and the error rate by 18% when compared to existing methods. The limitations of the proposed method are given as follows: it failed to select optimal features using optimization, and several metrics, such as specificity and space complexity, were not considered or estimated. In future work, the proposed method will be extended to apply the novel deep learning and optimization method for emotional footprint identification. Also, the specificity and space complexity will be measured to enhance the emotional footprint identification performance.

Author Contributions

All the authors contributed equally to the conceptualization, formal analysis, investigation, methodology, writing, and editing of the original draft. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhou, K.; Sisman, B.; Rana, R.; Schuller, B.W.; Li, H. Emotion Intensity and its Control for Emotional Voice Conversion. IEEE Trans. Affect. Comput. 2022, 14, 31–48. [Google Scholar] [CrossRef]
Liang, X.; Zhang, Z.; Xu, R. Multi-task deep cross-attention networks for far-field speaker verification and keyword spotting. EURASIP J. Audio, Speech, Music. Process. 2023, 2023, 28. [Google Scholar] [CrossRef]
Tran, T.-D. TTNet: A novel machine learning model for facial emotion detection in online learning systems. SoftwareX 2024, 27, 101787. [Google Scholar] [CrossRef]
Novais, R.; Cardoso, P.J.S.; Rodrigues, J.M.F. Emotion Classification from Speech by an Ensemble Strategy. In Proceedings of the 10th International Conference on Software Development and Technologies for Enhancing Accessibility and Fighting Info-Exclusion, Lisbon, Portugal, 31 August 2022–2 September 2022; Association for Computing Machinery: New York, NY, USA, 2022. [Google Scholar]
Asghar, M.Z.; Lajis, A.; Alam, M.M.; Rahmat, M.; Nasir, H.M.; Ahmad, H.; Al-Rakhami, M.S.; Al-Amri, A.; Al-bogamy, F.R. A Deep Neural Network Model for the Detection and Classification of Emotions from Textual Content. Complexity 2022, 2022, 8221121. [Google Scholar] [CrossRef]
Bharti, S.K.; Varadhaganapathy, S.; Gupta, R.K.; Shukla, P.K.; Bouye, M.; Hingaa, S.K.; Mahmoud, A. Text-Based Emotion Recognition Using Deep Learning Approach. Comput. Intell. Neurosci. 2022, 2022, 2645381. [Google Scholar] [CrossRef] [PubMed]
Chowanda, A.; Muliono, Y. Emotions Classification from Speech with Deep Learning. Int. J. Adv. Comput. Sci. Appl. 2022, 13. [Google Scholar] [CrossRef]
Kentour, M.; Lu, J. An investigation into the deep learning approach in sentimental analysis using graph based theories. PLoS ONE 2021, 16, e0260761. [Google Scholar] [CrossRef]
Mirheidari, B.; Bittar, A.; Cummins, N.; Downs, J.; Fisher, H.; Christensen, H. Automatic detection of expressed emotion from Five-Minute Speech Samples: Challenges and opportunities. PLoS ONE 2024, 19, e0300518. [Google Scholar] [CrossRef] [PubMed]
Saju, B.; Tressa, N.; Dhanaraj, R.K.; Tharewal, S.; Mathew, J.C.; Pelusi, D. Effective multi-class lung disease classification using the hybrid feature engineering mechanism. Math. Biosci. Eng. 2023, 20, 20245–20273. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Xu, K.; Xie, C.; Gao, Z. Emotion inference in conversations based on commonsense enhancement and graph structures. PLoS ONE 2024, 19, e0315039. [Google Scholar] [CrossRef]
Barakat, H.; Turk, O.; Demiroglu, C. Deep learning-based expressive speech synthesis: A systematic review of approaches, challenges, and resources. EURASIP J. Audio Speech Music. Process. 2024, 2024, 11. [Google Scholar] [CrossRef]
Mehta, D.; Faridu, M.; Siddiqui, H.; Javaid, A.Y. Recognition of Emotion Intensities Using Machine Learning Algorithms: A Comparative Study. Sensors 2019, 19, 1897. [Google Scholar] [CrossRef]
Murugappan, M.; Mutawa, A. Facial geometric feature extraction based emotional expression classification using machine learning algorithms. PLoS ONE 2021, 16, e0247131. [Google Scholar]
Swain, M.; Maji, B.; Kabisatpathy, P.; Routray, A. A DCRNN-based ensemble classifier for speech emotion recognition in Odia language. Complex Intell. Syst. 2022, 8, 4237–4249. [Google Scholar] [CrossRef]
Tesfagergish, S.G.; Kapočiūtė-Dzikienė, J.; Damaševičius, R. Zero-Shot Emotion Detection for Semi-Supervised Sentiment Analysis Using Sentence Transformers and Ensemble Learning. Appl. Sci. 2022, 12, 8662. [Google Scholar] [CrossRef]
Guo, J. Deep learning approach to text analysis for human emotion detection from big data. J. Intell. Syst. 2021, 31, 113–126. [Google Scholar] [CrossRef]
Fu, Y.; Yuan, S.; Zhang, C.; Cao, J. Emotion Recognition in Conversations: A Survey Focusing on Context, Speaker Dependencies, and Fusion Methods. Electronics 2023, 12, 4714. [Google Scholar] [CrossRef]
Hussain, M.; Qaz, E.-U.-H.; Ullah, I. Emotion Recognition System Based on Two-Level Ensemble of Deep-Convolutional Neural Network Models. IEEE Access 2023, 11, 16875–16895. [Google Scholar] [CrossRef]
Salas-Cáceres, J.; Lorenzo-Navarro, J.; Freire-Obregón, D.; Castrillón-Santana, M. Multimodal emotion recognition based on a fusion of audiovisual information with temporal dynamics. Multimed. Tools Appl. 2024, 84, 27327–27343. [Google Scholar] [CrossRef]
Flores, P.M.; Hilbert, M. Temporal communication dynamics in the after math of large-scale upheavals: Do digital footprints reveal a stage model. J. Comput. Soc. Sci. 2023, 6, 973–999. [Google Scholar] [CrossRef]
Roshan, M.; Rawat, M.; Aryan, K.; Lyakso, E.; Mekala, A.M. Linguistic based emotion analysis using SoftMax over time attention mechanism. PLoS ONE 2024, 19, e0301336. [Google Scholar] [CrossRef] [PubMed]
Bini, S.; Percannella, G.; Saggese, A.; Vento, M. A multi-task network for speaker and command recognition in industrial environments. Pattern Recognit. Lett. 2023, 176, 62–68. [Google Scholar] [CrossRef]
Zhan, C.; She, D.; Zhao, S.; Cheng, M.-M.; Yang, J. Zero-Shot Emotion Recognition via Affective Structural Embedding. Int. Conf. Comput. Vision. 2019, 1151–1160. [Google Scholar] [CrossRef]
Xie, S.; Yu, P.S. Active zero-shot learning: A novel approach to extreme Ulti-labeled classification. Int. J. Data Sci. Anal. 2019, 3, 151–160. [Google Scholar] [CrossRef]
Moreno-Garcia, C.F.; Jayne, C.; Elyan, E.; Aceves-Martins, M. A novel application of machine learning and zero-shot classification methodsfor automated abstract screening in systematic reviews. Decis. Anal. J. 2023, 6, 100162. [Google Scholar] [CrossRef]
Ramasamy, M.D.; Periasamy, K.; Krishnasamy, L.; Dhanaraj, R.K.; Kadry, S.; Nam, Y. Multi-Disease Classification Model Using Strassen’s Half of Threshold (SHoT) Training Algorithm in Healthcare Sector. IEEE Access 2021, 9, 112624–112636. [Google Scholar] [CrossRef]
Chutia, T.; Baruah, N. A review on emotion detection by using deep learning techniques. Artif. Intell. Rev. 2024, 57, 203. [Google Scholar] [CrossRef]
Canon, M.J.P.; Maceda, L.L.; Palaoag, T.D.; Abisado, M.B. A Fine-Tuned Distilled Zero-Shot Student Model for Emotion Detection in Academic-Related Responses. Int. J. Inf. Educ. Technol. 2024, 14, 1–10. [Google Scholar] [CrossRef]
Karthikeyan, J.; Annapurani, K. Two party Conversation with Emotional Footprint and Emotional Intensity. Mendeley Data 2023. [Google Scholar] [CrossRef]

Figure 1. Structure of Attention Bidirectional Recurrent Neural Zero-Shot Semantic Classifier (ABRN-ZSSC) method for emotional footprint identification.

Figure 2. Structure of Attention Bidirectional Recurrent Neural Network-based emotional footprint identification model.

Figure 3. Zero-Shot Learning-based classifiers.

Figure 4. Precision comparison of ABRN-ZSSC with EMOVOX [1], MTCANet [2] and TTNet [3].

Figure 5. Accuracy comparison of ABRN-ZSSC with EMOVOX [1], MTCANet [2] and TTNet [3].

Figure 6. Emotional Footprint training time comparison of ABRN-ZSSC with EMOVOX [1], MTCANet [2] and TTNet [3].

Figure 7. Emotional Footprint error rate comparison of ABRN-ZSSC with EMOVOX [1], MTCANet [2] and TTNet [3].

Figure 8. Recall comparison of ABRN-ZSSC with EMOVOX [1], MTCANet [2] and TTNet [3].

Figure 9. T-test results for ABRN-ZSSC method.

Table 1. Advantages of zero-shot learning and fine-tuning.

Advantages	Zero-Shot Learning	Fine-Tuning
Adaptability	Highly flexible. It handles new emotion categories instantly without retraining.	Needs resource-intensive retraining cycle when new classes emerge.
Cost and Speed	Lower data collection cost and faster deployment for new tasks.	High costs and time associated with data labeling and training.
Generalization	Designed to generalize to unseen data by using prior knowledge	Performance is degraded
Scalability	More scalable for huge number of emotion categories	Scalability is limited for new data and every condition.

Table 2. Two-Party Conversation with Emotional Footprint and Emotional Intensity dataset.

S. No	Column headers	Features	Description
1	“0”–“34”	Emotional categories [0–6]	0—no emotion 1—angry 2—disgust 3—fear 4—happiness 5—sadness 6—surprise Length of conversation
2	“35”	Conv_length	Length of conversation
3	“36”–“39”	Inform, question, directive, commissive	Number of utterances within the conversation that fall into their respective categories
4	“40”	Emotion_footprint_first_person	Emotional footprint of the first person in conversation [No emotion, Anger, Disgust, Trust, Happy, Sadness, Surprise Anticipation]
5	“41”	Emotion_footprint_second_person	Emotional footprint of the second person in conversation [No emotion, Anger, Disgust, Trust, Happy, Sadness, Surprise Anticipation]
6	“42”	Conv_intensity_footprint_first_person	Emotional footprint and intensity of the first person in conversation
7	“43”	Conv_intensity_footprint_second_person	Emotional footprint and intensity of the second person in conversation
8	“44”	Emotion_intensity _first_person	Intensity of emotion experienced by first person in conversation
9	“45”	Emotion_intensity_second_person	Intensity of emotion experienced by second person in conversation
10	“46”	Conversation	Complete and unchanged conversation between two parties
11	“47”	Speaker 1	Utterances made by first speaker within the conversation
12	“48”	Speaker 2	Utterances made by second speaker within the conversation

Table 3. Hyperparameter details.

Hyperparameters	Description
Optimizer	Adam
Recurrent_dropout	0.4
Batch size	40
Epochs	50
Learning rate	0.0001
Activation function	Attention mechanism, Attribute learning

Table 8. Ablation study for comparison of methods.

Methods	WSNBFSF Dataset
Methods	Precision	Accuracy	Recall	Training Time (ms)	Error Rate (%)
Attention mechanism	0.66	0.72	0.8	175	17.65
Bidirectional RNN	0.71	0.78	0.83	164	15.45
Zero-Shot Learning-based classifier	0.75	0.82	0.85	158	12.3
Proposed ABRN-ZSSC method	0.83	0.87	0.9	145	9.23

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jagadeesan, K.; Kumarappan, A. Attention Bidirectional Recurrent Neural Zero-Shot Semantic Classifier for Emotional Footprint Identification. Computation 2026, 14, 8. https://doi.org/10.3390/computation14010008

AMA Style

Jagadeesan K, Kumarappan A. Attention Bidirectional Recurrent Neural Zero-Shot Semantic Classifier for Emotional Footprint Identification. Computation. 2026; 14(1):8. https://doi.org/10.3390/computation14010008

Chicago/Turabian Style

Jagadeesan, Karthikeyan, and Annapurani Kumarappan. 2026. "Attention Bidirectional Recurrent Neural Zero-Shot Semantic Classifier for Emotional Footprint Identification" Computation 14, no. 1: 8. https://doi.org/10.3390/computation14010008

APA Style

Jagadeesan, K., & Kumarappan, A. (2026). Attention Bidirectional Recurrent Neural Zero-Shot Semantic Classifier for Emotional Footprint Identification. Computation, 14(1), 8. https://doi.org/10.3390/computation14010008

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Attention Bidirectional Recurrent Neural Zero-Shot Semantic Classifier for Emotional Footprint Identification

Abstract

1. Introduction

1.1. Innovations and Contributions of the Work

1.2. Organization of the Work

2. Related Works

3. Materials and Methods

3.1. Dataset Description

3.2. Attention Bidirectional Recurrent Neural Network-Based Emotional Footprint Identification

3.2.1. Global Conversation State

3.2.2. Speaker Conversation State

3.2.3. Speaker Conversation Update State

3.3. Zero-Shot Learning-Based Classifier

4. Experimental Setup

5. Performance Comparison Analysis

5.1. Performance Analysis of Precision and Accuracy

5.2. Performance Analysis of Training Time

5.3. Performance Analysis of Error Rate or Misclassification Error

5.4. Performance Analysis of Recall

5.5. Ablation Study

5.6. T-Test

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI