Joint Syntax-Enhanced and Topic-Driven Graph Networks for Emotion Recognition in Multi-Speaker Conversations

Yu, Hui; Ma, Tinghuai; Jia, Li; Al-Nabhan, Najla; Wahab, M. M. Abdel

doi:10.3390/app13063548

Open AccessArticle

Joint Syntax-Enhanced and Topic-Driven Graph Networks for Emotion Recognition in Multi-Speaker Conversations

by

Hui Yu

¹

,

Tinghuai Ma

^2,*

,

Li Jia

²,

Najla Al-Nabhan

³ and

M. M. Abdel Wahab

⁴

¹

School of Computer Science, Nanjing University of Information Science & Technology, Nanjing 210044, China

²

School of Software, Nanjing University of Information Science & Technology, Nanjing 210044, China

³

Department of Computer Science, King Saud University, Riyadh 11362, Saudi Arabia

⁴

Faculty of Science, Cairo University, Giza 12613, Egypt

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(6), 3548; https://doi.org/10.3390/app13063548

Submission received: 22 February 2023 / Revised: 6 March 2023 / Accepted: 7 March 2023 / Published: 10 March 2023

(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Daily conversations contain rich emotional information, and identifying this emotional information has become a hot task in the field of natural language processing. The traditional dialogue sentiment analysis method studies one-to-one dialogues and cannot be effectively applied to multi-speaker dialogues. This paper focuses on the relationship between participants in a multi-speaker conversation and analyzes the influence of each speaker on the emotion of the whole conversation. We summarize the challenges of emotion recognition work in multi-speaker dialogue, focusing on the context-topic switching problem caused by multi-speaker dialogue due to its free flow of topics. For this challenge, this paper proposes a graph network that combines syntactic structure and topic information. A syntax module is designed to convert sentences into graphs, using edges to represent dependencies between words, solving the colloquial problem of daily conversations. We use graph convolutional networks to extract the implicit meaning of discourse. In addition, we focus on the impact of topic information on sentiment, so we design a topic module to optimize the topic extraction and classification of sentences by VAE. Then, we use the combination of attention mechanism and syntactic structure to strengthen the model’s ability to analyze sentences. In addition, the topic segmentation technology is adopted to solve the long-term dependencies problem, and a heterogeneous graph is used to model the dialogue. The nodes of the graph combine speaker information and utterance information. Aiming at the interaction relationship between the subject and the object of the dialogue, different edge types are used to represent different interaction relationships, and different weights are assigned to them. The experimental results of our work on multiple public datasets show that the new model outperforms several other alternative methods in sentiment label classification results. In the multi-person dialogue dataset, the classification accuracy is increased by more than 4%, which verifies the effectiveness of constructing heterogeneous dialogue graphs.

Keywords:

emotion recognition; multi-speaker conversation; textual dialogue; natural language processing

1. Introduction

In this society dominated by big data, sentiment analysis has gained more and more attention from researchers because of its important applications in the business field. The most extensive application of sentiment analysis lies in the screening and classification of comments [1]. Businesses understand market dynamics and improve market competitiveness by analyzing user comments. Not only that, with the continuous development of social media and the increasing number of human–computer interactions [2,3], sentiment analysis in dialogue has also been proposed by researchers as an emerging task.

Traditional emotion recognition work generally takes non-dialogical text as the research object [4,5]. Researchers extract the sentiment statements in the text and analyze the sentiment polarity, emotions, evaluations, and opinion attitudes generated by humans towards a target object [6,7]. Emotion recognition in conversations is a new task proposed in recent years [8]. Unlike traditional emotion recognition, conversational texts have more than one speaker, the speakers produce emotional exchanges and collisions of opinions during the conversation, and their emotional volatility is more frequent and intense. Secondly, dialogue text is the language of daily communication, and speakers can conceal their true intentions through different expressions, so we not only need to extract the original meaning of the text but also need to explore the speaker’s intention hidden in the text [9,10]. Li et al. [11] explained three characteristics of emotion transmission in dialogue: context dependence, persistence, and contagiousness. Lou et al. [12] studied the implicit semantics of dialogue texts. However, their study subjects were all one-to-one dialogue texts. In this study, our goal is to address the problem of emotion recognition in the context of conversations between three and more speakers, a situation we refer to as multi-speaker conversations.

Multi-speaker conversations are characterized by having multiple speakers and are not a one-on-one speaking mode, so it is more challenging than a two-speaker conversation. First, because the emotional interaction of a conversation is a continuous process rather than a short fixed one, the emotional state of the conversation is continuously changing. The emotions of a two-speaker conversation are only influenced by each other’s words, while the emotions of a multi-speaker conversation may be influenced by the words of multiple interlocutors, and its emotional determination is more complex. Second, when two people talk, the content of the conversation is more organized and logical, but when there are more participants, a conversation may have multiple topics and each speaker has different concerns and different objects. One speaker can respond to multiple speakers at the same time, which makes the interaction more complex. Finally, the number of multi-speaker conversations is generally much larger than that of two-speaker conversations, and the long-term dependencies problem arising from this is difficult to solve with existing technologies. Moreover, due to the irregularity of daily languages, the literal sentiment of discourse may be inconsistent with the actual sentiment it expresses [12], and we define this inconsistency as the difference between the original meaning and the derived meaning of an utterance, and how to explore the derived meaning of an utterance is the work we must accomplish.

Inspired by graph-based models proposed in other tasks [13,14], we propose joint syntax-enhanced and topic-driven graph networks for emotion recognition in multi-speaker conversations, namely SETD-ERC. SETD-ERC has three modules, where the syntax module analyzes the sentence components of the input utterances to obtain the dependencies between word components, using this structural information allows better mining of the latent semantic information of the sentences. We also design a topic module and a dialogue interaction module, which together solve a series of challenges arising from user interaction. The topic module combines the syntax module to encode syntactic and topic information of sentences, and then a fusion mechanism is used to obtain the sentence representation with semantic, syntactic, and topic information. The topic module also extracts topic categories, and the dialogue interaction module uses it to segment conversations, which solves the problem of diversity of conversation topics, and the segmentation of conversations into sub-fragments also alleviates the problem of long-term dependencies. To model the emotional interactions between speakers, we designed dialogue graphs for simulating multi-speaker conversations with emotional interactions. We conducted extensive comparison experiments on four dialogue corpora, and the results show that our model performs superiorly in this work. The main contributions of our work are as follows:

We jointly learn the grammatical information and topic information of dialogues and propose a new sentiment analysis method for multi-speaker dialogues, a joint syntax-enhanced and topic-driven graph network model (SETD-ERC).
We combine topic segmentation techniques to model dialogues as dialogue graphs containing both interaction information and contextual information, which build the information iteration of multi-speaker conversations.
We validate our model on four datasets. The experimental results show that the performance of our model is better than baseline models.

The rest of the paper is organized as follows. Section 2 reviews related work. Section 3 provides detailed description of SETD-ERC. Section 4 and Section 5 present the experimental results; finally, Section 6 concludes the paper.

2. Related Work

Emotion recognition is an interdisciplinary research field that covers knowledge from linguistics, psychology, cognitive science, machine learning, natural language processing, and other disciplines [4,15], and is essentially the classification of texts for affective dispositions. This chapter first introduces the feature learning methods used in sentiment analysis, and then briefly introduces the research results of dialogue sentiment analysis.

2.1. Feature Learning in Emotion Recognition

Early emotion recognition used a lexicon-based approach [16] that valued different word components at different levels. Information such as part of speech, position, semantics, punctuation, and frequency of occurrence of words was regarded as learnable features, which are extracted by different methods for emotion classification.

Nowadays, the more popular method is to use machine learning algorithms to classify the sentiment of sentences [17,18], matrix the text, learn semantic features, location features, sentiment features, and other information about the text, and then use machine learning algorithms to train the classifier. The features learned by the model largely determine the quality of the model [19]. Pang et al. [20] extracted topic features of short documents; their work took into account the relevance of topics and emotions, enriching the sparsity of the feature space. Dieng et al. [21] combined topic models with word embeddings, assigning probabilities to words by calculating the similarity between word embeddings and topic embeddings, but it was ill-considered that they directly replaced CBOW [22] with topic embeddings as the context vectors of the word.

Since Devlin et al. [23] proposed the BERT language model, its excellent performance has made it the most popular language representation model. Peinelt et al. [24] constructed a new framework based on BERT, where they integrated topic information for similarity detection. Zhu et al. [25] first proposed to combine common sense knowledge with topic information for emotion recognition in conversations, and topic information was used for fine-tuning. A transformer encoder–decoder structure was developed to fuse topic and common knowledge information instead of recurrent attention neural networks to extract fine-grained information. Wang et al. [26] used an attention mechanism to capture semantic features in sentences and amplify key semantic information. Huang et al. [27] focused on researching the syntactic structure. Sentences were represented as dependency graphs instead of word sequences, and structural features between words were extracted. They considered sentiment features directly from the syntactic context of the text. Jia et al. [28] learned syntactic relations, sequence structure, and semantic information of text representations to construct text classification models. Sindhu et al. [16] synthesized the input embeddings of the token, orientation, grammatical function, field, and intensity components in the embedding stage, and the final classification results performed well.

We sum up the advantages and disadvantages of the above models. We construct a graph-based neural network to train the grammatical structure features of sentences and a topic model to extract topic information. An attention mechanism is used to fuse the semantics, syntactic, and topic information as feature vectors of sentences.

2.2. Emotion Recognition in Conversations

Emotion recognition in conversations has been a newly proposed task in recent years. The traditional non-dialogical texts are organized and logical with smooth sentiment changes, while the emotion of dialogic utterances is more complex as it is influenced by the context and interlocutors’ emotions. Al-Shaikh et al. [29] proposed a search algorithm for tracking social network users. Majumder et al. [30] applied the technology of tracking the state of the speakers to dialogue sentiment analysis, they thought that the current speaker’s state was influenced by that speaker’s previous state and the preceding context. Based on this assumption, the speaker’s state information is updated. On this basis, Ghosal et al. [31] proposed the speaker’s self-dependence and inter-speaker dependence, and they argued that this two relational information can explain the change of emotional information in the conversation. Speaker’s self-dependence simulates the emotional inertia of individual speakers, while inter-speaker dependence simulates the interaction between speakers. Ghosal also replaced the RNN with GCN to optimize the context propagation processing method. Graph convolutional network (GCN) is a neural network technology commonly used in dialogue sentiment analysis. Wan et al. [32] use GCN to classify nodes according to their characteristics and their relationships. Lee et al. [33] extracted dialogue round information, topic information, and speaker information as well as overall dialogue information. They designed a dialogue graph to simulate the interaction between these components and also combined a bidirectional long short-term memory (BiLSTM) with GCN to build up a new mechanism for processing contextual information.

Zhang et al. [34] focused on the multi-speaker dialogue emotion recognition task, and they are the first ones who considered both context-sensitive and speaker-sensitive dependencies in this task. They chose to use graphs, which transform the sentiment detection task into a classification problem for graph nodes, to describe the emotional relationships between speakers. However, they did not focus on the extraction of semantic information and ignored that the topic of the conversation is easy to change. Shen et al. [35] presented that the existing work was not conducive to the application of the pre-trained model XLNet [36]. They adjusted the memory update of XLNet so that it could retain more historical information and designed four attention mechanisms to model the dependencies arising from the dialogue process. However, their study did not make updates to the multi-speaker dialogue emotion recognition task. Sun et al. [37] considered for the first time the importance of dialogue structure in processing information and shaping speaker characteristics, and they exploit a gated convolution to propagate contextual information by using the self-speaker dependency of interlocutors.

The above approaches mention the factors that influence the effectiveness of emotion recognition methods such as semantics, syntactic structure, thematic information, and inter-speaker interaction, but their work only mentions one or two of the elements and does not consider them completely. Our study integrates all the influencing factors of multi-speaker dialogue emotion recognition and designs a joint syntax-enhanced and topic-driven graph network (SETD-ERC). We extract syntactic and thematic information so that the sentence representation vector retains more valid information. Unlike existing graph-based neural networks, we apply topic segmentation techniques to multi-speaker dialogue and exploit a special dialogue graph structure to model speaker identification information and handle discourse-emotion interactions between speakers.

3. Methodology

The overall architecture of the proposed joint syntax-enhanced and topic-driven graph network model (SETD-ERC) is shown in Figure 1, which consists of three main modules: (1) the syntax module learns syntactic information of words from input dialogue and generates a syntax-enhanced feature vector for each utterance. (2) The topic module adopts variational autoencoders (VAE) [38] to learn potential topic information in conversations. Then, it further classifies the topic information to get the topic category. We use an attention mechanism to balance the weights of the above two modules and reconstruct the sentence embedding to encode the syntactic and topic information simultaneously. (3) The dialogue interaction module processes the entire dialogue, and uses the topic category labels obtained by the topic module to segment the input dialogue. We split large dialogue passages into dialogue fragments to alleviate the long-term dependencies problem. Then, we construct a dialogue subgraph for the dialogue fragment. The nodes correspond to the utterances and the dependency relationship between utterances is represented by edges. In the following, we will describe each module in detail.

3.1. Problem Definition

Given a multi-speaker dialogue as input, define it as

D = {(u_{1}, p (u_{1})), (u_{2}, p (u_{2})), \dots, (u_{N}, p (u_{N}))} \in R^{N \times d}

, where N denotes the number of utterances in the conversation. We use the subscript i to denote the serial number of the utterance,

i \in [1, N]

.

p (u_{i})

corresponds to the speaker of the

i - t h

utterance

u_{i}

,

p (u_{i}) \in P

, P denotes the set of all speakers in the conversation. Each utterance

u_{i} = [w_{i}^{1}, w_{i}^{2}, \dots, w_{i}^{n}] \in R^{d}

, means that

u_{i}

contains n words, and d is the embedding vector size of words. The goal of the task in this paper is to identify the emotion label

y_{t}

of the utterance

u_{t}

at the current time t through the input dialogue context and corresponding speakers.

3.2. Syntax Module

At each time step i, we use the pretrained model

B E R T_{b a s e}

to represent the sentence

u_{i}

, which map the input words to a low-dimensional dense vector space, making the text into a language that computers can process. Such a mapping makes semantically similar words closer in the semantic space. We tokenize each sentence using

B E R T_{b a s e}

firstly, and then we decompose the sentence into tokens. We add the sentence start mark

[c l s]

and the segmentation mark

[s e p]

. The input is finally represented as

B E R T_{b a s e} (u_{i}) = [[c l s], x_{1}, x_{2} \dots, x_{n}, [s e p]]

, and the output corresponding to

[c l s]

is used as the sentence representation.

The syntax module models dependencies between words in a sentence. It mines the deep extended meaning of sentences by extracting their grammatical structure features. An existing natural language processing tool spaCy (https://spacy.io/) is used in this module to generate syntactic dependency trees and process them. For each sentence, each word in the sentence is treated as a node, and when there is a dependency between two nodes, a directed dependency edge between them is constructed. The syntactic dependency tree of the sentence “Syntactic dependency trees have a wide range of applications” is shown in Figure 2, where (·) marks the part of speech of the word, and the directed edge represents the dependency of the word. For example, “Syntactic” is a dependency of “tree”, and their relationship is “amod”, which stands for “adjective modifier”. The meanings of the tokens and relationship types mentioned in Figure 2 are shown in Table 1.

We generate an adjacency matrix A through the dependency tree, and the construction algorithm of the adjacency matrix is shown in Algorithm 1. If there is an edge from node i to node j, the corresponding element in the matrix is set to 1, which is expressed as

A_{i j} = 1

, otherwise, it is 0. In addition, we also consider the self-influence of the node on itself and set

A_{i = = j} = 1

.

Algorithm 1 Syntactic information construction algorithm.

Input:: sentence matrix $[x_{1}, x_{2}, \dots, x_{n}]$
Output:: adjacency matrix A
1:: define zero matrix A;
2:: use the spaCy tool to generate a dependency tree $T r e e$ ;
3:: for each edge $< i, j >$ in $T r e e$ do:
4:: set the value of $A_{i j}$ to 1;
5:: end for;
6:: for each edge $i = = j$ in $T r e e$ do:
7:: set the value of $A_{i j}$ to 1;
8:: end for;
9:: return adjacency matrix A;

To fuse the obtained syntactic structure information into the sentence embedding, we use GCN to convolve the hidden states of neighboring nodes to obtain syntax-enhanced features. For an input sentence

u_{i}

with n words, the generated adjacency matrix size is

n \times n

, We use a two-layer GCN to operate on the current node, convolve the sentence matrix

[x_{1}, x_{2}, \dots, x_{n}]

and the adjacency matrix A, so as to realize the iteration of information. The output is a sentence representation with structural information.

x_{i}^{^{'}} = R e L U (\sum_{j = 1}^{n} (A_{i j} W^{A 1} x_{j}))

(1)

x_{i}^{s} = t a n h (\sum_{j = 1}^{n} (A_{i j} W^{A 2} x_{j}))

(2)

where

x_{i}^{^{'}} \in R^{d}

is the output of the first layer of GCN,

x_{i}^{s} \in R^{d}

is the output of the second layer of GCN and represents the output result incorporating syntactic information. According to the comparison of the advantages and disadvantages of 21 kinds of activation functions in NLP by Eger et al. [39], we choose

R e L U

and

t a n h

functions to realize two-layer GCN, respectively.

A \in R^{n \times d}

denotes the adjacency matrix, i is the current node, j is the neighbor node of node i, and

A_{i j}

is used to represent the dependency between node i and node j. ∑ is the summation formula,

W^{A 1} \in R^{d \times d}

and

W^{A 2} \in R^{d \times d}

are the trainable parameters.

3.3. Topic Module

The topic module is mainly divided into two parts: the encoder and the decoder. We assume that each utterance is mapped to a latent variable, which encodes the topic discussed in the multi-speaker dialogue. The decoder reconstructs the input embedding and outputs a topic-aware sentence representation. To make better use of the topic information, we also classify the latent vector during the encoding process to obtain topic category labels, which facilitates the operation of subsequent modules. Since the VAE has both an encoder and a decoder, it can be used as a topic feature extractor and generator at the same time, so we use it to simulate the topic modeling process of neural networks.

For the input sentence

u_{i}

, the original input of the encoder is

B E R T_{b a s e} (u) = [[c l s], x_{1}, x_{2}, \dots, x_{n}, [s e p]]

, abbreviated X, which outputs a latent vector z. z describes the distribution of the conversation topics and obeys the Gaussian distribution,

z \sim N (0, 1)

. First, a multi-layer perceptron (MLP) is introduced to compute the mean

μ (x)

and variance

σ^{2} (x)

of the input X.

x ∣ z \sim N (μ (x), σ^{2} (x))

, where

N (μ (x), σ^{2} (x))

is multidimensional Gaussian distributions, and the components of z are independent Gaussian distributed random variables.

μ = F C_{μ} (x), σ^{2} = diag (F C_{σ} (x))

(3)

where

F C

denotes the fully connected layer, and the diag converts the column matrix into a diagonal matrix. Then we can calculate the latent variables

z = μ + σ \cdot ϵ

, where

ϵ

is a noise variable sampled from

N (0, 1)

. We add a classification layer after the latent variables to classify the latent variables and get the topic categories c. Let

p (c ∣ z)

denote the classifier on the latent variable z and fit it using a softmax layer. According to this, we get the topic category distribution

q (c) = s o f t m a x (z)

The decoder reconstructs the sentence embedding according to the latent variable z and topic categories c. First, we select a category from the distribution of the topic categories

q (c)

, then select a random latent variable z from the distribution

q (z ∣ c)

.

q (x ∣ z)

reconstructs the input embedding through the generator, and we set the selection scheme as

q (x ∣ z, c) = q (x ∣ z)

. It should be noted that the generator here is not generating dialogue responses, but extracting latent topics from the current utterance and training the model to reconstruct the sentence representation. We denote the reconstructed topic-aware sentence representation as

x_{i}^{t}

.

Based on the syntax-enhanced representations

x_{i}^{s}

obtained from the syntax module, and the topic-aware representation

x_{i}^{t}

in this module, we design a fusion mechanism that merges the information obtained from these two modules.

g = sigmoid (W^{F s} x_{i}^{S} + W^{F t} x_{i}^{t} + b^{F})

(4)

x_{i}^{f} = g ⊙ x_{i}^{s} + (1 - g) ⊙ x_{i}^{t}

(5)

where

W^{F s} \in R^{d \times d}

,

W^{F t} \in R^{d \times d}

, and

b^{F} \in R^{d}

are all learnable hyperparameters. ⊙ represents matrix multiplication. We use sigmoid to calculate the weights of the syntax-enhanced representation and the topic-aware representation, and the finally calculated

x_{i}^{f} \in R^{d}

is the sentence representation that incorporates both syntactic and topic encoding.

3.4. Dialogue Interaction Module

The research work of conversation-level emotion recognition should not only consider textual information but also consider the emotional interaction between interlocutors. However, for long-distance multi-speaker conversations with multiple discourses, the use of ordinary RNN and LSTM networks cannot effectively alleviate the gradient disappearance problem, so we need to build a separate module to handle the emotional communication of the conversation.

In our work, we construct dialogue graphs for multi-speaker dialogues. The nodes are initialized using the syntax-enhanced and topic-aware sentence representation

x^{f}

outputted by the previous module, and nodes are connected in dialogue order. We already obtain the conversation topic category c of the node from the topic module and segment the dialogue according to c. When the category c of the

m - t h

node

x_{m}

is different from that of the subsequent node

x_{m + 1} (m \in (1, n - 1))

, it means that

x_{m}

is irrelevant to what

x_{m + 1}

is talking about. Based on this, the long-distance conversation D is divided into several short-distance conversation sub-segments, and then the conversation sub-segments are processed.

Each conversation sub-segment

D^{s} = {x_{1}, x_{2}, \dots, x_{n}}

is constructed as a conversation subgraph(CSG) in this module. The specific method is as follows. Each vertex is initialized with the corresponding sequentially encoded feature vector

x^{f}

. The edges of the CSG are divided into two types:

e^{0}

and

e^{1}

.

e^{0}

means that the two vertices of the edge are spoken by the same speaker, and

e^{1}

for otherwise. We construct the edges in the following way: nodes are arranged in strict dialogue order, and the previous node can be connected to a future node, but a future node cannot be connected to the previous node in reverse. Detect each node

i (1 \leq i < n)

and each subsequent node. If the speaker of the post node

j (i + 1 \leq j \leq n)

is different from that of the node i, these two nodes are connected to construct the edge

e_{i j}^{1}

. Until it is detected that the post node

k (i + 1 \leq k \leq n)

and the node i have the same speaker, then construct the edge

e_{i k}^{0}

. Whenever we construct an edge

e^{0}

for the current node, we stop subsequent operations and start processing the next node

i + 1

. A sample of CSG is shown in Figure 3.

It should be noted that since subgraphs are subsets of multi-speaker conversations, they are continuous in dialogue states so that starting from the second subgraph, the first node of each subgraph will have an edge over the previous subgraph. The specific steps are shown in Algorithm 2.

Algorithm 2 conversation subgraph construction algorithm.

Input:: the dialogue $\{x_{1}^{f}, x_{2}^{f}, \dots, x_{n}^{f}\}$ , speaker identity $p (\cdot)$ , topic category $c (\cdot)$
Output:: Dialogue graph $G$
1:: $V = \{x_{1}^{f}, x_{2}^{f}, \dots, x_{n}^{f}\}$ ;
2:: $E = \emptyset, G = \emptyset$ ;
3:: $R = \{0, 1\}$ ;
4:: for $i \in \{1, 2, \dots, n - 1\}$ do:
5:: $j = i + 1$ ;
6:: while $j \leq n$ do:
7:: if $c (j) \neq c (i)$ then:
8:: $append (G, (V, E, R))$
9:: break;
10:: else:
11:: if $p (j) \neq p (i)$ then:
12:: $E = E \cup {(i, j, 1)}$
13:: else: $E = E \cup {(i, j, 0)}$
14:: end if
15:: end if
16:: $j = j + 1$
17:: end while
18:: end for;
19:: return $G$ ;

We update the graph nodes via a graph convolutional network (GCN) to control the propagation of node information, while the sentence vector

x^{f}

is treated as the input hidden state of the first GCN layer. The hidden state

h_{i}^{l}

of the node at layer l is determined by

h_{i}^{l - 1}

and its connected edges and the weight of the edge are related to the relationship type of the edge.

h_{i}^{0} = x_{i}^{f}

(6)

h_{i}^{l} = sigmoid (\sum_{j = 1}^{i} \frac{h_{j}^{l - 1}}{| j |} + b^{l} h_{i}^{l - 1} W_{i j}^{l})

(7)

v_{i} = h_{i}^{l}

(8)

where

h_{i}^{0} \in R^{d}

denotes the initial input state of GCN, and

x_{i}^{f} \in R^{d}

is the sentence feature vector without context information. GCN first averages the embeddings of the previous layer of neighbor nodes and then fuses its information for information transmission. We use the sigmoid function to accomplish this operation. Both

W_{i j}^{l}

and

b^{l}

are parameters to be trained, which determine the importance of neighbor information and the importance of their characteristics, respectively.

W_{i j}^{l} \in \{W_{0}^{l}, W_{1}^{l}\}

reflects different relationship types of edges. The output node embedding

v_{i}

will have both its feature information and the sentiment information of the interlocutor associated with it.

Finally, the information updated by the GCN is used as the final representation of the sentence and it is passed through the feed-forward neural network to obtain the sentiment of the current discourse.

d_{i} = R e L U (W^{d} v_{i} + b^{d})

(9)

D_{i} = t a n h (W^{D} v_{i} + b^{D})

(10)

\hat{y_{i}} = a r g m a x_{r \in R} (D_{i} [r])

(11)

where

a r g m a x

represents the selection of the most likely sentiment label as the output

\hat{y_{i}}

.

W^{d} \in R^{d \times d}

, and

b^{d} \in R^{d}

, and

W^{D} \in R^{d \times d}

, and

b^{D} \in R^{d}

are all learnable parameters. In order to solve the slow training speed caused by too many GCN layers, as well as the possible gradient disappearance or gradient explosion [40], we use the Adam [41] optimizer to optimize the parameters that appear in the model.

3.5. Objective Function

The objective function of this model consists of two parts. In the topic module, we use the VAE to extract topic information. We want the vector obtained from the reconstruction to be as accurate as possible and retain more valid information, so we need to minimize

- l o g q (x ∣ z)

, and the loss function is calculated using the Kullback–Leibler(KL) (https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained) divergence.

E_{x \sim \tilde{p} (x)} [- l o g q (x ∣ z) + \sum_{t} p (c ∣ z) K L (p (z ∣ x) ∥ q (z ∣ c)) + K L (p (c ∣ z) q (c))], z \sim p (z ∣ x)

(12)

where

p (z ∣ x)

is the prior for z,

p (c ∣ z)

is the classifier for z, fitted with softmax.

z \sim p (z ∣ x)

is the reparametrization trick. The second term

\sum_{t} p (c ∣ z) K L (p (z ∣ x) ∥ q (z ∣ c))

is to make the latent variable z can be aligned with the distribution of a certain c, which acts as a clustering of topics, and the third term

K L (p (c ∣ z) q (c))

is used to control the distribution of topic categories c to prevent overlapping of categories.

For the whole model, we also have to use the cross-entropy loss function to train the classified sentiment labels, which is defined by the following equation.

L (Θ) = - \sum_{q = 1}^{Q} \sum_{i = 1}^{N} l o g D_{i} [y_{i}^{t}]

(13)

Θ

is the set of trainable parameters of the model, the

y_{i}^{t}

is the ground true emotion label, and Q is the number of all conversations in the training set.

4. Experimental Setup

In order to verify whether the model is effective, we compared it with other benchmark experiments on four datasets. We introduces the datasets and benchmark experiments used by the model as well as some other experimental details.

4.1. Datasets

IEMOCAP [42]: A multimodal dataset of two-speaker conversations. There are six emotion labels, including neutral, happiness, sadness, anger, frustration, and excitement. Since there is no validation set for this dataset, we follow the method of Zhong et al. [43], who use the last 20 conversations in the training set for validation.

MELD [44]: A corpus of multi-speaker conversations collected from the TV show Friends. The emotion category contains seven types: neutral, happiness, surprise, sadness, anger, disgust, and fear.

DailyDialog [45]: A multi-turn dialogue dataset written by humans, covering a variety of topics in daily life, with emotional labels including neutral, happiness, surprise, sadness, anger, disgust, and fear. Since it has no speaker information, we consider the utterance turns as the speaker turns by default.

EmoryNLP [46]: Like MELD, it was collected from the script of the TV show Friends but differs in the selection of scenes and emotion labels. The emotion labels of this dataset include neutral, sad, mad, scared, powerful, peaceful, and joyful.

To verify the effectiveness of SETD-ERC, we evaluate our model on the corpora listed above. We only focus on the textual modal information of the corpus in our work. The statistics of the corpus are listed in Table 2.

4.2. Experimental Details

Evaluation metrics: For DailyDialog, since the “neutral” label is the majority in the dataset, we follow the method of Shen et al. [47] and Zhu et al. [25] to ignore the label “neutral” when calculating the results, and use micro-F1 as the evaluation metric for this dataset. For MELD, EmoryNLP and IEMOCAP, we use weighted-average F1 as the evaluation metric.

Parameter setting: In our experimental setting, to extract word embeddings, we use

B E R T_{b a s e}

as a pre-trained model, and the dimension of all hidden vectors is equal to 256 for easy calculation. In order to learn hyperparameters faster and avoid gradient disappearance or gradient explosion, the optimizer adopts Adam [41] to optimize the parameters in the model with a learning rate of

1 \times 10^{- 4}

. The dropout rate for all datasets is

0.5

, we set the epoch to 100 and the batch size to 128. We stop training if the validation loss does not decrease for 20 consecutive epochs. For all experiments, we choose the model that performs best on the development set and then evaluate it on the test set.

For the implementation, we used PyTorch, a full-featured framework for building deep learning models written in python. In addition to that, we use Python programming language [48]. Furthermore, as hardware, we use two systems; GPU is RTX 3090 and the CPU is Intel i7-12700.

4.3. Baselines

We compare our model with multiple state-of-the-art baseline systems below.

HiGRU [49]: A hierarchical GRU structure that jointly trains individual word and utterance-level information and long-range contextual information.

DialogueRNN [30]: A recurrent neural network-based approach that uses three GRUs to model the speaker, the context of the preceding discourse, and the emotion of the preceding discourse.

COSMIC [50]: A common sense-based framework for conversational emotion recognition, using a transformer pre-training model to extract contextual features. It fuses an external knowledge base and learns different common sense elements to study the emotional interactions between interlocutors, and discriminates the emotion categories of conversational discourse.

DAG-ERC [47]: A model that combines the advantages of graph-based neural networks and recurrence-based networks. A directed acyclic graph (DAG) is designed to encode discourse.

ERMC-DisGCN [37]: A discourse-aware graph neural network for multi-party conversational sentiment recognition, designed with a relational convolution to exploit the self-speaker dependence of interlocutors to propagate contextual information.

DialogueGCN [31]: A model that uses directed acyclic graphs to encode the structure of conversations, combining graph neural networks and recurrent neural networks to model distant conversation context and nearby contextual information, and the approach also takes into account speaker relevance.

DialogueXL [35]: An integrated XLNet model that modifies the recursive mechanism of XLNet from the segment level to the sentence level and introduces a multi-attention mechanism to compose dialogue-aware self-attention to handle the emotional impact of multi-party multi-turn chat interactions.

DialogueCRN [51]: A new contextual reasoning network that simulates human cognitive emotions by mimicking human cognitive thinking, and designs a multi-turn reasoning module to extract and integrate emotional cues.

CoG-BART [52]: Uses the pre-trained model BART as the backbone model, and supervised contrast learning is introduced to distinguish similar emotions.

5. Experimental Results and Analysis

In this section, we detail the experimental results of SETD-ERC on four datasets, and to verify the necessity of each module, we also conduct ablation experiments. We detail the role of the model through a case.

5.1. Analysis of Comparative Experimental Results

Table 3 presents the final results of our model and all baseline models on the four datasets. By analyzing the final results, we can find that SETD-ERC reaches the state-of-the-art on the three datasets: IEMOCAP, MELD, and EmoryNLP.

In our experiments, the benchmark model HiGRU, DialogueRNN, and COSMIC are based on recursive models. DAG-ERC, ERMC-DISGCN, and DialogueGCN use graph-based models. Through the experimental results, we can find that the classification results based on recursive models are relatively poor, indicating that the recursive model is not effective based on the characteristics of the coding context. The worst effect of the HiGRU is that it only focuses on encoding text information and ignores the importance of interactive information in the dialogue. DialogXL, DialogueCRN, and CoG-BART use different pre-training models. It can be found that the different pre-training models will affect the final classification effect, and DialogueCRN claims to be able to simulate human cognitive thinking, but it performs poorly on the short dialogue dataset and cannot cope with it. The theme of the dialogue is not continuous enough. Our model is always better than their models, which proves the superiority of our model.

On DailyDialog, the improvement of our model on the classification results is not as significant as DAG-ERC. When comparing the dataset, we found that the reason for this result is that the dialogues in DailyDialog are relatively short, and its average number of dialogue turns is less than 8, but the advantage of our model is to handle multi-turn dialogue. If the number of dialogues is low, the advantages of the dialogue interaction module in our model cannot be fully performed. When our model is processing the IEMOCAP dataset with an average dialogue length of 49, our model’s improvement of the classification results far exceeds other models, indicating that our model has the advantages of the emotional classification work in multi-party multi-turn dialogue. We effectively mine the original meaning and the derived meaning of the sentence through the syntactic structure and topic information and use the topic segmentation technology to the dialogue text to process the context information. The graph structure of SETD-ERC is specially designed for multi-party multi-turn dialogue to capture the speaker identities, enabling it to achieve better performance than other models.

Overall, the performance of our model has met expectations. On the long-distance dialogue text dataset IEMOCAP, we surpass other models by 4% more, indicating that the topic segmentation technology solves the long-term dependency problem very well. Long texts are divided into topic-centered short text dialogues to reduce model calculation difficulty and increase classification accuracy. We use heterogeneous dialogue graphs to model dialogues, which well incorporates dialogue subject information, which is beneficial to classification results.

5.2. Ablation Study

To demonstrate the effect of each module, ablation experiments are conducted on four datasets in this section, and the experimental results are shown in Table 4. “-” means the removal of a part, “syntax” indicates semantic feature extraction, “topic” indicates extraction of topic information, “dialogue graph” indicates the dialogue interaction module, including the processing of speaker information and the construction of dialogue graph. We also conducted experiments using only BERT for sentiment classification to better demonstrate the effectiveness of the model.

It is shown from the experimental results that removing each module will lead to different degrees of decline in the experimental results, indicating that each part of our framework contributes effectively to the emotion recognition task. We can also find that compared with removing a single module, removing the syntax module and topic module at the same time leads to more model drop. This proves that the combination of syntactic structure information and topic feature information improves the accuracy of classification, and their processing of sentence representation is complimentary. We found that the dialogue interaction module did not perform as well as expected in DailyDialog, which is because DailyDialog is a two-speaker dialogue dataset and its dialogue content is relatively concise. The speaker interaction distance of this dataset is relatively close. Existing methods, such as BERT, have been able to obtain the contextual information of the dialogue very well so the improvement of our method is limited. Even so, our model still maintains a good result. In contrast, because the dialogues of IEMOCAP and EmoryNLP are long and complex, the interaction module performs well on both datasets. We observed that there are as many as 302 speakers in IEMOCAP, so it is crucial to model the speaker information for this dataset, and our model does this well.

5.3. Case Study

As shown in Table 5, the utterance of “Uh-huh”. is too short, with little semantic information and a lack of emotional words. It is easy to classify as “neutral” by directly classifying the emotion of this sentence. However, when this sentence appears in different scenes, the emotion it expresses is not the same due to the difference in its contextual sentence and speaker identities. Case one is a conversation in which Joey Tribbiani is talking to himself, so the second sentence continues the emotion of the previous sentence. Utterance 07 “Uh-huh”. of Case 2 is influenced by the previous discourse, and the speaker’s self-influence is very important, so the emotion it predicts is correct. In case 3, the semantics of utterance 07 “That’s great”. contain positive emotional information, and the above speaker’s emotion is positive, so it is detected as a positive emotion. However, in fact, its real semantics in the original text means that the commodity is very expensive, and it has the meaning of irony. Our model has not made special modifications in irony detection, which is an area that needs to be improved in the future.

Case 4 is an excerpt from a long conversation. In the beginning, Phoebe and Rachel are talking to each other. In utterance 4, Rachel answers Phoebe’s question, but utterance 5 is issued by the new speaker Monica, who teases Rachel, so the mood is affected by the current topic and is very happy. Utterance 6 is still discussing the current topic, but its semantics have a greater impact on emotion, so we can correctly distinguish its emotion, which shows that the attention mechanism that integrates syntax and topic modules has worked. Utterances 7 to 9 are greetings, with no emotional change, but two new speakers joined the conversation. Utterance 10 started to discuss a new topic. Monica asked Joey why Richard didn’t come, and then Joey answered Monica’s question. Utterances 10 to 12 were all conversations between Monica and Joey, but Chandler praised Joey’s boyfriend in utterance 13, so the interactive subject of utterance 14 Joey changes from Monica to Joey, but utterance 15 turns back to the dialogue with Monica. The overall dialogue logic is very complex, and the chat topic and interaction subjects have changed many times, but our model still accurately classifies the correct emotional labels.

6. Conclusions

In this paper, we study the task of emotion recognition in multi-speaker multi-turn dialogue conversational contexts and propose joint syntax-enhanced and topic-driven graph networks for sentiment recognition. We use the pre-trained model

B E R T_{b a s e}

to obtain word embeddings in the model, learn the original meaning information of the sentence, and then integrate the syntactic structure information and topic information of the sentence to enrich the latent derived meaning of the utterance in the conversation. The model further builds the dialogue interaction module, which uses topic information to segment the content to solve the problem of long-term dialogue in the context of multi-speakers. We still simulate the emotional interaction between the speakers to improve the classification effect. Experimental results on four benchmark datasets demonstrate that our proposed method achieves state-of-the-art results in this work.

Our model is 4% more accurate than the best-performing benchmark model CoG-BART on IEMOCAP with the longest dialogue rounds, and also performs well on public datasets with shorter dialogue rounds. Since SETD-ERC is modeled for multi-person dialogue text types, in future work, we will explore its effectiveness on other dialogue-based tasks. Furthermore, the irony detection of utterances has been neglected in our existing work, and we hope to address this issue in future work. We also want to add a knowledge graph to further refine our method, as it can reduce the limitations of the construction of speaker characters.

Author Contributions

Conceptualization, H.Y. and T.M.; methodology, H.Y. and L.J.; software, H.Y.; validation, H.Y. and N.A.-N.; formal analysis, H.Y. and M.M.A.W.; supervision, L.J., N.A.-N. and M.M.A.W.; investigation, H.Y. and N.A.-N.; resources, H.Y. and N.A.-N.; data curation, H.Y. and M.M.A.W.; writing—original draft preparation, H.Y.; writing—review and editing, H.Y., T.M. and L.J.; project administration, T.M. and L.J. and M.M.A.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Key Research and Development Program of China (2021YFE0104400), the National Natural Science Foundation of China (NO. 62102187).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Bengesi, S.; Oladunni, T.; Olusegun, R.; Audu, H. A Machine Learning-Sentiment Analysis on Monkeypox Outbreak: An Extensive Dataset to Show the Polarity of Public Opinion From Twitter Tweets. IEEE Access 2023, 11, 11811–11826. [Google Scholar] [CrossRef]
Qian, Y.; Wang, J.; Li, D.; Zhang, X. Interactive capsule network for implicit sentiment analysis. Appl. Intell. 2023, 53, 3109–3123. [Google Scholar] [CrossRef]
Mao, Y.; Cai, F.; Guo, Y.; Chen, H. Incorporating emotion for response generation in multi-turn dialogues. Appl. Intell. 2022, 52, 7218–7229. [Google Scholar] [CrossRef]
Alswaidan, N.; Menai, M.E.B. A survey of state-of-the-art approaches for emotion recognition in text. Knowl. Inf. Syst. 2020, 62, 2937–2987. [Google Scholar] [CrossRef]
Birjali, M.; Kasri, M.; Hssane, A.B. A comprehensive survey on sentiment analysis: Approaches, challenges and trends. Knowl. Based Syst. 2021, 226, 107134. [Google Scholar] [CrossRef]
Chen, F.; Huang, Y. Knowledge-enhanced neural networks for sentiment analysis of Chinese reviews. Neurocomputing 2019, 368, 51–58. [Google Scholar] [CrossRef]
Xu, G.; Meng, Y.; Qiu, X.; Yu, Z.; Wu, X. Sentiment Analysis of Comment Texts Based on BiLSTM. IEEE Access 2019, 7, 51522–51532. [Google Scholar] [CrossRef]
Tian, L.; Moore, J.D.; Lai, C. Emotion recognition in spontaneous and acted dialogues. In Proceedings of the 2015 International Conference on Affective Computing and Intelligent Interaction, ACII 2015, Xi’an, China, 21–24 September 2015; pp. 698–704. [Google Scholar] [CrossRef] [Green Version]
Cai, Y.; Cai, H.; Wan, X. Multi-Modal Sarcasm Detection in Twitter with Hierarchical Fusion Model. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, 28 July–2 August 2019; Volume 1, pp. 2506–2515. [Google Scholar] [CrossRef]
Qin, L.; Li, Z.; Che, W.; Ni, M.; Liu, T. Co-GAT: A Co-Interactive Graph Attention Network for Joint Dialog Act Recognition and Sentiment Classification. In Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, 2–9 February 2021; pp. 13709–13717. [Google Scholar]
Li, D.; Li, Y.; Wang, S. Interactive double states emotion cell model for textual dialogue emotion prediction. Knowl. Based Syst. 2020, 189, 105084. [Google Scholar] [CrossRef]
Lou, C.; Liang, B.; Gui, L.; He, Y.; Dang, Y.; Xu, R. Affective Dependency Graph for Sarcasm Detection. In Proceedings of the SIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, 11–15 July 2021; pp. 1844–1849. [Google Scholar] [CrossRef]
Ouyang, S.; Zhang, Z.; Zhao, H. Dialogue Graph Modeling for Conversational Machine Reading. In Proceedings of the Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, 1–6 August 2021; Volume ACL/IJCNLP, Findings of ACL. pp. 3158–3169. [Google Scholar] [CrossRef]
Thost, V.; Chen, J. Directed Acyclic Graph Neural Networks. In Proceedings of the 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, 3–7 May 2021. [Google Scholar]
Lin, W.; Li, C. Review of Studies on Emotion Recognition and Judgment Based on Physiological Signals. Appl. Sci. 2023, 13, 2573. [Google Scholar] [CrossRef]
Sindhu, C.; Vadivu, G. Fine grained sentiment polarity classification using augmented knowledge sequence-attention mechanism. Microprocess. Microsyst. 2021, 81, 103365. [Google Scholar] [CrossRef]
Remus, R. ASVUniOfLeipzig: Sentiment Analysis in Twitter using Data-driven Machine Learning Techniques. In Proceedings of the 7th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2013, Atlanta, GA, USA, 14–15 June 2013; Diab, M.T., Baldwin, T., Baroni, M., Eds.; The Association for Computer Linguistics: Stroudsburg, PA, USA, 2013; pp. 450–454. [Google Scholar]
Jo, A.H.; Kwak, K.C. Speech Emotion Recognition Based on Two-Stream Deep Learning Model Using Korean Audio Information. Appl. Sci. 2023, 13, 2167. [Google Scholar] [CrossRef]
Zhang, D.; Zhu, Z.; Kang, S.; Zhang, G.; Liu, P. Syntactic and semantic analysis network for aspect-level sentiment classification. Appl. Intell. 2021, 51, 6136–6147. [Google Scholar] [CrossRef]
Pang, J.; Rao, Y.; Xie, H.; Wang, X.; Wang, F.L.; Wong, T.; Li, Q. Fast Supervised Topic Models for Short Text Emotion Detection. IEEE Trans. Cybern. 2021, 51, 815–828. [Google Scholar] [CrossRef] [PubMed]
Dieng, A.B.; Ruiz, F.J.R.; Blei, D.M. Topic Modeling in Embedding Spaces. Trans. Assoc. Comput. Linguist. 2020, 8, 439–453. [Google Scholar] [CrossRef]
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of the Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013, Lake Tahoe, NV, USA, 5–8 December 2013; Burges, C.J.C., Bottou, L., Ghahramani, Z., Weinberger, K.Q., Eds.; 2013; pp. 3111–3119. [Google Scholar]
Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, (Volume 1: Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Peinelt, N.; Nguyen, D.; Liakata, M. tBERT: Topic Models and BERT Joining Forces for Semantic Similarity Detection. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, 5–10 July 2020; Jurafsky, D., Chai, J., Schluter, N., Tetreault, J.R., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 7047–7055. [Google Scholar] [CrossRef]
Zhu, L.; Pergola, G.; Gui, L.; Zhou, D.; He, Y. Topic-Driven and Knowledge-Aware Transformer for Dialogue Emotion Detection. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, 1–6 August 2021; Zong, C., Xia, F., Li, W., Navigli, R., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 1571–1582. [Google Scholar] [CrossRef]
Wang, Y.; Huang, M.; Zhu, X.; Zhao, L. Attention-based LSTM for Aspect-level Sentiment Classification. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, TX, USA, 1–4 November 2016; Su, J., Carreras, X., Duh, K., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2016; pp. 606–615. [Google Scholar] [CrossRef]
Huang, B.; Carley, K.M. Syntax-Aware Aspect Level Sentiment Classification with Graph Attention Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, 3–7 November 2019; Inui, K., Jiang, J., Ng, V., Wan, X., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 5468–5476. [Google Scholar] [CrossRef] [Green Version]
Jia, X.; Wang, L. Attention enhanced capsule network for text classification by encoding syntactic dependency trees with graph convolutional neural network. PeerJ Comput. Sci. 2022, 8, e831. [Google Scholar] [CrossRef] [PubMed]
Al-Shaikh, A.; Mahafzah, B.A.; Alshraideh, M. Hybrid harmony search algorithm for social network contact tracing of COVID-19. Soft Comput. 2023, 27, 3343–3365. [Google Scholar] [CrossRef]
Majumder, N.; Poria, S.; Hazarika, D.; Mihalcea, R.; Gelbukh, A.F.; Cambria, E. DialogueRNN: An Attentive RNN for Emotion Detection in Conversations. In Proceedings of the The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, HI, USA, 27 January–1 February 2019; pp. 6818–6825. [Google Scholar] [CrossRef] [Green Version]
Ghosal, D.; Majumder, N.; Poria, S.; Chhaya, N.; Gelbukh, A.F. DialogueGCN: A Graph Convolutional Neural Network for Emotion Recognition in Conversation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, 3–7 November 2019; pp. 154–164. [Google Scholar] [CrossRef] [Green Version]
Wan, H.; Tang, P.; Tian, B.; Yu, H.; Jin, C.; Zhao, B.; Wang, H. Water Extraction in PolSAR Image Based on Superpixel and Graph Convolutional Network. Appl. Sci. 2023, 13, 2610. [Google Scholar] [CrossRef]
Lee, B.; Choi, Y.S. Graph Based Network with Contextualized Representations of Turns in Dialogue. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event/Punta Cana, Dominican Republic, 7–11 November 2021; pp. 443–455. [Google Scholar] [CrossRef]
Zhang, D.; Wu, L.; Sun, C.; Li, S.; Zhu, Q.; Zhou, G. Modeling both Context- and Speaker-Sensitive Dependence for Emotion Detection in Multi-speaker Conversations. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, 10–16 August 2019; pp. 5415–5421. [Google Scholar] [CrossRef] [Green Version]
Shen, W.; Chen, J.; Quan, X.; Xie, Z. DialogXL: All-in-One XLNet for Multi-Party Conversation Emotion Recognition. In Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, 2–9 February2021; pp. 13789–13797. [Google Scholar]
Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.G.; Salakhutdinov, R.; Le, Q.V. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, Vancouver, BC, Canada, 8–14 December 2019; pp. 5754–5764. [Google Scholar]
Sun, Y.; Yu, N.; Fu, G. A Discourse-Aware Graph Neural Network for Emotion Recognition in Multi-Party Conversation. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event, 16–20 November 2021; pp. 2949–2958. [Google Scholar] [CrossRef]
Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. In Proceedings of the 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
Eger, S.; Youssef, P.; Gurevych, I. Is it time to swish? Comparing deep learning activation functions across NLP tasks. arXiv 2019, arXiv:1901.02671. [Google Scholar]
Abuqaddom, I.; Mahafzah, B.A.; Faris, H. Oriented stochastic loss descent algorithm to train very deep multi-layer neural networks without vanishing gradients. Knowl.-Based Syst. 2021, 230, 107391. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Busso, C.; Bulut, M.; Lee, C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
Zhong, P.; Wang, D.; Miao, C. Knowledge-Enriched Transformer for Emotion Detection in Textual Conversations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, 3–7 November 2019; pp. 165–176. [Google Scholar] [CrossRef]
Poria, S.; Hazarika, D.; Majumder, N.; Naik, G.; Cambria, E.; Mihalcea, R. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, 28 July–2 August 2019; Volume 1, pp. 527–536. [Google Scholar] [CrossRef] [Green Version]
Li, Y.; Su, H.; Shen, X.; Li, W.; Cao, Z.; Niu, S. DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset. In Proceedings of the Eighth International Joint Conference on Natural Language Processing, IJCNLP 2017, Taipei, Taiwan, 27 November–1 December 2017; Volume 1, pp. 986–995. [Google Scholar]
Zahiri, S.M.; Choi, J.D. Emotion Detection on TV Show Transcripts with Sequence-Based Convolutional Neural Networks. In Proceedings of the The Workshops of the The Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 44–52. [Google Scholar]
Shen, W.; Wu, S.; Yang, Y.; Quan, X. Directed Acyclic Graph Network for Conversational Emotion Recognition. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, 1–6 August 2021; pp. 1551–1560. [Google Scholar] [CrossRef]
Van Rossum, G.; Drake, F.L. Python Reference Manual; Centrum voor Wiskunde en Informatica Amsterdam: Amsterdam, The Netherlands, 1995. [Google Scholar]
Jiao, W.; Yang, H.; King, I.; Lyu, M.R. HiGRU: Hierarchical Gated Recurrent Units for Utterance-Level Emotion Recognition. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 397–406. [Google Scholar] [CrossRef]
Ghosal, D.; Majumder, N.; Gelbukh, A.F.; Mihalcea, R.; Poria, S. COSMIC: COmmonSense knowledge for eMotion Identification in Conversations. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16–20 November 2020; Volume EMNLP 2020, Findings of ACL. pp. 2470–2481. [Google Scholar] [CrossRef]
Hu, D.; Wei, L.; Huai, X. DialogueCRN: Contextual Reasoning Networks for Emotion Recognition in Conversations. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, 1–6 August 2021; pp. 7042–7052. [Google Scholar] [CrossRef]
Li, S.; Yan, H.; Qiu, X. Contrast and Generation Make BART a Good Dialogue Emotion Recognizer. In Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, 22 February–1 March 2022; pp. 11002–11010. [Google Scholar]

Figure 1. Overview of our proposed approach SETD-ERC, which consists of a syntax module, a topic module, and a dialogue interaction module. Syntax module and topic module integrate dependency parsing with GCN and topic feature extraction to enhance the representation of the sentence. The dialogue interaction module incorporates topic segmentation and draws dialogue graphs to imitate emotional interactions.

Figure 2. The example of the syntactic dependency tree.

Figure 3. The sample of CSG. The utterances of the five speakers are represented by different colors. The red lines denote the edge

e^{0}

, and black lines represents the edge

e^{1}

.

Figure 3. The sample of CSG. The utterances of the five speakers are represented by different colors. The red lines denote the edge

e^{0}

, and black lines represents the edge

e^{1}

.

Table 1. Explanation of tokens and relationship type meanings mentioned in Figure 2.

Token	Part of Speech	Tag	Relationship Type
NOUN	Noun, words that specify a person, place, thing, animal, or idea.	amod	adjective modifie
ADJ	Adjective, words that typically modify nouns.	compound	compound word
VERB	Verb, words that signal events and actions.	nusbj	nominal subject
DET	Determiner, articles and other words that specify a particular noun phrase.	dobj	direct object
ADP	Adposition, the head of a prepositional or postpositional phrase.	det	determiner
		prep	prepositional modifier
		pobj	prepositional object

Table 2. The statistics of four datasets.

Dataset	Conversations			Utterances
Dataset	Train	Val	Test	Train	Val	Test
IEMOCAP		120	31		5810	1623
MELD	1038	114	280	9989	1109	2610
DailyDialog	11,118	1000	1000	87,170	8069	7740
EmoryNLP	713	99	85	9934	1344	1328

Table 3. Overall performance on the four datasets.

Model	IEMOCAP	DailyDialog	MELD	EmoryNLP
Model	Weighted-Avg-F1	Micro-F1	Weighted-Avg-F1	Weighted-Avg-F1
HiGRU (2019)	59.79	52.01	56.92	31.88
DialogueRNN (2019)	62.75	50.65	57.03	31.27
COSMIC (2020)	63.05	56.16	64.28	37.10
DAG-ERC (2021)	68.03	59.33	63.65	39.02
ERMC-DisGCN(2021)	64.1	*	64.22	36.38
DialogueGCN (2019)	64.18	*	58.1	33.85
DialogXL (2021)	65.94	54.93	62.41	34.73
DialogueCRN(2021)	66.2	*	58.39	*
CoG-BART (2022)	66.18	56.29	64.81	39.04
Ours	70.92	57.07	66.13	40.97

Table 4. Results of ablation study on four datasets.

Model	IEMOCAP	DailyDialog	MELD	EmoryNLP
Model	Weighted-Avg-F1	Micro-F1	Weighted-Avg-F1	Weighted-Avg-F1
Ours	70.92	57.07	66.13	40.97
-syntax	69.02	55.14	64.98	39.75
-topic	67.10	55.89	65.16	37.52
-syntax, topic	66.63	54.93	64.04	37.01
-dialogue graph	61.81	55.18	63.37	35.59
$B E R T_{b a s e}$	59.45	53.12	56.21	33.15

Table 5. Results of the three case studies, “ID” is the order of the utterances in the dialogue. We use green and red to highlight correct and incorrect predictions respectively.

Num	ID	Speaker	Text	Emotion	Prediction
1	08	Joey Tribbiani	I know...watch Green Acres the way it was meant to be seen.	Joyful	Joyful
	09	Joey Tribbiani	Uh-huh.	Joyful	Joyful
2	03	Chandler Bing	Get out!	Mad	Mad
	...
	07	Chandler Bing	Uh-huh.	Mad	Mad
3	03	Casting Guy	Excuse me, that’s 50 bucks?	Mad	Mad
	...
	06	Joey Tribbiani	Ohh, you know what it is? ...they’d send over the whole script on real paper and everything.	Peaceful	Peaceful
	07	Casting Guy	That’s great.	Mad	Peaceful
4	02	Phoebe Buffay	Hey.	Neutral	Neutral
	03	Rachel Green	Hey Phoebs, whatcha got there?	Neutral	Neutral
	04	Phoebe Buffay	Ok, Love Story, Brian’s Song, and Terms of Endearment.	Neutral	Neutral
	05	Monica Geller	Wow, all you need now is The Killing Fields and some guacamole and you’ve got yourself a part-ay.	Joyful	Joyful
	06	Phoebe Buffay	Yeah, I talked to my grandma about the Old Yeller incident, and she told me that my mom used to not show the ends of sad movies to shield us from the pain and sadness. You know, before she killed herself.	Sad	Sad
	07	Chandler Bing	Hey.	Neutral	Neutral
	08	Joey Tribbiani	Hey.	Neutral	Neutral
	09	Rachel Green	Hey.	Neutral	Neutral
	10	Monica Geller	Hey. Where is he, where’s Richard? Did you ditch him?	Neutral	Neutral
	11	Joey Tribbiani	Yeah right after we stole his lunch money and gave him a wedgie. What’s the matter with you, he’s parking the car.	Mad	Mad
	12	Monica Geller	So’d you guys have fun?	Neutral	Neutral
	13	Chandler Bing	Your boyfriend is so cool.	Joyful	Joyful
	14	Joey Tribbiani	Really?	Neutral	Neutral
	15	Chandler Bing	Yeah, he let us drive his Jaguar. Joey for 12 blocks, me for 15.	Neutral	Neutral

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yu, H.; Ma, T.; Jia, L.; Al-Nabhan, N.; Wahab, M.M.A. Joint Syntax-Enhanced and Topic-Driven Graph Networks for Emotion Recognition in Multi-Speaker Conversations. Appl. Sci. 2023, 13, 3548. https://doi.org/10.3390/app13063548

AMA Style

Yu H, Ma T, Jia L, Al-Nabhan N, Wahab MMA. Joint Syntax-Enhanced and Topic-Driven Graph Networks for Emotion Recognition in Multi-Speaker Conversations. Applied Sciences. 2023; 13(6):3548. https://doi.org/10.3390/app13063548

Chicago/Turabian Style

Yu, Hui, Tinghuai Ma, Li Jia, Najla Al-Nabhan, and M. M. Abdel Wahab. 2023. "Joint Syntax-Enhanced and Topic-Driven Graph Networks for Emotion Recognition in Multi-Speaker Conversations" Applied Sciences 13, no. 6: 3548. https://doi.org/10.3390/app13063548

APA Style

Yu, H., Ma, T., Jia, L., Al-Nabhan, N., & Wahab, M. M. A. (2023). Joint Syntax-Enhanced and Topic-Driven Graph Networks for Emotion Recognition in Multi-Speaker Conversations. Applied Sciences, 13(6), 3548. https://doi.org/10.3390/app13063548

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Joint Syntax-Enhanced and Topic-Driven Graph Networks for Emotion Recognition in Multi-Speaker Conversations

Abstract

1. Introduction

2. Related Work

2.1. Feature Learning in Emotion Recognition

2.2. Emotion Recognition in Conversations

3. Methodology

3.1. Problem Definition

3.2. Syntax Module

3.3. Topic Module

3.4. Dialogue Interaction Module

3.5. Objective Function

4. Experimental Setup

4.1. Datasets

4.2. Experimental Details

4.3. Baselines

5. Experimental Results and Analysis

5.1. Analysis of Comparative Experimental Results

5.2. Ablation Study

5.3. Case Study

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI