Customer Sentiment Recognition in Conversation Based on Contextual Semantic and Affective Interaction Information

Huang, Zhengwei; Liu, Huayuan; Zhu, Jun; Min, Jintao

doi:10.3390/app13137807

Open AccessArticle

Customer Sentiment Recognition in Conversation Based on Contextual Semantic and Affective Interaction Information

by

Zhengwei Huang

¹,

Huayuan Liu

¹,

Jun Zhu

^2,* and

Jintao Min

³

¹

College of Economics and Management, China Three Gorges University, Yichang 443002, China

²

Economics and Management Department, Guangxi Minzu Normal University, Chongzuo 532200, China

³

College of Computer and Information Technology, China Three Gorges University, Yichang 443002, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(13), 7807; https://doi.org/10.3390/app13137807

Submission received: 5 May 2023 / Revised: 18 June 2023 / Accepted: 30 June 2023 / Published: 2 July 2023

Download

Browse Figures

Versions Notes

Abstract

:

In the e-commerce environment, conversations between customers and businesses contain a multitude of useful information about customer sentiment. By mining that information, customer sentiment can be validly identified, which is helpful in accurately identifying customer needs and improving customer satisfaction. Contextual semantics information and inter-speaker affective interaction information are two key factors for identifying customers’ sentiments in conversation. Unfortunately, none of the existing approaches consider the two factors simultaneously. In this paper, we propose a conversational sentiment analysis method based on contextual semantic and affective interaction information. The proposed approach uses different bidirectional gated recurrent unit (BiGRU) combined with attention mechanisms to encode the contextual semantic information of different types of conversational texts. For modeling affective interactions, we use directed graph structures to portray the affective interactions between speakers and encode them with affective interaction features using graph convolutional neural networks (GCN). Finally, the two features are afused to recognize customer sentiment. The experimental results on the JDDC dataset show that our model can more accurately recognize customer sentiment than other baseline models in customer service conversation.

Keywords:

conversational sentiment analysis; contextual semantics; affective interaction; neural network

1. Introduction

Recently, with the development of artificial intelligence, conversation systems are gradually moving toward practical application. Thereinto, customer service chatbot is widely used in online business environments, such as Amazon’s Alexa [1], JD.com’s JIMI [2], and Alibaba’s AliMe [3]. The customer service chatbot content usually comes from a corpus of web-based knowledge [4], which can answer accurately enough. But its emotional intelligence is still a short board. In this background, customer sentiment analysis is an important task in customer service conversation, which is a prerequisite for the dialogue system to generate empathetic reply content. In this study, we focus on customer sentiment recognition in conversation, which aims to assign a proper sentiment label to each utterance in a customer service conversation.

Recognizing customer sentiment in conversations is very dependent on the contextual semantic information of the conversation or the affective interaction information between the speakers [5,6]. Take the conversation in Figure 1 as an example, if only considering the sentiment of the current utterance, the third utterance expresses a neutral sentiment. However, the third utterance expresses a negative sentiment if considering the contextual information of the conversation, which is inferred by the fourth sentence, indicating that the customer complains about the long waiting time for delivery. Therefore, it is necessary to mine contextual semantic information, which can enhance, weaken, or reverse the original sentiment of the current utterance [7]. In addition, the affective tendency of the conversation is also reflected in the affective interaction between the speakers. For instance, customers express strong dissatisfaction (i.e., the sixth utterance) when they perceive the agent’s response as perfunctory (i.e., the fifth utterance). The agent provides a more positive response (i.e., the seventh utterance) after sensing a very negative customer sentiment. Thus, modeling the affective interaction information between speakers can improve the accuracy of the sentiment recognition model to a certain extent.

For the characteristics of customer service conversation, this paper proposes a conversational sentiment analysis model based on contextual semantics and affective interaction information, which fully exploits the value of contextual semantics and affective interaction information between speakers. Specifically, to extract the contextual semantic information, this paper uses different BiGRU networks to contextually encode the customer utterance text and agent utterance text. Then, the two are fused into the global dialogues to encode the global contexts and achieve a deep and refined mining of different context types. To realize the modeling of affective interaction information between speakers, we construct a directed graph structure to portray the affective interaction dependencies between interlocutors and use graph convolutional neural networks to extract the affective dependency interaction features between speakers and themselves and interlocutors.

This paper has three main contributions. Firstly, we propose a model that considers contextual semantics and affective interaction information between speakers to recognize customer sentiment in conversation, which can effectively use the relevant information of the utterance. Second, we propose a contextual semantic encoder and affective interaction encoder to capture different speakers’ contextual information and inter-speaker affective interaction information, respectively. Third, the experiment results demonstrate that the performance of our model is superior to the baseline models for a sentiment analysis of the JDDC dataset.

2. Related Work

Conversational sentiment analysis differs from sentence-level and chapter-level sentiment analysis in that it needs to consider the contextual background of the conversation and the affective interactions between different speakers [6,8]. Therefore, contextual relevance models and speaker relevance models are two entry points for conversational sentiment analysis. In addition, to improve the validity of the models, some scholars also assisted modeling with other information, such as sentiment congruence and conversation topic information.

Contextual correlation models [9] mainly use contextual information in conversations to help the model improve sentiment analysis. Poria et al. [9] proposed a cLSTM model that enables utterances to capture contextual information from their surroundings, which first trains a separate CNN model for the vector representation of all utterances in a conversation, then inputs the vector representation into the LSTM model to obtain the contextual association representation of the utterance and finally recognizes the utterance sentiment. Zhong et al. [10] proposed a knowledge-enriched transformer (KET) to effectively incorporate contextual information and external knowledge bases to identify sentiment, which uses self-attention and cross-attention modules to exploit contextual information more efficiently. Tang et al. [11] proposed a method about target-dependent sentiment classification, which mainly uses target-dependent long short-term memory to capture contextual information and model the relatedness of a target word with its context words. Basiri et al. [12] utilized two independent bidirectional LSTM and GRU layers to extract both past and future contexts by combining the forward and backward hidden layers, and used the attention mechanism to focus on important parts.

The speaker-related model [13] focuses on the speaker’s state and interactions while considering conversational contextual information. Hazarika et al. [13] proposed a CMN model, which stores the context information in memories and models the interplay of these memories to capture inter-speaker dependencies. The CMN model selected separate memory units for different speakers. Based on this model, Hazarika et al. [14] proposed an ICON model, which uses a unified memory unit, realizes separate textual representations of the historical discourse of different speakers using the SIM module, then globally integrates them through the DGIM module. Majumder et al. [15] proposed a DialogueRNN model, which uses three GRU networks (speaker state GRU, global state GRU, and sentiment state GRU) to capture speaker information, historical discourse, and sentiment information, respectively. Zhang et al. [16] propose a graph convolutional neural network-based model, ConGCN, which uses each speaker and their corresponding utterance representations as nodes in the graph and associates them through undirected edges. Ghosal et al. [17] also used graph convolutional neural networks to model conversations, but the difference is that the DialogueGCN model contains only utterance nodes and indirectly associates them with previous speakers by deciding the type of edges based on different speaker information.

Current research has shown that the approach of incorporating the speaker’s affective interaction or contextual information is an effective means to improve the accuracy of conversational sentiment recognition. Based on this, some scholars also assisted modeling with other information, such as sentiment consistency and conversation topic information, which can help to improve the accuracy of sentiment recognition. For sentiment consistency modeling, Wang et al. [18] transformed the conversation sentiment recognition task into a sequence-tagging task for sentiment consistency modeling, using a conditional random field model layer to learn the sentiment consistency in conversations. Gao et al. [19] proposed a multi-task learning model, which uses sentiment transfer detection as a secondary task to help achieve conversational sentiment recognition. Zhu et al. [20] proposed a conversation sentiment recognition method from the perspective of dialogue topic information modeling, which achieves conversation sentiment recognition by combining topic-enhanced information with a knowledge base of conversation contextual information.

Our model differs from the above in several ways. Firstly, we model from both contextual and affective interactions between speakers to further explore the information. Secondly, we encode the context information of different speakers and fuse this information into global context information, which can extract the context information more accurately.

3. Methods

The architecture of our model is shown in Figure 2, which consists of a conversation text encoder, context semantic encoder, affective interaction encoder, and sentiment classifier. Firstly, the conversation text encoder uses an ERNIE pre-trained model to obtain the high-dimensional feature representation of each utterance. Then, from the perspective of different speakers, the BiGRU network is used to encode the utterance representation separately to obtain the relevant context representation of the speaker. And then, we fuse the context representation of the different speakers and input it into the new BiGRU network to obtain the global contextual semantic information. Meanwhile, we construct a directed graph structure to characterize the affective interaction dependence between speakers and use GCN to encode the interactive dependence information between speakers and themselves and intermediates. Finally, the SoftMax classifier is applied to the fused feature of conversation context and affective interaction feature, which results in five sentiment classes, i.e., negative, very negative, neutral, positive, and very positive. The details of the proposed model are described in Section 3.1, Section 3.2, Section 3.3 and Section 3.4.

3.1. Conversation Text Encoder

The main function of the conversation text encoder is to obtain the utterance vector representation. In the field of NLP, converting text into a way that computers can understand is a fundamental task for every task. Existing studies have shown that the language representation model obtained by pre-training a large-scale corpus can improve the ability of the models to represent text features and enhance the performance of downstream NLP tasks [21].

To extract the utterance representations of conversation, our model employs the pre-trained ERNIE3.0 [22] model, which has 12 encoder layers, 768 hidden states, and 12 self-attention heads. Specifically, for input, the text

{x_{1}, x_{2}, \dots, x_{m}}

is preprocessed by the ERNIE model into a word-based vector with 768 dimensions, and it is transformed into

{[c l s], x_{1} \dots, x_{m}, \dots [c l s]}

, where

x_{i}

is the word at position

i

in the utterance, and the first and last

[c l s]

mark the start and end positions of the utterance, respectively.

The ERNIE3.0 then computes the utterance representation, as shown in Equation (1):

u_{i}^{A}, u_{i}^{B} = E R N I E (U_{i}^{A}, U_{i}^{B})

(1)

where

A

and

B

represent different speakers, i.e., customer and agent, respectively;

U_{i}

denotes the unencoded utterance; and

u_{i}

denotes the encoded utterance at moment

i

.

u_{i}^{A}

and

u_{i}^{B}

denote the encoded utterances of speaker

A

and speaker

B

at moment

i

, respectively, which are encoded by ERNIE model.

3.2. Context Semantic Encoder

In customer service conversation scenarios, the sentiment of each utterance relates to the contextual information because of the continuity of the dialogue. Therefore, encoding contextual semantic information is critical for conversational sentiment analysis.

To obtain the contextual semantic information of the conversation, we use different BiGRU networks to extract different contextual semantic features and then use the attention mechanism to weigh contextual semantic features. Specifically, in the customer service conversation scenario, the conversation text is divided into two types according to different speaker identities, i.e., customer and agent conversation text. Then, we use different BiGRU networks to encode them. In addition, in order to model the contextual semantic association relationship between speakers, this paper inputs the contextual semantic features of the above two encodings into the new BiGRU network to obtain a global contextual semantic encoding representation. The calculation process is as follows:

\overset{⇀}{g_{t}} = \overset{⇀}{G R U} (\overset{⇀}{g_{t - 1}}, u_{t})

(2)

\overset{\leftarrow}{g_{t}} = \overset{\leftarrow}{G R U} (\overset{\leftarrow}{g_{t - 1}}, u_{t})

(3)

g_{t} = [\overset{⇀}{g_{t}}, \overset{\leftarrow}{g_{t}}], t = 1, 2, \dots, K

(4)

where

u_{t}

is the utterance representation after the ERNIE pre-model coding;

\overset{⇀}{g_{t}}

and

\overset{\leftarrow}{g_{t}}

are the output vector results of the forward and backward GRU networks, respectively; and

g_{t}

is the utterance representation that contains contextual semantic information.

In addition, considering the long-distance dependence between the semantics of the conversation context in the customer service conversation, and the inconsistency of the semantic contribution weight of the context utterances at different distances to the current utterance, we use the attention mechanism to weigh the contextual semantic information of different utterances. The specific details are shown in Figure 3.

Specifically, we use the BiGRU network to encode the utterance and obtain the vector representation, and then match it with the vector representation of other utterances to obtain the corresponding attention weights. The calculation process is as follows:

α_{t} = s o f t m a x (u_{t} w_{α} [g_{1}, g_{2}, \dots, g_{t - 1}])

(5)

s o f t m a x (g) = [\frac{e^{g_{1}}}{\sum_{i} e^{g_{i}}}, \frac{e^{g_{2}}}{\sum_{i} e^{g_{i}}}, \dots]

(6)

where

g_{1}, g_{2}, \dots, g_{t - 1}

are the vector representations of utterances encoded by the BiGRU network;

α_{t}

is the attention weight matrix; and

w_{α}

is the parameter to be learned during training. The context vector representation

c_{t}

related to the utterance

g_{t}

is obtained by weighting the utterances to sum:

c_{t} = α_{t} {[g_{1}, g_{2}, \dots, g_{t - 1}]}^{T}

(7)

Finally, the context vector representation related to the utterance

g_{t}

is input into the fully connected network to obtain a textual representation

G_{t}

of the fused conversation context semantic information:

G_{t} = R e L U (W_{C} (g_{t} \oplus c_{t}) + b_{c})

(8)

where ReLU is an activation function,

\oplus

represents the vector stitching operation, and

W_{C}

and

b_{c}

are the trainable model parameters.

3.3. Affective Interaction Encoder

The affective interaction encoder is mainly responsible for encoding the affective interaction relationship information between speakers in conversation. In this paper, we use directed graphs to characterize the affective interaction relationship between speakers and use GCN to extract the characteristics of affective interaction between speakers.

Specifically, we construct a directed graph

G = (V, E, R, W)

to characterize the dialogue interaction,

V

represents the set of nodes, and

E

represents the set of edges. Each utterance represents a vertex,

v_{i} \in V

,

i = [1, 2, \dots, K]

,

K

is the number of utterances. The initialization vector of the vertex

v_{i}

is an utterance representation that encodes the utterance through the ERNIE pre-training model.

r_{i j} \in E

represents the edge relationship between pairs of utterance

(v_{i}, v_{j})

, and

r \in R

represents the type of edge, which relates to dependencies between speakers.

This paper adopts the graph attention mechanism based on text similarity to set the weight of the edge, i.e., for each node, the sum of the weights of the input edges is 1. Considering the m sentences

v_{i - 1}

,

v_{i - 2}

, …,

v_{i - m}

before each node utterance and the n sentences

v_{i + 1}

,

v_{i + 2}

, …,

v_{i + n}

after each node utterance, the weight calculation between the node

v_{i}

and the node

v_{j}

is shown in Equation (9):

\partial_{i j} = s o f t m a x (g_{t}^{T} W [g_{i - m}, \dots, g_{i + n}]), s = i - m, \dots, i + n

(9)

where

\partial_{i j}

is the attention weight between a node and neighboring nodes and

W

is the parameter to be learned during training.

In customer service conversations, speakers are affected by each other’s emotions. Therefore, we construct the affective interaction-directed graph by constructing different types of edge relations. For customer service conversation, the edge relationship is shown in Table 1.

After constructing the directed graph structure, the GCN adopts a two-step convolution operation to fuse the local neighborhood nodes of each node and encodes the node feature vector

g_{i}

, that is not related to the speaker, into feature vector representation

h_{i},

that is related to the speaker’s emotional interaction information. The calculation process is as follows:

h_{t}^{(1)} = σ (\sum_{r \in R} \sum_{s \in Y} \frac{w_{t s}}{c_{t, r}} W_{r}^{(1)} g_{s} + w_{t t} W_{0}^{(1)} g_{t}), i = 1, 2, \dots, Y

(10)

h_{t}^{(2)} = σ (\sum_{s \in Y} W^{(2)} h_{s}^{(1)} + W_{0}^{(2)} h_{t}^{(1)}), t = 1, 2, \dots, Y

(11)

where

σ

is an activation function;

w_{0}^{(1)}

,

w_{r}^{(1)}

,

w_{0}^{(2)}

, and

w^{(2)}

are transformation parameters. Formulas (10) and (11) can effectively aggregate the affective interaction information of local domain speakers at each statement node in the directed graph, and the self-connected edges can also ensure the transformation of autocorrelation features.

3.4. Sentiment Classifier

The sentiment classifier fuses multiple feature coding structures to output the affective tendency label

X_{i}

. Specifically, we first carry out the serial aggregation of the conversation context semantic feature encoding

G_{i}

, the emotional interaction feature encoding between speakers

h_{i}

, and the single-utterance semantic feature encoding

e_{i}

, as shown in Equation (12):

M_{i} = [G_{i}, h_{i}, e_{i}]

(12)

Then, the attention mechanism is used to obtain the final conversation text feature representation

\tilde{M_{i}}

, as shown in Equations (13) and (14):

β_{i} = S o f t m a x (M_{i}^{T} W_{β} [M_{1}, M_{2}, \dots, M_{n}])

(13)

\tilde{M_{i}} = β_{i} {[M_{1}, M_{2}, \dots, M_{n}]}^{T}

(14)

Finally, the fused encoding results are input to the fully connected layer network, the sentiment classification probability distribution is obtained through the SoftMax function, and the sentiment label

X_{i}

with the highest probability is output:

l_{i} = R e L U (W_{l} \tilde{M_{i}} + b_{l})

(15)

P_{i} = S o f t m a x (W l_{i} + b)

(16)

X_{i} = a r g m a x (P_{i} [M])

(17)

where

b

and

b_{l}

are the parameters of the fully connected layer;

P_{i}

is the probability distribution of utterance

i

corresponding to each sentiment state

; W_{l}

and

W

are the weight of the training process learning; and

X_{i}

represents the label corresponding to the maximum probability in the

P_{i}

, which is the sentiment state corresponding to the ith sentence utterance identified by the model.

4. Experiments

4.1. Experimental Setup

4.1.1. Datasets

Our experiments are conducted on the JDDC dataset, which is obtained from JD Dialogue Challenge Competition (https://github.com/EndlessLethe/jddc2019-3rd-retrieve-model, (accessed on 25 December 2022)). To the best of my knowledge, the JDDC dataset is a large-scale customer service conversation dataset in the field of e-commerce, which is suitable for customer sentiment analysis in customer service conversation scenarios.

We first process the dataset by cleaning, deleting, and removing noise and irrelevant content. Each utterance is then annotated with one of five sentiment labels: very negative, negative, neutral, positive, and very positive. The specifications in the annotation process are as follows.

If there are two or more negative emotional messages, the emotional tendency of the utterance is marked as very negative, otherwise, the emotional tendency is marked as negative, as follows:

Customer: It’s been several times, you are too perfunctory. (very negative)

Customer: Why has not my order shipped yet? (negative)

If the utterance is of a simple consultation and does not have any emotional color, it is marked as neutral, as follows:

Customer: Hello, are you there? (neutral)

If there are two or more positive emotional messages, the emotional tendency of the utterance is marked as very positive, otherwise the emotional tendency is marked as positive, as follows:

Customer: The quality of the goods is good, I will give you good reviews. (very positive)

Agent: Hello, thank you for coming, how can I help you? (positive)

We arranged for three annotators to label each utterance sentiment. If two of them do not agree on their sentiment labeling, an expert makes the final decision. The data annotation results show that the consistency of the two is above 89%, and the consistency of the three is above 83%. The calculation of the Kappa coefficient between the two labeled persons shows that the mean value of the Kappa coefficient is greater than 0.75. Based on the results of the abovementioned indicators, the data annotation of this topic has a good consistency, and the quality of the data annotation meets the experimental standards.

After labeling the conversation sentiment of the dataset, the dataset is divided into the training set, verification set, and test set according to the ratio of 7:1:2, and the dataset is statistically analyzed. Table 2 shows the distribution of five types of sentiment labels on the dataset. Neutral occurs most frequently of all sentiments, which is in conformity to the characteristics of conversation in the field of e-commerce. In addition, the dataset contains 6305 conversations where the average turns of a conversation are 11.2.

4.1.2. Parameter Settings

The experiments are based on the deep learning framework PyTorch and use Baidu’s ERNIE-based model (12 encoder layers, 768 hidden states, 12 self-attention heads, 110 M) as the pre-training model. The vector dimension of each utterance embedded in the output layer is 768, and the number of GRU neural units in the context information extraction layer is 128, resulting in a total of two layers. The batch size of the training is eight, the optimization function of the model is Adam, the initial learning rate is

e^{- 3}

, and the dropout rate is 0.3. Considering that the affective tendency classification in this experiment belongs to a multi-classification scenario and has an unbalanced distribution, we use the cross-entropy loss function as the training loss function.

4.1.3. Evaluation Criteria

Four evaluation criteria, namely precision (Pr), recall (Re), F1measure (F1), and accuracy (Acc), are used to assess the performance of the models. These criteria are extensively used in text classification and sentiment analysis tasks [23]. These criteria are calculated as follows:

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(18)

P r e c i s o n = \frac{T P}{T P + F P}

(19)

R e c a l l = \frac{T P}{T P + F P}

(20)

F 1 = \frac{2 * Precison * R e c a l l}{P r e c i s o n + R e c a l l}

(21)

M a c r o - F 1 = \frac{1}{N} F 1_{i}

(22)

where

T P

,

T N

,

F P

, and

F N

are true positive, true negative, false positive, and false negative, respectively [24].

4.2. Baselines Methods

To evaluate the performance of our model, we implement the following baseline models to sentiment recognition in conversation:

(1): BiGRU [25]: Regardless of the contextual information in the conversation, it treats each utterance as an independent instance and uses a bidirectional GRU to encode the utterance and classify sentiment.
(2): BERT [26]: This model is used to construct the utterance representations which are sent to a two-layer perceptron with a final SoftMax layer for sentiment classification.
(3): ERNIE [22]: This model treats each sentence in the dialogue as an independent instance and uses the ERNIE model to encode the sentence and classify sentiment.
(4): c-LSTM [9]: This model uses a context-sensitive LSTM model for sentiment classification.
(5): BiGRU-Att [27]: The model uses the BiGRU network to encode and represent the context information for sentiment classification.
(6): CMN [13]: The model adopts two GRU models to extract contextual features from the conversation history of two speakers and passes the current utterance as input to two different memory networks to obtain the utterance representation of the two speakers for sentiment classification.
(7): DialogueRNN [15]: The model uses a BiGRU to model the speaker states, global states, and sentiment states based on recurrent neural networks (RNNs).
(8): DialogueGCN [17]: Based on the graph neural network, the model constructs the conversation as a graph structure, represents the nodes in the graph as utterances, and uses the speaker’s information to determine the type of edge, to establish the time dependence and speaker dependency in the multi-party dialogue, and then realize the sentiment classification.

4.3. Results and Analysis

We compare the performance of our model with baselines on the JDDC dataset. The experimental results are shown in Table 3.

Firstly, by comparing the BiGRU, BERT, and ERNIE methods to treat each utterance of the conversation as a single-sentence instance experiment, it is found that the pre-trained model has an advantage in the semantic encoding of text compared with the BiGRU model. The F1 values of the BiGRU model are 2.08 and 2.24 lower than the values of BERT and ERNIE, respectively. At the same time, the semantic coding effect of the ERNIE pre-training model is slightly better than the BERT pre-training model, which is consistent with the experimental results provided by Baidu. In addition, The ERNIE model without considering context has a lower F1 value than that of the c-LSTM, BiGRU-Att, and CMN models considering context, which indicates that the context element of dialogue plays a vital role in sentiment recognition in dialogue sentiment analysis.

Secondly, the experimental results of DiaogueRNN, DiaogueGCN, and our models are significantly better than those of c-LSTM, BiGRU-Att, and CMN models by comparing c-LSTM, BiGRU-Att, CMN, DiaogueRNN, DiaogueGCN, and our model. The reason is that compared with other models, DiaogueRNN, DiaogueGCN, and our model all learn the affective interaction relationship between speakers. Although all three models incorporate the affective interaction dependence between speakers, DiaogueGCN and our model use the graph convolutional neural network to fuse the utterance neighborhood information to obtain the text representation of the sentence, so the experimental results of DiaogueGCN and our model are better than the DiaogueRNN model. Finally, the macro-F1 value of our model is 1.16 higher than that of DiaogueGCN model. Compared with the DiaogueGCN model, our model uses ERNIE pre-trained model for the dialogue text embedding and adopts multiple GRU and attention mechanisms to extract the semantic features of context more comprehensively to obtain a better sentiment classification performance.

4.3.1. Effectiveness Analysis of Contextual Semantic Information and Affective Interaction Information

We conduct combined comparative experiments from the aspects of the contextual semantic encoder and speaker affective interaction encoder to study the influence of the different components of our model on the emotion recognition results, the specific hybrid model as shown in Table 4, and the experimental results of different models as shown in Table 5. In Table 5, Our model(a) is composed of a conversation text encoder; Our model(b) is composed of a conversation text encoder and an affective interaction encoder; and Our model(c) is composed of a conversation text encoder and a context semantic encoder. Our model is thus composed of a conversation text encoder, a context semantic encoder, and an affective interaction encoder.

According to the experimental results of Table 4 and Table 5, the F1 value of affective tendency classification has been gradually improved with the addition of a contextual semantic encoder and speaker affective interaction encoder. According to the comparative experiments of Our model(a), Our model(c), Our model(b), and Our model, it is found that the recognition effect is greatly improved, considering the context semantic features. By analyzing the recognition effect of different categories, we find that the recognition effect of considering contextual semantic features on negative and positive affective tendencies is better than that of several other categories, which shows that the model can infer implicit emotions based on context.

At the same time, by comparing the results of Our model(a), Our model(b), and Our model(c), and Our model groups, it can be found that Our model(b) increased the macro-F1 value by 3.48% compared with Our model(a), and Our model increased the macro-F1 value by 3.87% compared with Our model(c). It shows that considering the emotional interaction dependence between speakers can improve the effect of emotion recognition. In addition, by comparing the experimental results of Our model(a), Our model(c), and Our model(b), we find that the contextual semantic features or the affective interaction features between speakers can improve sentiment recognition. And the importance of the conversation contextual semantic information is slightly higher than that of the speaker’s affective interaction information. Through the analysis of a context semantic encoder and an affective interaction encoder, we found that contextual semantic information and affective interaction information can effectively improve the accuracy of customer sentiment recognition.

4.3.2. Affective Interaction-Directed Graph Window Size

To study the influence of the context window size of the directed graph of the affective interaction on the experimental results, we observed the change of macro-F1 value under Our model and Our model(b) by changing the number of directional graph context window sizes. The experimental results are shown in Figure 4.

The experimental results show that when the window size is 0, the macro-F1 value results of Our model and Our model(b) are close to the results of the models without the affective interaction encoder, which indicates that the model performance without using the affective interaction information between speakers is poor. When the window size is 0 to 3, the macro-F1 value of the model starts to rise, indicating that the model begins to use the affective interaction information between speakers. As the window size continues to increase, the macro-F1 value begins to shift from stable to downward, which indicates that there is too much affective interaction between speakers and a certain amount of noise. Therefore, we set the window size of the emotional interaction-directed graph as 4.

4.3.3. The Impact Analysis of Feature Fusion Mode

After extracting the corresponding features through the context semantic encoder and the affective interaction encoder, it is necessary to fuse the two features and input them into the full connection layer to achieve sentiment classification. In order to study the impact of different fusion methods on the experimental results of sentiment classification, this paper observes the change of F1 value under different sentiment categories by setting three fusion methods, i.e., sum, maximum pooling, and series connection. The experimental results are shown in Figure 5.

The experimental results show that the series fusion method has the best F1 value in all sentiment categories because the series fusion method can retain more information than the other two fusion methods. In addition, the gap between the three fusion methods of maximum pooling, summing, and tandem was the smallest in the neutral sentiment category. The effect of the series fusion mode was significantly better than that of the other two fusion methods in the negative and positive sentiment categories. It indicates that the series method can better retain sentiment characteristics.

5. Conclusions

In this paper, we proposed an efficient model to recognize customer sentiment in customer service conversations. The model consists of a conversation text encoder, context semantic encoder, affective interaction encoder, and sentiment classifier. Firstly, the conversation text encoder uses ERNIE pre-trained model to generate the utterance representations. Then, the context semantic encoder uses multiple GRU neural networks and attention mechanisms to encode different contexts to obtain more comprehensive context semantic features. Meanwhile, the affective interaction encoder constructs a directed graph to depict the affective interaction relationship between speakers and uses GCN to encode its affective interaction features. Finally, after fusing the encoding of different features, the sentiment classifier outputs the predicted sentiment label. To verify the performance of our model, we select eight baseline models for the same task. Experimental results on the JDDC dataset show that the performance of our model is better than other baseline models in terms of recognizing customer sentiment.

Although our model captures different speakers’ contextual information and inter-speaker affective interaction information that can improve the accuracy of sentiment recognition, our study has certain limitations. The proposed model does not take conversation topics or multimodal data into account. Therefore, the following two research directions merit future exploration: (1) consider more factors that affect the accuracy of sentiment recognition in conversations, such as conversational topics, and (2) incorporate multimodal information into our work.

Author Contributions

Conceptualization, Z.H., H.L. and J.Z.; methodology, H.L. and J.Z.; software, J.M.; validation, Z.H., H.L. and J.Z.; formal analysis, H.L. and J.Z.; investigation, H.L. and J.Z.; resources, J.Z.; data curation, H.L. and J.M.; writing—original draft preparation, H.L.; writing—review and editing, Z.H., H.L., J.M. and J.Z.; visualization, J.Z.; supervision, Z.H.; project administration, J.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Cross-border E-commerce Teacher Competence Improvement Project of Guangxi Minzu Normal University “Research on User Behavior of Cross-border E-commerce Platform under the Background of Artificial Intelligence” (KJDSKYZD202203); Guangxi University Young and Middle-aged Teachers Basic Ability Improvement Project “Probability Numbers and its Application in Uncertain Decision Theory” (2022KY0764).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Acknowledgments

The author is thankful to China Three Gorges University for providing resources for this research.

Conflicts of Interest

The authors declare no conflict of interest.

References

Chung, H.; Iorga, M.; Voas, J.; Lee, S. Alexa, Can I Trust You? Computer 2017, 50, 100–104. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Zhu, X. Case II (Part A): JIMI’s Growth Path: Artificial Intelligence Has Redefined the Customer Service of JD.Com. In Emerging Champions in the Digital Economy; Springer: Singapore, 2019. [Google Scholar] [CrossRef]
Li, F.L.; Qiu, M.; Chen, H.; Wang, X.; Gao, X.; Huang, J.; Ren, J.; Zhao, Z.; Zhao, W.; Wang, L.; et al. AliMe Assist: An Intelligent Assistant for Creating an Innovative E-commerce Experience. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Singapore, 6–10 November 2017; pp. 2495–2498. [Google Scholar] [CrossRef]
Sun, X.; Zhang, C.; Li, L. Dynamic emotion modelling and anomaly detection in conversation based on emotional transition tensor. Inf. Fusion 2019, 46, 11–22. [Google Scholar] [CrossRef]
Majid, R.; Santoso, H.A. Conversations Sentiment and Intent Categorization Using Context RNN for Emotion Recognition. In Proceedings of the 2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS), Coimbatore, India, 19–20 March 2021; pp. 46–50. [Google Scholar] [CrossRef]
Wei, J.; Feng, S.; Wang, D.; Zhang, Y.; Li, X. Attentional Neural Network for Emotion Detection in Conversations with Speaker Influence Awareness. In Natural Language Processing and Chinese Computing: 8th CCF International Conference, NLPCC 2019, Dunhuang, China, 9–14 October 2019; Springer International Publishing: Berlin/Heidelberg, Germany, 2019. [Google Scholar]
Huang, X.; Ren, M.; Han, Q.; Shi, X.; Nie, J.; Nie, W.; Liu, A.-A. Emotion Detection for Conversations Based on Reinforcement Learning Framework. IEEE Multimed. 2021, 28, 76–85. [Google Scholar] [CrossRef]
Hu, D.; Wei, L.; Huai, X. DialogueCRN: Contextual Reasoning Networks for Emotion Recognition in Conversations. arXiv 2021, arXiv:2106.01978. [Google Scholar]
Poria, S.; Cambria, E.; Hazarika, D.; Majumder, N.; Zadeh, A.; Morency, L.-P. Context-Dependent Sentiment Analysis in User-Generated Videos. In Proceedings of the 55th Annual Meeting of the Association-for-Computational-Linguistics (ACL), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 873–883. [Google Scholar] [CrossRef] [Green Version]
Zhong, P.; Wang, D.; Miao, C. Knowledge-Enriched Transformer for Emotion Detection in Textual Conversations. In Proceedings of the Conference on Empirical Methods in Natural Language Processing/9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 165–176. [Google Scholar]
Tang, D.; Qin, B.; Feng, X.; Liu, T. Effective LSTMs for Target-Dependent Sentiment Classification. In Proceedings of the COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, 11–16 December 2016; pp. 3298–3307. [Google Scholar]
Basiri, M.E.; Nemati, S.; Abdar, M.; Cambria, E.; Acharrya, U.R. ABCDM: An Attention-based Bidirectional CNN-RNN Deep Model for sentiment analysis. Future Gener. Comput. Syst. Int. J. Escience 2021, 115, 279–294. [Google Scholar] [CrossRef]
Hazarika, D.; Poria, S.; Zadeh, A.; Cambria, E.; Morency, L.P.; Zimmermann, R. Conversational Memory Network for Emotion Recognition in Dyadic Dialogue Videos. Proc. Conf. 2018, 2018, 2122–2132. [Google Scholar] [CrossRef] [PubMed]
Hazarika, D.; Poria, S.; Mihalcea, R.; Cambria, E.; Zimmermann, R. ICON: Interactive Conversational Memory Network for Multimodal Emotion Detection. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Brussels, Belgium, 31 October–4 November 2018; pp. 2594–2604. [Google Scholar]
Majumder, N.; Poria, S.; Hazarika, D.; Mihalcea, R.; Gelbukh, A.; Cambria, E. DialogueRNN: An Attentive RNN for Emotion Detection in Conversations. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence/31st Innovative Applications of Artificial Intelligence Conference/9th AAAI Symposium on Educational Advances in Artificial Intelligence, Honolulu, HI, USA, 17 July 2019; pp. 6818–6825. [Google Scholar]
Zhang, D.; Wu, L.; Sun, C.; Li, S.; Zhu, Q.; Zhou, G. Modeling both Context- and Speaker-Sensitive Dependence for Emotion Detection in Multi-speaker Conversations. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, Macao, China, 10–16 August 2019; pp. 5415–5421. [Google Scholar]
Ghosal, D.; Majumder, N.; Poria, S.; Chhaya, N.; Gelbukh, A. DialogueGCN: A Graph Convolutional Neural Network for Emotion Recognition in Conversation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing/9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 30 August 2019; pp. 154–164. [Google Scholar]
Wang, Y.; Zhang, J.; Ma, J.; Wang, S.; Xiao, J. Contextualized Emotion Recognition in Conversation as Sequence Tagging. In Proceedings of the 21st Annual Meeting of the Special-Interest-Group-on-Discourse-and-Dialogue (SIGDIAL), Electr Network, 1st Virtual Meeting, Virtual, 1–3 July 2020; pp. 186–195. [Google Scholar]
Gao, Q.; Cao, B.; Guan, X.; Gu, T.; Bao, X.; Wu, J.; Liu, B.; Cao, J. Emotion recognition in conversations with emotion shift detection based on multi-task learning. Knowl.-Based Syst. 2022, 248, 108861. [Google Scholar] [CrossRef]
Zhu, L.; Pergola, G.; Gui, L.; Zhou, D.; He, Y. Topic-Driven and Knowledge-Aware Transformer for Dialogue Emotion Detection. In Proceedings of the Joint Conference of 59th Annual Meeting of the Association-for-Computational-Linguistics (ACL)/11th International Joint Conference on Natural Language Processing (IJCNLP)/6th Workshop on Representation Learning for NLP (RepL4NLP), Electr Network, Online, 2 June 2021; pp. 1571–1582. [Google Scholar]
Liu, R.; Ye, X.; Yue, Z. Review of pre-trained models for natural language processing tasks. J. Comput. Appl. 2021, 41, 1236–1246. [Google Scholar] [CrossRef]
Sun, Y.; Wang, S.; Feng, S.; Ding, S.; Pang, C.; Shang, J.; Liu, J.; Chen, X.; Zhao, Y.; Lu, Y.; et al. Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation. arXiv 2021, arXiv:2107.02137. [Google Scholar]
Khiabani, P.J.; Basiri, M.E.; Rastegari, H. An improved evidence-based aggregation method for sentiment analysis. J. Inf. Sci. 2020, 46, 340–360. [Google Scholar] [CrossRef]
Liu, B. Sentiment Analysis: Mining Opinions, Sentiments, and Emotions; Cambridge University Press: Singapore, 2020. [Google Scholar]
Shen, J.; Liao, X.; Tao, Z. Sentence-Level Sentiment Analysis via BERT and BiGRU. In Proceedings of the 2nd International Conference on Image and Video Processing, and Artificial Intelligence (IPVAI), Shanghai, China, 23–25 August 2019; p. 11321. [Google Scholar] [CrossRef]
Yenduri, G.; Rajakumar, B.R.; Praghash, K.; Binu, D. Heuristic-Assisted BERT for Twitter Sentiment Analysis. Int. J. Comput. Intell. Appl. 2021, 20, 2150016. [Google Scholar] [CrossRef]
Zhou, L.; Bian, X. Improved text sentiment classification method based on BiGRU-Attention. J. Phys. Conf. Ser. 2019, 1345, 032097. [Google Scholar] [CrossRef]

Figure 1. Example of customer service conversation.

Figure 2. The illustration of our proposed framework, which consists of a conversation text encoder, a context semantic encoder, an affective interaction encoder, and sentiment classifier.

Figure 3. Attention mechanism weighting contextual semantics.

Figure 4. The effect of window size on the macro-F1 value.

Figure 5. The impact of different fusion methods on F1 value.

Table 1. The type of relationship between speakers.

Number	Relationship between Speakers
1	$P_{1} \to P_{1}$
2	$P_{1} \leftarrow P_{1}$
3	$P_{2} \to P_{2}$
4	$P_{2} \leftarrow P_{2}$
5	$P_{1} \to P_{2}$
6	$P_{1} \leftarrow P_{2}$
7	$P_{2} \to P_{1}$
8	$P_{2} \leftarrow P_{1}$

where

p_{i} \to p_{j}

represents two utterances in sequential order;

u_{i}

and

u_{j}

are derived from customers and agent, respectively; and

u_{i}

precedes the

u_{j}

utterance.

p_{i} \leftarrow p_{j}

indicates

u_{i}

after the

u_{j}

utterance.

Table 2. Information of the dataset.

Sentiment	Train	Dev.	Test	Proportion
Very negative	1038	163	282	1483 (2.1%)
Negative	3466	613	1019	5098 (7.22%)
Neutral	31,625	3953	8345	43,923 (62.2%)
Positive	12,794	2482	3819	19,095 (27.04%)
Very positive	722	102	193	1017 (1.44%)
Total utterances	49,645	7313	13,658	70,616

Table 3. The performances of different models.

Model	F1 Value					Macro-F1
Model	Very Negative	Negative	Neutral	Positive	Very Positive	Macro-F1
BiGRU	50.5	55.1	78.1	72.3	63.3	63.86
BERT	52.1	56.6	82.2	74.2	64.6	65.94
ERNIE	52.3	56.6	82.5	74.3	64.8	66.1
c-LSTM	55.2	59.7	82.2	76.1	67.2	68.08
BiGRU-Att	55.7	60.1	82.7	76.5	67.8	68.56
CMN	56.9	60.9	82.8	76.7	69.1	69.28
DiaogueRNN	58.5	61.9	83.3	77.5	70.4	70.32
DiaogueGCN	59.2	62.5	83.3	77.6	71.1	70.74
Our model	60.3	64.6	83.2	78.9	72.5	71.90

Table 4. Hybrid models of different component structures.

Model	Conversation Text Encoder	Context Semantic Encoder	Affective Interaction Encoder
Our model(a)	√	×	×
Our model(b)	√	×	√
Our model(c)	√	√	×
Our model	√	√	√

Table 5. The experimental results of different hybrid models.

Model	F1 Value					Macro-F1
Model	Very Negative	Negative	Neutral	Positive	Very Positive	Macro-F1
Our model(a)	52.3	56.6	82.5	74.3	64.8	66.1
Our model(b)	56.5	60.1	81.2	75.8	68.4	68.4
Our model(c)	56.8	61.4	81.9	76.6	69.7	69.28
Our model	60.3	64.6	83.5	78.9	72.5	71.96

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, Z.; Liu, H.; Zhu, J.; Min, J. Customer Sentiment Recognition in Conversation Based on Contextual Semantic and Affective Interaction Information. Appl. Sci. 2023, 13, 7807. https://doi.org/10.3390/app13137807

AMA Style

Huang Z, Liu H, Zhu J, Min J. Customer Sentiment Recognition in Conversation Based on Contextual Semantic and Affective Interaction Information. Applied Sciences. 2023; 13(13):7807. https://doi.org/10.3390/app13137807

Chicago/Turabian Style

Huang, Zhengwei, Huayuan Liu, Jun Zhu, and Jintao Min. 2023. "Customer Sentiment Recognition in Conversation Based on Contextual Semantic and Affective Interaction Information" Applied Sciences 13, no. 13: 7807. https://doi.org/10.3390/app13137807

APA Style

Huang, Z., Liu, H., Zhu, J., & Min, J. (2023). Customer Sentiment Recognition in Conversation Based on Contextual Semantic and Affective Interaction Information. Applied Sciences, 13(13), 7807. https://doi.org/10.3390/app13137807

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Customer Sentiment Recognition in Conversation Based on Contextual Semantic and Affective Interaction Information

Abstract

1. Introduction

2. Related Work

3. Methods

3.1. Conversation Text Encoder

3.2. Context Semantic Encoder

3.3. Affective Interaction Encoder

3.4. Sentiment Classifier

4. Experiments

4.1. Experimental Setup

4.1.1. Datasets

4.1.2. Parameter Settings

4.1.3. Evaluation Criteria

4.2. Baselines Methods

4.3. Results and Analysis

4.3.1. Effectiveness Analysis of Contextual Semantic Information and Affective Interaction Information

4.3.2. Affective Interaction-Directed Graph Window Size

4.3.3. The Impact Analysis of Feature Fusion Mode

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI