Text-Based Emotion Recognition in English and Polish for Therapeutic Chatbot

: In this article, we present the results of our experiments on sentiment and emotion recognition for English and Polish texts, aiming to work in the context of a therapeutic chatbot. We created a dedicated dataset by adding samples of neutral texts to an existing English-language emotion-labeled corpus. Next, using neural machine translation, we developed a Polish version of the English database. A bilingual, parallel corpus created in this way, named CORTEX (CORpus of Translated Emotional teXts), labeled with three sentiment polarity classes and nine emotion classes, was used for experiments on classiﬁcation. We employed various classiﬁers: Naïve Bayes, Support Vector Machines, fastText, and BERT. The results obtained were satisfactory: we achieved the best scores for the BERT-based models, which yielded accuracy of over 90% for sentiment (3-class) classiﬁcation and almost 80% for emotion (9-class) classiﬁcation. We compared the results for both languages and discussed the differences. Both the accuracy and the F1-scores for Polish turned out to be slightly inferior to those for English, with the highest difference visible for BERT.


Introduction
Chatbots and dialogue systems are entering new areas of our lives. Quite recently, they have also been introduced to psychological and psychiatric therapies. Such a system, in order to have a therapeutic conversation with a patient, must be able to correctly recognize the patient's emotional state. In some cases, it is enough just to recognize the sentiment of the patient, i.e., to detect whether the patient's utterance has a positive or a negative emotional tinge, or carries no emotion at all. In other cases, to correctly lead the therapeutic dialogue, more detailed emotion and mood recognition must be performed.
Our long-term target is to create a therapeutic dialogue system, working in Polish, able to hold empathetic conversations with patients. Most of the existing sentiment analysis approaches for Polish are opinion-based. They work either with short texts, e.g., as in message-level Twitter sentiment polarity classification, or with longer texts, e.g., when analyzing reviews of multiple aspects of restaurants or other services or goods. However, emotion-aware dialogue agents demand a different type of dataset, one that is more empathetic and multifaceted rather than just opinion-forming.
The Polish language is under-resourced in regard to annotated empathetic texts. To fill in this gap, in our work, we made an attempt to build a Polish version of the dataset, using neural machine translation and English corpora as the source texts.
In addition, we observed that in their experiments many researchers avoid taking into account the neutral emotional state, even though neutral utterances usually prevail during conversations. Since correctly distinguishing between neutral sentiment/emotion and any other emotional state is quite important in therapy, we decided to combine emotionlabeled texts with neutral ones. Next, we ran several experiments with sentiment polarity and emotion recognition for English and Polish, comparing the results between various classifiers and both languages.
Our article is structured as follows: first, in Section 2, we briefly review the state of the art in the area of dialogue systems applied to mental health, sentiment and emotion recognition, and the existing related resources. Next, in Section 3, we describe how we created the corpora for our study. The experiments themselves are described in Section 4, followed by presentation of the results in Section 5. The article concludes with discussion of the results in Section 6 and a summary in Section 7.

Conversational Agents in Mental Health
Mental disorders affect large numbers of people worldwide. To mitigate the problem of limited availability of human therapists and to develop new methods of treatment, computer-aided therapies have been designed and successfully applied in the context of mental health, including solutions based on artificial intelligence (AI) [1].
Application of AI in mental disorder therapies frequently takes the form of dialogue systems [2,3], which can be divided into two categories. The first is chatbots-virtual counselors capable of having text-based conversations with the patient, delivered, e.g., via a mobile application. Research [4][5][6] has shown positive impacts from chatbot-based therapy on patients with depression and anxiety. Another group of systems is the so-called embodied conversational agents (ECAs), which extend the text conversations with an animated visualization of the virtual therapist on the screen. Such systems have been applied in the treatment of depression [7] and autism spectrum disorders [8].
An important factor in human-computer interaction in therapeutic settings is the system's ability to recognize the patient's emotions and respond accordingly (affective computing [9]). Several affect-aware conversational systems have been developed [10,11] in the context of mental health, aiming to produce more natural and empathetic conversations; however, to the best of our knowledge, none of these were designed for the Polish language.

Text-Based Sentiment and Emotion Recognition
The long-term goal of our research is to develop a therapeutic chatbot capable of having a conversation in Polish. Such a system should not only be able to respond according to the user's intent and the topics mentioned, but its utterances should also be aligned with the user's emotional state. For this purpose, various text-based sentiment and emotion recognition methods have been developed.
The majority of historical approaches to sentiment analysis employed bag-of-words (BoW) representations and machine learning (ML) algorithms to build classifiers from textual data (e.g., utterances, opinions, reviews) with manually annotated sentiment polarity (e.g., positive, negative, neutral). Most studies focused on designing effective features to obtain better classification performance [12]. Snyder and Barzilay [13] analyzed the sentiment of multiple aspects of restaurant reviews, such as food and atmosphere. Several works have explored sentiment compositionality through careful engineering of features or polarity-shifting rules on syntactic structures [14]. Psycholinguistic features can be built using the large lexicons of word categories (LIWC) [15] that represent psycholinguistic processes (e.g., perceptual processes) and summary categories (e.g., word ratio), as well as part-of-speech categories (e.g., articles, verbs). Mohammad et al. [16] implemented diverse sentiment lexicons and a variety of handcrafted features.
Further progress toward understanding compositionality in tasks, such as sentiment detection, requires more complex datasets and more powerful ML models, such as deeplearning (DL) models. Socher et al. [17] introduced a new corpus-the Stanford Sentiment Treebank (SST), and a new approach based on DL-, the Recursive Neural Tensor Network. SST includes fine-grained sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences and presents new challenges for sentiment compositionality. The proposed method, trained on the new treebank, outperformed all previous methods.
Recent works have shown that shallow neural networks can also perform well for sentiment classification. Reference [18] presents the use of fastText word embeddings [19] as representation of words to perform the task of sentiment analysis. The results showed that the proposed approach yielded better results than many classic baseline models.
Currently, BERT (Bidirectional Encoder Representations from Transformers) is reported to be the state-of-the-art language model and has achieved amazing results in many language understanding tasks [20], including sentiment recognition. In Reference [21], the authors used the pretrained BERT model and fine-tuned it for the fine-grained sentimentclassification task on the Stanford Sentiment Treebank dataset. The proposed model performed better than complicated architectures, such as paragraphVectors, or typical recursive and convolutional neural networks. However, BERT-based models also exhibit some limitations, e.g., they have large computational and memory requirements, and the black-box model characteristics make their predictions hardly interpretable.
Deep-learning approaches, such as BERT-based models, have recently achieved stateof-the-art results [22] in the SemEval 2018 competition's task related to detecting affect in Tweets [23], with objectives ranging from emotion classification to emotion-intensity prediction. The competition dataset was annotated with 12 different emotions (incl. no emotion) for English, Spanish and Arabic in a multi-label manner. The most successful approaches during the competition in 2018 used combinations of sentence embeddings with features extracted from affective lexicons.
The interest in emotion analysis as part of cyclical competitions materialized one year later in SemEval 2019, in a task called EmoContext [24]. Its goal was to classify the emotion represented by a short informal dialogue utterance, also taking into account the preceding two turns of the dialogue. There were only four classes of emotion (Happy, Sad, Angry, and Others) and three different classification subtasks. Among the top systems, there were again examples of models leveraging both vector representations of sentences and emotion-related features.
Even though there are many English-language corpora of emotional texts, for example, References [13,17,25], only a few relevant resources exist for Polish. Until recently, the majority of approaches to sentiment analysis in Polish were based on lexicons, such as plWordNet 4.0 Emo [26,27] or the Nencki Affective Word List (NAWL) [28]. In recent years, several sentiment-labeled corpora have also been created. One is PolEmo [29], a corpus of consumer reviews from four domains: medicine, hotels, products, and schools. It contains 8216 reviews having 57,466 sentences. The corpus is labeled, both on review and sentence level, with four polarity classes: positive, neutral, negative, and ambiguous. Another example is the HateSpeech corpus [30], the current version of which contains over 2000 manually annotated posts crawled from the Polish public web. The posts contain various types and degrees of offensive language, expressed toward minorities (e.g., ethnic, racial). However, to the best of our knowledge, no emotion-labeled corpora exist for Polish. In addition, the current corpora contain no dialogue phrases. Our work aims to fill these gaps by creating a new corpus, suitable for the context of a chatbot.

Dialogue Corpora
To develop an affect-aware dialogue system, relevant conversational datasets are required. Most of the existing dialogue corpora are either domain-specific and taskoriented [31,32] or collected without full control over the content, e.g., from social media [33,34], and, therefore, are generally inappropriate in a therapeutic setting. There are, however, examples of emotionally-grounded conversational datasets for English, such as EmpatheticDialogues [35] and DailyDialog [36], which are labeled with emotions at the dialogue and utterance level, respectively.
The DailyDialog dataset [36] consists of daily conversations obtained through crawling websites for English learners. In total, it contains 13k dialogues, manually labeled with dialogue acts and emotions at the utterance level. The emotion-labeling scheme distinguishes six basic emotions: anger, disgust, fear, happiness, sadness, and surprise.
The emotion distribution in the dataset is highly imbalanced, with "happiness" being more than 10 times more frequent than the other emotions. There are also a large number of utterances (83% of the entire dataset) marked as representing no emotion.
The authors of Reference [35] propose a new benchmark for empathetic dialogue generation and EmphatheticDialogues (ED) itself-a novel dataset with about 25k personal dialogues. Each dialogue is grounded in a specific situation where the speaker was feeling a given emotion, with the listener responding actively. The resource consists of crowdsourced one-on-one conversations, and covers a large set of emotions in a balanced way. This dataset is larger and contains a more extensive set of emotions than many similar emotion-prediction datasets from other text domains. The authors' experiments show that models built on this dataset are considered more empathetic by human evaluators, compared to models merely trained on large-scale Internet-crawled opinion-oriented data.
To the best of our knowledge, no similar dialogue corpora exist for Polish.

Materials and Methods
We decided to create a corpus for emotion recognition from two sources: the Empa-theticDialogues and DailyDialog datasets, previously presented in Section 2.3. Within this research, following the classification experiment described in Reference [35], we frame the problem of emotion recognition as a single dialogue turn classification. Therefore, it was sufficient for creation of the corpus for classification to take just one utterance from each of the dialogues.

Extending the Corpus with Neutral Texts
The utterances in the EmpatheticDialogues corpus were collected by providing the dialogue participants with a prompt sentence, together with a context label, representing one of 32 emotional groundings in which several dialogue turns were then produced, starting from the prompt or a slightly modified version of it. In most dialogues, the emotional grounding is reflected in the first dialogue turn, but this is not always the case. Bearing that in mind, we decided to build our corpus for emotion recognition based on the prompt sentences rather than on the dialogue content itself.
Considering the planned future usage of the developed emotion detector in the context of a therapeutic chatbot, it was necessary to include neutral utterances in the classification corpus. Unfortunately, there were no examples of labeled neutral sentences in the EmpatheticDialogues dataset. Therefore, for this purpose, sentences from the DailyDialog corpus were used. For each of the experiments conducted, neutral ("no emotion") sentences were sampled from DailyDialog data, avoiding duplicated entries, equal in number to the mean count of all the other classes in the experiment. The original corpora were split into training, validation, and test subsets by their authors, and we decided to retain this division in the resulting dataset.

Creating the Polish Corpus Using Machine Translation
Since we aim to create a sentiment and emotion classifier for a chatbot working in Polish, we faced the problem of lack of relevant dialogue datasets for the Polish language. As a cheap and fast solution, we proposed to take advantage of the available resources for English and obtain the desired Polish sentences via machine translation (MT). Such an approach has already turned out to be successful, e.g., in creating a corpus for virtual assistants [37].
We tried several available MT solutions, translating from English into Polish and observed the correctness of the translations. Eventually we chose the Google Translation API (https://cloud.google.com/translate, accessed on 21 May 2021), a neural MT system, as it gave the highest level of correctness among the tested MT systems.
The whole process of creating our corpus is shown in Figure 1. Eventually we created a parallel bilingual (English and Polish) corpus of emotional texts, designed to serve experiments on sentiment and emotion recognition. We named it CORTEX, or CORpus of Translated Emotional teXts.

Models
We evaluated several approaches to emotion recognition, both simpler and more complex ones. The baseline models were the multinomial Naïve Bayes and linear Support Vector Machine classifiers trained on top of BoW representation applied to token bigrams. Both algorithms are available within the scikit-learn (https://scikit-learn.org, accessed on 21 May 2021) framework. Another model was based on the fastText algorithm, obtained from pretrained word embeddings (300-dimensional variant) available in the fasttext library (https://fasttext.cc, accessed on 21 May 2021), both for English and Polish, and also using token bigrams. For other fastText hyperparameters, we used their default values, including training for 5 epochs and a learning rate of 0.1.
The most complex approach was to fine-tune pretrained BERT BASE models for English and Polish (https://huggingface.co/dkleczek/bert-base-polish-uncased-v1, accessed on 21 May 2021) (the uncased variants) to the task of sequence classification. We performed fine-tuning using the AdamW optimizer with a linear learning rate decay starting from the maximum value of learning rate of 2 × 10 −5 , preceded by a warm-up for 10% of steps. We trained the models for 4 epochs with an effective batch size of 24, and we selected the best model for each experiment based on the validation metrics obtained after each training epoch. The code for training BERT was developed with the HuggingFace Transformers library [38].

Classes of Emotions
The corpus created as a combination of EmpatheticDialogues and DailyDialog utterances contained 32 classes of emotions plus the neutral class. This number seemed unnecessarily high considering the context of a therapeutic chatbot.
For the purpose of developing emotion classification models, we introduced new levels of class aggregation. First, we excluded some of the original emotion labels (anxious, surprised, impressed, nostalgic, sentimental, anticipating), which proved to be difficult to assign, as the utterances represented a given emotion both in positive and negative situations. Such ambiguous emotions might introduce noise to the training process. Next, we grouped similar emotion classes (see Table 1), taking into account the original papers that were the inspiration for emotion inventory used in Reference [35]. This led to two experimental setups-sentiment (3 classes) and emotion (9 classes) classification.

Results
We evaluated four different models for sentiment and emotion recognition, for each of the languages, using the developed CORTEX dataset. The numbers of sentences in individual subsets (train/val/test) are displayed in Table 1. In each experiment, we measured the values of accuracy and support-weighted F1-score (see Table 2). We conducted statistical analyses using the Wilson score interval, for the confidence level set to 90%. We assessed the confidence intervals for F1-score based on the confidence intervals for precision and recall. We also generated confusion matrices, allowing for a more detailed analysis of the results, including the models' mistakes. As seen in the table above, in both experimental setups (sentiment and emotion), BERT models outperformed the simpler methods, reaching over 90% accuracy for sentiment classification and almost 80% for emotion classification. The F1-score yielded similar values. The results for the test subset were usually slightly inferior to those for the validation subset; however, the difference was not high. The results obtained for Polish were by a few relative % worse than for English, especially in the case of BERT for emotion recognition. Unsurprisingly, for the 3-class scenario, evaluation metrics reached much higher values than for the 9-class scenario. The baseline models achieved visibly worse results, with SVM being slightly better than Naïve Bayes. These classifiers were based on the BoW representation, and, therefore, were not able to express semantic similarity between tokens in a proper way.
We observed a slight improvement for the fastText-based classifier applied to token bigrams. In this algorithm, the embeddings were created without the context of the entire input sequence, and this might be why BERT outperformed it by a large margin.
The scores for the Polish corpora were worse than for their English counterparts. This difference was significant for the more complex models. Apart from the degraded quality of the translated sentences, this might also have been caused by the quality of the underlying pretrained embeddings. Nevertheless, as presented in Table 3, both English and Polish models learned to distinguish neutral sentences from emotional ones (the F1-score for the neutral class was around 97%). Positive and negative polarity predictions reached between 88% and 92%, with slightly higher per-class scores for negative polarity. It is worth noting that only the BERT BASE (uncased) variants were evaluated, while plenty of other contextual embedding models are available.

Discussion
The envisioned study objectives have been met: we have created and tested a sentiment (3-class) and emotion (9-class) text-based classification engine for a therapeutic dialogue system, working in Polish. To achieve this, we had to create our own emotionlabeled corpus, which we generated using a neural MT system and two source English corpora. For sentiment and emotion recognition, we employed the state-of-the-art deeplearning classifier based on the BERT model, which outperformed the classic models, such as Naïve Bayes or Support Vector Machines.
We analyzed the misclassifications made by the best model (BERT) in more detail by looking at the examples related to the highest values from the confusion matrix (Table 4), especially the cases when a positive emotion label was confused with a negative emotion prediction and vice versa. The most problematic class was other_positive, as it was quite frequently predicted for sentences labeled with negative emotions, such as anger, sadness, and other_negative. The models for both languages did well in distinguishing between neutral and emotional texts; we obtained high F1-scores for the neutral class: 97.6% and 96.8% for English and Polish, respectively. We can explain some of the failed predictions as errors in the translation, others by prompt difficulty, e.g., the emotion was not reflected in the prompt itself (example: I have some friends who are traveling all over Europe taken from a dialogue labeled jealous) or multiple emotions were present in the utterance (example: I recently said goodbye to a good friend for a while. I love her!). Sometimes the label itself was just wrong, e.g., some of the neutral texts from DailyDialog seemed to be missing emotional labels (I'm happy with that price-labeled with no emotion-for which the model predicted happiness).
One of the limitations of our study is the accuracy of employed MT. Translation errors are the inevitable cost of our fast method of creating an emotion-labeled corpus for a new language. To assess the level of this inaccuracy, we manually verified a sample of our corpus. We found that about 10% of cases contained minor translation mistakes. Nevertheless, we observed that only about one-fourth of these might have an impact on the emotion category.
Considering how the dataset was obtained (machine translation from English, noisy labels), we consider the experiment results satisfactory. In the future, we plan to manually go through the developed corpus, fix the translation errors and the label mismatch where necessary, and check whether this improves the performance of our emotion-classification models.

Conclusions
In this article, we presented the results of our experiments on sentiment polarity and emotion recognition for English and Polish texts, aiming to work in the context of a therapeutic chatbot. We extended the existing language resources by adding samples of neutral texts to an existing English corpus. Next, we created a Polish version of the English database using neural machine translation. We used the corpus created in this way, which we named CORTEX, for experiments on sentiment and emotion classification. To show statistical significance, we calculated the Wilson score interval for each evaluation metric.
The results obtained were satisfactory: the best scores were achieved for the BERTbased classifiers, where accuracy of over 90% was achieved for sentiment (3-class) classification and almost 80% for emotion (9-class) classification. The results for Polish always turned out inferior to those for English, which might be caused either by imperfections in the MT process, or by the nature of the Polish language itself, as it is characterized by a more complex grammar and morphology. Exact research on this topic will be the subject of future work.
Our novel contributions presented in this article are as follows: • From existing resources, we created a new dataset containing empathetic utterances in English, which were annotated with nine emotion classes, including neutral texts. • Using neural machine translation, we created a Polish version of the above database, thus filling a gap in text resources for the Polish language. The two language versions of the database formed a new parallel corpus, named CORTEX. • We ran a series of experiments with sentiment polarity and emotion classification, establishing that the BERT-based classifier is currently the best method. Thus, we set a baseline for potential future researchers. • We showed the difference in classification efficacy for English and Polish and discussed possible explanations for this.
We made CORTEX, the developed dataset, available to the research community at https://github.com/azygadlo/CORTEX, accessed on 21 May 2021 and encourage researchers to use it for future experiments. We believe that this will help in designing better, more empathetic chatbots and dialogue systems, both for English and Polish. We also strongly encourage the creation of new versions of our database by extending it with next language versions.

Conflicts of Interest:
The authors declare no conflict of interest.