Real-Time Sentiment Analysis for Polish Dialog Systems Using MT as Pivot

: We live in a time when dialogue systems are becoming a very popular tool. It is estimated that in 2021 more than 80% of communication with customers on the ﬁrst line of service will be based on chatbots. They enter not only the retail market but also various other industries, e.g., they are used for medical interviews, information gathering or preliminary assessment and classiﬁcation of problems. Unfortunately, when these work incorrectly it leads to dissatisfaction. Such systems have the possibility of contacting a human consultant with a special command, but this is not the point. The dialog system should provide a good, uninterrupted and ﬂuid experience and not show that it is an artiﬁcial creation. Analysing the sentiment of the entire dialogue in real time can provide a solution to this problem. In our study, we focus on studying the methods of analysing the sentiment of dialogues based on machine learning for the English language and the morphologically complex Polish language, which also represents a language with a small amount of training resources. We analyse the methods directly and use the machine translator as an intermediary, thus checking the quality changes between models based on limited resources and those based on much larger English but machine translated texts. We manage to obtain over 89% accuracy using BERT-based models. We make recommendations in this regard, also taking into account the cost aspect of implementing and maintaining such a system.


Introduction
Chatbots are used in many service industries to answer customer questions and help them navigate the company's website. Due to them, customers can continue to engage in the life of the company. Chatbots are expected to be a constant trend in meeting these expectations.
Currently, dialog systems are used in many areas of industry and entertainment. They ceased to be simple gadgets that with some probability would be able to interpret questions asked in natural language through keywords and answer questions based on the FAQ and they became sophisticated tools based on artificial intelligence [1,2]. Currently, deep dialogue systems analyse the grammar, syntax and meaning of natural language, which enables them to accurately interpret human utterances. Their precision of operation is so great that many industries on their first line of technical support offer chatbots [3]. Algorithms based on the so-called deep machine learning often have the ability to spontaneously execute commands of various types, e.g., turning various services on and off, etc., without human participation or verification [4].
According to [5], the growing popularity of on-demand instant messaging has changed consumer preferences in terms of communication. More and more industries are incorporating chatbots into their business process. Bots are a critical resource for improving the consumer service. Chatbots are changing the way companies communicate with current and potential customers. Finance, healthcare, education, travel and the real estate industry derive the greatest profits from chatbots. It is predicted that 80% of companies will integrate some form of chatbot system in 2021, which will allow companies to save up to 30% of customer support costs. It turns out that over 50% of customers predict that companies are open 24/7, especially those providing international offers.
As communication technology advances, consumers expect to find information or contact customer support quickly and easily. Failure to respond promptly usually causes customers to become frustrated, which can mean losing a customer. However, contact with a human consultant is not always preferred. It turns out that 69% of consumers prefer chatbots because of their ability to provide quick answers to simple questions, 56% of consumers prefer to send a message to the company for help than call the customer service department, 37% of consumers expect quick answers in emergencies and 33% of consumers would like to use chatbots for booking, online ordering and other functions [5], which, with the current state of technology, is no longer a wishful thinking but a feasible task [6].
Chatbots are expected to become more human, but are unable to process the client's intentions, leading to misinterpreted requests and responses. They lack conversational intelligence-that is, they often fail to process the nuances implied in dialogue, resulting in inadequate conversation. However, the goal is for chatbots to be able to provide a personalized experience unique to each customer to build positive relationships, increase customer loyalty and earn positive feedback. To make it possible, they integrate various other systems, such as automatic speech recognition (ASR) [2], which are to improve this communication. Additionally, dialog systems are often integrated with a machine translator to reduce costs, which significantly increases the reach of the bot [7,8].
Unfortunately, not only chatbots but also other machine learning-based systems do not work 100% correctly and are at risk of error. The more systems connected, the greater the risk. Among others, therefore, neural dialog systems sometimes generate short and nonsensical responses and sometimes loop or fail to interpret queries correctly. Part of our project is a real-time speech analysis module, which its task is to assess the user's mood on a current basis through sentiment analysis. It analyses the flow of the entire conversation and augments the results each time new users input is provided. In this respect, we implement a dialog system, machine translator, ASR module and the results of these modules are subjected to sentiment analysis. We perform sentiment analysis using various methods for Polish and English and we also analyse the impact of machine translation on the quality of sentiment analysis. We perform analyses and evaluations based on human evaluation of the operation of our methods.
Detecting a drop in satisfaction in communication with a dialog agent is a very important problem for improving customer satisfaction. A quick response will, on the one hand, redirect the interlocutor to a human agent, on the other hand, it will allow us to collect data on in which cases the agent fails and remove these problems.
The article is structured as follows. Section 2 discusses the current state of the art as far as sentiment analysis is concerned. It also describes the experimental environment focusing machine translation, ASR, TTS and dialog system and their connection in a pipeline of tools. In Section 3 experiments are conducted and divided in subsections by the language of the trained models. Section 4 provides manual confirmation of the results obtained in Section 3. Finally, in Section 5 we draw and discuss conclusions.

Experimental Environment and the Current State of Knowledge
The aim of the study was to implement an analytical module to, among other reasons, analyse the sentiment of the text coming from dialogues between man and machine. This module was designed to analyse dialogue in real time and react to increased user dissatisfaction by redirecting it to a human agent. For this purpose, a test environment consisting of dialog system modules was prepared and the analysis of sentiment was carried out based on their results.

Testing Environment
Our dialog system consisted of a dialog agent implemented in English using Deep-Pavlov [9] and a model trained in the BERT architecture [10]. For training the model we used transcriptions of real life problems and transfer learning of the English model within the DeepPavlov toolkit. The Polish language was handled by a machine translator based on convolutional neural networks implemented as part of the ModernMT tool [11], where, apart from our own small body, we used a corpus based on subtitles [12]. For the purposes of the ASR module, the KALDI tool [13] was used and all audio recordings and their transcriptions produced under the Clarin project [14]. The scheme of the system operation is presented in Figure 1.

Testing Environment
Our dialog system consisted of a dialog agent implemented in English using Deep-Pavlov [9] and a model trained in the BERT architecture [10]. For training the model we used transcriptions of real life problems and transfer learning of the English model within the DeepPavlov toolkit. The Polish language was handled by a machine translator based on convolutional neural networks implemented as part of the ModernMT tool [11], where, apart from our own small body, we used a corpus based on subtitles [12]. For the purposes of the ASR module, the KALDI tool [13] was used and all audio recordings and their transcriptions produced under the Clarin project [14]. The scheme of the system operation is presented in Figure 1. Each of the modules was evaluated with a separate metric. The dialog system was verified with the SQuAD metric [15], reaching 81.5% and 93.4% on our own evaluation corpus (containing 50 questions and answers consisting of 1672 sentences). The BLEU metric [16] was used to assess machine translation (MT) and for the PL to EN translation, 64.21 points were achieved and 43.23 points in the reverse direction. On the other hand, the ASR system for PL was assessed using the WER metric [17], obtaining 18.21 and for EN 12.37. The MT and ASR modules were assessed on the aforementioned 1672 sentences prepared and translated by humans as part of our work. These results prove their high quality, consistent with the current state of knowledge in these fields, which should not disturb the reliability of the results of the sentiment analysis based on their results.

Sentiment Analysis Techniques Used
As part of the sentiment analysis itself, we made an in-depth analysis of the current state of knowledge and selected the most popular and most promising approaches. We started the work with the initial verification of the popular Vader tool [18], the results of which we treated as a reference point for further experiments.
The Vader tool uses a rule-based model using an English-language lexicon prepared by the authors [19]. It detects denials, words that enhance the overall tone and even pays attention to the case of letters or the number of exclamation points. The authors of the tool have prepared a lexicon with particular emphasis on the social media vocabulary, including emoticons and slang. The tool does not follow a machine learning approach, has consistent sentence grades and cannot be trained. Only English is supported. In order to analyse the text in Polish, each query had to be machine translated. A poorly translated text could have resulted in an inaccurate analysis. Since the BLEU score of our MT system was over 60 points, in our opinion translation quality was satisfactory for the experiment. According to [20] the impact of MT quality should be marginal. Additionally, not all rules contained in the tool translate to all languages. Therefore, there are adaptations of the tool, e.g., the German-language GerVADER [21]. The Vader tool was created in 2014 and at that time it was better than other methods [22].
After experiments with the VADER tool, sentiment analyses based on supervised learning methods were added [23]. More specifically, the Linear SCV [24] and naive Bayes Each of the modules was evaluated with a separate metric. The dialog system was verified with the SQuAD metric [15], reaching 81.5% and 93.4% on our own evaluation corpus (containing 50 questions and answers consisting of 1672 sentences). The BLEU metric [16] was used to assess machine translation (MT) and for the PL to EN translation, 64.21 points were achieved and 43.23 points in the reverse direction. On the other hand, the ASR system for PL was assessed using the WER metric [17], obtaining 18.21 and for EN 12.37. The MT and ASR modules were assessed on the aforementioned 1672 sentences prepared and translated by humans as part of our work. These results prove their high quality, consistent with the current state of knowledge in these fields, which should not disturb the reliability of the results of the sentiment analysis based on their results.

Sentiment Analysis Techniques Used
As part of the sentiment analysis itself, we made an in-depth analysis of the current state of knowledge and selected the most popular and most promising approaches. We started the work with the initial verification of the popular Vader tool [18], the results of which we treated as a reference point for further experiments.
The Vader tool uses a rule-based model using an English-language lexicon prepared by the authors [19]. It detects denials, words that enhance the overall tone and even pays attention to the case of letters or the number of exclamation points. The authors of the tool have prepared a lexicon with particular emphasis on the social media vocabulary, including emoticons and slang. The tool does not follow a machine learning approach, has consistent sentence grades and cannot be trained. Only English is supported. In order to analyse the text in Polish, each query had to be machine translated. A poorly translated text could have resulted in an inaccurate analysis. Since the BLEU score of our MT system was over 60 points, in our opinion translation quality was satisfactory for the experiment. According to [20] the impact of MT quality should be marginal. Additionally, not all rules contained in the tool translate to all languages. Therefore, there are adaptations of the tool, e.g., the German-language GerVADER [21]. The Vader tool was created in 2014 and at that time it was better than other methods [22].
After experiments with the VADER tool, sentiment analyses based on supervised learning methods were added [23]. More specifically, the Linear SCV [24] and naive Bayes Bernoulli [25] methods available in the sci-kit learn library [26] were tested. For their training, the English-language data set Sentiment140 [27] was selected, containing entries from the social network Twitter along with sentiment markers. Models were trained on the entire dataset using TF-IDF text vectorization. Additionally, for all methods, the returned results were normalized so that the values for the positive, neutral and negative sentiment represent the binary marking-0 for the negative sentiment and 1 for the positive sentiment. Our implementation was adjusted to show both the results of 2-class (positive and negative sentiment), 3-class (positive, neutral and negative sentiment) and percentages.
The Polish training set was created from approximately 43,000 tagged entries, also from Twitter [28]. Entries were downloaded through the API, using a simple script using the tweepy library [29]. On the basis of the prepared Polish data set, 2-class LinearSVC and naive Bayes Bernoulli models were trained. The Polish dataset was also used to train the naive Bayes Bernoulli 3-class model.
After these basic models were prepared, the state-of-the-art models that performed best were analysed. The world's best models with an efficiency of over 95% were created on the basis of transfer learning techniques. Due to fine tuning, the model can be adapted to up to 20 different activities, such as the sentiment analysis, answers to open-ended questions, text classification, translation, etc. To achieve such high efficiency, data sets with sizes over 20 TB and GPU/TPU units for training and fine tuning are needed. Sets and pretrained models are available for download and the algorithms can be adapted to the equipment-here are ways to reduce models, e.g., by 60% with a slight loss of quality (by 2%)- [30] which we did. Model rankings were created due to the machine learning community [31].
On the basis of these analyses, pretrained: DistilBERT [32], T5 (text-to-text transfertransformer) [33] and XLNET [34] were added. The transformers library by Hugging-Face [35] was used for this. It supports the formats of popular libraries PyTorch [36] and TensorFlow [37]. The spaCy wrapper [38] was also created for it, which simplifies the process of model training. The transformers library has the ability to export models to the onnx format, which in turn will allow the model to be optimized for the production environment [39].
Similar methods were already applied to other Slavic languages, but to other topic domains. Authors of [40] apply the sentiment analysis to the financial context news in the Lithuanian language. In [41] authors apply a modified RoBERTa model for sentiment analysis in Czech, which they find to be most successful.
The ABSA analysis tool (aspect target sentiment analysis) was also used [42]. It works by examining the sentiment of the selected subject (aspect) in the text. Due to this approach, from the user's opinion, one can obtain information about what exactly is considered good in the product and what is bad, e.g., from the opinion "I like my phone, the camera works great, but the battery leaves a lot to be desired" we get the analysis result: aspect: camera-sentiment: positive and aspect: battery-sentiment: negative. A proof-of concept was prepared, which uses the DistilBERT model adapted to such an analysis.

Experiments
One of the first tasks was to compare two text vectorization methods (Table 1), for two different models-naive Bayes Bernoulli and LinearSVC on the Amazon Video Games set [43]. The TF-IDF method [44] has been compiled with an implementation called LabelEncoder [45], which marks each word with a number.
Metrics that were included are: precision, recall and F1 score [46]. TF-IDF in the case of LinearSVC gave slightly better results. In the case of the naive Bayes model, it can be seen that word embedding gave 6% more precision. Nevertheless, the ratio of correct observations to the entire recall was twice lower for this model, which can also be seen from the F1 result, whose recall is a component.

Pretrained English Models
Models that arose after significant development in the field of transfer learning, i.e., transformer models, are pretrained, which also means that they already have mechanisms for converting text into vectors in hidden layers integrated with the model. Examples of such models are, e.g., BERT or GPT [47].
In this respect, the DistilBERT model was trained on the SST-2 benchmark (Stanford Sentiment Treebank v2) [48]. The transformers library was used to train the DistilBERT model and the SST-2 task came from the GLUE Benchmark [49] set, the result was similar to that recorded in the model ranking, i.e., accuracy (eval_acc) around 0.92: eval_loss = 0.3662; eval_acc = 0.9013; epoch = 3.0. The effectiveness of pretrained English-language models was also compared, as presented in Table 2. In preparation for measuring and comparing the model results, a simple script was prepared. The Amazon-"Video Games" in English [50] was selected to evaluate the effectiveness. The dataset is in the form of opinions rated 1-5. Ratings 1-2 were negative, 3 neutral and 4-5 positive. The naive Bayes and LinearSVC models were trained on the entire Sentiment140 dataset using TF-IDF vectorization. DistilBERT was trained on Wikipedia + BookCorpus corpora. The T5 model was pretrained on the C4 corpus [51] and the XLNET model on BookCorpus, English Wikipedia, Giga5, ClueWeb and CommonCrawl [52]. The Vader model was also used for the compilation, based on a set of rules and a specially prepared corpus. The evaluation was based on 20,000 records.
It has been noticed that the T5 model sporadically generates an unexpected sequence (e.g., "Sst" instead of prediction), which, despite a high score, indicates incorrect pretraining of this network. The problem was not investigated further as the Polish-language models were the priority in the study.

Comparative Experiments for Polish-Language Models
In terms of the Polish language, the Polish RoBERTa [53] and PolBERT [54] models were trained on various corpora with binary sentiment markings and the metrics of effectiveness were recorded. The simpletransformers library was used for this purpose [55]. The results of the experiments are presented in Table 3.
The data sets used were: -Clarin-Polish entries on the social platform Tweeter [56]; -PolEmo 2.0-Multidomain product review [57]; -AllegroReviews-Multidomain product reviews [58].  Models obtained the best results (0.88 and 0.89) after training on the PolEmo 2.0 corpus. It is worth paying attention to training RoBERTa on Allegro Reviews and then on PolEmo 2.0-the accuracy obtained was 0.58. Therefore, the domain of the corpora is important-PolEmo mainly consists of opinions about places and Allegro Reviews about products. If the subject matter overlaps, and this is partly the case with Clarin and PolEmo, the result should be better, due to the uniform context of the statements made, among other reasons.
The MCC metric was also included [59]. A result close to 1 represents a perfect prediction, 0 is no better than a random prediction and −1 represents a complete mismatch. In two cases, the MCC metric was equal to 0-this is most likely an error on the side of the used library.

Possible Optimization of Models
Optimizations play a key role as they can accelerate the model by up to 30%, thus reducing operating costs. Model inference, or otherwise obtaining the prediction result, can be optimized using tools such as the OpenVINO [60] framework, ONNX of the Microsoft company [61] or TensorRT [62]. KITO is also available [63], which is mainly used for image processing models. An extensive article by Intel on system, application and model optimization explains the importance of optimizations at the level of the entire infrastructure [64]. Finally, it was necessary to check the tools and additionally apply model pruning [65].

Results of Polish Models
During the experiments based on Polish models, the sentiment for 10,000 sentences from the Clarin set was analysed. The results are shown in Table 4. The results of PolBert and PolishRoberta are similar to those of the Polish ranking of models and presented in the Table 5 [49]:

Manual Evaluation
Due to the fact that the results presented in Section 3 prove that not only the general metric is very important for the quality of the evaluation, but also the area in which the products under analysis move, it was also decided that we would conduct a manual analysis. Although our models achieved some metric of popular benchmarks at the world level, due to the fact that some of them used the transfer learning of models belonging to other domains, they could potentially not be such good predictors in our field.
Therefore, 100 real opinions in Polish and 100 in English from our clients were manually prepared. They were subjected to manual human evaluation and then compared using the methods described. Manual evaluation of the sentiment of each of the comments was made using the following scale: "-"-negative sentiment of the comment; "0"-neutral sentiment of the comment; "+"-positive sentiment of the comment. During the test, points were awarded for compliance with the subjective assessment of the subject. The method received 3 points for perfect compliance and for partial compliance (e.g., the method considers the comment as neutral, the user as positive), 1 point. No compliance resulted in 0 points. The analysis was performed separately for 2-class models and separately for 3-class models. Finally, for each method, the percentage of compliance with the human method was determined.
The results of the human evaluation are presented in the form of graphs. Figure 2 shows the test result of the sentiment testing methods for Polish comments. Figure 3 shows the result of the test of sentiment research methods for Polish comments that are not considered neutral. Figure 4 shows the test results of the sentiment test method for English comments and Figure 5 shows the test results of sentiment test methods for English comments without those considered neutral.
other domains, they could potentially not be such good predictors in our field. Therefore, 100 real opinions in Polish and 100 in English from our clients were manually prepared. They were subjected to manual human evaluation and then compared using the methods described. Manual evaluation of the sentiment of each of the comments was made using the following scale: "-"-negative sentiment of the comment; "0"-neutral sentiment of the comment; "+"-positive sentiment of the comment.
During the test, points were awarded for compliance with the subjective assessment of the subject. The method received 3 points for perfect compliance and for partial compliance (e.g., the method considers the comment as neutral, the user as positive), 1 point. No compliance resulted in 0 points. The analysis was performed separately for 2-class models and separately for 3-class models. Finally, for each method, the percentage of compliance with the human method was determined.
The results of the human evaluation are presented in the form of graphs. Figure 2 shows the test result of the sentiment testing methods for Polish comments. Figure 3 shows the result of the test of sentiment research methods for Polish comments that are not considered neutral. Figure 4 shows the test results of the sentiment test method for English comments and Figure 5 shows the test results of sentiment test methods for English comments without those considered neutral.          Test of sentiment test methods for English comments without those considered neutral. Figure 5. Test of sentiment test methods for English comments without those considered neutral.

Conclusions and Discussion
Seemingly, due to the fact that the predictions are to be made in real time, the main determinant of accuracy will be the inference time. This will allow for cost optimization on the part of the enterprise.
BERT-based models are pretrained on large datasets, so their domain adaptability should be high. The problem of scaling, however, is the inference time and the use of resources by such a model. The inference time can be reduced by optimization methods, e.g., by exporting models to the ONNX format, using distillation or pruning.
The seemingly proposed optimal solution based on generic benchmarks could be the implementation of a queuing system for prediction, based on the BERT-Polbert and DistilBert models. If the queue was full, the models would be supported by the less demanding LinearSVC, Vader and naive Bayes. With large discrepancy in inference time between models and with a large number of queries, the supporting models will provide more results.
However, manual analysis revealed that for the Polish language, the PolishRoberta method turned out to be the most consistent. For the English language, the T5 method turned out to be the most compatible. However, the Vader method was well below expectations. It was with this method in mind that we tried to machine translate queries to avoid the need to create our own rules adapted to the language. It turns out, however, that the method not only fares poorly, but machine translations, and in particular ASR, significantly worsen its results. This is most likely because we lose a lot of information due to the normalization of the text on which the rules used in it are based on.
In conclusion, we reviewed the sentiment analysis techniques that achieve the highest results on generic benchmarks and we checked which of them in real business use work best in terms of quality and how they scale in performance. This is valuable knowledge from the business and implementation point of view. There is no doubt, however, that it is possible to further develop the research towards the analysis of model domain adaptation techniques and optimization of their performance. In a company, even a few percent of yields in these areas on a macroscale translate into real money.