Sentiment Analysis of Lithuanian Texts Using Traditional and Deep Learning Approaches

: We describe the sentiment analysis experiments that were performed on the Lithuanian Internet comment dataset using traditional machine learning (Naïve Bayes Multinomial—NBM and Support Vector Machine—SVM) and deep learning (Long Short-Term Memory—LSTM and Convolutional Neural Network—CNN) approaches. The traditional machine learning techniques were used with the features based on the lexical, morphological, and character information. The deep learning approaches were applied on the top of two types of word embeddings ( Vord2Vec continuous bag-of-words with negative sampling and FastText ). Both traditional and deep learning approaches had to solve the positive/negative/neutral sentiment classiﬁcation task on the balanced and full dataset versions. The best deep learning results (reaching 0.706 of accuracy) were achieved on the full dataset with CNN applied on top of the FastText embeddings, replaced emoticons, and eliminated diacritics. The traditional machine learning approaches demonstrated the best performance (0.735 of accuracy) on the full dataset with the NBM method, replaced emoticons, restored diacritics, and lemma unigrams as features. Although traditional machine learning approaches were superior when compared to the deep learning methods; deep learning demonstrated good results when applied on the small datasets.


Introduction
Internet has changed the ways how people express their beliefs and sentiments about products or services, events, topics, interactions, etc.It is mainly done via social networks, review websites, web forums, blogs, and Internet comments.These texts are sentiment rich, and are therefore beneficial for companies or individuals willing to improve their product marketing strategies and respond accordingly.Methods of automatic sentiment analysis are commonly chosen for this purpose.Sentiment analysis has been used for analyzing sentiments of tweets on health topics [1], evaluate teachers ability and predict student performance [2], identify contextual polarity of drugs, movie, and restaurant reviews [3][4][5], examine public opinion on products and services and societal issues [6], classify e-learners and their topics of interest in their social networks interactions [7], perform opinion mining in tweets [8,9], analyze sentiment orientation of microblogs [10], and forecast stock prices from sentiments of news signals in the financial markets [11].
Sentiment analysis (or classification) methods can be grouped into two main categories: dictionary-based and machine learning.The dictionary-based methods (such as [12][13][14][15]) typically rely on the external lexical resources of sentiment words and syntactic-semantic knowledge.Unfortunately, but such resources (as, e.g., SentiWordNet (SentiWordNet is a WordNet with the positive, negative, and neutral values assigned to each synset (more about SentiWordNet is in https://sentiwordnet.isti.cnr.it/))) are not available for the lower resourced languages.
Recently, the deep learning methods (which also belong to the group of machine learning) have gained interest in many text classification tasks, including sentiment analysis.As claimed in [18], a Deep Neural Network (DNN) architecture that jointly uses character-, word-, and sentence-level representations achieves state-of-the-art performance for binary (positive and negative) classification on the Stanford Sentiment Treebank of movie reviews [19] and the Stanford Twitter sentiment corpus [20].Katoh and Ninomiya in [21] solve the binary sentiment classification task on the English and Japanese datasets.The authors prove that multi-layer Neural Networks (NNs) work well only on large-scale datasets, but for smaller datasets NNs without hidden layers are more suitable.The Convolutional Neural Network (CNN) has been applied on the binary-class Twitter dataset used as the benchmark in the SemEval-2015 competition [22].The experiments that are described in [23] are performed on the English binary review data obtained from Amazon.Authors used the CNN method with Google News word2vec embeddings and demonstrated superiority over the baseline methods.A hybrid approach applied on the Hindi language and the dataset of four classes (positive, negative, neutral, and conflict) is based on the CNN architecture augmented with a set of optimized features selected through the multi-objective optimization framework.Subsequently, the augmented optimization vector is used for training the SVM [24].Yousif et al. [25] used a hybrid neural model, which combines CNN and Recurrent NN (RNN) to capture local n-gram features and long-term dependencies of the text for classification of sentiments and purposes of scientific citations.Yun et al. [26] combined with the vector space and CNN methods.Firstly, they select the words that are based on the spatial distribution of features in text information and map these words into abstract vectors based on the dictionary.Afterwards, CNN is used to extract features of abstract vectors for sentiment classification.
Stojanovski et al. in [27] used CNN and RNN to obtain more diverse representations on a top of GloVe (Global Vectors) word embeddings.For texts of the variable length, CNN and RNN allow to extract fixed length representation vectors and to use their concatenation.Their result applied on the binary or five classes (positive, negative, neutral, very positive, very negative) benchmark Twitter datasets, was ranked the second best in the SemEval-2016 task [27].Lu et al. [28] compared RNN and CNN trained on neural embeddings with SVM trained on unigram, bigram, and unigram + bigram features.They results showed that deep learning methods underperformed in most of the SVM models on English and Japanese sentiment datasets with either two or three classes.Cortis et al. [29] summarized sentiment analysis works that were presented by 32 participants in the SemEval-2017 Task 5 on the Financial Microblogs and News data.The traditional machine learning based technique (rather than deep learning) was ranked as the first: it employed linguistic, sentiment lexicon, domain-specific features, and Google word embeddings to construct models of ensemble regression algorithms [30].The second best technique was based on deep learning, i.e., an ensemble-based model that combined CNN with Long Short-Term Memory (LSTM) and word embeddings [31].The competition of in the SemEval-2017 Task 4 that is described in [32] attracted 48 teams.Sentiment analysis was done on the Twitter data of two or five classes.The best-ranked teams used deep learning: e.g., BB_twtr employed an ensemble of LSTM and CNN methods with multiple convolution operations [33]; and, DataStories applied LSTM network with attention mechanism [34].
Although the results of sentiment analysis are mixed (i.e., there is no consensus which method is the best), deep learning has become the dominant paradigm recently.Due to this reason in this research, we also investigate an impact of the deep learning approaches for the sentiment analysis task.
Here, we deal with the dataset of Lithuanian Internet comments containing three classes (positive, negative, and neutral) by testing two deep learning approaches (LSTM and CNN with two types of neural word embeddings).However, the deep learning approaches are effective only on the large training datasets and when applied on the top of the comprehensive list of word embeddings (trained from huge text corpora), therefore they may not be effective on the Lithuanian datasets for our solving task.Due to this reason for the comparison purposes, we test two traditional machine learning approaches (SVM and NBM with five types of feature representations) and evaluate an impact of pre-processing techniques (replacement of emoticons with sentiment words, removal of stop words, elimination/restoration of diacritics) and dataset sizes (full and balanced).Unfortunately, we cannot compare machine learning approaches with the dictionary-based because important lexical resources (e.g., SentiWordNet) either do not exist for the Lithuanian language or are too small to achieve reasonable results (some previous attempts to tackle the sentiment analysis task for the Lithuanian language based on the sentiment lexicon were not very successful [35]).Thus, the main purpose of this research is to test various machine learning methods (their parameters, pre-processing techniques), to compare the results, and to offer the best solution for the sentiment classification of non-normative Lithuanian texts.

Formal Definition of the Task
The sentiment classification task in our research can be interpreted as the supervised text classification task.
Let d 1 , d 2 , . . ., d n be the texts (Internet comments) attached to only one of c 1 , c 2 , . . ., c m class (where m = 3, because we have positive, negative, and neutral sentiments).Hence, our solving task is a single-label but multi-class (m > 2) classification problem.
Let function γ denote a mapping of texts to their sentiments (i.e., classes): γ : D → C .A goal is to choose a method that could find the best approximation of γ.For this reason, we have compared two groups of methods: i.e., traditional machine learning techniques with discrete representations (described in Section 2.3) and deep learning methods with distributional representations (described in Section 2.4).

The Dataset
We used the dataset of the Lithuanian Internet comments (see Table 1).Texts in this dataset were labeled with positive, negative, and neutral polarities by two human-experts based on their mutual agreement (i.e., experts assessed comments independently and only the texts that obtained the same polarity values were included into the dataset).The dataset is of the general domain: it contains opinions about articles on the various topics (politics, sport, health, economics, etc.) presented in the Lietuvos Rytas news portal https://www.lrytas.lt/and crawled in March, 2013.The Internet comments represent the non-normative Lithuanian language.
The Lithuanian language is vocabulary rich (the Dictionary of Lithuanian [36] contains more than 0.5 million headwords, whereas the Oxford English Dictionary [37] contains ~0.35 million); fusional and morphologically complex; derivationally rich (e.g., verbs can be made from the onomatopoeias, it also has ~60 prefixes and ~600 suffixes [38], some of which are used to derivate diminutives and hypocoristic words).The diminutives and hypocoristic words-which are very common in the spoken language (e.g., the word dukra (daughter) can be found as dukrytė, dukružytė, etc.)-can have more than one suffix, and in some rare cases their number increase up to six.Moreover, the non-normative Lithuanian language has one more problem in that English does not confront, i.e., word diacritics.In the non-normative Lithuanian texts word diacritics are used or omitted based on the author's choice.Words with omitted diacritics either become out-of-vocabulary words or (in most of the cases) obtain absolutely different meaning, which is ambiguous (e.g., karštas (hot) and karstas (coffin)), and even hold the opposite sentiment polarity.Besides undiacritized words, the non-normative Lithuanian texts contain other out-of-vocabulary words (including foreign language inserts, words with the spelling errors, jargon, abbreviations, etc.), emoticons, etc.The second version of the dataset (presented in Table 2) in each category contains an equal number of texts, which were randomly selected from the "main pool" (i.e., from the dataset presented in Table 1).The detailed description about this version of the dataset can be found in [35].
Despite the non-normative texts being rather noisy (e.g., have missing diacritics), they may contain some valuable information (e.g., emoticons), which, if taken into account can facilitate the solving task.Due to this and previously mentioned reasons about the non-normative Lithuanian language specifics, we have explored the impact of the following pre-processing techniques:

•
No pre-processing.The texts were lowercased, numbers and punctuation were eliminated.

•
With emoticons.As proved in [39], the emoticon replacement assures the higher classification accuracy.In our experiments we have used 32 groups of emoticons, where each group was mapped into the appropriate sentiment word (presented in its main vocabulary form).
For instance, all the emoticons of this group ":(", ":-(", ":-c", ":-[", etc. have the same meaning li ūdnas (sad) and therefore can be replaced with li ūdnas in the text.When using this pre-processing technique, all of the words were lowercased; the detected emoticons were replaced with the appropriate sentiment words; numbers and punctuation were eliminated.

•
No stop words.Intuitively, such words as acronyms, conjunctions or pronouns appear too often in all types of texts and are too short to carry important sentiment information.However, Spences and Uchyigit in [40] claim that interjections are strong indicators of subjectivity.Due to it, we have used a list of 555 stop words that does not contain interjections.During this text pre-processing step all words were lowercased and stop words, numbers, and punctuation were eliminated.

•
Diacritics elimination.The Lithuanian language uses the Latin alphabet supplemented with these diacritics: ą, č, ę, ė, š, u ˛, ū, ž.However, diacritics in the non-normative Lithuanian texts are sometimes omitted, therefore the same word in the non-normative texts can be found in both cases, i.e., written with or without diacritics.It causes ambiguity problems and increases the data sparseness, which, in turn, negatively affects the accuracy.One of the solutions for decreasing the data sparseness is to distort the data by replacing diacritized letters with their American Standard Code for Information Interchange (ASCII ) equivalent symbols.However, such distortion may cause even more ambiguity problems, which in turn may degrade the performance.Hence, the diacritics elimination is a questionable pre-processing technique, where the effectiveness has to be tested experimentally.Using this pre-processing technique, the words were undiacritized, lowercased and numbers and punctuation were eliminated.

•
Diacritics restoration is another direction for decreasing the data sparseness and increasing the text quality.For this purpose, we were using the language modeling method (described in [41]), which was proved to be the best for the Lithuanian diacritization problems.This language modeling method used the bi-gram back-off strategy (having ~58.1 million bigrams and ~2.3 million unigrams) to restore diacritics in our non-normative texts.After diacritization, the words were lowercased and numbers and punctuation were eliminated.

Traditional Machine Learning Techniques
We have used two traditional machine learning approaches: • Support Vector Machine (SVM) [42] is a discriminative instance-based method, which for the very long time has been the most popular text classification technique.SVM can efficiently cope with the high dimensional feature spaces (e.g., without the feature selection, SVM has to handle ~26 K different features (see a number of distinct tokens in Table 1)); sparseness of the feature vectors (only ~9 words per text); and, it does not perform aggressive feature selection (i.e., does not lose potentially relevant information, which is important not to degrade the accuracy [43]).

•
Naïve Bayes Multinomial (NBM) [44] is a generative profile-based approach, which can also outperform popular SVM in the sentiment analysis tasks [16].It is a simple, but rather fast technique, which performs especially well on a large number of features with equal significance.Besides, this method does not require huge resources for the data storage and it is often selected as a baseline approach.
We have used SMO (for the SVM) with the polynomial kernel and NBM implementations in the Weka (Hall et al., 2009) machine learning toolkit, version 3.6 (Weka machine learning toolkit can be found at http://www.cs.waikato.ac.nz/ml/weka/).All other parameters were set to their default values.Both the SVM and NBM methods were applied on a set of features generalizing different levels of abstraction:

•
Lexical.The token unigrams (a common bag-of-words approach) and both token bigrams + unigrams were extracted from the texts.

•
Morphological.These feature types cover unigrams of lemmas and bigrams + unigrams of lemmas.Before feature extraction, the texts had to be lemmatized.For the lemmatization we have used the Lithuanian morphological analyzer-lemmatizer Lemuoklis [45].Lemuoklis can solve ambiguity problems and transform recognized words into their dictionary form.However, it is only effective on the normative texts, which basically means that all undiacritized, abbreviated, or slang words remain untouched.

•
Character.This feature type represents document-level character tetra-grams.Such n = 4 was selected because it demonstrated the best performance over the different n values on the Lithuanian language in the topic classification task [46].

Deep Learning Methods
We have used two deep learning approaches: • Long Short-Term Memory (LSTM) method [47] is a special type of the artificial neural network.This method analyzes the text word-by-word and stores the captured semantics in the hidden layers.Besides, its main advantage from the recurrent neural networks is that LSTM does not suffer from the vanishing gradient problem when learning long-term dependencies.

•
Convolutional Neural Network (CNN) [48] (presented in detail for the text classification in [49]) is a feed-forward artificial neural network (ANN) that contains one or more convolutional layers and the max-pooling.Convolutional layers help the method to go deeper by decreasing the number of parameters and the max-pooling layer helps to effectively determine discriminative phrases in the text.However, the size of the filters in each convolutional layer still remains an issue: too small windows may cause a loss of important information; too large windows may produce an enormous number of parameters.
For testing LSTM [50] and CNN [49], we have used their implementations in deeplearning4j [51]-the open-source distributed deep learning library for the Java Virtual Machine.However, the existing implementations could solve only binary classification tasks, therefore necessary adjustments to multi-class classification were performed by the authors of this research.Both methods truncate the texts exceeding 256 words (it is more than enough, because an average text length does not exceed 10 tokens), all other parameters were set to their default values.
Both deep learning methods were applied on distributional feature vectors presented as the neural word embeddings.In our experiments, we have used two types of embeddings:
These embeddings were generated with the same deeplearning4j software, the continuous bag-of-words (CBOW) with the negative sampling and 300 dimensions.Under this architecture, a neural network had to predict a focus word by feeding its context words from the surrounding window (the detailed explanation of the method is in [52,53]).Using the corpus of ~234 million running words, it resulted in 687,947 neural word embeddings for the Lithuanian language (more about it can be found in [54]).

•
FastText [55].It is an extension of Word2Vec that is offered by Facebook AI Research Lab.Instead of inputting the whole words into the network (as in Word2Vec case), each word (e.g., network) is split into several word-level character n-grams (netw, etwo, twor, work); the separate vectors are trained for all of these n-grams; and, are then summed to get a single vector for the whole word (network).In our experiments we have used two-million available Lithuanian FastText word embeddings.(The FastText Lithuanian word embeddings were downloaded from https: //fasttext.cc/docs/en/crawl-vectors.html).
FastText embeddings were trained on various types of texts, because they contain both diacritized (e.g., gražus (beautiful), šaunus (cool), etc.) and undiacritized (grazus, saunus, etc.) versions of the same word.On the contrary, Word2Vec embeddings were trained on the normative Lithuanian texts only; therefore, intuitively they cannot be effective on the undiacritized texts (especially after the applied diacritics elimination pre-processing technique, as described in Section 2.2).Undiacritized non-normative texts are already distorted texts, therefore one of the solutions to find more matches among the word embeddings is to distort them the same way by removing diacritics.Such removal may cause disambiguation problems (because undiacritized version of some word can also exist in the normative Lithuanian language, but have the different form, meaning, or even different sentimental polarity), e.g., mama (mother in nominative) and mam ą (mother in accusative), karstas (coffin), and karštas (hot).The list of embeddings cannot contain the word duplicates: if diacritics elimination caused some ambiguity, all embeddings of less frequent words were discarded from the list (e.g., karštas is more often than karstas, therefore karstas is eliminated from the list of word embeddings, and karštas replaced with undiacritized karstas).As the result, the total number of words embeddings dropped from 687,947 to 632,435 undiacritized word embeddings.Despite that such distortion may cause the accuracy degrade problems, we are still curious to see the impact on the results.

Experimental Set-Up and Results
The experiments were carried out on two datasets and different pre-processing techniques (described in Section 2.2) using either the traditional machine learning (presented in Section 2.3) or deep learning (described in Section 2.4) approaches.All of the experiments were performed using stratified 10-fold cross validation and were evaluated using macro-accuracy and f-score (averaged over classes and folds) performance measures [56].
We have also evaluated random (see Equation ( 1)) and majority (Equation ( 2)) baselines, because the results are considered as reasonable, if the accuracy values exceed random and majority baselines.
baseline random = ∑ P(c i ) 2 , where P(c i ) is the probability of a class c i (1) baseline majority = max(P(c i )) (2) Thus, on the full dataset (in Table 1) baseline random = 0.454 and baseline majority = 0.617, and on the balanced dataset (in Table 2) baseline random = baseline majority = 0.333.
To determine whether the differences between the results are statistically significant, we have performed the McNemar's test [57] with the significance level equal to 95%.It means that the calculated p-value must be below 0.05, so that the differences would be considered as statistically significant.
Figures 1-3 summarizes the results obtained (1) on the full dataset (presented in Table 1); (2) on the full dataset with eliminated diacritics; and, (3) on the full dataset with restored diacritics, respectively.Figures 4-6 present the same pre-processing techniques as in Figures 1-3, but applied on the balanced dataset (see Table 2).In Figures 1 and 4, we investigate an accuracy (y axis) of different methods and pre-processing techniques (x axis) on the full and balanced datasets, respectively.It was important to evaluate how accurate methods are on the texts where diacritics are not specifically processed.In Figures 2 and 5, we investigate an impact of diacritics removal: diacritics removal decreases a number of distinct words in the texts, which might positively affect the results but increase the ambiguity (both in the texts and word embeddings), which might have a negative impact on the accuracy.In Figures 3 and 6 we investigate the impact of diacritics restoration.The diacritics restoration should decrease a number of distinct words and reduce the ambiguity; however, here we use the tool that is not refined, therefore some unrecognized words still remain untouched (and undiacritized) in the text.
In the figures for SVM and NBM, we present the highest obtained accuracy values among all feature types: token unigrams (lex1), token unigrams + bigrams (lex2), lemma unigrams (lem1), lemma unigrams + bigrams (lem2), and character tetra-grams (chr4).For LSTM and CNN, we present the highest accuracy values among these types of embeddings: continuous bag-of-words of 300 dimensions with the negative sampling (Word2Vec) and the FastText.Besides, on the texts with the eliminated diacritics, we test the third type, i.e., distorted Word2Vec (marked as Word2Vec_d) (explained in Section 2.4).The best determined feature and embedding types are presented in Tables 3 and 4 on the full and balanced datasets, respectively.Some of the SVM and NBM values presented in Figures 4 and 5 were obtained during previous research (presented in [35]).Previously obtained values, together with the new values (obtained with bigrams + unigrams of lemmas during this research), were used for comparison purposes: the best determined feature types and their values (of the same method) are presented in Figures 4 and 5.Only two of 24 values for LSTM and CNN that were presented in Figures 4-6 were reused from the previous research (described in [58]).Both old and new values were used for the comparison purposes to choose the best ones.However, the FastText embeddings (used in this research) in most of the cases outperformed Word2Vec (used in the previous research).All other presented values were obtained only during this research.

Discussion
Some useful resources as, e.g., SentiWordNet, are not available for the Lithuanian language and it prevents us from solving the sentiment analysis task with effective dictionary-based methods.Besides, lexical resources can usually be used with the normative texts not containing foreign language insertions, jargon, abbreviations, or words with missing diacritics.Not having effective solutions to restore diacritics in the Lithuanian texts (grazus → gražus (beautiful)), these lexical recourses cannot be useful.Moreover, even if diacritics could be restored correctly, the currently available Lithuanian morphological tools are not able to recognize non-normative words (e.g., fain ą (Anglicism of word fine in accusative)) and transform them into lemma (fain ą → fainas).Since the dictionary-based methods cannot be effectively applied on the Lithuanian language, machine learning approaches seem as the only possible solution.Of course, due to same previously mentioned reasons about the Lithuanian language specifics (in Section 2.2), we cannot expect the accuracy for the Lithuanian language above 80% (as it is reported in most sentiment analysis research works for the English language).For instance, the sentiment classification on the morphologically complex Croatian language with SVM method reaches ~57% of the f-score on the general topic and ~72% on the specific domain [59].The authors in [60] solved the sentiment analysis task for Czech on the Facebook posts with Maximum Entropy and SVM and the best achieved f-score is 69%.The deep learning approaches are still not very popular for the morphologically complex languages in the sentiment analysis task.However, the authors in [61] successfully applied the neural network classifier on the Russian language and achieved better results (i.e., 72%) over Logistic Regression, SVM, and Gradient Boosting.
The accuracy strongly depends on the choice of the method.Traditional supervised machine learning approaches seem to be the best solution when training datasets are rather small and more sophisticated feature types (based not only on the lexical, but on the character information) can be tested.Even knowing that the deep learning might not be the best solution for our solving task (because large training datasets and the comprehensive list of word embeddings are not available), we still have to find out how far we may go in the current situation (with the available resources) to know how much efforts on the improvement do we need in the future.
Our results that are represented in Figures 1-6 exceed random and majority baselines, except for the LSTM method on the full dataset.It is surprising, because on the balanced dataset LSTM not only exceeds the baseline, but it is superior to CNN.However, LSTM by its nature is well adjusted to deal with the long sequence inputs, which is not that important in our case.We do not need the method to learn what goes in the sequence, we need the method to learn how to classify the text as the whole based on the discriminative n-grams (words/short phrases) that may appear anywhere in the text.Since LSTM yielded the worst performance, we do not analyze it further in more detail.
As we can see from Figures 1-5, the accuracy values on the balanced dataset versions are lower as compared to the appropriate full dataset.Despite that the unbalanced datasets may be biased towards the major classes, it is not in our case, because we are dealing with the small data: balancing discards some potentially important information, which is crucial for training the robust model.
Despite that the deep learning techniques underperform traditional machine learning on both datasets a boost from the balanced to full dataset versions is the largest for the CNN method: i.e., 0.116, 0.124, and 0.127 on the original dataset, the dataset with eliminated diacritics, and the dataset with restored diacritics, respectively.For instance, a boost for NBM on the same versions of the dataset is only 0.060, 0.055, and 0.117, respectively; for SVM, it is 0.101, 0.106, and 0.126, respectively.The results confirm the well-known fact [21] that deep learning methods require larger training sets to outperform or to achieve the same accuracy levels as the traditional machine learning approaches.Unfortunately, in our research, even the full dataset is not enough for CNN to outperform SVM or NBM, and these findings coincide with work the presented in [28].However, using larger datasets is not a straightforward task due to the lack of annotated language resources.
The best overall accuracy with NBM, SVM, and CNN is 0.735, 0.724, and 0.706, respectively.The best performance with NBM was achieved with replaced emoticons, restored diacritics, and lemma unigrams as the feature type.SVM demonstrated the highest accuracy with the replaced emoticons, eliminated diacritics, and lemma unigrams.In case of CNN, the best results were achieved with the replaced emoticons, eliminated diacritics, and the TastText word embeddings.The difference between the best achieved accuracies of NBM and SVM is not statistically significant with p = 0.128, but the differences in accuracies between SVM and CNN, or NBM and CNN are significant with p = 0.014, and very significant with p = 0.000.
The differences between many values are statistically insignificant (especially on the dataset of restored diacritics in Figure 6) on the balanced dataset and it stops us from drawing the conclusions.However, the full dataset of differences among the traditional machine learning methods are also insignificant, but are significant in case of CNN (see p-values obtained with the McNemar test on the full dataset in Table 5).Although only marginally, different pre-processing techniques also affect the accuracy.From the results we can make the following statements.Restoration of emoticons (positively affecting the accuracy) really helps to express the sentiments; whereas, stop words seems to carry important information, therefore either the removal of stop words should not be applied as the pre-processing technique or the list of stop words has to be carefully examined.Despite their small impact, both elimination and restoration of diacritics and the pre-processing techniques positively affect the results (e.g., without pre-processing CNN achieves 0.691 of accuracy on the full dataset, 0.702 on the dataset with eliminated diacritics, and 0.703 with restored diacritics).Although, in this example, the restoration of diacritics seems to be the better choice when compared to elimination, it is not always the case: e.g., the results for the SVM without pre-processing are lower on the full dataset with restored diacritics (0.721) than on the full dataset with eliminated diacritics (0.722).Despite that it is hard to select the winner (between diacritics elimination and restoration), the diacritics pre-processing in general should be taken into account, especially when dealing with non-normative Lithuanian texts.The accuracy of diacritics restoration depends on the quality of the diacritics restoration tool as well.The tool is not 100% accurate, and sometimes even small errors can lead to the absolutely different meanings and sentiments.Besides, the elimination of diacritics was also used in the Word2Vec embeddings, unfortunately, not very successfully.This failure was probably caused by the loss of some important information: sometimes even slight differences, e.g., gera (good in feminine, singular, nominative) and ger ą (good in feminine, singular, accusative), may cause the mapping of words to the absolutely different embeddings.We assume it was the main reason why distorted Word2Vec_d embeddings were outperformed with FastText, which often includes both forms (undiacritized and diacritized) of word embeddings.
The FastText embeddings are effective not only on the texts with eliminated diacritics.The FastText embeddings that are applied on the top of the CNN method are the most effective (see Tables 3 and 4) word embedding type in general on both dataset versions with all pre-processing techniques and there are two explanations for that.First, the FastText embeddings map more words (two-million word embeddings) as compared to Word2Vec (containing only ~688 thousand of words).Second, Word2Vec embeddings are trained on the normative texts (that do not contain jargon, borrowings, abbreviations, foreign language insertions that also express sentiments) and we assume that FastText embeddings are already trained on the non-normative texts that according to their character are much closer to the texts that we are dealing with in this research.
If the type of word embeddings (i.e., FastText) is obvious when used with CNN, there is still no clear answer as to which feature type is the best for the traditional machine learning techniques.Surprisingly, the character n-grams (usually demonstrating high performance on the non-normative texts) and more sophisticated feature types based on unigrams + bigrams (except for NBM on the balanced original dataset and the dataset with eliminated diacritics) are not the best choice.It is probably due to the fact that the sentiments that we are dealing with are expressed rather straightforward (i.e., without hidden meanings or sarcasm).The top notch feature types for traditional machine learning approaches are token unigrams and lemma unigrams, besides, lemma unigrams are even more often determined to be the best (see Tables 3 and 4).The credit also goes to the morphological analyzer-lemmatizer tool Lemuoklis: although, it cannot process non-normative words (leaving them in the original form), but it is very accurate when lemmatizing the words that it recognizes.

Conclusions
Our main contribution is the comparative analysis of the traditional machine learning and deep learning methods solving the sentiment analysis task for the Lithuanian language.During this work, we have compared Support Vector Machine and Naïve Bayes Multinomial as representatives of traditional machine learning and Long Short-Term Memory and Convolutional Neural Network as representatives of deep learning.The traditional machine learning methods were used with the discrete representations (lexical, lemma, and character feature types); and, the deep learning methods were applied on the top of two types of neural word embeddings (continuous bag-of-words with the negative sampling and the FastText).Moreover, all of the experiments were carried out on the full and the balanced versions of the dataset testing the following pre-processing techniques: emoticons replacement with the sentiment words, stop words removal, diacritics elimination, and restoration.
The LSTM method underperformed the baseline (equal to 0.617 of accuracy) on the full dataset with all of the pre-processing techniques.Another deep learning method, i.e., Convolutional Neural Network, which reached the highest accuracy of 0.706 with FastText embeddings on the full dataset, replaced emoticons, and eliminated diacritics-remained in the third position (our results are very similar when compared to the results achieved with the neural network on the Russian language in [61]).The Support Vector Machine took the second place with the accuracy of 0.724, which was achieved on the full dataset with replaced emoticons, eliminated diacritics, and lemma unigrams as the feature representation type.The best result was demonstrated by the Naïve Bayes Multinomial method with the accuracy of 0.735, which was achieved on the full dataset, replaced emotions, restored diacritics, and lemma unigrams.Although Convolutional Neural Network underperformed Support Vector Machine and Naïve Bayes Multinomial, the gap in the accuracy is rather small, i.e., less than 0.3.
In the further research, we plan to reduce (or even eliminate) the gap between traditional and deep learning approaches.This could be done in two directions: (1) by modifying the parameters of the deep learning approaches; and, (2) by collecting and preparing more training data.Moreover, we envision the development of hybrid methods, such as advocated in [62], which combine neural networks with explicit knowledge.

Figure 1 .
Figure 1.Accuracy obtained with different pre-processing techniques applied on the full dataset.Black connecting lines in each group of columns connect results among which differences are not statistically significant.

Figure 2 .
Figure 2. Accuracies with different pre-processing on the full dataset with eliminated diacritics.

Figure 3 .
Figure 3. Accuracies with different pre-processing on the full dataset with restored diacritics.

Figure 1 . 17 Figure 1 .
Figure 1.Accuracy obtained with different pre-processing techniques applied on the full dataset.Black connecting lines in each group of columns connect results among which differences are not statistically significant.

Figure 2 .
Figure 2. Accuracies with different pre-processing on the full dataset with eliminated diacritics.

Figure 3 .
Figure 3. Accuracies with different pre-processing on the full dataset with restored diacritics.

Figure 2 .
Figure 2. Accuracies with different pre-processing on the full dataset with eliminated diacritics.

Figure 1 .
Figure 1.Accuracy obtained with different pre-processing techniques applied on the full dataset.Black connecting lines in each group of columns connect results among which differences are not statistically significant.

Figure 2 .
Figure 2. Accuracies with different pre-processing on the full dataset with eliminated diacritics.

Figure 3 .
Figure 3. Accuracies with different pre-processing on the full dataset with restored diacritics.Figure 3. Accuracies with different pre-processing on the full dataset with restored diacritics.

Figure 3 .
Figure 3. Accuracies with different pre-processing on the full dataset with restored diacritics.Figure 3. Accuracies with different pre-processing on the full dataset with restored diacritics.

Figure 4 .
Figure 4. Accuracies with different pre-processing on the balanced dataset.

Figure 5 .
Figure 5. Accuracies with different pre-processing on the balanced dataset with eliminated diacritics.

Figure 6 .
Figure 6.Accuracies with different pre-processing on the balanced dataset with restored diacritics.

Figure 4 .
Figure 4. Accuracies with different pre-processing on the balanced dataset.

Figure 4 .
Figure 4. Accuracies with different pre-processing on the balanced dataset.

Figure 5 .
Figure 5. Accuracies with different pre-processing on the balanced dataset with eliminated diacritics.

Figure 6 .
Figure 6.Accuracies with different pre-processing on the balanced dataset with restored diacritics.

Figure 5 .
Figure 5. Accuracies with different pre-processing on the balanced dataset with eliminated diacritics.

Figure 4 .
Figure 4. Accuracies with different pre-processing on the balanced dataset.

Figure 5 .
Figure 5. Accuracies with different pre-processing on the balanced dataset with eliminated diacritics.

Figure 6 .
Figure 6.Accuracies with different pre-processing on the balanced dataset with restored diacritics.Figure 6. Accuracies with different pre-processing on the balanced dataset with restored diacritics.

Figure 6 .
Figure 6.Accuracies with different pre-processing on the balanced dataset with restored diacritics.Figure 6. Accuracies with different pre-processing on the balanced dataset with restored diacritics.

Table 1 .
Statistical characteristics of the full dataset used in our sentiment analysis tasks.

Table 2 .
Statistical characteristics of the balanced dataset used in our sentiment analysis tasks.The upper and lower value in columns (4)-(6) represent the number of all tokens and the number of distinct tokens, respectively.

Table 3 .
The best revealed feature and embedding types with different classification methods on the full dataset.

Table 4 .
Best features and embedding types with different classification methods on the balanced dataset.

Table 5 .
p-values, representing the level of significance between the accuracies of Support Vector Machines (SVM), Naïve Bayes Multinomial (NBM), and Convolutional Neural Network (CNN) presented in Figures1-3.Underlined p-values denote statistically insignificant differences.