1. Introduction
Nowadays, social media platforms such as Twitter, Facebook or Instagram have become a common place for people to share their thoughts and for media outlets to spread their information to the world. Supervising social media becomes effective as a way of surveillance, as the number of their users has rocketed around the world: by the end of 2021, the number of social media users was over 4.2 billion, and this number is projected to increase to 5.4 billion by 2025 [
1]. A mean toward this is analyzing people’s opinion, sentiments, and attitudes from text sources, also known as sentiment analysis (SA) [
2]. Falling under the broad umbrella of text classification, SA systematically identifies, extracts, quantifies and studies affective states and subjective information, producing useful knowledge from text, to be used in subsequent descriptive or causal analysis [
3].
Due to the high popularity of social media, platforms such as Twitter have become a good source for investigating public opinions. By mining such texts, private entities can employ methods such as brand monitoring in order to capture important brand events in real time, and reflect a brand’s financial value to the firm [
4]. Public entities, such as governments, can use sentiment analysis or emotion detection in order to observe how the population is reacting to various public issues such as politics or healthcare. For example, Praveen et al. [
5] analyzed the attitude of Indian citizens towards COVID-19 vaccines by collecting public social media posts written in English. They concluded that, although positive sentiments were more prevalent than negative ones, the Indian government needs to focus especially on addressing the fear of vaccines, which is a major factor contributing to the negative attitude towards vaccination. Even more knowledge can be extracted by modeling discussions on social networks. Bonifazi et al. [
6] proposed a general multilayer network approach and proved its validity by applying it on a Twitter dataset containing texts concerning opinions on COVID-19 vaccines (pro-vaccination, neutral and anti-vaccination). They discovered that anti-vaxxers tend to have ego networks denser and more cohesive than those of pro-vaxxers, which leads to a larger number of interactions among anti-vaxxers.
SA represents an established category of natural language processing (NLP), with a lot of research efforts being focused on discovering the machine learning (ML) methods that produce the best models given a particular problem under study. Although textbooks such as [
2] or reviews such as [
3,
7,
8] present in details the recommended steps to be adopted specifically for SA or with respect to a given technology applied to NLP in general and SA in particular, a lot of research space is still open in the area of SA, if certain problem-specific conditions occur, such as those induced within microblogging platforms, or handling user input from mobile devices, etc.
Microblogging platforms such as Twitter, Instagram, Facebook, etc., inspire provocative questions as they feature linguistic challenges usually not seen in literary texts. Eisenstein [
9] describes them as being
bad language and it includes emoticons, phrasal abbreviations such as
lol,
smh or
ikr, expressive lengthening of words, e.g.,
cooolll, shortened words, or simply words written in a non-standard format, including those with typos, selected from some irregular vocabulary or following an informal grammar. Various reasons not in the scope of our research cause the presence of bad language in microblogging and it dramatically influences the performance of a standard NLP model applied here [
10]. Another issue regarding social media networks is related to privacy threats. The goal of any social network is to enable users to safely share information, but in most cases, the users are not familiar with privacy preservation issues. The work of Cerruto et al. [
11] demonstrates that it is possible to obtain or reconstruct personal user data by single-social or cross-social network analysis. In our work, we aim to infer global sentiment polarity for social media texts and we want to clarify that our analysis is not aimed at violating or exploiting user privacy.
Specifically targeting Twitter, performing SA on non-English tweets is seen as challenging, mostly because of the difficulty to gather enough labeled data in the target language [
12]. Annotated datasets could be found for popular languages in the world. For example, for English, we can mention BERTweet [
10], a large-scale language model trained over a corpus of 850M Tweets, which could be used together with fairseq [
13] or transformers [
14] for text categorization tasks, including SA. In France, DEFT challenges run between 2014 to 2018 focused on opinion mining and SA from Twitter posts [
15], and they gathered a labeled dataset at the disposal of the participating teams. In Spain, TASS workshop (
http://tass.sepln.org/, accessed on 22 August 2022) held starting with 2012 at SEPLN congress supplies a dataset with annotated tweets [
16], including Spanish crosslingual variations.
However, little could be found for less popular languages of the world, such as Romanian. Romanian belongs to the Romance language group, having many substantial differences from English, including the alphabet, grammar, phonology, etc. In general, the same sentiment is expressed in a more verbose way in Romanian than in English. Ciobotaru and Dinu [
17] performed emotion detection over a dataset of about 4000 tweets in Romanian. The texts were manually labeled by them, but the dataset was not made publicly available. Istrati and Ciobotaru [
18] collected and manually labeled a dataset with Romanian tweets about brands and created a SA model for usage in brand monitoring and analysis. Unfortunately, their manually labeled dataset is not publicly available. To our knowledge, the recently proposed LaRoSeDa dataset [
19] is the first and only public dataset dedicated to sentiment analysis in Romanian. The requirements imposed on our project [
20] include a three-class sentiment prediction capability (negative, neutral and positive) and the ability to analyze social media-specific texts. These requirements make LaRoSeDa an unsuitable candidate for our work because the sentiment is labeled in a binary fashion (negative and positive) and the texts refer to product reviews collected from various online shopping platforms, not from social media platforms.
A team at
technobium.com (accessed on 26 August 2022) created a commercial engine for SA for the Romanian language, with a free demo posted at
sentimetric.ro (accessed on 26 August 2022), capable of determining polarity for various texts, including microblogging [
21]. However, there are few scientific details on the internal construction of the model and its performance in both prediction capabilities and efficiency with regards to the usage of computational resources. Other efforts worth mentioning in the area of SA for Romanian are [
22,
23,
24,
25,
26].
To overcome the missing linguistic resources in the underrepresented languages, the works in [
27,
28,
29] suggest that automated translation could be used for model learning with respect to SA, maintaining a similar performance with the original dataset.
In this research, we restrict our scope to perform SA on Twitter data (as an example of a social network with harsh limit over the size of the messages), on a under-represented language (Romanian) which lacks available labeled data to be used within model training phases. Our final goal is to produce a model for inferring the global polarity of a tweet in a multinomial classification fashion (positive, negative or neutral) with an acceptable performance, even without being in possession of a large Romanian training dataset that can meet our needs. In this respect, we avoid consuming time for a huge data collection and annotation task, and instead, we used an English Twitter dataset of reasonable size translated to Romanian using a public web translation engine.
Our research fits under the wider coverage of a media surveillance project [
20], aiming at investigating specific habits of the Romanian public interacting with TV and social media. This imposes several restrictions over our model learning task, such as: the need to re-train the models on a regular basis as the observed environments are being very volatile, the need to process huge loads of messages in a short period of time during audience peaks, and the need to comply with strict data privacy and security standards.
The main contribution of this paper is to demonstrate that, even with the lack of a well-prepared training dataset, good SA results could be obtained by carefully implementing and adapting the standard NLP pipeline for SA [
2] to the specificity of the input data. In particular, we show how each step of the NLP pipeline was applied and discuss the consequence of each decision made throughout our NLP experiment. Furthermore, we extend the previous result of Balahur and Turchi [
27] obtained using French, German and Spanish tweets to Romanian, showing that comparable performance could be achieved on the translated dataset with that obtained on the original English source.
The paper is organized as follows: in
Section 2, we present related work competing with or influencing our research.
Section 3 introduces the data under study and the methodological steps followed to construct the SA model.
Section 4 presents the obtained results and
Section 5 concludes the paper.
2. Related Work
SA is a type of text classification employed with the specific objective of inferring affective states and subjective information from text. To perform SA one could adopt either a specific lexicon-based strategy or could use the standard ML-based text classification pipeline [
30]. As indicated by the literature [
2,
3,
31], classical machine learning algorithms or novel deep learning approaches were applied for inferring SA models on texts. Among classical ML algorithms, popular choices [
30] are Bernoulli Naive Bayes (NB) [
32], Support Vector Machines (SVM) [
33], Random Forest (RF) [
34] or the Logistic Regression [
35]. For deep learning, all important variants like the standard Deep Neural Network (DNN) [
36], the Convolutional Neural Network (CNN) [
37], or the Long Short-Term Memory (LSTM) [
38] are reported to perform well on text classification tasks. Classical ML methods and the standard DNNs are applied on document-level embeddings such as the well-known TF-IDF [
39] or the modern Doc2Vec [
40]. Novel methods of DL, including here architectures composed with CNN and LSTM cells are in general applied on word embeddings such as Word2Vec [
41]. The classical TF-IDF lead to high-dimensionality prediction problems, thus, the literature [
3] suggests considering dimensionality reduction schemes, such as principal component analysis (PCA) [
42], non-negative matrix factorization (NMF) [
43] or Latent semantic analysis (LSA) [
44], to make the computation more efficient. In our work, we will experiment with these methods, in the search for a suitable combination to fit our needs.
Recently, Google proposes BERT [
45] as a state-of-the-art pre-trained model for many NLP tasks. Multilingual BERT, pre-trained also for the Romanian language, is reported to be surprisingly good for cross-lingual model transfer [
46]. However, as practice indicates [
47], BERT comes with significant time costs for model learning—even on very powerful servers, hence this being in contradiction with the need of our project to frequently retrain the models and to accommodate high loads of messages on short time frames. Just for comparison, we will train a BERT-based classifier using its multilingual version for the Romanian language, to see how well we compare with the state of the art, in both classification performance and training time.
Performing SA on microblogging content is seen as a difficult task [
12] because one has to deal with bad language [
9]. However, for popular languages in the world like English, Spanish or French, plenty of linguistic resources exist such that to enhance SA for microblogging content. For English, we mention BERTweet [
10], a large-scale language model trained over a corpus of 850M Tweets, which could be used together with fairseq [
13] or transformers [
14] for inferring polarity. BERTweet scores an accuracy of
on the SemEval2017-Task4A [
48] test set, outperforming its competitors RoBERT and XLM-R. Barbieri et al. [
49] reports BERTweet as being the State-of-the-Art on TweetEval benchmark, with a
average recall. In France, labeled Twitter data were made available for the participants in the DEFT challanges [
15], while in Spain, the TASS workshops collected the efforts to classify Spanish annotated tweets [
16]. Pota et al. [
50] applied BERT-based models for SA on both English and Italian Twitter datasets and drew conclusions about the importance of individualized preprocessing for exploiting hidden information, a suggestion which we specifically followed in our work.
For the Persian language, which is also under-resourced like Romanian, in [
51] the lack of any public dataset for sentiment analysis is noted. The authors of this work created a public dataset by collecting 11500 texts and manually labeling each one as either positive, negative or neutral. Around
of the texts were collected from an electronic product website, while the rest were collected from Twitter. They applied various classification algorithms and a proposed CNN-LSTM network achieved the highest accuracy of ≈85%. In [
52], this pretrained hybrid model was used to infer the sentiment polarity of over 800,000 Persian tweets collected over a span of 6 months. The tweets refer to an Iranian COVID-19 vaccine called COVIran Barekat and foreign vaccines (AstraZeneca, Pfizer, Moderna and Sinopharm). The authors compared the sentiments expressed towards the Iranian vaccine versus the sentiments expressed towards the foreign vaccines and found a slight preference for the Iranian one. By obtaining a monthly distribution of opinions, between April and September of 2021, the authors discovered an increase in negative sentiments towards all vaccines between late August and September. A possible explanation could be related to the reported side effects of some of the vaccines. These results seem very promising but, as stated in
Section 1, we do not have the necessary resources in order to manually collect and label a large volume of Romanian texts.
The state of the art in SA for microblogging content is less advanced in under-resourced languages such as Romanian. We mention here the efforts of Ciobotaru and Dinu [
17] for emotion detection and Istrati and Ciobotaru [
18] for binary sentiment analysis who report promising results, or the company
technobium.com (accessed on 26 August 2022) who show a free demo posted at
sentimetric.ro (accessed on 26 August 2022). Yet we could hardly rely on those tools to incorporate them in a larger project, as, to our knowledge, a Romanian microblogging content dataset similar to BERTweet is not available.
As stated in
Section 1, LaRoSeDa (Large Romanian Sentiment Dataset) seems to be the only publicly available Romanian dataset labeled for sentiment analysis. The dataset contains 15,000 reviews, of which 7500 are labeled positive and 7500 negative. Due to its nature all works using this resource report the performances achieved for sentiment classification in a binary fashion. In [
53], an F1 score of 54% is achieved while in the work which introduces the LaRoSeDa dataset an accuracy of ≈91% is reported as the benchmark [
19]. More recently, we acknowledge the Romanian DistilBERT corpus (
https://github.com/racai-ai/Romanian-DistilBERT, accessed on 30 August 2022) [
54] which could be employed for binary SA over standard text data. They report the state-of-the-art binary classification accuracy of 98% for SA performed on LoReSaDa. All the previous experiments were performed on standard text in a binary classification fashion and the problem setting is different from our global polarity inference. Regarding the multinomial SA of social media texts in Romanian, we could not find any published work in order to set a benchmark with which we can compare.
Searching for other Romanian research relevant to SA, we mention Lupea and Briciu [
22] who developed a Romanian Emotions Lexicon, attaching tags to words. Tufis and Barbu Mititelu [
26] developed RoWordNet (
https://github.com/dumitrescustefan/RoWordNet, accessed on 30 August 2022), a semantic network of words based on the idea introduced by the WordNet English lexicon—each word having attached a polarity score. Both lexicons could be of use in sentiment analysis if a dictionary-based method (as defined in [
3] p. 541) is selected for SA. Other Romanian authors performed sentiment analysis on different sort of input data than tweets, such as speech [
23,
24] or poetry [
25], thus benefiting from resources developed with other purposes, such as SRoL [
55] which was developed to help Romanian speech processing research.
Constructing language resources for under-represented languages by automatic translation seems to work well for various NLP tasks, as acknowledged in [
27,
28,
29,
56]. Balahur and Turchi [
27,
29] shows that automatic translation of datasets performed with Google Translate, Bing Translator and Moses works well for French, Spanish and German in relation to the SVM SMO classifier. Balahur and Perea-Ortega [
28] performed extensive experiments with English and Spanish Twitter datasets supplied for SemEval 2013 and TASS 2013 workshops and showed that training data obtained from machine translated text could work well for learning polarity classification systems. Banea et al. [
56] positively respond to the question whether we can “reliably predict sentence-level subjectivity in languages other than English by leveraging on a manually annotated English dataset” by learning Naive Bayes classifiers on six languages, including Romanian, starting from an original English dataset with news articles translated with automatic engines available at that moment in time. Thus, this further motivates our efforts to use machine translation for obtaining the learning dataset for Romanian.
3. Data Processing Methodology
Given the specific advice provided in [
2] with respect to building a SA classifier, in this section, we present our approach to carrying out this task.
Figure 1 summarizes all performed steps. At the top of the diagram, the automatic translation process of the dataset from English to Romanian is highlighted. The preprocessing step consists of various procedures and are grouped in two different abstract pipelines. The one on the left will generate texts which are fit for the TF-IDF variants, Doc2vec and Word2Vec techniques while the one on the right will generate texts as expected by the pretrained BERT encoder. The feature extraction step highlights the transformation of the preprocessed text into numeric features which can be later used to train the ML models. The TF-IDF variants and Doc2Vec are grouped together to highlight that the output of all these methods are in the same form. To be more specific, a preprocessed text is transformed into a vector of length
N while Word2Vec will transform a preprocessed text into a
matrix where the number of rows will be equal to the number of words within the text and the number of columns will be equal to the word embedding size, i.e., the number of values set to represent a single word/token. BERT contextual encoding will represent texts using multiple vectors.
In model training and tuning, we can see that for all the selected ML approaches, with the exception of the BERT classifier, the evolutionary hyperparameter optimization methodology is used to identify the optimal set of parameters. Due to the high training times of BERT, we opted for a classic fine tuning process using the recommended parameters. The Bernoulli NB, Linear SVM, Logistic Regression, Random Forest and DNN are grouped together to highlight the many-to-many relation of this group with the TF-IDF variants and Doc2Vec features. This means that any feature from this group can be used by any model mentioned previously. LSTM and CNN are grouped together in order to highlight that both use the Word2Vec features, while the BERT classifier uses the specific BERT encodings. The prediction of any trained model will denote the sentiment polarity of the input texts.
In the following subsections, we introduce the dataset and provide details on the following: text preprocessing, feature extraction, dimensionality reduction, and classifier selection. Many transformations over the text indicated below were implemented with the help of SpaCy [
57] (
https://spacy.io/, accessed on 1 September 2022). Model construction and performance evaluation are presented in
Section 4.
3.1. Dataset
For our research, we selected the Twitter US Airline Sentiment Tweets (
https://www.kaggle.com/crowdflower/twitter-airline-sentiment, accessed on 1 September 2022) dataset. The data were collected in 2015 and each tweet was manually labeled by external contributors with its global polarity (positive, negative and neutral). It contains around 15,000 tweets, 63% being negative, 21% neutral and 16% positive. Each tweet is accompanied by the contributor’s confidence about the annotated sentiment and each negative tweet is accompanied by a reason for the assessment. There are a number of reasons why we consider this dataset suitable for our research purpose: (i) it contains microblogging-specific (bad) language, (ii) the sentiment class of each tweet was manually annotated and we can easily verify the correctness and the reason for the annotation, (iii) the number of tweets is large enough to provide reasonable training data, and (iv) the tweets are relatively recent.
As a first processing step, we used Google Translate (
https://translate.google.com/, accessed on 2 September 2022) service to translate all the tweets of the dataset in Romanian. We eliminated all duplicated rows and sorted the dataset by Tweet Id. Thus, we obtained two datasets: the original Twitter US Airline Sentiment Tweets (in English) and its Romanian translation.
The structural, grammatical and syntactical integrity of any text translated with automated processes is affected. The main metric used in the literature to measure the quality of an automated translator is the BLEU (Bilingual Evaluation Understudy) score. This score is computed by comparing a translation with one or more acceptable translations and checking for the presence/absence of particular words, the word ordering and the degree of distortion. If the score measures from 0 to 100, a higher number represents a better translation (100 denoting perfect translation). In [
58], general English texts are translated to 50 different languages using Google Translate and the BLEU score is computed. The mean score over all the compared translations is ≈76. English to Romanian achieved a score of 84, which is considerably above average. The maximum BLEU score of 91 was achieved by English to Portuguese while the minimum of 55 was achieved by English to Hindi. Similar results are also obtained in [
59] where English to Romanian obtained better than average results. In both works, better translations are obtained for languages which are in the same of similar family with English. Translations from English to distant languages, such as Hindi or Hebrew, are the most negatively effected.
For the purpose of running ML tasks, the dataset was split in training and test sets. The training dataset consists of approx. 11,000 instances, while the testing dataset consists of the remaining approx. 3700 instances, thus ensuring a 75–25% split between train and test data. The split was made just after text preprocessing, such that the class distribution between the train and test set was similar. Moreover, the English and Romanian train and test set are identical in the sense that they contain the same instances.
3.2. Text Preprocessing
Dealing with bad language is mandatory in a SA task over microblogging content [
10], thus, in this subsection we describe specific text preprocessing efforts done with this respect. As indicated by Pota et al. [
50], individualized pre-processing of the tweets is required in order to better exploit the hidden information of the input data. We developed a specialized preprocessing module, containing the following steps, applied in this specific order:
Extra white space removal (language-independent);
Custom word lemmatization and tokenization (language-dependent);
URL identification and removal (language-independent);
Emoji identification and replacement (language-independent);
Social media mention identification and removal (language-independent);
Extra consecutive character removal (language-independent);
Abbreviation replacement (language-dependent);
Stop-word removal (language-dependent);
Lower case capitalization (language-independent);
Punctuation mark removal (language-independent).
Language-independent steps can be applied in the same manner in both English and Romanian. In contrast, language-dependent steps implies that specific knowledge of Romanian or English is requested.
For building the BERT-based classifier, we performed only steps 3 and 5, and we performed an additional sentence-level tokenization. Next, we called to the BertTokenizer to obtain the specific BERT encodings.
All steps, excluding 2 and 4, are commonly used for text preprocessing and were applied in our work in the recommended fashion. In step 2, we instructed Spacy [
57] to not lemmatize social media-specific tagged words, to not split the tagged word during the tokenization phase, and to not remove the negation element during the stop-word removal.
Step 4 is critical for our microblogging context, as we deal with emojis, as a sort of Twitter bad language. Kralj Novak et al. [
60] reports that about 4% of Tweets contain emojis and their sentiment polarity does not depend on the language. In this respect, they constructed the Emoji Sentiment Ranking lexicon containing the 751 most frequently used emojis, each annotated with the sentiment polarity (negative, neutral or positive). Here, we verify whether a token is found in the Full Emoji List (
https://unicode.org/emoji/charts/full-emoji-list.html, accessed on 3 September 2022) and whether it has an associated sentiment in the Emoji Sentiment Ranking lexicon mentioned above. If true, the token will be replaced with its polarity and a special prefix and suffix.
3.3. Feature Extraction
Key to any NLP task is the document internal representation, i.e., properly selecting the features from the raw text and encoding them to numerical values, so as to keep the representation tractable or to enrich it with some language semantics [
3]. In our work, we tested the most popular approaches as suggested by the NLP literature [
2,
3]: TF-IDF, Word2Vec and Doc2Vec. We learned the embeddings on the English dataset and its Romanian translation and we restricted the vocabulary to contain only the tokens which appear at least three times, removing a large number of infrequent tokens or those which may have been erroneously built in the preprocessing step.
We notice that the resulting English vocabulary contains around 3100 tokens, while the Romanian one contains around 4000, due to the fact that Romanian is more verbose than English.
In order to learn the Word2Vec and Doc2Vec embeddings for our data, we used the Gensim library [
61]. For Word2Vec, we worked with the Continuous Bag-Of-Words (CBOW) architectural model. For learning the Doc2Vec embedding, we used the Distributed Bag-Of-Words (DBOW) with hierarchical softmax architecture. In both cases, we set the vector embedding size for each token to 200 and we trained each model with the following parameters: learning rate
,
, over 5
epochs.
For all three embeddings, the trained models were then applied on the testing sets.
3.4. Dimensionality Reduction
The abovementioned document representations and especially TF-IDF lead to high-dimensionality prediction problems [
3], causing learning algorithms to run slowly or requesting huge memory resources [
31].
Therefore, we applied PCA [
42], NMF [
43] and LSA [
44] dimensionality reduction algorithms on the datasets represented with TF-IDF features, reducing the number of features to 500. This means that the reduced representation for English is around 6.2 times smaller and for Romanian—about 8 times smaller.
We did not apply dimensionality reduction on the Word2Vec and Doc2Vec data because the desired vector size of each representation is set before the feature extraction, in our case: 200.
3.5. Classifier Selection
Following suggestions in the literature [
2,
3,
31], we selected the following methods for building our classifiers:
Bernoulli Naive Bayes (Bernoulli NB), Support Vector Machine (SVM), Random Forest (RF) and Logistic Regression (LR) from the classical ML;
Deep Neural Network (DNN), Long Short-Term Memory (LSTM) and the Convolutional Neural Network (CNN) from the area of deep learning;
Multilingual BERT—to get a glimpse of the state-of-the-art results.
As indicated in
Figure 1, the classical ML methods and the DNN were applied on the TF-IDF encoding, with and without dimensionality reduction and on Doc2Vec. On Word2Vec, we constructed classifiers with the help of LSTM and CNN.
We implemented the classical ML algorithms with the help of Scikit-Learn library [
62], while for the deep learning we used Keras [
63].
For BERT, we used the model available on the Hugging Face transformers (
https://huggingface.co/docs/transformers/model_doc/bert, accessed on 5 September 2022), called with the base multilingual uncased variant. On top of BERT, we added a hidden dense layer with 75 nodes and ReLU activation function, followed by the standard classification layer with 3 nodes which produce the sentiment. Adam was the selected optimizer, with a learning rate of
and
. The loss function was set to Categorical CrossEntropy.
3.6. Hyperparameter Optimization
When applying each of the abovementioned learning algorithms, we need to tune them with proper parameters selected so as to minimize the generalization error [
64,
65]. Various approaches could be considered such as exhaustive grid search, random search [
65], Bayesian optimization [
66] or evolutionary optimization (EO) [
67].
Given the large number of parameters to optimize, and noticing the vast literature accompanying the metaheuristic design of DNNs [
36] or recent applications of DL where parameters were selected with the help of genetic algorithms [
68,
69,
70], or suggestions that EO could outperform Bayesian optimization [
71], we decided to employ a classical genetic algorithm for hyperparameter search.
We used Sklearn-genetic-opt library [
72] for implementing genetic algorithm-based hyperparameter optimization in relation with our selected algorithms. Sklearn-genetic-opt makes usage of the Deap framework (
https://github.com/deap/deap, accessed on 5 September 2022) [
73], which supplies many evolutionary algorithms needed for solving optimization problems.
The GA was designed as following. Given a number N of parameters to optimize for some specific learning method, a chromosome is a vector of values selected for each parameter. A population consists of 20 individuals which is evolved over 40 generations with a crossover probability of 0.8 and mutation probability of 0.1. Individuals are selected for the next generation with a standard elitist tournament of size 3. Internally, each individual is evaluated using the accuracy as fitness function, computed with 3-fold cross-validation.
In the case of DNN, we considered among the parameters the following: the network capacity (the number of hidden layers and the number of units per layer), the activation function, the regularization function, the drop-out rate.
Since both CNN and LSTM need the embedding weight parameter which is 2D tensor, we modified the source code of Sklearn-genetic-opt, in order to transmit the multidimensional parameters directly to Deap.
The
Appendix A presents in full the parameters considered for evolutionary optimization, for all classifiers.
In general, convergence is seen after 15–20 generations, thus evolving the populations over 40 generations is more than enough to guarantee a good parameter selection.
For the BERT-based classifier, as learning just one model is very time consuming, we omitted to perform the evolutionary optimization procedure. Instead of cross-validation, we took of the training set for validation and we let the learning to optimize the loss function for several epochs. We noticed that the model rapidly overfits, thus, we stop the learning after two epochs.
4. Experiments and Results
In this section, we present our experiments and discuss the results. We first construct models on both the original and translated Twitter US Airline Sentiment Tweets dataset and next, we investigate how the best obtained Romanian models models perform on small real-life Romanian datasets, manually labeled.
4.1. Constructing the Models
All the experiments were conducted on a powerful machine with the following specifications: 2 × Intel Xeon Gold 6230 CPUs (20 Core at 2.1 GHz), 128 GB DDR4 internal RAM, 8 × NVIDIA Tesla V100 32GB and the source code was implemented in Python 3.9.
Table 1 presents the learning performance on the classifiers mentioned in
Section 3.5 on the test set, with or without dimensionality reduction. We assess the classification performance with the help of the accuracy and the weighted F1-measure. We also report the performance obtained on the original English data set, applying the same data processing pipeline (without any dimensionality reduction), in order to see how much we lose by the automatic translation to Romanian.
We observe that the classification performance of all considered models trained on the original English dataset is very close to the one obtained on the Romanian translation. The differences in all classification schemes are
which can be considered negligible. We expected this result, as it is in line with similar experiments done with automatic translation for other languages [
27,
28,
29]. Furthermore, they confirm the validity of the processing and learning pipeline, as applied on the Romanian translated Twitter data.
We note that dimensionality reduction does not bring in an increase in classification accuracy. Furthermore, similar accuracies of around are obtained either with TF-IDF or with Word2Vec feature extraction, but Doc2Vec does not help, in any scenario.
In terms of accuracy, Bernoulli NB, SVM, LR and DNN applied on TF-IDF encoding and CNN and LTSM applied on Word2Vec are all almost similar. However, Bernoulli NB scores slightly better on weighted F1-measure, therefore, we are prompted to select Bernoulli NB as the best classifier for TF-IDF encoding. For Word2Vec encoding, LSTM slightly outperforms CNN in both accuracy and weighted F1-measure.
We took advantage of a very powerful machine to run all the experiments. Even so, time spent for hyperparameter optimization and model learning are not negligible.
Table 2 lists the time spent for Evolutionary optimization and for learning the final model with the optimal parameter set, for each tested classifier. EO helped us to achieve a 1–3% improvement for the weighted F1-measure.
In
Table 2, we can also observe that in the majority of cases, the hyperparameter optimization process and the final model training times are higher for the Romanian classifiers. As stated in
Section 3.3, Romanian is more verbose than English, thus, the number of tokens learned for Romanian is larger. This fact might have contributed to the generation of more complex models when compared to English.
We shall note that, in general, learning a classical ML model takes less than a second and this is clearly less than the learning time required for a deep learning model, but searching for the best model parameters indeed is very time consuming. In the case of the models trained in Romanian, optimizing the Bernoulli NB on the dataset without dimensionality reduction took about 27 min. Searching for the best structure and capacity of the DNN took about 3 h and 45 min. Searching for best parameters for LTSM took more than 17 h, and we shall note that vector embedding size is only 200. Therefore, learning an LSTM proved to be a very prohibitive experiment in the absence of a well-equipped computing machine.
As expected, BERT provides with state-of-the-art results for both the English and the Romanian datasets. The gap from our best result to BERT is bigger for English compared with Romanian. However, all come with a high computational cost, as learning just one BERT-based classifier takes about 7 to 8 min. This makes the hyperparameter optimization infeasible. If we would consider the specification of the Genetic Algorithm presented in
Section 3.6, this would result in the worst case at 16,800 models learned, needing about 11 days of running the experiment. However, as we noticed, we do not need to perform this optimization, as just one BERT-based model learned with the recommended parameters already supplies state-of-the-art results. The problem with a BERT-based classifier is not with learning one model, but with time needed to classify unknown instances. Whereas, the other classifiers took negligible time (less than 1 s) to process the test set, the BERT-based model took about 44 s. This would prohibit us from employing the BERT-based classifier in the media surveillance situations with an extreme high-throughput of messages (e.g., during a prime-time audience TV show).
Given the difference of only between the best achieved performance (Bernoulli NB and the LSTM classifiers) and BERT, taking also into account the reasonable learning and testing time of those models, we conclude that we could strongly consider the classical Bernoulli NB as being our choice for the production environment required by our project.
4.2. Assessing the Models Performance on Real Cases
Given that the final purpose in our project is to apply the learned models for inferring the polarity of any Romanian tweet, we manually labeled two small test sets, each one containing 120 distinct tweets. The first one includes tweets specific to the airline industry, comparable with the ones used for training our models, and the second one includes general tweets. We applied on them the best models reported in the previous subsection (i.e., Bernoulli NB for TF-IDF encoding and LSTM for Word2Vec), the public demo of
sentimetric.ro (accessed on 8 September 2022) [
21] and the BERT-based classifier.
Each tweet was manually labeled by five human volunteers. Each one expressed an opinion about the polarity of the tweet and the final sentiment was established to be the one that was selected by the majority. Labeling statistics regarding how humans assessed the polarity is presented in
Table 3. We shall note that the labeling task seemed to be a difficult one for the volunteers, as for only 43 tweets (35.8%) in the case of airline industry specific dataset and 47 tweets (39.2%) in the case of general tweets all the 5 volunteers reached a unanimous decision. Furthermore, the distribution of the polarity of the tweets significantly differ from the one of the Twitter US Airline Sentiment Tweets (presented in the last row of
Table 3).
Polarity estimation results on the Romanian dataset with airline industry-specific tweets are presented in
Table 4.
Both encodings supply better results with our models (Bernoulli NB and LSTM) than
sentimetric.ro. This is expected as we learned our models on tweets specific for the aviation domain. BERT outperforms Bernoulli NB only by a slight margin.
Table 5 presents the models results on the Romanian general tweets dataset. We notice that Bernoulli NB scores better than
sentimetric.ro (accessed on 8 September 2022) in terms of the weighted F1-measure. LSTM scores worse.
sentimetric.ro (accessed on 8 September 2022) proved to assess a better polarity on the general domain than on the aviation, which is expected, as we assume that the engine was constructed for a wide usage. BERT proves to be the state of the art, as the margin by which it outperforms our best models is around 5%.
For both domains, our models’ results are worse than those obtained on the translated test set used in
Section 4.1, because now the tweets are real ones, not translated, and their target class distribution differs significantly—i.e., from a statistical point of view, sets are extracted from different statistical populations.
Bernoulli NB is more robust to novel tweets and to a different domains than LSTM and we suppose that this happens because for LSTM the Word2Vec embedding is learned on our very limited translated dataset and not on the whole Romanian language.
4.3. Discussion and Further Work
Experiments presented in
Section 4.1 show that a standard method such as Bernoulli Naive Bayes employed on the classical TD-IDF encoding supplies results that fit our media surveillance needs. The performance of the Bernoulli NB classifier is slightly better than the ones of other classifiers, being in a narrow margin below a BERT-based classifier. Applying evolutionary optimization for the hyperparameter search allows us to improve the performance of all classifiers by 1% to 3%.
Bernoulli NB has the advantage of very fast inference times for novel instances, being also easily retrainable, to accommodate for the volatility of the discussed topics. In contrast, although the BERT-based classifier indeed produces the state-of-the-art results in terms for both accuracy and weighted F1-measure, its needs in terms of hardware resources and computational time make it infeasible for our practical needs.
With the final experiments presented in
Section 4.2, we demonstrate that the selected classification model is suitable for our project production environment with general discussion topics, with a performance superior or at least equivalent to the already existing classifiers for the Romanian language although: (i) we learned a limited language model from a very specific dataset of only about 15,000 tweets; (ii) Romanian knowledge was produced with a public automatic translation service; and (iii) model learning was performed on domain-specific knowledge—the airline industry.
Therefore, we confirm that using datasets for Romanian NLP tasks constructed by automatic translation from English could be a solution, especially for SA. If no extensive language model exists, and if computing time and hardware resources represent a barrier, we recommend the use of the very classical TF-IDF encoding with a simple classifier such as the Bernoulli NB. In our case, classical ML methods proved to be more robust to generalization than DL-based methods. This might be due to the fact that the classic models used the TF-IDF features which do not take into account semantic and syntactic structures, thus being less affected by the automatic translation process when compared to Word2Vec. If one would opt for a wide-scope hyperparameter search, we suggest to employ evolutionary optimization as being capable of fine tuning the classifiers in a reasonable time.
For our future work, we would like to repeat the experiment presented in
Section 4.2 with the following differences: using considerably more texts in order to be more representative, defining standardized labeling rules, employing the help of more human annotators, and excluding texts which proved to be hard to label even by humans. Having more texts which are better labeled by more human annotators should improve the accuracy of all the presented models. Additionally, the sentiment class distribution of the dataset could be balanced using various oversampling techniques. The models can be retrained on the balanced data following the same methodology presented in this work and their performance re-evaluated. If these modifications do not bring a significant improvement, the LIME (
https://github.com/marcotcr/lime, accessed on 10 September 2022) (Local Interpretable Model-Agnostic Explanations) model can be used to understand the reasons behind predictions and decide which models are more robust and trustworthy [
74].
5. Conclusions
Within the larger setup of a media surveillance project [
20], we constructed a system capable of inferring the global sentiment polarity of Romanian tweets starting from an English dataset specific to the aviation industry, translated to Romanian. This paper describes our experience in designing the classification system and extracts several noteworthy conclusions in sentiment analysis of microblogging content for Romanian. As similar works treat the Romanian SA task only as a binary classification, we set the benchmark accuracy for a multinomial methodology consisting of three classes: negative, positive, and neutral. Bernoulli NB trained on TF-IDF features achieved an accuracy of around 78%, while BERT achieved the best result of 81%.
After carefully processing the Twitter data in order to properly approach bad language in particular, we built and evaluated models constructed with the help of various classifiers, including standard machine learning or the very popular nowadays deep learning. Given the large number of parameters to optimize for fine-tuning the classifiers, we opted to perform hyperparameter search with the help of evolutionary optimization. We found that the Bernoulli Naive Bayes classifier is the most robust one to both aviation industry specific tweets or to general ones and TF-IDF encoding should be used if no additional linguistic resources are available.
Regarding the performance measured with the help of accuracy and weighted F1-measure, we notice that it does not differ significantly for the English original dataset and its Romanian translation. Furthermore, although the training data were specific to aviation industry, classification performance achieved on a small dataset with general tweets seems to be slightly better than one of a commercial public demo available on the market.
Learning standard deep neural networks on a TF-IDF encoding or LSTM on a Word2Vec encoding bring in comparable results, but with an increased computational cost. Doc2Vec encoding seems not to help, as results are worse, regardless of the classifier.
Further research is still needed in order to obtain even better results. LSTM requires pre-trained Word2Vec embeddings for the target language, which are not available for Romanian. Moreover, a larger and more balanced dataset on the general domain could be used for learning, but probably with an increased computational cost.