Lexicon-Based vs. Bert-Based Sentiment Analysis: A Comparative Study in Italian

: Recent evolutions in the e-commerce market have led to an increasing importance attributed by consumers to product reviews made by third parties before proceeding to purchase. The industry, in order to improve the offer intercepting the discontent of consumers, has placed increasing attention towards systems able to identify the sentiment expressed by buyers, whether positive or negative. From the development of two types of methodologies: those based on lexicons and those based on machine and deep learning techniques. This study proposes a comparison between these technologies in the Italian market, one of the largest in the world, exploiting an ad hoc dataset: scientiﬁc evidence generally shows the superiority of language models such as BERT built on deep neural networks, but it opens several considerations on the effectiveness and improvement of these solutions when compared to those based on lexicons in the presence of datasets of reduced size such as the one under study, a common condition for languages other than English or Chinese.


Introduction
The progressive increase in data available to analysts has seen a strong boost in recent years thanks to the growing capillarity with which social networks have spread. One of the major uses of the latter that has contributed to their rise and that of data has been the availability of reviews about commercial products: the possibility of improving products and increasing their visibility or removing them from the market due to a bad reputation have been one of the levers that has moved the interest of market operators towards tools such as sentiment analysis of large amounts of data, in this case review, in order to attract more and more customers [1]. If before the expression of an evaluation of a commercial product was the prerogative of experts in the field through traditional media, now the democratization of this process has brought this possibility to anyone with an internet connection and a way to access it. Consequentially, there has never been such an increase in data and the need for automatic means to extract and analyze the contents in order to modify and direct business strategies in an immediate manner.
In general, the classification of text according to its different aspects has been a focus of research in recent years: opinion mining and sentiment analysis [2,3] through first rule-based systems [4] and then machine [5] and deep [6][7][8][9] learning have constituted a continuously improving line of research, thanks to the arrival of increasingly sophisticated language models capable of exploiting prior knowledge and adapting it to the specific tasks for which they have been employed, thus returning better results with less use of computing resources. In detail, language models based on deep neural networks have the advantage of being able to classify the sentiment by learning in an automatic way the key features from the datasets submitted in the training phase, processing sentences with both simple and complex structure; although, however exceptional, these results depend strongly on the language to be treated and, in particular, on the availability of large datasets on which to train the model beforehand: this situation is common only for English or Chinese languages, while everything else is generally classified as a low-resource language. At this juncture, lexicon-based models are inserted: they exploit pre-constituted dictionaries specific to a language and a domain of interest and are based on formalisms and rules that, although unable to interpret sentences with particularly complex structures, are instead particularly effective in scenarios where the available data are reduced: this argument is particularly valid for a complex but low-resource language such as Italian.
The aim of this paper is to practically compare the performance of these two methods with an ad hoc dataset in Italian language for the task of sentiment analysis, highlighting possible advantages and disadvantages of the two approaches in correspondence of linguistic structures and constructions with specific terminologies, proposing altogether to: • Verify the performance of one of the best language models available for the Italian language, i.e., BERT Base Italian XXL, by providing a dataset of reviews created ad hoc; • Test the performance of one of the best NooJ-based lexical analysis systems available for the Italian language, starting with the Sentix and SentIta lexicons, on the same dataset; • Understand and compare the performance of the two systems using tools such as SHAP for qualitative analysis and explainability of AI models.
The article is structured as follows: Section 2 describes the background and the most relevant related works, while Section 3 describes tested architectures and experimental setup, used dataset and adopted evaluation metrics. In Section 4 obtained results are discussed and finally in Section 5 conclusions and possible future works are drawn.

Background and Related Works
This section reviews the scientific literature concerning the two methodologies compared for the analysis of sentiment within texts: in Section 2.1, the machine-and deep-learning-based methods are discussed, while in the Section 2.2 the lexicon-based methods are illustrated.

Machine and Deep Learning Based Approaches
In recent years, the continuous expansion of the phenomenon of online reviews has provided a push towards the use of techniques that could automate the process of sentiment analysis, aimed at quickly classifying consumer opinion. The techniques developed have been numerous: starting from the analysis of frequency, role and position of terms in the texts [10] or of specific words and phrases [11], passing through the analysis of syntax [12] and of negations [13], more and more features have been engineered into machine learning algorithms based on the most disparate classifiers, such as maximum entropy and multinomial naïve Bayes [12], but limited by the need for large vocabularies for the training required to extract features for proper classification.
The introduction of the embeddings [14] has been a turning point: through algorithms such as common bag of words or skip-gram it was possible to obtain a vector representation of the tokens constituting the texts providing the context and then predicting the word or vice versa. The limit of this method is related to its static nature: mapped the word in the vector space during the creation of the embedding, the latter remains always the same regardless of the variation of the context of the text in classification and, if the token analyzed was outside the vocabulary of creation then it will not be recognized. Over time, variants of the original algorithm have been proposed, such as the faster GloVe [15] or char2vec [16] based on characters instead of words. Proposals specifically designed for sentiment analysis were also not lacking such as examples of embeddings trained on corpora specifically designed for sentiment analysis [17], then adding the ability to exploit lexical intensity [18] or training on multi-domain scenarios [19]. Readers may find interesting the approaches capable of taking further advantage of coreference and anaphora resolution techniques [20].
Recently, the development of deep neural networks has allowed a further proliferation of embedding techniques [21], as constituents of the first input layer of such networks, such as convolutional, recurrent and transformer-based [22], the foundation of the most modern language models. Convolutional models employ a two-dimensional matrix to represent the generic sentence, such as those proposed by Kim [23] and Kalchbrenner et al. [24]. With the use of specialized convolutional neural networks for the target sentiment, Chen et al. [6] provided further improvement in the area of sentiment analysis. Recurrent models, on the other hand, use memory cells to process the state in which information about incoming and previous tokens is contained. The development of sentiment analysis in relation to such models has typically relied on bottom-up representations of sentences [25], then exploiting long short-term memory (LSTM) networks [26] to mitigate gradient and longrange dependent disappearance issues, thereby achieving important results in correct sentiment recognition [27,28]. At the same time, several language models have begun to emerge, relying on LSTM networks, such as embeddings from language models [29] and universal language model fine/tuning [30]. The great innovation delivered by the language models is the provision of networks provided already pre-trained on huge corpora: this made possible to fine-tune such models using a small amount of task-specific data and much less computational resources. Recently, with the introduction of transformers [31], additional language models have arisen such as generative pre-trained transformer [32] and bidirectional encoder representations from transfomers (BERT) [33], often available in monolingual and multilingual versions between which scientists are trying to understand commonalities and differences [34], also in the field of sentiment analysis related to online reviews [35], but which have shown great performance improvements by overcoming the sequentiality of previous models and introducing operational parallelism through which countless advantages due to context analysis have been shown making the constituent embeddings themselves dynamic.

Lexicon-Based Approaches
Lexicon-based approaches rely on the assumption that the text semantic orientation is strictly related to the polarity of words and phrases that occur in it. This is related to content words, namely adjectives [36,37], adverbs [38], nouns [39] and verbs [40] and to phrases and sentences that contain them.
Although manually built lexicons are evidently more accurate than the automaticallybuilt ones, especially in cross-domain sentiment analysis tasks, the manual annotation is a costly activity in term of human resources and time [41,42]. This is the cause of the proliferation of studies on automatic polarity lexicons creation and propagation, which perform this task through morphological methods [43,44], by exploiting the semantic relations of thesauri [45][46][47][48] and by using co-occurrence algorithms in large corpora [49][50][51]. Automatically created dictionaries seem to be more unstable, but usually larger than the manually built ones. Size, anyway, does not always mean quality. It is common for these large dictionaries to have scarcely detailed information. Furthermore, a large amount of entries could denote fewer details in description, or, instead, could mean more noise.
Among the most relevant polarity lexicons for the English language SentiWordNet [46] and the SO-CAL dictionary [41] have to be mentioned. SentiWordNet is based on WordNet 2.0 [52] and has been built by automatically associating each WordNet synset to three scores: Obj for objective terms, Pos and Neg for positive and negative terms. Each score ranges from 0.0 to 1.0. The values are determined on the base of the proportion of eight ternary classifiers (with similar accuracy levels but different classification behaviors), that quantitatively analyze the glosses associated with every synset and assign them the proper label. SentiWordNet 3.0 [53] improves SentiWordNet 1.0. The main differences between them are the version of WordNet they annotate (3.0 for SentiWordNet 3.0), and the algorithm used to annotate WordNet, that in the 3.0. version, along with the semi-supervised learning step, also includes a random-walk step that perfects the scores. The SO-CAL dictionary [41], due to the low stability of the automatically generated lexical databases, has been manually developed by hand tagging, with an evaluation scale that ranged from +5 to −5, the semantically oriented words that have been found into a variety of sources, namely, the multi-domain collection of 400 reviews belonging to different categories [37]; 100 movie reviews from the Polarity Dataset [10,54]; the whole General Inquirer dictionary [55]. The result was a dictionary of 2252 adjectives, 1142 nouns, 903 verbs and 745 adverbs. The adverb list has been automatically generated by matching adverbs ending in -ly to their potentially corresponding adjective. Moreover, also a set of multi-word expressions (152 phrasal verbs, e.g., to fall apart, and 35 intensifier expressions, e.g., a little bit) have been taken into account. In case of overlapping between a simple word (e.g., fun, +2) and a multi-word expression (e.g., to make fun of, −1) with different polarity, the latter possesses the higher priority in the annotation process.
The largest part of the state of the art works on polarity lexicons for sentiment analysis purposes focuses on the English language. Thus, Italian lexical databases are mostly created by translating and adapting the English ones such as SentiWordNet and WordNet-Affect. Italian polarity lexica that deserve to be mentioned are Sentiment Italian Lexicon [56], also known as Sentix (https://valeriobasile.github.io/twita/sentix.html, accessed on 01 October 2021); SentIta [57]; the lexicon of the FICLIT+CS@UniBO System [58]; the CELI Sentiment Lexicon [59]; the Distributional Polarity Lexicon [60] and SenticNet [61]. Among others, Sentix merged the semantic information belonging to existing lexical resources in order to obtain an annotated lexicon of senses for Italian. Basically, MultiWordNet [62], the Italian counterpart of WordNet [52,63], has been used to transfer polarity information associated to English synsets in SentiWordNet to Italian synsets, thanks to the multilingual ontology BabelNet [64]. The dictionary contains 59,742 entries for 16,043 synsets. SentIta is a semi-automatically built sentiment lexicon that combines polarity and intensity labels that generate an evaluation scale that goes from −3 to +3 and a strength scale that ranges from −1 to +1. In this lexicon, all the adjectives included in the lexical resources of the Italian module of NooJ (https://www.nooj-association.org/resources.html, accessed on 1 October 2021) have been manually annotated with polarity and intensity scores. Afterwards, morphological finite state automata (FSA) have been used to semi-automatically extend the annotation over verbs, nouns and adverbs [65]. The result is a set of dictionaries of more than 20,000 entries, which has recently been enriched by taboo words [66], idioms [67] and emojis [68]. The lexicon of the FICLIT+CS@UniBO System, which has been created for the EVALITA 2014 SENTIPOLC task, includes adjectives and adverbs from the De Mauro-Paravia Italian dictionary and nouns and verbs from the Sentix database. All its lexical items have been classified according to their polarity by the use of the online sentiment analysis API provided by AI Applied (https://ai-applied.nl/text-apis, accessed on 1 October 2021). The CELI Sentiment Lexicon is a sentiment lexicon that contains simple words, multi-words and idioms, annotated with polarity, intensity, emotion and dominance. It is a proprietary resource that CELI sells with a license of use. The Distributional Polarity Lexicon is a large-scale polarity lexicon, which has been automatically created by deriving it through distributional models of lexical semantics, where the polarity of words is derived by sentences annotated with polarity. SenticNet [61] is a knowledge base for concept-level sentiment analysis, freely available also for the Italian language (SenticNet modules are available also for the English, Spanish, Portuguese, Indonesian and Vietnamese languages.), which does not merely use keywords and word co-occurrence counts, but deepens the implicit meaning associated with commonsense concepts by integrating logical reasoning within deep learning architectures. The resource provides semantic annotations associated with 200,000 natural language concepts, including polarity values that go from −1 to +1.

Materials and Methods
Hereafter, tested architectures are introduced in Section 3.1 while the exploited dataset and metrics used to evaluate performance are shown in Section 3.2. Widely used in a number of natural language processing tasks such as named entity recognition or sentiment analysis, bidirectional encoder representations from transformers [33] (BERT) forms the basis of several breakthrough language models to date. The ability of bidirectional context analysis, forward and backward, has allowed these kinds of models to specialize their previous knowledge, acquired through a long phase of pre-training on huge corpora, to the specific task under study: if on the one hand, the inner layers of the deep constituent network preserve their generalization capability, on the other hand, the outer layers of the network adapt themselves in a flexible way to the contents examined during the so-called fine-tuning phase that specializes the architecture, producing an ad hoc model.

Tested Architectures
The knobs through which to intervene in this fine tuning are the so-called hyper-parameters, of which the main ones have been reported in Table 1. In detail there are: the number of hidden layers that make up the encoder transformer also called transformer blocks in numbers equal to 12, then there are the attention heads also called self-attention [31] always equal to 12, then the hidden size of the feed forward networks and the parameter of the maximum length of the input sequence equal to 768 and 512, respectively, and the number of weights that make up the network equal to 110 million (M). Finally, the learning rate, the number of epochs used for fine-tuning and the batch size equal to 0.00001, 5 and 8 in our case, respectively. The tokenization and management of the out-of-vocabulary words is achieved through the WordPieceModel [69,70], which identifies the common sub-words through which the dictionary is built. Instead, the separation between phrases is achieved through the special token [SEP], while the output vector equal to the hidden dimension H with which to represent the entire sequence and provide input to the downstream classifier is represented through the special token [CLS].
The input of the final fully connected classification layer, i.e., the output of the transformers represented by the last hidden layer provided by the first token, is denoted as a vector C ∈ R H , while W ∈ R KxH is the parameter matrix of the classification layer where K is the number of categories and the probability for each of them is calculated as: Instead of employing the loss function Categorical Cross Entropy, valid for multi-class classification and provided by default by the BERT model of Hugging Face (https:// huggingface.co/transformers/model_doc/bert.html, accessed on 1 October 2021), it was chosen in this case to employ the loss function Binary Cross Entropy (BCE) provided by the torch (https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html, accessed on 1 October 2021) library, which is more suitable for the single label prediction case study; however, in order to have more numerical stability, we chose to employ BCE with Logits (https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html, accessed on 1 October 2021) (BCEwL), which combines the BCE with a sigmoid by exploiting the function LogSumExp (https://en.wikipedia.org/wiki/LogSumExp, accessed on 1 October 2021). Given N the batch size, classification using BCEwL can be described as: Transformer The most important BERT architectural component is the Transformer [31]. Its operation, starting from x and y sequences of sub-words, consists in placing the so-called [CLS] token before x, then after x and y the so-called [SEP]. Hence, the embedding function E and the normalization layer LN contribute to the embedding in this way: thus passing through FF, the Feed Forward layer, it happens that M transformer blocks change the embedding, then GELU (the element-wise gaussian error linear units activation function [71]) and MHSA (the multi-heads self-attention function [31]), obtaining: In the attention heads, which are N, it happens that:

BERT Base Italian XXL
In order to test BERT on our dataset in the Italian language, the version used was the best one provided through the Hugging Face framework (https://github.com/huggingface/ transformers, accessed on 1 October 2021) by the MDZ Digital Library team of the Bavarian State Library (https://huggingface.co/dbmdz/, accessed on 1 October 2021): BERT BASE Italian XXL. This version of BERT is pre-trained on texts taken from a recent Wikipedia dump plus various text collections from the OPUS (http://opus.nlpl.eu/, accessed on 1 October 2021) corpus plus the Italian OSCAR (https://traces1.inria.fr/oscar/, accessed on 1 October 2021) corpus, for a total of 81 GB of text and 13 billion tokens.

SHAP Explanation Approach
In addition to Accuracy and F 1 score metrics and to better discuss obtained results, the SHAP tool was used. SHAP employs a generic approach to explicate the predictions of any given model: it perturbs model inputs while observing how the output changes [72], based on the idea that the contributions of specific features can be observed by hiding the relevant inputs. In particular, SHAP is based on Shapley's theory of coalition games, in which it calculates the values: these values represent the coalition players, i.e., the features of a data instance, while the prediction represents the payoff of which the fair distribution is established on the basis of these values, which allow us to approximate the number of features in a linear time. Although SHAP was born to explain tabular data and images, it is also well suited for use with language models such as BERT, evaluating the impact of the text fragments that make up an input sentence, i.e., features, on sentiment prediction and their explanation.

Lexicon-Based Method
The lexical method has been tested, in a document-level sentiment classification task, by exploiting a hand-built lexicon, SentIta, and an automatically built one, Sentix. Regarding the hand-made lexicon, the entries of SentIta are labeled with inflectional (FLX) and derivational (DRV) properties and by four sentiment tags: positive and negative for the property "polarity", and strong and weak for the "intensity", as can be seen in the example below, with the negative word sudicio (English: filthy): sudicio,A+FLX=N106+DRV=SSIMO:N88+POLARITY=NEG+INTENSITY=STRONG Differently, Sentix entries are associated to the part-of-speech (a for the adjective), the WordNet synset ID, a positive and a negative score from SentiWordNet, a polarity score ranging from −1 to 1, and an intensity score ranging from 0 to 1: sudicio, a, 00419289, 0, 0.75, −1.0, 0.75 In the experiment the words are not considered alone; instead, they are treated into a set of co-occurrence rules that modify their scores according to the syntactic contexts in which they occur. Lexical, morphological and syntactical indicators are the markers that lead the analysis of the polar words in context, for instance: Comparison, e.g., Il suo motore era anche il più brioso [+2] [+3] (English: Its engine was also the most lively) and Un film peggiore di qualsiasi telefilm [−3] (English: A movie worse than whatever tv series).
The contextual shifting of the words has been handled by generalizing all the words endowed with the same prior polarity. Contextual operators, such as negation and comparison markers, intensifiers and down-toners, do not always change a sentence polarity in its positive or negative counterparts, they often have the effect of increasing or decreasing the sentence score, so it is better to talk about valence shifting rather than switching. The idea is that the final polarity of an expression modified by the context can be modulated by taking into account, at the same time, both the polarity of the opinionated words and the strength of the contextual indicators: Tables 2-4 show examples of co-occurrence rules among negation operators and sentiment words.   Table 4. Negation rules with weak operators.

Strong Negation Operator Sentiment Word Word Polarity Shifted Polarity
A network of local grammars has been designed on a set of rules that compute the individual polarity scores of words, according to the contexts in which they occur. In general, the sentence annotation is performed using embedded local grammars in the shape of FSA. Local grammars are algorithms that, through grammatical, morphological and lexical instructions, are used to formalize linguistic phenomena and to parse texts. They are defined local because, despite any generalization, they can be used only in the description and analysis of limited linguistic phenomena.
NooJ (https://www.nooj-association.org/, accessed on 1 October 2021) is the Natural Language Processing tool used in this work, in the lexicon-based task, for both the language formalization and the corpora pre-processing and processing, at the orthographical, lexical, morphological, syntactic and semantic levels [73]. The NooJ annotations can go through the description of simple word forms, multi-word units and discontinuous expressions. Lexical items and their semantic labels are systematically recalled into local grammars, which are algorithms in the shape of enhanced recursive transition networks (ERTN) that, through grammatical, morphological and lexical instructions, are exploited in order to formalize linguistic phenomena and to parse texts. The NooJ Finite State Automaton, illustrated in Figure 1, is an abstract device made up by a finite set of states (S) connected by transitions (t), with which it is possible to design a set of patterns able to recognize specific strings. FSA always goes from the initial state (Si) to the final one (Sf). NooJ FSA are special kinds of ERTN that also allow the use of outputs that can describe the recognized patterns, embedded graphs, loops, variables (V) and constraints (C). Furthermore, they can be placed anytime in relation to electronic dictionaries, which can be recalled into each state of the ERTN. More in detail, with reference to the syntactic treatment of sentiment lexicons, cooccurrence rules have been computed trough the finite-state technology by a network of more than 100 embedded graphs that confer the same polarity values to those expressions in which words belonging to the same classes occur and which are described by the same rules. Such classes of words correspond to the six values that ranges from −3 to +3, that represent the different negative and positive word polarities. There are more than 80 rules and they refer to negation, comparison, intensity and combination of their markers.
All the electronic sentiment dictionaries and the co-occurrence rules used in this paper have been formalized or converted into the NooJ format; therefore, in order to consider both SentIta and Sentix in context, by recalling them into their syntactic module of NooJ for sentiment analysis, the Sentix polarity and intensity scores have been translated into the NooJ labels. Its lemmas have been also enriched with inflectional and derivational properties from the Italian module of NooJ. The purpose was to obtain a version of Sentix in the NooJ format that was able to also allow the syntactic treatment of its words. Once the polarities of the dictionaries are extracted and modified, according to their syntactic contexts, the score of the whole review is simply measured by evaluating the arithmetic mean of all the oriented expressions extracted by the tool.

Dataset and Evaluation Metrics
The dataset exploited in this paper has been built by extracting Italian opinionated documents from e-commerce and opinion websites, such as www.ciao.it accessed on 1 October 2021, www.amazon.it accessed on 1 October 2021, www.mymovies.it accessed on 1 October 2021, www.tripadvisor.it accessed on 1 October 2021. It is composed of 600 reviews (126,184 tokens) about six different products and services, namely cars, smartphones, books, movies, hotels and videogames. Each one of the mentioned category is associated with 50 positive ad 50 negative texts. The distinction among positive and negative reviews is based on the structural data directly selected by the users. This choice poses several challenges which are related to: • The identification of the proper polarity of the reviews that are close to neutrality; • The treatment of the reviews that contain both positive and negative claims; • The correct analysis of reviews in which positive and negative comments are not related to the described product, e.g., delivery issues or plots in movies and books reviews [74].
Furthermore, performances have been evaluated through the following metrics: • Accuracy, that states the number of labels correctly identified; •

Results and Discussion
The results of the tested systems are provided in Table 5. The analysis of the numerical results in terms of accuracy and F 1 score easily identifies a winner in Bert, regardless of the lexicon used in combination with the NooJ tool. While it is evident that the language model employed is sufficiently performing even with a modestly sized dataset, on the other hand the difference is not overwhelming and indeed the doubt arises that by improving the lexicons, especially in cases where the data available for fine-tuning are even more scarce, the results of lexicon-based methods may still outperform the more modern deep neural network-based methods. To gain a deeper understanding of performance, for better or worse, of the systems employed, six reviews were analyzed in detail, respectively two not recognized by any system, two recognized only by BERT and two recognized only by NooJ.

Quantitative Analysis
The analysis of the errors of the two methods employed in the experiment has been conducted by evaluating the segment-level Precision, Recall, F 1 and Accuracy on the six reviews just mentioned above whose details on performance are reported in Table 6. The analyses made up with both lexicon-based method and BERT have been compared with the evaluation of human annotators over each text portion produced by SHAP, which correspond to the text segments displayed in Tables 7-12. As it can be seen, these segments can be smaller or bigger than a sentence, according to the analyses produced by the tool. The results that describe the analyses of the segments of the six reviews discussed in this section (Table 6) are obviously lower than the ones presented in Table 5, because they regard the error analysis and, for this reason, they focus on the reviews on which the tools, in turn, produced incorrect results. In detail, first rows of Table 6 show the overall performances of the tools on the six review segments. The results related to wrongly and correctly detected reviews, instead, are related to the reviews that have been, respectively, wrongly or correctly classified as positive or negative in the main task. As an example, wrongly detected reviews for BERT are the ones that have been correctly classified as a whole only by NooJ and the ones on which both the system failed. Table 7. Neither BERT nor NooJ detect them correctly. Example A.
As soon as we entered the hotel there was a strong smell, which was also there in our rooms. My room had a completely broken wardrobe.
Because of the smell we had to sleep with the window open.
The sheets were clean. Good location.
I'm still in love with it even though I treat it badly (I've got cigarette burns on the seats, it's escaped!).
faults: I have a sunroof: it broke after the first warm and cold spells, because there are pieces of plastic that dry out after a while.
The glass whistles. the horn works badly. the steering is wide, too wide. but it's beautiful. quiet engine, but if you go over 100 with the roof closed, it makes noise. Behind it you fuck well. −1.520 −2 ("male", "badly") −2 ("troppo largo", "too wide") +2 ("bella", "beautiful") +2 ("bene", "well") −2 ("male", "bad") −2 troppo largo", "too wide") 2 ("bella", "beautiful") −1 ("silenzioso", "silent") 1 ("silenzioso", "silent") −2 ("silenzioso", "silent") −2 ("chiuso", "closed") 2 ("bene", "well") Overall score (Predicted label) What emerged is that, as regards the method based on lexicons, certainly the measurement of the average scores of the words and phrases occurring in the reviews is not suitable for determining their correct orientation. In fact, although there is a very high precision at the segment level for the hand-annotated lexicon SentIta, this does not correspond to an adequate performance of the lexicon driven strategy at the document level, when compared to BERT (0.86 SentIta and 0.93 BERT). BERT performs significantly better on the segmentlevel task in the cases in which also the entire documents are classified properly (0.63 of Accuracy in wrongly annotated texts and 0.70 into the correct ones). The main difference between BERT and NooJ is that the fist concentrates the errors on the segments that are labeled by the human annotators as neutral or ambivalent, while the errors of NooJ depend above all on the dictionary features. In fact, SentIta presents problems in term of Recall and Sentix in term of Precision: this relies on the number of entries of the two lexicons and also on the annotation contained in the two resources as detailed in the following paragraph. Table 9. Only BERT detects them correctly. Example A.
(such as Pinko, which is distinguished by its golden colour, and the D&G, very chic which is distinguished by the different colouring of the rear lights).
With the arrival of the new C3, the range has been reduced to its essentials and the model given a new name 'C3 Classic'. A used 2004/2005 model is around 4800/5000 euros max (models with few km and in excellent condition).
It will definitely be the car I buy right after the summer holidays, as my little hatchback has now decided to abandon me.
someone like me who has to drive two children, shopping to do, and why not, also to go shopping with friends.
+1.031 0 1 ("amiche") 2 ("amiche") Consiglio a tutti di andare a vederla e a provarla, I recommend everyone to go and see it and try it out, +0.872 0 0 non resterete delusi. you will not be disappointed. +0.433 −3 ("delusi", "disappointed") 3 ("delusi", "disappointed") Overall score (Predicted label) Furthermore, considering the entire dataset, it must be noticed that the reviews that cause the higher number of errors for both the lexicons in the lexicon-based method (75%) are the positive ones, in the cases in which users discuss both strengths and weaknesses inside them. The presence of negative expressions into positive reviews inevitably make their score decrease and shift towards neutral values. This idea is confirmed by the average absolute values, in the wrongly annotated reviews by both NooJ and Bert, which does not exceed the score of 0.5. Moreover, in the 92% of the NooJ errors and in the 100% of the Bert errors, the average absolute scores are lower than 1. This is definitely the confirmation that, above all in the case of the lexicon-based method, the scores of the sentences need to be put in relation with one another and with the semantics of the whole documents. At the same time, it also clear that a classification task can lead to misleading results when the classes can not be sharply divided, because they are poles of a continuum. When the oriented texts are close to neutrality, or they are characterized by ambivalent polarity, the scores attributed by the tool are near to zero and the certainty level of the subjective judges is very low, for both machines and humans. Table 11. Only NooJ detects them correctly. Example A.
I still have to see if I'm completely convinced! I am very happy with the handling and the engine of this car. It's quick and has a precise precise gearbox. . .
And then the size of the passenger compartment is just fine for me as a small person, but when someone a little bigger than me gets in, it's hard for me to get my legs and arms into it comfortably, even though I've adjusted the seat.
All in all, these are small details. But if you drive every day they count.
I also have to say that this hotel has two flaws: (1) it is a bit noisy (especially if you stay during the weekend you can hear a lot of noise coming from the street).
(2) During our stay we went to the swimming pool but the water was not heated and it was freezing cold and we were told that if we wanted they could heat the water for the next day. . . −0.652 −2 ("non era riscaldata", "was not heated") −2 ("non era riscaldata", "was not heated") −2 ("ghiacciata", "freezing") −1 ("detto", "told") 2 ("detto", "told") −2 ("non", "not") 1 ("non", "not") un albergo di questa categoria non può permettersi uno scivolone del genere! a hotel of this category can not afford such a slip! −0.673 0 0 Overall score (Predicted label) However, these remarks must not be considered drawbacks but research challenges in the sentiment analysis field, which basically motivate the need of a fine-grained textual analysis, that cannot stop at the evaluation of the structured data provided by the users themselves.

Qualitative Analysis
In order to go in depth in the error analysis, in this paragraph the cases in which the described methodologies fail in the sentiment classification task will be discussed. In this regard, the six emblematic reviews that have been introduced previously have been analyzed in detail. Again, the reference is to those texts which have been improperly labeled by the tools. Both negative and positive incorrect attributions have been taken into account, referring to the NooJ annotation, to BERT classification, and by both of them. Then, the performances of the tools have been compared with a human reading of such texts.
The first two reviews have been improperly labeled by both the tools (Tables 7 and 8); the third and the fourth have been improperly classified by BERT (Tables 9 and 10) and the latter two refer to NooJ errors (Tables 11 and 12). The main difference between the output of the tools is that Nooj extracts oriented items from the texts, that can be words, multi-word units or phrases, while BERT attributes probabilistic values to entire sentences.
In detail, Tables 7 and 8 refer to reviews that have been wrongly classified by both the methods. Such reviews are made up by positive sentences that open and close the documents and by negative reports in the body of the texts. In the case of Table 7 for a human it is easy to understand that the review is negative, while BERT probably interprets the prominent position of the positive comments as a discourse marker that makes the tool fail in the text classification. The NooJ annotation is negative for SentIta and positive for Sentix, but anyway too close to zero, and consequently to neutrality, to be considered appropriate. This is because of the presence of ambivalent (positive and negative) comments in the same text. Instead, the example reported in Table 8 is difficult to classify also for a human reading, due to the irrational romantic feeling of the author for an almost completely non-functioning car. All the sentences of the text contain dual connotations, which again produce a document-level score close to zero for NooJ and a misleading reading of the sentences for BERT.
Tables 9 and 10 describe the errors made up by NooJ on reviews correctly annotated by BERT. In the first case, again, the presence of both positive and negative expressions brings the average score of Nooj close to zero. In the example reported in Table 10, the main problems with NooJ are related to Recall issues, because the method fails in the extraction of all the oriented expressions actually included in the texts.
Tables 11 and 12, refer to reviews that have been correctly annotated by NooJ, that instead have been cause of errors for BERT. The example of Table 11 is a positive text that is opened and closed by weakly negative comments that probably are considered to be the most relevant ones by BERT and that, at the end, make the system classify the text as negative. The example of Table 12 is basically divided into two parts, the first one is positive and the latter negative, so its classification is challenging for both the methods. NooJ does not fail in the classification because the intensity of the positive items in the first part of the review is higher if compared to the negative ones.
More generally, considering all the reviews included in the error analysis, with regards to the NooJ annotation, the presence of false positive can be noticed, e.g., dolci in the last review that is a noun and means desserts and not the positive adjective sweet, and also of false negatives, e.g., the sentence un albergo di questa categoria non può permettersi uno scivolone del genere (English: a hotel of this category cannot afford one slip like that).
Overall, it is possible to generalize that the errors caused by BERT seem to be related to the prominence of oriented markers in the text, that can be sometimes misleading. The errors caused by the lexicon methods are mostly related to syntactic issues and to polysemy. As far as syntax is concerned, the case of the long distance between sentiment indicators and contextual modifiers must be mentioned, as happens in non me la sento di dire che la distribuzione è cosi carente (English: I don't feel like saying that the distribution is so lacking). In this case the negation indicator non is ten words away from the polarity word carente and the system fails in the evaluation of the negation scope of the adverb. Polysemy pertains, above all, the lexicon that have been built on corpora, Sentix, that is by far the one that produces the higher number of errors, despite the fact that it allows the recovery of expressions that were not contained in SentIta. In fact, the scores of the entries of SentIta are given only if the polarity of the words is evident, also with low intensity. Such resource does not include words and expressions that are too close to neutrality, e.g., particolare (English: characteristic), words that can change their polarity according to different contexts, e.g., piccolo or grande (English: small or big) or domain-dependent polar words, e.g., silenzioso (English: quiet in the sentence motore silenzioso).
Differently, the lemmas from Sentix, which are systematically described by their Synonym Set (SynSet), that specify their different usages and shades of meaning, are included in the sentiment lexicon also when their context independent polarity is not so clear. For example, piccolo is associated to five different SynSet, four of them with negative scores (limited in size, of little importance or influence or power) and one of them with a positive score (very young). In contrast, in SentIta, words with unpredictable connotations, such as piccolo, are not associated with positive or negative polarity scores but treated as downtoners/intensifiers, decreasing/intensifying the intensity of the polarity of the co-occurring words. In the example of Table 10, we see a case of a (negated) negative usage of piccolo: La C3 non è troppo grande, ma nemmeno troppo piccola, l'ideale per una donna (English: The C3 is not too big, but not too small either, ideal for a woman). Neutral (1) and positive (2) usage examples of piccolo extracted from the whole text set are reported below:

1.
La colazione e molto buona e suggestiva servita un piccolo salone con caminetto (English: The breakfast is very good and evocative served in a small lounge with a fireplace); 2.
This is why Sentix has the higher levels of recall and the lower precision, if compared with SentIta and BERT. Basically Sentix is much larger than SentIta, but it contains more noise indeed.

Conclusions
This work proposed a comparative study, concerning the ability to analyze sentiment in Italian language on a dataset created ad hoc, between a language model based on deep neural networks such as BERT Base Italian XXL and a system based on lexicons such as NooJ. In particular, this study aimed to highlight the limitations and advantages of a language model in contexts other than the English language characterized by the scarcity of datasets such as Italian and for which it may still make sense to rely on models such as those based on the lexicon, deepening with a qualitative analysis based on tools born for the explicability of artificial intelligence such as SHAP.
Based on the results obtained, lexicon-based methods are to be preferred where the datasets are small and the available computational resources limited, under the condition of slightly lower performance. Looking forward, the path of language model-based methods is more attractive: unresolved problems such as the presence of sentiments of different polarities in the same text can be addressed with new implementation solutions by analyzing the limitations of this study that need to be framed in relation to the two proposed approaches.
With respect to the language models, the quality of the results depends both on the model used and on the data; therefore, besides testing the other models available for the Italian language, options could be the development of the dataset used and the utilization of other literature datasets. In addition, the study could be extended to multilingual/crosslingual models and exploit datasets in different languages by testing approaches that leverage transfer learning techniques to face low-resource languages scenarios such as the Italian one. Moreover, it could be useful to expand the classification to a multi-label scenario in which the detection of different emotional states (e.g., anger, happiness, humor, sadness, satire and so on) might weigh differently on the overall sentiment attribution.
On the other hand, with reference to the lexicon-based approach, we basically aim to enrich the syntactic rules for the annotation of the sequences that caused most of the errors discussed in the previous section. In detail, the reported limitation, which can be attributed to syntactic complexity and polysemy, can be addressed by providing grammar networks specialized for the disambiguation of phrases and part-of-speech, based on their textual context. The complexity of FSA will be improved, in order to make them capable of parsing larger phrases and sentences. This way, it is possible to solve a large part of the long-distance dependencies and to attribute a more accurate part-of-speech to the sentiment items, according to their context of occurrence and to their syntactic-semantic relations. Moreover, another drawback, which has been highlighted in the lexical method, is related to the measurement of the average polarity score of words and phrases located into a review. A possible solution is to include discourse markers in the analyses, in order to evaluate not only each single sentiment marker of a review, but also textual items that express a change of opinion within the review, a synthesis of the orientation expressed by the opinion holder or, for example, a sudden denial of what has been stated previously within the same text.
A very interesting scenario of future work could concern the possible hybridization of symbolic and sub-symbolic methods, so as to have systems able to solicit both the use of lexicons when the resources of pre-training are scarce and the use of previous knowledge when the use case of destination allows it.