Improving the Accuracy in Sentiment Classification in the Light of Modelling the Latent Semantic Relations

This paper presents the Methodology of Improving the Accuracy in Text Classification in Light of Modelling the Latent Semantic Relations (LSR). The aim of this Methodology to find the ways of eliminating the Limitations of Discriminant and Probabilistic methods for LSR revealing and customize the Text Classification Process to the more accurate recognition of the text tonality using the knowledge about the text’s Hierarchical Semantic Context on the form of Corpora-based Hierarchical Sentiment Dictionary. The main scientific contributions of this research is the set of approaches to improve the qualitative characteristics of Text Classification process: combination of the Discriminant and Probabilistic methods allowing to decrease the influences the Limitations of these methods on the LSR revealing process; considering each document as a complex structure allowing estimate documents integrally by separated classification of topically completed textual component (paragraphs); taking into account the features of Argumentative type of documents (Reviews) allowing to use for development the Text Classification methodology the author’s subjective evaluation of text tonality. Tonality, expressed by the document’s author, has a significant, but not critical, effect on the qualitative indicators of Sentiment Recognition.


35
The rapid development of computer technology and the Internet space in recent decades has led 36 to the fact that the process for creating and accessing the information content of many web resources 37 have become an integral part of private and professional activities of a person. The content of 38 information resources such as social networks, feedback services, web forums and blogs, is actively 39 populated by the users themselves and is publicly available.

40
Electronic copy available at: https://ssrn.com/abstract=3425530 This content, as well as some more official information (for example, financial statements of 41 enterprises, scientific and news articles), form a large array of unstructured text information 42 containing a huge amount of explicit and hidden knowledge. in certain topical contexts. That is why, over time, the initial formulation of the task of tonality 56 analysis has acquired a more detailed formulation and has emerged as a separate problem of context-57 sensitive (or contextually-oriented) Sentiment Classification, which is to automatically determine the 58 views of the user, expressed in the text, with respect to previously detected Topics being examined.

59
In recent years, a number of methods for detecting topics and sentiment analysis have been where, ) , ( t w tf -relative frequency of the w th word occurrence in document t: ) , ( t L w k -the number of w th word occurrences in the text t; df -total number of words in the 126 text of t; D -total number of documents in the collection.

127
Further on, for solving the problem of finding the similarity of documents (terms) from the point 128 of view of their relation to the same topic, different metrics can be applied. The most appropriate 129 metric is the cosine measure of the edge between the vectors: where y x  -the scalar product of the vectors,

153
That is why LSA tends to prevent multiple occurrences of a word in different topics and thus cannot 154 be used effectively to resolve polysemy issues (Lim#4).     Electronic copy available at: https://ssrn.com/abstract=3425530 perplexity is a measurement of how well a probability model predicts a sample. Low perplexity 171 indicates that the probability distribution is good at predicting the sample: where:  

184
The methods of unsupervised machine learning allow avoiding dependence on training the data.

185
For these methods to work, one also needs a Corpus of documents, but the preliminary markup is

196
Within the expert approach, the dictionary is compiled by the experts. This approach can be 197 distinguished by the complexity and high probability of the absence of domain-specific words in the 198 dictionary on the one hand, on the other -by the high quality of the dictionary in the sense of 199 adequacy of the assigned key.

200
In the dictionaries approach, initial small list of evaluation words is expanded by various 201 dictionaries, for example, explanatory or synonyms/antonyms. However, this approach also does not 202 take into account the subject area.

203
In the approach based on text collections, statistical analysis of the marked texts, as a rule,

204
belonging to the subject domain in question, is used to compile a Dictionary.

205
In [18], the dictionary of emotional vocabulary, compiled by experts manually, was used to 206 determine the tone of individual words. In this dictionary, each word and phrase is associated with 207 the orientation of the key (positive/negative) and with strength (in points).

208
The author's methods proposed in [25,26] are based on the dictionary approach, i.e.: to 209 determine the tonality of texts, a dictionary of estimated words is used, where each word has a 210 numerical weight that determines the degree of the word significance. In the method of working with 211 the dictionary closest to the paper [27]), the following needs to be considered: firstly, the dictionary 212 is created based on a statistical analysis of a training collection; secondly, the weight of the words is 213 determined with the help of a genetic algorithm.

214
In most studies, the tone of the text is determined based on the calculation of weights of the 215 appraisal words included in it:

216
Electronic copy available at: https://ssrn.com/abstract=3425530 where С T W -the weight of text T for tonality C; wi -the weight of the evaluated word i; the number of estimated bigrams of tonality C in the text T.

218
Texts are classified according to the linear function:

221
If the value of the function f is greater than zero, the text is positive, otherwise -negative.

232
In this paper the following author's definitions will be used:

233
1. Term is a basic unit of discrete data.

261
 the need to express the author's own opinion in such type of document will allow to carry out a 262 qualitative evaluation of the Sentiment Classification results on a guaranteed relevant dataset.

263
The choice of this type of document will be considered simultaneously as limitations in on the 264 scope of our research findings.

265
In this regard, the following scientific research questions (RQ) were raised:

306
As a dataset for a demonstration of the basic workability and evaluation of the quality of the 307 methodology application results, the Polish-language film reviews dataset from the filmweb.pl was 308 used.

341
Based on information about the maximum probability of matching the obtained Latent

342
Probabilistic Topics to the CF, in this step, the process of Semantic (topical) clustering of CF could be 343 performed. The results of this process with training dataset are presented in Table 3.

360
Electronic copy available at: https://ssrn.com/abstract=3425530 Step III. Identifying the Hidden Semantic Connection within the Documents

361
Mathematically the reduced model, as the instrument of preliminary LSR presence identification,

362
is the process of multiplying of the SVD transformation results with chosen k-dimension . The fragment of this step results with training dataset of the Reduced model 364 is presented in Table 5.

365
Via comparison of the red numbers in Table 5 with zero values in the same places of

407
The results of the implementation of Rules of Adjustments for training dataset are presented in 408 Table 11.  Electronic copy available at: https://ssrn.com/abstract=3425530 10-point scale). We consider the SPCS films reviews if the subjective review's assessment is more than 416 5 points, and SNCS -if it is equal or less than 5 points.

417
During the methodology verification, test dataset of 5000 Polish-language films reviews (2500 418 SNCS and 2500 SNSC) were analyzed. As a result, the two-level Contextual Hierarchical structure of 419 Topics (CHST) was defined ( where a -number of documents related to category S (positive, negative) and containing this 470 bigram, b -the number of documents not related to category S and containing this bigram as well.

471
The purpose of this layer is to evaluate the adequacy and prove the effectiveness of using 472 hierarchical to improve the accuracy of Sentiment Classification process.

473
The main tasks of this layer are:

495
Electronic copy available at: https://ssrn.com/abstract=3425530 As was accepted in this study as an Assumption 2, scanning and recognition of topics for One-

496
and Two-Level Classification will be performed by paragraphs (elements of the document) [35].

497
For realizing the Procedures 3 (with the deepest Topics Identification process) the following 498 algorithm is developed):

499
Step VII.1. This step is realized via scanning the Adopted Training Sample texts and identified 500 the topics on the 2 nd Level of the CHST for each Review paragraph. This procedure is implemented 501 by adding the Topic (Contextual Dictionary elements) from CHST to Training Sample as one of its 502 Paragraphs and then using the LSA method to find paragraphs that have a Latent Semantic Relations.

503
Step VII.2. This step is realized via scanning the part of the Training Sample for which Topics on 504 the 2 nd Level were identified, with the aim to find the bigrams form the 2 nd level of CBSD which 505 correspond to the Topic identified for each Paragraph.

510
Step V. For paragraphs for which topics had not been defined in the steps VII.1 and VII 3, this 511 step is realized via search the bigrams form 0 s level of CBSD.

512
The rules for determining the presence of the elements of the Sentiment Dictionaries and word-513 modifiers in the text are presented in Table 14.

530
The implementation of this step involves the use of the following rules:

550
For testing and evaluating the adequacy of the Sentiment Classification based on the CBSD 551 phase, the following test dataset was used: for the first layer (CBSD Creation Algorithm) -5000 Polish-552 language films reviews (2500 TSP and 2500 TSN); for the second layer (Sentiment Classification 553 Algorithm) -3000 Polish-language films reviews (1500 SPCS and 1500 SNCS) from the filmweb.pl.

554
To consider the SPCS films reviews, if the subjective review's assessment is more than 5 points, and 555 SNCS -if it is equal or less 5 points.  Corpora-based Sentiment Dictionary was created (Table 17).     (Table 18).    (Table 19).