Topic-Based Document-Level Sentiment Analysis Using Contextual Cues

: Document-level Sentiment Analysis is a complex task that implies the analysis of large textual content that can incorporate multiple contradictory polarities at the phrase and word levels. Most of the current approaches either represent textual data using pre-trained word embeddings without considering the local context that can be extracted from the dataset, or they detect the overall topic polarity without considering both the local and global context. In this paper, we propose a novel document-topic embedding model, D OC T OPIC 2V EC , for document-level polarity detection in large texts by employing general and speciﬁc contextual cues obtained through the use of document embeddings (D OC 2V EC ) and Topic Modeling. In our approach, (1) we use a large dataset with game reviews to create different word embeddings by applying W ORD 2V EC , F AST T EXT , and G LO V E , (2) we create D OC 2V EC s enriched with the local context given by the word embeddings for each review, (3) we construct topic embeddings T OPIC 2V EC using three Topic Modeling algorithms, i.e., LDA, NMF, and LSI, to enhance the global context of the Sentiment Analysis task, (4) for each document and its dominant topic, we build the new D OC T OPIC 2V EC by concatenating the D OC 2V EC with the T OPIC 2V EC created with the same word embedding. We also design six new Convolutional-based (Bidirectional) Recurrent Deep Neural Network Architectures that show promising results for this task. The proposed D OC T OPIC 2V EC s are used to benchmark multiple Machine and Deep Learning models, i.e., a Logistic Regression model, used as a baseline, and 18 Deep Neural Networks Architectures. The experimental results show that the new embedding and the new Deep Neural Network Architectures achieve better results than the baseline, i.e., Logistic Regression and D OC 2V EC .


Introduction
Opinion Mining and Sentiment Analysis are related research topics, at the intersection of Machine Learning and Natural Language Processing, that, recently, have been studied intensively [1][2][3][4][5][6]. The interest in these related topics is due to the wide range of applications where they can be used (e.g., advertising, politics, business, etc.) and the availability of large amounts of textual data. They are generally used to identify opinions and recognize the sentiments expressed, as well as the general polarity of a text, e.g., subjective or objective, positive or negative. The data sources that are mostly used in Opinion and Sentiment Analysis tasks are represented by blogs, posts from social media, comments from movie and product reviews sites or new articles [7]. These can be used to complete different tasks, such as emotion detection and sentiment classification.
Various types of neural networks have been employed to solve more accurately specific Opinion and Sentiment Analysis tasks, e.g., Recurrent Neural Networks (RNNs), A DOC2VEC is constructed as the average of the WORDEMBs for the terms in the document. This embedding manages to preserve the contexts and semantics of words at the document level [11]. WORDEMBs add semantic context by encoding the position for words in a sentence before vectorizing the text. We use five WORDEMB: (1) WORD2VEC CBOW (Continuous Bag-of-Words) model; (2) WORD2VEC SKIP-GRAM model; (3) FASTTEXT CBOW model; (4) FASTTEXT SKIP-GRAM model, and (5) GLOVE model. WORD2VEC captures the context of a word in a document and the relationship with the words surrounding it. Furthermore, this embedding manages to encode the semantic and syntactic similarity of the words within the document. WORD2VEC uses two models to determine the local context: CBOW and SKIP-GRAM. The CBOW model predicts the word's individual context by taking into account the context of all the words within the corpus. The SKIP-GRAM takes a word and determines the words that are in the same context. FASTTEXT extends Word2Vec by learning embedding vectors for the n-grams that are found within each word. FASTTEXT also uses CBOW and SKIP-GRAM models. GLOVE enhances the local context information of words using global statistics, i.e., word co-occurrence.
We use Topic Modeling to extract the hidden latent semantic patterns and to add a general context to Sentiment Analysis by detecting and grouping document with similar characteristics by the subjects of interest. We employ different Topic Modeling algorithms, i.e., Latent Dirichlet allocation (LDA) [12], Non-Negative Matrix Factorization (NMF) [13], Latent Semantic Indexing (LSI) [14]. We encode these hidden patterns that add a general context to Sentiment Analysis into TOPIC2VECs. TOPIC2VECs are built as the average between the topics top-k terms' relevance and their WORDEMBs. By employing TOPIC2VECs, we manage to encode context-based document grouping and to enhance each document's context by constructing the DOCTOPIC2VEC using the dominant topic as a concatenation between each document's DOC2VEC and TOPIC2VEC. Thus, documents that are similar in meaning and context, including polarity and opinion, will be closer to each other in the vector space than texts which are not necessarily related.
For the experiments, we use a large dataset consisting of game reviews. We create the DOCTOPIC2VECs using the discussed WORDEMBs and Topic Modeling algorithms. Each DOCTOPIC2VEC embedding is used in classification tasks that apply Logistic Regression (LOGREG) and neural networks with LSTM, GRU, Bidirectional, DENSE, and CNN layers. We also design six news Convolutional-based (Bidirectional) Recurrent Deep Neural Network (CNN-(BI)RNN) Architectures for the task of determining accurate document-level polarity. The results of our benchmark show that the accuracy is improved by about 5% when adding DOC2VEC contextual clues with NMF and LSI Topic Modeling algorithms, compared to the baseline, i.e., DOC2VEC-based LOGREG Sentiment Analysis. Furthermore, the proposed new architectures outperformed the state of the art solution proposed in [3].
The main research questions we want to answer are: (Q 1 ) Does a Topic Modeling approach improve the overall accuracy of detecting the polarity of textual data? (Q 2 ) Can local context added by Word Embeddings and global context added by Topic Modeling improve the accuracy of the Sentiment Analysis task? (Q 3 ) Can a novel CNN-(BI)RNN architecture prove to be a better model for the Sentiment Analysis task?
Thus, by answering these questions, the main objective of this work is three-fold: (O 1 ) Analyze the impact of Topic Modeling on the Sentiment Analysis task; (O 2 ) Construct a novel embedding DOCTOPIC2VEC that encapsulates both local and global context in order to improve the accuracy of detecting the polarity of textual data; (O 3 ) Build a novel CNN-(BI)RNN architecture to increase the accuracy of the Sentiment Analysis task.
This paper is structured as follows. In Section 2, we discuss the current advancement in Sentiment Analysis techniques. Section 3 presents the proposed architecture and describes each component module, together with the used algorithms and techniques. In Section 4, we describe the dataset and our set of experiments. Finally, we analyze and interpret the results. Section 5 is drawing the final conclusions and provides several future directions.

Related Work
Sentiment Analysis approaches can be classified into three categories: Machine Learning, Lexicon-based, and Hybrid [15]. Furthermore, these techniques are divided, based on the granularity level, in word (or aspect), sentence (or short text), and document (or long text) level.
There are not many solutions focusing on context-based Sentiment Analysis models. A context enrichment model for Sentiment Analysis is proposed in [4]. The authors add several processing steps, prior to sentiment classification, in order to augment the dataset with context. One important step discussed here is the prior-polarity identifica-tion with SentiWordNet. Unfortunately, the authors do not clearly specify what are the advantages of prior-polarity identification, and their model is just conceptual without any real experiments.
Most of the related previous works primarily use either only embeddings as text representation that are incorporated into the Sentiment Analysis model (e.g., [2,3]) or they consider Topic Modeling for determining the opinion by topic, and not to add context to the model (e.g., [16,17]).
In [3], the authors propose a Deep Learning 4CNN-BILSTM model for documentlevel Sentiment Analysis. Their model consists of four CNN layers and one BILSTM layer.
For the experiments, they use a relatively small amount of documents, i.e., 2003 articles from French newspapers. They employ two optimizers, SGD and Adam, and WORD2VEC as WORDEMBs solution. The proposed model is compared with CNN, LSTM, BILSTM, and CNN-LSTM, and they conclude that it achieves the best accuracy. Although they obtained a high accuracy for the 4CNN-BILSTM model, the results are not conclusive, as the experiments are performed on a small dataset. In our experiments, we also analyze their model, both the version proposed by them and also by adding Topic Modeling.
Attention mechanisms condition the Sentiment Analysis model to pay attention to the features which contribute the most to the task. The authors of the paper [18] propose a model based on LSTM layers with an attention mechanism. They used different approaches for the attention mechanism, i.e., convolution-based and pooling-based attention mechanism, and the word-vectors used for training, i.e., pre-trained word vectors from Word2Vec and randomly initialized word-vectors. Their model obtained better results than baseline methods on two out of three datasets. Attention-based Bidirectional CNN-RNN Deep Model (ABCDM) [19], another attention-based solution, use independent BiLSTM and GRU layers to extract both past and future contexts and an attention mechanism to put more or less emphasis on different words. To reduce the dimensionality and create new feature representations, the ABCDM model utilizes both convolutional layers and pooling techniques. This model achieves state-of-the-art performance when compared with other Neural Network architectures for the task of Sentiment Analysis on reviews and Twitter datasets.
An improved method for generating WORDEMBs used in Sentiment analysis is proposed in [2]. This method, Improved Word Vectors, uses Part-of-Speech, lexicon-based, and word position techniques together with WORD2VEC or GLOVE models. The performance of the proposed solution is tested using four different Deep Learning models and benchmark sentiment datasets. The results show that when using these embeddings, the accuracy of the model is slightly increased.
One solution that uses Topic Modeling for sentiment detection is presented in [17]. The authors combine shrinkage regression and Topic Modeling for detecting polarity in a Twitter dataset. The proposed model consists of two stages. In the first stage, they detect the polarity of the tweets using two shrinkage regression models. This type of regression adds a penalty in the way the loss function is calculated for models that have too many variables. During the second stage, the relevant topics are identified using LDA. The model estimates the sentiment of each topic using term sentiment scores.
Topic Modeling and WORDEMBs have been used together to analyze the sentiment of topics. However they have never been applied in Sentiment Analysis at the document level, as we propose in this paper. This approach is used for aspect-based topic Sentiment Analysis [20,21]. In this case, Topic Modeling is used for aspect extraction and categorization without considering the global context. In [21], the authors combine domain-trained WORDEMB and Topic Modeling for categorizing aspect-terms from online reviews. Their proposed model uses continuous WORDEMB and LDA algorithm. The model is tested using a small dataset, i.e., the restaurant reviews from the SemEval-2014 dataset consisting of 3841 sentences. One important limitation of their model is that it has a longer convergence time than the standard model and has lower performance than supervised models. Several recent works also explore pre-trained language models for the Sentiment Analysis task, e.g., BERT [22], RoBERTa [23], ALBERT [24]. In [25], BERT is compared with an LSTM-based architecture and achieves an overall better f-measure. In [26], a RoBERTa Sentiment Analysis model is combined with key entity detection, based on the presumption that people are more prone to observe negative information. This approach improves the accuracy of the Sentiment Analysis task when compared with architectures consisting of BERT or RoBERTa transformers combined with SVM, LR, or NBM. DICE T [1] is another transformer-based method for sentiment analysis. The novelty of DICE T is that it enhances the data quality by handling noises within contexts. For this, it uses six types of embeddings, i.e., character embeddings, GloVe, Part-of-Speech embeddings, Lexicon embeddings, ELMo [27] and BERT-based embeddings. The concatenated embeddings are fed to a BILSTM network with attention. DICE T has higher performance compared with Sentiment Analysis methods that use the standard one-type of embeddings, e.g., Glove or Word2Vec, or other pre-processing methods, e.g., TFIDF.  The Data Preprocessing module cleans and transforms the textual data to make them suitable for analysis. The Word Embedding and TFIDF Vectorization modules encode the documents' words into vector representations. The Document Embedding module computes a vector for each document based on the Word Embedding. The Topic Modeling uses the TFIDF document vectorization to extract the topics and the most relevant keywords. The Topic Embedding module constructs the vector representation of topics using word embedding. The Document-Topic Embedding module computes the new context enhanced document embeddings using the topic and document embeddings that add bot semantic and syntactic context to the vector representation. The classification module uses the new document-topic embeddings to classify documents and extract their polarity. The Evaluation module uses different metrics to determine the accuracy of the classification and determine the quality of the resulting models.

Data Preprocessing Module
The preprocessing step is important because the text written by people can contain misspelled words, symbols, abbreviations etc. that need to be removed or replaced to facilitate the execution of the subsequently tasks with greater accuracy [28]. The initial text is preprocessed using the following steps: (1) The text is cleaned by removing all JavaScript functions, HTML tags, and URL; (2) The contractions are expanded; (3) The named entities are extracted while the rest of the text is lemmatized; (4) The punctuation and stop words excluding negations (i.e., no, not, etc.) are removed; (5) The text is transformed to lowercase and then split into tokens; (6) The tokens that have a length greater than 3 or are negations are kept. Using this aggressive text preprocessing improves the algorithms' time performance, as the vocabulary is minimized to the essential tokens without excluding the terms which impact the polarity.

Word Embedding Module
The word embedding models used in this paper are WORD2VEC, FASTTEXT, and GLOVE. Each embedding model (WORDEMB) generates word representations in a vector space. The context of each word within a document is captured when employing these embeddings. Moreover, these models also encode both the relationship and the similarity between words from a semantic and syntactic perspective.
WORD2VEC represents a textual dataset as a set of vectors and outputs a vector space [29]. The context similarity of a word within the dataset is determined by measuring the distance between the corresponding vectors in this space. WORD2VEC use either the Continuous Bag-Of-Words (CBOW) or SKIP-GRAM model to create the representation of words.
The CBOW model utilizes the context of a word as input and attempts to predict the word itself. The input layer of the model is represented by the one-hot encoded vectors corresponding to each context words. The average of the vectors from this layer is used to compute the input for the hidden layer. The weighted sum of the inputs, computed by the hidden layer, is sent to the next layer. The hidden layer sends the weighted sum of the inputs to the next layer. Each terms' probability value is computed by the network's last layer and is given as a final result in the form of a vector.
THE SKIP-GRAM model, as opposed to CBOW, starts with the word as input and tries to generate its context. The input layer is the target word vector, while the output layer consists of the vectors with the probability values of the words appearing in the context of a target word. The hidden layer sends the weighted input to the following layer. The SKIP-GRAM model is generally used to discover the semantic similarity between words. Therefore, if two words have a similar context, these words might also have a similar semantic.

FASTTEXT
FASTTEXT is an unsupervised algorithm that uses the CBOW and SKIP-GRAM models for learning word embeddings [30]. This embedding is considered an extension of WORD2VEC as it follows a similar approach [31]. The difference is that the word is not considered the basic unit, but a bag of character n-grams. This facilitates better accuracy and a faster training time compared to WORD2VEC.

GLOVE
GLOVE (Global Vectors) is an unsupervised model applied for learning word embeddings [32]. In comparison to the other models described, i.e., WORD2VEC, FASTTEXT, GLOVE consider both local and global statistics of word-word co-occurrences in the corpus to obtain the vector representations of the words. It uses a term co-occurrence matrix that stores, for each word, the frequency of its appearance in the same context with another word. GLOVE captures the relationship between words by using the ratio of co-occurrence probability. Using the co-occurrence probability ratio, it extracts information from all the word vectors and identifies word analogies or synonyms within the same contexts.

Document Embedding Module
The document embeddings DOC2VEC (Equation (1)) are generated for each document d i in the dataset by adding the word embeddings for all the terms t (WORDEMB(t)) in the document and divide the sum by the number of terms in the document (m i ). We build a DOC2VEC for each WORDEMB we previously discussed.

TFIDF Vectorization
The TFIDF (term frequency-inverse document frequency) Vectorization module uses a bag-of-word approach to vectorize the news articles given: . . , t m } of size m = ||V|| that contains the unique words or terms t j in the dataset D.
the multiplicity (co-occurrences) function which denotes the number of times t j appears in document d i . For simplicity, we will denote (2)) is defined using: (4)) which uses the number document n j where a term t j ∈ V appears to penalize frequent terms that bring no information gain; Using the term weights, we can construct a document-term matrix A = {w ij |i = 1, n ∧ j = 1, m}, where rows correspond to documents and terms to columns. The cell value w ij is the weight (e.g., TF, TFIDF, etc.) of term t j in document d i .

Topic Modeling Module.
This module utilizes statistical unsupervised methods to extract hidden latent semantic patterns within our dataset. We use the following models for this module.
This module utilizes statistical unsupervised Machine Learning methods, i.e., Topic Modeling, to extract hidden latent semantic patterns within our dataset. We use three generative statistical models for this module, i.e., Latent Dirichlet allocation (LDA) [12], Non-Negative Matrix Factorization (NMF) [13], Latent Semantic Indexing (LSI) [14], also known as Latent Semantic Analysis (LSA). The Topic Modeling algorithms use the document-term matrix A as input.

Latent Dirichlet Allocation
Latent Dirichlet allocation (LDA) is a probabilistic model that groups various terms with similar meaning that represent the same notions [12]. It is one of the most popular Topic Modeling approaches [33]. LDA algorithm relies on the assumption that random mixtures over latent topics can be used to generate documents. In this context, each topic is described by a multinomial distribution over the unique terms in the vocabulary. Thus, we can generate documents using techniques such as Gibbs that samples <topic, words> pairs from a random mixture.
For k topics and a corpus of n documents D = {d i |i = 1, n} where each document d i is a sequence of m i words t j ∈ V j = 1, m modeled as Poisson distributions, i.e., m i ∼ Poisson(ξ) LDA uses the following process: (1) Determine a distribution of topics θ i for each document d i ; (2) Determine a distribution of words ϕ κ in a topic κ ∈ 1, k; (3) For each word t j in document d i : The distribution of words in topic κ is also a Dirichlet distribution over the vocab- is a m-dimensional vector of probabilities, ϕ κj is the probability of a word probability of word t j occurring in topic κ, and β = {β 1 , β 2 , . . . , β m } is a m-dimensional vector of positive reals β j > 0.
For each document d i (i = 1, n), we define z iκ described by a set of words t κj (j = 1, m) of size m i . Both z iκ and t κj are multinomial distributions, i.e., z κj = Multinomial k (θ i ) and t κj = Multinomial m (ϕ z iκ ).

Non-Negative Matrix Factorization
Non-Negative Matrix Factorization (NMF) is a dimensionality reduction paradigm based on linear algebra [34]. Experimental results prove that NMF is the best choice for extracting topics [35]. It is constructed on the premises that a matrix can be created as a product of two non-negative matrices. Thus, NMF factorizes a matrix A ∈ R n×m into two non-negative matrices W ∈ R n×k and H ∈ R k×m . With regard to Topic Modeling, these matrices have the following signification: (1) A is a document-term matrix constructed using weighted term frequencies for a corpus containing n documents and a vocabulary of size m terms; (2) W is the document-topic matrix that assigns a document membership to each topic k; (3) H is the topic-term matrix that assigns to each topic k the importance of a term.
To determine W and H, the objective function F(W, H) must be minimized by respecting the constraint that all the elements of W and H are non-negative. Equation (6) presents the objective function, where || · || F is the Frobenius norm.
To minimize the objective function (Equation (7)), the values of W and H are updated iteratively (with τ the index of the iteration) until they stabilize (Equation (8)).

Latent Semantic Indexing
Latent Semantic Indexing (LSI) tries to solve the problem of synonyms by identifying terms that statistically appear together. The algorithm's main consideration is that the randomness of word choice within documents hides an underlying latent semantic structure. To determine this latent structure, LSI employs the matrix factorization technique called Singular Value Decomposition (SVD). It identifies syntactical different but semantically similar terms using a structure called hidden "concept" space.
Given the document-term matrix A with the size n × m (n is number of documents, m is the number of terms in the vocabulary), LSI uses SVD to interactively factorize A into a product of three matrices, i.e., A = UΣV T .
(1) U is an n × k matrix that denotes the document-topics association. The columns of U are the eigenvectors u of AA T . Thus, these vectors identify the k non-zero eigenvalues Σ L = diag(σ 1 , σ 2 , . . . , σ k ) of AA T . Moreover, u are unit orthogonal vectors, i.e., U T U = I and are also called left singular values because they satisfy the condition uA = Σ L v.
(2) V T is an k × m matrix that denotes the topic-keywords association. The columns of V are the eigenvectors v of A T A. Thus, these vectors identify the r non-zero eigenvalues Moreover, v are unit orthogonal vectors, i.e., V T V = I, and are also called right singular values because they satisfy the condition Av = where each value is sorted in decreasing order from the one that holds the highest value to the one that represents the smallest one, i.e., σ 1 ≥ σ 2 ≥ · · · ≥ σ k > 0.

Topic Embedding Module
To encode the global context that is hidden in the latent semantic structures defined by the randomness of words, we employ a topic vector embedding TOPIC2VEC that encodes the keyword for the k topics extracted using one of the Topic Modeling algorithms. TOPIC2VEC takes the weighted average of the word embeddings WORDEMB of each relevant term t belonging to the topic z i (i = 1, k) and its probability distribution p(t|z i ) within the topic z i . Equation (9) presents the proposed encoding, where the number of keywords considered for a topic z i is n i . We build a TOPIC2VEC for each topic model and WORDEMB we previously discussed.

Document-Topic Embedding Module
The document with topics embeddings DOCTOPIC2VEC (Equation (10)) are generated by concatenating (operator ⊕) the TOPIC2VEC of the most dominant topic of a document with the document's DOC2VEC. We build a DOCTOPIC2VEC for each TOPIC2VEC we previously discussed using the same WORDEMB for both the DOC2VEC and the TOPIC2VEC. By concatenating the DOC2VEC with TOPIC2VEC and obtaining the DOCTOPIC2VEC we manage to encode the local context given by the document embedding (DOC2VEC) with the global context given by the topic embedding (TOPIC2VEC).

Classification Module
For classification, we use the Logistic Regression (LOGREG) algorithm, which serves as a baseline, and multiple Deep Neural Network (DNN) Architectures.

Logistic Regression
Logistic Regression (LOGREG) is a classification algorithm successfully used, in many cases, as a baseline for the Sentiment Analysis task to predict the class in which an observation can be categorized [36,37]. The algorithm tries to minimize the error of the estimations made using the log-likelihood and to determine the parameters that produce the best estimations using gradient descent [38]. The log-likelihood functions guarantee that the gradient descent algorithm can converge to the global minimum.  A Perceptron is a processing unit used to predict the label of an observation y = argmax y f (x, y) · w. The function f (x, y) is used to map all the possible feature representation <x, y> pairs to a new feature vector x and multiplies them by a weight vector w. The x vector must fulfill the following conditions: (1) it has a positive number of elements, and (2) the values of its elements are real value numbers.
GRU is a recurrent unit that has two gating mechanisms: (1) the update gate, and (2) the reset gate. The update gate is used as both the forget gate and the input gate. The reset gate determines what percentage of the previous hidden state contributes to the candidate state of the new step. Furthermore, the GRU has only one state component, i.e., the hidden state.
LSTM is a recurrent unit that uses in its design two components to represent its state: (1) the hidden state is given by a short-term memory component, and (2) the current cell state is achieved by the long-term memory component. The LSTM unit comprise of a gating mechanism with three gates and a memory cell. The gating mechanism has the following gates: (1) the input gate, (2) the forget gate, and (3) the output gate. LSTM controls the gradients' values and avoids the problems of vanishing and exploding gradients by using the forget gate and the properties of the additive functions which compose the cell state gradients.
Bidirectional RNN (BIRNN) units allow for the use of information from both the previous and next state to make predictions about the current state. We use both BIGRU and BILSTM in our models.
Dense layers are regular deeply connected neural network layers that contain only PERCEPTRON units.
CNN are Deep Neural Networks containing multiple convolution hidden layers that apply a filter to the activation function. After a convolutional layer, it is customary to use a layer that employs a pooling mechanism. The pooling layer reduces the dimensions of the data returned by the convolutional layer. This reduction is achieved by combining the results of the previous layer into a single layer neuron. The output of this single layer neuron is then used as the input of the following layer Considering these layers, we propose six new CNN-(BI)RNN architectures: CNN-BIGRU, CNN-3GRU, CNN-3BIGRU, CNN-BILSTM, CNN-3LSTM, and CNN-3BILSTM. When multiple recurrent layers are used, they form a stacked architecture.
Moreover, we implement the DNN Architecture presented in [3]. We use the same configurations for this DNN as presented in the original work. In the experiments, we name this architecture 4CNN-BILSTM.

Evaluation Module
Evaluation metrics are used to better understand the performance of a model and for fine-tuning the model on a given classification task. In our case, we are solving a multi-class classification problem where we are trying to determine the different polarities of a given text. Thus, we use the weighted accuracy measure for evaluating our models because it takes into account the distribution of classes within the dataset. The weighted accuracy ωA (Equation (11)) measures the per-class effectiveness of a classifier by employing the True/False Positive (TP i and FP i ) and True/False Negative (TN i and TP i ) rates. Given k classes y i (i = 1, k) and a dataset with n observations where n i observations are labeled with class y i , then we can compute a weight ω i for each class y i using Equation (12).

Dataset
For the experiments, we used a game reviews dataset containing context textual data posted on the MetaCritic website (https://www.metacritic.com/, accessed on 28 September 2021). The original version of this dataset is presented in [39] and improved in [40]. From the dataset, we use only the reviews and the polarity assigned for each review, although the collected raw data also contain other information. The polarity was transformed from the initial string format (i.e., positive, neutral, negative) into integer format (i.e., 2, 1, 0). This dataset contains over 90,500 game reviews with a polarity assigned that can be −1 for negative, 1 for positive, or 0 for neutral. After preprocessing, we had with 90,165 with the following distribution of class: 15,721 negative, 22,433 neutral, and 52,011 positive. Out of the total number of 90,165 comments, 99.31% were in English, while 0.69% were in Spanish. As the number of comments in Spanish is negligible, we kept them to see if and how they impact our analysis. The vocabulary size is 23,016. The reviews contain from 1 to 1217 terms, with an average of 44.02. The reviews with a length between 1 and 50 words are the most common in the dataset, i.e., 66,713. The number of reviews with more than 100 words is 8538.
Experimentally, we have identified that the classification tasks perform better when the training and testing sets keep the proportions of the polarities of the entire dataset. For example, on a LOGREG classification experiment, if the data are split poorly, e.g., mostly positive reviews are used in the training dataset, the accuracy is lower than 55%. If equal proportions are created, based on three-quarters of the initial dataset, the accuracy improves up to 67%, whereas if the dataset is split using the initial proportions, this results in approximately 71% accuracy. Therefore, we conducted the classification experiments using 80% of the dataset for training and 20% for testing, i.e., 72,132 reviews for training and 18,033 for testing. We preserved the polarity distribution of reviews in the both training and testing subsets. Moreover, we identified that the better the data are cleaned, i.e., as little as possible misspelled or foreign words are left in the dataset, the better the accuracy of the classification tasks is, with an increase of even 10% in accuracy compared to other data normalization methods.

Word Embedding
To identify the best size for each WORDEMB, we tested various parameters and evaluated the resulting embeddings using a few approaches: (1) Computing accuracy by identifying how well the model recognizes analogies; the test is performed using the questions-words dataset [41] that contains pairs of analogies from different domains; (2) Identifying the cosine similarity between words with positive and negative connotation that appear in the dataset, i.e., (fun, enjoyable), (boring, dull), etc., and (3) Checking the most similar words with a common word in the dataset.
We determined experimentally using a grid search that (1) the best window size is four; (2) the number of epochs used for training is 30; (3) the initial learning rate is 10 −2 . Table 1 presents the final embedding sizes used for classification determined after evaluation.

Document Embeddings
Using the five WORDEMBs, we construct a DOC2VEC for each review as an average of the WORDEMBs for the terms in the document. The size of the DOC2VEC is equal to the size of the WORDEMB used.

Topic Modeling
We identify 10 topics using the TFIDF document-term matrix as input together with the three Topic Modeling algorithms, i.e., LSA, NMF, and LSI. From each topic, the first 15 most relevant features are used in the algorithm for computing the topic embeddings. The number of documents where a topic is the most relevant is presented in Table 2. Tables 3-5 present the results for LDA, NMF, and LSI, respectively.
Analyzing the results of Table 3, we observe that LDA extracts diverse topics that can be interpreted using the keywords, e.g., Topic 0 is related to racing games, Topic 4 is related to sports games. Furthermore, LDA also manages to determine topics that find hidden latent semantic patterns that describe polarity, e.g., Topic 9 and Topic 7. We also note that LDA manages to detect and group together documents that have words in other languages than English, e.g., Topic 3, being the only algorithm among the three used in our analysis that picked up on the this negligible percent (0.69%) of comments.
As in the case of LDA, NMF (Table 4) manages to determine topics related to different games' genres and polarity. Unlike LDA, NMF manages to discover topics that that group together both polarity and game type, e.g., Topic 0, Topic 1, Topic 2. Furthermore, NMF fails to discover the comments that use a different language to English. Finally, LSI (Table 5) manages to determine topics related to the overall game play experience and users' opinion towards this aspect, e.g., Topic 0, Topic 2, Topic 5, Topic 6. Thus, most of the topics detected by LSI contain similar terms, e.g., game, play, to underline some of the polarity, e.g., good, awesome, beautiful, bad, fun, terrible.

Topic to Vector
For each topic determined by an algorithm, we build a TOPIC2VEC as the weighted average of the WORDEMB and the importance of each relevant word that describes the topic. Thus, the size of the TOPIC2VEC is same as the size of the used WORDEMB.

Document-Topic to Vector
A DOCTOPIC2VEC is created by concatenating the DOC2VEC with the TOPIC2VEC of the dominant topic for a document. The same WORDEMB is used when constructing the DOC2VEC and TOPIC2VEC embeddings that are concatenated for building the DOC-TOPIC2VEC embedding. Thus, the size of the DOCTOPIC2VEC is twice the size of the used WORDEMB.

Classification Algorithms
The classification experiments with LOGREG are computed using both DOC2VEC and DOCTOPIC2VEC. For this model to achieve a stronger regularization, we set the inverse regularization parameter C to 10 −5 .
Using the GRU units, we built multiple models: (1) One with a single GRU Layer; (2) One with three GRU layers (3GRU); (3) One with a single BIGRU Layer, and (4) One with three BIGRU layers (3BIGRU). All these models have a final DENSE Layer used for the final classification. Each GRU layer is initialized with 128 units and a dropout of 0.2. The activation for the update gate is the sigmoid function (Equation (13)) and for the reset gate the hyperbolic tangent function (Equation (14)). The sigmoid function is defined in (0, 1) and is used for models that utilize the probability of a variable. The hyperbolic tangent function is defined on the [−1, 1] interval and it is mainly used to better differentiate between the strongly negative values and 0. The DENSE output layer is initialized using the softmax activation function (Equation (15)) and with three as the dimension, corresponding with the number of possible values for the polarity. For multiclass classification, the softmax function is a generalized logistic activation function used to normalize the output of a network x = (x 1 , x 2 , . . . , x K ) to a probability distribution over predicted output classes i = 1, K. In our case, we set K = 3, as we are predicting the positive, negative, or neutral polarity.
We keep the same initialization parameters for the LSTM and DENSE layer as for the GRU architectures. The activation for the input, output, and forget gates is the sigmoid function, while for the hidden state and the cell input activation vector is the hyperbolic tangent function. The LSTM models use the same loss function and optimizer parameter.
As CNN architectures proved to be an asset for text classification [10], we build a CNN Sentiment Analysis architecture with three layers: CNN, MAXPOOLING, and DENSE. We initialize the filters to 64 and the kernel size to half the size of the input vector, i.e., DOC2VEC or DOCTOPIC2VEC. We also add the CNN and MAXPOOLING layers on top of the four GRU and four LSTM architectures to determine if convolutions on top of recurrent layers improve the classification as in [9]. Moreover, we implement the Deep Learning Architecture presented in [3] using the same configuration. In the experiments, we name this architecture 4CNN-BILSTM.
For all the Deep Neural Network Architectures, we utilize a batch size of 5000 to accurately estimate the gradient error in the detriment of the drawback known as slowing the convergence of the learning process. The loss is computed using categorical cross entropy and the applied optimizer is Adam. Each network is trained with a maximum of 200 epochs, using an automated stopping mechanism that stops the execution if the accuracy is not improved during 20 successive epochs.

Implementation
The entire pipeline is implemented in Python3.7. For named entity recognition and lemmatization we used the en_core_web_sm from the SpaCy [42] package. We use the gensim [43] and python-glove [44] packages for the WORDEMBs and the scikit-learn [45] package for the TFIDF vectorization, LOGREG classifier, and Topic Modeling algorithms. All the DNN Architectures are implemented in Keras [46] with TensorFlow [47] as the tensor backend engine. The experiments are run on an NVIDIA ® DGX Station ™ . The code is freely available online on GitHub at https://github.com/cipriantruica/DocTopic2Vec.

Results
Tables 6-10 present the average accuracy obtained after 10 distinct training experiments. As a baseline for the embeddings, we use the DOC2VEC, while, for classification, we use LOGREG. We utilize Stratified Cross-Validation for splitting the dataset into 80-20% training-testing sets with random seeding, i.e., 72,132 reviews for training and 18,033 for testing. Furthermore, we identified that the better the data are cleaned, i.e., as little as possible misspelled or foreign words left in the dataset, the better the accuracy of the classification tasks is, with an increase of even 10% in accuracy compared to other data normalization methods.
The proposed DOCTOPIC2VEC improves significantly the detection of polarity at document-level, for both NMF and LSI, over the simple implementation of DOC2VEC with over 5%. In the case of the LDA, we observe a decrease in accuracy. When using the WORD2VEC CBOW model to construct DOCTOPIC2VEC (Table 6), we obtain the best results and the overall best accuracy (i.e., 0.7718) for all our experiments with the GRU architecture and LSI topic model. For the WORD2VEC SKIP-GRAM (Table 7), the CNN-BIGRU architecture with the DOCTOPIC2VEC for NMF obtain the best results. The CNN-BIGRU architecture also achieves the best results when building the DOCTOPIC2VEC using FASTTEXT and LSI (Tables 8 and 9). When using GLOVE and the LSI topic model to construct the DOCTOPIC2VEC (Table 10), the best results are obtained with the novel CNN-3BIGRU architecture.  The experimental results show that the polarity detection accuracy is improved if the Topic Modeling algorithms meet at least one of the following two conditions: (1) The document to dominant topic distribution is balanced and manages to group context-related documents together; (2) The importance of the terms that belong to the topic have a small value range in order to enhance the document vectorization with the context-dependent terms.
Thus, depending on the used Topic Modeling algorithm, the overall performance of the proposed model changes.
LSI manages to meet the first condition needed to improve the accuracy of the polarity detection task. The importance of the relevant keywords for a topic detected with LSI has values ≤1. These values influence the TOPIC2VEC values (Table 5). Thus, the final DOCTOPIC2VEC's values remain balanced for the entire encoding, and the context extracted thought Topic Modeling in conjunction with the distribution of document to dominant topic improves the classification task (Table 2).
NMF satisfies both conditions needed to improve the accuracy of the Sentiment Analysis task. For NMF, the importance of the relevant keywords is not normalized and has values in the range [0, 6.64], but the majority of the values are still ≤1 (Table 4). When building the TOPIC2VEC for NMF, some dimensions are going to have higher values which add more importance to the context-related words. During the training of the model, the higher values introduce bias to these dimensions in the classification task and manage to influence the accuracy of detecting the document-level polarity by better grouping documents together. Moreover, the more balanced distribution of documents to the dominant topic obtained by the NMF (Table 2) also influences the context-based grouping of documents.
LSA does not meet any of the two conditions needed for an improved polarity detection model; thus, the accuracy decreases. When using LDA to build the DOCTOPIC2VEC, LOGREG and RNN results are influenced by the importance of a word to a topic and the distribution of the document to the dominant topic. The relevant words for some topics have high importance (Table 3). Thus, the TOPIC2VEC values are larger than the DOC2VEC values. Because the distributions of document to dominant topic is not balanced (Table 2), the TOPIC2VEC with the highest values is assigned to the majority of documents. When concatenating the DOCTOPIC2VEC, the second half of the embedding and the imbalanced distribution of the document to dominant topic influences the classification task and the results are similar to flipping a coin.
For the CNN with bidirectional RNNs models, we obtain better results for DOC-TOPIC2VEC constructed with LDA, as the Deep Neural Network uses convolutions to select values. Therefore, the impact of the second half of the TOPIC2VEC values, as well as the imbalanced document grouping, are minimized, and for some tests, we obtain better results.
We observe that, on average, we obtain better results when using bidirectional models for both the fully connected (e.g., BIGRU, 3BIGRU, etc.) and convolutional architectures (e.g., CNN-BIGRU, CNN-3BIGRU, etc.). We note that, on average, the proposed new architectures perform better for this task. Furthermore, our models outperform the state of the art 4CNN-BILSTM architecture. Stacking multiple layers of RNNs (e.g., 3GRU, 3BIGRU, 3GRU, etc.) with or without using a CNN brings very little improvement in accuracy over the architectures with a single layer. In case they are better, they only bring a ∼1% improvement. The same observations can be deducted for the architectures that stacks multiple CNNs, i.e., 4CNN-BILSTM.
As a final remark, we compare our results with the results obtained on the same dataset in [48] and in [40]. Our proposed Deep Neural Network architectures outperform with ∼10% the Transformer-based models in [40] that obtained an accuracy of only 0.67.

Conclusions
In this paper, we propose DOCTOPIC2VEC, a novel embedding that incorporates contextual cures through the use of Topic Modeling. We use a dataset with game reviews to learn different WORDEMB models, i.e., WORD2VEC, FASTTEXT, and GLOVE. Applying the different WORDEMB, we create DOC2VECs for each review and TOPIC2VECs for each topic extracted by LDA, NMF, and LSI. A DOCTOPIC2VEC is constructed for each review as the concatenation of its DOC2VEC with the TOPIC2VEC for its dominant topic. Both DOC2VEC and TOPIC2VEC use the same WORDEMB when are concatenated into the DOCTOPIC2VEC. To prove the efficiency of the new proposed DOCTOPIC2VEC in the task of Document-Level Sentiment Analysis, we implement different Deep Neural Network (DNN) Architectures using combinations of fully connected (i.e., GRU, LSTM, BIGRU, BILSTM, DENSE) and convolutional (CNN) layers. Furthermore, we propose six novel Convolutional-based Recurrent DNN Architectures that outperform the state of the art 4CNN-BILSTM architecture [3].
The experimental results show an improvement in accuracy in determining the documentlever polarity of ∼5% when employing the new proposed context-enhanced DOCTOPIC2VEC for the NMF-and LSI-based topic embeddings over the baseline, i.e., DOC2VEC with LOGREG. These embeddings manage to improve the classification by: (1) Grouping context-related documents together through the document to dominant topic distribution; (2) Enhancing the document vectorization with the importance of context-dependent terms that belong to the topic.
Furthermore, we observe that if the Topic Modeling algorithm does not meet these requirements, the polarity detection accuracy drops significantly, as in the case of LDA. Finally, we want to note that our proposed CNN-(BI)RNN architectures outperform the best performing state-of-the-art model with ∼10% applied on the same dataset in [40].
By combining Topic Modeling with the Sentiment Analysis task and by obtaining better results, we manage to answer (Q 1 ) and to fulfill objective (O 1 ). We answer (Q 2 ) by adding local and global context through the novel DOCTOPIC2VEC embedding and improving the accuracy of detecting the polarity of textual data, thus achieving objective (O 2 ). By introducing novel CNN-(BI)RNN Deep Learning Architectures that improve the accuracy of the Sentiment Analysis task, we answer our final research question (Q 3 ) and complete objective (O 3 ).
As future work, we aim to test other embeddings, e.g., MITTENS [49] which learns domain-specific representations, MOE [50] which manages word misspellings, BERT [22] which considers the word's occurrence and position when computing its context. Furthermore, we plan to explore how the WORDEMBs used in this paper could be used with other neural networks, such as Hierarchical Attention Networks or Deep Belief Networks.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: