An Improved Model for Analyzing Textual Sentiment Based on a Deep Neural Network Using Multi-Head Attention Mechanism

: Due to the increasing growth of social media content on websites such as Twitter and Facebook, analyzing textual sentiment has become a challenging task. Therefore, many studies have focused on textual sentiment analysis. Recently, deep learning models, such as convolutional neural networks and long short-term memory, have achieved promising performance in sentiment analysis. These models have proven their ability to cope with the arbitrary length of sequences. However, when they are used in the feature extraction layer, the feature distance is highly dimensional, the text data are sparse, and they assign equal importance to various features. To address these issues, we propose a hybrid model that combines a deep neural network with a multi-head attention mechanism (DNN–MHAT). In the DNN–MHAT model, we ﬁrst design an improved deep neural network to capture the text’s actual context and extract the local features of position invariants by combining recurrent bidirectional long short-term memory units (Bi-LSTM) with a convolutional neural network (CNN). Second, we present a multi-head attention mechanism to capture the words in the text that are signiﬁcantly related to long space and encoding dependencies, which adds a different focus to the information outputted from the hidden layers of BiLSTM. Finally, a global average pooling is applied for transforming the vector into a high-level sentiment representation to avoid model overﬁtting, and a sigmoid classiﬁer is applied to carry out the sentiment polarity classiﬁcation of texts. The DNN–MHAT model is tested on four reviews and two Twitter datasets. The results of the experiments illustrate the effectiveness of the DNN–MHAT model, which achieved excellent performance compared to the state-of-the-art baseline methods based on short tweets and long reviews.


Introduction
Sentiment analysis (SA) of text aims to extract and analyze knowledge from the personal information posted on the internet. Due to its wide range of industrial and academic applications, as well as the increasing growth of social networks, SA has become a hot topic in the field of natural language processing (NLP) in recent years [1]. Thus, different tools and techniques have been proposed to identify the polarity of documents. Polarity detection is a binary categorization task that plays a significant role in most SA applications [2]. Most of the previous approaches for SA have trained shallow techniques on carefully developed efficient features for obtaining satisfactory polarity categorization performances [3]. These models occasionally apply traditional classification approaches involving Naïve Bayes, support vector machines (SVM), and latent Dirichlet allocation (LDA) to linguistic properties, such as lexical features, part-of-speech (POS) tags, and n-grams. However, these approaches have two major drawbacks: (1) the feature distance on which the model must be trained is highly dimensional and scattered and thus affects the model performance; (2) the feature engineering operation is time intensive and an uphill task.
Several current works have suggested learning word embedding [4][5][6] to tackle the above limitations. Word embedding is a dense real-valued vector generated by a neural language model that considers various lexical associations [4,5]. Thus, this makes the employment of word embedding as the input of deep neural networks (DNN) highly common in existing NLP works [4]. In recent years, DNNs have gained increasing attention from many researchers in varied domains, such as medical informatics [6], finance [7], computer vision [8], and multimedia sentiment analysis [9].
DNNs have been suggested for analyzing text data that primarily focus on the performance of machine learning tasks or learning word embedding, such as categorization and clustering. Among the wide range of deep networks, recurrent neural networks (RNNs) and convolutional neural networks (CNNs) are more popular in research related to text processing [10]. The cause of this popularity is due to the fact that the CNN models can learn the local patterns, while the power of RNNs is demonstrated in sequential modelling. Although RNNs are used in several text processing applications, they cannot handle vanishing and exploding gradients, especially if the input data have long dependencies [4]. These dependencies are highly popular in most NLP approaches, especially in the domain of SA.
To deal with the above problem, long short-term memory (LSTM) was introduced [11], which has the ability to capture long dependencies. Due to the potential of LSTM to address the problems of RNNs, it has attracted the attention of many researchers in the field of NLP [12]. Considering both the previous and subsequent contexts, the bidirectional LSTM (Bi-LSTM) model was proposed to combine the forward hidden layers and backward hidden layers. This model can cope with the sequential modelling issue. Bi-LSTM is widely employed in many NLP applications. However, there are two major drawbacks to this model: (1) the high-dimensional input distance popular in the applications of text processing makes the model more complex and thus difficult to improve; (2) the model cannot focus on the significant parts of the context information of the text. To tackle these problems, many studies in the literature have been suggested. For instance, CNNs have been employed to extract meaningful patterns from text and reduce the dimensional feature distance [12]. The attention mechanism assigns various weights for focusing on the significant parts of context [4].
The current deep learning models for SA occasionally handle a few issues and disregard others. For instance, Chatterjee et al. [13] used LSTM and two pre-trained word embeddings for extracting both semantics and sentiments for feeling recognition, but did not address the differences between the importance of various parts of sentences. A study by Liu et al. [14] combined Bi-LSTM with CNN and the benefited attention mechanism, but this study did not consider the co-occurrence of long and short dependencies. Rezaeinia et al. [15] used CNNs and improved pre-trained word embeddings, but they did not take account of the different importance values of words and long dependencies.
The Google machine translation team presented a new concept of multi-head attention mechanism MHAT in 2017 to capture related information in various sub-distances via multiple distributed computations [16]. In our study, we employed the attention mechanism to select the most important contextual information, considered both forward and backward context dependencies, and assigned approximate attention to various words in comments.
In this work, we propose a new deep learning model that combines a deep neural network with a multi-head attention mechanism (DNN-MHAT) for classifying textual sentiment. We first applied a global vector for word representation (GloVe) [5] to create word vectors automatically as the weights in the embedding layer. We also designed an improved deep neural network to capture the text's actual context and extract the local features of position invariants by combining recurrent Bi-LSTM memory units with a CNN. Then, we devised a multi-head attention mechanism to capture the words in the text that are significantly related to long space and encoding dependencies, which adds effective weights to the different contextual features. Finally, the global average pooling layer was applied to obtain a multi-level pattern representation of the text sequences. A Softmax classifier was applied for classifying the processed context information. The DNN-MHAT model was tested on four reviews and two Twitter datasets. The results of the experiments illustrate the effectiveness of the DNN-MHAT model, which achieved excellent performance compared to the state-of-the-art baseline methods based on short tweets and long reviews. The experiments compared the DNN-MHAT model with five state-of-the-art DNN baseline methods based on SA and text classification datasets. DNN-MHAT outperformed the other five methods in terms of popular performance standards in NLP and the domains of SA. Our contributions are summarized as follows: We propose a new deep learning model, namely, DNN-MHAT, for text classification and SA tasks. First, we design an improved deep neural network to capture the text's actual context and extract the local features of position invariants using Bi-LSTM and CNN. Then, we present a multi-head attention mechanism to capture the words in the text that are significantly related to long space and encoding dependencies, assigning weighted importance to different information, efficiently enhancing the sentiment polarity of words and detecting the significant information in the text.

2.
We investigate the effectiveness of the DNN-MHAT model on two types of datasets: long reviews and short tweets on social media. Compared to five existing deep structures, the DNN-MHAT achieved better performance on two types of datasets.
The rest of this paper is structured as follows: Section 2 contains the Literature Review. Section 3 contains the Materials and Methods. Section 4 contains Experiments and Results. Finally, Section 5 contains the Conclusions and Future Work.

Sentiment Analysis
Most traditional SA research works have utilized supervised machine learning approaches as their clustering module or main classification [17]. These approaches exploited n-gram features and bag-of-words (BOW) techniques to classify and present user-created texts that bear sentiment [18]. These features are presented to cope with the issues of simple BOW techniques, such as overlooking the order of the word and the syntactic structures [19]. The major drawback of utilizing n-gram features, especially when n ≥ 3, is that the result of the feature space is highly dimensional. To handle this drawback, feature selection techniques have been widely applied in recent studies [20,21]. SVM, Naïve Bayes (NB) and artificial neural networks (ANN) are among the common methods employed to extract the meanings of users from their text, and have achieved good performance [22][23][24]. One of the problems that the supervised methods suffer from is that they are sometimes slow and require a large amount of time during training. To solve these problems, many methods based on unsupervised lexicons have been proposed [18,25]. These approaches are scalable, fast, and simple. However, they significantly rely on the lexicon, making them less accurate than their supervised counterparts [25,26]. Field dependency is another issue of lexicon-based approaches, making them less applicable for fields that do not contain specific lexicons.
Due to the advantages of both lexicon-and supervised-based methods, few researchers have taken these advantages and then combined them in various ways [27,28]. For instance, for SA, Zhang et al. [29] proposed a method that consists of two steps for the entity level of tweets. The first step is a high recall based on the supervised method. The second step is a high precision based on the lexicon method. A hybrid model for concept-based sentiment analysis combines machine learning methods and lexicon-based proposed by Mudians et al. [30]. Their method provided a more accurate and justified explanation than purely statistical methods and outperformed lexicon-based methods in detecting the strength of sentiment and polarity.

DNN for Sentiment Analysis
In the sentiment analysis domain, most existing DNN-based works have been oriented towards learning word embedding or exploiting various types of DNNs for clustering or classification tasks. Word embeddings are generated for capturing word similarities and lexical relationships [31]. Unsupervised methods are usually used to generate such embeddings. These methods are generated according to words with similar contexts and meanings, so they must have similar vectors. The drawback of this supposition is that the vectors of some words are similar, especially those occurring in a small neighborhood, but they are linguistically different. For instance, some words that carry feelings and have the opposite meaning (e.g., bad and good) have similar vectors since they sometimes co-occur in similar contexts. To cope with this problem, few studies have suggested sentiment-aware word vectors generated based on supervised approaches and large sentiment lexicons [4,[32][33][34]. A study by Petrucci and Dragoni [35] suggested a new neural word embedding approach for multi-field SA. The authors solvedthe major limitations of former approaches, which did not perform well when used in different fields from the one they were trained on. Their new approach performed better.
In SA applications, the LSTM and its variants [36] are widely utilized due to their ability to handle long-term dependencies. For instance, a novel model using LSTM and a recurrent neural network called P-LSTM was suggested by Chi Lu et al. [37]. The P-LSTM model used three-word phrase embedding rather than single word embedding. To extract accurate data from the text, the P-LSTM model presented the factor mechanism of the phrase that combines the feature vectors of the phrase embed layer and the hidden layer of LSTM. Ju et al. [38] presented a Cached LSTM model (CLSTM) that captured the semantic information of long texts. In recent years, Chatterjee et al. [13] introduced a multi-channel LSTM called SS-BED to detect sentiments in tweets. In the SS-BED model, Sentiment-Specific Word Embedding (SSWE) [39] and GloVe are employed in parallel for pre-trained word embedding. Three LSTM models are implemented sequentially for handling the long dependencies of texts. Finally, the two outputs of the feature vectors are sequenced as inputs in the fully connected layer. The SS-BED model does not address the differences in the importance of various parts of sentences.
CNNs are applied in applications of SA for extracting local features. These models are beneficial when the text is long and specific local features, such as n-grams, are significant. For instance, Rezaeinia et al. proposed a model based on CNN, which availed optimized word embedding to analyze the sentiment at the document level [15]. Their model optimized pre-trained GloVe and Word2Vec embedding [40] with positional, syntactical, and lexical features, but this study did not consider the different importance of words and long dependencies.
In recent years, the attention mechanism has been applied to optimize models of DNNs by allowing them to identify where to concentrate for learning. For instance, for binary sentiment classification, one BiLSTM layer and a global pooling mechanism model were suggested by Zabit et al. [14]. For text classification and question answering, Liu and Guo [41] proposed a hybrid model that combines Bi-LSTM, CNN and the attention mechanism, AC-BiLSTM. Their model used a one-dimensional CNN layer on the word embedding layer to extract local features, BiLSTM for extracting long dependencies, and an attention mechanism for focusing on significant text domains. The AC-BiLSTM model did not consider the co-occurrence of both long and short dependencies. Zhou et al. proposed a BiLSTM model with an attention mechanism to identify the significant features [42]. For text classification, a new attention model-based network called the hierarchical attention network (HAN) was proposed by Yang et al. [43]. The HAN model utilized two attention models at the sentence and words levels. They stacked the attention models on the outputs of gated recurrent unit GRU-based sequence encoders. The Google machine translation team presented a new concept of MHAT in 2017 [16] to capture related information in various sub-distances via multiple distributed computations.
Recently, few researchers have proposed hybrid DNNs for SA. For instance, Mohammad et al. [44] suggested an attention-based bidirectional CNN-RNN deep model for sentiment analysis (ABCDM), which combines an attention mechanism and a bidirectional CNN-RNN deep model. This model first uses GloVe embedding as the weights to the embedding layer, then two bidirectional GRU and LSTM layers for extracting past and future contexts and an attention mechanism for focusing on different words. Convolution and pooling mechanisms are applied to extract local features static position and reduce feature dimensions. A study that combines CNN and GRU with an attention mechanism, named ARC, proposed by Wen and Li [45] to classify reviews and tweets. They employed three various CNN modules for extracting local n-gram and global patterns and bidirectional GRU units. However, these models do not accurately determine the various degrees of importance of forward and backward directions.
The major difference between our model and the DNN baseline models is that our proposed model considers the following significant features simultaneously: (i) short and long context dependencies utilizing Bi-LSTM; (ii) identifying most significant features strong to positional changes utilizing CNNs with various kernels, filter sizes, and pooling mechanisms; (iii) capturing the words in the text that are significantly related to long space and encoding dependencies utilizing a multi-head attention mechanism.

Materials and Methods
This section describes the overall structure of the DNN-MHAT model, which comprises six fundamental components: the input layer, convolutional neural network, long short-term memory, global average pooling layer, multi-head attention mechanism, and Softmax layer. The overall structure of the DNN-MHAT model is shown in Figure 1.
The key goal of the DNN-MHAT model is to detect the polarity of sentiment for the given sentences.  [44] suggested an attention-based bidirectional CNN-RNN deep model for sentiment analysis (ABCDM), which combines an attention mechanism and a bidirectional CNN-RNN deep model. This model first uses GloVe embedding as the weights to the embedding layer, then two bidirectional GRU and LSTM layers for extracting past and future contexts and an attention mechanism for focusing on different words. Convolution and pooling mechanisms are applied to extract local features static position and reduce feature dimensions. A study that combines CNN and GRU with an attention mechanism, named ARC, proposed by Wen and Li [45] to classify reviews and tweets. They employed three various CNN modules for extracting local n-gram and global patterns and bidirectional GRU units. However, these models do not accurately determine the various degrees of importance of forward and backward directions.
The major difference between our model and the DNN baseline models is that our proposed model considers the following significant features simultaneously: (i) short and long context dependencies utilizing Bi-LSTM; (ii) identifying most significant features strong to positional changes utilizing CNNs with various kernels, filter sizes, and pooling mechanisms; (iii) capturing the words in the text that are significantly related to long space and encoding dependencies utilizing a multi-head attention mechanism.

Materials and Methods
This section describes the overall structure of the DNN-MHAT model, which comprises six fundamental components: the input layer, convolutional neural network, long short-term memory, global average pooling layer, multi-head attention mechanism, and Softmax layer. The overall structure of the DNN-MHAT model is shown in Figure 1. The key goal of the DNN-MHAT model is to detect the polarity of sentiment for the given sentences. In our method, first, we preprocessed the input data by tokenizing the input text, removing stop words, and dealing with the capitalization of words. Then, the tokenized texts were fed into the word embedding module. After that, the obtained word embedding vectors were fed into a CNN layer. The output of the CNN layer was fed into a Bi-LSTM layer. The output of the Bi-LSTM layer was fed into a multi-head attention module. After that, a global average pooling was applied to obtain the final representation. Finally, the final representation was fed into the Softmax classifier layer. Figure 2 shows the flowchart of the DNN-MHAT model. In our method, first, we preprocessed the input data by tokenizing the input text, removing stop words, and dealing with the capitalization of words. Then, the tokenized texts were fed into the word embedding module. After that, the obtained word embedding vectors were fed into a CNN layer. The output of the CNN layer was fed into a Bi-LSTM layer. The output of the Bi-LSTM layer was fed into a multi-head attention module. After that, a global average pooling was applied to obtain the final representation. Finally, the final representation was fed into the Softmax classifier layer.

Input Layer
A pre-trained GloVe embedding matrix was utilized to create the input comment matrix ∈ × where and refer to the embedding dimension and the total number of words, respectively. For embedding a comment vector, ∈ , represents the maximum number of words or the padding length, ∈ 1, deemed in the comment as shown below:

Convolutional Neural Network
CNNs contain many convolution layers employed in the applications of NLP for extracting local features. In CNNs, linear filters are used to perform the convolution process on the features of the input data. Initially, an embedding vector of size is generated to apply the CNN to a sentence containing a set of words. Then, the filter of the size × ℎ is frequently used in sub-matrices as the input feature matrix. The results in a feature map = , , … , are shown below: where = 0,1,2, … , − ℎ and : represent a sub-matrix of from row to . The subsampling layer or pooling layer is a popular practice in which feature maps are fed to reduce dimensions. Max-pooling is a common pooling strategy that determines the essential feature of the feature map, as shown in the following equation: The outputs of the pooling layer are used as the input to the fully connected layer, where these outputs are a pooled feature vector or concatenated (see Figure 1).

Long Short-Term Memory
RNNs are a type of feed-forward neural network. RNNs possess a recurrent hidden state activated by using the previous states and can deal with the variable-length sequences and automatically model the contextual information. LSTM is an improved type of RNN (see Figure 3a) designed to solve the exploding/vanishing issues faced by RNNs. The LSTM model contains a chain of recurrent memory units, and each of these chains implicates three "gates" with various functions. An LSTM unit contains three gates: input gate , forget gate , and output gate , and memory cell to maintain its state over random time intervals. These gates have been generated to organize the flow of data entering and leaving the memory cell. Suppose tanh (.), σ(.), and ⊙ are the hyperbolic tangent function, the sigmoid function and product, respectively. ℎ is the hidden state vector at time , and is the input vector at time . and illustrate cells for input or

Input Layer
A pre-trained GloVe embedding matrix was utilized to create the input comment matrix W g ∈ R n×e where e and n refer to the embedding dimension and the total number of words, respectively. For embedding a comment vector, c ∈ R m , m represents the maximum number of words w t or the padding length, t ∈ [1, m] deemed in the comment as shown below:

Convolutional Neural Network
CNNs contain many convolution layers employed in the applications of NLP for extracting local features. In CNNs, linear filters are used to perform the convolution process on the features of the input data. Initially, an embedding vector of size e is generated to apply the CNN to a sentence S containing a set of s words. Then, the filter f of the size e × h is frequently used in sub-matrices as the input feature matrix. The results in a feature map M = m 0 , m 1 , . . . , m s−h are shown below: where i = 0, 1, 2, . . . , s − h and S i:j represent a sub-matrix of S from row i to j. The sub-sampling layer or pooling layer is a popular practice in which feature maps are fed to reduce dimensions. Max-pooling is a common pooling strategy that determines the essential feature b of the feature map, as shown in the following equation: The outputs of the pooling layer are used as the input to the fully connected layer, where these outputs are a pooled feature vector or concatenated (see Figure 1).

Long Short-Term Memory
RNNs are a type of feed-forward neural network. RNNs possess a recurrent hidden state activated by using the previous states and can deal with the variable-length sequences and automatically model the contextual information. LSTM is an improved type of RNN (see Figure 3a) designed to solve the exploding/vanishing issues faced by RNNs. The LSTM model contains a chain of recurrent memory units, and each of these chains implicates three "gates" with various functions. An LSTM unit contains three gates: input gate i t , forget gate f t , and output gate o t , and memory cell c t to maintain its state over random time intervals. These gates have been generated to organize the flow of data entering and leaving the memory cell. Suppose tanh (.), σ(.), and are the hyperbolic tangent function, the sigmoid function and product, respectively. h t is the hidden state vector at time t, and x t is the input vector at time t. W and U illustrate cells for input x t or the weight matrices of gates. The hidden state h t and b indicate the bias vectors. In the forget gate f t , it defines what information to ignore from the cell state, as indicated by the following equation [41]: = tanh( ℎ + + ) The output gate defines what information is outputted according to the state of the cell state based on the following equations [41]: The LSTM model is based on serial information, but it is not beneficial, especially if you can reach the following information based on the previous model. Therefore, this is highly useful for sequencing tasks. The Bi-LSTM model comprises a forward ℎ ⃗ and a backward ℎ ⃖ LSTM layer (see Figure 3b). The core goal of the Bi-LSTM structure is that the forward layer ℎ ⃗ captures the previous sequential information, and the backward ℎ ⃖ captures the subsequent sequential information; both layers are connected to the same output layer. The most important feature of the BiLSTM architecture is that sequence contextual information is considered. Suppose that the input of time is the word embedding , at time − 1, the output of the forward layer is ℎ ⃗ and the output of the forward hidden layer and the backward hidden layer is ℎ ⃗ , ℎ ⃖ , respectively. The output of the backward and the hidden layer at time t is listed below [46]: The input gate i t defines what must be stored by calculating c t and i t and combining them based on the following equations [41]: The output gate o t defines what information is outputted according to the state of the cell state based on the following equations [41]: The LSTM model is based on serial information, but it is not beneficial, especially if you can reach the following information based on the previous model. Therefore, this is highly useful for sequencing tasks.
The Bi-LSTM model comprises a forward → h t and a backward ← h t LSTM layer (see Figure 3b). The core goal of the Bi-LSTM structure is that the forward layer → h t captures the previous sequential information, and the backward ← h t captures the subsequent sequential information; both layers are connected to the same output layer. The most important feature of the BiLSTM architecture is that sequence contextual information is considered. Suppose that the input of time t is the word embedding w t , at time t − 1, the output of the forward layer is → h t−1 and the output of the forward hidden layer and the backward hidden layer is → h t−1 , ← h t+1 , respectively. The output of the backward and the hidden layer at time t is listed below [46]: where L(.) indicates the hidden layer process of the LSTM hidden layer. The forward → h t and backward output vector ← h t are ∈ R 1×H , and they must be combined to obtain the text feature, where H indicates the number of hidden layer cells:

Multi-Head Attention Mechanism
Attention is a key component of the MHAT mechanism, but there is a fundamental difference in that the MAHT model can perform multiple distributed computations that handle complex information.

Scaled Dot-Product Attention
Scaled dot-product attention is a set of key-value pairs to an output and mapping a query. There are four steps for computing the attention as follows [46]: - Each key and query weight are computed by considering similarity. The proposed model is used as the dot product to determine the similarity. - The scaling operation is the next step to calculate the attention, where the factor √ d k is used as a moderator so that the dot-product is not too big. - The Softmax function is used to normalize the obtained weights. - The weighted sum is equal to the sum of the corresponding principal value V and similarity.
According to the steps mentioned above, we obtained the following formula:

Multi-Head Attention
MHAT is the improvement of the traditional attention mechanism, and it has excellent performance. Figure 4 shows the architecture of the MHAT mechanism. Initially, by a linear transformation, Q, K, and V are the input of the scaled dot-product attention. Therefore, this operation computes one head at a time. Thus, it should be carried out h, which is called multi-head. The parameters W for each linear transformation of Q, K, and V are different. Each scaled dot-product attention output of m time is concatenated, and the value obtained through a linear transformation is utilized as the output of the MHAT [47]. The formula can be expressed as shown below: Multihead(Q, K, V) = concat(head i , . . . , head m )W o (15) different. Each scaled dot-product attention output of time is concatenated, and the value obtained through a linear transformation is utilized as the output of the MHAT [47]. The formula can be expressed as shown below: Multihead( , , ) = concat(head , … , head )

Self-Attention
In this approach, we employed a self-attention for extracting the inner relations of sentences in (K = V = Q ) [48]. For instance, every word that has been entered should carry out the attention computation with each other word of the sentence. Thus, the MHAT mechanism produces a weight matrix α and a feature representation v.

Global Average Pooling Layer
The fully connected network is the main architecture of the classification network, which contains an activation function, Softmax, for performing classification. The fully connected network function represents multiplying the vector, stretching the feature map into a vector, and eventually reducing its dimension. To obtain the corresponding result of every category, this vector is entered into a Softmax layer. The fully connected network has two major drawbacks: (i) the number of parameters is very large and thus reduces the training speed; (ii) it is easy to carry out overfitting. Based on the two problems mentioned above, the global average pooling can avoid the shortcomings to achieve the same effect and thus adds the sequences of input features to the averaging [49]. After presenting the MHAT mechanism to the sentence, the feature matrix of the corresponding output is v , and the feature vector of every word in the sentence is v 1 , v 2 , . . . , v n . The global average pooling of the input sentence is shown below:

Softmax Layer
To predict sentiment analysis, we fed the output of vector v gap immediately into the Softmax layer, as shown in the equation below: To evaluate the proposed model, the purpose of cross-entropy was presented to reflect the gap among the predicted sentimental categoriesŷ and the real sentimental categories y.
where i represents the index number of the sentence. The Bi-LSTM layer can determine the context to arrange the information of sequences. MHAT can learn information from the representation of sub-distances and various dimensions and fully capture long-space textual features, which can play a critical role in effectively improving the sentimental analysis capability of the model straightway. The pseudo-code of DNN-MHAT is shown in Algorithm 1.

Experiments and Results
In this section, experiments conducted to assess the performance of the DNN-MHAT model for SA and text classification on different benchmarking datasets are described. The baseline methods and experimental setup, followed by a discussion of the results, are included in the following.

Datasets
Our study conducted sentiment analysis and text classification tasks utilizing long and short datasets. The details of the datasets are as follows: APP: This dataset for Android applications [44] comprises 752,937 metadata and product reviews from Amazon.
Kindle: This dataset for Kindle Store [44] comprises 982,619 metadata and product reviews from Amazon.
Electronics: This dataset for Electronics [44] comprises 1,689,188 metadata and product reviews from Amazon.
CDs: This dataset for CDs and Vinyl [44] comprises 1,097,592 product metadata and product reviews from Amazon.
Sentiment140: This dataset was generated at Stanford University [44] by computer science graduate students, comprising 1,600,000 tweets classified into positive and negative categories. Table 1 demonstrates the statistics of the datasets used in the proposed model and describes more details. Data preprocessing considers an essential step in machine learning and data mining [50][51][52][53][54]. The reviews contain incomplete sentences; a large amount of noise; and weak wording, such as words without application, high repetition, imperfect words and incorrect grammar. Unstructured data also have an impact on sentiment classification results. Preprocessing the reviews is needed to maintain a regular structure and reduce such problems. Cleaning data with filters, splitting the data into parts for training and testing, and building data sets with favorite words are a few of the steps employed in our research. Without going into too much depth, we used the following techniques to prepare the data.

Tokenization
We divided the text into phrases, words, symbols, or other meaningful elements, thus forming a list of individual words per comment. In each comment, we then used each word as a feature for our training classifier.

Removing Stop Words
Comment contains some stop words that have no meaning, such as prepositions, and words that add no emotion value (or, also, able, etc.). The Natural Language Toolkit (NLTK) library provides a stop words dictionary, including words with neutral meaning neutral that are not suitable for sentiment analysis. To remove the stop words from the comment's text, we checked each word in the list against the dictionary and excluded them.

Capitalization
Documents and texts containing many sentences and diverse capitalization can be a big problem when classifying big documents. The best approach to deal with inconsistent capitalization is to decapitalize each letter. This technique shows all words in the same feature distance to the text and document, but it poses a significant issue in the interpretation of some words (e.g., "US" (United States of America) to "us" (pronoun)).

Parameter Settings
The DNN-MHAT model was applied using the Tensorflow1.13.1 with Keras2.24 libraries written in Python 3.7.1 Language and an Ubuntu16.04 system with a CPU of Core Tetranuclear i7-7700k and a GPU of GTX1080 Ti GAMING X 11GB. To construct the input comment matrix C , the Tokenizer method uses 100,000 words. We assumed the 45 and 100 first words of comments in the tweet and review datasets by setting the padding sizes to 45 and 100, respectively. In the current study, the pre-trained and publicly available GloVe was utilized as the weights in the embedding layer. The "Gigaword 5 + Wikipedia 2014" version of GloVe was utilized, comprising six billion tokens and a vocabulary size of 400,000. For the embedding layer, the embedding size of 300 was used. Other parameter settings used in the proposed model are shown in Table 2.

. Evaluation Metrics
Four evaluation standards, Accuracy (Acc), Recall (Re), Precision (Pr), and F1 measure (F1), were employed for evaluating the performance of the proposed model. These standards are widely utilized in SA and text classification tasks. These standards are computed as follows [18]: TN, FP, TP, and FN are true negative, false positive, true positive, and false negative, respectively [18].

Baseline Methods
In this work, we compared the DNN-MHAT model with five state-of-the-art DNN models that have been developed to detect the polarity of sentiment classification as listed below: • IWV [15]: This model has been proposed for sentiment analysis, which comprises three convolution layers, a max-pooling layer, and a fully connected layer.

Results
In this section, the proposed model is compared with five baseline methods mentioned above for sentiment analysis with two types of datasets, four long reviews and two short tweets. Tables 3-6 show the results obtained for four long review datasets.  In Tables 3-6, DNN-MHAT achieved good performance in terms of accuracy, as 0.32%, 0.47%, 0.43%, and 0.38% on Kindle, Electronics, CD, and App datasets, respectively. For the F1 scale, the improvements are 0.55%, 0.63%, 0.77%, and 0.48% for the positive class and 0.38%, 0.30%, 0.50%, and 0.65% for the negative class on Kindle, Electronics, CD, and App datasets, respectively. As indicated above for accuracy and F1 scale, our DNN-MHAT outperformed the other five methods. It can be seen that these improvements were mainly derived from (i) handling long dependencies in text utilizing bidirectional LSTM layers, (ii) employing local features of varying lengths by applying CNN layers of different sizes, and (iii) assigning weights to words in the review according to their significance achieved from the multi-head attention (MHAT) mechanism layer. Tables 7 and 8 show the results obtained for two short tweet datasets. As shown in Tables 7 and 8, DNN-MHAT achieved good performance in terms of accuracy, 0.35% and 0.27% on Sentiment140 and Airline Twitter datasets, respectively. For the F1 scale, the improvements are 0.69% and 0.58% for the positive class and 0.36% and 0.52% for the negative class on Sentiment140 and Airline Twitter datasets, respectively. As indicated above for accuracy and F1 scale, our DNN-MHAT outperformed the other five methods. As shown in the results in Tables 7 and 8 for accuracy and F1 scales, DNN-MHAT outperformed the other five models in short tweets of Twitter datasets.

Ablation Study
To test the effectiveness of our model, we report the verification loss value, accuracy rate, training loss value, and training accuracy rate for two datasets, including the CD dataset and the Airline tweet dataset, as shown in Figures 5-8.  Tables 7 and 8 for accuracy and F1 scales, DNN-MHAT outperformed the other five models in short tweets of Twitter datasets.

Ablation Study
To test the effectiveness of our model, we report the verification loss value, accuracy rate, training loss value, and training accuracy rate for two datasets, including the CD dataset and the Airline tweet dataset, as shown in Figures 5-8.      We also evaluated our DNN-MHAT model using a dataset from a different language. We ran the DNN-MHAT model on the ASTD [55] dataset in the Arabic language. For a fair comparison, we embedded sentences using AraBERT. Table 9 shows the performance of our model, which achieved a better result.  We also evaluated our DNN-MHAT model using a dataset from a different language. We ran the DNN-MHAT model on the ASTD [55] dataset in the Arabic language. For a fair comparison, we embedded sentences using AraBERT. Table 9 shows the performance of our model, which achieved a better result. We also evaluated our DNN-MHAT model using a dataset from a different language. We ran the DNN-MHAT model on the ASTD [55] dataset in the Arabic language. For a fair comparison, we embedded sentences using AraBERT. Table 9 shows the performance of our model, which achieved a better result. Table 9. The accuracy of AraBERT DNN-MHAT for ASTD Arabic dataset.

Model Accuracy%
Arabic-BERT Base [56] 71.4 AraBERT [57] 92.6 Arabic BERT [58] 91 Our model 92.8 To illustrate the performance of our proposed DNN-MHAT model, we executed our model using different embedding layer sizes, namely, 50,100, 200, and 300, as shown in  The various embedding sizes have a certain effect on the proposed model's performance, so the DNN-MHAT model's accuracy was evaluated on two datasets when the number of epochs is equal to 5, 6, 7, 8, 9, and 10. As we can see in Figures 9 and 10, the embedding size of 300 performs better than the other embedding sizes in both CD and Airline datasets.  The various embedding sizes have a certain effect on the proposed model's performance, so the DNN-MHAT model's accuracy was evaluated on two datasets when the number of epochs is equal to 5, 6, 7, 8, 9, and 10. As we can see in Figures 9 and 10, the embedding size of 300 performs better than the other embedding sizes in both CD and Airline datasets. The various embedding sizes have a certain effect on the proposed model's performance, so the DNN-MHAT model's accuracy was evaluated on two datasets when the number of epochs is equal to 5, 6, 7, 8, 9, and 10. As we can see in Figures 9 and 10, the embedding size of 300 performs better than the other embedding sizes in both CD and Airline datasets.
The experiments illustrate that the GloVe pre-trained embedding, especially when the embedding size is set to 300, can achieve better results than other embedding sizes.

Discussion
The results show that the DNN-MHAT model outperformed the other five models in terms of both F1 measures and accuracy with Twitter datasets. However, the improvements are less compared to utilizing the review datasets. The main reason for this is that the Twitter datasets contain a small number of words. As mentioned above, the DNN-MHAT model does not provide important improvements when utilizing short comments. The first feature extraction layer in this model is an RNN-based network, which is evolved to capture long dependencies.
Due to the ability of BiLSTM to access both the previous and the following context, the information obtained by BiLSTM can be considered two different representations of the text. Moreover, employing a multi-head attention mechanism for each text representation can better focus on the significant related information and avoid the reciprocal intervention in the various representations. Thus, the multi-head attention mechanism in our model makes the determination of text semantics more accurate. Hence, our model effectually improves the accuracy of text classification and sentiment analysis.
However, our study was limited to document-level sentiment analysis. In this study, we did not consider the aspect-level sentiment. We leave this part for future work.

Conclusions and Future Work
For sentiment analysis, we propose a hybrid model that combines a deep neural network with a multi-head attention (DNN-MHAT) mechanism to tackle text data sparsity and high dimensionality problems. First, DNN-MHAT exploits pre-trained GloVe word embedding vectors as the primary weights into the embedding layer. Second, the CNN layer was used for extracting the local features of position invariants. Third, a recurrent Bi-LSTM unit was used for capturing the actual context of the text. After that, a multi-head attention mechanism was applied to the outputs of Bi-LSTM to capture the words in a text that are significantly related to long space and encode dependencies. The purpose of this is to add effect weights to the generated text concatenation. The MHAT provides an emphasis on variant words in a comment and hence makes the semantic representations more informative. Finally, a global average pooling with a sigmoid classifier is applied to transform the vector into a high-level sentiment representation while avoiding model overfitting and implementing the sentiment polarity classification of comments.
This study focused on detecting the polarity of sentiment analysis at the document level. In future work, we propose the verification of the effectiveness of our DNN-MHAT model for other levels, such as sentence-level and aspect-level sentiment analysis, and other sentiment analysis tasks, such as helpfulness and rating prediction.