#### 3.2. Classifier Design

One of the novelties of our method lies in the utilization of the BiLSTM network and multi-head attention mechanism combined with emoticons for enhancing the microblog sentiment analysis. Particularly, we divide the design part of the classifier into four layers, namely the input layer, BiLSTM layer, multi-head attention layer, and dense connection layer.

**Input layer:** the input layer generates the embedding of words and emojis. After the above-mentioned preprocessing, the input microblog data are expressed as: {${x}_{1}$, … $\text{}{x}_{\mathrm{i}}\dots {x}_{m}$;$\text{}{y}_{1}$, ...,$\text{}{y}_{j},\dots ,\text{}{y}_{n}$}, where ${x}_{i}\text{}$ is the $i$th word and ${y}_{j}$ is the $j$th emoticon in microblog. After the words and emoticons are mapped to the corresponding embedding, the input layer is represented as {${w}_{1}$,…$\text{}{w}_{\mathrm{i}}$, … ${w}_{m}$; ${e}_{1}$,$\text{}\dots {e}_{\mathrm{j}}$,..$\text{}{e}_{n}$}$\text{}\in {R}^{\left(m+n\right)\times d}$, where $\text{}\left(m+n\right)$ represents the total number of words and emoticons in microblog, and $\text{}d$ is the feature dimension.

**BiLSTM layer:** LSTM is a variant of RNN, which can solve the problem of gradient disappearance by introducing input gate i, output gate o, forget gate f, and memory cell state. LSTM can improve the memory mechanism of neural network to receive input information and training data, which is very suitable for modeling time series data, like texts, due to the characteristics of its design. BiLSTM is a combination of forward LSTM and backward LSTM. The former deals with the sequence from left to right, while the latter deals with the sequence from right to left. The biggest advantage of this structure is that the sequence context information is fully considered. Next, we introduce the procedures of LSTM in detail.

An LSTM unit consists of three controlling gates, including an input gate

${i}_{t}$, a forget gate

${f}_{t}$, and an output gate

${o}_{t}$, as well as a memory cell state

${c}_{t}$, all of which affect the unit’s ability to store and update information. The input gate output a value between 0–1 based on the input

${h}_{t-1}$ and

${w}_{t}$ (see Equation (1)). When the output is 1, it means that the cell state information is completely retained, and when the output is 0, it is completely abandoned. Next, the input gate layer decides which values need update, and the tanh layer creates a new candidate value vector

$\tilde{{c}_{t}}$, which can be added to the cell state. Subsequently, the two will be combined to update the cell state

${c}_{t}\text{}\left(\mathrm{see}\text{}\mathrm{Equations}\text{}\left(2\right)\u2013\left(4\right)\right);\text{}$ finally, the output gate will determine the output value based on the cell state (See Equations (5) and (6)). Among them,

$\text{}{W}_{f}$,

$\text{}{U}_{f}$,

$\text{}{b}_{f\text{}}$,

${W}_{i}$,

${U}_{i}$,

$\text{}{b}_{i}$,

$\text{}{W}_{c}$,

$\text{}{U}_{c}$,

${b}_{c}$, and

${W}_{o}$,

${U}_{o}$,

$\text{}{b}_{0}$ are the internal training parameters in the LSTM,

$\sigma \left(\xb7\right)$ is sigmoid activation function, and

$\odot $ denotes dot multiplication.

The above is the calculation process of LSTM. As we said earlier, BiLSTM includes forward LSTM and backward LSTM.

$\overrightarrow{LSTM}$ in BiLSTM, read the input from

${w}_{1}\text{}$ to

${e}_{n}$ to generate

$\overrightarrow{{h}_{t}}$ and another

$\overleftarrow{LSTM}\text{}$ read the input from

${e}_{n}\text{}$ to

${w}_{1}$ in order to generate

$\overleftarrow{{h}_{t-1}}$:

The forward and reverse context representations generated by

$\overrightarrow{{h}_{t}}$ and

$\overleftarrow{{h}_{t}}$ are connected into a long vector, and the combined output is the representation of the current time to the input:

Finally, the output [${h}_{1},\dots {h}_{\mathrm{i}},\dots {h}_{m},{l}_{1},\dots {l}_{\mathrm{j}}\dots {l}_{n}$] of the whole sentence is obtained, where ${h}_{\mathrm{i}}$ and ${l}_{\mathrm{j}}$ are utilized to represent the output of words and emoticons, respectively, in the hidden layer. In addition, we set all of the intermediate layers in BiLSTM to return the complete output sequence, thereby ensuring that the output of each hidden layer retains the long-distance information as much as possible.

**Multi-head attention layer:** attention is a mechanism for improving the effect of RNN-based models, and the calculation of attention is mainly divided into three steps. The first step is to use the attention function F to score query and key to get

${s}_{i}$. The two most common attention functions are additive attention and dot-product attention [

35]; in this article, we use the former. The second step is to use softmax function to normalize the scoring result

${s}_{i}$, so as to obtain the weight

${a}_{i}.$ The third step is to calculate attention, which is the weighted average of all values and weights

${a}_{i}$.

Figure 2 shows the flow chart of the attention mechanism.

Multi-head attention has improved the traditional attention mechanism, so that each head can extract the features of query and key in different subspaces. To be precise, these features come from Q and K, which are the projection of query and key in the subspace. The operation that is shown in

Figure 2 is performed once in each head, and a total of

$h$ times need to be performed. It should be noted that in the multi-head attention mechanism, the attention function could be the scaled dot-product function, which is the same as the traditional attention mechanism, except for the regulating scaling factor. In the experiment,

$h$ needs to be continuously debugged to determine the most suitable value for the task. Finally, the results that are returned in each head are concatenated and linearly converted to obtain multi-head attention.

Figure 3 shows the main idea. Next, we will combine the tasks in this article to explain, in detail, how we use the multi-head attention mechanism.

As we all know, each word contributes to the sentiment conveyance differently, and the effect of combining words with emoticons is also different. Therefore, in this paper, we propose scoring emoticons and words, and then weight the importance of each word in determining the emotional polarity of microblog.

First,

${w}_{i\text{}}\text{}$ and

${e}_{\begin{array}{c}j\\ \end{array}}$ are the word embeddings corresponding to the words and emoticons in microblog, respectively, where

${w}_{i\text{}}$,

${e}_{\begin{array}{c}j\\ \end{array}}\in {R}^{d}$ and

$d$ is the dimension of the word vector. The BiLSTM generates more abstract representations of word and emoji, namely

${h}_{i}\in {R}^{k}$ and

${l}_{j}\in {R}^{k}$, given the word embedding of words and emoticons as input. Observing that many users will post multiple emojis in the same microblog or multiple identical emojis in a row, but only emoticons that are not repeated in blog post are extracted in our method, and the number of different emoticons is limited to no more than 5. All emojis are combined to obtain the emoticon representation:

The attention function that is used in this paper is additive attention [

36], which performs better in higher dimensions. We regard the output [

${h}_{1},{h}_{2},\dots {h}_{m}]$ of words and

${q}_{v}$ of emoticons in the BiLSTM hidden layers as the query and key in the attention mechanism to calculate the attentive scores, and perform a softmax normalization operation in order to obtain the weight

${a}_{t}$, which represents the importance of the t-

$th$ word combined with emoticons for sentiment analysis:

Among them,

$\text{}{W}_{h},{W}_{e}$ are the weight matrices;

$b\text{}$ is the bias;

$v\text{}$ is the weight vector; and,

${v}^{T}\text{}$ is the transpose of

$\text{}v$. Accordingly, output of each head is:

When compared with the attention mechanism with only one head, multi-head attention allows for the model to jointly attend to information in different feature subspaces. It performs the attention function in parallel. Subsequently, we concatenate the results of every head in a linear manner, resulting in the final output of this layer.

where

${W}_{i}{}^{Q}$,

${W}_{i}{}^{K},{W}_{i}{}^{V}$ are the parameter matrices that project

Q,

K, and

V to different representation subspaces, and

Q =

${q}_{v}$,

K =

V = [

${h}_{1},{h}_{2},\dots {h}_{m}]$.

**Dense connection layer:** finally, we send the vectors from the previous layer to the densely connected layer. We use the ReLu function as the activation function to complete the nonlinear conversion. At the last densely connected layer, we perform Softmax operation on the output of the previous layer, and finally obtain the sentiment classification of microblog.