The relationship between different aspects mentioned in the same sentence is modeled through additional attention layers. In addition, our model analyzes the relationship between words through entropy, introduces the global sentence representation into the existing attention mechanism, and designs additional auxiliary tasks to guide sentence learning. This paper also introduces location information and part of speech information to increase the selection ability of the model so as to realize the prediction of emotional polarity.
4.1. Sentiment Analysis Based on Entropy
There are some recommendation systems used such as fuzzy entropy [
48], relative entropy [
49] and maximum entropy [
50]. It can be seen from Equation (
1) that the information entropy
is only related to the probability distribution of variable
x but has nothing to do with its specific value. To some extent, this shows that information entropy can effectively avoid the interference of noise data and effectively filter out users with less scoring information in the scoring system. Users in the system have different effects on the recommendation engine. Some users provide more information in the score, while others contain less information. Therefore, effectively filtering users with less information can effectively improve the recommendation accuracy.
In order to introduce the user information entropy model into recommendation systems, for user
u, the score set is represented by
. In the scoring system with a score of 1 to 3, respectively, corresponding to positive, negative, and neutral,
, where
represents the score generated by user
u in the system. For user
u, according to Equation (
1), the information entropy is:
where
C represents the number of scoring intervals, and in the three-point scoring system,
;
is the probability that user u’s score falls within interval K. The calculation process of
is as follows:
where
,
is the indicator function,
,
. Combined equations refeq:entropy and (
6) can calculate the information entropy according to the user’s score value. From the perspective of information theory, according to the characteristics of centralized scoring and extreme scoring of naval users or a small number of normal users who produce noise data, this paper directly uses information entropy to measure the amount of information contained in users’ scoring and filters users with low information entropy to achieve the purpose of filtering noise data. For example, in
Table 2 with a score of 1 to 3, if user
u evaluates 15 items, and there are 6 items from 1 to 3, then its information entropy
, and its information entropy reaches the maximum. Because its scores are evenly distributed, it can indicate that it is more cautious and objective in scoring the corresponding items. In another extreme case, the user u scores all items with 1, that is,
, which can be calculated by substituting into the formula to get
. Therefore, the user’s information entropy reaches the lowest value, which belongs to noise data. Intuitively, it can also be seen that the user’s scoring behavior is too arbitrary and extreme, and the reliability is low.
After processing, a new scoring matrix is obtained from the original scoring matrix R. Obviously, has higher data quality, and the collaborative filtering model based on training will also have higher recommendation accuracy.
4.2. Model Architecture
When the interactive sequence is given non-uniform weights, the complexity of the model can be reduced, and the long-term information of the sequence can be captured more concisely. In order to achieve this effect, that is, to suppress the tendency of uniform distribution of attention weight, reduce the number of captured actions, and improve the ability to distinguish items, considering that the attention weight is given adaptively by the model, try to increase the entropy-positive term of attention weight on the original loss function to form a new recommendation model structure, as shown in
Figure 1. The main components of the model are described layer by layer from bottom to top.
(1) Input layer
The bottom layer of our model is the input layer, which is divided into three parts: the input of historical interactive items, the input of target items, and the input of short-term sequential interactive items. The input of historical interactive items is represented by the ID multi-hot encoding of these items. The input of the target item is represented by one-hot encoding the ID of the item. The input of short-term sequential interactive items is represented by multiple thermal codes for the ID of the item. The final result of the input layer is the encoded feature vector.
(2) Embedding layer
The input layer is followed by the embedding layer, which is a fully connected layer. First, the original sequence is processed, and then, the processed sequence is input into the embedding layer to obtain the embedding vector. The model can only deal with the sequence with a fixed length n, which is the maximum length of the training sequence. If the length of the original sequence is less than n, fill 0 from the left. If the length of the original sequence is greater than n, the sequence with the nearest length is intercepted. In this way, the input sequence is transformed into a sequence of fixed length sequence . If the sequence length is greater than n, take the n recent actions. If the sequence length is less than n, repeatedly add a ‘padding item’ to the left. The sparse vector can be embedded into a linear matrix, which makes the sparse vector have the corresponding meaning. This feature is very suitable for deep learning. Especially in the recommendation field, the recommendation sequence will be determined according to the calculated embedding similarity between users and items or between items.
The sparse feature vector obtained by the input layer is transformed into a low-dimensional dense implicit vector representation in the implicit space. The sequence is embedded into the matrix
through items
M, which represents the collection of all items and the embedding dimension. In the sequential recommendation task, items have a strict sequence, so the location information needs to be embedded. The self-attention mechanism cannot perceive the location information; that is, the position of elements in the exchange sequence does not affect the final result. This behavior of not distinguishing the chronological order is contrary to the serialization recommendation. Therefore, position embedding
is added in the above embedding vector
E to obtain the input embedding with position information
. Equation (
7) gives a detailed definition.
(3) Stacking of self-attention blocks
This module is composed of one or more self-attention blocks stacked, and each self-attention block is composed of a self-attention layer, feed-forward network, residual connection, normalization layer, and dropout layer.
Self-attention layer: In Transformer Equation (
3), the attention mechanism function of the scaling dot product is defined. In fact, the attention function is used to calculate the degree of correlation between
Q and
K, distribute the weight according to the degree of correlation, and calculate the weighted sum of
V. The input embedded
converts into three matrices
and then inputs the self-attention function:
The internal structure of stacking of self-attention blocks is shown in
Figure 2.
Feed-forward network: Considering that the self-attention layer is a linear model and cannot perceive the nonlinear interaction of hidden features in different dimensions, it is necessary to use the nonlinear activation function to introduce nonlinear factors, that is, add a two-layer point feed-forward network:
where
,
. The ReLU activation function is nonlinear, which can make neurons have sparse activation, avoids the problem of gradient explosion or disappearance, and has fast convergence speed, which can help the model better mine relevant features.
Stacking of self-attention blocks: Stacking multiple self-attention blocks can make the model learn more complex feature transformation. However, through increasing the number of layers of the network in this way, it is easy to cause problems such as over fitting, gradient disappearance, and training time growth. Therefore, it is necessary to add residual connection, a normalization layer, and a dropout layer. The stacking formula of multiple self-attention blocks is defined as follows:
where
g represents self-attention layer or feed-forward network. LayerNorm is layer normalization, which is defined as
.
(4) Aggregation layer based on attention mechanism
The number of interactive items of users is not only one, so there are multiple results after the implicit vector inner product operation obtained through the interaction function in the interaction layer. The purpose of the aggregation layer is to aggregate these inner product results. Combined operation is used to facilitate subsequent processing. Our model supports traditional aggregation strategies, such as max pooling.
Considering the different contributions of different items to the prediction, we use the attention-based aggregation strategy. At the same time, the experimental part (
Section 5) gives the attention-based aggregation strategy and communication. The experimental results of the unified aggregation strategy verify the effectiveness of the attention-based aggregation layer designed in this paper.
(5) Prediction layer
Using the idea of matrix decomposition, at each time step
t, the effective information
extracted by the model and items
are embedded into the dot product, the score
is calculated, and then, it is sorted for recommendation.
where
represents the given interaction sequence
, and the possibility of the next item predicted by the model is item
i.
is a trained item embedding matrix,
I is the collection of all items, and
d is the dimension of the embedded vector.
Specifically, an entropy regular term is added to the original binary cross-entropy loss function in Equation (
2). At first, the distribution of the attention network is very sparse. Thus, we add an entropy in our loss function to make the distribution more concentrated. We call it an entropy-enhanced attention network. The entropy regular term is calculated from the self-attention matrix in the first self-attention block. The entropy value of each element is added to the loss function to form a new loss function as follows:
where
m denotes audio and visual modalites.
r represents each distribution in
m.
is a hyperparameter, and
K denotes sentence length.
The following Algorithm 1 shows the detailed flow used in the paper.
Algorithm 1: Recommendation based on attention networks with entropy function. |
Input: user dataset User, item dataset Item, review dataset Review, vocabulary V; Output: user representation U, item representation I, recommendation list L; 1: Initialize embedding size = , batch size = , negative sample ; 2: for epoch = do 3: split the dataset User, Item and Review into 4: training datasets (80%), verification datasets (10%), and testing datasets (10%); 5: construct according to Equation ( 7); 6: learn the FFN according to Equation ( 9); 7: expected LayerNorm according to Equation ( 10); 8: get Loss function according to Equation ( 12); 9: end for 10: return recommendation list L.
|