Enhancing Explainable Recommendations: Integrating Reason Generation and Rating Prediction through Multi-Task Learning

Zhu, Xingyu; Xia, Xiaona; Wu, Yuheng; Zhao, Wenxu

doi:10.3390/app14188303

Open AccessArticle

Enhancing Explainable Recommendations: Integrating Reason Generation and Rating Prediction through Multi-Task Learning

¹

School of Computer Science, Qufu Normal University, Rizhao 276826, China

²

Faculty of Education, Qufu Normal University, Qufu 273165, China

³

Chinese Academy of Education Big Data, Qufu Normal University, Qufu 273165, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2024, 14(18), 8303; https://doi.org/10.3390/app14188303

Submission received: 9 July 2024 / Revised: 8 September 2024 / Accepted: 11 September 2024 / Published: 14 September 2024

(This article belongs to the Special Issue Applied and Innovative Computational Intelligence Systems: 3rd Edition)

Download

Browse Figures

Review Reports Versions Notes

Abstract

In recent years, recommender systems—which provide personalized recommendations by analyzing users’ historical behavior to infer their preferences—have become essential tools across various domains, including e-commerce, streaming media, and social platforms. Recommender systems play a crucial role in enhancing user experience by mining vast amounts of data to identify what is most relevant to users. Among these, deep learning-based recommender systems have demonstrated exceptional recommendation performance. However, these “black-box” systems lack reasonable explanations for their recommendation results, which reduces their impact and credibility. To address this situation, an effective strategy is to provide a personalized textual explanation along with the recommendation. This approach has received increasing attention from researchers because it can enhance users’ trust in recommender systems through intuitive explanations. In this context, our paper introduces a novel explainable recommendation model named GCLTE. This model integrates Graph Contrastive Learning with transformers within an Encoder–Decoder framework to perform rating prediction and reason generation simultaneously. In addition, we cleverly combine the neural network layer with the transformer using a straightforward information enhancement operation. Finally, our extensive experiments on three real-world datasets demonstrate the effectiveness of GCLTE in both recommendation and explanation. The experimental results show that our model outperforms the top existing models.

Keywords:

explainable recommendation; multi-task learning; contrastive learning; rating prediction; reason generation; recommendation algorithms

1. Introduction

With the rise of e-commerce platforms and video websites, internet information is growing exponentially, leading to severe information overload and rapid development in recommendation system technology. Recommender systems mitigate information overload by identifying relevant content from vast amounts of data. Recommender systems are widely used across various fields. Recommender systems are extensively employed in areas like e-commerce platforms (e.g., Amazon), where they recommend products of potential interest to improve the shopping experience, and in streaming and social media platforms (e.g., TikTok), which recommend content users may enjoy and connect them with friends they might know. Traditional recommender systems, including collaborative filtering and content-based algorithms, primarily aim to provide accurate recommendations. Interestingly, the logical structure of these systems often yields explanations that align with user intuition.

Advancements in computer hardware and big data storage have propelled deep learning techniques to the forefront of computer science. Applying deep learning in recommender systems shows great promise [1]. Recent studies indicate that Contrastive Learning (CL) significantly enhances performance [2,3]. Graph data augmentation strategies create diverse user–item interaction graphs, providing additional supervised signals [4,5]. However, performance gains in CL-based recommendation models primarily stem from the contrastive loss mechanism, specifically InfoNCE [6], a form of Noise Contrastive Estimation that is widely used in Contrastive Learning to maximize the mutual information between positive sample pairs, rather than graph-based strategies [7].

However, while deep learning has expanded possibilities in recommender systems, it has also increased the opacity of these models, thus complicating their explainability [8,9]. We believe that if the recommender system provides a personalised explanation consistent with the user’s rating prediction, this will significantly enhance user trust and subsequently increase the system’s effectiveness.

Recommendation explanation methods are categorized into two types: embedded and post hoc [9,10]. Embedded methods modify the recommender system’s structure to enhance explainability [11,12,13,14], while post hoc methods use an external model to explain opaque recommendation models [15,16,17,18]. Embedded methods aim to enhance prediction performance and internal transparency, revealing the model’s decision making process but often failing to deliver high-quality recommendation explanations [10]. Post hoc methods generate explanations for complex models by creating separate modules, decoupling the recommendation and explanation tasks. This enhances model design flexibility and the explainability of high-performance models that are difficult to explain.

In recent years, several explainable recommendation methods based on post-processing techniques have been developed. These methods generate natural text explanations for personalized recommendations using natural language processing [19,20,21,22]. We note that the Encoder–Decoder-based model offers good portability and structural simplicity. Effective results are achieved by transferring embedded latent representations between two tasks within this straightforward multi-task learning framework [15,16]. Meanwhile, the powerful language modeling capabilities of the transformer [23] have demonstrated significant potential for generating comments in interpretable recommendation systems.

The main contributions of this paper are as follows:

Based on the Encoder–Decoder framework, we integrate CL concepts with the transformer to propose a concise multi-task learning model that achieves accurate rating prediction and generates recommendation reasons. This multi-task learning model, enhanced by CL with latent representations, positively impacts recommendation results and additionally enhances the transformer’s output.

We introduce a straightforward, yet effective, information augmentation strategy. We adapt the general Graph Neural Network paradigm, incorporating CL to suit our multi-task learning model’s encoder. Additionally, we enhance the transformer’s inputs with CL-modified latent representations to generate personalized recommendation reasons.

Finally, we conducted experiments on three benchmark datasets. Extensive testing demonstrates that our proposed model surpasses existing models in both recommendation accuracy and explainability. Furthermore, this well-structured and portable multi-task learning model holds significant potential for future enhancements.

2. Related Work

The importance of generating explainable reasons from recommendation results was discussed in Section 1. The multi-tasking model presented in this paper includes rating prediction based on Graph Contrastive Learning and generating reason explanations using a transformer. In this section, we will review related work in these areas and examine multi-tasking recommendation architectures similar to our model.

2.1. CL-Based Rating Prediction

Graph Convolutional Networks (GCNs) [24], a variant of Graph Neural Networks (GNNs) [25], have emerged as a dominant force in research on DNN-based recommendation models due to their effective processing of structured data. To address the persistent data sparsity issue in GCN-based models, many researchers have incorporated CL into these systems to enhance recommendations through self-supervised learning [26].

Most previous contrastive recommendation systems have utilized graph augmentations for data enhancement [4,5,27]. A prime example is Self-supervised Graph Learning (SGL) [4], which incorporates node dropout, edge dropout, and random walk techniques. However, a study [28] developed SGL-WA, a variant without graph augmentations, and showed that augmentations are not essential for CL-based recommendations. The primary boost comes from the contrastive loss, InfoNCE. Building on prior research, Extremely Simple Graph Contrastive Learning (Xsimgcl) [7] proposed integrating cross-layer contrast into the recommendation task to simplify the calculation process.

2.2. Explanation of Recommender Systems

Explanation in recommender systems initially focused on extracting features from users’ comments to explain recommendation results. Explicit Factor Models (EFMs) [29] extract product features from reviews and use matrix decomposition on a feature users–items matrix for recommendations. Mapping latent factors to specific features provides feature-level explanations. An Attentional Factorization Machine (AMF) [8] enhances Matrix Factorization by incorporating User Aspect Preference (UAP) and Item Aspect Quality (IAQ), using them as constraints in item rating matrix decomposition to boost recommendation performance. These methods enhance accuracy but their explanations remain limited.

Recently, advancements in text generation within natural language [30] have led to methods focusing on generating textual explanations rather than feature-level ones. These explanations, derived from review data, aim to give users the impression of a recommendation from a friend. The use of natural language generation for crafting these explanations has gained considerable attention. Earlier work in natural language generation relied on Recurrent Neural Networks (RNNs). For example, the Att2Seq model [31] incorporates an attention mechanism into the Seq2Seq framework, encoding input attributes into vectors using a Multilayer Perceptron (MLP) and generating words with a multilayer Long Short-Term Memory (LSTM) network [32]. Neural Rating Regression with Abstractive Tips Generation (NRT) [15] uses the Encoder–Decoder framework by encoding attribute features into vector form with an MLP and decoding these vectors into explanatory text with a Gated Recurrent Unit (GRU) [33]. It optimizes recommendation and natural language generation tasks within a multi-task learning framework.

One task of this paper is to explore the promising potential of transformers [23]. Their effectiveness in natural language understanding and generation is well documented [34,35]. Recently, the adoption of the transformer model and its success in natural language processing have inspired new approaches in recommender systems. Peter [22] employed a personalized transformer for explainable recommendations, showcasing the extensive potential of transformers in this domain.

2.3. Multi-Task Recommendation

This paper’s multi-task recommendation considers both the rating prediction and reason generation tasks. Neural Template Explanation Generation (NETE) [36] inputs specific features into a “neural template” to generate high-quality explanations alongside recommendations. Joint Multi-task Learning of Ratings and Review Summaries (J3R) [37] integrates MLP and pointer–generator networks within a multi-task framework to predict ratings and generate review summaries using shared latent representations of users and items. Encoder–Decoder and MLP-based Explainable Recommendation (EMER) [16], based on the Encoder–Decoder architecture inspired by NRT, incorporates a bidirectional recurrent network in the decoder and adds an attention mechanism. Co-Attentive Multi-Task Learning (CAML) [17] introduces a hierarchical co-attentive selector to integrate recommendation and explanation tasks.

The multi-task learning model proposed in this paper closely resembles the NRT described in Section 2.2. It uses the output of the recommendation task as input for the explanation task, optimizing both simultaneously within the same framework.

3. Problem Definition and Acronyms

3.1. Problem Definition

In this section, we define the problem formally. We introduce dataset

D = {U, V, R, C}

, consisting of U as the set of users, V as the set of items, R as the set of ratings, and C as the set of user reviews of items. Our model’s task is to provide a predictive rating

{\hat{r}}_{i j}

that indicates user

u_{i}

’s preference for item

v_{j}

, using the given user–item pairing. Additionally, it generates personalized explanatory text to justify the recommendation. The normalization problem is defined as follows:

Definition 1.

Given a user set U, item set V, rating set R, and review set C, the problem involves the following:

\begin{matrix} Predicting the rating {\hat{r}}_{i j} for a user u_{i} regarding an unrated \\ item v_{j}, generating a recommendation rationale d_{(i, j)} = \\ {d_{(i, j)}^{1}, d_{(i, j)}^{2}, d_{(i, j)}^{3}, \dots, d_{(i, j)}^{k}} . \end{matrix}

where

d_{(i, j)}^{k}

denotes the text forming words.

3.2. Acronyms

For ease of reference and clarity, this section provides a summary of all acronyms used in this paper and lists them in Table A1.

4. Our Model

In this section, we introduce GCLTE, our multi-task learning model designed for explainable recommendations. This model integrates Graph Contrastive Learning with transformers within an Encoder–Decoder framework to perform rating prediction and reason generation simultaneously. In our research, we found that the Encoder–Decoder structure effectively separates the recommendation task from the explanation task in explainable recommendations. Consequently, we innovatively designed a CL-based graph encoder and a transformer decoder to fully leverage the association between these tasks. Figure 1 depicts the specific architecture of GCLTE. The left section of the figure displays an N-layer CL-based GCN acting as an encoder. It extracts features from user

u_{i}

and items

v_{j}

and enhances the feature distribution in a convolutional layer through noise-based Contrastive Learning. After performing Contrastive Learning layer by layer through the convolutional and final layers, the predicted rating

{\hat{r}}_{i j}

is obtained, and the current state vector

E_{hidden}

is output. The right section of Figure 1 illustrates an N-layer transformer structure composed of an attention mechanism. Here, our model concatenates the state vector

E_{hidden}

from the encoder with the review embeddings from the dataset as input to the decoder. The concatenated embedding is then decoded by the transformer in the decoder before generating the text explanations. Next, we will detail each component of our model and explain how they can be collectively optimized within a multi-task learning framework.

4.1. CL-Based Graph Convolutional Neural Network Encoder

We constructed a unique CL-based GCN encoder. In this section, we will explain how it contributes to predictive rating generation. Learning vector representations, or embeddings, of users and items forms the core of modern recommender systems. From initial matrix decomposition techniques to recent deep learning-based methods, embeddings are typically derived by mapping from existing features that describe the users or items. This approach causes the encoding process to overlook the collaborative signal within user–item interactions. Later studies introduced the concept of high-order connectivity by incorporating GCNs into recommender systems. They achieve this by stacking multilayer graph convolutional layers to explore high-order connectivity between users and items.

4.1.1. Graph Convolution

Initially, for a user

u_{i}

and an item

v_{j}

, we obtain their embedding vectors

e_{u_{i}}

and

e_{v_{j}}

. Unlike Matrix Factorization (MF) or Neural Collaborative Filtering (NCF) models, our process follows LightGCN’s approach, performing embedding propagation among connected users and items. The key profile aggregation operations are as follows:

Message Construction:For a user–item connection pair

(u, i)

, we define a message from user u to item i as:

m_{u \leftarrow i} = f (e_{i}, e_{u}, p_{(u, i)})

(1)

The decay factor p controls the propagation strength on edge

(u, i)

, indicating that the message’s intensity should diminish with the path’s length. The message encoding function,

f (\cdot)

, is defined as:

m_{u \leftarrow i} = \frac{1}{\sqrt{| N_{u} | | N_{i} |}} \cdot e_{i}

(2)

where p is set to

\frac{1}{\sqrt{| N_{u} | | N_{i} |}}

and

N_{u}

and

N_{i}

denote the first-hop neighbors of u and i, respectively.

Message Aggregation: This process collects messages propagated by u’s neighbors to enhance u’s representation. The aggregation function is defined as:

e_{u}^{(1)} = \sum_{i \in N_{u}} m_{u \leftarrow i}

(3)

where

e_{u}^{(1)}

denotes the representation of user u after the first embedding propagation layer and the summation symbols represent the aggregation of neighbors

N_{u}

. For

e_{i}^{(1)}

, we apply the same aggregation method. Based on this, we can compute higher-level embeddings:

e_{u}^{(k + 1)} = \sum_{i \in N_{u}} \frac{1}{\sqrt{| N_{u} | | N_{i} |}} e_{i}^{(k)}, e_{i}^{(k + 1)} = \sum_{i \in N_{i}} \frac{1}{\sqrt{| N_{i} | | N_{u} |}} e_{u}^{(k)}

(4)

The matrix equivalent of Equation (4) is:

E^{(k + 1)} = (D^{- \frac{1}{2}} A D^{- \frac{1}{2}}) e^{(k)},

(5)

where A represents the user–item interaction adjacency matrix and D is a diagonal matrix of size

(M + N) \times (M + N)

. The element

D_{i} i

indicates the number of non-zero elements in the ith row of A.

In traditional GCNs, no layer combination is employed, and the final output is simply the embedding from the last layer. However, this approach often leads to uniform feature representations across nodes after multiple convolutions, failing to meet the specific needs of recommender systems [38]. We consider the embeddings from all convolutional layers when forming the final user–item representation, as depicted in Figure 2. This method aims to achieve a graph convolution structure that incorporates self-connectivity:

e_{u} = \sum_{k = 0}^{K} a_{k} e_{u}^{(k)}, e_{i} = \sum_{k = 0}^{K} a_{k} e_{i}^{(k)}

(6)

The weight

a_{k}

, defined as

\frac{1}{k + 1}

and always non-negative, represents the influence of the kth layer in synthesizing the final embedding. As illustrated in Figure 2, the matrix representation of Equation (6) can be depicted as follows:

\begin{matrix} E & = a_{0} E^{(0)} + a_{1} E^{(1)} + a_{2} E^{(2)} + \dots + a_{K} E^{(K)} \\ = a_{0} E^{(0)} + a_{1} \tilde{A} E^{(0)} + a_{2} {\tilde{A}}^{2} E^{(0)} + \dots + a_{K} {\tilde{A}}^{K} E^{(0)} \end{matrix}

(7)

where

\tilde{A} = D^{- \frac{1}{2}} A D^{- \frac{1}{2}}

.

4.1.2. Noise-Based Contrastive Learning

We adopt the method proposed by Simgcl [28] for noise-based enhancement of embeddings, as illustrated in Figure 3. This involves adding random noise to the node vectors in each convolutional layer as described in Section 4.1.1. The equations are expressed as follows:

e_{i}^{'} = e_{i} + Δ_{i}^{'}, e_{i}^{″} = e_{i} + Δ_{i}^{″}

(8)

Here,

Δ_{i}^{'}

and

Δ_{i}^{″}

represent the noise perturbation in different directions, and their

L 2

norm is maintained at a small constant value. This constraint ensures that the added noise only alters the direction, not the magnitude, of the node vectors. Additionally, the perturbation must be confined to the same super-quadrant as the node.

e_{i}^{'}

and

e_{i}^{″}

are the resulting new node vectors in different directions after applying the perturbations. The mathematical equation for

Δ

is expressed as follows:

Δ = ω ⊙ sign (e_{i}), ω \in R^{d} \sim U (0, 1)

(9)

From this, the matrix representation, as detailed in Equation (7) during model training, can be redefined as follows:

\begin{matrix} E^{'} = \frac{1}{L} ( & (\tilde{A} E^{(0)} + Δ^{(1)}) + \\ (\tilde{A} (\tilde{A} E^{(0)} + Δ^{(1)}) + Δ^{(2)}) + \\ \dots + ({\tilde{A}}^{L} E^{(0)} + {\tilde{A}}^{(L - 1)} Δ^{(1)} + \dots + \tilde{A} Δ^{(L - 1)} + Δ^{(L)})) \end{matrix}

(10)

Equation (10) indicates that at each graph convolution layer, distinct noise vectors are added to the current node embeddings. The role of CL will be verified in subsequent ablation experiments.

4.2. Transformer Decoder Based on Attention Mechanism

The transformer has demonstrated powerful language modeling capabilities. However, integrating it as an explanatory module within a recommender system poses challenges, particularly in linking it with the recommendation component. Our study specifically explores how it can account for the potential connections between users and items while generating personalized text content. Our Graph Contrastive Learning with transformers within an Encoder–Decoder framework (GCLTE) might seamlessly integrate the recommendation and explanation tasks in a straightforward and efficient manner. This section provides whole analysis processes, as well as the transformer-based decoder. The overall pseudocode for our transformer decoder is shown in Algorithm 1.

Algorithm 1 Transformer decoder.

Input:

x \in V^{*}

, a sequence of token IDs,

{Emb}_{hidden} \in R^{d_{e} \times d_{gcn}}

, a hidden state embedding matrix.
Output:

P \in {(0, 1)}^{N_{V} \times length (x)}

, where each column of

P

is a distribution over the vocabulary.
Hyperparameters:

l_{max}, L, H, d_{e}, d_{mlp} \in N .

Parameters:

θ

includes all of the following parameters:

W_{e} \in R^{d_{e} \times N_{V}}

,

W_{p} \in R^{d_{e} \times l_{max}}

, token and positional embedding matrices.
For

l \in [L]

:

W_{l}

, multi-head attention parameters for layer l, see Equation (12),

γ_{1}^{1}, β_{1}^{1}, γ_{1}^{2}, β_{1}^{2} \in R^{d_{e}}

, two sets of layer-norm parameters,

W_{mlp 1}^{l} \in R^{d_{mlp} \times d_{e}}, b_{mlp 1}^{l} \in R^{d_{e}}, W_{mlp 2}^{l} \in R^{d_{e} \times d_{mlp}}, b_{mlp 2}^{l} \in R^{d_{e}},

MLP parameters.

γ, β \in R^{d_{e}}

, final layer-norm parameters.

W_{u} \in R^{N_{V} \times d_{e}}

, the unembedding matrix.

1:: $l \leftarrow length (x)$
2:: $for t \in [l] : e_{t} \leftarrow W_{e} [:, x [t]] + W_{p} [:, t]$
3:: $X \leftarrow [e_{1}, e_{2}, \dots, e_{l}] + {Emb}_{hidden}$
4:: for $l = 1, 2, \dots, L$ do
5:: $for t \in [l] : \tilde{X} [:, t] \leftarrow layer_norm (X [:, t] ∣ γ_{l}^{1}, β_{l}^{1})$
6:: $X \leftarrow X + MHAttention (\tilde{X} ∣ W_{l}, Mask [t, t^{'}] = [[t \leq t^{'}]])$
7:: $for t \in [l] : \tilde{X} [:, t] \leftarrow layer_norm (X [:, t] ∣ γ_{l}^{2}, β_{l}^{2})$
8:: $X \leftarrow X + W_{mlp 2}^{l} GELU (W_{mlp 1}^{l} \tilde{X} + b_{mlp 1}^{l} 1^{⊤}) + b_{mlp 2}^{l} 1^{⊤}$
9:: end for
10:: $for t \in [l] : X [:, t] \leftarrow layer_norm (X [:, t] ∣ γ, β)$
11:: return $P = softmax (W_{u} X)$

4.2.1. Augmenting Input Information with Hidden Information from Neural Networks

In the decoder section of our model, we concatenate the final embedding from the GCN encoder with the text embedding. This operation increases the dimension of the embedding, and the resulting augmented embedding serves as the input to the encoder. The matrix representation is as follows:

E_{input} = (E_{hidden}^{⊤}, E_{text})

(11)

where

E_{text}

is the embedding matrix derived from the comments and

E_{hidden}

is the transposed final layer embedding matrix from the encoder, concatenated with the comment embedding. This configuration allows the decoder to consider the encoder’s final layer state at each step of text generation, enhancing the personalization of the output text.

4.2.2. Self-Attention-Based Transformer Layer

In the decoder of GCLTE, we primarily utilize the transformer [23] for generating explanatory text. The transformer consists of L identical sub-layers, each featuring multi-head self-attention and a position-wise feed-forward network. Layer l transforms the output

S_{l - 1}

from layer

l - 1

into

S_{l}

, where l ranges from 1 to L. In the H-head attention layer, each attention head calculates the score identically. The score for one of the H attention heads in layer l is calculated as follows:

\begin{matrix} Att (Q_{l}, K_{l}, V_{l}) = & softmax (\frac{Q_{l} K_{l}^{T}}{\sqrt{d}}) V_{l}, \\ Q_{l} = S_{l - 1} W_{l}^{Q}, \\ K_{l} = S_{l - 1} W_{l}^{K}, \\ V_{l} = S_{l - 1} W_{l}^{V} \end{matrix}

(12)

where

S_{l - 1} \in R^{| S | \times d}

represents the output of layer

l - 1

and d is the dimension of the embedding.

W_{l}^{Q}

,

W_{l}^{K}

, and

W_{l}^{V}

are the projection matrices.

4.3. Multi-Task Learning Loss Function

In this section, we frame the rating prediction task as a regression problem. The loss function is represented as follows:

L_{rating} = \frac{1}{| X |} \sum_{u_{i} \in U, v_{j} \in V} {({\hat{r}}_{i, j} - r_{i, j})}^{2}

(13)

where

X

denotes the full training set and

{\hat{r}}_{i, j}

is the actual rating of user

u_{i}

for item

v_{j}

.

During GCLTE’s training, random noise is added to the encoder’s convolutional layer. During the embedding-based comparison, the comparison loss is calculated as follows:

L_{cl} = L_{cl}^{users} + L_{cl}^{items}

(14)

\begin{matrix} L_{cl}^{users} = - \frac{1}{N} \sum_{i = 1}^{N} \log \frac{\exp (\frac{u_{i} \cdot u_{i}^{'}}{τ})}{\sum_{j = 1}^{N} \exp (\frac{u_{i} \cdot u_{j}^{'}}{τ})}, \\ L_{cl}^{items} = - \frac{1}{N} \sum_{i = 1}^{N} \log \frac{\exp (\frac{v_{i} \cdot v_{i}^{'}}{τ})}{\sum_{j = 1}^{N} \exp (\frac{v_{i} \cdot v_{j}^{'}}{τ})} \end{matrix}

(15)

where N is the perturbation batch size,

u_{i}

,

v_{i}

are the unperturbed user and item embeddings, and

u_{i}^{'}

,

v_{j}^{'}

, are the perturbed user and item embeddings.

τ

represents the temperature parameter, and

e x p (\cdot)

denotes the softmax operation.

For calculating text loss, we employ Negative Log Likelihood Loss as the loss function. It is calculated as follows:

L_{text} = - \frac{1}{S} \sum_{t = 1}^{S} \log P (y_{t} ∣ y_{1}, y_{2}, \dots, y_{t - 1})

(16)

where S denotes the length of the text.

P (y_{t} ∣ y_{1}, y_{2}, \dots, y_{t - 1})

represents the probability of generating the tth word based on the first

t - 1

words in the sequence.

We then combine the individual losses, weighted appropriately, to form the final loss:

L = λ_{r} L_{rating} + λ_{t} L_{text} + λ_{c} L_{cl} + λ_{l 2} L_{l 2}

(17)

where

λ_{r}

,

λ_{t}

,

λ_{c}

,

λ_{l 2}

are the weights assigned to each loss and

L_{l 2}

represents

L 2

regularization loss, which helps prevent model overfitting.

5. Experimental Setup

We will conduct experiments tailored to our multi-task learning architecture, aligning with our research objectives. These include evaluations focused on rating prediction and reason generation, as well as ablation studies on CL with transformers. This section will detail our dataset, baseline model, evaluation metrics, and other relevant experimental settings.

5.1. Datasets

Our experiments use one 3-core dataset and two 5-core datasets. The “n-core” designation means each user and item has at least n interactions, enhancing data density and reliability.

REASONER [39]. REASONER is an explainable recommendation dataset with ground truths for multiple explanations. The annotators of ground truths are the same individuals who engage in user–item interactions.

Digital_Music [40]. The Digital Music dataset, available on Amazon.com, includes user ratings and reviews for digital music products like MP3 downloads and music streaming services.

Amazon_Instant_Video [40]. Amazon Instant Video includes user reviews and ratings for instant video services on Amazon, featuring TV shows, movies, and other streaming video content.

5.2. Baselines

Att2seq [31]. Att2Seq is an attention-enhanced attribute-to-sequence model. The encoder represents input attributes as vectors, and the decoder uses these vectors to generate new comments with an attention mechanism to align words with attributes.

PETER [22]. PETER is a personalized transformer for interpretable recommendations. It links IDs with words by predicting words based on vectors corresponding to item–user ID positions.

NRT [15]. NRT is a multi-task learning framework that predicts ratings and generates high-quality summary cues to simulate user experiences. It uses a gated Recurrent Neural Network (RNN) to translate representations of users and items into concise sentences.

EMER [16]. EMER is a multi-task learning framework using an Encoder–Decoder structure and Multilayer Perceptron (MLP). It integrates item titles to generate recommendation rationales, facilitating both rating prediction and reason generation.

LightGCN [38]. LightGCN is a graph convolution-based recommender model that captures complex user behaviors and item characteristics. It enhances accuracy by omitting feature transformations and nonlinear activation layers found in traditional GCNs.

SGL [4]. SGL is a self-supervised learning model for recommender systems. It uses deep features of graph-structured data to enhance recommendations without needing large amounts of labeled data.

5.3. Evaluation of Indicators

RMSE and MAE. We evaluate recommendation performance using Root Mean Square Error (RMSE) and Mean Absolute Error (MAE). Smaller RMSE and MAE values indicate closer predicted ratings to actual ratings. The mathematical expressions for RMSE and MAE are as follows:

\begin{matrix} RMSE = \sqrt{\frac{1}{N} \sum_{u, i} {({\hat{r}}_{u, i} - r_{u, i})}^{2}}, \\ MAE = \frac{1}{N} \sum_{u, i} |{\hat{r}}_{u, i} - r_{u, i}| \end{matrix}

(18)

Here, N is the number of ratings between users and items.

{\hat{r}}_{u, i}

is the predicted rating of item i by user u, and

r_{u, i}

is the actual rating.

BLEU and ROUGE. We use BLEU [41] (Bilingual Evaluation Understudy) and ROUGE [42] (Recall-Oriented Understudy for Gisting Evaluation) to evaluate interpretation performance. BLEU measures text accuracy: BLEU-1 for word-level accuracy and BLEU-4 for sentence fluency. ROUGE evaluates text quality, with ROUGE-1 and ROUGE-2 scores computing Precision, Recall, and F-score for our model’s reason generation task. Higher BLEU and ROUGE scores indicate better text quality. The mathematical expressions for ROUGE are as follows:

\begin{matrix} R O U G E_n / p & = \frac{\sum_{u, i} Count (N)}{\sum_{u, i} Count (N_{\hat{S}})} \\ R O U G E_n / r & = \frac{\sum_{u, i} Count (N)}{\sum_{u, i} Count (N_{S})} \\ R O U G E_n / f & = \frac{2 \cdot R O U G E_n / p \cdot R O U G E_n / r}{R O U G E_n / p + R O U G E_n / r} \end{matrix}

(19)

where N denotes the

n - g r a m

, i.e., the matching of n consecutive words in the generated text

\hat{S}

with the real text S, where

\hat{S}

represents the generated text and S represents the real text. The mathematical expressions for BLEU are as follows:

BLEU - n = B P \cdot \exp (\sum_{u, i} ω_{n} \log p_{n})

(20)

where

B P = \exp (1 - \frac{L_{\hat{S}}}{L_{S}})

,

ω_{n}

is the weight, and

p_{n}

represents the accuracy of the generated text versus the real text at the

n - g r a m

level.

5.4. Experimental Settings

Our GCLTE model is implemented using the PyTorch architecture. In the encoder, we set the embedding dimension and the hidden layer dimension to 32 and the number of convolutional layers to 3. In the decoder, we set the embedding dimension and the hidden layer dimension to 512, with 3 transformer layers and 2 attention heads. The optimal values for the weights of the rating loss and text loss are searched within the range of [0.2]. During training, we use the Adam optimizer with an initial learning rate of 0.01, and training is halted after the loss function increases cumulatively three times. For the other baseline models, we use the parameter settings provided in their respective papers. For each dataset, we allocate

80 %

as the training set,

10 %

as the validation set, and the remaining

10 %

as the test set. We set the training batch size to 256 and the vocabulary size to 20,000 to retain the most frequent words. Finally, we independently run the experiments 5 times and report the average results. All models are uniformly trained on an NVIDIA RTX 4090.

6. Results and Discussion

In this section, we present the results of GCLTE compared to other baseline models on the rating prediction task and the reason generation task. We will conduct ablation experiments to validate the necessity of our proposed information enhancement operation. Additionally, we will evaluate the effectiveness of the noise-based contrasts used in our decoder. Moreover, we will investigate the positive impacts of our transformer decoder on the reason generation task, in comparison to other relative optimal RNN-based decoders. We will conclude with a discussion on the hyperparameters of GCLTE. Bolding in the data tables in this section indicates that optimal performance was achieved in the comparison.

6.1. Rating Prediction Results

In this subsection, we evaluate the performance of our multi-tasking learning model in rating prediction. We compare CLtrans with several mainstream baseline models, including PETER, NRT, and EMER. These models, similar to ours, are frequently used in comparative experiments. LightGCN and SGL, which are based on graph comparisons, serve as commonly used baseline models in the field of recommender systems. The experimental results are presented in Table 1, Table 2 and Table 3, respectively.

As shown in Table 1, GCLTE significantly outperforms other models on the 3-core dataset (REASONER) in terms of RMSE and MAE metrics. This improvement is attributed to the addition of random noise in the graph convolution process during the rating prediction task, which enables a broader range of embedding representations and yields better rating prediction results, even with relatively sparse interaction data.

For the 5-core datasets (Amazon_Music and Amazon_Instant_Video), the experimental results, presented in Table 2 and Table 3, show that GCLTE continues to outperform most other models in terms of RMSE and MAE metrics. Notably, the performance of SGL is comparable to ours. This similarity is attributed to the increased interaction information for each user–item in these datasets, where the dropout-based graph enhancement operation does minimal damage to the graph’s original structure. The slight gap between SGL and GCLTE in RMSE and MAE metrics, which are significantly lower than those of other models including LightGCN, demonstrates the significant impact of the contrastive loss shared by SGL and GCLTE during the training process.

6.2. Reason Generation Results

In this subsection, we analyze the performance of GCLTE in reason generation. We quantify the experimental results into specific values to evaluate text quality using metrics such as BLEU-1, BLEU-4, and ROUGE-1 (Precision, Recall, F1-score).

According to the evaluation results presented in Table 4, Table 5 and Table 6, GCLTE surpasses most other models across all three datasets. Specifically, GCLTE demonstrates significantly superior performance compared to the baseline models in terms of BLEU-1 and BLEU-4 scores. It is noteworthy that Att2Seq achieves comparable BLEU scores to GCLTE on the REASONER and Amazon_Music datasets and even slightly higher BLEU-4 scores on the REASONER dataset. However, it performs poorly on the Amazon_Instant_Video dataset. In contrast, GCLTE demonstrates consistent high performance across all datasets, indicating strong generalization capabilities.

Regarding ROUGE scores on the REASONER dataset, GCLTE does not achieve the highest Precision and Recall, but it leads in F1-score among all models. The F1-score, a harmonic mean of Precision and Recall, indicates that while GCLTE may not be more optimal in a single metric, it maintains the highest overall performance. Analyzing the results for the Amazon_Music and Amazon_Instant_Video dataset, GCLTE’s ROUGE scores significantly outperform those of other models in all metrics except Precision. Specifically, we observe that some models with high Precision scores perform poorly in both Recall and BLEU-4 scores. This is interpreted as an indication of poorer fluency or readability in the generated text, potentially leading to more repetitive word usage. GCLTE demonstrates consistent stability across all metrics, achieving strong scores without any significant anomalies. This stability reflects the robust feature extraction capability of our graph convolutional encoder, which uses a contrast strategy. Additionally, the transformer, enhanced with the encoder’s embedding state, produces high-quality text explanations.

6.3. Ablation Experiment

To validate the improvements to GCLTE in rating prediction and reason generation, we will conduct ablation experiments on the model’s components. The results for all datasets are presented in Table 7. In Table 7, “GCLTE” represents our model without any ablation of its modules. “

{GCLTE}_{W / CL}

” represents GCLTE without the Contrastive Learning module, “

{GCLTE}_{W / E}

” represents GCLTE without the information enhancement operation, and

{GCLTE}_{GRU}

represents GCLTE using GRU as the decoder, where “improve” denotes the enhancements of GCLTE relative to its variants, with the enhancement values bolded. The calculation method for “improve” is as follows:

improve = \frac{|{Metric}_{G C L T E} - {Metric}_{v a r i a n t s}|}{{Metric}_{v a r i a n t s}}

(21)

6.3.1. The Need for Hidden State-Based Information Enhancement

To validate the necessity of our information enhancement operation on text embedding during the input phase of the transformer decoder, which utilizes the hidden state of the GCN, we have removed the embedding splicing operation described in Equation (11) from

{GCLTE}_{W / E}

(without the enhancement). From the comparison in Table 6 between

{GCLTE}_{W / E}

and GCLTE, it is evident that the rating prediction performance of

{GCLTE}_{W / E}

remains nearly identical to that of GCLTE after the removal of the information enhancement operation. This outcome results from structuring the rating prediction and reason generation tasks in two distinct, independently functioning modules within the model, thereby preventing any mutual interference during multi-task learning optimization.

Additionally, we observe a significant decline in the quality of the generated text after removing the information enhancement from the hidden state of the GCN. This decline is due to the loss of personalized user and item information embedded in the GCN’s hidden state, which reduces the personalization in the generated text. Specifically, the decline in the quality of the generated text is more noticeable for the Amazon_Music dataset, which is larger than the others despite having a similar number of users. This results in a more informative hidden state of the GCN. When comparing

{GCLTE}_{W / E}

with other variants in Table 7, the most significant performance degradation occurs with

{GCLTE}_{W / E}

, demonstrating that the information enhancement operation significantly boosts model performance.

6.3.2. Enhancement from Contrastive Loss

To verify the performance enhancement from the embedding-based Contrastive Learning added to the graph convolutional neural network in the model’s encoder, we removed the contrastive loss computation from the training process in

{GCLTE}_{W / CL}

(without contrastive loss). From Table 7, it is evident that for both datasets, RMSE and MAE for

{GCLTE}_{W / CL}

have improved values compared to GCLTE. This improvement signifies that adding noise to the embedding nodes while calculating contrast loss during our training process beneficially enhances the model’s rating prediction capability. Additionally, text evaluation metrics show a slight improvement in GCLTE compared to

{GCLTE}_{W / CL}

. This is attributed to the information enhancement operation performed on the text embeddings using the hidden state of the GCN layer during the input phase of the transformer decoder. This indicates that the hidden state, enhanced by the contrastive loss, continues to positively influence the input of the transformer.

6.3.3. Advantage of Using a Transformer

Ref. [43] has demonstrated that transformers may perform poorly on some small datasets, specifically those with several million entries. This raises a pertinent question for our study: Why is the transformer used for the reason generation task in GCLTE instead of Recurrent Neural Networks, which are more commonly employed in NLP tasks? In response to this query, we have replaced the transformer module in GCLTE with a GRU, known for its simplicity and effectiveness in RNNs, while keeping the encoder unchanged. We introduce the variant,

{GCLTE}_{GRU}

, utilizing the encoder’s hidden state as the initial state for the GRU.

As shown in Table 7, the performance of

{GCLTE}_{GRU}

in the rating prediction task remains nearly identical to that of GCLTE, as anticipated. There is a noticeable degradation in text generation quality, indicating that the complex hidden states of GCN-based neural networks are not ideal inputs for GRUs. In contrast, transformers are more effective in decoding embeddings with complex relationships than GRUs. Research earlier in this paper demonstrated that complex embeddings positively impact the reason generation capabilities of transformers. Consequently, GCLTE achieves superior results by utilizing a transformer as the decoder.

6.4. Hyperparameters in Multi-Task Learning

In this subsection, we discuss the hyperparameters

λ_{r}

and

λ_{t}

as presented in Equation (17). We vary their values within

[0, 2]

while regularizing

λ_{r}

and

λ_{t}

, conduct experiments on the REASONER dataset, and plot the results in line graphs. In Figure 4a, the horizontal axis represents specific parameter settings, the vertical axis shows evaluation indicator values, and the impact of different hyperparameter settings on rating prediction performance is shown in the respective bars. In Figure 4b, the horizontal axis represents evaluation indicators, the vertical axis shows evaluation indicator values, and the dashed lines indicate different parameter settings.

As shown in Figure 4a, rating prediction performance is at its lowest when

λ_{r}

is set to 0. Apart from this, the rating prediction performance shows little sensitivity to changes in the hyperparameters, allowing for more flexibility in setting these parameters with greater focus on the reason generation task. As depicted in Figure 4b, the model’s textual performance is optimal when both

λ_{r}

and

λ_{t}

are set to 1. This indicates that the two tasks in GCLTE do not compete, eliminating the need to assign differing weights to prioritize one over the other. Consequently, we set both hyperparameters

λ_{r}

and

λ_{t}

to 1 in our experiments.

6.5. Computational Load Analysis

In this subsection, we discuss the computational load of GCLTE. Table 8 presents the RNN (complexity

O (n)

) structure of NRT and the transformer (complexity

O (n^{2})

) structure of PETER, along with the computational load of our GCLTE (complexity

O (n^{2})

) on an NVIDIA RTX4090 when processing the REASONER dataset. As shown in Table 8, although our GCLTE takes longer and requires more GPU memory than NRT on a single batch (due to its computational complexity), it requires significantly fewer training rounds, indicating that our model converges faster during training. Compared to PETER, GCLTE requires roughly the same training time and GPU memory on a single batch, but it also requires significantly fewer training rounds, resulting in a substantially shorter total training time.

7. Conclusions and Future Work

In this paper, we introduce a multi-task learning model designed for interpretable recommendations. This model, based on an Encoder–Decoder structure, enhances convolutional neural networks with Contrastive Learning techniques and includes a transformer module. We integrate rating prediction and reason generation through a simple and efficient information enhancement operation. Our comprehensive experiments on benchmark datasets demonstrate that integrating two tasks leads to significant performance improvements. Additionally, experimental results demonstrate that GCLTE surpasses current state-of-the-art models in both rating prediction and reason generation tasks.

However, GCLTE has not yet been tested on very large datasets. In future work, we plan to refine GCLTE’s testing outcomes on these datasets and make targeted improvements. Given the transformer module in GCLTE, we anticipate that it will perform exceptionally well on very large datasets. Moreover, we aim to enhance the model’s explainability, particularly in terms of reason generation.

Author Contributions

Methodology, X.Z.; Software, X.Z. and Y.W.; Validation, Y.W.; Formal analysis, W.Z.; Investigation, X.Z. and X.X.; Data curation, X.Z.; Writing—original draft, X.Z.; Writing—review & editing, X.X.; Visualization, W.Z.; Project administration, X.X.; Funding acquisition, X.X. All authors have read and agreed to the published version of the manuscript.

Funding

This study is supported by Social Science Planning Project of Shandong Province (Grant No. 24BJYJ02).

Data Availability Statement

The original data presented in the study are openly available in [39,40].

Conflicts of Interest

The authors declare that they have no known competing financial interest or personal relationships that could have appeared to influence the work reported in this paper.

Appendix A

Table A1. Summary of acronyms.

Acronyms	Full Name
CL	Contrastive Learning
InfoNCE	A form of Noise Contrastive Estimation
GNN	Graph Neural Network
GCN	Graph Convolutional Network
RNN	Recurrent Neural Network
DNN	Deep Neural Network
MLP	Multilayer Perceptron
LSTM	Long Short-Term Memory
SGL	Self-supervised Graph Learning
Xsimgcl	Extremely Simple Graph Contrastive Learning
EFM	Explicit Factor Model
AMF	Attentional Factorization Machine
UAP	User Aspect Preference
IAQ	Aspect Quality
Att2Seq	Attribute-to-Sequence Method
NRT	Neural Rating Regression with Abstractive Tips Generation
GRU	Gated Recurrent Unit
NETE	Neural Template Explanation Generation
J3R	Joint Multi-task Learning of Ratings and Review Summaries
EMER	Encoder–Decoder and MLP-based Explainable Recommendation
CAML	Co-Attentive Multi-Task Learning
MF	Matrix Factorization
NCF	Neural Collaborative Filtering
GCLTE	Our Graph Contrastive Learning with transformers within an Encoder–Decoder framework

References

Batmaz, Z.; Yurekli, A.; Bilge, A.; Kaleli, C. A review on deep learning for recommender systems: Challenges and remedies. Artif. Intell. Rev. 2019, 52, 1–37. [Google Scholar] [CrossRef]
Jaiswal, A.; Babu, A.R.; Zadeh, M.Z.; Banerjee, D.; Makedon, F. A Survey on Contrastive Self-Supervised Learning. Technologies 2021, 9, 2. [Google Scholar] [CrossRef]
Zhou, C.; Ma, J.; Zhang, J.; Zhou, J.; Yang, H. Contrastive Learning for Debiased Candidate Generation in Large-Scale Recommender Systems. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Singapore, 14–18 August 2021; ACM: New York, NY, USA, 2021; pp. 3985–3995. [Google Scholar] [CrossRef]
Wu, J.; Wang, X.; Feng, F.; He, X.; Chen, L.; Lian, J.; Xie, X. Self-supervised Graph Learning for Recommendation. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Montreal, QC, Canada, 11–15 July 2021; ACM: New York, NY, USA, 2021; pp. 726–735. [Google Scholar] [CrossRef]
Yu, J.; Yin, H.; Gao, M.; Xia, X.; Zhang, X.; Hung, N.Q.V. Socially-Aware Self-Supervised Tri-Training for Recommendation. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Singapore, 14–18 August 2021; ACM: New York, NY, USA, 2021; pp. 2084–2092. [Google Scholar] [CrossRef]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9729–9738. [Google Scholar]
Yu, J.; Xia, X.; Chen, T.; Cui, L.; Hung, N.Q.V.; Yin, H. XSimGCL: Towards Extremely Simple Graph Contrastive Learning for Recommendation. IEEE Trans. Knowl. Data Eng. 2024, 36, 913–926. [Google Scholar] [CrossRef]
Hou, Y.; Yang, N.; Wu, Y.; Yu, P.S. Explainable recommendation with fusion of aspect information. World Wide Web 2019, 22, 221–240. [Google Scholar] [CrossRef]
Zhang, Y.; Chen, X. Explainable Recommendation: A Survey and New Perspectives. Found. Trends Inf. Retr. 2020, 14, 1–101. [Google Scholar] [CrossRef]
Wang, X.; Chen, Y.; Yang, J.; Wu, L.; Wu, Z.; Xie, X. A Reinforcement Learning Framework for Explainable Recommendation. In Proceedings of the 2018 IEEE International Conference on Data Mining (ICDM), Singapore, 17–20 November 2018; pp. 587–596. [Google Scholar] [CrossRef]
Wang, X.; Wang, D.; Xu, C.; He, X.; Cao, Y.; Chua, T.S. Explainable reasoning over knowledge graphs for recommendation. Proc. Aaai Conf. Artif. Intell. 2019, 33, 5329–5336. [Google Scholar] [CrossRef]
He, X.; Chen, T.; Kan, M.Y.; Chen, X. TriRank: Review-aware Explainable Recommendation by Modeling Aspects. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, Melbourne, VIC, Australia, 19–23 October 2015; pp. 1661–1670. [Google Scholar] [CrossRef]
Wang, H.; Zhang, F.; Wang, J.; Zhao, M.; Li, W.; Xie, X.; Guo, M. RippleNet: Propagating User Preferences on the Knowledge Graph for Recommender Systems. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, Torino, Italy, 22–26 October 2018; pp. 417–426. [Google Scholar] [CrossRef]
Balloccu, G.; Boratto, L.; Fenu, G.; Marras, M. Hands on Explainable Recommender Systems with Knowledge Graphs. In Proceedings of the 16th ACM Conference on Recommender Systems, Seattle, WA, USA, 18–23 September 2022; pp. 710–713. [Google Scholar] [CrossRef]
Li, P.; Wang, Z.; Ren, Z.; Bing, L.; Lam, W. Neural Rating Regression with Abstractive Tips Generation for Recommendation. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Tokyo, Japan, 7–11 August 2017; pp. 345–354. [Google Scholar] [CrossRef]
Zhu, J.; He, Y.; Zhao, G.; Bu, X.; Qian, X. Joint Reason Generation and Rating Prediction for Explainable Recommendation. IEEE Trans. Knowl. Data Eng. 2023, 35, 4940–4953. [Google Scholar] [CrossRef]
Chen, Z.; Wang, X.; Xie, X.; Wu, T.; Bu, G.; Wang, Y.; Chen, E. Co-attentive Multi-task Learning for Explainable Recommendation. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19), Macao, Chian, 10–16 August 2019; pp. 2137–2143. [Google Scholar] [CrossRef]
Peake, G.; Wang, J. Explanation Mining: Post Hoc Interpretability of Latent Factor Models for Recommendation Systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 2060–2069. [Google Scholar] [CrossRef]
Sun, P.; Wu, L.; Zhang, K.; Su, Y.; Wang, M. An Unsupervised Aspect-Aware Recommendation Model with Explanation Text Generation. ACM Trans. Inf. Syst. 2021, 40, 1–29. [Google Scholar] [CrossRef]
Li, L.; Zhang, Y.; Chen, L. Personalized Prompt Learning for Explainable Recommendation. ACM Trans. Inf. Syst. 2023, 41, 1–26. [Google Scholar] [CrossRef]
Lyu, Z.; Wu, Y.; Lai, J.; Yang, M.; Li, C.; Zhou, W. Knowledge Enhanced Graph Neural Networks for Explainable Recommendation. IEEE Trans. Knowl. Data Eng. 2023, 35, 4954–4968. [Google Scholar] [CrossRef]
Li, L.; Zhang, Y.; Chen, L. Personalized Transformer for Explainable Recommendation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Virtual Event, 1–6 August 2021; pp. 4947–4957. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All You Need. In Proceedings of the NeurIPS, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
Gao, C.; Wang, X.; He, X.; Li, Y. Graph Neural Networks for Recommender System. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, Tempe, AZ, USA, 21–25 February 2022; pp. 1623–1625. [Google Scholar] [CrossRef]
Yu, J.; Yin, H.; Xia, X.; Chen, T.; Li, J.; Huang, Z. Self-Supervised Learning for Recommender Systems: A Survey. IEEE Trans. Knowl. Data Eng. 2024, 36, 335–355. [Google Scholar] [CrossRef]
You, Y.; Chen, T.; Sui, Y.; Chen, T.; Wang, Z.; Shen, Y. Graph Contrastive Learning with Augmentations. Adv. Neural Inf. Process. Syst. 2020, 33, 5812–5823. [Google Scholar]
Yu, J.; Yin, H.; Xia, X.; Chen, T.; Cui, L.; Nguyen, Q.V.H. Are Graph Augmentations Necessary? Simple Graph Contrastive Learning for Recommendation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11–15 July 2022; ACM: New York, NY, USA, 2022; pp. 1294–1303. [Google Scholar] [CrossRef]
Zhang, Y.; Lai, G.; Zhang, M.; Zhang, Y.; Liu, Y.; Ma, S. Explicit Factor Models for Explainable Recommendation Based on Phrase-Level Sentiment Analysis. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, Gold Coast, Australia, 6–11 July 2014; pp. 83–92. [Google Scholar] [CrossRef]
Gatt, A.; Krahmer, E. Survey of the State of the Art in Natural Language Generation: Core Tasks, Applications and Evaluation. J. Artif. Intell. Res. 2018, 61, 65–170. [Google Scholar] [CrossRef]
Dong, L.; Huang, S.; Wei, F.; Lapata, M.; Zhou, M.; Xu, K. Learning to Generate Product Reviews from Attributes. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Valencia, Spain, 3–7 April 2017; pp. 623–632. [Google Scholar]
Staudemeyer, R.C.; Morris, E.R. Understanding LSTM—A tutorial into Long Short-Term Memory Recurrent Neural Networks. arXiv 2019, arXiv:1909.09586. [Google Scholar]
Cho, K.; van Merrienboer, B.; Gülçehre, Ç.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv 2014, arXiv:1406.1078. [Google Scholar]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. OpenAI Blog. 2018. Available online: https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf (accessed on 1 January 2024).
Dong, L.; Yang, N.; Wang, W.; Wei, F.; Liu, X.; Wang, Y.; Gao, J.; Zhou, M.; Hon, H.W. Unified Language Model Pre-training for Natural Language Understanding and Generation. arXiv 2019, arXiv:1905.03197. [Google Scholar] [CrossRef]
Li, L.; Zhang, Y.; Chen, L. Generate Neural Template Explanations for Recommendation. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Virtual, 19–23 October 2020; pp. 755–764. [Google Scholar] [CrossRef]
Avinesh, P.V.S.; Ren, Y.; Meyer, C.M.; Chan, J.; Bao, Z.; Sanderson, M. J3R: Joint Multi-task Learning of Ratings and Review Summaries for Explainable Recommendation. In Proceedings of the Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2019, Würzburg, Germany, 16–20 September 2019; Proceedings, Part III. Springer International Publishing: Cham, Switzerland, 2020; pp. 339–355. [Google Scholar] [CrossRef]
He, X.; Deng, K.; Wang, X.; Li, Y.; Zhang, Y.; Wang, M. LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 25–30 July 2020; pp. 639–648. [Google Scholar] [CrossRef]
Chen, X.; Zhang, J.; Wang, L.; Dai, Q.; Dong, Z.; Tang, R.; Zhang, R.; Chen, L.; Wen, J.R. REASONER: An Explainable Recommendation Dataset with Multi-aspect Real User Labeled Ground Truths Towards more Measurable Explainable Recommendation. arXiv 2023, arXiv:2303.00168. [Google Scholar]
Hou, Y.; Li, J.; He, Z.; Yan, A.; Chen, X.; McAuley, J. Bridging Language and Items for Retrieval and Recommendation. arXiv 2024, arXiv:2403.03952. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar] [CrossRef]
Lin, C.Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
Lu, Z.; Xie, H.; Liu, C.; Zhang, Y. Bridging the Gap between Vision Transformers and Convolutional Neural Networks on Small Datasets. Adv. Neural Inf. Process. Syst. 2022, 35, 14663–14677. [Google Scholar]

Figure 1. GCLTE: integrating Graph Contrastive Learning and transformers for rating prediction and reason generation. In the figure, “bos” represents the marker indicating the beginning of the input sequence.

Figure 2. Representation of final layer embedding.

Figure 3. Adding noise perturbation to node vectors.

Figure 4. Effects of hyperparameter settings on performance. (a) Effects of hyperparameter settings on rating prediction performance. (b) Effects of hyperparameter settings on reason generation performance.

Table 1. Ratings prediction based on REASONER dataset (test set).

	RETER	NRT	LighGCN	SGL	EMER	GCLTE
RMSE	1.3173	1.1926	1.2158	1.2030	1.3147	1.1147
MAE	1.0978	0.9484	1.0035	0.9617	1.1275	0.9189

Table 2. Ratings prediction based on Amazon_Music dataset (test set).

	RETER	NRT	LighGCN	SGL	EMER	GCLTE
RMSE	1.1269	1.0959	1.0981	1.0476	1.0954	1.0358
MAE	0.9050	0.8592	0.8715	0.8179	0.8611	0.8065

Table 3. Ratings prediction based on Amazon_Instant_Video dataset (test set).

	RETER	NRT	LighGCN	SGL	EMER	GCLTE
RMSE	1.1269	1.1191	1.1194	1.0645	1.1191	1.0816
MAE	0.9050	0.8891	0.8840	0.8432	0.8878	0.8476

Table 4. Reason generation based on REASONER dataset (test set).

Model	BLEU1	BLEU4	ROUGE_1/f	ROUGE_1/r	ROUGE_1/p	ROUGE_2/f	ROUGE_2/r	ROUGE_2/p
Att2seq	20.5806	3.5184	22.5084	20.4895	26.0100	5.6244	5.2796	6.2503
RTER	17.1456	2.1726	19.6311	14.8421	29.4247	4.8076	3.5978	7.3658
NRT	18.5425	3.3937	20.6744	16.8437	27.9985	5.6636	4.7378	7.4232
EMER	18.6114	2.1851	22.0573	17.3172	30.9540	4.8569	4.0701	6.1467
GCLTE	20.8766	3.4942	22.7577	18.6529	27.1614	6.0271	5.2776	7.1744

Table 5. Reason generation based on Amazon_Music dataset (test set).

Model	BLEU1	BLEU4	ROUGE_1/f	ROUGE_1/r	ROUGE_1/p	ROUGE_2/f	ROUGE_2/r	ROUGE_2/p
Att2seq	17.5467	1.5001	18.9466	14.9544	34.0016	3.2864	2.5508	6.8905
RETER	7.8509	0.3690	11.0091	6.2566	47.6637	0.9135	0.5177	3.9592
NRT	13.5462	1.1856	16.1556	10.2688	38.6763	3.3596	2.1277	8.1914
EMER	8.9918	0.7997	9.5330	5.2827	66.2802	1.9228	1.0618	14.1053
GCLTE	21.4245	1.7071	22.2740	17.8240	30.0254	4.0968	3.3816	5.2516

Table 6. Reason generation based on Amazon_Instant_Video dataset (test set).

Model	BLEU1	BLEU4	ROUGE_1/f	ROUGE_1/r	ROUGE_1/p	ROUGE_2/f	ROUGE_2/r	ROUGE_2/p
Att2seq	12.8202	0.5419	18.4875	11.6949	44.9535	1.5134	0.9536	3.7028
RETER	7.8509	0.3690	11.0091	6.2566	47.6637	0.9135	0.5177	3.9592
NRT	11.4115	0.6270	16.0631	10.1750	38.9668	1.9961	1.2563	4.8961
EMER	13.3753	0.7190	17.6375	11.6058	37.6826	2.0430	1.3805	4.0564
GCLTE	21.1844	1.1768	23.2173	21.0422	26.1984	3.0755	2.9826	3.2088

Table 7. Ablation study based on Amazon dataset (test set).

Datasets	Model	RSME	MAE	BLEU1	BLEU4	ROUGE_1/f	ROUGE_2/f
Reasoner	GCLTE	1.1147	0.9189	20.8766	3.4942	22.7577	6.0271
	${GCLTE}_{W / CL}$	1.2098	1.0023	18.1429	3.0023	20.0818	4.9880
	Improve	0.09	0.08	0.15	0.16	0.13	0.21
	${GCLTE}_{W / E}$	1.1189	0.9201	16.6446	2.0450	19.2808	4.7288
	Improve	0.01	0.01	0.25	0.71	0.18	0.27
	${GCLTE}_{GRU}$	1.1145	0.9193	16.1304	2.8761	18.6526	3.7841
	Improve	0.01	0.01	0.29	0.21	0.22	0.59
Music	GCLTE	1.0358	0.8065	21.4245	1.7071	22.2740	4.0968
	${GCLTE}_{W / CL}$	1.0967	0.8573	18.6119	1.2658	20.9199	3.1654
	Improve	0.06	0.6	0.15	0.35	0.06	0.29
	${GCLTE}_{W / E}$	1.0454	0.8021	13.9831	0.7928	16.1360	1.8572
	Improve	0.01	0.01	0.53	1.15	0.38	1.20
	${GCLTE}_{GRU}$	1.0397	0.8225	14.3898	1.1689	17.6789	3.3935
	Improve	0.01	0.02	0.49	0.46	0.26	0.21
Video	GCLTE	1.0816	0.8476	21.1844	1.1768	23.2173	3.0755
	${GCLTE}_{W / CL}$	1.1193	0.8844	17.0556	1.0914	20.6873	2.6436
	Improve	0.03	0.04	0.24	0.08	0.12	0.16
	${GCLTE}_{W / E}$	1.0817	0.8473	14.1067	0.8425	16.9328	1.7363
	Improve	0.01	0.01	0.50	0.40	0.37	0.77
	${GCLTE}_{GRU}$	1.0901	0.8440	14.3178	0.6712	18.1082	1.8650
	Improve	0.01	0.01	0.48	0.75	0.28	0.65

Table 8. Computational load analysis of NRT, PETER, and GCLTE.

Model	Total Time (s)	Epochs	Time/Epoch (s)	GPU Memory Usage (MiB)
NRT	130.1	38	2.6	2429
PETER	222.3	29	5.1	3243
GCLTE	116.5	13	4.5	3807

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhu, X.; Xia, X.; Wu, Y.; Zhao, W. Enhancing Explainable Recommendations: Integrating Reason Generation and Rating Prediction through Multi-Task Learning. Appl. Sci. 2024, 14, 8303. https://doi.org/10.3390/app14188303

AMA Style

Zhu X, Xia X, Wu Y, Zhao W. Enhancing Explainable Recommendations: Integrating Reason Generation and Rating Prediction through Multi-Task Learning. Applied Sciences. 2024; 14(18):8303. https://doi.org/10.3390/app14188303

Chicago/Turabian Style

Zhu, Xingyu, Xiaona Xia, Yuheng Wu, and Wenxu Zhao. 2024. "Enhancing Explainable Recommendations: Integrating Reason Generation and Rating Prediction through Multi-Task Learning" Applied Sciences 14, no. 18: 8303. https://doi.org/10.3390/app14188303

APA Style

Zhu, X., Xia, X., Wu, Y., & Zhao, W. (2024). Enhancing Explainable Recommendations: Integrating Reason Generation and Rating Prediction through Multi-Task Learning. Applied Sciences, 14(18), 8303. https://doi.org/10.3390/app14188303

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Explainable Recommendations: Integrating Reason Generation and Rating Prediction through Multi-Task Learning

Abstract

1. Introduction

2. Related Work

2.1. CL-Based Rating Prediction

2.2. Explanation of Recommender Systems

2.3. Multi-Task Recommendation

3. Problem Definition and Acronyms

3.1. Problem Definition

3.2. Acronyms

4. Our Model

4.1. CL-Based Graph Convolutional Neural Network Encoder

4.1.1. Graph Convolution

4.1.2. Noise-Based Contrastive Learning

4.2. Transformer Decoder Based on Attention Mechanism

4.2.1. Augmenting Input Information with Hidden Information from Neural Networks

4.2.2. Self-Attention-Based Transformer Layer

4.3. Multi-Task Learning Loss Function

5. Experimental Setup

5.1. Datasets

5.2. Baselines

5.3. Evaluation of Indicators

5.4. Experimental Settings

6. Results and Discussion

6.1. Rating Prediction Results

6.2. Reason Generation Results

6.3. Ablation Experiment

6.3.1. The Need for Hidden State-Based Information Enhancement

6.3.2. Enhancement from Contrastive Loss

6.3.3. Advantage of Using a Transformer

6.4. Hyperparameters in Multi-Task Learning

6.5. Computational Load Analysis

7. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI