Enhancing Personalized Explainable Recommendations with Transformer Architecture and Feature Handling

Lin, Ming-Yen; Hsieh, I-Chen; Hsush, Sue-Chen

doi:10.3390/electronics14050998

Open AccessArticle

Enhancing Personalized Explainable Recommendations with Transformer Architecture and Feature Handling

by

Ming-Yen Lin

¹

,

I-Chen Hsieh

¹ and

Sue-Chen Hsush

^2,*

¹

Department of Information Engineering and Computer Science, Feng Chia University, Taichung 407102, Taiwan

²

Department of Information Management, Chaoyang University of Technology, Taichung 413310, Taiwan

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(5), 998; https://doi.org/10.3390/electronics14050998

Submission received: 25 January 2025 / Revised: 26 February 2025 / Accepted: 28 February 2025 / Published: 28 February 2025

(This article belongs to the Special Issue Recommender Systems: Approaches, Challenges and Applications, 3rd Edition)

Download

Browse Figure

Versions Notes

Abstract

The advancement of explainable recommendations aims to improve the quality of textual explanations for recommendations. Traditional methods primarily used Recurrent Neural Networks (RNNs) or their variants to generate personalized explanations. However, recent research has focused on leveraging Transformer architectures to enhance explanations by extracting user reviews and incorporating features from interacted items. Nevertheless, previous studies have failed to fully exploit the relationship between reviews and user ratings to generate more personalized explanations. In this paper, we propose a novel model named EPER (Enhanced Personalization for Explainable Recommendation), which considers reviews, user ratings, feature words, and item titles to generate high-quality personalized explanations. The EPER model employs a masking mechanism to prevent interference between rating prediction and explanation generation. Moreover, we propose an innovative feature-handling method to manage missing interaction features in existing models. Experimental results on public datasets demonstrate that EPER generally outperforms other well-known methods, including NETE, PETER+, and MMCT. Compared with MMCT, EPER improves explanation quality (ROUGE metric) by 3.27%, personalization (FMR metric) by 6.82%, and rating prediction (MSE metric) by 1.2% for the Amazon Clothing dataset. Overall, the EPER model provides personalized recommendation explanations that match or exceed the best existing methods, demonstrating its potential for practical applications.

Keywords:

explainable recommendation; transformer; personalized recommendation; text generation; rating prediction

1. Introduction

Recommender systems provide recommendations based on user preferences or past interactions. Common methods for recommendation include collaborative filtering, content-based filtering, and matrix factorization [1], while recent approaches use machine learning and deep learning techniques. Previous studies have primarily focused on improving the accuracy of recommendations; however, explaining the reasoning behind these recommendations can enhance user trust and improve the interpretability of the recommendation system.

Explainable recommendations provide comprehensible explanations for recommended items, such as “This item is recommended because you like this brand”. Some systems employ user reviews as input for explanation generation, utilizing the richness, diversity, and context-awareness of natural language to produce recommendations that users can easily understand. As a result, explainable recommendations have emerged as an important research area in recent years.

Approaches for generating textual explanations, in general, include template-based methods [2,3,4,5] and natural language generation (NLG) methods [6,7,8,9,10,11,12]. Template-based methods use predefined templates, extracting features or keywords to fill the blanks in the templates to produce explanations. These approaches have limited diversity and are very time-consuming because they require predefined templates and feature extraction.

NLG-based methods, on the other hand, employ models to generate textual explanations. Recurrent Neural Networks (RNNs) were traditionally employed in this domain. However, contemporary research has demonstrated that Transformer-based architectures [13] offer enhanced performance. These models process sentences in parallel using attention mechanisms. For example, the PETER+ model [10] incorporates review-related keywords to aid explanation generation, while the MMCT model [8] enhances the process by integrating sentiment analysis and visual elements. Despite these advancements, existing methods still face challenges in generating personalized explanations.

The generation of personalized explanations can benefit from an analysis of user preferences, item features, and user review information. For example, the words used in a user’s reviews of high-rated items often differ from those of low-rated items, reflecting individual preferences that can serve as a basis for customized explanations. Although ratings have been considered in previous research, the primary focus is on rating prediction rather than improving explanation quality. By learning the relationship between personalized ratings and review vocabulary, recommendation systems can generate more tailored explanations.

Moreover, reviews highlight the aspects of interest to the user related to the item. The titles of items frequently reflect their theme and characteristics. Enabling the model to focus on item titles can yield explanations that are more relevant to the attributes of items. Additionally, users’ attention to specific features in their reviews reveals their personal preferences. Extracting feature-related keywords from user reviews can help discriminate between individual preferences so that the explanations can align with users’ interests. Summarizing review features can significantly improve the quality of personalized explanations in explainable recommender systems.

To address these challenges, we propose the EPER (Enhanced Personalized Explainable Recommendation) model, which enhances the personalization and interpretability of explainable recommendation systems. The key contributions of this work are as follows:

Transformer-based explanation generation: Unlike prior methods that rely on RNN-based models, our approach employs Transformer architectures for improved parallel processing and contextual understanding.
Customized attention masking mechanism: We introduce a masking technique that prevents interference between the rating prediction and explanation generation tasks, improving the overall effectiveness of personalized recommendations.
Feature handling for personalized explanations: The model incorporates feature-related keywords extracted from user reviews and item titles to generate more relevant and personalized textual explanations.
Performance gains without image processing: Unlike MMCT [8], which requires image information for improved performance, our model achieves competitive or superior results without the need for image data, making it more flexible for practical applications.
Extensive empirical validation: We conduct comprehensive experiments on public datasets, demonstrating that EPER outperforms well-known models (e.g., NETE, PETER+, MMCT) in both rating prediction and explanation generation tasks.

The remainder of this paper is organized as follows: Section 2 reviews related work on explainable recommendation models, covering both traditional template-based and modern neural-based approaches. Section 3 introduces the proposed EPER model, detailing its architecture. Section 4 describes the experimental setup, datasets, evaluation metrics, and comparative analysis with state-of-the-art models. Section 5 concludes the paper.

2. Related Work

Explainable recommendation aims to generate human-interpretable justifications for suggested items while maintaining recommendation accuracy. In the literature, however, the terms “justification” and “explanation” are often used interchangeably. It is important to note that justifications generally refer to natural language statements that rationalize a model’s decisions, whereas explanations encompass a broader set of interpretability techniques—including context prediction, personalized feature handling, and other model-internal insights. In this work, our focus is on generating explanations that integrate these diverse interpretability strategies.

2.1. Template-Based Methods

Early research in explainable recommendation relied on template-based methods, which generate explanations by extracting structured product information (e.g., features, ratings, or keywords) and fitting them into predefined sentence templates. Representative works such as EFM [5], TriRank [2], and sCVR [4] apply phrase-level sentiment analysis to extract important aspects from user reviews and map them into fixed template-based sentences that align with user interests. NARRE [14] extends this approach by incorporating attention mechanisms to select relevant user reviews and present them as explanations.

Despite their interpretability, template-based methods suffer from limited adaptability and diversity, as they rely on manually predefined sentence templates. This lack of flexibility prevents them from generating highly personalized explanations tailored to individual user preferences. Moreover, manually constructing templates requires domain-specific expertise, making these methods less scalable for diverse recommendation scenarios.

2.2. RNN-Based and Hybrid NLG Methods

To improve upon template-based approaches, RNN-based NLG methods introduced more flexible text generation capabilities by leveraging deep learning. One of the first hybrid approaches, NETE [3], integrates template-based methods with neural generation techniques. It employs a Multi-Layer Perceptron (MLP) for rating prediction and a Gated Fusion Recurrent Unit (GFRU) for explanation generation. The GFRU consists of two GRU components: one for generating item feature words and another for sentence context words, with a Gated Fusion Unit (GFU) determining the final explanation at each time step.

Further improvements were introduced by SAER [11], which focuses on sentiment alignment between generated explanations and the predicted rating score. It incorporates Sentiment and Attribute Gates into a GRU-based model, allowing word selection to be directly influenced by the recommender module. This approach ensures that explanations are emotionally aligned with user expectations, improving personalization.

Additional developments in this category include ACMLM [15], which introduced personalized generation models for explanation tasks. R3 [16] enhances recommendation accuracy and explainability by extracting rationales from reviews, reducing spurious correlations. Despite these improvements, RNN-based models struggle with long-range dependencies, making them less effective in capturing complex contextual information. Additionally, their sequential nature prevents efficient parallelization, resulting in slower training and inference times compared with Transformer-based models.

2.3. Transformer-Based NLG Methods

With the emergence of self-attention mechanisms in Transformer-based architectures, explainable recommendation models have significantly improved in terms of efficiency, contextual learning, and parallel computation. Unlike RNNs, Transformers allow global dependencies across words, making them more effective at generating coherent and context-aware explanations. We briefly review the Transformer architecture [13] here.

The Transformer model leverages a self-attention mechanism that evaluates the importance of each word in the context of all other words in a sentence. This approach allows the model to capture global dependencies. Unlike RNNs that process words sequentially, Transformers consider the entire input simultaneously. This global view enhances the model’s ability to understand context and generate more coherent explanations. The approach also enables parallel processing by processing tokens in parallel to substantially reduce training and inference times, which is especially beneficial when dealing with large datasets. It also enhances contextual learning because the self-attention mechanism dynamically weighs the contribution of each word, leading to richer representations that improve both rating prediction and explanation generation.

Despite these advantages, Transformers come with challenges. Their parallel architecture demands significant computational resources and memory, making them computationally intensive compared with RNNs. Furthermore, while the self-attention mechanism enhances performance, it can also lead to difficulties in interpreting which specific parts of the input most strongly influence the final output.

One of the first Transformer-based models for explainable recommendations, PETER+ [10], employs a Transformer decoder to generate explanations while simultaneously predicting user ratings. It takes as input user IDs, item IDs, preprocessed feature words, and summary reviews, using customized attention masking to restrict unnecessary interactions between input components. While PETER+ provides high-quality textual explanations, it lacks dynamic feature integration, limiting its ability to adapt to missing or incomplete interaction data.

Building upon PETER+, MMCT [8] introduces multi-modal fusion, incorporating textual features, sentiment attributes, item images, and user reviews to enhance explanation diversity. It applies contrastive learning to improve personalization, using a Transformer-based encoder to model user–item relationships and a Transformer decoder for explanation generation. While MMCT demonstrates strong performance, its reliance on image-based features makes it less practical for scenarios where visual content is unavailable.

Another notable Transformer-based approach, PEPLER [17], employs pretrained language models for generating personalized explanations. It introduces discrete and continuous prompt learning techniques, treating user and item IDs as prompts to enhance the model’s understanding of recommendation contexts. Unlike standard Transformer architectures, PEPLER benefits from pretraining on large-scale text corpora, improving fluency and coherence in generated explanations.

In addition to the works discussed above, several studies have specifically focused on natural language justifications. For example, a comprehensive survey on “Explanation and Justification in Machine Learning” that delineates the differences between natural language justifications and broader explanations is provided in [18]. Other notable works, such as those corresponding to [19,20,21], have also investigated justification generation strategies. While these studies primarily address justification—defined as natural language statements that rationalize decisions—our work focuses on generating broader explanations that integrate multiple interpretability techniques.

2.4. EPER: Enhancing Transformer-Based Explanation Models

Our proposed EPER (Enhanced Personalization for Explainable Recommendation) model builds upon Transformer-based NLG techniques while addressing key limitations of prior models.

EPER improves upon PETER+ by introducing a feature-handling mechanism that dynamically estimates missing interaction features during inference. Unlike previous models that rely solely on preprocessed feature words, EPER incorporates user–item features extracted from user reviews, ensuring that personalized explanations can be generated even when explicit feature data is missing.

In addition, EPER introduces a customized attention masking mechanism to prevent interference between rating prediction and explanation generation tasks. Prior models, such as NETE and SAER, often struggle to balance recommendation accuracy with explanation quality, as they tend to overemphasize either rating prediction or explanation generation. EPER effectively resolves this issue by employing multi-task learning, optimizing both objectives simultaneously.

Unlike MMCT, which requires multi-modal inputs, EPER achieves comparable or superior performance using only textual data. This makes EPER particularly useful in text-only recommendation settings, where additional image or sentiment features are unavailable. Moreover, compared with PEPLER, which leverages pretrained Transformers, EPER explicitly models user–item interactions and personalized feature selection, ensuring that generated explanations remain highly relevant to individual users.

2.5. Recent Advances in Explainable Recommendation

Beyond NLG-based methods, knowledge-enhanced and reinforcement learning approaches have emerged to further improve explainability. SKGAN (Social-Enhanced Knowledge Graph Attention Network) [22] integrates social network data with knowledge graphs, allowing for explainable recommendations that incorporate user relationships. Similarly, KRRL [23] applies reinforcement learning to enhance knowledge-aware reasoning in explainable recommendations. SERMON [24] integrates multi-modal contrastive learning to better model user preferences and item characteristics, facilitating reciprocal learning between textual and visual modalities for improved explainability.

Other recent advancements include PR4SR [25], which utilizes hierarchical reinforcement learning for session-based recommendations, and CrossDR-Gen [26], which incorporates spatial–temporal disentanglement representation to improve next-POI (point of interest) recommendations. These techniques demonstrate alternative strategies for explainable recommendations beyond Transformer-based NLG models.

3. Proposed Model

The goal of explainable recommendation is to provide a predicted rating and explanation for the recommended item. Given a user ID u and an item ID i, rating prediction is to predict the user’s rating for the item, denoted as

{\hat{r}}_{u, i}

. The explanation generation is to generate an explanation for the user u with respect to the item i, denoted as

{\hat{W}}_{u, i}

, where

{\hat{W}}_{u, i} = \{{\hat{w}}_{1}, {\hat{w}}_{2}, \dots, {\hat{w}}_{T}\}

,

{\hat{w}}_{t}

is the t-th word in the explanation, 1 ≤ t ≤ T, and T is the length of the explanation.

Figure 1 shows the overall architecture of the proposed EPER (Enhanced Personalization for Explainable Recommendation) model. In Figure 1, the encoder block is depicted as consisting of multiple layers, each of which contains a multi-head self-attention layer followed by a position-wise feed-forward network. Residual connections and layer normalization accompany each sublayer to ensure stability during training. The decoder block is similarly structured, but with a key difference: it incorporates a masked multi-head self-attention layer in the first stage to prevent the decoder from accessing future tokens, followed by an encoder–decoder attention layer that fuses contextual information from the encoder, and finally a feed-forward network with the same residual and normalization operations. The Transformer encoder and Transformer decoder both contain two identical layers in our study.

The EPER performs two tasks using the Transformer: rating prediction and explanation generation. During the construction (training) of the model for explainable recommendation, an input sequence S and the corresponding review

W_{u, i} = {w_{1}, w_{2}, \dots, w_{| W_{u, i} |}}

are provided, where

w_{i}

(1 ≤ i ≤ |W_u_,i|) is a word in the vocabulary and |W_u_,i| is the length of the review. The sequence

S = [u, i, r_{u, i}, f_{u, i}, N_{i}]

contains u, i, the rating of user

u

for item

i (r_{u, i})

, the feature words in user

u

’s review for item

i (f_{u, i})

, and the corresponding item name for item

i

(N_i = {n₁, n₂, …, n_|Ni|}, where n_i (1 ≤ i ≤ |N_i|) is a word in the vocabulary). The input to the Transformer encoder consists of corresponding embeddings, including user embeddings. The <bos> is the beginning-of-sentence token.

The task of rating prediction is to find the relationship between user–item pair (

u, i

) and its rating r_u_,i. After training, the Transformer encoder learned the relationship, and the first element of the encoder output (the hidden layer

H_{E} = [H_{E, 1}, H_{E, 2}, \dots, H_{E, |S|}]

)

H_{E, 1}

is used to predict the rating

{\hat{r}}_{u, i}

.

The task of explanation generation produces corresponding explainable text. It is aided by two subtasks, one subtask performs context prediction (C) on

H_{E, 2}

; the other subtask performs personalized learning, which captures the important information from the input sequence

S = [u, i, r_{u, i}, f_{u, i}, N_{i}]

by the Transformer encoder, as the output hidden layer (

H_{E} = [H_{E, 1}, H_{E, 2}, \dots, H_{E, | S |}]

). The entire

H_{E}

is input into the Transformer decoder to generate the explanation text (the review text), i.e., the explanation generation task. The results are output as the (decoder) hidden layer (

H_{D}

) representation, and text explanations (

{\hat{W}}_{u, i} = \{{\hat{w}}_{1}, {\hat{w}}_{2}, \dots, {\hat{w}}_{|{\hat{W}}_{u, i}|}\}

) are generated one by one.

3.1. Rating Prediction

The user–item ID pair (

u, i

) is fed into the Transformer encoder, with the resulting output denoted as

H_{E}

. The (u, i) is passed to the embedding layer with the positional coding and then added up as the input for the later layers, like common Transformer input. The attention mechanism in the Transformer may learn the interactions of (u, i). A Multi-Layer Perceptron (MLP) processes

H_{E, 1}

, applying the sigmoid activation function to produce the numerical rating score, denoted as

{\hat{r}}_{u, i}

:

{\hat{r}}_{u, i} = σ (H_{E, 1} W_{1} + b_{1}) W_{2} + b_{2},

(1)

where

W_{1} \in R^{d \times d}

,

W_{2} \in R^{d \times 1}

,

b_{1} \in R^{d}

,

b_{2} \in R

. Here,

σ (\cdot)

is the sigmoid function, W₁ and W₂ are weight parameters, b₁ and b₂ are bias, and d is the dimensionality.

For the rating prediction task, we use Mean Square Error (MSE) as the loss function, and the loss of rating prediction Lr is computed as

L_{r} = \frac{1}{| τ |} \sum_{(u, i \in τ)} {(r_{u, i} - {\hat{r}}_{u, i})}^{2},

(2)

where

r_{u, i}

is the ground-truth rating,

{\hat{r}}_{u, i}

is the predicted rating,

| τ |

is the size of user–item pairs in the training set τ.

3.2. Context Prediction and Personalized Learning

Context prediction and personalized learning are two subtasks used to assist the explanation generation task. Context prediction captures interactions from user–item pairs (u, i) and produces the corresponding result. The result is converted into a probability distribution over all words in the vocabulary so as to correspond to the probability differences in words in the actual review explanation. Personalized learning uses a customized masking with the Transformer encoder to learn the attention relationships among (u,i) interactions and then inputs into the Transformer decoder to assist the explanation generation, ultimately improving the overall effectiveness of generating explanations. The processes of the two subtasks are described below.

Context Prediction As illustrated in Figure 1, the context prediction module leverages the hidden representation produced by the Transformer encoder, denoted as H_E. Notably, H_E is composed of multiple elements, and we empirically select the second element, H_E_,2, because it has been found to capture essential contextual nuances from the user–item interactions. By applying a linear transformation (with weight matrix W and bias b) to H_E_,2, followed by the softmax function [13], we obtain a probability distribution over the vocabulary V:

C = s o f t m a x (W_{c} H_{E, 2} + b_{c}),

(3)

where

C \in R^{| V |}

is a vector of size |V|,

W_{c} \in R^{| V | \times d}

is weight parameters,

b_{c} \in R^{| V |}

is bias, and d is the dimensionality. This formulation ensures that each component of C reflects the likelihood of the corresponding token, thereby guiding the explanation generation process.

In context prediction tasks, we use Negative Log-Likelihood (NLL) as the loss function, and the loss of context prediction Lc is computed as:

L_{c} = \frac{1}{| τ |} \sum_{(u, i \in τ)} \frac{1}{| W_{u, i} |} \sum_{t = 1}^{| W_{u, i} |} - l o g C^{w_{t}},

(4)

where in

C^{w_{t}}, w_{t}

represents the corresponding word at time step

t

in the actual review

W_{u, i}

,

C^{w_{t}}

represents the probability value in

C

corresponding to each word contained in

W_{u, i}

,

| W_{u, i} |

represents the length of the review text,

| τ |

is the size of user–item pairs in the training set τ.

Personalized Learning As shown in Figure 1, sequence S =

[u, i, r_{u, i}, f_{u, i}, N_{i}]

, after item embedding and positional embedding, is passed to the Transformer encoder to learn the interactions of user, item, rating, focused features, and item titles. Note that a custom masking method [10] is used to capture the relationships between sequences while preventing interference between the rating prediction and explanation generation tasks. The custom masking method allows the mutual attention between

u

and

i

only, without referencing other information in the sequence S. The other elements in S can attend to one another using standard attention computations. The output is represented as

H_{E} =

[

H_{E, 1}, H_{E, 2}, \dots, H_{E, |S|}

], where

H_{E}

captures the personalized learning feature representations of (

u, i

), aiding in the subsequent explanation generation task.

3.3. Explanation Generation

The input for explanation generation, comprising the personalized representation H_E, a special token <bos>, and

W_{u, i} = [w_{1}, w_{2}, \dots, w_{| W_{u, i} |}]

, is processed by the Transformer decoder, after item embedding and positional encoding. Similar to a standard Transformer, for each output position, the model uses the input sequence up to t − 1 and

{\hat{w}}_{t - 1}

at time step t to predict current word

{\hat{w}}_{t}

. The final explanation is

{\hat{W}}_{u, i} = \{{\hat{w}}_{1}, {\hat{w}}_{2}, \dots, {\hat{w}}_{|{\hat{W}}_{u, i}|}\}

.

Let the output of the Transformer decoder be represented as

H_{D} =

[

H_{D, 1}, H_{D, 2}, \dots, H_{D, |E_{u, i}|}, H_{D, |E_{u, i}| + 1}

]. Using the softmax function, the output probability results are represented as

C_{t}

. The probability distribution for the t-th word is computed as

C_{t} = s o f t m a x (W_{c} H_{D, t} + b_{c}),

(5)

where

C_{t} \in R^{| V |}

is a vector of size |V|,

W_{c} \in R^{| V | \times d}

is weight parameters,

b_{c} \in R^{| V |}

is bias, and d is the dimensionality. Then, the predicted word

{\hat{w}}_{t} = a r g m a x (C_{t})

is the word of highest probability in the vocabulary. In explanation generation task, we use the NLL as the loss function, and the loss of explanation generation Le is computed as

L_{e} = \frac{1}{| τ |} \sum_{(u, i \in τ)} \frac{1}{| W_{u, i} |} \sum_{t = 1}^{| W_{u, i} |} - l o g C_{t}^{w_{t}},

(6)

where in

C^{w_{t}}, w_{t}

represents the corresponding word at time step

t

in the actual review

W_{u, i}

,

C^{w_{t}}

represents the probability value in

C_{t}

, corresponding to each word contained in

W_{u, i}

,

| W_{u, i} |

is the length of the review,

| τ |

is the size of user–item pairs in the training set τ.

3.4. Model Optimization and Model Inference

The overall model architecture integrates the loss from various tasks for optimization. The total loss L is computed as

L = {λ_{r} L}_{r} + λ_{c} L_{c} + {λ_{e} L}_{e},

(7)

where

λ_{r}, λ_{c}

, and

λ_{e}

are the respective weight coefficients for L_r, L_c, and L_e.

When the model is used to predict the rating and the explanation during model inference, the only input required are u, i, f_u_,i, and N_i. The rating prediction uses u, i, and the hidden representation H_E, and produces the rating

{\hat{r}}_{u, i}

. The explanation generation takes f_u_,i, N_i, and <bos> to generate the final explanation

{\hat{W}}_{u, i} = \{{\hat{w}}_{1}, {\hat{w}}_{2}, \dots, {\hat{w}}_{|{\hat{W}}_{u, i}|}\}

.

3.5. User–Item Feature $f_{u, i}$

The proposed EPER model, along with existing explainable recommendation models such as PETER+ [10] and MMCT [8], incorporates user–item features

f_{u, i}

during inference. However, in real-world applications, these features are often unavailable because they are typically extracted from historical user reviews. This issue is particularly critical when making recommendations for new interactions, where the user has not provided a prior review for the item.

Example: Consider an e-commerce platform where a user purchases a newly released product. Since the user has not written a review for this product, existing explainable recommendation models cannot directly retrieve feature-based user preferences (e.g., “comfortable fit” and “durable material”). As a result, models relying solely on textual review-based feature extraction may fail to generate meaningful personalized explanations.

To address this challenge, we propose a feature-generation mechanism using a Transformer encoder. This mechanism predicts user–item features dynamically by leveraging user history and similar item attributes, enabling the model to generate explanations even in cases where interaction-based features do not exist. This method ensures that EPER remains effective in real-world scenarios, where explicit feature-based reviews are not always available.

The preprocessing uses a Transformer encoder. The Transformer encoder inputs a user–item pair (u,i), the output of the feature vector H_E_,2 from the second hidden layer of the Transformer encoder is utilized for the prediction. The H_E_,2 is processed using the softmax function, converting the output into a probability, denoted as F, representing the probability distribution of all feature tokens in the feature set V_F as follows:

F = s o f t m a x (W_{f} H_{E, 2} + b_{f}),

(8)

where

F \in R^{|V_{f}|}

is a vector of size |V_f|,

W_{f} \in R^{|V_{f}| \times d}

is the weight parameter, and

b_{f} \in R^{|V_{f}|}

is the bias, d is the dimensionality, and V_f is the feature set of user u for item i. Then, we apply the argmax function to extract the feature with the highest corresponding probability and treat it as the predicted feature:

{\hat{f}}_{u, i} = a r g m a x (F)

.

In this feature prediction, we use NLL as the loss function, and the loss of feature prediction L_F is computed as

L_{F} = \frac{1}{| τ |} \sum_{(u, i \in τ)} - l o g F^{f_{u, i}},

(9)

where in

F^{f_{u, i}}

,

f_{u, i}

represents the ground-truth feature,

F^{f_{u, i}}

represents the probability value in

F

corresponding to feature,

| τ |

is the size of user–item pairs in the training set τ.

Finally, the

{\hat{f}}_{u, i}

substitutes the

f_{u, i}

in the inference phase of the described model EPER, and this practically used model is named EPER_F.

4. Experiments

To evaluate the effectiveness of the proposed EPER model, we conduct a series of experiments using public datasets and compare our results with existing baseline methods. This section details the experimental setup, datasets, baseline models, and evaluation metrics. We present comprehensive results on both explanation generation and rating prediction tasks, followed by a comparative analysis across different datasets. Additionally, an ablation study is conducted to analyze the impact of various design choices in our model. The experimental findings demonstrate the advantages of EPER over competing methods, particularly in generating high-quality, personalized explanations while maintaining competitive rating prediction performance.

4.1. Experimental Datasets and Setup

To assess the performance of the proposed EPER model, we conducted comprehensive experiments using well-established public datasets. The Amazon (https://cseweb.ucsd.edu/~jmcauley/datasets/amazon/links.html, accessed on 10 January 2024) e-commerce dataset [3,5,6,8,10,14,15,16,17,24], specifically its Clothing, Shoes, and Jewelry and Movies and TV subsets, as well as the Yelp (https://www.yelp.com/dataset, accessed on 10 January 2024) Challenge 2019 dataset [2,4,6,9,11,12,14,16,22], were selected based on their widespread usage in explainable recommendation research. These datasets have been utilized in prior studies to evaluate personalized recommendation models and natural language generation for recommendation explanations [2,3,4,5,6,8,9,10,11,12,14,15,16,17,22]. Their rich textual reviews, user–item interactions, and structured rating information make them ideal for both rating prediction and explanation generation tasks. Unless otherwise specified, the dataset used in the following context is the Amazon Clothing dataset.

While the Amazon e-commerce and Yelp Challenge datasets are widely used benchmarks for explainable recommendation tasks, we acknowledge that these datasets are primarily focused on e-commerce and service-based domains. The generalizability of the EPER model to other domains, such as healthcare, education, or financial services, depends on the availability of structured user–item interactions and textual reviews. Since the EPER model relies on Transformer-based architectures, it can learn meaningful representations from domain-specific text, making it adaptable with appropriate fine-tuning.

To address concerns about unreliable or low-quality data, our model incorporates attention mechanisms that emphasize relevant content while reducing noise from uninformative reviews. Additionally, feature extraction techniques, such as sentiment analysis and aspect-based filtering, help mitigate issues related to data sparsity and inconsistent review quality. Future research may explore domain adaptation techniques, such as pretraining on diverse datasets or incorporating external knowledge graphs, to further enhance the robustness of EPER across different domains.

Table 1 presents the statistics of the datasets after preprocessing. The data preprocessing method is the same as modern research for explainable recommendations: using a phrase-level sentiment analysis toolkit [27] to preprocess text reviews. Each review includes a user ID, item ID, rating (1–5), a review explanation, and a feature word. The top 20,000 most frequent words are retained as the vocabulary V. The dataset is randomly divided into three subsets: 80% for training, 10% for validation, and 10% for testing. The testing set ensures that each user and item is included in the training set. Each user and item in the training set has at least one review.

Our experimental datasets represent large-scale benchmarks in the research on explainable recommendations: the Amazon Clothing dataset contains 179 K records, Amazon Movies comprises 441 K records, and the Yelp dataset includes 1.293 M records. These figures illustrate that EPER has been evaluated on data of substantial size.

With respect to performance evaluation, rating prediction uses metrics Root Mean Square Error (RMSE) and Mean Absolute Error (MAE), where lower scores indicate better performance. The evaluation of explanation generation includes two aspects: the quality of the generated text reviews and the level of personalization. The quality of the generated reviews is assessed by two common metrics: BLEU (Bilingual Evaluation Understudy) [28] and ROUGE (Recall-Oriented Understudy for Gisting Evaluation) [29] are metrics used to evaluate the quality of generated text. Higher scores in these metrics indicate that the generated text is more similar to the actual text, suggesting better explanation generation quality. The level of personalization is evaluated using three metrics [3]: Feature Matching Ratio (FMR), Feature Coverage Ratio (FCR), and Feature Diversity (DIV). Among these, only a lower DIV score signifies better performance.

In addition to the standard BLEU and ROUGE metrics for assessing text quality, we evaluate the level of personalization using three metrics: Feature Matching Ratio (FMR), Feature Coverage Ratio (FCR), and Feature Diversity (DIV). Their formulas are defined as follows:

F M R = \frac{1}{| N |} \sum_{u, i} δ (f_{u, i} \in {\hat{W}}_{u, i}),

(10)

where

{\hat{W}}_{u, i}

denotes the explanation generated for the user–item pair (u,i),

f_{u, i}

represents the feature words present in the ground-truth summary review. |N| represents the total number of generated sentences.

δ

() is an indicator function: it equals 1 if the generated explanation contains a feature word from the ground-truth summary review; otherwise, it equals 0, indicating that the feature word is not included.

F C R = \frac{N_{g}}{|F|},

(11)

where |F| represents the total number of feature terms in the entire dataset, and N_g denotes the number of distinct feature terms included in the generated explanations.

D I V = \frac{2}{| N | \times | (N - 1) |} \sum_{(u, i)} \sum_{(u, i) \neq (u^{'} {, i}^{'})} | F_{u, i} \cap F_{u^{'} {, i}^{'}} |,

(12)

where |N| denotes the total number of user–item summary reviews in the test set.

F_{u, i}

and

F_{u^{'} {, i}^{'}}

represent the feature sets corresponding to two different user–item pairs, respectively.

The experimental model was implemented using Python 3.8.16 and PyTorch 2.0.0 with CUDA 11.7 support, the GPU was NVIDIA GeForce RTX 2080 Ti. The training parameters were set as follows: maximum training epochs (Epoch): 100, embedding size (d): 512, Transformer layers and Attention Heads (H): 2, feed-forward neural network size: 2048, MLP layers: 2,

λ_{r}, λ_{c}

and

λ_{e}

: 0.1, 1.0, 1.0, optimization: SGD, batch size: 128, and

| W_{u, i} |

: 15. On the Amazon Clothing dataset, the model requires approximately 2.4 min per epoch for training, while on the Yelp dataset the training time is about 6.9 min per epoch. These measurements were obtained under identical conditions using a batch size of 128. Such results demonstrate that the EPER model is computationally efficient, making it suitable for practical applications in real-world recommendation systems.

4.2. Baseline Models

In this study, the baseline models are categorized into two tasks: rating prediction and explanation generation. Of these baseline models, only ACMLM [15] excludes a rating prediction component.

ACMLM [15]: Uses fine-tuned BERT [30] to generate personalized and diverse explanations.
NETE [3]: Utilizes a customized GRU model to learn and integrate specific features into sentence templates for better explanation generation.
PETER+ [10]: Incorporates personalized Transformer models with user and item IDs and feature words to aid in explanation generation.
SAER [11]: Employs a GRU model and includes an emotion alignment task, enabling recommendations to directly influence explanation generation.
MMCT [8]: Utilizes a personalized Transformer model to integrate multi-modal information and uses contrastive learning to generate explanations, improving quality. Requires item images for generating recommendation explanations.

4.3. Experimental Results

In this subsection, we present the experimental results of the EPER model, focusing on its performance in explanation generation and rating prediction compared with baseline models. We evaluate the quality of generated explanations using standard NLP metrics such as BLEU and ROUGE, followed by an analysis of personalization effectiveness through FMR, FCR, and DIV metrics. Next, we assess the model’s accuracy in rating prediction using RMSE and MAE. Additionally, we compare EPER across different datasets and conduct an ablation study to examine the impact of various design components. These results collectively highlight the strengths of EPER in producing personalized, high-quality recommendation explanations while maintaining strong predictive performance.

4.3.1. Performance of Explanation Generation

Table 2 shows the performance of different methods on the Amazon Clothing dataset. In the following context, bold scores indicate the best performance in each row, while underlined scores indicate the second best. Improv.(%) shows the improvement percentage of our proposed EPER model to the best model. In the table, “⭡” indicates that a higher value is better for the metric, and “⭣” indicates that a lower value is better.

Compared with ACMLM, NETE, SAER, and PETER+, MMCT shows only a small enhancement in text quality. When compared with the explainable recommendation methods mentioned above, our proposed EPER model demonstrates superior performance in text quality metrics (BLEU and ROUGE), as shown in Table 2. The EPER model outperforms other explanation generation methods in most indicators, with improvements of 1.71% in R1_P, 0.86% in R1_F, 3.27% in R2_P, 0.11% in R2_R, and 1.94% in R2_F. However, it slightly underperforms MMCT in B1, B4, and R1_R, with gaps of 4.44%, 2.3%, and 1.23%, respectively.

Next, we evaluate metrics using FMR, FCR, and DIV to measure the level of personalization in the generated text. As shown in Table 2, our EPER model outperforms or performs on par with all the other methods, indicating that our EPER model not only generates specific item features but also effectively incorporates feature information in the decoding generation process, making the explanations more personalized. Compared with the second-best method, our model achieves a 1.05% improvement in FMR, a 6.82% improvement in FCR, and performs equally well in DIV.

EPER’s superiority is highlighted by its significant improvements in key performance indicators. Although EPER shows slight underperformance in B1, B4, and R1_R compared with MMCT, its enhancements in personalization metrics (FMR and FCR) and in-text quality metrics (with improvements of 1.71% in R1_P and 3.27% in R2_P) clearly demonstrate its effectiveness. These results indicate that EPER is more adept at integrating feature-level information and contextual cues, thereby generating more personalized and high-quality explanations.

4.3.2. Performance of Rating Prediction

Table 3 presents the performance of rating prediction of various methods on the Amazon Clothing dataset. Although EPER’s RMSE (1.05) is marginally higher than that of MMCT (1.04), its superior performance in MAE (0.83 vs. 0.84) indicates more accurate overall rating predictions. The slight difference in RMSE can be attributed to the joint optimization strategy that balances rating prediction with explanation generation, ultimately enhancing the overall recommendation quality. This indicates mutual support between the generation task and the recommendation task during training. Additionally, learning user- and item-related information influences the effectiveness of recommendation performance.

4.3.3. Performance on Different Datasets

Various explainable recommendation methods were compared on the Amazon Movies and Yelp datasets. Because MMCT requires images, which are not available for all items in the Yelp dataset, the experimental results exclude the performance of MMCT.

Table 4 shows that on the Amazon Movies dataset, EPER outperforms other methods in terms of text quality with respect to the ROUGE metric, while EPER is second best for other metrics. Table 5 shows that on the Yelp dataset, except for the DIV metric where EPER is second best, EPER outperforms all the other methods for all the other metrics. By observing the characteristics of the Amazon Movies dataset, we can see that the item names influence the performance of the EPER model. It was found that there is no clear relationship between item names and review content, whereas there is a more obvious relationship between item names and review content in the Amazon Clothing and Yelp datasets.

Based on the above analysis, our proposed EPER model generally performs the best or slightly lags behind MMCT. However, MMCT requires the inclusion of image information, which is not always available for all items in practical applications. In brief, our proposed EPER model can achieve good results or even better ones without the need for image processing, making it more flexible in terms of dataset applicability.

4.3.4. Ablation Study

Several ablation studies were conducted to evaluate the impact of various design aspects of the EPER model. Table 6 presents the experimental results, which can be analyzed from three perspectives: (1) The impact of eliminating certain inputs from the model, (2) The effect of individual loss components, and (3) The role of the masking mechanism.

(1) Elimination of Certain Inputs: The absence of any of the three key inputs—user ratings (

r_{u, i}

), item titles (N_i), or user–item features (

f_{u, i}

)—results in a degradation of model performance across rating prediction, review quality, and personalization level metrics. The most crucial input is rating (

r_{u, i}

), followed by item titles (N_i), and then user–item feature (

f_{u, i}

). The most notable observation is that removing user–item features

f_{u, i}

results in a sharp decline in explainability and personalization quality. This is because

f_{u, i}

represents key user preferences extracted from past reviews, which helps generate more personalized explanations. When unavailable, the model struggles to provide user-specific explanations.

In practical recommendation systems, interaction-based features like

f_{u, i}

are often missing during inference, especially in cold-start scenarios or when recommending new products. To address this, EPER incorporates a feature-generation mechanism that dynamically predicts missing features using a Transformer encoder. This technique significantly improves performance, as seen in Table 6, but there is still room for enhancement, particularly in handling low-quality or noisy inferred features.

(2) The Effect of Individual Loss Components: Removing the rating loss

L_{r}

generally reduces review quality, as indicated by lower BLEU and ROUGE scores. However, precision scores for ROUGE_1, ROUGE_2, and the ROUGE_2 F-measure remain relatively stable.

Removing the context prediction loss

L_{c}

has a greater negative impact than removing

L_{r}

, leading to consistent performance degradation across rating prediction, explanation quality, and personalization metrics. This highlights the importance of context prediction in generating high-quality explanations.

(3) The Role of the Masking Mechanism: The EPER_N variant (Table 6) represents EPER without the customized masking mechanism. Without masking, the model’s personalization and review quality significantly degrade, as unintended attention interactions introduce noise into explanation generation. However, the impact on rating prediction is minimal, reinforcing that the masking mechanism is primarily beneficial for improving explanation quality.

4.3.5. Performance on User–Item Feature ${\hat{f}}_{u, i}$

As indicated in Table 6 and in Section 4.3.4, the performance is greatly decreased without the user–item feature

f_{u, i}

. However, we have pointed out that the user–item feature

f_{u, i}

is not available in practice. Therefore, an experiment to evaluate whether using the user–item feature

{\hat{f}}_{u, i}

, obtained by the mechanism described in Section 3.5, may approach the performance of using the real user–item feature

f_{u, i}

. The model used in the comparison was PETER+, which also used

{\hat{f}}_{u, i}

to substitute

f_{u, i}

for evaluation. Note that even the overlap between

f_{u, i}

and

{\hat{f}}_{u, i}

is not much, the performance improvement is still noticeable.

Table 7 shows that, when user–item feature

f_{u, i}

is unavailable, EPER outperforms PETER+ in rating prediction, review quality, and personalization level. EPER has better scores in all the metrics including RMSE, MAE, the BLEU and ROUGE, and the FMR, FCR, and DIV scores. Table 8 shows that, using the user–item feature

{\hat{f}}_{u, i}

computed from our mechanism, both models obtain performance gains, except BLEU. Overall, EPER still outperforms PETER+.

4.3.6. Comparisons of Explanation Cases

Table 9 presents examples of explanations generated by different models on the Amazon Clothing dataset. In Case 1, the ground-truth user review is “The price is great” for a product rated 5.0, and in Case 2, the ground-truth states “It was very small in the waist” for a product rated 1.0. In addition to the numerical performance comparisons provided in previous sections, we now offer a deeper qualitative analysis of these explanation cases to assess their effectiveness, interpretability, and potential impact on user understanding.

Case 1 Analysis: In this example, EPER, MMCT, and PETER+ all generate explanations that mention the product’s price. Notably, EPER’s explanation—“The price was great and the quality is great”—not only emphasizes the attractive price but also supplements it with a comment on quality. This dual focus likely enhances the explanation’s clarity and relevance, helping users better understand the rationale behind the recommendation. Such a balanced explanation can increase user trust by aligning closely with the sentiments expressed in the ground-truth.

Case 2 Analysis: For the second example, where the ground-truth highlights a concern regarding the small waist size, EPER produces the explanation “They are a bit tight in the waist”. This output effectively captures the key issue and presents it in a concise manner, which is crucial for user interpretability. In contrast, PETER+ fails to address the concern properly, providing a vague statement that does not reflect the critical aspect of the user’s review. The clear, focused nature of EPER’s explanation suggests that it is better suited to inform users about specific product attributes that influence their purchasing decision.

Overall Qualitative Insights: Our expanded analysis indicates that EPER’s explanations tend to be more aligned with the underlying review sentiments and user preferences. By integrating multiple interpretability techniques, EPER not only achieves higher scores on standard metrics (BLEU, ROUGE, FMR, FCR, and DIV) but also produces explanations that are clearer and more actionable. This qualitative improvement is significant for applications where user understanding and trust in the recommendation system are paramount.

5. Conclusions

We have proposed an explainable recommendation model, EPER, based on the Transformer architecture, which provides both recommendation scores and personalized natural language explanations for items recommended to users. EPER simultaneously considers user–item IDs, user ratings, feature words, and item title information, effectively improving the text quality of personalized recommendation explanations. A masking mechanism is incorporated into the model design, enabling the rating and explanation tasks to complement each other without interference. Additionally, we proposed a feature-handling method to address the limitations of previous approaches that relied on interaction features not yet present in the model.

Extensive experiments demonstrate that EPER achieves significant improvements in both explanation quality and recommendation scores compared with other methods. Notably, without requiring the preprocessing of image information, EPER outperforms or matches the overall performance of the well-known MMCT model. EPER overcomes the limitations of datasets lacking item image information, making it more widely applicable in practical applications. Furthermore, our proposed feature-handling method enhances the practical implementation while delivering high-quality personalized recommendation explanations.

While our experimental results demonstrate the effectiveness of EPER in explainable recommendations within e-commerce and service domains, several limitations should be acknowledged. First, the model assumes that textual explanations and structured metadata (e.g., user reviews and item names) are available. However, in domains with sparse or low-quality review data, performance may degrade significantly. Second, although the attention mechanisms improve explainability, the model does not explicitly model temporal user preferences—future extensions could incorporate sequential modeling techniques. Third, our current implementation does not incorporate multi-modal information such as product images or audio reviews, which could further enhance the personalization and relevance of generated explanations. Finally, the computational complexity of Transformer-based models remains a concern, particularly for real-time recommendation applications, necessitating future optimization strategies such as knowledge distillation or efficient Transformer architectures.

To address these challenges, future work can focus on adapting EPER to additional domains, such as healthcare, finance, and education, where explainability is essential but data structures differ significantly. Additionally, domain adaptation techniques, including self-supervised pretraining on multiple datasets, could help the model generalize beyond e-commerce recommendations. Exploring multi-modal learning approaches, where textual explanations are supplemented with visual and contextual information, could improve both accuracy and interpretability. Enhancing fairness and interpretability in recommendation explanations using causal inference or counterfactual reasoning techniques is another promising direction to minimize bias and improve trustworthiness in AI-driven recommendations.

In summary, the EPER model, by incorporating real ratings and item names as features into the learning process, effectively enhances the quality of explanation generation. It also successfully identifies the features users are concerned about, producing personalized recommendation explanations.

Author Contributions

Conceptualization, M.-Y.L.; formal analysis, M.-Y.L. and I.-C.H.; investigation, M.-Y.L. and I.-C.H.; methodology, M.-Y.L., I.-C.H., and S.-C.H.; supervision, M.-Y.L.; writing—original draft, I.-C.H. and M.-Y.L.; writing—review and editing, S.-C.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Science and Technology Council, Taiwan, under grant number 113-2221-E-035-062.

Data Availability Statement

The data presented in this study are openly available at http://jmcauley.ucsd.edu/data/amazon (accessed on 10 January 2024) and https://www.yelp.com/dataset (accessed on 10 January 2024).

Acknowledgments

We thank the reviewers and the editors for their suggestions, which have improved the quality of the paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Koren, Y.; Bell, R.; Volinsky, C. Matrix Factorization Techniques for Recommender Systems. Computer 2009, 42, 30–37. [Google Scholar] [CrossRef]
He, X.; Chen, T.; Kan, M.-Y.; Chen, X. TriRank. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, Melbourne, Australia, 19–23 October 2015; pp. 1661–1670. [Google Scholar]
Li, L.; Zhang, Y.; Chen, L. Generate Neural Template Explanations for Recommendation. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Virtual Event, 16–20 November 2020; pp. 755–764. [Google Scholar]
Ren, Z.; Liang, S.; Li, P.; Wang, S.; de Rijke, M. Social Collaborative Viewpoint Regression with Explainable Recommendations. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, Cambridge, UK, 6–10 February 2017; pp. 485–494. [Google Scholar]
Zhang, Y.; Lai, G.; Zhang, M.; Zhang, Y.; Liu, Y.; Ma, S. Explicit Factor Models for Explainable Recommendation Based on Phrase-Level Sentiment Analysis. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, Gold Coast, QLD, Australia, 6–11 July 2014; pp. 83–92. [Google Scholar]
Chen, Z.; Wang, X.; Xie, X.; Wu, T.; Bu, G.; Wang, Y.; Chen, E. Co-Attentive Multi-Task Learning for Explainable Recommendation. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, Macao, China, 10–16 August 2019; pp. 2137–2143. [Google Scholar]
Dong, L.; Huang, S.; Wei, F.; Lapata, M.; Zhou, M.; Xu, K. Learning to Generate Product Reviews from Attributes. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain, 3–7 April 2017; pp. 623–632. [Google Scholar]
Liu, Z.; Ma, Y.; Schubert, M.; Ouyang, Y.; Rong, W.; Xiong, Z. Multimodal Contrastive Transformer for Explainable Recommendation. IEEE Trans. Comput. Soc. Syst. 2023, 11, 2632–2643. [Google Scholar] [CrossRef]
Li, P.; Wang, Z.; Ren, Z.; Bing, L.; Lam, W. Neural Rating Regression with Abstractive Tips Generation for Recommendation. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Shinjuku, Tokyo, Japan, 7–11 August 2017; pp. 345–354. [Google Scholar]
Li, L.; Zhang, Y.; Li, C. Personalized Transformer for Explainable Recommendation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Bangkok, Thailand, 5–10 August 2021; pp. 4947–4957. [Google Scholar]
Yang, A.; Wang, N.; Deng, H.; Wang, H. Explanation as a Defense of Recommendation. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, Beijing, China, 22–26 February 2021; pp. 1029–1037. [Google Scholar]
Zhu, J.; He, Y.; Zhao, G.; Bo, X.; Qian, X. Joint Reason Generation and Rating Prediction for Explainable Recommendation. IEEE Trans. Knowl. Data Eng. 2023, 35, 4940–4953. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All You Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Chen, C.; Zhang, M.; Liu, Y.; Ma, S. Neural Attentional Rating Regression with Review-Level Explanations. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, Lyon, France, 23–27 April 2018; pp. 1583–1592. [Google Scholar]
Ni, J.; Li, J.; McAuley, J. Justifying Recommendations Using Distantly-Labeled Reviews and Fine-Grained Aspects. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China, 3–7 November 2019; pp. 188–197. [Google Scholar]
Pan, S.; Li, D.; Gu, H.; Lu, T.; Luo, X.; Gu, N. Accurate and Explainable Recommendation via Review Rationalization. In WWW ’22: Proceedings of the ACM Web Conference 2022, Virtual Event, 25–29 April 2022; Association for Computing Machinery: New York, NY, USA; pp. 3092–3101.
Li, L.; Zhang, Y.; Chen, L. Personalized Prompt Learning for Explainable Recommendation. ACM Trans. Inf. Syst. 2023, 41, 1–26. [Google Scholar] [CrossRef]
Biran, O.; Cotton, C. Explanation and justification in machine learning: A survey. In Proceedings of the IJCAI-17 Workshop on Explainable AI (XAI ‘17), Melbourne, Australia, 19 August 2017; pp. 1–8. [Google Scholar]
Kim, Y.; Chen, M.; Yao, Z. Natural language justifications for explainable recommendation. In Proceedings of the ACM Conference on Recommender Systems, Seattle, WA, USA, 24–28 September 2022. [Google Scholar] [CrossRef]
Shimizu, R.; Matsutani, M.; Goto, M. An explainable recommendation framework based on an improved knowledge graph attention network with massive volumes of side information. Knowl.-Based Syst. 2022, 107970. [Google Scholar] [CrossRef]
Wang, X.; Li, Q.; Zhou, P. Enhancing explainability in recommendation through natural language justifications. In Proceedings of the ACM Conference on Recommender Systems, Copenhagen, Denmark, 23–27 October 2022; p. 3675401. [Google Scholar] [CrossRef]
Liu, C.; Wu, W.; Wu, S.; Yuan, L.; Ding, R.; Zhou, F. Social-Enhanced Explainable Recommendation with Knowledge Graph. IEEE Trans. Knowl. Data Eng. 2024, 36, 840–853. [Google Scholar] [CrossRef]
Lin, Y.; Zhang, W.; Lin, F.; Zeng, W.; Zhou, X.; Wu, P. Knowledge-Aware Reasoning with Self-Supervised Reinforcement Learning for Explainable Recommendation in MOOCs. Neural Comput. Appl. 2024, 36, 4115–4132. [Google Scholar] [CrossRef]
Liao, H.; Wang, S.-F.; Cheng, H.; Zhang, W.; Zhang, J.; Zhou, M.; Lu, K.; Mao, R.; Xie, X. Aspect-Enhanced Explainable Recommendation with Multi-modal Contrastive Learning. ACM Trans. Intell. Syst. Technol. 2025, 16, 1–24. [Google Scholar] [CrossRef]
Pan, S.; Li, D.; Gu, H.; Lu, T.; Luo, X.; Gu, N. Explainable Session-Based Recommendation via Path Reasoning. IEEE Trans. Knowl. Data Eng. 2025, 37, 278–290. [Google Scholar]
Zeng, J.; Tao, H.; Wen, J.; Gao, M. Explainable Next POI Recommendation Based on Spatial-Temporal Disentanglement Representation and Pseudo Profile Generation. Knowl.-Based Syst. 2025, 309, 112784. [Google Scholar] [CrossRef]
Zhang, Y.; Zhang, H.; Zhang, M.; Liu, Y.; Ma, S. Do Users Rate or Review?: Boost Phrase-Level Sentiment Labeling with Review-Level Sentiment Classification. In Proceedings of the 37th ACM SIGIR International Conference on Research & Development in Information Retrieval, Gold Coast, QLD, Australia, 6–11 July 2014; pp. 1027–1030. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.-J. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar]
Lin, C.Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the Workshop on Text Summarization Branches Out, Barcelona, Spain, 22 July 2004; pp. 74–81. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]

Figure 1. Overview of the proposed EPER model.

Table 1. Datasets statistics.

Dataset	Amazon Clothing	Amazon Movies	Yelp
Number of users	38,764	7506	27,147
Number of items	22,919	7360	20,266
Number of reviews	179,223	441,783	1,293,247
Number of features	1162	5399	7340
Average reviews per user	4.62	58.86	47.64
Average reviews per item	7.82	60.26	63.81
Average words per reviews	10.48	14.14	12.32
Average words per item title	7.99	2.76	2.89

Table 2. Performance of explanation generation.

Metrics	FMR ⭡	FCR ⭡	DIV ⭣	B1 ⭡	B4 ⭡	R1_P ⭡	R1_R ⭡	R1_F ⭡	R2_P ⭡	R2_R ⭡	R2_F ⭡
ACMLM	0.21	0.36	0.12	11.35	1.03	15.64	16.32	14.96	3.32	3.65	3.06
NETE	0.89	0.28	0.07	22.35	3.86	37.46	28.36	29.48	10.94	7.53	8.34
SAER	0.93	0.29	0.04	22.00	4.32	39.31	29.40	31.28	12.20	9.02	9.02
PETER+	0.94	0.37	0.04	21.70	4.15	39.40	29.33	30.83	12.34	8.77	9.15
MMCT	0.95	0.44	0.04	23.53	4.44	39.70	30.53	31.34	12.54	9.16	9.30
EPER	0.96	0.47	0.04	22.53	4.34	40.38	30.19	31.61	12.95	9.17	9.48
Improv. (%)	+1.05	+6.82	+0	−4.44	−2.3	+1.71	−1.23	+0.86	+3.27	+0.11	+1.94

Table 3. Performance of rating prediction.

Metrics	RMSE ⭣	MAE ⭣
NETE	1.07	0.86
SAER	1.05	0.85
PETER+	1.05	0.86
MMCT	1.04	0.84
EPER	1.05	0.83

Table 4. Performance on Amazon Movies dataset.

Metrics	FMR ⭡	FCR ⭡	DIV ⭣	B1 ⭡	B4 ⭡	R1_P ⭡	R1_R ⭡	R1_F ⭡	R2_P ⭡	R2_R ⭡	R2_F ⭡	RMSE⭣	MAE⭣
ACMLM	0.10	0.31	2.07	9.52	0.22	11.65	10.39	9.69	0.71	0.81	0.64	X	X
NETE	0.71	0.19	1.93	18.76	2.46	33.87	21.43	24.81	7.58	4.77	5.46	0.96	0.73
SAER	0.80	0.28	1.22	19.82	3.05	34.94	24.16	26.48	9.02	6.34	6.84	0.95	0.72
PETER+	0.77	0.31	1.20	19.75	3.06	34.71	23.99	26.35	9.04	6.23	6.71	0.95	0.71
MMCT	0.84	0.41	1.22	20.85	3.22	35.02	25.04	27.15	9.13	6.52	6.90	0.94	0.70
EPER	0.81	0.36	1.22	20.35	3.15	36.08	24.72	27.23	9.28	6.43	6.90	0.95	0.70

Table 5. Performance on Yelp dataset.

Metrics	FMR ⭡	FCR ⭡	DIV ⭣	B1 ⭡	B4 ⭡	R1_P ⭡	R1_R ⭡	R1_F ⭡	R2_P ⭡	R2_R ⭡	R2_F ⭡	RMSE⭣	MAE⭣
ACMLM	0.05	0.31	0.95	7.01	0.24	7.89	7.54	6.82	0.44	0.48	0.39	X	X
NETE	0.80	0.27	1.48	19.31	2.69	33.98	22.51	25.56	8.93	5.54	6.33	1.01	0.79
PETER+	0.86	0.38	1.08	20.80	3.43	35.44	26.12	27.95	10.65	7.44	7.94	1.01	0.78
EPER	0.88	0.49	1.04	20.81	3.58	36.19	26.57	28.47	11.19	7.71	8.27	1.01	0.78

Table 6. Ablation study of the EPER model.

Metrics	FMR ⭡	FCR ⭡	DIV ⭣	B1 ⭡	B4 ⭡	R1_P ⭡	R1_R ⭡	R1_F ⭡	R2_P ⭡	R2_R ⭡	R2_F ⭡	RMSE⭣	MAE⭣
$w / o r_{u, i}$	0.95	0.45	0.05	21.95	4.14	40.24	29.78	31.29	12.92	8.84	9.27	1.05	0.83
$w / o N_{i}$	0.94	0.43	0.05	22.04	4.11	38.85	29.63	30.73	12.07	8.69	8.92	1.05	0.83
$w / o f_{u, i}$	0.13	0.16	0.20	14.93	1.05	16.49	16.18	15.19	2.07	2.07	1.85	1.05	0.83
$w / o L_{r}$	0.96	0.48	0.04	22.24	4.33	40.57	30.03	31.56	13.17	9.13	9.52	3.44	3.28
$w / o L_{c}$	0.96	0.48	0.04	22.27	4.26	40.21	30.06	31.47	12.90	9.03	9.39	1.06	0.85
EPER_N	0.95	0.44	0.05	22.00	4.21	40.10	29.81	31.28	12.91	8.89	9.31	1.05	0.83
EPER	0.96	0.47	0.04	22.53	4.34	40.38	30.19	31.61	12.95	9.17	9.48	1.05	0.82

EPER_N: EPER No mask version.

Table 7. Performance without user–item feature

f_{u, i}

.

Table 7. Performance without user–item feature

f_{u, i}

.

Metrics	FMR ⭡	FCR ⭡	DIV ⭣	B1 ⭡	B4 ⭡	R1_P ⭡	R1_R ⭡	R1_F ⭡	R2_P ⭡	R2_R ⭡	R2_F ⭡	RMSE⭣	MAE⭣
PETER+	0.07	0.04	1.29	12.98	0.56	13.94	13.31	12.85	1.12	1.13	1.02	1.05	0.84
EPER	0.13	0.16	0.20	14.93	1.05	16.49	16.18	15.19	2.07	2.07	1.85	1.05	0.83

Table 8. Performance with computed user–item feature

{\hat{f}}_{u, i}

.

Table 8. Performance with computed user–item feature

{\hat{f}}_{u, i}

.

Metrics	FMR ⭡	FCR ⭡	DIV ⭣	B1 ⭡	B4 ⭡	R1_P ⭡	R1_R ⭡	R1_F ⭡	R2_P ⭡	R2_R ⭡	R2_F ⭡	RMSE⭣	MAE⭣
PETER+	0.98	0.10	0.10	8.51	0.81	16.03	11.22	12.15	2.41	1.66	1.77	1.05	0.84
EPER_F	0.98	0.16	0.09	12.31	0.70	29.29	20.37	22.65	2.23	1.76	1.86	1.05	0.83

Table 9. Example cases of explanations.

Case 1: Ground-truth: The price is great; Rating: 5.0 PETER+: They are very comfortable and the price is right MMCT: The price was great and the quality is good EPER (Our Model): The price was great and the quality is great
Case 2: Ground-truth: It was very small in the waist; Rating: 1.0 PETER+: They are very comfortable and the waist is very comfortable MMCT: They are a little tight in the waist EPER (Our Model): They are a bit tight in the waist

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lin, M.-Y.; Hsieh, I.-C.; Hsush, S.-C. Enhancing Personalized Explainable Recommendations with Transformer Architecture and Feature Handling. Electronics 2025, 14, 998. https://doi.org/10.3390/electronics14050998

AMA Style

Lin M-Y, Hsieh I-C, Hsush S-C. Enhancing Personalized Explainable Recommendations with Transformer Architecture and Feature Handling. Electronics. 2025; 14(5):998. https://doi.org/10.3390/electronics14050998

Chicago/Turabian Style

Lin, Ming-Yen, I-Chen Hsieh, and Sue-Chen Hsush. 2025. "Enhancing Personalized Explainable Recommendations with Transformer Architecture and Feature Handling" Electronics 14, no. 5: 998. https://doi.org/10.3390/electronics14050998

APA Style

Lin, M.-Y., Hsieh, I.-C., & Hsush, S.-C. (2025). Enhancing Personalized Explainable Recommendations with Transformer Architecture and Feature Handling. Electronics, 14(5), 998. https://doi.org/10.3390/electronics14050998

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Personalized Explainable Recommendations with Transformer Architecture and Feature Handling

Abstract

1. Introduction

2. Related Work

2.1. Template-Based Methods

2.2. RNN-Based and Hybrid NLG Methods

2.3. Transformer-Based NLG Methods

2.4. EPER: Enhancing Transformer-Based Explanation Models

2.5. Recent Advances in Explainable Recommendation

3. Proposed Model

3.1. Rating Prediction

3.2. Context Prediction and Personalized Learning

3.3. Explanation Generation

3.4. Model Optimization and Model Inference

3.5. User–Item Feature $f_{u, i}$

4. Experiments

4.1. Experimental Datasets and Setup

4.2. Baseline Models

4.3. Experimental Results

4.3.1. Performance of Explanation Generation

4.3.2. Performance of Rating Prediction

4.3.3. Performance on Different Datasets

4.3.4. Ablation Study

4.3.5. Performance on User–Item Feature ${\hat{f}}_{u, i}$

4.3.6. Comparisons of Explanation Cases

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Enhancing Personalized Explainable Recommendations with Transformer Architecture and Feature Handling

Abstract

1. Introduction

2. Related Work

2.1. Template-Based Methods

2.2. RNN-Based and Hybrid NLG Methods

2.3. Transformer-Based NLG Methods

2.4. EPER: Enhancing Transformer-Based Explanation Models

2.5. Recent Advances in Explainable Recommendation

3. Proposed Model

3.1. Rating Prediction

3.2. Context Prediction and Personalized Learning

3.3. Explanation Generation

3.4. Model Optimization and Model Inference

3.5. User–Item Feature f u , i

4. Experiments

4.1. Experimental Datasets and Setup

4.2. Baseline Models

4.3. Experimental Results

4.3.1. Performance of Explanation Generation

4.3.2. Performance of Rating Prediction

4.3.3. Performance on Different Datasets

4.3.4. Ablation Study

4.3.5. Performance on User–Item Feature f ^ u , i

4.3.6. Comparisons of Explanation Cases

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.5. User–Item Feature $f_{u, i}$

4.3.5. Performance on User–Item Feature ${\hat{f}}_{u, i}$