Collaborative Knowledge-Enhanced Recommendation with Self-Supervisions

: Knowledge-enhanced recommendation (KER) aims to integrate the knowledge graph (KG) into collaborative ﬁltering (CF) for alleviating the sparsity and cold start problems. The state-of-the-art graph neural network (GNN)–based methods mainly focus on exploiting the connectivity between entities in the knowledge graph, while neglecting the interaction relation between items reﬂected in the user-item interactions. Moreover, the widely adopted BPR loss for model optimization fails to provide sufﬁcient supervisions for learning discriminative representation of users and items. To address these issues, we propose the collaborative knowledge-enhanced recommendation (CKER) method. Speciﬁcally, CKER proposes a collaborative graph convolution network (CGCN) to learn the user and item representations from the connection between items in the constructed interaction graph and the connectivity between entities in the knowledge graph. Moreover, we introduce the self-supervised learning to maximize the mutual information between the interaction- and knowledge-aware user preferences by deriving additional supervision signals. We conduct comprehensive experiments on two benchmark datasets, namely Amazon-Book and Last-FM, and the experimental results show that CKER can outperform the state-of-the-art baselines in terms of recall and NDCG on knowledge-enhanced recommendation.


Introduction
Recommender systems (RS) are an effective method to filter the irrelevant information on the internet and maintain a user's personalized needs [1][2][3], which has wide applications, such as search engines [4], e-commerce websites [5], etc. As a classical recommendation method, CF aims to generate recommendations by learning user and item representations from the user-item interactions [6][7][8]. However, CF faces the data sparsity issue and cold start problem (that is, making recommendations for novel users who have no preference for any items, or recommending new items that have not been interacted with by any users) [9][10][11], leading to unsatisfactory performance. To address these limitations, KER is proposed to introduce the knowledge graph (KG) into CF for enriching the connectivity between items in scenarios, where the knowledge graph can be accessed, thus learning high-quality latent vector for users and items to enhance recommendations.
Earlier methods for knowledge-enhanced recommendation can be divided into the embedding-based methods [12][13][14][15] and path-based methods [16][17][18][19][20], where the embeddingbased methods learn the user and item representations, using the knowledge base, and the path-based methods enhance the recommendation by exploiting the connectivity pattern between items in the KG. For instance, Zhang et al. propose to enhance the item representations with the structural, textual and visual knowledge extracted from the knowledge base [12], and Wang et al. propose an end-to-end framework to perform explicit reasoning on KG to learn path semantics and improve the recommendation interpretability [18].
Recently, the GNN-based methods were proposed by adopting the GNNs to unify the aforementioned two categories of methods [9,10,21,22]. For example, Wang et al. propose to detect the user interest by iteratively propagating the user preference in the KG [9], and Wang et al. further exploit the high-order connectivity among users, items and entities in the constructed collaborative knowledge graph [10].
Though the above-mentioned methods have achieved considerable performance, there still remains several limitations. First, the embedding-and path-based methods show unsatisfactory performance since they either leverage the semantic representation of items or model the semantic connectivity between items, without taking both aspects of information into consideration. Moreover, the state-of-the-art GNN-based recommenders mainly focus on exploiting the high-order connectivity between entities in the knowledge graph, neglecting the interaction relation (i.e., interacted by the same user) between items. In addition, the widely adopted Bayesian personalized ranking (BPR) loss cannot provide sufficient supervision signals to learn an accurate representation of users and items, failing to effectively distinguish the candidate items and accurately rank them when making predictions.
To address the above issues, we propose the CKER method. Specifically, given the historical user-item interactions, we first construct an interaction graph to reflect the interaction relation between items. Then, we propose the CGCN for learning the user and item representations, including two channels which exploit the connection between items in the interaction graph and the semantic connectivity between entities in the knowledge graph, respectively. After that, we construct a user-item bipartite graph from the historical interactions, which is utilized to generate the interaction-and knowledge-aware user preferences by relying on the generated representation of user's interacted items in two convolution channels. Next, we derive the self-supervisions by adopting the information noise-contrastive estimation (InfoNCE) [23] to maximize the mutual information between the interaction-and knowledge-aware preferences of each user, which is jointly trained with the main supervised learning.
Comprehensive experiments are conducted on two benchmark datasets, namely Amazon-Book and Last-FM. The significant improvement of CKER above the state-of-theart baselines in terms of recall and NDCG demonstrates the effectiveness of our proposal.
We summarize the main contributions in this paper as follows: 1.
To the best of our knowledge, we are the first to simultaneously consider the connection between items reflected in the user-item interactions and the semantic connectivity between entities in the knowledge graph; 2.
We introduce the self-supervised learning to derive additional supervision signals for enhancing the representation learning of users and items by maximizing the mutual information between the user's interaction-and knowledge-aware preferences; 3.
Extensive experiments conducted on two benchmark datasets, namely Amazon-Book and Last-FM, demonstrate the superiority of CKER over the competitive baselines in terms of recall and NDCG on the knowledge-enhanced recommendation task.
The rest of this paper is organized as follows. First, the related literature is summarized in Section 2. Then, we detail our proposed CKER model in Section 3. After that, we describe the experimental settings in Section 4 and analyze the experimental results in detail in Section 5. Finally, we conclude this work and suggest our future directions in Section 6.

Related Work
In this section, we first review the previous work for the knowledge-enhanced recommendation in Section 2.1, and then summarize the related work about the self-supervised learning and its applications in RS in Section 2.2.

Knowledge-Enhanced Recommendation
The existing work for knowledge-enhanced recommendation can be mainly divided into three categories, i.e., the embedding-based methods, the path-based methods and the GNN-based methods. The embedding-based methods generally utilize the semantics connections in KG to enhance the representation learning of items or users. For example, Cao et al. propose to perform the multi-task learning by combining the item recommendation and the knowledge graph completion to introduce the item knowledge into the user preference generation [14]. Moreover, the path-based models aim to exploit the connectivity pattern between items in the KG for guiding the personalized recommendation. For instance, Yu et al. regard KG as a heterogeneous information network and diffuse the user preferences along different meta-paths to generate the latent vector of users and items [19]. In addition, Wang et al. design a knowledge-aware path recurrent network to capture the sequential dependencies of entities and relations on each path in KG for detecting the user intent [18]. Upon the embedding-and path-based methods, the GNN-based methods are proposed to simultaneously consider the semantic representation of items and the connectivity pattern between items in KG. For example, Wang et al. propose to detect the user interest by propagating embeddings over the entities related to user's interacted items in the knowledge graph [9]. Then, Wang et al. propose KGCN, which learns user-specific item embeddings by exploiting the high-order connectivity between entities in the KG [21], and then Wang et al. further extends KGCN by adding regularization over the edge weights in KG using the label smoothness to alleviate the overfitting problem [22]. In addition, Wang et al. propose KGAT to integrate both the user-item interactions and the item knowledge into a hybrid collaborative knowledge graph for modeling the high-order relations among users, items and entities [10].
However, the existing knowledge-based recommenders fail to take both the interaction relation between items reflected in the user-item interactions and the connectivity pattern between items contained in the knowledge graph into consideration simultaneously, leading to unsatisfactory recommendation performance.

Self-Supervised Learning
Self-supervised learning (SSL) aims to train a network and learns the data representations by deriving the supervision signals from the raw data automatically, which can be categorized into the generative methods [24][25][26] and the contrastive methods [23,[27][28][29]. The generative models aim to reconstruct the input data, where the popular methods include the variational autoencoder (VAE) [30,31], the generative adversarial networks (GAN) [32], etc. Differently, the constrastive models learn data representations by comparing a training sample from different views based on the noise contrastive estimation (NCE) [33], such as from the global-local views [34] or the global-global views [27]. In this work, we adopt the contrastive method for deriving the self-supervisions, which can avoid introducing additional parameters.
Considering the effectiveness of SSL in various fields, recently SSL was also introduced into RS for enhancing the recommendation [35][36][37][38][39][40]. For example, Yao et al. propose to apply the SSL in the large-scale recommendation scenario to solve the data sparse and long-tail problems [35]. Moreover, Wu et al. propose to develop the SSL in the graph-based collaborative filtering by comparing the user and item embeddings from different augmentation views to learn discriminative representations [36]. Furthermore, Yu et al. propose to adopt the SSL to maximize the mutual information among the user preferences modeled by different channels to improve the social recommendation [37]. In addition, Xia et al. design a dual channel hypergraph convolutional network for sessionbased recommendation, where the SSL is utilized to enhance the hypergraph modeling by contrasting the session representation generated from the local graph and global graph [38].
However, to the best of our knowledge, SSL has not been applied on the knowledgeenhanced recommendation task for improving the recommendation accuracy. Thus, in this paper, we introduce the SSL to derive the additional supervisions by comparing the user preferences generated from the interaction graph and the knowledge graph for enhancing the representation of users and items.

Approach
In this section, we first formulate the definition of the knowledge-enhanced recommendation task. Then, we describe our proposed CKER method in detail, which mainly consists of five components, namely the CGCN module, the user preference generation, the main supervised learning module, the self-supervised learning module and the multi-task learning module.
The framework of CKER is plotted in Figure 1. Given the user-item interactions, we first construct an interaction graph to establish the interaction relation between items. Then for each item, we design a CGCN to update the item representations by propagating information from its neighbors in the interaction graph and the knowledge graph, respectively. After that, the user preference is generated by combining the interaction-and knowledge-aware preferences modeled by relying on the learned item representations for making predictions. Next, we apply the Bayesian personalized ranking (BPR) loss as the main supervised loss to utilize the supervisions in the user-item interactions by reconstructing the interaction matrix. In addition, we introduce the self-supervised learning, which adopts the InfoNCE to derive the self-supervisions between the generated interaction-and knowledge-aware preferences for each user. Finally, the multi-task learning is conducted by combining both the supervised and self-supervised losses for model optimization. Let U and I be the user set and item set, respectively, and the interactions between users and items are denoted as O + , where each user-item pair (u, i) ∈ O + indicates that user u interacted with item i. Moreover, assuming the knowledge graph is G K = {(h, r, t)|h, t ∈ V, r ∈ R}, where V is a set of real-world entities and R is the relation set between entities. For example, (Braveheart, director, Mel Gibson) denotes that the movie Braveheart is directed by the director Mel Gibson, where Braveheart and Mel Gibson are the entities and "is directed by" denotes the relation between two entities. In addition, the items in the interactions are the subset of the entities in the KG, i.e., I ⊂ V. The aim of knowledge-enhanced recommendation is to learn the representation of users and items from the interaction data O + and knowledge graph G K , so as to predict how likely each user is to adopt the candidate items; then, items ranked at the top-K positions constitute the recommendation list for the user.
The main notations used in this paper are listed in Table 1. Table 1. Main notations used in this paper.

Notation Description
In order to simultaneously exploit the interaction relation between items reflected in the user-item interactions and the item knowledge introduced by the knowledge graph, we propose the CGCN to learn the item representations, which consists of two channels, i.e., the interaction graph propagation and the knowledge graph propagation.

Interaction Graph Propagation
First, we take the interaction relation between items reflected in the user-item interactions into consideration. Specifically, given the interactions between users and items as O + , we first construct an interaction graph (IG) as G I = {I, E I }, where I denotes the nodes, i.e., all items, and E I is the edges. Each edge (i , i κ ) ∈ E I indicates that item i and item i κ are interacted by the same user. Moreover, we apply the max sampling according to the edge weights to select the M I most related items as the final neighbors of each item, so as to filter out the noise introduced by user's uncertain behavior pattern.
After constructing the interaction graph, we propagate information from the neighbors of each item in G I to exploit the interaction relation between items for updating the item representations. More specifically, we adopt the light graph convolution (LGC) proposed in [41] to conduct the information propagation. Differently, we adopt a left normalization method considering its simplicity and low computation cost, and the comparison of different normalization methods in CKER is left to our future work. Specifically, the l-layer graph convolution for item i can be formalized as follows: whereê l i ∈ R d is the propagated information for item i at the l-layer interaction graph convolution, N I i is the neighbors of item i in the interaction graph and |N I i | is the neighbor number, andê l−1 ∈ R d is the representation of neighbor ∈ N I i at the (l − 1)layer convolution.

Knowledge Graph Propagation
Besides propagating information on the interaction graph, we utilize the knowledge graph convolution to exploit the connectivity between entities in the knowledge graph. More specifically, each triplet (i, r, v) ∈ G K denotes that entity i and entity v are connected by the relation r. Moreover, similar to that in the IG, we apply the max sampling by relying on the edge weights to select the M K most related entities as the final neighbors of each entity to avoid introducing bias. In the KG, each tail entity has different semantics when paired with different relations; for example, entity Mel Gibson plays the role as the director and star in two triplets (Braveheart, director, Mel Gibson) and (Braveheart, star, Mel Gibson), respectively. Thus, we obtain the neighbor information for each entity by aggregating its corresponding relation-tail pairs in the KG as follows: whereẽ l i ∈ R d is the propagated information for entity i at the l-layer knowledge graph convolution, N K i is the neighbors of entity i in the knowledge graph. e r is the relation vector generated by the embedding layer preceding the graph propagation architectures, i at the (l − 1)-layer convolution, and denotes the Hadamard product used for combination. For each triplet (i, r, v), we propagate the information from the tail v to the head i under the relation r by multiplying the latent vector of the relation and the tail in an element-wise way, so that the relational message can be carried together with the tail in the information propagation.

Multi-Layer Graph Convolutions
At the l-layer CGCN, after propagating information on the interaction graph and knowledge graph to generate the respective item representations, i.e.,ê l i andẽ l i , we combine them together by the sum pooling to obtain the final latent vector of item i: where e l i ∈ R d is the representation of item i generated by the l-layer CGCN. Moreover, multi-layer CGCNs can be stacked to exploit multi-hop connection between items in IG and high-order connectivity between entities in KG. Specifically, for item i, the l-layer CGCN can be formalized as follows: where e l i and e l−1 i are the respective representation of item i at the land (l − 1)-layer CGCN, and N I i and N K i are the neighbors of item i in IG and KG, respectively. After generating the representation of items at different CGCN layers, we obtain the final item representations by summing them together, which can be formalized as follows: where e * i ∈ R d is the final representation of item i, and e 0 i ∈ R d is initialized by the embedding layer.

User Preference Generation
After learning the item representations by multi-layer CGCNs, we generate the user preference by the latent vector of user's interacted items. More specifically, we construct a user-item bipartite graph as G U = {U ∪ I, E U } from the user-item interactions, where the nodes U ∪ I include all users and all items, and each edge (u, i) ∈ E U indicates that user u interacted with item i before.
Given the generated item representations at the l-layer interaction graph convolution, we obtain the user preference reflected in the interaction graph as follows: whereê l u ∈ R d denotes the user preference aggregated from the representation of user's interacted items at the (l − 1)-layer interaction graph convolution, and N U u is the interacted items of user u.
Then, similar to that in Equation (5), we generate the interaction-aware user preference by summing the preference generated from different interaction graph convolution layers: whereê * u ∈ R d is the interaction-aware preference of user u generated by combining the user interest at different interaction graph convolution layers, andê 0 u ∈ R d is initialized by the embedding layer before the graph convolutions.
Similarly, we can also obtain the knowledge-aware user preference by aggregating the representation of items generated by multi-layer knowledge graph convolutions, which can be formalized as follows:ẽ whereẽ * u is the knowledge-aware preference of user u obtained by summing the preference at different knowledge graph convolution layers, andẽ 0 u is generated by the embedding layer before the graph convolutions.
Next, we can obtain the final user preference e * u by combining the interaction-and knowledge-aware preferences together:

Supervised Learning
After obtaining the item representations and generating the user preference, following [10], we adopt the inner product to make predictions as follows: whereŷ ui is the predicted score of measuring the probability of user u adopting item i. Then, to learn the trainable parameters in CKER (i.e., the ID embeddings of users and items), following [10], the Bayesian personalized ranking (BPR) loss is adopted as the optimization objective to utilize the supervision signals in the user-item interactions. Specifically, the BPR loss encourages the target items to be ranked at the top positions by enlarging the distance between the prediction score of the ground truth and the negative sample as follows: where L main is the main supervised loss, O = {(u, i, j)|(u, i) ∈ O + , (u, j) ∈ O − } indicates the pairwise training data, σ denotes the sigmoid function, and item i is user's interacted item, while j is an item randomly sampled from the unobserved interactions, i.e., y uj ∈ U × I\O + .

Self-Supervised Learning
In order to maximize the mutual information between the user preference obtained from the interaction graph and the knowledge graph, we introduce the self-supervised learning to derive the self-supervision signals using the InfoNCE [23] for enhancing the representation of users and items. Specifically, for each user u, we obtain the interaction-aware and knowledge-aware preferences reflected in the IG and KG asê * u andẽ * u , respectively. Assuming that the current mini-batch consists of N users, then the knowledge-aware preference of user u should be more similar to the interaction-aware preference of user u than that of the other N − 1 users in the mini-batch. Based on this intuition, we adopt the InfoNCE [23], which regards the pair of knowledge-and interaction-aware preferences of user u (i.e.,ẽ * u andê * u ) as the positive pair, and treats the pairs combining user u's knowledge-aware preference with the interaction-aware preference of other users in the mini-batch (i.e., [(ẽ * u ,ê * u κ )|κ = 1, . . . , − 1, + 1, . . . , N]) as the negative samples. We formalize the InfoNCE as follows: where L ssl is the self-supervised loss, sim(u, v) = cos(u, v) = u T v/||u||||v|| is the cosine similarity between two vectors u and v, where || · || denotes the L2 normalization operation, and λ is a hyper-parameter for scaling the similarity. By introducing the additional supervisions using the InfoNCE, we can encourage the interaction-aware preference of different users to be uniformly distributed in the latent space, so as to learn discriminative embeddings of users and items for better distinguishing them when making predictions.

Multi-Task Learning
After obtaining the main supervised loss L main by Equation (12) and the self-supervised loss L ssl by Equation (13), we conduct the multi-task learning by combining them together: where α is a hyper-parameter which adjusts the weight between two losses, a small L main indicates that the target items are ranked at top positions and a small L ssl denotes that the items are well distributed in the embedding space by the self-supervised learning. Finally, the back-propagation through time (BPTT) algorithm [42] is applied to optimize CKER for learning the trainable parameters. We detail the learning procedure of CKER in Algorithm 1. Given the observed useritem interactions O + , we first construct the interaction graph and the user-item bipartite graph in lines 1 and 2, respectively. Then for each training mini-batch B ∈ X, where X denotes all mini-batches, we first conduct the information propagation by multi-layer CGCNs, which consist of the interaction and knowledge graph propagations from line 5 to 9, which are then fused to generate the final item representations in line 10. Next, for each user-item pair, we obtain the user preference from line 12 to 18, including generating the interaction-and knowledge-aware preferences. After that, we sample the negative item in line 19 and look up the representation of items in line 20, which are inputted into the prediction function together with the user preference to obtain the prediction scores in line 21. Then, we generate the main supervised loss in line 22 and the self-supervised loss in line 23, respectively. Finally, we obtain the multi-task training loss in line 25 and apply the back-propagation to optimize the model in line 26.

Research Questions
We prove the effectiveness of CKER by addressing the following five research questions: What is the impact of the hyper-parameters α and λ on the performance of CKER?

Datasets and Evaluation Metrics
Two publicly available datasets, namely Amazon-Book and Last-FM, are adopted to evaluate the performance of CKER and the baselines. Amazon-Book is selected from Amazon-review, which is a widely used dataset for product recommendation. Last-FM is collected from the Last.fm online music systems, where the tracks are regarded as the items, and we take the subset of the music listening records from January 2015 to June 2015 for experiments as in [10]. Moreover, for both Amazon-Book and Last-FM, the 10-score setting [43] is applied to ensure the data quality, where users with fewer than 10 interactions and items appearing less than 10 times are filtered. In addition, following [10], for each dataset, we randomly select 80% of the historical interactions as the training set, and the remaining part constitutes the test set.
Moreover, the item knowledge is constructed for each dataset. Specifically, items are mapped to the Freebase [44,45] entities via title matching if there is mapping available, where we consider the triplets that are directly related to the entities aligned with items, no matter which role (i.e., subject or object) it serves. Moreover, two-hop neighbor entities of items are taken into consideration to enrich the relations between entities. Here, introducing small hops of neighbors can merely incorporate limited connectivities between items in the KG into the user and item representation learning, while introducing large hops of neighbors easily brings in bias. In this paper, we follow the setting in [10], which takes two-hop neighbor entities into consideration. We would like to leave the investigation on the impact of the hop number of neighbors to our future work. In addition, in order to ensure the KG quality, inactive entities (i.e., appearing less than 10 times) and infrequent relations (i.e., appearing in less than 50 triplets) are filtered out in the KG data for both Amazon-Book and Last-FM. The statistics of Amazon-Book and Last-FM after processing are shown in Table 2. Following previous work [10], Recall@K and NDCG@K are adopted as the evaluation metrics to evaluate the recommendation performance. Recall@K measures whether the target items are contained in the top-K positions in the recommendation list, while NDCG@K takes the ranking of the target items into consideration, i.e., whether the recommender ranks the target items at right positions. Unless specified differently, K is set to 20 in our experiments.
• MF [46] does not take the KG into consideration and learns the user and item representations by reconstructing the interaction matrix using the matrix factorization, where users and items are simply represented by their ID embeddings; • CKE [12] is an embedding-based method learning the entity embeddings using the knowledge base by TransR [48], which are then combined with the ID embeddings of items generated by MF as the final item representations for item predictions; • KGNN-LS [22] computes personalized item embeddings, using GNNs on userspecific graphs transformed from the KG, and provides regularization over the edge weights, using the label smoothness to prevent overfitting; • KGAT [10] combines the user-item graph and KG as a holistic graph for modeling the collaborative information by exploiting the high-order connectivity among users, items and entities. Moreover, the importance of different neighbors is distinguished in an attentive way by learning discriminative representation of the interaction relationship and KG relations; • CKAN [47] explicitly encodes the collaborative signals in user-item interactions, which are then combined with the knowledge associations modeled by an attention mechanism to discriminate the contribution of different neighbors in KG.

Experimental Setup
The hyper-parameters are tuned on the validation set, which is randomly separated from the interactions in the training set with a proportion of 10%. For a fair comparison, following [10], for both two datasets, we set the embedding dimension and the batch size to 64 and 1024, respectively, and ADAM [49] is adopted as the optimizer. Then, a grid search is conducted to confirm the optimal parameter settings on each dataset. More specifically, the learning rate ρ and L2 regularization η are respectively tuned in In addition, the model parameters, i.e., the ID embeddings of users and items are initialized with the Xavier [50] method. The best performing parameters on two datasets are summarized in Table 3.

Overall Performance
The performance of our proposed CKER and the baselines are presented in Table 4. Here, the results of all baselines are directly taken from [10] since we adopt the same datasets and preprocessing method for the experiments. First, we can observe that the KG-free method MF performs worse than other baselines for all cases on two datasets, indicating the necessity of introducing the knowledge graph for enhancing recommendation. Moreover, compared to MF, we can see that CKE achieves slightly better performance, indicating the effectiveness of learning collaborative embeddings for items from the knowledge base. Table 4. Model performance. The results of the best performing baseline and the best performer in each column are underlined and boldfaced, respectively. denotes a significant improvement of CKER over the best baseline, using a paired t-test (p < 0.01).

Amazon-Book
Last-FM Moreover, we can see that the GNN-based methods can obviously outperform CKE and MF in terms of both Recall@20 and NDCG@20 on two datasets, which indicates the utility of exploiting the high-order connectivity of items the knowledge graph. Furthermore, by comparing CKAN to KGNN-LS, we can observe that CKAN generally performs better than KGNN-LS, except losing the competition in terms of Recall@20 on Last-FM. We analyze that the possible reason is that, besides modeling the item knowledge, CKAN further takes the collaborative signals in the user-item interactions into consideration for making recommendations. In addition, by exploring the multi-order connectivity among users, items and entities in an attentive way in the collaborative knowledge graph, KGAT performs best among the baselines in terms of both metrics on two datasets.
Next, we move to the performance of our proposed CKER. First, it can be observed that CKER achieves the best performance in terms of both Recall@20 and NDCG@20 on two datasets. We attribute the improvements to the fact that (1) CKER can simultaneously exploit the interaction relation between items in the user-item interactions and the connectivity between entities in the knowledge graph; (2) CKER introduces the self-supervised learning to derive the self-supervision signals for enhancing the representation learning of users and items, which can help obviously distinguish different items on the basis of the main supervised learning. In addition, we can observe that CKER improves the performance above the best baseline KGAT by 8.88% and 8.01% in terms of Recall@20 and NDCG@20 on Amazon-Book, respectively, where the improvement rate is larger on Recall@20. However, the phenomenon is different on Last-FM, where a higher improvement rate is observed on the NDCG@20 metric (i.e., 11.83%) than that on Recall@20 (i.e., 8.93%). This could be due to the fact that the number of interactions and KG triplets on two datasets are different, which means that CKER contributes relatively more to hitting the target items in the recommendation list in scenarios where the item knowledge is more abundant than the interactions, while CKER improves the ability of ranking the target items at right positions more obviously in scenarios where the user-item interactions are relatively more sufficient.

Ablation Study
For RQ2, in order to validate the utility of each component in CKER, we conduct an ablation study by comparing CKER with the following variants: • w/o IG removes the interaction graph propagation and learns the representation of users and items by exploiting only the knowledge graph; • w/o KG removes the knowledge graph propagation and generates the user and item representations by merely exploiting the interaction graph; • w/o SSL removes the self-supervisions between the interaction-and knowledge-aware user preferences and optimizes the model, merely using the BPR loss. We present the results of CKER and its variants in Figure 2, where we evaluate the model performance on Amazon-Book by ranging the recommendation number from 10 to 50 for providing a comprehensive comparison. Similar phenomena can also be observed on the Last-FM dataset.
From Figure 2, we can observe that removing each component from CKER consistently decreases the recommendation accuracy. Moreover, by comparing w/o SSL to w/o IG and w/o KG, we can observe that w/o SSL achieves obviously better performance than w/o IG and w/o KG, indicating the effectiveness of our proposed CGCN, which simultaneously exploits the interaction relation between items in IG and the connectivity between entities in KG. Furthermore, compared with w/o IG, we can see that w/o KG achieves better performance in terms of both Recall@K and NDCG@K on various K. We analyze that this may be due to that, compared to the interaction graph, the knowledge graph contains noise entities that are unrelated to the personalized recommendation, introducing much bias in modeling the user preference. In addition, it is observed that the performance gap between CKER and w/o SSL is decreasing when the recommendation number increases, especially on the Recall@K metric. This indicates that introducing the self-supervised learning contributes to the performance improving in short recommendation lists more obviously, which is practical in real-world applications since the interface for displaying the recommendation results may be limited, such as on mobile phones.

Impact of Layer Number
For RQ3, to investigate the impact of the layer number of graph convolutions on the model performance, we compare CKER with its variants by ranging the layer number from 1 to 3. The results on Amazon-Book are provided in Figure 3, similar phenomena can also be observed on the Last-FM dataset.   From Figure 3, we can observe that for the variants w/o IG and w/o KG, with the layer number increasing, the performance of both variants consistently increases. This is due to that merely utilizing the KG propagation or the IG propagation, the model fails to provide sufficiently enough connectivities between items for learning the representations. Moreover, the best performance of both w/o IG and w/o KG achieved at layer number L = 3 is stills worse than w/o SSL and CKER. However, different from w/o IG and w/o KG, it is observed that with the layer number increasing, the performance of the variant w/o SSL and CKER both decreases. This may be due to the fact that by exploiting both the interaction and knowledge graphs, the recommender easily leads to overfitting, due to the abundant connectivities between items. Moreover, we can see that the performance gap between w/o SSL and CKER is decreasing with the layer number increasing. This could be explained by the fact that a larger number of graph convolutions leads to a more serious overfitting problem, decreasing the contribution of the self-supervised learning to the performance improving.

Impact of Neighbor Number
For RQ4, in order to investigate the impact of the neighbor number in the interaction and knowledge graphs on the model performance, we apply a grid search by ranging the neighbor number in IG (denoted as M I ) and the neighbor number in KG (denoted as M K ) both in {2, 4, 8, 16, 32}. The results on Amazon-Book and Last-FM are presented in Tables 5 and 6, respectively.
From Table 5, we can observe that on Amazon-Book, the best performance in terms of both metrics is achieved at M K = 32 and M I = 2, where the neighbor number in the IG is obviously smaller than that in the KG. This may be due to the fact that the IG has a more powerful ability to exploit the connectivity between items compared to the KG since unrelated entities to the recommendation task may be introduced in KG as stated in Section 5.2. Thus, a relatively smaller M I is required for achieving the best performance. Moreover, for the Last-FM dataset, the best performance of CKER in terms of both Recall@20 and NDCG@20 is achieved at M K = 8 and M I = 8. Compared to the results on Amazon-Book, we can see that for the neighbor numbers achieving the best performance, M K decreases and M I increases. We attribute this difference to the fact that the number of interactions and triplets in two datasets are different as shown in Table 2. Since fewer triplets are contained in Last-FM than Amazon-Book, M K achieving the best performance on Last-FM is also correspondingly smaller than that on Amazon-Book. Similarly, the larger number of interactions in Last-FM than Amazon-Book explains the larger M I achieving the best performance on the Last-FM dataset.

Hyper-Parameter Analysis
For RQ5, to investigate the impact of the hyper-parameters α and λ on the performance of CKER, we perform a grid search by tuning α and λ in {0.005, 0.01, 0.05, 0.1, 0.5} and {6, 8, 10, 12, 14}, respectively. The results on Amazon-Book are presented in Table 7, and the results on the Last-FM dataset are shown in Table 8.
For the parameter α, we can see that for each λ, increasing α will generally decrease the performance on Amazon-Book, while increasing the performance on Last-FM. We analyze that the possible reason is that a larger α indicates that the intensity of the selfsupervisions is larger, which can distinguish the items more obviously. As shown in Table 2, the item number in Last-FM is obviously larger than that in Amazon-Book, thus a larger α is required in the Last-FM dataset to learn discriminative item representations. Moreover, as for the parameter λ, it is observed that for each α on both Amazon-Book and Last-FM, with λ increasing, the performance of CKER generally first increases and then decreases. This may be due to the fact that the hyper-parameter λ plays a role of controlling the intensity of mining the hard negative samples [23,51], where a larger λ makes the learned item embeddings more uniformly distributed in the embedding space. Thus, with λ increasing from 6 to 14, CKER first learns more accurate representation of items, and then faces the overfitting problem, which degrades the recommendation accuracy. Table 7. Impact of the hyper-parameters α and λ for Amazon-Book.  Table 8. Impact of the hyper-parameters α and λ for Last-FM.

Conclusions and Future Work
In this paper, we propose a novel approach, i.e., the CKER method. First, CKER designs a CGCN to simultaneously exploit the interaction relation between items reflected in the user-item interactions and the item knowledge in the knowledge graph. Moreover, we apply the self-supervised learning to derive additional supervision signals for distinguishing items by contrasting the user preferences generated from the interaction and knowledge graphs. Extensive experiments conducted on two benchmark datasets, namely Amazon-Book and Last-FM, validate that CKER can outperform the state-of-the-art baselines on the knowledge-enhanced recommendation task, achieving the improvements of 8.88-8.93% in terms of Recall@20 and 8.01-11.83% in terms of NDCG@20, respectively. However, for the scenarios where the knowledge graph is unable or hard to construct, the advantages of our proposed CKER method may not be noticeable, leading to unsatisfactory performance.
For future work, we would like to incorporate various sources of side information, such as social networks, as knowledge for enhancing the recommendation. Moreover, we are also interested in improving the applicability of KG for recommendation by automatically filtering the connectivities between entities which are unrelated to the item recommendation. In addition, we also plan to adopt more datasets to investigate the scalability of our proposal in various application scenarios.