MIGAN: Mutual-Interaction Graph Attention Network for Collaborative Filtering

Many web platforms now include recommender systems. Network representation learning has been a successful approach for building these efficient recommender systems. However, learning the mutual influence of nodes in the network is challenging. Indeed, it carries collaborative signals accounting for complex user-item interactions on user decisions. For this purpose, in this paper, we develop a Mutual Interaction Graph Attention Network “MIGAN”, a new algorithm based on self-supervised representation learning on a large-scale bipartite graph (BGNN). Experimental investigation with real-world data demonstrates that MIGAN compares favorably with the baselines in terms of prediction accuracy and recommendation efficiency.


Introduction
In the literature, there are numerous techniques for building recommendation systems. They can be classified either as collaborative, content-based, or hybrid filtering approaches [1][2][3][4]. Collaborative filtering is the most influential. It relies on identifying users with similar tastes for item recommendations. It leverages their feedback to make suggestions to the active user. Collaborative recommender systems have been implemented in multiple application areas [5][6][7].
Learning effective user/item representations from their interactions and side information in recommender systems is a challenging issue. Since most data have a graph structure and the graph neural network (GNN) has superiority in representation learning, using GNN in recommender systems is a flourishing field of research. Several works are based on GNN to perform recommendations, and other tasks, such as the graph convolutional network (GCN) [8] and graph attention network (GAT) [9]. GAT computes the representation of nodes by adaptively combining their neighborhoods' vectors using a self-attention mechanism with trainable attention weights. Wang et al. [10] suggest a knowledge graph attention network (KGAT) for KG-based recommendations. To enrich the representation, the authors consider the implicit collaborative information of multi-hop neighbors. In the work [11], the authors propose a neighbor-aware graph attention network for recommendation tasks to model the implicit correlations of neighbors. Unlike previous attention networks, our current work determines the most relevant weights characterizing the mutual influence among item-users. In addition to learning the interaction between user interests and item embeddings, it integrates a new component accounting for the mutual influence of items carrying collaborative signals on user decisions. MIGAN learns deeply the most relevant weights representing the users mutual influence on an item. It exploits the complex relation between the user profile and the item attributes. Its main advantage lies in its ability to discriminate the relative influence of various interactions between nodes. Our main contributions summarize as follows: • Our approach is based on self-supervised representation learning on a large-scale bipartite graph (BGNN). We have adapted this powerful representation for the recommendation task because its ability to model the dependencies between the nodes on a large scale. • The collaborative filtering recommender based on interactive neural attention networks takes advantage of the encoding potential of interactive attention between users and items. It learns the most significant weights representing users' mutual effect on the item. Consequently, exploiting this information improves the recommender systems' accuracy. • The empirical evaluation, including various real-world dataset, shows that MIGAN significantly outperforms the state-of-the-art baselines.
The rest of the paper is organized as follows. Section 2 describes the related literature. Section 3 presents in detail the proposed architecture. Section 4 discusses the experimental results. Section 5 summarizes the conclusions.

Related Work
Graph embedding models are one of machine learning's newest and fastest-growing subfields. Its strength is in its ability to take advantage of the intrinsic graph structure of a wide range of data types encountered in a wide range of applications. The graph format models a set of elements (represented by nodes) and their relationships (represented by edge) to capture structural information. Therefore, there have been proposed various graph embedding models in literature [12,13]. Node2Vec is an embedding model for converting graphs into numerical representations where each node in the network is used as a starting point to produce a corpus of random walks [14]. In a first-order random walk, each step is solely determined by the current state. The steps in a second-order random walk are determined by the current and prior states. The random walk corpus is fed through Word2Vec to build the node embeddings. Liang et al. [15] extend the variational autoencoders (VAEs) to collaborative filtering for implicit feedback. It models the collaborative information into multinomial likelihood (MultiVAE) for the data distribution to sample prediction for items on the long tail. Drif et al. [16] develop an ensemble variational autoencoder framework for recommendations (EnsVAE) that specifies a procedure to transform subrecommenders' predicted utility matrix into interest probabilities that allow the VAE to represent the variation in their aggregation. This architecture is based on two components: (1) GloVe content-based filtering recommender (GloVe-CBF) that exploits the strengths of embedding-based representations and stacking ensemble learning techniques to extract features from the item-based side information, and (2) a variant of neural collaborative filtering recommender, named the Gate Recurrent Unit-based Matrix Factorization (GRU-MF) recommender. It models a high level of non-linearities and exhibits interactions between users and items in latent embeddings, reducing user biases towards items that are rated frequently by users. Weng Lo et al. [17] suggested the graph flow data's extensive structural information based on a graph neural networks (GNNs). The latter works with the message passing concept, where a node collects its neighbors features and sends them to a node as a message. In recent years, the graph attention network approach has developed rapidly. Wang et al. [18] introduced a multi-dimension interaction-based attentional knowledge graph neural network (MI-KGNN) to improve recommendations based on knowledge graph (KG). MI-KGNN explores the interaction between users and the neighborhood during embedding propagation. In [10], the authors proposed a knowledge graph attention network (KGAT) modeling the high-order connectivity in knowledge graphs. It exploits the attention mechanism to determine important neighbors. In [19], the authors propose a multi-view graph attention network (MV-GAN) based on the heterogeneous information networks for the recommendation. They create attention networks at the node and path levels to learn user and product representations from every view. A view-level attention mechanism is developed to integrate various relationship types in multiple views cooperatively. Liu et al. [20] propose a contextualized graph attention network (CGAT) based on an entity's local and non-local context data in a knowledge graph. CGAT implements a graph attention method to record local context information while considering users' unique preferences for entities. The non-local context of an entity is also extracted using a biased random walk sampling method by CGAT. In fact, propagating information from nodes across the network is done during numerous iterations. The aggregated information at each node (node embedding) is a memory-and time-consuming task. Due to this fact, many GNNs for recommendation suffer from scalability limitations, unpredictable memory, and computational resource requirements on large graphs. To overcome these drawbacks, our work considers node representation learning on large-scale bipartite graphs.

Mutual-Interaction Graph Attention Network Approach
The proposed architecture considers the mutual collaborative information of the user's preference on the whole neighboring item to enrich the representation. It is based on selfsupervised representation learning on a large-scale bipartite graph (BGNN) [21]. Figure 1 illustrates the model architecture in detail. We first formulate the recommendation task on the bipartite graph. Secondly, we introduce the embedding representation based on BGNN and then describe the mutual interaction graph attention mechanism.

Problem Formulation
Let U = u 1 , u 2 , . . . , u n and I = i 1 , i 2 , . . . , i m be the sets of users and items, respectively, where n is the number of users, and m is the number of items. We assume that R n×m is the user-item rating matrix.
We formulate the recommendation task as a prediction problem as follows: R: utility matrix;r u : predicted rating for each user u ∈ U; U = u 1 , u 2 , . . . , u n : is the sets of users, where n is the number of users; I = i 1 , i 2 , . . . , i m is the set of the items , where m is the number of items; R ui : is the ground truth rating assigned by the user u on the item i.
We define the utility matrix as: where: K is latent space's dimension. S (n×m) : is the inner product of both user and item latent vectors. It is decomposed by the matrix factorization method into P ∈ R N×K and Q ∈ R M×K . To normalizeR, we apply the min/max scaling: where: min = min (r ui ) is the minimal rating; max = max (r ui ): is the maximal rating. The normalization eliminates user bias. In other words, users have different ways to rate the items. Some would only give high ratings to items they like, while others do the opposite. Normalizing users' ratings hide their bias by mapping them to values between 0 and 1. The lowest rating of each user is associated with 0, while 1 represents their highest value. It assists in leveraging more accurate collaborative-based recommendations. Table 1 reports the notations used in the rest of this paper. A bipartite graph is composed of two independent sets of vertices, U 1 and I 1 . The edges connect a vertex from one set U 1 to one in I 1 .
We define Bipartite Graphs as follows: Let G = (U 1 , I 1 , E) be a bipartite graph. e ij represents the edge between u i and i j .
X u ∈ R M×P : is defined as a feature matrix of node u i (X i is similarly written). Our work is based on the self-supervised node representation learning model [21] that can employ topology information as well as separate node attributes from two domains to increase the recommendation performance for large graph.

Embedding Representation Based on Bipartite Graph Neural Networks (BGNN)
He et al. [21] proposed a a self-supervised representation learning framework for largescale bipartite graphs. In this section, we adapt the outputs of this BGNN architecture, thus, we can deploy it in our recommendation system. Let us define the following notions: H u ∈ R P ( H i ∈ R Q , respectively): is nodes embedding representation for U 1 (I 1 , respectively). f emb : is an embedding model given by θ parameters.
The embedding of distinct node features X u and X i is written as: The architecture of f emb is based on two functions: (i) Inter-Domain Message Passing (IDMP) and (ii) Intra-Domain Alignment (IDA). We will describe briefly these functions and show how we prepare the outputs for the recommendation task (interested reader can refer to [21] for more details). IDMP enables one domain to aggregate information from the other domain, through the linked edges, as follows: such as: f u (resp. f i ): represents the IDMP function for this domain.
H i→u (resp. H u→i ): is the flow of aggregated information from I 1 (resp. U 1 ) to U 1 (resp. I 1 ). After that, the Intra-Domain Alignment (IDA) is deployed for theses two distinct features into a single representation. After the self-supervised training, the algorithm gives the domains representation of H 1 u and H 1 I . The adversarial loss L adv is used to compute the best results.
Thus, the Inter-Domain Message Passing (IDMP) is expressed as: where: B u = D −1 u B u is the normalization of B u (D u is the degree matrix of B u ). By normalizing the incidence matrix of the graph, the algorithm can effectively reduce the computational cost. σ is the activation function ReLU [22]. K denotes the depth index of the hidden features of the nodes in set U 1 (resp. I 1 ).
Let σ is the Intra-Domain Alignment (IDA) discriminator and φ is the IDMP generator. The discriminator loss function is expressed as follows: where: P σ,φ (source = 1|h) is the probability that the input feature vector h is from the source domain H i→u . The implementation for the self-supervised representation learning on large scale bipartite graph (BGNN) for the recommendation task is summarized in Algorithm 1.

Begin phase 1
Extract the graph from the rating matrix B u , B i = GetBipartiteGraph(R) phase 2 Computing embeddings for each item and user

The Interactive Attention Network Recommender
The interactive attention network recommendation system aim at identifying latent features that show the users and items mutual influence. The attention mechanism has been shown to be useful in a variety of machine learning applications, including image/video captioning [23,24]. Our proposed interactive concept extracts each participant's contribution from their compressed representation. Thus, it allows the proposed recommendation framework to model efficiently the interaction characteristic. This attention network model figures out which weights best represent the users' mutual effect on the item. Figure 2 explains the mutual interactions between users and items.
Algorithm 2 shows the architecture of the proposed interactive attention network. In order to anticipate a distribution over the items, we create combined user and item interactive attention maps. As a result, the co-attention mechanism detects a correlation between items and users and calculates the likelihood that an item will be of interest to indirect comparable individuals.
The first embedding layers e u and e i captures latent features of users p u and items q i . They are followed by Long-Short-Term-Memory (LSTM) layers to learn long sequences with long time lags. Each LSTM state includes two inputs: the current feature vector and the output vector h t−1 from the previous state. Its output vector is h t . We chose to apply the LSTM model as it exhibits interactions between users and items in latent embeddings. Each node embedding layer is chained with an LSTM layer that contains recurrent modules enabling long-range learning. Information from nodes neighbors gradually enhances the subsequent feature representation because LSTM has an augmented hidden state with non-linear mechanisms. It allows propagating without modification, updating, or reset-ting states using simple learned gating functions. The LSTM representation is expressed as follows: h tu = g 1 (p) (12) h ti = g 2 (q) (13) Figure 2. The interactive attention network recommender. In this example, user 1 and user 2 rate three similar items. They have strong interactive attention. User 2 and user 3 rate an item not seen by user 1. Therefore, one can deduce that this new item can also attract user 1. It is a first-order interaction. Moreover, one can deduce a mutual influence based on entities' dependencies at more than a first order interaction level. For example, user 1 influences user 4 (a similar user of user 3), generating a recommendation based on user 3 preferences.
The learned representation is H p and H q , respectively, with d × n dimensions for H p and d × m for H q . Users and items embedded inputs are projected into a vector representation space using the attention technique. In fact, this representation models the high-order non-linear mutual relationship. For the interactive attention mechanism, we build an attention maps in order to predict a distribution over the items. For this purpose, we compute a matrix L = tanh(H p W pq H q ) , where L ∈ R n×m , and W pq is a d × d a learnable parameters matrix. The features co-attention maps is defined as: The interactive attention model uses a tangent function to model the mutual interactions between users and items. Afterward, we compute the the probability distribution over the embedding space. The softmax function is used to generate the attention weights: where f : is a multi-layer neural network. Then, the high order interaction latent space of users and items is given by: where β p and β q : are the derived attention weights. As a result, the predicted matrixR ui is defined as: where f : is a dense layer using a sigmoid activation function. Finally, we train the model to minimize the loss function which is the Mean Absolute Error (MAE): Softmax function application O: is the batch.dot() function from Keras backend that is used between two tensors ((α p ) t and (G) t ), (α ( q) t and (S) t ) respectively. phase 4 Each output of function O are transposed and then used as input into a product function. After that, both results (β u β i ) can be summed by a concatenate function.
att ui = concatenation(β u , β i ) return att ui Algorithm 3 summarizes the overall mutual-interaction graph attention network approach. Network representation learning can tackle the recommendation problems by embedding nodes into a low-dimensional space R d . Furthermore, this adapted BGNN representation for recommendation task improves multi-hop relationship modeling and the training accuracy. Unlike previous research on graph neural networks for recommendation that only learn complex relationship between the target and their neighbors using attention network, our work learn the most important weights representing the users' mutual influence on the item based on the interactive attention. Preparing data to be passed to the BGNN foreach u ∈ U, i ∈ V do Rj = minmax(R) phase 2 extracting embeddings by BGNN-Class() User and Item embedding are followed by LSTM layers.

Experiments and Discussion
We conduct our experiments on MovieLens 1M that is commonly used for benchmarking recommendation frameworks. The MovieLens dataset [25] is a real, timestamped 5-star ratings of the MovieLens platform users on various films. The selected dataset contains 1 million ratings from 6000 users over 4000 movies. Table 2 shows the dataset description. We divide the datasets into 75% training data and 25% testing data in a stratified way.

Genomic Tags
User Demographics We evaluate MIGAN using Mean Average Precision and Normalized Discounted Cumulative Gain. The mean average precision (MAP) measures the accuracy of information retrieval. The results of recommender systems are frequently pruned to return the Top-k components. The value of k varies based on the application: a system might display the top three trending items or the best ten items that meet the current user's preferences. The Normalized Discounted Cumulative Gain (NDCG) calculates an item's normalized usefulness depending on its position in the final list. It is used to assess the Top-k recommended items' ranking quality. Interested readers can refer to [16] for more details.

Hyperparameters Analysis
Here, we report the hyperparameters analysis phase, which is performed separately for each MIGAN recommender variant. Evaluation is done with MAP@k and NDCG@k.
The main idea of our model is to pass the results of the BGNN through a neural network and then apply the attention model to it. Consequently, we perform the analysis on 1-6 variants and focus on the learning algorithms used by each hyperparameter. We have 3 hyperparameters and two kinds of recurrent networks: Embedding output size = 50, 75, 100, Neural network = LSTM, GRU. It gives us six variants. The architecture of each variant is as follows: As shown in Table 3, Variant 1 scores better than the other variants. Thus, it is picked for further tweaking. We generate the utility matrix based on the learned embeddings. Figure 3 illustrates the hyperparameters analysis.
We explore a range of values for each hyperparameter as reported below: Results illustrate that we achieve the best performance for the following settings: α = 50, θ = 3, τ = 100, σ = elu, λ = Adam.

Performance Comparison with the Baselines
In this subsection, MIGAN's final benchmark results are compared to the outcomes of certain baseline recommender systems. We execute the baselines in the same evaluation environment as MIGAN to guarantee that comparisons are fair. Furthermore, we deploy the NeuRec library. Furthermore, we deploy the NeuRec library. It is released on GitHub as open-source software under an MIT license, and it implements 33 neural recommender systems [26].
We compare the proposed recommender framework with the following baselines: • The stacked content-based filtering recommender: the work [16] developed a content-based recommender system based on the stacking ensemble learning. • Neural collaborative filtering (NCF) [27]: this work developed a recommender framework that uses the multi-layer perceptron to exploit the user-item interaction. • Variational Autoencoders for Collaborative Filtering (MultiVAE) [15]: This approach investigates the collaborative information in a multinomial distribution to recommend items on the long tail. • Node2Vec embedding: We propose a variant of MIGAN architecture, which deploys Node2Vec embedding representation instead of BGNN. Table 4 reports the performance of the various approaches under investigation using the MovieLens dataset. According to mean average precision, MIGAN outperforms the baselines. Indeed, it models the higher-order feature interactions. Figure 4 shows the MAP@k evaluation versus k-top items. As the MAP measure indicates the fraction of relevant articles in the top k suggestions averaged over all users, MIGAN creates a tailored recommendation. To put it differently, MIGAN outperforms the other models in recalling relevant items for the user. It retains the user-item interaction and generates a user-specific task recommendation. Furthermore, both recommender-based BGNN and the recommender-based Node2Vec are quite competitive. For example, MIGAN and Node2Vec achieve MAP@10 = 0.85, and MAP@10 = 0.84, respectively, outperforming the other baselines. The training loss equals 0.23 with dimensions d = 70 and 0.32 with d = 50 for BGGN and Node2vec, respectively. Node2Vec is a random walk-based node embedding method producing high memory consumption for large graphs. In contrast, the cascaded training used in BGGN does not involve loading the entire graph into memory. Consequently, it reduces the memory cost and training time. The MultiVAE recommender scores poorly. Indeed, it does not use a enough rich representation of data semantic. The neural collaborative filtering approach (Neural CF) shows a good score with k = 10, MAP@10 = 0.74 due to the appropriate representation of the interaction between user and item.   Figure 5 shows that MIGAN exhibit a high NDCG score on MovieLens. Its Top-k recommendation list is quite similar to the ground-truth list. MIGAN boosts the modeling of user-item interaction. Indeed, the more interested the users are in an item, the more likely users with similar preferences to recommend it. Note that the Node2Vec also presents good NDCG scores. The graph recommender effectively obtains the user's overall interest built by the neighborhood representation.

Conclusions
We propose and investigate a graph recommender where each user's recommended content is accurate and personalized. MIGAN is a collaborative filtering (CF) system. A neural graph represents the item and users. It computes a representation of nodes by combining their neighborhoods' vectors according to their mutual influence interaction by utilizing a co-attention mechanism with trainable attention weights. The attention weights are adjustable parameters computed by aggregating neighbor vectors. This neural graph architecture predicts the ratings assigned by the users to the items.
We perform a comparative evaluation of several configurations for the RS using the well-known dataset MovieLens. We use two metrics to quantify its accuracy: the mean average precision (MAP) and the normalized discounted cumulative gain (NDCG). The first is about the accuracy of the recommendation. The second is about the ranking of the recommended items. Comparing MIGAN performance with some baselines shows that it outperforms all its alternatives in MAP and NDGC scores. However, it would be very interesting to focus on a specific domain for recommendation tasks, such as taking the knowledge graph to distill attribute-based collaborative signals and compare MIGAN performance with knowledge graph attention network models. Thus, future work will investigate this framework for collaborative knowledge graphs involving contextual and semantic data. Future work will investigate this framework for other recommendation tasks involving contextual data.