Model Description of Similarity-Based Recommendation Systems

The quality of online services highly depends on the accuracy of the recommendations they can provide to users. Researchers have proposed various similarity measures based on the assumption that similar people like or dislike similar items or people, in order to improve the accuracy of their services. Additionally, statistical models, such as the stochastic block models, have been used to understand network structures. In this paper, we discuss the relationship between similarity-based methods and statistical models using the Bernoulli mixture models and the expectation-maximization (EM) algorithm. The Bernoulli mixture model naturally leads to a completely positive matrix as the similarity matrix. We prove that most of the commonly used similarity measures yield completely positive matrices as the similarity matrix. Based on this relationship, we propose an algorithm to transform the similarity matrix to the Bernoulli mixture model. Such a correspondence provides a statistical interpretation to similarity-based methods. Using this algorithm, we conduct numerical experiments using synthetic data and real-world data provided from an online dating site, and report the efficiency of the recommendation system based on the Bernoulli mixture models.


Introduction
In this paper, we study recommendation problems, in particular, the reciprocal recommendation. The reciprocal recommendation is regarded as an edge prediction problem of random graphs. For example, a job recruiting service provides preferable matches between companies and job seekers. The corresponding graph is a bipartite graph, where nodes are categorized into two groups: job seekers and companies. Directed edges from one group to the other are the expression of the user's interests. Using this, the job recruiting service recommends unobserved potential matches between users and companies. Another common example is online dating services. Similarly, the corresponding graph is expressed as a bipartite graph with two groups, i.e., males and females. The directed edges are the preference expressions among users. The recommendation system provides potentially preferable partners to each user. The quality of such services depends entirely on the prediction accuracy of the unobserved or newly added edges. The edge prediction has been widely studied as a class of important problems in social networks [1][2][3][4][5].
In recommendation problems, it is often assumed that similar people like or dislike similar items, people, etc. Based on this assumption, researchers have proposed various similarity measures. The similarity is basically defined through the topological structure of the graph that represents the relationship among users or items. Neighbor-based metrics, path-based metrics, and random walk based metrics are commonly used in this type of analysis. Then, a similarity matrix defined from the similarity measure is used for the recommendation. Another approach is employing the statistical models, such as stochastic block models [6], that are used to estimate network structures, such as clusters or edge distributions. The learning methods using statistical models often achieve high prediction accuracy in comparison to similarity-based methods. Details on this topic are reported in [7] and the references therein.
The main purpose of this paper is to investigate the relationship between similarity-based methods and statistical models. We show that a class of widely applied similarity-based methods can be derived from the Bernoulli mixture models. More precisely, the Bernoulli mixture model with the expectation-maximization (EM) algorithm [8] naturally derives a completely positive matrix [9] as the similarity matrix. The class of completely positive matrices is a subset of doubly nonnegative matrices, i.e., positive semidefinite and element-wise nonnegative matrices [10]. Additionally, we provide an interpretation of completely positive matrices as a statistical model satisfying exchangeability [11][12][13][14]. Based on the above argument, we connect the similarity measures using completely positive matrices to the statistical models. First, we prove that most of the commonly used similarity measures yield completely positive matrices as the similarity matrix. Then, we propose an algorithm that transforms the similarity matrix to the Bernoulli mixture model. As a result, we obtain a statistical interpretation of the similarity-based methods through the Bernoulli mixture models. We conduct numerical experiments using synthetic data and real-world data provided from an online dating site, and report the efficiency of the recommendation method based on the Bernoulli mixture models.
Throughout the paper, the following notation is used. Let [n] be {1, . . . , n} for a positive integer n. For the matrices A and B, A ≤ B denotes the element-wise inequality and 0 ≤ A denotes that A is entry-wise non-negative. The same notation is used for the comparison of vectors. The Euclidean norm (resp. 1-norm) is denoted as a (resp. a 1 ). For the symmetric matrix A, O A means that A is positive semidefinite.
In this paper, we mainly focus on directed bipartite graphs. The directed bipartite graph G = (X, Y, E) consists of the disjoint sets of nodes, X = {x 1 , . . . , x n }, Y = {y 1 , . . . , y m }, and the set of directed edges E ⊂ (X × Y) ∪ (Y × X). The sizes of X and Y are n and m, respectively. Using the matrices A = (a ij ) ∈ {0, 1} n×m and B = (b ji ) ∈ {0, 1} m×n , the adjacency matrix of G is given by where a ij = 1 (respectively b ji = 1) if and only if (x i , y j ) ∈ E (respectively (y j , x i ) ∈ E). For the directed graph, the adjacency matrix A is not necessarily symmetric. In many social networks, each node of the graph corresponds to each user with attributes such as the age, gender, preferences, etc. In this paper, an observed attribute associated with the node x i ∈ X (resp. y j ∈ Y) is expressed by a multi-dimensional vector x i (resp. y j ). In real-world networks, the attributes are expected to be closely related to the graph structure.

Recommendation with Similarity Measures
We introduce similarity measures commonly used in the recommendation. Let us consider the situation that each element in X sends messages to some elements in Y, and vice versa. The messages are expressed as directed edges between X and Y. The observation is, thus, given as a directed bipartite graph G = (X, Y, E). The directed edge between nodes is called the expression of interest (EOI) in the context of the recommendation problems [15]. The purpose is to predict an unseen pair (x, y) ∈ X × Y such that these two nodes will send messages to each other. This problem is called the reciprocal recommendation [15][16][17][18][19][20][21][22][23][24][25]. In general, the graph is sparse, i.e., the number of observed edges is much fewer than the number of all possible edges.
In social networks, similar people tend to like and dislike similar people, and are liked and disliked by similar people as studied in [15,26]. Such observations motivated us to define similarity measures. Let sim(i, i ) be a similarity measure between the nodes x i , x i ∈ X. In a slight abuse of notation, we write sim(j, j ) to indicate a similarity measure between the nodes y j , y j ∈ Y. Based on the observed EOIs, the score of x i 's interest to y j ∈ Y for i ∈ [n] is defined as If x i ∈ X is similar to x i and the edge (x i , y j ) exists, the user x i gets a high score even if (x i , y j ) ∈ E.
In the reciprocal recommendation, score(j → i) defined by is also important. The reciprocal score between x i and y j , score(i ∼ j), is defined as the harmonic mean of score(i → j) and score(j → i) [15]. This is employed to measure the affinity between x i and y j . Table 1 shows popular similarity measures including graph-based measures and a content-based measure [1]. For the node x i ∈ X in the graph G = (X, Y, E), let s i (resp.s i ) be the index set of outer-edges, {j|(x i , y j ) ∈ E} (resp. in-edges, {j|(y j , x i ) ∈ E}) and |s| be the cardinality of the finite set s. In the following, the similarity measures based on outer-edges are introduced on directed bipartite graphs. The set of outer-edges s i can be replaced withs i to define the similarity measure based on in-edges.
In graph-based measures, the similarity between the nodes x i and x i is defined based on s i and s i . Some similarity measures depend only on s i and s i , and others may depend on the whole topological structure of the graph. In Table 1, the first group includes the Common Neighbors, Parameter-Dependent, Jaccard Coefficient, Sørensen Index, Hub Depressed, and Hub Promoted. The similarity measures in this group are locally defined, i.e., sim(i, i ) depends only on s i and s i . The second group includes SimRank, Adamic-Adar coefficient, and Resource Allocation. They are also defined from the graph structure. However, the similarity between the nodes x i and x i depends on the topological structure more than s i and s i . The third group consists of the content-based similarity, which is defined by the attributes associated with each node.
Below, we supplement the definition of the SimRank and the content-based similarity. Table 1. Definition of similarity measures sim(i, i ) between the nodes x i and x i . The right column shows whether the similarity measure is a completely positive similarity kernel (CPSK); see Section 4.

Similarity
Definition/Condition of S = (sim(i, i )) CPSK Common neighbors [27] |s i ∩ s i | √ Parameter-dependent [28] |s Hub depressed [31] |s i ∩ s i |/ max{|s i |, |s i |} √ Hub promoted [32] |s i ∩ s i |/ min{|s i |, |s i |} × SimRank [33] S = cP T SP + D, c ∈ (0, 1) √ Adamic-adar coefficient [34] ∑ k∈s i ∩s i 1/ log |s k | √ Resource allocation [31] ∑ k∈s i ∩s i 1/|s k | √ Content-based similarity [17] 1 SimRank [33] and its reduced variant [35] are determined from the random walk on the graph. Hence, the similarity between the two nodes depends on the whole structure of the graph. For c ∈ (0, 1), the similarity matrix S = ( S ij ) i,j∈[n+m] on X ∪ Y is given as the solution of S ii = c ∑ k∈s i ,k ∈s i S kk |s i ||s i | for i = i , while the diagonal element S ii is fixed to 1. Let P ∈ [0, 1] (n+m)×n+m be the column-normalized adjacency matrix defined from the adjacency matrix of G = (X, Y, E). Then, S satisfies S = cP T SP + D, where D is a diagonal matrix satisfying 1 − c ≤ D ii ≤ 1. In the reduced SimRank, D is defined as (1 − c)I. For the bipartite graph, the similarity matrix based on the SimRank is given as a block diagonal matrix.
Suppose that The continuous variables in the features are appropriately discredited. The similarity measure is defined using the number of shared attributes, i.e., In RECON, the score is defined from the normalized similarity, i.e., score(i → j) = ∑ j a ij ∑ k a ik sim(j , j).
The similarity-based recommendation is simple but the theoretical properties have not been sufficiently studied. In the next section, we introduce statistical models and consider the relationship to similarity-based methods.

Bernoulli Mixture Models and Similarity-Based Prediction
In this section, we show that the similarity-based methods are derived from Bernoulli mixture models (BMMs). BMMs have been employed in some studies [36][37][38] for block clustering problems, Here, we show that the BMMs are also useful for recommendation problems.
Suppose that each node belongs to a class c ∈ [C]. Let π c (respectively π c ) be the probability that each node in X (respectively Y) belongs to the class c ∈ [C]. We assume that the class at each node is independently drawn from these probability distributions. Though the number of classes, C, can be different in each group, here we suppose that they are the same for simplicity. When the node x i ∈ X in the graph belongs to the class c, the occurrence probability of the directed edge from x i to y j ∈ Y is defined by the Bernoulli distribution with the parameter α cj ∈ (0, 1). As previously mentioned, the adjacency matrix of the graph consists of A = (a ij ) and B = (b ji ). We assume that all elements of A and B are independently distributed. For each x i ∈ X, the probability of (a ij ) j∈[m] is given by the BMM, and the probability of the adjacency submatrix A = (a ij ) is given by In the same way, the probability of the adjacency submatrix B is given by where β ci ∈ (0, 1) is the parameter of the Bernoulli distribution. Hence, the probability of the whole adjacency matrix A is given by where Ψ is the set of all parameters in the BMM, i.e., π c , π c , α cj and β ci for i ∈ [n], j ∈ [m] and c ∈ [C]. One can introduce the prior distribution on the parameter α cj and β cj . The beta distribution is commonly used as the conjugate prior to the Bernoulli distribution. The parameter is estimated by maximizing the likelihood for the observed adjacency matrix A. The probability P( A; Ψ) is decomposed into two probabilities, P(A) and P(B), which do not share the parameters. In fact, P(A) depends only on π c and α cj and P(B) depends only on π c and β cj . In the following, we consider the parameter estimation of P(A). The same procedure works for the estimation of the parameters in P(B).
The expectation-maximization (EM) algorithm [8] can be used to calculate the maximum likelihood estimator. The auxiliary variables used in the EM algorithm have an important role in connecting the BMM with the similarity-based recommendation methods. Using the Jensen's inequality, we find that the log-likelihood log P(A) is bounded below as where the parameter r = (r(c|i)) c,i is positive auxiliary variables satisfying ∑ c∈[C] r(c|i) = 1. In the above inequality, the equality holds when r(c|i) is proportional to π c ∏ j α The auxiliary variable r(c|i) is regarded as the class probability of x i ∈ X when the adjacency matrix is observed.
In the EM algorithm, the lower bound of the log-likelihood, i.e., the function J(r, Ψ; A) in (6) is maximized. For this purpose, the alternating optimization method is used. Firstly the parameter Ψ is optimized for the fixed r, and secondly, the parameter r is optimized for the fixed Ψ. This process is repeatedly conducted until the function value J(r, Ψ; A) converges. Importantly, in each iteration the optimal solution is explicitly obtained. The following is the learning algorithm of the parameters: Step 1: Step 2: The estimator of the parameter Ψ is obtained by repeating (7) and (8).
Using the auxiliary variables r(c|i), one can naturally define the "occurrence probability" of the edge from x i to y j . Here, the occurrence probability is denoted by score(i → j). Note that the auxiliary variable r(c|i) is regarded as the conditional probability that x i belongs to the class c. If x i belongs to the class c, the occurrence probability of the edge (x i , y j ) is α cj . Hence, the occurrence probability of the edge (x i , y j ) is naturally given by where the updated parameter α cj in (7) is substituted. The similarity measure sim(i, i ) on X in the above is defined by where r(i) := 1/n, r(i|c) := r(c|i)r(i)/π c and π c r(i|c)r(i |c).
The equality ∑ i r(i, i ) = r(i ) = 1/n holds for r satisfying the update rule (7). The above joint probability r(i, i ) clearly satisfies the symmetry, r(i, i ) = r(i , i). This property is the special case of the finite exchangeability [11,13]. The exchangeability is related to the de Finetti's theorem [39], and the statistical models with the exchangeability have been used in several problems such as Bayes modeling and classification [12,40,41]. Here, we use the finite exchangeable model for the recommendation systems.
Equation (9) gives an interpretation of the heuristic recommendation methods (1) using similarity measures. Suppose that a similarity measure sim(i, i ) is used for the recommendation. Let us assume that the corresponding similarity matrix S = (sim(i, i )) i,i ∈[n] is approximately decomposed into the form of the mixture model r in (10), i.e., Then, score(i → j) defined from S is approximately the same as that computed from the Bernoulli mixture model with the parameter Ψ that maximizes J(r, Ψ; A) for the fixed r(c|i) associated with S. On the other hand, the score for the recommendation computed from the Bernoulli mixture uses the maximum likelihood estimator Ψ that attains the maximum value of J(r, Ψ; A) under the optimal auxiliary parameter r(c|i). Hence, we expect that the learning method using the Bernoulli mixture model will achieve higher prediction accuracy as compared to the similarity-based methods, if the Bernoulli mixture model approximates the underling probability distribution of the observed data.
For i, i ∈ [n], the probability function r(i, i ) satisfying (11) leads to the n × n positive semidefinite matrix (r(i, i )) i,i ∈[n] with nonnegative elements. As a result, the ratio r(i, i )/r(i)r(i ) is also positive semidefinite with nonnegative elements. Let us consider whether the similarity measures in Table 1 yield the similarity matrix with expression (10). Next, we demonstrate that the commonly used similarity measures meet the assumption (12) under a minor modification.

Completely Positive Similarity Kernels
For the set of all n by n symmetric matrices S n , let us introduce two subsets of S n ; one is the completely positive matrices and the other is doubly nonnegative matrices. The set of completely positive matrices is defined as C n = {S ∈ S n |∃N ≥ 0 s.t. S = NN T }, and the set of doubly nonnegative matrices is defined as D n = {S ∈ S n |S ≥ 0, S O}. Though the number of columns of the matrix N in the completely positive matrix is not specified, it can be bounded above by n(n + 1)/2 + 1. This is because C n is expressed as the convex hull of the set of rank one matrices {qq T |q ∈ R n , q ≥ 0} as show in [11]. The Carathéodory's theorem can be applied to prove the assertion. More detailed analysis of the matrix rank for the completely positive matrices is provided by [42]. Clearly, the completely positive matrix is doubly nonnegative matrix. However, [10] proved that there is a gap between the doubly nonnegative matrix and completely positive matrix when n ≥ 5.
The similarity measure that yields the doubly nonnegative matrix satisfies the definition of the kernel function [43]. The kernel function is widely applied in machine learning and statistics [43]. Here, we define the completely positive similarity kernel (CPSK) as the similarity measure that leads to the completely positive matrix as the Gram matrix or similarity matrix. We consider whether the similarity measures in Table 1 Clearly, the linear sum of the completely positive matrices with non-negative coefficients yields completely positive matrices. Using this fact with the above lemma, we show that all measures in Table 1 except the HP measure are the CPSK. In the following, let a i = (a i1 , . . . , a im ) T ∈ {0, 1} m for i ∈ [n] be non-zero binary column vectors, and let A be the matrix A = (a ij ) ∈ {0, 1} n×m . The index set s i is defined as

Common Neighbors
The elements of the similarity matrix are given by Hence, S = AA T holds. The common neighbors similarity measure yields the CPSK. Parameter-Dependent: The elements of the similarity matrix are given by Hence, we have S = DAA T D T , where D is the diagonal matrix whose diagonal elements are 1/ a 1 λ 1 , . . . , 1/ a 1 λ n . The Parameter-Dependent similarity measure yields the CPSK.
Jaccard Similarity: Hence, the Jaccard similarity matrix S = (S ii ) i,i ∈[n] is given by Let us define the matrices S 0 and T (k) respectively by (S 0 ) ii = a T i a i /m and (T (k) ) ii = (ā T iā i /m) k . The matrix S is then expressed as S = S 0 • ∑ ∞ k=0 T (k) . Lemma 1 (i) guarantees that T (k) is the CPSK since T (1) is the CPSK. Hence, the Jaccard similarity measure is the CPSK.

Sørensen Index:
The similarity matrix S = (S ii ) based on the Sørensen Index is given as The integral part is expressed as the limit of the sum of the rank one matrix, Hence, the Sørensen index is the CPSK.
Hub Promoted: The hub promoted similarity measure does not yield the positive semidefinite kernel. Indeed, for the adjacency matrix the similarity matrix based on the Hub Promoted similarity measure is given as The eigenvalues of S are 1 and (2 ± √ 5)/2. Hence, S is not positive semidefinite.
Hub Depressed: The similarity matrix is defined as Since the min operation is expressed as the integral min{x, y} = In the same way as the Sørensen Index, we can prove that the Hub Depressed similarity measure is the CPSK.

SimRank:
The SimRank matrix S satisfies S = cP T SP + (1 − c)D for c ∈ (0, 1), where P ≥ 0 is properly defined from A and D is a diagonal matrix such that the diagonal element d ii satisfies 1 − c ≤ d ii ≤ 1. The recursive calculation yields the equality S = ∑ ∞ k=0 c k (P k ) T DP k , meaning that S is the CPSK.
Adamic-Adar Coefficient: Given the adjacency matrix A = (a ij ), the similarity matrix S = (S ii ) is expressed as where a ik a i k log ∑ a k is set to zero if ∑ a k ≤ 1. Hence, we have S = ADA T , where D is the diagonal matrix with the elements D kk = 1/ log ∑ a k for ∑ a k ≥ 2 and D kk = 0 otherwise. Since S = NN T with N = AD 1/2 holds, the similarity measure based on the Adamic-adar coefficient is the CPSK.
Resource Allocation: In the same way as the Adamic-adar coefficient, the similarity matrix is given as where the term a ik a i k ∑ a k is set to zero if ∑ a k ≤ 1. We have S = ADA T , where D is the diagonal matrix with the elements D kk = 1/ ∑ a k for ∑ a k ≥ 2 and D kk = 0 otherwise. Since S = NN T with N = AD 1/2 holds, the similarity measure based on resource allocation is the CPSK.

Content-Based Similarity:
The similarity matrix is determined from the feature vector of each node as follows, Clearly, S is expressed as the sum of rank-one matrix Hence, Content-based similarity is the CPSK.

Transformation from Similarity Matrix to Bernoulli Mixture Model
Let us consider whether the similarity matrix allows the decomposition in (10) for sufficiently large C. Then, we construct an algorithm providing the probability decomposition (10) that approximates the similarity matrix.

Decomposition of Similarity Matrix
We show that a modified similarity matrix defined from the CPSK is decomposed into the form of (10). Suppose the n by n similarity matrix S is expressed as (10). Then, we have where r(i) = r(i ) = 1/n. Taking the sum over i ∈ X, we find that the equality S1/n = 1 (13) should hold. If the equality (13) is not required, the completely positive matrix S will be always decomposed into the form of (10) up to a constant factor. The equality (13) does not necessarily hold even when the CPSK is used. Let us define the diagonal matrix D as and let S be Then, S1/n = 1 holds. Since S is the completely positive matrix, also S is the completely positive matrix. Suppose that S/n 2 is decomposed into FF T with the non-negative matrix F = ( f 1 , . . . , f C ) ∈ R n×C . Then, Let us define π c = f c 2 1 and r(i|c) = ( f c ) i / f c 1 ≥ 0. Since 1 T S1/n 2 = 1, we have ∑ c π c = 1. Moreover, the equality S1/n = 1 guarantees 1 n meaning that ∑ c π c r(i|c) = r(i) = 1/n for i ∈ X. Hence, we have The modification (14) corresponds to the change of the balance between the self-similarity and the similarities with others.

Decomposition Algorithm
Let us propose a computation algorithm to obtain the approximate decomposition of the similarity matrix S. Once the decomposition of S is obtained, the recommendation using the similarity measure is connected to the BMMs. Such a correspondence provides a statistical interpretation of the similarity-based methods. For example, the conditional probability r(c|i) is available to categorize nodes into some classes according to the tendency of their preferences once a similarity matrix is obtained. The statistical interpretation provides a supplementary tool for similarity-based methods.
The problem is to find π c and r(i|c) such that the equation ∑ c π c r(i|c)r(i |c) r(i)r(i ) approximates the similarity matrix S = S ii , where r(i) = 1/n = ∑ c π c r(i|c) for n = |X|. Here, we focus on the similarity matrix on X. The same argument is clearly valid for the similarity matrix on Y.
This problem is similar to the non-negative matrix factorization (NMF) [44]. However, the standard algorithm for the NMF does not work since we have the additional constraint, ∑ c π c r(i|c) = 1/n. Here, we use the extended Kullback-Leibler (ext-KL) divergence to measure the discrepancy [45]. The ext-KL divergence between the two matrices C = (c ij ) and D = (d ij ) with nonnegative elements is defined as The minimization of the ext-KL divergence between S ii and the model r(i, i )/r(i)r(i ) is formalized by There are many optimization algorithms that can be used to solve nonlinear optimization problems with equality constants. A simple method is the alternating update of π c and r(i|c) such as the coordinate descent method [46]. Once r(i|c) is fixed, however, the parameter π c will be uniquely determined from the first equality constraint in (16) under a mild assumption. This means that the parameter π c cannot be updated while keeping the equality constants. Hence, the direct application of the coordinate descent method does not work. On the other hand, the gradient descent method with projection onto the constraint surface is a promising method [47,48]. In order to guarantee the convergence property, however, the step-length should be carefully controlled. Moreover, the projection in every iteration is computationally demanding. In the following, we propose a simple method to obtain an approximate solution of (16) with an easy implementation.
The constraint ∑ c π c r(i|c) = r(i) = 1/n is replaced with the condition that the KL-divergence between the uniform distribution and ∑ c π c r(i|c) vanishes, i.e., 1 We incorporated this constraint into the objective function to obtain tractable algorithm. Eventually, the optimization problem we considered is where the minimization problem is replaced with the maximization problem and λ is the regularization parameter. For a large λ, the optimal solution approximately satisfies the equality constraint ∑ c π c r(i|c) = 1/n. For the above problem we use the majorization-minimization (MM) algorithm [49]. Let a cij , c ∈ [C], i, j ∈ [n] and b ci be the auxiliary positive variables satisfying a cij = a cji , ∑ c a cij = 1 and ∑ c b cj = 1. Then, the objective function is bounded below by For fixed π c and r(i|c), the optimal a cij and b ci are explicitly obtained. The optimal solutions of π c and r(i|c) for a given a cij and b ci are also explicitly obtained. As a result, we obtain the following algorithm to compute the parameters in the Bernoulli mixture model from the similarity matrix S. Algorithm 1 is referred to as the SM-to-BM algorithm. The convergence of the SM-to-BM algorithm is guaranteed from the general argument of the MM algorithm [49].
Note that the SM-to-BM algorithm yields an approximate BMM model, even if the similarity matrix S is not completely positive such as the Hub Promoted. However, the approximation accuracy is thought to be not necessarily high, since not-CPSK such as the Hub Promoted does not directly correspond to the exchangeable mixture model (10).

Input:
Similarity matrix S = (S ii ) i,i ∈[n] and the number of classes C.

Step 1.
Initial values of auxiliary variables a cii and b ci are defined.

Step 2.
Repeat (i) and (ii) until the solution converges to a point: For given a cii and b ci : For given r ic and π c : Step 3.
Terminate the algorithm with the output: "The similarity matrix S is approximately obtained from the Bernoulli mixture model with π c and the auxiliary variable r(c|i) = n · r(i|c)π(c)."

Numerical Experiments of Reciprocal Recommendation
We conducted numerical experiments to ensure the effectiveness of the BMMs for the reciprocal recommendation. We also investigated how well the SM-to-BM algorithm works for the recommendation. In numerical experiments, we compare the prediction accuracy for the recommendation problems.
Suppose that there exist two groups, X = {x 1 , . . . , x n } and Y = {y 1 , . . . , y m }. Expressions of interest between these two groups are observed and they are expressed as directed edges. Hence, the observation is summarized as the bipartite graph with directed edges between X and Y. If there exists two directed edges (x, y) and (y, x) between x ∈ X and y ∈ Y, the pair is a preferable match in the graph. The task is to recommend a subset of Y to each element in X and vice versa based on the observation. The purpose is to provide potentially preferable matches as much as possible.
There are several criteria used to measure the prediction accuracy. Here, we use the mean average precision (MAP), because the MAP is a typical metric for evaluating the performance of recommender systems; see [5,[50][51][52][53][54][55][56] and references therein for more details.
Let us explain the MAP according to [50]. The recommendation to the element x is provided as the ordered set of Y, i.e., y (1) , y (2) , . . . , y (m) , meaning that the preferable match between x and y (1) is regarded to be most likely to occur compared to y (2) , . . . , y (m) . Suppose that for each x ∈ X, the preferable matches with elements in the subset Y x ⊂ Y are observed in the test dataset. Let us define z i as z i = 1 if y (i) is included in Y x and otherwise z i = 0. The precision at the position k is defined as P@k = 1 k ∑ k i=1 z i . The average precision ν x is then given as the average of P@k with the weight z k , i.e., In the latter case, ν x = 1/m for m = m − 1, and ν x = 1 m + 1 2(m−1) for m = m − 2. The MAP is defined as the mean value of ν x over x ∈ X. The high MAP value implies that the ordered set over Y generated by the recommender system is accurate on average. We use the normalized MAP that is the ratio of the above MAP and the expected MAP for the random recommendation. The normalized MAP is greater than one when the prediction accuracy of the recommendation is higher than that of the random recommendation.
The normalized discounted cumulative gain (NDCG) [5,50,57] is another popular measure in the literature of information retrieval. However, the computation of the NDCG requires the true ranking over the node. Hence, the NDCG is not available for the real-world data in our problem setup.

Gaussian Mixture Models
The graph is randomly generated based on the attributes defined on each node. The size of X and Y is 1000. Suppose that x i ∈ X has the profile vector x i ∈ R 100 and the preference vector x i ∈ R 100 . Thus, the attribute vector of x i is given by (x i , x i ) ∈ R 200 . Likewise, the attribute vector (y j , y j ) ∈ R 200 of y j ∈ Y consists of the profile vector y j ∈ R 100 and the preference vector y j ∈ R 100 . For each x i ∈ X, 100 elements in Y, for example, y k 1 , . . . , y k 100 are randomly sampled. Then, the Euclidean distance between the preference vector x i of x i and the profile vector y k j of y k j , i.e., x i − y k j is calculated for each y k j . Then, the 10 closest y k j from x i in terms of the above distance are chosen and directed edges from x i to the 10 chosen nodes in Y are added. In the same way, the edges from Y to X are generated and added to the graph. The training data is obtained as a random bipartite graph. Repeating the same procedure again with a different random seed, we obtain another random graph as a test data.
The above setup imitates practical recommendation problems. Usually, a profile vector is observed for each user. However, the preference vector is not directly observed, while the preference of each user can be inferred via the observed edges.
In our experiments, the profile vectors and preference vectors are identically and independently distributed from the Gaussian mixture distribution with two components, i.e., meaning that each profile or preference vector is generated from N 100 (0, I 100 ) or N 100 (1, I 100 ) with probability 1/2. Hence, each node in X is roughly categorized into one of two classes, i.e., 0 or 1, that is the mean vector of the preference, x i . When the class of x i is 0 (resp. 1), the edge from x i is highly likely to be connected to y j having the profile vector generated from N 100 (0, I 100 ) (resp. N 100 (1, I 100 )). Therefore, the distribution of edges from X to Y will be well approximated by the Bernoulli mixture model with C = 2. Figure 1 depicts the relationship between the distribution of attributes and edges from X to Y. The same argument holds for the distribution of edges from Y to X.
Edges from X to Y. The bold edges mean that there are many edges between the connected groups. The broken edges mean that there are few edges between the connected groups.
In this simulation, we have focused on the recommendation using similarity measures based on the graph structure. The recommendation to each node of the graph was determined by (1), where the similarity measures in Table 1 or the one determined from the Bernoulli mixture model (10) were employed. Table 2 shows the averaged MAP scores with the median absolute deviation (MAD) over 10 repetitions with different random seeds. In our experiments, the recommendation based on the BMMs with the appropriate number of components outperformed the other methods. However, the BMMs with a high number of components showed low prediction accuracy. Below, we show the edge prediction based on the SM-to-BM algorithm in Section 5. The results are shown in Table 3. The number of components in the Bernoulli mixture model was set to C = 2 or C = 5. Given the similarity matrix S, the SM-to-BM algorithm yielded the parameter π c and r(i|c). Next, edges were predicted through the formula (10) using π c , r(i|c) and r(i) = ∑ c π c r(i|c). The averaged MAP scores of this procedure are reported in the column of "itr:0". We also examined the edge prediction by the BMMs with the parameter updated from the one obtained by the SM-to-BM algorithm, where the update formula is given by (7) and (8). The "itr:10" (resp. "itr:100") column shows the MAP scores of the edge prediction using 10 times (resp. 100 times) updated parameter. In addition, the "BerMix" shows the MAP score of the BMMs with the updated parameter from the randomly initialized parameter.
In our experiments, we found that the SM-to-BM algorithm applied to commonly used similarity measures improved the accuracy of the recommendation. The MAP score for the "itr:0" method achieved a higher accuracy than the original similarity-based methods. The updated parameter from "itr:0", however, did not improve the MAP score significantly. The results of "itr:10" and "itr:100" for similarity measures were almost the same when the model was the Bernoulli mixture model with C = 2. This is because the EM algorithm with 10 iterations achieved the stationary point of this model in our experiments. We confirmed that there was yet a gap between the likelihood of the parameter computed by the SM-to-BM algorithm and the maximum likelihood. However, the numerical results indicate that the SM-to-BM algorithm provides a good parameter for the recommendation in the sense of the MAP score.

Real-World Data
We show the results for real-world data. The data was provided from an online dating site. The set X (resp. Y) consists of n = 15, 925 males and m = 16, 659 females. The data were gathered from 3 January 2016 to 5 June 2017. We used 130,8126 messages from 3 January 2016 to 31 October 2016 as the training data. Test data consists of 177,450 messages from 1 November 2016 to 5 June 2017 [55]. The proportion of edges in the test set to all data set is approximately 0.12.
In the numerical experiments, half of the users were randomly sampled from each group, and the corresponding subgraph with the training edges were defined as the training graph. On the other hand, the graph with the same nodes in the training graph and the edges in the test edges were used as the test graph. Based on the training graph, the recommendation was provided and was evaluated on the test graph. The same procedure was repeated over 20 times and the averaged MAP scores for each similarity measure were reported in Table 4. In the table, the MAP score of the recommendation for X and Y were separately reported. So far, we have defined the similarity measure based on out-edges from each node of the directed bipartite graph, referred to as "Interest". On the other hand, the similarity measure defined by in-edges is referred to as "Attract". For the BMMs, "Attract" means that the model of each component is computed under the assumption that each in-edge is independently generated, i.e., the probability of (a ij ) i∈[n] is given by ∏ i α a ij ic (1 − α ic ) 1−a ij when the class of y j ∈ Y is c. In the real-world datasets, the SM-to-BM algorithm was not used, because the dataset was too large to compute the corresponding BMMs from similarity matrices.
As shown in the numerical results, the recommendation based on the BMMs outperformed the other methods. Some similarity measures such as the Common Neighbors or Adamic-Adar coefficient showed relatively good results. On the other hand, the Hub Promoted measure, that is not the CPSK, showed the lowest prediction accuracy. As well as the result for synthetic data, the BMMs with two to five components produced high prediction accuracy. Even for medium to large datasets, we found that the Bernoulli mixture model with about five components worked well. We expect that the validation technique is available to determine the appropriate size of components. Also, the similarity with "Interest" or "Attract" can be determined from the validation dataset. Table 4. MAP scores for real-world data. The bold face indicates the top two MAP scores in each column.

Similarity
Recomm. of Y to X Recomm. of X to Y MAP (MAD): MAP (MAD)

Discussions and Concluding Remarks
In this paper, we considered the relationship between the similarity-based recommendation methods and statistical models. We showed that the BMMs are closely related to the recommendation using completely positive similarity measures. More concretely, both the BMM-based method and completely positive similarity measures share exchangeable mixture models as the statistical model of the edge distribution. Once this was established, we proposed the recommendation methods using the EM algorithm to BMMs to improve similarity-based methods.
Moreover, we proposed the SM-to-BM algorithm that transforms a similarity matrix to parameters of the Bernoulli mixture model. The main purpose of the SM-to-BM algorithm is to find a statistical model corresponding to a given similarity matrix. This transformation provides a statistical interpretation for similarity-based methods. For example, the conditional probability r(c|i) is obtained from the SM-to-BM algorithm. This probability is useful to categorize nodes, i.e., users, into some classes according to the tendency of their preferences once a similarity matrix is obtained. The SM-to-BM algorithm is available as a supplementary tool for similarity-based methods.
We conducted numerical experiments using synthetic and real-world data. We numerically verified the efficiency of the BMM-based method in comparison to similarity-based methods. For the synthetic data, the BMM-based method was compared with the recommendation using the statistical model obtained by the SM-to-BM algorithm. We found that the BMM-based method and the SM-to-BM method provide a comparable accuracy for the reciprocal recommendation. Since the synthetic data is well approximated by the BMM with C = 2, the SM-to-BM algorithm was thought to reduce the noise in the similarity matrices. In the real-world data, the SM-to-BM algorithm was not examined, since our algorithm using the MM method was computationally demanding for a large dataset. On the other hand, we observed that the BMM-based EM algorithm was scalable for a large dataset. A future work includes the development of computationally efficient SM-to-BM algorithms.
It is straightforward to show that the stochastic block models (SBMs) [6] are also closely related to the recommendation with completely positive similarity measures. In our preliminary experiments, however, we found that the recommendation system based on the SBMs did not show a high prediction accuracy in comparison to other methods. We expect that detailed theoretical analysis of the relation between the similarity measure and statistical models is an interesting research topic that can be used to better understand the meaning of the commonly used similarity measures.