User Identity Linkage across Social Networks by Heterogeneous Graph Attention Network Modeling

: Today, social networks are becoming increasingly popular and indispensable, where users usually have multiple accounts. It is of considerable signiﬁcance to conduct user identity linkage across social networks. We can comprehensively depict diversiﬁed characteristics of user behaviors, accurately model user proﬁles, conduct recommendations across social networks, and track cross social network user behaviors by user identity linkage. Existing works mainly focus on a speciﬁc type of user proﬁle, user-generated content, and structural information. They have problems of weak data expression ability and ignored potential relationships, resulting in unsatisfactory performances of user identity linkage. Recently, graph neural networks have achieved excellent results in graph embedding, graph representation, and graph classiﬁcation. As a graph has strong relationship expression ability, we propose a user identity linkage method based on a heterogeneous graph attention network mechanism (UIL-HGAN). Firstly, we represent user proﬁles, user-generated content, structural information, and their features in a heterogeneous graph. Secondly, we use multiple attention layers to aggregate user information. Finally, we use a multi-layer perceptron to predict user identity linkage. We conduct experiments on two real-world datasets: OSCHINA-Gitee and Facebook-Twitter. The results validate the effectiveness and advancement of UIL-HGAN by comparing different feature combinations and methods.


Introduction
With the rapid development of the Internet and information technology, social networks are becoming more and more indispensable and various. Due to different social networks embody different functions, the purpose and behavior characteristics of users are also different [1]. It is of considerable significance to conduct user identity linkage across social networks. We can comprehensively depict diversified characteristics of user behaviors, accurately model user profiles [2], conduct recommendations across social networks [3], and track user behaviors across social networks by user identity linkage.
The existing methods of user identity linkage can be divided into three categories [4]: User profile-based, user generated content-based, and structural information-based. User profiles mainly include a username, nickname, avatar, etc. Zafarani et al. [5] extracted features of patterns to human limitations, exogenous factors, and endogenous factors from username for user identity linkage. User-generated content includes time information, spatial information, and published content. profiles in the latent user space. Li et al. [15] proposed a machine learning method to analyze the user name pattern for user identity linkage.

User Generated Content-Based User Identity Linkage
User-generated content includes time information, spatial information, and published content. Goga et al. [16] proposed a supervised learning method, LU-Link, to extract geo-locations, timestamps, and writing styles from user-generated content for user identity linkage. Liu et al. [6] proposed Hydra to extract time information, spatial information, and others to analyze interest and writing style for the linkage of user identities. Li et al. [17] proposed U-UIM to calculate spatial similarity, time similarity, and content similarity. Riederer et al. [18] proposed a method, POIS, based on user trajectory for user identity linkage.

Structural Information-Based User Identity Linkage
Structural information can be expressed as two forms: Directed link and undirected link. Man et al. [7] proposed PALE, using the network embedding method to extract friend relationship information. Miao et al. [19] proposed a method, EUIA, for user identity linkage by comparing the embedding of nodes in low dimensional space. Wang et al. [20] proposed a semi-supervised method, APAN, which extends the idea of DeepWalk [21] to embed nodes for user identity linkage. Zhou et al. [1] proposed a semi-supervised method, FRUI, to extract features from the neighborhood-based network. Li et al. [22] analyzed the similarities of k-hop neighbors and compared the effects of several friendship-based classifiers for user identity linkage.

Multiple Type-Based User Identity Linkage
Some methods use multiple kinds of features for user identity linkage at the same time. Park et al. [23] proposed a conditional random field-based method called JLA to use user profiles and social relations for user identity linkage. Kong et al. [24] proposed a supervised learning method, MNA, to extract style features from user-generated content, combined with structural information. Nie et al. [25] proposed a method called DCIM to calculate the similarity of social relations and articles for the linkage of user identities. Wang et al. [8] proposed a method called LHNE to extract friend relationships, and user interests to represent in the latent space for user identity linkage.

Distinction with Current Works
The existing methods for user identity linkage pay attention on methods of extracting features from user profiles, user-generated content, and structural information. They mainly have the following problems: (1) User profiles, user-generated content, and structural information usually have problems of week unilateral expression ability, resulting in unsatisfactory performances of user identity linkage.
(2) There are some potential relationships between user profiles, user-generated content, and structural information, ignoring these potential relationships leads to little improvement of user identity linkage performance. The distinction between this paper and the existing methods is that we focus on aggregating different types of features to solve problems of weak data expression ability and ignored potential relationships.

Preliminary
In this section, we define the preliminary concepts used in this paper. A heterogeneous graph is a particular type of graph. Compared with a traditional graph, a heterogeneous graph contains many types of nodes and edges. In this paper, user profiles, user-generated content, structural information, and their features represent nodes in a heterogeneous graph from different social networks, respectively, and we build edges between these nodes and users. These node types include the user node, the user profile node, the user-generated content node, the structural information node, the category node, the feature node, and the structural embedding node. The category nodes are subdivisions of user profiles and user-generated content, such as the username in profile. The feature nodes are unique and universal, extracted from user profiles and user-generated content. The structural embedding nodes dynamically generate after the embedding of every user-to-user edge.
Definition 1 (Heterogeneous Social Network Graph). A heterogeneous social network graph denotes as G = {V, E }, where V = {v i |i = 1, . . . , N} is a node set, and E = {e ij |(i, j) = 1, . . . , N} is an edge set. A node set V = {V U , V P , V G , V S , V C , V F , V E } consists of several types of nodes: The user node V U , the user profile node V P , the user-generated content node V G , the structural information node V S , the category node V C , the feature node V F , and the structural embedding node V E .
The edge pattern between the user node v i and the user node v j is v i The edge pattern among the user node v m and the feature node v i is v i An example of a heterogeneous social network graph is shown in Figure 1, which includes three user nodes, two user profile nodes, one user-generated content node, three structural information nodes, four category nodes, five feature nodes, and six structural embedding nodes. The user-to-user edges include two types, the edge between user A and user B is the "friend" edge, and the edge between user A and user C is the "blacklist" edge. For user profiles, the "gender" category nodes linked with user A and user B connect to the "male" feature node, and the "name" category node linked with user A connects to three feature nodes. For user-generated content, the "interest" category node linked with user C connects to a feature node. For structural information, every structural embedding node dynamically generates after the embedding of each user-to-user edge. The number of generated nodes is related to the type of user-to-user edges. Without loss of generality, we focus on user identity linkage in two social networks, denoted as G s and G t respectively. As shown in Figure 2, for users in the two social networks, we intend to predict linkage pairs where these users belong to the same identity.
Definition 2 (User Identity Linkage). Given two social networks G s and G t , user identity linkage aims to predict whether a pair of entity v i ∈ G s and v j ∈ G t belong to the same user identity, i.e., if v i and v j belong to a same user identity, 0 otherwise. (1)

Methods
In this section, we introduce the proposed user identity linkage method based on a heterogeneous graph attention network mechanism UIL-HGAN. The overall process is shown in Figure 3. Firstly, we embed the nodes and different types of user-to-user edges to the latent space and generate the structural information nodes, such as (1) in Figures 1 and 3. Secondly, we introduce the structure of the attention layer. We use multiple attention layers to represent the user nodes by aggregating the feature nodes, the category nodes, the structural embedding nodes, the user profile nodes, the user-generated content nodes, and the structural information nodes, such as (2), (3), (4) in Figures 1 and 3. Finally, we perform a multi-layer perceptron to train the model and predict user identity linkage.

Embedding
The embedding process can be divided into two categories: The embedding of every user-to-user edge and the embedding of every node in the heterogeneous social network graph.

Embedding of Every User-to-User Edge
In the heterogeneous social network graph, the edges between users have several types. For example, in Figure 1, the edge between user A and user B is the "friend" edge, and the edge between user A and user C is the "blacklist" edge. We use the first-order proximity [26] to embed every user-to-user edge. Given two users v i , v j ∈ V U in the heterogeneous social network graph, the probability that the edge whose type is t, existing between v i and v j can be calculated as follows: where z i , z j ∈ R d are d -dimensional vectors of v i , v j in the latent space, and σ is the sigmoid function.
To obtain the vectors z i , z j in the latent space, combining with (2) and the log-likelihood function, we need to minimize the following: To avoid the existence of trivial solutions, the loss function of the embedding can be written as follow: where t is the type of edge, and K is the total number of negative sampling edges. For all edge types t ∈ T, the total loss function is written as: Then, to facilitate the subsequent feature aggregation, we transform the embedding vector into d-dimension in the latent space, and the transformation process is as follow: where W ∈ R d ×d is the transformation matrix, b ∈ R d is the bias vector, h ∈ R d is the transformed vector as input to subsequent attention layers, and tanh is the hyperbolic tangent function. Finally, according to the type of edges, the structural embedding nodes are created in the heterogeneous social network graph, whose embedding vector is h in the latent space.

Embedding of Every Node
In a heterogeneous social network graph, given any node v i ∈ V, we randomly initialize the node embedding vector h i ∈ R d in the latent space by the Glorot uniform initializer. In the subsequent user identity linkage tasks, the embedding vectors of nodes will be updated by every training epoch.

Multiple Attention Layers
In this subsection, we firstly introduce the structure of the attention layer. As shown in Figure 3, we use multiple attention layers that includes the category-level attention layers, the expression-level attention layers, and the user-level attention layers to represent the user nodes by aggregating the feature nodes, the category nodes, the structural embedding nodes, the user profile nodes, the user-generated content nodes, and the structural information nodes in the latent space.

Definition of the Attention Layer
To aggregate multiple types of nodes in the latent space, we need to obtain the neighbor information between nodes. To achieve this goal, we introduce the attention layer, which works in a graph. In the attention layer, we aggregate the 1-hop neighbor nodes by the attention mechanism.  The heterogeneous social network graph has a node v i and its neighbor node is a set of neighbor nodes whose type is t of node v i . h i , h j are the embedding vectors or the attention layer aggregated vectors of v i , v j in the latent space. The calculation of the attention [27] is shown as follow: where a ∈ R 2d is the learnable weight vector, and is the concatenate function. The attention coefficient To make the different attention coefficients easier to compare, we calculate all the attention coefficients of neighbor nodes whose type is t and normalized by a softmax function as follow: Finally, we aggregate the embedding vectors, or the attention layer aggregated vectors of all neighbor nodes whose type is t of node v i : where tanh is the hyperbolic tangent function, is the concatenate function, and h is the current attention layer aggregated vector of node v i , which is used to input the next attention layer or a multi-layer perceptron. The calculation of h will not update the embedding vector h i of node v i in the latent space. Therefore, an attention layer can be expressed as:

Category-Level Attention Layer
The category-level attention layer aggregates the node embedding vectors from the feature nodes to the category node in the latent space, as shown in Figure 1 (2). The heterogeneous social network graph has a category node v i ∈ V C which connects the feature node v j ∈ N(v i , V F ). Let h i , h j denote the embedding vector of node v i , v j in the latent space. So the category-level attention layer aggregated vector h of node v i is: where h is used to input the next expression-level attention layer.

Expression-Level Attention Layer
The expression-level attention layer aggregates the vectors from the category nodes or the structural embedding nodes to the user profile node or the user-generated content node or the structural information node in the latent space, as shown in Figure 1 (3). The heterogeneous social network graph has a node v i ∈ V P ∪ V G ∪ V S connecting the category node v j ∈ N(v i , V C ) or the structural embedding node v j ∈ N(v i , V E ). Let h i denote the embedding vector of node v i in the latent space. In the expression-level attention layer, vector h j is different from the category-level attention layer. If the type of v j is the category node, let h j denote the attention layer aggregated vector of v j . If the type of v j is the structural embedding node, let h j denote the embedding vector of v j from (6). Therefore, the expression-level attention layer aggregated vector h of node v i is: where h is used to input the next user-level attention layer.

User-Level Attention Layer
The user-level attention layer aggregates the vectors from the user profile nodes or the user-generated content node or the structural information nodes to the user node in the latent space, as shown in (4) in Figure 1. The heterogeneous social network graph has a node v i ∈ V U connecting the user profile nodes or the user-generated content node or the structural information node v j ∈ N(v i , V P ∪ V G ∪ V S ). Let h i denote the embedding vector of node v i in the latent space. Let h j denote the attention layer aggregated vector of node v j in the latent space. So, the user-level attention layer aggregated vector h of node v i is: where h is used to input the next multi-layer perceptron to predict user identity linkage, and is the concatenate function. In summary, we use multiple attention layers that includes the category-level attention layers, the expression-level attention layers, and the user-level attention layers to represent the user nodes by aggregating the feature nodes, the category nodes, the structural embedding nodes, the user profile nodes, the user-generated content nodes, and the structural information nodes in latent space. In the heterogeneous social network graph, only the embedding vector of the feature node and the structural embedding node is used to input the attention layer. The embedding vector of other types of nodes is used to calculate the attention coefficient.

User Identity Linkage
In this subsection, we use a multi-layer perceptron to predict user identity linkage. Give two social networks G s , G t and two user nodes v i ∈ G s , v j ∈ G t . Let h i , h j denote the user-level attention layer aggregated vectors of v i , v j , respectively. First of all, to improve the training speed and avoid the gradient disappearing, the batch normalization [28] is applied to normalize h i , h j respectively: where BN is the batch normalization layer. The input of the multi-layer perceptron is x = [h i h j ] and is the concatenate function. The l-th layer of the neural network is defined as: where W l , b l are the l-th layer parameters, ReLU(a) = max (0, a) is the linear rectification function, and y 0 (x) = x. Finally, the 1-dimension vectorŷ ∈ [0, 1] is the output by the neural network as the predicted value of user identity linkage:ŷ where σ is the sigmoid function, W ∈ R d×1 is the weight matrix, d is the dimension of y l (x), and b ∈ R 1 is the bias vector. Then the loss of the user identity linkage is: where y ij is the ground truth of user identity linkage pair (v i , v j ).

Overview of UIL-HGAN
In UIL-HGAN, we first embed the nodes and different types of user-to-user edges to the latent space and generate the structural information nodes. Secondly, we use multiple attention layers to represent the user nodes by aggregating the feature nodes, the category nodes, the structural embedding nodes, the user profile nodes, the user-generated content nodes, and the structural information nodes. Finally, we perform a multi-layer perceptron to train the model and predict user identity linkage. The total loss function denotes as follow: The model training process is shown in Algorithm 1, and the model parameters update by using Adam optimizer [29] to minimize the loss function. The model predicting process is shown in Algorithm 2.

Algorithm 1 Model training
Input: Two heterogeneous social network graph G s , G t , user identity linkage pair set S = {(v i , v j , y ij )|v i ∈ G s , v j ∈ G t } 1: Perform the embedding of every user-to-user edge in G s ∪ G t ; 2: Perform the embedding of every node in G s ∪ G t ; 3: for (v i , v j , y ij ) in S do 4: Calculate the aggregated vector h i of v i by multiple attention layers in G s ; 5: Calculate the aggregated vector h j of v j by multiple attention layers in G t ; 6: Calculate the predicted valueŷ ij of user identity linkage by the multi-layer perceptron; 7: Calculate the total loss L;

Experiments
In this section, we conduct experiments on two real-world datasets to test and validate the effectiveness and advancement of UIL-HGAN. The first dataset consists of two kinds of social networks: OSCHINA and Gitee, and the second dataset consists of Facebook and Twitter.

OSCHINA-Gitee
OSCHINA is the largest open-source technology community with 4 million members in China, providing a platform for developers to discover and exchange technologies. Gitee is a code hosting platform with 5 million members, providing a free private warehouse for hosting service. We used the breadth-first search algorithm to obtain 161,428 users from OSCHINA and 126,308 users from Gitee. The user information pages of OSCHINA provide Gitee links, which can be used as the ground truth for user identity linkage. Finally, we had a total of 5649 active users. We used this dataset to validate the effectiveness of UIL-HGAN.

Facebook-Twitter
Many methods evaluate the performances of user identity linkage on Facebook and Twitter. Facebook is a social platform where users can share pictures, links, and videos. Twitter is a microblog liked social network. We got 102,893 users from Facebook and 80,378 users from twitter. About.me is a third-party user information integration platform where users can add Facebook and Twitter links on the home page, which can be used as the ground truth for user identity linkage. Finally, we had a total of 8251 active users. We used this dataset to validate the advancement of UIL-HGAN.

Metrics for Comparison
In this paper, we use the popular evaluation metric hit-precision [14] to evaluate user identity linkage performance by comparing the top-k candidates of the predicted values. The hit-precision can be calculated as follow: where hit(x) is the position of the correct value in the top-k candidates. However, if the correct value is not in the top-k candidates, hit(x) = k + 1. For example, let k = 5, given the top-k candidate setŝ Y = {ŷ 1 ,ŷ 2 ,ŷ 3 ,ŷ 4 ,ŷ 5 }, if the correct valueŷ =ŷ 2 , then hit(x) = 2 and h(x) = k−1 k = 0.8. If the correct valueŷ / ∈Ŷ, then hit(x) = 6 and h(x) = 0. For N test users, we average the hit-precisions by In experiments, the hit-precision results express as hit@k.

Validation of Effectiveness
In this subsection, we use the OSCHINA-Gitee dataset to validate the effectiveness of UIL-HGAN. In UIL-HGAN, the multiple attention layers consist of the category-level attention layer, the expression-level attention layer, and the user-level attention layer. We used (Username), (URL-ID), (Username, URL-ID), and (Username, URL-ID, Follow) to carry out four groups of comparative experiments to validate the effectiveness of each layer in the multiple attention layers. Username and URL-ID are user profiles, and Follow is structural information. In the experiment, we set the dimension of features in the latent space as 32, set the ratio of positive and negative samples as 0.2, and set the training ratio to 20%, 40%, 60%, and 80%. The multi-layer perceptron has two hidden layers with the same dimension 32. The results of the experiment are shown in Table 1 and Figure 5.  In Table 1 and Figure 5, the results of (Username) and (URL-ID) validate the effectiveness of the category-level attention layer, where features of usernames and URL-IDs are aggregated. The hit@1 precisions with an 80% training ratio of (Username) and (URL-ID) are 0.5620 and 0.5900, showing that the URL-ID feature has a better performance than the username feature in user identity linkage.
Username and URL-ID are aggregated in the expression-level attention layer. Compared with the results of (Username), (URL-ID), and (Username, URL-ID) in Figure 5, the curve of (Username, URL-ID) is above the curves of (Username) and (URL-ID). It shows that (Username, URL-ID) has a better user identity linkage performance and validates the expression-level attention layer's effectiveness.
User profiles and structural information are aggregated in the user-level attention layer. Compared with the results of (Username, URL-ID) and (Username, URL-ID, Follow) in Table 1, the hit@k precisions of (Username, URL-ID, Follow) have larger values than the hit@k precisions of (Username, URL-ID). It shows that (Username, URL-ID, Follow) has a better user identity linkage performance and validates the user-level attention layer's effectiveness.
In this subsection, the OSCHINA-Gitee dataset's results validate each layer's effectiveness in the multiple attention layer and validate the effectiveness of UIL-HGAN. To a certain extent, UIL-HGAN solves the problems of weak data expression ability and ignored potential relationships.

Validation of Advancement
In this subsection, we use the Facebook-Twitter dataset to validate the advancement of UIL-HGAN. To facilitate the comparison with the existing user identity linkage methods, we use usernames and friend relationships in the dataset. In the experiment, we compared the methods as follows: • SVR [15]: A supervised learning method is proposed to extract the longest common string, the longest common subsequence, the Jenson Shannon distance, the editing distance, and others as features of username for user identity linkage. We use support vector regression (SVR) to calculate the hit-precision; • PALE [7]: A supervised user identity linkage method of structural information, that uses a network embedding method to represent nodes in a low dimension; • UIL-HGAN(P): An UIL-HGAN method using only usernames; • UIL-HGAN(S): An UIL-HGAN method using only friend relationships.
For different user identity linkage methods, we set the training ratio to 20%, 40%, 60%, and 80%, respectively, and calculate hit@1, hit@50, hit@100, hit@500, and hit@1000 separately. The experimental results are shown in Table 2. The hit-precision jitter is shown of different training ratios in the same method. As the hit@1 precision directly reflects the performance of user identity linkage, to intuitively compare different methods, hit@1 precisions under different training ratios of different methods are shown in Figure 6.  UIL-HGAN(S) and PALE only use friendly relationships for user identity linkage. When the training ratio is 20%, the hit@1 precision of PALE is 0.0620, while the hit@1 precision of UIL-HGAN(S) is 0.0740. When the training ratio is 80%, the hit@1 precision of PALE is 0.1370, while the hit@1 precision of UIL-HGAN(S) reached 0.1500. It shows that in the aspect of friendly relationships, our proposed method UIL-HGAN(S) has a better performance of user identity linkage than PALE.
Both UIL-HGAN(P) and SVR use only usernames for user identity linkage. In Figure 6, no matter what the training ratio was set to, the hit@1 precision curves are all above SVR, which is more evident than UIL-HGAN(S) and PALE. It shows that in regards to usernames, our proposed method UIL-HGAN(P) had a better performance of user identity linkage than SVR.
The comparisons among the above methods have proved the advancement of our proposed method unilaterally. UIL-HGAN uses both usernames and friendly relationships in the experiments. In Figure 6, UIL-HGAN had a better performances of user identity linkage than UIL-HGAN(P) and UIL-HGAN(S), which validates the advancement of our proposed method.
In this subsection, we use the Facebook-Twitter dataset to validate the advancement of UIL-HGAN. In the experiment, we found that different features have different convergence rates. It usually takes a longer training time to achieve better performances to aggregate multiple types of features for user identity linkage.

Conclusions
To solve problems of weak data expression ability and ignored potential relationships, we propose a novel method to represent user profiles, user-generated content, structural information, and their features in a heterogeneous graph. We propose a novel user identity linkage method based on a heterogeneous graph attention network mechanism called UIL-HGAN. We conduct experiments on two real-world datasets to test and validate the effectiveness and advancement of UIL-HGAN.
UIL-HGAN has two limitations. First of all, the input dimension of the attention layer must be the same. To some extent, it causes inconvenience to the aggregation of different features. Secondly, we focus on the aggregations between different types of features and only use a simple method to extract and create feature nodes in the experiment. In the future, sophisticated features can be extracted as feature nodes to improve user identity linkage performance.