A Noval Weighted Meta Graph Method for Classiﬁcation in Heterogeneous Information Networks

: There has been increasing interest in the analysis and mining of Heterogeneous Information Networks (HINs) and the classiﬁcation of their components in recent years. However, there are multiple challenges associated with distinguishing different types of objects in HINs in real-world applications. In this paper, a novel framework is proposed for the weighted Meta graph-based Classiﬁcation of Heterogeneous Information Networks (MCHIN) to address these challenges. The proposed framework has several appealing properties. In contrast to other proposed approaches, MCHIN can fully compute the weights of different meta graphs and mine the latent structural features of different nodes by using these weighted meta graphs. Moreover, MCHIN signiﬁcantly enlarges the training sets by introducing the concept of Extension Meta Graphs in HINs. The extension meta graphs are used to augment the semantic relationship among the source objects. Finally, based on the ranking distribution of objects, MCHIN groups the objects into pre-speciﬁed classes. We verify the performance of MCHIN on three real-world datasets. As is shown and discussed in the results section, the proposed framework can effectively outperform the baselines algorithms.


Introduction
In the real world, there exist lots of various entities, and together with their inter-relationships, they can be represented as information networks [1], such as Bibliographic Information Networks [2], Wikipedia [3], and Facebook. Existing studies [4][5][6][7][8][9] mainly focus on representing and analyzing such systems using homogeneous information networks, which composed of one type of nodes and edges. For example, Figure 1a illustrates a co-author homogeneous network consisting of "authors" as nodes and "publishing" relationships as links. However, the real-world systems are usually complex, mixed-type, and heterogeneous which can be more realistically represented by heterogeneous information networks. For example, there are different kinds of objects such as accounts, blogs, and friends, and different types of links such as publishing, forwarding and commenting between different pairs of objects in social networks. As an example, Figure 1b shows a bibliographic information network composed of multi-typed objects. There are authors (A), conferences (C), and papers (P) as nodes, and different relationships such as "publishing" and "writing" between different pairs of nodes.
One of the widely studied problems in Heterogeneous Information Networks (HINs) is classification considered as a semi-supervised learning approach, in which the labels of objects are predicted based on the structural properties or relationships between nodes. The classification can be applied in different setting of applications of of HINs such as similarity search [1] and recommendation systems [10,11].
One of the most popular concepts used to study HINs is meta path. The meta path can better preserve the relationship information between different types of entities in HINs in comparison to other broadly used concepts such as random walks. However, due to the inherent limitation of meta paths such as their length [1], they can only carry limited semantics. Another problem is that typical meta path generation approaches can produce many short paths or even isolated nodes which make processing and learning even less efficient [12]. To tackle the weaknesses of meta paths, in reference [13], the concept of meta graphs is proposed, which contain different meta paths together and are shown to be significantly more useful in various tasks. Different meta graphs contribute differently to specific classification problems. However, to the best of our knowledge, none of the existing methods has demonstrated different semantics among meta graphs. Here, we compute the weights of different meta graphs by a priori knowledge.
On one hand, using labeled or annotated data can remarkably improve the classification performance, and on the other hand, acquiring labeled data or manual annotation of the available data is difficult and expensive. In this paper, we extend the labeled objects based on extending meta graphs and iteratively enlarge the training dataset. Finally, we use MCHIN to calculate the relatedness of multi-typed objects and group the objects into pre-specified classes by using the ranking distribution of objects. The contributions of this study are summarized as follows: • We identify more objects for the training sets by the Extension Meta Graphs in HINs, which can effectively utilize a priori knowledge and improve the performance of classification.

•
We introduce MCHIN which incorporates extension meta graphs' semantics and relevance measurement on objects and assigns categorical labels to each object by a ranking distribution function.

•
We apply our proposed model on three real-world datasets and conduct a comprehensive analysis of the proposed model to gain more insights. The results show the effectiveness of MCHIN in the classification task in comparison to other state-of-the-art methods.
This paper is an extension of our previous conference paper "CHIN: Classification with META-PATH in Heterogeneous Information Networks [9]". We have significantly improved our previous work and presented new meta graphs based classification of heterogeneous information networks algorithms. The main differences of our previous paper are shown as follows.

•
We present a new model based on meta graphs, and compute the weights of different meta graphs.

•
We introduce a novel extension rule to extend HINs, which is the first time to be proposed in our manuscript.

•
We propose a novel framework MCHIN, which integrates the extend meta graphs and the weighted meta graphs to update the information of objects. Moreover, we provide the description of MCHIN algorithm.

•
The datasets are different in the two papers. What is more, we use three datasets in our paper.

•
We add Deepwalk, HIN2Vec, and SDNE as baselines in our experiments. We also introduce detailed information about baselines. We add the weights of different meta graphs. Besides, we experiment and theoretically analyze the reason why meta graphs are more effective than meta paths.
(a) Co-author Homogeneous Network (b) Bibliographic Heterogeneous Network Figure 1. Co-authorship homogeneous network and bibliographic heterogeneous network. In the co-authorship homogeneous network (a), there is only one type of objects: authors, and single relationship: co-authorship; Bibliographic heterogeneous network (b), on the other hand, contains three types of objects: authors, papers and conferences. The red lines represent the publishing relationship, and the black lines express the writing relationships between authors and papers.

Related Work
Numerous research studies have been conducted on the classification of networks. The heterogeneous information networks have two main advantages over their homogeneous counterparts-they make it possible to model different types of objects and relationships, and they preserve more complex topological structures with richer semantics information [14].
Classification of homogeneous information networks There has been substantial attentions given to the classification of network data or objects recently. Most of the previous studies focus on the networks' structure, and others use unlabeled objects for classification [15][16][17]. Collective classification [18] is one of the most popular classification methods in mining networks. These methods classify objects by their features and the structure of the network in the homogeneous information networks. Zhou et al. [19] proposed the Learning with Local and Global Consistency (LLGC) algorithm. They design a function integrating sufficiently smooth coefficients to learn the general classification from the labeled data. The weighted-vote relational neighbor classifier (wvRN) [15,20] is another widespread classification algorithm proposed for mining network data. The wvRN label objects by considering a link selection mechanism.
Classification of heterogeneous information networks One of the algorithms proposed for mining HINs is GNetMine proposed by Ji et al. [21]. GNetMine is a transductive classification model based on graph regularization. The GNetMine, designed for HINs, can only handle the data with a common topic. Wan et al. [16] consider the class-level meta paths to construct more accurate classifiers and obtain the train labels to improve the active learning in HINs. RankClass algorithm, proposed by Ji et al. [17], is a more effective classification method that utilizes both classification and ranking operation. The values of ranking are related with the classification in HINs. Deepwalk [22] is another popular embedding model. The input to the Deepwalk is a series of objects produced by random walks in a language model-SkipGram, and the output would be the representation of features' objects. However, Deepwalk can only apply the short random walks to obtain the semantic among the objects.

Problem Formalization
The related concepts and definitions are introduced as follow.
We define a HIN as a graph denoted by G = (V, E, A, R), where V and E represent the set of nodes and edges of the graph respectively, A and R represent the set of types of nodes and edges, respectively. Definition 1 represents network's meta-level structure, as shown in Figures 2a and 3a.
Definition 1. Network schema [2,23]. The network schema is defined as T G = (A,R). T G is a meta template for network G = (V, E, A, R) with the object type mapping φ : V → A and the link type mapping ψ : E → R  In homogeneous networks, different paths connect two objects of the same type. However, in heterogeneous information networks, different paths can connect objects of different types. These connecting paths imply different semantics. Formally, we call these paths as meta paths.

Definition 2.
Meta path [1]. A meta path P is a composite relation A 1 which represents a composite relation R = R 1 • R 2 • · · · • R l , between object types A 1 , A 2 , . . . , A l+1 , where • denotes the composition operator on relations.
Meta paths can express semantics between different objects. Obviously, different meta paths demonstrate different semantic meanings. In Figure 1b, meta path APA (Author-Paper-Author) shows that authors are co-writing a paper and the links between authors and papers denote co-author relationships. Definition 3. Classification in HINs. Given a class C = {c 1 , c 2 , · · · , c |c| } and object set V = {v 1 , v 2 , · · · , v |v| } in a HIN, where |c| and |v| denote the cardinality of these sets, respectively, then classification of HINs aims to map the object set V into the class set C through f : V → C. Definition 4. Meta graph [24]. A meta graph is denoted by S = (A, L, n t , n s ), where A ⊆ V is a subset of nodes and L ⊆ E is a subset of edges. The meta graph S is a directed acyclic graph (DAG), and n t , n s are a single source and target node, respectively.     Table 2. The neighbors of A 1 based on the meta graph.

Object Meta Graph Neighbors
where W A i A j is the adjacency matrix created for objects of type A i and A j . M[x i , y j ] represents the number of graphs between objects x i ∈ A l and y j ∈ A l+1 following meta graph S, and M[x i , Obviously, meta graphs are composed of various meta paths in this paper. We take Figure 2b as the running example to better comprehend the computation of meta graph matrix. For meta graph D 1 , it is computed as M D 1 = W AP · W PA ; the computing matrix of meta graph D 2 is computed as where " " represents Hadamard product.

Extension of Heterogeneous Information Networks Based on Meta Graph
There are numerous influence relations between objects in heterogeneous information networks. Meta graphs involve a series of types of objects expressing binary relations defined on HINs to further enrich the relationships. However, we use the structure and semantics of objects in meta graphs to extract more information and improve the performance of the classification. Hence, we extend meta graphs to the extension meta graphs to capture more object correlations. The extension rules are as follows: (1) both objects are from the same type; (2) they are also given the same label. If the objects satisfy both of the rules, we link them with an edge. For convenience, the extension meta graphs can be expressed in the notation A for short.
To illustrate, consider the bibliographic heterogeneous network shown in Figure 4a. Here, four objects P 1 , P 2 , P 4 , A 1 are already labeled (with three spades and a star). Without any extension, we can only relate A 2 to A 1 by APA (connecting A 1 , P 1 , and A 2 ). Now from the Extension Meta Graph A PA, we can capture the relationship between A 1 and A 3 , A 4 and A 5 . According to meta graph D 2 in Figure 2, we can also link P 1 and P 4 by meta path APAPA in D 2 (connecting A 1 , P 1 , A 2 , P 4 , and A 5 ) . In this way, We extract more information from HIN by the extension meta graphs.

Measuring the Similarity Between Different Types of Objects
The relationships among different types of objects following the meta graphs are important in classification in HINs. Therefore, it is necessary to compute the similarity between the source and target objects following meta graphs in HINs. We introduce the details of similarity calculation as follows. Given a meta graph S = (A 1 A 2 · · · A l+1 ) with weight θ s , where A 1 and A l+1 are different types of source object and target object, respectively, we want to calculate the relationship Rel between a 1i ∈ A 1 and a (l+1)j ∈ A l+1 : where w(a 1i , a (l+1)j ) is the value of weighted matrix M[a 1i , a (l+1)j ]. In fact, w(a 1i , a (l+1)j ) is initially the number of meta graphs that connect a 1i to a (l+1)j .

Weight Learning of Meta Graph
To obtain the weight θ p of each meta graph s, we construct the objective function O via a priori knowledge in HINs. The objective function O aims to maximize similarity with the same labels' objects, and minimize the similarity with the different labels' objects: where λ denotes the regularization parameter, · is defined as 2 norm. Rel s is the similarity of objects following meta graph s. The definition of function sign() is: We calculate the partial derivatives θ s of Equation (2), and obtain the solution of the loss function.

The Framework of MCHIN
In a HIN, let M ij be an n i × n j adjacent matrix of meta graph s ij with weight θ r . M ij,pq is the value of p th row and q th column in matrix M ij . And M ij,pq is also the weight of edges between objects x ip and x jq . We consider undirected graphs taht satisfy M ij = M T ji .
M ij,pq = 1 if x ip and x jq are adjacent 0 otherwise In a HIN, if objects have edges, then these objects have similar qualities such as the similarity rank scores. In the DBLP dataset, the higher of conferences' rank scores, the corresponding papers' rank scores would be higher. Therefore, we update each rank score of objects iteratively by the corresponding linked neighbors. The initial ranking distribution P(x ip |A i , k) 0 of objects is defined as a uniform distribution according to the labeled class k.
where l ik denotes the number of A i objects with k-th class. Similar to References [21] and [25], linked objects are inclined to have similar labels according to the consistency assumption. In order to retain the consistency of pre-assigned labels, we add the priori knowledge to the objective function obj: The first part of the function is the sum of the ranking distribution of objects with their neighbors, and the second part guarantees the consistency with the initial labels. When we minimize obj, it converges to the closed solution according to [16].
We iteratively normalize the P(x ip |A i , k) t as follows: The description of MCHIN's algorithm is as follows.

Algorithm 1: The algorithm of MCHIN
Input: HIN G = (V, E, A, R), meta graph s, meta graph matrix M Output: Weights of meta graphs θ s , the probability of each object belong to each class k begin Extend the HIN following meta graph s based on extension rules, obtain extend meta graphs s; Initialize the objects' ranking distribution within each class k; while (not converge) do Compute similarity of source objects and target objects following the extend meta graphs s, obtain Rel s with Equation (1); Compute the weight θ s according to Equation (4); Update the the ranking distribution within each class k with Equation (6); end end; Compute each object's posterior probability ; final ; return Weights of meta graphs θ s , the objects labeled k probability;

DBLP Dataset
The DBLP dataset is a real-world computer science bibliography dataset. We model the DBLP dataset as a HIN in this experiment. There are four types of objects in DBLP, Author (A), Conference (C), Paper (P) and Keywords (K). Among different objects, there are various relationships, "writing" relationships between authors and papers, "publishing" relationships between conferences and papers, and "containing" relationships between keywords and papers. DBLP dataset contains four main conferences-Information Retrieval, Database, Artificial Intelligence, and Data Mining. The four conferences can be considered as the a priori knowledge. DBLP dataset involves 14,376 papers, 20 conferences, 14,475 authors, and 8920 keywords. There are 170,794 links in DBLP. The ground truth is 4057 authors and all 20 conferences. We choose the set S = (APA, AP{A; C; T}PA) as the meta graphs in our experiments. For the classification, the authors are labeled.

YELP Dataset
Yelp is a website in which the users can find and evaluate different businesses. We extract a subnetwork that contains 4 types of objects-5000 restaurants (R), 257,953 Users (U), 10 Categories (C) ("American, Mexican, Italian, Chinese, Japanese, Thai, Indian, Canadian, Middle Eastern and Greek"), Reviews (E). Edges exist between restaurants and categories by the relation of "belong to", and categories are also the labels of businesses. The users are linked via "friendship" relationship. The classification task is to classify the restaurants by categories. In Figure 3, the meta graphs set is S = (RCR, RUR, RU{E; R}UR).

IMDB Dataset
We obtain the IMDB dataset from the movie website. IMDB movie network contains movies (M), actors (A), and directors (D). These three different types of objects form two main types of links: actors perform in movies (A-M) and directors direct movies (D-M). The movies genre is a priori knowledge. We divide the movies into three types of movies: drama, comedy, and actions in total. According to the semantic relationships between objects, we choose two proper meta graphs: S = (M{A; D}M, MA{D; M}AM) in IMDB dataset in our experiment.

Baselines
To evaluate the effectiveness of our model MCHIN, we compare our model with five algorithms as follows: • Deepwalk [22] adopts uniform random walks to represent network embedding for homogeneous networks. It conducted the SkipGram method to express nodes' features. Deepwalk can be applied to homogeneous information networks.

•
HIN2Vec [26] is applied to HINs by representing the meta paths. HIN2Vec designs a neural network model and utilizes meta paths to perform classification task. • SDNE [27] is a deep neural network based on a non-linear model. It joints the first-order and second-order proximity to present networks' structure. • MCHIN-ori is MCHIN that does not contain extension meta graphs. The original MCHIN (MCHIN-ori) is used to evaluate the effectiveness of the extension meta graphs.
The introduction of LLGC, wvRN and RanClass can be found in related work. LLGC, Deepwalk and wvRN are all popular and widely used baseline methods. But all of the three models can only be applied to homogeneous information networks. In order to utilize these methods, we use different meta paths to transform HINs to homogeneous information networks.

Classification of Nodes
In our experiments, we use classical method-accuracy to verify the effectiveness of MCHIN. We perform experiments on three real-world datasets-DBLP, Yelp, and IMDb. For more convincing results, we randomly select 3%, 5%, 7%, 10% labeled objects as the priori knowledge and classify the rest of the data in the DBLP dataset. Because Yelp dataset is very sparser than DBLP, we randomly select the proportion of labeled objects as x%(where x = 10, 30, 50 and 70). In the IMDB dataset, we choose the ratio of labeled objects x%(where x = 10, 30, 50 and 70) similar to the Yelp dataset.
In our work, λ ij indicates the selection of the important types of links during the ranking process. As discussed in Reference [17], we set α i = 0.1, λ ij = 0.2. They are good enough to verify the validity of MCHIN.
The results are shown in Tables 3-5, and the highest performance is in bold. In DBLP, clearly, when the proportion of labeled nodes are only 3%, the accuracy of MCHIN can be up to 87.6%. That outperforms Deepwalk more than 18%, and better than the best baselines models by 7%. By increasing the proportion of labeled objects, the performance of the proposed MCHIN has been far ahead than other comparison methods. The accuracy of MCHIN can achieve 90.5% by labeling 10% of the objects. In fact, our proposed model MCHIN is significantly more effective than MCHIN-ori especially when there is a lower number of labels, as MCHIN uses extension meta graphs. In Table 4, our method consistently outperforms the other baselines on the Yelp dataset as well. The results of LLGC, wvRN, RankClass and Deepwalk methods are quite similar to each other. The reasons are two-folds: (1) the four baseline methods are generally applied to homogeneous information networks. They lose the quality semantics and the useful complex information when they are applied to HINs. (2) None of the baseline approaches consider extra-label information through different types of models in HINs. When extending the meta graphs, MCHIN augments the prior label sets. This improves the performance 1% more than the MCHIN-ori method on average.
From Table 5, we can see that the effectiveness of MCHIN is higher than the best performance obtained by the baseline methods from 3% to 8% when the labeled objects change from 30% to 90%. The LLGC, wvRN, Deepwalk, and SDNE are generally used in homogeneous information networks, none of them can capture the rich semantic information in HINs. As the results of LLGC, wvRN, Deepwalk and SDNE methods show, they are significantly lower than others. Both of RanlClass and HIN2Vec algorithms are applied in the HINs, they can gain more semantic information in different relationships. Using meta paths in HIN2Vec effectively improve accuracy. However, the performance of MCHIN-ori is still up to 8% more than HIN2Vec because of adding meta graphs when the labeled objects are 70%. Furthermore, when we extend the meta graphs, the accuracy increases by 3% when the labeled objects are 50%.

Comparison of Algorithms Using Single Meta Path On DBLP
Meta graphs contain various meta paths, which represent different semantics in HINs. In order to verify the impacts of meta graphs, it is essential to compare the meta element-meta paths in meta graphs. Due to the space limitation, we only report the performance of meta graphs on the DBLP dataset. In our paper, the performances of four meta paths: APA, APAPA, APTPA, APCPA, and the meta graphs S = (APA, AP{A; C; T}PA) are compared for the DBLP dataset. Figure 5a-e show the accuracy of authors based on the meta paths and meta graphs. We can see that our model MCHIN outperforms all the baselines based on all the meta paths. The performance of MCHIN is still stable under different ratios of training data. In all the methods, meta path APCPA performs the best among APA and APAPA, as it can capture more semantic information in HINs. We further compare the performance of MCHIN based on meta graphs S = (APA, AP{A; C; T}PA) with all the four meta-paths. From Figure 5e, we can see that the meta graph APA, AP{A; C; T}PA performs the best among the four meta paths. The single meta graph AP{A; C; T}PA outperforms other meta paths at least 1%. It shows that meta graphs can capture more semantic information than meta paths. Figure 6. Accuracy comparison of MCHIN corresponding to different meta paths and meta graphs.

Learning Weights of Meta Graphs In MCHIN
From our experiments and comparisons, we can see that different meta graphs express different semantics. To utilize rich semantics of meta graphs, it is possible to weight meta graphs differently and assign higher weights to meta graphs with higher impacts on accuracy. Figure 5 shows the performance of MCHIN under meta paths and meta graphs on DBLP. The meta graphs' relative effectiveness is determined by MCHIN via its weight assignment mechanism. Table 6 denotes different weights of meta graphs computed by MCHIN. MCHIN has assigned the highest weight to the meta graphs AP{A; C; T}PA, as AP{A; C; T}PA performs the best in the classification. This matches with the intuition. On the other hand, the weight of APA is around 0.01∼0.15 due to its poor ability to capture the main features for the classification task. The results of meta graphs' weights on the Yelp dataset are also shown in Table 6. Because the labels of the links between restaurants are related to categories, our model assigns more weights 0.30∼0.40 to RCR than RUR. The meta graph RU(E; R)UR achieves the highest weight among other meta graphs. In the IMDB dataset, MA{D; M}AM has higher weight in average than M{A; T}M, as the MA{D; M}AM expresses more semantic than M{A; T}M. Table 6. Weights of meta graphs on DBLP, Yelp and IMDb dataset.

Conclusions
In this paper, we studied the classification problem in HINs and proposed a new algorithm, MCHIN, that iteratively classifies objects in HINs. The priori knowledge is used to extend the original heterogeneous information networks, which can effectively capture the information hidden in semantic and structure. It consequently provides a richer training set and improves classification performance. The proposed framework explores the schema of the network to weight each meta graph, and integrate the weighted meta graphs for more effective classification. MCHIN also calculates the similarity of different objects to extract richer semantic information from the HINs. The performance results of experimental analysis based on different real datasets validate the superiority of MCHIN in comparison with other algorithms. In the future, we plan to apply our algorithm to other datasets. Another avenue for future research is to generalize the proposed method to semi-supervised problems, such as the multi-label classification problem in which an object in the network might have more than one label.