StructSim: Meta-Structure-Based Similarity Measure in Heterogeneous Information Networks

: Similarity measures in heterogeneous information networks (HINs) have become increasingly important in recent years. Most measures in such networks are based on the meta path, a relation sequence connecting object types. However, in real-world scenarios, there exist many complex semantic relationships, which cannot be captured by the meta path. Therefore, a meta structure is proposed, which is a directed acyclic graph of object and relation types. In this paper, we explore the complex semantic meanings in HINs and propose a meta-structure-based similarity measure called StructSim. StructSim models the probability of subgraph expansion with bias from source node to target node. Different from existing methods, StructSim claims that the subgraph expansion is biased, i.e., the probability may be different when expanding from the same node to different nodes with the same type based on the meta structure. Moreover, StructSim defines the expansion bias by considering two types of node information, including out-neighbors of current expanded nodes and in-neighbors of next hop nodes to be expanded. To facilitate the implementation of StructSim, we further designed the node composition operator and expansion probability matrix with bias. Extensive experiments on DBLP and YAGO datasets demonstrate that StructSim is more effective than the state-of-the-art approaches.


Introduction
A heterogeneous information network (HIN) is a powerful way of modeling realworld relationships [1,2].Due to the capability of containing rich inter-dependency between objects, HINs have recently attracted a lot of research attention [3].HINs consist of multiple types of objects and relations.Figure 1 is a toy example of HINs.
In the real world, HINs are very ubiquitous, including DBLP [4] and Yago [5].HINs analysis can help discover interesting knowledge [6].A wide variety of data mining problems have also been studied in HINs [7,8].A similarity measure is the basis of many data mining tasks, such as clustering, classification and product recommendation.In this paper, we mainly focus on the similarity measure in HINs.
As an important tool of capturing semantics in HINs, a meta path is a sequence of relations connecting object types [9], which embodies rich semantic meanings.In HINs, different meta paths denote different semantic meanings.For instance, in Figure 1 So far, many meta-path-based similarity methods have been proposed in HINs, such as Path-Constrained Random Walk (PCRW) [10], PathSim [9] and HeteSim [11].These methods can only capture the simple semantic meaning between objects in HINs.However, there exist a wide variety of complex semantic relationships in large-scale HINs like knowledge graphs [12,13], which were originally introduced by Google in 2012 to optimize search results.These complex semantic relationships, such as the complex semantic relationships S in Figure 1, cannot be well captured by a simple meta path.Therefore, the concept of a meta structure is proposed, which is a directed acyclic graph (DAG) [14].For example, in Figure 1, Toby Kebbell and Nigel Havers not only act in the movies directed by Steven Spielberg, but also act in the movies edited by Michael Kahn (film editor).This kind of complex semantic relationship can be illustrated by the meta structure S in Figure 1 rather than a meta path p 1 or p 2 .
However, research on meta-structure-based similarity has, up until now, been relatively.Although BSCSE [14] has been proposed to measure such complex semantic similarities between objects, it does not fully leverage the structure information of a network.What is more, BSCSE holds that the probability of subgraph expansion is the same from the same node to different nodes with the same type based on the meta structure.In fact, the probability may be different.
Therefore, in this paper, we further consider the structure information of a network and introduce the bias of expansion to propose a novel meta-structure-based similarity measure called StructSim.In addition, to facilitate the implementation of StructSim, we further design the node composition operator and expansion probability matrix with bias.As illustrated by the dotted box in S in Figure 1, node-composition operator means that we combine two nodes in a layer of a meta structure into a composite node.In this way, we can also easily conduct subgraph expansion in a way similar to random walk.
Appl.Sci.2024, 14, x FOR PEER REVIEW 2 of 14 So far, many meta-path-based similarity methods have been proposed in HINs, such as Path-Constrained Random Walk (PCRW) [10], PathSim [9] and HeteSim [11].These methods can only capture the simple semantic meaning between objects in HINs.However, there exist a wide variety of complex semantic relationships in large-scale HINs like knowledge graphs [12,13], which were originally introduced by Google in 2012 to optimize search results.These complex semantic relationships, such as the complex semantic relationships in Figure 1, cannot be well captured by a simple meta path.Therefore, the concept of a meta structure is proposed, which is a directed acyclic graph (DAG) [14].For example, in Figure 1, Toby Kebbell and Nigel Havers not only act in the movies directed by Steven Spielberg, but also act in the movies edited by Michael Kahn (film editor).This kind of complex semantic relationship can be illustrated by the meta structure in Figure 1 rather than a meta path or .However, research on meta-structure-based similarity has, up until now, been relatively.Although BSCSE [14] has been proposed to measure such complex semantic similarities between objects, it does not fully leverage the structure information of a network.What is more, BSCSE holds that the probability of subgraph expansion is the same from the same node to different nodes with the same type based on the meta structure.In fact, the probability may be different.Therefore, in this paper, we further consider the structure information of a network and introduce the bias of expansion to propose a novel meta-structure-based similarity measure called StructSim.In addition, to facilitate the implementation of StructSim, we further design the node composition operator and expansion probability matrix with bias.As illustrated by the do ed box in in Figure 1, node-composition operator means that we combine two nodes in a layer of a meta structure into a composite node.In this way, we can also easily conduct subgraph expansion in a way similar to random walk.
The main contributions of our work can be summarized as follows.


We explore the complex semantic meaning analysis in large-scale HINs (e.g., knowledge graph) and propose a novel meta-structure-based similarity measure named StructSim, which can well capture the complex similarity semantics between objects in HINs.


We design the subgraph expansion with bias by considering comprehensive structure information of the network, which includes out-neighbor information on current The main contributions of our work can be summarized as follows.
• We explore the complex semantic meaning analysis in large-scale HINs (e.g., knowledge graph) and propose a novel meta-structure-based similarity measure named StructSim, which can well capture the complex similarity semantics between objects in HINs.• We design the subgraph expansion with bias by considering comprehensive structure information of the network, which includes out-neighbor information on current expansion nodes and in-neighbor information on the next-hop nodes to be expanded.
• We propose the Node-Composition Operator and expansion probability matrix with bias so as to conveniently implement StructSim and conduct subgraph expansion in a way similar to random walk.• We conduct extensive experiments on DBLP and YAGO datasets, which demonstrate that StructSim is more effective than the state-of-the-art methods.
The remainder of the paper is organized as follows.Section 2 summarizes the related work.We illustrate the proposed method, StructSim, in Section 3. Section 4 discusses the experiments and results.We conclude the study in Section 5.

Related Work
In this paper, we mainly summarize similarity measure methods in homogeneous information networks and those in HINs.
The similarity measure in homogeneous information networks calculates the similarity of objects based on the link structure of a network.Personalized PageRank [15] is a classic representative, which employs the probability of random walk with restart to evaluate the similarity between objects.However, it is an asymmetrical approach.SimRank [16] is a representative of symmetric measures, which adopts the neighbors' similarity to measure the similarity between objects.Several researchers are also devoted to the study of improving the efficiency due to the computational complexity [17].RoleSim [18] is proposed to measure the role of object similarity.SCAN [19] leverages the immediate neighbor set to measure the similarity of objects.All these approaches just consider the objects with the same type and do not consider the heterogeneity of a network.
The similarity measure in HINs mainly includes similarity based on the meta path and that based on the meta structure [20][21][22][23].The former captures the simple semantics between objects while the latter expresses the complex meanings between objects.
The approaches based on the meta path include PathCount [9], PCRW [10], PathSim [9], HeteSim [11], DPRel [22] and so on.PathSim can only measure the similarity between the same-type objects, while other methods can measure the similarity between different-type objects.PathSim, PCRW and HeteSim are based on random walk model; the difference lies in the method of random walk.DPRel [22] is a meta-path-based semi-metric measure for relevance measurement on objects in a general heterogeneous information network with a specified network schema.Recently, Shi et al. [24] presented a new-path-based measure called PReP, which is studied from a probabilistic perspective.HowSim [23] is a newly proposed similarity measure which has the property of being meta-path-free and capturing the structure and semantic similarity simultaneously.In addition, RelSim [25] employs the meta path to measure the similarity between relation instances, such as <Google, Larry Page> and <Microsoft, BillGates> (Organization and Founder).
The approaches based on the meta structure include BSCSE [14] and MGP [21].BSCSE employs the subgraph expansion model to capture complex semantic meanings among objects.MGP introduces the idea of PathSim to measure the proximity of distinct classes.Recently, RecurMS [26] has been proposed as a schematic structure in HINs and provides a unified framework for integrating all the meta paths and meta structures.The RecurMSbased similarity RMSS is defined as the weighted sum of the commuting matrices of the decomposed recurrent meta paths and meta trees.RMSS is robust to different meta paths or meta structures.Table 1 demonstrates a comparison of different similarity measures based on the meta path and meta structure in HINs. HowSim

The Proposed Method
In this section, we will introduce the proposed method.More specifically, we first describe the definition of the layer of the meta structure.Then, we elaborate upon the strategy of subgraph expansion with bias and its semantic meaning.Finally, we propose the general formula of StructSim, analyze the characteristics and discuss its implementation.

Layer of Meta Structure
The layer of the meta structure denotes its depth, which is defined as follows.
Definition 1 (Layer of meta structure [14]).Given the meta structure S = (N, M, o s , o t ), we can divide the nodes in the meta structure into different levels according to its topological characteristics.N is a set of nodes and M is a set of edges.To be specific, S[i] ⊆ N denotes the nodes' set of the i-th layer in the meta structure.S[i : j](1 ≤ i ≤ j) means the nodes' set from the i-th layer to the j-th layer in the meta structure S. d S denotes the total number of layers in the meta structure, with S[1 : d S ] = N.
In Figure 1, S is an example of a meta structure.The number of layers of S is 5, that is, d S = 5.The nodes' set of the first layer is denoted by S [1], with S[1] = {Person}.The nodes' set of the second layer is S [2], with S[2] = {Movie}.Then, we can easily obtain S[1 : 2], with S[1 : 2] = {Person, Movie}.There is only one node in the first layer of S, while there are two nodes in the third layer of S.In this paper, S[i] is regarded as a whole when expanding the subgraph, which will be illustrated in the implementation section.

Subgraph Expansion with Bias
In HINs, similarity measures based on the meta structure employ the subgraph expansion to model the complex semantic meanings between objects.The subgraph expansion strategy claims that the expansion probability is the same when expanding to a different node of the same type from the same layer of the meta structure in each step.In fact, the expansion probability may be different due to the difference between objects.Therefore, we propose a novel strategy of subgraph expansion with bias in this paper, which has an essential difference from current subgraph expansion.Based on the subgraph expansion with bias, we can easily calculate the subgraph expansion probability with bias.
We firstly describe the process of subgraph expansion and then focus on the bias factor in subgraph expansion.Similar to random walk based on the meta path, the subgraph expansion based on the meta structure means that we traverse the network starting from a given source node until the target node along the given meta structure is reached to.Owing to the complexity of the meta structure, the traversing result refers to the subgraph rather than the path.We will describe the process with an example.Figure 2 shows an example of the detailed process of subgraph expansion.To be specific, given the source node Toby Kebbell and the meta structure S in Figure 1 We continue to make the second layer expansion from 2(a) and 2(b), and then we can obtain subgraphs 3(a) and 3(b), respectively.Different from the first layer expansion, we need, meanwhile, to expand two nodes in the second layer, as shown by the do ed rectangle in 3(a) and 3(b) of Figure 2. In this situation, where two nodes are, meanwhile, expanded at a layer of the meta structure, we regard these two nodes as a whole to generate subgraph expansion in this study.To facilitate the expansion of this kind of complex structure, we further design the node composition operator, which will be introduced in detail in the next section.
After the two nodes are looked at as a whole, we can make the subgraph expansion in a similar way to random walk.Then, we expand to 4(a) and 4(b) from 3(a) of Figure 2. In a similar way, we generate subsequent subgraph expansion along different relations in every layer of the meta structure.Finally, we finish the expansion when meeting the target node.
Based on the subgraph expansion, we introduce the bias factor, which denotes that we make the subgraph expansion with different probabilities starting from the source node to the target node along the layer of the meta structure in the whole network.More specifically, the bias means that the probability is different when expanding to different nodes with the same type from the same layer of the meta structure in each step.That is, we need, meanwhile, to consider two kinds of node neighbor information, including the out-neighbors information on the current node and the in-neighbors information on the next hop node to be expanded.In other words, the probability of expanding to some nodes is high, while the probability of expanding to other nodes may be low in each step, namely, the expansion is biased.Figure 2 is an example; there are two nodes, (War Horse (film) and Wrath of the Titans), when we traverse the network starting from Toby Kebbell along the relation ⎯⎯⎯⎯ of the first layer of the meta structure .If we do not consider the bias, then the probability of expanding to War Horse (film) or Wrath of the Titans is equal to 1/2.In fact, the probability is not always the same.That is, the probability of expanding to one node is high, while the probability of expanding to another one may be low.Therefore, we devise the strategy of subgraph expansion with bias.For example, the probability from Toby Kebbell to War Horse (film) may be ¾, while the probability from Toby Kebbell to Wrath of the Titans may be 1/4.The movie role may be a vital reason for this bias.Just as Toby Kebbell acts as the leading role in one film, and may only make a guest appearance in another movie.It is obvious that Toby Kebbell will pay more attention to the movie in which he acts as the leading role, which also further reveals that he has different biases for both movies.We continue to make the second layer expansion from 2(a) and 2(b), and then we can obtain subgraphs 3(a) and 3(b), respectively.Different from the first layer expansion, we need, meanwhile, to expand two nodes in the second layer, as shown by the dotted rectangle in 3(a) and 3(b) of Figure 2. In this situation, where two nodes are, meanwhile, expanded at a layer of the meta structure, we regard these two nodes as a whole to generate subgraph expansion in this study.To facilitate the expansion of this kind of complex structure, we further design the node composition operator, which will be introduced in detail in the next section.
After the two nodes are looked at as a whole, we can make the subgraph expansion in a similar way to random walk.Then, we expand to 4(a) and 4(b) from 3(a) of Figure 2. In a similar way, we generate subsequent subgraph expansion along different relations in every layer of the meta structure.Finally, we finish the expansion when meeting the target node.
Based on the subgraph expansion, we introduce the bias factor, which denotes that we make the subgraph expansion with different probabilities starting from the source node to the target node along the layer of the meta structure in the whole network.More specifically, the bias means that the probability is different when expanding to different nodes with the same type from the same layer of the meta structure in each step.That is, we need, meanwhile, to consider two kinds of node neighbor information, including the out-neighbors information on the current node and the in-neighbors information on the next hop node to be expanded.In other words, the probability of expanding to some nodes is high, while the probability of expanding to other nodes may be low in each step, namely, the expansion is biased.Figure 2 is an example; there are two nodes, (War Horse (film) and Wrath of the Titans), when we traverse the network starting from Toby Kebbell along the relation actedIn − −−− → of the first layer of the meta structure S. If we do not consider the bias, then the probability of expanding to War Horse (film) or Wrath of the Titans is equal to 1/2.In fact, the probability is not always the same.That is, the probability of expanding to one node is high, while the probability of expanding to another one may be low.Therefore, we devise the strategy of subgraph expansion with bias.For example, the probability from Toby Kebbell to War Horse (film) may be ¾, while the probability from Toby Kebbell to Wrath of the Titans may be 1/4.The movie role may be a vital reason for this bias.Just as Toby Kebbell acts as the leading role in one film, and may only make a guest appearance in another movie.It is obvious that Toby Kebbell will pay more attention to the movie in which he acts as the leading role, which also further reveals that he has different biases for both movies.

General Formula of Similarity Measure Based on Meta Structure
In this section, we first describe the basic idea of the proposed method.Then, we propose the general formula of similarity measure based on the meta structure.Finally, we describe the semantic meanings with a detailed example.

Basic Idea
To explore the complicated semantics between objects, we propose a novel metastructure-based similarity measure called StructSim.The basic idea is that two objects are more similar if the probability of expansion from the source node to the target node along a given meta structure is greater.
Different from random walk based on the meta path, subgraph expansion based on the meta structure involves a more complex structure than the path.Moreover, we introduce the subgraph expansion with bias to make full use of leveraging the structure information of the network in this study, which has an essential difference from the existing subgraph expansion.
Specifically, StructSim claims that the expansion probability from the same node to different nodes with the same type is different in each step when the subgraph is expanded along the meta structure.That is, the probability of expanding to some nodes is high, while the probability of expanding to other nodes may be low.Here, to embody the bias of subgraph expansion, StructSim, meanwhile, considers the node information at both ends of the expansion, including the out-neighbor information on the current node, and the in-neighbor information on the next-hop node to be expanded.In a word, the expansion is biased and StructSim models the probability of subgraph expansion with bias.

The Formula of StructSim
In this section, we clarify the formula definition of StructSim.Let x and y be the source and target nodes, respectively.Based on the meta structure S, we can define the general formula of StructSim as follows: where f (x, y|s) denotes the similarity score between the source node x and target node y based on an instance s of the meta structure.Here, f (x, y|s) has several different definitions, such as the number of subgraphs and the expansion probability.In general, the number of subgraphs can be a special case of expansion probability.Therefore, we define the general formula of f (x, y|s) as follows: where d S denotes the number of layers of instance s of the meta structure.h(g b (i|S, G), g f (i|S, G)) is the subgraph expansion probability with bias in the ith layer, which can be obtained by subgraph expansion with bias starting from the source node to the target node along a given meta structure in the whole network.Here, the subgraph expansion with bias is a kind of novel strategy proposed to make full use of the structure information of the network.Equation (2) shows that f (x, y|s) can be regarded as the product of expansion probability with bias of all layers of instance s of the meta structure.
The function h g b (i|S, G), g f (i|S, G) in Equation ( 2) is the subgraph expansion prob- ability with bias and it depends on two factors: g b (i|S, G) and g f (i|S, G).Specifically, g b (i|S, G) is the set of expanded subgraphs in the ith layer, which controls the extent of de- pending on their out-neighbor information.g f (i|S, G) denotes the set of forward subgraphs of the next-hop nodes expanded from the ith layer, which controls the extent of dependence on their in-neighbor information.Through combining both types of information above, we can determine the bias in subgraph expansion and then obtain the subgraph expansion probability with bias, as shown in Equation (3).
where g b (i|S, G) is the set of expanded subgraphs in the ith layer during the process of subgraph expansion.The number of expanded subgraphs is |g b (i|S, G)|.g f (i|S, G) denotes the set of forward subgraphs of next-hop nodes expanded from the ith layer; is the number of forward subgraphs.ω means the weight which controls the proportion of out-neighbors and in-neighbors in expansion probability with bias.

Semantic Meaning Explanation with an Example
In this section, we use an example to describe the semantic meaning of subgraph expansion probability with bias h g b (i|S, G), g f (i|S, G) .Figure 1 is an HIN example.The given source node is Toby Kebbell and the meta structure is S. |g b (1|S, G)|, we can obtain the expansion probability with bias h g b (i|S, G), g f (i|S, G) .In this paper, we set out-neighbors and in-neighbors to have the same important weight in the expansion probability with bias, which means ω = 0.5.Consequently, we can obtain the expansion probability with bias from 1(a) to 2(a), that is The different values for ω mean that the proportion of the weight of out-neighbor and in-neighbor information is different in the expansion probability with bias.We can adjust the range of ω to determine the optimal value for a specific task.
In the same way, we can compute the subgraph expansion probability with bias of the remaining layers of the meta structure.Based on the expansion probability with bias, given an instance s of the meta structure, the product of probabilities with bias of all the layers of s is the similarity between the source node x and the target node y based on s, denoted by f (x, y|s).In general, given the source node o s , HIN G = (V, E) and the meta structure S, we can obtain all instances of S through expanding the subgraph along the layer of the meta structure in graph G.For instance, in Figure 1, through subgraph expansion starting from source node Toby Kebbell along the meta structure S, we can obtain the instance of the meta structure shown by a dashed line.For each instance, we compute the f (x, y|s) and then accumulate the similarity scores based on all instances of the meta structure to obtain the final similarity StructSim(x, y|S).It is obvious that StructSim models the probability of subgraph expansion with bias from the source node x to the target node y along a given meta structure S. Two objects are more similar if they are connected by more instances of the meta structure.

The Implementation of StructSim
In this section, we study how to perform similarity computation based on a given meta structure.We first present two novel mechanisms, including a Node-Composition operator and an expansion probability matrix with bias.Then, we describe the implementation of StructSim based on both strategies proposed above.

Node-Composition Operator
In Figure 2, we describe the detailed process of subgraph expansion.However, the expansion is different in different layers of the meta structure.For instance, we can easily find the difference between the first layer expansion and the second layer expansion by carefully observing Figure 2. At the first layer expansion, only a single node is expanded, as shown by the subgraph expansion from (1)a to (2)a in Figure 2.However, at the second layer expansion, two nodes need, meanwhile, to be expanded, as shown by the subgraph expansion from (2)a to (3)a in Figure 2. It is not easy to calculate the expansion probability in the situation of the second complicated expansion.Therefore, an effective solution is needed to solve this problem.
Inspired by the literature [11], we designed the Node-Composition Operator to facilitate the expansion of the complex structure and computation of the expansion probability matrix with bias.To be specific, we combine two nodes into a composition node.The relation connecting the composition node can also be called the composition edge.Based on the operation of Node-Composition, we regard the composition node as a whole to create subgraph expansion in a way similar to single node expansion.Then, we can easily compute the subgraph expansion probability with bias in this situation.Figure 2 is an example to clearly illustrate the Node-Composition Operator.From the process of subgraph expansion based on the meta structure in Figure 2, we can see that, in the second layer of subgraph expansion, we need expand two nodes of Person type at the same time from the node with Movie type.The process from 2(a) to 3(a) in Figure 2 shows the expansion from W.Horse (film) to M.Kahn (film editor) and S.Spielg (director).After the Node-Composition operation, we obtain subgraph (3)a' in Figure 3.The relation connecting W.Horse (film) and composition node (M.Kahn (film editor) and S.Spielg (director)) is also adjusted to be a composition edge, which is revealed by the bold arrow in (3)a' of Figure 3.Then, in the third layer expansion, it is the expansion from a composition node to a single node.As illustrated by the process from 3(a) to 4(a) in Figure 2, we generate subgraph expansion from M.Kahn and S.Spielg (a composition node) to E.Sun (a single node).After the operation of Node Composition, we obtain the subgraph exhibited by (4)a' in Figure 3. Inspired by the literature [11], we designed the Node-Composition Operator to facilitate the expansion of the complex structure and computation of the expansion probability matrix with bias.To be specific, we combine two nodes into a composition node.The relation connecting the composition node can also be called the composition edge.Based on the operation of Node-Composition, we regard the composition node as a whole to create subgraph expansion in a way similar to single node expansion.Then, we can easily compute the subgraph expansion probability with bias in this situation.Figure 2 is an example to clearly illustrate the Node-Composition Operator.From the process of subgraph expansion based on the meta structure in Figure 2, we can see that, in the second layer of subgraph expansion, we need expand two nodes of Person type at the same time from the node with Movie type.The process from 2(a) to 3(a) in Figure 2 shows the expansion from W.Horse (film) to M.Kahn (film editor) and S.Spielg (director).After the Node-Composition operation, we obtain subgraph (3)a' in Figure 3.The relation connecting W.Horse (film) and composition node (M.Kahn (film editor) and S.Spielg (director)) is also adjusted to be a composition edge, which is revealed by the bold arrow in (3)a' of Figure 3.Then, in the third layer expansion, it is the expansion from a composition node to a single node.As illustrated by the process from 3(a) to 4(a) in Figure 2, we generate subgraph expansion from M.Kahn and S.Spielg (a composition node) to E.Sun (a single node).After the operation of Node Composition, we obtain the subgraph exhibited by (4)a' in Figure 3. Based on the Node-Composition Operator, the node sets in every layer of the meta structure can be regarded as a whole.Then, we can adopt the matrix to conveniently calculate the expansion probability, which is similar to the transition probability matrix in random walk.Finally, we can apply the matrix multiplication to the implementation of StructSim.

Expansion Probability Matrix with Bias
Based on the Node-Composition Operator, we can easily define the expansion probability matrix with bias as follows.Based on the Node-Composition Operator, the node sets in every layer of the meta structure can be regarded as a whole.Then, we can adopt the matrix to conveniently calculate the expansion probability, which is similar to the transition probability matrix in random walk.Finally, we can apply the matrix multiplication to the implementation of StructSim.

Expansion Probability Matrix with Bias
Based on the Node-Composition Operator, we can easily define the expansion probability matrix with bias as follows.

Definition 2. Expansion probability matrix with bias. W ij is an adjacent matrix between S[i] and S[j]. The normalized matrix of W ij along the row vector is denoted as U ij , which is the expansion probability matrix of layer relation S[i]
R → S[j].Suppose there exists bias δ for S[i] expanding to different objects in S[j] of the same layer j.The expansion probability matrix with bias M(δ) i denotes that the object in the ith layer of the meta structure S has a different bias for various objects in the (i+1)th layer.The probability is the same in each row of the expansion probability matrix U ij , while the probability may be different in each row of the expansion probability matrix with bias U ij (δ).S[i] denotes the node sets in the ith layer of the meta structure S. It is obvious that S[i] has two different situations.One is that S[i] only contains a node.The other is that S[i] contains more than one node.In the first situation, it is easy to compute the expansion probability matrix with bias by adopting the existing transition probability matrix.However, it is very difficult to calculate the expansion probability matrix with bias in the second situation.As illustrated in the Node-Composition Operator, S[i] is regarded as a whole when making subgraph expansion based on a meta structure.Therefore, we first use the Node-Composition Operator for S[i].Then, we adopt a strategy similar to the first situation to calculate the probability matrix.
In this paper, we present the weighted combination operation to determine the function f.That is, Based on the expansion probability matrix with bias U ij (δ), we have the following relevance matrix SM structsim (S) between the type of source node o s and the type of target node o t based on the meta structure S for StructSim.
where d S denotes the total number of layers in the meta structure S. We can obtain the final similarity score SM structsim (S) based on the meta structure by way of matrix multiplication, which refers to the StructSim(x, y|S) in Equation (1).

Discussion
In this section, we analyze the computational complexity, characteristics and limitation of the proposed method.The time complexity mainly lies in matrix multiplication.What is more, the dimension and sparsity of the matrix also have a certain impact on efficiency.The complexity is O ( m 3 (d S − 1 ) for an instance of a meta structure S with d S depth, while m is the average matrix dimension.In addition, the Node-Composition Operator may also cause an increase in matrix dimensions, especially in a large-scale information network.
In order to better analyze the characteristics of the proposed method, we have concluded the overall relations between different measures and the computation formulas in Table 1.They are based on some basic strategies including path/graph count or walking/expansion probability.Although these methods can also capture semantics between objects, it is not sufficient for the utilization of the structure information of a network.Therefore, we present the strategy of expansion with bias in this paper in order to make full use of network information.Moreover, for the complex structure expansion, one must also face the problem that multiple nodes expand at the same time.To solve this, some strategies need to be devised.In this paper, we design the operator of Node-Composition.Based on the operation, we can make the subgraph expansion much easier.
These above strategies reveal the difference between StructSim and BSCSE.BSCSE simulates the process of structure-constrained subgraph expansion.StructSim further introduces the bias and operator of Node-Composition based on BSCSE, which can well capture comprehensive information between objects with complex semantic meanings.This property also implies that StructSim is more effective than BSCSE, since StructSim, meanwhile, considers both kinds of information for probability computation, not solely out-neighbor information.The limitation is that the efficiency of matrix multiplication is a vital issue that needs to be addressed.Besides this, by adopting the Node-Composition Operator in some layer of the meta structure, the number of composition nodes with the same type could be large, which is inefficient both in terms of space to store them and in terms of the operation of matrix multiplication.Fortunately, there have been several quick computation solutions, including dynamic programming and Monte Carlo [11].

Datasets
In order to verify the effectiveness of StructSim, we conducted extensive experiments on the classic datasets YAGO [5] and DBLP [11].YAGO is a huge semantic knowledge base system derived from Wikipedia, WordNet and GeoNames [5].Now, it contains more than 10 million entities and 120 million facts.In this paper, we mainly adopt the "Core Facts" part in YAGO, denoted by Yago-Core [27], which includes the fact tuples of entity and relationship such as <Steven Spielberg, directed, War Horse (film)>, and tuples of entity type.Table 2 shows the statistics of the Yago-Core dataset.DBLP contains the major conferences in four research fields including database, data mining, information retrieval and artificial intelligence.It has 14,475 authors, 14,376 papers, 8593 terms and 20 conferences, the labelled data of which is 4057 authors, 20 conferences and 100 papers.

Baselines
To validate the effectiveness of StructSim, we selected seven representative approaches as baselines, which included PathCount (the number of path instances) [9], PathSim (a standardized version of PathCount) [9], PCRW (the sum of probability based on random walk of all path instances) [10], StructCount (the number of instances of meta structure) [14], SCSE (the probability of subgraph expansion) [14] and MGP [21] (the shared characteristic meta-graphs).

Effectiveness Experiments
In order to evaluate the effectiveness of StructSim, we conducted extensive experiments on entity resolution and clustering tasks.We adopted two popular criteria of Area Under the Curve (AUC) and Normalized Mutual Information (NMI) [28] to evaluate the performance of StructSim on both tasks, respectively.Then, we further analyzed the traits of StructSim through a case study.

Entity Resolution Task
Entity resolution refers to finding the same entity pairs with different descriptions, which can achieve the purpose of data cleaning.For example, in YAGO, Presidency of Ronald Reagan and Ronald Reagan denote the same person but have different descriptions.Entity resolution aims to find Presidency of Ronald Reagan and Ronald Reagan so as to clean the data by deduplicating this entry.
We extracted experiment data from YAGO for an entity resolution task.To be specific, we leverage yago-fact tuples and entity types to obtain 2687 pairs of objects satisfying the semantic meaning revealed by the meta path p 1 in Figure 4.These extracted data, in total, included 4521 different people.Then, we manually labeled the data with the help of Wikipedia and obtained 46 pairs of entities which denoted the same object.The remaining 2641 pairs of entities were regarded as negative samples.
Ronald Reagan and Ronald Reagan denote the same person but have different descriptions.Entity resolution aims to find Presidency of Ronald Reagan and Ronald Reagan so as to clean the data by deduplicating this entry.
We extracted experiment data from YAGO for an entity resolution task.To be specific, we leverage yago-fact tuples and entity types to obtain 2687 pairs of objects satisfying the semantic meaning revealed by the meta path p1 in Figure 4.These extracted data, in total, included 4521 different people.Then, we manually labeled the data with the help of Wikipedia and obtained 46 pairs of entities which denoted the same object.The remaining 2641 pairs of entities were regarded as negative samples.For the entity resolution task, we employed the meta structure S 1 and meta paths p 1 , p 2 in Figure 4 to compute the similarity between objects.For each object, we regarded it as a source node to find the duplicated target node.Based on the similarity score, we ranked the other entities except for the source node.The higher the similarity score, the more likely it was to belong to the same object.Here, the same object was the pairs of objects that referred to the same entity.For example, Presidency of Ronald Reagan and Ronald Reagan refer to the same entity.We drew the ROC curve through adapting the threshold of the similarity score and computed the AUC value.
Table 3 demonstrates the results of different similarity measures in the entity resolution task.s = βs p 1 + (1 − β)s p 2 is the similarity formula of a linear combination of the meta path; s p 1 and s p 2 denote the similarity score based on the meta path p 1 and p 2 , respectively.From Table 3, we can observe that (1) the similarity measures based on the meta structure have a better performance compared with those based on the meta path, which reveals that the meta structure can capture more complex semantic meaning than the meta path.Here, S 1 means two people are not only subordinate to the same organization but also have a marital relationship with the same people.However, both p 1 and p 2 cannot express the complicated semantics.(2) StructSim performs better than StructCount, SCSE, BSCSE and MGP, which indicates that StructSim considers more rich semantics by incorporating the out-neighbor and in-neighbor information when expanding the subgraph, rather than just considering the out-neighbor information that most similarity measures do.(3) The linear combination of the meta path outperforms the approaches based on the meta path but is inferior to meta-structure-based approaches.This demonstrates that the linear combination can capture more comprehensive information than a single meta path; nevertheless, it cannot capture more complicated semantics than the meta structure.( 4) p 1 has some superiority over p 2 in the entity resolution task, which may be due to the fact that two objects with a martial relationship with the same person have higher probability to be the same people than two objects being subordinate to the same organization.

Clustering Task
For the clustering task, we clustered authors based on the meta structure S 2 and the meta paths p 3 , p 4 in Figure 4.According to the research field, we set the number of clusters to four.Based on the similarity matrices derived by different similarity measures, we applied Normalized Cut [29] to cluster authors.We ran this 100 times for each algorithm and recorded the average accuracy.
Table 4 demonstrates the results of different similarity measures in the clustering task, and also lists the experimental results based on the linear combination of the meta path.We can obtain the following several conclusions from Table 4. (1) StructSim outperforms StructCount, SCSE, BSCSE and MGP, which indicates that StructSim considers more rich semantics than StructCount and SCSE.(2) p 3 has a better performance than p 4 in the clustering task, which reveals that p 3 can capture more semantic meaning information compared with p 4 .That is, authors who have published papers in the same venue are more likely to be classified into a cluster.(3) The linear combination of the meta path performs better than the approaches based on the meta path.This demonstrates that the linear combination can capture more comprehensive information than any single meta path.We also further studied how parameter β influenced the performance of algorithms in the entity resolution task.We varied the parameter β from 0 to 1 and recorded the AUC value, which is depicted in Figure 5.The experimental results on various β show that different weight combinations of the meta path have a significant effect on performance.With the increase in β, the performance first gradually increases and then decreases.Algorithms achieve the best AUC value when β = 0.8, which indicates that the meta path p 1 has a more significant impact than p 2 in the entity resolution task.This is also consistent with our intuition.In addition, it is very meaningful to determine the proper weight of the meta path so as to achieve a better performance, which will be studied in future work.
Appl.Sci.2024, 14, x FOR PEER REVIEW 12 of 14 semantics than StructCount and SCSE.(2) p3 has a better performance than p4 in the clustering task, which reveals that can capture more semantic meaning information compared with .That is, authors who have published papers in the same venue are more likely to be classified into a cluster.(3) The linear combination of the meta path performs better than the approaches based on the meta path.This demonstrates that the linear combination can capture more comprehensive information than any single meta path.We also further studied how parameter β influenced the performance of algorithms in the entity resolution task.We varied the parameter β from 0 to 1 and recorded the AUC value, which is depicted in Figure 5.The experimental results on various β show that different weight combinations of the meta path have a significant effect on performance.With the increase in β, the performance first gradually increases and then decreases.Algorithms achieve the best AUC value when β = 0.8, which indicates that the meta path p1 has a more significant impact than p2 in the entity resolution task.This is also consistent with our intuition.In addition, it is very meaningful to determine the proper weight of the meta path so as to achieve a be er performance, which will be studied in future work.

Case Study
In order to illustrate the traits of StructSim, we show the top five most similar authors to "Christos Faloutsos" based on different meta paths, the linear combination of the meta paths and the meta structure for PCRW, PathSim and StructSim in Table 5.
From the results, we can see that StructSim obtains some similar authors to Christos Faloutsos and these authors contain more complex semantic meanings than those authors obtained using other methods.However, other methods just focus on partial semantics.

Case Study
In order to illustrate the traits of StructSim, we show the top five most similar authors to "Christos Faloutsos" based on different meta paths, the linear combination of the meta paths and the meta structure for PCRW, PathSim and StructSim in Table 5.From the results, we can see that StructSim obtains some similar authors to Christos Faloutsos and these authors contain more complex semantic meanings than those authors obtained using other methods.However, other methods just focus on partial semantics.For example, in the ranking result generated by PCRW based on APCPA, Charu C. Aggarwal is just an author who has published papers in the same conference that Christos Faloutsos published, but does not pursue the same research topics as him.Charu C. Aggarwal and Jiawei Han have published more papers in the same conference than Christos Faloutsos has published.Therefore, they are ranked in the top two.PathSim finds similar peer-authors, such as Jiawei Han, who has the same reputation as Christos Faloutsos in the data mining field.Both PCRW and PathSim are mainly concentrated on the same conference or the same terms.Nevertheless, StructSim can pay close attention to these two kinds of semantics.As a consequence, StructSim gives the best ranking quality with complicated semantics, which is consistent with human intuition.

Conclusions
In this article, we studied the complex semantic similarity measure in HINs.We proposed a meta-structure-based similarity measure approach called StructSim, which employs the subgraph expansion probability with bias to define the similarity between objects.Moreover, we designed the Node-Composition Operator and expansion probability matrix with bias so as to facilitate the implementation of StructSim.Experiments on Yago and DBLP show that StructSim has a better performance compared with the baselines.In addition, we also analyzed the traits of StructSim through a case study.
In the future, we will further study the semantics analysis approach based on more complicated structures, such as motifs, which may be beneficial in various data mining tasks.We are also interested in methodological complexity, which is very important in making an approach practical.
−−→ People between Toby Kebbel and Steven Spielberg, which means the actor acts in the movie directed by a specific director.The meta path People actedIn − −−− → Movie edited −1 −−−−→ People between Toby Kebbel and Michael Kahn (film editor) denotes that the actor acts in the movie edited by a specific film editor.It is obvious that the semantic similarity between objects is different based on different meta paths.Therefore, the similarity measure in HINs is related to the meta path.

2 Figure 1 .
Figure 1.A toy example of a heterogeneous information network, meta path, and meta structure.

Figure 1 .
Figure 1.A toy example of a heterogeneous information network, meta path, and meta structure.

14 Figure 2 .
Figure 2. The process of subgraph expansion based on the meta structure.

Figure 2 .
Figure 2. The process of subgraph expansion based on the meta structure.

Figure 2
denotes the process of subgraph expansion based on the meta structure.It is obvious that there exist two nodes connecting with Toby Kebbell through relation actedIn − −−− →, namely, War Horse (film) and Wrath of the Titans.Hence, the number of subgraphs in the first layer is g b (1|S, G) = 2.For subgraph 2(a), the forward subgraph of next-hop node W. Horse is 1(a); the number of the forward subgraph is |g f (1|S, G) = 1 .By a weighted combination of |g f (1|S, G) and Appl.Sci.2024, 14, x FOR PEER REVIEW 8 of 14

Figure 3 .
Figure 3.A toy example of the Node-Composition operator.
[ ] and [ ].The normalized matrix of along the row vector is denoted as , which is the expansion probability matrix of layer relation [ ] → [ ] .Suppose there exists bias for [ ] expanding to different objects in [ ] of the same layer j.The expansion probability matrix with bias ( ) of layer relation [ ] → [ ] is ( ) = ( , ( ) ). ( , ( ) ) is the function of matrics and ( ) and ( ) is the bias matrix of object type in the ith layer based

Figure 3 .
Figure 3.A toy example of the Node-Composition operator.
the function of matrics U ij and M(δ) i and M(δ) i is the bias matrix of object type in the ith layer based on layer relation S[i] R → S[j].

Figure 4 .
Figure 4. Meta path and meta structure on entity resolution and clustering tasks.

Figure 5 .
Figure 5.The results of the linear combination of the meta path in the entity resolution task.

Figure 5 .
Figure 5.The results of the combination of the meta path in the entity resolution task.

Table 1 .
A comparison of measures based on the meta path and meta structure in HINs.

Table 3 .
Effectiveness results of different similarity measures on entity resolution task.

Table 4 .
Effectiveness results of different similarity measures in clustering task.

Table 4 .
Effectiveness results of different similarity measures in clustering task.

Table 5 .
Top 5 most similar authors to "Christos Faloutso" under different meta paths and meta structures in DBLP dataset.