You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

2 February 2024

An Algorithm for Finding Optimal k-Core in Attribute Networks

and
1
Chengdu Institute of Computer Applications, Chinese Academy of Sciences, Chengdu 610299, China
2
School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing 100049, China
*
Author to whom correspondence should be addressed.
This article belongs to the Topic Applied Computing and Machine Intelligence (ACMI)

Abstract

As a structural indicator of dense subgraphs, k-core has been widely used in community search due to its concise and efficient calculation. Many community search algorithms have been expanded on the basis of k-core. However, relevant algorithms often set k values based on empirical analysis of datasets or require users to input manually. Once users are not familiar with the graph network structure, they may miss the optimal solution due to an improper k setting. Especially in attribute social networks, characterizing communities with only k-cores may lead to a lack of semantic interpretability of communities. Consequently, this article proposes a method for identifying the optimal k-core with the greatest attribute score in the attribute social network as the target community. The difficulty of the problem is that the query needs to integrate both structural and textual indicators of the community while fully considering the diversity of attribute scoring functions. To effectively reduce computational costs, we incorporate the topological characteristics of the k-core and the attribute characteristics of entities to construct a hierarchical forest. It is worth noting that we name tree nodes in a way similar to pre-order traversal and can maintain the order of all tree nodes during the forest creation process. In such an attribute forest, it is possible to quickly locate the initial solution containing all query vertices and reuse intermediate results during the process of expanding queries. We conducted effectiveness and performance experiments on multiple real datasets. As the results show, attribute scoring functions are not monotonic, and the algorithm proposed in this paper can avoid scores falling into local optima. With the help of the attribute k-core forest, the actual query time of the Advanced algorithm has improved by two orders of magnitude compared to the BaseLine algorithm. In addition, the average F1 score of our target community has increased by 2.04 times and 26.57% compared to ACQ and SFEG, respectively.

1. Introduction

Graphs are often used to represent complex network structures containing large-scale entities, where vertices denote entities and edges illustrate relationships between entities. Common ones include co-authorship relationships in scholar networks [1], friend relationships in social networks [2,3], and protein reaction relationships in biological networks [4,5]. Dense subgraph search has attracted widespread attention from scholars, especially community search based on personalized query conditions such as query vertices and structural indicators. Unlike traditional community detection, community search can complete personalized search tasks, making the target subgraph easier to interpret. As an important community search model, k-core requires each node in the subgraph to have at least k neighbors. Compared with clique, quasi-clique, densest subgraph, k-truss, etc., k-core has become the basis of many dense subgraph algorithm models and has been widely used in industry and academia due to its simple structure and efficient calculation.
Even though the k-core has a lot of advantages, we find that many algorithm models are overly idealized when attempting to determine the k value during query execution. Some take the median as k for queries based on empirical analysis of the dataset, while others are manually specified by the user. It should be noted that different datasets show differences in terms of their characteristics. When employing different structural evaluation indicators such as independence, maximum density, and conductivity, the scoring trend may not necessarily be positively (or negatively) associated with k . As for manual assignment, its uncertainty is greater, and once users are not familiar with the network, the performance of the model will be greatly reduced. Therefore, this article aims to globally search for the optimal k-core containing the specified query vertices.
In addition, with the continuous improvement of sensor devices, entities’ semantic features have become richer, and attribute networks [6,7,8,9] have emerged. People are no longer satisfied with utilizing only structural features to characterize entity relationships in the target network. To improve the semantic interpretability of queries, many models have added attribute correlation evaluation indicators to compensate for semantic deficiencies in simple graphs. However, there is no unified attribute scoring function to evaluate the attribute correlation of subgraphs. Different algorithm models design different attribute rating functions for diverse networks and objectives. To adapt to more application scenarios, we extract the primary keys from different attribute scoring functions in order to grasp their essence.
Given the attribute graph, an attribute scoring function, a query vertex set, and a query attribute set, the goal of this article is to find a k-core that contains query vertices and has the highest attribute rating. This can directly solve the problem of missing the optimal solution due to improper k value selection and can also serve as the basis for other algorithm models. A basic algorithm is to find all k-cores containing query vertices, calculate scores one by one, and return the optimal solution. However, the large computational cost makes it unsuitable for large attribute graphs. Hence, this article proposes an algorithm for searching for the optimal k-core on the attribute networks, integrating structural and attribute indicators, and constructing an index forest, which can improve adaptability and speed. Our main contributions are as follows:
  • Propose the idea of an attribute-optimal k-core. Since it is possible to miss the optimal solution by determining k through empirical analysis or manual selection, this article searches for the optimal k-core on an attribute graph as the target community to find the optimal solution to the problem.
  • Propose the advanced query algorithm. Extract primary keys from commonly used attribute scoring functions to adapt to various scoring needs. Taking advantage of the nested nature of k-core, design bidirectional reachable tree nodes, improve the naming and storage methods of tree nodes, quickly locate tree nodes, and achieve reuse of the calculation process.
  • We conducted numerous experiments on multiple ground-truth datasets. The experimental results on these datasets show that the Advanced algorithm can effectively avoid the scoring function falling into local optima. Its main time consumption is in forest creation, and the actual query time is improved by two orders of magnitude compared to BaseLine.
The rest of this paper is organized as follows: We review the related work in Section 2. Section 3 formally defines the model and proposes two algorithms. The experiments conducted and their results are presented in Section 4. Finally, in Section 5, we summarize the main work and discuss future research directions.

3. Proposed Approach

3.1. Problem Formulation

The problem studied in this article is defined on an undirected attribute graph G = V , E , A , where V , E , A represent the vertex set, edge set and attribute set, respectively. n = V and m = E are the number of vertices and edges, generally m > n . For v V , v is an entity. Pair u , v E is an edge, implying that there is a certain relationship between entities u and v (such as co-authorship, friends, etc.). For each v , there is a set A v representing the text attributes of entity v . Different entities may share common attributes, which means that they may have a certain semantic similarity in these attributes. Table 2 shows the main notations appearing throughout the remain text.
Table 2. Notations and meanings.
To clarify the problem to be solved and the algorithm presented in this article, it is necessary to introduce the relevant concepts first.
Definition 1 (k-core).
Given a graph G and a positive integer k , the subgraph H is called a k-core of G , denoted as C k , if and only if H is a maximum connected subgraph and v H , deg r e e ( v )     k .
It is worth emphasizing that the k-core obtained by the classical Peeling algorithm may contain multiple connected subgraphs, some of which can be viewed as multiple kinds of k-cores. For example, in Figure 1, the subgraph composed of {4, …, 26} can be viewed as a 4-core in the past, which contains two connected subgraphs. But in this article, according to Definition 1, we consider the subgraph composed of vertex set {9, …, 14} as a separate 5-core, which is more appropriate in real-world application scenarios and more conducive to the subsequent creation of forests.
Figure 1. Sample network and its corresponding k-cores.
Definition 2 (core number).
For v   G , the core number of a vertex v is c o r e ( v ) = m a x { k | v C k } , which is the maximum value of k among all C k containing vertex v .
Generally, k max and k min are used to represent the maximum and minimum core number of the graph G , while C k max represents the most tightly structured subgraph in graph G .
k-core can only characterize the compactness between vertices structurally, but for attribute graphs, each vertex has a different number of keywords to describe its characteristics. In order to achieve personalized queries and further explain the reasons for community formation, various attribute community search models have designed multiple attribute scoring functions [16,18,19,20,21] based on actual application scenarios to evaluate the quality of attribute communities.
Different models define attribute scoring functions differently, but they are usually derived from some primary keys. Assuming that H is the subgraph to be evaluated and S is the query attribute set, this article studies the following primary keys commonly used in attribute scoring functions:
  • V ( H ) : the vertex set of subgraph H ;
  • n ( H ) : the number of vertices in H , i.e., n ( H ) = V ( H ) ;
  • A ( v ) : the attribute set of vertex v ;
  • V w : the set of vertices in subgraph H containing the attribute w ;
Based on the above primary keys, some attribute scoring functions are defined as follows.
f H , S = v V H A v S
Equation (1) determines the number of attributes shared by all vertices in subgraph H with the query attribute set S .
f ( H , S ) = w S V w V ( H ) 2 n ( H )
First proposed by [19], Equation (2) determines the criteria that should be followed for the rating function of attribute communities. It is particularly pointed out that attributes have different contributions to community ratings, and the more vertices they are shared, the stronger the explanatory power of attributes on community semantics. Therefore, Equation (2) not only depends on the number of vertices containing the attribute w but also considers the weight of the attribute w in the subgraph H .
f ( H , S ) = w S V w V ( H ) 2 n ( H ) S
Equation (3) is a modification of Equation (2), which rewards larger communities under the same conditions. That is, if each vertex of a large community covers most of the query attributes, it should receive a higher attribute score than its subgraph.
f ( H , S ) = 1 , S w | w v H A ( v ) 0 , o t h e r s
As shown in Equation (4), some modes only require the target subgraph to cover all query attributes and then propose more requirements from different aspects, such as keyword distance [17], community diameter [21,23,24], similarity [18,25], which enriches application scenarios. Therefore, the goal of this article is to find the best k-core on the attribute graph that maximizes Equations (1)–(3). The formal problem definition is as follows:
Definition 3 (attribute optimal k-core).
Given an undirected attribute graph G , a query vertex set Q , a query attribute set S , and an attribute score function f H , S , for any k min k k max , if C k contains all query vertices in Q and makes f H , S optimal, then the subgraph C k is the attribute optimal k-core of G .

3.2. Algorithms for Finding the Optimal k-Core

To find the attribute optimal k-core, this section proposes two solutions, including the BaseLine algorithm and the Advanced algorithm, and analyzes their complexity.

3.2.1. BaseLine Algorithm

The BaseLine algorithm is easy to understand and implement, including the following steps:
  • Step 1: Perform core decomposition on attribute graph G to obtain the core numbers of all vertices.
  • Step 2: For any k min k k max , select all vertices with c o r e v = k to construct all connected subgraphs as the candidate C k set.
  • Step 3: Filter out the subgraphs uncovering the query node set Q from the candidate C k set, and obtain the feasible solution C k set.
  • Step 4: Calculate attribute scores for all feasible solution C k , and return the subgraph with the best score as the optimal solution.
Performing core decomposition on the graph and constructing all C k based on the number of core vertices requires traversing all vertices and edges multiple times, which takes O ( k max ( V + E ) ) time. In the worst-case, if step 3 does not filter out any subgraphs, then attribute scores need to be calculated for all C k . Let O ( F k ) be the time to compute the score of C k , then the time complexity of the BaseLine algorithm is O ( k max ( V + E ) + k min k max F k ) . Although the BaseLine algorithm is accurate and can finish in polynomial time, it is not user-friendly for large attribute networks.

3.2.2. Advanced Algorithm

We find that the performance bottleneck of the BaseLine algorithm lies in constructing C k for each k min k k max and checking whether it contains the query vertex set Q . Most of the time, Q may only exist in part of C k , especially when the graph contains multiple connected subgraphs, resulting in more invalid computations. In addition, when Q changes, the previous calculation result cannot be reused, and the feasible C k set needs to be filtered again, which is very unfavorable for online queries. Therefore, we deeply analysis the characteristics of the C k set and design a two-stage query algorithm:
  • Stage 1: Utilize the nested nature of C k to construct a hierarchical attribute k-core forest and sort the tree nodes.
  • Stage 2: Locate the query vertex set Q to the relevant tree nodes, construct an initial feasible solution, and then gradually spread to the marginal tree nodes to find the optimal attribute C k .
1.
Constructing the attribute k-core forest
According to Definition 1, for k min k 1 < k 2 k max , their corresponding k-cores are C k 1 and C k 2 , respectively. If C k 1 and C k 2 belong to the same connected graph, then C k 2 C k 1 , which is the nesting property of k-cores [26,27]. Trees can be used to organize and represent the relationships between different C k . Combining the primary keys of these attribute scoring functions, we define the attribute C k tree node as follows.
Definition 4 (attribute k-core tree node).
For any C k in attribute graph G , there is a unique tree node N i t , where t is the unique identifier of the tree and i is the index number of the node in the tree. N i t contains p a r e n t , c h i l d r e n , v e r t e x S e t , and i n v e r t e d I n d e x , which are, respectively used to identify its parent tree node, children tree nodes, connected k-shell vertex set, and the keyword inverted indexes in v e r t e x S e t .
Definition 5 (parent tree node).
Given the attribute graph G , for k min k 1 < k 2 k max , their k-cores are C k 1 and C k 2 , respectively, and their tree nodes are represented by N k 1 t and N k 2 t . If N k 1 t and N k 2 t belong to the same tree, C k 2 C k 1 , and there is no k 1 < k * < k 2 , then N k 1 t is the parent tree node of N k 2 t .
Definition 6 (attribute k-core forest).
Given the attribute graph G , create attribute k-core tree nodes for each k min k k max . If the tree node has a parent tree node, then connect them. All trees form the attribute k-core forest.
It is important to note that k may not be continuous, so a tree node’s core number does not necessarily equal its father tree node’s core number minus 1. As shown in Figure 2, subtrees with roots N 1 0 and N 2 0 can be restored to 2-core and 4-core, respectively. However, due to the lack of vertices with c o r e v = 3 , N 2 0 becomes the legitimate child node of N 1 0 , but their core numbers are not continuous. Furthermore, according to Definition 1, the subgraph composed of {9, …, 14} is actually a 5-core. So, although N 2 0 and N 5 0 are both child nodes of N 1 0 , they are located at different C k . Therefore, in the algorithm design for building attribute forest, we should only create tree nodes for the actual k that exists, rather than all integer in 0 k k max .
Figure 2. Hierarchical forest and the sorted tree nodes.
In order to quickly locate the tree node where the query vertex is located during the query phase, we store the tree node in a set of Bins. We organize Bins according to the following rules:
  • If c o r e v of the connected k-shell set in N i t equals k , then add N i t to b i n k .
  • For tree nodes in the same b i n k , sort them by tree number t first. Then, for nodes belonging to the same tree, sort them by the minimum rank of vertices in the vertexSet.
So when locating the query vertex, we only need to search for the b i n k based on c o r e v , and further locate its corresponding tree node with help of binary search algorithm. Note that if the located tree nodes belong to different trees, the query ends immediately because our target subgraph is at least a connected graph.
Algorithm 1 shows the pseudo-code for constructing a hierarchical forest. On the basis of core decomposition, hierarchical trees are constructed for each connected subgraph, finally resulting in a hierarchical forest. Lines 5, 6 create the root node of each tree, and line 7 recursively constructs its child nodes. Note that in Procedure B u i l d _ c h i l d _ n o d e , the tree nodes are numbered in the preorder traversal order. It also maintains p a r e n t _ k and c h i l d _ k during each recursive, which can avoid establishing blank tree nodes and reduces the storage cost of forests. In addition, the algorithm initially leaves bins empty and then creates a new b i n k when discovering an unclassified core number k . In other cases, only update the corresponding b i n k . For a parent tree node, after all its child nodes are created, its attribute inverted index can be constructed as shown in line 10 of Procedure B u i l d _ c h i l d _ n o d e , since now all nodes in its v e r t e x S e t belong to the same connected k-shell.
Algorithm 1 Building forest
Input: A graph G = ( V , E , A )
Output: the hierarchical forest B i n s
1: t 0 , B i n s
2: compute c o r e v of each vertex in G by core decomposition;
3: construct connected S u b g r a p h s  from G ;
4: for g r a p h in S u b g r a p h s :
5:   k min min ( c o r e ( v ) )   in   C o r e ( g r a p h ) , n o d e _ i n d e x 0 ;
6:  create root tree node N n o d e _ i n d e x t  and update b i n k min ;
7:   B u i l d _ c h i l d _ n o d e ( N n o d e _ i n d e x t , k min ) ;
8:  update tree indicator t ;
9: return B i n s .
Procedure B u i l d _ c h i l d _ n o d e ( p a r e n t _ n o d e , p a r e n t _ k ) :
1: select vertices with c o r e v > p a r e n t _ k as candidate children vertex set;
2: if candidate children vertex set is not empty:
3:  obtain connected subgraphs composed of candidate vertex sets as c h i l d r e n _ g r a p h s ;
4:  for each graph in c h i l d r e n _ g r a p h s :
5:    n o d e _ i n d e x n o d e _ i n d e x + 1 ;
6:   select min ( c o r e v ) in this graph as c h i l d _ k ;
7:   create c h i l d _ n o d e N n o d e _ i n d e x t and update b i n c h i l d _ k ;
8:   update p a r e n t _ n o d e ;
9:    B u i l d _ c h i l d _ n o d e ( c h i l d _ n o d e , c h i l d _ k ) ;
10: create attribute reverse index for p a r e n t _ n o d e ;
11: end B u i l d _ c h i l d _ n o d e
As shown in Figure 2, the hierarchical forest corresponding to the network in Figure 1 is composed of two trees, including a total of eight tree nodes. Starting from the root of the tree, as the depth increases, the core number of tree nodes gradually increases (not necessarily continuously), indicating an increasingly compact structure.
2.
Node localization and extended query
We apply the attribute hierarchical forest to the Advanced query algorithm, which consists of two parts: identifying the Lowest Common Ancestor node (LCA), and expanding the query, as shown in Algorithms 2 and 3.
Algorithm 2 LCA algorithm
Input: the hierarchical forest B i n s , the query vertex set Q
Output: the lowest common ancestor L C A n c e s t o r
1: N o d e l i s t , i _ min N o n e , i _ max N o n e
2: for v  in  Q :
3:  locate b i n k  according c o r e v ;
4:  perform a binary search the tree node N i t , where v     N i t . v e r t e x S e t ;
5:  add N i t to N o d e l i s t ;
6:  update i _ min   and   i _ max ;
7: end for
8: if the tree indicator t   of   N i t   in   N o d e l i s t are different:
9:  return N o n e # the query nodes do not belong to the same tree and have no common ancestor.
10: else:
11:  compute the lowest common ancestor for N i _ min t and N i _ max t as L C A n c e s t o r ;
12: end if
According to the hierarchical tree construction process shown in Algorithm 1, the tree nodes are numbered in a preorder traversal manner, which endows the tree nodes with the following query properties.
Property 1.
For tree nodes N i t 1 and N j t 2 , if t 1 t 2 , then they do not belong to the same tree, that is, the vertices in their vertexSet are not connected.
Property 2.
For tree nodes N i 1 t and N i 2 t , if i 1 < i 2 , then N i 1 t is created before N i 2 t .
Property 3.
For tree nodes N i 1 t , N i t , N i 2 t , if i 1 < i < i 2 , then the LCA of N i 1 t and N i 2 t is also the LCA of all of them.
Firstly, the Algorithm 2 locates relevant tree nodes (see lines 3, 4) based on the core number of the query vertices, and records the minimum and maximum tree node index ( i _ min and i _ max , respectively). According to Property 1, if the located tree nodes do not belong to the same tree, the query ends prematurely. According to the Properties 2 and 3, the LCA of N i _ m i n t and N i _ m a x t is also the LCA of all tree nodes in N o d e l i s t . Therefore, it is only necessary to perform a classic LCA search algorithm for N i _ m i n t and N i _ m a x t once. If you want it to be faster, you can consider a fast LCA algorithm. The subtree with the LCA node as the root node can be restored to the tightest subgraph containing all query vertices. Algorithm 3 takes it as the initial feasible solution and enters the expanding query stage to search for the optimal solution.
As shown in Figure 2, when Q = { 4   ,   9   ,   16   ,   25 } , Algorithm 2 locates relevant tree nodes as N o d e l i s t = { N 2 0 ,   N 3 0 ,   N 4 0 ,   N 5 0 } , i _ min = 2 ,   i _ max = 5 . Simply locating the LCA of N 2 0 and N 5 0 as N 1 0 , which is also the ancestor of N 3 0 and N 4 0 . The subtree rooted on N 1 0 can be restored to a connected 2-core, which is the initial feasible solution containing all query vertices.
Algorithm 3 Expanding query algorithm
Input: the hierarchical fores B i n s , the query attribute set S , the attribute score function f H , S and L C A n c e s t o r
Output: the optimal attribute k-core C b e s t
1: V ( H ) , V w , C b e s t N o n e , s c o r e _ b e s t N o n e
2: initialize V ( H ) , V w by traversing the subtree rooted in L C A n c e s t o r ;
3: s c o r e _ b e s t compute f ( H , S ) with V ( H ) and V m ;
4: C b e s t H ;
5: f i n i s h e d _ s u b t r e e L C A n c e s t o r ;
6: while f i n i s h e d _ s u b t r e e . p a r e n t exists:
7:   c u r r e n t _ r o o t f i n i s h e d _ s u b t r e e . p a r e n t ;
8:  if c u r r e n t _ r o o t has more than a child tree node:
9:     for each c h i l d in c u r r e n t _ r o o t . c h i l d r e n but not the f i n i s h e d _ s u b t r e e :
10:   traverse the subtree rooted on c h i l d in pre-order, and iteratively update primary keys V ( H ) ,   V w ;
11:   compute s c o r e of H ;
12:   if s c o r e is better than s c o r e _ b e s t :
13:      update s c o r e _ b e s t , C b e s t ;
14:   f i n i s h e d _ s u b t r e e c u r r e n t _ r o o t ;
15: return C b e s t
In Algorithm 3, the primary keys and optimal solution C b e s t are first initialized with the subtree rooted in L C A n c e s t o r , as shown in lines 2–4. If L C A n c e s t o r is not the root of the tree, then we take its parent tree node as c u r r e n t _ r o o t and enter the expansion phase. Note that the subtree rooted on L C A n c e s t o r has completed the calculation of the primary keys and can be recorded as f i n i s h e d _ s u b t r e e (line 5). Subsequently, only other child subtrees of c u r r e n t _ r o o t will be traversed. The traversal process continuously updates the primary keys, and only calculates the score of the new graph when tracing back to c u r r e n t _ r o o t , as shown in lines 9–13. After that, we update f i n i s h e d _ s u b t r e e again and continue to expand upward until the root of the tree to obtain the optimal solution C b e s t .
Assuming Q = { 4 ,   16 ,   25 } , according to Algorithm 2, the L C A n c e s t o r is located as N 2 0 , that is, the initial feasible solution is C 4 . According to Algorithm 3, since N 2 0 is not the root of the tree, it needs to be extended upward to the N 1 0 node to calculate subgraph C 2 . Note that N 1 0 has two child nodes, but N 2 0 has already completed the calculation. It only needs to visit N 5 0 , update V H and V w , and complete the calculation of the 2-core score when going back to N 1 0 . In the same way, continue to expand to N 0 0 and calculate the score of C 1 . At this time, C b e s t is obtained.
Although the time of the Advanced algorithm can be divided into two parts: forest creation and extended query, once the hierarchical forest is created, it only needs dynamic maintenance [28]. Therefore, the main query time depends on the depth of L C A n c e s t o r . The greater the depth, the more iterations and the longer the time. Since Algorithm 3 computes scores only on the ancestor path of L C A n c e s t o r , the statistical results of primary keys can be reused, the query time will be significantly reduced.

4. Experiments and Results

In this part, multiple experiments are designed to verify the effectiveness and efficiency of the proposed algorithms. In order to find the optimal attribute k-core, we implement the BaseLine and Advanced algorithms for the attribute scoring functions (see Equations (1)–(3)) listed in Section 3.1.
The datasets in the experiments are shown in Table 3, where A V is the number of unique attributes in the dataset, A v ¯ is the average number of attributes per vertex, and k max is the maximum core number of the dataset. Each dataset contains several high-quality ground-truth communities and attributes. In particular, for the Facebook dataset, which contains 10 different heterogeneous networks. For convenience of expression, we use F X to represent the heterogeneous network centered on vertex X , such as F414. The Facebook dataset can be obtained from SNAP, and the Twitter and Flickr datasets are quoted from the literature [29,30].
Table 3. Statistics of datasets.
For the fairness and effectiveness of the experiment, we selected the top 10 ground-truth communities with the largest scale from each dataset for querying. For each community to be evaluated, select the top 30% of attributes that appear the most frequently and are different from other communities as their representative attributes. The F1 score is used as an evaluation indicator when evaluating community quality. By default, the number of query vertexes is 12, and the query attributes are all representative attributes of a community. A single-factor evaluation method is employed to evaluate the impact of parameters on the model.

4.1. Evaluating the Effectiveness of Algorithms

We refer to the subgraph containing all query vertices as a feasible solution. In order to observe the ratings of different k-cores as comprehensively as possible, we identify the most tightly structured ground-truth communities. By executing the BaseLine algorithm and recording the scores of all feasible solutions, Table 4 shows the optimal k-core with different scoring functions. Note that f 1 , f 2 , and f 3 , respectively, represent the scoring functions shown in Equations (1)–(3).
Table 4. Best k for different attribute scoring functions.
As can be seen from Table 4, for the same query conditions, the optimal k-core is quite different when adopting different attribute scoring functions. Figure 3 shows the rating curves of networks F107, F348, F1684, F1912, Twitter, and Flickr when adopting Equation (3). The results illustrate that the same scoring function has different trends on different datasets, and there are many local optimal solutions. This fully demonstrates that determining the k value by manual selection or dataset analysis cannot guarantee global optimality, which also means that the problem proposed in this article is necessary.
Figure 3. Attribute scoring of k-cores on different datasets. (a) F107 (b) F348 (c) F1684 (d) F1912 (e) Twitter (f) Flickr.

4.2. Evaluating the Performance of Algorithms

The time of the BaseLine algorithm includes core decomposition and score computation, while the time of the Advanced algorithm additionally includes building the forest. We have calculated the execution time of each part, which helps identify the efficiency bottleneck of the algorithm. Since multiple heterogeneous networks on Facebook differ significantly in scale, only the query times of a few larger heterogeneous networks are displayed here, as shown in Figure 4.
Figure 4. Runtime of algorithms on different datasets. (a) Runtime with Equation (1) (b) Runtime with Equation (2).
The experimental results show that regardless of the datasets, the Advanced algorithm is always faster than the BaseLine algorithm. Upon careful observation of the time consumption at different stages of the algorithm, we find that in the BaseLine algorithm, score calculation accounts for the largest proportion of time. This is consistent with theoretical analysis since the BaseLine algorithm must calculate the score of every feasible solution, and the calculation process cannot be shared. As for the Advanced algorithm, building a hierarchical forest costs the most time. The actual query time includes two parts: locating the LCA for query vertices and extending queries. With the excellent properties of attribute tree nodes and iterative updates of primary keys, the Advanced algorithm’s computation time is greatly reduced. If the forest construction time is not taken into account, the Advanced algorithm is two orders of magnitude faster than the BaseLine algorithm. By dynamically maintaining the forest, the Advanced algorithm can achieve millisecond-level online queries.

4.3. Estimating the Quality of Communities

In order to evaluate the quality of the target community discovered by the Advanced algorithm, we compare it with the communities queried by the ACQ and SFEG algorithms. This is because ACQ uses k-core as the structural indicator and constructs a k-shell tree for the entire graph, which is similar to our forest; SFEG proposes an elastic mechanism to alleviate premature stopping of greedy algorithms, which expands the applicability of local optimal solutions.
As shown in Figure 5, our target community quality is superior to that of ACQ and SFEG. The reason is that ACQ only supports one query vertex. When the query vertex set has multiple vertices, ACQ randomly selects one of them as the query vertex. Especially in large attribute networks, the attribute density of vertices is higher, coupled with overlapping communities, leading to poorer query performance, such as in Flickr. Although the SFEG algorithm supports multiple query nodes, it still requires manually specifying the challenge threshold for the greedy algorithm. This makes it unable to avoid falling into local optima.
Figure 5. Community quality on different datasets.

4.4. Parameter Sensitivity

This section evaluates the impact of query vertices and query attributes on algorithm efficiency and effectiveness through experiments. As mentioned earlier, after the forest construction is complete, the Advanced query algorithm only includes two stages: LCA (Algorithm 2) and Expand (Algorithm 3). This section will illustrate how parameters Q and S affect the efficiency of the Advanced algorithm and the quality of the target community.
As shown in Figure 6, the larger the query attribute set, the longer it takes to calculate the score, as there are more primary keys to calculate. The attribute density of a dataset also affects query time. In Flickr, A v ¯ = 24.5, while in F1684, A v ¯ = 7.85, which leads to a significant difference in the number of representative attributes of the ground-truth community. The higher the attribute density of the dataset, the more representative it contains, and the longer the average query time. The more representative attributes a query attribute set contains, the more accurately the community can be queried.
Figure 6. Varying query attributes. (a) Time with different attributes (b) F1 score with different attributes.
Figure 7a shows that the more query vertices there are, the more related tree nodes can be located, and the distribution may be broader as well. More importantly, the depth of their common ancestors may be shallower, which to some extent reduces the query time. Nevertheless, the core number of vertices in the dataset has a significant impact on query time. For example, despite its smaller size, F1912 has a much higher k max and average degree than Twitter, resulting in more query iterations and longer query times. Too many query vertices can lead to excessive dispersion of related tree nodes, large community size, and mixing with low-quality vertices. Especially in loosely structured networks, such as F1684 in Figure 7b, LCA is too close to the tree root, resulting in a loss of community quality.
Figure 7. Varying query vertices. (a) Time with different vertices (b) F1 score with different vertices.
Overall, the algorithm proposed in this article can quickly find the globally optimal k-core for the target community. But the experiment conducted in this article is based on high-quality, large-scale communities in datasets. If applied to development scenarios, the model’s accuracy will be affected. Additionally, it is necessary to design stable and efficient algorithms to maintain attribute forests in dynamic network graphs, which will be the model’s future improvement direction.

5. Conclusions

The aim of this article is to find the attribute optimal k-core community within the attribute network in order to overcome the problem that empirical analysis and manual selection of k values may miss the optimal solution. Hence, we propose the concept of global optimal attribute k-core and design the BaseLine algorithm. To address the bottleneck of the BaseLine algorithm, we have constructed a hierarchical k-core forest that integrates structural and attribute semantic indicators and utilizes k-core’s nested nature. The attribute forest can transform complex graph problems into classical tree operations, reducing the difficulty of queries. Based on attribute forests, we have developed a two-stage advanced algorithm. It can locate the LCA of all relevant tree nodes at once and iteratively update the primary keys during the extended query stage, recycling the intermediate calculation process. Several experiments on real datasets show that the attribute optimal k-core is primarily affected by the scoring function, the query vertices, and the query attributes. Excluding the time required to create the forest, the advanced algorithm is two orders of magnitude faster than the BaseLine algorithm. In terms of community quality, the target community is significantly better than ACQ and SFEG. In order to improve the effectiveness and efficiency of the algorithm in the future, we will further adapt more attribute scoring functions and optimize the forest maintenance method for dynamic graphs.

Author Contributions

Conceptualization, J.L. and Y.Z.; methodology, J.L. and Y.Z.; software, J.L.; validation, J.L. and Y.Z.; formal analysis, J.L.; investigation, J.L.; resources, J.L.; data curation, J.L.; writing—original draft preparation, J.L.; writing—review and editing, Y.Z.; visualization, J.L.; supervision, Y.Z.; project administration, Y.Z.; funding acquisition, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Sichuan Sciences and Technology Program (No. 2023ZHCG0005).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

You can obtain the Facebook data set from https://snap.stanford.edu/ (accessed on 23 June 2023), and other data sets from https://github.com/he-tiantian/Attributed-Graph-Data/ (accessed on 28 June 2023).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Li, C.; Tang, Y.; Xiao, Z.; Li, T. Influential scholar recommendation model in academic social network. J. Comput. Appl. 2020, 40, 6. [Google Scholar]
  2. Sozio, M.; Gionis, A. The community-search problem and how to plan a successful cocktail party. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 25–28 July 2010; pp. 939–948. [Google Scholar]
  3. Leskovec, J.; Mcauley, J. Learning to discover social circles in ego networks. Adv. Neural Inf. Process. Syst. 2012, 25, 539–547. [Google Scholar]
  4. Palla, G.; Derényi, I.; Farkas, I.; Vicsek, T. Uncovering the overlapping community structure of complex networks in nature and society. Nature 2005, 435, 814–818. [Google Scholar] [CrossRef] [PubMed]
  5. Cui, W.; Xiao, Y.; Wang, H.; Lu, Y.; Wang, W. Online search of overlapping communities. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, 22–27 June 2013; pp. 277–288. [Google Scholar]
  6. Zhu, Q.; Hu, H.; Xu, C.; Xu, J.; Lee, W.C. Geo-social group queries with minimum acquaintance constraints. VLDB J. 2017, 26, 709–727. [Google Scholar] [CrossRef]
  7. Xie, X.; Song, M.; Liu, C.; Zhang, J.; Li, J. Effective influential community search on attributed graph. Neurocomputing 2021, 444, 111–125. [Google Scholar] [CrossRef]
  8. Sun, H.; Huang, R.; Jia, X.; He, L.; Sun, M.; Wang, P.; Sun, Z.; Huang, J. Community search for multiple nodes on attribute graphs. Knowl.-Based Syst. 2020, 193, 105393. [Google Scholar] [CrossRef]
  9. Lin, P.; Yu, S.; Zhou, X.; Peng, P.; Li, K.; Liao, X. Community search over large semantic-based attribute graphs. World Wide Web 2022, 25, 927–948. [Google Scholar] [CrossRef]
  10. Seidman, S.B. Network structure and minimum degree. Soc. Netw. 1983, 5, 269–287. [Google Scholar] [CrossRef]
  11. Batagelj, V.; Zaversnik, M. An o (m) algorithm for cores decomposition of networks. arXiv 2003, arXiv:cs/0310049. [Google Scholar]
  12. Liu, B.; Zhang, F. Incremental algorithms of the core maintenance problem on edge-weighted graphs. IEEE Access 2020, 8, 63872–63884. [Google Scholar] [CrossRef]
  13. Chowdhary, A.A.; Liu, C.; Chen, L.; Zhou, R.; Yang, Y. Finding attribute diversified community over large attributed networks. World Wide Web 2022, 25, 569–607. [Google Scholar] [CrossRef]
  14. Ghosh, B.; Ali, M.E.; Choudhury, F.M.; Apon, S.H.; Sellis, T.; Li, J. The flexible socio spatial group queries. Proc. VLDB Endow. 2018, 12, 99–111. [Google Scholar] [CrossRef]
  15. Islam, M.S.; Ali, M.E.; Kang, Y.B.; Sellis, T.; Choudhury, F.M.; Roy, S. Keyword aware influential community search in large attributed graphs. Inf. Syst. 2022, 104, 101914. [Google Scholar] [CrossRef]
  16. Fang, Y.; Cheng, C.K.; Luo, S.; Hu, J. Effective community search for large attributed graphs. Proc. VLDB Endow. 2016, 9, 1233–1244. [Google Scholar] [CrossRef]
  17. Zhang, Z.; Huang, X.; Xu, J.; Choi, B.; Shang, Z. Keyword-centric community search. In Proceedings of the 2019 IEEE 35th International Conference on Data Engineering (ICDE), Macao, China, 8–11 April 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 422–433. [Google Scholar]
  18. Liu, Q.; Zhu, Y.; Zhao, M.; Huang, X.; Xu, J.; Gao, Y. VAC, vertex-centric attributed community search. In Proceedings of the 2020 IEEE 36th International Conference on Data Engineering (ICDE), Dallas, TX, USA, 20–24 April 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 937–948. [Google Scholar]
  19. Huang, X.; Lakshmanan, L.V.S. Attribute truss community search. arXiv 2016, arXiv:1609.00090. [Google Scholar]
  20. Wang, C.; Wang, H.; Chen, H.; Li, D. Attributed community search based on effective scoring function and elastic greedy method. Inf. Sci. 2021, 562, 78–93. [Google Scholar] [CrossRef]
  21. Luo, J.; Cao, X.; Xie, X.; Qu, Q.; Xu, Z.; Jensen, C.S. Efficient attribute-constrained co-located community search. In Proceedings of the 2020 IEEE 36th International Conference on Data Engineering (ICDE), Dallas, TX, USA, 20–24 April 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1201–1212. [Google Scholar]
  22. Chu, D.; Zhang, F.; Lin, X.; Zhang, W.; Zhang, Y.; Xia, Y.; Zhang, C. Finding the best k in core decomposition, A time and space optimal solution. In Proceedings of the 2020 IEEE 36th International Conference on Data Engineering (ICDE), Dallas, TX, USA, 20–24 April 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 685–696. [Google Scholar]
  23. Guo, T.; Cao, X.; Cong, G. Efficient algorithms for answering the m-closest keywords query. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Victoria, Australia, 31 May–4 June 2015; pp. 405–418. [Google Scholar]
  24. Wang, K.; Wang, S.; Cao, X.; Qin, L. Efficient radius-bounded community search in geo-social networks. IEEE Trans. Knowl. Data Eng. 2020, 34, 4186–4200. [Google Scholar] [CrossRef]
  25. Luo, J.; Cao, X.; Xie, X.; Qu, Q. Best co-located community search in attributed networks. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing, China, 3–7 November 2019; pp. 2453–2456. [Google Scholar]
  26. Matula, D.W.; Beck, L.L. Smallest-last ordering and clustering and graph coloring algorithms. JACM 1983, 30, 417–427. [Google Scholar] [CrossRef]
  27. Sariyuce, A.E.; Seshadhri, C.; Pinar, A.; Catalyurek, U.V. Finding the hierarchy of dense subgraphs using nucleus decompositions. In Proceedings of the 24th International Conference on World Wide Web, Florence, Italy, 18–22 May 2015; pp. 927–937. [Google Scholar]
  28. Lin, Z.; Zhang, F.; Lin, X.; Zhang, W.; Tian, Z. Hierarchical core maintenance on large dynamic graphs. Proc. VLDB Endow. 2021, 14, 757–770. [Google Scholar] [CrossRef]
  29. He, T.; Liu, Y.; Ko, T.H.; Chan, K.C.; Ong, Y.S. Contextual correlation preserving multiview featured graph clustering. IEEE Trans. Cybern. 2019, 50, 4318–4331. [Google Scholar] [CrossRef] [PubMed]
  30. He, T.; Bai, L.; Ong, Y.S. Vicinal vertex allocation for matrix factorization in networks. IEEE Trans. Cybern. 2021, 52, 8047–8060. [Google Scholar] [CrossRef] [PubMed]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.