A Restart Local Search for Solving Diversiﬁed Top-k Weight Clique Search Problem

: Diversiﬁed top-k weight clique (DTKWC) search problem is an important generalization of the diversiﬁed top-k clique (DTKC) search problem with practical applications. The diversiﬁed top-k weight clique search problem aims to search k maximal cliques that can cover the maximum weight in a vertex weighted graph. In this work, we propose a novel local search algorithm called TOPKWCLQ for the DTKWC search problem which mainly includes two strategies. First, a restart strategy is adopted, which repeated the construction and updating processes of the maximal weight clique set. Second, a scoring heuristic is designed by giving different priorities for maximal weight cliques in candidate set. Meanwhile, a constraint model of the DTKWC search problem is constructed such that the research concerns can be evaluated. Experimental results show that the proposed algorithm TOPKWCLQ outperforms than the comparison algorithm on large-scale real-world graphs.


Introduction
Given an undirected graph G = (V, E), a clique is a subset of the graph G, where any two vertices are adjacent. The maximal clique (MC) is a clique with the largest cardinality in the graph G. The maximum weight clique (MWC) is a generalization of MC with a positive integer assigned to each vertex as its weight value. The diversified top-k clique (DTKC) search problem aims to find a set with at most k maximal cliques to occupy as many vertices as possible, where k is a parameter that requires to be provided. The diversified top-k weight clique (DTKWC) search problem [1] attempts to search a set with at most k maximal weight cliques in the graph G with the largest total weight of covered vertices in these cliques, which can be readily verified as a NP-hard problem [2].
The MC and related problems have lots of applications, especially in real-world applications such as combinatorial auction [3], community detection [4,5] and video object segmentation [6]. Recently, considerable attentions have also been paid to solve topk problems on large graphs [2,7,8]. This kind of problem can be very well applied to practical applications, such as the influential community [9], motif discovery in molecular biology [10]. For example, citation networks are usually represented as a type of social network with papers and links between citation relationships. In citation networks, denoted as graph G, papers are considered as vertices, and citation relationships are the edges between papers. The influence on the paper is viewed as a weight in G. The problem aims to search the top-k maximal divisive groups with different domains in G, which can be regarded as finding a DTKWC solution.
To solve MC and WMC problems on large-scale graphs effectively, some related methods have been proposed. These algorithms are usually divided into two categories: (1) the first one is the exact algorithms which can guarantee the optimality of the solutions, such as [11][12][13]. But exact algorithms may fail to solve the graphs within a reasonable time when the scale of them are larger. The second one is the local search algorithm, which is considered to find a suboptimal solution within a reasonable time for medium even larger graphs. And a large amount of effort has been devoted to designing different local search algorithms. For example, there exist lots of local search algorithms for solving WMC, e.g., [14,15]. Although there exist many algorithms to solve MC and WMC problems, currently, there are very few methods for diversified top-k cohesive groups. Such as Yuan et al. [7,8] (2015, 2016) proposed the concept of DTKC and then provided an approximate algorithm for it. Wu et al. [2] (2020) provides a local search algorithm to solve the DTKC search problem in large graphs, and it is state-of-the-art algorithm for DTKC search problem. Wu and Yin [16] (2021) introduce a problem of finding cohesive groups, named DTKSP problem, and develop a local search method based on some new heuristic strategies for this problem. Zhou et al. [1] (2021) encode the DTKWC search problem into the weighted partial MaxSAT (WPMS) problem, including direct encoding (DE) and independent set partition based encoding (ISPE), and solving WPMS with state-of-the-art solvers. However, this method is limited to solve real-world large graphs, because it is failed to encode large graphs into WPMS.
In this work, we propose a local search algorithm for the DTKWC search problem in large graphs, which provides a local optimal solution within a reasonable time and avoids the generation and storage of all maximal weight cliques. It aims at addressing the aforementioned problem. This algorithm, named TOPKWCLQ (which stands for top-k weight cliques), is based on two main strategies.
The first strategy is a restart method that can deal with the cycling problem. In the process of searching for the maximal weight clique, TOPKWCLQ will repeat to create a new maximal weight clique after initializing the set of maximal weight cliques and update this set through a scoring function. When the algorithm cannot be updated at fixed steps, it performs the restart process with the current best candidate solution.
The second strategy is a scoring function, which is designed by giving different priorities for maximal weight cliques in the candidate solution. During the searching process, TOPKWCLQ constructs and then maintains a candidate solution which size is at most k by adding or removing the maximal weight cliques according to the score value of each one. The score of each maximal weight clique is calculated by the total weight of the vertices that the clique has exclusively in the candidate solution.
To date, there is no suitable comparable algorithm for the DTKWC search problem on large scale of real-world graphs. Thus, we compare TOPKWCLQ with a commercial solver, CPLEX solver, with the constraint formulas proposed in this paper. Extensive performance experiments are executed to demonstrate that our proposed algorithm can achieve both high effectiveness and efficiency on real-world large-scale graphs.
The remainder of the paper is organized as follows. In Section 2, we propose the necessary background knowledge about diversified top-k weight clique search problem and formalize the DTKC and DTKWC search problem. In Section 3, we describe the TOPKWCLQ algorithm and the techniques it implements. In Section 4, we report extensive experimental results to demonstrate DTKWC's high performance compared to CPLEX with our model in solving the DTKWC search problem, and finally, the conclusions are given in Section 5.

Diversified Top-k Weight Clique Search Problem
In this section, some notations and basic definitions which are applied to the DTKWC search problem are introduced. Then the proof of NP-hardness about DTKWC search problem is given. Next, the constraint formulas which are used in CPLEX solver as the mathematical model for DTKC and DTKWC search problem are proposed, respectively.

Definition and Notations
A weighted graph G = (V, E, w) is a graph including |V| vertices and |E| edges, w is a weight function that assigns to each vertex v i of V a non-negative integer w(v i ) representing its weight. v w(v i ) i represents that vertex v i has weight w(v i ). Definition 1 (Maximal clique (MC)). Given an unweighted graph G(V, E), a clique c in G is a set of vertices such that for any u ∈ G, v ∈ c (u = v), we have (u, v) ∈ E. A clique c in G called a maximal clique if there exists no clique c in G such that c ⊂ c . Definition 2 (Maximal weight clique (MWC)). Given a weighted graph G(V, E, w), a weight clique c in G is a set of vertices such that for any u ∈ G, v ∈ c (u = v), we have (u, v) ∈ E and the weight of c is ω(c) = ∑ v i ∈c w(v i ). A weight clique c in G called a maximal weight clique if there exists no clique c in G such that ω(c) < ω(c ).
Given a set of maximal cliques C = {c 1 , c 2 , . . . }, the coverage of C, denoted by cov(C), is the set of vertices covered by C, i.e., cov(C) = c i ∈C c i . Definition 3 (Diversified top-k clique (DTKC)). Given an unweighted graph G(V, E) and an integer k, the problem of diversified top-k clique search is to compute a set C, such that each c ∈ C is a maximal clique, |C| ≤ k, and cov(C) is maximized. C is called diversified top-k cliques.
Given a set of maximal (weighted) cliques C = {c 1 , c 2 , . . . }, the private vertices of a maximal (weighted) clique c in C, denoted by priv(c, C), are a subset of vertices of c not contained in any other clique in C, i.e., priv(c, C) = c \ cov(C \ c). The weight of C is the total weight of the set of vertices in G covered by the cliques in C, denoted by W(C), as below The overlapping of C, denoted by overlap(S), is a set of vertices that are covered by maximal cliques in C more than once.
Definition 4 (Diversified top-k weight clique (DTKWC)). Given a weighted graph G(V, E, w) and an integer k, the problem of diversified top-k weight clique search is to compute a set C, such that each c ∈ C is a maximal weight clique, |C| ≤ k, and W(C) is maximized. C is called diversified top-k weight cliques.

Constraint Formulation for DTKWC Search Problem
The DTKWC search problem is a generalization of the DTKC search problem [2] which aims to find a maximal clique set with at most k size with maximum total weight and a lower overlapping among all possible maximal clique sets from a given graph. Hence, we first give the formulas of the DTKC search problem and then expand them to the formulas of the DTKWC search problem. The DTKC search problem can be formulated as a mixed integer linear program (MILP) as follows: Subject to: where x ih is the binary variable associated with the vertex i, such that x ih = 1 if vertex v i is in the h'th maximal clique, x ih = 0 otherwise. X i is also a binary variable associated with vertex i. X i = 1 if there exists a vertex i in a maximal clique, X i = 0 otherwise. And constraint (4) is guaranteed that there is an edge between every two vertices in a clique. Constraint (5) means there is one clique including vertex v i , then X i = 1. Constraints (6) and (7) give the range of the variables. According to the formulas above, we give the MILP of the DTKWC search problem below: Subject to: Similarly, x ih and X i represent the binary variables corresponding with the vertex i. x ih = 1, if vertex i appears in the h'th maximal clique, x ih = 0 otherwise. X i = 1, if vertex v i belongs to any maximal clique of C, X i = 0 otherwise. w i denotes the weight of the vertex i. Constraints (10)-(13) have the same intentions as the above constraints (4)-(7), respectively.
In the above formulas, both of these two problems aim to minimize the value of objective "OBJ1" on basis of maximizing the value of objective "OBJ2". Thus, we can obtain the optimal solution of DTKC (DTKWC) search problem.

TOPKWCLQ: A Local Search Method for the DTKWC Search Problem
In this section, we will outline the framework of our algorithm. We use a restart strategy that interleaves between the construction and updating processes of the maximal weight clique set to enhance the quality of the candidate solution.
The restart procedure of the local search avoids the previous trajectory but turns to explore more different maximal weight clique sets. We construct these different maximal weight clique sets by combing the maximal weight cliques constructed from different starting vertices at each iteration. Thus, in the DTKWC search problem, using this restart strategy, TOPKWCLQ can improve the quality of the current candidate solution step by step.
At each restart iteration, we need to construct a new maximal weight clique one by one and eliminate the original maximal weight clique with the scoring function from the current candidate solution until no further improvement is found in the limited updating steps or the limit time is out. Thus, it can save the search time of a single iteration and restart the algorithm as soon as possible. After the updating procedure, a current candidate solution, that is, a local optimal solution, can be found and the algorithm will update the solution by comparing this local optimal solution with the maintained candidate solution from the previous iterations. Finally, until the time limit runs out, the algorithm returns to a maximal weight clique set as the best solution.
In the following, we will provide a random restart local search algorithm for the DTKWC search problem, called TOPKWCLQ.

Maximal Weight Clique Scoring Function
Before describing the algorithm framework, we first give a core issue in the algorithm TOPKWCLQ to evaluate the priorities of each maximal weight clique. During the search process, TOPKWCLQ must maintain a maximal weight clique set of size at most k as a candidate solution of the DTKWC search problem. Therefore, it is important to balance the quality and efficiency of the solutions by determining which maximal weight clique should be included or eliminated from the current candidate solution. For this reason, we will define a scoring function based on the total weight of its private vertices presented in Section 2 for each maximal weight clique during the updating process.
Definition 5 (Score function (score(c))). Given a weighted graph G = (V, E, w), a maximal weight clique set C and a maximal weight clique c i of G (c i ∈ C). We use score(c i ) to define the benefit of c i after adding a maximal weight clique to the set C. The score of c in C is defined as The maximal weight clique selection method used in the UpdateSolution procedure is based on this scoring function. It attempts to determine the eliminated maximal weight clique with the smallest score value by computing the scoring function for each maximal weight clique in the candidate solution C after adding a new maximal weight clique into C.

TOPKWCLQ Algorithm: The Top-Level Algorithm
The proposed TOPKWCLQ algorithm (see the flowchart in Figure 1) combines an initialization procedure aiming to generate a feasible initial solution and a local search procedure aiming at improving the initial solution. The top level of TOPKWCLQ is outlined in Algorithm 1, as described below. C ← InitKCliques(G, m, RemainingSet); 13: if (cov(C) = V) then 14: return C; 15: end if 16: C ← LocalSearch(G, C, m, RemainingSet); 17: if (W(C * ) < W(C)) then 18: C * ← C; 19: end if 20: end while 21: return C * ; First, we introduce the basic framework of our algorithm, which is presented in Algorithm 1. A current best global solution C * will be initialized as an empty set. Then the TOPKWCLQ starts a loop until the limited time reaches the maximum which equals cuto f f (lines [4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20]. Before this loop, the parameters m 0 and m max of the best from multiple selection (BMS) strategy which is used in [2] to solve the DTKC search problem were given first (lines 6-11) and update the value of m in the loop. Then the TOPKWCLQ adopts a function InitKCliques to construct enough maximal weight cliques as an initialization solution (line 12). After the initialization procedure, if cov(C) equals to V, TOPKWCLQ will return C as a candidate solution (lines 13-15); Otherwise, update the current candidate solution C by using LocalSearch method (line 16). If the total weight of the vertices in C * is smaller than the total weight of C, that is W(C * ) < W(C), then replace C * with C (lines [17][18][19]. When the elapsed time is bigger than the cutoff time, TOPKWCLQ stops searching and returns C * . In this section, the technical details of the TOPKWCLQ algorithm are introduced. The function to create a maximal weight clique from a random vertex is introduced in Section 3.3 . In Section 3.4, the initialization procedure is presented. Section 3.5 presents the local search updating procedure of our algorithm.

Constructing a Maximal Weight Clique with Diversity
At each stage of our algorithm, we need constantly to find different maximal weight cliques to add into the candidate solution. Therefore, we design a method called GetClique which uses the vertices in RemainingSet to construct the maximal weight cliques according to the properties of the DTKWC search problem. Let Candset denote the vertices which are adjacent to all vertices already in c. We also design a function b[v] which will be utilized during the initialization procedure to represent the benefit of a vertex v, the expression is as follows, The Algorithm 2 shows the pseudo-code of GetCliques. First, c is initialized as an empty set. Then, GetCliques iteratively and randomly selects a vertex from RemainingSet which includes all vertices in V but excludes the vertices in the current candidate solution. If the set of RemainningSet is empty (line 2), then the algorithm returns c, and c is empty, which means we cannot create one more maximal weight clique. Otherwise, GetClique selects a vertex v from RemainingSet randomly and then adds it to the set c. Then, the algorithm adds all neighbours of v into CandSet. If CandSet is not empty, GetClique will find a maximal weight clique by the BMS strategy (which is proposed by [17] ) used to select the better next vertex as the added vertex to the current partial clique (lines [7][8][9][10][11][12][13][14][15][16][17][18][19][20][21][22]. In this situation, if the cardinality of CandSet is smaller than the parameter m, the algorithm will pick a vertex v from CandSet with the greatestb, breaking ties in favour of the older one; Otherwise, GetClique selects the vertex with biggest benefit from m vertices that randomly selects from Candset. After that, we can get a better result by just calculating the score of at most m vertices. CandSet is updated for selecting the next vertex of the maximal weight clique. 8: if (|CandSet| < m) then 9: pick the vertex v from CandSet with the greatestb, breaking ties in favour of the older one; 10: else 11: v ← randomly select a vertex from CandSet; 12: for (iter := 1 to m − 1) do 13: v ← randomly select a vertex from CandSet; 14: if 16: end if 17: end for 18: end if 19: c ← c ∪ {v}; 20: remove v from RemainingSet; 21: CandSet ← CandSet ∩ N(v); 22: end while 23: return c

The Initialization Procedure
In this subsection, we will explain the initialization procedure, which is outlined in Algorithm 3. It is the first stage of our algorithm. At the beginning of this procedure, a current candidate solution C is set to empty. Due to the DTKWC search problem needs to find a solution which is a set including at most k maximal weight cliques, this method attempts to create the maximal weight cliques randomly through GetClique that introduced in the above subsection. If we get an empty result from GetClique until there is no more vertex not belongs to C or we have created k maximal weight cliques. In this method, the starting vertices are generated randomly from a set that includes the vertices never used as a starting vertex. Repeat this random process, we can get the diversified maximal weight cliques that do not depend on the corresponding information acquired during the previous process. Moreover, starting from an unvisited vertex to construct the solution of the DTKWC search problem will overcome the cycling problem, i.e., revisiting the same solution within a short time in the local search algorithm.

Local Search Updating
The candidate solution initialized by the initialization procedure is just a good candidate solution meeting the requirement of the DTKWC search problem, but there is no guarantee that it is a great candidate solution. Therefore, in this subsection, we design a local search method to improve the quality of this solution by exploring as many new maximal weight cliques as possible (line 4).
The proposed local search method in Algorithm 4 finds the different maximal weight clique combinations from a candidate solution. It is a good way to iteratively find a better combination C which includes the vertices with a greater total weight. For this reason, we add a new maximal weight clique c created by GetClique into the current candidate solution C. Then we compute k + 1 score functions explained in Section 3.1 for all of maximal weight cliques in C each time (line 8). After this, we gain k + 1 values of score, delete the maximal weight clique with the smallest value among these maximal weight cliques in C . Such that we maintain a k size maximal weight clique solution as the new candidate solution (line 11).
c ← GetClique(G, m, RemainingSet); 5: if (c = ∅) then 6: break; 7: end if 8: Compute score(c) of each maximal weight clique c in C ; 10: c min ← arg c∈C min{score(c)}; 11: Remove c min from C , breaking ties in favour of the smaller one; 12: if (cov(C ) > cov(C)) then 13: C ← C , step ← 0; 14: end if 15: /* f s is the third parameter of TOPKWCLQ */ 16: if (step ≥ f s) then 17: break; 18: end if 19: end while 20: return C Although, we obtain the information that sometimes a candidate solution cannot be improved by the normal local search method in a long time. For this, we add a fixed step, denoted by f s, into our local search framework that breaks the loop if the current solution cannot be improved in f s steps.

An Example of DTKWC Search Problem
Example 1. Let us illustrate how to explore a solution for the DTKWC search problem by using a sample weighted graph in Figure 2. Figure 2a gives an weighted graph G(V, E, w) with ten vertices, where v w i i denotes vertex v i with w i = w(v i ). Assume the integer parameter k = 2. The best clique weight so far is ω(C max ) = 13. During the first phase, InitKCliques creates a maximal weight clique set C max = {c 1 , c 2 }, and The total weight of C max is 12. In the second phase, LocalSearch tries to determine a new maximal weight clique c 3 by GetClique. Suppose c 3 = {v 3 5 , v 3 7 }. We add c 3 into C max . Now, C max contains 3 maximal weight cliques. We evaluate the quality of these three maximal weight cliques by the scoring function we proposed.

Experimental Evaluation
In this section, we carry out extensive experiments to evaluate the performance of TOPKWCLQ on weighted real-world large graphs. Since there is no suitable heuristic or exact algorithm for the DTKWC search problem on real-world large graphs in literature, as we know that is a good choice to compare the results of the proposed algorithm to the results obtained by CPLEX solver which is a commercial solver for many combinatorial optimization problems with their constraint formulas of mathematical models. Therefore, the results obtained by CPLEX can be used as reference on the solution quality. We first describe the weighted benchmark and then present the experimental preliminaries and introduce the parameter settings.

The Benchmark
We evaluate the TOPKWCLQ algorithm on the benchmarks of the weighted real-world graph, which will be shown below.
The weighted real-world large graph benchmark in our experiments was originally from the Network Data Repository online [18] (http://www.graphrepository.com/ networks.php, accessed on 1 August 2021). There are millions of vertices and tens of millions of edges on many of the real-world graphs which used in our experiments. This benchmark has been transformed from unweighted graphs to the weighted graphs used the weighting function w(v i ) = (i mod 200) + 1 (including 102 instances) [19]. Moreover, most of these as the experimental instances used in maximum vertex weight clique problem [6,20,21], coloring problem [22], maximum k-plexes problem [23] and DTKC search problem [2]. Considering the relationship between the DTKWC search problem and these problems, these real-world graphs can naturally be used to evaluate the performance of our algorithm for the DTKWC search problem. These real-world graphs were downloaded from the author's website (http://lcs.ios.ac.cn/~caisw/Resource/weightedmassive-graphs.zip, accessed on 1 August 2021).
The graphs in our experiments are divided into 11 classes, including biological networks, collaboration networks, interaction networks, infrastructure networks, recommendation networks, retweet networks, scientific computing, social networks, facebook networks, technological networks, and web graphs.

Experimental Preliminaries and Parameter Tuning
The proposed algorithm TOPKWCLQ was implemented in C++ and compiled on CentOS with 2.4 GHz CPU and 32G RAM with "-O3" flag. We run TOPKWCLQ 10 times independently with the random seed setting from 1 to 10 for each instances. Each one is run until the run time of the algorithm arrives which is a given time limit that is assigned as 600 s in this paper. The termination criterion of CPLEX is either the convergence of lower and upper bounds or a time limit which is assigned as 3600 s. We use the solution values of CPLEX to evaluate the quality of the solution solved by TOPKWCLQ.
For each real-world large graph used in our experiments, we set the parameter k to 10, 20, 30, 40, and 50 to obtain five DTKWC search problem instances. Hence, there were 102 × 5 = 510 DTKWC search problem instances in our experiments.
TOPKWCLQ uses three parameters for which well-working values must be found: m 0 and m max are the minimum and maximum value of BMS strategy respectively, and f s is the maximum allowed updating steps of the solution per iteration. Parameters m 0 and m max are used in the BMS strategy inspired by [17]. The value of these parameters are set in Table 1 according to a preliminary tuning experiment. The next subsection is shown to the evaluation of TOPKWCLQ compared with the lower bound ("LB") and the upper bound ("UB") of CPLEX under all 510 DTKWC search problem instances.

Experimental Results
We present the comprehensive experiment results on the benchmark instances described in Section 4.1 with 5 values of parameter k in Tables 2-6. Among them, Tables 2-6 for k = 10, 20, 30, 40, 50, respectively.
For each instance, the column "Instance" indicates the basic information for the name. In TOPKWCLQ, we present the maximum weight value of the DTKWC search problem instances (w b ) and the average weight DTKWC search problem results (w a ) obtained over 10 runs. We also report the average run time over 10 runs (Time, in seconds) to reach the maximum weight for all DTKWC search problem instances by TOPKWCLQ. And "0" in the time column indicates TOPKWCLQ was able to obtain the best solution in less than 0.01 s. To study the effectiveness of TOPKWCLQ for DTKWC search problem, we compare it with the CPLEX solver (version 12.9) with the mathematical model (8)-(13) introduced in Section 2.2. The best lower bound (LB) and upper bound (UB) found by CPLEX are listed in the CPLEX columns. If CPLEX was unable to find a bound on an instance, the corresponding entry is marked by "-". If CPLEX was unable to load the model, the entry is marked by "N/A". For the items in a column, the bold value indicates that the algorithm obtained the same or better objective values compared to the results of the comparison algorithm.    effective. For 19 out of 510 real-world instances, we can prove the optimal solutions, where the values of the lower bound ("LB") and upper bound ("UB") are equivalent in CPLEX column. In terms of computational time, TOPKWCLQ can obtain the optimal values in less than one second (at most hundreds of seconds) in most cases. For example, on the graph rt-twitter-copen, CPLEX always finds better objective values than TOPKWCLQ except the instance with parameter k = 50. For another 84 out of 102 larger graphs, TOPKWCLQ can also obtain good objective values where CPLEX failed. Based on the benchmark introduced in Section 4.1, Table 7 summarizes the computational results of CPLEX and TOPKWCLQ on 102 real-world graphs. From Table 7, we observe that for almost all instances under the five values of the parameter k, our TOPKW-CLQ algorithm can obtain better solutions than the lower bound of CPLEX. It indicates the superiority of the proposed algorithm TOPWCLQ.

Conclusions
In this paper, we propose the diversified top-k weight clique search problem and formalize DTKC and DTKWC search problem. The scoring strategy is proposed to find diversified maximal weight cliques for our algorithm. A local search algorithm for the DTKWC search problem based on the scoring strategy and random restart strategy is then proposed, called TOPKWCLQ. This algorithm interleaves maximal weight clique set construction and updating. Experiments on the real-world benchmark show the effectiveness and efficiency of our algorithm. Moreover, further work is to investigate the enhanced configuration checking strategy used in [2] to enhance the performance of the algorithm.