An Improved Greedy Heuristic for the Minimum Positive Inﬂuence Dominating Set Problem in Social Networks

: This paper presents a performance comparison of greedy heuristics for a recent variant of the dominating set problem known as the minimum positive inﬂuence dominating set (MPIDS) problem. This APX-hard combinatorial optimization problem has applications in social networks. Its aim is to identify a small subset of key inﬂuential individuals in order to facilitate the spread of positive inﬂuence in the whole network. In this paper, we focus on the development of a fast and effective greedy heuristic for the MPIDS problem, because greedy heuristics are an essential component of more sophisticated metaheuristics. Thus, the development of well-working greedy heuristics supports the development of efﬁcient metaheuristics. Extensive experiments conducted on a wide range of social networks and complex networks conﬁrm the overall superiority of our greedy algorithm over its competitors, especially when the problem size becomes large. Moreover, we compare our algorithm with the integer linear programming solver CPLEX. While the performance of CPLEX is very strong for small and medium-sized networks, it reaches its limits when being applied to the largest networks. However, even in the context of small and medium-sized networks, our greedy algorithm is only 2.53% worse than CPLEX.


Introduction
Dominating set problems have recently attracted much attention due to their potential application in a variety of real-life settings. Apart from the standard minimum dominating set problem [1,2], examples include the minimum connected dominating set problem [3], the minimum total dominating set problem [4] and the minimum vertex weight dominating set problem [5].

Problem Background and Motivation
A problem variant that has been studied especially in the context of online social networks is the minimum positive influence dominating set (MPIDS) problem in which a social network is modeled by a simple, connected undirected graph where vertices represent a group of individuals (people) and edges indicate relationships and interactions between them. The problem was first introduced by Wang et al. [6], based on the following motivation. With the explosive growth of online social networks, the need for social network analysis tools in order to study the social influences and interactions between individuals within groups and organizations has become a primary concern. As an example, one of the most popular online social networks worldwide is Facebook. In January 2021, Facebook had approximately 2.74 billion users [7], more than any other social network. In addition, ideas and information propagated in social networks can have a significant impact on society (negative or positive) and on various aspects of the life of people. Social norms theory has shown that the behavior of individuals can be affected by perceptions

Problem Description and Existing Work
In technical terms, the MPIDS problem can be described as follows. Given a simple, connected undirected graph G = (V, E) it requires to find a dominating set of minimum cardinality such that at least half of the neighbors of each vertex form part of the dominating set. However, the problem was shown to be APX-hard [13]. Note that a problem is said to be APX-hard if there is a polynomial time reduction scheme from every problem in APX to that problem. Moreover, if a problem is APX-hard, it is also NP-hard. Therefore, most of the past and current research efforts concerning the MPIDS problem are focused on greedy heuristics and on some evolutionary approaches. Greedy heuristics [14] are procedures that generate a solution step by step, making a locally optimal choice at each stage based on a so-called greedy function. We briefly review the existing approaches in the following. The first greedy algorithm for the MPIDS problem, referred to as Wang's greedy algorithm [13], is a H(∆)-approximation algorithm with O(n 3 ) time complexity, where n = |V|, ∆ is the maximum vertex degree, and H is the harmonic function. Another greedy algorithm, referred to as Raei's greedy, was published in [15]. This algorithm requires O(n 2 ) time. It differs from the previous one in the way in which the next vertex at each construction step is chosen. That is, they differ with respect to the used greedy function. An improved version of Wang's greedy, referred to as Fei's greedy, was proposed in [16]. It incorporates a tie-breaking strategy based on Raei's greedy. More recently, Pan et al. [17] presented a fast greedy heuristic with a complexity of O(n lg n + m) that outperforms all previous greedy approaches both in terms of solution quality and computational time. Therefore, Pan's greedy is considered as the currently best-performing greedy algorithm for the MPIDS problem.
To the best of our knowledge, there are only two studies that have attempted to solve the MPIDS problem using metaheuristic approaches. Metaheuristics are approximate techniques for solving hard optimization problems of different types. They are among the most popular algorithms in the context of problem instances that are too large (or too complex) to be solved by exact techniques. Many metaheuristics are built upon subordinate algorithmic components such as greedy heuristics and local search algorithms. Examples of such metaheuristics include simulated annealing, tabu search, iterated local search, and greedy randomized adaptive search procedures. However, the family of metaheuristics also includes a whole range of bio-inspired techniques such as ant colony optimization, genetic and evolutionary algorithms, and particle swarm optimization. Especially in combinatorial optimization, metaheuristics have had considerable success in application areas such as scheduling, routing, bioinformatics, medical research, passenger and freight terminal operations, and data classification. We refer the interested reader to [18,19] for further information. The MPIDS problem was first tackled by a memetic algorithm [20], called ILPMA, which uses tabu search for improving solutions. The second one is a hybrid approach, referred to as HSIA, that combines a genetic algorithm with particle swarm optimization. Both ILPMA and HSIA share two common features: they incorporate a greedy randomized adaptive algorithm similar to GRASP [21] to seed the initial population with good solutions and their performances are compared with those of the greedy algorithms.

Motivation and Contribution
Wang's greedy, Raei's greedy and Fei's greedy suffer from a common drawback that they are time-consuming and become highly ineffective with an increasing graph size. This is mainly due to the vertex selection strategy employed at each construction step of the solution construction process. In particular, the choice of the next vertex to be included in the current partial solution requires the evaluation of the corresponding greedy function for all vertices that do not form part of the current partial solution, which requires O(n) of time. This time complexity is reduced to O(∆) time in the case of both Pan's greedy and our proposal by considering only the neighbors of a particular vertex at each construction step. We expect our algorithm to outperform the currently best greedy algorithm (Pan's algorithm) due to the following reasons. First, we develop a graph pruning procedure that identifies vertices that must form part of an optimal solution. Second, the vertex selection strategy of our algorithm benefits from the exploitation of two greedy functions (cover-degree and need-degree), which is in contrast to Pan's greedy which only makes use of cover-degree. Finally, in a post-processing step we remove redundant vertices.
Our motivation for the development of a fast and effective greedy algorithm is as follows. The development of high-performing metaheuristics for the MPIDS problem is still a challenge. However, the performance of metaheuristics depends largely on the performance of their main components. Greedy heuristics are a key component of many metaheuristics. They are used for generating initial solutions or for reconstructing partial solutions. The contribution of this paper is to review the available literature on greedy heuristics for the MPIDS problem and make a performance comparison among them both in terms of solution quality and computation time. Furthermore, designing an efficient greedy approach is still a relevant challenge despite many prior attempts. Therefore, we also propose an improved greedy algorithm which outperforms the existing greedy approaches both on benchmark instances that have already been considered in the literature, but also on larger complex networks with millions of nodes. Our approach can be considered a further improvement of the basic idea of Pan's work [17].

Structure of the Paper
The rest of this paper is organized as follows. In Section 2 we give a technical description of the MPIDS problem. In Section 3, we review the available literature on greedy heuristics applied to the MPIDS problem. The proposed greedy algorithm is described in Section 4. In Section 5, we present and discuss the experimental results. Finally, Section 6 summarizes the work and offers directions for future work.

The Minimum Positive Influence Dominating Set Problem
We describe the MPIDS problem by starting with basic notions and underlying definitions which are used throughout this paper. Let G = (V, E) be an undirected graph that is both simple and connected. (Note that an undirected graph is called simple if it does not contain multiple edges between pairs of vertices and loops.) Hereby, V is a set of n vertices-representing, for example, people-and E ⊂ V × V is a set of m edges modeling, for example, relationships between those people. Let v and u be two distinct vertices from G. Now, v and u are said to be adjacent (neighbors) if they are connected by an edge. Further, the open neighborhood of v is defined as Definition 1 (Dominating set). A dominating set (DS) of G is a subset D ⊆ V such that each vertex v ∈ V \ D is adjacent to at least one vertex in D.

Definition 2 (Positive influence dominating set).
A positive influence dominating set (PIDS) of G is a dominating set D ⊆ V such that each vertex v ∈ V has at least half of its open neighbors in D, that is, at least deg(v)/2 vertices of N(v) must belong to D for each v ∈ V. Figure 1b shows an example of a DS of the graph shown in Figure 1a, where black vertices represent those that are in the DS. Figure 1c provides an example of a PIDS (black vertices). Vertex 6, for example, has degree 3 and 3/2 = 2 of its neighbors (vertex 4 and vertex 2) are part of the PIDS. This problem can be easily formulated in terms of an integer linear program (ILP). For convenience, we assume that V is enumerated as v 1 , v 2 , . . . , v n . A binary variable and only if v i is part of the optimal solution, and x i = 0 otherwise. The ILP model can then be stated as follows.

Minimize
Equation (2) ensures that the generated solution is a dominating set. Remember that

Definition 4 (Minimum positive influence dominating set problem).
Given an undirected simple graph G = (V, E), the minimum positive influence dominating set (MPIDS) problem asks to find a PIDS of G of minimum cardinality.
Based on the ILP model of the MDS problem from above, the MPIDS problem can also easily be stated in terms of an ILP as follow.

Minimize
Equation (5) ensures that a feasible solution contains at least half of the neighbors of each vertex v i ∈ V.

Greedy Heuristics
Greedy algorithms are very common techniques for constructing solutions to combinatorial optimization problems from scratch, in a step-by-step manner. They are either used as standalone algorithms, or as subordinate algorithmic components of more sophisticated metaheuristics. Algorithm 1 shows the pseudo-code of a basic greedy algorithm for the MPIDS problem. It takes as input a simple, connected undirected graph G = (V, E) representing an instance of the MPIDS problem and provides as output a subset of vertices S ⊂ V corresponding to a positive influence dominating set. This algorithm builds a solution step by step, starting from an empty solution S = ∅. At each construction step, it adds one feasible vertex v * ∈ S to S until a valid solution is obtained. S = V \ S denotes the set of vertices of V not belonging to S. Note that this set initially contains all the vertices of the graph.

Algorithm 1 MPIDS_Greedy()
At each construction step, the choice of the solution component to be added to the current partial solution is made deterministically based on a so-called greedy function which plays an important role for the performance of the algorithm. It evaluates each solution component by measuring the local improvement obtained by adding the corresponding component to the incumbent partial solution. Most of the existing greedy algorithms for the MPIDS problem are based on at least one of the two greedy functions that are known as cover-degree and need-degree. To define them, let S ⊂ V be the incumbent partial solution which is not yet a PIDS and and not covered otherwise. Consequently, S is a valid solution (i.e., a PIDS) if and only if all vertices of V are covered. Now, for a given v ∈ S, cover-degree and need-degree are calculated as in Equation (7) and Equation (8), respectively.
The first one represents the number of uncovered neighbors of v with respect to S, whereas the second represents the total need of the uncovered neighbors of v. Table 1 indicates which ones of these two greedy functions were used in the previous works from the literature. In the case in which an algorithm makes use of both functions, one of them was used as a secondary criterion for breaking ties in those cases in which the principal criterion provides the same value for several vertices. In the following, we recall the most recent and best-performing greedy algorithm, that is, Pan's greedy [17]. Note that, even though Pan's greedy uses the same greedy function (cover-degree) as Wang's greedy, its time complexity is much lower. This is for the following reason. While Wang's algorithm considers at each construction step any so-far unselected vertex for inclusion into the current partial solution, Pan's algorithm-after choosing the next unselected vertex v in ascending order with respect to the degree-only considers so-far unselected neighbors of v in subsequent construction steps until v is covered. The detailed pseudo-code for Pan's algorithm is provided in Algorithm 2. This pseudo-code is the same as in the original publication, but with using our notation and after removing some unnecessary details in order to improve the readability of the algorithm.

Algorithm 2 Pan's greedy algorithm [17]
Input: a simple, connected undirected graph G = (V, E) Output: a positive influence dominating set S 1: Rename the vertices from V such that {v 1 , v 2 , · · · , v n } are the vertices in ascending order of the degree 2: for j = 1 to ρ do 8: u * ←argmax{cover-degree(u) | u ∈ N S (v i )}

The Proposed Algorithm
In the following, we describe our improved greedy algorithm-henceforth referred to as IGA-PIDS-which is based on Pan's algorithm for the MPIDS problem.

The Greedy Procedure
The pseudo-code of IGA-PIDS is given in Algorithm 3. It receives as input a problem instance that consists of a simple, connected undirected graph G = (V, E) with n vertices, and works as follows. First, the given graph is pruned by applying procedure GraphPruning(G) (see Algorithm 4) which returns a partial solution S. This procedure identifies vertices that must form part of the optimal solution and adds them to S. This is done in order to speed up the process of solution construction of the greedy algorithm. Suppose that v ∈ V is a pendant vertex, i.e., deg(v) = 1, and let u be its unique neighbor. First, u is then added to S, because v must be covered and can only be covered by u. Second, if the degree of u is two-that is, deg(u) = 2, let w = v be the second neighbor of u. As u must covered by exactly one vertex, and as no other vertex benefits from choosing v for covering u, we can savely choose w for covering u. Therefore, w is added to S. At this point, let C be the set of all so-far uncovered vertices. Next, the algorithm picks yet uncovered vertices from C in (increasing) order of their degrees. The chosen vertex v i is then covered by choosing neighbors of v i from N S (v i ) until v i is covered. In particular, at each step the vertex with the largest cover-degree value of all vertices in N S (v i ) is chosen. If tie breaking is necessary, it is done with the need-degree function (see lines 8 and 9 of Algorithm 3). Finally, the algorithm is terminated when all vertices are covered, that is, when S corresponds to a valid PIDS solution. Let v i be an un-covered vertex with the smallest sub-index in C 7:

Removing Redundant Vertices
The results produced by our greedy algorithm may contain redundant vertices. A redundant vertex is a vertex that can be removed from a solution without making the solution invalid. Formally, a vertex v ∈ S is redundant if each vertex u from its closed neighborhood N(v) has strictly more than half of its neighbors in S. In other words, v ∈ S is redundant if and only if ∀u ∈ N(v) : h S (u) < 0. That is, each vertex in S is checked in sequence to find whether it is redundant. If it is the case, this vertex may safely be removed from S and all values h S (.) of its neighbors will be decreased by one. A pseudo-code of the complete removal-function is presented in Algorithm 5. This function has a time complexity of O(n + m). Note that our algorithm, by default, makes use of this function. However, we will also test our algorithm without removing redundant vertices. The resulting algorithm variant is henceforth labelled IGA-PIDS − . end if 8: end for 9: return S Another way to implement this procedure, which is not considered here, would be to adopt the same removal strategy as presented in [23] for the minimum weight dominating set problem. For the latter, all redundant vertices are initially grouped in a unique set called S r . Then, a vertex from S r is iteratively chosen to be removed according to a particular greedy function. In the case of the MPIDS problems, we could select the vertex with the smallest degree, for example. This iterative process stops once S r becomes empty and all redundant vertices are removed.

Complexity
Here, we describe the time complexity of IGA-PIDS presented in Algorithm 3. We assume that the problem instance G = (V, E) is represented by an adjacency list and the solution S as a list. It obliviously that function GraphPruning(·) has complexity O(n). Line 3 is done in O(n lg n) time as |E| ≤ |V| and n = |V| is the number of vertices in G. Since the size of S does not exceed n, all other lines, excluding lines 9 and 10, can be done in O(n) time. The running time of lines 9 and 10 is only proportional to the sum of degrees of all vertices in V. The hand shaking lemma [24] states that ∑ v∈V deg(v) = 2m, where m = |E| is the number of edges in G. Therefore, these two lines require O(m) time. In the light of the above, we can conclude that the time complexity of IGA-PIDS takes at most O(n lg n + m) time.

Computational Setting
In order to conduct a fair comparison with these greedy algorithms, all of them were implemented in ANSI C++ using Cygwin GCC 4.4 for compiling the software and carried out on the same computing platform. The experiments were performed on a laptop equipped with a 64-bit 2.5-GHz Intel® Core™ i5-7200U processor and 8 GB of RAM. Moreover, in order to get an impression about the solution quality provided by IGA-PIDS, we applied the ILP solver ILOG CPLEX 12.10 in single-threaded mode-with a time limit of 2 CPU hours per instance-to all problem instances. Note that the default optimality gaps of CPLEX were used. These default values indicate CPLEX to stop when an integer feasible solution is reached that is within 0.01% of optimality. The experiments with CPLEX were performed on a cluster of machines with two Intel ® Xeon ® Silver 4210 CPUs with 10 cores of 2.20 GHz and 92 Gbytes of RAM.

Problem Instances
We use three sets of benchmark instances for the experimental evaluation. The first two have already been used by other studies on the MPIDS problem, while the last one (SNAP networks) consists of large-scale complex networks that are studied for the first time in the context of the MPIDS problem.

1.
Small social networks: this class of instances contains four well-known real and synthetic networks namely American College Football (Football) [25], Zachary's Karate Club (Karate) [26], the Dolphins Network (Dolphins) [27] and the Jazz Network (Jazz) [28]. Characteristics such as the number of vertices and the number of edges of these networks are provided in Table 2. The values of the optimal solutions for each instance (except for the last one) were taken from [20] in which the authors also made use of CPLEX.

2.
Large-scale social networks: this class of instances contains 13 real social networks which were provided by the authors of [20] and some of them were originally downloaded from the Network Data Repository [29]. Their characteristics, together with a brief description, are given in Table 3.

3.
SNAP social networks: this class of instances contains 22 real complex networks with sizes ranging from 10 4 vertices to 3 × 10 7 . It was download from the Stanford Large Network Dataset Collection [30]. All these instances were originally directed graphs and they are transformed to undirected graphs by neglecting arc orientations and considering parallel edges as one edge. Table 4 gives a brief description of the SNAP networks used in our experiments after preprocessing.   In general, note that the benchmark instances are assumed to be simple connected graphs without isolated vertices. For this purpose, we employed a preprocessing step for removing isolated vertices, self-loops and parallel edges.

Results and Discussion
The numerical results for the first two benchmark sets--that is, for social networks -are presented in Table 5 which has the following structure. The first column indicates the name of the social network. For each of the fives greedy algorithms the table contains two columns. The first one, labeled Val, provides the objective function values of the solutions found for the corresponding problem instances. The other one, labeled Time(s), shows the corresponding running times in seconds. In this context, note that a computation time of 0.0 means that the time was below 0.01 s. The best result per table row is highlighted in bold font. Moreover, note that provenly optimal results-see Table 6 for the full CPLEX results-are underlined.   The experimental results allow us to make the following observations. IGA-PIDS is able to obtain the best solution for 15 out of 17 problem instances. In contrast, all competitors together only return the best solution in 7 out of 17 cases. In addition, we can observe that both Pan's greedy and IGA-PIDS are significantly faster than the other three competitors (approximately two orders of magnitude). Furthermore, IGA-PIDS is only marginally slower than Pan's greedy, which means that the improved solution quality obtained by IGA-PIDS is not achieved at the expense of a significantly increased computation time. Moreover, without taking our greedy approach into consideration, it is worth to note that Pan's greedy outperforms the other greedy algorithms both in term of solution quality and computational efficiency. Finally, while the performance of each of the four greedy algorithms starts to degrade in the context of the largest problem instances, this is not the case for IGA-PIDS.
In order to study the effect of the proposed function to eliminate redundant vertices -see function Reduce(·) described in Section 4.2- Table 6 reports the performance of our greedy algorithm before (IGA-PIDS) and after (IGA-PIDS − ) removing redundant vertices. From these results it can be concluded that most solutions suffer from the existence of redundant vertices. Thus, removing them can provide more accurate results which comes, however, at the cost of additional computation time. Having said that, the increase of the average computation time is very moderate: from 0.014 s to only 0.017 s. In Table 6, we additionally present the results obtained from the ILP solver CPLEX. The obtained results show that the newest CPLEX version (12.10) is very efficient for the MPIDS problem. In fact, only in two cases (social networks socfb-nips-ego and Dolphins) our greedy algorithm is able to match the results of CPLEX. Moreover, CPLEX provides provenly optimal solutions in 11 out of 17 cases (within 2 h of computation time). Provenly optimal results are marked by an asterisk. Nevertheless, the results provided by IGA-PIDS are, on average, only 2.53% worse than those of CPLEX (see last table column labeled Gap (%)).
Finally, the results for the large-scale SNAP networks are presented in Table 7, in the same format as outlined above. The results of CPLEX are also included. Moreover, note that no result is provided in those cases in which the computation time of the respective algorithm exceeded 7200 s (two hours). The results on these larger networks confirm our findings from the first two benchmark sets. However, the differences between the algorithms are now even more pronounced. Observe, for example, that-on average-IGA-PIDS produces solutions with more than 2000 vertices less per instance when compared to Pan's greedy. Moreover, the limits of CPLEX are clearly reached in the application to problem instances of the size and the complexity found in the Amazon* instances and cit-Patents. The results provided by CPLEX in these cases are far worse than those of IGA-PIDS, as indicated by the gaps. In fact, the results of IGA-PIDS improve by approx. 55% over the results of CPLEX in these cases. Therefore, IGA-PIDS is overall the best algorithm, even outperforming CPLEX with an average gap of approx. 42% between the IGA-PIDS solutions and the CPLEX solutions.
Finally, we decided to provide a statistical assessment of the obtained results by means of critical difference (CD) plots [31]. In order to produce the average rank of each greedy algorithm considering the complete set of 39 problem instances, the scmamp R package [31] first applies the Friedman test to compare the five approaches simultaneously. In this way, the hypothesis that the five techniques perform equally was rejected. Then, the package performs pairwise comparisons using the Nemenyi post-hoc test [32]. As mentioned above, the results are shown graphically by means of CD plots. Note that, in such a plot, each considered algorithm is placed on the horizontal axis according to its average ranking. The performances of those algorithms that are below the critical difference threshold (computed with a significance level of 0.05) are considered as statistically equivalent. This is indicated by bold horizontal bars joining the markers of the respective algorithm variants. Figure 2 shows two CD plots. The first one (Figure 2a) concerns the obtained solution quality. IGA-PIDS is clearly the best-performing algorithm, with statistical significance. Concerning computation times (Figure 2b) it can be observed that Pan's greedy is slightly faster than IGA-PIDS. However, no statistical difference can be detected between the running times of Pan's algorithm and those of IGA-PIDS.

Conclusions
In this paper, we have studied an APX-hard combinatorial optimization problem in graphs and networks, the so-called minimum positive influence dominating set (MPIDS) problem. For tackling this problem, we present a simple and a fast improved greedy algorithm (IGA-PIDS) which is based on an effective exploitation of information provided by problem specific knowledge. The performance of IGA-PIDS is evaluated on 17 social networks of different sizes, ranging from 34 to 23,628 vertices. Moreover, we have applied our own greedy algorithm, as well as the competitors from the literature, to large-scale complex networks from the SNAP data set. The previously existing greedy heuristics for the MPIDS problem were re-implemented by ourselves for this purpose. Numerical results show that our greedy algorithm is able to outperform the other greedy algorithms from the literature, providing solutions with a significantly higher quality, especially in the context of the larger SNAP networks. We were also able to show that CPLEX, even though performing very strongly on small and medium-sized problem instances, reaches its limits when tackling large-scale complex networks.
Concerning limitations of this work, it would certainly be interesting to relate network characteristics with problem difficulty. That is, in the future it would be very interesting to identify those network characteristics that make the problem difficult, both for CPLEX and for our greedy algorithm. Note that such characteristics are not necessarily the same for the two types of algorithms. Furthermore, the design of well-working metaheuristics for this problem seems very challenging. Considering the results of CPLEX from this work, it is clear that the two existing metaheuristics for the MPIDS problem can not compete with CPLEX for most of the 17 social networks that were tackled in earlier works. The design of novel metaheuristics for the MPIDS problem will benefit from our improved greedy heuristic. Moreover, given the results of CPLEX, it might be a good idea to develop hybrid algorithms that combine metaheuristic elements with those of exact solvers such as CPLEX.