A Surprisal-Based Greedy Heuristic for the Set Covering Problem

: In this paper we exploit concepts from Information Theory to improve the classical Chvatal greedy algorithm for the set covering problem. In particular, we develop a new greedy procedure, called Surprisal-Based Greedy Heuristic (SBH), incorporating the computation of a “surprisal” measure when selecting the solution columns. Computational experiments, performed on instances from the OR-Library, showed that SBH yields a 2.5% improvement in terms of the objective function value over the Chvatal’s algorithm while retaining similar execution times, making it suitable for real-time applications. The new heuristic was also compared with Kordalewski’s greedy algorithm, obtaining similar solutions in much shorter times on large instances, and Grossmann and Wool’s algorithm for unicost instances, where SBH obtained better solutions


Introduction
The Set Covering Problem (SCP) is a classical combinatorial optimization problem which, given a collection of elements, aims to find the minimum number of sets that incorporate (cover) all of these elements.More formally, let I be a set of m items and J = {S 1 , S 2 . . ., S n } a collection of n subsets of I where each subset S j (j = 1, . . ., n) is associated to a non-negative cost c j .The SCP finds a minimum cost sub-collection of J that covers all the elements of I at minimum cost, the cost being defined as the sum of subsets cost.
The SCP finds applications in many fields.One of the most important is crew scheduling, where SCP provides a minumum-cost set of crews in order to cover a given set of trips.These problems include airline crew scheduling (see, for example, Rubin [1] and Marchiori [2]) and railway crew scheduling (see, example, Caprara [3]).Other applications are the winner determination problem in combinatorial auctions, a class of sales mechanisms (Abrache et al. [4]) and vehicle routing (Foster et al. [5], Cacchiani et al. [6] and Bai et al. [7]).The SCP is also relevant in a number of production planning problems, as described by Vemuganti in [8], wherein solving is often required in real-time.In addition, it is worth noting that the set covering problem is equivalent to the hitting set problem [9].Indeed, we can view an instance of set covering as a bipartite graph in which vertices on the left represent the items, whilst vertices on the right represent the sets, and edges represent the inclusion of items in sets.The goal of the hitting set problem is to find a subset with the minimum number of right vertices such that all left vertices are covered.
Garey and Johnson in [10] have proven that the SCP is NP-hard in the strong sense.Exact algorithms are mostly based on branch-and-bound and branch-and-cut techniques.Etcheberry [11] utilizes sub-gradient optimization in a branch-and-bound framework.Balas and Ho [12] present a procedure based on cutting planes from conditional bounds, i.e., valid lower bounds if the constraint set is amended by certain inequalities.Beasley [13] introduces an algorithm which blends dual ascent, sub-gradient optimization and linear programming.In [14], Beasley and Jornsten incorporate the [13] algorithm into a Lagrangian heuristic.Fisher and Kedia [15] use continuous heuristics applied to the dual of the linear programming relaxation, obtaining lower bounds for a branch-and-bound algorithm.Finally, we mention Balas and Carrera [16] with their procedure based on a dynamic sub-gradient optimization and branch-and-bound.These algorithms were tested on instances involving up to 200 rows and 2000 columns in the case of Balas and Fisher's algorithms and 400 rows and 4000 columns in [13,14,16].Among these algorithms the fastest one is the Balas and Carrera's algorithm, with an average time in the order of 100 s on small instances and 1000 s on the largest ones (on a Cray-1S computer).Caprara [17] compared these methods with the general-purpose ILP solvers CPLEX 4.0.8 and MINTO 2.3, observing that the latter ones have execution times competitive with that of the best exact algorithms for the SCP in the literature.
In most industrial applications it is important to rely on heuristic methods in order to obtain "good" solutions quickly enough to meet the expectations of decision-makers.To this purpose, many heuristics have been presented in the literature.The classical greedy algorithm proposed by Chvatal [18] sequentially inserts the set with a minimum score in the solution.Chvatal proved that the worst case performance ratio does not exceed i , where d is the size of the largest set.More recently, Kordalewski [19] described a new approximation heuristics for the SCP.His algorithm involves the same scheme of Chvatal's procedure, but modifies the score by including a new parameter, named difficulty.Wang et al. [20] presented the TS-IDS algorithm designed for deep web crawling and, then, Singhania [21] tested it in a resource management application.Feo and Resende [22] presented a Greedy Randomized Adaptive Procedure (GRASP), in which they first constructed an initial solution through an adaptive randomized greedy function and then applied a local search procedure.Haouari and Chaouachi [23] introduced PROGRES, a probabilistic greedy search heuristic which uses diversification schemes along with a learning strategy.
Regarding Lagrangian heuristics, we mention the algorithm developed by Beasley [24] and later improved by Haddadi [25], which consists of a sub-gradient optimization procedure coupled with a greedy algorithm and Lagrangian cost fixing.A similar procedure was designed by Caprara et al. [26], which includes three phases, sub-gradient, heuristic and column fixing, followed by a refining procedure.Beasley and Chu [27] proposed a genetic algorithm in which a variable mutation rate and two new operators are defined.Similarly Aickelin [28] describes an indirect genetic algorithm.In this procedure actual solutions are found by an external decoder function and then another indirect optimization layer is used to improve the result.Lastly, we mention Meta-Raps, introduced by Lan et al. [29], an iterative search procedure that uses randomness as a way to avoid local optima.All the mentioned heuristics present calculation times not compatible with real contexts.For example, Caprara's algorithm [26] produces solutions with an average computing time of about 400 s (on a DECstation 5000/240 CPU), if executed on non-unicost instances from Beasley's OR Library, with 500 × 5000 and 1000× 10,000 as matrix sizes.Indeed, the difficulty of the problem leads to very high computational costs, which has led academics to research heuristics and meta-heuristics capable of obtaining good solutions, as close as possible to the optimal, in a very short time, in order to tackle real-time applications.In this respect, it is worth noting the paper by Grossman and Wool [30], in which a comparative study of eight approximation algorithms for the unicost SCP are proposed.Among these there were several greedy variants, fractional relaxations and randomized algorithms.Other investigations carried out over the years include the following: Galinier et al. [31], who studied a variant of SCP, called the Large Set Covering Problem (LSCP), in which sets are possibly infinite; Lanza-Gutierrez et al. [32], who were interested in the difficulty of applying metaheuristics designed for solving continuous optimization problems to the SCP; Sundar et al. [33], who proposed another algorithm to solve the SCP by combining an artificial bee colony (ABC) algorithm with local search; Maneengam et al. [34], who, in order to solve the green ship routing and scheduling problem (GSRSP), developed a set covering model based on route representation which includes berth time-window constraints; finally, an empiric complexity analysis over the set covering problem, and other problems, was recently conducted by Derpich et al. [35].
In this paper, we exploit concepts from Information Theory (see Borda [36]) to improve Chvatal's greedy algorithm.Our purpose is to devise a heuristic able to improve the quality of the solution while retaining similar execution times to those of Chvatal's algorithm, making it suitable for real-time applications.The main contributions of the current work can be summarized as follows.

•
The development of a real-time algorithm, named Surprisal-Based Greedy Heuristic (SBH), for the fast computation of high quality solutions for the set covering problem.In particular, our algorithm introduces a surprisal measure, also known as self-information, to partly account for the problem structure while constructing the solution.• A comparison of the proposed heuristic with three other greedy algorithms, namely Chvatal's greedy procedure [18], Kordalewski's algorithm [19] and the Altgreedy procedure [30] for unicost problems.SBH improves the classical Chvatal greedy algorithm [18] in terms of objective function and has the same scalability in computation time, while Kordalewski's algorithm produces slightly better solutions but has computation times that are much higher than those of the SBH algorithm, making it impractical for real-time applications.
We emphasize that there is a plethora of other methods in the literature for solving the SCP, but most of them are time-consuming.We are only interested in fast heuristics that are compatible with real-time applications.
The remainder of the article is organized as follows.In Section 2 we describe the three algorithms involved in our analysis and illustrate SBH.Section 3 presents an experimental campaign which compares the greedy algorithms mentioned above.Finally, Section 4 reports on some of the conclusions.

Problem Formulation
The SCP can be formulated as follows.In addition to the notation introduced in Section 1, let a ij be a constant equal to 1 if item i is covered by subset j and 0 otherwise.Moreover, let x j denote a binary variable defined as follows: where (1) aims to minimize the total cost of the selected columns and (2) imposes the condition that every row is covered by at least one column.

Greedy Algorithms
As we explained in the previous section, we were interested in greedy procedures in order to produce good solutions in a very short time, suitable for real-time applications.SCP greedy algorithms are sequential procedures that identify the best unselected column with respect to a given score and then insert this in the solution set.Let I j be the set of rows covered by column j and J i the set of columns covering row i. Algorithm 1 shows the pseudocode of Chvatal's greedy algorithm [18].Each column j is given a score equal to the column cost c j divided by the number of rows I j covered by j.At each step, the algorithm inserts the column j * with the minimum score in the solution set.
for j ∈ J do remove the already covered rows 7: A variant of Chvatal's procedure for unicost problems was suggested by Grossman and Wool [30], named Altgreedy.This algorithm is composed of two main steps: in the first phase, the column with the highest number of covered rows is inserted in the solution; then, in the secomd phase, some columns are removed from the solution set according to lexicographic order, as long as the number of the new uncovered rows remains smaller than the last number of new rows covered.
More recently, Kordalewski [19] proposed a new greedy heuristic which is a recursive procedure that introduces two new terms: valuation and difficulty.In the first step, valuation is computed for all columns j by dividing the number of rows, covered by j, by the column cost, as in Chvatal's score.For each row i is defined a parameter, difficulty, which is the inverse of the sum of the valuations of the sets covering i, used to indicate how difficult it might be to cover that row.This is based on the observation that a low valuation implies a low probability of selection.The valuation v can be computed as: while difficulty d will be only updated with the new valuations.

The SBH Algorithm
In this section, we describe the SBH greedy heuristic, that constitutes an improvement on the classic Chvatal greedy procedure.As illustrated in Section 2.2, Chvatal's algorithm assigns each column j a score equal to the unit cost to cover the rows in I j .Then it iteratively inserts the columns with the lowest score in the solution set.However, this approach is flawed when rows in I j are poorly covered.Indeed, it does not consider the probability that rows i ∈ I j are covered by other columns j ∈ J i .Our algorithm aims to correct this by introducing an additional term expressing the "surprisal" that a column j is selected.Therefore, our score considers two aspects: the cost of a column j and the probability that the rows in I j can be covered by other columns.
To formally describe our procedure, we introduce some concepts from Information Theory.The term information refers to any message which gives details in an uncertain problem and is closely related with the probability of occurrence of an uncertain event.
Information is an additive and non-negative measure which is equal to 0 when the event is certain and it grows when its probability decreases.More specifically, given an event A with probability to occur p A , the self-information I A is defined as: Self-information is also called surprisal because it expresses the "surprise" of seeing event A as the outcome of an experiment.In the SBH algorithm, at each stage we compute the surprisal of each column.The columns containing row i are considered independent of each other, so the probability of selecting one of them (denoted as event Ā) is Therefore, the opposite event, i.e., selecting row i with a column different from the current one, is: The self-information measure contained in this event is: Thanks to the additivity of the self-information measure, surprisal of a column j can be written as: We modify Chvatal's cost of column j, i.e., c j |I j | , by introducing the surprisal of j to the denominator, in order to favor columns with high self-information.In particular, at each step we select the column that minimizes: which is equivalent to This formulation is the same as minimizing the probability of the intersection of independent events, each of which selects a column, other than the current one, covering row i.Two extreme cases can occur: • if column j is the only one covering a row i ∈ I j , it is no surprise that it is selected: in this case I j is high and the modified cost (9) of column j is 0 so that column j is included in the solution; • if, on the other hand, all rows i ∈ I j are covered by a high number of other columns j ∈ J i , surprisal I j is very low.In this case, the cost attributed to column j is greater than its Chvatal's cost.
To illustrate this concept, we now present a numerical example.Let 5]   be the coverage matrix and the column cost vector.We denote, with CH i score and SBH i score , respectively, as the Chvatal and SBH scores vectors, at the i-th iteration.A hyphen is inserted to indicate that the corresponding column can no longer be considered because it either has already been selected or it is empty, meaning that the column does not cover rows that still need to be covered.At the first iteration of Chvatal's algorithm we have the following scores: The second column (the one with lowest score) is selected.Subsequently, at the second iteration the scores are as follows: At this point, the third column is selected.Finally, it is worth noting that the first column covers only rows already covered by the other selected columns.Then, at the third iteration, the scores become: Therefore, column 4 is selected and the total cost for the current solution (columns 2, 3 and 4) amounts to 8 units.On the other hand, computing the SBH score for each column j according to (10): our SBH algorithm, at the first iteration, produces: The fourth column has the least score, and is embedded in the current solution.At the second iteration, the scores are the following: Column 2 is selected and the procedure ends.In conclusion, our algorithm selects only two columns (4 and 2), with a total cost of 6 units, in contrast to Chvatal's greedy algorithm which ends up with a greater solution cost.Therefore, SBH outperforms Chvatal's procedure because the latter cannot recognize the column 4 that must necessarily be part of the solution.
It is worth noting that SBH has the same computational complexity as Chvatal's algorithm, since they require the same number of steps in order to compute the score measure.

Experimental Results
The aim of our computational experiments was to assess the performance of the SBH heuristic procedure with respect to the other greedy heuristics proposed in literature.We implemented the heuristics in C++ and performed our experiments on a stand-alone Linux machine with a 4 core processor clocked at 3 GHz and equipped with 16 GB of RAM.The algorithm was tested on 77 instances from Beasley's OR Library [37].Table 1 describes the main features of the test instances and, in particular, the column density, i.e., the percentage of ones in matrix a and column range, i.e., the minimum and maximum values of objective function coefficients.The remaining column headings are self-explanatory.Instances are divided into sets having sizes ranging from 200 × 1000 to 1000× 10,000.Set E contains small unicost instances of size 50 × 500.Sets 4, 5 and 6 were generated by Balas and Ho [12] and consist of small instances with low density, while sets A to E come from Beasley [13].The remaining instances (sets NRE to NRH) are from [24].Such instances are significantly larger and optimal solutions are not available.Similarly, Table 2 reports features of seven large scale real-word instances derived from the crew-scheduling problem [26].
We compared SBH with Chvatal's procedure [18] (CH) and the heuristic by Kordalewski [19] (KORD).Tables 3-5 report the computational results for each instance under the following headings:

•
Instance: the name of the instance where the string before "dot" refers to the set which the instance belongs to;  3, it is worth noting that our heuristic, compared to Chvatal's greedy procedure, had a smaller gap, ranging from 12.65% to 11.03%, with an average improvement of 1.42%.Among these instances, SBH provided a better solution than [18] in 19 out of 24 instance problems.We point out that the best objective function value was given by Kordalewski's algorithm, which was slightly better than our SBH procedure (by only 0.59%), but was slower.Similar observations can be derived from Table 4. Here, SBH performed better, even though it differed from the Kordalewski algorithm by only 0.07%.Comparing SBH with CH, it is worth noting that only in 4 instances out of 45 did SBH obtain a worse solution.SBH came close to the optimal solution, with an average gap of 10.69%, and was better than CH by 2.62%.The execution time for all the instances averaged 0.113 s for CH, 0.230 s for the Kordalewski procedure and 0.564 s for SBH.Increasing the size of the instances (which is the case in rail problems), Kordalewski's algorithm became much slower.Consequently, on these instances we compared only the CH and SBH heuristics.On these instances, our SBH algorithm provided an average objective function improvement of 5.82% with comparable execution times.In conclusion, this first analysis showed that the new SBH heuristic generally produced very similar results with respect to Kordalewski's heuristic.This is due to the fact that both heuristics consider the degree of row coverage, although in different ways, and, thus, the difficulty in covering them.However, the large amount of time the KORD algorithm took to solve rail instances points out that the use of SBH meets the requirements of real-time applications.Finally, the average percentage improvement of SBH with respect to CH, taking into account all instances, i.e., sets 4-6, scp and rail, amounted to 2.5%.We next compared the algorithms on unicost instances, obtained by setting the cost of all columns equal to 1, as in Grossman and Wool's paper [30].In particular, we compared SBH with the Altgreedy (ALTG) algorithm proposed by Grossman and Wool [30], introduced in Section 2.2.The results are shown in Tables 6-8, where the subdivision of instances was the same as before.The additional column "SBH vs. ALTG" reports the percentage improvement of SBH with respect to ALTG algorithm.Looking at Tables 6, it is worth noting that the heuristic which performed better was that of Kordalewski.Indeed, our heuristic SBH was worse than KORD by about 3.49%, while it was better than the other two greedy procedures, with a gap of 1.15%.Here, computation times were all comparable and ranged between 0.002 and 0.007 s.SBH improved its performance in larger instances, as shown in Tables 7 and 8.We would like to point out that ALTG and CH produced the same solution cost for all of the instances, except for the rail ones.In particular, SBH yielded an average improvement of 1.50% on CH and ALTG ( [30]) on scp instances, and, respectively, 1.39% and 12.97% on rail instances.Comparing SBH and KORD on the scp instances, we observed that they were very similar with a 0.07% improvement.In the largest instances (Table 8), as said before, it emerged that the computational time of KORD maade it impractical for real-time applications.The analysis showed that, in most cases, SBH produced better solutions than classical Chvatal's algorithm.However, in a few instances CH presented a better solution.This phenomenon was attributable to the features of the instances.As shown in the example provided in Section 2.3, SBH immediately recognizes columns that must necessarily be present in the solution, while CH only selects them when they exhibit the lowest unit cost.In conclusion, the computational campaign revealed that SBH generally outperformed CH when considering instances containing columns with few covered rows.

Conclusions
In this paper, we proposed a new greedy heuristic, SBH, an improvement on the classical greedy algorithm proposed by Chvatal [18].We showed that, in the vast majority of the test instances, SBH generated better solutions than other greedy algorithms, such as Kordalewski's algorithm [19] and Altgreedy [30].Computational tests also showed that Kordalewski's algorithm is not suitable for real-time application, since it presents very large execution times, while our SBH algorithm runs in a few seconds, even on very large instances.

Table 4 .
Results for instance sets scp.

Table 5 .
Results for instance set rail.

Table 7 .
Results for unicost instance sets scp.

Table 8 .
Results for unicost instance sets rail.