1. Introduction
The Set Covering Problem (SCP) is a classical combinatorial optimization problem which, given a collection of elements, aims to find the minimum number of sets that incorporate (cover) all of these elements. More formally, let I be a set of m items and $J=\{{S}_{1},{S}_{2}\dots ,{S}_{n}\}$ a collection of n subsets of I where each subset ${S}_{j}$ ($j=1,\cdots ,n$) is associated to a non-negative cost ${c}_{j}$. The SCP finds a minimum cost sub-collection of J that covers all the elements of I at minimum cost, the cost being defined as the sum of subsets cost.
The SCP finds applications in many fields. One of the most important is crew scheduling, where SCP provides a minumum-cost set of crews in order to cover a given set of trips. These problems include airline crew scheduling (see, for example, Rubin [
1] and Marchiori [
2]) and railway crew scheduling (see, example, Caprara [
3]). Other applications are the winner determination problem in combinatorial auctions, a class of sales mechanisms (Abrache et al. [
4]) and vehicle routing (Foster et al. [
5], Cacchiani et al. [
6] and Bai et al. [
7]). The SCP is also relevant in a number of production planning problems, as described by Vemuganti in [
8], wherein solving is often required in
real-time. In addition, it is worth noting that the set covering problem is equivalent to the hitting set problem [
9]. Indeed, we can view an instance of set covering as a bipartite graph in which vertices on the left represent the items, whilst vertices on the right represent the sets, and edges represent the inclusion of items in sets. The goal of the hitting set problem is to find a subset with the minimum number of right vertices such that all left vertices are covered.
Garey and Johnson in [
10] have proven that the SCP is NP-hard in the strong sense. Exact algorithms are mostly based on branch-and-bound and branch-and-cut techniques. Etcheberry [
11] utilizes sub-gradient optimization in a branch-and-bound framework. Balas and Ho [
12] present a procedure based on cutting planes from
conditional bounds, i.e., valid lower bounds if the constraint set is amended by certain inequalities. Beasley [
13] introduces an algorithm which blends dual ascent, sub-gradient optimization and linear programming. In [
14], Beasley and Jornsten incorporate the [
13] algorithm into a Lagrangian heuristic. Fisher and Kedia [
15] use continuous heuristics applied to the dual of the linear programming relaxation, obtaining lower bounds for a branch-and-bound algorithm. Finally, we mention Balas and Carrera [
16] with their procedure based on a dynamic sub-gradient optimization and branch-and-bound. These algorithms were tested on instances involving up to 200 rows and 2000 columns in the case of Balas and Fisher’s algorithms and 400 rows and 4000 columns in [
13,
14,
16]. Among these algorithms the fastest one is the Balas and Carrera’s algorithm, with an average time in the order of 100 s on small instances and 1000 s on the largest ones (on a Cray-1S computer). Caprara [
17] compared these methods with the general-purpose ILP solvers CPLEX 4.0.8 and MINTO 2.3, observing that the latter ones have execution times competitive with that of the best exact algorithms for the SCP in the literature.
In most industrial applications it is important to rely on heuristic methods in order to obtain “good” solutions quickly enough to meet the expectations of decision-makers. To this purpose, many heuristics have been presented in the literature. The classical greedy algorithm proposed by Chvatal [
18] sequentially inserts the set with a minimum
score in the solution. Chvatal proved that the worst case performance ratio does not exceed
$H\left(d\right)={\sum}_{i=1}^{d}\frac{1}{i}$, where
d is the size of the largest set. More recently, Kordalewski [
19] described a new approximation heuristics for the SCP. His algorithm involves the same scheme of Chvatal’s procedure, but modifies the score by including a new parameter, named
difficulty. Wang et al. [
20] presented the TS-IDS algorithm designed for deep web crawling and, then, Singhania [
21] tested it in a resource management application. Feo and Resende [
22] presented a Greedy Randomized Adaptive Procedure (GRASP), in which they first constructed an initial solution through an adaptive randomized greedy function and then applied a local search procedure. Haouari and Chaouachi [
23] introduced PROGRES, a probabilistic greedy search heuristic which uses diversification schemes along with a learning strategy.
Regarding Lagrangian heuristics, we mention the algorithm developed by Beasley [
24] and later improved by Haddadi [
25], which consists of a sub-gradient optimization procedure coupled with a greedy algorithm and Lagrangian cost fixing. A similar procedure was designed by Caprara et al. [
26], which includes three phases,
sub-gradient,
heuristic and
column fixing, followed by a refining procedure. Beasley and Chu [
27] proposed a genetic algorithm in which a variable mutation rate and two new operators are defined. Similarly Aickelin [
28] describes an indirect genetic algorithm. In this procedure actual solutions are found by an external decoder function and then another indirect optimization layer is used to improve the result. Lastly, we mention Meta-Raps, introduced by Lan et al. [
29], an iterative search procedure that uses randomness as a way to avoid local optima. All the mentioned heuristics present calculation times not compatible with real contexts. For example, Caprara’s algorithm [
26] produces solutions with an average computing time of about 400 s (on a DECstation 5000/240 CPU), if executed on non-unicost instances from Beasley’s OR Library, with
$500\times 5000$ and
$1000\times $ 10,000 as matrix sizes. Indeed, the difficulty of the problem leads to very high computational costs, which has led academics to research heuristics and meta-heuristics capable of obtaining good solutions, as close as possible to the optimal, in a very short time, in order to tackle real-time applications. In this respect, it is worth noting the paper by Grossman and Wool [
30], in which a comparative study of eight approximation algorithms for the unicost SCP are proposed. Among these there were several greedy variants, fractional relaxations and randomized algorithms. Other investigations carried out over the years include the following: Galinier et al. [
31], who studied a variant of SCP, called the Large Set Covering Problem (LSCP), in which sets are possibly infinite; Lanza-Gutierrez et al. [
32], who were interested in the difficulty of applying metaheuristics designed for solving continuous optimization problems to the SCP; Sundar et al. [
33], who proposed another algorithm to solve the SCP by combining an artificial bee colony (ABC) algorithm with local search; Maneengam et al. [
34], who, in order to solve the green ship routing and scheduling problem (GSRSP), developed a set covering model based on route representation which includes berth time-window constraints; finally, an empiric complexity analysis over the set covering problem, and other problems, was recently conducted by Derpich et al. [
35].
In this paper, we exploit concepts from Information Theory (see Borda [
36]) to improve Chvatal’s greedy algorithm. Our purpose is to devise a heuristic able to improve the quality of the solution while retaining similar execution times to those of Chvatal’s algorithm, making it suitable for real-time applications. The main contributions of the current work can be summarized as follows.
The development of a real-time algorithm, named Surprisal-Based Greedy Heuristic (SBH), for the fast computation of high quality solutions for the set covering problem. In particular, our algorithm introduces a surprisal measure, also known as self-information, to partly account for the problem structure while constructing the solution.
A comparison of the proposed heuristic with three other greedy algorithms, namely Chvatal’s greedy procedure [
18], Kordalewski’s algorithm [
19] and the Altgreedy procedure [
30] for unicost problems. SBH improves the classical Chvatal greedy algorithm [
18] in terms of objective function and has the same scalability in computation time, while Kordalewski’s algorithm produces slightly better solutions but has computation times that are much higher than those of the SBH algorithm, making it impractical for real-time applications.
We emphasize that there is a plethora of other methods in the literature for solving the SCP, but most of them are time-consuming. We are only interested in fast heuristics that are compatible with real-time applications.
The remainder of the article is organized as follows. In
Section 2 we describe the three algorithms involved in our analysis and illustrate SBH.
Section 3 presents an experimental campaign which compares the greedy algorithms mentioned above. Finally,
Section 4 reports on some of the conclusions.
3. Experimental Results
The aim of our computational experiments was to assess the performance of the SBH heuristic procedure with respect to the other greedy heuristics proposed in literature. We implemented the heuristics in
C++ and performed our experiments on a stand-alone Linux machine with a 4 core processor clocked at 3 GHz and equipped with 16 GB of RAM. The algorithm was tested on 77 instances from Beasley’s OR Library [
37].
Table 1 describes the main features of the test instances and, in particular, the column density, i.e., the percentage of ones in matrix
a and column range, i.e., the minimum and maximum values of objective function coefficients. The remaining column headings are self-explanatory. Instances are divided into sets having sizes ranging from
$200\times 1000$ to
$1000\times $ 10,000. Set
E contains small unicost instances of size
$50\times 500$. Sets 4, 5 and 6 were generated by Balas and Ho [
12] and consist of small instances with low density, while sets
A to
E come from Beasley [
13]. The remaining instances (sets
$NRE$ to
$NRH$) are from [
24]. Such instances are significantly larger and optimal solutions are not available. Similarly,
Table 2 reports features of seven large scale real-word instances derived from the crew-scheduling problem [
26].
We compared SBH with Chvatal’s procedure [
18] (CH) and the heuristic by Kordalewski [
19] (KORD).
Table 3,
Table 4 and
Table 5 report the computational results for each instance under the following headings:
Instance: the name of the instance where the string before “dot” refers to the set which the instance belongs to;
BS: objective function value of the best known solution;
SOL: the objective function value of the best solution determined by the heuristic;
TIME: the execution time in seconds;
GAP: percentage gap between BS and the SOL value, i.e.,
Columns “SBH vs. CH” and “SBH vs. KORD” report the percentage improvement of SBH w.r.t. CH and KORD, respectively. Regarding
Table 3, it is worth noting that our heuristic, compared to Chvatal’s greedy procedure, had a smaller gap, ranging from
$12.65\%$ to
$11.03\%$, with an average improvement of
$1.42\%$. Among these instances, SBH provided a better solution than [
18] in 19 out of 24 instance problems. We point out that the best objective function value was given by Kordalewski’s algorithm, which was slightly better than our SBH procedure (by only
$0.59\%$), but was slower.
Similar observations can be derived from
Table 4. Here, SBH performed better, even though it differed from the Kordalewski algorithm by only
$0.07\%$. Comparing SBH with CH, it is worth noting that only in 4 instances out of 45 did SBH obtain a worse solution. SBH came close to the optimal solution, with an average gap of
$10.69\%$, and was better than CH by
$2.62\%$. The execution time for all the instances averaged
$0.113$ s for CH,
$0.230$ s for the Kordalewski procedure and
$0.564$ s for SBH. Increasing the size of the instances (which is the case in
$rail$ problems), Kordalewski’s algorithm became much slower. Consequently, on these instances we compared only the CH and SBH heuristics. On these instances, our SBH algorithm provided an average objective function improvement of
$5.82\%$ with comparable execution times. In conclusion, this first analysis showed that the new SBH heuristic generally produced very similar results with respect to Kordalewski’s heuristic. This is due to the fact that both heuristics consider the degree of row coverage, although in different ways, and, thus, the
difficulty in covering them. However, the large amount of time the KORD algorithm took to solve
$rail$ instances points out that the use of SBH meets the requirements of real-time applications. Finally, the average percentage improvement of SBH with respect to CH, taking into account all instances, i.e., sets 4–6,
scp and
rail, amounted to
$2.5\%$.
We next compared the algorithms on unicost instances, obtained by setting the cost of all columns equal to 1, as in Grossman and Wool’s paper [
30]. In particular, we compared SBH with the
Altgreedy (ALTG) algorithm proposed by Grossman and Wool [
30], introduced in
Section 2.2. The results are shown in
Table 6,
Table 7 and
Table 8, where the subdivision of instances was the same as before. The additional column “SBH vs. ALTG” reports the percentage improvement of SBH with respect to ALTG algorithm. Looking at
Table 6, it is worth noting that the heuristic which performed better was that of Kordalewski. Indeed, our heuristic SBH was worse than KORD by about
$3.49\%$, while it was better than the other two greedy procedures, with a gap of
$1.15\%$. Here, computation times were all comparable and ranged between
$0.002$ and
$0.007$ s. SBH improved its performance in larger instances, as shown in
Table 7 and
Table 8. We would like to point out that ALTG and CH produced the same solution cost for all of the instances, except for the
$rail$ ones. In particular, SBH yielded an average improvement of
$1.50\%$ on CH and ALTG ([
30]) on
$scp$ instances, and, respectively,
$1.39\%$ and
$12.97\%$ on
$rail$ instances. Comparing SBH and
$KORD$ on the
$scp$ instances, we observed that they were very similar with a
$0.07\%$ improvement. In the largest instances (
Table 8), as said before, it emerged that the computational time of KORD maade it impractical for real-time applications. The analysis showed that, in most cases, SBH produced better solutions than classical Chvatal’s algorithm. However, in a few instances CH presented a better solution. This phenomenon was attributable to the features of the instances. As shown in the example provided in
Section 2.3, SBH immediately recognizes columns that must necessarily be present in the solution, while CH only selects them when they exhibit the lowest unit cost. In conclusion, the computational campaign revealed that SBH generally outperformed CH when considering instances containing columns with few covered rows.