Gene-Similarity Normalization in a Genetic Algorithm for the Maximum k -Coverage Problem

: The maximum k -coverage problem (MKCP) is a generalized covering problem which can be solved by genetic algorithms, but their operation is impeded by redundancy in the representation of solutions to MKCP. We introduce a normalization step for candidate solutions based on distance between genes which ensures that a standard crossover such as uniform and n -point crossovers produces a feasible solution and improves the solution quality. We present results from experiments in which this normalization was applied to a single crossover operation, and also results for example MKCPs.


Introduction
The maximum k-coverage problem (MKCP) is regarded as a generalization of several covering problems. The problem has a range of applications in combinatorial optimization such as scheduling, circuit layout design, packing, facility location, and covering graphs by subgraphs [1,2]. Recently, besides some theoretical approaches [3,4], it has been extended to many real-world applications such as blog-watch [5], seed selection in a Web crawler [6], map-reduce [7], influence maximization in social networks [8], recommendation in e-commerce [9], sensor deployment in wireless sensor networks [10], multi-depot train driver scheduling [11], cloud computing [12], and location problem [13].
Let A = (a ij ) be an m × n 0-1 matrix, and let w i be a weight applied to each row of A. The objective of MKCP is to choose k columns so as to maximize the sum of the weights of the rows that contain '1' and are also located in one of these k columns.
This problem can be represented formally as follows: a ij x j ≥ 1 subject to n ∑ j=1 x j = k x j ∈ {0, 1}, j = 1, 2, . . . , n, where I(·) is an indicator function (I( f alse) = 0 and I(true) = 1). We are concerned with the case that w i = 1 for all i, when we say that a row is covered by the selected k columns if it contains '1' which is also located in one of these columns. In this case, MKCP is to find k columns that cover as many rows as possible.
MKCP has many real-world applications. For example, MKCP can be applied to the following practical applications about museum touring [14]: Suppose that a museum can operate only k guided Proposition 1. The relation ∼ becomes an equivalence relation.
Proof. For each i ≤ k, Σ i , • forms a symmetric group S i , where • is the function composition operator [29]. Then, the direct product P := Π k i=1 S k is also a group [29], and therefore has an identity meaning that the relation ∼ is reflexive. When σ ∈ P, its inverse σ −1 ∈ P exists, i.e., the relation ∼ is symmetric. The group P is closed under the operator •: It means if σ 1 , σ 2 ∈ P, then σ 1 • σ 2 ∈ P. That is, the relation ∼ is also transitive. Taken together, the relation ∼ becomes an equivalence relation.
The equivalence relation of Definition 1 allows us to consider the real solution space (phenotype space) as the set of equivalence classes of elements in G, i.e., the quotient space G/∼.
We can measure the similarity of two vectors x and y in G, the space of MKCP solution encodings, using a distance metric D, which we obtain by summing values of a subsidiary metric d which measures the difference between two columns of the matrix A which expresses the MKCP, as follows: Here, the genes of two chromosomes, x i and y i , represent chosen column indices. If we regard the indices as just labels, not column vectors, the discrete metric, which becomes one if the two indices are the same, and zero otherwise, can be used as a metric d. This satisfies all the conditions for a distance metric including the triangular inequality.
An alternative metric is Hamming distance, which is a measure of the dissimilarity of two binary vectors. If we consider each column of the matrix A as a binary vector (the column vector), not just an integer, we can find the Hamming distance between any two column vectors, and hence between the corresponding genes x i and y i in two chromosomes from G. These distances can then be summed as shown in Equation (1). Figure 1 shows two distances in G calculated using discrete metric and Hamming distance. As shown in Figure 1c, the discrete metric simply compares the column indices of x i and y i , and ignores the contents of the corresponding columns of A. In Figure 1d, we see that the distance between x 1 and y 1 is the Hamming distance between the first and the second column vectors of A.  Now, we establish a metric in the quotient space G/∼ by the following proposition: Proposition 2. Let x = (x 1 , x 2 , . . . , x n ) and y = (y 1 , y 2 , . . . , y n ) be in G, and let a metric D on G be defined by Equation (1). Then,D becomes a metric on G/∼, where σ i (y) represents the ith element of permuted y by σ.
Proof. Let σ be in the group Π k i=1 S k . The computation D(x, y) := ∑ n i=1 d(x i , y i ), is unaffected by summation order, and thus D(x, y) = D(σ(x), σ(y)). Hence, σ is an isometry on G, and thus Π k i=1 S k is an isometry subgroup. The relation ∼ is an equivalent relation obtained from Π k i=1 S k . Hence from [30,31]D(x,ȳ) is a metric on G/∼.

Normalization in MKCP
As shown above, redundancy in the integer representation of solutions to MKCP means that the encoding (genotype) space G is unnecessarily larger than the true solution (phenotype) space, which is the quotient space G/∼. Redundant representations can be expected to reduce the performance of genetic algorithms significantly, which, in particular, undermines the effectiveness of standard crossovers defined using masks [32]. The problem of redundant representations have been addressed by a number of methods such as adaptive crossover [33][34][35][36], and among which the normalization technique [37] is representative. Normalization changes a parent genotype into a different genotype with the same phenotype which is similar to the genotype of the other parent before a standard crossover is performed. It is based on adaptive crossovers [34,35] and many variants have appeared [31,38,39].

Preserving Feasibility
A representation is infeasible if it does not meet the requirements of a solution. Figure 2a shows an integer representation which is infeasible because it contains duplicate column indices, and Figure 2b shows a binary representation which is infeasible because it contains the wrong number of '1's. Figure 2 also shows how these infeasible solutions are likely to be created by standard crossover operators such as uniform and n-point crossovers. A repairing step can be performed to restore feasibility, but this has the effect of a mutation, and may garble gene sequences inherited from parents. Using integer encoding, the problem can be avoided by rearranging the parents so that any shared column indices are in the same position. Such rearrangement naturally makes offspring preserve feasibility and no special repairing step is required. This form of normalization is easily implemented, as shown in the pseudocode in Figure 3, and takes O(k 2 ) time.
Swap values of y i and y j ; } In Figure 4, this normalization to preserve feasibility for the example in Figure 2a

Normalization for Improving Solution Quality
A good solution to MKCP will consist of dissimilar columns. For example, in Figure 1, (1, 3) is a better solution than (1, 4) because the '1's in Columns 1 and 3 all occur in different rows, while Columns 1 and 4 share a '1' in Row 2.
In the context of a genetic algorithm, we would expect chromosomes with genes corresponding to dissimilar columns to be most effective. Looking again at Figure 1, suppose that we apply a standard one-point crossover to the parents (1, 2) and (3,4). If the cutting line lies between the first gene and the second one, then the offspring will be (1,4), and this offspring covers the rows {1, 2, 3} of A. However, if the second parent is rearranged to (4, 3), the offspring becomes (1, 3), which covers {1, 2, 4, 5}. This example is illustrated in Figure 5. In general, we want the offspring to have genes that correspond to columns that are as dissimilar as possible. Thus, it is helpful to rearrange chromosomes so that genes corresponding to similar columns are in the same positions. The goal of this rearrangement can be formulated in terms of distance in the phenotype space G/∼. Consider two parents x and y in the genotype space G. Rearranging genes so that those corresponding to the most similar columns are located in the same positions is equivalent to the search for a permutation σ * such that the distance between the equivalence class of x and that of y is equal to the distance between x and σ * (y). This is also equivalent to finding the σ which minimizes the distance between x and the permuted y in Equation (2). The optimal rearrangement is achieved by considering all of the permutations of genes in the second parent, and choosing the permutation that minimizes the distance sum between the column vectors corresponding to gene pairs in the same locations. If Hamming distance H is used to get the dissimilarity between the column vectors corresponding to two genes, we will choose the permutation σ * such that where Σ k is the set of all the permutations of length k and σ i (y) denotes the ith element of permuted y by σ.
We give an example case in Figure 6, in which the chromosomes (1, 2) and (3,4) are both parents, and now we normalize (3,4). Because k is 2, the number of all the permutations is just 2. We compute ∑ k i=1 H(x i , σ i (y)) for each permutation. If σ 1 = ( 1 2 1 2 ), σ 1 (y) = (3,4). Then, H(x 1 , σ 1 1 (y)) = 4, H(x 2 , σ 1 2 (y)) = 2, and their sum is 6. For the second permutation σ 2 = ( 1 2 2 1 ), σ 2 (y) = (4, 3). Then, H(x 1 , σ 2 1 (y)) = 2, H(x 2 , σ 2 2 (y)) = 2, and their sum is 4. Since this is smaller than 6, σ 2 = ( 1 2 2 1 ) is the optimal permutation. Enumerating all k! permutations is intractable for a large k, and thus this procedure is intractable. However, the problem can be solved using the Hungarian method [40], which is a network-flow-based technique. It provides an optimal result, and runs in O(k 3 ) time [41]. Alternatively, we can use a fast heuristic [42], which runs in O(k 2 ) and produces results very close to the optimum. Either of these methods can be treated as a function that accepts a 2D array, in which the elements are the distances between two genes, and returns the permutation with the minimum total distance between chromosomes. This normalization is shown in the pseudocode of Figure 7.
); // using the Hungarian method or its fast variant y ← σ * (y); } After we rearrange the second parent according to the permutation σ * , a standard crossover is applied. Now, we investigate the relation between this optimal rearrangement and the feasibility. There may be more than one optimal rearrangement satisfying Equation (3), but we can show that one of them is always feasible. Feasibility is preserved by an optimal permutation σ * which locates all common indices in the same positions by σ * , as we now prove: . , x k ) and y = (y 1 , y 2 , . . . , y k ) are two chromosomes and x p = y q , then there ). There is an index r satisfying σ r (y) = y q . Let σ be the same permutation as σ , except that σ (y) p and σ r (y) are exchanged: ) for all σ ∈ Σ k and σ p (y) = y q .
This proof relies on the triangular inequality, which is a key property of any valid distance. Thus, Proposition 3 holds for distances other than Hamming distance. We could use discrete metric, introduced in Section 2. This distance provides a very rough comparison of two solutions, but Proposition 3 still holds. Equation (3) can be rewritten using discrete metric ρ instead of Hamming distance H: In particular, in this case, feasibility is preserved by any optimal permutation σ * in Equation (4). Its proof is quite similar to the proof of Proposition 3. Proposition 4. If x = (x 1 , x 2 , . . . , x k ) and y = (y 1 , y 2 , . . . , y k ) are two chromosomes and x p = y q , then Proof. We assume that σ * p (y) = y q . There is an index r satisfying σ * r (y) = y q . Let σ be the same permutation as σ * , except that σ * (y) p and σ * r (y) are exchanged: Then, ) (∵ x p = y q by assumption)

This contradicts the assumption that
Using discrete metric will only cause identical indices to be rearranged into the same positions. In fact, normalization by discrete metric is exactly the same as rearranging for preserving feasibility introduced in Section 3.1.

Test Sets and Test Environments
Our experiments were conducted on 65 instances of 11 set cover problems with various size and densities, from the OR-library [43]. Although these benchmark data were designed as set cover problems, the data can also be considered as maximum k-coverage problems, as in [27]. Some details of these problems are presented in Table 1, where m and n are the numbers of rows and columns, respectively, and density is the percentage of '1's in the MKCP matrix A. The present authors previously experimented with problems with fixed values of k of 10 and 20 [28]. However, in this study, we varied k with the tightness ratio α, which is the product of k and the density of a problem. The higher is the tightness ratio, the larger is the value of the object function, which is the coverage, that we are likely to achieve. If the tightness ratio is 1, the optimum coverage is likely to be very close to n. We used tightness ratios of 0.8, 0.6, and 0.4.
We implemented all our tested algorithms in C language using gcc version 5.4.0, and ran them on Ubuntu 16.04.6.

Effect of Normalization on a Crossover
To see whether or not normalization is effective at crossover, we performed experiments with three methods of rearranging the second of two parents before a crossover: • REPAIR: The second parent is not rearranged before the crossover, but infeasible offspring are repaired to restore feasibility after the crossover. • FP: The second parent is rearranged to produce only feasible offspring using the normalization in Figure 3.
The second parent is rearranged by the normalization in Figure 7, to minimize the sum of the distances using the permutation σ * of Equation (3).
REPAIR produces feasible offspring by the method of replacing duplicate column indices with a randomly chosen one among the indices that are not contained in the offspring. OPT was implemented using the Hungarian method [40], which runs in O(k 3 ) time.
To determine the most effective method, we performed the following steps: 1. N parent chromosomes were randomly generated. (N was set to 100.) 2.
The second parent in each pair was rearranged using the methods described above.

4.
A uniform crossover was applied to each couple.

5.
We computed the mean and the standard deviation for the coverage of each of the N/2 offspring. This procedure was applied to a single instance of each problem listed in Table 1 using REPAIR, FP, and OPT. Table 2 shows the results for each of these 11 instances. We see that OPT normalization outperforms the others, and can therefore be expected to improve the performance of a GA. Moreover, the results of the one-tailed t-test show that the objective values of offspring produced using OPT normalization were better than those of their parents. On the contrary, the values produced using REPAIR and FP were similar to those of their parents. This suggests that, compared to other methods, OPT normalization would strongly support the function of the crossover operator in searching the solution space in a promising direction, without replacement strategy. Ave and SD are the average and the standard deviation of the fitness of 100 parents (in the column of "Parents") or 50 offspring (in the remaining columns of REPAIR, FP, and OPT), respectively. REPAIR produces feasible offspring by random repair. FP rearranges the second parent to produce feasible offspring using the normalization in Figure 3. OPT is optimized normalization of the second parent. * The one-tailed t-test of the null hypothesis that the result of given method is equal to fitness of parents.

Performance of GAs with Normalization Methods
Our underlying evolutionary model is similar to the model of CHC [44], which was applied to many problems [45][46][47][48][49][50]. We paired a population of N chromosomes randomly, and then we applied crossover to each pair, generating a total of N/2 offspring. We ranked all parents and offspring and the fittest N individuals among them became the population in the next generation. We used 100 as the size of population in our experiments. We reinitialized the population, except for the best individual, if there were no changes over kr(1 − r) generations, where r is a divergence ratio that wes set to 0.25. This GA stopped after 500 generations and returned the best it has found. The pseudocode of our GA is given in Figure 8. In the following experiments, we changed the normalization method in a single GA. We compared the output of our GA with the best result that we found in this study, using the metric %-gap, which is 100 × |best − output|/best.
Thirty trials were performed for each method, and the averaged results are shown in Table 3. The results labeled RR-GA were produced without normalization: infeasible offspring produced were repaired randomly using the same method as REPAIR in Section 4.2. We also compared the results from the GA with a multi-start method using randomly generated solutions. In each run of this method, called Multi-Start, we sampled 10 6 random solutions and chose the best one. Even RR-GA performed significantly better than Multi-Start, suggesting that GAs are an appropriate mechanism for solving the MKCP.
The results labeled FP-GA in Table 3 were produced by rearranging the genes of the second parent to produce feasible offspring without the need for repair. FP-GA outperformed RR-GA for large values of k but not for small values. It seems that the effect of mutation by repair is rather effective when the solution space is small.
The results labeled OPT-GA in Table 3 were produced by the GA with the proposed normalization. Using the same GA as FP-GA, OPT-GA rearranges the genes of the second parent to minimize the sum of distances between genes (column vectors) before applying recombination. OPT-GA clearly outperforms Multi-Start and RR-GA; the results of one-tailed t-tests prove that OPT-GA also outperforms FP-GA significantly.
The results of the t-tests show that Multi-Start is clearly the worst technique, even though it is allowed 10 6 evaluations, while the GA-based methods produce relatively small 2.5 × 10 4 chromosomes. RR-GA and FP-GA have similar performance, and OPT-GA clearly does best. Averages from 30 runs. Multi-Start generates random solutions and chooses the best. RR-GA is a genetic algorithm with random repair. FP-GA is a genetic algorithm with FP normalization. OPT-GA is a genetic algorithm with OPT normalization. 1 The one-tailed t-test with the null hypothesis of RR-GA = Multi-Start. 2 The one-tailed t-test with the null hypothesis of FP-GA = RR-GA. 3 The one-tailed t-test with the null hypothesis of OPT-GA = FP-GA.

Conclusions
We present the maximum k-coverage problem (MKCP) and analyze its representation and solution space. If we apply a GA to the MKCP, then we immediately encounter the issue of redundancy in the genotype space, which is larger than the phenotype space that we characterize as a quotient space. We introduce a method of normalizing chromosomes that ensures a crossover produces feasible offspring with genes that are column vectors of the MKCP matrix and are as dissimilar as possible. This normalization was implemented using the Hungarian method [40]. We performed experiments which showed the effectiveness of this approach.
In this study, we adopted two locus-based metrics of the discrete metric and its extended version derived from Hamming distance between genes (column vectors). However, other metrics such as some variants of Cayley metric on permutations [51,52] may also be applied to the proposed theoretical framework. In the case of such non-locus-based metric, we should design a new crossover tailored to the metric. This investigation will be a promising work, which we leave for future study.
As mentioned in the Introduction, the proposed theoretical framework can be applied to real-world applications such as the cyber-physical social systems and public safety networks. We leave this applied work for future study. We also expect that this approach can be applied to other problems which have the same representation for their solutions. By expanding this technique to solution representations of variable length as in [53,54], we believe it could also be applied to the set cover problem.