Fast Feature Selection in a GPU Cluster Using the Delta Test

Feature or variable selection still remains an unsolved problem, due to the infeasible evaluation of all the solution space. Several algorithms based on heuristics have been proposed so far with successful results. However, these algorithms were not designed for considering very large datasets, making their execution impossible, due to the memory and time limitations. This paper presents an implementation of a genetic algorithm that has been parallelized using the classical island approach, but also considering graphic processing units to speed up the computation of the fitness function. Special attention has been paid to the population evaluation, as well as to the migration operator in the parallel genetic algorithm (GA), which is not usually considered too significant; although, as the experiments will show, it is crucial in order to obtain robust results.


Introduction
The problem of variable selection is crucial for the time of the design models that classify or perform regression; so, the better the selection is, the more accurate that models can be designed [1][2][3].Although classification and regression are both modeling problems, they should be treated separately.Thus, the work presented in this paper focuses on regression, also known as function approximation.Formally, the function approximation problem is to determine, given a set of input/output pairs (x i , y i ) ∈ R d × R i = 1...N and an unknown function, f , such as f (x i ) ≈ y i .From this, the problem of variable selection can be defined as the search for the subset of variables that make it possible to build a model that approximates the data as accurately as possible.
At first, the method to determine the most adequate subset of variables could seem trivial: build a model for each possible solution.This approach could be considered as the brute force approach and would be optimal; however, it has critical problems: • It is not possible to compute all the combinations when the problem has a large number of variables, due to the computational cost; • The validity criterion depends in the model, which usually has several parameters that affect the final results considerably.
These problems have encouraged the use of non-parametric metrics (which are model independent) to be more objective about the time of evaluating a solution.One of the mainstream ways is to use mutual information (MI) [4] as a criterion.MI is defined as the difference between the sum of the individual entropies and the joint entropy and of two events (in the variable selection problem, the events are the output, Y , and the inputs, X).The higher it is, the more relevant for the model the variables selected for X should be.This approach has been successfully applied using several estimators to compute it.Another non-parametric criterion that has been applied to variable selection is the delta test (DT) [5].There are few papers that apply it, although its adequacy to perform variable selection has been theoretically demonstrated in [6,7].Concretely, in [7], there are comparisons of DT versus MI using Kraskov's estimator, where DT shows better performance.A recent comparison between both approaches has been done in [8], where the MI showed a slightly better performance according to experts' opinions.However, in [9], MI estimation using the Parzen window and Kraskov's method were outperformed by DT, considering objective metrics as the approximation error using least squares support vector machines and experts' opinions.The interesting thing about this last paper is that, due to the small number of samples, an exhaustive search is performed, so that the global optimum for each metric is obtained.This encourages the research presented in this paper, because it emphasizes how important it is to analyze as many solutions as possible.Therefore, this work is focused on optimizing a parallel genetic algorithm, which is implemented on a novel high performance computing (HPC) architecture using clusters of computers with several graphical processing units (GPUs).Using this architecture, speed up can be obtained by using GPUs to compute the fitness (in this case, the DT) and CPUs to parallelize the genetic algorithm (GA).Moreover, the parallelization of optimization algorithms seems to be the only way to obtain solutions in large dataset problems, as there are computation and time limitations if a single machine is used.This work presents analyses, as well one of the most important parameters in parallel genetic algorithms: the migration operator.
The rest of the paper is organized as follows: Section 2 introduces the delta test.Afterwards, Section 3 describes the design of the parallel genetic algorithm, which will perform the optimizations presented in the experiments included in Section 4. Finally, conclusions are discussed.

Delta Test in Variable Selection
In order to evaluate the goodness of an individual, the delta test (DT) [5] value obtained using the combination of variables is used, as it has been shown to be an adequate criterion [6,10].
In function estimation, the main problem is to find the correct signal and noise terms.As mentioned, the relationship between input (signal) x i and output y i is represented as where r i is the noise term.The usual assumption in many domains behind r i is that the term is independent and identically distributed following the normal distribution r i ∼ N (0, σ 2 ).In the training phase, the goal is to have a model with a good generalization ability that does not "learn" the noise, r i .
The DT provides a way to find out how much noise is present in the data before the modeling stage.That is, it is estimating the variance of the noise, σ 2 , or the mean squared error (MSE), which can be obtained without overfitting.
The DT is computed using the nearest neighbor approach as: where the first nearest neighbor of a point, x i , in the R d space is x N N (i) and y N N (i) is the output of x N N (i) .The DT is a special case of the Gamma Test [11], another noise variance estimator is based on nearest neighbor distributions.The difference is in the extra hyper-parameter present in the Gamma Test (the number of neighbors), while the DT uses only the first nearest neighbor, providing a fully non-parametric method.The reduction to only the closest neighbor still gives the unbiased estimator of the noise variance in the limit N → ∞ [10].

Computation of Delta Test Using Pre-Calculated Distances
Computation of the nearest neighbor in the naive way involves calculating the distances between each pair of samples: and returning the smallest d i,j and the corresponding index of the Nearest Neigbor, N N (i), for each sample.Since the focus is on examining non-empty subsets of variables that can share individual elements, a lot of time is wasted recomputing the squared differences to obtain d i,j .A simple solution to decrease the running time is to store that information into a N (N − 1)/2 × d matrix, where each row contains precomputed squared differences for a pair of samples (x i , x j ).Given this matrix, computing all pairwise distances for a given variable subset I ⊆ {1, 2, . . ., d} involves summing precomputed values for those I variables (i.e., the I-th columns of the matrix).

Computation of the Delta Test on a GPU
The computation of the k nearest neighbors (KNN) requires great computational effort, since it has to compute the pairwise distances between all the points and, then, sort them to choose the closest ones.In [12], an implementation of the KNN algorithm on a GPU (the code is available at [13]) is presented.In this approach, brute force is used to compute the distances between the input vectors, as they are independent of each other; therefore, they are easily parallelized using the single instruction multiple data (SIMD) paradigm.
After computing the distances, they need to be sorted to determine the closest neighbors.To do so, the insertion sort algorithm used is Quicksort; due to its recursive nature, it cannot be implemented on GPUs.
All the processes require about 10 times faster speed than other algorithms implemented in libraries, such as ANN (Approximate Nearest Neighbor [14]) which uses kd-trees and box-decomposition trees to improve performance.
Another advantage of using the GPU is that, although the computation is repeated constantly, it allows one to handle large problems, unlike the pre-calculated distance approach, which might have memory limitations.

Design of the Genetic Algorithm
Genetic algorithms (GAs) are a well-known optimization tool that has been applied to many problems.This kind of algorithm is based on the principles of natural evolution, where the offspring of the new generations are supposed to be better than their parents.The main advantage of GAs is that they have the ability to explore the solution space globally, but at the same time, they can exploit the regions where the best solutions are.
When a GA is designed, the first decision to be made is how an individual will represent a solution, that is, the solution encoding.The variable selection problem has a straightforward encoding using a binary chromosome, whose length is equal to the number of variables.If a gene within the chromosome equals one, the variable is selected; if it is zero, then the variable is discarded.
Once the encoding of the solutions is defined, it is necessary to determine the operators that will affect the evolution of the individuals.These are introduced in the following subsections.

GA Operator Description
There are several types of operators that will modify the chromosome of an individual.The first one is the selection operator, which determines if an individual is going to reproduce and generate offspring.If the selection is very restrictive and only the best individuals reproduce, the convergence of the algorithm is accelerated.On the other hand, if random individuals are chosen, there is the risk of not converging to a good solution, as well as a lack of robustness.The chosen operator is "binary tournament selection", which consists of taking two couples of individuals and selecting from each couple the individual that has better fitness.
The second one is the crossover operator, which is responsible for combining the genes of the parents to produce the offspring.For the binary encoding, many operators have been proposed, although the two-points binary crossover seems to have a compromise between the exploration/exploitation of the solutions [15].This operator defines two points in the chromosome where the individuals exchange genes, generating two new individuals.
Once the offspring are obtained, they might suffer a mutation in their chromosome.This mutation is the one that helps the algorithm to escape from local minima and jump to other regions of the solution space.The mutation selected was at the gene level, that is, if the individual is mutating, select-unselect a random variable.
Finally, to obtain the population t + 1, it has to be decided if the whole population, t, is going to be replaced (generational GA) or only some of the offspring replace their ancestors (stationary GA).As there is the need to obtain a high number of solutions, it is more adequate to use the newly generated individuals, but to maintain exploitation, the best individuals of the previous generation are kept.This is known as elitism [16].
The evolution continues until a stop criterion is satisfied.There are many possibilities and approaches proposed in the literature [17]; however, the proposed algorithm only considers an execution time limit of 600 s.This criterion affects all the previous operators, as well as their parameters (i.e., crossover and mutation probabilities).
The reason for choosing 600 s is because this value is considered as the maximum time an operator is willing to wait before a solution is provided.Even though variable selection is an of f − line problem, a human operator might decide to recompute solutions with new collected data at any time.
This time limit has been previously used in the literature for GAs [18,19], and it is a common time barrier used in many programs implemented by important companies, like Microsoft [20], Apple [21] , Apache [22] , etc.
Therefore, the limitation implies that the convergence should be fast, but diversity should still be reasonably maintained.These are the reasons to select the binary tournament (high selection pressure), two-point binary crossover with a high probability of obtaining a high exploitation of the solutions, mutation at a gene level, but with a high probability of maintain exploration, and elitism, to always keep the best chromosomes so far (high exploitation).
To summarize, the standard operators chosen for the algorithm are: (1) Crossover: two-point binary with 0.8 probability.

Parallelizing the GA
Apart from the optimization of the computation of the fitness function using GPUs, the available architecture allows the algorithm to be distributed through several machines in a classical cluster manner.GAs are intrinsically parallel in the sense that several operations can be done in parallel, because they are independent.However, modifications in the execution flow, like splitting the populations into sub-populations, also known as demes, lead to better results [23][24][25][26].

Island Model and Migration/Immigration Policy
Among the several approaches proposed in the literature to parallelize GAs [27], the multi-deme distributed GA (also known as the island model) is one of the most popular, due to its good behavior [26][27][28].This implementation consists of evolving isolated populations on different islands and, in some cases, exchanging individuals.The implementation of this paradigm usually maps an island to a processor.In the available architecture for this work, since a processor might have several cores, populations are mapped to cores.Regarding the use of the GPUs to compute the DT for each individual, an island uses one GPU to compute the distance matrix.
The mechanism that exchanges individuals between the islands is known as the migration operator.The benefits of this procedure can be different, depending on the way it is applied and the criterion that selects the individuals [29]; it can help to maintain diversity by exploring new solutions provided by the new individuals, or it might help exploitation, because all islands will share some equal individuals.The communication between the islands requires one to consider the following parameters: • Send policy: decides which individual or individuals should be sent.
• Acceptance policy: decides if one or several individuals coming from other islands are or are not included in the destination island.• Replacement policy: if new individuals are incorporated into the population, this policy must determine which individuals of the current population should be replaced (in order to maintain the population size).• Communication topology: defines the migration flow, that is, between which islands are individuals being exchanging.• Communication frequency: determines when the migration step will be carried out.
The analysis of the GA behavior as a function of the migration process has not been studied deeply since some years ago [30,31], although recent papers study this parameter, as it has an important influence on the results [32,33].Concretely, in [33], the authors propose the MultiKulti algorithm, whose main novelty is to introduce a criterion that determines if an island accepts a migration or does not.The results presented showed that the diversity and the number of individual evaluations before the algorithm finished increased using this migration filtering.
The migration rate is a parameter that can be random [34], fixed [26] or autoregulated, depending on the diversity of the population.However, to consider the population diversity to modulate the migration rate is equivalent to determining a replacement policy.The reason is that if the replacement policy determination is based as well on the current population and does not accept any replacement, it is as if the migration rate has been decreased.
The details on how the algorithm performs the selection of the individuals selected to be migrated and the ones to be replaced is analyzed in detail in Section 4.3.Regarding the migration topology, all the islands communicate with each other in a collective manner, as is depicted in Figure 1.

Population Distribution in the Cluster
Since the available architecture is a heterogeneous grid of computers, it is obvious that the time to complete a generation during the run will be different on each computer [35].As the distributed populations perform a collective communication broadcasting to individuals, the global performance of the algorithm would be injured if the fastest machines had to wait for the slower ones, wasting computing resources.In order to ameliorate this fact, the decision of setting different population sizes has been taken.Slower or overloaded machines will process less individuals, allowing these processes to require a shorter time to complete a generation.Therefore, during the collective communications, the waiting time for the synchronization will be reduced.Obviously, this approach is the same as increasing the predefined population size in the more powerful machines.
The question that arises from this policy is: how much should the size of the population be decreased/increased?The answer can be obtained empirically by measuring the time for one generation on each machine and obtaining the fraction between the fastest/slowest and the other time measurements.For example, if the time of M achine 1 is double that of M achine 2 , the population size for M achine 1 should be half (or the population size for M achine 2 should be doubled).
Therefore, a communication step must be performed at the beginning of the algorithms: the root process (rank 0) is the reference, so each process with its corresponding GPU performs the evaluation of the same individual (i.e., all variables selected) and waits for the message from the root process, indicating how much it took to perform the evaluation.Then, each process computes the number of individuals as: keeping in mind that the number of individuals must be even.

Optimizing the Computation of the Population Fitness
The implementation presented by [12] had to be adapted in order to use several GPUs within the same node.To do so, several functions from the NVIDIAlibrary have to be called, as is described in [36], to set the desired GPU by each process.For each time the KNN function is called, the whole reference dataset has to be copied to the GPU memory to compute the distances.Therefore, during the evaluation of the whole population, this process has to be repeated as many times as the number of individuals with the possibility of creating an overhead time that could be avoided, since a process will not use other GPUs than the one selected and the reference dataset is the same.The implementation of the function was modified, so that the complete population was given to the mex-function, which computes the KNN neighbors.Thus, the kernels, when computing the distances just have to accumulate or not accumulate a distance in a certain dimension depending on the value of the gene in the chromosome.
However, the implementation splits the queries to fit in the GPU memory, and this depends in the dimension of the dataset, performing a "memcopy" from the host to the device for each split.Therefore, the higher the dimension of the problem is, the longer it will take to compute the distances.
In this paper, this computation has been optimized, as the dimension of the dataset defined by each chromosome is different.For example, a chromosome that encodes a selection of just two variables does not need the other d − 2 variables.Therefore, the function distance computation function is called as many times as there are individuals, but cropping the dimensionality of the input vectors on each call to match the variable selection encoded on each chromosome.This process is depicted in Figure 2 with a chromosome that encodes a solution with two variables.This cropping is easily performed in MATLAB, due to its efficient matrix operations.Therefore, the computation of the fitness for individuals with a low number of variables becomes much faster, as the KNN function does not have to split the query to iterate through all the dimensions of the dataset.
This reduction in time is much higher than the one obtained by reducing the number of calls to select the GPU and to perform the copy of the reference set to the GPU memory.

Experiments
This section shows the performance of the proposed algorithm when compared to previous approaches using small and large datasets.Afterwards, an analysis of the migration policy chosen is made.

Cluster Architecture
The cluster that was configured had the described below that were interconnected, as Figure 3 shows.

Comparison with Previous
This section will analyze the performance and behavior of the proposed algorithm.First, small datasets are compared with a previous work; afterwards, the algorithm is applied to a large real-world dataset.

Small Datasets
The algorithm is compared with the approach presented in [37], which is a previous parallel version of the algorithm implemented in a regular cluster.For the sake of a fair comparison, the stop criteria will be the same: execution time.In [37], the time limit is set to 600 s, based on studies and recommendations from the industry arguing that that is the maximum time that an operator is willing to wait to see a solution.
The same population sizes were used (50,100 and 150), and the datasets processed were: (1) The Tecator dataset [38]: The Tecator dataset aims at performing the task of predicting the fat content of a meat sample on the basis of its near-infrared absorbance spectrum.The dataset contains 215 useful instances for interpolation problems, with 100 input channels and 3 outputs, although only one is going to be used (fat content).( 2) The Anthrokids modified dataset [39]:This dataset consists of several measures to predict a child's weight.It has 1019 instances and 55 variables.
The results are shown in Table 1.As the table reflects, the performance of the proposed algorithm is not impressive in comparison with the previous algorithm.In fact, it does not outperform the previous approach.However, it is remarkable that the difference between the solutions is not too big, and in some cases, the new algorithm outperforms the previous one, demonstrating that the proposed approach is a good algorithm.The reason because the previous algorithm obtains, in general, better solutions for these datasets is because the GA is able to generate more populations, due to the pre-calculation of the distance matrix, as it was described in Section 2.
Table 1.Performance in terms of the delta test (DT) value of the cluster of GPUs (pGPU) against sequential (seq.) and parallel approaches (where the number of processes is notated as np and the number of GPUs as nGPU) .

Large Datasets
The analysis of the previous results might discourage the use of GPUs instead of using a pre-calculation of the distances; however, when the dataset starts becoming a little bit bigger, this second approach is not possible any more.The reason is the application runs out of memory, so to use a cluster of GPUs, it is not a matter of performance in time or quality results, it is a matter of being able to provide a solution.
As an example, part of the dataset provided by the Spanish Institute of Statistics (Instituto Nacional de Estadística (INE)) that contains data about marital dissolutions in Spain is used.Several problems arise from this data, and one of them is predicting the dissolution process length, which is translated into the regression problem.
The data to be used consists of 19,967 input samples of 20 variables, which is divided into training (of 15,385) test (4568) sets.The size of the training set is too big to pre-calculate the distance matrix used in previous approaches [37].
The proposed approach is executed during 600 s using the same configuration of the previous section (detailed in Section 3.1), being able to finish only one generation, with 50 individuals providing a DT value of 0.001642 using 9 variables.Due to the high memory requirements, one of the computers with the oldest GPU model was delaying the other two, because of the synchronization step in the migration.Therefore, the experiments are repeated only with two nodes instead of three.The performance increases significantly, allowing the algorithm to evolve a mean of 7.6 generations, obtaining a DT value of 0.001589 using only 4 variables.
To test the validity of the selection provided, a model (a radial basis function neural network with 15 neurons) is designed using the methodology proposed in [40] without local search optimization.The experiments are done using all the variables and using the ones selected by the algorithm, obtaining the results shown in Tables 2 and 3.The approximation errors provided by the neural networks prove how critical the necessity of performing variable selection before modeling a real world problem is.The network maintains a good generalization capability, as the differences between the test error and the training error are small.

Comparison Between Different Migration
Another experiment carried out consists of a comparison of several approaches regarding how the communication between the islands is performed.In order to isolate as much as possible the effect of the migration parameters, the seeds of the random function used on each run is the same, so that the unique element that could modify the results is the migration step.The setting for the experiments is: Regarding the send policy, the two selected options are considered, as are the most popular in the literature.The acceptance policy consists of always accepting the incoming individuals or only accepting them if they are different from the best individual in the population.In the implementation, an individual i = i 1 , i 2 , ..., i d with i k ∈ 0, 1 is considered different from another individual if the Hamming distance between them is not higher than d/2.
As the tackled problem requires huge computational effort, only the worst individual replacement policy is chosen with the aim of converging to a good solution fast.Other approaches could consider other replacement policies.

Diversity of the Population
The value of the average Hamming distance between each pair of chromosomes is chosen to be the metric that determines the diversity of the population.For each run, the evolution of this parameter is recorded and, then, the average of the evolutions is computed; these values are graphically depicted in Figure 4.It is easy to see how the population maintains diversity when the conditional acceptance criterion is applied, and the configuration, SB-AA-RW, is the one that has less diversity in the population, confirming previous results.

Accuracy and Robustness
The reduction of the diversity by not using a conditional acceptance criterion could seem a disadvantage; however, as Table 4 shows, it is totally the opposite case, because it allows the algorithm to exploit the less divergent individuals more and to achieve better performance.This table also shows that the use of the conditional criterion does not affect the final result at all.

Conclusions
In this paper, we present a new algorithm that takes advantage of new high performance computing technologies.The main novelty is the use of a cluster, where the nodes have graphical processing units to compute the fitness function efficiently; this is a cluster of a cluster of GPUs.Another important contribution is the analysis of several migration policies, showing their influence on population diversity and result robustness.The performance of the algorithm for small datasets is acceptable when compared with previous methods, plus it is able to provide good results in a reasonable time.Even when the size of the dataset becomes too large, the proposed algorithm is able to provide a good solution, in contrast to the previous approach, which is not able to provide any.

Figure 2 .
Figure 2. The procedure to crop the query set based on the chromosome in order to make the computation of the distances and the data transfer faster.

Figure 4 .Table 4 .
Figure 4. Diversity of the population using different communication SB, send best; AA, always accept; RW, replace worst; AiD, accept if different; SR, send random.

Table 2 .
Delta test values for the large-sized dataset (standard deviation in brackets).

Table 3 .
Approximation errors (Normalized Root Mean Square Error -NRMSE ) of the large dataset with and without variable selection using an RBFNN (Radial Basis Function Neural Network) with 15 neurons.