Multi-Objective Evolutionary Instance Selection for Regression Tasks

The purpose of instance selection is to reduce the data size while preserving as much useful information stored in the data as possible and detecting and removing the erroneous and redundant information. In this work, we analyze instance selection in regression tasks and apply the NSGA-II multi-objective evolutionary algorithm to direct the search for the optimal subset of the training dataset and the k-NN algorithm for evaluating the solutions during the selection process. A key advantage of the method is obtaining a pool of solutions situated on the Pareto front, where each of them is the best for certain RMSE-compression balance. We discuss different parameters of the process and their influence on the results and put special efforts to reducing the computational complexity of our approach. The experimental evaluation proves that the proposed method achieves good performance in terms of minimization of prediction error and minimization of dataset size.


Introduction
Data preprocessing is a crucial step in data mining systems. It is frequently more important than the selection of the best prediction model, as even the best model cannot obtain good results if it learns using poor quality data [1]. A part of data preprocessing is data selection, which comprises feature selection and instance selection.
The purpose of instance selection is to preserve useful information stored in the data and reject the erroneous information, while reducing the data size by selecting an optimal set of instances. This allows for accelerating the predictive model training and to obtain a lower prediction error [2].
Reducing the data size also makes it easier to analyze the properties of the data by humans, as well as allowing the assessment of the expected performance of the prediction models [3]. In other words, by instance selection, we want to "compress the information". In this work, we consider instance selection in regression problems, which is a more complex task than instance selection in classification problems [4] an much less literature exists on this topic.
Instance selection finds practical application in a range of problems, where the data size can be reduced. For example, it can be applied to the datasets considered in this study, describing real-world problems from various domains. In addition, one of the authors took part in two practical implementations in the industry. The first one was an artificial intelligence-based system for controlling steel production in electric arc process, where there was a lot of data from the previous processes (such as the amount of energy, of different chemical compounds, etc.). The other one was in the electronics industry in a system for predicting the performance of the electronic appliances (power inverters and others) where the amount of data describing the parameters and behavior of the appliances was also very large. In both cases, there were regression problems with many redundant and erroneous data and instance selection was very useful to enable efficient further processing of the data.
The first difference between instance selection for classification and regression tasks that in classification it is enough to determine the class boundaries (the black thick line in Figure 1a), and to select only the instances needed to determine the boundaries [5]. The remaining instances (in the grayed area) can be removed. However, before removing them, the noisy instances, which do not match their neighbor class, must be removed first in order not to introduce the false classification boundaries.
In the case of instance selection for regression tasks, we also need to remove the noisy instances, which do not match their neighbors (instances A and B in Figure 1b) and then we can remove the instances that are very close to some other instances in terms of input and the output space (instances C and D in Figure 1b). However, the reduction cannot be so strong as in classification problems, where we need only the class boundaries because, in regression, each point in the data space is important.
In instance selection for classification tasks, we can obtain strong data reduction and we can also obtain more balanced class distribution and thus higher entropy H calculated as [6]: where c is the number of classes in the initial dataset and p(X i ) is the proportion of instances of i-th class to all instance in the dataset.
In regression problems, we can estimate the dependent variable y as a deterministic continuous function of the independent variables x and use differential entropy H d [7]: where S is area covered by x. We discuss the connections between measures of information and loss functions with instance selection performance in Section 3.4.
(a) (b) Figure 1. Instance selection in classification (a) and in regression (b). The axes represent the attributes x1 and x2. In classification, the red circle and blue cross represent points of two different classes. In regression, the height of the vertical line represents the output value of an instance and the circle shows its location in the input space.
However, in practice, instance selection is not so simple as in Figure 1, where there are only two attributes and a few instances. We cannot say which instance needs to be rejected without taking into account which other instances are also rejected. This is because the outcome depends on the set on which the predictive models are trained and thus we must consider the set of selected instances as a whole, which makes instance selection an NP-hard problem (see Section 3.1). For that reason, we decided to base the instance selection method on evolutionary algorithms, where the advantage of this approach is that the evolutionary algorithm evaluates the prediction quality on the entire subsets of selected instances and we do not have to explicitly define the relation of an instance to their neighbors (which in regression tasks is not as simple as in classification) to decide upon its selection or rejection.
Another advantage of evolutionary-based methods is the possibility of obtaining solutions with low prediction error (which in case of regression problems is typically expressed by RMSE-root mean square error) and strong data reduction (compression) at the same time.
As the discussed solution is based on multi-objective optimization, a key advantage of it is that we obtain a pool of solutions situated on the Pareto front, where each of them is the best for certain RMSE-compression balance and we can choose one of them, as will be discussed in Section 2.2.
In binary instance selection, each vector (instance) can be either selected or rejected. In instance weighting, the instances can be assigned real value weights between 0 and 1, which reflect the instance importance for building the predictive model. Then, the model includes the contribution of particular instances in the learning process proportionally to the weights assigned to them [8]. Obviously, we always want to select the most representative instances, so that the reduced set contains as much useful information and as low noise as possible.
First, we review the problems and existing solutions in instance selection with non-evolutionary (Section 1.1) and evolutionary approaches (Section 1.2). Then, we introduce our approach, which uses the multi-objective evolutionary NSGA-II algorithm [9] with k-NN as the inner evaluation algorithm and propose several modifications and improvements (Section 2). Finally, the discussion is supported with experimental evaluations (Section 3).

Non-Evolutionary Instance Selection Algorithms
Non-evolutionary instance selection algorithms usually are based on some local properties of the dataset, as the nearest neighbors or Voronoi cells, in order to assess which instances can be removed as noisy or redundant.
The CNN (Condensed Nearest Neighbor) algorithm was the first instance selection algorithm developed by Hart [10]. The purpose of CNN is to reject these instances that do not bring any additional information into the classification process. A popular noise filter is ENN (Edited Nearest Neighbor) proposed by Wilson [11] to increase classification accuracy by noise reduction. These two algorithms are still among the most useful due to their simplicity. In later years, more instance selection algorithms have been proposed for classification tasks [5,[12][13][14][15][16][17][18].
In the literature, DROP-3 and DROP-5 are frequently considered the most effective of them [11]. There were also some works to directly use information theory for instance selection in classification tasks. Son [19] proposed a method, where the data set was segmented into several partitions. Then, each partition was divided continuously based on entropy, until all partitions are pure or no further partitioning can be done. Then one can search for the representative instance in each partition. Kajdanowicz [20] introduced a method for comparison and selection of the training set using entropy-based distance.
Recently, adaptations of instance selection methods for multi-output classification problems were proposed [21]. A taxonomy and comparative study of instance selection methods for classification can be found in [2,22].
It is obvious that fewer papers addressed the problem of instance selection in regression tasks and one of the first approaches was presented by Zhang [23]. Since there are no classes in regression, several approaches to replace the "class" with another concept were used.
In [24,25], the class concept was replaced by some threshold distance. If the distance in the input space between two instances is greater than the threshold, they can be treated by the instance selection algorithm in the same way as different class instances in classification tasks. Another option is to perform discretization and convert the regression task to a multiple class classification task and then use instance selection algorithms for classification problems [26]. In [27], a Class Conditional Instance Selection for Regression (CCISR) was proposed, which was derived from the CCIS for classification [28]. CCIS created two graphs: one for the nearest neighbors of the same class as a given instance and another one for other class instances. A scoring function based on the distances in graphs was applied to evaluate the instances.
Guillen et al. [29] proposed the use of mutual information for instance selection in time series prediction. In the first step, the nearest neighbors of a given point were determined and then the mutual information between that point and each of its neighbors was calculated. If the loss of mutual information with respect to its neighbors was similar to the instances near the examined instance, this instance was included in the selected dataset. The authors of [30,31] extended this idea to instance selection in time series prediction by calculating the mutual information between every instance from the training set and the currently evaluated instance, then to order the training set in descending order by the distances and selected the predefined number of points.
In [32], an instance selection method for regression was based on recursive data partitioning. The algorithm started with partitioning the input space using the k-means clustering. If the ratio of the standard deviation to the mean of the group was less than a threshold, the element closest to the mean of each cluster was marked as a representative. Otherwise, the algorithm continued to split the leaf recursively.
In [25], an adaptation of DROP2 and DROP3 to regression tasks was presented and two solutions were proposed: to compare the accumulated error that occurs when an instance is selected and when it is rejected, and to use the already mentioned concept of the distance threshold. Since both ideas were used to adapt DROP2 and DROP3 to regression, four resultant algorithms were tested. DROP3-RT (Regression-Threshold) definitely worked best of the four methods and thus we use it for comparison in the experiments.
In [4], ensembles of instance selection methods for regression tasks were used. The ensembles consisted of several members of the same instance selection algorithm and by implementing bagging operated on different subsets of the original training set. A final decision was made by voting with a threshold. If an instance obtained more votes than the threshold, it was finally selected. Using a different threshold, a Pareto front of solutions could be obtained, i.e., no solution exists, which can improve both of the objectives more than any solution on the front. Results of this work are also used in the experimental comparison.
The problem with non-evolutionary instance selection algorithms is that in most cases they are based on certain assumptions and observations made by their authors about how the data is typically distributed. For example, the ENN algorithm removes the instances miss-classified by k-NN, considering them noisy. This is not always true, as they may also be boundary instances and indeed ENN has the tendency to smooth the class boundaries. There are also other assumptions in other algorithms that are more frequently true than not, but in some cases they are wrong.

Evolutionary Instance Selection Algorithms
Evolutionary instance selection algorithms do not make any assumptions about the dataset properties but verify iteratively large number of different subsets in an intelligent way to minimize the search space. This can result in better solutions or better (lower) Pareto front in the case of multiple solutions. On the other hand, this is usually achieved at the expense of much higher computational cost. For that reason, in this work, we pay special attention to limit the computational cost as far as possible, which is discussed in Section 2.3.
Tolvi [33] used genetic algorithms for outlier detection and variable selection in linear regression models, performing both operations simultaneously. However, he evaluated the model on two very small datasets (35 instances with two features and 21 instances with three features); nevertheless, in his experiments, evolutionary instance selection algorithms outperformed the non-evolutionary ones.
Shuning [34] concluded that genetic algorithm based instance selection for classification works best for low entropy datasets and with higher entropy, there will be less benefit from instance selection. In [35], an algorithm called Cooperative Coevolutionary Instance Selection (CCIS) was presented. The method used two populations evaluated cooperatively. The training set was divided into n approximately equal parts and each part was assigned to a sub-population. Each individual of a sub-population encoded a subset of training instances. Every sub-population was evolved using a standard genetic algorithm. The second population consisted of combinations of instance sets.
Tsaia [36] considered jointly instance and feature selection in an evolutionary approach. In addition, in [37], an evolutionary algorithm was presented for instance and feature selection and particular problems were assigned to several populations to handle each one separately and each population was optimizing a part of the problem. Then, the authors tried to join the obtained solutions in an attempt to obtain better results. Czarnowski [38,39] introduced an instance selection method that incorporates several ideas. First clustering was performed on the data and then, within the clusters, the selection was executed by the team of agents. The agents cooperated by sharing a population of solutions and refined the solutions using local search.
We found only two works describing the application of multi-objective evolutionary algorithms to instance selection, both to classification problems and both dated for 2017.
In [40], the MOEA/D algorithm was used in a coevolutionary approach integrating instance selection and generating the hyper-parameters for training an SVM. The two criteria used in that optimization were the reduction of the training set size and the performance with a given set of an SVM's hyper-parameters. The average results over some classification datasets were provided.
In [41], the authors also considered the over-fitting problem. At each iteration of the genetic algorithm, the training and validation partitions were updated in order to prevent the prototypes from over-fitting a single validation data set. Each time the partitions were updated, all of the solutions in the Pareto set were re-evaluated. However, only the results for 1-NN averaged over several classification problems was reported, similarly, as in the previous work, so it is not possible to compare the results with our method.

Multi-Objective Evolutionary Instance Selection for Regression (MEISR)
This section introduces our solution to instance selection in regression tasks called MEISR. MEISR uses a multi-objective evolutionary algorithm, which in the current implementation is NSGA-II [9], to direct the search for the optimal reduced training sets and the k-NN algorithm to determine the error on the training sets. First, particular aspects are discussed and finally the pseudo-code of the whole algorithm is presented.

Basic Concepts
As it was stated in the introduction, the two objectives of instance selection process are minimization of the number of instances in the reduced training set S and minimization of the error obtained on the test set by predictive models trained on the reduced training set S. The first objective is known as minimization of retention or maximization of reduction or compression. The second one in our case is expressed with root mean square error (RMSE) because RMSE is the standard and commonly used measure of regressor performance. Because the words "RMSE", "reduction" and "retention" all begin with "r", we will denote the various RMSE values on the test set with symbols starting with "r" (r0, r1, r2, r3) and we will denote the first objective with symbols starting with "c" (c0, c1, c2, c3), which will stand for retention or 1-compression: where N is the number of all instances in the training set T, N sel is the number of selected instances from T, which create the selected set S. We use the standard definitions of root mean square error on the training set (RMSE trn ) and on the test set (RMSE tst ): where y predicted_i is the predicted and y i is the actual value of the output variable for the i-th instance (of the training and test set respectively) and where N is the number of all instances in the training and N tst in the test set. To prevent over-fitting, either validation set or the other stop criterion can be used, as discussed in Section 2.6. We use a multi-objective evolutionary instance selection method based on the NSGA-II algorithm [9] to maximize compression and minimize RMSE and to obtain a set of solutions on the Pareto front (these are such solutions in which no other solution exists, which can improve both of the objectives more than any solution on the front-see Figure 2).
We chose NSGA-II because it is widely used and, despite the existence of newer algorithms, NSGA-II still gives the best or one of the best results for two-objective problems and its modification NSGA-III for problems with more than two objectives [42]. As the description of NSGA-II can be easily found in the literature [9], we are not going do describe it in detail, but rather discuss the issues specific to the proposed instance selection method, especially that the main focus of our work is on instance selection and not on genetic or evolutionary algorithms.
The NSGA-II algorithm was adjusted to direct the search for solutions for the instance selection task by setting the proper objectives, by using the proper encoding, and by implementing the proposed initialization schemes and mutation and crossover operators, as will be discussed in subsequent sections.
First, an initial population of P individuals is created, where each individual represents one reduced training set S. The length of the chromosome equals the number of instances in the original training set T. At each position of the chromosome, a value w represents the weight of a single instance. If w = 0, then the instance is rejected. If w > 0, then the instance is included in the prediction model (regressor) learning with the weight w. In the simplest binary case, we allow only for two different values of w: 0 and 1. Initially, all weights w have random values generated by the initialization method presented two subsections later.
Then, the evolutionary-based instance selection process starts and by adjusting the weights tries to find a group of the best solutions (reduced training sets S), located on the Pareto front. The quality of the solutions during the process is assessed by the two criteria already mentioned: retention (the lower the better) and the prediction error of the learning model (also the lower the better). A detailed discussion of the objective criteria is provided in the next subsection.

The Objectives
The first objective is a minimization of the number of instances of the training set (i.e., minimization of retention or maximization of compression).
The second objective is a minimization of RMSE. It must be distinguished between the final objective, which is a minimization of RMSE on the test set (RMSE tst ) and the objective used by the instance selection process, which is a minimization of RMSE on the training set (RMSE trn ). The final objective (RMSE tst ) cannot be minimized directly because the test set is not available while selecting the instances.
During the instance selection process RMSE trn is internally determined with the leave-one-out procedure, always using the k-NN algorithm as the inner prediction model (the regressor inside the instance selection process). In the final evaluation, RMSE tst can be determined using any prediction model trained on the reduced training set S. In the experimental section, the final prediction models we used were k-NN with different k parameters and MLP neural networks and the RMSE tst obtained on the test sets is reported in the experimental results.
Using a single objective genetic algorithm, a fitness function that incorporates all criteria must be defined. The typical definition of the fitness function for instance selection is: where α is a coefficient indicating the expected balance between the objectives. However, multi-objective solutions do not require the determination of the coefficient α; instead, they minimize two objectives: retention and RMSE trn . Moreover, the obtained results consist of a group of non-nominated individuals situated on the Pareto front (i.e., the reduced training sets S) with different trade-offs between objectives. In our study, the following encoding of individuals is used: each individual (each selected data set S) is encoded as a vector w = {w 1 , ..., w N }, where N stands for the size of the vector (which equals the number of instances in the original training set T). The vector w can only take specific values assigned from set {0, σ, 2σ, 3σ, ..., 1}, where σ depends on the algorithm parameter numLevels and it is calculated as follows: σ = 1/ (numLevels − 1). In case of the numLevels = 2 vector, w can have only binary values, in case of numLevels = 5 vector, w can take five values ({0.00, 0.25, 0.50, 0.75, 1.00}), etc. If numLevels = 0 a vector, w can have any real number values assigned. Such a process was aimed at increasing the readability of weights and the ability to test different variants of simulations.
In the experimental section, we present the results for binary instance selection (numLevels = 2) and real-value instance weighting (numLevels = 0) as these are two most characteristic encodings. Moreover, implementing instance weighting in regression problems frequently allows for some improvement in prediction quality for noisy data [8,43]. Thus, one of the aims of this work is to examine in what conditions implementing real value weights w i is beneficial, in spite of this making the algorithm more complex and interpretation of the results more difficult.
During the instance selection process, the following objectives are used by NSGA-II for binary instance selection: RMSE trn (w) = knn(S(w), T) while for real-value instance weighting the following objectives are directly used by NSGA-II: where RMSE trn (w) is obtained with the k-NN algorithm while predicting output of all instances from the original training set T, using the reduced (selected) training set S given by the weight vector w (each time without the instance currently being predicted). ret(w) in binary instance selection is the sum of instance weights (which equals in this case the number of selected instances), while, in instance weighting, it is a weighted sum of the instance weights w i and the number of instances with non-zero weights nred(w i ). nred(w i ) returns 1 if the instance is selected and 0 if it is rejected. β is a parameter that balances the sum of the instance weights and the sum of the not rejected instances. The first term, summing the instance weights, is needed to allow crossover and mutation operations to gradually reduce some of the instance weights w i . In case of instance weighting nred(w) is calculated as follows: where γ is a parameter that defines a weight, below which an instance is rejected (which is experimentally set as 0.01; however, the exact value is not so crucial, as the algorithm adjusts its behavior to that value by modifying the weights proportionally to γ). Such an approach allows the optimization algorithm to minimize the instance weights and, consequently, for a reduction of instances. Instances with weights lower than γ get rejected and are not taken into account by the k-NN algorithm, while instances with weights greater than γ are taken into account proportionally to their weights (as well by the inner evaluation as by the final prediction model), as will be discussed in the next section. However, when we report the retention in the experimental results, we take into account the only the number of all instances with non-zero weights.  An example of the obtained Pareto front and the four points of interest are shown in Figure 2. In orange: results on the training set, in green: the corresponding results on the test set. All the points obtained on the Pareto fronts for each dataset are presented numerically and graphically in the supplementary resources. Due to limited space in the paper, we present only four most characteristic points of interest (r0, c0), (r1, c1), (r2, c2) and also (r3, c3) the in cases where r3 < r1. Figure 2 is used to explain the results obtained in one fold of the 10-fold cross-validation. The corresponding values reported in the table are the average values over the whole 10-fold cross-validation: • r0-baseline RMSE tst obtained without instance selection. • c0-c0 is always 1, which means there is no compression on the whole dataset, for that reason the value does not occur in any table. • r1-RMSE tst obtained with instance selection for the point (c1, r1) in Figure 2; this is the RMSE tst obtained on the test set, while training the model on this reduced training set S, which is the point on the Pareto front with the lowest RMSE trn on the training set and the weakest compression. • c1-retention rate (1-compression) of the point (c1, r1). • r2-RMSE tst obtained with instance selection for the point (c2, r2) in Figure 2; this is the RMSE obtained on the test set, while training the model on this reduced training set, which was represented by the closest obtained point to the point (r0 trn , retention = 0); the brown diagonal line shows this distance d. This point was selected as a representative point because further increasing compression usually leads to sudden increase of RMSE trn and RMSE tst , making the area to the left of this point practically unusable. • c2-retention rate of the point (c2,r2). • r3, c3-RMSE tst and retention rate obtained with and additional run of the instance selection process with the alternative initialization (90% probability of each instance being included in the initial population). It was aimed to obtain RMSE tst lower than r1 and r0. However, this was useful only in a few cases; in other cases, r3 was equal r0 or to r1, which meant that no further decrease of RMSE below r0 or r1 (whichever was lower) was possible to obtain. One of the important reasons that it was not always possible to obtain the lowest RMSE with the main instance selection process at the point (c1, r1) was that the Pareto front was extending gradually during the optimization and in some cases. Before it would reach the point corresponding to (c3, r3), the test error for the same compression can already start to increase (we minimize RMSE trn and report the RMSE tst ), so we must stop the process earlier (see Section 2.6).
Although the target users of our method are scientists and engineers, who should understand their process and be able to select the appropriate solution from the Pareto front, we can suggest the solution with the lowest RMSE trn for predictive model learning and the solution marked by the point (c2, r2) in Figure 2 for analyzing the properties of the data. To enable a better choice, it may also be a good idea to display the front graphically (as can be done in the software available in the supplementary resources) so the user can quickly assess all the solutions.

k-NN as the Inner Evaluation Algorithm
The rationale behind choosing k-NN as the inner evaluation algorithm is the speed of this approach. This is because the full k-NN algorithm has to be performed only once before the optimization starts. In the case of other prediction algorithms, this would be either impossible or much more complex and thus less efficient. Let us assume that there are 96 individuals in the population and that the optimization requires 30 epochs. In this case, the value of the fitness function must be calculated 2880 times. Training any prediction model 2880 times would be computationally very costly.
Although, our previous experiments [24] showed that the best results in terms of RMSE-compression balance can usually be obtained if the inner evaluation model is the same algorithm as the final predictor, in this work, we sacrifice that small improvement in order to shorten the optimization process usually from two to three orders of magnitude. However, as we will show, when the final predictor is an MLP neural network, for the inner evaluation, we use k-NN with parameters, which makes its prediction as close to the prediction of the neural network as possible. In this way, we obtain better results, while still keeping the process time short.
In the case of the k-NN algorithm, we calculate the distance matrix between each pair of instances in the training set. Then, we create one two-dimensional array D i for each instance X i . The first dimension is N-the number of instances in the training set and the second dimension is three. The three values stored in the array are: dist(X i , X j ), j and y j . dist(X i , X j ) is the distances between the current instance X i and each other instance X j , which in the simplest case is a simple Euclidean distance and in a general case can be expressed by Equation (11). The second value j is the ordinal number of each instance in the dataset T. The third value is the output value y j of the j-th instance.
Then, we sort the arrays D i increasingly by dist(X i , X j ). At the prediction step, we only go through the beginning of the array, read the instance number j and check if this instance is selected, if not, then we go to the next instance, as long as we find k selected instances. Then, for the k nearest selected instances, we read their weights w j from the chromosome, their output value y j and predict this instance output value y predicted_i as the weighted average of the k outputs (which is a simple average in case of binary instance selection): where y j is the output value of the j-th neighbor of an instance X i and w j is the weight expressing the j-th neighbor importance (it should not be confused with the weight wd j related to the distance between the instances X i and X j as in the standard weighted k-NN algorithm). Only the neighbors with w j > γ are considered. In binary instance selection, w j = 1 always. To prevent an instance from being considered by the algorithm a neighbor of itself, we set the distance to the instance itself to a very large number as the maximum value of the Double type (1.79 × 10 308 ). The prediction step is extremely fast and only these steps are performed each time the RMSE trn is calculated. The distance between two instances X i and X j is calculated as: where A is the number of features (attributes), wa m is the weight of the m-th attribute and x im and x jm are the values of the m-th attribute of i-th and j-th instance respectively. In the simplest case, all the attribute weights equal 1. However, there are cases where it is useful to use attribute weighting and assign different weights to different attributes, e.g., as the absolute values of the correlation coefficient between each attribute and the output. For example, attribute weighting is useful in improving instance selection results if the final predictor is also k-NN using the attribute weights or when the final predictor is some other algorithm, which internally performs attribute weighting, e.g., an MLP neural network [44]. v is the exponent in Minkovsky distance measure; for v = 2, the measure becomes the Euclidean distance. There is usually no need to raise the expression in the bracket to the power 1/v (calculate a square root for v = 2) because we need only to find the nearest neighbors and not the distance to them (unless we use weighted k-NN, where the closest neighbors have a greater influence on the prediction).
If the final prediction model is i.e., a neural network, then, during the learning, an error the network makes as a response to presenting a given instance is multiplied by the instance weight: where Error is the total error, used in the network learning algorithms to adjust the network weights and f(.) is an error measure, which, for the frequently used mean square error measure, gives:

Population Initialization
The main goal of population initialization methods is to provide optimal coverage of the search space. This can shorten the evolutionary optimization and enable search for the solutions either more uniformly spread in the solution space or focused in some areas, depending on the needs.
Typical initialization assigns randomly generated values (weights) w i to each position i at the chromosome (to each instance) of the individuals (training datasets). For this purpose, different methods can be used (e.g., Pseudo-Random Generators [45], Quasi-Random Generators [46], population dependent methods [47]). In addition, methods based on value transformations [48], a priori knowledge about the problem or clustering can be used [49]. According to some authors, initialization should be adjusted to a specified problem [50] and it is also the case in instance selection tasks.
One of the best initialization methods that relies on population is Adaptive Randomness [47]. In this method, for each individual in population several candidates are created. Only the candidate for which that smallest distance to the rest of the population is the larger than for all other candidates is added to the population. The advantage of this method is that the created candidates do not have to be evaluated, only the distance to the already assigned individuals is calculated. Population-based methods do not depend on random number generators and number transformation methods. Due to this, they can be combined with other types of initialization methods [51]. In this work, such combinations have been tested with an idea to produce different initial values (e.g., initialize smaller values and examine the effect of such initialization on the obtained results).
Additionally, several methods that aim at the differentiation of initialized values (to achieve different degrees of instance selection reduction) are introduced: • Power transformation. In this method, the randomly generated number real number rnd (0 < rnd < 1) is raised to a certain power v as follows: where higher v results in lower values (because 0 > rnd < 1) obtained after transformation.

•
Fill transformation. This binary transformation method uses a predefined probability f 1 that determines balance between 0 and 1 as a result of initialization: • Spread initialization. The idea behind this method is to differentiate the individuals in population in terms of the different probabilities of occurrence of small and large values: where f 2 = 0.1 + h * n/N is a value dependent on particular individual in the population, n stands for index of the individual in the population, h is a parameter.
In case of binary values, the numbers generated by initialization methods are rounded to the closest acceptable value. Based on the preliminary experiments, which are available in the supplementary resources, we decided to use the adaptive randomness combined with power and fill transformations.

Genetic Operators: Crossover and Mutation
We used the multi-point multi-parent crossover. The number of instances in the original training set T (which equals the chromosome length) was N. The M split points of the chromosome were randomly selected for each child. Then, a parent providing the genetic material for each segment was randomly chosen with the probability proportional to its fitness (thus the same parent could be chosen for more than one segment). The number M influences the speed of the algorithm conversion and the experiments showed that the optimal number M for the fastest conversion can be roughly expressed as: round(n/10) for n ≤ 1000, 100 + round((n − 1000)/100) for n > 1000. (17) However, smaller numbers M than given by Equation (17) also provide similar results in the instance selection, but the optimization takes longer. P children were generated in this way. Then, the P children and P parents were merged together into one group and P individuals with the highest fitness value from the merged group were promoted into the next iteration. (It can be said that the probability of crossover was 100% and because of this way of promoting individuals to the next iteration, it was the only optimal probability).
A broad range of mutation probabilities could be used without significantly influencing the results. We used the probability of mutation of 0.1% per each chromosome position and the following mutation operation: where rnd stands for randomly generated real number from range 0, 1 , RND(0, numLevels − 1) stands for random generated integer number from set {0, ..., numLevels − 1}, m range stands for mutation range parameter (set experimentally as 0.2).

Training Time Minimization
In genetic algorithms, larger populations require fewer epochs of the algorithm to converge. However, some optimal population size exists from the viewpoint of the computational cost of the whole process. The cost can be approximated by the number of fitness function evaluations. According to our tests, this optimal population size was between 60 and 120 individuals. We decided to use 96 individuals because 96 was a multiple of the number of CPU cores in our server, so it scaled well in parallel computation. Larger populations increase computational costs but do not have any other negative impact on the process. However, if the population would be too small, it may limit the diversity of the individuals and prevent the algorithm from finding the best solutions.
Longer chromosomes, which represent lager datasets, require more epochs of the evolutionary algorithm to find the desired solutions. If too few epochs are used, under-fitting can occur and the solutions can be of poor quality, where the model fits neither the training data nor the test data enough well. On the other hand, using too many epochs can cause over-fitting-the model fits the training data too well and thus fails to fit the test data enough well (does not have generalization capabilities), which prevents good performance on the test data [52,53]. When over-fitting occurs, the error on the training set continuously decreases with further model learning, and the error on the test set starts increasing.
The biggest role in creating over-fitting can be attributed to the model getting fitted to the outstanding, noisy instances, when the model learning runs too long and thus the model gets too complex [52]. In our previous works [44,54], we studied various methods of preventing over-fitting in neural network learning. Noise removal by the ENN instance selection algorithm [11] and its extensions proved to be among the best methods. In the experimental section we compare the MEISR method to the extensions of ENN.
To prevent over-fitting in the MEISR algorithm, we used a standard early stopping approach [53]. By watching the error on a validation test (which is a part of the training set not used for model training but for online evaluation), the process can be stopped before the error starts increasing. In the thousands of experiments, we determined that, for the MEISR algorithm with 96 individuals in the population, the safe number of epochs, which always prevented over-fitting and which was high enough to train the model properly can be expressed as: where N is the number of instances in the training set T, e 0 = 8 for binary instance selection and e 0 = 24 for real value instance weighting. For example, this gives 20 epochs for 300 instances and 34 epochs for 30,000 instances for binary instance selection. Thus, this can be also used as an early stopping criterion, especially that the process proved highly repeatable. This criterion has an additional advantage that it allows using all available instances for the original training set and thus achieving the best possible selected subsets. On the other hand, controlling the error on a validation set can allow for more training epochs and thus better model adjustment. Based on the experiments, we can conclude that the stopping criterion given by Equation (19) more frequently allowed for better results, especially for smaller datasets (where the difference between the validation and test set was high) and also the process was frequently faster.

Pseudo-Code of the MEISR Algorithm
The pseudo-code in Algorithm 1 shows the instance selection process. NSGA-II uses the basic operations of the standard genetic algorithm: crossover and mutation, but, additionally, it utilizes a mechanism to generate a wide Pareto front (the solutions situated on the Pareto front are called non-dominated solutions) by calculating the so called crowding distance [9] between the solutions on the fronts (line 19) and uses it to favor the solutions that are far apart from other ones (line 20). for p = 1 to P do 9: for m = 1 to M do 10: parent(m) = select_parent(P) 11: crossover_point(m) = rnd(N)

Experimental Evaluation
This section presents the experimental evaluation of the MEISR algorithm. The experimental process is presented in Figure 3. To provide comprehensive results, the algorithm is evaluated with prediction models (regressors) that belong to three different groups of models, regarding the response to instance selection (1-NN, k-NN, MLP neural network). Finally, a comparison with several other instance selection methods is presented. Most of the analysis is based on the relative RMSE tst (a ratio of RMSE after instance selection to RMSE before instance selection) because our focus here is not to find the best predictive model, but to evaluate how much the performance of the given model will differ after applying instance selection. However, for a reference, we also present the absolute values of RMSE. The dataset properties are presented in Table A1. The experimental results are presented in Tables A2-A14, explained in Figure 2 and its description and summarized in Figure 4.

Data Sets
In case of instance selection, we never know the best solution because it is an NP-hard problem, where trying the number of all possible combinations from sets of thousands of instances is out of accessible computational resources (the number of 5000 combinations from 10,000 instances is about 10 3010 and even the number of 500 combinations from 1000 instances is about 10 300 ). Thus, we cannot use a dataset with the best results known and see how much we approached this. To ensure reliable and unbiased results, we used the standard benchmark datasets from the KEEL Repository [55], as shown in Table A1. The output values were standardized in the experiments to enable us to evaluate and compare the results better.

Experimental Setup
The experimental setup is shown in Figure 3. We conducted the experiments using our own software written in C# language. Very detailed experimental results and the software source code can be found in the supplementary resources at kordos.com/entropy2018. The interested reader can find much more information there than can fit into this paper and replicate the experiments using our software. The experimental process is presented in Figure 3.  Different predictive models display different sensitivity to instance selection. To capture a broad range of model responses to instance selection and at the same time to keep the paper length within a reasonable limit, we carefully chose three models, where each of them is a representative of one model group regarding the sensitivity to instance selection.

10-fold crossvalidation
The 1-NN algorithm is very sensitive to a change of single neighbor, as the prediction is based on that only neighbor and for this reason it is strongly influenced by instance selection. K-NN with higher k is less sensitive to a single neighbor change and thus the instance selection influences the RMSE tst to a lower degree than in 1-NN. The distance measures used for the k-NN algorithm are described in details in Section 2.3 and the optimal k values in Table A1.
There is also a group of predictive models, which base the prediction results (function approximation) on a broad neighborhood and thus their RMSE tst is much less dependent on instance selection; however, instance selection strongly accelerates the learning process. We chose an MLP neural network as a representative of this group, as it is one of the most popular models. We used a network with a typical structure for regression problems: one hidden layer with hyperbolic tangent transfer functions and one neuron in the output layer with the linear transfer function. The hidden layer consisted of six neurons if the number of attributes was below 12 and 12 neurons otherwise and was trained with the VSS algorithm [56] for 15 iterations (as this algorithm does not require more iterations). We also trained it with the well-known Rprop algorithm [57] for 60 epochs and the obtained results were almost the same as for VSS (the differences were statistically insignificant with t-test and Wilcoxon test), thus we do not report them here.
As the best results can be obtained if the evaluation model within instance selection is the same as the final predictive model [24], we used the same models at both positions, with the exception of when the final model was an MLP neural network. In this case, the evaluation model within instance selection was k-NN with optimal k because of the speed of the solution and the fact that an MLP network response to instance selection is closer to the response of k-NN with optimal k than to that of 1-NN.
We were able to find in the literature two papers suitable for comparison ( [4,25], which have been presented in Section 1.2), which considered instance selection for regression problems and presented the results on a set of several datasets, reporting the obtained compression and RMSE or r 2 correlation.
In [4], the Pareto front was used. Since the experiments were conducted on most of the same datasets from the Keel Repository using the optimal k in k-NN, we used the detailed results from the online resources to that paper and performed the comparison. There were eight methods in this paper and we included in comparisons the four best of them, namely: threshold-based CNN ensemble (TE-C), threshold-based ENN ensemble (TE-E), discretization-based CNN ensemble (DE-C), discretization-based ENN ensemble (DE-E) and compared with our MEISR method.
In [25], the results were presented only for a single point with 8-NN for each dataset. We obtained these from the author's detailed experimental results, which were also performed on most of the same datasets from the Keel Repository. Thus, we conducted the experiments with MEISR using 8-NN for a comparison with their results and measured the output additionally in r 2 correlation because that measure was used in the DROP3-RT method evaluation. Four methods were presented in this paper and DROP3-RT was the best one, so we included only DROP3-RT in the comparison with MEISR.
All the tests were performed using two testing procedures. The first one was 50% holdout, where randomly chosen 50% of instances were used for training and the remaining instances for testing (in case of odd instance number, the last instance was randomly assigned to one of the sets). The tests were performed 10 times with different random instances chosen for the test and training set each time. The average results over the 10 tests are reported. The reason of repeating this procedure 10 times is based on the recommendations that an experimental design should provide a sufficiently large number of measurements of the algorithm performance and, based on many analyses, 10 experimental measurements is the recommended standard [58,59].
The second procedure was a standard 10-fold cross-validation [59]. In this case, the dataset was first randomly divided into 10 parts with an equal number of instances (or almost equal it the instances cannot be divided equally into 10 parts). Then, the 10 measurements were performed; each of the 10 times a different part of the data was selected as a test set and the remaining nine parts as a training set.
In both cases, we reported the final goals of instance selection: the average RMSE over the 10 test sets and average retention over the 10 training sets.
The two most widely used statistical tests for determining if the difference between the results of two models over various datasets is non-random are the paired t-test and Wilcoxon signed-ranks test. The t-test assumes that the differences between the two compared random variables are distributed normally and is also affected by outliers which may decrease its power by increasing the estimated standard error. The Wilcoxon signed-ranks test does not assume normal distributions and is less influenced by outliers. However, when the assumptions of the paired t-test are valid, it is more powerful than the Wilcoxon test [60]. We used both of them and, in each case, both of them equally indicated the significance of the difference for the standard p-value of 0.05.
The standard deviations of the RMSE and retention in the experiments can be found in the online supplementary resources. We do not report them here to save the space because they are not used by the statistical tests, either by any other analysis.

Experimental Results and Discussion
The experimental results are summarized in Figure 4 (the figure presents the results in 10-fold cross-validation, but the results in 50% holdout were very similar). The numerical results for both testing procedures 50% holdout and 10-fold cross-validation are placed in the tables, where Table A2 summarizes all the results. The reduction (compression) is obviously independent of the final prediction model because the model is not used during the instance selection. However, RMSEtst depends on the final prediction model.
The average influence of instance selection on the results in both testing procedures: 50% holdout and 10-fold cross-validation was very similar, in spite that the absolute RMSE values were higher in 50% holdout due to a smaller training set size. The differences were usually below 1% and statistically insignificant according to Wilcoxon test. There were bigger differences between particular datasets, especially between the smallest ones, but over all datasets the average change of RMSE was comparable (r1/r0, r2/r0, r min /r0). Thus, it can be said that the MEISR algorithm performed equally well in both testing procedures. There were only bigger differences in the retention at the point c3-on average, the retention was higher by 11% in 50% holdout. This can be explained by the fact that in 50% holdout there are fewer instances in the original training set (50% vs. 90%), and as at the point c2 the instances are sparse and higher percentage of them must remain to allow sufficient instance density to train the model. The differences in c(r min ) can be bigger because, at that point, the Pareto Front is very flat and very little change in RMSE causes a much bigger change in retention.
The biggest improvement in terms of RMSE tst was observed for 1-NN as the final model (Tables A3 and A4), where the average RMSE tst decrease was about 3.5% for average retention 62% and 8.6% for average retention 76%.
In case of k-kNN with optimal k (Tables A5 and A6), on average, lower RMSE tst was obtained on the original uncompressed data than at point c1. However, at the point c3 with an average retention rate 85% we were able to reduce RMSE tst on average by 1.5%.
When the final prediction model was an MLP neural network, on average, the RMSEtst decreased by 0.8% at c1 and by 2.5% at rmin (Tables A7 and A8). At the point c2 (with the strongest compression), the increase of the RMSEtst was about three times lower than for 1-NN and k-NN. Thus, the conclusion is the MLP neural network is much less sensitive to instance selection. However, the unquestionable benefit of instance selection, in this case, was a reduction of data sizes and the shortening of the network learning process and thus giving the chance to try many different network configurations in a limited time. Nevertheless, as we have already mentioned, further RMSEtst decrease could be improved by using the MLP network also as the evaluation algorithm on the training set during the evolutionary optimization at the cost of much higher computational complexity.
The real-value instance weighting (Tables A9 and A10) gave better results (lower Pareto front) only in two areas: for very high compression and for noisy datasets, while, in all the other cases, binary instance selection was better.
The MEISR method outperformed the four ensemble based methods: threshold-based CNN ensemble (TE-C), threshold-based ENN ensemble (TE-E), discretization-based CNN ensemble (DE-C), discretization-based ENN ensemble (DE-E) for the two retention values of 0.5 and 0.25, for which we run the tests (see Table A11 and A12).
As we had to chose a single point from the Pareto front for comparison with the DROP3-RT method, in a case that we could not find a point with lower 1-r 2 and stronger compression, we decided to always use a point with stronger compression, even if the 1-r 2 (and RMSE) would be higher (see the results in Table A13 and A14). machineCPU r0 (unified for all methods) r1/r0|c1 -1-NN r2/r0|c2 -1-NN r3/r0|c3 -1-NN r1/r0|c1 -k-NN (optimal k) r2/r0|c2 -k-NN (optimal k) r3/r0|c3 -k-NN (optimal k) r1/r0|c1 -MLP r2/r0|c2 -MLP r3/r0|c3 -MLP r1/r0|c1 -real value inst.  Comparing with other instance selection methods, the MEISR method allowed for obtaining significantly better results than all the other methods. Only for the largest datasets, the RMSE obtained with DROP3-RT was in several cases lower, but the compression of MEISR was stronger. The statistical tests can be found in Table A15.

Information Distribution, Loss Functions and Error Reduction
As mentioned in the introduction, the entropy used as a measure of information can increase after instance selection in classification tasks and some analogy exists for instance selection for regression problems. Unlike in classification problems, differential entropy used for continuous data (Equation (2)) can also be negative, which means that the information contained in the data is small (which can be thought of as if 2 H is small, then H is negative) [7]. However, unlike only decision boundaries in classification, in regression, each point of the data space matters. For that reason, to assess the instance selection performance in regression, a measure that can show the detail differences between closely located points of the data space will be preferred. Namely, we need to know how well one point can be represented by another closely located point.
Several such measures can be used. The first one is cross entropy, which identifies the difference between two probability distributions P and Q. It measures the average number of bits needed to identify a point from the set P, if a coding scheme from another set Q is used. Cross-entropy loss increases when the predicted class frequently differs from the actual label. For discrete data, cross entropy can be written as: For continuous data, cross entropy can be defined per analogy as However, unlike in classification, in regression problems, we do not usually know the probability distributions. As cross entropy is used as a loss function in machine learning, other loss functions can also be useful to identify the difference between the two distributions. If the difference is high, we can assume that the data contains much noise. In addition, indeed the experimental evaluations confirmed this assumption, as in this case instance selection allowed for significantly decreasing the RMSE tst of the final predictor.
In regression tasks, the most common loss function is root mean square error (RMSE). Since this measure was used in the optimization, we will also use it to show the dependencies. The loss functions r0 for 1-NN and the relative RMSE tst (r1/r0) obtained for retention = c1 are shown in Figure 5, where r0 is the RMSE tst without instance selection and r1 is the RMSE tst obtained for compression c1. The Pearson correlation coefficients between these variables were −0.782 for 1-NN and −0.728 for k-NN with optimal k.
We also observed that the optimal k on the selected subset was frequently different than before instance selection, and it tended to converge to the range of 5 to 7. That is, if the optimal k was 2 before instance selection it may increase after and if it was 11 before it is more likely to decrease. It can be explained, as after the selection fewer instances remained in the dataset, so the distances between them were bigger and frequently the previous closest neighbor of the examined instance no longer existed and thus it had to be replaced by the average of some further still existing instances. On the other hand, a high value of the optimal k is characteristic for noisy datasets, as an average value of several neighbors is needed to mask the noise. The correlation is shown in Figure 5. Instance selection removes the outlier and noisy instances, thus no longer so many neighbors are required to mask the detrimental effect of noise. Therefore, in the cases, where the optimal k was above 11, we used k = 11 because otherwise the optimal k would decrease anyway during the instance selection process and starting the optimization with k = 11 allowed for obtaining lower RMSE trn and RMSE tst on the selected subset.
The dependency between r0 for 1-NN and the k value that was optimal on the original datasets (orgK in Table A1) characterized by correlation coefficient 0.844 is also shown in Figure 5.

Computational Complexity
In this case, contrary to popular belief, the evolutionary algorithm based solution does not have to be more computationally expensive than the non-evolutionary ones.
The MEISR instance selection process can be decomposed into two steps: 1.
The first step-calculating the distance matrices (see Section 2.3) has the complexity O(n 2 ).

2.
The second step-running the evolutionary optimization has the complexity O(nlogn)-because of the increasing number of epochs with dataset size.
One operation in the second step takes longer than in the first step. For small datasets, the second step is dominant, but, for the big ones, the first step. The measurements showed that for 900 instances the first step took about 10% of the total time, but, for 36,690 instances, it took about 65%.
Most of the non-evolutionary instance selection algorithms must also calculate the distance matrix or other equivalent matrix. Their complexity is between O(n 2 ) (ENN, RHMC, ELH) and O(n 3 ) (DROP1-5, GE, RNGE) [22]. We really observed that, for big datasets, the instance selection time with DROP3 grows much faster than with MEISR.
In the first step, the distances between each instances in the training set are first calculated in O(n 2 ) and then they are sorted in O(nlogn), so the complexity is O(n 2 )-the higher of the two. The time spent on calculating the distance matrix also grows with the number of attributes.
The second step consists of several operations. Calculating the fitness function has the complexity O(n) because the output value of n instances must be obtained, where n = N is the number of instances in the original training set. Obtaining the output value requires reading on average k non-zero positions from sorted output value arrays, where k is the number of nearest neighbors in the k-NN algorithm, which assuming a reduction rate of 50% requires reading 2k entries and calculating the average of them. The time spent in this step grows with the number of k, but much slower than linearly because also other operations are performed in this step, which do not depend on k, or depend but weaker than linearly, as crossover, mutation, and selection. The proportions of time spent at each step depend on particular software implementation. The experimental measurements confirmed that the complexity of the step can be considered O(nlogn).
In a practical software implementation, there is also a third-factor consuming time: the constant operations independent on the data, as calling functions, creating objects, etc. This factor is most significant with very small datasets and this is the reason that the times per one instance (two last columns in Table A16) are higher for the smallest dataset than for some of the following datasets. Figure 6 shows the dependency of the MEISR running time on the number of instances (left) for 1-NN and k-NN with optimal k and the percentage of the running time used to calculate the distance matrix used by k-NN. The two points marked as "85a" denote the tick dataset, which has 85 attributes (more than other datasets) and, for that reason, there is so high cost of calculating the distance matrix for this case. The measurements were performed using our software (available from the online supplementary resources) on a computer with two Xeon E5-2696v2 processors. Detailed values can be found in Table A16).

Performance Metrics for Multi-Objective Evolutionary Algorithms
To measure the behavior of the NSGA-II algorithm used within the instance selection process, we provide the performance metrics for multi-objective evolutionary algorithms for binary instance selection with k-NN with optimal k (for other sets of experiments, the metrics were very similar).
However, first, it must be emphasized that the metrics express only the performance of the evolutionary algorithm in terms of data reduction and RMSE trn on the training set and not of the whole process of instance selection, where the final objectives are data reduction and RMSE tst on the test set. Second, as it was discussed that we had to stop the optimization early to prevent over-fitting and, for that reason, we could not achieve performance metrics that were as good as could be obtained if the target objectives would be optimized directly.
We calculated the following popular metrics: Ratio of Non-dominated Individuals-RNI [61], Inverted Generational Distance-IGD [62] (which expresses closeness of the solutions to the true Pareto front), Uniform Distribution-UD [61] (which expresses distribution of the solutions, with the σ for UD metric set to 0.025), Maximum Spread-MS [63] (spread of the solutions) and HyperVolume indicator-HV [64,65] (which applies to several of listed categories, with a reference point for HV metric set to +5%).
The obtained values of metrics can be summed up as follows: • The RNI values are high, average 0.560, which means that more than 50% of the population formed a Pareto front.

•
The IGD values are low, average: 0.014 (lower = better), which means that the results were always close to the optimal Pareto front.

•
The UD values are low, average: 0.234 (lower = better), which means that most of the solutions were properly spread.

•
The MS values are high, average: 0.833 (higher = better), which means that obtained Pareto fronts are wide in range in comparison to optimal Pareto front.

•
The average HV values are 0.039, which means that the values of RMSE trn , even for high compression, did not increase significantly, which is good because lower RMSE is preferred and the very low compression is not present in the results, which is also good because the desired RMSE has already been achieved with stronger compression and thus HV covers a satisfactory part of the objectives' area.

Conclusions
We presented and experimentally evaluated an instance selection method applied for regression tasks, which uses the k-NN algorithm for error evaluation a multi-objective evolutionary algorithm (NSGA-II in the current implementation) for directing the search process. Different aspects of the solution were discussed, many improvements were proposed and different tests were performed. The following main conclusions can be drawn from this work: • A key advantage of the MEISR method is that we obtain several solutions from the Pareto front, and can choose one of them depending on our preferences of the RMSE-compression balance.
As the solutions create a Pareto front, each of them is best for a given balance between RMSE and compression, as explained in Section 2.2. If someone is not sure which solution to choose, then we suggest the solution with the lowest RMSE on the training set (RMSE trn ) for the purpose for machine learning and the solution (c2, r2) for analyzing the properties of the data, as explained in Figure 2 and in the text following it. This is valid for every dataset, as this is the characteristic feature of the multi-objective optimization itself and it is not dependent on the dataset properties. • k-NN is very well suited as the inner evaluation algorithm because of its speed-the distance matrix has to be calculated and sorted only once and then the prediction is extremely fast. This makes the computational cost of the method comparable to the cost on non-evolutionary instance selection algorithms (and some of them, as the DROP family have even higher cost), while the results are usually better (see Section 3.5, Figure 6 and Table A16 for details).

•
We were frequently able to preserve all the useful information in the dataset for the purpose of predictive model performance while reducing it size by about one third (see columns 4 and 5 in Tables A3-A8 and Figure 4).

•
Proper initialization of the population accelerates the instance selection process and helps to find the desired solutions (see Section 2.4).

•
The best results in terms of RMSE-compression balance can be obtained if the inner evaluation algorithm is the same as the final predictor [24]. For that reason, we used 1-NN as the internal regressor when 1-NN was used as the final prediction model and k-NN with optimal k as the internal regressor when k-NN with optimal k was used as the final prediction model.

•
Although our previous experiments showed [24] that when an MLP neural network as the final predictor better results were achieved if also the internal regressor was an MLP network, it would be very time-consuming, especially for bigger datasets, as the MLP network has to be trained each time (at least on the part of the data close to the currently evaluated point). Thus, we decided to use k-NN with optimal k as the internal regressor for the MLP neural network.

•
We noticed that there are two areas where real-value instance weighting can provide better results than binary instance selection: for very high compression, where it was usually able to achieve lower RMSE and for noisy datasets, which required high k value in k-NN (see Table A10).

•
The obtained RMSE tst with the k-NN algorithm for a given point on the Pareto front (for a given compression) can be approximately assessed by the measures of how well one instances can be substituted by other, e.g., the loss functions of cross entropy or RMSE. The lower values of the loss function correspond with lower possible decrease RMSE tst .

•
The MEISR method achieved better results than the other 12 instance selection methods for regression, for which we were able to obtain the experimental results to perform the comparison (see Tables A12 and A14) • We can observe that the "front" is less steep for the test set than for the training set. This is the compression strength grows and the RMSE increases on average slower on the test set than on the corresponding training set solutions (when moving from right to left in Figure 2, the green points are getting closer to the orange ones). This allows for choosing a point with higher compression, as the RMSE tst on that point is likely to grow less than RMSE trn • There were not significant differences between the algorithm performance tested with 50% holdout and 10-fold cross-validation. Only retention at the point c2 was on average higher 11% in the first case, as there were already fewer instances in the original training set.
We have also noticed two areas of possible improvement and we are going to investigate them in our future work: • A single front covered all solutions of interest in many cases, but not in all. One of the reasons is the tendency of multi-objective evolutionary algorithms to not cover the areas on the ends of the front, as was discussed and some solutions were proposed i.e., in [66,67]. The second reason is that the front extends gradually during the optimization and, to prevent over-fitting, we must stop the instance selection process before the front is fully extended.

•
Experimental comparison with other instance selection methods showed that, for the small and medium size datasets, the MEISR method has the greatest advantage over other methods (see Tables A12 and A14). However, for the largest datasets, while still the compression was always stronger, the obtained 1 − r 2 began to became similar to that of DROP3-RT method. Although MEISR optimized RMSE and 1 − r 2 was used only for comparison and the relation between RMSE and 1 − r 2 is not linear, using 1 − r 2 as the objective on the training set would most likely improve the results. This would be true also for the smaller datasets and the tendency would remain. A well known issue here is that, for genetic algorithms with longer chromosomes, the convergence is more difficult. Thus, we are going to investigate the possibilities of alternative encoding of the instances to limit the chromosome length.
To summarize: the presented method of instance selection in regression tasks has proved to work effectively and has several advantages. Thus, we believe it can be helpful for researchers and practitioners in industry. Moreover, there is probably still room for further improvements.

Conflicts of Interest:
The authors declare no conflict of interest. The founding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.
Results for each experimental case, averaged for all datasets, are presented the Table A2 and the statistical significance test results in Table A15. Results for particular configurations fo the MEISR algorithm are presented in Tables A3 -A10. Result comparison with other algorithms is presented in Tables A11, A12, A13 and A14.
The calculation times is and its particular components are presented in Table A16. Table A1. Datasets used in the experiments and their properties: number of instances (Ints.), number of attributes (Attr.), the optimal k -the k in k-NN that gives the lowest RMSE tst (orgK) and the k used in the experiments as optimal k (optK) for the reason explained further in the text.