EvoSplit: An evolutionary approach to split a multi-label data set into disjoint subsets

This paper presents a new evolutionary approach, EvoSplit, for the distribution of multi-label data sets into disjoint subsets for supervised machine learning. Currently, data set providers either divide a data set randomly or using iterative stratification, a method that aims to maintain the label (or label pair) distribution of the original data set into the different subsets. Following the same aim, this paper first introduces a single-objective evolutionary approach that tries to obtain a split that maximizes the similarity between those distributions independently. Second, a new multi-objective evolutionary algorithm is presented to maximize the similarity considering simultaneously both distributions (label and label pair). Both approaches are validated using well-known multi-label data sets as well as large image data sets currently used in computer vision and machine learning applications. EvoSplit improves the splitting of a data set in comparison to the iterative stratification following different measures: Label Distribution, Label Pair Distribution, Examples Distribution, folds and fold-label pairs with zero positive examples.


Introduction
Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs [28]. An inducer receives a set of labeled examples as training data and makes predictions for unseen inputs [25,15]. Traditionally, each example is associated with a single label. However, in some problems an example might be associated with multiple labels. Multi-label machine learning has received significant attention in fields such as text categorization [22], image classification [38], health risk prediction [24], or electricity load monitoring [34], among others. In computer vision in particular, there are more and more applications and available data sets that involve multi-label learning [20,39,3].
Formally [40], suppose X = R d (or Z d ) denotes the d-dimension instance space, and Y = {y 1 , y 2 , . . . , y q } denotes the label space with q possible class labels. The task of multi-label learning is to learn a function h : X → 2 Y from the multi-label training set D = {(x i , Y i )|1 ≤ i ≤ m}. For each multi-label example (x i , Y i ), x i ∈ X is a d-dimensional feature vector (x i1 , x i2 , . . . , x id ) and Y i ⊆ Y is the set of labels associated with x i . For any unseen instance x ∈ X, the multi-label classifier C(·) predicts C(x) ⊆ Y as the set of proper labels for x.
In supervised learning, experiments typically involve a first step of distributing the examples of a data set into two or more disjoint subsets [29]. If a large amount of training data is available, the holdout method [15] is used to distribute the examples into two mutually exclusive subsets called training set and test set, or holdout set. Sometimes the training set is also divided into two disjoint subsets to create a validation set. When training data is limited, k-fold cross-validation is used, which splits the data set into k disjoint subsets of approximately equal size. arXiv:2102.06154v3 [cs.LG] 3 Mar 2021 Table 1: Imbalance measures of different multi-label data sets: size of data set (m), number of labels (q), max number of labels in an example of the data set (Max Labels), maximum frequency of a label in the data set (Max Frequency), label cardinality (Card), label density (Dens), label diversity (Div), normalized label diversity (PDiv), theoretical complexity score (TCS), average imbalance ratio per label (avgIR), and SCUMBLE. Well-known multi-label data sets (top) have smaller size and lower complexity than data sets currently used in computer vision applications (bottom). These data sets will be used in the validation of the different methods presented in this paper. In single-label data sets, those disjoint subsets are built by distributing equally and randomly the examples in the original data set belonging to the different class labels. However, splitting a multi-label data set is not straightforward, as an over-represented class in one subset will be an under-represented in the other/s [15].
Furthermore [29], random distribution of multi-label training examples into subsets suffers from the following practical problem: it can lead to test subsets lacking even just one positive example of a rare label, which in turn causes calculation problems for a number of multi-label evaluation measures. The typical way these problems get by-passed in the literature is through complete removal of rare labels. This, however, implies that the performance of the learning systems on rare labels is unimportant, which is seldom true.
As mentioned by [32], multi-label classification usually follows predetermined train/test splits set by data set providers, without the analysis in terms of how well the examples are distributed into those train/test splits.
Therefore, a method called stratification [15] or stratified sampling [29] was developed, in which a data set is split so that the proportion of examples of each class in each subset is approximately equal to that in the complete data set. Stratification improves upon standard cross-validation both in terms of bias and variance, when compared to regular cross-validation [15].
The data used for learning a classifier is often imbalanced [35,6]. Thus, the class labels assigned to each instance are not equally represented. Traditionally, imbalanced classification has been faced through techniques such as resampling, cost-sensitive learning, and algorithmic-specific adaptations [23,21]. In deep learning, data augmentation is a technique that has been successful in order to address imbalanced data sets [30,17].
Different measures have been proposed to estimate the degree of multilabelledness and the imbalance level of a data set [40,4,5]. The label cardinality Card is the average number of labels per example (Eq. (1)), the label density Dens normalizes label cardinality by the number of possible labels in the label space (Eq. (2)), the label diversity Div is the number of distinct label sets that appear in the data set (Eq. (3)), which can also be normalized by the number of examples to indicate the proportion of distinct label sets (Eq. (4)), the average Imbalance Ratio per label avgIR as the average of the imbalance ratios (IRLbl) between the majority label and each label Y i ⊆ Y (Eq. (5)), the level of concurrence among minority and majority labels SCU M BLE, and the Theoretical Complexity Score T CS (Eq. (7)). Table 1 presents these measures for well-known multi-label data sets, and recent large multi-label image data sets that are used in machine learning and computer vision applications. The complexity of these later data sets in terms of size, number of labels, cardinality, and diversity, is much higher than traditional multi-label data sets. Some of these data sets, e.g Microsoft COCO, are labeled not only with the different classes that appear in one example but with the exact number of appearances of each class. This is the reason why the frequency of the label appearing the most in the Microsoft COCO data set is higher than one, as it appears several times, on average, per example ( Fig. 1).
T CS(D) = log(m × q × Div(D)) The remainder of this paper is organized as follows: Next, in Section 2 a review of available methods for the stratification of multi-label data sets is presented; Section 3 introduces and evaluates an evolutionary approach to obtain a stratified sampling of a data set; Section 4 proposes a multi-objective evolutionary algorithm to obtain an improved stratification, and Section 5 validates this latest evolutionary approach with large image data sets currently used in computer vision and machine learning applications, and compares the results with the official splits usually employed in the literature. Finally, Section 6 discusses the methods proposed in the paper and presents some future work.

Related works
The first approach to apply stratification to multi-label data sets was proposed by Sechidis et al. [29]. This measure is calculated as: The Examples Distribution (ED) measure (Eq. (9)) evaluates the extent to which the final number of examples in each subset S j deviates from the desired/expected number of examples in that subset. . They also mentioned that their approach might get worse results for multi-label classification methods that consider pairs of labels, e.g. Calibrated Label Ranking [13], as their stratification method only considers the distribution of single labels.
Similarly to the measures presented in Section 1, measures could be defined not only for single labels appearing in a data set but also to higher order relations between them, i.e. simultaneous appearance of labels (label pairs, triplets. . . ), such as Card k , Dens k , Div k , and P Div k . For instance, Card 2 (D) would indicate the average number of label pairs per example. Table 2 shows these measures for order 2 for the data sets previously analyzed.
Given the limitation mentioned by Sechidis et al. regarding pairs of labels, Szymański and Kajdanowicz [32] extended the Iterative Stratification approach to take into account second-order relationships between labels, i.e. label pairs, and not just single labels into account when performing stratification. The proposed algorithm, Second Order Iterative Szymański and Kajdanowicz compared SOIS with IS and random distribution using the same measures (LD, ED, F Z, F LZ). They also included a new measure, the Label Pair Distribution (LP D) (Eq. (10)), an extension of the LD measure that operates on positive and negative subsets of label pairs instead of labels. Given E the set of label pairs appearing in the data set, S i j and D i are the sets of samples that have the i-th label pair from E assigned in subset S j and the entire data set respectively. In most cases, SOIS obtains better results than IS.
3 First approach: Single-objective evolutionary algorithm This work proposes an evolutionary algorithm (EA), EvoSplit, to obtain the distribution of a data set into disjoint subsets, considering their desired size as a hard constraint. The structure of the evolutionary algorithm follows the process presented in Algorithm 1.

Characteristics of the algorithm
Let D be a multi-label data set, k the desired number of disjoints subsets S 1 , . . . , S k of D, and c 1 , . . . , c k the desired number of examples at each subset. Each individual is encoded as an integer vector of size |D|, in which each gene represents the subset to which each example is assigned.
Different possibilities can be used to generate new individuals by crossover and mutation. EvoSplit selects parents by ranking, recombination is performed using 1-point crossover, and a random mutation is carried out by reassigning randomly 1% of the genes to a different subset. This process to generate new individuals would produce in most cases individuals that do not comply with the constraint of having c i in subset S i , i = 1 . . . k. Therefore, a repairing process is applied to randomly reassign examples/genes to other subsets to fully comply with the constraint.
This work considers two different fitness functions: a variant of the Label Distribution (LD) and the Label Pair Distribution (LP D), which were introduced in Section 2. The Label Distribution is appropriate for data sets in which a Select the n best individuals to generate a new population until generations_without_changes > gen max specific label can appear only once in an example. However, for data sets that might include in a particular example several instances of the same label, Eq. (8) is not appropriate. This is the case, as it was shown earlier of well-known data sets in computer vision, as Microsoft COCO [20]. Therefore, the LD has being modified to consider also data sets with this characteristic.
Let's consider λ Sj i and λ D i the number of appearances of label λ i in subset S j and data set D respectively, L Sj and L D the total number of labels in subset S j and data set D respectively. The modified Label Distribution measure, which is used as fitness function in EvoSplit, is then calculated following Eq. (11).
We could proceed similarly with the LP D measure. However, EvoSplit does not consider in its calculation the number of appearances of each label in each example, but only the co-occurrence of labels, as in the original LP D measure. In this case, a variant of the LP D would increase considerably the number of pair combinations and make difficult that different examples share the same pair.

Constraints
The application of evolutionary computation allows the introduction of constraints that all the individuals in the population must fulfill to be feasible. In Section 1, it was mentioned that the distribution of labels with few examples might lead to subsets lacking examples having that label, which can difficult validation and test of multi-label classifiers. Therefore, EvoSplit introduces an optional constraint to ensure that, if possible, all subsets contain, at least, one example of each label. For instance, if a data set has to be split into three subsets and a label only appears in three examples, each example will be distributed to a different subset. Some other constraints could also be considered.
In case of generating an individual that does not fulfill the constraint, a repairing process similar to that explained before would be applied.

Results
Next, this work presents a comparison of the performance of the proposed evolutionary approach with other alternatives to split a data set into disjoint subsets, i.e. official (the train/test division set by the providers of the used data sets), random, stratified (SOIS) 1 . All the experiments are run on an Intel Core i5-9600K at 3.70GHz CPU, 64GB RAM machine with Ubuntu 16.04 LTS operating system installed.
Similarly to the literature [29,32], this work has evaluated the different methods considering cross-validation of well-know multi-label data sets. For cross validation, the size of each subset for the different data sets is presented in Table 3. For the evolutionary approaches, the parameters have been selected as follows: •  Table 4 shows the evaluation in terms of the Label Distribution measure of the split obtained with the different methods. In both cases, whether considering or not the constraint, the evolutionary approach considering the Label Distribution as fitness function obtains better results than any other method. Something similar happens when the evaluation is carried out in terms of the Label Pair Distribution (Table 5) and this is the measure employed as fitness function. However, in most cases, when the evolutionary algorithm tries to improve the distribution of single labels (by using LD) fails to distribute label pairs better than the Stratification method. Something similar happens when the fitness function is the LP D measure and the results are measured in terms of LD.
It is worth mentioning that the Stratification method does not consider the desired number of samples per fold as a hard constraint. Therefore, the final sizes of the subsets might deviate from the pre-established ones, as measured by the Examples Distribution and shown in Table 6.
Following [29], besides using the Label Distribution and the Label Pair Distribution, the result of each alternative is also measured (see Tables 7  From these results, it seems that the Stratification method works better than the proposed evolutionary algorithm if the aim is to obtain subsets that approximate better the distributions of both single labels and label pairs. This is probably due to the process that the Stratification method follows to distribute examples to subsets: first, considering label pairs in the distribution and, later, assigning remaining examples based on single labels.

Second approach: Multi-Objective Evolutionary Algorithm
Then, it seems appropriate to consider both statistical measures, the Label Distribution and the Label Pair Distribution, in the optimization algorithm to split a data set into disjoint subsets. Multi-objective optimization problems are those problems where the goal is to optimize simultaneously several objective functions. These different functions have conflicting objectives, i.e. optimizing one affects the others. Therefore, there is not a unique solution but a set of    solutions. The set of solutions in which the different objective components cannot be simultaneously improved constitute a Pareto front. Each solution in the Pareto front represents a trade-off between the different objectives. Similarly to evolutionary algorithms for single objective problems, multi-objective evolutionary algorithms (MOEA) [7] are heuristic algorithms to solve problems with multiple objective functions. The three goals of an MOEA are [36]: 1) to find a set of solutions as close as possible to the Pareto front (known as convergence); 2) to find a well distributed set of solutions (known as diversity); and 3) to cover the entire Pareto front (known as coverage). Several MOEAs have been proposed in the literature. This work employs the Non-dominated Sorting Genetic Algorithm II (NSGA-II) [8].
NSGA-II has the three following features: 1) it uses an elitist principle, i.e. the elites of a population are given the opportunity to be carried to the next generation; 2) it uses an explicit diversity preserving mechanism (Crowding distance); and 3) it emphasizes the non-dominated solutions.
Therefore, considering both the Label Distribution and the Label Pair Distribution as objective functions to be simultaneously optimized, NSGA-II will obtain a set of solutions, some of them optimizing one over the other objective Table 6: Examples Distribution of different splitting algorithms. Only the stratification method deviates from zero.

Evolutionary algorithm Label Distribution
Label Pair Distribution unconstrained constrained unconstrained constrained Corel5k and vice versa. From these set of solutions, EvoSplit selects the solution closer (using Euclidean distance) to the coordinates origin (Fig. 2).
This work has employed the implementation of NSGA-II offered by pymoo [1], a multi-objective optimization framework in Python, using the same parameters in terms of individuals, size of offspring and ending condition presented in Section 3.3. Table 9 shows the results obtained for the different measures of the splits obtained using the MOEA unconstrained approach. The Examples Distribution measure is not shown as it is always zero, as with the previous evolutionary approaches. The obtained results are, in most cases, better that those obtained with the Stratification method. In this approach, unlike the previous single-objective evolutionary alternatives, results are good in terms of both LD and LP D.

Results
The MOEA approach obtains results in terms of LD close to those obtained by the single-objective approach optimizing only LD (see Table 4), and close in terms of LP D when the optimization is based only in LP D (see Table 5). These are more balanced results than those obtained with the single-objective evolutionary approaches, i.e. a good result in one of the measures does not affect a good result in the other one. Additionally, results are also quite similar in terms of F Z and F LZ. For those data sets with F LZ different to zero, Table 10 shows the results obtained with the constrained alternative. For some data sets, F LZ is reduced without affecting considerably the LD and LP D measures.

Application to large image data sets
EvoSplit has also been validated using large multi-label data sets widely employed in computer vision applications, particularly using deep learning techniques: Microsoft COCO, ImageNet, and OpenImages.

Microsoft COCO
The Microsoft Common Objects in COntext (MS COCO) data set [20] contains 91 common object categories with 82 of them having more than 5, 000 labeled instances. In total, the data set has 2, 500, 000 labeled instances in 328, 000 images. This work has used the subset considered in the COCO Panoptic Segmentation Task, which includes 123, 167 images. From these, 5, 000 are selected for validation, and the remaining ones for training. The panoptic segmentation task involves assigning a semantic label and instance id for each pixel of an image, which requires generating dense, coherent scene segmentations. From all the data sets considered in this work, MS COCO is the one with the highest cardinality, an average of more than 11 labels per image. It is the only data set in which examples are not labeled stating if a label appears or not in an image but with the number of times that the label appears. As shown in Table 1, this makes possible that the label (class 1 = person) appearing the most in the data set (Max Frequency) does it more than once, on average, in each image.

Tencent ML-Images
The Tencent ML-Images database [39] is a multi-label image database with 18M images and 11K categories, collected from ImageNet [9] and OpenImages [16]. After a process of removal and inclusion of images, and relabeling of the data set, 10, 756, 941 images, covering 10, 032 categories, are included from Imagenet. From these, 50, 000 are randomly selected as validation set. Following a similar process, 6, 902, 811 training images and 38, 739 validation images are selected from OpenImages, covering 1, 134 unique categories. Finally, these images and categories from ImageNet and OpenImages are merged to construct the Tencent ML-Images database, which includes 17, 609, 752 training and 88, 739 validation images (50, 000 from ImageNet and 38, 739 from OpenImages), covering 11, 166 categories.

Results
The measures of the application of the different methods to split these data sets are shown in Tables 11 to 13. In almost all the cases (in bold), any evolutionary approach, either single-objective or multi-objective, either constrained or unconstrained, performs better than the official or stratified splitting methods. Similarly to the results shown for traditional smaller data sets, MOEA shows the best combined results for the Label Distribution and the Label Pair  Distribution in almost all the cases. For the Microsoft COCO and the OpenImages data sets the results for those measures are improved by one or more orders of magnitude with respect to the official splits, i.e. those offered by the providers of the data set.
With these data sets it is even clearer the effect of using the constrained approach, in which the goal is to include all the labels in every fold. The introduction of the constraint allows to obtain F Z and F LZ equal to zero for the OpenImages data set, while their values for the official split are 1 and 9 respectively. F LZ is dramatically reduced for the Imagenet data set (361 vs 19 using MOEA).

Discussion
This paper presents EvoSplit, a novel evolutionary method to split a multi-label data set into disjoint subsets. Different proposals, single-objective and multi-objective, using diverse measures as fitness function, have been proposed. A constraint has also been introduced in order to ensure that, if possible, labels are distributed among all subsets. In almost all the cases, the multi-objective proposal obtains state-of-the-art results, improving or matching the quality of the splits officially provided or obtained with iterative stratification methods. The improvement of EvoSplit over previous methods is highlighted when applied to very large data sets, as those currently used in machine learning and computer vision applications.
Moreover, the introduction of the constrained optimization decreases the chance of producing subsets with zero positive examples for one or more labels. This should have an effect on the training as there will be fewer labels for which there are no training or validation examples. A very relevant result is that EvoSplit is able to find splits that fulfill the constrain without affecting too much the distribution of labels and label pairs.
EvoSplit is able to obtain better distributions of the original data sets considering the desired size of the subsets as a hard constraint, i.e. ensuring that the Examples Distribution is equal to zero. This is not the case for the iterative stratification methods.
Only in the case of the Imagenet data set, the best results are not obtained by the multi-objective EA but by the single-objective EA optimizing the Label Pair Distribution measure. An explanation to this might be related to the relation in the diversity between labels and pair labels for this data set. This data set has a particular characteristics: there are almost eight times more different pair labels than different labels in the data set (see the Diversity measure in Tables 1 and 2). For all the other data sets, the relation is close to one. Therefore, this larger proportion in the diversity of label pairs might have an influence, benefiting the optimization based only on label pairs.
In conclusion, EvoSplit supports researchers in the process of creating a data set by providing different evolutionary alternatives to split that data set by optimizing the distribution of examples into the different subsets. EvoSplit can, in the future, be extended to higher levels of relationship between labels, e.g. triplets, by implementing a many-objective evolutionary algorithm [19].

Availability of splits and code
The splits obtained with EvoSplit for the different data sets employed in this paper are freely available at https: //github.com/FranciscoFlorezRevuelta/EvoSplit for their use by the research community. The EvoSplit code will be also available at the same repository.