ProLSFEO-LDL: Prototype Selection and Label-Speciﬁc Feature Evolutionary Optimization for Label Distribution Learning

: Label Distribution Learning (LDL) is a general learning framework that assigns an instance to a distribution over a set of labels rather than to a single label or multiple labels. Current LDL methods have proven their effectiveness in many real-life machine learning applications. In LDL problems, instance-based algorithms and particularly the adapted version of the k -nearest neighbors method for LDL (AA- k NN) has proven to be very competitive, achieving acceptable results and allowing an explainable model. However, it suffers from several handicaps: it needs large storage requirements, it is not efﬁcient predicting and presents a low tolerance to noise. The purpose of this paper is to mitigate these effects by adding a data reduction stage. The technique devised, called Prototype selection and Label-Speciﬁc Feature Evolutionary Optimization for LDL (ProLSFEO-LDL), is a novel method to simultaneously address the prototype selection and the label-speciﬁc feature selection pre-processing techniques. Both techniques pose a complex optimization problem with a huge search space. Therefore, we have proposed a search method based on evolutionary algorithms that allows us to obtain a solution to both problems in a reasonable time. The effectiveness of the proposed ProLSFEO-LDL method is veriﬁed on several real-world LDL datasets, showing signiﬁcant improvements in comparison with using raw datasets.


Introduction
A supervised learning process is the machine learning task of training a function that maps an input to an output based on data points with known outputs. Classification is the process of predicting to which of a set of categories a new observation belongs. Thus, the purpose of classification is to achieve a model that will be able to classify the right class to an unknown pattern. However, there are an increasing number of problems where a pattern can have several labels simultaneously associated. Examples can be found in genetics [1], image classification [2], etc. The generalization of the classic classification is Multi-Label Learning (MLL) [3][4][5][6], where multiple labels can be assigned to each instance.
Nevertheless, in many real-world problems we may face cases where MLL is still not enough, as the degree of description of each label is different. To appoint only one dataset used in this paper, the biological experiments on yeast genes [7] during a period of time yield different levels of gene expression in a time serie. The exact level of expression at any point in time is of less importance. Therefore, we have devised a search method based on evolutionary algorithms [32] that allows us to obtain a solution to both problems in a reasonable time. To this end, we have adapted elements of classical evolutionary algorithms to manage LDL restrictions, we have designed a representation of the solution and a way to measure its quality, which allows us to address both problems together.
In order to evaluate the proposal proficiency, we will compare an LDL-learner applying our ProLSFEO-LDL algorithm to the raw training set and measuring six aspects of their performance. We will repeat the experiment over 13 real-world datasets and validate the results of the empirical comparisons using Wilcoxon and Bayesian Sign tests [33][34][35].
The rest of the paper is organized as follows. First, a brief review and discussion of the foundations of LDL, a description of data reduction techniques and an introduction to the evolutionary optimization process are given in Section 2. The proposed ProLSFEO-LDL method is described in Section 3. Then the details of the experiments are reported in Section 4. Finally, the results and conclusions are drawn in Section 5 and Section 6, respectively.

Preliminaries
In this section, the foundations and the most relevant studies carried out on LDL (Section 2.1), are presented. Furthermore, some basic concepts on instance selection and label-specific features selection for classification are introduced (Section 2.2), as well as the notions needed to optimize solutions using evolutionary algorithms (Section 2.3), providing the necessary background required to properly present the study carried out in this paper.

Foundations of Label Distribution Learning
We can formalize a LDL problem as a set of m training samples S = {(x 1 , x i } where y j ∈ Y|j ∈ {j, · · · , c}, such that Y = {y 1 , · · · , y c } denotes the complete set of labels. The constant c is the number of labels and d y j x i is the description degree of the particular jth label y j for a particular ith instance x i . According to the definition, for each x i the description degree should meet the constraints d y j x i ∈ [0, 1] and ∑ c j=1 d y j x i = 1. The solution to an LDL problem can be addressed from several perspectives. According to the selected approach, the algorithm to be developed may differ significantly, either a brand-new algorithm designed especially to deal with LDL constraints, or an adaptation of already available classification algorithms, reformulated to work with those constraints. The LDL study presented in [9] suggested six algorithms grouped in three categories. The first one is Problem Transformation (PT), a simple way to convert an LDL problem into a Single-Label Learning or SLL [36] problem. PT transforms the training samples into weighted single-label examples. Thus, any SLL algorithm may be applied. Two representative algorithms are PT-Bayes and PT-SVM. The second one is Algorithm Adaptation (AA), in which the algorithms are tailored to existing learning algorithms to handle directly with the label distribution. Two suitable algorithms were presented: AA-kNN, an adaptation of the well-known k-nearest neighbors method [37], and AA-BP, a three-layer backpropagation neural network. Finally, Specialized Algorithms (SAs), in contrast to the indirect strategy of PT and AA, directly match the LDL problem. SA-IIS and SA-BFGS are two specialized algorithms that learn by optimizing an energy function that is based on the maximum entropy model. Subsequent works have successfully improved the results achieved by these original algorithms through different approaches. The methods LDLogitBoost and AOSO-LDLogitBoost proposed in [19], are a combination of the boosting method and the logistic regression applied to LDL framework. Deep Label Distribution Learning (DLDL) [18] and LDL based on Ensemble Neural Networks (ENN-LDL) [38] are two examples of success in the application of neural networks in LDL. Modelled on differentiable decision trees [39], an end-to-end strategy LDL forests proposed in [17] was used as the basis for Structured Random Forest (StructRF) [20]. BC-LDL [40] and DBC-LDL [41] use the binary coding strategies to address with the large-scale LDL problem. Classification with LDL (LDL4C) [42] is also an interesting approach when the learned label distribution model is considered as a classification model. Feature selection on LDL [43] shows encouraging results by applying feature selection on LDL problems.

Prototype Selection and Label-Specific Feature Learning
Nowadays, one of the main challenges for supervised learning algorithms is still how to deal with raw datasets. Data pre-processing is an often unattended but important step in the data mining process [22]. Data gathering is often a poorly monitored process, resulting in low-quality datasets. If there is a lot of irrelevant and redundant information or noisy and unreliable data, then knowledge discovery is more difficult to carry out. Data reduction techniques [22] allow us to obtain a reduced representation of the original data but maintaining the essential structure and integrity. Such pre-processing techniques usually lead to improved performance of subsequent learners.
A frequent problem with real datasets is their big volume, as well as the presence of noise and anomalies that complicate the learning process. Instance selection is to choose a subset of data to achieve the original purpose of a data mining application as if the whole data is used [23]. The optimal outcome of instance selection is a stand-alone model, with a minimal data sample that can perform tasks with little or no performance degradation.
Instance selection has been successfully applied to various problem types like imbalanced learning [44], data streams [45], regression [46], subgroup discover in large size datasets [47]. The research conducted in [48] is also of interest to investigate the impact of instance selection on the underlying structure of a dataset by analyzing the distribution of sample types. However, as of today, instance selection has not been extensively researched in the domain of MLL and to date only few methods have been made available [26][27][28]. As for LDL, we have not been able to find any studies to date.
Prototype Selection [22] methods are Instance Selection methods that expect to find training sets that offer the best ranking accuracy and reduction rates by using instances-based classifiers which consider a certain similarity or distance measure. A widely used categorization of prototype selection methods consists of three types of techniques [24]: Condensation, where the aim is to retain border points, preserving the accuracy of the training system; Edition, where the objective is to eliminate boundary points that are considered noise or do not match their neighbors but without removing the internal points of each dataset; or Hybrid methods that try to find a small set of training data while maintaining the performance of the classifier. The best approach considering the trade-off between reduction and accuracy is usually the hybrid technique.
To name just a few successful cases when applying prototype selection, [49] proposes a prototype selection algorithm called MONIPS that has proved to be competitive with classical prototype selection solutions adapted to monotonic classification. The experimental study presented in [50] shows a significant improvement when prototype selection is applied to dynamic classifier and ensemble selection.
Another widely used pre-processing technique is the reduction of the data dimensionality by means of feature selection [25]. The main idea is to replace the original set of features by a new subset that extract the main information and provide an accurate classification. However, in MLL and LDL, the strategy of selecting a set of characteristics shared by all labels may not be optimal since each label may be described by a specific subset of characteristics of their own. Label-specific feature learning has been widely studied in MLL. For instance, LIFT [29] firstly builds specific features of each label by performing cluster analysis on its positive and negative instances, and then conducts training and testing by querying the cluster results. LLSF [30] proposes learning label-specific features for each class label by considering pairwise (i.e., second-order) label correlations. MLFC [51] is also a multi-label learning method, which attempts to learn the specific characteristics of each label by exploiting the correlations between them. However, the studies we can find on this subject for LDL are still scarce. LDLSF [31] is a method inspired in LLSF adapted to deal with LDL problems by jointly selecting label-specific features, selecting common features and exploiting label correlations.

Evolutionary Optimization
Evolutionary algorithms (EAs) [32] are stochastic search mechanisms based on natural selection notions. EAs have been applied to a broad range of problems, including search problems [52], optimization problems [53], and in many areas as in economics, engineering, biology, etc. The primary idea is to maintain a population of chromosomes, that represent valid solutions to the problem and which evolve over time through a process of competition and targeted variation. CHC [54], is a well-known evolutionary model that introduces different techniques to achieve a trade-off between exploration and exploitation; such as incest prevention, reinitialization of the search process when the population converges or the search stops making progress and the competition among parents and offsprings into the replacement process.
Prototype Selection can be considered as a search problem where EAs can be applied. The search space consists of all the subsets of the training set. This can be represented by using a binary chromosome with two possible states for each gene. If the gene value is 1 then the associated instance is included in the subset, if the value is 0, this does not occur. The chromosome will have as many genes as the number of instances in the training set. EAs have been used to solve the prototype selection problem with promising results [55][56][57][58][59].
Label-specific feature selection is a very complex task. If the original set contains q features and c labels, the objective is to find a q × c binary matrix where each ij position indicates if the individual feature x i will be taken into consideration to predicte the label d j . Many search strategies can be used to find a solution to this problem but finding the optimal subset can be a huge time-consuming task. Therefore, it is justifiable to use an evolutionary algorithm that gives us an approximate solution in an acceptable time. Several studies have successfully applied feature reduction to multi-label problems using evolutionary algorithms [60][61][62][63], but as of today no such technique has been used to solve the label-specific feature learning problem.

ProLSFEO-LDL: Protoype Selection and Label-Specific Feature Evolutionary Optimization for Label Distribution Learning
ProLSFEO-LDL is an evolutionary algorithm proposal adapted to LDL specificities that combines prototype selection and label-specific feature learning.
It uses the framework of CHC where the search space will be represented by a chromosome (or individual) in which we will code the prototype selection and label-specific feature as detailed in the next Section 3.1. Through the evolutionary algorithm described in Algorithm 1, we will optimize this initial solution until we reach a solution that meets our expectations. We start from a parent population P of size N, randomly initialized, to generate a new population P obtained by crossing the individuals of the parent population. The recombination method to obtain the offsprings is detailed in the Section 3.2. Then, a survival competition is held where the best N individuals from the parent population P and the offspring population P are selected to compose the next generation. To determine if one individual is better than another we need to evaluate the quality of each chromosome, for this we use the fitness method described in the Section 3.3. When the population converges or the search no longer progresses, the population is reinitiated to introduce a new diversity into the search, for this purpose we use a threshold t to control when the reinitialization will take place, this step is explained in the Section 3.4. The process of evolution iterates over G generations and finally returns the best solution B found in the search.
In the following sections, we explain in depth each part of the proposed method.

Representation of Individuals
Each individual should represent the two parts of the algorithm. On the one hand, we have the first m genes of the chromosome that represents the selection of instances, if the gene value is 1 then the associated instance is included in the subset, if the value is 0, this does not occur. And, the second part of the chromosome, consists of q × c genes representing the matrix of label-specific features. A graphical representation of genes, chromosome and population is exposed in Figure 1, where genes named as ss i represent if the instance i will be included in the final set or not; and genes f s jk represent if feature k will be taken into account to predicte the specific label j. The initialization of the chromosomes that make up the population is randomly performed from the "discrete uniform" distribution in the closed interval [0, 1].

Crossover
The crossover operator, HUX, performs a uniform cross that randomly exchanges exactly half of the bits that differ between the two parenting chains. No mutation is applied during the recombination phase of the CHC algorithm.

Fitness Function
In order to measure the quality of a particular chromosome we need to evaluate it using a LDL learner as wrapper. As the prototype selection is focused on the optimization of the AA-kNN method, we have chosen this same learner, but with some adaptations to support the selection of features. By default AA-kNN uses all features to calculate the distance between each instance. In our adaptation, each output label j should be individually predicted using only the selected features from f s = { f s j1 , ..., f s jq } coded in the chromosome . The prediction obtained is then normalized so that the result is compatible with the LDL constraints.
The step sequence for evaluating a chromosome is as follows: • Create a subset of instances: the selected prototypes are coded in the first m genes of the chromosome. We create a subset T by keeping only the elements of the original training set S which have an associated gene value = 1. The objective is to minimize the fitness function.

Reinitialization
When the population converges or the search no longer progresses, the population is reinitiated to introduce a new diversity into the search. The population is considered not sufficiently diverse when the threshold L reaches 1 (line 13 in Algorithm 1). This treshold is initialized with the parameterized value t (line 4 in Algorithm 1) and decreased by 1 each time that the population is not evolving (line 12 in Algorithm 1). In such a case, only the chromosome that represents the best solution found during the search is kept in the new population, and the remaining individuals are randomly generated, filling in the rest of the population (lines 14-17 in Algorithm 1). In CHC, the typical value used for threshold t is equal to the total length of the chromosome divided by 4. In our case, due to the length of the chromosome, we will choose a lower value that allows a faster convergence.

Experimental Framework
This section is devoted to introducing the experimental framework used in the empirical study conducted in this paper. In our experiments, we have included 13 datasets of a wide variety of real-world problems, described in the following Section 4.1. In order to evaluate the learners proficiency, we will employ six measures described in Section 4.2 to evaluate the different aspects of experiments performed in Section 4.3.

Datasets
There are a total of 13 real-world datasets employed in the experiments. The summary of their characteristics is shown in Table 1. The first to the tenth datasets were gathered from biological experiments on the budding yeast Saccharomyces cerevisiae [7]. Each dataset consists of 2465 yeast genes, and an associated phylogenetic profile vector with a length of 24 is used to characterize each gene. In a biological experiment, the level of gene expression is usually uneven at each discrete time point, so the labels correspond to the time point.
Datasets JAFFE [64] and BU_3DFE [65] are two widely used facial expression image datasets. A total of 213 gray-scale expression images in the JAFFE dataset while BU_3DFE contains 2500 facial expression images. The images in JAFFE have been rated by 60 people based on the six primary emotion labels with a 5-level scale, i.e., fear, disgust, happiness, anger, sadness, surprise, and the images in BU_3DFE have been ranked by 23 people using the same scale as used in JAFFE. Each dataset is composed by a 243-dimensional feature vector extracted by the Local Binary Patterns method (LBP) [66]. The score for each emotion is considered the degree of description, and the description degrees (normalized gene expression level) of the six emotions make a label distribution for a particular facial expression image.
The Natural Scene dataset is compiled from 2000 natural scene images that have been inconsistently classified by ten human rankers. A 294-dimensional feature vector extracted in [2] represents each image and is linked with a multi-label selected from nine possible labels, i.e., sky, mountain, water, sun, cloud, snow, building, desert and plant. Then rankers are then required to sort the labels in descending order of relevance. Predictably, the ratings for the same image from different rankers are highly inconsistent. So, a nonlinear programming process [67] is applied to achieve the label distribution.

Evaluation Measure Selection
Several measures of distance/similarity between probability distributions can be applied to measure the distance/similarity between label distributions. We propose to use a set of six measures when comparing different LDL algorithms:

•
Chebyshev Distance: is a metric defined on a vector space where the distance between two vectors is the greatest of their differences along any coordinate dimension. • Clark Distance: the Clark distance also called coefficient of divergence is the squared root of half of the divergence distance. • Canberra Metric: is a numerical measure of the distance between pairs of points in a vector space. It is a weighted version of Manhattan distance. The Canberra distance is a metric function often used for data scattered around an origin. • Kullback-Leibler Divergence: which is closely related to relative entropy, information divergence, and information for discrimination, is a non-symmetric measure of the difference between two probability distributions p(x) and q(x). Specifically, the Kullback-Leibler divergence of q(x) from p(x) is a measure of the information lost when q(x) is used to approximate p(x).

•
Cosine coefficient: is a metric used to measure how similar two non-zero vectors are irrespective of their size. It measures the cosine of the angle between two vectors projected in a multidimensional space. The smaller the angle, higher the cosine similarity.

•
Intersection similarity: has its largest value, 1, when all the terms of the first probability distribution are identical to the corresponding terms of the second probability distribution. Otherwise, the similarity is less than 1. In the extreme case, when both distributions are very different, then the similarity will be close to 0.
Each of these measures come from a different syntactic family: Minkowski family, the X 2 family, the L 1 family, the Shannon's entropy family, the inner product family, and the intersection family, respectively [68]. Taking into account that they come from different families and are significantly different in both syntax and semantics, they have a good chance to reflect different aspects of an LDL algorithm. Supposing that the real label distribution is D = {d 1 , d 2 , ..., d c }, and the predicted label distribution isD = {d 1 ,d 2 , ...,d c }, then the formulation of the six measures is summarized in Table 2. Table 2. Evaluation measure for LDL learners. ↓ means that the lowest value is the best and ↑ means the opposite.

Experimental Setting
The proposed ProLSFEO-LDL algorithm is firstly applied, in a pre-preprocessing step, to the raw datasets obtaining a subset of training instances and a matrix of label-specific features. The subset of selected instances will make up the new pre-processed training set that the learner will train with. As we mentioned before, the selected learner is AA-kNN. Regarding the label-specific features, we have adapted the AA-kNN algorithm to support the selection of features, for this, each label is predicted separately using only the characteristics marked as selected. The prediction obtained is normalized to make it compatible with the LDL constraints.
The results will be compared with those obtained by the same learner without applying data reduction as a pre-processing step. All measures were computed over a merged set from the test predictions using a wrapped 10-fcv. Note that this procedure is over the entire process and differs from the cross-validation used to estimate the fitness function within the optimization process.
The parameters used in experiments are summarized in Table 3. The AA-kNN algorithm used in the fitness function of ProLSFEO-LDL method and in the subsequent learner requires a value for k, set to four neighbors. We have selected a small value for k in order to provide the most flexible fit with a low bias.

Results and Analysis
This section presents the results of the empirical studies and their analyses. We will compare the results obtained by AA-kNN with the results obtained by applying the pre-processing step ProLSFEO-LDL. In Table 4 the best outcome for each dataset and measure is highlighted in bold. The last row is the average aggregation result of each column. The best average is also highlighted in bold. As those algorithms have been tested using 10-fcv, the performance is represented using "mean±standard deviation".
The Wilcoxon test and Bayesian Sign test [33,34] are used to validate the results of the empirical comparisons. In the Bayesian Sign test, a distribution of the differences of the results achieved using methods L (AA-kNN) and R (ProLSFEO-LDL) is computed into a graphical space divided into three regions: left, rope and right. The location of most of the distribution in these sectors indicates the final decision: the superiority of algorithm L, statistical equivalence and the superiority of algorithm R, respectively. KEEL package [69] has been used to compute the Wilcoxon test and the R package rNPBST [70] was used to extract the graphical representations of the Bayesian Sign tests analyzed in the following empirical studies. The Rope limit parameter used to represent the Bayesian Sign test is 0.0001.
The outcome of both statistical tests applied to our method is represented in Table 5 (Wilcoxon test) and Figure 2 (Bayesian Sign test).
Comparing ProLSFEO-LDL with the standard LDL learner AA-kNN, we reach the following conclusions:

•
The results of the different measures shown in Table 4 highlights the best ranking of ProLSFEO-LDL in the large majority of the datasets and measures.

•
The Wilcoxon Signed Ranks test corroborates the significance of the differences between our approach and AA-kNN. As we can see in Table 5, all the hypotheses of equivalence are rejected with small p-values.

•
With regard to the Bayesian Sign test, Figure 2 graphically represent the statistical significance in terms of precision between ProLSFEO-LDL and AA-kNN. The following heat-maps clearly indicate the significant superiority of ProLSFEO-LDL, as the computed distributions are always located in the right region. Another interesting information to analyse is the reduction ratio obtained by the proposed method. In Figure 3 we show the percentage of selected prototypes and features with respect to the initial training set. In the case of features we represent the average percentage since each of the output labels can be represented by a different number of features.
The average percentage is around 53% for both, prototypes and features, varying more significantly for the SJAFFE and SBU_3DFE datasets where the reduction ratio decreases significantly. This percentaje of data reduction will have a huge impact in the performance of the learner. In instance-based methods like AA-kNN, the fact of handling about half of the prototypes and features, will lead to a reduction of computation time in the prediction phase.

Conclusions
In this paper, we proposed a novel data reduction algorithm that adapts to LDL constraints. It simultaneously address the prototype selection and the label-specific feature selection with two objectives: finding an optimal subset of samples to improve the performance of the AA-kNN learner and selecting a subset of characteristics specific for each one of the output label. Both tasks have been addressed as search problems using an evolutionary algorithm, based on CHC, to optimize the solution.
In order to verify the effectiveness of the solution designed, ProLSFEO-LDL has been applied on several real-world LDL datasets, showing significant improvements compared to the use of the raw training set. The results of the different measures highlights the best ranking of ProLSFEO-LDL, outcomes subsequently corroborated by statistical tests. In addition, the percentaje of data reduction reached leads to a significant improvement of prediction time.
In future studies we may introduce some further comparisons between already existing approaches and also make improvements to the presented proposal such as: • It might be interesting to compare the results obtained by this study with the following techniques: LDLSF [31] and Binary Coding bases LDL [40]. • Complement the experimental analysis by dealing with larger datasets (Big Data). To this end, we will complement the current proposal with some of the big data reduction techniques presented in [71].

•
The experimental settings use the AA-KNN learner to measure the quality of the solution applied over the pre-processed dataset. Other more powerful LDL learners could be considered for this task but they must previously be adapted to support the label-specific feature selection. With this we will be able to carry out more adequate comparisons with state-of-the-art LDL methods like LDLFs proposed in [17] or StructRF [20].

•
Another interesting study that we will undertake is how the presence of noise at the label side can affect the performance. In real scenarios the data gathering is often an automatic procedure that can lead to incorrect sample labelling. It would be interesting to inject some artificial noise to the training labels in order to check the robustness of the implemented approach.