Improving Multi-Instance Multi-Label Learning by Extreme Learning Machine

Multi-instance multi-label learning is a learning framework, where every object is represented by a bag of instances and associated with multiple labels simultaneously. The existing degeneration strategy-based methods often suffer from some common drawbacks: (1) the user-specific parameter for the number of clusters may incur the effective problem; (2) SVM may bring a high computational cost when utilized as the classifier builder. In this paper, we propose an algorithm, namely multi-instance multi-label (MIML)-extreme learning machine (ELM), to address the problems. To our best knowledge, we are the first to utilize ELM in the MIML problem and to conduct the comparison of ELM and SVM on MIML. Extensive experiments have been conducted on real datasets and synthetic datasets. The results show that MIMLELM tends to achieve better generalization performance at a higher learning speed.


Introduction
When utilizing machine learning to solve practical problems, we often consider an object as a feature vector.Then, we get an instance of the object.Further, associating the instance with a specific class label of the object, we obtain an example.Given a large collection of examples, the task is to get a function mapping from the instance space to the label space.We expect that the learned function can predict the labels of unseen instances correctly.However, in some applications, a real-world object is often ambiguous, which consists of multiple instances and corresponds to multiple different labels simultaneously.
For example, an image usually contains multiple patches each represented by an instance, while in image classification, such an image can belong to several classes simultaneously, e.g., an image can belong to mountains, as well as Africa [1]; another example is text categorization [1], where a document usually contains multiple sections each of which can be represented as an instance, and the document can be regarded as belonging to different categories if it were viewed from different aspects, e.g., a document can be categorized as a scientific novel, Jules Verne's writing or even books on traveling.The MIML (Multi-instance Multi-label) problem also arises in the protein function prediction task [2].A domain is a distinct functional and structural unit of a protein.A multi-functional protein often consists of several domains, each fulfilling its own function independently.Taking a protein as an object, a domain as an instance and each biological function as a label, the protein function prediction problem exactly matches the MIML learning task.
In this context, multi-instance multi-label learning was proposed [1].Similar to the other two multi-learning frameworks, i.e., multi-instance learning (MIL) [3] and multi-label learning (MLL) [4], the MIML learning framework also results from the ambiguity in representing the real-world objects.Differently, more difficult than two other multi-learning frameworks, MIML studies the ambiguity in terms of both the input space (i.e., instance space) and the output space (i.e., label space), while MIL just studies the ambiguity in the input space and MLL just the ambiguity in the output space, respectively.In [1], Zhou et al. proposed a degeneration strategy-based framework for MIML, which consists of two phases.First, the MIML problem is degenerated into the single-instance multi-label (SIML) problem through a specific clustering process; second, the SIML problem is decomposed into a multiple independent binary classification (i.e., single-instance single-label) problem using Support Vector Machine (SVM) as the classifiers builder.This two-phase framework has been successfully applied to many real-world applications and has been shown to be effective [5].However, it could be further improved if the following drawbacks are tackled.On one hand, the clustering process in the first phase requires a user-specific parameter for the number of clusters.Unfortunately, it is often difficult to determine the correct number of clusters in advance.The incorrect number of clusters may affect the accuracy of the learning algorithm; on the other hand, SIML is degenerated into single-instance single-label learning (SISL) (i.e., single instance, single label) in the second phase, as this will increase the volume of data to be handled and thus burden the classifier building.Utilizing SVM as the classifier builder in this phase may suffer from a high computational cost and require a number of parameters to be optimized.
In this paper, we propose to enhance the two-phase framework by tackling the two above issues and make the following contributions: (1) We utilize extreme learning machine (ELM) [6] instead of SVM to improve the efficiency of the two-phase framework.To our best knowledge, we are the first to utilize ELM in the MIML problem and to conduct the comparison of ELM and SVM on MIML.(2) We design a method of theoretical guarantee to determine the number of clusters automatically while incorporating it into the improved two-phase framework for effectiveness.
The remainder of this paper is organized as follows.In Section 2, we give a brief introduction to MIML and ELM.Section 3 details the improvements of the two-phase framework.Experimental analysis is given in Section 4. Finally, Section 5 concludes this paper.

The Preliminaries
This research is related to some previous work on MIML learning and ELM.In what follows, we briefly review some preliminaries of the two related works in Sections 2.1 and 2.2, respectively.

Multi-Instance Multi-Label Learning
In traditional supervised learning, the relationships between an object and its description and its label are always a one-to-one correspondence.That is, an object is represented by a single instance and associated with a single class label.In this sense, we refer to it as single-instance single-label learning (SISL).Formally, let X be the instance space (or say, feature space) and Y the set of class labels.The goal of SISL is to learn a function f SISL : X→Y from a given dataset {(x 1 , y 1 ), (x 2 , y 2 ), . . ., (x m , y m )}, where x i ∈X is an instance and y i ∈Y is the label of x i .This formalization is prevailing and successful.However, as mentioned in Section 1, many real-world objects are complicated and ambiguous in their semantics.Representing these ambiguous objects with SISL may lose some important information and make the learning task problematic [1].Thus, many real-world complicated objects do not fit in this framework well.
In order to deal with this problem, several multi-learning frameworks have been proposed, e.g., multi-instance learning (MIL), multi-label learning (MLL) and multi-instance multi-label Learning (MIML).MIL studies the problem where a real-world object described by a number of instances is associated with a single class label.The training set for MIL is composed of many bags each containing multiple instances.In particular, a bag is labeled positively if it contains at least one positive instance and negatively otherwise.The goal is to label unseen bags correctly.Note that although the training bags are labeled, the labels of their instances are unknown.This learning framework was formalized by Dietterich et al. [3] when they were investigating drug activity prediction.Formally, let X be the instance space (or say, feature space) and Y the set of class labels.The task of MIL is to learn a function f MIL : 2 X →{−1, +1} from a given dataset {(X 1 , y 1 ), (X 2 , y 2 ), . . ., (X m , y m )}, where X i ⊆X is a set of instances {x j ∈X(j = 1, 2, . . ., n i ), and y i ∈{−1, +1} is the label of X i .Multi-instance learning techniques have been successfully applied to diverse applications, including image categorization [7,8], image retrieval [9,10], text categorization [11,12], web mining [13], spam detection [14], face detection [15], computer-aided medical diagnosis [16], etc. Differently, MLL studies the problem where a real-world object is described by one instance, but associated with a number of class labels.The goal is to learn a function f MLL : X→2 Y from a given dataset {(x 1 , Y 1 ), (x 2 , Y 2 ), . . ., (x m , Y m )}, where x i ∈X is an instance and Y i ⊆Y a set of labels {y The existing work of MLL falls into two major categories.One attempts to divide multi-label learning to a number of two class classification problems [17,18] or to transform it into a label ranking problem [19,20]; the other tries to exploit the correlation between the labels [21,22].MLL has been found useful in many tasks, such as text categorization [23], scene classification [24], image and video annotation [25,26], bioinformatics [27,28] and even association rule mining [29,30].
MIML is a generalization of traditional supervised learning, multi-instance learning and multi-label learning, where a real-world object may be associated with a number of instances and a number of labels simultaneously.In some cases, transforming single-instance multi-label objects to MIML objects for learning may be beneficial.Before the explanation, we first introduce how to perform such a transformation.Let S = {(x 1 , Y 1 ), (x 2 , Y 2 ), . . ., (x m , Y m )} be the dataset, where x i ∈X is an instance and Y i ⊆Y a set of labels {y We can first obtain a vector v l for each class label l∈Y by averaging all of the training instances of label l, i.e., The benefits of such a transformation are intuitive.First, for an object associated with multiple class labels, if it is described by only a single instance, the information corresponding to these labels is mixed and thus difficult to learn.However, by breaking the single-instance into a number of instances, each corresponding to one label, the structure information collapsed in the single-instance representation may become easier to exploit.Second, for each label, the number of training instances can be significantly increased.Moreover, when representing the multi-label object using a set of instances, the relation between the input patterns and the semantic meanings may become more easily discoverable.In some cases, understanding why a particular object has a certain class label is even more important than simply making an accurate prediction while MIML offers a possibility for this purpose.For example, using MIML, we may discover that one object has label l 1 because it contains instance n ; it has label l k because it contains instance i ; while the occurrence of both instance 1 and instance i triggers label l j .Formally, the task of MIML is to learn a function f MI ML :

A Brief Introduction to ELM
Extreme learning machine (ELM) is a generalized single hidden-layer feedforward network.In ELM, the hidden-layer node parameter is mathematically calculated instead of being iteratively tuned; thus, it provides good generalization performance at thousands of times faster speed than traditional popular learning algorithms for feedforward neural networks [31].
As a powerful classification model, ELM has been widely applied in many fields.For example, in [32], ELM was applied for plain text classification by using the one-against-one (OAO) and one-against-all (OAA) decomposition scheme.In [31], an ELM-based XML document classification framework was proposed to improve classification accuracy by exploiting two different voting strategies.A protein secondary prediction framework based on ELM was proposed in [33] to provide good performance at extremely high speed.The work in [34] implemented the protein-protein interaction prediction on multi-chain sets and on single-chain sets using ELM and SVM for a comparable study.In both cases, ELM tends to obtain higher recall values than SVM and shows a remarkable advantage in computational speed.The work in [35] evaluated the multi-category classification performance of ELM on three microarray datasets.The results indicate that ELM produces comparable or better classification accuracies with reduced training time and implementation complexity compared to artificial neural network methods and support vector machine methods.In [36], the use of ELM for multiresolution access of terrain height information was proposed.The optimization method-based ELM for classification was studied in [37].
ELM not only tends to reach the smallest training error, but also the smallest norm of weights [6].Given a training set D = {(x i , t i )|x i ∈ R n , t i ∈ R m , i = 1, . . ., N}, activation function g(x) and hidden node number L, the pseudocode of ELM is given in Algorithm 1.More detailed introductions to ELM can be found in a series of published literature [6,37,38].

The Proposed Approach MIMLELM
MIMLSVM is a representative two-phase MIML algorithm successfully applied in many real-world tasks [2].It was first proposed by Zhou et al. in [1] and recently improved by Li et al., in [5].MIMLSVM solves the MIML problem by first degenerating it into single-instance multi-label problems through a specific clustering process and then decomposing the learning of multiple labels into a series of binary classification tasks using SVM.However, as mentioned, MIMLSVM may suffer from some drawbacks in either of the two phases.For example, in the first phase, the user-specific parameter for the number of clusters may incur the effective problem; in the second phase, utilizing SVM as the classifiers builder may bring high computational cost and require a great number of parameters to be optimized.
Input: DB: dataset; HN: number of hidden layer nodes; AF: activation function Output: In this paper, we present another algorithm, namely MIMLELM, to make MIMLSVM more efficient and effective.In this proposed method: (1) We utilize ELM instead of SVM to improve the efficiency of the two-phase framework.To our best knowledge, we are the first to utilize ELM in the MIML problem and to conduct the comparison of ELM and SVM on MIML.(2) We develop a method of theoretical guarantee to determine the number of clusters automatically, so that the transformation from MIML to SIML is more effective.(3) We exploit a genetic algorithm-based ELM ensemble to further improve the prediction performance.

Determination of the Number of Clusters
The primary important task for MIMLELM is to transform MIML into SIML.Unlike MIMLSVM, which performs the transformation through a clustering process with a user-specified parameter for the number of clusters, we utilize AIC [39], a model selection criterion, to automatically determine the number of clusters.
AIC is founded on information theory.It offers a relative estimation of the information lost when a given model is used to represent the process that generates the data.For any statistical model, the general form of AIC is AIC = −2ln(L)+2K, where L is the maximized value of the likelihood function for the model and K is the number of parameters in the model.Given a set of candidate models, the one of the minimum AIC value is preferred [39].
Let M k be the model of the clustering result with k clusters C 1 , C 2 , . .., C k , where the number of samples in C i is m i .X i denotes a random variable indicating the PD value between any pair of micro-clusters in C i .Then, under a general assumption commonly used in the clustering community, X i follows a Gaussian distribution with (µ i , σ 2 i ), where µ i is the expected PD value between any pair of micro-clusters in C i , and σ 2 i is the corresponding variance.That is, the probability density of X i is: Let x i j (1≤j≤C 2 m i ) be an observation of X i ; the corresponding log-likelihood w.r.t the data in Since the fact that the log-likelihood for all clusters is the sum of the log-likelihood of the individual clusters, the log-likelihood of the data w.r.t M k is: Further, take the MLE (maximum likelihood estimate) of σ 2 i , i.e.: Equation (3); we obtain that: Finally, in our case, the number of independent parameters K is 2k.Thus, AIC of the model M k is:

Transformation from MIML to SIML
With the number of clusters computed, we start to transform the MIML learning task, i.e., learning a function f MI ML : 2 X →2 Y , to a multi-label learning task, i.e., learning a function f MLL : Z→2 Y .Given an MIML training example, the goal of this step is to get a mapping function z i = φ(X i ), where φ: 2 x →Z, such that for any As such, the proper labels of a new example X k can be determined according to Y k = f MLL (φ(X k )).Since the proper number of clusters has been automatically determined in Section 3.1, we implement the mapping function φ() by performing the following k-medoids clustering process.Initially, each MIML example (X u , Y u ) (u = 1, 2, . . ., m) is collected and put into a dataset Γ (Line 1).Then, a k-medoids clustering method is performed.In this process, we first randomly select k elements from Γ to initialize the k medoids M t (t = 1, 2, . . ., k).Note: instead of a user-specified parameter, k is an automatically-determined value by Equation ( 6) in Section 3.1.Since each data item in Γ, i.e., X u , is an unlabeled multi-instance bag instead of a single instance, we employ the Hausdorff distance [40] to measure the distance between two different multi-instance bags.
The Hausdorff distance is a famous metric for measuring the distance between two bags of points, which has often been used in computer vision tasks.
In detail, given two bags A = {a 1 , a 2 , . . ., a n A } and B = {b 1 , b 2 , . . ., b n B }, the Hausdorff distance dH between A and B is defined as: With the help of these medoids, every original multi-instance example X u can be transformed into a k-dimensional numerical vector z u , where the i-th . ., m) by replacing itself with its structure information, i.e., the relationship of X u and the k medoids.Figure 2 is an illustration of this transformation, where the dataset Γ is divided into three clusters, and thus, any MIML example X u is represented as a three-dimensional numerical vector After this process, we obtain the mapping function z

Transformation from SIML to SISL
After transforming the MIML examples (X i , Y i ) to the SIML examples (z i , Y i ), i = 1, 2,. ..m, the SIML learning task can be further transformed into a traditional supervised learning task SISL, i.e., learning a function f SISL : Z × Y→{−1, +1}.For this goal, we can implement the transformation from SIML to SISL in such a way that for any y ∈ Y, f SISL (z i , y) = +1 if y ∈ Y i , and −1 otherwise.That is, f SISL ={y| f SISL (z i , y) = +1}.
Figure 3 gives a simple illustration of this transformation.For a multi-label dataset, there are some instances that have more than one class label.It is hard for us to train the classifiers directly over the multi-label datasets.An intuitive solution to this problem is to use every multi-label data more than once when training.This is rational because every SIML example could be considered as a set of SISLs, where each SISL is of the same instance, but with a different label.Concretely, each SIML example is taken as a positive SISL example of all the classes to which it belongs.As shown in Figure 3, every circle represents an SIML example.In particular, each example in area Ais of two class labels " " and "×", while the other examples are of either the " " label or the "×" label.According to the transformation from SIML to SISL mentioned above, an SIML example, say (X u , { , ×}) in area A should be transformed into two SISL examples, (X u 1 , ) and (X u 1 , ×).Consequently, when training the " " model, (X u , { , ×}) is considered as (X u 1 , ); otherwise, it is considered as (X u 1 , ×).In this way, the SIML examples in area A is ensured to be used as a positive example both in classes " " and "×".This method can more effectively make full use of the data and make the experiment result closer to the true one.

ELM Ensemble Based on GA
So far, we have decomposed the MIML problem into the SISL problem using SIML as the bridge.Since an MIML example is often of more than two class labels, the corresponding SISL problem should be naturally a multi-class problem.
Two commonly-used methods for multi-class classification are one-against-all (OAA) and one-against-one (OAO) [41].For the N-class problem, OAA builds N binary classifiers, one for each class separating the class from the others.Instead, the OAO strategy involves N(N − 1)/2 binary classifiers.Each classifier is trained to separate each pair of classes.After all N(N − 1)/2 classifiers are trained, a voting strategy is used to make the final decision.However, a common drawback of the two strategies is that they both consider every trained classifier equally important, although the real performance may vary over different classifiers.
An ensemble classifier was proposed as an effective method to address the above problem.The output of an ensemble is a weighted average of the outputs of several classifiers, where the weights should be high for those classifiers performing well and low for those whose outputs are not reliable.However, finding the optimum weights is an optimization problem that is hard to exactly solve, especially when the objective functions do not have "nice" properties, such as continuity, differentiability, etc.In what follows, we utilize a genetic algorithm (GA)-based method to find the appropriate weights for each classifier.
The genetic algorithm [42] is a randomized search and optimization technique.In GA, the parameters of the search space are encoded in the form of strings called chromosomes.A collection of chromosomes is called a population.Initially, a random population is created.A fitness function is associated with each string that represents the degree of goodness of the string.Biologically-inspired operators, such as selection, crossover and mutation, continue for a fixed number of generations or until a termination condition is satisfied.

Fitness Function
Given a training instance x, the expected output of x is d(x) and the actual output of the i-th individual ELM is o i (x).Moreover, let V be the validation set and w = [w 1 , w 2 , . . ., w N ] a possible weight assignment, i.e., the chromosome of an individual in the evolving population.According to [43], the estimated generalization error of the ELM ensemble corresponding to w is: where: It is obvious that E V w expresses the goodness of w.The smaller E V w is, the better w is.Thus, we use f (w) = 1 E V w as the fitness function.

Selection
During each successive generation, a certain selection method is needed to rate the fitness of each solution and preferentially select the best solution.In this paper, we use roulette wheel selection.The fitness function associated with each chromosome is used to associate a probability of selection with each individual chromosome.If f i is the fitness of individual i in the population, the probability of i being selected is where n is the number of individuals in the population.In this way, chromosomes with higher fitness values are less likely to be eliminated, but there is still a chance that they may be.

Crossover
We use the normal single point crossover.A crossover point is selected randomly between one and l (length of the chromosome).Crossover probabilities are computed as in [44].Let f max be the maximum fitness value of the current population, f be the average fitness value of the population and f be the larger of the fitness values of the solutions to be crossed.Then, the probability of crossover, µ c , is calculated as: where the values of k 1 and k 3 are kept equal to 1.0 as in [44].Note that when f max = f , then f = f max and µ c will be equal to k 3 .The aim behind this adaptation is to achieve a trade-off between exploration and exploitation in a different manner.The value of µ c is increased when the better of the two chromosomes to be crossed is itself quite poor.In contrast, when it is a good solution, µ c is low so as to reduce the likelihood of disrupting a good solution by crossover.Mutation: Each chromosome undergoes mutation with a probability µ m .The mutation probability is also selected adaptively for each chromosome as in [44].That is, µ m is given below: where the values of k 2 and k 4 are kept equal to 0.5.Each position in a chromosome is mutated with a probability µ m in the following way.The value is replaced with a random variable drawn from a Laplacian distribution, p( )∝e − | −µ| δ , where the scaling factor δ sets the magnitude of perturbation and µ is the value at the position to be perturbed.The scaling factor δ is chosen equal to 0.1.The old value at the position is replaced with the newly-generated value.By generating a random variable using a Laplacian distribution, there is a nonzero probability of generating any valid position from any other valid position, while the probability of generating a value near the old value is greater.
The above process of fitness computation, selection, crossover and mutation is executed for a maximum number of generations.The best chromosome seen up to the last generation provides the solution to the weighted classifier ensemble problem.Note that sum w i should be kept during the evolving.Therefore, it is necessary to do normalization on the evolved w.Thus, we use a simple normalization scheme that replaces w i with w i / N ∑ i=1 w i in each generation.

Performance Evaluation
In this section, we study the performance of the proposed MIMLELM algorithm in terms of both efficiency and effectiveness.The experiments are conducted on an HP PC (Lenovo, Shenyang, China) with 2.33 GHz Intel Core 2 CPU, 2 GB main memory running Windows 7, and all algorithms are implemented in MATLAB 2013.Both real and synthetic datasets are used in the experiments.

Datasets
Four real datasets are utilized in our experiments.The first dataset is Image [1], which comprises 2000 natural scene images and five classes.The percent of images of more than one class is over 22%.On average, each image is of 1.24 ± 0.46 class labels and 1.36 ± 0.54 instances; The second dataset is Test [22], which contains 2000 documents and seven classes.The percent of documents of multiple labels is 15%.On average, each document is of 1.15 ± 0.37 class labels and 1.64 ± 0.73 instances.The third and the fourth datasets are from two bacteria genomes, i.e., Geobacter sulfurreducens and Azotobacter vinelandii [2], respectively.In the two datasets, each protein is represented as a bag of domains and labeled with a group of GO (Gene Ontology) molecular function terms.In detail, there are 397 proteins in Geobacter sulfurreducens with a total of 320 molecular function terms.The average number of instances per protein (bag) is 3.20 ± 1.21, and the average number of labels per protein is 3.14 ± 3.33.The Azotobacter vinelandii dataset has 407 proteins with a total of 320 molecular function terms.The average number of instances per protein (bag) is 3.07 ± 1.16, and the average number of labels per protein is 4.00 ± 6.97.Table 1 gives the summarized characteristics of the four datasets, where std. is the abbreviation of standard deviation.

Evaluation Criteria
In multi-label learning, each object may have several labels simultaneously.The commonly-used evaluation criteria, such as accuracy, precision and recall, are not suitable in this case.In this paper, four popular multi-label learning evaluation criteria, i.e., one-error (OE), coverage (Co), ranking loss (RL) and average precision (AP), are used to measure the performance of the proposed algorithm.Given a test dataset S = {(X 1 , Y 1 ), (X 2 , Y 2 ), . . ., (X p , Y p )}, the four criteria are defined as below, where h(X i ) returns a set of proper labels of X i , h(X i , y) returns a real-value indicating the confidence for y to be a proper label of X i and rank h (X i , y) returns the rank of y derived from h(X i , y).
The one-error evaluates how many times the top-ranked label is not a proper label of the object.The performance is perfect when one-error S (h) = 0; the smaller the value of one-error S (h), the better the performance of h.
The coverage evaluates how far it is needed, on the average, to go down the list of labels in order to cover all of the proper labels of the object.It is loosely related to precision at the level of perfect recall.The smaller the value of coverage S (h), the better the performance of h.
, where Y i denotes the complementary set of Y i in Y.The ranking loss evaluates the average fraction of label pairs that are misordered for the object.The performance is perfect when rloss S (h) = 0; the smaller the value of rloss S (h), the better the performance of h.
. The average precision evaluates the average fraction of proper labels ranked above a particular label y∈Y i .The performance is perfect when avgprec S (h) = 1; the larger the value of avgprec S (h), the better the performance of h.

Effectiveness
In this set of experiments, we study the effectiveness of the proposed MIMLELM on the four real datasets.The four criteria mentioned in Section 4.2 are utilized for performance evaluation.Particularly, MIMLSVM+ [5], one of the state-of-the-art algorithms for learning with multi-instance multi-label examples, is utilized as the competitor.The MIMLSVM+ (Advanced multi-instance multi-label with support vector machine) algorithm is implemented with a Gaussian kernel, while the penalty factor cost is set from 10 −3 , 10 −2 , . .., 10 3 .The MIMLELM (multi-instance multi-label with extreme learning machine) is implemented with the number of hidden layer nodes set to be 100, 200 and 300, respectively.Specially, for a fair performance comparison, we modified MIMLSVM+ to include the automatic method for k and the genetic algorithm-based weights assignment.On each dataset, the data are randomly partitioned into a training set and a test set according to the ratio of about 1:1.The training set is used to build a predictive model, and the test set is used to evaluate its performance.
Experiments are repeated for thirty runs by using random training/test partitions, and the average results are reported in Tables 2-5, where the best performance on each criterion is highlighted in boldface, and '↓' indicates "the smaller the better", while '↑' indicates "the bigger the better".As seen from the results in Tables 2-5, MIMLSVM+ achieves better performance in terms of all cases.Applying statistical tests (nonparametric ones) to the rankings obtained for each method in the different datasets according to [45], we find that the differences are significant.However, another important observation is that MIMLSVM+ is more sensitive to the parameter settings than MIMLELM.For example, on the Image dataset, the AP values of MIMLSVM+ vary in a wider interval [0.3735,Moreover, we conduct another set of experiments to gradually evaluate the effect of each contribution in MIMLELM.That is, we first modify MIMLSVM+ to include the automatic method for k, then use ELM instead of SVM and then include the genetic algorithm-based weights assignment.
The effectiveness of each option is gradually tested on four real datasets using our evaluation criteria.
The results are shown in Figure 4a-d, where SVM denotes the original MIMLSVM+ [5], SVM+k denotes the modified MIMLSVM+ including the automatic method for k, ELM+k denotes the usage of ELM instead of SVM in SVM+k and ELM+k+w denotes ELM+k, further including the genetic algorithm-based weights assignment.As seen from Figure 4a-d, the options of including the automatic method for k and the genetic algorithm-based weights assignment can make the four evaluation criteria better, while the usage of ELM instead of SVM in SVM+k slightly reduces the effectiveness.Since ELM can reach a comparable effectiveness as SVM at a much faster learning speed, it is the best option to combine the three contributions in terms of both efficiency and effectiveness.As mentioned, we are the first to utilize ELM in the MIML problem.In this sense, it is more suitable to consider the proposed MIML-ELM as a framework addressing MIML by ELM.In other words, any better variation of ELM can be integrated into this framework to improve the effectiveness of the original one.For example, some recently-proposed methods, RELM [46], MCVELM [47], KELM [48], DropELM [49] and GEELM [50], can be integrated into this framework to improve the effectiveness of MIMLELM.In this subsection, we conducted a special set of experiments to check how the effectiveness of the proposed method could be further improved by utilizing other ELM learning processes instead of the original one.In particular, we replaced ELM exploited in our method by RELM [46], MCVELM [47], KELM [48], DropELM [49] and GEELM [50], respectively.The results of the effectiveness comparison on four different datasets are shown in Tables 6-9, respectively.As expected, the results indicates that the effectiveness of our method can be further improved by utilizing other ELM learning processes instead of the original one.
As mentioned, we are the first to utilize ELM in the MIML problem.In this sense, it is more suitable to consider the proposed MIML-ELM as a framework addressing MIML by ELM.In other words, any better variation of ELM can be integrated into this framework to improve the effectiveness of the original one.For example, some recently-proposed methods, RELM [46], MCVELM [47], KELM [48], DropELM [49] and GEELM [50], can be integrated into this framework to improve the effectiveness of MIML-ELM.In this subsection, we conducted a special set of experiments to check how the effectiveness of the proposed method could be further improved by utilizing other ELM learning processes instead of the original one.In particular, we replaced ELM exploited in our method by RELM [46], MCVELM [47], KELM [48], DropELM [49] and GEELM [50], respectively.The results of the effectiveness comparison on four different datasets are shown in Tables 6-9, respectively.As expected, the results indicate that the effectiveness of our method can be further improved by utilizing other ELM learning processes instead of the original one.

Efficiency
In this series of experiments, we study the efficiency of MIMLELM by testing its scalability.That is, each dataset is replicated different numbers of times, and then, we observe how the training time and the testing time vary with the data size increasing.Again, MIMLSVM+ is utilized as the competitor.Similarly, the MIMLSVM+ algorithm is implemented with a Gaussian kernel, while the penalty factor cost is set from 10 −3 , 10 −2 , . .., 10 3 .The MIMLELM is implemented with the number of hidden layer nodes set to be 100, 200 and 300, respectively.
The experimental results are given in Figures 5-8.As we observed, when the data size is small, the efficiency difference between MIMLSVM+ and MIMLELM is not very significant.However, as the data size increases, the superiority of MIMLELM becomes more and more significant.This case is particularly evident in terms of the testing time.In the Image dataset, the dataset is replicated 0.5-2 times with the step size set to be 0.5.When the number of copies is two, the efficiency improvement could be up to one 92.5% (from about 41.2 s down to about 21.4 s).In the Text dataset, the dataset is replicated 0.5-2 times with the step size set to be 0.5.When the number of copies is two, the efficiency improvement could be even up to 223.3% (from about 23.6 s down to about 7.3 s).In the Geobacter sulfurreducens dataset, the dataset is replicated 1-5 times with the step size set to be 1.When the number of copies is five, the efficiency improvement could be up to 82.4% (from about 3.1 s down to about 1.7 s).In the Azotobacter vinelandii dataset, the dataset is replicated 1-5 times with the step size set to be one.When the number of copies is five, the efficiency improvement could be up to 84.2% (from about 3.5 s down to about 1.9 s).MIMLSVM+ with c=10 -1 ,γ=2 3  MIMLSVM+ with c=10 0 ,γ=2 1 MIMLSVM+ with c=10 1 ,γ=2 3 MIMLSVM+ with c=10 2 ,γ=2 5 MIMLSVM+ with c=10 3 ,γ=2 5  MIMLELM with HN=10 MIMLELM with HN=100 MIMLELM with HN=1000

Statistical Significance of the Results
For the purpose of exploring the statistical significance of the results, we performed a nonparametric Friedman test followed by a Holm post hoc test, as advised by Demsar [45] to statistically compare algorithms on multiple datasets.Thus, the Friedman and the Holm test results are reported, as well.
The Friedman test [51] can be used to compare k algorithms over N datasets by ranking each algorithm on each dataset separately.The algorithm obtaining the best performance gets the rank of 1, the second best ranks 2, and so on.In case of ties, average ranks are assigned.Then, the average ranks of all algorithms on all datasets is calculated and compared.If the null hypothesis, which is all algorithms are performing equivalently, is rejected under the Friedman test statistic, post hoc tests, such as the Holm test [52], can be used to determine which algorithms perform statistically different.When all classifiers are compared with a control classifier and p 1 ≤p 2 ≤. ..≤p k−1 , Holm's step-down procedure starts with the most significant p value.If p 1 is below α/(k − 1), the corresponding hypothesis is rejected, and we are allowed to compare p 2 to α/(k − 2).If the second hypothesis is rejected, the test proceeds with the third, and so on.As soon as a certain null hypothesis cannot be rejected, all of the remaining hypotheses are retained, as well.
In Figure 4a-d, we have conducted a set of experiments to gradually evaluate the effect of each contribution in MIMLELM.That is, we first modify MIMLSVM+ to include the automatic method for k, then use ELM instead of SVM and then include the genetic algorithm-based weights assignment.The effectiveness of each option is gradually tested on four real datasets using four evaluation criteria.In order to further explore if the improvements are significantly different, we performed a Friedman test followed by a Holm post hoc test.In particular, Table 10 shows the rankings of each contribution on each dataset over criterion C. According to the rankings, we computed  11, with SE= 4 × 5 6 × 4 = 0.913, the Holm procedure rejects the first hypothesis, since the corresponding p value is smaller than the adjusted α.Thus, it is statically believed that our method, i.e., ELM+k+w, has a significant performance improvement of criterion C over SVM.The similar cases can be found when the tests are conducted on the other three criteria.Limited by space, we do not show them here.
In Tables 2-5, we compared the effectiveness of MIMLSVM+ and MIMLELM with different condition settings on four criteria, where, for a fair performance comparison, MIMLSVM+ is modified to include the automatic method for k and the genetic algorithm-based weights assignment as MIMLELM does.] ≈ 13.43 and F F = 3 × 13. 43  4 × 9 − 13.43 ≈ 1.79.With 10 classifiers and four datasets, F F is distributed according to the F distribution with 10−1 = 9 and (10 − 1) × (4 − 1) = 27 degrees of freedom.The critical value of F(9, 27) for α = 0.05 is 2.25.Thus, as expected, we could not reject the null-hypothesis.That is, the Friedman test reports that there is not a significant difference among the ten methods on criterion C.This is because what we proposed in this paper is a framework.Equipped with the framework, the effectiveness of MIML can be improved further no matter whether SVM or ELM is explored.Since ELM is comparable to SVM on effectiveness [6,32,37], MIMLELM is certainly comparable to MIMLSVM+ on effectiveness.This confirms the general effectiveness of the proposed framework.Similar cases can be found when the tests are conducted on the other three criteria.Limited by space, we do not show them here.In Figures 5-8, we studied the training time and the testing time of MIMLSVM+ and MIMLELM for the efficiency comparison, respectively.In order to further explore if the differences are significant, we performed a Friedman test followed by a Holm post hoc test.In particular, Table 13  ] ≈ 24.55 and F F = 3 × 24.55 4 × 9 − 24.55 ≈ 6.43.With ten classifiers and four datasets, F F is distributed according to the F distribution with 10 − 1 = 9 and (10 − 1) × (4 − 1) = 27 degrees of freedom.The critical value of F(9, 27) for α = 0.05 is 2.25, so we reject the null-hypothesis.That is, the Friedman test reports a significant difference among the ten methods.In what follows, we choose ELM with HN = 200 as the control classifier and proceed with a Holm post hoc test.As shown in Table 16, with SE = 10 × 11 6 × 4 = 2.141, the Holm procedure rejects the hypotheses from the first to the third since the corresponding p-values are smaller than the adjusted α's.Thus, it is statically believed that MIMLELM with HN = 200 has a significant performance improvement of training over two of the MIMLSVM+ classifiers.In summary, the proposed framework can significantly improve the effectiveness of MIML learning.Equipped with the framework, the effectiveness of MIMLELM is comparable to that of MIMLSVM+, while the efficiency of MIMLELM is significantly better than that of MIMLSVM+.

Conclusions
MIML is a framework for learning with complicated objects and has been proven to be effective in many applications.However, the existing two-phase MIML approaches may suffer from the effectiveness problem arising from the user-specific cluster number and the efficiency problem arising from the high computational cost.In this paper, we propose the MIMLELM approach to learn with MIML examples quickly.On the one hand, the efficiency is highly improved by integrating extreme learning machine into the MIML learning framework.To our best knowledge, we are the first to utilize ELM in the MIML problem and to conduct the comparison of ELM and SVM on MIML.On the other hand, we develop a method of theoretical guarantee to determine the number of clusters automatically and to exploit a genetic algorithm-based ELM ensemble to further improve the effectiveness.
where S l is the set of all of the training instances x i of label l.Then, each instance can be transformed into a bag, B i , of |Y| instances by computing B i = {x i −v l |l∈Y}.As such, the single-instance multi-label dataset S is transformed into an MIML dataset S = {(B 1 , Y 1 ), (B 2 , Y 2 ), . . ., (B m , Y m )}.

Figure 1 .
Figure 1.The relationship among these four learning frameworks.

Algorithm 1 :Results 1 for i = 1 to L do 2 randomly assign input weight w i ; 3 randomly assign bias b i ; 4
ELM Input: DB: dataset; HN: number of hidden layer nodes; AF: activation function Output: calculate H; 5 calculate β = H † T ||a − b|| is used to measure the distance between the instances a and b, which takes the form of the Euclidean distance; max a∈A min b∈B ||a − b|| and max b∈B min a∈A ||b − a|| denote the maximized minimum distance of every instance in A and all instances in B and the maximized minimum distance of every instance in B and all instances in A, respectively.The Hausdorff distance-based k-medoids clustering method divides the dataset Γ into k partitions, the medoids of which are M 1 , M 2 , . .., M k , respectively.

3 Figure 2 .
Figure 2. The process of transforming multi-instance examples into single-instance examples.

Figure 3 .
Figure 3.The example of data processing.

Figure 4 .
Figure 4. Gradual effectiveness evaluation of each contribution in Multi-instance Multi-label with Extreme Learning Machine (MIMLELM).(a) Gradual evaluation on average precision (AP); (b) gradual evaluation on converage (C); (c) gradual evaluation on one-error (OE); (d) gradual evaluation on ranking loss (RL).

Figure 5 .
Figure 5.The efficiency comparison on the Image dataset.(a) The comparison of the training time; (b) the comparison of the testing time.

Figure 6 .
Figure 6.The efficiency comparison on the Text dataset.(a) The comparison of the training time; (b) the comparison of the testing time.

Figure 7 .
Figure 7.The efficiency comparison on the Geobacter sulfurreducens dataset.(a) The comparison of the training time; (b) the comparison of the testing time.

Figure 8 .
Figure 8.The efficiency comparison on the Azotobacter vinelandii dataset.(a) The comparison of the training time; (b) the comparison of the testing time.

5 2 4 ]
= 11.1 and F F = 3 × 11.1 4 × 3 − 11.1 = 37.With four algorithms and four datasets, F F is distributed according to the F distribution with 4 − 1 = 3 and (4 − 1) × (4 − 1) = 9 degrees of freedom.The critical value of F(3, 9) for α = 0.05 is 3.86, so we reject the null-hypothesis.That is, the Friedman test reports a significant difference among the four methods.In what follows, we choose ELM+k+w as the control classifier and proceed with a Holm post hoc test.As shown in Table

do 13 foreach y∈Y u do 14 decompose
X m ; 2 determine the number of clusters, k, using AIC;3 randomly select k elements from Γ to initialize the k medoids {M 1 , M 2 , . . ., M k }; , Y u ) into into an SIML example (z u , Y u ), where z u = (d H (X u , M 1 ), d H (X u , M 2 ), . . ., d H (X u , M k )); (z u , Y u ) into |Y u | SISL examples 10Transform (X u

Table 1 .
The information of the datasets.std.: standard deviation.
0.5642] while those of MIMLELM vary in a narrower range [0.4381, 0.5529]; the C values of MIMLSVM+ vary in a wider interval [1.1201, 2.0000], while those of MIMLELM vary in a narrower range [1.5700, 2.0000]; the OE values of MIMLSVM+ vary in a wider interval [0.5783, 0.7969], while those of MIMLELM vary in a narrower range [0.6720, 0.8400]; and the RL values of MIMLSVM+ vary in a wider interval [0.3511, 0.4513], while those of MIMLELM vary in a narrower range [0.4109, 0.4750].In the other three real datasets, we have a similar observation.Moreover, we observe that in this set of experiments, MIMLELM works better when HN is set to 200.

Table 2 .
The effectiveness comparison on the Image data set.AP: average precision; C: coverage; OE: one-error; RL: ranking loss; MIMLSVM+: multi-instance multi-label support vector machine; MIMLELM: multi-instance multi-label-extreme learning machine.

Table 6 .
The effectiveness comparison of Extreme Learning Machine (ELM)and its variants on the Image dataset.

Table 7 .
The effectiveness comparison of ELM and its variants on the Text dataset.

Table 8 .
The effectiveness comparison of ELM and its variants on the Geobacter sulfurreducens dataset.

Table 9 .
The effectiveness comparison of ELM and its variants on the Azotobacter vinelandii dataset.

Table 10 .
Friedman test of the gradual effectiveness evaluation on criterion C.

Table 11 .
Holm test of the gradual effectiveness evaluation on criterion C.

Table 12 .
Friedman test of the effectiveness comparison in Tables 2-5 on criterion C.

Table 13 .
Friedman test of the training time.

Table 15 .
Friedman test of the testing time.