Next Article in Journal
iLife: Safely Extending Lifetime for Memory-Oriented SSD
Previous Article in Journal
Fuzzy Optimized MFAC Based on ADRC in AUV Heading Control
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Fast Algorithm for Multi-Class Learning from Label Proportions

1
School of Information and Communication Engineering, Beijing University of Posts and Telecommunications, Beijing 100876, China
2
School of Computer Science and Technology, University of Chinese Academy Sciences, Beijing 100190, China
3
School of Information Technology and Management, University of International Business and Economics, Beijing 100029, China
4
School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China
5
Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences, Beijing 100190, China
6
Research Center on Fictitious Economy and Data Science, Chinese Academy of Sciences, Beijing 100190, China
7
College of Information Science and Technology, University of Nebraska at Omaha, NE 68182, USA
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Electronics 2019, 8(6), 609; https://doi.org/10.3390/electronics8060609
Submission received: 17 April 2019 / Revised: 24 May 2019 / Accepted: 27 May 2019 / Published: 30 May 2019
(This article belongs to the Section Artificial Intelligence)

Abstract

:
Learning from label proportions (LLP) is a new kind of learning problem which has attracted wide interest in machine learning. Different from the well-known supervised learning, the training data of LLP is in the form of bags and only the proportion of each class in each bag is available. Actually, many modern applications can be successfully abstracted to this problem such as modeling voting behaviors and spam filtering. However, time-consuming training is still a challenge for LLP, which becomes a bottleneck especially when addressing large bags and bag sizes. In this paper, we propose a fast algorithm called multi-class learning from label proportions by extreme learning machine (LLP-ELM), which takes advantage of an extreme learning machine with fast learning speed to solve multi-class learning from label proportions. Firstly, we reshape the hidden layer output matrix and the training data target matrix of an extreme learning machine to adapt to the proportion information instead of the real labels. Secondly, a robust loss function with a regularization term is formulated and two efficient solutions are provided to different cases. Finally, various experiments demonstrate the significant speed-up of the proposed model with better accuracies on different datasets compared with several state-of-the-art methods.

1. Introduction

In the era of big data, many real-world applications involve a multi-class problem. For example, the MNIST database of handwritten digits is to separate 10 numbers ranging from 0 to 9. Generally speaking, the traditional methods to solve the multi-class problems are adapting supervised learning to learn a multi-class classifier from the training data such as random forests [1], support vector machine [2], convolutional neural networks [3] and boosting [4]. However, many real-world cases fail to be efficiently solved by supervised learning algorithms, which is mainly due to the following two reasons:
On one hand, supervised learning algorithms need large amount of labeled training data in order to obtain a good performance. However, it is becoming infeasible or quite difficulty as the increase of training data. This is because label information is often provided by a human annotator and annotating large datasets is always expensive and time consuming. Furthermore, multiple human annotators often provide inconsistent labels, which could cause the performance of learning algorithms worse. On the other hand, labels of instances are not available in some cases where there are additional constraints. For example, in the client purchasing behaviors analysis [5], we are always prone to adapting supervised learning to learn the clients’ transaction action. However, revealing clients’ key information may cause some legal problems, especially when the information is provided to a third party. This necessitates the development of weakly supervised learning algorithms.
In practice, compared with the accurate labeling individual samples, we can obtain the proportion information of different categories in every bag much more accurately and cheaply by some prior knowledge or random sampling. For example, based on statistics or commonsense, 80% bears are black, 90% Asians are with black hair and 70% living rooms have a TV [6]. Similar problem can be found in gene name tagging, where a word has a 75% probability to be a gene if it ends with the morpheme gene [7]. As a result, it is very meaningful to study the problem of learning from label proportions. Figure 1 illustrates the problem of multi-class learning from class proportions. In detail, there are images of three categories including pandas, butterflies and dolphins and they are divided into four bags with no intersection among them. In each bag, the amount of different categories are denoted by the sizes of rectangles with different colors respectively and a proportion information can be obtained by the sizes of different categories. Then a multi-classifier could be obtained by learning from label proportions. On the right pandas, butterflies and dolphins are separated according to the multi-classifier.
Nowadays, more and more applications can be concluded as this problems, such as demographic classification [8], video event detection [9], presidential election [10], traffic flow prediction [11], embryo implantation prediction [12] and sar image classification [13].

1.1. Related Works

Learning from label proportion has been studied for several years but the papers for the issue are relatively rare. Kuck and de Freitas [14] first gave a solution for this problem, where a Markov Chain Monte Carlo algorithm was employed. However, high computing complexity is a main problem of this algorithm.
Rüping [15] addressed this problem by means of support vector regression. In detail, the mean instance of each bag should comply with a soft label obtained from the label proportion, which use the thought of inverse classifier calibration. Based on the thought, Cui [5] proposed a new algorithm by replacing the SVR by ELM to accelerate the training time. Both of two methods can acquire a relatively good performance. However, Felix X. Yu [16] argued the thought can cause the performance terrible in some cases. Furthermore, both of the two algorithms lack the ability to deal with multi-class problem.
Recently, Felix X. Yu [16] presented a new method based on the large-margin framework. In detail, the objective function was to optimize the known label proportions as well as the unknown instance labels. This model outperforms the former methods in most situations and alleviates the need for making restrictive assumptions on the data. However, the limitations about this method can be further discussed. On one hand, this algorithm cannot handle multi-class problem directly. On the other hand, the training efficiency of this algorithm is very low as it need an alternating process to obtain the following results.
Furthermore, Wang [17] proposed an algorithm based on matrix to directly solve the multi-class classification case, while the others have to perform certain post-processing procedure for multi-class classification. In detail, this method made the predicted class proportions and the ground-truths the same while preserving the sparsity of the predicted label vector of each individual sample. Then a set of auxiliary variables were introduced to solve the optimization problem. Compared to the previous algorithms, the model shows great advantages in handling multi-class problems. However, the algorithm contains matrix computing and it suffers from big computing complexity when the size of instances’ feature goes to considerably big.
Other methods can be found in References [18,19,20,21].

1.2. Motivation

As the amount of data has increased substantially, large-scale and multi-class data has become a trend for many machine learning problems. Generally, large scale data means more time-consuming for training especially for weakly supervised learning where the labels of training data are inaccessible. This necessitates the development of fast learning algorithms for multi-class learning from label proportions. For example, in the case of political election [15], the voters can be divided into three groups: always-favorable voters, always-unsupported voters and swing voters where the last will vote the candidates according to the benefit given to them. Every candidate would like to identify which class each voter belongs to according to proportions information revealed by the previous elections. In particular, the sooner the candidates obtain the information, the higher probability they can take the right action to win the election. This is mainly because they will have more time to focus on their attention to regions where they can achieve the a maximal gain. That is to say, it is a typical LLP problem and a fast solution to it can extremely bring crucial advantage to the candidates.
Although most of the proposed methods have developed effective solutions to LLP [15,16,21,22,23], the time-consuming problem is not fully considered. That is to say, with the increase of bag sizes and bags, they may need unaffordable time to yield a classifier. In order to improve the computational efficiency of the training process, we proposed a fast training method based on ELM which has proven its advantage in fast training speed.
The rest of the paper is organized as follows. First, the extreme learning machine is present in Section 2 and then we show the novel LLP algorithm and the solution to it in Section 3. After this, the experiment results are present in Section 4. Finally, in Section 5, some ideas and conclusions of our work are given.

2. Background

In this section, we give a brief introduction of the traditional extreme learning machine [24,25]. Figure 2 shows the architecture of ELM. In detail, it is a single-hidden layer feed-forward networks with three parts: input neurons, hidden neurons and output neurons. In particular, h ( x ) = [ h 1 ( x ) , , h L ( x ) ] is nonlinear feature mapping of ELM with the form of h j ( x ) = g ( w j . x + b j ) and β j = [ β j 1 , . . . , β j c ] T , j = 1 , , L is the output weights between the jth hidden layer and the output nodes.
Given N samples ( x i , t i ) , i = 1 , , N , where x i = [ x i 1 , , x i d ] T denotes the input feature vectors and t i = [ t i 1 , , t i c ] T is the corresponding label in a one-hot fashion. In particular, c and d respectively represent the total classes and feature number. Consequently, a standard feed-forward neural network with L hidden nodes can be expressed as:
j = 1 L β j g ( w j . x i + b j ) = o i , i = 1 , , N ,
where w j = [ w j 1 , w j 2 , , w j L ] T is the weight vector between the jth hidden neuron and the input neurons, and β j = [ β j 1 , β j 2 , , β j c ] T , j = 1 , , L is the weight vector connecting the output neuron and the jth hidden neurons. According to Reference [25], the ELM can approximate those N samples to zero error with the equation i = 1 N o i t i = 0 . Thus, the above equations can be expressed as:
j = 1 L β j g ( w j . x i + b j ) = t i , i = 1 , , N .
In particular, we can use matrix to express the above N equations with form of:
H β = T ,
where H is the hidden layer output matrix of the single-hidden layer feed-forward network and T is output matrix. More specifically, H and T have the form of:
H = h ( x 1 ) h ( x N ) = h 1 ( x 1 ) h L ( x 1 ) h 1 ( x N ) h L ( x N )
and
T = t 1 T t N T = t 11 t 1 c t N 1 t N c .
In practice, the hidden node parameters (w,b) of ELM are randomly generated and then fixed without iteratively tuning, which is different to the traditional BP neural networks [25]. As a result, training an ELM is equivalent to find the optimal solution to β , which is in defined as:
β = β 1 T β L T = β 11 β 1 c β L 1 β L c .
Furthermore, β can be computed by the following expression:
β * = H T ,
where H is the Moore-Penrose generalized inverse of matrix H.

3. The LLP-ELM Algorithm

In this section, we propose a fast method for multi-class learning from label proportions algorithm called LLP-ELM, which employs an extreme learning machine to solve multi-class LLP problem. In order to leverage the extreme learning machine to LLP, we reshape the hidden layer output matrix H and the training data target matrix T to new forms, such that H is in bag level and T contains the proportion information instead of a label one.

3.1. Learning Setting

The LLP problem is described by a set of training data, which is divided into several bags. Furthermore, compared to the traditional supervised learning, we only know the proportions of different categories in each bag instead of the ground-truth labels. In this paper, we consider the situation that different bags are disjoint and the nth bag of the training data can be denoted as B n , n = 1 , , h . Consequently, the total training data is in form of:
D = B 1 B 2 B h B i B j = , i j .
where there are n bags and N is the number of total instances. Each bag consists of m n instances with the constraint n = 1 h m n = N and can be expressed as:
B n = { x n 1 , , x n m n } , n { 1 , 2 , , h } .
Meanwhile, p n is the corresponding class proportion vector of B n and c represents the total classes number. More specifically, p n can be written as a vector form:
p n = p n 1 p n c ,
where the mth element p n m is the proportion of the mth class in the nth bag with the constraint m = 1 c p n m = 1 . Furthermore, the total proportion information can be defined in form of matrix:
P = p 1 T p h T = p 11 p 1 c p h 1 p h c .

3.2. The LLP-ELM Framework

From the above learning setting of LLP, a classifier in instance level is the final objective. To this end, we modify the original equations in ELM to the new equations in bag level. Specifically, we add all the equations in each bag straightforward and the final equations in nth bag can be expressed as follows:
j = 1 L k = 1 m n β j g ( w j . x nk + b j ) = k = 1 m n t nk , n = 1 , , h ,
where t nk is the real label for the kth instances in nth bag. Obviously, the real label information in the right part is inaccessible to us, with only label proportions in each bag available. To this end, we derive the right part of the above equation as the following form:
k = 1 m n t nk = m n p n , n = 1 , , h ,
where p n is the label proportion of nth bag. Substituting the formula (13) to (12), we can naturally obtain the following equations:
j = 1 L β j k = 1 m n g ( w j . x nk + b j ) = m n p j , n = 1 , , h ,
In particular, similar to the method from ELM [25], we can write the above equations in the form of matrix computing as follows:
H p β = P ,
where H p is the hidden layer output matrix in the bag level and P is the training data target proportion matrix. More specifically, H p and P are given in form of:
H p = k = 1 m 1 h ( x 1 k ) k = 1 m h h ( x hk ) = k = 1 m 1 h 1 ( x 1 k ) k = 1 m 1 h L ( x 1 k ) k = 1 m h h 1 ( x hk ) k = 1 m h h L ( x hk )
and
P = m 1 p 1 T m h p h T = m 1 p 11 m 1 p 1 c m h p h 1 m h p h c .
Meanwhile, the final solution β is the same with the original form in ELM with dimension L × c . Again, the optimal solution to (15) is given by
β = H p P ,
where H p is the Moore-Penrose generalized by the inverse of matrix H p .
In order to obtain a better generalization performance of ELM, we also follow the method from Reference [24] to study the regularized ELM. In detail, the final objective function of ELM is formulated as follows:
min β R L × c 1 2 β 2 + C 2 i = 1 N e i 2 s . t . h ( x i ) β = t i T e i T , i = 1 , , N ,
in which the first term of the objective function is a regularization term and C is a parameter to make a trade-off between the first and second term.
We equivalently reformulate the problem (17) as follows by substituting the constraints to its objective function:
min β R L × c L E L M = 1 2 β 2 + C 2 T H β 2 .
Note that the second term of (18) can be replaced by C 2 P H p β 2 , which is the matrix form in bag level. In other words, the final unconstrained optimization problem can be written as:
min β R L × c L E L M = 1 2 β 2 + C 2 P H p β 2 .
In practice, the final objection is widely known as the ridge regression or regularized least squares.

3.3. How to Solve the LLP-ELM

We follow the strategy from Reference [24] to solve (19) and the final purpose is to minimize the training error as well as the norm of the output weights. Obviously, the final objective function is a convex problem, which is always solved by way of gradient. More specifically, by setting the gradient of (19) to zero with respect to β , we can obtain the following expression:
β C H p T ( P H p β ) = 0 .
This yields
( I C + H p T H p ) β = H p T P ,
where I is an identity matrix with dimension L.
The above equation is very intuitive and we can obtain the final optimization result by inverting a L × L matrix directly. However, it is less efficient to directly invert a L × L matrix when the number of bag is less than the number of hidden neurons (h < L). Therefore, there are two methods which are shown in Remarks 1 and 2. In summary, in the case where the number of bags are plentiful than hidden neurons, we use Remark 1 to compute the output weights, otherwise we use Remark 2.
Remark 1.
The solution for Formula (20) when h > L.
  • H p has more rows than columns, which means the number of bag is larger than the number of hidden neurons.
  • By inverting a L×L matrix directly and multiplying both sides by ( H p T H p + I C ) 1 , we can obtain the following expression
    β = ( H p T H p + I C ) 1 H p T P ,
    which is the optimal solution of (20).
Remark 2.
The solution for formula (20) when h < L.
  • Notice that H p is full row rank and H p H p T is invertible when h < L.
  • Restrict β to be a linear combination of the row in H p : β = H p T α
  • Substitute β = H p T α into (20) and multiply by ( H p H p T ) 1 H p .
  • By the above step, we can obtain the following equation:
    α C ( P H p H p T α ) = 0 .
  • As a result, the final optimal solution of (20) is in form of
    β = H p T α = H p T ( H p H p T + I C ) 1 P = 0 .
The solution process of LLP-ELM model can be concluded to the following two steps:
  • Compute training data target proportion matrix P and the hidden layer output matrix H p , which is shown in Figure 3.
  • Obtain the final optional solution of β according to Remark 1 or Remark 2.
The details of the process are shown in Algorithm 1.
Algorithm 1 LLP-ELM
Input: Training datasets in bags { B n } ; The corresponding proportion p n of B n ; Activation function g(x) and the number of hidden nodes N.
Output: Classification model f(x, β )
Begin
 • Randomly initialize the value w j and b j for the jth node, j = 1 , , L .
 • Compute the training data target proportion matrix P by the proportion information of each bag.
 • Compute the hidden layer output matrix in the bag level H p .
 • Obtain the weight vector according to Remark 1 or Remark 2.
End

3.4. Computational Complexity

From the Remarks 1 and 2, we can observe that the main time cost of our method is to calculate the matrix inversion. Furthermore, the dimension of matrix is minimum of the number of bags h and the hidden neurons L, which is determined by us. As we all know, the complexity of matrix inversion is proportional to the O 3 , where O is the dimension of matrix and is equal to Min(L,h) in this paper.

4. Experiments

In this section, we evaluate the performance of our proposed algorithm on binary and multi-class datasets and two methods from References [15,16] are used to compare with our model. The two methods have proven their advantage compared to previous methods. Additional, the training time of the three algorithms on different datasets is presented. Our source code is available on https://github.com/liujiabin008/LLP-ELM-GITHUB.

4.1. Experiment Setting

The total data is partitioned into two parts: 80% for training data and 20% for testing. Furthermore, we random split the training data into different bags with bag sizes 2, 4, 8, 16, 32 and 64. The final results contain training time and classification accuracy by repeating the experiment 5 times.
For alter-∝SVM, we need to give an initial value to the labels and in practice a stochastic method is employed according to the proportions information of different bags. Furthermore, alter-∝SVM need to run several times to reduce the influence of random initialization, with choosing the lowest objective value as the final result.
In particular, the attributes are scaled to [−1;1] for each dataset and 3000 neurons in hidden layer are used in the experiment for LLP-ELM. Furthermore, linear kernel is considered for alter-∝SVM and InvCal in our experiments. All the parameters are tuned in the criterion of fivefold cross validation. In detail, the parameters of different algorithms are tuned as follows:
I n v C a l : C p [ 0 . 1 , 1 , 10 ] , ε [ 0 , 0 . 1 , 0 . 01 ] a l t e r S V M : C [ 0 . 1 , 1 , 10 ] , C p [ 1 , 10 , 100 ] L L P E L M : C [ 0 . 1 , 1 , 0 . 01 ]

4.2. Binary Datasets

In this section, we compare our algorithm with two state-of-art methods called alter-∝SVM [16] and InvCal [15] on 9 binary classification problems. Table 1 is the summary of the 9 datasets with size and attributes, where the size of different datasets is bigger from up to down.
The final performance of the three algorithms is presented in Table 2, where the 9 datasets are arranged in order of increasing size and the bold numbers means the best accuracy in the special dataset.
As can be seen from this table, our method always has a higher accuracy than InvCal and alter-∝SVM on most datasets, with acquiring 40 best, 9 second and 5 last from the total 54 results. Especially on the datasets of diabetes and ala, LLP-ELM is superior to the other two algorithms in all bag sizes. In particular, we can obtain that as the increase of bag size, the accuracy of the three methods decrease in different degrees. This is mainly because less information is provided as the increase of bag size.
Additionally, the average classification accuracies on the 9 datasets are shown in Figure 4 with different bag size. In detail, alter-∝SVM, InvCal and ELM-LLP are respectively denoted by green, blue and red lines. From the results, we can clearly see that our method is superior to the other two algorithms in classification accuracy.
We also present the training time of different methods in Table 3. From the table, we can obtain that ELM-LLP is much faster than the other two algorithms with several time than InvCal and hundreds times than alter-∝SVM. Furthermore, larger size of dataset means more time to train a model for different methods. In practice, the main time cost of our method is to calculate the inverting matrix, whose dimension is smaller than the minimum of the number of bags and the hidden neurons.
Additionally, the average relative time on the 9 datasets are shown in Figure 5 with different bag sizes. In detail, alter-∝SVM, InvCal and ELM-LLP are respectively denoted by green, blue and red lines. Furthermore, in order to better show the result, we compute the logarithmic result based on 10 to the average time and restrict the final result bigger to zero by adding a value.
From the result, we can see that our algorithm is hundreds times faster than alter-∝SVM and tens times faster than InvCal. As the bag size increases, the training time is decreasing for the three algorithm in different degrees. This is straightforward to our algorithm, where the dimension of matrix inversion decrease as the bag size is bigger.

4.3. Multi-Class Datasets

In this section, 5 multiple classification datasets are used to compare the effect of different methods. A summary of those datasets is presented in Table 4. As the alter-∝SVM [16] and InvCal [15] essentially learn binary classifiers, they adapt the one-vs.-all manner on the multi-class datasets. In detail, the probabilities of certain sample belonging to different possible classes are individually computed and then choose the largest predicted values as the final label.
The final results of the three algorithms are present on Table 5, where each column is the result of different dataset on a special bag size. Our method always has a higher accuracy than the other two methods on most multi-class datasets, with acquiring 23 best from the total 30 results. Especially on the datasets of shuttle and satimage, LLP-ELM is superior to the other two methods in all bag sizes. Similar to the results of binary datasets, bigger bag size will result in a lower accuracy.
Table 6 shows the training time of different methods and we can see that ELM-LLP is much faster than the other two algorithms in dealing with multi-class datasets. As the other two algorithms need to train K classifier, while our method only need to train a multi-class classifier.
Furthermore, the relative mean training time of multi-class datesets is also present in Figure 6 to better show the advantage of our algorithm, where the x-axis represents different bag size. Similar to the result in binary datasets, our model is faster than the other two methods. Furthermore, our algorithm has a bigger advantage in relative training time compared to the situation in binary datasets, which is mainly due to the ability of our algorithm in dealing with multi-class problem directly.

4.4. Caltech-101

For better showing the advantage of our method in dealing multi-class problem, we select the Caltech-101 datasets [26] to compare different algorithms. Specifically, butterflies, sunflowers, leopards, dolphins, elephants, cars, cups, dollar, laptops and pianos are chosen to comprise the final data where there are total 10 categories. In particular, the images are reshaped to 240 × 320 pixels and HOG [27] is used to extract image feature with the final number of 43,200. More specifically, HOG (Histogram of Oriented Gradient) is to compute a histogram of gradients, with each gradient quantized by its angle and weighed by its magnitude. Experimental setting is similar to the the multi-class case where ELM-LLP directly handle multi-class classification and the other two methods conduct one-vs-all experiment.
The final results are shown in Figure 7 and we can observer that LLP-ELM outperforms the other two methods in all bag size. The performance of alter-∝SVM is relatively worse than the other two methods and its accuracy is closed to 10% when the bag size is 64.
Furthermoer, the training time of the three algorithms on Caltech-101 is shown in Figure 8, where different methods are denoted by different colors. Also we use the relative time to represent the final training time, where logarithmic operation is conducted to the results. Similar to the above two situations, our method has a advantage in training time compared to the other two algorithms.

5. Conclusions

In this paper, we present a fast method for multi-class learning from label proportions algorithm called LLP-ELM, which can significantly reduce the training time consumption. In detail, we reshape the hidden layer output matrix H and the training data target matrix T of the extreme learning machine to new forms such that it contains the proportion information instead of the real labels. It can acquire competitive or even better classification accuracy compared to some state-of-the-art algorithms. Meanwhile, the learning speed of our algorithm is several to hundreds times faster, which can be valuable in some situation. Furthermore, our model naturally contains multi-class property and can solve multi-class LLP problem directly without any change. In conclusion, the proposed method can be a good choice for multi-class learning form label proportion, which has many practical applications.

Author Contributions

F.Z. provided guidance for the research, and revised the paper. J.L. designed the software and B.W. wrote the first draft. Z.Q. proposed the ideas and Y.S. gives some suggestion to this paper.

Funding

This research was funded by National Natural Science Foundation of China OF FUNDER grant number 91546201 and 61702099, and key project of National Natural Science Foundation of China OF FUNDER grant number 71110107026, and Fundamental Research Funds for the Central Universities in UIBE OF FUNDER grant number 16QD17.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Breiman, L. Random Forest. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  2. Suykens, J.A.K.; Vandewalle, J. Least Squares Support Vector Machine Classifiers; Kluwer Academic Publishers: Norwell, MA, USA, 1999; pp. 293–300. [Google Scholar]
  3. Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]
  4. Zhu, J.; Zou, H.; Rosset, S.; Hastie, T. Multi-class AdaBoost. Stat. Interface 2006, 2, 349–360. [Google Scholar]
  5. Cui, L.; Zhang, J.; Chen, Z.; Shi, Y.; Yu, P.S. Inverse extreme learning machine for learning with label proportions. In Proceedings of the IEEE International Conference on Big Data, Boston, MA, USA, 11–14 December 2017; pp. 576–585. [Google Scholar]
  6. Yu, F.X.; Cao, L.; Merler, M.; Codella, N.; Chen, T.; Smith, J.R.; Chang, S.F. Modeling Attributes from Category-Attribute Proportions. In Proceedings of the 22nd ACM International Conference on Multimedia, Orlando, FL, USA, 3–7 November 2014; pp. 977–980. [Google Scholar]
  7. Mann, G.S.; Mccallum, A. Simple, robust, scalable semi-supervised learning via expectation regularization. In Proceedings of the 24th International Conference on Machine Learning, Corvalis, OR, USA, 20–24 June 2007; pp. 593–600. [Google Scholar]
  8. Ardehaly, E.M.; Culotta, A. Co-training for Demographic Classification Using Deep Learning from Label Proportions. arXiv 2017, arXiv:1709.04108. [Google Scholar]
  9. Lai, K.T.; Yu, F.X.; Chen, M.S.; Chang, S.F. Video Event Detection by Inferring Temporal Instance Labels. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 2251–2258. [Google Scholar]
  10. Tao, S.; Dan, S.; Oconnor, B. A Probabilistic Approach for Learning with Label Proportions Applied to the US Presidential Election. In Proceedings of the 2017 IEEE International Conference on Data Mining (ICDM), New Orleans, LA, USA, 18–21 November 2017; pp. 445–454. [Google Scholar]
  11. Liebig, T.; Stolpe, M.; Morik, K. Distributed traffic flow prediction with label proportions: From in-network towards high performance computation with MPI. In MUD’15 Proceedings of the 2nd International Conference on Mining Urban Data; ACM: Lille, France, 2015; Volume 1392, pp. 36–43. [Google Scholar]
  12. Hernández-González, J.; Inza, I.; Crisol-Ortíz, L.; Guembe, M.A.; Iñarra, M.J.; Lozano, J.A. Fitting the data from embryo implantation prediction: Learning from label proportions. Stat. Methods Med. Res. 2016, 27, 1056–1066. [Google Scholar] [CrossRef] [PubMed]
  13. Ding, Y.; Li, Y.; Yu, W. Learning from label proportions for SAR image classification. Eurasip J. Adv. Signal Process. 2017, 2017, 41. [Google Scholar] [CrossRef]
  14. Kuck, H.; de Freitas, N. Learning about individuals from group statistics. arXiv 2012, arXiv:1207.1393. [Google Scholar]
  15. Rüping, S. SVM Classifier Estimation from Group Probabilities. In Proceedings of the 27th International Conference on International Conference on Machine Learning, Haifa, Israel, 21–24 June 2010; pp. 911–918. [Google Scholar]
  16. Yu, F.X.; Liu, D.; Kumar, S.; Jebara, T.; Chang, S.F. ∝SVM for learning with label proportions. In Proceedings of the 30th International Conference on International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; pp. 504–512. [Google Scholar]
  17. Wang, Z.; Feng, J. Multi-class learning from class proportions. Neurocomputing 2013, 119, 273–280. [Google Scholar] [CrossRef]
  18. Fish, B.; Reyzin, L. On the Complexity of Learning from Label Proportions. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, Melbourne, Australia, 19–25 August 2017; pp. 1675–1681. [Google Scholar]
  19. Fan, K.; Zhang, H.; Yan, S.; Wang, L.; Zhang, W.; Feng, J. Learning a generative classifier from label proportions. Neurocomputing 2014, 139, 47–55. [Google Scholar] [CrossRef]
  20. Wang, B.; Chen, Z.; Qi, Z. Linear Twin SVM for Learning from Label Proportions. In Proceedings of the 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), Singapore, 6–9 December 2015; pp. 56–59. [Google Scholar]
  21. Qi, Z.; Wang, B.; Meng, F.; Niu, L. Learning With Label Proportions via NPSVM. IEEE Trans. Cybern. 2017, 47, 3293–3305. [Google Scholar] [CrossRef] [PubMed]
  22. Qi, Z.; Fan, M.; Tian, Y.; Niu, L.; Yong, S.; Peng, Z. Adaboost-LLP: A Boosting Method for Learning with Label Proportions. IEEE Trans. Neural Netw. Learn. Syst. 2017, 29, 3548–3559. [Google Scholar] [PubMed]
  23. Shi, Y.; Cui, L.; Chen, Z.; Qi, Z. Learning from label proportions with pinball loss. Int. J. Mach. Learn. Cybern. 2017, 10, 187–205. [Google Scholar] [CrossRef]
  24. Huang, G.B.; Zhou, H.; Ding, X.; Zhang, R. Extreme learning machine for regression and multiclass classification. IEEE Trans. Syst. Man Cybern. Part B 2012, 42, 513–529. [Google Scholar] [CrossRef] [PubMed]
  25. Huang, G.B.; Zhu, Q.Y.; Siew, C.K. Extreme learning machine: Theory and applications. Neurocomputing 2006, 70, 489–501. [Google Scholar] [CrossRef]
  26. Li, F.F.; Fergus, R.; Perona, P. Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. In Proceedings of the 2004 Conference on Computer Vision and Pattern Recognition Workshop, Washington, DC, USA, 27 June–2 July 2004. [Google Scholar]
  27. Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; pp. 886–893. [Google Scholar]
Figure 1. The illustration of multi-class learning from class proportions. In detail, there are images of three categories including pandas, butterflies and dolphins and they are divided into four bags with no intersection among them. In each bag, the amount of different categories is denoted by the sizes of rectangles with different colors respectively and a proportion information can be obtained by the sizes of different categories. Then a multi-classifier could be obtained by learning from label proportions. On the right pandas, butterflies and dolphins are separated according to the multi-classifier.
Figure 1. The illustration of multi-class learning from class proportions. In detail, there are images of three categories including pandas, butterflies and dolphins and they are divided into four bags with no intersection among them. In each bag, the amount of different categories is denoted by the sizes of rectangles with different colors respectively and a proportion information can be obtained by the sizes of different categories. Then a multi-classifier could be obtained by learning from label proportions. On the right pandas, butterflies and dolphins are separated according to the multi-classifier.
Electronics 08 00609 g001
Figure 2. The architecture of the Extreme Learning Machine (ELM). In detail, it is a single-hidden layer feed-forward network with three parts: input neurons, hidden neurons and output neurons. In particular, h ( x ) = [ h 1 ( x ) , , h L ( x ) ] is nonlinear feature mapping of ELM with the form of h j ( x ) = g ( w j . x + b j ) and β j = [ β j 1 , , β j c ] T , j = 1 , , L is the output weights between the jth hidden layer and the output nodes.
Figure 2. The architecture of the Extreme Learning Machine (ELM). In detail, it is a single-hidden layer feed-forward network with three parts: input neurons, hidden neurons and output neurons. In particular, h ( x ) = [ h 1 ( x ) , , h L ( x ) ] is nonlinear feature mapping of ELM with the form of h j ( x ) = g ( w j . x + b j ) and β j = [ β j 1 , , β j c ] T , j = 1 , , L is the output weights between the jth hidden layer and the output nodes.
Electronics 08 00609 g002
Figure 3. The solution process of learning from label proportions by extreme learning machine (LLP-ELM).
Figure 3. The solution process of learning from label proportions by extreme learning machine (LLP-ELM).
Electronics 08 00609 g003
Figure 4. The mean accuracies for the classification of different methods on the machine learning datasets. Specifically, the x-axis represents different bag sizes and y-axis is the mean accuracy. Furthermore, different algorithms are denoted by different colors.
Figure 4. The mean accuracies for the classification of different methods on the machine learning datasets. Specifically, the x-axis represents different bag sizes and y-axis is the mean accuracy. Furthermore, different algorithms are denoted by different colors.
Electronics 08 00609 g004
Figure 5. The mean relative time for the classification of different methods on the machine learning datasets. Specifically, the x-axis represents different bag size and y-axis is the relative time.
Figure 5. The mean relative time for the classification of different methods on the machine learning datasets. Specifically, the x-axis represents different bag size and y-axis is the relative time.
Electronics 08 00609 g005
Figure 6. The mean relative time for multi-class datasets of different methods. In detial, the x-axis represents different bag size and y-axis is the relative time.
Figure 6. The mean relative time for multi-class datasets of different methods. In detial, the x-axis represents different bag size and y-axis is the relative time.
Electronics 08 00609 g006
Figure 7. The performance of different methods on Caltech-101, where the x-axis is bag size and y-axis represents the accuracies of different algorithms. Furthermore, red, blue and green lines denotes the LLP-ELM, InvCal and alter-∝SVM respectively.
Figure 7. The performance of different methods on Caltech-101, where the x-axis is bag size and y-axis represents the accuracies of different algorithms. Furthermore, red, blue and green lines denotes the LLP-ELM, InvCal and alter-∝SVM respectively.
Electronics 08 00609 g007
Figure 8. The training time of different methods on Caltech-101. Specifically, the x-axis is bag size and y-axis represents relative time of different algorithms.
Figure 8. The training time of different methods on Caltech-101. Specifically, the x-axis is bag size and y-axis represents relative time of different algorithms.
Electronics 08 00609 g008
Table 1. Binary datasets in the experiment.
Table 1. Binary datasets in the experiment.
DatasetSizeAttributes
sonar20860
heart27013
vote43516
breast-cancer68310
credit-a69015
diabetes7688
pima-indian7688
splice100060
ala1065119
Table 2. The final results under the optimal parameters with bag size 2, 4, 8, 16, 32, 64. Bold numbers denote the best accuracies.
Table 2. The final results under the optimal parameters with bag size 2, 4, 8, 16, 32, 64. Bold numbers denote the best accuracies.
DatasetMethod248163264
InvCal0.76 ± 0.120.70 ± 0.110.72 ± 0.110.65 ± 0.140.59 ± 0.120.50 ± 0.13
sonaralter-∝SVM0.74 ± 0.090.64 ± 0.090.51 ± 0.110.595 ± 0.060.53 ± 0.100.49 ± 0.13
LLP-ELM0.91 ± 0.020.78 ± 0.030.74 ± 0.040.68 ± 0.090.58 ± 0.040.55 ± 0.07
InvCal0.80 ± 0.050.79 ± 0.040.81 ± 0.060.71 ± 0.110.75 ± 0.070.73 ± 0.14
heartalter-∝SVM0.81 ± 0.040.79 ± 0.030.80 ± 0.030.78 ± 0.110.66 ± 0.200.77 ± 0.07
LLP-ELM0.88 ± 0.020.84 ± 0.020.78 ± 0.030.75 ± 0.110.76 ± 0.040.74 ± 0.09
InvCal0.95 ± 0.030.94 ± 0.030.94 ± 0.040.92 ± 0.020.89 ± 0.040.84 ± 0.07
votealter-∝SVM0.95 ± 0.010.94 ± 0.030.94 ± 0.020.95 ± 0.010.91 ± 0.060.89 ± 0.01
LLP-ELM0.98 ± 0.010.97 ± 0.010.96 ± 0.010.95 ± 0.010.91 ± 0.010.90 ± 0.05
InvCal0.95 ± 0.010.94 ± 0.010.95 ± 0.020.95 ± 0.030.95 ± 0.010.90 ± 0.05
breast-canceralter-∝SVM0.96 ± 0.020.96 ± 0.010.96 ± 0.020.96 ± 0.020.96 ± 0.010.97 ± 0.01
LLP-ELM0.97 ± 0.000.97 ± 0.000.97 ± 0.000.96 ± 0.010.93 ± 0.020.92 ± 0.04
InvCal0.85 ± 0.020.85 ± 0.020.85 ± 0.030.82 ± 0.020.82 ± 0.030.77 ± 0.09
credit-aalter-∝SVM0.85 ± 0.020.85 ± 0.020.81 ± 0.050.82 ± 0.020.64 ± 0.140.76 ± 0.08
LLP-ELM0.89 ± 0.010.88 ± 0.020.83 ± 0.020.80 ± 0.030.74 ± 0.080.76 ± 0.06
InvCal0.75 ± 0.030.71 ± 0.050.73 ± 0.040.67 ± 0.050.66 ± 0.050.64 ± 0.03
diabetesalter-∝SVM0.76 ± 0.020.73 ± 0.030.71 ± 0.040.67 ± 0.030.66 ± 0.040.66 ± 0.05
LLP-ELM0.78 ± 0.010.78 ± 0.020.75 ± 0.010.72 ± 0.020.67 ± 0.020.68 ± 0.02
InvCal0.76 ± 0.030.70 ± 0.050.72 ± 0.040.70 ± 0.060.66 ± 0.070.65 ± 0.03
pima-indianalter-∝SVM0.75 ± 0.030.73 ± 0.030.70 ± 0.030.67 ± 0.040.66 ± 0.030.65 ± 0.02
LLP-ELM0.78 ± 0.000.77 ± 0.010.75 ± 0.010.73 ± 0.010.71 ± 0.030.56 ± 0.07
InvCal0.79 ± 0.020.73 ± 0.020.73 ± 0.060.65 ± 0.030.63 ± 0.050.60 ± 0.04
splice-scalealter-∝SVM0.78 ± 0.040.74 ± 0.030.71 ± 0.040.65 ± 0.050.66 ± 0.040.56 ± 0.17
LLP-ELM0.94 ± 0.010.82 ± 0.030.78 ± 0.020.69 ± 0.030.65 ± 0.030.60 ± 0.05
InvCal0.82 ± 0.020.78 ± 0.030.77 ± 0.020.71 ± 0.050.74 ± 0.030.71 ± 0.05
alaalter-∝SVM0.82 ± 0.020.79 ± 0.040.79 ± 0.030.72 ± 0.060.76 ± 0.020.75 ± 0.02
LLP-ELM0.9 ± 0.000.85 ± 0.010.81 ± 0.010.76 ± 0.020.76 ± 0.020.75 ± 0.03
Table 3. Training time (second) of different algorithms with bag size 2, 4, 8, 16, 32 and 64. Bold numbers denote the least training time.
Table 3. Training time (second) of different algorithms with bag size 2, 4, 8, 16, 32 and 64. Bold numbers denote the least training time.
DatasetMethod248163264
InvCal0.830.380.350.330.320.31
sonaralter-∝SVM1.661.190.680.530.420.37
LLP-ELM0.030.020.020.020.020.02
InvCal0.500.400.330.330.320.31
heartalter-∝SVM2.161.511.010.780.670.61
LLP-ELM0.030.020.020.020.020.02
InvCal0.610.460.350.330.320.31
votealter-∝SVM3.832.822.882.141.731.54
LLP-ELM0.050.040.040.040.040.03
InvCal1.460.580.410.350.320.31
breast-canceralter-∝SVM7.825.715.254.954.054.02
LLP-ELM0.080.060.050.050.050.05
InvCal1.641.610.430.350.340.32
credit-aalter-∝SVM9.397.356.846.075.324.95
LLP-ELM0.080.060.060.050.060.06
InvCal1.900.630.430.350.330.31
diabetesalter-∝SVM13.3211.199.248.056.866.19
LLP-ELM0.10.070.060.060.060.06
InvCal1.970.650.430.350.330.31
pima-indianalter-∝SVM14.310.699.167.666.896.44
LLP-ELM0.090.070.060.060.060.06
InvCal4.281.320.570.380.320.31
splice-scalealter-∝SVM25.225.422.3218.5615.6713.42
LLP-ELM0.130.100.090.080.090.09
InvCal3.243.961.170.480.370.35
alaalter-∝SVM48.8043.7747.1440.3133.9629.81
LLP-ELM0.250.170.150.130.140.15
Table 4. Binary datasets in the experiment.
Table 4. Binary datasets in the experiment.
DatasetSizeAttributesClasses
shuttle100097
connect-410001263
protein10003753
dna20001803
satimage4435366
Table 5. The final results under the optimal parameters with bag size 2, 4, 8, 16, 32, 64 on multi-class datasets. Bold numbers denote the best accuracies.
Table 5. The final results under the optimal parameters with bag size 2, 4, 8, 16, 32, 64 on multi-class datasets. Bold numbers denote the best accuracies.
DatasetMethod248163264
InvCal0.81 ± 0.020.84 ± 0.020.86 ± 0.030.85 ± 0.020.81 ± 0.020.81 ± 0.02
shuttlealter-∝SVM0.88 ± 0.030.87 ± 0.030.89 ± 0.030.85 ± 0.060.81 ± 0.130.73 ± 0.08
LLP-ELM0.93 ± 0.010.92 ± 0.010.92 ± 0.010.92 ± 0.010.92 ± 0.020.92 ± 0.02
InvCal0.78 ± 0.010.79 ± 0.030.78 ± 0.030.70 ± 0.050.76 ± 0.030.79 ± 0.03
connect-4alter-∝SVM0.79 ± 0.030.79 ± 0.020.76 ± 0.030.74 ± 0.040.72 ± 0.030.73 ± 0.03
LLP-ELM0.94 ± 0.010.86 ± 0.010.81 ± 0.020.77 ± 0.020.75 ± 0.040.77 ± 0.02
InvCal0.53 ± 0.080.49 ± 0.050.50 ± 0.030.48 ± 0.060.52 ± 0.010.47 ± 0.03
proteinalter-∝SVM0.54 ± 0.050.49 ± 0.040.48 ± 0.050.43 ± 0.050.41 ± 0.020.40 ± 0.02
LLP-ELM0.79 ± 0.020.66 ± 0.010.59 ± 0.020.55 ± 0.010.50 ± 0.020.50 ± 0.02
InvCal0.92 ± 0.010.79 ± 0.020.66 ± 0.020.73 ± 0.030.76 ± 0.040.72 ± 0.03
dnaalter-∝SVM0.92 ± 0.010.92.85 ± 0.010.91 ± 0.020.86 ± 0.050.77 ± 0.070.68 ± 0.08
LLP-ELM0.98 ± 0.000.94 ± 0.000.89 ± 0.010.81 ± 0.020.77 ± 0.030.68 ± 0.04
InvCal0.75 ± 0.010.76 ± 0.010.70 ± 0.040.76 ± 0.0376 ± 0.010.75 ± 0.02
satimagealter-∝SVM0.80 ± 0.010.81 ± 0.010.81 ± 0.010.78 ± 0.040.59 ± 0.050.61 ± 0.09
LLP-ELM0.90 ± 0.000.89 ± 0.000.89 ± 0.000.87 ± 0.000.84 ± 0.000.80 ± 0.01
Table 6. Training time (second) of different algorithms with bag size 2, 4, 8, 16, 32 and 64 on multi-class datasets. Bold numbers denote the least training time.
Table 6. Training time (second) of different algorithms with bag size 2, 4, 8, 16, 32 and 64 on multi-class datasets. Bold numbers denote the least training time.
DatasetMethod248163264
InvCal22.357.353.493.382.532.31
shuttlealter-∝SVM42.1232.5323.7122.3622.1522.21
LLP-ELM0.190.090.080.080.080.08
InvCal7.802.931.561.210.990.96
connect-4alter-∝SVM18.8816.5413.5812.1210.759.54
LLP-ELM0.150.110.090.090.090.09
InvCal5.654.301.471.350.970.91
proteinalter-∝SVM52.7437.9328.5524.0022.2720.99
LLP-ELM0.180.150.140.130.130.13
InvCal19.4321.986.031.471.050.91
dnaalter-∝SVM100.5495.5499.67109.0893.1988.06
LLP-ELM0.370.230.200.190.190.20
InvCal19.556.4110.737.704.773.57
satimagealter-∝SVM743710930800730700
LLP-ELM1.230.580.400.360.360.36

Share and Cite

MDPI and ACS Style

Zhang, F.; Liu, J.; Wang, B.; Qi, Z.; Shi, Y. A Fast Algorithm for Multi-Class Learning from Label Proportions. Electronics 2019, 8, 609. https://doi.org/10.3390/electronics8060609

AMA Style

Zhang F, Liu J, Wang B, Qi Z, Shi Y. A Fast Algorithm for Multi-Class Learning from Label Proportions. Electronics. 2019; 8(6):609. https://doi.org/10.3390/electronics8060609

Chicago/Turabian Style

Zhang, Fan, Jiabin Liu, Bo Wang, Zhiquan Qi, and Yong Shi. 2019. "A Fast Algorithm for Multi-Class Learning from Label Proportions" Electronics 8, no. 6: 609. https://doi.org/10.3390/electronics8060609

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop