Binary Classification with a Pseudo Exponential Model and Its Application for MultiTask Learning †

In this paper, we investigate the basic properties of binary classification with a pseudo model based on the Itakura–Saito distance and reveal that the Itakura–Saito distance is a unique appropriate measure for estimation with the pseudo model in the framework of general Bregman divergence. Furthermore, we propose a novel multi-task learning algorithm based on the pseudo model in the framework of the ensemble learning method. We focus on a specific setting of the multi-task learning for binary classification problems. The set of features is assumed to be common among all tasks, which are our targets of performance improvement. We consider a situation where the shared structures among the dataset are represented by divergence between underlying distributions associated with multiple tasks. We discuss statistical properties of the proposed method and investigate the validity of the proposed method with numerical experiments.


Introduction
In the framework of multi-task learning problems, we assume that there are multiple related tasks (datasets) sharing a common structure and can utilize the shared structure to improve the generalization performance of classifiers for multiple tasks [1,2].This framework has been successfully employed in various kind of applications, such as medical diagnosis.Most methods utilize the similarity among tasks to improve the performance of classifiers by representing the shared structure as a regularization term [3,4].We tackle this problem using a boosting method, which makes it possible to adaptively learn complicated problems with low computational cost.The boosting methods are notable implementations of the ensemble learning and try to construct a better classifier by combining weak classifiers.AdaBoost is the most popular boosting method, and many variations, including TrAdaBoost for the multi-task learning [5], have been developed.In face recognition [6], as well as web search ranking [7], the computational efficiency of boosting is paid attention to in the framework of multi-task learning.
In this paper, we firstly reveal that AdaBoost can be derived by a sequential minimization of the Itakura-Saito (IS) distance between an empirical distribution and a pseudo measure model associated with a classifier.The IS distance is a special case of the Bregman divergence [8] between two positive measures and is frequently used for non-negative matrix factorization (NMF) in the region of signal processing [9,10].Secondly, we propose a novel boosting algorithm for the multi-task learning based on the IS distance.We utilize the IS distance as a discrepancy measure between pseudo models associated with tasks and incorporate the IS distance as a regularizer into AdaBoost.The proposed method can capture the shared structure, i.e., the relationship between underlying distributions by considering the IS distance between pseudo models constructed by classifiers.We discuss the statistical properties of the proposed method and investigate the validity of the regularization by the IS distance with small experiments using synthetic datasets and a real dataset.
This paper is organized as follows.In Section 2, basic settings are described, and a divergence measure is introduced.In Section 3, we briefly introduce the IS distance, which is a special case of the Bregman divergence, and investigate the relationship between a well-known ensemble algorithm, AdaBoost and estimation with a pseudo model using the Itakura-Saito distance.In Section 4, we propose a method for multi-task learning, which is derived from a minimization of the weighted sum of divergence, and the performance of the proposed methods is examined in Section 5 using a synthetic dataset and a real dataset (a short version of this article has been presented as a conference paper [11]; some theoretical results and numerical experiments are added to the current version).

Settings
In this study, we focus on binary classification problems.Let x be an input and y ∈ Y = {±1} be a class label.Let us assume that J datasets D j = {x i=1 (j = 1, . . ., J) are given, and let p j (y|x)r j (x) and pj (y|x)r j (x) be an underlying distribution and an empirical distribution associated with the dataset D j , respectively.Here, we assume that each conditional distribution of y given x is written as: where p 0 (y|x) is a common conditional distribution for all datasets and δ k (x) is a term that is specific to the dataset D k .Note that y∈Y δ k (x)y = 0 holds, because p k (y|x) is a probability distribution.While a discriminant function F k is usually constructed using only the dataset D k , the multi-task learning aims to improve the performance of the discriminant function for each dataset D k with the help of datasets D j (j = k).For this purpose, we consider a risk minimization problem defined with a pseudo model and the Itakura-Saito (IS) distance, which is a discrepancy measure frequently used in a region of signal processing.
Let M = m(y) 0 ≤ y∈Y m(y) < ∞ be a space of all positive finite measures over Y.The Itakura-Saito distance between p, q ∈ M is defined as: where r(x) is a marginal distribution of x shared by p, q ∈ M. Note that the IS distance is a kind of statistical version of the Bregman divergence [12], which makes it possible to directly plug-in the empirical distribution.We observe that IS(p, q; r) ≥ 0 and IS(p, q; r) = 0 if and only if p = q.Banerjee et al. [13] showed that there exists a unique Bregman divergence corresponding to every regular exponential family, and the Itakura-Saito distance is associated with the exponential distribution.

Parameter Estimation with the Pseudo Model
Let q F (y|x) be an (un-normalized) pseudo model associated with a function F (x), Note that q F (y|x) is not a probability function, i.e., y∈Y q F (y|x) = 1 in general.If q F (y|x) is normalized, the model reduces to the classical logistic model as: When the function F is parameterized by θ, the maximum likelihood estimation (MLE) argmax θ n i=1 log qF (y i |x i ) or equivalently minimization of the (extended) Kullback-Leibler (KL) divergence is a powerful tool for the estimation of θ, and the MLE has properties such as asymptotic consistency and efficiency under some regularity conditions.Here, we consider parameter estimation with the pseudo model Equation (3) rather than the normalized model Equation (4).Proposition 1.Let p(y|x) = qF 0 (y|x) be the underlying distribution.Then, we observe: Proof.See Appendix A On the other hand, when we consider an estimation based on the extended KL divergence, i.e., argmin F KL(p, q F ; r) where: we observe the following.
Proposition 2. Let F 0 be a function F 0 ( = 0) and p(y|x) = qF 0 (y|x) be the underlying distribution.Then, we observe: Proof.See Appendix B.
Remark 1.Let p(y|x) = qF 0 (y|x) be the underlying distribution.Then, minimizer Equation (8) or (9) of the extended KL divergence attains the Bayes rule, i.e., The proposition and the remark show that the extended KL divergence is not completely appropriate for estimation with the pseudo model.

Characterization of the Itakura-Saito Distance
In this section, we investigate the characterization of the Itakura-Saito distance for estimation with the pseudo model, in the framework of the Bregman U-divergence.Firstly, we briefly introduce the statistical version of Bregman U-divergence [12].The statistical version of Bregman U-divergence is a discrepancy measure between positive measures in M defined by a generating function U and enables us to directly plug-in the empirical distribution for estimation.[12] proposed a general boosting-type algorithm for classification using the Bregman U-divergence and discussed properties of the method from the viewpoint of information geometry [14].By changing the generating function U, the Bregman U-divergence can have a useful property as robustness against noise.For example, the β-divergence is a special case of the Bregman U-divergence and is frequently used for robust estimation in the context of unsupervised learning, such as clustering or component analysis [15,16].Another example of the Bregman U-divergence is the η-divergence, which is employed to robustify the classification algorithm and is closely related to probability models of mislabeling [17,18].
Let U be a monotonically-increasing convex function and ξ be an inverse function of U ′ , the derivative of U. From the convexity of the function U, the function ξ is a monotonically-increasing function.The statistical version of Bregman U-divergence between two measures p, q ∈ M is defined as follows.D U (p, q; r) = r(x) y∈Y {U(ξ(q(y|x))) − U(ξ(p(y|x))) − p(y|x) (ξ(q(y|x)) − ξ(p(y|x)))} dx.(11) Note that the function ξ should be defined at least on z > 0.
Remark 2. The KL divergence and the Itakura-Saito distance are special cases of the Bregman U-divergence Equation (11) with generating functions U(z) = exp(z) and U(z) = − log(c − z) + c 1 (z < c), where c and c 1 are constants, respectively.
Here, we introduce the concept of reflection-symmetric for characterization of the IS distance.
holds for all z = 0.
If the function f is reflection-symmetric, we observe that: Because of this property, the reflection-symmetric function often has a singular point at z = 0, and to investigate the behavior of the function, we can employ the Laurent series as: Note that if the function f is holomorphic over R, b k = 0 for all k, and the Laurent series is equivalent to the Taylor series.
Remark 3. If the function f is reflection-symmetric and holomorphic over R, a k = b k = 0 holds for all k, and then, f is a constant function.
For the Bregman U-divergence Equation ( 11), we observe the following Lemma.Lemma 4. Let F 0 be an arbitrary function, p(y|x) = qF 0 (y|x) be the underlying distribution and q F (x) be the pseudo model Equation (3).If the Bregman U-divergence associated with the function U attains: a function ξ ′ (z)z 2 derived from U is reflection-symmetric.In addition, if the Bregman U-divergence associated with the function U attains: Proof.See Appendix C.

Remark 4. Proposition 1 implies that the function ξ associated with the IS distance satisfies Lemma 4.
Remark 5. Propositions imply that the function U, i.e., Bregman U-divergence, attains Equation (15) or ( 16) is not unique and there exists divergences satisfying Equation (15) or (16), other than the Itakura-Saito distance.For example, a function: ), and then, ξ ′ (z)z 2 is reflection-symmetric.The associated generating function U is written as: where C 1 is a constant.
In the following theorem, we reveal the characterization of the Itakura-Saito distance for estimation with the pseudo model Equation ( 3) and the Bregman U-divergence.
Proof.See Appendix D.
Remark 6.If we assume that a function ξ ′ (z)z 2 derived from U is reflection-symmetric and holomorphic over R, ξ ′ (z)z 2 is a constant function from Remark 3.Then, we obtain ξ(z) = c + b 1 z where c, b 1 are constants, implying that the associated divergence is equivalent to the Itakura-Saito distance.

Relationship with AdaBoost
The IS distance between the underlying conditional distribution p(y|x) and the pseudo model q F (y|x) is written as: where C is a constant, and Equation ( 21) is equivalent to an expected loss of AdaBoost, except for the constant term.Then, sequential minimization of an empirical version of Equation ( 21) is equivalent to the algorithm of AdaBoost, which is the most popular boosting method for the binary classification.Furthermore, [12,19] discussed that a gradient-based boosting algorithm can be derived from the minimization of the KL divergence or the Bregman U-divergence between the underlying distribution and a pseudo model.An important difference between these frameworks and our framework Equation ( 21) is the employed pseudo model.The pseudo model employed by the previous frameworks assumes a condition called "consistent data assumption" and is defined with the empirical distribution, implying that the pseudo model varies depending on the dataset.On the other hand, the pseudo model Equation ( 3) employed in Equation ( 21) is fixed against the dataset as usual statistical models.
The IS distance between two pseudo models q F (y|x) and q F ′ (y|x) is written as, Note that IS(q F ′ , q F ; r) = IS(q F , q F ′ ; r) holds for arbitrary q F and q F ′ , while the IS distance itself is not necessarily symmetric.Furthermore, note that the symmetric property does not hold for normalized models qF and qF ′ .

Application for Multi-Task Learning
There are two main types of frameworks for multi-task learning [20,21].
Case 1 : There is a target dataset D k , and our interest is to construct a discriminant function F k utilizing remaining datasets D j (j = k) or a priori constructed discriminant functions F j (j = k).Case 2 : Our interest is to simultaneously construct better discriminant functions F 1 , . . ., F J using all J datasets D 1 , . . ., D J by utilizing shared information among datasets.

Case 1
In this section, we focus on the above first framework.Let us assume that discriminant functions F j (x) (j = k) are given or are constructed by an arbitrary binary classification method.Then, let us consider a risk function: where λ k,j ≥ 0 (j = k) are regularization constants.Note that the risk function depends on functions F j (j = k), and the second term becomes small when the target discriminant function F k is similar to functions F j (j = k) in the sense of the IS distance; and the second term corresponds to a regularizer incorporating the shared information among datasets into the target function F k .Furthermore, note that the marginal distribution r k is shared in the second term for the ease of implementation and the simplicity of theoretical analysis.An empirical version of Equation ( 23) is written as: An algorithm is derived by sequential minimization of Equation ( 24) by updating F k to F k + αf , i.e., (α, f ) = argmin α,f Lk (F k + αf ), where f is a weak classifier and α is a coefficient [22].
(1) Initialize the function to F 0 k , and define weights for the i-th example with a function F as: where: (2) For t = 1, . . ., T (a) Select a weak classifier f t k ∈ {±1}, which minimizes the following quantity: where In Step 1, F 0 k is typically initialized as F 0 k (x) = 0.The quantity Equation ( 25) is a mixture of two terms: ε 1 (f ) is a weighted error rate of the classifier f , and ε 2 (f ) is the sum of weights w 2 (f ), which represents the degree of discrepancy between f and F − F j .ε 2 (f ) becomes large when F is updated by f as departed from F j .Note that if we set λ k,j = 0 for all j, the risk function Equation (24) coincides with that of AdaBoost, and the above derived algorithm reduces to the usual AdaBoost.
Because the empirical risk function Equation ( 24) is convex with respect to F or F ′ , we can consider another version of the risk function as: where Fk (x) = j =k λ k,j λ k F j (x).The risk function is upper bounded by the risk function Equation (24), implying that the effect of regularization by the shared information is weakened.The derived algorithm is almost the same as the one derived from Equation (24).

Case 2
In this section, we consider simultaneous construction of discriminant functions F 1 , . . ., F J by minimizing the following risk function: where π j (j = 1, . . ., J) is a positive constant satisfying J j=1 π j = 1 and L k is defined in Equation ( 23).Though we can directly minimize the empirical version of Equation ( 27), a derived algorithm is complicated and is computationally heavy.Then, we derive a simplified algorithm utilizing the algorithm shown in Case 1 in which a target dataset is fixed.
Note that the empirical risk function cannot be monotonically decreased because the minimization of L k (F k ) is a trade-off of the first term and the second regularization term, and a decrease of L k (F k ) does not necessarily mean a decrease of the regularization term.

Statistical Properties of the Proposed Methods
In this section, we discuss the statistical properties of the proposed methods.Firstly, we focus on Case 1, and the minimizer F * k of the risk function Equation ( 23) satisfies the following: which implies: or equivalently: where . This can be interpreted as a probabilistic model of asymmetric mislabeling [17,18].In Equation (29), the confidence of classification is discounted by the results of remaining discriminant functions when the classifier sgn(F * k (x)) makes a different decision from these of sgn(F j (x)) (j = k).
Proof.We obtain Equation (32) by considering the Taylor expansion of Equation (29).
We observe that a discrepancy derived by δ k is moderated by the mixture of ǫ j when perturbations ǫ j are independently and identically distributed.Proposition 7. Let η j (x) = F j (x) − F k (x) be a difference between two functions.Then, F * k can be approximated as: Proof.See Appendix E.
Proposition 8. Let F * k be a minimizer of the risk function Equation (23) with λ k,j = 0(j = k).Then, we observe: i.e., the proposed method improves the performance in the sense of the squared error, when: holds.
Proof.See Appendix F.
Secondly, we consider a property of the algorithm for Case 2.
Proposition 9. Let r(x) = r j (x) (j = 1, . . ., J) be a common marginal distribution shared by all tasks.Then, the minimizer of the risk function is written as: where Proof.See Appendix G.
The only difference from Equation ( 28) is that regularization is strengthened by π j π k λ j,k , and then, the same propositions in Section 4.1 hold for Equation (36).

Comparison of Regularization Terms
The proposed method incorporates the regularization term defined by the IS distance into AdaBoost.In this section, we discuss a property of the regularization term.
Proposition 10.Let ǫ(x) be a perturbation function satisfying |ǫ(x)| ≪ 1.Then, we observe: Proof.We obtain these approximations by considering the Taylor expansion up to second order.Figure 1 shows values of divergences against a value of qF (x).Those relations implies that the KL divergence Equation (37) emphasizes a region of input x whose conditional distribution qF (x) is nearly equal to 1  2 , i.e., the classification boundary, while the IS distance Equation (39) focuses on a region of x whose conditional distribution is nearly equal to zero or one.The IS distance between pseudo model Equation (40), i.e., the proposed method, considers the intermediate of Equations ( 37) and (39).This implies that the regularization Equation ( 40) with the IS distance puts more focus on a region far from the classification boundary compared to Equation (37), while Equation (39) tends to relatively ignore the region near the classification boundary.Furthermore, note that the employment of Equation ( 40) makes it possible to derive the simple algorithm shown in Section 4.1.

Experiments
In this section, we investigate the performance of the proposed multi-task algorithm with synthetic datasets and a real dataset.

Synthetic Dataset
Firstly, we investigate the performance of the proposed method using two synthetic datasets within the situation described in Case 2. We compared the proposed method with AdaBoost trained with an individual dataset and AdaBoost trained with all datasets simultaneously.We employed the boosting stump (the boosting stump is a decision tree with only one node) as the weak classifier and fixed as π j = 1/J.A boosting-type method has a hyper-parameter T , the step number of boosting, and the proposed method additionally has the hyper-parameter λ k,j .In the experiment, we determined these parameters T and λ k,j by the validation technique.Especially, we investigated two kinds of scenarios for the determination of λ k,j .
1. We set that λ k,j = λ for all j, k and determined λ. 2. We set that λ k,j = λ IS q Fk ,q Fj ;r k where Fj is a discriminant function constructed by AdaBoost with the dataset D j and determined λ.
Scenario 2 can incorporate more detailed information about the relationship between tasks, and the proposed method can ignore the information of tasks having less shared information.In summary, we compared the following four methods: A : The proposed method with λ k,j determined by Scenario 1. B : The proposed method with λ k,j determined by Scenario 2. C : AdaBoost trained with an individual dataset.D : AdaBoost trained with all datasets simultaneously.We utilized 80% of the training dataset for training of classifiers and the remaining 20% for the validation.We repeated the above procedure 20 times and observed the averaged performance of the methods.

Dataset 1
We set the number J of tasks to three and assume that a marginal distribution of x is a uniform distribution on [−1, 1] 2 , and a discriminant function F j (j = 1, 2, 3) associated with each dataset is generated by F j (x) = (1 + c j,2 )(x 1 − c j,1 ) − x 2 , where c j,1 ∼ N (0, 0.2 2 ) and c j,2 ∼ N (0, 0.1 2 ).In addition, we randomly added a contamination noise on label y.Under these settings, we generated a training dataset, including 400 examples, and a test dataset, including 600 examples.Generated datasets are shown in Figure 2. We observe that each discriminant function and noise structure are different from the other two.Figure 3 shows boxplots of the test errors of each method for datasets D j (j = 1, 2, 3).We observe that the proposed method consistently outperforms individually trained AdaBoost, and AdaBoost trained with all datasets simultaneously.The figure shows that the proposed method can incorporate shared information among datasets into classifiers.Figure 3. Boxplots of the test error of each method: A-proposed method with λ in Scenario 1; B-proposed method with λ in Scenario 2; C-AdaBoost trained with the individual dataset; D-AdaBoost trained with all datasets simultaneously; for three datasets, over the 20 simulation trials.

Dataset 2
We set the number J of tasks to 6 and assume that a marginal distribution of x is a uniform distribution on [−1, 1] 2 .Discriminant functions associated with each dataset are generated by: where c j,1 ∼ N (0, 0.1 2 ) and c j,2 ∼ N (0, 0.1 2 ).In addition, we randomly added a contamination noise on label y.Under these settings, we generated training dataset, including 400 examples, and the test dataset, including 600 examples.Generated datasets are shown in Figure 4. We observe that Datasets 1, 2 and 3 share a structure, and Datasets 4, 5 and 6 share another structure.Figure 5 shows boxplots of the test errors of each method for datasets D j (j = 1, . . ., 6).We omitted the result of AdaBoost trained with all datasets simultaneously (D) from the figure, because its performance is significantly worse than those of the other methods: the median of classification errors is around 0.5.This is because the structures of Datasets 1, 2, 3 and Datasets 4, 5, 6 are opposite, and the labeling of concatenated dataset seems to be random.We observe that the proposed method with Scenario 2 (B) improves performance against individually-trained AdaBoost (C) and the proposed method in Scenario 1 (A).This is because the structure shared among Datasets 1, 2 and 3 does not have information about Datasets 4, 5 and 6 (and vice versa), and Method (B) can ignore the influence of the irrelevant information by adjusting λ k,j responding to IS(q Fj , q Fk ; r k ).Note that the performance of Method (A) is not so degraded, because the regularization parameter λ k,j was determined, so as to be zero, implying AdaBoost trained with the individual dataset.
Figure 6 shows examples of classification boundaries estimated by Methods A, B, C and D, for Dataset 6.

Real Dataset: School Dataset
In this section, we compared the proposed method (Scenario 2) to the a binary decision tree-based ensemble method, called extremely randomized trees (ExtraTrees) [23], applying to a real dataset, "school data", reported from the Inner London Education Authority [24].The dataset consists of examination records of 15,362 students from 139 secondary schools, i.e., we had 139 tasks.The dimension of input x is 27, in which original variables that are categorical were transformed into dummy variables.The original target variable y 0 represents score values in the range [1,70], and we transformed the target variable y 0 to a binary variable as: We set the threshold to 20 to balance the ratio of classes (−1 : +1 = 7930 : 7432).We randomly divided the dataset of each tasks into 80% of the training dataset and remaining 20% test dataset.In addition, we used 20% of the divided training dataset as a validation dataset to determine the hyper-parameter λ and step number T .We repeated the above procedure 20 times and observed the average performance of the methods.Figure 7 shows the medians of error rates over 20 trials, by the proposed method and the ExtraTrees for 139 tasks.The horizontal axis indicates an index of a task, which is ranked in increasing order of the median error rate of the ExtraTrees.We observe that the proposed method is comparable to the ExtraTrees and especially has an advantage for tasks, in which the error rates of the ExtraTrees are large.

Conclusions
In this paper, we investigate the properties of binary classification with the pseudo model and reveal that minimization of the Itakura-Saito distance between the empirical distribution and the pseudo model is equivalent to AdaBoost and provides suitable properties for the binary classification.In addition, we pointed out that the Itakura-Saito distance is a unique divergence, having a suitable property for estimation with the pseudo model in the framework of the Bregman divergence.Based on the framework, we proposed a novel binary classification method for the multi-task learning, which incorporates shared information among tasks into the targeted task.The risk function of the proposed method is defined by the mixture of IS distance.The IS distance between pseudo models can be interpreted as the regularization term, incorporating shared information among tasks into the binary classifier for the target task.We investigated statistical properties of the risk function and derived computationally-feasible boosting-based algorithms.Furthermore, we considered a mechanism for the adjustment of the degree of information sharing and numerically investigated the validity of the proposed methods.Proof.We can express the function ξ(z) by a Laurent series as: Then, we have: Because of the assumption of reflection-symmetry for z 2 ξ ′ (z) and Lemma 11, we have b 2 = 0 and ka k = −(k + 2)b k+2 for all k ≥ 1.Thus, we obtain: Then, we have: From Equation (48) and the assumption of the reflection-symmetry of the function z ξ(z) − ξ z z+z −1   , we observe that for all z, z ξ(z) where: Since {h k (z)} ∞ k=1 is functionally independent, we conclude that a k = 0 for all k ≥ 1 or, equivalently, ξ(z) = c + b 1 z .
We now give a proof for Theorem 5 using Lemma 12.
Proof.If condition Equations ( 19) and ( 20) hold, functions ξ ′ (z)z 2 and z ξ(z) − ξ z z+z −1 are both reflection-symmetric from Lemma 4. From Lemma 12, the reflection-symmetric property of these two functions implies ξ(z) = b 1 z +c.Since the function ξ should be defined on z > 0, the generating function U derived from ξ is written as:

F. Proof of Proposition 8
We observe that:

3 Figure 2 .
Figure 2. The three generated datasets and decision boundaries.

6 Figure 4 .
Figure 4.The six generated datasets and decision boundaries.

6 Figure 5 .Figure 6 .
Figure 5. Boxplots of the test error of each method: A, Proposed method with λ in Scenario 1; B, proposed method with λ in Scenario 2; C, AdaBoost trained with the individual dataset ; for 6 datasets, over the 20 simulation trials.

Figure 7 .
Figure 7. Medians of error rates by the proposed method and extremely randomized trees (ExtraTrees) for 139 tasks.The horizontal axis represents an index of a task, and the vertical axis indicates the median of error rates over 20 trials.Tasks are ranked in increasing order of the median error rate of the ExtraTrees.