Open Access
This article is

- freely available
- re-usable

*Entropy*
**2015**,
*17*(8),
5673-5694;
https://doi.org/10.3390/e17085673

Article

Binary Classification with a Pseudo Exponential Model and Its Application for Multi-Task Learning †

^{1}

Future University Hakodate, 116-2 Kamedanakano, Hakodate Hokkaido 041-8655, Japan

^{2}

The Institute of Statistical Mathematics, 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japan

^{*}

Author to whom correspondence should be addressed.

^{†}

This paper is an extended version of our paper published in Proceedings of the MaxEnt 2014 Conference on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, Amboise, France, 21–26 September 2014.

Academic Editors:
Frédéric Barbaresco
and
Ali Mohammad-Djafari

Received: 11 May 2015 / Accepted: 3 August 2015 / Published: 6 August 2015

## Abstract

**:**

In this paper, we investigate the basic properties of binary classification with a pseudo model based on the Itakura–Saito distance and reveal that the Itakura–Saito distance is a unique appropriate measure for estimation with the pseudo model in the framework of general Bregman divergence. Furthermore, we propose a novel multi-task learning algorithm based on the pseudo model in the framework of the ensemble learning method. We focus on a specific setting of the multi-task learning for binary classification problems. The set of features is assumed to be common among all tasks, which are our targets of performance improvement. We consider a situation where the shared structures among the dataset are represented by divergence between underlying distributions associated with multiple tasks. We discuss statistical properties of the proposed method and investigate the validity of the proposed method with numerical experiments.

Keywords:

multi-task learning; Itakura–Saito distance; pseudo model; un-normalized model## 1. Introduction

In the framework of multi-task learning problems, we assume that there are multiple related tasks (datasets) sharing a common structure and can utilize the shared structure to improve the generalization performance of classifiers for multiple tasks [1,2]. This framework has been successfully employed in various kind of applications, such as medical diagnosis. Most methods utilize the similarity among tasks to improve the performance of classifiers by representing the shared structure as a regularization term [3,4]. We tackle this problem using a boosting method, which makes it possible to adaptively learn complicated problems with low computational cost. The boosting methods are notable implementations of the ensemble learning and try to construct a better classifier by combining weak classifiers. AdaBoost is the most popular boosting method, and many variations, including TrAdaBoost for the multi-task learning [5], have been developed. In face recognition [6], as well as web search ranking [7], the computational efficiency of boosting is paid attention to in the framework of multi-task learning.

In this paper, we firstly reveal that AdaBoost can be derived by a sequential minimization of the Itakura–Saito (IS) distance between an empirical distribution and a pseudo measure model associated with a classifier. The IS distance is a special case of the Bregman divergence [8] between two positive measures and is frequently used for non-negative matrix factorization (NMF) in the region of signal processing [9,10]. Secondly, we propose a novel boosting algorithm for the multi-task learning based on the IS distance. We utilize the IS distance as a discrepancy measure between pseudo models associated with tasks and incorporate the IS distance as a regularizer into AdaBoost. The proposed method can capture the shared structure, i.e., the relationship between underlying distributions by considering the IS distance between pseudo models constructed by classifiers. We discuss the statistical properties of the proposed method and investigate the validity of the regularization by the IS distance with small experiments using synthetic datasets and a real dataset.

This paper is organized as follows. In Section 2, basic settings are described, and a divergence measure is introduced. In Section 3, we briefly introduce the IS distance, which is a special case of the Bregman divergence, and investigate the relationship between a well-known ensemble algorithm, AdaBoost and estimation with a pseudo model using the Itakura–Saito distance. In Section 4, we propose a method for multi-task learning, which is derived from a minimization of the weighted sum of divergence, and the performance of the proposed methods is examined in Section 5 using a synthetic dataset and a real dataset (a short version of this article has been presented as a conference paper [11]; some theoretical results and numerical experiments are added to the current version).

## 2. Settings

In this study, we focus on binary classification problems. Let x be an input and $y\in \mathrm{\mathcal{Y}}=\{\pm 1\}$ be a class label. Let us assume that J datasets ${\mathrm{\mathcal{D}}}_{j}={\{{x}_{i}^{\left(j\right)},{y}_{i}^{\left(j\right)}\}}_{i=1}^{{n}_{j}}$ ($j=1,...,J$) are given, and let ${p}_{j}\left(y\right|x){r}_{j}\left(x\right)$ and ${\tilde{p}}_{j}\left(y\right|x){\tilde{r}}_{j}\left(x\right)$ be an underlying distribution and an empirical distribution associated with the dataset ${\mathrm{\mathcal{D}}}_{j}$, respectively. Here, we assume that each conditional distribution of y given x is written as:
where ${p}_{0}\left(y\right|x)$ is a common conditional distribution for all datasets and ${\delta}_{k}\left(x\right)$ is a term that is specific to the dataset ${\mathrm{\mathcal{D}}}_{k}$. Note that ${\sum}_{y\in \mathrm{\mathcal{Y}}}{\delta}_{k}\left(x\right)y=0$ holds, because ${p}_{k}\left(y\right|x)$ is a probability distribution. While a discriminant function ${F}_{k}$ is usually constructed using only the dataset ${\mathrm{\mathcal{D}}}_{k}$, the multi-task learning aims to improve the performance of the discriminant function for each dataset ${\mathrm{\mathcal{D}}}_{k}$ with the help of datasets ${\mathrm{\mathcal{D}}}_{j}$ ($j\ne k$). For this purpose, we consider a risk minimization problem defined with a pseudo model and the Itakura–Saito (IS) distance, which is a discrepancy measure frequently used in a region of signal processing.

$${p}_{k}\left(y\right|x)={p}_{0}\left(y\right|x)+{\delta}_{k}\left(x\right)y$$

Let $\mathrm{\mathcal{M}}=\left\{m\left(y\right)|0\le {\sum}_{y\in \mathrm{\mathcal{Y}}}m\left(y\right)<\infty \phantom{\rule{0.277778em}{0ex}}\right\}$ be a space of all positive finite measures over $\mathcal{Y}$. The Itakura–Saito distance between $p,q\in \mathrm{\mathcal{M}}$ is defined as:
where $r\left(x\right)$ is a marginal distribution of x shared by p, $q\in \mathrm{\mathcal{M}}$. Note that the IS distance is a kind of statistical version of the Bregman divergence [12], which makes it possible to directly plug-in the empirical distribution. We observe that $IS(p,q;r)\ge 0$ and $IS(p,q;r)=0$ if and only if $p=q$. banerjee et al. [13] showed that there exists a unique Bregman divergence corresponding to every regular exponential family, and the Itakura–Saito distance is associated with the exponential distribution.

$$IS(p,q;r)=\int r\left(x\right)\sum _{y\in \mathrm{\mathcal{Y}}}\left\{log\frac{q\left(y\right|x)}{p\left(y\right|x)}-1+\frac{p\left(y\right|x)}{q\left(y\right|x)}\right\}dx$$

## 3. Itakura–Saito Distance and Pseudo Model

#### 3.1. Parameter Estimation with the Pseudo Model

Let ${q}_{F}\left(y\right|x)$ be an (un-normalized) pseudo model associated with a function $F\left(x\right)$,

$${q}_{F}\left(y\right|x)=exp\left(F\left(x\right)y\right).$$

Note that ${q}_{F}\left(y\right|x)$ is not a probability function, i.e., ${\sum}_{y\in \mathrm{\mathcal{Y}}}{q}_{F}\left(y\right|x)\ne 1$ in general. If ${q}_{F}\left(y\right|x)$ is normalized, the model reduces to the classical logistic model as:

$${\overline{q}}_{F}\left(y\right|x)=\frac{exp\left(F\right(x\left)y\right)}{exp\left(F\right(x\left)\right)+exp(-F(x\left)\right)}.$$

When the function F is parameterized by θ, the maximum likelihood estimation (MLE) ${\mathrm{argmax}}_{\theta}{\sum}_{i=1}^{n}log{\overline{q}}_{F}\left({y}_{i}\right|{x}_{i})$ or equivalently minimization of the (extended) Kullback–Leibler (KL) divergence is a powerful tool for the estimation of θ, and the MLE has properties such as asymptotic consistency and efficiency under some regularity conditions. Here, we consider parameter estimation with the pseudo model Equation (3) rather than the normalized model Equation (4).

**Proposition 1.**Let $p\left(y\right|x)={\overline{q}}_{{F}_{0}}\left(y\right|x)$ be the underlying distribution. Then, we observe:

$$\begin{array}{cc}\hfill \underset{F}{\mathrm{argmin}}IS(p,{q}_{F};r)& ={F}_{0},\hfill \end{array}$$

$$\begin{array}{cc}\hfill \underset{F}{\mathrm{argmin}}IS({q}_{F},p;r)& ={F}_{0}.\hfill \end{array}$$

**Proof.**See Appendix A

On the other hand, when we consider an estimation based on the extended KL divergence, i.e., ${\mathrm{argmin}}_{F}KL(p,{q}_{F};r)$ where:
we observe the following.

$$KL(p,q;r)=\int r\left(x\right)\sum _{y\in \mathrm{\mathcal{Y}}}\{p\left(y\right|x)log\frac{p\left(y\right|x)}{q\left(y\right|x)}-p\left(y\right|x)+q\left(y\right|x)\}dx,$$

**Proposition 2.**Let ${F}_{0}$ be a function ${F}_{0}$($\ne 0$) and $p\left(y\right|x)={\overline{q}}_{{F}_{0}}\left(y\right|x)$ be the underlying distribution. Then, we observe:

$$\begin{array}{cc}\hfill {F}_{KL,1}=\underset{F}{\mathrm{argmin}}KL(p,{q}_{F};r)\ne & {F}_{0},\hfill \end{array}$$

$$\begin{array}{cc}\hfill {F}_{KL,2}=\underset{F}{\mathrm{argmin}}KL({q}_{F},p;r)\ne & {F}_{0}.\hfill \end{array}$$

**Proof.**See Appendix B.

**Remark 1.**Let $p\left(y\right|x)={\overline{q}}_{{F}_{0}}\left(y\right|x)$ be the underlying distribution. Then, minimizer Equation (8) or (9) of the extended KL divergence attains the Bayes rule, i.e.,

$$\begin{array}{cc}\hfill sgn\left({F}_{KL,1}\left(x\right)\right)& =sgn\left({F}_{KL,2}\left(x\right)\right)=sgn\left(\frac{1}{2}log\frac{p(+1|x)}{p(-1|x)}\right).\hfill \end{array}$$

The proposition and the remark show that the extended KL divergence is not completely appropriate for estimation with the pseudo model.

#### 3.2. Characterization of the Itakura–Saito Distance

In this section, we investigate the characterization of the Itakura–Saito distance for estimation with the pseudo model, in the framework of the Bregman U-divergence. Firstly, we briefly introduce the statistical version of Bregman U-divergence [12]. The statistical version of Bregman U-divergence is a discrepancy measure between positive measures in $\mathcal{M}$ defined by a generating function U and enables us to directly plug-in the empirical distribution for estimation. [12] proposed a general boosting-type algorithm for classification using the Bregman U-divergence and discussed properties of the method from the viewpoint of information geometry [14]. By changing the generating function U, the Bregman U-divergence can have a useful property as robustness against noise. For example, the β-divergence is a special case of the Bregman U-divergence and is frequently used for robust estimation in the context of unsupervised learning, such as clustering or component analysis [15,16]. Another example of the Bregman U-divergence is the η-divergence, which is employed to robustify the classification algorithm and is closely related to probability models of mislabeling [17,18].

Let U be a monotonically-increasing convex function and ξ be an inverse function of ${U}^{\prime}$, the derivative of U. From the convexity of the function U, the function ξ is a monotonically-increasing function. The statistical version of Bregman U-divergence between two measures $p,q\in \mathrm{\mathcal{M}}$ is defined as follows.

$$\begin{array}{c}\hfill {D}_{U}(p,q;r)=\int r\left(x\right)\sum _{y\in \mathrm{\mathcal{Y}}}\left\{U\left(\xi \left(q\left(y\right|x)\right)\right)-U\left(\xi \left(p\left(y\right|x)\right)\right)-p\left(y\right|x)\left(\xi \left(q\right(y\left|x\right))-\xi (p\left(y\right|x\left)\right)\right)\right\}dx.\end{array}$$

Note that the function ξ should be defined at least on $z>0$.

**Remark 2.**The KL divergence and the Itakura–Saito distance are special cases of the Bregman U-divergence Equation (11) with generating functions $U\left(z\right)=exp\left(z\right)$ and $U\left(z\right)=-log(c-z)+{c}_{1}$ ($z<c$), where c and ${c}_{1}$ are constants, respectively.

Here, we introduce the concept of reflection-symmetric for characterization of the IS distance.

**Definition 3.**A function $f\left(z\right)$ is reflection-symmetric if:

$$\begin{array}{c}\hfill f\left(z\right)=f\left({z}^{-1}\right)\end{array}$$

If the function f is reflection-symmetric, we observe that:

$$\begin{array}{c}\hfill \underset{z\to 0}{lim}f\left(z\right)=\underset{z\to \infty}{lim}f\left(z\right).\end{array}$$

Because of this property, the reflection-symmetric function often has a singular point at $z=0$, and to investigate the behavior of the function, we can employ the Laurent series as:

$$\begin{array}{c}\hfill f\left(z\right)=c+\sum _{k=1}^{\infty}\left({a}_{k}{z}^{k}+{b}_{k}{z}^{-k}\right).\end{array}$$

Note that if the function f is holomorphic over R, ${b}_{k}=0$ for all k, and the Laurent series is equivalent to the Taylor series.

**Remark 3.**If the function f is reflection-symmetric and holomorphic over R, ${a}_{k}={b}_{k}=0$ holds for all k, and then, f is a constant function.

For the Bregman U-divergence Equation (11), we observe the following Lemma.

**Lemma 4.**Let ${F}_{0}$ be an arbitrary function, $p\left(y\right|x)={\overline{q}}_{{F}_{0}}\left(y\right|x)$ be the underlying distribution and ${q}_{F}\left(x\right)$ be the pseudo model Equation (3). If the Bregman U-divergence associated with the function U attains:

$$\begin{array}{c}\hfill {F}_{0}=\underset{F}{\mathrm{argmin}}{D}_{U}(p,{q}_{F};r),\end{array}$$

$$\begin{array}{c}\hfill {F}_{0}=\underset{F}{\mathrm{argmin}}{D}_{U}({q}_{F},p;r),\end{array}$$

**Proof.**See Appendix C.

**Remark 4.**Proposition 1 implies that the function ξ associated with the IS distance satisfies Lemma 4.

**Remark 5.**Propositions imply that the function U, i.e., Bregman U-divergence, attains Equation (15) or (16) is not unique and there exists divergences satisfying Equation (15) or (16), other than the Itakura–Saito distance. For example, a function:

$$\begin{array}{c}\hfill \xi \left(z\right)=-2{z}^{-\frac{2}{3}}-{z}^{-\frac{4}{3}}\end{array}$$

$$\begin{array}{cc}\hfill U\left(z\right)& ={\int}^{z}{\xi}^{-1}\left({z}^{\prime}\right)d{z}^{\prime}=-4\frac{-2+\sqrt{1-z}}{\sqrt{-1+\sqrt{1-z}}}+{C}_{1}\hfill \end{array}$$

In the following theorem, we reveal the characterization of the Itakura–Saito distance for estimation with the pseudo model Equation (3) and the Bregman U-divergence.

**Theorem 5.**Let $p\left(y\right|x)={\overline{q}}_{{F}_{0}}\left(y\right|x)$ be the underlying distribution and ${q}_{F}\left(x\right)$ be the pseudo model Equation (3). If conditions:

$$\begin{array}{cc}\hfill {F}_{0}& =\underset{F}{\mathrm{argmin}}{D}_{U}(p,{q}_{F};r),\hfill \end{array}$$

$$\begin{array}{cc}\hfill {F}_{0}& =\underset{F}{\mathrm{argmin}}{D}_{U}({q}_{F},p;r)\hfill \end{array}$$

**Proof.**See Appendix D.

**Remark 6.**If we assume that a function ${\xi}^{\prime}\left(z\right){z}^{2}$ derived from U is reflection-symmetric and holomorphic over R, ${\xi}^{\prime}\left(z\right){z}^{2}$ is a constant function from Remark 3. Then, we obtain $\xi \left(z\right)=c+\frac{{b}_{1}}{z}$ where $c,{b}_{1}$ are constants, implying that the associated divergence is equivalent to the Itakura–Saito distance.

#### 3.3. Relationship with AdaBoost

The IS distance between the underlying conditional distribution $p\left(y\right|x)$ and the pseudo model ${q}_{F}\left(y\right|x)$ is written as:
where C is a constant, and Equation (21) is equivalent to an expected loss of AdaBoost, except for the constant term. Then, sequential minimization of an empirical version of Equation (21) is equivalent to the algorithm of AdaBoost, which is the most popular boosting method for the binary classification. Furthermore, [12,19] discussed that a gradient-based boosting algorithm can be derived from the minimization of the KL divergence or the Bregman U-divergence between the underlying distribution and a pseudo model. An important difference between these frameworks and our framework Equation (21) is the employed pseudo model. The pseudo model employed by the previous frameworks assumes a condition called “consistent data assumption” and is defined with the empirical distribution, implying that the pseudo model varies depending on the dataset. On the other hand, the pseudo model Equation (3) employed in Equation (21) is fixed against the dataset as usual statistical models.

$$\begin{array}{cc}\hfill IS(p,{q}_{F};r)& =C+\int r\left(x\right)\sum _{y\in \mathrm{\mathcal{Y}}}\left\{F\left(x\right)y+\frac{p\left(y\right|x)}{{q}_{F}\left(y\right|x)}\right\}dx\hfill \\ & =C+\int r\left(x\right)\sum _{y\in \mathrm{\mathcal{Y}}}p\left(y\right|x){e}^{-F\left(x\right)y}dx,\hfill \end{array}$$

The IS distance between two pseudo models ${q}_{F}\left(y\right|x)$ and ${q}_{{F}^{\prime}}\left(y\right|x)$ is written as,

$$\begin{array}{cc}\hfill IS({q}_{F},{q}_{{F}^{\prime}};r)& =\int r\left(x\right)\sum _{y\in \mathrm{\mathcal{Y}}}\left\{{F}^{\prime}\left(x\right)y-F\left(x\right)y-1+exp(F\left(x\right)y-{F}^{\prime}\left(x\right)y)\right\}dx\hfill \\ & =2+\int r\left(x\right)\left\{exp(F\left(x\right)-{F}^{\prime}\left(x\right))+exp({F}^{\prime}\left(x\right)-F\left(x\right))\right\}dx.\hfill \end{array}$$

Note that $IS({q}_{{F}^{\prime}},{q}_{F};r)=IS({q}_{F},{q}_{{F}^{\prime}};r)$ holds for arbitrary ${q}_{F}$ and ${q}_{{F}^{\prime}}$, while the IS distance itself is not necessarily symmetric. Furthermore, note that the symmetric property does not hold for normalized models ${\overline{q}}_{F}$ and ${\overline{q}}_{{F}^{\prime}}$.

## 4. Application for Multi-Task Learning

- Case 1 :
- There is a target dataset ${\mathrm{\mathcal{D}}}_{k}$, and our interest is to construct a discriminant function ${F}_{k}$ utilizing remaining datasets ${\mathrm{\mathcal{D}}}_{j}$ ($j\ne k$) or a priori constructed discriminant functions ${F}_{j}$ ($j\ne k$).
- Case 2 :
- Our interest is to simultaneously construct better discriminant functions ${F}_{1},...,{F}_{J}$ using all J datasets ${\mathrm{\mathcal{D}}}_{1},...,{\mathrm{\mathcal{D}}}_{J}$ by utilizing shared information among datasets.

#### 4.1. Case 1

In this section, we focus on the above first framework. Let us assume that discriminant functions ${F}_{j}\left(x\right)$ ($j\ne k$) are given or are constructed by an arbitrary binary classification method. Then, let us consider a risk function:
where ${\text{\lambda}}_{k,j}\ge 0$ ($j\ne k$) are regularization constants. Note that the risk function depends on functions ${F}_{j}$ ($j\ne k$), and the second term becomes small when the target discriminant function ${F}_{k}$ is similar to functions ${F}_{j}$($j\ne k$) in the sense of the IS distance; and the second term corresponds to a regularizer incorporating the shared information among datasets into the target function ${F}_{k}$. Furthermore, note that the marginal distribution ${r}_{k}$ is shared in the second term for the ease of implementation and the simplicity of theoretical analysis.

$$\begin{array}{cc}\hfill {L}_{k}\left({F}_{k}\right)=& IS({p}_{k},{q}_{{F}_{k}};{r}_{k})+\sum _{j\ne k}{\text{\lambda}}_{k,j}IS({q}_{{F}_{k}},{q}_{{F}_{j}};{r}_{k})\hfill \\ \hfill =& \int {r}_{k}\left(x\right)\left\{\sum _{y\in \mathrm{\mathcal{Y}}}{p}_{k}\left(y\right|x){e}^{-{F}_{k}\left(x\right)y}+\sum _{j\ne k}{\text{\lambda}}_{k,j}\left\{{e}^{{F}_{k}\left(x\right)-{F}_{j}\left(x\right)}+{e}^{{F}_{j}\left(x\right)-{F}_{k}\left(x\right)}\right\}\right\}dx,\hfill \end{array}$$

An empirical version of Equation (23) is written as:

$${\overline{L}}_{k}\left({F}_{k}\right)=\frac{1}{{n}_{k}}\sum _{i=1}^{{n}_{k}}\left({e}^{-{F}_{k}\left({x}_{i}^{\left(k\right)}\right){y}_{i}^{\left(k\right)}}+\sum _{j\ne k}{\lambda}_{k,j}\left({e}^{{F}_{k}\left({x}_{i}^{\left(k\right)}\right)-{F}_{j}\left({x}_{i}^{\left(k\right)}\right)}+{e}^{{F}_{j}\left({x}_{i}^{\left(k\right)}\right)-{F}_{k}\left({x}_{i}^{\left(k\right)}\right)}\right)\right).$$

An algorithm is derived by sequential minimization of Equation (24) by updating ${F}_{k}$ to ${F}_{k}+\alpha f$, i.e., $(\alpha ,f)={\mathrm{argmin}}_{\alpha ,f}{\overline{L}}_{k}({F}_{k}+\alpha f)$, where f is a weak classifier and α is a coefficient [22].

- (1)
- Initialize the function to ${F}_{k}^{0}$, and define weights for the i-th example with a function F as:$$\begin{array}{cc}\hfill {w}_{1}(i;F)& =\frac{{e}^{-F\left({x}_{i}^{\left(k\right)}\right){y}_{i}^{\left(k\right)}}}{{Z}_{1}\left(F\right)},\hfill \\ \hfill {w}_{2}(i;F)& =\frac{{\sum}_{j\ne k}{\lambda}_{k,j}{e}^{f\left({x}_{i}^{\left(k\right)}\right)(F\left({x}_{i}^{\left(k\right)}\right)-{F}_{j}\left({x}_{i}^{\left(k\right)}\right))}}{{Z}_{2}\left(F\right)}\hfill \end{array}$$$$\begin{array}{cc}\hfill {Z}_{1}\left(F\right)& =\sum _{i=1}^{{n}_{k}}{e}^{-F\left({x}_{i}^{\left(k\right)}\right){y}_{i}^{\left(k\right)}},\hfill \\ \hfill {Z}_{2}\left(F\right)& =\sum _{i=1}^{{n}_{k}}\sum _{j\ne k}{\lambda}^{k,j}\left({e}^{F\left({x}_{i}^{\left(k\right)}\right)-{F}_{j}\left({x}_{i}^{\left(k\right)}\right)}+{e}^{-F\left({x}_{i}^{\left(k\right)}\right)+{F}_{j}\left({x}_{i}^{\left(k\right)}\right)}\right).\hfill \end{array}$$
- (2)
- For $t=1,...,T$
- (a)
- Select a weak classifier ${f}_{k}^{t}\in \{\pm 1\}$, which minimizes the following quantity:$$\epsilon \left(f\right)=\frac{{Z}_{1}\left({F}_{k}^{t-1}\right)}{{Z}_{1}\left({F}_{k}^{t-1}\right)+{Z}_{2}\left({F}_{k}^{t-1}\right)}{\epsilon}_{1}\left(f\right)+\frac{{Z}_{2}\left({F}_{k}^{t-1}\right)}{{Z}_{1}\left({F}_{k}^{t-1}\right)+{Z}_{2}\left({F}_{k}^{t-1}\right)}{\epsilon}_{2}\left(f\right).$$
- (b)
- Calculate a coefficient of ${f}_{k}^{t}$ by ${\alpha}_{k}^{t}=\frac{1}{2}log\frac{1-\epsilon \left({f}_{k}^{t}\right)}{\epsilon \left({f}_{k}^{t}\right)}$.
- (c)
- Update the discriminant function as ${F}_{k}^{t}={F}_{k}^{t-1}+{\alpha}_{k}^{t}{f}_{k}^{t}$.

- (3)
- Output ${F}_{k}^{T}\left(x\right)={F}_{k}^{0}\left(x\right)+{\sum}_{t=1}^{T}{\alpha}_{k}^{t}{f}_{k}^{t}\left(x\right)$.

In Step 1, ${F}_{k}^{0}$ is typically initialized as ${F}_{k}^{0}\left(x\right)=0$. The quantity Equation (25) is a mixture of two terms: ${\epsilon}_{1}\left(f\right)$ is a weighted error rate of the classifier f, and ${\epsilon}_{2}\left(f\right)$ is the sum of weights ${w}_{2}\left(f\right)$, which represents the degree of discrepancy between f and $F-{F}_{j}$. ${\epsilon}_{2}\left(f\right)$ becomes large when F is updated by f as departed from ${F}_{j}$. Note that if we set ${\lambda}_{k,j}=0$ for all j, the risk function Equation (24) coincides with that of AdaBoost, and the above derived algorithm reduces to the usual AdaBoost.

Because the empirical risk function Equation (24) is convex with respect to F or ${F}^{\prime}$, we can consider another version of the risk function as:
where ${\overline{F}}_{k}\left(x\right)={\sum}_{j\ne k}\frac{{\lambda}_{k,j}}{{\lambda}_{k}}{F}_{j}\left(x\right)$. The risk function is upper bounded by the risk function Equation (24), implying that the effect of regularization by the shared information is weakened. The derived algorithm is almost the same as the one derived from Equation (24).

$$\begin{array}{cc}\hfill {\overline{L}}_{k}\left({F}_{k}\right)& =\frac{1}{{n}_{k}}\sum _{i=1}^{{n}_{k}}\left({e}^{-{F}_{k}\left({x}_{i}^{\left(k\right)}\right){y}_{i}^{\left(k\right)}}+{\lambda}_{k}\left({e}^{{F}_{k}\left({x}_{i}^{\left(k\right)}\right)-{\overline{F}}_{k}\left({x}_{i}^{\left(k\right)}\right)}+{e}^{-{F}_{k}\left({x}_{i}^{\left(k\right)}\right)+{\overline{F}}_{k}\left({x}_{i}^{\left(k\right)}\right)}\right)\right)\hfill \end{array}$$

#### 4.2. Case 2

In this section, we consider simultaneous construction of discriminant functions ${F}_{1},...,{F}_{J}$ by minimizing the following risk function:
where ${\pi}_{j}$($j=1,...,J$) is a positive constant satisfying ${\sum}_{j=1}^{J}{\pi}_{j}=1$ and ${L}_{k}$ is defined in Equation (23).

$$L({F}_{1},...,{F}_{J})=\sum _{j=1}^{J}{\pi}_{j}{L}_{j}\left({F}_{j}\right)$$

Though we can directly minimize the empirical version of Equation (27), a derived algorithm is complicated and is computationally heavy. Then, we derive a simplified algorithm utilizing the algorithm shown in Case 1 in which a target dataset is fixed.

- (1)
- Initialize functions ${F}_{1},...,{F}_{J}$.
- (2)
- For $t=1,...,T$:
- (a)
- Randomly choose a target index $k\in \{1,...,J\}$.
- (b)
- Update the function ${F}_{k}$ using the algorithm in Case 1 by S steps, with fixed functions ${F}_{j}$ ($j\ne k$).

- (3)
- Output learned functions ${F}_{1},...,{F}_{J}$.

Note that the empirical risk function cannot be monotonically decreased because the minimization of ${L}_{k}\left({F}_{k}\right)$ is a trade-off of the first term and the second regularization term, and a decrease of ${L}_{k}\left({F}_{k}\right)$ does not necessarily mean a decrease of the regularization term.

#### 4.3. Statistical Properties of the Proposed Methods

In this section, we discuss the statistical properties of the proposed methods. Firstly, we focus on Case 1, and the minimizer ${F}_{k}^{*}$ of the risk function Equation (23) satisfies the following:
which implies:
or equivalently:
where ${p}_{0,k}\left(y\right|x)=\frac{exp\left({F}_{k}^{*}\left(x\right)y\right)}{exp\left({F}_{k}^{*}\left(x\right)\right)+exp(-{F}_{k}^{*}\left(x\right))}$. This can be interpreted as a probabilistic model of asymmetric mislabeling [17,18]. In Equation (29), the confidence of classification is discounted by the results of remaining discriminant functions when the classifier $sgn\left({F}_{k}^{*}\left(x\right)\right)$ makes a different decision from these of $sgn\left({F}_{j}\left(x\right)\right)$ ($j\ne k$).

$$\begin{array}{cc}\hfill {\left.\frac{\delta {L}_{k}\left({F}_{k}\right)}{\delta {F}_{k}\left(x\right)}\right|}_{{F}_{k}={F}_{k}^{*}}& \propto -{p}_{k}(+1|x){e}^{-{F}_{k}^{*}\left(x\right)}+{p}_{k}(-1|x){e}^{{F}_{k}^{*}\left(x\right)}+\sum _{j\ne k}{\lambda}_{k,j}\left\{{e}^{{F}_{k}^{*}\left(x\right)-{F}_{j}\left(x\right)}-{e}^{{F}_{j}\left(x\right)-{F}_{k}^{*}\left(x\right)}\right\}=0,\hfill \end{array}$$

$$\begin{array}{cc}\hfill {F}_{k}^{*}\left(x\right)& =\frac{1}{2}log\frac{{p}_{k}(+1|x)+{\sum}_{j\ne k}{\lambda}_{k,j}exp\left({F}_{j}\left(x\right)\right)}{{p}_{k}(-1|x)+{\sum}_{j\ne k}{\lambda}_{k,j}exp(-{F}_{j}\left(x\right))},\hfill \end{array}$$

$$\begin{array}{cc}\hfill {p}_{k}\left(y\right|x)& ={p}_{0,k}\left(y\right|x)\left(1+\sum _{j\ne k}{\lambda}_{k,j}exp(-{F}_{j}\left(x\right)y)\right)-{p}_{0,k}(-y|x)\sum _{j\ne k}{\lambda}_{k,j}exp\left({F}_{j}\left(x\right)y\right),\hfill \end{array}$$

**Remark 7.**${F}_{k}^{*}\left(x\right)\ge 0$ does not mean ${p}_{k}(+1|x)\ge \frac{1}{2}$ unless ${F}_{j}\left(x\right)=\frac{1}{2}log\frac{{p}_{k}(+1|x)}{{p}_{k}(-1|x)}$ holds.

**Proposition 6.**Let us assume that ${F}_{j}\left(x\right)$ satisfies:

$$\frac{exp\left({F}_{j}\left(x\right)y\right)}{exp\left({F}_{j}\left(x\right)\right)+exp(-{F}_{j}\left(x\right))}={p}_{0}\left(y\right|x)+{\u03f5}_{j}\left(x\right)y,\phantom{\rule{4pt}{0ex}}\left|\right|{\u03f5}_{j}\left(x\right)\left|\right|\ll 1.$$

Then, Equation (29) can be approximated as:
where $P=\sqrt{{p}_{0}(+1|x){p}_{0}(-1|x)}$ and ${\lambda}_{k}={\sum}_{j\ne k}{\lambda}_{k,j}$.

$${F}_{k}^{*}\left(x\right)\simeq \frac{1}{2}log\frac{{p}_{0}(+1|x)}{{p}_{0}(-1|x)}+\frac{1}{2{P}^{2}}\frac{P{\delta}_{k}\left(x\right)+{\sum}_{j\ne k}{\lambda}_{k,j}{\u03f5}_{j}\left(x\right)}{P+{\lambda}_{k}}$$

We observe that a discrepancy derived by ${\delta}_{k}$ is moderated by the mixture of ${\u03f5}_{j}$ when perturbations ${\u03f5}_{j}$ are independently and identically distributed.

**Proposition 7.**Let ${\eta}_{j}\left(x\right)={F}_{j}\left(x\right)-{F}_{k}\left(x\right)$ be a difference between two functions. Then, ${F}_{k}^{*}$ can be approximated as:

$$\begin{array}{c}\hfill {F}_{k}^{*}\left(x\right)\simeq \frac{1}{2}log\frac{{p}_{k}(+1|x)}{{p}_{k}(-1|x)}+\frac{1}{P}\sum _{j\ne k}{\lambda}_{k,j}{\eta}_{j}\left(x\right).\end{array}$$

**Proof.**See Appendix E.

**Proposition 8.**Let ${\overline{F}}_{k}^{*}$ be a minimizer of the risk function Equation (23) with ${\lambda}_{k,j}=0$($j\ne k$). Then, we observe:

$$\begin{array}{cc}\hfill {\left({\overline{F}}_{k}^{*}\left(x\right)-\frac{1}{2}log\frac{{p}_{0}(+1|x)}{{p}_{0}(-1|x)}\right)}^{2}& \ge {\left({F}_{k}^{*}\left(x\right)-\frac{1}{2}log\frac{{p}_{0}(+1|x)}{{p}_{0}(-1|x)}\right)}^{2},\hfill \end{array}$$

$$|{\delta}_{k}(x)|\ge \frac{|{\sum}_{j\ne k}{\lambda}_{k,j}{\u03f5}_{j}\left(x\right)|}{{\lambda}_{k}}$$

**Proof.**See Appendix F.

Secondly, we consider a property of the algorithm for Case 2.

**Proposition 9.**Let $r\left(x\right)={r}_{j}\left(x\right)$ ($j=1,...,J$) be a common marginal distribution shared by all tasks. Then, the minimizer of the risk function is written as:

$$\begin{array}{cc}\hfill {F}_{k}\left(x\right)& =\frac{1}{2}log\frac{{p}_{k}(+1|x)+{\sum}_{j\ne k}{\lambda}_{jk}{e}^{{F}_{j}\left(x\right)}}{{p}_{k}(-1|x)+{\sum}_{j\ne k}{\lambda}_{jk}{e}^{-{F}_{j}\left(x\right)}},\hfill \end{array}$$

**Proof.**See Appendix G.

The only difference from Equation (28) is that regularization is strengthened by $\frac{{\pi}_{j}}{{\pi}_{k}}{\lambda}_{k,j}$, and then, the same propositions in Section 4.1 hold for Equation (36).

#### 4.4. Comparison of Regularization Terms

The proposed method incorporates the regularization term defined by the IS distance into AdaBoost. In this section, we discuss a property of the regularization term.

**Proposition 10.**Let $\u03f5\left(x\right)$ be a perturbation function satisfying $\left|\u03f5\right(x\left)\right|\ll 1$. Then, we observe:

$$\begin{array}{cc}\hfill KL({\overline{q}}_{F},{\overline{q}}_{F+\u03f5};r)& \simeq \int 2r\left(x\right)\u03f5{\left(x\right)}^{2}{\overline{q}}_{F}(+1|x){\overline{q}}_{F}(-1|x)dx,\hfill \end{array}$$

$$\begin{array}{cc}\hfill KL({q}_{F},{q}_{F+\u03f5};r)& \simeq \int \frac{r\left(x\right)}{2}\u03f5{\left(x\right)}^{2}\frac{1}{\sqrt{{\overline{q}}_{F}(+1|x){\overline{q}}_{F}(-1|x)}}dx,\hfill \end{array}$$

$$\begin{array}{cc}\hfill IS({\overline{q}}_{F},{\overline{q}}_{F+\u03f5};r)& \simeq \int 2r\left(x\right)\u03f5{\left(x\right)}^{2}\sum _{y\in \mathrm{\mathcal{Y}}}{\overline{q}}_{F}{\left(y\right|x)}^{2}dx,\hfill \end{array}$$

$$\begin{array}{cc}\hfill IS({q}_{F},{q}_{F+\u03f5};r)& \simeq \int r\left(x\right)\u03f5{\left(x\right)}^{2}dx.\hfill \end{array}$$

**Proof.**We obtain these approximations by considering the Taylor expansion up to second order.

Figure 1 shows values of divergences against a value of ${\overline{q}}_{F}\left(x\right)$. Those relations implies that the KL divergence Equation (37) emphasizes a region of input x whose conditional distribution ${\overline{q}}_{F}\left(x\right)$ is nearly equal to $\frac{1}{2}$, i.e., the classification boundary, while the IS distance Equation (39) focuses on a region of x whose conditional distribution is nearly equal to zero or one. The IS distance between pseudo model Equation (40), i.e., the proposed method, considers the intermediate of Equations (37) and (39). This implies that the regularization Equation (40) with the IS distance puts more focus on a region far from the classification boundary compared to Equation (37), while Equation (39) tends to relatively ignore the region near the classification boundary. Furthermore, note that the employment of Equation (40) makes it possible to derive the simple algorithm shown in Section 4.1.

## 5. Experiments

In this section, we investigate the performance of the proposed multi-task algorithm with synthetic datasets and a real dataset.

#### 5.1. Synthetic Dataset

Firstly, we investigate the performance of the proposed method using two synthetic datasets within the situation described in Case 2. We compared the proposed method with AdaBoost trained with an individual dataset and AdaBoost trained with all datasets simultaneously. We employed the boosting stump (the boosting stump is a decision tree with only one node) as the weak classifier and fixed as ${\pi}_{j}=1/J$. A boosting-type method has a hyper-parameter T, the step number of boosting, and the proposed method additionally has the hyper-parameter ${\lambda}_{k,j}$. In the experiment, we determined these parameters T and ${\lambda}_{k,j}$ by the validation technique. Especially, we investigated two kinds of scenarios for the determination of ${\lambda}_{k,j}$.

- We set that ${\lambda}_{k,j}=\lambda $ for all $j,k$ and determined λ.
- We set that ${\lambda}_{k,j}=\frac{\lambda}{IS\left({q}_{{\widehat{F}}_{k}},{q}_{{\widehat{F}}_{j}};{r}_{k}\right)}$ where ${\widehat{F}}_{j}$ is a discriminant function constructed by AdaBoost with the dataset ${\mathrm{\mathcal{D}}}_{j}$ and determined λ.

Scenario 2 can incorporate more detailed information about the relationship between tasks, and the proposed method can ignore the information of tasks having less shared information. In summary, we compared the following four methods:

- A:
- The proposed method with ${\lambda}_{k,j}$ determined by Scenario 1.
- B:
- The proposed method with ${\lambda}_{k,j}$ determined by Scenario 2.
- C:
- AdaBoost trained with an individual dataset.
- D:
- AdaBoost trained with all datasets simultaneously.

We utilized $80\%$ of the training dataset for training of classifiers and the remaining $20\%$ for the validation. We repeated the above procedure 20 times and observed the averaged performance of the methods.

#### 5.1.1. Dataset 1

We set the number J of tasks to three and assume that a marginal distribution of x is a uniform distribution on ${[-1,1]}^{2}$, and a discriminant function ${F}_{j}$ ($j=1,2,3$) associated with each dataset is generated by ${F}_{j}\left(x\right)=(1+{c}_{j,2})({x}_{1}-{c}_{j,1})-{x}_{2}$, where ${c}_{j,1}\sim \mathrm{\mathcal{N}}(0,0.{2}^{2})$ and ${c}_{j,2}\sim \mathrm{\mathcal{N}}(0,0.{1}^{2})$. In addition, we randomly added a contamination noise on label y. Under these settings, we generated a training dataset, including 400 examples, and a test dataset, including 600 examples. Generated datasets are shown in Figure 2. We observe that each discriminant function and noise structure are different from the other two.

Figure 3 shows boxplots of the test errors of each method for datasets ${\mathrm{\mathcal{D}}}_{j}$ ($j=1,2,3$). We observe that the proposed method consistently outperforms individually trained AdaBoost, and AdaBoost trained with all datasets simultaneously. The figure shows that the proposed method can incorporate shared information among datasets into classifiers.

**Figure 3.**Boxplots of the test error of each method: A—proposed method with λ in Scenario 1; B—proposed method with λ in Scenario 2; C—AdaBoost trained with the individual dataset; D—AdaBoost trained with all datasets simultaneously; for three datasets, over the 20 simulation trials.

#### 5.1.2. Dataset 2

We set the number J of tasks to 6 and assume that a marginal distribution of x is a uniform distribution on ${[-1,1]}^{2}$. Discriminant functions associated with each dataset are generated by:
where ${c}_{j,1}\sim \mathrm{\mathcal{N}}(0,0.{1}^{2})$ and ${c}_{j,2}\sim \mathrm{\mathcal{N}}(0,0.{1}^{2})$. In addition, we randomly added a contamination noise on label y. Under these settings, we generated training dataset, including 400 examples, and the test dataset, including 600 examples. Generated datasets are shown in Figure 4. We observe that Datasets 1, 2 and 3 share a structure, and Datasets 4, 5 and 6 share another structure.

$$\begin{array}{c}\hfill {F}_{j}\left(x\right)=\left\{\begin{array}{cc}(1+{c}_{j,2})({x}_{1}-{c}_{j,1})-{x}_{2},\hfill & j=1,2,3,\hfill \\ -(1+{c}_{j,2})({x}_{1}-{c}_{j,1})+{x}_{2},\hfill & j=4,5,6,\hfill \end{array}\right.\end{array}$$

Figure 5 shows boxplots of the test errors of each method for datasets ${\mathrm{\mathcal{D}}}_{j}$ ($j=1,...,6$). We omitted the result of AdaBoost trained with all datasets simultaneously (D) from the figure, because its performance is significantly worse than those of the other methods: the median of classification errors is around $0.5$. This is because the structures of Datasets 1, 2, 3 and Datasets 4, 5, 6 are opposite, and the labeling of concatenated dataset seems to be random. We observe that the proposed method with Scenario 2 (B) improves performance against individually-trained AdaBoost (C) and the proposed method in Scenario 1 (A). This is because the structure shared among Datasets 1, 2 and 3 does not have information about Datasets 4, 5 and 6 (and vice versa), and Method (B) can ignore the influence of the irrelevant information by adjusting ${\lambda}_{k,j}$ responding to $IS({q}_{{\widehat{F}}_{j}},{q}_{{\widehat{F}}_{k}};{r}_{k})$. Note that the performance of Method (A) is not so degraded, because the regularization parameter ${\lambda}_{k,j}$ was determined, so as to be zero, implying AdaBoost trained with the individual dataset.

Figure 6 shows examples of classification boundaries estimated by Methods A, B, C and D, for Dataset 6.

**Figure 5.**Boxplots of the test error of each method: A, Proposed method with λ in Scenario 1; B, proposed method with λ in Scenario 2; C, AdaBoost trained with the individual dataset ; for 6 datasets, over the 20 simulation trials.

**Figure 6.**Classification boundaries by Methods A, B, C and D for Dataset 6. The blue line is the true classification boundary, and the red line represents the estimated classification boundary.

#### 5.2. Real Dataset: School Dataset

In this section, we compared the proposed method (Scenario 2) to the a binary decision tree-based ensemble method, called extremely randomized trees (ExtraTrees) [23], applying to a real dataset, “school data”, reported from the Inner London Education Authority [24]. The dataset consists of examination records of 15,362 students from 139 secondary schools, i.e., we had 139 tasks. The dimension of input x is 27, in which original variables that are categorical were transformed into dummy variables. The original target variable ${y}_{0}$ represents score values in the range $[1,70]$, and we transformed the target variable ${y}_{0}$ to a binary variable as:

$$y=sgn({y}_{0}-20).$$

We set the threshold to 20 to balance the ratio of classes ($-1:+1=7930:7432$). We randomly divided the dataset of each tasks into $80\%$ of the training dataset and remaining $20\%$ test dataset. In addition, we used $20\%$ of the divided training dataset as a validation dataset to determine the hyper-parameter λ and step number T. We repeated the above procedure 20 times and observed the average performance of the methods. Figure 7 shows the medians of error rates over 20 trials, by the proposed method and the ExtraTrees for 139 tasks. The horizontal axis indicates an index of a task, which is ranked in increasing order of the median error rate of the ExtraTrees. We observe that the proposed method is comparable to the ExtraTrees and especially has an advantage for tasks, in which the error rates of the ExtraTrees are large.

**Figure 7.**Medians of error rates by the proposed method and extremely randomized trees (ExtraTrees) for 139 tasks. The horizontal axis represents an index of a task, and the vertical axis indicates the median of error rates over 20 trials. Tasks are ranked in increasing order of the median error rate of the ExtraTrees.

## 6. Conclusions

In this paper, we investigate the properties of binary classification with the pseudo model and reveal that minimization of the Itakura–Saito distance between the empirical distribution and the pseudo model is equivalent to AdaBoost and provides suitable properties for the binary classification. In addition, we pointed out that the Itakura–Saito distance is a unique divergence, having a suitable property for estimation with the pseudo model in the framework of the Bregman divergence. Based on the framework, we proposed a novel binary classification method for the multi-task learning, which incorporates shared information among tasks into the targeted task. The risk function of the proposed method is defined by the mixture of IS distance. The IS distance between pseudo models can be interpreted as the regularization term, incorporating shared information among tasks into the binary classifier for the target task. We investigated statistical properties of the risk function and derived computationally-feasible boosting-based algorithms. Furthermore, we considered a mechanism for the adjustment of the degree of information sharing and numerically investigated the validity of the proposed methods.

## Acknowledgments

This study was partially supported by a Grant-in-Aid for Young Scientists (B), 25730018, from MEXT, Japan. Shinto Eguchi and Osamu Komori were supported by the Japan Science and Technology Agency (JST), Core Research for Evolutionary Science and Technology (CREST).

## Author Contributions

Takashi Takenouchi made major contributions to employing the Itakura–Saito divergence, and Shinto Eguchi gave a proof for the characterization associated with the divergence. Takashi Takenouchi and Osamu Komori contributed to the statistical discussion for the multi-task learning.

## Conflicts of Interest

The authors declare no conflict of interest.

## Appendix

## A. Proof of Proposition 1

By a variational calculation, a minimizer of Equation (5) satisfies:
and $F={F}_{0}$ satisfies the above equation for an arbitrary ${F}_{0}$, which concludes Equation (5). Furthermore,
and $F={F}_{0}$ satisfies the above equation for an arbitrary ${F}_{0}$, concluding Equation (6).

$$\begin{array}{cc}\hfill \frac{\delta IS(p,{q}_{F};r)}{\delta F\left(x\right)}& \propto \frac{{e}^{{F}_{0}\left(x\right)-F\left(x\right)}-{e}^{-{F}_{0}\left(x\right)+F\left(x\right)}}{{e}^{{F}_{0}\left(x\right)}+{e}^{-{F}_{0}\left(x\right)}}=0,\hfill \end{array}$$

$$\begin{array}{cc}\hfill \frac{\delta IS({q}_{F},p;r)}{\delta F\left(x\right)}& \propto \left({e}^{{F}_{0}\left(x\right)}+{e}^{-{F}_{0}\left(x\right)}\right)\left({e}^{F\left(x\right)-{F}_{0}\left(x\right)}-{e}^{-F\left(x\right)+{F}_{0}\left(x\right)}\right)=0,\hfill \end{array}$$

## B. Proof of Proposition 2

By a straightforward variational calculation, we observe that a minimizer ${F}_{KL,1}$ of Equation (8) satisfies:
and ${F}_{KL,1}={F}_{0}$ means ${F}_{0}\left(x\right)=0$ ($\forall x$), which concludes Equation (8). Furthermore, for Equation (9), ${F}_{KL,2}$ satisfies:
and ${F}_{KL,2}={F}_{0}$ means ${F}_{0}\left(x\right)=0$ ($\forall x$), concluding Equation (9).

$$\begin{array}{cc}\hfill \frac{\delta KL(p,{q}_{F};r)}{\delta F\left(x\right)}& \propto -p(+1|x)+p(-1|x)+exp\left({F}_{KL,1}\left(x\right)\right)-exp(-{F}_{KL,1}\left(x\right))\hfill \\ & =\frac{-{e}^{{F}_{0}\left(x\right)}+{e}^{-{F}_{0}\left(x\right)}}{{e}^{{F}_{0}\left(x\right)}+{e}^{-{F}_{0}\left(x\right)}}+{e}^{{F}_{KL,1}\left(x\right)}-{e}^{-{F}_{KL,1}\left(x\right)}=0,\hfill \end{array}$$

$$\begin{array}{cc}& \frac{\delta KL({q}_{F},p;r)}{\delta F\left(x\right)}\hfill \\ \hfill \propto & ({F}_{KL,2}\left(x\right)-{F}_{0}\left(x\right))({e}^{{F}_{KL,2}\left(x\right)}+{e}^{-{F}_{KL,2}\left(x\right)})+({e}^{{F}_{KL,2}\left(x\right)}-{e}^{-{F}_{KL,2}\left(x\right)})log({e}^{{F}_{0}\left(x\right)}+{e}^{-{F}_{0}\left(x\right)})\hfill \\ \hfill =& 0,\hfill \end{array}$$

## C. Proof of Lemma 4

If Equation (15) holds, ${F}_{0}$ satisfies:

$$\begin{array}{cc}\hfill {\left.\frac{\delta {D}_{U}(p,{q}_{F};r)}{\delta F\left(x\right)}\right|}_{F={F}_{0}}& =\left(1-\frac{1}{{\sum}_{y\in \mathrm{\mathcal{Y}}}{q}_{{F}_{0}}\left(y\right|x)}\right)\sum _{y\in \mathrm{\mathcal{Y}}}y{\xi}^{\prime}\left({q}_{{F}_{0}}\left(y\right|x)\right){q}_{{F}_{0}}{\left(y\right|x)}^{2}\hfill \\ & \propto {\xi}^{\prime}\left({e}^{{F}_{0}\left(x\right)}\right){e}^{2{F}_{0}\left(x\right)}-{\xi}^{\prime}\left({e}^{-2{F}_{0}\left(x\right)}\right){e}^{-2{F}_{0}\left(x\right)}\hfill \\ & =0.\hfill \end{array}$$

By setting $z={e}^{{F}_{0}\left(x\right)}$, we have ${z}^{2}\xi \prime \left(z\right)={z}^{-2}\xi \prime \left({z}^{-1}\right)$, and the function $\xi \prime \left(z\right){z}^{2}$ is reflection-symmetric.

If Equation (16) holds, ${F}_{0}$ satisfies:
implying that the function $z\left\{\xi \left(z\right)-\xi \left(\frac{z}{z+{z}^{-1}}\right)\right\}$ is reflection-symmetric.

$$\begin{array}{cc}& {\left.\frac{\delta {D}_{U}({q}_{F},p;r)}{\delta F\left(x\right)}\right|}_{F={F}_{0}}\hfill \\ \hfill =& \sum _{y\in \mathrm{\mathcal{Y}}}y{q}_{{F}_{0}}\left(y\right|x)\left\{\xi \left({q}_{{F}_{0}}\left(y\right|x)\right)-\xi \left({\overline{q}}_{{F}_{0}}\left(y\right|x)\right)\right\}\hfill \\ \hfill =& {e}^{{F}_{0}\left(x\right)}\left\{\xi \left({e}^{{F}_{0}\left(x\right)}\right)-\xi \left(\frac{{e}^{{F}_{0}\left(x\right)}}{{e}^{{F}_{0}\left(x\right)}+{e}^{-{F}_{0}\left(x\right)}}\right)\right\}-{e}^{-{F}_{0}\left(x\right)}\left\{\xi \left({e}^{-{F}_{0}\left(x\right)}\right)-\xi \left(\frac{{e}^{-{F}_{0}\left(x\right)}}{{e}^{{F}_{0}\left(x\right)}+{e}^{-{F}_{0}\left(x\right)}}\right)\right\}\hfill \\ \hfill =& 0,\hfill \end{array}$$

## D. Proof of Theorem 5

For the proof of the theorem, we firstly prepare the following lemmas.

**Lemma 11.**Let $f\left(z\right)$ be a reflection-symmetric and holomorphic function on $z\ne 0$. Then, ${a}_{k}={b}_{k}$ holds for all $k\ge 1$.

**Proof.**The function f can be expressed as Equation (14), and let us assume that there exists an integer ${k}_{0}$, such that ${a}_{{k}_{0}}\ne {b}_{{k}_{0}}$. From the reflection-symmetric property, we have:

$$({a}_{{k}_{0}}-{b}_{{k}_{0}})({z}^{{k}_{0}}-{z}^{-{k}_{0}})=0$$

**Lemma 12.**Let $\xi \left(z\right)$ be a holomorphic function on $z\ne 0$. If two functions:

$$\begin{array}{c}\hfill {\xi}^{\prime}\left(z\right){z}^{2},\phantom{\rule{4.pt}{0ex}}\text{and}\phantom{\rule{4.pt}{0ex}}z\left\{\xi \left(z\right)-\xi \left(\frac{z}{z+{z}^{-1}}\right)\right\}\end{array}$$

**Proof.**We can express the function $\xi \left(z\right)$ by a Laurent series as:

$$\begin{array}{c}\hfill \xi \left(z\right)=c+\sum _{k=1}^{\infty}\left({a}_{k}{z}^{k}+{b}_{k}{z}^{-k}\right).\end{array}$$

Then, we have:

$$\begin{array}{cc}\hfill {\xi}^{\prime}\left(z\right){z}^{2}& =\sum _{k=1}^{\infty}k\left({a}_{k}{z}^{k+1}-{b}_{k}{z}^{-k+1}\right)\hfill \\ & =-{b}_{1}-2{b}_{2}{z}^{-1}+\sum _{k=1}^{\infty}\left(k{a}_{k}{z}^{k+1}-(k+2){b}_{k+2}{z}^{-k-1}\right).\hfill \end{array}$$

Because of the assumption of reflection-symmetry for ${z}^{2}\xi \prime \left(z\right)$ and Lemma 11, we have ${b}_{2}=0$ and $k{a}_{k}=-(k+2){b}_{k+2}$ for all $k\ge 1$. Thus, we obtain:

$$\begin{array}{cc}\hfill \xi \left(z\right)& =\int -\frac{{b}_{1}}{{z}^{2}}+\sum _{k=1}^{\infty}{a}_{k}\left(k{z}^{k-1}+k{z}^{-k-3}\right)dz\hfill \\ & =c+{b}_{1}{z}^{-1}+\sum _{k=1}^{\infty}{a}_{k}\left({z}^{k}-\frac{k}{k+2}{z}^{-k-2}\right).\hfill \end{array}$$

Then, we have:

$$\begin{array}{cc}& z\left\{\xi \left(z\right)-\xi \left(\frac{z}{z+{z}^{-1}}\right)\right\}\hfill \\ \hfill =& {b}_{1}(1-(z+{z}^{-1}))+\sum _{k=1}^{\infty}{a}_{k}\left\{{z}^{k+1}(1-{(z+{z}^{-1})}^{-k})-\frac{k}{k+2}{z}^{-k-1}(1-{(z+{z}^{-1})}^{k+2})\right\}.\hfill \end{array}$$

From Equation (48) and the assumption of the reflection-symmetry of the function $z\left\{\xi \left(z\right)-\xi \left(\frac{z}{z+{z}^{-1}}\right)\right\}$, we observe that for all z,
where:

$$\begin{array}{cc}\hfill z\left\{\xi \left(z\right)-\xi \left(\frac{z}{z+{z}^{-1}}\right)\right\}-{z}^{-1}\left\{\xi \left({z}^{-1}\right)-\xi \left(\frac{{z}^{-1}}{z+{z}^{-1}}\right)\right\}=& \sum _{k=1}^{\infty}{a}_{k}{h}_{k}\left(z\right)\hfill \\ \hfill =& 0\hfill \end{array}$$

$$\begin{array}{c}\hfill {h}_{k}\left(z\right)=\left({z}^{k+1}-{z}^{-k-1}\right)\left\{1-{(z+{z}^{-1})}^{-k}+\frac{k}{k+2}\left\{1-{(z+{z}^{-1})}^{k+2}\right\}\right\}.\end{array}$$

Since ${\left\{{h}_{k}\left(z\right)\right\}}_{k=1}^{\infty}$ is functionally independent, we conclude that ${a}_{k}=0$ for all $k\ge 1$ or, equivalently, $\xi \left(z\right)=c+\frac{{b}_{1}}{z}$.

We now give a proof for Theorem 5 using Lemma 12.

**Proof.**If condition Equations (19) and (20) hold, functions ${\xi}^{\prime}\left(z\right){z}^{2}$ and $z\left\{\xi \left(z\right)-\xi \left(\frac{z}{z+{z}^{-1}}\right)\right\}$ are both reflection-symmetric from Lemma 4. From Lemma 12, the reflection-symmetric property of these two functions implies $\xi \left(z\right)=\frac{{b}_{1}}{z}+c$. Since the function should be defined on $z>0$, the generating function U derived from ξ is written as:

$$\begin{array}{c}\hfill U\left(z\right)=\int {\xi}^{-1}\left(z\right)dz={b}_{1}log(c-z)+{c}_{1}\phantom{\rule{4pt}{0ex}}(z<c)\end{array}$$

$$\begin{array}{cc}\hfill {D}_{U}(p,q;r)& =\int r\left(x\right)\sum _{y\in \mathrm{\mathcal{Y}}}\left\{-{b}_{1}log\frac{q\left(y\right|x)}{p\left(y\right|x)}-p\left(y\right|x)\left\{\frac{{b}_{1}}{q\left(y\right|x)}-\frac{{b}_{1}}{p\left(y\right|x)}\right\}\right\}dx\hfill \\ & =-{b}_{1}\int r\left(x\right)\sum _{y\in \mathrm{\mathcal{Y}}}\left\{log\frac{q\left(y\right|x)}{p\left(y\right|x)}+\frac{p\left(y\right|x)}{q\left(y\right|x)}-1\right\}dx\hfill \\ & =-{b}_{1}IS(p,q;r),\hfill \end{array}$$

## E. Proof of Proposition 7

From Equation (28), we observe:

$$\begin{array}{cc}& {F}_{k}^{*}\left(x\right)\hfill \\ \hfill =& log\frac{\sqrt{{p}_{k}(+1|x)+\frac{1}{4{p}_{k}(-1|x)}{\left({\sum}_{j\ne k}{\lambda}_{k,j}\left\{{e}^{-{\eta}_{j}\left(x\right)}-{e}^{{\eta}_{j}\left(x\right)}\right\}\right)}^{2}}-\frac{1}{2\sqrt{{p}_{k}(-1|x)}}{\sum}_{j\ne k}{\lambda}_{k,j}\left\{{e}^{-{\eta}_{j}\left(x\right)}-{e}^{{\eta}_{j}\left(x\right)}\right\}}{\sqrt{{p}_{k}(-1|x)}}\hfill \\ \hfill \simeq & \frac{1}{2}log\frac{{p}_{k}(+1|x)}{{p}_{k}(-1|x)}+\frac{1}{P}\sum _{j\ne k}{\lambda}_{k,j}{\eta}_{j}\left(x\right).\hfill \end{array}$$

## F. Proof of Proposition 8

We observe that:
which implies the proposition.

$$\begin{array}{cc}& {\left({\overline{F}}_{k}^{*}\left(x\right)-\frac{1}{2}log\frac{{p}_{0}(+1|x)}{{p}_{0}(-1|x)}\right)}^{2}-{\left({F}_{k}^{*}\left(x\right)-\frac{1}{2}log\frac{{p}_{0}(+1|x)}{{p}_{0}(-1|x)}\right)}^{2}\hfill \\ \hfill =& \frac{1}{4{P}^{4}{(P+{\text{\lambda}}_{k})}^{2}}\left({\lambda}_{k}{\delta}_{k}\left(x\right)-\sum _{j\ne k}{\lambda}_{k,j}{\u03f5}_{j}\left(x\right)\right)\left(({\lambda}_{k}+2P){\delta}_{k}\left(x\right)+\sum _{j\ne k}{\lambda}_{k,j}{\u03f5}_{j}\left(x\right)\right),\hfill \end{array}$$

## G. Proof of Proposition 9

The minimizer of the risk function Equation (27) satisfies:
implying Equation (36).

$$\begin{array}{cc}\hfill \frac{\delta L({F}_{1},...,{F}_{J})}{\delta {F}_{k}}\propto & {e}^{{F}_{k}\left(x\right)}\left\{{\pi}_{k}{p}_{k}(-1|x)+\sum _{j\ne k}({\pi}_{k}{\lambda}_{k,j}+{\pi}_{j}{\lambda}_{j,k}){e}^{-{F}_{j}\left(x\right)}\right\}\hfill \\ & -{e}^{-{F}_{k}\left(x\right)}\left\{{\pi}_{k}{p}_{k}(+1|x)+\sum _{j\ne k}({\pi}_{k}{\lambda}_{k,j}+{\pi}_{j}{\lambda}_{j,k}){e}^{{F}_{j}\left(x\right)}\right\}\hfill \\ \hfill =& 0,\hfill \end{array}$$

## References

- Caruana, R. Multitask learning. Mach. Learn.
**1997**, 28, 41–75. [Google Scholar] [CrossRef] - Pan, S.J.; Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng.
**2010**, 22, 1345–1359. [Google Scholar] [CrossRef] - Argyriou, A.; Pontil, M.; Ying, Y.; Micchelli, C.A. A spectral regularization framework for multi-task structure learning. In Advances in Neural Information Processing Systems 19; MIT Press: Cambridge, MA, USA, 2007. [Google Scholar]
- Evgeniou, A.; Pontil, M. Multi-task feature learning. In Advances in Neural Information Processing Systems 19; MIT Press: Cambridge, MA, USA, 2007. [Google Scholar]
- Dai, W.; Yang, Q.; Xue, G.R.; Yu, Y. Boosting for transfer learning. In Proceedings of the 24th International Conference on Machine Learning, Corvallis, OR, USA, 20–24 June 2007; pp. 193–200.
- Wang, X.; Zhang, C.; Zhang, Z. Boosted multi-task learning for face verification with applications to web image and video search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 142–149.
- Chapelle, O.; Shivaswamy, P.; Vadrevu, S.; Weiinberger, K.; Zhang, Y.; Tseng, B. Multi-task learning for boosting with application to web search ranking. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 25–28 July 2010; pp. 1189–1198.
- Cichocki, A.; Amari, S. Families of alpha- beta- and gamma-divergences: Flexible and robust measures of similarities. Entropy
**2010**, 12, 1532–1568. [Google Scholar] [CrossRef] - Févotte, C.; Bertin, N.; Durrieu, J.L. Nonnegative matrix factorization with the Itakura–Saito divergence: With application to music analysis. Neural Comput.
**2009**, 21, 793–830. [Google Scholar] [CrossRef] [PubMed] - Lefevre, A.; Bach, F.; Févotte, C. Itakura–Saito nonnegative matrix factorization with group sparsity. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech, 22–27 May 2011; pp. 21–24.
- Takenouchi, T.; Komori, O.; Eguchi, S. A novel boosting algorithm for multi-task learning based on the Itakura–Saito divergence. In Proceedings of the Bayesian Inference and Maximum Entropy Methods in Science and Engineering (MaxEnt 2014), Amboise, France, 21–26 September 2014; pp. 230–237.
- Murata, N.; Takenouchi, T.; Kanamori, T.; Eguchi, S. Information geometry of U-boost and Bregman divergence. Neural Comput.
**2004**, 16, 1437–1481. [Google Scholar] [CrossRef] [PubMed] - Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman divergences. J. Mach. Learn. Res.
**2005**, 6, 1705–1749. [Google Scholar] - Amari, S.; Nagaoka, H. Methods of Information Geometry of Translations of Mathematical Monographs; Oxford University Press: Providence, RI, USA, 2000; Volume 191. [Google Scholar]
- Mihoko, M.; Eguchi, S. Robust blind source separation by beta divergence. Neural Comput.
**2002**, 14, 1859–1886. [Google Scholar] [CrossRef] [PubMed] - Cichocki, A.; Cruces, S.; Amari, S.I. Generalized alpha-beta divergences and their application to robust nonnegative matrix factorization. Entropy
**2011**, 13, 134–170. [Google Scholar] [CrossRef] - Takenouchi, T.; Eguchi, S. Robustifying AdaBoost by adding the naive error rate. Neural Comput.
**2004**, 16, 767–787. [Google Scholar] [CrossRef] [PubMed] - Takenouchi, T.; Eguchi, S.; Murata, T.; Kanamori, T. Robust boosting algorithm against mislabeling in multi-class problems. Neural Comput.
**2008**, 20, 1596–1630. [Google Scholar] [CrossRef] [PubMed] - Lafferty, G.L.J. Boosting and maximum likelihood for exponential models. In Advances in Neural Information Processing Systems 14; MIT Press: Cambridge, MA, USA, 2002. [Google Scholar]
- Evgeniou, T.; Pontil, M. Regularized multi-task learning. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, WA, USA, 22–25 August 2004; pp. 109–117.
- Xue, Y.; Liao, X.; Carin, L.; Krishnapuram, B. Multi-task learning for classification with Dirichlet process priors. J. Mach. Learn. Res.
**2007**, 8, 35–63. [Google Scholar] - Mason, L.; Baxter, J.; Bartlett, P.; Frean, M. Boosting algorithms as gradient decent in function space. In Advances in Neural Information Processing Systems 11; MIT Press: Cambridge, MA, USA, 1999. [Google Scholar]
- Geurts, P.; Ernst, D.; Wehenkel, L. Extremely randomized trees. Mach. Learn.
**2006**, 63, 3–42. [Google Scholar] [CrossRef] - Goldstein, H. Multilevel modelling of survey data. J. R. Stat. Soc. Ser. D
**1991**, 40, 235–244. [Google Scholar] [CrossRef]

© 2015 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/).