#
Binary Classification with a Pseudo Exponential Model and Its Application for Multi-Task Learning^{ †}

^{1}

^{2}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

## 2. Settings

## 3. Itakura–Saito Distance and Pseudo Model

#### 3.1. Parameter Estimation with the Pseudo Model

**Proposition 1.**Let $p\left(y\right|x)={\overline{q}}_{{F}_{0}}\left(y\right|x)$ be the underlying distribution. Then, we observe:

**Proof.**See Appendix A

**Proposition 2.**Let ${F}_{0}$ be a function ${F}_{0}$($\ne 0$) and $p\left(y\right|x)={\overline{q}}_{{F}_{0}}\left(y\right|x)$ be the underlying distribution. Then, we observe:

**Proof.**See Appendix B.

**Remark 1.**Let $p\left(y\right|x)={\overline{q}}_{{F}_{0}}\left(y\right|x)$ be the underlying distribution. Then, minimizer Equation (8) or (9) of the extended KL divergence attains the Bayes rule, i.e.,

#### 3.2. Characterization of the Itakura–Saito Distance

**Remark 2.**The KL divergence and the Itakura–Saito distance are special cases of the Bregman U-divergence Equation (11) with generating functions $U\left(z\right)=exp\left(z\right)$ and $U\left(z\right)=-log(c-z)+{c}_{1}$ ($z<c$), where c and ${c}_{1}$ are constants, respectively.

**Definition 3.**A function $f\left(z\right)$ is reflection-symmetric if:

**Remark 3.**If the function f is reflection-symmetric and holomorphic over R, ${a}_{k}={b}_{k}=0$ holds for all k, and then, f is a constant function.

**Lemma 4.**Let ${F}_{0}$ be an arbitrary function, $p\left(y\right|x)={\overline{q}}_{{F}_{0}}\left(y\right|x)$ be the underlying distribution and ${q}_{F}\left(x\right)$ be the pseudo model Equation (3). If the Bregman U-divergence associated with the function U attains:

**Proof.**See Appendix C.

**Remark 4.**Proposition 1 implies that the function ξ associated with the IS distance satisfies Lemma 4.

**Remark 5.**Propositions imply that the function U, i.e., Bregman U-divergence, attains Equation (15) or (16) is not unique and there exists divergences satisfying Equation (15) or (16), other than the Itakura–Saito distance. For example, a function:

**Theorem 5.**Let $p\left(y\right|x)={\overline{q}}_{{F}_{0}}\left(y\right|x)$ be the underlying distribution and ${q}_{F}\left(x\right)$ be the pseudo model Equation (3). If conditions:

**Proof.**See Appendix D.

**Remark 6.**If we assume that a function ${\xi}^{\prime}\left(z\right){z}^{2}$ derived from U is reflection-symmetric and holomorphic over R, ${\xi}^{\prime}\left(z\right){z}^{2}$ is a constant function from Remark 3. Then, we obtain $\xi \left(z\right)=c+\frac{{b}_{1}}{z}$ where $c,{b}_{1}$ are constants, implying that the associated divergence is equivalent to the Itakura–Saito distance.

#### 3.3. Relationship with AdaBoost

## 4. Application for Multi-Task Learning

- Case 1 :
- There is a target dataset ${\mathrm{\mathcal{D}}}_{k}$, and our interest is to construct a discriminant function ${F}_{k}$ utilizing remaining datasets ${\mathrm{\mathcal{D}}}_{j}$ ($j\ne k$) or a priori constructed discriminant functions ${F}_{j}$ ($j\ne k$).
- Case 2 :
- Our interest is to simultaneously construct better discriminant functions ${F}_{1},...,{F}_{J}$ using all J datasets ${\mathrm{\mathcal{D}}}_{1},...,{\mathrm{\mathcal{D}}}_{J}$ by utilizing shared information among datasets.

#### 4.1. Case 1

- (1)
- Initialize the function to ${F}_{k}^{0}$, and define weights for the i-th example with a function F as:$$\begin{array}{cc}\hfill {w}_{1}(i;F)& =\frac{{e}^{-F\left({x}_{i}^{\left(k\right)}\right){y}_{i}^{\left(k\right)}}}{{Z}_{1}\left(F\right)},\hfill \\ \hfill {w}_{2}(i;F)& =\frac{{\sum}_{j\ne k}{\lambda}_{k,j}{e}^{f\left({x}_{i}^{\left(k\right)}\right)(F\left({x}_{i}^{\left(k\right)}\right)-{F}_{j}\left({x}_{i}^{\left(k\right)}\right))}}{{Z}_{2}\left(F\right)}\hfill \end{array}$$$$\begin{array}{cc}\hfill {Z}_{1}\left(F\right)& =\sum _{i=1}^{{n}_{k}}{e}^{-F\left({x}_{i}^{\left(k\right)}\right){y}_{i}^{\left(k\right)}},\hfill \\ \hfill {Z}_{2}\left(F\right)& =\sum _{i=1}^{{n}_{k}}\sum _{j\ne k}{\lambda}^{k,j}\left({e}^{F\left({x}_{i}^{\left(k\right)}\right)-{F}_{j}\left({x}_{i}^{\left(k\right)}\right)}+{e}^{-F\left({x}_{i}^{\left(k\right)}\right)+{F}_{j}\left({x}_{i}^{\left(k\right)}\right)}\right).\hfill \end{array}$$
- (2)
- For $t=1,...,T$
- (a)
- Select a weak classifier ${f}_{k}^{t}\in \{\pm 1\}$, which minimizes the following quantity:$$\epsilon \left(f\right)=\frac{{Z}_{1}\left({F}_{k}^{t-1}\right)}{{Z}_{1}\left({F}_{k}^{t-1}\right)+{Z}_{2}\left({F}_{k}^{t-1}\right)}{\epsilon}_{1}\left(f\right)+\frac{{Z}_{2}\left({F}_{k}^{t-1}\right)}{{Z}_{1}\left({F}_{k}^{t-1}\right)+{Z}_{2}\left({F}_{k}^{t-1}\right)}{\epsilon}_{2}\left(f\right).$$
- (b)
- Calculate a coefficient of ${f}_{k}^{t}$ by ${\alpha}_{k}^{t}=\frac{1}{2}log\frac{1-\epsilon \left({f}_{k}^{t}\right)}{\epsilon \left({f}_{k}^{t}\right)}$.
- (c)
- Update the discriminant function as ${F}_{k}^{t}={F}_{k}^{t-1}+{\alpha}_{k}^{t}{f}_{k}^{t}$.

- (3)
- Output ${F}_{k}^{T}\left(x\right)={F}_{k}^{0}\left(x\right)+{\sum}_{t=1}^{T}{\alpha}_{k}^{t}{f}_{k}^{t}\left(x\right)$.

#### 4.2. Case 2

- (1)
- Initialize functions ${F}_{1},...,{F}_{J}$.
- (2)
- For $t=1,...,T$:
- (a)
- Randomly choose a target index $k\in \{1,...,J\}$.
- (b)
- Update the function ${F}_{k}$ using the algorithm in Case 1 by S steps, with fixed functions ${F}_{j}$ ($j\ne k$).

- (3)
- Output learned functions ${F}_{1},...,{F}_{J}$.

#### 4.3. Statistical Properties of the Proposed Methods

**Remark 7.**${F}_{k}^{*}\left(x\right)\ge 0$ does not mean ${p}_{k}(+1|x)\ge \frac{1}{2}$ unless ${F}_{j}\left(x\right)=\frac{1}{2}log\frac{{p}_{k}(+1|x)}{{p}_{k}(-1|x)}$ holds.

**Proposition 6.**Let us assume that ${F}_{j}\left(x\right)$ satisfies:

**Proposition 7.**Let ${\eta}_{j}\left(x\right)={F}_{j}\left(x\right)-{F}_{k}\left(x\right)$ be a difference between two functions. Then, ${F}_{k}^{*}$ can be approximated as:

**Proof.**See Appendix E.

**Proposition 8.**Let ${\overline{F}}_{k}^{*}$ be a minimizer of the risk function Equation (23) with ${\lambda}_{k,j}=0$($j\ne k$). Then, we observe:

**Proof.**See Appendix F.

**Proposition 9.**Let $r\left(x\right)={r}_{j}\left(x\right)$ ($j=1,...,J$) be a common marginal distribution shared by all tasks. Then, the minimizer of the risk function is written as:

**Proof.**See Appendix G.

#### 4.4. Comparison of Regularization Terms

**Proposition 10.**Let $\u03f5\left(x\right)$ be a perturbation function satisfying $\left|\u03f5\right(x\left)\right|\ll 1$. Then, we observe:

**Proof.**We obtain these approximations by considering the Taylor expansion up to second order.

## 5. Experiments

#### 5.1. Synthetic Dataset

- We set that ${\lambda}_{k,j}=\lambda $ for all $j,k$ and determined λ.
- We set that ${\lambda}_{k,j}=\frac{\lambda}{IS\left({q}_{{\widehat{F}}_{k}},{q}_{{\widehat{F}}_{j}};{r}_{k}\right)}$ where ${\widehat{F}}_{j}$ is a discriminant function constructed by AdaBoost with the dataset ${\mathrm{\mathcal{D}}}_{j}$ and determined λ.

- A:
- The proposed method with ${\lambda}_{k,j}$ determined by Scenario 1.
- B:
- The proposed method with ${\lambda}_{k,j}$ determined by Scenario 2.
- C:
- AdaBoost trained with an individual dataset.
- D:
- AdaBoost trained with all datasets simultaneously.

#### 5.1.1. Dataset 1

**Figure 3.**Boxplots of the test error of each method: A—proposed method with λ in Scenario 1; B—proposed method with λ in Scenario 2; C—AdaBoost trained with the individual dataset; D—AdaBoost trained with all datasets simultaneously; for three datasets, over the 20 simulation trials.

#### 5.1.2. Dataset 2

**Figure 5.**Boxplots of the test error of each method: A, Proposed method with λ in Scenario 1; B, proposed method with λ in Scenario 2; C, AdaBoost trained with the individual dataset ; for 6 datasets, over the 20 simulation trials.

**Figure 6.**Classification boundaries by Methods A, B, C and D for Dataset 6. The blue line is the true classification boundary, and the red line represents the estimated classification boundary.

#### 5.2. Real Dataset: School Dataset

**Figure 7.**Medians of error rates by the proposed method and extremely randomized trees (ExtraTrees) for 139 tasks. The horizontal axis represents an index of a task, and the vertical axis indicates the median of error rates over 20 trials. Tasks are ranked in increasing order of the median error rate of the ExtraTrees.

## 6. Conclusions

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## Appendix

## A. Proof of Proposition 1

## B. Proof of Proposition 2

## C. Proof of Lemma 4

## D. Proof of Theorem 5

**Lemma 11.**Let $f\left(z\right)$ be a reflection-symmetric and holomorphic function on $z\ne 0$. Then, ${a}_{k}={b}_{k}$ holds for all $k\ge 1$.

**Proof.**The function f can be expressed as Equation (14), and let us assume that there exists an integer ${k}_{0}$, such that ${a}_{{k}_{0}}\ne {b}_{{k}_{0}}$. From the reflection-symmetric property, we have:

**Lemma 12.**Let $\xi \left(z\right)$ be a holomorphic function on $z\ne 0$. If two functions:

**Proof.**We can express the function $\xi \left(z\right)$ by a Laurent series as:

**Proof.**If condition Equations (19) and (20) hold, functions ${\xi}^{\prime}\left(z\right){z}^{2}$ and $z\left\{\xi \left(z\right)-\xi \left(\frac{z}{z+{z}^{-1}}\right)\right\}$ are both reflection-symmetric from Lemma 4. From Lemma 12, the reflection-symmetric property of these two functions implies $\xi \left(z\right)=\frac{{b}_{1}}{z}+c$. Since the function should be defined on $z>0$, the generating function U derived from ξ is written as:

## E. Proof of Proposition 7

## F. Proof of Proposition 8

## G. Proof of Proposition 9

## References

- Caruana, R. Multitask learning. Mach. Learn.
**1997**, 28, 41–75. [Google Scholar] [CrossRef] - Pan, S.J.; Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng.
**2010**, 22, 1345–1359. [Google Scholar] [CrossRef] - Argyriou, A.; Pontil, M.; Ying, Y.; Micchelli, C.A. A spectral regularization framework for multi-task structure learning. In Advances in Neural Information Processing Systems 19; MIT Press: Cambridge, MA, USA, 2007. [Google Scholar]
- Evgeniou, A.; Pontil, M. Multi-task feature learning. In Advances in Neural Information Processing Systems 19; MIT Press: Cambridge, MA, USA, 2007. [Google Scholar]
- Dai, W.; Yang, Q.; Xue, G.R.; Yu, Y. Boosting for transfer learning. In Proceedings of the 24th International Conference on Machine Learning, Corvallis, OR, USA, 20–24 June 2007; pp. 193–200.
- Wang, X.; Zhang, C.; Zhang, Z. Boosted multi-task learning for face verification with applications to web image and video search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 142–149.
- Chapelle, O.; Shivaswamy, P.; Vadrevu, S.; Weiinberger, K.; Zhang, Y.; Tseng, B. Multi-task learning for boosting with application to web search ranking. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 25–28 July 2010; pp. 1189–1198.
- Cichocki, A.; Amari, S. Families of alpha- beta- and gamma-divergences: Flexible and robust measures of similarities. Entropy
**2010**, 12, 1532–1568. [Google Scholar] [CrossRef] - Févotte, C.; Bertin, N.; Durrieu, J.L. Nonnegative matrix factorization with the Itakura–Saito divergence: With application to music analysis. Neural Comput.
**2009**, 21, 793–830. [Google Scholar] [CrossRef] [PubMed] - Lefevre, A.; Bach, F.; Févotte, C. Itakura–Saito nonnegative matrix factorization with group sparsity. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech, 22–27 May 2011; pp. 21–24.
- Takenouchi, T.; Komori, O.; Eguchi, S. A novel boosting algorithm for multi-task learning based on the Itakura–Saito divergence. In Proceedings of the Bayesian Inference and Maximum Entropy Methods in Science and Engineering (MaxEnt 2014), Amboise, France, 21–26 September 2014; pp. 230–237.
- Murata, N.; Takenouchi, T.; Kanamori, T.; Eguchi, S. Information geometry of U-boost and Bregman divergence. Neural Comput.
**2004**, 16, 1437–1481. [Google Scholar] [CrossRef] [PubMed] - Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman divergences. J. Mach. Learn. Res.
**2005**, 6, 1705–1749. [Google Scholar] - Amari, S.; Nagaoka, H. Methods of Information Geometry of Translations of Mathematical Monographs; Oxford University Press: Providence, RI, USA, 2000; Volume 191. [Google Scholar]
- Mihoko, M.; Eguchi, S. Robust blind source separation by beta divergence. Neural Comput.
**2002**, 14, 1859–1886. [Google Scholar] [CrossRef] [PubMed] - Cichocki, A.; Cruces, S.; Amari, S.I. Generalized alpha-beta divergences and their application to robust nonnegative matrix factorization. Entropy
**2011**, 13, 134–170. [Google Scholar] [CrossRef] - Takenouchi, T.; Eguchi, S. Robustifying AdaBoost by adding the naive error rate. Neural Comput.
**2004**, 16, 767–787. [Google Scholar] [CrossRef] [PubMed] - Takenouchi, T.; Eguchi, S.; Murata, T.; Kanamori, T. Robust boosting algorithm against mislabeling in multi-class problems. Neural Comput.
**2008**, 20, 1596–1630. [Google Scholar] [CrossRef] [PubMed] - Lafferty, G.L.J. Boosting and maximum likelihood for exponential models. In Advances in Neural Information Processing Systems 14; MIT Press: Cambridge, MA, USA, 2002. [Google Scholar]
- Evgeniou, T.; Pontil, M. Regularized multi-task learning. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, WA, USA, 22–25 August 2004; pp. 109–117.
- Xue, Y.; Liao, X.; Carin, L.; Krishnapuram, B. Multi-task learning for classification with Dirichlet process priors. J. Mach. Learn. Res.
**2007**, 8, 35–63. [Google Scholar] - Mason, L.; Baxter, J.; Bartlett, P.; Frean, M. Boosting algorithms as gradient decent in function space. In Advances in Neural Information Processing Systems 11; MIT Press: Cambridge, MA, USA, 1999. [Google Scholar]
- Geurts, P.; Ernst, D.; Wehenkel, L. Extremely randomized trees. Mach. Learn.
**2006**, 63, 3–42. [Google Scholar] [CrossRef] - Goldstein, H. Multilevel modelling of survey data. J. R. Stat. Soc. Ser. D
**1991**, 40, 235–244. [Google Scholar] [CrossRef]

© 2015 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Takenouchi, T.; Komori, O.; Eguchi, S.
Binary Classification with a Pseudo Exponential Model and Its Application for Multi-Task Learning. *Entropy* **2015**, *17*, 5673-5694.
https://doi.org/10.3390/e17085673

**AMA Style**

Takenouchi T, Komori O, Eguchi S.
Binary Classification with a Pseudo Exponential Model and Its Application for Multi-Task Learning. *Entropy*. 2015; 17(8):5673-5694.
https://doi.org/10.3390/e17085673

**Chicago/Turabian Style**

Takenouchi, Takashi, Osamu Komori, and Shinto Eguchi.
2015. "Binary Classification with a Pseudo Exponential Model and Its Application for Multi-Task Learning" *Entropy* 17, no. 8: 5673-5694.
https://doi.org/10.3390/e17085673