# Projection Pursuit Through ϕ-Divergence Minimisation

## Abstract

**:**

## 1. Outline of the Article

#### 1.1. Huber’s analytic approach

**$\mathcal{R}emark$ 1.1.**Huber stops his algorithm when the Kullback–Leibler divergence equals zero or when his algorithm reaches the ${d}^{th}$ iteration, he then obtains an approximation of f from g:

#### 1.2. Huber’s synthetic approach

**$\mathcal{R}emark$ 1.2.**First, in a similar manner to the analytic approach, this methodology enables us to approximate and even to represent f from g:

#### 1.3. Proposal

**$\mathcal{R}emark$ 1.3.**As in the previous algorithm, we first provide an approximate and even a representation of f from g: To obtain an approximation of f, we stop our algorithm when the divergence equals zero, i.e., ${D}_{\varphi}({g}^{(j)},f)=0$ implies ${g}^{(j)}=f$ with $j\le d$, or when our algorithm reaches the ${d}^{th}$ iteration, i.e., we approximate f with ${g}^{(d)}$.

**$\mathcal{E}xample$ 1.1.**Let f be a density defined on ${\mathbb{R}}^{3}$ by $f({x}_{1},{x}_{2},{x}_{3})=n({x}_{1},{x}_{2})h({x}_{3})$, with n being a bi-dimensional Gaussian density, and h being a non-Gaussian density. Let us also consider g, a Gaussian density with same mean and variance as f.

**$\mathcal{E}xample$ 1.2.**Assuming that the φ-divergence is greater than the ${L}^{2}$ norm. Let us consider ${({X}_{n})}_{n\ge 0}$, the Markov chain with continuous state space E. Let f be the density of $({X}_{0},{X}_{1})$ and let g be the normal density with same mean and variance as f.

## 2. The Algorithm

#### 2.1. The model

#### Elliptical laws

**Definition 2.1.**

- with Σ, being a $d\times d$ positive-definite matrix and with μ, being a d-column vector,
- with ${\xi}_{d}$, being referred as the “density generator”,
- with ${c}_{d}$, being a normalisation constant, such that ${c}_{d}=\frac{\Gamma (d/2)}{{(2\pi )}^{d/2}}{\left({\int}_{0}^{\infty}{x}^{d/2-1}{\xi}_{d}(x)dx\right)}^{-1}$, with ${\int}_{0}^{\infty}{x}^{d/2-1}{\xi}_{d}(x)dx<\infty $.

**Property 2.1.**

**$\mathcal{R}emark$ 2.1.**

**Definition 2.2.**

**$\mathcal{E}xample$ 2.1.**

#### Choice of g

**Proposition 2.1.**

**Definition 2.3.**

**$\mathcal{R}emark$ 2.2.**

#### 2.2. Stochastic outline of our algorithm

## 3. Results

#### 3.1. Convergence results

#### 3.1.1. Hypotheses on f

- (H1)
- : $\text{For}\phantom{\rule{4.pt}{0ex}}\text{all}\phantom{\rule{4.pt}{0ex}}\epsilon >0,\phantom{\rule{4.pt}{0ex}}\text{there}\phantom{\rule{4.pt}{0ex}}\text{is}\phantom{\rule{4.pt}{0ex}}\eta >0,\phantom{\rule{4.pt}{0ex}}\text{such}\phantom{\rule{4.pt}{0ex}}\text{that}\phantom{\rule{4.pt}{0ex}}\text{for}\phantom{\rule{4.pt}{0ex}}\text{all}\phantom{\rule{4.pt}{0ex}}c\in {\Theta}^{{D}_{\varphi}}\phantom{\rule{4.pt}{0ex}}\text{verifying}\phantom{\rule{4.pt}{0ex}}\parallel c-{a}_{k}\parallel \ge \epsilon ,\text{we}\phantom{\rule{4.pt}{0ex}}\text{have}\phantom{\rule{4.pt}{0ex}}\mathbf{P}M(c,a)-\eta >\mathbf{P}M({a}_{k},a),\phantom{\rule{4.pt}{0ex}}\text{with}\phantom{\rule{4.pt}{0ex}}a\in \Theta .$
- (H2)
- : $\exists \phantom{\rule{4.pt}{0ex}}Z<0,\phantom{\rule{4.pt}{0ex}}{n}_{0}>0\phantom{\rule{4.pt}{0ex}}\text{such}\phantom{\rule{4.pt}{0ex}}\text{that}\phantom{\rule{4.pt}{0ex}}(n\ge {n}_{0}\phantom{\rule{4pt}{0ex}}\Rightarrow \phantom{\rule{4pt}{0ex}}{sup}_{a\in \Theta}{sup}_{c\in {\{{\Theta}^{{D}_{\varphi}}\}}^{c}}{\mathbb{P}}_{n}M(c,a)<Z)$
- (H3)
- : There is a neighbourhood V of a
_{k}, and a positive function H, such that, for all $c\in V,\phantom{\rule{4.pt}{0ex}}\text{we}\phantom{\rule{4.pt}{0ex}}\text{have}\phantom{\rule{4.pt}{0ex}}|M(c,{a}_{k},x)|\le H(x)\phantom{\rule{4pt}{0ex}}(\mathbf{P}-a.s.)\phantom{\rule{4.pt}{0ex}}\text{with}\phantom{\rule{4.pt}{0ex}}\mathbf{P}H<\infty ,$ - (H4)
- : There is a neighbourhood V of a
_{k}, such that for all ε, there is a η such that for all $c\in V\phantom{\rule{4.pt}{0ex}}\text{and}\phantom{\rule{4.pt}{0ex}}a\in \Theta ,\phantom{\rule{4.pt}{0ex}}\text{verifying}\phantom{\rule{4.pt}{0ex}}\parallel a-{a}_{k}\parallel \ge \epsilon ,\phantom{\rule{4.pt}{0ex}}\text{we}\phantom{\rule{4.pt}{0ex}}\text{have}\phantom{\rule{4.pt}{0ex}}\mathbf{P}M(c,{a}_{k})<\mathbf{P}M(c,a)-\eta .$

- (H5)
- : The function φ is ${\mathcal{C}}^{3}$ in $(0,+\infty $) and there is a neighbourhood ${V}_{k}^{\prime}$ of $({a}_{k},{a}_{k})$ such that, for all $(b,a)$ of ${V}_{k}^{\prime}$, the gradient $\nabla (\frac{g(x){f}_{a}({a}^{\top}x)}{{g}_{a}({a}^{\top}x)})$ and the Hessian $\mathcal{H}(\frac{g(x){f}_{a}({a}^{\top}x)}{{g}_{a}({a}^{\top}x)})$ exist ($\lambda \_a.s.$), and the first order partial derivatives $\frac{g(x){f}_{a}({a}^{\top}x)}{{g}_{a}({a}^{\top}x)}$ and the first and second order derivatives of $(b,a)\mapsto \rho (b,a,x)$ are dominated ($\lambda \_$a.s.) by λ-integrable functions.
- (H6)
- : The function $(b,a)\mapsto M(b,a)$ is ${\mathcal{C}}^{3}$ in a neighbourhood ${V}_{k}$ of $({a}_{k},{a}_{k})$ for all x; and the partial derivatives of $(b,a)\mapsto M(b,a)$ are all dominated in ${V}_{k}$ by a $\mathbf{P}\_$integrable function $H(x)$.
- (H7)
- : $\mathbf{P}\parallel \frac{\partial}{\partial b}M({a}_{k},{a}_{k}){\parallel}^{2}$ and $\mathbf{P}\parallel \frac{\partial}{\partial a}M({a}_{k},{a}_{k}){\parallel}^{2}$ are finite and the expressions $\mathbf{P}\frac{{\partial}^{2}}{\partial {b}_{i}\partial {b}_{j}}M({a}_{k},{a}_{k})$ and ${I}_{{a}_{k}}$ exist and are invertible.
- (H8)
- : There exists k such that $\mathbf{P}M({a}_{k},{a}_{k})=0$.
- (H9)
- : ${(Va{r}_{\mathbf{P}}(M({a}_{k},{a}_{k})))}^{1/2}$ exists and is invertible.
- (H0)
- : f and g are assumed to be positive and bounded and such that $K(g,f)\ge \int |f(x)-g(x)|dx$.

#### 3.1.2. Estimation of the first co-vector of f

**Proposition 3.1.**

**$\mathcal{R}emark$ 3.1.**

**Proposition 3.2.**

- ${\stackrel{\u02c7}{a}}_{k}$ is an estimate of ${a}_{k}$ as defined in Proposition 3.2 with ${\stackrel{\u02c7}{g}}_{n}^{(k-1)}$ instead of g,
- ${\stackrel{\u02c7}{g}}_{n}^{(k)}$ is such that ${\stackrel{\u02c7}{g}}_{n}^{(0)}=g$, ${\stackrel{\u02c7}{g}}_{n}^{(k)}(x)={\stackrel{\u02c7}{g}}_{n}^{(k-1)}(x)\frac{{f}_{{\stackrel{\u02c7}{a}}_{k},n}({\stackrel{\u02c7}{a}}_{k}^{\top}x)}{{[{\stackrel{\u02c7}{g}}^{(k-1)}]}_{{\stackrel{\u02c7}{a}}_{k},n}({\stackrel{\u02c7}{a}}_{k}^{\top}x)}$, i.e., ${\stackrel{\u02c7}{g}}_{n}^{(k)}(x)=g(x){\Pi}_{j=1}^{k}\frac{{f}_{{\stackrel{\u02c7}{a}}_{j},n}({\stackrel{\u02c7}{a}}_{j}^{\top}x)}{{[{\stackrel{\u02c7}{g}}^{(j-1)}]}_{{\stackrel{\u02c7}{a}}_{j},n}({\stackrel{\u02c7}{a}}_{j}^{\top}x)}$.

#### 3.1.3. Convergence study at the ${k}^{\text{th}}$ step of the algorithm:

**Proposition 3.3.**

**Theorem 3.1.**

#### 3.2. Asymptotic Inference at the ${k}^{th}$ step of the algorithm

**Theorem 3.2.**

**Theorem 3.3.**

#### 3.3. A stopping rule for the procedure

#### 3.3.1. Estimation of f

**Theorem 3.4.**

**Corollary 3.1.**

#### 3.3.2. Testing of the criteria

**Theorem 3.5.**

**Corollary 3.2.**

**Corollary 3.3.**

#### 3.4. Goodness-of-fit test for copulas

**Theorem 3.6.**

#### 3.5. Rewriting of the convolution product

**Proposition 3.4.**

**Theorem 3.7.**

#### 3.6. On the regression

**$\mathcal{R}emark$ 3.2.**

#### 3.6.1. The basic idea

**Theorem 3.8.**

**$\mathcal{R}emark$ 3.3**

#### 3.6.2. General case

**Theorem 3.9.**

**Corollary 3.4.**

## 4. Simulations

**$\mathcal{S}imulation$ 4.1**

Our Algorithm | |

Projection Study 0 : | minimum : 0.0201741 |

at point : (1.00912,1.09453,0.01893) | |

P-Value : 0.81131 | |

Test : | ${H}_{0}$ : ${a}_{1}\in {\mathcal{E}}_{1}$ : True |

${\chi}^{2}$(Kernel Estimation of ${g}^{(1)}$, ${g}^{(1)}$) | 6.1726 |

^{(1)}.

**$\mathcal{S}imulation$ 4.2**

Our Algorithm | |

Projection Study 0 | minimum : 0.002692 |

at point : (1.01326, 0.0657, 0.0628, 0.1011, 0.0509, 0.1083, | |

0.1261, 0.0573, 0.0377, 0.0794, 0.0906, 0.0356, 0.0012, | |

0.0292, 0.0737, 0.0934, 0.0286, 0.1057, 0.0697, 0.0771) | |

P-Value : 0.80554 | |

Test : | ${H}_{0}$ : ${a}_{1}\in {\mathcal{E}}_{1}$ : True |

H(Est. of ${g}^{(1)}$, ${g}^{(1)}$) | 3.042174 |

^{(1)}.

**$\mathcal{S}imulation$ 4.3**

Our Algorithm | |

Projection Study 0 : | minimum : 0.0210058 |

at point : (1.001,0.0014) | |

P-Value : 0.989552 | |

Test : | ${H}_{0}$ : ${a}_{1}\in {\mathcal{E}}_{1}$ : True |

${D}_{\varphi}$(Kernel Estimation of ${g}^{(1)}$, ${g}^{(1)}$) | 6.47617 |

^{(1)}.

Our Regression | $E({Y}_{1})$ | -4.545483 |

$Cov({Y}_{1},{Y}_{0})$ | 0.0380534 | |

$Var({Y}_{0})$ | 0.9190052 | |

$E({Y}_{0})$ | 0.3103752 | |

correlation $({Y}_{1},{Y}_{0})$ | 0.02158213 | |

Least squares method | ${\alpha}_{1}$ | -4.34159227 |

Std Error of ${\alpha}_{1}$ | 0.19870 | |

${\alpha}_{2}$ | 0.06803317 | |

Std Error of ${\alpha}_{2}$ | 0.21154 | |

correlation $({X}_{1},{X}_{0})$ | 0.04888484 |

**Figure 3.**Graph of the regression of ${X}_{1}$ on ${X}_{0}$ based on the least squares method (red) and based on our theory (green).

**$\mathcal{S}imulation$ 4.4**

Our Algorithm | |

Projection Study 0 : | minimum : 0.010920 |

at point : (1.09,-0.9701) | |

P-Value : 0.889400 | |

Test : | ${H}_{0}$ : ${a}_{1}\in {\mathcal{E}}_{1}$ : True |

${D}_{\varphi}$(Kernel Estimation of ${g}^{(1)}$, ${g}^{(1)}$) | 5.25077 |

^{(1)}.

Simulation | a | Std Error of a | b | Std Error of b |

1 | -4.83739 | 0.11149 | -0.95861 | 0.04677 |

2 | -4.56895 | 0.09989 | -0.88577 | 0.04225 |

3 | -4.4926 | 0.1057 | -1.2085 | 0.0452 |

4 | -4.70619 | 0.10350 | -1.04549 | 0.04235 |

5 | -4.40331 | 0.10248 | -1.00890 | 0.0438 |

6 | -4.61757 | 0.09813 | -1.20890 | 0.04649 |

7 | -4.40572 | 0.09172 | -1.16085 | 0.04091 |

8 | -4.39581 | 0.10174 | -1.38696 | 0.04487 |

9 | -4.42780 | 0.10018 | -0.93672 | 0.04066 |

10 | -4.55394 | 0.09923 | -0.98065 | 0.04382 |

**$\mathcal{S}imulation$ 4.5**

- c is the Gaussian copula with correlation coefficient $\rho =0.5$,
- the Gumbel distribution parameters are $-1$ and 1 and
- the Exponential density parameter is 2.

Our Algorithm | |

Projection Study number 0 : | minimum : 0.445199 |

at point : (1.0142,0.0026) | |

P-Value : 0.94579 | |

Test : | ${H}_{1}$ : ${a}_{1}\notin {\mathcal{E}}_{1}$ : True |

Projection Study number 1 : | minimum : 0.0263 |

at point : (0.0084,0.9006) | |

P-Value : 0.97101 | |

Test : | ${H}_{0}$ : ${a}_{2}\in {\mathcal{E}}_{2}$ : True |

K(Kernel Estimation of ${g}^{(2)}$, ${g}^{(2)}$) | 4.0680 |

_{0}is verified.

**Figure 5.**Graph of the estimate of $({x}_{0},{x}_{1})\mapsto {c}_{\rho}({F}_{Gumbel}({x}_{0}),{F}_{Exponential}({x}_{1}))$.

## Application to real datasets

Our Algorithm | |

Projection Study 0 : | minimum : 0.017345 |

at point : (0.027,3.18) | |

P-Value : 0.890210 | |

Test : | ${H}_{0}$ : ${a}_{1}\in {\mathcal{E}}_{1}$ : True |

K(Kernel Estimation of ${g}^{(1)}$, ${g}^{(1)}$) | 2.7704005 |

Simulation | ${\alpha}_{1}$ | Std Error of ${\alpha}_{1}$ | ${\alpha}_{2}$ | Std Error of ${\alpha}_{2}$ |

1 | 3.153694 | 0.230380 | 0.026578 | 0.004236 |

**Figure 6.**Graph of the regression of log of Nokia on Sanofi based on the least squares method (red) and based on our theory (green).

Date | Nokia | Log-of-Nokia | Sanofi | Date | Nokia | Log-of-Nokia | Sanofi |

10/05/10 | 84.75 | 4.44 | 51.62 | 07/05/10 | 81.85 | 4.4 | 48.5 |

06/05/10 | 87.3 | 4.47 | 50.35 | 05/05/10 | 87.75 | 4.47 | 50.95 |

04/05/10 | 87.25 | 4.47 | 50.49 | 03/05/10 | 87.85 | 4.48 | 51.51 |

30/04/10 | 87.8 | 4.48 | 51.66 | 29/04/10 | 87.85 | 4.48 | 51.41 |

28/04/10 | 87.85 | 4.48 | 51.88 | 27/04/10 | 89 | 4.49 | 52.11 |

26/04/10 | 89.2 | 4.49 | 54.09 | 23/04/10 | 90.7 | 4.51 | 53.47 |

22/04/10 | 92.75 | 4.53 | 53.59 | 21/04/10 | 108.4 | 4.69 | 53.95 |

20/04/10 | 108.9 | 4.69 | 54.43 | 19/04/10 | 108.3 | 4.68 | 54.05 |

16/04/10 | 106.8 | 4.67 | 54.04 | 15/04/10 | 109.9 | 4.7 | 54.95 |

14/04/10 | 109.8 | 4.7 | 54.86 | 13/04/10 | 108.3 | 4.68 | 54.67 |

12/04/10 | 109.1 | 4.69 | 55.27 | 09/04/10 | 110.1 | 4.7 | 55.41 |

08/04/10 | 110.7 | 4.71 | 54.96 | 07/04/10 | 113.2 | 4.73 | 55.3 |

06/04/10 | 112.4 | 4.72 | 54.64 | 01/04/10 | 113.3 | 4.73 | 55.16 |

31/03/10 | 112.4 | 4.72 | 55.19 | 30/03/10 | 112.5 | 4.72 | 55.39 |

29/03/10 | 111.8 | 4.72 | 55.49 | 26/03/10 | 112.5 | 4.72 | 55.72 |

25/03/10 | 111.4 | 4.71 | 56.33 | 24/03/10 | 110.2 | 4.7 | 55.95 |

23/03/10 | 109.1 | 4.69 | 56.12 | 22/03/10 | 109.2 | 4.69 | 56.33 |

19/03/10 | 108.5 | 4.69 | 56.57 | 18/03/10 | 108.4 | 4.69 | 56.56 |

17/03/10 | 109.9 | 4.7 | 56.28 | 16/03/10 | 107 | 4.67 | 57.21 |

Date | Nokia | Log-of-Nokia | Sanofi | Date | Nokia | Log-of-Nokia | Sanofi |

15/03/10 | 105.3 | 4.66 | 55.95 | 12/03/10 | 105 | 4.65 | 55.4 |

11/03/10 | 103 | 4.63 | 55.65 | 10/03/10 | 104 | 4.64 | 56.13 |

09/03/10 | 101.5 | 4.62 | 56.17 | 08/03/10 | 100.7 | 4.61 | 55.75 |

05/03/10 | 100.2 | 4.61 | 55.76 | 04/03/10 | 98.7 | 4.59 | 54.81 |

03/03/10 | 99.8 | 4.6 | 55.14 | 02/03/10 | 97.25 | 4.58 | 54.99 |

01/03/10 | 95.85 | 4.56 | 54.82 | 26/02/10 | 95.85 | 4.56 | 53.72 |

25/02/10 | 94.55 | 4.55 | 52.92 | 24/02/10 | 96.3 | 4.57 | 53.92 |

23/02/10 | 96.2 | 4.57 | 54.05 | 22/02/10 | 96.7 | 4.57 | 54.14 |

19/02/10 | 97.3 | 4.58 | 54.71 | 18/02/10 | 96.6 | 4.57 | 54.43 |

17/02/10 | 96.1 | 4.57 | 53.88 | 16/02/10 | 94.95 | 4.55 | 53.56 |

15/02/10 | 93.65 | 4.54 | 53.2 | 12/02/10 | 93.55 | 4.54 | 53.01 |

11/02/10 | 94.6 | 4.55 | 52.52 | 10/02/10 | 95.55 | 4.56 | 52.2 |

09/02/10 | 98.4 | 4.59 | 52.66 | 08/02/10 | 99.2 | 4.6 | 52.98 |

05/02/10 | 99.8 | 4.6 | 51.68 | 04/02/10 | 102.6 | 4.63 | 53.42 |

03/02/10 | 103.9 | 4.64 | 54.06 | 02/02/10 | 103.8 | 4.64 | 53.8 |

01/02/10 | 102.4 | 4.63 | 53.23 | 29/01/10 | 103.6 | 4.64 | 53.6 |

28/01/10 | 101.8 | 4.62 | 52.68 | 27/01/10 | 92.55 | 4.53 | 53.8 |

26/01/10 | 92.7 | 4.53 | 54.42 | 25/01/10 | 91.9 | 4.52 | 53.66 |

22/01/10 | 94.1 | 4.54 | 54.65 | 21/01/10 | 93.7 | 4.54 | 55.28 |

20/01/10 | 92.75 | 4.53 | 56.67 | 19/01/10 | 93.6 | 4.54 | 57.69 |

18/01/10 | 94.55 | 4.55 | 56.67 | 15/01/10 | 93.55 | 4.54 | 56.85 |

14/01/10 | 93.7 | 4.54 | 56.91 | 13/01/10 | 92.5 | 4.53 | 56.18 |

12/01/10 | 92.35 | 4.53 | 55.83 | 11/01/10 | 93 | 4.53 | 56.08 |

## 5. Critics of the Simulations

## 6. Conclusions

## References

- Friedman, J.H.; Stuetzle, W.; Schroeder, A. Projection pursuit density estimation. J. Amer. Statist. Assoc.
**1984**, 79, 599–608. [Google Scholar] [CrossRef] - Huber, P.J. Projection pursuit. Ann. Statist.
**1985**, 13, 435–525, With discussion. [Google Scholar] [CrossRef] - Zhu, M. On the forward and backward algorithms of projection pursuit. Ann. Statist.
**2004**, 32, 233–244. [Google Scholar] [CrossRef] - Yohai, V.J. Optimal robust estimates using the Kullback-Leibler divergence. Stat. Probab. Lett.
**2008**, 78, 1811–1816. [Google Scholar] [CrossRef] - Toma, A. Optimal robust M-estimators using divergences. Stat. Probab. Lett.
**2009**, 79, 1–5. [Google Scholar] [CrossRef] - Huber, P.J. Robust Statistics; Wiley: Hoboken, NJ, USA, 1981; republished in paperback, 2004. [Google Scholar]
- Diaconis, P.; Freedman, D. Asymptotics of graphical projection pursuit. Ann. Statist.
**1984**, 12, 793–815. [Google Scholar] [CrossRef] - Cambanis, S.; Huang, S.; Simons, G. On the theory of elliptically contoured distributions. J. Multivariate Anal.
**1981**, 11, 368–385. [Google Scholar] [CrossRef] - Landsman, Z.M.; Valdez, E.A. Tail conditional expectations for elliptical distributions. N. Am. Actuar. J.
**2003**, 7, 55–71. [Google Scholar] [CrossRef] - Van der Vaart, A.W. Asymptotic Statistics. In Cambridge Series in Statistical and Probabilistic Mathematics; Cambridge University Press: Cambridge, MA, USA, 1998; Volume 3. [Google Scholar]
- Broniatowski, M.; Keziou, A. Parametric estimation and tests through divergences and the duality technique. J. Multivariate Anal.
**2009**, 100, 16–36. [Google Scholar] [CrossRef] - Vajda, I. χ
^{α}-divergence and generalized Fisher’s information. In Transactions of the Sixth Prague Conference on Information Theory, Statistical Decision Functions, Random Processes; Czech Technical University in Prague: Prague, Czech, 1971; dedicated to the memory of Antonín Spacek; Academia: Prague, Czech; pp. 873–886. [Google Scholar] - Black, F.; Scholes, M.S. The pricing of options and corporate liabilities. J. Polit. Econ.
**1973**, 3, 637–654. [Google Scholar] [CrossRef] - Saporta, G. Probabilités, Analyse des données et Statistique; Technip: Paris, France, 2006. [Google Scholar]
- Scott, D.W. Multivariate Density Estimation. Theory, Practice, and Visualization; John Wiley and Sons: New York, NY, USA, 1992. [Google Scholar]
- Cressie, N.; Read, T.R.C. Multinomial goodness-of-fit tests. J. Roy. Statist. Soc.
**1984**, Ser. B 46, 440–464. [Google Scholar] - Csiszár, I. On topology properties of f-divergences. Studia Sci. Math. Hungar.
**1967**, 2, 329–339. [Google Scholar] - Liese, F.; Vajda, I. Convex Statistical Distances. In Teubner-Texte zur Mathematik [Teubner Texts in Mathematics]; B.G. Teubner Verlagsgesellschaft: Leipzig, Germany, 1987; Volume 95. [Google Scholar]
- Pardo, L. Statistical inference based on divergence measures. In Statistics: Textbooks and Monographs; Chapman & Hall/CRC: Boca Raton, FL, USA, 2006; Volume 185. [Google Scholar]
- Zografos, K.; Ferentinos, K.; Papaioannou, T. ϕ-divergence statistics: sampling properties and multinomial goodness of fit and divergence tests. Comm. Statist. Theory Methods
**1990**, 19, 1785–1802. [Google Scholar] [CrossRef] - Azé, D. Eléments d’analyse convexe et variationnelle; Ellipse: Minneapolis, MN, USA, 1997. [Google Scholar]
- Touboul, J. Projection pursuit through φ-divergence minimisation. arXiv:0912.2883, 2009. [Google Scholar]
- Bosq, D.; Lecoutre, J.-P. Livre—Theorie De L’Estimation Fonctionnelle; Economica: Hoboken, NJ, USA, 1999. [Google Scholar]

## Appendix

## A. Reminders

#### A.1. φ-Divergence

**Definition A.1.**We define the $\varphi -$divergence of P from Q, where P and Q are two probability distributions over a space Ω such that Q is absolutely continuous with respect to P, by

- -
- with the Kullback–Leibler divergence, we associate $\phi (x)=xln(x)-x+1$
- -
- with the Hellinger distance, we associate $\phi (x)=2{(\sqrt{x}-1)}^{2}$
- -
- with the ${\chi}^{2}$ distance, we associate $\phi (x)=\frac{1}{2}{(x-1)}^{2}$
- -
- more generally, with power divergences, we associate $\phi (x)=\frac{{x}^{\gamma}-\gamma x+\gamma -1}{\gamma (\gamma -1)}$, where $\gamma \in \mathbb{R}\setminus (0,1)$
- -
- and, finally, with the ${L}^{1}$ norm, which is also a divergence, we associate $\phi (x)=|x-1|.$

**Property A.1.**We have ${D}_{\varphi}(P,Q)=0\iff P=Q.$

**Property A.2.**The divergence function $Q\mapsto {D}_{\varphi}(Q,P)$ is convex, lower semi-continuous (l.s.c.)—for the topology that makes all the applications of the form $Q\mapsto \int fdQ$ continuous where f is bounded and continuous—as well as l.s.c. for the topology of the uniform convergence.

**Property A.3.**(corollary (1.29), page 19 of Liese [18]). If $T:(X,A)\to (Y,B)$ is measurable and if ${D}_{\varphi}(P,Q)<\infty ,$ then ${D}_{\varphi}(P,Q)\ge {D}_{\varphi}(P{T}^{-1},Q{T}^{-1}),$ with equality being reached when T is surjective for $(P,Q)$.

**Theorem A.1.**(theorem III.4 of Azé [21]). Let $f:I\to \mathbb{R}$ be a convex function. Then f is a Lipschitz function in all compact intervals $[a,b]\subset int\{I\}.$ In particular, f is continuous on $int\{I\}$.

#### A.2. Miscellaneous

**Lemma A.1.**The set ${\Gamma}_{c}$ is closed in ${L}^{1}$ for the topology of the uniform convergence.

**Lemma A.2.**For all $c>0$, we have ${\Gamma}_{c}\subset {\overline{B}}_{{L}^{1}}(f,c),$ where ${B}_{{L}^{1}}(f,c)=\{p\in {L}^{1}{;\parallel f-p\parallel}_{1}\le c\}$.

**Lemma A.3.**G is closed in ${L}^{1}$ for the topology of the uniform convergence.

**Lemma A.4.**Let consider the sequence $({a}_{i})$ defined in (2.3) page 1587.

**Proposition A.1.**Assuming $(H1)$ to $(H3)$ hold. Both ${sup}_{a\in \Theta}\parallel {\stackrel{\u02c7}{c}}_{n}(a)-{a}_{k}\parallel $ and ${\stackrel{\u02c7}{\gamma}}_{n}$ tends to ${a}_{k}$ a.s.

**Theorem A.2.**Assuming $(H0)$ to $(H3)$ hold, for any $k=1,...,d$ and any $x\in {\mathbb{R}}^{d}$, we have $|{\stackrel{\u02c7}{g}}^{(k)}(x)-{g}^{(k)}(x)|={O}_{\mathbf{P}}({n}^{-1/2})$ and $\int |{\stackrel{\u02c7}{g}}^{(k)}(x)-{g}^{(k)}(x)|dx={O}_{\mathbf{P}}({n}^{-1/2})$ as well as $|K({\stackrel{\u02c7}{g}}^{(k)},f)-K({g}^{(k)},f)|={O}_{\mathbf{P}}({n}^{-1/2}).$

**Theorem A.3.**Assuming that $(H1)$ to $(H3)$, $(H6)$ and $(H8)$ hold. Then, $\sqrt{n}{(Va{r}_{\mathbf{P}}(M({\stackrel{\u02c7}{c}}_{n}({\stackrel{\u02c7}{\gamma}}_{n}),{\stackrel{\u02c7}{\gamma}}_{n})))}^{-1/2}({\mathbb{P}}_{n}M({\stackrel{\u02c7}{c}}_{n}({\stackrel{\u02c7}{\gamma}}_{n}),{\stackrel{\u02c7}{\gamma}}_{n})-{\mathbb{P}}_{n}M({a}_{k},{a}_{k}))\stackrel{\mathcal{L}\mathit{aw}}{\to}\mathcal{N}(0,I)$, where k represents the ${k}^{th}$ step of the algorithm and with I being the identity matrix in ${\mathbb{R}}^{d}$.

## B. Study of the sample

**Proposition B.1.**Using the notations introduced in Broniatowski [11] and in Section 3.1, it holds ${lim}_{n\to \infty}{sup}_{a\in {\mathbb{R}}_{*}^{d}}|({B}_{1}(n,a)-{B}_{2}(n,a))-{D}_{\varphi}(g\frac{{f}_{a}}{{g}_{a}},f)|=0.$

**$\mathcal{R}emark$ B.1.**With the Kullback–Leibler divergence, we can take for ${\theta}_{m}$ the expression ${m}^{-\nu}$, with $0<\nu <\frac{1}{4+d}$.

## C. Hypotheses’ discussion

#### C.1. Discussion of $(H2)$.

#### C.2. Discussion of $(H4)$.

- We work with the Kullback–Leibler divergence, (0)
- We have $f(./{a}_{1}^{\top}x)=g(./{a}_{1}^{\top}x)$, i.e., $K(g\frac{{f}_{1}}{{g}_{1}},f)=0$—we could also derive the same proof with f, ${g}^{(k-1)}$ and ${a}_{k}$—(1)

## D. Proofs

**Proof of Lemma D.1.**

**Lemma D.1.**We have $g(./{a}_{1}^{\top}x,...,{a}_{j}^{\top}x)=n({a}_{j+1}^{\top}x,...,{a}_{d}^{\top}x)=f(./{a}_{1}^{\top}x,...,{a}_{j}^{\top}x)$.

**Proof of Lemma D.2.**

**Lemma D.2.**Should there exist a family ${({a}_{i})}_{i=1...d}$ such that $f(x)=n({a}_{j+1}^{\top}x,...,{a}_{d}^{\top}x)h({a}_{1}^{\top}x,...,{a}_{j}^{\top}x),$ with $j<d$, with f, n and h being densities, then this family is an orthogonal basis of ${\mathbb{R}}^{d}$.

**Lemma D.3.**${inf}_{a\in {\mathbb{R}}_{*}^{d}}{D}_{\varphi}({g}^{*},f)$ is reached when the ϕ-divergence is greater than the ${L}^{1}$ distance as well as the ${L}^{2}$ distance.

**Proof of Lemma D.4.**

**Lemma D.4.**For any $p\le d$, we have ${f}_{{a}_{p}}^{(p-1)}={f}_{{a}_{p}}$—see Huber’s analytic method -, ${g}_{{a}_{p}}^{(p-1)}={g}_{{a}_{p}}$—see Huber’s synthetic method - and ${g}_{{a}_{p}}^{(p-1)}={g}_{{a}_{p}}$—see our algorithm.

**Proof of Lemma D.5.**

**Lemma D.5.**If there exits p, $p\le d$, such that ${D}_{\varphi}({g}^{(p)},f)=0$, then the family of ${({a}_{i})}_{i=1,..,p}$—derived from the construction of ${g}^{(p)}$—is free and orthogonal.

**Proof of Lemma D.6.**

**Lemma D.6.**If there exits p, $p\le d$, such that ${D}_{\varphi}({g}^{(p)},f)=0$, where ${g}^{(p)}$ is built from the free and orthogonal family ${a}_{1}$,...,${a}_{j}$, then, there exists a free and orthogonal family ${({b}_{k})}_{k=j+1,...,d}$ of vectors of ${\mathbb{R}}_{*}^{d}$, such that ${g}^{(p)}(x)=g({b}_{j+1}^{\top}x,...,{b}_{d}^{\top}x/{a}_{1}^{\top}x,...,{a}_{j}^{\top}x){f}_{{a}_{1}}({a}_{1}^{\top}x)...{f}_{{a}_{j}}({a}_{j}^{\top}x)$ and such that ${\mathbb{R}}^{d}=Vect\{{a}_{i}\}\stackrel{\perp}{\oplus}Vect\{{b}_{k}\}$.

**Proof of Lemma D.7.**

**Lemma D.7.**For any continuous density f, we have ${y}_{m}=|{f}_{m}(x)-f(x)|={O}_{\mathbf{P}}({m}^{-\frac{2}{4+d}})$.

**Proof of Proposition 3.1.**

- $T:g(./{x}_{1})\frac{h({x}_{1}){f}_{1}({x}_{1})}{{g}_{1}({x}_{1})}\mapsto g(./{x}_{1}){f}_{1}({x}_{1})$
- $T:f(./{x}_{1}){f}_{1}({x}_{1})\mapsto f(./{x}_{1}){f}_{1}({x}_{1})$

**Proof of Proposition 3.3.**Proposition 3.3 comes immediately from Proposition B.1 page 1606 and Lemma A.1 page 1605.

**Proof of Theorem 3.1.**First, by the very definition of the kernel estimator ${\stackrel{\u02c7}{g}}_{n}^{(0)}={g}_{n}$ converges towards g. Moreover, the continuity of $a\mapsto {f}_{a,n}$ and $a\mapsto {g}_{a,n}$ and Proposition 3.3 imply that ${\stackrel{\u02c7}{g}}_{n}^{(1)}={\stackrel{\u02c7}{g}}_{n}^{(0)}\frac{{f}_{a,n}}{{\stackrel{\u02c7}{g}}_{a,n}^{(0)}}$ converges towards ${g}^{(1)}$. Finally, since, for any k, ${\stackrel{\u02c7}{g}}_{n}^{(k)}={\stackrel{\u02c7}{g}}_{n}^{(k-1)}\frac{{f}_{{\stackrel{\u02c7}{a}}_{k},n}}{{\stackrel{\u02c7}{g}}_{{\stackrel{\u02c7}{a}}_{k},n}^{(k-1)}}$, we conclude by an immediate induction.

**Proof of Theorem 3.2.**First, from Lemma D.7, we derive that, for any x, ${sup}_{a\in {\mathbb{R}}_{*}^{d}}|{f}_{a,n}({a}^{\top}x)-{f}_{a}({a}^{\top}x)|={O}_{\mathbf{P}}({n}^{-\frac{2}{4+d}})$. Then, let us consider ${\Psi}_{j}=\frac{{f}_{\stackrel{\u02c7}{{a}_{j}},n}({\stackrel{\u02c7}{{a}_{j}}}^{\top}x)}{{\stackrel{\u02c7}{g}}_{\stackrel{\u02c7}{{a}_{j}},n}^{(j-1)}({\stackrel{\u02c7}{{a}_{j}}}^{\top}x)}-\frac{{f}_{{a}_{j}}({a}_{j}^{\top}x)}{{g}_{{a}_{j}}^{(j-1)}({a}_{j}^{\top}x)}$, we have ${\Psi}_{j}=\frac{1}{{\stackrel{\u02c7}{g}}_{\stackrel{\u02c7}{{a}_{j}},n}^{(j-1)}({\stackrel{\u02c7}{{a}_{j}}}^{\top}x){g}_{{a}_{j}}^{(j-1)}({a}_{j}^{\top}x)}$ $(({f}_{\stackrel{\u02c7}{{a}_{j}},n}({\stackrel{\u02c7}{{a}_{j}}}^{\top}x)-{f}_{{a}_{j}}({a}_{j}^{\top}x)){g}_{{a}_{j}}^{(j-1)}({a}_{j}^{\top}x)+{f}_{{a}_{j}}({a}_{j}^{\top}x)({g}_{{a}_{j}}^{(j-1)}({a}_{j}^{\top}x)-{\stackrel{\u02c7}{g}}_{\stackrel{\u02c7}{{a}_{j}},n}^{(j-1)}({\stackrel{\u02c7}{{a}_{j}}}^{\top}x))),$ i.e., $|{\Psi}_{j}|={O}_{\mathbf{P}}({n}^{-\frac{1}{2}{\mathbf{1}}_{d=1}-\frac{2}{4+d}{\mathbf{1}}_{d>1}})$ since ${f}_{{a}_{j}}({a}_{j}^{\top}x)=O(1)$ and ${g}_{{a}_{j}}^{(j-1)}({a}_{j}^{\top}x)=O(1)$. We can therefore conclude similarly as in the proof of Theorem A.2.

**Proof of Theorem D.1.**

**Theorem D.1.**In the case where f is known and under the hypotheses assumed in Section 3.1, it holds $\sqrt{n}\mathcal{A}.({\stackrel{\u02c7}{c}}_{n}({a}_{k})-{a}_{k})\stackrel{\mathcal{L}\mathit{aw}}{\to}\mathcal{B}.{\mathcal{N}}_{d}(0,\mathbf{P}\parallel \frac{\partial}{\partial b}M({a}_{k},{a}_{k}){\parallel}^{2})+\mathcal{C}.{\mathcal{N}}_{d}(0,\mathbf{P}\parallel \frac{\partial}{\partial a}M({a}_{k},{a}_{k}){\parallel}^{2})$ and $\sqrt{n}\mathcal{A}.({\stackrel{\u02c7}{\gamma}}_{n}-{a}_{k})\stackrel{\mathcal{L}\mathit{aw}}{\to}\mathcal{C}.{\mathcal{N}}_{d}(0,\mathbf{P}\parallel \frac{\partial}{\partial b}M({a}_{k},{a}_{k}){\parallel}^{2})+\mathcal{C}.{\mathcal{N}}_{d}(0,\mathbf{P}\parallel \frac{\partial}{\partial a}M({a}_{k},{a}_{k}){\parallel}^{2})$ where $\mathcal{A}=\mathbf{P}\frac{{\partial}^{2}}{\partial b\partial b}M({a}_{k},{a}_{k})(\mathbf{P}\frac{{\partial}^{2}}{\partial {a}_{i}\partial {a}_{j}}M({a}_{k},{a}_{k})+\mathbf{P}\frac{{\partial}^{2}}{\partial {a}_{i}\partial {b}_{j}}M({a}_{k},{a}_{k}))$, $\mathcal{C}=\mathbf{P}\frac{{\partial}^{2}}{\partial b\partial b}M({a}_{k},{a}_{k})$ and $\mathcal{B}=\mathbf{P}\frac{{\partial}^{2}}{\partial b\partial b}M({a}_{k},{a}_{k})+\mathbf{P}\frac{{\partial}^{2}}{\partial {a}_{i}\partial {a}_{j}}M({a}_{k},{a}_{k})+\mathbf{P}\frac{{\partial}^{2}}{\partial {a}_{i}\partial {b}_{j}}M({a}_{k},{a}_{k}).$

**Proof of Theorem 3.3.**We derive this theorem through Proposition B.1 and Theorem D.1.

**Proof of Theorem 3.4.**We recall that ${g}_{n}^{(k)}$ is the kernel estimator of ${\stackrel{\u02c7}{g}}^{(k)}$. Since the Kullback–Leibler divergence is greater than the ${L}^{1}$-distance, we then have ${lim}_{n}{lim}_{k}K({g}_{n}^{(k)},{f}_{n})\ge {lim}_{n}{lim}_{k}\int |{g}_{n}^{(k)}(x)-{f}_{n}(x)|dx$

**Proof of Corollary 3.1.**Through the dominated convergence theorem and through Theorem 3.4, we get the result using a reductio ad absurdum.

**Proof of Theorem 3.5.**Through Proposition B.1 and Theorem A.3, we derive theorem 3.5.

© 2010 by the author; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license http://creativecommons.org/licenses/by/3.0/.

## Share and Cite

**MDPI and ACS Style**

Touboul, J. Projection Pursuit Through *ϕ*-Divergence Minimisation. *Entropy* **2010**, *12*, 1581-1611.
https://doi.org/10.3390/e12061581

**AMA Style**

Touboul J. Projection Pursuit Through *ϕ*-Divergence Minimisation. *Entropy*. 2010; 12(6):1581-1611.
https://doi.org/10.3390/e12061581

**Chicago/Turabian Style**

Touboul, Jacques. 2010. "Projection Pursuit Through *ϕ*-Divergence Minimisation" *Entropy* 12, no. 6: 1581-1611.
https://doi.org/10.3390/e12061581