# The Fisher Information as a Neural Guiding Principle for Independent Component Analysis

^{*}

## Abstract

**:**

## 1. Introduction

#### 1.1. Combining Objective Functions

_{i}of neuron i, by the intrinsic parameters ${a}_{i}^{k}={(\widehat{a})}_{i}^{k}$ (with k = 1, 2, … indexing the different internal degrees of freedom) of the neurons and by the inter-neural synaptic connectivity matrix w

_{ij}= (ŵ)

_{ij}. Within the objective functional approach, one considers evolution equations:

#### 1.2. Hebbian Learning in Neural Networks

#### 1.3. Instantaneous Single Neuron

_{w}represents the number of incoming inputs y

_{j}, which represent in this case either an external input or the activities of other neurons in a network. b is a bias in the neuron’s sensitivity, and ${\overline{y}}_{j}$ represents the (trailing) average of the input activity, such that only deviations from this average contribute to the integrated input. An objective function for the neural activity is, in this case, not present, and the evolution Equations (1) reduce to:

_{b}and ∈

_{w}, separated the adaption rates from the definition of the respective objective functions.

#### 1.4. Information Theoretical Incentives for Synaptic Plasticity

## 2. Objective Functions for Synaptic Plasticity

_{j}and the output y are correlated.

#### 2.1. Cubic Approximation

_{0}of the limiting function G(x) (compare Figure 2a) are symmetric for the case b = 0, considered in the following, and scale ∼ N for large N [32]. The Hebbian function H(x) has, on the other hand, only a single root at x = 0 (viz at y = 0.5), for b = 0. We are then led to the cubic approximation:

^{2}> 0 could also be absorbed into the adaption rate ∈

_{w}. In Figure 2b, the learning rule from Equation (6) is compared to the cubic approximation (Equation (8)).

_{j}) of the input activities y

_{j}. We now assume uncorrelated and symmetric input distributions,

**w**·

**γ**will be fully rotational invariant. Therefore, the result does not depend on the direction one chooses for the PCs. In particular, if one chooses the principal components to lie along the axes of reference, one can eliminate the linear correlation terms, without loss of generality.

_{w}→ 0, and we obtain:

_{j}of the j-the input, the excess kurtosis K

_{j}and the weighed average Φ of the afferent standard deviations.

#### 2.1.1. Scaling of Dominant Components

_{1}(the first principal component, or FPC) and the rest of the directions having a small SD. In this context, the weight vector aligns with the FPC, resulting in one large weight (w

_{1}). All other synaptic weight adapt to small values. Solving Equation (12) for the large component yields:

#### 2.1.2. Sensitivity to the Excess Kurtosis

_{i}and excess kurtosis K

_{i}, for i = 1, 2. Three types of solutions can then, in principle, exist:

_{1/2}in each case and evaluate the stability of the fixpoints. A sketch of the fixpoints and their stability is presented in Figure 3.

- The trivial fixpoint (0; 0) is always unstable, with positive eigenvalues:$${\mathrm{\lambda}}_{1,2}(0,0)={\in}_{w}\frac{{x}_{0}^{2}}{{N}^{2}}({\mathrm{\sigma}}_{1}^{2},{\mathrm{\sigma}}_{2}^{2}).$$
- For $({w}_{1}^{*}\ne 0,0)$, one finds the eigenvalues:$${\mathrm{\lambda}}_{1,2}({w}_{1}^{*}\ne 0,0)={\mathit{\in}}_{w}\frac{{x}_{0}^{2}}{{N}^{2}}\left(-2{\mathrm{\sigma}}_{1}^{2},\phantom{\rule{0.2em}{0ex}}\frac{{\mathrm{\sigma}}_{2}^{2}{K}_{1}}{{K}_{1}+3}\right).$$The first eigenvalue λ
_{1}is hence always negative with the sign of the second eigenvalue λ_{2}depending exclusively on K_{1}. The fixpoint $({w}_{1}^{*}\ne 0,0)$ is hence stable/unstable for negative/positive K_{1}. - The last term $3\mathrm{\Phi}-{x}_{0}^{2}$ in Equation (12) is identical for all synapses. Two non-zero synaptic weights $({w}_{1}^{*}\ne 0,{w}_{2}^{*}\ne 0)$ can hence only exist for identical signs of the respective excess kurtosis, K
_{1}K_{2}≥ 0. It is easy to show that $({w}_{1}^{*}\ne 0,{w}_{2}^{*}\ne 0)$ is unstable/stable whenever both K_{1},_{2}are negative/positive, in accordance with Equation (15).

_{1}, is negative.

#### 2.1.3. Principal Component Analysis

_{1}= K

_{2}< 0, one finds that the phase space contracts faster around

**w**

^{(1)}when ${\mathrm{\sigma}}_{1}^{2}>{\mathrm{\sigma}}_{2}^{2}$, and vice versa.

#### 2.2. Alternative Transfer Functions

_{x→}

_{∓}

_{∞}y(x) = 0/1. For example, in [37], the authors showed that this is indeed the case for an arc-tangential transfer function. An interesting transfer function to consider in this context is the rescaled error function erf(x − b),

^{2}by ${x}_{0}^{2}$, the squared roots for b = 0. Equation (20) reduces, interestingly and apart from an overall scaling factor, to the cubic approximation from Equation (20) for b = 0:

_{j}is the skewness of input distribution y

_{j}, as defined by:

_{j}= 0) as the ones we have been treating, small values of b produce only a shift in the effective x

_{0}(provided that b

^{2}is smaller than ${x}_{0}^{2}/2$).

_{j}=0 would become stable for negative ${x}_{0}^{2}-{b}^{2}/2$. This has however not happened for the numerical simulations we performed, which resulted in values of b ≈ 1, for target activity levels hyi as low as 0.1, while x

_{0}= 2.4 for N = 2 (which we used). Even sparser activity levels 〈y〉 as low as 0:1, while x0 = 2:4 for N = 2 (which we used). Even sparser activity levels 〈y〉 ≪1 would require larger firing thresholds b ≫ 1, and stable synaptic plasticity would be achieved be selecting then appropriately large N, corresponding to values of x

_{0}, such that ${x}_{0}^{2}-{b}^{2}/2$ remains positive.

#### 2.3. The Stationarity Principle of Statistical Learning

#### 2.3.1. The Fisher Information with Respect to the Synaptic Flux

_{θ}(y) with respect to a certain parameter θ, becoming minimal whenever θ does not influence the statistics of y. The Fisher information is hence a suitable information theoretical functional for the implementation of the stationarity principle of statistical learning.

_{w}= 1 afferent neurons. We define with:entropy-17-03838entropy-17-03838entropy-17-03838

_{1}. Here, y(y

_{1}) is given by $\mathrm{\sigma}({w}_{1}({y}_{1}-{\overline{y}}_{1})-b)$, as defined in Equation (3). There are two changes with respect to the bare Fisher information (Equation (27)).

- The operator w
_{1}∂/∂w_{1}corresponds to a dimensionless differential operator and, hence, to the log-derivative. The whole objective function ${\mathcal{F}}_{{N}_{w}=1}^{syn}$ is hence dimensionless. - The average sensitivity is computed as an average over the probability distribution p(y
_{1}) of the presynaptic activity y_{1}, since we are interested in minimizing the time average of the sensitivity of the postsynaptic activity with respect to synaptic weight changes in the context of a stationary presynaptic activity distribution p(y_{1}).

_{1})), for which y is a monotonic function of y

_{1}, we have:

**y**= (y

_{1},…, y

_{Nw}) the vector of afferent synaptic weights and with p(y) the corresponding probability distribution function, we may generalize Equation (30) as:

**y**)) from Equation (28) by $\frac{p({y}_{j})}{\partial y/\partial {y}_{j}}$, in what constitutes the independent synapse extension, and which represents the Fisher information with respect to the flux operator:

- Minimizing ${\mathcal{F}}_{{N}_{w}}^{syn}$, in accordance with the stationarity principle for statistical learning, leads to a synaptic weight vector w that is perpendicular to the gradient ∇
_{w}(log(p)), restricting consequently the overall growth of the modulus of**w**. - In ${\mathcal{F}}_{{N}_{w}}^{syn}$, there is no direct cross-talk between different synapses. Expression Equation (32) is hence adequate for deriving Hebbian-type learning rules in which every synapse has access only to locally-available information, together with the overall state of the postsynaptic neuron in terms of its firing activity y or its membrane potential x. We call Equation (32) the local synapse extension with respect to other formulations allowing for inter-synaptic cross-talk.
- It is straightforward to show [37] that Equation (31) reduces to Equation (5), when using the relations from Equation (3), viz. ${\mathcal{F}}_{{N}_{w}}^{syn}={\mathcal{F}}^{syn}$when we identify N → N
_{w}. We have, however, opted to retain N generically as a free parameter in Equation (5), allowing us to shift appropriately the roots of G(x).

## 3. Results and Discussion

#### 3.1. Quantitative Comparison of the Model and the Cubic Approximation

_{j}and kurtosis K

_{j}of the input distribution. Given that the input distribution has a finite width, the integrated input x cannot fall into the minima of the objective function for every point in the distribution, but rather the cloud of x points generated will tend to spread around these minima. The discrepancies in the rule from the cubic approximation in the vicinity and away from the minima are then expected to affect the final result of the learning procedure.

_{1}, without loss of generality), the sum:

_{s}) with individual standard deviations σ

_{s}, whose peaks are at a distance ±d from the center of the input range (0.5).

- σ
_{s}is adjusted, changing d, such that the overall standard deviation σ_{1}remains constant. In this way, one can select with d different kurtosis levels, while retaining a constant standard deviation. For d = 0, one gets a bound (since y_{1}∈ [0, 1]) normal distribution with K_{1}≈ 0 (slightly negative, since the distributions are bound). In this way, we can evaluate the size of w_{1}after training for a varying K_{1}∈ [−2, 0) for any given σ_{1}. - For the other N
_{w}− 1 directions, we use bound normal distributions with standard deviations σ_{i}= σ_{1}/2 as in [32].

_{1}after training is presented together with the prediction (Equation (13)) from the cubic approximation, as a function of K

_{1}(the kurtosis in the y

_{1}direction), for a constant b = 0. In this case, we have used σ

_{1}= 0.1 and σ

_{i}

_{≠1}= σ

_{1}/2.

_{1}, as indeed is observed in Figure 4b. When K = −2, the input distribution becomes the sum of two deltas, and the rule is able to assign each delta to a root. The prediction is of course once again exact in this case. Otherwise, while the quantitative result differs from one rule to another, the qualitative behavior remains unchanged.

#### 3.2. Independent Component Analysis: An Application to the Nonlinear Bars Problem

_{w}inputs where N

_{w}= L × L, each horizontal and vertical bar has a constant probability of being present of p = 1/L. Each input or pixel can take only two values: a low-intensity and a high-intensity value. Each bar then corresponds to a whole row or a column of high-intensity pixels, where at the intersection of two bars, the pixel has the same value (high) as in the rest of the bar, making the problem non-linear.

## 4. Conclusions

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## References

- Attwell, D.; Laughlin, S.B. An energy budget for signaling in the grey matter of the brain. J. Cereb. Blood Flow Metab.
**2001**, 21, 1133–1145. [Google Scholar] - Mink, J.W.; Blumenschine, R.J.; Adams, D.B. Ratio of central nervous system to body metabolism in vertebrates: its constancy and functional basis. Am. J. Physiol.-Regul. Integr. Comp. Physiol.
**1981**, 241, R203–R212. [Google Scholar] - Niven, J.E.; Laughlin, S.B. Energy limitation as a selective pressure on the evolution of sensory systems. J. Exp. Biol.
**2008**, 211, 1792–1804. [Google Scholar] - Bullmore, E.; Sporns, O. The economy of brain network organization. Nat. Rev. Neurosci.
**2012**, 13, 336–349. [Google Scholar] - Lee, H.; Battle, A.; Raina, R.; Ng, A.Y. Efficient sparse coding algorithms. In Advances in Neural Information Processing Systems: Proceedings of The First 12 Conferences; Jordan, M.I., LeCun, Y., Solla, S.A., Eds.; The MIT Press: Cambridge, MA, USA, 2001; pp. 801–808. [Google Scholar]
- Stemmler, M.; Koch, C. How voltage-dependent conductances can adapt to maximize the information encoded by neuronal firing rate. Nat. Neurosci.
**1999**, 2, 521–527. [Google Scholar] - Gros, C. Generating functionals for guided self-organization. In Guided Self-Organization: Inception; Prokopenko, M., Ed.; Springer: Berlin/Heidelberg, Germany, 2014; pp. 53–66. [Google Scholar]
- MacKay, D. Information-based objective functions for active data selection. Neural Comput.
**1992**, 4, 590–604. [Google Scholar] - Marler, R.T.; Arora, J.S. Survey of multi-objective optimization methods for engineering. Struct. Multidiscip. Optim.
**2004**, 26, 369–395. [Google Scholar] - Intrator, N.; Cooper, L.N. Objective function formulation of the BCM theory of visual cortical plasticity: Statistical connections, stability conditions. Neural Netw.
**1992**, 5, 3–17. [Google Scholar] - Kay, J.W.; Phillips, W. Coherent infomax as a computational goal for neural systems. Bull. Math. Biol.
**2011**, 73, 344–372. [Google Scholar] - Polani, D. Information: currency of life. HFSP J
**2009**, 3, 307–316. [Google Scholar] - Zahedi, K.; Ay, N.; Der, R. Higher coordination with less control—A result of information maximization in the sensorimotor loop. Adapt. Behav.
**2010**, 18, 338–355. [Google Scholar] - Polani, D.; Prokopenko, M.; Yaeger, L.S. Information and self-organization of behavior. Adv. Complex Syst.
**2013**, 16, 1303001. [Google Scholar] - Prokopenko, M.; Gershenson, C. Entropy Methods in Guided Self-Organisation. Entropy
**2014**, 16, 5232–5241. [Google Scholar] - Der, R.; Martius, G. The Playful Machine: Theoretical Foundation and Practical Realization of Self-Organizing Robots; Springer: Berlin, Heidelberg, Germany, 2012; Volume 15. [Google Scholar]
- Markovic, D.; Gros, C. Self-organized chaos through polyhomeostatic optimization. Phys. Rev. Lett.
**2010**, 105, 068702. [Google Scholar] - Marković, D.; Gros, C. Intrinsic adaptation in autonomous recurrent neural networks. Neural Comput.
**2012**, 24, 523–540. [Google Scholar] - Triesch, J. Synergies between intrinsic and synaptic plasticity mechanisms. Neural Comput.
**2007**, 19, 885–909. [Google Scholar] - Linsker, R. Local synaptic learning rules suffice to maximize mutual information in a linear network. Neural Comput.
**1992**, 4, 691–702. [Google Scholar] - Chechik, G. Spike-timing-dependent plasticity and relevant mutual information maximization. Neural Comput.
**2003**, 15, 1481–1510. [Google Scholar] - Toyoizumi, T.; Pfister, J.P.; Aihara, K.; Gerstner, W. Generalized Bienenstock–Cooper–Munro rule for spiking neurons that maximizes information transmission. Proc. Natl. Acad. Sci. USA
**2005**, 102, 5239–5244. [Google Scholar] - Friston, K. The free-energy principle: A unified brain theory. Nat. Rev. Neurosci.
**2010**, 11, 127–138. [Google Scholar] - Mozzachiodi, R.; Byrne, J.H. More than synaptic plasticity: Role of nonsynaptic plasticity in learning and memory. Trends Neurosci.
**2010**, 33, 17–26. [Google Scholar] - Strogatz, S.H. Nonlinear Dynamics and Chaos: With Applications to Physics, Biology and Chemistry; Perseus Publishing: Boulder, CO, USA, 2001. [Google Scholar]
- Hebb, D.O. The Organization of Behavior: A Neuropsychological Theory; Psychology Press: Mahwah, NJ, USA, 2002. [Google Scholar]
- Oja, E. The nonlinear PCA learning rule in independent component analysis. Neurocomputing
**1997**, 17, 25–45. [Google Scholar] - Bi, G.Q.; Poo, M.M. Synaptic modifications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neurosci.
**1998**, 18, 10464–10472. [Google Scholar] - Froemke, R.C.; Dan, Y. Spike-timing-dependent synaptic modification induced by natural spike trains. Nature
**2002**, 416, 433–438. [Google Scholar] - Izhikevich, E.M.; Desai, N.S. Relating stdp to bcm. Neural Comput.
**2003**, 15, 1511–1523. [Google Scholar] - Echeveste, R.; Gros, C. Two-trace model for spike-timing-dependent synaptic plasticity. Neural Comput.
**2015**, 27, 672–698. [Google Scholar] - Echeveste, R.; Gros, C. Generating functionals for computational intelligence: The Fisher information as an objective function for self-limiting Hebbian learning rules. Front. Robot. AI
**2014**, 1. [Google Scholar] [CrossRef] - Bell, A.J.; Sejnowski, T.J. An information-maximization approach to blind separation and blind deconvolution. Neural Comput.
**1995**, 7, 1129–1159. [Google Scholar] - Martius, G.; Der, R.; Ay, N. Information driven self-organization of complex robotic behaviors. PloS ONE
**2013**, 8, e63400. [Google Scholar] - Földiak, P. Forming sparse representations by local anti-Hebbian learning. Biol. Cybern.
**1990**, 64, 165–170. [Google Scholar] - Brunel, N.; Nadal, J.P. Mutual information, Fisher information, and population coding. Neural Comput.
**1998**, 10, 1731–1757. [Google Scholar] - Echeveste, R.; Gros, C. An objective function for self-limiting neural plasticity rules. Proceedings of the 23th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN), Bruges, Belgium, 22–24 April 2015.
- Hyvärinen, A.; Karhunen, J.; Oja, E. Independent Component Analysis; Wiley: New York, NJ, USA, 2004; Volume 46. [Google Scholar]
- Bell, A.J.; Sejnowski, T.J. The “independent components” of natural scenes are edge filters. Vis. Res.
**1997**, 37, 3327–3338. [Google Scholar] - Paradiso, M. A theory for the use of visual orientation information which exploits the columnar structure of striate cortex. Biol. Cybern.
**1988**, 58, 35–49. [Google Scholar] - Seung, H.; Sompolinsky, H. Simple models for reading neuronal population codes. Proc. Natl. Acad. Sci. USA
**1993**, 90, 10749–10753. [Google Scholar] - Gutnisky, D.A.; Dragoi, V. Adaptive coding of visual information in neural populations. Nature
**2008**, 452, 220–224. [Google Scholar] - Bethge, M.; Rotermund, D.; Pawelzik, K. Optimal neural rate coding leads to bimodal firing rate distributions. Netw. Comput. Neural Syst.
**2003**, 14, 303–319. [Google Scholar] - Lansky, P.; Greenwood, P.E. Optimal signal in sensory neurons under an extended rate coding concept. BioSystems
**2007**, 89, 10–15. [Google Scholar] - Ecker, A.S.; Berens, P.; Tolias, A.S.; Bethge, M. The effect of noise correlations in populations of diversely tuned neurons. J. Neurosci.
**2011**, 31, 14272–14283. [Google Scholar] - Reginatto, M. Derivation of the equations of nonrelativistic quantum mechanics using the principle of minimum Fisher information. Phys. Rev. A
**1998**, 58, 1775–1778. [Google Scholar] - DeCarlo, L.T. On the meaning and use of kurtosis. Psychol. Methods.
**1997**, 2, 292. [Google Scholar] - Comon, P. Independent component analysis, a new concept. Signal Process
**1994**, 36, 287–314. [Google Scholar] - Hyvärinen, A.; Oja, E. Independent component analysis: Algorithms and applications. Neural Netw.
**2000**, 13, 411–430. [Google Scholar] - Girolami, M.; Fyfe, C. Negentropy and Kurtosis as Projection Pursuit Indices Provide Generalised ICA Algorithms. Proceedings of NIPS 96 Workshop on Blind Signal Processing and Their Applications, Snowmaas, Aspen, CO, USA, 7 December 1996.
- Li, H.; Adali, T. A class of complex ICA algorithms based on the kurtosis cost function. IEEE Trans. Neural Netw.
**2008**, 19, 408–420. [Google Scholar]

**Figure 1.**Organigram of the approach followed. The objective function ${\mathcal{F}}^{syn}$ for synaptic plasticity studied here can be motivated by the Fisher information for the synaptic flux. The resulting plasticity rule ${\dot{w}}_{j}$ for the synaptic weights will then be investigated both through simulations and using a cubic approximation in x (which becomes exact, when using the error functions as a transfer function y(x) = σ(x − b); see Section 2.2), which allows one to derive analytic results for the dependence of the synaptic adaption with respect to the kurtosis of the input statistics.

**Figure 2.**(

**a**) The plasticity functions G and H, as defined by Equation (7), here expressed entirely in terms of the output activity y ∈ [0, 1], for clarity. H represents the Hebbian contribution of the rule, with G acting as a limiting factor, reverting the sign of Equation (6) for activity values close to 0/1. (

**b**) Plot of the learning rule from Equation (6) together with the cubic approximation (Equation (8)), expressed this time as a function of the membrane potential x. Parameters: b = 0 and N = 2.

**Figure 3.**Sketch of the fixpoints of Equation (10), which approximates Equation (9), for two competing weights w

_{1}and w

_{2}as a function of the kurtosis K

_{1}and K

_{2}of the respective input directions. Open, full and half-full circles represent unstable fixpoints, stable fixpoints and saddles, respectively. The axes are expressed in terms of ${w}_{i}^{2}$, since the solutions are determined only up to a sign change.

**Figure 4.**(

**a**) Functions G and H, as in Figure 2a, now for Equation (7), here expressed entirely in terms of the output activity y ∈ [0, 1], for clarity. (

**b**) Final absolute value of the weight w

_{1}after training, with both learning rules (Equation (6) and Equation (20)), together with the prediction from the cubic approximation (Equation (13)), as a function of the kurtosis K

_{1}for the direction of the principal component. For b = 0, σ

_{1}= 0.1, σ

_{i}

_{≠1}= σ

_{1}/2 and N

_{w}= 100. One observes that the prediction is practically exact in the case of the error transfer function, remaining qualitatively similar for the case of the Fermi transfer function (Equation (6)).

**Figure 5.**A single neuron, whose synaptic weights evolve according to Equation (6) is presented with a set of input images consisting of the non-linear superposition of a random set of bars. We find that, on subsequent iterations, the neuron becomes selective to either single bars (the independent components of the input distribution) or to points.

© 2015 by the authors; licensee MDPI, Basel, Switzerland This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Echeveste, R.; Eckmann, S.; Gros, C. The Fisher Information as a Neural Guiding Principle for Independent Component Analysis. *Entropy* **2015**, *17*, 3838-3856.
https://doi.org/10.3390/e17063838

**AMA Style**

Echeveste R, Eckmann S, Gros C. The Fisher Information as a Neural Guiding Principle for Independent Component Analysis. *Entropy*. 2015; 17(6):3838-3856.
https://doi.org/10.3390/e17063838

**Chicago/Turabian Style**

Echeveste, Rodrigo, Samuel Eckmann, and Claudius Gros. 2015. "The Fisher Information as a Neural Guiding Principle for Independent Component Analysis" *Entropy* 17, no. 6: 3838-3856.
https://doi.org/10.3390/e17063838