#### 2.3. The Dynamics of On-Line Training in
Stationary Environments

In the following, we sketch the successful theory of on-line learning [

5,

6,

7,

8,

38] as, for instance, applied to the dynamics of LVQ algorithms in [

39,

46,

47,

48] and to on-line gradient descent in SCM in [

30,

31,

32,

33,

34,

35,

36]. We refer the reader to the original publications for details. The extensions to non-stationary situations with concept drifts are discussed in

Section 2.4.

The analysis follows the same key steps in both settings. We consider adaptive vectors ${\overrightarrow{w}}_{1,2}\in {\mathbb{R}}^{N}$ (prototypes in LVQ or student weights in the SCM) while the characteristic vectors ${\overrightarrow{B}}_{1,2}$ specify the target task (cluster centers in LVQ training, SCM teacher vectors for regression).

The consideration of the thermodynamic limit $N\to \infty $ is instrumental for the theoretical treatment. The limit facilitates the following key steps, which, eventually, yield an exact mathematical description of the training dynamics in terms of ordinary differential equations (ODE):

The many degrees of freedom, i.e., the components of the adaptive vectors, can be characterized in terms of only very few quantities. The definition of meaningful so-called

order parameters follows naturally from the specific mathematical structure of the model. After presentation of a number

$\mu $ of examples, as indicated by corresponding superscripts, we describe the system by the projections

Obviously, ${Q}_{11}^{\mu},{Q}_{22}^{\mu}$ and ${Q}_{12}^{\mu}={Q}_{21}^{\mu}$ relate to the norms and mutual overlap of the adaptive vectors, while the four quantities ${R}_{im}$ specify their projections into the linear subspace defined by the characteristic vectors $\{{\overrightarrow{B}}_{1},{\overrightarrow{B}}_{2}\}$, respectively.

- (b)
Recursions

For the order parameters, recursion relations can be derived directly from the learning algorithms in Equations (

2) and (

9), which are both of the generic form

${\overrightarrow{w}}_{k}^{\mu}\phantom{\rule{0.166667em}{0ex}}={\overrightarrow{w}}_{k}^{\mu -1}\phantom{\rule{0.166667em}{0ex}}+\Delta {\overrightarrow{w}}_{k}^{\mu},$ by considering the corresponding inner products:

Note that terms of order

$\mathcal{O}(1/N)$ on the right hand side (r.h.s.) of Equation (

15) will be neglected in the following.

- (c)
Averages over the Model Data

Applying the central limit theorem (CLT), we can perform an average over the random sequence of independent examples. Note that $\Delta {\overrightarrow{w}}_{k}^{\mu}\propto {\overrightarrow{\xi}}^{\mu}$ or $\Delta {\overrightarrow{w}}_{k}^{\mu}\propto \left({\overrightarrow{\xi}}^{\mu}-{\overrightarrow{w}}_{k}^{\mu -1}\right)$, respectively.

Consequently, the current input

${\overrightarrow{\xi}}^{\mu}$ enters the r.h.s. of Equation (

15) only through its norm

$\mid \overrightarrow{\xi}{\mid}^{2}=\mathcal{O}(N)$ and the quantities

Since these inner products correspond to sums of many independent random quantities in our model, the CLT implies that the projections in Equation (

16) are correlated Gaussian quantities for large

N and their joint density

$P({h}_{1}^{\mu},{h}_{2}^{\mu},{b}_{1}^{\mu},{b}_{2}^{\mu})$ is given completely by first and second moments.

LVQ: For the clustered density, cf. Equation (

4), the conditional moments read

where

$i,k,l,m,n\in \{1,2\}$ and

${\delta}_{\dots}$ is the Kronecker-Delta.

SCM: In the simpler case of the isotropic, spherical density (Equation (

13)) with

$\lambda =0$ and

${v}_{1}={v}_{2}=1$ the moments reduce to

Hence, in both cases, the joint density of

${h}_{1,2}^{\mu}$ and

${b}_{1,2}^{\mu}$ is fully specified by the values of the order parameters in the previous time step and the parameters of the model density. This important result enables us to perform an average of the recursion relations (Equation (

15)) over the latest training example in terms of Gaussian integrals. Moreover, the resulting r.h.s. can be expressed in closed form in

$\{{R}_{im}^{\mu -1},{Q}_{ik}^{\mu -1}\}$.

- (d)
Self-Averaging Properties

The self-averaging property of order parameters makes it possible to restrict the description to their mean values: Fluctuations of the stochastic dynamics can be neglected in the limit

$N\to \infty $. This concept has been borrowed from the statistical physics of disordered materials and has been applied frequently in the study of neural network models and learning processes [

4,

5,

6,

7]. For a detailed mathematical discussion in the context of sequential on-line learning, see [

58].

Consequently, we can interpret the averaged Equation (

15) directly as deterministic recursions for the means of

$\{{R}_{im}^{\mu},{Q}_{ik}^{\mu}\}$ which coincide with their actual values in the thermodynamic limit.

- (e)
Continuous Time Limit and ODE

For

$N\to \infty $, we can interpret the ratios on the left hand sides of Equation (

15) as derivatives with respect to the continuous learning time

This scaling corresponds to the plausible assumption that the number of examples required for successful training is proportional to the number of degrees of freedom in the system.

Averages are performed over the joint densities $P\left(\left\{{h}_{i},{b}_{m}\right\}\right)$ corresponding to the most recent, independently drawn input vector. Here, and in the following, we have omitted the index $\mu $.

The resulting sets of coupled ODE obtained from Equation (

15) are of the generic form:

Here, the subscript

stat indicates that the ODE describe learning from a stationary density, cf. Equation (

4) or (

13).

LVQ: For the classification model, we have to insert the terms

with the LVQ1 modulation functions

${f}_{i}$ from Equation (

3) and (conditional) averages with respect to the density (Equation (

4)).

SCM: In the modeling of regression in a student–teacher scenario, we obtain

where the quantities

${\rho}_{i}$ are defined in Equation (

12) for the latest input vector and averages are performed over the isotropic input density (Equation (

13)).

In both training scenarios considered here, the r.h.s. of Equation (

20), as given by Equations (

21) and (

22), can be expressed in terms of elementary functions. For the straightforward yet lengthy results, we refer the reader to the original literature for LVQ [

39,

46] and SCM [

31,

32,

33,

34].

- (f)
Generalization error

After training, the success of learning is quantified in terms of the generalization error ${\u03f5}_{g}$, which can also be expressed as a function of order parameters.

LVQ: In the case of classification,

${\u03f5}_{g}$ is given as the probability of misclassifying a novel, randomly drawn input vector. In the LVQ model, class-specific errors corresponding to data from clusters

$k=1,2$ in Equation (

4) can be considered separately:

is the class-specific misclassification rate, i.e., the probability for an example drawn from a cluster

k to be assigned to

$\widehat{k}\ne k$ with

${d}_{k}>{d}_{\widehat{k}}$. For the derivation of the class-wise and total generalization error for systems with two prototypes as functions of the order parameters, we also refer to [

39]. One obtains

SCM: For regression, the generalization error is defined as an average

$\u2329\cdots \u232a$ of the quadratic deviation between student and teacher output over the isotropic density, cf. Equation (

13):

the full form of which can be found in [

31,

32] for arbitrary

K and

M. For

$K=M=2$ with orthonormal teacher vectors, it simplifies to

- (g)
Learning curves

The (numerical) integration of the ODE for a given particular training algorithm, model density and specific initial conditions, $\{{R}_{im}(0),{Q}_{ik}(0)\}$ yields the temporal evolution of order parameters in the course of training.

Exploiting the self-averaging properties of order parameters once more, we can obtain the learning curves ${\u03f5}_{g}(\alpha )={\u03f5}_{g}\left(\{{R}_{im}(\alpha ),{Q}_{ik}(\alpha )\}\right)$, i.e., the generalization error after on-line training with $(\alpha \phantom{\rule{0.166667em}{0ex}}N)$ random examples.