2. Notation
A parameterized curve in
${\mathbb{R}}^{d}$ is a continuous function
$\mathbf{f}:I\u27f6{\mathbb{R}}^{d}$ where
$I=[a,b]$ is a closed interval of the real line. The length of
$\mathbf{f}$ is given by
Let
${x}_{1},{x}_{2},\dots ,{x}_{T}\in B(0,\sqrt{d}R)\subset {\mathbb{R}}^{d}$ be a sequence of data, where
$B(\mathbf{c},R)$ stands for the
${\ell}_{2}$ball centered in
$\mathbf{c}\in {\mathbb{R}}^{d}$ with radius
$R>0$. Let
${\mathcal{Q}}_{\delta}$ be a grid over
$B(0,\sqrt{d}R)$, i.e.,
${\mathcal{Q}}_{\delta}=B(0,\sqrt{d}R)\cap {\Gamma}_{\delta}$ where
${\Gamma}_{\delta}$ is a lattice in
${\mathbb{R}}^{d}$ with spacing
$\delta >0$. Let
$L>0$ and define for each
$k\in \u27e61,p\u27e7$ the collection
${\mathcal{F}}_{k,L}$ of polygonal lines
$\mathbf{f}$ with
k segments whose vertices are in
${\mathcal{Q}}_{\delta}$ and such that
$\mathcal{L}\left(\mathbf{f}\right)\le L$. Denote by
${\mathcal{F}}_{p}={\cup}_{k=1}^{p}{\mathcal{F}}_{k,L}$ all polygonal lines with a number of segments
$\le p$, whose vertices are in
${\mathcal{Q}}_{\delta}$ and whose length is at most
L. Finally, let
$\mathcal{K}\left(\mathbf{f}\right)$ denote the number of segments of
$\mathbf{f}\in {\mathcal{F}}_{p}$. This strategy is illustrated by
Figure 4.
Our goal is to learn a timedependent polygonal line which passes through the “middle” of data and gives a summary of all available observations
${x}_{1},\dots ,{x}_{t1}$ (denoted by
${\left({x}_{s}\right)}_{1:(t1)}$ hereafter) before time
t. Our output at time
t is a polygonal line
${\widehat{\mathbf{f}}}_{t}\in {\mathcal{F}}_{p}$ depending on past information
${\left({x}_{s}\right)}_{1:(t1)}$ and past predictions
${\left({\widehat{\mathbf{f}}}_{s}\right)}_{1:(t1)}$. When
${x}_{t}$ is revealed, the instantaneous loss at time
t is computed as
In what follows, we investigate regret bounds for the cumulative loss based on (
2). Given a measurable space
$\Theta $ (embedded with its Borel
$\sigma $algebra), we let
$\mathcal{P}(\Theta )$ denote the set of probability distributions on
$\Theta $, and for some reference measure
$\pi $, we let
${\mathcal{P}}_{\pi}(\Theta )$ be the set of probability distributions absolutely continuous with respect to
$\pi $.
For any
$k\in \u27e61,p\u27e7$, let
${\pi}_{k}$ denote a probability distribution on
${\mathcal{F}}_{k,L}$. We define the
prior$\pi $ on
${\mathcal{F}}_{p}={\cup}_{k=1}^{p}{\mathcal{F}}_{k,L}$ as
where
${w}_{1},\dots ,{w}_{p}\ge 0$ and
${\sum}_{k\in \u27e61,p\u27e7}{w}_{k}=1$.
We adopt a quasiBayesianflavored procedure: consider the Gibbs quasiposterior (note that this is not a proper posterior in all generality, hence the term “quasi”):
where
as advocated by [
32,
35] who then considered realizations from this quasiposterior. In the present paper, we will rather focus on a quantity linked to the mode of this quasiposterior. Indeed, the mode of the quasiposterior
${\widehat{\rho}}_{t+1}$ is
where
(i) is a cumulative loss term,
(ii) is a term controlling the variance of the prediction
$\mathbf{f}$ to past predictions
${\widehat{\mathbf{f}}}_{s},s\le t$, and
(iii) can be regarded as a penalty function on the complexity of
$\mathbf{f}$ if
$\pi $ is well chosen. This mode hence has a similar flavor to follow the best expert or follow the perturbed leader in the setting of prediction with experts (see [
22,
36], Chapters 3 and 4) if we consider each
$\mathbf{f}\in {\mathcal{F}}_{p}$ as an expert which always delivers constant advice. These remarks yield Algorithm 1.
Algorithm 1 Sequentially learning principal curves. 
 1:
Input parameters: $p>0,\eta >0,\pi \left(z\right)={\mathrm{e}}^{z}{\mathbb{1}}_{\{z>0\}}$ and penalty function $h:{\mathcal{F}}_{p}\to {\mathbb{R}}^{+}$  2:
Initialization: For each $\mathbf{f}\in {\mathcal{F}}_{p}$, draw ${z}_{\mathbf{f}}\sim \pi $ and ${\Delta}_{\mathbf{f},0}=\frac{1}{\eta}(h\left(\mathbf{f}\right){z}_{\mathbf{f}})$  3:
For $t=1,\dots ,T$  4:
Get the data ${x}_{t}$  5:
where ${\Delta}_{\mathbf{f},s}=\Delta (\mathbf{f},{x}_{s})$, $s\ge 1$.  6:
End for

3. Regret Bounds for Sequential Learning of Principal Curves
We now present our main theoretical results.
Theorem 1. For any sequence ${\left({x}_{t}\right)}_{1:T}\in B(0,\sqrt{d}R)$, $R\ge 0$ and any penalty function $h:{\mathcal{F}}_{p}\to {\mathbb{R}}^{+}$, let $\pi \left(z\right)={\mathrm{e}}^{z}{\mathbb{1}}_{\{z>0\}}$. Let $0<\eta \le \frac{1}{d{(2R+\delta )}^{2}}$; then the procedure described in Algorithm 1 satisfieswhere ${c}_{0}=d{(2R+\delta )}^{2}$ and The expectation of the cumulative loss of polygonal lines
${\widehat{\mathbf{f}}}_{1},\dots ,{\widehat{\mathbf{f}}}_{T}$ is upperbounded by the smallest penalized cumulative loss over all
$k\in \{1,\dots ,p\}$ up to a multiplicative term
$(1+{c}_{0}(\mathrm{e}1)\eta )$, which can be made arbitrarily close to 1 by choosing a small enough
$\eta $. However, this will lead to both a large
$h\left(\mathbf{f}\right)/\eta $ in
${S}_{T,h,\eta}$ and a large
$\frac{1}{\eta}(1+ln{\sum}_{\mathbf{f}\in {\mathcal{F}}_{p}}{\mathrm{e}}^{h\left(\mathbf{f}\right)})$. In addition, another important issue is the choice of the penalty function
h. For each
$\mathbf{f}\in {\mathcal{F}}_{p}$,
$h\left(\mathbf{f}\right)$ should be large enough to ensure a small
${\sum}_{\mathbf{f}\in {\mathcal{F}}_{p}}{\mathrm{e}}^{h\left(\mathbf{f}\right)}$, but not too large to avoid overpenalization and a larger value for
${S}_{T,h,\eta}$. We therefore set
for each
$\mathbf{f}$ with
k segments (where
$\leftM\right$ denotes the cardinality of a set
M) since it leads to
The penalty function
$h\left(\mathbf{f}\right)={c}_{1}\mathcal{K}\left(\mathbf{f}\right)+{c}_{2}L+{c}_{3}$ satisfies (
3), where
${c}_{1},{c}_{2},{c}_{3}$ are constants depending on
R,
d,
$\delta $,
p (this is proven in Lemma 3, in
Section 6). We therefore obtain the following corollary.
Corollary 1. Under the assumptions of Theorem 1, letThenwhere ${r}_{T,k,L}={inf}_{\mathbf{f}\in {\mathcal{F}}_{p}}{\sum}_{t=1}^{T}\Delta (\mathbf{f},{x}_{t})({c}_{1}k+{c}_{2}L+{c}_{3})$. Proof. Note that
and we conclude by setting
□
Sadly, Corollary 1 is not of much practical use since the optimal value for
$\eta $ depends on
${inf}_{\mathbf{f}\in {\mathcal{F}}_{p}}{\sum}_{t=1}^{T}\Delta (\mathbf{f},{x}_{t})$ which is obviously unknown, even more so at time
$t=0$. We therefore provide an adaptive refinement of Algorithm 1 in the following Algorithm 2.
Algorithm 2 Sequentially and adaptively learning principal curves. 
 1:
Input parameters: $p>0$, $L>0$, $\pi $, h and ${\eta}_{0}=\frac{\sqrt{{c}_{1}p+{c}_{2}L+{c}_{3}}}{{c}_{0}\sqrt{\mathrm{e}1}}$  2:
Initialization: For each $\mathbf{f}\in {\mathcal{F}}_{p}$, draw ${z}_{\mathbf{f}}\sim \pi $, ${\Delta}_{\mathbf{f},0}=\frac{1}{{\eta}_{0}}(h\left(\mathbf{f}\right){z}_{\mathbf{f}})$ and ${\widehat{\mathbf{f}}}_{0}=\underset{\mathbf{f}\in {\mathcal{F}}_{p}}{arginf}\phantom{\rule{4pt}{0ex}}{\Delta}_{\mathbf{f},0}$  3:
For $t=1,\dots ,T$  4:
Compute ${\eta}_{t}=\frac{\sqrt{{c}_{1}p+{c}_{2}L+{c}_{3}}}{{c}_{0}\sqrt{(\mathrm{e}1)t}}$  5:
Get data ${x}_{t}$ and compute ${\Delta}_{\mathbf{f},t}=\Delta (\mathbf{f},{x}_{t})+\left(\frac{1}{{\eta}_{t}}\frac{1}{{\eta}_{t1}}\right)\left(h\left(\mathbf{f}\right){z}_{\mathbf{f}}\right)$  6:
 7:
End for

Theorem 2. For any sequence ${\left({x}_{t}\right)}_{1:T}\in B(0,\sqrt{d}R),R\ge 0$, let $h\left(\mathbf{f}\right)={c}_{1}\mathcal{K}\left(\mathbf{f}\right)+{c}_{2}L+{c}_{3}$ where ${c}_{1}$, ${c}_{2}$, ${c}_{3}$ are constants depending on $R,d,\delta ,lnp$. Let $\pi \left(z\right)={\mathrm{e}}^{z}{\mathbb{1}}_{\{z>0\}}$ andwhere $t\ge 1$ and ${c}_{0}=d{(2R+\delta )}^{2}$. Then the procedure described in Algorithm 2 satisfies The message of this regret bound is that the expected cumulative loss of polygonal lines
${\widehat{\mathbf{f}}}_{1},\dots ,{\widehat{\mathbf{f}}}_{T}$ is upperbounded by the minimal cumulative loss over all
$k\in \{1,\dots ,p\}$, up to an additive term which is sublinear in
T. The actual magnitude of this remainder term is
$\sqrt{kT}$. When
L is fixed, the number
k of segments is a measure of complexity of the retained polygonal line. This bound therefore yields the same magnitude as (
1), which is the most refined bound in the literature so far ([
18] where the optimal values for
k and
L were obtained in a model selection fashion).
4. Implementation
The argument of the infimum in Algorithm 2 is taken over
${\mathcal{F}}_{p}={\cup}_{k=1}^{p}{\mathcal{F}}_{k,L}$ which has a cardinality of order
${\left{\mathcal{Q}}_{\delta}\right}^{p}$, making any greedy search largely timeconsuming. We instead turn to the following strategy: Given a polygonal line
${\widehat{\mathbf{f}}}_{t}\in {\mathcal{F}}_{{k}_{t},L}$ with
${k}_{t}$ segments, we consider, with a certain proportion, the availability of
${\widehat{\mathbf{f}}}_{t+1}$ within a neighborhood
$\mathcal{U}\left({\widehat{\mathbf{f}}}_{t}\right)$ (see the formal definition below) of
${\widehat{\mathbf{f}}}_{t}$. This consideration is well suited for the principal curves setting, since if observation
${x}_{t}$ is close to
${\widehat{\mathbf{f}}}_{t}$, one can expect that the polygonal line which well fits observations
${x}_{s},s=1,\dots ,t$ lies in a neighborhood of
${\widehat{\mathbf{f}}}_{t}$. In addition, if each polygonal line
$\mathbf{f}$ is regarded as an action, we no longer assume that all actions are available at all times, and allow the set of available actions to vary at each time. This is a model known as “sleeping experts (or actions)” in prior work [
37,
38]. In this setting, defining the regret with respect to the best action in the whole set of actions in hindsight remains difficult, since that action might sometimes be unavailable. Hence, it is natural to define the regret with respect to the best ranking of all actions in the hindsight according to their losses or rewards, and at each round one chooses among the available actions by selecting the one which ranks the highest. Ref. [
38] introduced this notion of regret and studied both the fullinformation (best action) and partialinformation (multiarmed bandit) settings with stochastic and adversarial rewards and adversarial action availability. They pointed out that the
EXP4 algorithm [
37] attains the optimal regret in the adversarial rewards case but has a runtime exponential in the number of all actions. Ref. [
39] considered full and partial information with stochastic action availability and proposed an algorithm that runs in polynomial time. In what follows, we materialize our implementation by resorting to “sleeping experts”, i.e., a special set of available actions that adapts to the setting of principal curves.
Let
$\sigma $ denote an ordering of
${\mathcal{F}}_{p}$ actions, and
${\mathcal{A}}_{t}$ a subset of the available actions at round
t. We let
$\sigma \left({\mathcal{A}}_{t}\right)$ denote the highest ranked action in
${\mathcal{A}}_{t}$. In addition, for any action
$\mathbf{f}\in {\mathcal{F}}_{p}$ we define the reward
${r}_{\mathbf{f},t}$ of
$\mathbf{f}$ at round
$t,t\ge 0$ by
It is clear that
${r}_{\mathbf{f},t}\in (0,{c}_{0})$. The convention from losses to gains is done in order to facilitate the subsequent performance analysis. The reward of an ordering
$\sigma $ is the cumulative reward of the selected action at each time:
and the reward of the best ordering is
${max}_{\sigma}{\sum}_{t=0}^{T}{r}_{\sigma \left({\mathcal{A}}_{t}\right),t}$ (respectively,
$\mathbb{E}\left[{max}_{\sigma}{\sum}_{t=1}^{T}{r}_{\sigma \left({\mathcal{A}}_{t}\right),t}\right]$ when
${\mathcal{A}}_{t}$ is stochastic).
Our procedure starts with a partition step which aims at identifying the “relevant” neighborhood of an observation $x\in {\mathbb{R}}^{d}$ with respect to a given polygonal line, and then proceeds with the definition of the neighborhood of an action $\mathbf{f}$. We then provide the full implementation and prove a regret bound.
Partition. For any polygonal line
$\mathbf{f}$ with
k segments, we denote by
$\stackrel{\rightharpoonup}{\mathbf{V}}=\left({v}_{1},\dots ,{v}_{k+1}\right)$ its vertices and by
${s}_{i},i=1,\dots ,k$ the line segments connecting
${v}_{i}$ and
${v}_{i+1}$. In the sequel, we use
$\mathbf{f}\left(\stackrel{\rightharpoonup}{\mathbf{V}}\right)$ to represent the polygonal line formed by connecting consecutive vertices in
$\stackrel{\rightharpoonup}{\mathbf{V}}$ if no confusion arises. Let
${V}_{i},i=1,\dots ,k+1$ and
${S}_{i},i=1,\dots ,k$ be the Voronoi partitions of
${\mathbb{R}}^{d}$ with respect to
$\mathbf{f}$, i.e., regions consisting of all points closer to vertex
${v}_{i}$ or segment
${s}_{i}$.
Figure 5 shows an example of Voronoi partition with respect to
$\mathbf{f}$ with three segments.
Neighborhood. For any
$x\in {\mathbb{R}}^{d}$, we define the neighborhood
$\mathcal{N}\left(x\right)$ with respect to
$\mathbf{f}$ as the union of all Voronoi partitions whose closure intersects with two vertices connecting the projection
$\mathbf{f}\left({s}_{\mathbf{f}}\left(x\right)\right)$ of
x to
$\mathbf{f}$. For example, for the point
x in
Figure 5, its neighborhood
$\mathcal{N}\left(x\right)$ is the union of
${S}_{2},{V}_{3},{S}_{3}$ and
${V}_{4}$. In addition, let
${\mathcal{N}}_{t}\left(x\right)=\left\{{x}_{s}\in \mathcal{N}\left(x\right),s=1,\dots ,t.\right\}$ be the set of observations
${x}_{1:t}$ belonging to
$\mathcal{N}\left(x\right)$ and
${\overline{\mathcal{N}}}_{t}\left(x\right)$ be its average. Let
$\mathcal{D}\left(M\right)={sup}_{x,y\in M}\left\rightx{y\left\right}_{2}$ denote the diameter of set
$M\subset {\mathbb{R}}^{d}$. We finally define the local grid
${\mathcal{Q}}_{\delta ,t}\left(x\right)$ of
$x\in {\mathbb{R}}^{d}$ at time
t as
We can finally proceed to the definition of the neighborhood
$\mathcal{U}\left({\widehat{\mathbf{f}}}_{t}\right)$ of
${\widehat{\mathbf{f}}}_{t}$. Assume
${\widehat{\mathbf{f}}}_{t}$ has
${k}_{t}+1$ vertices
$\stackrel{\rightharpoonup}{\mathbf{V}}=(\underset{\left(i\right)}{\underset{\u23df}{{v}_{1:{i}_{t}1}}},\underset{\left(ii\right)}{\underset{\u23df}{{v}_{{i}_{t}:{j}_{t}1}}},\underset{\left(iii\right)}{\underset{\u23df}{{v}_{{j}_{t}:{k}_{t}+1}}})$, where vertices of
$\left(ii\right)$ belong to
${\mathcal{Q}}_{\delta ,t}\left({x}_{t}\right)$ while those of
$\left(i\right)$ and
$\left(iii\right)$ do not. The neighborhood
$\mathcal{U}\left({\widehat{\mathbf{f}}}_{t}\right)$ consists of
$\mathbf{f}$ sharing vertices
$\left(i\right)$ and
$\left(iii\right)$ with
${\widehat{\mathbf{f}}}_{t}$, but can be equipped with different vertices
$\left(ii\right)$ in
${\mathcal{Q}}_{\delta ,t}\left({x}_{t}\right)$; i.e.,
where
${v}_{1:m}\in {\mathcal{Q}}_{\delta ,t}\left({x}_{t}\right)$ and
m is given by
In Algorithm 3, we initiate the principal curve
${\widehat{\mathbf{f}}}_{1}$ as the first component line segment whose vertices are the two farthest projections of data
${x}_{1:{t}_{0}}$ (
${t}_{0}$ can be set to 20 in practice) on the first component line. The reward of
$\mathbf{f}$ at round
t in this setting is therefore
${r}_{\mathbf{f},t}={c}_{0}\Delta (\mathbf{f},{x}_{{t}_{0}+t})$. Algorithm 3 has an exploration phase (when
${I}_{t}=1$) and an exploitation phase (
${I}_{t}=0$). In the exploration phase, it is allowed to observe rewards of all actions and to choose an optimal perturbed action from the set
${\mathcal{F}}_{p}$ of all actions. In the exploitation phase, only rewards of a part of actions can be accessed and rewards of others are estimated by a constant, and we update our action from the neighborhood
$\mathcal{U}\left({\widehat{\mathbf{f}}}_{t1}\right)$ of the previous action
${\widehat{\mathbf{f}}}_{t1}$. This local update (or search) greatly reduces computation complexity since
$\mathcal{U}\left({\widehat{\mathbf{f}}}_{t1}\right)\ll \left{\mathcal{F}}_{p}\right$ when
p is large. In addition, this local search will be enough to account for the case when
${x}_{t}$ locates in
$\mathcal{U}\left({\widehat{\mathbf{f}}}_{t1}\right)$. The parameter
$\beta $ needs to be carefully calibrated since it should not be too large to ensure that the condition
$cond\left(t\right)$ is nonempty; otherwise, all rewards are estimated by the same constant and thus lead to the same descending ordering of tuples for both
$\left({\sum}_{s=1}^{t1}{\widehat{r}}_{\mathbf{f},s},\mathbf{f}\in {\mathcal{F}}_{p}\right)$ and
$\left({\sum}_{s=1}^{t}{\widehat{r}}_{\mathbf{f},s},\mathbf{f}\in {\mathcal{F}}_{p}\right)$. Therefore, we may face the risk of having
${\widehat{\mathbf{f}}}_{t+1}$ in the neighborhood of
${\widehat{\mathbf{f}}}_{t}$ even if we are in the exploration phase at time
$t+1$. Conversely, very small
$\beta $ could result in large bias for the estimation
$\frac{{r}_{\mathbf{f},t}}{\mathbb{P}\left({\widehat{\mathbf{f}}}_{t}=\mathbf{f}{\mathcal{H}}_{t}\right)}$ of
${r}_{\mathbf{f},t}$. Note that the exploitation phase is close yet different to the label efficient prediction ([
40], Remark 1.1) since we allow an action at time
t to be different from the previous one. Ref. [
41] proposed the
geometric resampling method to estimate the conditional probability
$\mathbb{P}\left({\widehat{\mathbf{f}}}_{t}=\mathbf{f}{\mathcal{H}}_{t}\right)$ since this quantity often does not have an explicit form. However, due to the simple exponential distribution of
${z}_{\mathbf{f}}$ chosen in our case, an explicit form of
$\mathbb{P}\left({\widehat{\mathbf{f}}}_{t}=\mathbf{f}{\mathcal{H}}_{t}\right)$ is straightforward.
Algorithm 3 A locally greedy algorithm for sequentially learning principal curves. 
 1:
Input parameters: $p>0$, $R>0$, $L>0$, $\u03f5>0$, $\alpha >0$, $1>\beta >0$ and any penalty function h  2:
Initialization: Given ${\left({x}_{t}\right)}_{1:{t}_{0}}$, obtain ${\widehat{\mathbf{f}}}_{1}$ as the first principal component  3:
For $t=2,\dots ,T$  4:
Draw ${I}_{t}\sim Bernoulli\left(\u03f5\right)$ and ${z}_{\mathbf{f}}\sim \pi $.  5:
Let
i.e., sorting all $\mathbf{f}\in {\mathcal{F}}_{p}$ in descending order according to their perturbed cumulative reward till $t1$.  6:
If ${I}_{t}=1$, set ${\mathcal{A}}_{t}={\mathcal{F}}_{p}$ and ${\widehat{\mathbf{f}}}_{t}={\widehat{\sigma}}^{t}\left({\mathcal{A}}_{t}\right)$ and observe ${r}_{{\widehat{\mathbf{f}}}_{t},t}$  7:
 8:
If ${I}_{t}=0$, set ${\mathcal{A}}_{t}=\mathcal{U}\left({\widehat{\mathbf{f}}}_{t1}\right)$, ${\widehat{\mathbf{f}}}_{t}={\widehat{\sigma}}^{t}\left({\mathcal{A}}_{t}\right)$ and observe ${r}_{{\widehat{\mathbf{f}}}_{t},t}$  9:
where ${\mathcal{H}}_{t}$ denotes all the randomness before time t and $\mathit{cond}\left(t\right)=\left\{\mathbf{f}\in {\mathcal{F}}_{p}:\mathbb{P}\left({\widehat{\mathbf{f}}}_{t}=\mathbf{f}{\mathcal{H}}_{t}\right)>\beta \right\}$. In particular, when $t=1$, we set ${\widehat{r}}_{\mathbf{f},1}={r}_{\mathbf{f},1}$ for all $\mathbf{f}\in {\mathcal{F}}_{p}$, $\mathcal{U}\left({\widehat{\mathbf{f}}}_{0}\right)=\varnothing $ and ${\widehat{r}}_{{\widehat{\sigma}}^{1}\left(\mathcal{U}\left({\widehat{\mathbf{f}}}_{0}\right)\right),1}\equiv 0$.  10:
End for

Theorem 3. Assume that $p>6$, $T\ge 2{\mathcal{F}}_{p}{}^{2}$ and let $\beta ={\left{\mathcal{F}}_{p}\right}^{\frac{1}{2}}{T}^{\frac{1}{4}}$, $\alpha =\frac{{c}_{0}}{\beta}$, ${\widehat{c}}_{0}=\frac{2{c}_{0}}{\beta}$, $\u03f5=1{\left{\mathcal{F}}_{p}\right}^{\frac{1}{2}\frac{3}{p}}{T}^{\frac{1}{4}}$ andThen the procedure described in Algorithm 3 satisfies the regret bound The proof of Theorem 3 is presented in
Section 6. The regret is upper bounded by a term of order
$\left({\left{\mathcal{F}}_{p}\right}^{\frac{1}{2}}{T}^{\frac{3}{4}}\right)$, sublinear in
T. The term
$(1\u03f5){c}_{0}T={c}_{0}{\left{\mathcal{F}}_{p}\right}^{\frac{1}{2}}{T}^{\frac{3}{4}}$ is the price to pay for the local search (with a proportion
$1\u03f5$) of polygonal line
${\widehat{\mathbf{f}}}_{t}$ in the neighborhood of the previous
${\widehat{\mathbf{f}}}_{t1}$. If
$\u03f5=1$, we would have that
${\widehat{c}}_{0}={c}_{0}$, and the last two terms in the first inequality of Theorem 3 would vanish; hence, the upper bound reduces to Theorem 2. In addition, our algorithm achieves an order that is smaller (from the perspective of both the number
$\left{\mathcal{F}}_{p}\right$ of all actions and the total rounds
T) than [
39] since at each time, the availability of actions for our algorithm can be either the whole action set or a neighborhood of the previous action while [
39] consider at each time only partial and independent stochastic available set of actions generated from a predefined distribution.
5. Numerical Experiments
We illustrate the performance of Algorithm 3 on synthetic and reallife data. Our implementation (hereafter denoted by
slpc—Sequential Learning of Principal Curves) is conducted with the R language and thus our most natural competitors are the R package
princurve, which is the algorithm from [
10], and
incremental, which is the algorithm from SCMS [
23]. We let
$p=50$,
$R={max}_{t=1,\dots ,T}{\left\rightx\left\right}_{2}/\sqrt{d}$,
$L=0.1p\sqrt{d}R$. The spacing
$\delta $ of the lattice is adjusted with respect to data scale.
Synthetic data We generate a dataset
$\left\{{x}_{t}\in {\mathbb{R}}^{2},t=1,\dots ,500\right\}$ uniformly along the curve
$y=0.05\times {(x5)}^{3}$,
$x\in [0,10]$.
Table 1 shows the regret (first row) for
the ground truth (sum of squared distances of all points to the true curve),
princurve and incremental SCMS (sum of squared distances between observation ${x}_{t+1}$ and fitted princurve on observations ${x}_{1:t}$),
slpc (regret being equal to ${\sum}_{t=0}^{T1}\mathbb{E}[\Delta ({\widehat{\mathbf{f}}}_{t+1},{x}_{t+1})]$ in both cases).
The mean computation time with different values for the time horizons T are also reported.
Table 1 demonstrates the advantages of our method
slpc, as it achieved the optimal tradeoff between performance (in terms of regret) and runtime. Although
princurve outperformed the other two algorithms in terms of computation time, it yielded the largest regret, since it outputs a curve which does not pass in “the middle of data” but rather bends towards the curvature of the data cloud, as shown in
Figure 6 where the predicted principal curves
${\widehat{\mathbf{f}}}_{t+1}$ for
princurve,
incremental SCMS and
slpc are presented.
incremental SCMS and
slpc both yielded satisfactory results, although the mean computation time of
splc was significantly smaller than that of
incremental SCMS (the reason being that eigenvectors of the Hessian of PDF need to be computed in
incremental SCMS).
Figure 7 showed, respectively, the estimation of the regret of
slpc and its perround value (i.e., the cumulative loss divided by the number of rounds) both with respect to the round
t. The jumps in the perround curve occurred at the beginning, due to the initialization from a first principal component and to the collection of new data. When data accumulates, the vanishing pattern of the perround curve illustrates that the regret is sublinear in
t, which matches our aforementioned theoretical results.
In addition, to better illustrate the way
slpc works between two epochs,
Figure 8 focuses on the impact of collecting a new data point on the principal curve. We see that only a local vertex is impacted, whereas the rest of the principal curve remains unaltered. This cutdown in algorithmic complexity is one the key assets of
slpc.
Synthetic data in high dimension. We also apply our algorithm on a dataset
$\{{x}_{t}\in {\mathbb{R}}^{6},$$t=1,2,\dots ,200\}$ in higher dimension. It is generated uniformly along a parametric curve whose coordinates are
where
t takes 100 equidistant values in
$[0,2\pi ]$. To the best of our knowledge, [
10,
16,
18] only tested their algorithm on 2dimensional data. This example aims at illustrating that our algorithm also works on higher dimensional data.
Table 2 shows the regret for the ground truth,
princurve and
slpc.
In addition,
Figure 9 shows the behaviour of
slpc (green) on each dimension.
Seismic data. Seismic data spanning long periods of time are essential for a thorough understanding of earthquakes. The “Centennial Earthquake Catalog” [
42] aims at providing a realistic picture of the seismicity distribution on Earth. It consists in a global catalog of locations and magnitudes of instrumentally recorded earthquakes from 1900 to 2008. We focus on a particularly representative seismic active zone (a lithospheric border close to Australia) whose longitude is between E
${130}^{\circ}$ to E
${180}^{\circ}$ and latitude between S
${70}^{\circ}$ to N
${30}^{\circ}$, with
$T=218$ seismic recordings. As shown in
Figure 10,
slpc recovers nicely the tectonic plate boundary, but both
princurve and
incremental SCMS with wellcalibrated bandwidth fail to do so.
Lastly, since no ground truth is available, we used the ${R}^{2}$ coefficient to assess the performance (residuals are replaced by the squared distance between data points and their projections onto the principal curve). The average over 10 trials was 0.990.
Back to Seismic Data.Figure 11 was taken from the USGS website (
https://earthquake.usgs.gov/data/centennial/) and gives the global locations of earthquakes for the period 1900–1999. The seismic data (latitude, longitude, magnitude of earthquakes, etc.) used in the present paper may be downloaded from this website.
Daily Commute Data. The identification of segments of personal daily commuting trajectories can help taxi or bus companies to optimize their fleets and increase frequencies on segments with high commuting activity. Sequential principal curves appear to be an ideal tool to address this learning problem: we tested our algorithm on trajectory data from the University of Illinois at Chicago (
https://www.cs.uic.edu/~boxu/mp2p/gps_data.html). The data were obtained from the GPS reading systems carried by two of the laboratory members during their daily commute for 6 months in the Cook county and the Dupage county of Illinois.
Figure 12 presents the learning curves yielded by
princurve and
slpc on geolocalization data for the first person, on May 30. A particularly remarkable asset of
slpc is that abrupt curvature in the data sequence was perfectly captured, whereas
princurve does not enjoy the same flexibility. Again, we used the
${R}^{2}$ coefficient to assess the performance (where residuals are replaced by the squared distances between data points and their projections onto the principal curve). The average over 10 trials was 0.998.
6. Proofs
This section contains the proof of Theorem 2 (note that Theorem 1 is a straightforward consequence, with
${\eta}_{t}=\eta $,
$t=0,\dots ,T$) and the proof of Theorem 3 (which involves intermediary lemmas). Let us first define for each
$t=0,\dots ,T$ the following forecaster sequence
${\left({\widehat{\mathbf{f}}}_{t}^{\star}\right)}_{t}$
Note that
${\widehat{\mathbf{f}}}_{t}^{\star}$ is an “illegal” forecaster since it peeks into the future. In addition, denote by
the polygonal line in
${\mathcal{F}}_{p}$ which minimizes the cumulative loss in the first
T rounds plus a penalty term.
${\mathbf{f}}^{\star}$ is deterministic, and
${\widehat{\mathbf{f}}}_{t}^{\star}$ is a random quantity (since it depends on
${z}_{\mathbf{f}}$,
$\mathbf{f}\in {\mathcal{F}}_{p}$ drawn from
$\pi $). If several
$\mathbf{f}$ attain the infimum, we chose
${\mathbf{f}}_{T}^{\star}$ as the one having the smallest complexity. We now enunciate the first (out of three) intermediary technical result.
Lemma 1. For any sequence ${x}_{1},\dots ,{x}_{T}$ in $B(0,\sqrt{d}R)$, Proof. Proof by induction on
T. Clearly (
5) holds for
$T=0$. Assume that (
5) holds for
$T1$:
Adding
${\Delta}_{{\widehat{\mathbf{f}}}_{T}^{\star},T}$ to both sides of the above inequality concludes the proof. □
By (
5) and the definition of
${\widehat{\mathbf{f}}}_{T}^{\star}$, for
$k\ge 1$, we have
$\pi $almost surely that
where
$1/{\eta}_{1}=0$ by convention. The second and third inequality is due to respectively the definition of
${\widehat{\mathbf{f}}}_{T}^{\star}$ and
${\mathbf{f}}_{T}^{\star}$. Hence
where the second inequality is due to
$\mathbb{E}\left[{Z}_{{\mathbf{f}}_{T}^{\star}}\right]=0$ and
$\left(\frac{1}{{\eta}_{t}}\frac{1}{{\eta}_{t1}}\right)>0$ for
$t=0,1,\dots ,T$ since
${\eta}_{t}$ is decreasing in
t in Theorem 2. In addition, for
$y\ge 0$, one has
Hence, for any
$y\ge 0$
where
$u={\sum}_{\mathbf{f}\in {\mathcal{F}}_{p}}{\mathrm{e}}^{h\left(\mathbf{f}\right)}$. Therefore, we have
We thus obtain
Next, we control the regret of Algorithm 2.
Lemma 2. Assume that ${z}_{\mathbf{f}}$ is sampled from the symmetric exponential distribution in $\mathbb{R}$, i.e., $\pi \left(z\right)={\mathrm{e}}^{z}{\mathbb{1}}_{\{z>0\}}$. Assume that ${sup}_{t=1,\dots ,T}{\eta}_{t1}\le \frac{1}{d{(2R+\delta )}^{2}}$, and define ${c}_{0}=d{(2R+\delta )}^{2}$. Then for any sequence $\left({x}_{t}\right)\in B(0,\sqrt{d}R)$, $t=1,\dots ,T$, Proof. Let us denote by
the instantaneous loss suffered by the polygonal line
${\widehat{\mathbf{f}}}_{t}$ when
${x}_{t}$ is obtained. We have
where the inequality is due to the fact that
$\Delta (\mathbf{f},x)\le d{(2R+\delta )}^{2}$ holds uniformly for any
$\mathbf{f}\in {\mathcal{F}}_{p}$ and
$x\in B(0,\sqrt{d}R)$. Finally, summing on
t on both sides and using the elementary inequality
${\mathrm{e}}^{x}\le 1+(\mathrm{e}1)x$ if
$x\in (0,1)$ concludes the proof. □
Lemma 3. For $k\in \u27e61,p\u27e7$, we control the cardinality of set $\left\{\mathbf{f}\in {\mathcal{F}}_{p},\mathcal{K}\left(\mathbf{f}\right)=k\right\}$ aswhere ${V}_{d}$ denotes the volume of the unit ball in ${\mathbb{R}}^{d}$. Proof. First, let
${N}_{k,\delta}$ denote the set of polygonal lines with
k segments and whose vertices are in
${\mathcal{Q}}_{\delta}$. Notice that
${N}_{k,\delta}$ is different from
$\{\mathbf{f}\in {\mathcal{F}}_{p},\mathcal{K}\left(\mathbf{f}\right)=k\}$ and that
Hence
where the second inequality is a consequence to the elementary inequality
$\left(\genfrac{}{}{0pt}{}{p}{k}\right)\le {\left(\frac{p\mathrm{e}}{k}\right)}^{k}$ combined with Lemma 2 in [
16]. □
We now have all the ingredients to prove Theorem 1 and Theorem 2.
First, combining (
6) and (
7) yields that
Assume that
${\eta}_{t}=\eta $,
$t=0,\dots ,T$ and
$h\left(\mathbf{f}\right)={c}_{1}\mathcal{K}\left(\mathbf{f}\right)+{c}_{2}L+{c}_{3}$ for
$\mathbf{f}\in {\mathcal{F}}_{p}$, then
$(\frac{1}{2}+{\sum}_{\mathbf{f}\in {\mathcal{F}}_{p}}{\mathrm{e}}^{h\left(\mathbf{f}\right)})\le 0$ and moreover
where
and the second inequality is obtained with Lemma 1. By setting
we obtain
where
${r}_{T,k,L}={inf}_{\mathbf{f}\in {\mathcal{F}}_{p}}{\sum}_{t=1}^{T}\Delta (\mathbf{f},{x}_{t})({c}_{1}k+{c}_{2}L+{c}_{3})$. This proves Theorem 1.
Finally, assume that
Since
$\mathbb{E}\left[\Delta ({\widehat{\mathbf{f}}}_{t}^{\star},{x}_{t})\right]\le {c}_{0}$ for any
$t=1,\dots ,T$, we have
which concludes the proof of Theorem 2.
Lemma 4. Using Algorithm 3, if $0<\u03f5\le 1$, $0<\beta <1$, $\alpha \ge \frac{(1\beta ){c}_{0}}{\beta}$ and $\left\mathcal{U}\left({\widehat{\mathbf{f}}}_{t1}\right)\right\ge 2$ for all $t\ge 2$, where $\left\mathcal{U}\left({\widehat{\mathbf{f}}}_{t1}\right)\right$ is the cardinality of $\mathcal{U}\left({\widehat{\mathbf{f}}}_{t1}\right)$, then we have Proof. First notice that
${\mathcal{A}}_{t}=\mathcal{U}\left({\widehat{\mathbf{f}}}_{t1}\right)$ if
${I}_{t}=0$, and that for
$t\ge 2$
where
$cond{\left(t\right)}^{c}$ denotes the complement of set
$cond\left(t\right)$. The first inequality above is due to the assumption that for all
$\mathbf{f}\in {\mathcal{A}}_{t}\cap cond\left(t\right)$, we have
$\mathbb{P}\left({\widehat{\sigma}}^{t}\left({\mathcal{A}}_{t}\right)=\mathbf{f}\phantom{\rule{0.166667em}{0ex}}{\mathcal{H}}_{t}\right)\ge \beta $. For
$t=1$, the above inequality is trivial since
${\widehat{r}}_{{\widehat{\sigma}}^{1}\left(\mathcal{U}\left({\widehat{\mathbf{f}}}_{0}\right)\right),1}\equiv 0$ by its definition. Hence, for
$t\ge 1$, one has
Summing on both sides of inequality (
8) over
t terminates the proof of Lemma 4. □
Lemma 5. Let ${\widehat{c}}_{0}=\frac{{c}_{0}}{\beta}+\alpha $. If $0<{\eta}_{1}={\eta}_{2}=\dots ={\eta}_{T}=\eta <\frac{1}{{\widehat{c}}_{0}}$, then we have Proof. By the definition of
${\widehat{r}}_{\mathbf{f},t}$ in Algorithm 3, for any
$\mathbf{f}\in {\mathcal{F}}_{p}$ and
$t\ge 1$, we have
where in the second inequality we use that
${r}_{\mathbf{f},t}\le {c}_{0}$ for all
$\mathbf{f}$ and
t, and that
$\mathbb{P}\left({\widehat{\mathbf{f}}}_{t}=\mathbf{f}{\mathcal{H}}_{t}\right)\ge \beta $ when
$\mathbf{f}\in \mathcal{U}\left({\widehat{\mathbf{f}}}_{t1}\right)\cap cond\left(t\right)$. The rest of the proof is similar to those of Lemmas 1 and 2. In fact, if we define by
$\widehat{\Delta}\left(\mathbf{f},{x}_{t}\right)={\widehat{c}}_{0}{\widehat{r}}_{\mathbf{f},t}$, then one can easily observe the following relation when
${I}_{t}=1$ (similar relation in the case that
${I}_{t}$ = 0)
Then applying Lemmas 1 and 2 on this newly defined sequence
$\widehat{\Delta}\left({\widehat{\mathbf{f}}}_{t},{x}_{t}\right),t=1,\dots T$ leads to the result of Lemma 5. □
The proof of the upcoming Lemma 6 requires the following submartingale inequality: let
${Y}_{0},\dots {Y}_{T}$ be a sequence of random variable adapted to random events
${\mathcal{H}}_{0},\dots ,{\mathcal{H}}_{T}$ such that for
$1\le t\le T$, the following three conditions hold:
Then for any
$\lambda >0$,
The proof can be found in Chung and Lu [
43] (Theorem 7.3).
Lemma 6. Assume that $0<\beta <\frac{1}{\left{\mathcal{F}}_{p}\right},\alpha \ge \frac{{c}_{0}}{\beta}$ and $\eta >0$, then we have Proof. First, we have almost surely that
Denote by
${Y}_{\mathbf{f},t}={r}_{\mathbf{f},t}{\widehat{r}}_{\mathbf{f},t}$. Since
and
$\alpha >{c}_{0}\ge {r}_{\mathbf{f},t}$ uniformly for any
$\mathbf{f}$ and
t, we have uniformly that
$\mathbb{E}\left[{Y}_{t}{\mathcal{H}}_{t}\right]\le 0$, satisfying the first condition.
For the second condition, if
$\mathbf{f}\in \mathcal{U}\left({\widehat{\mathbf{f}}}_{t1}\right)\cap \hspace{0.17em}cond\left(t\right)$, then
Similarly, for
$\mathbf{f}\notin \mathcal{U}\left({\widehat{\mathbf{f}}}_{t1}\right)\cap cond\left(t\right)$, one can have
$\mathrm{Var}\left({Y}_{t}\right{\mathcal{H}}_{t})\le {\alpha}^{2}$. Moreover, for the third condition, since
then
Setting
$\lambda =\sqrt{2T\left[\frac{{c}_{0}^{2}}{\beta}+{\alpha}^{2}(1\beta )+{\left({c}_{0}+2\alpha \right)}^{2}\right]ln\left(\frac{1}{\beta}\right)}$ leads to
Hence the following inequality holds with probability
$1\left{\mathcal{F}}_{p}\right\beta $
Finally, noticing that
${max}_{\mathbf{f}\in {\mathcal{F}}_{p}}{\sum}_{t=1}^{T}\left({r}_{\mathbf{f},t}{\widehat{r}}_{\mathbf{f},t}\right)\le {c}_{0}T$ almost surely, we terminate the proof of Lemma 6. □
Proof of Theorem 3.
Assume that
$p>6$,
$T\ge 2{\mathcal{F}}_{p}{}^{2}$ and let
With those values, the assumptions of Lemmas 4, 5 and 6 are satisfied. Combining their results lead to the following
where the second inequality is due to the fact that the cardinality
$\left\mathcal{U}\left({\widehat{\mathbf{f}}}_{t1}\right)\right$ is upper bounded by
${\left{\mathcal{F}}_{p}\right}^{\frac{3}{p}}$ for
$t\ge 1$. In addition, using the definition of
${r}_{\mathbf{f},t}$ that
${r}_{\mathbf{f},t}={c}_{0}\Delta (\mathbf{f},{x}_{t})$ terminates the proof of Theorem 3. □