Inconsistency of Template Estimation by Minimizing of the Variance/Pre-Variance in the Quotient Space

We tackle the problem of template estimation when data have been randomly deformed under a group action in the presence of noise. In order to estimate the template, one often minimizes the variance when the influence of the transformations have been removed (computation of the Fr{\'e}chet mean in the quotient space). The consistency bias is defined as the distance (possibly zero) between the orbit of the template and the orbit of one element which minimizes the variance. In the first part, we restrict ourselves to isometric group action, in this case the Hilbertian distance is invariant under the group action. We establish an asymptotic behavior of the consistency bias which is linear with respect to the noise level. As a result the inconsistency is unavoidable as soon as the noise is enough. In practice, template estimation with a finite sample is often done with an algorithm called"max-max". In the second part, also in the case of isometric group finite, we show the convergence of this algorithm to an empirical Karcher mean. Our numerical experiments show that the bias observed in practice can not be attributed to the small sample size or to a convergence problem but is indeed due to the previously studied inconsistency. In a third part, we also present some insights of the case of a non invariant distance with respect to the group action. We will see that the inconsistency still holds as soon as the noise level is large enough. Moreover we prove the inconsistency even when a regularization term is added.


Contents
1 Introduction

General Introduction
Template estimation is a well known issue in different fields such as statistics on signals [KSW11], shape theory, computational anatomy [GMT00, JDJG04, CMT + 04] etc. In these fields, the template (which can be viewed as the prototype of our data) can be (according to different vocabulary) shifted, transformed, wrapped or deformed due to different groups acting on data. Moreover, due to a limited precision in the measurement, the presence of noise is almost always unavoidable. These mixed effects on data lead us to study the consistency of algorithms which claim to compute the template. A popular algorithm consists in the minimization of the variance, in other words, the computation of the Fréchet mean in quotient space. This method has been already proved to be inconsistent [BC11,MHP16,DATP17]. In [BC11] the authors proves the inconsistency with a lower bound of the expectation of the error between the original template and the estimated template with a finite sample, they deduce that this expectation does not go to zero as the size of the sample goes to infinity. This work was done in a functional space, where functions only observed at a finite number of points of the functions were observed. In this case one can model these observable values on a grid. When the resolution of the grid goes to zero, one can show the consistency [PZ16] by using the Fréchet mean with the Wasserstein distance on the space of measures rather than in the space of functions. However, in (medical) images the number of pixels or voxels is finite.
In [MHP16], the authors demonstrated the inconsistency in a finite dimensional manifold with Gaussian noise, when the noisel level tends to zero. In our previous work [DATP17], we focused our study on the inconsistency with Hilbert Space (including infinite dimensional case) as ambient space. This current paper is an extension of a conference paper [DPA17].

Why Using a Group Action? Comparison with the Standard Norm
In the following, we take a simple example which justifies the use of the group action in order to compare the shape of two functions: On Figure 1, suppose that you want to compare these functions. The simplest way to compare f0 with f1 would be to compute the L 2 -norm (or any other norm) of f0 − f1, if we do that we have that f0 − f1 0.6. Likewise f0 − f2 0.6, therefore the norm tells us that f0 is at the same distance from f1 and from f2. Yet, our eyes would say that f0, f1 have the same shape, contrarily to f0 and f2. Therefore the simple use of the L 2 -norm in the space of functions is not enough. To have a relevant way to compare functions, one can register functions first. Firstly, we estimate the better time translation which aligns f0 and f1 and secondly, we compute the L 2 -norm after this alignment step. On this example, we find that the distance is now 0.02. On the contrarily, after alignment the distance between f0 and f2 is still 0.6. With this new way of comparing functions, the functions f0 looks like f1 but do not look like f2. This fits with our intuition. That is why we use a group action in order to perform statistics. In the following paragraph, we precise how to do it in general.
This idea of using deformations/transformation in order to compare things is not new. It was already proposed by Darcy Thompson [Tho42] in the beginning of the 20th century, in order to classify species. The blue one (f 0 ) is a step function, the red one (f 1 ) is a translated version of the blue one when noise has been added, and the green one (f 3 ) is the null function.

Settings and Notation
In this paper, we suppose that observations belong to a Hilbert space (M, ·, · ), we denote by · the norm associated to the dot product ·, · . We also consider a group of transformation G which acts on M the space of observations. This means that g · (g · x) = (g g) · x and e · x = x for all x ∈ M , g, g ∈ G, where e is the identity element of G. Note that in this article, g · x is the result of the action of g on x, and · should not to be confused with the multiplication of real numbers noted × The generative model is the following: we transform an unknown template t0 ∈ M with Φ a random and unknown element of the group G and we add some noise. Let σ be a positive noise level and a standardized noise: E( ) = 0, E( 2 ) = 1. Moreover we suppose that and Φ are independent random variables. Finally, the only observable random variable is: This generative model is commonly used in Computational anatomy in diverse frameworks, for instance with currents [DPC + 14, GM01], varifolds [Cha13], LDDMM on images [BMTY05] but also in functional data analysis [KSW11]. All these works are applied in different spaces, for instance, the varifold builds an embedding of the surfaces into an Hilbert space, and a group of diffeomorphisms have the ability of deform these surfaces. Supposing a general group action on a space with the generative model (1) allows us to embed all these various situations into one abstract model, and to study template estimation in this abstract model.
Example of noise: if we assume that the noise is independent and identically distributed on each pixel or voxel with a standard deviation w, then σ = √ N w, where N is the number of pixels/voxels. However, the noise which we consider can be more general: we do not require the fact that the noise is independent over each region of the space M .
Note that the inconsistency of Template estimation can be also studied with an alternative generative model, called backward model where Y = Φ·(t0 +σ ) [DATP17]. Some authors also use the term perturbation model see [Huc11,Roh03,Goo91].
Quotient space: the random transformation of the template by the group leads us to project the observation Y into the quotient space. The quotient space is defined as the set containing all the orbit [x] = {g · x, g ∈ G} for x ∈ M . The set which is constituted of all orbits is call the quotient space M by the group G and is noted by: As we want to do statistics on this space, we aim to equip the quotient with a metric. One often requires that dM the distance in the ambient space is invariant under the group action G, this means that ∀m, n ∈ M, ∀g ∈ G dM (g · m, g · n) = dM (m, n). , in this case we call dQ a pseudo-distance. Nevertheless, this has no consequence in this paper if dQ is only a pseudo-distance), then is well defined, and dQ is a distance in the quotient space. The quotient distance dQ([x], [y]) is the distance between x and y where y is the registration of y with respect to x. We say in this case that y is in optimal position with respect to x.
One particular distance in the ambient space M , which we use in all this article, is the distance given by the norm of the Hilbert space: dM (a, b) = a − b . Moreover we say that G acts isometrically on M , if x → g · x is a linear map which leaves the norm unchanged. In this case dM the distance given by the norm of the Hilbert space is invariant under the group action. The quotient (pseudo)-distance is, in this case (see fig. 2 Remark 1.1. When G acts isometrically on M a Hilbert space, by expansion of the squared norm we have: Figure 2: Due to the invariant action, the orbits are parallel. Here the orbits are circles centred at 0. This is the case when the group G is the group of rotations.
Thus, even if the quotient space is not a linear space, we have a "polarization identity" in the quotient space: (2) When the distance given by the norm is invariant under the group action, we define the variance of the random orbit [Y ] as the expectation of the (pseudo)-distance between the random orbit [Y ] and the orbit of a point x in M : for all x ∈ M and g ∈ G, the variance F is well defined in the quotient space: Moreover, in presence of a sample of the observable variable Y noted Y1, . . . , Yn, one can define the empirical variance of a point x in M : Definition 1.2. Template estimation is performed by minimizing Fn : In order to study this estimation method, one can look the limit of this estimator when the number of data n tends to +∞, in this case, the estimation becomes: Note that, if the action is not isometric and is not either invariant, a priori dQ is no longer a (pseudo)-distance in the quotient space (this point is discussed in Section 3). However one can still define F and wonder if the minimization of F is a consistent estimator of t0. In this case, we call F a pre-variance.

Questions and Contributions
This setting leads us to wonder about few things listed below: Questions: • Is t0 a minimum of the variance or the pre-variance?
• What is the behavior of the consistency bias with respect to the noise level?
• How to perform such a minimization of the variance? Indeed, in practice we have only a sample and not the whole distribution.
Contribution: In the case of an isometric action, we provide a Taylor expansion of the consistency bias when the noise level σ tends to infinity. As we do not have the whole distribution, we minimize the empirical variance given a sample. An element which minimizes this empirical variance is called an empirical Fréchet mean. We already know that the empirical Fréchet mean converges to the Fréchet mean when the sample size tends to infinity [Zie77]. Therefore our problem is reduced to finding an empirical Fréchet mean with a finite but sufficiently large sample. One algorithm called the "max-max" algorithm [AAT07] aims to compute such an empirical Fréchet mean. We establish some properties of the convergence of this algorithm. In particular, when the group is finite, the algorithm converges in a finite number of steps to an empirical Karcher mean (a local minimum of the empirical variance given a sample). This helps us to illustrate the inconsistency in this very simple framework.
We would like to insist on this point: the noise is created in the ambient space with our generative model and the computation of the Fréchet mean is done in the quotient space, this interaction induces an inconsistency. On the opposite, if one models the noise directly in the quotient space and compute the Fréchet mean in the quotient space, we have no reason to suspect any inconsistency.
Moreover it is also possible to define and use isometric actions on curves [HCG + 13, KSW11] or on surfaces [KKD + 11] where our work can be directly applied. The previous works related to the inconsistency of template estimation [BC11, MHP16, DATP17] focused on isometric action, which is a restriction to real applications. That is why we provide, in Section 3, some insights of the non invariant case: the inconsistency also appears as soon as the noise level is large enough.
This article is organized as follows: Section 2 is dedicated for isometric action. More precisely, in Section 2.2, we study the presence of the inconsistency and we establish the asymptotic behavior when the noise parameter σ tends to ∞. In Section 2.4 we detail the max-max algorithm and its properties. In Section 2.5 we illustrate the inconsistency with synthetic data. Finally in Section 3, we prove the inconsistency for more general group action, when the noise level is large enough. We do it in two settings, the first one is that the group contains a subgroup acting isometrically on M , the second one is that the group acts linearly on the space M .
2 Inconsistency of Template Estimation with an Isometric Action

Congruent Section and Computation of Fréchet Mean in Quotient Space
Given points m and y, there is a priori no closed formed expression in order to compute the quotient distance inf g∈G g ·m−y . Therefore computing and minimizing the variance in the quotient does not seem straightforward. There is one case where it may be possible: the existence of a congruent section. We say that s : Q → M is a section if π•s = Id, where π : M → Q is the canonical projection into the quotient space. Moreover we say that the section s is congruent if: Then S = s(Q) the image of the quotient by the section is a part of M which has an interesting property: In other words, the section gives us a part of M containing a point of each orbit such that all points in S are already registered. Moreover, if s is a section, s : [m] → g · s([m]) is also a section, without loss of generality we can assume that t0 = s([t0]).
In this case, the variance is equal to: where we recognize the variance of the random variable s([Y ]). As we know that the element which minimizes the variance in a linear space is given by the expected value, we have that: where indexes are taken modulo N . If we take p1 = (0, 5, 0, . . . , 0), p2 = (0, 3, 2, 0, . . . , 0), p3 = (2, 3, 0, . . . , 0). By hand we can check that there is no Thus, a congruent section in Q = M/G does not exists.
We can generalize this simple example by taking a non finite group: Example 2.3. Let us take M = L 2 (R/Z) the set of 1-periodic functions such that 1 0 f 2 (t)dt < +∞. G = R/Z acts on L 2 (R/Z) by time translation defined by: Then a section in Q = M/G does not exists. fig. 3). Let us suppose that a section s exists, then without loss of generality we can assume that s( ) should be registered with respect to f1. For τ ∈ R/Z we can verify that f1 − τ · f2 ≥ f1 − f2 and that this inequality is strict as soon as τ = 0. Then f2 is the only element of [f2] registered with f1 then s([f2]) = f2. Likewise for s([f3]) = f3, then we should have: However it is easy to verify that d 2 ). This is a contradiction. Therefore, a congruent section does not exist. When the congruent section exists, then the quotient can be included in a part S of the ambient space M and the metric dM and dQ are corresponding. The existence of a congruent section indicates us that the quotient space is not so complicated. Indeed when there is an existence of a congruent section, the quotient space is embedded in the ambient space with respect to the distances in the quotient space and in the ambient space. In that case computations are easier, projecting data on this part S and taking the mean. Then when such a congruent section does not exist, computing the Fréchet mean in quotient space is not so obvious. However, we can established proofs of inconsistency which are less tight. In this article we prove that the method is inconsistent when the noise is large.

Inconsistency and Quantification of the Consistency Bias
We start with Theorem 2.4 which gives us an asymptotic behavior of the consistency bias when the noise level σ tends to infinity. One key notion in Theorem 2.4 is the concept of fixed point under the action G: a point x ∈ M is a fixed point if for all g ∈ G, g · x = x. We require that the support of the noise is not included in the set of fixed points. However, this condition is almost always fulfilled. For instance in R n the set of fixed points under a linear group action is a null set for the Lebesgue measure (unless the action is trivial: g · x = x for all g ∈ G but this situation is irrelevant).
Theorem 2.4. Let us suppose that the support of the noise is not included in the set of fixed points under the group action. Let Y be the observable variable defined in Equation (1). If the Fréchet mean of [Y ] exists, then we have the following lower and upper bounds of the consistency bias noted CB: only of the standardized noise and of the group action. The consistency bias has the following asymptotic behavior when the noise level σ tends to infinity: In the following we note by S the unit sphere of M . For v ∈ S, we call The sketch of the proof is the following: • K > 0 because the support of is not included in the set of fixed points under the action of G.
• K ≤ 1 is the consequence of the Cauchy-Schwarz inequality.
• The proof of Inequalities (3) is based on the triangular inequalities: where m minimizes F : having a piece of information about the norm of m is enough to deduce a piece of information about the consistency bias.
• The asymptotic Taylor expansion of the consistency bias (4) is the direct consequence of inequalities (3).
Proof of Theorem 2.4. We note S the unit sphere in M . In order to prove that K > 0, we take x in the support of such that x is not a fixed point under the action of G. It exists g0 ∈ G such that g0 · x = x. We note v0 = g 0 ·x x ∈ S, we have v0, g0 · x = x > v0, x and by continuity of the dot product it exists r > 0 such that: ∀y ∈ B(x, r) v0, g0 · y > v0, y as x is in the support of we have P( ∈ B(x, r)) > 0, it follows: Thanks to Inequality (5) and the fact that sup g∈G v0, g · ≥ v0, we have: Then we get K ≥ θ(v0) > 0. Moreover, if we use the Cauchy-Schwarz inequality: In order to prove Inequalities (3), we use the "polar" coordinates of a point in M (see fig. 4), every point in M can be represented by (r, v) where r ≥ 0 is the radius, and v belong to S the unit sphere in M , v represents the "angle". We compute F (m) as a function of (r, v). In a first step, we minimize this expression as a function of r, in a second step we minimize this expression as a function of v. This makes appear the constant K.  As we said, let us take r ≥ 0 and v ∈ S, we expand the variance at the point rv: (6) Indeed g · Y = Y thanks to the isometric action. We note x + = max(x, 0) the positive part of x. Moreover we define the two following functions: the r ≥ 0 which minimizes (6) isλ(v) and the minimum value of the variance restricted to the half line R + v is: Note that we remove the positive part and the square because argmaxλ = argmax (λ + ) 2 indeed λ takes a non negative value. In order to prove it let us remark that: As we said in the sketch of the proof we are interested in getting information about the norm of m : Let v ∈ S, we have: − t0 ≤ v, gΦ · t0 ≤ t0 because the action is isometric. Now we decompose Y = Φ · t0 + σ and we get: By taking the largest value in these inequalities with respect to v ∈ S, we get by definition of K: Moreover we recall the triangular inequalities: Thanks to (10) and to (11), Inequalities (3) are proved.

Remarks about Theorem 2.4 and Its Proof
We can ensure the presence of inconsistency as soon as the signal to noise ratio satisfies t 0 σ < K 2 . Moreover, if the signal to noise ratio verifies t 0 σ < K 3 then the consistency bias is not smaller than t0 i.e.,: CB ≥ t0 . In other words, the Fréchet mean in quotient space is too far from the template: the template estimation with the Fréchet mean in quotient space is useless in this case. In [DATP17] we also gave lower and upper bounds as a function of σ but these bounds were less informative than bounds given by Theorem 2.4. These bounds did not give the asymptotic behaviour of the consistency bias. Moreover, in [DATP17] the lower bound goes to zero when the template becomes closed to fixed points. This may suggest that the consistency bias was small for this kind of template. We prove here that it is not the case.
Note that Theorem 2.4 is not a contradiction with [KSW11] where the authors proved the consistency of template estimation with the Fréchet mean in quotient space for all σ > 0. Indeed their noise was included in the set of constant functions which are the fixed points under their group action.
The constant K appearing in the asymptotic behaviour of the consistency bias (4) is a constant of interest. We can give several (but similar) interpretations of K: • It follows from Equation (3) that K is the consistency bias with a null template t0 = 0 and a standardized noise (σ = 1).
• From the proof of Theorem 2.4 we know that 0 < K ≤ E( ) ≤ 1.
On the one hand, if G is the group of rotations then K = E( ), because for all v s.t. v = 1, sup g∈G v, g = , by aligning v and . On the other hand if G acts trivially (which means that g · x = x for all g ∈ G, x ∈ M ) then K = 0. The general case for K is between two extreme cases: the group where the orbits are minimal (one point) and the group for which the orbits are maximal (the whole sphere). We can state that the more the group action has the ability to align the elements, the larger the constant K is and the larger the consistency bias is.
• The squared quotient distance between two points is: ), encodes the level of contraction of the quotient distance (or folding). The larger K is, the more contracted the quotient space is.
One disadvantage of Theorem 2.4 is that it ensures the presence of inconsistency for σ large enough but it says nothing when σ is small, in this case one can refer to [MHP16] or [DATP17].
We can remark that this Theorem can be used as an alternating proof the following Theorem (which was already proved in [DATP17]), proving and quantifying inconsistency when the template is a fixed point: Corollary 2.5. Let G acting isometrically on M an Hilbert space. Let t0 be a fixed point, and a standardized noise which support is not included in the set of fixed points. Then estimating the template with the Fréchet mean is inconsistent. Moreover if the Fréchet mean in quotient space exists then the consistency bias is equal to: Indeed for t0 = 0 which is a particular fixed point we have CB = σK thanks to Theorem 2.4. If t0 is a fixed point non necessarily equal to 0, we can define Y = Y − t0 = 0 + σ , in this random variable 0 is the template we can apply the formula CB = σK to the random variable Y , which concludes.
In the proof of Theorem 2.4, we have seen that the minimum of the variance restricted to the half-line is a registration score:λ(v) tells you how much it is a good idea to search the Fréchet mean of [Y ] in the direction pointed by v: the moreλ(v) is large, the more v is a good choice.
On the contrary when this value is equal to zero, it is useless to search the Fréchet mean in this direction.
) is a registration score with respect to the noise, the larger θ(v), the more the unit vector v looks like to the noise after registration.
we have seen that its norm verifies: We see that the element m which minimizes (12) does not depend of σ, in particular we can assume σ = 0, and wonder which elements minimizes F (m) = E(infg∈G m − gΦ · t0 2 ), it becomes clear that only the points in the orbit of t0 can minimize this variance. Then when is included in the set of fixed points, the estimation is always consistent for all σ. This is an alternative proof of the Theorem of consistency done by Kurtek et al. [KSW11].
In the proof of Theorem 2.4, we have seen that the direction of the Fréchet mean of [Y ] is given by the supremum of this quantity (7): This Equation is a good illustration of the difficulty to compute the Fréchet mean in quotient space. Indeed, we have on one side the contribution of the noise v, g · σ and on the other side the contribution of the template v, gΦ · t0 , and we take the supremum of the sum of these two contributions over g ∈ G. Unfortunately the supremum of the sum of two terms is not equal to the sum of the supremum of each of these terms. Hence, it is difficult to separate these two contributions. However, we can intuit that when the noise is large, v, gσ prevails over v, gΦ · t0 , and the use of the Cauchy-Schwarz inequality in Equations (8) and (9) proves it rigorously. We can conclude that, when the noise is large, the direction of the Fréchet mean in the quotient space depends more on the noise than on the template. In practice, we cannot minimize the exact variance in quotient space, because we have only a finite sample and not the whole distribution. In this section we study the estimation of the empirical Fréchet mean with the max-max algorithm. We assume that the group is finite. In this case, the registration can always be found by an exhaustive search. Hence, the numeric experiments which we conduct in Section 2.5 lead to an empirical Karcher mean in a finite number of steps. In a compact group acting continuously, the registration also exists but is not necessarily computable without approximation.

Template
If we have a sample: Y1, . . . , YI of independent and identically distributed copies of Y , then we define the empirical variance in the quotient space: (13) The empirical variance is an approximation of the variance. Indeed thanks to the law of large number we have limI→∞ FI (x) = F (x) for all x ∈ M . One element which minimizes globally (respectively locally) FI is called an empirical Fréchet mean (respectively an empirical Karcher mean). For x ∈ M and g ∈ G I : g = (g1, . . . , gI ) where gi ∈ G for all i = 1..I we define J an auxiliary function by: The max-max algorithm 1 iteratively minimizes the function J in the variable x ∈ M and in the variable g ∈ G I (see fig. 5): First, we note that this algorithm is sensitive to the the starting point. However we remark that m1 = 1 I I i=1 gi ·Yi for some gi ∈ G, thus without loss of generality, we can start from m1 = 1 I I i=1 gi · Yi for some gi ∈ G. The empirical variance does not increase at each step of the algorithm since: FI (mn) = J(mn, g n ) ≥ J(mn+1, g n ) ≥ J(mn+1, g n+1 ) = FI (mn+1) while Convergence is not reached do Minimizing g ∈ G I → J(m n , g): we get g n i by registering Y i with respect to m n .
n = n + 1. end whilê m = m n Proposition 2.7. As the group is finite, the convergence is reached in a finite number of steps.
Proof of Proposition 2.7. The sequence (FI (mn)) n∈N is non-increasing. Moreover the sequence (mn) n∈N takes value in a finite set which is: Yi, gi ∈ G}. Therefore, the sequence (FI (mn)) n∈N is stationary. Let n ∈ N such that FI (mn) = FI (mn+1). Hence the empirical variance did not decrease between step n and step n + 1 and we have: as mn+1 is the unique element which minimizes m → J(m, g n ) we conclude that mn+1 = mn.
This proposition gives us a shutoff parameter in the max-max algorithm: we stop the algorithm as soon as mn = mn+1. Let us callm the final result of the max-max algorithm. It may seem logical thatm is at least a local minimum of the empirical variance. However this intuition may be wrong: let us give a toy counterexample, suppose that we observe Y1, . . . , YI , due to the transformation of the group it is possible that n i=1 Yi = 0. We can start from m1 = 0 in the max-max algorithm, as Yi and 0 are already registered, the max-max algorithm does not transform Yi. At step two, we still have m2 = 0, by induction the max-max algorithm stays at 0 even if 0 is not a Fréchet or Karcher mean of [Y ]. Because 0 is equally distant from all the points in the orbit of Yi, 0 is called a focal point of [Yi]. The notion of focal point is important for the consistency of the Fréchet mean in manifold [BB08]. Fortunately, the situation wherem is not a Karcher mean is almost always avoided due to the following statement:  Note that, if we call z the registration of y with respect to m, then the registration is unique if and only if m, z − g · z = 0 for all g ∈ G \ {e}.
Once the max-max algorithm has reached convergence, it suffices to test this condition form obtained by the max-max algorithm and Yi for all i. This condition is in fact generic and is always obtained in practice.
Proof of Proposition 2.8. We call gi the unique element in G which register Yi with respect tom, for all h ∈ G \ {gi}, m − gi · Yi < m − hi · Yi . By continuity of the norm we have for a close enough to m: a − gi · Yi < a − hi · Yi for all hi = gi (note that this argument requires a finite group). The registrations of Yi with respect to m and to a are the same: because m → J(m, g) has one unique local minimumm.
Remark 2.9. We remark the max-max algorithm is in fact a gradient descent. The gradient descent is a general method to find the minimum of a differentiable function. Here we are interested in the minimum of the variance F : let m0 ∈ M and we define by induction the gradient descent of the variance mn+1 = mn − ρ∇F (mn), where ρ > 0 and F the variance in the quotient space. In [DATP17], the gradient of the variance in quotient space for finite group and for a regular point m was computed (m is regular as soon as g · m = m implies g = e), this leads to: where g(Y, mn) is the almost-surely unique element of the group which registers Y with respect to mn. Now if we have a set of data Y1, . . . , Yn we can approximated the expectation which leads to the following approximated gradient descent: now by taking ρ = 1 2 we get mn+1 = 1 I I i=1 g(Yi, mn) · Yi. So the approximated gradient descent with ρ = 1 2 is exactly the max-max algorithm. However, the max-max algorithm for finite group, is proved to be converging in a finite number of steps which is not the case for gradient descent in general.

Simulation on Synthetic Data
In this Section, we consider data in an Euclidean space R N equipped with its canonical dot product ·, · , and G = Z/NZ acts on R N by time translation on coordinates: where indexes are taken modulo N . This space models the discretization of functions defined on [0, 1] with N points. This action is found in [AAT07] and used for neuroelectric signals in [HCG + 13]. The registration between two vectors can be made by an exhaustive research but it is faster with the fast Fourier transform [CT65].

Max-Max Algorithm with a Step Function as Template
We display an example of a template and template estimation with the max-max algorithm on Figure 6a. This experiment was already conducted in [AAT07], but no explanation of the appearance of the bias was provided. We know from Section 2.4 that the max-max output is an empirical Karcher mean, and that this result can be obtained in a finite number of steps. Taking σ = 10 may seem extremely high, however the standard deviation of the noise at each point is not 10 but σ √ N = 1.25 which is reasonable.
The sample size is 10 5 , the algorithm stopped after 247 steps, andm the estimated template (in red on the Figure 6a) is not a focal points of the orbits [Yi], then Proposition 2.8 applies. We call empirical bias (noted EB) the quotient distance between the true template and the pointm given by the max-max result. On this experiment we have EB σ 0.11. Of course, one could think that we estimate the template with an empirical bias due to a too small sample size which induces fluctuation. To reply to this objection, we keep in memorym obtained with the max-max algorithm. If there was no inconsistency then we would have F (t0) ≤ F (m). We do not know the value of the variance F at these points, but thanks to the law of large number, we know that: Given a sample, we compute FI (t0) and FI (m) thanks to the definition of the empirical variance FI (13). We display the result on Figure 6b, this tends to confirm that F (t0) > F (m). In other word, the variance at the template is larger that the variance at the point given by the max-max algorithm.  Empirical variance at the template in blue and at the estimated template in red (b) Figure 6: Template t 0 and template estimationm on Figure 6a. Empirical variance at the template and template estimation with the max-max algorithm as a function of the size of the sample on Figure 6b. (a) Example of a template (a step function) and the estimated templatem with a sample size 10 5 in R 64 , is Gaussian noise and σ = 10. At the discontinuity points of the template, we observe a Gibbs-like phenomena; (b) Variation of F I (t 0 ) (in blue) and of F I (m) (in red) as a function of I the size of the sample. Since convergence is already reached, F (m), which is the limit of red curve, is below F (t 0 ): F (t 0 ) is the limit of the blue curve. Due to the inconsistency,m is an example of point such that F (m) < F (t 0 ). Figure 6a shows that the main source of the inconsistency was the discontinuity of the template. One may think that a continuous template would lead to a better behaviour. However, it is not the case as presented on Figure 7. Even with a large number of observations created from a continuous template we do not observe a convergence to the template. In the example of Figure 7, the empirical bias satisfies EB σ = 0.23. In green we also display the mean of data knowing transformations, this produces a much better result, since that in this case we have EB σ = 0.04.  3 Inconsistency in the Case of Non Invariant Distance under the Group Action

Notation and Hypothesis
In this Section, data still come from an Hilbert space M . However, we take a group of deformation G which acts in a non invariant way on M . Starting from a template t0 we consider a random deformation in the group G namely a random variable Φ which takes value in G and an standardized noise in M independent of Φ. We suppose that our observable random variable is: where σ is the noise level. We suppose that E( Y 2 ) < +∞, and we define the pre-variance of Y in M/G as the map defined by: In this part we still study the inconsistency of template estimation by minimizing F .
We present two frameworks where we can ensure the presence of inconsistency: in Section 3.3 we suppose that the group G contains a non trivial group H which acts isometrically on M . However, some groups do not satisfy this hypothesis, that is why, in Section 3.4 we do not suppose that G contains a subgroup acting isometrically but we require that G acts linearly on M . In both sections we prove inconsistency as soon as the variance σ 2 is large enough.
These hypothesis are not unacceptable as for example, deformations that are considered in computational anatomy may include rotations which form a subgroup H of the diffeomorphic deformations which acts isometrically. Concerning the second case, an important example is: Example 3.1. Let G be a subgroup of the group of C ∞ diffeomorphisms on R n G acts linearly on L 2 (R n ) with the map: Note that this action is not isometric: indeed, f • ϕ −1 has generally a different L 2 -norm than f , because a Jacobian determinant appears in the computation of the integral.

Where Did We Need an Isometric Action Previously?
Let M be an Hilbert space, and G a group acting on M . Can we define a distance in the quotient space Q = M/G defined as the set which contains all the orbits? When the action is invariant, the orbits are parallel in the sense where dM (m, n) = dM (g · m, g · n) for all m, n ∈ M and for all g ∈ G. This implies that: is a distance on Q. However, it is not necessarily the case when the action is no longer invariant. Let us take the following example: Example 3.2. We call C ∞ diff (R 2 ) the set of the C ∞ diffeomorphisms of R 2 . We equip R 2 with its canonical Euclidean structure. We take p = (−1, −1), q = (1, 1) and r = (2, 0) (see fig. 8),  Therefore when the action is no longer invariant, a priori one cannot define a distance in the quotient anymore. If Y is a random variable in However infg∈G g · a − b is positive and is equal to zero if a = b, then infg∈G g · a − b is a pre-distance in M . Then infg∈G g · m − Y measures the discrepancy between the random point Y and the current point m. Even if the discrepancy measure is not symmetric or does not satisfy the triangular inequality, we can still define F (x) = E(infg∈G g · x − Y 2 ) and call it the pre-variance of the projection of Y into M/G, if E( Y 2 ) < +∞. The elements which minimize this function are the element whose orbit are the closest of the random point Y . Hence, we wonder if the template can be estimated by minimizing this pre-variance. Note that, once again for all x ∈ M and g ∈ G. Then the pre-variance is well defined in the quotient space by It is not surprising to use a discrepancy measure which is not a distance, for instance the Kullback-Leibler divergence [KL51] is not symmetric although it is commonly used.
In the proof of inconsistency of Theorem 2.4, we used that the action was isometric in order to simplify the expansion of the variance in Equation (6): with g · Y 2 = Y 2 there was only one term which depends on g: g · m, Y and the two other terms could be pulled out of the infimum. When the action is no longer isometric we cannot do this trick anymore. To remedy this situation, in this article, we require that the orbit of the template is a bounded set.
In the following, we prove inconsistency even with non isometric action (but only when the noise level is large enough if the template is not a fixed point). The sketches of the different proofs are always the same: finding a point m such that F (m) < F (t0), in order to do that it suffices to find an upper bound of F (m) and a lower bound of F (t0) and to compare these two bounds.

Non Invariant Group Action, with a Subgroup Acting Isometrically
In this subsection G acts on M an Hilbert space. We assume that there exists a subgroup H ⊂ G such that H acts isometrically on M . As H is included in G, we deduce a useful link between the variance of Y projected in M/H and the pre-variance of Y projected in M/G:

Inconsistency when the Template Is a Fixed Point
We begin by assuming that the template t0 is a fixed point under the action of G: Proposition 3.3. Suppose that t0 is a fixed point under the group action G. Let be a standardized noise which support is not included in the fixed points under the group action of H, and Y = Φ · t0 + σ = t0 + σ . Then t0 is not a minimum of the pre-variance F .
Note that in order to apply Corollary 2.5, we do not need that Φ is included in H, because t0 is a fixed point.
2. Because we take the infimum over more elements we have: 3. As t0 is a fixed point under the action of G and under the action of H: With Equations (15)-(17), we conclude that t0 does not minimize F .

Inconsistency in the General Case for the Template
The following Proposition 3.4 tells us that when σ is large enough then there is an inconsistency.
Proposition 3.4. We suppose that the template is not a fixed point and that its orbit under the group G is bounded. We consider A ≥ sup g∈G g·t 0 t 0 and a ≤ inf g∈G g·t 0 t 0 , note that a ≤ 1 ≤ A and we have: ∀g ∈ G a t0 ≤ g · t0 ≤ A t0 .
We note: We suppose that θH > 0. If σ is bigger than a critical noise level noted σc defined as: Then we have inconsistency.
Note that in Section 2.2 we have proved inconsistency in the isometric case as soon as σ > 2 t 0 K , where K ≥ θH , then we find in this theorem an analogical sufficient condition on σ where is a corrective term due to the non invariant action.
We have shown in [DATP17] that if the orbit of the template [t0]H is a manifold, then θH > 0 as soon as the support of is not included in Tt 0 [t0] ⊥ (the normal space of the orbit of the template t0 at the point t0).

If [t0]
is not a manifold, we have also seen in [DATP17] that θH > 0 as soon as t0 is an accumulation point of [t0]H and the support of contains a ball B(0, r). Hence, θH > 0 is a rather generic condition. Condition (18) can be reformulated as follows: as soon as the signal to noise ratio t 0 σ is sufficiently small: then there is inconsistency. We remark the presence of the constants θ(t0) and θH in Proposition 3.4. This kind of constants were already here in the isometric case under the form θ( t 0 t 0 ) = 1 t 0 E(sup g∈G t0, g · ), due to the polarization identity (2), we can state that it measures how much the template looks like to the noise after registration, but only in the isometric case. However we can intuit that this constant plays a analogical role in the non isometric case. √ 20 then there is inconsistency. By Cauchy-Schwarz inequality we have E( ) ≤ E( 2 ) = 1, thus the signal to noise ratio has to be rather small in order to fulfill this condition.

Proof of Proposition 3.4
We define the following values: Note that λH and λ(t0) are registration scores which definitions are the same than the registration score used in the proof of Theorem 2.4 in Section 2 (only the normalization by t0 is different). The proof of Proposition 3.4 is based on the following Lemma: Lemma 3.6. If: then t0 is not a minimizer of the pre-variance of [Y ] in M/G.
How condition (20) can be understood? In order to answer to that question, let us imagine that G = H acts isometrically, then a can be set up to 1, and λ(t0) = λH the condition (20) becomes λ 2 H − 2λH + 1 = (λH − 1) 2 > 0 and the conditions of Theorem 4.2 of [DATP17] aimed to ensure that λH > 1. Now let us return to the non invariant case: if H is strictly included in G such that a is closed enough to 1 and λ(t0) closed enough to λH , then on can think that condition (20) still holds. However, the closed enough seems hard to be quantified.
Proof of Lemma 3.6. The proof is based on the following points: With items 1 and 2 we get that F (λH t0) < F (t0). Item 1 is just based on the fact that in the map F , we take the infimum on a larger set than on FH . We now prove item 2, in order to do that we expand the two quantities, firstly: We use the fact that H acts isometrically between Equations (21) and (22) and the fact that λH ≥ 0 because infa∈A −λa = −λ sup a∈A a is true for any A subset of R if λ ≥ 0. Secondly: Then: thanks to hypothesis (20).
Proof of Proposition 3.4. In order to prove Proposition 3.4, all we have to do is proving λH ≥ 0 and proving that Condition (20) is fulfilled when σ > σc. Firstly, thanks to Cauchy-Schwarz inequality, we have: Note that as σ > σc ≥ A t 0 θ H we get λH ≥ 0, this proves (19). We also have: Then we can find a lower bound of a 2 − 2λ(t0) + λ 2 H : For σ > σc where σc is the biggest solution of the quadratic Equation P (σ) = 0, we get a 2 − 2λ(t0) + λ 2 H > 0 and template estimation is inconsistent thanks to Lemma 3.6. The critical σc is exactly the one given by Proposition 3.4.

Linear Action
The result of the previous part has a drawback, it requires that the group of deformations contains a non trivial subgroup which acts isometrically. We know remove this hypothesis, but we require that the group acts linearly on data.

Inconsistency
In this Subsection we suppose that the group G acts linearly on M . Once again, we can give a criteria on the noise level which leads to inconsistency: Proposition 3.7. We suppose that the orbit of the template is bounded with: ∃a ≥ 0, A > 0 such that ∀g ∈ G a t0 ≤ g · t0 ≤ A t0 .
We suppose that A < √ 2. In other words, the deformation of the template can multiply the norm of the template by less than √ 2. We also suppose that: There is inconsistency as soon as Once again we find a condition which is similar to the isometric case, but due to the non invariant action we have here a corrective term which depends on A and a. Note that as G does not act isometrically, results in [DATP17] do not apply in order to fulfill Condition (23). However it is easy to fulfill this Condition thanks to the following Proposition: Proposition 3.9. If t0 is not a fixed point, and if the support of contains a ball B(0, ρ) for ρ > 0 then Remark 3.10. It is possible to remove the condition A < √ 2 in Proposition 3.7. Indeed Let be h ∈ G such that: The template t0 can be replaced by h · t0 since Φt0 + σ is equal to Φh −1 · ht0 and applying Proposition 3.7 to the new template h · t0. We get that h · t0 does not minimize the variance F with A ≤ √ 2 (because the new template is h · t0). Since h · t0 does not minimize F , the original template t0 does not minimize the pre-variance F neither, since F (t0) = F (h · t0).
This changes the critical σc since we apply Proposition 3.7 to h · t0 instead of t0 itself.

Proofs of Proposition 3.7 and Proposition 3.9
As in Section 3.3 we first prove a Lemma: Lemma 3.11. We define: Suppose that λ(t0) ≥ 0 and that: Then t0 is not a minimum of F .
Proof of Lemma 3.11. Since ∀g ∈ G a t0 ≤ g · t0 ≤ A t0 , then by linearity of the action we get: We remind that: By using Equations (25) and (26) we get: We get: Note that we use the fact that the action is linear in Equation (27). We obtain that t0 is not the minimum of the F : Proof of Proposition 3.7. By solving the following quadratic inequality we remark that: Besides, as in section 3.3.2 we can take a lower bound of λ(t0) by decomposing Y = Φ · t0 + σ and applying Cauchy-Schwarz inequality Φ · t0, g · t0 ≥ −A 2 t0 2 , we get: Thanks to Condition (28) and the fact that σ > σc we get: Then λ(t0) ≥ 0 and Condition (24) is fulfilled. Thus, there is inconsistency, according to Lemma 3.11.
• 0 Figure 9: Representation of the three cases, on each we can find an x in the support of the noise such as x, g 0 · t 0 > x, t 0 and by continuity of the dot product , g 0 · t 0 > , t 0 with is an event with a non zero probability, (for instance the ball in gray). This is enough in order to show that θ(t 0 ) > 0.
(a) Case 1: t 0 and g ·t 0 are linearly independent; (b) Case 2: g ·t 0 is proportional to t 0 with a factor > 1; (c) Case 3: g · t 0 is proportional to t 0 with a factor < 1.

Example of a Template Estimation Which is Consistent
In order to underline the importance of the hypotheses, we give an example where the method is consistent: Example 3.12. Let M be an Hilbert space and V a closed sub-linear space of M . Then G = V acts on M by (see fig. 10): This action is not isometric, indeed m → m + v is not linear (except if v = 0). However this action is invariant, let us consider V ⊥ the orthogonal space of V . The variance in the quotient space is: where p : M → V ⊥ the orthogonal projection on V ⊥ . Then it is clear that t0 minimizes F . In fact, s : [m] → p(m) is just a congruent section of the quotient (see Section 2.1). Here, once again, we see the role played by the the congruent section (when it exists) in order to study the consistency.
Hence, is there a contradiction with Proposition 3.4 or Proposition 3.7 which prove inconsistency as soon as the noise level is large enough? In Proposition 3.4, we require that there is a subgroup acting isometrically, in this example the only element which acts linearly is the identity element m → m + 0, then H = {0} is the only possibility, however the support of the noise should not be included in the set of fixed point under the group action of H. Here, all points are fixed under H, hence it is not possible to fulfill this condition. Example 3.12 is not a contradiction with Proposition 3.4, it is also not a contradiction with Proposition 3.7 since it does not act linearly on data. [y] Figure 10: In the case of affine translation by vectors of V , the orbits are affine subspace parallel to V . The distance between two orbits [x] and [y] is given by the distance between the orthogonal projection of x and y in V ⊥ . This is an example where template estimation is consistent.

Inconsistency with Non Invariant Action and Regularization
In practice people add a regularization term in the function they minimize in LDDMM [BMTY05, DPC + 14], or in Demons [LGP + 13] etc. Because, if one considers two points, one does not want necessarily to fit one with the other. Indeed, even if one deformation matches exactly these two points, it may be an unrealistic deformation. So far, we did not study the use of such a term in the inconsistency.

Case of Deformations Closed to the Identity Element of G
If we suppose that the deformations Φ of the template is closed to identity, it is useless to take the infimum over G because G contains big deformations. Perhaps one of these big deformations can reaches the infimum in F , but this element is not the one which deformed the template in the generative model. Then such big deformations should not be taken into account. That is why, if we suppose that G can be equipped with a distance dG, then we can assume that there exists r > 0 such that the deformation Φ belongs almost surely to B = B(e, r) = {g ∈ G, dG(e, g) < r}.
Instead of defining F (m) as E(infg∈G g · m − Y 2 ), one can define F (m) = E(infg∈B g · m − Y 2 ), and the previous proofs will still be true, when replacing for instance λ(t0) by λ(t0) = 1 t 0 2 E(sup g∈B g · t0, Y ) etc. Likewise we need to replace the hypothesis "the support of is not included in the set of fixed points " by "the support of in not included is the set of fixed points under the action restricted to B".
Note that restraining ourselves to B is equivalent to add a following regularization on the function F : Moreover considering only the elements in B will automatically satisfy the condition A < √ 2 in Proposition 3.7 as long as the group G acts continuously on the template, if r is small enough.
3.6.2 Inconsistency in the Case of a Group Acting Linearly with a Bounded Regularization In this Section we suppose that the group G acts linearly. We also suppose that A < √ 2. The regularization term is a bounded map Reg : G → [0, Ω]. With this framework, we still able to prove that there is inconsistency as soon as the noise level is large enough: Proposition 3.13. Let G be a group acting linearly on M . We suppose that the orbit of the template t0 is bounded with A = sup g∈G g·t 0 t 0 < √ 2, the generative model is still Y = Φ · t0 + σ . We define the pre-variance as: Then as soon as the noise level is large enough, i.e.,: Then t0 is not a minimizser of F .
The proof is exactly the same as the Proof of Proposition 3.7, we take 0 as a lower bound of the the regularization term in the lower bound of F (t0), and we take Ω as a upper bound of the regularization term in the upper bound of F (λ(t0)t0). We solve a similar quadratic equation in order to find the critical σ.

Conclusions and Discussion
We provided an asymptotic behavior of the consistency bias when the noise level σ tends to infinity in the case of isometric action. As a consequence, the inconsistency can not be neglected when σ is large. When the action is no longer isometric, inconsistency has been also shown when the noise level is large.
However, we have not answered this question: can the inconsistency be neglected? When the noise level is small enough, then the consistency bias is small [MHP16,DATP17], hence it can be neglected. Note that the quotient space is not a manifold, this prevents us to use a priori the Central Limit theorem for manifold proved in [BB08]. However, if the Central Limit theorem could be applied to quotient space, the fluctuations induces an error which would be approximately equal to σ √ I and if K 1 √ I , then the inconsistency could be neglected because it is small compared to fluctuation. One way to avoid the inconsistency is to use another framework, for a instance a Bayesian paradigm [CDH16].
In the numerical experiments we presented, we have seen that the estimated template is more crispy that the true template. The intuition is that the estimated template in computational anatomy with a group of diffeomorphisms is also more detailed. However, the true template is almost always unknown. It is then possible that one think that the computation of the template succeeded to capture small details of the template while it is just an artifact due to the inconsistency. Moreover in order to tackle this question, one needs to have a good modeliation of the noise, for instance in [KSW11], the observations are curves, what is a relevant noise in the space of curves?
In this article, we have considered actions which do not let the distance invariant. Although we have only shown the inconsistency as soon as the noise level is large enough, the inequality used where not optimal at all, surely future works could improve this work and prove that inconsistency appears for small noise level. Moreover a quantification of the inconsistency should be established.