Efficiency Bound of Local Z-Estimators on Discrete Sample Spaces

Many statistical models over a discrete sample space often face the computational difficulty of the normalization constant. Because of that, the maximum likelihood estimator does not work. In order to circumvent the computation difficulty, alternative estimators such as pseudo-likelihood and composite likelihood that require only a local computation over the sample space have been proposed. In this paper, we present a theoretical analysis of such localized estimators. The asymptotic variance of localized estimators depends on the neighborhood system on the sample space. We investigate the relation between the neighborhood system and estimation accuracy of localized estimators. Moreover, we derive the efficiency bound. The theoretical results are applied to investigate the statistical properties of existing estimators and some extended ones.


Introduction
For many statistical models on a discrete sample space, the computation of the normalization constant is often intractable.Because of that, the maximum likelihood estimator (MLE) is not of practical use to estimate probability distributions, although the MLE has some nice theoretical properties such as the statistical consistency and efficiency under some regularity conditions [1].
In order to circumvent the computation of the normalization constant, alternative estimators that require only a local computation over the sample space have been proposed.In this paper, estimators on the basis of such a concept are called localized estimators.Examples of localized estimators include pseudo-likelihood [2], composite likelihood [3,4], ratio matching [5,6], proper local scoring rules [7,8], and many others.These estimators are used for discrete statistical models such as conditional random fields [9], Boltzmann machines [10], restricted Boltzmann machines [11], discrete exponential family harmoniums [12], and Ising models [13].
In this paper, we present a theoretical analysis of localized estimators.We use the standard tools in the statistical asymptotic theory.In our analysis, a class of localized estimators including pseudo-likelihood and composite likelihood is treated as M-estimator or Z-estimator which is an extension of the MLE [1].The localized estimators require local computation around a neighborhood of observed points.Hence, the asymptotic variance of the localized estimator depends on the size of the neighborhood.We investigate the relation between the estimation accuracy and the neighborhood system.A similar result is given by [14], in which asymptotic variances between specific composite likelihoods are compared.In our approach, we consider a stochastic variant of localized estimators, and derive a general result that the larger neighborhood leads to more efficient estimator under a simple condition.The pseudo-likelihood and composite likelihood are obtained as the expectation of a stochastic localized estimator.We derive the exact efficiency bound for the expected localized estimator.As far as we know, the derivation of the efficiency bound is a new result for a class of localized estimators, though upper and lower bounds have been proposed [14] for some localized estimators.
The rest of the paper is organized as follows.In Section 2, we introduce basic concepts such as pseudo-likelihood, composite likelihood, and Z-estimators.Section 3 is devoted to define stochastic local Z-estimator associated with a neighborhood system over the discrete sample space.In Section 4, we study the relation between the neighborhood system and asymptotic efficiency of the stochastic local Z-estimator.In Section 5, we define local Z-estimator as the expectation of the stochastic local Z-estimator, and present its efficiency bound.The theoretical results are applied to study asymptotic properties of existing estimators and some extended ones.Finally, Section 6 concludes the paper with discussions.

Preliminaries
M-estimators and Z-estimators were proposed as an extension of the MLE.In practice, M-estimators and Z-estimators are often computationally demanding due to the normalization constant in statistical models.To circumvent computational difficulty, localized estimators have been proposed.We introduce some existing localized estimators especially on discrete sample spaces.In later sections, we consider statistical properties of a localized variant of Z-estimators.
Let us summarize the notations to be used throughout the paper.Let R be the set of all real numbers.The discrete sample space is denoted as X .The statistical model p θ (x) for x ∈ X with the parameter θ ∈ Θ ⊂ R d is also expressed as p(x; θ).The vector a usually denotes the column vector, and • T denotes the transposition of vector or matrix.For a linear space T and an integer d, (T) d denotes the d-fold product space of T, and the element c ∈ (T) d is expressed as c = (c 1 , . . ., c d ).The product space of two subspaces T 1 and T 2 that are orthogonal to each other is denoted as ) T .The indicator function is denoted as 1[A] that takes 1 if A is true and 0 otherwise.

M-and Z-Estimators
Suppose that samples x 1 , . . ., x m are i.i.d.distributed from the probability p(x) over the discrete sample space X .A statistical model p θ (x) = p(x; θ) with the parameter θ ∈ Θ ⊂ R d is assumed to estimate p(x).In this paper, our concern is the statistical efficiency of estimators.Hence, we suppose that the statistical model includes p(x).
The MLE is commonly used to estimate p(x).It uses the negative log-likelihood of the model, − log p θ (x), as the loss function and the estimator is given by the minimum solution of its empirical mean, − 1 m ∑ m i=1 log p θ (x i ).Generally, the estimator obtained by the minimum solution of a loss function is referred to as the M-estimator.The MLE is an example of M-estimators.When the loss function is differentiable, the gradient of the loss function vanishes at the estimated parameter.Instead of minimizing loss function, a solution of the system of equations also provides an estimator of the parameter.Such an estimator is called the Z-estimator [1].In the MLE, the system of equations is given as where 0 ∈ R d is the null-vector.The gradient ∇ log p θ (x) is known as the score function of the model p θ (x).In this paper, the score function is denoted as In general, the Z-estimator is defined as the solution of the system of equations where the R d -valued function f θ (x) = f (x; θ) is referred to as the identification function [15,16].In the M-estimator, the identification function is given as the gradient of the loss function.In general, however, the identification function itself is not necessarily expressed as the gradient of a loss function, if it is not integrable.The identification function f θ (x) is also called Z-estimator with some abuse of terminology.
The pseudo-likelihood does not require the normalization constant, and it satisfies the statistical consistency of the parameter estimation [2,17].The identification function of the corresponding Z-estimator is obtained by the gradient vector of the loss Function (2).
Example 2 (Composite likelihood).The composite likelihood was proposed as an extension of the pseudo-likelihood [3].Suppose that X is expressed as the product space as in Example 1.For the index subset A ⊂ {1, . . ., n}, let x A = (x i ) i∈A be the subvector of x ∈ X .For each = 1, . . ., M, Suppose that A and B are a pair of disjoint subsets in {1, . . ., n}, and let C be the complement of the union A ∪ B , i.e., C = (A ∪ B ) c .Given positive constants γ 1 , . . ., γ M , the loss function of the composite likelihood, S CL , is defined as The composite likelihood using the subsets A = { }, B = A c and positive constant γ = 1 for = 1, . . ., n yields the pseudo-likelihood.As well as the pseudo-likelihood, the composite likelihood has the statistical consistency under some regularity condition [4].
Originally, the pseudo and composite likelihoods were proposed to deal with spatial data [2,3].As a generalization of these estimators, a localized variant of scoring rules works efficiently to the statistical analysis of discrete spatial data [18].

A Stochastic Variant of Z-Estimators
In this section, we define a stochastic variant of Z-estimators.For the discrete sample space X , suppose that the neighborhood system N is defined as a subset of the power set 2 X , i.e., N is a family of subsets in X .Let us define the neighborhood system at x ∈ X by N x = {e ∈ N|x ∈ e}.We assume that N x is not empty for any x.In some class of localized estimators, the neighborhood system is expressed using an undirected graph on X [7].In our setup, the neighborhood system is not necessarily expressed by an undirected graph, and we allow the neighborhood system to possess multiple neighbors at each point x.
Let us define the stochastic Z-estimator.A conditional probability of the set e ∈ N given x ∈ X is denoted as q(e|x).We assume that q(e|x) = 0 if e ∈ N x throughout the paper.Given a sample x, we randomly generate a neighborhood e from the conditional probability q(e|x).Using i.i.d.copies of (x, e), we estimate p(x).Here, the statistical model p θ (x) of the form (1) is used.We use the Z-estimator f θ (x, e) = f (x, e; θ) ∈ R d to estimate the parameter θ ∈ Θ ⊂ R d .The element of f θ (x, e) is denoted as f θ,k (x, e) or f k (x, e; θ) for k = 1, . . ., d.The expectation under the probability p θ (x)q(e|x) is written as E θ,q [•].Suppose that the equality holds for all θ ∈ Θ.In addition, we assume that the vectors E θ,q [∇ f θ,k ], k = 1, . . ., d are linearly independent, meaning that f θ depends substantially on the parameter θ [19].The solution of the system of equations produces a statistically consistent estimator under some regularity condition [1].In the parameter estimation of the model p θ (x), the stochastic Z-estimator is defined as the solution of (4) using the identification function satisfying (3).As shown in Section 5, stochastic Z-estimators are useful to investigate statistical properties of the standard pseudo-likelihood and composite likelihood in Examples 1 and 2. The computational tractability of the stochastic Z-estimator is not necessarily guaranteed.The MLE using the score function f θ (x, e) = u θ (x) is regarded as a stochastic Z-estimator for any q(e|x) and it may not be computationally tractable because of the normalizing constant.As a class of computationally efficient estimators, let us define the stochastic local Z-estimator as the stochastic Z-estimator using for any neighborhood e ∈ N, where E θ,q [•|e] is the conditional expectation given e.The conditional probability p(x|e) of p θ (x)q(e|x) can take a positive value only when x ∈ e.Hence, f θ (x, e) depends only on the neighborhood of x and its computation will be tractable.
Example 3 (Stochastic pseudo-likelihood).Let us define the stochastic variant of the pseudo-likelihood estimator.On the sample space , where e x,k ⊂ X is given as In order to estimate the parameters in complex models such as conditional random fields and Boltzmann machines, the union, ∪ e∈N x e, is often used as the neighborhood at x [2,9,17].Let the conditional probability q(e|x) on N x as q(e x,k |x) = q k , k = 1, . . ., n, where q 1 , . . ., q n are positive numbers satisfying ∑ n k=1 q k = 1.The identification function of the stochastic pseudo-likelihood is defined by where we used the equality e x,k = e z,k for z ∈ e x,k .Hence, the equality (5) holds for any (q k ) k=1,...,n . When Example 4 (Stochastic variant of composite likelihood).Let us introduce a stochastic variant of composite likelihood on the sample space Let us define e x, , = 1, . . ., M by the subset e x, = {y ∈ X | y B = x B }, and the neighborhood system N x by N x = {e x, | = 1, . . ., M}.We assume that the map from to B is one to one.In other words, the disjoint subsets A , B , C can be specified from the neighborhood e x, .Suppose that the conditional probability q (e |x) on N x is defined as q (e x, |x) = q for = 1, . . ., M, where q 1 , . . ., q M are positive numbers satisfying ∑ M =1 q = 1.As well as Example 3, we see that the conditional probability q (x|e x, ) defined from p θ (x)q (e |x) is given as p θ (x A , x C |x B ).Let us consider the identification function, which is nothing but ∇ log p θ (x A |x B ).Then, (5) holds under the joint probability p θ (x)q (e |x).Indeed, we have for any (q ) =1,...,M .In this paper, the Z-estimator using ( 6) is called the reduced stochastic composite likelihood (reduced-SCL).The stochastic composite likelihood proposed in [20] is a randomized extension of the above f θ (x, e ).Let z = (z 1 , . . ., z M ) be a binary random vector taking an element of {0, 1} M , and α 1 , . . ., α M be positive constants.The SCL is defined as the Z-estimator obtained by The statistical consistency and the normality of the SCL is shown in [20].

Neighborhood Systems and Asymptotic Variances
We consider the relation between neighborhood systems and statistical properties of stochastic local Z-estimators.

Tangent Spaces of Statistical Models
At the beginning, let us introduce some geometric concepts to investigate statistical properties of localized estimators.These concepts are borrowed from information geometry [21].For the neighborhood system N with the conditional probability q(e|x), let us define the linear space T θ,q as The inner product for a 1 , a 2 ∈ T θ,q is defined as E θ,q [a 1 a 2 ].A geometric meaning of T θ,q is the tangent space of the statistical model {p θ (x)q(e|x)|θ ∈ Θ}.For any a ∈ T θ,q and sufficiently small ε > 0, the perturbation of p θ (x)q(e|x) to the direction a(x, e) leads to the probability function p θ (x)q(e|x)(1 + εa(x, e)).Each element of the score function u θ,j (x), j = 1, . . ., d is a member of T θ,q by regarding as u θ,j (x) Let us consider the stochastic Z-estimator derived from ) d for any θ.It leads to a Fisher consistent estimator.Stochastic local Z-estimators use an identification function in the linear subspace The orthogonal complement of T L θ,q in T θ,q is denoted as T E θ,q , which is given as Indeed, the orthogonality of T L θ,q and T E θ,q is confirmed by for any a ∈ T L θ,q and any b ∈ T E θ,q .In addition, any f ∈ T θ,q can be decomposed into ..,d is defined as the projection of each element of the score u θ onto T L θ,q , i.e., The efficient score is computationally tractable when the size of the neighborhood e is not exponential order but linear or low-degree polynomial order of n, where n is the dimension of x.The trade-off between the computational and statistical efficiency is presented in Theorems 1 and 2 in Section 4.2.
Another expression of the efficient score is where the conditional probability p(x|e; θ, q) is defined from p θ (x)q(e|x).The above equality is obtained by We define T I θ,q as the subspace of T L θ,q spanned by {u I θ,k | k = 1, . . ., d}, and T A θ,q be the orthogonal complement of T I θ,q in T L θ,q .As a result, we obtain We describe statistical properties of stochastic local Z-estimators using the above tangent spaces.

Asymptotic Variance of Stochastic Local Z-Estimators
Under a fixed conditional probability q(e|x), we derive the asymptotically efficient stochastic local Z-estimator in the same way to semi-parametric estimation [19,22].In addition, we consider the monotonicity of the efficiency w.r.t. the size of the neighborhood.Given i.i.d.samples (x i , e i ), i = 1, . . ., m, generated from p(x)q(e|x), the estimator θ of the parameter in the model ( 1) is obtained by solving the system of Equation ( 4), where f θ ∈ (T θ,q ) d for any θ ∈ Θ. Suppose that the true probability function p(x) is realized by p θ (x) of the model (1).As shown in [1], the standard asymptotic theory yields that the asymptotic variance of the above Z-estimator is given as The derivation of the asymptotic variance is presented in Appendix for completeness of the presentation.
We consider the asymptotic variance of the stochastic local Z-estimators.A simple expression of the asymptotic variance is obtained using the efficient score u I θ .Without loss of generality, the identification function of the stochastic local Z-estimator, f θ ∈ (T L θ,q ) d , is expressed as where a θ ∈ (T A θ,q ) d .The reason is briefly shown below.Suppose that , where B θ is a d by d matrix that does not depend on x and e.The condition that the matrix holds.In the above equalities, we use the formula for f θ ∈ T θ,q that is obtained by differentiating the identity E θ,q [ f θ ] = 0. Clearly, f θ (x, e) provides the same estimator as B −1 θ f θ (x, e).See [19] for details of the standard form of Z-estimators.

Theorem 1. Let us define the d by d matrix G I
θ by Then, for a fixed conditional probability q(e|x), the asymptotic variance of any stochastic local Z-estimator θ satisfies the inequality in the sense of the non-negative definiteness.The equality is attained by the Z-estimator using u I θ .
Proof.Let us compute each matrix in (9).According to the above argument, without loss of generality, we assume As a result, we obtain When f θ = u I θ , the matrix A becomes the null matrix and the minimum asymptotic variance is attained.
The minimum variance of stochastic local Z-estimators is attained by the efficient score.This conclusion agrees to the result of the asymptotically efficient estimator in semi-parametric models including nuisance parameters [19,22].

Remark 1.
Let us consider the relation between the stochastic pseudo-likelihood ∇ log p θ (x k |x −k ) and efficient score u I θ (x, e x,k ).Suppose that the neighborhood system N x and the conditional distribution q(e|x) on N x are defined as shown in Example 3.Then, we have u I θ (x, e x,k ) = ∇ log p θ (x k |x −k ).Likewise, we find that the reduced-SCL, ∇ log p θ (x A |x B ), is equivalent with the efficient score under the setup in Example 4 when the index subset A is defined as B c .

Monotonicity of Asymptotic Efficiency
As described in [23], for the composite likelihood estimator with the index pairs (A , B ), = 1, . . ., M, it is widely believed that by increasing the size of A (and correspondingly decreasing the size of B = A c ), one can capture more dependency relations in the model and increase the accuracy.For the stochastic local Z-estimators, we can obtain the exact relation between the neighborhood system and asymptotic efficiency.
Let us consider two stochastic local Z-estimators; one is defined by q(e|x) on the neighborhood system e ∈ N x and the other is given by q (e |x) on the neighborhood system e ∈ N x .The efficient score are respectively written as u I θ (x, e) for q(e|x) and u I θ (x, e ) for q (e|x).In addition, let us define Theorem 2. Let p(x, e, e ) be the joint probability of (x, e, e ) ∈ X × N × N and suppose that probability functions, q(e|x), q (e |x) and p θ (x), are obtained from p(x, e, e ).We assume that holds under the probability distribution p(x, e, e ).Then, we have i.e., the efficiency bound of N x and q (x|e) is smaller than or equal to that of N x and q(x|e).
Proof.We use the basic formula of the conditional variance for random variables X and Z.The above formula is applied to the score u θ (x) and the efficient score The last equality comes from the fact that the score u θ (x) is common in both setups.Since the equality (10) holds, again the Formula ( 12) with Thus, we obtain As a result, we have (11).
A similar inequality is derived in [24] for the mutual Fisher information.The mutual Fisher information is rather similar to V[E[u θ |e]] than G I θ .Theorem 13 of [24] corresponds to the one-dimensional version of the inequality Let us show an example that agrees to (10).We define two neighborhood systems N = {N x |x ∈ X } and N = {N x |x ∈ X } such that, for any e ∈ N x , there exists e ∈ N x satisfying e ⊂ e .For the joint probability p(x, e, e ), suppose that x and e are conditionally independent given e and that the conditional probability r (e |e) derived from p(x, e, e ) is equal to zero unless e ⊂ e .Under these conditions, q (e |x) derived from p(x, e, e ) takes 0 if e ∈ N x .The conditional independence assures that p(x, e, e ) is expressed as p(x, e, e ) = p θ (x)q(e|x)r (e |e) = p(x|e)r(e|e )q(e ).Hence, the conditional probability p(x|e ) is expressed as ∑ e∈N p(x|e)r(e|e ).Thus, we obtain As a result, the better efficiency bound is obtained by the larger neighborhood.A similar result is presented in [25] for the composite likelihood estimators.The relation of the result in [25] and ours is explained in Section 5.3 of this paper.
Example 5. Let N x be a neighborhood system at x endowed with the conditional distribution q(e|x).Another neighborhood system is defined as N x = {X } for all x, and q (e |x) = 1 for e = X .Let us define p(x, e, e ) = p θ (x)q(e|x) for e = X and otherwise p(x, e, e ) = 0. Since e always takes X , x and e are conditionally independent given e.Thus, we have G I θ G I θ .Indeed, G I θ is the Fisher information matrix of the model p θ (x).
We compare the stochastic pseudo-likelihood and reduced-SCL.Let N x = {e x,k | k = 1, . . ., n} be the neighborhood system defined in Example 3, and N be ∪ x∈X N x .The conditional distribution on N x is given by q(e x,k |x) = q k , k = 1, . . ., n.As shown in Remark 1, the corresponding efficient score is nothing but the stochastic pseudo-likelihood, i.e., u I θ (x, e x,k ) = ∇ log p θ (x k |x −k ).Let us define another neighborhood system N x in the same way as Example 4. For the subsets B ⊂ X and A = B c , = 1, . . ., M, we define e x, as {y ∈ X | y B = x B } and N x = {e x, | = 1, . . ., M}.Let N be ∪ x∈X N x .The conditional distribution on N x is given as q (e x, |x) = q for = 1, . . ., M.Then, the efficient score associated with N and q is equal to the reduced-SCL, i.e., u I θ (x, e x, ) = ∇ log p θ (x A |x B ).As the direct conclusion of Theorem 2 and the above argument about the property of the conditional independence between x and e ∈ N given e ∈ N, we obtain the following corollary.
Corollary 1.We define N e for e ∈ N by N e = {e ∈ N |e ⊂ e }.Let r (e |e) be a conditional probability on N e given e ∈ N, where r (e |e) = 0 is assumed for e ∈ N e .If the equality q = ∑ n k=1 q k r (e x, |e x,k ) holds, the reduced-SCL with N and q is more efficient than stochastic pseudo-likelihood with N and q.Example 6. Suppose that the size of N e is the same for all e ∈ N and that the size of the set {e ∈ N|x ∈ e ⊂ e } is the same for any x ∈ X and e ∈ N such that x ∈ e .Let q(e|x) (resp.q (e |x)) be the uniform distribution on N x (resp.N x ).Then, the reduced-SCL is more efficient than stochastic pseudo-likelihood.Indeed, the assumption ensures that the sum ∑ e∈N q(e|x)r (e |e) does not depend on x and e .Thus, the uniform distribution q (e |x) meets the condition of the above corollary.For example, let B 1 , . . ., B M be all subsets of size n − 2 in {1, . . ., n}.Then, we have M = n(n − 1)/2.The size of N e is n − 1, and the size of {e ∈ N|x ∈ e ⊂ e } is equal to 2.

Local Z-Estimators and Efficiency Bounds
In this section, we define the local Z-estimator as the expectation of a stochastic local Z-estimator, and derive its efficiency bound.

Local Z-Estimators
Computationally tractable estimators such as pseudo-likelihood and composite likelihood are obtained by the expectation of an identification function in T L θ,q .Let us define the local Z-estimator as the Z-estimator using where f θ ∈ (T L θ,q ) d .The conditional expectation given x is regarded as the projection onto the subspace T X θ,q which is defined as does not depend on e}.
Let Π X be the projection operator onto T X θ,q and Π ⊥ X be the one onto the orthogonal complement of T X θ,q .Then, one can prove When the number of elements in the neighborhood N x is reasonable, the computation of the local Z-estimator is tractable.
Below, we show that some estimators are expressed as the local Z-estimator.
Example 7 (Pseudo-likelihood and composite likelihood).In the setup of Example 3, the conditional expectation of the efficient score, E θ,q [u I θ |x], yields the pseudo-likelihood when q(e|x) is the uniform distribution on N x .In the setup of Example 4, let us assume A = B c and q (e x, |x) = q .Then, the conditional expectation of the efficient score u which is the general form of the composite likelihood in Example 2 with γ = q .

Efficiency Bounds
We derive the efficiency bound of the local Z-estimator.Without loss of generality, the local Z-estimator fθ (x) ∈ (T X θ,q ) d is represented as Under the model p θ (x), we calculate the asymptotic variance (9) of the local Z-estimator θ using fθ (x).The matrix E θ,q [u θ f T θ ] in ( 9) is given as Hence, we have Here, the expectation E θ,q [ fθ f T θ ] can be written as the expectation under p θ (x), i.e., E θ [ fθ f T θ ], since u θ and fθ depend only on x.The orthogonal decomposition meaning that the asymptotic variance of the stochastic local Z-estimator using f θ (x, e) is larger then or equal to that of the local Z-estimator using fθ (x).
We consider the optimal choice of a θ ∈ (T A θ,q ) d in fθ (x) = E θ,q [u I θ + a θ |x].Let us define the subspace T XA θ,q as Π X T A θ,q = {Π X [a] | a ∈ T A θ,q }, and Π XA be the projection operator onto T XA θ,q .Then, we define v I θ,j (x) ∈ T X θ,q as the projection of u I θ,j (x, e) ∈ T I θ,q onto the orthogonal complement of T XA θ,q in T X θ,q , i.e., v I θ,j = (Π X − Π XA )[u I θ,j ] for j = 1, . . ., d.In this paper, we call v I θ = (v I θ,1 , . . ., v I θ,d ) T the local efficient score.

Theorem 3. Let us define d by d matrix H
Then, the efficiency bound of the local Z-estimator θ is given as The equality is attained by the local Z-estimator using the local efficient score The left-hand side of the above inequality is the asymptotic variance of the local Z-estimator.The equality is attained by the local Z-estimator using v I θ .
We consider the relation between the local efficient score v I θ (x) and the score u θ (x).We define T ML θ,q as the subspace spanned by the score u θ,j (x), j = 1, . . ., d.For any a ∈ T A θ,q , we have E θ,q [u θ,j E θ,q [a|x]] = E θ,q [u θ,j a] = 0, j = 1, . . ., d, meaning that T ML θ,q and T XA θ,q are orthogonal to each other.Hence, T X θ,q is decomposed into where T XC θ,q is the orthogonal complement of T XA θ,q ⊕ T ML θ,q in T X θ,q .Eventually, subspaces in T θ,q satisfy the following relations, Let us define T XI θ,q as the subspace spanned by the local efficient score v I θ,j (x), j = 1, . . ., d.Under a mild assumption, T XI θ,q and T ML θ,q has the same dimension.Since v I θ (x) is orthogonal to Π X T A θ,q , T XI θ,q is included in T ML θ,q ⊕ T XC θ,q .Hence, T XC θ,q is interpreted as the subspace expressing the information loss caused by the localization of the score u θ .

Comparison of Local Z-Estimators
We compare the asymptotic variances of two local Z-estimators that are connected to composite likelihoods.
One estimator is defined from the neighborhood system N which consists of the singleton N x = {e x }, x ∈ X .Here, we assume that e x = e x holds for x ∈ e x and ∪ x∈X e x = X .Such a neighborhood system N is called the equivalence class [25].An equivalence class corresponds to a partition of the sample space.The conditional probability q(e|x) takes 1 for e = e x and 0 otherwise.Let u I θ (x, e) be the efficient score defined from N and q(e|x), and ūI θ (x) be the local Z-estimator ūI θ (x) = E θ,q [u I θ |x].Another localized estimator is defined from the neighborhood system N which consists of N x , x ∈ X , where N x is not necessarily a singleton.Suppose that e x ⊂ e holds for any e ∈ N x .The conditional probability q (e |x) is defined as q (e |x) = r (e |e x ), where r (e |e x ) is a conditional probability of e ∈ N x given e x .The corresponding efficient score is denoted as u I θ (x, e) and let us define ū I θ (x) = E θ,q [u I θ |x] as the local Z-estimator associated with N and q (e |x).From the definition, the joint probability p θ (x)q(e|x)r (e |e) agrees to q(e|x) an q (e |x).Hence we see that x and e are conditionally independent given e.Hence, Theorem 2 guarantees the inequality (G I θ ) −1 (G I θ ) −1 .The efficient score u I θ (x, e) can take a non-zero value only when e = e x .Hence, u I θ (x, e) is regarded as the function of x, i.e, u I θ (x, e) ∈ (T X θ,q ) d , and the asymptotic variance of the local Z-estimator obtained by ūI On the other hand, the asymptotic variance of the local Z-estimator derived from ū I θ (x) is less than or equal to (G I θ ) −1 due to (13).Therefore, ū I θ with N and q provides more efficient estimators than ūI θ with N and q.Liang and Jordan presented a similar result in [25].In their setup, the larger neighborhood N x is a singleton {e x } and the smaller one, N x , can have multiple neighborhoods at each x.In such a case, the similar relation holds, i.e., the estimator with N is more efficient.However, their approach is different from ours.In [25], the randomness is introduced over the patterns of the partition of X .Moreover, their identification function corresponding to our ūI θ (x) is decomposed into two terms; one is the term conditioned on the partition and the other is its orthogonal complement.On the other hand, our approach uses the decomposition of u I θ (x, e) into ūθ I (x) and its orthogonal complement.
In their analysis, the simplified expression of the asymptotic variance shown in (9) and the standard expression of the identification function, f (x, e) = u I θ (x, e) + a(x, e), are not used.Hence, the evaluation of the asymptotic variance yields rather a complex dependency on the estimator.As a result, their approach does not show the efficiency bound, though the asymptotic variance of the composite likelihood for exponential families is presented under the misspecified setup.

Closed Exponential Families
The so-called closed exponential family has an interesting property from the viewpoint of localized estimators, as presented in [26].Let p θ (x) = exp{θ T t(x) − c(θ)} be the exponential family defined for x = (x 1 , . . ., x n ) ∈ X = X 1 × • • • × X n with the parameter θ ∈ Θ ⊂ R d .The function t(x) ∈ R d is referred to as the sufficient statistic.Given disjoint index subsets A, B ⊂ {1, . . ., n}, let t B (x) be all elements of t(x) that depend just on x B , and t A,B (x) be the other elements.Hence, t B (x) is expressed as t B (x B ).The parameter θ is correspondingly decomposed into θ = (θ A,B , θ B ). Thus, we have θ T t(x) = θ T A,B t A,B (x) + θ T B t B (x B ).The exponential family p θ (x) is called the closed exponential family, when the marginal distribution of x B is expressed as the exponential family with the sufficient statistic t B (x B ).
We consider the composite likelihood of the closed exponential family.For the pairs of two disjoint index subsets, {A , B }, = 1, . . ., M, suppose that any element of t(x) is included in t A ,B (x) at least one .Then, the local Z-estimator using the composite likelihood ∑ M =1 log p θ (x A |x B ) is identical to the MLE [26].Hence, the composite likelihood of the closed exponential family attains the efficiency bound of the MLE.
For the general statistical model p θ (x), let us restate the above result in terms of the tangent spaces in T θ,q .Let us decompose p θ (x) into p θ (x) = p(x A |x B ; θ)p(x B ; θ).
We assume that for any index subset B, all elements of ∇ log p(x B ; θ) are included in T ML θ,q that is spanned by the elements of u θ (x) = ∇ log p θ (x).Then, ∇ log p(x A |x B ; θ) also lies in (T ML θ,q ) d .Thus, ∇ log p(x A |x B ; θ) is expressed as C ∇ log p θ (x) using a d by d matrix C .If ∑ M =1 C is invertible, the local Z-estimator obtained by ∑ M =1 ∇ log p θ (x A |x B ) is identical to the MLE.In this case, Π X T I θ,q = T ML θ,q , i.e., T XI θ,q = T ML θ,q holds.Therefore, there is no information loss caused by the localization.The matrix C for the closed exponential family is given as the projection matrix onto the subspace spanned by t A ,B (x) − E θ [t A ,B |x B ] that is included in T ML θ,q .The above result implies that the tangent space T XC θ,q expressing the information loss will be related to the score of the marginal distribution, ∇ log p θ (x B ).

Conclusions
In this paper, some statistical properties of stochastic local Z-estimators and local Z-estimators are investigated.The class of local Z-estimators includes pseudo-likelihood and composite likelihood.For stochastic local Z-estimators, we established the exact relation between neighborhood systems and the efficiency bound under a simple and general condition.In addition, the efficiency bound of the local Z-estimators was presented.
Future works include the study of more general class of localized estimators.Indeed, local Z-estimators do not include the class of proper local scoring rules [7].It is worthwhile to derive the efficiency bound for more general localized estimators.Exploring nice applications of the efficiency bound will be another interesting direction of our study.In our setup, the local efficient score expressed by the projection of the score attains the efficiency bound among local Z-estimators.An important problem is to develop a computationally tractable method to obtain the projection onto tangent subspaces.