Abstract
Many statistical models over a discrete sample space often face the computational difficulty of the normalization constant. Because of that, the maximum likelihood estimator does not work. In order to circumvent the computation difficulty, alternative estimators such as pseudo-likelihood and composite likelihood that require only a local computation over the sample space have been proposed. In this paper, we present a theoretical analysis of such localized estimators. The asymptotic variance of localized estimators depends on the neighborhood system on the sample space. We investigate the relation between the neighborhood system and estimation accuracy of localized estimators. Moreover, we derive the efficiency bound. The theoretical results are applied to investigate the statistical properties of existing estimators and some extended ones.
1. Introduction
For many statistical models on a discrete sample space, the computation of the normalization constant is often intractable. Because of that, the maximum likelihood estimator (MLE) is not of practical use to estimate probability distributions, although the MLE has some nice theoretical properties such as the statistical consistency and efficiency under some regularity conditions [1].
In order to circumvent the computation of the normalization constant, alternative estimators that require only a local computation over the sample space have been proposed. In this paper, estimators on the basis of such a concept are called localized estimators. Examples of localized estimators include pseudo-likelihood [2], composite likelihood [3,4], ratio matching [5,6], proper local scoring rules [7,8], and many others. These estimators are used for discrete statistical models such as conditional random fields [9], Boltzmann machines [10], restricted Boltzmann machines [11], discrete exponential family harmoniums [12], and Ising models [13].
In this paper, we present a theoretical analysis of localized estimators. We use the standard tools in the statistical asymptotic theory. In our analysis, a class of localized estimators including pseudo-likelihood and composite likelihood is treated as M-estimator or Z-estimator which is an extension of the MLE [1]. The localized estimators require local computation around a neighborhood of observed points. Hence, the asymptotic variance of the localized estimator depends on the size of the neighborhood. We investigate the relation between the estimation accuracy and the neighborhood system. A similar result is given by [14], in which asymptotic variances between specific composite likelihoods are compared. In our approach, we consider a stochastic variant of localized estimators, and derive a general result that the larger neighborhood leads to more efficient estimator under a simple condition. The pseudo-likelihood and composite likelihood are obtained as the expectation of a stochastic localized estimator. We derive the exact efficiency bound for the expected localized estimator. As far as we know, the derivation of the efficiency bound is a new result for a class of localized estimators, though upper and lower bounds have been proposed [14] for some localized estimators.
The rest of the paper is organized as follows. In Section 2, we introduce basic concepts such as pseudo-likelihood, composite likelihood, and Z-estimators. Section 3 is devoted to define stochastic local Z-estimator associated with a neighborhood system over the discrete sample space. In Section 4, we study the relation between the neighborhood system and asymptotic efficiency of the stochastic local Z-estimator. In Section 5, we define local Z-estimator as the expectation of the stochastic local Z-estimator, and present its efficiency bound. The theoretical results are applied to study asymptotic properties of existing estimators and some extended ones. Finally, Section 6 concludes the paper with discussions.
2. Preliminaries
M-estimators and Z-estimators were proposed as an extension of the MLE. In practice, M-estimators and Z-estimators are often computationally demanding due to the normalization constant in statistical models. To circumvent computational difficulty, localized estimators have been proposed. We introduce some existing localized estimators especially on discrete sample spaces. In later sections, we consider statistical properties of a localized variant of Z-estimators.
Let us summarize the notations to be used throughout the paper. Let be the set of all real numbers. The discrete sample space is denoted as . The statistical model for with the parameter is also expressed as . The vector a usually denotes the column vector, and denotes the transposition of vector or matrix. For a linear space T and an integer d, denotes the d-fold product space of T, and the element is expressed as . The product space of two subspaces and that are orthogonal to each other is denoted as . For the function of the parameter , denotes the gradient vector . The indicator function is denoted as that takes 1 if A is true and 0 otherwise.
2.1. M- and Z-Estimators
Suppose that samples are i.i.d. distributed from the probability over the discrete sample space . A statistical model with the parameter is assumed to estimate . In this paper, our concern is the statistical efficiency of estimators. Hence, we suppose that the statistical model includes .
The MLE is commonly used to estimate . It uses the negative log-likelihood of the model, , as the loss function and the estimator is given by the minimum solution of its empirical mean, .
Generally, the estimator obtained by the minimum solution of a loss function is referred to as the M-estimator. The MLE is an example of M-estimators. When the loss function is differentiable, the gradient of the loss function vanishes at the estimated parameter. Instead of minimizing loss function, a solution of the system of equations also provides an estimator of the parameter. Such an estimator is called the Z-estimator [1]. In the MLE, the system of equations is given as
where is the null-vector. The gradient is known as the score function of the model . In this paper, the score function is denoted as
In general, the Z-estimator is defined as the solution of the system of equations
where the -valued function is referred to as the identification function [15,16]. In the M-estimator, the identification function is given as the gradient of the loss function. In general, however, the identification function itself is not necessarily expressed as the gradient of a loss function, if it is not integrable. The identification function is also called Z-estimator with some abuse of terminology.
2.2. Localized Estimators
Below, let us introduce some localized estimators. The statistical model defined on the discrete set is denoted by
for , where is the normalization constant at the parameter θ, i.e.,
Throughout the paper, we assume for all and all .
Example 1 (Pseudo-likelihood).
Suppose that is expressed as the product space , where are finite sets such as . For , let be the dimensional vector defined by dropping the k-th element of x. The loss function of the pseudo-likelihood, , is defined as the negative log-likelihood of the conditional probability defined from , i.e.,
The pseudo-likelihood does not require the normalization constant, and it satisfies the statistical consistency of the parameter estimation [2,17]. The identification function of the corresponding Z-estimator is obtained by the gradient vector of the loss Function (2).
Example 2 (Composite likelihood).
The composite likelihood was proposed as an extension of the pseudo-likelihood [3]. Suppose that is expressed as the product space as in Example 1. For the index subset , let be the subvector of . For each , Suppose that and are a pair of disjoint subsets in , and let be the complement of the union , i.e., . Given positive constants , the loss function of the composite likelihood, , is defined as
The composite likelihood using the subsets and positive constant for yields the pseudo-likelihood. As well as the pseudo-likelihood, the composite likelihood has the statistical consistency under some regularity condition [4].
Originally, the pseudo and composite likelihoods were proposed to deal with spatial data [2,3]. As a generalization of these estimators, a localized variant of scoring rules works efficiently to the statistical analysis of discrete spatial data [18].
3. A Stochastic Variant of Z-Estimators
In this section, we define a stochastic variant of Z-estimators. For the discrete sample space , suppose that the neighborhood system N is defined as a subset of the power set , i.e., N is a family of subsets in . Let us define the neighborhood system at by . We assume that is not empty for any x. In some class of localized estimators, the neighborhood system is expressed using an undirected graph on [7]. In our setup, the neighborhood system is not necessarily expressed by an undirected graph, and we allow the neighborhood system to possess multiple neighbors at each point x.
Let us define the stochastic Z-estimator. A conditional probability of the set given is denoted as . We assume that if throughout the paper. Given a sample x, we randomly generate a neighborhood e from the conditional probability . Using i.i.d. copies of , we estimate . Here, the statistical model of the form (1) is used. We use the Z-estimator to estimate the parameter . The element of is denoted as or for . The expectation under the probability is written as . Suppose that the equality
holds for all . In addition, we assume that the vectors are linearly independent, meaning that depends substantially on the parameter θ [19]. The solution of the system of equations
produces a statistically consistent estimator under some regularity condition [1]. In the parameter estimation of the model , the stochastic Z-estimator is defined as the solution of (4) using the identification function satisfying (3). As shown in Section 5, stochastic Z-estimators are useful to investigate statistical properties of the standard pseudo-likelihood and composite likelihood in Examples 1 and 2.
The computational tractability of the stochastic Z-estimator is not necessarily guaranteed. The MLE using the score function is regarded as a stochastic Z-estimator for any and it may not be computationally tractable because of the normalizing constant. As a class of computationally efficient estimators, let us define the stochastic local Z-estimator as the stochastic Z-estimator using satisfying
for any neighborhood , where is the conditional expectation given e. The conditional probability of can take a positive value only when . Hence, depends only on the neighborhood of x and its computation will be tractable.
Example 3 (Stochastic pseudo-likelihood).
Let us define the stochastic variant of the pseudo-likelihood estimator. On the sample space , the neighborhood system at is defined as , where is given as
In order to estimate the parameters in complex models such as conditional random fields and Boltzmann machines, the union, , is often used as the neighborhood at x [2,9,17]. Let the conditional probability on as , where are positive numbers satisfying . The identification function of the stochastic pseudo-likelihood is defined by
for . Then, is equal to . The conditional probability derived from is given as
where we used the equality for . Hence, the equality (5) holds for any . When depends on x, is different from in general.
Example 4 (Stochastic variant of composite likelihood).
Let us introduce a stochastic variant of composite likelihood on the sample space . Below, notations in Example 2 are used. Let us define by the subset , and the neighborhood system by . We assume that the map from ℓ to is one to one. In other words, the disjoint subsets can be specified from the neighborhood . Suppose that the conditional probability on is defined as for , where are positive numbers satisfying . As well as Example 3, we see that the conditional probability defined from is given as . Let us consider the identification function,
which is nothing but . Then, (5) holds under the joint probability . Indeed, we have
for any . In this paper, the Z-estimator using (6) is called the reduced stochastic composite likelihood (reduced-SCL). The stochastic composite likelihood proposed in [20] is a randomized extension of the above . Let be a binary random vector taking an element of , and be positive constants. The SCL is defined as the Z-estimator obtained by
The statistical consistency and the normality of the SCL is shown in [20].
4. Neighborhood Systems and Asymptotic Variances
We consider the relation between neighborhood systems and statistical properties of stochastic local Z-estimators.
4.1. Tangent Spaces of Statistical Models
At the beginning, let us introduce some geometric concepts to investigate statistical properties of localized estimators. These concepts are borrowed from information geometry [21]. For the neighborhood system N with the conditional probability , let us define the linear space as
The inner product for is defined as . A geometric meaning of is the tangent space of the statistical model . For any and sufficiently small , the perturbation of to the direction leads to the probability function . Each element of the score function is a member of by regarding as .
Let us consider the stochastic Z-estimator derived from satisfying for any θ. It leads to a Fisher consistent estimator. Stochastic local Z-estimators use an identification function in the linear subspace
The orthogonal complement of in is denoted as , which is given as
Indeed, the orthogonality of and is confirmed by
for any and any . In addition, any can be decomposed into
such that and .
The efficient score is defined as the projection of each element of the score onto , i.e.,
The efficient score is computationally tractable when the size of the neighborhood e is not exponential order but linear or low-degree polynomial order of n, where n is the dimension of x. The trade-off between the computational and statistical efficiency is presented in Theorems 1 and 2 in Section 4.2.
Another expression of the efficient score is
where the conditional probability is defined from . The above equality is obtained by
We define as the subspace of spanned by , and be the orthogonal complement of in . As a result, we obtain
We describe statistical properties of stochastic local Z-estimators using the above tangent spaces.
4.2. Asymptotic Variance of Stochastic Local Z-Estimators
Under a fixed conditional probability , we derive the asymptotically efficient stochastic local Z-estimator in the same way to semi-parametric estimation [19,22]. In addition, we consider the monotonicity of the efficiency w.r.t. the size of the neighborhood. Given i.i.d. samples , generated from , the estimator of the parameter in the model (1) is obtained by solving the system of Equation (4), where for any . Suppose that the true probability function is realized by of the model (1). As shown in [1], the standard asymptotic theory yields that the asymptotic variance of the above Z-estimator is given as
The derivation of the asymptotic variance is presented in Appendix for completeness of the presentation.
We consider the asymptotic variance of the stochastic local Z-estimators. A simple expression of the asymptotic variance is obtained using the efficient score . Without loss of generality, the identification function of the stochastic local Z-estimator, , is expressed as
where . The reason is briefly shown below. Suppose that is decomposed into , where is a d by d matrix that does not depend on x and e. The condition that the matrix is invertible assures that is invertible, since
holds. In the above equalities, we use the formula
for that is obtained by differentiating the identity . Clearly, provides the same estimator as . See [19] for details of the standard form of Z-estimators.
Theorem 1.
Let us define the d by d matrix by
Then, for a fixed conditional probability , the asymptotic variance of any stochastic local Z-estimator satisfies the inequality
in the sense of the non-negative definiteness. The equality is attained by the Z-estimator using .
Proof.
Let us compute each matrix in (9). According to the above argument, without loss of generality, we assume for . The matrix is then expressed as
due to and . Let us define . Then, we have
As a result, we obtain
When , the matrix A becomes the null matrix and the minimum asymptotic variance is attained. ☐
The minimum variance of stochastic local Z-estimators is attained by the efficient score. This conclusion agrees to the result of the asymptotically efficient estimator in semi-parametric models including nuisance parameters [19,22].
Remark 1.
Let us consider the relation between the stochastic pseudo-likelihood and efficient score . Suppose that the neighborhood system and the conditional distribution on are defined as shown in Example 3. Then, we have . Likewise, we find that the reduced-SCL, , is equivalent with the efficient score under the setup in Example 4 when the index subset is defined as .
4.3. Monotonicity of Asymptotic Efficiency
As described in [23], for the composite likelihood estimator with the index pairs , , it is widely believed that by increasing the size of (and correspondingly decreasing the size of ), one can capture more dependency relations in the model and increase the accuracy. For the stochastic local Z-estimators, we can obtain the exact relation between the neighborhood system and asymptotic efficiency.
Let us consider two stochastic local Z-estimators; one is defined by on the neighborhood system and the other is given by on the neighborhood system . The efficient score are respectively written as for and for . In addition, let us define and .
Theorem 2.
Let be the joint probability of and suppose that probability functions, and , are obtained from . We assume that
holds under the probability distribution . Then, we have
i.e., the efficiency bound of and is smaller than or equal to that of and .
Proof.
We use the basic formula of the conditional variance
for random variables X and Z. The above formula is applied to the score and the efficient score . Note that holds. Then, we have
The last equality comes from the fact that the score is common in both setups. Since the equality (10) holds, again the Formula (12) with and yields
Thus, we obtain
As a result, we have (11). ☐
A similar inequality is derived in [24] for the mutual Fisher information. The mutual Fisher information is rather similar to than . Theorem 13 of [24] corresponds to the one-dimensional version of the inequality .
Let us show an example that agrees to (10). We define two neighborhood systems and such that, for any , there exists satisfying . For the joint probability , suppose that x and are conditionally independent given e and that the conditional probability derived from is equal to zero unless . Under these conditions, derived from takes 0 if . The conditional independence assures that is expressed as . Hence, the conditional probability is expressed as . Thus, we obtain
As a result, the better efficiency bound is obtained by the larger neighborhood. A similar result is presented in [25] for the composite likelihood estimators. The relation of the result in [25] and ours is explained in Section 5.3 of this paper.
Example 5.
Let be a neighborhood system at x endowed with the conditional distribution . Another neighborhood system is defined as for all x, and for . Let us define for and otherwise . Since always takes , x and are conditionally independent given e. Thus, we have . Indeed, is the Fisher information matrix of the model .
We compare the stochastic pseudo-likelihood and reduced-SCL. Let be the neighborhood system defined in Example 3, and N be . The conditional distribution on is given by . As shown in Remark 1, the corresponding efficient score is nothing but the stochastic pseudo-likelihood, i.e., . Let us define another neighborhood system in the same way as Example 4. For the subsets and , we define as and . Let be . The conditional distribution on is given as for . Then, the efficient score associated with and is equal to the reduced-SCL, i.e., . As the direct conclusion of Theorem 2 and the above argument about the property of the conditional independence between x and given , we obtain the following corollary.
Corollary 1.
We define for by . Let be a conditional probability on given , where is assumed for . If the equality holds, the reduced-SCL with and is more efficient than stochastic pseudo-likelihood with N and q.
Example 6.
Suppose that the size of is the same for all and that the size of the set is the same for any and such that . Let (resp. ) be the uniform distribution on (resp. ). Then, the reduced-SCL is more efficient than stochastic pseudo-likelihood. Indeed, the assumption ensures that the sum does not depend on x and . Thus, the uniform distribution meets the condition of the above corollary. For example, let be all subsets of size in . Then, we have . The size of is , and the size of is equal to 2.
5. Local Z-Estimators and Efficiency Bounds
In this section, we define the local Z-estimator as the expectation of a stochastic local Z-estimator, and derive its efficiency bound.
5.1. Local Z-Estimators
Computationally tractable estimators such as pseudo-likelihood and composite likelihood are obtained by the expectation of an identification function in . Let us define the local Z-estimator as the Z-estimator using
where . The conditional expectation given x is regarded as the projection onto the subspace which is defined as
Let be the projection operator onto and be the one onto the orthogonal complement of . Then, one can prove and for . When the number of elements in the neighborhood is reasonable, the computation of the local Z-estimator is tractable.
Below, we show that some estimators are expressed as the local Z-estimator.
Example 7 (Pseudo-likelihood and composite likelihood).
In the setup of Example 3, the conditional expectation of the efficient score, , yields the pseudo-likelihood when is the uniform distribution on . In the setup of Example 4, let us assume and . Then, the conditional expectation of the efficient score yields
which is the general form of the composite likelihood in Example 2 with .
5.2. Efficiency Bounds
We derive the efficiency bound of the local Z-estimator. Without loss of generality, the local Z-estimator is represented as
Under the model , we calculate the asymptotic variance (9) of the local Z-estimator using . The matrix in (9) is given as
Hence, we have
Here, the expectation can be written as the expectation under , i.e., , since and depend only on x. The orthogonal decomposition leads to
meaning that the asymptotic variance of the stochastic local Z-estimator using is larger then or equal to that of the local Z-estimator using .
We consider the optimal choice of in . Let us define the subspace as , and be the projection operator onto . Then, we define as the projection of onto the orthogonal complement of in , i.e.,
for . In this paper, we call the local efficient score.
Theorem 3.
Let us define d by d matrix as . Then, the efficiency bound of the local Z-estimator is given as
The equality is attained by the local Z-estimator using the local efficient score .
Proof.
has the orthogonal decomposition , where . Hence, we obtain and
The left-hand side of the above inequality is the asymptotic variance of the local Z-estimator. The equality is attained by the local Z-estimator using . ☐
We consider the relation between the local efficient score and the score . We define as the subspace spanned by the score . For any , we have
meaning that and are orthogonal to each other. Hence, is decomposed into
where is the orthogonal complement of in . Eventually, subspaces in satisfy the following relations,
Let us define as the subspace spanned by the local efficient score . Under a mild assumption, and has the same dimension. Since is orthogonal to , is included in . Hence, is interpreted as the subspace expressing the information loss caused by the localization of the score .
5.3. Relation to Existing Works
5.3.1. Comparison of Local Z-Estimators
We compare the asymptotic variances of two local Z-estimators that are connected to composite likelihoods.
One estimator is defined from the neighborhood system N which consists of the singleton . Here, we assume that holds for and . Such a neighborhood system N is called the equivalence class [25]. An equivalence class corresponds to a partition of the sample space. The conditional probability takes 1 for and 0 otherwise. Let be the efficient score defined from N and , and be the local Z-estimator .
Another localized estimator is defined from the neighborhood system which consists of , where is not necessarily a singleton. Suppose that holds for any . The conditional probability is defined as , where is a conditional probability of given . The corresponding efficient score is denoted as and let us define as the local Z-estimator associated with and .
From the definition, the joint probability agrees to an . Hence we see that x and are conditionally independent given e. Hence, Theorem 2 guarantees the inequality .
The efficient score can take a non-zero value only when . Hence, is regarded as the function of x, i.e, , and the asymptotic variance of the local Z-estimator obtained by is . On the other hand, the asymptotic variance of the local Z-estimator derived from is less than or equal to due to (13). Therefore, with and provides more efficient estimators than with N and q.
Liang and Jordan presented a similar result in [25]. In their setup, the larger neighborhood is a singleton and the smaller one, , can have multiple neighborhoods at each x. In such a case, the similar relation holds, i.e., the estimator with is more efficient. However, their approach is different from ours. In [25], the randomness is introduced over the patterns of the partition of . Moreover, their identification function corresponding to our is decomposed into two terms; one is the term conditioned on the partition and the other is its orthogonal complement. On the other hand, our approach uses the decomposition of into and its orthogonal complement. In their analysis, the simplified expression of the asymptotic variance shown in (9) and the standard expression of the identification function, , are not used. Hence, the evaluation of the asymptotic variance yields rather a complex dependency on the estimator. As a result, their approach does not show the efficiency bound, though the asymptotic variance of the composite likelihood for exponential families is presented under the misspecified setup.
5.3.2. Closed Exponential Families
The so-called closed exponential family has an interesting property from the viewpoint of localized estimators, as presented in [26]. Let be the exponential family defined for with the parameter . The function is referred to as the sufficient statistic. Given disjoint index subsets , let be all elements of that depend just on , and be the other elements. Hence, is expressed as . The parameter θ is correspondingly decomposed into . Thus, we have . The exponential family is called the closed exponential family, when the marginal distribution of is expressed as the exponential family with the sufficient statistic .
We consider the composite likelihood of the closed exponential family. For the pairs of two disjoint index subsets, , suppose that any element of is included in at least one ℓ. Then, the local Z-estimator using the composite likelihood is identical to the MLE [26]. Hence, the composite likelihood of the closed exponential family attains the efficiency bound of the MLE.
For the general statistical model , let us restate the above result in terms of the tangent spaces in . Let us decompose into
We assume that for any index subset B, all elements of are included in that is spanned by the elements of . Then, also lies in . Thus, is expressed as using a d by d matrix . If is invertible, the local Z-estimator obtained by is identical to the MLE. In this case, , i.e., holds. Therefore, there is no information loss caused by the localization. The matrix for the closed exponential family is given as the projection matrix onto the subspace spanned by that is included in . The above result implies that the tangent space expressing the information loss will be related to the score of the marginal distribution, .
6. Conclusions
In this paper, some statistical properties of stochastic local Z-estimators and local Z-estimators are investigated. The class of local Z-estimators includes pseudo-likelihood and composite likelihood. For stochastic local Z-estimators, we established the exact relation between neighborhood systems and the efficiency bound under a simple and general condition. In addition, the efficiency bound of the local Z-estimators was presented.
Future works include the study of more general class of localized estimators. Indeed, local Z-estimators do not include the class of proper local scoring rules [7]. It is worthwhile to derive the efficiency bound for more general localized estimators. Exploring nice applications of the efficiency bound will be another interesting direction of our study. In our setup, the local efficient score expressed by the projection of the score attains the efficiency bound among local Z-estimators. An important problem is to develop a computationally tractable method to obtain the projection onto tangent subspaces.
one
Conflicts of Interest
The author declares no conflict of interest.
Appendix A. Asymptotic Variance of Stochastic Local Z-Estimators
Given i.i.d. samples from , we estimate the parameter θ using the stochastic local Z-estimator obtained by
where the identification function satisfies for any . The Taylor expansion around the true parameter θ yields
where the element is given as . As m tends to infinity, the asymptotic distribution of is given as the multivariate normal distribution,
Since holds for any θ, the derivative is the null matrix. This fact yields
Hence, the asymptotic distribution of is the d-dimensional normal distribution with mean and variance .
References
- Van der Vaart, A.W. Asymptotic Statistics; Cambridge University Press: Cambridge, UK, 2000. [Google Scholar]
- Besag, J. Spatial interaction and the statistical analysis of lattice systems. J. R. Stat. Soc. Ser. B 1974, 36, 192–236. [Google Scholar]
- Lindsay, B.G. Composite likelihood methods. Contemp. Math. 1988, 80, 220–239. [Google Scholar]
- Varin, C.; Reid, N.; Firth, D. An overview of composite likelihood methods. Stat. Sin. 2011, 21, 5–42. [Google Scholar]
- Hyväinen, A. Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. IEEE Trans. Neural Netw. 2007, 18, 1529–1531. [Google Scholar] [CrossRef]
- Hyvärinen, A. Some extensions of score matching. Comput. Stat. Data Anal. 2007, 51, 2499–2512. [Google Scholar] [CrossRef]
- Dawid, A.P.; Lauritzen, S.; Parry, M. Proper local scoring rules on discrete sample spaces. Ann. Stat. 2012, 40, 593–608. [Google Scholar] [CrossRef]
- Kanamori, T.; Takenouchi, T. Graph-Based Composite Local Bregman Divergences on Discrete Sample Spaces. 2016; arXiv:1604.06568. [Google Scholar]
- Lafferty, J.D.; McCallum, A.; Pereira, F.C.N. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, San Francisco, CA, USA, 28 June–1 July 2001; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2001; pp. 282–289. [Google Scholar]
- Ackley, H.; Hinton, E.; Sejnowski, J. A learning algorithm for boltzmann machines. Cognit. Sci. 1985, 9, 147–169. [Google Scholar] [CrossRef]
- Smolensky, P. Information Processing in Dynamical Systems: Foundations of Harmony Theory. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1; MIT Press: Cambridge, MA, USA, 1986; pp. 194–281. [Google Scholar]
- Welling, M.; Rosen-Zvi, M.; Hinton, G.E. Exponential family harmoniums with an application to information retrieval. In Advances in Neural Information Processing Systems 17; Saul, L.K., Weiss, Y., Bottou, L., Eds.; MIT Press: Cambridge, MA, USA, 2005; pp. 1481–1488. [Google Scholar]
- Ising, E. Beitrag zur Theorie des Ferromagnetismus. Zeitschrift für Physik 1925, 31, 253–258. (In German) [Google Scholar] [CrossRef]
- Marlin, B.; de Freitas, N. Asymptotic efficiency of deterministic estimators for discrete energy-based models: Ratio matching and pseudolikelihood. In Uncertainty in Artificial Intelligence (UAI); AUAI Press: Corvallis, OR, USA, 2011. [Google Scholar]
- Gneiting, T. Making and evaluating point forecasts. J. Am. Stat. Assoc. 2011, 106, 746–762. [Google Scholar] [CrossRef]
- Steinwart, I.; Pasin, C.; Williamson, R.C.; Zhang, S. Elicitation and identification of properties. In Proceedings of the 27th Conference on Learning Theory, COLT 2014, Barcelona, Spain, 13–15 June 2014; pp. 482–526.
- Hyväinen, A. Consistency of pseudolikelihood estimation of fully visible boltzmann machines. Neural Comput. 2006, 18, 2283–2292. [Google Scholar] [CrossRef] [PubMed]
- Dawid, A.; Musio, M. Estimation of spatial processes using local scoring rules. AStA Adv. Stat. Anal. 2013, 97, 173–179. [Google Scholar] [CrossRef]
- Amari, S.; Kawanabe, M. Information geometry of estimating functions in semi-parametric statistical models. Bernoulli 1997, 3, 29–54. [Google Scholar] [CrossRef]
- Dillon, J.V.; Lebanon, G. Stochastic composite likelihood. J. Mach. Learn. Res. 2010, 11, 2597–2633. [Google Scholar]
- Cichocki, A.; Amari, S. Families of alpha- beta- and gamma-divergences: Flexible and robust measures of similarities. Entropy 2010, 12, 1532–1568. [Google Scholar] [CrossRef]
- Bickel, P.J.; Klaassen, C.A.J.; Ritov, Y.; Wellner, J.A. Efficient and Adaptive Estimation for Semiparametric Models; Springer-Verlag: New York, NY, USA, 1998. [Google Scholar]
- Asuncion, A.U.; Liu, Q.; Ihler, A.T.; Smyth, P. Learning with blocks: Composite likelihood and contrastive divergence. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Sardinia, Italy, 13–15 May 2010; Volume 9, pp. 33–40.
- Zegers, P. Fisher information properties. Entropy 2015, 17, 4918–4939. [Google Scholar] [CrossRef]
- Liang, P.; Jordan, M.I. An asymptotic analysis of generative, discriminative, and pseudolikelihood estimators. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, Helsinki, Finland, 5–9 July 2008; ACM: New York, NY, USA, 2008; pp. 584–591. [Google Scholar]
- Mardia, K.V.; Kent, J.T.; Hughes, G.; Taylor, C.C. Maximum likelihood estimation using composite likelihoods for closed exponential families. Biometrika 2009, 96, 975–982. [Google Scholar] [CrossRef]
© 2016 by the author; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).