The Onset of Parisi’s Complexity in a Mismatched Inference Problem

We show that a statistical mechanics model where both the Sherringhton–Kirkpatrick and Hopfield Hamiltonians appear, which is equivalent to a high-dimensional mismatched inference problem, is described by a replica symmetry-breaking Parisi solution.


Introduction
Beginning with Parisi's seminal works on the Sherringhton-Kirkpatrick (SK) model [1,2], the ideas and the tools developed in spin glass theory spread across many other research fields such as computer science, probability theory, neural networks and others [3][4][5][6][7].From a mathematical perspective, efforts to rigorously prove Parisi's theory have yielded powerful techniques, such as interpolation methods [8,9], stochastic stability [10], and synchronization [11], which are currently instrumental in analyzing numerous disordered systems.
In this work, we will consider a family of mean-field spin glasses whose Hamiltonian contains two types of random interactions: the first is the SK type, while the second is a Hopfield model with a finite number of patterns.This class of models can also be interpreted as a high-dimensional inference problem known as a spiked Wigner model in a mismatched setting [12][13][14][15][16].
Our main result is a representation of the thermodynamic limit for the quenched pressure per particle in terms of a variational problem of Parisi type.The proof relies on two main ingredients: Guerra's replica symmetry-breaking bound, which allows controlling the SK contribution, and adaptive interpolation, which is employed to linearize the Hopfield interaction.We start with a review of the SK model in Section 2 and then lay the ground for the inferential interpretation of the model under study in Section 3. In Section 4, we define and rigorously identify the exact solution of the model.Finally, we describe some interesting challenges for future investigations.

The SK Model
The SK model was introduced in the 1970s by D. Sherringhton and S. Kirkpatrick [17] and stands as an explicitly solvable mean-field spin glass.In their work, the authors discovered that the solution obtained through the replica symmetric (RS) approximation was not correct at low temperature.With a groundbreaking approach, Parisi identified a new type of solutions, nowadays called replica symmetry breaking (RSB), which proved to be correct at any temperature, thereby revealing a novel mathematical and physical structure [18].
The SK model is defined by its Hamiltonian, that is a function of N spins σ = (σ i ) i≤N ∈ {−1, 1} N : where z = (z ij ) i,j≤N is a collection of i.i.d.standard Gaussian.In physical terms, the couplings between pairs of spins can be ferromagnetic or antiferromagnetic with equal probability.Consider also a random variable ξ with E|ξ| < ∞ and a collection ξ = (ξ i ) i≤N iid ∼ ξ representing random external fields acting on the spins.The Parisi formula is a representation for the large N limit of the pressure p SK N defined by In the definition (2), (β, h) ∈ R >0 × R are fixed parameters, and the dependence on the realization of the random collections z, ξ is kept implicit.One can prove [5] that p SK N converges, for almost all realizations of the disorder, to its average pSK Notice that E, taken after the logarithm, averages both the collections of z and ξ that are called quenched variables.The Hamiltonian (1) can also be regarded as a centered Gaussian process with covariance where q N (σ 1 , σ 2 ) is the overlap between two spin configurations σ 1 and σ 2 .The Parisi variational principle for the limiting pressure per particle of this model was proved after almost three decades of efforts, and it is mainly due to the works of Guerra [8] and Talagrand [19].We hereby summarize these milestones in a single theorem.
Theorem 1 (Parisi Formula [8,19]).Let M [0,1] be the space of probability measures on [0, 1], β > 0 and y ∈ R. Consider the Parisi functional, which is defined as where Φ χ (s, y, β) solves the PDE The following holds lim The key tool for the proof is the (Gaussian) interpolation method, which is introduced in [9] in order to prove the existence of the large N limit of pSK N .The thermodynamic equilibrium induced by the pressure pSK N is called quenched equilibrium and is defined as follows.Physical quantities (e.g., energy) are functions of the disorder variables z, ξ and the spin configurations σ.Given a function f (z, ξ, σ), its equilibrium value is defined as where G N is the (random) Boltzmann-Gibbs distribution The measure E⟨ ⟩ N is called a quenched measure and can be viewed as a two-step measuring process.Initially, for a given realization of the disorder variables z, ξ, one assumes that the system equilibrates according to the canonical Boltzmann-Gibbs distribution G N defining a (random) measure on the space of spin configurations.The expectation with respect to G N is denoted by ⟨ ⟩ N , namely In probabilistic terms, G N defines a conditional measure given z and ξ.The remaining degrees of freedom z, ξ are then averaged according to their apriori distribution E.
An important role is played by the concept of replicas.Replicas are i.i.d.samples from G N at fixed disorder.Hence, the equilibrium value of a function f (z, ξ, σ 1 , . . ., σ n ) of n replicas and the quenched variables z, ξ is defined by The computation of derivatives of pSK N shows, using integration by parts, that the SK model is fully characterized by the (joint) distribution of the overlap array q N (σ l , σ l ′ ) l,l ′ ≤n ≡ (q l,l ′ ) l,l ′ ≤n , namely the overlaps between any finite number n of replicas with respect to the measure (11).The main feature of the Parisi theory is the characterization of the mentioned joint measure by means of two structural properties: (i) It is uniquely determined by a one-dimensional marginal, namely the distribution of q 1,2 ; (ii) The distribution of three replicas has with a probability of one an ultrametric support lim N→∞ E⟨1(q 1,2 ≥ min(q 1,3 , q 2,3 ))⟩ = 1 . ( Despite having a mathematical proof of the Parisi Formula ( 7) for the SK model, (i) and (ii) have been rigorously proved only in the mixed p-spin model [6,20,21], an extension of the SK model, whose Hamiltonian contains also higher-order interactions (three-body, four-body, etc.).One of the crucial instruments to achieve a rigorous control of the model is the so-called Ruelle Probability Cascades (RPCs), defined by Ruelle [22] when formalizing the properties of the Generalized Random Energy model of Derrida [23].See also the characterization of RPC in terms of coalescent processes given in [24].The first direct link between RPC and the SK model appeared in the work of Aizenman-Sims-Starr [25], where the authors found a representation of the thermodynamic limit of quenched pressure per particle in terms of the cavity fields distribution.This representation strongly suggested that if the thermodynamic limit of the overlap distribution is described by an RPC, then the Parisi formula is correct.
The first signal that the overlap array is described by an RPC was originally found by Aizenmann and Contucci in [10] with the identification of stochastic stability and by Ghirlanda and Guerra [26].Both papers show an (infinite) set of identities for the moments of the overlap array distribution.It turns out that these identities actually imply that the support of the joint distribution of the overlap is ultrametric, as proved by Panchenko [27].It should be noticed that Panchenko's theorem requires identities for the overlap moments of all orders.The latter do not hold for the bare SK model, but it can be shown that there exists a perturbation of the Hamiltonian that forces the SK model to satisfy them without affecting the limit of the quenched pressure [28].
Once the validity of the Parisi Formula ( 7) is established, it is natural to ask for the properties of its solution.The uniqueness of the minimizer of (7) has been assessed by Auffinger and Chen [29], and its properties have been investigated for example in [30,31].
A relevant question about the minimizer is the following: for which values of the parameters (β, h) is the solution of ( 7) a Dirac-delta function δ q for some q ∈ [0, 1]?In this case, we say that the model is replica symmetric and the Parisi Formula (7) reads The replica symmetric region can be identified [6,32] with the region of parameters (β, h) where the overlap is a self-averaging quantity, namely where q * is exactly the value that realizes the infimum in (13).The physics conjecture is that the replica symmetric region can be identified by the so called Almeida-Thouless [33] The above conjecture is proved only in the case of Gaussian external field ξ i ∼ N (0, 1) [34].An alternative characterization of the replica symmetric region has been obtained in [6,35].If the minimizer corresponds to a non-trivial distribution (i.e., with non-zero variance), we say that replica symmetry breaking occurs, and the overlap is not a self-averaging quantity.
The Parisi formula has been extended to other mean field models with centered Gaussian interactions: vector spins [36], multispecies models [11,37,38], multiscale models [39,40].Finally, we mention that the SK model fulfills a remarkable universality property: as long as z ij 's are independent, centered, and with unit variance, the thermodynamic limit is still described by the Parisi solution [41].
In this work, we show that a class of non-centered Gaussian spin glasses admits an interpretation of high-dimensional inference that extends the celebrated correspondence between the spiked Wigner model and the SK model in the Nishimori line where replica symmetry is always fulfilled [3].We show that the addition of an SK Hamiltonian to a Hopfield with a finite number of patterns can be mapped into a high-dimensional mismatched inference problem, where the statistician ignores the correct apriori distribution on the signal components they have to reconstruct.We shall see that even this slight mismatch may lead to the emergence of complexity, namely to the breakdown of replica symmetry, which is instead guaranteed under very mild hypotheses for optimal statisticians.

High-Dimensional Inference and Statistical Physics
High-dimensional inference aims at recovering a ground truth signal, ξ in the following, that is usually a vector with a very large number of components from some noisy observations of it, which is denoted by Y.The main feature of this setting is that the dimension of the signal, i.e., the number of real parameters to reconstruct, and the number of observations at disposal are a function of one another, typically a polynomial.For instance, for our purposes, ξ will be a vector of R N and Y will be an N × N matrix for a total of N 2 noisy observations.Hence, if the number of observations becomes large, the number of parameters to retrieve also does.Contrary to what happens in typical low-dimensional settings, where max-likelihood, or Maximum A Posteriori (MAP) approaches yield provably satisfactory reconstruction performances, in a high-dimensional setting, this is not always the case.In particular, one needs to devise another kind of more refined estimators that exploit the marginal posterior probabilities for each signal component.
Both approaches described above are Bayesian, and the knowledge of a prior distribution on the signal components can play a key role especially for high-dimensional problems.Furthermore, to compose the posterior measure for the entire signal, one needs the likelihood of the data, which is the probability of an outcome y of the variable Y given a certain ground truth realization ξ = x.As we shall discuss soon, under certain hypotheses, the Bayesian approach highlights the correspondence of relevant information theoretic quantities with thermodynamic ones.Among the others, a key quantity is the mutual information between the signal ξ and the observations Y, which quantifies the residual amount of information left in Y about ξ after the noise corruption.As intuition may suggest, the mutual information gives access to the best reconstruction error that is information theoretically achievable.
Finally, we stress that the high dimensionality of the problem can induce phase transition in some parameters of the model, like the so-called signal-to-noise ratio (SNR), that tunes the strength of the signal with respect to that of the noise in the observations.

Bayes-Optimality and Nishimori Identities
For the sake of simplicity, we start by considering a signal ξ = (ξ i ) i≤N ∈ R N of i.i.d.(independently and identically distributed) components ξ i iid ∼ P ξ , where P ξ has a finite fourth moment.The observations at the disposal of a statistician can be modeled as a stochastic function of the ground-truth signal: Y = F(ξ; z), where z is the source of randomness or simply the noise.Knowing the function F, from a Bayesian perspective, translates directly into having the likelihood of the model, namely the conditional distribution dP Y|ξ=x (y) = p Y|ξ=x (y)dy, which we assume to have a density p Y|ξ=x (y) over the Lebesgue measure.Observe that the likelihood is strongly affected by the nature of the noise.
According to Bayes' rule, the posterior distribution of ξ given the data is: where dP ξ (x) = ∏ i≤N dP ξ (x i ), and Z (y) is the probability of a given realization of the data, which is sometimes also called evidence.In practice, the above posterior, which would be ideal to perform inference, is rarely available, and the statistician is not aware either of the likelihood or of the correct prior distribution for the signal, or even both.This motivates the following definition of a special inference setting: Definition 1 (Bayes optimality).The statistician is said to be Bayes optimal, or in the Bayesoptimal setting, if they are aware both of P ξ and F(•; z); namely, they have access to the posterior (16).
The above is saying that an optimal statistician knows everything about the model except for the ground truth ξ itself.The Bayes-optimal setting is thus often used as a theoretical framework to establish the information theoretical limits.Indeed, it is known that the mean square error between the ground truth and an estimator ξ(y) is minimized by an optimal statistician that can use the posterior mean as an estimator, yielding the minimum mean square error (MMSE) In the following, we shall denote averages with respect to the posterior as ⟨•⟩ Y .
Another important consequence of this setting is the so-called Nishimori identities, which can be stated as follows.Given any continuous bounded function f of the data Y, the ground truth ξ and n − 1 i.i.d.samples from the posterior (x (k) ) n k=2 , one has where x (1) ∼ P ξ|Y .An elementary proof can be found in [42].These identities are enforcing a symmetry between replicas drawn from the posterior and the ground truth.For instance, a direct application of the Nishimori identities yields It is important to stress that, as it can be seen from the above equation, an optimal statistician is actually able to compute the minimum mean square error using their posterior.At this point, the reader will have noticed a similarity with the Statistical Mechanics formalism.In fact, it is possible to interpret Z (y) as the partition function of a model with Hamiltonian − log p Y|ξ=x (y) and unit inverse absolute temperature.The pressure per particle of such a model would thus be namely minus the Shannon entropy of the data per signal component, which is related to the mutual information The contribution coming from the conditional entropy H(Y | ξ) can be regarded as due only to the noise, since for fixed ξ, the only randomness in Y is due to Z.
We stress here that Bayes optimality, and the Nishimori identities, under rather mild hypotheses [43] are enough to grant replica symmetry in the model, i.e., concentration of the order parameters in the model.For the models we are interested in, the latter can be shown to imply finite-dimensional variational principles for the limiting mutual information.

The Spiked Wigner Model
The spiked Wigner model (WSM) was first introduced in [44] as a model for Principal Component Analysis (PCA), and since then, it was widely studied in recent literature.Without pretension of being exhaustive, we refer the interested reader to [42,[45][46][47][48][49][50][51].For our purposes, we restrict ourselves to the case where the signal is an N-dimensional vector of ±1s, drawn from a Rademacher distribution ξ i iid where z ij iid ∼ N (0, 1), and µ is a positive parameter called the signal-to-noise ratio.The statistician is tasked with the recovery of ξ given the observations Y.The Bayes-optimal posterior measure for this inference problem can be written directly as a Boltzmann-Gibbs random measure thanks to the Gaussian nature of the likelihood: where we have already exploited the fact that (ξ i ) 2 = σ 2 i = 1.We are denoting the posterior samples with σ.Since the quantity we are interested in is the quenched pressure of this model that is connected to the mutual information I N (Y; ξ)/N by a simple shift with an additive constant, we are allowed to perform a gauge transformation without altering its value: This results in a Hamiltonian that is now independent of the original ground-truth signal and the coupling between spins are Gaussian random variables with a mean equal to their variance.This condition identifies a peculiar region of the phase space of a spin-glass model, which is called Nishimori line.In fact, the Nishimori identities were first discovered and studied in the context of gauge spin-glasses.Despite looking simpler, the above model retains most of the features we need for our study.
For inference models with additive Gaussian noise, like the one above, it is possible to prove the so-called I-MMSE relation: where ∥ • ∥ F is the Frobenius norm and ⟨•⟩ denotes the expectation with respect to the Boltzmann-Gibbs measure induced by (27).Hence, once the mutual information is known, the MMSE can be accessed through a derivative with respect to the signal-to-noise ratio.A clarification is in order here: the above is the MMSE on the reconstruction of the rank-one matrix ξξ ⊺ , because, due to flip symmetry, here we do not have any actual information on the single vector ξ, but only on the spike ξξ ⊺ .

Sub-Optimality and Replica Symmetry Breaking
There are several ways to break Bayes optimality.Some examples are that the statistician does not know the signal-to-noise ratio µ [13,52]; the statistician adopts a likelihood different from that of the true model [14]; the statistician adopts a wrong prior [12,53]; combinations of the previous and many others.We will focus on the mismatching priors case, where the statistician not only adopts a wrong prior on the ground-truth elements, but they are not aware of the rank of the spiked matrix hidden inside the noise, which is denoted by M. The rest is assumed to be known.The channel of the inference problem is If the statistician assumes a Rademacher prior to the signal components and a rank-one hidden matrix, they will write a posterior in the form exp − H N (σ; z, ξ) (30) where The slash on quantities emphasizes that they are not the Bayes-optimal ones.In this setting, one can no longer rely on the Nishimori identities, and in principle, replica symmetry is no longer guaranteed.On the contrary, as we shall argue later on, a mismatch in the prior only is already sufficient to cause replica symmetry breaking.

The Model
Let M be a fixed integer and k ∈ {1, . . ., M}.Consider two independent random collections (z ij ) i,j≤N iid ∼ N (0, 1) and (ξ The above random collections play the role of quenched disorder in the model.Consider N Ising spins σ = (σ i ) i≤N ∈ {+1, −1} N and the Hamiltonian function Here, H int N is the interacting part while denotes the random external field acting on the spins.The Hamiltonian ( 32) is determined by the choice of M, µ, ν, λ and P ξ .For µ = ν, the interaction term H int N coincides with the Hamiltonian (31).Note that for some special choices of the parameters, we recover some well-known spin glass models: • ν = 0 gives the SK model ( 1) at β = √ µ and random external field h.• µ = 0 gives the Hopfield model [6,7,18] with a finite number of patterns (ξ (k) ) k≤M .
gives the SK model on the Nishimori line (27).As we have seen in Section 3, the latter can be also viewed as a spiked Wigner model in the Bayesian-optimal setting.
Notice that the entire model can be interpreted as a Hopfield model where the traditional j is corrupted by Gaussian noise.Furthermore, if the Hebbian coupling is replaced by a constant matrix, the model reduces to an SK model with the addition of a ferromagnetic interaction, and it was studied in [54].
Our main result is the computation of the thermodynamic limit of the pressure per particle whose variance can be shown to converge to 0 as an O(N −1 ), namely: where K is a suitable positive constant.
We thus focus on pN (µ, ν, λ) = Ep N (µ, ν, λ).The proof of this lemma makes use of the Efron-Stein concentration inequality to bound the variance, and it is simple but tedious.It follows closely that of ( [12], Lemma 9).We are now in a position to state our main theorem: where and P is the Parisi functional (5) with a random external field and E denotes the expectation with respect to ξ (k) iid ∼ P ξ .The consistency equations are Moreover, there exists C > 0 such that for any k ≤ M, one has |x k | ≤ C and the supremum in (36) can be restricted to The proof of the theorem is based on the concentration of the Mattis magnetization, which is the normalized scalar product between a spin-configuration (or sample from the wrong posterior measure) and one of the ξ (k) : The Hamiltonian can thus be rewritten using (40) in the following form: The Mattis magnetization, in fact, plays the role of an order parameter for this model.The concentration we can prove is only an integral average over some suitably small magnetic fields, which is still sufficient for our purposes: ) for all k ≤ M. For any y ∈ R, we denote by ⟨•⟩ N,y the Boltzmann-Gibbs measure induced by the Hamiltonian H N,y (σ) = H N (σ) − y σ • ξ (k) .Then for all µ, ν ≥ 0 and λ ∈ R M .
We shall omit the proof of the above result as it is completely analogous to the one in [12].We will need an intermediate lemma that leads to it (see Lemma 2 later) together with a second key ingredient: the adaptive interpolation technique [48] combined with Guerra's replica symmetry-breaking upper bound for the quenched pressure of the SK model [8].
Proof of Theorem 2. Here, we outline the main steps of the proof of the variational principle for the thermodynamic limit.The proof is achieved via two bounds that match in the N → ∞ limit.Let us start by defining the interpolating Hamiltonian where with α ∈ (0, 1/(2M)) and where the interpolating functions r , that must be continuously differentiable in [0, 1] and non negative, will be suitably chosen.With this interpolation, one is able to prove the following sum rule: Proposition 2. The following sum rule holds: where The proof consists of the computation of the derivative of the interpolating pressure related to the model (43).It follows closely that of ( [12], Proposition 7), to which we refer the interested reader.Since the remainder ∆ (k) ϵ is non-negative, the above proposition already yields a bound for the quenched pressure of our model when we choose r where we used Lipschitz continuity of the SK pressure in the magnetic fields.
The upper bound requires more attention.First, we notice that pSK N ( ) is convex in the magnetic fields and that R Now, we use Guerra's bound for the SK pressure, that, importantly, is uniform in N, and we average over ϵ on both sides What remains to do is to prove that E ϵ ∆ One can easily check that the above system is regular enough to admit a unique solution on the interval t ∈ [0, 1].In case, the remainder to push to 0 would appear as The goal is now to apply a concentration lemma here: with K a positive constant.
Notice that the integral in (51) is over ϵ and not over the effective magnetic field of the model, which is instead R ϵ (t).Nevertheless, we can integrate over the magnetic fields R ϵ (t) with a change of variables.This involves a Jacobian that is larger than 1.In fact, thanks to Liouville's theorem ( [55], Corollary 3.1, Chapter V), one can prove that when ν ≥ 0. This allows us to bound the thermal fluctuations in (51) using ( 52) and then Liouville's theorem: Since ξ i has a bounded second moment, using Cauchy-Schwarz inequality, one can show 44) and ( 50)).Therefore, The fluctuations induced by the disorder can be bounded in a very similar fashion using (53): Hence, overall (51), that equals . δ can be chosen as a function of N in order to optimize the convergence rate: δ = s 2M/3 N N −1/3 .Using Fubini's theorem in (49) to exchange the t and ϵ averages and then Dominated Convergence, one concludes the proof.
From the the variational problem (36), we can deduce also the differentiability properties of the limiting pressure obtaining the average values of the relevant thermodynamic quantities of the model: where χ * (q) denotes the unique measure solving the Parisi variational principle in Theorem 1.

Conclusions and Perspectives
In this paper, we offer an overview of the Parisi formula from a mathematical physics perspective, emphasizing its potential applications, particularly in addressing the mismatched inference problem outlined earlier.Building upon our previous work [12], we investigate a scenario where a statistician, tasked with reconstructing a finite-rank matrix, lacks knowledge about the underlying matrix generation process, including both the matrix elements and its rank.We consider the case in which the statistician assumes a rank-one matrix, leading to a mismatch between the "true" Bayes posterior and the one used for inference.Our key contribution is the proof that, contrary to what happens in the Bayes-optimal setting, this Bayesian mismatch induces replica symmetry breaking in the model.Consequently, we express the pressure of the corresponding spin glass as an infinite-dimensional variational principle over the space of distributions on [0, 1].
The chosen mismatch scenario shares some similarities with those studied in [57,58] with the fundamental difference being that here the rank of the hidden matrix is finite.In a recent work [59], the authors consider a general case of mismatch, which includes mismatching priors and likelihoods.The mentioned paper proves a universality property with respect to the likelihood assumed by the statistician provided that observations remain independent given the ground truth.Despite these advancements, all the proofs available so far in the literature break down when considering a high-rank hidden matrix.To rigorously comprehend this scenario, addressing the solution of the Hopfield model is of crucial importance.However, to the best of our knowledge, its complete solution remains elusive [5,6].

ϵ
(t)).Hence, we can use Jensen's inequality and Lipschitz continuity of pSK to obtain: pN → 0 for a proper choice of the interpolating functions R ϵ .The choice is made through a system of coupled ODEs Ṙ

1 )
and denote by ⟨•⟩ N,y the Boltzmann-Gibbs expectation associated to the Hamiltonian H N (σ; µ, ν, λ + ye k ) where k ≤ M and e k is the k-th canonical basis vector of R M .Then