Signal detection in nearly continuous spectra and symmetry breaking

The large scale behavior of systems having a large number of interacting degrees of freedom is suitably described using renormalization group, from non-Gaussian distributions. Renormalization group techniques used in physics are then expected to be helpful for issues when standard methods in data analysis break down. Signal detection and recognition for covariance matrices having nearly continuous spectra is currently an open issue in data science and machine learning. Using the field theoretical embedding introduced in arXiv:2011.02376 to reproduces experimental correlations, we show in this paper that the presence of a signal may be characterized by a phase transition with $\mathbb{Z}_2$-symmetry breaking. For our investigations, we use the nonperturbative renormalization group formalism, using a local potential approximation to construct an approximate solution of the flow. Moreover, we focus on the nearly continuous signal build as a perturbation of the Marchenko-Pastur law with many discrete spikes.


I. INTRODUCTION
The renormalization group (RG) is one of the most important discoveries of the XX th in physics. It is more a general idea rather than a specific law of nature; aiming to extract relevant features of statistical or quantum states in a modern conception due to Kadanoff and Wilson [1]- [2]. Introduced in the area of statistical physics, it is, in particular, the most powerful concept to explain the universality of large distance physics for systems involving a very large number of interacting degrees of freedom, without requiring a complete description of these fundamental degrees of freedom. RG explain universality and efficiency of effective descriptions of physical laws through a progressive dilution of information with coarse-graining, which are absorbed into the running parameters defining effective theory [3]- [4]. The most universal formalization of the RG is based on the existence of an intrinsic hierarchy of degrees of freedom; in such a way that we can progressively ignore some of them, "integrated" in a less fundamental effective description for the remaining ones. For this reason, RG is particularly relevant in many-body physics, for all problems involving a very large number of interacting degrees of freedom. In physics, this hierarchy is intrinsically related to the notion of scale; and RG aims to construct large scale effective theory integrating out microscopic degrees of freedom, in such a way to preserve long-distance physics see [5]- [13]. More Generally, the Kadanoff and Wilson idea is the statement that the best way to study a sub-number of degrees of freedom in a large system is to integrate out the remaining degrees of freedom. Standard incarnations of the RG takes the form of a flow in the formal space of Hamiltonians (log-likelihood in probability theory), describing a sequence of distributions having the same long-distance physics. * vincent.lahoche@cea.fr † dine.ousmanesamary@cipma.uac.bj ‡ mohamed.tamaazousti@cea.fr Data analysis and machine learning are aiming to extract relevant features among sets of very large dimension. This is in particular the case within the big data paradigm. Principal component analysis (PCA) [14]- [27] look for a linear projection into a lower-dimensional space keeping only relevant features; exactly what the RG aims to do. Thus, RG is expected to be a relevant and competitive approach to standard PCA. In this paper, we focus especially on a problem where standard PCA fails to provide a clean separation between "what is relevant" and "what we can ignore". This is, in particular, the case of nearly continuous spectra, as Figure 1 shows. For such a spectrum standard PCA does not work suitably and RG is expected to be able to provide a distinction between signal and noise. This can be achieved through a field theoretical embedding, as considered in [14], from an analogy with what happens in standard field theory. The number of relevant terms in the Hamiltonian, spanning the distinguishable distributions at large scales, depends on the dimension of space d, and such, on the momenta distribution ρ(p 2 ) = (p 2 ) d/2−1 . The relevance of couplings involved in the Hamiltonian thus depends on the momentum distribution. From this basic observation, it seems to be reasonable to investigate the RG flow associated with the eigenvalue distribution of the covariance matrix through a suitable field theoretical embedding able to reproduce (at least partially) the data correlations and extract relevant features of the distributions. Note that such a strategy follows the current point of view about field theory, understood as effective descriptions at the large scale of some partially understood microscopic physics [14]- [16]. In this way, a signal could be differentiated from noise by simple comparisons of the classes of equivalences generated by the relevant couplings. Note moreover that such a strategy does not allowsà priori to infer effective properties of data. In this paper, we only claim to build an effective theory, in the same class of long-range equivalence as the "true" theory, see Figure 2. This point of view was the one developed in [14]- [16]. In these papers, the authors were able to characterize the presence of a signal, and estimate the breaking point between signal and noise, by the   fact that the first non-Gaussian perturbation, which is relevant for a purely Marchenko-Pastur (MP) distribution and becomes irrelevant for a sufficiently strong signal. In this paper, we focus precisely on the asymptotic aspects (IR) attached to the signal and we show that a phase transition, corresponding to a breaking of reflection symmetry can be associated with it. We moreover justify the existence of an intrinsic detection threshold and show how this threshold could be considered for the construction of a functional detection algorithm. Finally, we mention some open questions.

II. FRAMEWORK
We consider a set of data described by a big N×P matrix X ia for i = 1, 2, · · · , N and a = 1, 2, · · · P. We assumes N, P 1 but P/N < 1. The covariance matrix C is the N × N entries C i j = P a=1 X ia X ja . Moreover, when the entries of X are purely i.i.d, the eigenvalues of the matrix C i j /N converge in the weak topology in distribution toward the MP law [42]. Figure 1 provides a typical spectrum for P = 1500 and N = 2000. We denote by µ exp (λ) the eigenvalue distribution.
In [14]- [15] we introduced a field theoretical embedding aiming to reproduce data correlations. The framework describes a nearly continuous random field ϕ(p) ∈ R, the variables p being defined such that p 2 is an eigenvalue of the covariance matrix translated from its largest eigenvalue λ 0 . The field is provided with a probability density p[ϕ] := e −S [ϕ] . The functional hamiltonian S being defined as: For g = 0, the model is purely Gaussian, and the 2-point correlations functions ϕ(p)ϕ(p ) = (p 2 + m 2 ) −1 δ p,−p , where δ is the Kronecker delta and the notation X[ϕ] denotes the mean value of the quantity X with respect to the probability measure e −S[ϕ] p dϕ(p). In that case, we reproduce exactly the experimental 2-point correlations given by the eigenvalues of the covariance matrix if, firstly m 2 = 1/λ 0 , and secondly the momenta p are such that p 2 is distributed following the eigenvalue distribution of the covariance matrix. We denote as ρ(p 2 ) this eigenvalue distribution inferred from the knowledge of µ exp (λ), the integration measure for the variable p reading as -ρ(p 2 )pd p - [14]- [15].
The existence of n-points correlations functions which cannot be decomposed as a product of 2-point functions accordingly to the Wick theorem require to remove the condition g = 0. The functional U[ϕ] is assumed to be a conservative and Z 2 -invariant polynomial in ϕ of the form: It is conservative in the usual sense in field theory, meaning that momenta are conserved at each vertex. The choice of these interactions and the reflection symmetry ϕ → −ϕ follow from simplicity. Indeed, we are aiming to construct only an approximation, and extract some relevant features concerning the momenta distributions able to discriminate between data and noise. See [25] for more details.
The RG flow can be constructed from the standard Wilson-Kadanoff procedure, partially integrating over modes having high momenta (ultraviolet (UV) modes). In such a field framework, it is suitable to use the functional renormalization group (FRG) to construct approximate solutions of the RG flow, beyond perturbation theory [5]- [13]. The FRG is based on the effective hamiltonian for integrated modes below some scale k rather than on hamiltonians for the remaining, not integrated modes above the scale k (infrared (IR) modes). The effective hamiltonian for integrated degrees of freedom is denoted as Γ k [M] and obeys to the first order differential equation: In this equation: • r k (p 2 ), the regulator, plays the role of an effective mass, depending both on momenta and infrared cut-off k. It vanishes for high momenta with respect to k (p 2 /k 2 1), whereas low momenta modes are frozen, and decouple from long distance physics. Moreover, r k (p 2 ) vanishes for k = 0, ensuring that all the modes are integrated out.
• The effective averaged hamiltonian Γ k [M] is defined from a slight modified version of the Legendre transform for free energy W k [ j]: where ∆S k [ϕ] := 1 2 p ϕ(p)r k (p 2 )ϕ(−p). The free energy W k [ j] being the generating functional of cumulants, W k [ j] := ln exp p j(−p)ϕ(p) + ∆S k [ϕ] . This definition ensures that Γ k reduces to the microscopic hamiltonian S in the deep UV (k 2 1), where r k (p 2 ) is expected to be of order k 2 . Moreover, for k = 0, r k (p 2 ) vanishes, and Γ k reduces formally to the full effective hamiltonian Γ, with all modes integrated out.
• The notation Γ (2) means second derivative with respect to M, the classical field defined as: The exact flow equation (3) works in an infinite-dimensional space of functions, and cannot be solved exactly in general. A standard method to construct approximate solutions is to truncate into a finite dimensional subspace, assumed to be relevant from physical conditions. In this paper we focus on the local potential approximation (LPA), assuming that non quadratic part of Γ k may be spanned by local interactions of the form (2). For the quadratic part, we use standard derivative expansion (DE), keeping only couplings of order p 2 , Assuming to work in the IR region, and following the standard LPA assumptions, we project the flow equation on a constant classical field, neglecting its momentum dependence: M(p) = Mδ p0 . It is suitable to include the term of order (p 2 ) 0 in the non-quadratic part. Denoting it as U k , we assume the following expansion around non-vanishing vacuum κ for constant classical field: Where χ = M 2 /2. Despite the fact that formally no dependence on the regulator is expected for the infrared limit k = 0; the truncation procedure may introduce a spurious dependence on the regulator [43]. To keep control on these spurious effects, we focus on the famous Litim regulator, which has been proved to be optimal [11]- [12] and is widely used in the literature [5]: θ being the Heaviside step function.
The flow equation for the potential U k can be deduced from the equation (3), setting constant classical field: .
(9) It is suitable to introduce the flow parameter τ := ln k 0 pρ(p 2 )d p rather than k. Moreover, from the interpretation of the parameter u 2 as the asymptotic effective mass, it is suitable to assume the scaling u 2 ∼ k 2 . In such a way, we are able to define a canonical dimension for all the couplings. In standard field theory, this canonical dimension allows to convert the RG equations as an autonomous system. This is not true here, because the shape of the momentum distribution is not invariant from RG transformations. However it is suitable to provide a version of dimension such that the only source of explicit scale dependence is at the level of the linear term in the flow equation. From this requirement, one expect to define dimensionless quantities denoted with a "bare" as: leading to: with the notation X := dX/dτ. We voluntary sketch the discussion on the dimension here, some details may found in [14]- [15]. From (11) and (10), it is suitable to define: The flow equation for the "dimensionless" parameter follows: where: and: The flow equations for couplings κ and u 2n may be finally deduced from the condition: for ϕ 4 (blue curve), ϕ 6 (orange curve), ϕ 8 (green curve) and ϕ 10 (red curve) asscociated to the purely MP law (purple curve) with variance equals to 1.
The corresponding flow equation can be deduced following the same strategy. Figure 3 shows the canonical dimensions for the first local interactions with the pure MP law. This picture shows the existence of two regions. For the last tier of the spectrum, only two couplings are relevant, the sixtic being asymptotically marginal in accordance to power-law counting (the MP law behaving as ρ(p 2 ) ∼ (p 2 ) 1/2 for small p). In contrast, for the two first tiers of the distribution, the number of relevant interactions may be very large. As discussed in [14] standard methods in field theory do not work suitably in such a case. One should expect that the field theoretical approach being relevant only for the last tier of the spectrum that we call learnable region.

III. Z 2 -SYMMETRY BREAKING AND SIGNAL DETECTION
Besides these analytic considerations, we provide in this section the first look at a numerical investigation on a more realistic signal, as illustrated in Figure 1. In our experiments, we focus on the distribution of the eigenvalues for two types of covariance matrix in the regime of high dimensions (typically in our experiments we consider P = 1500 and N = 2000, which gives K(= P/N) = 0.75). First, we consider covariance matrix associated with i.i.d random entries. The distribution of the eigenvalues of such matrix converges, for large P and N, to the MP's law, that we interpret to be data composed entirely of noise. corrsponding to a perturbation of the case of pure noise by adding a matrix of rank R = 65 (defining the size of the signal). In our experiments we fix the variance to one and K = 0.75. For such a spectrum, the learnable region is expected between ∼ 2.5 and ∼ 3.4, where ϕ 4 and ϕ 6 are expected to be the only relevant interactions (this is an information that one can get from the study of the canonical dimensions as illustrated in Figure 3)..
To start, and following [14]- [15], we focus on the simpler version of the derivative expansion (DE), expanding the effective potential U k as a power of m := M/N: The derivation of the corresponding flow equations follows the same strategy as for (17), (18) and (19), see [14]- [15]. In Figure 4, we illustrated different viewpoints of the 3D compact region R 0 in the vicinity of the Gaussian fixed point where the RG trajectories, obtained by the DE, ends in the symmetric phase, and thus are compatible with a symmetry restoration scenario for initial conditions corresponding to an explicit symmetry breaking. However, all these initial conditions are not expected to be physically relevant in the deep IR. Indeed, for scales k 2 ∼ 1/N, one expect to obtain a good approximation for the exact covariance matrix. From construction, this imposes u 2 to reach a finite value, of the order of the inverse of the larger eigenvalue of the spectrum. In turn, this imposes for the dimensionless parameterū 2 to be of order N.
The initial conditions compatibles with this requirement are pictured in blue on the Figure. In Figure 5 we show the same region R 0 using LPA and equations (17), (18) and (19). We show that this region is as well compact, and reduces when we increase the intensity of the signal. Finally, we illustrate on Figure 6 how the (deep) IR potential changes accordingly to the intensity of the signal. When the signal is low the RG trajectories end in the symmetric phase and conversely it stays in the non-symmetric phase when the signal strength is strong; providing explicit evidence of the relation between signal and symmetry breaking in the deep IR region.

IV. CONCLUSION
In this paper, we investigated the RG of an effective field theory able to reproduce IR correlations at least partially in the learnable region, where both locals ϕ 4 and ϕ 6 are relevant. Focusing on local interactions, we constructed approximate solutions of the exact RG equation (3), using standard DE and LPA. Some extended discussions can be found in [14]- [15], especially regarding the role of the anomalous dimension, which does not change our conclusions. Among the IR properties of the effective IR theories, we focused on the vac-uum expectation value. We showed the existence of a nearly compact region R 0 in the vicinity of the Gaussian fixed point where the Z 2 -symmetry is always restored in the deep IR for purely noisy signals well described by the MP law. Furthermore, we observed that the size of this region R 0 is reduced when we consider a deviation by a signal to the asymptotic MP spectrum. Thus, this implies that some trajectories ending in the symmetric phase for pure noise end in a broken phase, with ϕ 0 when the signal is added. Moreover, among the initial conditions allowed by the region R 0 ; only a subset of them are physically relevant, i.e. such that the inverse end mass u 2 is of the same magnitude as the expected largest eigenvalue of the (continuous part of the) spectrum. Thus, as soon as the deformation of the region R 0 reaches one of this physical subregion, some physically relevant trajectories are affected and leave the symmetric phase in the deep IR. This observation exhibits the existence of an intrinsic sensitivity threshold for signal detection based on the asymptotic vacuum expectation value.
This observation allows considering a detection algorithm based on the existence of a phase transition in the deep IR region. This, however, remains an objective under investigation. In this study we focus on synthetic data for which we have a good knowledge of the noise and signal notions, essentially in order to keep control on the perturbation; but we plane to investigate this framework on real data. Other questions concern the phase transition; which seems to be able to be first or second order, depending on how the power count for the ϕ 6 coupling is affected. The nature of the transition could be linked to a finer detection criterion. Finally, other questions concern the approach. Investigations of regions of the more UV spectrum, for example, might require methods beyond standard DE. The validity of the field theory approximation could also be questioned in the UV. All these questions are the subject of ongoing investigations.
To conclude, despite the fact our findings are based on a definition of noise based on the MP law, we planned to explore different mathematical incarnations of noisy signals, in a different context, to confirm the universal character of our conclusions. Thus our investigations are also continuing for the Wigner distribution [41], as well as on more exotic dis-