Adaptive Importance Sampling for Equivariant Group-Convolution Computation †

: This paper introduces an adaptive importance sampling scheme for the computation of group-based convolutions, a key step in the implementation of equivariant neural networks. By leveraging information geometry to deﬁne the parameters update rule for inferring the optimal sampling distribution, we show promising results for our approach by working with the two-dimensional rotation group SO ( 2 ) and von Mises distributions. Finally, we position our AIS scheme with respect to quantum algorithms for computing Monte Carlo estimations


Introduction and Motivations
Geometric deep learning [1] is an emerging field receiving more and more traction because of its successful application to a wide range of domains [2][3][4]. In this context, equivariant neural networks (ENN) [5] have been shown to be superior to conventional deep learning approaches from both accuracy and robustness standpoints and appear as a natural alternative to data augmentation techniques [6,7] to achieve geometrical robustness.
One key bottleneck for scaling ENN to industrial applications lies with the numerical computation of the associated equivariant operators. More precisely, two main approaches have been used in the literature, namely a Monte Carlo sampling method [2] (which can be made exhaustive for small finite groups) and a generalized Fourier-based method [4,8,9]. However, these approaches suffer from scalability issues as the complexity of the underlying group increases (e.g., handling non-compact groups such as SU (1,1) or large finite groups such as the symmetric group S n is challenging). Even for groups such as SO (2) for which previous works on the use of spherical harmonics can be leveraged on, the efficient computation of a reliable estimate of the convolution remains a challenge (convergence). In this context, the authors of [10] have proposed an efficient method for building adequate kernel functions to be used within steerable neural networks [11] by leveraging on the knowledge of infinitesimal generators of the considered Lie group and on a Krylov approach for solving the linear constraints. We propose in this paper to cover the specific case of group-convolutional neural networks (G-CNN) [2,12], which in particular, rely on the computation of group-based convolution operators. By leveraging on information geometry as proposed in [13] for quantile estimation, we introduce here an adaptive importance sampling (AIS) variance reduction method based on information geometric optimization [14] to improve the convergence of Monte Carlo estimators for the numerical computation of group-based convolution feature maps, as used in several recent works [2,9,15]. We illustrate our approach on the two-dimensional rotation group SO(2) by regularizing with von Mises distributions [16], a set-up for which the Fisher information metric [17] can be computed using closed form formulas.
Finally, we shed some light on the benefits of working toward a quantum version of our proposed AIS scheme in order to reach a quadratic speed-up [18]. Improving quantum Monte Carlo integration schemes is indeed a very active topic of research [19], mainly driven by applications within the financial industry [20]. Benchmarking with group-Fourier transform-based approaches, such as [21], which are more theoretically involved but with a promise of an exponential speed-up, will be of particular interest in this context.

Group Convolution and Expectation
We consider in the following a compact group G with corresponding Haar measure µ G . As µ G (G) < ∞, we can choose µ G so that G dµ G = 1 by using an adequate normalization.
We are interested in evaluating the group-based convolution operator ψ G defined below for functionals f , k : G → R and g ∈ G: Using a probabilistic interpretation of (1), we can write where H is a G−valued random variable distributed according to µ G . The convolution can therefore be estimated with a Monte Carlo method by using the following estimator where h i ∼ µ G and for which the efficiency could be improved through variance reduction techniques [22]. By anchoring in [13], we describe in the following an adaptive importance sampling approach for the computation of (1). Similar ideas were also used in [23] for financial applications.

Adaptive Importance Sampling
We consider in the following a set Φ Θ of parametric probability density functions on G, where Θ represents the parameters space. Each density φ θ ∈ Φ Θ is assumed to be absolutely continuous with respect to the Haar measure µ G of the group G, so that the corresponding probability measure can be written as dµ θ = φ θ dµ G and the Radon-Nikodym derivative can be considered. Using the conventional importance sampling approach, we can then write: The idea is then to choose a measure µ θ * for which θ * minimizes the variance v k, f ,g of the random variable k H −1 g ω θ (H) f (H), which can be written as where m k, f ,g 2

Monte Carlo Estimator and Convergence
We assume that we can construct a sequence of parameters (θ i ) n−1 i=0 , together with realizations (h i ) n i=1 of the random variables (H i ) n i=1 such that H i ∼ µ θ i−1 and that θ n → θ * ∈ Θ as n → ∞. We can then consider the following Monte Carlo estimator: Under usual integrability conditions, Theorem 3.1 of [13] states thatψ G n (g) → ψ G (g) almost surely as n → ∞. Furthermore, we have the following distributional convergence result, where N 0, σ 2 refers to the Gaussian distribution with 0 mean and variance σ 2 .

Natural Gradient Descent
We now discuss how to build the sequence of parameters (θ i ) n−1 i=0 and corresponding realizations (h i ) n i=1 as introduced in Section 3.1, reminding ourselves that we have Assuming that the parameter space Θ ⊆ R m is a smooth manifold, we can consider the Fisher information metric g on the density space Φ Θ , which is defined as it follows [17]: We then propose using a natural gradient descent strategy to minimize the quantity m k, f ,g 2 , namely where F k is the Fisher information matrix, i.e., the representation of the Fisher metric as a m × m matrix and α k ∈ R * + . Assuming that the considered functions are smooth enough, it is possible to write: Using a stochastic approximation scheme such as the Robbins-Monro algorithm [24] then leads to consider the following update rule,

About IGO Algorithms
Information geometric optimization (IGO) algorithms are introduced in [14] as a unified framework to solve black-box optimization problems. IGO algorithms can be seen as performing an estimation of a distribution over the considered search space X leading to small values of the target function Q when sampling according to it. More precisely, the idea is to maintain at each iteration t a parametric probability distribution P λ t on the search space X , for λ t ∈ Λ ⊆ R p and to have the value λ t evolve over time as to shift P λ t toward giving more weight to points x ∈ X associated with a lower value of Q.
The IGO algorithms described [14] first transfer the function Q from X to Λ by using an adaptive quantile-based approach and then applying a natural gradient descent by leveraging on the Fisher information metric of the considered statistical model. The scheme described in the Definition 5 of [14] defines the following the update rule for the parameter λ t : where I is the fisher matrix of the model, x 1 , ..., x N are N samples drawn according to P λ t at step t, x i:N denotes the sample point ranked i th according to Q (i.e., Q(x 1:N ) < . . . < Q(x N:N ) ) and , with ω(q) = 1 q<q 0 a quantile-based selection function of threshold q 0 .
IGO algorithms could therefore be used in our context by setting Q = m k, f ,g 2 (θ) and X = Θ to infer the optimal value θ * ∈ Θ. Implementing the update rule (19) requires a priori a large number of evaluations of the term Q = m k, f ,g 2 (θ) to derive the sorted samples x i:N , making this approach generally not well suited to our context.

Application to SO(2)-Convolutions
We give here an application of our AIS approach for the computation of SO(2)convolutions by using von Mises densities [16] for the weighting. This type of computation is in particular relevant when working with SE(2)-ENN by exploiting the semi-direct product structure SE(2) = R 2 SO(2), as performed in [3].

Numerical Experiments
To numerically validate our approach, we have considered von Mises type feature functions f κ 0 ,α 0 : α → e κ 0 cos(α 0 −µ) and kernel functions k : [0, 2π] → R modeled as small fully connected neural networks with one hidden layer of 128 neurons with ReLu activation and uniform random weights initialization. To run our testing, we have used κ 0 = 3 and µ 0 = π 2 . Figure 1 shows the comparison between the results obtained with the estimator (10) using the adaptive importance sampling scheme and those obtained with the conventional estimator (3). We can in particular see that the adaptive importance sampling scheme converges faster to the theoretical value (here computed by using (3) with n = 50, 000 and displayed in black in Figure 1), while providing much narrower confidence intervals (because of lower variance) than the conventional Monte Carlo estimator. Figure 2 shows the evolution of the parameter θ = (µ, κ) as we iterate through the update rule (18), from which we can also observe a fast convergence.

Extension to SO(3)-Convolutions
Generalizing the above results to cover SO(3)-convolutions is of particular interest when using ENN for processing spherical data such as fish-eye images [4,25]. The Fisher-Bingham distribution [26], also known as the Kent distribution, can be leveraged in this context. More precisely, we have in this case, for x ∈ S 2 (the 2D-sphere in R 3 ): where γ i for i = 1, 2, 3 are vectors of R 3 so that the 3 × 3 matrix Γ = [γ 1 , γ 2 , γ 3 ] is orthogonal and c(κ, β) is a normalizing constant.
Although we defer to further work the details of the derivation of the corresponding AIS estimator (10), we illustrate on Figure 3 that SO(3)-convolutions could also benefit from variance reduction methods by using a simple quasi-Monte Carlo scheme [27] with a three-dimensional Sobol sequence [28].

Monte Carlo Methods in the Quantum Set-Up
Monte Carlo computations can generally benefit from a quadratic speed-up in a quantum computing set-up [18] and improving quantum Monte Carlo integration schemes is a very active topic of research [19], mainly driven by applications within the financial industry [20,29].
A similar speed-up can therefore be expected in our context by estimating (1) by leveraging on the quantum amplitude estimation (QAE) algorithm [30]. For g ∈ G, we denote φ f ,k g : G → R the function such that ∀h ∈ G, φ f ,k We first construct the operator U µ G to load a discretized version of µ G so that U µ G |0 = ∑ h∈G p(h)|h , with p(h) = g∈B(h,r ) µ G (g), B(x, r) the ball of radius r > 0 centered in and build another unitary operator U φ to compute and load the values of φ f ,k g taken on G , that we defined by Using the QAE algorithm on U φ U µ G gives us access to an estimate of (1) after proper rescaling, with a precision of δ in O . As described in Section 3.1, the AIS estimator (10) leads to a precision of δ for samples, which is asymptotically less efficient than the above quantum estimator. However, no quantum advantage has been evidenced on current hardware for general Monte Carlo estimations and further challenges with respect to the precision of the evaluation of the integrand φ f ,k g are expected in our specific context. Keeping working on the optimization of the estimators in the classical set-up while keeping track of the progress made on the development of quantum hardware therefore appears a reasonable path to follow.

Conclusions and Further Work
By leveraging on the approach proposed in [13] for quantile estimation, we have introduced in this paper an AIS variance reduction method for the computation of groupbased convolution operators, a key component of equivariant neural networks. We have in particular used information geometry concepts to define an efficient update rule to infer the optimal sampling parametric distribution and have also shown promising results when working with the two-dimensional rotation group SO(2) and von Mises distributions.
Further work will include the study of non-compact groups such as SU(1, 1) as to improve the efficiency of the computations underlying to the ENN introduced in [9]. As shown in [31], Souriau Thermodynamics can be used to build Gaussian distributions over SU(1, 1), which appear as natural candidates for applying the AIS scheme presented in this paper.
We have also seen that Monte Carlo computations can generally benefit from a quadratic speed-up in a quantum computing set-up. Further work will include the study of using AIS in this context as to provide a generic and efficient quantum algorithm for groupconvolution computation. Benchmarking with group-Fourier transform-based approaches such as [21], which are more theoretically involved but with a promise of exponential speed-up, will also be of high interest, as it will be the case for results coming from the emerging field of quantum geometric deep learning [32,33].
Author Contributions: All authors contributed equally to the paper. All authors have read and agreed to the published version of the manuscript.
Funding: This paper is the result of some research work conducted by the authors at Thales Group.
Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.

Conflicts of Interest:
The authors declare no conflict of interest.