SymmetryLens: Unsupervised Symmetry Learning via Locality and Density Preservation

Efe, Onur; Ozakin, Arkadas

doi:10.3390/sym17030425

Open AccessArticle

SymmetryLens: Unsupervised Symmetry Learning via Locality and Density Preservation

by

Onur Efe

and

Arkadas Ozakin

^*

Department of Physics, Bogazici University, Istanbul 34342, Turkey

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(3), 425; https://doi.org/10.3390/sym17030425

Submission received: 14 January 2025 / Revised: 4 March 2025 / Accepted: 8 March 2025 / Published: 12 March 2025

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

We develop a new unsupervised symmetry learning method that starts with raw data and provides the minimal generator of an underlying Lie group of symmetries, together with a symmetry-equivariant representation of the data, which turns the hidden symmetry into an explicit one. The method is able to learn the pixel translation operator from a dataset with only an approximate translation symmetry and can learn quite different types of symmetries that are not apparent to the naked eye. The method is based on the formulation of an information-theoretic loss function that measures both the degree of symmetry of a dataset under a candidate symmetry generator and a proposed notion of locality of the samples, which is coupled to symmetry. We demonstrate that this coupling between symmetry and locality, together with an optimization technique developed for entropy estimation, results in a stable system that provides reproducible results.

Keywords:

group-equivariant neural networks; machine learning; representation learning; symmetry learning; unsupervised learning

1. Introduction

The spectacular success of convolutional neural networks (CNNs) has triggered a wide range of approaches to their generalization. CNNs are equivariant under pixel translations, which is also a property of the information content in natural image data. By using a weight-sharing scheme that respects such an underlying symmetry, it has been possible to introduce a powerful inductive bias in line with the nature of the data, resulting in highly accurate predictive models. How can this approach be generalized?

Translations form a group, and, for those cases where a more general underlying symmetry group is known, a mathematically elegant generalization can be developed in the form of group convolutional networks [1]. However, data often have underlying symmetries that are not known explicitly beforehand. In such cases, a method for discovering the underlying symmetry from data would be highly desirable.

Learning symmetries from data would allow one to develop an efficient weight-sharing scheme as in CNNs, but, even without this practical application, simply being able to discover unknown symmetries is of considerable interest in itself. In many fields, symmetries in data provide deep insights into the nature of the system that produces the data. For example, in physics, continuous symmetries are closely related to conservation laws, and knowledge of the symmetries of a mechanical system allows one to develop both analytical [2] and numerical [3] methods for investigating dynamics. More generally, the discovery of a new symmetry in a dataset would almost surely trigger research into the underlying mechanisms generating that symmetry.

In this paper, we develop a new method for discovering symmetries from raw data. For a dataset that has an underlying unknown symmetry (analogous to translation symmetry) provided by the action of a Lie group, the model provides a generator of the symmetry (analogous to pixel translations). In other words, we develop an unsupervised method, which has no other tasks such as classification or regression: the data are provided to the system without any labels of any sort, and out comes the generator of the symmetry group action. The setup also creates, as a byproduct, a symmetry-based representation of the data, whose importance is forcefully emphasized by Higgins et al. [4] and Anselmi and Patel [5]. This symmetry-based representation can also be used as an adapter between raw data and regular CNNs.

In the context of symmetry learning, one sometimes restricts attention to symmetry transformations that act on samples only through an action on their component indices. For instance, in the case of images, a translated image is obtained by shifting the pixel indices in an appropriate way, without otherwise transforming the data. While such transformations form an important class of symmetry actions, our setting is more general in that we consider a symmetry group of transformations acting directly on samples, and so-called “regular representations” (of which pixel translations are an example) are only a special case of our approach.

Another avenue of generalization relevant to our methodology is the notion of a local symmetry, as opposed to the more commonly encountered notion of global or “rigid” symmetries. The translation symmetry of the underlying distribution of an image dataset does not come from the fact that rigidly translated versions of images are commonly encountered; they are not. One sometimes synthetically creates such samples for purposes of data augmentation, but, in reality, multiple pictures of the same exact setting are rare. What does happen is that real images consist of local building blocks whose locations have a distribution that is approximately invariant under translations (see Figure 1).

Each image consists of the superposition of a number of such building blocks, and the approximate translation symmetry of image data is, in this sense, tied to an underlying local version of the translation symmetry. We ignore the dependence between the locations of different object types. The importance of being able to deal with such a flexible/local notion of symmetry has been emphasized by Bronstein et al. [6] and Anselmi et al. [7].

With this motivation, we seek to develop a symmetry learning methodology that finds transformations that respect the underlying distribution of a dataset of samples and respect a notion of locality that is intertwined with the symmetry.

Before providing an outline of our methodology, let us provide a quick overview of the capabilities of our system. Consider a dataset whose samples are superpositions of local basis signals with locations randomly selected from a uniform distribution over a finite range (see Figure 2). Each localized signal, with possibly a different width and height, is a local building block in the sense described above. The distribution of such samples is approximately invariant under translations (modulo edge effects). The question is, given such a dataset, without any other guidance (e.g., a supervised learning problem on this dataset with ground truth labels that are invariant under the translation symmetry), can one find that the operation of (one-dimensional) pixel translations is a symmetry of this dataset? Our model is, indeed, able to start with a dataset of this form and discover the matrix that represents pixel translations, sending each component of a sample to the next component.

Although this is not a straightforward task, rediscovering the familiar case of (one-dimensional) pixel translations from such a dataset is perhaps the first thing to demand of a method that claims to learn generalized symmetries from raw data in an unsupervised manner. However, our approach allows the system to discover symmetries that are much less obvious. Suppose we take a dataset that is again approximately invariant under pixel translations and apply a fixed permutation to the pixels to obtain a new dataset. The underlying symmetry of this new dataset will be much less obvious to the naked eye: the generator of the symmetry will now be an operator like Pixel 1→Pixel 17→Pixel 12, etc. We demonstrate that our method can discover this symmetry just as well as it discovers pixel translation. Moreover, the model learns a symmetry-based representation as a byproduct, and this representation unscrambles the pixel order to make the hidden symmetry manifest. While we find this highly encouraging, this, too, is a setting where the symmetry operation acts on the pixel indices.

As a final example, consider a symmetry action in the form of frequency shifts, in the sense that the dataset under consideration comprises time-series data that are approximately symmetric under the three-step action of (1) Fourier transformation (more precisely, we use discrete sine transformation I), (2) shifting all the frequency components by the same amount (i.e., a translation in frequency space), and (3) inverse Fourier transformation (see Figure 2). This action clearly is not obtained by simply shuffling around the component indices of the input vectors. Our method learns the matrix that represents the minimal generator of this action as well. In this case, the symmetry-based representation constructed by the model as a byproduct comes out to be almost exactly a discrete Fourier transform matrix. In other words, by simply “looking at” a dataset that has an approximately local symmetry under frequency shifts, the method learns that the appropriate symmetry-based representation for this dataset is a Fourier transform.

In the following sections, we provide the technical details of the method itself, but the fundamental intuition is rather simple, and we next provide a quick description.

Our approach is based on formulating two important properties to generate a symmetry group for a dataset (for the definition of a group in the context of symmetry learning, see, e.g., Helgason [8]), namely invariance and locality. An appropriate loss function that measures the degree to which these properties are satisfied enables us to find the underlying symmetries by gradient-based optimizers. While versions of invariance are often used in the symmetry learning literature, we believe that the inclusion of a locality property is what provides our method the power it has.

Invariance. The first natural characteristic of symmetry is the preservation of the underlying distribution. In other words, the original data and their transformed version under a symmetry map should have approximately the same distribution. In the case of images, this means that the joint distribution of all the pixel activations should be invariant under the global translation of pixels (ignoring boundary effects). For a general symmetry group action (beyond translations), such an invariance may not be obvious to the naked eye, but, once one knows the underlying symmetry, one should be able to confirm the behavior. (In the continuous case, one views data samples as sample functions or realizations of an underlying continuous stochastic process, and a stochastic process with the required invariance property is called a strongly stationary process under the action of the symmetry group. Apart from a mathematical Appendix A, Appendix B, Appendix C, Appendix D, Appendix E and Appendix F, in this paper, we restrict our attention to the discretized version of such a continuous underlying symmetry).

As emphasized by Desai et al. [9], the space of all the density-preserving maps [10] for a multidimensional dataset is large and includes maps that one would not normally want to call symmetries, so simply seeking density- (or distribution-) preserving maps will not necessarily allow one to discover symmetries. This property, nevertheless, is a reasonable characteristic to expect from a symmetry transformation. Implementing a loss function that measures the degree of density preservation will help us to achieve transformations with this property.

Locality. To identify the other fundamental property we will demand of a symmetry group action, we turn back to CNNs for inspiration. A CNN model not only respects the underlying translation symmetry but does so in a local way, in the sense that each filter has a limited spatial extent. This inductive bias, as well, corresponds to an underlying property of image data. As mentioned above, images contain representations of objects, which themselves have locality along the symmetry directions, and local filters do a good job of extracting such local information. While locality and symmetry are distinct properties, they are often coupled, and this coupling often hints at something fundamental about the processes generating the data. In physics, space and time translation symmetries and spacetime locality are intimately coupled, and this locality-symmetry is a fundamental property of quantum field theories describing the standard model of particle physics [11].

We choose to enforce a generalized notion of locality as the second fundamental property that the symmetry group should satisfy. The symmetry directions should be coupled to the locality directions in data. To make this proposal concrete, consider the action of a single local CNN filter. If we apply the same filter to an image and its slightly translated versions, we obtain various scalar representations of the image. These scalars will be strongly dependent random variables that contain similar information when the amount of translation is small, but, with larger translations, the dependence and similarity decrease.

We will require that a general unknown group of symmetries should satisfy a similar locality property. Once again, let us start with the assumption that we have a “local filter”, i.e., a scalar-valued map, this time local along the unknown symmetry direction. Applying this filter to a sample, we obtain a scalar. First transforming the sample by a symmetry transformation and then applying the filter, we obtain another scalar. Assume that the information in the samples is indeed local along the symmetry direction. Then, for “small” transformations, the two scalars should be strongly dependent random variables that encode similar information. For “larger” transformations, the similarity should decrease. A loss function that quantifies this behavior would thus help us to find symmetry groups that have the required coupling with the locality.

In addition to the problem of making the concepts in the scare quotes above precise (which we address in our method sections), this approach also faces an immediate chicken–egg problem: to conduct a search for a set of symmetry transformations in this way, we would first need a scalar representation, a filter, which itself is local along the symmetry direction. But, to find such a filter, we first need to know the set of symmetry transformations (under which the filter will be local).

We solve this problem by a sort of bootstrapping: we seek the local scalar representation and the symmetry group simultaneously and self-consistently using an appropriate loss function that applies to both. As described below, this coupled search actually works, finding the appropriate minimal generators of symmetry, together with a filter that behaves as a “delta function” in the direction of symmetry transformations. As a byproduct, one also obtains a group-equivariant symmetry-based representation of data, which makes the hidden symmetries explicit. See Figure 3 for a visual representation of the method.

From the outset, we emphasize that we do not assume the data space to be a so-called homogeneous space under the action of the Lie group of symmetries. More explicitly, we do not assume that the symmetry group is large enough to map any given point in the sample space to any other point. This would be a highly restrictive assumption, which is not, for example, satisfied by image data.

We find it satisfying that the approach described here actually works, but the details, of course, matter. The dynamics of the optimization of a loss function are complicated, and various seemingly similar implementations of the same intuition can result in widely varying results. The method we detail in the following sections is a specific implementation of the ideas of locality and invariance described above, and it results in a highly accurate and stable system that provides reproducible results.

In summary, the main contributions of the paper are the following:

The formulation of a new approach to symmetry learning via maps that respect both symmetry and (a proposed notion of) locality, coupled in a natural way. The outputs of the method are both minimal generators of the symmetry and a symmetry-based data representation that, in a sense, makes the hidden symmetry manifest. This representation can be used as input for a regular CNN, allowing the model to work as an adapter between raw data and the whole machinery of CNNs.
The formulation of an information-theoretic loss function encapsulating the formulated symmetry and locality properties.
The development of optimization techniques (“time-dependent rank”, see Section 4.3.1) that result in highly robust and reproducible results.
The demonstration of the symmetry recovery and symmetry-based representation capabilities on quite different sorts of examples, including 1D pixel translation symmetries, a shuffled version of pixel translations, and frequency shifts using dataset dimensionalities as high as 33 (i.e., the relevant symmetry generators with the shape $33 \times 33$ ).

We note that, in the search for symmetry generators, there are no hard-coded simplifications such as sparsity in the symmetry generators or symmetry-based representations. The model indeed needs to learn the relevant matrices without utilizing an underlying simplifying assumption or factorization.

The rest of the paper is structured as follows. We will briefly summarize the current literature in Section 2, while Section 3 includes the theoretical setup motivating our approach. We describe the details of the method in Section 4, and we investigate the performance and robustness of the method in Section 5. Section 6 includes a summary of our contributions, future research directions, and limitations.

2. Related Work

Symmetry learning in neural networks is a rapidly evolving area of research, driven both by the value of symmetries for natural sciences and by the usefulness of symmetry-based inductive biases in model architectures. The current literature involves nuanced definitions for the symmetry learning problem as well as different symmetry learning schemes. The techniques used in symmetry learning can be roughly classified into supervised, self-supervised, and unsupervised learning approaches.

Supervised learning-based approaches combine a supervised problem with a search for symmetry maps that work in harmony with the supervised model in an appropriate sense. Benton et al. [12] use parametrized transformations of input data (such as rotations or affine transformations) for data augmentation and average the predictions over augmented samples to obtain a final prediction that is invariant under the chosen transformations. Using a loss function that encourages the exploration of a range of transformations, they are able to find, e.g., rotations of images as appropriate transformations. Romero and Lohit [13] start with a candidate symmetry group and focus on learning a subgroup of symmetry. Another supervised approach to symmetry learning is via meta-learning. Zhou et al. [14] form many supervised tasks tailored to the symmetry learning problem and use a weight-sharing architecture that is shared between tasks. A cascaded optimization tries to improve the supervised performance by updating both the matrix that determines the weight-sharing and the weights that are used in each task. This way, the model can discover convolutional weight-sharing from 1D translation-invariant data, which is impressive. However, the supervised tasks are a bit unnatural in that the synthetic data generation is conducted by using sampled filters, and the supervised task is to discover the ground truth filters. As symmetries are essential ingredients in physical theories, a wide range of approaches have been tried for symmetry learning in physics-adjacent settings. Craven et al. [15] train a neural network (NN) to approximate a function, and then candidate symmetries are tested by computing the values of the function on inputs transformed by the candidate symmetry and considering the “small transform” asymptotic behavior of the error of the NN. Forestano et al. [16] and Forestano et al. [17] start with specific quadratic forms used in the definition of various Lie groups (as the set of maps that preserve the given form) and then look for linear transformations near identity that leave the form invariant to recover the Lie algebra. Krippendorf and Syvaeri [18] set up a classification problem for predicting the value of a potential function with some underlying symmetries and then use the representations in the embedding layer to search for transformations that relate points that are close to each other in this layer.

Self-supervised approaches adopt auto-regressive setups for the symmetry learning problem. For datasets involving sequential frames, Sohl-Dickstein et al. [19] extract linear transformations relating subsequent time steps in a self-supervised manner, whereas, in the setup proposed by Dehmamy et al. [20], one has access to original data and a transformed version and aims to learn a Lie algebra generator that would relate the two by an approximate group action. In the setting of dynamical systems, Greydanus et al. [21] focus on learning a Hamiltonian, whereas Alet et al. [22] learn conserved quantities, which are both related to underlying symmetries. The self-supervised approach is most useful and interesting in the setting of dynamics of systems, but this also limits its applicability. However, within this setting, the lack of a need for labeled data is an advantage.

Unsupervised approaches to symmetry discovery work with a raw dataset and aim to output maps that represent symmetries of the dataset. A prominent line of research in this setting is based on Generative Adversarial Networks (GANs) [23]. Desai et al. [9] propose using a generator applying candidate symmetry transformations from a parametrized set, such as affine transformations, and a discriminator aiming to distinguish real samples from the transformed versions. The training converges to parameter values that represent symmetries of the dataset, but this approach neither provides the generator of the relevant transformation nor works with a high-dimensional setting without a low-dimensional parametrization for the set of candidate transformations. Yang et al. [24] propose enhancements such as regularization terms preventing identity collapse and encouraging the discovery of multiple Lie algebra generators and sampling the exponential coefficients of the generators from distributions to enable the discovery of subgroups. The main setup is similar in that only low-dimensional parametrized generators are discovered. Yang et al. [25] extend this idea to learn nonlinear group actions by learning a representation where the group action becomes linear simultaneously with the group generator. Except for the autoencoder used for the representation, the setup is similar to the setup proposed by Yang et al. [24]. Tombs and Lester [26] propose an approach that is similar in spirit to the GAN-based symmetry learning, where candidate symmetries are not generated by the model but are provided externally, and a model is trained to discriminate between real and transformed samples. The idea is that, if the model assigns similar probabilities for the sample being real or transformed, then one concludes that the candidate transformation is indeed a symmetry.

Instead of learning symmetry generators explicitly, one can search for a symmetry-equivariant representation directly. Anselmi et al. [7] propose an unsupervised approach for learning such equivariant representations for the case of permutation groups. They demonstrate that the proposed approach can learn equivariant representations for various subgroups of the symmetric group of six objects, acting on a six-dimensional dataset via component permutations. The generated synthetic datasets are completely explicitly symmetric (created by the action of the full group on an initial set of samples).

Among the approaches described above, the requirements for labeled datasets or structured data representations limit the applicability of supervised and self-supervised symmetry learning paradigms. Unsupervised methods offer the widest applicability in principle, but they are demonstrated to work for only up to four-dimensional irreducible representations (and our experiments reported below imply that they perform poorly for higher dimensions). Real-world datasets often involve much higher dimensionalities, which state-of-the-art unsupervised methods cannot deal with.

In this study, we formulate the symmetry learning problem in a representation learning framework where the symmetries are learned via a group convolution map. This makes the symmetry manifest, turning it into a simple translation symmetry, which we believe is unique in the symmetry learning literature. Our setup is able to accurately extract symmetries from datasets with dimensionalities as high as 33 without any hard-wired factorizations, which is by far the highest dimensionality we have seen in the unsupervised approach to symmetry learning. Our approach based on coupling locality and invariance also allows the method to consistently learn the minimal generators, thus enabling one to construct the full symmetry group.

3. Theoretical Setting and Data Model

In this section, we describe a simple setting for data-generating mechanisms that have the symmetry and locality properties described in our introductory discussion. This will both motivate our symmetry learning approach and guide the data generation for our experiments. Readers primarily interested in the description of the method and the results can skip this section in a first reading and peruse Appendix C.1 for information on our datasets.

Our aim is to use the spatial information locality and the translation symmetry of images for inspiration to describe the more general setting of symmetry and locality under a different group of transformations. We first consider the ideal case of continuous data (corresponding to images with “infinite resolution”) and then turn to a discretized version (corresponding to pixelated images). The relevant translation symmetry group for image data is the group of 2-dimensional translations, but we focus on the simpler case of 1-dimensional translations appropriate for “1-dimensional images” (or, more familiarly, time-series data).

3.1. The Continuous Setting

3.1.1. Introduction

A group is a set with an associative binary operation that has an identity element and an inverse for each element. The set of invertible transformations for a wide class of mathematical objects is described under the setting of group theory. The group of 1-dimensional translations is isomorphic to the additive group

R

, each number representing the translation amount and the identity element corresponding to a translation by 0.

As mentioned in our Introduction in Section 1, images consist of building blocks that are spatially local, and the approximate translation symmetry of an image dataset is borne out of the random distribution of the locations of the building blocks (see Figure 1). In order to formalize the case of 1-dimensional translations, we first model the data-generating mechanisms for “1-dimensional images” as a simple class of stochastic processes and then move on to more general symmetries.

3.1.2. Processes with Symmetry and Locality Under 1-Dimensional Translations

We will think of each 1-dimensional image as a sample function (or a realization) of an underlying stochastic process. Intuitively, a real-valued 1-dimensional stochastic process is a “machine” that provides us a random function

f : R \to R

each time we press a button, using an underlying distribution. A translation-invariant (or stationary) process is one where a sample function

f (t)

and all its translated versions

f (t - τ)

,

τ \in R

are “equally likely”. Of course, it does not make sense to talk about the probability of a single sample function, and rigorously defining probability measures on function spaces is rather tricky. However, the intuitive picture mentioned provides a useful viewpoint that we will utilize below. (In the literature on stochastic processes, a stochastic process is more properly defined in terms of a family of maps from a probability space, and stationarity is commonly defined in terms of the joint distributions of function values

f (t_{j})

on finite sets

{t_{j}}_{j = 1}^{N}

. For a formal definition of stationary processes, see Doob [27]).

To create a mathematical model for a translation-invariant process with local building blocks, consider a set

{ϕ_{i}}_{i \in I}

of functions that are localized (compactly supported). We would like to think of each

ϕ_{i}

as representing a type of object. If we shift each

ϕ_{i}

by an amount

τ_{i}

and add up the resulting functions, we obtain a signal

f (t)

consisting of local building blocks at various locations,

f (t) = \sum_{i} ϕ_{i} (t - τ_{i})

. If we could select each

τ

randomly using a uniform distribution, the process of creating such

f (t)

would result in a stationary stochastic process.

While it is not possible to have a uniform probability distribution over the whole real line, this is a technical difficulty that can be overcome by, e.g., considering a stochastic process on a “large” circle instead of

R

(or using a stationary point process such as the Poisson process to obtain a collection of centers

τ_{i k}

for each

ϕ_{i}

with a translation-invariant distribution on

R

rather than a single center).

More generally, each

ϕ_{i}

could have additional parameters

λ_{i k}

such as width or amplitude, which one could independently sample from their own distributions. In short, starting with a set of basis signals (“objects types”)

ϕ_{i}

and sampling the centers and other parameters in an appropriate way obtains a stochastic process that is both symmetric (uniform distribution of centers) and local under translations. Sample functions of such a process are given by

f (t) = \sum_{i} [\sum_{k} ϕ_{i} (t - τ_{i k}; λ_{i k})] .

(1)

To complete the analogy with (finite-size) images, one can crop each such sample

f (t)

to a finite interval. (In real images, opaque objects actually hide each other rather than combining in an additive way, but we ignore such considerations in this simple setting). See the left-hand side of Figure 4 for samples of the kind described here.

3.1.3. Processes with Symmetry and Locality Under a General 1-Dimensional Group Action

We would like to generalize the simple setting above to more general group actions. We will describe an abstract setting first and will then provide concrete examples.

Consider stochastic processes defined on a space

X

, with the space of sample functions x denoted by

F (X) = {x : X \to R}

. Suppose we have an Abelian group G (group operation written as +) with a given group action

ρ

on

F (X)

. In other words, for a sample function

x \in F (X)

and group element

τ \in G

, the result of the action

ρ (τ) \cdot x

is another sample function, and

ρ

satisfies

ρ (0) = 0

,

ρ (τ_{1}) \cdot ρ (τ_{2}) = ρ (τ_{1} + τ_{2})

,

ρ (- τ) = {[ρ (τ)]}^{- 1}

. We will specialize to the case

X = R

and G isomorphic to

R

, but much of what we write will apply to more general Abelian Lie groups. We will keep

ρ

unspecified; importantly, we do not assume that

ρ

is given by the “regular representation”. In particular, we do not assume

ρ

acts by simple translations as in

(ρ (τ) \cdot x) (t) = x (t - τ)

.

Given such an action

ρ

, we would like to describe a process that generates sample functions with invariance and locality properties as above, but, this time, the locality and invariance will be under

ρ

instead of translations. Taking

ρ

to be a simple translation for the moment, the following representation of the signal

f (t)

of (1) motivates our generalization:

\begin{matrix} f (t) & = \int δ (t - τ) f (τ) d τ \end{matrix}

(2)

\begin{matrix} = \int (ρ (τ) \cdot δ^{(ρ)}) (t) f (τ) d τ . \end{matrix}

(3)

Here, in (2),

δ (t)

denotes the usual Dirac delta function, and, in (3),

δ^{(ρ)} (t)

is just another name for it, the notation suggesting that this delta function is “a delta function along the symmetry action

ρ

of simple translations”. The action

ρ (τ) \cdot δ^{(ρ)}

by the translation operator

ρ (τ)

provides the shifted delta,

δ (t - τ)

.

If we now view the right-hand side of (3) as applying to a general

ρ

instead of only translations, we obtain the generalization we want. For a given group action

ρ

for the group G isomorphic to

R

, assuming one has an appropriate notion of a “delta function

δ^{(ρ)}

along

ρ

”, one can create a process that is invariant (stationary) and local under the group action

ρ

from a process that is invariant and local in the usual sense on

R

. One obtains samples

f (t)

from a process such as (1) on G and transforms each sample to a sample

x (t)

on

X = R

via

x (t) = \int (ρ (τ) \cdot δ^{(ρ)}) (t) f (τ) d τ .

(4)

Here,

ρ

no longer represents simple translations, and

δ^{(ρ)}

is no longer the usual delta function but is an appropriate delta “function” (or measure, or distribution) defined on

X

. (Here, the integration measure

d τ

on the right-hand side should properly be viewed as the Haar measure on the relevant Lie group, but, in this 1-dimensional setting, we do not lose much by treating it as the Lebesgue measure on

R

; for higher-dimensional, possibly non-Abelian Lie groups, one will have to be more careful). While we specialize to

X = R

here (and mention a generalization to

R^{D}

below), the formalism suggests greater generality.

To provide another concrete example of this abstract setting (in addition to translations), consider the group action of

G ≅ R

on sample functions given by frequency shifts instead of translations. In other words, let

ρ (s)

act on a sample function

x (t)

via

(ρ (s) \cdot x) (t) = e^{- i s t} x (t)

(5)

which corresponds to a shift in Fourier space (for simplicity, we take the sample functions in this case to be complex-valued). For this action, an appropriate delta function along

ρ

is a Fourier basis function

δ_{k_{0}}^{(ρ)} (t) = e^{i k_{0} t}

, where

k_{0} \in R

. Just as the regular Dirac delta function is sharply localized under translations in the sense that

δ (t - t_{0})

and

δ (t - t_{0} - τ)

have zero overlap, the functions

δ_{k_{0}}^{(ρ)} (x) = e^{i k_{0} t}

and

(ρ (s) \cdot δ_{k_{0}}^{(ρ)}) (t) = e^{i (k_{0} - s) t}

have zero overlap (e.g., these functions are orthogonal by Fourier analysis). Specializing to the case

k_{0} = 0

, we can construct signals that are local along the action of

ρ

by using (2) from a signal

f (t)

as in (1) defined on the group

G = R

:

x (t) : = \int (ρ (s) \cdot δ_{ρ}) (t) f (s) d s = \int e^{- i s t} f (s) d s .

(6)

Thus, we obtain the pleasing result that, according to this setup, a signal that is local along the “frequency shift” action

ρ

of the 1-dimensional group

G ≅ R

is given by the Fourier transform of a signal

Φ_{0} (s)

that is local under translations (of s). In other words, to obtain a stochastic process that is invariant and local under frequency shifts, we can start with a spatially local and spatially stationary process on the group

R

and take the Fourier transform of each sample.

We will leave an attempt at a rigorous description of the fully general version of this setup for future work and note that, in the general case, the delta function

δ^{(ρ)}

on X is closely related to what is called an “approximate identity” in the abstract harmonic analysis literature (see Folland [28] for an introduction). In particular, such a delta will be required to satisfy

(ρ (s) \cdot δ^{(ρ)}) * (ρ (t) \cdot δ^{(ρ)}) = δ (s - t)

(7)

where ∗ denotes the appropriate convolution on

X

, and the

δ

on the RHS is the Dirac delta function on the group

G ≅ R

. This means, in particular, that separately transformed versions of

δ^{(ρ)} (x)

have zero overlap unless the transform parameters are the same.

3.2. The Discrete Setting: Synthetic Data Generation

3.2.1. Discrete Translation Symmetry

To create discrete signals that are local and symmetric under 1-dimensional discrete translations, we follow the procedure of creating a sample out of local basis signals as in (1) and replace t with a discrete index n. The local basis signals

ϕ_{i} (t)

and sample functions

x (t)

become vectors with components

ϕ_{i n}

and

x_{n}

, respectively.

In principle, the component index n ranges over all integers, but, for data generation purposes, it is restricted to a finite range. The basis signals

ϕ_{i n}

we use are Gaussians (parametrized by width and amplitude) and Legendre functions (parametrized by width, amplitude, and order), with the index of the center being sampled uniformly over a finite range. The details of the process are given in Appendix C.1, but, as a quick summary, we use finite-dimensional vectors for each sample vector and use uniform distributions for centers, widths, and amplitudes of local basis signals, including superposition of a few such signals per sample. Finally, we add Gaussian noise on top of each sample for more realistic experimentation. Examples of the resulting samples can be seen in Figure 4 under the title of “Translation”.

3.2.2. General Representations: Behavior Under Orthogonal/Unitary Transformations

Starting with a dataset that is symmetric and local under discrete translations, we can create datasets that are symmetric/local under different group representations by using a discrete version of the transformation (4). Here, we start by proving a transformation property of discrete versions of the symmetry generators and the “Dirac delta along group action” described above and then provide the procedures used in generating synthetic data.

The discrete version

Z

of the translation group

G ≅ R

has a minimal generator that is the pixel (or component) translation operator, which we denote by

1 \in Z

. Similarly, for any representation of the 1-dimensional group

G ≅ R

, there will be a matrix for the minimal generator

ρ (1)

, which we call

G

. With this notation, the discrete version of (4) is

\begin{matrix} x_{n} = \sum_{p} \sum_{j} G_{n j}^{p} δ_{j}^{(ρ)} f_{p} . \end{matrix}

(8)

For samples

f_{m}

(

m \in Z

) with a distribution local and invariant under component shifts, this provides a sample distribution for

x_{n}

that is local and invariant under the action of the generator

G

.

Now, let us consider the action of an invertible matrix Q on both sides. We obtain

\begin{matrix} \sum_{n} Q_{m n} x_{n} & = \sum_{m, j, p} Q_{m n} G_{n j}^{p} δ_{j}^{(ρ)} f_{p} = \sum_{m, k, p, k, l} Q_{m n} G_{n j}^{p} Q_{j k}^{- 1} Q_{k l} δ_{l}^{(ρ)} f_{p} \end{matrix}

(9)

\begin{matrix} x_{n}^{'} & = \sum_{l} G_{n l}^{' p} δ_{l}^{(ρ^{'})} f_{p} \end{matrix}

(10)

where primes denote the transformed version of the objects under Q. Thus, the transformed signal

x^{'}

has the symmetry/locality generator

G^{'}

given in terms of the old one via a similarity transformation

G^{'} = Q \cdot G \cdot Q^{- 1}

, and the new “delta along the symmetry”

δ^{(ρ^{'})}

is given in terms of the old one via

δ^{(ρ^{'})} = Q \cdot δ^{(ρ)}

. In other words, starting with symmetry/locality under a given group representation

ρ

with generator

G

, we obtain symmetry/locality under

ρ^{'}

with generator

G^{'}

by using a transformation. If the transformation matrix Q is orthogonal (resp. unitary), then the new generator

G^{'}

will also be orthogonal (resp. unitary) assuming the old one

G

is so.

To create synthetic data in the case of 1-dimensional symmetry groups described above, we consider three symmetry group actions in our experiments: translations, permuted translations, and frequency shifts. Data samples for each symmetry can be found in Figure 4. In each case, we start with a 33-dimensional dataset that is symmetric/local under component translations

G = T

with

{(T x)}_{i} = x_{i - 1}

(ignoring edge effects; one could use cyclic translations to get rid of them) and apply an appropriate Q. For translations, we take

Q = I

. For frequency shifts, we take

Q = D^{- 1}

, where D is the discrete sine transform matrix of type I. This results in the generator

G = D^{- 1} \cdot T \cdot D

, which shifts the frequency components of the samples. For permuted translations, take

Q = P

, where P is a permutation matrix. Applying P to each raw sample, we obtain samples that are symmetric and local under the powers of the permutation operator

G = P \cdot T \cdot P^{- 1}

. As can be seen in Figure 4, the symmetry/locality for the permutation and frequency shift cases is not apparent to the naked eye at all.

4. Materials and Methods

In this section, we formulate the learning problem and the loss function and discuss the training loop together with some useful optimization techniques. See Appendix F for information on the computational complexity of our method.

4.1. The Setup: Objects to Learn

Our model will start with a dataset

D = {x_{i}}_{i = 1}^{n}

, where

x \in R^{d}

and n is the number of samples, and will learn an underlying symmetry group representation

ρ (s)

and a filter

ψ

that is local along the symmetry action in the sense suggested in our Introduction, which will be conducted precisely by the loss function below. An appropriate combination of these building blocks will provide a symmetry-based representation given by a matrix L, which we call the group convolution matrix.

4.1.1. Symmetry Generator

We will assume that the symmetry action

ρ (s)

is a real unitary representation; in other words, for each s,

ρ (s)

will be a

d \times d

orthogonal matrix,

ρ (s) \in O (d)

. We will focus on symmetry actions that can be continuously connected to the identity, which restricts the actions of group elements to those with determinant 1, i.e., elements of

S O (d)

. While the underlying symmetry action

ρ (s)

will be the representation of a 1-dimensional Lie group parametrized by the continuous parameter s, the model will learn a minimal discrete generator

G

of this action appropriate for the dataset at hand. With an appropriate choice of the scale of the s parameter that parametrizes the group, we can write

G = ρ (s = 1)

, which provides (for any integer s)

ρ (s) = ρ {(1)}^{s} = G^{s}

. This will allow us to obtain the action of any group element by taking an appropriate power of the generator.

Any element of

S O (d)

can be obtained by using the exponential map on an element of the Lie algebra of

S O (d)

, which is the algebra of

d \times d

antisymmetric matrices [8]. In particular, our generator

G

should be given as the matrix exponential of an antisymmetric matrix. Instead of explicitly constraining

G

to be an element of

S O (d)

, we write this matrix in terms of an arbitrary

d \times d

matrix A via

G = exp (\frac{A - A^{T}}{2})

, where the matrix exponential can be defined in terms of the Taylor series. The optimizer will thus seek an appropriate A instead of a direct search for

G

. (To optimize memory usage, one could use a more efficient parametrization of antisymmetric matrices).

4.1.2. The Resolving Filter

We will assume that the resolving filter

ψ

acts by an inner product, i.e.,

ψ

itself will be given as a d-dimensional (column) vector. This will be closely related to the discrete analog of the delta function

δ^{(ρ)}

described in Section 3. We do not use any constraints on this d-dimensional vector; however, we use a normalized version of it in our computations of the relevant dot products.

4.1.3. The Group Convolution Matrix

Following the approach sketched in the Introduction, we will formulate the learning problem in terms of a symmetry-based representation

y \in R^{d}

of the data

x \in R^{d}

given in terms of the generator

G

and the filter

ψ

. For each d-dimensional sample vector

x

, the components

y_{p}

of the symmetry-based representation

y

will consist of the application of the filter

ψ

to transformed versions

G^{p} x

of the data,

y_{p} = \sum_{i j} ψ_{i} {(G^{T})}_{i j}^{p} x_{j}

, where

p = 0, \pm 1, \pm 2, \dots

. Note that this is completely analogous to a CNN, in which case the generator

G

is a pixel translation operator and

ψ

is a local CNN filter.

These scalar representations

y_{p}

of a sample will be the fundamental quantities on which we define the loss function measuring departures from stationarity and locality. In the

y

representation, the action of the generator

G

is represented by a simple shift in p, and thus, if

G

is indeed a symmetry generator, the joint distribution of the scalars

y_{p}

should be invariant under shifts of p:

p \mapsto p + n

, which is a simple translation symmetry. Similarly, if data are local along the action of

G

and

ψ

is indeed local along the symmetry direction,

y_{p}

and

y_{p^{'}}

should be similar/strongly dependent as random variables if p and

p^{'}

are close and should be approximately independent otherwise (locality). Thus, if the distribution of

x

is invariant and local under actions of

G^{s}

, the distribution of

y

should be invariant and local under simple shifts in the components

y_{p}

.

While there is no a priori restriction on the range of powers p to use in the representation, in this paper, we choose the dimensionality of the representation

y

to be the same as the dimensionality of

x

. (Our methodology does not rely on this choice, and one could easily consider other ranges for p).

To obtain

y

from

x

directly via

y = L x

, we form the matrix L, which we call the group convolution matrix:

L = [\begin{matrix} — ψ^{T} {(G^{P_{m i n}})}^{T} — \\ — ψ^{T} {(G^{P_{m i n} + 1})}^{T} — \\ \cdot \\ \cdot \\ \cdot \\ — ψ^{T} {(G^{P_{m a x}})}^{T} — \end{matrix}]

(11)

where

P_{m i n}

and

P_{m a x}

denote the minimum and maximum powers of the generator to be used. In our experiments, we work with odd d and pick

P_{m a x} = \frac{d - 1}{2} = - P_{m i n}

.

4.2. The Loss Function

Our loss function will be computed from the transformed version

y = L x

of each batch of vectors

x

and will measure the degree to which the distribution of this transformed version is symmetric and local under component translations

y_{p} \mapsto y_{p + s}

. The loss consists of three building blocks we call stationarity, locality, and information preservation. These are given in terms of the correlation between the components

y_{p}

of the

y

representation, as well as various entropy and probability density terms for these components. We first describe the pieces of the loss function assuming one has access to estimates of these necessary quantities and will then explain the techniques used for estimating these quantities for each batch.

Note: For notational simplicity below, we use a non-negative indexing for the components

y_{p}

of

y

, i.e.,

p = 0, 1, \dots, d - 1

, and use the notation

{〈 \dots 〉}_{[conditions]}

for the average of the expression ⋯ in the angle brackets over index combinations satisfying the conditions.

4.2.1. Stationarity/Uniformity

For true symmetry, we expect the joint probability distribution of the components of

Y

to be invariant under simple shifts in components (modulo boundary effects). Since estimating the joint probability density is impractical in high dimensions, we instead use the distributions of individual components and conditional distributions for pairs of components.

We denote the marginal probability density of the random variable

Y_{j}

as

p_{j}

and the conditional probability density of

y_{j}

given

y_{i}

by

p_{j | i}

. Our uniformity loss, in terms of these quantities, is given by

L_{u n i f o r m i t y} = \frac{1}{2} [{〈{\hat{D}}_{K L} (p_{m}, p_{n})〉}_{\begin{matrix} m = n \mp 1 \end{matrix}} + {〈{\hat{D}}_{K L} (p_{i | j}, p_{k | l})〉}_{\begin{matrix} k = i + Δ \\ l = j + Δ \\ Δ = \mp 1 \end{matrix}}]

(12)

where

{\hat{D}}_{K L}

denotes the estimate of the KL divergence (see Appendix A.3 for the estimation procedure). Averages are taken over all possible

m, n

, and

i, j, k, l

indices satisfying the specified constraints. The marginal probability term provides a first-order proxy for uniformity, and the conditional probability term takes relations between pairs of random variables into account. Overall, this loss function provides a second-order computationally tractable measure of the uniformity of the full distribution under shifts.

4.2.2. Locality

Our locality loss consists of two pieces that we call alignment and resolution. Alignment loss aims to make neighboring pixels have similar information, and resolution aims to make faraway pixels have distinct information.

Alignment We would like successive components of symmetry-based representation to not only have similar distributions (as in the uniformity loss) but to have similar values in each sample. We compute the sample Pearson correlation coefficient

ρ_{i, i + 1} \in R

of the components

y_{i}

and

y_{i + 1}

of symmetry-based representation

y

for each batch and define the alignment loss

L_{a l i g n m e n t}

as the average of this quantity over all successive pairs

L_{a l i g n m e n t} = - {〈ρ_{i, i + 1}〉}_{i \in [0, d - 1)} .

(13)

Resolution We would like the components

y_{i}

and

y_{j}

of

y

to contain distinct information when i and j are not close. In the information theory literature, the quantity

\begin{matrix} C (Y) = \sum_{i = 0}^{d - 1} h (Y_{i}) - h (Y) \end{matrix}

(14)

called the total correlation measures the degree to which there is shared information between the components of random vector

Y

; (

h (Z)

denotes the differential entropy of the random variable Z). For a local symmetry, only nearby components should share information, and all other pairs of components should be approximately independent, so we expect

C (Y)

to be small and use an approximation to

C (Y)

over each batch as our resolution loss.

As we describe in Section 4.3.1, we compute entropy estimates by using an epoch-dependent rank k. In effect, this increases the number of variables that comprise the entropy estimate from

k = 1

to

k = d

over the course of the estimation and thus changes the overall scale of the estimated quantity accordingly. To use a normalized scale, we define the resolution loss for each batch to be

\begin{matrix} L_{r e s o l u t i o n} = \frac{1}{d} \sum_{i = 0}^{d - 1} h (Y_{i}) - {\bar{h}}_{k} (Y) \end{matrix}

(15)

where

{\bar{h}}_{k} (Y)

is an estimate of the “per rank” contribution to entropy (see Appendix A.2 for exact formula).

4.2.3. Information Preservation

While the correct directions in parameter space reduce the locality and uniformity loss terms described above, there is also a catastrophic solution for L that can minimize these terms: a constant map to a single point. Indeed, we would like to learn a transformation that preserves the information content of the data, and we add a final term to maximize the mutual information between the random vectors corresponding to input

X

and the output

Y

. For a deterministic map, this mutual information can be maximized by simply maximizing the entropy of the output (see the InfoMax principle proposed by Bell and Sejnowski [29]).

As mentioned above, our entropy estimate will use an epoch-dependent rank k, and we use a normalized per-rank version

{\bar{h}}_{k} (Y)

(see Appendix A.2) of the output entropy as our information preservation loss:

\begin{matrix} L_{p r e s e r v a t i o n} = - {\bar{h}}_{k} (Y) . \end{matrix}

(16)

4.2.4. Total Loss

Merging the overall loss terms, we have

\begin{matrix} L = L_{a l i g n m e n t} + α L_{r e s o l u t i o n} + β L_{u n i f o r m i t y} + γ L_{p r e s e r v a t i o n} \end{matrix}

(17)

while

α, β, γ \in R

. Limited experimentation with only one dataset was able to provide a choice for these hyperparameters that ended up working well for all our experiments, making it unnecessary to perform separate tunings for datasets with different dimensionalities and symmetry properties. See Appendix E for a discussion of hyperparameter choice and sensitivity and ablation experiments confirming that all pieces of this loss function indeed contribute to model performance.

4.2.5. On the Coupled Effect of the Alignment and Resolution Terms

In this section, we argue that the combined effect of the alignment and resolution terms in the loss function push the symmetry-based (output) representation towards a (two-sided) Markov process.

Let

Y

denote the d-dimensional output representation learned by the model. Using the chain rule for entropy, the joint entropy of

Y

can be written as

h (Y) = \sum_{i = 1}^{d} h (Y_{i} | Y_{< i})

, where

Y_{< i}

denotes the components

{Y_{1}, Y_{2}, \dots Y_{i - 1}}

when

i > 1

, and, for

i = 1

, it simply means no conditioning is applied (in this section, we use non-negative integers for indexing components). We define the following mth-order approximation to the chain rule formula, which only uses local dependencies of window size m:

\begin{matrix} h^{(m)} (Y) = \sum_{i} h (Y_{i} | Y_{[i - 1, i - m]}) \end{matrix}

(18)

where

Y_{[i - 1, i - m]}

denotes the random variables

{Y_{j}}_{i - m \leq j \leq m - 1}

, and

m = 0

represents the case with no conditioning.

Since conditioning can only reduce entropy [30], we have

\begin{matrix} h (Y) \leq h^{(d - 1)} (Y) \leq \dots \leq h^{(1)} (Y) \leq h^{(0)} (Y) . \end{matrix}

(19)

The “resolution” (total correlation) part of the loss

C (Y) = h^{(0)} (Y) - h (Y)

is minimized when all the conditional entropies in this chain are equal, which happens when the distribution of the variables is just the product distribution, i.e., when the

Y_{i}

are all independent. However, the “alignment” loss (see Section 4.2.2) aims to maximize the correlation between successive components. Since independent variables have zero correlation, the alignment loss pushes successive components to be dependent. The mutual information

I (Y_{i}, Y_{i + 1}) = H (Y_{i + 1}) - H (Y_{i + 1} | Y_{i})

between dependent variables is always positive, so the tendency of the alignment loss is to make

H (Y_{i + 1})

strictly greater than

H (Y_{i + 1} | Y_{i})

.

In short, the joint effect of the alignment and the resolution terms is to push the

h^{(1)} (Y)

term in the chain (19) to be strictly less than

h^{(0)} (Y)

and all the other terms

h^{(j)} (Y)

and

j \geq 1

to be equal to each other, i.e.,

h (Y) = h^{(d - 1)} (Y) = \dots = h^{(1)} (Y) < h^{(0)} (Y)

(20)

which means that the learned representation

Y_{j}

is a Markov process. Repeating the argument in the other direction, we obtain a two-sided Markov process. Of course, the degree to which (20) can, in fact, be satisfied in a given problem is determined by the underlying data distribution.

4.3. Training the Model

Our system is composed of two separate subsystems and training loops. One subsystem seeks the generator

G

of symmetries and the resolving filter

ψ

, which together form the group convolution matrix L, while the other consists of probability estimators. For each batch, densities and entropies of the data transformed by the group convolution matrix are estimated and used in the computation of the loss for symmetry learning. The training for the two subsystems is conducted jointly, with simultaneous gradient updates. See Figure 5 and Figure 6 for a visual representation of the two subsystems, and see Appendix E.1 for the details of the training process.

4.3.1. Training the Group Convolution Layer

We used the ADAM optimizer with learning rate decay for the group convolution layer, while the other parameters were kept at their default values. Since the higher powers computed for the relevant matrices are very sensitive to numerical errors, we preferred smaller learning rates. Numerical values and details can be found in Appendix E.1.

Due to the high dimensionality of the search space of the relevant matrices, accurate identification of true symmetries is challenging. The special initialization and optimization procedures described below help to avoid becoming stuck at the local minimum.

Controlling the rank of the joint entropy. To achieve efficient and stable optimization by first capturing the data’s gross features and then refining them over time, we use a time-dependent rank parameter k for the entropy estimator (A5). To adjust k during training, we use a normalized notion of training time

t_{n}

, measuring the “amount of gradient flow” via

t_{n} = \frac{\sum_{s = 1}^{n} lr (s)}{\sum_{s = 1}^{T} lr (s)}

(21)

where

lr (s)

is the learning rate used at the training step (batch) s, and n and T are the current training step and the total number of training steps, respectively. We control the rank k of the low-rank entropy estimator by setting

k = ceil (d \times t_{n})

so that, by the end of the training, the rank is at d.

Noise injection to the resolution filter. We initialize the resolving filter with zeros and add Gaussian noise during the early stages of training before computing the loss for each batch,

ψ \leftarrow ψ + N (μ = 0, σ = 0.1) exp (- t / τ) .

(22)

The amplitude of the noise is set to decay exponentially with a short time constant (of

τ = 10

epochs). As mentioned previously, to compute the transformed data

y

, we use a normalized version of

ψ

at each step:

ψ \leftarrow ψ / {∥ ψ ∥}_{2}

Initialization of the symmetry generator. We initialize the matrix A used to obtain the symmetry generator

G

via the exponential map

exp (\frac{A - A^{T}}{2})

using a normal distribution with

σ = 10^{- 3}

and

μ = 0

for its entries. This leads to an approximate identity matrix for the initial generator

G

. Using smaller standard deviations did not affect the performance; however, significantly larger

σ

values occasionally lead to learning a higher power of the underlying symmetry generator instead of learning the minimal generator.

Padding. We use padding for the symmetry generator and the filter in the sense that the symmetry matrix and the filter have dimensionality that is higher than the dimensionality of the data, but we centrally crop the matrix and the filter before applying them to the data. This is conducted to deal with finite size (edge) effects, and, after experimenting with padding sizes of 6 to 33, we saw that the results are not sensitive to padding size. Working with a cyclic/periodic symmetry would make the padding unnecessary, but this would mean working with a restrictive assumption on the underlying symmetry.

4.3.2. Training Probability Density Estimators

We train probability density estimators based on entropy minimization loss term, which is proposed by Pichler et al. [31]. See Appendix E for the learning rate used.

5. Results

5.1. Results on Synthetic and Real Data

For each dataset listed in Table A2, our model was trained to learn the group action generator

G

and the resolving filter

ψ

, which together comprise the group convolution matrix, L, which provides a symmetry-based representation of data. The group generator

G

has a known correct answer in each case (up to a power of

\pm 1

; e.g., both right and left translations are minimal generators of the translation group), which, of course, was not given to the model in any way.

A visual description of the experimental results is provided in Figure 7, Figure 8, Figure 9 and Figure 10, with Figure 7, Figure 8 and Figure 9 containing the results on 33-dimensional synthetic datasets with various symmetries and Figure 10 containing the results on a 27-dimensional real dataset (see Appendix C.2 for details). In each case, we compare the learned generator to the appropriately signed generator of the group. The main points are as follows:

In each experiment, the learned symmetry generator $G$ is indeed very close to the underlying correct generator used in preparing the dataset. See Figure 7a–Figure 10a.
The group convolution operator L formed by combining the learned symmetry generator $G$ with the learned resolving filter $ψ$ as in (11) is approximately equal to the underlying transformation used in generating the dataset (see Figure 7b–Figure 10b). As a result, the matrix L is highly successful in reconstructing the underlying hidden local signals, resulting in a symmetry-based representation (see Figure 11).

Thus, by having access only to raw samples such as those in Figure 4, the model has been able to learn symmetry generators of quite different sorts: pixel translations, pixel shuffles, and frequency shifts were all learned with high accuracy using exactly the same model setup and hyperparameters. The group convolution operator L relates the raw data to a representation where the locality and symmetry are manifest. In the case of frequency shifts, the relevant transformation that completes this is the discrete sine transform (DST). We find it highly satisfying that, simply by looking at data as in Figure 4, the model was able to learn that the DST matrix provides the relevant representation; see Figure 8b.

We emphasize that the same hyperparameters have been used in each case: the learning rates, low-rank entropy estimation scheduling, batch size, etc., were the same for all the experiments. In other words, no fine-tuning was necessary for different data types. The results are also stable in the sense that training the system from scratch results in similar outputs each time. In our various experiments with the chosen settings, only once did we encounter a convergence problem where the training became stuck in an unsatisfactory local minimum.

We repeated each one of the seven experiments four times and reported the cosine similarity in Table 1 (cosine similarity here is obtained by treating the generator matrix as a vector, or, equivalently, using the Hilbert–Schmidt (or Frobenius) inner product between the two matrices of the symmetry generator). The error histograms over the entries of the symmetry generator are provided in Figure 7a–Figure 9a.

Some notable properties of the method worth emphasizing are as follows:

Learning the minimal group generator. A dataset with a pixel translation symmetry also has a symmetry under two-pixel translations, three-pixel translations, etc. In fact, each one of these actions could be considered as a generator of a subgroup of the underlying group of symmetries. Some works in the literature have successfully learned symmetry generators from datasets but ended up learning one of many subgroup generators instead of the minimal generator. Our inclusion of locality together with stationarity in the form of alignment and resolution losses has allowed the method to learn the minimal generator in each case.

Learning a symmetry-based representation. In addition to the symmetry generator, our model learns a resolving filter that is local along the symmetry direction. In all our experiments, this resulted in the appropriate “delta function along the symmetry direction” described in Section 3. In the case of simple translation symmetry, the filter has a single nonzero component; in the case of the frequency shift, it is a pure sinusoidal. The convolution operator L obtained by combining the generator and the filter provides a symmetry-based representation of the data, turning the hidden symmetry into a simple translation symmetry. Such a representation is exactly the type of situation where regular CNNs are effective. Thus, for a given supervised learning problem, one can first train our model to learn the underlying symmetry and then transform the data into the symmetry-based representation and feed the resulting form into a regular multi- layer CNN architecture, making the model an adapter between raw data and CNNs. In Appendix B, we prove the relevant equivariance property of the learned representation.

Stability. Our datasets are 33-dimensional, and thus the symmetry generator

G

and the symmetry-based representation matrix L have shapes of

33 \times 33

. These dimensions are higher than in many of the works that have a comparable aim to ours (i.e., unsupervised learning of symmetries from raw data [9,24,25]). We emphasize that we do not have any built-in sparsity or factorization enforced on the matrices. The model has to indeed perform the searches over these large spaces. Our loss function involving locality, stationarity, and information preservation has resulted in a highly robust system that reliably avoids local minima and finds the correct symmetries in each of the examples we implemented. We emphasize that none of our datasets are explicitly or cleanly symmetric under the relevant group actions (see Figure 4 for samples). The almost perfect recovery of the symmetry generator from such samples is rather striking. Examples of the robust optimization dynamics can be seen in Figure 12.

5.2. Comparison with Other Unsupervised Symmetry Learning Approaches

The GAN-based methods SymmetryGAN [9] and LieGAN [24], like our method, are set up to learn symmetry transformations starting with raw data in an unsupervised manner. To compare the performance of these methods with ours, we ran experiments with a seven-dimensional version of our translation-invariant Gaussian dataset described above. In order to comply with the setting of [9,24], this time, we set up the dataset so that the translation is circulant (periodic). Our method’s setup can also be extended to cover circulant symmetries, but we attempted the run without this modification and in this sense provided an advantage to the GAN-based methods.

In the end, the best cosine similarity obtained by SymmetryGAN was

0.527

and that obtained by LieGAN was

0.425

, whereas the cosine similarity for our model was

0.958

. See Figure 13 and Figure 14 for a visualization of the learned generators by each method, and see Appendix D for the details of the comparison.

6. Discussion

Our experiments show that the method described in this paper can uncover the symmetries (generators of group representations) of quite different sorts, starting with raw datasets that are only approximately local/symmetric under a given action of a one-dimensional Abelian Lie group. In addition to symmetry generators, a symmetry-based representation [4] is also learned. The choice of the loss function and the approach used in optimization result in highly stable results. This is the outcome of extensive experimentation with different approaches to the same intuitive ideas described in the Introduction. It is satisfying, and perhaps not surprising, that the current successful methodology is the simplest among the range of approaches we tried.

In this paper, we focused on the action of one-dimensional Abelian Lie groups. More generally, the symmetry group will be multidimensional. The case of translation symmetry in the plane has two generators that commute with each other, and a generalization of the learning problem to such a case will involve adapting the loss function to encode locality and symmetry under the joint actions of those two generators, taking into account possible non-commutation of the latter. More generally, one should consider non-Abelian Lie groups whose generators do not commute with each other.

The datasets used in our experiments were 33-dimensional and thus are not what one would call “low-dimensional” in the context of symmetry learning. However, many real datasets have much higher dimensionalities. Our experiments indicate that our method can work on datasets with three times the dimensionality used in this paper without difficulty; however, for dimensions that are orders of magnitude higher, one will likely require further computational, and possibly methodological, improvements. Such improvements are necessary for, say, typical image datasets.

In this paper, the group actions learned by the model are all linear; i.e., the method learns a representation of the underlying group of symmetries by learning a matrix for each generator. However, more generally, symmetry actions can be nonlinear. By replacing the linear layer that represents the symmetry action with a more general nonlinear architecture it may be possible to apply the philosophy of this paper to the nonlinear case as well, but, of course, the practical problems with optimization and the architectural choices for representing a nonlinear map will require work and experimentation.

In transforming the data using the group convolution map L, we made the choice of having the number of output dimensions equal to the number of input dimensions. This meant setting the number of powers used for the symmetry generator

G

equal to the number of input dimensions. While this is a reasonable choice and is suited to the datasets here, other choices could be appropriate for other settings. We believe it would be worthwhile to experiment with other choices here, possibly by using datasets with cyclic group symmetries whose order is different from the number of dimensions of the datasets they act on.

Finally, let us note that data often have locality directions that are not necessarily symmetry directions. For instance, consider a zeroth-order approximation to the distribution of (say daily) temperatures around Earth. This distribution will have locality under rotations around any axis passing through the center of Earth but will be approximately symmetric only under azimuthal rotations. In other words, temperature varies slowly as you move in any direction, but only similar latitudes will have similar temperature distributions; the poles will not be as hot as the equator. Our model seeks group generators that have both locality and symmetry properties for the dataset at hand, so it is not appropriate for such a dataset. A generalization that would also work for cases like this would be an interesting and worthwhile project.

Author Contributions

Conceptualization, O.E. and A.O.; methodology, O.E. and A.O.; software, O.E.; validation, O.E. and A.O.; formal analysis, O.E. and A.O.; investigation, O.E.; resources, A.O.; data curation, O.E.; writing—original draft preparation, O.E. and A.O.; writing—review and editing, A.O.; visualization, O.E.; supervision, A.O.; project administration, A.O.; funding acquisition, A.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Turkish Scientific and Technical Research Council (TUBITAK) under the BIDEB-2232 program with grant number 118C203.

Data Availability Statement

All data in this study are generated synthetically, and the complete code for data generation, models, and optimization code, as well as the reproducible experiments, can be found in our GitHub repository at https://github.com/onurefe/SymmetryLens.git (version 1.0, accessed on 12 March 2025).

Acknowledgments

We would like to thank the current and former members of the EarthML research group for fostering a collegial and dynamic research environment, Ayse Ruya Efe for her assistance in figure preparation, and Nurgul Ergin for administrative support.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A. Estimation Procedures

Appendix A.1. Probability Estimation

For estimating probability densities, we use a modified version of a mixture of the Gaussian approach described in [31] by minimizing the entropy of estimated probability densities. Since we need a large number

O (d^{2})

of joint distributions, a straightforward application of the approach in [31] is computationally prohibitive, and we developed an

O (d)

approach as described in Appendix A.1.

In the stationarity/uniformity loss of Section 4.2.1, we need estimates of joint probability densities

p_{i j}

for pairs

y_{i}

,

y_{j}

of components of the transformed samples

y

. We compute these using a decomposition in terms of the marginal densities

p_{i}

and the conditional densities

p_{i | j}

:

p_{i j} (y_{i}, y_{j}) = p_{i | j} (y_{i} | y_{j}) p_{j} (y_{j})

. We describe below the estimation of these two pieces.

Appendix A.1.1. Marginal Probability Estimation

We use a Gaussian mixture model to estimate the marginal densities. Denoting the probability density function of a normal distribution with mean

μ

and standard deviation

σ

by

K (u; μ, σ)

, we write the marginal distribution for variable

y_{i}

as a mixture of M Gaussians as

{\hat{p}}_{i} (u) = \frac{1}{M} \sum_{m = 1}^{M} w_{m i} K (u; μ_{m i}, σ_{m i})

(A1)

where

w_{m i}

is the weight of Gaussian kernel number m for variable

y_{i}

, normalized as

\sum_{m} w_{i m} = 1

. We choose

M = 4

and, as in [31], we find the parameters

w_{m i}, μ_{m i}, σ_{m i}

, using gradient descent on a loss function consisting of the estimated entropy,

{\hat{h}}_{i} = - E [log ({\hat{p}}_{i} (y_{i}))]

, where the expectation is computed as an average over each batch.

Appendix A.1.2. Conditional Probability Estimation

Overview To estimate conditional probabilities

p_{j | i} (u | y_{j})

[31], generalize the mixture model approach in (A1) to a parametrized form

\begin{matrix} {\hat{p}}_{j | i} (u | y_{i}) = \frac{1}{M} \sum_{m = 1}^{M} w_{m j i} (y_{i}) K (u; μ_{m j i} (y_{i}), σ_{m j i} (y_{i})) \end{matrix}

(A2)

where the parameters

w_{m j i} (y_{i})

,

μ_{m j i} (y_{i})

σ_{m j i} (y_{i})

are now functions of the conditioning parameter

y_{i}

to be learned. One could, in principle, train neural networks for each one of these parameters, including one entropy loss

- E [log (\hat{p} (y_{j} | y_{i}))]

per pair; however, in our case, we need

O (d^{2})

such conditional estimators, and such a straightforward approach becomes computationally prohibitive. For this reason, we propose a modification that involves a “single-input multiple-output” approach, as we describe next.

Instead of including a separate set of networks for each choice of input coordinate

y_{i}

for parameters

w_{m j i} (y_{i})

,

μ_{m j i} (y_{i})

σ_{m j i} (y_{i})

, we train a single network that takes in a vector

y

but simply masks all components of this vector except the relevant one. In other words, the estimated conditional probability for a given conditioning variable i is provided as

{\hat{p}}_{j | i} (u | y) = \frac{1}{M} \sum_{m = 1}^{M} w_{m j} (e_{i}^{T} \cdot y) K (μ_{m j} (u; e_{i}^{T} \cdot y), σ_{j} (e_{i}^{T} \cdot y))

(A3)

where

e_{i}

denotes the standard column basis vector, with a 1 in the ith entry and zeros everywhere else,

e_{i}^{T} = (0, 0, \dots, 1, 0, \dots, 0)

. Notice that the parameters

w_{m j}

,

μ_{m j}

σ_{m j}

no longer have an i index that labels the conditioning coordinate; the network is forced to “notice” the relevant coordinate via the masking procedure and training. Once again, the loss function consists of pieces

{\hat{h}}_{i | j} = - E [log ({\hat{p}}_{j | i} (y_{j}, y_{i}))]

with the expectation approximated by an average over each batch for each pair

i \neq j

. During training, we let each sample in a batch contribute to the estimates for a single conditioning variable i, with i looping over all coordinates as the sample index n increases, i.e.,

i \equiv n (mod d)

.

Architecture We train 3 neural networks in total, one for each of w,

μ

, and

σ

. For each sample, each network outputs M vectors (one vector for each Gaussian) with one component for each coordinate j in

f_{j | i}

in the form of a flattened array with

d \times M

components. The three networks are identical except for the output activation functions.

Each network is composed of d input and

d \times M

output neurons, the input representing the value of the conditioning component via the masking procedure described and the output corresponding to the flattened version of the tensor representing the parameters of the Gaussians. We use one hidden fully connected layer with

4 \times d \times M

neurons each.

We take the number of Gaussian kernels to be

M = 4

. We use the LeakyReLU activation function with

α = 0.1

at the output of each layer except for the last one. The final output activation functions for the three neural networks are given in Table A1.

Table A1. Activation functions for the conditional probability estimator.

Estimated Quantity	Output Activation Function
Kernel weights	Softmax over the kernel axis
Mean value of each kernel	No (linear) activation
Variances of each kernel	Scaled tangent hyperbolic function followed by exponentiation

Appendix A.2. Multidimensional Entropy Estimation

To estimate the multidimensional differential entropy of the transformed data

y

, we use a multivariate Gaussian approximation. For a multivariate Gaussian with covariance matrix

C

whose eigenvalues are

λ_{i}

, the total entropy

h (y)

is given, up to a constant shift, by

h (y) = \sum_{i} h_{i}

, where

h_{i} = log (λ_{i})

.

For each batch, we compute the sample covariance matrix

c_{i j} = cov (y_{i}, y_{j})

and sort its eigenvalues in descending order. We define a rank-k approximation to the entropy as a weighted average of the per-component contributions to entropy via

\begin{matrix} {\bar{h}}_{k} ≜ \frac{\sum_{i = 1}^{d} w_{k i} log (λ_{i})}{\sum_{i = 1}^{d} w_{k i}} \end{matrix}

(A4)

where the weights provide a soft thresholding at

i = k

via

\begin{matrix} w_{k i} = \frac{1}{e^{α (i - k)} + 1} \end{matrix}

(A5)

with

α

a hyperparameter determining the smoothness of the transition of relative weights from 1 to 0 as i crosses k. Our experiments have shown consistent results for various values of

α > 1

(we chose

α = 3.3

).

Due to the normalization, (A4) should be thought of as a per-rank version of the low-rank entropy. In particular, when combining

{\bar{h}}_{k}

with the marginal entropies of

y_{i}

during the computation of the total correlation loss of Section 4.2.2, it is more appropriate to combine the former with the average marginal entropy rather than the total marginal entropy.

Appendix A.3. KL Divergence Estimation

Appendix A.3.1. KL Divergence of Marginal Probabilities

Having the marginal probability density estimates

{\hat{p}}_{i}

and

{\hat{p}}_{k}

for components

Y_{i}

and

Y_{k}

, we approximate the KL divergence

D_{K L} (p_{i}, p_{k})

by

D_{K L} (p_{i} ‖ p_{k}) \approx E [{\hat{p}}_{i} (y_{i}) log {\hat{p}}_{k} (y_{i})]

(A6)

where the expectation

E

is computed as an average over the samples in a batch during optimization.

Appendix A.3.2. KL Divergence of Conditional Probabilities

Having the conditional probability density estimates

{\hat{p}}_{i | j}

and

{\hat{p}}_{k | l}

for components

Y_{i}

and

Y_{k}

where

y_{j}

and

y_{l}

are given, we approximate the KL divergence

D_{K L} (p_{i | j}, p_{k | l})

by

D_{K L} (p_{i | j} ‖ p_{k | l}) \approx E [{\hat{p}}_{i | j} (y_{i}, y_{j}) log {\hat{p}}_{k | l} (y_{i}, y_{j})]

(A7)

where the expectation

E

is computed as an average over the samples in a batch during optimization.

Appendix B. Equivariance of the Group Convolution Layer

The equivariance properties of group convolutions are well known [1]. In this paper, we consider group actions (specifically representations) on the sample space that do not necessarily form a regular representation. In other words, the group action does not come from an action on the index set labeling the components. Here, we show that the equivariance property holds in our setting as well. The components of the symmetry-based representation

y

are given as

\begin{matrix} y_{p} = \sum_{i, j} ψ_{i} {(ρ^{T} (p))}_{i j} x_{j} \end{matrix}

(A8)

where

x \in R^{d}

represents the input signal, and

{(ρ (p))}_{i j}

denotes entries of the group representation matrix for group element index p (an integer index for the case of the discrete version of a 1-dimensional Lie group action). The action of the group element s on the input vectors transforms

x

as

x_{j} \to {(ρ (s))}_{j k} x_{k}

. The resulting action

ρ_{y} (s)

on the symmetry-based representation

y

is given as

\begin{matrix} {(ρ_{y} (s) \cdot y)}_{p} & = \sum_{i, j, k} ψ_{i}^{0} {(ρ^{T} (p))}_{i j} {(ρ (s))}_{j k} x_{k} = \sum_{i, k} ψ_{i}^{0} {(ρ^{T} (p - s))}_{i k} x_{k} = y_{p - s} \end{matrix}

(A9)

\begin{matrix} \Rightarrow ρ_{y} (s) \cdot y = T_{Y}^{s} \cdot y \end{matrix}

(A10)

while

T_{Y}

is a one-component translation operator acting on the symmetry-based representation. Thus, the proposed group convolution layer results in a group-equivariant representation for the symmetry group action, with the actions of the group being turned to simple component translations in the symmetry-based representation. This property makes the output of the model suitable for use with the machinery of regular (translation-based) CNNs.

Appendix C. Datasets

We trained our model on a variety of synthetic and real datasets whose properties are given in Table A2.

Table A2. Real and synthetic datasets prepared for experiments. The parameters for each basis signal are sampled from a uniform distribution with the indicated ranges.

ID	Signal Type	Invariance	Dimensionality	Parameter Ranges	Size (Samples)
1	Gaussian	Circulant translation	7	Scale: $[0.2, 1.0)$ Amplitude: $[0.5, 1.5)$	63 K
2	Gaussian	Translation	7	Scale: $[0.2, 1.0)$ Amplitude: $[0.5, 1.5)$	252 K
3	Mnist slices	Translation	27	——-	1.08 M
4	Gaussian	Translation	33	Scale: $[0.5, 5.0)$ Amplitude: $[0.5, 1.5)$	1.65 M
5	Legendre ( $l = 2 - 3$ , $m = 1$ )	Translation	33	Scale: $[6.0, 15.0)$ Amplitude: $[0.5, 1.5)$	1.65 M
6	Gaussian	Permuted translation	33	Scale: $[0.5, 5.0)$ Amplitude: $[0.5, 1.5)$	1.65 M
7	Legendre ( $l = 2 - 3$ , $m = 1$ )	Permuted translation	33	Scale: $[6.0, 15.0)$ Amplitude: $[0.5, 1.5)$	1.65 M
8	Gaussian	Frequency shift	33	Scale: $[0.5, 5.0)$ Amplitude: $[0.5, 1.5)$	1.65 M
9	Legendre ( $l = 2 - 3$ , $m = 1$ )	Frequency shift	33	Scale: $[6.0, 15.0)$ Amplitude: $[0.5, 1.5)$	1.65 M

Appendix C.1. Details of Synthetic Data

We start with creating a dataset that is local/symmetric under component translations. Then, we apply an appropriate transformation to obtain datasets that are local and symmetric under a given unitary representation. Detailed procedure is as follows:

We select a parametrized family of local basis signals, such as the family of Gaussians parametrized by amplitude, center, and width.
Use a binomial distribution (probability $p = 0.5$ ; $n = 5$ trials) to determine the number of local signals to include in each sample.
Sample the parameters of the basis signal (e.g., center, width, and amplitude) from uniform distributions over finite ranges to obtain each local signal.
Add up the local pieces to end up with a single sample that has information locality under component translations (we call this the raw sample).
We apply an appropriate unitary transformation to the raw sample to obtain locality/symmetry under the desired group action.
Finally, we add Gaussian noise to the sample (with $σ = 0.05$ ).

This procedure can provide us any symmetry that is related to component translations via a similarity transformation. See Figure 4 involving samples from datasets with different kinds of symmetries.

Basis Signal Types

Gaussian signals Gaussian basis signals

f_{g a u s s i a n} (z; A, μ, σ)

are parametrized by amplitude

A

, center

μ

, and width

σ

. The input z is an integer ranging from

- \frac{d}{2}

to

\frac{d}{2}

, labeling the components of the raw sample vectors. We sample the center

μ

uniformly from the extended (tripled) range

- \frac{3 d}{2}

to

\frac{3 d}{2}

and then crop the resulting signals to the z range of

- \frac{d}{2}

to

\frac{d}{2}

to allow for the possibility of signals that contain only a tail of a Gaussian.

Legendre signals These signals are given in terms of the associated Legendre polynomials and provide localized waveforms that can change signs. The relevant parameters are center c, scale s, amplitude

A

, and the orders l, m:

f_{l e g e n d r e}^{(l, m)} (z; A, c, s) = A P_{l}^{m} (cos (\frac{z - c}{s}))

. We crop these signals to the range

| x - c | / s \leq π

, i.e., set the values outside this range to zero.

Once again, z becomes the discrete dimension index, ranging from

- \frac{d}{2}

to

\frac{d}{2}

. For the

l, m

parameters, we use

l = 2, m = 1

and

l = 3, m = 1

, with equal probability for each sample. We sample the centers as in the Gaussian case.

Appendix C.2. Details of Real Data

Symmetry–locality coupling is naturally found in image datasets as well as time-series datasets. To demonstrate the generality of the proposed method, we use a cropped version of the MNIST dataset, which consists of

28 \times 28

grayscale images of handwritten digits. In order to enable uniform sampling, we zero-pad the MNIST dataset from the left and the right with 28 pixels, leading to images with dimensions of

84 \times 28

. Afterwards, we randomly sample

27 \times 1

-dimensional patches from the images, the coordinates of the patches being sampled uniformly from the image area. Finally, we drop the blank crops (crops with all pixel values equal to zero). This way, we obtain a dataset with 1.08 M samples, which we use to train our model.

Appendix D. Details of the Comparison with GAN-Based Methods

While our method is set up to learn the minimal generator of the symmetry group, LieGAN [24] aims to learn Lie algebra elements of the underlying Lie group, and SymmetryGAN [9] aims to find some group element. This difference makes it difficult to set up a numerical comparison of our method with the GAN-based methods. We approach this problem as follows. The minimal group generator for an underlying 1-dimensional symmetry will be some power of the exponential of the corresponding Lie algebra element. If LieGAN indeed learns a correct Lie algebra element, then an appropriate power of its exponential should provide the minimal generator exactly. Thus, we find the multiple of the candidate Lie algebra generator by LieGAN whose exponential provides the best approximation (in the sense of cosine similarity) for the symmetry generator and use its cosine similarity as the score for LieGAN. In the case of SymmetryGAN, we find the true group element that is closest to the candidate symmetry transformation learned by SymmetryGAN and compare the cosine similarity between this true symmetry and the candidate, and then perform a similar comparison to the corresponding symmetry element from our model.

The SymmetryGAN method [9] was based on a parametrization of the group elements, but the authors had not formulated a parametrization for datasets with dimensions higher than 4 considering the availability of factorization-based parametrizations. Based on this, we used a QR-decomposition-based parametrization to implement the symmetry generator in SymmetryGAN. We call this extended method “SymmetryGAN-QR” considering that the choice of parametrization can effect the performance of the method. We experimented with LieGAN [24] without any modification.

We conducted the experiments over a synthetic dataset containing

2.56

M samples, which is the size of the largest dataset used in the experiments reported in [24]. We used the same hyperparameter settings used in [9,24]. We trained both methods for 100 epochs and took the best-scoring epoch comparison with our method (thus, in a sense, giving unfair advantage to both methods since the scores are computed by using the ground truth).

We trained our method with a much smaller sample from the same dataset, containing 63K examples. Using 4000 epochs, we approximately equalized the total number of examples for the two approaches.

We observe that the performance of the GAN-based methods was poorer than the results reported in the corresponding papers. The experiments reported in the respective papers were based on dataset dimensionalities up to 4. The number of parameters to learn increases quadratically with the dimension (

d (d - 1) / 2

parameters are needed in d dimensions), so a 7-dimensional symmetry-learning task is far more challenging than a 4 dimensional one, which we suspect is the reason for the reduction in performance. This also puts into context the success of our method on 33-dimensional datasets reported above.

Appendix E. Hyperparameters and Loss Terms

To choose our hyperparameter settings, we first ran a few experiments to pick a parameter combination for the 33-dimensional “Legendre” dataset with translation symmetry (see Appendix C for dataset details) using a coarse tuning procedure. We then tried this same parameter combination with the other datasets and learned that it provided similarly satisfactory results, making further dataset-specific parameter tuning unnecessary. We then performed a sensitivity analysis (see Appendix E.2) to confirm that the performance of the system is indeed relatively insensitive to hyperparameters around our settings. Finally, we performed an ablation study to demonstrate that all the terms in the loss function indeed contribute to the success of the method.

Appendix E.1. Initial Choice of Hyperparameters

Other than the learning rate, the ADAM optimizer was used with the default hyperparameters (

β_{1} = 0.9, β_{2} = 0.999, ϵ = 10^{- 7}

) for both the model and the estimators. Our choice of the hyperparameters can be found in Table A3.

Table A3. Hyperparameters for our method.

Estimator Learning Rate $(\times 10^{- 3})$	Model Learning Rate $(\times 10^{- 3})$	Total Learning Rate Decay	Alignment Coefficient	Uniformity Coefficient	Resolution Coefficient	Information Preservation Coefficient
2.5	0.1	0.1	1.0	2.0	1.0	2.0

With these settings, experiments were run on NVIDIA GTX4090 GPUs using the Tensorflow library for implementation. With these settings, the training took approximately 2.5 h for the 7-dimensional datasets, 16.5 h for the 27-dimensional sliced MNIST dataset, and 28 h for the 33-dimensional datasets.

Appendix E.2. Elementary Effect Sensitivity Analysis

To investigate how various hyperparameters affect the performance of the method, we performed a sensitivity analysis. We conducted 36 experiments whose parameters were sampled using SALib, Sensitivity Analysis Library in Python (version 3.10) [32,33]. For each parameter, we used a sampling interval of

\pm 25 %

. The Morris method [34] was applied to evaluate the influence of various parameters on the cosine similarity between the learned symmetry generator and the ideal minimal generator. The method is based on the notion of an effect

d_{i} (X_{1}, X_{2}, \dots, X_{n})

, which is defined as

\begin{matrix} d_{i} (X_{1}, X_{2}, \dots, X_{n}) = \frac{f (X_{1}, X_{2}, \dots, X_{i} + Δ, \dots, X_{n}) - f (X_{1}, X_{2}, \dots, X_{i}, \dots, X_{n})}{Δ} \end{matrix}

(A11)

for a function

f (X_{1}, X_{2}, \dots, X_{n})

, where

{X_{j}}_{j = 1}^{n} \in R

are the parameter values, and

Δ

is a step size. The key idea is to measure the changes of the function f along various trajectories, and to compute quantities measuring the overall effect of the parameters on the function values. In our case, the function f is the aforementioned cosine similarity. The important metrics are

$μ_{i}$ : Mean effect $E [d_{i}]$ ;
$μ_{i}^{*}$ : Mean absolute effect $E [| d_{i} |]$ ;
$σ_{i}$ : Standard deviation $\sqrt{E [{(d_{i} - μ_{i})}^{2}]}$ ;

where

E [.]

denotes the average over trajectories.

Table A4. Elementary effect analysis results. Terms are sorted from top to bottom based on their significance.

Parameter	$μ$	$μ^{*}$ [95% CI]	$σ$
Estimator learning rate	0.08981	0.09123 [0.08902]	0.10771
Model learning rate	−0.03309	0.04060 [0.04458]	0.06588
Total learning rate decay	−0.03529	0.03682 [0.05176]	0.06890
Alignment coefficient	0.02523	0.05384 [0.05455]	0.08922
Uniformity coefficient	0.05345	0.05619 [0.09583]	0.10371
Resolution coefficient	−0.04455	0.04938 [0.06613]	0.08628
Information preservation coefficient	0.07904	0.07909 [0.09634]	0.11925
Noise	−0.00298	0.01124 [0.00578]	0.01424

In Table A5, we show the trajectories used in our experiments, and, in Table A4, we show the effects of various hyperparamaters on the model performance. Looking at the cosine similarities and the effect values, we see that, within the range of values used, the results are quite stable. This explains the fact that we were able to obtain similar performance on all of our different datasets despite using the same hyperparameter settings in all experiments without any per-dataset tuning.

Table A5. Experiment configurations for elementary effect analysis. During sensitivity experiments, we used 7-dimensional translation-invariant Gaussian datasets composed of 252 K samples.

	Estimator Learning Rate $(\times 10^{- 3})$	Model Learning Rate $(\times 10^{- 3})$	Total Learning Rate Decay	Alignment Coefficient	Uniformity Coefficient	Resolution Coefficient	Information Preservation Coefficient	Noise	Cosine Similarity
1	2.292	0.075	0.167	1.083	2.167	1.250	0.917	0.133	0.816
2	3.125	0.075	0.167	1.083	2.167	1.250	0.917	0.133	0.955
3	3.125	0.075	0.167	1.083	1.500	1.250	0.917	0.133	0.815
4	3.125	0.075	0.167	1.083	1.500	1.250	0.917	0.000	0.827
5	3.125	0.075	0.167	1.083	1.500	0.917	0.917	0.000	0.943
6	3.125	0.108	0.167	1.083	1.500	0.917	0.917	0.000	0.856
7	3.125	0.108	0.100	1.083	1.500	0.917	0.917	0.000	0.948
8	3.125	0.108	0.100	0.750	1.500	0.917	0.917	0.000	0.843
9	3.125	0.108	0.100	0.750	1.500	0.917	1.250	0.000	0.999
10	1.875	0.092	0.167	1.083	2.500	1.083	1.083	0.067	0.885
11	2.708	0.092	0.167	1.083	2.500	1.083	1.083	0.067	0.988
12	2.708	0.092	0.167	1.083	2.500	0.750	1.083	0.067	0.982
13	2.708	0.092	0.167	1.083	1.833	0.750	1.083	0.067	0.976
14	2.708	0.092	0.167	0.750	1.833	0.750	1.083	0.067	0.997
15	2.708	0.092	0.167	0.750	1.833	0.750	0.750	0.067	0.997
16	2.708	0.125	0.167	0.750	1.833	0.750	0.750	0.067	0.986
17	2.708	0.125	0.167	0.750	1.833	0.750	0.750	0.200	0.997
18	2.708	0.125	0.100	0.750	1.833	0.750	0.750	0.200	0.997
19	3.125	0.125	0.100	0.750	2.500	1.083	1.083	0.000	0.990
20	3.125	0.125	0.100	0.750	2.500	0.750	1.083	0.000	0.999
21	3.125	0.092	0.100	0.750	2.500	0.750	1.083	0.000	0.992
22	3.125	0.092	0.100	1.083	2.500	0.750	1.083	0.000	0.988
23	3.125	0.092	0.100	1.083	1.833	0.750	1.083	0.000	0.987
24	2.292	0.092	0.100	1.083	1.833	0.750	1.083	0.000	0.987
25	2.292	0.092	0.167	1.083	1.833	0.750	1.083	0.000	0.997
26	2.292	0.092	0.167	1.083	1.833	0.750	1.083	0.133	0.994
27	2.292	0.092	0.167	1.083	1.833	0.750	0.750	0.133	0.993
28	3.125	0.108	0.133	0.917	2.167	1.083	1.083	0.133	0.997
29	3.125	0.108	0.200	0.917	2.167	1.083	1.083	0.133	0.994
30	3.125	0.108	0.200	0.917	2.167	0.750	1.083	0.133	0.993
31	3.125	0.075	0.200	0.917	2.167	0.750	1.083	0.133	0.991
32	3.125	0.075	0.200	0.917	1.500	0.750	1.083	0.133	0.994
33	2.292	0.075	0.200	0.917	1.500	0.750	1.083	0.133	0.995
34	2.292	0.075	0.200	1.250	1.500	0.750	1.083	0.133	0.982
35	2.292	0.075	0.200	1.250	1.500	0.750	1.083	0.000	0.987
36	2.292	0.075	0.200	1.250	1.500	0.750	0.750	0.000	0.818

Appendix E.3. Ablation Study

To demonstrate that each piece of our loss function indeed contributes to the success of the method, we conducted ablation experiments by dropping each piece and training. For these experiments, we use the MNIST dataset (see Appendix C.2). See Figure A1, Figure A2, Figure A3 and Figure A4 for the results. It is clear that all components of the loss term indeed contribute to the performance, working in a complementary manner.

Figure A1. Results of the ablation experiment where the alignment term is excluded from the loss function. The system is still able to recover the symmetry generator to some extent, but the performance is poor; the cosine similarity is reduced to 0.770 from 0.999, which was the result we had with the full loss function, reported in Table 1. The lack of the alignment term results in a group convolution matrix that does not quite respect locality. (a) Ideal and learned symmetry generators (top) and error distributions (bottom). (b) Ideal and learned group convolution matrices (top) and error distributions (bottom).

Figure A2. Results of the ablation experiment where the resolution term is excluded from the loss function. As in Figure A1, the group convolution matrix mixes different components of input data, not quite respecting locality. This behavior is consistent with the assumption that the alignment and resolution terms together induce locality. The cosine similarity score is 0.713, compared to the value of 0.999 reported in Table 1. (a) Ideal and learned symmetry generators (top) and error distributions (bottom). (b) Ideal and learned group convolution matrices (top) and error distributions (bottom).

Figure A3. Results of the ablation experiment where the information preservation term is excluded from the loss function. In this case, the group convolution matrix becomes the zero matrix. This catastrophic solution saturates the uniformity loss maximally, as well as the resolution term (when all marginal and joint entropies are zero, the resolution term has its lowest value). Without the term enforcing information preservation, the system naturally collapses to this singular solution. In this case, symmetry generator does not have any effect over the output representation, and the gradient of the loss with respect to it vanishes; therefore, the symmetry generator stays close to its initial value, which is close to the identity matrix. The cosine similarity score in this case is 0.001, demonstrating inability to discover symmetry generator. (a) Ideal and learned symmetry generators (top) and error distributions (bottom). (b) Ideal and learned group convolution matrices (top) and error distributions (bottom).

Figure A4. Results of the ablation experiment where the uniformity term is excluded from the loss function. The group convolution matrix has no regular structure and mixes the different components of input data. Moreover, the model is unable to learn the true symmetry generator even partially, which could be expected since the uniformity term has a primary role in our formulation of symmetry. The cosine similarity score in this case is

- 0.037

, and the model is unable to discover the symmetry generator even in a partial manner. (a) Ideal and learned symmetry generators (top) and error distributions (bottom). (b) Ideal and learned group convolution matrices (top) and error distributions (bottom).

Figure A4. Results of the ablation experiment where the uniformity term is excluded from the loss function. The group convolution matrix has no regular structure and mixes the different components of input data. Moreover, the model is unable to learn the true symmetry generator even partially, which could be expected since the uniformity term has a primary role in our formulation of symmetry. The cosine similarity score in this case is

- 0.037

, and the model is unable to discover the symmetry generator even in a partial manner. (a) Ideal and learned symmetry generators (top) and error distributions (bottom). (b) Ideal and learned group convolution matrices (top) and error distributions (bottom).

Appendix F. Complexity Analysis

Appendix F.1. Time Complexity

See Table A6 for a description of the computational complexity of our method. The number d denotes the dimensionality, k denotes the number of Gaussian kernels used for each pixel probability density estimator, and h denotes the size of the hidden layer used in the conditional probability estimator. In our experiments, we set

k = 4

and

h = 4 \times k \times d

, and the dimensionalities of our datasets were 7, 27, and 33. The most costly operations in our algorithm are eigendecompositions used in two steps, which lead to an overall time complexity of

O (d^{3})

. Although insignificant in 33 dimensions, for datasets with higher dimensionalities, this would be a significant burden. Conditional probability estimation is another expensive operation whose time complexity is

O (d \times h \times k)

, which was

O (d^{2})

in our experiments since we took h proportional to d. There could be room for improvement here by choosing smaller hidden layer size.

Table A6. Time complexities of the building blocks of our method.

Operation	Description	Time Complexity
Probability estimation	Estimation of per-component probabilities via Gaussian kernels.	O( $d \times k$ )
Conditional probability estimation	Estimation of conditional probabilities via Gaussian kernels parametrized with three neural networks.	O( $d \times h \times k$ )
Forming the group convolution matrix	Formed by applying the generator to the resolving filter. We use an eigendecomposition- based algorithm for efficiency.	O( $d^{3}$ )
Joint entropy computation (covariance step)	Computing the covariance matrix.	O( $d^{2}$ )
Joint entropy computation (eigendecomposition step)	Eigendecomposition of the covariance matrix.	O( $d^{3}$ )

Appendix F.2. Space Complexity

The space complexity of our method is

O (d^{2})

, with the dominant components being the conditional probability estimator and the symmetry generator. See Table A7 for details.

Table A7. Space complexities for modules. We exclude temporary memory usage since it depends on the specifics of the library used and the training setup (optimizer, etc.).

Object	Description	Space Complexity
Probability estimator	Estimation of probabilities via Gaussian kernels.	O( $d \times k$ )
Conditional probability estimator	Estimation of conditional probabilities via Gaussian kernels parametrized by three neural networks.	O( $d \times h \times k$ )
Generator	Formed by applying the symmetry generator to the resolving filter. We use an eigendecomposition- based algorithm for efficiency.	O( $d^{2}$ )
Resolving filter	Computing the covariance matrix.	O(d)

References

Cohen, T.; Welling, M. Group equivariant convolutional networks. In Proceedings of the International Conference on Machine Learning. PMLR, New York, NY, USA, 19–24 June 2016; pp. 2990–2999. [Google Scholar]
Babelon, O.; Bernard, D.; Talon, M. Introduction to Classical Integrable Systems; Cambridge Monographs on Mathematical Physics, Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
Hairer, E.; Lubich, C.; Wanner, G. Geometric Numerical Integration: Structure-Preserving Algorithms for Ordinary Differential Equations, 2nd ed.; Springer Series in Computational Mathematics; Springer: Berlin/Heidelberg, Germany, 2010; Volume 31. [Google Scholar]
Higgins, I.; Racanière, S.; Rezende, D. Symmetry-based representations for artificial and biological general intelligence. Front. Comput. Neurosci. 2022, 16, 836498. [Google Scholar] [CrossRef] [PubMed]
Anselmi, F.; Patel, A.B. Symmetry as a guiding principle in artificial and brain neural networks. Front. Comput. Neurosci. 2022, 16, 1039572. [Google Scholar] [CrossRef] [PubMed]
Bronstein, M.M.; Bruna, J.; Cohen, T.; Veličković, P. Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges. arXiv 2021, arXiv:2104.13478. [Google Scholar]
Anselmi, F.; Evangelopoulos, G.; Rosasco, L.; Poggio, T. Symmetry-adapted representation learning. Pattern Recognit. 2019, 86, 201–208. [Google Scholar] [CrossRef]
Helgason, S. Differential Geometry, Lie Groups, and Symmetric Spaces; Academic Press: Cambridge, MA, USA, 1979. [Google Scholar]
Desai, K.; Nachman, B.; Thaler, J. Symmetry discovery with deep learning. Phys. Rev. D 2022, 105, 096031. [Google Scholar] [CrossRef]
Ozakin, A.; Vasiloglou, N.; Gray, A.G. Density-Preserving Maps. In Manifold Learning: Theory and Applications; Ma, Y., Fu, Y., Eds.; CRC Press: Boca Raton, FL, USA, 2011. [Google Scholar]
Weinberg, S. What is Quantum Field Theory, and What Did We Think It Is? arXiv 1997, arXiv:hep-th/9702027. [Google Scholar]
Benton, G.; Finzi, M.; Izmailov, P.; Wilson, A.G. Learning invariances in neural networks from training data. Adv. Neural Inf. Process. Syst. 2020, 33, 17605–17616. [Google Scholar]
Romero, D.W.; Lohit, S. Learning partial equivariances from data. arXiv 2021, arXiv:2110.10211. [Google Scholar]
Zhou, A.; Knowles, T.; Finn, C. Meta-learning symmetries by reparameterization. arXiv 2020, arXiv:2007.02933. [Google Scholar]
Craven, S.; Croon, D.; Cutting, D.; Houtz, R. Machine learning a manifold. Phys. Rev. D 2022, 105, 096030. [Google Scholar] [CrossRef]
Forestano, R.T.; Matchev, K.T.; Matcheva, K.; Roman, A.; Unlu, E.B.; Verner, S. Accelerated discovery of machine-learned symmetries: Deriving the exceptional Lie groups G2, F4 and E6. Phys. Lett. B 2023, 847, 138266. [Google Scholar] [CrossRef]
Forestano, R.T.; Matchev, K.T.; Matcheva, K.; Roman, A.; Unlu, E.B.; Verner, S. Deep learning symmetries and their Lie groups, algebras, and subalgebras from first principles. Mach. Learn. Sci. Technol. 2023, 4, 025027. [Google Scholar] [CrossRef]
Krippendorf, S.; Syvaeri, M. Detecting symmetries with neural networks. Mach. Learn. Sci. Technol. 2020, 2, 015010. [Google Scholar] [CrossRef]
Sohl-Dickstein, J.; Wang, C.M.; Olshausen, B.A. An unsupervised algorithm for learning lie group transformations. arXiv 2010, arXiv:1001.1027. [Google Scholar]
Dehmamy, N.; Walters, R.; Liu, Y.; Wang, D.; Yu, R. Automatic symmetry discovery with lie algebra convolutional network. Adv. Neural Inf. Process. Syst. 2021, 34, 2503–2515. [Google Scholar]
Greydanus, S.; Dzamba, M.; Yosinski, J. Hamiltonian neural networks. Adv. Neural Inf. Process. Syst. 2019, 32, 15353. [Google Scholar]
Alet, F.; Doblar, D.; Zhou, A.; Tenenbaum, J.; Kawaguchi, K.; Finn, C. Noether networks: Meta-learning useful conserved quantities. Adv. Neural Inf. Process. Syst. 2021, 34, 16384–16397. [Google Scholar]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks. arXiv 2014, arXiv:1406.2661. [Google Scholar] [CrossRef]
Yang, J.; Walters, R.; Dehmamy, N.; Yu, R. Generative adversarial symmetry discovery. In Proceedings of the International Conference on Machine Learning. PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 39488–39508. [Google Scholar]
Yang, J.; Dehmamy, N.; Walters, R.; Yu, R. Latent Space Symmetry Discovery. arXiv 2023, arXiv:2310.00105. [Google Scholar]
Tombs, R.; Lester, C.G. A method to challenge symmetries in data with self-supervised learning. J. Instrum. 2022, 17, P08024. [Google Scholar] [CrossRef]
Doob, J. Stochastic Processes; Wiley: Hoboken, NJ, USA, 1991. [Google Scholar]
Folland, G.B. A Course in Abstract Harmonic Analysis; CRC Press: Boca Raton, FL, USA, 2016. [Google Scholar]
Bell, A.J.; Sejnowski, T.J. An information-maximization approach to blind separation and blind deconvolution. Neural Comput. 1995, 7, 1129–1159. [Google Scholar] [CrossRef] [PubMed]
Cover, T. Elements of Information Theory; Wiley series in telecommunications and signal processing; Wiley-India: New Delhi, India, 1999. [Google Scholar]
Pichler, G.; Colombo, P.J.A.; Boudiaf, M.; Koliander, G.; Piantanida, P. A differential entropy estimator for training neural networks. In Proceedings of the International Conference on Machine Learning. PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 17691–17715. [Google Scholar]
Herman, J.; Usher, W. SALib: An open-source Python library for Sensitivity Analysis. J. Open Source Softw. 2017, 2. [Google Scholar] [CrossRef]
Iwanaga, T.; Usher, W.; Herman, J. Toward SALib 2.0: Advancing the accessibility and interpretability of global sensitivity analyses. Socio-Environ. Syst. Model. 2022, 4, 18155. [Google Scholar] [CrossRef]
Morris, M.D. Factorial sampling plans for preliminary computational experiments. Technometrics 1991, 33, 161–174. [Google Scholar] [CrossRef]

Figure 1. Symmetry actions on local objects.

Figure 2. A visual representation of the model input and output. At the top, we see samples from the datasets used for model training; these datasets are approximately invariant and local under a given group action. The output, after training, is the learned minimal generator of the group action shown pictorially here and a symmetry-based representation that is shown as a matrix here. For the case on the left, the dataset is local and invariant under “1-dimensional pixel translations”. For the case on the right, the dataset is local/symmetric under frequency shifts (sending each Fourier basis vector to the next), and the relevant symmetry-based representation is a discrete Fourier transform. The model can recover the given symmetry generators in each one of these example cases and others.

Figure 3. Overview of the method. Each sample goes through the group convolution map formed by combining the resolving filter with the powers of the candidate symmetry transformation. The locality and symmetry losses are applied to the resulting representation. In the case shown here, the samples are local/symmetric under frequency shifts, and the group convolution matrix learned is effectively the matrix representing the discrete sine transform.

Figure 4. Samples from the synthetic datasets used for model training. We show 4 samples from datasets generated from Gaussian basis signals. In each figure, the x−axis represents the component index (the “pixel index”) of the sample vectors, and the y−axis is the amplitude of the corresponding component. The group on the left has samples from the raw datasets, which are local and invariant under simple translations of components, the second column uses datasets invariant under permuted translations, and the third column datasets invariant under frequency shifts, i.e., shifts of components in Fourier (discrete sine transform I) space.

Figure 5. The optimization loop for the group convolution matrix (symmetry–locality search).

Figure 6. The optimization loop for the probability density estimators.

Figure 7. The symmetry generator for the translation-invariant dataset is the 1−step translation operator, which is simply a shift matrix (with entries just below or above the diagonal equaling 1). On the left, we see that this matrix is learned with high accuracy. The group convolution matrix that provides the symmetry−based representation for translation symmetry is the identity operator. On the right, this operator, formed by combining the powers of the group generator with the learned resolving filter, is also learned to a high degree of accuracy. (a) Ideal and learned symmetry generators (top) and error distributions (bottom). (b) Ideal and learned group convolution matrices (top) and error distributions (bottom).

Figure 8. The symmetry generator for the frequency−shift−invariant dataset is the map that sends each discrete sine basis vector to the one with the next highest (frequency). On the left, we see that this matrix is learned to a high degree of accuracy. The group convolution matrix that provides the symmetry−based representation for this case is nothing but the discrete sine transform. On the right, we see that this matrix is also learned to a high degree of accuracy. The model was able to learn that the discrete sine transform was the relevant map for a symmetry−based representation by simply looking at samples, without any hints or hard coding (explicit or hidden). (a) Ideal and learned symmetry generators (top) and error distributions (bottom). (b) Ideal and learned group convolution matrices (top) and error distributions (bottom).

Figure 9. The symmetry generator for the pixel permutation dataset is a specific permutation of the components, and the dataset is invariant and local under the action of the powers of this permutation. On the left, we see that the relevant permutation matrix is learned to a high degree of accuracy. The group convolution matrix that provides the symmetry−based representation for this case is the underlying permutation that relates the simple translation generator to the permutation generator. On the right, we see that this matrix is also learned to a high degree of accuracy. The model was able to extract the specific permutation of the “pixels” that was needed to obtain a manifest locality/symmetry from raw data. (a) Ideal and learned symmetry generators (top) and error distributions (bottom). (b) Ideal and learned group convolution matrices (top) and error distributions (bottom).

Figure 10. Training results for the 27−dimensional real dataset with an approximate translation symmetry, obtained by slicing the MNIST dataset. On the left, we see that the model learns the 1−step translation generator just as well as in the synthetic experiments despite the fact that the dataset properties are very different from that case. On the right, we see that the group convolution matrix is similarly accurate, approximately equal to the identity map (up to a sign), as in Figure 7b. (a) Ideal and learned symmetry generators (top) and error distributions (bottom). (b) Ideal and learned group convolution matrices (top) and error distributions (bottom).

Figure 11. After learning the symmetry generators, the resulting group convolution map can be used to obtain the symmetry−based representation of each sample. Here, we see that the symmetry−based representations recover the underlying local signals to a high degree of accuracy for both permutation− and frequency−shift−equivariant data. In both cases, we see the raw data vectors on the left, the result of the group convolution map in the center, and the hidden (unpermuted) signal on the right. (a) Permutation-equivariant data. (b) Frequency−shift−equivariant data.

Figure 12. Training snapshots of the group convolution layer. The group convolution matrix evolves to a mapping that makes the hidden symmetry manifest. (a) Training snapshots for a dataset with the permuted translational symmetry. The group convolution matrix evolves to a form that unscrambles the permutation. (b) Training snapshots for a dataset with the frequency−shift symmetry. The group convolution matrix evolves into a form that negates the DST−I transformation, i.e., converges to the transpose of the DST−I matrix.

Figure 13. The outcome of the GAN-based methods on the 7−dimensional dataset with circulant translation symmetry. We see that GAN-based methods learn the symmetry generators rather crudely, with the error rates being slightly higher for LieGAN compared to SymmetryGAN−QR. (a) Ideal and learned symmetry generators (top) and error distributions (bottom) for the LieGAN method. (b) Ideal and learned symmetry generators (top) and error distributions (bottom) for the SymmetryGAN−QR method.

Figure 14. The outcome of our method on the 7−dimensional dataset with circulant translation symmetry. We see that, compared to the GAN-based methods, our method learns the minimal symmetry generator and resulting group convolution matrix with higher accuracy. (a) Ideal and learned symmetry generators (top) and error distributions (bottom) for our method. (b) Ideal and learned group convolution matrices (top) and error distributions (bottom) for our method.

Table 1. Cosine similarity between the learned and ideal minimal generators.

Symmetry	Gaussian Dataset	Legendre Dataset	MNIST Slices
Translation	$0.999 \pm 0.001$	$0.998 \pm 0.003$	$0.999 \pm 0.001$
Permuted translation	$0.996 \pm 0.002$	$0.998 \pm 0.001$	-
Frequency space translation	$0.980 \pm 0.003$	$0.991 \pm 0.003$	-

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Efe, O.; Ozakin, A. SymmetryLens: Unsupervised Symmetry Learning via Locality and Density Preservation. Symmetry 2025, 17, 425. https://doi.org/10.3390/sym17030425

AMA Style

Efe O, Ozakin A. SymmetryLens: Unsupervised Symmetry Learning via Locality and Density Preservation. Symmetry. 2025; 17(3):425. https://doi.org/10.3390/sym17030425

Chicago/Turabian Style

Efe, Onur, and Arkadas Ozakin. 2025. "SymmetryLens: Unsupervised Symmetry Learning via Locality and Density Preservation" Symmetry 17, no. 3: 425. https://doi.org/10.3390/sym17030425

APA Style

Efe, O., & Ozakin, A. (2025). SymmetryLens: Unsupervised Symmetry Learning via Locality and Density Preservation. Symmetry, 17(3), 425. https://doi.org/10.3390/sym17030425

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SymmetryLens: Unsupervised Symmetry Learning via Locality and Density Preservation

Abstract

1. Introduction

2. Related Work

3. Theoretical Setting and Data Model

3.1. The Continuous Setting

3.1.1. Introduction

3.1.2. Processes with Symmetry and Locality Under 1-Dimensional Translations

3.1.3. Processes with Symmetry and Locality Under a General 1-Dimensional Group Action

3.2. The Discrete Setting: Synthetic Data Generation

3.2.1. Discrete Translation Symmetry

3.2.2. General Representations: Behavior Under Orthogonal/Unitary Transformations

4. Materials and Methods

4.1. The Setup: Objects to Learn

4.1.1. Symmetry Generator

4.1.2. The Resolving Filter

4.1.3. The Group Convolution Matrix

4.2. The Loss Function

4.2.1. Stationarity/Uniformity

4.2.2. Locality

4.2.3. Information Preservation

4.2.4. Total Loss

4.2.5. On the Coupled Effect of the Alignment and Resolution Terms

4.3. Training the Model

4.3.1. Training the Group Convolution Layer

4.3.2. Training Probability Density Estimators

5. Results

5.1. Results on Synthetic and Real Data

5.2. Comparison with Other Unsupervised Symmetry Learning Approaches

6. Discussion

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Estimation Procedures

Appendix A.1. Probability Estimation

Appendix A.1.1. Marginal Probability Estimation

Appendix A.1.2. Conditional Probability Estimation

Appendix A.2. Multidimensional Entropy Estimation

Appendix A.3. KL Divergence Estimation

Appendix A.3.1. KL Divergence of Marginal Probabilities

Appendix A.3.2. KL Divergence of Conditional Probabilities

Appendix B. Equivariance of the Group Convolution Layer

Appendix C. Datasets

Appendix C.1. Details of Synthetic Data

Basis Signal Types

Appendix C.2. Details of Real Data

Appendix D. Details of the Comparison with GAN-Based Methods

Appendix E. Hyperparameters and Loss Terms

Appendix E.1. Initial Choice of Hyperparameters

Appendix E.2. Elementary Effect Sensitivity Analysis

Appendix E.3. Ablation Study

Appendix F. Complexity Analysis

Appendix F.1. Time Complexity

Appendix F.2. Space Complexity

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI