Butterfly Transforms for Efficient Representation of Spatially Variant Point Spread Functions in Bayesian Imaging

Bayesian imaging algorithms are becoming increasingly important in, e.g., astronomy, medicine and biology. Given that many of these algorithms compute iterative solutions to high-dimensional inverse problems, the efficiency and accuracy of the instrument response representation are of high importance for the imaging process. For efficiency reasons, point spread functions, which make up a large fraction of the response functions of telescopes and microscopes, are usually assumed to be spatially invariant in a given field of view and can thus be represented by a convolution. For many instruments, this assumption does not hold and degrades the accuracy of the instrument representation. Here, we discuss the application of butterfly transforms, which are linear neural network structures whose sizes scale sub-quadratically with the number of data points. Butterfly transforms are efficient by design, since they are inspired by the structure of the Cooley–Tukey fast Fourier transform. In this work, we combine them in several ways into butterfly networks, compare the different architectures with respect to their performance and identify a representation that is suitable for the efficient representation of a synthetic spatially variant point spread function up to a 1% error. Furthermore, we show its application in a short synthetic example.


Introduction
Images of astronomical objects are the result of measurements by physical instruments and intricate post-processing. In this procedure, instrument responses play an important role as they build the connection between the signal, i.e., the quantity of interest, and the observables.
Unfortunately, instrument responses are often non-trivial and hard to model in a simple and numerically efficient form. Examples for such instruments are the X-ray observatories eROSITA (extended ROentgen Survey with an Imaging Telescope Array) [1] and Chandra [2]. Both are challenging to compute due to their inhomogeneous behaviour in terms of space and energy. In order to efficiently perform statistical field inference, for example, by using NIFTy (Numerical Information Field Theory) [3][4][5], a Python software package for the numerical application of information field theory [6][7][8][9], these responses must be represented numerically in a way that is fast and differentiable. One promising candidate for the efficient representation of instrument responses are butterfly transforms, a linear neural network structure inspired by the structure of the fast Fourier transform (FFT) algorithm, whose size scales with O (N log N ), where N is the number of pixels.
In many cases, the measurement equation for some data d, taken with an instrument response R of the signal s assuming additive noise n, can be formulated as d = R(s) + n. Regarding photographic instruments this response R is a linear map that can be separated into two operations, D and O. Here, D describes the measurement process of the detector, while O represents the optical properties of the instrument. The latter is also referred to as point spread function (PSF). Since computers are used for the analysis of the experiments performed, the continuous signal space is approximated by a discrete pixelation and thus all operators can be represented as matrices.
If O can be approximated by a circulant matrix, a matrix consisting of cyclic permutations of the same row vector a, its matrix multiplication with any vector simplifies to a discrete convolution with a, meaning that it is spatially invariant and homogeneous, respectively. In many physically relevant cases, this homogeneity can approximately be assumed for a given observed area of the instrument. Additionally, the convolution theorem states that a convolution corresponds to a point-wise multiplication in Fourier space. Consequently, convolutional responses can be represented in an efficient way, due to the fact that one only has to store one N -entry vector instead of a N 2 matrix, as well as due to the efficiency one gains by replacing a discrete Fourier transformation by the fast Fourier transformation (FFT). Often the homogeneity assumption only holds up to a certain degree and in a limited field of view. For spatially variant PSFs, and thus non-circulant responses, efficient representations are urgently needed.
In this paper, we propose using butterfly transforms to represent spatially variant PSFs in order to build likelihoods for instruments such as eROSITA, Chandra, and many more. In particular, we present a way to parameterize butterfly transforms, combine them into networks, and compare different network architectures in terms of their efficiency and accuracy. Section 2 summarizes the implementation and application of butterfly factorizations and transforms in previous works. Section 3 describes how butterfly transforms are parameterized in this work and how they are inspired by the structure of the Cooley-Tukey-FFT algorithm. Section 4 gives a short introduction to information field theory and Section 5 describes different designs of likelihoods. In Section 6, we define a metric in order to compare different butterfly network architectures with respect to their capability to represent the synthetic response defined in Section 7. The results, which consider a comparison of different architectures, a comparison of execution times, and a mock application of a butterfly network, can be found in Section 8.

Related Work
Butterfly factorizations and transformations are becoming increasingly popular in the machine learning community for a variety of applications. Polcari describes in [10] how the generalization of the butterfly structure known from the FFT algorithm can be used for multi-layer decomposition of unitary matrices. In [11], Dao et al. proposed a way to learn fast linear transformation algorithms using butterfly factorizations. They were able to learn several fast linear transformations, e.g., FFT, discrete sine transform, etc., and showed that their approach can be used as an efficient replacement for generic matrices in machine learning pipelines. Alizadeh et al. [12] proposed butterfly transformations as a replacement for pointwise convolutions in depth-wise separable convolutions in convolutional neural networks. This is particularly important for architectures such as MobileNets [13][14][15] that are designed to run on mobile devices. Singhal et al. [16] used complex-valued butterfly transforms for hyperspectral image processing. The combination of complex-valued multi-scale feature representation with data-driven feature learning results in lighter, yet accurate classification models. Lin et al. [17] also proposed a new form of butterfly transform, called deformable butterflies, as a replacement for convolutional or fully connected layers in neural networks. Song et al. [18] instead used the butterfly algorithm to encrypt optical images.
Our contribution to the field demonstrates the application of butterfly transforms to efficiently represent spatially varying point spread functions, necessary for accurate Bayesian imaging. A proceeding paper presented an early stage of this work [19]. In this article, however, we go more into the details of the method, theoretically address the scaling of the networks, and introduce error maps as a new visualisation method. We also perform a comparison of the execution time between a butterfly network Python implementation and a full matrix-vector multiplication. In addition, we show the behaviour of the response approximation at a higher resolution and use butterfly networks as a response representation for a synthetic Bayesian imaging task.

Fast Fourier Transformation
Due to the convolution theorem, Fourier transformation is one of the key elements of convolutional processes and thus the algorithm of FFT is highly relevant for the representation of instrument responses on regular grids. The main idea of the FFT is to split the sum in the discrete Fourier transform (DFT), into two sums, over even and odd indices [20]. By using the mathematical properties of the N -th primitive root ω N = e −2πi N , it can be shown that Taking a closer look at the Fourier componentf k+ N 2 shows that one can reuse the same two components,f even k andf odd k , which were already calculated: This means that an N -sized Fourier transform can be separated into two N /2-sized Fourier transforms along the even and odd indices, also called the Danielson-Lanczos Lemma found in 1942 [21]. The componentsf even The two smaller Fourier transforms can be separated in the same way, resulting in a divide and conquer algorithm. Assuming that the initial value of N is a power of 2, this splitting can be applied log 2 (N ) times. Inspired by machine learning language, these iterations are called layers in the following. With N additions in each of these layers, the total computational complexity is about O(N log 2 N ). Comparing this to a regular DFT with its computational complexity of O(N 2 ) (N components with N summands) the amount of saved time in the FFT algorithm is significant.
Due to the layer-wise splitting into even and odd indices one has to spend some additional effort on renumbering and book-keeping in order to combine the correct indices. In 1965, Cooley and Tukey discovered that this iterative splitting into even and odd indices can be represented as a bit reversal of the indices (reading backwards in binary representation) [21]. Therefore, one does not have to take additional care of the right input ordering. Besides the "decimation in time algorithm" by Cooley and Tukey, there are other FFT algorithms that are not further considered in this work.

Butterfly Transform and Convolution
The data-flow diagram illustrating the algorithm of Equation (4) is often called a butterfly diagram because of its resemblance to a butterfly (see Figure 1). Here, the direction and color of the arrows pointing to a node describe the mathematical operations being performed. Since the abstraction of the FFT algorithm results in a similar data flow diagram, it will be called the butterfly transform in the following. As the butterfly diagrams always connect to two components, most of the descriptions used in the following, concerning their parameterization, are two-dimensional to keep the notation simple.  In order to generalize the FFT while preserving its efficient structure, we decompose the operations in Equation (4) into a diagonal operator Φ and a mixing operator Θ, as given in the following.
and thus f k For each component, we introduce free parameters that control how the operation deviates from an ordinary FFT. A general representation of Θ is obtained by parameterizing it by the sine and cosine of an angle θ, To preserve the generality of the transformation within one layer, the θs for different connected pairs, denoted by the index k in Equation (6), are independent. This means that for an N -size transformation there are N /2 θs, in each layer, regulating the interaction between two connected data points. Considering this parameterization, we obtain the Θ from Equation (5), i.e., the one for an FFT, by inserting θ = π 4 . The operator Φ is parameterized as This parameterization makes it possible to recover the correct phases for an FFT, but also to change them in an arbitrary way. The combination of Θ and Φ is sufficient to represent an entire FFT transform. To obtain an even more general transformation, the diagonal operator Γ, is introduced, which accounts for the real-valued amplitudes. This leads to a loss of unitarity for γ 1 , γ 2 = 0 in the combined transformation of Γ, Φ and Θ. Now, we can build a generic butterfly-structured transformation B, using the layered structure of an FFT as a guiding example. The subscript of the operators refers to the layer in the FFT algorithm and thus implies that the correct components are connected.
Given this butterfly transformation and the structure of a convolution operation, based on the convolution theorem, a butterfly convolution-like operator O can be formulated as In this equation the Λ operator corresponds to the Fourier transformed PSF. Usually, physically reasonable PSFs are real-valued in position space and thus complex-valued in harmonic space. Therefore, the Λ operator is defined as a diagonal operator, with complex values, B † , in Equation (11), denotes the adjoint of B. For some experiments, the parameters of B and B † were strictly coupled, called mirrored architecture in the following. For others, the parameters were independent, denoted by different indices, resulting in a non-mirrored architecture:

Multidimensional Butterfly Transformation
Since the butterfly structure is strongly related to the FFT, it would make sense to treat multidimensional butterfly transformations in the same way as multidimensional Fourier transformations. Therefore, butterfly transformations can be applied to each dimension separately. However, in this work the 2D application is slightly modified, in a way that the mixing operator Θ is applied to each axis separately (for the first axis all columns are transformed with the same θs, whereas for the second axis all row transformations share the same θs), but after this axis-wise Θ-transformation the operators Φ and Γ are applied as diagonal operators. Applying a 2D butterfly transform on a m × l 2D grid with m > l results in log 2 m layers for the first axis transformation and log 2 l layers for the second axis. The number of θs in one Θ i operator is then m 2 and l 2 , respectively. The operators Φ and Γ are then applied in log 2 m layers. Here, the number of φs and γs in one Φ i or Γ i operator is ml.
Another approach, next to the 2D application, is to reduce the number of dimensions to one (in this case, as we are dealing with images, from 2D to 1D) and just perform one butterfly transform to this one dimension. For the case of two-dimensional inputs the dimensionality reduction can be easily performed by concatenating all the column vectors to one long vector, which will be called flattening from now on. These two different approaches differ in the number of layers needed by the butterfly algorithm as well as in the number of parameters per layer. For a one-dimensional transform of a ml 1D grid, one obtains log 2 ml layers. Here, the number of parameters behaves differently as there are ml 2 in each Θ i . The number of parameters in one Φ i or Γ i is again ml.
Since in the latter case this results in a larger number of parameters, the 1D flattened transform is expected to be more flexible than the 2D transform. A more detailed listing of the number of parameters for each approach is shown in Table 1. Table 1. Comparison of the number of parameters needed for a 2D and 1D butterfly transform.

Name #θ #φ #γ
2D (m × l grid) m 2 log 2 m + l 2 log 2 l ml log 2 m ml log 2 m flat (ml) ml 2 log 2 ml ml log 2 ml ml log 2 ml Their different scaling behaviour can be easily compared by setting m = l. Then the total number of parameters is [l log 2 l + 2l 2 log 2 l] for the 2D and [5l 2 log 2 l] for the 1D flattened case. Figure 2 shows the different scaling behaviours of these two butterfly applications and of a full matrix representation as a function of the axis length l.

Information Field Theory
To reach a better understanding for the area of use for the efficient responses, a brief introduction to information field theory (IFT) [9] will be given. Information field theory is the application of information theory to physical fields. Probably the most important relation within information theory is Bayes' theorem, which connects a posterior with the likelihood, the prior, and the evidence. Here, the likelihood P (d|s) describes how likely it is to record specific data for a given signal, which incorporates knowledge about the instrument and the measurement process. The prior P (s) is chosen with respect to the physical knowledge one has about the observed quantity or situation. The evidence P (d) = ds P (d|s)P (s) is needed for the proper normalization of the posterior P (s|d). Another important quantity is the information Hamiltonian that is defined as the negative logarithm of the probability, H(d, s) = − ln[P (d, s)]. Due to the properties of the logarithm and the product rule of probabilities the information Hamiltonian, H is an additive quantity H(d, s) = H(d|s) + H(s). The likelihood can be computed from the noise statistic P (n|s) and the measurement equation, here in the form P (d|s, n) = δ(d − R(s) − n). Thus, the likelihood is P (d|s) = dn P (d|s, n)P (n|s) Assuming Gaussian priors for signal, G(s, S), and noise, G(n, N), and using Equation (15) the Hamiltonians simplify to However, if the measurement process follows Poisson statistics, which is the case for realistic photographic measurements, a Poissonian likelihood model has to be used. For this, we define the event density field s at a sky coordinate x as s x and the density of observed events in a detector bin i as µ i . The connection between the event density and the observed event density is described by the instrument response R i x in the equation R thus describes the optical properties of the instrument and the detector. For the case that all measured events are independent of each other, the likelihood P (d|µ) follows a Poisson distribution and thus, following the definition of the information Hamiltonian, the likelihood Hamiltonian is For a Poisson distribution the density s has to be strictly positive. Thus, the prior P (s) can no longer be Gaussian, but must follow a strictly positive distribution, e.g., a log-normal distribution or an inverse gamma distribution. Dropping all the constant terms, symbolized by = meaning "up to constants in any here relevant quantity", the likelihood Hamiltonian can be formulated as One way to find an estimate for the signal s is to maximize the probability P (s|d) by minimizing the joint Hamiltonian H(d, s), with respect to the signal s. This is the maximum a posteriori (MAP) approximation. Other inference methods including an uncertainty quantification are metric Gaussian variational inference (MGVI) [22] or geometric variational inference (geoVI) [23]. As a minimization algorithm, Newton-CG [24] was used throughout all experiments.

Parallel and Serial Likelihoods
Models for inference processes in NIFTy are built in a forward way, as so-called generative models. This means that a model of the physical signal is created first, followed by the instrument response. Applying the IFT formalism, described in Section 4, to a generative model with a butterfly convolution operator as a response yields a likelihood with dependencies on the signal s and the response parameters θ, φ, γ, and λ.
In addition to being able to use butterfly convolution operators with mirrored, nonmirrored, flat, and 2D configurations, they can be combined into a network built in parallel or in series. In the case where n multiple butterfly convolution operators are connected in series, the response operator representation in Equation (16) or (20) arises from the sequential application of multiple butterfly convolution operators, In contrast, with n butterfly convolution operators arranged in a parallel architecture, each butterfly convolution operator is applied to the signal, and the results are summed. The corresponding instrument response representation is then Before using a particular butterfly network as an instrument response function in an imaging application, it is trained on signal-data pairs of the instrument. For many instruments, specific knowledge of the PSFs is available in the form of signal-data pairs, either from an expensive one-time simulation or from extrapolation of calibration measurements. Using these signal-data pairs, the joint Hamiltonian H(d, s, θ, φ, γ, λ) is minimized with respect to the response parameters θ, φ, γ, and λ, resulting in a MAP approximation of the instrument. The initial values for these parameters, θ, φ, γ, and λ, are chosen such that all O i correspond to a convolution with a delta peak ( θ = π/4 , γ = 0, λ = 1, and φ according to the needed phases, see Section 3.1). The prior distribution of the parameters is assumed to be Gaussian, with means at the initial values and unit variance. The final goal of the minimization is to obtain an efficient digital twin of the real physical instrument.
Once a butterfly response is trained, it can be used for imaging with the corresponding instrument. For this, the response parameters are fixed to the inferred values θ, φ, γ, and λ, resulting in a response operator, which is linear in the signal s. The selection of a suitable generative model for s depends on the observation of interest. In order to obtain an uncertainty estimate for the physical signal s, the inference algorithms MGVI or geoVI can be used.

Evaluation of the Response Approximation
Before using a trained butterfly response in an inference algorithm, it must be certified that the mapping performed by the response representation is sufficiently accurate. Therefore, we compare the action caused by a signal, here a point source at position z, s(x) = δ(x − z), of the to-be-learned or simulated response with the butterfly response by their difference. This will be called the response approximation error To keep the evaluation simple, unit brightness point sources at all signal domain locations z ∈ Ω are considered. In the next step, the 2-norm ( s 2 = ∑ x∈Ω |s x | 2 ) of E(s) is calculated for all these sources individually and normalized by the 2-norm of the corresponding true signal response: where z defines a relative error for each pixel z and thus results in an image, called the error map, in the following. This error map depicts the errors introduced by the butterfly response in total and shows which areas are more reliable and in which areas the mapping deviates significantly from the approximated response. In order to quantify the total error with respect to all mapping errors, we calculate the 2-norm of the 4D matrix E z = E[δ(x − z)], containing the error images for all possible z-values and normalize it by dividing with the 2-norm of the matrix R z = R sim. [δ(x − z)], containing all true simulated responses resulting in the the total error ζ:

Synthetic Response
In order to investigate whether and to what degree butterfly networks are capable of approximating spatially variant PSFs, they were trained to approximate a synthetic response. This synthetic response can be regarded as the convolution of the signal s, which is a point source located at the position z, s(x) = δ(x − z), with a rotational symmetric PSF with a position-dependent shape, For the PSF a zero-centred Gaussian was chosen, with ρ = x 2 , where x is the coordinate vector of the image plane and x 2 = x 2 1 + x 2 2 is its length. The dependence on the position z of the point source is encoded in the variance σ 2 (z) of the Gaussian. To keep this spatial dependency simple, only the distance from the centre of the image c to the point source z, r = c − z 2 , influences the shape of the PSF. As this absolute value depends on the image resolution, r will be normalized by the maximal distance within the image,r = r/r max , to obtain a relative measure for the distance being in the interval [0, 1]. As indicated, the variance σ 2 is a function of this relative distancer between the point source at z and the image centre c, The two parameters are set to β = 0.01 and η = 10 −5 . Following Equation (28), larger distancesr lead to larger values of the variance σ 2 . This means that point sources with smaller values ofr are convolved with a sharper Gaussian, while point sources at further distances from the centre are convolved with broader Gaussians (see Figure 3). This results in an spatially variant PSF, which can be used to examine the expressiveness of the butterfly architecture.

Comparison of Architectures
In search of a butterfly network capable of representing spatially variant point spread functions, various architectures were compared, in terms of their ability to represent the synthetic response, differing in their number of butterfly convolution operators (BCOs), mirrored (mr) or non-mirrored (nmr) architecture, flat or 2D network design, and serial or parallel built likelihood (see Table 2). All of these networks were trained to approximate the synthetic response described in Section 7 by maximizing the posterior (MAP) until the optimization was sufficiently converged (300 Newton steps). As training data, a set of all possible PSFs within the given pixelation of 16 × 16 was used. The signals were fixed to be point sources with brightness values of 40 at the corresponding positions. A Gaussian likelihood was used and the noise covariance N was set to be diagonal with entries of 10 −6 . In order to gain a better understanding of the influence of some of these properties on the total approximation behaviour, the networks are regarded separately and with respect to their final total approximation errorζ in Table 2. The comparison of theζ value of Net 1 , Net 2 , and Net 3 with 1, 2, and 3 BCOs, but otherwise the same properties, shows that a higher number of BCOs lowers the total error and thus increases the approximation capability. The second property of interest is the kind of architecture used, mirrored or non-mirrored. Therefore theζ value of Net 3 , with its mirrored architecture, is compared to the one of Net 4 , with its non-mirrored architecture, while their other properties are equivalent. This shows that the non-mirrored architecture performs better than the mirrored one. The same conclusion can be drawn by comparingζ of Net 5 and Net 6 , which also only differ in their state of mirroring. In a similar way the flattened and 2D applications can be examined. Since Net 3 and Net 5 only differ in this property, their error values suggest that the flat application is superior to the 2D application with respect to the reconstruction capability. This is confirmed regarding the error of Net 4 and Net 6 , which are in a similar relationship.
Since more BCOs, flattening, and a non-mirrored architecture increase the number of parameters and thus lead to more degrees of freedom, it is assumed that these architectures are more flexible and can approximate the true response in a better way.
For the overall efficiency of the various networks it is not only important to approximate the synthetic response in a optimal way, but also to keep the number of parameters, and thus the network density (network density is defined here as the ratio of network parameters and number of entries in a full matrix representation), as low as possible (see Figure 4). In the examined cases, sparser architectures tend to perform worse in comparison to architectures with more parameters. Overall Net 4 approximates the synthetic response best with an 1% error. Net 6 , however, has only 44% of the parameters of Net 4 and is therefore less dense. This goes hand in hand with a slightly increased approximation error by an absolute value of 0.46% (see Table 2). In the end, the number of parameters of the butterfly networks still scales with O(N log N ). This means that they become less dense with increasing resolution.

Performance at Increased Resolution
Although the performance of the much sparser network Net 6 is only decreased by 0.46% in absolute error, Net 4 is still chosen as the best performing architecture for further discussion. In the following, we will investigate how the performance of the network depends on the resolution. Therefore, a Net 4 architecture was again trained on a complete set of signal responses of all possible point source positions with an increased resolution of 32 × 32. The noise covariance was not changed.
The results for 16 × 16 and 32 × 32 for the otherwise same butterfly network architecture are collected in Table 3. One can see that the approximation capability does not suffer, while the density decreases with increasing resolution. The density decreases because the number of butterfly network parameters scales with about N log N whereas a full matrix representation would scale with N 2 . This is the expected behaviour described in Section 3.3. Table 3. Comparison of the same network structure applied to different resolutions. The percentage error is on the same scale while the density decreases with higher resolution. In order to further validate the approximation of the mapping, one can compare the synthetic signal responses (see Figure 3) and the approximated butterfly response (see Figure 5) for the same 25 different point sources. One can see that the main features, the position, the size and the magnitude of the Gaussian are picked up by the network, but also that there are not defined areas in the logarithmic plot. This is due to negative regions in the butterfly signal response caused by lack of constraints on positivity in the non-mirrored architecture. For a closer investigation we also plotted the difference between the two signal responses (see Figure 6). As presumed, the peaks of the error pictures E(s) are on the same magnitude as the total errorζ. Since these are just a few samples for the mapping, which are not representative for the overall approximation, we also consider the error mapsˆ , described in Section 6, for further investigation of the errors made by the butterfly networks. Figure 7 shows that the relative errorˆ z is not constant over the image domain, but smaller in the centre than at its borders. Since the only property changing with the relative distancer from the image centre is the shape and size of the synthetic response, it appears that wider PSFs are more difficult to approximate than smaller ones. This is also visualized by showing the relationship betweenr z andˆ z for each pixel of the error map in a 2D histogram (see right plot Figure 7).

Comparison of Execution Times and Memory Consumption
In addition to measuring the ability to represent the true inhomogeneous point spread function, we want to test the speed of the butterfly network implementation in an application. Therefore, we use the Python package timeit to measure the execution time of a forward pass in seconds. Here, we compare the execution time of a butterfly convolution with fixed θ, φ, γ, and λ against a full matrix-vector product performed by numpy.matmul. Since the numpy.matmul implementation runs on multiple cores and the butterfly network implementation is not parallelized, we force the code to run on a single core to make the numbers more comparable. [OMP_NUM_THREADS = 1, OPENBLAS_NUM_THREADS = 1, MKL_NUM_THREADS = 1] Figure 8 and Table 4 show that for small N the butterfly convolution implementation is slower than full matrix-vector multiplication. The reason for this behaviour is that the butterfly convolution is mostly written in Python and thus has a huge overhead, whereas the matrix-vector multiplication is written in C and is highly optimised. Implementing the butterfly convolution in C should increase its speed significantly. However, this changes at higher resolution, where the butterfly convolution is considerably faster than the full matrix-vector mutliplication, despite being implemented in a slower language. This is due to the different scaling of the number of operations involved and their parameters. Since the number of parameters in a full matrix multiplication scales with N 2 , it was not possible to store the matrix on a laptop for N > (256 × 256). This memory consumption further illustrates the importance of this method for high resolution imaging with spatially varying point spread functions.

Imaging with a Butterfly Response
Since the previous sections show that the efficient representation of instrument responses using butterfly networks is indeed possible up to a certain representation error, this section presents a small imaging application with a trained butterfly network and a synthetic generated data set.
As already explained in Section 5, a forward model is built in order to infer a signal from collected data. To keep the validation simple, a synthetic signal consisting of 15 point sources with random positions in the signal domain and brightness values between 900 and 1000 is generated, also called ground truth (see Figure 9a) in the following. The corresponding observed event density is then calculated from this ground truth using the synthetic response from Section 7 (see Figure 9b). From this observed event density the synthetic data is generated by drawing from Poisson distributions with the respective rates.
As a model for the prior distribution P (s) of the point sources an inverse gamma distribution, f (x; q, α) = q α Γ(α) x −α−1 exp (− q x ), is chosen with α = 1 and q = 1. For the response operator in our Poissonian likelihood we use the butterfly network Net 4 (32x32) that was previously trained on the synthetic response, as described in Section 8.2. In order to ensure positivity in the signal response domain, we clip the signal response at a minimum of 10 −8 . Positivity is a necessary property here for the Poisson likelihood to be well-defined. Applying the IFT formalism, the joint Hamiltonian H(s, d) is constructed using these prior statistics and the Poissonian likelihood. For the inference we use the geoVI algorithm [23] with four mirrored samples. After inference with geoVI, we obtain a set of samples that represent the posterior distribution of the quantity of interest, in this case the signal. From these samples, we can calculate the posterior mean and standard deviation of the signal and the signal response. Comparing the posterior mean of the signal (see Figure 9c) to the ground truth (see Figure 9a), one can see that most of the signal is sufficiently reconstructed. The positions and brightness values in the centre are reconstructed very well, whereas there are some errors in the outer parts of the image. Most areas with larger errors also appear to have high signal posterior standard deviation values (see Figure 9e) and therefore should not be considered as reliable derived estimates. Thus, these reconstruction errors are not only caused by the response approximation error, which happened in the training process discussed earlier, but also due to the fact that the extreme blurring by the synthetic response at the image borders in combination with the Poisson shot noise causes reconstruction ambiguities in the signal space. These ambiguities lead to a larger error and higher standard deviation of the posterior.
In order to see that the algorithm has converged and is working correctly, one can also compare the data (not shown) with the reconstructed signal response (see Figure 9d) by looking at their difference (see Figure 9f). In summary, it can be seen that the efficient response representation can be successfully applied in imaging algorithms.

Discussion
The need for efficient response representations in imaging led to the development of the models presented in this work, which were inspired by earlier research on butterfly matrices [11]. The efficient structure of butterfly matrices, inherited of fast Fourier transforms (FFT), results in a sub-quadratic algorithm scaling with O(N log N ) that is capable of representing an expensively simulated synthetic response up to 1% error. To this end, Net 4 , a butterfly convolutional network with three butterfly convolution operators (BCOs) in series, non-mirrored architecture, and flat application is used, which is differentiable and thus suitable for the application as a response in generative models for measurement data. Furthermore, we could show that by upscaling the network to a higher resolution the accuracy does not suffer while the representation becomes, as expected, sparser compared to a full matrix representation. This network is also successfully used as a response representation for a spatially variant point spread function (PSF) in an imaging example with synthetic generated photon count data following Poisson statistics. We expect the butterfly network representation to give comparably good results for spatially varying PSFs of real instruments such as Chandra or eROSITA. Although these are more complex, and radially asymmetric, this can be argued because the radial symmetry of the synthetic example is not part of the butterfly transform parameterization. In addition, typical PSFs from real instrument are smaller and therefore sparser than the synthetic PSFs in this work and therefore require fewer degrees of freedom. We also see in our example that most of the approximation error is made in the outer regions of the PSFs, indicating that the error also becomes smaller for smaller PSFs.
To improve the computational performance through GPU support and parallelization, more advanced machine learning platforms such as TensorFlow [25] or PyTorch [26] could be considered. After sufficient training, the corresponding butterfly network can be used to perform high-fidelity imaging using information field theory and NIFTy. Additionally, other fields of application with a connection to slightly inhomogeneous processes are imaginable. These also include time-or energy-dependant processes with variable correlations. All in all, the method to represent instrument response functions introduced in this work is promising to improve imaging with complex photographic instruments and thus should be considered in further research.