L 1-Minimization Algorithm for Bayesian Online Compressed Sensing

In this work, we propose a Bayesian online reconstruction algorithm for sparse signals based on Compressed Sensing and inspired by L1-regularization schemes. A previous work has introduced a mean-field approximation for the Bayesian online algorithm and has shown that it is possible to saturate the offline performance in the presence of Gaussian measurement noise when the signal generating distribution is known. Here, we build on these results and show that reconstruction is possible even if prior knowledge about the generation of the signal is limited, by introduction of a Laplace prior and of an extra Kullback–Leibler divergence minimization step for hyper-parameter learning.


Introduction
It has become commonplace to talk about the recent "information explosion" or "data deluge".These expressions refer to a much faster growth in data production compared to all available data storage and to the even more evident divergence between the volume of data produced and the general data processing capacity.This state of things begs for more efficient ways of storing and analyzing data.Natural and man-made signals tend to be compressible, which means their innate redundancies allow them to be expressed as a small number of combinations of its components-they are sparse in some basis.In the last few decades a lot of effort has been directed to the development of compression techniques that allow the sampled data to be rewritten to a reduced number of bits.However, these techniques mean superfluous costs on the gathering and processing of the data.When talking about medical imaging that includes radiation, unnecessary data gathering almost certainly signifies collateral and unwanted health effects in the long run for the patients.
Compressed sensing (CS) is a framework for signal processing that presents itself as a successful resource that has produced many techniques for efficient information extraction [1].CS utilizes the fact that real-world signals can typically be represented by a small combination of elemental components [2,3] to eliminate the data redundancy directly in its acquisition (in contrast with what occurs in a digital camera, for example, that captures millions of point measurements each time a picture is taken only to discard a large portion of this data immediately after its acquisition through a image compression algorithm).In a standard scenario, CS utilizes the sparsity property to enable the recovery of various signals from much fewer samples of linear measurements than required by the Nyquist-Shannon theorem [4,5].The Nyquist rate [6,7] is a concept present in virtually all signal acquisition protocols used in consumer electronics and medical imaging devices, among others [2].Unfortunately, it implies the necessity of a high sampling frequency, which means an especially high demand if we consider the now pervasive high definition registers.Between 2004 and 2006, Candès, Tao and Donoho [4,8,9] showed that sparse signals could be exactly reconstructed with below-the-Nyquist-rate sampling.Moreover (and perhaps what made it so intriguing), the reconstruction could be done in non-adaptive fashion through L1-minimization.These seminal works gave rise to CS [5].It should be noted that "sampling" here has a twist, though, in CS, a sample of the signal is obtained by calculating its inner product against different test functions.All in all, the popularity of CS in the last few years made it pervasive through various disciplines, from Computer Science to Statistical Physics.
In this work, we consider the scenario of a signal generated by an unknown distribution and its recovery by means of a Bayesian online CS scheme.There is a vast literature on Bayesian schemes for CS (e.g., [10][11][12]) and their reconstruction have been thoroughly examined, not least by the Statistical Physics community [13][14][15].Online reconstruction schemes can be invaluable in situations where real-time inference is needed (e.g., magnetic resonance imaging), when the data set is too large to be entirely stored on a hard disk (e.g., modern simulation science) or in the presence of time-variant signals (e.g., data streaming).Algorithms for Bayesian online learning are well-known in the neural networks community [16][17][18] and, in many fields of engineering such as optimal control theory, where they are known as Kalman filters [19][20][21].In addition, an interesting compromise between online and offline learning is the so-called mini-batch learning [22], which has recently also found its way to the CS scenario [23].In direct connection with this letter, previous work [24] has already established the possibility of online CS reconstruction, but it suffered from an important limitation, namely the necessity of an exact knowledge about the generating distribution of the signal.
This paper is organized as follows: in Section 2, the basic CS problem is defined, together with two reconstruction methods-the L1-minimization problem called LASSO (Least Absolute Shrinkage and Selection Operator) and the Bayesian scheme.Section 3 introduces the Bayesian online CS framework as presented in [24].In Section 4, the main contribution of this work, an extension for the online framework in which imperfect knowledge about the signal generating distribution is allowed, is described.Section 5 presents the main results and Section 6 summarizes and discusses the main findings of this work.

Problem Setup
Let x 0 = (x 0 i ) ∈ R N be a real N-dimensional signal where each component is generated by a sparsity-inducing distribution φ(x) = (1 − ρ)δ(x) + ρg(x).It is assumed that 0 < ρ < 1 and that g(x) does not have a finite mass at the origin.Sequential linear projections of the signal with known independent random measurement vectors ) are obtained at instants µ = 1, 2, . . . .Let these random projections u µ ≡ A µ • x 0 pass through a (possibly noisy) channel before their result becomes available-this way, the value of each measurement, y µ , will be distributed according to P(y µ |u µ ) (Figure 1).A common goal in Compressed Sensing is to accurately recover the signal x 0 based on the knowledge of D t = {(A 1 , y 1 ), (A 2 , y 2 ), . . ., (A t , y t )} and of the functional form of the channel P(y|A • x).Since the beginnings of CS, the staple technique in the field [5] has been the LASSO, an optimization problem with an L1-regularization term [25], where y = (y µ ) t µ=1 , A is a matrix with all vectors A µ as rows and the Lp-norms are defined as z p = (∑ i |z i | p ) 1/p .The seminal works [4,8,9] were pivotal in proving that perfect signal reconstruction in absence of noise is possible by means of (1) even in subsampling scenarios (i.e., for ρ < α < 1, with α ≡ t/N) .The Statistical Physics community is no stranger for these results, with typical behaviour analysis for large systems, where N → ∞ [26,27] being accepted as guarantees for the validity of the method.
More recently, Bayesian techniques for signal reconstruction have also been proposed (e.g., [10,28]).They consist in estimating x 0 from the posterior distribution It has been shown [11] that the Bayesian reconstruction typically performs better than the L1-minimization scheme for fewer measurements.In fact, the minimum mean squared error (mmse) estimator x(D t ) = dx xP(x|D t ) is guaranteed [14] to minimize the mean square error mse defined for any possible estimate x as:

Bayesian Online Compressed Sensing
In this work, reconstruction of the signal is done in an online manner.That is, measurements are used one at a time and discarded thereafter (in opposition to the offline/batch scenarios presented above where the whole dataset D t is available during the entire reconstruction).In addition, we would like to achieve this objective with limited knowledge about the generating distribution φ(x).
Bayes' Theorem (2) can be easily transformed into an online update recursion [16].Since all measurements are independent by design, the likelihood can be factored as This way, the posterior distribution P(x|D t ) for the signal x after t measurements can be written In general, distribution P(x|D t ) can be complicated, so the calculation of x(D t ) is potentially difficult.Consider for now the Compressed Sensing scenario where the prior distribution is exactly the same as the known signal generating distribution φ(x).The δ(x) factor here adds a singularity to the prior φ(x), so that the posterior should not approximate a Gaussian distribution even for t → ∞, which was a crucial hypothesis in previous works [16] and would greatly simplify asymptotic calculations.In absence of such a simplification, a previous work [24] introduced the following mean-field approximation: for the posterior distribution P(x|D t ) in Label (5) as a device to facilitate marginal computations and (as a consequence) their expected values.In this expression, {(a t i , h t i )} is a set of natural parameters and is the normalization of the marginal distribution P i (x i |D t ) = dx \i P(x|D t ), where dx \i denotes integration over all {x j } j =i .So as not to introduce any biases, we define The mmse estimate of the signal and its variance can then be readily calculated from the approximate posterior as respectively.Equations ( 5) and ( 6) correspond to a sequence of update and projection steps-the update step adds new information obtained with the most recent measurement t, but transforms the posterior into a possibly intractable distribution; the projection step then returns this distribution to an easier exponential representation with independent components (Figure 2).The full scheme can be summarized as 2N update rules for the parameters {(a t i , h t i )} [24].
where Ω t (∆ t+1 ) = DzP(y t+1 |∆ t+1 + χ t+1 z) and Dz = (dz/ √ 2π) exp(−z 2 /2) is the standard Gaussian measure.It has been shown [24] that this online algorithm has an asymptotic error decay comparable to the offline reconstruction scheme when additive measurement noise is present.In the noiseless scenario, the offline Bayesian algorithm achieves zero error for finite t, while the online version decays exponentially.

Mismatched Priors and L1-Minimization Based Reconstruction
A limitation of the framework introduced above is the simple fact that, in a real-world scenario, one does not know the exact parameters of the generating distribution, and possibly not even some of its major characteristics except for the fact that the generated signals are somewhat sparse.Using this algorithm as a canvas, we propose a strategy based on a Bayesian formulation of the L1-regularized problem, relying on the use of the sparsity-inducing Laplace prior φ λ (x) = (2/λ) exp(−λ|x|) (12) and a learning scheme for λ concurrent with the measurements.The use of Laplace-like priors is not new in the literature [29][30][31], but, to the best of our knowledge, this has not been introduced as part of an online algorithm yet.We observe that the L1-estimator (1) can also be written in a Bayesian manner as xt = dx x lim β→∞ P L1 (x|D t ; β, λ), where with Z β,λ a normalization factor dependent on β and λ.This formulation suggests the replacement of the generating distribution φ(x) in Label ( 6) for φ λ (x), effectively altering the projection step of the Bayesian online algorithm to the similar expression where Z(a, h; λ) := dx exp −ax 2 /2 + hx − λ|x| .The first and second moments of this alternate approximate distribution are, respectively, paralleling Equations ( 8) and ( 9).Note that the natural parameters update Equations ( 10) and ( 11) remain unchanged, since they are independent of our choice of prior, be it φ(x), φ λ (x), or any other.
A common practical issue when making use of the L1-regularization scheme (1) is that the best value of the shrinkage parameter λ is data-dependent and not always clear.Nevertheless, its value is highly influential on the reconstructed model-there is a direct relation between λ and the number of non-zero coefficients in the estimate, which is reflected on the prediction accuracy of the model [32].Therefore, since it is crucial to estimate accurate values of the shrinkage parameter, usual strategies in offline scenarios include calculation of its best value through cross-validation procedures [25] or straight calculation of an ideal number of non-zero covariates by adding one at a time in a recursive manner [32].Unfortunately, these strategies are of little use in our present setting, where each measurement has to be used only once in an online manner.
Given any two distributions q(x) and r(x), the Kullback-Leibler divergence [33], D KL (q||r) = dx q(x) ln[q(x)/r(x)] is a common measure for their dissimilarity.Also called relative entropy, it has the interesting property that D KL (q||r) ≥ 0 always and it is zero if and only if q = r.Now, whenever distribution r(x) is of the exponential family, i.e., f (x) ∝ ∑ k ξ k s k (x), minimization of D KL (q||r) with respect to ξ k is equivalent to matching the moments dx q(x)s k (x) and dx r(x)s k (x).In particular, for the exact posterior P as defined in the left-hand side of Label (2), minimization of D KL (P|| P λ ) with respect to λ after measurement t, i.e., finding λ t such that is the same as solving the implicit equation for λ.Here, |x| P is the best possible estimate for the true sparsity of the original signal, Q 0 ≡ dx |x|φ(x).Our proposal is that λ t is, at any instant t, an optimal estimate for the shrinkage parameter.Moreover, it is not static.With the natural parameters {(a t i , h t i )} evolving, λ t must change as well to maintain |x| P λ Q 0 .Alas, the left-hand side of expression ( 18) is unknown.To overcome this difficulty, we make the approximation |x| P |m t |.An argument can be made in its defense: as more and more measurements are acquired and the approximate posterior distribution (14) The first limit has been proved in the large signal size limit N → ∞ in [24] for an exact prior; here, we assume its validity also in the present scenario.As for the second limit, Minkowski inequality guarantees that |x| P λ ≤ |m t |; with the collapse of the distribution P i (x i ) around the true value x 0 i at t → ∞, no probability mass crosses the x i = 0 axis, so that the equality limit is met.Therefore, these results mean a good estimate for the solution of ( 18) is given by λt that solves for λ, where the right-hand side can be directly calculated from P λ t−1 (x|D t ).This implicit expression corresponds to an extra step in the online algorithm for the recovery of the signal, which is shown in full below (Algorithm 1).
Algorithm 1 Online L1-based signal recovery for CS.
Find λt Equation ( Update Estimate signal means m t i and variances v t i Equations ( 15) and ( 16) 8: end while 9: return m t

Results and Discussion
We tested the algorithm on sparse signals generated by two different distributions of the form φ The δ(x) factor induces sparsity by making every component equal to zero with probability 1 − ρ.The non-sparse components could either come from a Gaussian distribution g(x) = (1/ √ 2π) exp(−x 2 /2) or from a binary distribution g(x) = (δ(x + 1) + δ(x − 1))/2.To better evaluate the accuracy losses incurred due to approximation (19), we also tested the algorithm for the ideal situation where Q 0 is known.To guarantee practical usefulness, all results correspond to the standard CS scenario Its noiseless counterpart can be obtained by taking the limit σ 2 n → 0, which leads to P(y|u) = δ(y − u). Figure 3 shows the average mean squared error normalized by x 0 2 2 resulting of simulations of the L1-based online CS algorithm with λ learning.Simulations for finite signal length size N showed decreasing error for increasing N-in order to correct for finite size effects, the curves in Figure 3 were obtained by extrapolating the results for finite N = 200, 500, 1000 and 2000 to N → ∞ by means of a quadratic fit.As expected, approximation (19) induces a significant accuracy loss, but, for α ≥ 1, the reconstruction error decays approximately exponentially, proving the usefulness of the reconstruction scheme.In all the examples shown, the solution to Equation (19) was obtained by numerically searching the value λt which best approximates both sides, i.e., by finding λt = arg min λ [ |x| P λ − |m t |] 2 .The empirical results prove what could have been a reasonable assumption: Figure 4 shows that in the early stages when only a few measurements have been obtained, even for an exact prior, |m t | is not an excellent approximation of |x| P .It also shows that the approximation gets progressively better with more measurements.For this reason, we found that the use of a small damping factor γ in the update of λ provides faster and more accurate learning.
Finding an optimal value for the sparsity-inducing parameter λ is not trivial.Ideally, one desires to find λ such that the P λ (x) would represent the actual sparsity of the original signal |x 0 i | .The limit λ → ∞ is undesirable, since the soft constraints in P λ (x) due to the finite value of the natural parameters {(a t i , h t i )} would not be enough to prevent the prior distribution to completely dominate the entire approximate posterior, leading all signal estimates to zero.At the same time, consider as a representative example the update Equations ( 10) and (11) for the noisy standard CS scenario, which can explicitly written as a t+1 i which means that all the natural parameters a t i = ∑ t µ=1 da µ i typically grow with more and more measurements.Similarly, whenever x 0 i is different from zero, any parameters h t i = ∑ t µ=1 dh µ i exhibit a growth in absolute value lead by the m i factor.For any fixed λ, these considerations mean that the regularizing effect of the prior φ λ (x) in Label ( 14) would vanish asymptotically in what would be a typical case of a prior being dominated by the likelihood.It is noteworthy that the introduction of a step for the minimization of the KL-divergence in the Bayesian online algorithm results in estimates λt that typically increase with α as well, in a scale similar to the other natural parameters (Figure 5).

Conclusions
In this paper, we proposed an online algorithm for Compressed Sensing.Based on an earlier version that expected exact knowledge of the signal generating distribution [24], the present adaptation made use of a double-exponential prior (12) to get rid of this limitation.In order to keep the signal sparsity always in focus during the signal reconstruction, we introduced in this work an extra step where an optimal value for the sparsity-inducing parameter λ can be estimated as a function of {(a t i , h t i )} (i.e., a function of the data D t ).This was achieved through minimization of the KL-divergence

Figure 1 .
Figure 1.Schematic representation of the measurement process in Compressed Sensing.The sparse signal is projected onto a (known) random measurement vector A. The result u = A • x goes through a channel giving rise to the value y ∼ P(y|u).

Figure 4 .
Figure 4. Difference between |x| P (from Label (6), where the exact prior is considered) and the true signal sparsity |x 0 i | |.The full line corresponds to the average of 100 simulations of the noisy standard CS scenario with N = 1000, ρ = 0.2, σ 2 n = 10 −4 and g(x) = (1/ √ 2π) exp(−x 2 /2).The dashed line is the true sparsity of the original signal Q 0 = dx 0 |x 0 |φ(x 0 ).Inset: Mean variance.Notice that the absolute value approximation gets progressively better with growing α.Indeed, its accuracy matches the diminishing of the posterior distribution's variance.

Figure 5 .
Figure 5.An example of the online CS algorithm where λ has been adjusted after all measurements so that the inferred signal sparsity was equal to the known value Q 0 .Note that λ t , just like the natural parameters {(a t i , h t i )}, typically grows exponentially.Here, N = 1000, ρ = 0.1, σ 2 n = 0 and g(x) = (1/ √ 2π) exp(−x 2 /2).