On the Complementarity of Sparse L0 and CEL0 Regularized Loss Landscapes for DOA Estimation

L0 sparse methods are not widespread in Direction-Of-Arrival (DOA) estimation yet, despite their potential superiority over classical methods in difficult scenarios. This comes from the difficulties encountered for global optimization on hill-climbing error surfaces. In this paper, we explore the loss landscapes of L0 and Continuous Exact L0 (CEL0) regularized problems in order to design a new optimization scheme. As expected, we observe that the recently introduced CEL0 penalty leads to an error surface with less local minima than the L0 one. This property explains the good behavior of the CEL0-regularized sparse DOA estimation problem for well-separated sources. Unfortunately, CEL0-regularized landscape enlarges L0-basins in the middle of close sources, and CEL0 methods are thus unable to resolve two close sources. Consequently, we propose to alternate between both error surfaces to increase the probability of reaching the global solution. Experiments show that the proposed approach offers better performance than existing ones, and particularly an enhanced resolution limit.


Introduction
The study of Direction-Of-Arrival (DOA) estimation has a long history in signal processing. Conventional methods [1] such as beamforming or Capon's method are still the subject of numerous works, e.g., [2]. However, they present degraded performance in the presence of multiple close sources. Subspace-based methods such as MUltiple SIgnal Classification (MUSIC) have been introduced to improve the resolution limit for multiple sources. Unfortunately, these methods fail in presence of correlated sources [3]. They also often require a priori knowledge of the number of sources and need a sufficient number of snapshots. Sparse DOA estimation has received much attention in the last decade due to its potential performance in such scenarios [4][5][6][7][8][9][10].
Sparsity naturally arises in DOA estimation when considering a discretization of the field-of-view in numerous candidate angles of arrival on a grid. The aim is to estimate a vector γ whose dimension covers the whole grid and whose only the very few entries corresponding to sources DOA are non-zero.
The purpose of sparse estimation, under the Single Measurement Vector (SMV) framework, is to retrieve the sparse vector γ ∈ C G from the noisy measurement y = Bγ + w, with y ∈ C N 2 , N 2 G , knowing a dictionary B. The dictionary B depends on the array's responses for the different angles of arrival candidates. It can be formulated as the following regularized problem: where the so-called 0 -norm is defined as γ 0 = Card g ∈ {1, . . . , G} : γ g = 0 , γ g being the the g-th component of vector γ. The 0 -norm is the natural measure of sparsity: it counts the number of non zero components of the vector. The regularization parameter λ aims to balance the relative importance between the data fidelity term 1/2 Bγ − y 2 2 and the 0 -norm enforcing sparsity of the solution. The sparse estimation problem can also be formulated as a constraint problem. The relationship between the two 0 -problems has been studied in [11]. Based on this study, recent theoretical results [12,13] have been provided for an off-line selection of λ so that the 0 -problems are equivalent. The regularization parameter is here chosen in accordance with those results.
The 0 -minimization problem is known to be NP-hard: its resolution usually requires an exhaustive search. The use of the very recently proposed global optimization method [14] is limited to small size problems and is thus unadapted here because of its huge computational cost. So far, many suboptimal methods have therefore been proposed, as the well-known Iterative Hard Thresholding (IHT) algorithm. IHT is a proximal gradient descent algorithm: it iteratively produces estimatesγ (i) so that the cost function J 0 decreases, starting from an initial pointγ (0) . However, the 0 -regularized error surface J 0 exhibits numerous local minima, and convergence is only proved to a stationary point. Convex relaxation of (1) by the 1 -norm is also a popular alternative. However, conditions [15] under which the sparse vector can be reliably recovered are usually too restrictive for practical applications as in DOA estimation.
More recently, minimization of a regularized criterion using nonsmooth nonconvex but continuous penalties has drawn considerable attention [16], and it has been shown in many applications that it can yield significantly better performance than with using the 1 -norm [17]. Such penalties include q -norms (0 < q < 1), Smoothly Clipped Absolute Deviation (SCAD), and Minimax Concave Penalty (MCP) [18]. The Continuous Exact 0 (CEL0) penalty [19] corresponds to the limit case of MCP for the SMV framework. CEL0 is shown to suppress some local minima of J 0 while preserving the global one. The CEL0-regularized cost function is: with and α i = 1 / B ·,i 2 2 , B ·,i being the i-th column of matrix B. 1 is the indicator function whose value is one if the given condition is respected and zero otherwise. Despite its promising interest, we have shown in [12] that traditional suboptimal optimization schemes of CEL0regularized functional as Iterative Reweighted 1 (IRL1) or Forward Backward (FB) are unable to resolve close sources. The aim of this paper is to investigate the properties of J 0 and J CEL0 loss surfaces in order to propose a sparse optimization strategy to resolve close sources . The goal is to improve the resolution limit of both MUSIC method, limited for low Signal-to-Noise Ratios, and existing sparse methods. The proposed approach follows an iterative scheme that requires little computational cost. It shows good performance for close sources and does not require any particular initialization.
Outline of the paper: we first explain the model and the sparse DOA estimation problem in Section 2. In Section 3, we compare J 0 and J CEL0 loss surfaces in the context of multi-source DOA estimation: an in-deep analysis of the minimizers is provided for two close sources. Based on this analysis, Section 4 presents the proposed optimization scheme, whose originality is to alternate between both loss surfaces. Numerical simulations of Section 5 finally show the validity and advantages of our approach.
Notations: Upper-case and lower-case boldface letters denote matrices and vectors, respectively. · * denotes the conjugate, · T the transpose and · H the conjugate transpose of a vector or matrix. x i is the i th component of vector x, and ω i the i th component of the set ω. Given a matrix X, the i th column is denoted X ·,i . Considering a matrix X of dimension N × G, X ω is the submatrix of X containing the columns indicated by the set ω ⊆ I G , where I G = {1, . . . , G} is the ordered index set. Similarly, x ω is the subvector of x defined as: with ω the number of elements in ω.

On-Grid Array Signal Modeling
Consider M far field narrow band sources impinging an array of N antennas from anglesθ m , m = 1 . . . M. For a single snapshot at time t, the output array signal where a(θ m ) is the steering vector (or array response) for the directionθ m ,s m (t) the complex envelope of the signal of the mth source, and n(t) ∈ C N a white gaussian noise vector of covariance E n(t)n H (t) = σ 2 n I N , where I N is the N × N identity matrix. Let us suppose the directions of the sources are part of a predefined set Θ = {θ 1 , . . . , θ G } resulting from the discretization of the field-of-view, with G N: for all arrival anglesθ m , m ∈ [1, . . . , M], there exists g ∈ [1, . . . , G] such thatθ m = θ g . This assumption is often considered in operational systems to measure the calibration table A = [a(θ 1 ), . . . , a(θ G )] containing the array responses for the angles in Θ. Considering this calibration table, the measurement x(t) can be expressed as: where s(t) ∈ C G is sparse with only M non-zero entries corresponding to the sources signalss m (t), with M G. Under the assumption that the sources are uncorrelated, it is interesting to use the vectorized covariance matrix in algorithms: it allows us to consider the contribution of multiple snapshots thus increasing the accuracy, without increasing the computational cost by much. It also has the advantage of being a SMV model, thus all associated methods can be used, and it additionally gives the possibility of estimating more sources than the number of sensors.

Vectorized Covariance Matrix Model
Considering uncorrelated sources, the covariance matrix R xx= E[x(t)x H (t)] is given by: withγ m the power of the mth source and I N the square identity matrix of dimension N. The vectorized covariance matrix, noted as r = vec(R xx ), is the vector obtained from the concatenation of the columns of R xx . It can be expressed as with b = a * ⊗ a, a * is the conjugate of a, and ⊗ is the Kronecker product. Considering a dictionary B = [b(θ 1 ), . . . , b(θ G )] computed from the calibration matrix A, we have r = Bγ + σ 2 n vec(I N ). Considering K finite samples, the covariance matrix is estimated bŷ , and we denoter the associated estimated vectorized covariance matrix. Let us suppose that the power of the noise is known. We consider the noisy observation vector y ∈ C N 2 computed as: with B ∈ C N 2 ×G . The noise vector w results from the estimation of R xx with a finite number of snapshots. The power vector γ ∈ C G is sparse and the indices of non-zero components indicate the directions of the sources. The aim of sparse DOA estimation is to retrieve the indices of non-zero components of vector γ in Equation (8), through the resolution of the problem given by Equation (1).

Description and Numerical Investigations of the Minimizers of J 0 and J CEL0
It is known that there are numerous local minima in J 0 , which complicates the minimization of the criterion. Initialization of iterative descent algorithms is particularly delicate, as also highlighted in the DOA estimation literature. In [12], we have successfully used CEL0 penalty for DOA estimation of well-separated sources. Those good results do not seem to be transposed to the case of close sources. It is important now to analyze more deeply the minimizers of 0 and CEL0 penalized problems in order to propose an optimization scheme able to resolve closer sources.

Simulation Setup
Although our approach is array and scenario independent, we illustrate it in this paper with the following setup. We consider an Uniform Circular Array (UCA) with N = 7 antennas and radius d = λ 0/2, where λ 0 is the wavelength. This array could allow for twodimensions direction-of-arrival estimation, but we limit ourselves to azimuth estimation. The −3 dB beamwidth of this array is 40°. UCAs are well known for their θ invariant performance. We study the case of M = 2 incoming sources located atθ 1 = 32°and a varyingθ 2 . The number of snapshots is fixed to K = 50. In this part, the received signal is noiseless. The field-of-view is the range [0, 360]°with a grid spacing of 0.5°(G = 720). The mutual coherence, corresponding to the maximum absolute correlation between two columns of the dictionary, is in this case close to 1: 1 methods are thus ineffective.

Minimizers of J 0
Let us define I G = {1, . . . , G} the ordered index set. For a given observation y ∈ C G and a set ω ⊆ I G , we define the constrained problem (C ω ) as follows: where γ i is the ith component of vector γ. Let us noteγ ω the subvector ofγ composed only with the terms indicated by ω :γ ω = γ ω 1 , . . . ,γ ω ω T , with ω the number of elements in ω. The constraint ensures the solutionsγ are sparse vectors whose indices of all non zero entries are in ω, i.e., all components whose indices are not in ω are null. We can then writeγ = Z p (γ ω ), where Z p is a zero padding operator in I G : For any ω ⊆ I G ,γ ∈ C G solves (C ω ) if and only ifγ ω ∈ C ω solves B H ω B ω x = B H ω y andγ = Z p (γ ω ). B ω is the submatrix of B composed only with the columns whose indices are in ω.
There are strong connections between the minimizers of the constrained problem (C ω ) and the minimizers of the regularized criterion J 0 (λ, ·) that we want to analyze. For y ∈ C N , given a set ω ⊆ I G , letγ solve problem (C ω ). Then for any λ, J 0 (λ, ·) reaches a (local) minimum atγ (Proposition 2.3 [20]). Conversely, for y ∈ C N and λ > 0, let J 0 (λ, ·) have a (local) minimum atγ. Thenγ solves (Cω) forω = supp(γ) (Lemma 2.4 [20]). Moreover, the (local) minimum that J 0 (λ, ·) has atγ is strict iff rank(Bω) = ω (Theorem 3.2 [20]). There is thus a large number of local minima: the number of supports ω ⊆ I G such that rank(B ω ) = ω that lead to strict local minima is upper bounded by ∑ N 2 k=0 ( G k ). In [20], it is shown that under mild conditions, J 0 (λ, ·) have a unique strict global minimizer. It is common knowledge that the optimal solution of the regularized problem given by Equation (1) depends on the regularization parameter λ, which balances the relative importance between data fidelity and sparsity. In most papers, this parameter is empirically tuned. In previous works [12,13], we proposed a theoretical analysis for an off-line selection of λ. In the sequel of this paper, λ will be selected in an appropriate interval I as defined in [12,13]. Figure 1a-d represents projections of J 0 for the scenario described above, forθ 1 = 32°a ndθ 2 = 62°in the noiseless case. Iso-levels of J 0 are reported for a vector γ having at most two non-zero components: γ = Z p (γ ω ) with ω = 2. Those components correspond to fixed directions θ ω 1 =θ 1 = 32°and θ ω 2 which changes on the different figures. On (a), θ ω 2 = 47°, which corresponds to 1 /2 θ 1 +θ 2 ; on (d), θ ω 2 =θ 2 = 62°. In between, we set: θ ω 2 = 52°and 57°. The values of the two components γ ω 1 and γ ω 2 , which are the only components allowed to be non-zero, are varying along the two axis. λ is fixed to 9.5, which belongs to the interval I. In each figure, we see four (local) minima: the local minimum at 0, local minima along the axis (i.e., one non-zero component), and those corresponding to strictly two non-zero components. The global minimum (black filled circle) is located on Figure 1d for γ ω 1 = γ ω 2 = 7 and its value is 2λ = 19.

Minimizers of J CEL0
The global minimizer of J 0 is preserved in J CEL0 , but the number of local minima of J CEL0 may be inferior to the number of local minima of J 0 . Particularly, a local minimum γ of J CEL0 verifies |γ i | ∈ {0} ∪ [ √ 2λ, +∞); hence local minimizers of J 0 having at least one component |γ i | ∈ (0, √ 2λ) are not local minimizers of J CEL0 [21]. Figure 1e-h represents the loss surfaces of J CEL0 . We observe the suppression of local minima on J CEL0 for θ ω 2 close to 1 /2 θ 1 +θ 2 = 47°, for which 0 < γ ω 2 < √ 2λ = 4.36. Some local minima of J 0 are also only critical points of J CEL0 . Moreover, the local minima that J 0 has at 0 is no longer one in J CEL0 , which is particularly interesting for the initialization of iterative optimization algorithms.
However, those good properties also have a disadvantage: it leads to "flat" minima, i.e., large connected regions where the error remains approximately constant. This is illustrated on Figure 2, comparing the minimum of J 0 and J CEL0 as a function of θ ω 1 and θ ω 2 for two (or one) non-zero components. Numerous points of the CEL0 surface are approximately at the level of the local minima corresponding to θ ω 1 = θ ω 2 = 1 /2(θ 1 +θ 2 ). This behavior appears for close sources when this point is also close to the global minimum. (h) J CEL0 , θ ω 1 = 32°, θ ω 2 = 62°F igure 1. Loss surfaces of J 0 (a-d) and J CEL0 (e-h) as a function of γ ω 1 and γ ω 2 , for γ = Z p (γ ω ), with ω = 2, i.e., at most two non-zero components corresponding to directions θ ω 1 =θ 1 and a varying θ ω 2 . When those directions correspond to the true onesθ 1 = 32°andθ 2 = 62°(d and h) the global minimum indicated by a black filled circle (γ ω 1 = γ ω 2 = 7) is equal to 2λ = 19. Local minima are indicated by the blue asterisks, while light blue diamonds represent critical points that are not local minima.

Alternating between Loss Surfaces
For well separated sources, we have shown in [12] than IRL1 algorithm used to minimize J CEL0 (IRL1-CEL0) gives better statistical results than IHT and at a lower computational cost. Indeed, IRL1-CEL0 benefits from the suppression of local minima in this case. Unfortunately, IRL1-CEL0 fails for close sources. This behavior is illustrated on Figure 3a, for true sources at 32°and 48°. Let us note that MUSIC fails for such close sources. The IRL1-CEL0 algorithm is rapidly attracted by a local bad basin corresponding to a few non-zero components for directions in the middle of the true directions (lack of resolution). In this example, denotingγ the final estimated vector andω the set indicating the non-zero components, we verify that B Ĥ ω Bωγω = B Ĥ ω y and rank(Bω) = ω = 5: it is a strict local minimum of J 0 . The IHT algorithm also remains stuck in a local minimum with this time numerous non-zero components (Figure 3b), forming two clusters around the true directions. In order to avoid being attracted by a bad basin and take advantage of both regularizations, we propose to alternate the minimization between them. Based on this heuristic, we propose the optimization scheme ALICE-L0 (Alternated Landscapes Iterations for Complementary Enhancement for 0 ) detailed on Algorithm 1.

Algorithm 1. Optimization Scheme ALICE-L0 (Alternated Landscapes Iterations for Complementary Enhancement for 0 )
Input: dictionary B, observation y, β = 1 /L, L Lipschitz-constant of B, τ 1 , τ 2 , stopping criteria Initialization: > lim and i outer < n lim do • i outer = i outer + 1 • Compute w weighting vector by: > lim,1 and j < n lim,1 do (weighted FISTA iterations) > lim,2 and k < n lim,2 do (IHT iterations) Figure 3. Solutions as iterations go by for close sources with no noise, for λ = 0.78. X-axis: iteration number. Y-axis: directions associated with components ofγ. The color represents the level of the components. True sourcesθ 1 = 32°and θ 2 = 48°are indicated by the black doted lines. Corresponding components of the optimal solution are equal to 7, all others are null.
We start the minimization considering the CEL0-regularized functional, and usinĝ γ (0) = 0 as initialization. Indeed, we previously saw that this local minimum in J 0 is suppressed in J CEL0 . Iterations of the weighted Fast Iterative Soft Thresholding Algorithm (weighted FISTA) aim to minimize the convex majorizer of the nonconvex CEL0 functional. For a weighting vector w, iterations use the proximal of the weighted 1 function, defined component by component as: After some iterations, the estimated vector is used as initialization for IHT, which performs minimization steps over the 0 -regularized cost function. The hard threshold corresponds to the proximal of the 0 -norm, defined by: We then loop back to alternate between the loss surfaces. The behavior of our algorithm is represented on Figure 3c for close sources (in this article, we set lim = 1 × 10 −6 , lim,1 = 1 × 10 −2 , lim,2 = 1 × 10 −6 , n lim = 2000, n lim,1 = 200, n lim,2 = 200, τ 1 = 0.9, τ 2 = 1). In this noiseless case, this is the only algorithm attaining the global minimum of J 0 .

Statistical Performance
The purpose of this section is to numerically quantify the algorithms' performance as a function of the sources' separation. For that, two criteria will be used: the first one is the percentage of outliers (only one estimated direction or located at more than half a beamwidth of the true directions), and the second one the Root-Mean-Square-Error (RMSE) between estimated and true directions, calculated without outliers. The simulation setup is the one described in Section 3.1, for a Signal-to-Noise Ratio per source equal to 0 dB. Results are presented in Figure 4: we observe that the proposed scheme ALICE-L0 outperforms other methods in terms of statistical accuracy and resolution limit. Let us note that the behavior of IHT is unreliable with noise, and thus statistical results are not presented.

Conclusions
We linked the operating limits of IHT and IRL1-CEL0 to the properties of corresponding loss landscapes in DOA estimation. To avoid the weaknesses of both criteria, an optimization scheme is proposed using alternatively J 0 and J CEL0 , i.e., alternating between the two regularizations. A particular implementation using λ obtained by [12,13] has been successfully tested, improving for example the resolution limit. Ongoing work concerns algorithm parameters (when to change of regularization) which are here left to the user.