Solution Strategies for Linear Inverse Problems in Spatial Audio Signal Processing

The aim of this study was to compare algorithms for solving inverse problems generally encountered in spatial audio signal processing. Tikhonov regularization is typically utilized to solve overdetermined linear systems in which the regularization parameter is selected by the golden section search (GSS) algorithm. For underdetermined problems with sparse solutions, several iterative compressive sampling (CS) methods are suggested as alternatives to traditional convex optimization (CVX) methods that are computationally expensive. The focal underdetermined system solver (FOCUSS), the steepest descent (SD) method, Newton’s (NT) method, and the conjugate gradient (CG) method were developed to solve CS problems more efficiently in this study. These algorithms were compared in terms of problems, including source localization and separation, noise source identification, and analysis and synthesis of sound fields, by using a uniform linear array (ULA), a uniform circular array (UCA), and a random array. The derived results are discussed herein and guidelines for the application of these algorithms are summarized.


Introduction
Numerous inverse problems exist in the field of acoustics.For example, nearfield acoustic holography (NAH) is a noise source identification method that reconstructs a surface field of the source on the basis of sound pressure measured in the nearfield of the source [1][2][3][4][5].The deconvolution approach for the mapping of acoustic sources (DAMAS) is also a method for noise source identification [6].Another example is the source signal separation problem, where an individual source signal is to be extracted from a mixed array of signals [7].Inverse problems can also be found in source sound field synthesis (SFS) problems, where the sound field produced by secondary sources is to be matched with a target field [8,9].Other examples include sound field control [10,11], crosstalk cancellation in binaural audio rendering [12], noise reduction in speech enhancement [13], room response equalization, and dereverberation [14,15].In linear range acoustics, each of these problems can be formulated as a linear system (Ax = b).The current study focused on the solutions of farfield noise source identification problems, sound source localization and separation problems, and sound field analysis (SFA) and synthesis (SFS) problems.Although inverse solutions of acoustic problems have long been investigated by researchers, according to our review of the literature, no conclusive results that give solution strategies and parameter choice guidelines can be found in the literature.Furthermore, although audio quality is the chief concern in practical audio reproduction, most previous research has examined general numerical accuracy and stability.This study explored these problems from the perspective of reproduced signal quality.Solution strategies were compared in a unified framework, and guidelines of parameter choice are summarized herein.
In general, inverse problems can be divided into two categories: overdetermined and square systems, and underdetermined systems.To solve overdetermined systems, least-squares methods, Tikhonov regularization (TIKR) [16], and truncated singular value decomposition (TSVD) are commonly used.Traditionally, Morozov's discrepancy principle, generalized cross-validation (GCV), and the L-curve method can be used to choose the regularization parameter in the TIKR method [17][18][19][20].However, solution methods that are better suited to audio applications than conventional approaches are proposed in this work.In particular, golden section search (GSS) [21] is applied to find optimal regularization parameters.To solve underdetermined problems, compressive sampling (CS) [22,23] solved by using convex optimization (CVX) [24][25][26] is a widely used approach that is known to be computationally expensive.In the present study, computationally efficient iterative approaches that incorporate sparsity constraints, including the focal underdetermined system solver (FOCUSS) [27], steepest descent (SD), Newton's (NT) method, and conjugate gradient (CG), were developed.These algorithms were compared for several audio application scenarios.The first scenario was sound source localization and separation using a uniform linear array (ULA) and a uniform circular array (UCA).To assess the separation quality, perceptual evaluation of audio quality (PEAQ), perceptual evaluation of speech quality (PESQ), and segmental signal-to-noise ratio (segSNR) were adopted [13,28].The second scenario was concerned with analyses and syntheses of sound fields.Recently, an integrated array system was developed on the basis of a freefield model for spatial audio recording and reproduction [8,9].This study extended the previous work to a reverberant environment; a live room was fitted with reflective walls.For the SFA, a 24-element circular microphone array was utilized to encode the sound field based on plane-wave decomposition, whereas in the SFS, a 32-element rectangular loudspeaker array was employed to decode the encoded sound field using three approaches.The third scenario was sound source localization and separation using a random array.

Inverse Solution Algorithms
In this section, an array model is given, along with its assumptions.Assume that the sound sources are at the farfield of the microphone array such that sound waves impinging on the array can be regarded as plane waves.In the following array model, time-harmonic dependence e jωt , j = √ −1 and ω as angular frequency, is assumed so that the model is essentially formulated in the frequency domain.M microphones and N sources are considered.The array pressure vector can be expressed as [29].
where p ∈ C M is the sound pressure vector received by the microphone, s ∈ C N is the source amplitude vector, v is the noise vector, and A ∈ C M×N is the steering matrix associated with the sources.Therefore, given the pressure measurement p and the steering matrix A, solving the problem of Equation (1) for the unknown source amplitude vector s is a linear inverse problem.Linear inverse problems can be divided into three categories: square systems (M = N), overdetermined systems (M > N), and underdetermined systems (M < N).In the following, solution strategies are presented for these categories.

TSVD and Least-Squares Problems
The most basic approach [16][17][18] to solve linear inverse problems is the least-squares method in which the following cost function is minimized: where e = p − As denotes the error vector and " • 2 " denotes the vector 2-norm.If matrix A is of full-column rank, the least-squares solution can be written as More generally, by the TSVD of where Σ represents a diagonal matrix with singular values at its diagonal entries, U and V represent unitary matrices, and A + represents the pseudoinverse of the matrix A defined as where Σ + = diag σ −1 1 , . . ., σ −1 r , 0, . . ., 0 ∈ C N×M , r = rank(A) [30,31].Note that the expression of Equation ( 4) is sufficiently general that it always provides the minimum-norm least-squares solution, even when the matrix A is rank-deficient.The square of the residual error becomes with u i being the ith column of U and I being the identity matrix.
In practice, the matrix A can contain small singular values and can be very ill-conditioned, which leads to numerical instability during the inversion of A. Two common methods to cope with the ill-posedness of the problem are TSVD and TIKR.Briefly, the TSVD method is simply to discard small singular values of the matrix A, whereas TIKR involves attempts to minimize the following cost function [16]: where the regularization parameter β weights the residual norm and the solution norm.After some manipulation, we derive with the following optimal solution: This result can be rewritten in terms of the TSVD of A as follows: where , with u i and v i being the ith column partition of the matrices U and V, denotes the filter function.It can also be shown that the minimum residual vector can be written as α i u i is the residual vector of the components of p orthogonal to The residual norm can be written as From Equation ( 9), the solution 2-norm can be written as

Choice of Regularization Parameter β
In traditional solution strategies for inverse problems, methods are available for choosing regularization parameters, such as Morozov's discrepancy principle, the Generalized Cross-Validation (GCV) method, and the L-curve method [19,20].The first two methods have been discussed extensively in the literature.Therefore, we subsequently focus on only the L-curve method for brevity.
The L-curve method is widely used for choosing regularization parameters in inverse solutions.In the curve, the solution norm is plotted versus the residual norm by varying the regularization parameter.From Equations ( 9) and (11), it is straightforward to find the solution norm and the residual norm Regularization helps to improve robustness against system perturbation and measurement noise.Insights can be gained by writing the solution error as where is the regularization error and A # e is the perturbation error.
Hence, when β → 0, f i → 1 for very well-conditioned problems with high signal-to-noise ratio (SNR) measurements, the solution error is dominated by the perturbation error and a few high-order modes are filtered out.In this case, the solution norm is sensitive to the choice of β.The solution tends to be undersmoothed and susceptible to measurement noise.Conversely, when β → ∞, f i → 0 for very ill-conditioned problems with low SNR measurements, the solution error is dominated by regularization error and numerous high-order modes are filtered out.The solution tends to be oversmoothed and fine details such as resolution are thus lost.In this case, the residual norm is sensitive to the choice of β.
The parameter β acts as a weighting factor between the residual norms and the solution norm.Choosing an appropriate β to strike the balance between these two terms is vital.However, conventional approaches such as GCV and the L-curve method do not always provide satisfactory results in this situation.In this paper, a new method is proposed to facilitate the choice of the regular parameter β for the TIKR method.
Setting the gradient of the cost function of the TIKR method, J = p − As Without loss of generality, assume A is of full-column rank.Note that which has an effective condition number (σ 2 1 + β 2 )/(σ 2 N + β 2 ).Therefore, if we want the condition number to be τ after regularization, we must require Let κ be the condition number of A; that is, κ = σ 1 /σ M .In general, for very ill-conditioned systems, κ τ 1, and Therefore, the regularization parameter β can be chosen to be the maximal singular value σ 1 of A divided by the regularized condition number τ.For example, one may choose that τ = 100, which causes 40-dB potential loss of SNR in the inverse solution.Normally, A tends to be ill-conditioned at low frequencies.Choosing a frequency-independent β may suffice for the worst-case scenario.Thus, the regularization parameter β is chosen according to the maximal condition number at a selected low frequency (100 Hz is selected in the following simulation).Next, a coarse search is performed by varying β in orders of 10.A potential interval in which an optimal β may exist is located by observing how an objective function, such as Perceptual Evaluation of Speech Quality (PESQ) [29], varies with β.Finally, a fine-grained search is performed in the potential interval by using the Golden Section Search (GSS) algorithm [21].
GSS is an optimization technique suited for finding the extremum of a unimodal function.It is a simple bracketing method that does not require evaluation of the gradient of the cost function.In each search, a probe point must be selected within the left and right brackets according to the golden ratio.The golden ratio can be defined as = 1.618 , ϕ is the golden ratio Let f (β) be the objective function (PESQ in our case) for which we wish to find the optimal β that maximizes f (β).First, f (β) has been evaluated at two points, β 1 and β 3 .The maximizing value is between β 1 and β 3 .The golden ratio can be used to find β 2 and β 4 .β 2 and β 4 can be written as ) is larger than f (β 4 ), a maximum clearly lies in the interval between β 1 and β 4 .Therefore, the new β 3 is equal to β 4. If f (β 2 ) is smaller than f (β 4 ), a maximum lies in the interval between β 2 and β 3 .Therefore, the new β 1 is equal to β 2 .Figure 1 shows the Schematic of golden section search.The process is repeated until the gap between β 2 and β 4 is small.The iteration process stops when the beta converges within a prespecified tolerance window (0.0001 in our case).The optimal beta β opt can be written as The algorithm is summarized as the following pseudocode: the new β3 is equal to β4.If f(β2) is smaller than f(β4), a maximum lies in the interval between β2 and β3.Therefore, the new β1 is equal to β2. Figure 1 shows the Schematic of golden section search.The process is repeated until the gap between β2 and β4 is small.The iteration process stops when the beta converges within a prespecified tolerance window (0.0001 in our case).The optimal beta opt  can be written as The algorithm is summarized as the following pseudocode: Therefore, this study developed a procedure for choosing optimal regularization parameter beta.The procedure involves four steps as follows: Therefore, this study developed a procedure for choosing optimal regularization parameter beta.The procedure involves four steps as follows: • Step 1. Select τ according to a condition number threshold.

•
Step 2. Select a constant β = σ 1 /τ as an initial guess.For a frequency-domain design, it may be necessary to choose a frequency-independent β for the worst-case scenario.

•
Step 3. Perform a coarse search by running the simulation forward and backward with 10 s powers of β.Locate a potential interval in which an optimal β may exist by observing the trend of an objective function, such as PESQ, with respect to β.

•
Step 4. Perform a fine search by using optimization methods such as the GSS algorithm to find the optimal regularization parameter β.

Underdetermined Systems
In this section, algorithms are presented for underdetermined problems, where the number of microphones (M) is lower than the number of potential sources (N).In this case, the solution is generally not unique unless we impose constraints.Although a pseudoinverse gives a unique minimum-norm least-squares solution, the resolution of the solution is generally not favorable because the solution error tends to be evenly distributed among all entries.Instead, we impose sparsity as the constraint to limit the cardinality (nonzero entries) of the solutions in the study, which suggests that pruning procedures of some sort must be incorporated into the iteration process.

CVX Algorithms
An underdetermined problem with sparse solutions can be written as the following CS problem: where • 1 denotes the vector 1-norm.Numerous methods are available for solving this constrained optimization problem [22,23].Suppose that the noise energy is constrained within a threshold ε that can be selected with reference to the aforementioned least-squares solution.This problem can be solved numerically by CVX.Freeware was adopted to conduct CVX in this study [24][25][26].

Iterative Approaches
In the following, we apply four iterative algorithms to solve underdetermined problems.The first method is Focal Underdetermined System Solver (FOCUSS) [27], which is an iterative technique well suited for finding sparse solutions to underdetermined linear systems.The algorithm has two integral parts: a low-resolution initial estimate of the real signal and the iteration process that refines the initial solution to the final localized solution.Because the system is underdetermined, the sensors are more numerous than the sources.In this case, we assume our dictionary contains 36 sources.These sources are located at 5 • intervals from the x label.Actually, this case has only three sources.Therefore, if the result is perfect, our final solution has only three nonzero solutions.
The FOCUSS algorithm can be summarized in three steps, In Equation ( 21), [diag(s k−1 )] converts the vector s k−1 into a diagonal weight matrix.The TIKR solution is used as the initial condition.Similar to other fixed-point iteration methods, the algorithms converge within finite numbers of iterations to the sparse solution with appropriate initial conditions.
The large term in the weighting reduces the 2-norm of q The relatively large entries in W reduce the contributions of the corresponding elements of s p to the cost, and the solution is nonzero in the source direction.The pseudoinverse in Equation ( 22) can also be implemented by using the TIKR method.Therefore, it can also be written as The FOCUSS-TIKR method is robust to noise because of the regularization parameter β.The iteration process stops when the solution converges within a prespecified tolerance window (0.0001 in our case).

Iterative Approaches: Promote Sparsity by Pruning
CVX algorithms can solve CS problems, but these algorithms are known to be very computationally expensive, which prevents their use in real-time processing.To address this challenge, several iterative approaches are proposed as follows.
In these iterative techniques, the quadratic residual function is to be minimized.The key step that executes the "compressive sampling" is a pruning process that must be incorporated into each iteration to promote sparsity, as illustrated in Figure 2. First, several main peaks as well as sidelobes may appear in the source diagram.We reset all elements in the source vector s below a prespecified threshold (s max − D) to zero.D is initially set to a very small value D 0 ; it is then increased incrementally by ∆D in each iteration, typically by the same ∆D in every step.
is to be minimized.The key step that executes the "compressive sampling" is a pruning process that must be incorporated into each iteration to promote sparsity, as illustrated in Figure 2. First, several main peaks as well as sidelobes may appear in the source diagram.We reset all elements in the source vector s below a prespecified threshold ( max s D  ) to zero.D is initially set to a very small value 0 D ; it is then increased incrementally by D  in each iteration, typically by the same D  in every step.The iterative pruning process is summarized with a flowchart in Figure 3.The stopping condition is in which ∇J(s) is the gradient of the quadratic residual function J(s).Three approaches were employed in this study to update the solutions in the iterative CS algorithms such that each quadratic residual function J(s) is minimized.
Appl.Sci.2017, 7, 582 9 of 30 The iterative pruning process is summarized with a flowchart in Figure 3.The stopping condition is (27) in which J  (s) is the gradient of the quadratic residual function J(s) .Three approaches were employed in this study to update the solutions in the iterative CS algorithms such that each quadratic residual function J(s) is minimized.Steepest Decent (SD) Method The SD algorithm is based on the notion that the search direction at each iteration is the negative gradient of the cost function in Equation ( 26) for minimization problems.The gradient vector of the quadratic residual function is given by  The SD algorithm is based on the notion that the search direction at each iteration is the negative gradient of the cost function in Equation ( 26) for minimization problems.The gradient vector of the quadratic residual function is given by where r = p − As is the residual vector.The new solution s is updated as where µ is the step size.To determine the optimal step size µ, let the vector g be g = Aw (30) It can be shown after some algebraic manipulations that From Equation ( 31), the step size µ to minimize F(s ) along the direction w can be found by setting the derivative of F(s ) with respect to µ to zero.Consequently, we obtain Newton's Method The NT method is also an iterative method gradient search.Recall the gradient of the cost function is where r is as defined before.Setting this gradient to zero leads to the optimal solution where Hence, the solution can be updated as In practical implementation, the pseudoinverse A + is usually of the form of TIKR, namely is singular for underdetermined problems.As a refinement of the algorithm, a step size µ can be used to rewrite the update equation as The CG method is an iterative algorithm well suited for the numerical solution of systems of linear equations, associated with symmetric and positive-definite matrices.Instead of the negative gradient used in the SD method, which occasionally causes zigzag convergence, the search direction of the CG method is a linear combination of the current negative gradient and the previous search direction.Development of the CG algorithm is based on nested Krylov subspaces.For details, we refer the interested readers to Ref. [32][33][34].For brevity, we only summarize the CG algorithm with the following pseudocode: To put the system of equations in Equation ( 1) into a more tractable form, we multiply by A H on both sides, which leads to the normal equations This equation is equivalent to finding the vector s for which the gradient of F(s) equals zero.

Comparison of Algorithms
This section presents the application of the preceding algorithms to acoustic source localization and separation problems through numerical simulation.These algorithms were compared for three example problems, with the aid of a uniform linear array (ULA) and a random array.In addition, an inverse solution involved in sound field analysis (SFA) and sound field synthesis (SFS) in spatial audio was investigated.Microphone data are synthetic and generated by the model of Equation (1).

Uniform Linear Array
In the numerical simulation shown in Figure 4, 10-microphone ULA was utilized to separate the signals emitted by three sources.The sources were located at the far field such that the plane wave assumption was valid.The spacing between adjacent microphones was 10 cm.This simulated underdetermined system contained 36 sources as our dictionary.We separated the signals and localized the signals in one stage.TIKR, FOCUSS, and CS-CVX algorithms were used to solve these inverse problems.Figure 5 shows the condition numbers of different frequencies.The problem is ill-conditioned at low frequencies.Source localization results are shown in Figure 6.• , was broadcasting a male speech signal.
The second source, located at θ = 90     The separation results obtained using TIKR, FOCUSS, and CS-CVX are summarized in Table 1.PESQ is an objective test measure for speech quality evaluation.It is a full-reference algorithm and analyzes the speech signal sample-by-sample after a temporal alignment of corresponding excerpts of reference and test signal.The mean opinion score (MOS) is calculated on the basis of PESQ ranging from 1 to 5; MOS signifies the difference in speech quality between the clean and the separated signals, which is affected by separation performance and signal distortion.The segmental SNR (segSNR) is defined as The separation results obtained using TIKR, FOCUSS, and CS-CVX are summarized in Table 1.PESQ is an objective test measure for speech quality evaluation.It is a full-reference algorithm and analyzes the speech signal sample-by-sample after a temporal alignment of corresponding excerpts of reference and test signal.The mean opinion score (MOS) is calculated on the basis of PESQ ranging from 1 to 5; MOS signifies the difference in speech quality between the clean and the separated signals, which is affected by separation performance and signal distortion.The segmental SNR (segSNR) is defined as Appl.Sci.2017, 7, 582 14 of 29 The segSNR correlates with the effect of noise reduction.The FOCUSS-PINV algorithm was observed to achieve the highest score in PESQ and segSNR (Table 1), although it required more computation time than TIKR and FOCUSS-TIKR.In our previous simulation, we simulated microphones that had no noise; therefore, our regularization parameter was very close to zero.In the current simulation, our microphones did have white noise with a magnitude equal to the magnitude of the microphone signals divided by 100.Therefore, the potential loss of SNR was 40 dB.
The regularization parameter is chosen by the maximal singular σ 1 value dividing 100 in 100 Hz.In our case, the maximal singular value was equal to 5.32.Next, a coarse search was performed by varying β in orders of 10 and using the GSS algorithm to find the optimal regularization parameter.In this case, the optimal regularization parameter was 0.0174, and the FOCUSS-PINV of the robustness to the noise was very ineffective because PINV artifacts caused discontinuities in regularization.CS-CVX was sensitive to the noise, and the segSNR and PESQ achieved lower scores than FOCUSS-TIKR did, as listed in Tables 2 and 3 (20dB).Therefore, noise was present, and FOCUSS-TIKR was the best choice in this case.It outperformed PESQ and segSNR.

Random Array
A simulation was conducted for localization and separation of two point sources located at (0, 0, −1 m) and (0.8, 0.3, −1 m), both of which were emitting clean speech signals, as illustrated in Figure 7.A 30-element random array with aperture dimension 0.48 m × 0.4 m situated at z = 0 was utilized to capture the signals emitted by these two sources.To set up the propagation matrix, 100 (10 × 10) equivalent sources were distributed on the image plane located 1 m away from the array.Figure 8 shows the condition numbers of different frequencies.The problem is ill-conditioned at low frequencies.
Appl.Sci.2017, 7, 582 16 of 30 equivalent sources were distributed on the image plane located 1 m away from the array.Figure 8 shows the condition numbers of different frequencies.The problem is ill-conditioned at low frequencies.Figure 9a-f show the source localization results obtained using six approaches.Two sources were correctly located on the noise map with varying degrees of resolution by all methods.A conventional method delay and sum (DAS) method (a) gave the poorest resolution, whereas the CS-CVX method provided the highest resolution.The SD method (d) and CG method (f) were acceptable but did not perform quite as well as the CS-CVX method.The NT method (e) yielded accurate source locations with a slightly increased sidelobe level, but it was the most computationally efficient (Table 2).In general,   equivalent sources were distributed on the image plane located 1 m away from the array.Figure 8 shows the condition numbers of different frequencies.The problem is ill-conditioned at low frequencies.Figure 9a-f show the source localization results obtained using six approaches.Two sources were correctly located on the noise map with varying degrees of resolution by all methods.A conventional method delay and sum (DAS) method (a) gave the poorest resolution, whereas the CS-CVX method provided the highest resolution.The SD method (d) and CG method (f) were acceptable but did not perform quite as well as the CS-CVX method.The NT method (e) yielded accurate source locations with a slightly increased sidelobe level, but it was the most computationally efficient (Table 2).In general, Figure 9a-f show the source localization results obtained using six approaches.Two sources were correctly located on the noise map with varying degrees of resolution by all methods.A conventional method delay and sum (DAS) method (a) gave the poorest resolution, whereas the CS-CVX method provided the highest resolution.The SD method (d) and CG method (f) were acceptable but did not perform quite as well as the CS-CVX method.The NT method (e) yielded accurate source locations with a slightly increased sidelobe level, but it was the most computationally efficient (Table 2).Table 4 presents a comparison of the separation results obtained using five methods.CS-CVX displayed the highest scores for PESQ and segSNR, despite being extremely time-consuming.Iterative CS approaches were determined to be far more computationally efficient than the CS-CVX method.The CG method attained the highest PESQ, but the lowest segSNR.This suggests that the favorable separation performance of the CG method comes at the price of signal distortion.The SD and NT methods demonstrated acceptable PESQ and high segSNR.Although signals were not perfectly separated by using these two methods, the incurred distortion was minor.In general, methods present a trade-off between separation performance and signal distortion.Table 5 shows the separation results with additive noise (SNR = 28 dB).All the methods were observed to suffer from the interference of noise; consequently, the values of PESQ and segSNR were notably low.However, all the methods were determined to be robust to noise.The present study also considered mismatches between equivalent sources (dictionaries) and real sources.Table 6 shows the separation results of the NT method with different levels of mismatch.Mismatch means that the real source is not precisely on the deployed source location (dictionary).Extreme mismatch means the real source is exactly on the center of four nearest the deployed source location.More descriptions are added to the revised manuscript.Unless the source was just at the center of the near dictionaries, the separation performance was high and not influenced by noise.

SFA and SFS
Depending on the sparsity of the sound sources, the SFA stage can be implemented in several manners [35][36][37][38].For the sparse-source scenario, a two-stage algorithm is utilized; the source bearings are estimated using the minimum power distortionless response (MPDR) [7] and the associated amplitudes of plane waves are estimated using the TIKR algorithm.For the nonsparse-source scenario, a one-stage algorithm based on the CS-CVX algorithm or the FOCUSS algorithm is employed.
The SFS stage is carried out using a loudspeaker array to reconstruct the sound field with the source bearing and amplitude obtained in the SFA stage.Pressure matching was employed for the SFS purpose in this study by sampling a large number of virtual control points in the interior area surrounded by the loudspeakers.The pressure matching procedure can be described as the following optimization problem: min where T is the amplitude vector of the Pth primary plane-wave component, is the steering vector for the dth primary plane-wave component to the K×P is the steering matrix from the plane-wave components obtained in the preceding SFA stage to the control points.Therefore, the optimal solution can be written as where "#" symbolizes some type of inverse operation on the matrix H(ω).In this study, TIKR was utilized to calculate the input signal amplitudes to the secondary sources.GSS can be used to find the optimal regularization parameter.
Experiments were conducted to validate the proposed audio analysis and synthesis system.In the SFA stage, a 24-element circular microphone array with a radius of 12 cm was utilized to capture and parameterize the sound field in an anechoic chamber (the recording room), as illustrated in Figure 10.In the SFS stage, a rectangular, 32-loudspeaker array was employed to reproduce in a live room (the reproduction room) the sound field previously encoded in the SFA stage.The walls of the room were lined with acoustically reflective boards (Figure 11).The SFS stage is carried out using a loudspeaker array to reconstruct the sound field with the source bearing and amplitude obtained in the SFA stage.Pressure matching was employed for the SFS purpose in this study by sampling a large number of virtual control points in the interior area surrounded by the loudspeakers.The pressure matching procedure can be described as the following optimization problem: where is the amplitude vector of the Pth primary plane-wave component, denotes the amplitude vector of the input signals to the L secondary loudspeaker sources, denotes the room response matrix, and is the steering vector for the dth primary plane-wave component to is the steering matrix from the plane-wave components obtained in the preceding SFA stage to the control points.Therefore, the optimal solution can be written as where "#" symbolizes some type of inverse operation on the matrix    H . In this study, TIKR was utilized to calculate the input signal amplitudes to the secondary sources.GSS can be used to find the optimal regularization parameter.
Experiments were conducted to validate the proposed audio analysis and synthesis system.In the SFA stage, a 24-element circular microphone array with a radius of 12 cm was utilized to capture and parameterize the sound field in an anechoic chamber (the recording room), as illustrated in Figure 10.In the SFS stage, a rectangular, 32-loudspeaker array was employed to reproduce in a live room (the reproduction room) the sound field previously encoded in the SFA stage.The walls of the room were lined with acoustically reflective boards (Figure 11).To process microphone output and loudspeaker input signals, multichannel analog-to-digital converters (M-32 AD) and digital-to-analog converters (M-32 DA) (RME, Haimhausen, Germany) were used with a sampling frequency of 16 kHz.
An audio codec system involves three inverse problems, namely the SFA stage, room response modeling, and the SFS stage.The condition numbers are plotted against the frequencies of three steering matrices in Figure 12a-c.Figure 12c indicates that the ill-posedness encountered in the room response modeling procedures must be addressed, with the aid of appropriate regularization methods.Large regularization parameters can increase the robustness of the inverse problem.In this study, we set the regularization parameter to 10.
(a) To process microphone output and loudspeaker input signals, multichannel analog-to-digital converters (M-32 AD) and digital-to-analog converters (M-32 DA) (RME, Haimhausen, Germany) were used with a sampling frequency of 16 kHz.
An audio codec system involves three inverse problems, namely the SFA stage, room response modeling, and the SFS stage.The condition numbers are plotted against the frequencies of three steering matrices in Figure 12a-c.Figure 12c indicates that the ill-posedness encountered in the room response modeling procedures must be addressed, with the aid of appropriate regularization methods.Large regularization parameters can increase the robustness of the inverse problem.In this study, we set the regularization parameter to 10.To process microphone output and loudspeaker input signals, multichannel analog-to-digital converters (M-32 AD) and digital-to-analog converters (M-32 DA) (RME, Haimhausen, Germany) were used with a sampling frequency of 16 kHz.
An audio codec system involves three inverse problems, namely the SFA stage, room response modeling, and the SFS stage.The condition numbers are plotted against the frequencies of three steering matrices in Figure 12a-c.Figure 12c indicates that the ill-posedness encountered in the room response modeling procedures must be addressed, with the aid of appropriate regularization methods.Large regularization parameters can increase the robustness of the inverse problem.In this study, we set the regularization parameter to 10.
(a) The results show that the source was accurately localized using MPDR.Next, the source signals were extracted using the TIKR algorithm.We also applied one-stage CS algorithms and one-stage FOCUSS algorithms, which located sources and separated their amplitudes in a single calculation.In the SFA experiment, loudspeaker sources positioned at the angles θ = 60 • , 240 • played two 10-s speech clips.After recording the source by CMA, we used three algorithms to extract the source signals.First, we applied the two-stage MPDR and TIKR algorithms.The MPDR spectrum is plotted as a function of angle and frequency in Figure 13a.The resulting frequency-averaged and normalized MPDR spectrum is illustrated in Figure 13b, which peaks at the angles θ = 60 • , 240 • as desired.The results show that the source was accurately localized using MPDR.Next, the source signals were extracted using the TIKR algorithm.We also applied one-stage CS algorithms and one-stage FOCUSS algorithms, which located sources and separated their amplitudes in a single calculation.The signals extracted using different methods were evaluated by using the MOS of the PESQ test.Results confirmed that the TIKR performed well in signal separation with satisfactory audio quality.The results are summarized in Table 7.The signals extracted using different methods were evaluated by using the MOS of the PESQ test.Results confirmed that the TIKR performed well in signal separation with satisfactory audio quality.The results are summarized in Table 7.One sample coherence function between one loudspeaker and one microphone is shown in Figure 14, indicating the signal quality to be poor below 200 Hz.Therefore, band-limited processing was applied for all frequencies up to 200 Hz in the SFS stage.In this frequency range, pressure matching was used on the basis of the room response model.One sample coherence function between one loudspeaker and one microphone is shown in Figure 14, indicating the signal quality to be poor below 200 Hz.Therefore, band-limited processing was applied for all frequencies up to 200 Hz in the SFS stage.In this frequency range, pressure matching was used on the basis of the room response model.The SFS stage was conducted for three different methods.The coherence between the loudspeaker and the microphone was poor below 200 Hz; therefore, the signals below 200 Hz were not processed.Method 1, band-limited processing, was applied from 200 Hz to the spatial aliasing frequency, 952 Hz, in the SFS stage.In this frequency range, pressure matching was performed on the basis of the room response model.Below 200 Hz, unprocessed audio signals were fed directly to the loudspeakers.Above 952 Hz, a simple vector panning [39] approach was adopted.The optimal regularization parameter β achieving the highest MOS in room response modeling was calculated using GSS [21] as β = 0.0008634.
In the second method, instead of a vector panning method, we used DAS to process signals above 952 Hz.In the third method, we used pressure matching to obtain signals above 200 Hz.The use of different regularization parameters in pressure matching results in different levels of localization performance and audio quality.
Figure 15a,c shows the MPDR spectrum and the normalized MPDR spectrum obtained using the third method for β = 0.01 and β = 10, respectively.Low values of the regularization parameter β yielded higher localization performance than high values did.These two signals were compared with the clean signal through the PESQ test.The results showed that the high β ensured satisfactory voice quality, whereas the lowβ impaired voice quality.
The results of three localization methods are presented in Figure 16a-f.The MPDR spectra are plotted as functions of angle and frequency in Figure 16a,c,f.The resulting frequency-averaged and normalized MPDR spectra are shown in Figure 16b,d,f.The SFS stage was conducted for three different methods.The coherence between the loudspeaker and the microphone was poor below 200 Hz; therefore, the signals below 200 Hz were not processed.Method 1, band-limited processing, was applied from 200 Hz to the spatial aliasing frequency, 952 Hz, in the SFS stage.In this frequency range, pressure matching was performed on the basis of the room response model.Below 200 Hz, unprocessed audio signals were fed directly to the loudspeakers.Above 952 Hz, a simple vector panning [39] approach was adopted.The optimal regularization parameter β achieving the highest MOS in room response modeling was calculated using GSS [21] as β = 0.0008634.
In the second method, instead of a vector panning method, we used DAS to process signals above 952 Hz.In the third method, we used pressure matching to obtain signals above 200 Hz.The use of different regularization parameters in pressure matching results in different levels of localization performance and audio quality.
Figure 15a,c shows the MPDR spectrum and the normalized MPDR spectrum obtained using the third method for β = 0.01 and β = 10, respectively.Low values of the regularization parameter β yielded higher localization performance than high values did.These two signals were compared with the clean signal through the PESQ test.The results showed that the high β ensured satisfactory voice quality, whereas the low β impaired voice quality.
The results of three localization methods are presented in Figure 16a-f.The MPDR spectra are plotted as functions of angle and frequency in Figure 16a,c,f.The resulting frequency-averaged and normalized MPDR spectra are shown in Figure 16b,d,f.

Conclusions
This study developed algorithms for solving inverse problems generally encountered in spatial audio signal processing.The TIKR algorithm was shown to solve overdetermined problems.However, the regularization parameter in the TIKR method was not effectively chosen.This study thus presents a guideline for choosing the optimal regularization parameter β in the TIKR method.Specifically, choosing the optimal β involves dividing the maximal singular value at low frequency by the threshold and then running the simulation forward and backward for powers of 10 of β.Optimization methods such as GSS can be used by observing the trend of an objective function (such as PESQ).Some trade-offs must be made between localization performance and voice quality.In general, a high β results in a small solution norm with high voice quality, whereas a low β yields a small residual norm with high localization performance.
Inverse problems in noise sound source localization and separation problems can be solved by 1-stage and 2-stage (overdetermined and underdetermined), each of the 1-stage and 2-stage methods has its advantages and disadvantages.In general, the 1-stage methods provide both localization and separation results with good performance.The 2-stage methods give slightly better separation performance than the 1-stage methods.From our experience, PESQ correlates better with separability and segSNR correlates better with distortion.
For 1-stage (underdetermined) problems, iterative CS algorithms have been developed for solving acoustic inverse problems, with applications to localization and separation.The results demonstrate that the CS-CVX method was effective in solving CS problems, despite being computationally expensive.Iterative CS methods achieved comparable performance to the CS-CVX method for CS problems, in far less computation time.The FOCUSS-TIKR and CG methods attained high PESQ, whereas the SD and NT methods attained high segSNR.In general, iterative CS methods were determined to perform better than the TIKR method.For 1-stage methods, FOCUSS-TIKR attains the highest MOS value of PESQ for clean signals, while the Newton method performs the best.Both methods require less CPU time than the CS-CVX.
Inverse solution approaches are also useful in solving SFA and SFS problems.In this study, three inverse problems were solved for implementing an audio codec system.Because of the ill-posed yield at low frequencies, particularly in the room response modeling stage, choosing an appropriate regularization parameter β was crucial.Therefore, this stage required a larger regularization parameter than the SFA and SFS stages required.In the analysis stage, the one-stage CS algorithm was determined to be more computationally expensive than the two-stage TIKR algorithm.In the synthesis stage, the first method performed well in localization, but did not perform well in

Conclusions
This study developed algorithms for solving inverse problems generally encountered in spatial audio signal processing.The TIKR algorithm was shown to solve overdetermined problems.However, the regularization parameter in the TIKR method was not effectively chosen.This study thus presents a guideline for choosing the optimal regularization parameter β in the TIKR method.Specifically, choosing the optimal β involves dividing the maximal singular value at low frequency by the threshold and then running the simulation forward and backward for powers of 10 of β.Optimization methods such as GSS can be used by observing the trend of an objective function (such as PESQ).Some trade-offs must be made between localization performance and voice quality.In general, a high β results in a small solution norm with high voice quality, whereas a low β yields a small residual norm with high localization performance.
Inverse problems in noise sound source localization and separation problems can be solved by 1-stage and 2-stage (overdetermined and underdetermined), each of the 1-stage and 2-stage methods has its advantages and disadvantages.In general, the 1-stage methods provide both localization and separation results with good performance.The 2-stage methods give slightly better separation performance than the 1-stage methods.From our experience, PESQ correlates better with separability and segSNR correlates better with distortion.
For 1-stage (underdetermined) problems, iterative CS algorithms have been developed for solving acoustic inverse problems, with applications to localization and separation.The results demonstrate that the CS-CVX method was effective in solving CS problems, despite being computationally expensive.Iterative CS methods achieved comparable performance to the CS-CVX method for CS problems, in far less computation time.The FOCUSS-TIKR and CG methods attained high PESQ, whereas the SD and NT methods attained high segSNR.In general, iterative CS methods were determined to perform better than the TIKR method.For 1-stage methods, FOCUSS-TIKR attains the highest MOS value of PESQ for clean signals, while the Newton method performs the best.Both methods require less CPU time than the CS-CVX.
Inverse solution approaches are also useful in solving SFA and SFS problems.In this study, three inverse problems were solved for implementing an audio codec system.Because of the ill-posed yield at low frequencies, particularly in the room response modeling stage, choosing an appropriate regularization parameter β was crucial.Therefore, this stage required a larger regularization parameter than the SFA and SFS stages required.In the analysis stage, the one-stage CS algorithm was determined to be more computationally expensive than the two-stage TIKR algorithm.In the synthesis stage, the first method performed well in localization, but did not perform well in reproduced voice quality.As compared with the third method, the second method reproduced signals with boosted high-frequency content above 952 Hz with poor localization.The third method had the highest performance in terms of voice quality and localization performance.

Figure 1 .
Figure 1.Schematic of golden section search.If f (β 2 ) is higher than f (β 4 ), β 3 is equal to β 4 in the next iteration.

Figure 2 .
Figure 2. Pruning process to promote sparsity of inverse solution.

Figure 2 .
Figure 2. Pruning process to promote sparsity of inverse solution.

Figure 4 .
Figure 4. Numerical simulation of 10-microphone uniform linear array utilized to separate the signals emitted by three sources.The first source, located at

Figure 4 .
Figure 4. Numerical simulation of 10-microphone uniform linear array utilized to separate the signals emitted by three sources.The first source, located at θ = 45• , was broadcasting a male speech signal.

Figure 4 .
Figure 4. Numerical simulation of 10-microphone uniform linear array utilized to separate the signals emitted by three sources.The first source, located at

Figure 7 .
Figure 7. Arrangement for simulation of localization and separation of two sources.

Figure 7 .
Figure 7. Arrangement for simulation of localization and separation of two sources.

Figure 7 .
Figure 7. Arrangement for simulation of localization and separation of two sources.

T
denotes the amplitude vector of the input signals to the L secondary loudspeaker sources, H(ω) ∈ C K×L denotes the room response matrix, and b d =

Figure 10 .
Figure 10.Sound field analysis experimental arrangement in a 5.4 m × 3.5 m × 2 m anechoic room.Figure 10.Sound field analysis experimental arrangement in a 5.4 m × 3.5 m × 2 m anechoic room.

Figure 10 .
Figure 10.Sound field analysis experimental arrangement in a 5.4 m × 3.5 m × 2 m anechoic room.Figure 10.Sound field analysis experimental arrangement in a 5.4 m × 3.5 m × 2 m anechoic room.

Figure 11 .
Figure 11.Sound field synthesis experimental arrangement in a 3.6 m × 3.6 m × 2 m live room fitted with reflective walls.

Figure 11 .
Figure 11.Sound field synthesis experimental arrangement in a 3.6 m × 3.6 m × 2 m live room fitted with reflective walls.

30 Figure 11 .
Figure 11.Sound field synthesis experimental arrangement in a 3.6 m × 3.6 m × 2 m live room fitted with reflective walls.

Figure 12 .
Figure 12.Plots of condition numbers versus frequencies of each steering matrix in three inverse problems: (a) sound field analysis stage; (b) room response modeling; and (c) sound field synthesis stage.

Figure 13 .
Figure 13.Localization of one music source signal with sampling rate 16 kHz, located at the angles 60 , 240   in an anechoic chamber.By using a uniform circular array with a radius of 12 cm, the source direction can be identified by the peak in the angular spectrum.(a) Minimum power distortionless response (MPDR) spectrum plotted versus angle and frequency.(b) Frequencyaveraged and normalized MPDR spectrum.

Figure 13 .
Figure 13.Localization of one music source signal with sampling rate 16 kHz, located at the angles 60 • , 240 • in an anechoic chamber.By using a uniform circular array with a radius of 12 cm, the source direction can be identified by the peak in the angular spectrum.(a) Minimum power distortionless response (MPDR) spectrum plotted versus angle and frequency.(b) Frequency-averaged and normalized MPDR spectrum.

Figure 14 .
Figure 14.Coherence curve measured from one loudspeaker to one microphone in a live room.A PULSE analysis platform made from Brüel & Kjaer was arranged to measure the coherence curve.A white noise signal with sampling rate 16 kHz was used as the driving signal of the loudspeaker.

Figure 14 .
Figure 14.Coherence curve measured from one loudspeaker to one microphone in a live room.A PULSE analysis platform made from Brüel & Kjaer was arranged to measure the coherence curve.A white noise signal with sampling rate 16 kHz was used as the driving signal of the loudspeaker.

Figure 15 .
Figure 15.Localization results in the sound field synthesis experiment by the third method with different regularization parameters.β = 10 (a,b) and β = 0.01 (c,d).

Figure 15 .
Figure 15.Localization results in the sound field synthesis experiment by the third method with different regularization parameters.β = 10 (a,b) and β = 0.01 (c,d).

Figure 16 .
Figure 16.Localization results in the sound field synthesis experiment with three different approaches.(a,b) first method, (c,d) second method, and (e,f) third method.

Figure 16 .
Figure 16.Localization results in the sound field synthesis experiment with three different approaches.(a,b) first method, (c,d) second method, and (e,f) third method.

Table 1 .
Separation performance of the Tikhonov regularization (TIKR) and focal underdetermined system solver (FOCUSS) methods for three sources in the underdetermined system.

Table 3 .
Separation performance of the TIKR, FOCUSS and CS-CVX methods for three sources with additive white noises (20dB SNR).

Table 4 .
Separation performance of five algorithms for two speech sources.

Table 5 .
Separation performance of five algorithms for two speech sources with additive noise.

Table 6 .
Separation performance of Newton's algorithm for two speech sources with different mismatch conditions.

Table 7 .
Mean opinion score of perceptual evaluation of speech quality for source signal separation

Table 7 .
Mean opinion score of perceptual evaluation of speech quality for source signal separation at the angles 60 • , 240 • with speech signals using Tikhonov regularization (TIKR), compressive sampling-convex optimization (CS-CVX), and focal underdetermined system solver (FOCUSS).