Diffusion-Based Sound Source Localization Using a Distributed Network of Microphone Arrays

Davide Albertini; Alberto Bernardini; Gioele Greco; Augusto Sarti

doi:10.3390/s25072078

,

and

Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, 20133 Milan, Italy

^*

Author to whom correspondence should be addressed.

Sensors2025, 25(7), 2078;https://doi.org/10.3390/s25072078

This article belongs to the Special Issue Acoustic Sensing and Monitoring in Urban and Natural Environments (2nd Edition)

Version Notes

Order Reprints

Abstract

Traditionally, microphone array networks for 3D sound source localization rely on centralized data processing, which can limit scalability and robustness. In this article, we recast the task of sound source localization (SSL) with networks of acoustic arrays as a distributed optimization problem. We then present two resolution approaches of such a problem; one is computationally centralized, while the other is computationally distributed and based on an Adapt-Then-Combine (ATC) diffusion strategy. In particular, we address 3D SSL with a network of linear microphone arrays, each of which estimates a stream of 2D directions of arrival (DoAs) and they cooperate with each other to localize a single sound source. We develop adaptive cooperation strategies to penalize the arrays with the most detrimental effects on localization accuracy and improve performance through error-based and distance-based penalties. The performance of the method is evaluated using increasingly complex DoA stream models and simulated acoustic environments characterized by various levels of reverberation and signal-to-noise ratio (SNR). Furthermore, we investigate how the performance is related to the connectivity of the network and show that the proposed approach maintains high localization accuracy and stability even in sparsely connected networks.

Keywords:

sound source localization; wireless acoustic sensor networks; ATC diffusion; microphone array processing

1. Introduction

In recent years, networks of distributed microphone arrays have gained popularity and have been used in various acoustic signal processing applications [1,2,3,4,5,6,7,8,9]. One of the most important tasks of these networks is sound source localization (SSL) and tracking, which can be a primary task or support other algorithms for which the position of one or more sound sources is valuable information [10,11,12,13,14,15,16,17]. Typical application scenarios for SSL methods are audio surveillance, video conferencing, and automotive systems [3,18,19].

The SSL problem has been extensively studied in the literature on distributed microphone arrays, and several approaches can be classified based on the type of acoustic parameter used for localization [1]. Examples of acoustic parameters used for SSL are time delay between microphones [10,11,12,16,20], measurements of sound energy [21,22], power measures obtained through beamforming techniques [13,17,23], and estimation of the direction of arrival (DoA) of sound sources [14,24,25,26,27,28,29]. More recently, techniques based on deep learning have used latent space features to map acoustic signals to the position of sound sources [15].

In this manuscript, we focus on SSL methods that utilize DoAs, which are defined as either one angle (in 2D space) or two angles (in 3D space), indicating the direction of a sound source with respect to a reference direction. In particular, we focus on the SSL framework introduced in [24], which allows 3D SSL using only 2D DoA measurements. This approach is advantageous because 2D DoAs can be computed quite efficiently, and are therefore well suited for low-cost, low-power microphone arrays commonly used in microphone array networks.

Regardless of the chosen acoustic parameters for SSL, most methods are based on optimization problems, where the goal is to fit a sound propagation model with acoustic measurements or features extracted from acoustic arrays [1]. Traditionally, SSL methods use centralized processing to solve the aforementioned optimization problems, where data from all arrays in the network are collected and processed by a dedicated node, often referred to as the Fusion Center, which then performs localization [1]. As a result, systems of this kind have a critical point of failure (i.e., the Fusion Center) that must also guarantee a communication bandwidth high enough to process measurements from all sensor nodes [3]. Therefore, there is a growing interest in developing computationally distributed solutions that allow for estimation of the quantities of interest (e.g., the position of a sound source) by distributing the computational load to all acoustic arrays and leveraging their cooperation to achieve better performance. In addition, distributed approaches are desirable due to their higher scalability, robustness, and low power consumption [30].

Computationally distributed approaches in microphone array networks have been investigated for various acoustic signal processing tasks, including signal estimation [31,32], beamforming techniques [5,33], active noise control [34], and acoustic echo cancellation [35]. However, to the best of our knowledge, few computationally distributed SSL methods have been proposed, and they are mostly limited to 2D environments [17,36,37].

Recently, the authors of the present manuscript, building on the centralized framework for 3D SSL with 2D DoAs of [24], proposed a computationally distributed 3D SSL method [38] that uses a network of planar microphone arrays. In [38], SSL is described as a distributed minimization problem and is approached with an Adapt-Then-Combine (ATC) diffusion strategy [30,39]. This, in turn, has two major advantages over [24]. Firstly, the approach is computationally distributed and divides the computational load across the arrays. Second, the approach is adaptive, i.e., each array handles a stream of 2D DoAs instead of single DoA measurements as in [24]. This allows the microphone array network to adapt to changes in the distribution of DoA streams, which are usually influenced by noise and unfavorable acoustic effects, and learn their statistical moments. As a result, localization accuracy can be improved by penalizing arrays that acquire unreliable measurements. By responding to DoA streams, the tracking of sound sources is also automatically integrated into this approach, under the assumption that the sound source moves relatively slowly.

In this manuscript, we extend the work of the conference paper [38] in several ways. With respect to [38], which considers a simple DoA stream model based on the assumption that the acoustic environment is anechoic, this manuscript introduces more sophisticated DoA stream models, along with applications of the method in simulated reverberant environments. Building on these extensions, we propose a new set of data exchange policies between acoustic arrays that are specifically tailored to significantly improve localization accuracy in more complex acoustic scenarios. Moreover, the work in [38] considered just a fully connected network topology, where each node is directly connected to every other node. In this manuscript, we consider different connected topologies and investigate how performance is affected when network connectivity is gradually reduced. Our results show that the performance degradation is negligible as long as a connected network topology is considered. This emphasizes the effectiveness and resilience of the proposed distributed approach. We also test the robustness of the proposed approach at different reverberation levels and signal-to-noise ratios (SNRs). We show that our method converges even under challenging conditions and show a way to control the stability of the position estimate after convergence.

The manuscript is structured as follows. Section 2 introduces the SSL framework used throughout the paper and formulates the SSL task as an optimization problem. Section 3 discusses a centralized solution to this optimization problem that is able to handle streams of DoAs and penalize noisy arrays. Section 4 presents a computationally distributed approach that uses an ATC diffusion strategy to solve the SSL problem. This section also introduces new cooperation strategies between array nodes to improve performance. Section 5 evaluates the accuracy and robustness of the proposed methods, while Section 6 examines the convergence speed and steady-state stability of the approach. Section 7 assesses the resilience of the method under reduced network connectivity. Finally, Section 8 offers concluding remarks and discusses potential future developments.

2. Background on 3D Sound Source Localization with Linear Microphone Arrays

We tackle the problem of localizing a sound source in a 3D space using a network of linear microphone arrays, where each array measures a 2D DoA. Let us consider an acoustic environment in which a single sound source is located at the coordinates

s = {[s_{x}, s_{y}, s_{z}]}^{T}

. Let us also consider K spatially distributed linear microphone arrays whose reference points are located at the coordinates

m_{k} = {[m_{k, x}, m_{k, y}, m_{k, z}]}^{T}

for

k = 1, \dots, K

. Each array is oriented according to a unit vector

v_{k} = {[cos α_{k} cos β_{k}, sin α_{k} cos β_{k}, sin β_{k}]}^{T}

, where

α_{k}

and

β_{k}

stand for the azimuth and elevation angle, respectively. The goal is to determine the position of the sound source based on the DoAs detected by the microphone arrays.

2.1. Sound Source Localization Framework

Similarly to the work in [24], in this work we approach 3D SSL with linear arrays by performing 2D DoA triangulation in the same plane as the sound source; namely, the plane defined by

z = s_{z}

. The goal of triangulation for SSL is to determine the location of the sound source as the intersection of acoustic rays emanating from the source and passing through the microphone arrays, where the acoustic rays are parameterized by the position of the array and its DoA estimate. Since in our context the triangulation takes place on the plane

z = s_{z}

, this poses a challenge as the acoustic rays from the source to the microphone arrays do not normally lie on this plane. Therefore, we consider the projection of the microphone arrays onto the plane

z = s_{z}

. The coordinates of the projected arrays’ reference points are denoted as

m_{k}^{'} = {[m_{k, x}^{'}, m_{k, y}^{'}, s_{z}]}^{T}

. To ensure the correctness of this approach, each projected array must preserve the 2D DoA

ϑ_{k} = θ_{k} + α_{k}

, where

θ_{k}

is the local DoA estimate of the array. The distance of the array to the sound source must also be preserved after projection, i.e.,

| | s - m_{k} | | = | | s - m_{k}^{'} | |

. Given these constraints, the projection of each array depends on the sound source position

s

, and the coordinates are computed as [24]

\begin{matrix} m_{k, x}^{'} & = s_{x} - C_{k} P_{k} + Q_{k} \sqrt{D_{k} - C_{k}^{2}}, \\ m_{k, y}^{'} & = s_{y} - C_{k} Q_{k} - P_{k} \sqrt{D_{k} - C_{k}^{2}}, \end{matrix}

(1)

where

\begin{matrix} \begin{matrix} C_{k} & = v_{k}^{T} (s - m_{k}), & D_{k} & = {∥ s - m_{k} ∥}^{2}, \\ P_{k} & = cos α_{k}, & Q_{k} & = sin α_{k} . \end{matrix} \end{matrix}

A representation of a projected array and the corresponding local DoA is shown in Figure 1.

Figure 1. A projected microphone array centered at

m_{k}^{'}

, and orientation versor

v_{k}^{'} = {[cos α_{k}, sin α_{k}, 0]}^{T}

measures the DoA

θ_{k}

of a sound source located at

s

in the far field.

Having defined the parameters of the projected arrays, we can determine the acoustic rays required for triangulation on the plane

z = s_{z}

. An acoustic ray intersecting a sound source and a projected microphone array k is described by

a_{k} s_{x} + b_{k} s_{y} + c_{k} = 0,

(2)

where

a_{k}

,

b_{k}

, and

c_{k}

are parameterized in terms of the DoA

ϑ_{k}

of the array as

\{\begin{matrix} a_{k} = sin (ϑ_{k}) \\ b_{k} = - cos (ϑ_{k}) \\ c_{k} = m_{k, y}^{'} cos (ϑ_{k}) - m_{k, x}^{'} sin (ϑ_{k}) . \end{matrix}

(3)

However, in a real acoustic scenario, Equation (2) is generally not satisfied due to the error introduced by the DoA estimation process and adverse room acoustic phenomena. Nevertheless, from Equation (2), we can derive an error function that measures the agreement between the measurements of each array and the assumed acoustic ray propagation model. Therefore, we define the fitting error of each array DoA measure as

e_{k} (s, ϑ_{k}) = sin ϑ_{k} s_{x} - cos ϑ_{k} s_{y} + c_{k} (s, ϑ_{k}) .

(4)

Employing the fitting error of each array in (4), we can express the considered SSL task by writing the following optimization problem:

\begin{matrix} arg min_{s} J^{glob} (s) & = arg min_{s} \sum_{k = 1}^{K} e_{k} {(s, ϑ_{k})}^{2} \\ = arg min_{s} \sum_{k = 1}^{K} J_{k} (s, ϑ_{k}) \end{matrix}

(5)

where we use

J_{k} (s, ϑ_{k}) = e_{k} {(s, ϑ_{k})}^{2}

to denote the local cost function of array k. Consequently, by minimizing the global cost

J_{glob} (s)

, we aim to find the source location

s

that best fits all measured DoAs across the network.

2.2. Centralized Resolution Approach Presented in [24]

A centralized solution to the optimization problem (5) can be obtained by using the following Gauss–Newton iterative method [24]:

s_{i} = s_{i - 1} - {[\nabla e (s_{i - 1})]}^{†} e (s_{i - 1})

(6)

where

e (s_{i}) = {[e_{1} (s_{i}, ϑ_{1}), e_{2} (s_{i}, ϑ_{2}), \dots, e_{K} (s_{i}, ϑ_{K})]}^{T}

,

s_{i} = {[s_{x, i}, s_{y, i}, s_{z, i}]}^{T}

is the position estimate at iteration index i, ^† denotes the Moore–Penrose matrix inverse operator, and

\nabla e (s) = {[\nabla e_{1} (s, ϑ_{1}), \dots, \nabla e_{K} (s, ϑ_{K})]}^{T}

is the error gradient vector with elements

\nabla e_{k} (s, ϑ_{k}) = {[\frac{\partial e_{k} (s, ϑ_{k})}{\partial s_{x}}, \frac{\partial e_{k} (s, ϑ_{k})}{\partial s_{y}}, \frac{\partial e_{k} (s, ϑ_{k})}{\partial s_{z}}]}^{T} .

Although the approach converges to a solution quite quickly [24], it has several limitations in terms of flexibility and localization performance. Regarding flexibility, Ref. [24] is designed for single DoA measurements and remains an “instantaneous” approach that is not able to learn from data streams and to adapt accordingly. Moreover, since it is computationally centralized, its scalability and robustness are limited. As far as localization performance is concerned, Ref. [24] lacks a mechanism to scale the contribution of individual array measurements, preventing the penalization of arrays that negatively impact localization accuracy.

2.3. Distributed Resolution Approach Presented in [38]

Our previous conference paper [38] presents a computationally distributed approach based on ATC diffusion. It is designed starting from the theoretical framework established in [24], adapting it to DoA streams while incorporating a mechanism to penalize arrays with the most detrimental effects on localization. However, Ref. [38] is also characterized by important limitations. Firstly, only anechoic environments were considered for the performance evaluation, resulting in a simple DoA stream model. Moreover, only fully connected topologies were considered and no studies on the performance of networks with reduced connectivity were presented.

3. Adaptive Sound Source Localization

We first present a computationally centralized approach to solving the 3D SSL problem outlined in Section 2. This method improves upon the one proposed in [24] by incorporating penalty factors to scale down the contribution of arrays that negatively impact the localization task. In preparation for this discussion, we briefly describe the network model.

3.1. Network Topology

Let us assume the K microphone arrays are connected by a network topology modeled as a digraph, where the nodes represent the microphone arrays and the edges represent communication links between the node pairs. We define the neighborhood of a node k as the set

N_{k}

of all nodes connected to k by an edge. Given two neighboring nodes k and l, a non-negative scalar

w_{l k}

is used by node k to scale the data it receives from node l, and can be interpreted as the degree of trust node k assigns to node l, and the converse holds for

w_{k l}

. A depiction of a network of this sort is shown in Figure 2. We can collect the scaling coefficients of the entire network into an

K \times K

combination matrix

W = [w_{l k}]

.

Figure 2. Example of network topology with K microphone arrays. The green area highlights the neighborhood

N_{k}

of agent k.

To enforce convergence and stability properties, it is common to require

W

to be a left-stochastic and primitive matrix [39,40]. Left-stochasticity is enforced by requiring all columns of

W

to sum to one. Additionally, we impose a strongly connected topology, ensuring a path exists between any two nodes and that at least one node is connected to itself. This last condition automatically guarantees that

W

is primitive [30].

3.2. Centralized Adaptive SSL

A first step towards a more general SSL problem formulation of (5), which enables the introduction of penalty factors to scale the contribution of noisy arrays, is given by

arg min_{s} J^{weight} (s) = arg min_{s} \sum_{k = 1}^{K} q_{k} J_{k} (s, ϑ_{k}) .

(7)

This more general problem formulation leads to different solutions compared to the unweighted problem in (5). The solutions of these two optimization problems coincide only if the

q_{k}

terms are equal or if the local costs are minimized at the same location [30]. However, in the SSL framework under consideration, this is not the case. From the theory of adaptive networks, we know that the solution of (7) can be interpreted as a Pareto-optimal solution [40]. Different choices of the scaling factors

q_{k}

will lead to different Pareto-optimal solutions. From now on, we denote a Pareto-optimal solution with

s^{*}

, which we also call the network’s limit point. It is important to emphasize that the limit point

s^{*}

does not necessarily correspond to the actual position of the source, since in real scenarios biases of DoA estimations lead to different solutions.

For the solution of (7), we here consider a centralized approach based on stochastic gradient descent (SGD) and characterized by the following iteration:

s_{i} = s_{i - 1} - \sum_{k = 1}^{K} q_{k} \nabla_{s} J_{k} (s_{i - 1}, ϑ_{k, i})

(8)

where

\nabla_{s}

denotes the gradient operator with respect to

s

. Unlike in Equation (6), here the instantaneous DoA measurements at each node are replaced by DoA streams, i.e.,

ϑ_{k} = ϑ_{k, i}

, which endows the approach with learning and adaptation capabilities. Note also that the use of DoA streams leads to perturbations during descent to a Pareto-optimal solution

s^{*}

. These perturbations are inherent in SGD-based approaches, as the local costs and thus the global cost function are evaluated with noisy DoA measures. As a result, the network never fully reaches

s^{*}

, but continues to orbit around this solution after an initial transient.

Without loss of generality, we express

q_{k}

as

q_{k} = μ_{k} p_{k} (W)

, where

μ_{k}

is an array-specific parameter, while

p_{k} (W)

depends on the network topology. This leads to the following centralized approach:

s_{i} = s_{i - 1} - \sum_{k = 1}^{K} μ_{k} p_{k} (W) \nabla_{s} J_{k} (s_{i - 1}, ϑ_{k, i}) .

(9)

Here,

μ_{k} > 0

is a step-size parameter, and is used to determine the adaptation speed for each array. Instead,

p_{k} (W)

is set as the kth entry of the Perron eigenvector of the combination matrix

W

. Since we constrain

W

to be left-stochastic and primitive, the existence of the Perron eigenvector is guaranteed by the Perron–Frobenius theorem [41]. More precisely, the eigenvector satisfies

Wp = p, 1^{T} p = 1, p_{k} > 0, k = 1, \dots, K .

(10)

This specific choice of

p_{k} (W)

aligns with the goal of obtaining the equivalent centralized implementation of the distributed diffusion ATC approach described in [38], which considers only fully connected networks. Specifically, this choice ensures that the centralized SGD approach of (9) yields the same iterates

s_{i}

of the corresponding fully connected distributed implementation, having as its combination matrix

W

[39]. The process of setting the combination weights in

W

in order to scale noisy arrays is described in Section 4.2.

4. ATC Diffusion Sound Source Localization

In this section, we extend the computationally distributed approach of [38] to solve the problem in (7). First, we revisit the ATC diffusion-based 3D SSL approach of [38]. Then, we introduce novel cooperation strategies and propose two metrics to evaluate distributed approaches for SSL.

4.1. ATC SSL Algorithm

According to [38], a computationally distributed solution to (7) can be found by employing a distributed version of the SGD, called Adapt-Then-Combine (ATC) diffusion [30]. The ATC diffusion method consists of two steps. In the adapt step, a local SGD step is performed by each node. In the combine step, each node combines the results of its neighbors. Applying ATC diffusion to the problem in Equation (7) leads to the following recursion, defined for each agent k:

\{\begin{matrix} ψ_{k, i} = s_{k, i - 1} - μ_{k} \nabla_{s} J_{k} (s_{k, i - 1}, ϑ_{k, i}) \\ s_{k, i} = \sum_{l \in N_{k}} w_{l k} ψ_{l, i} \end{matrix} .

(11)

As each local cost function in (5) is a quadratic term, its gradient can be expressed as

\nabla_{s} J_{k} (s, ϑ_{k}) = 2 e_{k} (s, ϑ_{k}) \nabla_{s} e_{k} (s, ϑ_{k}) .

(12)

Therefore, we can rewrite (11) as

\{\begin{matrix} ψ_{k, i} = s_{k, i - 1} - 2 μ_{k} e_{k} (s_{k, i - 1}, ϑ_{k, i}) \nabla_{s} e_{k} (s_{k, i - 1}, ϑ_{k, i}) \\ s_{k, i} = \sum_{l \in N_{k}} w_{l k, i} ψ_{l, i} \end{matrix}

(13)

where

\nabla_{s} e_{k} (s, ϑ_{k}) = [\begin{matrix} - a_{k} (Q_{k} Θ_{k, x} - v_{k, x} P_{k}) + b_{k} (P_{k} Θ_{k, x} + v_{k, x} Q_{k}) \\ - a_{k} (Q_{k} Θ_{k, y} - v_{k, y} P_{k}) + b_{k} (P_{k} Θ_{k, y} + v_{k, y} Q_{k}) \\ - a_{k} (Q_{k} Θ_{k, z} - v_{k, z} P_{k}) + b_{k} (P_{k} Θ_{k, z} + v_{k, z} Q_{k}) \end{matrix}]

(14)

and

Θ_{k, j} = \frac{s_{j} - m_{k, j} - v_{k, j} C_{k}}{\sqrt{D_{k} - C_{k}^{2}}} with j = x, y, z .

(15)

It is worth noting that the quantities

a_{k}

and

b_{k}

depend on the instantaneous DoA measures

ϑ_{k, i}

, while the quantities

C_{k}

and

D_{k}

depend on the previous array position estimate

s_{k, i - 1}

, but such dependencies have been omitted for the sake of readability.

From Equation (13), we also note that there are two sets of tunable parameters that can influence the dynamics of the ATC diffusion recursion, namely, the step-size parameter

μ_{k}

and the combination weights

w_{l k}

. Both parameters influence the solution of the problem in Equation (7), leading to different Pareto-optimal solutions

s^{*}

[40]. From this point on, we assume a uniform step-size parameter

μ_{k} = μ

across all nodes, so that it no longer influences the attained solutions

s^{*}

. Instead, combination weights

w_{l k}

are set to achieve different limit point solutions

s^{*}

[30,39], and we refer to the policies for determining these coefficients as combination policies. In the context of SSL, it is desirable to adjust these coefficients to obtain limit point solutions

s^{*}

that are closer to the actual sound source, thereby increasing localization accuracy.

In the case where the network is fully connected, and thus no non-zero weights are present in the combination matrix

W

, it is worth noting that each node could potentially compute the centralized implementation discussed in Section 3.2. This implementation is characterized by the same localization accuracy and convergence behavior as the distributed, fully connected implementation.

4.2. Adaptive Combination Policies

We now discuss how combination policies can be used to improve the localization accuracy of the proposed distributed SSL method. In general, we can distinguish two classes of combination policies: one where the combination coefficients are kept constant and one where they change from iteration to iteration [30]. In this paper, we focus on the latter case and thus refer to the resulting combination strategies as adaptive combination strategies. The main idea is to adaptively adjust the combination weights in the matrix

W

to penalize arrays that have the most detrimental effect on localization accuracy by assigning them small combination weights. To achieve this goal, at each iteration i, each agent k assigns a penalty term

γ_{l, i}

, with

l \in N_{k}

, to its neighbors as

γ_{l, i} = (1 - ζ) γ_{l, i - 1} + ζ φ_{l, i},

(16)

where

φ_{l, i}

is an arbitrary penalty factor for the current iteration and

0 \leq ζ \leq 1

is a forget rate that smooths the penalty factors. Then, the combination weights assigned by node k to its neighbors are calculated as follows:

w_{l k, i} = \{\begin{matrix} \frac{1}{γ_{l, i}} {(\sum_{l \in N_{k}} \frac{1}{γ_{l, i}})}^{- 1}, if l \in N_{k} \\ 0, otherwise \end{matrix}

(17)

where we normalize the weights so that the resulting combination matrix is left-stochastic. Penalty factors can be chosen in several ways. Below, we propose three different combination policies aimed at increasing the localization accuracy of ATC diffusion SSL and discuss their applicability.

4.2.1. Error Penalty Factor

The first penalty factor we propose utilizes the fitting errors of neighboring nodes. At each iteration, each agent k penalizes its neighbors based on

| e_{l} (s_{l, i - 1}, ϑ_{l, i}) |

, where

l \in N_{k}

. This penalty factor represents the absolute value of the fitting error between the source position estimate and the data measurements from the neighboring nodes. Thus, the penalty factor is calculated as follows:

φ_{l, i} = | e_{l} (s_{l, i - 1}, ϑ_{l, i}) |,

(18)

and then smoothed according to Equation (16).

4.2.2. Cost Penalty Factor

Another option is to utilize the values of the cost functions associated with the neighbors of each node as a penalty factor. As the cost function is defined as a quadratic error, each agent can employ the following penalty factor for its neighbors:

φ_{l, i} = J_{l} (s_{l, i - 1}, ϑ_{l, i}) = e_{l}^{2} (s_{l, i - 1}, ϑ_{l, i}) .

(19)

Furthermore, assuming a uniform step size among all nodes, i.e.,

μ_{l} = μ_{k} = μ

, and a fully connected network topology, it is possible to verify that the policy resulting from the employment of this penalty factor corresponds to the well-known neighbor-centered adaptive relative variance combination policy [30] from adaptive networks theory.

4.2.3. Distance Penalty Factor

As a last option, we propose to use the distance between each array and the predicted position of the sound source as the penalty factor. This is because, especially in reverberant environments, distant sources are expected to impair the DoA estimation methods more than nearby sources. Furthermore, distant arrays impair the localization more than nearby arrays for the same DoA error. Therefore, we may use the following penalty factor:

φ_{l, i} = {∥ m_{l} - s_{l, i - 1} ∥}^{2} .

(20)

Regardless of which approach is used to determine the combination weights, each of the proposed policies requires the exchange of additional parameters between agents. Without any combination policies, nodes only exchange their estimates for the position of the sound source during the combine step of the ATC diffusion iteration. Similarly, the distance penalty does not require any additional data exchange if we assume that self-localization is performed once before SSL, allowing each node to know the positions of all its neighbors. Instead, when using either the error penalty in Equation (18) or the cost penalty in Equation (19), an additional parameter, namely,

e_{l} (s_{l, i - 1}, ϑ_{l, i})

, must be exchanged between neighboring nodes. However, even if the additional parameter

e_{l} (s_{l, i - 1}, ϑ_{l, i})

is exchanged, the communication bandwidth of each agent remains small unless data are exchanged very frequently.

To summarize the proposed approach, Algorithm 1 provides the pseudocode for each agent, detailing the iterative application of diffusion-based SSL and the combination policy. Complementing this, Figure 3 presents the corresponding block scheme.

Algorithm 1 Diffusion-Based SSL With Linear Arrays

%For each array, at each new iteration i, do
%Adapt Step
$ψ_{k, i} = s_{k, i - 1} - 2 μ_{k} e_{k} (s_{k, i - 1}, ϑ_{k, i}) \nabla_{s} e_{k} (s_{k, i - 1}, ϑ_{k, i})$
%Select Combination Strategy
for each neighbor $l \in N_{k}$ do
% $Compute Penalty Factor$ $φ_{l, i}$ $according to$ (18)–(20)
$γ_{l, i} = (1 - ζ) γ_{l, i - 1} + ζ φ_{l, i}$
$w_{l k, i} = \frac{1}{γ_{l, i}} {(\sum_{l \in N_{k}} \frac{1}{γ_{l, i}})}^{- 1}, if l \in N_{k}$
end for
%Combine Step
$s_{k, i} = \sum_{l \in N_{k}} w_{l k, i} ψ_{l, i}$

Figure 3. Schematic of the ATC diffusion-based SSL for node k in an example topology. Each node updates its local estimate (adapt step) and then merges intermediate estimates from neighbors (combine step) using an adaptive combination policy.

4.3. Metrics

To evaluate the performance of distributed SSL and the effectiveness of the different combination strategies, we introduce two metrics, namely, the mean absolute error (MAE) and the mean squared deviation (MSD). Both metrics are used to describe the steady-state behavior of the network and are calculated for each sensor node. An averaged version of these metrics across all nodes can be used to evaluate the global performance of the network.

4.3.1. Mean Absolute Error

The MAE evaluates localization accuracy by evaluating the error between the predicted source position and the actual position of the sound source, which we define as

s^{GT}

. Mathematically, we define the node and the network MAE as

\begin{matrix} {MAE}_{k} & ≜ lim_{i \to \infty} E ∥ s_{k, i} - s^{GT} ∥, \\ {MAE}_{avg} & ≜ \frac{1}{K} \sum_{k = 1}^{K} {MAE}_{k}, \end{matrix}

(21)

respectively. In practice, we cannot use the theoretical definitions of the MAE as in Equation (21) because these require averaging over an infinite number of iterations. Therefore, in this paper, we estimate the MAE by stopping the diffusion-based SSL algorithm after

I_{\max} > > 1

iterations to ensure that the network has reached a steady state and estimate the node and network MAE as

\begin{matrix} {\hat{MAE}}_{k} & ≜ \frac{1}{N_{it}} \sum_{i = I_{\max} - N_{it}}^{I_{\max}} ∥ s_{k, i} - s^{GT} ∥ \\ {\hat{MAE}}_{avg} & ≜ \frac{1}{K} \sum_{k = 1}^{K} {\hat{MAE}}_{k} \end{matrix}

(22)

where

N_{it}

denotes the number of steady-state iterations.

4.3.2. Mean Square Deviation

MSD evaluates the stability properties of diffusion methods [30] by quantifying the mean squared error (MSE) between each node’s iteration at steady state and the estimated network limit point. We define the MSD of each node and that of the entire network as

\begin{matrix} {MSD}_{k} & ≜ lim_{i \to \infty} E ∥ s_{k, i} - s^{*} ∥, \\ {MSD}_{avg} & ≜ \frac{1}{K} \sum_{k = 1}^{K} {MSD}_{k} . \end{matrix}

(23)

Similarly to before, we cannot use the theoretical definitions of MSD as in Equation (21) because we do not know the limit point of the network

s^{*}

and the definitions are averages over an infinite number of iterations. Moreover, it is often more interesting to show “instantaneous” values of the MSD, which show how the deviations of the iterates around the reached network limit points evolve over time. In this sense, we define the “instantaneous” MSD as

\begin{matrix} {\hat{MSD}}_{k} (i) & ≜ ∥ s_{k, i} - {\hat{s}}^{*} ∥ \\ {\hat{MSD}}_{avg} (i) & ≜ \frac{1}{K} \sum_{k = 1}^{K} {\hat{MSD}}_{k} . \end{matrix}

(24)

where the network limit point is estimated as

{\hat{s}}^{*} ≜ \frac{1}{K N_{it}} \sum_{k = 1}^{K} \sum_{i = I_{\max} - N_{it}}^{I_{\max}} s_{k, i} .

(25)

5. Testing the Localization Accuracy

In this section, we evaluate the localization accuracy of the proposed diffusion-based SSL method using both DoA stream models and an acoustic simulation. The evaluation is performed under a fully connected network topology, establishing a clear baseline for assessing the impact of combination policies without the complexities introduced by sparser connectivity. The influence of partial network connectivity is explored in Section 7. Throughout this section, we consider a room of size

4 m \times 3 m \times 3 m

, with eight microphone arrays placed near the room edges, with reference positions

m_{k}

and orientations

v_{k}

set according to Table 1. We also consider different candidate sound source positions

s^{GT}

placed on a uniformly distributed 2D grid of

23 \times 23

points for two different planes:

z = 0.45

and

z = 1.5

. As for the diffusion-based SSL method, we consider a uniform step size across all nodes

μ_{k} = μ = 0.1

. Sound sources are assumed to be stationary in this set of experiments.

Table 1. Reference points and orientations of microphone arrays used in Section 5.

5.1. DoA Stream Models

We begin by evaluating the accuracy of the proposed ATC diffusion SSL method using different DoA stream models. These models represent the combined effects that room acoustics and the DoA estimators used in each array have on the generation of DoA streams. This approach allows us to test the accuracy of the method in controlled, simplified scenarios and to investigate how localization accuracy is affected by factors such as biases in DoA estimation.

We employ three different DoA stream models. Each of them is a perturbed version of the actual DoA

θ_{k}^{GT}

, which we define as

θ_{k}^{GT} = arccos (\frac{v_{k}^{T} (s^{GT} - m_{k})}{∥ s^{GT} - m_{k} ∥}) .

(26)

The simplest DoA stream model is based on the assumption that DoAs measured by each array are drawn from the following distribution:

θ_{k, i} \sim G (θ_{k}^{GT}, σ),

(27)

where

G (ν, ϖ)

is a normal distribution with mean

ν

and variance

ϖ

. Hence, it is assumed that each agent measures unbiased DoAs with the same variance

σ

. We will refer to this DoA measurement distribution as model I. This was the only case considered in [38].

A more complex DoA stream model also considers a different DoA measurement variance for each array, i.e., DoAs are generated from the distribution

θ_{k, i} \sim G (θ_{k}^{GT}, σ_{k}) .

(28)

We refer to this DoA stream distribution as model II, according to which each agent has its own variance

σ_{k} \sim L (ν, ϖ)

, and

L

denotes a log-normal distribution with mean

ν

and variance

ϖ

.

However, in real acoustic scenarios DoA estimates are often biased, due to reverberation and a low SNR. Therefore, a more realistic DoA stream model, which we refer to as model III, considers biased agents, each with its own measurement bias. The distribution for model III is given by

θ_{k, i} \sim G (ν_{k}, σ_{k}),

(29)

where

ν_{k} \sim G (θ_{k}^{GT}, σ_{b})

and

σ_{b}

models the variance of the measurement bias, and where

σ_{k}

has the same distribution as model II.

We now compare the localization accuracy of the proposed diffusion-based SSL with the various DoA stream models discussed above. In particular, we evaluate the accuracy by computing the localization error

{\hat{MAE}}_{avg}

according to Equation (22) and we perform

N_{trials}

Monte Carlo simulations for each possible source position in the grid to average the effect of the different DoA stream realizations.

The results are shown in Figure 4, where the metric

{\hat{MAE}}_{avg}

is shown at each possible source position for all different DoA stream models. In particular, top to bottom, Figure 4 depicts the accuracy of the proposed approach using DoA streams drawn from models I, II, and III, while different columns show different combination policies. The leftmost column shows the results when no adaptive combination policy is used and therefore all combination coefficients are set to

w_{l k, i} = w_{l k} = 1 / Q

. The other three columns, in order from left to right, in Figure 4 show the results obtained when the combination coefficients are set according to the distance penalty factor in Equation (20), the error penalty factor in Equation (18), and the cost penalty factor in Equation (19), respectively. For this analysis, we set

N_{trials} = 50

,

N_{it} = 100

, and

I_{\max} = 500

. These values were selected to ensure that, from iteration

I_{\max} - N_{it}

onward, all subsequent iterations corresponded to steady-state behavior. DoA streams generated with model I were obtained by setting

σ = 10 °

, streams generated with model II were obtained by setting

σ_{k} \sim L (1.9 °, 0.6 °)

, and those generated with model III were obtained by using the same

σ_{k}

distribution and setting

ν_{k} \sim G (θ_{k}^{GT}, 0.8)

. We verify that DoA measurement biases negatively impact the accuracy of the proposed method. More interestingly, we observe that the error-based policies (i.e., the ones using either the error or the cost as penalty factors) have higher accuracy when the DoA measurements on each array are unbiased (i.e., for DoA stream models I and II). This is to be expected since these policies penalize arrays with higher costs, and for unbiased DoA streams, higher costs are associated with less accurate estimates of the sound source position. However, this is no longer the case for DoA streams generated according to model III, where arrays are characterized by biased DoA estimates. Indeed, the local costs of each array may have minima at positions far from each other and from the actual source position

s^{GT}

. In this scenario, the adaptive distance policy shows better accuracy, as distant arrays have on average the worst estimates of DoA and thus of source position. Therefore, depending on the specific DoA estimation procedure, which in turn leads to different DoA streams, it may be appropriate to choose the combination policy for diffusion-based SSL accordingly.

Figure 4. Average MAE,

{\hat{MAE}}_{avg}

, for the DoA stream models from Section 5.1. Each subfigure shows results on a

23 \times 23

grid in the

x y

-plane at

z = 0.45

(top) and

z = 1.5

(bottom), with

N_{trials} = 50

realizations per DoA stream model. Columns (from left to right) compare no combination strategy, distance combination strategy (20), error combination strategy (18), and cost combination strategy (19). (a)

{\hat{MAE}}_{avg}

for DoA stream Model I with

σ = 10 °

. (b)

{\hat{MAE}}_{avg}

for DoA stream Model II with

σ_{k} \sim L ({1.9}^{°}, 0.66 °)

. (c)

{\hat{MAE}}_{avg}

for DoA stream Model III with

σ_{k} \sim L (1.9 °, 0.6 °)

and

ν_{k} \sim G (θ_{k}^{GT}, 0.8)

.

5.2. Acoustic Simulations

We now evaluate the localization performance of diffusion-based SSL in a simulated reverberant environment. To simulate the acoustic environment, microphone signals are generated using the image source method [42]. In particular, this method is used to generate all room impulse responses (RIRs) from all the candidate source positions in the 3D grid to all microphones. Then, we convolve the sound source signal with the simulated RIRs and corrupt the resulting signals with an additive Gaussian white noise with a signal-to-noise ratio (SNR) of 20 dB to obtain the simulated microphone signals. As a sound source, we used a 30-second male speech from the TIMIT database with a sampling frequency of

F s = 16

kHz. It is also assumed that the sound source is omnidirectional.

Let us again consider

K = 8

uniform linear arrays (ULAs) of microphones whose reference points and orientations are set again according to Table 1. Also, each array is composed of

N_{mic} = 6

elements with an inter-element distance of

δ = 2

cm. Therefore, the 3D coordinates of the nth microphone in array k can be defined as

m_{k, n} = m_{k} + v_{k} (δ (n - \frac{N_{mic} + 1}{2}))

(30)

where

m_{k, n}

denotes the nth microphone position within the kth array, and

n = 1, \dots, N_{mic}

. To obtain a DoA stream, we divide the simulated microphone signals into frames of length 2048 samples, each of which represents the data measurement at iteration step i. In each frame, microphone arrays estimate an “instantaneous” DoA using a beamforming technique. Specifically, we transform the received microphone signals into the time–frequency domain using the short-time Fourier transform (STFT) with a Hamming analysis window of length 256 samples overlapped by 50%. This results in a total of

W = 15

time windows per frame.

Let

X_{k, n} (t_{p}, ω_{q})

denote the STFT of the nth microphone signal of the kth array, evaluated at a time–frequency bin

(t_{p}, ω_{q})

, where the index

t_{p}

refers to the pth time segment, and

ω_{q}

refers to the qth frequency bin. Note that the STFTs are calculated at each iteration index i, but we omit this dependency to simplify the notation. The minimum variance distortionless response (MVDR) pseudospectrum at

ω_{q}

is given by [43]

H_{k} (θ_{r}, ω_{q}) = \frac{1}{a^{T} (θ_{r}, ω_{q}) G_{k}^{- 1} (ω_{q}) a (θ_{r}, ω_{q})},

(31)

where

a (θ_{r}, ω_{q})

is the far-field propagation vector for each microphone array, computed for a set of sampled angles

θ_{r}

, in radians, providing the desired angular resolution. The elements of the propagation vector

a (θ_{r}, ω_{q})

are given by

{[a (θ_{r}, ω_{q})]}_{n} = e^{j \frac{ω_{q}}{c} δ (n - \frac{N_{mic} + 1}{2}) cos (θ_{r})}

(32)

On the other hand,

G_{k} (ω_{q})

in Equation (31) represents the sample estimate of the array covariance matrix, defined as

G_{k} (ω_{q}) = \frac{1}{W} \sum_{p = 1}^{W} x_{k} (t_{p}, ω_{q}) x_{k}^{H} (t_{p}, ω_{q}) .

(33)

where

x_{k} (t_{p}, ω_{q}) = [X_{k, 1} (t_{p}, ω_{q}), \dots, X_{k, N_{mic}} (t_{p}, ω_{q})]

. Then, the DoA estimated by agent k at iteration index i is

θ_{k, i} = arg max_{θ_{r}} h_{k} (θ_{r}),

(34)

where

h_{k} (θ_{r})

is the geometric mean of the pseudospectrum values

H_{k} (θ_{r}, ω_{q})

along the frequency axis, with frequency bins in the range

[500 H z, 4 k H z]

.

We now present the localization results obtained from this acoustic simulation. First, as an illustrative example, Figure 5 shows the room layout (4 m × 3 m × 3 m), the positions of the eight microphone arrays (as listed in Table 1), and the true as well as estimated source locations. In this instance, the source is positioned at

s = {[0.45, 0.83, 1.50]}^{T}

m. Additionally, we illustrate the trajectories—the location estimates at each iteration—obtained using our diffusion-based SSL algorithm with all proposed combination policies.

Figure 5. Simulated acoustic environment and localization results. The actual source position (depicted as a star) is at

s = {[0.45, 0.83, 1.50]}^{T}

m, and microphone array centers are shown as filled circles. The trajectory of estimated source locations, obtained using our diffusion-based SSL algorithm, illustrates the localization process.

Further, following the methodology in the previous subsection, we evaluate localization accuracy using the average mean absolute error (

{\hat{MAE}}_{avg}

), computed over the same 3D grid. However, in this case, we consider only a single realization, corresponding to the DoA stream obtained from the simulated acoustic environment. The results are shown in Figure 6, where the considered diffusion SSL and the method of [24] are compared. As we can see, the proposed approach always achieves higher accuracy than the centralized approach of [24], since the adaptive policies aim to penalize the arrays with the most harmful effects on localization. Moreover, the combination policies always improve the results compared to the trivial ATC diffusion implementation where no combination policies are used.

Figure 6.

{\hat{MAE}}_{avg}

at each candidate source position in a 23 × 23 grid on the

x y

plane for

z = 0.45

(top) and

z = 1.5

(bottom). Each column represents a different combination strategy, with the exception of the last column, which refers to the technique of Canclini et al. [24].

We can note that the localization performance obtained by using the combination policy based on the error-based penalty factor is similar to that of the combination policy based on the distance penalty factor. The reason is that in this simulated acoustic environment, and with the chosen DoA estimator, unlike the simpler DoA stream models of the previous subsection, while the bias increases, the DoA variance increases as well.

We are interested in evaluating the robustness of the proposed method in adverse acoustic scenarios by assessing the localization accuracy when both T60 and SNR increase. Specifically, we measure the localization accuracy for different T60 values between 0 and 1 s at a fixed SNR of 20 dB. We also measure localization accuracy for a T60 of 0.4 s and varying SNR values between 10 and 35 dB. Instead of an analysis point by point, in this set of experiments we measure the average MAE over all the source positions in a subsampled grid of

5 \times 5 \times 3

points in the room, in order to have a mean volumetric localization accuracy value for each T60/SNR pair. In line with previous experiments, we assess the localization accuracy of every source location in the grid using the metric

{\hat{MAE}}_{avg}

, with

N_{it} = 100

. This metric is then averaged across the grid to obtain the mean volumetric average of localization accuracy, denoted as

ν_{{\hat{MAE}}_{avg}}

. These results are summarized in Figure 7.

Figure 7. Average localization accuracy

ν_{{\hat{MAE}}_{avg}}

across the entire simulated acoustic environment (as described in Section 5.1), comparing the proposed method with various combination policies and the baseline method from Ref. Canclini. On the left,

ν_{{\hat{MAE}}_{avg}}

is obtained with T60 ranging from 0 to 1 s with a fixed SNR of 20 dB, while on the right,

ν_{{\hat{MAE}}_{avg}}

is obtained with a prescribed SNR ranging from 10 to 35 dB and a fixed T60 of 0.4 s. et al. [24].

As expected, the performance of all methods deteriorates with increasing T60 and decreasing SNR. Nevertheless, even under the worst conditions, the diffusion-based methods outperform the centralized method proposed in [24]. We also observe that all the proposed adaptive combination policies result in improved accuracy and stability. Notably, the policy based on the cost penalty factor consistently delivers the best performance in terms of MAE.

6. A Study on Convergence and Stability Behavior

We now discuss the network dynamics of the proposed method, both in terms of convergence rate and steady-state stability. To asses these properties, we use the “instantaneous” MSD in Equation (24) as a metric, and evaluate it across several iterations. The network limit point is estimated according to Equation (25) with

N_{it} = 100

. Both analyses consider, without loss of generality, a single source position

s = {[1.22, 0.97, 0.97]}^{T}

, generating a stream of DoAs according to model III (Section 5.1), and are averaged across 100 DoA stream realizations.

6.1. Convergence Speed

To evaluate the convergence speed of the proposed method, Figure 8 shows the average “instantaneous” MSD for each combination policy. The results show that combination strategies generally slow down convergence compared to scenarios with uniform weights (no combination policy). In particular, the cost policy significantly reduces convergence rates. This could be explained by the similarity between the cost policy and the adaptive relative variance policy, which is known for its slow convergence [44]. We also observe that among the combination policies, the distance-based policy achieves the fastest convergence rate, closely matching the rate observed when no combination policy is applied. This makes the distance-based policy preferable in scenarios where faster convergence is a priority.

Figure 8. Average “instantaneous” MSD for each combination policy across 100 realizations of DoA streams, obtained using model III and considering a sound source of

s = {[1.22, 0.97, 0.97]}^{T}

.

We also analyze the evolution of the localization accuracy at each time-step by presenting the “instantaneous” network MAE in Figure 9, defined as

{\hat{MAE}}_{avg} (i) = \frac{1}{K} \sum {k = 1}^{K} ∥ s_{k, i} - s^{GT} ∥

, to complement Figure 8. These results indicate that while some combination policies achieve faster convergence, they have minimal impact on overall localization accuracy. In particular, the distance-based policy, though slightly less accurate in the long run, provides a substantial improvement in convergence speed with only a minor trade-off in accuracy.

Figure 9. Average MAE for each combination policy across 100 realizations of DoA streams, obtained using model III and considering a sound source in

s = {[1.22, 0.97, 0.97]}^{T}

.

6.2. Steady-State Stability

We now examine how the step size

μ_{k}

influences the steady-state stability of the proposed ATC diffusion SSL. In the literature on diffusion networks, fixed step sizes

μ_{k}

are typically used [30,39,44,45]. Although they slow down convergence rates compared to decaying step sizes, they allow for adaptation to drifts in the data collected by the sensor nodes and are therefore often preferred [39]. Moreover, when both the local and global costs are convex, it is known that the step size can control the stability of the network solution at steady state. Recent studies have also confirmed this finding in non-convex environments [45], such as the ones commonly encountered in acoustic SSL, including the proposed approach.

Figure 10 shows the average “instantaneous” MSD for three different values of

μ

, namely, 0.25, 0.1, and 0.5. The results show that reducing the step size reduces the oscillations around the limit point at steady state, but at the cost of a slower convergence rate. This observation applies to all combination strategies. We therefore restrict ourselves to the distance combination policy, although similar behavior can be observed for other strategies as well.

Figure 10. Average “instantaneous” MSD for different values of

μ

, across 100 realizations of DoA streams, obtained using model III and considering a sound source in

s = {[1.22, 0.97, 0.97]}^{T}

. The results show the MSD performance if the distance combination policy is used.

These results highlight the trade-off between convergence speed and steady-state stability across different combination policies and step-size choices. Although formal convergence guarantees in this highly non-convex scenario remain challenging, extensive tests with varying source positions, room dimensions, and DoA stream statistics have shown that the proposed ATC diffusion SSL algorithm consistently converges to a stable solution. Notably, no instances of divergence were observed.

7. Impact of Network Connectivity on Localization Accuracy

We now investigate how the localization performance is influenced by reduced network connectivity. Specifically, we modify the network by reducing the number of neighbors that each node has based on their Euclidean distance. In other words, if the distance between nodes

(k, l)

exceeds a certain threshold, both combination weights

w_{k, l}

and

w_{l, k}

are set to zero. Starting from the same array network configuration as in Section 5.2, we can find three distinct network topologies by gradually decreasing this distance threshold before the network splits into two separate subnetworks. These topologies are depicted in Figure 11, where self-loops have been omitted to avoid clutter. Their degree of connectivity is quantified by the well-known algebraic connectivity, which is defined as the second smallest eigenvalue of the Laplacian of the combination matrix

W

[46]. To assess the overall localization accuracy across different network connectivities, we computed the average MAE as described in the previous section, further averaging the error over every point in a

23 \times 23 \times 3

grid of possible source locations. For this experiment, we applied the cost penalty factor. Figure 12 shows the distribution of MAE values for each network connectivity. The results show that the localization accuracy is not significantly affected by sparser topologies, highlighting the effectiveness of the proposed method. Interestingly, the fully connected networked topology serves as the performance benchmark as it exhibits the highest accuracy.

Figure 11. Different network topologies and their relative algebraic connectivity used to study performance as a function of connectivity.

Figure 12. Distribution of

{\hat{MAE}}_{avg}

for different network connectivities.

8. Conclusions and Future Work

In this work, we reformulated the general problem of sound source localization for a network of acoustic arrays as a distributed optimization problem, where the arrays measure streams of acoustic parameters and cooperate to localize a sound source. In particular, we proposed ATC diffusion as a technique for localizing acoustic sources through cooperation between microphone arrays. We also discussed how localization performance can be improved by using different weighting schemes for communication between arrays, which we call combination policies.

As an example, we presented an ATC diffusion-based SSL method that enables 3D localization of a single sound source using 2D direction-of-arrival (DoA) measurements obtained from spatially distributed linear microphone arrays. This approach extends the work in [38]. Ad hoc combination policies were developed to improve localization accuracy, and all demonstrated superior performance compared to uniform combination policies, where communication links between agents are uniformly weighted. These results hold true for both statistical DoA stream models and simulated acoustic environments. We also showed that stability and convergence properties of the proposed approach can be controlled by the step-size parameters.

Future work will address more complex scenarios with multiple sound sources and different cost functions based on other sound propagation models and acoustic parameters, e.g., TDOAs, acoustic energy, and sound intensity.

Author Contributions

Conceptualization and methodology, D.A. and A.B.; investigation and software, D.A. and G.G.; writing—original draft preparation, D.A.; writing—review and editing, A.B. and G.G.; supervision, A.S. All authors have read and agreed to the published version of the manuscript.

Funding

The authors would like to thank ST Microelectronics for their generous support in funding this research.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

The authors would like to acknowledge the Multilayered Urban Sustainability Action (MUSA) project, which is funded by the European Union, for their contributions to this work.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cobos, M.; Antonacci, F.; Alexandridis, A.; Mouchtaris, A.; Lee, B. A Survey of Sound Source Localization Methods in Wireless Acoustic Sensor Networks. Wirel. Commun. Mob. Comput. 2017, 3956282. [Google Scholar]
Cobos, M.; Antonacci, F.; Mouchtaris, A.; Lee, B. Wireless Acoustic Sensor Networks and Applications. Wirel. Commun. Mob. Comput. 2017, 2017, 1085290. [Google Scholar] [CrossRef]
Bertrand, A. Applications and trends in wireless acoustic sensor networks: A signal processing perspective. In Proceedings of the 18th IEEE Symposium on Communications and Vehicular Technology in the Benelux (SCVT), Ghent, Belgium, 22–23 November 2011; Volume 18, p. 6. [Google Scholar]
Turchet, L.; Fazekas, G.; Lagrange, M.; Ghadikolaei, H.S.; Fischione, C. The Internet of Audio Things: State of the Art, Vision, and Challenges. IEEE Internet Things J. 2020, 7, 10233–10249. [Google Scholar] [CrossRef]
Bertrand, A.; Moonen, M. Distributed Node-Specific LCMV Beamforming in Wireless Sensor Networks. IEEE Trans. Signal Process. 2012, 60, 233–246. [Google Scholar] [CrossRef]
Markovich-Golan, S.; Gannot, S.; Cohen, I. Distributed Multiple Constraints Generalized Sidelobe Canceler for Fully Connected Wireless Acoustic Sensor Networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2013, 21, 343–356. [Google Scholar] [CrossRef]
Haddad, D.; Lima, M.; Martins, W.; Biscainho, L.; Nunes, L.; Lee, B. Acoustic Sensor Self-Localization: Models and Recent Results. Wirel. Commun. Mob. Comput. 2017, 2017, 7972146. [Google Scholar] [CrossRef]
Liaquat, M.; Munawar, H.S.; Rahman, A.; Qadir, Z.; Kouzani, A.; Mahmud, M.A. Localization of Sound Sources: A Systematic Review. Energies 2021, 14, 3910. [Google Scholar] [CrossRef]
Furnon, N.; Serizel, R.; Essid, S.; Illina, I. Attention-based distributed speech enhancement for unconstrained microphone arrays with varying number of nodes. In Proceedings of the 29th European Signal Processing Conference (EUSIPCO), Dublin, Ireland, 23–27 August 2021; pp. 1095–1099. [Google Scholar] [CrossRef]
Brandstein, M.; Adcock, J.; Silverman, H. A closed-form location estimator for use with room environment microphone arrays. IEEE/ACM Trans. Audio Speech Lang. Process. 1997, 5, 45–50. [Google Scholar] [CrossRef]
Compagnoni, M.; Bestagini, P.; Antonacci, F.; Sarti, A.; Tubaro, S. Localization of Acoustic Sources Through the Fitting of Propagation Cones Using Multiple Independent Arrays. IEEE Trans. Audio Speech Lang. Process. 2012, 20, 1964–1975. [Google Scholar] [CrossRef]
Canclini, A.; Antonacci, F.; Sarti, A.; Tubaro, S. Acoustic Source Localization With Distributed Asynchronous Microphone Networks. IEEE Trans. Audio Speech Lang. Process. 2013, 21, 439–443. [Google Scholar] [CrossRef]
Astapov, S.; Preden, J.S.; Berdnikova, J. Simplified acoustic localization by linear arrays for wireless sensor networks. In Proceedings of the 2013 18th International Conference on Digital Signal Processing (DSP), Santorini, Greece, 1–3 July 2013; pp. 1–6. [Google Scholar] [CrossRef]
Alexandridis, A.; Mouchtaris, A. Multiple Sound Source Location Estimation in Wireless Acoustic Sensor Networks Using DOA Estimates: The Data-Association Problem. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 26, 342–356. [Google Scholar] [CrossRef]
Evers, C.; Löllmann, H.W.; Mellmann, H.; Schmidt, A.; Barfuss, H.; Naylor, P.A.; Kellermann, W. The LOCATA Challenge: Acoustic Source Localization and Tracking. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 1620–1643. [Google Scholar] [CrossRef]
Yang, J.; Zhong, X.; Chen, W.; Wang, W. Multiple Acoustic Source Localization in Microphone Array Networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 334–347. [Google Scholar] [CrossRef]
Çakmak, B.; Dietzen, T.; Ali, R.; Naylor, P.; Waterschoot, T.V. A Distributed Steered Response Power Approach to Source Localization in Wireless Acoustic Sensor Networks. In Proceedings of the 2022 International Workshop on Acoustic Signal Enhancement (IWAENC), Bamberg, Germany, 5–8 September 2022; pp. 1–5. [Google Scholar] [CrossRef]
Wang, H.; Chen, C.E.; Ali, A.; Asgari, S.; Hudson, R.; Yao, K.; Estrin, D.; Taylor, C. Acoustic sensor networks for woodpecker localization. In Advanced Signal Processing Algorithms, Architectures, and Implementations XV; Society of Photo Optical: Bellingham, WA, USA, 2005; Volume 5910, p. 591009. [Google Scholar] [CrossRef]
Wang, C.; Griebel, S.; Brandstein, M. Robust automatic video-conferencing with multiple cameras and microphones. In Proceedings of the 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532), New York, NY, USA, 30 July–2 August 2000; Volume 3, pp. 1585–1588. [Google Scholar] [CrossRef]
Canclini, A.; Bestagini, P.; Antonacci, F.; Compagnoni, M.; Sarti, A.; Tubaro, S. A Robust and Low-Complexity Source Localization Algorithm for Asynchronous Distributed Microphone Networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2015, 23, 1563–1575. [Google Scholar] [CrossRef]
Blatt, D.; Hero, A. Energy-based sensor network source localization via projection onto convex sets. IEEE Trans. Signal Process. 2006, 54, 3614–3619. [Google Scholar] [CrossRef]
Meng, W.; Xiao, W. Energy-Based Acoustic Source Localization Methods: A Survey. Sensors 2017, 17, 376. [Google Scholar] [CrossRef] [PubMed]
Brendel, A.; Laufer-Goldshtein, B.; Gannot, S.; Kellermann, W. Learning-Based Acoustic Source Localization Using Directional Spectra. In Proceedings of the 2019 IEEE 8th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), Le Gosier, Guadeloupe, 15–18 December 2019; pp. 276–280. [Google Scholar] [CrossRef]
Canclini, A.; Antonacci, F.; Sarti, A.; Tubaro, S. Distributed 3D Source Localization from 2D DOA Measurements Using Multiple Linear Arrays. Wirel. Commun. Mob. Comput. 2017, 2017, 1049141. [Google Scholar] [CrossRef]
Kaplan, L.; Le, Q.; Molnar, N. Maximum likelihood methods for bearings-only target localization. In Proceedings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Proceedings (ICASSP), Salt Lake City, UT, USA, 7–11 May 2001; Volume 5, pp. 3001–3004. [Google Scholar] [CrossRef]
Bishop, A.N.; Anderson, B.D.O.; Fidan, B.; Pathirana, P.N.; Mao, G. Bearing-Only Localization using Geometrically Constrained Optimization. IEEE Trans. Aerosp. Electron. Syst. 2009, 45, 308–320. [Google Scholar] [CrossRef]
Wang, Z.; Luo, J.; Zhang, X. A Novel Location-Penalized Maximum Likelihood Estimator for Bearing-Only Target Localization. IEEE Trans. Signal Process. 2012, 60, 6166–6181. [Google Scholar] [CrossRef]
Griffin, A.; Alexandridis, A.; Pavlidi, D.; Mastorakis, Y.; Mouchtaris, A. Localizing multiple audio sources in a wireless acoustic sensor network. Signal Process. 2015, 107, 54–67. [Google Scholar] [CrossRef]
Choi, J.; Zotter, F.; Jo, B.; Yoo, J. Multiarray Eigenbeam-ESPRIT for 3D Sound Source Localization With Multiple Spherical Microphone Arrays. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 30, 2310–2325. [Google Scholar] [CrossRef]
Sayed, A.H. Adaptation, Learning, and Optimization over Networks. Found. Trends Mach. Learn. 2014, 7, 311–801. [Google Scholar] [CrossRef]
Bertrand, A.; Moonen, M. Distributed Adaptive Node-Specific Signal Estimation in Fully Connected Sensor Networks—Part I: Sequential Node Updating. IEEE Trans. Signal Process. 2010, 58, 5277–5291. [Google Scholar] [CrossRef]
Hassani, A.; Bertrand, A.; Moonen, M. Distributed GEVD-based signal subspace estimation in a fully-connected wireless sensor network. In Proceedings of the 2014 22nd European Signal Processing Conference (EUSIPCO), Lisbon, Portugal, 1–5 September 2014; pp. 1292–1296. [Google Scholar]
Guo, X.; Yuan, M.; Ke, Y.; Zheng, C.; Li, X. Distributed node-specific block-diagonal LCMV beamforming in wireless acoustic sensor networks. Signal Process. 2021, 185, 108085. [Google Scholar] [CrossRef]
Ferrer, M.; de Diego, M.; Piñero, G.; Gonzalez, A. Active noise control over adaptive distributed networks. Signal Process. 2015, 107, 82–95. [Google Scholar] [CrossRef]
Ruiz, S.; van Waterschoot, T.; Moonen, M. Distributed Combined Acoustic Echo Cancellation and Noise Reduction in Wireless Acoustic Sensor and Actuator Networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 30, 534–547. [Google Scholar] [CrossRef]
Dorfan, Y.; Plinge, A.; Hazan, G.; Gannot, S. Distributed Expectation-Maximization Algorithm for Speaker Localization in Reverberant Environments. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 26, 682–695. [Google Scholar]
Grinstein, E.; Brookes, M.; Naylor, P.A. Graph Neural Networks for Sound Source Localization on Distributed Microphone Networks. In Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
Albertini, D.; Greco, G.; Bernardini, A.; Sarti, A. Diffusion-Based Sound Source Localization Using Networks of Planar Microphone Arrays. In Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
Sayed, A.H. Adaptive Networks. Proc. IEEE 2014, 102, 460–497. [Google Scholar] [CrossRef]
Chen, J.; Sayed, A.H. Distributed pareto-optimal solutions via diffusion adaptation. In Proceedings of the IEEE Statistical Signal Processing Workshop (SSP), Ann Arbor, MI, USA, 5–8 August 2012; pp. 648–651. [Google Scholar] [CrossRef]
Horn, R.; Johnson, C. Matrix Analysis; Cambridge University Press: Cambridge, MA, USA, 2012. [Google Scholar]
Allen, J.; Berkley, D. Image method for efficiently simulating small-room acoustics. J. Acoust. Soc. Am. 1979, 65, 943–950. [Google Scholar] [CrossRef]
Van Trees, H.L. Parameter Estimation II. In Optimum Array Processing; John Wiley & Sons, Ltd.: Hoboken, NJ, USA, 2002; Chapter 9; pp. 1139–1317. [Google Scholar]
Yu, C.; Sayed, A.H. A Strategy for Adjusting Combination Weights Over Adaptive Networkd. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, BC, Canada, 26–31 May 2013; p. 5. [Google Scholar]
Vlaski, S.; Sayed, A.H. Distributed Learning in Non-Convex Environments—Part I: Agreement at a Linear Rate. IEEE Trans. Signal Process. 2021, 69, 1242–1256. [Google Scholar] [CrossRef]
Fiedler, M. Algebraic connectivity of graphs. Czechoslov. Math. J. 1973, 23, 298–305. [Google Scholar] [CrossRef]

Figure 1. A projected microphone array centered at

m_{k}^{'}

, and orientation versor

v_{k}^{'} = {[cos α_{k}, sin α_{k}, 0]}^{T}

measures the DoA

θ_{k}

of a sound source located at

s

in the far field.

Figure 2. Example of network topology with K microphone arrays. The green area highlights the neighborhood

N_{k}

of agent k.

Figure 3. Schematic of the ATC diffusion-based SSL for node k in an example topology. Each node updates its local estimate (adapt step) and then merges intermediate estimates from neighbors (combine step) using an adaptive combination policy.

Figure 4. Average MAE,

{\hat{MAE}}_{avg}

, for the DoA stream models from Section 5.1. Each subfigure shows results on a

23 \times 23

grid in the

x y

-plane at

z = 0.45

(top) and

z = 1.5

(bottom), with

N_{trials} = 50

realizations per DoA stream model. Columns (from left to right) compare no combination strategy, distance combination strategy (20), error combination strategy (18), and cost combination strategy (19). (a)

{\hat{MAE}}_{avg}

for DoA stream Model I with

σ = 10 °

. (b)

{\hat{MAE}}_{avg}

for DoA stream Model II with

σ_{k} \sim L ({1.9}^{°}, 0.66 °)

. (c)

{\hat{MAE}}_{avg}

for DoA stream Model III with

σ_{k} \sim L (1.9 °, 0.6 °)

and

ν_{k} \sim G (θ_{k}^{GT}, 0.8)

.

Figure 5. Simulated acoustic environment and localization results. The actual source position (depicted as a star) is at

s = {[0.45, 0.83, 1.50]}^{T}

m, and microphone array centers are shown as filled circles. The trajectory of estimated source locations, obtained using our diffusion-based SSL algorithm, illustrates the localization process.

Figure 6.

{\hat{MAE}}_{avg}

at each candidate source position in a 23 × 23 grid on the

x y

plane for

z = 0.45

(top) and

z = 1.5

(bottom). Each column represents a different combination strategy, with the exception of the last column, which refers to the technique of Canclini et al. [24].

Figure 7. Average localization accuracy

ν_{{\hat{MAE}}_{avg}}

across the entire simulated acoustic environment (as described in Section 5.1), comparing the proposed method with various combination policies and the baseline method from Ref. Canclini. On the left,

ν_{{\hat{MAE}}_{avg}}

is obtained with T60 ranging from 0 to 1 s with a fixed SNR of 20 dB, while on the right,

ν_{{\hat{MAE}}_{avg}}

is obtained with a prescribed SNR ranging from 10 to 35 dB and a fixed T60 of 0.4 s. et al. [24].

Figure 8. Average “instantaneous” MSD for each combination policy across 100 realizations of DoA streams, obtained using model III and considering a sound source of

s = {[1.22, 0.97, 0.97]}^{T}

.

Figure 9. Average MAE for each combination policy across 100 realizations of DoA streams, obtained using model III and considering a sound source in

s = {[1.22, 0.97, 0.97]}^{T}

.

Figure 10. Average “instantaneous” MSD for different values of

μ

, across 100 realizations of DoA streams, obtained using model III and considering a sound source in

s = {[1.22, 0.97, 0.97]}^{T}

. The results show the MSD performance if the distance combination policy is used.

Figure 11. Different network topologies and their relative algebraic connectivity used to study performance as a function of connectivity.

Figure 12. Distribution of

{\hat{MAE}}_{avg}

for different network connectivities.

Table 1. Reference points and orientations of microphone arrays used in Section 5.

ID	$m_{k}$ [m]	$(α_{k}, β_{k})$
1	${[0.25, 0.25, 0.25]}^{T}$	(−135°, 45°)
2	${[3.75, 0.25, 0.25]}^{T}$	(45°, 45°)
3	${[3.75, 2.75, 0.25]}^{T}$	(45°, 0°)
4	${[0.25, 2.75, 0.25]}^{T}$	(225°, 0°)
5	${[0.25, 0.25, 2.75]}^{T}$	(−45°, 0°)
6	${[3.75, 0.25, 2.75]}^{T}$	(135°, 45°)
7	${[3.75, 2.75, 2.75]}^{T}$	(135°, 0°)
8	${[0.25, 2.75, 2.75]}^{T}$	(45°, 45°)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Diffusion-Based Sound Source Localization Using a Distributed Network of Microphone Arrays

Abstract

1. Introduction

2. Background on 3D Sound Source Localization with Linear Microphone Arrays

2.1. Sound Source Localization Framework

2.2. Centralized Resolution Approach Presented in [24]

2.3. Distributed Resolution Approach Presented in [38]

3. Adaptive Sound Source Localization

3.1. Network Topology

3.2. Centralized Adaptive SSL

4. ATC Diffusion Sound Source Localization

4.1. ATC SSL Algorithm

4.2. Adaptive Combination Policies

4.2.1. Error Penalty Factor

4.2.2. Cost Penalty Factor

4.2.3. Distance Penalty Factor

4.3. Metrics

4.3.1. Mean Absolute Error

4.3.2. Mean Square Deviation

5. Testing the Localization Accuracy

5.1. DoA Stream Models

5.2. Acoustic Simulations

6. A Study on Convergence and Stability Behavior

6.1. Convergence Speed

6.2. Steady-State Stability

7. Impact of Network Connectivity on Localization Accuracy

8. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics