Neural Network Regression for Sound Source Localization Using Time Difference of Arrival Based on Parametric Homomorphic Deconvolution

Kim, Keonwook; Choi, Anthony

doi:10.3390/app15179272

Open AccessArticle

Neural Network Regression for Sound Source Localization Using Time Difference of Arrival Based on Parametric Homomorphic Deconvolution

by

Keonwook Kim

^1,*

and

Anthony Choi

²

¹

Division of Electronics & Electrical Engineering, Dongguk University-Seoul, Seoul 04620, Republic of Korea

²

Department of Electrical & Computer Engineering, Mercer University, 1501 Mercer University Drive, Macon, GA 31207, USA

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(17), 9272; https://doi.org/10.3390/app15179272 (registering DOI)

Submission received: 12 July 2025 / Revised: 15 August 2025 / Accepted: 21 August 2025 / Published: 23 August 2025

(This article belongs to the Special Issue Advances in Audio Signal Processing)

Download

Browse Figures

Versions Notes

Abstract

This paper proposes a novel sound source localization system that combines parametric homomorphic deconvolution with neural network regression to estimate the angle of arrival from a single-channel signal. The system uses an analog adder to sum signals from three spatially arranged microphones, reducing system hardware complexity and requiring the estimation of time delays from a single-channel signal. Time delay features are extracted through parametric homomorphic deconvolution methods—Yule–Walker, Prony, and Steiglitz–McBride—and input to multilayer perceptrons configured with various structures. Simulations confirm that Steiglitz–McBride provides the sharpest and most accurate predictions with reduced model order, while Yule–Walker shows slightly better performance than Prony at higher orders. A hybrid learning strategy that combines synthetic and real-world data improves generalization and robustness across all angles. Experimental validations in an anechoic chamber support the simulation results, showing high correlation and low deviation values, especially with the Steiglitz–McBride method. The proposed sound source localization system demonstrates a compact and scalable design suitable for real-time and resource-constrained applications and provides a promising platform for future extensions in complex environments and broader signal interpretation domains.

Keywords:

sound source localization; single channel; angle of arrival; homomorphic deconvolution; neural network regression; time delay estimation; Yule–Walker; Prony; Steiglitz–McBride; multilayer perceptron

1. Introduction

Sound propagates at a relatively low speed compared to electromagnetic waves, allowing measurable time differences in arrival at spatially separated microphones. These time differences provide critical information for estimating the angle of arrival (AoA) of an acoustic source. Conventional sound source localization (SSL) systems utilize multiple microphones, each receiving propagated signals independently. Beamforming techniques are commonly employed in such systems [1], where the received signals are aligned in time—typically by applying delays—and then summed to reinforce signals arriving from a particular direction. The direction that maximizes the beamformed output corresponds to the estimated AoA. This approach relies on scanning a range of angles and selecting the one with the highest output energy or correlation, forming the basis of delay-and-sum beamforming or more advanced variants such as Capon [2,3] and MUSIC [4] algorithms.

Despite their effectiveness, conventional SSL methods depend on independent signal channels for each microphone, typically realized using dedicated analog-to-digital converters (ADCs). This configuration is necessary for estimating the time difference of arrival (TDoA)—the time gap between signal arrivals at spatially separated sensors, which is fundamental to determining the direction of the sound source. As the number of microphones increases to achieve finer localization resolution, the system must incorporate a proportional number of ADCs, along with additional circuitry and communication interfaces to handle the resulting data streams. This leads to significant increases in system complexity, power consumption, and cost [5,6,7]. These challenges become especially pronounced in sonar systems or spatially distributed receiver arrays, where synchronization, wiring, and distributed processing further complicate implementation [8]. In such scenarios, reducing the number of ADCs while maintaining accurate AoA estimation is a non-trivial but highly desirable goal.

An alternative approach to reducing the number of signal channels is inspired by biological auditory systems, particularly monaural and binaural sound localization mechanisms [9,10,11,12,13]. In these systems, structural features surrounding the receivers—such as the human pinnae, head, and torso—induce reflections and diffractions that shape the incoming sound waves in direction-dependent ways. These shape-induced modifications result in spectral cues that can be processed to infer the AoA. Various algorithms have been developed to exploit these cues for localization using only one or two microphones. However, the localization performance of such bio-inspired methods is often constrained by the physical limitations of the structural components, such as their size, geometry, and placement [14]. While the human auditory system demonstrates impressive localization capability using binaural cues, replicating this performance in artificial systems remains challenging, particularly under dynamic or reverberant conditions.

To overcome the limitations of conventional and bio-inspired SSL systems, we propose a novel architecture that reduces the number of ADCs while preserving reliable directional information. Rather than relying on structural diffraction and reflection as in monaural or binaural systems, the proposed method utilizes the clearer time delay information from multiple spatially separated microphones. These signals are summed through a simple analog adder into a single waveform, requiring only one ADC for digitization. If the inter-microphone delays can be effectively estimated from the composite signal, accurate AoA estimation can be achieved comparable to conventional multi-channel systems. Although the microphone configuration is not physically compact, the system remains straightforward to implement due to the simplicity of the analog addition and the associated processing algorithm.

To derive the inter-microphone time delays from a single-channel signal, conventional methods such as time-domain cross-correlation or generalized cross-correlation are often employed, but these approaches generally exhibit limited resolution and reduced robustness under noisy or reverberant conditions [15,16,17]. To address these limitations, this study adopts homomorphic deconvolution (HD), a well-established method that transforms convolution into an additive operation in the cepstral domain, allowing clearer identification of delay components [18,19]. Building on our previous work [20,21,22], we utilize parametric HD techniques based on classical spectral estimation algorithms including Yule–Walker [23], Prony [24,25], and Steiglitz–McBride [26,27]. These methods estimate propagation model coefficients that encode the delay structure between microphones. In this study, parametric HD is applied to extract TDoA features from a single-channel signal generated by combining microphone inputs via analog addition. Although the HD process yields coefficients, further processing is required to obtain interpretable TDoA features—specifically pole locations and magnitudes—which are then used as inputs to a multilayer perceptron (MLP) for final AoA estimation.

The MLP [28] is employed to perform regression from extracted features to estimate the AoA. While the non-parametric HD distribution contains rich information, it generally requires a large and complex MLP to handle its high dimensionality. In contrast, a compact representation of features allows for a smaller and more efficient architecture. This is conceptually similar to convolutional neural networks (CNNs) [29,30], where early layers perform effective feature extraction, reducing the load on subsequent layers. The structure of the MLP, particularly how neurons are connected, influences the quality of the regression output. This study explores how such structural differences affect localization accuracy. The use of an MLP is especially appropriate in this context due to the nonlinear nature of the time delay distribution with respect to AoA, which is not easily modeled using conventional methods. The interplay between feature representation and MLP structure is central to the overall performance of the proposed SSL system. Figure 1 illustrates the system configuration, in which sound signals from a moving object are received by three spatially separated microphones, combined through an analog adder, and processed via homomorphic deconvolution, feature extraction, and a neural network to estimate the AoA.

In summary, this work introduces a compact and scalable SSL framework that minimizes hardware complexity while maintaining high prediction accuracy. By combining signals from multiple microphones into a single channel and leveraging model-based parametric methods, the system extracts concise and informative features that reduce the complexity of the subsequent MLP regression while preserving estimation performance. The proposed approach achieves reliable AoA estimation without the need for multiple synchronized ADCs and integrates simulation and real-world data to enhance generalization across diverse conditions. This framework not only advances single-channel localization techniques but also provides a flexible platform applicable to a broad range of multi-sensor signal analysis problems.

This paper is organized as follows. Section 2 reviews related works in sound source localization, highlighting key methodologies and contributions relevant to this study. Section 3 introduces the parametric HD method and explains how time delays are extracted and structured as input features. Section 4 describes the proposed SSL system, including the receiver configuration, feature extraction strategy, and MLP structures. Section 5 presents simulation results evaluating localization performance across a range of parameters and models, followed by an analysis of prediction accuracy for various directions. Section 6 provides experimental validation using data collected in an anechoic chamber. Finally, Section 7 discusses the most effective models and parameter settings and offers concluding remarks.

2. Related Works

Several review articles have provided comprehensive overviews of SSL technologies, highlighting the evolution of classical methods, learning-based models, and bio-inspired strategies [31,32,33]. These works offer a broad foundation for understanding the trends and challenges in SSL research. Recent studies have explored a variety of approaches to sound source localization based on time-delay estimation and learning-based frameworks. Table 1 summarizes key features of 16 related works, focusing on feature extraction methods, learning models, and specific contributions relevant to this study.

This paper builds upon the author’s long-standing research in SSL. Earlier works introduced underwater SSL using beamforming for rapid, scalable acoustic tracking, but with substantial multi-channel synchronization and computational burdens. To overcome these limitations, subsequent studies proposed compact binaural and monaural SSL systems tailored to airborne sound propagation, emphasizing feasibility in constrained environments [50,51,52,53]. However, performance of structure-dependent monaural/binaural methods can degrade under complex reflection and diffraction from surrounding structures and sensitivity to geometric variations. To address these issues while retaining simple signal flow, a single-channel SSL paradigm aggregates multiple sensors via an analog adder and estimates inter-sensor delays from the composite signal. HD is introduced to estimate these time delays, and model-based (parametric) HD algorithms provide the fundamental basis of the single-channel SSL system [20]. More recently, single-channel SSL approaches were realized, demonstrating promising performance with machine-learning models such as linear regression and Gaussian process regression [21,22]. The present work advances this line of research by incorporating neural networks and a novel feature-extraction process via parametric HD. All acoustic experiments have been conducted using a consistent testbed within the same anechoic chamber [54] to ensure comparability across results.

3. Parametric Homomorphic Deconvolution

This section summarizes the HD algorithm as applied in this study, with reference to our previous works [20,21,22]. HD is a form of homomorphic signal processing that separates convoluted signals by transforming the multiplication in the frequency domain into addition in the cepstral domain, followed by signal separation and inverse transformation [18,19]. The method is particularly useful in acoustic signal processing, where propagation effects often manifest as convolution with unknown impulse responses. By exploiting the duality between convolution and multiplication, HD enables effective isolation of time-delay features embedded in single-channel or multi-path environments. This approach enhances interpretability and allows more compact feature representation, especially valuable when used in conjunction with machine learning models.

As shown in Figure 2, the HD system consists of two cascaded homomorphic processing stages—forward and backward conversions. The input signal, formed by convolution of a source and a propagation function, is first transformed using the fast Fourier transform (FFT). The logarithm of the magnitude of the FFT output is taken to convert multiplicative relationships into additive ones. An ensemble average is applied to suppress noise, leveraging the inverse proportionality between ensemble length and noise variance. The result is then inverse transformed back to the time domain using the inverse FFT, yielding the real cepstrum.

A frequency-invariant window function is then applied to isolate the time-delay related component. This operation selectively passes the region of interest in the cepstral domain, typically beyond a predefined time threshold, thus emphasizing delayed signal components. The backward conversion begins by applying the exponential and conjugation operations, followed by another FFT to generate the propagation function. The conjugation operation is applied here to utilize the symmetry properties of the discrete Fourier transform (DFT), enabling the inverse DFT computation using the DFT through complex conjugation [55]. In the non-parametric HD, this output provides a time-delay distribution. The peaks of the resulting sequence indicate the TDoA.

To improve robustness and resolution, the parametric HD replaces the final stage with a model-based approach. Instead of directly performing an Inverse FFT, the system fits a rational transfer function

H (z)

of the form:

H (z) = \frac{b_{0} + b_{1} z^{- 1} + \dots + b_{M} z^{- M}}{1 + a_{1} z^{- 1} + \dots + a_{N} z^{- N}}

(1)

This structure is estimated using classical spectral estimation techniques, namely Yule–Walker, Prony, and Steiglitz–McBride algorithms. Yule–Walker models the system as autoregressive (AR), estimating only the denominator coefficients. Prony and Steiglitz–McBride implement autoregressive moving average (ARMA) modeling, estimating both numerator and denominator coefficients.

The use of complex input in these algorithms allows for flexible pole-zero placement, enhancing accuracy in estimating the TDoA. The estimated impulse response

\tilde{h} [n]

from

H (z)

is evaluated by computing the following:

{{|\tilde{h} [n]|}^{2} = |{H (z)|}_{z = e^{j 2 π n / L}}|}^{2}

(2)

Here,

L

is the FFT length, and

n

denotes the discrete index. The peaks of

{|\tilde{h} [n]|}^{2}

correspond to time delays between microphones. Statistical performance of HD depends on the signal-to-noise ratio (SNR), ensemble average length, parametric method, and model order. Among parametric methods, Steiglitz–McBride generally shows lower bias and variance, while Yule–Walker and Prony are more sensitive to noise and shorter ensemble lengths. For a more detailed explanation of the parametric HD algorithms, readers are encouraged to consult the author’s earlier publications [20,21,22].

4. Methodology

In the previous section, the HD process was discussed as a method to extract time delay information from the received signal. Building upon that foundation, this section presents the overall methodology for the proposed SSL system. Specifically, the discussion is divided into three subsections: receiver configuration, feature extraction, and multilayer perceptron structures. The receiver configuration subsection addresses the ambiguity and considerations involved in achieving an optimal sensor arrangement. The feature extraction subsection describes the structure and organization of the features derived from the parametric HD output. Finally, the multilayer perceptron structures subsection introduces three potential configurations designed to estimate the AoA through regression, leveraging the extracted features.

4.1. Receiver Configuration

In previous work [22], the optimal receiver configuration for three microphones placed on a planar circle with a 32 cm radius was determined using a brute-force approach. The goal was to maximize the diversity of the time delay distributions across different arrival angles, effectively minimizing the similarity between time delay profiles for different configurations. Due to the symmetric properties inherent to the time delay distribution relative to the AoA, multiple optimal configurations were identified. The receiver configuration used in this study is shown in Figure 3. Solid dots represent microphones, and arrows indicate the distance and direction vectors (DDVs) between the microphones.

Based on the DDVs and the incoming signal direction, the time delay distribution for each angle can be computed, as shown in Figure 4a. In the figure, black dots represent the location of the time delay in samples, assuming a sampling frequency of 48 kHz, at each given incoming angle. The vectors

v_{1}

,

v_{2}

, and

v_{3}

provide independent delay patterns with a periodicity of π radians. Due to this periodicity, two pairs of angles produce identical time delay patterns, specifically at 1.0151, 1.4352, 1.9777, and 2.3890 radians. Color-coded lines highlight these angle pairs that result in similar time delay characteristics. It is important to note that unoptimized receiver configurations generally exhibit a higher number of such ambiguities.

These angular identities can be further visualized using matrix operations. From Figure 4a, a binary matrix

A

is constructed, where each column (

n

columns) represents a specific incoming angle and each row (

m

rows) corresponds to a time sample, with ‘1’ marking the location of a time delay and ‘0’ elsewhere. By computing the Gram matrix [56] below, a square similarity matrix is generated, as shown in Figure 4b.

G = A^{T} A A \in Z^{m \times n}

(3)

The diagonal elements of

G

represent the inner products of each angle with itself, achieving a maximum value of 3 due to the three DDVs. Off-diagonal elements reflect the similarity between different angles, where larger values indicate identical or highly similar time delay patterns. The specific locations of maximum off-diagonal values are highlighted in Figure 4b using red lines and arrows. The sum of the elements in the

G

matrix provides an overall measure of similarity for the given receiver configuration; lower total values indicate greater diversity in the time delay distributions. This method was previously used to identify optimal configurations in the author’s earlier work [22]. These ambiguities suggest that the regression process based on MLP architectures may encounter certain confusion in distinguishing these angles.

4.2. Feature Extraction

The parametric HD algorithm produces output in the form of model coefficients, typically for the denominator and potentially for the numerator as well. These coefficients must be properly organized and managed to serve as input features for the MLP tasked with estimating the arrival angle. To efficiently reduce the size of the feature set, it is desirable to avoid full evaluation of the parametric HD output based on the complete response. Instead, this study leverages the spectral estimation theory, which indicates that the angles of the poles in the parametric HD model carry significant information about the peak locations in the signal structure. Therefore, we visualize the pole angles and their corresponding magnitudes to investigate the feasibility of using them as regression features in the following steps.

The designed feature of the parametric HD output, denoted as

μ_{i}

, is a complex number where the angle represents the pole angle and the magnitude reflects the evaluated magnitude at that angle as in the equation below.

\begin{matrix} \{z_{0}, z_{1}, \dots, z_{N - 1}\} s u c h t h a t z_{i}^{N} + a_{1} z_{i}^{N - 1} + \dots + a_{N} = 0 \forall i \\ ∡ μ_{i} = ∡ z_{i} a n d |μ_{i}| = |{H (z)|}_{z = e^{j ∡ z_{i}}}| \end{matrix}

(4)

Note that the polynomial equation in Equation (4) is the denominator of the Equation (1) in the parametric HD model. Figure 5 illustrates the distribution of these features for individual arrival angles, with color indicating the incoming angle. The

x

-axis shows the sum of pole angles (in radians), while the

y

-axis represents the logarithm of the evaluated absolute magnitude sum. Two cases are displayed: Figure 5a for an order 2 model and Figure 5b for an order 10 model. The feature distributions reveal a certain degree of clustering by incoming angle, suggesting that a MLP can potentially distinguish between arrival angles based on these extracted features.

Figure 6 illustrates how the pole locations and corresponding magnitudes represent the time delay characteristics. Figure 6a shows the nonparametric HD distribution for an incoming angle of 0 radians, where three dominant peaks correspond to the time delays between microphones. In contrast, the Yule–Walker (AR model) and Prony (ARMA model) show smooth evaluated distributions, with red lines marking pole locations and magnitudes. These methods tend to spread the dense population of peaks and may place some poles at irrelevant locations. The Steiglitz–McBride (ARMA model) provides a much more accurate reconstruction, with peak locations and magnitudes closely matching the nonparametric HD distribution. With a properly chosen model order (here in 3 order), the Steiglitz–McBride method demonstrates the best agreement with true time delays, offering the most promising feature extraction performance. Higher-order parametric HD models also show comparable results when the Yule–Walker and Prony methods are used.

4.3. Multilayer Perceptron Structures

In designing the artificial neural network for SSL, achieving simplicity and computational efficiency is prioritized over adopting excessively complex architectures. Studies have shown that MLPs can achieve competitive performance in sound localization tasks, particularly when paired with structured and informative feature extraction methods such as spectral features or time-delay related descriptors [57]. Although deep learning models like CNNs and recurrent neural networks (RNNs) have demonstrated strong performance for spatial and temporal modeling, their added complexity often outweighs their benefits when the input features are already designed to capture essential information. Thus, simpler MLP architecture is considered a practical and interpretable choice for this application.

To further enhance model efficiency and learning capacity, a branched MLP structure is adopted in this work. By deploying multiple smaller independent MLPs in parallel, each specialized in processing distinct subsets of the extracted features, the system can effectively capture diverse acoustic characteristics without requiring deep or wide monolithic networks [58]. After independent feature processing, the outputs from the parallel branches are merged and processed through an additional fully connected MLP stage, enabling efficient integration of specialized information. This branched design reduces both layer and neuron counts compared to conventional architectures while maintaining or even improving learning performance, making it highly suitable for real-time and resource-constrained SSL applications.

The feature set provided to the MLP consists of pole locations and their corresponding absolute magnitudes. As a result, the number of features is twice the order of the parametric HD model. Based on these features, three different MLP structures are considered in this study. The first is the fully connected structure, where all extracted features are directly fed into a conventional fully connected MLP. The second is the phase–magnitude pair structure, in which each pole angle and corresponding magnitude pair are processed by a small independent MLP with two inputs; the outputs from these units are then merged and fed into a fully connected MLP for regression. The third structure is the phase–magnitude group structure, where all pole angles and all magnitudes are separately processed by independent MLP branches before merging into a final fully connected MLP. These alternative structures aim to explore different strategies for organizing feature information to optimize localization accuracy. Illustrations of these structures are shown in Figure 7.

Table 2 describes the key parameters for each structure. The feature input layers are identical across all configurations. Since each pole represents both an angle and a magnitude, the number of input features is twice the order of the parametric HD model. The primary differences among the structures lie in their preprocessing layers. The fully connected structure employs a single parallel MLP with 64 neurons in the hidden layer. The phase–magnitude pair structure uses multiple parallel MLPs, each with two inputs (angle and magnitude), and each branch has four neurons in its hidden layer, matching the HD model order. The phase–magnitude group structure separates all angles and magnitudes into two distinct MLP branches, each with 32 neurons in the hidden layer. After preprocessing, all structures feed into a shared processing MLP consisting of three sequential layers with 32, 64, and 32 neurons, respectively. The network concludes with a single output neuron for the regression task. The parameters of the processing layers were selected through extensive simulation experiments to optimize localization performance.

5. Simulations

In the previous section, the proposed SSL algorithm was implemented by sequentially combining parametric HD, feature extraction, and MLP structures. This section presents the realization and evaluation of the SSL system using computer-generated simulation data. System performance is analyzed under various conditions, including different MLP structures and parametric HD model orders, to determine the optimal configuration for real-world applications. To facilitate reproducibility, an open-source MATLAB (Version 2024a) implementation of the model-based parametric HD used in this study has been released; a repository link is provided [59,60,61]. Also, the detailed acoustic propagation model was developed to accurately replicate the conditions of the free field in the anechoic chamber [54], ensuring that the simulated environment closely matches the actual experimental setup. The simulation data generation method is consistent with the author’s previous research [20,21,22], providing continuity and reliability for comparative analysis.

The simulation parameters are summarized in Table 3. In the simulation, three microphones are positioned according to the configuration shown in Figure 3 and receive independent acoustic signals sampled at 48 kHz. These signals are combined into a single data frame for processing by the parametric HD algorithm with 200 ensemble length. The acoustic source is a wideband white noise signal, band-limited by a linear-phase low-pass filter with a cutoff frequency of approximately 12 kHz. Incoming angles are uniformly divided between 0 and 3 radians into 1000 discrete steps. A total of 100,000 data frames are randomly generated and partitioned into training, validation, and test sets without overlap. Each network training process is performed up to 1000 iterations to ensure sufficient convergence.

The MLPs are trained as supervised regressors by minimizing the mean squared error (MSE) between the predicted angle

{\hat{θ}}_{i}

and the ground truth

θ_{i}

:

L (ω) = \frac{1}{N} \sum_{i = 1}^{N} {({\hat{θ}}_{i} - θ_{i})}^{2}

(5)

where

ω

denotes the network parameters. While this loss function is non-convex in

ω

, it is smooth and provides well-behaved gradients for regression; model selection and reporting use the root-mean-square error (RMSE) on a held-out validation set.

Optimization employs the limited-memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) quasi-Newton method [62] in a full-batch regime. L-BFGS approximates the inverse Hessian from a small history of gradient and parameter differences, enabling faster and more stable convergence than first-order methods such as stochastic gradient descent for the smooth squared-error loss function. A strong Wolfe line search enforces sufficient decrease and curvature conditions, reducing sensitivity to step-size tuning and yielding precise weight updates [63]. Training proceeds for up to 1000 iterations with validation assessed periodically (every 100 iterations), and it terminates upon meeting standard tolerance criteria on gradient norm and relative objective reduction.

Table 4 presents the RMSE values on the test dataset across different parametric HD methods, MLP structures, and model orders. The RMSE values are computed based on 1000 training iterations, or fewer if early stopping is triggered due to convergence or overfitting. However, in all simulations performed, the training continued through all 1000 iterations without meeting the stop conditions. In the table, light grey cells indicate the minimum RMSE for each row, while dark grey cells denote the global minimum within each parametric HD method. Results show that the Yule–Walker and Prony methods achieved their lowest RMSE values at the maximum tested model order of 10, whereas the Steiglitz–McBride method reached its optimal performance at order 3. Among the MLP structures, the phase–magnitude group structure demonstrated the best overall localization performance.

The variation in optimal model order across parametric HD methods can be better understood by referring to the feature extraction results shown in Figure 6. The Yule–Walker and Prony methods tend to produce smoothed distributions, where the estimated pole locations and magnitudes do not fully capture the distinct time delay characteristics. In contrast, the Steiglitz–McBride method yields sharp, dominant peaks with highly accurate pole placements that align closely with true time delays. Since the number of microphones is three, a model order of three for the Steiglitz–McBride method is sufficient to represent the required time delay information effectively. Meanwhile, the relatively less precise representations of the Yule–Walker and Prony methods benefit from higher model orders, where additional poles help compensate for their limitations in expressing the time delay structure.

This paper employs low-to-moderate model orders for parametric HD. Although a higher order can refine the fit, the tendency within the examined range already indicates the behavior at larger orders: the gain tapers while complexity and numerical sensitivity increase. All parametric variants require solving linear systems and evaluating higher-order filters; therefore, raising the order elevates runtime and may amplify noise effects. The model order also determines the feature dimension provided to the MLP, which enlarges the network size, parameter count, and latency. In line with the low-complexity objective, compact orders are reported as the most suitable accuracy–complexity trade-off, and the Steiglitz–McBride formulation yielded sharp and stable delay estimates at comparatively small order.

In terms of MLP structure, the phase–magnitude group structure exhibits the best performance among the three evaluated configurations. Both the fully connected structure and the phase–magnitude pair structure demonstrate similar performance levels but fall short of the phase–magnitude group structure. The branched MLP structures intentionally restrict certain neuron connections to guide more structured information flow, allowing the network to better extract meaningful representations for the regression task. In particular, separating the phase and magnitude inputs into distinct branches, as done in the phase–magnitude group structure, enhances information extraction while maintaining efficiency through limited layers and neurons. It is important to note that all MLP structures were designed to have similar total numbers of layers and neurons for a fair comparison. Although the phase–magnitude pair structure also introduces clustering by processing each pole individually, this granularity appears to be too fine for effective information consolidation. Meanwhile, the fully connected structure could potentially benefit from increased depth and width due to its densely interconnected architecture, which is better suited for handling larger and more complex feature representations.

Figure 8 presents the best prediction outcomes from the simulation for each parametric HD method. The left column of the figure illustrates the distribution of prediction means along with shaded regions representing the ±RMSE range around the target values. The right column displays linear regression plots with corresponding

R

values. The

R

value, also known as the Pearson correlation coefficient, indicates the strength and direction of the linear relationship between the predicted and actual values. An

R

value close to 1 signifies high prediction accuracy. All prediction plots reveal four noticeable performance glitches, which correspond to the time delay ambiguities identified in Figure 4. The outer pair of glitches demonstrates a broader spread in regression outputs, resulting from the greater angular distance associated with their ambiguity. In contrast, the inner pair shows narrower distributions. These types of discrepancies are inherent to systems using only three microphones, where limited spatial diversity restricts the system’s ability to fully resolve time delay ambiguities.

The Yule–Walker and Prony methods show similar prediction distributions in Figure 8. The minor differences in RMSE between the two are reflected in the

R

values and linear regression trends. In the left column figures, the solid dark red lines represent the mean of the predicted distributions, while the green lines in the right column indicate the estimated linear regression lines. The Steiglitz–McBride method demonstrates significantly improved performance, producing narrower and more accurate prediction distributions. Its regression line closely aligns with the ground truth. Compared to the other methods, the Steiglitz–McBride method is not immune to performance glitches, but it confines these issues to a very narrow range of affected points. The resulting prediction distributions around the ambiguous angles are sharper and more focused, reinforcing the method’s robustness despite the inherent limitations of using three microphones.

6. Results

The acoustic experiments for validating the proposed SSL system were conducted in an anechoic chamber that conforms to ISO 3745 [64] guidelines. Specifically, the chamber meets the hemi-free-field requirements for the 1 kHz–16 kHz 1/3 octave band and satisfies the free-field conditions for the 250 Hz–16 kHz 1/3 octave band [54]. This study adopts the free-field mode, which requires complete acoustic insulation using wedge-shaped absorbers. The experimental setup includes three microphones placed at predetermined positions (Figure 3) and a single speaker arranged to ensure far-field propagation. The directional angle between the speaker and the microphone array is precisely guided using a line laser mounted above the speaker. A minimum distance of one meter is maintained between the speaker and the receivers to preserve the far-field condition.

The microphones are installed on a rigid frame constructed from lumber and plastic supports to maintain the array’s shape and height during angular rotation. To ensure horizontal propagation, both the speaker and receiver array are positioned at the same vertical level. Angular adjustments are made using a pair of engraved saw-toothed wheels, which allow rotation in discrete π⁄18 (10°) increments. As shown in Figure 9, the microphone array follows a radial layout on concentric circles, centered around the rotational axis. Signals from three condenser microphones (C-2, Behringer, Tortola, British Virgin Islands) are merged using an analog mixer (MX-1020, MPA Tech, Seoul, Republic of Korea) to form a single-channel signal. This signal is digitized via an audio interface (Quad-Capture, Roland, Hamamatsu, Japan) and processed using the proposed SSL algorithm. The speaker (HS80M, Yamaha, Hamamatsu, Japan) is also connected to the same interface and produces the wideband acoustic signals. Real-time data acquisition and playback are managed in MATLAB (Version 2024a) using audio stream input/output compatible system objects. For each angle, a 5-minute recording is conducted, excluding the first and last one second to eliminate transitional noise, resulting in approximately 5 min of usable data per direction. The experimental parameters are summarized in Table 5.

Although the experimental data provide realistic environmental measurements, their angular coverage is limited due to discrete sampling in π⁄18 increments. Consequently, a neural network trained exclusively on such sparse experimental data may not generalize well to unseen angles. To address this, a hybrid learning strategy is employed that combines synthetic (simulation-based) data with real-world (experimentally collected) data. In machine learning terms, synthetic data offer broad and continuous angular coverage, while real-world data reflect true sensor characteristics and acoustic propagation effects. For this study, 70,000 training frames are generated from simulation, and 3490 frames are collected experimentally. These two datasets are integrated to train the neural network, enabling robust prediction performance over the entire angular range.

Figure 10 illustrates the prediction results on the test set for three learning strategies. The validation and test datasets are disjointed from the training data to ensure objective evaluation. The left column shows networks trained only on synthetic data, the middle column presents results from hybrid training (synthetic + real-world), and the right column shows networks trained solely on real-world data. For each configuration, prediction accuracy is evaluated using real-world data only, with the

R

value and linear regression lines used to assess fit quality.

Networks trained only on synthetic data perform poorly on real-world inputs, while those trained only on experimental data fail to generalize to unseen (simulated) conditions. This highlights the domain bias induced by each training strategy. In contrast, hybrid learning demonstrates balanced prediction performance across data domains and angular ranges. Among the parametric HD methods, Yule–Walker and Prony produce similar results under hybrid training, with Yule–Walker showing slightly better accuracy. The Steiglitz–McBride method clearly outperforms the others, yielding the highest

R

values and most consistent regression trends across angles. Despite its superior performance, localized degradation is still observed near ambiguous angles where identical time delay distributions lead to convergence issues—a known limitation for three-microphone arrays.

Table 6 summarizes the RMSE on the experimental test set under the three learning strategies. Each row corresponds to a specific parametric HD method using its optimal model order determined in the simulation. The Steiglitz–McBride method consistently yields the lowest RMSE, achieving a minimum of 0.0279 when trained solely on experimental data due to its focused exposure to limited conditions. The Yule–Walker method also demonstrates competitive performance, particularly under the hybrid learning strategy, with RMSE values approaching those of the Steiglitz–McBride method. Prony, on the other hand, shows slightly higher RMSE in all cases, indicating comparatively lower generalization capability. However, Figure 10 shows that Steiglitz–McBride’s specialized training does not generalize well to broader angle scenarios. The hybrid training approach offers more consistent performance across all angular regions. These results confirm the strength of the Steiglitz–McBride method and highlight the value of integrating real-world data to enhance both accuracy and generalization in neural network-based SSL systems.

For reference, the performance of the proposed neural network–based regression method was compared with results from our previous Gaussian process regression (GPR) framework, which employed the same experimental dataset. Under the real-world only training condition for seen angles, the present method achieved RMSEs of 0.0731, 0.1701, and 0.0279 for Yule–Walker, Prony, and Steiglitz–McBride models, respectively, compared to 0.1631, 0.1679, and 0.0132 in the GPR-based approach. For the synthetic-only condition representing unseen angles, the proposed method obtained RMSEs of 0.4317, 1.2378, and 0.3922, which outperform the GPR-based results of 0.7047, 0.9727, and 0.3503 for Yule–Walker, Prony, and Steiglitz–McBride, respectively. While differences in training strategy and model architecture influence the detailed trends, these results demonstrate that the neural network regression framework can achieve competitive or superior accuracy, particularly in real-world conditions, while offering improved scalability and flexibility compared to the GPR approach.

7. Conclusions

This paper presents a novel method for localizing the angle of arrival based on parametric homomorphic deconvolution with neural network regression. The proposed sound source localization system constructs a single-channel input by summing the signals from three spatially distributed microphones through an analog adder. The forward and backward homomorphic systems operate in cascade to perform deconvolution, estimating the propagation function that captures the time delay between receivers. The parametric homomorphic deconvolution approach utilizes spectral estimation techniques—Yule–Walker, Prony, and Steiglitz–McBride—to represent the time-delay distribution in model-based form. From this representation, pole location and magnitude information are extracted to form the input features for regression. A series of multilayer perceptrons are explored to estimate the angle of arrival, including fully connected, phase–magnitude pair, and phase–magnitude group structures. Among these, the branched structure based on phase–magnitude grouping exhibits the best performance with compact architecture. The optimal receiver configuration was determined in advance using a time-delay similarity matrix, and three-microphone configurations were employed for both simulation and experimental validation. The Steiglitz–McBride method demonstrated consistent advantages, providing sharp and accurate predictions with reduced model order. Meanwhile, the Yule–Walker and Prony methods also showed improved performance with increased model order, with Yule–Walker performing slightly better than Prony in most cases. Simulations confirmed the reliability of the proposed system, and experiments in an anechoic chamber yielded precise predictions when proper parameters were applied.

This study expands upon the prior work involving Gaussian process regression by adopting neural network-based regression to improve scalability and learning flexibility. Compared to handcrafted kernels, neural networks can autonomously model nonlinear regression mappings, which enhances accuracy across a wide range of trained and untrained angles. A hybrid learning strategy, combining synthetic and real-world datasets, further improves generalization and robustness of the system. The predictive results demonstrate the system’s suitability for low-complexity deployment, as the analog summation and single analog-to-digital converter structure minimizes hardware overhead. The proposed system maintains a fixed computational requirement regardless of receiver count, unlike conventional localization systems that scale with sensor number. Despite minor prediction discrepancies around ambiguous angles due to time-delay symmetry, the system delivers stable and accurate performance. The proposed framework extends to reverberant spaces, mobile platforms, and sensor-fault-tolerant operation. Beyond sound source localization, the deconvolution–regression pipeline applies to multi-sensor phase analysis and other sequential signal tasks. Future work will profile computational costs for embedded use, focusing on feature extraction, while maintaining a shallow, low-overhead machine learning with compact model orders.

Author Contributions

Conceptualization, K.K.; methodology, K.K.; software, K.K.; validation, K.K.; formal analysis, K.K.; investigation, K.K.; resources, K.K.; data curation, K.K.; writing—original draft preparation, K.K.; writing—review and editing, K.K. and A.C.; visualization, K.K.; supervision, K.K.; project administration, A.C.; funding acquisition, A.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by NASA Georgia Space Grant Consortium, grant number 80NSSC20M0094.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The acoustic data collected in the anechoic chamber for various angles are available upon request.

Acknowledgments

This work was supported by the Dongguk University Research Fund of 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript (listed in order of appearance in the text).

AoA	Angle of arrival
SSL	Sound source localization
ADC	Analog-to-digital converter
TDoA	Time difference of arrival
HD	Homomorphic deconvolution
MLP	Multilayer perceptron
CNN	Convolutional neural network
FFT	Fast Fourier transform
DFT	Discrete Fourier transform
AR	Autoregressive
ARMA	Autoregressive moving average
SNR	Signal-to-noise ratio
DDV	Distance and direction vector
RNN	Recurrent neural network
FCL	Fully connected layer
MSE	Mean squared error
L-BFGS	Limited-memory Broyden–Fletcher–Goldfarb–Shanno
RMSE	Root mean squared error

References

Van Veen, B.D.; Buckley, K.M. Beamforming: A versatile approach to spatial filtering. IEEE ASSP Mag. 1988, 5, 4–24. [Google Scholar] [CrossRef]
Capon, J. High-resolution frequency-wavenumber spectrum analysis. Proc. IEEE 1969, 57, 1408–1418. [Google Scholar] [CrossRef]
Cox, H.; Zeskind, R.; Owen, M. Robust adaptive beamforming. IEEE Trans. Acoust. Speech Signal Process. 1987, 35, 1365–1376. [Google Scholar] [CrossRef]
Schmidt, R. Multiple emitter location and signal parameter estimation. IEEE Trans. Antenn. Propag. 1986, 34, 276–280. [Google Scholar] [CrossRef]
Cobos, M.; Antonacci, F.; Alexandridis, A.; Mouchtaris, A.; Lee, B. A Survey of Sound Source Localization Methods in Wireless Acoustic Sensor Networks. Wirel. Commun. Mob. Comput. 2017, 2017, 3956282. [Google Scholar] [CrossRef]
Liaquat, M.U.; Munawar, H.S.; Rahman, A.; Qadir, Z.; Kouzani, A.Z.; Mahmud, M.A.P. Sound Localization for Ad-Hoc Microphone Arrays. Energies 2021, 14, 3446. [Google Scholar] [CrossRef]
Chung, M.-A.; Chou, H.-C.; Lin, C.-W. Sound Localization Based on Acoustic Source Using Multiple Microphone Array in an Indoor Environment. Electronics 2022, 11, 890. [Google Scholar] [CrossRef]
George, A.D.; Markwell, J.; Fogarty, R. Real-time sonar beamforming on high-performance distributed computers. Parallel Comput. 2000, 26, 1231–1252. [Google Scholar] [CrossRef]
Raspaud, M.; Viste, H.; Evangelista, G. Binaural Source Localization by Joint Estimation of ILD and ITD. IEEE Trans. Audio Speech Lang. Process. 2010, 18, 68–77. [Google Scholar] [CrossRef]
Takashima, R.; Takiguchi, T.; Ariki, Y. Monaural sound-source-direction estimation using the acoustic transfer function of a parabolic reflection board. J. Acoust. Soc. Am. 2010, 127, 902–908. [Google Scholar] [CrossRef]
Fuchs, A.K.; Feldbauer, C.; Stark, M. Monaural sound localization. In Proceedings of the 12th Annual Conference of the International Speech Communication Association, INTERSPEECH 2011, Florence, Italy, 27–31 August 2011; pp. 2521–2524. [Google Scholar]
Kliper, R.; Kayser, H.; Weinshall, D.; Nelken, I.; Anemüller, J. Monaural azimuth localization using spectral dynamics of speech. In Proceedings of the 12th Annual Conference of the International Speech Communication Association, INTERSPEECH 2011, Florence, Italy, 27–31 August 2011; pp. 33–36. [Google Scholar]
Argentieri, S.; Danès, P.; Souères, P. A survey on sound source localization in robotics: From binaural to array processing methods. Comput. Speech Lang. 2015, 34, 87–112. [Google Scholar] [CrossRef]
Popper, A.N.; Fay, R.R. Sound Source Localization; Springer: Berlin/Heidelberg, Germany, 2005. [Google Scholar]
Knapp, C.; Carter, G. The generalized correlation method for estimation of time delay. IEEE Trans. Acoust. Speech Signal Process. 1976, 24, 320–327. [Google Scholar] [CrossRef]
Carter, G.C. Time Delay Estimation. In Adaptive Methods in Underwater Acoustics; Urban, H.G., Ed.; Springer: Dordrecht, The Netherlands, 1985; pp. 175–196. [Google Scholar]
Thierry, E.; Bataillou, E.; Suisse, G.; Rix, H.; Thakor, N. Time delay estimation of evoked potentials by generalized cross-correlation. In Proceedings of the 1992 14th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Paris, France, 29 October–1 November 1992; pp. 2722–2723. [Google Scholar]
Oppenheim, A.V. Speech analysis-synthesis system based on homomorphic filtering. J. Acoust. Soc. Am. 1969, 45, 458–465. [Google Scholar] [CrossRef]
Tribolet, J.M.; Oppenheim, A.V. Deconvolution of seismic data using homomorphic filtering. In Proceedings of the Joint Automatic Control Conference, San Francisco, CA, USA, 22–24 June 1977; pp. 68–74. [Google Scholar]
Park, Y.; Choi, A.; Kim, K. Parametric Estimations Based on Homomorphic Deconvolution for Time of Flight in Sound Source Localization System. Sensors 2020, 20, 925. [Google Scholar] [CrossRef]
Park, Y.; Choi, A.; Kim, K. Single-Channel Multiple-Receiver Sound Source Localization System with Homomorphic Deconvolution and Linear Regression. Sensors 2021, 21, 760. [Google Scholar] [CrossRef] [PubMed]
Kim, K.; Hong, Y. Gaussian Process Regression for Single-Channel Sound Source Localization System Based on Homomorphic Deconvolution. Sensors 2023, 23, 769. [Google Scholar] [CrossRef] [PubMed]
Chan, Y.; Langford, R. Spectral estimation via the high-order Yule-Walker equations. IEEE Trans. Acoust. Speech Signal Process. 1982, 30, 689–698. [Google Scholar] [CrossRef]
Prony, G. Essai experimental et analytique sur les lois de la dilatabilite de fluides elastiques et sur celles da la force expansion de la vapeur de l’alcool, a differentes temperatures. J. L’ecole Polytech. 1795, 1, 24–76. [Google Scholar]
Parks, T.W.; Burrus, C.S. Digital Filter Design; Wiley-Interscience: Hoboken, NJ, USA, 1987. [Google Scholar]
Steiglitz, K.; McBride, L. A technique for the identification of linear systems. IEEE Trans. Autom. Control 1965, 10, 461–464. [Google Scholar] [CrossRef]
Ljung, L. System Identification: Theory for the User, 2nd ed.; Prentice Hall PTR: Hoboken, NJ, USA, 1999. [Google Scholar]
Murtagh, F. Multilayer perceptrons for classification and regression. Neurocomputing 1991, 2, 183–197. [Google Scholar] [CrossRef]
LeCun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1989, 1, 541–551. [Google Scholar] [CrossRef]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Grumiaux, P.A.; Kitic, S.; Girin, L.; Guerin, A. A survey of sound source localization with deep learning methods. J. Acoust. Soc. Am. 2022, 152, 107. [Google Scholar] [CrossRef]
Sesyuk, A.; Ioannou, S.; Raspopoulos, M. A Survey of 3D Indoor Localization Systems and Technologies. Sensors 2022, 22, 9380. [Google Scholar] [CrossRef]
Jekaterynczuk, G.; Piotrowski, Z. A Survey of Sound Source Localization and Detection Methods and Their Applications. Sensors 2023, 24, 68. [Google Scholar] [CrossRef]
Adavanne, S.; Politis, A.; Nikunen, J.; Virtanen, T. Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks. IEEE J. Sel. Top. Signal Process. 2019, 13, 34–48. [Google Scholar] [CrossRef]
Nguyen, T.N.T.; Watcharasupat, K.N.; Nguyen, N.K.; Jones, D.L.; Gan, W.-S. SALSA: Spatial Cue-Augmented Log-Spectrogram Features for Polyphonic Sound Event Localization and Detection. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 30, 1749–1762. [Google Scholar] [CrossRef]
Pan, Z.; Zhang, M.; Wu, J.; Wang, J.; Li, H. Multi-Tone Phase Coding of Interaural Time Difference for Sound Source Localization With Spiking Neural Networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 2656–2670. [Google Scholar] [CrossRef]
Wang, L.; Wang, L. Sound Source Localization in Spherical Harmonics Domain Based on High-Order Ambisonics Signals Enhancement Neural Network. IEEE Access 2024, 12, 44043–44054. [Google Scholar] [CrossRef]
Rusrus, J.; Shirmohammadi, S.; Bouchard, M. Characterization of Moving Sound Sources Direction-of-Arrival Estimation Using Different Deep Learning Architectures. IEEE Trans. Instrum. Meas. 2023, 72, 2505914. [Google Scholar] [CrossRef]
Jacome, K.G.R.; Grijalva, F.L.; Masiero, B.S. Sound Events Localization and Detection Using Bio-Inspired Gammatone Filters and Temporal Convolutional Neural Networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 2314–2324. [Google Scholar] [CrossRef]
Diaz-Guerra, D.; Miguel, A.; Beltran, J.R. Direction of Arrival Estimation of Sound Sources Using Icosahedral CNNs. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 313–321. [Google Scholar] [CrossRef]
Liu, H.; Zhang, X.; Li, P.; Yao, Y.; Zhang, S.; Xiao, Q. Time Delay Estimation for Sound Source Localization Using CNN-Based Multi-GCC Feature Fusion. IEEE Access 2023, 11, 140789–140800. [Google Scholar] [CrossRef]
Healy, B.; McNamee, P.; Ahmadabadi, Z.N. Feature Aggregation in Joint Sound Classification and Localization Neural Networks. IEEE Access 2024, 12, 109157–109170. [Google Scholar] [CrossRef]
Wang, Q.; Du, J.; Wu, H.-X.; Pan, J.; Ma, F.; Lee, C.-H. A Four-Stage Data Augmentation Approach to ResNet-Conformer Based Acoustic Modeling for Sound Event Localization and Detection. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 1251–1264. [Google Scholar] [CrossRef]
Sakavicius, S.; Serackis, A.; Abromavicius, V. Multiple Sound Source Localization in Three Dimensions Using Convolutional Neural Networks and Clustering Based Post-Processing. IEEE Access 2022, 10, 125707–125722. [Google Scholar] [CrossRef]
Zhu, X.-C.; Zhang, H.; Feng, H.-T.; Zhao, D.-H.; Zhang, X.-J.; Tao, Z. IFAN: An Icosahedral Feature Attention Network for Sound Source Localization. IEEE Trans. Instrum. Meas. 2024, 73, 2505913. [Google Scholar] [CrossRef]
SongGong, K.; Wang, W.; Chen, H. Acoustic Source Localization in the Circular Harmonic Domain Using Deep Learning Architecture. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 30, 2475–2491. [Google Scholar] [CrossRef]
Joshi, A.; Hickey, J.-P. Deep Learning-Based Low-Frequency Passive Acoustic Source Localization. Appl. Sci. 2024, 14, 9893. [Google Scholar] [CrossRef]
Chen, L.; Chen, G.; Huang, L.; Choy, Y.-S.; Sun, W. Multiple Sound Source Localization, Separation, and Reconstruction by Microphone Array: A DNN-Based Approach. Appl. Sci. 2022, 12, 3428. [Google Scholar] [CrossRef]
Wang, S.; Yang, P.; Sun, H. Sound Source Localization Indoors Based on Two-Level Reference Points Matching. Appl. Sci. 2022, 12, 9956. [Google Scholar] [CrossRef]
Kim, K.; Choi, A. Binaural Sound Localizer for Azimuthal Movement Detection Based on Diffraction. Sensors 2012, 12, 10584–10603. [Google Scholar] [CrossRef]
Kim, K.; Kim, Y. Monaural Sound Localization Based on Structure-Induced Acoustic Resonance. Sensors 2015, 15, 3872–3895. [Google Scholar] [CrossRef] [PubMed]
Kim, Y.; Kim, K. Near-Field Sound Localization Based on the Small Profile Monaural Structure. Sensors 2015, 15, 28742–28763. [Google Scholar] [CrossRef] [PubMed]
Park, Y.; Choi, A.; Kim, K. Monaural Sound Localization Based on Reflective Structure and Homomorphic Deconvolution. Sensors 2017, 17, 2189. [Google Scholar] [CrossRef] [PubMed]
Kim, K. Design and analysis of experimental anechoic chamber for localization. J. Acoust. Soc. Korea 2012, 31, 10. [Google Scholar] [CrossRef]
Kim, K. Conceptual Digital Signal Processing with MATLAB; Springer Nature: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
Horn, R.A.; Johnson, C.R. Matrix Analysis; Cambridge university press: Cambridge, UK, 2012. [Google Scholar]
Zaken, O.B.; Kumar, A.; Tourbabin, V.; Rafaely, B. Neural-Network-Based Direction-of-Arrival Estimation for Reverberant Speech-The Importance of Energetic, Temporal, and Spatial Information. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 1298–1309. [Google Scholar] [CrossRef]
Zelený, O.; Fryza, T. Multi-Branch Multi Layer Perceptron: A Solution for Precise Regression using Machine Learning. In Proceedings of the 2023 33rd International Conference Radioelektronika (RADIOELEKTRONIKA), Pardubice, Czech Republic, 19–20 April 2023; pp. 1–5. [Google Scholar]
Kim, K. Parametric Homomorphic Deconvolution (Steiglitz–McBride, ARMA). MathWWorks: MATLAB Central File Exchange, 2025. Available online: https://www.mathworks.com/matlabcentral/fileexchange/181789-parametric-homomorphic-deconvolution-steiglitz-mcbride-arma (accessed on 12 August 2025).
Kim, K. Parametric Homomorphic Deconvolution (Prony, ARMA). MathWorks: MATLAB Central File Exchange, 2025. Available online: https://www.mathworks.com/matlabcentral/fileexchange/181788-parametric-homomorphic-deconvolution-prony-arma (accessed on 12 August 2025).
Kim, K. Parametric Homomorphic Deconvolution (Yule–Walker, AR). MathWorks: MATLAB Central File Exchange, 2025. Available online: https://www.mathworks.com/matlabcentral/fileexchange/181787-parametric-homomorphic-deconvolution-yule-walker (accessed on 12 August 2025).
Liu, D.C.; Nocedal, J. On the limited memory BFGS method for large scale optimization. Math. Program. 1989, 45, 503–528. [Google Scholar] [CrossRef]
Moré, J.J.; Thuente, D.J. Line search algorithms with guaranteed sufficient decrease. ACM Trans. Math. Softw. (TOMS) 1994, 20, 286–307. [Google Scholar] [CrossRef]
ANSI S12.55-2006/ISO 3745:2003; Acoustics—Determination of Sound Power Levels of Noise Sources Using Sound Pressure—Precision Methods for Anechoic and Hemi-Anechoic Rooms. ISO: Geneva, Switzerland, 2006.

Figure 1. Functional diagram of the proposed SSL system. The sound generated from the object (vehicle) arrives at the individual receiver (three circles) with time difference. The analog adder combines the received sounds into a single-channel signal for further processing with the SSL algorithm.

Figure 2. Computational procedure for parametric homomorphic deconvolution. The * indicates the convolution operation.

Figure 3. Receiver configuration.

p_{1}

,

p_{2}

, and

p_{3}

are microphone locations and

v_{1}

,

v_{2}

, and

v_{3}

are vectors indicating the distance and direction vectors between microphones.

Figure 3. Receiver configuration.

p_{1}

,

p_{2}

, and

p_{3}

are microphone locations and

v_{1}

,

v_{2}

, and

v_{3}

are vectors indicating the distance and direction vectors between microphones.

Figure 4. Individual time delay distribution and similarity matrix for various angles. (a)

v_{1}

,

v_{2}

, and

v_{3}

curves indicate the time position of delays from corresponding DDVs. (b) Similarity matrix G shows the likeness of time delay distribution for individual angles.

Figure 4. Individual time delay distribution and similarity matrix for various angles. (a)

v_{1}

,

v_{2}

, and

v_{3}

curves indicate the time position of delays from corresponding DDVs. (b) Similarity matrix G shows the likeness of time delay distribution for individual angles.

Figure 5. Feature distributions of the corresponding parametric HD methods. The

x

-axis represents the sum of pole angles and the

y

-axis illustrates the decimal logarithm on the sum of absolute pole magnitudes: (a) Parametric order is 2. (b) Parametric order is 10.

Figure 5. Feature distributions of the corresponding parametric HD methods. The

x

-axis represents the sum of pole angles and the

y

-axis illustrates the decimal logarithm on the sum of absolute pole magnitudes: (a) Parametric order is 2. (b) Parametric order is 10.

Figure 6. HD distributions and feature extraction for 0 radian. Pole locations and corresponding magnitudes are represented by vertical red lines. (a) Reference non-parametric HD. (b) Yule–Walker HD—order 3. (c) Prony HD—order 3. (d) Steiglitz–McBride HD—order 3.

Figure 7. The MLP structures. The green neurons represent the feature input layer. The pink boxes indicate fully connected layers (FCLs) for preprocessing. (a) Fully connected structure. (b) Phase–magnitude pair structure. (c) Phase–magnitude group structure.

Figure 8. Predictions on the test set from simulation data. Left column: prediction mean with ± RMSE range. Right column: linear regression with corresponding

R

value. (a,b) Yule–Walker HD @ order 10 with Phase–magnitude group. (c,d) Prony HD @ order 10 with Phase–magnitude group. (e,f) Steiglitz–McBride HD @ order 3 with Phase–magnitude group.

Figure 8. Predictions on the test set from simulation data. Left column: prediction mean with ± RMSE range. Right column: linear regression with corresponding

R

value. (a,b) Yule–Walker HD @ order 10 with Phase–magnitude group. (c,d) Prony HD @ order 10 with Phase–magnitude group. (e,f) Steiglitz–McBride HD @ order 3 with Phase–magnitude group.

Figure 9. Configuration of the acoustic experiment conducted in the anechoic chamber, featuring three microphones and a single speaker positioned using laser alignment.

Figure 10. Predictions on the test set using different learning strategies: left column—synthetic-only training, middle column—hybrid training, right column—real-world only training. (a–c) Yule–Walker HD @ order 10 with Phase–magnitude group. (d–f) Prony HD @ order 10 with Phase–magnitude group. (g–i) Steiglitz–McBride HD @ order 3 with Phase–magnitude group.

Table 1. Related works.

First Author et al.	Feature Extraction Method	Machine Learning Method	Remarks
Adavanne et al. [34]	Magnitude and phase of spectrogram from multi-channel audio	CRNN with SED and DoA branches	Joint model for SELD with 3D Cartesian DoA
Nguyen et al. [35]	Log-spectrogram + principal eigenvector (SALSA)	CNN-based SELD	Improves performance for polyphonic sources in various microphone array format
Pan et al. [36]	Multi-tone ITD-based spike phase coding	Neuromorphic SNN	Biologically-inspired SNN
Wang et al. [37]	Multi-scale CNN enhancement of HOA signals in frequency domain	CNN + EB-MVDR	Enhances DoA estimation under noise/reverberation
Rusrus et al. [38]	STFT magnitude/phase	FNN, RNN, TCN	Systematic comparison of DNN architectures for moving source
Jacome et al. [39]	Gammatone filter + Intensity vector	CRNN + Temporal Convolutional Network (TCN)	Bio-inspired SELD
Diaz-Guerra et al. [40]	SRP-PHAT on icosahedral grid	Icosahedral CNN (soft-argmax)	Lightweight DoA regression on spherical arrays
Liu et al. [41]	Multi-weight GCC features	CNN for TDE regression	Fuses multiple delay features; robust to noise/reverb
Healy et al. [42]	Spectrogram + mid-level feature maps	CNN + feature aggregation (PANet, BiFPN, SEN)	Performance gain with feature aggregation in SELD
Wang et al. [43]	Four-stage augmentations on time, frequency, and channel	ResNet-conformer hybrid	Advanced augmentation pipeline boosts DCASE performance
Sakavičius et al. [44]	Spectrum phase of tetrahedral array	CNN + clustering	Single-frame 3D localization
Zhu et al. [45]	SRP-PHAT + SRP-LMS	Icosahedral CNN + attention	Robust 3D tracking via attention-weighted spatial maps
SongGong et al. [46]	Real/Imaginary of circular harmonics + energy enhancement	CNN	Robust to array perturbation using harmonic encoding
Joshi et al. [47]	CSM, GCC	Shallow CNN, multitask CNN	Develop benchmark cases for low frequency SSL
Chen et al. [48]	MFCC, AMS, GFPS, RASTA-PLP	Supervised DNN + iterative beamforming	Multi-source SSL and signal reconstruction using BW-MUSIC with DNN enhancement
Wang et al. [49]	TDoA-based fingerprint vectors (not explicitly spectral)	Fuzzy C-means clustering + greedy search	Indoor SSL using two-level reference point matching for real-time localization

Initial definitions are omitted on purpose in this table.

Table 2. MLP structures and corresponding parameters. The symbol ⊗ denotes successive fully connected layers with the indicated number of neurons.

Parameter as Numberof Neurons	Structures
Parameter as Numberof Neurons	Fully Connected	Phase–Magnitude Pair	Phase–Magnitude Group
Feature input layer *	HD order × 2	HD order × 2	HD order × 2
Preprocessing layer *	64 × 1 (1 layer)	4 × HD order (HD order layers in parallel)	32 × 2 (2 layers in parallel)
Processing layer *	32 ⊗ 64 ⊗ 32 (3 sequential layers)	32 ⊗ 64 ⊗ 32 (3 sequential layers)	32 ⊗ 64 ⊗ 32 (3 sequential layers)
Output layer	1	1	1

* After each layer, a batch normalization layer and a Rectified Linear Unit layer are applied.

Table 3. Simulation parameters and values.

Parameter	Value	Parameter	Value
Sampling frequency	48,000 Hz	Angle range	$0 \leq θ \leq 3 r a d .$
Frame length	1024 samples	Angle resolution	$3 / 1000$ rad. (0.1719°)
Sound speed	34,613 cm/s	Whole data set	100,000
Number of receivers	3	Validation set	15% of Whole
SNR	20 dB	Test set	15% of Whole
Audio source	Wideband signal ~12,000 Hz	Training iterations	1000

Table 4. RMSE distributions for the MLP structures and parametric HD methods. Light greys indicate the row minimum, and dark greys represent a global minimum for the parametric HD method.

MLP Structure	HD Method	Parametric HD Order
MLP Structure	HD Method	2	3	4	5	6	7	8	9	10
Fully Connected	Yule–Walker	0.6091	0.3568	0.2253	0.1877	0.2006	0.1801	0.1671	0.1691	0.1478
	Prony	0.5740	0.3878	0.2250	0.1895	0.2304	0.1750	0.1792	0.1716	0.1530
	Steiglitz–McBride	0.3085	0.2830	0.3883	0.6550	0.5523	0.5151	0.4795	0.4746	0.5110
Pha-Mag Pair	Yule–Walker	0.6288	0.3697	0.1948	0.1833	0.1977	0.2184	0.1782	0.2005	0.1624
	Prony	0.5869	0.3867	0.2431	0.1992	0.2236	0.2038	0.2006	0.2107	0.1943
	Steiglitz–McBride	0.2709	0.1918	0.2588	0.4972	0.3031	0.3921	0.2946	0.3841	0.3362
Pha-Mag Group	Yule–Walker	0.6300	0.2882	0.1969	0.1557	0.1758	0.1635	0.1471	0.1325	0.1295
	Prony	0.5616	0.3616	0.2125	0.1539	0.1701	0.1655	0.1815	0.1655	0.1516
	Steiglitz–McBride	0.1946	0.0816	0.1349	0.2779	0.2402	0.3760	0.2887	0.2864	0.3263

Table 5. Experiment parameters and values.

Parameter	Value	Parameter	Value
Angle range	$0 \leq θ < π r a d .$	Angle resolution	$π / 18$ $rad . (10 °)$ $(18 angles; k / π$ $for 0 \leq k \leq 17$ )
Whole data set	4986	Validation set	747 (~15% of Whole)
Test set	749 (~15% of Whole)	Training iterations	1000

Table 6. RMSE distributions for test experiment data.

Parametric Model with Best Order	Training Data Source
Parametric Model with Best Order	Synthetic Only	Hybrid	Real-World Only
Yule–Walker @ 10 order	0.4317	0.1181	0.0731
Prony @ 10 order	1.2378	0.2418	0.1701
Steiglitz–McBride @ 3 order	0.3922	0.0853	0.0279

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, K.; Choi, A. Neural Network Regression for Sound Source Localization Using Time Difference of Arrival Based on Parametric Homomorphic Deconvolution. Appl. Sci. 2025, 15, 9272. https://doi.org/10.3390/app15179272

AMA Style

Kim K, Choi A. Neural Network Regression for Sound Source Localization Using Time Difference of Arrival Based on Parametric Homomorphic Deconvolution. Applied Sciences. 2025; 15(17):9272. https://doi.org/10.3390/app15179272

Chicago/Turabian Style

Kim, Keonwook, and Anthony Choi. 2025. "Neural Network Regression for Sound Source Localization Using Time Difference of Arrival Based on Parametric Homomorphic Deconvolution" Applied Sciences 15, no. 17: 9272. https://doi.org/10.3390/app15179272

APA Style

Kim, K., & Choi, A. (2025). Neural Network Regression for Sound Source Localization Using Time Difference of Arrival Based on Parametric Homomorphic Deconvolution. Applied Sciences, 15(17), 9272. https://doi.org/10.3390/app15179272

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Neural Network Regression for Sound Source Localization Using Time Difference of Arrival Based on Parametric Homomorphic Deconvolution

Abstract

1. Introduction

2. Related Works

3. Parametric Homomorphic Deconvolution

4. Methodology

4.1. Receiver Configuration

4.2. Feature Extraction

4.3. Multilayer Perceptron Structures

5. Simulations

6. Results

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI