1. Introduction
Sound propagates at a relatively low speed compared to electromagnetic waves, allowing measurable time differences in arrival at spatially separated microphones. These time differences provide critical information for estimating the angle of arrival (AoA) of an acoustic source. Conventional sound source localization (SSL) systems utilize multiple microphones, each receiving propagated signals independently. Beamforming techniques are commonly employed in such systems [
1], where the received signals are aligned in time—typically by applying delays—and then summed to reinforce signals arriving from a particular direction. The direction that maximizes the beamformed output corresponds to the estimated AoA. This approach relies on scanning a range of angles and selecting the one with the highest output energy or correlation, forming the basis of delay-and-sum beamforming or more advanced variants such as Capon [
2,
3] and MUSIC [
4] algorithms.
Despite their effectiveness, conventional SSL methods depend on independent signal channels for each microphone, typically realized using dedicated analog-to-digital converters (ADCs). This configuration is necessary for estimating the time difference of arrival (TDoA)—the time gap between signal arrivals at spatially separated sensors, which is fundamental to determining the direction of the sound source. As the number of microphones increases to achieve finer localization resolution, the system must incorporate a proportional number of ADCs, along with additional circuitry and communication interfaces to handle the resulting data streams. This leads to significant increases in system complexity, power consumption, and cost [
5,
6,
7]. These challenges become especially pronounced in sonar systems or spatially distributed receiver arrays, where synchronization, wiring, and distributed processing further complicate implementation [
8]. In such scenarios, reducing the number of ADCs while maintaining accurate AoA estimation is a non-trivial but highly desirable goal.
An alternative approach to reducing the number of signal channels is inspired by biological auditory systems, particularly monaural and binaural sound localization mechanisms [
9,
10,
11,
12,
13]. In these systems, structural features surrounding the receivers—such as the human pinnae, head, and torso—induce reflections and diffractions that shape the incoming sound waves in direction-dependent ways. These shape-induced modifications result in spectral cues that can be processed to infer the AoA. Various algorithms have been developed to exploit these cues for localization using only one or two microphones. However, the localization performance of such bio-inspired methods is often constrained by the physical limitations of the structural components, such as their size, geometry, and placement [
14]. While the human auditory system demonstrates impressive localization capability using binaural cues, replicating this performance in artificial systems remains challenging, particularly under dynamic or reverberant conditions.
To overcome the limitations of conventional and bio-inspired SSL systems, we propose a novel architecture that reduces the number of ADCs while preserving reliable directional information. Rather than relying on structural diffraction and reflection as in monaural or binaural systems, the proposed method utilizes the clearer time delay information from multiple spatially separated microphones. These signals are summed through a simple analog adder into a single waveform, requiring only one ADC for digitization. If the inter-microphone delays can be effectively estimated from the composite signal, accurate AoA estimation can be achieved comparable to conventional multi-channel systems. Although the microphone configuration is not physically compact, the system remains straightforward to implement due to the simplicity of the analog addition and the associated processing algorithm.
To derive the inter-microphone time delays from a single-channel signal, conventional methods such as time-domain cross-correlation or generalized cross-correlation are often employed, but these approaches generally exhibit limited resolution and reduced robustness under noisy or reverberant conditions [
15,
16,
17]. To address these limitations, this study adopts homomorphic deconvolution (HD), a well-established method that transforms convolution into an additive operation in the cepstral domain, allowing clearer identification of delay components [
18,
19]. Building on our previous work [
20,
21,
22], we utilize parametric HD techniques based on classical spectral estimation algorithms including Yule–Walker [
23], Prony [
24,
25], and Steiglitz–McBride [
26,
27]. These methods estimate propagation model coefficients that encode the delay structure between microphones. In this study, parametric HD is applied to extract TDoA features from a single-channel signal generated by combining microphone inputs via analog addition. Although the HD process yields coefficients, further processing is required to obtain interpretable TDoA features—specifically pole locations and magnitudes—which are then used as inputs to a multilayer perceptron (MLP) for final AoA estimation.
The MLP [
28] is employed to perform regression from extracted features to estimate the AoA. While the non-parametric HD distribution contains rich information, it generally requires a large and complex MLP to handle its high dimensionality. In contrast, a compact representation of features allows for a smaller and more efficient architecture. This is conceptually similar to convolutional neural networks (CNNs) [
29,
30], where early layers perform effective feature extraction, reducing the load on subsequent layers. The structure of the MLP, particularly how neurons are connected, influences the quality of the regression output. This study explores how such structural differences affect localization accuracy. The use of an MLP is especially appropriate in this context due to the nonlinear nature of the time delay distribution with respect to AoA, which is not easily modeled using conventional methods. The interplay between feature representation and MLP structure is central to the overall performance of the proposed SSL system.
Figure 1 illustrates the system configuration, in which sound signals from a moving object are received by three spatially separated microphones, combined through an analog adder, and processed via homomorphic deconvolution, feature extraction, and a neural network to estimate the AoA.
In summary, this work introduces a compact and scalable SSL framework that minimizes hardware complexity while maintaining high prediction accuracy. By combining signals from multiple microphones into a single channel and leveraging model-based parametric methods, the system extracts concise and informative features that reduce the complexity of the subsequent MLP regression while preserving estimation performance. The proposed approach achieves reliable AoA estimation without the need for multiple synchronized ADCs and integrates simulation and real-world data to enhance generalization across diverse conditions. This framework not only advances single-channel localization techniques but also provides a flexible platform applicable to a broad range of multi-sensor signal analysis problems.
This paper is organized as follows.
Section 2 reviews related works in sound source localization, highlighting key methodologies and contributions relevant to this study.
Section 3 introduces the parametric HD method and explains how time delays are extracted and structured as input features.
Section 4 describes the proposed SSL system, including the receiver configuration, feature extraction strategy, and MLP structures.
Section 5 presents simulation results evaluating localization performance across a range of parameters and models, followed by an analysis of prediction accuracy for various directions.
Section 6 provides experimental validation using data collected in an anechoic chamber. Finally,
Section 7 discusses the most effective models and parameter settings and offers concluding remarks.
2. Related Works
Several review articles have provided comprehensive overviews of SSL technologies, highlighting the evolution of classical methods, learning-based models, and bio-inspired strategies [
31,
32,
33]. These works offer a broad foundation for understanding the trends and challenges in SSL research. Recent studies have explored a variety of approaches to sound source localization based on time-delay estimation and learning-based frameworks.
Table 1 summarizes key features of 16 related works, focusing on feature extraction methods, learning models, and specific contributions relevant to this study.
This paper builds upon the author’s long-standing research in SSL. Earlier works introduced underwater SSL using beamforming for rapid, scalable acoustic tracking, but with substantial multi-channel synchronization and computational burdens. To overcome these limitations, subsequent studies proposed compact binaural and monaural SSL systems tailored to airborne sound propagation, emphasizing feasibility in constrained environments [
50,
51,
52,
53]. However, performance of structure-dependent monaural/binaural methods can degrade under complex reflection and diffraction from surrounding structures and sensitivity to geometric variations. To address these issues while retaining simple signal flow, a single-channel SSL paradigm aggregates multiple sensors via an analog adder and estimates inter-sensor delays from the composite signal. HD is introduced to estimate these time delays, and model-based (parametric) HD algorithms provide the fundamental basis of the single-channel SSL system [
20]. More recently, single-channel SSL approaches were realized, demonstrating promising performance with machine-learning models such as linear regression and Gaussian process regression [
21,
22]. The present work advances this line of research by incorporating neural networks and a novel feature-extraction process via parametric HD. All acoustic experiments have been conducted using a consistent testbed within the same anechoic chamber [
54] to ensure comparability across results.
3. Parametric Homomorphic Deconvolution
This section summarizes the HD algorithm as applied in this study, with reference to our previous works [
20,
21,
22]. HD is a form of homomorphic signal processing that separates convoluted signals by transforming the multiplication in the frequency domain into addition in the cepstral domain, followed by signal separation and inverse transformation [
18,
19]. The method is particularly useful in acoustic signal processing, where propagation effects often manifest as convolution with unknown impulse responses. By exploiting the duality between convolution and multiplication, HD enables effective isolation of time-delay features embedded in single-channel or multi-path environments. This approach enhances interpretability and allows more compact feature representation, especially valuable when used in conjunction with machine learning models.
As shown in
Figure 2, the HD system consists of two cascaded homomorphic processing stages—forward and backward conversions. The input signal, formed by convolution of a source and a propagation function, is first transformed using the fast Fourier transform (FFT). The logarithm of the magnitude of the FFT output is taken to convert multiplicative relationships into additive ones. An ensemble average is applied to suppress noise, leveraging the inverse proportionality between ensemble length and noise variance. The result is then inverse transformed back to the time domain using the inverse FFT, yielding the real cepstrum.
A frequency-invariant window function is then applied to isolate the time-delay related component. This operation selectively passes the region of interest in the cepstral domain, typically beyond a predefined time threshold, thus emphasizing delayed signal components. The backward conversion begins by applying the exponential and conjugation operations, followed by another FFT to generate the propagation function. The conjugation operation is applied here to utilize the symmetry properties of the discrete Fourier transform (DFT), enabling the inverse DFT computation using the DFT through complex conjugation [
55]. In the non-parametric HD, this output provides a time-delay distribution. The peaks of the resulting sequence indicate the TDoA.
To improve robustness and resolution, the parametric HD replaces the final stage with a model-based approach. Instead of directly performing an Inverse FFT, the system fits a rational transfer function
of the form:
This structure is estimated using classical spectral estimation techniques, namely Yule–Walker, Prony, and Steiglitz–McBride algorithms. Yule–Walker models the system as autoregressive (AR), estimating only the denominator coefficients. Prony and Steiglitz–McBride implement autoregressive moving average (ARMA) modeling, estimating both numerator and denominator coefficients.
The use of complex input in these algorithms allows for flexible pole-zero placement, enhancing accuracy in estimating the TDoA. The estimated impulse response
from
is evaluated by computing the following:
Here,
is the FFT length, and
denotes the discrete index. The peaks of
correspond to time delays between microphones. Statistical performance of HD depends on the signal-to-noise ratio (SNR), ensemble average length, parametric method, and model order. Among parametric methods, Steiglitz–McBride generally shows lower bias and variance, while Yule–Walker and Prony are more sensitive to noise and shorter ensemble lengths. For a more detailed explanation of the parametric HD algorithms, readers are encouraged to consult the author’s earlier publications [
20,
21,
22].
4. Methodology
In the previous section, the HD process was discussed as a method to extract time delay information from the received signal. Building upon that foundation, this section presents the overall methodology for the proposed SSL system. Specifically, the discussion is divided into three subsections: receiver configuration, feature extraction, and multilayer perceptron structures. The receiver configuration subsection addresses the ambiguity and considerations involved in achieving an optimal sensor arrangement. The feature extraction subsection describes the structure and organization of the features derived from the parametric HD output. Finally, the multilayer perceptron structures subsection introduces three potential configurations designed to estimate the AoA through regression, leveraging the extracted features.
4.1. Receiver Configuration
In previous work [
22], the optimal receiver configuration for three microphones placed on a planar circle with a 32 cm radius was determined using a brute-force approach. The goal was to maximize the diversity of the time delay distributions across different arrival angles, effectively minimizing the similarity between time delay profiles for different configurations. Due to the symmetric properties inherent to the time delay distribution relative to the AoA, multiple optimal configurations were identified. The receiver configuration used in this study is shown in
Figure 3. Solid dots represent microphones, and arrows indicate the distance and direction vectors (DDVs) between the microphones.
Based on the DDVs and the incoming signal direction, the time delay distribution for each angle can be computed, as shown in
Figure 4a. In the figure, black dots represent the location of the time delay in samples, assuming a sampling frequency of 48 kHz, at each given incoming angle. The vectors
,
, and
provide independent delay patterns with a periodicity of π radians. Due to this periodicity, two pairs of angles produce identical time delay patterns, specifically at 1.0151, 1.4352, 1.9777, and 2.3890 radians. Color-coded lines highlight these angle pairs that result in similar time delay characteristics. It is important to note that unoptimized receiver configurations generally exhibit a higher number of such ambiguities.
These angular identities can be further visualized using matrix operations. From
Figure 4a, a binary matrix
is constructed, where each column (
columns) represents a specific incoming angle and each row (
rows) corresponds to a time sample, with ‘1’ marking the location of a time delay and ‘0’ elsewhere. By computing the Gram matrix [
56] below, a square similarity matrix is generated, as shown in
Figure 4b.
The diagonal elements of
represent the inner products of each angle with itself, achieving a maximum value of 3 due to the three DDVs. Off-diagonal elements reflect the similarity between different angles, where larger values indicate identical or highly similar time delay patterns. The specific locations of maximum off-diagonal values are highlighted in
Figure 4b using red lines and arrows. The sum of the elements in the
matrix provides an overall measure of similarity for the given receiver configuration; lower total values indicate greater diversity in the time delay distributions. This method was previously used to identify optimal configurations in the author’s earlier work [
22]. These ambiguities suggest that the regression process based on MLP architectures may encounter certain confusion in distinguishing these angles.
4.2. Feature Extraction
The parametric HD algorithm produces output in the form of model coefficients, typically for the denominator and potentially for the numerator as well. These coefficients must be properly organized and managed to serve as input features for the MLP tasked with estimating the arrival angle. To efficiently reduce the size of the feature set, it is desirable to avoid full evaluation of the parametric HD output based on the complete response. Instead, this study leverages the spectral estimation theory, which indicates that the angles of the poles in the parametric HD model carry significant information about the peak locations in the signal structure. Therefore, we visualize the pole angles and their corresponding magnitudes to investigate the feasibility of using them as regression features in the following steps.
The designed feature of the parametric HD output, denoted as
, is a complex number where the angle represents the pole angle and the magnitude reflects the evaluated magnitude at that angle as in the equation below.
Note that the polynomial equation in Equation (4) is the denominator of the Equation (1) in the parametric HD model.
Figure 5 illustrates the distribution of these features for individual arrival angles, with color indicating the incoming angle. The
-axis shows the sum of pole angles (in radians), while the
-axis represents the logarithm of the evaluated absolute magnitude sum. Two cases are displayed:
Figure 5a for an order 2 model and
Figure 5b for an order 10 model. The feature distributions reveal a certain degree of clustering by incoming angle, suggesting that a MLP can potentially distinguish between arrival angles based on these extracted features.
Figure 6 illustrates how the pole locations and corresponding magnitudes represent the time delay characteristics.
Figure 6a shows the nonparametric HD distribution for an incoming angle of 0 radians, where three dominant peaks correspond to the time delays between microphones. In contrast, the Yule–Walker (AR model) and Prony (ARMA model) show smooth evaluated distributions, with red lines marking pole locations and magnitudes. These methods tend to spread the dense population of peaks and may place some poles at irrelevant locations. The Steiglitz–McBride (ARMA model) provides a much more accurate reconstruction, with peak locations and magnitudes closely matching the nonparametric HD distribution. With a properly chosen model order (here in 3 order), the Steiglitz–McBride method demonstrates the best agreement with true time delays, offering the most promising feature extraction performance. Higher-order parametric HD models also show comparable results when the Yule–Walker and Prony methods are used.
4.3. Multilayer Perceptron Structures
In designing the artificial neural network for SSL, achieving simplicity and computational efficiency is prioritized over adopting excessively complex architectures. Studies have shown that MLPs can achieve competitive performance in sound localization tasks, particularly when paired with structured and informative feature extraction methods such as spectral features or time-delay related descriptors [
57]. Although deep learning models like CNNs and recurrent neural networks (RNNs) have demonstrated strong performance for spatial and temporal modeling, their added complexity often outweighs their benefits when the input features are already designed to capture essential information. Thus, simpler MLP architecture is considered a practical and interpretable choice for this application.
To further enhance model efficiency and learning capacity, a branched MLP structure is adopted in this work. By deploying multiple smaller independent MLPs in parallel, each specialized in processing distinct subsets of the extracted features, the system can effectively capture diverse acoustic characteristics without requiring deep or wide monolithic networks [
58]. After independent feature processing, the outputs from the parallel branches are merged and processed through an additional fully connected MLP stage, enabling efficient integration of specialized information. This branched design reduces both layer and neuron counts compared to conventional architectures while maintaining or even improving learning performance, making it highly suitable for real-time and resource-constrained SSL applications.
The feature set provided to the MLP consists of pole locations and their corresponding absolute magnitudes. As a result, the number of features is twice the order of the parametric HD model. Based on these features, three different MLP structures are considered in this study. The first is the fully connected structure, where all extracted features are directly fed into a conventional fully connected MLP. The second is the phase–magnitude pair structure, in which each pole angle and corresponding magnitude pair are processed by a small independent MLP with two inputs; the outputs from these units are then merged and fed into a fully connected MLP for regression. The third structure is the phase–magnitude group structure, where all pole angles and all magnitudes are separately processed by independent MLP branches before merging into a final fully connected MLP. These alternative structures aim to explore different strategies for organizing feature information to optimize localization accuracy. Illustrations of these structures are shown in
Figure 7.
Table 2 describes the key parameters for each structure. The feature input layers are identical across all configurations. Since each pole represents both an angle and a magnitude, the number of input features is twice the order of the parametric HD model. The primary differences among the structures lie in their preprocessing layers. The fully connected structure employs a single parallel MLP with 64 neurons in the hidden layer. The phase–magnitude pair structure uses multiple parallel MLPs, each with two inputs (angle and magnitude), and each branch has four neurons in its hidden layer, matching the HD model order. The phase–magnitude group structure separates all angles and magnitudes into two distinct MLP branches, each with 32 neurons in the hidden layer. After preprocessing, all structures feed into a shared processing MLP consisting of three sequential layers with 32, 64, and 32 neurons, respectively. The network concludes with a single output neuron for the regression task. The parameters of the processing layers were selected through extensive simulation experiments to optimize localization performance.
5. Simulations
In the previous section, the proposed SSL algorithm was implemented by sequentially combining parametric HD, feature extraction, and MLP structures. This section presents the realization and evaluation of the SSL system using computer-generated simulation data. System performance is analyzed under various conditions, including different MLP structures and parametric HD model orders, to determine the optimal configuration for real-world applications. To facilitate reproducibility, an open-source MATLAB (Version 2024a) implementation of the model-based parametric HD used in this study has been released; a repository link is provided [
59,
60,
61]. Also, the detailed acoustic propagation model was developed to accurately replicate the conditions of the free field in the anechoic chamber [
54], ensuring that the simulated environment closely matches the actual experimental setup. The simulation data generation method is consistent with the author’s previous research [
20,
21,
22], providing continuity and reliability for comparative analysis.
The simulation parameters are summarized in
Table 3. In the simulation, three microphones are positioned according to the configuration shown in
Figure 3 and receive independent acoustic signals sampled at 48 kHz. These signals are combined into a single data frame for processing by the parametric HD algorithm with 200 ensemble length. The acoustic source is a wideband white noise signal, band-limited by a linear-phase low-pass filter with a cutoff frequency of approximately 12 kHz. Incoming angles are uniformly divided between 0 and 3 radians into 1000 discrete steps. A total of 100,000 data frames are randomly generated and partitioned into training, validation, and test sets without overlap. Each network training process is performed up to 1000 iterations to ensure sufficient convergence.
The MLPs are trained as supervised regressors by minimizing the mean squared error (MSE) between the predicted angle
and the ground truth
:
where
denotes the network parameters. While this loss function is non-convex in
, it is smooth and provides well-behaved gradients for regression; model selection and reporting use the root-mean-square error (RMSE) on a held-out validation set.
Optimization employs the limited-memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) quasi-Newton method [
62] in a full-batch regime. L-BFGS approximates the inverse Hessian from a small history of gradient and parameter differences, enabling faster and more stable convergence than first-order methods such as stochastic gradient descent for the smooth squared-error loss function. A strong Wolfe line search enforces sufficient decrease and curvature conditions, reducing sensitivity to step-size tuning and yielding precise weight updates [
63]. Training proceeds for up to 1000 iterations with validation assessed periodically (every 100 iterations), and it terminates upon meeting standard tolerance criteria on gradient norm and relative objective reduction.
Table 4 presents the RMSE values on the test dataset across different parametric HD methods, MLP structures, and model orders. The RMSE values are computed based on 1000 training iterations, or fewer if early stopping is triggered due to convergence or overfitting. However, in all simulations performed, the training continued through all 1000 iterations without meeting the stop conditions. In the table, light grey cells indicate the minimum RMSE for each row, while dark grey cells denote the global minimum within each parametric HD method. Results show that the Yule–Walker and Prony methods achieved their lowest RMSE values at the maximum tested model order of 10, whereas the Steiglitz–McBride method reached its optimal performance at order 3. Among the MLP structures, the phase–magnitude group structure demonstrated the best overall localization performance.
The variation in optimal model order across parametric HD methods can be better understood by referring to the feature extraction results shown in
Figure 6. The Yule–Walker and Prony methods tend to produce smoothed distributions, where the estimated pole locations and magnitudes do not fully capture the distinct time delay characteristics. In contrast, the Steiglitz–McBride method yields sharp, dominant peaks with highly accurate pole placements that align closely with true time delays. Since the number of microphones is three, a model order of three for the Steiglitz–McBride method is sufficient to represent the required time delay information effectively. Meanwhile, the relatively less precise representations of the Yule–Walker and Prony methods benefit from higher model orders, where additional poles help compensate for their limitations in expressing the time delay structure.
This paper employs low-to-moderate model orders for parametric HD. Although a higher order can refine the fit, the tendency within the examined range already indicates the behavior at larger orders: the gain tapers while complexity and numerical sensitivity increase. All parametric variants require solving linear systems and evaluating higher-order filters; therefore, raising the order elevates runtime and may amplify noise effects. The model order also determines the feature dimension provided to the MLP, which enlarges the network size, parameter count, and latency. In line with the low-complexity objective, compact orders are reported as the most suitable accuracy–complexity trade-off, and the Steiglitz–McBride formulation yielded sharp and stable delay estimates at comparatively small order.
In terms of MLP structure, the phase–magnitude group structure exhibits the best performance among the three evaluated configurations. Both the fully connected structure and the phase–magnitude pair structure demonstrate similar performance levels but fall short of the phase–magnitude group structure. The branched MLP structures intentionally restrict certain neuron connections to guide more structured information flow, allowing the network to better extract meaningful representations for the regression task. In particular, separating the phase and magnitude inputs into distinct branches, as done in the phase–magnitude group structure, enhances information extraction while maintaining efficiency through limited layers and neurons. It is important to note that all MLP structures were designed to have similar total numbers of layers and neurons for a fair comparison. Although the phase–magnitude pair structure also introduces clustering by processing each pole individually, this granularity appears to be too fine for effective information consolidation. Meanwhile, the fully connected structure could potentially benefit from increased depth and width due to its densely interconnected architecture, which is better suited for handling larger and more complex feature representations.
Figure 8 presents the best prediction outcomes from the simulation for each parametric HD method. The left column of the figure illustrates the distribution of prediction means along with shaded regions representing the ±RMSE range around the target values. The right column displays linear regression plots with corresponding
values. The
value, also known as the Pearson correlation coefficient, indicates the strength and direction of the linear relationship between the predicted and actual values. An
value close to 1 signifies high prediction accuracy. All prediction plots reveal four noticeable performance glitches, which correspond to the time delay ambiguities identified in
Figure 4. The outer pair of glitches demonstrates a broader spread in regression outputs, resulting from the greater angular distance associated with their ambiguity. In contrast, the inner pair shows narrower distributions. These types of discrepancies are inherent to systems using only three microphones, where limited spatial diversity restricts the system’s ability to fully resolve time delay ambiguities.
The Yule–Walker and Prony methods show similar prediction distributions in
Figure 8. The minor differences in RMSE between the two are reflected in the
values and linear regression trends. In the left column figures, the solid dark red lines represent the mean of the predicted distributions, while the green lines in the right column indicate the estimated linear regression lines. The Steiglitz–McBride method demonstrates significantly improved performance, producing narrower and more accurate prediction distributions. Its regression line closely aligns with the ground truth. Compared to the other methods, the Steiglitz–McBride method is not immune to performance glitches, but it confines these issues to a very narrow range of affected points. The resulting prediction distributions around the ambiguous angles are sharper and more focused, reinforcing the method’s robustness despite the inherent limitations of using three microphones.
6. Results
The acoustic experiments for validating the proposed SSL system were conducted in an anechoic chamber that conforms to ISO 3745 [
64] guidelines. Specifically, the chamber meets the hemi-free-field requirements for the 1 kHz–16 kHz 1/3 octave band and satisfies the free-field conditions for the 250 Hz–16 kHz 1/3 octave band [
54]. This study adopts the free-field mode, which requires complete acoustic insulation using wedge-shaped absorbers. The experimental setup includes three microphones placed at predetermined positions (
Figure 3) and a single speaker arranged to ensure far-field propagation. The directional angle between the speaker and the microphone array is precisely guided using a line laser mounted above the speaker. A minimum distance of one meter is maintained between the speaker and the receivers to preserve the far-field condition.
The microphones are installed on a rigid frame constructed from lumber and plastic supports to maintain the array’s shape and height during angular rotation. To ensure horizontal propagation, both the speaker and receiver array are positioned at the same vertical level. Angular adjustments are made using a pair of engraved saw-toothed wheels, which allow rotation in discrete π⁄18 (10°) increments. As shown in
Figure 9, the microphone array follows a radial layout on concentric circles, centered around the rotational axis. Signals from three condenser microphones (C-2, Behringer, Tortola, British Virgin Islands) are merged using an analog mixer (MX-1020, MPA Tech, Seoul, Republic of Korea) to form a single-channel signal. This signal is digitized via an audio interface (Quad-Capture, Roland, Hamamatsu, Japan) and processed using the proposed SSL algorithm. The speaker (HS80M, Yamaha, Hamamatsu, Japan) is also connected to the same interface and produces the wideband acoustic signals. Real-time data acquisition and playback are managed in MATLAB (Version 2024a) using audio stream input/output compatible system objects. For each angle, a 5-minute recording is conducted, excluding the first and last one second to eliminate transitional noise, resulting in approximately 5 min of usable data per direction. The experimental parameters are summarized in
Table 5.
Although the experimental data provide realistic environmental measurements, their angular coverage is limited due to discrete sampling in π⁄18 increments. Consequently, a neural network trained exclusively on such sparse experimental data may not generalize well to unseen angles. To address this, a hybrid learning strategy is employed that combines synthetic (simulation-based) data with real-world (experimentally collected) data. In machine learning terms, synthetic data offer broad and continuous angular coverage, while real-world data reflect true sensor characteristics and acoustic propagation effects. For this study, 70,000 training frames are generated from simulation, and 3490 frames are collected experimentally. These two datasets are integrated to train the neural network, enabling robust prediction performance over the entire angular range.
Figure 10 illustrates the prediction results on the test set for three learning strategies. The validation and test datasets are disjointed from the training data to ensure objective evaluation. The left column shows networks trained only on synthetic data, the middle column presents results from hybrid training (synthetic + real-world), and the right column shows networks trained solely on real-world data. For each configuration, prediction accuracy is evaluated using real-world data only, with the
value and linear regression lines used to assess fit quality.
Networks trained only on synthetic data perform poorly on real-world inputs, while those trained only on experimental data fail to generalize to unseen (simulated) conditions. This highlights the domain bias induced by each training strategy. In contrast, hybrid learning demonstrates balanced prediction performance across data domains and angular ranges. Among the parametric HD methods, Yule–Walker and Prony produce similar results under hybrid training, with Yule–Walker showing slightly better accuracy. The Steiglitz–McBride method clearly outperforms the others, yielding the highest values and most consistent regression trends across angles. Despite its superior performance, localized degradation is still observed near ambiguous angles where identical time delay distributions lead to convergence issues—a known limitation for three-microphone arrays.
Table 6 summarizes the RMSE on the experimental test set under the three learning strategies. Each row corresponds to a specific parametric HD method using its optimal model order determined in the simulation. The Steiglitz–McBride method consistently yields the lowest RMSE, achieving a minimum of 0.0279 when trained solely on experimental data due to its focused exposure to limited conditions. The Yule–Walker method also demonstrates competitive performance, particularly under the hybrid learning strategy, with RMSE values approaching those of the Steiglitz–McBride method. Prony, on the other hand, shows slightly higher RMSE in all cases, indicating comparatively lower generalization capability. However,
Figure 10 shows that Steiglitz–McBride’s specialized training does not generalize well to broader angle scenarios. The hybrid training approach offers more consistent performance across all angular regions. These results confirm the strength of the Steiglitz–McBride method and highlight the value of integrating real-world data to enhance both accuracy and generalization in neural network-based SSL systems.
For reference, the performance of the proposed neural network–based regression method was compared with results from our previous Gaussian process regression (GPR) framework, which employed the same experimental dataset. Under the real-world only training condition for seen angles, the present method achieved RMSEs of 0.0731, 0.1701, and 0.0279 for Yule–Walker, Prony, and Steiglitz–McBride models, respectively, compared to 0.1631, 0.1679, and 0.0132 in the GPR-based approach. For the synthetic-only condition representing unseen angles, the proposed method obtained RMSEs of 0.4317, 1.2378, and 0.3922, which outperform the GPR-based results of 0.7047, 0.9727, and 0.3503 for Yule–Walker, Prony, and Steiglitz–McBride, respectively. While differences in training strategy and model architecture influence the detailed trends, these results demonstrate that the neural network regression framework can achieve competitive or superior accuracy, particularly in real-world conditions, while offering improved scalability and flexibility compared to the GPR approach.
7. Conclusions
This paper presents a novel method for localizing the angle of arrival based on parametric homomorphic deconvolution with neural network regression. The proposed sound source localization system constructs a single-channel input by summing the signals from three spatially distributed microphones through an analog adder. The forward and backward homomorphic systems operate in cascade to perform deconvolution, estimating the propagation function that captures the time delay between receivers. The parametric homomorphic deconvolution approach utilizes spectral estimation techniques—Yule–Walker, Prony, and Steiglitz–McBride—to represent the time-delay distribution in model-based form. From this representation, pole location and magnitude information are extracted to form the input features for regression. A series of multilayer perceptrons are explored to estimate the angle of arrival, including fully connected, phase–magnitude pair, and phase–magnitude group structures. Among these, the branched structure based on phase–magnitude grouping exhibits the best performance with compact architecture. The optimal receiver configuration was determined in advance using a time-delay similarity matrix, and three-microphone configurations were employed for both simulation and experimental validation. The Steiglitz–McBride method demonstrated consistent advantages, providing sharp and accurate predictions with reduced model order. Meanwhile, the Yule–Walker and Prony methods also showed improved performance with increased model order, with Yule–Walker performing slightly better than Prony in most cases. Simulations confirmed the reliability of the proposed system, and experiments in an anechoic chamber yielded precise predictions when proper parameters were applied.
This study expands upon the prior work involving Gaussian process regression by adopting neural network-based regression to improve scalability and learning flexibility. Compared to handcrafted kernels, neural networks can autonomously model nonlinear regression mappings, which enhances accuracy across a wide range of trained and untrained angles. A hybrid learning strategy, combining synthetic and real-world datasets, further improves generalization and robustness of the system. The predictive results demonstrate the system’s suitability for low-complexity deployment, as the analog summation and single analog-to-digital converter structure minimizes hardware overhead. The proposed system maintains a fixed computational requirement regardless of receiver count, unlike conventional localization systems that scale with sensor number. Despite minor prediction discrepancies around ambiguous angles due to time-delay symmetry, the system delivers stable and accurate performance. The proposed framework extends to reverberant spaces, mobile platforms, and sensor-fault-tolerant operation. Beyond sound source localization, the deconvolution–regression pipeline applies to multi-sensor phase analysis and other sequential signal tasks. Future work will profile computational costs for embedded use, focusing on feature extraction, while maintaining a shallow, low-overhead machine learning with compact model orders.