1. Introduction
Modern megacities face a set of interrelated transport safety problems. Urbanization and increased car usage result in greater traffic density and create a critical load on the street network, especially at its nodal points, i.e., crossroads. Road safety is especially important due to the dramatic increase in car usage and ongoing urbanization.
According to the 2023 World Health Organization data, about 1.19 million people die in road accidents every year. Road accidents are the main cause of death for children and young people aged from 5 to 29 [
1]. Studies show that crossroads remain high-risk areas within the urban road network. According to international research, up to 25% of all traffic accidents with serious consequences occur where transport flows intersect [
2]. Complex movement trajectories, limited visibility, and the need for simultaneous decision making by multiple people create the conditions for accidents to happen.
The challenges of urban mobility can be tackled by developing road situation monitoring systems as part of the smart transportation system (STS) concept [
3]. Historically, the key technologies used in STS included visual and radar-based methods that are used in most modern traffic control systems [
4,
5].
The development of the V2X (Vehicle-to-Everything) standards [
6] introduced new requirements for monitoring systems, including the need for reliable event detection in any weather conditions and limited visibility. However, conventional road monitoring systems based on cameras, radars, and lidars demonstrate limited efficiency in adverse weather conditions (heavy rain or snow), insufficient illumination, or visual obstacles [
7].
Multiple research studies confirm that cameras providing large amounts of semantic information depend heavily on illumination and atmospheric transmittance. Although radar systems are less sensitive to weather conditions, their resolution capability is too limited for precise positioning and event type classification. LIDARs provide high 3D positioning precision, but are associated with high capital and operating costs, while their efficiency lowers during heavy precipitation.
The fundamental limitation of all the mentioned technologies is the object direct visibility requirement, which is often impossible in dense urban areas or when large vehicles or other visual obstacles are present. This fundamental line-of-sight requirement creates a critical blind spot for situational awareness at occluded intersections. This restriction fuels the search for alternative and complementary approaches to road situation monitoring.
In this context, acoustic monitoring methods are deemed the most promising. This is because the acoustic sound source localization methods do not require direct object visibility, and they can be used in any weather and lighting conditions. The physical nature of sound waves allows them to diffract around obstacles and propagate in a complex urban environment, thus ensuring a different approach to monitoring [
8].
The modern digital signal processing algorithms based on Time Difference of Arrival (TDoA) and methods such as Generalized Cross-Correlation with Phase Transform (GCC-PHAT) [
9] can be used to localize typical accident sounds (collisions, brake squeaks, emergency vehicle sirens) precisely, even in the context of intense urban noise [
10] and multiple-beam sound-wave propagation [
11].
The development of STS and V2X standards provide new opportunities to improve the promptness of data exchange between vehicles and urban infrastructure. Acoustic sensors may be a key component of such infrastructure to facilitate early accident detection and prompt road users warning. The combined analysis of audio and video data is especially promising because it can significantly improve the precision and response rate for the “connected” crossroads by reducing the warning transmission time and improving the coordination between the vehicles and the urban monitoring network [
12].
The conducted analysis of patents and R&D solutions confirms the growing interest in acoustic technologies for transport infrastructure. The existing solution can be divided into basic TDoA systems, adaptable complexes with noise suppression, smart platforms with ML modules, and multimodal structures combining acoustic and visual data [
13,
14,
15,
16,
17,
18,
19,
20]. This wide range of approaches indicates that acoustic methods are recognized as promising, but it simultaneously shows that there is no versatile solution tailored for urban crossroads monitoring.
However, despite the multitude of algorithmic approaches that show good results in laboratory tests and mathematical modeling [
21,
22], their implementation in a real-life urban landscape is associated with unique challenges. The uncontrolled background noise (traffic, construction work, pedestrians’ voices), multiple-beam sound propagation due to reflection from buildings and road surfaces, and the need to work with a real object, rather than the ideal hardware, require complex field tests and subsequent algorithm optimization.
Multiple existing solutions have certain drawbacks including poor adaptation to a specific urban environment and a lack of specialized classification of accident sounds. Additionally, there is a significant gap between the theoretical research on sound localization algorithms and their practical implementation as complete systems [
23,
24,
25,
26,
27,
28,
29].
This study presents the development, upgrade, and comprehensive experimental (laboratory and field) testing of the hardware/software prototype of the acoustic monitoring system to detect accidents on crossroads designed for subsequent integration with the V2X infrastructure.
To accomplish these goals, this study aims to tackle the following issues:
Developing the architecture and creating a hardware/software prototype of the localization system based on the TDoA and GCC-PHAT algorithms, taking into consideration the urban environment requirements.
Conducting laboratory testing of the prototype to verify its operability in controlled conditions.
Conducting primary field tests to detect systemic problems and limitations of the basic algorithm version in real-life urban conditions.
Upgrading the algorithm based on the field data analysis, including the extended parameter preprocessing and their dynamic adjustment.
Evaluating the accuracy and reliability of the upgraded system during repeated field tests.
Analyzing the possibility of integrating the developed solution with V2X systems, considering time and reliability parameters.
2. Materials and Methods
2.1. Calculating TDOA Using GCC-PHAT
The TDOA was calculated using the GCC method proposed by C. Knapp in 1976 for sound source localization problems [
9]. This approach relies on various weight functions, including ML, ROTH, and PHAT, to process the signals in the frequency domain. The algorithm operates in two stages: during Stage 1, the Time Difference of Arrival is calculated for each pair of microphones. During Stage 2, the sound source is localized [
30] based on the obtained delay times and the known microphone array geometry.
To determine the signal arrival direction, we used the Generalized Cross-Correlation method with Phase Transform (GCC-PHAT). This method is known as one of the most effective in TDOA calculation, which is due to its high precision and resilience against reverberation [
31]. The GCC-PHAT is based on the cross-power spectrum normalization that results in only preserving the phase information of the signals. For two discrete signals
and
, the GCC-PHAT function is determined by the following expression:
where
and
are spectrums and
is the complex conjugate spectrum of sound signals
and
.
The required TDOA is calculated by determining the argument of the global output signal maximum:
where
stands for the evaluated TDOA and reverse Fourier conversion applied to two sound signals.
When calculating the TDOA, assume that soundwave fronts can be approximated with parallel lines in the microphone pair location area. Assuming that the direction to the sound source is within the range of 0–90° relative to the baseline connecting Microphone 1 and Microphone 2, the arrival angle evaluation
can be obtained using the results of Equation (2) in the following formula:
where
is the distance between the microphone pair,
is the additional distance that a soundwave travels to the second microphone after reaching the first one,
is the propagation speed of sound in the medium, and
is the calculated signal arrival angle relative to the line connecting Microphones 1 and 2, determined counterclockwise [
32].
2.2. Software Implementation of the Algorithm
The Python 3.10 platform, with key libraries including NumPy, SciPy, sounddevice, and matplotlib, was selected to implement the sound localization system prototype. Python was selected due to its flexibility, availability of a large set of libraries for scientific calculation, and fast prototyping capabilities.
The complete signal processing chain, from raw audio capture to coordinate estimation, is illustrated in
Figure 1. The pipeline consists of three major stages: multi-channel signal preprocessing, active sound event detection, and localization via TDoA calculation and optimization.
The algorithmic pipeline for signal processing includes three key stages to perform successive conversion of initial audio data to sound source coordinates:
- (1)
Signal preprocessing, involving bandpass filtering (100–3000 Hz for laboratory or 500–2000 Hz for field conditions), median filtering to suppress impulse noise, and amplitude normalization;
- (2)
Active segment selection by identifying the 2 s fragment with the highest energy within a sliding window;
- (3)
Time-delay estimation between all microphone pairs using the GCC-PHAT algorithm with subsample interpolation for enhanced accuracy.
During the first stage, four independent audio channels are captured either from a WAV file or using the sound device module in real time. The obtained signals undergo subsequent preprocessing, including Chebyshev bandpass filtering II in the range of 500–2000 Hz for effective suppression of low-frequency babbling and high-frequency noise, as well as median filtering to remove single pulse spikes. This approach significantly helps to improve the signal/noise ratio and prepare data for subsequent analysis.
After that, energy is calculated within the 2 s sliding window (at an increment of 0.5 s) for each of the four tracks, and then the segment with the highest total power is selected, as it probably contains an accident sound (sirens, collision, brakes).
To evaluate delay times, we used the GCC-PHAT algorithm with the longest delay time limited to 0.05 s and a subsample approximation of the correlation peak implemented via the normalized spectral cross-multiplication. Then, a non-linear equation system is formed for all the microphone pairs:
where
is the coordinate vector for the evaluated sound source position (m),
are the coordinate vectors for microphones i and j (m),
is the sound speed in air (≈343 m/s), and
is the signal Time Difference of Arrival between microphones i and j (sec).
The equation system can be solved using the non-linear optimization method. This solution has good resilience to noise spikes, which helps increase the robustness and stability of time difference of arrival evaluations (Δt) during interference.
Finally, the results are visualized on a two-dimensional chart: the microphones and the calculated source point are displayed in the same field, while the delay values and coordinates are shown in a table, as can be seen in
Figure 2.
This implementation covers the entire signal processing cycle, from the capture of raw audio data to the visualization of sound event coordinates.
2.3. Experimental Setup
The experimental setup consists of four omnidirectional condenser microphones placed at random spots to test the algorithms with an imperfect configuration. This approach was selected to model the real-life conditions of an urban environment where perfect sensor placement geometry is often impossible.
The specifications of the equipment are shown in
Table 1.
The hardware synchronization of all four channels through the UMC1820 audio interface ensured precise temporal alignment, minimizing intrinsic timing jitter. The effective timing resolution for delay (TDoA) measurements was determined according to the sampling rate (44.1 kHz), providing a theoretical discrete resolution of approximately 22.7 µs. The subsample interpolation employed in the GCC-PHAT implementation further refined this resolution to an estimated effective accuracy in the range of 1–5 µs under the tested signal-to-noise conditions. For signal conditioning, 4th-order Chebyshev Type II bandpass filters were implemented digitally in the preprocessing stage, with the specific passband (100–3000 Hz for laboratory tests, 500–2000 Hz for field tests) chosen to optimize the signal-to-noise ratio for the target acoustic events while suppressing out-of-band interference.
Highly sensitive microphones with a wide frequency range were selected to record both low-frequency vibrations (collisions, brake squeals) and high-frequency components (sirens). The usage of standard computing components implies that the system can be deployed without using expensive special equipment.
Critical to the TDoA method is the precise temporal alignment of all audio channels. This was achieved via hardware synchronization: all four microphones were connected to and sampled simultaneously by a single multichannel audio interface (UMC1820), which provides sample-accurate clock synchronization across its input channels, eliminating internal timing skew. This setup ensured that any measured time difference (Δt) between channels was attributable solely to the physical propagation delay of the sound wave, not to discrepancies in the recording hardware.
2.4. Experiment Methodology
The experiment was conducted in three stages:
Mathematical modeling: testing the theoretical bases of localization algorithms in controlled conditions using synthetic data.
Laboratory testing: testing the developed system in a room with controlled acoustic conditions.
Field testing: testing the operability of the system in a real-life urban environment.
Localization accuracy was evaluated on each stage by comparing the calculated coordinates and the true position of the sound source. The key evaluation metrics included the following:
Mean absolute error for coordinates (Δx, Δy, Δz);
Standard deviation (σ);
Euclidean distance between the true and calculated positions;
Successful localization percentage (error < 1 m);
During the first stage, the theoretical bases of localization algorithms were tested in controlled conditions using synthetic data. To facilitate the comprehensive evaluation of the localization algorithms in question, we developed a simulation unit in MATLAB R2020b to analyze their precision and resilience against noise in thoroughly controlled conditions.
Modeling involved the subsequent implementation of the following stages:
Microphone set parameterization: four receivers in the X–Y plane at a distance of 5 m from the center.
The calculation of theoretical signal arrival delays for the source at a specific point (7 m, 10 m, 5 m).
Adding the Gaussian noise with a 5 dB SNR to imitate the urban background.
Applying the algorithms and evaluating the localization error.
For the results obtained to be reliable, the testing was made during 100 independent runs with different noise implementations. This helped produce reference data on algorithm accuracy and identify the ones that would be more promising in the prototype.
The second stage included the testing of the developed system in a room with controlled acoustic conditions. Laboratory testing was used to verify the basic operability of the system and its hardware components with minimum external interference. Laboratory testing took place in a room with controlled acoustic conditions. Four omnidirectional condenser microphones were attached to office pedestals at a height of 0.6 m, each 2 m away from the central point, making a square with a side length of 4 m, as shown in
Figure 3.
The sound source was a recording of sirens played on a smartphone. It moved between 10 previously marked points along the square’s perimeter. For each position, the data from the four channels was recorded in the WAV format. Example temporal shapes of the signals after preprocessing are shown in
Figure 4.
The final stage involved testing the operability of the system in a real-life urban environment. Following successful laboratory tests, we carried out a series of field tests to evaluate the resilience of the algorithm to distance scaling, multipath reflections, and uncontrolled noise.
The field tests aimed to evaluate the system’s operability in conditions close to a real-life urban crossroads. Tests were conducted on an open-air site at a university campus surrounded by buildings, thus creating an acoustic environment typical of urban areas with active background noise (traffic, construction work, pedestrians). The tests were performed on an open site between campus buildings featuring an even concrete surface with no additional muffling systems.
The field trials employed a two-stage methodology. First, synchronized multichannel audio data were recorded on-site for all test points to create a representative dataset. Subsequently, this recorded dataset was processed by the software pipeline in a controlled environment to enable precise, repeatable analysis and iterative algorithm development. The recorded processing time of 95 ms per event, as reported in the Results and Conclusion, was measured during this offline processing stage using the same codebase, confirming that the computational performance meets the real-time requirement (<200 ms) for V2X integration.
The hardware configuration of the field setup included four omnidirectional microphones installed on tripods at a height of 1.5 m. The microphones were placed at the vertices of a square with a side length of 14 m, thus covering a control area of a size typical for an urban crossroads. The synchronized sound signal was recorded using a multichannel audio interface connected to a portable computer, which served as the main data collection and primary processing module.
The side length of 14 m was chosen to model the geometric scale of a standard urban intersection, ensuring the array covers the key conflict zones while allowing for practical deployment of microphones on existing street infrastructure at the corners.
During the experiment, a speaker playing a record of sirens was placed at 11 predetermined points within the controlled area. This helped us evaluate the localization precision of the algorithm at various positions relative to the microphones, including central and peripheral points.
The field setup configuration, hardware platform, and testing site overview are shown in
Figure 5,
Figure 6 and
Figure 7.
This field testing produced a representative audio database reflecting the operation of the system under real-life acoustic interference, multiple-path sound propagation, and scaling distances typical of the urban environment.
3. Results
3.1. Mathematical Modeling Results
The simulation setup implemented in MATLAB R2020b imitated the key characteristics of the urban acoustic environment: the presence of additive noise and spatial positioning of the sensors.
The virtual microphone set configuration included four receivers placed at the vertices of a 5 m square. A single-point Gaussian impulse was modeled as the sound source located at the point with the coordinates of (7, 10, 5) m, which allowed for the algorithm’s precision evaluation when the object was positioned outside the geometrical center of the microphone set. For the maximum approximation to real-life conditions, the synthetic signals for each microphone were complemented with an additive Gaussian white noise, providing for a 5 dB signal/noise ratio (SNR), which is typical for a busy urban street.
Each of the algorithms under analysis (GCC and GCC-PHAT) was used in a series of 100 independent runs with different noise implementations to ensure the statistical relevance of the results. During each run, delay times were identified, and the system calculated non-linear equations to determine the coordinates of the source.
The modeling results presented in the aggregate table facilitated a clear ranking of the algorithms by their accuracy (
Table 2).
The analysis of the data showed the following:
The Cross-Correlation and GCC-PHAT algorithms are precise and resilient. Their mean errors for all the coordinates were under 0.2 m, and the minimum scattering of the evaluations (σ ≈ 0.1 m) implies that the results are highly replicable even under noisy conditions. A small but statistically relevant advantage of GCC-PHAT is associated with its phase normalization, which effectively suppresses the impact of amplitude distortion and reverberation.
Considering the precision and reverberation resilience, as well as high precision and satisfactory completion time, we selected the GCC-PHAT algorithm to be implemented in the prototype.
3.2. Laboratory Testing Results for the Basic Prototype
After the successful mathematical modeling, we developed the basic version of the hardware–software prototype and tested it in laboratory conditions. The purpose of this stage was to test the operability of the system in a controlled acoustic environment and identify any problems that were not accounted for during modeling (e.g., equipment imperfections, multipath reflections).
The recorded signal processing included bandpass filtering (100–3000 Hz) to suppress low-frequency babble and high-frequency interference, calculating delay times using the cross-correlation method, and solving the system of equations to evaluate the coordinates.
The results of 10 laboratory test runs showed that the mean absolute localization error amounted to 0.45 m with a standard deviation of 0.23 m. The best results (≤0.3 m) were observed at the points located closer to the center of the microphone set, while on the periphery (Points 2 and 5), the error value increased to 0.85 and 0.66 m, respectively.
The obtained results presented in
Table 3 and visualized in
Figure 8 (a 3D visualization of one of the experimental runs where the calculated source position is compared to its true position) confirmed the fundamental operability of the selected algorithmic approach under controlled conditions. The key causes of deviations included the lack of rigid hardware channel synchronization, multipath reflections from room surfaces, and single false correlation peaks in noise segments.
Laboratory testing identified two key limitations of the basic version:
The lack of a rigid hardware synchronization of the recording channels resulted in timestamp jitter, introducing fluctuations of up to ±0.5 m in the resulting evaluation.
Multipath reflections from the walls and furniture in the room distorted the impulse shape, leading to the emergence of false peaks in the correlation function and, as a result, errors in delay time assessments.
These results confirmed the fundamental operability of the system but clearly identified the areas that needed to be improved before field testing.
3.3. Field Testing Results and Systemic Problem Identification
The transition to field testing identified the fundamental limitations of the basic version of the algorithm. Field testing showed that the laboratory version of the algorithm is unfit for real-life street conditions due to the lack of consideration for scaling and reverberation effects. When evaluating the position of the source, the algorithm systematically downplayed the distance: when the true distance was 1 m, it produced 0.5 m, 0.7 m for 2 m, and 1.3 m for 3.5 m.
Multipath echoing (reflections from buildings, construction work, and asphalt) distorted the impulses, thus introducing false correlation peaks. We found that 100% of delays were within the permitted max_tau range for small distances, but, when moving away from the source, many of the evaluations were impossible to complete and resulted in incorrect results. The mean localization error in field conditions for the basic version of the algorithm was 2.3 m, which was deemed unacceptable for practical usage.
The analysis of the identified problems helped develop and implement a set of upgrades. The solutions to the identified limitations of the first version of the algorithm were integrated with the second version of the script 7. Key improvements:
Extended signal preprocessing: the 500–2000 Hz bandpass filter helped separate both low-frequency babble (construction machinery) and high-frequency interference (wind, tires). The median filter smoothed out single impulse spikes (shouts, doors shutting).
Smart identification of useful signals: the automatic identification of a 2-s segment with the greatest total energy guaranteed that the algorithm only analyzed “clean” sections and ignored prolonged segments with background noise.
Localization core improvement: peak search imitation with max_tau (0.05 s) eliminated the delays that are impossible due to the setup geometry, which immediately helped eliminate up to 50% of false trippings. The subsample peak refinement helped improve the accuracy of arrival time evaluations up to tens of microseconds.
Robust solver for the system of equations: for the final coordinate calculation, we shifted to non-linear optimization using the least squares method (least_squares), which produced more sustainable results with small channel mistiming.
3.4. Upgraded Algorithm Validation Results
To test the algorithm improvements, we repeated and processed 11 field runs with predetermined source coordinates. For each point, we recorded four tracks (one per each microphone) and then compared the calculated coordinates with the true ones (see
Table 4).
Repeated tests of the upgraded system demonstrated significant improvements in all metrics. The results are shown in
Table 5.
Repeated testing of the upgraded algorithm demonstrated a significant improvement in localization precision and reliability, especially close to the center of the microphone set:
Number of successful tests: 5 out of 11;
Borderline results: 4 out of 11;
Erroneous evaluations: 2 out of 11;
Mean error for successful tests: ≈0.65 m;
Total mean deviation (all tests included): ≈1.69 m;
Standard deviation: σ ≈ 1.16 m;
The number of false delays dropped by 70% compared to Version 1.
For enhanced clarity, the combined accuracy and latency results of the final system are summarized in
Table 6 below.
To illustrate the real-life operation of the algorithm,
Figure 9 and
Figure 10 show two typical field tests: for Control Point 1 (true position of the source (0; 0) m) and for Point 10 (true position of the source (−4; −1) m). In the charts, blue dots show microphones, green triangle stands for the true position of the source, and orange star stands for the position calculated by the system.
As
Figure 9 shows, near the center ((0; 0) meters), the algorithm produces very accurate evaluations (error ≈ 0.14 m), while, in
Figure 10 (Point 10), the displacement increases to ≈1.3 m due to reverberation effects and increased TDoA delays outside the max_tau area.
Thus, the upgrades implemented helped to significantly improve the localization accuracy by 52% and significantly increase the stability of the system’s operation in complex acoustic conditions of the urban environment.
4. Conclusions
The conducted research confirmed the high efficiency and practical applicability of the developed acoustic accident monitoring sensor for smart crossroads. We used the iterative approach to the development of a system, including the stage-wise algorithm testing from mathematical modeling and laboratory trials to complex field testing under conditions close to a real-life urban environment.
The research features the comparison of sound localization algorithms in additive noise conditions (SNR = 5 dB). The GCC-PHAT was selected based on the precision, robustness, and calculation efficiency criteria. Mathematical modeling showed that this algorithm has a mean localization error of about 0.15 m, demonstrating the optimal balance for real-time systems.
Based on the selected method, we developed and tested a fully functional prototype of the localization system. The hardware part included a set of four microphones and a multichannel audio interface, while the software part was implemented in Python using the SciPy libraries. The results suggest that efficient solutions can be developed with a standard computing platform without expensive specialized equipment.
To adapt the algorithm to urban environment conditions, we implemented several upgrades. The primary field testing exposed a systemic error in the basic version of the algorithm that amounted to 2.3 m. To eliminate it, we took several actions: introduced extended signal preprocessing (bandpass and median filtering), implemented automatic energy-rich signal segment identification, optimized the correlation peak search domain (max_tau), and shifted towards non-linear optimization when solving the system of localization equations.
The upgraded algorithm was validated for 11 control points, showing the mean localization error reduction to 1.1 m, which corresponds to a 52% precision improvement. Besides, the number of false trippings dropped by 70%, and the standard deviation reduced to 0.15 m, indicating increased stability of the system’s operation.
The performance of the system is an important practical result: the processing time for one acoustic event was 95 ms, which is significantly lower than the threshold of 200 ms. This parameter confirms that the developed system can be integrated with the V2X infrastructure for prompt warning transmission to road users.
The developed acoustic monitoring system complies with the precision and response rate criteria of urban safety problems and builds a technological base for subsequent development as part of the smart city concept. Its development prospects are linked to the solution of several research and development problems.
The first one stipulates the intellectualization of the system through the integration of machine learning algorithms. The goals may include the development and training of a classifier to determine the types of emergency acoustic events (e.g., collision, brake squealing, sirens). The implementation of this function shall help reduce the number of false trippings and improve the quality of warnings in V2X systems.
The second stipulates the network scaling of the architecture. Creating a distributed network of synchronized acoustic sensors to cover large traffic intersections is deemed a relevant problem. This shall require data coordination and aggregation protocols for neighboring nodes, thus facilitating the shift from point-wise localization to tracking sound sources in space. The system’s architecture, based on standardized hardware and geometrically scalable algorithms, provides a clear pathway for deployment as a distributed network of sensors across multiple intersections within an urban transport network.
Extending this calibration effort, future work will include the development of adaptive detection thresholds to maintain system performance across the wide range of acoustic conditions inherent to urban traffic, from free-flowing periods to dense, congested states.
This calibration and threshold adaptation must be validated through testing in authentic, high-noise crossroads environments, supplementing the campus trials with data collected at operational intersections featuring intense, mixed traffic flows, construction activity, and other complex urban noise sources.
While this study validates the localization performance and latency of the core acoustic pipeline, the development and benchmarking of a robust sound event classifier for distinguishing emergency events (e.g., collisions, sirens) from ambient urban noise remains a key objective for future work to realize a fully autonomous detection system.
The experiments were conducted only with siren sound. However, real urban soundscapes typically involve a wide variety of acoustic sources. Future work will specifically involve creating and testing the system against a comprehensive database of real-world sounds, including collision impacts, various siren types, tire screeches, and pedestrian shouts, to evaluate and optimize its detection and classification performance across the full spectrum of critical acoustic events, and to assess the consistency of the GCC-PHAT localization performance across these diverse acoustic signatures.
Additionally, investigating the potential benefits and trade-offs of using directional microphones represents another research direction. While omnidirectional microphones provide uniform coverage, directional microphones could improve the signal-to-noise ratio for specific approach lanes and potentially simplify the array geometry, though at the cost of increased system complexity and the need for precise mechanical or electronic steering.
The third and the most complex one can be classified as sensor fusion. The reliability can be improved through the real-time combination of acoustic data and information flows from cameras, lidars, and radars. The combined processing of heterogeneous data shall help overcome the limitations typical of each sensor type and produce a universal interference-proof image of the road situation.
Fourth, the optimization of the microphone array geometry for specific urban environments warrants detailed investigation. A systematic analysis of the influence of inter-microphone spacing and array aperture on localization sensitivity and distance estimation accuracy will provide practical guidelines for deploying the system at intersections with varying sizes and layouts.
Additionally, field calibration across diverse urban morphologies is required to precisely quantify the system’s effective range and performance limits under varying noise conditions, from dense city centers to suburban intersections.
Thus, the presented solution opens extensive opportunities for research. Its further development in line with the prospects discussed shall help create a high-level, economically efficient, and technologically justified infrastructure aiming to significantly improve safety on urban crossroads in the context of the evolving smart city concept and V2× communications.