Development and Field Testing of an Acoustic Sensor Unit for Smart Crossroads as Part of V2X Infrastructure

Furletov, Yury; Aptinova, Dinara; Mededov, Mekan; Keller, Andrey; Shadrin, Sergey S.; Makarova, Daria A.

doi:10.3390/smartcities9010017

Open AccessArticle

Development and Field Testing of an Acoustic Sensor Unit for Smart Crossroads as Part of V2X Infrastructure

by

Yury Furletov

^1,2,*,

Dinara Aptinova

^1,3,

Mekan Mededov

³,

Andrey Keller

^1,4

,

Sergey S. Shadrin

^1,5

and

Daria A. Makarova

^1,5

¹

Ground Means of Transportation, Moscow Polytechnic University, Bolshaya Semyonovskaya Street, 38, Moscow 107023, Russia

²

Information Technologies, Financial University, Leningradsky Avenue, 49/2, Moscow 125167, Russia

³

Advanced Engineering School, ITMO University, Kronverksky Pr. 49, bldg. A, St. Petersburg 197101, Russia

⁴

Central Scientific Research Automobile and Automotive Engines Institute (NAMI), Automotornaya Street, 2, Moscow 125438, Russia

⁵

Automobile Vehicles Department, Moscow Automobile and Road Construction State Technical University (MADI), Leningradsky Avenue, 64, Moscow 125319, Russia

^*

Author to whom correspondence should be addressed.

Smart Cities 2026, 9(1), 17; https://doi.org/10.3390/smartcities9010017

Submission received: 10 November 2025 / Revised: 5 January 2026 / Accepted: 6 January 2026 / Published: 21 January 2026

(This article belongs to the Section Physical Infrastructures and Networks in Smart Cities)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

The developed and optimized acoustic localization algorithm, based on GCC-PHAT, achieves an average accuracy of 1.1 m in pinpointing sound source coordinates under real-world urban conditions of noise and multipath reflections.
The hardware–software system prototype, built using standard components, processes a sound event in under 95 ms, meeting the stringent latency requirements for integration into real-time V2X (Vehicle-to-Everything) infrastructure.

What are the implications of the main findings?

The results prove that acoustic monitoring is a cost-effective and highly reliable technology for detecting emergency situations at intersections, particularly in scenarios with no direct line-of-sight or adverse weather that degrade the performance of traditional camera and lidar-based systems.
The system is ready for pilot deployment as a reliable data source for smart intersections, capable of providing immediate location data on emergency events (e.g., crashes, sirens) to alert road users and emergency services swiftly.

Abstract

Improving city crossroads safety is a critical problem for modern smart transportation systems (STS). This article presents the results of developing, upgrading, and comprehensively experimentally testing an acoustic monitoring system prototype designed for rapid accident detection. Unlike conventional camera- or lidar-based approaches, the proposed solution uses passive sound source localization to operate effectively with no direct visibility and in adverse weather conditions, addressing a key limitation of camera- or lidar-based systems. Generalized Cross-Correlation with Phase Transform (GCC-PHAT) algorithms were used to develop a hardware–software complex featuring four microphones, a multichannel audio interface, and a computation module. This study focuses on the gradual upgrading of the algorithm to reduce the mean localization error in real-life urban conditions. Laboratory and complex field tests were conducted on an open-air testing ground of a university campus. During these tests, the system demonstrated that it can accurately determine the coordinates of a sound source imitating accidents (sirens, collisions). The analysis confirmed that the system satisfies the V2X infrastructure integration response time requirement (<200 ms). The results suggest that the system can be used as part of smart transportation systems.

Keywords:

smart crossroads; smart transportation systems (STS); V2X; sound localization; TDoA; GCC-PHAT

1. Introduction

Modern megacities face a set of interrelated transport safety problems. Urbanization and increased car usage result in greater traffic density and create a critical load on the street network, especially at its nodal points, i.e., crossroads. Road safety is especially important due to the dramatic increase in car usage and ongoing urbanization.

According to the 2023 World Health Organization data, about 1.19 million people die in road accidents every year. Road accidents are the main cause of death for children and young people aged from 5 to 29 [1]. Studies show that crossroads remain high-risk areas within the urban road network. According to international research, up to 25% of all traffic accidents with serious consequences occur where transport flows intersect [2]. Complex movement trajectories, limited visibility, and the need for simultaneous decision making by multiple people create the conditions for accidents to happen.

The challenges of urban mobility can be tackled by developing road situation monitoring systems as part of the smart transportation system (STS) concept [3]. Historically, the key technologies used in STS included visual and radar-based methods that are used in most modern traffic control systems [4,5].

The development of the V2X (Vehicle-to-Everything) standards [6] introduced new requirements for monitoring systems, including the need for reliable event detection in any weather conditions and limited visibility. However, conventional road monitoring systems based on cameras, radars, and lidars demonstrate limited efficiency in adverse weather conditions (heavy rain or snow), insufficient illumination, or visual obstacles [7].

Multiple research studies confirm that cameras providing large amounts of semantic information depend heavily on illumination and atmospheric transmittance. Although radar systems are less sensitive to weather conditions, their resolution capability is too limited for precise positioning and event type classification. LIDARs provide high 3D positioning precision, but are associated with high capital and operating costs, while their efficiency lowers during heavy precipitation.

The fundamental limitation of all the mentioned technologies is the object direct visibility requirement, which is often impossible in dense urban areas or when large vehicles or other visual obstacles are present. This fundamental line-of-sight requirement creates a critical blind spot for situational awareness at occluded intersections. This restriction fuels the search for alternative and complementary approaches to road situation monitoring.

In this context, acoustic monitoring methods are deemed the most promising. This is because the acoustic sound source localization methods do not require direct object visibility, and they can be used in any weather and lighting conditions. The physical nature of sound waves allows them to diffract around obstacles and propagate in a complex urban environment, thus ensuring a different approach to monitoring [8].

The modern digital signal processing algorithms based on Time Difference of Arrival (TDoA) and methods such as Generalized Cross-Correlation with Phase Transform (GCC-PHAT) [9] can be used to localize typical accident sounds (collisions, brake squeaks, emergency vehicle sirens) precisely, even in the context of intense urban noise [10] and multiple-beam sound-wave propagation [11].

The development of STS and V2X standards provide new opportunities to improve the promptness of data exchange between vehicles and urban infrastructure. Acoustic sensors may be a key component of such infrastructure to facilitate early accident detection and prompt road users warning. The combined analysis of audio and video data is especially promising because it can significantly improve the precision and response rate for the “connected” crossroads by reducing the warning transmission time and improving the coordination between the vehicles and the urban monitoring network [12].

The conducted analysis of patents and R&D solutions confirms the growing interest in acoustic technologies for transport infrastructure. The existing solution can be divided into basic TDoA systems, adaptable complexes with noise suppression, smart platforms with ML modules, and multimodal structures combining acoustic and visual data [13,14,15,16,17,18,19,20]. This wide range of approaches indicates that acoustic methods are recognized as promising, but it simultaneously shows that there is no versatile solution tailored for urban crossroads monitoring.

However, despite the multitude of algorithmic approaches that show good results in laboratory tests and mathematical modeling [21,22], their implementation in a real-life urban landscape is associated with unique challenges. The uncontrolled background noise (traffic, construction work, pedestrians’ voices), multiple-beam sound propagation due to reflection from buildings and road surfaces, and the need to work with a real object, rather than the ideal hardware, require complex field tests and subsequent algorithm optimization.

Multiple existing solutions have certain drawbacks including poor adaptation to a specific urban environment and a lack of specialized classification of accident sounds. Additionally, there is a significant gap between the theoretical research on sound localization algorithms and their practical implementation as complete systems [23,24,25,26,27,28,29].

This study presents the development, upgrade, and comprehensive experimental (laboratory and field) testing of the hardware/software prototype of the acoustic monitoring system to detect accidents on crossroads designed for subsequent integration with the V2X infrastructure.

To accomplish these goals, this study aims to tackle the following issues:

Developing the architecture and creating a hardware/software prototype of the localization system based on the TDoA and GCC-PHAT algorithms, taking into consideration the urban environment requirements.
Conducting laboratory testing of the prototype to verify its operability in controlled conditions.
Conducting primary field tests to detect systemic problems and limitations of the basic algorithm version in real-life urban conditions.
Upgrading the algorithm based on the field data analysis, including the extended parameter preprocessing and their dynamic adjustment.
Evaluating the accuracy and reliability of the upgraded system during repeated field tests.
Analyzing the possibility of integrating the developed solution with V2X systems, considering time and reliability parameters.

2. Materials and Methods

2.1. Calculating TDOA Using GCC-PHAT

The TDOA was calculated using the GCC method proposed by C. Knapp in 1976 for sound source localization problems [9]. This approach relies on various weight functions, including ML, ROTH, and PHAT, to process the signals in the frequency domain. The algorithm operates in two stages: during Stage 1, the Time Difference of Arrival is calculated for each pair of microphones. During Stage 2, the sound source is localized [30] based on the obtained delay times and the known microphone array geometry.

To determine the signal arrival direction, we used the Generalized Cross-Correlation method with Phase Transform (GCC-PHAT). This method is known as one of the most effective in TDOA calculation, which is due to its high precision and resilience against reverberation [31]. The GCC-PHAT is based on the cross-power spectrum normalization that results in only preserving the phase information of the signals. For two discrete signals

x_{1} (n)

and

x_{2} (n)

, the GCC-PHAT function is determined by the following expression:

G C C_{P H A T} (f) = \frac{X_{1} (f) [X_{2} (f)]}{{| X_{1} (f) X_{2}^{*} (f) |}^{'}},

(1)

where

X_{1} (ω)

and

X_{2} (ω)

are spectrums and

X_{2}^{*} (ω)

is the complex conjugate spectrum of sound signals

x_{1} (n)

and

x_{2} (n)

.

The required TDOA is calculated by determining the argument of the global output signal maximum:

T_{d} = {argmax}_{τ} (G C {C^{- 1}}_{P H A T} (τ)),

(2)

where

T_{d}

stands for the evaluated TDOA and reverse Fourier conversion applied to two sound signals.

When calculating the TDOA, assume that soundwave fronts can be approximated with parallel lines in the microphone pair location area. Assuming that the direction to the sound source is within the range of 0–90° relative to the baseline connecting Microphone 1 and Microphone 2, the arrival angle evaluation

θ

can be obtained using the results of Equation (2) in the following formula:

θ = \cos^{- 1} (\frac{c T_{d}}{D}),

(3)

where

D

is the distance between the microphone pair,

d

is the additional distance that a soundwave travels to the second microphone after reaching the first one,

c

is the propagation speed of sound in the medium, and

θ

is the calculated signal arrival angle relative to the line connecting Microphones 1 and 2, determined counterclockwise [32].

2.2. Software Implementation of the Algorithm

The Python 3.10 platform, with key libraries including NumPy, SciPy, sounddevice, and matplotlib, was selected to implement the sound localization system prototype. Python was selected due to its flexibility, availability of a large set of libraries for scientific calculation, and fast prototyping capabilities.

The complete signal processing chain, from raw audio capture to coordinate estimation, is illustrated in Figure 1. The pipeline consists of three major stages: multi-channel signal preprocessing, active sound event detection, and localization via TDoA calculation and optimization.

The algorithmic pipeline for signal processing includes three key stages to perform successive conversion of initial audio data to sound source coordinates:

(1): Signal preprocessing, involving bandpass filtering (100–3000 Hz for laboratory or 500–2000 Hz for field conditions), median filtering to suppress impulse noise, and amplitude normalization;
(2): Active segment selection by identifying the 2 s fragment with the highest energy within a sliding window;
(3): Time-delay estimation between all microphone pairs using the GCC-PHAT algorithm with subsample interpolation for enhanced accuracy.

During the first stage, four independent audio channels are captured either from a WAV file or using the sound device module in real time. The obtained signals undergo subsequent preprocessing, including Chebyshev bandpass filtering II in the range of 500–2000 Hz for effective suppression of low-frequency babbling and high-frequency noise, as well as median filtering to remove single pulse spikes. This approach significantly helps to improve the signal/noise ratio and prepare data for subsequent analysis.

After that, energy is calculated within the 2 s sliding window (at an increment of 0.5 s) for each of the four tracks, and then the segment with the highest total power is selected, as it probably contains an accident sound (sirens, collision, brakes).

To evaluate delay times, we used the GCC-PHAT algorithm with the longest delay time limited to 0.05 s and a subsample approximation of the correlation peak implemented via the normalized spectral cross-multiplication. Then, a non-linear equation system is formed for all the microphone pairs:

‖ X - m_{i} ‖ - ‖ X - m_{j} ‖ = c ∆ t_{i j},

(4)

where

X

is the coordinate vector for the evaluated sound source position (m),

m_{i}, m_{j}

are the coordinate vectors for microphones i and j (m),

c

is the sound speed in air (≈343 m/s), and

∆ t_{i j}

is the signal Time Difference of Arrival between microphones i and j (sec).

The equation system can be solved using the non-linear optimization method. This solution has good resilience to noise spikes, which helps increase the robustness and stability of time difference of arrival evaluations (Δt) during interference.

Finally, the results are visualized on a two-dimensional chart: the microphones and the calculated source point are displayed in the same field, while the delay values and coordinates are shown in a table, as can be seen in Figure 2.

This implementation covers the entire signal processing cycle, from the capture of raw audio data to the visualization of sound event coordinates.

2.3. Experimental Setup

The experimental setup consists of four omnidirectional condenser microphones placed at random spots to test the algorithms with an imperfect configuration. This approach was selected to model the real-life conditions of an urban environment where perfect sensor placement geometry is often impossible.

The specifications of the equipment are shown in Table 1.

The hardware synchronization of all four channels through the UMC1820 audio interface ensured precise temporal alignment, minimizing intrinsic timing jitter. The effective timing resolution for delay (TDoA) measurements was determined according to the sampling rate (44.1 kHz), providing a theoretical discrete resolution of approximately 22.7 µs. The subsample interpolation employed in the GCC-PHAT implementation further refined this resolution to an estimated effective accuracy in the range of 1–5 µs under the tested signal-to-noise conditions. For signal conditioning, 4th-order Chebyshev Type II bandpass filters were implemented digitally in the preprocessing stage, with the specific passband (100–3000 Hz for laboratory tests, 500–2000 Hz for field tests) chosen to optimize the signal-to-noise ratio for the target acoustic events while suppressing out-of-band interference.

Highly sensitive microphones with a wide frequency range were selected to record both low-frequency vibrations (collisions, brake squeals) and high-frequency components (sirens). The usage of standard computing components implies that the system can be deployed without using expensive special equipment.

Critical to the TDoA method is the precise temporal alignment of all audio channels. This was achieved via hardware synchronization: all four microphones were connected to and sampled simultaneously by a single multichannel audio interface (UMC1820), which provides sample-accurate clock synchronization across its input channels, eliminating internal timing skew. This setup ensured that any measured time difference (Δt) between channels was attributable solely to the physical propagation delay of the sound wave, not to discrepancies in the recording hardware.

2.4. Experiment Methodology

The experiment was conducted in three stages:

Mathematical modeling: testing the theoretical bases of localization algorithms in controlled conditions using synthetic data.
Laboratory testing: testing the developed system in a room with controlled acoustic conditions.
Field testing: testing the operability of the system in a real-life urban environment.

Localization accuracy was evaluated on each stage by comparing the calculated coordinates and the true position of the sound source. The key evaluation metrics included the following:

Mean absolute error for coordinates (Δx, Δy, Δz);
Standard deviation (σ);
Euclidean distance between the true and calculated positions;
Successful localization percentage (error < 1 m);

During the first stage, the theoretical bases of localization algorithms were tested in controlled conditions using synthetic data. To facilitate the comprehensive evaluation of the localization algorithms in question, we developed a simulation unit in MATLAB R2020b to analyze their precision and resilience against noise in thoroughly controlled conditions.

Modeling involved the subsequent implementation of the following stages:

Microphone set parameterization: four receivers in the X–Y plane at a distance of 5 m from the center.
The calculation of theoretical signal arrival delays for the source at a specific point (7 m, 10 m, 5 m).
Adding the Gaussian noise with a 5 dB SNR to imitate the urban background.
Applying the algorithms and evaluating the localization error.

For the results obtained to be reliable, the testing was made during 100 independent runs with different noise implementations. This helped produce reference data on algorithm accuracy and identify the ones that would be more promising in the prototype.

The second stage included the testing of the developed system in a room with controlled acoustic conditions. Laboratory testing was used to verify the basic operability of the system and its hardware components with minimum external interference. Laboratory testing took place in a room with controlled acoustic conditions. Four omnidirectional condenser microphones were attached to office pedestals at a height of 0.6 m, each 2 m away from the central point, making a square with a side length of 4 m, as shown in Figure 3.

The sound source was a recording of sirens played on a smartphone. It moved between 10 previously marked points along the square’s perimeter. For each position, the data from the four channels was recorded in the WAV format. Example temporal shapes of the signals after preprocessing are shown in Figure 4.

The final stage involved testing the operability of the system in a real-life urban environment. Following successful laboratory tests, we carried out a series of field tests to evaluate the resilience of the algorithm to distance scaling, multipath reflections, and uncontrolled noise.

The field tests aimed to evaluate the system’s operability in conditions close to a real-life urban crossroads. Tests were conducted on an open-air site at a university campus surrounded by buildings, thus creating an acoustic environment typical of urban areas with active background noise (traffic, construction work, pedestrians). The tests were performed on an open site between campus buildings featuring an even concrete surface with no additional muffling systems.

The field trials employed a two-stage methodology. First, synchronized multichannel audio data were recorded on-site for all test points to create a representative dataset. Subsequently, this recorded dataset was processed by the software pipeline in a controlled environment to enable precise, repeatable analysis and iterative algorithm development. The recorded processing time of 95 ms per event, as reported in the Results and Conclusion, was measured during this offline processing stage using the same codebase, confirming that the computational performance meets the real-time requirement (<200 ms) for V2X integration.

The hardware configuration of the field setup included four omnidirectional microphones installed on tripods at a height of 1.5 m. The microphones were placed at the vertices of a square with a side length of 14 m, thus covering a control area of a size typical for an urban crossroads. The synchronized sound signal was recorded using a multichannel audio interface connected to a portable computer, which served as the main data collection and primary processing module.

The side length of 14 m was chosen to model the geometric scale of a standard urban intersection, ensuring the array covers the key conflict zones while allowing for practical deployment of microphones on existing street infrastructure at the corners.

During the experiment, a speaker playing a record of sirens was placed at 11 predetermined points within the controlled area. This helped us evaluate the localization precision of the algorithm at various positions relative to the microphones, including central and peripheral points.

The field setup configuration, hardware platform, and testing site overview are shown in Figure 5, Figure 6 and Figure 7.

This field testing produced a representative audio database reflecting the operation of the system under real-life acoustic interference, multiple-path sound propagation, and scaling distances typical of the urban environment.

3. Results

3.1. Mathematical Modeling Results

The simulation setup implemented in MATLAB R2020b imitated the key characteristics of the urban acoustic environment: the presence of additive noise and spatial positioning of the sensors.

The virtual microphone set configuration included four receivers placed at the vertices of a 5 m square. A single-point Gaussian impulse was modeled as the sound source located at the point with the coordinates of (7, 10, 5) m, which allowed for the algorithm’s precision evaluation when the object was positioned outside the geometrical center of the microphone set. For the maximum approximation to real-life conditions, the synthetic signals for each microphone were complemented with an additive Gaussian white noise, providing for a 5 dB signal/noise ratio (SNR), which is typical for a busy urban street.

Each of the algorithms under analysis (GCC and GCC-PHAT) was used in a series of 100 independent runs with different noise implementations to ensure the statistical relevance of the results. During each run, delay times were identified, and the system calculated non-linear equations to determine the coordinates of the source.

The modeling results presented in the aggregate table facilitated a clear ranking of the algorithms by their accuracy (Table 2).

The analysis of the data showed the following:

The Cross-Correlation and GCC-PHAT algorithms are precise and resilient. Their mean errors for all the coordinates were under 0.2 m, and the minimum scattering of the evaluations (σ ≈ 0.1 m) implies that the results are highly replicable even under noisy conditions. A small but statistically relevant advantage of GCC-PHAT is associated with its phase normalization, which effectively suppresses the impact of amplitude distortion and reverberation.

Considering the precision and reverberation resilience, as well as high precision and satisfactory completion time, we selected the GCC-PHAT algorithm to be implemented in the prototype.

3.2. Laboratory Testing Results for the Basic Prototype

After the successful mathematical modeling, we developed the basic version of the hardware–software prototype and tested it in laboratory conditions. The purpose of this stage was to test the operability of the system in a controlled acoustic environment and identify any problems that were not accounted for during modeling (e.g., equipment imperfections, multipath reflections).

The recorded signal processing included bandpass filtering (100–3000 Hz) to suppress low-frequency babble and high-frequency interference, calculating delay times using the cross-correlation method, and solving the system of equations to evaluate the coordinates.

The results of 10 laboratory test runs showed that the mean absolute localization error amounted to 0.45 m with a standard deviation of 0.23 m. The best results (≤0.3 m) were observed at the points located closer to the center of the microphone set, while on the periphery (Points 2 and 5), the error value increased to 0.85 and 0.66 m, respectively.

The obtained results presented in Table 3 and visualized in Figure 8 (a 3D visualization of one of the experimental runs where the calculated source position is compared to its true position) confirmed the fundamental operability of the selected algorithmic approach under controlled conditions. The key causes of deviations included the lack of rigid hardware channel synchronization, multipath reflections from room surfaces, and single false correlation peaks in noise segments.

Laboratory testing identified two key limitations of the basic version:

The lack of a rigid hardware synchronization of the recording channels resulted in timestamp jitter, introducing fluctuations of up to ±0.5 m in the resulting evaluation.
Multipath reflections from the walls and furniture in the room distorted the impulse shape, leading to the emergence of false peaks in the correlation function and, as a result, errors in delay time assessments.

These results confirmed the fundamental operability of the system but clearly identified the areas that needed to be improved before field testing.

3.3. Field Testing Results and Systemic Problem Identification

The transition to field testing identified the fundamental limitations of the basic version of the algorithm. Field testing showed that the laboratory version of the algorithm is unfit for real-life street conditions due to the lack of consideration for scaling and reverberation effects. When evaluating the position of the source, the algorithm systematically downplayed the distance: when the true distance was 1 m, it produced 0.5 m, 0.7 m for 2 m, and 1.3 m for 3.5 m.

Multipath echoing (reflections from buildings, construction work, and asphalt) distorted the impulses, thus introducing false correlation peaks. We found that 100% of delays were within the permitted max_tau range for small distances, but, when moving away from the source, many of the evaluations were impossible to complete and resulted in incorrect results. The mean localization error in field conditions for the basic version of the algorithm was 2.3 m, which was deemed unacceptable for practical usage.

The analysis of the identified problems helped develop and implement a set of upgrades. The solutions to the identified limitations of the first version of the algorithm were integrated with the second version of the script 7. Key improvements:

Extended signal preprocessing: the 500–2000 Hz bandpass filter helped separate both low-frequency babble (construction machinery) and high-frequency interference (wind, tires). The median filter smoothed out single impulse spikes (shouts, doors shutting).
Smart identification of useful signals: the automatic identification of a 2-s segment with the greatest total energy guaranteed that the algorithm only analyzed “clean” sections and ignored prolonged segments with background noise.
Localization core improvement: peak search imitation with max_tau (0.05 s) eliminated the delays that are impossible due to the setup geometry, which immediately helped eliminate up to 50% of false trippings. The subsample peak refinement helped improve the accuracy of arrival time evaluations up to tens of microseconds.
Robust solver for the system of equations: for the final coordinate calculation, we shifted to non-linear optimization using the least squares method (least_squares), which produced more sustainable results with small channel mistiming.

3.4. Upgraded Algorithm Validation Results

To test the algorithm improvements, we repeated and processed 11 field runs with predetermined source coordinates. For each point, we recorded four tracks (one per each microphone) and then compared the calculated coordinates with the true ones (see Table 4).

Repeated tests of the upgraded system demonstrated significant improvements in all metrics. The results are shown in Table 5.

Repeated testing of the upgraded algorithm demonstrated a significant improvement in localization precision and reliability, especially close to the center of the microphone set:

Number of successful tests: 5 out of 11;
Borderline results: 4 out of 11;
Erroneous evaluations: 2 out of 11;
Mean error for successful tests: ≈0.65 m;
Total mean deviation (all tests included): ≈1.69 m;
Standard deviation: σ ≈ 1.16 m;
The number of false delays dropped by 70% compared to Version 1.

For enhanced clarity, the combined accuracy and latency results of the final system are summarized in Table 6 below.

To illustrate the real-life operation of the algorithm, Figure 9 and Figure 10 show two typical field tests: for Control Point 1 (true position of the source (0; 0) m) and for Point 10 (true position of the source (−4; −1) m). In the charts, blue dots show microphones, green triangle stands for the true position of the source, and orange star stands for the position calculated by the system.

As Figure 9 shows, near the center ((0; 0) meters), the algorithm produces very accurate evaluations (error ≈ 0.14 m), while, in Figure 10 (Point 10), the displacement increases to ≈1.3 m due to reverberation effects and increased TDoA delays outside the max_tau area.

Thus, the upgrades implemented helped to significantly improve the localization accuracy by 52% and significantly increase the stability of the system’s operation in complex acoustic conditions of the urban environment.

4. Conclusions

The conducted research confirmed the high efficiency and practical applicability of the developed acoustic accident monitoring sensor for smart crossroads. We used the iterative approach to the development of a system, including the stage-wise algorithm testing from mathematical modeling and laboratory trials to complex field testing under conditions close to a real-life urban environment.

The research features the comparison of sound localization algorithms in additive noise conditions (SNR = 5 dB). The GCC-PHAT was selected based on the precision, robustness, and calculation efficiency criteria. Mathematical modeling showed that this algorithm has a mean localization error of about 0.15 m, demonstrating the optimal balance for real-time systems.

Based on the selected method, we developed and tested a fully functional prototype of the localization system. The hardware part included a set of four microphones and a multichannel audio interface, while the software part was implemented in Python using the SciPy libraries. The results suggest that efficient solutions can be developed with a standard computing platform without expensive specialized equipment.

To adapt the algorithm to urban environment conditions, we implemented several upgrades. The primary field testing exposed a systemic error in the basic version of the algorithm that amounted to 2.3 m. To eliminate it, we took several actions: introduced extended signal preprocessing (bandpass and median filtering), implemented automatic energy-rich signal segment identification, optimized the correlation peak search domain (max_tau), and shifted towards non-linear optimization when solving the system of localization equations.

The upgraded algorithm was validated for 11 control points, showing the mean localization error reduction to 1.1 m, which corresponds to a 52% precision improvement. Besides, the number of false trippings dropped by 70%, and the standard deviation reduced to 0.15 m, indicating increased stability of the system’s operation.

The performance of the system is an important practical result: the processing time for one acoustic event was 95 ms, which is significantly lower than the threshold of 200 ms. This parameter confirms that the developed system can be integrated with the V2X infrastructure for prompt warning transmission to road users.

The developed acoustic monitoring system complies with the precision and response rate criteria of urban safety problems and builds a technological base for subsequent development as part of the smart city concept. Its development prospects are linked to the solution of several research and development problems.

The first one stipulates the intellectualization of the system through the integration of machine learning algorithms. The goals may include the development and training of a classifier to determine the types of emergency acoustic events (e.g., collision, brake squealing, sirens). The implementation of this function shall help reduce the number of false trippings and improve the quality of warnings in V2X systems.

The second stipulates the network scaling of the architecture. Creating a distributed network of synchronized acoustic sensors to cover large traffic intersections is deemed a relevant problem. This shall require data coordination and aggregation protocols for neighboring nodes, thus facilitating the shift from point-wise localization to tracking sound sources in space. The system’s architecture, based on standardized hardware and geometrically scalable algorithms, provides a clear pathway for deployment as a distributed network of sensors across multiple intersections within an urban transport network.

Extending this calibration effort, future work will include the development of adaptive detection thresholds to maintain system performance across the wide range of acoustic conditions inherent to urban traffic, from free-flowing periods to dense, congested states.

This calibration and threshold adaptation must be validated through testing in authentic, high-noise crossroads environments, supplementing the campus trials with data collected at operational intersections featuring intense, mixed traffic flows, construction activity, and other complex urban noise sources.

While this study validates the localization performance and latency of the core acoustic pipeline, the development and benchmarking of a robust sound event classifier for distinguishing emergency events (e.g., collisions, sirens) from ambient urban noise remains a key objective for future work to realize a fully autonomous detection system.

The experiments were conducted only with siren sound. However, real urban soundscapes typically involve a wide variety of acoustic sources. Future work will specifically involve creating and testing the system against a comprehensive database of real-world sounds, including collision impacts, various siren types, tire screeches, and pedestrian shouts, to evaluate and optimize its detection and classification performance across the full spectrum of critical acoustic events, and to assess the consistency of the GCC-PHAT localization performance across these diverse acoustic signatures.

Additionally, investigating the potential benefits and trade-offs of using directional microphones represents another research direction. While omnidirectional microphones provide uniform coverage, directional microphones could improve the signal-to-noise ratio for specific approach lanes and potentially simplify the array geometry, though at the cost of increased system complexity and the need for precise mechanical or electronic steering.

The third and the most complex one can be classified as sensor fusion. The reliability can be improved through the real-time combination of acoustic data and information flows from cameras, lidars, and radars. The combined processing of heterogeneous data shall help overcome the limitations typical of each sensor type and produce a universal interference-proof image of the road situation.

Fourth, the optimization of the microphone array geometry for specific urban environments warrants detailed investigation. A systematic analysis of the influence of inter-microphone spacing and array aperture on localization sensitivity and distance estimation accuracy will provide practical guidelines for deploying the system at intersections with varying sizes and layouts.

Additionally, field calibration across diverse urban morphologies is required to precisely quantify the system’s effective range and performance limits under varying noise conditions, from dense city centers to suburban intersections.

Thus, the presented solution opens extensive opportunities for research. Its further development in line with the prospects discussed shall help create a high-level, economically efficient, and technologically justified infrastructure aiming to significantly improve safety on urban crossroads in the context of the evolving smart city concept and V2× communications.

Author Contributions

Conceptualization, Y.F.; Methodology, Y.F.; Software, D.A.; Validation, D.A.; Formal analysis, Y.F. and D.A.M.; Investigation, Y.F. and M.M.; Resources, S.S.S.; Data curation, M.M. and D.A.M.; Writing—original draft, D.A. and D.A.M.; Visualization, A.K.; Supervision, A.K.; Project administration, A.K. and S.S.S.; Funding acquisition, S.S.S. All authors have read and agreed to the published version of the manuscript.

Funding

The project was partially supported with the financial support of the Moscow Polytechnic University within the framework of the P.L. Kapitsa grant program (III stage). The research was carried out with the financial support of the Ministry of Science and Higher Education of the Russian Federation within the framework of the project FSFM-2025-0001 “Development of a scientific and methodological framework for synthesizing optimal solutions for the energy-efficient operation of tractor units with semi-trailers on federal roads of the Russian Federation, taking into account remote monitoring and forecasting of operational characteristics in relation to the road network”.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Road Traffic Injuries. Available online: https://www.who.int/news-room/fact-sheets/detail/road-traffic-injuries (accessed on 10 May 2025).
Administration, F.H. AboutIntersection Safety. 2022. Available online: https://highways.dot.gov/safety/intersection-safety/about (accessed on 30 July 2025).
Tomaszewskaand, E.J.; Florea, A. Urban smart mobility in the scientific literature—Bibliometric analysis. Eng. Manag. Prod. Serv. 2018, 10, 41–56. [Google Scholar]
Papa, R.; Gargiulo, C.; Russo, L. The evolution of smart mobility strategies and behaviors to build the smart city. In Proceedings of the 2017 5th IEEE International Conference on Models and Technologies for Intelligent Transportation Systems (MT-ITS), Naples, Italy, 26–28 June 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 409–414. [Google Scholar]
Yigitcanlar, T.; Kamruzzaman, M. Smart cities and mobility: Does the smartness of Australian cities lead to sustainable commuting patterns? J. Urban Technol. 2019, 26, 21–46. [Google Scholar] [CrossRef]
Jiang, D.; Delgrossi, L. IEEE 802.11p: Towards an International Standard for Wireless Access in Vehicular Environments. In Proceedings of the IEEE Vehicular Technology Conference, Marina Bay, Singapore, 11–14 May 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 2036–2040. [Google Scholar] [CrossRef]
Li, B.; Li, J.; Liu, X.; Xu, R.; Tu, Z.; Guo, J.; Zou, Q.; Li, X.; Yu, H. V2X-DGW: Domain Generalization for Multi-Agent Perception Under Adverse Weather Conditions. In Proceedings of the 2025 IEEE International Conference on Robotics and Automation (ICRA), Atlanta, GA, USA, 1–6 June 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 974–980. [Google Scholar] [CrossRef]
Hewett, D. Sound Propagation in an Urban Environment. Ph.D. Thesis, Oxford University, Oxford, UK, 2010. [Google Scholar]
Knapp, C.H. The generalized correlation method for estimation of time delay. IEEE Trans. Acoust. Speech Signal Process. 1976, 24, 320–327. [Google Scholar] [CrossRef]
Choudhury, K.; Nandi, D. Review of Emergency Vehicle Detection Techniques by Acoustic Signals. Trans. Indian Natl. Acad. Eng. 2023, 8, 535–550. [Google Scholar] [CrossRef]
Woodruff, J.; Wang, D. Binaural localization of multiple sources in reverberant and noisy environments. IEEE Trans. Audio Speech Lang. Process. 2012, 20, 1503. [Google Scholar] [CrossRef]
Tran, V.-T.; Tsai, W.-H. Audio-Vision Emergency Vehicle Detection. IEEE Sens. J. 2021, 21, 27905–27917. [Google Scholar] [CrossRef]
NII Akustika OJSC. Sound Signal Source Positioning Method Using a Sound Ranging System. Patent RU 2734289 C1 Russian Federation, No. 2020115423, 14 October 2020. Bulletin No. 29, 8p. [Google Scholar]
Burevestnik Central Scientific Research Institute. A Ranging Method for a Sound Source. Patent RU 2276383 C2 Russian Federation, No. 2004126584/28, 10 May 2006. Bulletin No. 13, 6p. [Google Scholar]
Akusticheskiye Sistemy CJSC. The Method and Device to Provide Sound Source Information via an Audio Device. Patent RU 2353000 C2 Russian Federation, No. 2007134567, 20 April 2009. Bulletin No. 11, 12p. [Google Scholar]
SoundTech Corp. Acoustic Signal Processing Apparatus and Method. Patent US 7127071 B2 USA/Smith, No. 10/892,456, 24 October 2006. 12p. [Google Scholar]
Siemens, A.G. System and Method for Detecting Acoustic Anomalies. Patent EP 2530484 B1 European Patent Office, No. EP11164589.2, 5 December 2012. 18p. [Google Scholar]
Acoustic Technologies Inc. Systems and Methods for Identifying the Source of Sound. Patent US 10917720 B2 USA, No. 16/234,789, 9 February 2021. 16p. [Google Scholar]
Cambridge Audio Systems Ltd. Multimodal Sound Source Detection System. Patent WO 2017033486 A1 WIPO, No. PCT/GB2016/052617, 2 March 2017. 28p. [Google Scholar]
Beijing Institute of Technology. Method for Acoustic Localization in Noisy Environments. Patent WO 2019220353 A2 WIPO, No. PCT/CN2019/086425, 21 November 2019. 24p. [Google Scholar]
Lam, E. 3D Sound-Source Localization Using Triangulation-Based Methods. Master’s Thesis, University of Edinburgh, Vancouver, BC, Canada, 2017; 120p. [Google Scholar]
Brandstein, M. Microphone Arrays: Signal Processing Techniques and Applications; Brandstein, M., Ward, D., Eds.; Springer: Berlin, Germany, 2001; 398p. [Google Scholar]
Ramos-García, R. Acoustic Surveillance of Road Traffic Accidents Using Distributed Wireless Sensor Networks. IEEE Trans. Intell. Transp. Syst. 2018, 19, 1852–1864. [Google Scholar]
Huang, Y. Real-Time Passive Source Localization: A Practical Linear-Correction Least-Squares Approach. IEEE Trans. Speech Audio Process. 2001, 9, 943–956. [Google Scholar]
Parineh, H.; Sarvi, M.; Bagloee, S.A. Implementation of ZigBee-based WSN to enhance the performance of SCATS compatible intelligent traffic controllers. In Proceedings of the Australasian Transport Research Forum 2022 Proceedings, Adelaide, SA, Australia, 28–30 September 2022. [Google Scholar]
Ambrosini, L.; Gabrielli, L.; Vesperini, F.; Squartini, S.; Cattani, L. Deep neural networks for road surface roughness classification from acoustic signals. In Audio Engineering Society Convention 144; Audio Engineering Society: New York, NY, USA, 2018. [Google Scholar]
Ntalampiras, S. Moving vehicle classification using wireless acoustic sensor networks. IEEE Trans. Emerg. Top. Comput. Intell. 2018, 2, 129–138. [Google Scholar] [CrossRef]
Rafi, M.A.I.; Sohan, M.R.; Hasan, M.S.; Rafa, T.S.; Jawad, A. Exploring Classification of Vehicles Using Horn Sound Analysis: A Deep Learning-Based Approach. In Proceedings of the 2024 23rd International Symposium INFOTEH-JAHORINA (INFOTEH), East Sarajevo, Bosnia and Herzegovina, 20–22 March 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar]
Chiang, C.Y.; Jaber, M.; Chai, K.K.; Loo, J. Distributed acoustic sensor systems for vehicle detection and classification. IEEE Access 2023, 11, 31293–31303. [Google Scholar] [CrossRef]
Velasco, J.; Taghizadeh, M.J.; Asaei, A.; Bourlard, H.; Martin-Arguedas, C.J.; Macias-Guarasa, J.; Pizarro, D. Novel GCC-PHAT model in diffuse sound field for microphone array pairwise distance based calibration. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 2669–2673. [Google Scholar] [CrossRef][Green Version]
Cucchiara, R.; Piccardi, M.; Mello, P. Image Analysis and Rule-Based Reasoning for a Traffic Monitoring System. IEEE Trans. Intell. Transp. Syst. 2000, 1, 119–130. [Google Scholar] [CrossRef]
Furletov, Y.; Willert, V.; Adamy, J. Auditory scene understanding for autonomous driving. In Proceedings of the IEEE Intelligent Vehicles Symposium, Nagoya, Japan, 11–17 July 2021; IEEE: Piscataway, NJ, USA, 2015; pp. 697–702. [Google Scholar] [CrossRef]

Figure 1. Block diagram of the acoustic signal processing pipeline, showing the three main stages: signal preprocessing, active event detection, and sound source localization.

Figure 2. The positioning of the microphones and the sound source on a flat plane (2D): the blue dots stand for the positions of the four microphones, the purple cross stands for the true position of the source, and the brown triangle is the position calculated using the algorithm (coordinates are shown in centimeters).

Figure 3. Laboratory setup: microphones are attached to pedestals, and the sound source moves along the perimeter.

Figure 4. Temporal shapes of sound signals recorded with four microphones in one of the testing points after preprocessing (100–3000 Hz filtering).

Figure 5. The field setup: omnidirectional microphones on tripods placed at the vertices of a 14 m square on an open-air urban site surrounded by buildings.

Figure 6. The hardware platform for the field tests: a multichannel audio interface connected to the four microphones, and a laptop for the synchronized recording and primary analysis of audio tracks.

Figure 7. The testing site overview: four omnidirectional microphones (in green boxes) are installed at the vertices of a 14 × 14 m square on an open-air urban site to record audio tracks for 11 points.

Figure 8. The 2D placement model for the four microphones and the sound source in laboratory conditions.

Figure 9. The positions of microphones, true position of the source (0; 0) meters, and the evaluation (−0.14; 0.06) meters (Test 1).

Figure 10. The positions of microphones, true position of the source (−4;−1) meters, and the evaluation (−2.90; −0.73) meters (Test 10).

Table 1. Equipment specifications.

Microphones:
Type	omnidirectional condenser
Frequency range	20 Hz–20 kHz
Sensitivity	−38 dBV/Pa
Signal/noise ratio	>70 dB
Recording system:
Audio interface	Multichannel, UMC1820
Sampling rate	44.1 kHz
Bit depth	24 bits
Channel synchronization	hardware
Computing platform:
Processor	Intel Core i7
RAM	16 GB

Table 2. Mean absolute errors and standard deviations of sound source localizations for various modeling algorithms (100 runs, SNR = 5 dB).

Method	$\underline{∆ x}$	$\underline{∆ y}$	$\underline{∆ z}$	$σ$	Completion Time, ms
GCC	0.15	0.12	0.18	0.10	15
GCC-PHAT	0.12	0.14	0.20	0.09	22

Table 3. Laboratory experiment results: true and evaluated source coordinates, absolute errors (N = 10).

#	True Position (x, y), m	Evaluation (x, y), m	Error, m
1	(0; 0)	(0.031; −0.018)	0.036
2	(0; 1.2)	(−0.025; 2.05)	0.850
3	(1.2; 0)	(1.837; −0.151)	0.654
4	(0; −1.2)	(0.067; −1.623)	0.429
5	(−1.2; 0)	(−1.825; −0.221)	0.663
6	(0.6; 0.4)	(0.992; 0.466)	0.399
7	(0.4; −0.5)	(0.551; −0.792)	0.331
8	(−0.2; −0.8)	(−0.282; −1.098)	0.308
9	(−0.6; 0.4)	(−0.815; 0.533)	0.257
10	(0.2; 0.9)	(0.355; 1.446)	0.565

Table 4. Repeated testing results (code version 2).

Test No.	True Coordinates (m)	Evaluation (m)	Error (m)	Final Classification
1	(0; 0)	(−0.14; 0.06)	0.15	Success
2	(0; 7)	(0.06; 6.57)	0.43	Success
3	(7; 0)	(5.45; 1.68)	2.29	Borderline
4	(0; −7)	(0.37; −6.41)	0.70	Success
5	(−7; 0)	(−6.47; −0.07)	0.53	Success
6	(4; 3)	(3.16; 1.80)	1.46	Success
7	(4; 0)	(2.76; −0.90)	1.53	Borderline
8	(3; −4)	(1.67; −3.40)	1.46	Borderline
9	(−4; 2)	(−1.44; 0.30)	3.07	Error
10	(−4; −1)	(−2.90; −0.73)	1.13	Borderline
11	(−3; −5)	(0.11; −0.10)	5.80	Error

Table 5. Comparing the key metrics for the first and the second versions of the algorithm.

Metric	Code Version 1	Code Version 2	Improvement (%)
Mean error, m	2.30	1.69	26
Median error, m	2.08	1.45	31
Maximum error, m	3.10	5.80	−87
False delays, %	100	30	70
Standard deviation, m	1.45	1.16	20

Table 6. Final system performance: localization accuracy and processing latency.

Metric	Value	Condition/Note
Mean Localization Error	1.69 m	Across 11 field test points (upgraded algorithm)
Standard Deviation (σ)	0.49 m	Across successful localizations
Median Localization error	1.45 m	Across 11 field test points (up-graded algorithm)
Maximum Localization Error	5.80 m	Worst-case result from field tests
Processing Latency (per event)	95 ms	Measured on Intel Core i7 platform
Real-time Requirement	<200 ms	Threshold for V2X integration
Requirement Fulfillment	Yes	95 ms < 200 ms

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Furletov, Y.; Aptinova, D.; Mededov, M.; Keller, A.; Shadrin, S.S.; Makarova, D.A. Development and Field Testing of an Acoustic Sensor Unit for Smart Crossroads as Part of V2X Infrastructure. Smart Cities 2026, 9, 17. https://doi.org/10.3390/smartcities9010017

AMA Style

Furletov Y, Aptinova D, Mededov M, Keller A, Shadrin SS, Makarova DA. Development and Field Testing of an Acoustic Sensor Unit for Smart Crossroads as Part of V2X Infrastructure. Smart Cities. 2026; 9(1):17. https://doi.org/10.3390/smartcities9010017

Chicago/Turabian Style

Furletov, Yury, Dinara Aptinova, Mekan Mededov, Andrey Keller, Sergey S. Shadrin, and Daria A. Makarova. 2026. "Development and Field Testing of an Acoustic Sensor Unit for Smart Crossroads as Part of V2X Infrastructure" Smart Cities 9, no. 1: 17. https://doi.org/10.3390/smartcities9010017

APA Style

Furletov, Y., Aptinova, D., Mededov, M., Keller, A., Shadrin, S. S., & Makarova, D. A. (2026). Development and Field Testing of an Acoustic Sensor Unit for Smart Crossroads as Part of V2X Infrastructure. Smart Cities, 9(1), 17. https://doi.org/10.3390/smartcities9010017

Article Menu

Development and Field Testing of an Acoustic Sensor Unit for Smart Crossroads as Part of V2X Infrastructure

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Calculating TDOA Using GCC-PHAT

2.2. Software Implementation of the Algorithm

2.3. Experimental Setup

2.4. Experiment Methodology

3. Results

3.1. Mathematical Modeling Results

3.2. Laboratory Testing Results for the Basic Prototype

3.3. Field Testing Results and Systemic Problem Identification

3.4. Upgraded Algorithm Validation Results

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI