A Novel Driving Noise Analysis Method for On-Road Traffic Detection

Effective noise reduction and abnormal feature extraction are important for abnormal sound detection occurring in urban traffic operations. However, to improve the detection accuracy of continuous traffic flow and even overlapping vehicle bodies, effective methods capable to achieve accurate signal-to-noise ratio and appropriate characteristic parameters should be explored. In view of the disadvantages of traditional traffic detection methods, such as Short-Time Energy (STE) and Mel Frequency Cepstral Coefficients (MFCC), this study adopts an improved spectral subtraction method to analyze traffic noise. Through the feature fusion of STE and MFCC coefficients, an innovative feature parameter, E-MFCC, is obtained, assisting to propose a traffic noise detection solution based on Triangular Wave Analysis (TWA). APP Designer in MATLAB was used to establish a traffic detection simulation platform. The experimental results showed that compared with the accuracies of traffic detection using the traditional STE and MFCC methods as 67.77% and 76.01%, respectively, the detection accuracy of the proposed TWA is significantly improved, attaining 91%. The results demonstrated the effectiveness of the traffic detection method proposed in solving the overlapping problem, thus achieving accurate detection of road traffic volume and improving the efficiency of road operation.


Introduction
In many cities throughout of the world, the existing road network infrastructure needs to be maintained and improved, such as widening existing roads and making use of intelligent transportation system technologies. The number of on-road sensors were usually huge, and could not capture all traffic flow conditions for short-term traffic flow predictions. At the same time, traffic noise emissions may be related to the following attributes: e.g., power unit component; the interaction between speed, traffic flow type and road slope; rolling noise component; and the relation function between traffic speed and road surface [1]. As one of the major input data sources, a traffic noise spectral profile was used for the real-time traffic detection to obtain road traffic flow estimation in real time, with an average correct classification rate of about 96% [2]. Taking audio signals from two nearby sensors, the generalized cross correlation (GCC) function was combined with a particle filter to jointly estimate speed and wheelbase length. Using voice for road traffic monitoring, a design method for the microphone array was put forward based on the correlation function of acoustic observation vehicle trajectory [3]. Probabilistic noise models with more explicit road surface maker (RSM) features were developed to analyze the results of the RSM feature detection under various driving conditions [4]. Based on the probabilistic sensor model, an RSM model designed by Jo et al. a utilized particle filter to update the measurements, thereby improving the localization performance [5]. By extracting the peak power envelope in the traffic noise signal, Torija used a microphone array to record the sound of vehicles and estimate the number of vehicles, in which lanes were divided the local spectral features of sound to identify overlapping sound events and separate the background noise has a certain effect on the recognition of overlapping signals, and can be applied to the situation of overlapping traffic noise.
This paper proposed an improved spectral subtraction method to deal with traffic noise. A new feature parameter E-MFCC was defined through the feature fusion of STE and MFCC coefficient, so that a new traffic detection method based on traffic noise is proposed in the frame of TWA, to improve traffic detection accuracy. The remainder of the paper is structured as follows: Section 2.1 introduces the principle of traffic detection based on traffic noise, including pretreatments and analyses on the collected traffic noise data. Section 2.2 analyzes the characteristic parameters. With the field collected experimental data, the detection performance of the three traffic detection methods was assessed based on the experimental results in Section 2.3, so as to verify the effectiveness of the detection method. Finally, conclusions and future research directions are provided in Section 3.

Pretreatment and Characteristic Analysis of Driving Noise
Voice endpoint detection (VED) technology has been widely used in voice detection [25], with the main purpose being to distinguish voice and non-voice segments from the input signal. The core of using endpoint detection technology to detect traffic volume is to set thresholds and confirm vehicle signal frames. The existence of a vehicle is determined by the threshold value, as shown in Figure 1. According to the corresponding abscissa of the waveform diagram, the oversaturated road segment can be roughly determined. The red dot line within the figure represents the starting point of the vehicle road, and the green line is the ending point, both of which are roughly judged by the waveform of the driving noise signal. In this study, the low signal-to-noise ratio is not recommended to be judged out according to the waveform. For example, two vehicles were included in Figure 1, but unfortunately, only one was detected at this time. Therefore, it is necessary to set up a threshold, as shown in Figure 1, by which two sections of the waveform may be obtained.

Pretreatment
Borkar et al. extracted the driving noise MFCC, and based on the Neuro-Fuzzy Classifier characterized by MFCC, SVM was used to classify of low traffic (40 km/h), medium traffic (20-40 km/h) and high traffic (0-20 km/h) at three different traffic density levels, showing that the classification accuracy was over 95% [26]. Kaur et al. collected traffic noise data in a 'busy street' and a 'quiet street', respectively, and extracted various time-and frequency-based features such as short-term zero crossing rate (ZCR), shortterm energy (STE), root mean Square (RMS) and MFCC, yielding results with a better classification accuracy of 91.8% with Neural Network and 93% with SVM [27].
Vehicles driving on the road are often accompanied by the influence of abnormal short-term signals, such as vehicle whistle, emergency brake, etc., which may also be erroneously detected as vehicles passing. According to the high and low thresholds, the vehicle noise signal detection and analysis algorithm was proposed in Figure 2. Six steps are included in the algorithm as follows: Step 1: The original traffic noise signal collected in real time on the road is imported into the processing system; Step 2: The collected traffic noise signal is preprocessed and filtered to obtain smooth traffic noise signal data; Step 3: Feature vectors are extracted as the configuration parameters and the eigenvalue F i of each frame signal of driving noise is calculated successively; Step 4: The high threshold is calculated as , the low threshold is calculated as E 2 = a * maxE j 1≤j≤n 0 ≤ a ≤ 10, and increased by 0.1 per time so as to set up the minimum car signal frame; Step 5: If F i ≤ E 1 and F r ≤ E 2 , (r = i + 1, i + 2, · · · , i + l (v) − 1), then i (s) = i, where l (v) is the minimum length of the vehicle segment and i (s) is the initial endpoint of the Q th driving noise segment; then, F i is successively calculated from i = i + l until F i < E 2 and F s < E 2 , (s = i + 1, i + 2, · · · , i + l (n) − 1); then, i (o) = i + l (n) − 1, where l (n) is the minimum length of the environmental noise segment, and i (s) is the end point of the Q th driving noise segment; and the interval [i s , i o ] represents the Q th driving noise segment. Repeat the preceding steps; Step 6: The entire algorithm iterates between the detection state and vehicle signal state. To complete a cycle is to detect a vehicle passing, and the number of cycles equals the number of vehicles. By exploring all data sequences, the objective of road traffic flow detection can be achieved.
The preprocessing process mainly uses the following operation modes: pre-emphasis, windowing, framing, normalization and noise reduction [27]. After sampling the noise signal, a FIR high-pass filter called pre-emphasis of audio samples, is inserted to facilitate the analysis of audio samples. The purpose is to increase the high frequency resolution of the audio signal. Pre-emphasis processing has a certain inhibitory effect on the low frequency signal. After pre-emphasis, the high frequency component of the traffic noise signal is significantly increased, and the overall amplitude of the signal becomes smaller. The waveform of the traffic noise signal after pre-emphasis becomes smoother when there are no cars, which is beneficial to the subsequent signal processing and feature extraction.

Framing and Windowing
The traffic noise signal is a random signal with non-stationarity, which can be regarded as a quasi-steady state process within a rather short time range. During the traffic noise signal processing, the entire signal processing needs to be framed, generally in the range of 10-30 ms, to ensure the stability of the input signal. To this end, a traffic noise signal with length L is framed according to Equation (1).
where f is the total number of frames after splitting, L is the signal length, T i is the displacement of i + 1 th frame to i th frame, namely the frame shift, X i is the overlapping part between the two frames, N i is the frame length and X i = N i − T i , and the final data is divided into f frames.

Normalized Processing
During driving noise signal collection, even if the same vehicle is affected by factors such as speed and location of acquisition equipment, the signal amplitude collected is different. To eliminate the influence of various factors on traffic detection, when the position of the weak traffic signal in the strong signal is obtained, a selection judgment is required. Euclidean distance is used as the judgment function. Because of the output of continuous signals in the audio, the collected signal amplitude is used to eliminate the influence and weight of the location of the collection device by using the distance discrimination index, and the expression is shown in Equation (2): X ij (m, n), Y(m, n) is the driving signal at any point in the selected area. After selecting the template, is fixed; if D=0, the signal is equal; If D > 0, the normalization is shown in Equation (3): The normalization process eliminates the number level difference between the data of each dimension and avoids large calculation errors caused by the large difference of the data level of the feature data. To analyze the collected road vehicle data accurately, noise reduction is needed.

Improved Spectral Subtraction Noise Reduction
In the process of spectral subtraction, it is necessary to determine the length of the leading noise segment and the value of the sum of the parameters. The acquisition of an audio signal is completely random background noise; thus, when using spectral subtraction, it is likely the line value will be greater than this one. In this case, if spectral subtraction is used to reduce noise, the background noise environment cannot be removed and many burr peak points will be retained, which greatly reduces the effect of noise reduction. The flow chart of this method is shown in Figure 3. The improved spectral subtraction of multiwindow spectral estimation is mainly based on the basic spectral subtraction technology. An orthogonal data window is improved to multiple orthogonal data windows, and the red dot line within Figure 3 represents the improved part. The original traffic noise signal time domain sequence is x td (n 0 ), the signal sequence after pretreatment is x td (n) and x td (n) is expressed as the i th frame traffic noise signal. The general spectral subtraction noise reduction steps include time-frequency domain conversion, noise estimation, phase angle calculation, spectral subtraction and the five steps of frequency-time domain conversion. The three steps of time-frequency domain conversion, noise estimation and phase angle calculation are important to the pre-work of spectral subtraction, as shown in Equation (4) below: The spectral subtraction process reduces the energy of the frequency domain signal x f d i (m) and the average noise energy of each frame, and the amplitude value x ss i (m) after spectral subtraction is realized as follows in Equation (5): where a is a minus factor and b is a gain compensation factor. Then, the signal sequence y i (k) after spectral subtraction is obtained by combining them. The basic spectral subtraction only uses one data window in the process of noise reduction, and it is improved through multi-window spectral estimation. The flow chart of this method is shown in Figure 3.
where the multi-window spectral power spectral density P(k, i) and P y (k, i) are the multiwindow spectral power density and the smoothed power spectral density of the kth spectral line of the i th frame. PMTM means to estimate the multi-window spectral power spectral density and the first and last M frames of the i th frame as the center. Then, 2M + 1 frames are averaged. | → X i (k)| is the average magnitude spectrum, g(k, i) is the gain factor and the magnitude spectrum after spectral subtraction is | ∧ X i (k)|. The short time energy of the traffic noise signal y i (n) in i th frame is calculated as shown in Equation (7): where y i (n) is the value of a frame and n= 1, 2, . . . , L, 1 ≤ i ≤ f n. L is the frame length and the square sum of the amplitude of the i th frame signal is the STE value of the corresponding time point of the signal. Figure 4 represents the original waveform of a section of traffic noise and the corresponding STE. In Figure 4, the STE of the traffic noise signal changes with time, and the energy difference between the congested segments with and without car is significant. In the case of great environmental noise, the existence of congested segments should be identified, so that the STE feature extraction algorithm becomes relatively easy to implement, which saves the processing time in the case of large amounts of data. Zero Crossing Rate (ZCR) indicates the number of times a voice signal waveform passes through the horizontal axis (zero level) in a frame of speech [28]. The definition of short-time average zero crossing rate is shown in Equation (8):  The sonogram reflects the dynamic spectrum characteristics of the sound signal, which plays an important role in signal processing [29]. In the sonogram, the abscissa represents the time and the ordinate represents the amplitude. Because the three-dimensional information is mapped to the two-dimensional plane, the amplitude is expressed by the depth of the color. The deeper color indicates large amplitude at this time. The result of FFT of traffic noise signal is shown in Equation (9) below: The spectrum characteristics of traffic noise signal calculated can be expressed by the matrix shown in Equation (10).
where A sv represents the spectrum feature and a i represents the amplitude value of the traffic noise signal after FFT transformation. m = 1, 2, . . . , [N/2] int + 1, and when the k indicates 1, |Y i (k)| and |Y i (N − k + 2)| is equal, only ([N/2] int + 1) sample points are needed. According to Equations (8) and (9), the spectrum feature of the traffic noise signal is extracted, as shown in Figure 6. The Figure 6 represents a section of traffic noise signal waveform and its corresponding spectrogram extraction results, respectively. The frequency of the signal collected in the road environment is found below 8000 Hz, and the color of the traffic noise section is deeper than that of the ambient noise section. The traffic noise information is continuously distributed from the low frequency region to the medium and high frequency region, whereas the environmental noise information is mainly concentrated in the low and medium frequency region. It can be figured out from Figure 6 that above the intermediate frequency region of about 3000 Hz, the frequency distribution of the environmental noise signal is rather small, and the frequency relative to the traffic noise signal can be ignored. Therefore, the existence of congested segments can be easily judged by setting the frequency threshold.

Feature Extraction and Fusion
This study proposes the Triangular Wave Analysis (TWA) technology for traffic volume processing. Through the analyses of STE and MFCC characteristic traffic detection methods in [30], both methods have certain limitations in detecting the performance of overlapping congested segments. The indicator of STE (E i ) of the traffic noise signal is calculated by taking d 0 m (i, n) from d m (i, n) in MFCC, and the first and last two frames of the short-time energy E i of the traffic noise signal are discarded because the frames were not included in calculating d m (i, n). To match the length of d 0 m (i, n), the short-time energy E i is also discarded. The E i is then multiplied by the index, that is, the characteristics of the frame i signal of E-MFCC.
E-MFCC does not need to set the front NIS frame audio by the MFCC describing the background noise, so as to calculate the average. The E-MFCC feature is feasible for traffic detection, which is superior to Short Term Energy and MFCC.

Extremum Extraction
For the extracted new feature envelope, the extremum extraction method searches the local extremum points of the numerical sequence to extract the extremum of the original sequence. The new feature E-MFCC is taken as the original data of the traffic noise signal after digitization. Assuming that the original data matrix is X, Equation (11) can be expressed as follows: where X is the original data matrix, i= 1, 2, · ··, m, T i , P i are the original time and amplitude, respectively, i is the amount of original data and m is the end. The specific process is as follows: First, the beginning and end of initialization are presented as Equation (12): where E(.) is the extremum matrix and d is the number of extremum points. Then, judging from the given conditions: when P i−1 > P i > P i+1 is satisfied, it is the minimum point; when P i−1 < P i < P i+1 is satisfied, it is the maximum point. The corresponding extremum point T i and P i are stored in E(.). The above two steps are the basic principle and implementation process of extremum extraction.
The extracted E-MFCC features and the smoothed E-MFCC features are shown in Figure 7, where the peak represents the extreme, and the waveform after extremum extraction of smoothed E-MFCC characteristic curve is shown in Figure 7b.

Formation and Combination of Triangular Waves
The formation and combination of triangular waves mainly includes four parts: triangular wave formation, combination on triangular wave, combination under triangular wave and expansion of frame width.
T 1 is the interference triangle wave which appears in monotonic increasing, T 2 is the interference triangle wave which appears in monotonic decreasing and T 3 is the interference triangle wave which includes T 1 and T 2 . In order to achieve satisfactory detection accuracy, it is necessary to carry out triangle wave up combinations, triangle wave down combinations and frame width expansion for these three cases. Figure 8b represents the combination of triangular waves, which is the first step of the combination rule of triangular waves. It is a solution to the T 1 situation. Comparing Figure 8b with Figure 8a, the monotone increasing interference triangular wave of type T 1 has been eliminated after the combination operation on the triangular wave. Figure 8c is the second step of the combination rule of triangular wave. It is a solution to T 2 case. After the combination of triangular wave up and down processing, the peak value of some triangular waves is still very small; that is, the signal in some environmental noise segments is very weak. Comparing Figure 8c with Figure 8d, it can be figured out that after the operation of triangle waves lowers, the combination algorithm, the monotone decreasing interference triangle wave of type T 2 has been eliminated. At the same time, after the combination of up and down triangle waves, T 3 interference wave is basically eliminated. The waveform formed by frame width expansion is shown in Figure 8d. The frame width expansion algorithm is designed to solve the problem of abnormal noise (whistle, birdsong). The core idea is to expand the frame width of overlapped congested segment. By setting the minimum frame length of the congested segment, the traffic volume is obtained. The triangle wave analysis algorithm solves the problem of separating overlapping vehicle sections and can be applied.

Experimental Data Acquisition
The Nanbin Road segment between the Caiyuanba Yangtze River Bridge and Chongqing Yangtze River Bridge was selected as the experimental data acquisition section. The entire road lays out from west to east along the Yangtze River (the north side adjacent to the river), with a total length around 1.3 km, and the south side is surrounded by mountains with no redundant branch section. The segment is intentionally selected to avoid pedestrian noise disturbance, as no shops, restaurants etc. are nearby, thus ensuring the quality of traffic noise collection. For urban arterial roads, noise intervention for pedestrians and noise intervention for shops are included and the data is collected from factors such as congested roads, etc. The data collection is seriously disturbed, so the two-lane off-peak hours are selected to collect the traffic noise. The driving noise signal can be converted from the time domain to the frequency domain by the fast Fourier transform (FFT), and then the spectrogram in the signal is extracted and analyzed.
The data collection points and collection schematic diagram are shown in Figure 9. As shown in Figure 9b, the equipment used in the study included a recording pen, a mobile phone and a computer. The recording pen is used for lossless recording with a 1536 KBPS/48 kHz sampling rate, and the recording format is WAV, supporting linein recording and built-in microphone recording. The sensitivity of the microphone is −58 ± 3 dB, the working temperature is −25~70 • C and it has full directivity, electret type capacitor microphone-to-text. The frame length is set to 25 ms (namely 1200 sample points) and the frame shift is set to 12.5 ms. The video data were synchronously collected by the smart phone to compare the actual traffic volume. A video corresponding to the same segment will be recorded using the smartphone, to obtain the statistics of the vehicles on the road, which will be verified as the actual traffic volume of the experiment. Then, save the recorded traffic noise data in WAV format, and save the video data in MP4 format as a reference to obtain accurate traffic volume. The collected data were processed by Premiere Pro CC 2017 to remove the reverse traffic noise data and count the actual traffic volume for 60 min. The final synthetic sample was named 'traffic noise', and the actual traffic volume is 667 pcu.

Evaluating Indicator
According to Ma et al. [31], the traffic detection evaluation indexes in this study were selected as traffic detection accuracy r c , false detection rate r w and missed detection rate r m , with definitions provided in Equation (13) below: where n a is the actual number of vehicles in video synchronous acquisition data, n t is the total number of vehicles detected by traffic noise, n (t, c) is the correct number of vehicles in the total number of detected vehicles, n (t, w) is the number of vehicles wrongly detected from the noise signal, n (t, w) = n t − n (t, c) and n m is the number of undetected vehicles. The basic spectral subtraction and improved spectral subtraction are used to filter and denoise a section of original traffic noise signal, as shown in Figure 10. Background from the environmental noise (distant vehicle) is superimposed. For Figure 10c, it can be seen that the traffic noise after noise reduction by spectral subtraction has an obvious discrimination compared with the original waveform, in which noise is marked in the red rectangular box. For Figure 10d, after filtering and noise reduction, when using the improved spectral subtraction, the signal of the non-congested segment is close to zero within the time domain diagram, which is ignorable, as marked in the red dotted rectangular box. In the spectrogram, there is no frequency distribution in the non-vehicle section, and the spectrum of the vehicle noise signal section is smoother. Figure 10. Noise reduction algorithm sonogram and waveform comparison. (a) original driving noise signal and its spectrogram, (b) noised-driving noise signal and its spectrogram, (c) spectral subtraction driving noise signal and its spectrogram and (d) improved spectral subtraction for driving noise signal and its spectrogram.

Noise Reduction Performance Comparison
A traffic statistics system based on traffic noise signal is established by App Designer in MATLAB R2020a. Different frame lengths directly affect the accuracy of detection. With regard to TWA characteristics, the parameter setting of the minimum frame length of the traffic segment signal is the key to detect the traffic volume.
Taking 10 frames as the initial minimum frame length of the vehicle signal segment, 20 frames are successively increased to verify the traffic noise data of a segment with a traffic flow of 12 vehicles, as presented in Figure 11. In the Figure, the red triangle represents the triangular waves detected more than once, whereas the triangular box with a red dotted line represents the triangular waves missed. As can be figured out from Figure 11a-c, with the increase of frame length, the number of detected triangular waves becomes closer to the real number of vehicles, and the number of detected triangular waves decreases from six to two. In Figure 11d, when attaining 40 frames, the number of detected triangular waves equals the real vehicles. As the transition of frames increases, few triangular waves are detected and three vehicles missed detection at 50 frames and four vehicles missed detection at 60 frames. That is to say, when the minimum frame length of the vehicle signal section is set within certain range, the number of vehicles can be accurately detected. Through multiple experiments and comparisons, the minimum frame length of the vehicle signal segment was set within the range of 35-45 frames.

Establishment of Simulation Platform
First, the platform selection area for traffic noise signal acquisition and feature extraction operation plate, acoustic signal acquisition acoustic signal reading are determined. Then, noise signal denoising and preprocessing occurs, followed by the selection of a traffic detection algorithm, which is divided into the STE detection algorithm, MFCC detection algorithm and TWA algorithm.
Among the three traffic detection algorithms, traffic detection based on STE needs to consider the selection of high and low thresholds. Traffic detected by the three detection algorithms is presented in Figure 12. Referring to the threshold parameter selection method of endpoint detection in Zhang and Pan [32], the high and low threshold parameters were set as 0.13 and 0.11, respectively, in order to obtain the number of vehicles closer to the field situations. TWA is used for traffic detection, and the minimum length of vehicle signal segment is set as 40 frames, and other parameters are the same as MFCC detection.
The broken line A within the Figure is the distribution broken line of n a , a 1 , a 2 and a 3 represent the n t distribution under the action of STE, MFCC and TWA methods; b 1 , b 2 and b 3 represent the n (t,w) distribution under the action of STE, MFCC and TWA methods; c 1 , c 2 and c 3 represent the n m distribution under the action of STE, MFCC and TWA methods; d 1 , d 2 and d 3 represent the n (t,c) distribution under the action of STE, MFCC and TWA methods. From the Figure 13, the distributions of n t and n (t,c) were found to be the closest to that of the broken with the lowest n m distribution. The statistical results of the three detection algorithms are shown in Table 2. In the second column of Table 2, the vehicle missing rate of the three methods are provided. The average missing rate of TWA is 9%; that of MFCC is 23.69%; and that of STE is 32.23%, R T 1 < R M 1 < R S 1 . In the third column of the table, the vehicle error detection rate of the three methods is provided, in which the average error detection rate of TWA is 9.15%; that of MFCC is 3.75%; and that of STE is 1.05%, R S 2 < R M 2 < R T 2 . In the fourth column of the table, the vehicle accuracy of the three methods was provided, in which the average accuracy of TWA attains 91%, much higher than that of MFCC (76.01%) and STE (67.77%), R S 3 < R M 3 < R T 3 . TWA can better achieve the detection of overlapping car bodies and has good stability within the time-varying environment.

Conclusions and Future Research Directions
This paper proposes a novel traffic noise analysis method for on-road traffic detection. The innovative feature parameter E-MFCC was defined for the fusion of STE and MFCC principal component features. Then, the extremum point was identified by exploring the entire characteristic curve, in which the overlapped segment signals were separated by triangular wave algorithms. The final triangular wave number is the number of vehicles detected in the running noise signal, which has theoretical and practical significance to expand and improve current traffic detection technologies, e.g., by loop detector or video camera sensors.
The average accuracy of STE, MFCC and TWA were calculated as 67.77%, 76.01% and 91%, respectively, indicating that TWA has rather good accuracy and is effective to detect traffic volume. Although the results are promising, this study indeed has limitations with regards to the data and the approach. First, this study only focuses on one-way lane traffic detection, without considering the two-way lane, which may lead to underutilization of information. In the follow-up work, the applicability of TWA in multi-lane processes will be investigated. Furthermore, during traffic noise pretreatment and feature extraction, window function, frame length, frame shift and feature dimension were mostly selected by personal experiences, without a set of mature theories or methods to guide the parameters' configuration. Different parameter settings may help to shed lights on investigating the result of the study. Driving sound is not the end of traffic detection, because different models need to be converted into Passenger Car Units to get more accurate traffic statistics; therefore, the next part of the study will involve vehicle type recognition research based on the driving voice, and theoretical research efforts are required to improve result quality in the field.