Spatio-Temporal Analysis of Urban Acoustic Environments with Binaural Psycho-Acoustical Considerations for IoT-Based Applications

Sound pleasantness or annoyance perceived in urban soundscapes is a major concern in environmental acoustics. Binaural psychoacoustic parameters are helpful to describe generic acoustic environments, as it is stated within the ISO 12913 framework. In this paper, the application of a Wireless Acoustic Sensor Network (WASN) to evaluate the spatial distribution and the evolution of urban acoustic environments is described. Two experiments are presented using an indoor and an outdoor deployment of a WASN with several nodes using an Internet of Things (IoT) environment to collect audio data and calculate meaningful parameters such as the sound pressure level, binaural loudness and binaural sharpness. A chunk of audio is recorded in each node periodically with a microphone array and the binaural rendering is conducted by exploiting the estimated directional characteristics of the incoming sound by means of DOA estimation. Each node computes the parameters in a different location and sends the values to a cloud-based broker structure that allows spatial statistical analysis through Kriging techniques. A cross-validation analysis is also performed to confirm the usefulness of the proposed system.


Introduction
Noise has become a big problem in big cities. It has been shown that affects human behavior, health and even children's cognition [1]. Different measurements and studies of this environmental factor have been performed in the past [2,3]. The normal way to do noise measurements is to collect noise samples with a sound level meter, however this technique has many drawbacks. On the one hand, only local measurements in a grid are taken and on the other, it is expensive due to the measuring equipment and personnel costs. Moreover, these studies have been performed observing objective parameters, the evaluation of the equivalent sound pressure level [4]. However, previous works have shown that the evaluation of psychoacoustic parameters, such as loudness and sharpness, fits better to the assessment of noise subjective annoyance [5,6]. Psychoacoustic research has been widely studied and standards for evaluating subjective annoyance and calculating psychoacoustic parameters have been created [7,8].
In 2014, the International Standardization Organization introduced the standard ISO 12913 about soundscape [9], following the recommendations of the COST action Soundscape [10]. The standard defines the soundscape as: acoustic environment as perceived or experienced and/or understood by a person or people, in context [9]. This context includes the relationships between person and activity and place, in space and time. The interaction between the different actors in the acoustic environment establish a dynamic context which is able to influence soundscape through different mechanisms as: (1) the auditory sensation, (2) the interpretation of auditory sensation, and (3) the responses to the acoustic environment. In this way, the soundscape differs from the acoustic environment (defined as: sound at the receiver from all sound sources as modified by the environment). In this work, the study is mainly oriented to the analysis of the acoustic environment, but as the evaluation is made in terms of the estimation of the psycho-acoustic annoyance (because it is not related to the direct subjective response of people) by using the Zwicker's model. Here there is no specific qualification of the sound by subjective perception, but the description of the environment with a simplified version of the Zwicker's model is used. This can be related to the subjective response, but this will be done in a later study. Other aspects that tackle this study are related to the development of an infrastructure to evaluate the acoustic environment in terms of binaural psycho-acoustic parameters and also its spatial and temporal analysis.
In the context of modern urban planning and smart cities, IoT systems are receiving increasing attention due to their ability to monitor different problematic areas within a population center [11,12]. These problems are mostly related to the environment, but also to e-Health, Intelligent Transportation Systems (ITS) and other emerging technological areas with high demands [13][14][15].
This work explains the implementation of an IoT-based system oriented to gather binaural psycho-acoustic information from indoor and outdoor environments. As the annoyance model proposed uses only loudness and sharpness as the main parameters, some simplifications have been made. In order to explain the performance of this parameters in the considered environments, spatial and temporal analysis have been done in the measurements to check the validity of the spatial statistical model proposed using kriging.
The paper is organised as follows: in Section 2, a brief state-of-the-art is sketched. In Section 3, the theoretical and practical methodology used to deploy the experiments, both in indoor and outdoor environments, with spatial statistical analysis (using kriging method) are described, discussing also hardware, software and signal processing considerations. Also in the previous section, the procedure to obtain the binaural loudness and sharpness is described. In Section 4, the results are presented and discussed. Finally, Section 5 summarizes the main conclusions of this work.

Related Works
Different authors have implemented different monitoring systems for environmental noise and have evaluated the related annoyance. The use of WASNs for noise monitoring was the main focus of the works in [16,17]. The authors used a WASN with Tmote-Sky motes [18] and Tmote Invent (TmI), for traffic noise monitoring using the equivalent level (L eq,T ) and counting the number and type of vehicles. In that deployment, the sampling frequency was set at 8 kHz. As an important conclusion, the authors stated that Tmote-Sky had excessive self-noise making it inappropiate for acquiring sound measurements, while, TmI offered good audio features. In that research, no specific procedure for calibration was provided. In [19,20], the authors deployed a WASN in Ostrobothnia (Finland). In these references, the authors evaluated different tests to assess the noise impact. They measured the L eq,T with T = 125 ms using a sampling frequency of 33 kHz, with 14 calibrated nodes (MicaZ from Crossbow -now MEMSIC-with an ad-hoc acquisition circuit to allow a dynamic range of 60 dB), and synchronised during 96 h, they obtained good results. Other works such as [21][22][23] used mobile phones for noise pollution monitoring. Although the authors obtained interesting results, in this paper there is a lack of information about the recording conditions which means less accuracy in the noise measurements. When evaluating noise parameters, the location of the measuring devices should follow specific rules [24].
In the previous works [16][17][18][19][20][21][22][23], the measurements focused on the L eq,T or its A-weighted version (ITU-R 468), L eqA,T . The calculation of these measurements was made by applying a frequency-selective filter that considers the frequency range around 3-6 kHz, where the human ear is most sensitive.
From the previous references, we can see similar works more oriented to the soundscape description. In [25,26], the referred authors focus the description of the soundscape in the evaluation of noise levels and psycho-acoustic parameters from recordings gathered in walk-around. However, none of them include an analysis based on binaural psycho-acoustic annoyance, which probably should be more accurate from the psycho-acoustic point of view in a human being.
In the literature, the evaluation of Psycho-acoustic Annoyance (PA) is mainly based on the work by Zwicker & Fastl [27]. Different authors have tried to focus this research taking into account the subjective perception combined with specific psychoacoustic parameters [28]. Loudness is an important factor affecting the perceived psychoacoustic annoyance of a sound [27,29]. This parameter is defined as the subjective intensity of sound, which qualifies sounds ranging from quiet to loud. It is mainly related to the perceived amplitude of a sound. There are some models to evaluate this parameter which deliver a numerical estimation of the loudness level based on some objective characteristics of the sound. The best known models are Zwicker's [27] and Moore's [30], which are standardised in the 2 parts of ISO 532 [8]. They are based on a monaural stimulus and pass through different filtering stages which are similar to the ear natural filters and simulates the hearing system. Recently, new research in psychoacoustics has shown that monaural loudness is not enough to assess the acoustic loudness and different models for binaural loudness were developed [28,31]. The implementation of the Zwicker's model [32] allowed a computing perspective for loudness monitoring. In a previous work [33], the authors developed a binaural loudness monitoring system by using the Zwicker's model and using binaural synthesis by means of a combination of Head-Related Transfer Functions (HRTFs) and microphone array processing.
In [34], the authors evaluated the environmental noise with a mobile application and display the results in a map using the kriging method. In [35], the authors measured 5 minutes sound pressure level (SPL) values with a WASN and also evaluated spatial cross-validation using the kriging method in a small city in Spain. In [36], the authors implemented a system oriented to compute the psychoacoustic parameters in the server side from recorded audio chunks. In [33], the authors implemented a edge-computing system by using different Raspberry Pi 3 (Rpi3) nodes in order to perform an evaluation of performance when computing the binaural loudness directly on the Rpi3 nodes.
In this paper, an improvement of the previous work [33] with the binaural loudness is done by incorporating into the model the computation of the binaural sharpness. Moreover, a simplified version of the Zwicker's psycho-acoustic annoyance model is evaluated (assuming specific conditions), mainly to assess the spatial distribution of the subjective annoyance in an indoor and outdoor environments.

Methodology
The implementation of a WASN to evaluate a simplified version of the psycho-acoustic annoyance based on an IoT framework considers different steps: (a) the hardware and software implementation of the nodes for the computation of psycho-acoustic parameters, (b) the IoT framework configuration and (c) the cross-validation and spatial statistical analysis.
In [33], the procedure to obtain the binaural loudness is described. In the present work, the binaural sharpness calculation is also implemented in order to evaluate a simplified version of the psycho-acoustic annoyance. As the specific loudness is obtained for each 1/3-frequency band, left and right sharpness is calculated using the weighting function g (z) described in Equation (1) to compute the value of the sharpness in each channel. In order to compute the binaural sharpness, the same proportion value as in binaural loudness calculation (see Equation (2)) is applied [28].

Simplifications for the Zwicker's Model
In this work, a simplified implementation of the Zwicker's annoyance model is proposed. Assuming, in this case, locations where the environmental noise have few tonal components that provokes low inter-modulation in the environmental sound were selected. Here, the model could be simplified by reducing Fluctuation Strength (F) and Roughness (R) components and making the term w FR ≈ 0 in the Zwicker's annoyance model [27]. With this simplification the equation for the PA is presented in Equation (3 where L 5 is the percentile 5 for the loudness, and w S is (S − 1.75) · 0.25 log 10 (L 5 + 10) if sharpness S > 1.75 and 0 if sharpness S ≤ 1.75.
Here this simplification is needed, because at least at this moment the computation of binaural loudness and sharpness takes a long time (5-6 s for a 1 s of recorded audio) with the Rpi3 nodes. The use of more parameters in the model would collapse the computing capacity of the node for monitoring uses.
The binaural psycho-acoustic annoyance (BPA) is introduced here like the PA but using the binaural parameters (i.e., binaural loudness and binaural sharpness) to compute this binaural annoyance.

Hardware and Software Implementation
The sensing and computing device is composed of two subsystems: the acquisition subsystem and the processing core. The processing core is based on a Raspberry Pi [39] and the acquisition subsystem is composed by an array of four microphones of a Sony PlayStation3 Eye camera (PS3-Eye). The PS3-Eye microphone array works to process each channel using 16-bit depth and a sampling rate of 16 kHz, with a signal-to-noise ratio (SNR) of 90 dB and a power consumption of 500 mAh. Different works have used this device, mainly due to its camera, e.g., for eye tracking systems [40] or to improve the precision of multi-touch displays [41], but also for sound source mapping [42]. In this work, the principal device used is the microphone array for the audio capture of the acoustic sensor device. The distance between the outermost microphones is 62 mm, and the two middle microphones have reversed polarity. The device was located in two different environments (one indoor and one outdoor).
The core of the processing part of the node is a Rpi3 Model B, that it is used in the acquisition and publishing stages. The technical characteristics of the Rpi3 include a 1.2 GHz 64-bit quad-core ARMv8 CPU, 1 GB RAM, 40 GPIO pins, 4 USB ports, a full HDMI port, an 10/100MB Ethernet port and integrated 802.11n Wireless LAN, and also Bluetooth Low Energy (BLE). Also, the USB ports and the GPIO pins are a good solution, providing the Rpi with the possibility to have a range of peripherals available, such as WiFi antennas, ZigBee modules, microphones, cameras and connections with other devices, e,g., Arduino. A Power-Over-Ethernet connection [43] was chosen to power the board, in which an Ethernet cable provide both electrical power and data connection to the Rpi3.
The design of the device was made according to the model shown in Figure 1 [33]. In the software section, the Rpi3 platform has a Linux-based OS used for the programming in a high-level programming language, such as Python. The main tasks of the developed algorithm are: acquisition and windowing of the audio signal, binaural synthesis, loudness and sharpness calculation, binaural psycho-acoustic parameters calculation, monoaural signal creation, third-octave band filtering stage, sound pressure level calculation, storage, publishing and processing of the results. Also, a copy of the different parameters evaluated is stored in the internal memory of the Rpi3 sensing device.
First, the acquisition stage is carried out, recording the audio from the microphone array in chunks of 1 second. Direction-Of-Arrival (DOA) estimates for a set of frequency bins are obtained by processing the audio input from the two central microphones in the Short-Term Fourier Transform (STFT) domain [33,44,45]. Then, in the following stage, the binaural synthesis is performed to create a signal that imitates the human hearing system response by using HRTFs and the DOA information resulting from the previous stage. Next, the binaural loudness and binaural sharpness are calculated from the binaural signal synthesised. Finally, the results are uploaded to a Cloud-based service (namely ThinkSpeak) for storing and publishing data. This service allows data processing, using the ThinkSpeak API, allowing the connection with Matlab. This connection allows to compute the PA, based on the approximation explained in Section 3.1 by Equation (3) and finally, the results are analysed in RStudio to investigate the temporal and spatial correlation between the different nodes. Then, the spatial statistics of the environment is analysed using the kriging method [46,47], in order to assess the most annoying areas. This analysis is made by determining the interpolated PA in different locations between the nodes and considering the error of this method. Finally, the evaluation is conducted following a leave-one-out cross-validation approach with a linear model technique and analyzing the Root Mean Squared Error (RMSE). At the same time, before the binaural synthesis stage, the monophonic channels are converted into a monaural signal to extract acoustic parameters of the full spectrum Sound Pressure Level (SPL) and from a third octave band A-filtering stage SPL (SPL(A)). In reference [30], the authors describe in detail the calibration procedure of the nodes for the measurement of the SPL and the binaural loudness and sharpness. Finally, these results are sent to the cloud and to the internal memory too. Also, the correlation between temporal evolution of these parameters and the psycho-acoustic one's is determined.

Results and Discussion
In this section, the WASN performance is evaluated carrying out two experiments in an indoor and outdoor environment. Moreover, the results of these experiments are presented and analyzed.
The experiments planned for this work consider an indoor and an outdoor environment. The indoor environment corresponds to the location of 4 nodes in different offices of the Computer Science Dpt at the ETSE in the University of Valencia and the outdoor environment corresponds to the location of different nodes in the façades of houses in a neighborhood from Murcia city close to a highway. The system performance is evaluated and the results are presented and analysed. Figure 2 shows the ThingSpeak configuration for the data channels in the indoor environment, collecting SPL, channels left and right for loudness (L), binaural loudness (BL), channels left and right for sharpness (S), and binaural sharpness (BS).

Indoor Environment
The different nodes considered in this experiment were placed at different offices in the third floor of the first block in the ETSE. Here two offices (one with a professor and fluent activity of students during few hours and other with only a professor working), the secretary office of the department and a laboratory with one student working during several hours were selected. Although the activity was different in each location, it was not too high, therefore the noise levels and the computed psycho-acoustic annoyance had low variability. All the nodes were connected to Internet and every node was linked to a different channel in the ThingSpeak platform during 8 h. Figure 3a shows the location of the four nodes from the upper part of the building. Table 1 shows the statistical values (mean and standard deviation) for each node for the indoor measurements. Table 2 shows the correlation matrix of binaural loudness, binaural sharpness and SPL.

Outdoor Environment Measurement
The outdoor environment corresponds to an urbanization near a highway. Therefore, the traffic noise is considerable. The measurement points were located in two different lines of buildings. The buildings in the second line (far away from the highway) are more protected from the traffic noise, but these buildings are near a pedestrian area where some children were playing during few hours. In this two different areas, the activity is also the criteria selected. In this location, five nodes were placed in different façades of the urbanization houses. Nodes 1 and 2 were located in the ground floor, nodes 3 and 4 were located in the second floor and node 5 was placed in the first floor. All of them in the façade of the corresponding house. Figure 3b shows the location of these nodes.   Table 3 shows the statistical values (mean and standard deviation) for each node for the outdoor measurements. Table 4 shows the correlation matrix of binaural loudness, binaural sharpness and SPL.  Tables 2 and 4, correlation of the temporal series of the different considered parameters (i.e., SPL, binaural loudness and binaural sharpness) show clear differences. This fact is related to the different nature of the parameters considered. In general, SPL is more related to loudness, but sharpness differs, because it depends mainly on the centroid of the spectral content of the signals in each environment.  In order to perform a validation of the values and the number of selected points for the linear modeling of the set of measured values, a Leave-One-Out-Cross-Validation algorithm in R-Studio software package (with the "Linear Model" method) was used for the whole set of time-averaged measurements, corresponding to the spatial locations. In Table 5 the results of this test are shown in order to select the nodes in the indoor environment. A linear model is generated to predict the % established in the first column (%test/%training) with the mean value of the BPA in every location. In this table, Rsquared is the Pearson's squared correlation coefficient and MAE is the maximum absolute error. RMSE is the minimum of the list of results obtained with this algorithm. According to this table, the selection of a validation set with 60% of the nodes is enough for our network. Table 6 shows the results of the cross-correlation test to predict the temporal values of BPA in the indoor environment. In addition, a linear model is generated to predict the values of BPA in the test percentage established in the first column (as in the previous case). Here, the whole temporal series was used in each location and from these results we can predict with high accuracy the test values training the model with the 50% of the measurements (as shown the RMSE and MAE values).

Modelling and Cross-Validation of the Measurement Sets
In Table 7, the results of the test to select the nodes in the outdoor environment are presented. This prediction is made using a linear model for the % established in the first column (%test/%training) with the mean value of the BPA at each location. According to this table, the selection of a validation set with 60% of the nodes is enough for the network of measuring platforms. Table 8 shows the results of the cross-correlation test to predict the temporal values of BPA in the outdoor environment and generates a linear model to predict the values of BPA in the test percentage established in the first column (as in the previous case). As in the previous case, we used the whole temporal series in each location and from this results we can predict quite well the test values training the model with the 50% of the measurements (as shown the RMSE and MAE values).
It can be seen that in both cases (the indoor and outdoor environments), the relationship between the BPA and binaural loudness + binaural sharpness is better correlated than the BPA and SPL.
In this case, location 4 in the outdoor environment corresponds to the place of the most exposed node to the traffic noise. Therefore, as the variance of the temporal values for the SPL and the binaural loudness and sharpness is the lowest one (see Table 3), both for the test set and the training set, the random selection of time samples for the LOOCV gave for the RMSE and MAE a very low value in the cases of 80% and 50% of the training set, making a really good matching for the temporal prediction of the SPL and the BPA (from the binaural loudness and sharpness).

Spatial Statistical Analysis (Kriging Method)
Spatial statistics allows the use of several methods. The most common methods are Inverse Distance Weighted (IDW), spline and kriging. IDW is a simple and intuitive deterministic interpolation method based on principle that sample values closer to the prediction location have more influence on prediction value than sample values farther apart. The major disadvantage of IDW is "bull's eye" effect and edgy surface. Spline is also a deterministic interpolation method which fits mathematical function through input data to create smooth surface. Kriging is a method based on spatial autocorrelation [46].
The computed PA from the measurements of binaural loudness and sharpness establish a data set in relation to different locations with its GPS coordinates, longitude and latitude. By denoting the PA computed with the simplified Zwicker's model at a location x as Y(x), this data set is defined as {Y(x), x ∈ D}, where D are all the locations of the modelling sets, following the kriging technique [47].
In this context, the objective of this proposed model is the prediction of Y(x 0 ) in any location x 0 , particularly those within the validation set. The annoyance reports contain information of the set of covariables included. Therefore, Y(x) is modeled as a tendency function of the covariables better involved in the process which explains its variability in a large extent plus some random error which is explained by the short term variability, i.e., where µ(x) = E[Y(x)] and δ(x) is a stationary Gaussian process with zero mean, whose spatial dependence characterization is given by the variogram γ [48]: where Var denotes the variance and h is an offset. This variogram represents the main function of the kriging method, which presents different procedures such as simple kriging, ordinary kriging, universal kriging, indicator kriging, co-kriging, etc, attending to different statistical aspects considered in the covariable set. Ordinary kriging is the most widely used kriging method. It is used to estimate a value at a point of a region for which a variogram is known, using data in the neighborhood of the estimation location and also can be used to estimate a block value [49].
In this study, the variogram has been calculated, using ordinary kriging and a spherical model, with the R statistical package for variogram fitting [50]. Figure 5a shows the spatial distribution of the BPA in the indoor environment and Figure 5b shows the spatial distribution of the BPA in the outdoor environment. The map shows the evaluation of the equation 3 with the kriging method showing the points between the measuring points. Also Figure 6a,b show the relative error of the kriging estimation showing that the error of the Kriging estimation is kept bounded inside the region between the measuring points below 25% and grows up as the estimated point is in the outer part and far away from the centroid of the measurement point.  Figure 6 shows the relative error distribution obtained in the Kriging estimation. The estimation absolute error range for the BPA in the indoor environment (see Figure 6a) is [0.0064, 0.1082] which is reasonably low and is bellow the BPA obtained from measurements. The greater error is focused in the region outside the building where no measurements are done. For the outdoor environment (see Figure 6b), the estimation error range is [0.9525, 24.2659] which is a larger range compared to the BPA obtained from measurements. The greater error is located in the region next to the highway.

A Preliminar Subjective Study
In order to check the subjective response in the indoor and outdoor environments, a short poll was conducted to a group of people related to each environment. The poll asked for some social aspects (gender and age) and the valuation of some aspects related to the soundscape, following the Swedish Soundscape-Quality Protocol which assess the perceived affective quality using the following eight adjectives: Pleasant, Unpleasant, Eventful, Uneventful, Monotonous, Exciting, Calm, and Noisy [51].
An Android application for mobile has been developed with the short poll (translated into Spanish) to allow a quick data gathering of the surveyants. In order to locate the response of every person answering the poll, the GPS coordinates have been registered with the same application.
The analysis of the results of this poll is shown in Table 9. According to this values the response in the indoor environment is generally qualified as pleasant and mostly uneventful (calm and nearly monotonous), although at certain hours students make the soundscape a bit exciting.
Taking into account the values in the outdoor environment, in general, the values are a little bit higher for noisy and unpleasant, which means that the people living in this area seems to like the place but stand the noise from the highway and like the noise made by children activities.

Conclusions
In this work, an IoT system collecting information from each acoustic environment in different environments (indoor and outdoor) is deployed. The nodes of the system gather audio information and compute psycho-acoustic parameters (loudness and sharpness) performing the binaural synthesis by considering the HRTFs. In the indoor environment, the nodes were placed in different rooms at the Computer Science Department of the University of Valencia and in the outdoor environment, the nodes were located in the outdoor of different houses in a residential area in Murcia.
From the binaural psycho-acoustic parameters gathered and the SPL, different statistical analysis were done. The correlation between these parameters was performed, showing that the binaural loudness and the SPL correlated well in nearly every location (indoor and outdoor, at a 99% significance level). Also, the binaural sharpness worked fine with the SPL, but with smaller Pearson coefficient values.
From the simplified version of the Zwicker's Psycho-acoustic Annoyance model, the values of the Binaural Psycho-acoustic Annoyance were computed to determine the subjective annoyance at each node location in the indoor and outdoor environment. At this point, different cross-validation analysis were done in order to determine the validity of the BPA spatial and temporal values. To this end, a Leave-One-Out-Cross-Validation algorithm was used to determine the best modelling set (a good training set was established with 50% of the values). In all cases (indoor and outdoor), the binaural loudness and sharpness explained better the relationship with BPA than the sound pressure level, both for spatial averages and for temporal series. Finally, a spatial statistical analysis was done in the indoor and the outdoor environment by using an Ordinary Kriging technique. The error of the Kriging estimation is kept bounded inside the region between the measuring points below 25% and grows up outside this region. Also a short on-site subjective study has been done. According to the results, the BPA model has been validated for the indoor and outdoor environment. Further work is needed for the spatial validation of the subjective assessment.
Currently, further work is being done with the implementation of the complete Zwicker's Psycho-acoustic Annoyance model, adding the Roughness and Fluctuation Strength algorithms to the implementation of the binaural model. Acknowledgments: Authors would like to thank the Spanish Ministry of Economy and Competitiveness for the funding, coming from the grant BIA2016-76957-C3-1-R and TIN2016-78799-P. Authors would also like to thank all the people who has given access to their homes and working places to do the measurements.
Author Contributions: Jaume Segura and Juan M. Navarro conceived and designed the experiments; José Montoya and Santiago Felici performed the experiments; José Montoya and Máximo Cobos programed the binaural sharpness algorithm, Juan J. Pérez programed link to ThinkSpeak and Ana M. Torres analyzed the data; Jaume Segura wrote the paper.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: