Leveraging Urban Sounds : A Commodity Multi-Microphone Hardware Approach for Sound Recognition †

City noise and sound are measured and processed with the purpose of drawing appropriate government legislation and regulations, ultimately aimed at contributing to a healthier environment for humans. The primary use of urban noise analysis is carried out with the main purpose of reporting or denouncing, to the appropriate authorities, a misconduct or correct a misuse of council resources. We believe that urban sounds carry more information than what it is extracted to date. In this paper we present a cloud-based urban sound analysis system for the capturing, processing and trading of urban sound-based information. By leveraging modern artificial intelligence algorithms running on a FOG computing city infrastructure, we will show how the presented solution can offer a valuable solution for exploiting urban sound information. A specific focus is given to the hardware implementation of the sound sensor and its multimicrophone architecture. We discuss how the presented architecture is designed to allow the trading of sound information between independent parties, transparently, using cloud-based sound processing APIs running on an inexpensive consumer-grade microphone.


Introduction
Traditionally, urban noise is measured and analyzed with the purpose of drawing appropriate regulations [1,2] for a healthier urban environment [3].As such, most urban noise monitoring activities are carried out with the sole purpose of reporting a misconduct (e.g., street noise, noisy bars) to the appropriate authorities or correcting a misuse of council resources (e.g., bus re-routing, city planning) [4].Furthermore, noise data is often, if not always, collected and processed with dedicated expensive devices including high-end microphones [5,6].The employment of costly hardware represents a limiting factor for the businesses that can be built around urban sound analysis [7].As a result, the trading of sound-related information is almost invariably a measurement of absolute noise levels that sound experts provide to government-like institutions [8].
However, from urban sounds, it is possible to extract several other types of valuable information related to human activities (e.g., vehicle traffic, garbage collection, delivery of goods, gunshots, leisure activities, etc).If this information could be made available, in a easy-to-use manner and at a low cost, it could enable the development of several intelligent services and products in the urban environment.
We propose an urban sound classification system that is designed to enable the large-scale exploitation of urban sound.Our system intends to achieve this goal in the following ways: 1.
Instead of using proprietary and costly acoustic sensors, we propose to explore the use of a consumer-grade commodity hardware for noise capturing and measurement, as a means to drastically reduce the costs of sound exploitation systems.
With this respect, our idea is to follow a similar strategy to the successful driver-less car manufacturers [9].They realised that if they could design an artificial intelligence solution that could run with inexpensive hardware (e.g., vision cameras), a driver-less car could become affordable.Before that, the driver-less car industry was very much heading towards vehicles equipped with expensive light detection and ranging (LIDAR) technology, heavily limiting the widespread and commercialisation of those vehicles.
In the personal voice assistant market, Amazon took a similar approach.They developed ALEXA, an AI solution that runs in fairly inexpensive US$100 multimicrophone-based device capable of talking and ,eventually, answer any question [10].

2.
Our system will offer sophisticate, yet easy-to-use, sound processing and artificial intelligent algorithms (AI) for extracting non-straightforward information from urban sounds.Such algorithms will only marginally depend on the capabilities of the sound capturing device.
For that, the system will be heavily cloud-based (actually FOG computing), and the services will be available through web-based APIs.By doing so, the data collected by the consumer-grade sound capturing devices can be exploited by third-party applications, possibly through a pay-per-use business model.
The overarching goal of the envisioned idea, hopefully reflected in this paper, is to show that this approach can enable new socially responsible business opportunities in the area of urban sound analysis by enabling the creation of intelligent products and services for the city.

Urban Acoustic Event Detection Applications
There is a significant body of work focusing on Acoustic Event Detection (AED) in urban environments.In general terms, AED has been mainly driven by classification-based approaches.Among them, we should differentiate between one-class novelty detection oriented to the identification of events of interest from a majority class (e.g., background noise), and the multiclass classification, trained to detect a closed set of previously defined acoustic classes of interest (e.g., gunshots, screams).
Within the one-class approach, Ntalampiras et al. [11] assume the existence of an unbounded acoustic class to detect hazardous situations due to the local, occasional, diverse and unpredictable nature of this kind of sounds in a real scenario, instead of the classical bounded multiclass scenario.In [12], Nakagima et al. tackle a similar problem using a deep neural network sound recognition algorithm, using both real-life sounds and mixed artificially created sounds.Some researchers have concluded that, assuming that the unbounded class is a complex problem to solve by any machine-learning algorithm, the multiclass AED could take into account the signal to noise ratio of the events [12,13].
On the other hand, the multiclass AED has also provided good results in different domains of application for outdoor monitoring.For instance, in [14], a reliable detector of audio events in very noisy environments is described.And more recently, in [15] a two-class classification approach has proved to overcome one-class counterpart to detect anomalous events to tailor reliable road traffic noise maps within the DYNAMAP project [16].In this context, valuable literature resources can be found in the proceedings of the DCASE annual event (Detection and Classification of Acoustic Scenes and Events), a competition that presents in each call different challenges, including several real-life ones.These challenges cover from acoustic scene classification [17] through the tagging of online contents from free sound to bird audio detection [18], among others.Among the different works presented to DCASE, it is worth mentioning the approach introduced in [19], based on the use of a multimicrophone architecture for acoustic event detection in indoor environments.This solution takes advantage of the fact that the device is already installed at home.The work also argues the firm possibility that the same approach could give good performance in outdoor environments too.
Finally, let us note that specific challenges have to be addressed when employing microphone arrays for AED purposes [20].As reported in [21], a grid-based method can be successfully leveraged for the localization of multiple sound sources within a Wireless Acoustic Sensor Networks (WASN) setup.Where, each sensor node, containing an array of microphones, can process the direction-of-arrival of a sound sample inside a processing node.For instance, in [22], the authors use a far-field random array to localise and separate several noise sources, by means of a two-stage noise source algorithm, mainly solved by the Tikhonov regularisation and the maximum likelihood estimation.The FISONIC is a FIWARE-based project focused on the continuous analysis of the urban environment with multiple microphones in a single sensor [6].They are the only two projects that employ multiple microphones where signal processing is carried out in the cloud [23].It is found that a proposal including an affordable platform with multi-microphone enables a substantial improvement in the accuracy of the AED algorithms deployed in the sensors of a WASN.

Proposed Urban Sound Measurement System
Sound processing is proposed as mean to allow the mining of urban sound for the generation of valuable derived information.For a successful exploitation, this information is then traded for a profit between parties.Seen in its whole, a successful urban sound processing system must be design maximizing the trading element of sound-derived information.Figure 1 gives an overview of the presented system.The proposed solution is composed of a variable number of microphones all wirelessly connected to a local server located in the vicinity of the microphone network (FOG computing).The same server, responsible for part of the processing, is connected to a cloud service similar to an Amazon Web Service (AWS) where a second processing stage would take place.Here the actual sound processing algorithms are remotely employed via an API scheme.The existence of a sound processing API structure would imply the possibility of having multiple APIs providers, possibly competing among them for the best performance.Once the sound information is extracted, the trading of the latter and the establishing of the actual business service takes place on a more traditional and customizable environment (e.g., web page, smart phone).Here the client and service provider value added trading takes place.
It should be noted that the existence of an FOG processing unit allows for less stringent data rate requirement on the city network installation.Furthermore, the important issue of data privacy, typical of sound recording applications, is partially addressed by the use of certification exchange security on the edge computing platform.

Multimicrophone Sensor and Commodity Hardware
In 2015, Amazon commercialised their first intelligent personal assistant named Amazon Echo (https://en.wikipedia.org/wiki/Amazon$_$Echo).The challenge for Amazon was to design an artificial intelligence hosted in the cloud, later on named Alexa, that would allow an inexpensive multimicrophone unit to ideally understand and answer any voice question.As Echo sales show, Amazon greatly succeeded in this task.
As Amazon did with Echo, we propose a similar system structure where we host most of the sophisticated sound processing algorithm in a FOG computing platform (https://www.cisco.com/c/dam/en$_$us/solutions/trends/iot/docs/computing-overview.pdf)(refer to Figure 1).Freeing the microphones from the need of performing any heavy sound processing allows for the use of an inexpensive network of commercial-grade microphones.The possibility of employing inexpensive microphones would not only mean a lower overall cost for the system but would also mean the commercialization of urban sound sensors by non sound experts.A scenario much more convenient than the one where urban microphones are expensive professional grade devices, built and sold by fewer companies.
The Amazon Echo (docs.aws.amazon.com/a4b/latest/ag/a4b-ag.pdf) itself does represents a great example of commodity hardware that can be purchased or rented inexpensively.Prior a customization for outdoor use, the Amazon Echo is certainly one possible candidate for the sound sensor that we intend to explore in this research.Other types of commercial-grade microphones will be considered in the future.Amazon Echo is not only economical convenient, its multimicrophone architecture is certainly a promising feature that, if exploited well, could offer great additional advantages.

Business Opportunities
Frost and Sullivan's research estimates that the global Smart City market will be valued at US$2T by 2025, being AI a key driver and Europe the region with the largest number of smart city project investments globally [24].The mining of urban sound is a rich source of information from which to develop services in the context of Smart City.Imagine, for instance, a high-end real-estate agency that wishes to offer a characterization of the soundscape of the properties in its catalogue as an added value to its customers.
In this case, the need for characterizing the soundscape is temporary and should include measures from several locations and over time (from days to a few weeks).To do so, the real-state agency needs to either install a network of dedicated high-end acoustic sensors or to hire a sound professional to make several manual measurements.Both options can be prohibitively expensive if the need for the measurements is not permanent.
We aim at allowing new businesses to exploit urban sounds within the scope of Smart Cities.The proposed technology allows organizations to leverage their services by taking advantage of noise information collected from a network of low-cost commodity sensors.Its main differentiating factor is that it won't depend on any existing infrastructure.Instead, since it relies on inexpensive commodity hardware, it allows the solution to be deployed, even temporarily, on different locations and conveniently mine noise/sound information and build services.Let us exemplify two scenarios that can illustrate the envisioned use of the proposed solution.
Noise City Management in Barcelona: Throughout the year, the Barcelona City Council receives many complaints that are difficult to deal with because they require the identification of the origin of the problem.This is often done by sending a sound technician to measure a specific location, which generally reveals a static and limited description of the problem.In such a scenario, the city council could install the proposed solution in the area of interest.For instance, a small sound network installed around the Sagrada Familia church could be used to determine what percentage of noise is coming from traffic, leisure activities, or urban services (e.g., garbage collection).Once the study is completed, the sensor network would be taken down.With this information, the city council could take the necessary actions or create noise-limiting plans to protect the neighbours.
Security at Exhibitions or Festivals: Barcelona, as many major cities, hosts several international festivals (e.g., Mobile World Congress) that attracts yearly millions of people.Security is a paramount priority for the responsible authorities.State police and local law enforcement authorities do spend a great deal of energy and money to guarantee the maximum level of security for the attendees.Within this scenario, the organisations behind such a large exhibition would rent out the presented solution and, via its cloud, share the real-time audio information with the local authorities, who could be immediately alerted in case of a rise of the possible threats (e.g., gunshot sound or screams).

Conclusions
In this paper we have presented a urban sound processing system that employs inexpensive multimicrophone sensors together with a FOG computing architecture.The proposed system does represent a valuable solution for the data collection, processing and trading of urban sound.In a way, similarly to what Amazon did with their cloud product Alexa, we proposed to partially decentralize the sound processing task to a more agile FOG server.This will naturally free the consumer-grade sound sensors from the difficult task of processing, allowing noise-based services and products to be developed inexpensively.Additionally, the proposed system will allow all sound processing algorithms to be used remotely by any third party via convenient APIs.Lastly, the cloud structure of the proposed solution would facilitate the trading of sound-derived information between parties who are not actually expert in the field of sound signal processing.Directly fostering the business aspect behind urban sound exploitation.

Figure 1 .
Figure 1.System Overview of the proposed solution.