An Urban Acoustic Rainfall Estimation Technique Using a CNN Inversion Approach for Potential Smart City Applications

Alkhatib, Mohammed I. I.; Talei, Amin; Chang, Tak Kwin; Pauwels, Valentijn R. N.; Chow, Ming Fai

doi:10.3390/smartcities6060139

Open AccessArticle

An Urban Acoustic Rainfall Estimation Technique Using a CNN Inversion Approach for Potential Smart City Applications

by

Mohammed I. I. Alkhatib

¹,

Amin Talei

^1,*

,

Tak Kwin Chang

¹

,

Valentijn R. N. Pauwels

²

and

Ming Fai Chow

¹

Department of Civil Engineering, School of Engineering, Monash University Malaysia, Jalan Lagoon Selatan, Bandar Sunway 47500, Selangor, Malaysia

²

Department of Civil Engineering, Monash University, Clayton, VIC 3800, Australia

^*

Author to whom correspondence should be addressed.

Smart Cities 2023, 6(6), 3112-3137; https://doi.org/10.3390/smartcities6060139

Submission received: 10 October 2023 / Revised: 28 October 2023 / Accepted: 6 November 2023 / Published: 16 November 2023

(This article belongs to the Topic Machine Learning and Big Data Analytics for Sustainability and Resilience)

Download

Browse Figures

Versions Notes

Abstract

:

The need for robust rainfall estimation has increased with more frequent and intense floods due to human-induced land use and climate change, especially in urban areas. Besides the existing rainfall measurement systems, citizen science can offer unconventional methods to provide complementary rainfall data for enhancing spatial and temporal data coverage. This demand for accurate rainfall data is particularly crucial in the context of smart city innovations, where real-time weather information is essential for effective urban planning, flood management, and environmental sustainability. Therefore, this study provides proof-of-concept for a novel method of estimating rainfall intensity using its recorded audio in an urban area, which can be incorporated into a smart city as part of its real-time weather forecasting system. This study proposes a convolutional neural network (CNN) inversion model for acoustic rainfall intensity estimation. The developed CNN rainfall sensing model showed a significant improvement in performance over the traditional approach, which relies on the loudness feature as an input, especially for simulating rainfall intensities above 60 mm/h. Also, a CNN-based denoising framework was developed to attenuate unwanted noises in rainfall recordings, which achieved up to 98% accuracy on the validation and testing datasets. This study and its promising results are a step towards developing an acoustic rainfall sensing tool for citizen-science applications in smart cities. However, further investigation is necessary to upgrade this proof-of-concept for practical applications.

Keywords:

smart cities; artificial intelligence; acoustic rainfall sensing; audio denoising; data-driven forecasting; interoperable AI

1. Introduction

Smart cities represent a vision for urban development that relies heavily on technology to improve the overall efficiency of urban systems and enhance the life quality for citizens [1]. There is a need for smart cities as the world continues to urbanise at an unprecedented rate. Rapid urbanisation brings economic and environmental challenges like congestion, pollution, resource scarcity, rapid flooding, and increased energy consumption [2,3,4]. To address these challenges, cities are turning to technology to optimise transportation [5], energy distribution [6], waste management [7], urban drainage design and management [8], healthcare [9], education [10], and other public services. These technologies harness the power of data, the Internet of Things (IoT), artificial intelligence, and other innovations to create more resilient urban environments [1].

Most countries still rely on conventional techniques such as ground-based rain gauges and weather radars [11] to monitor or forecast weather parameters such as rainfall. Nonetheless, one of the areas that still lacks attention in the concept of smart cities is the development of robust systems that can contribute to real-time weather monitoring and forecasting.

On the other hand, from a hydrological perspective, there is a need for improved spatial rainfall resolution for urban areas as floods have become more frequent and intense due to the increase in impervious surfaces and climate change, causing human fatalities and economic losses, and robust rainfall intensity estimation tools in terms of spatial and temporal resolution can play a pivotal role in flood forecasting and mitigation. This is even more critical in tropical urban areas since rainfall is more intense and patchily distributed at local scales compared to temperate climates. Nevertheless, deploying such conventional rain gauges for spatially high-resolution coverage is costly and challenging in urban areas with tall buildings as there are specific installation requirements for such sensors, and most low-to-middle-income countries cannot afford such equipment [11]. Moreover, each technique has its limitations, but most importantly, weather radars depend on rain gauges for calibration [11]. Yet a rain gauge can only provide a point estimate of rainfall, meaning a dense network is required to provide a high spatial resolution and support a robust calibration process for radars.

In response to the spatial limitation of rain gauge networks, recent efforts focusing on developing unconventional, advanced, robust rainfall sensing techniques are constantly emerging, such as telecommunication networks [12], image-based rainfall estimation [13], and citizen-science rain gauges [14]. The main goal of such techniques is not to replace the existing rainfall sensing infrastructures such as rain gauges and weather radars but rather to provide a complementary data source to fill gaps in the collected data and improve the spatial resolution of rainfall observation over a catchment.

Of all the unconventional rainfall sensing techniques, citizen science stands out the most due to the number of successful applications in the field emerging in different countries such as the USA [15], Thailand [16], Nepal [17], the Netherlands [18], and Mexico [19]. One of these projects is the collaborative community rain, hail, and snow network (CoCoRaHS) project [15]. The CoCoRaHS, which was established in 1997, involves US citizens of different knowledge backgrounds and skills measuring daily rainfall data using a mini rain gauge installed in their house or backyard and reporting it on a website. So far, more than 200,000 participants are reportedly involved in the CoCoRaHS project [20].

Due to the success of citizen science in rainfall estimation and the lack of further development in terms of tools for citizens to utilise in rainfall intensity data collection, the present study aims to explore the idea of acoustic rainfall estimation in an urban area for citizen-science application for three reasons.

First, citizen-science rain gauges can be easily employed in rural areas; however, their installation in an urban area could be challenging as such an area is mainly surrounded by tall buildings, which may obstruct deployment.

Secondly, there have been almost two decades of successful applications of the passive acoustic listener (PAL) technique in marine environments for rainfall estimation [21]. The idea of the PAL is based on identifying rainfall droplet acoustic signatures in a marine environment and applying an inversion algorithm to match it with a predefined acoustic signature captured in a laboratory setting for different raindrop sizes. This methodology will eventually infer the rainfall drop size distribution and intensity.

The third reason is that almost all studies in the literature have used acoustic rainfall analysis for rainfall classification purposes, which may not fulfil the rainfall data needs for hydrological modelling and forecasting applications where the actual rainfall depth with its spatial and temporal characteristics is needed. Furthermore, rainfall classification is a subjective analysis as predefined rainfall ranges to classify rain can be significantly different from one country to another. For example, in Malaysia, very heavy rainfall is defined for rainfall intensities ≥60

m m \cdot h^{- 1}

[22], while in China, intensities ≥25

m m \cdot h^{- 1}

are classified as violent rains [23]; however, a 25

m m \cdot h^{- 1}

rainfall intensity is considered merely moderate intensity in Malaysia. Therefore, a developed classification model in China would not be helpful in a tropical country like Malaysia. Thus, to correct the direction of future research, this study explores acoustic rainfall intensity estimation.

Due to the abovementioned reasons, this study is focused on developing an acoustic rainfall estimator in urban areas for potential citizen-science application in smart cities. However, two aspects must be considered in developing a viable technique for urban areas: (1) urban noise handling and (2) dealing with varying rainfall sound characteristics generated by various impact surface types. Most recently, Wang et al. [24] conducted a study on classifying rainfall intensity into six categories from surveillance audio data using synthetically generated rainfall data and induced urban noise. This work and some similar studies on synthetic or natural rain audio [23,25,26,27] have focused on rainfall classification and have ignored denoising procedures for the captured audio recordings in their applications, perhaps due to the complexity of the process or the fact that it might not be seen as a crucial step in rainfall classification. For example, Brown et al. [28] reported that applying speech filtering techniques to rainfall audio recordings would automatically remove the rain sound from the recordings since filtering techniques remove stationary types of noise (also known as background noise). Therefore, rainfall noise, which acts like background noise, will be removed. Therefore, such a filtering technique might not help exclude urban noises as it may remove the rain sound, which is the desired data. In conclusion, there is a need for an approach to at least attenuate, if not remove, unwanted urban noises (e.g., cars passing, thundering, animal callings, human noise, etc.) in a rainfall recording. On this point, attention is drawn towards bioacoustics studies such as the work of Brown et al. [28] and Ferroudj [29] where they developed classification models to classify 1 min audio recordings in a forest area to either audio clips containing rainfall noise or no-rainfall noise and then removed the rainfall audio clips from their audio datasets. Therefore, this study investigates a similar approach where a classification model is developed using state-of-the-art convolutional neural networks (CNNs). However, rather than classifying 1 min audio clips, shorter frames are classified as either rainfall or no-rainfall for each 1 min audio clip, and then the no-rainfall frames are replaced with adjacent rainfall frames. Further justifications are provided in the methodology for this proposed approach.

The second required aspect of the targeted acoustic rainfall estimation technique is handling the different acoustic characteristics of sound generated by raindrops impacting various types of surfaces (e.g., concrete, soil, vegetation, glass, etc.) in an urban setting. Hypothetically, rainfall can be estimated from a rainfall audio recording, considering the associations between loudness levels and rainfall intensity. As a rule of thumb, a higher rainfall intensity would generate louder sound levels than a low-intensity rain. Thus, distinguishing rainfall intensity based on sound loudness is possible for a specific environment. However, when considering two different surfaces or environments (e.g., flexible and solid surfaces), the captured sound will be a mix of two different acoustic characteristics. Hence, it will be challenging for a single model to estimate rainfall intensity for a mixed-surface environment based on sound loudness levels. Therefore, other acoustic features must be combined with loudness features to enable a model to differentiate the rainfall intensity in a mixed urban environment. Thus, the main question would be what acoustic features would work best in a model to estimate rainfall intensity coming from a mixed environment. Several studies have highlighted the challenge of extracting acoustic features and selecting those that can be mapped to a desired acoustic class [25,30,31,32,33,34,35]. However, recent advancements in computer vision provide solutions that could eliminate the manual feature extraction and selection process, which has shown a lot of success in different fields such as bioacoustics studies [30], flood detection [36], fire event detection [37] transportation monitoring and planning [38], urban/coastal bottom modelling and object/vessel extraction [39,40], and other fields of applications. In this study, a state-of-the-art convolutional neural network (CNN) is adopted for rainfall intensity estimation from audio recordings where the initial convolutional and pooling layers in the network are expected to carry out the feature extraction and selection process, while the last layer carries out the regression process [41].

To this end, the primary objective of this study was to investigate the possibility of using a deep learning (CNN) approach for rainfall intensity estimation rather than classification in a tropical urban area and to better understand the limitations of using such a technique in an urban area and propose future directions for this technique to reach actual applications. Therefore, the current study is the first to propose CNN-based urban acoustic rainfall intensity sensing from an open environment using CNN inversion for potential citizen-science applications. This would work as a complementary data source to existing rainfall sensing tools to improve the spatial resolution of rainfall. Also, this study proposes a novel CNN-based urban acoustic rainfall denoising framework to remove unwanted noises in rainfall audio recordings. Finally, this study provides the largest urban tropical rainfall dataset reported in the literature, collected on-site and off-site, with rainfall intensities higher than 30

m m \cdot h^{- 1}

.

2. Methodology

2.1. Overview

The methodology for this study is summarised schematically in Figure 1. The study revolved around two main components, with each having four stages. The first component was the CNN denoising framework development, which consists of four main stages: (1) data collection of urban noise audio recordings; (2) feature extraction of 5 s log-Mel spectrograms; (3) the development of a binary CNN classifier using the 5 s log-Mel spectrograms as inputs to classify audio recordings into rainfall or no-rainfall recordings; and (4) developing a CNN-based urban rainfall denoising framework to remove no-rainfall audio recordings from each 1 min rainfall audio clip dataset. The second component was the CNN-based acoustic rainfall sensing model development, which consists of four main stages: (1) data collection of rainfall audio recordings and weather station readings for both on-site and off-site; (2) feature extraction of 1 min decibel and log-Mel spectrograms; (3) the development of a CNN acoustic rainfall estimator model using the decibel and log-Mel spectrograms as input and identifying the best spectrogram format for predicting rainfall intensity in an urban environment, comparing the performance based on the validation, and testing datasets with an artificial neural network (ANN) model with loudness features as an input; and (4) testing the model on an unseen testing dataset collected off-site using a professional recorder. More detailed information and justification for the proposed methodology are provided in the following sections.

2.2. Study Site and Data Collection

The study site of this research is Monash University (Malaysia Campus) and its surroundings (up to ~2 km radius). The data collection made at the Monash University Malaysia campus is called on-site, while any data collection out of the campus is denoted as off-campus in this study.

2.2.1. On-Site Calibration and Validation Dataset

This study was conducted at Monash University (Malaysia campus) and its surrounding area (see Figure 2). A total of five points (A–E) were selected for sound data collection, as shown in Figure 2, with the details of each environment listed in Table 1. These five environments are physically and acoustically different, covering a wide range of sounds generated from rainfall impacting various surfaces typically seen in an urban environment. The main criterion in selecting the five locations is to provide the model with the boundaries of the problem from a loudness point of view by giving it the loudest to the quietest environment, which was achieved through locations A and B, respectively. The rest of the locations provided data to the model within the loudness boundaries to help the model better generalise to any new areas. Moreover, other constraints contributed to the selection of the five locations in this study to maximise the amount of collected data, such as (1) a constant electricity supply to the recorder and (2) the safety of the recorders without the interference of human interaction.

In hydrological applications, the recorded data of the rain gauge (weather station) can be generalised for a catchment depending on the catchment topography, climate, etc. The recommended number of rain gauges for reliable rainfall measurement for a tropical and fairly flat environment (non-mountainous) of 1 to 900 km², is one rain gauge as recommended by different sources, including the World Metrological Organisation (WMO) [42] and a hydrological book by Raghunath [43]. Therefore, in this study, one rain gauge was used at the data collection site. Also, referring to similar works in literature, Bedoya et al. [44] placed the recorders 100 m away from the rain gauge, while Ferroudj [29] placed the 4 recorders of the study approximately 1 km from the weather station.

Moreover, it is worth noting that none of the selected locations are purely surrounded by a single surface but rather a mixture of surfaces from concrete, interlock tiles, steel, glass, etc. This mimics what would be expected for off-site urban environments. Although there are several surface combinations (or environments) in an urban area that those five selected locations might not cover, they still represent the most common environments that citizens might encounter in an urban setting.

Also, it is worth keeping in mind that this study is the first step towards developing a potential citizen-science-based acoustic rainfall sensing technique. Therefore, a continuous and diverse dataset is crucial for developing a rainfall-sensing tool with strong generalisation capability. The data collection in this study was conducted from 1 September 2020 to 31 August 2022, covering a period of two years. The rainfall events with varying durations and intensities from the two monsoon seasons (the Southwest monsoon season from May to September and the Northeast monsoon season from November to March) in Peninsular Malaysia [45] were captured at five data collection points.

The rainfall data were collected using a weather station (Watchdog Spectrum 2000 brand (Aurora, IL, USA)) with a 1 min resolution and a minimum detection sensitivity of 0.25 mm·min⁻¹ (15 mm·h⁻¹). Moreover, this study focused on using professional audio recorders rather than smartphones or mixed devices to reduce uncertainty and ensure better control over the audio data collection process, especially in the model development stage. Such a decision was made to avoid additional uncertainty from different phone brands, audio recording apps, and even audio file formats until a clear understanding of the potential and limitations of the proposed acoustic rainfall sensing technique was formulated. In this study, professional Zoom H2n field sound recorders were used for audio data collection, where the audio files were saved in an uncompressed WAV format at a sampling frequency of 44.1 kHz at 16-bit depths. The recorder gain was set to 5.0 out of 10.0. The data collection resulted in 18,404 (1 min resolution) pairs of rainfall and audio data with a maximum rainfall intensity of 3 mm·min⁻¹ (180 mm·h⁻¹).

The dataset was split into training (80%), validation (10%), and testing (10%) datasets with a similar statistical distribution to allow for a fair and consistent assessment of the model performance. The training and validation datasets were used for model calibration (including input combination selection) and fine-tuning the model parameters; moreover, the testing dataset was reserved as an unseen dataset to evaluate the final calibrated model’s performance. The data-driven models were trained on a mixed dataset from the five locations to allow the models to learn the dataset’s variations and better generalise to a new site within the acoustic range of the five locations.

Figure 2. Monash University Malaysia campus and the selected points for data collection [46].

2.2.2. Off-Site Validation Dataset

An off-site validation dataset was collected by audio recorders to gauge the generalisation capacity of the developed model based on the on-site dataset. Although the model-learned information could be limited due to using only five data collection points in calibration, off-site model validation would provide a reliability check on the model for future applications. For this, two sites for audio data collection were considered, with the first one (Off-Site #1) in a fully urbanised region (with an approximate area of 0.95 km²) 0.3 to 1.2 km away from the main weather station (Rain Gauge #1) of this study (see Figure 3). The second site (Off-Site #2) was around roughly ~6 km away from the main weather station (Rain Gauge #1). In addition, Rainfall Gauge #2 was installed near the second audio data collection site to robustly measure rainfall for better model validation (see Figure 3). The off-site data collection spanned from June to November 2022, resulting in 1448 min (~24 h) of rainfall data from 21 rainfall events and their corresponding audio data captured from diverse urban environments in off-sites 1 and 2.

2.3. Acoustic Feature Representation

In this study, two different acoustic time-frequency features (2D audio representation) are examined: (1) decibel spectrograms and (2) log-Mel spectrograms. Both represent audio as an image where the x-axis is time, the y-axis is frequency, and the colour intensity is the energy or loudness of a noise. Figure 4a,b provide a sample of both the decibel and log-Mel spectrogram.

The main difference between the decibel spectrogram and the log-Mel spectrogram is the way the frequency is represented. The decibel spectrogram gives equal weight (linear) to all frequencies in a spectrogram, while the log-Mel spectrogram resembles how humans perceive sound, wherein the energy of sound is distributed on the Mel scale.

The log-Mel spectrogram has shown great success in several acoustic classification applications [23,25,31,47,48,49]. This is justified since most of the energy of most noises is within the low-frequency bands, and the log-Mel spectrogram specialises in fixating on low-frequency bands. Therefore, a log-Mel spectrogram is utilised in this study to build the denoising model.

However, to develop the regression model, the decision was to test and compare a decibel and log-Mel spectrogram. This decision is mainly due to the structure of rainfall noise in the time–frequency domain. Rainfall noise in the time–frequency domain differs from most conventional noises, where the energy spreads more or less equally at all frequencies, and a decibel spectrogram represents it well. But, this does not justify ruling out log-Mel spectrograms. Therefore, both spectrograms were tested and compared regarding the CNN model performance.

The calculations of both the decibel spectrogram and log-Mel spectrogram are as follows:

First, the audio recordings are split into 1 min clips corresponding to 1 min rainfall data generated from the weather station. From there, a decibel spectrogram is generated by calculating the squared magnitude of the short-term Fourier transform (STFT) coefficients on a logarithmic scale (dB) [50]. In this study, the 1 min audio clips are split into short audio frames

x (n)

of length

M

with 50% overlap between each frame. A discrete Fourier transform is applied to each frame using Equation (1):

X (m, k) = \sum_{n = 0}^{N - 1} x (n + m H) \cdot w (n) e^{(- i n ω_{k})}, o \leq k \leq N - 1

(1)

where

X (m, k)

is the kth Fourier coefficient for the mth time frame,

ω_{k} = \frac{2 π k}{N}

is the frequency of the

k^{i t h}

sinusoid, n is the discrete-time step,

H

is the hop size parameter, and

N

is the discrete-time window length. In this study,

N

and

H

are taken as 1024 and 512, respectively.

A Hamming windowing function is applied to each frame to smooth the discontinuities at the beginning and end of each audio signal frame audio signal using Equation (2).

w (n) = 0.54 - 0.46 \cos (\frac{π n}{N}), n = 0,1, 2, \dots, M - 1

(2)

From there, each Fourier coefficient is associated with a physical time (seconds) and frequency (hertz) position using Equations (3) and (4):

T c o e f ({m)}_{c o e f} = \frac{m \cdot H}{F s}

(3)

F c o e f ({m)}_{c o e f} = \frac{k \cdot F s}{N}

(4)

where

F s

is the audio sampling rate in hertz.

The magnitude of the STFT is taken as in Equation (5):

y (m, k) = {|X (m, k)|}^{2}

(5)

where

X (m, k)

is a vector of the Fourier coefficients

X (m, k)

using Equation (1).

From there, the magnitude is converted to a decibel scale using Equation (6):

P (m, k) = 20 {l o g}_{10} (y (m, k))

(6)

While the log-Mel spectrogram is calculated by applying a Mel scale to the frequencies in the decibel spectrogram to separate the frequency scale into n-mels [51] using Equation (7):

f_{m} = 2595 \times l o g (1 + \frac{f}{700})

(7)

2.4. Convolutional Neural Networks

Unlike other traditional machine-learning-based classification methods, CNNs are feed-forward neural networks that can extract deep features from input data [41]. CNN is widely used in the domain of computer vision, such as for speech-related tasks [49], image recognition [39,40,52], and acoustic classification tasks [32], with successful practical applications in different fields.

A typical CNN model consists of a 2D convolutional layer, a pooling layer (sometimes), and a fully connected layer [41]. The convolutional layers are responsible for extracting features from 2D inputs. The pooling layers perform a downsampling operation that reduces the spatial size of the feature map and removes redundant spatial information. Finally, the fully connected layer is responsible for the classification or regression task.

Pooling layers can be either max or average pooling. The difference in both types is basically in the name where max pooling divides the input into rectangular pooling regions and returns the maximum values of rectangular input regions, while average pooling returns the average values of rectangular regions’ input [41]. The input to the pooling layer is the feature generated by the convolutional layer. Deciding on the choice of max or average pooling is mainly based on trial and error to find which works best for the problem.

Other types of layers can be present in the network, such as batch normalisation or dropout layers. The batch normalisation layer normalises the activations and gradients propagating through a network, making network training an easier optimisation problem [53]. Incorporating batch normalisation layers between convolutional layers can accelerate network training and mitigate the sensitivity to network initialisation [53]. At the same time, dropout layers introduce randomness by setting input elements to zero with a given probability (those elements are dropped during training), which is mainly utilised in neural networks to prevent overfitting problems [54]. Only the convolutional and fully connected layers in the CNN network include trainable parameters [41]. Those two layers have a nonlinear activation function, either a

r e l u

,

t a n h

, or

s i g m o i d

function. Most CNN networks utilise a

r e l u

activation function since they can better handle the vanishing descent problem than tanh and sigmoid functions. A

r e l u

performs a threshold operation on each element by setting any input value less than zero to zero. The driving learning process in CNNs is based on backpropagation and gradient descent [41]. The learnable network parameters are updated during the process based on the loss function values.

2.5. CNN-Based Urban Rainfall Denoising Framework Development

This study is conducted in an open environment where noises from several urban sound sources are inevitably recorded along with rainfall noise. Therefore, understanding soundscape noises and their influence on audio data is important for developing an efficient denoising framework. The soundscape is mainly composed of three types of noises: (1) anthrophony, (2) biophony, and (3) geophony. The anthrophonic noises are human-sourced (e.g., machines, humans, music, etc.). On the other hand, biophonic noises are attributed to biological sources, including birds, insects, animals, and other living organisms [55]. Lastly, geophonic noises originate from geophysical and environmental sources, such as rain, wind, thunder, and other natural noises [56].

To better visualise the different noise sources during rainfall, the spectrograms of a 1 min audio recording clip are presented in Figure 5. In Figure 5a, the noise due to thunder (lasting for approximately 5 s) appears in bright white as a localised and concentrated signal. The other example belongs to a car-passing noise (lasting for 10 s), while another random noise (most likely attributed to impact) lasts for a short period of 1–2 s in Figure 5b. Conversely, in Figure 5c, two distinct noises can be seen. The first is from a motorbike passing and has a similar spectrogram signature to car noise; the second is related to multiple car horns lasting for 5 s in the audio recording with a similar spectrogram signature to thunder yet with lower intensity. Lastly, Figure 5d shows a bird song in the spectrogram as a vertically concentrated signal during a low-intensity event. The bird song signal is spaced with varying durations. Thus, most noises other than rainfall in a 1 min recording do not last long, except music or human speech, which would be expected to last longer if someone is talking or listening to music close to a recording device during rainfall.

Therefore, a denoising framework was proposed in this study, as shown in Figure 6. A 1 min clip was first split into 12 (5 s) frames. The reason for selecting 5 s is because most noises other than rainfall in 1 min recordings last from 5 to 15 s. From there, a log-Mel spectrogram was generated for each of the 12 frames, which were fed into a binary classifier that will classify the audio clip into rainfall or no-rainfall. The classifier’s output for the 12 frames in a 1 min recording was concatenated and passed through an if-else conditional statement. If more than 8 frames in a 1 min recording are classified as no-rainfall, then this recording is deleted. If less than 8 frames are classified as rainfall, then the no-rainfall frames are replaced with the adjacent frames, and the recording is saved. The reason for selecting 8 frames as the threshold is because there is a chance that the model would result in a miss classification, and the selection of 8 frames was based on trial and error based on visual analysis. A less conservative number could be explored in the future.

The convolutional neural network (CNN) model was selected to develop a binary classifier since it has shown state-of-the-art results in several environmental sound classification studies [23,25,32,37]. The ~600 h collected in this study resulted in 500 thousand 5 s clips, which were utilised for training the CNN model. CNN requires a big dataset for training; however, in the case of a limited dataset, transfer learning [57] can help reduce the needed computation time and the dataset to train a CNN from scratch. Transfer learning is a powerful method to train a neural network on a limited dataset. It involves a pretrained network on a massive dataset (consisting of over 1 million samples) and then repurposing and applying the acquired knowledge to a specific task of interest. This approach capitalises on the pretrained model’s knowledge, improving the network’s performance on the smaller target dataset. The underlying assumption with transfer learning is that generic features learned on a large dataset can be shared with similar tasks solving a different problem. Therefore, a transfer learning approach was utilised to speed up the training process and improve the generalisability of the model.

Transfer learning [57] was used in this study by adapting a pretrained network on audio spectrograms to solve a new problem by fine-tuning the entire network on a new dataset. A low learning rate was applied to the convolutional and pooling layers to slow down learning, and a high learning rate factor was applied to the final fully connected (FC) layers to speed up learning for the new problem.

A VGGish network [47] was employed in this study for the denoising framework. The VGGish network is designed by Google to generate feature embeddings (feature extraction) from audio grayscale log-Mel spectrograms with a fixed input size of 96 × 64 bins and trained on the Audio Set dataset, which consists of 100 million units of audio data from YouTube videos [58]. The VGGish network employs four groups of convolutional layers with a filter (kernel size) of 3 × 3, a stride of 1 × 1, and a

r e l u

activation function, followed by a max pooling layer with a 2 × 2 filter size and stride of 2 × 2. Three fully connected layers follow the convolutional groups. The full architecture of the network is shown in Table 2 with the number of layers, the filter number, size, and stride.

2.6. Acoustic Rainfall Sensing Model Development

2.6.1. Baseline Model

The baseline model for the acoustic rainfall estimator in this study is based on a fully connected model with

r e l u

activations with

N

layers and

M

neurons per layer. The input for the model is based on the average signal amplitude (ASA) Equation (8), and the output is the rainfall intensity. The best-performing model had

N = 4

layers and

M = 2155

units. The parameters were selected based on a Bayes search sweeping over

N = [1 - 6]

and

M = [10 - 4000] .

A S A = \frac{1}{N} \sum_{n} {|x (n)|}^{2}

(8)

2.6.2. CNN-Based Acoustic Rainfall Sensing Model

The literature presents an abundance of different CNN architectures developed for acoustic classification applications, with a difference in terms of the number of layers, size, order of layers, and so on. However, the literature presents no CNN architectures for acoustic data inversion applications. Therefore, this study drew attention to the CNN model used in acoustic classification applications and tried to build on it.

Following the methodology proposed by Yin et al. [13] for CNN rainfall intensity estimation using images, the authors proposed utilising a pretrained CNN network on a large dataset for an image classification application and adapting the pretrained CNN network for a regression problem through transfer learning. The main goals of this study are to explore CNN for acoustic rainfall intensity estimation and identify the best input format for the CNN model. Therefore, transfer learning with a pretrained CNN was not utilised in the current study, but it will be explored in the future for the final model development as more data are gathered. Also, in the future, with more data, the concept of building a CNN architecture from scratch will be explored. But, at this stage, we rely on a pre-existing audio CNN model in this study. Therefore, a VGGish network architecture [47] was employed. In this study, a regression layer was added to the VGGish network to allow the VGGish network to predict rainfall intensity values from rainfall audio decibel or log-Mel spectrograms without transfer learning to allow comparing the effect of the spectrogram type on predicting rainfall intensities using a CNN.

2.7. Loss Functions

For training the binary CNN denoising model, a binary cross entropy loss function is utilised using Equation (9):

C E_{l o s s} = - \frac{1}{N} \sum_{n = 1}^{N} \sum_{i = 1}^{K} T_{n i} l n Y_{n i}

(9)

where

Y_{n i}

is the model prediction class,

T_{n i}

is the target class, N is the number of observations (mini-batch size), and K is the number of classes.

For training the acoustic rainfall estimator, a half mean squared error loss function is utilised using Equation (10):

M S E_l o s s = \frac{1}{2 N} \sum_{i = 1}^{M} (X_{i} - T_{i})^{2}

(10)

where

X_{i}

is the model prediction,

T_{i}

is the target value,

M

is the total number of responses in

X

(mini-batch), and

N

is the total number of observations in

X

(mini-batch size).

2.8. Performance Criteria

2.8.1. CNN-Based Urban Rainfall Denoising Model

Four performance criteria were utilised to evaluate the denoising binary CNN classifier model: (1) accuracy, (2) recall, (3) specificity, and (4) precision percentage (%). These four criteria were selected based on relevant research studies [24,25,32,47,59,60,61,62] that deal with audio-data-driven classification problems. These criteria are calculated as follows:

Accuracy percentage (%) calculated using Equation (11):

$A c c u r a c y (%) = \frac{T N + T P}{T N + T P + F N + F P} \times 100$

(11)
Recall percentage calculated using Equation (12):

$R e c a l l (%) = \frac{T P}{F N + T P} \times 100$

(12)
Specificity percentage calculated using Equation (13):

$S p e c i f i c i t y (%) = \frac{T N}{T N + F P} \times 100$

(13)
Precision percentage calculated using Equation (14):

$P r e c i s i o n (%) = \frac{T P}{F P + T P} \times 100$

(14)

where TN is the true negative value, TP is the true positive value, FN is the false negative value, and FP is the false positive value.

2.8.2. CNN-Based Acoustic Rainfall Sensing Model

Three performance criteria were utilised to evaluate the data-driven regression models: (1) coefficient of determination, (2) root mean square error, and (3) mean absolute error. These three criteria were selected based on relevant research studies [63,64,65,66,67,68] that deal with data-driven regression problems. These criteria are calculated as follows:

Coefficient of determination (R²) calculated using Equation (15):

$R^{2} = [\frac{\sum_{i = 1}^{n} (R_{i} - \bar{R}) ({\hat{R}}_{i} - {\bar{R}}^{'})}{\sqrt{\sum_{i = 1}^{n} (R_{i} - \bar{R})^{2}} \times \sqrt{\sum_{i = 1}^{n} ({\hat{R}}_{i} - {\bar{R}}^{'})^{2}}}]^{2}$

(15)
Root mean square error (RMSE) calculated using Equation (16):

$R M S E = \sqrt{\frac{\sum_{i = 1}^{n} (R_{i} - {\hat{R}}_{i})^{2}}{n}}$

(16)
Mean absolute error (MAE) calculated using Equation (17):

$M A E = \frac{\sum_{i = 1}^{n} |R_{i} - {\hat{R}}_{i}|}{n}$

(17)

where $R_{i}$ and ${\hat{R}}_{i}$ are the observed and simulated rainfall values, respectively; $\bar{R}$ and ${\bar{R}}^{'}$ are the average observed and simulated rainfall values, respectively; and n is the total number of observations.

3. Results and Discussions

3.1. Rainfall Data Analysis

Figure 7 shows the distribution of collected on-site rainfall data. The collected data in this study follow a skewed distribution towards low rainfall intensities. Based on the statistical distribution of the rainfall data, the data were split into four categories: very low, low, moderate, and high intensity.

The zero values in the collected dataset reflect a very low intensity rainfall below 0.25 mm·min⁻¹ that happened during the rainfall events, forming 24.0% of the entire dataset. The low-intensity rainfall data (0.25 to 0.50 mm·min⁻¹) contain 50.5% of the recorded data, while the moderate rainfall intensity (0.75 to 1.70 mm·min⁻¹) includes 23.5% of the dataset. Finally, the high-intensity rainfall data (2.00 to 3.00 mm·min⁻¹) include around 2.0% of the dataset.

Figure 8a–b shows a sample of a 5 s log-Mel spectrogram for the five locations for a visual comparison during a similar rainfall event (at rainfall intensity of 0.25 mm·min⁻¹) and different urban noises during rainfall. Figure 8a shows different rainfall sound patterns for the same event, indicating the collected data’s diversity in this study. The differences in rainfall sound characteristics for the five locations are associated with their different environmental acoustics. Figure 8b shows different urban noises during rainfall. It can be visually demonstrated how drastically those noise patterns differ from rainfall recordings. Additional urban noise samples are provided in the Supplementary Materials.

3.2. CNN-Based Urban Acoustic Denoising Framework

The model performance is shown in Table 3 in terms of accuracy, precision, specificity, and recall percentage on the training, validation, and testing datasets. The model had around 98% accuracy, recall, specificity, and precision in classifying 5 s audio recordings into rainfall and no-rainfall recordings. The developed model was utilised in the proposed denoising framework, and the results are utilised in developing the acoustic rainfall estimator in the following section. This procedure reduces urban noise in a rainfall audio recording before being utilised in developing the acoustic rainfall sensing model in the following stage. Figure 9 shows a sample of audio recordings before and after the denoising process using the developed framework of this study. As can be seen, most of the significant urban noises in the rainfall recordings are attenuated using the developed denoising framework. The results of the denoising framework were rigorously manually checked to confirm the validity of the denoising framework outputs before utilising the outputs for the acoustic rainfall sensing model development.

3.3. CNN-Based Acoustic Rainfall Sensing Model

The denoised rainfall dataset was split into training, validation, and testing datasets, while care was taken for a statistically fair data distribution among them. Table 4 shows the statistical measures of the collected rainfall observations used for the training, validation, and testing datasets.

The performance of the baseline FC model and the CNN models in terms of

R^{2}

, RMSE, and MAE is shown in Table 5 using the training, validation, and testing datasets. The results show that the CNN models outperform the baseline models in the validation and testing dataset by around ~100%, ~40%, and ~40% in terms of

R^{2}

, RMSE, and MAE. Moreover, the CNN models’ performance on the on-site validation and testing dataset (unseen) is consistent with the model calibration performance, which can also be inferred as an indicator of no overfitting problems. The log-Mel spectrogram CNN model performed slightly better than the decibel spectrogram in terms of the on-site validation dataset by around 5%, 8%, and 6% in terms of

R^{2}

, RMSE, and MAE, and with regard to the on-site testing dataset, by around 2%, 4%, and 4% in terms of

R^{2}

, RMSE, and MAE.

In addition, Table 6 shows the CNN models’ performance on the testing dataset for the five environments separately. As can be seen, in terms of RMSE, both CNN models show a close performance for the five environments, especially in that the highest performance is at Environment A (greenhouse with Perspex ceiling) followed by Environment D (food court with glass ceilings), and the lowest performance is at Environment C (main gate). This finding could be due to the potentially louder sounds generated at Environments A (greenhouse with Perspex ceiling) and D (food court with glass ceilings) compared to Environments B and D. In other words, it could be concluded that environments with loud rainfall noise allow the model to distinguish different rainfall intensities better.

Interestingly, the decibel spectrogram CNN model performs significantly less in Environment E (umbrella) than the log-Mel spectrogram CNN model. However, Environment E (umbrella) has the loudest rainfall noises compared to other locations. One reason for such a result could be that the log-Mel spectrogram can better represent the difference in rainfall intensity at Environment E than the decibel spectrogram. Moreover, the results for both decibel and log-Mel spectrogram CNN networks show a strong capability to predict rainfall intensity below 60

m m \cdot h^{- 1},

while the models struggle to predict rainfall intensity higher than 60

m m \cdot h^{- 1}

(see Table 7). The reason behind this limitation is due to the core limitation coming from the relationship between intensity and loudness features (see Figure 10), where the relationship is not linear but rather logarithmic with an up threshold of 60

m m \cdot h^{- 1}

where any rainfall intensity above this threshold becomes less distinctive. Sato et al. [69] highlighted a similar relationship between rainfall and sound intensity. This also indicates that following the methodology of Ma and Nystuen [70] in marine areas and Bedoya et al. [44] in forest areas, where they only relied on the loudness feature as an input to a model, will not work effectively in a tropical urban area and would result in an excessive underestimation in predicting high rainfall intensities.

The CNN models were calibrated in this study to predict 1 min rainfall intensity resolution. However, in real-life applications, the typical time resolutions for rainfall intensity are 5 min, 10 min, 15 min, 30 min, hourly, and daily [22]. Therefore, a 5 min resolution was chosen as an example to showcase model performances for such resolutions. Figure 11a,b shows the parity plots for the developed log-Mel Spectrogram CNN model on the testing dataset between simulated and observed rainfall data for both 1 min and 5 min resolutions. The parity plot shows strong agreements between the simulated and observed rainfall with a cross-correlation (CC) value of 0.874 and 0.927. As expected, the model shows a varying degree of underestimations and overestimations at different ranges of rainfall intensities at a 1 min resolution. The underestimations for high rainfall intensity at 1 min resolution are reduced with a lower resolution of 5 min. Moreover, one potential reason for discrepancy at 1 min resolution at different rainfall intensities could be that the audio recorder registers the rainfall sound continuously, while the tipping rain gauge records rainfall values discretely; therefore, this could cause the underestimation and overestimation of very high and low rainfall intensities, respectively. As a result, if the ground truth data can be captured by an alternative technique, such as an optical or acoustic disdrometer, it might result in better calibration and testing results between the observed and simulated values, especially for low intensities.

3.4. Local CNN Explainability

Although deep learning models are black box models, and it is challenging to understand the influencing factors that lead to specific predictions, several studies in the literature proposed visualisation techniques to be used with the CNN models to allow for some understanding of which parts of the input the CNN model is focusing on for a specific prediction (individual predictions). One of those techniques is gradient-weighted class activation mapping (Grad-CAM) [71]. Grad-CAM examines a target’s gradient information flowing into the CNN model’s final convolutional layer. The output of the Grad-CAM is a coarse heat map highlighting the parts of the input (2D representation) that contribute the most to the final prediction. This method does not provide global explainability (all predictions) but rather a local explainability (single predictions). This technique efficiently helps in explaining the classification decision of a model. For example, classifying cats and dogs, Grad-CAM would show which part (features) of the cats or dogs the CNN model focuses on to make a decision.

Figure 12 shows a sample of the Grad-CAM localised heat maps on the validation dataset for the five locations and at different rainfall intensities. It should be noted that the presented 1 min events are not the same for the five locations at a specific rainfall intensity. Previous analysis showed that rainfall noise is distributed all over the frequency range, not at a specific frequency range. However, interestingly, the Grad-CAM results show that the CNN model focuses on narrow frequency bands for the five locations and all rainfall intensities. Previously, Bedoya et al. [44], Ma and Nystuen [70], and Gaucherel and Grimaldi [72] proposed using a loudness feature at a specific frequency band for rainfall intensity estimation in marine, rural, and forest areas. In the case of the CNN model, those frequency bands are not fixed but rather more adaptive, which might be better tailored to generalise to different locations in an urban environment at different rainfall intensities. Interestingly, the Grad-CAM results for several cases show that the CNN model tends to focus on the upper part of the log-Mel spectrograms (high-frequency audio range), which might indicate the possibility of having critical information at that range, helping the model to predict rainfall intensities (see Figure 12). Therefore, future studies should consider exploring higher audio sampling frequencies, which could contribute to improving the CNN model’s capabilities in predicting higher rainfall intensities.

3.5. Potential Application in Citizen Science—Proof of Concept

In this section, the off-site dataset described in Section 2.2.2 was utilised to further test the developed model’s generalisability to new urban environments. The log-Mel spectrogram CNN model performance on an off-site dataset (1 min resolution) was

R^{2}

= 0.523, RMSE = 0.379

m m \cdot {m i n}^{- 1}

, and MAE = 0.275

m m \cdot {m i n}^{- 1}

. Figure 13 compares observed versus simulated rainfall intensities (with 5 min resolution) for the off-site validation dataset. The model’s off-site performance shows an initial promising performance. However, the results showed that the model overestimated rainfall intensity, especially for low and moderate intensities (see Figure 13). Of course, having more data from new collection points in the future can help to recalibrate the model for better generalisation. Further improvements are still needed to enhance the potential of such a tool before reaching a practical application stage.

An alternative potential approach to improve the model’s outcome on off-site data is to utilise bias correction until more data are gathered to recalibrate the model and improve the performance. Such an approach would require the presence of ground truth observations, which would not be possible in all situations. Song [73] proposed a leeway for bias correction without the availability of ground truth observations where a machine learning model is fitted on a set of model predictions (input) and the residuals (output). The residuals are basically the difference between predicted values and the ground truth observations. If the observed values are unavailable, the fitted model can be used to predict the residuals, and then the bias correction is carried out.

In this paper, the available off-site dataset does not allow us to train a machine learning (ML) model on predicted values and their results, so instead, we demonstrate to what degree a bias correction would improve the results. In this case, we utilise a bias correction using residual rotation. First, the required residual rotation for all 21 events is identified and averaged. This allows for a rotation that might be applied to new events. This demonstration assumes that the ground truth data are available. Second, the residuals are rotated and subtracted from the predicted values. Thus, the performance on the biased corrected off-site dataset (1 min resolution) was

R^{2}

= 0.623, RMSE = 0.310

m m \cdot m i n^{- 1}

, and MAE = 0.222

m m \cdot {m i n}^{- 1}

. This results in approximately 19%, 18%, and 19% improvement in terms of

R^{2}

, RMSE, and MAE compared to the unbiased concreted data. In addition, Figure 14 compares observed versus bias-corrected simulated rainfall intensities (with 5 min resolution) for the off-site validation dataset. The bias correction reduced the overestimations at low and moderate rainfall intensity but increased the number of underestimations at those ranges. Bias correction can somewhat improve the model performance off-site; thus, it is worth further investigating different types of bias correction in the future.

4. Conclusions and Future Works

In the context of smart city innovations, monitoring and forecasting weather parameters, such as rainfall, are essential. This study investigated the potential of developing an urban acoustic rainfall sensing tool to contribute to citizen-science applications in a smart city. First, a CNN model was developed for acoustic rainfall detection and as a backbone for a denoising framework to reduce the presence of unwanted noises in urban rainfall recordings. The developed model trained on log-Mel spectrograms showed a detection accuracy of 98% on the validation and testing datasets. In the future, alternative CNN models could be explored with different audio frame lengths or alternative techniques to achieve a higher performance.

In addition, a CNN model approach for acoustic rainfall intensity estimation was investigated by comparing the performance using a decibel spectrogram and log-Mel spectrogram as inputs to the VGGish model. Both decibel and log-Mel spectrograms showed a close performance, but the log-Mel spectrogram showed a slightly better performance, around 5%, 8%, and 6% in terms of

R^{2}

, RMSE, and MAE on the validation dataset and 2%, 4%, and 4% in terms of

R^{2}

, RMSE, and MAE on the testing datasets. The developed CNN models were compared with a baseline model based on a fully connected (FC) model trained on loudness as input. The performance of the CNN models showed a significant improvement compared to the baseline FC model, around 100%, 40%, and 40% in terms of

R^{2}

, RMSE, and MAE for both validation and testing datasets. Also, the CNN models showed significant improvements in predicting rainfall intensities higher than 60

m m \cdot h^{- 1}

. At this stage, this study showed that CNN can be used for predicting rainfall intensity values from rainfall audio. Thus, future studies should focus on exploring different CNN architectures and existing CNN models through transfer learning to improve the predictability of rainfall intensity from rainfall sound.

Finally, the developed model generalisability was evaluated on an off-site dataset from different urban environments. The model’s performance on the off-site dataset was lower than the on-site dataset, as expected. Therefore, future efforts should aim to increase data diversity by collecting acoustic recordings from different urban environments to upgrade the model training. Furthermore, the model performance when using different recording devices and audio formats should also be studied. This initial study provided the required knowledge and techniques for acoustic rainfall estimation through citizen science; therefore, using smartphones and surveillance audio will be part of the main topic of our next research study.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/smartcities6060139/s1, Figure S1: Sample of 5 s log-Mel spectrogram of (a) rainfall recordings from the five environments and (b) different urban noises during rainfall.

Author Contributions

Conceptualization, M.I.I.A., T.K.C. and A.T.; methodology, M.I.I.A. and A.T.; software, M.I.I.A.; validation, A.T., T.K.C., V.R.N.P. and M.F.C. formal analysis, M.I.I.A.; investigation, M.I.I.A.; resources, A.T. and M.F.C.; data curation, M.I.I.A. and T.K.C.; writing—original draft preparation, M.I.I.A.; writing—review and editing, M.I.I.A., A.T., V.R.N.P., T.K.C. and M.F.C.; visualization, M.I.I.A., A.T. and T.K.C.; supervision, A.T., V.R.N.P. and M.F.C.; project administration, A.T., M.F.C., V.R.N.P. and T.K.C.; funding acquisition, A.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Ministry of Higher Education Malaysia grant number FRGS/1/2021/TK0/MUSM/02/6.

Data Availability Statement

A sample of the audio data with its corresponding rainfall intensity, along with the codes for the models’ development, is provided in the following link: https://github.com/mohammedalkhatib69/Monash-Urban-Acoustic-Rainfall-Sensor.git (accessed on 9 November 2023).

Acknowledgments

The authors acknowledge the financial support of Monash University (Malaysia Campus) through the merit PhD scholarship for Mohammed I.I. Alkhatib.

Conflicts of Interest

The authors have no relevant financial nor non-financial interests to disclose.

References

Gracias, J.S.; Parnell, G.S.; Specking, E.; Pohl, E.A.; Buchanan, R. Smart Cities—A Structured Literature Review. Smart Cities 2023, 6, 1719–1743. [Google Scholar] [CrossRef]
Wu, W.; Lin, Y. The impact of rapid urbanization on residential energy consumption in China. PLoS ONE 2022, 17, e0270226. [Google Scholar] [CrossRef] [PubMed]
Chen, H.; Jia, B.; Lau, S.Y. Sustainable urban form for Chinese compact cities: Challenges of a rapid urbanized economy. Habitat. Int. 2008, 32, 28–40. [Google Scholar] [CrossRef]
Deng, J.S.; Wang, K.; Hong, Y.; Qi, J.G. Spatio-temporal dynamics and evolution of land use change and landscape pattern in response to rapid urbanization. Landsc. Urban Plan. 2009, 92, 187–198. [Google Scholar] [CrossRef]
Damadam, S.; Zourbakhsh, M.; Javidan, R.; Faroughi, A. An Intelligent IoT Based Traffic Light Management System: Deep Reinforcement Learning. Smart Cities 2022, 5, 1293–1311. [Google Scholar] [CrossRef]
Calvillo, C.F.; Sánchez-Miralles, A.; Villar, J. Energy management and planning in smart cities. Renew. Sustain. Energy Rev. 2016, 55, 273–287. [Google Scholar] [CrossRef]
Mingaleva, Z.; Vukovic, N.; Volkova, I.; Salimova, T. Waste Management in Green and Smart Cities: A Case Study of Russia. Sustainability 2019, 12, 94. [Google Scholar] [CrossRef]
Keung, K.L.; Lee, C.K.M.; Ng, K.K.H.; Yeung, C.K. Smart City Application and Analysis: Real-time Urban Drainage Monitoring by IoT Sensors: A Case Study of Hong Kong. In Proceedings of the 2018 IEEE International Conference on Industrial Engineering and Engineering Management (IEEM), Bangkok, Thailand, 16–19 December 2018; pp. 521–525. [Google Scholar] [CrossRef]
Ghazal, T.M.; Hasan, M.K.; Alshurideh, M.T.; Alzoubi, H.M.; Ahmad, M.; Akbar, S.S.; Al Kurdi, B.; Akour, I.A. IoT for Smart Cities: Machine Learning Approaches in Smart Healthcare—A Review. Future Internet 2021, 13, 218. [Google Scholar] [CrossRef]
Gomede, E.; Gaffo, F.; Briganó, G.; de Barros, R.; Mendes, L. Application of Computational Intelligence to Improve Education in Smart Cities. Sensors 2018, 18, 267. [Google Scholar] [CrossRef]
Kidd, C.; Becker, A.; Huffman, G.J.; Muller, C.L.; Joe, P.; Skofronick-Jackson, G.; Kirschbaum, D.B. So, how much of the Earth’s surface is covered by rain gauges? Bull. Am. Meteorol. Soc. 2017, 98, 69–78. [Google Scholar] [CrossRef]
Overeem, A.; Leijnse, H.; Uijlenhoet, R. Country-wide rainfall maps from cellular communication networks. Proc. Natl. Acad. Sci. USA 2013, 110, 2741. [Google Scholar] [CrossRef]
Yin, H.; Zheng, F.; Duan, H.F.; Savic, D.; Kapelan, Z. Estimating rainfall intensity using an image-based deep learning model. Engineering 2022, 21, 162–174. [Google Scholar] [CrossRef]
Muller, C.L.; Chapman, L.; Johnston, S.; Kidd, C.; Illingworth, S.; Foody, G.; Overeem, A.; Leigh, R.R. Crowdsourcing for climate and atmospheric sciences: Current status and future potential. Int. J. Climatol. 2015, 35, 3185–3203. [Google Scholar] [CrossRef]
Plunket, W.W. A Case Study of Travis County’s Precipitation Events Inspired by a ‘Hyperlocal’ Approach from NWS and CoCoRaHS Data. Master’s Thesis, Texas State University, San Marcos, TX, USA, December 2020. [Google Scholar]
Mapiam, P.P.; Methaprayun, M.; Bogaard, T.; Schoups, G.; Veldhuis, M.-C.T. Citizen rain gauges improve hourly radar rainfall bias correction using a two-step Kalman filter. Hydrol. Earth Syst. Sci. 2022, 26, 775–794. [Google Scholar] [CrossRef]
Davids, J.C.; Devkota, N.; Pandey, A.; Prajapati, R.; Ertis, B.A.; Rutten, M.M.; Lyon, S.W.; Bogaard, T.A.; Van de Giesen, N. Soda bottle science-citizen science monsoon precipitation monitoring in Nepal. Front. Earth Sci. 2019, 7, 46. [Google Scholar] [CrossRef]
Tipaldo, G.; Allamano, P. Citizen science and community-based rain monitoring initiatives: An interdisciplinary approach across sociology and water science. Wiley Interdiscip. Rev. Water 2017, 4, e1200. [Google Scholar] [CrossRef]
Shinbrot, X.A.; Muñoz-Villers, L.; Mayer, A.; López-Portillo, M.; Jones, K.; López-Ramírez, S.; Alcocer-Lezama, C.; Ramos-Escobedo, M.; Manson, R. Quiahua, the first citizen science rainfall monitoring network in Mexico: Filling critical gaps in rainfall data for evaluating a payment for hydrologic services program. Citiz. Sci. 2020, 5, 19. [Google Scholar] [CrossRef]
COCORAHS. Community Collaborative Rain, Hail and Snow Network. Available online: http://cocorahs.org (accessed on 31 May 2020).
Anagnostou, M.N.; Nystuen, J.A.; Anagnostou, E.N.; Papadopoulos, A.; Lykousis, V. Passive aquatic listener (PAL): An adoptive underwater acoustic recording system for the marine environment. Nucl. Instrum. Methods Phys. Res. A 2011, 626–627, S94–S98. [Google Scholar] [CrossRef]
MASMA. Urban Stormwater Management Mannual for Malaysia; Urban Stormwater Management Mannual for Malaysia: Kuala Lumpur, Malaysia, 2012. [Google Scholar]
Wang, X.; Wang, M.; Liu, X.; Glade, T.; Chen, M.; Xie, Y.; Yuan, H.; Chen, Y. Rainfall observation using surveillance audio. Appl. Acoust. 2021, 186, 108478. [Google Scholar] [CrossRef]
Wang, X.; Glade, T.; Schmaltz, E.; Liu, X. Surveillance audio-based rainfall observation: An enhanced strategy for extreme rainfall observation. Appl. Acoust. 2023, 211, 109581. [Google Scholar] [CrossRef]
Chen, M.; Wang, X.; Wang, M.; Liu, X.; Wu, Y.; Wang, X. Estimating rainfall from surveillance audio based on parallel network with multi-scale fusion and attention mechanism. Remote Sens. 2022, 14, 5750. [Google Scholar] [CrossRef]
Avanzato, R.; Beritelli, F. An innovative acoustic rain gauge based on convolutional neural networks. Information 2020, 11, 183. [Google Scholar] [CrossRef]
Avanzato, R.; Beritelli, F.; Di Franco, F.; Puglisi, V.F. A convolutional neural networks approach to audio classification for rainfall estimation. In Proceedings of the 2019 10th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS), Metz, France, 5 December 2019; pp. 285–289. [Google Scholar] [CrossRef]
Brown, A.; Garg, S.; Montgomery, J. Automatic rain and cicada chorus filtering of bird acoustic data. Appl. Soft Comput. 2019, 81, 105501. [Google Scholar] [CrossRef]
Ferroudj, M.M. Detection of Rain in Acoustic Recordings of the Environment Using Machine Learning Techniques. Master’s Thesis, Queensland University of Technology, Brisbane, Australia, 2015. Available online: https://eprints.qut.edu.au/82848/ (accessed on 31 October 2021).
Gan, H.; Zhang, J.; Towsey, M.; Truskinger, A.; Stark, D.; van Rensburg, B.J.; Li, Y.; Roe, P. A novel frog chorusing recognition method with acoustic indices and machine learning. Future Gener. Comput. Syst. 2021, 125, 485–495. [Google Scholar] [CrossRef]
Xie, J.; Zhu, M. Investigation of acoustic and visual features for acoustic scene classification. Expert Syst. Appl. 2019, 126, 20–29. [Google Scholar] [CrossRef]
Himawan, I.; Towsey, M.; Roe, P. Detection and Classification of Acoustic Scenes and Events. 2018. Available online: https://github.com/himaivan/BAD2 (accessed on 1 February 2021).
Valada, A.; Burgard, W. Deep spatiotemporal models for robust proprioceptive terrain classification. Int. J. Rob. Res. 2017, 36, 1521–1539. [Google Scholar] [CrossRef]
Valada, A.; Spinello, L.; Burgard, W. Deep feature learning for acoustics-based terrain classification. In Robotics Research; Bicchi, A., Burgard, W., Eds.; Springer International Publishing: Cham, Switzerland, 2017; Volume 2, pp. 21–37. [Google Scholar] [CrossRef]
Alías, F.; Socoró, J.; Sevillano, X. A Review of Physical and Perceptual Feature Extraction Techniques for Speech, Music and Environmental Sounds. Appl. Sci. 2016, 6, 143. [Google Scholar] [CrossRef]
Sarker, C.; Mejias, L.; Maire, F.; Woodley, A. Flood Mapping with Convolutional Neural Networks Using Spatio-Contextual Pixel Information. Remote Sens. 2019, 11, 2331. [Google Scholar] [CrossRef]
Lee, B.-J.; Lee, M.-S.; Jung, W.-S. Acoustic Based Fire Event Detection System in Underground Utility Tunnels. Fire 2023, 6, 211. [Google Scholar] [CrossRef]
Tamagusko, T.; Correia, M.G.; Rita, L.; Bostan, T.-C.; Peliteiro, M.; Martins, R.; Santos, L.; Ferreira, A. Data-Driven Approach for Urban Micromobility Enhancement through Safety Mapping and Intelligent Route Planning. Smart Cities 2023, 6, 2035–2056. [Google Scholar] [CrossRef]
Polap, D.; Wlodarczyk-Sielicka, M. Classification of Non-Conventional Ships Using a Neural Bag-of-Words Mechanism. Sensors 2020, 20, 1608. [Google Scholar] [CrossRef]
Polap, D.; Włodarczyk-Sielicka, M. Interpolation merge as augmentation technique in the problem of ship classification. In Proceedings of the 2020 15th Conference on Computer Science and Information Systems (FedCSIS), Sofia, Bulgaria, 6–9 September 2020; pp. 443–446. [Google Scholar] [CrossRef]
Yamashita, R.; Nishio, M.; Do, R.K.G.; Togashi, K. Convolutional neural networks: An overview and application in radiology. In Insights into Imaging; Springer Verlag: Berlin/Heidelberg, Germany, 2018; Volume 9, pp. 611–629. [Google Scholar] [CrossRef]
Miller, S. Handbook for Agrohydrology; Natural Resources Institute: Chatham, UK, 1994. [Google Scholar]
Raghunath, H.M. Hydrology Principles Analysis Design, 2nd ed.; New Age International (P) Limited: New Delhi, India, 2006. [Google Scholar]
Bedoya, C.; Isaza, C.; Daza, J.M.; López, J.D. Automatic identification of rainfall in acoustic recordings. Ecol. Indic. 2017, 75, 95–100. [Google Scholar] [CrossRef]
Suhaila, J.; Jemain, A.A. Fitting daily rainfall amount in Malaysia using the normal transform distribution. J. Appl. Sci. 2007, 7, 1880–1886. [Google Scholar] [CrossRef]
Google Earth v9.151.0.1, Monash University Malaysia 3°03′50″ N, 101°35′59″ E, Elevation 18M. 2D Building Data Layer. Google. Available online: http://www.google.com/earth/index.html (accessed on 5 December 2021).
Hershey, S.; Chaudhuri, S.; Ellis, D.P.W.; Gemmeke, J.F.; Jansen, A.; Moore, R.C.; Plakal, M.; Platt, D.; Saurous, R.A.; Seybold, B.; et al. CNN architectures for large-scale audio classification. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 131–135. [Google Scholar] [CrossRef]
Chu, S.; Narayanan, S.; Kuo, C.-J. Environmental sound recognition with time–frequency audio features. IEEE Trans. Audio Speech Lang. Process. 2009, 17, 1142–1158. [Google Scholar] [CrossRef]
Kapil, P.; Ekbal, A. A deep neural network based multi-task learning approach to hate speech detection. Knowl. Based Syst. 2020, 210, 106458. [Google Scholar] [CrossRef]
Proakis, J.G.; Manolakis, D.G. Digital Signal Processing Principles, Algorithms, and Applications; Pearson Education: Karnataka, India, 1996. [Google Scholar]
Umesh, S.; Cohen, L.; Nelson, D. Fitting the Mel scale. In Proceedings of the 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258), Phoenix, AZ, USA, 15–19 March 1999; Volume 1, pp. 217–220. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv 2015, arXiv:1502.03167. [Google Scholar]
Hinton, G.E.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R.R. Improving neural networks by preventing co-adaptation of feature detectors. arXiv 2012, arXiv:1207.0580. [Google Scholar]
Halfwerk, W.; Holleman, L.J.M.; Lessells, C.M.; Slabbekoorn, H. Negative impact of traffic noise on avian reproductive success. J. Appl. Ecol. 2011, 48, 210–219. [Google Scholar] [CrossRef]
Pijanowski, B.C.; Villanueva-Rivera, L.J.; Dumyahn, S.L.; Farina, A.; Krause, B.L.; Napoletano, B.M.; Gage, S.H.; Pieretti, N. Soundscape ecology: The science of sound in the landscape. Bioscience 2011, 61, 203–216. [Google Scholar] [CrossRef]
Pratt, L.Y. Discriminability-based transfer between neural networks. In Advances in Neural Information Processing Systems; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1993; pp. 204–2011. [Google Scholar]
Gemmeke, J.F.; Ellis, D.P.W.; Freedman, D.; Jansen, A.; Lawrence, W.; Moore, R.C.; Plakal, M.; Ritter, M. Audio Set: An ontology and human-labeled dataset for audio events. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 776–780. [Google Scholar] [CrossRef]
Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J. Big Data 2021, 8, 53. [Google Scholar] [CrossRef] [PubMed]
Xie, J.; Hu, K.; Zhu, M.; Yu, J.; Zhu, Q. Investigation of different CNN-based models for improved bird sound classification. IEEE Access 2019, 7, 175353–175361. [Google Scholar] [CrossRef]
Huang, C.J.; Yang, Y.J.; Yang, D.X.; Chen, Y.J. Frog classification using machine learning techniques. Expert Syst. Appl. 2009, 36, 3737–3743. [Google Scholar] [CrossRef]
Xie, J.; Towsey, M.; Zhang, J.; Roe, P. Investigation of acoustic and visual features for frog call classification. J. Signal. Process. Syst. 2020, 92, 23–36. [Google Scholar] [CrossRef]
Chang, F.-J.; Chen, Y.-C. A counterpropagation fuzzy-neural network modeling approach to real time streamflow prediction. J. Hydrol. 2001, 245, 153–164. [Google Scholar] [CrossRef]
Xu, L.; Zhao, J.; Li, C.; Li, C.; Wang, X.; Xie, Z. Simulation and prediction of hydrological processes based on firefly algorithm with deep learning and support vector for regression. Int. J. Parallel Emergent Distrib. Syst. 2020, 35, 288–296. [Google Scholar] [CrossRef]
Chen, S.-H.; Lin, Y.-H.; Chang, L.-C.; Chang, F.-J. The strategy of building a flood forecast model by neuro-fuzzy network. Hydrol. Process. 2006, 20, 1525–1540. [Google Scholar] [CrossRef]
Chang, T.; Talei, A.; Chua, L.; Alaghmand, S. The impact of training data sequence on the performance of neuro-fuzzy rainfall-runoff models with online learning. Water 2018, 11, 52. [Google Scholar] [CrossRef]
Chang, F.-J.; Chiang, Y.-M.; Tsai, M.-J.; Shieh, M.-C.; Hsu, K.-L.; Sorooshian, S. Watershed rainfall forecasting using neuro-fuzzy networks with the assimilation of multi-sensor information. J. Hydrol. 2014, 508, 374–384. [Google Scholar] [CrossRef]
Chang, T.K.; Talei, A.; Alaghmand, S.; Ooi, M.P.-L. Choice of rainfall inputs for event-based rainfall-runoff modeling in a catchment with multiple rainfall stations using data-driven techniques. J. Hydrol. 2017, 545, 100–108. [Google Scholar] [CrossRef]
Sato, H.; Kurisu, K.; Morimoto, M.; Maeda, M. Effects of rainfall rate on physical characteristics of outdoor noise from the viewpoint of outdoor acoustic mass notification system. Appl. Acoust. 2021, 172, 107616. [Google Scholar] [CrossRef]
Ma, B.B.; Nystuen, J.A. Passive acoustic detection and measurement of rainfall at sea. J. Atmos. Ocean Technol. 2005, 22, 1225–1248. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar] [CrossRef]
Gaucherel, C.; Grimaldi, V. The Pluviophone: Measuring rainfall by its sound. J. Vib. Acoust. Trans. ASME 2015, 137, 034504. [Google Scholar] [CrossRef]
Song, J. Bias corrections for random forest in regression using residual rotation. J. Korean Stat. Soc. 2015, 44, 321–326. [Google Scholar] [CrossRef]

Figure 1. Schematic of the research methodology of the present study.

Figure 3. Off-site data collection area with rain gauge locations.

Figure 4. Sample of audio spectrograms (a) decibel spectrogram, and (b) log-Mel spectrogram.

Figure 5. Sample spectrograms of different noises during a rainfall event, including (a) thunder noise, (b) impact noise and car passing noise, (c) motorbike passing and car horn noises, and (d) singing birds.

Figure 6. Urban rainfall recording denoising framework.

Figure 7. Rainfall data distribution based on the 1 min resolution measured data point.

Figure 8. Sample of 5 s log-Mel spectrogram of (a) rainfall recordings from the five environments and (b) different urban noises during rainfall.

Figure 9. Sample of 1 min log-Mel spectrograms containing rainfall and urban noises (a) before applying the denoising framework and (b) after applying the denoising framework.

Figure 10. Relationship plot between average signal amplitude (ASA) and rainfall intensity (

m m \cdot m i n^{- 1}

).

Figure 10. Relationship plot between average signal amplitude (ASA) and rainfall intensity (

m m \cdot m i n^{- 1}

).

Figure 11. Parity plots of observed vs. simulated rainfall intensities by the log-Mel Spectrogram CNN model on the testing dataset (on site) using (a) 1 min data resolution and (b) 5-min data resolution.

Figure 12. Grad-CAM on a sample of the validation dataset, which highlights the contributing areas to the prediction decision of the log-Mel spectrogram CNN model.

Figure 13. Parity plots of observed vs. simulated rainfall intensities by the log-Mel spectrogram CNN model on the off-site dataset using 5-min data resolution.

Figure 14. Parity plots of observed vs. simulated rainfall intensities by the bias corrected log-Mel spectrogram CNN model on the off-site dataset using 5-min data resolution.

Table 1. Details of the data collection points in the study site.

Point	Location	Remarks	Distance from the Rain Gauge (m)
Sound Recording Stations
A	Greenhouse	The most dominant surface is a perspex transparent flexible rooftop surface, while the recording area is surrounded by metallic surfaces	172
B	Storage room on the rooftop of building 5A	The most dominant surface is a hard concrete surface	63
C	University main gate	The most dominant surfaces are covered by interlock pavement and flexible canopy	110
D	Food court	A glass canopy with a mix of vegetation cover	37
E	Umbrella setup	A standard flexible and waterproofing umbrella fabric	105
Rainfall Gauge Station
F	The rooftop of building 5 (rain gauge location)	Located in the centre of all other recording points. Moreover, it is the nearest recorder to one of the rain gauges	0

Table 2. VGGish model architecture.

Layer No.	Components
Layer 1	Input layer with 96 × 64 × 1
Layer 2	$Convolutional layer with 64 kernels, 3 \times 3 kernel size, 2 \times 2 stride step, zero padding, and r e l u$ activation function.
Layer 3	Max pooling layer with 2 × 2 kernel size, 2 × 2 stride step, and zero padding
Layer 4	$Convolutional layer with 128 kernels, 3 \times 3 kernel size, 1 \times 1 stride step, zero padding, and r e l u$ activation function.
Layer 5	Max pooling layer with 2 × 2 kernel size, 2 × 2 stride step, and zero padding.
Layer 6	$Convolutional layer with 256 kernels, 3 \times 3 kernel size, 1 \times 1 stride step, zero padding, and r e l u$ activation function.
Layer 7	$Convolutional layer with 256 kernels, 3 \times 3 kernel size, 1 \times 1 stride step, zero padding, and r e l u$ activation function.
Layer 8	Max pooling layer with 2 × 2 kernel size, 2 × 2 stride step, and zero padding.
Layer 9	$Convolutional layer with 512 kernels, 3 \times 3 kernel size, 1 \times 1 stride step, zero padding, and r e l u$ activation function.
Layer 10	$Convolutional layer with 512 kernels, 3 \times 3 kernel size, 1 \times 1 stride step, zero padding, and r e l u$ activation function.
Layer 11	Max pooling layer with 2 × 2 kernel size, 2 × 2 stride step, and zero padding.
Layer 12	$Fully connected layer with 4096 neurons and a r e l u$ activation function
Layer 13	$Fully connected layer with 4096 neurons and a r e l u$ activation function
Layer 14	$Fully connected layer with 2 neurons and a s i g m o i d$ activation function
Layer 15	A modified classification layer with a binary cross-entropy loss function

Table 3. CNN model performance in terms of accuracy, recall, specificity, and precision percentage on the training, validation, and testing datasets.

Dataset	Accuracy (%)	Recall (%)	Specificity (%)	Precision (%)
Training	98.5	98.8	98.1	98.1
Validation	98.7	98.9	98.4	98.4
Testing	98.6	98.8	98.4	98.4

Table 4. The statistical measures of rainfall data in (

m m \cdot {m i n}^{- 1}

) for this study’s training, validation, and testing datasets.

Table 4. The statistical measures of rainfall data in (

m m \cdot {m i n}^{- 1}

) for this study’s training, validation, and testing datasets.

Data Split	Max	Median	Mean	SD	Skewness
Training (80%)	3.000	0.250	0.468	0.503	1.603
Validation (10%)	3.000	0.250	0.470	0.506	1.619
Testing (10%)	3.000	0.250	0.465	0.498	1.581

Table 5. Models’ performances in terms of R², RMSE, and MAE on the training, validation, and testing datasets.

Model	Dataset	R²	RMSE (mm·min⁻¹)	MAE (mm·min⁻¹)
Baseline FC model	Training	0.380	0.384	0.275
	Validation	0.374	0.398	0.285
	Testing	0.350	0.413	0.293
Decibel-Spectrogram-CNN	Training	0.785	0.233	0.170
	Validation	0.753	0.252	0.177
	Testing	0.747	0.251	0.183
Log-Mel-Spectrogram-CNN	Training	0.819	0.214	0.159
	Validation	0.789	0.233	0.166
	Testing	0.764	0.242	0.175

Table 6. CNN models’ performances in terms of R², RMSE, and MAE on the testing dataset for each of the five data collection points.

Model	Environment	R²	RMSE	MAE
Model	Environment	R²	(mm·min⁻¹)	(mm·min⁻¹)
Decibel-Spectrogram-CNN	A	0.800	0.239	0.173
	B	0.747	0.246	0.176
	C	0.749	0.276	0.202
	D	0.756	0.246	0.182
	E	0.594	0.262	0.193
Log-Mel-Spectrogram-CNN	A	0.810	0.233	0.170
	B	0.735	0.252	0.179
	C	0.753	0.271	0.196
	D	0.791	0.230	0.167
	E	0.676	0.228	0.167

Table 7. CNN models’ performance in terms of RMSE, and MAE on the testing dataset for rainfall intensities below 60

m m \cdot h^{- 1}

and above 60

m m \cdot h^{- 1}

.

Table 7. CNN models’ performance in terms of RMSE, and MAE on the testing dataset for rainfall intensities below 60

m m \cdot h^{- 1}

and above 60

m m \cdot h^{- 1}

.

Model	Intensity Class	RMSE	MAE
Model	Intensity Class	(mm·min⁻¹)	(mm·min⁻¹)
Decibel-Spectrogram-CNN	<60 mm·h⁻¹	0.208	0.158
Decibel-Spectrogram-CNN	≥60 mm·h⁻¹	0.550	0.439
Log-Mel-Spectrogram-CNN	<60 mm·h⁻¹	0.195	0.149
Log-Mel-Spectrogram-CNN	≥60 mm·h⁻¹	0.393	0.295

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alkhatib, M.I.I.; Talei, A.; Chang, T.K.; Pauwels, V.R.N.; Chow, M.F. An Urban Acoustic Rainfall Estimation Technique Using a CNN Inversion Approach for Potential Smart City Applications. Smart Cities 2023, 6, 3112-3137. https://doi.org/10.3390/smartcities6060139

AMA Style

Alkhatib MII, Talei A, Chang TK, Pauwels VRN, Chow MF. An Urban Acoustic Rainfall Estimation Technique Using a CNN Inversion Approach for Potential Smart City Applications. Smart Cities. 2023; 6(6):3112-3137. https://doi.org/10.3390/smartcities6060139

Chicago/Turabian Style

Alkhatib, Mohammed I. I., Amin Talei, Tak Kwin Chang, Valentijn R. N. Pauwels, and Ming Fai Chow. 2023. "An Urban Acoustic Rainfall Estimation Technique Using a CNN Inversion Approach for Potential Smart City Applications" Smart Cities 6, no. 6: 3112-3137. https://doi.org/10.3390/smartcities6060139

APA Style

Alkhatib, M. I. I., Talei, A., Chang, T. K., Pauwels, V. R. N., & Chow, M. F. (2023). An Urban Acoustic Rainfall Estimation Technique Using a CNN Inversion Approach for Potential Smart City Applications. Smart Cities, 6(6), 3112-3137. https://doi.org/10.3390/smartcities6060139

Article Menu

An Urban Acoustic Rainfall Estimation Technique Using a CNN Inversion Approach for Potential Smart City Applications

Abstract

1. Introduction

2. Methodology

2.1. Overview

2.2. Study Site and Data Collection

2.2.1. On-Site Calibration and Validation Dataset

2.2.2. Off-Site Validation Dataset

2.3. Acoustic Feature Representation

2.4. Convolutional Neural Networks

2.5. CNN-Based Urban Rainfall Denoising Framework Development

2.6. Acoustic Rainfall Sensing Model Development

2.6.1. Baseline Model

2.6.2. CNN-Based Acoustic Rainfall Sensing Model

2.7. Loss Functions

2.8. Performance Criteria

2.8.1. CNN-Based Urban Rainfall Denoising Model

2.8.2. CNN-Based Acoustic Rainfall Sensing Model

3. Results and Discussions

3.1. Rainfall Data Analysis

3.2. CNN-Based Urban Acoustic Denoising Framework

3.3. CNN-Based Acoustic Rainfall Sensing Model

3.4. Local CNN Explainability

3.5. Potential Application in Citizen Science—Proof of Concept

4. Conclusions and Future Works

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI