Ambient Sound Recognition of Daily Events by Means of Convolutional Neural Networks and Fuzzy Temporal Restrictions

: The use of multimodal sensors to describe activities of daily living in a noninvasive way is a promising research ﬁeld in continuous development. In this work, we propose the use of ambient audio sensors to recognise events which are generated from the activities of daily living carried out by the inhabitants of a home. An edge–fog computing approach is proposed to integrate the recognition of audio events with smart boards where the data are collected. To this end, we compiled a balanced dataset which was collected and labelled in controlled conditions. A spectral representation of sounds was computed using convolutional network inputs to recognise ambient sounds with encouraging results. Next, fuzzy processing of audio event streams was included in the IoT boards by means of temporal restrictions deﬁned by protoforms to ﬁlter the raw audio event recognition, which are key in removing false positives in real-time event recognition.


Introduction
Activity recognition (AR) has become an active research topic [1] focused on detecting human behaviours in smart environments [2]. Sensing human activity has been adopted in smart homes [3] with the aim of improving quality of life, allowing people to stay independent in their own homes for as long as possible [4].
In initial approaches, there was a predominance of binary sensors used to describe daily human activities within smart environments in a noninvasive manner. Next, a new generation of devices emerged to integrate a richer perspective in sensing smart objects and people's activities. Among them, the following types of sensors stand out: (i) wearable devices, which have been used to analyse activities and gestures [5]; (ii) location devices, which at present reach extremely high accuracy in indoor contexts [6]; (iii) vision sensors (visible-light or thermal-infrared sensors) in video and image sequences [7]; (iv) audio sensors [8] that recognise events based on audio information. This has been followed by a new trend of multimodal sensors that has enabled the use of general-purpose sensing technologies to monitor activities.
AR approaches are mainly grouped into two categories: knowledge-driven approaches [9] and data-driven approaches [10]. A number of previous AR studies have focused on classifying activities where the beginning and end of the activities, and therefore, the key features are known beforehand, which is referred to as explicit segmentation [11] or offline evaluation, as they do not provide real-time capabilities in AR. However, including real-time capabilities is a key requirement in AR in order to provide responses to real-world conditions [12], enabling adequate assistance services. In real-time AR, where the beginning and end of the events are unknown, approaches based on sliding windows to segment the data stream are required [11]. In addition, in the context of multimodal sensors, the use of deep learning models has shown promising performance in processing multimedia data [12].
In this work, we focus on the recognition of daily events by means of ambient sound devices and using deep learning models integrated with smart boards. The contribution of this work can be summarised as follows: • Collecting a dataset of audio samples of events related to activities of daily living which are generated within indoor spaces; • Integrating a fog-edge architecture with the IoT boards where the audio samples are collected to provide real-time recognition of audio events; • Evaluating the performance of deep learning models for offline and real-time recognition of ambient daily living events in naturalistic conditions; • A straightforward fuzzy processing of audio event streams is described by means of temporal restrictions which are modeled on linguistic protoforms to improve the audio recognition.
The remainder of the paper is organised as follows: In Section 1, we review related works and methodologies; in Section 2, we describe the devices, architecture, and methodology of the approach; in Section 3, we present the results of a case study of event recognition; in Section 4, we detail our conclusions and ongoing work.

Related Works
The integration of technology into smart environments to support our daily lives in an immersive and ubiquitous way was introduced by ubiquitous computing as the age of calm technology, when technology recedes into the background of our lives [13]. From this visionary perspective at the beginning of the 1990s to our present Internet of Things, two key characteristics have been exploited over the last 30 years: (i) immersiveness or low invasiveness of integrated devices (both on our bodies and in our environment) and (ii) smart connected devices which provide interpretable outcomes from the information collected by sensors.
As described above, ambient binary sensors have been proposed to describe daily activities in indoor spaces [14] with the goal of deploying immersive sensors, providing encouraging results with accurately labelled datasets [15] under data-driven approaches [10]. Nowadays, the burgeoning growth of devices is promoting multimodal sensors which typically integrate video, audio, and wearable sensors [16], and other IoT devices with increasing high-capacity computing. The new trends are converging toward synthetic sensors [17], which are deployed to sense everything in a given room, enabling the use of general-purpose sensing technologies in order to monitor activities by means of sensor fusion. In this context, audio processing by smart microphones for the labelling of audible events is opening up a promising research field within AR [8].
On the architecture of components for learning and communication of devices, the paradigms of edge computing [18] or fog computing [19] have located the data and services within the devices where sensors are integrated, providing virtualised resources and engaged location-based services at the edge of the mobile networks under a new perspective of the Internet of Things (IoT) to develop collaborative smart objects which interact with each other and cooperate with their neighbours to reach common goals [20].
In the machine learning models for AR, describing sensor information under datadriven approaches has depended on the type of sensors, whether inertial [5] or binary sensors [15], where integrated methodologies to exploit spatial-temporal features have been proposed [21]. Additionally, deep learning (DL) has also been shown as a suitable approach in AR to discover and extract features from sensors [22]. DL is related to multimodal sensor recognition, such as vision and audio, where obtaining hierarchical features to reduce complexity is key. Regarding vision sensors, the use of thermal vision is proposed to guarantee privacy while preventing dangers such as falling by means of convolutional neural networks (CNNs) [23].
In the field of audio recognition, the combination of CNNs [24] with the use of spectrogram for sound representation [25] has been proven to generate encouraging results in sound recognition, which can be used for environmental sound classification [26][27][28] and music signal analysis [29,30]. Specifically, both the use of log-Mel spectrogram (LM) and Mel-frequency cepstral coefficient (MFCC) has been proposed for robust representation in sound classification [31].
In the given field of environmental sound recognition in indoor spaces, we highlight several approaches. In [32], the recognition of events, such as a bouncing ball or cricket, was carried out by means of spectral representation of sound with frame-level features, which was learned using Markov models. In [33], two classes of sounds (i.e., tapping and washing hands) were recognised using spectral and histogram of sounds by SVM in naturalistic conditions within a geriatric residence. In part of the study by [8], 3D spatial directional microphones allowed high-quality multidirectional audio to be captured to detect events and the location of sounds in an environment. For this purpose, Mel-frequency cepstral coefficients are computed as spatial features which are related to events using Gaussian and hidden Markov models. In [34], 30 events were collected to recognise the 7 rooms or spaces where the inhabitant carried out activities (bathroom, bedroom, house entrance, kitchen, office, outdoor, and workshop). In this work, log-Mel spectrograms were also evaluated for sound event classification, together with a DL model (VGG-16) pretrained with YouTube audios, with encouraging results but where accuracy was demonstrated to differ notably between controlled conditions and real-life contexts.
Moreover, fuzzy logic has been demonstrated to provide suitable sensor representation from the first AR methods [35] to recent works [21]. In addition, fusing and aggregating heterogeneous data from sensors have become key in edge-fog distributed architectures [36]. In concrete terms, the representation of temporal features by means of fuzzy logic has increased performance in several contexts of AR [10,37]. In addition, fuzzy logic has provided an interpretable representation of outcomes for low-level processing of sensor data [38] and has improved accuracy in uncertain and inaccurate sensor data [39]. Protoforms and fuzzy logic were proposed by Zadeh [40] as a useful knowledge model for reasoning [41] and summarisation [42] of data under uncertainty. The use of protoforms [43] and fuzzy rules to infer knowledge has provided suitable representations [44].
Based on the works and the approaches reviewed in this section, in this work, we present a dataset focused on daily living events in indoor environments to enhance AR using smart IoT boards. The proposed audio recognition model was based on spectral information of audio samples, together with learning from CNNs, which provides highperformance recognition with automatic spatial feature extraction. The audio predictions from DL models were filtered using fuzzy protoforms to provide a coherent recognition of daily audio events which define temporal restrictions. In addition, a case scenario in naturalistic conditions was evaluated to analyse the impact of the recognition of daily events in real time.

Materials and Methods
In this section, we describe the proposal of devices, architecture, and methods for ambient sound recognition of daily events by means of smart boards and CNNs. First, in Section 2.1, we present the IoT board and audio sensors in an edge-fog architecture for collecting and labelling environmental sounds. Second, in Section 2.2, a DL model for ambient sound recognition of daily events is presented using a Mel-frequency spectrogram and CNNs. Third, in order to filter the raw audio event recognition, fuzzy processing of audio event streams is included in the IoT boards by means of temporal restrictions defined by protoforms, which is detailed in Section 2.2.

Materials: Devices and Architecture
In this section, we describe the materials and devices proposed for sound recognition of daily living events in smart environments. In the context of the Internet of Things and ubiquitous computing, the integration of devices into the spaces where the data are collected is characterised by immersiveness and low invasiveness. Here, an edge-fog computing approach was implemented.
First, we proposed the use of audio sensors connected to smart boards to collect and recognise sound events. The selected smart board was Raspberry Pi [45], which enables computing capabilities for machine learning, including deep learning models [46]. The audio sensors integrated were low-cost microphones with a USB connector, providing plugand-play connectivity with Raspberry Pi under Raspbian Operating System. In Figure 1, we show both connected devices deployed in a bathroom. The aim of integrating audio sensors into smart boards for the recognition of daily events was to (i) collect sound samples for training purposes, (ii) train deep learning models from labelled sound samples, and (iii) recognise audio events to evaluate the trained models in a real-time context. The programming language used to code the application embedded into the Raspberry Pi was Python [47], and the deep learning models were implemented on Python with Keras, an open source library for neural networks [48]. The remote services for labelling of data and spreading the recognised output of audio events in real time were developed under MQTT, which provided a publish/subscribe protocol for wireless sensor networks [49]. This approach was inspired by the paradigms of fog and edge computing [50].
Second, for the purpose of collecting and labelling sound samples from smart environments, the Raspberry Pi collected sound samples of a given duration in the smart board in real time. In addition, the Raspberry Pi board was subscribed to an MQTT topic, where the start and end of each event were published to label a given sound event from a mobile application. Between the start and end of the time interval, the board stored the sound samples, associating each instance with a label. The mobile application for labelling sound samples was developed in Android [51], providing a mobile tool to label the events in a handheld device. In order to facilitate the task of labelling while the daily tasks are performed, NFC tags were placed on the objects and furniture involved in the events, such as doors or taps. The NFC tags automatically activated labelling in the mobile application when touched by the user, sending the start and end of a sound label under MQTT. In Figure 1, we show the NFC tags and the mobile application for labelling sound events.
Third, the recognition model of sound events was trained with the labelled data, computing real-time recognition of ambient sounds. For this purpose, the deep learning model for sound recognition, which is described in Section 2.2, had been previously trained and stored in the Raspberry Pi. The model received the segments of audio samples from the audio sensor as input and classified them according to the target labels. The prediction for each target was published by MQTT in real time to be reachable by other smart devices or AR models.
Fourth, fuzzy filtering of raw audio event prediction was carried out by means of temporal restrictions using linguistic protoforms in an interpretable way. This enabled us to filter predictions which did not match with protoforms defined by fuzzy temporal windows and fuzzy quantifiers. The architecture of components of the proposed approach is described in Figure 2.

Deep Learning Model for Ambient Sound Recognition of Daily Events
In this section, we describe a classifier model for ambient sound recognition of daily events based on spectral representation and DL models. First, as detailed previously, the translation from unidimensional digital audio samples to bidimensional spatial representation based on spectrogram features (a picture of sound) provides encouraging results in ambient audio classification [25].
In this work, a window size of 3 s was defined to segment and collect the ambient audio samples, as it provides a suitable time interval for audio recognition [26]. The collection frequency of the of the ambient audio sensor was set to 44.1 kHz.
Next, we extracted two representations of the spectrum of each sound, which were evaluated as input by different CNNs: • Log-mel spectrogram (LM) was calculated for time-frequency representation of audio signals using a log power spectrum on a nonlinear Mel scale of frequency. When defining the length of the fast Fourier transform window to 2048, it produces images sized 128 × 130. • Log-scaled Mel-Frequency cepstral coefficients (MFCCs) with 13 components from the raw audio signals, which computes the spectrum of sound using a linear cosine transform of a log power spectrum on a nonlinear Mel scale of frequency [52]. As traditional MFCCs use between 8 and 13 cepstral coefficients [53], we proposed 13 features to provide the most representative information of audio samples. Based on this configuration, the resulting MFCC spectrogram of positive frequencies developed images sized 13 × 130. In Figure 3, we provide MFCCs of the audio samples collected from daily living events, which were used as inputs subsequently to be classified with the corresponding sound labels using a CNN.
CNNs are described as feature extractors and classifiers with encouraging results in image recognition [54]. The use of different CNN models with several layers of feature extraction [26,31] has been proposed for ambient audio recognition purposes according to the representation of the spectrum of the sounds. Therefore, in this work, two CNN models were evaluated: (i) a CNN model with five convolutional layers for MFCC processing, where a unique average pooling is included after convolutions due to reduced input space 13 × 130 × 1 and (ii) a CNN model with 5 convolutional layers and a max pooling reduction whose configurations are shown in Table 1.   The models were implemented with Keras under Python to enable integration with Raspberry Pi in real time, using an edge-computing approach which publishes the events detected without exposing sensitive audio sensor data from homes, guaranteeing the privacy of the inhabitant.

Fuzzy Protoforms to Describe Daily Events from Audio Recognition Streams
In this section, we describe the formal representation of audio streams computed under a linguistic representation [36]. The aim of fuzzy processing is to include a filtering process of audio classification in real-time conditions in order to provide temporal restrictions and criteria to identify a given event.
The stream of audio recognition from a smart audio sensor s j is composed of a set of predictions s j = {m From the sensor streams, we defined protoforms which integrate an interpretable, rich, and expressive approach that models the expert knowledge in the stream linguistically. The protoform is in the shape of the following: where are Q k V r T j are identifiers of the following linguistic terms: • V r defines a crisp term, whose value is directly related to a recognised event r. • T j defines a fuzzy temporal window (FTW) j where the audio event V r is aggregated. The FTWs are described according to the distance from the current time t * to a given timestamp t i as ∆t i = t * − t i using the membership function µ T j (∆t i ) , which defines a degree of relevance between [0, 1] for the time elapsed ∆t i between the point of time t i current time t * . • We defined an aggregation function of V r over T j which computes a unique aggregation degree of the occurrence of the event V r within a temporal window T j . Therefore, the following t-norm and t-conorm are defined to aggregate a linguistic term and temporal window: where we use Fuzzy weighted average (FWA) [55] to compute the degree of the linguistic term in the temporal window. In this way, the t-norm computes the temporal degree for each point of time of the temporal window, and the co-norm aggregates these computed degrees in the whole temporal window in a unique representative degree. • Q k is a fuzzy quantifier k that filters and transforms the aggregation degree of the audio event V r within the FTW T j . The set of quantifiers defined in this domain are represented by the fuzzy sets shown [56]. The quantifier applies a transformation µ Q K : [0, 1] → [0, 1] to the aggregated degree of µ Q K (A k ∪ T j (S r )) [57]. In this work, a given protoform was defined for each event or audio class to be recognised. The protoform defines temporal restrictions using the relevance of the term (quantifier) in the temporal window (FTW) under conditions of relative normality. For example, the phrase many vacuum cleaner sounds for half a minute determines a protoform in which the term many defines the quantifier, and the term half a minute defines the temporal window. The degree of the protoforms, which is computed between 0 and 1, determines the degree of truth of the recognition of the audio event. Applying these temporal restrictions enabled the removal of false positives in AR, which is key in analysing the normality of behaviours in daily life.

Results
In this section, we present the results of the approach. First, a collection of ambient sounds from daily living events in the home is presented, together with the evaluation of the proposed methodology in offline and real-time conditions in different case studies. The data were collected in a home with four rooms (living room, bedroom, kitchen, and bathroom) for an inhabitant who lives there as their usual residence.
First, we created a dataset of ambient sounds from daily living events in the home. The selected activities/events to be recognised in the case study are detailed in Table 2. For each label, a balanced dataset of 100 sound samples with a duration of 3 s was collected in naturalistic conditions. For the labelling of events, the mobile application described in Section 2.1 was integrated to determine the start and end of each event. In the first evaluation, a cross-validation method was carried out to analyse the capabilities of the audio recognition model in offline conditions over the collected and balanced dataset with an explicit segmentation of the audio samples with a window size of 3 s. Next, the approach was evaluated in real time over four scenarios in which audio samples were collected from ambient microphones while the inhabitant carried out activities of daily living in naturalistic conditions. The case studies have a duration of 2220 s, with a total of 760 samples analysed.
The dataset of audio samples collected in this work and the labels of the scenes are available in the following repository (Repository: https://github.com/AuroraPR/ Ambiental-Sound-Recognition (last access 15 July 2021), which includes the implementation of the proposed methods with Python and Keras.

Offline Case Study Evaluation
In this section, we describe the results provided by the deep learning models based on CNN and LM and MFCC representation for ambient sound recognition with the data collected and a public dataset in an offline context using 10-fold cross-validation.

•
Ad hoc ambient audio dataset. In this case, the dataset includes audio samples which have been collected in a single home and were labelled with an explicit segmentation of 3 s for events occurred in controlled conditions using the approach described in Section 2.1. All classes described in Table 2 are included in the dataset. • Audioset dataset (Repository: https://research.google.com/audioset/ (last access 15 July 2021)). This public dataset provides videos from YouTube and labelling in the segment where a given sound occurs. From the categories of the dataset, we selected 12 events related to our classes which are included in the dataset: "Toilet flush", "Conversation", "Dishes, pots, and pans", "Alarm clock", "Water", "Water tap", "Printer", "Microwave oven", "Doorbell", "Door", "Telephone ringing" and "Silence". The sounds collected from Audioset correspond to a balanced dataset with 60 files for each class which includes an explicit segmentation of the sound events.
For each dataset, we present a comparison of the confusion matrices for each fold in the cross-validation that was computed. First, in Figure 4 we present the performance of the DL models in ambient sound recognition of daily events for the ad hoc dataset. Second, in Figure 5, we present the performance of the DL models in the Audioset dataset. In Table 3, we describe the metrics of f1 score, precision, and recall for both DL models, and the evaluated datasets.
As can be observed, the performance of the ad hoc ambient audio dataset has excellent results for both CNN models in controlled conditions, CNN + MFCC showing the best results. However, the performance in sound recognition of daily events with the Audioset dataset is highly deficient. This is due to the fact that the audio samples from YouTube videos include noise overlapping with other sounds and audio generation from heterogeneous sources. For interested readers, the Audioset samples are available in the repository of this work.   As we can observe, the collection of an ad hoc ambient audio dataset is strongly recommended given the weak sampling from heterogeneous sources. From the ad hoc ambient audio dataset, we have collected the number of trainable parameters, learning time, millions of instructions (up to 40 epochs) and evaluation time in a Raspberry Pi 3B whose core frequency is 400 MHz, presented in Table 4.
Based on these results, in the next section, we describe the evaluation in real-time conditions using the best configuration with the ad hoc ambient audio dataset and the model based on CNN + MFCC which also requires fewer computational resources for audio recognition learning and evaluation.

Real-Time Case Study Evaluation
Next, we present the results for the evaluation of four scenes at a home in naturalistic conditions using the CNN + MFCC model which performed learning under the ad hoc ambient audio dataset. The six scenes comprised the following sequences of activities: • (Scene 1) The inhabitant arrived home, went to the kitchen and started talking, then started using cutlery, then turned on the extractor fan for a long while, then turned on the tap, turned on the microwave, and was called on the phone. In this context, a new label is necessary to recognise idle as an event class, which corresponds to the absence of target events, including silence and other ambient sounds produced by the inhabitant. The addition of the idle label is key for AR learning in real-time conditions [11,15]. For evaluation purposes, idle activity has been included using a scene cross-validation, where each scene is learned with idle audio samples from other scenes, together with the offline dataset of target events.
In Table 5, we detail the performance of the CNN + MFCC ambient sound recognition model, comparing the ground truth against the inferred classification by means of F1-score, accuracy, precision, and recall for each scene.

Fuzzy Protoforms and Fuzzy Rules
In this section, we describe the linguistic protoforms which define temporal restrictions from the raw audio prediction in order to provide a coherent recognition of daily audio events. The FTWs and fuzzy quantifiers were defined with membership functions defined by TS and TL functions (listed in Abbreviations). In Tables 6 and 7, we describe the membership functions for quantifiers and FTWs, together with the protoforms for each audio event, which define the temporal restrictions for normality.  The impact of filtering the raw audio events from the recognition model was evaluated for the real-time scenarios (offline evaluation was not possible due to not providing a stream of daily events). Beyond the encouraging results described in the previous section, in these scenes, we identified the recognition of scarce audio events which are not related to the correct occurrence of events. In Figure 6, we demonstrate the ground truth and raw audio events predicted in a timeline for the four scenes, including the detection of false-positive events. In Table 8, we describe the false positives and negatives computed from the time interval detection of home events using (i) raw processing and (ii) fuzzy temporal restrictions. Computing the false positives and negatives of time intervals has been described as a relevant metric for detecting events in activity recognition regardless of their duration [10]. The evaluation of these audio events in temporal windows using protoforms, which determine a minimal restriction for recognition, has enabled filtering the most spurious occurrences, as well as defining a degree of adherence between 0 and 1 to the protoform. The use of fuzzy temporal restrictions provides an encouraging method, reducing false positives from the raw audio recognition from 24 occurrences to 2 occurrences while including only 2 false negatives.

Limitations of the Work
The activity recognition methods and devices proposed in this work present encouraging performance in offline and real-time recognition of ambient audio events. A balanced dataset with 100 samples per label is sufficient to work in controlled and naturalistic conditions; however, translating the results to deployments "in the wild" [34] would require a larger dataset and additional data preprocessing methods, such as clustering and augmentation. Evaluation with Audioset provided highly deficient results due to noise, overlapping with other sounds, and audio generation from heterogeneous sources. Evaluating audio events in different domains will require extensive datasets and complex processing for domain adaptation methods [58].

Conclusions and Ongoing Work
In this work, we evaluated the capabilities of audio recognition models based on spectral information and deep learning to identify ambient events related to the daily activities of inhabitants in a home. To this end, an edge-fog computing approach with smart boards was presented, which enabled the evaluation and recognition of audio samples within the devices while preserving the privacy of the users. Fuzzy processing of audio event streams was included in the IoT boards to filter the raw prediction of audio events by means of temporal restrictions defined by protoforms. The fuzzy processing of audio recognition proved crucial in real-time scenarios to avoid false positives and provide a coherent recognition of daily events detected from protoforms which are directly defined in linguistic terms.
In ongoing research, we aim to integrate a fusion of heterogeneous sensors, such as wearable and binary sensors, to increase the sensing capabilities of audio recognition with other daily activity events. In addition, fuzzy rules could enhance the knowledge-based definition of activities with steady processing from raw data, integrating the data collected from different sensors.  Data Availability Statement: The dataset of audio samples collected in this work and the labels of the scenes are available in the following repository https://github.com/AuroraPR/Ambiental-Sound-Recognition (last access 15 July 2021) , which includes the implementation of the proposed methods with Python and Keras.

Conflicts of Interest:
The authors declare no conflict of interest.
Sample Availability: The dataset of audio samples collected in this work and the labels of the scenes are available in the next repository https://research.google.com/audioset/ (last access 15 July 2021) which include the implementation with Python and Keras of the proposed methods.

Abbreviations
The following abbreviations are used in this manuscript:

CCN convolutional neural network IoT
Internet of Things MFCC Mel-Frequency cepstral coefficient AR activity recognition