Automatic Scene Recognition through Acoustic Classiﬁcation for Behavioral Robotics

: Classiﬁcation of complex acoustic scenes under real time scenarios is an active domain which has engaged several researchers lately form the machine learning community. A variety of techniques have been proposed for acoustic patterns or scene classiﬁcation including natural soundscapes such as rain/thunder, and urban soundscapes such as restaurants/streets, etc. In this work, we present a framework for automatic acoustic classiﬁcation for behavioral robotics. Motivated by several texture classiﬁcation algorithms used in computer vision, a modiﬁed feature descriptor for sound is proposed which incorporates a combination of 1-D local ternary patterns (1D-LTP) and baseline method Mel-frequency cepstral coefﬁcients (MFCC). The extracted feature vector is later classiﬁed using a multi-class support vector machine (SVM), which is selected as a base classiﬁer. The proposed method is validated on two standard benchmark datasets i.e., DCASE and RWCP and achieves accuracies of 97.38% and 94.10%, respectively. A comparative analysis demonstrates that the proposed scheme performs exceptionally well compared to other feature descriptors.


Introduction
Robotics is the branch of artificial intelligence which is concerned with designing robots that can perform tasks and interact with the environment, without the aid of human intervention.Although the mechanical control technology of robots has been remarkably well developed in recent years.The ability of robots to perceive and analyse their surrounding environment, especially the auditory scenes still requires a significant research effort.Acoustic-based classification complements the vision based classification in a number of ways.First, considering the field of view, microphones are more nearly omni-directional than even wide-angle camera lenses.Second, audio signals require a significantly smaller bandwidth and low processing power.Third, acoustic classification is more reliable as the parameters of image/video processing algorithms are affected by variations in light intensity, thus, increasing the probability of false alarms.Detection and classification of acoustic scenes can help to facilitate the human-robot interaction and increase the application domain of behavioral and assistive robotics.One of the key aspects of designing an acoustic classification system is the selection of proper signal features that could achieve an effective discrimination between different sound signals.Sounds coming from a general environment are considered neither music nor speech, but a collection of some audio signals that resemble noise signals.While sufficient research has focused on music and speech analysis, very little work has been done on concrete analysis of feature selection for classification of environmental sounds.One of the main objectives of this research is to investigate the effect of multiple features on the efficiency of an environmental scene classification system.
The state-of-the-art for acoustic scene classification features a number of approaches.Table 1 presents a summary of some considerable works in this domain which are discussed as follows.In [1], an approach based on local binary patterns (LBP) is adopted to construct the spectrogram image of environmental sounds.The LBP features are enhanced by incorporating local statistics, normalized and finally classified by a linear SVM.The accuracy is validated against RWCP dataset.In [2], the authors studied sound classification in a non-stationary noise environment.At first, probabilistic latent component analysis (PLCA) is performed for noise separation.Further, regularized kernel fisher discriminant analysis (KFDA) is adopted for multi-class sound classification.The method is validated on RWCP dataset.In [3], acoustic classification is performed using large-scale audio feature extraction.First, a large number of spectral, cepstral, energy and voice related features are extracted from highly variable recordings.Then, a sliding window approach is adopted with SVM to classify short recordings.Finally, a majority voting is employed to classify large recordings.The work further proposes Mel spectra as the most relevant features.In [4], features based on LBP from the logarithm of the Gammatone-like spectrogram are proposed.However, LBP is sensitive to noise and discards important information.Therefore, a two-projection-based LBP feature descriptor is also proposed that captures the texture information of the spectrogram of sound events.In [5], a matching pursuit (MP) algorithm is used to extract effective time-frequency features from sounds.The MP technique uses a dictionary of atoms for feature selection, resulting in a set of features that are flexible and physically interpretable.In [6], Fast Fourier Transform (FFT) is used to extract spectral power and duration of event based sounds.A number of features are extracted which include time-domain zero crossings, spectral centroid, roll off, flux and MFCC.Further, sound classification is done using SVM and multi-layer perceptron (MLP).In [7], a combination of log frequency cepstral coefficient (LFCC), Gaussian mixture models (GMMs) and a maximum likelihood criterion is employed to recognize various sound events for a cleaning robot.Experimental results demonstrate that LFCC based approach performs better than MFCC under low signal to noise ratio (SNR) environment.Human classification accuracy in performing similar classification tasks is also evaluated by experiments.
In [8], a feature extraction pipeline is proposed for analyzing audio scene signals.Features are computed from a histogram of gradients (HOG) of constant Q-transform followed by an appropriate pooling scheme.The performance of the proposed scheme is tested on several datasets including Toy, East Anglia (EA) and another dataset named Litis Rouen collected by the authors.In [9], MP algorithm is used to extract useful Gabor atoms from input audio stream.MP is applied over the whole duration of acoustic event.The time-frequency features are constructed from atoms in order to capture temporal and spectral information of a sound event.Further, the classification is done using a random forest classifier.Deep neural network (DNN) based transfer learning is proposed in [12] for acoustic classification.First, the DNN is trained on source domain task that performs mid-level feature extraction.Then, the pre-trained model is re-used on the DCASE target task.In [13], the authors proposed that dilated CNN architecture performs better environmental sound classification as compared to CNN with max pooling.The effect of dilation rate and number of layers on performance is also investigated.The work in [14] proposes a hierarchical approach to classify different sound events such as silence, non-silence, speech, non-speech, music and noise.In contrast to a classical one-step classification scheme, a different set of effective features is selected at each level.In [15], a hearing aid system is proposed for real time recognition of various sounds.The system is based on generating audio finger print i.e., a brief summary of audio file which collects a number of features including spectrogram zero crossings (ZC), MFCCs, linear prediction coefficients (LPCs) and log area ratio (LAR).The recognition is done on self collected sound samples using a K nearest neighbors (KNN) classifier.The system achieves a maximum accuracy of 99%.In [16], the authors propose automatic emotion classification system for music sounds.The work utilizes several features of sound wave, i.e., peak value, average height, the number of half wavelengths, average width and beats per minutes.Finally, regression analysis is perform to recognize various emotions from the sound.The system achieves an average accuracy of 77%.In [17], sound identification method for a mobile robot in home and office environment is proposed.A simple sound database called Pitch-Cluster-Maps (PCMs) based on vector quantization technique is constructed and its codebook is generated using binarized frequency spectrum.The works in [18,19] demonstrate that acoustic local ternary patterns (LTPs) show better performance as compared to MFCCs for fall detection problem.In the literature, various convolutional neural network (CNN) architectures are used to classify soundtracks from a dataset of 70 million training videos (5.24 million hours) with 30, 871 video-level labels [20].Experiments are performed using fully connected DNNs, VGG [21], AlexNet [22], Inception [23] and ResNet [24] etc.
The acoustic scene classification approach proposed in this work has the following contributions.
• An extended feature descriptor is proposed which takes advantage of modified 1-D LTP in combination with MFCC.

•
A feature fusion methodology is opted, which exploits the complementary strengths of both MFCC and modified 1-D features to generate a serial vector.

•
To provide a better insight, a set of classifiers are tested on two standard benchmark datasets.This action supports researchers in selecting the best classifiers for this application.
The rest of the paper is organized as follows.In Section 2, the proposed method of acoustic scene classification is discussed.Section 3 discusses the experimental setup and datasets.The performance results and discussions are presented and discussed in Section 4 and finally, Section 5 concludes the paper.

System Overview
Figure 1 shows the overall architecture of the proposed acoustic scene classification system.The sound signal is captured from environment through a microphone.It is digitized using an ADC in the preprocessing step and fed into the feature extraction stage.The MFCC and 1D-LTP features are extracted from the digital sound signal, they are fused together in a joint feature vector and finally classified using an SVM classifier.The main processing steps of the proposed system are discussed as follows.

1-D Local Ternary Patterns
The local binary patterns (LBPs) have been investigated as among the most prominent feature descriptors in the field of computer vision and image analysis [25].The basic idea behind LBP is to compare each pixel of an image with its neighborhood.Each compar ison of an image pixel with its neighbors results in binary values '0' or '1'.This helps to summarize a local structure in an image and obtains powerful feature descriptors for a number of promising applications such as face recognition [26] and texture analysis [27].LBPs are invariant to monotonic grey scale changes and have low computational cost [28].Applying the LBP method for 1-D signals such as sound, helps to obtain useful information about local temporal dynamics of sound.The LBPs achieve discriminative features of several sounds, as exhibited by the works on music genre recognition [29] as well as environmental sound classification [1].However, LBPs are highly affected by noise and fluctuations in acoustic samples [1].In order to further improve the discriminative power of LBP, LTPs were proposed for face recognition in 2010 [30], and later on applied in a number of works [31][32][33].In contrast to the LBPs which encode the relationships of 'greater than' or 'less than' between the pixel and its neighbor, the LTPs reflect the 'greater than', 'equal to' or 'less than' relationships.Under the same sampling conditions, LTPs help to achieve more discriminative and sophisticated sound features as compared to 1D-LBPs.
Analog audio signal is first digitized with sampling frequency F s to form a discrete signal X[i] having N number of samples.The 1D-LTPs of sampled signal X[i] are computed using a sliding window approach.Consider a signal sample x[i] with amplitude α is placed at the center of window with size P + 1. Defining the upper and lower values of amplitude threshold as (α + t) and (α − t) respectively, where t is arbitrary constant.From the amplitudes of signal samples that lie in the window, a ternary code vector F of size P is obtained whose individual values are computed as; where Q(x[i]) is defined as; From the ternary code vector, the upper and lower local ternary patterns are computed as; LTP lower [i] = where, Figure 2 illustrates the extraction of 1D-LTP features for one sample of a discrete audio signal.

X[i]
i

Mel-Frequency Cepstral Coefficients (MFCC)
MFCCs are a baseline method that has been widely used in the analysis of audio signals.Although primarily designed for speech recognition [34,35], they have been a popular feature of choice in the automatic scene classification [36,37].The MFCCs are the coefficients that collectively make up the Mel Frequency Cepstrum (MFC), a representation of short term power spectrum of sound based on linear cosine transform of a log power spectrum on a non linear Mel scale of frequency.The MFCCs are linearly spaced on the Mel frequency scale which closely approximates the human auditory system's response.Such a representation of sound signal extracts discriminant features which help to achieve environmental sound classification with good accuracy.
Figure 3 shows a standard pipeline for the extraction of MFCC features.In the first step, the digitized sound signal is segmented in to short frames each having N samples.Next, the periodogram-based power spectrum is estimated for each frame.Let s i (n) denote the time domain signal (of N samples) that belongs to frame i, its Discrete Fourier Transform (DFT) is calculated as; where K denotes the length of DFT and h(n) denotes the N sample long analysis window.In this work, Hamming window is used to realize a high-pass FIR filter to emphasize the high frequency part of the signal and remove DC content.In the next step, the output of complex Fourier transform is magnitude squared and power spectral estimate of frame i is computed as;

Discrete cosine transform
In the final step, MFCC coefficients are obtained by applying discrete cosine transform (DCT).
() = ∑ (, )cos (( − 0.5)/ /)  =1 (6) Where, the index of cepstral coefficient is represented by (n=0, …, P-1).In our proposed syste we have used 13 MFCC coefficients.MFCC feature vector is constructed using eq.6.Then, a set of Mel-scaled filter banks is computed and applied to power spectrum of each frame.The Mel-scale is linear for frequencies lower than 1000 Hz and a logarithm above it.To compute the filter bank energy spectrum, each filter is multiplied by the power spectrum computed above and coefficients are added up.The Mel-filtered spectrum of frame i is computed as; where L denotes the total number of filters and H l denotes the transfer function of lth filter.Next, the logarithm of Mel-filtered energy spectrum is computed and Discrete Cosine Transform (DCT) is applied to it.Mathematically, where n = 1, • • • , L is the cepstral coefficient number.In the proposed frame work, initial 13 MFCCs are used for scene classification.

Feature Fusion
The 1D-LTP and MFCC features extracted above are fused together to form a joint feature vector for classification.The fusion of 1D-LTP and MFCC features helps to obtain a more sophisticated feature representation which has better discriminative properties as well as an accurate representation in frequency domain.The fusion process is a simple serial concatenation of 1D-LTP and MFCC feature vectors.

Classification
The classification stage employs a multiclass SVM.The basic idea of SVM is to find a hyperplane that separates D-dimensional data into its two classes [38].SVM is a discriminative model for classification that principally depends on two basic assumptions.First, complex classification problems can be classified through simple linear discriminative functions by transforming data into a high-dimension space.Second, the training samples for SVMs consist only of those data points that lie close to the decision surface, with the supposition that they provide the most relevant information for classification [39].SVMs were originally proposed as binary classifiers.However, in real scenarios, data is to be classified into multiple classes.This is done by using multiclass SVM.Either a one-against-one (OAO) or one-against-all (OAA) approach can be used [40].For acoustic scene classification setup proposed in this work, the joint feature vector extracted from previous stage is used to train the multiclass SVM OAO classifier.

Setup
Experiments were performed using MATLAB 2016a software on 2.2 GHz Intel i7 processor with 8 GB RAM.The extracted features are MFCC (13 coefficients) and 1D-LTPs (13 bins) with threshold t = 0.0002.The classification is being done by applying various SVM kernels, and by finalizing quadratic and cubic kernels because of their best performance [41].Training/testing percentage is fixed to be 80/20 (80% for training, and 20% for testing) for both datasets.The performance of classifier is measured through classification accuracy averaged over k-fold cross validation.The value of k = 10 has been selected based on experimentation to generally result in best accuracy with low bias, modest variance and low correlation.The classifier accuracy is measured as, where TP stands for true positive, TN for true negative, FP for false positive and FN for false negative.
The performance of the proposed approach is also compared with several state-of-the-art audio feature representation techniques i.e., MFCC, ID-LBP and linear prediction cepstral coefficients (LPCC).

Datasets
An important challenge in acoustic scene classification for robotics is the collection of proper environmental sound database.Since there is an infinite number of sounds, no single database can cover all of them.Therefore, no robotic system is capable of recognizing all the sounds.Instead, the scene recognition capability is limited by the application domain and set of tasks performed by the particular robot.In order to have an initial reference for comparison, two standard benchmark datasets are selected, i.e., (a) real world computing partnership (RWCP) sound scene dataset [42] and (b) DCASE challenge dataset [43].
RWCP is one of the first datasets which are collected for scene understanding.It contains sounds of various audio sources which were moved using a mechanical device.Recordings were done using a linear array of 14 microphones and a semi-spherical array of 54 microphones with a DAT recorder at 48 KHz frequency and 16 bit resolution.The average length of sound sample is about 1 s.A proposed feature descriptor was tested on experimental dataset consisting of 17 different environmental sounds shown in Table 2 (a) along with the number of samples for each class.
The DCASE challenge dataset consists of a set of recorded sounds in fifteen different urban environments.The duration of each sound clip is 30 s and recording is performed in London.The DCASE dataset consists of 15 different classes of urban sounds; each class contains 78 sound samples as given in Table 2 (b).The RWCP and DCASE databases contain a variety of sound classes that accurately model the general indoor or outdoor environment.We believe that verifying the performance of our proposed solution on these databases can help to realize intelligent systems for advanced applications such as sound localization [44] and human-robot interaction [45,46].
As discussed earlier, 1D-LTP features are discriminative.The scatter plots of Figures 4 and 5 show the distribution of 1D-LTPs for several classes of RWCP and DCASE datasets.These plots demonstrate that the 1D-LTP feature values that belong to the same class are spaced close to each other, whereas the features belonging to different classes are spaced relatively far on the scatter plot.Features having these strong discriminative properties result in a good classification accuracy.

Results and Discussion
The accuracy trend for both datasets is demonstrated in Figure 6.Table 3 presents the overall classification accuracy of the proposed and existing methods along with their computational time in seconds.It can be comfortably observed from the stats that the proposed method (i.e., ID-LTP + MFCC) outperforms shows a better accuracy with computational time smaller or comparable to other approaches.
combination of features has advantages of capturing textural features and complex acoustic structures.Our proposed framework is more effective in classification of complex environmental scenes.
Envi 1845 [8] I   To get a better insight, few other performance metrics are also investigated including sensitivity, specificity, and error rate.Moreover, for a fair comparison, two classifier families, i.e., SVM and KNN are contemplated due to their greater number of variants.Table 4 provides a comparison of seven classifiers on the DCASE dataset.The SVM with quadratic kernel (SVM-Q) shows better results in terms of accuracy, specificity and error rate while SVM with cubic kernel (SVM-C) and KNN weighted (KNN-W) show better sensitivity.In Table 5, the performance results are demonstrated for RWCP dataset.The SVM-Q classifier achieves a high accuracy and error rate while better sensitivity and specificity values are achieved by the KNN medium (KNN-M) and SVM-C, respectively.7. The figure shows that all classes except the city center class have an accuracy of more than 90%.The confusion matrix of the proposed approach for RWCP dataset is shown in Figure 8.Here, the phone class has an accuracy of 89% whereas, all the remaining classes have accuracy above 90%.The classification results of Figure 7 and 8 confirm the accuracy and validity of the proposed feature classification technique.To reveal the authenticity and robustness of our proposed method, confidence intervals against both datasets are also provided for two state-of-the-art classifiers.Figure 9 demonstrates the confidence interval showing min, max and average classification values of both classifiers.From the stats, its quite obvious that SVM-Q can be formally selected as a standard classifier for this application.
where TP stands for true positive, TN for true negative, FP for false positive and FN for false negative.

Conclusions
Scene classification is an important task in behavioral robotics.Using acoustic signals for environmental scene classification complements the visual-based classification in many ways.This study aimed to select the image texture classification features and investigate their effect on the classification of sound signals.In particular, the work proposes a modified feature descriptor as a combination of 1D-LTPs and MFCCs.Our analysis and simulation results for the two reference datasets i.e., DCASE and RWCP show that 1D-LTPs exhibit good discriminative properties for sound signals.On the other hand, the MFCCs as the baseline method, approximates the behavior of the human auditory system.Fusing 1D-LTPs with MFCCs achieves a more sophisticated and discriminative feature representation of environmental sounds.The proposed fused feature vector is classified with various kernels of multi-class SVM.Results demonstrate that SVM with quadratic kernel achieves high accuracy as compared to other feature representations.The proposed system can be applied to a number of practical indoor and outdoor robotic scenarios.

Materials
Two publicly available datasets are utilized in this research are RWCP and DCASE.The RWCP dataset is available at [42] and DCASE is available at: http://dcase.community/challenge2018/index.

Figure 1 .
Figure 1.System Architecture for Acoustic Scene Classification.

Figure 4 :
Figure 4: Acoustic scene classification evaluation over DCASE and RWCP dataset

Figure 6 .
Figure 6.Classification performance of the proposed ID-LTP and several other features over DCASE and RWCP dataset.
.The performance of classifier is measured through classification accuracy averaged over 5-fold cross validations.

Figure 7 .
Figure 7. Confusion matrix of the proposed approach for DCASE dataset.

Figure 8 .
Figure 8. Confusion matrix of the proposed approach for RWCP dataset.

Figure 9 .
Figure 9. Confidence interval against two selected classifiers on benchmark datasets.

Funding:
The authors extend their appreciation to the Deanship of Scientific Research at King Saud University for funding this work through research group NO (RG-1438-034).

Table 1 .
Summary of published works on acoustic scene classification.

Table 2 .
Details of Individual Classes of RWCP and DCASE Datasets.

Table 3 .
Performance results for DCASE and RWCP datasets.

Table 4 .
Performance of various classifiers for proposed feature extraction approach for DCASE dataset.

Table 5 .
Performance of various classifiers for proposed feature extraction approach for RWCP dataset.