1. Introduction
Robotics is the branch of artificial intelligence which is concerned with designing robots that can perform tasks and interact with the environment, without the aid of human intervention. Although the mechanical control technology of robots has been remarkably well developed in recent years. The ability of robots to perceive and analyse their surrounding environment, especially the auditory scenes still requires a significant research effort. Acoustic-based classification complements the vision based classification in a number of ways. First, considering the field of view, microphones are more nearly omni-directional than even wide-angle camera lenses. Second, audio signals require a significantly smaller bandwidth and low processing power. Third, acoustic classification is more reliable as the parameters of image/video processing algorithms are affected by variations in light intensity, thus, increasing the probability of false alarms. Detection and classification of acoustic scenes can help to facilitate the human-robot interaction and increase the application domain of behavioral and assistive robotics.
One of the key aspects of designing an acoustic classification system is the selection of proper signal features that could achieve an effective discrimination between different sound signals. Sounds coming from a general environment are considered neither music nor speech, but a collection of some audio signals that resemble noise signals. While sufficient research has focused on music and speech analysis, very little work has been done on concrete analysis of feature selection for classification of environmental sounds. One of the main objectives of this research is to investigate the effect of multiple features on the efficiency of an environmental scene classification system.
The state-of-the-art for acoustic scene classification features a number of approaches.
Table 1 presents a summary of some considerable works in this domain which are discussed as follows. In [
1], an approach based on local binary patterns (LBP) is adopted to construct the spectrogram image of environmental sounds. The LBP features are enhanced by incorporating local statistics, normalized and finally classified by a linear SVM. The accuracy is validated against RWCP dataset. In [
2], the authors studied sound classification in a non-stationary noise environment. At first, probabilistic latent component analysis (PLCA) is performed for noise separation. Further, regularized kernel fisher discriminant analysis (KFDA) is adopted for multi-class sound classification. The method is validated on RWCP dataset. In [
3], acoustic classification is performed using large-scale audio feature extraction. First, a large number of spectral, cepstral, energy and voice related features are extracted from highly variable recordings. Then, a sliding window approach is adopted with SVM to classify short recordings. Finally, a majority voting is employed to classify large recordings. The work further proposes Mel spectra as the most relevant features.
In [
4], features based on LBP from the logarithm of the Gammatone-like spectrogram are proposed. However, LBP is sensitive to noise and discards important information. Therefore, a two-projection-based LBP feature descriptor is also proposed that captures the texture information of the spectrogram of sound events. In [
5], a matching pursuit (MP) algorithm is used to extract effective time-frequency features from sounds. The MP technique uses a dictionary of atoms for feature selection, resulting in a set of features that are flexible and physically interpretable. In [
6], Fast Fourier Transform (FFT) is used to extract spectral power and duration of event based sounds. A number of features are extracted which include time-domain zero crossings, spectral centroid, roll off, flux and MFCC. Further, sound classification is done using SVM and multi-layer perceptron (MLP). In [
7], a combination of log frequency cepstral coefficient (LFCC), Gaussian mixture models (GMMs) and a maximum likelihood criterion is employed to recognize various sound events for a cleaning robot. Experimental results demonstrate that LFCC based approach performs better than MFCC under low signal to noise ratio (SNR) environment. Human classification accuracy in performing similar classification tasks is also evaluated by experiments.
In [
8], a feature extraction pipeline is proposed for analyzing audio scene signals. Features are computed from a histogram of gradients (HOG) of constant Q-transform followed by an appropriate pooling scheme. The performance of the proposed scheme is tested on several datasets including Toy, East Anglia (EA) and another dataset named Litis Rouen collected by the authors. In [
9], MP algorithm is used to extract useful Gabor atoms from input audio stream. MP is applied over the whole duration of acoustic event. The time-frequency features are constructed from atoms in order to capture temporal and spectral information of a sound event. Further, the classification is done using a random forest classifier. Deep neural network (DNN) based transfer learning is proposed in [
12] for acoustic classification. First, the DNN is trained on source domain task that performs mid-level feature extraction. Then, the pre-trained model is re-used on the DCASE target task. In [
13], the authors proposed that dilated CNN architecture performs better environmental sound classification as compared to CNN with max pooling. The effect of dilation rate and number of layers on performance is also investigated. The work in [
14] proposes a hierarchical approach to classify different sound events such as silence, non-silence, speech, non-speech, music and noise. In contrast to a classical one-step classification scheme, a different set of effective features is selected at each level. In [
15], a hearing aid system is proposed for real time recognition of various sounds. The system is based on generating audio finger print i.e., a brief summary of audio file which collects a number of features including spectrogram zero crossings (ZC), MFCCs, linear prediction coefficients (LPCs) and log area ratio (LAR). The recognition is done on self collected sound samples using a K nearest neighbors (KNN) classifier. The system achieves a maximum accuracy of 99%. In [
16], the authors propose automatic emotion classification system for music sounds. The work utilizes several features of sound wave, i.e., peak value, average height, the number of half wavelengths, average width and beats per minutes. Finally, regression analysis is perform to recognize various emotions from the sound. The system achieves an average accuracy of 77%. In [
17], sound identification method for a mobile robot in home and office environment is proposed. A simple sound database called Pitch-Cluster-Maps (PCMs) based on vector quantization technique is constructed and its codebook is generated using binarized frequency spectrum. The works in [
18,
19] demonstrate that acoustic local ternary patterns (LTPs) show better performance as compared to MFCCs for fall detection problem. In the literature, various convolutional neural network (CNN) architectures are used to classify soundtracks from a dataset of 70 million training videos (
million hours) with
video-level labels [
20]. Experiments are performed using fully connected DNNs, VGG [
21], AlexNet [
22], Inception [
23] and ResNet [
24] etc.
The acoustic scene classification approach proposed in this work has the following contributions.
An extended feature descriptor is proposed which takes advantage of modified 1-D LTP in combination with MFCC.
A feature fusion methodology is opted, which exploits the complementary strengths of both MFCC and modified 1-D features to generate a serial vector.
To provide a better insight, a set of classifiers are tested on two standard benchmark datasets. This action supports researchers in selecting the best classifiers for this application.
The rest of the paper is organized as follows. In
Section 2, the proposed method of acoustic scene classification is discussed.
Section 3 discusses the experimental setup and datasets. The performance results and discussions are presented and discussed in
Section 4 and finally,
Section 5 concludes the paper.
4. Results and Discussion
The accuracy trend for both datasets is demonstrated in
Figure 6.
Table 3 presents the overall classification accuracy of the proposed and existing methods along with their computational time in seconds. It can be comfortably observed from the stats that the proposed method (i.e., ID-LTP + MFCC) outperforms shows a better accuracy with computational time smaller or comparable to other approaches.
To get a better insight, few other performance metrics are also investigated including sensitivity, specificity, and error rate. Moreover, for a fair comparison, two classifier families, i.e., SVM and KNN are contemplated due to their greater number of variants.
Table 4 provides a comparison of seven classifiers on the DCASE dataset. The SVM with quadratic kernel (SVM-Q) shows better results in terms of accuracy, specificity and error rate while SVM with cubic kernel (SVM-C) and KNN weighted (KNN-W) show better sensitivity. In
Table 5, the performance results are demonstrated for RWCP dataset. The SVM-Q classifier achieves a high accuracy and error rate while better sensitivity and specificity values are achieved by the KNN medium (KNN-M) and SVM-C, respectively.
Classification results of individual classes for the DCASE dataset are shown by a confusion matrix of
Figure 7. The figure shows that all classes except the
city center class have an accuracy of more than
. The confusion matrix of the proposed approach for RWCP dataset is shown in
Figure 8. Here, the
phone class has an accuracy of
whereas, all the remaining classes have accuracy above
. The classification results of
Figure 7 and
Figure 8 confirm the accuracy and validity of the proposed feature classification technique. To reveal the authenticity and robustness of our proposed method, confidence intervals against both datasets are also provided for two state-of-the-art classifiers.
Figure 9 demonstrates the confidence interval showing min, max and average classification values of both classifiers. From the stats, its quite obvious that SVM-Q can be formally selected as a standard classifier for this application.