Acoustic Scene Classiﬁcation Using Efﬁcient Summary Statistics and Multiple Spectro-Temporal Descriptor Fusion

: This paper presents a novel approach for acoustic scene classiﬁcation based on efﬁcient acoustic feature extraction using spectro-temporal descriptors fusion. Grounded on the ﬁnding in neuroscience—“auditory system summarizes the temporal details of sounds using time-averaged statistics to understand acoustic scenes”, we devise an efﬁcient computational framework for sound scene classiﬁcation by using multipe time-frequency descriptors fusion with discriminant information enhancement. To characterize rich information of sound, i.e., local structures on the time-frequency plane, we adopt 2-dimensional local descriptors. A more critical issue raised in how to logically ‘summarize’ those local details into a compact feature vector for scene classiﬁcation. Although ‘time-averaged statistics’ is suggested by the psychological investigation, directly computing time average of local acoustic features is not a logical way, since arithmetic mean is vulnerable to extreme values which are anticipated to be generated by interference sounds which are irrelevant to the scene category. To tackle this problem, we develop time-frame weighting approach to enhance sound textures as well as to suppress scene-irrelevant events. Subsequently, robust acoustic feature for scene classiﬁcation can be efﬁciently characterized. The proposed method had been validated by using Rouen dataset which consists of 19 acoustic scene categories with 3029 real samples. Extensive results demonstrated the effectiveness of the proposed scheme.


Introduction
Environmental sounds, which are an integral part of multimedia data, contain plenty of information, such as location and activities.To efficiently utilize the audio information for indexing massive multimedia contents, many research efforts have been spent on developing acoustic scene classification (ASC) system using advanced signal processing and machine learning techniques in recent years [1][2][3].Although some progress has been made, the key issues in acoustic scene understanding, i.e., acoustic feature representations development and efficient framework, are still open questions to the research field.
Psychological research findings reveal that "auditory system summarizes the temporal details of sounds using time-averaged statistics to understand acoustic scenes" [4,5].Inspired by the results, plenty of works endeavour to adopt descriptive statistics to characterize textures in an acoustic scene for content-based classification [2,6].Standard approaches to ASC firstly converts input audio signal to a time-frequency representations (TFRs) by using either handcrafted features, such as Mel-scale spectrogram and mel-frequency cepstral coefficients (MFCCs), or learning-based feature, e.g., unsupervised feature learning with neural networks.Then, statistical moments, e.g., mean, variance, skewness and kurtosis, are subsequently employed to convert TFRs (matrix) to compact feature vector [7][8][9] for further statistical classification using supervised learning.Figure 1 shows a general flowchart of ASC.Although descriptive statistics are off-the-shelf tools which can be used for acoustic scene understanding, the "summarization" performed by the auditory system is essentially different from arithmetic averaging.Descriptive statistics of time-averaging implies that every frame of acoustic feature carries similar information for ASC.In contrast, the auditory system adopts an adaptive scheme that "keeping the environment-specific acoustic information, while weeding out the irrelevant detail", according to the neuroscience research result [10].In other words, auditory system has a content-enhancing function that draws more attention on discriminative patterns in acoustic scenes, which is well-suited to acoustic scenes parsing.
This paper attempts to mimic discriminantcontent − enhancing function of auditory perception for ASC by using data-driven statistical machine learning.We begin this study with the primary mathematical formulation, that is, we assume ambient sound can be decomposed into two categories, which can be expressed as follows: Concretely in ASC tasks, s r (t) and s i (t) denote scene-relevant/irrelevant sounds, respectively.The first-category usually exhibits high temporal homogeneity, and thus delivers predominant discriminative information of acoustic scenes.Furthermore, it can be described as a superposition of many similar scene-relevant acoustic events over background textures [4], which have been extensively investigated in psychology research of human auditory perception [7,11].The latter component s i (t) denotes scene-irrelevant sound which hardly contributes to ASC.For instance, speech can occur in different acoustic scenes, such as in street, shop, and cafe.It is noteworthy that s i (t) usually presents complex spectro-temporal patterns and stronger energy, and thus can severely affect ASC performance.Therefore, these outliers to current acoustic scene category should be handled carefully at feature extraction stage.By exploiting characteristics of the two components, we found that s r (t) commonly present stationary statistical properties over time; in contrast, scene-irrelevant sounds, which exhibit complex spectro-temporal patterns in short-time periods, are discretely superimposed.In this study, we investigate the structural difference to discern s i from s r .Main contributions of this work are listed as follows.
• Latest research towards ASC manifested that 2-dimensional (2D) local descriptors are efficient for describing environmental sounds, such as using local Binary patterns (LBP) [12] and histograms of oriented gradients (HOG) [13].We perform intensive tests to evaluate various local descriptors for ASC.Furthermore, we proposed a framework to aggregate multiple 2D descriptors for ASC.• To enhance scene-specific sound patterns, we conduct novelty detection over the audio clip.
Both sound textures and super-positioned scene-relevant events would reside in the subspace due to high temporal homogeneity.On the contrary, scene-irrelevant sounds will generate distinct deviations to the subspace.According to above analysis, a series of weights can be derived which indicate the importance of representing the scene.• To efficiently summarize local acoustic patterns, we employ a weighted averaging scheme which converts spectro-temporal distribution (matrix) to a compact vector.A multi-feature aggregation scheme had been further applied to fuse the discriminant information conveyed by local descriptors.
According to the validation studies on real data, the proposed approach achieved superior performance comparing to other recent results.

Related Works
In computational auditory scene analysis, standard approaches are primarily composed of two steps: feature extraction and statistical classification.The first stage tackles the issue of developing efficient feature representation of an audio clip for ASC.A wide variety of acoustic features had been investigated, including log-scale spectrogram [1], cepstral features (MFCCs) [6] and gammatone cepstral coefficients (GTCC) [2].Those features are mainly taken from automatic speech recognition (ASR) field and have been proved to be effective to characterize rich low-level information from the audio signal by using a time-frequency (matrix) representation (TFR).Subsequently, statistical moments, e.g., averaging and standard deviations, are employed to convert time-frequency representations (matrix) to compact feature vector [2].Besides, various types of summary statistics had been adopted to characterize discriminant information in both time and frequency domain, such as by using zero-crossing rate (ZCR) and spectral kurtosis [3].In recent years, much attention is paid to apply advanced feature extraction methods for ASC, such as using Bag-of-features [13] and sparse coding models [1] to distill discriminant information from noisy recordings.At the latter part, the typical examples of statistical classification algorithms are Gaussian mixture models (GMM) and support vector machine (SVM) [14].More recently, taking inspiration from various successful applications in both computer vision and speech recognition, there is an emerging trend in ASC research to shift from conventional classification techniques to deep neural network-based methods.Such tendency is evident according to the latest DCASE challenge series (IEEE AASP Challenges Detection and Classification of Acoustic Scenes and Events) [15].Many researchers attempted to introduce Convolutional Neural Networks (CNNs) to drive an informative and robust audio data representation for ASC in a data-driven manner, such as in [16,17].Specifically, the CNNs-based approach is able to jointly optimize the acoustic feature representation and the classification algorithm.However, the current open datasets for ASC research are much smaller compared with the ones for computer vision, which may lead to under fitting status.Besides, the ensemble of several ASC architectures had been proved to be efficient to boost the ASC accuracy.For instance, in 2017 DCASE evaluation, the well-ranked submissions were based on CNNs aggregating with other deep neural network models, i.e., Recurrent Neural Network (RNN) and Multiple Layer Perception (MLP) [18].Current work had been carried out based on the survey over state-of-the-art research.
In this study, we propose novel ASC approach which consists of three major components: spectro-temporal feature extraction using multiple time-frequency representations (TFRs), discriminant content enhancing weights extraction and multi-descriptor aggregation scheme for ASC. Figure 2 shows the processing flow of the proposed ASC system.We introduce the key components as follows.

Time-Frequency Representation (TFR)
Audio waveform is commonly transformed to time-frequency representations (TFRs) in which temporal and spectral information can be characterized simultaneously.In this study, we evaluated several TFRs, including Mel-spectrogram, MFCCs [2] and Constant-Q spectrogram [19].We denote TFRs as S( f , t ) in Figure 2, where t is frame index.

Spectro-Temporal Descriptors
Based on TFRs, we further employ local descriptors to characterize spectro-temporal patterns in a 2D fashion.Several local descriptors are evaluated, such as Higher-order Local Auto-Correlation (HLAC), Local Binary Patterns (LBP) and Histogram of Oriented Gradients (HOG).By using descriptors, we convert raw TFRs to mid-level local structure-based representations which facilitate ASC.In Figure 2, the feature extracted by using local descriptors is noted by D(n, k), where n is time index and k is feature dimension.

Higher-order Local Auto-Correlation (HLAC)
Higher-order Local Auto-Correlation (HLAC) features are conventional local descriptors for extracting patterns in 2D patch [20].The features had been successfully applied to a wide variety of real applications, including texture and face classification.The HLAC features is well-developed based on higher-order autocorrelation function: The mask patterns of HLAC is shown in Figure 3.In dealing with audio, S(r) denotes TFRs, r = [t r , f r ] is reference point on time-frequency plane, (a 1 = [t a1 , f a1 ] , a 2 = [t a2 , f a2 ] ) is a set of displacements.HLAC extraction is limited to 3 × 3 local region and there are 35 individual mask patterns extracted.We introduce a sliding window on TFRs covering 3 consecutive frames of spectrum, from where HLAC features are extracted.The window shifts one frame at a time.Since acoustic features are assumed to be highly correlated within local region, more discriminative features can to be obtained via computing HLAC.

Local Binary Patterns (LBP)
Local binary patterns (LBP) are effective local descriptors which have been applied for textures classification [21] and sound classification [12], etc.The LBP convert local structures into binary patterns by comparing values to the central pixel, which is briefly introduced by Figure 4.In this study, we adopt LBP as spectro-temporal feature extractor.The general formulation for LBP can be written as: where

Histogram of Oriented Gradients (HOG)
Histogram of oriented Gradients (HOG) is one most important 2D local descriptor in image processing, which count occurrences of gradient orientation in the localized patch.It has also been successfully applied for sound processing [13].In a similar vein, we introduce HOG descriptor to characterize spectro-temporal structures in acoustic scenes with cumulative oriented gradients over local TFR.Figure 5 shows primary mechanism of HOG features, which extract plenty of spectro-temporal dynamics on time-frequency plane.The extracted HOG acoustic features are favorable for characterizing wide variations in environmental sound.

Spectro-temporal gradients
Local structure 5. HOG coding scheme on spectro-temporal local regions.

Acoustic Summary Statistics Extraction for ASC
Although spectro-temporal descriptors can extract rich details of acoustic signal, more critical issue can be raised in summarizing those local patterns to derive a compact feature vector for classification.In this study, we devise an efficient approach to extract robust summary statistics from audio scene data.First, we present an algorithmic flowchart in Algorithm 1, and the details are demonstrated as follows.

Unsupervised Novelty Analysis of Acoustic Scene
Our goal is to enhance scene relevant acoustic patterns and to suppress scene-irrelevant events as well during summary statistics extraction.To this end, it is necessary to discern the two components in environmental sound clips.According to (1), two category sounds are linearly combined, we adopt (linear) subspace method to detect events in an acoustic scene in an unsupervised manner.Sound textures, which are composed of the superposition of many acoustic events with high similarity, will reside in principal acoustic subspace; on the contrary, scene-irrelevant events are anticipated to exhibit distinct distance to the subspace.The procedure is based on principal component analysis (PCA) [22] and we start the process from computing correlation matrix of input feature: where d n , n ∈ [1, N] is acoustic feature vector extracted from one audio sample, i.e. 30 s clip, then, eigen decomposition is performed: Let [v 1 , ..., v K ] denotes subspace accommodating predominant textures, which is composed of K -th eigen vectors with highest eigenvalues.K is determined by contribution rate which is defined as η The deviation distance to subspace can be computed by: By examining residual h, we are able to detect outlier events in the acoustic scene.Sound textures, due to high temporal homogeneity, will generate h obeying normal distribution.In contrast, scene-irrelevant events superimposed on textures will introduce long tail to the histogram of h, therefore, h can no longer be well described by Gaussian.Based on such property, we introduce Gaussianity measures of kurtosis and skewness to discern acoustic events, which are denoted as v kurt (h) and v skew (h), respectively.Thresholds on two measures, which are ε kurt , ε skew , are experimentally set to detect scene-irrelevant events.If there are no events detected by thresholding, uniform weights can be applied since there are mostly homogeneous textures.Otherwise, we develop weights to suppress scene-irrelevant events as follows.

Textures-Enhancing Weights Generation
To enhance sound textures for ASC, we develop weights for feature frames based on membership probabilities to the acoustic scene.Laplace model, due to its robustness to outliers, is introduced to derive such probabilities [22].By fitting h to the model, µ h , b h and likelihood can be estimated: Laplace model puts much less probability density to events which are unrelated to scene category.To further enhance sound textures, we introduce soft thresholding: where τ = + √ 2b h .Such setting assures textures will obtain highest weights (1, from probability aspect) while scene-irrelevant events would be suppressed by smaller weights.In Figure 6, we present an example of textures-enhancing weight generation using proposed method.

Summary Statistics Computation
Based on the weights w(n) and feature representation D(n, k), we derive summary statistics by: where d n is the n-th feature vector in D ∈ R N×K and x is the extracted feature vector.

Class Score Fusion for Classification
As shown in Figure 2, we fuse discriminant information characterized by multiple spectro-temporal descriptors for ASC.To this end, we estimate class membership probabilities of input sound clip using various acoustic features with probabilistic SVM, which generates the probability interpretation of distance between input data and classification hyperplane in the (kernel) feature space.The formulation can be expressed as: where {x HLAC,m , y m } are the HLAC feature vector extracted from m−th training clip and the corresponding label, respectively.Parameters of w svm and b svm can be determined by quadratic programming, and logistic regression can be performed to compute A, B accordingly.Finally, we can derive class score l HLAC,m .In the same vein, the class probabilities of l LBP,m and l HOG,m can be computed by using features {x LBP,m , y m } and {x HOG,m , y m }.Finally, we employed linear programming to calibrate multi-stream class scores as follows: in which the conditional fusion weights α i can be tuned explicitly at the training/validation stage.The estimated score fusion formula is anticipated to achieve higher accuracy comparing to the case of simple majority voting.

Dataset and Parameters
We validate the proposed scheme by performing extensive experiments using LITIS Rouen Dataset [14], which includes 19 classes of real acoustic scenes categories with 3026 clips of 30 s length.It is noteworthy that the LITIS Rouen dataset is open dataset, which provide a standardised way to present and compare results.Figure 7 shows the distribution of audio samples among classes.Sounds are recorded at 22.05kHz sampling rate and 16-bit depth.20-fold splits are provided to partition data into 80%-training/20%-test sets.In our experiments, we set Fourier analysis window length to 30ms with half overlapping.60 Mel-filters were applied to extract Mel-spectrogram.To obtain CQT spectrogram, the number of bins per octave was set to 48 and local region size, number of orientations were set to 8 × 8 and 8, respectively.2D smoothing was performed to TFRs through convolving with Gaussian kernel and kernel parameter was set to 3. At acoustic subspace extraction, we set contribution rate η K to 0.99.Thresholds for events detection of ε kurt and ε skew were set to 7 and 1, respectively.At classification stage, Gaussian kernel parameter was set to 0.1.

Evaluation of TFRs with Local Descriptors
We began experiments by testing acoustic features extracted by local descriptors over various time-frequency representations (TFR).In detail, we examined four kinds of TFRs, including Fourier spectrogram, Mel-scaled spectrogram (Mel-spectrogram), constant-Q transform spectrogram (CQT) and Mel-Frequency Cepstral Coefficients (MFCCs), due to their popularity in ASC research.Furthermore, well-developed two-dimensional local descriptors of HLAC, LBP and HOG had been introduced to further characterize spectro-temporal patterns on time-frequency plane.Together, we had 12 combinations of TFRs with local descriptors.Since this evaluation was dedicated to acoustic feature comparison, simple time-averaging was applied to produce feature vector for each audio clip, i.e., w was set to 1 N in (9); SVM classifier with Gaussian kernel had been used for multi-class classification.Table 1 summarized all the results, from which we can see the highest ASC precision was achieved by using HOG descriptor over the CQT spectrogram.The performance comparison also revealed that although Mel-spectrogram and MFCCs were widely applied, they are not optimal when working with local descriptors because much local detail information was lost during feature extraction.In addition, we present the class-wise feature distribution of the LITIS Rouen dataset by using CQT spectrogram with HoG features in Figure 8.According to the feature space visualization, the compact clusters are anticipated to achieve higher ASC accuracy, while scattered clusters can be difficult to classify.

Evaluation of Acoustic Summary Statistics Extraction Scheme
We next experimentally validate the proposed summary statistics extraction.In the previous experiment, the TFR obtained by CQT transform had been proved to be superior for ASC task.In this test, we further investigate the feature weighting scheme for acoustic summary statistics extraction.Figure 9 presents the contribution of employing feature-enhancing weighting.It was evident that the proposed feature weighting scheme is universally applicable to generate efficient acoustic features for ASC.In addition, we further establish multiple spectro-temporal feature aggregation scheme to achieve superior ASC classification.Figure 10 presents the class-wise precision obtained by the proposed method.According to the results, our scheme produced ideal accuracies, i.e., over 98.5% accuracy, over the classes of bus, car, kid game, restaurant and high speed-train.While, there were several cases that current method failed in making a confident classification, such as for classifying the bust street, quiet street, and shop cases.In Table 2, we compared our result with state-of-the-art performances reported by the latest publications.It can be seen the proposed ASC scheme achieved superior accuracy compared to other methods.Notably, our approach also outperformed a very recent work using the optimal fusion of multiple convolution neural networks (CNNs) [24].It is no doubt that CNNs is quite powerful learning method for general pattern classification tasks; however, in this study, the data size is small, and may induce under fitting of CNN models.That can be the main reason that our method can outperform CNNs.The comparison confirmed the superiority of the proposed scheme for ASC task.

Conclusions
This paper presented a novel scheme for acoustic scene classification based on robust summary statistics extraction and efficient class score fusion.To characterize spectro-temporal patterns in sound, we evaluated various time-frequency representations with efficient local descriptors.Motivated by finding in psychological research-auditory system can keep relevant information about the acoustic environment, while weeding out the irrelevant details, We develop novel scheme to extract summary statistics for ASC to enhance discriminative scene-specific sound textures as well as to suppress scene-irrelevant events.Finally, we aggregate multi-way discriminant information characterized by various local descriptors through optimal weighted averaging.The proposed method is validated with Rouen dataset and experimental results presented superiority of proposed approach.

Figure 1 .
Figure 1.General framework of acoustic scene classification system.ASC can be performed by examining homogeneous structures of audio data.

Figure 2 .
Figure 2. Flowchart of the proposed acoustic scene classification approach.
is gray-level pixel value at spatial position r j , and [[•]] generates 1 only if bracketed condition is met and 0 otherwise.L c = {r} J j=1 indicates local 2D structure surrounded c ∈ R 2 , including J spatial positions r j close to c and τ c is the gray-level value of the central pixel.For ordinary LBP, J is set to 8 and hence local patch is limited to 3 × 3.In this study, we use prototype version.

Figure 6 .
Figure 6.Example of acoustic texture enhancing weights generation.

Figure 7 .
Figure 7. Statistics scene recordings collection by categories in Rouen dataset.

Figure 9 .
Figure 9. Classification accuracy comparison between w/o feature-enhancing weighting schemes.

Figure 10 .
Figure 10.Classification accuracies for each scene category obtained by the proposed method.

Table 1 .
by using TFRs with local descriptors.