Empirical Mode Decomposition-Based Feature Extraction for Environmental Sound Classification

In environment sound classification, log Mel band energies (MBEs) are considered as the most successful and commonly used features for classification. The underlying algorithm, fast Fourier transform (FFT), is valid under certain restrictions. In this study, we address these limitations of Fourier transform and propose a new method to extract log Mel band energies using amplitude modulation and frequency modulation. We present a comparative study between traditionally used log Mel band energy features extracted by Fourier transform and log Mel band energy features extracted by our new approach. This approach is based on extracting log Mel band energies from estimation of instantaneous frequency (IF) and instantaneous amplitude (IA), which are used to construct a spectrogram. The estimation of IA and IF is made by associating empirical mode decomposition (EMD) with the Teager–Kaiser energy operator (TKEO) and the discrete energy separation algorithm. Later, Mel filter bank is applied to the estimated spectrogram to generate EMD-TKEO-based MBEs, or simply, EMD-MBEs. In addition, we employ the EMD method to remove signal trends from the original signal and generate another type of MBE, called S-MBEs, using FFT and a Mel filter bank. Four different datasets were utilised and convolutional neural networks (CNN) were trained using features extracted from Fourier transform-based MBEs (FFT-MBEs), EMD-MBEs, and S-MBEs. In addition, CNNs were trained with an aggregation of all three feature extraction techniques and a combination of FFT-MBEs and EMD-MBEs. Individually, FFT-MBEs achieved higher accuracy compared to EMD-MBEs and S-MBEs. In general, the system trained with the combination of all three features performed slightly better compared to the system trained with the three features separately.


Introduction
Environment sound classification (ESC) has been receiving significant attention over the past couple of years. Environmental sounds range from traffic sounds, birds, rain, and sounds produced by human activity in houses, offices, cafes, and numerous other locations. In contrast to speech, environmental sounds are generated by various sources, including human and non-human activities. Humans mostly identify events or surroundings, in addition to vision, through the sounds present in the environment. The classification of such sounds provides a plethora of opportunities for improving human-machine interaction, thereby bimproving automation and security.
Neural networks have played a vital role in the growth of classification systems such as image classification systems and improved speech and environmental sound classification systems. Convolutional neural networks (CNN) are at the forefront of this change, along with recurrent neural networks (RNN) and long short-term memory (LSTM), which are still used in many systems [1][2][3][4][5][6][7][8]. In image classification, an image is used as feature. In speech and sound classification, an image form of the sound is provided through the timefrequency-energy information of the signal, namely, a spectrogram. Fourier transform is used extensively for spectrogram construction and feature extraction. It has dominated since the time of its inception and, consequently, the terms 'spectrum' and 'spectrogram' have become synonymous with Fourier transform of the data [9][10][11]. In classification systems, the Mel filter bank energies are extracted using a fast Fourier transform-based algorithm to generate Mel spectrograms. Whether these systems are trained from the scratch using time-frequency representation of sounds [6,12,13] or if transfer learning is used to retrain systems trained on images to perform sound classification [5,[14][15][16], they employ Fourier transform for feature extraction. However, there are some crucial restrictions to performing Fourier spectral analysis, which makes Fourier transform valid under extremely general conditions [17,18]. To perform Fourier spectral analysis on a system, the system must be linear and the data must be ergodic and stationary; failing to meet these criteria will result in little physical sense. Sound is a time-varying signal whose frequency and energy changes depend on the source generating the sound, which implies that the assumptions of stationarity and ergodism may not be satisfied. In addition, the Fourier spectrum establishes global uniform harmonic components, resulting in the necessity of additional components to simulate data that is non-stationary and non-uniform globally. Consequently, it spreads the energy over a wide frequency range. To analyse data of a non-stationary nature in the time domain, numerous Fourier components are applied, causing energy diversion to a much wider frequency scale. Furthermore, Fourier spectral analysis utilises a priori-defined basis functions that require additional harmonic components to analyse deformed wave-profiles. Features based on short-time fast Fourier transform (STFT), introduced by Cooley and Tukey in 1965 [19], are predominately used in extracting frequency domain features [1,[20][21][22][23]. The wavelet transform, which is a windowed Fourier transform in the time domain, provides the solution to the limitations of STFT . Wavelets [24] overcome limitations because the window is scaled in both time and frequency [25]. Wavelet analysis provides the solution for analysing non-stationary data. However, in wavelet transform, we still require an a priori-defined basis in terms of wavelet function, which makes wavelet analysis non-adaptive in nature. The most commonly used Morlet wavelet function is based on Fourier and suffers from the same shortcomings as Fourier analysis [18,26,27].
Due to ubiquitous usage of Fourier spectral analysis, the notions of instantaneous frequency (IF) and instantaneous amplitude (IA) are relatively less accepted [18]. Traditionally, the frequency is defined with the sine and cosine functions as basis functions spanning the whole data length with constant amplitude. According to this approach, the instantaneous frequency also must be defined on either the cosine or sine basis function. As result, it would be compulsory to have a one complete oscillation. This approach would make no sense for a non-stationary signal that changes from time to time. In real life, most systems are non-linear and operate or generate non-stationary data [28,29]. In order to cater to the non-linear and non-stationary nature of signals, a novel method of decomposing temporal signals to analyse non-stationary and non-linear time series data and processes called empirical mode decomposition was introduced by Huang [18]. This decomposition is adaptive and highly efficient. This method decomposes the signal into a finite number of oscillatory units called intrinsic mode functions (IMFs). These modes are extracted based on characteristics of local time series data with zero mean with symmetric AM-FM components. The decomposition of the signal is highly adaptive and is based on the direct extraction of energy with local time scales. Using the Teager-Kaiser energy operator (TKEO), we can extract instantaneous frequency and amplitude from the IMFs, thus, allowing us to locate any event on a time scale and a frequency scale. The IMFs serve as the basis in this case and are calculated for every signal rather than being defined a priori. The EMD combined with the TKEO method provides the estimation of instantaneous amplitude (IA) and instantaneous frequency (IF) for any non-stationary signal without defining an a priori basis function; this method generates the basis function dynamically for each signal. The advantage of this method is that it uses instantaneous amplitude and frequency analysis and does not rely on any basis functions compared to other techniques to generate time frequency-based features, generating instantly rather than requiring a minimum number of samples to perform analysis. In addition, this preliminary work adds another path for future development and applications of IA and IF in different domains where time frequency analysis is required. The EMD method has been used in speech recognition systems [30][31][32] and human emotion recognition system [33]. The EMD method has also been used to perform classification of respiratory sound in conjunction with FFT to extract features [34]. The EMD extracts IMFs and later selects the best IMF based on the entropy parameter. Gamma tone cepstral coefficients (GTCC) are extracted using Fourier transform and discrete cosine transform. However, in this approach, the underlying issue due to Fourier transform persists.
In ESC system, we are interested in the features extraction stage. The system relies heavily on the type of the features to learn sound events. In this paper, we introduce the use of empirical mode decomposition along with the Teager-Kaiser energy tracking operator to estimate instantaneous frequency and amplitude, which are used to construct features in terms of spectrogram for classification using neural networks. We apply the most commonly used Mel filter banks for the spectrogram. In this paper we introduced the novel Mel filter based on a spectrogram generated through IA and IF. The EMD method decomposes the signal into several mono-component IMFs; on each of the IMF, TKEO and DESA methods are applied to obtain the IA and IF information of the signal. We call this Mel spectrogram obtained through EMD and TKEO the empirical mode decomposition Mel filter bank energies (EMD-MBE). We also introduce SMBE, in which we remove the signal trend from the signal using the EMD method. We compared our proposed features with fast Fourier-based Mel filter bank energies (FFT-MBE) on four ESC data-sets. We propose an aggregation of all three features ,which results in improvement of accuracy over traditional FFT-based log Mel filter bank energies.
The rest of the paper is organised as follows. Section 2 explains the EMD and TKEO methods in detail. Section 3 demonstrates feature extraction process using FFT and EMD-TKEO methods. In Section 4, the experimental setup is described, followed by Section 5 discussing the results obtained. Section 2 presents the conclusion.

Empirical Mode Decomposition-Teager-Kaiser Energy Operator (EMD-TKEO) Method
We use the empirical mode decomposition method to decompose the environmental sound into its intrinsic mode functions (IMFs), as described in the next section. The combination of EMD decomposition and TKEO is used to estimate instantaneous amplitude (IA) and instantaneous frequency (IF). Afterwards, a Mel filter bank is applied to obtain Mel filter bank energies (MBE). These features are then used to train machine learning algorithms. The proposed system is depicted in the Figure 1. In this paper, we extend the work of P. Maragos [35,36] from the application of speech and underwater acoustic signals to extracting AM-FM modulation information from environment sound signals. In [35], the authors defined the real valued signal with combined AM and FM structure as: This expression can be utilised to formulate a signal as [37]: and where rest(t) is the last component containing very low frequency information, which could be neglected from the original signal. Re represents the real part, φ i (t) is the phase, and a i (t) and f i (t) are instantaneous amplitude and instantaneous frequencies respectively of the ith IMF.

Empirical Mode Decomposition
EMD is a method of a decomposing non-stationary signal into a collection of monocomponent AM-FM signals. These mono-component signals are referred to as intrinsic mode functions (IMFs). The extraction of the IMFs follows an envelope subtraction process and linear combination of all the IMF-extracted results into the original signal. The signal is decomposed in the time domain, hence preserving the time-varying frequency and amplitude of the signal. As compared to Fourier transform, EMD does not require an a priori-defined basis function for the computation of IMFs. Fourier transform uses harmonic components of signal, whereas EMD is based on the oscillation present in the signal. The oscillatory decomposition is defined by the sifting process. The signal is examined for local maxima and minima. Using the information of local maxima and minima, the upper envelope and lower envelope are determined via cubic spline. The mean envelope is generated using the upper and lower envelopes, which represents the trend of the signal. This mean envelope is subtracted from the original signal to create an IMF candidate. Before counting this candidate as an actual IMF, a test is conducted, i.e., if the number of zero crossings and the number of extrema differs by no more than one. If the candidate satisfies the criteria, it is counted as an IMF and the counter is incremented; otherwise, the counter is set to zero. Verification is conducted to check if the candidates meet the criteria of IMF for each IMF generated. If the criteria are not fulfilled, the sifting process is applied again until the conditions are matched. The IMF obtained is then stored and is subtracted from the original signal to start a new sifting process for another IMF. The method is repeated until the signal is deconstructed to a level that it contains no more than two extrema [18].

Sifting Process for IMFs
The EMD method could be defined in simplest terms as a filter that sifts through the signal and breaks it down into a mono-component signal, defined above as IMFs. A function is defined as an intrinsic mode function when it satisfies the following criteria: 1 The number of extrema (maxima and minima) in a signal must be equal to the zerocrossing number or differ at most by one; 2 The mean of the envelopes obtained through local maxima and local minima must be equal to zero at all times.
The IMFs are obtained through a process known as the sifting process, which is described in Algorithm 1 [33]: Algorithm 1 Sifting process for intrinsic mode functions Input : a sound event signal Output : collection of IMFs 1 Compute all local extrema in the signal x(t) : local maxima and local minima; 2 Construct the upper envelope E u (t) and lower envelope E l (t) by joining the local maxima and local minima with a cubic spline on the given signal x(t); 3 Calculate the mean of the envelopes m(t) = (E u (t) + E l (t)) / 2 ; 4 Subtract the mean from the original signal x(t), then obtain a new data sequence r(t) from which the low frequency is deleted r(t) = x(t) -m(t); 5 Repeat steps 1 − 4 until r(t) is an IMF (satisfying the two conditions above); 6 Subtract this IMF r(t) from the original signal x(t) : res(t) = x(t) -r(t) ; 7 Repeat steps 1 − 6 until the residual signal res(t) is obtained that does not meet the above mentioned conditions of an IMF, resulting in all IMFs The number of IMFs extracted from a particular signal depends on two factors.

1
The process terminates when the res(t); the last IMF, is either a monotonic function or function with only one extremum. 2 The number of IMFs is subjected to stopping criteria, where the user terminates the sifting process after a particular number of IMFs have been created.
In the first case, the output of the EMD sifting process delivers N IMFs r 1 (t), r 2 (t), ..., r N (t) along with the residual signal res(t) of the original signal x(t). x(t) can be presented as a linear combination of all the IMFs and res(t).
With this method, the signal x(t) is decomposed empirically into a finite number of functions. The IMFs of an audio of car passing is show in Figure 2. Each IMF can be used separately to obtain instantaneous Frequency (IF) and instantaneous amplitude (IA) for sound event detection systems, explained in the next section.
In the case of early stopping, the original signal cannot be reconstructed, as some information is discarded deliberately. However, in some cases, it could be used to remove low frequency components from the parent signal. In [38], the authors used the first five IMFs, on the basis that those IMFs gave an ample amount of information about energy and pitch in their study.

Teager-Kaiser Energy Operator (TKEO)
The energy separation algorithm (ESA) is applied to extract the IA and IF information from the signal, as standalone IMFs do not provide meaningful information about IA and IF. The EMD method takes a multi-component signal and provides us with IMFs that are mono-component. Introduced by J.F. Kaiser [39], the TKEO, an energy tracking operator used with an energy separation algorithm, computes these IA and IF features without using integrals, as in the Hilbert transform and Fourier Transform. Rather it is completely comprised of differentiation. The property of differentiation gives the TKEO the advantage of good localisation [40]. It becomes more natural to use the TKEO for local estimation of IA and IF functions. The TKEO is a non-linear operator that computes the energy of the signal as a product of the square of the amplitude and frequency of the signal, given as in [41]: whereṙ i (t) andr i (t) are the first and second order derivatives of r i (t). For a discrete time signal r i (n), Equation (5) can be written as [42]: The instantaneous features are extracted by applying ESA in a discrete form to the signals. The discrete energy separation algorithm (DESA) [41] provides us with IA and IF. We used DESA-1 in this study, given as : Here, x(n) is a mono-component signal. The DESA-1 algorithm should be applied on mono-component signals only. In [36], authors proposed the use of low pass filter to smooth the output of the energy tracking operator. They found that high frequency error component was introduced by the energy operator. To eliminate this issue, a seven-point linear binomial low-pass smoothing filter with impulse response (1,6,15,20,15,6,1) is applied after TKEO as shown in Figure 3 [33,36].

Mel Band Energies
The general Mel band energies (MBEs) are computed through discrete Fourier transform as follows. Let x(n) be a discrete audio signal having sampling rate f s . It is divided into P frames, each of length N samples with N/2 overlapping samples, such that represents the pth frame of the signal x[n] and is given as : The input signal x[n] can be represented as a matrix of size N × P as X = [ x 1 , x 2 , ..., x p , ..., x P ]. When calculating DFT, the signal for each x p , one assumes that the signal is repeated infinitely, which introduces an issue of spectral leakage. To avoid spectral leakage, the Hanning window is applied.
and the discrete Fourier transform of the signal is given as : Here, k = 0, 1, 2, ..., N − 1, where N represents the number of points used by FFT for a particular frame x p . Using the sampling rate f s of the input signal, the corresponding frequency can be computed using the frequency bin as l f (k) = k f s /N and the frequency resolution can be computed as f r es = l f (k + 1) − l f (k). The DFT of the p th frame x p can be represented as To obtain Mel filter bank energies, we multiply the DFT matrix X p (k) by the Mel filter bank matrix M(m, k). A logarithm is applied to obtain log Mel band energies of the size F × P, as given in.

EMD-Mel Band Energies
In this method, AM and FM components are used to construct magnitude spectrum; later a Mel filter bank is applied to obtain Mel band energies. Figure 5 demonstrates the process of obtaining MBEs; in this case, we call these features empirical mode decompositionbased Mel filter bank energies (EMD-MBE). The first step is the decomposition of the signal into its components using EMD. These distinct, adaptive decomposed components are known as intrinsic mode functions (IMFs). For each distinct IMF, instantaneous amplitude a(i, n) and instantaneous frequency f (i, n) are obtained through the energy tracking operator and energy separation algorithm, where i = 1, 2...N I MFs and n = 1, 2, 3, ..., N b , and N b represents the number of samples in the input signal. In this study, we used the Teager-Kaiser energy tracking operator (TKEO) and the discrete energy separation algorithm (DESA-1). The TKEO and DESA compute the IA and IF over the complete length of the IMF. The TKEO and DESA algorithms in [36,43,44] have shown estimations of IA and IF with an error less than 10 −3 . The framing /windowing function is applied later to the IA and IF obtained from the TKEO. In the next stage, we apply the Hanning windowing function to obtain short overlapping frames of instantaneous frequency f p (i, n p ) and instantaneous amplitude a p (i, n p ). Afterwards, to obtain the magnitude spectrum, we use the definition provided by [37]. The authors defined the Hilbert Huang Transform as a generalised Fourier Transform and defined the spectrum using the Hilbert spectrum, which is derived from the time-frequency distribution of the instantaneous energy envelope, which is the square magnitude of the amplitude envelope. In this study, we used the instantaneous energy envelope (|a(i, n)| 2 ) and summed them over large number of frequency bands, as compared to the Hilbert spectrum [33,37]. This enables us to distribute the energy over a large number of frequencies and obtain higher frequency resolution. To derive Mel band energies, we summed the energies similarly to the number of frequency bands defined in the previous section. The relation is defined as (13) : where i is a single IMF, N represents the number of samples in the frame, and B j represents the particular sub-band defined as B k = [l f (k), l f (k + 1)], where k = 1, 2, 3, ...N − 1 and l f (k) = k f s /N. The indicator function of set Ω is given as: ∈ Ω After using (13), we obtain a matrix of size N × P. This matrix is multiplied, similarly to Equation (12), by the Mel filter bank matrix M(m, k) of shape F × N. The resulting matrix generated will have the shape F × P. Taking the log, we obtain log Mel band energies.

S-MBE
Sounds produced in any environment are composed of complex and random changes, and the presence of a signal trend causes a negative effect in frequency domain power spectral analysis or time domain correlational analysis, which could result in the loss of information in the low frequency spectrum. To counter such problems, we apply a method of extracting Mel filter bank energies (MBEs) by removing the signal trend, calling these S-MBEs. In the literature, researchers have used a version of SMFCC for emotion recognition [45]. We present a different version that extracts the feature before applying discrete cosine transform (DCT) to the MBEs. The complete process of extracting S-MBEs is presented in Figure 6. In this method, EMD performs the decomposition on the signal and extracts the signal trend information from the IMFs. The signal trend is denoted by T[n] and is computed by Equation (14) [45]: The signal trend is removed from the input signal by applying the zero-crossing rate (ZCR) detection method. T[n] is defined as the sum of all IMFs that satisfy the following condition Equation (15): where R represent the zero-crossing rate. Afterwards, the reconstructed signal S[n] is obtained by removing the signal trend T[n] from the input signal.

S[n] = x[n] − T[n]
Finally, the MBEs are calculated using FFT and Mel filter banks on the reconstructed signal, as shown in Figure 6.

Features
The Mel spectrograms from the aforementioned mentioned techniques are extracted and are depicted in Figures 7 and 8. The FFT based Mel spectrogram is shown in Figure 7a along with S-MBE in Figure 7b. The EMD based Mel spectrogram is depicted in Figure 8 .

Databases
The development of an ESC system relies heavily on the database. Sound classification is a vast topic and contains many categories such as acoustic scene classification, sound event classification, environmental sound classification, and many more. We utilised databases comprised of acoustic scenes, sound events, and environmental sounds for classification. The databases are explained below.

Acoustic Scene Classification Dataset
Detection of acoustic scene has been considered as a complex problem by the research community for several years, and various efforts have been dedicated to solving this issue. Acoustic scenes contain acoustic events in a particular environment such as metro station, airports, train stations, etc. Classifying these categories becomes complex due to the similar nature of the sound events occurring in those environments. To solve this issue, Detection and Classification of Acoustic Scenes and Events (DCASE) provided a dataset that contains audio recordings of 10 different categories: airport, indoor shopping mall, metro station, pedestrian street, public square, street with medium level of traffic, travelling by a train, travelling by a bus, travelling by an underground metro, and urban park (TUT Urban Acoustic Scenes 2018 dataset).

Low-Complexity Acoustic Scene Classification Dataset
This dataset is provided by the DCASE community [46] and contains three categories. The dataset comprises recordings from 12 European cities in 10 distinct acoustic scenes. The 10 different categories are then divided into three separate categories as follows: • Indoor scenes-indoor: airport, indoor shopping mall, and metro station; • Outdoor scenes-pedestrian street, public square, a street with a medium level of traffic, and urban park; • Transportation-related scenes-traveling by bus, traveling by tram, traveling by underground metro.
The audio signal is recorded at 48kHz and in 24-bit in binaural format using only one recording device. The dataset is divided into two categories: the development set and the evaluation set. Due to the unavailability of the labels of the evaluation set, the system was evaluated on the development set only. The development set contains 40 h of audio recordings divided into a training set and a test set. Each audio file is 10 s long. The baseline system [47] is evaluated using the development set by log Mel filter bank energy features.

Urbansound8k
Urbansound8k is a dataset containing 10 different classes and 8732 short-duration (less than or equal to 4 s) files [48,49]. The collection is composed of environmental sounds such as air conditioner, car horn, playing children, dog bark, drilling, engine idling, gun shot, jackhammer, siren, and street music. Recordings are available in 10-fold cross validation and recorded at 22.05 KHz sampling frequency.

Custom Database
The audio recordings are collected from FreeSound [50] from several contributors. Each recording was registered by a different publisher and with different locations, lengths, equipment, and sampling rates. The recordings are gathered for four categories, i.e., rain, wind, car passing, and human walk. The recordings sampling rates were from 44,100 Hz to 96,000 Hz. The database was processed to obtain uniform characteristics. Ten-second audio files were extracted with a sampling rate of 441,000 Hz, resulting in 750 files of 10-s length with a total duration of 125 min for each recording [51] .

Classification Model
We used convolutional neural networks (CNNs) in this study. CNNs have been widely used with Mel band energies for classification of environmental sounds. For the Acoustic Scene Classification Dataset and the Low-Complexity Acoustic Scene Classification Dataset, we used baseline [46] CNN1 model. All the parameters were selected according to the baseline model mentioned by the authors in order to evaluate the same system using different feature inputs. For Urbansound8K and the custom dataset, we used the CNN2 model [51] given in Table 1.
In this study, we made comparison between two feature extraction techniques. We have proposed an EMD-based feature extraction technique compared to FFT-based feature extraction technique. In order to compare the performances of both methods, we employed baseline systems. We used baseline systems since they are built on the simple extraction of Mel band energies features, and no additional pre-or post-processing is applied during the training and testing of the systems. The systems that reached highest score of accuracy on the evaluation of acoustic scene classification datasets [52] use an ensemble of features based on adaptive temporal division and classify using a VGGish based neural network. In addition, the score on the development dataset is not published. The leading system on the development dataset [53] uses MFCC features with I-vector backend processing with a fusion of CNNs and I-vectors to make predictions. Similarly, for the low-complexity acoustic scene classification dataset, the leading system uses resnet with a receptive field. For Urbansound8k, different systems are proposed [16,54]. These systems use feature pre-processing and post processing, transfer learning, and other methods to enhance the accuracy of the system. Compared to the state-of-the-art systems, which utilize different systems and employ different pre and post processing methods, we followed the path of baseline systems. To compare against FFT-based Mel band energies, we proposed EMDbased Mel band energies. This allows us to evaluate the performance of both features on the same system and with same parameters without any pre-or post-processing of the feature. The specifications of the systems used are described below. Table 1. Convolutional neural network model specification.

Layers
Model Specifications

CNN1 CNN2
Layer 1 Conv2D (32,(7,7)) Conv2D (64,(4,4) ) Relu, strides = 1 tanh, strides = 1 MaxPool2D (5,5) MaxPool2D ( Similarly, the EMD-Log Mel band energies and log scaled S-MBEs were calculated for both datasets with similar characteristics, resulting in similar shape. The input shape is 40 × 500, trained for 200 epochs with a mini batch size of 16 and data shuffling between epochs. An Adam optimizer [55] is used for optimisation, with a learning rate of 0.001. Model performance is checked after each epoch on the validation set, and the best performing is chosen. The system was trained for 200 epochs with the Adam optimizer, with an initial learning rate of 0.001.
• For Urbansound8k, the log Mel band energies, EMD-log Mel band energies, and log scaled S-MBEs were extracted with 60 Mel bands; a window size of 1024 samples with a hop length of 512 samples is used. The input size for the CNN was 60 × 41 and silent segments were discarded. The Urbansound dataset was trained using 10-fold cross validation. The network was trained for 300 epochs with the Adagrad optimizer [56]. • For the custom database, the log Mel band energies, EMD-log Mel band energies, and log scaled S-MBEs were extracted with 128 Mel bands with 50% overlap. The custom dataset was trained using seven-fold cross validation. The system was trained for 200 epochs with Adagrad optimizer, with an initial learning rate of 0.001. • To evaluate the experimental results, this paper uses classification accuracy as a metric:  [46] to perform comparisons between the feature extraction methods under the same evaluation metrics.

Results and Discussion
We trained the convolutional neural networks with the parameters given in the previous section. The model was evaluated using a test set and average classification accuracy is computed. First, we trained the models using only one feature at a time; for each feature, the system was trained and evaluated. Afterwards, we combined the time-frequency analysis techniques. In the first case, we combined FFT-MBEs and EMD-MBEs. The model was trained and evaluated by aggregating these two features. Later, we combined the two proposed features extracted using EMD-MBEs and SMBEs with traditional FFT-MBEs. Similarly, we trained and evaluated the models by aggregating all the features.
The average classification accuracy of each database with different features are presented in Table 2. The class-wise average classification accuracies for the Acoustic Scene Classification Dataset with FFT-based log MBE, EMD-based log MB, log SMBE, and a combination of these features are presented in Table 3. Similarly, for the Low-Complexity Dataset, Urbansound8k, and the custom database, class-wise mean classification accuracies are presented in Table 4, Table 5, and Table 6, respectively.
It is evident that the FFT-based log MBEs performed better than EMD-based log MBEs and log S-MBEs for every database. However, the combination of all the features improves performance of the system with respect to single FFT-based log MBE-based features in some cases. The EMD-MFB method outperforms FFT-MFB in some categories, as shown in the Tables 2-6. The CNNs were able to perform better inferences for some categories than others. It is, however, uncertain what led the CNNs to better learn features of one method over another, and of one category over another. There are no evaluation methods available in the research domain to understand what leads to better performance of CNNs. In EMD-MFB, for each sample of monotonic IMF in the time domain, we obtained an equivalent instantaneous amplitude IA and instantaneous frequency IF component. To obtain a similar shape as that for FFT-MBE, we performed framing on the training sample, and later we performed summation over frequency bands. The frame consists of IA and IF samples. The total number of frequency components in a single frame of EMD-MFBE for the original signal is defined by the number of samples in the window summed over frequency bands and the number of IMFs extracted. Contrary to this, in FFT, the windowing function is applied directly to the input signal and frequency and energy components are computed over this window. This method may introduce spurious higher harmonics in the result, as mentioned earlier. The FFT-MBE is smoother and contains more components per frame, whereas EMD-MBE is limited by the number of IMFs and number of samples in the window. Furthermore, empirical mode decomposition suffers from mode mixing. Mode mixing of EMD is mainly caused by intermittence and noise. Sudden changes in the signal are one of the main causes of mode mixing, such as noise interference or a high frequency wave discrete distribution in the original signal, which results in the signal being a local high frequency signal, thus producing a local extreme value. The envelope generated by this local extreme value point jump phenomena results in the IMF not agreeing with the time scale and continues to the different frequency components in the original signals, which cannot be effectively separated according to the characteristics of time scale [57]. Mode mixing will affect the subsequent decomposition components; afterwards, the timefrequency distribution of following IMFs will be ambiguous and, eventually, the EMD decomposition process loses physical meaning [58,59]. Many researchers have studied this issue and there are several solutions have been given [60][61][62].
The EMD-MFB method requires more computational resources and time to extract features as compared FFT-based MFB. One evident reason is the calculation of IMFs during the decomposition of the signal. Secondly, the programming methods for the calculation of FFT have been highly studied, which resulted in the current algorithm of Cooley and Tukey [19]. Similar efforts could be made in future for the EMD method in order to reduce the computational overhead.

Conclusions
The main objective of this paper was to introduce an adaptive time-frequency analysis method for an audio signal and perform a comparative analysis with traditionally used time-frequency analysis method. These methods were evaluated based on their performance as features in an environmental sound classification system. The traditionally used method, Fourier transform, is valid under some general conditions and relies on an a priori-defined basis. An adaptive method for signal decomposition into multiple components introduced by Huang et al., empirical mode decomposition (EMD), is applied to obtain intrinsic mode function (IMFs) as components. A discrete energy separation algorithm, the Teager-Kaiser energy operator (TKEO), is applied on each IMF individually to obtain instantaneous amplitude (IA) and instantaneous frequency (IF) on a local time scale. Afterward, a windowing function is applied to generate spectrograms, which are summed together. Later, a Mel filter bank is applied to generate log Mel band energies. We also proposed S-MBEs in this paper, which use EMD to compute the signal trend, which is subsequently removed from the original signal; later, log Mel band energies are computed using Fourier transform (FFT). The features extracted from the proposed method estimated the change of frequencies with respect to time similarly to traditional method with different intensities. This could be attributed to the fact the EMD-TKEO method estimated the IA and IF, which were summed together with a fixed window size to match the dimensions of Mel filter banks. We compared the performance of features extracted with the proposed methods with features extracted from fast Fourier transform-based log Mel band energies. Two different CNN systems were employed in this study to evaluate the feature performance on four different databases. The results demonstrate that the EMD-based Mel band energies (EMD-MBEs)'s performance lagged behind FFT-based Mel band energies (FFT-MBEs). S-MBE performed the worst among the three features under evaluation for every database. The aggregation of all three features resulted in an improvement of accuracy over FFT-MBEs. The improvement reflects the fact that EMD-MBEs performed better for some classes than FFT-MBEs, and the combination of these methods improved the overall result. The analysis of the low performance of the proposed method reveals that in the estimation of time-frequency representation, the resolution is limited by the number of IMFs and window size. Furthermore, during the process of decomposition of the signal into IMFs, the EMD method suffers from the mode mixing problem, which degrades the quality of features extracted. In the future, different EMD methods will be under consideration to obtain estimation of features with similar performance to FFT-based features.