Improvement of Speech / Music Classification for 3 GPP EVS Based on LSTM

The competition of speech recognition technology related to smartphones is now getting into full swing with the widespread internet of thing (IoT) devices. For robust speech recognition, it is necessary to detect speech signals in various acoustic environments. Speech/music classification that facilitates optimized signal processing from classification results has been extensively adapted as an essential part of various electronics applications, such as multi-rate audio codecs, automatic speech recognition, and multimedia document indexing. In this paper, we propose a new technique to improve robustness of a speech/music classifier for an enhanced voice service (EVS) codec adopted as a voice-over-LTE (VoLTE) speech codec using long short-term memory (LSTM). For effective speech/music classification, feature vectors implemented with the LSTM are chosen from the features of the EVS. To overcome the diversity of music data, a large scale of data is used for learning. Experiments show that LSTM-based speech/music classification provides better results than the conventional EVS speech/music classification algorithm in various conditions and types of speech/music data, especially at lower signal-to-noise ratio (SNR) than conventional EVS algorithm.


Introduction
Speech/music classification algorithms are an important component of variable rate speech coding and coverage of the communication bandwidth, and provide effective means for enhancing the capacity of the bandwidth.In addition, a major concern in speech coding is optimizing speech input, and many types of input are being investigated, such as music.Speech/music classification algorithms are an essential part for providing high performance sound quality in speech coding.Recently, a number of adaptive multi-rate (AMR) voice codecs have been proposed to efficiently utilize the limited bandwidth resources available [1][2][3].Precise determination of speech/music classification is quite necessary, as different bit rate allocations for the correct input/output formats affect the voice characteristic of these adaptive multi-rate voice codecs [4,5].
Recently, further improvements in speech/music classification problems have been achieved by adopting several machine learning techniques, such as the support vector machine (SVM) [6,7], Gaussian mixture model (GMM) [8], and deep belief network (DBN) [9] for the selectable mode vocoder (SMV) codec.The enhanced voice services (EVS) speech/music classifier, which is known as the 3rd-generation partnership project (3GPP) standard speech codec for the voice-over-LTE (VoLTE) network, is also based on GMM, but its features were calculated either at a current frame or as a moving average between those in the current and the previous frames [10].The speech/music classifier uses a binary classification, but the diversity of music is greater than that of speech, and it can be generally said that it is a multiclass classification method, according to each musical genre.The GMM is not suitable for solving multiclass classification problems due to scalability issues.In this paper, we propose a robust speech/music classifier based on long-short term memory (LSTM) [11,12], which can solve the vanishing gradient problem [13] better than the RNNs.In the case of an audio signal, a high correlation exists between signal samples in consecutive frames in the sequence.LSTMs, a particular type of RNNs, were basically proposed as a scheme of extending NNs to sequential signals.The extension of recurrent connections allows LSTMs to utilize the prior frame and makes them more robust to manipulating sequential data compared to non-recurrent NNs.The proposed method employs an LSTM using a feature vector derived from EVS codec.To appraise the accuracy of the proposed algorithm, speech/music classification experiments are performed under a variety of simulated conditions.

Conventional 3GPP Enhanced Voice Services
The EVS is a 3GPP speech codec designed for the VoLTE network.The EVS codec supports two modes: ACELP (for low and intermediate bitrates) and MDCT (for intermediate and high bitrates).The mode is selected depending on channel capacity and speech quality requirements.
The EVS speech/music classifier operates when the EVS codec is detected as "active" by voice activity detection (VAD) every 20 ms frame.

Feature Selection
The speech/music classification method applied to the EVS codec reuses the 68 parameters calculated in the early stages of codec preprocessing to minimize complexity.The EVS codec uses the technique proposed by Karnebäck [14], which is a method of analyzing the correlation matrix of all the features, for the initial selection of the features to be used in the Gaussian mixture model.In this way, we analyze the candidate feature sets with minimal cross-correlation and select the feature by calculating the discrimination probability as follows: where m (mus) h and m (sp) h are the feature histograms that h generated on the music and the speech training database, respectively, and J is the whole number of frames in the database.The following 12 feature vectors are selected from the initial 68 feature vectors through the discriminatory probabilities U h ; five LSF parameters, normalized correlation, open-loop pitch, spectral stationarity, non-stationarity, tonality, spectral difference, and residual LP error energy [3].

GMM-Based Method
The GMM has been estimated via the expectation maximization algorithm [15] on a speech/music database, and is a weighted sum of L-component Gaussian densities, given by the following equation: where N(z|µ k , Σ k ) are the component Gaussian densities, ω k are the component weights, and z is a normalized N-dimensional feature vector.The GMM generates two probabilities, p m and p s , for the music probability and the speech probability, respectively.By analyzing the values of music and speech probabilities in each frame, a discrimination measure between music and speech can be obtained by subtracting the log-probabilities, as: Symmetry 2018, 10, 605 3 of 8

Context-Based Method
The GMM-based speech/music classification responds very rapidly to change-over from speech to music and vice versa.To effectively utilize the discrimination potential of the GMM-based approach, y is sharpened and smoothed by the following adaptive auto-regressive filter: where [−1] denotes the previous frame and γ c is a filter factor.If the segment is energetically significant, the scaled relevant frame energy value will be close to 1, and for background noise, will be close to 0.01.Accordingly, if the SNR is high, more weight is given to the current frame, whereas if the SNR is low, the classifier has more dependency on the past data because it is difficult to make accurate short-term decisions.This situation potentially occurs when y is smaller than 0, and is smaller than the value of the previous frame.In this case: where g is the gradient of the GMM approach, and g [−1] is initialized to the value of −y each frame.Finally, previous frames of varying sizes (0-7) are combined according to the characteristics of the signal to determine speech/music [10].

Proposed LSTM-Based Speech/Music Classification
In this paper, an improved LSTM-based speech/music classification algorithm applicable to the framework of a speech/music classifier is proposed.LSTMs are sequence-based models of key importance for speech processing, natural language understanding, natural language generation, and many other areas.Because speech/music signals are highly correlated in time, LSTM is an appropriate method to classify speech/music.As shown in Figure 1, speech/music classification is performed using LSTM when it makes a decision as active speech in the EVS codec, compared with the conventional EVS codec and RNN-based algorithm.The feature vectors for speech/music classification are limited to 12 feature vectors used in the conventional EVS.

Context-Based Method
The GMM-based speech/music classification responds very rapidly to change-over from speech to music and vice versa.To effectively utilize the discrimination potential of the GMM-based approach, y is sharpened and smoothed by the following adaptive auto-regressive filter: where [−1] denotes the previous frame and is a filter factor.If the segment is energetically significant, the scaled relevant frame energy value will be close to 1, and for background noise, will be close to 0.01.Accordingly, if the SNR is high, more weight is given to the current frame, whereas if the SNR is low, the classifier has more dependency on the past data because it is difficult to make accurate short-term decisions.This situation potentially occurs when is smaller than 0, and is smaller than the value of the previous frame.In this case: where is the gradient of the GMM approach, and [ ] is initialized to the value of − each frame.Finally, previous frames of varying sizes (0-7) are combined according to the characteristics of the signal to determine speech/music [10].

Proposed LSTM-Based Speech/Music Classification
In this paper, an improved LSTM-based speech/music classification algorithm applicable to the framework of a speech/music classifier is proposed.LSTMs are sequence-based models of key importance for speech processing, natural language understanding, natural language generation, and many other areas.Because speech/music signals are highly correlated in time, LSTM is an appropriate method to classify speech/music.As shown in Figure 1, speech/music classification is performed using LSTM when it makes a decision as active speech in the EVS codec, compared with the conventional EVS codec and RNN-based algorithm.The feature vectors for speech/music classification are limited to 12 feature vectors used in the conventional EVS.The LSTM unit consists of an input activation function, a single memory cell, and three gates (input i t , forget f t , and output o t ), as shown in Figure 2. i t permits the input signal to change or block the memory cell state.f t controls what to remember and what to forget in the cell, and avoids vanishing gradients.Finally, o t allows the memory cell state to have an influence on other neurons, or prevent this influence.With the addition of a memory cell, the LSTM can overcome the gradation problem of capturing and disappearing very complex and long-term dynamics.According to the LSTM unit, for an input x t , the LSTM calculates a hidden/control state h t , a block input g t , and a state of memory cell c t , which is an encoding of everything the cell has recognized until time t: where W ij are the weight matrices, is the point-wise product with the gate value, b j is the bias, φ(x) is the activation function, and σ(x) is the logistic sigmoid.As shown in Figure 3, LSTM units are gathered together to form layers and are connected at each time step.
The LSTM unit consists of an input activation function, a single memory cell, and three gates (input , forget , and output ), as shown in Figure 2.
permits the input signal to change or block the memory cell state.controls what to remember and what to forget in the cell, and avoids vanishing gradients.Finally, allows the memory cell state to have an influence on other neurons, or prevent this influence.With the addition of a memory cell, the LSTM can overcome the gradation problem of capturing and disappearing very complex and long-term dynamics.According to the LSTM unit, for an input , the LSTM calculates a hidden/control state ℎ , a block input , and a state of memory cell , which is an encoding of everything the cell has recognized until time : = ( + ℎ + ) = ( + ℎ + ) where are the weight matrices, ⨀ is the point-wise product with the gate value, is the bias, ( ) is the activation function, and ( ) is the logistic sigmoid.As shown in Figure 3, LSTM units are gathered together to form layers and are connected at each time step.The LSTM unit consists of an input activation function, a single memory cell, and three gates (input , forget , and output ), as shown in Figure 2.
permits the input signal to change or block the memory cell state.controls what to remember and what to forget in the cell, and avoids vanishing gradients.Finally, allows the memory cell state to have an influence on other neurons, or prevent this influence.With the addition of a memory cell, the LSTM can overcome the gradation problem of capturing and disappearing very complex and long-term dynamics.According to the LSTM unit, for an input , the LSTM calculates a hidden/control state ℎ , a block input , and a state of memory cell , which is an encoding of everything the cell has recognized until time : = ( + ℎ + ) = ( + ℎ + ) = ( + ℎ + ) = ⨀ + ⨀ where are the weight matrices, ⨀ is the point-wise product with the gate value, is the bias, ( ) is the activation function, and ( ) is the logistic sigmoid.As shown in Figure 3, LSTM units are gathered together to form layers and are connected at each time step.

Experiments and Results
For evaluating the proposed method, we have compared the LSTM-based speech/music classification algorithm with the EVS method.The evaluation was implemented on the TIMIT speech database [16], Billboard year-end CDs from 1980 to 2013, and classical music CDs.From the music CDs, different genres (jazz, hip-hop, classic, blues, etc.) of music were collected.The entire music database was 221 h long and a large amount of data was employed in the experiment.
For training the LSTM-based speech/music classifier, 60 h of speech signal and 160 h of music signal were randomly selected from database.The length of each speech segment ranged from 6 to 12 s, and the length of each music segment ranged from 3 to 12 min.To create noisy environments, we added babble, car, white, pink, factory1, and factory2 noises from the NOISEX-92 database to the clean speech data at 5, 10, 15, and 20 dB SNR.As initial parameters of the proposed LSTM, we have used the parameter setting listed in Table 1.For testing, we randomly chose 20 h of speech data and 61 h of music data, which were separated from the training data.The data were sampled at 16 kHz with a frame size of 20 ms.To calculate the accuracy of the algorithm, each frame was manually labelled and compared to the corresponding classification results of the classifier.
For proper understanding of the performance difference, the results of the speech/music classification in conjunction with the test speech/music segment are shown in Figure 4.It can be observed from this figure that the proposed method has effectively classified speech and music, according to manual marking (silence = 0, speech = 1, music = 2), and has yielded better results in white noise (5 dB SNR) conditions when compared with the EVS based algorithm and RNN based algorithm.
To appraise the performance of the proposed algorithm, the speech/music classifier accuracy of the algorithm was investigated.The results are shown in Table 2.The test results verified that the proposed LSTM based method effectively improves the performance of the EVS.In particular, in the case of pink and factory1 noise environments at 5 dB SNR, the accuracy of the proposed method is significantly improved when compared with that of the EVS-based method.In the comparison between RNN and LSTM, the performance of LSTM is shown to have better performance than the RNN in all conditions.To appraise the performance of the proposed algorithm, the speech/music classifier accuracy of the algorithm was investigated.The results are shown in Table 2.The test results verified that the proposed LSTM based method effectively improves the performance of the EVS.In particular, in the case of pink and factory1 noise environments at 5 dB SNR, the accuracy of the proposed method is significantly improved when compared with that of the EVS-based method.In the comparison between RNN and LSTM, the performance of LSTM is shown to have better performance than the RNN in all conditions.

Figure 1 .
Figure 1.(a) Block diagram of the conventional EVS speech/music classification.(b) Block diagram of the recurrent neural network based speech/music classification.(c) Block diagram of the proposed speech/music classification.

Figure 1 .
Figure 1.(a) Block diagram of the conventional EVS speech/music classification.(b) Block diagram of the recurrent neural network based speech/music classification.(c) Block diagram of the proposed speech/music classification.

Figure 2 .Figure 3 .
Figure 2. Detailed schematic of the LSTM unit, containing three gates, a block input, and a block output.y 2 y T y 1

Figure 2 .
Figure 2. Detailed schematic of the LSTM unit, containing three gates, a block input, and a block output.

Figure 2 .
Figure 2. Detailed schematic of the LSTM unit, containing three gates, a block input, and a block output.

Figure 4 .
Figure 4. Speech with white noise (5 dB SNR) and music waveform, and the decision result of the speech/music classification algorithms (manual marking:silence = 0, speech = 1, music = 2).(a) Test waveform, (b) decision of EVS codec, (c) decision of RNN based algorithm, and (d) decision of LSTM algorithm.

Figure 4 .
Figure 4. Speech with white noise (5 dB SNR) and music waveform, and the decision result of the speech/music classification algorithms (manual marking:silence = 0, speech = 1, music = 2).(a) Test waveform, (b) decision of EVS codec, (c) decision of RNN based algorithm, and (d) decision of LSTM algorithm.

Table 1 .
Parameter setting of the proposed LSTM.

Table 2 .
Comparison of speech/music classification accuracy.

Table 2 .
Comparison of speech/music classification accuracy.