Bimodal Emotion Recognition Model for Minnan Songs

Most of the existing research papers study the emotion recognition of Minnan songs from the perspectives of music analysis theory and music appreciation. However, these investigations do not explore any possibility of carrying out an automatic emotion recognition of Minnan songs. In this paper, we propose a model that consists of four main modules to classify the emotion of Minnan songs by using the bimodal data—song lyrics and audio. In the proposed model, an attention-based Long Short-Term Memory (LSTM) neural network is applied to extract lyrical features, and a Convolutional Neural Network (CNN) is used to extract the audio features from the spectrum. Then, two kinds of extracted features are concatenated by multimodal compact bilinear pooling, and finally, the concatenated features are input to the classifying module to determine the song emotion. We designed three experiment groups to investigate the classifying performance of combinations of the four main parts, the comparisons of proposed model with the current approaches and the influence of a few key parameters on the performance of emotion recognition. The results show that the proposed model exhibits better performance over all other experimental groups. The accuracy, precision and recall of the proposed model exceed 0.80 in a combination of appropriate parameters.


Introduction
Minnan songs (also called Hokkien Song) are an important part of ancient Chinese music, the pronunciation, grammar and tone of Minnan songs are quite different from that of Mandarin songs. Minnan songs, which originated in the 1930s and formed in the 1950s and 1960s, are widely spread in Southern Fujian and Taiwan [1][2][3]. Minnan songs are popular among people in south Fujian and Taiwan, overseas Chinese and Chinese businessmen in southeast Asia [1,4,5]. The vivid rhythms and lyrics of Minnan songs contain rich cultural and spiritual treasures, and become the spiritual link connecting the Chinese in Southern Fujian [1]. The broad market demand and the significant effect of Minnan songs has gradually attracted the attention of experts and scholars [5]. Current studies that research the emotion of Minnan songs are mainly from the perspectives of music analysis theory and music appreciation [6], such as context, linguistic mode and the history of music development. These studies provide suggestions for inheriting the Minnan songs, as well as guidance for social media to develop the Minnan song market. In the Internet era, the availability of cloud-based streaming music applications with extensive libraries bring about the popularity of the song-recommended system. Current music recommendation systems are mostly based on song similarity and user acoustic similarity and a combination of the two factors, respectively, and their results showed that the proposed approaches can obtain efficient solutions compared with other alternative state-of-the-art strategies [34].
In spite of using lyrics or audio alone, some researchers combines the two materials to improve the accuracy of emotion recognition. Xiao et al. proposed a hybrid system combining the lyric features and audio features with fusion methods, and experiments showed that the hybrid system with fewer training samples could achieve the same or better classification accuracies than systems using lyrics or audio alone [35]. Jamdar et al. in 2015 proposed a method which considers the lyrical and audio features of songs to detect the emotion of songs [36]. In [36], for extracting lyrical features, the linguistic association rules are applied to ensure ambiguity is being addressed properly, and audio features are extracted from a music intelligence platform-Echo Nest. The emotion classification is conducted on the KNN algorithm by using feature weighting and stepwise threshold reduction [36]. Lee et al. proposed a convolutional attention networks model to learn the features of both speech and text data, and the proposed model obtained better results for classifying emotions in the benchmark datasets [37]. Some studies considered to use other mid-level or high-level audio features, such as chord progression and genre metadata in [27,[38][39][40]. Lin et al. presented the association between genre and emotion, and proposed a two-layer scheme that could exploit the correlation for emotion classification [38]. Schuller et al. [40] incorporated genre, ballroom dance style, chord progression and lyrics to classify music emotion, and the experiments showed that most of the considered factors would improve the classification accuracy.
Inspired by these studies, this paper studies the emotion recognition of Minnan music by using the lyrics and audio. We get the songs of Southern Fujian from the platform of Kugou during the 80s-10s . The lyrical features are extracted by attention-based Long Short-Term Memory (LSTM) neural network, and the audio features are extracted by a proposed Convolutional Neural Network (CNN) from the Mel spectrum. Then the extracted two kinds of features are combined by adopting multimodal compact bilinear pooling. The music emotion is further determined by a fully-connected layers and softmax function. We set three groups of experiments according to the used data resource, i.e., experiments with lyric modality, experiments with audio modality and experiments with lyric and audio modality. In the experiments, we also set comparable groups for studying the performance of approaches in current studies and investigating the effect of a few key parameters on the music emotion recognition (MER).
The rest of the paper is structured as follows: Section 2 briefly introduces the proposed model, and the details of four main parts of the model are presented in detail. Section 3 presents the used dataset and preprocessing strategy of the data. Section 4 reports the setting of experiments, results of different combinations of the four main parts, the comparison of proposed model with the current new studies and meanwhile shows the effect of parameters on the performance of MER. Section 5 discusses the advantages and parameter settings of the proposed approach. Section 6 concludes this paper.

Model
The overall architecture of the proposed model is shown in Figure 1, which is mainly consisted of four modules. In the first two modules (denoted as module A and B in Figure 1), two separated unimodal models learn the most discriminative features from audio and lyrics respectively. The attention mechanism based on a LSTM (Long short-term memory) neural network is used to highlight the most emotional words in lyrics. The CNN (convolutional neural network) model with a few stacked convolution-pooling layers extracts the audio features from the inputting spectrum. Instead of a simple concatenation of the generated audio features and lyric features, we combined the two types of features by the Multimodal Compact Bilinear pooling (MCB) in the third module (denoted as C in Figure 1). Then in the fourth module (denoted as D in Figure 1) the combined joint representation is fed to the classifier to determine the final decision of emotion classification.

Attention-Based LSTM for Extracting Lyric Features
In the lyrics of a song, some words usually express stronger emotion than other words. In our model, we use the attention-based LSTM (Long short-term memory) neural network to highlight the most important words. Let W = {W 1 , W 2 , ..., W i , ..., W n } denotes lyrics of a set of n songs, where W i is the one hot vector representation. For the lyrics of each song W i , we embed the words into a vector space S i = {s 1 i , s 2 i , ..., s j i , ..., s L i } as Equation (1). The P e denotes the matrix of parameters, and parameters E and L are for the embedding and dimension length of lyrics.
Then higher level features .., f L i } are extracted from the word embedding vector s j i . In Equation (2), the θ s is the parameters of LSTM, and B is the size of LSTM cell.
According to the effect of words on the emotion classification, the attention mechanism is to assign different weights β j i to each word f j i . The exp(.) is the exponential function. The normalized attention over all words is calculated as follows: where is the unnormalized attention weight, and it reflects how closely the word h j i is related to emotions. The parameters K and b are the learned weight matrix and bias term, respectively. The φ(·) is the activation function Tanh. The attended features is computed as a weighted sum over the word annotations: So the attended features of the ith song lyrics is generated as Equation (6), where the parameter θ (t) a ) denotes the weight parameters.

CNN Model for Extracting Audio Features
The rhythm and melody of music, usually relevant to song emotion, are mainly determined by the distribution and variance of signal energy on time and frequency domains. The Mel spectrogram is a visual representation for variances of the frequency spectrum over time of a signal, and it is computed as the nonlinear transform to the frequency axis of the Short-Time-Fourier-Transformation (STFT). The Mel spectrogram is usually applied in audio signal related applications due to its simple estimation, lower level acoustic representation.
As shown in Figure 1B, the CNN model only uses the Mel spectrogram of the audio signal as input, and a few stacked convolution-pooling layers extract audio features. The convolution calculation is written in Equation (7), where C l is the output feature map of layer l, and C l−1 is the input feature map of layer l − 1. In Equation (7), w l, f k is the parameter of the kth filter from layer l − 1 to l; b (l, f k ) is the bias of layer l; the φ(cdot) is the activation function, such as Relu, Sigmoid or Tanh, and here we use the Tanh function.
After one convolution layer, a max polling layer is then followed to understand the spectrogram. The pooling operation is written in Equation (8) Finally, the output feature of ith song is denoted as V i after the flatten operation.

Multimodal Compact Bilinear Pooling
Bilinear pooling is the outer product between two vectors. Compared with the method of element-wise product or simple concatenation, bilinear pooling allows all elements of two vectors to interact, but the high dimensionality and infeasible number of parameters result in a high computation cost and an over-fitting problem. In order to reduce the number of parameters and avoid computing the outer product explicitly, we adopt the Multimodal Compact Bilinear pooling (MCB), projecting the joint outer product to a lower dimensional space and avoiding computing the outer product directly, to combine the extracted audio and lyric features.
A vector can be projected to a lower dimension by the count sketch projection function Ψ [41]. For example, the lyric features Z i ∈ R B of the ith song is projected to a representationZ i ∈ R d with a lower dimension (d < B). Rather than calculating the outer product of two generated feature vectors directly, in Equation (9) the MCB computes the outer product of two vectors as convolution of two count sketches. The * is the convolution operator.

Classifier for Determining Song Emotion
The combined audio and lyric features are fed to the module D with Fully-Connected (FC) layers and a softmax function, and the formula is written in Equation (10). The θ t l is the parameters of the FC layers, and the Ψ ZV is the joint embedding generated by MCB from Equation (9). As a result, the output vector o includes several values, which are the probabilities of emotions of a song.

Dataset
We get 1162 Minnan songs from the 80s to 10s (ranging from 1980 to 2017) from the Kugou musical platform, and the obtained songs include lyrical texts and audio in MP3 format. Different from the mainstream emotional classification for songs, the emotional expression of Minnan songs, based on the research in [42], are marked as seven categories, including love (相爱), lovelorn (失恋), inspirational (励志), lonely (漂泊孤苦), homesickness (思乡), miss someone (思人) and leave (离别). The detailed information for the seven types of songs are shown in Table 1.

Data Preprocessing
The data processing includes two parts, the processing of lyrical texts and the audio data. The lyrics of modern Minnan songs consist of Chinese and written Hokkien (the written form of the Minnan language). The Jieba segmentation module [43] can only work on Chinese texts, which means that the segmentation for the Minnan lyrics is inaccurate. So we add a new Minnan dictionary [44] to the Jieba database, such that the Minnan-Jieba dictionary is formed. For the preprocessing of lyrics, nontextual characters are removed at first. Then the lyrics are segmented by the formed Minnan-Jieba dictionary.
In this paper we have two kinds of preprocessing methods for the song audio. The first one is the open-source integrated platform, OpenSMILE [45], for extracting the physical characteristics (for example signal energy, loudness and pitch) from song audio. We first transform the song audio into the WAV format, then we extract features of each WAV formatted song audio by the OpenSMILE platform and further normalize the extracted features. The extracted physical attributes (denoting the dimension as k) includes frame energy, frame intensity and critical spectrum etc.
The second kind of audio preprocessing is to generate the Mel spectrum for the song audio. The climax of a song can reflect the real feelings of the song [4], so we select n seconds in the middle of a song audio. Then we can get a Mel spectrogram from the selected n seconds song audio, and we set the size of the obtained spectrum as 192 × 128 in the following experiments. The window size n FFT is set as 1024, and the distance between adjacent windows hop length is 512, i.e., there is a 50% overlap between two adjacent windows.

Experiment Settings
The experiments were designed as three groups according to the used data resource, and we set the comparisons between two experimental groups using three aspects: the method for extracting features, classifier and main parameters, which are shown in Table 2. The first two groups a1-a7 and b1-b4 are single-modal based experiments, which use lyrical text and song audio alone as input data, respectively. The multimodal experimental groups, i.e., c1-c4, classify song emotion by combining song lyrics and song audio. In Table 2, the parameters d 1 and d 2 are the dimension of output lyric features and output audio features, respectively; the used classifiers in our experiments include the common used SVM (Support Vector Machine) and full-connected (FC) layers combining softmax ( Figure 1D); the methods for extracting features consisted of the TF-IDF, LSTM, attention-based LSTM ( Figure  1A), OpenSMILE and the CNN (Convolutional Neural Network, Figure 1B). In addition, we also set experimental comparisons for the concatenating approach of the extracted lyric and audio features. Further, the experiments of current approaches [46][47][48][49][50][51] on our used Minnan music dataset were also investigated. Groups a8-a9 and b5-a6 are the approaches for using unimodal data, and c4-a5 are the experiments for multimodal data.
The Group a1, as a compared base line for a2-a9, extracts features from the song lyrics by TF-IDF, then an SVM classifier is used to classify the emotion of songs, in which the dimension d 1 of extracted lyric features is set as 297. Group a2 extracts lyric features by using an LSTM model and using an SVM classifier to recognize song emotion. To compare the performance of different classifiers, different with the used classifier in group a2, the group a3 with d 1 = 64 determines a song emotion by the classifier-FC layers and softmax ( Figure 1D). The group a4 also determines a song emotion by the module in Figure 1D, while the dimension d 1 of extracted features increases to d 1 = 128 compared with group a3. In the groups a6-a7, we apply the attention-based LSTM ( Figure 1A) and the module in Figure 1D, and set d 1 as 64 and 128, respectively. Compared with the group a6, all settings are the same except that the group a5 uses the classifier SVM. In addition, the approaches in [46,47], as a comparison with groups a6-a7, apply the transfer learning-based DNN model and a Naive Bayes model, respectively, to estimate music emotion from lyric modality.
In groups b1-b8, the song audio is used as the data resource. The group Gb1 extracts audio features by OpenSMILE directly, and the dimension d 2 of extracted features is 384. Gb3 uses the CNN module to extract audio features from the spectrogram at first, and then inputs the extracted features to an SVM for classifying the song emotion. Different with the used classifier in group Gb3, group Gb4 uses fully-connected layers and softmax ( Figure 1D) to determine the emotion of songs. As to group Gb5 and Gb6, the parameter n increases to 30 seconds, and the classifiers are designed to be the module in Figure 1D, and the values of d 2 are set as 64 and 128, respectively. It is noticeable that groups b4-b6 use the modules Figure 1B,D, respectively, which belong to the main parts of our proposed model. The groups b7-b8, for comparing with groups b4-b6, are the audio modality based approaches in current studies [48,49]. In particular, [48] combines CNN and LSTM to classify speech emotion, and [49] applies a deep neural network to determine speech emotion.
The groups Gc1-Gc8 use both song lyrics and song audio as data resource, which extract features by three different methods and two different types of classifiers. The methods for extracting features and classifiers are the same as groups Ga1-Gb8, and we select the best main parameters d 1 = 64, d 2 = 64 and n = 30 for Gc2-Gc6, and d 1 = 297, d 2 = 384 for Gc1. In the groups Gc1-Gc3, the extracted lyric and audio features are combined directly, while in Gc4-Gc6 the two types of features are concatenated by MCB (Multimodal Compact Bilinear pooling), i.e., Figure 1C. Further, two current approaches [50,51], using both lyric and audio modality, are denoted as groups Gc5-Gc6, which are designed for comparing with our proposed model Gc6.
The structure of our proposed model are set as follows. In the module A, the LSTM has 256 hidden neurons in each cell. The proposed CNN in Figure 1B, one 3 × 3 convolution and a 2 × 2 max-pooling layer are stacked for learning representation. The audio and lyric features are combined by the MCB in Figure 1C as a vector with the dimension 1024. The FC layers in Figure 1D is set as 1024-512-7(Tanh). For training the model, the common used stochastic gradient descent with learning rate 0.01 is used. To avoid the over-fitting problem, we use the dropout technique for the FC layers with a probability of 0.5, and we also use the dropout with a probability of 0.3 before and after the module in Figure 1C. All of the implementations of our model are trained on 2 × NVIDIA Tesla V100.

Results for Unimodal Data
The experimental results consisted of two parts (i.e., the unimodal and multimodal data based experiments) in general by the modality of input data. In Table 3, experiments based on one modality (lyrics or audio) were conducted on different classifiers, feature-extraction methods and main parameters. Similar to most of the current studies, the classification performance is evaluated by accuracy, precision and recall. The accuracy is the ratio of the number of correct predictions to the total number of samples and it is defined in Equation (11). The precision is defined as the ratio of number of correct results to the number of all obtained results and it is formatted as Equation (12). Recall, written as Equation (13), is the ratio of the number of correct results to the total number of results that should have been obtained. In Equations (11)-(13), TP, TN, FP and FN represent "True Positives", "True Negatives", "False Positives" and "False Negatives", respectively.
In the experiments, we set multiple splits of train and test dataset, and the fraction of train samples increases from 0.5 to 0.9. Every experimental group ran 20 times under each split in our experiments. The performance of group Ga1, extracting lyrical features by IF-IDF and training model by SVM classifier, is the worst among the experimental groups based on the lyrics modality. The precision of Ga2 with the same SVM classifier increases by more than 10% when extracting features with LSTM. Compared with Ga2, the accuracy/precision/recall increases by 2-7% in Ga3, where the classifier is replaced with the module in Figure 1D, i.e., the full-connection layers and softmax. In Ga4, the dimension of extracted lyric features grows to 128, while the result of Ga4 is slightly lower than that of Ga3. By adding the attention mechanism in groups Ga5-Ga7, the accuracy of Ga5-Ga7 is up to 10% higher than that of Ga2-Ga4 under the same classifier and parameter, and the accuracy, precision and recall reach up to 0.45, 0.38 and 0.35, respectively. Similar to the statistics of Ga3 and Ga4, the comparison between group Ga6 and Ga7 also show that the increase of d 1 cannot ensure a better emotion recognizing result.
By comparing the results of groups Ga1-Ga7 with groups Gb1-Gb6, the classification results based on audio modality outperform the results based on lyric modality in general. The accuracy of group Gb1 with d 2 = 384 is almost the same with that of group Gb2 with d 2 = 1582. So, a large dimension of extracted audio features by OpenSMILE may not promote the classifying accuracy. The performance of group Gb1 is at least 10% lower than that of Gb3, which means that the OpenSMILE is worse than CNN for extracting features from song audio. By setting the classifier of Gb4 as fully-connected layers and softmax (i.e., the module in Figure 1D), the classification performance of Gb4 is much better than that of Gb3, and the accuracy/precision/recall increases to 0.51/0.49/0.44 under the 90:10 train-test split in Gb4. The accuracy/precision/recall of group Gb5 increases when the parameter n grows from 15 in Gb4 to 30 in Gb5. The classification performance of group Gb6 with dimension d 2 = 128 is about 5% higher than that of Gb5 with d 2 = 64.
In addition to the above mentioned results, we also investigate the song emotion recognition of current one-modality data based approaches with the Minnan song dataset. The results for lyric modality are the groups Ga8-Gb9, and results for audio modality are the groups Gb7-Gb8. The group Ga8 applies the transfer learning-based DNN models [46] to estimate song emotion from lyrics. The accuracy and precision of group Ga8 are slightly smaller than that of the attention-based LSTM (i.e., Ga6-Ga7), while the recall of Ga8 is far smaller than that of group Ga6-Ga7. As to the group Ga9, a simple Naive Bayes machine learning approach [47] is used for music emotion classification based on lyrics, the result of which is similar to that of group Ga1. Two audio-based emotion recognition approaches (i.e., the groups Gb7 and Gb8) are also compared with our designed experiments. In [48], the proposed method combined CNN and LSTM to classify speech emotion. Though the method in [48] applied in our dataset can get a better performance than group Gb5 and Gb6, the architecture of the method is complex and the computation cost is high. The [49] uses a deep multilayered neural network for the emotion recognition, and the result of which is similar to that of in Gb5. However, the group Gb6 with the parameter d 2 = 128 is still slightly higher than that of Gb8. Table 3. Average performance of unimodal data based experimental groups.

Results of Bimodal Data
The results of emotion recognition based on the bimodal data, i.e., the combination of the lyrics and audio, are presented in Figure 2 and Table 4. From Table 3 and Table 4, the results by bimodal data is significantly higher than the experiment groups by unimodal data in general.
Group Gc1 extracts audio features and lyric features by by OpenSMILE and TF-IDF, respectively, and the accuracy increases to 0.4 at most in Figure 2, which is about 10% higher than that of group Ga1 and Gb1. Group Gc2 combines the feature-extraction method and main parameters of Ga3 and Gb5, and its accuracy is about 20% higher than that of Ga3 and Gb5. The group Gc3, combining group Ga6 and Gb5 together, achieves a better performance than that of Ga6 and Gb5, and the accuracy/precision/recall of which is up to 0.78/0.79/0.76, respectively. In Table 4 and Figure 2, groups Gc1-Gc3 are the results of concatenating two different types of features by multimodal compact bilinear pooling (MCB), while groups Gc1-Gc3 concatenate two features directly. We can see that the accuracy/precision/recall of groups Gc4-Gc6 are about 6% higher than that of Gc1-Gc3. It is noticeable that the result of group Gc6 (i.e., our proposed model in Figure 1) performs better than the other groups from Table 4 and Figure 2, which suggests that our model can recognize the emotion of songs more effectively. Groups Gc7-Gc8 are the results of two current studies [50,51], which combines two or more data resource to realize emotion recognition. The accuracy of group Gc7 is no larger than 5% the accuracy of our proposed model Gc6, while the accuracy and recall of Gc7 is over 10% lower than that of Gc6. In group Gc8, the performance of MER is significantly worse than Gc6 as the used model in Gc8 only focuses on positive and negative emotion in [50].

Discussion
From Tables 3 and 4 we can conclude that the accuracy/precision/recall by using the Mel Spectrogram is higher than that of lyrics in general. The experiment results also show that by combining the audio and lyric modality can achieve a much better performance than that of only using audio or lyric modality alone. Moreover, the strategy, combining extracted audio and lyric features by multimodal compact bilinear pooling, can improve the results of Minnan music emotion recognition (MER) compared with concatenating two types of extracted features directly. The results of Table 3 show that the combined modules of our proposed model is better than the approaches of current studies. Table 4 and Figure 2 prestent that the performance of our proposed model on MER is the best compared with multimodal data based models in the two current studies. The results that using the extracted audio features from Mel Spectrogram by the CNN module of our proposed model can also get achieve a satisfactory level of performance. However, the Minnan MER of extracting features by OpenSMILE is the worst when only using audio modality, which indicates that the extracted audio features by OpenSMILE cannot describe the characteristics of Minnan songs well. The experimental results also show that the performance of attention strategy based LSTM is more effective for extracting lyric features than the LSTM.
To study the effect of parameters on the performance of music emotion recognition, we design a few comparable groups in the experiments. In Tables 3 and 4, we can see that the longer selected segment n (in seconds) for generating the Mel spectrum, the better classifying performance on song emotion. However, in contrast to previous studies, a large dimension d 1 of extracted lyric features may result in a unsatisfactory classification result. The comparisons between the feature-extraction method by OpenSMILE also indicate that the large dimension of extracted audio features by OpenSMILE cannot guarantee a high classifying performance.

Conclusions and Future Work
This paper proposes a model that combines audio features and lyrical features to classify emotion of songs, in which the model includes four modules. In the first module, the audio features are extracted from the Mel spectrum by a proposed CNN model; in the the second module, features are extracted by an attention-based LSTM; the extracted audio and lyric features are concatenated by the multimodal compact bilinear pooling in the third module; then the emotion of a song is determined through the full-connected layers with softmax in the fourth module. We set three experimental groups to investigate the performance of different combinations of the four modules in the proposed model. The experiment results show that the performance of our proposed model on music emotion recognition (MER) is the best compared with all other experimental groups and approaches of current studies. Moreover, the results by combining the audio and lyric modality can achieve a much better performance than that of only using audio or lyric modality. Furthermore, the MER by combining extracted audio and lyric features with multimodal compact bilinear pooling is better than the results obtained by concatenating two extracted features directly. The comparisons show that the attention strategy is effective for obtaining a satisfactory classification result when extracting features from lyrics. It is noticeable that the MER by OpenSMILE is the worst when only using audio modality, so the statistics means that the extracted audio features by OpenSMILE cannot describe characteristics of Minnan songs well.
Further, to explore the influence of parameters on the performance (i.e., accuracy, precision and recall) of experimental groups, we offer different parameter combinations as comparable groups in the experiments. When using the Mel spectrum to extract audio features, the longer selected segment of song audio, the better classifying performance on song emotion. On the contrary, a larger dimension of extracted features by OpenSMILE may result in a unsatisfied classifying result.
There are a few interesting directions can be explored further. For example, the synchronization between lyrics and audio should be researched, i.e., the lyrical line corresponds to the portion of song audio, when extracting features from lyrics and audio. The consideration of synchronization may improve the accuracy of emotion recognition greatly. Separating the vocal part from the song audio in the preprocessing of song audio should be tried, and the emotion classification experiments by extracting features from vocal signal and a mixture of instruments signal can be compared. The multimodal data, combining vocal modality, instruments modality and lyrics modality, the model should be proposed to research music emotion recognition. In this paper, we concatenate the lyrical features and audio features by multimodal compact bilinear pooling. Other concatenation approaches, for example, an attention mechanism, can be studied to improve the efficiency of emotion recognition.