A Novel Emotion-Aware Hybrid Music Recommendation Method Using Deep Neural Network

: Emotion-aware music recommendations has gained increasing attention in recent years, as music comes with the ability to regulate human emotions. Exploiting emotional information has the potential to improve recommendation performances. However, conventional studies identiﬁed emotion as discrete representations, and could not predict users’ emotional states at time points when no user activity data exists, let alone the awareness of the inﬂuences posed by social events. In this study, we proposed an emotion-aware music recommendation method using deep neural networks (emoMR). We modeled a representation of music emotion using low-level audio features and music metadata, model the users’ emotion states using an artiﬁcial emotion generation model with endogenous factors exogenous factors capable of expressing the inﬂuences posed by events on emotions. The two models were trained using a designed deep neural network architecture (emoDNN) to predict the music emotions for the music and the music emotion preferences for the users in a continuous form. Based on the models, we proposed a hybrid approach of combining content-based and collaborative ﬁltering for generating emotion-aware music recommendations. Experiment results show that emoMR performs better in the metrics of Precision , Recall , F 1, and HitRate than the other baseline algorithms. We also tested the performance of emoMR on two major events (the death of Yuan Longping and the Coronavirus Disease 2019 (COVID-19) cases in Zhejiang). Results show that emoMR takes advantage of event information and outperforms other baseline algorithms. the ability to use item similarities but also the ability to use social network information and hybrid approaches. The two algorithms incorporate social network data to exploit the implicit user feedback in music recommendations and are emotion-aware.


Introduction
The development of the modern Internet has witnessed the thri ving of personalized services that exploit users' preference data to help them navigate through the enormous amount of heterogeneous content on the Internet. Recommender systems are such services developed to help users filter out useful personalized information [1][2][3]. In the field of music recommendations, the large number of digital music content provides a huge opportunity for recommenders to suggest music content that meets users' preferences and reduces the search cost for users to find their favorite music [4]. Existing music recommenders leverage the user-item matrix containing both the users and music items traits [5], utilizing acoustic metadata, editorial metadata, cultural metadata [6], and even locationbased [7], context-based [8] information to make better recommendations. However, as a kind of emotional stimulus, music has the power to influence human emotion cognition [9].
Thus, music recommendations based on the impact on human emotion has received research interest in both academic and commercial sectors. To make music recommendations emotion-aware, studies focus mainly on the aspects of music emotion recognition [10] and user's emotional or affective preference for music [11].
Neurobiological studies and clinical practices have revealed the human brain has particular substrates to process emotional information in music [12], making recognizing music emotions critical in utilizing complex human emotion information. Several studies have proposed methods to recognize music emotions to help make recommendations. Deng et al. constructed the relational model of the acoustic (or low-level) features of music to the emotional impact of the music. They utilized the users' historical playlist data to generate emotion-related music recommendations [13]. Rachman et al. used music lyrics data to build a psycholinguistic model for music emotion classification [14].
In addition, studies have proved that individuality has significant impacts on music emotion recognition [15]. Thus, from the individual perspective of user preferences, studies have also offered insights into user emotion state representation and emotion preference recognition. Some studies tried to get the emotional state of the users directly from the users' biophysical signals. Hu et al. build the link of music-induced emotion with biophysical signals for emotion-aware music information retrieval [16]. Ayata et al. detected user emotions for music recommendations by obtaining biophysical signals from the users via wearable physiological sensors [17]. However, the biophysical data can be expensive to obtain for a music recommender.
There is strong evidence that the brain mechanisms mediating musical emotion recognition are also dealing with the analysis and evaluation of the emotional content of complex social signals [18]. Modern social media services provide valuable information about user preferences and psychological traits [19]. It is not surprising that social media footprints, such as likes, posts, comments, social events, etc., are being exploited to infer individual traits. Researchers have turned to rich user activity and feedback data on social media [20,21] for music emotion preference modeling. Park et al. proposed an emotion state representation model by extracting social media data to recommend music [22]. Shen et al. proposed a model to represent user emotion state as short-term preference and used data from social media to provide music recommendations based on the model [23]. Other improvements in emotion-aware music recommendations including the context-based method [24], hybrid approaches [25], incorporation of deep learning methods [26], etc.
Though huge progress has been made to emotion-aware music recommendations, there are still problems in existing models and methods. Firstly, existing studies tend to represent music emotion, and user emotion states in a discrete manner, such as the six Ekman basic emotions used by Park et al. [22] and Polignano et al. [27]. However, studies have proved that aside from the six basic emotions, there are more categories of emotions [28], and it is better to model emotion in a continuous way [29]. Secondly, user emotion state extracted from social media data can only represent user emotion state or the music emotion preference of the user at some certain time points, as stated by Shen et al. [23] emotion accounts for a short-term preference that drifts across time [30]. While the recommender should generate recommendations whenever the user enters the system, there is still a need for continuous emotion state recognition across time. Thirdly, it has been proven that events, especially major social events, regulate human emotions [31] while existing studies fail to utilize the information of events to improve emotion state representation.
In this paper, we proposed an emotion-aware music recommendation method (emoMR) that utilized a deep neural network. Representing music emotions continuously, generating user emotion states when no user activity data presents, and capturing event-related information to refine the emotion states are the three main issues this paper deals with. We proposed a music emotion representation model by exploiting the low-level features of the music audio signals and the music metadata to predict the music emotion representations in a continuous form based on the emotion valence-arousal model [32]. We also proposed an emotion state representation model based on the artificial emotion generation model. Event-related information and implicit user feedback from social media were used as the exogenous factors of the model. The endogenous factors of the model were constructed using the human circadian rhythm model. The two proposed models were trained via a deep neural network architecture (emoDNN) and can generate accurate predictions. We took a hybrid approach that combines the advantage of both a content-based approach and collaborative filtering approach to generate recommendations. The results of the experiments show that emoMR outperforms the other baseline algorithms in selected metrics. Although the length of the recommendation list reduces the performance of the compared methods, emoMR still has an advantage over other emotion-aware music recommendation methods. The results also suggest that our method is able to use event-related information to improve the quality of emotion-aware music recommendations.
The main contributions are summarized as follows: 1. The proposed models represent music emotions and users' music emotion preferences under certain emotion states in a continuous form of valence-arousal scores allowing for complex emotion expressions in music recommendations; 2. Trained models with a deep neural network (emoDNN) that enable the rapid processing of both music emotions and user emotion states even on time points when no user activity data exists; 3. Incorporated event-related information to allow the hybrid approach to generate music recommendations considering the influences posed by events on emotions.
The rest of this paper is arranged as follows: Section 2 overviews the related works. Section 3 explains that the recommendation models consists of the music emotion representation model, the emotion state representation model, and the deep neural network framework. Section 4 demonstrates the experiments and results of comparing emoMR with baseline algorithms. Section 5 concludes the whole paper.

Related Works
Compared to popular recommendation models currently used in music recommendations, emotion-aware music recommenders offer higher quality recommendations by introducing an additional preference of emotions since music itself has the ability to regulate human emotions [33]. The striking ability of music to induce human emotions has invited researchers to retrieve emotional information from the music and exploit emotionrelated data from users for music recommendations. Novel approaches have been forged to utilize emotion-related data, such as exploiting user interactions to make emotion-aware music recommendations [34]. In this section, we discuss several works on improving music recommendations by exploiting emotion-related information.
The first factor of exploiting emotional information in music recommendations is the representation of emotions. The basic idea behind music emotion representation is using acoustical clues to predict music emotions. Acoustical clues or low-level audio features of music can be used to predict human feelings [35]. Barthet et al. proposed an automated music emotion recognition model using the acoustical music traits of both frequencies and temporal to aid context-based music recommendations [36]. Several other models such as the emotion triggering low-level feature model [37], musical texture and expressivity features [38], and acousticvisual emotion Gaussians model [39] have been proposed for music recommendations with emotion information. Other studies leveraged the fruit of music information retrieval [40] and combine low-level audio features with music metadata such as lyrics [41] and genre [42] to represent music emotions. While acoustical clues are widely used in music emotion recognition, new types of music have been invented every year. Combining both the low-level audio features and the descriptive metadata is more promising in music emotion representation.
The second factor of exploiting emotional information in music recommendations is the representation of the user's emotion state. Basically, there are two kinds of approaches to representing the user's emotion state. One is to get physiological emotion metrics directly from the users, and the other is predicting the user emotion state by the activity data generated by the user. Although recommenders or personalized systems can directly ask their users for their current emotion state, it is hard for a user to describe the emotional status. Therefore, studies have used biophysical signals or metrics (such as electromyography and skin conductance [43]) to recognize user emotions. Some studies exploit the technologies used in cognitive neuroscience to represent user emotion states through electroencephalography (EEG) [44,45] in music-related activities. Ayata et al. even built an emotion-based music recommendation system by tracking user emotion states with wearable physiological sensors [17]. The problem with the direct approach is that it is expensive to obtain the biophysical metrics data from the users, and existing biophysical signal acquisition equipment might interfere with user experience. The other approach, however, exploits the user activity data and involves no expensive data acquisition sessions, which is regarded as more promising. Activities like the operations and interactions on music systems can be used to represent user emotion states [46,47]. Activities on social media are a fruitful source of user emotion state representations. Rosa et al. used the lexicon-based sentiment metric with a correction factor based on the user profile to build an enhanced sentiment metric to represent user emotion states in music recommendation systems [48]. Deng et al. modeled user emotions with user activity data crawled from the Chinese Twitter service, Weibo in music recommendations [49]. A recent study used implicit user feedback data from social networks to build an emotion-aware music recommender based on hybrid information fusion [50]. Another study used an affective coherence model to build an emotion-aware music recommender, utilizing data from social media to compute an affective coherence score and predict the user's emotion state [27]. Among the studies that utilized activity data from social media, sentiment analysis is the dominant technology used to extract and build the emotion state representations.
The third factor of exploiting emotion information in music recommendations is the recommendation model used for recommendations. As a special kind of recommendation item, the models used for emotion-aware music recommendations vary according to the specific research scenarios. Lu et al. used a content-based model to predict music emotion tags in music recommendations [51]. Deng et al. used a collaborative filtering model in music recommendations after representing the user emotion state through data from Weibo [49]. Kim et al. employed a tag-based recommendation model to recommend music after semantically generating emotion tags [52]. Han et al. employed a context-based model after classifying music emotion in music recommendations [53]. Besides these models, hybrid models that combine the advantages of multiple recommendation models have also been applied in emotion-aware music recommendations, especially those with the help of deep learning [54]. New information processing technologies such as big data analysis technologies [55] and machine learning have accelerated music recommendations in many ways [56]. Therefore, compared to the other models, hybrid approaches can come up with better performances. Deep learning technology helps improve not only the recommendation models [57], but also the emotion recognition (representation) [58].
To sum up, although conventional studies exploit both the acoustical clues and metadata of music, few have investigated the approach of representing music emotions in a continuous manner. Existing studies have not proposed a way to represent users' emotion states at time points when no user activity data exists. Conventional studies also tend to ignore the influences posed by events on human emotions. All of these ask for further investigation.

Model Construction
Recommending music regrading the emotional preference of the users calls for the need to exploit emotional information, which entails identifying the emotional characters of the music and recognizing the emotional preference of the user under a certain emotional state. In this section, we first proposed a music emotion representation model, then we proposed an emotion state representation model. Based on the two models we describe, the hybrid approach took by our recommendation process. The general architecture of the proposed recommendation method (emoMR) is illustrated in Figure 1. The general architecture is composed of 3 layers: The data layer, the model layer, and the application layer from bottom up. The data layer provides data form the model layer and the application layer generates the recommendation lists upon the model layer. The details for each of the layers are described as: 1. The data layer. This layer offers 5 kinds of data for further processing. The user portrait data contains information inherited from conventional recommendation systems, such as the user profile from the registration system. The social media data contains the users' activity data from social media, which is used as implicit user feedback in our recommendation method. The music acoustic data contains the audio signal data of the music. The music metadata contains descriptive data of the music, such as the genre, artist, lyrics, etc. The event data contains public opinion data on certain events. 2. The model layer. In the model layer, the music emotion representation model and the emotion state representation model generate the music emotion representation and emotion state representation using data from the data layer. The models are trained to predict music emotions for the music and the music emotion preferences for users with a deep neural network (emoDNN). The hybrid recommendation model combining the content-based and collaborative filtering recommendation approaches uses the data generated by the trained models to make recommendations. 3. The application layer. In this layer, the proposed method generates music recommendation lists for users.
As emotional stimulus, it can be pretty complex to classify the music into a certain emotion kind [42]. Things like one piece of music that sound like Joy to one person but Sad to another, which happens ubiquitously. Perhaps a horrible event had just happened to the latter. The human cognitive process of emotion is typically complex. A person may be in several emotional states simultaneously under a certain situation [59]. For example, one might be in a complex emotional state of both Joy and Fear while playing horror video games. To express the music emotion characters and human emotion states quantitatively, we model the emotion using the valance-arousal model [32] which quantifies emotions by expressing emotions as points in a two-dimensional plane. The model is illustrated in Figure 2. The plane's horizontal axis is depicted by valence ranging from unpleasant to pleasant, and its vertical axis is depicted by arousal, which indicates the activation or energy level of the emotion. Therefore, this paper proposes a music emotion representation model and an emotion state representation model based on the emotion valence-arousal model.

The Music Emotion Representation Model
The emotion of a piece of music is the emotional response of the person who listens to it. However, the actual emotional response is hard to obtain. Thus, our music emotion representation model takes the approach of generating quantitative representations of the music emotion by utilizing the features of the music to predict the emotional response. Aside from other studies which tend to make classifications of music emotion, we represent it as a tuple E i = s valence , s arousal in which s valence indicates the emotion valence score and s arousal indicates the emotion arousal score, E i indicates the emotion of music i.
For a piece of music, there are 2 sets of features that are exploited: 1. Low-level audio features, which extracted from the audio data; 2. Music descriptive metadata, such as genre, year, artist, lyrics, etc.

Low-Level Audio Features Extraction
Studies have shown that the low-level audio features of pitch, Zero-Crossing Rate (ZCR), Log-Energy (LE), Teager Energy Operator (TEO), and Mel-Frequency Cepstral Coefficients (MFCC) can determine the emotional state of music audio signals [60][61][62][63]. The extraction methods for these features are illustrated as follows:

1.
Pitch: Pitch extraction calculates the distances between the peaks of a given segment of the music audio signal. Let Sig i denote the audio segment, k denotes the pitch period of a peak, and Len i denotes the window length of the segment, and the pitch feature can be obtained using Equation (1): 2. ZCR: The Zero-Crossing Rate describes the rate of sign changes of the signal during a signal frame. It counts the times the signal changes across positive and negative values. The definition of ZCR is shown in Equation (2), where Len w is the length of the signal windows and s f (·) is the sign function, and is defined in Equation (3): 3.
LE: This feature estimates the energy of the amplitude of the audio signal. The calculation can be formulated as Equation (4): 4.
TEO: This feature links to the energy of the audio signal as well, but from a nonlinear perspective. The TEO of a signal segment can be calculated using Equation (5):

5.
MFCC: The MFCC is derived from a mel-scale frequency filter-bank. The calculation of MFCC can be obtained by first segmenting audio signals into Len w frames and then applying a Hamming Window (HW) defined by Equation (6) to each frame: Len s is the number of samples in a given frame. For an input signal Sig i , the output signal O i is defined as Equation (7): The output signal is then converted into the frequency domain by Fast Fourier Transform (FFT). A weighted sum of spectral filter components is then calculated with triangular filters. The Mel spectrum is then obtained by Equation (8): The features reveal emotions due to their connections to the polarity and energy of the emotions. Pitch and ZCR show great discrimination across pleasant and unpleasant emotions, Log-Energy and TEO measure the energy of different emotions, and MFCC compacts signal energy into its coefficients. The features describe the music emotion from both the emotional valence and arousal perspective.

Music Metadata Exploitation
Aside from the low-level audio features, the music also has some important metadata that suggests additional information about the music. Unfortunately, in the common practice of the music industry, there is no unified metadata structure standard. Therefore, we refined the metadata structure in the work [64] and applied it to the music emotion representation model. We eliminate the notation, attention-metadata, and usage from the structure, because they are within the user domain and hard to obtain when training the model. The metadata of rights and ownership, publishing, production, record-info, carrier, and website are also eliminated due to their irrelevance to the music emotion. Before the training process, 6 classes of metadata are kept in our structure. They are listed in Table 1. The metadata is then represented by Meta = {M i |i = 1, 2, · · · , 6}. The labels and categorical metadata are converted into numerical representations. Unlike the other metadata, the lyrics (M 5 ) contain blocks of long texts instead of labels. To leverage the texts in the lyrics and represent (M 5 ) as labels, the lyrics texts are processed by standard Natural Language Processing (NLP) technologies, such as stop-words removal, tokenization, and Term Frequency-Inverse Document Frequency (TF-IDF) score generation. The emotion of the lyrics can then be obtained by sentiment analysis. Described in the work [27], sentiment analysis generates scores for 6 Ekman basic emotions EE = {Joy, Anger, Sadness, Surprise, Fear, Disgust}. Let S(·) denote the score for a particular basic emotion, then S t = {S t (emo)|emo ∈ EE} is the emotion vector of item t. To fit the emotion vector of the lyrics (the item is lyrics) into our valence-arousal based model that is M 5 =< Valence, Arousal >, and is converted as follows: 1.
Normalization: The min-max normalization is used to convert the 6 elements in S T to values within the range of [0, 1]. LetS i (emo) denote the emo score of the ith lyrics after normalization, and can be described by Equation (9): 2.
Calculate valence and arousal: The emotions of E po = {Joy, Anger, Surprise} are treated as positive while E ne = {Sadness, Fear, Disgust} are treated as negative. The valence and arousal values of s i can be obtained by Equation (10): The music emotion representation model predicts the emotional response, which is a < Valence, Arousal > vector using the music feature vector. The music feature vector is composed of the low-level audio features of the music and the metadata of the music. The vector V music is shown as: Posts and comments of a piece of music reflect the affection and emotional response of users to the music [50]. Label data can be obtained by exploiting the implicit user feedback from social network posts and comments using text sentiment analysis and obtain the label data of < Valence, Arousal > vectors using the conversion method described above to train the model.

The Deep Neural Network for the Music Emotion Representation Model
The works by Tallapally et al. [65] and Zarzour et al. [66] provided an approach to use a deep neural network to predict user ratings, scores, and preferences. The main idea of building the music emotion representation model with a deep neural network is using the music feature vector V music to predict the emotional response it brings about. The response which indicates the user's emotion-related preference can be quantified by the < Valence, Arousal > vector. We proposed an approach of emoDNN to predict the vector. The general architecture is illustrated by Figure 3.  The architecture used two deep neural networks to predict the Valence and Arousal respectively. Each network consists of an input layer, hidden layers, and an output layer. The input layer consists of the vector V music within which the features are converted to numerical representations as mentioned above. The hidden layers can be customized to investigate the feature-emotion interactions. The output layer is the predicted Valence and Arousal. The total quantity of nodes in the hidden layers is set by Equation (11) to ensure the accuracy and training speed of emoDNN [67]: N hidden is the number of nodes in the hidden layers, N input and N output indicate the nodes in the input layer and output layer respectively, and τ is a constant number between [1,10]. The loss function is set to Mean Squared Error (MSE) shown in Equation (12): E(x i ) is the ith actural output while y i is the ith expected output. w is the weight and b is the bias. To handle over-fitting and gradient disappearance, the activation function is set to Rectified Linear Unit (ReLU) as shown in Equation (13):

The Emotion State Representation Model
The emotion state representation model aims to identify the music emotion preference of the user in a certain emotion state at arbitrary time points. Recent emotion-aware recommendation studies use discrete user-generated content from a social network to identify user emotion [26] or transform emotion to context-aware [8]. These studies fail to represent a consecutive user emotion state, thus, unable to represent the user emotion state at an arbitrary time point. For example, a user only interacts with the system and social media in the morning, traditional methods can not represent the emotion state of the user at 15:00, because no user feedback data can be utilized at 15:00.
To represent the emotion state of the uses at arbitrary time points, an emotion generation model was built based on the cognitive theory of emotion [68] and the cognitive appraisal theory of emotion [69]. The theories suggest that music stimulates the human cognitive process and evokes emotion which is a cognitive process that consists of several spontaneous components that happen simultaneously. The model is used to generate emotions in a consecutive manner [70] artificially. We utilized the model to generate the emotion state of the user at arbitrary time points. A simplified version of the model is depicted by Figure 4. We defined the 4 modules in the model according to our need to generate the emotion state at arbitrary time points: 1. The environment module: It utilizes the information that affects user emotion in the environment. The information creates a context that can be identified from social networks and public opinion. This module influences the emotion evaluation of the user and contributes to the exogenous factors of the emotion state; 2. The experience module is controlled by the characters of the user and contributes to the endogenous factors of the emotion state; 3. The evaluation module synthesis the output of the exogenous and endogenous factors and generate the emotion state at arbitrary time points by evaluation; 4. The emotion module identifies the music emotion preference user the emotion state generated by the evaluation module.

The Exogenous Factors of the Emotion State
Information from the environment that affects the user emotion has long been investigated by studies that raise context as an influential factor [8]. The information acts as a kind of social stimuli and influences user emotion. The influence has been proven by cognitive attention studies [71] and neuroscience [72]. Among the many types of data, implicit user feedback information such as rating, review, and clicks were introduced to address the influences of information from the environment on user emotion [50]. Based on the social network data used by Polignano et al. [27], we propose three types of data as the exogenous factors of the emotion state: The sentiments of these comments help distinguish user emotion states.
Let T estat be the time point for estimating the user emotion state. Given TH range as the time range. The events, posts, and comments from the social network (e.g., Weibo, Facebook, Twitter) are collected and processed by sentiment analysis and converted into valence-arousal vectors. Especially, the public opinions for the events can be collected by trends from the social networks (e.g., Weibo top trends). The process has been described in Section 3.1.2. The exogenous factors at T estat (EXF T estat ) can be described as: where sen type =< Valence, Arousal > is a two-element vector, consists of the valence and arousal scores of the texts for the types. Since trends on social networks tend to have a longer duration than individual social network activities. The exogenous factors use the event information neglected in other studies can help decide the user emotion state where traditional social stimuli such as the implicit user feedback of user rating, review, and clicks fail to present.

The Endogenous Factors of the Emotion State
Endogenous factors relate to the inherent mechanism of human emotions. Studies have shown that emotions are regulated by the bio-physical or bio-chemical mechanisms of the human body [73,74]. The performances of these mechanisms vary over time and can be described as the circadian rhythm [75]. The circadian rhythm acts as an internal timing system that has the power of regulating human emotion [76]. Much of human beings' physiological indicators vary according to the circadian rhythm, such as the hormones, blood sugar, body temperature, etc. [77]. Thus, it causes oscillation in human emotion. Figure 5 shows the mRNA and protein abundance oscillation across a certain period of circadian time (in hours) obtained by the study of Olde Scheper et al. [78].
The circadian rhythm as an endogenous cycle of the human biological process that recurs in approximately 24-hr intervals can be treated as a smooth rhythm with added noise. It can be modeled with data from known periods (24 hr each). A study proposed a cosine fit of the circadian rhythm curve [79]. The cosine model is shown in Equation (14).
where CR i is the value of the circadian rhythm model at time i, MESOR is the Midline Estimating Statistic of Rhythm [80], A is the amplitude of the oscillation, t i is the time point (sampling time), λ is the period, ϕ is the acrophase, and σ i is an error term.
We use the cosine model to represent the endogenous factors of the user emotion state. The parameters can be obtained by utilizing the users' extracted emotional information data proposed by Qian et al. [50] across a 24-hr period. Thus, the endogenous factors of the emotion state can be modeled as:

The Emotion State Representation
In the above artificial emotion generation model, the evaluation module is used to represent the emotion state by synthesizing the exogenous and endogenous factors of the emotion state. However, user properties such as age, gender influence not only users' perception and appraisal of exogenous factors [81] but also the endogenous factors of users [82]. Thus, to represent the emotion state of a user at a certain time, we combine the exogenous and endogenous factors together with the user properties UP user that are used in conventional music recommendation studies [83]. The emotion state representation is shown as:

The Music Emotion Preference Identification Using Deep Neural Network
Identifying the music emotion preference of a user at a certain time is the main purpose for constructing the emotion state representation model. We use the same deep neural network structure (emoDNN) in Section 3.1.3 to learn the model. Section 3.1 has proposed a music emotion representation model. The emotion of the music can be represented by a valence-arousal vector. The label values for training the emotion state representation model is obtained by utilizing the emotion (valence-arousal vector generated by the music emotion representation model) of the music liked (post, repost, like, positive comment) by the user at T estat .
The nodes in the hidden layers are tuned to fit the situation for the input vector V estat . The loss function and activation function are the same as those for the music emotion representation model. However, as the endogenous factors introduce a cosine function, the model cannot be treated as a linear one. Thus, we update the model and add a higher-order polynomial to fit the model. The model is shown in Equation (15): The high order polynomial uses the parameters of the cosine fit model of the endogenous factors to bring the oscillations of the circadian rhythm of the user to the model.

The Recommendation Process
The emotion-aware music recommendation method proposed in this paper (emoMR) consists of the following steps in the recommendation process: Music emotion representation generation, user preference for the music emotion calculation, similarity calculation, and recommendation list generation. The method is a hybrid recommendation method that takes both a content-based method and collaborative filtering method into the recommendation process. The process is depicted in Figure 6.
With existing music emotion preference information Without existing music emotion preference information Calculating music emotion representation on music items

Content-based method Get music of similar music emotion
Calculating music emotion preference for the current user emotion state on users

Collaborative filtering method
Get music with music emotion preferred by similar music emotion preference users

Similarity Calculation
Recommendation list generation Figure 6. The recommendation process of emoMR.
When a user User i enters the recommendation process at time T estat , the method first decides whether there the music emotion preference of the user exists in the system. The process can then be described as: 1. If the user's music emotion preference information exists in the system, the music emotion representation is calculated using the music emotion representation model. The emoDNN for the music emotion representation model can be trained, and its parameters can be saved for future use. The emotion representation for the music can also be stored to accelerate future recommendation cycles. 2. If the user's music emotion preference information does not exist in the system, the music emotion preference for the user's current emotion state is calculated using the emotion state representation model. The parameters for the trained emotion state representation model emoDNN and the calculated user music emotion preference can be stored to accelerate future recommendation cycles. 3. Content-based method is used to get a list of music of similar music emotion to the music emotion preference at the given time T estat . Add the music emotion vector to the music feature vector used by existing content-based music recommenders [84]. Calculate the similarities of the music to the emotion preferred music and rank the similarities to generate a list of music. 4. Collaborative filtering method is used to get the music with the music emotion preferred by similar music emotion preference users at the given time T estat . Add the music emotion preference vector of the user's feature vector. Calculate the similarities between User i and other users. Rank the users according to the similarities. The preferred music lists of the users in the similar user list are ranked by the music emotion preference of the users. Get Top-K music items from the lists of the users.

5.
Using the generated music list as the recommendation list.
The similarities in the process are calculated by cosine similarity [85] depicted in Equation (16). Music emotion preference is a significant factor that affects music recommendations. However, conventional music preferences are also significant. Thus, we combined the music emotion preference and conventional music preferences by adding the music emotion vectors to the existing feature vectors of the music and adding the music emotion preference vectors to the existing feature vectors of the users. Therefore, the results generated by our method take advantage of both conventional preferences and music emotion preferences:

Experiments
In this section, the performance of the proposed emotion-aware music recommendation method emoMR is evaluated experimentally. The performance is compared against alternative approaches. We introduce the dataset used in the experiments, the experiment designs, the model training, and the evaluation. We implemented a toy system to control the experiment procedures and collect metrics data.

Datasets
According to the design of our method, the following data need to be fed to our method to train the models and generate recommendations: The audio file of the music, the metadata of the music, the activity data of the users within at least a 24-hr period, and social media data including posts, comments, and events. Existing datasets such as Netflix, GrooveShark, Last.fm [86] come up with no emotion-related characters. To train the models proposed in our work and test the performance of emoMR against other approaches, we used a hybrid dataset consists of: 1. Data from the myPersonality [27] dataset. This dataset comes with information about 4 million Facebook users and 22 million posts over their timeline. By filtering through this dataset, we extracted 109 users who post music links and 509 music posted in the form of music links. This dataset provides not only the user tags but also music information and implicit user feedback information to aid the training of the music emotion representation model and the emotion state representation model. 2. Data acquired from music social platform. The platform is a Chinese music streaming service (NetEase Cloud Music, https://music.163.com (accessed on 12 July 2021)) with an embedded community for users to share their lives with music. We scraped the music chart with a web crawler (https://github.com/MiChongGET/CloudMusicApi (accessed on 12 July 2021) and got 500 high-ranking songs. The metadata and audio file of the 500 songs and the 509 ones in the myPersonality dataset were acquired. The music in the myPersonality dataset contained almost English songs and was relatively old, while the 500 high-ranking songs were mostly Chinese songs. Therefore, the two datasets have 38 songs in common. We also searched through the community and acquired 200 users related to the 971 songs and 116,079 user activities, including user likes, posts, reposts, and comments. 3. Data acquired from the social network (Weibo, https://weibo.com (accessed on 13 July 2021)). We acquired 105,376 event-related data from Weibo. The data were posted within the time window of the user activities of the music social platform data. We use this data to insert event-related public opinion information to emoMR.
The ratio of the data for training against testing was around 3:1. The text information of music metadata and user activities were processed with text sentiment analysis services to obtain the emotion representation and construct the label data. The text sentiment analysis services for the English contents were processed by the IBM Tone Analyzer (https://www. ibm.com/watson/services/tone-analyzer (accessed on 17 July 2021)), and the Chinese contents were processed by Baidu AI (https://ai.baidu.com/tech/nlp/emotion_detection (accessed on 17 July 2021)). In particular, the GDPR regulation [87] has placed restrictions on the exploitation of user data in order to limit illegal use. The third-party dataset and the datasets acquired from the music streaming and social network services are guaranteed to be anonymized by the corresponding APIs of the services. The datasets will be used for our research purpose only and will not be used for commercial purposes.

Model Training
To train the music emotion representation model and the emotion state representation model, we use 70% of the training data to train the emoDNNs and 30% of the training data to test the trained models. We referred to a novel music recommendation system using deep learning [88] for the hyper-parameter settings and the training process. The hyper-parameters for the two emoDNNs are listed in Tables 2 and 3. The number of the hidden layers of the two emoDNNs were set to 5 according to Equation (11) and the experience derived from Liu et al. [67]. Table 2. Hyper-parameters for the music emotion representation model emoDNN.

Parameter Setting
Training epoch 300 Batch 20 Optimizer Adam Learn rate 0.05 Table 3. Hyper-parameters for the emotion state representation model emoDNN.

Parameter Setting
Training epoch 500 Batch 10 Optimizer Adam Learn rate 0.01 The training process uses the above hyper-parameter settings to learn the weights for the two models. As shown in Figure 7a, the loss function value for the train and validation sets plunged as the epoch reaches 50. When the epochs reach 300, the loss function values drop to 0.017 for training and 0.007 for validation. Figure 7b shows the accuracy changes over different epochs. When the epochs reach 300, the accuracy values reach 0.938 for training and 0.964 for validation. The output accuracy on the testing set is 97.46%.
As shown in Figure 8a, the loss function value for the train and validation sets plunged as the epoch reaches 50. When the epoch reaches 300, the loss function values drop to 0.0027 for training and 0.0029 for validation. Figure 8b shows the accuracy changes over different epochs. When the epochs reach 300, the accuracy values reach 0.871 for training and 0.813 for validation. The output accuracy on the testing set is 88.81%.
The trained models' accuracy is 97.46% for the music emotion representation model and 88.81% for the emotion state representation model. The accuracy is enough for generating representations and predicting emotion preferences for music in the testing dataset. Due to the nonlinear nature of the emotion state representation model, which is introduced by the human circadian rhythm and the stochastically-occurring events, the accuracy of the emotion state representation model emoDNN is significantly lower than that of the music emotion representation model. This leaves room for future improvements.

Baseline Algorithms and Metrics
The main purpose of the experiment is to compare the proposed emoMR with alternative approaches. For the comparison, we selected four alternative algorithms as the baseline algorithms, they are listed below: The four baseline algorithms contain two approaches that do not take emotion information into account and two approaches that also utilize emotion information.
CB and SCBCF are alternative approaches that do not take emotion information into account. They are item similarity-based algorithms that share an easy way of computing similarity scores and are widely used in music recommendations.
EARS and EMERS are the two algorithms that also utilize emotion information. They demonstrate not only the ability to use item similarities but also the ability to use social network information and hybrid approaches. The two algorithms incorporate social network data to exploit the implicit user feedback in music recommendations and are emotion-aware.
To test the performance of emoMR against the baseline algorithms, we leveraged a variety of metrics commonly used in music recommendation method performance comparisons [8]. The metrics are: Precision, Recall, F1, and HitRate [27] listed by Equations (17)- (20). Precision is the fraction of recommended music pieces that are relevant. Recall is the fraction of relevant music pieces that are recommended. F1 is the harmonic mean of precision and recall. HitRate is the fraction of hits.
where M rec is the music recommended in the recommendation list and M pick is the music actually picked by the user. If a piece of music is contained in a recommendation list for User i then it is a hit, N hits is the number of hits, and TOP − N is the number of Top-N recommendations.

Results
To assess the performance of our emoMR, we compared the Precision, Recall, F1, and HitRate of emoMR against those of the baseline algorithms. As the recommendation task was set to Top-N recommendations to suit the output of the above methods, we believe the N in Top-N has a significant influence on the performances of the methods. Therefore, the comparison was first given by a top-10 recommendation task which is a common practice among the recommendation systems of various music streaming services. Then, the performance variations were compared by changing the N from 5 to 20 to test the influence of N in Top-N. Finally, to test the event-related emotion utilization ability of emoMR, we compared the performances of the methods on two top-10 recommendation tasks using data within the time range of two major events on Chinese social media platforms.

Performance Comparison in the Top-10 Recommendation Task
We compared the performance of emoMR against the baseline algorithms of CB, SCBCF, EARS, and EMRES in the 4 metrics on a top-10 recommendation task. The results are shown in Figure 9.
The results show that emoMR outperforms the other baseline algorithms in all 4 metrics. The performance of the plain Content-Based algorithm (CB) is the worst. The SCBCF algorithm outperforms the CB algorithm due to its ability to exploit social network data and the hybrid approach it incorporates. The EARS, EMRES, and emoMR exploit not only the social network data but also information related to emotions. However, the EARS cannot utilize the low-level audio features, and the MERES takes a content-based-like approach. Thus, they fail to fuse more information than emoMR.
The results suggest that compared to the second-highest performance algorithm EMERS, emoMR improved 4.06% of Precision, 15 Figure 9. The performance comparison of the methods in the top-10 recommendation task.

Performance Comparison in the Top-N Recommendation Tasks
The performance of the five methods was also evaluated in Top-N tasks where N varies from 5 to 20 to reveal the influence of N on the performance. The results are shown in Figure 10.
The results show that as N varies from 5 to 20, the Precisions of the five methods decrease, whilst the Recalls, F1s, and HitRates increase. As the lengths of recommendation lists increase, and the precision drop, which means the tail items of the recommendation list contribute less to the quality of the recommendation. The precision of the emotionaware algorithms also outperforms those non-emotion-aware algorithms. The increment in Recalls and F1s all suggest that increase the length of the recommendation list increases the advantages of emoMR over the other four algorithms. However, the HitRates suggests that EMRES outperforms emoMR on top-5 recommendations and that top-10 recommendations observe the greatest advantage of emoMR on EMRES in HitRate.
The comparisons under the top-5 recommendation task suggest the following results: The Precision of emoMR is 3.53%, 12.50%, 18.78%, and 90.24% higher than those of EMRES, EARS, SCBCF, and CB, respectively. The F1 of emoMR is 0.61%, 13.70%, 19.39%, and 101.76% higher than those of EMRES, EARS, SCBCF, and CB, respectively. The Recall of emoMR has no significant advantage over EMRES in this situation, while it is 13.95%, 19.51%, and 104.16% higher than those of EARS, SCBCF, and CB, respectively. The HitRate of emoMR is 2.53% lower than EMRES while it is 71.11%, 140.62%, and 305.26% higher than those of EARS, SCBCF, and CB, respectively. Besides, emoMR still has an advantage over MERES in HitRate as the length of the recommendation list increases. Meanwhile, emoMR keeps leading the performance as the N in Top-N goes beyond 5.   cases in Zhejiang (9-12 June 2021). We grab the data from Weibo trends during the events' time ranges. The performance results are shown in Figure 11.
The results show that the performance of emoMR significantly outstands the baseline algorithms in Precision, Recall, F1, and HitRate when recommending within certain event time ranges. The advantages of emoMR over other algorithms are even bigger compared to the top-10 recommendation performance, unaware of the major events.
EVT1 caused a general sad atmosphere in the Chinese social media platform of Weibo as the breaking news of Yuan Longping's death started to go viral.  Figure 11. The performance comparison of the methods in recommending top-10 items during event-related time ranges.

Discussions
The results suggest that emoMR outperforms all the other baseline algorithms in the top-10 recommendation task. As studies suggest that more information fusion helps improve recommendation performance of Top-N tasks [16], we believe that is why emoMR outperforms the other algorithms. The results of the significant performance gaps between emotion-aware methods (emoMR, EMERS, and EARS) and emotion-unaware methods (SCBCF and CB) also indicate that the ability to utilize emotion-related information helps recommendation methods to gain advantages in music recommendations.
Our method outperforms the other baseline algorithms as the length of the recommendation list expands. Although the HitRate of EMERS exceeded emoMR in top-5 recommendations, emoMR still has its advantages in HitRate back in top-5 through top-20 recommendations. We believe the advantages of emoMR are gained by allowing more emotion variations to show up in the recommendation list. The music of similar music emotions and the users of similar music emotion preferences are ranked by the corresponding similarities. A longer recommendation list means a wider range of similarities which will include more emotion variations.
The advantages of emoMR in recommending top-10 items during event-related time ranges mark the unique ability of emoMR to utilize events information in emotion-aware music recommendations. As studies suggest that the brain mechanisms mediating musical emotion recognition are also dealing with the analysis and evaluation of the emotional content of complex social signals [18], social events are able to trigger the emotion cognition process. The events bring emotion-related information by public opinion. The sentiments of event-related public opinions were a critical part of the exogenous factors of the emotion state representation model. The results suggest that emoMR is able to utilize public opinion information during major events to refine the emotion state representation model by offering more information to the exogenous factors.

Conclusions
The development of personalized services on the Internet has called for new technologies to improve service quality. To improve the recommendation quality of the music recommenders, researchers have started introducing new information that has not been exploited to make better suggestions. This paper proposed an emotion-aware hybrid music recommendation method using deep learning (emoMR) to introduce emotion information into music recommendations. Experiment results show that emoMR performs better in the metrics of Precision, Recall, F1, and HitRate than the other baseline algorithms in the top-10 recommendation task. Meanwhile, emoMR keeps its advantage over other baseline algorithms as the length of the recommendation list increases. We also tested the performance of emoMR on two major events (the death of Yuan Longping and the COVID-19 cases in Zhejiang). Results show that emoMR takes advantage of the events information and outperforms the other baseline algorithms. Our work contributes to the development of recommenders that include novel aspects of information. Our work can also be applied to a wide range of fields [91] to promote user satisfaction. Our innovative works in this paper are as listed as follows: 1. The proposed method predicts music emotions in a continuous form, enabling a more precise and flexible representation of music emotions than traditional discrete representations. By modeling the music emotion representation into a < Valence, Arousal > vector based on the emotion valence-arousal model using low-level audio features and music metadata, we build a deep neural network (emoDNN) to predict the emotion representations for the music items. Compared to the discrete representations of music emotions in other studies, our model predicts the music emotion in a continuous form of valence and arousal scores. 2. The proposed method predicts users' music emotion preference whenever it needs while traditional methods can only generate predictions at time points when user feedback data are present. By modeling the users' emotion states using an artificial emotion generation model with endogenous factors generated by the human circadian rhythm model and exogenous factors consisted of events and implicit user feedback, we use the emoDNN to predict the users' music emotion preferences under the emotion states represented by the model. Benefiting from the continuity of the human circadian rhythm, the model is able to predict continuously across time regardless of the absence of user feedback data at a given time point. 3. The proposed method can utilize event information to refine the emotion generation, which provides a more accurate emotion state description than traditional artificial emotion generation models. With the introduction of events in the exogenous factors of the user emotion state representation model, emoMR is able to express the influence of events (especially major social events) on user emotions. 4. The proposed method employs a hybrid approach of combining the advantages of both content-based and collaborative filtering approaches to generate the recommendations. Theoretically, our findings will contribute to the theory of emotion recognition by reflecting the cognitive theory of emotion [68] and the cognitive appraisal theory of emotion [69] with social signals.
We acknowledge some limitations in our study that allows for opportunities for future research and list them as: (1) Although the accuracy of the emotion state representation model is acceptable, there is still room for improvements compared to the accuracy of the music emotion representation model. Future works can benefit from improving the accuracy of the emotion state representation model. (2) Limited by the research paradigm and the designed model, our method is unable to predict the emotion of a person at a given time point. Future works can benefit from designing an intermediate output of emotion prediction to aid the recommendation process and validate the accuracy of the recommendation. (3) Limited by the approaches of the study and the dataset used, our study did not take users with minority personal traits such as bipolar traits into consideration which is a compelling choice. Future works can benefit from taking humanistic care for minorities. (4) The proposed method utilized the human circadian rhythm model to generate emotion state representations that need at least user activity data of a 24-hr time period. Future works can benefit from exploiting other models that require fewer data.
Author Contributions: S.W. and C.X. built the framework of the whole paper. S.W. and Z.T. carried out the collection and preprocessing of the data. S.W. designed the whole experiment and method. Z.T. implemented the experiment. C.X. and A.S.D. provided analytical and experimental tools. S.W. and C.X. wrote the manuscript. C.X., A.S.D. and Z.T. revised the manuscript. All authors read and approved the final manuscript.