Style Transformation Method of Stage Background Images by Emotion Words of Lyrics

: Recently, with the development of computer technology, deep learning has expanded to the ﬁeld of art, which requires creativity, which is a unique ability of humans, and an understanding of the human emotions expressed in art to process them as data. The ﬁeld of art is integrating with various industrial ﬁelds, among which artiﬁcial intelligence (AI) is being used in stage art, to create visual images. As it is difﬁcult for a computer to process emotions expressed in songs as data, existing stage background images for song performances are human designed. Recently, research has been conducted to enable AI to design stage background images on behalf of humans. However, there is no research on reﬂecting emotions contained in song lyrics to stage background images. This paper proposes a style transformation method to reﬂect emotions in stage background images. First, multiple verses and choruses are derived from song lyrics, one at a time, and emotion words included in each verse and chorus are extracted. Second, the probability distribution of the emotion words is calculated for each verse and chorus, and the image with the most similar probability distribution from an image dataset with emotion word tags in advance is selected for each verse and chorus. Finally, for each verse and chorus, the stage background images with the transferred style are outputted. Through an experiment, the similarity between the stage background and the image transferred to the style of the image with similar emotion words probability distribution was 38%, and the similarity between the stage background image and the image transferred to the style of the image with completely different emotion word probability distribution was 8%. The proposed method reduced the total variation loss of change from 1.0777 to 0.1597. The total variation loss is the sum of content loss and style loss based on weights. This shows that the style transferred image is close to edge information about the content of the input image, and the style is close to the target style image.


Introduction
Advances in computer technology have led to technological innovations such as information revolution, big data processing, and active use of networks. These innovations have increased interest in AI [1,2]. In recent years, AI has been researched in the field of art, which requires creativity, an inherent ability of humans. The field of art has been integrated with various industrial fields, such as AI, which is used in stage art in combination with stage effects. When a singer dances and sings, the audience views the singer's stage performance in combination with stage effects. Stage effects determine the stage mood using several important elements, such as lighting, music, acting, and stage background, which visually convey emotions associated with the songs to the audience. Among stage effects, stage background has been transitioning from the expression method of using props to the expression method of media performance that uses images through large light-emitting diode (LED) screens or projectors [3]. In general, background stage images used in media performances are selected in advance by professional stage designers at the planning stage. Stage background images of the singing performance represent the emotions expressed in song lyrics. It is difficult for computers to represent emotions that humans have been manually designing for stage background images. Recently, research was conducted to enable AI to design stage background images in place of humans. In this approach, a stage background image recommendation system is used to automatically compose stage background images according to dance styles without professional stage designers. However, the limitation of the stage background images selected through conventional recommendation systems is that the emotions to be represented in the song lyrics are not reflected in the stage performance. It would be ideal to represent emotions represented in song lyrics through stage background images during stage performances. Research regarding the reflection of emotions contained in song lyrics in a stage background are scarce; however, it is possible to use research that transforms background images according to their meanings or purpose by synthesizing background images with text or images containing the meaning to be represented. There is research that partially transforms images using the content contained in text [4][5][6] and transfers the style such as color, line, and texture of image to another image [7][8][9][10][11]. The existing stage background image recommendation system recommends images for dancers, but this does not include the characteristics of the song lyrics.
This paper proposes a method to transform the multiple styles of stage background images based on the emotion words contained in each verse and chorus of song lyrics. First, the lyrics selected by the user are divided into sentences. Multiple verses and choruses are derived from the lyrics, one at a time and compared to the emotion word dictionary to extract emotion words included in each verse and chorus. Second, the probability distributions of the emotion words are calculated for each verse and chorus and the image with the most similar probability distribution from the image dataset with the emotion word tags in advance is selected for each verse and chorus. Finally, for each verse and chorus, the stage background images with the transferred style are outputted for each verse and chorus. The advantages of the proposed method are as follows.

•
It uses emotion words contained in song lyrics to transform the style of stage background images. Audience immersion can be increased by using stage background images to represent emotions expressed in the song lyrics used for singing in stage performances.
Emotions that are complex to represent using computers can be represented.

•
Certain emotions that are difficult for humans to determine intuitively can be represented because the proposed method can transform the style of images based on an image with a high correlation with the emotion represented using lyrics.
The remainder of this paper is organized as follows. Section 2 introduces the stage background recommendation system, methods of extracting the visual features of emotion words, and image style transformation methods reported in related work, and examines their limitations and technical constraints. Section 3 describes the method proposed to derive emotion words contained in the verses and choruses of song lyrics to reflect the emotions in stage background images and apply the style that is directly related to the derived emotion words to the stage background images. Section 4 verifies whether the stage background images are transformed according to the probability distribution of emotion words represented for each verse and chorus. Section 5 summarizes the findings and describes the limitations of this research and future research directions.

Related Work
In this section, we introduce the existing stage background recommendation system, methods of extracting the visual features of emotion words and image style transformation methods. Song lyric-based style transformation methods are compared, and their limitations and technical constraints are examined.

Stage Background Image Recommendation System
A stage background image recommendation system recommends stage background images by reflecting the dancer's preferences and dance styles such as ballet, belly dance, street dance, modern dance, tango, and waltz [3]. Dancers choose familiar or favorite stage background images. Therefore, the stage background images can be artistic images or actual photographs that the dancers prefer. Reference [3] proposed a model that predicts users' preferred images through social media. The proposed model predicts the K number of images that the dancer (user) is most likely to use as stage background images via three procedures. First, the features of the images shared by the dancer on social media (Pinterest) are extracted. Second, the profile of the dancer is learned based on the features of the shared images. Third, the interest level of the dancer in each candidate image is predicted, and the candidate images are ranked according to the dancer's predicted interest level. However, because only dances are reflected, the stage background images from the stage background recommendation system cannot represent the emotions that a stage performance aims to express through lyrics.

Emotion Classification
To express emotions using images that are difficult to process using computers, research was conducted recently to improve the accuracy of the sentimental understanding of human emotions [12][13][14][15][16][17][18].
Human emotions are visualized and used in psychotherapy, image search, etc. In general, models for representing emotions are divided into two types: categorical emotion states (CES) models, which classify emotions into several basic categories such as fear, amusement, and sadness, and dimensional emotion space (DES) models, which use threedimensional emotion space such as arousal, time, and harmony. As it is difficult to construct a multidimensional emotion space using information about time included in song lyrics, we used the CES method to consider images as a basic category of emotions. The CES method is easy for users to understand and convenient for emotion classification of images. The research in [12] used the CES method to extract principles-of-art-based emotion features (PAEF) to classify features of emotions included in images to understand the relationship between artistic principles and emotions. PAEF are a combination of representation features derived from the principles of balance, emphasis, harmony, variety, gradation, and movement. PAEF are used to classify the basic emotion words evoked in humans through images. A psychological research classified common basic emotion words into eight categories based on images through facial electromyography, heart rate, finger temperature, etc. That is, emotions contained in images are classified into eight categories, which define anger, disgust, fear and sadness as negative emotions and amusement, awe, contentment, and excitement as positive emotions [13]. These are called images of emotional levels, whereby an image of emotional level refers to the relationship between the style, such as color, saturation, brightness, and contrast, and the emotional effect derived from art theory [17]. The level of basic emotion words defined in the eight categories is classified for images. To evaluate this, the participants looked at the images, selected the most appropriate basic emotion category, and evaluated the emotional labels of the images. However, because it is not possible to visualize the features of images classified with the level of basic emotion words, there is no way of knowing the images that are appropriate for the stage background. Therefore, it is necessary to derive a method of visualizing the features of each basic emotion to find its relationship with the song lyrics and reflect them in the stage background images.

Style Transfer
Style is transferred to reflect the features of each basic emotion word in the stage background images. Usually, style transfer is used to transfer the image style. Style transfer consists of content image and style image. Content image refers to an image that has information such as an object or a common landscape that people can usually recognize, and style image refers to an image that has information such as color or texture that will be combined with the content image. Style transfer transfers the style based on a convolutional neural network (CNN) [10,11] and a generative adversarial network (GAN) [5][6][7]. Style transfer based on the CNN model extracts features by separating content and style in an image. Training is performed to extract content features from deep layers and extract style features from middle layers through the CNN model. The GAN model is used to change the content in detail, but in this research, the CNN model is used because it changes the overall image style. Table 1 presents the difference between the existing methods of transforming image style and the proposed method. The research in Zhao et al. [12] investigates the concept of the principle of art and its effect on emotion and classifies emotion images into eight basic emotion words. However, because many images are classified for each basic emotion word, it is difficult to find an appropriate image for the stage background image.

Method of Transferring Image Style Based on Song Lyrics
This section presents the proposed method to transfer stage background images using the emotion words contained in song lyrics. The proposed method consists of the lyrics preprocessing stage, which extracts the probability distribution of emotion words for each verse and chorus from selected lyrics and the emotion image processing stage, which transfers the styles of each verse and chorus images related to the extracted probability distribution of emotion words. The proposed method transfers the stage background image using styles of images related to emotion words extracted from each verse and chorus. The number of images with representative emotion image styles applied is equal to the sum of the number of verses and choruses from selected lyrics. Figure 1 is the overview of the proposed method. The proposed method is composed of the lyrics preprocessing stage and the emotion image processing stage. Table 2 is the description of all stages. In the lyrics preprocessing stage, the selected lyrics by a user are extracted into verses and choruses, and the probability distribution of emotion words contained in each verse and chorus is extracted separately. In the emotion image processing stage, the emotion images with tags, where the tags are matched to the corresponding emotion images in advance, are selected from the extracted emotion words of each verse and chorus and the stage background image is transferred to the different styles of the selected emotion image according to each verse and chorus.

Stage Description
Lyric preprocessing User's selected lyrics are divided into verses and choruses, and a probability distribution of emotion words is extracted for each verse and chorus.
Emotion image processing From emotion images with tags, the appropriate images are selected for each verse and chorus, and styles of selected images transferred to stage background image.

Step 1: Lyric Preprocessing
The lyric preprocessing step is composed of a sentence divider, verse/chorus extractor and the basic emotion words for the emotion word extractor. The sentence divider divides the lyrics into sentences. The verse/chorus extractor extracts the selected lyrics into verses and choruses. The basic emotion words for the emotion word extractor extracts the probability distribution of emotion words contained in each verse and chorus. Figure 2 is an overview of step 1. The sentence divider divides the lyrics into a set of sentences considering capital letters. The set of sentences in the lyrics is defined as the unprocessed set L U i . All sentences in L U i are processed as a set L T i , which is classified as verses and choruses through the classification process. This is repeated until there are no sentences left in L U i and all sentences in L T i are processed. The verse/chorus extractor executes the following processes. The user's selected lyrics consist of n verses and m choruses. The ith sentence inputted in L T i is compared to the sentences in L U i , and the frequency is repeatedly checked. The set of sentences with no repetition in the lyrics as verse ., l V n 1 , l V n 2 , .. , l V n i and the set of sentences that are repeated in the lyrics are classified as chorus The basic emotion words for probability distribution of the emotion word extractor compares the L V n , L C m with an Emolex (Emotion Dictionary) [19] and finds the matching emotion words. The Emolex consists of a total of 14,182 words classified into the basic emotion words, as shown in Figure 3, and information on whether they are positive emotion or negative emotion is also included. All basic emotion words are expressed by eight emotions, categorized into the positive emotions of anticipation, joy, surprise and trust, and the negative emotions of anger, disgust, fear and sadness, as proposed by Plutchik [20] to provide a high-dimensional emotion lexicon [19]. The extracted emotion words in each verse and chorus are replaced with the classified basic emotion words. Each basic emotion word is counted by the corresponding numbers, the number of anticipation b 1 , that of joy b 2 , that of trust b 3 , that of surprise b 4 , that of anger b 5 , that of fear b 6 , that of sadness b 7 , and that of disgust b 8 . The probability distribution of the basic emotion words is calculated. The set that counts the number of eight basic emotion words contained in L T i is defined as B. The number of eight basic emotion words included in the nth verse from L V n is stored in The probability distributions of the basic emotion words included in verses and choruses are defined as U V = so f tmax B V n , U C = so f tmax B C m , and calculated as Equation (1).

Step 2: Emotion Image Processing
The emotion image processing step is composed of a representative emotion images selector and representative emotion images style transfer. The representative emotion images selector searches and selects images with a probability distribution similar to the U V , U C of the emotion images with tags. The representative emotion images style transfer transfers stage background images into the styles of the selected emotion images with tags. Figure 4 shows an overview of step 2. The representative emotion images selector selects emotion images with tags similar to probability distributions U V , U C contained in each verse and chorus in the lyric preprocessing step. In total, 1000 emotion images with tags were downloaded from Flickr and defined as I i . P I i is defined as a set that counts the number of eight basic emotion words contained in I i . P I i is the set of the number of basic emotion words in which the ith image was classified through peer evaluation. The probability distribution of the basic emotion words included in the ith image is Y i and is calculated as Equation (2).
Finally, to select the emotion images with tags associated with the user's selected lyrics, the representative emotion images selector findsŶ i , which has a probability distribution that is most similar to that of U, included in U V , U C . The index i ofŶ i where the difference in the probability distribution is the minimum, as defined in Equation (3).
The representative emotion images style transfer transfers the styles of (n + m) emotion images with tags derived from the representative emotion images selector through a CNN model-based style transfer algorithm to the stage background image. Style is features such as color, saturation, brightness, contrast, stroke, and texture. Figure 5 shows the method of outputting an image with a transferred style through a CNN that extracts the image features of content related to the stage background image and a CNN that extracts the features of style from emotion images with tags. The CNN model normalized the weight of the network using the VGG-16 network, and average pooling was used instead of max pooling. Style characteristics are based on the Gram matrix, ignoring spatial information, and extracting features such as texture and color. Since the correlation of the feature maps of multiple layers, not a single layer, is viewed at the same time, static information, not layout information that the image has globally, is obtained in consideration of multiple scales. Style transfer [9][10][11] is applied to the representative emotion images style transfer as shown below. There are three types of input images: Stage background image related to the selected lyrics is defined as content image I C , I i selected from the representative emotion images selector is defined as style image I S , and noise image is defined as noise image I N . A noise image is a random variation of brightness or color information in images. This paper synthesizes content of I C and style of I S on I N . The CNN model is composed of a total of five blocks, B 1, .. ,5 and one block B consists of two convolution layers and one pooling layer. After going through one block, each content feature and style feature are extracted. When the input is I C , the output from the second block is defined as the content features, and when the input is I S , the output at each block is defined as the style features. Content features should have location information of objects included in I C and edge information of the object, and the style feature is the correlation of feature maps. The content features of I C and the style features of I S jointly minimize the distance from the features of I N . Content loss L c is calculated by comparing the content of features of I C with I N as in Equation (4).
I C is feed forward through the network. Style loss L S , is calculated by comparing the style features of I S with I N as in Equation (5).
The pixel-level information disappears as the layer deepens, but the semantic information of I C remains the same. Style features should be independent of spatial features. Low-level convolution layers represent low-level features such as edges. This feature maintains a higher resolution. The deeper the layer, the more difficult it is to visualize and interpret features such as edges because they are not directly connected to I C . High-level convolution layers capture semantic and less granular spatial information. Style features can get information that considers multiple magnifications of the image globally. However, artifacts occur while transferring I C into styles. This implies that I N , which is output through the model, loses content information, including the edge information of the objects in I C . The deformed error value of the image should be minimized with the transferred style. Edge information of I C is recovered through the sobel edge detector [21]. The sobel edge detector is used to reduce the generation of artifacts without losing content features, and then α, β is calculated. α controls the preservation of the content image, and β controls the preservation of the style image. It detects the content features of I C , making the content features of I T even stronger. The content feature difference between I C and I T is defined as L V and optimization is performed. In 2D images, sobel edge detection is performed in two directions, vertical and horizontal. The total variation loss L T is calculated based on α, β, L C , L S as in Equation (6). The noise image I N is updated through back propagation based on the total variation loss L T .
Optimization proceeds with the limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm [22] to find the minimum of L T .

Experiments
The probability distribution accuracy verification experiment and the style variation quality verification experiment were performed. It is important that the style of the stage background image is well transformed according to the distribution of emotion words included in the lyrics. This paper verified whether the styles of the stage background image change according to the probability distribution of emotion words included in each verse and chorus and verified the CNN-based style transfer performance.

Dataset and Experimental Environment
The datasets used to verify the proposed method are the NRC Word-Emotion Association Lexicon (Emolex) and images from Unsplash. The Emolex dataset is a list of English words and their associations with eight basic emotion words (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive). The annotations were performed manually. It includes 6475 English words, and 281 English words were used in the experiment. Table 3 presents the number and distribution of emotion words for each English word to facilitate the use of the Emolex dataset in this experiment. Unsplash is a high-quality open-access image dataset that can be used for further research on machine learning, image quality and search engines. We downloaded 1000 abstract images from Unsplash, and seven colleagues classified them into anger, disgust, fear, sadness, amusement, awe, content, and excitement. Table 3 presents the classified results.
Amusement is a compound emotion of anger and joy, awe is of fear and surprise, content is of joy and trust and excitement is of surprise and joy. The user's selected lyrics used in the experiment was "Forgotten heroes" as shown in Figure 6. The Figure 6 is input to the experiment. The experiments included Windows 10, Intel i7-7700, Nvidia Titan RTX 24 GB graphics card and DDR4 40 GB RAM. The proposed system was developed using Python, and the CNN model was implemented using a deep learning library called Tensorflow.   Figure 7 shows the result of extracting verses and choruses from Figure 6 using the verse/chorus extractor. Figure 6 consists of 44 sentences, and each word in each sentence was compared to all the words in entire sentence. A total of 12 consecutive sentences that were repeated twice were extracted as chorus and 20 non-repeated sentences were extracted as verses. Since 44 sentences should be compared with emotion words, the sentences are split into multiple words.

Experiment Results
Using the basic emotion words for probability distribution of the emotion word extractor, we compared the emotion words in the lyrics to those in Emolex, as shown in Figure 8. When the words matched, the words in the lyrics were replaced with the basic emotion words. As shown in Figure 8, a total of seven emotion words (alarm, watch, youth, lately, pill, different, and show) were matched in Verse 1; a total of eight emotion words (build, right, stand, couch, hero, old, forgot and bold) in the chorus; a total of five emotion words (hero, old, smile, fade and visit) in Verse 2; and a total of two emotion words (know and hope) in Verse 3. The emotion words were matched a total of 199 times in the song lyrics including duplicates, and the searched emotion words were replaced with the Emolex-based basic emotion words. The probability distributions of the emotion words extractor count the total number of replaced basic emotion words and calculate the probability distribution of each basic emotion word for each verse and chorus.  The representative emotion image selector compares the similarity with the probability distributions of emotion words extracted through step 1 and probability distributions of the emotion images tags and selects the images for each verse and chorus, as shown in Figure 9. The probability distribution that is similar to the corresponding probability distribution is searched in the dataset of emotion images with tags. Verse 1 has the most similar probability distribution to the probability distribution of (A) and Chorus 1 has the most similar probability distribution to the probability distribution of (B). However, the disadvantage of this style transfer method is that many high-frequency artifacts occur. The sobel edge detector extracts the edge features of content in the horizontal and the vertical directions because the image is 2 dimension, and edge features of content strengthen. In Figure 10a,b, the edge features of the content are extracted, and it maintains the content edge information well. Figure 10c,d show the high-frequency composition of the image to which the style is applied, but the content edge information is lost as the style is transferred. Figure 11a,b maintain the content feature even when the style is transferred by strengthening the edge feature through the sobel edge detector. As a result, representative emotion image style transfer was output as Table 4, optimization was performed by minimizing style loss and content loss. Figure 12 shows that the total variation loss is minimized from 1.0777 to 0.1597.     Table 5 shows the result of comparing the histogram distributions using compareHist function. The styles of images (a), (b) with similar probability distributions of emotion words and an image (c) with a different probability distribution of emotion words were transferred to the stage background image, and the similarity of the images was compared. When the distributions of the pixels of images are similar, the similarity of the images is high, and vice versa. The similarity of images is compared using the compareHist function. The compareHist function allows comparison of image features such as image contrast, color distribution and brightness. (a), (b) and (c) images are compared with the target image for similarity comparison. HISTCMP_CORREL (correlation) [23] is a correlation expressed by calculating pixels having the same value and is calculated as in Equation (7). The closer the value is to 1, the more similar the images are. H is a histogram, and N is the total number of histogram bins. HISTCMP_CHISQR (Chi-squared distribution) [23] is the distribution of the spread of pixel values. It is calculated as in Equation (8) and the closer it is to 0, the more similar the images are.
HISTCMP_INTERSECT (intersection) [23] computes the similarity of two discrete probability distributions, as in Equation (9), using the possible values of the intersection between 0 and 1. The closer to 1, the more similar the images are.
HISTCMP_BHATTACHARYYA [24] calculates the degree of overlap of two probability distributions as in Equation (10). The closer to 0, the more similar images are. Table 6 shows histogram graph according to RGB distribution, hue, and value. The horizontal axis of the graph represents the change in color tone from 0 to 255, with the left side representing the dark area and the right side representing the bright area. The vertical axis of the graph represents the size of the area captured in each horizontal area, that is, the total number of pixels. This is the number of pixels in an image over a range of 256 pixel values.  Table 7 shows the results of inputting various lyrics. By inputting the song lyrics of "Meant to be this way", "Sax is my cardio" and "Heart of lion (Leo)", the stage background images were transformed according to the proposed method for each verse and chorus. The song lyrics "Meant to be this way" is consisted of two verses and two choruses, and a total of 12 emotion words were extracted. In this song's lyrics, 4 emotion words out of 14 sentences in verse, 5 emotion words out of 14 sentences in chorus, and 3 emotion words out of 14 sentences in verse 2 were extracted. The song lyrics to "Sax is my cardio" is consisted of two verses and two choruses, 28 emotion words were extracted. In this song's lyrics, 14 emotion words out of 12 sentences in verse 1, 7 emotion words out of 8 sentences in chorus, and 7 emotion words out of 12 sentences were extracted in verse 2. The song lyrics "Heart of a lion (Leo)" is consisted of three verses and two choruses, and a total of 32 emotion words were extracted. In this song's lyrics, 5 emotion words out of 8 sentences in verse 1, 10 emotion words out of 13 sentences in chorus, 3 emotion words out of 8 sentences in verse 2 and 9 emotion words out of 8 sentences in verse 3 were extracted. The styles were transferred by selecting the images with the most similar probability distributions for each verse and chorus through the emotion words extracted from each song lyrics. We confirmed through the results in Table 7 that the styles are well transformed even from complex stage background images.

Conclusions
This paper proposed a method to transfer stage background images into styles based on the emotion words contained in each verse and chorus from lyrics selected by a user. First, multiple verses and choruses were derived from the lyrics, one at a time, and compared with the emotion word dictionary to extract the emotion words included in each verse and chorus. Next, the image with the most similar probability distribution to the corresponding probability distribution was selected based on the probability distribution of emotion words included in the lyrics, and the styles were transferred to the stage background image for each verse and chorus. In the experiment, the performance of the style transfer was verified, and the probability distribution of the emotion words in the transformed stage background image was verified as similar to the probability distributions of the song lyrics. Experimental results showed that the proposed method reduced total variation loss from 1.0777 to 0.1597. This result shows that the style transferred image is close to edge information about the content of the input image, and the style is close to the target style image. In addition, stage background image and images of transferred styles with similar emotion words probability distributions were 38% similar, and stage background image and image of transferred styles with completely different probability distributions were 8% similar.
Due to the limitations of lexicon-based approaches, several aspects related to the design of relevant emotion analysis models need to design a model in future works. The input of the models that extract emotions by considering full sentences is sentence, but the lyrics do not follow the complete sentence structure. It is difficult to use each structure as an input to the previous models. In the case of a full sentence, there is a limit to the accuracy because there is uncertainty of specifying all emotions corresponding to the words of each sentence. Therefore, in this paper, limited emotion words were selected and utilized.