Natural language processing (NLP) employs computational and linguistics techniques to understand human languages in the form of text and speech/voice. Prominent contributions to the field of NLP under active research include sentiment analysis with mood detection such as happy, sad, angry, and so forth. Mood classification has been studied from two points of view. The first one is dealing with emotions as discrete units, while the second point of view is classifying emotions in groups, in a dimensional space. Dimensional models try to interpret human emotions with vectors from spaces of two or three dimensions.
This section deals with the first part of a multi-modal emotion recognition system, that is the NLP subsystem, which will use text information as input data. The text information available comes in the form of lyrics. Lyrics are words that make up a song and contain strong and meaningful information about the emotional state a song can evoke to a listener. Trying to mimic the effect lyrics have to a listener we will build a NLP/Deep Learning system that will predict the emotion, the textual part (lyrics) of a song, can cause in a listener.
2.2.1. Word Embeddings
However, in NLP models, words that constitute a corpus, in our case lyrics, do not show any direct numeric information. In order lyrics to have meaning, for the models we developed, we had to represent each word with a vector, these vectors are known as Word Embeddings. To compute these vectors we followed many known methodologies, such as Bag of Words (BoW) [
31], TF-IDF [
32], Word2Vec [
33], GloVe [
34], BERT embeddings [
35] which produced vectors of different representations and dimensions. The model we will mainly use for lyric analysis, BERT, takes as input sequences of text and transforms them internally to tensors of size (3, 128). BERT’s tokenizer is responsible for this process and it is described in more detail subsequently.
The Bag of Words (BoW) method [
31] is the simplest text to vector technique, describing the occurrence of words within a document. Each text is represented from the set (bag) of words. Specifically, given a vocabulary
, where
is the number of words of the text
C, where
. The BoW algorithm constructs a vector
with
v length; element
of the vector expresses how many times the word
is in the text
C. Having a vocabulary, for example,
and a text document, for example,
C = ‘Yesterday my dog met a dog’, then the generated vector will be
. Each BoW involves a vocabulary of known words and a score of the presence of known words. The model is only concerned with whether known words occur in the document.
In this approach, we look at the histogram of the words within the lyrics document, i.e., considering each word count as a feature. The intuition is that lyrics documents are similar if they have similar content. Further, that from the content alone, we can learn something about the meaning of the document. The complexity of this algorithm comes both from deciding how to design the vocabulary of known words (or tokens) and how to score the presence of known words.
For many years, the Bag of Words algorithm was used with great success in many problems in the field of natural language processing. This algorithm with small variations is still used today. On the other hand, the BoW method has several disadvantages as the only information that it retains is the number of occurrences of words. The most important disadvantage is that all information about the syntactic structure of the text is lost. An example of this problem is the sentences ‘Make love not war’ and ‘Make war not love"—the algorithm will give exactly the same vector for their representation, although the their meaning is completely different
The TF-IDF method [
32] is a statistical method that calculates the relevance of a word in a text from a set of texts. This value depends on two quantities: the frequency of the term Term Frequency (TF) and the Inverse Document Frequency (IDF) of the documents. The TF value is calculated in several ways, with the simplest one being the frequency with which a term,
t, appears in a document,
d. The IDF expresses the frequency of the word in all the documents and is calculated as the logarithm of the ratio of the total number of documents to the number of documents in which the word appears.
The value TF-IDF for a word
w in a text
d from a set of texts
D is calculated by the Formula (
8), where values close to 0 mean the significance of the word in the text is small, while values close to 1 indicate great importance.
The examination of the significance of words over a large set of documents is conceded as the main advantage of the TD-IDF method. For example, words that often appear in a text, such as , are not important to the semantics of a document. The algorithm manages to recognize these words and give them small importance. Instead, words like , for example, could appear frequently in a document about insects or debugging code. In these cases, the algorithm can detect these words, giving significant meaning. The drawback of this method is the ignorance of the text’s syntactic structure, (like the BoW method does). In addition, the effectiveness of this method is dependent on the large number of documents that the set of texts (corpus) contains.
The term Word2Vec [
33,
36] refers to a set of models, used for the vector representation of words that a text contains. A Word2Vec model receives as input a large volume of documents (corpus) and produces a vector space
, typically
. Each unique word belonging to the corpus is represented in the vector space by a vector, in such a way that words with similar contexts are close together. In this way, the correlation of the words that have similar conceptual meaning is achieved.
In the Word2Vec models, the word vectors are presented using the one-hot encoding, which means that each vector has a length of V, and the vocabulary size consists of zeros in all its elements except from the item that represents that word in vocabulary, which takes the value 1. These words vectors are inputs in the neural network, where in its hidden layers the input products are summed with the parameter tables and at the output layer the softmax function is applied predicting the correct positions of ones and, consequently, the correct words in the one-hot vectors of the output.
For the vector representation of the words, these models choose between two architectural modelings, the CBoW and the Skip-gram. The CBoW-Continuous Bag of Words model takes as input the neighboring words of a target word and tries to predict the target word in the output. In contrast, the Skip-gram model accepts a word as input and predicts its neighboring words as output. The Skip-gram model works best for small volumes of data and manages to represent rare words better. On the other hand, the CBoW model better represents more common words and it is faster.
The GloVe (Global Vectors) method [
34,
37], similarly to Word2Vec, uses vector word representation. The advantage of this method is that it not only uses local attributes, such as contextual words, but also incorporates some global attributes. The model is capable of recognizing words that, although appearing in the context of another word, do not convey any conceptual content. For example, in the sentence
‘The dog chased the cat’ the word
‘the’ precedes the words
‘dog’ and
‘cat’ but does not give them any information which must also be recognized by an NLP system.
Firstly, the GloVe model creates a vector space, utilizing a set of documents. It achieves this by calculating and utilizing the co-occurrence of words in a volume of documents (corpus). The key component of the method is the C co-occurence table V × V for a V size vocabulary, where each element of the table expresses the probabilities that the word occurs in the context of the word .
Then, the GloVe method utilizes the co-occurrence table and calculates ratios from the co-occurrence values of two words to discover their conceptual relationship. Suppose that expresses the co-occurrence, that is, the probability that the word k belongs to the context of the word w. For example, if we have two words and , then the ratio would be large because the term would have a price close to 1 while the term price close to 0. Respectively, the ratio would have a very low value, while the ratio would have a value close to 1.
The procedure that the GloVe method follows to predict contextual words is to use the probabilistic co-occurrence ratios by performing regression. In conclusion, GloVe generates a f function that takes a word as an argument and generates a vector of that word. The characteristic of these vectors is that adding and subtracting them corresponds not only to the vector space but also to the conceptual space. Such an example being the proposition: .
The model we will mainly use for lyric analysis, BERT, takes as input sequences of text and transforms them internally to tensors of size (3, 128). BERT’s tokenizer is responsible for this process and it is described in more detail subsequently.
The BERT model requires a specific representation of input data in order to work correctly, called BERT Embeddings [
35]. This representation requires the breaking of the input sequences in tokens, BERT provides a custom Tokenizer for this process. Specifically, BERT’s Tokenizer creates a sequence of word-tokens, matching each input word to BERT’s dictionary. When tokenization is complete, Tokenizer puts the token [CLS] at the beginning and the token [SEP] at the end of each sentence. BERT expects three parallel vectors for each input, these vectors have fixed length of 128 and have the names
input_ids,
input_masks and segment_ids. Vector
input_ids is constructed from identification IDs of each token in input sequence, these IDs are provided in model’s dictionary. Vector
input_masks is constructed with ones in the first
n items and zeros in the last
items, where n is the length of input sequence. Last, vector
segment_ids helps to separate sentences that construct an input sequence, for example, the items that match the first sentence will be zeros, the items that match the second sentence will be ones; the items that match the third sentence will be twos and so forth. Input sequences longer than 128 will be truncated and input sequences shorter than 128 will be filled with empty tokens.