2.2. Informational Entropy
The first interesting applications of statistics to poetry had been mainly based on concepts of information theory, such as entropy and informational energy. This way, poetry was treated from the point of view of statistical mechanics. From this point of view, each letter of the alphabet present in a given text was assimilated with a random variable having a certain probability of occurrence.
These probabilities can be approximated empirically by the associated frequencies measured from the text, see Figure 1
. According to the Law of Large Numbers, the quotients between the number of occurrences of a letter and the text length tends to the theoretical probability of the letter. Thus, the poem can be modeled by a finite random variable X
, which takes letter values
and is described by the probability distribution
This distribution probability table provides the letter frequency structure of the text. Now, each letter is supposed to contribute the text with some “information”, which is given by the negative log-likelihood function; for instance, the letter “a” contains an information equal to
. The minus sign was considered for positivity reasons, while the logarithm function was chosen for its properties (the information contained by two letters is the sum of their individual information, namely,
). A weighted average of these letter information is given by the expression of informational entropy
The statistical measure H
represents the amount of information produced, on average, for each letter of a text. It is measured in bits. We recall that one bit is the measure describing the information contained in a choice between two equally likely choices. Consequently, the information contained in
equally likely choices is equal to N
bits. It is worth noting that the previous entropy expression mimics the formula introduced in information theory by C.E. Shannon [1
] in 1948.
Another interpretation of entropy is the following. Assume that during a text compression each letter is binary encoded as a sequence that consists of 0 s and 1 s. Then, the entropy is the average number of binary digits needed to represent each letter in the most efficient way. Given that the entropy of English is about 4 bits per letter and the ASCII code (Abbreviation from American Standard Code for Information Interchange, which is a character encoding standard for electronic communication) translates each letter into 8 binary digits, it follows that this representation is not very efficient.
The value of the entropy H was related in poetry by the state of organization of the poem, such as rhythm or poem structure. For instance, a poem with a short verse has a smaller entropy, while a poem written in white verse (no rhythm) tends to have a larger entropy. Consequently, a poem has a smaller entropy than a regular plain text, due to its structure. The amount of constraint imposed on a text due to its structure is called redundancy. Consequently, poetry has more redundancy than plain text. Entropy and redundancy depend also on the language used, being complimentary to each other, in the sense that the more redundant a text is, the smaller its entropy. For instance, a laconic text has reduced redundancy, since the omission of any word can cause non-recoverable loss of information.
The maximum of entropy is reached in the case when the random variable X is equally distributed, i.e., in the case when the text contains each letter an equal number of times, namely when each letter frequency is . The maximum entropy value is given by bits per letter.
However, this maximum is far to be reached by any language, due to its own peculiarities, such as preference towards using certain words more often, or bias towards using certain language structures. For instance, in English, there is a high frequency of letter “e” and a strong tendency of letter “t” to be followed by letter “h”. Further, “q” is always followed by a letter “u”, and so on.
It has been noticed that each poet has its own specific entropy range, due to the usage of favorite words or specific poem structure preferred by the poet. It has also been conjectured that it should be able to recognize the author of a text just from the text entropy. After the examination of several examples we are now certain that it is easier to recognize the poetry language easier than the poet itself.
Our experiments have shown that the entropy range for English language is between 4.1 and 4.2 bits per letter. Other languages have a different range. For instance, the Romanian language has the range between 3.7 to 3.9 bits per letter. The fact that English has more entropy is due to the fact that the probability distribution of letters in English is more evenly distributed than in Romanian, fact that can be also seen from Figure 1
after a careful look.
In the following, an entropy table is provided containing a few characteristic English writings.
|Jefferson the Virginian||4.158|
|Romeo and Juliet||4.182|
|King Richard III||4.214|
|Byron vol. I||4.184|
|Byron vol. II||4.184|
It is worth noting that Byron’s first poetry volume and Byron’s second volume (containing letters and journals) have the same entropy, 4.184 bits, which shows consistency across the works of this writer. We also note that Shakespeare’s works (Macbeth, Romeo and Juliet, Othello, King Richard III) contain a larger variation than Byron’s, taking values in the range 4.1–4.2 bits. The entire text of Jefferson the Virginian has an entropy of 4.158 bits (quite close to the entropy of 4.14 bits found by Shannon considering just a part of this work). The English translation of the works of the German philosopher Nietzsche (1844–1900) has an entropy of 4.147 bits. The lowest entropy occurs for the Bible chapter on Genesis, 4.105 bits, a fact explained by a strong tendency in repetition of certain specific biblical words.
For the sake of comparison, we shall provide next an entropy table containing a few Romanian authors:
We have considered Eminescu’s famous poem, Luceafarul (The Morning Star), Ispirescu’s fairy tales and Crangă’s Amintiri din Copilărie (Memories of Childhood). It is worth noting that all these writings have entropies that agree up to two decimals −3.87 bits. We infer that the entropy of Romanian language is definitely smaller than the entropy of English. One plausible explanation is the rare frequency of some letters in Romanian language, such as “q”, “w”, “x”, and “y”, comparative to English.
However, there are also some major limitations of entropy as an assessing information tool. First of all, the entropy does not take into account the order of letters. This means that if we scramble the letters of a nice poem into some rubbish text, we still have the same entropy value. Other questions have raised, such as whether the random variable X should include in its values blank spaces and punctation signs, fact that would definitely decrease the information entropy value.
Even if entropy is a robust tool of assessing poems from the point of view of the information content, it does not suffice for all purposes, and hence it was necessary to consider more statistical measures for assessing the information of a text.
2.3. Informational Energy
If information entropy was inspired from thermodynamics, then information energy
is a concept inspired from the kinetic energy of gases. This concept was also used to asses poetry features related to poetic uncertainty as introduced and used for the first time by Onicescu [2
] in the mid-1960s. Its definition resembles the definition of the kinetic energy as in the following
This statistical measure is also an analog of the moment of inertia from solid mechanics. This measures the easiness
of rotation of the letters probability distribution
about the x
-axis. Heuristically, the uniform distribution will have the lowest informational energy, as it has the lowest easiness of rotation about the x
-axis, see Figure 2
This fact can be shown using Lagrange multipliers as follows. Let
denote the probability of each letter. We look for the distribution that minimizes
subject to constraints
. The associated Lagrangian is
Solving the Lagrange equations , we obtain , which is a uniform distribution. Since , this solution corresponds to a global minimum.
Therefore, a short verse and a regular rhythm in poetry tend to increase the informational energy. On the other side, using a free verse tends to decrease the value of I
. In addition to this, it has been noticed that entropy and informational energy have inverse variations, in the sense that any poem with a large information energy tends to have a smaller entropy, see Marcus [3
]. We note that the entropy attains its maximum for the uniform distribution, while the informational energy reaches its minimum for the same distribution.
To get an idea about the magnitude and variation of the informational energy for the English language we provide the following table:
|Jefferson the Virginian||0.0332|
|Romeo and Juliet||0.0323|
|King Richard III||0.0314|
|Byron vol. I||0.0322|
|Byron vol. II||0.0324|
For the sake of comparison with Romanian language we include the next table:
We note that English language has a smaller informational energy than Romanian, the range for the former being 0.031–0.032, while for the latter, 0.036–0.042.
It is worth noting that most entropy limitations encountered before carry also over to the informational energy.
The fact that it can exist a fine poem and a rubbish text with the same entropy and information energy (we can transform one into the other by just making some permutations of the letters in the poem) shows the drastic limitation of the usage of these two information tools. The main limitation consists of using probability of individual letters, which completely neglects the meaning casted in the words of the poem. In order to fix this problem, one should consider the random variable X
to take values in a dictionary rather than in an alphabet. This way, the variable X
, which is still finite, has thousands of outcomes rather that just 26. For instance, if in a poem, the word “luck” is used 5 times and the poem has 200 words, then we compute the probability
. Doing this for all words, we can compute an informational entropy and informational energy at the words level, which makes more sense than at the letters level. It has been noticed that the words with the highest frequency in English are “the”, “of”, “end”, “I”, “or”, “say”, etc. Their empirical probabilities are
, etc. The interested reader can consult the words frequency graph given by Figure 1
Consequently, if poem A uses each word equally often (fact that confers the poem a specific joyful structure) and poem B uses each word only once (fact that means the poem uses a richer vocabulary), then poem A will have the largest possible entropy and poem B will have the lowest. This fact is based on the fact that the equiprobable distribution reaches the highest entropy.
Unfortunately, even this approach of defining a hierarchical entropy at the level of words is flawed. If we consider two poems, the original one and then the same poem obtained by writing the words in reversed order, then both poems would have the same entropy and informational energy, even if the poetic sense is completely lost in the latter poem. This is because the order of words matter in conferring a specific meaning. The poetical sense is destroyed by changing the words order. This can be explained by considering each word as a cohesive group of letters with strong statistical inferences with nearby groups of letters. This problem has a fix using stochastic processes.
2.4. Marcov Processes
A mathematical model implementing the temporal structure of the poem is needed. The order of words can be modeled by a Markov process, by which each word is related with other words in the dictionary using a directional tree. Each node in this tree is associated with a word and each edge is assigned a transition probability describing the chance of going from one word to another.
The “markovian” feature refers to the fact that the process does not have any memory; it just triggers the next word, given one word, with a certain probability. For example, the word “magic” can be followed by any of the words “trick”, “wand”, “land”, or “carpet”. In a poem, using the phrase “magic trick” might not be very poetical, but using “magic wand” or “magic land” seems to be better.
From this point of view, writing a poem means to navigate through a directional tree whose nodes are words in a given vocabulary of a certain language, see Figure 3
. Taking products of transition probabilities of edges joining consecutive words in the poem we obtain the total probability of the poem as a sequence of random variables (words) extracted from the dictionary. This approach involves a temporal structure between words, which marks its superiority from the aforementioned approaches. If one would have access to this directional tree and its probabilities then he/she would be able to generate new poems, statistically similar with the ones used to construct the tree.
This method in its current form is also limited. One reason is because the process does not have any memory. For instance, if a poem talks about “Odysseus” from the Greek mythology, then words as ”legendary”, “hero”, “ingenious” will have a larger probability to appear later in the poem, even if they are not following directly the word “Odysseus”.
A superior method, which is not markovian, will be described in the second part of the paper and regards the use of neural networks in poetry.
2.5. N-Gram Entropy
In his 1951 paper, Shannon [4
] studied the predictability of the English language introducing the concept of the N-gram entropy
. This is a sequence of conditional entropies,
, given by
is a block of
is an actual letter,
is the probability of occurrence of the N
is the probability of the block
is the conditional probability of letter j
to follow the block
and is given by
can be considered as the conditional entropy of the next letter j when the preceding letters are given. Therefore, measures the entropy due to statistics extending over N adjacent letters of the text. This approach is useful when one tries to answer the following question: Given letters of a text, find the letter which is most likely to appear next in the text.
, are considered as an approximating sequence of another object, a long-range statistic, called the language entropy
, defined by
and measured in bits (per letter). Computing the language entropy of a given poetry author is a complex computational problem. However, the first few N
-gram entropies can be computed relative easily and were found by Shannon [4
First, regardless of the author,
(bits per letter), given that the language is English (with an alphabet of 26 letters). The 1-gram involves the entropy associated with letter frequency
for English language. For any poetry author, his/hers
entropy should be smaller than 4.14 (bits per letter) due to the style structure constraints. It is worth noting that
, where H
is the aforementioned informational entropy. The interpretation of
is the number of random guesses one would take on average to rich the correct next letter in a text, without having any knowledge about previous letters.
measures the entropy over pairs of two consecutive letters as
can be interpreted as the number of random guesses one would take on average to rich the correct next letter in a text, knowing the previous letter.
can be also be computed as
can be interpreted as the number of random guesses one would take on average to rich the correct next letter in a text, knowing the previous two letters.
2.7. Conveyed Information
Shannon’s work [4
] is mainly concerned with the predictability of the English language. One way to continue this idea is to ask the following question: How much information is conveyed by a block of letters about the next letter in the text?
This is related to the previous question of finding the conditional probability of the next letter given a previous letter block. For this we recall the following notion of information theory, which can be found in more detail in [5
Let X and Y be two random variables. The information conveyed about X by Y is defined by , where is the entropy of random variable X and is the conditional entropy of X given Y. In our case X represents the random variable taking values in the alphabet , while Y is the random variable taking -letter blocks values; each of these blocks can be considered as a function .
The information conveyed by an
-letter group about the next letter in the text will be denoted by
and is computed as follows
Similarly, we can define the information conveyed by a
-letter group about the next 2-letter group in the text by
This leads inductively to the formula of the information conveyed by an M
-letter group about the next k
-letter group in a text as
We have the following information inequalities:
Proposition 1. For any integers we have:
Since the conveyed information is non-negative, , then , fact that implies the first inequality.
The previous inequality can be written telescopically as
which after cancelling similar terms leads to the desired inequality. □
It is worth noting that for
, the second inequality becomes
, for any
, which implies
, for all
. This shows that the entropy associated with letter frequency,
, is the largest among all N
-gram entropies. This fact can be also noticed from the numerical values presented in Section 2.5
Another remark is that inequalities of Proposition 1 are strict as long as the text bears some meaningful information. This follows from the fact that if and only if X and Y are independent random variables.
2.9. Comparison of Two Texts
In this section, we shall introduce a method of comparing two texts, by evaluating the information proximity of one text with respect to another. We shall first consider a benchmark text in plain English with letter distribution p. This text will provide the distribution of letters in English.
The second text can be a poetry text, also in English, with letter distribution q
. We shall compare the texts by computing the Kullback–Leibler relative entropy,
. This measures a certain proximity between distributions p
as in the following sense. If the benchmark text is considered efficient in expressing some idea, using poetry to express the same idea can be less efficient; thus, the expression
measures the inefficiency of expressing an idea using poetry rather than using plain English. From the technical point of view, the Kullback–Leibler relative entropy is given by the formula
The text with letter distribution q
will be called the test text
, while the one with distribution p
, the benchmark text
. We shall consider the benchmark text to be the entire text of the book Jefferson the Virginian
by Dumas Malone, as Shannon himself considered it initially, while investigating the prediction of English language, [4
]. The first test text is the first volume of Byron poetry. A simulation in ’R’ provides the relative entropy value
The next test text is taken to be Othello
by Shakespeare, case in which
This shows a larger measure of inefficiency when trying to express Dumas’s ideas using Shakespearian language rather than Byron-type language. This can be explained by the fact that Byron used a variant of English closer to Dumas’ language rather than Shakespeare. Consequently, the language used in Othello
is farther from English than the one used in Byron’s poetry.
Considering further the test text as the Genesis
chapter, we obtain
This value, larger than the previous two, can be explained by the even older version of English used by the Bible.
These comparisons are based solely on the single letter distribution. It is interesting to consider a future direction of research, which compares two texts using a more complex relative entropy defined by
which is a Kullback–Leibler relative entropy considering N
-letter blocks rather than single letters. We obviously have
One question is whether the quotient approaches a certain value for N large. In this case, the value of the limit would be the languages relative entropy.
Another variant of comparing two texts is to compute the cross-entropy of one with respect to the other. The cross-entropy formula is given by
Since an algebraic computation provides
then the cross-entropy between two texts is obtained by adding the entropy of the benchmark text to the Kullback–Leibler relative entropy of the texts. Since
. Given the distribution p
, the minimum of the cross-entropy
is realized for
, and in this case
The idea of comparing two texts using the relative or cross-entropy will be used to measure the information deviation of RNN-generated poems from genuine Byron poetry.
2.10. Neural Networks in Poetry
The comprehension of hidden probabilistic structures built among the words of a poem and their relation with the poetical meaning is a complex problem for the human mind and cannot be caught into a simple statistic or mathematical model. This difficulty comes from the fact that each poet has his/her own specific style features, while using an enormous knowledge and control over the statistics of the language he/she writes in. Therefore, we need a system that could be able to learn abstract hierarchical representations of the poetical connotations represented by words and phrases.
Given the complexity of this task, one approach is to employ machine learning. This means to look for computer implemented techniques that are able to capture statistical regularities in a written text. After learning the author’s specific writing patterns, the machine should be able to adapt to the author’s style and generate new texts, which are statistically equivalent with the ones the machine was trained on.
It turned out recently that one way to successfully approach this complex problem is to employ a specific type of machine learning, called Recurrent Neural Network (RNN), which date back to 1986 and is based on David Rumelhart’s work [6
]. This belongs to a class of artificial neural networks where connections between nodes form a directed graph along a temporal sequence, similar to the directional tree previously mentioned in Section 2.4
The main feature of these networks is allowing the network to exhibit temporal dynamic behavior. The network input and the output vectors at time t
are denoted respectively by
represents the hidden state at time t
. In its simplest case, the transition equations of an RNN take the following form
see Figure 4
. Thus, the hidden states take values between
and 1, while the output is an affine function of the current hidden state. Matrices W
represent the hidden-to-hidden state transition and input-to-hidden state transition, respectively. b
denote bias vectors. During training the network parameters
adjust such that the output sequence
gets as close as possible to the sequence of target vectors,
. This tuning process, by which a cost function, such as
is minimized, is called learning
. This idea is the basis of many revolutionizing technologies, such as: machine translation, text-to-speech synthesis, speech recognition, automatic image captioning, etc. For instance, in machine translation from English to French,
represent a sentence in English, and
represents the French translation sentence. The statistical relationships between English and French languages is stored in the hidden states and parameters of the network, which are tuned in such a way that the translation is as good as possible.
Each network bares a certain capacity, which can be interpreted as the network ability to fit a large variety of target functions. This depends on the number of its parameters. If the network contains just a few units, the number of parameters is small and its capacity is low. Consequently, the network will underfit the data; for instance, considering the case of the previous example, the network cannot translate complex sentences. For this reason, the capacity has to be enhanced. There are two ways to do that: (i) by increasing the number of hidden units (horizontal expansion) and (ii) considering several layers of hidden states (deep neural network).
When increasing the network capacity two difficulties can occur: the vanishing gradient problem and the exploding gradient problem. The first problem can be addressed by considering special hidden units such as LSTM cells, which were introduced in 1997 by Hochreiter and Schmidhuber [7
]. We won’t get into the details here, but the reader is referred to consult [5
In the rest of the paper we shall use an RNN consisting of a single-layer of LSTM cells to learn statistical features of Byron’s poems and then use these parameters to generate new Byron-type poems. The way we shall generate new poems is to input a certain “seed” phrase and use the trained network to generate the prediction of the next 400 characters which would follow the provided seed. We shall obtain a sequence of “poems”, as training progresses, of increasing quality. We shall then asses their informational deviation from a genuine Byron text, using the statistical measures introduced in the first part of the paper. When this difference of the similarity is small enough we shall stop training.
The next sections take the reader through all the steps necessary for the aforementioned goal. These include data collection, data processing and cleaning, choosing the neural model, training, and conclusions.
2.10.2. Data Collection
Any machine learning model is based on learning from data. Therefore, the first step is to collect data, as much and as qualitative as possible. Regarding our project, I found Byron’s works online under Project Gutenberg at www.gutenberg.org
. Since the website states “this eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever”, it is clear we may use it. Lord Byron’s work counts seven volumes. All contain poetry, but the second volume, which deals with letters and journals. Among all these files, we shall use for training only volume I of poetry, which can be found at http://www.gutenberg.org/ebooks/8861
The reason for which we had restricted our study to only one volume is the limited processing power available, as the training was done on a regular PC-laptop.
2.10.3. Data Processing and Clean-Up
The quality of data is an important ingredient in data preparation step, since if data are not well cleaned, regardless of the neural architecture complexity used, the results won’t be qualitative good.
In our case, almost all poems contain lots of footnotes and other explanatory notes, which are written in plain English by the publisher. Therefore, our next task was to remove these notes, including all forewords or comments from the publisher, keeping only pure Byron’s work. This might take a while, as it is done semi-manually. The reader can find the cleaned-up text file at http://machinelearningofannarbor.org/book1.txt
2.10.4. Choosing the Model
The neural network model employed in the analysis of Byron’s poetry is presented in Figure 5
. It consists of an RNN with 70 LSTM cells (with hidden states
), on the top of which is added a dense layer with 50 cells. We chose only one layer in the RNN because if considering a multiple layer RNN the number of network parameters increases, which might lead eventually to an overfit of data, which is restricted at this time to only one volume of poetry. The output of the network uses a softmax activation function, which provides a distribution of probabilities of the next predicted character. The input is done through 70 input variables,
. If the poetry text is written as a sequence of characters (not necessary only letters) as
then the text was parsed into chunks of 70 characters and fed into the network as follows. The first input values are the first 70 characters in the text as
The next input sequence is obtained by shifting to the right by a step of 3 characters as
The third input sequence is obtained after a new shift as
Since the text length is
characters, then the number of input sequences is 124,765. It also turns out that the number of characters in Byron’s poems is 63. This includes besides 26 letters and 10 usual digits, also other characters such as coma, period, colon, semicolon, quotation marks, question marks, accents, etc., see Figure 6
. Therefore, the network output will be 63-dimensional.
The activation function in the dense layer is a ReLU function (which is preferable to sigmoid activation function as it usually provides better accuracy). The learning is using the Adam minimization algorithm—an adaptive momentum algorithm, which is a variation of the gradient descent method, with adjustable learning rate. Training occurs using batches of size 80 at a time and it trains for 150 epochs. Training takes roughly 3 min per epoch, taking about 7 h for all 150 epochs.
Since this is a multi-class classification problem (with 63 classes), the loss function is taken to be the categorical cross-entropy. Each class corresponds to a character given in Figure 6
denotes the probability density of the next character forecasted by the network and p
is the probability density of the true character in the text, then during training the network minimizes the cross-entropy
Therefore, learning tries to minimize the average number of bits needed to identify a character correctly using the probability distribution q
, rather than the true distribution p
. Since the text contains poetry from only the first volume, in order to avoid overfitting, a dropout of
has been introduced after both the LSTM and the dense layers.