Evaluating the Irregularity of Natural Languages

: In the present work, we quantify the irregularity of different European languages belonging to four linguistic families (Romance, Germanic, Uralic and Slavic) and an artiﬁcial language (Esperanto). We modiﬁed a well-known method to calculate the approximate and sample entropy of written texts. We ﬁnd differences in the degree of irregularity between the families and our method, which is based on the search of regularities in a sequence of symbols, and consistently distinguishes between natural and synthetic randomized texts. Moreover, we extended our study to the case where multiple scales are accounted for, such as the multiscale entropy analysis. Our results revealed that real texts have non-trivial structure compared to the ones obtained from randomization procedures.


Introduction
Diverse studies have reported spatio-temporal organization properties in natural languages.Two representative findings of universal features of natural language are the Zipf and Heaps laws, which are based on word frequency and number of different words, respectively [1][2][3][4].From a more basic perspective, human language can also be considered as a sequence of symbols which contains information encoded in the patterns (words) needed to communicate.For instance, the frequency rate of appearance of the symbols is different for every language, and so are the declension and verbal conjugation rules.There are also restrictions in the order of appearance of bigrams, trigrams and, in general, n-grams; for example, in English and Spanish the letter "q" is always followed by "u".The way these restrictions and other factors modulate the structure and randomness of the language can be potentially evaluated by means of concepts like entropy, as proposed by Shannon [5,6].The use of entropy-related algorithms to estimate orderliness in natural language have revealed that language is not regular nor random, but the direct quantification of the presence of randomness is not an easy task.Diverse studies have used the concept of entropy by means of a n-gram analysis [7], a binary simplification [8], nonparametric entropy estimation [9], mutual information of letters [10], information-based energy [11], complexity quantification [12] and entropy-word approach [13].However, entropy-information measures based on regularity of pattern statistics has not been widely employed to evaluate the "complexity" of natural language.A straightforward way to quantify the propinquity of two words is to count the number of letters that they have in common; the words coming from the same root, diminutives, augmentatives or the "functional shift" of certain words are good examples of these similarities.In the context of dynamical systems, there are well known methods to measure the repetition pattern in a time series: the approximate entropy (ApEn) and its derivatives [14,15].The ApEn quantifies the regularity in a time series, a lower value of ApEn indicates a more regular behavior whereas a high value is assigned to more irregular sequences.This method has been successfully applied to analyze time series from several sources [16][17][18][19][20].Here we adopt a similar approach based on the ApEn algorithm in order to evaluate the levels of complexity in four language families (Romance, Germanic, Slavic and Uralic).Our goal is to determine the dominance of regularities in written texts, which are considered as finite time series.Our method was applied to several texts from different languages.The results reveal that, for texts belonging to the same family, it is observed that the ApEn decreases as the length of the word pattern increases in similar fashion.Moreover, we also extend our study to evaluate the multiscale behavior of entropy for the assessment of regularities based on different scales as it was suggested by Costa et al. [21].Additionally, we also apply our methodology to two synthetic sequences, which are the randomized versions of the original text and a text written in Esperanto.We found significant differences between real and synthetic texts, observing a higher complexity for the real sequences compared to the randomized ones through different scales.The paper is organized as follows: First, we present the methodology used throughout the article, including the modified method to calculate the ApEn for the cases of sequences of symbols.Next, the main results of the study are described; and finally, we provide some final remarks.

Approximate Entropy of a Text
Within the context of information theory, the entropy of a sequence of symbols (from an alphabet with L elements) is given in terms of the so-called Shannon entropy H S = − ∑ L j=1 p j log p j , with p j the probability of the symbol j.The Shannon entropy measures the average uncertainty of a discrete variable and represents the average information content [6].For sequences composed of blocks with n symbols, the entropy measures the uncertainty assigned to a word of length n [22,23].The difference entropy h n = H n+1 − H n represents the uncertainty related to the appearance of the n + 1 symbol given that the n preceding symbols are known [22].For dynamical systems, the estimation of the mean rate of creation of information is given by the Kolmogorov-Sinai (KS) entropy and KS measures the unpredictability of systems changing with time [24].However, numerical calculations of KS requires very large sequences, therefore it is not practical to apply to real sequences.In order to overcome this limitation, Grassberger et al. [25] proposed the K 2 entropy to evaluate the dimensionality of chaotic systems as a lower bound of the KS entropy.Later, as an extension of the K 2 entropy, Pincus [14] introduced the Approximate Entropy (ApEn) to evaluate the regularity in a given time series.The ApEn provides a direct measure of the degree of irregularity or randomness in a time series and, in the context of physiological signals, as a measure of system complexity: smaller values indicate greater regularity, and greater values convey more disorder or randomness [14,17].Here we introduce a modified ApEn algorithm for the regularity analysis of a written text.Our method considers a similar procedure as the ApEn proposed by Pincus [14] and it can be summarized as follows: for a given text, {s(1), s(2), s(3), ..., s(N)} of N elements, where an element can be a letter or symbol (including the space), we define a set of patterns of length m, S m (i) is the pattern of m elements or symbols, from s(i) to s(i + m − 1).Next, we look for matches occurring between two patterns if the "distance" is smaller than a given value.We impose a restriction to the "distance" between two such patterns, i.e., we set a number r representing the maximum number of positions at which the corresponding symbols are different.This distance is known as the Hamming distance [26].Next, we calculate the number n m i of patterns S m (j) with j ≤ N − m + 1 such that h(S m (i), S m (j)) ≤ r, with h(S m (i), S m (j)) the Hamming distance.Then, the quantity C m i (r) = n m i N−m+1 is defined, representing the probability of having patterns within the distance r from the template pattern S m (i).
Following Pincus [14] we define the Approximate Entropy in the case of texts as, where Φ m (r) and Φ m+1 (r) are given by Φ m (r) = , respectively.As in the context of time series, the statistic represented by ApEn quantifies the degree of regularity/irregularity in a given text, and it is conceived as approximately equal to the negative average natural logarithm of the conditional probability that two patterns that are similar for m symbols remain similar for m + 1 elements [14].Although ApEn is very useful for distinguishing a variety of deterministic/stochastic processes, it has been reported that there is a bias in ApEn because the method counts each pattern as matching itself.The existence of this bias, under particular circunstances, causes ApEn to substimate or to provide a faulty value for a given time series.Therefore, the development of an alternative method was desirable to overcome the limitations of ApEn.On the basis of K 2 and ApEn algorithms, Richman and Moorman [15] introduced the so-called sample entropy (SampEn) to reduce the bias in ApEn.One of the advantages of SampEn is that it does not count self-matches and is not based on a template-wise approach.Discounting the self-matches is justified since the entropy is conceived as a measure of the rate of information production; then, self-matches do not add new information [15].Following the definition of Richman and Moorman [15], we can also define the SampEn(m, r, N) in the case of texts as, where U m+1 and U m are the probabilities that two patterns will match (with a tolerance of r) for m + 1 and m symbols, respectively [27].As in the case of ApEn, SampEn is conceived as the negative natural logarithm of the conditional probability that two sequences similar for m points remain similar at the next point, with a tolerance of r, without counting the self-matches; and a lower value of SampEn indicates a more regular behavior of a symbol sequence whereas high values are assigned to more irregular sequences.We remark that both ApEn and SampEn represent family statistics that depend on the sequence length N, the tolerance parameter r and the pattern length m.

Results and Discussion
Prior to the description of our results, we briefly explain the main steps of our method for a simple case of a very short text.Lets consider the beginning of the famously acknowledged Hamlet's soliloquy: To-be-or-not-to-be.The length of the sentence is 18, and the average length of words in this sentence is 2, i.e., approximately every three symbols the space mark repeats, then a natural value for m is 3, and we set the tolerance value r = 1 (33% of the pattern length).Starting with the first letter from the left, the 16 subseries we can built are S 1 = {To-}, S 2 = {o-b}, ..., S 16 = {-be}.After performing all the procedure described in the previous section, we find that, for the Hamlet's soliloquy beginning, the statistics Equation (1) results in ApEn(3, 1, 18) = 0.215.This is a relatively intermediate value, which indicates that the sentence is moderately predictable compared to the case where the position of symbols was randomized (ApEn rand = 0.435 in average for five independent realizations).
First, we analyze literary texts from each of the 14 languages which are described in Table 1 for an extended dataset, which includes two more books of each language, please refer to the supporting information online at [28].The texts were downloaded from the website of the Gutenberg Project (http://www.gutenberg.org).In order to avoid finite size effects and to validate our method for relatively short sequences, we restrict ourselves to segments with 5000 symbols and repeat the calculations for 10 segments this length [29].In our case we have kept the punctuation marks and the space mark as symbols.In what follows, we will only refer to ApEn values since we obtain the same qualitative results using either SampEn or ApEn algorithms and particular differences will be discussed elsewhere.
Table 1.Books written in different languages considered in our study.For each book, we also include the linguistic family, the language, the number of symbols in the alphabet (L), the mean ApEn values for m = 5 and m = 6 with r = 2, the total number of words N and the total number of different words M. The Esperanto text (L = 28) used in the analysis is La Batalo del Vivo, of Charles Dickens.For an extended dataset see supporting information online at [28]. Figure 1 shows the calculations of the average ApEn for several values of m and a fixed value r = 2.We notice that for m = 6, the ApEn values obtained for each language tend to be close when they are grouped according to the family to which they belong, allowing a comparison between families (see Table 1).For this m value, the Romance family exhibits the highest value of ApEn, followed by the Uralic, Germanic and Slavic families, indicating that different levels of regularity/irregularity are observed in the analyzed family languages (Figure 1a-d).It is also worth to mention that languages that belong to the same family display a similar profile as the pattern length increases, while the value of entropy for Romance, Slavic and Uralic families almost monotonically decreases; for the Germanic one the entropy exhibits a small change between m = 6 and m = 7, revealing that the level of irregularity remains approximately constant for these pattern length scales, which roughly corresponds to the mean word length of these languages [7].Notably, all the Romance languages are much more overlapped compared to ApEn curves for the other families.
To further compare the ApEn profiles in texts, we also studied two artificial cases: Esperanto and randomized versions of the original sequences.Invented languages like Esperanto are attempts to simplify natural languages by suppressing, for example, irregular verbs and including words from different languages to make it universal.For the randomized version, we consider a text which is randomly generated with identical symbol and space probabilities as observed in a real text and ten independent realizations were constructed.The results of ApEn-values for the randomized versions are shown in Figure 1f.In Figure 1e we also show the behavior of entropy in terms of m for Esperanto.For r = 2, we observe that at m = 3, the entropy value is close to the values observed in the majority of natural languages, and a rapid decay is observed between m = 4 and m = 5, being this decline much faster than the one observed in real texts (see Figure 1a-d).We note that for values between m = 3 and m = 4, a higher value of ApEn is observed for random texts than for real texts, and then the entropy decreases dramatically for larger values of the pattern length.We remark that for short length patterns the ApEn is high due to the fact that the frequency of m and m + 1 length patterns is quite different, indicating a high irregularity in the text, as expected for random sequences.When the entropy values (corresponding to pattern lengths 3-10) from the different languages were pairwise compared with their corresponding random version, we found significant differences in almost all cases (p-value < 0.05 by Student's test, see Table 2

for details).
In order to further characterize the effects of the parameters m and r on the entropy values, in Figure 2 we show the calculations of ApEn for 36 pairs of values of the parameter r and the pattern length m.Recall that the r value represents the similarity criterion based on the Hamming distance, i.e., the number of positions at which the corresponding symbols may differ.Thus, r takes values between 1 and m − 1.As shown in Figure 2, we find that in most cases the entropy increases as the parameter r increases for a fixed value of the pattern length m, whereas for a fixed r the entropy value tends to decrease as m increases.Note that, for Germanic languages this general behavior is not observed as m increases ( Figure 2(b1)).As a general remark of the dependence of entropy on parameters m and r, we notice that an acceptable value of r is given by the level of discrepancies between the two patterns (a factor of the pattern length), since for larger m values and small values of r, a higher concordance is required every time, i.e., almost a perfect match, and larger sequences are required to get a reliable statistic.
Finally, to compare the behavior of the entropy values, we applied the Fisher's linear discriminant [30] to the data showed in Figure 1a-d.This technique is very useful to determine if the ApEn profiles could potentially classify languages into the Romance, Germanic, Slavic and Uralic families.Results for the 14 languages are presented in Figure 3.For this analysis we considered the ApEn values (corresponding to pattern lengths 3-10) from ten segments of 5000 symbols for each language.Then, the data were projected down to a two-dimensional scatter plot presented in Figure 3.We observe a separation between clusters formed by languages that belong to the same linguistic family.(Finnish and Hungarian).We also show the cases of (e) Esperanto and (f) random versions of the cases in panels (a-d).For natural languages, we observe similarities in the decay profile of the entropy between languages which belong to the same family.We notice that for the Germanic family, the entropy measure remains almost constant for pattern lengths between m = 6 and m = 7. Esperanto shows a fast decay between m = 4 and m = 5, while random texts present a high value of entropy for m = 3 and m = 4 with abrupt decay from values greater than 4. For results of the extended dataset see supporting information online at [28].1).For each m-value and for each language, we considered ten segments with length 5000 to obtain ten ApEn values.Next, languages were labeled in classes according to the linguistic family to which they belong (Romance, Germanic, Slavic, Uralic).The eight dimensional vectors comprising the eight ApEn (pattern lengths [3][4][5][6][7][8][9][10] values are used to create the two-dimensional projection.We observe that the families are segregated.For results of the extended dataset see supporting information online at [28].

Multiscale Entropy Analysis of Texts
In the context of biological signals, Costa et al. [21] introduced the multiscale entropy analysis (MSE) to evaluate the relative complexity of time series across multiple scales.This method was introduced to give an explanation to the fact that, in the context of biological signals, single-scale entropy methods (such as ApEn) assign higher values to random sequences from certain pathologic conditions whereas an intermediate value is assigned to signals from healthy systems [17,21].It has been argued that these results may lead to erroneous conclusions about the level of complexity displayed by these systems.Here we adopt a similar approach with the idea of evaluating the complexity of written texts by accounting multiple time scales.We explain the main steps of the modified MSE for the analysis of texts.Given the original sequence {s(1), s(2), s(3), ..., s(N)}, a coarse-graining process is applied.A scale factor τ is considered to generate new sequences with elements formed by repeated concatenation of symbols from non-overlapping segments of length τ.Thus, the coarse-graining sequences for a scale factor τ are given by y τ k,j = s (k−1)τ+j • • • s kτ+j−1 , with 1 ≤ k ≤ N/τ, 1 ≤ j ≤ τ and the dots denote concatenation.We observe that for τ = 1, the original sequence is recovered, whereas for τ > 1 the length of the new sequences is reduced to N/τ.We note that for each scale factor τ, there are τ coarse-grained sequences derived from the original one, as it was recently pointed out in the composite MSE [31,32].Next, to complete the MSE steps the ApEn algorithm is applied to the sequence y τ k,j for each scale to evaluate the regularity/irregularity in the new block-sequences.In order to improve the statistics, the entropy was calculated for all the jth coarse-grained time series for a given τ and the MSE value is given in terms of the average value of the entropies.Finally, the entropy value is plotted against the scale factor.A very simple example of the coarse-graining procedure can be illustrated for the Hamlet's soliloquy: "To-be-or-not-to-be".For τ = 2 we obtain y 2  1,1 = {To}, y 2 2,1 = {-b}, ..., y 2 9,1 = {be} and y 2 1,2 = {o-}, y 2 2,2 = {be}, ..., y 2 8,2 = {-b}.We note that these new sequences have components formed by two-letter blocks which are the input for the modified ApEn algorithm described in the previous Section.In practice, each new τ-block component is assumed as a single character for calculation purposes.Figure 4 presents the results of the MSE analysis for the real and synthetic texts described in the previous Section.The value of the entropy for scales one and two is higher for random texts than for real ones (Figure 4a-d).It is noticeable that for texts from natural language, as the scale factor increases, the entropy value decreases moderately compared to the rapid decreasing observed for synthetic random data such that for scales larger than 3, the entropy values for random sequences are smaller than the ones from original texts.Similar conclusions were obtained for Esperanto and its random version (data not shown).
As it has been identified when MSE was applied to biological signals, it is observed signals that exhibit long-range correlated behavior are more complex than the uncorrelated ones.When applied to natural language, our results show that the temporal organization of natural languages (with some differences between them) exhibits more complex structure than the sequences constructed by randomizations.These results are also concordant with previous studies, which report the presence of long-range correlations in written texts [33,34].Note that for scales one and two, randomized versions exhibit higher values of entropy than the real ones, while for scales bigger than three original sequences have more complexity vs random versions.We also note that random data from the Slavic and Uralic families remain very close to the entropy values of real texts compared to what happens to the Romance and Germanic cases, where a clear separation is observed for scale factors smaller than six.

Conclusions
We have presented a modified version of the approximate entropy method which is suitable for the evaluation of irregularities on multiple temporal scales in written texts from natural languages.First, we described the modified ApEn and SampEn methodologies by considering repetition of patterns subjected to a threshold Hamming distance.This entropy-based statistic (ApEn) was defined as approximately equal to the negative average natural logarithm of the conditional probability that two symbols-patterns that are similar for one length m remain similar when the length m is increased in one element [14].We applied this algorithm to different natural languages which belong to four families.Our results showed that natural languages are neither regular nor random but exhibiting different levels of irregularity which are similar for languages belonging to the same family.The application of the Fisher linear discriminant analysis to the ApEn-values, revealed that the four language families are segregated.Besides, the modified MSE method was applied to compare the multiscale features in the same real texts as well as in randomized versions of themselves.We found that real sequences exhibit a non trivial structure compared to texts obtained from randomizations, i.e., natural languages have more complex structure observed across different local scales compared to sequences with arbitrary order.Finally, we point out that additional studies are needed to fully characterize natural language predictability as well as to consider corrections in the calculations of multiscale entropy values [32].

Figure 1 .
Figure 1.Approximate entropy (ApEn) as a function of the pattern length m for four families of European languages.In all cases we set r = 2.Here symbols represent the mean value of the entropy measure for 10 segments, each with 5000 symbols and the error bars represent the standard deviation.The four families considered here are: (a) Romance (Latin, Spanish, Italian and French); (b) Germanic (English, German, Swedish and Dutch); (c) Slavic (Russian, Polish, Serbian and Czech); and (d) Uralic(Finnish and Hungarian).We also show the cases of (e) Esperanto and (f) random versions of the cases in panels (a-d).For natural languages, we observe similarities in the decay profile of the entropy between languages which belong to the same family.We notice that for the Germanic family, the entropy measure remains almost constant for pattern lengths between m = 6 and m = 7. Esperanto shows a fast decay between m = 4 and m = 5, while random texts present a high value of entropy for m = 3 and m = 4 with abrupt decay from values greater than 4. For results of the extended dataset see supporting information online at[28].

Figure 3 .
Figure 3. Results of the application of a linear classification analysis to data derived from four family languages.Here we show the projection of ApEn values from patterns lengths between m = 3 and m = 10 (see Figure1).For each m-value and for each language, we considered ten segments with length 5000 to obtain ten ApEn values.Next, languages were labeled in classes according to the linguistic family to which they belong (Romance, Germanic, Slavic, Uralic).The eight dimensional vectors comprising the eight ApEn (pattern lengths 3-10) values are used to create the two-dimensional projection.We observe that the families are segregated.For results of the extended dataset see supporting information online at[28].

Figure 4 .
Figure 4. Multiscale entropy analysis (MSE) for 14 natural languages from 4 European families and their corresponding randomized sequences.Symbols represent the mean value of the ApEn for 10 segments and the error bars the standard deviation.The length of each segment is 5000 data elements and we used the values m = 3 and r = 2.Each panel shows the results for the (a) Romance; (b) Germanic; (c) Slavic and (d) Uralic families.Note that for scales one and two, randomized versions exhibit higher values of entropy than the real ones, while for scales bigger than three original sequences have more complexity vs random versions.We also note that random data from the Slavic and Uralic families remain very close to the entropy values of real texts compared to what happens to the Romance and Germanic cases, where a clear separation is observed for scale factors smaller than six.