Classification of Literary Works: Fractality and Complexity of the Narrative, Essay, and Research Article

A complex network as an abstraction of a language system has attracted much attention during the last decade. Linguistic typological research using quantitative measures is a current research topic based on the complex network approach. This research aims at showing the node degree, betweenness, shortest path length, clustering coefficient, and nearest neighbourhoods’ degree, as well as more complex measures such as: the fractal dimension, the complexity of a given network, the Area Under Box-covering, and the Area Under the Robustness Curve. The literary works of Mexican writers were classify according to their genre. Precisely 87% of the full word co-occurrence networks were classified as a fractal. Also, empirical evidence is presented that supports the conjecture that lemmatisation of the original text is a renormalisation process of the networks that preserve their fractal property and reveal stylistic attributes by genre.


Introduction
A complex network as an abstraction of a language system has attracted attention in the last decade. The current linguistics research, based on the complex network approach, follows three major lines [1,2]: characterisation of human language as a multi-level system, linguistic typological research using quantitative measures, and the relationship between system-level complexity of human language and its microscopic features.
Word co-occurrence networks and their measures have been widely employed to analyse the syntactic features for multiple purposes, such as: identifying authors' writing styles [3][4][5][6][7][8] and evaluating machine translations [9]. Also, Ferraz de Arruda, Nascimento Silva [10], as well as F. de Arruda, Q. Marinho [11] built a complex network where the nodes are the representation of adjacent paragraphs that share a minimum semantical content to classify the text as real (written by an author) or randomly constructed (built from random blocks of real texts).
In most of the research mentioned above, well-known measures such as: node degree (k), shortest path length (spl), betweenness (b), clustering coefficient (cc), and the average of nearest neighbourhoods' degree (nnd) are applied to characterise the word co-occurrence networks. The k, b, and nnd are centrality measures that characterise local properties of the network that are useful for authorship attribution [3][4][5][6][7][8]. However, these measures do not capture the global network structure that could give us insight into the literary genre. This research aims at showing that local and global measures of the word co-occurrence networks-of literary works of Mexican writers-let us classify them according to the genre. Thus, the following research questions are formulated:

1.
Are measures of the complex network useful to classify literary works by genre? 2.
Is the full word co-occurrence network of literary works fractal? 3.
Do pre-process tasks such as: deletion of number, punctuation, functional words, and lemmatisation generate fractal networks?

Measures of Complex Networks
Formally, a network is defined by G = (V, E) where V is the vertexes or nodes, and E is the edges. The complex networks exhibit non-trivial topological features that do not occur in simple networks, such as: lattices or random graphs [12], and their overall behaviour cannot be predicted by observing the behaviour of their nodes [13]. Since the complex network theory has its root in graph theory, some measures are presented below.
The degree of a node i is defined by: where j represents a given neighbour of the node i, and N is the total neighbours. The value of v ij is defined as one, if there is a connection between nodes i and j, and as 0 otherwise. Similarly, the betweenness of a node is defined as: where L jm , is the number of shortest paths between nodes j and k, and L jm (i) is the shortest paths between nodes j and m that go through i. The average nearest neighbourhoods' degree (nnd) of a given node can be computed by: where k i is the degree of the node i, and the set V(i) contains its nearest neighbours, and k j is the degree of a given neighbour. A definition of network clustering is expressed by: where τ is the number of triangles of the network and spl (2) is the shortest path of length two. A "triangle" is a set of three nodes in which each contacts the other two.

Fractality of Complex Networks
A fractal is an object that is similar to itself on all scales [14]. A network is a fractal network if its box-covering follows the power law given by: where N b (l) is the minimum number of boxes of diameter l to cover the network-the procedure of box-covering that gives us this number is detailed later-β is the scaling factor, and d b is the box dimension of a complex network that can be obtained as follows: On the other hand, a non-fractal network is characterised by a sharp decay of N b (l), with l described by an exponential function as follows [15,16]:

Complexity of Networks
The complexity measure of a network proposed by Lei, Liu [17] is defined as: where d(G) = |E|/ 4CR 3 /3∆ is the absolute density [18]; |E|, C, R, and ∆ are the number of edges, circumference, radius, and diameter of the network, respectively.
is known as structure entropy based on degree and betweenness [17], where k is the Boltzmann constant, |V| is the number of nodes, , and b max is the maximum value of the betweenness computed by the Equation (2). This measure captures the topology of the networks, but it is not affected by scales and their types.

Box-Covering of Complex Networks
To obtain N b (l), consider the phrase "No one behind, no one ahead". Its word co-occurrence network is shown in Figure 1. The number of boxes to cover the network N b (l) for l = 1, and l = ∆ + 1-where ∆ is the diameter of the network-is the number of nodes of the network and one, respectively. The N b (l) from 2 to ∆ is not a trivial answer. For example, N b (l = 1) = 4 and N b (l = ∆ + 1) = 1 for the network of Figure 1. To obtain the N b (l = 2), we first compute a dual network (G') from the original (G) as follows: given a distance l; two nodes i, j, in the dual network, are connected if the distance between l ij is greater than or equal to l. For example, we start the procedure from the node "no", see Figure 2; "no" and "behind" have a distance of two in G, thus, they will be connected in G'. Next, the node "ahead" as the starting node is chosen-notice that the distance from it to "behind" is two-thus, a connection in G' will be drawn (see Figure 2). Then, the nodes of G' must be coloured following a single rule: two nodes directly connected will be painted different colours. The nodes of the resulting coloured dual network G' are mapped to original network G. The number of colours of G' represents the minimum number of boxes N b (l) of a given value of l to cover the network. The nodes of G, in the same colour, belong to the same box. The procedure described above is repeated until l = ∆ + 1. For more profound details of the box-covering algorithm, the reader is referred to the work of Chaoming, Lazaros [19]. Since l vs. N b (l) characterises the topology of the network, the area under the box-covering curve, l vs. N b (l) (AUB), was also included in the measures of the word co-occurrence network.

Robustness of Fractal Networks
Intentional network attacks are based on different centrality measures such as: the node degree or betweenness. They differ in the approach to compute those centrality measures such as: computing the global degree or betweenness, then performing the attack, or recomputing the centrality measure after a node is removed [20][21][22][23]. The fraction of nodes necessary to break down a fractal network (p c ) by a random attack are close to the total number of nodes; thus, these networks are extremely robust [24]. On the other hand, this robustness decreases drastically when the nodes with a high degree are selected to be removed [20,25]. This vulnerability to intentional attack relies on that a few nodes, with a high degree, maintain the connectivity of the network [26]. The robustness of each network is quantified by the size of the largest connected component C c after removing a fraction p node from the network [20,24,26,27] when C c (p c ) 0 the network has been disintegrated. The value of p c is low for fragile networks, and the opposite for robust networks.
Although the p c value is useful for measuring the overall damage caused by the attack strategy, it does not reflect the damage of an individual node removal; for example, Figure 3 shows the plot of C c vs. p, where the value of p c is 0.5 and 0.49 for networks one and two, respectively. This means that for both networks, it is necessary to remove approximately 50% of the nodes to disintegrate them in components that contain at most one node. Moreover, based on Figure 3, the removal of the nodes from network two causes more damage than the removal of those from network one. This damage can be quantified by computing the Area Under the Robustness Curve (AURC)-0.0956 for network one and 0.060 for network 2-to a higher the value, the higher the robustness of the network. The AURC of the attack performed by node degree was included as a measure of network robustness instead of p c .

Materials and Methods
From seven Mexican writers -Juan José Arreola Zúñiga, Carlos Fuentes Macías, Jorge Ibargüengoitia Antillón, Carlos Monsiváis Aceves, José Emilio Pacheco Berny, Octavio Irineo Paz Lozano, and Alfonso Reyes Ocha-21 essays, 21 narratives (15 tales and six novels), and 21 research articles were the corpus for this research (see Table 1). Noticeably, some authors wrote titles classified as essays, tales, or novels, such as Carlos Fuentes, Jorge Ibargüengoitia, and José Emilio Pacheco. The essays, narratives, and research articles were published between 1911 and 2019. All the titles were obtained in an electronic format such as pdf and then converted to plain text.
The node degree (k), betweenness (b), shortest path length (spl), clustering coefficient (cc), and nearest neighbourhoods' degree (nnd), as well as more complex measures such as: the fractal dimension (d b ) obtained by the Equation (6), the complexity of a network c(G) given by the Equation (8), the Area Under Box-covering (AUB), and the Area Under the Robustness Curve (AURC), were computed for each network of each title. Statistical analysis was carried out to select those measures that have a significant difference by literary genres and produce a better classification.
Then the Support Vector Machine (SVM), Naïve Bayes (NB), Decision Tree (DT), and Neural Network (NN) implemented in Weka [28] and fourth data mining views-described later and based on the measures mentioned above-were employed to classify the literary works. The hyperparameter optimisation of the data mining techniques was conducted by sequential model-based algorithm configuration [29,30]. The hidden layers and the nodes learning function of NN were 28 and sigmoid, respectively. The polynomial kernel was used in SVM, and all measures of the networks were normalised before training and validating SVM and NN. The NN technique was used with a normal distribution to estimate the probabilities of the network measures. DT uses the C4.5 algorithm [31]. The efficacy of each data mining technique and data mining views was validated by 5-fold cross-validation, comparing the Area under the Receiver Operating characteristic Curve (AROC). The AROC is useful to measure the performance of a data mining technique when the dataset is unbalanced [32]. Values of AROC closer to 1 mean a better classification than those closer to 0.5. This analysis shows the impact of data mining techniques and the measures on the classification of literary works. These results answer research question one (see Figure 4). Also, the accuracy of classification is presented as additional information that is computed as (TP Positive (TP) + False Positive (FP) + False Negative (FN) + True Negative (TN)). The computation of AROC and accuracy are well-known for a two-class problem. Furthermore, for a multi-class problem, for each time one class could be considered as positive, then all the others as negative. This means that TP, TN, FP, and FN are calculated for each class. Therefore, a confusion matrix and AROC curve is obtained for each class (see [33,34] for more details). A set of word co-occurrence networks of each title was obtained and the first network was built using the full text. The second was obtained by deleting numbers and functional words.
A lemmatisation stage created the third after numbers and functional words deletion, and the fourth network was attained only through a lemmatisation stage (see Figure 4).
The networks were obtained by using the full text, by deleting numbers and functional words, by adding a lemmatisation stage after the numbers and functional words deletion, and through only a lemmatisation stage, are classified as fractal or non-fractal. Thus, research question two and three will be answered.

Results and Discussion
Tables 2-5 show the descriptive statistics by literary genre of the three types of networks-the first was built using the full text; the second was built by deleting numbers, punctuation marks, and functional words; the third was built by adding a lemmatisation stage; and the fourth was built through only a lemmatisation stage, denoted by subscripts f, nf, l and ol, respectively. Table 2. Mean and standard deviation by genre of node degree (k), betweenness (b), shortest path length (spl), clustering coefficient (cc), nearest neighbourhoods' degree (nnd), fractal dimension (d b ), complexity c(G), the Area Under Box-covering (AUB), and the Area Under the Robustness Curve (AURC) of the networks built using the full text.   Genre k l (µ-σ) b l (µ-σ) slp l (µ-σ) cc l (µ-σ) nnd l (µ-σ) d bl (µ-σ) c(G) l (µ-σ) AUB l (µ-σ) AURC l (µ-σ)  Table 5. Mean, and standard deviation by genre of node degree (k), betweenness (b), shortest path length (spl), clustering coefficient (cc), nearest neighbourhoods' degree (nnd), fractal dimension (d b ), complexity c(G), the Area Under Box-covering (AUB), and the Area Under the Robustness Curve (AURC) of the networks built only by lemmatisation stage.  Although spl nf, cc nf, d bnf , and AURC nf have a significant difference, they do not provide additional information-of those provided by the measures of full-text networks-to differentiate the genre. For example, spl nf is only statistically different for the novel and tale (see Table 6). However, slp f is statistically different for the novel, essay, and both the research article and tale. Thus, spl nf, cc nf, and d bnf were not included in the set of measures to build data mining models. Table 6 summarises the significant statistical difference for spl f and spl nf .

Genre k ol (µ-σ) b ol (µ-σ) slp ol (µ-σ) cc ol (µ-σ) nnd ol (µ-σ) d bol (µ-σ) c(G) ol (µ-σ) AUB ol (µ-σ) AURC ol (µ-σ)
Finally, the one-way ANOVA conducted on the individual influence of essay, tale, novel, and research article on spl l and AURC l shows significant effects:  Table 6. The subsets built using the significant statistical differences between slp f and slp nf induced by the novel, essay, research article, and tale. The value in the intersection of each row and column is the means of each measure for a given genre.

Subset 1 Subset 2 Subset 3
Novel After these analyses, the spl f , k f , nnd f , cc f , b f , db f , AURC f , AUB f , c(G) f , spl l , k l , nnd l , cc l , b l , db l , AURC l , AUB l , and c(G) l were selected to classify the genre of each literary work. This set of measures is a data mining view named DV 1 , and DV 1 was compared with a data mining view named DV 2 that contains all the measures computed on the three types of co-occurrence networks described previously. Also, a third data mining view named DV 3 , which contains only the measures spl, k, nnd, cc, and b obtained from the three types of co-occurrence networks, was tested to show that measures such as d b , c(G), AUB, and AURC contribute to capturing the features of the literary genre. Since the influences of the data mining technique and data mining view on the AROC need to be tested, a two-way ANOVA is appropriate for this purpose, providing the data is normal and homoscedastic [32,35]. However, the AROC generated by our experiments does not meet these assumptions; thus, a Scheirer-Ray-Hare test [36,37] was used instead. A Scheirer-Ray-Hare test shows there is a significant difference among the AROC of the data mining views: H(2) = 21.496, p < 0.001, the data mining techniques: H(3) = 84.79, p < 0.001, and the interaction between both: H(6) = 30.167, p < 0.001. Figure 5 summarises the effect of both data mining view and data mining technique on AROC that are detailed below.
Entropy 2020, 22, x FOR PEER REVIEW 9 of 13 Table 6. The subsets built using the significant statistical differences between slpf and slpnf induced by the novel, essay, research article, and tale. The value in the intersection of each row and column is the means of each measure for a given genre.  , the splf, kf, nndf, ccf, bf, dbf, AURCf, AUBf, c(G) f, spll, kl, nndl, ccl, bl, dbl,  AURCl, AUBl, and c(G)l were selected to classify the genre of each literary work. This set of measures is a data mining view named DV1, and DV1 was compared with a data mining view named DV2 that contains all the measures computed on the three types of co-occurrence networks described previously. Also, a third data mining view named DV3, which contains only the measures spl, k, nnd, cc, and b obtained from the three types of co-occurrence networks, was tested to show that measures such as db, c(G), AUB, and AURC contribute to capturing the features of the literary genre. Since the influences of the data mining technique and data mining view on the AROC need to be tested, a twoway ANOVA is appropriate for this purpose, providing the data is normal and homoscedastic [32,35]. However, the AROC generated by our experiments does not meet these assumptions; thus, a Scheirer-Ray-Hare test [36,37] was used instead. A Scheirer-Ray-Hare test shows there is a significant difference among the AROC of the data mining views: H(2) = 21.496, p < 0.001, the data mining techniques: H(3) = 84.79, p < 0.001, and the interaction between both: H(6) = 30.167, p < 0.001. Figure 5 summarises the effect of both data mining view and data mining technique on AROC that are detailed below. A Kruskal-Wallis test shows that DV1, DV2, and DV3 affect the median of the AROC: A Kruskal-Wallis test shows that DV 1 , DV 2 , and DV 3 affect the median of the AROC: H (2) = 21.496, p < 0.001. A posthoc Mann-Whitney test using a Dunn-Sidak adjustment [38] (α = 0.0169) shows that the median of DV 1 (Mdn = 0.975) is higher than DV 2 (Mdn = 0.968)-U (N Dv1 = 400, N Dv2 = 400) = 68704, z = −3.59, p < 0.001 and DV 3 (Mdn = 0.955)-U (N Dv1 = 400, N Dv2 = 400) = 66131, z = −4.388, p < 0.001. Thus, the statistical analysis carried out on the measures of three types of networks is useful to select relevant measures that increase the AROC. No statistical difference was found between DV 2 and DV 3 , U (N Dv2 = 400, N Dv2 = 400) = 78117, z = −0.59, p = 0.236. This evidence suggests that well-known measures such as: node degree, shortest path length, betweenness, clustering coefficient, and the average of nearest neighbourhoods' degree-used to build DV 3 -applied in the previous research to identify authors' writing styles [3][4][5][6][7][8] are not enough to produce a higher AROC. On the other hand, more complex measures such as: d b , c(G), AUB, and AURC improve the classification.

Genre
Similarly, the Kruskal-Wallis test shows that the medians of the AROC obtained from NN, SVM, NB, and DT affect the AROC, H (3) = 84.793. A posthoc Mann-Whitney test using a Dunn-Sidak adjustment [38] (α = 0.0085) shows that the median of both NN (Mdn = 1.00) and SVM (Mdn = 0.975) were higher than those of NB (Mdn = 0.968)-see the corresponding row and column of Table 7 for the result of the pair-wise test e.g., row NB and column NN show a significant difference: U (N NN = 300, N NB = 300) = 36784.5, z = −3.995, p < 0.0001-and DT (Mdn = 0.911). No statistical difference between NN (Mdn = 1.00) and SVM (Mdn = 0.975) was found. Table 7. Pair-wise Mann-Whitney test using a Dunn-Sidak adjustment (α = 0.0085) among data mining techniques. The intersection of a row and a column presents the result of the test between the two data mining techniques. To support the conjecture that deleting number, punctuation, and functional words do not have a significant effect on the AROC, the models of NN based on DV 1 and the fourth data mining view named DV 4 , which contain the measures from the networks built using the full text (spl f , k f , nnd f , cc f , b f , db f , AURC f , AUB f , and c(G) f ) and those from networks built using only a lemmatisation stage (spl ol , k ol , nnd ol , cc ol , b ol , db ol , AURC ol , AUB ol , and c(G) ol ), were compared. The Mann-Whitney test shows no statistical difference-U (N DV1 = 100, N DV2 = 100) = 4793, z = −0.631, p = 0.528-between the AROC of DV 1 (Mdn = 1) and DV 4 (Mdn = 0.98). The accuracy of DV 1 and DV 4 is 0.93 for both. Thus, the deletion of the number and punctuation marks is not useful to reveal stylistic attributes by genre as lemmatisation does. Furthermore, all these stages together modify the network fractality, as the evidence presented later suggests. The accuracy of the NN model based on DV 1 , DV 2 , DV 3 , and DV 4 are 0.93, 0.90, 0.89, and 0.93, respectively.
To classify each network as fractal or non-fractal, the Akaike Information Criterion (AIC) [39] were computed for the networks based on the full text. The second network was obtained by deleting numbers, punctuation marks, and functional words. The third was created by adding a lemmatisation stage, and the fourth was attained only through a lemmatisation. The AIC is useful to classify networks as fractal and non-fractal [40]. To select the better mathematical model, first the AIC for power (denoted by subscript P) and exponential (denoted by subscript E) models-Equations (5) and (7)-were computed, then the minimum value is chosen (AIC min ). ∆AIC i was computed by AIC i -AIC min , where i is the AIC of power or exponential models. The AIC's rule of thumb is that the two models are statistically different if ∆AIC is greater than two, thus, the model with ∆AIC = 0 should be selected [41,42]. Table S2 of the supplementary material shows that the difference between ∆AIC P and ∆AIC E for about 87% of the full word co-occurrence network is higher than two; thus, the mathematical model for the relation l vs. N b (l) computed by the box-covering algorithm of these networks is the power model (see Equation (5)). Although for 13% of the networks, a model cannot be selected feasibly based on ∆AIC, the power model obtained the least value. Thus, most of the full word co-occurrence networks of literary works are fractal. This result supports the fractality founded in other languages and English literature by different mathematical analyses [43][44][45][46]. Noticeably, selecting the better model based on the adjusted coefficient of determination (R2) is rather difficult.
Similarly, Table S3 of the supplementary material shows that the difference between ∆AIC P and ∆AIC E for about 89% of the word co-occurrence networks-built by deleting numbers, punctuation marks, and functional words-suggests they are fractal; 2% were classified as exponential, and 9% were undetermined (since ∆AIC ≤ 2). However, adding a lemmatisation stage to the previous ones dilutes the fractality (25.3% are fractal, 33.3% are exponential, and 41.3 are undetermined), see Table S4. The lemmatisation stage alone preserves the fractality of the full-text networks (87% are fractals, and 13% are undetermined); see Tables S2 and S5, which show no difference between the AROC curve of the classification of literary works according to their genre. Note that the lemmatisation stage preserves the original fractality of the networks. Thus, this supports the conjecture that lemmatisation is a kind of renormalisation of a complex network that preserves the fractality. This paves the way to compare this linguistic renormalisation with that introduced by Song, Havlin [16].

Conclusions
This research aims at showing that measures of the word co-occurrence network of literary works-by Mexican writers-classifies them according to the literary genre. The local measures-such as: node degree, the average of nearest neighbourhoods' degree, and global measures using shortest path length, betweenness, clustering coefficient, and the average of nearest neighbourhoods' degree-widely used in the previous research to identify authors' writing styles, produces acceptable values of AROC classification. However, more elaborate measures using fractal dimension, complexity, the AUB, and the AURC show an improvement of AROC. These measures capture the topology based on the minimum number of boxes to cover the network, the robustness, and the complexity measured by structural entropy and density. Precisely 87% of the full word co-occurrence networks were classified as a fractal. Thus, those findings support the conjecture that fractality occurs in the literary works of Mexican writers, as was previously reported by their English-speaking counterparts. Also, the empirical evidence suggests that the lemmatisation of literary works is a renormalisation stage that preserves the original text fractality. On the contrary, the deletion of numbers, punctuation marks, and functional works, as well as lemmatisation, dilute the fractality. The number of literary works included in this study limit the generalisation of this conjecture. Also, it would be interesting for future research directions to compare the renormalisation induced by a lemmatisation stage-linguistic renormalisation-to renormalisation of networks based on the box-covering algorithm.
Supplementary Materials: The following are available online at http://www.mdpi.com/1099-4300/22/8/904/s1, Table S1 Title, primary author and the gender of the literacy works, Table S2, The adjusted determination coefficient, AIC, ∆AIC and classification for the networks based the full text, Table S3 The adjusted determination coefficient R2, AIC, ∆AIC and classification for the networks obtained by deleting numbers, punctuation marks and functional words, Table S4 The adjusted determination coefficient R2, AIC, ∆AIC and classification for the networks obtained by deleting numbers, punctuation marks, functional words and a lemmatisation stage, and Table S5 The adjusted determination coefficient, AIC, ∆AIC and classification for the networks obtained only by a lemmatisation stage.
Funding: This research was funded by Instituto Politécnico Nacional, grant number SIP20200169.