Topological Signature of 19 th Century Novelists : Persistence Homology in Context-Free Text Mining

Topological Data Analysis (TDA) refers to a collection of methods that find the structure of shapes in data. Although recently, TDA methods have been used in many areas of data mining, it has not been widely applied to text mining tasks. In most text processing algorithms, the order in which different entities appear or co-appear is being lost. Assuming these lost orders are informative features of the data, TDA may play a significant role in the resulted gap on text processing state of the art. Once provided, the topology of different entities through a textural document may reveal some additive information regarding the document that is not reflected in any other features from conventional text processing methods. In this paper, we introduce a novel approach that hires TDA in text processing in order to capture and use the topology of different same-type entities in textural documents. First, we will show how to extract some topological signatures in the text using persistent homology-i.e., a TDA tool that captures topological signature of data cloud. Then we will show how to utilize these signatures for text classification.


Introduction
A common approach in Topological Data Analysis (TDA) is to capture the shape or the underlying structure of shapes in data.Using topology in data science is mostly new, though computational topology and computational geometry existed in applied mathematics for many years.Only recently, TDA has been considered as an alternative to the conventional machine learning algorithms.Specifically, TDA is been considered to deal with high-dimensional noisy data sets.Here the common approach is to capture the shapes as the main characteristics of data and dismiss the rest as noise or irrelevant information.New contributions on TDA often target clustering, dimensionality reduction or descriptive modeling.In other word, wherever the shape and/or the structure of shapes in data is worth-full, TDA may provide reasonable solutions.
TDA methods have not been widely applied to natural language processing and subsequently text mining.There is no evidence to believe this is due to the weakness of topology in text processing.Of course it is not easy to define meaningful shapes in textural documents.But still, it may address some the challenges in text mining.The majority of algorithms in text processing and information retrieval are based on bag of words models which would not consider the order of tokens and the flow of language.There have been efforts in traditional machine learning to include the order information in feature selection step, e.g., by including parts of speech tags or parts of the parse tree.But still these ideas are not enough to capture the real value of orders in the text.Thus, there is still a huge gap between the capability of current text mining algorithms and the importance of order in text documents.This is exactly where topological data analysis may help.It can be useful to design more efficient order-preserving algorithms in text analysis.
Here we introduce a novel algorithm that hires TDA tools for text classification.We will show how the value of the orders (topology of words) in the text may play a role in classification tasks.To evaluate the capabilities of our idea, we use a set of different long novels by different novelists, we look for the topological features in the graph of main characters (persons) in each novel.Such graphs are constructed only based on the positions each character is appearing in a novel.For each novel, such a graph is actually a context-free product of the main characters.It measures only the co-appearance of characters though the novel and contains no additional information regarding each character.Then comparing with the other graphs of novels by different novelists, we will try to predict the author.Obviously, this is a metaphoric problem statement.Nevertheless, the results might be extended to more applied problems in different applications of text processing.But even before that, it is important to show that the topological signatures exists in the text.Also it is important to show these signatures are worth of extracting.Thus, we need a reliable answer to the metaphoric problem statement as the proof of concept.

Fundamental Definitions
In topological data analysis, data cloud is often viewed in the form of Simplicial Complexes.Here a Simplex may denote to a single data point (0-simplex), a line between two data point (1-simplex), a triangle (2-simplex), etc.Generally, a subset consisted of (k + 1) data point is called an k-simplex.Additionally, (−1)-simplex describes an empty set [1].A simplicial complex is a set of the simplices.Any subset of a simplex in the simplicial complex is also in the simplicial complex.A few instance of k-simplices and an example of a simplicial complex is shown in Figure 1.In data science, high dimensional data sets often come in the form of data cloud, a large set of data points in high-dimensional space.Thus, we need some techniques to translate these records into a meaningful topological structure, similar to the visual interpretation that makes a set of close yet discrete points meaningful (e.g., distinguishing a continues shape out of discrete points).The technique to do this procedure in TDA is called Persistent Homology [2][3][4][5].Note that Homology refers to the set of holes in the shape at each dimension.One may measure these holes in the terms of Betti numbers.These numbers keep all topological properties of a shape, while they contain no additional geometric characteristics.The i th Betti number is defined as the number of i-dimensional holes a in simplicial complex [2,6].More specifically, β 0 is the number of connected component, β 1 is number of 1-D holes and β 2 is the number of 2-D voids, etc. Betti numbers for some topological shapes are shown in Figure 2. In our study we only focus on β 0 and β 1 , i.e., number of components and number of loops respectively.
Persistent homology is a tool in topological data analysis that captures topological signature of data cloud.Decreasing the spatial resolution, one may connect all data points that are close enough to each other.This way they may construct a loop.Finally, these data points get closer to each other and a subset of k different points may get close enough to each other to be assumed as a k-simplex which contains no loop (no hole).We may require each two point in the simplex to be in a fixed range (radius) of each other.Increasing this radius gradually, many loops (or equivalently holes) may appear and disappear in each dimension.The persistence diagram captures the birth and the death of all holes in a certain dimension [2].Alternatively, the birth and the death of holes can be shown with barcodes where the lifetime of holes are being captured and plotted in one dimensional bars [7,8].An example of these barcodes is shown in Figure 3.Here the information structure based on thresholding distances is called Rips Filtration [9].In Rips complex, any k-simplex is consisted of k nodes whose pairwise distance is less than or equal to the threshold.Decreasing spatial resolution implies that the playground is Euclidean distance, while we may easily replace it with other metrics.Nonetheless, many different popular filtrations in TDA usually follow the same logic.For a comprehensive review of the concepts in TDA, we will refer the reader to [4,6,10].

Related Work
Wagner et al. in [11] used flag complex over vector space representation of corpus to compute similarities among documents.The authors mostly focused on efficiency of computing homology on high-dimensional sparse matrices in text mining.They proposed a novel path: using cosine distance as the distance measure, thresholding the distance and decreasing the threshold gradually to detect all the complete sub-graphs (called cliques) in the resulted graph.A flag complex is the set of all complete sub-graphs of the graph.Here, discrete Morse theory was used to compress the Flag complexes and compute Betti numbers.Zhu in [12] introduces a new application of persistent homology in text mining and suggested a novel text representation.The methodology begins with dividing a document to a fixed number of different blocks of text, calculating a vector representation for each block (tf-idf).Then an undirected graph is being constructed based on cosine similarity of each two blocks.Each block is presented as a node.Moreover, if and only if the cosine similarity between two block is more than a certain threshold, an edge between the corresponded nodes will represent their relation.Decreasing the threshold from zero to the global maximum of similarities, one can quantify the changes on the resulted graph via persistent homology.The author focused in β 0 and β 1 only to study the number of clusters and holes.Intuitively, number of holes (or equivalently number of loops) is a good sign of "tie-back" in the document and the persistence diagram or barcode of β 1 may reveal it.Obviously, here the information structure (called Similarity Filtration) follows the same logic as Rips filtration, considering angular distance of nodes (text blocks) as the metric.Since this method ignores the order of blocks in the document, the author tries to inject the order into the model assuming there should exist an edge between each pair of subsequent text blocks, no matter what the threshold is.This assumption forces the model always to consider the order of text blocks.This modified filtration is called Similarity Filtration with Time Skeleton (SIFTS).The method is applied on nursery rhymes as ideal repetitive documents and in contrast some other stories.SIFTS was also applied on a set of writings consisted of two major groups, child-writing and adolescent-writing.Since the significance difference between the groups might be the result of larger writings in the adolescent group, the process was repeated in this group with the truncated version of writings of appropriate length.The difference in the number of holes was still significant, though the smallest angular threshold for the appearance of holes was not anymore significantly different in the groups.
Guan et al. in [13] proposed an unsupervised method of key-phrase extraction that prunes the semantic graph of candidate phrases via homology analysis.Usually in document summarizing for clustering, classification or information retrieval, a relatively large set of candidate phrases is being pruned and sorted by ranks based on supervised approaches.The authors constructed the graph of candidate phrases by connecting the phrases that share at least a certain portion of common tokens (words).Then a topological collapsing algorithm [14] is used to remove dominated vertices iteratively.Note that a vertex v i is dominated by v j if and only if the set of v i 's neighbors is a subset of the set of v j 's neighbors.Intuitively, if all the neighbors of a vertex (phrase) are connected to another vertex, we may assume the latter phrase can explain the same concept better or at least no worse.The author evaluated the method on SemEval-2010 data set Corpus and NUS corpus and reported higher performance (precision, recall and F1) than the traditional methods.
There are a few other works in which topological data analysis has been applied to natural language processing and text mining.Chiang in [15] suggested that quantifiers over simplicial complex stand as an efficient and effective clustering algorithm for high dimensional data such as vector space representation of a corpus.Torres-Tramón et al. in [16] developed a topic detection method in Twitter data.The authors hired Mapper algorithm [17] to map the vector space (i.e. term frequency matrix) to a sequence of graph representations.Then the most frequent features in the most connected components are retrieved as the best candidates to be interesting topics.Zadrozny and Garbayo in [18] introduced a sheaf model to distinguish contradictions and disagreements among textural documents.They assumed that underlying theories in documents may have shared quantifiers and predicates which can construct partial orders (a sheaf model).In such partial orders, the global sections and local sections will determine the level of disagreement.

Methodology
More than the any text processing literature, our work is inspired by the recent developments on the application of TDA in time series analysis [19][20][21][22][23].In these works usually persistent homology is hired to study the changes in the topology of d-dimensional time series or the delay embedding of 1-dimensional time series.In many of these studies, if we replace a time series (a sequence of data) with a long text document (another sequence of data), the idea still might be valid, though the application has changed.Recall that even in the first attempt to apply TDA in the area of natural language processing [12], Zhu tries to consider the order of text blocks, though the defined order is harshly injected to the model.Still the main idea of using TDA to capture "the order" in the text is quiet novel.Looking at text documents as the sequences of different entities may enable us to capture the ordered appearances and more importantly co-appearances of implied or directly mentioned entities (e.g., names, topics, etc.).This is where TDA may role as the interpreter of ordered data.Moreover, the same techniques as in topological signal processing are applicable here.
To provide an example of how TDA may capture the order in the text documents, we chose 75 novels from Gutenberg Project by six novelists of the romanticism era in nineteenth century.We propose a novel method that uses persistent homology to predict the author only based on the graph of the main characters in the novel.The list of authors and the number of books that we used from each one can be seen in Table 1.For each book we downloaded the text version, and removed extra information such as metadata and the table of content.To extract the appearance of characters through the novel, we used Stanford CoreNLP API's [24] named entity recognizer (NER).The books were split by sentence, tokenized, and annotated with named entity tags.Then we extracted each entity tagged as "PERSON" with its position in the book based on its order in the list of all tokens in the document.Through this process we created a list associated to each book consisting of every character in the novel and the place that they appeared in the book in order of appearance.To reduce the noise, we kept only the indices of 10 most important (i.e., the most frequent) characters in each novel.
To define the distance between two characters in a novel (e.g., character A and character B), we use the set of indices in the novel where each of them appear.Let I denote the corresponding indices for character A and J denote the corresponding indices for character B. In Equation 1, assuming m ≥ n, to have an equal number of indices, we use a subset of indices in J. Let J * denote to the chosen indices of J.I = (i 1 , i 2 , . . ., i n ) J = (j 1 , j 2 , . . ., j m ) In the algorithm to choose J * elements, we try to minimize the distance between the elements of I and J * in the similar positions as in Equation 2.Here the constraint guarantees that exactly n unique indices are being chosen from J.
Now we can define the distance between character A and character B as in Equation 3.Here Ĩ and J are the normalized version of I and J * respectively, where each element in I and J * is divided by the novel length, i.e, total number of words in the novel.We use the expanded or contracted versions of Ĩ and J where each of their elements is raised to the power of p-that would be Ĩ(p) and J(p) respectively.In the equation, WD 0.5 is Wasserstein distance of order 0.5.Note that for t = 0 the function measures the distance between the original vectors Ĩ and J.The order 0.5 pushes the function to be more sensitive to the closer element-wise distances.In addition, using different values for parameter t enables us to focus on closer characters at the beginning of the novel (t > 0) or closer characters at the end of the novel (t < 0).This may reveal different topological signature of each novelist.Using pair-wise distances among characters in each novel, we utilized Rips filtration and constructed the persistence diagrams for each novel.For each novel, using three different choices of t (t = 0, − , and + ) may consider different scenarios and cover almost all the topological characteristics for any choice of t.In practice, we used = 0.1 and for each novel constructed three persistence diagrams based on distances defined in Equation 3.For persistence diagrams and quantify over them we used R package TDA [25].Some samples of persistence diagrams (where t = 0) are shown in Figure 4. Having the persistence diagrams for all the novels, each time we select two novelists and calculate the distances among all the novels we have for those two novelists.To quantify the difference between each pair of novels, we can simply use their persistence diagrams.To measure the difference between two persistence diagrams, we used Wasserstein distance [26] of order one at dimension one and dimension zero.Then the distance between two novels (e.g., X and Y) is defined by Equation 4. As mentioned before, we have three different persistence diagrams each novel based on three different choices of t.Also, each of these diagrams covers two dimensions 1 and 0. Let PD 1 t and PD 0 t denote the persistence diagrams based on the distances with parameter t at dimension 1 and 0 respectively.In other word, PD 1 t only considers loops and PD 0 t only considers components.Then let WD denote the Wasserstein distance of order one between two persistence diagrams.

Results and Discussion
We used a 5-Nearest Neighbors (5-NN) algorithm in 10-fold cross validation mode to predict the authors of the novels.Note that we used balanced subsets of novels for cross validation.Since we had different number of novels for each novelist, for binary classification of novels by each pair of novelists, we used a relatively large number of iterations (n = 250).In each iteration a sample of novels by the novelist who had more novels was randomly chosen to get a balanced set of novels (where the class ratio is set to one).Then we used the balanced set for the 10-fold cross validation.It means that for the case when n is large enough, any accuracy higher than 50% is statistically significant.The choice of 5 for nearest neighbors comes from our initial experiments.In our 5-NN algorithm, we set the weight for the vote of each neighbor inversely proportional to the square of its distance.Intuitively, we may be unable to consider a lifelong writing style for a writer.However, it is a safe assumption that novelists usually repeat the style of one particular novel in a few other novels.So, the algorithm is capable of providing valid neighbors.Note that here the style only refers to the relation among novel characters.
Table 1 shows the accuracy of each binary classification task in percentages.For the total number of around 69'000 predictions on our small data set of 75 novels, the average accuracy was 77.0%.As suggested in Table 1 sometime it is easy to distinguish between the topological signatures of writers, e.g., Dostoyevsky vs. Austin.On the other hand, in some cases two different writers may have similar signatures, e.g., Dostoyevsky vs. Scott as it is shown in Figure 4. We did not use Co-reference Resolution for the entity detection task in our work.We believe that the state of the art in co-reference resolution is not helpful enough to be hired in our algorithm.Note that for a character in a novel, we retrieved only a portion of appearance indices since we lost many of co-references.As a result, in our entity detection for each entity (i.e., character) the precision is substantially higher than the recall.This is exactly why we chose not to use any co-reference resolution algorithm.Our method is extremely more sensitive to the precision than the recall in entity detection phase.Wherever an entity is indirectly implied (e.g., by a pronoun or etc.), there is a high probability that the same entity (character) is directly mentioned a few lines before or even after that.Recall that an advantage of topology is that it is not much sensitive to the choice of metrics.In other word, small changes in the distances only affect geometric properties that are not much useful for our study.Note that for co-reference resolution tools usually precision is higher than recall [27].Still, a co-reference resolution algorithm even with the accuracy of around 90% may easily harm our model since for each entity, the precision of entity detection will decrease, the distance functions may experience a huge shift, and eventually the topological signature will be lost.
Here we discuss the existence of topological signature in the text.We chose a metaphoric example to show how these signatures may be extracted from the novels.But, the question is whether we can extend the results to some more applied areas.Even if the answer is yes, still there might be some concern about the computational cost which is beyond the scope of this research.Last but not least, in the terms of accuracy for some applications, these topological features might be substantially weaker than those features that traditional text processing provides.But intuitively, these features may still carry some additive information that is lost in traditional text mining.Thus, one opportunity is to use topological features in addition to the other features.It sounds reasonable at least when the computational cost is not a major matter of concern.

Conclusion
As we have shown in our results, a topological signature of writer exists in the writing.For our contribution, we only worked on names (characters) and used a context-free graph of characters' relations to capture topological features.The novel algorithm we used is by definition robust to translation and even the length of text documents.However, the accuracy is not high enough to use the algorithm directly for a real world application.Yet, the existence of topological signature in the context-free graph is much more important than the accuracy itself in this particular task.Choosing different model parameters or definitions of distance functions may improve the accuracy of this algorithm.But, there are also other dimensions that may improve our algorithm.One may use other features (e.g., topics, concepts, parts of speech, etc.) in addition to the persons to get into a more precise algorithm.The other possibility is to use topological features as the additional variables (in addition to the other features) to run a model.

Figure 2 .
Figure 2. Betti numbers for a single point, a circle, sphere and a torus.In a k-dimensional space, n th Betti number is always zero for any n ≥ k.

Figure 3 .
Figure 3.A simple data cloud (left) with its persistence diagram at dimension one the illustrates the birth and the death of loops (middle) and equivalent representation of barcode (right).

Figure 4 .
Figure 4. Persistence diagrams of the graphs of characters in different novels.

Table 1 .
Average Accuracy of binary classification, having a labeled set of novels by two novelists and using 10-fold cross validation.The numbers in parentheses are the total number of novels for each novelist.The accuracy values are in percentages.