Tensor-Based Semantically-Aware Topic Clustering of Biomedical Documents

: Biomedicine is a pillar of the collective, scientiﬁc effort of human self-discovery, as well as a major source of humanistic data codiﬁed primarily in biomedical documents. Despite their rigid structure, maintaining and updating a considerably-sized collection of such documents is a task of overwhelming complexity mandating efﬁcient information retrieval for the purpose of the integration of clustering schemes. The latter should work natively with inherently multidimensional data and higher order interdependencies. Additionally, past experience indicates that clustering should be semantically enhanced. Tensor algebra is the key to extending the current term-document model to more dimensions. In this article, an alternative keyword-term-document strategy, based on scientometric observations that keywords typically possess more expressive power than ordinary text terms, whose algorithmic cornerstones are third order tensors and MeSH ontological functions, is proposed. This strategy has been compared against a baseline using two different biomedical datasets, the TREC (Text REtrieval Conference) genomics benchmark and a large custom set of cognitive science articles from PubMed.


Introduction
Cognitive science, namely the study of the mind and its processes [1][2][3], has recently gained significant momentum, which can be attributed to a number of reasons.It is a major driver of the big data age along with online social networks, the semantic web and computational systems theory to name a few.Recent sociological and demographic studies conducted in the majority of the Western world including the EU [4] and the U.S. [5] reveal that one of the biggest challenges of the next decade will be the healthcare costs associated with cognitive issues.Similar trends at the planet scale can be found in reports compiled by the UN Population Division [6][7][8].Thus, the systematic analysis of cognitive science literature is of immediate interest, besides researchers, of healthcare planners, government agencies, hospital administrators, insurance companies, equipment manufacturers and software developers.
With the creation of PubMed (http://www.ncbi.nlm.nih.gov/pubmed) in 1996, the largest public online database under the ultimate administrative oversight of NIH, a massive collection spanning millions of life science articles is available to researchers.Over the past two decades, it has been substantially enriched and currently contains more than fourteen million abstracts, whereas it accepts and serves more than seventy million queries of six terms each on average per month.Indicative of its enormous topic diversity is the fact that through Entrez (French term for enter.), the PubMed-coupled Tensor product along k-th dimension • F Frobenius tensor or matrix norm Vector with n entries of 1 Indicator function for predicate p H (s 1 , . . . ,s n ) Harmonic mean of values s 1 , . . ., Expected value of random variable X Var [X] Variance of random variable X κ 3 [X] Skewness of random variable X κ 4 [X] Kurtosis of random variable X

Previous Work
Document clustering has gained much interest in biomedicine [12].PubMed abstracts are clustered with frequent words and near terms in [13].A graph algorithm based on flow simulation is considered in [14], where advanced techniques are proposed in [15].
Biomedical ontologies in conjunction with mining of biomedical texts led to the technique of word sense disambiguation (WSD), which maps documents to different topics.Ontologies and meta-data assist the clustering algorithms [16].Event-based text mining systems in the context of biomedicine as an annotation scheme are the focus of [17].On the other hand, domain-specific information extraction systems regarding event-level information with automatic causality recognition are proposed in [18].Human gene ontologies are described in [19,20].U-Compare, an integrated text mining and NLP system based on the Unstructured Information Management (UIMA) Framework (UIMA: http://uima.apache.org/), is presented in [21].
Using the MeSH ontology for biomedical document clustering is popular in scientific literature [22][23][24][25].Various clustering approaches such as suffix tree clustering were supplemented with ontological information in [26], whereas the accuracy of similarity metrics is discussed in [27].A knowledge domain scheme based on bipartite graphs with MeSH is presented in [28].Two serious limitations that face approaches by using the MeSH thesaurus are introduced in [29].
Tensor algebra [30,31] and the closely-associated field of multilayer graphs [32,33] are some of the primary algorithmic tools for dealing with higher order data, along with higher order statistics [34,35] and multivariate polynomials [36,37].Central places in tensor algebra have Tucker and Kruskal tensor forms [38], which allow alternative tensor representations appropriate for certain linear algebraic operations such as tensor-matrix multiplication, tensor compression [39], tensor regularization and factor discovery.Models for tensor data mining have been outlaid in [40].A very recent work combining tensors and semantics for medical information retrieval is [41].

Architecture
The proposed system architecture is shown in Figure 1.The interaction between the various components has been kept at a minimum, and feedback loops have been avoided.However, in future versions, the tensor can be updated either incrementally or in batch mode with information extracted from the queries.

Python Tools
Python is well known in the developer community for its rich library ecosystem.The objective of the Entrez document retrieval system is to provide a single entry point for seamless and efficient access across those health-related public databases that are under NIH administrative supervision, including among others, PubMed, MEDLINE, preMEDLINE and the NCBI database.Thus, Entrez is the key to a vast body of medical knowledge through advanced text queries.As a consequence, APIs for Entrez have been implemented for most, if not all, major programming languages such as the NCBI API for Java, the NCBI Toolkit for C++ and Biopython for Python.The functionality of each Entrez API and the associated library should include at least methods for retrieving articles based on keywords, terms, authors, doi (Digital Object Identifier) or the unique PubMed identifier, as well as for providing pointers to related documents or supplementary data and traversing document lists in both directions.
The native document format supported by Entrez is XML (ftp://ftp.ncbi.nlm.nih.gov/bioproject).The latter being structurally balanced and semantically enriched with tags and properties and possessing a strict tree hierarchy is particularly suited to parsing techniques such as those found in the Xerces family of Java parsers.Moreover, the highly structured XML format is appropriate for graph databases, such as TitanDB and Neo4j (https://neo4j.com)or, with appropriate conversion to JSON, for document databases, such as MongoDB.Table 2 contains the XML tags described in the public Entrez XML schema.An XML schema is one of the two means for formatting an XML document in a tree structure, the other one being DTD (Document Type Definition) [42].Generally, a schema is preferred because of its increased flexibility, being itself written in XML.In contrast, DTD is based on a terse and restricted SGML syntax, which provides compatibility with the SGML standard at the expense of a steeper learning curve [43].Biopython is one such PubMed API aiming at providing seamless and fully-fledged Entrez functionality, including document retrieval in a multitude of ways.Table 3 summarizes the methods that are associated with the basic Entrez functionality.Once Biopython has been installed through pip or another Python package manager, it can be invoked as follows: >>> from Bio import E n t r e z >>> E n t r e z .email = ' name@domain .org ' >>> E n t r y P o i n t = E n t r e z .e i n f o ( ) >>> XMLArticle = E n t r y P o i n t .eread ( ) >>> E n t r y P o i n t .c l o s e ( ) Key ontological MeSH operations such as search and least common ancestor location can be automated with NLTK (http://www.nltk.org), a common library for natural language processing.Moreover, NLTK has been integrated with additional functionality for word-and sentence-level syntactic analysis, term similarity metrics, including the Wu-Palmer [44], the Leacock-Chodorow [45] and the Jiang-Conrath [46] metrics, and methods for sub-thesaurus construction and maintenance.For instance, using NLTK and the term cognitive, the entries of Table 4 were located in the MeSH ontology.Notice that the entries of Table 4 are located at very different levels of the MeSH tree hierarchy ranging from a high level of abstraction such as F03.615 down to very specialized issues like F04.096.628.255.500.Thus, subsequent searches started at high abstraction levels such as F02.463 and H01.158, which were identified by pruning the MeSH identifiers to their first two segments.
The following code segment displays how NLTK can parse a simple sentence.
>>> import n l t k as n l >>> from stemming .p o r t e r 2 import stem >>> from n l .corpus import stopwords >>> p r i n t n l .word_tokenize ( ' Hello world !' )

Tensor Toolbox
Tensor Toolbox is a recent MATLAB toolbox by Sanida Labs for direct support of tensors and certain associated key functions [30].Although MATLAB inherently supports multidimensional arrays since its earliest editions, Tensor Toolbox offers considerably more flexibility, a set of new and equivalent tensor types, including natural, compressed, Tucker, and Kruskal forms, and a broad set of methods for handling these primary data types.These primary data types are respectively denoted as tensor, sp tensor, t tensor and k tensor, constituting an important semantic difference compared to the default MATLAB approach, which treats all multidimensional arrays as ordinary matrix types.Since tensors represent natural keyword-term-document triplets, Tensor Toolbox is an indispensable software component of our implementation.Provided the Tensor and the Communications Systems toolboxes have been properly installed, the following MATLAB commands populate a sparse third order tensor and store it in Tucker form: It is obvious from the way tensors are defined that they are direct generalizations of matrices.Indeed, a matrix M ∈ R I 1 ×I 2 is the linear algebraic vehicle for coupling the row space R I 1 and the column space R I 2 .Of course, for square and invertible matrices, the row and column spaces coincide.
Tensor G, which will contain properly-defined values for the keyword-term-document triplets, is populated by I k keywords and I t terms stored in I d documents making it a third order tensor G ∈ R I k ×I t ×I d .Note that the proposed G is but one of the ways for extending the established term-document model, which is based on a second order tensor, namely a matrix M ∈ R I t ×I d .For instance, in [47], a term-author-document is proposed.The latter is based on empirical scientometric evidence in favor of the semantic role authors play in the process of information retrieval [10,11].A common point with the proposed model, besides both relying on third order tensors, is that they are inspired by OLAP (Online Analytical Processing) cubes [48].Regarding the tensor dimensions, it should be noted that, although the three dimensions are easy to visualize and handle, they are by no means a golden rule.
As stated earlier, G ∈ R I k ×I t ×I d , essentially the algebraic cornerstone of the proposed technique, is a third order and real valued tensor simultaneously coupling the keyword, term and document spaces.The entries of G are associated with the document retrieval process.Concretely, let k[i 1 ], t[i 2 ] and d[i 3 ] respectively denote the i 1 -th keyword, the i 2 -th term and the i 3 -th document where 1 according to the following four factor double tf-idf (term frequency-inverse document frequency) scheme: In Equation ( 1), the first pair of terms p k [i 1 , i 3 ] and q k [i 1 , i 3 ] forms a standard tf-idf scheme based only on terms and documents: while the second pair of terms p t [i 2 , i 3 ] and q t [i 2 , i 3 ] constitutes the second tf-if scheme:

Analytics
Tensor density, similarly to large matrix sparsity, is a significant metric, which besides potential compression, may reveal interesting patterns along many dimensions since its definition is straightforward.Definition 2. The density ρ of a tensor T is defined as the number of the non-zero elements to its total number of elements, which can be easily found by multiplying the size of each dimension.Thus: Definition 3. Along similar lines, the log density ρ of T is defined as the logarithm of the number of the non-zero elements to the logarithm of its total number of elements, essentially being the ratio of the magnitudes of the respective numbers.
Besides its natural interpretation, ρ can usually lead to larger values, which in turn result in numerically stable computations in formulae when it appears in denominators.
The Frobenius norm of a tensor T , denoted by T F , is an algebraic indicator of the overall strength of the tensor entries, which is indirectly tied to compressionability.Recall that the Frobenius norm for a matrix M ∈ R I 1 ×I 2 is defined as: Both are related in their own way to compression potential, which is critical given the large volume of data typically held in tensors.The former plays the same role as with matrices, whereas the latter indicates whether there are strong or weak connections between keywords, terms and documents.Since both provide a data summary in the form of a scalar, they give quick and overall information regarding the tensor status at the expense of aggregating information about each dimension of this single value.Thus, both metrics can be used as building blocks for composite ones, which examine each dimension separately.Definition 4. The Frobenius norm T F of an p-th order tensor T ∈ R I 1 ×I 2 ×...×I p is the square root of the sum along each dimension of its elements squared: . . .
Generally, there is no consensus as to which values of T F indicate strong connections on average.In order to derive bounds, probabilistic techniques can be employed by treating the elements of T being drawn from a distribution.One way is to observe that T 2 F is the sample approximation of E T 2 i 1 , . . ., i p .Then, since the Frobenius norm is always positive, as all-zero tensors are not under consideration, the Markov inequality: can be used to derive a bound.For instance, if the elements of T are drawn from a Gauss distribution, then T F follows a noncentral chi square distribution.

Metric Fusion
Again, in order to take into account information about each dimension separately for a third order tensor, it suffices to fix the last index and create a metric ν that takes into consideration the density of each resulting matrix separately.Thus, if the density of each separate matrix T [:, :, i 3 ] for a fixed value of i 3 is defined as: then the set of tensor densities along the third dimension ρ i 3 can be used to build the following aggregative metric: The harmonic mean ensures that ν will tend to be close to the smallest of ρ i 3 .
For a third order tensor T ∈ R I 1 ×I 2 ×I 3 , a related metric can be constructed by first fixing one of the three indices, treating the remaining two dimensions as a sequence of matrices, computing the Frobenius norm for each such matrix and taking the harmonic mean of these norms.Deciding which index is to be fixed is important as it essentially determines a tensor partitioning.For the purposes of this article, the last index i 3 is fixed creating thus a metric µ, which ranges over the documents.µ = H T [:, :, i 1 ] F , . . . ,T :, :, i p F = I 3 ∑ Notice that T [:, :, k] in (11) denotes the matrix created by fixing i 3 to k while the two remaining indices stay unaltered, creating thus a I 1 × I 2 matrix.If I 3 is large, which may well be the case for document collections, then it would also make sense to compute statistic measures such as sample versions of variance, skewness and kurtosis.

Tensor Tucker Form
Definition 5.The multiplication T × k v along the k-th dimension between a p-th order tensor T ∈ R I 1 ×...×I k ×...×I p and a vector v ∈ R I k is a (p − 1)-th order tensor A ∈ R I 1 ×...×I k−1 ×I k+1 ×...×I p with elements: Definition 6.The multiplication T × k M along the k-th dimension between a p-th order tensor T ∈ R I 1 ×...×I k ×...×I p and a matrix M ∈ R L×I k is a p-th order tensor B ∈ R I 1 ×...I k−1 ×L×I k+1 ...×I p with elements: Definition 7. Tucker tensor factorization is defined as: The Tucker factorization is one of the possible generalizations of the SVD for matrices: which is the core of the term-document information retrieval model and the starting point of a number of document clustering schemes.In order to compute the Tucker factorization, the higher order SVD is employed.The latter is based on cyclically updating each of the basis matrices U k until they all converge according to a criterion.

Queries
Similarly to the term-document matrix case, tensor G can be queried regarding a set of terms {k[i 1 ]} or a set of keywords {t[i 2 ]}.Said queries can be cached in terms of linear algebra as tensor-vector multiplications.In addition, G allows queries about both terms and keywords.
Generating a query vector, it suffices to place one at a position corresponding to a query term and zero otherwise.For a more detailed description, see the collection querying algorithm in [47].

Document Clustering
When partitioning a set S into k subsets, a heuristic approach is necessary since the number of ways b k to perform such partitioning equals [49]: The generating function B(z) of the recursively defined sequence b k is: which can be proven using the property of partial sums for any integer sequence.

Baseline Methodology
The processing steps of the baseline methodology are extensively described in previous works [23][24][25].Initially, for the web documents to be retrieved and later processed, the web document repository (PubMed) is queried.Specifically, we have used PubMed API (Pubmed API: http://www.ncbi.nlm.nih.gov/books/NBK25500/).After the results D = d 1 , d 2 , . . .d n are retrieved in the initial step, each result item d i consists of six different items: title, author names, abstract, keywords, conference/journal name and publication date, In the following, the document representation takes place as the proposed methodology enriched the corresponding texts with annotations from a specific ontology.Consequently, each document is represented as a term frequency-inverse document frequency (T f /Id f ) vector, and some terms of the vector are annotated and mapped on senses identified from MeSH.
As a last step, the vectors-documents are clustered by utilizing k-means.
Require: Query vector q Ensure: Clusters produced T f /Id f Clusters ← K-Means(M) {where Cosine Similarity metric is applied to k-Means} 9: end for

Data Synopsis
In order to compare our proposed tensor based scheme, the TREC Genomics 2007 dataset (TREC Genomics Track: http://ir.ohsu.edu/genomics/)serves as an evaluation benchmark.For a more detailed description of the specific dataset, see [25,50].It is worthwhile to mention that in the TREC Genomics 2007 dataset, about 160,000 documents from about 50 genomics-related journals are considered.

Baseline Method
Regarding the clustering procedure, the k-means algorithm is employed with the following parameters.The number of derived clusters is 20, while the cosine similarity distance was utilized for identifying underlying document similarities.Regarding the tensor scheme, Tucker factorization, which is a higher-order SVD generalization, has been executed, and the rows and columns of the base matrices corresponding to the c largest entries of the core tensor have been selected.This is similar to selecting the c largest singular values of the SVD in the matrix case.
We have compared the produced clusters for both schemes by using precision, recall and F-measure scores.
As can be seen in Tables 5, 6 and 7, the proposed representations and clustering strategies achieve notable precision, recall and F-measure for a small and average number of processed documents.As the number of processed documents increases, the performance of the corresponding methods seems to decrease.By observing Table 8, it is deduced that density is a decreasing function of tensor size.Please notice that p(%) denotes the percentage of the documents used to extract these results.Table 9 shows how G F compares to log I d .As with density, this ratio falls with I d .This can be interpreted as the weakening of document connections.When few documents are available, then it is easy to derive strong connections between them.On the other hand, as the collection is augmented with more documents, then topical associations lose in strength due to the increased subject variability.

Custom PubMed Dataset
Before analyzing the precision and recall characteristics of the proposed model, it is worth looking at the tensor contents and specifically at the term list.The twenty most and the twenty least common keywords in the collection and their frequencies are shown in Table 10.Additionally, Table 11 contains the corresponding information for the text terms.In these tables, the frequency f for both keywords and terms is computed based on the entire document collection.It is no surprise to see common technical and medical terms at the top of the list of Table 10.For instance, both fMRI and EEG analysis are widespread techniques with many MATLAB implementations.Moreover, older or more specialized terms are less frequent.For example, shell shock or shellshock is a rather negatively-charged WWI-era term, which is now largely replaced by post-traumatic.Notice that closely-associated keywords, such as clinical and evaluation, have similar frequencies, which is expected.Rare keywords also pertain to other physiological conditions, probably from papers establishing a connection between brain and body functionality.The situation is similar in Table 11 where the top twenty and the bottom twenty terms are shown.In comparison to Table 10, the terms are more diversified covering a broader number of topics including many secondary ones, and thus, the gaps between terms are considerably narrower.Obviously, the terms cognitive, cognition and brain are present in literally every document of the collection, which was anticipated.In contrast to Table 10, there hardly appears to be a connection between the least frequent terms and the topic.In fact, the right-hand side of Table 10 could appear in virtually any medical collection about any topic and still make some sense.This implies that there is definitely compression potential in the original collection as a portion of documents can be replaced by a combination of eigen-documents or, in the case of redundant information determined by a large number of generic terms, it can be simply discarded.
Notice that the majority of the terms of the second column of Table 11 probably refer to the subjects undergoing some kind of treatment or monitoring.Furthermore, the first column of Table 11 shares many common entries with the corresponding column of Table 10.This can be attributed to the fact that a keyword, which carries significant semantic information, is very likely to be used in the text of a document.Furthermore, the frequency of terms fMRI and EEG equals roughly the sum of the frequency of the academic papers and the clinical data documents of Table 12.A possible explanation is that these types of documents are the most likely to refer to clinical methodology, while the remaining document types address auxiliary topics.It is of interest to examine the similarity between the keyword set S k and the term set S t , as any high relevance between them would mean the tensor can be reduced to a term-document matrix.Their similarity was assessed with the DTW metric, which works on vectors of lengths p and q.First, it defines a metric between the members of both vectors i,j and then relies on the recurrence relation: to compute the shortest transformation and its cost γ * between those two vectors, creating incrementally a shortest cost path in a p × q tableau.Since S k and S t contain words, i,j was selected to be the Levenshtein distance.One fine point is that DTW requires vectors, which are ordered, whereas sets are by definition unordered.To overcome this, S k and S t were sorted in descending order according to word frequency and, if needed, lexicographically, as well, to break any frequency ties.This preserves not only the words in each set, but also their significance.Another subtlety is that the atomic operations are character and not word oriented.Once γ * was computed, it was expressed as a fraction of the worst case scenario, which is the deletion of each character of S t followed by the insertion of each character of S k .For the given sets, this ratio was 0.1181, which means that there is little overlapping between their sorted versions.Table 13 presents tensor density as a function of I d .Similarly to the benchmark dataset, it is also a decreasing function of tensor size.Table 14 presents the ratio of G F to log I d .The weakening of document connections is caused similarly to the benchmark dataset case.

Conclusions
This article presented a semantically-aware topic-based document clustering scheme for biomedical articles that can be further applied to biomedical ones.The core of this scheme is a keyword-term-document third order tensor, namely a three-dimensional array.The latter is a generalization of the established term-document matrix model, which is widely used in information retrieval, both in research and in industrial-grade systems.A third order keyword-term-document tensor with values coming out a tf-idf scheme is proposed.The advantage of the proposed representation is the semantic enrichment, which is achieved with the inclusion of keywords.Scientometric research suggests that keywords regularly carry more semantic information than ordinary terms.A variation of this model is to mix keywords from MeSH with keywords retrieved from PubMed.
The proposed methodology has been compared to both the term-author-document outlined in [47] in terms of compression potential, precision and recall.Both were implemented in MATLAB using the Tensor Toolbox.The experimental results suggest that inclusion of keywords instead of authors increases precision and, to an extent, recall.
Regarding future work directions, a number of extensions is possible.The sparsity patterns of larger tensors should be analyzed, and if possible, compression techniques such as those proposed in [51] should be applied.Furthermore, effective density patterns should be investigated.Another research point is the addition of update operations to the proposed model, namely of insertion and deletion operations, yielding thus a more flexible scheme.A related topic is the development of persistent methodologies for tensors, such as those in [52], in order to support the efficient retrieval of past versions.Finally, real-time analytics are gaining attention with the recent combination of streaming algorithms and tensors.

Table 2 .
PubMed document XML tags.
1: identify documents set D = {d 1 , d 2 , . . . , n } 2: ∀ result item d i = {t i , an i , a i , k i , cj i , pd i } in D 3: for each d i in D do4: calculation of MeSH vectors M = {M d 1 , M d 2 , . . .M d n } 5: end for 6: use as input, titles, keywords and abstracts: d i = {t i , k i } 7: for each M d i in M do

Table 9 .
G F ratio to log I d .

Table 10 .
Collection keywords (frequency f as a percentage).

Table 11 .
Collection terms (frequency f as a percentage).

Table 12 .
Document types (frequency f as a percentage).

Table 12
contains the frequency of each document type in the collection.It comprises approximately half of the scientific papers, which is consistent with the role of PubMed, supplemented by another half of auxiliary documents of various types.

Table 14 .
G F ratio to log I d .