TREASURE: Text Mining Algorithm Based On Affinity Analysis and Set Intersection to Find the Action of Tuberculosis Drugs against Other Pathogens

: Tuberculosis (TB) is one of the top causes of death in the world. Though TB is known as the world’s most infectious killer, it can be treated with a combination of TB drugs. Some of these drugs can be active against other infective agents, in addition to TB. We propose a framework called TREASURE (Text mining algoRithm basEd on Affinity analysis and Set intersection to find the action of tUberculosis dRugs against other pathogEns), which particularly focuses on the extraction of various drug–pathogen relationships in eight different TB drugs, namely pyrazinamide, moxifloxacin, ethambutol, isoniazid, rifampicin, linezolid, streptomycin and amikacin. More than 1500 research papers from PubMed are collected for each drug. The data collected for this purpose are first preprocessed, and various relation records are generated for each drug using affinity analysis. These records are then filtered based on the maximum co-occurrence value and set intersection property to obtain the required inferences. The inferences produced by this framework can help the medical researchers in finding cures for other bacterial diseases. Additionally, the analysis presented in this model can be utilized by the medical experts in their


Introduction
According to the World Health Organization (WHO) report, the number of TB cases went from 2.2 million to 2.8 million in the year 2015 [1]. It also stated that the global estimates went up from 9.6 million to 10.4 million, and the number of deaths caused by TB doubled in India. In 2016, the greatest number of new cases of Multidrug-resistant TB (MDR-TB) was reported in India. By WHO estimates in 2017, around 400,000 people died among the 2.7 million people affected by TB [2]. In the year 2019, around 2.64 million TB cases were reported by the WHO in India, and estimates show that 40% of the Indian population was affected by TB bacteria. Many journals, conferences, and patents are dedicated to the study of TB, TB survey and anti-TB drugs in PubMed. PubMed is a free search engine to primarily access the MEDLINE database and it comprises more than 32 million citations and abstracts for biomedical literature [3]. Manyhealth related information such as diseases, their symptoms, prevention, treatment, many plants, their activity, usage, etc. are present in this database.
Tuberculosis (TB) is a contagious and an infectious bacterial disease caused by a bacterium named Mycobacterium tuberculosis that mainly affects the lungs [4]. It can also affect other parts of the body like the spine, kidney and brain. TB bacteria are spread through air when an infected person sneezes, coughs, or speaks. It is a curable disease. A combination of antibiotic medications is given for patients with active symptoms and they must undergo a long course of treatment [5,6]. According to WHO, around 63 million lives were saved through TB treatment since the year 2000.
The demand for TB drugs is always high, and over the years, many anti-TB agents have been developed. The treatment for pulmonary TB (lungs affected) involves two antibiotics for six months and an additional two antibiotics for the first two months. The first-line anti-TB agents such as rifampicin, isoniazid, pyrazinamide and ethambutol are the prescribed antibiotics [7]. The same combination of medications can be used for the treatment of Extrapulmonary TB (outside the lungs). The drugs are usually developed to attack or kill the bacterium that causes the disease. Each drug has its own specific molecular target to which it binds and produces its pharmacological effect [8]. Their mechanism of action may also be effective against other pathogens. Thus, these drugs can also be used in the treatment of other diseases. This is known as drug repurposing and it is gaining attention due to the significant rise in the costs of pharmaceutical R&D. The pharmaceutical companies are looking for such repurposing strategies [9,10].
The existing text mining algorithms can be used to classify drugs with respect to a particular disease [11,12]. However, there are no reports of any inferences drawn on the activity of drugs against a spectrum of pathogens, in addition to the pathogen responsible for an infectious disease. This motivated the need to design a technology-based solution by using TB drugs-based PubMed abstracts as the source dataset. For this purpose, we collected 5 years of recent abstracts on various TB drugs research from PubMed. The large volume of extracted data should be mined to find the underlying patterns. The proposed framework, TREASURE, finds various relations among the preprocessed data, with the help of affinity analysis, and it uses set intersection property to filter out these patterns among the relation sets based upon their occurring frequency. This framework is applied on various TB drugs datasets to determine the various other infections these drugs are effective against in addition to TB. This method might help various researchers in the field of drug discovery and drug interaction studies, doctors and also the pharmaceutical companies.

Literature Review
The gene-disease relationships have a great significance in diagnosis, treatment and prevention of diseases. Though these associations are deeply investigated by the researchers, much of their underpinnings are yet to be explained. Text mining was performed on documents from the PubMed database to predict the gene-disease relationships based on the cosine similarity between the gene vectors and disease vectors [13]. This method integrates the MeSH database, co-occurrence methods and term weight to predict gene-disease relationships. Chemical and drug information is accumulated in all sorts of text documents like industry reports, patents and scientific articles. This has led to the development of many text mining applications. The PubMed biomedical literature database is a valuable source of information. An R package was developed by combining the advantages of existing text mining algorithms, to analyze the PubMed abstracts [14]. A review on the tools, methods and applications of text mining for chemical compounds, was presented to determine their structures and identify relationships between chemicals and other entities [15].
Latent Dirichlet Allocation is a probabilistic topic model that aims to give a topic perspective solution. Researchers have proposed various topic models based on LDA. An algorithm called Bio-LDA was introduced to identify latent topics using biological terms and uncover putative relations among bio-terms and topics [16]. A survey was presented to discover trends, research development and intellectual structures of topic modelling based on LDA [17]. To extract research topics from Alzheimer's disease-related papers, the LDAP framework was proposed, which combined LDA with an affinity propagation model algorithm [18]. Various analyses, like research trends analysis, were performed on the results. Another model, named latent semantic analysis, was used to semantically interface PubMed abstracts to gene ontology [19]. The PubMed abstracts for this model were obtained based on the semantic similarity between the user query and the abstracts. Three keyword extraction models were proposed in which the first one was based on LSA and right singular matrix, and the other two models were based on Shannon entropy [20]. The proposed models are not length dominating, and they have a low redundancy. For the analysis of patent data, an intelligent system based on principle component analysis and logistics was proposed [21]. This system extracts the features from the patents database and classifies them into categories such as software, biological, business and chemical.
The prediction of TB survivability has been a challenging problem for many years. A study was presented on various data mining approaches that have been utilized for Tuberculosis diagnosis and prognosis, and a best prediction model for TB survivability was developed accordingly [22]. A regulatory network was proposed to detect the action of genes required for Mycobacterium tuberculosis (M.tb) persistence, using text mining models [23]. Its purpose was to suggest candidates for new drug targets and to provide fresh insights on the persistence mechanism of M.tb. A method to find close association rules within TB data, was proposed and applied on the dataset containing the real medical records of TB patients [24]. This method determined the association of one symptom to another. Different types of media content analysis are carried out for different application.
Taking all these as a motivation, this paper proposes a model to deal with PubMed datasets in the identification of the various pathogens the TB drugs are active against, in addition to M.tb. The analyzed TB drugs can be effective against some other diseases, caused by the pathogens that these drugs work against.

Materials and Methods
This article proposes a TREASURE framework, that follows affinity analysis to find various relations between the dataset and filters out the patterns among these records via set intersection and occurring frequency. Algorithm 1 gives the outline of the TREASURE model. Thousands of documents pertaining to each drug are collected from the PubMed database. The collected documents are preprocessed as they undergo tokenizing, stemming, stop words removal and tf-idf calculation phases. The preprocessed data are visualized through word cloud to give an idea about the frequently occurring words in the collected dataset. Then, the preprocessed data undergoes an affinity analysis phase, which determines various co-occurring relationships among the words in the dataset and generates the relation records accordingly. These relation records are then filtered, based on maximum co-occurrence value and set intersection property, to obtain the resultant set. Thus, different resultant sets for different TB drugs are obtained and are then analyzed to provide various inferences. The overall architecture of the TREASURE framework is depicted in Figure 1.

Algorithm 1 Outline of TREASURE model
Input: Abstracts from PubMed Database Output: A filtered resultant set containing frequently occurring word patterns 1.
Data preprocessing 1. Generate the relation records via affinity analysis 3.
Based on the co-occurrence value of each element and set intersection property, the resultant set is filtered out from the relation records

Data Preprocessing
The TREASURE model gathers data from the PubMed database. As the data gathering methods are loosely controlled, the gathered data might contain garbage values, outof-range values, missing values, etc. This data must be transformed to a useful and an efficient format. It is done by various preprocessing steps. Algorithm 2 shows the TREAS-URE data preprocessing. The gathered data is first cleaned to handle the noisy and missing data. Then, tokenization breaks the sentences or clauses into separate words. The punctuations among the sentences are removed. In any dataset, there are always some commonly used words such as "the", "for", "should", and "to", which do not contribute to any kind of learning. They are known as stop words. Removing them is an important step in data preprocessing. Inpython, NLTK (Natural Language Toolkit) library has a list of stop words. Along with this, an additional list of stop words, if given as user input, is also removed. The dataset might contain words like affects, affecting, affected, while all these words mean the same thing. These words are either reduced to their root words or word stems that affixes to suffixes or prefixes, and this process is referred as stemming.
The next step in data preprocessing is to extract the important words through tf-idf calculation. Once the words are reduced to their stem the term frequency (tf) and the inverse document frequency (idf) are computed for each word as given in Equations (1) and (2). The tf-idf weight is a statistical measure used to determine the importance of a word (term) to a document (doc) in the dataset [25,26]. Its calculation is given in Equation (3). tf (term,doc) = frequency of in Total number of words in (1) idf (term) = ln Total number of documents Number of documents containing term

Algorithm 2 TREASURE Data Preprocessing
Input: PubMed abstracts in a csv file Output: Preprocessed data in a csv file 1.
Loop through the entire csv file 1. Calculate term frequency for the words in document as given in Equation (1)  3.
Calculate inverse document frequency as given in Equation (2)  4. Compute tf-idf values for the words as given in Equation (3)  5.
Set a minimum threshold value for tf-idf 6.
Open a new csv file 6.1 For row_i write each word of document_i whose tf-idf > threshold

Generation of Relation Records Using Affinity Analysis
The preprocessed data must be integrated to derive any inference. For example, Amikacin is used in the treatment of non-tuberculous mycobacterial (NTM) disease. This sentence exhibits the relationship between amikacin and non-tuberculous mycobacteria. This kind of co-occurrence relationships can be determined by the affinity analysis technique. Algorithm 3 shows the generation of relation records using affinity analysis. In this technique, we consider each document as a transaction and each word as an item. To determine various connections between items, some formal definitions of measures like support, confidence and lift are needed [27][28][29].
Support is a simple and yet an important metric in affinity analysis. Its equation is given in Equation (4). The support of (A ꓴ B), where A and B are item sets, is given as the ratio of all the transactions that contain all items of (A ꓴ B) to the number of transactions in the dataset.
Confidence denotes the likelihood of certain items to occur together. It is given in Equation (5) and is defined as the proportion of transactions containing item set A that also contain item set B.
Lift is another important measure in affinity analysis. It is the ratio of probability of A and B occurring together to the product of probabilities of A and B occurring as if there was no association between them Equation (6).

Algorithm 3 TREASURE Relation Records Generation via Affinity Analysis
Input: Preprocessed csv file, minimum number of items in a set as min_length, minimum co-occurrence value as min_support and the minimum conditional property as min_cofidence Output: A JSON file containing a list of relation records with corresponding confidence, support and lift values 1.
Read each item in the file 2.
Calculate support for every item as given in Equation (4) 3.
Insert every item into a frequent dataset whose support ≥ min_support 4.
For each item in the frequent dataset calculate confidence and lift values as given in Equations (5) and (6) 5.
Insert every rule into a JSON file whose confidence and items count are greater than the corresponding threshold

Filtering Relation Records Based On Maximum Co-Occurrence Value and Set Intersection Property
The generated relation records contain interrelated words with corresponding support, confidence and lift values. However, not all the relations generated, will have a useful meaning. Therefore, we need to filter these records to extract the primary combination of words. For this purpose, the set intersection property is applied on the records. Among the intersecting sets, the one with the maximum co-occurrence value, i.e., the relation that has frequently occurred, is filtered out and added to the resultant set. This is done to prevent the repetition of same inference and to obtain as many unique inferences as possible. For example, ['capreomycin', 'injectable'], ['capreomycin', 'tuberculosis'] are the intersecting sets. The inference obtained here is that capreomycin is an injectable drug used in the treatment of TB. Among these, ['capreomycin', 'injectable'] has the maximum co-occurrence value and thus, it is added to the resultant set. Algorithm 4 shows the filtration of relation records. By applying this technique, the most essential relations among the records are filtered out and various inferences are obtained.

Algorithm 4 TREASURE Relation Records Filtration based on Maximum Co-occurrence Value and Set Intersection Input:
A list of relation records from the JSON file as D Output: A filtered resultant set S containing frequently occurring word patterns 1.
Initialize an empty dictionary ED and an empty set ES 2.
For each i in range (length (D)) 2.1 For each j in range (i + 1, length (D)) 2. Display the resultant set S

Data Preprocessing
Around eight drugs are analyzed with this model. For this purpose, abstracts from each document are collected, as they provide the accurate and necessary information about the paper. The PubMed abstracts have a unique ID called PMID. The metapub library in python gets these IDs as input and extracts their corresponding abstracts. The number of document abstracts collected for each drug from PubMed is given in Table 1. The NLTK package in python is used to perform tokenization, stemming and to remove stop words. The tf-idf value is calculated for the words obtained and the minimum threshold is set around 0.03 to remove some unnecessary words. This threshold value is set after various trial and errors in the range 0.02 to 0.05. The results are then stored in a csv file such that each row corresponds to an abstract and each row contains the preprocessed words of that abstract. Similarly, eight different csv files are created. Figure 2 shows the word cloud representation of the preprocessed data obtained for eight different TB drugs. Among the preprocessed data, some words frequently occur than the other. The frequency of the word indicates its importance. Word clouds help to identify such words, as the size of each word in the cloud indicates its frequency of occurrence. Therefore, the preprocessed data are visualized in the form of a word cloud to identify such important words.

Generation of Relation Records Using Affinity Analysis
This method is used to determine the relationship between the preprocessed data. Each document is considered as a transaction and each word in the document is considered as an item. The apyori library in python is used to perform the affinity analysis. The support, confidence and lift values are computed to determine various connections between the items. The frequently occurring trends among the data are identified.
The minimum co-occurrence value (support) is set around 0.007 after various trials from 0.004 to 0.008 value range, the minimum conditional probability (confidence) is set as 0.5 after various trial and errors in the range 0.4 to 0.6 and the threshold for minimum number of items in the set is kept as 1 after various trials in the range of 1 to 3. The generated relation records are then stored in a JSON file. Therefore, eight JSON files with corresponding relation records are obtained. Tables 2-9 represent the sample item sets obtained for different TB drugs datasets with corresponding support, confidence and lift values.

Filtering Relation Records Based On Maximum Co-Occurrence Value and Set Intersection Property
The generated JSON files consist of the relation records. These records are filtered to determine the most weighted relationships. The python sets are used for this purpose. The set intersection is applied among the records, to group similar relations and to obtain unique inferences at the end. Then, the relation with maximum co-occurrence value from each group of records is extracted and added to the resultant set. Therefore, eight different filtered resultant sets, each corresponding to a drug, are obtained through this method. Figure 3 shows the resultant sets obtained for eight different TB drugs with co-occurrence value corresponding to each item of each set. The records present in the resultant set provide some important information about the drug. For example, in Figure 3c, ['genes', 'inha', 'katg'] is one of the elements of the resultant set obtained for the isoniazid dataset. The following inference is obtained for this record. M.tb poses a great challenge at the scientific level as it acquires gene mutations, which develop resistance to the drugs and treatment forms. The mutations at codon 315 (amino acid position) of the katG gene are associated with high resistance to isoniazid. Therefore, isoniazid is ineffective in the treatment of M.tb with this mutation [30]. Whereas the inhA gene mutations are associated with low-level resistance to isoniazid and thus high doses of the drug can be used for M.tb treatment with this mutation [31]. Similarly, various inferences can be obtained from the resultant sets of 8 different TB drugs. Thus, the filtered resultant sets provide the most essential and unique inferences of the corresponding dataset. Since the relation records are generated using affinity analysis, they do not contain much noise. The resultant set is obtained through set intersection property and hence, there will not be any repetition of records. The obtained resultant set contains various inferences. These inferences can help the medical experts in their research and can also lead to some discoveries. Suppose if this method is applied on different disease datasets, we may extract some common features and relationships between the diseases. In this paper, the resultant sets of eight different TB drugs are analyzed and compared, to discover such inferences. Though these drugs are developed to work against the pathogens causing TB, there are chances that they might work against other infections. Extracting such inferences may help in the development of new treatments for some diseases.

Discussion
The resultant sets are analyzed and an important inference is extracted about the drugs. It determines the pathogens that these drugs are active against, in addition to TB as given in Figure 4. The antibiotic drugs developed for a particular disease work against the pathogen causing that disease. It inhibits the growth of the microbial targets without harming the host. Each class of antibiotics has a unique mechanism of action. They can be the inhibitors of cell wall synthesis, inhibitors of protein synthesis, inhibitors of cell membrane function, inhibitors of nucleic acid synthesis or inhibitors of other metabolic processes [32]. It depends on the nature of their structure and their affinity to the target. A mechanism of action of a drug can work against many pathogens. Therefore, that drug can be used in the treatment of diseases caused by such pathogens. For example, moxifloxacin is also effective against Helicobacter Pylori, a bacterium which causes ulcer and might progress to stomach cancer [33,34]. Thus in addition to TB, moxifloxacin can also be used in the treatment of these two diseases. A Gram-positive bacterium named Methicillin-Resistant Staphylococcus aureus (MRSA) causes various infections such as skin infections, pneumonia and sepsis [35]. The drugs, moxifloxacin, amikacin and linezolid are effective against MRSA in addition to TB and thereby can be used in the treatment of the mentioned infections. Similarly the TB drugs can be effective against various other pathogens or infections. Thereby they can be used in the treatment of various diseases caused by such pathogens. This kind of TB drugpathogen relationship is extracted through our framework. The results are compared with some important information sources such as doctors and medical research papers, and they closely associate with the inferences obtained from these sources [36][37][38][39]. This inference might help medical researchers in various drug analyses, in finding treatment for various other bacterial diseases, etc. Researchers who carry out in vitro testing can also benefit from this study by extracting this inference from about 1000 papers, which would have otherwise been extremely cumbersome under usual settings.
Another result is also obtained through this framework is depicted in Figure 5. The resultant sets of the mentioned TB drugs in this figure, contained either antiretroviral or immunodeficiency virus in them. This inference states that some of the TB drugs are also used in antiretroviral therapy to treat Human Immunodeficiency Virus (HIV). This inference shows the inter-relationships between TB and HIV. People diagnosed with HIV have a high chance of getting infected with TB pathogens as HIV weakens the immune system [40,41]. Therefore, the body cannot fight TB, thereby these pathogens can quickly progress into TB disease. TB has become the leading cause of death among the people affected by HIV. If a person is affected by TB, it is important for them to know about their HIV status. Similarly, various inferences can be extracted through this framework. It is evident that this way of data extraction has identified many different and important results. This tool when applied to other drugs, diseases or plants datasets can draw out various important inferences and thereby can help the medical researchers to speed up their background research work. This framework can be considered as a complementing tool for doctors and medical research experts in their drug discovery studies, drug interaction studies and bacteria analysis, and can pave a way in discovering treatment for some incurable diseases.
In order to predict the accuracy of the proposed framework, it is compared with the existing topic modelling techniques such as latent Dirichlet allocation (LDA), latent Dirichlet allocation with affinity propagation (LDA with AP) and latent semantic analysis (LSA). The performance of our framework is evaluated using the following standard methodologies [42,43].
Precision is the measure of how much information returned by the system is correct. It is given in Equation (7). It is the ratio of the number of correctly predicted observations to the total predicted observations in the resultant set.
Recall, as given in Equation (8), the measure of relevant information extracted by the system. It is the ratio of correctly predicted observations to all the observations in the class.
F-Measure is the harmonic mean of precision and recall and is computed as given in Equation (9). It balances both the precision and recall with a single score.
The above mentioned standard methodologies are calculated for various algorithms. The graphical comparisons of these measures for four different drug datasets, moxifloxacin, linezolid, streptomycin and rifampicin are depicted in Figure 6a-c. From Figure 6c, we can observe that the TREASURE model gives high accuracy (71.85%) when compared to LDA, LDA with AP and LSA algorithms. The benchmark words for comparing the results of each algorithm were collected from renowned information sources such as WHO, Healthline, MEDLINE and NIH. Hence, it is evident that this way of data extraction has provided appropriate results in the identification of various TB drug-pathogen relationships.

Conclusions
Tuberculosis is a potentially serious infectious disease caused by Mycobacterium tuberculosis that usually attacks the lungs. This paper is a novel attempt to propose a framework named TREASURE to analyze various TB drugs from PubMed literature and identify the action of these drugs against other pathogens. Lack of effective analysis tools to discover different drug-pathogen relationships necessitated the proposal of TREASURE model. We analyzed eight different TB drugs namely pyrazinamide, moxifloxacin, ethambutol, isoniazid, rifampicin, linezolid, streptomycin and amikacin, and found out various other pathogens or infections that these drugs are effective against, in addition to TB. We generated relation records from the datasets using affinity analysis and filtered these records using maximum co-occurrence value and set intersection property to obtain the results. This method provides inferences based on various drug-pathogen relationships which can help the medical experts to speed up their background research work and thereby saves time and manpower. In future, it can also be used to find remedies for some incurable diseases. In this application, we use only the text datasets. As a future research, it is intended to combine this model with image processing and analyze its performance.