The Treasury Chest of Text Mining: Piling Available Resources for Powerful Biomedical Text Mining

: Text mining (TM) is a semi-automatized, multi-step process, able to turn unstructured into structured data. TM relevance has increased upon machine learning (ML) and deep learning (DL) algorithms’ application in its various steps. When applied to biomedical literature, text mining is named biomedical text mining and its speciﬁcity lies in both the type of analyzed documents and the language and concepts retrieved. The array of documents that can be used ranges from scientiﬁc literature to patents or clinical data, and the biomedical concepts often include, despite not being limited to genes, proteins, drugs, and diseases. This review aims to gather the leading tools for biomedical TM, summarily describing and systematizing them. We also surveyed several resources to compile the most valuable ones for each category.


Introduction
Text mining is already widely used, mainly on social media, for, e.g., Twitter, to explore disease symptoms [1], reactions to public regulations [2], or to study the opioid crisis [3]. Moreover, it has been increasingly applied in various industries, such as the financial sector for decision-making processes [4]. Big text data in this sector, from websites or even social media, has been used on stock price prediction, financial fraud detection, and market forecast [4]. In the health sector, particularly in diabetes, corpus-based terminology from online texts, manuals, or professional papers have been automatically extracted. These were used to developed a specific terminology list of patients whilst browsing phrases to compare professional and common terminology and evaluate statistics of different terminologies in two different languages [5]. In the pharmaceutical sector, automatic terminology extraction from pharmaceutical documents of meaningful information has been applied to classify documents [6].
This review aims at providing a broad perspective of the major developments in biomedical text mining. First, in Section 1, the basis of text mining and some fundamental definitions will be clarified. In this scope, the focus will be mainly on natural language processing (NLP) methods, a field that allies artificial intelligence (AI), linguistic and computer science (CS) methodologies. Thenceforth, in Section 2, the most recent and After accomplishing the NER step, NEN algorithms are invoked to semantics and coherence for all the retrieved tokens, solving the disambiguation. As such, constitute an essential step in the automated construction of a biomedical database describing and relating concepts, which can be organized either as a hierarchy or as a set of relationships. Abbreviation recognition and synonym recognition are advantageous to unify and normalize biomedical terms [12]. Biomedical NEN intends to map entity terms in biomedical text to typical entities in a particular knowledge base, i.e., a database which compiles information about a topical domain in a hierarchical or relationship manner [15]. Furthermore, NEN models can exhibit additional steps such as abbreviation resolution, in which acronyms are reformed to the original long words by using the abbreviation dictionaries [16]. So, after the NER step, NEN will normalize the terms, for example, the tool will recognize the term 'IL6' as the abbreviation of 'Interleukin 6' while the NER step only associates 'IL6' with the category 'gene' or 'protein'. For further information on this step and more detailed examples, the readers may explore other articles, both biomedical and other topics. Various articles describe this step and all its complexity [17][18][19].
Lastly, RE is a task that aims to identify syntactic and semantic relations between the entities that originated in the previous text mining tasks automatically [20,21]. Basic RE methods encompassed simple systems based on co-occurrence statistics that evolved to more intricate ones using syntactic analysis and machine learning (ML)/deep learning (DL) models [21,22]. The extracted relations are expressed in a machine-understandable format ready for post-text mining analysis [23]. In the biomedical field, relations among entities are pivotal towards understanding complex biological mechanisms by being able to retrieve new relations from previously known ones. The extraction of homo and heterogeneous interactions between chemicals, diseases, genes, proteins, and/or other classes is needed to decipher new knowledge mainly in the fields of, e.g., regulatory pathways, metabolic processes, or adverse drug relations [20,21]. For example, to the sentence 'Individuals with a BRCA1 gene mutation are more likely to develop breast cancer at a younger age', NER can recognize that BReast CAncer gene 1 (BRCA1) is a gene and that 'breast cancer' is a disease and categorize them like that. NEN is responsible to disambiguate this previous step, to categorize all keywords correctly and recognize the term 'BRCA1' as the abbreviation of 'breast cancer susceptibility gene 1'. Lastly, RE is able to associate the BRCA1 gene with the disease 'breast cancer'. The process of relations extraction can be difficult and there are several different approaches in this phase. Some articles explain these approaches more in-depth and use thorough examples [24][25][26].

Text Mining Challenges-What Makes Text Mining Complex?
Part of text mining complexity lies in the fact that different sources compile data in different formats, which often requires specific techniques [7,11]. These types of data frequently lack common structural frameworks and can have errors like improper grammar, spelling errors or semantic ambiguities. Text errors increase the complexity of data pre-processing and text mining analysis [7,11,27]. The recognition and mapping of certain terms in the NER and NEN steps can also be troublesome. In fact, Biomedical NER is usually considered more challenging since there are numerous difficulties for automatic identification of biomedical terms due to irregularities in how known entities are entitled [11,12]. Common challenges arise when terms are not a part of the ontology used, as misspellings or ambiguity in the term s designations can occur. Hence, to deal with this issue, choosing the right corpus and/or ontology is crucial. This is particularly true for genes and proteins where nomenclature is frequently messier since proteins and genes can share the same abbreviation and different ontologies may have different spellings [7,15]. However, this type of heterogeneity and ambiguity can also happen in key classes such as drugs or chemicals [10,15]. Correctly choosing a corpus to train a text mining model and then being able to retrieve relations from text is an intricate task due to the complexity of grammatical construction hindering the machine retrieval of relations from text and, at the same time, incorporation of data from external sources can foster the advances of the RE step [28].
Lastly, biological knowledge is complex, and the lack of certain specific information can compile conflicting answers. For instance, the same species under different conditions (e.g., age, gender, treatments) may not have the same biological system and what happens in one species may not happen in another. These differences, if not noted, may lead to different answers upon text mining application [29].

Traditional Versus Machine Learning Driven Text Mining
Traditional text mining approaches consist of finding patterns in the evaluated text based on previously known patterns of interest. Arguably, the simplest way to mine text is by using dictionaries in order to find words of interest in the text, and grammatical rules to find the relation between those words. Both techniques rely on the existence of dictionaries containing the words of interest and grammatical rules for the given language. As such, it struggles to find new patterns different from the already known ones [30].
The advance in ML methods made possible the application of such methods to text mining, allowing a step forward in their performance. ML algorithms are fed with labeled text, where the results are known, to learn a model able to connect the text with the results, in such a way that it can generalize to new unknown text. This advance allows the user to train models, based on already known classified text, to generalize over new text, instead of explicitly defining the rules and words to be looked for in the text. One caveat of this technique is the requirement of a large amount of labeled data, already classified text, which might be unavailable for the task at hand. Figure 2 shows an example of a sentence annotated for the three tasks: NER, NEN, and RE. The annotation for NER is at the token level and usually follows the IOB format, which stands for inside, outside, and beginning, respectively. Tokens marked as O do not belong to any entity, while the first token of an entity is marked as B, and the remaining tokens of the entity are marked as I. In the case of a multi-class NER, the class name can be attached after the B and I tokens, such as B-Disease, B-Gene, B-Chemical, among others. The labels for the NEN task are at the entity level and may represent a sequence of tokens, for example, "breast cancer", which is represented by two tokens. In both NER and NEN, tokens marked as O are not part of any entity. Finally, the label for RE is at the sentence level and states whether two entities is a sentence are related (or unrelated) to each other. In this example, it states that the "breast cancer" disease is related to the "BRCA1" gene. To process text using ML models, the text must first be converted to a numeric form. With the advances of DL and Neural Networks (NN), word2vec [31] is a very popular model that represents words as embeddings, that is, vectors of numbers relating words to other words or concepts. These vectors were computed by using co-occurrences in a sentence, allowing the system to obtain interesting relationships without labeled data. It allows the use of a much larger set of text to initially train the model, and then one can use a smaller amount of labeled text to fine-tune to the final task. This technique improves the performance of the models and allows the user to transfer learning the initial vectors between different tasks [32]. More recently, several have obtained good results with DL mechanisms. These systems combine embeddings with attention mechanisms, that allow the system to focus on elements of the embedding. By using different attention mechanisms to construct encoder/decoders, Bidirectional Encoder Representations from Transformers (BERT) [33] and more recent systems such as Embedding from Language Models (ELMo) [34] have been too achieved impressive performance. More biological specific, there is a BERT version trained on biomedical text, called Bidirectional Encoder Representations from Transformers for Biomedical text mining (BioBERT), which will be detailed in the next sections [35]. For example, the occurrence of words in the text can be used to create a labeled data set containing approximately correct examples. To further enhance the number of labeled data for ML methods, one can use other simpler methods, such as dictionaries and rule-based methods. This technique is sometimes referred to as weak or distantly supervised approach and was already used in works such as Wright et al. [36].

Biomedical Corpora
Models of biomedical text mining require specific data to be successfully and effectively trained and evaluated. Therefore, the development and implementation of novel resources to train and evaluate these algorithms have become a fundamental process to obtain better results and unveil new insights within the biomedical field. Due to the rapid growth of biomedical literature, literature knowledge annotation and extraction are becoming increasingly demanding [35,37]. Indeed, one of the biggest challenges of text mining in the biomedical field of research is the construction of appropriate biomedical corpora, the construction of a set of annotated texts and/or sentences that establish terms and/or relationships within a domain [38,39]. General corpora challenges lie on the different word distribution, as well as the domain-dependent type of terms and expressions of the area whose meaning is similar and rarely appears in documents from other domains [35,37]. The annotation process is a crucial step in biomedical corpora construction since when poorly made, severely hinders the accuracy and benefit of biomedical text mining tools [38]. This annotation process can be performed through a manual curation based on a guideline, which, although known as the gold standard corpus due to higher-level outcomes, is time-consuming and implies knowledge in linguistic and semantic fields.
Annotated corpora are crucial for both training and evaluation of text mining tasks since their text is enriched by adding features and patterns [40]. Nevertheless, it is necessary to take into consideration the type of annotations existent in the corpus as well as the task to be performed. If the goal is to use an annotated corpus for NER, then the corpus must have the entities of interest for a given category annotated to be able to identify terms for that category [41]. However, in the biological context more than knowing which entities are involved in a process, it is often more valuable to retrieve relations from the text to understand how these entities interact in order to gain knowledge of the biological system [42]. Thus, to perform RE, annotated corpora with the same type of relationship between entities and their characteristics must be used [43]. In the Biomedical field, most of the frequently used corpora are focused on genes and/or proteins. These may include GENome Information Acquisition (GENIA) ( [44]), Colorado Richly Annotated Full Text Corpus (CRAFT) [45], and Critical Assessment of Information Extraction in Biology (BIOCREATIVE) II [46] corpora. Furthermore, depending on the purpose of the study other corpora can be used, thus contributing to a wider range of categories, such as diseases-National Center for Biotechnology Information (NCBI) disease corpus [47]) and chemicals-CHEMicals Disease Named Entity Recognition (CHEMDNER) [48]).

Text Mining Toolkits
Performing text mining tasks from the ground up can be complex. Toolkits are key steps for a simpler implementation of text mining complex tasks, such as text processing (tokenization, stemming, part-of-speech tagging), NER, NEN, and RE, without losing versatility, necessary for a high-quality approach. These general toolkits can then be adapted for specific contexts, in this case, the biomedical domain. General Architecture for Text Engineering (GATE) was originally released in 1999 and nowadays evolved to encompass a family of tools [65]. GATE tackles text mining and NLP problems and comprises three main components: Gate Document Manager (GDM), Collection of REusable Objects for Language Engineering (CREOLE), and GATE Graphical Interface (GGI). GDM acts as a storage for all the information created by the Language Engineering systems. CREOLE does the actual text processing and analysis. Existing algorithms can be implemented using wrappers around those methods. However, it is also possible to develop approaches not included in the toolkit through CREOLE's API. GGI, as the name suggests, provides a graphical interface for the various tools and resources provided in GATE. This workflow was successfully applied to several fields like life sciences and medicine, such as cancer research, medical records analysis, and drug patent-related research [66]. Unstructured Information Management Applications (UIMA) is yet another software with an architecture based on four main components, namely acquisition, unstructured information analysis on a document level, unstructured information analysis on a collection level, and structured information access. The acquisition is used to retrieve the documents to process and allows the implementation of external applications. Document-level analysis performs a wide range of tasks in each document, such as language translation, grammatical parsing, named-entity detection, document summarization, and document classification. Collectionlevel analysis can be performed on a whole collection or sub-collection of documents to infer common characteristics between documents such as glossaries of terms, taxonomies, feature vectors, databases of extracted relations, and detected entities. The structured information access component allows browsing knowledge obtained from documents or searching available methods to perform these tasks. Semantic search uses the documentlevel and collection-level analysis and annotations to return an ordered list of documents. This toolkit also presents a directory service to browse through the different text processing tools and knowledge source adapters, a tool for uniform access to several knowledge sources in this architecture. Recombining Unstructured Information Management (UIM) technology, UIMA is capable to accelerate scientific advances through text mining allowing the development of other tools [67]. ClearTK is a toolkit for statistical NLP, developed for UIMA in 2008 [68]. First, it performs feature extraction on an annotated corpus using a wide range of methods with different complexities. Then, it passes the features onto the training data consumer, used to generate a training data file that can then be further used in an ML model building library. In 2014, ClearTK 2.0 was released as further development and adaptation according to the feedback community of the original toolkit [69]. UIMA was also successfully used for clinical diagnosis [70], as well as a wrapper for an annotator [71]. Both GATE and UIMA approaches are quite complex and need a deep understanding of their architectures, not only for their use but also for the development of applications that can be used within those architectures.
BioC is a simple workflow for NLP and text mining tasks, originally implemented in C++ but extended to Python, Perl, Go and Ruby. BioC toolkit is based in XML, and converts from this format into BioC data classes and vice-versa using two connectors, one input connector and one output connector. Between the input and the output, it is possible to perform several text processing tasks. A good application example for this toolkit was the release of the PubMed Central (PMC) corpus in the XML BioC format, allowing easier text retrieval tasks without losing information present in the document set. This format might also decrease the necessary learning curve for researchers getting started with text mining [72]. As it is a widely used toolkit, new adaptations were released such as tmBioC, which makes the necessary changes to other algorithms, such as DNorm [73], tmVar [74], SR4GN [75], tmChem [76], GenNorm [77], and PubTator [78]. As such, these become compatible with BioC, which further increases the application potential of BioC [79]. BioC has been used in web text mining tools [80,81] as well as in ontologies and annotations [82,83]. Some tools used in the biomedical domain lack an adaptation for the specific scientific context. Due to this, these tools are mainly used for pre-processing purposes, such as tokenization of sentences or words, for instance, Stanford Core NLP and Natural Language ToolKit (NLTK) toolkits. Stanford Core NLP was originally developed for in-house use, nonetheless, it was later released to provide a tool with a simpler architecture that did not require deep knowledge to be used. It was implemented in Java and can be run either from the API or the command line. It provides a wide range of annotators, speech tagging, or NER, among others. This toolkit was designed for English and Chinese languages, but models for other languages can be easily constructed. Like with languages, other annotators can also be added to a pipeline within Stanford Core NLP [84]. This toolkit has been used for processing in biomedical text mining pipelines [85]. NLTK was developed in Python, to solve some of the challenges inherent with the teaching of computational linguistics and NLP. As such, it was developed to be simple to use, work with consistent data structures, be easily expanded through the development of new tools and have detailed documentation. Furthermore, NLTK includes models from the biomedical domain corpus regarding protein-protein interactions as, for, e.g., BioCreative-protein-protein interaction corpus http://www.nltk.org/nltk_data/ (accessed on 1 July 2021). Hence, this toolkit is capable of performing a wide range of NLP operations regarding the biomedical field, such as tokenization, parsing, token tagging, and text classification. In fact, NLTK was the toolkit used for text pre-processing in the GeNomics & Informatics (GNI) corpus which highlights its usefulness in the biomedical field despite its broad scope application [57].

Text Mining Tools for NER, NEN, and RE
Even though there are available a variety of tools to perform text mining NER, the most widely accepted State-Of-The-Art (SOTA) model is BioBERT, a model based on Google's BERT model, pre-trained on biomedical corpora encompassing PubMed abstracts and PMC full-text articles. For NER, BioBERT uses bidirectional transformers and directly learns WordPiece embeddings during pre-training and fine-tuning, improving its performance [35]. HUNER is a stand-alone NER tool that was trained in 34 corpora for five entity types as chemicals, cell lines, diseases, genes, and species whilst using scientific literature and patents. HUNER was also evaluated on the CRAFT corpus outperforming previous SOTA tools as GNormPlus and tmChem by 5-13 pp on chemicals, species, and genes entities types. HUNER model uses Long Short Term Memory Conditional Random Fields (LSTM-CRF) to learn feature correlations and predicting the entities' tags [86]. To facilitate its use and forgo Docker, HunFlair was released, combining an improved HUNER where a bidirectional LSTM-CRF (biLSTM-CRF) model was used along with the implementation of the Flair NLP framework. This eased its use even for inexperienced users [87]. To tackle the challenges in NER related to spelling errors mainly found in clinical records, Cimind, a NER tool was developed. The authors developed a new dataset that provides multiple versions of the 10th revision of the International Classification of Diseases (ICD) in both English and French, so the system can rely on a dataset to store the double metaphone code for every word in each available language [88]. Cimind allows overcoming spelling errors through a phonetic approach.
Neural Biomedical named Entity Recognition and multi-type Normalization (BERN) is a NER and NEN joint tool, which encompasses the NER model from the SOTA BioBERT, and a high-performance NEN model for each entity type in a single step. Moreover, BERN also applies probability-based decision rules to the entities retrieved from the NER stage to differentiate coinciding ones. Besides its command-line use via GitHub repository, BERN is also available as a RESTful Web service. Other methods were developed to tackle specific problem areas as genetic variants, SETH, and diseases, AuDis. SETH is a tool that performs NER and NEN in natural language text specific to genetic variants, facilitating the identification of genetic variants, attiring these variants to established nomenclatures, and linking them to databases covering multiple mutation types. SETH's modular implementation makes it simple to substitute the gene recognition tool to be used, so it can adapt to other types of texts [89]. AuDis is a disease NER and NEN tool that applies CRF based model for the NER task optimized with multiple post-processing steps and improved abbreviation resolution and stopwords filtering. The disease mentions were normalized to specific concepts in an existing repository and the authors developed a dictionary-lookup method [90].
Disease-Expression Relation Extraction from Text (DEXTER) [91] is an RE tool that extracts information from the literature regarding gene and microRNA expression in specific diseases-related framework returning expression level, experimental context, and the compared conditions. DEXTER first step of NER is performed using the Stanford CoreNLP toolkit to gather all entities that are genes, microRNAs, and diseases. Then, the sentences retrieved undergo a search for trigger words, words that are specific to the DEXTER field. Next, parsing is used to extract relations and build a triplets entity1-relation-entity2 Standard Dependency Graph (SDG). DEXTER is both available as a stand-alone tool or as part of BioExpress resource [92]. Protein-protein association Extraction with Deep Language (PEDL) is another RE tool to predict protein-protein associations using a Multi-Instance Learning (MIL) framework. PEDL is a two-step approach, which, first, uses a BERT model to extract information and then finds relationship predictions from the transformer layers using CLaSsification (CLS) tokens. PEDL uses distantly supervised data to retrieve all protein pairs and relations from Protein Interaction Database (PID) and directly supervised data from gold standard datasets from the BioNLP-shared tasks. PEDL was able to extend the knowledge present in pathways databases by predicting additional protein-protein associations which is a strong indicator of the usefulness of such approaches [93]. Biomedical Relation (BioRel), a dataset for distantly supervised RE, is a full text mining approach encompassing all text mining steps. BioRel uses Medline as corpus and Unified Medical Language System (ULMS), specifically MetathesauRus RELationships (MRREL), as Knowledge Database (KB) since it includes binary relations. Relationships were assessed through National Drug File-Reference Terminology (NDFRT) and the relation between genes and cancer were retrieved from National Cancer Institute (NCI) to add to the MRREL vocabulary. MetaMap was used to retrieve and normalize entities from the corpus to UMLS. Further feature extraction was performed using the StanfordNLP tool and further distantly supervised annotation labels were created before the final steps of filtering and building the final dataset. Moreover, BioRel was tested and showed itself as a useful resource for training Deep Neural Networks (DNN) models [23].
Despite the methods highlighted in this section, a recap of useful text mining tools is detailed listed in Table 2. Biocreative, National Center for Text Mining (NaCTeM), and NCBI also provide related resources in http://biocreative.sourceforge.net/bionlp_tools_ links.html, http://www.nactem.ac.uk/software.php, and https://www.ncbi.nlm.nih.gov/ research/bionlp/Tools/ (accessed on 1 July 2021), respectively, where several text mining tools are compiled.

Web-Based Applications
Despite the accuracy and broader application of the previous tools presented, less experienced users tend to prefer web-based applications to perform their analysis. Hence, there are several web servers available to retrieve key concepts from biomedical data.
PubTator Central (PTC) is a freely accessible, daily updated server able to automatically annotate more than 30 M PubMed abstracts and more than 3 M full articles from the PMC-TM subset. PTC is commonly used in biocuration support, gene prioritization, genetic disease analysis, literature-based knowledge discovery, or downstream text mining. This web server is capable of recognizing tokens and classify them into six categories as genes or proteins, genetic variants, diseases, chemicals, and cell lines. Each of these categories was trained using the same taggers as PubTator or re-trained to increase performance whenever a new corpus was available as in the case of variants that used tmVar 2.0, and chemicals that used an improved version of TaggerOne via re-training with the BioCreative V Chemical Disease Relation (BC5CDR) and CHEMDNER corpora. Normalization was conduced with a Convolutional Neural Network (CNN), able to identify the correct bioconcept through the syntax and semantics of the surrounding words. This CNN was trained with human-curated databases attaining accuracy that is human-comparable. Annotated articles and abstracts are freely available using a raw text input through the command line, but also in the PTC web server (https://www.ncbi.nlm.nih.gov/research/pubtator/ [106]) (accessed on 1 July 2021). This web tool has been used in several COrona VIrus Disease 2019 (COVID-19)-related works [107][108][109].
SciLite is another web server that helps users to find essential concepts in documents and correlate them to available resources and tools (https://www.scilit.net/) (accessed on 1 July 2021). This server is not only capable of identifying genes or proteins, organisms, diseases, GO terms, chemicals, and accession numbers but it is also capable to link these concepts to related databases and provide more information on these concepts without leaving the webserver page. For instance, it can retrieve protein structures corresponding to Protein Data Bank (PDB) accession numbers in documents. SciLite also has an evaluation mechanism in which users can confirm or report annotations to improve text mining algorithms [110]. SciLite has been used to improve access to protein motif articles [111] as well as in annotators [112].
Textpresso is another web server with two main goals: separate a group of full articles into individual sentences and categorize terms in article databases and sentences in order to be easily searched (https://textpressocentral.org/tpc) (accessed on 1 July 2021). Although this web server is dedicated to Caenorhabditis elegans literature, it may be extended to other organisms. Words can be classified into several biological concepts such as genes, alleles, cell groups, and phenotypes, or can be correlated as associations, regulations or as biological processes. When combined, these classes form an ontology and the article corpus can be marked with words of these classifications. Users can search for one or several keywords from these classes and the webserver provides sentences of articles with those words to help users to select relevant articles. At the moment, Textpresso has more than 2.5 million full-text articles from several corpora. Overall, Textpresso helps users to identify important articles and focuses on article information related to the user's query [113]. This tool has been used to accelerate the annotation process for the creation of knowledge graphs [30] as well as for annotation and curation in more complex pipelines [114].
Egas https://demo.bmd-software.com/egas/ (accessed on 1 July 2021) is a webbased tool designed to be user-friendly with an extensive interface developed with six main components: project management, project and document navigators, processing tools, account management, concept and relation type filters, and real-time collaboration. Usually, this tool's workflow begins with the selection and import of study documents that can be local files in raw text, A1 and BioC format, or a query for PubMed abstracts as well as PubMed Central full-text documents. If the query is used, a list of documents for selection is presented to the user [115]. After document selection, the next step is to automatically annotate these retrieved documents and for that, Biomedical Concept Annotation System (BeCAS) REST API [116] is used to annotate a range of biomedical entities such as genes, proteins, or drugs. To analyze protein-protein interactions, an ML model was deployed to recognize protein names, using BioThesaurus [117] for normalization and then a rule-based approach for protein-protein interactions recognition [115]. Annotation results are then displayed in the document viewer and annotated documents can be exported in A1 or BioC formats [115]. This web tool was used, for example, in work for semiautomatic curation of text data related to a rare disease [118].
PolySearch2 is a text mining web tool available in http://polysearch.ca/index (accessed on 1 July 2021) that can effectively relate two entities in a "given X, find all associated Y" type query [119]. PolySearch2 mines several sources encompassing MEDLINE and PMC papers, Wikipedia, US patents, open-access textbooks, and Medline Plus comprising a 43 million articles text collection with further integration of 13 public databases [119]. The entities available for such requests range from human diseases, genes, single nucleotide polymorphisms, proteins, drugs, metabolites, toxins, metabolic pathways, organs, tissues, subcellular organelles, positive health effects, negative health effects, and drug actions to integration to ontology terms from a plethora biological and chemical taxonomies [119]. The dictionaries that enable this type of query include over 1.13 million terms and 2.84 synonyms gathered from various sources [119]. Polysearch2 was used as an evaluation tool in [120]. It was also used for searching proteins related to liver cancer [121].
Finding Associated Concepts with Text Analysis (FACTA)+ available at http://www. nactem.ac.uk/facta/ (accessed on 1 July 2021) was released to expand the original FACTA version [122] and fill the need for a tool that identifies and explores a range of associations [123] adding to the previous webserver. Originally, FACTA allowed users to obtain entities relevant and related to a query [122]. Searches are performed using a MEDLINE query based on input queries that can be a word, an ID or the combination of both, and results presented are within six categories: human genes or proteins, diseases, symptoms, drugs, enzymes, or chemical compounds. Concept identification in documents is accomplished by a dictionary-based approach that involves Universal Protein resource (UniProt) [124], BioThesaurus [117], UMLS [125], Human Metabolome DataBase (HMDB) [126], Kyoto Encyclopedia of Genes and Genomes (KEGG) [127], and DrugBank [128]. In FACTA+, three new features were added from the recognition of biomolecular events to the discovery of associations and an enhanced result's visualization. To address the event's recognition, it is important to detect triggers, words considered as indicative of a relation between two entities, done by FACTA+ via an ML-based approach for NER which uses CRF models. To discover hidden associations, FACTA+ considers that if a central entity is related to two different entities, then these two entities may be related among them as well. Due to the high noise attaining these associations, it is of the utmost importance to correctly rank these possible indirect association favoring their retrieval whilst maintaining them reliably. To incorporate these hidden associations, a new treemap visualization scheme for the directly associated concepts and treemap with linking where co-occurrences are the relation strength measure the indirectly associated concepts [123]. This web tool has been used to extract information about the indirect interactions between post-translational modifications of histone proteins [129].
A wide range of tools is available online to help researchers automatically annotate biomedical literature, resorting to text mining techniques. However, there are two main sub-tasks, RE and scoring functions, that still need to be further improved in the next years.

Public Databases That Incorporate Text Mining Models
Databases that gather biological information often incorporate information retrieved via a text mining approach to widening their data. This is the case for several databases such as Search Tool for Retrieval of Interacting Genes/Proteins (STRING), Search Tool for Interacting Chemicals (STITCH), microRNA-Target interactions dataBase (miRTarBase), Biological General Repository for Interaction Datasets (BioGRID), and DisGeNet.
STRING database seeks to gather, score and incorporate available protein-protein interaction data and complement this information with computational predictions [130]. STRING ultimately aims to establish a global network with proteins' direct physical interactions and indirect functional interactions. Nowadays, this database includes protein-protein interaction information from 5090 different organisms and more than 24.6 million proteins.
Two proteins are related if they are functionally associated, meaning, they both contribute to a particular biological function. To be considered functionally associated, they may interact physically or share a specific cellular pathway. These associations can be established through genomic, co-expression, text mining, biochemical experiments, or pathway information. Each association has a score and a number of views associated to evaluate the protein-protein interaction information and give a confidence estimation. Text mining associations are obtained through a statistical co-citation analysis from more than 28 million PubMed, Medline, and Online Mendelian Inheritance in Man (OMIM) full articles and abstracts [130].
STITCH aims to integrate protein-protein interaction and protein-chemical interactions into a single database with a global network for each organism [131]. This database displays 430 k different chemicals and the correspondent binding affinities for users to get an analysis of the chemical s impact on a protein of interest. STITCH database combines experimental data from ChEMBL [132], Psychoactive Drug Screening Program (PDSP) K i Database [133] and PDB [134], computational predictions and information from manually curated datasets such as DrugBank [135], GPCR-ligand database [136], Matador [137], Therapeutic Targets Database [138], Comparative Toxicogenomics Database [139], and pathway databases like KEGG [140], Reactome [141] and BioCyc [142]. Redundant interactions along these datasets are only counted once to improve the interaction confidence level and the final score is computed based on the strongest described binding affinity. Information from these experimental and manually curated datasets are integrated with text mining predictions, involving co-occurrence text mining and NLP from MEDLINE and Research Portfolio Online Reporting Tools (RePORTER) abstracts and PubMed full articles. STITCH later version filters, for each organism tissue, proteins, and chemicals that are not associated with that particular tissue [131].
MiRTarBase is an online database for interactions between gene targets and microR-NAs, non-coding RNAs of 18-25 nucleotides that regulate gene expression [143]. This database has more than 479,000 curated microRNA-target interactions (MTIs) with more than 4000 microRNAs and more than 23,000 target genes from more than 11,000 manually curated articles, identified through a text mining system with a scoring system. Several databases and tools were integrated into the miRTarBase database to improve accuracy such as gene information, microRNA regulators information, disease information, and gene and microRNA expression [143].
BioGRID database aims to curate and store human and model organisms protein, genetic and chemical interactions [144]. This database has more than 1.74 million biological interactions from more than 70 species manually curated from more than 55 thousands articles and other databases. BioGRID takes interaction data from experimental expressions and is guided by text mining approaches.
DisGeNet is a knowledge management platform that establishes genes and genomic variants associations with human diseases [145]. This comprehensive database includes over 24,000 diseases and traits, 17,000 genes, and 117,000 genomic variants, as well as over 625,000 gene-disease associations and over 115,000 variant-disease associations. Both gene-disease and variant-disease associations were extracted based on gene or variant list similarity as well as using text mining tools applied to scientific literature via the Literaturederived Human Gene-Disease Network (LHGDN) or BEFREE text mining resources. These tools can identify as well as standardize entities and relationships and review linguistics and semantics to identify relationships between genotype and phenotype [145].

Future Perspectives
As the number of available biological and biomedical literature repositories is quickly increasing, the manual curation of every publication is becoming tougher and the prioritization of important experimental publications is becoming harder. This leads to a necessary integration of text mining methods in a daily researcher life as it allows publication scoring regarding articles interaction information and automates articles annotations [144]. The main caveat related to text mining resources lies in the lack of centralization of such resources. This fact severely hinders tool comparison and even the finding of such tools regardless of the user proficiency. Approaches to gathering resources and creating a centralized database must be a priority. To this end, an attempt to find and aggregate such resources was made by Amália Lourenço's Lab [146]. In this work, over 135 active websites were found and characterized regarding text mining tools. Despite all the results presented, similar work must be done for repositories, universities' websites, and biomedical literature. This highlights the need for, as above mentioned, the creation of a centralized and permanently updated database of such tools. Recently, bio.tools by Elixir website, a centralized option that is user-dependent for the intended input, encompasses many NLP https://bio.tools/t?topicID=%22topic_0218%22 (accessed on 1 July 2021) and text mining https://bio.tools/t?operationID=%22operation_0306%22 (accessed on 1 July 2021) resources.
Text mining tools blooming is also boosted by the improvement of DL algorithms and their ability to provide new insights. However, the RE step is still far from being resolved. To address these difficulties, text mining competitions as BioCreative, Biomedical Natural Language Processing Workshop (BioNLP), or Biomedical Linked Annotation Hackathon (BLAH) take place, most of them, yearly to encourage discussion and push the boundaries of text mining. New methods to improve the incorporation of data from KB, new corpora suitable to an increasing range of subjects and new models towards a higher accuracy RE step are, generally speaking, the first demands.
Systems Biology is a hot area nowadays despite its complexity due to the inherent difficulties to hoard vast insights from related fields such as the ones provided by Omics methodologies. Text mining can effectively connect information from different sources, integrate it, and provide an accurate way to visualize the results often providing even deeper insights and, hence streamlining this broader area. Funding: This work was funded by COMPETE 2020-Operational Programme for Competitiveness and Internationalisation and Portuguese national funds via FCT-Fundação para a Ciência e a Tecnologia, under projects POCI-01-0145-FEDER-031356 and UIDB/04539/2020. Authors would also like to acknowledge STRATAGEM-New diagnostic and therapeutic tools against multidrugresistant tumors, CA17104.