PREGO: A Literature and Data-Mining Resource to Associate Microorganisms, Biological Processes, and Environment Types

To elucidate ecosystem functioning, it is fundamental to recognize what processes occur in which environments (where) and which microorganisms carry them out (who). Here, we present PREGO, a one-stop-shop knowledge base providing such associations. PREGO combines text mining and data integration techniques to mine such what-where-who associations from data and metadata scattered in the scientific literature and in public omics repositories. Microorganisms, biological processes, and environment types are identified and mapped to ontology terms from established community resources. Analyses of comentions in text and co-occurrences in metagenomics data/metadata are performed to extract associations and a level of confidence is assigned to each of them thanks to a scoring scheme. The PREGO knowledge base contains associations for 364,508 microbial taxa, 1090 environmental types, 15,091 biological processes, and 7971 molecular functions with a total of almost 58 million associations. These associations are available through a web portal, an Application Programming Interface (API), and bulk download. By exploring environments and/or processes associated with each other or with microbes, PREGO aims to assist researchers in design and interpretation of experiments and their results. To demonstrate PREGO’s capabilities, a thorough presentation of its web interface is given along with a meta-analysis of experimental results from a lagoon-sediment study of sulfur-cycle related microbes.


Introduction
Microbes are omnipresent and impact global ecosystem functions [1] through their abundance [2], versatility [3], and interactions [4]. These facts have inspired microbiologists from diverse scientific fields to study their genotype and phenotype [5], their metabolism [6], and their interactions with the environment [7]. All this work has resulted in a wealth of knowledge available in the forms of literature and experimental data. Literature contains vast amounts of information in the free text form that overwhelms researchers. Advanced text mining methods [8] have been developed to assist this issue. Experimental data and their metadata require mining [9] as well for their integration, mostly through metagenomic mining from online repositories. Hence, the combination of this knowledge about microbial their data to explore hypotheses but also to identify potential gaps in knowledge regarding these associations [46].
Here, we present PREGO, a hypothesis generation web resource that is designed to be useful to microbiologists-in particular microbial ecologists and environmental microbiologists. Its specific aims include: (a) the gathering of source data, metadata, and literature followed by the extraction of microorganism, process, environment associations contained therein, (b) making such a mined knowledge base available to life sciences researchers via an easy to use and explore web portal. As such, PREGO can be useful also to system microbiologists and large-scale data analysts through bulk download and programming access. We document the principles, analysis methodology, and contents behind PREGO. Last but not least, we demonstrate PREGO's capabilities for researcher-support related to the above through a case study involving sulfate-reducing microorganisms.

Materials and Methods
PREGO is a resource designed to assist molecular ecologists in acquiring a single point overview of what-where-who process-environment-organism associations. The system is comprised of two main parts: (a) a server that periodically harvests data and extracts process-environment-organism associations from the scientific literature, environmental samples, and genome annotation sequences ( Figure 1, step 1 to 5) and (b) a web-based interface as well as an Application Programming Interface (API) that provides users and programmers with a friendly way to extract and navigate across the process-environmentorganism associations (Figure 1, step 6).
Microorganisms 2022, 10, x FOR PEER REVIEW 3 of 23 theory [45]. These analyses and resources are important because microbiologists can enrich their data to explore hypotheses but also to identify potential gaps in knowledge regarding these associations [46]. Here, we present PREGO, a hypothesis generation web resource that is designed to be useful to microbiologists-in particular microbial ecologists and environmental microbiologists. Its specific aims include: (a) the gathering of source data, metadata, and literature followed by the extraction of microorganism, process, environment associations contained therein, (b) making such a mined knowledge base available to life sciences researchers via an easy to use and explore web portal. As such, PREGO can be useful also to system microbiologists and large-scale data analysts through bulk download and programming access. We document the principles, analysis methodology, and contents behind PREGO. Last but not least, we demonstrate PREGO's capabilities for researcher-support related to the above through a case study involving sulfate-reducing microorganisms.

Materials and Methods
PREGO is a resource designed to assist molecular ecologists in acquiring a single point overview of what-where-who process-environment-organism associations. The system is comprised of two main parts: (a) a server that periodically harvests data and extracts process-environment-organism associations from the scientific literature, environmental samples, and genome annotation sequences ( Figure 1, step 1 to 5) and (b) a web-based interface as well as an Application Programming Interface (API) that provides users and programmers with a friendly way to extract and navigate across the processenvironment-organism associations ( Figure 1, step 6). Scientific text, environmental sample data, and genomic annotations are handled with respective methodologies in order to standardize their entities. Named Entity Recognition and Comention/Co-occurrence analysis is the common framework in order to build a weighted association network with nodes being the entity identifiers. Lastly, all these associations are available through a Web interface and an API. All these steps have been implemented in an autonomous way with regular cycles of updates (see Appendix B). Icons used from the Noun Project released under CC BY: Books by Shakeel Ch., Bacteria by Maxim Kulikov, ftp by DinosoftLab, Mountain by Diane, Ship on Sea by farra nugraha, River by Chanut is Industries. Scientific text, environmental sample data, and genomic annotations are handled with respective methodologies in order to standardize their entities. Named Entity Recognition and Comention/Co-occurrence analysis is the common framework in order to build a weighted association network with nodes being the entity identifiers. Lastly, all these associations are available through a Web interface and an API. All these steps have been implemented in an autonomous way with regular cycles of updates (see Appendix B). Icons used from the Noun Project released under CC BY: Books by Shakeel Ch., Bacteria by Maxim Kulikov, ftp by DinosoftLab, Mountain by Diane, Ship on Sea by farra nugraha, River by Chanut is Industries.

Entity Types, Channels, and Associations
PREGO supports three entity types: Process, Environment, and Organism. For interoperability and consistency, an ontology or taxonomy is adopted for each type of entity. Processes are represented as Gene Ontology (GO) terms and are grouped either as Biological processes (GObp) or as Molecular functions (GOmf). In addition, Environments are represented by terms from the Environmental Ontology. Organisms are represented by the microbial NCBI Taxonomy Ids (Bacteria, Archaea, and unicellular eukaryotes). For the unicellular eukaryotes, a custom list was populated with the unicellular eukaryotic taxa using a curated list.
PREGO's contents are mainly divided into three distinct channels of information based on data origin and format ( Figure 1, step 1). The Literature channel exploits scientific publications, i.e., abstracts and full text open access scientific publications (Table 1 and Section 2.2). Through the Annotated Genomes and Isolates channel, PREGO retrieves genome annotations and their accompanying metadata (Table 1 and Section 2.3). Finally, the Environmental Samples channel supports the integration of metagenomic analyses from both amplicon and shotgun studies. These include taxonomic and functional profiles along with their corresponding metadata (Table 1, more details in Section 2.4). In cases in which the retrieved data and metadata are in text form, they are standardized to the aforementioned identifiers and taxonomies using Named Entity Recognition (NER) tools, namely the EXTRACT tagger [32,47]. In cases where data contain KEGG Orthology terms and/or Uniref identifiers, they are mapped to the respective GOmf using the mapping files available from the UniProt (see Appendix A). Associations are extracted after the mapping and standardization of the entities from each resource ( Figure 1, step 3).
The association extraction pipeline is distinct for each channel and resource because of differences in the data type origin (see prego_gathering_data in the Availability of Supporting Source Codes section). By the means of navigation, the large number of associations returned to the user require a type of sorting; ideally, one that ranks the most trustworthy associations at the top. For those reasons, each channel of PREGO has a dedicated scoring scheme bounded within the (0, 5] space for consistency. In Appendix C, the scoring scheme of each channel is elaborated.

Text Mining of Scientific Literature
PREGO implements a text mining methodology to extract associations of the aforementioned entities from literature. The origin of text mining is a corpus that comprises scientific abstracts and full text articles from MEDLINE ® and PubMed ® and PubMed Central ® Open Access Subset (PMC OA Subset) [48], respectively. The building and periodic update of the corpus is possible through the NCBI File Transfer Protocol (FTP) services. PREGO also has a dedicated text-mining dictionary (see Availability of Supporting Source Codes section) that contains the entities ids, names, synonyms, and neglected words (stop words). PREGO dictionary incorporates the ORGANISMS [31] and ENVIRONMENTS [49] evaluated dictionaries as well as the experimental dictionaries of Gene Ontology Biological Process and Molecular Function.
Text mining is subsequently performed on the corpus using the dictionary through the EXTRACT tagger [32,47]. The tagger recognizes the entities of the dictionary in each abstract and full text article and assigns their co-mentions with a score. The score is sensitive to the text structural level of co-mention; higher to lower scoring when co-mention appears in the same sentence, then, in the same paragraph, and lastly in the same article. All these are integrated and normalized to a single score for each association, as implemented in STRING 9.1 [34] (see Appendix C for more details). In addition, the tagger extracts each mention in every article to provide the origin of each association it extracts.

Annotated Genomes and Isolates
Annotated genomes and isolates comprise the most trustworthy data in PREGO's knowledge base because they refer to a single species/strain and also have manually curated metadata. Among other data types, JGI-IMG [27,50] includes millions of genes from isolated genomes (isolates), SAGs and MAGs. Such annotations, along with their corresponding metadata, were collected using web-parsing technologies. Their metadata, describing their related environment/ecosystem, were tagged using the EXTRACT tagger to infer organisms-environments associations. The annotated KEGG terms were mapped to GOmf terms (see Appendix A). The GOmf terms were then used to extract organismsprocesses associations.
The Struo pipeline [51] and its outcome when using the Genome Taxonomy DataBase (GTDB) (v.03-RS86) [52] was exploited to enrich organisms-processes associations. A set of 21,276 representative genomes, accompanied by UniRef50 annotations, was retrieved using the provided FTP server. The annotations were then mapped to GOmf terms (see Appendix A). Related GTDB genomes were mapped to their corresponding NCBI taxa (see Appendix A). All associations extracted from these resources were assigned arbitrarily a confidence level of four out of five. This score choice reflects the high-quality of these data and metadata.
In addition, BioProject data were integrated to PREGO using the NCBI FTP/e-utils services [48]. The BioProject ids that were integrated are the ones that have been assigned a PubMed abstract, a unicellular taxon, and Genome sequencing as data type. Then, using the text mining pipeline, associations were extracted connecting the assigned taxon with the rest of the entities that appear in the abstracts. This method resulted in associations that were assigned a confidence level of three (out of five) because of the combined method of curated data with text mining.

Environmental Samples
MGnify [26] and MG-RAST [28] repositories provide a great number of public metagenomic records. In the PREGO framework, both amplicon and shotgun metagenomic analyses are retrieved periodically along with their corresponding metadata. Data retrieval from these resources is possible from their Application Programming Interfaces (APIs). Marker gene analyses are retrieved and by measuring the co-occurrence of taxa present in the various environmental types (e.g., biomes, materials, features, etc.) organisms-environments associations are extracted. These associations emerge when a taxon is reported together with a certain environmental type, being mentioned in the metadata of a sample (metadata based co-occurrence). Similarly, analyses of metagenomic samples along with their corresponding metadata and annotations are also retrieved and organisms-environments, organisms-processes and processes-environments are extracted. The processes-environments associations are possible through co-occurrence of the functional annotations of metagenomes with the environmental metadata of the samples.
In all cases, the EXTRACT tagger is used on the microorganism names and the corresponding metadata of each sample to identify their identifiers (NCBI ids, ENVO terms, GOmf, GObp). All associations in this channel are scored based on the number of samples the entity of interest co-occurs with specific sample metadata (e.g., environmental type) or annotations (functional annotations or taxonomic annotations). The same scoring scheme was implemented across the channel resources (see Appendix C for more details), which ranks these associations with a value in the (0, 5] space.

Sequence Search
In the case of organisms, PREGO enables sequence-based queries, meaning a sequence (amplicon) can be used as an entry point like it was a taxon name. To this end, a custom database was built using a set of reference custom databases for four commonly used marker genes. For 16S and 18S rRNA, the SILVA database (v.138) [53] and the PR 2 database (version_4.14.0) [54,55] were used. Cytochrome c oxidase I (COI) [56] is another commonly used marker gene; for this reason, Midori 2 (version GB243) [57] was integrated in PREGO's custom database. Finally, for the Internal transcribed spacer (ITS), common in studies focusing on Fungi, the Unite (version 8.3, accessed 10.05.2021) [58] database was added.

Back-End Server and Front-End Implementation
PREGO is a multi-tier web-based application. It is hosted on a 64 GB RAM DELL R540, 20 core, Debian server. Custom API clients (written in Python) are responsible for retrieving the data and metadata from each source ( Figure 1, step 2). These clients as well as the subsequent methodology ( Figure 1, step 3 to 6) are updated in regular cycles using custom daemons (see Appendix B, Figure A1). The mamba/blackmamba web framework underlies communication to the Postgres association-holding database and the client-side communication. HTML 5, Ajax, JQuery, and custom Javascript enhance the user web experience. PREGO supports widely used browsers (e.g., Chrome, Firefox, Safari, Edge) in various operating systems, such as Windows 10, Linux (Ubuntu 18), and MacOS (10.12, 11).

The PREGO Web Resource
Users can access the PREGO contents through its web User Interface (UI) (Figures 2 and 3), its Application Programming Interface (API) (Figure 4), or bulk download of all associations (Appendix D). The User Interface comes with two search fields: a plain text search and a sequence search ( Figure 2a). The latter is used when the user wants to search for a taxon sequence (see Section 2.5 for supported sequence databases). The plain text search supports three types of entry points; the user can search for a taxon name, e.g., Methanosarcina mazei, an environmental type, e.g., lagoon, or a biological process e.g., methanogenesis. In all entry points, PREGO returns an overview page consisting of tabs with associations of the entity of interest with the entities of the two other types (Figure 2b-d) as well as Documents and Downloads tabs (Figure 2e,f).
Regarding the association tabs, when a taxon is used as a query, PREGO returns an overview page consisting of tabs for environments, biological processes, and molecular functions. When an environmental type is used as input, PREGO returns the organisms that have been found to be related to it, as well as the Biological Processes observed in the given environment. Lastly, if a biological process is under study, PREGO returns a tab with the organisms along with another tab with the Environments related to the process. Notably, only the associations with scores higher than 0.5 are presented in the web platform and are sorted in descending order based on their score. The score is visualized with a five-star system (see Appendix C for the scoring scheme). Regarding the association tabs, when a taxon is used as a query, PREGO returns an overview page consisting of tabs for environments, biological processes, and molecular functions. When an environmental type is used as input, PREGO returns the organisms that have been found to be related to it, as well as the Biological Processes observed in the given environment. Lastly, if a biological process is under study, PREGO returns a tab with the organisms along with another tab with the Environments related to the process. Notably, only the associations with scores higher than 0.5 are presented in the web platform and are sorted in descending order based on their score. The score is visualized with a five-star system (see Appendix C for the scoring scheme).
Every association tab contains three tables with associations derived from the PREGO channels (see Section 2) along with their supported evidence. The user can both search and scroll through these tables, which makes knowledge extraction easier in cases Every association tab contains three tables with associations derived from the PREGO channels (see Section 2) along with their supported evidence. The user can both search and scroll through these tables, which makes knowledge extraction easier in cases where, for example, Isolate data contain hundreds of associations. In the Literature channel, each association is supported by the scientific articles with text-mining identified comentions. When a user clicks on an association, a popup window appears. This window displays abstracts or excerpts of full text with the associated entities highlighted (Figure 3a). Additionally, the Environmental Samples and Genome annotations and Isolates channels support evidence for each association by providing links to more detailed information. In the former channel, when the users click on an association, they are redirected to pertinent sample pages of MGnify (Figure 3b). Similarly, the latter redirects users to JGI and NCBI genomes when the associations originated from JGI-IMG and Struo, respectively (Figure 3c). fied co-mentions. When a user clicks on an association, a popup window appears. This window displays abstracts or excerpts of full text with the associated entities highlighted (Figure 3a). Additionally, the Environmental Samples and Genome annotations and Isolates channels support evidence for each association by providing links to more detailed information. In the former channel, when the users click on an association, they are redirected to pertinent sample pages of MGnify (Figure 3b). Similarly, the latter redirects users to JGI and NCBI genomes when the associations originated from JGI-IMG and Struo, respectively (Figure 3c).  The Documents tab includes a list of scientific publications where the queried entity is mentioned. Through the Downloads tab, users are able to get all of the PREGO associations found for their query, per entity type (e.g., all the environments found related to an organism) and per channel (e.g., all the Environments found related to an organism through the Literature channel). This data retrieval functionality is also available via the PREGO API (syntax described in Figure 4). Finally, all PREGO associations are available for bulk download from each channel (see Table A2).
is mentioned. Through the Downloads tab, users are able to get all of the PREGO associations found for their query, per entity type (e.g., all the environments found related to an organism) and per channel (e.g., all the Environments found related to an organism through the Literature channel). This data retrieval functionality is also available via the PREGO API (syntax described in Figure 4). Finally, all PREGO associations are available for bulk download from each channel (see Table Α2).

PREGO in Action
To demonstrate PREGO's potential, we present four different ways that PREGO can assist molecular ecologists. The demo focuses on the sulfate-reducing microorganisms (SRMs) as well as the processes and environments that relate to sulfate reduction. Through this demo, we highlight how the different channels may provide complementary insights regarding different taxonomic levels and different association types.  [59], several bacterial and archaeal SRM were found in lagoonal sediments, after amplifying and sequencing the dissimilatory sulfite reductase β-subunit (dsrB). Using PREGO for the case of Desulfobacteraceae, the family in which the majority of the observed OTUs of the study belonged to, several environmental types similar to lagoons were retrieved from both the Literature and the Environmental samples channels (Figure 3a,b). Moreover, most of them had a high z-score, such as "sediment", "sludge", and "activated sludge". Several dissimilar environmental types were associated with Desulfobacteraceae, e.g., "oil reservoir" indicating them as potential environments where sulfate reduction takes place. However, the presence of taxa within that family in different environments, from "sea water" to "forest" and "Wastewater

PREGO in Action
To demonstrate PREGO's potential, we present four different ways that PREGO can assist molecular ecologists. The demo focuses on the sulfate-reducing microorganisms (SRMs) as well as the processes and environments that relate to sulfate reduction. Through this demo, we highlight how the different channels may provide complementary insights regarding different taxonomic levels and different association types.

Which Environments Are Related to a Taxon?
Based on Pavloudi et al. (2017) [59], several bacterial and archaeal SRM were found in lagoonal sediments, after amplifying and sequencing the dissimilatory sulfite reductase β-subunit (dsrB). Using PREGO for the case of Desulfobacteraceae, the family in which the majority of the observed OTUs of the study belonged to, several environmental types similar to lagoons were retrieved from both the Literature and the Environmental samples channels (Figure 3a,b). Moreover, most of them had a high z-score, such as "sediment", "sludge", and "activated sludge". Several dissimilar environmental types were associated with Desulfobacteraceae, e.g., "oil reservoir" indicating them as potential environments where sulfate reduction takes place. However, the presence of taxa within that family in different environments, from "sea water" to "forest" and "Wastewater treatment plant", may suggest that this family has ubiquitous representatives in diverse conditions. Searching for Desulfatiglans anilini (https://prego.hcmr.gr/example1, accessed on 24 December 2021) at the species level, the most abundant species in Pavloudi et al. (2017) and, for Desulfatiglans anilini DSM 4660 strain (https://prego.hcmr.gr/example2, accessed on 24 December 2021), PREGO provides associations with the "Anaerobic sediment", "Marine oxygen minimum zone", and "Anaerobic digester sludge" terms. These associations further corroborate the relationship between the species and sulfate reduction. More specifically, the "sulfur spring" ENVO term was retrieved from the Environmental samples channel as well.

Which Biological Processes and Molecular Functions Are Related to a Taxon?
According to Pavloudi et al. (2017), Desulfatiglans anilini plays an important role in sulfate reduction. The Biological Processes provided by PREGO's Literature channel are the GO terms "Sulfate reduction", "Sulfide oxidation", and "Sulfide ion homeostasis", which support this claim. In addition, the "Denitrification pathway" term was also retrieved. This is rather interesting as it is in line with what Pavloudi et al. (2017) discussed about the SRMs and their ability to use various electron acceptors, e.g., nitrate and nitrite. Furthermore, PREGO's Molecular Function tab provides more insight on this example. Several GO terms related to sulfate reduction (e.g., terms related to "sulfite reductase") were associated with DSM 4660 strain and Desulfatiglans anilini species in multiple channels. Interestingly, in the case of the strain query, the Annotated Genomes channel returned many GO terms related to the nitrogen fixation, e.g., "nitric oxide dioxygenase activity".

Which Taxa Are Related to a Biological Process?
PREGO can be also used to report organisms that relate to a certain biological process. Searching for "dissimilatory sulfate reduction" associations with taxa (https://prego.hcmr. gr/example3, accessed on 24 December 2021) resulted in several taxa that were mentioned in the Pavloudi et al. (2017) study. For example, taxa such as Thermodesulfobacteria and Thermodesulfovibrio were found among the entries with the highest score (e.g.,) based on the Literature channel. The other two channels did not contain any associations. Using the "Sulfate assimilation" (https://prego.hcmr.gr/example4, accessed on 24 December 2021) as the biological process input, PREGO results showed several genera that were missing from PREGO results concerning the "dissimilatory sulfate reduction". Hence, manual search of GObp terms that describe the actual biological process of interest is more insightful.

Are There Any Associations between Environments and Biological Processes?
Are there other environmental types, except the lagoonal sediments, in which sulfate assimilation occurs? In that question, and in "dissimilatory sulfate reduction" (https://prego. hcmr.gr/example3, accessed on 24 December 2021) in particular, PREGO assigns the highest score to "sediment" while, among others, "anoxic water", "oil reservoir", "mud volcano", and "basalt" are potentially associated with environments related to sulfate reduction.
Inversely, PREGO is insightful about occurring processes in a specific environmental type. For example, searching for the biological processes that take place in "basalt" (https:// prego.hcmr.gr/example5, accessed on 24 December 2021), processes like "Nitrogen fixation" and "Reactive nitrogen species metabolic process" stand out. However, sulfate reduction is not among the associations. However, when asking for "Mafic lava" (https://prego.hcmr.gr/ example6, accessed on 24 December 2021), both the "nitrogen fixation" and "Sulfur compound metabolic process" terms are returned. This highlights the suggestions of Pavloudi et al. (2017), regarding the potential use of various electron acceptors from the different strains present in different environmental types.

PREGO Contents
PREGO contains the literature, environmental samples, and genome annotations of the resources shown in Table 1. The extracted contents of these resources have resulted to a knowledge base with~364 K distinct taxonomic groups (out of a pool of~620 K Bacteria, Archaea, and microbial eukaryotes, based on NCBI Taxonomy) from which~258 K are at the species level (Table 2). These taxa are associated with~1 K Environment Ontology terms, 15 K GObp terms, and with~7.9 K GOmf terms. Combining the above, PREGO maintains a knowledge base of entities and associations between them that form a multipartite network with entities as nodes and scored associations between them as weighted links. As shown in Figure 5, in its current version (December 2021), PREGO knowledge base covers 157 bacterial phyla (107 are Candidatus), 23 phyla from archaea (18 are Candidatus), and 22 unicellular eukaryotic phyla described in the NCBI Taxonomy database. The number of bacterial taxa present among the associations of each phylum ranges from the order of 10 s, as in the case of Candidatus Coatesbacteria, to hundreds of thousands, e.g., Actinobacteriae. The number of environmental types, found among the PREGO associations for each phylum, ranges from just a few to up to 1000. Similarly, the number of biological processes that have been related to the various phyla may range from less than a dozen, e.g., Yanofskybacteria to up to several thousands, e.g., Bacteroidetes. On the contrary, the number of molecular functions found to be related to taxa of each phylum is rather constant in all three domains. biological processes that have been related to the various phyla may range from less than a dozen, e.g., Yanofskybacteria to up to several thousands, e.g., Bacteroidetes. On the contrary, the number of molecular functions found to be related to taxa of each phylum is rather constant in all three domains.

PREGO Contents
On its current version and according to the NCBI Taxonomy that it is based on, PREGO manages to cover a great range of microbial taxa, as most (if not all phyla) are present in the knowledge base ( Figure 5). The different number of organisms' entities per phylum highlights the diverse number of the members of the various phyla. On the contrary, the similar number of molecular functions in all cases indicates the robustness of the main metabolic processes required for life. With respect to biological processes, their number per phylum varies to some extent, especially for the case of Bacteria and Archaea. That could be observed as, in many cases, phyla that have been recently described using molecular techniques have not been studied extensively yet, e.g., Candidatus Delongbacteria. As expected, the number of environmental types that have been associated with members of each phylum varies, as a phylum may be universally present, while others could be strongly niche-specific (e.g., Hydrothermarchaeota).
Because of its three different channels, PREGO manages to extract associations both in the species and higher taxonomic levels. The Isolates channel supports explicit associations at the species level (Table 3 and Figure S3). Interestingly, the number of such genomes seems to have reached a plateau for now, as PREGO-like platforms include the same order of magnitude. The Literature channel, on the other hand, promotes the extraction of associations at higher taxonomic levels ( Table 3 and Figure S1). This also applies to environment-organisms associations derived from the Environmental Samples channel (Table 3 and Figure S2). Associations regarding biological processes, though, are strongly enhanced by the Literature channel and the massive increase of literature. Table 3. The associations between entities of PREGO after co-occurrence analysis: The supported entity types of associations are Environments-Biological Processes, Environments-Molecular Functions, Taxa-Environments, Taxa-Biological Processes, Taxa-Molecular Functions. Additionally, the text mining methodology of the Literature channel has retrieved most of the entities present in PREGO knowledge base (Table 2). A significant contribution to the taxa with associations is due to the PMC OA processing by the text mining pipeline of the Literature channel. This is in-line with reports in other applications of text mining when using full text articles [60]. However, the resulting associations are suggestive because of the text mining nature, and therefore subject for further review by the users.

Related Tools' Functionality and Content
There is an emerging niche for tools similar to PREGO to bring forward microbe associations and metadata. Table 4 summarizes the common and different features of BacDive, WoM, NMDC data portal, and PREGO. All of them commonly share the environmental associations and biological/metabolic processes with the microbes.
BacDive is a well-established platform with a focus on phenotype and cultivation information for about 100,000 prokaryotes, bacteria, and archaea. It has a high level of curation for most of its input types, like literature, internal databases, and personal collections. The NMDC data portal has published the scheme, the user interface, and a demonstrative collection of samples that will be populated later on. Standout features are the spatial visualization with coordinates and the detailed information of the samples, e.g., sequencing instruments and methodology. An alternative approach is facilitated by WoM, which aims to bind chemistry to microbes. An environment, in particular, is defined as the starting metabolite pool that is transformed by an organism. Another tool is The Microbe Directory that contains fully curated metadata for about 8000 microbes from all superkingdoms. This tool focuses on conditions of growth and on host taxa.
Complementary to these tools, PREGO contains associations of bacteria, archaea, and eukaryotes. Distinctive features are the associations of environments with processes/ functions and the large-scale literature integration with text mining. Most importantly, most of the tools are complementary to each other with minimum overlap, an indication of the opportunities for further innovative synergies.

PREGO Next Steps
PREGO is a user-friendly association mining and sharing platform. Its modular webarchitecture grants it the flexibility for further improvements in the aforementioned aspects, namely: source datasets, user interface, entity, and association scope expansion. Regarding datasets, additional data, such as transcriptomes from MGnify and other records annotated with metadata from studies in EuroPMC (https://ebi-metagenomics.github.io/blog/20 21/11/17/Publication-Annotations/, accessed on 24 December 2021) [61], could be incorporated. Similarly, the NMDC data platform standards-compliant annotated records (https://data.microbiomedata.org/, accessed on 24 December 2021) could serve as an additional resource with its high-quality metadata [16,17]. Reciprocally, if requested, pertinent literature and association summaries could be programmatically offered to interested third parties. Furthermore, the entity types supported by the PREGO system could be expanded. For example, GOmf terms could be upgraded as a search-entry point to the system. In addition, disease and tissue describing terms, already supported by the PREGO-underlying EXTRACT system [32], could enter the PREGO ecosystem of associated entities. From a statistics perspective, the calculation of a combined association score, when an association is reported by more than one channel of information, could be another feature to add.
The user interface can be enhanced to support multiple entity and/or sequence queries, instead of single ones. Sequences can be processed by taxonomy assignment pipelines (e.g., PEMA [62]) and be converted into searching PREGO for associations. In addition, network visualization tools, like Arena3D web [63], could allow interactive browsing of associations through multi-layered graphs. Enrichment analyses, like those performed by OnTheFly 2.0 [64] or Flame [65], can be incorporated. Omics data analysis pipelines, like MiBiOmics [66], environment associations with sequences using SeqEnv [67] and biogeochemical associations with metagenomic data with DiTing [68] could be enabled by comparing the associations pertinent to different groups of entities. The computationally intensive tasks of multiple queries, taxonomy assignments to sequences and enrichment analysis could be offered by our in-house High Performance Computing facility (https: //hpc.hcmr.gr/, accessed on 24 December 2021) [69] Figure S1: Summary of all the unique entities per phylum for each of the four entity types (in log10 scale) that appear in PREGO coming from the Literature channel. Figure S2: Summary of all the unique entities per phylum for each of the four entity types (in log10 scale) that appear in PREGO coming from the Environmental samples channel. Figure S3: Summary of all the unique entities per phylum for each of the four entity types (in log10 scale) that appear in PREGO coming from the Annotated genomes and Isolates channel. Finally, as STRUO annotations refer to GTDB genomes, publicly available mappings (http://ftp.tue.mpg.de/ebio/projects/struo/GTDB_release89/metadata/, accessed on 24 December 2021) were used to link the genomes used with their corresponding NCBI Taxonomy entries.

Daemons
An important component PREGO approach ( Figure A1) is the regular updates which keep PREGO in line with the literature and microbiology data advances. The updates are implemented with custom scripts called daemons that are executed regularly spanning from once a month up to six-month cycles. This variation occurs because of the API requirements of each web resource as well as the computational intensity of the association extraction from the retrieved data.
MG-RAST metagenomes and JGI/IMG isolates annotations come with KEGG orthology (KO) terms; Struo-oriented genome annotations, on the other hand, have Uni-prot50 ids. The mapping from KO to GOmf and Uniprot50 to GOmf is implemented via UniProtKB mapping files of their FTP server (see idmapping.dat and idmap-ping_selected.tab files). By using the 3-column mapping file, the initial annotations were mapped to GOmf. As a complement, a list of metabolism-oriented KEGG ORTHOLOGY (KO) terms has been built (see prego_mappings in the Availability of Supporting Source Codes section).

Daemons
An important component PREGO approach ( Figure A1) is the regular updates which keep PREGO in line with the literature and microbiology data advances. The updates are implemented with custom scripts called daemons that are executed regularly spanning from once a month up to six-month cycles. This variation occurs because of the API requirements of each web resource as well as the computational intensity of the association extraction from the retrieved data.  Each Daemon is attached to a resource because its data retrieval methods (API, FTP) and following steps, shown in Figure A1, require special handling and multiple scripts (see prego_daemons in the Availability of Supporting Source Codes section).

Appendix C.1. Scoring
Scoring in PREGO is used to answer the questions: • Which associations are more thrustworthy? • Which associations are more relevant to the user's query?
Relevant, informative, and probable associations are presented to the user through the three channels that were discussed previously. Each channel has its own scoring scheme for the associations it contains and all of them are fit in the interval (0, 5] to maintain consistency. The values of the score are visually shown as stars. The Genome Annotation and Isolates channel has fixed values of scores depending on the resource because Genome Annotation is straightforward, and the microbe id is known a priori. On the other hand, Environmental Samples channel data are based on samples, which contain metagenomes and OTU tables. Thus, it has two levels of organization, microbes with metadata, and sample identifiers. Each association of two entities is scored based on the number of samples they co-occur. A Literature channel scoring scheme is based on the co-mention of a pair of entities in each document, paragraph, and sentence. The differences in the nature of data require different scoring schemes in these channels. The contingency table (Table A1) of two random variables, X and Y are the starting point for the calculation of scores. The term X = 1 might be a specific NCBI id and Y = 1 a ENVO term. The c 1,1 is the number of instances that two terms of X = 1 and Y = 1 are cooccurring, i.e., the joint frequency. The marginals are the c 1,. and c .,1 for x and y, respectively, which are the backgrounds for each entity type. Different handling of these frequencies leads to different measures. There is not a perfect scoring scheme, just the one that works best on a particular instance. Consequently, scoring attributes require testing different measures and their parameters. Table A1. Contingency table of co-occurrences between entities X = x and Y = y. This is the basic structure for all scoring schemes. c x,y is the count of the co-occurrence of these entities. c x,. is the count of the x with all the entities of Y type (e.g., Molecular function). Conversely, c .,y is the count of y with all the entities of X type (e.g., taxonomy).

Yes
No Total Yes c x,y c x,0 c x,.