A Compendium of Chemical Class and Use Type Open Access Databases

With an ever-increasing production and registration of chemical substances, obtaining reliable and up to date information on their use types (UT) and chemical class (CC) is of crucial importance. We evaluated the current status of open access chemical substance databases (DBs) regarding UT and CC information using the “Meta-analysis of the Global Impact of Chemicals” (MAGIC) graph as a benchmark. A decision tree-based selection process was used to choose the most suitable out of 96 databases. To compare the DB content for 100 weighted, randomly selected chemical substances, an extensive quantitative and qualitative analysis was performed. It was found that four DBs yielded more qualitative and quantitative UT and CC results than the current MAGIC graph: The European Bioinformatics Institute DB, ChemSpider, the English Wikipedia page, and the National Center for Biotechnology Information (NCBI). The NCBI, along with its subsidiary DBs PubChem and Medical Subject Headings (MeSH), showed the best performance according to the defined criteria. To analyse large datasets, harmonisation of the available information might be beneficial, as the available DBs mostly aggregate information without harmonising them.


Summary
In the 21st century, a continuously increasing number of compounds and substances is used and potentially released into the environment [1]. Understanding how they affect the environment is the main objective of ecotoxicological research [2]. To further our understanding of their impact and to better comprehend environmental chemicals in general, additional information detailing their use type and associated chemical classes is urgently needed. In this context, substance databases (DBs) play an essential role as information sources for research, governmental institutions, regulation, citizens' science, and companies alike ( [3]; see also Appendix A Database Compendium References). To date, an unintelligibly broad range of DBs is publicly available (see Appendix A Database Compendium References).
In ecotoxicology, DBs are vital to easily store, manage, and access information, for instance, concerning adverse effects on a trophic and species level [4]. Generally, DBs may contain similar information on substances but serve different purposes, like Kyoto Encyclopedia of Genes and Genomes (KEGG), European Chemicals Agency (ECHA), and DrugBank, which convey information on genomes, chemical registration data and pharmaceutical information, respectively [5][6][7], other DBs particularly focus on chemical structures and the ontology of chemicals [8]. Therefore, information obtained from DBs often need to be harmonised and verified before use [9]. For chemical substances, the use type (UT; descriptor of for what purpose(s) a chemical is used) and chemical class (CC; descriptor to which group a specific chemical belongs according to its molecular structure) are relevant information to understand their pathways into the environment, as well as the potential environmental effects and the behaviour in the environment [10,11]. However, little research is available on the quality and quantity of UT and CC information in open access chemical substance DBs.
A decision tree-based selection process ( Figure 1) evaluating relevant factors like accessibility, extent and a final UT and CC analysis of the best performing DBs was applied. The decision tree was designed to find DBs, which are frequently curated and updated, as well as open access for researchers worldwide. The DBs remaining after the application of selection criteria were evaluated along with the "Meta-analysis of the Global Impact of Chemicals" (MAGIC) graph DB [12]. This "Meta-analysis of the Global Impact of Chemicals" (MAGIC) graph is a labelled property graph database [12]. The MAGIC graph harmonises and integrates multiple DBs and focuses on potential environmental impact chemicals (PEIC) accounting for synonyms for substance and compound names. The MAGIC Graph currently relies on information from the Pesticide Action Network (PAN) DB for UT, which has only limited UT+CC information. Therefore, we evaluated which other DBs would be a suitable addition, for the quality and quantity of UT and CC information. This was done using a set of 100 weighted, randomly selected chemicals (see Methods, Section 3.3). Hence, the goal was to identify UT and CC DBs, which were most suitable to be integrated into the MAGIC graph DB regarding data quality and quantity, thereby, e.g., enabling the interpretation of exposure patterns in the environment among CC and UT. High quantity and quality UT and CC information will prospectively be used and integrated into the MAGIC graph.

The "Chemical Class and Use Type Compendium" Dataset
The current version of the compendium of CC and UT DBs can be accessed via https://static. magic.eco/Compendium. Tables 1 and 2 provide an overview of the compendium's columns used in its two different worksheets, with Table 1 explaining the first worksheet (decision tree selection process; see Section 2.2.) and Table 2 explaining the second worksheet (UT and CC validation process) of the compendium. Table 1. Description of the columns and their content as used in the Chemical Class (CC) and Use Type (UT) Compendium sheet 1-DB selection process (https://static.magic.eco/Compendium). Each row in the compendium describes a different DB.

Column Description
Name DB Full name of the DB, according to the information provided by the website Criterion 1 If the DB still exists. Criterion 2 If a search can be performed in the DB. Criterion 3 The URL used is unique or the DB is unique.

Criterion 4
The website has a DB with a visible connection to the topic (chemical substance information, pesticide information, etc.)

Criterion 5
Is the website and DB current (updated in the last six months) or are there extensive updates occurring once a year. (Then the last 18 months are reviewed and taken into account.) Criterion 6 Legal Status: Is scrapping of data allowed, under what copyright is the data available, the website allows download/usage of data and integration into another website or using the majority of the DB information for a research paper.

Criterion 7
The website is free of charge for academic/scientific purposes or in general, meaning no associated license fees.

Criterion 8
The DB data is unique and original to the website or-if not-the website harmonises multiple data sources.

Criterion 9
There are more than 1500 PEICs available on the website or the DB has unique information on individual substances that are not available in other DBs.

Criterion 10
Uses chemical class or use type information and provides one or both of these.

Criterion 11
Is not entirely integrated by another DB or has additional information that is not part of another DB (e.g., original DB) that reached the same stage in the decision tree.

Criterion 12
Softer criterion, the website/DB is global and not only of regional relevance. (Below 10k entries, and specifically for a particular state or country, DBs with above 50k substance entries are generally seen to be of international relevance.) URL The URL to the landing page through which the DB can be publicly accessed. Initial discard criteria The criteria why the DB was discarded.

Chemical Abstracts Service (CAS) Number
Information on the availability of CAS information within the DB.

Last access
The point in time when the DB/website was last accessed.

CAS Number
The CAS number of the substance. Substance Name The common name of the substance (IUPAC/product name).

SMILES
Simplified Molecular Input Line Entry Specification (SMILES) structure description of the substance. Substance known * Indicates if the substance was found in the DB (1 = yes, 0 = no).
DB UT * If and which UT information could be found regarding the substance ("Unknown" indicates that no information was available).
DB CC * If and which CC information could be found regarding the substance (CAS & Substance name) ("Unclassified" indicates that no information was available).

Database Selection Process
Using the criteria as described in Table 1, a total of 96 DBs were evaluated. Of those, 21 were discarded in decision tree criteria 1-3 (Table 3), indicating that they are not uniquely identifiable or do not provide any search function. Four DB websites were mentioned in the literature, but did not exist or were not accessible anymore, when tried in November 2019. Seven DBs were discarded as no search could be performed (i.e., second criterion). Reasons were, e.g., that the website still existed, however, discontinued its DB or search function due to funding [13] or was undergoing maintenance during the analysis period [14]. Seven DBs were discarded due to the URL (i.e., third criterion). In two cases, the URL was identical, while the other ones had different URLs, which, however, led to the same DB. Twenty-four DBs had a different focus (criterion 4), which means that they did not provide substance property information. The BindingDB.org [15] is an example for a DB that may have other uses (interactions between small molecules and protein); however, it did not fulfil the 4th criterion and was therefore excluded. Ten DBs were discarded due to criterion 5 as they had infrequent updates. While some DBs had no unique data information and therefore, only mirrored another DB, none were discarded during this step, as they had been discarded before. An example is the LookChem shop page which displayed substance information that directly came from Wikipedia [16]. Twelve DBs were either not openly accessible or not free of charge (criterion 6 and 7), which led to them being discarded, as they did not fulfil the open access criterion. Those websites might, in general, have the option for manual searches, like Cayman [17], but they forbid crawling (i.e., automatically querying large subsets of data) the DB webpage or extract data information through batch search without buying an annual licence (i.e., Herts [18] (criterion 7)). Table 3. The twelve selection criteria, their description, the number of discarded DBs during each decision step, the number of DBs remaining after each decision step. The right column describes the number of DBs (out of the total of 96 DBs evaluated) that fulfilled the respective criterion. Those nine DBs left were further analysed along with the "Meta-analysis of the Global Impact of Chemicals" (MAGIC) graph DB. (For more details on decision criteria, see Table 1 Further, five DBs were discarded for comprising less than 1500 substances according to criterion 9. Seven were omitted because they did not fulfil criterion 10, i.e., they did not include relevant UT or CC information. Seven DBs websites were discarded following criterion 11, in which the DB was checked for whether it was wholly integrated into another DB or website. Examples are PubChem and Medical Subject Headings (MesH) which are completely integrated into the National Center for Biotechnology Information (NCBI) [19] or the major restructuring of some U.S.-based DBs, like TOXNET [20], which were integrated into PubChem [21] and PubMed [22], amongst others on the 16 December 2019 [20]. According to the final criterion 12, five DBs were discarded as they were not DBs containing large sets of PEIC information of global relevance. After this decision tree-based evaluation, nine DBs were further analysed (see also Table 2, or Section 2.3) and their suitability for integration into the MAGIC graph was assessed (see Methods, Section 3.3). Detailed information on the further evaluated DBs can be found in the Chemical Class and Use Type Compendium Sheet 2 (https://static.magic.eco/Compendium) and Tables 4 and 5.

UC and CC Database Quantitative Extent and Quality
For DBs left after the decision tree analysis (i.e., NCBI, Comptox, Wikipedia, European Bioinformatics Institute (EBI), ChemSpider, KEGG, MAGIC graph, National Pesticide Information Center Product Research Online (NPRO), ECHA, and the DrugBank; Table 4), the content was qualitatively and quantitatively evaluated (for detailed methods, see Section 3.3). This was performed using a total of 100 weighted randomly selected chemical substances (see Methods, Section 3.3). Between 15 and 100 of the 100 chemical substances were listed in each of the remaining DBs. DrugBank, with 15 substance entries, had the smallest, whereas the NCBI website, with 100 out of 100, had the highest number of substances matched (Table 3). Comptox and ChemSpider found the second and third highest number of substances (98 and 99, respectively). Wikipedia (English version), as a general DB with a broad range of topics and entries [23], scored better than many other DBs with specific topical focus (e.g., NPRO, KEGG, DrugBank) [5,7,24], however, less well than DBs designed with a general focus on chemical substances and their properties (NCBI, ECHA, Comptox, EBI, ChemSpider) [6,19, [25][26][27]. This was expected, as the criteria were designed to find a DB, which has detailed up to date information on a broad range of relevant substances. Concerning quantitative UT and CC information, the NCBI DB collection has the most substance entries, including this information (Table 4). It is the primary domain for various substance-related DBs, like PubChem and MesH, with millions of compound and substance entries [21]. The current MAGIC graph DB search resulted in 56 (out of the total of 100) substance entries with UT information and 48 substance entries with CC information. Hence, NCBI performed better than the MAGIC graph concerning UT information (86 substances with UT information), as well as Comptox (85%) and Wikipedia (63%). Concerning CC information, NCBI contained 82 substances with CC information compared to Wikipedia (58), ChemSpider (68), and EBI (50), and performed better than the current MAGIC graph (48 substances).
The qualitative evaluation assessed how detailed the information contained in the DB is, e.g., herbicide or insecticide was considered a qualitatively better UT information than pesticide (for detailed methods, see Section 3.3). The results can be seen in Table 5. The NCBI DB group performed best overall.
The percentage of high-quality CC entries within the MAGIC graph compared to the quantitative amount CC information had the lowest ratio (i.e., 56%) among all assessed DBs. NPRO did not provide any CC information and, therefore, could not be evaluated. In general, the NCBI ranked highest with 78 substances found with high-quality CC information (95%) and 79 (92%) substances found with high-quality UT information, as it included 22 substances more than the next highest ranked DB. No DB had a higher entry to quality percentage than the NCBI regarding CC. However, five DBs had more than the NCBI's 92% high-quality UT information rate, with NPRO ranking first (97%; 37 of 38 substances with UT information had high-quality information). This could be explained by the fact that NCBI, in general, has more entries with UT information (79 of 86 substance entries that had UT information also had high-quality UT information; also see the amount of quality UT entries compared to the total substance list in percentage in Table 5). Consequently, the NCBI provides the highest rate of UT and CC quality information compared to the original set of 100 substances. Table 4. Quantitative analysis of the nine selected DBs and the current MAGIC graph. The number of substances found in the DB, as well as the percentage of substances that were found, are provided. Furthermore, the number of substance entries with UT, as well as the number of substances with CC information, is described. Lastly, the percentage of substances with CC/UT information is compared to the substance entries that were found in total and the result given in [%].  Table 5. Qualitative analysis of the nine selected DBs and the current MAGIC graph. The number of substances with detailed (quality) UT information in the DB and the percentage of substances with quality UT information compared to all substances with UT information that were found are provided. Furthermore, the number of substance entries with quality CC, as well as the percentage of substances with quality CC information, compared to all substances with CC information, is described.

Substances with
Quality CC Therefore, NCBI and its subdomains PubChem and MesH, is the best performing DB regarding available UT and CC substance information, of all 96 tested DBs, and appears to be most suitable. Furthermore, as the most promising candidate, the DB was selected for integration into the MAGIC graph. However, future integration of other DBs, like NPRO, for specific pesticide UT information, or KEGG, which offered comparably fewer but high-quality results, might serve as valuable additional sources. The unique strength and weaknesses of all 96 DBs can be viewed in the Compendium.

Database Compilation
The search for adequate DBs was performed from November until December 2019. First, an extensive search for DBs, focusing on substance property data was performed. DBs were compiled through a literature research. Search engines, like "Google.com", "Google Scholar", or "Web of Science", were used. The search was performed using the keywords "Substance", "Use type", "Chemical compound", "Ecotoxicology", "Toxicology", "Chemical Class", and "Database". Keywords were also used in combined search strings, and different variations for some keywords were also used. However, the search was limited to substance DBs that were available in German or English language. More than 50 relevant scientific papers and~500 websites, and domains, as well as sub-domains were examined in the process. Generally, any DB found was added to the initial dataset (see Supplementary Material Use type and Chemical Class Compendium sheet 1).

Decision Tree-Based Database Evaluation
DBs were evaluated using the decision tree given in Figure 1 between December 2019 and August 2020. The criteria were assessed in an order that accounts for their importance, combined with its verifiability and from general criteria to more in-depth criteria in descending order. The decision tree was designed with the aim to find chemical substance DBs, which are frequently updated, provide a large, unique dataset and are open access for researchers worldwide. However, DBs not fulfilling a particular criterion could still be of value for other research questions and purposes. In the following paragraphs, the criteria and the reason for implementation are described in detail and decision tree order. The website hosting the DB had first to fulfil the following three criteria: Firstly, it needed to have some measures to ensure data integrity and, if, e.g., extracted from a dated paper, still needed to exist. Secondly, it was required, that a substance search could be performed in the DB. The substance used for this criterion was Dichlorodiphenyltrichloroethane (DDT) (CAS: 50-29-3) due to its extensive documentation and high ecotoxicological relevance. However, this criterion was met regardless of whether the DB included the substance or not, as long as the search could be accessed and yielded a result. As a third criterion, a unique URL was mandatory, as multiple DB searches can refer to the same larger DB. Likewise, a website might just refer to other websites without hosting a DB itself. All DBs with duplicate URLs were excluded. Websites that had the same primary domain, and were thus connected DBs, were manually compared and only one of them kept in the dataset.
The fourth criterion focused on the question of whether the DB had a direct relation to the topic, i.e., concerns chemical substances and their properties. In the next criterion, the date of the last update that the DB received was considered. Frequent updates were required (i.e., updated within the last 18 months). In the sixth criterion, the respective DBs copyright and open access police were verified. For criterion 7, the pricing model of the website was checked, as only free of charge DBs were desired. This criterion ensured that the method presented in this paper is reproducible on a broad scale. Criterion eight focused on the uniqueness of the information in a DB, i.e., that the DB not merely reproduces information from another DB or harmonises more than one source. As an example, a substance web store has a DB including information obtained from Wikipedia for all relevant substances, and both DBs are listed in the DB set. In this case, the original DB would be chosen. Another example would be a DB that unites multiple other DBs or their substance information without information loss.
The next criterion assessed whether the DB listed more than 1500 PEICs. If not, the question was asked whether it conveyed information on unique, relevant substances or compounds. For criterion 10, we assessed whether the DB contained CC or UT information or even both. Like criterion four, DBs that were discarded during this step might be useful for other purposes. Criterion 11 evaluated whether DBs were wholly integrated into other DBs and to account for complete overlap of DBs. In this step, even original data sources were discarded, to favour meta-databases and DB collections, like NCBI or ChemSpider, as they accumulate information. The last criterion (12) assessed the spatial applicability of the DB, distinguishing DBs with global, universal relevance from regional DBs including substances of only local importance, therefore being deemed less relevant for the present study.

Chemical Class and Use Type Verification
The DBs that passed the decision tree analysis (n = 9) and the MAGIC graph were further on thoroughly tested for their quantitative and qualitative properties using 100 randomly selected substances from the MAGIC graph DB. The majority of chemicals could be found in 1-3 DBs. Substances of high importance for ecotoxicological research (e.g., pesticides) were often found in >7 DBs. Substances were generally considered to be of higher importance if they had a larger number of linked DBS in the MAGIC graph because the MAGIC graph primarily aggregates ecotoxicological DBs. A list of all substances in the MAGIC graph was compiled (n = 19,069), including the number of datasets in which the substance was found. Out of this list, a weighted sample was drawn randomly. The weight per sample equals the number of datasets, in which the substance was contained. Figure 2a shows the original dataset with the total of 19,069 substances found in the MAGIC graph, and Figure 2b shows the weighted random sample. Like this, the 100 substances were weighted in a way that represents ecotoxicological importance. As a proxy, the number of datasets linked to a substance within the MAGIC graph was used.
Subsequently, manual searches were performed for each DB, on whether it had accounts of the designated substances. The manual search of the substances contained in the DB was performed in late December 2019 and January 2020. Afterwards, the DB was ranked by dividing the entries into one of three categories: 1.
basic information, and 3. detailed information.
The sum of substances with any information regarding UT and CC (basic information (1) and the substances detailed information (2)) was calculated for each of the 9 DBs and the MAGIC graph. This is the foundation for the quantitative analysis, during which they were compared to the number of substances found in the DB and the total amount of substances given (n = 100) (the results can be found in Table 4). The sum of substances with detailed information (2) was calculated for the qualitative analysis (the results can be found in Table 5). Here, the sum of substances with detailed UT and CC information were compared to the sum of substances found with any (basic and detailed) UT and CC information, respectively, as well as the total amount of substances given (n = 100).
Subsequently, the results were compared to the other DBs and ranked according to quality and quantity of information available. From this, the DB with the best performance was chosen.

User Notes
The website https://magic.eco/ provides access to the most recent version of the MAGIC graph. It offers the possibility to visit individual chemical identifiers and to discover the synonyms and generalisations and the data that is currently connected to the chemical. The website also provides a user with an option to download an up-to-date version of the Microsoft ® Excel worksheet published with this data descriptor (https://static.magic.eco/Compendium). Funding: This study was funded by the German Society for the Advancement of Sciences (DFG SCHU 2271/6-2).

Acknowledgments:
We thank Felix Hoegerl for technical support.