Graphing Ecotoxicology: The MAGIC Graph for Linking Environmental Data on Chemicals

: Assessing the impact of chemicals on the environment and addressing subsequent issues are two central challenges to their safe use. Environmental data are continuously expanding, requiring ﬂexible, scalable, and extendable data management solutions that can harmonize multiple data sources with potentially differing nomenclatures or levels of speciﬁcity. Here, we present the methodological steps taken to construct a rule-based labeled property graph database, the “Meta-analysis of the Global Impact of Chemicals” (MAGIC) graph, for potential environmental impact chemicals (PEIC) and its subsequent application harmonizing multiple large-scale databases. The resulting data encompass 16,739 unique PEICs attributed to their corresponding chemical class, stereo-chemical information, valid synonyms, use types, unique identiﬁers (e.g., Chemical Abstract Service registry number CAS RN), and others. These data provide researchers with additional chemical information for a large amount of PEICs and can also be publicly accessed using a web interface. Our analysis has shown that data harmonization can increase up to 98% when using the MAGIC graph approach compared to relational data systems for datasets with different nomenclatures. The graph database system and its data appear more suitable for large-scale analysis where traditional (i.e., relational) data systems are reaching conceptional limitations. Dataset: The dataset can be found in Supplementary Materials, http://www.mdpi.com/2306-5729/ 4/1/34/s1.


Summary
The primary concern of ecotoxicology is the impact of chemicals on the environment [1].To assess this impact at a large-scale, i.e., continental or global context, data of environmental concentrations, effects, use types or application rates have to be incorporated into a consistent structure.Today, science can rely on numerous databases providing these data (Table 1) for potential environmental impact chemicals (PEICs, e.g., pesticides, industrial chemicals, flame retardants, and solvents).However, the process of linking them takes significant harmonization efforts, even after a common semantic framework has been established, i.e., even after their integration into a coherent base.Among the most fundamental reasons hindering instant data linkage and affecting dimensions of the ecotoxicological data are differing nomenclatures and differing levels of specificity (see Table 2 for Differing data specificity is also an issue when linking chemical data.Measured environmental concentrations or effect endpoints may be provided in different databases specifically for any kind of isomers, including data specific to enantiomers or diastereomers, or at the level of unique compound structures, ignoring stereo-compositions and, thus, including isomeric and racemic mixtures.Differing specificity, if not addressed, substantially hinders the integration of data from different sources for some of the ecotoxicologically most important groups of compounds, e.g., insecticides [12], that may act substantially differently based on their stereo-chemical composition [13].For analyses that cover only relatively small sets of PEICs, problems of nomenclature and specificity can be handled manually by expert judgment.Knowledge of PEICs thereby allows constructing data analysis workflows that cover all deviations in chemical names and that reasonably span different levels of specificity.Larger analyses that cover several dozens, or more, of PEICs are often based on relational data representations [12].In the case only two different naming schemes are involved, e.g., if only two data sources are linked, differing identifiers of the same chemical can still be resolved by establishing a synonym table.However, relational database joins are costly, and linking more than two different data sources by joining their chemical identifiers (e.g., chemical name) with synonym tables increases the processing complexity significantly, quickly reaching points where complex data analyses become cumbersome [14].Moreover, resolving different levels of specificity within and among relational datasets requires sophisticated techniques that entail even more effort to develop and that can hardly be established without significant lack of performance [15].At least when combining more than two data sources-a requirement of many ecotoxicological meta-analyses-relational data representations are suboptimal due to their constraints in performance and usability [15][16][17].
Labeled property graph databases represent an effective tool to address the aforementioned issues of extendibility, scalability, and flexibility [14,18,19].Briefly, a labeled property graph consists of nodes (vertices) that are connected through relationships (directed edges) [19].Both nodes and relationships can be labeled to distinguish functional roles and can be enriched with properties (see Appendix A for further details).In contrast to relational database systems, the number and type of relationships between entities is, thus, not strictly defined and allows linking information very flexibly [18,20].This flexibility and the graph's emphasis on relationships appear well suited for the establishment of a synonym database that can also resolve hierarchical relationships [14,15].Consequently, over the last years, graph databases have evolved as a technical alternative to the established relational database systems, featuring large-scale business (e.g., logistics, social media, and health management) and scientific applications (e.g., web science and sociology) where relational solutions become unfeasible [21,22].However, even after an extensive literature review, we could not find any published approach that uses a graph for managing and analyzing data in ecotoxicology.
The aim of this study was to assess the usability of graph databases for large-scale ecotoxicological meta-analyses that integrate and link a wide range of relevant data (Table 1) and was conducted by the research group "Meta-Analysis of the Global Impact of Chemicals" (MAGIC).In addition, multiple ecotoxicologically-relevant databases were used to perform a data harmonization, using U.S. EPA Chemical Dashboard (CDDB) [11] as a synonym provider, demonstrating the method's applicability in a large-scale ecotoxicological scope and quantifying the method's advantages compared to relational joins.The data were subsequently reprojected into tabular form, granting easy accessibility to researchers and professionals.The Microsoft ® Excel worksheet published with this data description summarizes the information that is currently contained in the MAGIC graph in a tabular format, while an up-to-date version of the MAGIC graph can be explored using our website (https://magic.eco;see User Notes).Harmonized data for 16,739 PEICs in the MAGIC graph contain information about unique identifiers (CAS RN and DTXSID), valid synonyms, respective chemical classes, use type classification and their inclusion in various databases.
We are positive that the MAGIC graph can serve as a reliable proof that, with graph databases, one already has a suited data integration tool at hand.With it already being actively integrated in large-scale risk analysis at the national level [23], the MAGIC graph will find further applications and become a central tool in trans-national or global risk analyses in the future (DFG SCHU 2271/6-2).The MAGIC graph is publicly available and will provide a continuously expanding feature set, allowing researchers to take advantage of graph database solutions.

Database
The MAGIC graph contains 16,731 PEICs (see Microsoft ® Excel worksheet for complete list).For these chemicals, chemical identifiers (n = 66,636) used by relevant datasets (Table 1) are stored and linked to the chemicals they identify.Each chemical has a preferred name for consistent creation of output.Use types and chemical classes, as provided by external datasets, are included in the MAGIC graph and linked with the chemical identifier used by the external dataset.The resulting schema (Figure 1) allows collecting chemical information over multiple databases regardless of the individually used identifiers by navigating the graph (Figure 2).Up-to-date contents of the graph can be retrieved using the website https://magic.eco(also see user notes).output.Use types and chemical classes, as provided by external datasets, are included in the MAGIC graph and linked with the chemical identifier used by the external dataset.The resulting schema (Figure 1) allows collecting chemical information over multiple databases regardless of the individually used identifiers by navigating the graph (Figure 2).Up-to-date contents of the graph can be retrieved using the website https://magic.eco(also see user notes).output.Use types and chemical classes, as provided by external datasets, are included in the MAGIC graph and linked with the chemical identifier used by the external dataset.The resulting schema (Figure 1) allows collecting chemical information over multiple databases regardless of the individually used identifiers by navigating the graph (Figure 2).Up-to-date contents of the graph can be retrieved using the website https://magic.eco(also see user notes).Data 2019, 4, 34 5 of 17

Summary Microsoft ® Excel Worksheet
Published with this data descriptor is a Microsoft ® Excel worksheet that summarizes the content of the MAGIC graph.The columns of this worksheet are described in Table 3.

Chemical
The preferred name of the chemical as derived from the CDDB.In most instances, the name given here equals the preferred name of the CDDB.

CAS RN
The currently valid Chemical Abstract Service registry number as given by the CDDB.Alternative CAS RNs, such as deleted numbers, are given under synonyms if they are used by at least one of the databases included in the MAGIC graph.

DTXSID
The substance identifier of the distributed structure-searchable toxicity database as provided by the CDDB Synonyms Additional identifiers of the chemical.Synonyms are only listed if they are used by at least one of the databases included in the MAGIC graph.

Chemical Class
The chemical class according to the PAN and WQP dataset.Only chemicals occurring in one of these datasets are classified and classifications are given here as is.Chemical classifications will be extended by considering further databases and harmonized among databases in the future.

Stereochemical
An "x" indicates stereo-chemical information is associated with the chemical.

Insecticide
An "x" indicates that the chemical is used as an insecticide according to the PAN database.

Herbicide
An "x" indicates that the chemical is used as an herbicide according to the PAN database.

Fungicide
An "x" indicates that the chemical is used as a fungicide according to the PAN database.

Microbiocide
An "x" indicates that the chemical is used as a microbiocide according to the PAN database.

Other Uses
A list of other uses of this chemical (excluding insecticide, herbicide, fungicide and microbiocide) according to the PAN and WQP databases.As with chemical classes, use type classification will be improved continuously over the next versions of the MAGIC graph.

WQP
Entries marked "x" indicate that the WQP database contains records of this chemical, using any of its identifiers.

ECOTOX
Entries marked "x" indicate that the ECOTOX database contains records of this chemical, using any of its identifiers.

FOODTOX
Entries marked "x" indicate that the FOODTOX database contains records of this chemical, using any of its identifiers.

USE
Entries marked "x" indicate that the USE database contains records of this chemical, using any of its identifiers.

PAN
Entries marked "x" indicate that the PAN database contains records of this chemical, using any of its identifiers.

Database Linkage and Pesticide Use Types
The databases listed in Table 1 were integrated into the MAGIC graph and subsequently analyzed individually regarding the chemical identifiers they contain (Table 4).These databases were selected because they are the most comprehensive resources for large-scale ecotoxicological core data from governmental and non-governmental sources.The MAGIC graph made it possible to evaluate how many of the identifiers used by each dataset actually identified chemicals, and how the identified chemicals were distributed among chemicals with stereo-chemical information and those without.Further, the number of synonymous identifiers within each dataset was identified. 1 Identifiers that were linked to specific structurally unique compounds using the CDDB. 2 Percentages refer to the entirety of chemical identifiers in the dataset. 3Identifiers used by the respective dataset that could not be linked to a specific chemical using the CDDB. 4 Synonyms refer to the amount of additional chemical identifiers attributed to chemicals. 5Percentages refer to the entirety of chemicals in the dataset.
The considered databases vary in the absolute number of chemicals they cover and the proportion of identifiers for chemicals (Table 4).For instance, 35% of WQP identifiers are not categorized as a "chemical", because they refer to mixtures, physical attributes (e.g., temperature and flow velocity), biological parameters (e.g., algal density and toxicity endpoints) or other, non-chemical information.Lower proportions of chemical identifiers may primarily indicate that the respective database is not only focused on PEICs but also on other entities, such as formulations, mixtures, etc.However, lower proportions may also be a result of low-quality data reporting, such as non-adherence to standardized nomenclature.
The characterization of databases further reveals that PEICs with isomeric information constitute 11-18% of chemicals in all analyzed databases (Table 4).Integration of hierarchical structuring is therefore a graphs' valuable feature that not only allows for a more detailed differentiation among chemicals but also enables transparent analyses over multiple levels of specificity.Synonym analysis shows that, for instance, in the ECOTOX database, synonymous relationships are rare (n = 28; <0.2%), which underlines the CAS RNs' suitability as identifiers.Nonetheless, while CAS RNs uniquely identify chemicals, there may be multiple CAS RNs (e.g., CAS RN vs. deprecated CAS RN) referring to the same chemical (e.g., cyfluthrin).This may produce spurious analysis results, if unaddressed.With the MAGIC graph, however, analyses are based on chemicals, instead of identifiers, and all data, related to a chemical, are considered equally, regardless of the chemical identifier used.
We also assessed to what extent the MAGIC graph allows linking more chemicals over the different datasets compared to a relational approach where only same-spelling identifiers were considered linkable (Figure 3).We found that linkage increased only marginally (1-2%) when using the graph in the case both merged databases used CAS RN (see Table 4).This increase, although only small, underlines that, even with CAS RN, nomenclature can be an issue for data linkage.Relational joins using same-spelling names were only successful for 0-63% compared to the graph approach (Figure 3), signifying that joins relying on names are substantially affected by differing nomenclatures.In contrast, the graph approach successfully linked 21-99% of entries.With relational joins, it was impossible to link CAS RN from one dataset and chemical names from another, whereas, with the graph approach, we successfully linked 98% of the data from ECOTOX and FOODTOX, a linkage that depends on using CAS RN and names simultaneously (Table 4).While this case may also be partially resolved using relational joins, prior manual harmonization of chemical identifiers would be required, which is time-intensive, yet unnecessary, when using the graph approach.Figure 3 provides further information, e.g., on fractions for individual database pairs or total linkage of chemical data.In large-scale ecotoxicological assessments, transcending national or continental boundaries, harmonization and subsequent linking of data may become unfeasible, while the MAGIC graph approach can provide better performance and coverage compared to traditional relational joins.
boundaries, harmonization and subsequent linking of data may become unfeasible, while the MAGIC graph approach can provide better performance and coverage compared to traditional relational joins.Complete linkage presumes that all chemicals listed in the smaller database are contained in the larger one.Linkage with a relational approach, where only CAS RN (light blue) or same-spelling chemical identifiers match (dark blue), is compared to the additional gain with the graph approach (orange).See Table 4 for the types of identifiers that were available for each database.
The use types and chemical classes of the PAN database give an example of how data, included in the MAGIC graph, can be used for characterizing datasets: the ECOTOX, FOODTOX and PAN databases cover a broad range of chemicals, including similar proportions of insecticides, herbicides, fungicides and microbiocides (Figure 4).In contrast, the USE dataset shows a higher proportion of insecticides, herbicides and fungicides, and a lower proportion of microbiocides, reflecting its focus on agricultural pesticide applications.Similarly, the WQP contains relatively fewer data of insecticides, herbicides and fungicides, since the number of chemicals being breakdown products (classified as other use type) in this environmental concentration dataset is rather high.The integration of the PAN database use types into the MAGIC graph thus enables an unprecedentedly comprehensive overview of the kind of PEICs that are contained in individual datasets (Figure 4).In addition, the successful data harmonization further demonstrates that ancillary chemical data can be readily incorporated into the MAGIC graph.For instance, supplementing regulatory information (e.g., regulatory status, environmental quality criteria) may now be added with only little effort.Linkage with a relational approach, where only CAS RN (light blue) or same-spelling chemical identifiers match (dark blue), is compared to the additional gain with the graph approach (orange).See Table 4 for the types of identifiers that were available for each database.
The use types and chemical classes of the PAN database give an example of how data, included in the MAGIC graph, can be used for characterizing datasets: the ECOTOX, FOODTOX and PAN databases cover a broad range of chemicals, including similar proportions of insecticides, herbicides, fungicides and microbiocides (Figure 4).In contrast, the USE dataset shows a higher proportion of insecticides, herbicides and fungicides, and a lower proportion of microbiocides, reflecting its focus on agricultural pesticide applications.Similarly, the WQP contains relatively fewer data of insecticides, herbicides and fungicides, since the number of chemicals being breakdown products (classified as other use type) in this environmental concentration dataset is rather high.The integration of the PAN database use types into the MAGIC graph thus enables an unprecedentedly comprehensive overview of the kind of PEICs that are contained in individual datasets (Figure 4).In addition, the successful data harmonization further demonstrates that ancillary chemical data can be readily incorporated into the MAGIC graph.For instance, supplementing regulatory information (e.g., regulatory status, environmental quality criteria) may now be added with only little effort.

Methods
In contrast to relational database management systems, graph databases do not depend on predefined schemata.Briefly, nodes, relationships, labels, types and properties can be added, modified and removed ad hoc and as needed.While this tremendously facilitates the management of changing and growing heterogeneous datasets, it also complicates the usage of these data.Without a static and technically binding schema, the current semantics have to be discovered dynamically: it has to be found what kinds of nodes there are, what properties they have, how nodes are related to other nodes, etc.These concerns were addressed by specifying features of the data model informally outside the database and included semantics of node labels, relationships between nodes and restrictions of properties.To maintain consistency between this specification and the content of the database, as well as safeguard data integrity, 32 rules, checking specific aspects of the data model, were implemented (Appendix B, Table A1).Rules were iteratively formulated by expert judgment whenever new conceptual or technical requirements arose, while it was generally aimed at maintaining a small set of rules.Violations of the rules result in notifications that have to be resolved manually or semi-automatically (Figure 5).This rule-based approach provides a balanced tradeoff between benefits of an agreed schema and flexibility of a graph database.Turning the a priori schema known from relational databases into a posteriori applied consistency rules also resulted in workflows that resemble those of test-driven developments [24].For instance, extension of the domain of the graph database application, e.g., by additionally linking taxonomic data to effect data, is achieved in the two following steps.First, one specifies and implements a set of additional rules, e.g., "species and genus are allowed labels", "an effect must be linked to a species", "a species belongs to a genus", etc., and afterwards modifies the database by adding nodes and relationships until all rules are fulfilled.Fulfillment of all rules then marks a new version of the database application that provides additional information.

Methods
In contrast to relational database management systems, graph databases do not depend on predefined schemata.Briefly, nodes, relationships, labels, types and properties can be added, modified and removed ad hoc and as needed.While this tremendously facilitates the management of changing and growing heterogeneous datasets, it also complicates the usage of these data.Without a static and technically binding schema, the current semantics have to be discovered dynamically: it has to be found what kinds of nodes there are, what properties they have, how nodes are related to other nodes, etc.These concerns were addressed by specifying features of the data model informally outside the database and included semantics of node labels, relationships between nodes and restrictions of properties.To maintain consistency between this specification and the content of the database, as well as safeguard data integrity, 32 rules, checking specific aspects of the data model, were implemented (Appendix B, Table A1).Rules were iteratively formulated by expert judgment whenever new conceptual or technical requirements arose, while it was generally aimed at maintaining a small set of rules.Violations of the rules result in notifications that have to be resolved manually or semi-automatically (Figure 5).This rule-based approach provides a balanced tradeoff between benefits of an agreed schema and flexibility of a graph database.Turning the a priori schema known from relational databases into a posteriori applied consistency rules also resulted in work-flows that resemble those of test-driven developments [24].For instance, extension of the domain of the graph database application, e.g., by additionally linking taxonomic data to effect data, is achieved in the two following steps.First, one specifies and implements a set of additional rules, e.g., "species and genus are allowed labels", "an effect must be linked to a species", "a species belongs to a genus", etc., and afterwards modifies the database by adding nodes and relationships until all rules are fulfilled.Fulfillment of all rules then marks a new version of the database application that provides additional information.Initially, six publicly available chemical databases were identified and compared regarding quality of synonyms they provide for an array of organic pesticides (n = 655).After extensive quantity and quality assessments of the generated synonym links, the U.S. EPA Chemical Dashboard [11], containing approximately 765,000 chemical entries, was chosen as a synonym provider (see Appendix C).
In the first implementation, synonymous chemical identifiers (e.g., substance names) were interlinked directly (Figure 6a).However, following this concept, the number of steps necessary to collect all synonyms of a given identifier varied between queries, which resulted in complex queries.This concept also complicated the estimation of the quality of synonym relationships, as two distant identifiers could be linked over relationships of different certainty.A later refined representation distinguished between the chemical itself and its identifiers (Figure 6b), leading to a representation where the step sequence for collecting all synonyms of a chemical is well defined and only requires two steps.This adjustment improved the computational efficacy, at the same time allowing to add further chemical identifiers without increasing the maximum number of steps required.Initially, six publicly available chemical databases were identified and compared regarding quality of synonyms they provide for an array of organic pesticides (n = 655).After extensive quantity and quality assessments of the generated synonym links, the U.S. EPA Chemical Dashboard [11], containing approximately 765,000 chemical entries, was chosen as a synonym provider (see Appendix C).
In the first implementation, synonymous chemical identifiers (e.g., substance names) were interlinked directly (Figure 6a).However, following this concept, the number of steps necessary to collect all synonyms of a given identifier varied between queries, which resulted in complex queries.This concept also complicated the estimation of the quality of synonym relationships, as two distant identifiers could be linked over relationships of different certainty.A later refined representation distinguished between the chemical itself and its identifiers (Figure 6b), leading to a representation where the step sequence for collecting all synonyms of a chemical is well defined and only requires two steps.This adjustment improved the computational efficacy, at the same time allowing to add further chemical identifiers without increasing the maximum number of steps required.Consistency between chemical query results was achieved by attributing each chemical a preferred identifier (used by the CDDB), so that chemicals can be identified in a default way.A descriptive property was attributed to relationships between identifiers and chemicals to reflect the identification type, e.g., CAS RN.Data output for chemicals can, thus, be restricted to specific types of relationships for identifying chemicals.
Substances relevant in ecotoxicological contexts can be described by varying detail of specificity (e.g., isomerism), which was addressed by creating hierarchical chemical sub-graphs.For example, permethrin (Figure 7), an insecticidal compound, represents a stereoisomeric mixture of cis-and trans-permethrin isomers.The respective relationships between chemicals were resolved considering the presence of stereo-layers in their standard InChI strings [25], creating a hierarchical sub-graph (Figure 7).Further distinction of hierarchical levels (e.g., enantiomers and diastereomers) currently is not technically possible, as standard InChI strings do not support this operation [25].However, it is also rarely needed for ecotoxicological assessments using field concentrations.Consistency between chemical query results was achieved by attributing each chemical a preferred identifier (used by the CDDB), so that chemicals can be identified in a default way.A descriptive property was attributed to relationships between identifiers and chemicals to reflect the identification type, e.g., CAS RN.Data output for chemicals can, thus, be restricted to specific types of relationships for identifying chemicals.
Substances relevant in ecotoxicological contexts can be described by varying detail of specificity (e.g., isomerism), which was addressed by creating hierarchical chemical sub-graphs.For example, permethrin (Figure 7), an insecticidal compound, represents a stereoisomeric mixture of cis-and trans-permethrin isomers.The respective relationships between chemicals were resolved considering the presence of stereo-layers in their standard InChI strings [25], creating a hierarchical sub-graph (Figure 7).Further distinction of hierarchical levels (e.g., enantiomers and diastereomers) currently is not technically possible, as standard InChI strings do not support this operation [25].However, it is also rarely needed for ecotoxicological assessments using field concentrations.Consistency between chemical query results was achieved by attributing each chemical a preferred identifier (used by the CDDB), so that chemicals can be identified in a default way.A descriptive property was attributed to relationships between identifiers and chemicals to reflect the identification type, e.g., CAS RN.Data output for chemicals can, thus, be restricted to specific types of relationships for identifying chemicals.
Substances relevant in ecotoxicological contexts can be described by varying detail of specificity (e.g., isomerism), which was addressed by creating hierarchical chemical sub-graphs.For example, permethrin (Figure 7), an insecticidal compound, represents a stereoisomeric mixture of cis-and trans-permethrin isomers.The respective relationships between chemicals were resolved considering the presence of stereo-layers in their standard InChI strings [25], creating a hierarchical sub-graph (Figure 7).Further distinction of hierarchical levels (e.g., enantiomers and diastereomers) currently is not technically possible, as standard InChI strings do not support this operation [25].However, it is also rarely needed for ecotoxicological assessments using field concentrations.

G1
"All nodes must have a single label": Ensuring that each node has exactly one label results in a graph that is easier to maintain, as other rules can refer to specific sets of nodes without having to deal with possible labeling overlaps.If the graph gets more complex in the future, it might however become advantageous to allow multiple labels per node.

G2
"Only a set of predefined labels is allowed for nodes": Restricting labels to a predefined set prevents nodes in the graph that are not targeted by rules.The MAGIC graph may currently contain nodes with labels "ChemicalIdentifier", "Chemical" and "Dataset".

G3
"Nodes should have a label with an associated view": Making sure that each label has a defined way that it is represented by the front-end website makes the MAGIC graph completely navigable.

G4
"Only a set of predefined types is allowed for relationships": Predefining types assures that all relationships are addressed by rules.Currently, the relationship types "identifies", "prefers_identifier", "uses_identifier" and "specifies" are allowed in the MAGIC graph.

G5
"Every item in the MAGIC graph must have a name": Naming items provides an endpoint for visiting the item using the website front-end and facilitates modifications and synchronization of the graph by allowing identification of individual nodes.For many types of items, such as chemical identifiers, the name is a natural part of the data.

G6
"Labels should have an associated edit view": Edit views allow smaller modifications of items using the website front-end and the provision of such edit views helps in maintaining the graph database.

G7
"MAGIC graph items should have at least one ingoing relationship": Items, having no ingoing relationship, lack in significance because they are not navigable along the graph relationships and should not be part of the graph.Some items are considered as entry points to the graph (e.g., datasets) and are marked as globally available.The rule does not apply to these items.

CI1
"All chemical identifiers nodes must have a timestamp": Registering the date of item creation assists in synchronization chemical identifiers with external data sources and helps to resolve rule conflicts by indicating which item is more recent.

CI2
"All chemical identifiers that actually identify a chemical should be linked to exactly one chemical": A chemical identifier should identify a chemical, otherwise it is irrelevant for the chemical graph.However, there are two typical occasions when a chemical identifier does not identify a chemical: (1) when a data source, from which data were imported into the graph listed an identifier as a chemical identifier, but further investigation revealed that the identifier did not refer to a chemical in a strict sense (e.g., it identifies a mixture of chemicals); and (2) when a chemical identifier was not found by the synonym provider.Violating this rule gives the user a chance to recognize and mitigate the second occasion, e.g., by adding manual synonym relationships.The user also has a chance to mark a chemical identifier in such a way that it does not trigger this rule anymore (by marking it as a chemical identifier that does not actually identify a chemical), which also signifies that the rule violation was recognized and managed manually.

CI3
"All chemical identification relationships should have a timestamp": Registering the date of relationship creation assists in synchronization with external data sources and helps to resolve rule conflicts by indicating which relationship is more recent.

CI4
"All chemical identifications should have a type": Specifying the mode in which a chemical identifier identifies a chemical helps in estimating the quality and uncertainty of the relationship.It also helps to output specific sets of identifiers, e.g., only CAS RN.The types used for specifying the relationship of identification are currently not restricted but may be a predefined set in the future.

CI5
"All chemical identifiers should be used by at least one dataset": The data sources where a chemical identifier is used should be given.If this is not the case, retracing the origin of chemical identifiers is not possible which decreases the overall quality of the MAGIC graph.

CI6
"All chemical identifiers should have an identifier from a predefined list showing what is actually identified": Other rules depend on the information that a chemical identifier actually identifies a chemical, that is, there applies a stricter meaning of chemical than in some other databases.
To provide this information, a chemical identifier should describe what it actually identifies.Currently, the following possibilities are considered here: chemical, mixture, unmatched chemical (by no means a corresponding chemical could be identified), unspecific (is not specific enough to identify exactly one chemical) and ignored (for any reason).

CI7
"A chemical identifier, that does not actually identify a chemical, should not be linked with a chemical": Specifying that a chemical identifier identifies a chemical only does make sense in case the chemical identifier is marked as actually identifying a chemical.If this is not the case, but an identifying relationship exists nonetheless, a manual examination of the case is advised.

C1
"All chemicals must have a timestamp": Registering the date of item creation assists in resolving conflicts involving chemicals by indicating which item is more recent.

C2
"All chemicals should have exactly one preferred name": According to the MAGIC graph data model, a chemical is considered having many names.Making sure that every chemical has exactly one designated preferred name still allows it to be referred to in outputs in a harmonized way.

C3
"All chemicals should be identified by at least one chemical identifier": Chemicals that have no identifier cannot be related to actual chemicals and should removed from the graph.

C4
"All identifier preferences should have a timestamp": Name preferences of chemicals, especially when taken from external sources, may change over time.In these occasions, timestamps help to identify the more recent preference.

C5
"All chemicals should indicate whether they have bond stereo-chemical information": To understand which level of specificity regarding stereo-chemistry a chemical has, presence or absence of stereo-information at double bonds should be indicated.

C6
"All chemicals should indicate whether they have tertrahedral stereo-chemical information": To understand which level of specificity regarding stereo-chemistry a chemical has, presence or absence of stereo-information at tetrahedral stereo centers should be indicated.

C7
"Chemicals with stereo-information should specify other chemicals or indicate to not do so": The purpose of considering stereo-information is to distinguish two levels of specificity regarding stereo-chemistry: absence and presence of stereo-information.In the case of stereo-information presence, a chemical should specify a chemical without stereo-information but with the same chemical structure, so representing the two levels of specificity in the graph.For some chemicals, it is not reasonable to find a chemical with the same structure but without stereo-information.In this case, the more specific chemical should be marked such that this rule can be ignored.

C8
"Chemicals may not specify themselves": Violations of this rule may occur when stereo-information in external data sources changes.

D1
"All datasets must have a timestamp": Registering the date of item creation assists in resolving conflicts in datasets by indicating which item is more recent.

D2
"All datasets must have a title": A title provides a more extensive way for a short description of the dataset but is not, unlike its name, used as an identifier.

D3
"All datasets must have a description": A description is an even more extensive opportunity to characterize a dataset by text.

D4
"All datasets should haven an indicator of whether they are published": Distinction between published and non-published datasets allows to decide which datasets are accessible by the website front-end.

D5
"All datasets should be published": At least at later stages, after inserting a dataset into the graph and fixing possible rule violations, the dataset should be published to make its data available.

D6
"All datasets must have at least one author": Assigning authors to a dataset is an attribution to the persons who were responsible for inserting the dataset into the graph.

UTI1
"All Use Type Identifiers should be used by at least one dataset": Use type identifiers originate from datasets and attributions to these datasets should be given.

UTI2
"All Use Type Identifiers should be used by at least one chemical identifier": All use type identifiers should be linked to at least one chemical identifier, otherwise they are of limited use for assessments.

CCI1
"All Chem Class Identifiers should be used by at least one dataset": Chemical class identifiers originate from datasets and an attribution to this dataset should be given.

CCI2
"All Chem Class Identifiers should be used by at least one chemical identifier": All chemical class identifiers should be linked to at least one chemical identifier, otherwise they are of limited use for assessments.

SyncCDDB1
"All identifiers in the MAGIC graph that actually identify a chemical should have exactly one match in the CDDB": Identifiers that have no match in the CDDB have been removed from there and should also be removed from the MAGIC graph.In some (rare) cases, chemical identifiers have two or more matches in the CDDB.These cases should be resolved manually, e.g., by ignoring the chemical identifier.

SyncCDDB2
"All identifiers of a specific chemical in the MAGIC graph should have the same preferred name and DTXSID in the CDDB": Having different preferred names among the synonym identifiers of a chemical is a strong indicator that synonym relationships in the CDDB have changed.This should result in an update of synonym relationships in the MAGIC graph as well.

SyncCDDB3
"The preferred name of a chemical in the MAGIC graph should be the same as the preferred name in the Preferred names of the CDDB may change.Making sure we use the same preferred name in the MAGIC graph as in the CDDB circumvents the need to establish a custom scheme for preferred names.

SyncCDDB4
"The stereo-information of a chemical in the MAGIC graph should be the same as the stereo-information of that chemical in the CDDB": This rule captures changes in the chemical structure stored in the CDDB.Again, these changes should be synchronized with the MAGIC graph to reflect the most recent specifying relationships.

SyncCDDB5
"The identifier type of the relationship between a chemical identifier and a chemical in the MAGIC graph should be the same as in the CDDB": Synchronizing the identifier type between CDDB and MAGIC graph eliminates the necessity to manage a custom set of identifier types while still allowing to use the benefits of typed identifiers.

SyncCDDB6
"All identifiers in the MAGIC graph that do not actually identify a chemical should have no match in the CDDB": On some occasions, new identifiers become recognized by the CDDB.This rule captures those instances where the newly recognized identifiers match identifiers in the MAGIC that previously have been marked as not actually identifying chemicals.

Appendix C
In total, 655 substance names-categorized as organic contaminants-were obtained from the Water Quality Portal (https://www.waterqualitydata.us/) and used for benchmarking six databases.First, successful synonym attribution was compared quantitatively (Table A3) between databases, and then the quality of synonym relationships was manually assessed by validating correctness of generated links using assigned standard InChI-Keys.Attribution of synonyms and InChI-Keys was high for all six databases with the exception of one (ChemSpider), which was removed from subsequent analyses due to its comparatively low coverage (Table A3).Following this, manual validation of assigned InChI-Keys revealed that attribution was frequently incorrect or query responses were ambiguous, except for the U.S. EPA Chemical Dashboard (CDDB).Although multiple factors leading to misattribution of InChI-Keys or synonyms were identified, automatic aggregation of synonym lists from online-sources lacking expert curation was found the most prevalent factor that adversely affected attribution quality.The CDDB, that, unlike

Figure 1 .
Figure 1.Schema of the "Meta-analysis of the Global Impact of Chemicals" (MAGIC) graph depicting typed relationships (arrows) between labeled nodes (circles).

Figure 2 .
Figure 2. Network map of the MAGIC graph.A subset of chemical identifiers is shown as green dots.Larger circles outlined in black represent the datasets included in the graph.Grey lines connect the datasets with chemical identifiers, and individual identifiers with chemicals (cyan circles, rarely visible).

Figure 1 .
Figure 1.Schema of the "Meta-analysis of the Global Impact of Chemicals" (MAGIC) graph depicting typed relationships (arrows) between labeled nodes (circles).

Figure 1 .
Figure 1.Schema of the "Meta-analysis of the Global Impact of Chemicals" (MAGIC) graph depicting typed relationships (arrows) between labeled nodes (circles).

Figure 2 .
Figure 2. Network map of the MAGIC graph.A subset of chemical identifiers is shown as green dots.Larger circles outlined in black represent the datasets included in the graph.Grey lines connect the datasets with chemical identifiers, and individual identifiers with chemicals (cyan circles, rarely visible).

Figure 2 .
Figure 2. Network map of the MAGIC graph.A subset of chemical identifiers is shown as green dots.Larger circles outlined in black represent the datasets included in the graph.Grey lines connect the datasets with chemical identifiers, and individual identifiers with chemicals (cyan circles, rarely visible).

Figure 3 .
Figure 3. Linkage of chemical data relative to a theoretical maximum for different databases.Complete linkage presumes that all chemicals listed in the smaller database are contained in the larger one.Linkage with a relational approach, where only CAS RN (light blue) or same-spelling chemical identifiers match (dark blue), is compared to the additional gain with the graph approach (orange).See Table4for the types of identifiers that were available for each database.

Figure 3 .
Figure 3. Linkage of chemical data relative to a theoretical maximum for different databases.Complete linkage presumes that all chemicals listed in the smaller database are contained in the larger one.Linkage with a relational approach, where only CAS RN (light blue) or same-spelling chemical identifiers match (dark blue), is compared to the additional gain with the graph approach (orange).See Table4for the types of identifiers that were available for each database.

Figure 4 .
Figure 4. PAN use types of the chemicals in different datasets after linking them with the MAGIC graph.Chemicals may have multiple use types and are then included in several categories.

Figure 4 .
Figure 4. PAN use types of the chemicals in different datasets after linking them with the MAGIC graph.Chemicals may have multiple use types and are then included in several categories.

Figure 5 .
Figure 5. Workflow for adding data to the MAGIC graph while maintaining its integrity.

Figure 5 .
Figure 5. Workflow for adding data to the MAGIC graph while maintaining its integrity.

Figure 6 .
Figure 6.Two representations of four synonym identifiers (1-4) in a graph.(a) Links between identifiers indicate known synonym relationships.Synonyms of an identifier are all directly or indirectly connected other identifiers.(b) Identifiers point to the identified chemical (blue circle).All identifiers pointing to the same chemical are synonyms.The chemical has a preferred identifier (double arrow).

Figure 7 .
Figure 7. Representation of some synonyms and different levels of specificity for permethrin.Identifiers (green) refer (solid arrows) to chemicals (blue).Chemicals with stereo-information refer (dashed arrow) to a structural identical chemical without stereo-information.

Figure 6 .
Figure 6.Two representations of four synonym identifiers (1-4) in a graph.(a) Links between identifiers indicate known synonym relationships.Synonyms of an identifier are all directly or indirectly connected other identifiers.(b) Identifiers point to the identified chemical (blue circle).All identifiers pointing to the same chemical are synonyms.The chemical has a preferred identifier (double arrow).

Figure 6 .
Figure 6.Two representations of four synonym identifiers (1-4) in a graph.(a) Links between identifiers indicate known synonym relationships.Synonyms of an identifier are all directly or indirectly connected other identifiers.(b) Identifiers point to the identified chemical (blue circle).All identifiers pointing to the same chemical are synonyms.The chemical has a preferred identifier (double arrow).

Figure 7 .
Figure 7. Representation of some synonyms and different levels of specificity for permethrin.Identifiers (green) refer (solid arrows) to chemicals (blue).Chemicals with stereo-information refer (dashed arrow) to a structural identical chemical without stereo-information.

Figure 7 .
Figure 7. Representation of some synonyms and different levels of specificity for permethrin.Identifiers refer (solid arrows) to chemicals (blue).Chemicals with stereo-information refer (dashed arrow) to a structural identical chemical without stereo-information.

Table 2 .
Examples of problems occurring when linking data from different sources.

Table 3 .
Description of the submitted Microsoft ® Excel worksheet.

Table 4 .
Characterization of ecotoxicologically-relevant datasets using the MAGIC graph.

Table A1 .
Rules defining the MAGIC chemical graph.