Standartox: Standardizing Toxicity Data

: An increasing number of chemicals such as pharmaceuticals, pesticides and synthetic hormones are in daily use all over the world. In the environment, chemicals can adversely affect populations and communities and in turn related ecosystem functions. To evaluate the risks from chemicals for ecosystems, data on their toxicity, which are typically produced in standardized ecotoxicological laboratory tests, is required. The results from ecotoxicological tests are compiled in (meta-)databases such as the United States Environmental Protection Agency (EPA) ECOTOXicology Knowledgebase (ECOTOX). However, for many chemicals, multiple ecotoxicity data are available for the same test organism. These can vary strongly, thereby causing uncertainty of related analyses. Given that most current databases lack aggregation steps or are conﬁned to speciﬁc chemicals, we developed Standartox, a tool and database that continuously incorporates the ever-growing number of test results in an automated process workﬂow that ultimately leads to a single aggregated data point for a speciﬁc chemical-organism test combination, representing the toxicity of a chemical. Standartox can be accessed through a web application and an R package. Dataset: https://doi.org/10.5281/zenodo.3785031 Dataset License: MIT


Summary
An increasing number of chemicals such as pharmaceuticals, pesticides and synthetic hormones are in daily use all over the world. In Europe alone, some 100,000 chemicals are estimated to be in current use, whereof 30,000 are produced in quantities larger than one ton per year [1]. Except for pesticides that are released into the environment deliberately, most chemicals enter the environment as a result of their use through different paths (e.g., atmospheric emission and deposition or discharge through wastewater) [2]. In the environment, chemicals can adversely affect populations and communities and in turn related ecosystem functions [3][4][5][6][7]. Ultimately, this may compromise natures contribution to human well-being, for example the ecosystem services clean drinking and irrigation water as well as food production [8][9][10]. Pollution with man-made chemicals has been identified as one of three major environmental problems for which research gaps hamper the derivation of planetary boundaries, i.e., thresholds beyond which irreversible state shifts may occur [11,12]. Bernhardt et al. [13] argue that the knowledge gap how chemicals affect populations, communities and in turn ecosystem functions and services, may impede the accomplishment of the Sustainable Development Goals [14] of the United Nations. Even highly regulated chemicals, such as pesticides have been shown to cause strong adverse effects on non-target organisms, such as birds [5], aquatic insects [15] or fish [10], questioning the current regulation efforts [16].
To evaluate the risks from chemicals to ecosystems, data on their toxicity are required, which is typically produced in standardized ecotoxicological laboratory tests. For example, Morrissey et al. [17] used ecotoxicological test results from 49 insects and crustaceans to evaluate the effect of neonicotinoid insecticides in the aquatic ecosystem. Furthermore, Malaj et al. [4] compiled experimental toxicity test results for 223 chemicals to assess the risk from chemicals to freshwater ecosystems in Europe. Similarly, permissible environmental concentrations are often derived from these test data, typically by a combination with safety factors to account for uncertainties. The test data mainly relate to a few, well tested, standard organisms, such as the brown rat Rattus norvegicus, the water flea Daphnia magna and the microalga Raphidocelis subcapitata. Nevertheless, a much greater variety of organisms has been used in ecotoxicological experiments.
To date, only few initiatives exist that aim to create a public resource of ecotoxicological data, such as the United States Environmental (EPA) Protection Agency ECOTOXicology Knowledgebase (ECOTOX) (ca. 1,000,000 test results, 13,000 taxa, 12,000 chemicals) [18], the German Environmental Agency's Information System Ecotoxicology and Environmental Quality Targets (ETOX) [19], the Pesticides Properties DataBase (PPDB) (ca. 2000 pesticides) [20] or the EnviroTox database [21,22]. The former two compile all available results from experiments into a database. However, for many chemicals, multiple ecotoxicity values are available for the same test organism. These can vary strongly, thereby causing uncertainty of related analyses [23,24]. Moreover, the lack of associated quality information and heterogeneous units hamper reproducible science. The PPDB database, in contrast, provides single ecotoxicity values only for pesticides and a few selected test organisms, thereby covering only a minor fraction of the vast amount of ecotoxicological data. The EnviroTox database is limited to aquatic organisms. Moreover, data analyses often require links to additional data resources, for example to append additional chemical and species information (e.g., chemical properties, habitat of species), which calls for more automated procedures.
We therefore developed Standartox, a tool and database that aims to overcome the limitations of other databases by continuously incorporating the ever-growing number of test results in an automated process workflow that ultimately leads to a harmonized ecotoxicity data collection and provides methods to derive single aggregated ecotoxicity values for a specific chemical-organism test combination. Standartox makes use of the publicly available and quarterly updated ECOTOX database [25] and restricts the data to commonly used endpoints in ecotoxicology, such as half maximal effective concentrations (EC 50 ) or no-observed-adverse-effect concentrations (NOEC), leading to about 600,000 ecotoxicological test results, including about 8000 chemicals, tested on about 10,000 taxa in the current version. Standartox users can filter test results according to several parameters, e.g., refining a search for ecotoxicity data on organisms occurring in specific habitats or regions of the world. Above all, Standartox aggregates ecotoxicological test results in a standardized way, by calculating the minimum, the geometric mean and the maximum of the results for each chemical and the associated, user-defined test parameters. Hence, this reduces the variability between risk assessments that are due to the selection of different ecotoxicological test data [23]. Thereby, Standartox provides the basis for reproducible science and combines information from different sources to simplify the derivation of risk indicators such as Species Sensitivity Distributions (SSD) and Toxic Units (TU), which represent two prominent concepts to assess effects on organisms in ecotoxicology [26][27][28]. Besides aggregating ecotoxicological test results, Standartox provides a concise overview of the tested chemicals, allowing the identification of potential knowledge gaps. Moreover, Standartox could help in reducing the millions of animals used for toxicity testing each year by facilitating access to ecotoxicity data, which are in favor of, for example, the guidelines by the Organisation for Economic Co-operation and Development (OECD) [29,30]. Standartox comes with two front-ends, a web application (http://standartox.uni-landau.de) and the R [31] package standartox, providing convenience structures and thereby largely reducing processing time for users.

Data Description
Standartox constitutes a collection of quality checked ecotoxicological test results. It is build on the ECOTOX database [25] whose data are processed, cleaned and harmonized to retrieve comparable toxicity endpoints. Subsequently, filter and aggregation methods are created to allow for the retrieval of single toxicity equivalents for specific experimental conditions. The ECOTOX database is updated quarterly, providing on average 5228 (2014-2019) new toxicity entries. These are included in Standartox with each update.

Filters
The data can be restricted to the three endpoint groups, namely half maximal effective/lethal concentration/dose values (e.g., EC 50 , LD 50 ), henceforth abbreviated as XX 50 , lowest observed effect concentrations/levels (LOEC/L), henceforth abbreviated as LOEX and no observed effect concentrations/levels (NOEC/L), henceforth abbreviated as NOEX (Table A2). Standartox allows the ecotoxicity data to be filtered by effect groups (e.g., mortality, population, growth) ( Figure 1A) and concentration types (e.g., formulation, active ingredient) as well as test durations (in hours). In addition to these test-specific parameters, Standartox data entries can be filtered by chemical-specific parameters such as the CAS number and chemical roles (e.g., pesticides, metals, drugs) ( Figure 1B) and classes (e.g., organochlorine, triazine) ( Figure 1C). Furthermore, the Standartox data can be refined to certain taxonomic groups ( Figure 1D) as well as organism-specific parameters, such as the organisms' habitat (e.g., freshwater, marine, terrestrial) ( Figure 1E) and distribution (e.g., Europe, South America) ( Figure 1F).

Aggregation
Typically, species exhibit a differential sensitivity towards chemicals ( Figure 2A). Moreover, multiple ecotoxicity values are available for individual species-chemical combinations and these can also exhibit high variability due to several factors such as durations of ecotoxicity tests ( Figure 2B), experimental conditions and physiological or genetic fitness differences between test individuals or populations. Not every factor is recorded though, leading to unexplainable variability ( Figure 2C). To aggregate multiple ecotoxicity values into a single value on the desired taxonomic level (e.g., for an individual species, across species of a genus or family), and chemical grouping (e.g., across all pesticides), Standartox provides several aggregation methods including the minimum, the maximum and the geometric mean allowing to aggregate the filtered data set. The geometric mean is preferred in comparison to the arithmetic mean, because it is less influenced by outliers and is suitable for skewed data. Furthermore, the geometric mean is preferable over the median, because the median completely ignores the tails of the data distribution, making it unreliable for small data sets [32]. Posthuma et al. [33] showed the usefulness of SSDs and its underlying geometric mean aggregations when assessing environmental effects of chemicals. In the course of the aggregation process, outliers that exceed 1.5 times the interquartile range are flagged to caution Standartox users. However, they are considered in the aggregation, given that the geometric mean is relatively robust against outliers.
Overall, Standartox provides a harmonized and reproducible approach to aggregate ecotoxicity data.  50 ) in Standartox illustrating (A) differential variability and data distribution between species (i.e., Xenopus laevis-Amphibian, Raphidocelis subcapitata-Algae, Oncorhynchus mykiss-Fish, Lemna minor-Macrophyte) for the chemical atrazine in 96 h tests, (B) how the variability in toxicity tests with zinc sulfate and Daphnia magna varies with test duration and (C) high variability that is not explained by the available test characteristics in the case of cupric sulfate tested on Pimephales promelas for 96 h. Red dots depict Standartox geometric mean estimates and red error bars show the associated standard deviation. Black dots depict the raw data. To facilitate readability, data points are randomly scattered along a hypothetical y-axis and are greyed out if within the violins.

Accuracy Assessment
To validate Standartox results we compared geometric means resulting from the aggregation in Standartox to the corresponding values from other databases, for chemicals where data were available in both resources. The PPDB provides ecotoxicity data on a few selected species commonly used in chemical risk assessment, that have been manually quality controlled through expert judgment [20]. The vast majority of aggregated values (91.9%) of Standartox lie within one order of magnitude of the corresponding PPDB values (n = 3601). This would increase to 92.6%, when restricting the comparison to Standartox values where data from at least five experiments are available. Similarly, we compared Standartox to ecotoxicity values for Daphnia magna from the ChemProp [34] software, which estimates LC 50 values via quantitative structure-activity relationship (QSAR) models [35]. We found that 95% of Standartox values lie within one order of magnitude of the ChemProp (n = 179) values. However, the difference is not necessarily an indication of lower quality of Standartox estimates but may also reflect the wider range of experimental conditions for which data are available in the database underlying Standartox as well as inaccurate predictions for QSAR models, respectively ( Figure 3).

Perspectives
Novel predictive frameworks incorporating chemical mode of action and species traits emphasize the need for holistic and automated analyses of large-scale ecotoxicological data [36,37]. Indeed, the increasing amount of data from ecotoxicological tests and experiments that is becoming available has elicited several initiatives to harmonize these data. These initiatives partly aim for overlapping goals, yet have limitations or objectives that distinguish them from Standartox: Comptox, is a web tool published by the EPA which, similar to Standartox allows for filtering test results, the retrieval of additional chemical information as well as predicted toxicity data [38], such as 48 h Daphnia magna LC 50 values. However, toxicity estimations are limited to standard test organisms, and the tool lacks the possibility for automated data retrieval [39]. Comptox is built on the Aggregated Computational Toxicology Resource (ACToR) database, which constitutes the basis for several applications published by the EPA. It collects physicochemical and toxicological data on more than 500,000 environmental chemicals and pharmaceutical compounds from various resources and presents them in a curated list on the web [40,41]. However, no filter mechanisms or aggregation methods are provided in ACToR per se.
The EnviroTox database which also uses, amongst others the ECOTOX database as an input has recently been published [21,22]. In contrast to Standartox, EnviroTox is restricted to selected aquatic organisms (i.e., fish, amphibians, invertebrates and algae) and experimental durations (at least 24 h) and uses a rule-based algorithm to derive single ecotoxicity values. Besides, EnviroTox provides additional information on toxicity endpoints, such as acute or chronic classifications and mode of action assignments. We intentionally omitted such classifications given that the approach to classification may vary with the purpose of the study or because of different classification schemes [42]. The EnviroTox database allows for an aggregation into single toxicity values for individual taxa, whereas Standartox performs this aggregation for individual chemical-taxa combinations. However, the Standartox results for individual taxa-chemical combinations could easily be aggregated across chemicals in a second step to provide a similar aggregation as that performed in EnviroTox.
The Etox database collects ecotoxicity test information and provides methods to filter those. Like the ECOTOX database, it also lacks methods to perform aggregations of the ecotoxicity data and only provides manual (non-automated) access. In contrast to the latter, the Etox database can not be downloaded as a whole.
The PPDB provides data only on pesticides, and as mentioned before, it provides single quality controlled values only for commonly used taxa, e.g., Daphnia magna or Raphidocelis subcapitata.
In summary, none of the above mentioned initiatives aim for an automated and standardized aggregation method of exposure endpoints for individual chemicals. In addition, they lack the possibility to access the databases through a common high level programming language, such as R. An overview of the filter and aggregation methods as well as the accessibility of the presented databases is presented in Table 1. Table 1. Overview on databases that provide ecotoxicological data. Abbreviations: ALL: Most important test parameters, including chemical, taxon, duration for filtering ecotoxicological data are incorporated. Web: Accessible via a web application through a graphical user interface. API: Accessible via an application programming interface.

Database Filter Aggregation, Selection Access
Comptox [39] Chemical no Web, file Ecotox [25] ALL no Web, file EnviroTox [21] ALL chemical, organism Web Etox [19] ALL no Web Pesticides Properties DataBase (PPDB) [20] fixed values manual selection Web, file Standartox ALL chemical, organism API, Web As outlined above, toxicity estimates from different studies can vary strongly due to a wide range of experimental conditions such as pH, temperature and conductivity [43,44]. Integrating these conditions into the aggregated estimates would certainly improve toxicity estimates. However, the current implementation of Standartox omits these conditions, because the ECOTOX database only provides sparse records on experimental conditions. The most frequently provided experimental conditions are temperature (77%), pH (56%), hardness (27%), dissolved oxygen (18%), Alkalinity (15%) and salinity (9%). For all other conditions less than 5% of data entries are available. A text-mining approach, where a literature reference is associated with ecotoxicity raw data, iterating through the individual publications could potentially increase this number, e.g., Compson et al. [45] successfully applied text-mining techniques to retrieve species trait data.

Methods
An automated processing pipeline downloads the quarterly released ECOTOX database, performs several preparation steps on it and exports a final Standartox data set. This data set is accessible via a web application and an application programming interface (API). An API provides the means for machine communication between a host and a client and thus allows scriptable data queries. To facilitate the API access, the R [31] package standartox is built. All data presented in this paper are derived from the Standartox build, based on the ECOTOX release from the 12.12.2019. The code for Standartox is located in the two Github repositories andschar/standartox-build (https://github. com/andschar/standartox-build) and andschar/standartox (https://github.com/andschar/standartox). The former contains code to process the data and to build the web application and the API, the latter contains code to build the R package. Most of the code is written in R 3.6.1 and associated packages (List : Table A4) and in Structured Query Language (SQL) for PostgreSQL 9.6.1. A graphical overview of the most important processing steps is given in Figure 4.

Processing
Standartox downloads the quarterly released ECOTOX database and builds it into a local PostgreSQL database. Subsequently, SQL functions for further processing the data are implemented. In addition lookup tables that enable the conversion of units such as duration and concentration are created. A meta-table providing information, such as the release version of the ECOTOX database is added. Then, provided Chemical Abstracts Service (CAS) numbers and taxonomic names are used to query additional information from publicly available databases on chemicals and organisms, respectively. This includes the Compendium of Pesticide Common Names [46], the Chemical Entities of Biological Interest (ChEBI) database [47], the Chemical Identifier Resolver (CIR) service [48], the Pubchem database [49], Eurostat [50] and Wikidata [51] for chemicals and the World Register of Marine Species (WoRMS) [52], the Global Biodiversity Information Facility (GBIF) [53] and the freshwaterecology.info database [54] for habitat and spatial distribution of organisms (Table A1). Given that taxonomic names can be ambiguous, e.g., the genus Eisenia can refer to an algae and a worm, we first match the taxa names against specific database identifiers and subsequently check their accordance with the underlying ECOTOX data taxonomy. Then, we query the actual data by using the identifiers. In a next step, the data are added to Standartox to enable filtering for specific chemical roles (e.g., drug, metal, pesticide, personal care product) and classes (e.g., pyrethroid, carbamate) as well as spatial distribution (i.e., continents) and habitat preferences (e.g., freshwater) of individual taxa. Taxa that were not identified to at least genus level are excluded, because relative toxicity comparisons have been shown to be not meaningful for higher taxonomic levels [24,55,56]. Finally, the Standartox data set is compiled, which includes the harmonisation of data, e.g., through conversion of test concentration and duration units. 1237 distinct concentration units are converted to six harmonized ones (i.e., g/L, g/m 2 , ppb, g/g, L/L and L/m 2 ) when conversion is possible. Likewise, the 126 distinct duration units are converted to hours whenever this is unambiguously possible. To guarantee appropriate unit conversion and harmonisation, we compared the results of an automated unit conversion to a manual one for each of the distinct concentration and duration units. This assures that 652 of the 1237 concentration units (95.3% of the data) are converted correctly. The remainder could not be converted and is removed. Furthermore, the units are cleaned, for example through removing additional information in the field such as food, soil, ai that are also coded in other variables and hinder the processing of units. Concentrations that are given as rates such as per day (e.g., mg/kg/day) are multiplied by the days of the test and then converted. Experimental endpoints are restricted to three groups, namely NOEX, LOEX and XX 50 . Other endpoints, such as Bioconcentration factors, non-half maximal effective concentrations (e.g., IC 10 , EC 25 , LD 99 ) or maximum acceptable toxicant concentrations are removed. Along with that, a catalog, listing all distinct entries and value ranges, for categorical and continuous variables, respectively, is created. The compiled Standartox data set together with the catalog is exported and accessible via the web application and the API, through the R package.

Application Methods
When accessed, the web application and the API load the compressed serialized Standartox data into memory and allows the user to interact with them. The user can then call the functions stx_filter() and stx_aggregate() that filter and aggregate the data according to specific parameters ( Table 2). The interactive web application is built in R using the shiny framework, which runs with the help of a shiny server [57]. The API is built by using the R package plumber [58], which allows for the creation of Representational State Transfer (REST) APIs from R. REST is a software architectural style that defines web service communication rules. The API is reachable via the Internet Protocol (IP) address 139.14.20.252 and port 8000. Three API-endpoints (/catalog, /filter, and /meta) can be queried (Table A3). The /catalog API-endpoint returns a JavaScript Object Notation (JSON) file containing a catalog of possible filter parameters to choose from. The /filter returns the filtered Standartox table as a compressed serialized binary file created by the R package fst [59], to reduce size and allow for fast user queries. Lastly, the /meta API-endpoint returns a JSON file with meta information, such as the timestamp of the request and the used Standartox version. The API is designed to be used with the R package standartox and therefore uses serialization methods specific to R (rds() from the R package base and fst() from the R package fst). To facilitate the API usage the R package standartox is created.

User Notes
Users can access Standartox either via the web application (http://standartox.uni-landau.de) or via the R package standartox. By accessing the web application, users can filter and download the resulting data sets as a comma-separated values (csv) file. Users of the R package can directly load the data within R. The R package provides the two functions stx_catalog() and stx_query(). The first command queries a catalog of possible Standartox parameters into an R list object. The latter allows users to set the Standartox filter parameters and to fetch the actual data. It returns an R list of three tables (i.e., R data.frames) containing the filtered data set, the aggregated data set and a table with the meta information retrieved from the API endpoints. A short R-code example is given below (Listing 1) and a detailed description on the usage of the R package is provided on its Github page (https://github.com/andschar/standartox).

Conclusions
Due to the steady incorporation of new ecotoxicity data, the aggregated values produced by Standartox can be subject to change with future updates. We regard this as an advantage rather than a drawback because other published works that aim in a similar direction often constitute a singular effort or require manual work for each update. Standartox, in contrast, automates the update process, yet still provides access to its older versions, assuring reproducibility and version control.
In comparison to rule-based approaches for the derivation of single ecotoxicity values, Standartox has the advantage to be free from the subjectivity of a set of human-induced rules. Above all, Standartox provides quick access through its design to be queried via the R language. Due to an increased amount of available ecotoxicological test data, it becomes fundamental to provide and distribute ecotoxicity information in adequate formats, both easily accessible for humans and easily processable for machines. Standartox meets these requirements and puts its focus on the aggregation of toxicity data, thereby adding a piece to the puzzle of modern ecotoxicological data analyses.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: