New Models to Predict the Acute and Chronic Toxicities of Representative Species of the Main Trophic Levels of Aquatic Environments

To assess the impact of chemicals on an aquatic environment, toxicological data for three trophic levels are needed to address the chronic and acute toxicities. The use of non-testing methods, such as predictive computational models, was proposed to avoid or reduce the need for animal models and speed up the process when there are many substances to be tested. We developed predictive models for Raphidocelis subcapitata, Daphnia magna, and fish for acute and chronic toxicities. The random forest machine learning approach gave the best results. The models gave good statistical quality for all endpoints. These models are freely available for use as individual models in the VEGA platform and for prioritization in JANUS software.


Introduction
About 300 million tonnes of chemicals that are used in consumer and industrial products are discharged into wastewaters and find their way into natural waters every year. Additional pollution comes from diffuse sources in agriculture, where it is estimated that about 140 million tonnes of fertilizers and several million tonnes of pesticides are applied annually [1]. Therefore, aquatic communities are exposed to many chemicals that can be toxic for them and humans [2,3], even in low concentrations. The management of pollution from the release of synthetic chemicals has become a cause for concern for the scientific community, regulators, and the public [4].
Risk assessment of chemicals is necessary to prevent and control pollution due to anthropogenic chemicals [5]. Information on aquatic toxicity is needed to assess the hazards and risks of chemicals to freshwater organisms. Ideally, all aquatic organisms should be tested, and the most sensitive species should be selected for assessment to protect all aquatic organisms. However, since it is impossible to test the toxicity of chemicals on all aquatic organisms one by one, only representative aquatic organisms are selected. Algae, crustaceans, and fish belong to different trophic levels, i.e., primary producers and primary and secondary consumers, and are considered relevant for the protection of aquatic ecosystems [6].
Besides hazard and risk assessments, toxicity data from these representative organisms are used for the prioritization of chemicals, which are accepted by regulatory authorities, for instance, to screen persistent, bioaccumulative, and toxic chemicals [4].
performed well for a limited chemical domain but were not useful to assess a diverse set of chemical structures [11]. Most of the QSAR models focus on acute aquatic toxicity [12][13][14][15][16][17][18][19][20][21] and only a few relate to chronic aquatic toxicity. Claeys et al. [22] and Austin and Eadsforth [23] developed models for chronic narcosis toxicity and chronic non-polar narcosis toxicity in fish, respectively. Fan et al. [24] developed a local model that was based on 48 substituted benzenes to predict the chronic toxicity of chemicals toward Daphnia magna. Ding and colleagues [4] developed classification models to predict the chronic toxicity of chemicals toward Daphnia magna and the green algae Pseudokirchneriella subcapitata. ECOSAR software contains chronic toxicity models for algae, daphnids, and fish [25]. Several QSAR models, such as those included in ECOSAR, were used to predict the toxicity of chemicals, but they yielded poor correlations and still need improvement [6]. QSARCHE contains non-polar and polar narcosis QSAR models for chronic toxicity in fish [26]. However, given the current state of development of chronic aquatic toxicity models, considerably more work is still needed [27].
The objective of the present work was to develop QSAR models to predict the acute and chronic toxicity of various chemicals toward algae (Raphidocelis subcapitata, previously known as Pseudokirchneriella subcapitata), Daphnia magna, and fish (several species) based on OECD Good Laboratory Practice (GLP) data [28]. It is expected that these models will help to fill gaps in the information on the acute and chronic toxicity of several chemicals for all the trophic levels of freshwater ecosystems.
This work is a part of a project called JANUS (Joining Environmental, Ecotoxicological, and Toxicological Assessment of Chemical Substances with Non-Testing Methods within a Unified Screening), funded by the Bundesministerium für Umwelt, Naturschutz und nukleare Sicherheit (BMU), that aims for the implementation of a new strategy for the prioritization of chemicals according to PBT; endocrine disruption; and carcinogenic, mutagenic, and toxic-to-reproduction (CMR) properties.
The QSAR models described here are used to assess the ecotoxicity of chemicals within the toxicity criterion of the JANUS prioritization scheme. They advance the work done in a previous project, called PROMETHEUS [29], which was based on calculations of the acute-to-chronic ratio (ACR) to check the fulfillment of the toxicity criterion for PBT assessment.

Results
We checked a large number of substances with experimental values on organisms representing the three trophic levels, as required by the European regulation [5], for acute and chronic values. This collection of values is large compared with other collections, particularly on chronic toxicity. Chronic toxicity is the most important value according to the European regulation [5], and the acute value has to be calculated only if information regarding the chronic value is missing. We applied a quality check regarding the chemical structures and the consistency of property values when more than one value was found. The presence of multiple values offers an advantage, increasing the reliability of the experimental value. However, when there are significant differences between the experimental values, this increases the uncertainty and may indicate the presence of errors or difficulties with a certain substance. For this reason, we adopted the procedure described to prune the dataset, excluding substances with uncertain experimental values.
Then, we developed QSAR random forest models (the tree ensemble method) for three trophic levels-Raphidocelis subcapitata, Daphnia magna, and fish-for both acute and chronic toxicities. We made the distribution of the endpoint data uniform and split the dataset to get a training and a validation set. We calculated the descriptors and pruned them using a genetic algorithm (gaselect) or variable selection using random forest (VSURF) approaches. For each model, we checked different parameters to identify ideal thresholds for the applicability domain (AD) and examined them in every possible combination. We assessed the performance by including and excluding high leverage compounds. A total of thirty-two possible AD combinations were evaluated; then, the best one was selected based on the best compromise in terms of coverage and performance in 10-fold cross-validation. Each model was validated on an external set of data. Table 1 summarizes the statistics for the training and the external validation sets, with the variable selection method chosen for each model, the number of descriptors, and details of the parameters used to define the AD. For each endpoint and trophic level, the table also shows the statistics with and without the AD. E(L)C 50 represents the acute toxicity, i.e., the concentration necessary to produce the effect (death) in half of the exposed population, whereas the no observed effect concentration (NOEC) represents the chronic effect. a E(L)C 50 is the concentration that causes the effect (death) in 50% of the exposed population. b NOEC is the no observed effect concentration. c R 2 is the determination coefficient. d MAE is the mean absolute error. e RMSE is the root mean squared error. f AD is the applicability domain of the model.
The statistics were evaluated using the determination coefficient (R 2 ), the mean absolute error (MAE), and the root-mean-squared error (RMSE). All models gave high statistics for the training set. This was expected since the random forest approach was adopted. The most interesting values were those related to the validation set, which contained substances that were never used to build up the model (about 20% of the total number of substances available). These values provided a good assessment of the expected predictivity of the models.
The model for Raphidocelis subcapitata acute toxicity (EC 50 ) gave an R 2 of 0.60 for the validation set. Though this does indicate uncertainty in the prediction, it is still acceptable for this kind of endpoint. The use of the information on AD did not substantially improve the predictions. The models for Raphidocelis subcapitata gave the worst results. However, the results were worse toward acute toxicity.
The model for Daphnia magna acute toxicity (EC 50 ) gave an R 2 of 0.69 for the validation set. Although this indicates uncertainty in the prediction, it is still acceptable for this kind of endpoint. Information on the AD did not substantially improve the predictions. In the case of models for NOEC, the R 2 in the validation sets was even better at 0.78, and this is useful, also because this endpoint is much less studied than acute toxicity. The results for fish were comparable to those for Daphnia magna, with a small reduction in the statistical quality. Furthermore, in fish, there was an improvement in performance regarding chronic toxicity. This is a good result too.

Discussion
We developed a battery of models for three trophic levels that are important for aquatic toxicity: Raphidocelis subcapitata, Daphnia magna, and fish for both acute and chronic toxicities. The statistical parameters were good, and similar or better than those that were previously published.
For the algal acute toxicity, too few models are available, and the majority are for specific classes of chemicals [30,31]. This can explain the better statistics compared to ours.
The models for chronic toxicity toward Raphidocelis subcapitata and algae in general are very limited. Ding et al. [4] developed classification models, which are quite different from the regression models that predict continuous values. However, they obtained worse models than the classification models toward Daphnia magna, and this confirmed the greater difficulty in predicting this endpoint.
Considering the low number of available models for algal toxicity (acute and chronic), it is clear that more efforts are needed to develop new models and have a sufficient number of models to better assess these endpoints. We explained how using more than one model for the same endpoint may help the user in the assessment [32].
While many models were published for fish and Daphnia magna acute toxicities, the number of studies for chronic effects is much scarcer.
Recent studies on Daphnia magna acute toxicity had similar difficulties to those reported above and achieved R 2 values of about 0.60-0.68 for the validation set [33,34]. However, Khan and colleagues used 175 [33] and 133 [34] compounds; therefore, our model used a much larger set of compounds: 428. Thus, it is quite probable that the present model is more robust and has a larger AD than the previous ones. We also noticed that the model by Khan et al. [34] is specific for biocides, and this might explain the lower R 2 , and the other model [33] is for pharmaceuticals. If we consider older models, their performance was less satisfactory [35].
Few models about the chronic toxicity for Daphnia magna have been published. The model developed by Fan et al. [24] is a multilinear regression model with five descriptors and gave an R 2 for the test set of 0.736. However, this model is quite different from the one we developed because this is only for substituted benzenes and is validated on ten substances.
Many models have been published for acute fish toxicity. Older models gave worse performances [36]. More recent models achieved similar or even better results. For instance, the models by Khan et al. [33] achieved an R 2 of 0.8 for the validation set.
Very few models have been published about fish chronic toxicity. Claeys et al. [22] developed a model for nonpolar narcosis based on 49 substances for the training set and 20 for the test set. LogP was the key parameter. Even fewer substances were available for polar narcosis (13). The R 2 for the training and test sets for nonpolar narcosis were 0.76 and 0.73, respectively, which were quite similar to the R 2 of 0.74 of the ECOSAR model for neutral organics [37]. However, these values refer only to nonpolar substances without a specific mode of action or reactivity. As we said, the model for polar narcosis had only a very small number of compounds, and the user needs a previous tool to see whether the substance of interest follows the narcosis mechanism; for the other cases, the substances were outside the AD of the system. Both the ECOSAR and the QSARCHE (the software developed by Claeys et al. [22]) were criticized later by Austin and Eadsforth [23] for the use of some inaccurate data in these models. They proposed another model for NOEC for nonpolar narcotics, which achieved an R 2 of 0.89 for the test set [23]. There were 10 substances in the test set and 19 in the training set. Thus, for this model too, there are the general limitations of the small number of substances at the basis of the model and the need to know or predict whether the target substance is a nonpolar narcotic.
In general, there is the need for further studies addressing aquatic toxicity for the three trophic levels for both acute and chronic toxicities. This study represents a contribution in this direction.
The models we describe can be applied for the assessment of chemical substances in a facilitated way because they were implemented within the VEGAHUB tools [38]. One is VEGA, which is a freely available platform that contains dozens of models. The availability within the same system of the QSAR models, of the software assessing the applicability domain for each substance, and a tool to visualize the six most similar substances for read-across offers great improvements compared to the past situation. This provides a way to evaluate the confidence of the prediction for the specific chemical. The other tool is JANUS, which is freely available software that was designed to assess and prioritize the chemicals according to their PBT properties.

The Datasets
We collected the experimental toxicity values from short-and long-term aquatic toxicity tests of chemicals that were conducted by the Japanese Ministry of Environment on several organisms; these have been publicly available on the official website [39] since March 2016.
The aquatic toxicity tests were done according to the OECD GLP standards [28] and the OECD official guidelines for several species and endpoints [40][41][42][43][44]. In view of the purposes and the context of the JANUS project, we selected the following representative organisms of the three trophic levels of freshwater ecosystems: We generated the chemical structures of the compounds as SMILES notations [45] from the chemical name and CAS RN using ChemCell [46] and Marvin View 17.12, 2012017, ChemAxon [47]. We manually checked the correctness and consistency of the chemical structures, chemical names, and CAS RN using several databases, including ChemIDplus Advanced [48], PubChem [49], ChemSpider [50], and DSSTox [51]. We added the structures that were not automatically generated using these databases.
Then we pruned the initial datasets. This was done mostly because classical QSAR approaches, particularly for the calculation of molecular descriptors, cannot deal with certain types of structures (inorganic, disconnected structures like salts, etc.). We excluded metal complexes, inorganics, mixtures of structural isomers, ambiguous structures, nonionic surfactants, mixtures, complex disconnected structures, chemicals whose names and CAS RN did not correspond, and "substances of unknown or variable composition, complex reaction products, or biological materials" (UVCBs). We also neutralized the salts.
As the next step, we selected continuous experimental values, excluding those reported as a range, as above or below a certain numerical threshold, or only approximate. We kept the toxicity values from experimental conditions of the assays as they are defined in the OECD guidelines [40][41][42][43][44]. For instance, we excluded toxicity values from 0-48 h assays on Raphidocelis subcapitata and LC 50 for fish after 120 h of exposure. We also eliminated pH-adjusted toxicity values for fish and Daphnia magna. We calculated the molecular weight of each chemical structure to convert the experimental toxicity value from milligrams per liter to millimoles per liter.
We checked the multiple values for the same species and endpoint for each substance: the difference between the largest and the smallest values had to be within a factor of 10, i.e., one log unit, when the experimental conditions and the reliability of the studies were the same, as reported in [52]. When possible, we looked for the outlier(s); if not found, we eliminated the data and the substance was not used. We also checked whether the experimental toxicity values were higher than the water solubility. If it was so, we eliminated this chemical.
Once the dataset was pruned, we obtained the number of chemicals for the different trophic levels and endpoints that are reported in Table 2. Since after the pruning, the Japanese Ministry of Environment dataset contained only 35 chemicals for chronic fish toxicity, we searched other sources of experimental data for this endpoint. We retrieved many experimental data from the ECOTOX Aquire database [53], which was updated in July 2017, and pruned this dataset according to several criteria: A further source of data was the PROMETHEUS dataset on chronic fish toxicity [29], which included experimental data from eChemPortal [54] and several datasets that were extracted from OECD QSAR Toolbox 3.2 [55].
We used the same criteria as above to check the data from these two additional sources. We merged all the datasets on chronic fish toxicity from various sources [39,[53][54][55], checking multiple values and duplicates. The final chronic fish toxicity dataset contained 94 chemicals.
We calculated the median and the arithmetic and geometric means of the multiple values in millimoles per liter to check whether there were differences between them to integrate the multiple values. We found a very good correlation (R 2 ≈ 1) and we chose the geometric mean, as recommended by [52]. We calculated the logarithm of the geometric mean to normalize the data. We also tried a Box-Cox transformation [56], optimizing the nated pH-adjusted toxicity values for fish and Daphnia magna. We calculated the molecular weight of each chemical structure to convert the experimental toxicity value from milligrams per liter to millimoles per liter.
We checked the multiple values for the same species and endpoint for each substance: the difference between the largest and the smallest values had to be within a factor of 10, i.e., one log unit, when the experimental conditions and the reliability of the studies were the same, as reported in [52]. When possible, we looked for the outlier(s); if not found, we eliminated the data and the substance was not used. We also checked whether the experimental toxicity values were higher than the water solubility. If it was so, we eliminated this chemical.
Once the dataset was pruned, we obtained the number of chemicals for the different trophic levels and endpoints that are reported in Table 2. Table 2. Number of chemicals for each trophic level from acute short-term and chronic toxicity tests available in the Japanese Ministry of Environment's database after pruning of chemical structures and experimental values. The numbers of substances before pruning are given in parentheses.  (37) Since after the pruning, the Japanese Ministry of Environment dataset contained only 35 chemicals for chronic fish toxicity, we searched other sources of experimental data for this endpoint. We retrieved many experimental data from the ECOTOX Aquire database [53], which was updated in July 2017, and pruned this dataset according to several criteria:


Taxonomic: animals, fish, and standard test species.  Test results: endpoint (NOEC) and effect measurement (mortality).  Test conditions: test location (laboratory), exposure media (freshwater), and exposure types (flow-through, renewal).  Chemical analysis: measured.  Purity > 80% and "not reported"; if the purity was "not reported," we checked the chemical grade (eliminated: experimental, practical, and technical grades).  Organism life stage: egg(s), embryo(s), blastula, eyed egg or stage, and eyed embryo. A further source of data was the PROMETHEUS dataset on chronic fish toxicity [29], which included experimental data from eChemPortal [54] and several datasets that were extracted from OECD QSAR Toolbox 3.2 [55].
We used the same criteria as above to check the data from these two additional sources. We merged all the datasets on chronic fish toxicity from various sources [39,[53][54][55], checking multiple values and duplicates. The final chronic fish toxicity dataset contained 94 chemicals.
We calculated the median and the arithmetic and geometric means of the multiple values in millimoles per liter to check whether there were differences between them to integrate the multiple values. We found a very good correlation (R 2 ≈ 1) and we chose the geometric mean, as recommended by [52]. We calculated the logarithm of the geometric mean to normalize the data. We also tried a Box-Cox transformation [56], optimizing the ʎ value for each dataset. Since the Box-Cox transformation gave better results in terms of value for each dataset. Since the Box-Cox transformation gave better results in terms of normalization of the data, we used this instead of the logarithm to normalize the data. We excluded data that fell outside the range (mean of the Box-Cox transformed values) ± 3 SD (standard deviation).
For each dataset, we normalized the SMILES using istMolBase 1.0.3 [57]. We neutralized the normalized SMILES of the compounds in the datasets of acute and chronic toxicity endpoints for Daphnia magna and fish. We used ionized normalized SMILES in the case of EC 50 72 h and NOEC 72 h datasets for Raphidocelis subcapitata since pH is a critical issue in algae [41]. According to the OECD guideline [41], two growth media can be used: one at pH 7.5 and another at pH 8.1. The two growth media can give different results, depending on the pH, especially for ionizing substances [41]. We had no information on what medium was used to perform the assays; therefore, we calculated the main microspecies at pH 7.5 and 8.1 for all the SMILES using JChem [58] and eliminated compounds with SMILES that changed depending on pH. The dataset containing the CAS RN, the SMILES, and the experimental values of each model are included as Supplementary Material (FinalDataset.xls).

The QSAR Models
To build up the models, we divided the datasets into training and test sets in a ratio of 80:20. To obtain a uniform distribution of the endpoint values between the two subsets, we applied an activity and descriptors sampling method. First, principal component analysis (PCA) was done on all the 2D descriptors that were calculated using the Dragon 7.0 Extension for KNIME [57,59]; the first two principal components were selected. Five random compounds were selected. Then, we chose the most dissimilar compound from the sample pool according to the first two principal components and the response using several combinations of distance metrics and scoring functions. Then, this compound was added to the pool and the operation was repeated until the desired number of compounds for the training set was reached (80% of the substances).
As explained before, the Dragon 7.0 Extension for KNIME was used to calculate 2D descriptors, excluding compounds whose Ghose-Crippen octanol-water partition coefficient [60] (ALOGP) could not be calculated (because this descriptor was the one that was most closely correlated to the response). Dragon 7.0 can calculate 3839 2D descriptors. A large part of this number of descriptors is likely to be redundant or not informative; therefore, one must apply methods to reduce the variables to train the model (pruning). The procedure adopted was divided into three phases:
All the descriptors that correlated higher than 0.95 (Pearson) with at least one other descriptor were eliminated; 3.
A genetic algorithm (gaselect) or variable selection using random forest (VSURF) was applied.
For each dataset, we had two pools of variables, one selected with gaselect and the other with VSURF. Both datasets were imported into a KNIME workflow to derive the models.
Among the several algorithms used, a random forest (RF) called tree ensemble gave the best results in terms of performance. This algorithm builds a series of regression trees with different rows and different variables (according to certain parameters) and then the results are aggregated as an ensemble of models. The parameters for the variables of each tree and the number of compounds are selected on the basis of the performance of several models (hyperparameter-tuning research) using R 2 as a metric of the bootstrap (100 iterations) cross-validation on the training set.
Two approaches were applied to define the applicability domain (AD) of the models: 1. The first approach explored the structural domain of the model. This was done by recording the degree of structural similarity of a given compound to those in the training set. A distance matrix containing distances for each pair of compounds in the training set was calculated; then, for each compound in the training set, we calculated the mean distance to its first k neighbors. The training set chemicals were then sorted on the basis of these distances and the value corresponding to a given percentile of the distribution of distances was used as a threshold (T D ), beyond which, chemicals were excluded from the AD. For the external validations, the procedure was repeated, calculating the distance of each validation set chemical from their neighbors in the training set; then, T D was used to identify the compounds outside the AD. For the present work, we used the Euclidean and Manhattan distance metrics; values assigned to k were 1 and 5; values assigned to T D were those corresponding to the 100th, 97.5th, 95th, and 90th percentiles of the training set distance distribution.
2. The second approach was based on the derivation of a so-called "error model", which predicts the uncertainty of the predictions from an "activity model". An activity model is a classical model that is based on chemical descriptors as independent variables and an endpoint (e.g., a biological activity) as the dependent variable. An error model is derived from the same training set as the associated activity model; the differences from the cross-validated absolute errors (previously generated by the activity model) are the dependent variables, while the independent variables are a series of AD metrics that reflect the accuracy of the predictions that are made by the activity model. Six AD metrics were used as "descriptors": wRMS1 is the weighted root-mean-squared difference between the predicted activity of the target and the observed activity of its five neighbors in the training set; wRMS2 is the weighted root-mean-squared difference between the predicted activity of the five neighbors of the target and the observed activity of the same neighbors; SIMILARITYNEAREST1 and SIMILARITYNEAREST5 are the Euclidean distance of the target from one and five neighbors in the training set, respectively; TREE_SD is the standard deviation of the prediction of the target among the RF trees; and PREDICTED is the prediction for the target. The RF algorithm was used for the error model. Training set chemicals were sorted based on errors in the predictions that were estimated by the error model; then, the value corresponding to a given percentile of the distribution of predicted errors was used as a threshold (T E ), beyond which, chemicals were excluded from the AD. The same T E was applied to the predicted errors that were calculated for the validation set chemicals. For the present work, the values that were assigned to T E were those corresponding to the 100th, 90th, 75th, and 65th percentiles of the training set error distribution.