Models for the No-Observed-Effect Concentration (NOEC) and Maximal Half-Effective Concentration (EC50)

Typical in silico models for ecotoxicology focus on a few endpoints, but there is a need to increase the diversity of these models. This study proposes models using the NOEC for the harlequin fly (Chironomus riparius) and EC50 for swollen duckweed (Lemna gibba) for the first time. The data were derived from the EFSA OpenFoodTox database. The models were based on the correlation weights of molecular features used to calculate the 2D descriptor in CORAL software. The Monte Carlo method was used to calculate the correlation weights of the algorithms. The determination coefficients of the best models for the external validation set were 0.74 (NOAEC) and 0.85 (EC50).


Introduction
The aquatic ecosystem has an important role for humans, animals, plants, and aquatic organisms.Water pollution resulting from humans as consumers and other anthropogenic activities such as agriculture and industry have negative impacts on aquatic organisms.Hazard and risk assessments are needed to investigate the adverse effects of chemicals on water life such as aquatic plants, fish, and algae [1][2][3].
Considering the amount of chemicals released into the environment, fast, accurate, and cost-effective assessments are required.Several organizations and regulations such as the United States Environmental Protection Agency (US EPA), REACH regulation (Registration, Evaluation, Authorization and Restriction of Chemicals) in Europe, the European Center for Validation of Alternative Methods (ECVAM) of the European Union, and the European Union Commission Scientific Committee on Toxicity, Ecotoxicity, and Environment (CSTEE) suggest and encourage an in silico approach for risk assessment [1].Among the in silico approaches, the quantitative structure-activity relationship (QSAR) is a rapid and low-cost method to create models to test a large number of chemicals, reducing the number of time-consuming in vivo tests [4].
There are several assays to investigate aquatic ecotoxicity, and some are validated and standardized [5][6][7][8][9][10].Swollen duckweed (Lemna gibba) has many attributes that make it useful for ecosystem health assessment [11][12][13].The OECD Test Guidelines (TG) 221 are designed to assess the toxicity of substances to freshwater aquatic plants of the genus Lemna (duckweed).Plant cultures are exposed to different concentrations of the test substance for seven days to test the substance-related effects on vegetative growth.The effects are quantified by comparing growth in test solutions to the control and the concentration that causes x% inhibition of growth is determined (e.g., EC50).
The harlequin fly (Chironomus riparius) is used for ecotoxicological studies to assess acute and sub-lethal effects of chemicals in soil and water [14][15][16].The OECD TG 218, 219, and 233 [5,6,8] are designed to assess the effects of prolonged exposure to chemicals on the lifecycle of the sediment-dwelling freshwater dipteran Chironomus sp.The instar chironomid larvae are placed in a sediment-water system together with the test substance.Chironomid emergence and development rates are measured.This test shows the concentration that caused x% reduction in emergence, larvae survival and growth (e.g., EC50), and/or the no-observed-effect concentration (NOEC).The OECD TG 233 [8], compared to the former, covers all the lifecycles of the first generation and the early part of the second generation, and effects are measured as the TG described previously.
The NOEC and EC50 are endpoints largely used in ecotoxicology to define the critical effects caused by long-term exposure (NOEC) [17] and acute exposure (EC50) [18] to substances for non-target organisms.In this paper, we collected NOEC and EC50 data from the EFSA OpenFoodTox database [19] and developed two QSAR models for the prediction of ecotoxicological endpoints: NOEC for the harlequin fly and EC50 for swollen duckweed.These models are based on the correlation weights of molecular features used to calculate the 2D descriptor in the CORAL software (http://www.insilico.eu/coral/accessed on 11 June 2024).The Monte Carlo method has been used for many other models [20][21][22][23][24][25][26][27].

Data
Experimental NOEC for harlequin fly (n = 122) and EC50 swollen duckweed (n = 94) were collected from the EFSA OpenFoodTox database version 6 (https://zenodo.org/records/8120114, accessed on 16 April 2024).These values were converted into the logarithmic scale.The chemical substances with the experimental data for the two endpoints we studied are pesticides; for other families of substances, it is not common to perform this kind of experimental study.The datasets with the information on the compound structures and experimental values are reported in Table S1 (EC50 for swollen duckweed) and Table S2 (NOEC for harlequin fly).The folder "2D chemical structures", available in Supplementary Materials, reports the 2D chemical structures of the compounds for both datasets.

Model
In this work, models of endpoints are regression relations of the following form [28,29]: The DCW(T, N) is the SMILES-based optimal descriptor, which is the sum of the so-called correlation weights.The numerical data on correlation weights were calculated with the Monte Carlo method.T and N are parameters of the Monte Carlo optimization procedure.T is the threshold to define active molecular features extracted from a simplified molecular input-line entry system (SMILES).T = 5 means that all molecular features present in the training set at least five times are active; otherwise, they are inactive.N is the number of epochs (iterations) of the Monte Carlo procedure.Starting from the same dataset, the substances were assigned to active training, passive training, calibration, or validation subsets.This splitting was repeated three times; thus, we developed three models for each endpoint.

Optimal SMILES-Based Descriptors
The descriptor used here was calculated as follows: where S k is the SMILES atom, i.e., the undivided part of the SMILES.In other words, S k is a single symbol ('C', 'O', etc.) or a group of symbols that cannot be examined separately because they define the atom jointly ('Cl', 'Br', etc.).The CW(x) are the correlation weights of the corresponding molecular features.

Monte Carlo Optimization
The optimization of the correlation weights is based on the search for the maximum of the target function (TF) calculated as follows: where IIC is the index of the ideality of correlation.CII is the correlation intensity index.The application of IIC and CII improves the predictive potential of models (i.e., the statistical quality of the validation set) but to the detriment of values on the training set [30,31].The use of IIC and CII is actually "statistical forcing"; the correlation weights are optimized to be more favorable for the calibration set than for the training sets.It is hoped (and often justified) that a favorable effect on the calibration set will be accompanied by a favorable effect on the external validation set.This way, the system prefers a general model rather than a model that is too close to the training set-in this last case, there is the risk of sticking to a model that is overfitting.

Validation of the Models
There are a number of conceptually different criteria for the predictive potential of QSAR models.Some of them are based on calculations with the training set only.Other criteria are calculated using an external validation set.In our case, the models were validated using the following criteria for the predictive potential: • The determination coefficient (R 2 ); • The correlation ideality index (IIC) [30,31]; • The correlation intensity index (CII) [30,31]; • RMSE, which is the root mean squared error; • MAE, which is the mean absolute error; • F, which is the Fischer F ratio.
In this way, we calculated statistics suitable both for the internal and the eternal validation approaches.

The New Models for the Environmental Protection
Our environment needs better protection.Unfortunately, this calls for a huge effort, to cover an incredibly large number of species.In silico models may be a valid help in this effort, but so far, most of the models relate to a few endpoints; typically, in silico models address fish, daphnia, and algae, partly because these are commonly requested by current regulation.Here, we introduce new models for other species, exploiting data from the OpenFoodTox database.In particular, we addressed the harlequin fly (Chironomus riparius) and swollen duckweed (Lemna gibba).Collections of experimental values from sound databases offer a powerful driver for the development of transparent and reproducible in silico models.Thus, this paper promotes this vision.Using the datasets from authoritative databases boosts the development of in silico models, and it is convenient to identify algorithms that can quickly process datasets and develop models.CORAL, the software applied in our study, is an example.

Models for NOEC for Harlequin Fly
Table 1 contains the main statistical characteristics of models for logNOEC for the harlequin fly.The statistical values are for each subset (active training, passive training, calibration, and validation), also considering the leave-one-out statistics, indicated as Q 2 .More details on the statistical parameters are given in Table S3 in Supplementary Materials.These models, for each of the three splits starting from the same overall dataset, are the following: Table S4 in Supplementary Materials shows the experimental and calculated values and the applicability domain (AD) of logNOEC for the model obtained in the case of split 1.

Models for EC50 for Swollen Duckweed
Table 2 gives the statistical characteristics of models for logEC50.More details on the statistical parameters are given in Table S5 in Supplementary Materials.These models are the following, for the three splits:  Considering the results on the two endpoints, there is a similar situation.The results for the validation sets are better than those for the active and passive training sets.For instance, in the model for NOEC (Table 1), the R 2 values for the validation sets are in the range of 0.72-0.75,while the R 2 values for the training sets are in the range of 0.26-0.58.The situation is similar for the second endpoint, EC50 (Table 2).In this case, the R 2 values for the training sets are in the range of 0.26-0.49,always lower than the R 2 values of the validation sets, which are in the range of 0.77-0.85.This is due to the mechanism of the CORAL model which forces the optimization of the values of the calibration sets.
Table S6 in Supplementary Materials shows the experimental and calculated values and the applicability domain (AD) of EC50 for swollen duckweed for the model obtained in the case of split 1.
Compared with the common practice of splitting the set of substances into training and validation sets, our approach is more elaborate.We further split the set of substances to be used for the model development into three subsets, namely active training, passive training, and calibration, as described in the Materials and Methods section.In practice, the initial steps of model building, when active and passive training sets are used, are preliminary phases to identify the components of the models.The structure of the proposed model is decided in the calibration phase.Thus, the results on the calibration set are conceptually closer to those in the usual process where only one training set is used.For this reason, the values of the validation set should be compared only with the calibration set, which in fact are much more similar than those in the preliminary phases of the model using the active and passive training sets, which are related to a not fully developed model.

Discussion
Our models were developed using SMILES as the format to represent the chemical structures.This is one of the most commonly used formats for in silico models.There are several advantages associated with the use of this format.It is quite compact; for instance, the structure of 1-tert-butyl-3,5-dimethyl-2,4,6-trinitrobenzene requires 62 bytes when represented with the SMILES, 998 bytes using a connected table, and 2066 bytes using and MDL MOL file.This speeds up the calculation and makes the use of the developed models more convenient and faster.The SMILES format is somehow more readable, at least for expert users.The SMILES structures can be easily found on the internet, for instance, from PubChem (https://pubchem.ncbi.nlm.nih.gov/accessed on 11 June 2024).There are disadvantages too.The SMILES format does not address the 3D structure, which is, however, not used in most in silico models for ecotoxicity.Indeed, 3D models require more complexity in handling the structure, which has to be optimized, and the benefits resulting from the improved description of the chemical structure are usually not appreciable due to the uncertainty of the experimental data.Furthermore, the SMILES format is not unique, even if we used the canonical SMILES.Different formats may be used, for instance, for the description of the nitro groups, and this has to be standardized before modeling.
The SMILES format can cover chirality.However, in our case, we did not use the chiral representation, since in our datasets, there was no information about the toxicity values specific to individual enantiomers.This is a common situation, and pure enantiomers are only used within specific sectors when the price of the product can afford the physical separation of less active enantiomers.
Beyond these general considerations regarding the use of SMILES for the representation of the substances, there is a particular advantage related to the use of the CORAL software.In CORAL, the SMILES is directly used by the software, without the need to calculate molecular descriptors starting from the 2D or 3D structure, which is typical in most in silico models.Indeed, the software processes directly the SMILES code through the identification of the sequence of characters.This represents an important simplification of the modeling scheme.This is convenient from the point of view of using and comparing models.The obtained results also allow us to conclude that the correlation ideality index can be a useful measure of the predictive potential of models.
The Monte Carlo method involves the use of some random process.The figures (Figures 1 and 2) representing the constructed models as well as their statistical quality (Tables 1 and 2) show that, nevertheless, in both cases, both the determination coefficients and the mean square errors are reproduced during the mentioned random processes.Thus, the method under consideration can be considered useful from both practical and heuristic points of view.
values specific to individual enantiomers.This is a common situation, and pure enantiomers are only used within specific sectors when the price of the product can afford the physical separation of less active enantiomers.
Beyond these general considerations regarding the use of SMILES for the representation of the substances, there is a particular advantage related to the use of the CORAL software.In CORAL, the SMILES is directly used by the software, without the need to calculate molecular descriptors starting from the 2D or 3D structure, which is typical in most in silico models.Indeed, the software processes directly the SMILES code through the identification of the sequence of characters.This represents an important simplification of the modeling scheme.This is convenient from the point of view of using and comparing models.The obtained results also allow us to conclude that the correlation ideality index can be a useful measure of the predictive potential of models.
The Monte Carlo method involves the use of some random process.The figures (Figures 1 and 2) representing the constructed models as well as their statistical quality (Tables 1 and 2) show that, nevertheless, in both cases, both the determination coefficients and the mean square errors are reproduced during the mentioned random processes.Thus, the method under consideration can be considered useful from both practical and heuristic points of view.This study was carried out to assess the feasibility of the approach used for modeling experimental data for two endpoints, which have not been studied so far.The only example we found in the literature refers to a model for swollen duckweed, using 13 substances in total; thus, it can be considered a preliminary study, without a proper validation of the results [32].The paucity of existing models demonstrates the need to generate new models for these poorly represented endpoints.However, it should be noted that it is possible to expand the search space of this modeling system by assessing mechanistic interpretations and deeper analysis of the reproducibility of the results of the described random processes on large datasets.This is what is planned to be carried out in the near future.

Conclusions
We introduced new QSAR models for two endpoints of ecotoxicological interest for the first time.The data came from the EFSA database OpenFoodTox.This illustrates the easy way to exploit data that are more and more frequently available to protect our environment by better considering a larger set of species than those usually addressed.We used CORAL software to obtain models, which simply need the SMILES, without calculating molecular descriptors.We investigated different ways to optimize the results using new features of the CORAL software.The IIC and CII components of the target function are significant, since without these, the models are not satisfactory (their predictive potentials are low).These models will be implemented in the VEGAHUB website (www.vegahub.eu;accessed on 11 June 2024) for free use.

Supplementary Materials:
The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/toxics12060425/s1,Table S1: information on the compounds (ID, CAS number, name, and SMILES) and the experimental values used for the EC50 model of swollen duckweed.Table S2: information on the compounds (ID, CAS number, name, and SMILES) This study was carried out to assess the feasibility of the approach used for modeling experimental data for two endpoints, which have not been studied so far.The only example we found in the literature refers to a model for swollen duckweed, using 13 substances in total; thus, it can be considered a preliminary study, without a proper validation of the results [32].The paucity of existing models demonstrates the need to generate new models for these poorly represented endpoints.However, it should be noted that it is possible to expand the search space of this modeling system by assessing mechanistic interpretations and deeper analysis of the reproducibility of the results of the described random processes on large datasets.This is what is planned to be carried out in the near future.

Conclusions
We introduced new QSAR models for two endpoints of ecotoxicological interest for the first time.The data came from the EFSA database OpenFoodTox.This illustrates the easy way to exploit data that are more and more frequently available to protect our environment by better considering a larger set of species than those usually addressed.We used CORAL software to obtain models, which simply need the SMILES, without calculating molecular descriptors.We investigated different ways to optimize the results using new features of the CORAL software.The IIC and CII components of the target function are significant, since without these, the models are not satisfactory (their predictive potentials are low).These models will be implemented in the VEGAHUB website (www.vegahub.eu;accessed on 11 June 2024) for free use.

Figure 1 .
Figure 1.Splitting correlations for the active and passive training sets for logNOEC models for the harlequin fly.Black dots represent the substances in the subset.

Figure 1 .
Figure 1.Splitting correlations for the active and passive training sets for logNOEC models for the harlequin fly.Black dots represent the substances in the subset.

Figure 2 .
Figure 2. Splitting correlations for the active and passive training, calibration, and validation sets for logEC50 for swollen duckweed.Black dots represent the substances in the subset.

Figure 2 .
Figure 2. Splitting correlations for the active and passive training, calibration, and validation sets for logEC50 for swollen duckweed.Black dots represent the substances in the subset.

Table 1 .
The statistical characteristics of model logNOEC for the harlequin fly.A, P, C, and V are active training, passive training, calibration, and validation sets.
* Please see Section 2.3 for the description of the statistical parameters.

Table 2 .
The statistical characteristics on model logEC50.A, P, C, and V are active training, passive training, calibration, and validation sets.
* Please see Section 2.3 for the description of the statistical parameters.