Monte Carlo Models for Sub-Chronic Repeated-Dose Toxicity: Systemic and Organ-Specific Toxicity

The risk-characterization of chemicals requires the determination of repeated-dose toxicity (RDT). This depends on two main outcomes: the no-observed-adverse-effect level (NOAEL) and the lowest-observed-adverse-effect level (LOAEL). These endpoints are fundamental requirements in several regulatory frameworks, such as the Registration, Evaluation, Authorization and Restriction of Chemicals (REACH) and the European Regulation of 1223/2009 on cosmetics. The RDT results for the safety evaluation of chemicals are undeniably important; however, the in vivo tests are time-consuming and very expensive. The in silico models can provide useful input to investigate sub-chronic RDT. Considering the complexity of these endpoints, involving variable experimental designs, this non-testing approach is challenging and attractive. Here, we built eight in silico models for the NOAEL and LOAEL predictions, focusing on systemic and organ-specific toxicity, looking into the effects on the liver, kidney and brain. Starting with the NOAEL and LOAEL data for oral sub-chronic toxicity in rats, retrieved from public databases, we developed and validated eight quantitative structure-activity relationship (QSAR) models based on the optimal descriptors calculated by the Monte Carlo method, using the CORAL software. The results obtained with these models represent a good achievement, to exploit them in a safety assessment, considering the importance of organ-related toxicity.


Introduction
The European Commission (EC) requires continuous efforts to develop new approaches in the safety assessment of chemicals consistent with the regulatory framework. The European Directive on cosmetics, for example, laid out a new vision in Europe for the safety assessment of cosmetics, moving from the consolidated, animal-based, toxicological approach to a novel paradigm that bans the use of animals for toxicity tests [1,2]. Regulators explicitly require a deep examination of RDT, since it gives crucial details on the critical adverse effects resulting from repeated exposure to a certain substance in a limited period [3,4]. REACH, which prescribes within its annexes an assessment on RDT [3], firmly supports the application of new-approach methodologies in order to provide forward-looking solutions in the field of risk assessment.
RDT studies usually employ a wide array of standards; indeed, they present different exposure times and routes as well as various models, such as rodent or non-rodent species [5]. The exposure times can be divided into three groups: (i) sub-acute toxicity studies, which have a duration of up to 28 days (usually 2-4 weeks); (ii) sub-chronic toxicity,

Characterization of the Datasets
The datasets contain chemicals from different sources covering several functional categories, such as pesticides, drugs, food additives and cosmetics. Figure 1 shows the histograms of the ToxPrint chemotypes identified in the datasets, reported in Tables S1-S4 of the Supporting Information. The chemotypes matching the structures in the datasets are shown if the frequency is over 10%. Out of 729 chemotypes, respectively, 376, 318, 211 and 328 were found in the general, kidney, brain and liver datasets. The most frequent ToxPrint chemotypes in all datasets were aromatic benzenes, C=O carbonyl generic, halides, alkanes, carboxamide, aminocarbonyl, heterocycles, aliphatic-aromatic ether and aromatic amines, all with over a 16% frequency in the datasets. The chemical space, covered by chemicals in terms of the structural and physico-chemical properties, is described in Tables S25-S28  (Supplementary Materials). Datasets present a similar structural and physico-chemical profile. This represents important evidence, which allows us to say that these models could be suitable for predicting external data with a chemical space in line with what we defined here. lows us to say that these models could be suitable for predicting external data with a chemical space in line with what we defined here.

Models Developed
The CORAL models were based on the hybrid optimal descriptors extracted from the simplified molecular-input line-entry system (SMILES) and molecular graph. The models for the predictions of NOAEL and LOAEL in rats for sub-chronic repeated-dose toxicity are given in Equations (1)-(8), while Table 1 shows the statistical characteristics of the models. We calculated the validation metrics for each model, such as the determination coefficient (R 2 ), the leave-one-out (LOO) cross-validated determination coefficient (Q 2 ), predictive R 2 (R 2 pred), the criteria of predictability Q 2 F3, the concordance correlation coefficient (CCC), the index of ideality of correlation (IIC), the mean absolute error (MAE), the root mean squared error (RMSE) and the Fischer F-ratio (F) [12][13][14]. The predictions and experimental values, together with the extracted attributes of the models, are reported in Tables S9-S24 in the Supplementary Materials.

Models Developed
The CORAL models were based on the hybrid optimal descriptors extracted from the simplified molecular-input line-entry system (SMILES) and molecular graph. The models for the predictions of NOAEL and LOAEL in rats for sub-chronic repeated-dose toxicity are given in Equations (1)-(8), while Table 1 shows the statistical characteristics of the models. We calculated the validation metrics for each model, such as the determination coefficient (R 2 ), the leave-one-out (LOO) cross-validated determination coefficient (Q 2 ), predictive R 2 (R 2 pred), the criteria of predictability Q 2 F3 , the concordance correlation coefficient (CCC), the index of ideality of correlation (IIC), the mean absolute error (MAE), the root mean squared error (RMSE) and the Fischer F-ratio (F) [12][13][14]. The predictions and experimental values, together with the extracted attributes of the models, are reported in Tables S9-S24 in the Supplementary Materials. The most important criterion for assessing the predictive potential of a model is the statistical quality of an external validation set. In the general models, the predictive ability R 2 in the validation sets reached 0.55 and 0.53, respectively, for NOAEL and LOAEL. The external validation metrics of the CCC and Q 2 F3 for NOAEL were 0.69 and 0.53, and 0.63 and 0.32 for LOAEL. The LOO cross-validation technique Q 2 was used to assess the robustness of the models: it ranged from 0.50 to 0.54 for NOAEL and 0.49 to 0.54 for LOAEL. A value of 0.5 is an acceptable threshold for the Q 2 reported in the literature [13].
The performances were better in the models for the organ-specific toxic effects than in the general models. In the kidney models, the predictive ability R 2 in the validation sets reached 0.59 for NOAEL and 0.69 for LOAEL. The external validation metrics, CCC and Q 2 F3 , were 0.75 and 0.84 for NOAEL, and 0.82 and 0.81 for LOAEL. The Q 2 ranged from 0.48 to 0.58 for NOAEL and from 0.49 to 0.63 for LOAEL. In the brain models, R 2 in the validation sets reached 0.53 for NOAEL and 0.69 for LOAEL. The agreement between the experimental and calculated values, expressed by the metric CCC, was 0.67 for NOAEL and 0.80 for LOAEL. The criterion of predictability Q 2 F3 was 0.81 for NOAEL and 0.82 for LOAEL. Moreover, Q 2 ranged between 0.46 and 0.51 for NOAEL and 0.44 and 0.61 for LOAEL. Last but not least, in the liver models, R 2 reached 0.55 and 0.61 in the validation sets for NOAEL and LOAEL, respectively. The metric CCC reached similar values for NOAEL and LOAEL. The criterion of predictability Q 2 F3 was 0.55 for both NOAEL and LOAEL. The robustness of the models expressed by Q 2 reached similar values: higher than 0.72 for NOAEL and LOAEL. It can be noticed that LOAEL performs better than NOAEL in all the cases. Considering the complexity of modeling these endpoints, the overall performance is quite satisfactory for all the models. It is possible to observe that the performance of the liver models is better than the other organ-specific models, showing higher robustness, expressed by Q 2 , and lower MAE and RMSE values.

General Models
The general NOAEL model was built up using the equivalent distribution in the active training set (n = 140), passive training set (n = 140), calibration set (n = 140), and validation set (n = 141). The model was calculated as follows: The general LOAEL model was built up using the equivalent distribution in the active training set (n = 142), passive training set (n = 142), calibration set (n = 137), and validation set (n = 137). The model was calculated as follows:

Kidney Models
The NOAEL kidney model was built up using the equivalent distribution in the active training set (n = 95), passive training set (n = 95), calibration set (n = 45), and validation set (n = 45). The model was calculated as follows: The LOAEL kidney model was built up using the equivalent distribution in the active training set (n = 97), passive training set (n = 102), calibration set (n = 46), and validation set (n = 38). The model was calculated as follows:

Brain Models
The NOAEL brain model was built up using the equivalent distribution in the active training set (n = 23), passive training set (n = 22), calibration set (n = 23), and validation set (n = 22). The model was calculated as follows: The LOAEL brain model was built up using the equivalent distribution in the active training set (n = 22), passive training set (n = 23), calibration set (n = 22), and validation set (n = 23). The model was calculated as follows:

Liver Models
The NOAEL liver model was built up using the equivalent distribution in the active training set (n = 97), passive training set (n = 94), calibration set (n = 30), and validation set (n = 31). The model was calculated as follows: The LOAEL liver model was built up using the equivalent distribution in the active training set (n = 96), passive training set (n = 95), calibration set (n = 30), and validation set (n = 31). The model was calculated as follows:

Discussion
In the present study, the in silico models were developed for predicting NOAEL and LOAEL for the sub-chronic toxicity data (90 days of exposure). The development of the in silico models for RDT represents a challenging task [9,15]. The uncertainty that distinguishes the process derives, in particular, from the experimental design [6]. Indeed, different designs can produce contrasting NOAEL and LOAEL for the same chemical [9]. This variability makes the development of computational models extremely challenging. In recent years, only a few NOAEL and LOAEL models have been developed [6,10,15]. Table 2 shows the statistical characteristics that are in common with the models developed here, which we will discuss in Section 3.1. The complexity of developing models for NOAEL and LOAEL arises also from the fact that NOAEL and LOAEL are related to an extended range of effects that could arise in different target organs. These aspects, as well as the extensive modes and mechanism of actions, are difficult to consider separately, and often only partial information is provided. In this view, all of these features contribute to the final toxicity value. This study proposed new models for examining the critical effects that could arise in target organs, bearing in mind the abovementioned criticisms and limitations. To the best of our knowledge, this is one of the few attempts to develop NOAEL and LOAEL models for organ-specific toxicity. Predicting target-organ toxicity is a focal point in the next-generation risk assessment (NGRA) [16] and in the future, predictions of organ-level effects, in particular for the liver, should represent the basis of in silico modeling [17]. Moreover, data with a more defined understanding of the mechanisms/effects and/or adverse outcome pathways (AOP) is needed to improve the prediction strategy [17]. A thorough investigation of the ADME processes of a chemical, i.e., its ability to penetrate certain biological barriers and its distribution within the human body, is fundamental in the forward-looking NGRA approach [16]. Additionally, the physiologically-based kinetic (PBK) models, which are fundamental in determining the exposure to possible toxicants, can provide fundamental indications of concerns linked to target organs [18]. Considering this context, the QSAR models for organ-specific toxicity, which we have developed in this work, can provide outcomes useful for creating a future integrated-testing strategy. Finally, our work follows the regulatory updates and aims to improve the riskassessment procedure, introducing new alternative methods. NOAEL and LOAEL are fundamental endpoints for human health risk assessment under different regulatory contexts, such as cosmetics and REACH [3,4]. A significant use of time and money characterizes the in vivo RDT studies as well as the substantial number of animals involved. According to the cosmetics regulation, the in vivo RDT studies are not part of the testing procedures anymore [1], and REACH [3] calls for research to reduce the number of animals, according to the 3R principle [19]. Therefore, there is a pressing demand for new and valid alternatives for these endpoints.
However, the uncertainty of the NOAEL and LOAEL values, which is strongly dependent on the experimental design, tends to be reflected in the predictions obtained with the in silico models. Moreover, the link between the descriptors and systemic effects could not be linearly explained, since there is no weighty mechanistic principle [20]. Despite all these limitations, a reliable predictivity for NOAEL and LOAEL may be obtained using a combination of several techniques, such as QSARs, read-across, physiologicallybased pharmacokinetic modeling (PBPK), the threshold of toxicological concern (TTC), and in vitro methods.

Comparison with Other Models
To the best of our knowledge, there are no NOAEL and LOAEL regression models related to organ-specific toxicity reported in the literature. We will therefore compare the general NOAEL and LOAEL models with the published and implemented models in free software.
In the last few years, other models have been developed for NOAEL and LOAEL in rats for sub-chronic repeated-dose toxicity (90 days of exposure). One model is currently freely available for predictions in the VEGA software [10,21]. The authors in [10] looked to studies in rodents to build NOAEL models. They considered studies with values for 28 days of treatment, dividing by 3 in order to approximate 90 days [4]. The dataset collected consisted of 140 chemicals with toxicity data from the Integrated Risk Information System (IRIS), Hazard Evaluation Support System (HESS) and Munro databases (https://www.epa. gov/iris, accessed on 5 January 2021) and from the OECD QSAR Toolbox, version 3.2 (https://qsartoolbox.org/, accessed on 5 January 2021). The performance of the model [10] is reported in Table 2.
In 2020, two new models (NOAEL and LOAEL) were developed [6]. These models were built using toxicity data from the Fraunhofer RepDose database, a non-public database. Studies only for the sub-chronic data from 90 days of oral toxicity in rats (Rattus norvegicus) were included. The models were based on 326 compounds. The statistical characteristics of the base model and the model based on chemicals inside the applicability domain (AD) are reported in Table 2. In this work, the models are only based on the public sources collected recently, covering more compounds than the other models reported in Table 2. The presence of the various functional groups and the chemical domains described by the physico-chemical properties, highlight the chemical heterogeneity of our datasets.
The most significant criterion for comparing the predictive potential of the models considers the statistical quality of an external (validation) set. For the NOAEL model, the validation set in [6] consisted of 38 compounds reaching R 2 = 0.65. When excluding the compounds outside the AD, the predictive capacity reached R 2 = 0.69, with 33 compounds covered. In the present work, the NOAEL model was based on a validation set of 141 compounds and reached R 2 0.55. Excluding compounds outside the AD improved the R 2 , reaching 0.61 with a 77% coverage of compounds. The NOAEL model in [10] reached an R 2 of, respectively, 0.60 and 0.61 in the whole validation set and in the validation inside the AD. However, the validation set had fewer compounds than the previous ones.
The LOAEL model in [6] reached R 2 = 0.59. The predictive ability reached R 2 = 0.62, excluding compounds outside the AD, with 33 compounds inside it. In this work, the LOAEL model was based on a validation set of 137 compounds and reached R 2 0.53. R 2 was similar, excluding compounds outside the AD with 74% of compounds.

Dataset Collection and Preparation
The datasets used to develop the QSAR models were built using the NOAEL and LOAEL experimental sub-chronic (90 days of exposure) oral-toxicity data in rats, from different publicly available databases, in line with their conformity to the OECD guideline, 408 [22]. These sources include the Munro and HESS databases, both of which are available from the OECD QSAR Toolbox, version 4.4 (https://qsartoolbox.org/, accessed on 5 January 2021), the IRIS database from the U.S. Environmental Protection Agency (EPA) (https://www.epa.gov/ iris, accessed on 5 January 2021), the COSMOS database [23] (https://www.ng.cosmosdb.eu/ about, accessed on 5 January 2021), the U.S. EPA's ToxRefDB, v. 2.0 [24], and the European Food Safety Authority (EFSA) OpenFoodTox database [25,26]. In order to ensure consistent data, we took into account only data from the sub-chronic toxicity studies of oral exposure in rats (90 days, but also 91 and 92 days in order to extend the number of data), removing variability and inconsistencies due to interspecies' differences [27]. Further, 2D SMILES chemical structures were retrieved for each substance. In some cases, SMILES were already present in the databases, otherwise, they were retrieved using a semi-automated data curation and quality checking workflow, implemented in the KNIME Analytics Platform, v. 4 [28,29].
Once a full dataset was set up under these rules, we rejected inorganic compounds, metal complexes and data related to mixtures or polymeric structures. Ionized structures were neutralized and counterions were excluded. Duplicates were analyzed and detected by checking SMILES at the 2D level, and by the CAS number. For compounds with multiple data, the lowest value was preserved, adopting a conservative approach. The experimental values (in mg/kg body weight (bw) per day) were converted to a logarithmic scale. Finally, to make the data more consistent, only chemicals with an NOAEL equal to or lower than LOAEL were maintained.
Another step was to retrieve data related not only to general-systemic toxicity but also to organ-specific toxicity. Therefore, following the same procedure, we built one dataset for general NOAEL and LOAEL and separate datasets using NOAEL and LOAEL data where the indication of the adverse effect is associated with a specific target/organ for the liver, kidney and brain. This provides a general dataset of 573 unique chemicals. For the organ-related data, we obtained 353 chemicals for the liver, 289 for the kidney and 91 for the brain. The datasets containing IDs, CAS numbers, SMILES and experimental values are reported in the Supporting Information, Tables S1-S4. A further curation aimed at the development of a specific model will be described in the next sections. The complete structure of these datasets is shown in Table 3.

Characterization of the Datasets
The chemical structures of the molecules (Supplementary Materials, Tables S1-S4) were analyzed with the ChemoTyper tool, v. 1.0, which searches and identifies 729 ToxPrint chemotypes (https://www.chemotyper.org, accessed on 16 April 2021). The complete lists and histograms of the chemotypes identified in the datasets can be found, respectively, in Tables S5-S8 and in Figures S1-S4 (Supplementary Materials).
Chemicals were also analyzed on the basis of their physico-chemical properties. These were calculated using the 'Molecular Properties' node of the analytics platform, KNIME (https://www.knime.com/knime-analytics-platform, accessed on 16 April 2021). The properties included the molecular weight, hydrogen bond acceptors, hydrogen bond donors, rotatable bonds count, Lipinski's rule of five, topological polar surface area, atomic polarizabilities, bond polarizabilities and MannholdlogP. Tables S25-S28 in the Supplementary Materials show the mean, maximum and minimum values of the selected properties for each dataset.

Further Curation
To build the models, we removed the compounds with a low-frequency distribution of NOAEL and LOAEL values, as they were considered noise. For the liver dataset, an analysis was done to identify the possible structural or response outliers and activity cliffs, employing the Small Dataset Curator tool, v. 1.0.0 [30,31] and the istSimilarity tool, v. 1.0.7 [32]. The final number of compounds and the range of the NOAEL and LOAEL endpoints are summarized in Table 3. Figures S5-S8 in the Supplementary Materials show the frequency distribution of the experimental values.

Model Development and Optimal Descriptors
We used the CORAL software, v. 2020 (www.insilico.eu/coral, accessed on 6 June 2021), to build the QSAR models with the Monte Carlo method. The models were based on the so-called hybrid optimal descriptor, accounting for the correlation weights (CW) of both the SMILES and hydrogen-suppressed molecular graph (HSG) attributes [33,34].
The datasets were split randomly into the active training, passive training, calibration and validation sets [33,35]. Each subset had its own task. The task for the active training set was to calculate and identify which correlation weights gave the largest correlation between the experimental and predicted endpoint for the active training set. The task for the passive training set was to analyze if these correlation weights gave a reasonable correlation coefficient for similar compounds. The calibration set was used to detect overtraining. The validation set was used to evaluate the predictive potential of the model, using substances never used in the phase of model building.
The general form of a CORAL model can be described by the following one-variable Equation (9): where C 0 and C 1 are the intercept and the slope, DCW is the optimal descriptor, T* and N* are the best threshold and the best number of epochs of the Monte Carlo optimization. The best threshold is an integer to classify a molecular feature as active or rare. The building process never involves rare features. Both thresholds give the best statistical performance for the calibration set. The optimal descriptor-a function of molecular features-that we applied, comprises the following components, as listed in Equation (10): where,  The others APP attributes are defined in the same manner. Table 4 contains the types of attributes involved in the models with corresponding references. Table 4. SMILES and molecular graph attributes.

Attribute Description
S, SS, SSS Single SMILES atom, a combination of two SMILES atoms and a combination of three SMILES atoms [36,37].
NNC Nearest neighbors codes [33,39]. The correlation weights were based on the index of ideality of correlation (IIC) and on the correlation intensity index (CII), both of which were calculated using the compounds in the calibration set, as described in the literature [14,41]. Table 5 shows details of the eight models developed. Screenshots of the methods M1-M8, utilized to build up the models, are shown in Figures S9-S16 (Supplementary Materials).

Definition of the Applicability Domain
The applicability domain (AD) implemented in CORAL is based on a probabilistic measure defect SMILES , which measures the quality of the molecular features extracted from the compound [33,36]. It is defined according to the distribution of the molecular features (S k ) extracted from SMILES or HSG in the active training and calibration sets. The defect of S k was calculated as: where P(S k ) and N(S k ) are, respectively, the probability and frequency of S k in the active training set. P (S k ) and N (S k ) are, respectively, the probability and frequency of S k in the calibration set. The defect of the SMILES was calculated as: where NA is the number of non-blocked molecular features extracted from SMILES or HSG. The criterion for the reliability of a prediction is defined as below in Equation (13): where D is the average of the statistical SMILES-defect for the compounds in the active training set [33]. Compounds having defect SMILES greater than the calculated threshold 2 × D were considered outside the AD. In contrast to a traditional interpretation of the applicability domain, this approach provides a semi-qualitative criterion, named the statistical defect. Indeed, this is a measure of the presence of the molecular features observed in SMILES. If the majority of the molecular features are rare, then the SMILES have a large defect, while if the majority of the molecular features are represented many times in both the training set and in the calibration set, the defect is small. Tables S9−S24 in the Supplementary Materials report the statistical quality and the calculated threshold for the models applying the AD procedure described.

Conclusions
The present study provides new QSAR regression models for NOAEL and LOAEL, in particular, new models for NOAEL and LOAEL for the liver, kidney and brain. Considering the difficulty to obtain a robust model for these endpoints, predictions are quite satisfying, particularly for the organ-specific models. This is encouraging, with a view to a future integrated approach, where the prediction of NOAEL and LOAEL can be used to support a new multi-tier assessment strategy. Particularly convincing are the results for the liver, an organ of central importance for the risk assessment of various kinds of chemicals. These QSAR models will be implemented in the VEGA platform (https://www.vegahub.eu, accessed on 2 December 2021). The VEGA system will also show the six most similar substances and an evaluation of the reliability of the predictions. They have the important advantage of being freely available for the end users. Therefore, they can be used by the scientific community, worldwide, to support their activities. These models can be used for a hazard characterization of chemicals as well as for screening, the prioritization of chemicals and as supporting evidence for a risk-assessment approach.
Moreover, these models offer an innovative perspective for the risk-assessment tools developed by the authors and their colleagues, in the context of the LIFE VERMEER project (https://www.life-vermeer.eu/, accessed on 3 May 2022).
Finally, as we also previously highlighted within the text, some limitations were present in this work due to the intricacy of the experimental design, the dependency of the dose administered and the complexity of the toxicity mechanisms behind the adverse effects. It is worth underlining that, in the future, these models could be refined, applying the concept of the BMD, which is more reliable than NOAEL, since it is less influenced by the sample size and selection of the treatment dose. Moreover, with the awareness of their importance to refine the prediction of repeated-dose toxicity, a deeper evaluation of relevant issues, such as the role of the metabolism and mechanism of actions, will be taken into account in order to increase the work we are doing to provide advanced strategies for risk assessment and regulatory purposes.