Toxicity Assessment of the Binary Mixtures of Aquatic Organisms Based on Different Hypothetical Descriptors

Modern industrialization has led to the creation of a wide range of organic chemicals, especially in the form of multicomponent mixtures, thus making the evaluation of environmental pollution more difficult by normal methods. In this paper, we attempt to use forward stepwise multiple linear regression (MLR) and nonlinear radial basis function neural networks (RBFNN) to establish quantitative structure–activity relationship models (QSARs) to predict the toxicity of 79 binary mixtures of aquatic organisms using different hypothetical descriptors. To search for the proper mixture descriptors, 11 mixture rules were performed and tested based on preliminary modeling results. The statistical parameters of the best derived MLR model were Ntrain = 62, R2 = 0.727, RMS = 0.494, F = 159.537, Q2LOO = 0.727, and Q2pred = 0.725 for the training set; and Ntest = 17, R2 = 0.721, RMS = 0.508, F = 38.773, and q2ext = 0.720 for the external test set. The RBFNN model gave the following statistical results: Ntrain = 62, R2 = 0.956, RMS = 0.199, F = 1279.919, Q2LOO = 0.955, and Q2pred = 0.855 for the training set; and Ntest = 17, R2 = 0.880, RMS = 0.367, F = 110.980, and q2ext = 0.853 for the external test set. The quality of the models was assessed by validating the relevant parameters, and the final results showed that the developed models are predictive and can be used for the toxicity prediction of binary mixtures within their applicability domain.


Introduction
It has been widely recognized that toxic chemicals in the environment do not exist individually but are mixtures of each other; thus, research on both single and mixed toxic compounds is important [1,2]. Today, water system pollution is worthy of attention. With modern industrial development, various water bodies suffer from damage due to industrial wastewater discharge and the discharge of pesticides, organic chemicals, and other pollutants from all walks of life, with the destruction of water bodies posing a great threat to the ecological environment and human health [3,4]. In the aquatic world, water pollutants hardly ever exist as a compound alone but rather contaminate water bodies as a mixture. Under existing human risk assessment regulations, the toxicity performance of toxic chemicals is simply assessed by information on the toxicity of their individual compounds, whereas the fact is that the toxicity performance of an individual compound is vastly divergent from that of a mixture and that interactions between the components of a mixture can lead to synergistic and antagonistic reactions of an individual compound, resulting in significant changes in the toxicity performance of a mixture compared to a single one [5]. To address these issues, a risk assessment of the toxicity of mixtures seems to be greatly important and thus necessary.

Model Development for Individual Compounds
As shown in Figure 1, in the current study, 35 compounds present in the aque environment were used for model development. First, a total of 614 descriptors were tained after molecular optimization. Then, 348 nonconforming descriptors were eli nated after the CODESSA software heuristic method (HM), and 85 descriptors w pruned out of the descriptor pool after descriptor relevance screening. Finally, 181 scriptors were left to build the model. Through the forward stepwise multiple regress method, the five representative descriptors were selected for the construction of both MLR and the RFBNNs models. (The total and the eliminated descriptor in this study present in the supplementary material.) To evaluate the models, the leave-out many cro validation (LOM) and Y randomization test methods were performed. For the individ groups, the results are listed in Table 1.

Model Development for Individual Compounds
As shown in Figure 1, in the current study, 35 compounds present in the aqueous environment were used for model development. First, a total of 614 descriptors were obtained after molecular optimization. Then, 348 nonconforming descriptors were eliminated after the CODESSA software heuristic method (HM), and 85 descriptors were pruned out of the descriptor pool after descriptor relevance screening. Finally, 181 descriptors were left to build the model. Through the forward stepwise multiple regression method, the five representative descriptors were selected for the construction of both the MLR and the RFBNNs models. (The total and the eliminated descriptor in this study are present in the supplementary material.) To evaluate the models, the leave-out many cross-validation (LOM) and Y randomization test methods were performed. For the individual groups, the results are listed in Table 1. As seen from the table, five descriptors were chosen for the construction of the relationship through a forward stepwise multiple linear regression approach: (1) minimum atomic state energy for a C atom (Min-C); (2) relative number of N atoms (Rn-N); (3) total point charge component of the molecular dipole (Tot-pc); (4) maximum e-n attraction for a C-H bond (Max-C-H); and (5) HOMO energy (HOMO). It is expressed by Equation (1): N train = 28, R 2 = 0.887, RMS = 0.398, F = 204.660, Q 2 LOO = 0.887, and R 2 pred = 0.938; and N test = 7, R 2 = 0.987, RMS = 0.297, F = 374.332, and q 2 ext =0.933. The statistical results reveal that the model has excellent statistical reliability for the internal training set and outstanding predictive power for the external test set (the minimum accepted value for Q 2 LOO , R 2 pred , and q 2 ext is 0.5, and for R 2 is 0.6; in addition, the smaller the RMS value, the larger the F value, and the higher the quality of the model). In Table 2, the predicted pEC50 values, experimental pEC50 values, residual values, and individual compound names for an individual compound by each of the two models are shown, along with figures depicting the experimental and predicted value curves for the training and test sets under each of the two models in Figure 2. An assessment of the correlation between the individual descriptors is necessary for binary mixture descriptor generation, and when the pairwise correlation between two individual descriptors fell below 0.8 [19], it was demonstrated that the individual descriptors were highly independent of each other and were able to avoid chance correlation effects due to interdependence, for which we examined the cross-correlation matrix of the five descriptors, as shown in Table 3.

Model Development for Binary Mixture Compounds
In the following, five descriptors were chosen to build the mixture compound models. Regarding the generation of binary mixture descriptors, although it has been shown that descriptors are generated simply by addition [20], which is not the case, it is based on this that 11 different mixing rules (Table 1) were applied to generate the binary mixture descriptors. Not only was the choice of the hybrid rule compared in terms of the evaluation of the model quality, but also the ninth hybrid rule was more dominant in terms of the contribution of the descriptors to the molar ratio [11], and thus this hybrid rule was finally used for the construction of the hybrid model. The equation for this is An assessment of the correlation between the individual descriptors is necessary for binary mixture descriptor generation, and when the pairwise correlation between two individual descriptors fell below 0.8 [19], it was demonstrated that the individual descriptors were highly independent of each other and were able to avoid chance correlation effects due to interdependence, for which we examined the cross-correlation matrix of the five descriptors, as shown in Table 3. Table 3. Inter-correlation between the five descriptors.

Rn-N HOMO Tot-pc Min-C Max-C-H
Rn-N +1.000

Model Development for Binary Mixture Compounds
In the following, five descriptors were chosen to build the mixture compound models. Regarding the generation of binary mixture descriptors, although it has been shown that descriptors are generated simply by addition [20], which is not the case, it is based on this that 11 different mixing rules (Table 1) were applied to generate the binary mixture descriptors. Not only was the choice of the hybrid rule compared in terms of the evaluation of the model quality, but also the ninth hybrid rule was more dominant in terms of the contribution of the descriptors to the molar ratio [11], and thus this hybrid rule was finally used for the construction of the hybrid model. The equation for this is In addition, it will allow molecules with larger descriptor values to be more dominant in the case of large differences between the component descriptors.
When looking at the values of the statistical parameters for the internal validation, it can be demonstrated that the model has good robustness in conjunction with statistical reliability, while the external validation parameters of the model also indicate that the model has a better predictive power. Furthermore, one has a higher requirement for the model with the statistical covariates presented in Table 4. In a standard comparison of the statistical parameters (R 2 > 0.6, q 2 ext > 0.5, and k ≈ 1 (k is the slope of the regression line through the origin)), the quality of the model is in line with the requirements. The set of relevant statistics predicted by the mixtures under each of the two models is shown in Table 5; the graphs of the experimental and predicted values for the training set as well as the test set are shown in Figure 3. Moreover, Figure 4 shows the scatter plots of the residuals for all data under both models. Table 5. The number of chemicals in the mixtures, ratio of toxic unit, experimental pEC50 mix, predicted pEC50 mix, and their corresponding residuals.

Mixture
No.   In addition, it will allow molecules with larger descriptor values to be more dominant in the case of large differences between the component descriptors.

Chemicals in the Mixture
Finally, the MLR for the mixture in which the equation expression is constructed based on the mixing rules and the selected descriptors is as follows: When looking at the values of the statistical parameters for the internal validation, it can be demonstrated that the model has good robustness in conjunction with statistical reliability, while the external validation parameters of the model also indicate that the model has a better predictive power. Furthermore, one has a higher requirement for the model with the statistical covariates presented in Table 4. In a standard comparison of the statistical parameters (R 2 > 0.6, q 2 ext > 0.5, and k ≈ 1 (k is the slope of the regression line through the origin)), the quality of the model is in line with the requirements.
The set of relevant statistics predicted by the mixtures under each of the two models is shown in Table 5; the graphs of the experimental and predicted values for the training set as well as the test set are shown in Figure 3. Moreover, Figure 4 shows the scatter plots of the residuals for all data under both models.

RBFNN Results Analysis
Generally, nonlinear models have outstanding predictive power compared to linear ones. In the current work, an RBFNN model was constructed using the same descriptors as those used to construct the MLR model, and the quality of the model was evaluated by randomly dividing the training set as well as the test set. In the construction of the RBFNN model, the structure of the three-layer network was constructed as 5-nk-1, denoting the number of cells in the input, hidden, and output layers. For the radial basis function (RBF), the width (r) range was controlled by starting at 0.1 in increments of 0.1 until increasing to 4. The optimal width value found for the individual compound RBFNN model was r = 4, and the optimal width value for the mixture was r = 1.6.    The prediction data of the RBFNN models for the individual compound and the binary mixture are shown in Tables 1 and 2, respectively, and the plots of the experimental and predicted values for the training set and the test set are shown in Figures 2 and 3, respectively. In addition, the scatter plots of the residuals of the two models are also shown in Figure 4. The statistical parameters for the group of the individual compounds were N train = 28, R 2 = 0.864, RMS = 0.436, F = 165.309, Q 2 LOO = 0.864, and R 2 pred = 0.847 for the training set, and N test = 7, R 2 = 0.941, RMS = 0.466, F = 79.300, and q 2 ext = 0.834 for the test set, and it is apparent from the observations that the model exhibits superior reliability as well as predictiveness. The statistical parameters of the nonlinear model for the mixture were N train = 62, R 2 = 0.956, RMS = 0.199, F = 1279.919, Q 2 LOO = 0.955, and Q 2 pred = 0.855 for the training set; and N test = 17, R 2 = 0.880, RMS = 0.367, F = 110.980, and q 2 ext = 0.853 for the test set. Analysis of the statistical results shows that the robustness of the model as well as its predictive power is somewhat enhanced compared to the MLR modeling. In addition, the other parameters of the model quality were calculated as shown in Table 4.

Validation of the Models
Y randomization tests are typically applied to confirm the degree of chance correlation of regression models. In the current work, 10 tests were conducted for each of the two models, and the amount of parameter validation for both models is shown in Table 6. As seen from the table, the low R 2 and RMS, along with the high MAE values (see below), indicated that the chance correlation of the models will barely exist [21]. The expression for MAE is where n is the number of the example, y (a) is the experimental value of a single example, andŷ (b) is the predicted value of a single example.
In the following, a fivefold cross-validation algorithm was applied to assess the robustness of the models built. The validation parameters, including R 2 , F, and RMS, were used to evaluate the two models, as shown in Tables 7 and 8

Model Applicability Domain Analysis
In the present study, the visual application domains of the models that can typically be observed and analyzed using Williams plots are shown in Figure 5, where the horizontal coordinate is the leverage value, the vertical coordinate is the standardized cross-validation residual, the outlier criterion (read line) for the x coordinate is set to 3 m/n (m is the number of 5 descriptors chosen and n is the number of 62 compounds used as the training set), and the outlier criterion (read line) for the y coordinate is specified as ± 3σ (σ = 0.967). In Figure 5, it can be clearly observed that both the training set compounds and the test set compounds are within the domain, indicating that a reasonably close relationship can be established between the selected descriptors and the toxicity of the compounds, and within the domain, the model is able to fit the relevant mixture toxicity predictions.
In the present study, the visual application domains of the models that can typically be observed and analyzed using Williams plots are shown in Figure 5, where the horizontal coordinate is the leverage value, the vertical coordinate is the standardized cross-validation residual, the outlier criterion (read line) for the x coordinate is set to 3 m/n (m is the number of 5 descriptors chosen and n is the number of 62 compounds used as the training set), and the outlier criterion (read line) for the y coordinate is specified as ± 3σ (σ = 0.967). In Figure 5, it can be clearly observed that both the training set compounds and the test set compounds are within the domain, indicating that a reasonably close relationship can be established between the selected descriptors and the toxicity of the compounds, and within the domain, the model is able to fit the relevant mixture toxicity predictions.

Discussion of Selected Descriptors
In the present work, the built model can be used to predict the compounds, including aldehydes (AHS), cyanides (CGS), sulfonamides (SAS), and methomyl (TMP). These substances contain the C, H, O, N, and S atoms. Five descriptors, namely, HOMO, Rn-N, Totpc, Max-C-H, and Min-C, were used to construct the QSAR model. HOMO and Rn-N have a positive effect on the increase in toxicity, and Tot-pc, Max e-n, and Min-C-H have a

Discussion of Selected Descriptors
In the present work, the built model can be used to predict the compounds, including aldehydes (AHS), cyanides (CGS), sulfonamides (SAS), and methomyl (TMP). These substances contain the C, H, O, N, and S atoms. Five descriptors, namely, HOMO, Rn-N, Tot-pc, Max-C-H, and Min-C, were used to construct the QSAR model. HOMO and Rn-N have a positive effect on the increase in toxicity, and Tot-pc, Max e-n, and Min-C-H have a negative effect on the increase in toxicity. Rn-N is a constitutional descriptor, which, in the present work, mainly concerning organic molecules containing cyano and nitrogencontaining heterocycles in the compounds, is positively correlated with the increase in toxicity. HOMO, Tot-pc, Max-C-H, and Min-C are quantum-chemical descriptors; HOMO has the property of being an electron donor in a chemical reaction, and the higher the energy is, the higher the toxicity value. Tot-pc is a class of descriptors describing the polarity of a molecule, where the size of the molecular dipole depends on the distribution of the point charges. MAX-C-H describes the electron-to-charge attraction of two atoms in a C-H bond. When a bond is formed, more electronegative atoms participate in the bonding orbitals to gain some of the electrons, and the electron-to-charge attraction decreases, resulting in a decrease in electronegativity, which has a positive effect on the decrease in toxicity. Min-C is a calculation of the energy of the ground state of the C atom; the lower the energy is, the more stable and the lower the toxicity value.

Datasets
In the present work, 35 compounds widely present in aqueous environments, including 13 aldehydes (AHS), 11 cyanides (CGS), 10 sulfonamides (SAS), and 1 methomyl (TMP), were obtained from Mainak Chatterjee et al. [22]. In an aqueous environment, the above compounds can cause damage to the environment in the form of either individual compounds or in mixtures, which directly or indirectly have an impact on human health. The names of the individual compounds, the experimental, predicted, and residual values (pEC50) by the MLR model and the RBFNN model are presented in Table 2 (pEC50 units  are in moles per liter).
Additionally, the data in Table 2 were randomly divided into 28 training sets as well as 7 test sets (marked with asterisk) to assess the performance of the individual compound models. The selection of molecular descriptors highly correlated with single toxicity in the whole dataset.
Seventy-nine binary mixtures are listed in Table 5 along with the values obtained by each of the two modeling techniques. Toxicity ratios for the binary mixtures of two individual compounds and compositional information are also listed in this table. As the QSAR studies usually did, binary mixtures of 79 species were randomly divided into 62 training sets as well as 17 test sets (marked with an asterisk) for assessing the quality of the binary mixture models.

Molecular Descriptors Generation and Selection
In this study, the molecular descriptors were employed as quantitative representations of the molecular structural features and were then used to build the relationship between the representative descriptors and the target of the toxicity or activity. The process is as follows: the structures of the individual compounds were first drawn by Chemdraw (PerkinElmer Informatics, Inc: Massachusetts, MA, USA) [23], followed by preliminary molecular optimization in the HyperChem 6.0 program (Hypercube, Inc., Waterloo, ON, Canada) [24] software through molecular mechanics MM+ force fields. Then, the preliminary optimized molecular structures were further optimized using the semiempirical AM1 method in the Polak-Ribière algorithm until the root mean square gradient reached 0.001 kcal/mol. Last, the molecular structures were optimized in the MOPAC 6.0 software package (Indiana University: Bloomington, IN, USA) [25], and the structure files derived from HyperChem and MOPAC were selected for structural descriptors, geometric descriptors, electrostatic descriptors, quantum chemical descriptors, and topological descriptors using the CODESSA 2.63 program (University of Florida, Gainesville, FL, USA) [26] after optimization with the same root-mean-square gradient. Furthermore, 7 of the other descriptors obtained from HyperChem (including logP) were added to the descriptor pool.
Doing this, we needed to find the representative descriptors that are more related to the toxicity of the single compounds. Thus, a heuristic method in CODESSA software was employed, which can be used to calculate a pool of relevant descriptors and subsequent determinations of the most suitable descriptors for the construction of the model.
The generalization of the descriptors for the toxicity assessment of the binary mixtures is a challenge compared to the generation of descriptors for individual compounds. Normally, the approach applied to generating mixture descriptors is the weighted descriptor generation approach [27]. Supposing that the hypothetical descriptors do not follow the simple addition, other calculation rules have been performed [11,[28][29][30][31], from which the optimal mixture rule was determined based on the preliminary modeling results. For each mixing rule, the selection standard was considered in terms of the reliability and robustness of the model quality (R 2 (correlation coefficient), R 2 adj (adjusted correlation coefficient), F (Fisher test), and Q 2 LOO (leave-one-out correlation coefficient)), and the expression of the equation, as shown in Table 1.

Multiple Linear Regressions
As the easiest model-building statistical technique, MLR has been commonly implemented in quantitative constructive relationship models to solve regression analysis problems. It can predict the values of two or more explanatory variables from the corresponding variables, and it is essentially an extension of ordinary least squares (OLS) regression involving two and more explanatory variables as a mathematical statistical technique. Typically, multiple linear regression uses molecular descriptors as X variables to establish a mathematical relationship with the desired activity value Y (pEC50), which involves dividing the overall dataset into a test set and a training set. In a regression model, the regression coefficient bn and the intercept b 0 of the model have the following relationship: Regularly for the model, reliability and predictiveness are generally assessed using statistical parameters, including R 2 , RMS, F, Q 2 LOO , R 2 pred , q 2 ext , etc. For the development of the MLR model, we have chosen to do this in the CODESSA 2.63 program (University of Florida, Gainesville, FL, USA).

Radial Basis Function Neural Networks (RBFNN)
During the construction of the QSAR, one can consider not only the best multivariate linear model available by constructing the molecular descriptor versus the desired activity value (pEC50) but also some nonlinear models to establish the relationship, such as the RBFNN. The specifics of radial basis function neural networks have been described in several papers [32,33]. Briefly, a radial basis function neural network consists of an input layer, a hidden layer and an output layer. The input layer is virtually just an input vector and does not involve the processing of information; the hidden layer consists of k radial basis function (RBF) units; and the output layer is composed of linear neurons (LNS) [32,34]. In general, the radial basis function (RBF) serves as a Gaussian function defined by the center (Cj) and the width (Rj). The radial basis function (RBF) implements the nonlinear transformation by measuring the Euclidean distance between the input vector (X) and the center of the radial basis function (Cj): where y k stands for the k th output unit of the input vector X, w kj for the weight relationship between the k th output unit and the j th implied layer unit, and b k for the respective bias. The determination of centers and width plays a decisive role in model development. Multiple methods are used to select centers. In the current study, we chose to employ a forward subset selection routine to select the centers from the training set samples. Regarding the width selection, the width range was from 0.1 to 4, in increments of 0.1, and the best width was ultimately selected. Afterwards, the connection weights between the hidden and output layers were selected using the least squares method. For the development of the RBFNN model, we have chosen to do this in MATLAB software (Online access: https://www.mathworks.com/products/matlab.html, access on 25 November 2021).
The RBFNN model was evaluated using the same statistical parameters as the MLR model.

Conclusions
Toxicity estimation of 79 aquatic mixtures was performed by quantitative constitutive relationship modeling through MLR and RBFNN methods. Eleven different mixing rules of the hypothetical descriptors were considered to obtain the proper models. Statistical results show that the developed MLR models are more robust as well as predictive, while the RBFNN models have a better model quality compared to the former. Furthermore, the statistical results show that the developed descriptors have excellent performance for the toxicity of mixtures wirhin the applicability domain range. We conclude that the models can be effective for the toxicity prediction of aquatic contaminants and have practical value for ecological assessment.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/molecules27196389/s1. The total and the eliminated descriptor.
Funding: This work was financially supported by the Natural Science Foundation of Shandong Province in China (ZR2021MB024) and National Natural Science Foundation of China (21778047, 21675138, 21705139).