Towards the Development of Global Nano-Quantitative Structure–Property Relationship Models: Zeta Potentials of Metal Oxide Nanoparticles

Zeta potential indirectly reflects a charge of the surface of nanoparticles in solutions and could be used to represent the stability of the colloidal solution. As processes of synthesis, testing and evaluation of new nanomaterials are expensive and time-consuming, so it would be helpful to estimate an approximate range of properties for untested nanomaterials using computational modeling. We collected the largest dataset of zeta potential measurements of bare metal oxide nanoparticles in water (87 data points). The dataset was used to develop quantitative structure–property relationship (QSPR) models. Essential features of nanoparticles were represented using a modified simplified molecular input line entry system (SMILES). SMILES strings reflected the size-dependent behavior of zeta potentials, as the considered quasi-SMILES modification included information about both chemical composition and the size of the nanoparticles. Three mathematical models were generated using the Monte Carlo method, and their statistical quality was evaluated (R2 for the training set varied from 0.71 to 0.87; for the validation set, from 0.67 to 0.82; root mean square errors for both training and validation sets ranged from 11.3 to 17.2 mV). The developed models were analyzed and linked to aggregation effects in aqueous solutions.


Introduction
Massive production of nanoparticle-based materials results in their release into the environment. It is widely known that certain physical characteristics of nanoparticles, such as size, shape, charge, chemical composition, and the pH of the solution, may directly influence their toxicity [1][2][3][4]. Nanoparticles become involved in processes of dissolution, agglomeration, and settlement when released into the aquatic environment [5]. Changes in the stability or surface charges of nanoparticles in solution are known to induce toxic effects [6].
Zeta potential (ζ) could indirectly represent both stability and the surface charge of nanomaterial [6]. Zeta potential measurements are among the characteristics recommended for experimental testing of nanomaterials [7]. In general, experimental testing is expensive, so it is vital that robust theoretical approaches that can be applied for the preliminary computational modeling of various properties of nanomaterials [8]. One of the most popular approaches is the quantitative structure-property relationship (QSPR) technique [8]. The QSPR is used to establish links between features of the nanomaterial's structure and the target property. The QSPR has been widely applied over the last decade to predict nanoparticle properties [8][9][10][11][12][13][14][15][16]. To date, only three research articles have reported QSPR modeling of zeta potentials [17][18][19]. These nano-QSPR models focused only on chemical composition-dependent descriptors [17][18][19]. In fact, these models include a variety of core compositions but do not take into consideration size-dependent effects (namely, the dataset do not contain information about the same core composition nanoparticles of different sizes). Zeta-potentials are size-dependent, so the inability of previously developed models to address this feature significantly lowers the transferability of these models.
In order to develop a global model, the size-dependent behavior of zeta potentials should be taken into account [20]. This would facilitate the creation of revised and boosted datasets that contain chemically diverse nanoparticles of varied sizes. However, a fundamental problem related to the scarcity and inconsistency of experimental data on zeta potentials should be the first to overcome. This problem appears in the connection to differences between synthesis and measurement protocols [21]. As a consequence, variations in measuring protocols may result in a significant variance in data and may lead to inaccuracy in data assessment. Thus, an established list of criteria should be followed during the development of databases suitable for the computational modeling. Based on that, the first aim of this research article is to gather experimental reports on zeta potential measurements and analyze the reliability of collected data points.
Apart from the reliability of sources with experimental data, the problem of computational characterization of nanoparticles still exists [1]. Nanoparticles are characterized by a high structural complexity. In most cases, traditional QSPR methods are unable to distinguish structural features of nanoparticles [11]. Addressing this issue, specific descriptors for nanoparticles have been recently developed [8]. Despite promising results, the creation of a unified and simple system of nanoparticle representation is an open question. The second aim of this research article is to create an universal system of representation that can be used to describe nanomaterials of different sizes.
In the present study, the authors collected (using multiple literature sources) and curated 87 data points of zeta potential measurements in aqueous solutions for differently sized nanomaterials made of silica and metal oxides. Features of nanoparticles were represented using a modified version of the simplified molecular input line entry system (quasi-SMILES) [22][23][24]. Nano-QSAR models were constructed and provided the basics for a mechanistic interpretation of results. The supremacy of presented descriptors over other nano-descriptors was evaluated.

Data Collection and Curation
The current study focuses on zeta potential measurements of of silicon-and metal oxide nanoparticles in water. The main obstacle we faced was the data curation, as the data from multiple sources was inconsistent and contradictory. Initial criteria assessed for literature search were as follows:

1.
Zeta potentials were measured for no less than three different oxide nanoparticles.

2.
Measurements were conducted for non-coated nanoparticles in pure water.

3.
Core composition, nominal size, and the size of the aggregate in the water were reported. 4.
Contradictory data points (reports of the same core composition and the same size, but with significant differences in zeta potential values) were removed.

Quasi-Simplified Molecular Input Line Entry System (SMILES) Optimal Descriptors and Model Generation
As mentioned in the introduction, features of the molecular architecture of metal oxide nanoparticles combined with the size-dependent behavior of the target property render traditional QSPR tools useless. At the same time, the model for such data could be built under the paradigm that an "endpoint is a mathematical function of all available eclectic data" [34][35][36]. This paradigm could be applied using a technique known as "quasi-SMILES." The quasi-SMILES system of structural representation is a suitable tool to encode any available eclectic data. To represent nanoparticles' structures using quasi-SMILES, the numerical parameters from In the presented case, the optimal descriptor is a translator of eclectic information into the predictive model [37][38][39]. For instance, using Scheme 1, Al 2 O 3 nanoparticles form the first row in Table 1 (Table S1 and Figure S1 for the nominal size; Table S2 and Figure S2 for the size in water).
The collected dataset was split into training, invisible training, calibration, and validation sets. The training set was used to calculate correlation weights and to build the model. The invisible training set was aimed to determine whether the correlation between DCW(T*,N*) and the values of the zeta-potentials is satisfactory for structurally similar nanoparticles outside of the training set. DCW stands for the descriptor of correlation weights. The calibration set was evaluated to estimate the cutting value for the overtraining. The validation set was applied to estimate a predictive potential of a modeled.
The model was developed using the Monte Carlo approach [19,35,36]. Final models were represented by the following equation: where T is the threshold, i.e., the integer to divide attributes of quasi-SMILES into two classes: (i) rare, if the number of an attribute is less than T in the training set, and (ii) frequent, if the number of an attribute is large or equal to T. N is the number of epochs of the Monte Carlo optimization. The DCW is calculated as where A k is an attribute of quasi-SMILES, and T and N are parameters of the Monte Carlo optimization. The CW(A k ) are correlation weights for different A k . The correlation weights were used to calculate the DCW(T*,N*) with a maximal value for the target function (TF): where R TRN and R iTRN are correlation coefficients between the optimal descriptor and zeta potential for the training and invisible training sets, respectively. Const is an empirical parameter that, in the current study, was set as 0.1.
The measure of statistical quality of attributes (A) from the model for a given split (i.e., training, invisible training, calibration, and validation sets) can be estimated via defect(A) as follows [36]: Using data on defect(A) for all attributes of quasi-SMILES involved in building up the model, one can estimate defect of quasi-SMILES as If the given split is "good," then defect(A) for all attributes of quasi-SMILES are equal to zero. In reality, for the majority of cases, defect(A) > 0. The average value of defect(quasiSMILES) (calculated for the training set) are used to separate quasi-SMILES into two categories: (i) the domain of applicability and (ii) outliers.

Alternative Descriptors
As mentioned in the introduction, the majority of currently available descriptors for both conventional organics and nanomaterials do not take into consideration size-dependent effects of nanoparticles [8]. We compared presented quasi-SMILES parameters with (a) quantum-chemical descriptors (calculated for small clusters as discussed in Mikolajczyk A. et al. [17]) and (b) ionic characteristics (calculated based on chemical formula as discussed by Sizochenko N. et al. [16]).
The analysis of the distribution of zeta potential values in the collected dataset ( Figure 1) shows that the data has an almost normal distribution with slightly skewed data points toward high positive values. Predictive models were developed for three random splits of the data. Splitting of the initial dataset is presented in Table S3. Table 2 contains data on the correlation weights obtained by the Monte Carlo optimization procedure. As we can see, each model included different weights for the same attributes. Table 3 contains the statistical characteristics of developed quasi-SMILES based models.  Table 3, the statistical characteristics of the developed quasi-SMILES-based models were satisfactory. At the same time, all attempts to build models using quantum-chemical descriptors or ionic characteristics adopted from literature failed [17,18]. R 2 for the training set for all non-quasi-SMILES models was below 0.47, which is a sign of random modeling. Indeed, quantum-chemical parameters are capable of representing size-dependent effects; however, in fact, the true power of quantum-chemical descriptors has never been identified, as authors have only conducted calculations for clusters of predefined size [9,17,18]. Similar situation is observed the ionic characteristics (R 2 training < 0.5): it is clear that descriptors derived from the chemical formula alone are not capable of representing the size-dependent behavior of zeta potential.
Let us take a closer look at quasi-SMILES based models. According to the defect(quasiSMILES), the model for Split 1 contained 9 outliers in the training set and 1 in the test set (~13% of the total number of nanoparticles), the model for Split 2 contained 12 outliers in the training set and 5 in the test set (~19.5% of the total dataset), and the model for Split 3 contained 6 outliers in the training set and 1 in the test set (~8%). As data on zeta potentials is very sensitive, a variation in the number of outliers is related to a variation in measurement distributions in the collected database [36].
However, a high RMSE for validation sets (up to 17.2) reflects potential inaccuracies for the determination of stable/unstable nanoparticles. In other words, obtained models are useful for predictions of charge (positive/negative), but have only a limited usefulness for the purpose of stability prediction (nanoparticles with ξ < −20 and ξ > 20 are stable nanoparticles), due to the high deviation. More research is needed to further address these issues. It should be noted that the quality of any predictive model is the ability to adequately predict endpoints for external objects. In that case, the external prediction is invisible during model development [37]. At the same time, an excellent statistical quality of a model for the training set is often an indicator of overfitting [38]. In the present case (Table 4), the predictive potential (external prediction) is suitable, as overfitting is not observed.
Scatterplots for observed and predicted values are presented in Figure 2. In general, scattering of data for invisible training sets (Figure 2(1c, 2c, and 3c)) is quite significant. This can be explained by complexity for fitting data obtained from various sources. Among the three developed quasi-SMILES based models, the most reliable predictions for the invisible training set are considered to be those obtained for Model 3.     Table 4. The resultant DCW(1,30) value represents the summation of all correlation weights and is equal to 1.435.
Having data on several runs of the Monte Carlo optimization, one can select attributes of quasi-SMILES that have solely positive correlation weights. These attributes can be interpreted as promoters for the increase in zeta-potential. On the other hand, attributes of quasi-SMILES that have negative correlation weights in several runs of the optimization also could be extracted. Those can be interpreted as promoters for the decrease of zeta-potential. Table 5 contains examples of promoters for the increase or decrease zeta-potential. Table 5. Factors related to the increase (positives) of to the decrease (negatives) of zeta-potentials.

Conclusions
The authors here have gathered experimental reports on zeta potential measurements of nano-sized metal oxides and analyzed collected data points, selecting for further studies only those that are reliable and comparable among different publications. In this study, a simple workflow was developed and applied, which allowed for the use of modeling methods even for quite complex data collected from different sources. Specific quasi-SMILES descriptors for the assessment of zeta potentials were calculated and tested. The presented quasi-SMILES descriptors directly take into account the size of nanoparticles, being capable of reflecting the size-dependent behavior of zeta potentials. At the same time, the developed descriptors do not require complex or long-term computations. The resulting models showed reasonable statistical characteristics. Thus, the general modeling workflow, due to its simplicity and transparency, can be applied for nano-QSAR modeling. The presented database can be used as a basis for extensive nano-QSPR modeling in the future.