The System of Self-Consistent Models: The Case of Henry’s Law Constants

Data on Henry’s law constants make it possible to systematize geochemical conditions affecting atmosphere status and consequently triggering climate changes. The constants of Henry’s law are desired for assessing the processes related to atmospheric contaminations caused by pollutants. The most important are those that are capable of long-term movements over long distances. This ability is closely related to the values of Henry’s law constants. Chemical changes in gaseous mixtures affect the fate of atmospheric pollutants and ecology, climate, and human health. Since the number of organic compounds present in the atmosphere is extremely large, it is desirable to develop models suitable for predictions for the large pool of organic molecules that may be present in the atmosphere. Here, we report the development of such a model for Henry’s law constants predictions of 29,439 compounds using the CORAL software (2023). The statistical quality of the model is characterized by the value of the coefficient of determination for the training and validation sets of about 0.81 (on average).


Introduction
The constants of Henry's law are desired for assessing the processes related to atmospheric pollutants, particularly those related to their transport in the atmosphere.Chemical changes in the gaseous phase affect the fate of atmospheric pollutants and ecology, climate, and human health.
Quantitative structure-property relationships (QSPRs) are one of the tools applied to describe physicochemical endpoints, which characterize the status of the earth's atmosphere.Such endpoints could be, for example, the rate constants for the reaction of OH radicals, the rate constants of reactions of ozone with organic and inorganic compounds, vapor pressure, and Henry's law constants [1].
The experimental measurements for Henry's law constants are reported using different techniques, including headspace gas chromatography, modified headspace techniques, phase ratio variation, the differential headspace method, and dilution techniques [2].However, accurate Henry's law constants values are currently unavailable for many compounds.Instrumental problems make the experimental determination of Henry's law constants values difficult and expensive.Consequently, applying theoretical methods to robustly predict this endpoint for diverse compound types is essential [2].
Global climate change, attributed to the increased levels of greenhouse gases produced by the use of fossil fuels, is recognized as the most important challenge of nowadays.The increasing concentration of greenhouse gases in the atmosphere contributes to global warming, which already contributes to the increasing extreme weather events.Mitigating climate change has proved very difficult due to complex distribution problems across resources for different countries.In the future, geochemical and geophysical impacts are expected to cause significant harm to ecosystems.
The use of regulatory impact assessment is increasingly becoming a standard procedure in OECD countries to prepare regulations.As noted above, climate impact assessment is one of the priority objects of regulatory activity around the world, aimed at controlling and observing climate change [1,2].Since Henry's law constants are a source of valuable information on the environmental effects of organic compounds, QSPRs can serve as a tool to simulate the above constants [3][4][5].
It is necessary to take into account that in addition to the knowledge or at least estimation of numerical values on Henry's law constants, computer experiments aimed at obtaining such data as they develop and improve can be a source of additional information helping to formulate and test theoretical ideas about the nature of the influence of molecular architecture by the values of the mentioned constants.This, in turn, can provide the basis for finding technological solutions that are less dangerous in terms of impact on the environment, as well as on human health and farm animals and plants.
Cross-validation was used initially to estimate the prediction error of a mathematical modeling procedure.For a couple of decades, it was trusted that cross-validation estimates the prediction error unbiasedly.Nonetheless, numerous reports in the cheminformatics literature show that cross-validated figures of merit cannot be trusted.Instead of crossvalidation, other variants of algorithms are possible that can provide the user with statistical measures of model reliability, or, more precisely, measures of reliability of the predictive potential of models.The essence of such algorithms is to use many options for distributing the available data into non-overlapping training and control samples.One can carry out such distributions in any way one likes, but most likely, the most "reliable" or at least the most impartial are random divisions into the above-mentioned sets.Another important condition, if unsuccessful, then at least with greater confidence in the model verification carried out this way, is the consideration and comparison of many such random distributions.The more such divisions into training and control are considered, the more reasons there are to read such assessments as reliable.Such verification is referred to in the literature as a system of self-consistent models [6].Unlike previous approaches used to test/evaluate the predictive potential of models, the system of self-consistent models is also a method for constructing a model.Due to the significant contribution of random transformations of available data carried out when building such models, the results obtained can hardly be assessed as artificial.They are as natural as chance itself.However, this "naturalness" contains not only attractive but also unpleasant possibilities, which are a consequence of the slow increase in accuracy/reliability with an increase in the number of corresponding computer experiments.The Monte Carlo method is suitable for analyzing and comparing the mentioned random distributions in training and validation sets.Still, it also leads to the need to consider "extra" unsuccessful distributions.By taking these "excess experiments" into account, the corresponding comparisons are characterized by significant dispersion, for a fair assessment of which a large number of corresponding samples (models) is necessary.
The present study reports the results of an attempt to apply a system of self-consistent models [6,7] to assess the predictive potential of models of Henry's law constants simulated by the CORAL software (http://www.insilico.eu/coral,accessed on 12 October 2023).

The System of Self-Consistent Models
The system of self-consistent models represents the process of testing the predictive potential of an applied approach.It involves building models with several random distributions of the available data into training and validation sets.Perhaps the process could be considered a variation of the two-deep cross-validation [8].The process can be represented as the following: M i represents the i-th model calculated with Equation (3) with the correlation weights obtained by the Monte Carlo optimization, which gives a maximum of the target function calculated with Equation ( 5).V k is the list of compounds distributed to the validation set in a k-th split of data.Figure 1 shows the principle of the selection of compounds for the i-th model validation with a k-th split validation set.

The System of Self-Consistent Models
The system of self-consistent models represents the process of testing the predictive potential of an applied approach.It involves building models with several random distributions of the available data into training and validation sets.Perhaps the process could be considered a variation of the two-deep cross-validation [8].The process can be represented as the following: (1) Mi represents the i-th model calculated with Equation (3) with the correlation weights obtained by the Monte Carlo optimization, which gives a maximum of the target function calculated with Equation (5).Vk is the list of compounds distributed to the validation set in a k-th split of data.Figure 1 shows the principle of the selection of compounds for the i-th model validation with a k-th split validation set.

The Statistical Quality of Models
The model of pHLC observed in the case of split 1 is the following: pHLC = 3.8465 (±0.0005) + 1.1055 (±0.0001) × DCW (3.15) (2) Figure 2 shows the results of the model graphically.

The Statistical Quality of Models
The model of pHLC observed in the case of split 1 is the following: Figure 2 shows the results of the model graphically.
Molecules 2023, 28, x FOR PEER REVIEW 4 of 12 Table 1 contains the statistical quality of models observed for ten random splits, showing several statistical parameters.The values of these parameters are quite similar for the different splits, and this indicates that the methodology is robust and replicable.This fact is reasonably expected due to the large number of substances within each split; thus, we can assume that the distribution of the substances within each split is representative of the general behavior, and there are not many substances with deviating values.The large number of substances facilitates the task; still, the spread of values for Henry's constant is quite large, over N orders of magnitude.Thus, despite the modeling difficulties, the developed model is a good one.Furthermore, within each split, the values related to the different sets (active training, passive training, calibration, and validation) are also quite constant and good, and this is another demonstration of the quality of the model, which is expected to provide the same kind of performance when used on new substances.For instance, the R 2 values are good, around 0.8.Table 1 contains the statistical quality of models observed for ten random splits, showing several statistical parameters.The values of these parameters are quite similar for the different splits, and this indicates that the methodology is robust and replicable.This fact is reasonably expected due to the large number of substances within each split; thus, we can assume that the distribution of the substances within each split is representative of the general behavior, and there are not many substances with deviating values.The large number of substances facilitates the task; still, the spread of values for Henry's constant is quite large, over N orders of magnitude.Thus, despite the modeling difficulties, the developed model is a good one.Furthermore, within each split, the values related to the different sets (active training, passive training, calibration, and validation) are also quite constant and good, and this is another demonstration of the quality of the model, which is expected to provide the same kind of performance when used on new substances.For instance, the R 2 values are good, around 0.8.
, and Q 2 F3 = improved cross-validation criteria [10]; <R m 2 > = average rm2 metrics [11]; RMSE = root mean squared error; MAE = mean absolute error; F = Fischer F ratio; Nac = the number of parameters for the Monte Carlo optimization (the number of active SMILES attributes).
Tables 2-4 represent the system of self-consistent models.Despite the similarity of the statistical quality of the models, one can see that there is a difference.Various distributions of available data in training (i.e., A, P, and C) and validation sets resulted in somehow different models, and some of them are better than others.Figure 3 shows that in the overall splits, there is an anti-correlation between the number of SMLES attributes involved in the Monte Carlo optimization and average values of determination coefficients for the validation sets of the models.Thus, a paradoxical situation is observed: a smaller number of optimized parameters is accompanied by an improvement in the predictive potential of the model.A possible explanation is that there is a preferred number of inputs to the model in order to be generally correct, and by adding more inputs, there is a loss of predictivity due to overfitting [12].The simplest and most traditional ideas for answering the question highlighted in the title are that the experiment is expensive and takes time to complete.However, this is just the tip of the iceberg.There are quite deep needs for building models in the aspect of epistemology.Since practice is the basis of the development of society, the development of various models is dictated primarily by the need for practical solutions to real problems, and here, problems arise that are much more unpleasant and complex than the definitions of the domain of applicability and unambiguity of algorithms.You should start by determining who the consumer of the model is.The complication of models is rarely accompanied by their actual improvement.As a result, the consumer of the model, despite being interested in using the model, often simply does not know how to do it.Under such circumstances, it becomes likely that the only user of the model remains the developer of this model.To avoid this simple, unpleasant situation, significant efforts are needed on the part of model developers.All models are wrong [13], and only some of them are useful.How do we recognize useful models?It seems that not the last condition for the usefulness of a model is its popularity.In other words, the model is useful if it is used.However, popularity does not mean usefulness.At least without the useful results of using the model, it cannot be considered useful.Many factors determine the usefulness of a model, but the first one is the clarity of the result or the ability to assess how reliable the outcome is quickly.The reliability of a result starts with its reproducibility.Often, reproducibility may not be absolute and is accompanied by some level of variance in the result, such as the statistical quality of the model expressed in terms of the coefficient of determination and root mean squared error.However, in reality, the model for the endpoint of interest is the value extracted from complex computer experiments obtained for the only correct distribution of the available data into the training and validation subsystems.What is bad about it?First of all, the possibility for such a model to turn out to be a beautiful accident does not coincide with the real difficulties and paradoxes of predicting the endpoint in question; of course, this situation is especially "dangerous" when few data are available.If there are a lot of data, the probability that some selected split into training and validation is successful becomes smaller the greater the amount of data available for model development.The result is a situation similar to the uncertainty principle: the more data, the more reliable the result, which, however, is most likely less accurate.In other words, when determining a model from a small number of available data, the coefficient of determination will be close to 1.Still, when determining the same model for a large number of data, the coefficient of determination will not be comparable to 1. Since all models are wrong, the researcher, whether a model developer or a model user, must be on guard; that is, to avoid a situation where fighting mice (high prediction accuracy) distracts from the "tiger" present (the impossibility of contradicting the uncertainty principle or the contradiction between the need to consider large amounts of experimental data, which guarantees an increase in uncertainty in the forecast).All this leads to the triumph of seditious thoughts, which is bad if there are few data and equally bad if there are too many data.This means there must be some kind of "correct" reliable middle ground where there are exactly as many data as needed for a useful model (the model that is still incorrect).

Why Are Models Needed?
The simplest and most traditional ideas for answering the question highlighted in the title are that the experiment is expensive and takes time to complete.However, this is just the tip of the iceberg.There are quite deep needs for building models in the aspect of epistemology.Since practice is the basis of the development of society, the development of various models is dictated primarily by the need for practical solutions to real problems, and here, problems arise that are much more unpleasant and complex than the definitions of the domain of applicability and unambiguity of algorithms.You should start by determining who the consumer of the model is.The complication of models is rarely accompanied by their actual improvement.As a result, the consumer of the model, despite being interested in using the model, often simply does not know how to do it.Under such circumstances, it becomes likely that the only user of the model remains the developer of this model.To avoid this simple, unpleasant situation, significant efforts are needed on the part of model developers.All models are wrong [13], and only some of them are useful.How do we recognize useful models?It seems that not the last condition for the usefulness of a model is its popularity.In other words, the model is useful if it is used.However, popularity does not mean usefulness.At least without the useful results of using the model, it cannot be considered useful.Many factors determine the usefulness of a model, but the first one is the clarity of the result or the ability to assess how reliable the outcome is quickly.The reliability of a result starts with its reproducibility.Often, reproducibility may not be absolute and is accompanied by some level of variance in the result, such as the statistical quality of the model expressed in terms of the coefficient of determination and root mean squared error.However, in reality, the model for the endpoint of interest is the value extracted from complex computer experiments obtained for the only correct distribution of the available data into the training and validation subsystems.What is bad about it?First of all, the possibility for such a model to turn out to be a beautiful accident does not coincide with the real difficulties and paradoxes of predicting the endpoint in question; of course, this situation is especially "dangerous" when few data are available.If there are a lot of data, the probability that some selected split into training and validation is successful becomes smaller the greater the amount of data available for model development.The result is a situation similar to the uncertainty principle: the more data, the more reliable the result, which, however, is most likely less accurate.In other words, when

Model as a Hired Worker in a Workshop
Almost always, the model acts as a kind of assistant in solving various problems.Here, a certain analogy arises with the cooperation between a hired worker and the owner of a certain workshop or even production.If following this analogy, like a worker, the model must have certain capabilities and qualities.A model must be able to do something.In the context of the QSPR, the model must be able to, having received a standard task (input data), predict something corresponding (the expected value) to the input data that are offered to it.At the same time, a hired worker should not take on arbitrary tasks but only those for which they are an expert.Likewise, the model should abandon a problem that it cannot reliably solve.The model should at least abandon the attempt to solve the impossible task, but it is better if the model is able to explain why this problem cannot be solved using this model.This leads to the formulation of the quite important concept of QSPR simulation known as the "applicability domain".Typically, the applicability domain is calculated based on distances in the multidimensional descriptor space or based on the similarity of molecular graphs.However, another possibility is for the user to determine the scope of the applicability of the model.It would be ideal to determine the applicability domain during the process of building the model; that is, taking into account the wishes of the potential user of the model.In other words, the model is self-sufficient, able to regulate and adequately evaluate its actions by the "proposed task".

A Model Is Either Knowledge or Delusion
A model can be knowledge if certain conditions for its development and use are met.The model must be carefully tested.All models are wrong [13].All models are random events [14].For a model to be useful, it must work to create new knowledge.For a model to produce knowledge, its results must be reproducible according to elementary logic.
The approach considered here (the system of self-consistent models) is an attempt to create, or at least simulate, the ability of a model, independent of the user, to "check its actions".The self-consistency means not only (or even not so much) the similarity of the statistical quality of the models but rather the verification of the predictive potential on compounds really "invisible" in the process of developing the model (i.e., compounds selected according to the scheme that is shown in Figure 1).
It may seem that the system of self-consistent models is an approach conceptually similar to cross-validation [15].However, if within the framework of cross-validation, one model is used in which the effect of removing one or several (e.g., 5-fold [16]) participants (compounds) is studied, then in the case of systems of self-consistent models, the predictive potential of many models built based on different training sets is studied and compared.The proposed approach will undoubtedly provide a departure from the naive Q 2 [8] and make it possible to evaluate the predictive potential in terms of the index of the ideality of correlation, which incorporates information about both the determination coefficient and the mean absolute error (MAE).
Similar to the world of material movements, in which all events visible and tangible to us in everyday life take place, such as traffic, sports, or exchange rates, there is a world of probabilistic actions, accidents, and even catastrophes that influence each other.However, these are not visible and not tangible to us.Perhaps quantitative structure-property/activity relationships (QSPRs/QSARs) allow you to look into this world of accidents and trends that affect each other.There is no mysticism here, but the phenomena occurring in such a space are not always described ideally and reliably.In other words, encountering situations that, in contrast to logic, are possible and unpleasant.For example, the quality of calculations (models) can be affected by the collection of substances that are available in the database or a list of priorities and criteria selected in the software used to build up QSPRs.However, in any case, it remains an indisputable axiom that models of random events are knowledge only when they are understandable and allow the possibility of verification by establishing and confirming their reproducibility.
The use of various descriptors to predict the physicochemical properties of substances has been criticized many times.This criticism covered many aspects of the QSPR use.The main point of criticism was the very idea of correlation.That is, correlation does not show causes.Another point of criticism is the impossibility of accurately determining the applicability domain.However, the practical use of QSPRs is increasing rather than decreasing.Two circumstances able to improve the quality of discussion between critics and supporters of QSPRs: (i) it cannot be expected that QSPRs are capable of replacing experiments; and (ii) a QSPR can become undeniably useful if QSPRs become the language of communication between the experimenter and the developer of a QSPR, not a mathematical tool itself.

Data
Experimental Henry's law constants (Atm m 3 /mole) were obtained from [1] as expressed via negative decimal logarithm (pHLC).They were examined as the endpoint for QSPR analysis.The large experimental data set (n = 29,439) was applied to distribute experimental data into four sets using ten different splits.The resulting data sets included the active training-A (≈25%), passive training-P (≈25%), calibration-C (≈25%), and validation-V (≈25%) sets (the average percentage of similarity for these splits was less than 30%).Each of the above sets has a defined task.The active training set is used to build the model: molecular features extracted from the SMILES of the active training set are involved in the process of Monte Carlo optimization aimed at providing correlation weights for the above features, which give a maximal target function calculated using descriptors (the sum of the correlation weights) and the endpoint on the active training set.The task of the passive training set is to check whether the model obtained for the active training set is satisfactory for the SMILES, which was not involved in the active training set.The calibration set should detect the start of the overtraining (overfitting).At the beginning of the optimization, the correlation coefficients between the experimental values of the endpoint and the descriptor contemporaneously increase for all sets.Still, the correlation coefficient for the calibration set reaches the maximum (this is the start of the overtraining), and further optimization leads to a decrease in the calibration set's correlation coefficient.The optimization should be stopped when overtraining starts.After stopping the Monte Carlo optimization procedure, the validation set is used to assess the predictive potential of the obtained model.

Model
The model values of Henry's law constants were calculated as pHLC = C 0 + C 1 × DCW (3.15) (3) where C 0 and C 1 are regression coefficients; the optimal SMILES-based descriptor DCW applies two thresholds; the first threshold equal to 3 is used to define the active and inactive attributes of the simplified molecular input-line entry system (SMILES); and the second threshold, equal to 15, refers to the number of epochs in the Monte Carlo optimization process [25].See Supplementary Materials Table S1-S4.

Optimal Descriptor
Active SMILES attributes have so-called correlation weights (CWs), which are necessary to calculate the descriptor DCW(3.15) by the formula: DCW(3.15)= ∑ CW(S k ) + ∑ CW(SS k ) (4) Table 5 contains an example of the calculation of the DCW(3.15).

The Monte Carlo Optimization
Equation ( 4) needs the numerical data on the above correlation weights.Monte Carlo optimization is a tool to calculate those correlation weights.The following target functions for the Monte Carlo optimization are used: The r AT and r PT are correlation coefficients between the observed and predicted endpoints for the active and passive training sets.

Conclusions
In this work, the application of the system of self-consistent models, which are generated by the CORAL software, was used to develop a predictive model based on a large set of substances with values of Henry's law constants.The suggested version of the optimal descriptor and the Monte Carlo optimization provide satisfactory predictive potential for all ten random splits of the experimental data.Nevertheless, the model self-consistency

Figure 1 .
Figure 1.The principle of the selection of the list of compounds from the validation set of the k-th split to confirm the predictive potential of the i-th model.Symbol '*' denotes the subset of compounds of the k-th validation set of k-th split without compounds that are included in the active training set, passive training set, or the calibration set of the i-th split.

Figure 1 .
Figure 1.The principle of the selection of the list of compounds from the validation set of the k-th split to confirm the predictive potential of the i-th model.Symbol '*' denotes the subset of compounds of the k-th validation set of k-th split without compounds that are included in the active training set, passive training set, or the calibration set of the i-th split.

Figure 2 .
Figure 2. The graphical representation of the model for pHLC was observed in the case of split 1. A1, P1, C1, and V1 are the active training, passive training, calibration, and validation sets, respectively.

Figure 2 .
Figure 2. The graphical representation of the model for pHLC was observed in the case of split 1.A 1 , P 1 , C 1 , and V 1 are the active training, passive training, calibration, and validation sets, respectively.

Figure 3 .
Figure 3. Anti-correlation between the number of active SMILES attributes involved in the Monte Carlo optimization and the average determination coefficient of a model's overall splits.

Figure 3 .
Figure 3. Anti-correlation between the number of active SMILES attributes involved in the Monte Carlo optimization and the average determination coefficient of a model's overall splits.

Table 1 .
The statistical quality of models observed for ten random splits into the active training (A), passive training (P), calibration (C), and validation (V) sets.

Table 1 .
The statistical quality of models observed for ten random splits into the active training (A), passive training (P), calibration (C), and validation (V) sets.
(*) n = the number of compounds in a set; R 2 = determination coefficient; CCC

Table 2 .
The numbers of compounds included in the process of checking the predictive potential of the i-th model for the k-th split.

Table 3 .
Determination coefficients were observed when testing the predictive potential of the i-th model for the k-th split and the average values and dispersion of the determination coefficients.

Table 4 .
Mean absolute error values, which were observed when testing the predictive potential of the i-th model for the k-th split.