The first group includes methods that can be used on top of QRPR models implemented by any machine learning method (denoted as Universal AD), and the second one includes methods that are integral parts of the machine learning (ML) methods implementing the QRPR models (denoted as ML-dependent AD). Such models not only predict reaction property values but also estimate the confidence of prediction, so the reactions with high confidence of prediction are within AD. Here, we only consider regression models and corresponding ADs.
2.1.1. Universal Applicability Domain Definition Approaches
Universal AD definition methods can be used on top of QRPR models, which can be implemented by any suitable machine learning method. In this study, the Random Forest Regressor (see below) is used for this purpose. Some AD definition methods (Leverage, Nearest Neighbors approach, One-Class SVM, Two-Class Y-inlier/Y-outlier classifier) return a continuous value indicating the reliability of prediction. When using these methods, it is necessary to choose a threshold for such a value, and for some of them, the values of other adjustable hyperparameters. Such methods correspond to the reliability aspect of AD definition according to Hanser et al. [
12]. Some other methods (Bounding Box, Fragment Control, Reaction Type Control) give an answer as to whether a test object is within AD or not. Bounding Box and Fragment Control does not have any adjustable hyperparameter, whereas Reaction Type Control has one (neighborhood radius, see below). Such methods correspond to the applicability aspect of AD definition according to Hanser et al. [
12].
Leverage. This method is based on the Mahalanobis distance to the centre of the training-set distribution. The leverage
h of a chemical reaction is calculated based on the “hat” matrix as
h =
(xiT(XTX)−1xi), where
X is the training-set descriptor matrix, and
xi is the descriptor vector for the reaction
i. The leverage threshold is usually defined as
h* =
3*(M + 1)/N, where
M is the number of descriptors and
N is the number of training examples. Chemical reactions with leverage values
h >
h* are considered to be chemically different from the training set reactions, so they are marked as X-outliers [
3,
4]. This approach is denoted hereafter as
Leverage. The drawback of it is the absence of strict rules for choosing the threshold
h* [
25]. As an alternative, an optimal threshold value
h* can be found using an internal cross-validation procedure by maximizing some AD performance metrics. This method is denoted hereafter as
Lev_cv.
Leverage does not have internal hyperparameters, while
Lev_cv has one—the optimal threshold that needs to be adjusted.
Nearest Neighbours approach (denoted as
Z-kNN). This AD definition is based on the distance(s) between a current reaction and the closest training-set reaction(s). Usually, one nearest neighbour is considered (
k = 1). If the distance is not within the user-defined threshold, then the prediction is considered unreliable and the reaction is considered as an X-outlier. The threshold value is commonly taken as
Dc = Zσ + <y>, where
<y> is the average and
σ is the standard deviation of the Euclidean distances between nearest neighbours in the training set,
Z is an empirical parameter to control the significance level, with the recommended value of 0.5 [
3,
4,
25]. Such a method is denoted as
Z-1NN. An optimal threshold can be found using an internal cross-validation procedure by maximizing some AD performance metrics. In this case, the method is denoted as
Z-1NN_cv.
One-Class Support Vector Machine (denoted as
1-SVM). The one-class Support Vector Machine method reveals highly populated zones in descriptor space by maximizing the distance between a separating hyperplane and the origin in the feature space implicitly defined by some Mercers’ kernel. The decision function of such model returns (+1) for the reactions which fall into highly populated zones (within AD, i.e., X-inliers) and (−1) for the reactions outside of AD (X-outliers) [
9,
26].
1-SVM models were built in this study using the scikit-learn library [
27]. The method requires the fitting of two hyperparameters:
nu (which defines the upper bound percentage of errors and lower bound percentage of support vectors) and
gamma (parameter of RBF kernel which is used), the optimal values of which can be found in cross-validation (see
Table S1). Other hyperparameters were set to default values.
Two-Class Y-inlier/Y-outlier Classifier (denoted as
2CC). In this case, a binary classifier learns to distinguish Y-inliers from Y-outliers. First, QRPR models are built to predict quantitative characteristics of chemical reactions. Chemical reactions with a higher prediction error estimated in cross-validation (more than 3× RMSE) are labelled as Y-outliers, while the remaining reactions are labelled as Y-inliers. After that, a binary classification model is trained to discriminate between them and provide a confidence score that a given reaction is a Y-inlier for the corresponding QRPR model. Although this method seems quite straightforward, we have not found its application in the literature. Unfortunately, this method cannot be applied if there are no or too few Y-outliers. In this study, Random Forest Classifier, implemented in scikit-learn library [
27], was used for building the binary classification model. The method requires setting the values of two hyperparameters:
max_features (the values of features selected upon tree branching) and probability threshold
p* (see
Table S1). If the predicted probability of belonging to the Y-inliers is greater than
p*, the prediction of reaction characteristics by the QRPR model for it is considered reliable (within AD, or X-inlier). Other hyperparameters of the Random Forest Classifier were set to defaults, except the number of decision trees in Random Forest Classifier, which was set to 500.
Bounding Box (denoted as
BB). This approach defines AD as a D-dimensional hypercube with each edge spanning the range between the minimum and maximum values of the corresponding descriptor. If at least one descriptor for a given reaction is out of the range defined by the minimum and maximum values of the training set examples, the reaction is considered outside of the AD of the corresponding QRPR model [
3,
4]. The approach does not have adjustable hyperparameters.
Fragment Control (denoted as
FC). In this case, if a Condensed Graph of Reaction [
28,
29] (
Figure 1) representing a given reaction has fragments (subgraphs) missing in the training set, then it is considered to be an X-outlier (out of AD) whenever the corresponding QRPR model is applied [
8,
30].
FC can formally be considered as a special case of Bounding Box for fragment descriptors. This method does not have adjustable parameters.
Reaction Type Control (denoted as
RTC_cv). This method was proposed for the first time in our previous publication [
16] for predicting kinetics parameters. If the reaction centre of a chemical reaction is absent in the reactions in the training set, it is considered out of AD (X-outlier). Reaction centre is detected using reaction signatures [
31]. Signature creation includes (1) representation of a chemical reaction as a Condensed Graph of Reaction (CGR,
Figure 1, top center), (2) highlighting one or more reaction centers which are identified as a set of adjacent dynamic atoms and bonds on the CGR and (3) considering atoms in neighborhood with radius
R for each of the reaction centers (
Figure 1, top right), (4) introducing canonical numbering of atoms of the reaction center with its neighborhood using an algorithm similar to the Morgan algorithm, (5) the signature is encoded by SMILES-like canonical string generated by CGRtools library (
Figure 1) [
31]. For every atom, hybridization, the number of adjacent heavy atoms and the symbol of chemical element are encoded in the signature. In order to distinguish whether the aromatic cycle is a part of the reaction center or its closest substituent, we introduced a separate type of hybridization for aromatic atoms. Thus, sp
3, sp
2, sp hybridizations for aliphatic atoms and “aromatic” hybridizations were used. The signature includes atomic labels both on the reaction centre itself and its neighbourhood. The neighborhood radius
R is a hyperparameter of the method. If the radius is set to 0, the reaction signature includes only atoms of the reaction centre. Since it is necessary to select the hyperparameter, this method is denoted as
RTC_cv.
2.1.2. ML-Dependent Applicability Domain Definition Approaches
The variance in the prediction density given by Gaussian Process Regression model (denoted as
GPR-AD). Gaussian Process Regression (GPR) assumes that the joint distribution of a real-valued property of chemical reactions and their descriptors is multivariate normal (Gaussian) with the elements of its covariance matrix computed by means of special covariance functions (kernels). For every reaction, a GPR model produces a posterior conditional distribution (so-called prediction density) of the reaction property given the vector of reaction descriptors. The prediction density has normal (Gaussian) distribution with the mean corresponding to the predicted value of the property and the variance corresponding to prediction confidence [
32]. If the variance is greater than a predefined threshold
σ*, the chemical reaction is considered as X-outlier (out of AD). This AD definition method is denoted as
GPR-AD. The GPR implementation in scikit-learn library was used [
27]. The method requires adjustment of three hyperparameter—
alpha, which stands for the noise level (also acts as regularization of the model), the parameter
gamma of the RBF kernel which represents the covariance function (see
Table S1), and variance threshold
σ*. The optimal values of hyperparameters are determined using internal cross-validation. Other hyperparameters of Gaussian Processes are set by default.
The variance in predictions made by an ensemble of QRPR models. The variance in predictions made by an ensemble of QSAR/QSPR models is often applied as a score for determining their AD [
5,
6]. In this study, we extend this to QRPR models. Here, we consider a chemical reaction to be within AD (X-inlier) if the variance in the property values predicted by the ensemble of models is less than a given threshold
σ*. The optimal value for
σ* can be found using internal cross-validation procedure by maximizing an AD quality metrics (see below). One of the approaches to estimate the prediction variance needed for this purpose is to build a QRPR model on the whole training set using the Random Forest Regression (RFR) machine learning method, which provides the mean (which is considered as a predicted value of the reaction property) and the variance in predictions (which is considered as a measure of prediction confidence) made by individual Random Trees individual models. This approach is denoted hereafter as
RFR_VAR. In this study, a modified version of the Random Forest Regression method (RFR, 500 trees) implemented in scikit-learn library [
27] was used.