#### 4.3.1. Model Generation

A conformal predictor will make

valid predictions according to a user defined significance level. The significance level is the percentage of, to the user, acceptable errors that the model may commit. In a binary classification problem, a set of class labels are assigned to new compounds by comparing them to calibration sets classifications with known labels (mutagenic and nonmutagenic). These calibration sets are randomly selected from the training set (see following section of model validation). If the prediction outcome for a new compound is higher than the set significance level, i.e., similar enough to the corresponding predictions for the calibration set compounds for classes M (mutagenic) and NM (nonmutagenic), respectively, the new compound is assigned that class label. This procedure is performed for each new compound and each label (class) in the dataset. Consequently, for a binary classification problem there are four possible outcomes. A new instance can be labelled with either of the two classes, assigned both labels (

both classification) or none of the labels (

empty classification). The procedure is illustrated in

Figure 2 and described below in more detail.

The percentage of trees in the random forest ensemble predicting each of the two classes (class probability) is used as a conformal prediction similarity (conformity) measure. Conformal prediction assigns classes to new compounds by comparing the class probability against the corresponding sorted list of class probabilities for the calibration set associated with each RF model.

The predicted class probabilities for classes M and NM of the new compound is placed in the sorted list of calibration set probabilities for classes M and NM, respectively, adding one compound to the list for each class. The position of the new compound in each of these two sorted lists is determined and the fraction with lower probabilities is calculated. This fraction is compared to the corresponding significance level set by user. For a new compound to be assigned a class the calculated fraction must be larger or equal to the set significance level.

The four possible outcomes from a binary classification task is illustrated in

Figure 2 and described in the following section:

New compound C1 has predicted class probabilities for class M and NM of 0.73 and 0.27, respectively. Placing these probabilities in the corresponding sorted calibration set list of probabilities results in positions 6 for class M and position 1 for class NM. The corresponding calculated fractions, called conformal predictions p-values, are 0.75 and 0.0, respectively. The set significance level in this example is 0.20 and new compound C1 can be assigned to class M (0.75 ≥ 0.20) but not to class NM (0.0 < 0.20). Similarly new compound C2 can only be assigned to class NM. For the two remaining new compounds C3 and C4 the situation is somewhat different. For new compound C3, the calculated fractions for both classes are above the set significance level and, consequently, this compound is assigned to both class M and NM (the both class). For new compound C4 the situation is the opposite and both calculated fractions are below the set significance level and new compound C4 cannot be assigned to any of the two classes by the model (the empty class). For new compound C4 it should be noted, for clarity, that 7 decision trees did not give a class assignment, e.g., the resulting leaf node was unable to provide a majority class vote.

For more examples on how conformal prediction is carried out, we refer the reader to [

21].

The performance of a conformal predictor is often measured by its

validity and

efficiency. A conformal predictor is

valid if the percentage of errors does not exceed the set significance level. In conformal prediction a prediction is considered correct if it includes the correct predicted class label, which means that

both predictions are always correct and, vice versa,

empty predictions are never correct (i.e., always erroneous). Thus, the

validity for class A (M or NM) is the sum of the number of correct class predictions and the number of

both class predictions for class A compounds divided by total number of predicted class A compounds:

The efficiency in conformal prediction is calculated as the percentage of the total number of single class predictions, regardless of whether they are correct or not, in relation to the total number of predicted compounds. Thus, if 75% of the predicted compounds are assigned to either class M or NM, respectively, then the efficiency of the conformal prediction model is 0.75. The rest of the compounds (25%) are, consequently, predicted as both or empty class compounds.

Many times a trade off in conformal prediction is that between the validity of the model and the efficiency.

We have used the RF algorithm [

35] for deriving the underlying models in our conformal predictors. The models were developed using Python, Scikit-learn [

36] version 0.17, and the nonconformist package [

37] version 1.2.5. Binary classification models were built based on RF using the Scikit-learn RandomForestClassifier with 100 trees and all other options set at the default value.

#### 4.3.2. Model Validation

The data set was randomly divided into a training (70%, 656 compounds) and an external test (30%, 280 compounds) set. This procedure was repeated 50 times to generate 50 pairs of random training and external test sets. The cross-conformal prediction method described by Sun et al. [

38] was then applied and each training set was further divided into a proper training set and calibration set, 80% and 20%, respectively, using five random cross-validation folds (

Figure 3, left part). The proper training set was used for deriving the RF model and the calibration set for predicting the conformal prediction

p-values of the test set. For each of the 5 folds the

p-value predictions on the corresponding external test set were stored and the results over all 50 test sets were used for class assignments in accordance with the set significance levels (

Figure 2) and reported in

Table 1. Also, from each of the five proper training sets generated by the fivefold cross-validation an internal training (80%) and internal test set (20%) was randomly selected. The latter set was then predicted and assigned classes using the same procedure as for the external test set over all 50 internal test sets.

In order to quantify the quality of the derived models the following classification metrics were used [

39]:

where:

TP = true positives (no. of mutagenic compounds correctly classified as mutagenic);

FP = false positives (no. of nonmutagenic compounds incorrectly classified as mutagenic);

TN = true negatives (no. of nonmutagenic compounds correctly classified as nonmutagenic);

FN = false negatives (no. of mutagenic compounds incorrectly classified as nonmutagenic);

po = observed accuracy = (TP + TN)/(TP + FP + TN + FN);

pe = expected accuracy = [(TP + FN)(TP + FP) + (TN + FP) (TN + FN)]/[(TP + FP + FN + TN)^{2}].