# Predictive Capability of QSAR Models Based on the CompTox Zebrafish Embryo Assays: An Imbalanced Classification Problem

## 1. Introduction

## 2. Results

## 3. Discussion

## 4. Materials and Methods

#### 4.1. Data Set and Chemical Representation

#### 4.2. Machine Learning Methods

^{1}-norm penalty). The hyperparameters to be optimized are usually the regression coefficients (weights, bias) and the penalty. RF is an ensemble classifier. Ensemble classification algorithms are following a paradigm where multiple “weak classifiers” are trained and aggregated to improve the prediction capabilities and lower the prediction error. The weak learners here are decision trees and the aggregation is conducted by means of bootstrapping (each tree trained on a part of data and subset of features) and final voting. RF is considered a non-linear method. The hyperparameter for RF can be large and complex. Commonly optimized hyperparameters are tree depth, number of trees, class-weights, and the number of features utilized. The MLP is a fully-connected neural network. Neural networks machine learning algorithm where multiple learners are connected in layers. The learners (neurons) learn parameters (weights, bias) from the data and are “activated” by means of a non-linear function such as the sigmoid function. Hyperparameters which are commonly optimized in MLP are the number of layers, penalty function, learning rate and activation function.

_{2}[30] expressed, which is named here Real-Accuracy (RA) and defined by Equations (2) and (3):

#### 4.3. Modelling

## 5. Conclusions

## Supplementary Materials

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## Sample Availability

## Appendix A

**Figure 1.**Scatter plots of values of four model quality parameters against MCC corresponding to 209 models on the respective test sets (

**a**) Real Accuracy, (

**b**) Cohen’s Kappa, (

**c**) Accuracy and (

**d**) Balanced Accuracy.

**Figure 2.**Boxplot diagrams of MCC CV values for the training set (

**a**) and MCC Test values for the test set (

**b**) for 209 models generated for 19 endpoints (on the X-axis). The threshold MCC value of 0.20 is marked by the dashed horizontal line. Median value of quality metrics for each endpoint is given by horizontal line in each box.

**Figure 3.**Structural fragments presented by fingerprints utilized in the final model for the JAW endpoint. The purple circle denotes the center of the fingerprint with a radius which involves atoms denoted by the yellow-colored circles. The asterisk denotes a continuation of the structure.

**Table 1.**Pearson correlation coefficients between quality metrics obtained for the test set across the 209 models.

Real Accuracy | MCC | Cohen’s Kappa | Accuracy | Balanced Accuracy | |
---|---|---|---|---|---|

Real Accuracy | 1 | 0.84 | 0.86 | −0.39 | −0.28 |

MCC | 0.84 | 1 | 0.97 | −0.24 | −0.19 |

Cohen’s Kappa | 0.86 | 0.97 | 1 | −0.18 | −0.21 |

Accuracy | −0.39 | −0.24 | −0.18 | 1 | 0.59 |

Balanced Accuracy | −0.28 | −0.19 | −0.21 | 0.59 | 1 |

**Table 2.**Data set overview sorted by the number of active compounds per endpoint. All endpoints are binary variables having only values 1 or 0 (active or inactive). The number of missing data in each endpoint is given in the last column (“missing”).

Endpoint | Negative (0) | Positive (1) | Missing Values |
---|---|---|---|

AXIS | 882 | 108 | 28 |

ActivityScore | 812 | 187 | 19 |

BRAI | 930 | 60 | 28 |

CFIN | 942 | 48 | 28 |

CIRC | 972 | 18 | 28 |

EYE | 913 | 77 | 28 |

JAW | 881 | 109 | 28 |

MORT | 884 | 115 | 19 |

NC | 977 | 13 | 28 |

OTIC | 949 | 41 | 28 |

PE | 874 | 116 | 28 |

PFIN | 936 | 54 | 28 |

PIG | 945 | 45 | 28 |

SNOU | 883 | 107 | 28 |

SOMI | 952 | 38 | 28 |

SWIM | 958 | 32 | 28 |

TRUN | 934 | 56 | 28 |

TR | 912 | 78 | 28 |

YSE | 867 | 123 | 28 |

Positive (Model) (1) | Negative (Model) (0) | |
---|---|---|

Positive (Experimental) (1) | TP | FN |

Negative (Experimental) (0) | FP | TN |

Classifier | Feature Set | * Scaling | ** Feat. Sel. | Endpoints |
---|---|---|---|---|

Logistic regression | Fingerprints | No | No | 19 |

Multilayer perceptron | Fingerprints | No | No | 19 |

Random forest | Descriptors | No | No | 19 |

Random forest | Descriptors | No | Yes | 19 |

Random forest | Fingerprints | No | No | 19 |

Logistic regression | Descriptors | Yes | No | 19 |

Logistic regression | Descriptors | Yes | Yes | 19 |

Multilayer perceptron | Descriptors | Yes | No | 19 |

Multilayer perceptron | Descriptors | Yes | Yes | 19 |

Random forest | Descriptors | Yes | No | 19 |

Random forest | Descriptors | Yes | Yes | 19 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

