# Parsimonious Optimization of Multitask Neural Network Hyperparameters

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Results

#### 2.1. Grid Search and Convergence of GA and TPE

_{T}) associated with the architectures generated by all the possible combinations of hyperparameters at the considered levels (i.e., GS results, 196,608 combinations) has peculiar distributions for each dataset. In particular, the NURA dataset seems to be easier to model with a half of the architectures (46%) providing a NER

_{T}greater than 60%, while for the Tox21 and ClinTox dataset, this fraction was reduced 21% and 25%, respectively.

_{T}for GA and the best GS solution was between 1% (for NURA) to 3% (for ClinTox) (Table 1). TPE (third column in Figure 1) seems to explore the hyperparameter space more effectively than GA, yielding satisfactory results for most trials. However, the best TPE combinations provide NER

_{T}comparable to GA, with an average difference always less than 0.5%.

#### 2.2. Random Search Vs. GA and TPE

_{T}.

_{T}. Among them, GA and TPE were significantly different only for the NURA dataset.

_{T}) and specificity (SP

_{T}), respectively), as shown in Figure 3. In all cases, two populations of solutions could be observed, one with SN

_{T}and SP

_{T}around 50% (worst solutions) and the other with solutions approaching those of the best GS combinations (upper right corner). GA and TPE solutions (orange and green points) converged to the upper right corner (i.e., with SN

_{T}and SP

_{T}values approaching 100% of correct identification). For all considered datasets, the final GA population (red points) included architectures associated with optimal classification performance.

_{T}significantly greater than SP

_{T}. This can be due to the extreme data imbalance in opposite directions for the two tasks, one of which had 94.1% while the other had only 7.1% of active samples in the training set.

#### 2.3. Performance on External Test Set

_{T}according to a t-test (at a 95% confidence level) for two of the three datasets (NURA and Tox21, see Appendix, Table A4, Table A5 and Table A6). For the ClinTox dataset, both GA and TPE results were comparable to the RS ones; indeed, all three approaches provided a NER

_{T}9% lower on average than the best GS result. For the other two datasets, GA and TPE provided better results in terms of NER

_{T}, with only 1% and 2% differences from the best GS solution for the NURA and Tox21 datasets, respectively. GA and TPE provided comparable results in terms of NER

_{T}for all datasets. In general, TPE tends to provide solutions showing a slightly higher SP

_{T}and a slightly lower SN

_{T}than GA.

_{T}and SP

_{T}was always 2% or less.

_{T}and SN

_{T}around 9%.

#### 2.4. GA and DoE

_{T}. Otherwise, the importance of the hyperparameter is considered inconclusive.

## 3. Discussion

_{T}for the best results of GA, TPE, and GS ranged from 1% (for NURA) to 3% (for ClinTox).

## 4. Materials and Methods

#### 4.1. Dataset

#### 4.2. Multitask Neural Network

#### 4.3. Classification Performance of Multitask Neural Networks

_{t}, TN

_{t}, FP

_{t}, and FN

_{t}were computed as the number of true positive, true negative, false positive, and false negative for the t-th task. To compare the overall performance of models, ‘global’ sensitivity, specificity, and non-error rate measures (SN

_{T}, SP

_{T}, NER

_{T}) were computed as follows [25]:

_{T}and SP

_{T}represent the percentage of active and inactive molecules correctly predicted over all tasks, respectively.

#### 4.4. Optimization Strategies

_{T}) in 3-fold cross validation for each chromosome of the population;

_{T}) of the population;

_{1}) contains observations that gave the best scores and the second one (x

_{2}) all other observations;

_{1}) and g(x

_{2}) using Parzen estimators (or kernel density estimators);

_{1}), evaluating them in terms of l(x

_{1})/g(x

_{2}), and returning the set that yields the minimum value under l(x

_{1})/g(x

_{1}) corresponding to the greatest expected improvement. These hyperparameters are then evaluated on the objective function;

_{T}as the score to maximize. We chose predefined values for each hyperparameters in order to constrain the possible combinations to the grid search space.

#### 4.5. Software

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Sample Availability

## Appendix A

**Table A1.**Two-tailed t-test with a hypothesized zero mean difference for the three datasets considering the best result among 10 replicas between genetic algorithms (GA) and random search (RS) on 3-fold cross validation. Variances were assumed unequal only for Tox21 (p-value of F test 0.007).

NURA (GA, RS) | ClinTox (GA, RS) | Tox21 (GA, RS) | |
---|---|---|---|

Mean | 94.1, 93.4 | 91.0, 88.5 | 74.6, 73.0 |

Variance | 0.001, 0.003 | 0.023, 0.037 | 0.004, 0.023 |

Pooled Variance | 0.002 | 0.031 | - |

Degrees of freedom | 18 | 18 | 12 |

t value | 3.643 | 3.214 | 3.104 |

P(t ≤ t) two-tail | 0.002 | 0.005 | 0.009 |

t critical (95%) | 2.101 | 2.101 | 2.179 |

**Table A2.**Two-tailed t-test with a hypothesized zero mean difference for the three datasets considering the best result among 10 replicas between genetic algorithms (GA) and tree-structured Parzen estimator (TPE) on 3-fold cross validation. Variances were assumed unequal for Tox21 and ClinTox (p-value of F test 0.017 and 0.024, respectively).

NURA (GA, TPE) | ClinTox (GA, TPE) | Tox21 (GA, TPE) | |
---|---|---|---|

Mean | 94.1, 94.5 | 91.0, 91.1 | 74.6, 74.6 |

Variance | 0.001, 0.001 | 0.02, 0.006 | 0.004, 0.002 |

Pooled Variance | 0.001 | - | - |

Degrees of freedom | 18 | 13 | 13 |

t value | −2.466 | −0.147 | 0.100 |

P(t ≤ t) two-tail | 0.024 | 0.885 | 0.922 |

t critical (95%) | 2.101 | 2.160 | 2.160 |

**Table A3.**Two-tailed t-test with a hypothesized zero mean difference for the three datasets considering the best result among 10 replicas between tree-structured Parzen estimator (TPE) and random search (RS) on 3-fold cross validation. Variances were assumed equal.

NURA (TPE, RS) | ClinTox (TPE, RS) | Tox21 (TPE, RS) | |
---|---|---|---|

Mean | 94.5, 93.4 | 91.1, 88.5 | 74.6, 73.0 |

Variance | 0.001, 0.003 | 0.006, 0.037 | 0.018, 0.024 |

Pooled Variance | 0.002 | 0.022 | 0.021 |

Degrees of freedom | 18 | 18 | 18 |

t value | 5.441 | 3.935 | 2.454 |

P(t ≤ t) two-tail | 0.004 | 0.001 | 0.024 |

t critical (95%) | 2.101 | 2.101 | 2.101 |

**Table A4.**Two-tailed t-test with a hypothesized zero mean difference for the three datasets considering the best result among 10 replicas between genetic algorithms (GA) and random search (RS) on the external test set. Variances were assumed unequal for Tox21 and NURA (p-value of F test 0.04 and 0.02).

NURA (GA, RS) | ClinTox (GA, RS) | Tox21 (GA, RS) | |
---|---|---|---|

Mean | 94.4, 93.7 | 86.4, 86.5 | 77.6, 75.6 |

Variance | 0.002, 0.007 | 0.163, 0.226 | 0.009, 0.032 |

Pooled Variance | - | 0.194 | - |

Degrees of freedom | 10 | 18 | 14 |

t value | 2.370 | −0.084 | 3.024 |

P(t ≤ t) two-tail | 0.034 | 0.934 | 0.009 |

t critical (95%) | 2.160 | 2.101 | 2.144 |

**Table A5.**Two-tailed t-test with a hypothesized zero mean difference for the three datasets considering the best result among 10 replicas between genetic algorithms (GA) and tree-structured Parzen estimator (TPE) on an external test set. Variances were assumed equal.

NURA (GA, TPE) | ClinTox (GA, TPE) | Tox21 (GA, TPE) | |
---|---|---|---|

Mean | 94.4, 94.7 | 86.4, 86.6 | 77.6, 77.4 |

Variance | 0.002, 0.002 | 0.162, 0.451 | 0.009, 0.015 |

Pooled Variance | 0.002 | 0.307 | 0.012 |

Degrees of freedom | 18 | 18 | 18 |

t value | −1.852 | −0.088 | 0.269 |

P(t ≤ t) two-tail | 0.080 | 0.931 | 0.790 |

t critical (95%) | 2.101 | 2.101 | 2.101 |

**Table A6.**Two-tailed t-test with a hypothesized zero mean difference for the three datasets considering the best result among 10 replicas between tree-structured Parzen estimator (TPE) and random search (RS) on an external test set. Variances were assumed equal.

NURA (TPE, RS) | ClinTox (TPE, RS) | Tox21 (TPE, RS) | |
---|---|---|---|

Mean | 94.7, 93.7 | 86.6, 86.5 | 77.4, 75.6 |

Variance | 0.002, 0.007 | 0.451, 0.225 | 0.014, 0.031 |

Pooled Variance | 0.005 | 0.338 | 0.023 |

Degrees of freedom | 18 | 18 | 18 |

t value | 3.450 | 0.020 | 2.629 |

P(t ≤ t) two-tail | 0.003 | 0.984 | 0.017 |

t critical (95%) | 2.101 | 2.101 | 2.101 |

## References

- Ramsundar, B.; Kearnes, S.; Riley, P.; Webster, D.; Konerding, D.; Pande, V. Massively Multitask Networks for Drug Discovery. arXiv
**2015**, arXiv:1502.02072. [Google Scholar] - Xu, Y.; Ma, J.; Liaw, A.; Sheridan, R.P.; Svetnik, V. Demystifying Multitask Deep Neural Networks for Quantitative Structure–Activity Relationships. J. Chem. Inf. Model.
**2017**, 57, 2490–2504. [Google Scholar] [CrossRef] - Lipinski, C.F.; Maltarollo, V.G.; Oliveira, P.R.; Da Silva, A.B.F.; Honorio, K.M. Advances and Perspectives in Applying Deep Learning for Drug Design and Discovery. Front. Robot. AI
**2019**, 6, 108. [Google Scholar] [CrossRef][Green Version] - Passos, D.; Mishra, P. An Automated Deep Learning Pipeline Based on Advanced Optimisations for Leveraging Spectral Classification Modelling. Chemometrics Intellig. Lab. Syst.
**2021**, 215, 104354. [Google Scholar] [CrossRef] - Bergstra, J.; Bardenet, R.; Bengio, Y.; Kégl, B. Algorithms for Hyper-Parameter Optimization. In Proceedings of the Advances in Neural Information Processing Systems, Granada, Spain, 12–15 December 2011; Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Red Hood, NY, USA, 2011; Volume 24. [Google Scholar]
- Smith, L.N. A Disciplined Approach to Neural Network Hyper-Parameters: Part 1—Learning Rate, Batch Size, Momentum, and Weight Decay. arXiv
**2018**, arXiv:1803.09820. [Google Scholar] - Bergstra, J.; Bengio, Y. Random Search for Hyper-Parameter Optimization. J. Mach. Learn. Res.
**2012**, 13, 281–305. [Google Scholar] - Liashchynskyi, P.; Liashchynskyi, P. Grid Search, Random Search, Genetic Algorithm: A Big Comparison for NAS. arXiv
**2019**, arXiv:1912.06059. [Google Scholar] - Elsken, T.; Metzen, J.H.; Hutter, F. Neural Architecture Search: A Survey. J. Mach. Learn. Res.
**2019**, 20, 1997–2017. [Google Scholar] - Holland, J.H. Genetic Algorithms. Sci. Am.
**1992**, 267, 66–73. [Google Scholar] [CrossRef] - Ballabio, D.; Vasighi, M.; Consonni, V.; Kompany-zareh, M. Chemometrics and Intelligent Laboratory Systems Genetic Algorithms for Architecture Optimisation of Counter-Propagation Arti Fi Cial Neural Networks. Chemom. Intell. Lab. Syst.
**2011**, 105, 56–64. [Google Scholar] [CrossRef] - Er, M.J.; Liu, F. Parameter tuning of MLP neural network using genetic algorithms. In Proceedings of the Sixth International Symposium on Neural Networks (ISNN 2009), Wuhan, China, 26–29 May 2009; Wang, H., Shen, Y., Huang, T., Zeng, Z., Eds.; Springer: Berlin/Heidelberg, Germany, 2009; pp. 121–130, ISBN 9783642012167. [Google Scholar]
- Ganapathy, K. A Study of Genetic Algorithms for Hyperparameter Optimization of Neural Networks in Machine Translation. arXiv
**2020**, arXiv:2009.08928. [Google Scholar] - Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A Next-Generation Hyperparameter Optimization Framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 3–7 August 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 2623–2631. [Google Scholar]
- Bergstra, J.; Komer, B.; Eliasmith, C.; Yamins, D.; Cox, D.D. Hyperopt: A Python Library for Model Selection and Hyperparameter Optimization. Comput. Sci. Discov.
**2015**, 8, 014008. [Google Scholar] [CrossRef] - Yang, H.H.; Amari, S. Complexity Issues in Natural Gradient Descent Method for Training Multilayer Perceptrons. Neural Comput.
**1998**, 10, 2137–2157. [Google Scholar] [CrossRef] [PubMed] - Li, L.; Jamieson, K.; DeSalvo, G.; Rostamizadeh, A.; Talwalkar, A. Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization. J. Mach. Learn. Res.
**2017**, 18, 6765–6816. [Google Scholar] - Yang, L.; Shami, A. On Hyperparameter Optimization of Machine Learning Algorithms: Theory and Practice. Neurocomputing
**2020**, 415, 295–316. [Google Scholar] [CrossRef] - Valsecchi, C.; Grisoni, F.; Motta, S.; Bonati, L.; Ballabio, D. NURA: A Curated Dataset of Nuclear Receptor Modulators. Toxicol. Appl. Pharmacol.
**2020**, 407, 115244. [Google Scholar] [CrossRef] - Wu, Z.; Ramsundar, B.; Feinberg, E.N.; Gomes, J.; Geniesse, C.; Pappu, A.S.; Leswing, K.; Pande, V. MoleculeNet: A Benchmark for Molecular Machine Learning. Chem. Sci.
**2018**, 9, 513–530. [Google Scholar] [CrossRef][Green Version] - Cela Torrijos, R.; Phan-Tan-Luu, R. Introduction experimental designs. In Comprehensive Chemometrics, 2nd ed.; Elsevier: Oxford, UK, 2020; pp. 205–208. [Google Scholar]
- Valsecchi, C.; Collarile, M.; Grisoni, F.; Todeschini, R.; Ballabio, D.; Consonni, V. Predicting Molecular Activity on Nuclear Receptors by Multitask Neural Networks. J. Chemom.
**2020**, 4, e3325. [Google Scholar] [CrossRef] - Rogers, D.; Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model.
**2010**, 50, 742–754. [Google Scholar] [CrossRef] - Caruana, R. Multitask Learning. Mach. Learn.
**1997**, 28, 41–75. [Google Scholar] [CrossRef] - Ballabio, D.; Grisoni, F.; Todeschini, R. Multivariate Comparison of Classification Performance Measures. Chemom. Intellig. Lab. Syst.
**2018**, 174, 33–44. [Google Scholar] [CrossRef] - Leardi, R. D-Optimal Designs. In Encyclopedia of Analytical Chemistry; John Wiley & Sons, Ltd.: Hoboken, NJ, USA, 2018; pp. 1–11. [Google Scholar]
- KodeSrl Dragon (Software for Molecular Descriptor Calculation), Version 7.0; 2016. Available online: https://chm.kode-solutions.net/pf/dragon-7-0/ (accessed on 22 January 2018).
- Python Software Foundation. Python Language Reference. Version 3.6. Available online: https://www.python.org/ (accessed on 24 April 2019).
- Keras. Available online: https://keras.io/ (accessed on 18 February 2021).
- Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv
**2016**, arXiv:1603.04467. [Google Scholar]

**Figure 1.**Ordered distribution of the overall non-error rate (3-fold cross-validation) of the architectures found by the optimization methods (columns, grid search, GS, genetic algorithm, GA, and tree-structured Parzen estimator, TPE) for NURA, ClinTox, and Tox21 datasets (rows). For GA, red and orange bars represent the performance of the final population (10 chromosomes) and all the other tested architectures, respectively.

**Figure 2.**Overall non-error rate (NER

_{T}) of the best 10 solutions for each dataset and optimization method (tree-structured Parzen estimator, TPE, genetic algorithms, GA, and random search, RS, in green, red, and blue, respectively). Error bars are calculated considering the variability over 10 replicas.

**Figure 3.**Sensitivity (SN

_{T}) vs. specificity (SP

_{T}) for the three datasets with 3-fold cross validation. Each point represents an architecture found by grid search (GS, grey points), random search (RS, light blue points), tree-structured Parzen estimator (TPE, green points), all tested architectures by genetic algorithms (GA, orange points) and architectures in the final GA populations (red points).

**Figure 4.**Coefficients of the D-optimal models and relative frequency bars of the hyperparameters among the genetic algorithms (GA) final population (upper bar) and the best 1000 architectures obtained with the grid search (GS) strategy (lower bar). For quantitative hyperparameters, the bars are colored according to Figure 5.

**Figure 5.**Hyperparameters to be tuned and the levels considered. Levels are colored to facilitate the comprehension of Figure 4.

**Table 1.**Results in terms of overall non-error rate (NER

_{T}) considering grid search (GS), genetic algorithms (GA), random search (RS), and tree-structured Parzen estimator (TPE) as optimization strategies in 3-fold cross validation. The computational time for 3-fold cross-validation is also reported in hours (h). Mean and confidence interval among 10 replicas are reported for GA, RS, and TPE results.

Cross-Validation | ||||||
---|---|---|---|---|---|---|

NURA | ClinTox | Tox21 | ||||

Best NER_{T} | Time (h) | Best NER_{T} | Time (h) | Best NER_{T} | Time (h) | |

GS | 95.1 | 4905.3 | 93.7 | 1205.2 | 76.4 | 2858.7 |

GA | 94.1 ± 0.2 | 3.3 | 91.0 ± 1.1 | 1.0 | 74.6 ± 0.5 | 2.3 |

TPE | 94.5 ± 0.2 | 5.1 | 91.1 ± 0.5 | 1.6 | 74.6 ± 0.8 | 2.8 |

RS | 93.4 ± 0.4 | 3.2 | 88.5 ± 1.4 | 0.8 | 73.0 ± 1.0 | 2.2 |

**Table 2.**Results in terms of overall non-error rate (NER

_{T}), sensitivity (SN

_{T}), and specificity (SP

_{T}) considering grid search (GS), genetic algorithms (GA), random search (RS), and tree-structured Parzen estimator (TPE) as optimization strategies on the external test set. Confidence intervals among 10 replicas are reported for the GA and RS results.

External Set | |||||||||
---|---|---|---|---|---|---|---|---|---|

NURA | ClinTox | Tox21 | |||||||

NER_{T} | SN_{T} | SP_{T} | NER_{T} | SN_{T} | SP_{T} | NER_{T} | SN_{T} | SP_{T} | |

GS | 95.5 | 95.3 | 95.7 | 95.6 | 94.9 | 96.2 | 78.7 | 74.3 | 83.1 |

GA | 94.4 ± 0.3 | 94.6 ± 0.4 | 94.1 ± 0.4 | 86.4 ± 2.9 | 88.8 ± 2.6 | 84.7 ± 4.5 | 77.6 ± 0.7 | 75.9 ± 0.8 | 79.3 ± 1.8 |

TPE | 94.7 ± 0.3 | 94.5 ± 0.3 | 95.0 ± 0.7 | 86.6 ± 3.1 | 87.6 ± 5.3 | 85.5 ± 3.5 | 77.4 ± 0.8 | 75.5 ± 2.0 | 79.4 ± 2.4 |

RS | 93.7 ± 0.6 | 94.1 ± 0.5 | 93.2 ± 0.9 | 86.5 ± 3.4 | 88.4 ± 3.1 | 84.7 ± 4.5 | 75.7 ± 1.3 | 74.6 ± 2.2 | 76.7 ± 3.6 |

Dataset | Description | No. Tasks | No. Samples | Ref. |
---|---|---|---|---|

NURA | Qualitative bioactivity annotations for 11 selected nuclear receptors. | 30 | 14,963 | [19] |

ClinTox | Qualitative data of drugs approved by the FDA and those that have failed clinical trials for toxicity reasons. | 2 | 1472 | [20] |

Tox21 | Qualitative toxicity measurements on 12 biological targets. | 12 | 7586 | [20] |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Valsecchi, C.; Consonni, V.; Todeschini, R.; Orlandi, M.E.; Gosetti, F.; Ballabio, D. Parsimonious Optimization of Multitask Neural Network Hyperparameters. *Molecules* **2021**, *26*, 7254.
https://doi.org/10.3390/molecules26237254

**AMA Style**

Valsecchi C, Consonni V, Todeschini R, Orlandi ME, Gosetti F, Ballabio D. Parsimonious Optimization of Multitask Neural Network Hyperparameters. *Molecules*. 2021; 26(23):7254.
https://doi.org/10.3390/molecules26237254

**Chicago/Turabian Style**

Valsecchi, Cecile, Viviana Consonni, Roberto Todeschini, Marco Emilio Orlandi, Fabio Gosetti, and Davide Ballabio. 2021. "Parsimonious Optimization of Multitask Neural Network Hyperparameters" *Molecules* 26, no. 23: 7254.
https://doi.org/10.3390/molecules26237254