# Multi-Task Neural Networks and Molecular Fingerprints to Enhance Compound Identification from LC-MS/MS Data

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Materials and Methods

#### 2.1. Data Collection, Curing and Dimensionality Reduction

- records with missing instrument and/or collision energy annotations;
- ambiguous records with collision energy units that were not compatible with the matched instrument type;
- records with spectra acquired in negative ionisation mode;
- records showing inconsistency between MW and precursor ions;
- records with spectra acquired with ion trap instrumentation;
- records with spectra acquired with collision energy values (CE) lower than 5 V or higher than 70 V;
- records with spectra measured with Atmospheric Pressure Photo-Ionisation (APPI), Atmospheric Pressure Chemical-Ionisation (APCI), Linear Trap (LT) and Orbitrap, due to the few entries available in the database;
- records with precursor ions different from the most common ones, such as [M+H]
^{+}, [M+Na]^{+}, [M+K]^{+}, and [M+NH4]^{+}.

- discarding records with spectra showing less than two peaks;
- for each spectrum, removing peaks at m/z higher than the precursor ions.

^{+}and 2) spectra with less than five peaks, in order to increase the amount of peak information for the model training. By applying these additional curing actions, we obtained a second reduced dataset with 12,550 spectra.

**X**, where each row represents a spectrum and each column an m/z value; thus, each entry x

_{ij}of the matrix is the signal measured at the j-th m/z for the i-th spectrum. The considered m/z range was defined between 45.0 Da and 704.5 Da, being the most comprehensive m/z range considering all the available spectra. The resolution was defined at one decimal point to reduce data sparseness and get a reasonable data size, leading, therefore, to spectral vectors of size 6596 bits. Since the resolution of the original data was higher than 1 decimal point, the intensity of each bit was set to the maximum among all fragments that had the same mass when considering only one decimal point. This preprocessing ensured that in each spectrum we had at least one peak with intensity equal to 100. Finally, for each spectrum, the intensity values were standardised by dividing by the maximum intensity (100) to obtain an appropriate scale for the subsequent modelling through artificial neural networks.

#### 2.2. Molecular Fingerprints (FPs)

#### 2.3. Multi-Task Modelling

#### 2.3.1. Artificial Neural Networks (ANNs)

#### 2.3.2. Validation Protocol

#### 2.3.3. Performance Measures for Multi-Task Modelling and Similarity Matching

_{t}is the Non-Error Rate achieved on the t-th task, which is defined as:

_{t}and Sp

_{t}represent the sensitivity and specificity for the t-th task, respectively, and are calculated as follows:

_{t}, TN

_{t}, FP

_{t}and FN

_{t}are the number of true positive (bits = 1 correctly predicted), true negative (bits = 0 correctly predicted), false positive (bits = 0 erroneously predicted as 1) and false negative (bits = 1 erroneously predicted as 0) for the t-th task. Therefore, the higher the sensitivity and specificity, the higher NER and the better the model in terms of prediction accuracy. On the other hand, the percentage of correctly predicted bits was not used as a measure to assess the model quality, because this index is known to be biased when dealing with unbalanced classification tasks [41]; this was the case, the molecular fingerprints being very sparse and, thus, characterised by a very high percentage of zero values (77%).

_{i}) to evaluate the similarity between the true fingerprint of the compound and the fingerprint predicted by the multi-task model from the corresponding MS spectrum [43,44], as follows:

_{i}is a measure of similarity and corresponds to the number of common bits equal to 1 for both the true and predicted fingerprints, b

_{i}is a measure of dissimilarity and is equal to the number of bits equal to 1 for the true and 0 for the predicted fingerprint, c

_{i}is again a measure of dissimilarity and is equal to the number of bits equal to 0 for the true and 1 for the predicted fingerprint, respectively. Then, average similarities can be calculated over all the JT

_{i}values of specific sets of spectra.

#### 2.3.4. Tuning of Artificial Neural Network Hyperparameters

_{val}and JT

_{train}are the average Jaccard–Tanimoto similarity indices for the validation and training sets, respectively. In this way, we maximised the similarity between the predicted and true fingerprints for the validation set, but, at the same time, we minimised the potential overfitting, trying to minimise the difference between performance on the training and validation sets.

#### 2.4. Software and Code

## 3. Results

#### 3.1. Optimisation of the Multi-Task Model

#### 3.2. Model Performance and Similarity Matching

#### 3.3. Diagnostic for the Matching of Experimental and Predicted Fingerprints

## 4. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Sample Availability

## References

- Zang, X.; Monge, M.E.; Fernández, F.M. Mass Spectrometry-Based Non-Targeted Metabolic Profiling for Disease Detection: Recent Developments. Trends Analyt. Chem.
**2019**, 118, 158–169. [Google Scholar] [CrossRef] [PubMed] - Sannino, A.; Bolzoni, L. GC/CI–MS/MS Method for the Identification and Quantification of Volatile N-Nitrosamines in Meat Products. Food Chem.
**2013**, 141, 3925–3930. [Google Scholar] [CrossRef] - He, Y.; Zhang, Z.M.; Ma, P.; Ji, H.C.; Lu, H.M. GC-MS Profiling of Leukemia Cells: An Optimized Preparation Protocol for the Intracellular Metabolome. Anal. Methods
**2018**, 10, 1266–1274. [Google Scholar] [CrossRef] - Ji, H.; Deng, H.; Lu, H.; Zhang, Z. Predicting a Molecular Fingerprint from an Electron Ionization Mass Spectrum with Deep Neural Networks. Anal. Chem.
**2020**, 92, 8649–8653. [Google Scholar] [CrossRef] [PubMed] - Gosetti, F.; Mazzucco, E.; Zampieri, D.; Gennaro, M.C. Signal Suppression/enhancement in High-Performance Liquid Chromatography Tandem Mass Spectrometry. J. Chromatogr. A
**2010**, 1217, 3929–3937. [Google Scholar] [CrossRef] [PubMed] - Gross, J.H. Tandem Mass Spectrometry. In Mass Spectrometry: A Textbook; Gross, J.H., Ed.; Springer International Publishing: Cham, Switzerland, 2017; pp. 539–612. ISBN 9783319543987. [Google Scholar]
- Litsa, E.; Chenthamarakshan, V.; Das, P.; Kavraki, L. Spec2Mol: An End-to-End Deep Learning Framework for Translating MS/MS Spectra to de-Novo Molecules. ChemRxiv
**2021**. [Google Scholar] [CrossRef] - Guijas, C.; Montenegro-Burke, J.R.; Warth, B.; Spilker, M.E.; Siuzdak, G. Metabolomics Activity Screening for Identifying Metabolites That Modulate Phenotype. Nat. Biotechnol.
**2018**, 36, 316–320. [Google Scholar] [CrossRef] - Werner, E.; Heilier, J.-F.; Ducruix, C.; Ezan, E.; Junot, C.; Tabet, J.-C. Mass Spectrometry for the Identification of the Discriminating Signals from Metabolomics: Current Status and Future Trends. J. Chromatogr. B Analyt. Technol. Biomed. Life Sci.
**2008**, 871, 143–163. [Google Scholar] [CrossRef] - Patti, G.J.; Tautenhahn, R.; Siuzdak, G. Meta-Analysis of Untargeted Metabolomic Data from Multiple Profiling Experiments. Nat. Protoc.
**2012**, 7, 508–516. [Google Scholar] [CrossRef] - Scheubert, K.; Hufsky, F.; Petras, D.; Wang, M.; Nothias, L.-F.; Dührkop, K.; Bandeira, N.; Dorrestein, P.C.; Böcker, S. Significance Estimation for Large Scale Metabolomics Annotations by Spectral Matching. Nat. Commun.
**2017**, 8, 1494. [Google Scholar] [CrossRef][Green Version] - Fan, Z.; Alley, A.; Ghaffari, K.; Ressom, H.W. MetFID: Artificial Neural Network-Based Compound Fingerprint Prediction for Metabolite Annotation. Metabolomics
**2020**, 16, 104. [Google Scholar] [CrossRef] [PubMed] - Heinonen, M.; Shen, H.; Zamboni, N.; Rousu, J. Metabolite Identification and Molecular Fingerprint Prediction through Machine Learning. Bioinformatics
**2012**, 28, 2333–2341. [Google Scholar] [CrossRef] [PubMed] - Shrivastava, A.D.; Swainston, N.; Samanta, S.; Roberts, I.; Wright Muelas, M.; Kell, D.B. MassGenie: A Transformer-Based Deep Learning Method for Identifying Small Molecules from Their Mass Spectra. Biomolecules
**2021**, 11, 1793. [Google Scholar] [CrossRef] [PubMed] - Da Silva, R.R.; Dorrestein, P.C.; Quinn, R.A. Illuminating the Dark Matter in Metabolomics. Proc. Natl. Acad. Sci. USA
**2015**, 112, 12549–12550. [Google Scholar] [CrossRef] - Matsuda, F. Technical Challenges in Mass Spectrometry-Based Metabolomics. Mass Spectrom.
**2016**, 5, S0052. [Google Scholar] [CrossRef] - Hufsky, F.; Böcker, S. Mining Molecular Structure Databases: Identification of Small Molecules Based on Fragmentation Mass Spectrometry Data. Mass Spectrom. Rev.
**2017**, 36, 624–633. [Google Scholar] [CrossRef] - Hill, A.W.; Mortishire-Smith, R.J. Automated Assignment of High-Resolution Collisionally Activated Dissociation Mass Spectra Using a Systematic Bond Disconnection Approach. Rapid Commun. Mass Spectrom.
**2005**, 19, 3111–3118. [Google Scholar] [CrossRef] - Heinonen, M.; Rantanen, A.; Mielikäinen, T.; Kokkonen, J.; Kiuru, J.; Ketola, R.A.; Rousu, J. FiD: A Software for Ab Initio Structural Identification of Product Ions from Tandem Mass Spectrometric Data. Rapid Commun. Mass Spectrom.
**2008**, 22, 3043–3052. [Google Scholar] - Allen, F.; Greiner, R.; Wishart, D. Competitive Fragmentation Modeling of ESI-MS/MS Spectra for Putative Metabolite Identification. Metabolomics
**2015**, 11, 98–110. [Google Scholar] [CrossRef] - Willett, P. Similarity-Based Virtual Screening Using 2D Fingerprints. Drug Discov. Today
**2006**, 11, 1046–1053. [Google Scholar] [CrossRef] - Rogers, D.; Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model.
**2010**, 50, 742–754. [Google Scholar] [CrossRef] [PubMed] - MassBank of North America. Available online: https://mona.fiehnlab.ucdavis.edu/ (accessed on 13 October 2021).
- LC-MS/MS to Fingerprints Dataset. Available online: https://michem.unimib.it/download/data/lc-ms-ms-to-fingerprints-dataset/ (accessed on 1 August 2022).
- Zou, H.; Hastie, T.; Tibshirani, R. Sparse Principal Component Analysis. J. Comput. Graph. Stat.
**2006**, 15, 265–286. [Google Scholar] [CrossRef] - Sjöstrand, K.; Clemmensen, L.H.; Larsen, R.; Einarsson, G.; Ersbøll, B. SpaSM: A MATLAB Toolbox for Sparse Statistical Modeling. J. Stat. Softw.
**2018**, 84, 1–37. [Google Scholar] [CrossRef] - Durant, J.L.; Leland, B.A.; Henry, D.R.; Nourse, J.G. Reoptimization of MDL Keys for Use in Drug Discovery. J. Chem. Inf. Comput. Sci.
**2002**, 42, 1273–1280. [Google Scholar] [CrossRef] - Ramsundar, B.; Kearnes, S.; Riley, P.; Webster, D.; Konerding, D.; Pande, V.; Edu, P. Massively Multitask Networks for Drug Discovery. arXiv
**2015**, arXiv:1502.02072. [Google Scholar] - Sosnin, S.; Vashurina, M.; Withnall, M.; Karpov, P.; Fedorov, M.; Tetko, I.V. A Survey of Multi-Task Learning Methods in Chemoinformatics. Mol. Inform.
**2018**, 38, e1800108. [Google Scholar] [CrossRef] - Dahl, G.E.; Jaitly, N.; Salakhutdinov, R. Multi-Task Neural Networks for QSAR Predictions. arXiv
**2014**, arXiv:1406.1231. [Google Scholar] - Rosenblatt, F. The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain. Psychol. Rev.
**1958**, 65, 386–408. [Google Scholar] [CrossRef] - Caruana, R. Multitask Learning. Mach. Learn.
**1997**, 28, 41–75. [Google Scholar] [CrossRef] - Hoerl, A.E.; Kennard, R.W. Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics
**1970**, 12, 55–67. [Google Scholar] [CrossRef] - Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J. Mach. Learn. Res.
**2014**, 15, 1929–1958. [Google Scholar] - Tibshirani, R. Regression Shrinkage and Selection via the Lasso. J. R. Stat. Soc.
**1996**, 58, 267–288. [Google Scholar] [CrossRef] - Yao, Y.; Rosasco, L.; Caponnetto, A. On Early Stopping in Gradient Descent Learning. Constr. Approx.
**2007**, 26, 289–315. [Google Scholar] [CrossRef] - Ramsundar, B.; Eastman, P.; Walters, P.; Pande, V. Deep Learning for the Life Sciences: Applying Deep Learning to Genomics, Microscopy, Drug Discovery, and More; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2019; ISBN 9781492039785. [Google Scholar]
- Bakker, B. Task Clustering and Gating for Bayesian Multitask Learning. J. Mach. Learn. Res.
**2003**, 4, 83–99. [Google Scholar] - Pérez, N.F.; Ferré, J.; Boqué, R. Calculation of the Reliability of Classification in Discriminant Partial Least-Squares Binary Classification. Chemometrics Intellig. Lab. Syst.
**2009**, 95, 122–128. [Google Scholar] [CrossRef] - Valsecchi, C.; Grisoni, F.; Consonni, V.; Ballabio, D. Consensus versus Individual QSARs in Classification: Comparison on a Large-Scale Case Study. J. Chem. Inf. Model.
**2020**, 60, 1215–1223. [Google Scholar] [CrossRef] - Ballabio, D.; Grisoni, F.; Todeschini, R. Multivariate Comparison of Classification Performance Measures. Chemom. Intell. Lab. Syst.
**2018**, 174, 33–44. [Google Scholar] [CrossRef] - Valsecchi, C.; Collarile, M.; Grisoni, F.; Todeschini, R.; Ballabio, D.; Consonni, V. Predicting Molecular Activity on Nuclear Receptors by Multitask Neural Networks. J. Chemom.
**2020**, 36, e3325. [Google Scholar] [CrossRef] - Jaccard, P. The Distribution of the Flora in the Alpine Zone. New Phytol.
**1912**, 11, 37–50. [Google Scholar] [CrossRef] - Todeschini, R.; Ballabio, D.; Consonni, V. Distances and Other Dissimilarity Measures in Chemometrics. In Encyclopedia of Analytical Chemistry; Wiley: Hoboken, NJ, USA, 2015. [Google Scholar]
- Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A Next-Generation Hyperparameter Optimization Framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; Association for Computing Machinery: New York, NY, USA, 2019; pp. 2623–2631. [Google Scholar]
- Valsecchi, C.; Consonni, V.; Todeschini, R.; Orlandi, M.E.; Gosetti, F.; Ballabio, D. Parsimonious Optimization of Multitask Neural Network Hyperparameters. Molecules
**2021**, 26, 7254. [Google Scholar] [CrossRef] - RDKit MACCS Keys. Available online: https://github.com/rdkit/rdkit-orig/blob/master/rdkit/Chem/MACCSkeys.py (accessed on 15 July 2022).
- Python Software Foundation. Python Language Reference, Version 3.6. Available online: https://www.python.org/ (accessed on 4 September 2022).
- Chollet, F. Keras. Available online: https://keras.io/ (accessed on 18 February 2021).
- Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv
**2016**, arXiv:1603.04467. [Google Scholar] - Ballabio, D. A MATLAB Toolbox for Principal Component Analysis and Unsupervised Exploration of Data Structure. Chemom. Intell. Lab. Syst.
**2015**, 149, 1–9. [Google Scholar] [CrossRef] - Bechtold, B. Violin Plots for Matlab. Available online: https://github.com/bastibe/Violinplot-Matlab (accessed on 15 July 2022).
- Seber, G.A.F. Multivariate Observations; Wiley: Hoboken, NJ, USA, 2009; p. 686. ISBN 9780470317310. [Google Scholar]

**Figure 1.**Score plot of the first and second MDS coordinates for the molecular fingerprints. Fingerprints of the 12K and 40K datasets are coloured in blue and grey, respectively.

**Figure 2.**Bar plot of the effects of the ANNs hyperparameters on the objective function with their 95% confidence intervals. AF: activation function; BS: batch size; DO: dropout; LR: learning rate; NTS: number of task-specific neurons; N: number of neurons; PA: patience; OT: optimisation type.

**Figure 3.**Violin plot of computational times required to train replicates of ANNs with 500 SPCA scores, 1000 SPCA scores and the 6596 raw MS features for the dataset 40K.

**Figure 4.**Score plot of the first and second MDS dimensions for the 40K test set predicted fingerprints; (

**a**) fingerprints are coloured in a greyscale, the higher the similarity between predicted and true fingerprint, the darker the colour; (

**b**) predicted fingerprints of exemplificative chemicals are coloured in blue (high accuracy between predicted and experimental fingerprints) and orange (low accuracy).

**Figure 6.**Exemplificative chemicals with high accuracy between predicted and experimental fingerprints.

**Figure 7.**Violin plot showing the distribution of the percentage of active bits in the fingerprints of chemicals with low and high accuracy of predicted fingerprints.

**Table 1.**Number of MS spectra and compounds included in the training, validation and test sets for the 12K and 40K datasets.

40K Dataset | 12K Dataset | |||
---|---|---|---|---|

MS Spectra | Compounds | MS Spectra | Compounds | |

Training | 29,279 | 4040 | 9037 | 2804 |

Validation | 6059 | 806 | 1828 | 577 |

Test | 5233 | 711 | 1685 | 501 |

Total | 40,571 | 5557 | 12,550 | 3882 |

**Table 2.**Tested levels and optimal values of ANNs hyperparameters for the architectures trained with the 12K and 40K datasets.

Minimum and Maximum Level | Optimal Hyperparameters 40K Dataset | Optimal Hyperparameters 12K Dataset | |
---|---|---|---|

Number of neurons (N) | 50–100 | 100 | 100 |

Neurons task-specific (NTS) | 250–500 | 250 | 500 |

Learning rate (LR) | 0.0001–0.01 | 0.006 | 0.0025 |

Activation function (AF) | Sigmoid, ReLU | Sigmoid | Sigmoid |

Dropout (DO) | 0–0.5 | 0.30 | 0.33 |

Batch size (BS) | 2000–4000 | 2000 | 600 |

Patience (PA) | 50–150 | 95 | 114 |

Optimisation type (OT) | Adam, SGD, RMSprop | RMSprop | RMSprop |

**Table 3.**Non-Error Rate (NER) and average Jaccard Tanimoto similarity (JT) achieved on the training, validation and test sets with the ANNs trained on the 12K and 40K datasets with different number of features.

Train | Validation | Test | ||||||
---|---|---|---|---|---|---|---|---|

Dataset | Spectra | Features | NER | JT | NER | JT | NER | JT |

40K | 40,571 | 1000 | 0.82 | 0.56 | 0.69 | 0.48 | 0.69 | 0.44 |

40K | 40,571 | 500 | 0.81 | 0.54 | 0.69 | 0.48 | 0.69 | 0.45 |

40K | 40,571 | 6596 | 0.79 | 0.50 | 0.69 | 0.45 | 0.70 | 0.43 |

12K | 12,550 | 500 | 0.84 | 0.61 | 0.70 | 0.49 | 0.69 | 0.47 |

12K | 12,550 | 6596 | 0.82 | 0.58 | 0.70 | 0.47 | 0.70 | 0.46 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Consonni, V.; Gosetti, F.; Termopoli, V.; Todeschini, R.; Valsecchi, C.; Ballabio, D.
Multi-Task Neural Networks and Molecular Fingerprints to Enhance Compound Identification from LC-MS/MS Data. *Molecules* **2022**, *27*, 5827.
https://doi.org/10.3390/molecules27185827

**AMA Style**

Consonni V, Gosetti F, Termopoli V, Todeschini R, Valsecchi C, Ballabio D.
Multi-Task Neural Networks and Molecular Fingerprints to Enhance Compound Identification from LC-MS/MS Data. *Molecules*. 2022; 27(18):5827.
https://doi.org/10.3390/molecules27185827

**Chicago/Turabian Style**

Consonni, Viviana, Fabio Gosetti, Veronica Termopoli, Roberto Todeschini, Cecile Valsecchi, and Davide Ballabio.
2022. "Multi-Task Neural Networks and Molecular Fingerprints to Enhance Compound Identification from LC-MS/MS Data" *Molecules* 27, no. 18: 5827.
https://doi.org/10.3390/molecules27185827