Comparative Analysis of Binary Similarity Measures for Compound Identification in Mass Spectrometry-Based Metabolomics
Abstract
:1. Introduction
2. Results
2.1. Theoretical Considerations
- (1)
- The similarity measures 1 (Jaccard), 2 (Dice), 3 (3W-Jaccard), 4 (Sokal–Sneath), and 12 (Kulczynski) are strictly order preserving;
- (2)
- The similarity measures 5 (Cosine) and 15 (Hellinger) are strictly order preserving;
- (3)
- The similarity measures 7 (McConnaughey) and 8 (Driver–Kroeber) are strictly order preserving.
2.2. EI Mass Spectra-Based Identification
2.2.1. Scores of Binary Similarity Measures
2.2.2. Accuracies of Binary Similarity Measures
2.3. ESI Mass Spectra-Based Identification
2.3.1. Scores of Binary Similarity Measures
2.3.2. Accuracies of Binary Similarity Measures
3. Discussion
4. Materials and Methods
4.1. Binary Similarity Measures
4.2. Mass Spectra Libraries
Index | Name | Expression | Range |
---|---|---|---|
1 | Jaccard | c/(a+b+c) | [0, 1) |
2 | Dice | 2c/(a+b+2c) | [0, 1) |
3 | 3W-Jaccard | 3c/(a+b+3c) | [0, 1) |
4 | Sokal–Sneath | c/(2a+2b+c) | [0, 1) |
5 | Cosine | c/√((a+c)·(b+c)) | [0, 1) |
6 | Mountford | 2c/(c(a+b)+2ab) | [0, 2] |
7 | McConnaughey | (c2−ab)/((a+c)·(b+c)) | [−1, 1) |
8 | Driver–Kroeber | c(a+b+2c)/(2(a+c)·(b+c)) | [0, 1) |
9 | Simpson | c/min(a+c,b+c) | [0, 1) |
10 | Braun–Banquet | c/max(a+c,b+c) | [0, 1) |
11 | Fager–McGowan | c/√((a+c)·(b+c)) − 1/(2·√(max(a+c,b+c))) | (−1/2, 1) |
12 | Kulczynski | c/(a+b) | [0, ∞) |
13 | Intersection | c | [0, ∞) |
14 | Hamming | 1/(a+b) | (0, 1] |
15 | Hellinger | 1 − √((1 − c/√((a+c)·(b+c)))) | [0, 1) |
4.3. Compound Identification by a Mass Spectra Library
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Fan, Z.; Alley, A.; Ghaffari, K.; Ressom, H.W. MetFID: Artificial neural network-based compound fingerprint prediction for metabolite annotation. Metabolomics 2020, 16, 104. [Google Scholar] [CrossRef] [PubMed]
- Nguyen, D.H.; Nguyen, C.H.; Mamitsuka, H. Recent advances and prospects of computational methods for metabolite identification: A review with emphasis on machine learning approaches. Brief. Bioinform. 2018, 20, 2028–2043. [Google Scholar] [CrossRef] [PubMed]
- Todeschini, R.; Consonni, V.; Xiang, H.; Holliday, J.; Buscema, M.; Willett, P. Similarity coefficients for binary chemoinformatics data: Overview and extended comparison using simulated and real data sets. J. Chem. Inf. Model 2012, 52, 2884–2901. [Google Scholar] [CrossRef] [PubMed]
- Gerlich, M.; Neumann, S. MetFusion: Integration of compound identification strategies. J. Mass Spectrom. 2013, 48, 291–298. [Google Scholar] [CrossRef] [PubMed]
- Mistrik, R. A new concept for the interpretation of mass spectra based on a combination of a fragmentation mechanism database and a computer expert system. Adv. Mass Spectrom. Elsevier Amst. 2004, 16, 821. [Google Scholar]
- Wolf, S.; Schmidt, S.; Muller-Hannemann, M.; Neumann, S. In silico 512 fragmentation for computer assisted identification of metabolite mass 513 spectra. BMC Bioinform. 2010, 11, 148. [Google Scholar] [CrossRef] [Green Version]
- Stein, S.E.; Scott, D.R. Optimization and testing of mass spectral library search algorithms for compound identification. J. Am. Soc. Mass Spectrom. 1994, 5, 859–866. [Google Scholar] [CrossRef] [Green Version]
- Atwater, B.L.; Stauffer, D.B.; McLafferty, F.W.; Peterson, D.W. Reliability ranking and scaling improvements to the probability based matching system for unknown mass spectra. Anal. Chem. 1985, 57, 899–903. [Google Scholar] [CrossRef]
- Hertz, H.S.; Hites, R.A.; Biemann, K. Identification of mass spectra by computer-searching a file of known spectra. Anal. Chem. 1971, 43, 681–691. [Google Scholar] [CrossRef]
- Rasmussen, G.; Isenhour, T.L. The evaluation of mass spectral search algorithms. J. Chem. Inf. Comput. Sci. 1979, 19, 179–186. [Google Scholar] [CrossRef]
- Julian, R.K.; Higgs, R.E.; Gygi, J.D.; Hilton, M.D. A Method for Quantitatively Differentiating Crude Natural Extracts Using High-Performance Liquid Chromatography−Electrospray Mass Spectrometry. Anal. Chem. 1998, 70, 3249–3254. [Google Scholar] [CrossRef] [PubMed]
- Koo, I.; Zhang, X.; Kim, S. Wavelet-and Fourier-transform-based spectrum similarity approaches to compound identification in gas chromatography/mass spectrometry. Anal. Chem. 2011, 83, 5631–5638. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Kim, S.; Koo, I.; Jeong, J.; Wu, S.; Shi, X.; Zhang, X. Compound identification using partial and semipartial correlations for gas chromatography–mass spectrometry data. Anal. Chem. 2012, 84, 6477–6487. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Huber, F.; Ridder, L.; Verhoeven, S.; Spaaks, J.H.; Diblen, F.; Rogers, S.; Van Der Hooft, J.J. Spec2Vec: Improved mass spectral similarity scoring through learning of structural relationships. PLoS Comput. Comput. Biol. 2021, 17, e1008724. [Google Scholar] [CrossRef]
- Li, Y.; Kind, T.; Folz, J.; Vaniya, A.; Mehta, S.S.; Fiehn, O. Spectral entropy outperforms MS/MS dot product similarity for small-molecule compound identification. Nat. Methods 2021, 18, 1524–1531. [Google Scholar] [CrossRef] [PubMed]
- Bender, A.; Jenkins, J.L.; Scheiber, J.; Sukuru, S.C.K.; Glick, M.; Davies, J.W. How similar are similarity searching methods? A principal component analysis of molecular descriptor space. J. Chem. Inf. Modeling 2009, 49, 108–119. [Google Scholar] [CrossRef]
- Brusco, M.; Cradit, J.D.; Steinley, D. A comparison of 71 binary similarity coefficients: The effect of base rates. PLoS ONE 2021, 16, e0247751. [Google Scholar] [CrossRef] [PubMed]
- Choi, S.-S.; Cha, S.-H.; Tappert, C.C. A Survey of Binary Similarity and Distance Measures. J. Syst. Cybern. Inform. 2010, 8, 43–48. [Google Scholar]
- Duan, J.; Dixon, S.L.; Lowrie, J.F.; Sherman, W. Analysis and comparison of 2D fingerprints: Insights into database screening performance using eight fingerprint methods. J. Mol. Graph. Model. 2010, 29, 157–170. [Google Scholar] [CrossRef] [PubMed]
- Gower, J.C.; Legendre, P. Metric and Euclidean properties of dissimilarity coefficients. J. Classif. 1986, 3, 5–48. [Google Scholar] [CrossRef]
- Holliday, J.D.; Hu, C.; Willett, P. Grouping of coefficients for the calculation of inter-molecular similarity and dissimilarity using 2D fragment bit-strings. Comb. Chem. High Throughput Screen. 2002, 5, 155–166. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Hubalek, Z. Coefficients of association and similarity, based on binary (presence-absence) data: An evaluation. Biol. Rev. 1982, 57, 669–689. [Google Scholar] [CrossRef]
- Jackson, D.A.; Somers, K.M.; Harvey, H.H. Similarity coefficients: Measures of co-occurrence and association or simply measures of occurrence? Am. Nat. 1989, 133, 436–453. [Google Scholar] [CrossRef]
- Sastry, M.; Lowrie, J.F.; Dixon, S.L.; Sherman, W. Large-scale systematic analysis of 2D fingerprint methods and parameters to improve virtual screening enrichments. J. Chem. Inf. Modeling 2010, 50, 771–784. [Google Scholar] [CrossRef]
- Wijaya, S.H.; Afendi, F.M.; Batubara, I.; Darusman, L.K.; Altaf-Ul-Amin, M.; Kanaya, S. Finding an appropriate equation to measure similarity between binary vectors: Case studies on Indonesian and Japanese herbal medicines. BMC Bioinform. 2016, 17, 520. [Google Scholar] [CrossRef] [Green Version]
- Jaccard, P. Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions voisines. Bull. Soc. Vaud. Sci. Nat. 1901, 37, 241–272. [Google Scholar]
- Koo, I.; Kim, S.; Zhang, X. Comparative analysis of mass spectral matching-based compound identification in gas chromatography-mass spectrometry. J. Chromatogr. A 2013, 1298, 132–138. [Google Scholar] [CrossRef] [Green Version]
- Stumpfe, D.; Bajorath, J. Similarity searching. WIREs Comput. Mol. Sci. 2011, 1, 260–282. [Google Scholar] [CrossRef]
- Willett, P. Similarity-based data mining in files of two-dimensional chemical structures using fingerprint measures of molecular resemblance. WIREs Data Min. Knowl. Discov. 2011, 1, 241–251. [Google Scholar] [CrossRef] [Green Version]
Similarity Measures | Ranks | ||
---|---|---|---|
1 | 2 | 3 | |
1 | 27.49 (26.89,28.07) | 35.08 (34.43,35.72) | 39.41 (38.76,40.07) |
2 | 27.49 (26.89,28.07) | 35.08 (34.43,35.72) | 39.41 (38.76,40.07) |
3 | 27.49 (26.89,28.07) | 35.08 (34.43,35.72) | 39.41 (38.76,40.07) |
4 | 27.49 (26.89,28.07) | 35.08 (34.43,35.72) | 39.41 (38.76,40.07) |
5 | 29.11 (28.50,29.71) | 37.27 (36.61,37.92) | 42.03 (41.38,42.68) |
6 | 27.51 (26.92,28.10) | 35.63 (34.99,36.28) | 40.19 (39.55,40.86) |
7 | 31.24 (30.62,31.86) | 40.24 (39.59,40.90) | 45.36 (44.68,46.02) |
8 | 31.24 (30.62,31.86) | 40.24 (39.59,40.90) | 45.36 (44.68,46.02) |
9 | 20.71 (20.17,21.25) | 20.80 (20.25,21.34) | 20.90 (20.34,21.44) |
10 | 18.32 (17.81,18.83) | 23.78 (23.21,24.36) | 26.65 (26.07,27.24) |
11 | 29.78 (29.17,30.39) | 38.09 (37.43,38.74) | 42.88 (42.22,43.54) |
12 | 27.49 (26.89,28.07) | 35.08 (34.43,35.72) | 39.41 (38.76,40.07) |
13 | 15.21 (14.75,15.69) | 15.40 (14.91,15.89) | 15.60 (15.11,16.09) |
14 | 26.16 (25.57,26.76) | 33.25 (32.63,33.89) | 37.34 (36.71,38.00) |
15 | 29.11 (28.50,29.71) | 37.27 (36.62,37.93) | 42.03 (41.38,42.68) |
Similarity Measures | Ranks | ||
---|---|---|---|
1 | 2 | 3 | |
1 | 52.24 (50.23,54.29) | 59.56 (57.64,61.49) | 62.83 (60.90,64.76) |
2 | 52.24 (50.23,54.29) | 59.56 (57.64,61.49) | 62.83 (60.90,64.76) |
3 | 52.24 (50.23,54.29) | 59.56 (57.64,61.49) | 62.83 (60.90,64.76) |
4 | 52.24 (50.23,54.29) | 59.56 (57.64,61.49) | 62.83 (60.90,64.76) |
5 | 53.37 (51.36,55.38) | 60.32 (58.39,62.24) | 64.13 (62.20,66.05) |
6 | 48.01 (46.00,49.98) | 54.37 (52.45,56.38) | 56.72 (54.75,58.73) |
7 | 51.15 (49.14,53.16) | 58.23 (56.26,60.23) | 61.87 (59.94,63.79) |
8 | 51.15 (49.14,53.16) | 58.23 (56.26,60.23) | 61.87 (59.94,63.79) |
9 | 42.78 (40.85,44.70) | 45.21 (43.20,47.22) | 47.59 (45.63,49.60) |
10 | 50.31 (48.26,52.28) | 57.22 (55.25,59.19) | 61.07 (59.15,63.04) |
11 | 53.33 (51.32,55.34) | 60.36 (58.39,62.29) | 63.83 (61.95,65.80) |
12 | 52.24 (50.23,54.29) | 59.56 (57.64,61.49) | 62.83 (60.90,64.76) |
13 | 36.12 (34.24,38.09) | 39.18 (37.21,41.15) | 41.23 (39.22,43.24) |
14 | 47.34 (45.33,49.31) | 52.32 (50.36,54.37) | 54.46 (52.45,56.43) |
15 | 53.37 (51.36,55.38) | 60.32 (58.39,62.24) | 64.13 (62.20,66.05) |
Reference Mass Spectra | |||
---|---|---|---|
0 | 1 | ||
Query mass spectra | 0 | d | b |
1 | a | c |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kim, S.; Kato, I.; Zhang, X. Comparative Analysis of Binary Similarity Measures for Compound Identification in Mass Spectrometry-Based Metabolomics. Metabolites 2022, 12, 694. https://doi.org/10.3390/metabo12080694
Kim S, Kato I, Zhang X. Comparative Analysis of Binary Similarity Measures for Compound Identification in Mass Spectrometry-Based Metabolomics. Metabolites. 2022; 12(8):694. https://doi.org/10.3390/metabo12080694
Chicago/Turabian StyleKim, Seongho, Ikuko Kato, and Xiang Zhang. 2022. "Comparative Analysis of Binary Similarity Measures for Compound Identification in Mass Spectrometry-Based Metabolomics" Metabolites 12, no. 8: 694. https://doi.org/10.3390/metabo12080694
APA StyleKim, S., Kato, I., & Zhang, X. (2022). Comparative Analysis of Binary Similarity Measures for Compound Identification in Mass Spectrometry-Based Metabolomics. Metabolites, 12(8), 694. https://doi.org/10.3390/metabo12080694