Joint Representation Learning for Retrieval and Annotation of Genomic Interval Sets
Abstract
:1. Introduction
2. Results
2.1. Overview
2.2. Scenario 1. Retrieve Region Sets for a Metadata Query
2.3. Scenario 2. Annotate Unlabeled Region Sets
2.4. Scenario 3. Retrieve Region Sets for a Query Region Set
2.5. Annotating External Data with a Pre-Trained Model
3. Discussion
4. Methods
4.1. Training StarSpace Models
4.2. Tokenization Process
4.3. Models with Different and Combined Label Sets
4.4. Training Procedure
4.5. GEO Projection
4.6. Evaluation Metrics
4.6.1. R-Precision
4.6.2. F1 Score
4.6.3. Mean Reciprocal Rank
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Eng, J.K.; Jahan, T.A.; Hoopmann, M.R. Comet: An open-source MS/MS sequence database search tool. Proteomics 2013, 13, 22–24. [Google Scholar] [CrossRef]
- Bourne, P.E.; Bonazzi, V.; Dunn, M.; Green, E.D.; Guyer, M.; Komatsoulis, G.; Larkin, J.; Russell, B. The NIH big data to knowledge (BD2K) initiative. J. Am. Med. Inform. Assoc. 2015, 22, 1114. [Google Scholar] [CrossRef] [PubMed]
- Ohno-Machado, L.; Sansone, S.A.; Alter, G.; Fore, I.; Grethe, J.; Xu, H.; Gonzalez-Beltran, A.; Rocca-Serra, P.; Gururaj, A.E.; Bell, E.; et al. Finding useful data across multiple biomedical data repositories using DataMed. Nat. Genet. 2017, 49, 816–819. [Google Scholar] [CrossRef] [PubMed]
- Sansone, S.A.; Gonzalez-Beltran, A.; Rocca-Serra, P.; Alter, G.; Grethe, J.S.; Xu, H.; Fore, I.M.; Lyle, J.; Gururaj, A.E.; Chen, X.; et al. DATS, the data tag suite to enable discoverability of datasets. Sci. Data 2017, 4, 170059. [Google Scholar] [CrossRef] [PubMed]
- Soto, A.J.; Przybyła, P.; Ananiadou, S. Thalia: Semantic search engine for biomedical abstracts. Bioinformatics 2019, 35, 1799–1801. [Google Scholar] [CrossRef] [PubMed]
- Kancherla, J.; Yang, Y.; Chae, H.; Corrada Bravo, H. Epiviz File Server: Query, transform and interactively explore data from indexed genomic files. Bioinformatics 2020, 36, 4682–4690. [Google Scholar] [CrossRef] [PubMed]
- Sheffield, N.C.; Bonazzi, V.R.; Bourne, P.E.; Burdett, T.; Clark, T.; Grossman, R.L.; Spjuth, O.; Yates, A.D. From biomedical cloud platforms to microservices: Next steps in FAIR data and analysis. Sci. Data 2022, 9, 553. [Google Scholar] [CrossRef]
- Xue, B.; Khoroshevskyi, O.; Gomez, R.A.; Sheffield, N.C. Opportunities and challenges in sharing and reusing genomic interval data. Front. Genet. 2023, 14, 1155809. [Google Scholar] [CrossRef]
- Fernández, J.D.; Lenzerini, M.; Masseroli, M.; Venco, F.; Ceri, S. Ontology-based search of genomic metadata. IEEE/ACM Trans. Comput. Biol. Bioinform. 2015, 13, 233–247. [Google Scholar] [CrossRef]
- Canakoglu, A.; Bernasconi, A.; Colombo, A.; Masseroli, M.; Ceri, S. GenoSurf: Metadata driven semantic search system for integrated genomic datasets. Database 2019, 2019, baz132. [Google Scholar] [CrossRef]
- Zhu, Y.; Stephens, R.M.; Meltzer, P.S.; Davis, S.R. SRAdb: Query and use public next-generation sequencing data from within R. BMC Bioinform. 2013, 14, 19. [Google Scholar] [CrossRef]
- Risbridger, G.P.; Davis, I.D.; Birrell, S.N.; Tilley, W.D. Breast and prostate cancer: More similar than different. Nat. Rev. Cancer 2010, 10, 205–212. [Google Scholar] [CrossRef] [PubMed]
- Quinlan, A.R.; Hall, I.M. BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics 2010, 26, 841–842. [Google Scholar] [CrossRef] [PubMed]
- Li, H. Tabix: Fast retrieval of sequence features from generic TAB-delimited files. Bioinformatics 2011, 27, 718–719. [Google Scholar] [CrossRef] [PubMed]
- Zhang, Y.; Cao, X.; Zhong, S. GeNemo: A search engine for web-based functional genomic data. Nucleic Acids Res. 2016, 44, W122–W127. [Google Scholar] [CrossRef] [PubMed]
- Dozmorov, M.G. Epigenomic annotation-based interpretation of genomic data: From enrichment analysis to machine learning. Bioinformatics 2017, 33, 3323–3330. [Google Scholar] [CrossRef] [PubMed]
- Nagraj, V.; Magee, N.E.; Sheffield, N.C. LOLAweb: A containerized web server for interactive genomic locus overlap enrichment analysis. Nucleic Acids Res. 2018, 46, W194–W199. [Google Scholar] [CrossRef] [PubMed]
- Feng, J.; Ratan, A.; Sheffield, N.C. Augmented Interval List: A novel data structure for efficient genomic interval search. Bioinformatics 2019, 35, 4907–4911. [Google Scholar] [CrossRef] [PubMed]
- Layer, R.M.; Pedersen, B.S.; DiSera, T.; Marth, G.T.; Gertz, J.; Quinlan, A.R. GIGGLE: A search engine for large-scale integrated genome analysis. Nat. Methods 2018, 15, 123–126. [Google Scholar] [CrossRef]
- Feng, J.; Sheffield, N.C. IGD: High-performance search for large-scale genomic interval datasets. Bioinformatics 2021, 37, 118–120. [Google Scholar] [CrossRef]
- Sinha, A.; Lai, B.C.; Mai, J.Y. A Bin-based Indexing for Scalable Range Join on Genomic Data. IEEE/ACM Trans. Comput. Biol. Bioinform. 2023, 20, 2210–2222. [Google Scholar] [CrossRef]
- Nelson, W.; Zitnik, M.; Wang, B.; Leskovec, J.; Goldenberg, A.; Sharan, R. To embed or not: Network embedding as a paradigm in computational biology. Front. Genet. 2019, 10, 381. [Google Scholar] [CrossRef]
- Xiong, L.; Xu, K.; Tian, K.; Shao, Y.; Tang, L.; Gao, G.; Zhang, M.; Jiang, T.; Zhang, Q.C. SCALE method for single-cell ATAC-seq analysis via latent feature extraction. Nat. Commun. 2019, 10, 4576. [Google Scholar] [CrossRef]
- Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef]
- Gharavi, E.; Gu, A.; Zheng, G.; Smith, J.P.; Cho, H.J.; Zhang, A.; Brown, D.E.; Sheffield, N.C. Embeddings of genomic region sets capture rich biological associations in lower dimensions. Bioinformatics 2021, 37, 4299–4306. [Google Scholar] [CrossRef]
- Qin, Y.; Huttlin, E.L.; Winsnes, C.F.; Gosztyla, M.L.; Wacheul, L.; Kelly, M.R.; Blue, S.M.; Zheng, F.; Chen, M.; Schaffer, L.V.; et al. A multi-scale map of cell structure fusing protein images and interactions. Nature 2021, 600, 536–542. [Google Scholar] [CrossRef]
- Yuan, H.; Kelley, D.R. scBasset: Sequence-based modeling of single-cell ATAC-seq using convolutional neural networks. Nat. Methods 2022, 19, 1088–1096. [Google Scholar] [CrossRef]
- LeRoy, N.J.; Smith, J.P.; Zheng, G.; Rymuza, J.; Gharavi, E.; Brown, D.E.; Zhang, A.; Sheffield, N.C. Fast clustering and cell-type annotation of scATAC data using pre-trained embeddings. bioRxiv 2023. [Google Scholar] [CrossRef]
- Zhang, Y.; Xiao, Y.; Yang, M.; Ma, J. Cancer mutational signatures representation by large-scale context embedding. Bioinformatics 2020, 36, i309–i316. [Google Scholar] [CrossRef]
- Wu, L.; Fisch, A.; Chopra, S.; Adams, K.; Bordes, A.; Weston, J. Starspace: Embed all the things! Proc. AAAI Conf. Artif. Intell. 2018, 32, 5569–5577. [Google Scholar] [CrossRef]
- Moore, J.E.; Purcaro, M.J.; Pratt, H.E.; Epstein, C.B.; Shoresh, N.; Adrian, J.; Kawli, T.; Davis, C.A.; Dobin, A.; Kaul, R.; et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature 2020, 583, 699–710. [Google Scholar] [PubMed]
- Khoroshevskyi, O.; LeRoy, N.; Reuter, V.P.; Sheffield, N.C. GEOfetch: A command-line tool for downloading data and standardized metadata from GEO and SRA. Bioinformatics 2023, 39, btad069. [Google Scholar] [CrossRef] [PubMed]
- Rymuza, J.; Sun, Y.; Zheng, G.; LeRoy, N.J.; Murach, M.; Phan, N.; Zhang, A.; Sheffield, N.C. Methods for constructing and evaluating consensus genomic interval sets. bioRxiv 2023. [Google Scholar] [CrossRef]
- Zheng, G.; Rymuza, J.; Gharavi, E.; LeRoy, N.J.; Zhang, A.; Sheffield, N.C. Methods for evaluating unsupervised vector representations of genomic regions. bioRxiv 2023. [Google Scholar] [CrossRef]
- Craswell, N. R-precision. In Encyclopedia of Database Systems; Springer: New York, NY, USA, 2009; p. 2453. [Google Scholar]
- Yang, Y.; Liu, X. A re-examination of text categorization methods. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, CA, USA, 15–19 August 1999; pp. 42–49. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Gharavi, E.; LeRoy, N.J.; Zheng, G.; Zhang, A.; Brown, D.E.; Sheffield, N.C. Joint Representation Learning for Retrieval and Annotation of Genomic Interval Sets. Bioengineering 2024, 11, 263. https://doi.org/10.3390/bioengineering11030263
Gharavi E, LeRoy NJ, Zheng G, Zhang A, Brown DE, Sheffield NC. Joint Representation Learning for Retrieval and Annotation of Genomic Interval Sets. Bioengineering. 2024; 11(3):263. https://doi.org/10.3390/bioengineering11030263
Chicago/Turabian StyleGharavi, Erfaneh, Nathan J. LeRoy, Guangtao Zheng, Aidong Zhang, Donald E. Brown, and Nathan C. Sheffield. 2024. "Joint Representation Learning for Retrieval and Annotation of Genomic Interval Sets" Bioengineering 11, no. 3: 263. https://doi.org/10.3390/bioengineering11030263
APA StyleGharavi, E., LeRoy, N. J., Zheng, G., Zhang, A., Brown, D. E., & Sheffield, N. C. (2024). Joint Representation Learning for Retrieval and Annotation of Genomic Interval Sets. Bioengineering, 11(3), 263. https://doi.org/10.3390/bioengineering11030263