Mining Two Decades of Soybean Genomics Literature Using Rule-Based Text Mining: Chromosome-Resolved Patterns of Glyma Gene Mentions
Abstract
1. Introduction
2. Results
2.1. Corpus-Level Distribution of Soybean Gene Mentions
2.2. Chromosome-Level Distribution of Gene Mentions
2.3. Temporal Dynamics of Chromosome-Specific Gene Reporting
2.4. Most Mentioned Soybean Genes
3. Discussion
4. Materials and Methods
4.1. PubMed Corpus Collection
4.2. Gene Name Recognition via Rule-Based Text Mining
4.3. Data Aggregation and Normalization
5. Conclusions
Limitations and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| Abbreviation | Definition |
| NLP | Natural Language Processing |
| QTL | Quantitative Trait Locus |
| GWAS | Genome-Wide Association Study |
| RNA-seq | RNA Sequencing |
| CRISPR | Clustered Regularly Interspaced Short Palindromic Repeats |
| NCBI | National Center for Biotechnology Information |
| PMID | PubMed Identifier |
| XML | Extensible Markup Language |
| GO | Gene Ontology |
| PPR | Pentatricopeptide Repeat |
| NSF | N-ethylmaleimide-Sensitive Factor |
| PEBP | Phosphatidylethanolamine-Binding Protein |
| CCT | CONSTANS, CONSTANS-like, and TOC1 domain |
| CSV | Comma-Separated Values |
References
- Hartman, G.L.; West, E.D.; Herman, T.K. Crops that feed the World 2. Soybean—Worldwide production, use, and constraints caused by pathogens and pests. Food Secur. 2011, 3, 5–17. [Google Scholar] [CrossRef]
- Dilawari, R.; Kaur, N.; Priyadarshi, N.; Prakash, I.; Patra, A.; Mehta, S.; Singh, B.; Jain, P.; Islam, M.A. Soybean: A Key Player for Global Food Security. In Soybean Improvement; Wani, S.H., Sofi, N.u.R., Bhat, M.A., Lin, F., Eds.; Springer: Cham, Switzerland, 2022. [Google Scholar] [CrossRef]
- Shoemaker, R.C.; Schlueter, J.; Doyle, J.J. Paleopolyploidy and gene duplication in soybean and other legumes. Curr. Opin. Plant Biol. 2006, 9, 104–109. [Google Scholar] [CrossRef]
- Schmutz, J.; Cannon, S.B.; Schlueter, J.; Ma, J.; Mitros, T.; Nelson, W.; Hyten, D.L.; Song, Q.; Thelen, J.J.; Cheng, J.; et al. Genome sequence of the palaeopolyploid soybean. Nature 2010, 463, 178–183, Erratum in Nature 2010, 465, 120. https://doi.org/10.1038/nature08957. [Google Scholar] [CrossRef]
- Phang, T.H.; Shao, G.; Lam, H.M. Salt tolerance in soybean. J. Integr. Plant Biol. 2008, 50, 1196–1212. [Google Scholar] [CrossRef]
- Manavalan, L.P.; Guttikonda, S.K.; Tran, L.S.P.; Nguyen, H.T. Physiological and molecular approaches to improve drought resistance in soybean. Plant Cell Physiol. 2009, 50, 1260–1276. [Google Scholar] [CrossRef] [PubMed]
- Singh, G.S.; Shivakumar, B.G. The role of soybean in agriculture. In The Soybean: Botany, Production and Uses; CABI: Wallingford, UK, 2010; pp. 24–47. [Google Scholar]
- Hyten, D.L.; Song, Q.; Zhu, Y.; Choi, I.Y.; Nelson, R.L.; Costa, J.M.; Specht, J.E.; Schoemaker, R.C.; Cregan, P.B. Impacts of genetic bottlenecks on soybean genome diversity. Proc. Natl. Acad. Sci. USA 2006, 103, 16666–16671. [Google Scholar] [CrossRef] [PubMed]
- Jannink, J.L.; Lorenz, A.J.; Iwata, H. Genomic selection in plant breeding: From theory to practice. Brief. Funct. Genom. 2010, 9, 166–177. [Google Scholar] [CrossRef]
- Kassem, M.A. Soybean Seed Composition: Protein, Oil, Fatty Acids, Amino Acids, Sugars, Mineral Nutrients, Tocopherols, and Isoflavones; Springer: Cham, Switzerland, 2022; ISBN 978-3-030-82905-6. [Google Scholar] [CrossRef]
- Libault, M.; Farmer, A.; Brechenmacher, L.; Drnevich, J.; Langley, R.J.; Bilgin, D.D.; Radwan, O.; Neece, D.J.; Clough, S.J.; May, G.D.; et al. Complete transcriptome of the soybean root hair cell, a single-cell model, and its alteration in response to Bradyrhizobium japonicum infection. Plant Physiol. 2010, 152, 541–552. [Google Scholar] [CrossRef]
- Zhang, M.; Liu, S.; Wang, Z.; Yuan, Y.; Zhang, Z.; Liang, Q.; Yang, X.; Duan, Z.; Liu, Y.; Kong, F.; et al. Progress in soybean functional genomics over the past decade. Plant Biotechnol. J. 2021, 20, 256–282. [Google Scholar] [CrossRef] [PubMed]
- Brown, A.V.; Conners, S.I.; Huang, W.; Wilkey, A.P.; Grant, D.; Weeks, N.T.; Cannon, S.B.; Graham, M.A.; Nelson, R.T. A new decade of SoyBase, the USDA-ARS soybean genetics and genomics database. Nucleic Acids Res. 2021, 49, D1496–D1501. [Google Scholar] [CrossRef]
- Valliyodan, B.; Cannon, S.B.; Bayer, P.E.; Shu, S.; Brown, A.V.; Ren, L.; Jenkins, J.; Chung, C.Y.-L.; Chan, T.-F.; Daum, C.G.; et al. Construction and comparison of three reference-quality genome assemblies for soybean. Plant J. 2019, 100, 1066–1082. [Google Scholar] [CrossRef]
- Liu, Y.; Du, H.; Li, P.; Shen, Y.; Peng, H.; Liu, S.; Zhou, G.A.; Zhang, H.; Liu, Z.; Shi, M.; et al. Pan-genome of wild and cultivated soybeans. Cell 2020, 182, 162–176. [Google Scholar] [CrossRef]
- Sangi, S.; Araujo, P.M.; Coelho, F.S.; Gazara, R.K.; Almeida-Silva, F.; Venancio, T.M.; Grativol, C. Genome-Wide Analysis of the COBRA-like Gene Family Supports Gene Expansion through Whole-Genome Duplication in Soybean (Glycine max). Plants 2021, 10, 167. [Google Scholar] [CrossRef]
- Stoeger, T.; Gerlach, M.; Morimoto, R.I.; Nunes Amaral, L.A. Large-scale investigation of the reasons why potentially important genes are ignored. PLoS Biol. 2018, 16, e2006643. [Google Scholar] [CrossRef] [PubMed]
- Cohen, K.B.; Hunter, L. Getting started in text mining. PLoS Comput. Biol. 2008, 4, e20. [Google Scholar] [CrossRef]
- Hirschman, L.; Burns, G.A.; Krallinger, M.; Aright, C.; Bretonet, K.; Valencia, A.; Wu, C.H.; Chatr-Aryamontri, A.; Dowell, K.G.; Huala, E.; et al. Text mining for the biocuration workflow. Database 2012, 2012, bas020. [Google Scholar] [CrossRef]
- Pletscher-Frankild, S.; Palleja, A.; Tsafou, K.; Binder, J.X.; Jensen, L.J. DISEASES: Text mining and data integration of disease–gene associations. Methods 2015, 74, 83–89. [Google Scholar] [CrossRef]
- Kambar, E.Z.N.M. Harnessing NLP and Large Language Models for Pattern Discovery and Information Extraction in Electric Health Reports. Ph.D. Thesis, University of Nevada, Las Vegas, NV, USA, 2024. [Google Scholar] [CrossRef]
- Krallinger, M.; Rabal, O.; Lourenço, A.; Oyarzabal, J.; Valencia, A. Information retrieval and text mining technologies for chemistry. Chem. Rev. 2017, 117, 7673–7761. [Google Scholar] [CrossRef] [PubMed]
- Wei, C.H.; Allot, A.; Leaman, R.; Lu, Z. PubTator central: Automated concept annotation for biomedical full text articles. Nucleic Acids Res. 2019, 47, W587–W593. [Google Scholar] [CrossRef] [PubMed]
- Luo, L.; Yang, Z.; Yang, P.; Zhang, Y.; Wang, L.; Lin, H.; Wang, J. An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition. Bioinformatics 2018, 34, 1381–1388. [Google Scholar] [CrossRef]
- Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef]
- Campos, D.; Matos, S.; Luis Oliveira, J. Biomedical named entity recognition: A survey of machine-learning tools. In Theory and Applications for Advanced Text Mining; InTech: Houston, TX, USA, 2012; pp. 175–195. [Google Scholar] [CrossRef]
- Ohyanagi, H.; Takano, T.; Terashima, S.; Kobayashi, M.; Kanno, M.; Morimoto, K.; Kanegae, H.; Sasaki, Y.; Saito, M.; Asano, S.; et al. Plant Omics Data Center: An Integrated Web Repository for Interspecies Gene Expression Networks with NLP-Based Curation. Plant Cell Physiol. 2015, 56, e9. [Google Scholar] [CrossRef] [PubMed]
- Goodwin, S.; McPherson, J.D.; McCombie, W.R. Coming of age: Ten years of next-generation sequencing technologies. Nat. Rev. Genet. 2016, 17, 333–351. [Google Scholar] [CrossRef]
- Chen, K.; Wang, Y.; Zhang, R.; Zhang, H.; Tao, C. CRISPR/Cas Genome Editing and Precision Plant Breeding in Agriculture. Annu. Rev. Plant Biol. 2019, 70, 667–697. [Google Scholar] [CrossRef]
- Anderson, J.; Akond, M.; Kassem, M.A.; Meksem, K.; Kantartzi, S. Quantitative trait loci underlying resistance to sudden death syndrome (SDS) in MD96-5722 by ‘Spencer’ recombinant inbred line population of soybean. 3 Biotech 2015, 5, 203–210. [Google Scholar] [CrossRef][Green Version]
- Yan, H.; Wang, H.; Cheng, H.; Hu, Z.; Chu, S.; Zhang, G.; Yu, D. Detection and fine-mapping of Soybean mosaic virus resistance genes via linkage and association analysis in soybean. J. Integr. Plant Biol. 2015, 57, 722–729. [Google Scholar] [CrossRef]
- Cho, Y.; Njiti, V.; Chen, X.; Triwatayakorn, K.; Kassem, M.A.; Meksem, K.; Lightfoot, D.A.; Wood, A.J. Quantitative Trait Loci Associated with Foliar Trigonelline Accumulation in Glycine max L. J. Biomed. Biotech. 2002, 2, 151–157. [Google Scholar] [CrossRef]
- Du, W.; Wang, M.; Fu, S.; Yu, D. Mapping WTLs for seed yield and drought susceptibility index in soybean (Glycine max L.) across different environments. J. Genet. Genom. 2009, 36, 721–731. [Google Scholar] [CrossRef] [PubMed]
- Zhang, D.; Cheng, H.; Wang, H.; Zhang, H.; Liu, C.; Yu, D. Identification of genomic regions determining flower and pod numbers development in soybean (Glycine max L.). J. Genet. Genom. 2007, 37, 545–556. [Google Scholar] [CrossRef]
- Zhuang, Y.; Wang, X.; Li, X.; Hu, J.; Fan, L.; Landis, J.B.; Cannon, S.B.; Grimwood, J.; Schmutz, J.; Jackson, S.A.; et al. Phylogenomics of the genus Glycine sheds light on polyploid evolution and life-strategy transition. Nat. Plants 2022, 8, 233–244. [Google Scholar] [CrossRef] [PubMed]
- Li, Y.F.; Zhang, L.; Wang, J.; Wang, X.; Guo, S.; Xu, Z.J.; Li, D.; Liu, Z.; Li, Y.H.; Liu, B.; et al. Flowering time regulator qFT13-3 involved in soybean adaptation to high latitudes. Plant Biotechnol. J. 2024, 22, 1164–1176. [Google Scholar] [CrossRef]
- Edwards, A.; Isserlin, R.; Bader, G.; Frye, S.V.; Willson, T.M.; Yu, F.H. Too many roads not taken. Nature 2011, 470, 163–165. [Google Scholar] [CrossRef] [PubMed]
- Pfeiffer, T.; Hoffmann, R. Temporal patterns of genes in scientific publications. Proc. Natl. Acad. Sci. USA 2007, 104, 12052–12056. [Google Scholar] [CrossRef] [PubMed]
- Hou, M.; Pang, S. Plant Pan-Genomics: Opportunities, Advances, and Challenges. J. Data Sci. Intell. Syst. 2024, 1–8. [Google Scholar] [CrossRef]
- Davis-Turak, J.; Courtney, S.M.; Hazard, E.S.; Glen, W.B.; da Silveira, W.A.; Wesselman, T.; Hardin, L.P.; Wolf, B.J.; Chung, D.; Hardiman, G. Genomics pipelines and data integration: Challenges and opportunities in the research setting. Expert Rev. Mol. Diagn. 2017, 17, 225–237. [Google Scholar] [CrossRef]
- Benegas, G.; Ye, C.; Albors, C.; Li, J.C.; Song, Y.S. Genomic language models: Opportunities and challenges. Trends Genet. 2025, 41, 286–302. [Google Scholar] [CrossRef]
- Rossum, G.V.; Drake, F.L. Python 3 Reference Manual; CreateSpace: Scotts Valley, CA, USA, 2009. [Google Scholar]
- Kluyver, T.; Ragan-Kelley, B.; Perez, F.; Granger, B.; Bussonnier, M.; Frederic, J.; Kelley, K.; Hamrick, J.; Grout, J.; Corlay, S.; et al. Jupyter Notebooks—A publishing format for reproducible computational workflows. In Positioning and Power in Academic Publishing: Players, Agents and Agendas; Loizides, F., Schmidt, B., Eds.; IOS Press: Amsterdam, The Netherlands, 2016; pp. 87–90. [Google Scholar] [CrossRef]
- McKinney, W. Data structures for statistical computing in Python. In Proceedings of the 9th Python in Science Conference; SciPy: Austin, TX, USA, 2010; pp. 51–56. [Google Scholar] [CrossRef]
- Hunter, J.D. Matplotlib: A 2D graphics environment. Comput. Sci. Eng. 2007, 9, 90–95. [Google Scholar] [CrossRef]
- Waskom, M.L. Seaborn: Statistical data visualization. J. Open-Source Softw. 2021, 6, 3021. [Google Scholar] [CrossRef]




| Gene_ID | Number of Abstracts | Description |
|---|---|---|
| Glyma.20G085100 | 4 | CCT domain-containing protein; IPR010402 (CCT domain); GO:0005515 (protein binding) |
| Glyma.18G022500 | 4 | Alpha-soluble NSF attachment protein 2; IPR000744 (NSF attachment protein); GO:0005515 (protein binding), GO:0006886 (intracellular protein transport) |
| Glyma.06G095100 | 2 | MYB/SANT-like domain-containing protein; IPR024752 (Myb/SANT-like domain) |
| Glyma.19G193400 | 2 | Basic leucine zipper (bZIP) transcription factor HBP-1a; IPR004827 (basic leucine zipper domain); GO:0003700 (DNA-binding transcription factor activity), GO:0043565 (sequence-specific DNA binding) |
| Glyma.19G194300 | 2 | Flowering locus T (FT)-like protein; IPR008914 (phosphatidylethanolamine-binding protein, PEBP family) |
| Glyma.19G195400 | 2 | Beta-fructofuranosidase (cell wall invertase); IPR001362 (glycoside hydrolase family 32), IPR008985 (lectin/glucanase superfamily), IPR023296 (beta-propeller domain); GO:0005975 (carbohydrate metabolic process) |
| Glyma.11G108300 | 2 | Cytochrome P450 family protein; IPR001128 (cytochrome P450); GO:0005506 (iron ion binding), GO:0020037 (heme binding), GO:0055114 (oxidation–reduction process) |
| Glyma.09G284700 | 2 | Peroxidase family protein; IPR010255 (heme peroxidase); GO:0004601 (peroxidase activity), GO:0006979 (response to oxidative stress), GO:0020037 (heme binding), GO:0055114 (oxidation–reduction process) |
| Glyma.03G227300 | 2 | Phytochrome-like protein kinase; light-sensing photoreceptor protein; IPR001294 (phytochrome domain); GO:0000155 (sensor kinase activity), GO:0007165 (signal transduction), GO:0009584 (detection of visible light) |
| Glyma.16G149300 | 2 | Cytochrome P450 family protein; IPR001128 (cytochrome P450); GO:0005506 (iron ion binding), GO:0020037 (heme binding), GO:0055114 (oxidation–reduction process) |
| Glyma.03G226000 | 2 | Mannan endo-1,4-beta-mannosidase; IPR017853 (glycoside hydrolase superfamily); GO:0005975 (carbohydrate metabolic process) |
| Glyma.17G090200 | 2 | RING-H2 zinc finger protein; IPR013083 (RING-type zinc finger); GO:0005515 (protein binding), GO:0008270 (zinc ion binding) |
| Glyma.05G243400 | 2 | Elongation factor Tu (EF-Tu)-like GTP-binding protein; IPR000795 (GTP-binding domain), IPR027417 (P-loop NTP hydrolase); GO:0003924 (GTPase activity), GO:0005525 (GTP binding) |
| Glyma.14G194300 | 2 | Fatty acid desaturase 8; IPR005804 (fatty acid desaturase); GO:0006629 (lipid metabolic process), GO:0055114 (oxidation–reduction process) |
| Glyma.09G171200 | 2 | Pentatricopeptide repeat (PPR) protein; IPR002885 (PPR repeat); GO:0005515 (protein binding) |
| Glyma.04G167900 | 2 | Light-harvesting chlorophyll a/b-binding protein; IPR022796 (chlorophyll-binding protein); GO:0016020 (membrane) |
| Glyma.07G102300 | 2 | FAD/NAD(P)-binding oxidoreductase; IPR001327 (oxidoreductase NAD-binding domain); GO:0016491 (oxidoreductase activity), GO:0050660 (FAD binding), GO:0055114 (oxidation–reduction process) |
| Glyma.06G204300 | 2 | TCP transcription factor family protein; IPR005333 (TCP domain) |
| Glyma.17G152300 | 2 | Purine permease family protein; IPR004853 (transporter domain) |
| Glyma.02G304700 | 2 | Phytochromobilin:ferredoxin oxidoreductase; IPR009249 (ferredoxin-dependent bilin reductase); GO:0010024 (phytochromobilin biosynthesis), GO:0055114 (oxidation–reduction process) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Kassem, M.A.; Knizia, D.; Meksem, K. Mining Two Decades of Soybean Genomics Literature Using Rule-Based Text Mining: Chromosome-Resolved Patterns of Glyma Gene Mentions. Int. J. Mol. Sci. 2026, 27, 3398. https://doi.org/10.3390/ijms27083398
Kassem MA, Knizia D, Meksem K. Mining Two Decades of Soybean Genomics Literature Using Rule-Based Text Mining: Chromosome-Resolved Patterns of Glyma Gene Mentions. International Journal of Molecular Sciences. 2026; 27(8):3398. https://doi.org/10.3390/ijms27083398
Chicago/Turabian StyleKassem, My Abdelmajid, Dounya Knizia, and Khalid Meksem. 2026. "Mining Two Decades of Soybean Genomics Literature Using Rule-Based Text Mining: Chromosome-Resolved Patterns of Glyma Gene Mentions" International Journal of Molecular Sciences 27, no. 8: 3398. https://doi.org/10.3390/ijms27083398
APA StyleKassem, M. A., Knizia, D., & Meksem, K. (2026). Mining Two Decades of Soybean Genomics Literature Using Rule-Based Text Mining: Chromosome-Resolved Patterns of Glyma Gene Mentions. International Journal of Molecular Sciences, 27(8), 3398. https://doi.org/10.3390/ijms27083398
