A Natural Language Processing Method Identifies an Association Between Bacterial Communities in the Upper Genital Tract and Ovarian Cancer

Polio, Andrew; Wagner, Vincent; Bender, David P.; Goodheart, Michael J.; Gonzalez Bosquet, Jesus

doi:10.3390/ijms26157432

Open AccessCommunication

A Natural Language Processing Method Identifies an Association Between Bacterial Communities in the Upper Genital Tract and Ovarian Cancer

by

Andrew Polio

^1,*,

Vincent Wagner

¹

,

David P. Bender

^1,2,

Michael J. Goodheart

^1,2 and

Jesus Gonzalez Bosquet

^1,2

¹

Department of Obstetrics and Gynecology, University of Iowa, 200 Hawkins dr., Iowa City, IA 52242, USA

²

Holden Comprehensive Cancer Center, University of Iowa Hospitals and Clinics, Iowa City, IA 52242, USA

^*

Author to whom correspondence should be addressed.

Int. J. Mol. Sci. 2025, 26(15), 7432; https://doi.org/10.3390/ijms26157432

Submission received: 24 April 2025 / Revised: 26 July 2025 / Accepted: 29 July 2025 / Published: 1 August 2025

(This article belongs to the Section Molecular Oncology)

Download

Browse Figures

Versions Notes

Abstract

Bacterial communities within the female upper genital tract may influence the risk of ovarian cancer. In this retrospective cohort pilot study, we aim to detect different communities of bacteria between ovarian cancer and normal controls using topic modeling, a natural language processing tool. RNA was extracted and analyzed using the VITCOMIC2 pipeline. Topic modeling assessed differences in bacterial communities. Idatuning identified an optimal latent topic number and Latent Dirichlet Allocation (LDA) assessed topic differences between high-grade serous ovarian cancer (HGSOC) and controls. Results were validated using The Cancer Genome Atlas (TCGA) HGSOC dataset. A total of 801 unique taxa were identified, with 13 bacteria significantly differing between HGSOC and normal controls. LDA modeling revealed a latent topic associated with HGSOC samples, containing bacteria Escherichia/Shigella and Corynebacterineae. Pathway analysis using KEGG databases suggest differences in several biologic pathways including oocyte meiosis, aldosterone-regulated sodium reabsorption, gastric acid secretion, and long-term potentiation. These findings support the hypothesis that bacterial communities in the upper female genital tract may influence the development of HGSOC by altering the local environment, with potential functional implications between HGSOC and normal controls. However, further validation is required to confirms these associations and determine mechanistic relevance.

Keywords:

ovarian cancer; prediction model; natural language processing; microbiome; RNA sequencing; RNAseq

1. Introduction

The human microbiome is a symbiotic community of bacteria, fungi, and viruses that live on or within the human body with specific functions, properties, and interactions within its environment [1,2]. Bacterial communities may influence the risk of cancer within the female genital tract, specifically ovarian cancers, by altering the local microenvironment. Alterations within the microbiome may lead to inflammation, immune response modulation, genetic or epigenetic changes, or microenvironment modulation, which can lead to the development of gynecologic cancers [3,4,5]. Historically, identification within the microbiome relied on culture studies, however, 16S RNA gene sequencing is now the test of choice. It is a culture-independent method that utilizes the highly conserved 16S RNA gene to evaluate bacterial diversity from complex microbiomes [6,7]. Topic modeling, a natural language processing tool, can be used to assess latent interactions between microbes, providing a representation of the community of bacteria related to various states. Along with functional analysis, this allows unique insight bacterial community structure associated with healthy or disease states [8].

The objective of this pilot study was to detect differences in the bacteria of the upper genital tract in high-grade serous ovarian cancer (HGSOC) and normal control samples using 16S RNA sequencing and topic modeling. Results were externally validated in The Cancer Genome Atlas (TCGA) HGSOC dataset. Pathway analysis was then performed using the Kyoto Encyclopedia of Genes and Genomes (KEGG) database.

2. Results

A total of 253 patients with advanced or recurrent HGSOC were identified, with 193 patients with available tissue. Ultimately, 112 patients had tissue samples with good quality RNA and were processed for RNA sequencing. Then, 20 patients undergoing salpingectomy for benign indications were identified, and 12 patient samples were ultimately identified to have good quality RNA which was processed for RNA sequencing (Figure 1).

Following 16S RNA analysis, a total of 801 unique bacterial taxa were identified in the 124 samples. The univariate analysis highlighted thirteen taxa significantly different in relative abundance counts between HGSOC and control samples (p < 0.05) (Figure 2).

LDA modeling was applied to discern latent topics within the bacterial community data. Using the Idatuning method, the optimal number of latent topics was identified as 83. Topic #81 was identified as significantly different in HGSOC samples with a positive log2 fold change (FDR-adjusted p < 0.05). This topic included bacteria like Escherichia/Shigella, Corynebacterineae (Figure 3 and Figure 4). To validate these findings, a similar analysis was performed using data from the TCGA dataset. In total, 868 unique taxa were identified with 43 optimal latent topics. Topics #19 and #36 were found to be significant, which included 33% of the genera observed in the initial significant topic, including Escherichia/Shigella and Corynebacterineae.

After predicting pathway abundance with picrust2 (v 2.6.2), we performed differential pathway analysis between HGSOC samples and controls using KEGG pathways database identifying enrichment of multiple signaling pathways including oocyte meiosis, aldosterone regulated sodium reabsorption, gastric acid secretion, and long-term potentiation (negative log2 fold change < 1, p = 0.003) (Figure 5).

The results were then externally validated using the TCGA HGSOC dataset. A similar analysis was performed which identified 868 unique taxa. Topic modeling was then again utilized to analyze changes in bacterial communities between the two sample groups (Figure 6). An optimal latent topic number was identified as 43.

Once identified, LDA was used to identify several significantly different topics including topics #19 and #36. These topics included 33% of the same genera observed in the study significant topic #81 including Escherichia/Shigella, Corynebacterineae (Figure 7).

3. Discussion

The current study presents a novel application of natural language processing (NLP) through topic modeling to identify potentially significant differences in bacterial communities in the upper genital tract of patients with HGSOC. Our findings revealed differences in bacterial communities between HGSOC and normal control samples which may reflect underlying dysbiosis, however, these associations remain correlative. Among these, genera such as Escherichia/Shigella and members of the Corynebacterineae family were notably associated with HGSOC, consistent with the previous literature suggesting that alterations in the microbiome may contribute to disease pathogenesis [9,10]. The enrichment of genera such as Escherichia/Shigella and Corynebacterineae in HGSOC samples aligns with prior studies that have reported their presence in ovarian cancer or peritoneal environments [3,4,9,10]. Emerging evidence suggests microbiome communities within the upper female genital tract may influence the development and/or progression of ovarian cancer. While causality remains elusive, several studies have reported associations between microbiome dysbiosis, inflammation, and tumorigenesis. Microbiome involvement in the initiation and progression of cancer can be mediated by modulation of the immune system. Escherichia/Shigella, for example, are Gram-negative bacteria that can activate Toll-like receptor 4 (TLR4) promoting inflammation and activation of the NF-kB pathway, producing pro- and anti-inflammatory cytokines. The bacterial metabolite, lipopolysaccharide (LPS), is highly immunogenic and can stimulate tumor progression through TLR4, which subsequently induces phosphatidylinositol-3-kinase (PIK3) and activation of EMT [11,12,13]. Microbial dysgenesis and overactivation of these pathways may promote a cancer permissive environment [14].

Furthermore, the enrichment of specific taxa may exert biologic influence through hormonal modulation. One such mechanism is via microbially mediated changes in estrogen metabolism. Prior work has shown certain gut and reproductive tract microbes possess β-glucuronidase activity which may modulate systemic estrogen levels by reactivating conjugated estrogens in the enterohepatic, potentially leading to elevated local hormonal concentrations in the pelvic environment. Altered estrogen signaling can promote proliferation, DNA damage, and impair apoptosis in ovarian tissue, contributing to ovarian carcinogenesis [15,16,17]. In our previous work, we observed that certain bacterial taxa were associated with patterns of somatic DNA methylation and genomic alterations in HGSOC [18], suggesting that microbiota may shape tumor evolution by influencing host epigenetic and transcriptional programs [18]. Taken together, these may support the hypothesis that microbial communities may contribute to the composition and evolution of the tumor microenvironment.

However, bacterial function often relies on, or is influenced by, other members within the specific community [19]. Using topic modeling, it is possible to deconstruct complex microbial communities into a set of topics that represent the distribution of full communities [20]. LDA modeling identified topic #81 as the optimal latent topic which showed a positive fold change in HGSOC samples compared to controls, indicating an overrepresentation of specific bacterial taxa in cancerous tissues. This supports the hypothesis that the composition of a specific bacterial community may play an integral role in the development of disease. This topic primarily included bacterial species known to reside in the genital tract, reinforcing the hypothesis local microbial dysbiosis influences ovarian cancer development.

The use of NLP and topic modeling offers a unique perspective on microbial analysis. It allows for potential investigation into not only how communities differ quantifiably, but also in their composition. By treating bacterial communities as “topics” like how words cluster in textual data, we were able to model the high-dimensional interactions between different bacterial species and identify potentially meaningful patterns and associations with ovarian cancer.

Additionally, our pathway analysis revealed potential alterations in several biologic processes between samples. Oocyte meiosis and aldosterone-regulated sodium reabsorption pathways are relevant to ovarian function and were noted to be downregulated in HGSOC samples [21]. Disruptions in oocyte meiosis can affect cell division, chromosomal integrity, and repair mechanisms resulting in accumulation of mutations and promotion of cancer development. Alterations within the aldosterone pathway can affect sodium and other electrolyte homeostasis, fluid retention affecting the local tumor microenvironment. The affected pathways of oocyte meiosis and aldosterone regulated sodium reabsorption may reflect microbial influence on host signaling via microbial derived metabolites. For example, microbial derived short chain fatty acids (SCFAs) can modulate histone deacetylase activity and microbiota can impact estrogen metabolism; both relevant to ovarian physiology and tumorigenesis [22]. Changes within systemic metabolic and organismal systems, including gastric acid secretion and long-term potentiation, suggest the potential for broader effects resulting from local microbiome imbalances within the genital tract. While not mechanistically definitive, taken in full, this poses the hypothesis that these may reflect broader metabolic disruption that may affect immune response, cell signaling pathways, and the local tumor environment, contributing to the development and growth of cancers within the genital tract.

A strength of this study is the integration of multiple analytic tools, which allowed us to identify differences in composition and a potential understanding of functional implications of these microbial shifts. Validation of our results with the TCGA dataset also adds robustness to our findings and external validity. The limitations of this study include its retrospective nature. This limits our ability to establish causal relationships between microbial differences and the development of disease. There is the potential for confounding factors, such as patient demographics, antibiotic use, or other factors that may contribute to changes within the microbiome. Discrepancies between datasets may be due to a variability in sample acquisition, sequencing depth, or RNA extraction methods. PICRUSt2 predictions are limited by reliance on reference genomes and 16S rRNA data. Future studies using shotgun metagenomics will be essential to validate functional predictions and explore strain-specific contributions. Additionally, while results were validated using an independent dataset, TCGA, the study did not employ independent analytical platforms to validate and confirm microbial signal. Therefore, these findings could potentially be subject to contamination and should be interpreted as hypothesis generating, pending further validation. Additional studies which follow microbiome changes over time, as well as clinical outcomes, would help establish a more causal role in the development of cancer. Finally, the use of topic modeling carries its own limitations, including a susceptibility to overfitting, especially with smaller sample sizes. Although the sample size, particularly for the control group was small, we accounted for this through validation with the TCGA dataset, but future efforts will include validating these findings with larger datasets.

4. Materials and Methods

We performed a single institution, retrospective, cohort pilot study comparing abundance of bacterial presence between HGSOC and control samples. Tumor samples were collected from patients with HGSOC undergoing cytoreductive surgery (cases) and compared to patient samples collected at the time of surgery for benign indications (controls).

4.1. Specimen Acquisition

Tissue samples were obtained from the Department of Obstetrics and Gynecology Gynecologic Oncology Bank (IRB, ID#200209010), which is part of the Women’s Health Tissue Repository (WHTR, IRB, ID#201809807). A separate approval was given by the University of Iowa (UI) Institutional Review Board (IRB, ID#201202714) to collect 20 normal fallopian tube samples in coordination with the University of Iowa Tissue Procurement Core Facility to be used as controls. Tubal samples came from the junction of the ampullary and fimbriated end of fallopian tubes of volunteers without any family or personal history of cancers who were scheduled to undergo salpingectomy for benign indications (mainly sterilization). Fallopian tubes were chosen as controls as current understanding is this the likely origin of HGSOC [23]. No patient indicating a personal or family history of cancer was included. All tissues archived in the WHTR were originally obtained from adult patients under informed consent in accordance with University of Iowa IRB guidelines. RNA was then extracted from epithelial tissue from the junction of the ampullary and fimbriated end of fallopian tubes. Twenty normal fallopian tube specimens were obtained. Of those, 12 produced viable RNA for analysis. RNA from both the fallopian tube and HGSC specimens had been previously extracted and purified in a prior study [18].

4.2. RNA Sequencing and Metagenomic Analysis

RNA was extracted and processed using the VITCOMIC2 (v 3.0) pipeline, which analyzes the 16S RNA gene and high throughput sequences to visualize the phylogenetic composition of metagenomic samples. Files from sequencing were pre-processed with fastp (v 0.23.4) and then seqkit (v 2.3.0) was applied to convert FASTQ to FASTA format [24,25]. Finally, MAPseq (v 2.0.1alpha) was used to map sequences against reference 16S RNA sequences. This also provides a curated reference of full-length rRNA genes and pre-classified to taxonomic categories based on the NCBI taxonomy and All-species Living Tree Project dataset. We used hits with an identity >94% and an alignment length of ≥75 base pairs. These parameters helped exclude likely contaminant sequences and human RNA, ensuring a high-confidence microbial profile. Host depletion was effectively achieved by focusing on reads mapping specifically to conserved 6S rRNA genes. VITCOMIC2 (v 3.0) was then used to determine the bacterial composition of the samples [26,27].

The Phyloseq package (v 1.52.0) was used for the representation and analysis of microbiome census data. DESeq2 (v 1.49.3) was used to normalize, log2 transform, and analyze count data. A univariate analysis identified differences in relative bacterial 16S RNA abundance counts. All samples were processed using standardized extraction, handling, and sequencing pipelines at a single institution to minimize batch effects or technical confounding.

4.3. Natural Language Processing Analysis

Topic modeling with Latent Dirichlet Allocation (LDA) was used to assess changes in bacterial communities between samples: (1) first, the Idatuning method determined the optimal number of latent topics for the analysis. This tool is commonly used in NLP to optimize topic models like LDA. Idatuning systematically tests multiple topic counts and applies statistical metrics to identify the best-fitting model. (2) Then, we used Topicmodels to evaluate differences in bacterial communities by examining topic distributions. Statistical differences between the two groups (HGSOC vs. controls) were considered for false discovery rate (FDR)-adjusted p-values < 0.05. A schematic of this pipeline is available in the Supplementary Materials.

4.4. Prediction of Functional Profiles

Functional profiles cannot be directly identified using 16S rRNA gene sequence data, so several methods have been developed to predict microbial community functions from taxonomic profiles. One of them is picrust2 (Phylogenetic Investigation of Communities by Reconstruction of Unobserved States, v 2.6.2), which was developed for prediction of functions from 16S marker sequences [28]. Functional prediction was based on gene families resulting from the KEGG orthologs and Enzyme Classification numbers (EC) datasets analyses. With picrust2 we predicted the pathway abundance, and then with the R package ggpicrust2, we performed the differential pathway abundance between cases and controls [29].

4.5. Analysis Validation

Validation with TCGA database: Validation was performed using HGSOC TCGA dataset. Briefly: after permission was granted to access controlled data by the Genomic Data Commons (GDC) Data Portal (dbGaP#29868), TCGA HGSOC RNAseq (N = 423) files in BAM format were downloaded from women with HGSOC. As previously described, we used the VITCOMIC2 pipeline to assess the phylogenetic composition of metagenomic samples. Briefly, BAM files were converted to FASTQ format with samtools. Then, FASTQ were pre-processed with fastp and then seqkit was applied to convert to FASTA format. Finally, MAPseq was used to map sequences against reference 16S RNA sequences with a curated reference of taxonomic categories. We used hits with identity >94% and an alignment length of ≥75 base pairs. We use VITCOMIC2 to determine the bacterial composition of the samples, Phyloseq for representation and analysis of microbiome census data, and DESeq2 to normalize, log2 transform, and analyze count data.

5. Conclusions

In conclusion, NLP methods identified different bacterial communities in the upper female genital tract associated with HGSOC compared to normal controls. Differences in bacterial communities may be related to functional differences between HGSOC samples and normal controls. While this provides preliminary evidence suggesting potential association, this is hypothesis generating and further study is needed to confirm these associations and the role of the microbiome in the development of ovarian cancer. These results should therefore be interpreted as exploratory. Further studies should include larger sample sizes to validate these results, include mechanistic investigations, longitudinal tracking of microbiome changes, and investigate potential biomarkers for early detection in ovarian cancer.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/ijms26157432/s1.

Author Contributions

Conceptualization, J.G.B.; methodology, J.G.B.; validation, J.G.B.; formal analysis, J.G.B.; resources, J.G.B., M.J.G., and D.P.B.; data curation, A.P. and J.G.B.; writing—original draft preparation, A.P. and J.G.B.; writing—review and editing, J.G.B., V.W., A.P., M.J.G., and D.P.B.; visualization, A.P. and J.G.B.; supervision, J.G.B.; project administration, J.G.B. and M.J.G.; funding acquisition, J.G.B. and M.J.G.; M.J.G. is responsible for assembling and maintaining the tumor bank utilized for this study. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the NIH 5R01CA99908-18 (K. Leslie PI), Department of Defense OC190352 (K. Leslie PI), and by the Research Fund of the Gynecologic Oncology Division of the University of Iowa Hospitals and Clinics. Also, it was supported in part by the American Association of Obstetricians and Gynecologists Foundation (AAOGF) Bridge Funding Award.

Institutional Review Board Statement

The study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Institutional Review Board of University of Iowa (IRB ID#200209010, approved on 19 September 2005; IRB ID#201809807, approved 10 April 2019).

Informed Consent Statement

Informed consent was obtained from all the subjects involved in the study.

Data Availability Statement

Clinical data are not publicly available due to patient privacy. Datasets can be browsed by their accession number: GSE156699. The validation part of this study was performed in silico, with de-identified publicly available data. All data from TCGA are available at their website: https://portal.gdc.cancer.gov/, accessed 20 December 2024. Software utilized by this study is also publicly available at Bioconductor website: http://bioconductor.org/, accessed 20 December 2024.

Acknowledgments

The authors would like to acknowledge the work of the University of Iowa Core laboratories.

Conflicts of Interest

All the authors have nothing to disclose. This does not alter our adherence to the journal policies on sharing data and materials.

References

Aggarwal, N.; Kitano, S.; Puah, G.R.Y.; Kittelmann, S.; Hwang, I.Y.; Chang, M.W. Microbiome and Human Health: Current Understanding, Engineering, and Enabling Technologies. Chem. Rev. 2022, 123, 31. [Google Scholar] [CrossRef] [PubMed]
Madhogaria, B.; Bhowmik, P.; Kundu, A. Correlation between human gut microbiome and diseases. Infect. Med. 2022, 1, 180–191. [Google Scholar] [CrossRef]
Li, C.; Feng, Y.; Yang, C.; Wang, D.; Zhang, D.; Luo, X.; Zhang, H.; Huang, H.; Zhang, H.; Jiang, Y.; et al. Association between vaginal microbiota and the progression of ovarian cancer. J. Med. Virol. 2023, 95, e28898. [Google Scholar] [CrossRef]
Laniewski, P.; Ilhan, Z.E.; Herbst-Kralovetz, M.M. The microbiome and gynaecological cancer development, prevention and therapy. Nat. Rev. Urol. 2020, 17, 232–250. [Google Scholar] [CrossRef]
Chambers, L.M.; Bussies, P.; Vargas, R.; Esakov, E.; Tewari, S.; Reizes, O.; Michener, C. The Microbiome and Gynecologic Cancer: Current Evidence and Future Opportunities. Curr. Oncol. Rep. 2021, 23, 92. [Google Scholar] [CrossRef] [PubMed]
Sanschagrin, S.; Yergeau, E. Next-generation sequencing of 16S ribosomal RNA gene amplicons. J. Vis. Exp. 2014, 90, 51709. [Google Scholar]
Wensel, C.R.; Pluznick, J.L.; Salzberg, S.L.; Sears, C.L. Next-generation sequencing: Insights to advance clinical investigations of the microbiome. J. Clin. Investig. 2022, 132, e154944. [Google Scholar] [CrossRef]
Shrode, R.L.; Ollberding, N.J.; Mangalam, A.K. Looking at the Full Picture: Utilizing Topic Modeling to Determine Disease-Associated Microbiome Communities. bioRxiv 2023. [Google Scholar] [CrossRef]
Asangba, A.E.; Chen, J.; Goergen, K.M.; Larson, M.C.; Oberg, A.L.; Casarin, J.; Multinu, F.; Kaufmann, S.H.; Mariani, A.; Chia, N.; et al. Diagnostic and prognostic potential of the microbiome in ovarian cancer treatment response. Sci. Rep. 2023, 13, 730. [Google Scholar] [CrossRef]
Miao, R.; Badger, T.C.; Groesch, K.; Diaz-Sylvester, P.L.; Wilson, T.; Ghareeb, A.; Martin, J.A.; Cregger, M.; Welge, M.; Bushell, C.; et al. Assessment of peritoneal microbial features and tumor marker levels as potential diagnostic tools for ovarian cancer. PLoS ONE 2020, 15, e0227707. [Google Scholar] [CrossRef]
Wang, Q.; Zhao, L.; Han, L.; Fu, G.; Tuo, X.; Ma, S.; Li, Q.; Wang, Y.; Liang, D.; Tang, M.; et al. The differential distribution of bacteria between cancerous and noncancerous ovarian tissues in situ. J. Ovarian Res. 2020, 13, 8. [Google Scholar] [CrossRef] [PubMed]
Zhang, M.; Mo, J.; Huang, W.; Bao, Y.; Luo, X.; Yuan, L. The ovarian cancer-associated microbiome contributes to the tumor’s inflammatory microenvironment. Front. Cell Infect Microbiol. 2024, 21, 1440742. [Google Scholar] [CrossRef]
Sipos, A.; Ujlaki, G.; Mikó, E.; Maka, E.; Szabó, J.; Uray, K.; Krasznai, Z.; Bai, Z. The role of the microbiome in ovarian cancer: Mechanistic insights into oncobiosis and to bacterial metabolite signaling. Mol. Med. 2021, 27, 33. [Google Scholar] [CrossRef]
Sadrekarimi, H.; Gardanova, Z.R.; Bakhshesh, M.; Ebrahimzadeh, F.; Yaseri, A.F.; Thangavelu, L.; Hasanpoor, Z.; Zadeh, F.A.; Kahrizi, M.S. Emerging role of human microbiome in cancer development and response to therapy: Special focus on intestinal microflora. J. Transl. Med. 2022, 20, 301. [Google Scholar] [CrossRef]
Ervin, S.M.; Li, H.; Lim, L.; Roberts, L.R.; Liang, X.; Mani, S.; Redinbo, M.R. Gut microbial β-glucuronidases reactivate estrogens as components of the estrobolome that reactivate estrogens. J. Biol. Chem. 2019, 294, 18586–18599. [Google Scholar] [CrossRef] [PubMed]
Parida, S.; Sharma, D. The Microbiome–Estrogen Connection and Breast Cancer Risk. Cells 2019, 8, 1642. [Google Scholar] [CrossRef]
He, S.; Li, H.; Yu, Z.; Zhang, F.; Liang, S.; Liu, H.; Chen, H.; Lü, M. The Gut Microbiome and Sex Hormone-Related Diseases. Front Microbiol. 2021, 12, 711137. [Google Scholar] [CrossRef]
Reyes, H.D.; Devor, E.J.; Warrier, A.; Newtson, A.M.; Mattson, J.; Wagner, V.; Duncan, G.N.; Leslie, K.K.; Gonzalez-Bosquet, J. Differential DNA methylation in high-grade serous ovarian cancer (HGSOC) is associated with tumor behavior. Sci. Rep. 2019, 9, 17996. [Google Scholar] [CrossRef]
Rowland, I.; Gibson, G.; Heinken, A.; Scott, K.; Swann, J.; Thiele, I.; Tuohy, K. microbiota functions: Metabolism of nutrients and other food components. Eur. J. Nutr. 2018, 57, 1–24. [Google Scholar] [CrossRef]
Kim, A.; Sevanto, S.; Moore, E.R.; Lubbers, N. Latent Dirichlet Allocation modeling of environmental microbiomes. PLoS Comput. Biol. 2023, 19, e1011075. [Google Scholar] [CrossRef] [PubMed]
Ma, X.; Xu, R.; Chen, J.; Wang, S.; Hu, P.; Wu, Y.; Que, Y.; Du, W.; Cai, X.; Chen, H.; et al. The epithelial Na⁺ channel (ENaC) in ovarian granulosa cells modulates Ca²⁺ mobilization and gonadotrophin signaling for estrogen homeostasis and female fertility. Cell Commun. Signal. 2024, 22, 398. [Google Scholar] [CrossRef] [PubMed]
Ramos Meyers, G.; Samouda, H.; Bohn, T. Short Chain Fatty Acid Metabolism in Relation to Gut Microbiota and Genetic Variability. Nutrients 2022, 14, 5361. [Google Scholar] [CrossRef] [PubMed]
Erickson, B.K.; Conner, M.G.; Landen, C.N., Jr. The role of the fallopian tube in the origin of ovarian cancer. Am. J. Obstet. Gynecol. 2013, 209, 409–414. [Google Scholar] [CrossRef]
Chen, S.; Zhou, Y.; Chen, Y.; Gu, J. fastp: An ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 2018, 34, i884–i890. [Google Scholar] [CrossRef] [PubMed]
Shen, W.; Le, S.; Li, Y.; Hu, F. SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. PLoS ONE 2016, 11, e0163962. [Google Scholar] [CrossRef]
Yilmaz, P.; Parfrey, L.W.; Yarza, P.; Gerken, J.; Pruesse, E.; Quast, C.; Schweer, T.; Peplies, J.; Ludwig, W.; Glöckner, F.O. The SILVA and “All-species Living Tree Project (LTP)” taxonomic frameworks. Nucleic Acids Res. 2014, 42, D643–D648. [Google Scholar] [CrossRef]
Mori, H.; Maruyama, T.; Yano, M.; Yamada, T.; Kurokawa, K. VITCOMIC2: Visualization tool for the phylogenetic composition of microbial communities based on 16S rRNA gene amplicons and metagenomic shotgun sequencing. BMC Syst. Biol. 2018, 12 (Suppl. 2), 30. [Google Scholar] [CrossRef]
Douglas, G.M.; Maffei, V.J.; Zaneveld, J.R.; Yurgel, S.N.; Brown, J.R.; Taylor, C.M.; Huttenhower, C. PICRUSt2 for prediction of metagenome functions. Nat. Biotechnol. 2020, 38, 685–688. [Google Scholar] [CrossRef] [PubMed Central]
Yang, C.; Mai, J.; Cao, X.; Burberry, A.; Cominelli, F.; Zhang, L. ggpicrust2: An R package for PICRUSt2 predicted functional profile analysis and visualization. Bioinformatics 2023, 39, btad470. [Google Scholar] [CrossRef] [PubMed Central]

Figure 1. Patient population. Normal fallopian tube samples from patients with no risk factors and no personal/family history of ovarian cancer. Out of the 20 samples, 12 were suitable and were sequenced. Samples were from HGSOC patients that underwent surgical intervention at the University of Iowa Hospitals and Clinics and had their tumors sequenced.

Figure 2. Comparison of 16S RNA gene expression between HGSOC and control samples. (A) Heatmap of the normalized 30 most frequent bacterial 16S RNA counts found by RNAseq for HGSOC and control samples. Abundance counts are represented in green. Analysis was performed by phyloseq R package (v 4.4.1). (B) Heatmap of the 16S RNA log

-

2 transformed normalized counts found by RNAseq between HGSOC and control samples that were different following univariate analysis, N = 13. Analysis was performed with DESeq2 R package. 16S RNAlog2 transformed expression is represented in a blue–red scale.

Figure 2. Comparison of 16S RNA gene expression between HGSOC and control samples. (A) Heatmap of the normalized 30 most frequent bacterial 16S RNA counts found by RNAseq for HGSOC and control samples. Abundance counts are represented in green. Analysis was performed by phyloseq R package (v 4.4.1). (B) Heatmap of the 16S RNA log

-

2 transformed normalized counts found by RNAseq between HGSOC and control samples that were different following univariate analysis, N = 13. Analysis was performed with DESeq2 R package. 16S RNAlog2 transformed expression is represented in a blue–red scale.

Figure 3. Analysis to identify an optimal latent of topic numbers in the cohort. The FindTopicNumber function from the Idatuning (v 1.0.3) package was used to identify an optimal latent number topic using both minimization (CoaJuan2009, Arun2010) and maximalization (Griffiths, Deveaud2014) metrics. Based on these metrics #83 was selected as the model to proceed. On the horizontal axis, number of topics tested, from 0 to 120. On the vertical axis the percentage of variation.

Figure 4. Topic modeling using Latent Dirichlet Allocation (LDA). LDA is a natural language processing tool for topic modeling that assesses for differentially abundant topics between HGSOC and control samples. Left panel: topic #81 demonstrates positive log2 fold changes (>1) with over 9-fold change between cancer and control samples, and with a significant FDR-adjusted p-value (p < 0.05). Right panel: plotting per-topic bacterial (vertical axis) probabilities (horizontal axis).

Figure 5. KEGG pathway differences between HGSOC and normal controls. Multiple signaling pathways are found to be significantly different including environmental information processing, metabolic, and organismal systems. Mid-panel vertical axis: KEGG name of the significant pathways; lower axis: relative gene expression abundance (log2 transformed) in the described pathways. Right panel: log2 fold change with direction; negative: less in cancer than in normal; positive: more in cancer than in normal. Right-side, adjusted p-value of the difference.

Figure 6. Analysis to identify an optimal latent of topic numbers in the TCGA cohort. The FindTopicNumber function from the Idatuning (v 1.0.3) package was used to identify an optimal latent number topic using both minimization (CoaJuan2009, Arun2010) and maximalization (Griffiths, Deveaud2014) metrics. Based on these metrics #43 was selected as the optimal topic number to proceed. On the horizontal axis is the number of topics tested, from 0 to 120. On the vertical axis is the percentage of variation.

Figure 7. Topic modeling using Latent Dirichlet Allocation (LDA). LDA topic modeling in the TCGA dataset. Left panel: topics #19 and #36 demonstrate positive log2 fold changes (>1), and topic #16 demonstrates negative log2 fold changes between cancer and control samples. All three have significant FDR-adjusted p-values (p < 0.05). Right panel: plotting bacterial (vertical axis) probabilities (horizontal axis) of topic #36.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Polio, A.; Wagner, V.; Bender, D.P.; Goodheart, M.J.; Gonzalez Bosquet, J. A Natural Language Processing Method Identifies an Association Between Bacterial Communities in the Upper Genital Tract and Ovarian Cancer. Int. J. Mol. Sci. 2025, 26, 7432. https://doi.org/10.3390/ijms26157432

AMA Style

Polio A, Wagner V, Bender DP, Goodheart MJ, Gonzalez Bosquet J. A Natural Language Processing Method Identifies an Association Between Bacterial Communities in the Upper Genital Tract and Ovarian Cancer. International Journal of Molecular Sciences. 2025; 26(15):7432. https://doi.org/10.3390/ijms26157432

Chicago/Turabian Style

Polio, Andrew, Vincent Wagner, David P. Bender, Michael J. Goodheart, and Jesus Gonzalez Bosquet. 2025. "A Natural Language Processing Method Identifies an Association Between Bacterial Communities in the Upper Genital Tract and Ovarian Cancer" International Journal of Molecular Sciences 26, no. 15: 7432. https://doi.org/10.3390/ijms26157432

APA Style

Polio, A., Wagner, V., Bender, D. P., Goodheart, M. J., & Gonzalez Bosquet, J. (2025). A Natural Language Processing Method Identifies an Association Between Bacterial Communities in the Upper Genital Tract and Ovarian Cancer. International Journal of Molecular Sciences, 26(15), 7432. https://doi.org/10.3390/ijms26157432

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Natural Language Processing Method Identifies an Association Between Bacterial Communities in the Upper Genital Tract and Ovarian Cancer

Abstract

1. Introduction

2. Results

3. Discussion

4. Materials and Methods

4.1. Specimen Acquisition

4.2. RNA Sequencing and Metagenomic Analysis

4.3. Natural Language Processing Analysis

4.4. Prediction of Functional Profiles

4.5. Analysis Validation

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI