Dynamics of Fecal Microbiota with and without Invasive Cervical Cancer and Its Application in Early Diagnosis

Simple Summary The fecal microbiome has been suggested to be linked to invasive cervical cancer (ICC). Considering that ICC is common in women, it is important to identify bacterial signatures from fecal microbiota that contribute in classifying cervical cancer. Although previous studies have suggested possible biomarkers based on fecal microbiota, limited information exists in terms of the diagnostic ability using gut microbiota-derived signatures for detecting early ICC. The purpose of this study was to investigate the potential association between early ICC and fecal microbiota and to examine whether fecal microbiota-derived markers can be utilized as a non-invasive tool to diagnose early ICC using machine learning (ML) techniques. Further studies to incorporate quantitative and qualitative characterization of identified individual bacterial genus and validate our model in larger cohorts are imperative in terms of causality for the association between cervical cancer and microbes. Abstract The fecal microbiota is being increasingly implicated in the diagnosis of various diseases. However, evidence on changes in the fecal microbiota in invasive cervical cancer (ICC) remains scarce. Here, we aimed to investigate the fecal microbiota of our cohorts, develop a diagnostic model for predicting early ICC, and identify potential fecal microbiota-derived biomarkers using amplicon sequencing data. We obtained fecal samples from 29 healthy women (HC) and 17 women with clinically confirmed early ICC (CAN). Although Shannon’s diversity index was not reached at statistical significance, the Chao1 and Observed operational taxonomic units (OTUs) in fecal microbiota was significantly different between CAN and HC group. Furthermore, there were significant differences in the taxonomic profiles between HC and CAN; Prevotella was significantly more abundant in the CAN group and Clostridium in the HC group. Linear discriminant analysis effect size (LEfSe) analysis was applied to validate the taxonomic differences at the genus level. Furthermore, we identified a set of seven bacterial genera that were used to construct a machine learning (ML)-based classifier model to distinguish CAN from patients with HC. The model had high diagnostic utility (area under the curve [AUC] = 0.913) for predicting early ICC. Our study provides an initial step toward exploring the fecal microbiota and helps clinicians diagnose.


Introduction
Invasive cervical cancer (ICC) is one of the common health problem for women worldwide, affecting approximately 500,000 women every year [1]. Human papillomavirus (HPV) is the major cause of 95-100% of cases of ICC [2]. Although HPV infection is the primary cause, it does not determine the development of cervical cancer. Most HPV infections are cleared, and only a small proportion of infected women develop cervical intraepithelial neoplasia or ICC. Thus, understanding tumorigenesis is still insufficient. Martin et al. [3] demonstrated that complex host variations are significant in the development of carcinogenesis.
The intestinal microbiota co-exist with their hosts symbiotically and play an essential role in host health and disease. Studies have shown that fecal microbes orchestrate not only the host's metabolism but also their immune response. Several attempts have been made to theorize the association between intestinal microbiota and ICC [4][5][6][7]. Sims et al. [8] reported differences in fecal microbial composition between patients with ICC and healthy controls, indicating that their gut microbiota reflect etiologic or clinical differences by age. Wang et al. [9] demonstrated Proteobacteria, Parabacteroides, Escherichia-Shigella, and Roseburia as possible biomarkers for ICC diagnosis.
Although these studies demonstrated the potential association between fecal microbiota and ICC, identifying biomarkers to predict ICC remains a challenge. The number of bacterial types found within the fecal microbiota is complex [10] and the number of possible correlations between these bacteria are more complex. Moreover, the relationship between the fecal microbiota and ICC might be obscured by variations in data [11]. Not only this, the fact that different bacterial compositions may provide similar functionality is also one of the difficulties [12]. These problems are analogous to the challenges faced by the geneticists [13], where there are a number of genetic interactions that can be associated with cervical cancer, making it difficult to determine the causative agents [14].
Given the importance of the ICC, and the potential relationship between fecal microenvironment and cervical health, there exists a powerful rationale or the development of a prediction model using fecal microbiota-based biomarkers that might be used to predict early ICC. We investigated the fecal microbiota of a well-characterized cohort of participants with biopsy-proven patients. Furthermore, we applied machine learning (ML) algorithms to discover possible interactions associated with early ICC and identify potential microbial markers. In particular, we aimed to develop a prediction model for early ICC diagnosis in the form of a classifier.

Participants' Characteristics
This study enrolled 46 participants in total, 29 healthy women (HC) and 17 patients with invasive cervical cancer (CAN). The HC group included healthy volunteers, aged 20-54 years, without a symptomatic history of the reproductive tract for the last 10 years. A total of 17 of patients with early stage-cervical cancer, aged 29-64 years, were recruited before the loop electrosurgical excision procedure in our hospital between January and August 2020. Additional patients' characteristics are presented in Table 1.

Dynamics of Fecal Microbiota
A total of 46 fecal microbial samples from 46 participants were examined. A median of 14,315 (ranging from 5953 to 125,338) high-quality sequences were obtained after performing quality control using DADA2 software. Figure 1 shows taxonomic composition at the phylum ( Figure 1A) and genus ( Figure 1B was mainly dominated by Firmicutes (60.86%), Bacteroidetes (35.73%), and Actinobacteria (1.02%) at the phylum level and Bacteroides (24.74%), Faecalibacterium (11.33%), Prevotella (7.95%), and Blautia (5.31%) at the genus level. To evaluate differences in intestinal microbial α-diversity (i.e., Chao1, Shannon's index, and observed OTUs), we compared the diversity and richness indices of HC and CAN (Figure 2A). A comparison of the Chao1 presented a significant reduction in HC (p = 0.0098). Opposite trend was shown in observed OTUs (p = 0.012). However, no statistical significance in Shannon's diversity was observed (p = 0.63). To evaluate the extent of dissimilarity in fecal microbial composition, β-diversity was computed based on the Bray-Curtis distance. The principal coordinates analysis (PCoA) plot was used for presenting the microbial composition of each group, and ADONIS analysis was performed to compare the dispersion among health statuses. Upon analyzing β-diversity at the genus level, we observed significant differences (p = 0.001) in clustering of each group in the PCoA plot ( Figure 2B). Upon close examination of the PCoA1 and PCoA2 axis of our samples, we observed significant differences in PCoA1 (p < 0.001 by the Wilcoxon test). However, no significant difference was observed in PCoA2.

Comparative Analysis of the Fecal Microbial Taxa between HC and CAN
Assigned sequences were used to evaluate differences in taxonomic abundances between CAN and HC at various taxonomic levels. We observed significant changes in fecal microbiota structures between HC and CAN. At the phylum level, the abundance of Bacteroidetes, Firmicutes, and Proteobacteria did not reach statistical significance (data not shown). At the family level, the abundance of Lachnospiraceae (p = 0.001) and Turicibacteraceae (p = 0.0017) in HC was significantly higher, whereas the abundance of Tissierellaceae (p = 0.001), Prevotellaceae (p = 0.001), and Actinomycetaceae (p = 0.0023) was higher in CAN ( Figure S1). At the genus level, Prevotella (p = 0.0022) was notably more abundant in CAN. While Unclassified Lachnospiraceae (p = 0.0022) and Clostridium (p = 0.0035) were more abundant in HC. Other genera, namely Finegoldia, Anaerococcus, Peptostreptococcus, Peptoniphilus, Varibaculum, Parvimonas, and Dialister were enriched in the CAN group, and the difference was statistically significant (Figure 3). Linear discriminant analysis effect size (LEfSe) analysis was performed to compare the most differently abundant taxa of each group

Comparative Analysis of the Fecal Microbial Taxa between HC and CAN
Assigned sequences were used to evaluate differences in taxonomic abundances between CAN and HC at various taxonomic levels. We observed significant changes in fecal microbiota structures between HC and CAN. At the phylum level, the abundance of Bacteroidetes, Firmicutes, and Proteobacteria did not reach statistical significance (data not shown). At the family level, the abundance of Lachnospiraceae (p = 0.001) and Turicibacteraceae (p = 0.0017) in HC was significantly higher, whereas the abundance of Tissierellaceae (p = 0.001), Prevotellaceae (p = 0.001), and Actinomycetaceae (p = 0.0023) was higher in CAN ( Figure S1). At the genus level, Prevotella (p = 0.0022) was notably more abundant in CAN. While Unclassified Lachnospiraceae (p = 0.0022) and Clostridium (p = 0.0035) were more abundant in HC. Other genera, namely Finegoldia, Anaerococcus, Peptostreptococcus, Peptoniphilus, Varibaculum, Parvimonas, and Dialister were enriched in the CAN group, and the difference was statistically significant ( Figure 3). Linear discriminant analysis effect size (LEfSe) analysis was performed to compare the most differently abundant taxa of each group ( Figure 4). Linear discriminant analysis (LDA) scores that exceeded 3.5 were obtained as the representative genus in each group, revealing relatively consistent results with that of abundance analysis ( Figure 3). Both the LDA graph ( Figure 4A) and the cladogram ( Figure 4B) showed that 11 genera (Varibaculum, Actinobaculum, Corynebacterium, Dialister, WAL_1855D, Peptostreptococcus, Peptoniphilus, Anaerococcus, Streptococcus, Finegoldia, and Prevotella) were more abundant in the CAN group, whereas 12 genera (Bacteroides, Faecalibacterium, Blautia, Unclassified Clostridiales, Clostridium, Roseburia, Unclassified Lachnospiraceae, Ruminococcus, Gemmiger, Haemophilus, Bifidobacterium, and Lachnospira) were more abundant in the HC cluster.

Ecological Network and Correlation Analysis
The symbiotic network of each group was built based on Spearman's correlation coefficient for genera with rho values (correlation coefficient) more than 0.5 and p < 0.05 ( Figure 5). One module included a group of genera that are connected between themselves, but had much fewer connections with genera outside the group. As shown in Figure 5, in the HC group, the symbiotic network consisted of nine modules with 49 nodes (genera) and 172 edges; five of the nine modules with ≥5 nodes were obtained from network analysis, and most relationships were positive. By contrast, in the CAN group, only six modules were obtained from network analysis, with 12 nodes and 23 edges. Only two modules had ≥5 nodes; overall, they also showed a positive relationship. We then investigated the Cancers 2020, 12, 3800 5 of 13 potential correlation between disease features and fecal microbiota composition to explore if it provides the heterogeneity of the microbial community ( Figure S2). Within the HC group, Firmicutes and Bacteroidetes positively correlated with each other (rho = 0.83, p < 0.001). By contrast, in the CAN group, Bacteroidetes and Actinobacteria showed a slightly positive correlation (rho = 0.55, p < 0.05). Based on their correlation pattern, we further investigated the Firmicutes to Bacteroidetes ratio (F/B ratio). Case control studies [15][16][17][18][19] have demonstrated that higher ratios are observed in patients with diseases such as hypertension, chronic fatigue syndrome, and autism. However, this trend was not seen in our cohort (p = 0.182).
Cancers 2020, 12, x FOR PEER REVIEW 5 of 14 ( Figure 4). Linear discriminant analysis (LDA) scores that exceeded 3.5 were obtained as the representative genus in each group, revealing relatively consistent results with that of abundance analysis ( Figure 3). Both the LDA graph ( Figure 4A) and the cladogram ( Figure 4B)     Firmicutes and Bacteroidetes positively correlated with each other (rho = 0.83, p < 0.001). By contrast, in the CAN group, Bacteroidetes and Actinobacteria showed a slightly positive correlation (rho = 0.55, p < 0.05). Based on their correlation pattern, we further investigated the Firmicutes to Bacteroidetes ratio (F/B ratio). Case control studies [15][16][17][18][19] have demonstrated that higher ratios are observed in patients with diseases such as hypertension, chronic fatigue syndrome, and autism. However, this trend was not seen in our cohort (p = 0.182).

Predictive Model Based on Fecal Microbiota
In order to determine whether differences in fecal microbial composition can be considered potential biomarkers for distinguishing CAN from HC, we applied microbiota classification based on L1-LASSO regression to our sequencing data ( Figure 6A). Our model selected seven genera as the most important features (Prevotella, Peptostreptococcus, Finegolida, Ruminococcus, Clostridium, Pseudomonas, and Turibacter). Prevotella and Turibacter were the top predictors in our model. Area under curve (AUC) of ROC curve was computed to evaluate the predictive ability. As shown in Figure 6B, the discriminant model based on the seven genera effectively distinguishes CAN from HC (mean AUC = 0.913). From the seven genera selected, two were >3-fold abundant in HC compared to that in CAN, whereas three were >3-fold abundant in CAN (Table 2). Microbial signatures were

Predictive Model Based on Fecal Microbiota
In order to determine whether differences in fecal microbial composition can be considered potential biomarkers for distinguishing CAN from HC, we applied microbiota classification based on L1-LASSO regression to our sequencing data ( Figure 6A). Our model selected seven genera as the most important features (Prevotella, Peptostreptococcus, Finegolida, Ruminococcus, Clostridium, Pseudomonas, and Turibacter). Prevotella and Turibacter were the top predictors in our model. Area under curve (AUC) of ROC curve was computed to evaluate the predictive ability. As shown in Figure 6B, the discriminant model based on the seven genera effectively distinguishes CAN from HC (mean AUC = 0.913). From the seven genera selected, two were >3-fold abundant in HC compared to that in CAN, whereas three were >3-fold abundant in CAN (Table 2). Microbial signatures were further validated by applying random forest (RF) to the original dataset and checking for overlapped features. The trained RF selected 21 genera as the most important features, five of which overlapped with the genera selected as features in our prediction models and LEfSe analysis ( Figure S3). Furthermore, the RF model showed considerably high prediction accuracy (AUC = 0.91 and 0.88 in the training and test sets, respectively, Figure S4). Thus, ML-based fecal microbiota could distinguish between the CAN and HC groups among our cohort, indicating that intestinal microbiota can be used as potential biomarkers to predict early-stage of ICC. further validated by applying random forest (RF) to the original dataset and checking for overlapped features. The trained RF selected 21 genera as the most important features, five of which overlapped with the genera selected as features in our prediction models and LEfSe analysis ( Figure S3). Furthermore, the RF model showed considerably high prediction accuracy (AUC = 0.91 and 0.88 in the training and test sets, respectively, Figure S4). Thus, ML-based fecal microbiota could distinguish between the CAN and HC groups among our cohort, indicating that intestinal microbiota can be used as potential biomarkers to predict early-stage of ICC.

Discussion
ICC is known as a multifactorial disease that is affected by various genetic and environmental factors. Although HPV infection is critical for the incidence of cervical cancer [20], the impact of other potential factors on early stage of ICC has not been actively studied. In this study, we aimed to investigate the fecal microbiota of women with and without ICC and build a diagnostic model based on ML algorithm. We found that the dynamics of the fecal microbiota differed between healthy women and patients with early ICC. Significant differences in α-and β-diversity between patients with ICC and cancer-free controls were observed, demonstrating compositional differences in the fecal microbial community. Furthermore, we developed a ML-based prediction model to detect the presence of early ICC using the relative abundance of specific bacterial genera. Preliminary results of this study revealed that ML-based ROC analysis can predict and detect early ICC. Additionally, we demonstrated the diagnostic accuracy of fecal microbiota-based biomarkers to predict this disease. Most of the seven bacterial genera used for building this model were directly or indirectly linked to

Discussion
ICC is known as a multifactorial disease that is affected by various genetic and environmental factors. Although HPV infection is critical for the incidence of cervical cancer [20], the impact of other potential factors on early stage of ICC has not been actively studied. In this study, we aimed to investigate the fecal microbiota of women with and without ICC and build a diagnostic model based on ML algorithm. We found that the dynamics of the fecal microbiota differed between healthy women and patients with early ICC. Significant differences in αand β-diversity between patients with ICC and cancer-free controls were observed, demonstrating compositional differences in the fecal microbial community. Furthermore, we developed a ML-based prediction model to detect the presence of early ICC using the relative abundance of specific bacterial genera. Preliminary results of this study revealed that ML-based ROC analysis can predict and detect early ICC. Additionally, we demonstrated the diagnostic accuracy of fecal microbiota-based biomarkers to predict this disease. Most of the seven bacterial genera used for building this model were directly or indirectly linked to human health. Furthermore, our model had considerably high accuracy in detecting early stage of ICC (AUC = 0.913).
Recently, Wang et al. [9] reported a strong association between gut microbiome and cervical cancer by comparing the gut microbial composition between five healthy controls and eight patients with cervical cancer. The authors demonstrated that in patients with cervical cancer, α-diversity had an increasing trend, and a clear separation was found between each group for β-diversity. Moreover, they identified several gut microbial compositions at various taxonomic levels. For example, Proteobacteria was significantly higher in abundance in the cancer group, and Escherichia-Shigella, Roseburia, Pseudomonas, Lachnoclostridium, Lachnospiraceae_UCG-004, Dorea, Unidentified Lachnospiraceae, Fusicatenibacter, Lachnospiraceae_UCG-010, Yersinia, and Succinivibrio had a significantly higher abundance at the genus level. However, our analysis of fecal microbiota in healthy women vs. patients with ICC is in contrast with that of Wang et al. We observed significantly lower observed OTUs in patients with cervical cancer. Although there was no statistical significance, Shannon's index was lower. Furthermore, our analysis on compositional differences, in terms of the fecal microbial community according to health statuses, showed dissimilarities in the abundance of specific genera in patients with cervical cancer. Demographic characterization of each cohort and different bioinformatic methods for identifying bacterial composition may cause dissimilarities [21][22][23]. Bacterial identification performed by the previous study was amplified using the 16S rRNA gene with primers targeting the V4 region [9]. However, primer pairs targeting the V4-V5 region were used in our study, which might contribute to the differential abundance of bacterial genera.
Emerging studies have suggested the potential links between the increased abundance of Prevotella and several disorders, such as bacterial vaginosis, metabolic disorders, low-grade systemic inflammation, and periodontitis [24][25][26][27][28]. Larsen et al. [29] demonstrated the association between the abundance of Prevotella and mucosal inflammation mediated by T helper type 17 (Th17). Other studies have also supported the role of Prevotella. Gosmann et al. [30] revealed that in the vaginal tract, Prevotella contributes to the activation of a Th17 immune response. Li et al. showed a causal role of Prevotella-enriched gut enterotype, suggesting its contribution to a specific disease [31]. A study in mice conducted by Elinav et al. [32] revealed a potential role of Prevotella in the fecal microbial environment by promoting dextran sulfate sodium (DSS)-induced colitis. In the patient group of our study, Ruminococcus and Clostiridium, known as butyrate-producing bacteria, are decreased. As a major nutrient of the intestinal tract, butyrate plays important roles in controlling inflammation and preventing leaky gut and regulates intestinal autophagy and energy metabolism in the human colon [33][34][35]. Thus, a reduction of butyrate-producing bacteria may affect general intestinal health, thereby affecting vaginal health. Taken together, Prevotella, Ruminococcus, and Clostridium may be linked to early ICC risk.
ML techniques have been used to assess the association between the microbiome and numerous disease status [36][37][38]. Prominent studies have established the concept of fecal microbiota as non-invasive diagnostic tools for certain diseases or cancers, including hepatocellular carcinoma (CRC), nonalcoholic fatty liver disease (NAFLD), and type 2 diabetes (T2D) [39][40][41][42]. In this study, we demonstrated characteristic differences in the fecal microbiota of our cohort, identified specific seven biomarkers, constructed a prediction model, and validated its diagnostic accuracy. Thus, fecal microbiota-derived microbial markers may become potential tools for the diagnosis of early ICC. Further studies to validate fecal microbiota-derived biomarkers in larger cohorts from various ethnic populations and countries are needed to promote the accuracy and stability for early ICC diagnosis.
We acknowledge the limitations associated with our study, which include the following: (1) Our findings only provide preliminary potential of an association between fecal microbiota and early ICC compared to cancer-free controls. Therefore, we could not address whether changes in the fecal microbial community are driven by tumor development or tumor formation in terms of causal relationship, and additional studies are essential to assess how bacterial genera play a role in cervical health, affecting ICC. (2) Our study was based on 16S rRNA gene sequencing, limiting identification of bacterial species, and evaluating the role of the fecal microbiota alone. (3) Although our data support the hypothesis that Prevotella is associated with ICC, it does not support causality. (4) Age is a confounding variable for the incidence of ICC. However, changes in the vaginal microbiome with respect to age were not evaluated. Further studies need to validate the results of this study according to age. (5) It is well known that HPV infection affects incidence and progression of ICC. However, the results of this study did not evaluate potential association between HPV infection and alterations in fecal microbiota. Furthermore, since gut microbiota might be affected by diet and lifestyle of the individual, further studies are needed to assess how those environmental factors are correlated to fecal microbiota of ICC patients. Nonetheless, key strengths of this study include a prospective prediction model, which successfully predicted cervical cancer with a high AUC score. Moreover, we identified seven bacterial genera that are differentially abundant between normal women and patients with cancer from a well-characterized cohort based on fecal microbial composition.

Study Population
We obtained ethical approval from the Institutional Review Board of Kyungpook National University Chilgok Hospital (KNUMC 2015-10-033, 16-11-2015), according to the Declaration of Helsinki. In the CAN group, patients were clinically staged according to the 2009 International Federation of Gynecologic Obstetrics (FIGO) staging system [43]. Patients with a history of preoperative chemotherapy, radiotherapy, or administration of antibiotics were excluded. Fecal samples were collected using Transwab tubes (Sigma, Dorset, UK) from healthy women and patients with early ICC. All collected fecal swabs were sent to the laboratory, and stored at −80 • C until further processing.

DNA Extraction and 16S rRNA Gene Sequencing
Total fecal DNA was extracted from each sample using the QIAamp PowerFecal Pro DNA Kit (Qiagen, Hilden, Germany), following the protocol provided by manufacturer. Bacterial DNA concentration was measured using a Qubit ® 2.0 Fluorometer (Life Technologies, Carlsbad, CA, USA) and the quality of extracted DNA was assessed by electrophoresis. For amplicon sequencing, DNA isolated from each sample was amplified with universal primer pairs targeting the V4-V5 regions of bacterial 16S rRNA genes, 515F (5 -barcode-GTGCCAGCMGCCGCGGTAA-3 ), and 907R (5 -barcode-CCGYCAATTCMTTTRAGTTT-3 ). PCR was performed according to the conditions as previously described [44].

Bioinformatic Analysis
For amplicon sequencing analysis, the generated raw single-end reads were acquired from the Ion Torrent Software Suite as the FASTQ format and processed using the Quantitative Insights Into Microbial Ecology 2 (QIIME2) v. 2020.8 software [45]. Sequences were then processed through quality filtering, trimming, dereplicating, and denoising using DADA2 for obtaining amplicon sequence variants (ASV) [46]. Briefly, the quality filtering was conducted according to their mean frequency of 18,768 and the Q score (sequencing quality) that is less than 30 was trimmed and denoised. The ASVs at an abundance that is less than 0.1% of the mean sample depth were filtered prior to further analysis. Non-bacterial, mitochondrial, and chloroplast sequences were removed. Representative sequences were then assigned for taxonomic identification using a custom trained naïve Bayes ML classifier, trained for differentiating taxa present with the 99% cutoff value using against the Greengenes 13_8 database. Sequences were rarefied at a minimum sequencing depth of 5953 reads. All samples in the OTU table were subsampled to equal depths prior to further analysis.

Statistical Analysis
General statistical analyses and visualization of our sequencing results were carried out using RStudio 1.0.153 (https://www.rstudio.com/) and Calypso web application [47]. Alpha diversity indexes within the samples were evaluated using an ANOVA test to measure statistical significance among two different groups (HC vs. CAN). PCoA was performed to analyze and visualize patterns of β-diversity based on the Bray-Curtis dissimilarity. Two-dimensional PCoA analysis was conducted using R with the vegan, reshape, and ggplot2 packages [48][49][50]. Because each axis in PCoA has a unique value that represents the degree of variation in that axis, we represented statistical difference of each axis between groups. Phylogenies were manipulated using GraPhlAn [51]. The difference between each group at the various taxonomic level was calculated using Wilcoxon rank-sum test. To further control error, the false discovery rate (FDR) was applied and genera with a FDR of <0.05 were considered differentially abundant and visualized using a pie chart. In addition, the symbiotic network relationship between each group's microbiota was computed and displayed using Cytoscape (v3.7.2) according to Spearman's correlation coefficient [52]. To discover biomarkers or genomic features that characterize the differences between different biological conditions, LDA effect size (LEfSe) was conducted based on the Huttenhower Galaxy web application (http://huttenhower.sph.harvard.edu/galaxy/). For this analysis, the Kruskal-Wallis test was used to detect features with significant differential abundance among classes. Abundance differences of bacterial genus between HC vs. CAN clusters were identified using the LEfSe approach. Then, LDA was performed to evaluate the effect size of each feature (p < 0.05 and LDA score >3.5). The correlations between the phyla within each group were estimated using Spearman's correlation coefficient and plotted using the Corrplot and PerformanceAnalytics packages [53,54]. To test the association between the composition of fecal microbiota and cervical cancer, the relative abundance was log-transformed and normalized to z-scores for model construction. In a five-fold cross-validation setup with 10 iterations, an L1 normalized (LASSO) logistic regression model [55] was applied to the training set and then evaluated on the test set within every cross validated fold. Every machine learning step including data preprocessing, model construction, feature selection, and evaluation of final model was performed using the SIAMCAT package [56]. To validate the bacterial signature for distinguishing between HC and CAN, we constructed an additional ML model based on RF to build a classifier from the same sample set with the randomForest package [57]. Briefly, the sample set was randomly split into two sets with same proportion of each group (29 samples for training and 17 for test sets) as described by Chakravarthy et al. [58]. The model was constructed using genera abundances with default parameters. Feature selection was performed by the iterative feature elimination step to optimize this model and the final model was built based on the selected features using caret package [59]. AUC of ROC curve was calculated to measure the accuracy of the classifier. A Venn diagram was drawn to present overlapped potential biomarkers between MLs and LEfSe analysis using VennDiagram package [60].

Conclusions
In conclusion, we suggest an association between fecal microbiota and ICC. Furthermore, ML-based on fecal microbiota can aid women for the prevention of cervical cancer in terms of diagnosis. Although challenges remain in advancing the knowledge of fecal microbiota into the clinic for early ICC diagnosis, our findings provide an initial step toward exploring fecal microbiota for clinician's decision-making and monitoring early ICC.
Supplementary Materials: The following are available online at http://www.mdpi.com/2072-6694/12/12/3800/s1, Figure S1: Box-plots showing difference of bacterial abundance at the family level. All samples in each group were visualized by colored dots. The lines in the boxes represented median of values. p values were computed to compare their statistical significance by Wilcoxon test and adjusted using false discovery rate (FDR) procedure. Figure S2: Correlation matrix of fecal microbiota composition by phylum within each group. The distributions of each phyla are displayed on the diagonal in the form of histogram based on kernel density estimation. The bottom of the diagonal represents scatter plots with fitted line. Positive and negative correlations are shown in positive and negative numbers, respectively. (A) HC, (B) CAN. * p < 0.05; ** p < 0.1; *** p < 0.005. Figure S3: Venn diagram depicting the overlap of potential biomarkers selected by each of feature selection methods. Five genera were overlapped with features selected using the LASSO regression-based prediction model. Figure S4: ROC curves presenting predictive accuracy based on random forest analysis. (A) ROC curve of training set (B) ROC curve of test set. Each set was evaluated 300 times and mean AUC are presented. Grey lines and box plots present each of 300 runs. The black line represents median of those values.