Pangenome Analysis Reveals a High Degree of Genetic Diversity in Gardnerella vaginalis : An In Silico Approach

: The genus Gardnerella comprises Gram-variable, anaerobic, hemolytic, and non-motile bacilli, with four known species, where Gardnerella vaginalis is the main species responsible for bacterial vaginosis (BV). However, quantifying this species is challenging due to a lack of data and underreporting. Despite its signiﬁcance, particularly for women, and the availability of several genomes in online databases, genomic analyses and studies on effective treatments still lack details. This study aimed to conduct bioinformatic analyses focused on pangenomics to investigate the complete gene repertoire of the species. Genomes of the bacterium available in online databases were used for comparative genomics, genomic plasticity, gene synteny, and pangenome prediction analyses. The results revealed considerable genome variability, indicating a highly diverse pangenome. The low number of genes in the core genome and similarity analysis conﬁrmed this variability. Three pathogenicity islands, two resistance islands, and nine genomic islands were identiﬁed, suggesting horizontal gene transfer events during evolution. These ﬁndings underscore the need for sequencing new G. vaginalis genomes to better comprehend its variability and adaptation patterns.


Introduction
The genus Gardnerella was proposed in the 1980s by Greenwood and Pickett after conducting Adansonian analyses, DNA-DNA hybridization methods, electron microscopy, and biochemical analyses against microorganisms of other genera such as Haemophilus, Pasteurella, and Streptococcus [1].They believed there was no relationship between Haemophilus vaginalis, previously proposed as the Gram-negative microorganism isolated in both men and women [1][2][3], and these bacteria.
The species Gardnerella vaginalis has a thin layer of peptidoglycan and is considered Gram-positive.However, this layer can bleach, making it appear Gram-negative and, thus, it is classified as Gram-variable.They are non-motile bacilli and may present fimbriae or not [4][5][6].This bacterium has a well-understood but often underestimated pathogenic potential.It can form biofilms, hindering effective treatment and increasing its chance of survival in the host [7].Additionally, it can produce enzymes with prolidase and sialidase activities, which function as toxins, enabling greater adhesion and destruction of human tissue, and are also related to cases of premature births and abortions [8][9][10].Another important virulence factor is vaginolysin, which causes disturbances in the host immune system and lysis of erythrocytes [11][12][13].
In this new genus, the G. vaginalis species accounts for more than 50% of bacterial vaginosis (BV) cases in women and is the most recently studied [14].Previous studies Venereology 2023, 2 133 allowed the description of three more new species, later characterized by Vaneechoutte et al. in 2019 [6], named Gardnerella leopoldii sp.nov., Gardnerella biotic sp.nov., and Gardnerella swidsinskii sp.nov.This classification was possible by analyzing the isolated G. vaginalis species using 16S RNA gene amplification [4,8,11,15,16] and neighboring cluster analysis by Ahmed et al. [17].Additionally, all these species are related to cases of BV, with some species having a more significant relation to symptoms of the clinical condition than others, according to some Nugent/Amsel criteria [18].
The vaginal microbiota is composed mainly of Döderlein Bacillus, which are microorganisms of the genus Lactobacillus responsible for maintaining the acidic pH of the vagina, protecting against pathogens through the production of hydrogen peroxide (H 2 O 2 ).An imbalance in this vaginal flora causes bacterial vaginosis (BV)-type infections, where there is a decrease in lactobacilli [18][19][20].In association, an increase in anaerobic bacteria such as G. vaginalis is observed, resulting in a higher vaginal pH and making the patient susceptible to BV.
The main characteristics of BV, besides the increased pH, are the discomfort reported due to a white, gray, or yellowish vaginal discharge and a foul-smelling odor that may increase with coitus or menstruation.If left untreated, it can cause more severe problems such as salpingitis and endometritis in women, and in men, it can cause itching in the penile area and discomfort when urinating [21,22].
BV is a widespread infection reported among sexually active women; it is estimated that between 10% and 30% of women are diagnosed worldwide [23,24].Additionally, countries with a high frequency of HIV report that the frequency of BV is 50% in women.Studies have shown that the primary pathogens related to vaginosis are Candida sp.(56.3%), followed by G. vaginalis (35%) [25].Several risk factors are related to the development of this pathology, including having many sexual partners, using intrauterine devices (IUDs) and vaginal douches, and engaging in oral sex [26][27][28][29].Moreover, a strong relationship exists between the very early initiation of sexual activity and a low educational level [27,30].
Through oncotic cytology, more specifically the Papanicolaou technique, the inflammatory and infectious processes can be diagnosed along with their etiological agents.This exam is usual in women's health routine every year; it is easy to perform, is a low-cost approach, and is an advantage that facilitates the diagnosis of BV and enables faster treatment [31].However, a confirmatory diagnosis of BV requires the presence of at least three of four signs: (I) positive amine test; (II) presence of clue cells; (III) grayish or yellowish discharge; and (IV) vaginal pH higher than 4.7 [32][33][34].
BV treatment focuses primarily on restoring the balance of the vaginal flora by decreasing the number of anaerobic microorganisms and increasing the number of lactobacilli.There is a range of therapeutic options, including using antibiotics such as nitroimidazole or even establishing physiological control of the vaginosis through hydrogen peroxide ingestion via vaginal douching [3, [34][35][36][37][38][39].However, despite all these treatments, the number of BV cases has constantly increased over the years.
The study of G. vaginalis is necessary and essential, given the number of different genomes deposited in databases, and it is of great medical importance, especially for females.There is a need for more genomic studies, such as using bioinformatics, to assist in future research on this species.Such works can help identify the distribution patterns of the species using recent whole-genome molecular epidemiology techniques and identify potential virulence factors that may be related to pathological factors.In this work, several methodologies are used to identify mechanisms and the diversity of the organisms to help in future research with the species G. vaginalis.

Genome Information
RefSeq and NCBI (National Center for Biotechnology Information) databases (https: //www.ncbi.nlm.nih.gov/-accessed on 9 August 2022) were used to download the genomes in ".fna" format to perform the genomic analyses of G. vaginalis.Sixteen genomes were used for phylogenomic and phylogenetic analyses, along with seven complete genomes of G. vaginalis and nine representative genomes from different clades, according to the tree available at the NCBI (Table 1).To perform the pangenome analysis, 107 genomes of G. vaginalis available in the NCBI (till the first half of 2022) in the ".faa" format were used, which were classified as follows: 7 complete genomes; 1 chromosome; 63 scaffolds; and 36 contigs.The annotation of each of these genomes was performed with the Prokka software (version 1.14.6,Tseemann, 2020) [40] using the default parameters plus the options: 'addgenes' and 'rnammer'.More information about the 107 genomes is described in Supplementary Table S1.
In addition to the genomes of the bacteria under study, the reference genome of the organism Alloscardovia omnicolens (DSM21503) was downloaded in the same formats as mentioned above.It was added to all the analyses to function as an outgroup, which helps the analyses to be more parsimonious when analyzing the group of interest [41].

Phylogenomic Analysis and Phylogenetic Reconstruction
The software Gegenees (version 3.1, Segerman, 2019) [42] was used to perform the phylogenomic analyses; its methodology is based on the similarity analysis between the genomes through a strategy followed by genome sequence fragmentation and DNA alignment using a BLASTn algorithm to obtain a distance matrix and the phylogeny analysis.
The genome fragmentation occurs using user-defined lengths and step sizes.Here, we used a size of 500 bp and a step size of 500 bp too.Finally, a heatmap was generated, showing the percentage of similarity between the strains in a range of 0 to 100%.
In addition, an analysis to identify polymorphic genes and whether there is phylogenetic interference in a gene-by-gene analysis was performed using the online software PGAdb-builder [43] with a wgMLST (whole genome multilocus sequence typing) analysis.The resulting PGAdb profile was performed using 90% coverage and 90% identity filters.This profile was exported in the ".newick" format to the MEGA software (version 11, Kumar, 2021) [44] for generating a phylogenetic tree.

Genomic Plasticity via Identification of Genomic Islands
Plasticity analyses were performed with GIPSy (version 1.1.3,Soares, 2016) (Genomic Island Prediction Software) to identify genomic islands, genomic regions acquired via horizontal gene transfer (HGT), which contain information about the microorganism lifestyle.The prediction is based on several features, such as codon usage and G + C content; insertion sequences or flanking tRNAs flanking; presence of transposases; and varied size [45].The genomic islands were classified into pathogenicity, resistance, and metabolic islands with their respective genes.For this analysis, one non-pathogenic organism was chosen to be compared with the reference organism chosen in this work.
The genomic islands identified with GIPSy were plotted using the BRIG software (version 0.95, Alikhan, 2011) (BLAST Ring Image Generator) [46], which performs a comparative analysis of all genomes against a reference genome.A circular figure was generated where each ring represents a different genome, and the blank areas represent deletion areas.The last ring exhibits the genomic islands.

Gene Synteny
The software Mauve (version snapshot_2015_02_13 build 0, Darling Lab, 2015) [47], along with the progressive Mauve algorithm, was used for gene synteny analysis, where it was possible to identify possible gene rearrangement events.By fragmenting the genomes into pre-defined sizes, the software built multiple genome alignments identifying locally collinear blocks (LCB) plotted in a figure, which is usable for identifying these rearrangements.

Prediction of Orthologous Genes and Pangenome Development
Using the Orthofinder software (version 2.5.4,Emms, 2021) [48], the prediction of orthologous genes was performed, where all genomes were compared against all using the DIAMOND algorithm (version 2.0.14, Buchfink, 2021).This software uses the MCL program (Markov Clustering algorithm) for such prediction [49].Through in-house script (orthofinder_pangenome_splitter.pl), three ortho groups were generated, which are core genome, which is composed of all genes that are commonly shared by all strains, involved mainly in essential cellular processes; shared, which is composed of genes that are present in two or more strains but not all; and singletons that are classified as strain-specific genes, which are present in only one strain.The genes in the latter subset are related to adaptation processes, both to the environment and the host [50,51].
To analyze the pangenome, Heap's Law was used together with in-house scripts (pandev.pl) to estimate the fixed parameters and the least squares fitting of the exponential regression decay (core genome and singletons).For the fixed parameters from the pangenome, Heap's Law is used.It is an empirical law, and it is represented by the formula n = k* N ˆy, where a given number of genes (n) is calculated for a given number of genomes (N), and k and y (α = 1 − y) are free parameters.When α > 1, according to the law, one can consider this pangenome closed, meaning the addition of genomes has no substantial effect in the total number of genes.When α < 1, the pangenome is considered open, meaning there will be a significant increase in the number of genes for each new genome added to the analysis.The formula used for least-squares fit is represented by n = k × exp[−x/t] + tgθ, where n is the number of genes, and k, t, and tgθ are free parameters.This law is utilized to infer how many genes will compose the core genome after stabilized [52,53] and to estimate approximately how many genes are added by each new sequenced genome.

Orthology Assignments and Functional Annotation
The analyses of the core genome, shared genes, and singletons subsets were classified according to the Cluster of Orthologous Genes (COG), where the CDS of the subsets were aligned and classified according to the functional categories of the COG.For this, eggNOG-mapper (version 2.1.11,Cantalapiedra, 2023) was used, a database that performs this annotation according to orthologous genes [54].Each sequence was mapped using either the hidden Markov model (HMM) or DIAMOND to align with the eggNOG database.The best matching sequence of the target sequence is classified according to its taxonomy and finally categorized and annotated according to gene ontology (GO) [55], KEGG pathways [56], and COG functional categories [57].

Phylogenetic Analysis
The Gegenees software analyses demonstrated a color variation from green to red showing the degree of similarity between the strains (Figure 1), where almost all the complete genomes are close with similarity ranging from 71 to 100% inside this cluster (cluster 1).Strains such as UMB0833, JCP7659, and JCP8070, scaffold genomes, grouped in cluster 2, with similarity ranging from 72 to 100% inside the cluster, with a medium similarity of 38 to 46% with cluster 1 and a low similarity with cluster 3 (15 to 19%).In addition, G. vaginalis GV37, 409-05 (both complete genomes), and UMBB0913, scaffold, grouped in cluster 3, showing a low similarity of approximately 18% with the other two clusters.The others represented in an orange shade had a median similarity and comprised scaffold genomes.
Venereology 2023, 2, FOR PEER REVIEW 6 others represented in an orange shade had a median similarity and comprised scaffold genomes.
Figure 1.Heatmap with 7 complete genomes of G. vaginalis and 9 representative genomes of clades, according to the tree available at the NCBI.In green, a high similarity can be observed ranging from 65% to 100%; in orange, a median similarity ranging from 38% to 46%; and finally, a low similarity in red, ranging from 1% to 22%.
The data generated with the PGAdb-builder software were plotted in the MegaX software to visualize a phylogenetic tree (Figure 2), where the strains form the same clusters found in the heatmap of Gegenees.
Figure 1.Heatmap with 7 complete genomes of G. vaginalis and 9 representative genomes of clades, according to the tree available at the NCBI.In green, a high similarity can be observed ranging from 65% to 100%; in orange, a median similarity ranging from 38% to 46%; and finally, a low similarity in red, ranging from 1% to 22%.
The data generated with the PGAdb-builder software were plotted in the MegaX software to visualize a phylogenetic tree (Figure 2), where the strains form the same clusters found in the heatmap of Gegenees.
Figure 1.Heatmap with 7 complete genomes of G. vaginalis and 9 representative genomes of clades according to the tree available at the NCBI.In green, a high similarity can be observed ranging from 65% to 100%; in orange, a median similarity ranging from 38% to 46%; and finally, a low similarity in red, ranging from 1% to 22%.
The data generated with the PGAdb-builder software were plotted in the MegaX soft ware to visualize a phylogenetic tree (Figure 2), where the strains form the same clusters found in the heatmap of Gegenees.
Figure 2. The phylogenetic tree shows the bootstrap percentage with 1000 bootstraps on the tree branches, ranging from 28 as the lowest percentage to 100 as the highest one.The same clusters found on the heatmap may be visualized here.

Genomic Plasticity
For the prediction of genomic islands, the genome plasticity analysis with software GIPSy was performed, where we can observe through the BRIG's results the pathogenicity and resistance islands plotted in the outermost circle of the image, with comparative genomic plasticity results.For the comparative genomic plasticity analysis, it was used a reference genome according to the NCBI, which is the G. vaginalis strain FDAARGOS 568.In the circle image, each ring represents one strain, and the presence of blank regions in each ring demonstrates a deletion of this genome part concerning the reference genome in the most central ring.It can be seen that there are two large deletion regions in almost all the genomes and the other deletions appear only in preliminary genomes and are present in almost all the complete genomes.The last ring related to the genome of Alloscardovia omnicolens DSM21503 is a ring with many deletions compared to the species studied.
Figure 3 also shows the presence of three pathogenicity islands, two resistance islands, and two metabolic islands in the last three outermost rings.Resistance island 1 and pathogenicity island 2 overlap approximately at a position of 200 kpb, and resistance island 2 and pathogenicity island 3 overlap approximately at a position of 500~600 kpb.These regions can present genes related to virulence increase and resistance against antibiotics, and some draft genomes can present some deletions at the beginning of these islands.The other islands also present some deletions when compared with some genomes.
Figure 3. Genomic plasticity prediction with pathogenicity islands (PAI), resistance islands (RI), and metabolic islands (MI) of G. vaginalis.Each ring represents one strain of the working group, ranging from complete genomes in inner rings to draft genomes in the outermost rings.At the outermost, three rings represent the genomics islands-in red, the pathogenicity islands; in black, the resistance islands; and in purple, the metabolic islands.

Gene Synteny
The software Mauve was used to visualize the synteny between the various genomic sequences, where through the homology between them, it was possible to visualize possible rearrangements.Synteny is observed through gene blocks for better identification.Several broken regions may be observed, with many inversions and deletions of many regions, in the comparison between all genomes (Figure 4).A remarkable homogeneity was observed in cluster 1 (Supplementary Figure S4), with highly conserved blocks when compared to clusters 2 and 3 (Supplementary Figure S5 and S6), which showed inversions and deletions.Supplementary Figures S1-S3 show a comparative analysis between the clusters according to the similarity predicted in the previous results.When the comparative analysis was conducted, the genomes classified as scaffolds showed a greater, more significant deletion.Furthermore, Supplementary Figure S3 shows a more homogeneous relationship of these genomes forming clusters, while the other figures of clusters 1 and 2 (Supplementary Figures S1 and S2) show the presence of more broken regions.

Gene Synteny
The software Mauve was used to visualize the synteny between the various genomic sequences, where through the homology between them, it was possible to visualize possible rearrangements.Synteny is observed through gene blocks for better identification.Several broken regions may be observed, with many inversions and deletions of many regions, in the comparison between all genomes (Figure 4).A remarkable homogeneity was observed in cluster 1 (Supplementary Figure S4), with highly conserved blocks when compared to clusters 2 and 3 (Supplementary Figure S5 and S6), which showed inversions and deletions.

Pangenome Development
OrthoFinder separated these genomes into three different groups, core genes, shared genes, and singletons to predict orthologous genes, where 3563 non-redundant genes were found and classified as pangenome genes, with 208 genes being in the core genome group and 237 genes in the singletons group.
For the analyses with pangenome, the alpha value from Heap's Law when all genomes are analyzed was less than 1, being approximately 0.75 (α = 1 − 0.248), inferring an open genome.However, the core genome is close to closing, with 208 genes and stabilizing with 152 genes (Figure 5). Figure 6 shows the number of singletons in each of the genomes used, where the number changes, and some have more singletons than others.

Pangenome Development
OrthoFinder separated these genomes into three different groups, core genes, shared genes, and singletons to predict orthologous genes, where 3563 non-redundant genes were found and classified as pangenome genes, with 208 genes being in the core genome group and 237 genes in the singletons group.
For the analyses with pangenome, the alpha value from Heap's Law when all genomes are analyzed was less than 1, being approximately 0.75 (α = 1 − 0.248), inferring an open genome.However, the core genome is close to closing, with 208 genes and stabilizing with 152 genes (Figure 5). Figure 6 shows the number of singletons in each of the genomes used, where the number changes, and some have more singletons than others.

Orthologous Genes Characterization
The orthologous gene analyses are plotted in Figures 7 and 8, showing the COG classifications of the core genome, shared genome, and singletons, respectively.It is observed that the categories between these two analyses differ regarding the classification,

Orthologous Genes Characterization
The orthologous gene analyses are plotted in Figures 7 and 8, showing the COG classifications of the core genome, shared genome, and singletons, respectively.It is observed that the categories between these two analyses differ regarding the classification, Figure 6.A singletons diagram.On side (A), a diagram represents the "singletons" curve generated using the least-squares fits law, with the tangent theta equal to 0.147, i.e., for each newly sequence genome, approximately one gene is added to the pangenome.On side (B), a Venn diagram showing singletons present in each strain.The number of singletons ranged from 0 to 46 genes in the strains.

Orthologous Genes Characterization
The orthologous gene analyses are plotted in Figures 7 and 8, showing the COG classifications of the core genome, shared genome, and singletons, respectively.It is observed that the categories between these two analyses differ regarding the classification, prevailing in the core, the COG J, which is related to translation, ribosomal, and biogenesis functions, differently from what is seen from the shared and singleton types that prevail but are not yet categorized and have unknown functions.
prevailing in the core, the COG J, which is related to translation, ribosomal, and biogenesis functions, differently from what is seen from the shared and singleton types that prevail but are not yet categorized and have unknown functions.
The second category that is more prevalent in core classifications is COG F, more related to nucleus transport and metabolism, and for shared and singletons classifications, COG L is related to the replication, recombination, and repair of genomic content.Furthermore, there was no prevalence of a common category in these analyses.Each bar is related to one classification, COG J, with more genes classified, followed by COG F classification.Note: the column that displays "-" is related to the empty cells in the program's result, that is, genes that were unable to be classified by the program.Each bar is related to one classification, COG J, with more genes classified, followed by COG F classification.Note: the column that displays "-" is related to the empty cells in the program's result, that is, genes that were unable to be classified by the program.The second category that is more prevalent in core classifications is COG F, more related to nucleus transport and metabolism, and for shared and singletons classifications, COG L is related to the replication, recombination, and repair of genomic content.Furthermore, there was no prevalence of a common category in these analyses.

Discussion
G. vaginalis is a species of great medical importance, being one of the primary pathogens responsible for bacterial vaginosis, with high importance in women's health, and there are several reports and a certain problematization in the notification of these diseases as well as in the understanding of the consequences that it brings and how the different strains of this same species can behave differently in the host.Little is known about studies using comparative genomics to analyze the gene diversity of G. vaginalis.First, sixteen (16) genomes considered representative of the clusters grouped in the phylogenetic tree in the NCBI databases were used to analyze their distribution at the genomic level, their similarities, and differences throughout the evolution.
In all comparative analyses performed in this work, the strains present the same pattern.There is a formation of three clusters, with the first grouping all of the complete genomes; the second grouping the draft genome of the strains UMB0833, JCP7659, and JCP8070; and finally, the third cluster grouping the strains GV37, 409-05 (these being complete genomes), and UMB0913, considered as draft genome.Cluster 1 presents similarity above 80%, being always grouped in the same way and with a few deletions.Although close, it shows much translocation and a few inversion processes over time, which can be related to adaptation processes.Cluster 2 presents a median similarity, around 45%, with more deleted regions and much translocation and inversion process.Moreover, cluster 3 was more unexpected because it grouped different degrees of complementary types, being less similar with more events of evolution when compared with the other clusters, but inside this cluster, the genomes are so close and present almost the same patterns.
Firstly, in an attempt to explain how these clusters were formed, we researched more information about the epidemiology of these genomes, especially their extraction site.Unfortunately, only five samples had their extraction sites identified in the NCBI, as shown in Table 1 of the materials and methods.Once these sites had been identified, we realized that the clustering of these species was not related to their extraction site, since species as UMB0736 and UMB0298, both extracted in Maywood, USA were allocated in cluster 1; UMB0833, also extracted in Maywood, USA, was allocated in cluster 2; and, finally, GV37, which were extracted in Paris, France and UMB0913, which were extracted in Maywood, USA, were grouped in cluster 3.Then, although most of the species that we are able to identify were extracted from the same place, they do not show high clonality patterns, which corroborates the high plasticity found in the genomic plasticity analyses.
However, this work corroborates another work in diversity analysis inside the Gardnerella genus, where they found a polyphyletic organization, sorting this genus into nine distinct genotypes.In agreement with this work, our cluster 3, which grouped GV37, 409-05, and UMB0913 strains, can be considered as part of GGtype4, a group that presented genomes from strains that were considered less virulent organisms, and the other strains belonging to cluster 1 can be considered as part of GGtype9, a group which presents genomes from strains that were considered as more virulent organisms [59].In addition, this work also suggests that the UMB0833 strain, part of cluster 2, can be considered as GGtype7, which presents characteristics like cluster 1 and may be classified as more virulent organisms.A study by Bohr L. et al. [60] used a different methodology, from DNA extraction to the characterization of the genes present in the pangenome and also corroborates the findings in our study and in the aforementioned study, in which they separated the clusters in the same way.Furthermore, this study demonstrates the significance of lateral gene transfer in the diversity process under discussion.Our findings, as depicted in Figure 3, have successfully identified certain genes associated with pathogenicity and resistance through genomic islands originating from horizontal gene transfer.Our research paves the way for more in-depth investigations into the genes related to these mechanisms in an effort to ascertain whether HGT may indeed play a pivotal role in influencing diversity.Moreover, it guides future endeavors aimed at exploring these genes and their encoded proteins for potential utilization as diagnostic markers, as well as therapeutic and preventive measures against bacterial vaginosis.This genomic difference between the strains of G. vaginalis in different groups was also observed in several studies, such as in the amplified restriction analysis of ribosomal DNA (ARDRA) identifying two genotypes among 17 samples of G. vaginalis, where they considered the existence of a biotype 3 even though it was not identified due to the limiting number of isolates and, also, to the region where they were obtained [11].Other comparative analysis studies showed the existence of four different biotypes of G. vaginalis in the analysis of neighboring clusters between genes and perceived the formation of unexpected clades between them, which corroborates our study [17].This was later confirmed in PCR analyses of the cpn 6 gene sequence, confirming the previous result where four different groups were formed within the same species [6].
A pangenome analysis was performed in this work, where the genes were separated into three distinct groups: the core genes present in all strains; shared genes among some but not all strains; and singletons, with genes specific to each strain.The latter two groups are related to the process of evolution and adaptation of each strain, thus helping in better understanding the behavior of this species during its evolution.In the results, the species showed a wide-open pangenome with an alpha value lower than 1 (0.75), a core genome with approximately 208 genes, and the number of singletons added by each sequenced genome to the pangenome was shown to be very low after stabilization, with an estimate of less than one gene added.The core stabilization value is very close (~152 genes), and the wide-open pangenome and the tangent of theta value of singletons indicate the possibility of this pangenome closing, as genes are being added; however, it may take some time for this to happen.This variation is also observed in the Venn diagram of singleton genes (Figure 6B), where some genomes have more genes, being more variable than others.For example, strain AMD with 46 genes, 5-1 with 43 genes, and 409-05 with 29 genes.
Altogether, we can better understand the result found in the functional analysis of COG that represents a functional analysis of the behavior of this bacterium.When analyzing the core genome, proteins classified as COG J were more abundant than the others.COG J harbors proteins associated with translation, ribosomal structure, and biogenesis, which are functions commonly related to the functioning and survival of the bacterium, thus being essential for proliferation, differentiation, and development [61], as well as also helping the pathogen to adapt to changes in external physicochemical parameters; furthermore, another highly abundant functional category in the core genome was COG F, which is related to nucleotide transport and metabolism that are essential to all genomes once it is related to energy production in the organism.From another perspective, when we compare the COG categories between the singleton and shared genes with those of the core genes, in its totality, there are many genes in categories that are not very well described yet, which can be possibly accounted to genes acquired over the years in the process of adaptation of each strain, which would explain this diversity as well.In addition, for this analyzed set, the second category found was COG L, related to replication, recombination, and repair, which have essential roles in the adaptation process and diversity [62,63].

Conclusions
In conclusion, it can be observed that some genomes exhibit greater similarity among themselves when considering genomes within the same clade, as opposed to comparisons with genomes from different clades.When examining the evolutionary history and, in particular, the arrangement of their genes, some degree of this similarity becomes evident.However, a more pronounced divergence is now apparent when analyzing the gene order across the entirety of the genomes over the course of evolution.This comprehensive understanding sheds light on the genome's diversity.Moreover, regions of horizontal gene transfer were identified, which may serve as a pivotal factor contributing to this diversity.When compared with the pangenome, it becomes apparent that new genes can be added with the discovery of new genomes, further amplifying this aspect.Analysis of the functions associated with these genes was anticipated when examining the core genome, with genes fulfilling crucial metabolic roles for survival.Nonetheless, the other two subsets contain genes with unknown functions, which may potentially generate new genes and contribute to an expanded diversity.
These findings hold significant relevance for future studies focusing on the search for resistance and virulence genes as targets for diagnostics or therapeutics.However, they should be corroborated through in vitro and in vivo experimentation.

Figure 3 .
Figure3.Genomic plasticity prediction with pathogenicity islands (PAI), resistance islands (RI), and metabolic islands (MI) of G. vaginalis.Each ring represents one strain of the working group, ranging from complete genomes in inner rings to draft genomes in the outermost rings.At the outermost, three rings represent the genomics islands-in red, the pathogenicity islands; in black, the resistance islands; and in purple, the metabolic islands.

Figure 4 .
Figure 4. Gene synteny analysis.All genomes were broken and presented in continuous blocks.Each color represents one region that has passed through evolution for inversions, translocations, or deletions.

Figure 4 .
Figure 4. Gene synteny analysis.All genomes were broken and presented in continuous blocks.Each color represents one region that has passed through evolution for inversions, translocations, or deletions.

Figure 5 .
Figure 5.The pangenome development.In blue, the pangenome growth curve with Heap's law correction stabilizes approximately in just over 3000 genes, characterizing an open pangenome.The core genome stabilization curve is in orange, finalizing in approximately 208 genes.

Figure 6 .
Figure 6.A singletons diagram.On side (A), a diagram represents the "singletons" curve generated using the least-squares fits law, with the tangent theta equal to 0.147, i.e., for each newly sequence genome, approximately one gene is added to the pangenome.On side (B), a Venn diagram showing singletons present in each strain.The number of singletons ranged from 0 to 46 genes in the strains.

Figure 5 . 10 Figure 5 .
Figure 5.The pangenome development.In blue, the pangenome growth curve with Heap's law correction stabilizes approximately in just over 3000 genes, characterizing an open pangenome.The core genome stabilization curve is in orange, finalizing in approximately 208 genes.

Figure 6 .
Figure 6.A singletons diagram.On side (A), a diagram represents the "singletons" curve generated using the least-squares fits law, with the tangent theta equal to 0.147, i.e., for each newly sequence genome, approximately one gene is added to the pangenome.On side (B), a Venn diagram showing singletons present in each strain.The number of singletons ranged from 0 to 46 genes in the strains.

Figure 7 .
Figure 7. COG functional classification of core genome.Bar graph representing the COG classifications of the core genome.Each bar is related to one classification, COG J, with more genes classified, followed by COG F classification.Note: the column that displays "-" is related to the empty cells in the program's result, that is, genes that were unable to be classified by the program.

Figure 7 .
Figure 7. COG functional classification of core genome.Bar graph representing the COG classifications of the core genome.Each bar is related to one classification, COG J, with more genes classified, followed by COG F classification.Note: the column that displays "-" is related to the empty cells in the program's result, that is, genes that were unable to be classified by the program.

Venereology 2023, 2 , 12 Figure 8 .
Figure 8. COG functional classification of shared and singleton's genes.Bar graph representing the COG classifications of shared plus singletons group.Each bar is related to one classification, COG S, with more genes classified, followed by COG L classification.Note: the column that displays "-" is related to the empty cells in the program's result, that is, genes that were unable to be classified by the program.

Figure 8 .
Figure 8. COG functional classification of shared and singleton's genes.Bar graph representing the COG classifications of shared plus singletons group.Each bar is related to one classification, COG S, with more genes classified, followed by COG L classification.Note: the column that displays "-" is related to the empty cells in the program's result, that is, genes that were unable to be classified by the program.

Table 1 .
Information about the 16 genomes of G. vaginalis strains.