Genome-Wide Comparative Analysis of Lactiplantibacillus pentosus Isolates Autochthonous to Cucumber Fermentation Reveals Subclades of Divergent Ancestry

Lactiplantibacillus pentosus, commonly isolated from commercial cucumber fermentation, is a promising candidate for starter culture formulation due to its ability to achieve complete sugar utilization to an end pH of 3.3. In this study, we conducted a comparative genomic analysis encompassing 24 L. pentosus and 3 Lactiplantibacillus plantarum isolates autochthonous to commercial cucumber fermentation and 47 lactobacillales reference genomes to determine species specificity and provide insights into niche adaptation. Results showed that metrics such as average nucleotide identity score, emulated Rep-PCR-(GTG)5, computed multi-locus sequence typing (MLST), and multiple open reading frame (ORF)-based phylogenetic trees can robustly and consistently distinguish the two closely related species. Phylogenetic trees based on the alignment of 587 common ORFs separated the L. pentosus autochthonous cucumber isolates from olive fermentation isolates into clade A and B, respectively. The L. pentosus autochthonous clade partitions into subclades A.I, A.II, and A.III, suggesting substantial intraspecies diversity in the cucumber fermentation habitat. The hypervariable sequences within CRISPR arrays revealed recent evolutionary history, which aligns with the L. pentosus subclades identified in the phylogenetic trees constructed. While L. plantarum autochthonous to cucumber fermentation only encode for Type II-A CRISPR arrays, autochthonous L. pentosus clade B codes for Type I-E and L. pentosus clade A hosts both types of arrays. L. pentosus 7.8.2, for which phylogeny could not be defined using the varied methods employed, was found to uniquely encode for four distinct Type I-E CRISPR arrays and a Type II-A array. Prophage sequences in varied isolates evidence the presence of adaptive immunity in the candidate starter cultures isolated from vegetable fermentation as observed in dairy counterparts. This study provides insight into the genomic features of industrial Lactiplantibacillus species, the level of species differentiation in a vegetable fermentation habitat, and diversity profile of relevance in the selection of functional starter cultures.


Introduction
Diverse lactobacilli and Leuconostoc species are frequently isolated from vegetable fermentations, including cabbage, cucumber, green tomato, green bean, okra, olive, and onion, making them relevant for the formulation of starter cultures [1][2][3][4][5][6][7][8][9][10][11]. Starter cultures encompassing Lactiplantibacillus plantarum and Lactiplantibacillus pentosus are often used in the fermentation of blanched garlic, carrot, cucumber, olives kimchi, and kale, as well as cereal grains [5,[12][13][14][15][16][17][18]. Fermentative bacteria such as lactobacilli are implicated in the conversion of sugars to lactic acid and the production of ascorbic acid, glutathione, flavonoid aglycones, and distinguishable volatile compounds, as well as in enhancing total L. plantarum and L. pentosus isolates and genome sequences used in this study: Twentyseven pure cultures were included in this study, of which twenty-two were isolated from a commercial cucumber fermentation tank filled with recycled cover brine in 2009 as described by Pérez-Díaz et al. [32]. The twenty-two cultures include L. plantarum 3.2. 8 and 7.8.4, and twenty L. pentosus cultures. L. pentosus 7.8.2 and 7.2.20 were isolated from an independent commercial cucumber fermentation tank filled with fresh cover brine in 2010 [32]. L. pentosus LA0445 is of cucumber fermentation origin isolated in 1983 and MU045 is a mutant derivative of this strain [36,37]. L. plantarum T1R2b was isolated in 2020 from a low-salt cucumber fermentation that produced an irregular slimy cover brine [38].
The sequencing, assembly, and annotation, along with growth conditions, of the genomic sequences corresponding to the L. plantarum and L. pentosus cultures autochthonous to cucumber fermentation and included in this study are described by Page & Pérez-Díaz [39] and available from the National Center for Biotechnology Information (NCBI) database under BioProject Accession Number PRJNA674638. GenBank accession numbers corresponding to the genome sequences derived from the autochthonous and allochthonous lactic acid bacteria included in this study are listed in Table 1.  Figure 1. Descriptive data of the Lactiplantibacillus plantarum (■) and Lactiplantibacillus pentosus (□) genome sequence features and coding gene type. A significant difference (p-value < 0.05) was calculated for panels identified with an asterisk (*).
ANI score calculation for the L. plantarum and L. pentosus genome sequences: Calculations of the ANI scores between an L. pentosus and an L. plantarum were carried out in the IMG system [40] (analysis was run in July 2022). Estimated ANI scores for pairwise comparisons of L. pentosus or L. plantarum homolog genomes were calculated in Kbase via the FastANI v0.13 automated pipeline [41] (analysis run in March 2022).
Construction of phylogenetic trees: Three distinct phylogenetic trees were constructed for definitive culture identification and comparison of species boundaries. A phylogenetic tree was constructed based on a computer emulated Rep-PCR using the (GTG)5 primer [42] for the genome sequences of several lactic acid bacteria whose genome sequences are described in Table 1. The metadata for the genome sequences described in Table 1 that are autochthonous to cucumber may be found in Page & Perez-Diaz [39]. Amplicons from a computed Rep-PCR-(GTG)5 were generated with Geneious software (v. 2021.1.1, www.geneious.com, analysed in June 2021) for each genome sequence of interest. The band patterns produced for each genome sequence were compared using Bionumerics software v. 7.6.3 (www.applied-maths.com). Similarity matrices of the densitometric curves from each sample were calculated using Pearson correlation coefficients and clustered via unweighted pair group method with arithmetic averages (UPGMA).
A second phylogenetic tree was produced using a computer-emulated MLST run in AutoMLST [43] (analysis run in December 2021) based on the 79 core genes described in Table S1, which were automatically selected by the software. This analysis was focused on the clustering of autochthonous and allochthonous L. plantarum and L. pentosus genomes, whose accession numbers are listed in Table 1. The genome sequence of Pediococcus damnosus DSM20331 was selected by AutoMLST as the tree root.
A phylogenetic tree was generated in PATRIC (analysis run in December 2021) [44] via the Codon Tree method, which employs RaxML to generate phylogenetic distances between sequences of L. pentosus genomes exclusively [45]. Accession numbers for the L. A second phylogenetic tree was produced using a computer-emulated MLST run in AutoMLST [43] (analysis run in December 2021) based on the 79 core genes described in Table S1, which were automatically selected by the software. This analysis was focused on the clustering of autochthonous and allochthonous L. plantarum and L. pentosus genomes, whose accession numbers are listed in Table 1. The genome sequence of Pediococcus damnosus DSM20331 was selected by AutoMLST as the tree root.
A phylogenetic tree was generated in PATRIC (analysis run in December 2021) [44] via the Codon Tree method, which employs RaxML to generate phylogenetic distances between sequences of L. pentosus genomes exclusively [45]. Accession numbers for the L. pentosus genomes used, from cucumber fermentation autochthonous and allochthonous isolates, are listed in Table 1. No duplications or deletions were allowed for the analysis. The 587 ORFs listed in Table S1 were used for the construction of the phylogenetic tree.
Tree diagrams were generated for the computed MLST and multiple ORFs phylogenetic analysis outputs with Interactive Tree of Life (iTOL) v.6.5.7 [46] in June 2022. Default parameters were used for the construction of trees. CRISPR-Cas system identification and characterization: The CRISPRCasTyper pipeline was used to scrutinize the genome sequences of interest [47]. CRISPR spacers were extracted and visualized using the CRISPRViz pipeline [48]. The extracted spacer sequences were passed to BLASTn searches in the NCBI nucleotide database to identify potential protospacer sequences. The positive BLAST hits had an e-value smaller than 1 × 10 −3 and an identity score greater than 85%. The flanking regions of the protospacer matches were extracted and aligned to identify putative protospacer adjacent motifs (PAM) (Supplementary Figure S1). The BLAST-webserver search was conducted against the nucleotide collection [49] in December 2021.
Identification of putative prophage sequences: A survey of putative prophage sequences was conducted among the genome sequences derived from the cucumber-autochthonous L. plantarum and L. pentosus using the Phage Search Tool enhanced release (PHASTER, analysis run in February 2022) [50,51]. The multiple contigs option was selected for the prophage sequence analysis and default parameters were applied for the other options in the service. Accession numbers for the lactobacilli genome sequences used are listed in Table 1.

Results and Discussion
Features of the L. pentosus and L. plantarum genome sequences: The genome sizes of most LAB are between 1.8 and 3.4 megabases [52], which ranges on the lower end of the bacterial genome spectrum spanning approximately 500 kilobases to 12 megabases [53]. LAB are thought to be highly specialized given their evolutionary path and historical association with select niches over the course of human association and use for fermentation processes in relatively nutrient-rich habitats [52]. Such bacteria are known for their ability to use horizontal gene transfer for adaptation [54]. Generally, bacteria exhibit mutational bias that deletes superfluous sequences, which produces smaller genomes with a greater degree of specialization [55].
The L. pentosus genomes studied here ranged in size from 3.59 to 3.83 Mbp with 46% GC content, while the L. plantarum genome sequences were found to be significantly smaller at 3.38 to 3.48 Mbp with 44% GC content ( Figure 1). The range of such genome sizes and GC percentages are consistent with the ranges reported in the PATRIC database for each species [19]. The benefits derived by L. pentosus from the additional genes relative to L. plantarum and the metabolic versatility associated with such genes could confer a competitive advantage [56,57]. Figure 1 shows that the predicted signal peptides and proteins, specifically transmembrane proteins, are more abundant in the L. pentosus genome as compared to those in the L. plantarum counterpart. It is documented that L. pentosus is more commonly isolated from cucumber and olive fermentations than L. plantarum [2,10]. Such observations suggest that the larger L. pentosus genome may be advantageous in sensing its habitat, fermentation in this case.
ANI is widely used to determine whether multiple genomes belong to the same species [34,58]. As expected, the calculation of ANI values from a comparison of a L. pentosus and L. plantarum genome sequences was approximately 80%, significantly less than the 95% cut-off score for species membership (Figure 2A). The ANI calculations described here were generated through pairwise comparisons between an autochthonous L. plantarum or L. pentosus genome sequence and a homolog genome sequence or a reference genome ( Figure 2A). Contrary to L. plantarum autochthonous to cucumber fermentation ( Figure 2B), the calculation of varied ANI values above the 95% threshold among the L. pentosus counterparts suggested a degree of intraspecies variation worth studying for discerning adaptive diversity of value in the optimization of starter cultures for cucumber fermentations ( Figure 2C).
The 27 isolates included in this study were previously identified through the alignment of the 16S rDNA partial sequence to those of reference strains and the recA amplicon size as described by Torriani et al. [10,31]. The identification of 26 out of the 27 isolates in the study was confirmed via the ANI scores calculated as well as via the whole genome sequence alignments to reference genomes in the NCBI and the IMG/M databases, the exception being L. pentosus 7.2.4, which was initially classified as L. plantarum [10].
L. pentosus and L. plantarum intraspecies genetic diversity as defined through the construction of phylogenetic trees: The simulation of Rep-PCR-(GTG) 5 through emulated banding patterns from whole genome sequences discriminated between the L. pentosus and L. plantarum autochthonous to cucumber fermentation ( Figure 3) and replicated the separation previously observed by Pérez-Díaz et al. [32] for L. pentosus LA0445 from six commercial fermentation isolates of the same species (1.2.13, 3.8.24, 1.2.11, 1.8.6, 1.8.9, and 3.2.37) using benchtop Rep-PCR-(GTG) 5 . The L. pentosus strain separation identified using computer-emulated and benchtop Rep-PCR-(GTG) 5 phylogenetic clustering was also detected in AutoMLST analysis including 79 common ORFs ( Figure 4A) and using PATRIC-based clustering aligning 587 common ORFs ( Figure 4B). The four analyses suggest at least two distinct L. pentosus clades, A and B, which include identical clusters of isolates. L. pentosus clade A includes isolates collected in North Carolina from fermentation day one to day thirty, while L. pentosus clade B includes three isolates collected from fermentation day seven or later. Among these are LA0445, which was derived from a commercial fermentation conducted in 1983, as well as MU045, a derivative of the former deficient in malic acid decarboxylation [59]. Two L. pentosus isolated from cucumber fermentation conducted in North Carolina (2009) Figure 4A). discerning adaptive diversity of value in the optimization of starter cultures for cucumber fermentations ( Figure 1C). The 27 isolates included in this study were previously identified through the alignment of the 16S rDNA partial sequence to those of reference strains and the recA amplicon size as described by Torriani et al. [31] [10]. The identification of 26 out of the 27 isolates in the study was confirmed via the ANI scores calculated as well as via the whole genome sequence alignments to reference genomes in the NCBI and the IMG/M databases, the exception being L. pentosus 7.2.4, which was initially classified as L. plantarum [10].

L. pentosus and L. plantarum intraspecies genetic diversity as defined through the construction of phylogenetic trees:
The simulation of Rep-PCR-(GTG)5 through emulated banding patterns from whole genome sequences discriminated between the L. pentosus and L. plantarum autochthonous to cucumber fermentation ( Figure 3) and replicated the separation previously observed by Pérez-Díaz et al. [32] for L. pentosus LA0445 from six commercial fermentation isolates of the same species (1.2.13, 3.8.24, 1.2.11, 1.8.6, 1.8.9, and 3.2.37) using benchtop Rep-PCR-(GTG)5. The L. pentosus strain separation identified using computer-emulated and benchtop Rep-PCR-(GTG)5 phylogenetic clustering was also detected in AutoMLST analysis including 79 common ORFs ( Figure 4A) and using PATRIC-based clustering aligning 587 common ORFs ( Figure 4B). The four analyses suggest at least two distinct L. pentosus clades, A and B, which include identical clusters of isolates. L. pentosus clade A includes isolates collected in North Carolina from fermentation day one to day thirty, while L. pentosus clade B includes three isolates collected from fermentation day seven or later. Among A subsequent phylogenetic construction including only L. pentosus strains employed 1000 core ORFs and expanded the L. pentosus clade A described in Figure 4 to include isolates that did not sort into clade A or B in the previous trees ( Figure 5). Clade A can be further subdivided into three subclades, A.I, A.II, and A.III. Subclade A.I includes twelve isolates, ten of which were collected on day one or three of fermentation. Subclade A.II is a group of two isolates, one each from the cucumber fermentations conducted in North Carolina and Minnesota, both isolated on day seven. Subclade A.III includes five L. pentosus isolated on day seven or 14 of the cucumber fermentation conducted in North Carolina. L. pentosus 7.8.2 was isolated from the cucumber fermentation performed in Minnesota and did not sort with isolates from clade A or B but is consistently associated with L. pentosus reference genomes from multiple sources ( Figure 5). This phylogenetic analysis also included multiple strains derived from olive fermentation, which, with the these are LA0445, which was derived from a commercial fermentation conducted in 1983, as well as MU045, a derivative of the former deficient in malic acid decarboxylation [59]. Two L. pentosus isolated from cucumber fermentation conducted in North Carolina (2009) also belong to clade B, 7.2.4 and 7.2.11. Two L. pentosus isolated from a cucumber fermentation conducted in Minnesota (7.2.20 and 7.8.2), as well as multiple isolates collected on days seven and 14 of the Carolinian counterparts ( Figure 4A).    Figure 4A).  sota and did not sort with isolates from clade A or B but is consistently associated w pentosus reference genomes from multiple sources ( Figure 5). This phylogenetic an also included multiple strains derived from olive fermentation, which, with the exce of a single isolate associated with subclade A.I, clustered in several distinct clades n sociated with L. pentosus clades A or B.  The three autochthonous L. plantarum isolates were associated with counterparts from multiple habitats based in the three phylogenetic analyses employing emulated Rep-PCR-(GTG) 5 , computed MLST data (AutoMLST) involving 79 common ORFs, and 587 common ORFs via the PATRIC Codon Tree builder (Figure 4). L. plantarum 3.2.8 and 7.8.4 share a common ancestor with the L. plantarum strain ZJ316, which is a human fecal isolate with reported probiotic properties [60]. L. plantarum T1R2b clustered into a separate branch from 3.2.8 and 7.8.4, which has common ancestry with L. plantarum B21, a fermented sausage isolate able to produce a broad spectrum bacteriocin B21AG, which is also a circular plantacyclin [61]. The L. plantarum T1R2b draft genomic DNA sequence, however, lacks the putative genes encoding for such bacteriocin [62]. It is relevant to note that two of the three L. plantarum autochthonous to cucumber fermentation are capable of prevailing in the native habitat and consistently produce slimy brines. Thus, exploitation of such L. plantarum isolates in cucumber fermentation is limited.
Occurrence and diversity of CRISPR-Cas immune systems: The hypervariable nature of CRISPR loci has been used in the past for the genotyping of various genera and species, including starter cultures and food pathogens, notably, Streptococcus thermophilus for the former [63,64] and E. coli and Salmonella for the later [65,66]. Here, we sought to exploit the hypervariable CRISPR arrays in the L. pentosus and L. plantarum genomes to discern intraspecies diversity and ancestry. Both type I-E and type II-A CRISPR-Cas systems were identified in the L. pentosus genomes, with canonical cas operon structures and occurrence of variable spacers in CRISPR arrays (Table 2). Distinct type II-A arrays were detected in L. plantarum isolates, as well as in L. pentosus 7.8.2 and the L. pentosus subclades A.I, A.II, and A.III. The fact that no type II-A spacers were shared between the subclades confirms the significance of the isolates' clustering observed in phylogenetic analyses ( Figure 6). The identified CRISPR arrays in the L. plantarum isolates did not share similar spacer sequences despite the close association of these isolates in phylogenetic trees. No CRISPR-Cas system was detected in the L. plantarum isolate T1R2b.    L. pentosus isolates 7.8.46 and 7.2.20, which belong to subclade A.II, shared common type I-E CRISPR arrays with subclade A.III but not with subclade A.I, suggesting that A.II shares more recent ancestry with A.III than subclade A.I (Figure 7). L. pentosus 7.8.46 and 7.2.20 also shared a type II-A array that was not detected in other L. pentosus isolates, further supporting the separation of these isolates from subclades A.I and A.III. Type I-E CRISPR arrays detected in the L. pentosus isolate 7.8.2 were distinct from other L. pentosus CRISPR sequences, which reflected the separation of strain 7.8.2 from other L. pentosus isolates described in phylogenetic analysis (Figure 8).   Four type I-E spacers were found to match known phages previously isolated from the same fermentation tanks (Figure 9) [10,26]. Unexpectedly, three separate spacers in L. pentosus 7.2.11 matched to two different phages, φJL-1 and φSha, while a fourth spacer in L. pentosus 7.2.4 targeted φSha (Figure 9). All protospacers were flanked by putative protospacer adjacent motif (PAM) sequence 5 -AAA-3 on the 5 end, reflecting CRISPRmediated immunity in L. pentosus against these phages during industrial fermentation. This is consistent with the role of CRISPR-Cas systems as providers of adaptive immunity as originally demonstrated in dairy starter cultures [67]. Figure 7. The Type I-E CRISPR spacers were extracted and aligned from the L. pentosus clade A genomes, subclade A.I ( ), subclade A.II ( ), and subclade A.III ( ). All L. pentosus strains were isolated from commercial cucumber fermentations. Differently colored boxes represent individual spacer sequences in each CRISPR array. Identically colored boxes in two or more isolates indicate a shared spacer sequence between those isolates. Asterisks indicate CRISPR spacer arrays detected in separate subclades. L. pentosus isolates 7.8.46 and 7.2.20, which belong to subclade A.II, shared common type I-E CRISPR arrays with subclade A.III but not with subclade A.I, suggesting that A.II shares more recent ancestry with A.III than subclade A.I (Figure 7). L. pentosus 7.8.46 and 7.2.20 also shared a type II-A array that was not detected in other L. pentosus isolates, Figure 8. Type I-E CRISPR spacers extracted and aligned from L. pentosus clade B genomes ( ) and the L. pentosus 7.8.2 genome ( ). All L. pentosus strains were isolated from commercial cucumber fermentations. Differently colored boxes represent individual spacer sequences in each CRISPR array. Identically colored boxes in two or more isolates indicate a shared spacer sequence between those isolates. further supporting the separation of these isolates from subclades A.I and A.III. Type I-E CRISPR arrays detected in the L. pentosus isolate 7.8.2 were distinct from other L. pentosus CRISPR sequences, which reflected the separation of strain 7.8.2 from other L. pentosus isolates described in phylogenetic analysis ( Figure 8). Four type I-E spacers were found to match known phages previously isolated from the same fermentation tanks (Figure 9) [10,26]. Unexpectedly, three separate spacers in L. pentosus 7.2.11 matched to two different phages, ϕJL-1 and ϕSha, while a fourth spacer in L. pentosus 7.2.4 targeted ϕSha ( Figure 9). All protospacers were flanked by putative protospacer adjacent motif (PAM) sequence 5′-AAA-3′ on the 5′ end, reflecting CRISPR-mediated immunity in L. pentosus against these phages during industrial fermentation. This is consistent with the role of CRISPR-Cas systems as providers of adaptive immunity as originally demonstrated in dairy starter cultures [67]. Putative prophage profiles detected in the L. pentosus and L. plantarum genome sequences: It is estimated that 10% of the bacterial community in commercial cucumber fermentations is susceptible to bacteriophage infection, of which a fifth is attributed to L. pentosus and L. plantarum [68]. Phages of the Myoviridae and Siphoviridae are commonly isolated from vegetable fermentations [68,69]. Five intact prophage sequences of the Siphoviridae sp. were detected in 7 of the 24 L. pentosus isolates (Table 3). Siphoviridae ctu0P1 was found in the genomic DNA sequences of six L. pentosus isolated on days one and three (A)