Comparative Genomics Analyses Reveal the Differences between B. longum subsp. infantis and B. longum subsp. longum in Carbohydrate Utilisation, CRISPR-Cas Systems and Bacteriocin Operons

Bifidobacterium longum is one of the most widely distributed and abundant Bifidobacterium in the human intestine, and has been proven to have a variety of physiological functions. In this study, 80 strains of B. longum isolated from human subjects were classified into subspecies by ANI and phylogenetic analyses, and the functional genes were compared. The results showed that there were significant differences in carbohydrate metabolism between the two subspecies, which determined their preference for human milk oligosaccharides or plant-derived carbohydrates. The predicted exopolysaccharide (EPS) gene clusters had large variability within species but without difference at the subspecies level. Four subtype CRISPR-Cas systems presented in B. longum, while the subtypes I-U and II-C only existed in B. longum subsp. longum. The bacteriocin operons in B. longum subsp. infantis were more widely distributed compared with B. longum subsp. longum. In conclusion, this study revealed the similarities and differences between B. longum subsp. infantis and B. longum subsp. longum, which could provide a theoretical basis for further exploring the probiotic characteristics of B. longum.


Introduction
Bifidobacterium longum consists of three subspecies, including B. longum subsp. suis, B. longum subsp. longum and B. longum subsp. infantis, which is one of the most abundant Bifidobacterium in the intestines [1]. B. longum subsp. longum is widely present in the human intestine of different ages, while B. longum subsp. infantis mainly exists in the intestine of breast-fed infants, and B. longum subsp. suis is mainly isolated from the gastrointestinal tract of piglets or cattle [2]. At present, B. longum subsp. infantis and B. longum subsp. longum have been widely used in dairy products, functional foods and probiotic products, and they have various physiological functions such as regulating immunity [3,4] and maintaining intestinal balance [5]. Although the genetic relationship between B. longum subsp. infantis and B. longum subsp. longum is extremely similar, there

Average Nucleotide Identity (ANI) Values
A python script [13] (https://github.com/widdowquinn/pyani) (accessed on 20 July 2021) was used to calculate the ANI values between each two genomes, and TBtools was used to cluster and visualise the resulting matrix [14]. Ten genomes of B. longum subsp. suis in the NCBI RefSeq database were supplemented for calculating ANI (Table S1).

Phylogenetic Analyses
The orthologous gene analysis was carried out using orthomclV2.0.9 software [15], the mafft-7.313 [16] was used to align the orthologous gene sequences of different strains and phylogeny software was used to analyse the evolutionary relationship via the neighbourjoining (NJ) method and construct a phylogenetic tree. The optimization of the phylogenetic tree was completed on the online website (http://www.evolgenius.info/evolview/) (accessed on 15 December 2020) [17].

Pan-Genome and Core-Genome Analysis
The pan-genome and the function of core genes were analysed via PGAP v1.2.1 [18], and functional classification of core genes was based on the COG database.

Whole Genome and Orthologous Gene Comparison
The sequence similarities of whole genomes among different strains in the same subspecies were visualised by BLAST Ring Image Generator (BRIG) [19]. OrthomclV2.0.9 software [15] was used to analyse the orthologous genes of B. longum subsp. infantis and B. longum subsp. longum and to compare the similarities and differences between the two subspecies.

Genotype and Phenotype Analysis of Carbohydrate Metabolism
The HMM method in HMMER-3.1 was used to annotate all the genomes. The CAZy database was used to predict the carbohydrate active enzyme genes [20], and BLAST was used to compare the carbohydrate metabolism gene clusters. The growth of 41 B. longum strains on the medium with six carbohydrates as the sole carbon source was determined, including L-arabinose, L-fucose, lacto-N-tetraose (LNT), 2 -fucosyllactose (2 FL), 3 -sialyllactose (3 SL) and galactooligosaccharide (GOS). Glucose-free mMRS medium with bromocresol purple as an indicator was prepared and the carbohydrate solution filtered through a 0.22-µm sterile membrane filter was added to the glucose-free mMRS medium at a final concentration of 1% (w/v). After 48 h anaerobic culturing at 37 • C, a colour change of the medium was observed, and the experiment was repeated independently three times.

CRISPR-Cas Systems Prediction
CRISPRCasFinder was used to predict the CRISPR systems, together with cas genes [21]. A phylogenetic tree based on Cas1 protein amino acid sequences and repeat nucleic acid sequences was constructed via MEGA X [22], and sequence alignment was performed using MUSCLE and UPGMA methods to construct a phylogenetic tree. The conserved repeat sequence secondary structure was visualised via RNAfold [23], and Weblogo was used to predict the conservation of RNA secondary structure [24]. The potential prophages in B. longum were predicted by PHASTER [25], and a local BLAST database was built based on the prophage sequences to analyse the match between CRISPR spacers and the prophages.

Bacteriocin Prediction
The online database BAGEL4 was used to predict the potential bacteriocins in B. longum [26]. Core peptide BLAST was performed to confirm the bacteriocins identified by BAGEL4.

Statistical Analysis
The number of GH genes, GT genes and CRISPR spacers in B. longum genomes were statistically analysed. GraphPad Prism 9.0 (GraphPad Software Inc., San Diego, CA, USA) was utilised for data analysis and plotting, and the significant difference of gene number was evaluated by t test.

ANI and Phylogenetic Analysis of B. longum
ANI is the average of all orthologous genes consistency in the two genomes, which can be used for species classification. As 16S rRNA gene comparison analysis cannot accurately distinguish the subspecies, all the B. longum strains in the study were classified at the subspecies level via ANI and phylogenetic analysis. The clustering results based on ANI values showed that 90 strains could be divided into three subspecies ( Figure 1, Table S2). The mean of ANI value between B. longum subsp. infantis and B. longum subsp. longum was 95.21%, while the mean of ANI between B. longum subsp. suis and B. longum subsp. infantis or B. longum subsp. longum were 95.80% and 96.29%, respectively. Moreover, the mean of ANI between B. longum M2CF0114 and B. longum subsp. longum or B. longum subsp. suis were 96.90% and 95.97%, respectively. Therefore, B. longum M2CF0114 should be classified as B. longum subsp. longum, although the clustering of ANI puts it in the same branch as B. longum subsp. suis.
Phylogenetic analysis was performed to further confirm the classification of the strains. OrthoMCL was used to analyse the orthologous genes of 90 B. longum strains; a total of 849 orthologous genes were obtained, and a phylogenetic tree of B. longum was constructed based on the orthologous genes ( Figure 2). The results showed that 90 strains were located on three branches, 39 strains were located on the same branch with B. longum subsp. longum NCC2705 (type strain) and 39 strains were located on the same branch with B. longum subsp. infantis ATCC15697 (type strain); B. longum Su859 and LMG 21814 among the 10 strains on the third branch have been proved to be B. longum subsp. suis [2,27]. Moreover, it was further confirmed that M2CF0114 belonged to B. longum subsp. longum. In addition, those analyses showed that some strains were misclassified in the previous study; for instance, 157F, ATCC55813 and 35624 should be B. longum subsp. longum instead of B. longum subsp. infantis, and BXY01, CMCCP0001 and JDM301 should be classified as B. longum subsp. suis rather than B. longum subsp. longum.  Phylogenetic analysis was performed to further confirm the classification of the strains. OrthoMCL was used to analyse the orthologous genes of 90 B. longum strains; a total of 849 orthologous genes were obtained, and a phylogenetic tree of B. longum was constructed based on the orthologous genes ( Figure 2). The results showed that 90 strains were located on three branches, 39 strains were located on the same branch with B. longum subsp. longum NCC2705 (type strain) and 39 strains were located on the same branch with B. longum subsp. infantis ATCC15697 (type strain); B. longum Su859 and LMG 21814 among the 10 strains on the third branch have been proved to be B. longum subsp. suis [2,27]. Moreover, it was further confirmed that M2CF0114 belonged to B. longum subsp. longum. In addition, those analyses showed that some strains were misclassified in the previous study; for instance, 157F, ATCC55813 and 35624 should be B. longum subsp. longum instead of B. longum subsp. infantis, and BXY01, CMCCP0001 and JDM301 should be classified as B. longum subsp. suis rather than B. longum subsp. longum. Pan-genome refers to all the genes of one species, consisting of three parts: core genes, non-essential genes and strain-specific genes. The pan-genome analysis of B. longum subsp. longum showed that as the number of added genomes increased, new genes continued to appear, and the number of pan-genome gradually increased; when the number of genomes reached 35, the number of pan-genomic genes increased slowly (Figure 3a), which indicated that the B. longum subsp. longum genome was nearly closed. The pan-genome of B. longum subsp. longum contained 6645 genes, and its core genome possessed 1043 genes. The COG annotation showed that those genes related to translation, ribosomal structure and biogenesis accounted for the highest proportion in the core genome (13.42%), followed by function unknown (10.83%), amino acid transport and metabolism related genes (9.97%), and the genes related to carbohydrate transport and metabolism accounted for 7.57% in the core genome ( Figure 3b). The pan-genome curve of B. longum subsp. infantis showed that it was approximately closed; as the number of added genomes increased, the rate of change in the number of core genomes slowed down (Figure 3a). The core genome size of B. longum subsp. infantis strains was 1139 genes, which accounted for approximately 16.06% of the B. longum subsp. infantis pan-genome. The functional composition of the COG of B. longum subsp. infantis was not significantly different from that in B. longum subsp. longum (Figure 3b), and only slightly higher than B. longum subsp. longum (12.20%) in genes related to amino acid transport and metabolism.

General Genome Features of B. longum subsp. infantis and B. longum subsp. longum
The general genome features of B. longum are shown in

Pan-and Core-Genome of B. longum subsp. infantis and B. longum subsp. longum
Pan-genome refers to all the genes of one species, consisting of three parts: core genes, non-essential genes and strain-specific genes. The pan-genome analysis of B. longum subsp. longum showed that as the number of added genomes increased, new genes continued to appear, and the number of pan-genome gradually increased; when the number of genomes reached 35, the number of pan-genomic genes increased slowly (Figure 3a), which indicated that the B. longum subsp. longum genome was nearly closed. The pangenome of B. longum subsp. longum contained 6645 genes, and its core genome possessed 1043 genes. The COG annotation showed that those genes related to translation, ribosomal structure and biogenesis accounted for the highest proportion in the core genome

Whole Genome and Orthologous Gene Comparison
The whole genomes of B. longum subsp. infantis and B. longum subsp. longum were compared via BRIG. B. longum subsp. infantis ATCC15697 and B. longum subsp. longum NCC2705 were taken as the reference genomes of each subspecies, respectively. The genomes of B. longum subsp. longum included seven main regions of variation ( Figure 4a); regions A, C and F mainly included the genes related to transport and metabolism of substances, especially carbohydrates; region D mainly included the extracellular polysaccharide synthesis gene clusters; the genes in region B were related to replication, recombination and repair and transcription; region E mainly included the genes related to prophages, transposons and function unknown; and region G consisted of genes related to transcription or defence mechanisms. B. longum subsp. infantis had more variable regions compared with B. longum subsp. longum (Figure 4b). Among them, regions a, b, d and h had the same functions as the regions D, C, A and E of B. longum subsp. longum; region c was mainly related to signal transduction mechanisms; regions e and g contained genes related to replication, recombination and repair; and the main genes in regions f and i had unknown function.
increased, the rate of change in the number of core genomes slowed down (Figure 3a). The core genome size of B. longum subsp. infantis strains was 1139 genes, which accounted for approximately 16.06% of the B. longum subsp. infantis pan-genome. The functional composition of the COG of B. longum subsp. infantis was not significantly different from that in B. longum subsp. longum (Figure 3b), and only slightly higher than B. longum subsp. longum (12.20%) in genes related to amino acid transport and metabolism.

Whole Genome and Orthologous Gene Comparison
The whole genomes of B. longum subsp. infantis and B. longum subsp. longum were compared via BRIG. B. longum subsp. infantis ATCC15697 and B. longum subsp. longum NCC2705 were taken as the reference genomes of each subspecies, respectively. The genomes of B. longum subsp. longum included seven main regions of variation ( Figure 4a); regions A, C and F mainly included the genes related to transport and metabolism of substances, especially carbohydrates; region D mainly included the extracellular polysaccharide synthesis gene clusters; the genes in region B were related to replication, recombination and repair and transcription; region E mainly included the genes related to prophages, transposons and function unknown; and region G consisted of genes related to transcription or defence mechanisms. B. longum subsp. infantis had more variable regions compared with B. longum subsp. longum (Figure 4b). Among them, regions a, b, d and h had the same functions as the regions D, C, A and E of B. longum subsp. longum; region c was mainly related to signal transduction mechanisms; regions e and g contained genes The results of orthologous genes analysis showed that the number of orthologous genes in B. longum subsp. longum and B. longum subsp. infantis were 1097 and 1175 ( Figure 4c,d), respectively. The difference between orthologous genes of the two subspecies was further compared, and a total of 41 genes only existed in B. longum subsp. infantis, while there were only ten specific genes for B. longum subsp. longum (Table S3). Specific genes of B. longum subsp. infantis mainly included urease gene clusters (BLON_RS00555-BLON_RS00605) and sialic acid metabolism gene clusters (BLON_RS03265-BLON_RS03305) [6], while the specific-genes of B. longum subsp. longum included transcriptional regulator, transpeptidase and amidohydrolase.

Carbohydrate Utilization Genotype and Phenotype of B. longum subsp. infantis and B. longum subsp. longum
Carbohydrate metabolism was one of the main differences between B. longum subsp. infantis and B. longum subsp. longum. In this study, the carbohydrate metabolism phenotype of B. longum and its association with genotype were analysed. The prediction of the CAZy database showed that those 80 B. longum strains encoded 65 glycosylhydrolase (GH) families and 15 glycosyltransferase (GT) families; in addition, GH19, GH30, GH35, GH43_34, GH50, GH57 and GH13_6 only existed in very few genomes, while GH20, GH2, GH77, GH13_11, GH42, GH3 and GH51 were abundant in B. longum. 4c,d), respectively. The difference between orthologous genes of the two subspecies was further compared, and a total of 41 genes only existed in B. longum subsp. infantis, while there were only ten specific genes for B. longum subsp. longum (Table S3). Specific genes of B. longum subsp. infantis mainly included urease gene clusters (BLON_RS00555-BLON_RS00605) and sialic acid metabolism gene clusters (BLON_RS03265-BLON_RS03305) [6], while the specific-genes of B. longum subsp. longum included transcriptional regulator, transpeptidase and amidohydrolase. The cluster analysis of the distribution and abundance of GH families showed that the GH families of B. longum could be divided into two groups, in which group A was B. longum subsp. infantis and group B was B. longum subsp. longum (Figure 5a). In addition, the significant difference in the number of GH genes between B. longum subsp. infantis and B. longum subsp. longum was evaluated by t-test, and the results showed that 21 out of 65 GH families were significantly different (p < 0.05, data not shown). The GH27, GH121 and GH127 families only existed in B. longum subsp. longum. Only B. longum subsp. longum APC1480 lacked the GH27 family; the GH121 family was absent in JSWX9M5, CCFM762, DJO10A and YS108R, while the GH127 family was conserved in all the B. longum subsp.
longum strains. The GH29, GH33 and GH95 families were conserved in all the B. longum subsp. infantis, and only a few B. longum subsp. longum strains (such as JSWX9M5 and CCFM752) consisted of those genes. The number of GH43 and its subfamily, GH51, in B. longum subsp. infantis was less than that in B. longum subsp. longum. In addition, the 15 GT families in B. longum could also be divided into two groups (Figure 5b), but there was no significant distinction at the subspecies level (except GT2, p < 0.05, data not shown).

Predicted EPS Gene Clusters in B. longum subsp. infantis and B. longum subsp. longum
Priming glycosyltransferase (p-GTF) is a key enzyme for synthesizing exopolysaccharides and catalyses the first step in synthesizing many heteropolysaccharides [28]. The genomic analysis of B. longum showed that B. longum subsp. longum ATCC 55813, CCFM756, DJO10A and M2CF0114, as well as B. longum subsp. infantis JSWX6M2, lacked the p-GTF (cpsD or rfbP), and only B. longum subsp. longum NCC2705 had two p-GTFs. The phylogenetic tree based on the genes encoding p-GTF showed that the p-GTF genes of B. longum subsp. infantis FHNFQ45M2, FGZ19I2M3 and FHeNJZ3M1 were different from other strains. The p-GTF gene in other B. longum subsp. infantis strains was rfbP; while there were two types of p-GTF genes in B. longum subsp. longum (Figure 6a).
EPS gene clusters defined in two Bifidobacterium strains were used as templates to perform BLAST comparison with the remaining B. longum, including B. animalis subsp. lactis DSM 10140, containing most of the genes described in the "consensus" LAB-EPS cluster [29] and B. longum subsp. longum 35624 [30]. The results showed that the predicted There was a 43kb HMOs utilisation gene cluster in B. longum subsp. infantis [6]. The BLAST using B. longum subsp. infantis ATCC15697 (type strain) as a template showed that the four glycosylhydrolase genes in the HMOs utilization gene cluster, including beta-galactosidase (BLON_RS12085), alpha-L-fucosidase (BLON_RS12095), exo-alphasialidase (BLON_RS12155) and beta-hexosaminidase (BLON_RS12185), were conserved in all B. longum subsp. infantis, and the diversities of different strains mainly existed in the four regions (Figure 5c). The number of MFS transporters was different in region A, and a few strains had IS3 and other hypothetical proteins; the difference in region B is the number of ABC transporter permease and ABC transporter substrate-binding proteins. SDZC2M4, FHeNJZ3M1, HeNJZ8M1, BT1, 1888B and IN-07 had inserted genes in region C and the number of SBPs in this region was different; six genes in region D were deleted in JSSZ7M7, FJND2M2, FZJJH13M4, 2, 4 and TPY12-1. Only B. longum subsp. longum JSWX9M5 and CCFM752 possessed the HMOs utilisation gene clusters similar to that in B. longum subsp. infantis (Figure 5c). B. longum subsp. longum JSWX9M5 lacked a beta-galactosidase (BLON_RS12085), and part of SBP compared with B. longum subsp. infantis ATCC15697, B. longum subsp. longum CCFM752 had only one L-fucosidase (BLON_RS12095) gene and a linked LacI family transcriptional regulator. Interestingly, there was a beta-galactosidase gene (BLON_RS12085) in most B. longum subsp. longum strains (except FHuBZX17M2, JSWX9M5 and NCTC13219), but there was no complete gene cluster. In addition, other genes related to HMOs metabolism were found in B. longum subsp. infantis ATCC15697, including two fucosidase clusters, one sialidase cluster, one beta-galactosidase gene (GH42), and two beta-hexosaminidase (GH20) genes as well as gene clusters related to LNB utilisation ( Table 2). The B. longum subsp. longum NCC2705 arabinose utilization gene cluster consisted of six genes (Table 3), among which the genes of BL_RS05095, BL_RS05100 and BL_RS05105 were found in all the strains. Genes araA, araB and araD were present in all B. longum subsp. longum (except CECT7347), but most B. longum subsp. infantis did not have those three genes. B. longum subsp. infantis FGZ17I1M1, FGZ19I1M3, FGZ19I2M3 and FGZ23I1M2 only consisted of araD, B. longum subsp. infantis FHNFQ45M2, and JSSZ7M7 only possessed araB. Additionally, only B. longum subsp. infantis SDZC2M4 and FZJJH13M4 had the same arabinose metabolism gene cluster as that in B. longum subsp. longum. The growth of B. longum with six carbohydrates as the sole carbon source in vitro was determined. All B. longum could grow with GOS and LNT as the sole carbon source. There was a significant difference in the utilisation of the other four carbohydrates between the two subspecies. All the B. longum subsp. infantis strains were able to grow with fucose as the sole carbon source, while B. longum subsp. longum could not metabolise it (except CCFM752 and JSWX9M5). Additionally, all the B. longum subsp. longum strains could use arabinose; by contrast, only a few B. longum subsp. infantis strains could grow in the presence of arabinose, including FJSYZ1M3, SDZC2M4 and FZJJH13M4. The utilisation of 2'FL and 3'SL by B. longum showed that both HMOs supported the growth of all B. longum subsp. infantis, while the vast majority of B. longum subsp. longum were not able to metabolise HMOs (except CCFM752 and JSWX9M5) (Figure 5d).

Predicted EPS Gene Clusters in B. longum subsp. infantis and B. longum subsp. longum
Priming glycosyltransferase (p-GTF) is a key enzyme for synthesizing exopolysaccharides and catalyses the first step in synthesizing many heteropolysaccharides [28]. The genomic analysis of B. longum showed that B. longum subsp. longum ATCC 55813, CCFM756, DJO10A and M2CF0114, as well as B. longum subsp. infantis JSWX6M2, lacked the p-GTF (cpsD or rfbP), and only B. longum subsp. longum NCC2705 had two p-GTFs. The phylogenetic tree based on the genes encoding p-GTF showed that the p-GTF genes of B. longum subsp. infantis FHNFQ45M2, FGZ19I2M3 and FHeNJZ3M1 were different from other strains. The p-GTF gene in other B. longum subsp. infantis strains was rfbP; while there were two types of p-GTF genes in B. longum subsp. longum (Figure 6a).
EPS gene clusters defined in two Bifidobacterium strains were used as templates to perform BLAST comparison with the remaining B. longum, including B. animalis subsp. lactis DSM 10140, containing most of the genes described in the "consensus" LAB-EPS cluster [29] and B. longum subsp. longum 35624 [30]. The results showed that the predicted EPS gene clusters were highly variable in different strains, and only some strains showed similar gene cluster composition. For instance, the EPS gene clusters in B. longum subsp. longum FGXBM14M, FHeJZ28M10, FHuBZX17M2 and FSCREG2M33, as well as B. longum subsp. infantis FHuNCS6M8, FJND16M4 and JSWX23M3, were relatively similar to that in B. longum subsp. longum 35624; all the EPS gene clusters in B. longum subsp. infantis FGZ17I1M1, FGZ19I1M3, FGZ23I1M2 and ATCC15697 contained multiple glycosyltransferases (Figure 6b,c).

CRISPR-Cas Systems in B. longum subsp. infantis and B. longum subsp. longum
CRISPR-Cas systems provided adaptive immunity for prokaryotes through DNAencoded and RNA-mediated nucleic acid targeting mechanisms. B. longum CRISPR-Cas systems were predicted by CRISPRCasFinder, and the existence of cas genes and the repeats sequence Evidence_Level = 4 in CRISPR loci were used as the basis for evaluating the existence of CRISPR-Cas systems in genomes. The results showed that 25 out of 40 B. longum subsp. longum strains had a CRISPR-Cas system, including four subtypes I-C, I-E, I-U and II-C, while 24 out of 40 B. longum subsp. infantis strains had CRISPR-Cas systems with subtypes I-C and I-E (Figure 7a, Table S4). Interestingly, there was another type of I-C Cas protein cluster in some subtypes of I-E in B. longum subsp. infantis, while B. longum subsp. longum with subtype I-E did not possess the two types of Cas proteins.
The Cas1 protein is the best phylogenetic marker for the evolution of CRISPR-Cas systems. Cas1 protein was present in the CRISPR-Cas systems of all the strains except B. longum subsp. longum APC1480. The phylogenetic tree, constructed based on the sequence of Cas1 protein, showed that the Cas1 proteins of different subtypes were located on different branches (Figure 8a). The phylogenetic analysis of subtypes I-C and I-E Cas1 proteins showed that the subtype I-C Cas1 proteins of B. longum subsp. infantis and B. longum subsp. longum were located on two branches, while the subtype I-E Cas1 protein had no obvious difference between the two subspecies ( Figure 8a).
A total of 24 CRISPR were predicted in B. longum subsp. longum (except M120R013). The length of type I-C repeats was 32 nucleotides and the length of type II-C and I-U repeats was 36 nucleotides, while the length of type I-E repeats was different (Table S4), and the predicted RNA secondary structure was diverse (Figure 7b). There were three variable nucleotides in the 4th, 13th and 15th positions in the repeat sequence of the four type I-E strains, but there was no significant influence on the secondary structure (Figure 7b). A total of 24 CRISPR were predicted in B. longum subsp. infantis. Except for B. longum subsp. infantis 4, the type I-C repeat sequence was conserved; the repeat sequences of FHNFQ45M2 and JSSZ7M7 were different from other type I-E strains (Table S4, Figure 7c). Phylogenetic analysis was performed on repeat sequences of B. longum, and the results showed that repeat sequences could distinguish the different subtypes of the CRISPR-Cas system in B. longum (Figure 8b); repeat sequences of type I-C CRISPR-Cas system could distinguish B. longum subsp. infantis and B. longum subsp. longum, while repeat sequences of the type I-E CRISPR-Cas system had no obvious diversity between the two subspecies.  (Figure 6b,c).  encoded and RNA-mediated nucleic acid targeting mechanisms. B. longum CRISPR-Cas systems were predicted by CRISPRCasFinder, and the existence of cas genes and the repeats sequence Evidence_Level = 4 in CRISPR loci were used as the basis for evaluating the existence of CRISPR-Cas systems in genomes. The results showed that 25 out of 40 B. longum subsp. longum strains had a CRISPR-Cas system, including four subtypes I-C, I-E, I-U and II-C, while 24 out of 40 B. longum subsp. infantis strains had CRISPR-Cas systems with subtypes I-C and I-E (Figure 7a, Table S4). Interestingly, there was another type of I-C Cas protein cluster in some subtypes of I-E in B. longum subsp. infantis, while B. longum subsp. longum with subtype I-E did not possess the two types of Cas proteins.  CRISPR spacers revealed the immune record of the strain and the challenges overcome during DNA invasion. The number of subtypes I-C spacers in B. longum subsp. infantis was significantly less than that of B. longum subsp. longum (p < 0.05), while there was no significant difference in the number of subtypes I-E spacers between the two subspecies ( Figure 7d). Most of the predicted prophages in B. longum were incomplete, and intact prophages only existed in B. longum subsp. longum APC1480 and C1A13 (Table S5) of Cas1 protein, showed that the Cas1 proteins of different subtypes were located on different branches (Figure 8a). The phylogenetic analysis of subtypes I-C and I-E Cas1 proteins showed that the subtype I-C Cas1 proteins of B. longum subsp. infantis and B. longum subsp. longum were located on two branches, while the subtype I-E Cas1 protein had no obvious difference between the two subspecies (Figure 8a).
(a) (b) Figure 8. Phylogenetic tree based on Cas1 proteins (a) and repeat sequences (b). CRISPR-Cas subtypes or subspecies are shown on the right, and each group was marked with colour. The tree was drawn with UPGMA using 1000 bootstrap replicates. Bootstrap values were recorded on the nodes.
A total of 24 CRISPR were predicted in B. longum subsp. longum (except M120R013). The length of type I-C repeats was 32 nucleotides and the length of type II-C and I-U repeats was 36 nucleotides, while the length of type I-E repeats was different (Table S4), and the predicted RNA secondary structure was diverse (Figure 7b). There were three variable nucleotides in the 4th, 13th and 15th positions in the repeat sequence of the four type I-E strains, but there was no significant influence on the secondary structure ( Figure  7b). A total of 24 CRISPR were predicted in B. longum subsp. infantis. Except for B. longum subsp. infantis 4, the type I-C repeat sequence was conserved; the repeat sequences of FHNFQ45M2 and JSSZ7M7 were different from other type I-E strains (Table S4, Figure  7c). Phylogenetic analysis was performed on repeat sequences of B. longum, and the results showed that repeat sequences could distinguish the different subtypes of the CRISPR-Cas system in B. longum (Figure 8b); repeat sequences of type I-C CRISPR-Cas system could distinguish B. longum subsp. infantis and B. longum subsp. longum, while repeat sequences of the type I-E CRISPR-Cas system had no obvious diversity between the two subspecies.
CRISPR spacers revealed the immune record of the strain and the challenges overcome during DNA invasion. The number of subtypes I-C spacers in B. longum subsp. infantis was significantly less than that of B. longum subsp. longum (p < 0.05), while there was no significant difference in the number of subtypes I-E spacers between the two subspecies (Figure 7d). Most of the predicted prophages in B. longum were incomplete, and intact prophages only existed in B. longum subsp. longum APC1480 and C1A13 (Table S5)   BAGEL was used to predict the potential bacteriocin operons in B. longum, and different types of bacteriocin were distinguished according to the bacteriocin classification method proposed in previous studies. There was no bacteriocin operon in most B. longum subsp. longum strains; only two bacteriocin operons were predicted in four genomes. A lantibiotic (BLD_1648) was predicted in DJO10A, CECT7347 and JSWX9M5, and there was a class II bacteriocin Propionicin_SM1 (originally isolated from Propionibacterium jannaschii) in APC1480 (Table 4, Figure 10a). Differently from B. longum subsp. longum, there were 12 bacteriocin operons in B. longum subsp. infantis. All the predicted bacteriocins were class I bacteriocin except for Propionicin_SM1, and seven belonged to class Ia bacteriocin. Only nine B. longum subsp. infantis strains had no bacteriocin operons. The comparative analysis of the distribution of bacteriocin operons and strains clustering revealed that seven B. longum subsp. infantis strains without predicted bacteriocin were located on the same branch of the phylogenetic tree ( Figure 10a). Additionally, the gene clusters of four bacteriocins widely distributed in B. longum were further analysed. The three class I bacteriocins were lanthipeptides, the gene cluster consisting of a gene encoding core peptide, genes encoding a lantibiotic modifying enzyme (LanM), a lantibiotic biosynthesis protein (LanC), a transcriptional regulatory protein (LanR) and genes relating to signal transduction (LanK) and lantibiotic mersacidin transporter(LanT). There was one gene encoding core peptide, one transposase and some genes with other functions in class II bacteriocin Propionicin_SM1 (Figure 10b).

Bacteriocin Operons in B. longum subsp. infantis and B. longum subsp. longum
BAGEL was used to predict the potential bacteriocin operons in B. longum, and different types of bacteriocin were distinguished according to the bacteriocin classification method proposed in previous studies. There was no bacteriocin operon in most B. longum subsp. longum strains; only two bacteriocin operons were predicted in four genomes. A lantibiotic (BLD_1648) was predicted in DJO10A, CECT7347 and JSWX9M5, and there was   Thiopeptide  -4  Thiopeptide  -LAPs  -5  Id, LAPs  -Sactipeptides  -3 Ic, Sactipeptides -bacteriocins were lanthipeptides, the gene cluster consisting of a gene encoding core peptide, genes encoding a lantibiotic modifying enzyme (LanM), a lantibiotic biosynthesis protein (LanC), a transcriptional regulatory protein (LanR) and genes relating to signal transduction (LanK) and lantibiotic mersacidin transporter(LanT). There was one gene encoding core peptide, one transposase and some genes with other functions in class II bacteriocin Propionicin_SM1 (Figure 10b).

Discussion
As there are certain differences in phenotypes and genotypes between B. longum subsp. infantis and B. longum subsp. longum, such as carbohydrate utilisation, research on B. longum should be carried out at the subspecies level; the close genetic relationship makes it difficult for the common methods of species identification to accurately distinguish the two subspecies. Therefore, the classification of B. longum subsp. infantis and B. longum subsp. longum is the pre-requisite for comparative genomic analysis. ANI can be used to assess the genetic relationship between subspecies at the genomic level. The results of this study showed that ANI between B. longum subsp. suis and B. longum subsp. longum, and that between B. longum subsp. suis and B. longum subsp. infantis, was higher than that between B. longum subsp. longum and B. longum subsp. infantis, which was consistent with the previous report [8], indicating that the relationship between B. longum subsp. infantis and B. longum subsp. longum was relatively distant, and their genetic determinants were different. In addition, the results also showed that ANI among strains of the same subspecies were all greater than 97%, while ANI between different subspecies were all less than 97%. However, the results obtained by different ANI calculation software are slightly different, which may cause classification errors [31], and ANI combined with phylogenetic analysis to distinguish similar species was more reliable. In this study, through further phylogenetic analysis, B. longum subsp. infantis and B. longum subsp. longum were distinguished clearly, and there were some misclassified strains in the public database. Previous studies also found similar results, in which a phylogenetic tree was constructed based on 43 selected reference genes and confirmed that B. longum subsp. longum JDM301 should be B. longum subsp. suis, while B. longum subsp. infantis 157F and ATCC 55813 should be B. longum subsp. longum [8]. The phylogenetic tree constructed based on the core genes of 33 B. longum strains also showed errors in the classification of B. longum 157F [9]. The comparison of B. longum core genes also proved that strain 35624 was indeed B. longum subsp. longum [30]. B. longum subsp. longum BXY01 and CMCCP0001 were also reclassified as B. longum subsp. suis in similar phylogenetic studies [9,11]. Therefore, the classification of different B. longum subspecies could be achieved via the combination of ANI and phylogenetic analysis.
The comparison of general characteristics of B. longum genomes showed that there were differences in the size of different genomes. The genome of B. longum subsp. infantis was generally larger than that of B. longum subsp. longum, the average GC content of B. longum subsp. infantis was slightly lower than that of B. longum subsp. longum and the number of genes in B. longum subsp. infantis was also more than that in B. longum subsp. longum. The pan-genome of 23 B. longum subsp. longum strains has been analysed previously and was confirmed as open, but the number of strains involved was relatively small [8]. Similarly, other pan-genome analyses of B. longum contained only a small number of B. longum subsp. infantis strains [9,11]. The difference between B. longum subsp. infantis and B. longum subsp. longum may lead to bias analysis of pan-genome features. Therefore, in this study, the pan-genome of 40 B. longum subsp. infantis strains and 40 B. longum subsp. longum strains was analysed and compared. The results showed that the pan-genome of B. longum subsp. infantis and B. longum subsp. longum was nearly closed, which indicated that, with the addition of new strains, new genes would appear in the genomes of the two subspecies, but the probability of new genes appearing is low.
The comparison of whole genome sequence revealed the differences in the intraspecies genetic information of B. longum subsp. longum and B. longum subsp. infantis. The results showed that different genomes of B. longum subsp. infantis had greater diversities than those of B. longum subsp. longum. Strains of both subspecies had variability in carbohydrate transport and metabolism genes and EPS gene clusters. The comparison of homologous gene diversities revealed the interspecific differences between the two subspecies. The number of specific genes in B. longum subsp. infantis was four times than that in B. longum subsp. longum, and it mainly included two larger gene clusters. This may be related to the utilisation of breast milk by B. longum subsp. infantis. The urease gene cluster (BLON_RS00555-BLON_RS00605) was related to the utilisation of urea in breast milk. As the protein concentration in breast milk often makes it difficult to meet the increasing nitrogen demand during the growth of infants, the urea in breast milk becomes another potential nitrogen sources for infants and their gut microbes; bacterial urease (EC5.3.1.5) plays a major role in urea nitrogen recovery (UNS) [6]. Comparative genomic hybridization analysis of 15 B. longum strains has been performed in a previous study, and the results indicated that genes and their activity of urea metabolism were only conserved in B. longum subsp. infantis [32]. The sialic acid metabolism gene cluster (BLON_RS03265-BLON_RS03305) is related to the metabolism of HMOs. Although this gene cluster did not exist in B. longum subsp. longum strains in this study, it was found that ten B. longum subsp. longum strains isolated from younger subjects contained the genes encoding sialidase homologous to that in B. longum subsp. infantis ATCC 15697 via gene enrichment [33]. The existence of specific genes in B. longum subsp. infantis may further explain why B. longum subsp. infantis are more widely present in the intestine of breast-fed infants.
The proportion of annotated gene encoding enzymes involved in complex carbohydrates metabolism were more than 8% in the B. longum genome [9,34]. GTs and GHs are responsible for the synthesis of glycosidic bonds and hydrolysis (or modification), respectively, and are the two major enzymes related to metabolising carbohydrates. In this study, the GH and GT families of B. longum were compared and the results showed that B. longum subsp. infantis and B. longum subsp. longum had significant differences in the composition of GHs, while GTs showed no difference at the subspecies level. The GH27 family contained various enzymes such as α-galactosidase and β-L-arabinopyranosidase, the GH121 family mainly encoded β-L-arabinobiosidase, the GH127 family contained β-Larabinofuranosidases and the GH43 family mainly included α-L-arabinanase, β-xylosidase, L-arabinofuranosidase and other enzymes related to the metabolism of complex plantderived polysaccharides. Moreover, the GH29, GH33 and GH95 families were related to HMOs utilization. Therefore, there were gene families related to the utilisation of HMOs in B. longum subsp. infantis, while the number of gene families related to the utilisation of plant-derived polysaccharides in B. longum subsp. longum was greater, which was consistent with the previous reports [8,9].
In earlier studies, it was found that B. longum subsp. infantis can grow well in a medium with HMOs as the sole carbon source, while B. longum subsp. longum could not utilise HMOs or had weak utilisation ability [35][36][37], but the latest research has found that certain B. longum subsp. longum strains could also utilise 2'FL and 3'FL in HMOs [38,39]. Therefore, the metabolism-related genes of the three HMOs, 2'FL, 3'SL and LNT, were analysed and further determined the metabolic ability of B. longum. Both B. longum subsp. longum and B. longum subsp. infantis could metabolise LNT (Galβ 1-3GlcNAc linkage) similar to previous reports [38,40]. LNT was decomposed into LNB and lactose under the action of β-hexosaminidases [6], and a gene encoding β-hexosaminidases and genes related to LNB metabolism were present in both B. longum subsp. infantis and B. longum subsp. longum. Additionally, beta-galactosidases (BLON_RS10470) was also related to LNT metabolism [41], which explains the reason for B. longum subsp. longum metabolising LNT. All B. longum subsp. infantis strains in this study could metabolise the three HMOs assayed. The deletion of some transporters and the insertion of gene fragments in the HMO utilisation gene cluster did not affect the metabolic ability of B. longum subsp. infantis on 2'FL and 3'SL. The presence of partial HMO utilisation gene clusters in B. longum subsp. longum JSWX9M5 and CCFM752 enabled them to metabolise 2'FL and 3'SL. The HMO utilisation gene cluster in B. longum subsp. longum JSWX9M5 was more similar to that in B. longum subsp. infantis ATCC15697, while the HMO utilisation gene cluster in B. longum subsp. longum CCFM752 was consistent with that found in B. longum subsp. longum SC596 [38]. Interestingly, B. longum subsp. longum CCFM752 could also metabolise 3'SL, although there was no presence of sialidase in its genome.
Previous studies have shown that B. longum subsp. infantis could use L-fucose, but B. longum subsp. longum could not [42]; hence, L-arabinose was considered to be a potential marker distinguishing B. longum subsp. longum and B. longum subsp. infantis [7]. It was found that a permease and a fucosidase in B. longum subsp. infantis ATCC15697 seemed to replace the three arabinose metabolism genes (BL_RS05080-BL_RS05090) in B. longum subsp. longum NCC2705 [43]. No fucosidase clusters and fucose metabolism ability were found in other B. longum subsp. longum strains, except for JSWX9M5 and CCFM752. It was found that there was an arabinose metabolism gene cluster in B. longum subsp. infantis SDZC2M4 and FZJJH13M4, and in vitro growth studies also proved that both the two strains could metabolise arabinose. It is worth noting that an arabinose gene cluster was not found in the genome of B. longum subsp. infantis FJSYZ1M3. Those studies showed that the differences in carbohydrate metabolism-related genes determined the different carbohydrate metabolism capabilities of strains, and further revealed that B. longum subsp. infantis preferred to metabolise HMOs, while B. longum subsp. longum had a stronger utilisation capacity for plant-derived carbohydrates. Kujawska et al. analysed the B. longum isolated from infant faecal samples from 1 to 18 months and found that the difference in