Genomic Analysis of Prophages from Klebsiella pneumoniae Clinical Isolates

Klebsiella pneumoniae is an increasing threat to public health and represents one of the most concerning pathogens involved in life-threatening infections. The resistant and virulence determinants are coded by mobile genetic elements which can easily spread between bacteria populations and co-evolve with its genomic host. In this study, we present the full genomic sequences, insertion sites and phylogenetic analysis of 150 prophages found in 40 K. pneumoniae clinical isolates obtained from an outbreak in a Portuguese hospital. All strains harbored at least one prophage and we identified 104 intact prophages (69.3%). The prophage size ranges from 29.7 to 50.6 kbp, coding between 32 and 78 putative genes. The prophage GC content is 51.2%, lower than the average GC content of 57.1% in K. pneumoniae. Complete prophages were classified into three families in the order Caudolovirales: Myoviridae (59.6%), Siphoviridae (38.5%) and Podoviridae (1.9%). In addition, an alignment and phylogenetic analysis revealed nine distinct clusters. Evidence of recombination was detected within the genome of some prophages but, in most cases, proteins involved in viral structure, transcription, replication and regulation (lysogenic/lysis) were maintained. These results support the knowledge that prophages are diverse and widely disseminated in K. pneumoniae genomes, contributing to the evolution of this species and conferring additional phenotypes. Moreover, we identified K. pneumoniae prophages in a set of endolysin genes, which were found to code for proteins with lysozyme activity, cleaving the β-1,4 linkages between N-acetylmuramic acid and N-acetyl-D-glucosamine residues in the peptidoglycan network and thus representing genes with the potential for lysin phage therapy.


Introduction
Klebsiella pneumoniae is an opportunistic and commensal gram-negative human pathogen prevalent in the hospital environment. This bacterium is mainly found in gastrointestinal and respiratory tracts and on the skin of healthy individuals, but in recent years it has become one of the world's leading causes of community and hospital-acquired infections, such as urinary tract infections (UTIs), pneumonia, septicaemia, and wound/soft tissue infections, with an increasing mortality rate, particularly in immunocompromised individuals, neonates, and the elderly [1][2][3][4][5].
Due to its widespread distribution and genetic plasticity, K. pneumoniae is one of the most important multidrug-resistant (MDR) pathogens and has been classified as an have been proposed as alternative antimicrobial agents to treat infections in the postantibiotic era [29][30][31]. Recent research has produced some promising results regarding the use of endolysins against K. pneumoniae. A recombinant endolysin from the K. pneumoniae phage KP27 was produced, and its peptidoglycan-degrading activity was demonstrated against gram-negative bacteria by the co-incubation of bacteria and endolysin [32]. In a study by Walmagh et al. (2013), five endolysins were characterized, including two endolysins from K. pneumoniae phages K11 and KP32, and their muralytic activity on the peptidoglycan of several gram-negative bacterial species was demonstrated [33]. In another study, two endolysins (ElyA1 and ElyA2) combined with colistin were tested against A. baumannii, P. aeruginosa and K. pneumoniae, and one of them displayed activity against 13 out of 17 strains of K. pneumoniae [34].
In this work, we aimed to evaluate the prophage presence in clinical isolates of K. pneumoniae from an outbreak in a Portuguese tertiary-care hospital. Also, we aimed to understand how prophages can contribute to the rapid evolution of this bacterial pathogen. Moreover, we have identified and characterized putative endolysin genes encoded by these prophage genomes that can potentially be used for phage lysin therapy.

K. pneumoniae Isolates Genomes
A total of 40 multiclonal K. pneumoniae isolates from 23 patients hospitalized in intensive care unit at SAMS Hospital, a Portuguese tertiary-care hospital, were recently sequenced by whole genome sequencing (WGS) [35] and the genomes were screened for prophage presence.

Prophage Identification
PHASTER (PHAge Search Tool Enhanced Release) [36] and Prophage Hunter Tool [37] were used, allowing for the identification and annotation of putative prophages within contigs of each K. pneumoniae genome (last accessed January 2021). Concurrently, bacterial genomes were also annotated using the open-access tool RAST: Rapid Annotation using Subsystem Technology [38][39][40], and the identified prophage genes were extensively analysed in terms of sequence and structure to evaluate its homology with bacteriophagederived regions. The annotation of prophage coding sequences found by the three different methods was compared (data not shown).
All prophage sequences were manually sorted and curated, and the insertion sites were determined as shown below. A complete prophage sequence is not often present in one contig, and to overcome this limitation, whenever possible, prophages were scaffolded using BLAST with a query of the prophages from K. pneumoniae subsp. pneumoniae HS11286 (GenBank Accession: CP003200.1, genome region from 1288358 to 1338717) and K. pneumoniae strain FDAARGOS_775 (GenBank Accession: NZ_CP040993.1, genome region from 3328442 to 3378114) to check for homologies in the contigs in a similar way, as described elsewhere [41]. Both genomes were available on NCBI database (https://www.ncbi.nlm.nih.gov/, last accessed 15 January 2021) as reference genomes for K. pneumoniae and, since both have integrated prophages, we also referred to them to determine the correct insertion sites. The insertion sites of the prophages were identified whenever the prophage 5 and 3 ends were contiguously flanked by bacterial genes in a contig. The last bacterial gene before the prophage sequence and the first bacterial gene after the prophage sequence were identified.

Prophage Classification
The identified intact prophages were classified in silico into their respective phage families based on the prophage structural head-neck-tail proteins using VIRFAM [44]. The prophage lifestyles were classified using PHACTS [45] and were additionally assessed by manually inspecting the genomes for genes related to lifestyle (e.g., integrases).

Prophage Pan-Genome
The core-and pan-genome of K. pneumoniae intact prophages were determined using Roary (version 3.13) [46], using as settings for core genome the genes present in at least 50% of the prophage intact genomes, a minimum BLASTp percentage identity of 40, 50, 60, 70, 80 or 90%, and −s option. These settings were used to determine the most suitable parameters for determining the prophage pan-genome, as previously described [47].

Prophage Phylogenetic Analysis
Intact prophage sequences were queried against all K. pneumoniae phages sequences available on the PATRIC website (https://www.patricbrc.org, last accessed January 2021) [48], which had 256 sequences in January 2021, and against public databases using phagelimited BLASTn [42] to identify similar phages. Hits with a query cover of at least 50% were considered similar phages and those with query covers below 50% were considered close phages.
The prophage genomes were aligned using MAFFT version 7 [49] default options. Maximum likelihood phylogenetic trees from the alignments were produced using FastTree 2.1.11 [50]. The produced trees were visualized and annotated using Interactive Tree Of Life (iTOL) v6 [51].

Endolysins Identification, Gene Ontology Analysis and Functional Annotation
Since defective prophages can also harbor lysins, we considered all prophages identified (intact and defective) for endolysins identification. Together with our prophage sequences, we also analysed a set of 17 annotated phages identified during prophage phylogenetic analysis, which share homology with our prophages. A total of 167 prophage sequences (150 sequences originally identified + 17 phage annotated sequences) were submitted to bioinformatic analysis for the identification of putative phage endolysins in terms of sequence homology using BLAST [42] and structural homology using the open-access tools Phyre2 [43] and SWISS-MODEL [52]. Gene Ontology (GO) identifiers and related GO terms were assigned to the identified endolysins using the QuickGo web server (http://www.ebi.ac.uk/QuickGO/, last accessed July 2021).

Endolysin Phylogenetic Analysis
Endolysin genomic and proteomic sequences were aligned using MAFFT version 7 [49] with default parameters. The genome phylogenetic tree was constructed using the Jukes-Cantor substitution model and the proteome phylogenetic tree was constructed using the Le Gascuel substitution model in PHYML 3.3.20180621 (Geneious Prime version 2021.1.1). The identity matrix generated during the construction of the phylogenetic trees was used to infer nucleotides and proteins endolysins identity. Trees were visualized and annotated using Interactive Tree Of Life (iTOL) v6 [51].

Identification and Prevalence of Prophages in K. pneumoniae Strains
In the present study, the genome sequences of 40 K. pneumoniae clinical isolates from 23 patients were analysed with a web server tool for identification and annotation of prophage sequences within bacterial genomes, PHASTER, with default arguments [36].
A total of 150 prophage-like elements were detected (Supplementary Table S1), of which 104 were classified as intact, 39 as incomplete and 7 as questionable (Figure 1a, Supplementary Table S2). One strain was found to harbor only one prophage, while the remaining 39 strains (97.5%) have at least two prophages, which indicates that prophages are abundant in the K. pneumoniae genome. The total number of prophages per strain ranged from 1 to 10, with an average of 3.7 prophages per strain, and most strains harbored either two (n = 13) or four (n = 11) prophages ( Figure 1b). A significantly higher prevalence of incomplete and questionable prophages was expected since intact prophages are usually under strong selection or genetic degradation by bacteria for rapid deletion from bacterial genomes [53,54]. Instead, we found that 97.5% of the K. pneumoniae strains contain intact prophages. Strains isolated from patients 1, 6, 21, 23, 24, 25 and 26 contained only intact prophages, whereas strains isolated from patients 2, 3, 5, 9, 11, 14, 16, 17, 18 and 19 contained intact and incomplete prophages, and strains isolated from patients 4, 7, 8, 10, 13 and 15 contained intact, incomplete, and questionable prophages (Supplementary Table S2). Furthermore, patients were colonized with one K. pneumoniae strain, except patients 1, 3, 15, 17, 19, 23 and 26, which were colonized with 6, 2, 4, 5, 3, 2 and 2 strains, respectively. It is important to note that the genomes were divided into contigs, which implies that PHASTER may have underestimated the correct number of intact prophages (some were split into different contigs and identified as incomplete or questionable prophages).

Genome Characteristics of K. pneumoniae Prophages
The shortest (remnant) prophage sequence is 8.9 kbp, and the biggest is 60.8 kbp, with the coding sequences (CDS) number ranging from 12 to 75. The average GC% content in all 150 prophages is 52.2% (min 45.1%, max 60.2%), while the average bacterial GC% content is 57.1% (min 54.9%, max 57.4%) (Supplementary Table S1), which suggests horizontal gene transfer of the prophage region.
For the purpose of our analysis, incomplete and questionable prophage sequences divided into different contigs were scaffolded as described in materials and methods. After this analysis, prophages still considered questionable, and incomplete were grouped as defective prophages. Prophages smaller than 28 kbp were considered not intact because they lacked a prophage genomic structure and were difficult to distinguish from other integrative elements. Only prophages with identified integrase and/or at least one gene involved in biological processes (e.g., terminase, endolysin, capsid, tail fibers) were considered intact.
According to this criterion, prophages were found to be intact in 104 of the 150 prophage sequences (63.3%) (

Classification of K. pneumoniae Prophages
Intact prophages identified were in silico assigned to a family taxon using the VIRFAM website [44]. Classification was based on genes considered to be the most indicative of its family: major capsid protein, large terminase subunit, tail tape measure protein and tail sheath protein. All prophages could be assigned to a family. The majority, 62 (59.6%) was assigned to the Myoviridae family, 40 (38.5%) to the Siphoviridae family, and 2 (1.9%) to the Podoviridae family ( Figure 2). This is in accordance with the estimated distribution described in the literature [55,56]. , which agrees with their usually described size around 40-42 kbp, containing about 55 genes. However, we identified only two prophages belonging to this family, which may justify lesser variability. Interestingly, all the strains colonizing each patient harbor prophages belonging to the same family, except patients 17, 24 and 25. According to Table 1, patient 17 had five isolates, of which three (Kp4874, Kp4875, Kp4876) harbored prophages belonging only to Siphoviridae family; the Kp4872 harbored Siphoviridae (n = 2) and Podoviridae (n = 1) prophages, and the Kp4873 harbored Siphoviridae (n = 2) and Myoviridae (n = 1) prophages. On the other hand, the isolate Kp4886 from patient 24 harbored three prophages belonging to Siphoviridae (n = 2) and Myoviridae (n = 1), and the isolate Kp4887 from patient 25 harbored Myoviridae (n = 1), Siphoviridae (n = 1) and Podoviridae (n = 1) prophages. All prophages were predicted to have a temperate lifestyle by PHACTS, except for Kp4866-6 prophage, which was not confidently predicted to have a lytic lifestyle. However, all prophage genomes, including Kp4866-6, contained an integrase gene and a BLASTn showed that Kp4866-6 have similarity to Klebsiella michiganensis (up to 99.94% identity and 75% query coverage) and K. pneumoniae (up to 97.28% identity and 57% query coverage) genomes, could indicate that have also a temperate lifestyle.
Concerning the pan-genome (i.e., the entire set of genes in genomes) and the core genome (i.e., the set of genes that are present in all genomes), we found that K. pneumoniae prophages have an open pan-genome ( Figure 3) made of 892 genes (considering a 40% identity threshold for BLASTp) or 1285 genes (if the threshold is raised to 90%). Considering the typical high genomic diversity of prophages [47], we considered the threshold of 40% for BLASTp to list the prophage pan-genome (Supplementary Table S3). Additionally, no core genes (present in at least half of the prophages) are found for thresholds of protein identity higher than 50%. However, considering the presence in at least 50% of the genomes as a core gene we could find 3 core genes if considering the protein identity threshold of 50% and 16 core genes for a threshold of 40% identity. In Supplementary Table S3, it is possible to observe that 389 genes are singletons, present in one genome only.

Genomic and Proteomic Phylogenetic Relationships between K. pneumoniae Prophages
The 104 intact prophage sequences were compared to a list of 256 K. pneumoniae phage sequences available on the PATRIC website, and we used the BLASTn [42] tool for prophage identification. Hits with a query cover of at least 50% were considered similar prophages and query covers ranging from 20% to 50% were considered close phages. Using this criterion, 13 Klebsiella phages were identified that were highly similar (≥50% genome homology) to our prophage sequences, as well as 4 Klebsiella phages and 1 Pseudomonas phage (VW-6B) with ≥20% genome homology ( Table 1).
The similarity of prophage genomes was determined using an MAFFT alignment with default arguments and quantified as a heat-map matrix (Supplementary Figure S1). Whole-genome analysis revealed nine clusters of prophages with a genome identity above 50%, indicating strong evolutionary relationships. To understand the diversity of the prophage identified, a genomic phylogenetic tree was generated. Confirming our previous results, most of the prophages cluster by family group (Figure 4). Clusters C1-C4 and C5-C8 are comprised of sub-clusters containing highly related prophages (more than 70% identity), belonging to the Myoviridae and Siphoviridae families, respectively. Even for areas of lower identities, prophages tend to cluster according to family. Cluster C9 was revealed to be a mixed cluster with higher diversity, comprising prophages from Myoviridae, Siphoviridae and Podoviridae families, demonstrating that sub-clusters of the same family can be scattered in the phylogenetic tree and have an enormous genomic diversity.

Presence of Virulence Factors and Antibiotic Resistance Genes within K. pneumoniae Prophages
Prophages, even if defective, have implications on bacterial lifestyle, fitness, virulence, and the evolution of their bacterial host [21,25]. So, we searched for the presence of virulence factors and antimicrobial resistance genes encoded by the 150 prophages identified. Our analysis revealed the absence of any of the virulence or antimicrobial resistance-associated genes, using the available databases described in Materials and Methods. Since a rapid spreading of bacteria pathogenicity and an increase in host fitness is expected to be linked with prophages, we decided to analyse all prophage genomes and considered virulence genes to be those that might influence bacterial capacity to invade its host, evade or inhibit the host immune defense and survive and proliferate in different environmental conditions, in a similar way as described in Costa et al. (2018) [57]. By grouping the virulence and fitness genes in classes (Supplementary Table S4), we found that the TraR/DksA family transcription regulator was the most prevalent potential virulence factor. This family of transcriptional regulators is proposed to regulate a diverse set of genes, including those involved in virulence, the activation of stress response and providing indirect fitness advantages for the host [58,59]. Other potential virulence factors that may confer advantages to the bacteria-harboring the prophage were also found. These include the membrane-associated factor lipoprotein, which has been shown to play a direct role in virulence-associated functions, such as colonization, invasion, evasion of host defense, and immunomodulation [60]; the molecular chaperone DnaJ, that can have important functions in the assembly and replication of phage particles but may also be involved in bacterial motility and adhesion to the host, and has been described as essential for the virulence and colonization of Streptococcus pneumoniae, P. aeruginosa and Salmonella spp. [61][62][63]; UmuCD proteins, which are involved in persistence under stress conditions and already described in K. pneumoniae [64]; the serine/threonine phosphatase protein, an enzyme sensing and responding to environmental signals resulting from entering the host [65]; and the acetyltransferase family protein, which are enzymes indirectly involved in antibiotic, xenobiotic resistance and play a role in bacterial virulence [66].
Moreover, it was observed that the prophage Kp4852-1 had a closely adjacent T6SSassociated gene (ImpB protein) along the downstream regions of the prophage insertion site (Supplementary Figure S2, Supplementary Table S1). T6SS components belong to Type VI secretion system, which has been identified in several different pathogenic bacteria and appears to play different roles related to pathogenicity, host-range determination and/or niche adaptation [67,68].

Identification of Putative Endolysins within K. pneumoniae Prophages Genomes
Endolysins encoded by prophage genomes have attracted increased interest, particularly in the context of emerging antibiotic resistance. To study the nature of the endolysins encoded by prophages, our set of 150 K. pneumoniae prophages genomes, jointly with a set of 17 Klebsiella phagic genomes in GenBank, were analysed for putative endolysin identification (Supplementary Table S5). We were able to identify 132 endolysin sequences (115 endolysins from our dataset plus 17 endolysins from phages annotated), except for the K. pneumoniae intact prophage Kp4852-4 and 34 defective prophages, for which no endolysin was identified, pointing to the presence of cryptic prophages that are no longer able to pursue a lytic cycle. Endolysins were also present in some defective prophages, even if their sequences were still incomplete; such is the case of the prophages Kp4858-4, Kp4858-5, Kp4859-4 and Kp4859-5.
To determine the relationship and diversity of K. pneumoniae endolysins, we performed a genome and proteome heat map analysis of prophage sequences (Supplementary Figure S3). Genomic analysis revealed seven clusters of endolysins with genome identity above 60%, indicating strong evolutionary relationships. Interestingly, proteomic analysis revealed a diversity in the amino acid and nucleotide sequence, while still allowing the identification of the same seven clusters with an amino acid sequence identity above 40%, reinforcing precise function conservation [69].
To understand if the endolysins clusters formed were related to a prophage family, we constructed genomic and proteomic phylogenetic trees and provided family information, as shown in Figure 5. Genomic clusters N1, N3-N5 and N7 (Figure 5a, also identified in Supplementary Figure S3a) have more than 60% nucleotides identity and are comprised of sub-clusters containing even highly related endolysins (>80% nucleotides identity). These sub-clusters are composed of endolysins that belong to prophages of the same family. Clusters N2 and N6 are composed of endolysins of different prophage families, although sharing a 65% and 45% nucleotide identity, respectively. Nevertheless, clusters of the same prophage family are scattered in the tree, demonstrating that endolysins of these prophage families can have significantly divergent genomes. A similar analysis was made when observing the proteomic tree (Figure 5b, also identified in Supplementary Figure S3b). Six clusters (clusters P1-P4 and P6-P7) of high endolysins proteome identity (>60% amino acids identity) are also clusters (clusters N1, N3-N7) with high genomic identity (>60% identity). Interestingly, cluster P5, which corresponds to genomic cluster N2, is a more diverse group, but still comprises related endolysins (>40% amino acids, > 45% nucleotides identity) and includes sub-clusters of highly related endolysins, with 95% amino acids and nucleotides identity belonging to different prophage families. Still, these results demonstrate a strong agreement between both analyses. Trees were analysed and annotated using Interactive Tree Of Life (iTOL) v6 [51]. Tree branches represent Myoviridae (green); Siphoviridae (blue); Podoviridae (red); and Ackermannviridae (black) families (in silico determined). Grey-shaded circles represent clusters with identities higher than 50%. Small, shaded circles represent endolysins groups. Lysozymes/muramidases (orange); Chitinases (yellow); and Endopeptidases (light blue).
The GO knowledgebase is the world's largest source of information on the functions of genes. In this work, the gene ontology for phage endolysin was analysed using the QuickGo website (http://www.ebi.ac.uk/QuickGO/, last accessed July 2021), which is a web-based tool that allows easy browsing of the Gene Ontology (GO) provided by the GO Consortium annotation groups. Here, we searched three specific GO terms: lysozyme, chitinase and endopeptidase (Supplementary Figure S4). The GO term GO:0003796 corresponds to lysozyme activity, whose ontology by molecular function describes this protein as "Catalysis of the hydrolysis of the beta-(1->4) linkages between N-acetylmuramic acid and N-acetyl-D-glucosamine residues in a peptidoglycan". The GO term GO:0004568 corresponds to chitinase activity, whose ontology by molecular function describes this protein as "Catalysis of the hydrolysis of (1->4)-beta linkages of N-acetyl-D-glucosamine (GlcNAc) polymers of chitin and chitodextrins". For both, all the relationships in the ancestor chart are "Is a", including the molecular function of hydrolase activity and peptidoglycan muralytic activity (Supplementary Figure S4a). The GO term GO:0004175 corresponds to endopeptidase activity, whose ontology by molecular function describes this protein as "Catalysis of the hydrolysis of internal, alpha-peptide bonds in a polypeptide chain". For both lysozyme activity (GO:0003796) and chitinase activity (GO:0004568), all the relationships in the Ancestor Chart are "Is a", including the molecular function of hydrolase activity and peptidoglycan muralytic activity (Supplementary Figure S4b). For endopeptidase activity (GO:0004175), most of the relationships are "Is a", including the molecular function of peptidase activity and catalytic activity, acting on a protein and hydrolase activity; but it is also possible to see one "Part of", showing the biological process of proteolysis (Supplementary Figure S4c).
Although endolysins included into Group 1-5 have different nucleotides and amino acid sequences, they fall into the same class. All of them are lysozymes/muramidases and can hydrolyse β-1,4 linkages between N-acetylmuramic acid and N-acetyl-D-glucosamine residues in peptidoglycan as well as inside N-acetyl-D-glucosamine residues. In the case of the chitinases (Group 4), the enzyme binds to chitin and randomly cleaves glycosidic linkages in chitin and chitodextrins in a non-processive mode, generating chitooligosaccharides and free ends on which exo-chitinases and exo-chitodextrinases can act [70]. Remarkably, Group 6 includes one endopeptidase identified from Klebsiella phage ST846-OXA48phi9.1 (GenBank: MK416021), which share a higher identity (higher than 80%) with endolysin R21 like-proteins despite their different function. The endolysins identified here had a predicted molecular weight between 16.06 and 24.31 kDa, consistent with the molecular weight described in the literature for phages of gram-negative bacteria [71].

Discussion
The spread of highly virulent and antibiotic-resistant K. pneumoniae strains, both in hospitals and natural environments, requires more knowledge about Klebsiella prophages as mediators of gene transfer often offering advantageous features to the host, as well as their antibacterial potential (including gene-encoded products involved in host lysis called endolysins). Particular attention has been given to phages and endolysins which demonstrate activity on highly virulent and multidrug-resistant pathogens including K. pneumoniae [32].
In this study, we analysed 40 recently sequenced K. pneumoniae clinical strains and found prophages in all genomes. Almost all genomes harbored more than one prophage, consistent with the fact that K. pneumoniae is one of the species with more prophages among widely sequenced bacteria [72,73], suggesting that prophages are important for its biology. Since prophages are involved in the transduction of genetic material horizontally, the presence of the same prophages in different isolates indicates their horizontal movement and importance in genomic plasticity or evolution [74]. Most prophages were intact (70.9%), which may indicate a recent integration, and 29.1% were defective (incomplete or questionable). Incomplete and questionable prophages often lack essential phage functions [73] and therefore our further analysis was focused on intact prophages. Only a few studies have characterized the prevalence of prophages in K. pneumoniae species, although they are genetic elements that significantly contribute to genome variability, evolution, and virulence of their bacterial hosts [32,55,56,72,73,75].
In our analysis, the size of K. pneumoniae prophage genomes varied from 8.9 to 60.8 kbp, with an average of 37.4 kbp, which agrees with the literature for what has already been described for K. pneumoniae and enterobacteria [73,75,76]. Using an in silico approach, among the 104 intact prophages, we found Myoviridae to be the most represented family (59.6%), followed by Siphoviridae (38.5%) and Podoviridae (1.9%). The same distribution was observed in other studies [55,56]. Myoviridae, which are typically the largest phage family, had size genomes below average, while Siphoviridae had sizes slightly above the average. PHASTER analysis of draft genomes (especially if these were distributed through different contigs) could erroneously delimit prophages, for which all prophage sequences analysed were manually curated. Although we have manually curated the prophage insertion sites and scaffolded prophages that were split in several contigs, this may result in unexpected or variable genome sizes. These differences may also result from the acquisition of bacterial genes adjacent to the prophage during repeated excision and integration cycles or loss and genetic degradation that caused the reduction of its genome size [76].
Prophages of K. pneumoniae were found to be greatly similar. Our comparison of 104 intact prophages revealed that some prophages shared more than 50% genome identity, indicating strong evolutionary relationships. Comparing our prophage genomes against public databases, we found 17 Klebsiella phages which share similarity with the 104 intact prophages in terms of query coverage and identity (in some cases higher than 60%) and this helped us to identify nine clusters composed by identical prophages. Moreover, related clusters tend to group prophages of the same family. However, at same time, the prophages were also greatly diversified. A comparison of these nine clusters revealed less than 30% of genome identity and we found a Pseudomonas phage VW-6B that shares an identity higher than 31% with one of the prophages identified here. This may indicate that the common ancestor of these two species was infected by a phage that co-evolved with the host bacteria during speciation, or by phage transfer between species.
K. pneumoniae prophages have an open pan-genome, meaning that for each new prophage genome added, new genes contribute for the pan-genome. Thus, the inclusion of more prophages is expected to raise the number of the pan-genome size of 892 genes so far determined, which is also corroborated by the high percentage of singleton genes (43.6%, 389/892). On the other hand, the reduced number of core genes points to high sequence diversity, only preserving essential structural genes.
Some prophages carry genes that can alter the features of the host, ranging from increased host fitness to increased virulence, and many studies have reported the connection of the pathogen virulence to the acquisition of prophages [21,77]. In fact, even defective prophages are considered as potential mobile elements carrying virulence factors [25].
Thus, although open-access research tools did not find virulence factors, a detailed analysis showed otherwise, revealing several potential virulence factors that can be related to bacteria fitness and influence the ability of the bacterium to colonize its host and survive in adverse environments.
Prophages and their bacterial hosts have common evolutionary interests since the proliferation of the host also results in increased prophage population. Thus, some prophages provide the bacterium beneficial traits, such as increased fitness, and confer new virulence factors and/or antibiotic resistance genes exploited for bacterial pathogenesis [25,76]. Accordingly, we identified several putative virulence factors, such as TraR/DksA family transcriptional regulator, membrane-associated lipoprotein, molecular chaperone DnaJ and other proteins with functions in persistence under stress conditions, interaction with host cells and regulation of virulence gene expression. TraR family regulators may also play a role in prophage propagation by interfering with the host mechanisms of regulation, increasing the bacterial conjugation, and improving the transmission of the prophage by lateral gene transfer between bacteria [59]. Taken together, K. pneumoniae may benefit from carrying a prophage due to the putative beneficial genes carried by them. Thus, K. pneumoniae prophages may confer an evolutionary fitness benefit to the host due to the presence of virulence factors, which should be further study. On the other hand, our study confirmed the previous description by Perdigão et al. (2020), where patterns of resistance-conferring genes related to antimicrobial resistance were associated with chromosomal mutations and plasmid mobilization [35]. The prophages sequences described here did not contain the genes responsible by antibiotic resistance in these K. pneumoniae isolates.
Prophages have coevolved with bacteria for more than a billion years and have developed efficient strategies to lyse and thus kill their bacterial host at the end of the lytic cycle for progeny release [16]. Endolysins are proteins used in this lytic process and have been used in many scientific works for the development of antibacterial therapeutics [71,[78][79][80]. Depending on cleavage sites, they can be categorized into four different groups: (a) N-acetylmuramidases (lysozymes), (b) N-acetyl-β-D-glucosaminidases (glycosidases) (c) N-acetylmuramoyl-L-alanine amidases and (d) endopeptidases [71,81]. In this study, we exploited the K. pneumoniae sequenced genomes to identify endolysins in their prophages and characterized them. We identified 132 endolysins (115 endolysins from our prophage genomes and 17 lysins harbored by the most related phages and the endolysins were assigned to three groups: lysozymes/muramidases, glycosidases/chitinases and endopeptidases. The group of lysozymes included endolysins related to endolysin LyzP1 and R21, which are two lysozymes of phages P1 and 21, respectively, and were the first endolysins showing a signal-anchor-release (SAR) domain [82]. In contrast to phage lysozymes like T4, which accumulate a fully folded and enzymatically active endolysin in the cytoplasm, SAR endolysins are the first endolysins secreted as enzymatically inactive form anchored to the membrane by the N-terminal SAR domain to avoid premature lysis of the infected host [83]. The group of chitinases are members of glycoside hydrolase (GH) family 18 and 19 and are also a broad, lysozyme-like superfamily cleaving chitin, which is the second most abundant biopolymer on the planet and is a linear, insoluble homopolymer composed of β-1,4 linked subunits of N-acetyl glucosamine polymers, a structure uncommon in bacterial cell walls. Most of the bacterial chitinases isolated and sequenced so far are included in GH family 18, have a molecular weight range of 20-60 kDa and are smaller than plant chitinases (40-85 kDa) [70], which agrees with the molecular weight predicted here. The third group included endopeptidases, but this group was only identified in the related Klebsiella phages, 48ST307 and ST846-OXA48phi9.1.
Our study shows that the diversity of K. pneumoniae prophages is vast and continues to expand. In addition, this genome analysis serves as a basis for the characterization and evolutionary relationship of prophages harbored by K. pneumoniae and identification of relevant proteins, such as endolysins, which may have biomedical applications for new and protein-based antimicrobials. Future work will focus on cloning and characterizing K. pneumoniae endolysins and studying their lysis activity and potential as bactericidal products.