Analysis of 19 Highly Conserved Vibrio cholerae Bacteriophages Isolated from Environmental and Patient Sources Over a Twelve-Year Period

The Vibrio cholerae biotype “El Tor” is responsible for all of the current epidemic and endemic cholera outbreaks worldwide. These outbreaks are clonal, and it is hypothesized that they originate from the coastal areas near the Bay of Bengal, where the lytic bacteriophage ICP1 (International Centre for Diarrhoeal Disease Research, Bangladesh cholera phage 1) specifically preys upon these pathogenic outbreak strains. ICP1 has also been the dominant bacteriophage found in cholera patient stools since 2001. However, little is known about the genomic differences between the ICP1 strains that have been collected over time. Here, we elucidate the pan-genome and the phylogeny of the ICP1 strains by aligning, annotating, and analyzing the genomes of 19 distinct isolates that were collected between 2001 and 2012. Our results reveal that the ICP1 isolates are highly conserved and possess a large core-genome as well as a smaller, somewhat flexible accessory-genome. Despite its overall conservation, ICP1 strains have managed to acquire a number of unknown genes, as well as a CRISPR-Cas system which is known to be critical for its ongoing struggle for co-evolutionary dominance over its host. This study describes a foundation on which to construct future molecular and bioinformatic studies of these V. cholerae-associated bacteriophages.


28
Vibrio cholerae is a globally-distributed bacterium and the causal agent of the disease cholera, a 29 potentially severe intestinal illness that affects ~1-5 million people resulting in up to ~140,000 deaths 30 annually [1]. The current (seventh) pandemic is comprised of V. cholerae biotype 'El Tor' 31 (predominantly serotype O1) and is responsible for current endemic and epidemic disease [2].

32
Epidemic disease outbreaks sweep the globe periodically and have been traced back to a single 33 lineage that has emerged from the Bay of Bengal region in multiple waves over the last half-century 34 [3]. Despite the overall genetic heterogeneity of this lineage, in which individual outbreaks are 35 nearly always clonal [4], there is an abundance of subtle variation and horizontal transfer that has 36 been observed between outbreaks over time [5,6]. It is hypothesized that the Bay of Bengal serves as 37 a reservoir where El Tor strains circulate throughout the year exchanging genetic material and 38 undergoing ecological selection before infiltrating coastal communities. They are subsequently 39 transported by infected individuals to larger cities where they can be transmitted globally [5,7]. This 40 mechanism is thought to create a bottleneck for strains and result in the clonality of outbreaks.

41
Bacteriophage, viruses that uniquely infect bacteria, are extremely abundant in the 47 In Bangladesh, ICP1 has been found in water samples [12,13] and it has been identified as the 48 dominant phage in cholera patient stool samples since 2001 [11]. The persistence of this phage over 49 time indicates that V. cholerae has strategies to limit ICP1 predation, and that ICP1 can evolve to 50 overcome such defenses. Indeed, from this natural genetic laboratory, several complex and 51 surprising adaptations/acquisitions have occurred in the race for survival between V. cholerae and 52 ICP1. These include self-mobilizing chromosomal islands that can provide a rapid and efficient 53 response to ICP1 infection [14] and the first known example of a bacteriophage-encoded 54 CRISPR-Cas system [15]. Initial characterization of eight ICP1 isolates collected between 2001-2011 55 noted the relative low level of diversity and lack of major genomic rearrangements, deletions or 56 insertions [11] (with the exception of its remarkable CRISPR-Cas acquisition [15]). Here we build

66
Nineteen ICP1 bacteriophage genomes were acquired from various sources (Table 1). Those 67 genomes not specifically described below were downloaded from the NCBI GenBank database.

68
Their metadata (if available) were used to inform our reporting of isolation source and isolation 69 year. In cases where this information differed from what was reported in a genome's original 70 publication we deferred to the GenBank database. Isolation year was used to standardize the ICP1 71 bacteriophage naming convention: ICP1_YEAR_X where "X" is sequentially assigned letter (A-Z) 72 based on the order in which genomes were named. The only exception is the original, 'ancestral' 73 ICP1 isolate, which is simply referred to as 'ICP1' [11].

74
Genomes not already available on GenBank include: ICP1_2006_E and ICP1_2011_A, which 75 were isolated and sequenced as described previously [15]. ICP1_2011_A was assembled using CLC

90
--rand_start --n_rand_starts 10 -b 100). A companion phylogeny based on the concatenated 91 core-genome (described below) was constructed using the same methods. Dendroscope3 v3.5.9 [23] 92 was used to generate a tanglegram joining both trees with ICP1 ancestral set as the outgroup. The 93 whole-genome alignment was also used to determine a consensus ICP1 sequence using CLC performed on all genomes to find additional possible coding sequences using Prodigal v2.6.3 [26] 103 with default settings and a confidence cutoff ≥95%. All newly identified putative-ORFs from each 104 genome were then blasted against each other with the same conditions as above, grouped by 105 homology and given an iterative numerical identifier based on their locations in the genome 106 relative to the extant ORFs (i.e. ORF1, ORF2, ORF2.1, ORF2.2, ORF3, etc.). ORFs were then 107 categorized into groups based on how many of the 19 genomes they were found it. If an ORF was 108 present in all 19 genomes, it was considered part of the core-genome, otherwise, it was considered 109 part of the accessory-genome. Core and accessory ORFs were also mapped to the BRIG alignment.

110
The core-genome ORF protein sequences were concatenated by strain to create a core-genomic, 111 pseudo-genome for each ICP1 strain and subjected to the phylogenetic analysis as above.  Overall, the topologies between whole-genome and core-genome trees were highly similar 141 with a few minor exceptions. Both share identical clustering and have almost no leaf-level 142 differences within clusters B, C and D. Cluster A showed an inversion of the phylogenetic 143 differences of the leaves between the two alignments. For instance, ICP1_2012_A has the most 144 divergent core-genome (from all other strains) but is the least divergent within its cluster when 145 the whole-genome was considered.

183
Despite their occurrence across all genomes in this study, the core-genome ORFs were not all 184 equally conserved at the sequence level. By comparing average pairwise nucleotide and amino acid 185 sequence identity, we were able to resolve the core ORFs into three distinct groups: conserved-core,

186
synonymous-core and divergent-core ( Figure 4). The conserved-core was comprised of 49 ORFs 187 that shared perfect sequence identity among all the genomic sequences. Correspondingly, they 188 shared identical amino acid sequence identity. The synonymous-core included 26 ORFs that 7 of 11 possessed a small amount of nucleotide diversity between genomes, but all mutations were silent, 190 and the amino acid primary structure was therefore identical. Finally, the divergent-core contained 191 the remaining 110 ORFs which were diverse in both nucleotide and amino acid pairwise identity.

192
These divergent ORFs encompassed a range pairwise identity similarities from 99.97% to 91.57% at 193 the nucleotide level and 99.98% to 91.83% for amino acid sequences (Table S1). The degree of 194 pairwise identity difference was not correlated with an ORF's sequence length (Figure 4).

203
The accessory ORFs were distributed among all 19 genomes in a complex manner (Table S2) 204 and grouped into 15 levels of occurrence. These ranged from occurrences in 18 genomes (n=12) to 205 singletons (n=14) (Figure 3) with several patterns of accessory ORF co-occurrence (

212
The majority of ORFs in both the core and accessory genomes were classified as 213 hypothetical due to a lack of an informative BLAST identification. Only 18 of the core ORFs 214 (9.7%) currently had a predicted function, two in the conserved-core, three in the 216 with a putative or known function (

228
We have found that the genomes of ICP1 are surprisingly well-conserved between all isolates 229 over the twelve-year period in which they were isolated. This is demonstrated in the whole-genome

248
Despite the stable conservation of the core-genome, ICP1 also possesses a diverse collection of 249 accessory genes that have been integrated into several locations around the pan-genome (Figure 2).

250
These accessory ORFs appear to be both single acquisitions as well as integrations of larger loci 251 (Table S2). The most notable of the latter is the previously described acquisition of a CRIPSR-Cas 252 system which is used as a weapon in the arms race between ICP1 and V. cholerae [14,15]. This may 253 suggest that other mechanisms of molecular warfare would be likely targets for acquisition, but if 254 so, the conserved domain analyses failed to reveal mechanisms already known. Interestingly, 255 though, the trend has been for genomic ORF counts to diminish over time culminating in the latest 256 isolate from India in 2012 which possessed the fewest overall number of ORFs, i.e. the smallest 257 accessory-genome ( Figure S1). This could be due to shedding of outdated or detrimental genes, 258 possibly because the acquisition of a CRISPR-Cas system provided enough flexibility to obviate the 259 need for other mechanisms. And although ICP1_2012_A is CRISPR-Cas negative, the other five 260 most recent isolates are positive. It is also possible that this is a regional difference though, and 261 more sequenced genomes will help to determine if this trend is chronological or geographical or 262 spurious.

271
As we advance our understanding of how cholera spreads globally, it will be important to also 272 continue tracking ICP1's phylogeny and genetic composition so that we may develop a better 273 understanding of its co-evolution with V. cholerae and attempt to disentangle the complex molecular 274 and ecological interactions that may play an important role in defining cholera outbreaks.

275
Supplementary Materials: The following are available online at www.mdpi.com/xxx/s1, Figure S1: ORF 276 occurrence trends by year and genome length, Table S1: Core-genome ORFs pairwise similarity and CDD hits, 277              Divergent Core Table S1: Core-genome information. The data for each ORF represented in Figure 4 is listed in columns 2 and 3. Columns 4 and 5 contain the standard deviations for those pairwise similarity values. The other columns contain information about putative gene function and conserved domain homology.