2.1. Identification of an Evolutionarily Young Gammaretroviral ERV in Polar Bears
A virome study examining the brain tissue of two polar bears (Knut of the Berlin Zoological Garden (Berlin, Germany) and Jerka of the Wuppertal Zoological Garden (Wuppertal, Germany) identified sequences ranging from 73% to 80% identity, at the DNA level, to porcine endogenous retrovirus (PERV),
Mus dunni endogenous retrovirus (MDEV), Feline sarcoma virus (FeLV), and Murine leukemia virus (MLV) genes [
23].
A blastn search was performed on all 72,214 scaffolds of the draft polar bear genome sequence using, as query sequences, brain derived transcriptome retroviral sequences with the highest percentage similarities to MDEV and PERV [
21]. MDEV and PERV sequences attracted the most non-overlapping transcriptome reads, deriving from different parts of the retroviral genome, from the RNA-seq virome data and were therefore further analyzed [
23]. The blastn search revealed four loci in four different polar bear scaffolds with significant similarity to the potential retroviral sequences (
Table S1). Presumed endogenous retrovirus harboring loci were extracted from the respective polar bear scaffold sequences including ~15,000 bp both up and downstream of the blastn identified scaffold regions. The exact boundaries of the identified ERV loci were then determined using Repeatmasker (
Table S2), and additional sequence comparisons, as described in [
32,
33,
34]. Repeatmasker results indicated that the identified proviruses belonged to the gammaretrovirus genus of retroviruses and were annotated as koala retrovirus (KoRV) sequences indicating the high similarity to KoRV, specifically KoRV_I-int (
Table S2) [
35,
36].
The Retrotector script was used as an independent verification method for the presence of retroviral motifs in the polar bear scaffolds. Retrotector analysis revealed retroviral motifs similar to known gammaretroviral sequences [
33,
34,
37] identifying proviral genes
gag, protease (
pro), polymerase (
pol), and envelope (
env) as shown in
Figure 1. A dUTPase domain was not identified. Retrotector generated reconstructed retroviral proteins (called putein) sequences from the defective reading frames of the proviral loci by introducing 10 frameshifts. GAG and PRO proteins were predicted from Retrotector for all four scaffolds with a percentage identity of 94.8% and 80%, respectively. The majority of the differences found in the alignments were identified in scaffold 7 that also contain three frameshifts in the GAG putein (
Figure 1). Excluding scaffold 7 sequences, the percentage sequence identity increased to 99% for both alignments. Putein sequences for the
pol genes of scaffold 1 and 162 were generated from provirus regions harboring assembly gaps and were therefore incomplete, while proviral sequences within scaffold 7 and scaffold 200 appeared to be intact with regard to the
pol gene region with a percentage identity of 95.6% (
Figure 1). Retrotector analysis corrected 5 frameshifts during the scaffold 1 POL putein prediction, further indicating the incompleteness of the identified ORF. The
pol gene in scaffold 1 was interrupted by an inserted Gypsy like element and a SINE-MIRb element encompassing nucleotides 4652–5549 of the provirus.
Figure 1.
Multiple sequence alignments of the four UrsusERV proviral loci as identified in the polar bear genome sequence are shown. Proviral long terminal repeat (LTR), gag, pro, pol and env regions are indicated. Scaffold 1 and scaffold 162 proviral sequences harbor assembly gaps and lack proviral pol and env regions, indicated by horizontal black lines. The scaffold 200 provirus is structurally complete and could potentially produce functional proteins. Sequence differences relative to the consensus sequence are indicated by black vertical lines with the scaffold 7 provirus exhibiting the greatest overall divergence. Only the scaffold 200 provirus harbors a complete env gene that is partially present in scaffolds 162 and scaffold 7 and completely absent from scaffold 1.
Figure 1.
Multiple sequence alignments of the four UrsusERV proviral loci as identified in the polar bear genome sequence are shown. Proviral long terminal repeat (LTR), gag, pro, pol and env regions are indicated. Scaffold 1 and scaffold 162 proviral sequences harbor assembly gaps and lack proviral pol and env regions, indicated by horizontal black lines. The scaffold 200 provirus is structurally complete and could potentially produce functional proteins. Sequence differences relative to the consensus sequence are indicated by black vertical lines with the scaffold 7 provirus exhibiting the greatest overall divergence. Only the scaffold 200 provirus harbors a complete env gene that is partially present in scaffolds 162 and scaffold 7 and completely absent from scaffold 1.
Motifs conserved in retroviral envelope proteins were detected by Retrotector in all proviral sequences, with the exception of the scaffold 1 proviral sequence that was lacking the
env gene region. The
env gene putein domains in scaffolds 7 and 162 appeared to be affected by mutations. Generation of puteins for both scaffolds identified frameshifts, whereas the provirus in scaffold 200 displayed an intact ENV ORF (
Figure 1). Envelope puteins, as generated by Retrotector, displayed 91.8% among sequence identity for alignable regions.
The National Center for Biotechnology Information (NCBI) Conserved Domain database analysis of proviral scaffolds identified the same motifs and frameshifts as Retrotector for all four proviral sequences and further highlighted the degree of similarity with respect to conserved motifs in other retroviral proteins [
38]. Further analyses of all proviral loci indicated tRNA
Pro as the primer binding site for viral replication initiation. Sequence analysis further indicated that proviral loci in scaffold 1, 7 and 162 are unlikely to produce functional viruses due to frameshifts, deletions, insertions, and repetitive elements that interrupt the sequences. The provirus in scaffold 200, on the other hand, harbored intact retroviral motifs in the correct reading frame, an indication that functional viral proteins could potentially be produced from this locus (
Figure 1).
A majority rule proviral consensus sequence of 7878 nt was generated from the four identified proviral loci. A consensus sequence that included LTR,
gag,
pro,
pol, and partial
env gene characteristics was used for additional blastn searches of the polar bear draft genome sequence scaffolds. Blastn search using the consensus sequence identified no additional proviral loci. However, 15 solitary LTRs were identified. A multiple sequence alignment of identified UrsusERV LTR sequences (proviral and solitary) revealed two subgroups with group-specific nucleotide differences that we designated LTR-A and LTR-B (
Figure 2). Out of a total of 23 LTRs, 12 belonged to the LTR-A subgroup, 10 belonged to the LTR-B subgroup and one appeared to be a recombinant. The scaffold 1 and 200 proviruses were flanked by LTR-A and the scaffold 7 provirus was flanked by LTR-B.
Figure 2.
Multiple sequence alignments of UrsusERV LTRs, as identified in the polar bear genome sequence, are shown. LTR-A represents the first group of LTRs while LTR-B represents the second. Proviral 5′ LTRs and 3′ LTRs are indicated by numbers 1 and 2, respectively, in the sequence names. Solitary LTRs are indicated accordingly. Some identified LTRs only comprised sub-regions of the full-length LTR. Colored vertical lines represent nucleotide differences among the LTR sequences with respect to a majority rule consensus sequence (C: blue; G: yellow; A: red; T: green). Horizontal lines indicate gaps in the alignment while dark grey shaded regions indicate assembly gaps in some of the LTR sequences. The 3′ LTR in the scaffold 162 provirus is a recombinant between an LTR-A and LTR-B type LTR.
Figure 2.
Multiple sequence alignments of UrsusERV LTRs, as identified in the polar bear genome sequence, are shown. LTR-A represents the first group of LTRs while LTR-B represents the second. Proviral 5′ LTRs and 3′ LTRs are indicated by numbers 1 and 2, respectively, in the sequence names. Solitary LTRs are indicated accordingly. Some identified LTRs only comprised sub-regions of the full-length LTR. Colored vertical lines represent nucleotide differences among the LTR sequences with respect to a majority rule consensus sequence (C: blue; G: yellow; A: red; T: green). Horizontal lines indicate gaps in the alignment while dark grey shaded regions indicate assembly gaps in some of the LTR sequences. The 3′ LTR in the scaffold 162 provirus is a recombinant between an LTR-A and LTR-B type LTR.
We observed that the scaffold 162 provirus was flanked by a 5′ LTR characteristic of an LTR-A sequence, yet the 3′ LTR was characteristic of an LTR-B sequence for the 5′ most 200 nt and an LTR-A-like sequence for the 3′ most 300 nt. UrsusERV LTR sequences were therefore analyzed for recombination events using four computational methods, specifically, GARD, DualBrothers, Recombination Analysis Tool (RAT, Norwich, UK) and RECCO (Saarbrücken, Germany) [
39,
40,
41,
42,
43]. All four methods identified a recombination event for the 3′ LTR in the scaffold 162 provirus occurring between nt 177 and 207 of the 3′ LTR sequence (
Figure S1). Phylogenetic analysis of the GARD recombination script clustered the first sub-region of scaffold 162 LTR-A/B ranging from 1 to 177 bp within the LTR-B clade, while the second region ranging from 207 bp onwards within the LTR-A clade, supporting a recombination event (
Figure S1). The RECCO computational method confirmed the presumably recombined LTR sequence (“scaffold162_1-18,384_LTR-A/B” in
Figure S1) with a
p-value < 0.001. RECCO also indicated another potential recombination event in the Scaffold 143 LTR (
Table 1).
Table 1.
RECCO analysis identified potential recombination events in two of the UrsusERV LTRs *.
Table 1.
RECCO analysis identified potential recombination events in two of the UrsusERV LTRs *.
Sequence | Start | End | Savings | Seq pv |
---|
scaffold143_3,240,578-3,261,249_LTR-A_solitary | 311 | 311 | 8 | 0,000999 |
scaffold162_1-18,384_LTR-A/B | 186 | 216 | 30 | 0,000999 |
Target site duplications flanking proviral LTRs were also examined. Retroviral integration generally generates a target site duplication (TSD) of 4–5 bp immediately flanking the provirus [
44]. Four bp TSDs were identified for the four proviral insertions and in 10 out of 15 solo LTRs (
Table 2). The remaining five LTR sequences were identified in small scaffold sequences with TSD information being unavailable in the current draft genome version.
Table 2.
UrsusERV LTR and solitary LTR 4 bp target site duplication (TSD) sequences.
Table 2.
UrsusERV LTR and solitary LTR 4 bp target site duplication (TSD) sequences.
Polar Bear Scaffolds | TSD Sequence |
---|
Scaffold 1 | CATC/CATC |
Scaffold 7 | ACTT/ACTT |
Scaffold 162 | AGGC/AGGC |
Scaffold 200 | ATGG/ATGG |
Scaffold 182 | CTTC/CTTC |
Scaffold 30 | CTCT/CTCT |
Scaffold 80 | GTGT/GTGT |
Scaffold 42 | ATCT/ATCT |
Scaffold 52 | TATC/TATC |
Scaffold 217 | GTAG/GTAG |
Scaffold 143 | TGAA/TGAA |
Scaffold 72 | GCAC/GCAC |
Scaffold 155 | AGAC/AGAC |
Scaffold 5 | GTAT/GTAT |
Scaffold 18,513 | N/A |
Scaffold 18,762 | N/A |
Scaffold 23,372 | N/A |
Scaffold 35,249 | N/A |
Scaffold 62,681 | N/A |
2.3. UrsusERV Molecular and Genome Screening of Bear Species
We further investigated the presence of UrsusERV sequences in additional polar bears and other bear species. Majority rule consensus sequences of proviral and LTR sequences were used to screen the giant panda draft genome assembly (BGI-Shenzhen AilMel 1.0 December 2009) available through the University of California, Santa Cruz (UCSC) Genome Browser [
48]. Blat searches failed to identify significantly similar sequences indicating that UrsusERV is not present in all members of the Ursidae family, further supporting the age estimations based on LTR sequence divergence. We next examined other bear species, specifically whole genome sequencing data from black bear (
Ursus americanus), brown bear (
Ursus arctos), grizzly bear (
Ursus arctos ssp.), an ancient polar bear jawbone [
28,
31], and six unrelated modern polar bears that were obtained from the National Centre for Biotechnology Information (NCBI) Sequence Read Archive [
27]. Short read data were transformed to fastq files and mapped to the four polar bear retroviral scaffolds using the Burrows–Wheeler aligner [
49]. Analyses of the polar bear mapped reads indicated the presence of all four proviruses in the six polar bear whole genome sequenced animals. Reads identical to UrsusERV proviral sequence were also identified in the ancient polar bear jawbone sequence data. Genome sequence data from the black and brown bears identified sequences with 93.9% identity to the identified consensus proviral sequence in both bear species indicating that UrsusERV is present in the
Ursinae clade of the
Ursidae family [
50]. This also suggests integration of UrsusERV like viruses in bears occurred earlier than 2 million years ago (mya, the oldest dated provirus) as the black bear and brown/polar bear clades diverged from a common ancestor over 7 mya.
All the UrsusERV sequences flanking the polar bear proviral LTRs in the polar bear draft genome sequence were identified in the other six polar bear whole genome draft sequences. The scaffold 200 3′ proviral insertion site was the only one identified in the 130,000-year old polar bear jawbone sample providing evidence that the same insertion site for at least one provirus was present in the ancient polar bear population. Sequence coverage was not as comprehensive as for the modern bear sequences and thus we cannot exclude that the other proviruses were present in this sample [
31]. Brown and black bears whole genome draft sequences were also screened for the proviral insertion sites identified in polar bears. While the proviral flanking sequences could be identified in the black bear genome, they lacked a proviral integration e.g., the flanking sequences were uninterrupted by UrsusERV. Therefore, while black bears are positive for UrsusERV sequences, the virus has integrated in distinct locations in the genome relative to polar bears. This could in part explain the differences in age estimates between the polar bear derived ERVs and the divergence time of black bear and brown bear/polar bear lineages in addition to the overall lower percentage identity of UrsusERV derived from black bears relative to those found in brown or polar bears. Out of the four female brown bears examined, two had the same UrsusERV insertion sites as polar bears for all the proviruses except for the provirus located in scaffold 200, which was polar bear specific as all brown bears tested lacked an integration at this genomic location although the flanking sequences were identified (
Table S3). Two additional brown bear genome sequences, from Kenai and Admiralty Island, where found to have the same provirus insertions found on polar bear scaffolds 7 and 162, while the polar bear scaffold 1 integration was absent indicating insertion site polymorphisms in brown bears for the scaffold 1 proviral integration (
Table S3) [
51].
Molecular screening for UrsusERV sequences was also performed by sequence-specific PCR using genomic DNA extracted from a giant panda, spectacled bear (
Tremarctos ornatus), Syrian brown bear (
Ursus arctos syriacus), black bear, and polar bear to further verify the bioinformatic analysis. Genomic DNA was subjected to PCRs using primers designed to amplify a 312 bp region within the
gag gene region showing little sequence divergence between three bioinformatically identified UrsusERV proviruses. PCR products of expected sizes were amplified from polar bear, Syrian brown bear and black bear, but not from spectacled bear and giant panda (
Figure 3). Sanger sequencing of the PCR products confirmed that the target
gag region of UrsusERV was amplified (
Figure S2). This approach further demonstrates that UrsusERV is absent from the panda and spectacled bear genomes but present in all examined members of the
Ursinae subfamily of bears.
Figure 3.
Gel electrophoretic separation of PCR products amplified from five bear species. PCR products and a 100 bp DNA ladder were separated through a 1.5% agarose gel. As indicated by a PCR product of an expected size of 312 bp, UrsusERV sequences were present in the examined polar, brown and black bears but absent from spectacled bears and the giant panda. Relevant sized DNA marker bands are indicated on the left.
Figure 3.
Gel electrophoretic separation of PCR products amplified from five bear species. PCR products and a 100 bp DNA ladder were separated through a 1.5% agarose gel. As indicated by a PCR product of an expected size of 312 bp, UrsusERV sequences were present in the examined polar, brown and black bears but absent from spectacled bears and the giant panda. Relevant sized DNA marker bands are indicated on the left.
2.4. UrsusERV Phylogenetic Analysis
We next examined phylogenetic relationships of UrsusERV with other retroviruses. Protein consensus sequences were generated for the GAG, POL, and ENV genes using the Retrotector putein predictions. The protein consensus sequences were searched against the NCBI protein database and sequences with identity >35% were extracted. The UrsusERV consensus, individual scaffold protein sequences, and protein sequences extracted from NCBI protein database were aligned using MAFFT [
52]. Resulting protein alignments included murine, primate, koala, bat, avian, pig, and feline gammaretroviruses. Galidia ERV protein sequences, a basal class 1 gammaretroviral group, were used as an outgroup to perform Bayesian phylogenetic analysis (
Figure 4) [
6]. GAG and POL trees displayed almost identical topologies with UrsusERV and the gibbon ape, koala, swine, bat, killer and whale retroviruses forming a clade sister to the murine and feline retroviruses clade. UrsusERV demonstrated greatest affinity with the PERVs for
gag and was a sister clade to PERVs, gibbon ape leukemia viruses (GALVs), KoRVs and
Mus caroli endogenous retrovirs (McERV)/
Mus dunni endogenous retrovirus (MDERV) for
pol and
env (
Figure 4). Among the individual proviruses, the scaffold 1 and 200 sequences representing the two youngest viral integrations occupied derived positions in the GAG and POL puteins trees. The ERV with the oldest integration date (scaffold 7) and the recombinant ERV on scaffold 162 occupied basal positions. This was not the case for the ENV putein tree where most scaffold sequence branching patterns could not be resolved and the youngest ERV (scaffold 200) occupied a basal position in the UrsusERV clade. However, with the exception of scaffold 200, much or all of the ENV protein was absent complicating phylogenetic analysis. Black and brown bear UrsusERV consensus sequence were constructed using the polar bear UrsusERV consensus sequence as reference. Phylogenetic analysis based on nucleotide sequences was also performed, where alignable, the black bear UrsusERV sequence was basal to the UrsusERV clade, while the brown bear UrsusERV sequence formed a sister clade to the polar bear sequence. Comparison of the UrsusERV clade with the other retroviral nucleotide sequences produced identical results to tree topologies from protein sequences. (
Figures S3 and S4). The relative ages of bear lineages with respect to estimated bear lineage age, geological epoch and UrsusERV invasion events are shown in
Figure 5.
Figure 4.
Phylogenetic analysis of consensus and individual UrsusERV proviral proteins within the Retroviridae. Bayesian phylogenetic trees are shown for GAG, POL, and ENV proteins. Protease analysis due to limited variation among the sequences was included with the polymerase analysis. Posterior probabilities >50% are shown. All sequences and analysis description are included in the material and methods section. UrsusERV (highlighted red) consensus and proviral sequences form a distinct clade that in all three analyses is closely related to PERV sequences.
Figure 4.
Phylogenetic analysis of consensus and individual UrsusERV proviral proteins within the Retroviridae. Bayesian phylogenetic trees are shown for GAG, POL, and ENV proteins. Protease analysis due to limited variation among the sequences was included with the polymerase analysis. Posterior probabilities >50% are shown. All sequences and analysis description are included in the material and methods section. UrsusERV (highlighted red) consensus and proviral sequences form a distinct clade that in all three analyses is closely related to PERV sequences.
Figure 5.
Maximum clade probability tree from mtDNA and nuclear DNA alignment display the Beast chronogram estimation. Clades age was estimated using fossil data points, with log normal distribution, and GTR model. Median divergences ages are shown above the blue horizontal bars that indicate the 95% highest posterior intervals of the estimation. Green arrow indicates the UrsusERV first insertion into the Ursinae clade that affected most likely all members of the clade. Red arrow indicates the second UrsusERV insertion into the brown bear clade, while the blue arrow represents scaffold 200 provirus that appears to be polar bear specific.
Figure 5.
Maximum clade probability tree from mtDNA and nuclear DNA alignment display the Beast chronogram estimation. Clades age was estimated using fossil data points, with log normal distribution, and GTR model. Median divergences ages are shown above the blue horizontal bars that indicate the 95% highest posterior intervals of the estimation. Green arrow indicates the UrsusERV first insertion into the Ursinae clade that affected most likely all members of the clade. Red arrow indicates the second UrsusERV insertion into the brown bear clade, while the blue arrow represents scaffold 200 provirus that appears to be polar bear specific.
Comparison of host and retroviral phylogenies can reveal discordances that provide evidence for cross species transmissions of retroviruses. If an ERV and its hosts have co-evolved, the species tree and retroviral tree should be largely concordant. Discordant trees then indicate retroviral introgression among lineages or independent infection of different lineages by the same or related retroviruses [
53]. Comparison of the host and retroviral phylogenetic relationships indicated substantial cross species transmissions in all lineages. UrsusERVs identified in the bears do not exhibit a consistent host pathogen co evolutionary pattern consistent with multiple invasions of the bear lineage from an unknown reservoir (
Figure 6).
Figure 6.
Tanglegram illustration of the phylogenies of gammaretroviruses and their hosts. The host tree on the left was based on mtDNA
cytochrome b gene sequences, and the retroviral tree on the right was based on the polymerase gene nucleotide sequence. Ursinae species that were positive for UrsuERV and the corresponding ERV sequences are shown in red. Phylogenetic trees were generated using maximum likelihood analysis as implemented in RAxML [
54]. The bootstrap consensus trees illustrated were inferred from 500 replicates with the percentage bootstrap given next to each branch. Evolutionary phylogeny comparison of host
versus retrovirus illustrates lack of co-evolution and cross species transmission events.
Figure 6.
Tanglegram illustration of the phylogenies of gammaretroviruses and their hosts. The host tree on the left was based on mtDNA
cytochrome b gene sequences, and the retroviral tree on the right was based on the polymerase gene nucleotide sequence. Ursinae species that were positive for UrsuERV and the corresponding ERV sequences are shown in red. Phylogenetic trees were generated using maximum likelihood analysis as implemented in RAxML [
54]. The bootstrap consensus trees illustrated were inferred from 500 replicates with the percentage bootstrap given next to each branch. Evolutionary phylogeny comparison of host
versus retrovirus illustrates lack of co-evolution and cross species transmission events.