Members of the Papillomaviridae
taxonomic family have a small circular double-stranded DNA genome of around 8kb in length that is packaged in a non-enveloped icosahedral capsid. Papillomaviruses (PVs) have been particularly well-studied in humans due to their association with multiple disease states including cervical cancer and other malignancies. Well over 200 human papillomavirus (HPV) types have been identified to date. Historically, the first discovered PV was Cottontail rabbit PV (current name SfPV1), which was also the first DNA tumour virus described [1
]. PVs infect most mammal species (both terrestrial and marine), several birds, reptiles, and fish [2
]. In well-studied host species, some PV type infections are asymptomatic; therefore, in-depth study of vertebrates’ epithelial viromes may significantly increase the number of known PVs. After the first fully sequenced PV genomes were published [4
], the first sequence analyses of PVs were also performed [6
]. Subsequently, there have been several studies of the ancestral and more recent evolution of PVs [11
]. However, the evolutionary origin of PVs is not well understood, although it is assumed to be ancient.
PV sequences can be found in different nucleotide databases: in ENA (European Nucleotide Archive) there are ~25,000 sequences, and in NCBI (National Center for Biotechnology Information) there are 25,189 sequences with the taxonomic restriction Papillomaviridae
. In NCBI 1686 entries are found with length 6300 to 9500 nucleotides and with taxonomic restriction Papillomaviridae
, mostly corresponding to PV complete genomes (this redundant set includes isolates, etc.). “NCBI refseq”, which is a subset of the NCBI nucleotide collection containing only reference genomes (a non-redundant database), contains 135 reference PV genomes. In the UniProtKB (UniProt Knowledgebase) database, there are 556 entries in the manually annotated SwissProt and 12,302 in the computer-annotated TrEMBL (TrEMBL contains the translations of all coding sequences present in the EMBL Nucleotide Sequence Database not yet integrated in Swiss-Prot). In UniProt “complete proteomes” (“complete proteome”—all proteins annotated for species or isolate), 97 PV proteomes can be found, including 37 “reference proteomes” (“reference proteomes” are a representative cross-section of the taxonomic diversity to be found within UniProtKB “complete proteome”, they include the proteomes of well-studied model organisms and other proteomes of interest for biomedical and biotechnological research; for more details, see [16
]. In the PAVE (Papillomavirus Episteme [2
]) database, which was curated by experts in the field, 340 PV types with 3150 protein sequences are found (as of 8 June 2017) [3
]. However, whether sequence information alone is enough to tell us something about the deep evolutionary history of PVs and their origin is open to debate.
Viruses are fast evolving units. PV coding sequences have been estimated to evolve ~5 times faster on average compared to their mammalian host nuclear coding sequences [14
]. The evolutionary rate of the PV E1 protein is estimated to be 1.76 × 10−8
substitutions/nt/year for Lambdapapillomaviruses infecting Felidae; 7.1 × 10−9
substitutions/nt/year for mammalian PVs; and 1.1 × 10−8
substitutions/nt/year for nonmammalian amniote PVs [11
] compared to 2.2 × 10−9
for mammalian nuclear coding sequences [21
]. In general, the short-term evolutionary rates of viruses (and other genomes) are much faster than long-term evolutionary rates due in part at least to the loss of deleterious mutations from the population [22
]. Thus, the sequence space sampled by viruses is even larger than that expected from long-term evolutionary rates. It is estimated that PVs have existed at least ~315 million years [23
]. Considering this, PV proteins may still have homologs in the biosphere (outside of PVs), but without significant sequence similarity.
It has been known for more than three decades that structure is more conserved than sequence [24
]. Challis and Schmidler have shown that including structural information enables better phylogenetic inference for distant relationships [26
]. Additionally, Herman et al. have shown that including structural information reduces significantly the uncertainty of alignments and topologies of phylogenetic trees, indicating that structure contains more information than can be obtained from sequences alone [27
]. This is especially important in the case of viruses, which are able to sample a huge amount of sequence space and loose sequence similarity within a relatively short time (compared to organisms). Thus, it is essential to include structural information in order to study deep evolutionary relationships.
A common view of proteins is that they are composed of domains—independent functional, evolutionary and structural units often linked by unstructured polypeptide chain. A protein (polypeptide chain) can be virtually chopped into domains on multiple criteria and domain borders depend on the domain assignment method. Domains are more monophyletic compared to proteins as one protein may consist of many domains with very different phylogenetic histories. Thus, protein domains, and especially structural domains, can be used to study the evolutionary history (origin) of viral proteins.
In this study, the structural information of protein domains was used to find distant homologs to PV proteins and to shed more light on the evolutionary history of PVs. Our results show that only half of the PV protein domains have a relative in the rest of the sequenced biosphere. E1 replication protein shows the most connections with cellular organisms and viruses alike. Capsid protein L1 has evolutionary relationship with rest of the virosphere. However, for a number of PV protein domains, distant homologs could not be detected.
Several aspects of the molecular biology of PVs are quite well known, however, the origin and the evolutionary relationship to other organisms is still enigmatic. In this work, the occurrence of PV protein domains was used to study the relations of PV domains with other domains characterised so far and to study the origin and/or evolution of PV proteins and PVs.
PVs, similar to several other viral families, encode proteins without detectable structural homologs in cellular organisms [45
]. This trend can be quantitatively evaluated in different ways [40
]. As shown in Figure 3
(see also Figure S3
) and in the analysis of protein domain occurrence at higher taxonomic levels in citation [40
], PVs have a high fraction of protein domains not found in cellular superkingdoms or are found in a small fraction of cellular genomes. In this aspect (location of PV in Figure S3
and shape of the PV lines in Figure 3
and Figure S2
compared to Figure 4 in Reference [40
]), PVs and Polyomaviridae
are more similar to RNA viruses and ssDNA viruses than dsDNA viruses. That kind of bimodal or U-shape distribution is confirmed also independently at the structural level. Relationship of PV protein structural domains to other structural domains was assessed and visualised with “Galaxy of folds” toolkit. Only E1 helicase domain locates at a densely populated region (close relationship) and at least four domains locate at very sparse regions (Figure 2
4.1. SUPERFAMILY Limitations
The SUPERFAMILY resource is a useful tool for deep evolutionary studies; unfortunately, it has its own limitations. Different HMM models of SCOP families from the same SF may not recognise easily the sequences from (structural) sibling family, especially in the case of viruses. For example, when using HMM model of PV E1 DBD domain and searching it against all the known sequences, it does not recognise Large-T antigen DBD sequences from Polyomaviridae
. However, PV E1 DBD and Polyomaviridae
Large-T antigen DBD are classified to the same SF in SCOP. In SUPERFAMILY, the SF hits are collected as a union of all of the respective SF HMM results [37
]. In addition, SUPERFAMILY is limited to protein structural domains classified in SCOP. Unfortunately, not all protein structures of interest are in the SCOP database (not in SCOP 1.75 [35
], SCOP2 [46
] or SCOPe [47
Because of the gap between structural classification and current data in PDB database, biologically/virologically suspicious results were re-evaluated using most recent data. For example, SF_52540 was found in most Parvoviridae
species but SF_55464 only in a subset. The structure of respective domain in SCOP is solved for Adeno-associated virus (Dependoparvovirus
) (SCOP and PDB representative “1m55”). The HMM model based on “1m55” recognises protoparvoviruses (Parvovirinae
) but not bocaparvoviruses (Parvovirinae
) on HMMER “hmmsearch”. However, based on published structures the structural and functional similarity of bocaparvovirus “4kw3”, dependoparvovirus “1m55” and protoparvovirus “4pp4” gives evidence that bocaparvoviruses and probably Densovirinae
(another subfamily of Parvoviridae
family) have a homologous domain to SF_55464 [48
]. Hopefully, the next release of SCOP (and SUPERFAMILY) will include up to date viral structural information. To avoid our subjective bias, SUPERFAMILY data and extended structural analysis data were interpreted separately. The quality of the data in databases in this kind of studies is very important. The amount of data used in our work is still comprehensible, allowing us to test correctness/quality of input data and our results, however, in larger-scale analyses it is not feasible or indeed possible. One PV related example is the network-like relationship studies of dsDNA viruses [49
]. In the publication, data showed a connection (Figure 1 in [49
]) between PV and polyomaviruses, which corresponds most likely to Bandicoot Papillomatosis virus; a chimera, containing capsid proteins from PV and a replication protein with a DnaJ domain from polyomaviruses. This connection was misinterpreted by the authors in the text. Therefore, to avoid or minimise misinterpretations in large scale studies, each scientific society should keep the data as correct as possible, to give confidence to large-scale analysis results.
4.2. Capsid Protein Connects PVs with a Rest of the Virosphere
The PV major capsid protein L1 has structural relatives at the SF level only in Polyomaviridae
. In addition to L1, Polyomaviridae
also codes domains structurally similar to E1 DBD and E1 helicase (including hexamerisation subdomain). PVs and Polyomaviridae
are the only known viruses with nucleosomes inside virion [50
]. The thirty-year-old statement by Favre et al. “The existence of a viral core containing DNA and cellular histones may be a further common structural characteristic of papovaviruses.” (Papovaviruses—old name of PVs and polyomaviruses together) is still valid and this characteristic is not only common but also specific for these viruses [50
]. Thus, there are several lines of evidence that PVs and Polyomaviridae
are clearly evolutionary related.
According to published non-hierarchical structural analysis and supported by the common FOLD level in SCOP, PV L1 and Polyomaviridae
major capsid protein VP1 belong to the “single jelly-roll” (eight-stranded beta barrel) capsid lineage also called “Picorna-like lineage”. The single jelly-roll capsid lineage contains capsid proteins from a number of other viral families, including Circoviridae
, and Parvoviridae
together with numerous families of RNA viruses [52
]. Viral families in this lineage have different replication strategies and have host ranges both from Eukaryota and Bacteria. As noted earlier, SCOP classification to the same FOLD level does not guarantee a common ancestor, however, it also does not exclude it. Thus, most likely PVs are connected to the wider virosphere via their major capsid protein L1.
4.3. E2 DBD Most Likely Does Not Originate from Gammaherpesviruses
As summarized in Figure 4
, E2 DBD domain has connection only with gammaherpesviruses. According to SUPERFAMILY results and HMMER searches only members of genus Lymphocryptoviruses
gives significant hits. However, published structures of rhadinovirus (genus Rhadinovirus
is another member of Gammaherpesvirinae
subfamily) proteins (PDB codes 4blg, 2yq1, 4k2j and 5a76) prove that functionally and structurally homologous proteins are found also in rhadinoviruses [54
]. The divergence time of gammaherpesviruses, where the SF_54957 domain is found have not been estimated explicitly; however it is possible to estimate their potential divergence time to no more than ~200 million years ago from published data [58
]. PVs have existed at least ~315 million years [23
] and assuming virus-host co-divergence also for fish viruses, the PVs are most likely more ancestral, at least ~415 million years [60
]. Therefore, PV E2 DBD does not originate from gammaherpesviruses, at least not after their divergence.
4.4. Replication Protein Connects PVs with a Rest of Biosphere
SF_55464 is also found in more than hundred bacterial species (Table 3
) with wide and sparse phylogenomic distribution. Most of the bacterial hits have best e
-value for the relaxase HMM model (as in the case of bacterial plasmids). Non-relaxase hits in Bacteria are found only in 5 phytoplasma species. Extending to noncomplete genomes increases the number of bacterial hits to a few thousands of sequences. Thus, at least via the relationship with the relaxase domain, PVs have connections with bacteria and bacterial plasmids. The relationship of geminiviral replication proteins to plasmid have been published, however the direction of the transfer is not clear [61
Phylogenetically closest genomes to currently known PV hosts (Vertebrates) where SF_55464 is found in SUPERFAMILY database are among Fungi (Supplementary Materials, File S1
). Extending the search to non-complete genomes we also found SF_55464 in some Metazoa, however, closer examination shows, that they are all most likely misannotations. These sequences were almost identical to Bacterial ones and, if we exclude very recent “from Bacteria to Eukaryota” transfer (which is possible but very unlikely), these sequences are most likely contaminants or a part of the sequenced organism’s microbiota. Detailed examination of true positive eukaryotic hits identified that in Stramenopiles this domain is coded by mitochondrial DNA, widening the phylogenetic distribution of RCR domains. All SF_55464 true positives in eukaryotes give best hit to geminiviral Rep HMM model. To test whether the eukaryotic hits are taxonomically restricted sequences or just a moderately divergent member of some other protein domain family we performed reciprocal sequence search using SF_55464 eukaryotic hits (parts of sequences corresponding to SF_55464) as a query in “phmmer” and “tblastn”. Only sequences from three organisms belonging to Basidiomycota
(Serpula lacrymans var. lacrymans
S7.9, Pisolithus tinctorius
Marx 270 v1.0 and Laccaria bicolor
S238N-H82) recognised each other SF_55464 sequences (and after them viral sequences from Geminiviridae
). All other eukaryotic sequences give hits only to viral sequences mostly from Geminiviridae
. Thus, in the sequenced biosphere, these eukaryotic hits do not have close homologs in other organisms (even those not yet annotated as protein). This indicates that these sequences are taxonomically restricted and not on the periphery on some unidentified protein domain family. Therefore, PVs have connections to the sequenced eukaryotic world only via distant relatives in the virosphere.
SF_55464 is found in all or almost all members of PV, Polyomaviridae
, and Geminiviridae
. It is also found in 10 viruses currently classified into the new proposed family Genomoviridae
and in a single member of Circoviridae
(out of 45 in the database). In the phylogenetic tree of Rep proteins of circular single-stranded DNA (ssDNA) viruses, genomoviral Rep proteins form a well-supported monophyletic clade which branches as a sister group of Geminiviridae
and they both are more distantly related to Circo
- and Nanoviridae
]. The Rep protein tree is supported by structural analyses showing that Rep proteins of Geminiviridae
are indeed structurally related [44
]. Nanovirus and circovirus Rep protein structures are not yet classified in SCOP. Thus, in the virosphere, extended structural analyses of E1 DBD relatives connect PV with Polyomaviridae
, and Genomoviridae
. However, connection outside the virosphere is still restricted to bacterial plasmids, bacteria, very few eukaryotes and few red algal plasmids.
4.5. Closest Domain Pair of E1 Protein Is Found Far from Known PV Hosts
Since SF_55464 and SF_52540 coexist as a domain pair in E1 protein, the existence of this pair in other genomes was studied. In Bacteria, this combination is found in 119 species in SUPERFAMILY “complete genomes”, mostly in combination of “relaxase” domain with “Tandem AAA-ATPase domain”. This domain pair was present in very few eukaryotes with very sparse phylogenomic distribution in three Fungi, one Alveolata and one Amoebozoa. In contrast to Bacterial sequences they have best fit to geminiviral “DNA-binding domain of REP protein” and “Extended AAA-ATPase domain” HMM models.
In the case of viruses, E1 domain pair was found in Polyomaviridae
, and in 21 other viruses. Sharing SF_55464 and SF_52540 (including the hexamerisation subdomain) is true for Polyomaviridae
(or at least for Parvovirinae
). PVs and Polyomaviridae
are dsDNA viruses; however, Parvoviridae
belongs to ssDNA viruses encapsulating linear ssDNA. Two proteins coded by human herpesviruses (HHV) from this list belong to roseloviruses, namely HHV6A and HHV6B. They encode a protein with most likely parvovirus origin. Herpesviruses are known helperviruses for some parvoviruses [63
] and according to phylogenetic distribution of this protein (and phylogenetic tree) there have been from virus-to virus transfer with direction from parvoviruses to roseloviruses (during the last 100 MYA as estimated from [58
In plasmids, SF_55464 and SF_52540 domain combination is also found (Table 4
). In most of the plasmids, the sequence regions assigned to SF_55464 have the best e
-value for “relaxase” HMM model and regions assigned to SF_52540 have the best hit to “Tandem AAA-ATPase” HMM (like in Bacteria). Only the sequences of the phytoplasma plasmids and red algal plasmids of Porphyra pulchra
have best e
-value for geminiviral “DNA-binding domain of REP protein” HMM and SF_52540 models other than “Tandem AAA-ATPase”.
Considering “Virus to host” and “Host to virus” gene transfers and recombination of different viruses as well as accepting the statement by Rohwer and Barott “When considering the virosphere, extremely unlikely events become probabilistic certainties.” it is very difficult to estimate the evolutionary history or trajectory of these domains [65
]. It is possible to generate the phylogenetic tree of these sequences but it is much harder to find a root. Work on the age of some of these viral genera and families may give some information and restrictions, but this is beyond the scope of the current study.
Summarizing over all protein domains of PV, only domains coding less than half of total annotated coding sequences show confident evolutionary connection to the rest of biosphere. This half include ~1/5 of total amino acids (E1 DBD and helicase) showing connection to sequenced and annotated cellular proteins and less than 1/20 of total amino acids (E2 DBD) showing connection with gammaherpesviruses.
PVs are clearly related to Polyomaviridae, sharing structural homologs of capsid protein and two domains of replication protein at SCOP SF level. Both viral families have dsDNA viral genomes packed into nucleosomes inside the viral particle. Parvoviridae shares two replication related domains and, including extended structural similarity, also the capsid protein with PVs and with Polyomaviridae. As ssDNA viruses, Parvoviridae do not have nucleosomes in virions.
The relationship of PV, Polyomaviridae and Parvoviridae to Geminiviridae, Circoviridae, Nanoviridae, and Genomoviridae is not as clear and their exact relationship is out of the scope of this work. They all have SF_55464 and according to extended structural analysis, they all (except Nano- and Genomoviridae) have common capsid protein (there are no structural predictions for Nano- and Genomoviridae capsid protein).
The major capsid protein L1 and replication protein E1 connect PVs to the rest of the virosphere, E1 DBD also connects PVs to bacterial plasmids, bacteria and red algal plasmids. Excluding the E1 helicase domain, the connections to eukaryotic protein domains are almost non-existent, even including available structural information. There are clear connections with other parts of the biosphere but the exact evolutionary trajectory of PV proteins is not yet known. There are still almost no hints as to how PVs as a whole/entirety originate and how they become vertebrate and epithelium specific. The evolutionary history of other PV protein domains, which have not been found in cellular organisms, is as mysterious.
Most likely the last common ancestor of PV, Polyomaviridae
, and Parvoviridae
, or more precisely, the genome coding the ancestral replication and/or capsid protein of these viruses, inhabited a marine environment. Only very few non-fungal and non-vertebrate marine eukaryotic genomes are sequenced. Thus, most likely, we have an unexplored sequence and structure space in both cellular and viral taxons, as well as in other types of mobile elements in marine environments. Further characterisation (sequencing is only one part of the characterisations) of this and other biotopes will give more information and thus more hints on the origin of PV proteins. On the other hand, some connections between PVs and other viruses or cellular organisms may be lost forever due to gene loss events. For example, in the PV family, the E6 gene was lost at least twice in different virus clades [15
To our current knowledge, PVs are connected to the rest of biosphere via replication and major capsid proteins. The origin and/or evolutionary history of other domains are still unknown. This makes the question “When and how did PV originate?” of continuing interest.