Origin and Deep Evolution of Human Endogenous Retroviruses in Pan-Primates

Human endogenous retroviruses (HERVs) are viral “fossils” in the human genome that originated from the ancient integration of exogenous retroviruses. Although HERVs have sporadically been reported in nonhuman primate genomes, their deep origination in pan-primates remains to be explored. Hence, based on the in silico genomic mining of full-length HERVs in 49 primates, we performed the largest systematic survey to date of the distribution, phylogeny, and functional predictions of HERVs. Most importantly, we obtained conclusive evidence of nonhuman origin for most contemporary HERVs. We found that various supergroups, including HERVW9, HUERSP, HSERVIII, HERVIPADP, HERVK, and HERVHF, were widely distributed in Strepsirrhini, Platyrrhini (New World monkeys) and Catarrhini (Old World monkeys and apes). We found that numerous HERVHFs are spread by vertical transmission within Catarrhini and one HERVHF was traced in 17 species, indicating its ancient nature. We also discovered that 164 HERVs were likely involved in genomic rearrangement and 107 HERVs were potentially coopted in the form of noncoding RNAs (ncRNAs) in humans. In summary, we provided comprehensive data on the deep origination of modern HERVs in pan-primates.


Introduction
Human endogenous retroviruses (HERVs) are retroviral remnants in the human genome derived from the ancient integration of exogenous retroviruses and occupy approximately 8% of the human genomic DNA [1,2]. HERVs are mainly formed via two major mechanisms: (1) horizontal transmission, in which exogenous retroviral RNA is integrated into the host genome, thus becoming a provirus that will produce infectious virus; (2) vertical transmission, in which the past retroviral infection of germline cells results in a provirus with Mendelian heritability [1,3].
Some evidence has shown that HERVs also appear in nonhuman primates [9][10][11]. Furthermore, LTR dating indicated that some HERVs could have entered the genomes of primate ancestors 25 million years ago (MYA) [12], although such dating results should be considered cautiously [9]. To decode the deep origination of HERVs, we performed the systematically in silico mining survey of HERVs in 49 primate genomes covering all major families of the order Primates. Based on these extensive data, we present the full landscape of evolutionary pathways leading to the generation of modern HERVs.

Genome Screening and Identification of HERVs
For the strict identification of full-length HERVs, we first used the RT (reverse transcriptase) sequences from the human genome available in the gEVE database [13] and performed TBLASTN searches [14] to screen the 49 primate genomes with a cutoff e-value of 0.00001. All genome assemblies were accessed from National Center for Biotechnology Information (NCBI) Assembly Database (https://www.ncbi.nlm.nih.gov/assembly/ accessed on 30 November 2021) and the Sequence Read Archive (SRA) Database (https: //www.ncbi.nlm.nih.gov/sra/ accessed on 30 November 2021) under an accessible BioProject accession code: PRJNA785018. Then, the data generated from the TBLASTN analysis were transformed into gene transfer format (GTF), and repetitive data were removed according to the contig names, start positions, and end positions. BEDTools [15] was employed to merge locations with distances of less than 1000 base pairs (bps).
To reduce false positive results, the merged sequences were extracted, and DIAMOND BLASTX searches [16] were performed against all primate (taxonomy ID 9443) alone sequences and viruses (taxonomy ID 10239) in the Reference Sequence (RefSeq) database [17] with an evalue of 0.00001. The best alignment results were extracted and screened based on query names and bit scores, and phylogenetic analysis (see below for details) was performed with reference RT sequences of HERVs from a previous publication [18] to confirm whether the screened results represented true RT domains of HERVs. The recognized sequences were extended to a length of 10,000 bp from both ends for further LTR (long terminal repeat) identification.
LTR-Harvest [19] with the default parameters was utilized to determine the boundaries of each LTR of HERVs. The internal sequences were searched against HERV proteins in the gEVE database by using DIAMOND BLASTX with an e-value of 0.00001. The results were transformed into browser extensible data (BED) files, merged, and extracted as previously described. Another round of DIAMOND BLASTX searches was performed to ascertain the nature of HERV genes. The results were manually checked, genomic fragments were merged based on the orders and locations in primate genomes to reduce repeatability, and only translations of at least 100 amino acids in the length of HERV genes were retained. For classification, phylogenetic reconstruction was performed using pol sequences aligned by MAFFT [20] with the parameter "-auto", and the alignments were trimmed by trimAl [21] with the parameter "-gt 0.1" or "-gt 0.5". IQ-TREE2 [22] was applied to construct the maximum likelihood (ML) trees with the parameters "-B 1000 -alrt 1000".

Vertical Transmission Identification
To identify vertical transmission events, BLASTN searches [14] were performed with the full-length HERVs with both flanking regions (~2000 bp in length). Hits that met the following three conditions were extracted: (1) two HERV sequences showed 90% coverage and identity; (2) two HERV LTR-flanking sequences were matched with an identity over 90%; and (3) the BLASTN results of both LTR-flanking sequences showed over 25% coverage, with at least one result showing 80% coverage. Only candidate HERVs that simultaneously met all three conditions were selected for further analysis. Vertical transmission-associated paired HERVs were analyzed with the igraph package [23], and vertical transmission events were estimated based on the species that contained paired HERVs. The species tree of primates was generated using TimeTree [24], and the bubble pie chart was created for visualization with the scatterpie package [25].

Genomic Rearrangement Analysis
Homologous recombination (i.e., between two similar HERVs in different genomic locations in a given species) leading to genomic rearrangement might have occurred during primate genome evolution [26]. We first attempted to detect the rearrangement signals by performing phylogenetic LTR reconstructions according to different HERV categories and primate species based on the 5 and 3 LTR sequences of full-length HERVs. We collected sequences that did not cluster in pairs (i.e., the 5 LTR and 3 LTR of a single HERV) in the phylogenetic trees. We next tested their coverage and identity by performing BLASTN searches with the same query and subject. We selected the paired LTRs with higher bit scores from different HERV sources (e.g., the bit score of 5 LTR-1 vs. 5 LTR-2 was higher than that of 5 LTR-1 vs. 3 LTR-1). Then, we checked whether these paired LTRs matched other LTRs (e.g., 5 LTR-1 matched 5 /3 LTR-2 and 3 LTR-1 matched 3 /5 LTR-2). We reasoned that the HERVs with matching LTRs may be subjected to genomic rearrangement.

HERVs-Derived ncRNA Verification in the Human Genome
We employed the coordinates of all known human transcripts from Ensembl database version 104 and ncRNAs [27] from the NONCODE database [28] in Genome Reference Consortium Human Build 38 (GRCh38/hg38) and retained those that intersected with the coordinates of HERVs in the human genome by using BEDTools with the parameter "intersect -wo -s". We only selected the results for which the coverage of at least one feature in a pair of features was equal to 100% and predicted the possible HERV-derived ncRNA molecules based on these results.

HERVs Are Widely Dispersed in the Genomes of Old World Monkeys and Apes
To identify HERVs, we first analyzed the genomes of 49 species of primates to identify the RT domains of reference HERVs because the RT domains of HERVs are often used to distinguish HERVs and other retroviruses [18,29,30]. Briefly, we performed a first round of TBLASTN analysis to search the HERV RT domains of primate genomes, and a second round of DIAMOND BLASTX analysis was then performed to exclude those RT domains that were better aligned with host proteins or other viral proteins. Next, we performed phylogenetic analysis to verify whether these RT sequences belonged to HERVs. We extended the length of the verified sequences and identified their LTR boundaries. We subsequently estimated the internal sequences between two LTRs with DIAMOND BLASTX, and only sequences that contained RT domains and showed the correct ordering of other genes (e.g., gag-pro-pol-env) were identified as classifiable HERVs, and hence defined as "full-length HERVs" (F-HERVs). We reconstructed the pol gene of each HERV and performed phylogenetic analysis to classify the HERVs ( Figure 1A). The HERVs were classified according to their phylogenetic relationships with reference sequences and the similarity of their RT domains with reference RT sequences.
In total, we identified 2301 classifiable F-HERV copies ( Figure 1A,B & Supplementary Data S2), with most of them found in Catarrhini, ( Figure 1B). The limited numbers identified in this study reflected our rigorous search methodology and the limitations of using full-length HERVs. The greatest number of the identified F-HERVs belonged to the HERVHF supergroup, whose members reportedly integrated into Catarrhini genomes at least 30-45 million years ago [31][32][33][34]. It is worth noting that we found HERVH sequences not only in the species in which they have been reported previously (Homo sapiens, Gorilla gorilla gorilla, Pongo abelii, Papio anubis, Chlorocebus aethiops, Callithrix jacchus, Pan troglodytes, Nomascus siki and Aotus nancymaae) but also in some new genera of Catarrhini, such as Mandrillus, Rhinopithecus, and Colobus ( Figure 1B), further confirming the widespread and ancient nature of HERVHF. We also found other types of F-HERVs, such as HERVW9, HERVIPADP, HERVK, and HSERVIII members ( Figure 1B), in primates, which was consistent with previous studies [35][36][37][38] but with hosts expanded in this study. Together, these results demonstrated that F-HERVs are ancient, and humans inherited such elements via vertical transmission from nonhuman primates.   G o rilla g o ril la gor illa chr X . 14 2 5453 5 1.14 25

Numerous HERVHFs Are Spread by Vertical Transmission within Catarrhini
If vertical transmission events occurred, the sequences of the two viruses and their flanking sequences should be the same in different primate genomes. However, over a long evolutionary history, many mutations accumulate in HERVs, which makes it difficult to identify vertical transmission events. Therefore, we set the following strict criteria for identifying possible vertical transmission events: (1) two HERVs must show high identity and coverage; (2) the flanking sequences of the two HERVs must show high identity; and (3) at least one of the flanking sequences must show high coverage ( Figure 1C).
In total, we discovered 1226 F-HERVs that may participate in vertical transmission and identified 408 vertical transmission events ( Figure 1D). All of the vertical transmission events were identified within Catarrhini, and more than half of these (222 of 408) were found in apes. According to HERV classification, most of the vertically transmitted F-HERVs belonged to the HERVHF group, which was consistent with the distribution of HERVs ( Figure 1B). Interestingly, we found that several F-HERVs may have infiltrated the common ancestor of Old-World monkeys and apes, including 10 HERVHF, 2 HERVK, 1 HERVIPADP, 1 HSERVIII, and 1 HUERSP members ( Figure 1D). Strikingly, one HERVHF was vertically transmitted from Old World monkeys to apes, and the pathway of its vertical transmission was traced in 17 species ( Figure 1E & Supplementary Data S4-S6). In addition, we estimated the time of F-HERV integration based on the time tree of these 49 primates and the vertical transmission events detected within them ( Figure 1D). We speculated that detectable vertical transmission of F-HERVs occurred from 0 to 29.4 MYA and that nearly 25% of vertical transmission events (118 in 468) occurred at 9.1 MYA, when Gorilla gorilla gorilla separated from Homo sapiens.

Some F-HERVs May Be Involved in Genomic Rearrangement
HERVs are not only molecular 'fossils' of ancient retroviruses but are also functional in host genomes under certain circumstances [39][40][41]. One of the functions of HERVs is mediating host genomic recombination, leading to potential genomic rearrangement [26,42,43]. When a HERV is integrated into a host genome, the two LTRs of that one element should be more similar to each other than to the LTRs of any other element, although they accumulate mutations after integration and residence in the germ line [26]. Therefore, if a HERV has two similar but different LTRs, genomic recombination may occur within that HERV.

Discussion
In the past 30 years, many HERVs have been identified in human genomes, but there has been little systematic research on HERVs in other nonhuman primates. In this study, we used all known HERV sequences to determine the classifiable HERVs in 49 species of primates and annotated their specific loci in the host genomes (Supplementary Data S2). We only discovered 292 F-HERVs in humans, which was much lower than the number indicated by previous research [8,47]. One major reason for this difference was that we used only the RT sequences of HERVs in the human genomes available from gEVE, rather than using exogenous retroviral pol sequences to search HERVs, because we were focused on tracing the origin of different types of HERVs in human genomes and the RT domain is the most conserved domain that can be used to distinguish retroviruses [29,48]. Indeed, many HERVs accumulate mutations or are even lost during long-term evolution, and phylogenetic analysis based on these proteins sometimes cannot rebuild their phylogenetic relationships. Although we identified fewer F-HERVs in the human genome through our pipeline, these F-HERVs showed a relatively intact genomic structure and covered 5 superclasses of HERVs ( Figure 1A,B). In addition, further investigation revealed that some of these F-HERVs were involved in vertical transmission. Thus, these F-HERVs could help us to effectively pursue the origin of HERVs.
Vertical transmission events of HERVs provide strong evidence that could indicate the origin of HERVs. One of the most ancient HERVs (HERVLs) reported to date, which integrated into an ancestor of all extant placental mammals at more than 100 MYA, was identified based on this line of reasoning [49]. We found that most vertical transmissionrelated F-HERVs in human genomes were derived from those present in Hominidae, and the others came from Hylobates and Old World monkeys (Supplementary Data S3). We estimated that the integration times of these F-HERVs, which ranged from 9.1 MYA to 29.4 MYA ( Figure 1C and Supplementary Data S3) depended on the time of separation, and this result was consistent with previous reports [50,51]. Vertical transmission events spanning long periods are difficult to track because of the strict definition of vertical transmission, which requires very high sequence similarity and coverage of HERVs and their flanking sequences. Mutations in HERVs show a positive correlation with time, and we were, therefore, unable to identify vertical transmission that may have occurred in the ancestors of NWM or Strepsirrhini. Another reason for the unsuccessful detection of vertical transmission was that the total number of F-HERVs found in the first step was small. If we were to consider the HERVs that have lost their RT domains, different results might be obtained.
HERVs are capable of causing homologous recombination due to their high sequence similarities. Many studies have analyzed HERV-related gene recombination by comparing the genomes of different individuals [52][53][54]. However, the available genomes from the different individuals of the same species are insufficient. We assume that homologous recombination takes place between two HERVs of the same type (e.g., HERVHF) and that they share highly similar sequences but show differences in their LTRs. When homologous recombination occurs, such HERVs will exchange their internal sequences, leading to a pair of analogous but different LTRs. We conjecture that homologous recombination takes place on the basis of this assumption ( Figure 2B,C & Supplementary Data S7), and the results should be treated with caution because recombination of endogenous retroviruses which had microhomologic sequences has also been reported in other mammals [55].
Recently, HERVs have been reported to be associated with many human diseases, including cancer and infectious and autoimmune diseases, and the mechanisms underlying the functions of HERVs in these illnesses also vary (e.g., acting as promoters or enhancers to regulate gene expression or encoding peptides that participate in immune regulation) [39,40,[56][57][58]. Therefore, it is important to consider HERVs that have the potential to be expressed. To identify potentially expressed F-HERVs in humans, we intersected the coordinates of all human F-HERVs and all known transcripts from other public databases and found that some HERVs may be transcribed into ncRNAs and functional under certain conditions, such as viral infections [40]. We also performed searches of F-HERVs from other nonhuman primates in the RefSeq database and the nucleotide sequence (nt) database of NCBI with BLASTN, and we did not find any credible transcripts strongly related to these HERVs. However, some of these F-HERVs showed high similarities with F-HERVs from humans, so we surmised they may have homologous functions.
In conclusion, we traced the F-HERVs present in the human genome back to nonhuman primates and found that some HERVs originated before the speciation of Hominidae, Hylobates, and Old-World monkeys. In addition, some of the F-HERVs that we identified were possibly functional from the perspective of genomics or transcriptomics, likely indicating long-term co-option. Together, these findings could help us to better understand the deep origin and evolution of modern HERVs.

Supplementary Materials:
The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/v14071370/s1, Data S1: The common name, scientific name and classification of each primate we analyzed in this research; Data S2: The loci of each HERV we identified in the genomes of 49 primates; Data S3: The vertical transmission events in human genomes; Data S4: The alignments of the 5LTR and flanking sequences of a widely vertical transmitted HERV-H in 17 species; Data S5: The alignments of the 3LTR and flanking sequences of a widely vertical transmitted HERV-H in 17 species; Data S6: The alignments of the internal sequences of a widely vertical transmitted HERV-H in 17 species; Data S7: The LTRs' blastn results of possible HERVs which were involved in host gene recombination; Data S8: The alignments of the 5LTR sequences of HERV in Figure 2B; Data S9: The alignments of the 3LTR sequences of HERV in Figure 2B; Data S10: The relationship between HERVs in human genomes and Human ncRNA.

Data Availability Statement:
The data presented in this study are available in the article and the Supplementary Materials here. Additional data related to this article may be acquired from the authors.