Genetic Analysis and Evolutionary Changes of the Torque teno sus Virus

The torque teno sus virus (TTSuV) is an emerging virus threating the Suidae species of unclear pathogenicity, although it was previously reported as a worsening factor of other porcine diseases, in particular, porcine circovirus associated disease (PCVAD). Here, a comprehensive codon usage analysis of the open reading frame 1 (ORF1), which encodes the viral capsid protein, was undertaken for the first time to reveal its evolutionary history. We revealed independent phylogenetic processes for the two genera during TTSuV evolution, which was confirmed by principal component analysis (PCA). A low codon usage bias was observed in different genera and different species, with Kappatorquevirus a (TTSuVk2a) displaying the highest, which was mainly driven by mutation pressure and natural selection, especially natural selection. Overall, ATs were more abundant than GCs, along with more A-ended synonymous codons in relative synonymous codon usage (RSCU) analysis. To further confirm the role of natural selection and TTSuV adaptation to the Suidae species, codon adaptation index (CAI), relative codon deoptimization index (RCDI), and similarity index (SiD) analyses were performed, which showed different adaptations for different TTSuVs. Importantly, we identified a more dominant role of Sus scrofa in the evolution of Iotatorquevirus (TTSuV1), with the highest CAI values and lowest RCDI values compared to Sus scrofa domestica. However, in TTSuVk2, the roles of Sus scrofa and Sus scrofa domestica were the same, regarding codon usage, with similar CAI and RCDI values. Our study provides a new perspective of the evolution of TTSuV and valuable information to develop control measures against TTSuV.


Introduction
The torque teno virus (TTV) was first identified in Homo sapiens and then in numerous host species and, therefore, was considered to have a broad host range [1,2]. TTV infecting the Suidae species is named the torque teno sus virus (TTSuV). Until now, the pathogenicity of TTSuV was considered uncertain, with an unclear role as a negative factor for other diseases, especially porcine circovirus associated disease (PCVAD) [3]. Similar to porcine circovirus, the genome of TTSuV is a single-stranded, negative-sense circular DNA of 2.8 kb [4]. TTSuV belongs to two different genera, Iotatorquevirus

Recombination and Phylogeny
According to the International Committee on Taxonomy of Viruses (ICTV) (https://talk.ictvonline. org/taxonomy/), TTSuV1 and TTSuV2 belong to different genera, Iotatorquevirus and Kappatorquevirus, given their high divergence (>56%). Therefore, the two genera were analyzed independently. A total of 181 sequences published on the National Center for Biotechnology Information (NCBI) GenBank (https://www.ncbi.nlm.nih.gov/genbank/) were chosen for analysis.

Principal Component Analysis
The first two axes accounted for the majority of the total variation (54.84%), therefore they were chosen for principal component analysis (PCA). As shown in Figure 1C, the Iotatorquevirus and Kappatorquevirus genera grouped separate from each other. In contrast, TTSuV1a and TTSuV1b clustered closer together than TTSuVk2a and TTSuVk2b, in agreement with the phylogenetic distribution.

Principal Component Analysis
The first two axes accounted for the majority of the total variation (54.84%), therefore they were chosen for principal component analysis (PCA). As shown in Figure 1C, the Iotatorquevirus and Kappatorquevirus genera grouped separate from each other. In contrast, TTSuV1a and TTSuV1b clustered closer together than TTSuVk2a and TTSuVk2b, in agreement with the phylogenetic distribution.

Nucleotide and Codon Composition
Nucleotide A was the most abundant, regardless of species (Figure 2), followed by nucleotides G, C, and T in TTSuV1 and G, C, and T in TTSuVk2. In addition, ATs were more abundant than GCs.

Nucleotide and Codon Composition
Nucleotide A was the most abundant, regardless of species (Figure 2), followed by nucleotides G, C, and T in TTSuV1 and G, C, and T in TTSuVk2. In addition, ATs were more abundant than GCs. Furthermore, in terms of codon composition, A was the most abundant nucleotide, being more than 50% among all the codons at the third position regardless of genera (Supplementary Materials).
Furthermore, in terms of codon composition, A was the most abundant nucleotide, being more than 50% among all the codons at the third position regardless of genera (Supplementary materials).

RSCU in Different Species
Within the 18 most abundant synonymous codons, A-ended codons were the most frequent among the four synonymous codons, with an occurrence ten times in TTSuV1 and eight times in TTSuVk2. The frequencies of the other three synonymous codons were: five for C-ended, two for Tended, and one for G-ended in TTSuV1, while six for T-ended and four for C-ended in TTSuVk2 (Table 1). In addition, eight (CTA, GTA, AGT, CCA, ACA, GCA, AGA, GGA) and nine (CTA, ATA, GTA, AGC, ACT, GCT, AAA, AGA, GGA) preferred codons had relative synonymous codon usage (RSCU) values of more than 1.6, indicating over-represented codons in TTSuV1 and TTSuVk2, respectively, and a higher frequency of A-ended synonymous codons. In addition, the RSCU relative to species was performed. We found that in TTSuV1, the 18 most abundant synonymous codons displayed no difference, whereas distinct RSCU patterns were observed in TTSuVk2a and TTSuVk2b. Moreover, to analyze the impact of the host on the RSCU pattern of TTSuV, the RSCU value of reference hosts, Sus scrofa, Sus scrofa domestica, and TTSuV, were compared. No complete coincidence nor antagonism existed among them. The ratio of coincidence/antagonism in TTSuV1 was 6/12 (both in TTSuV1a and TTSuV1b) compared to Sus scrofa and 11/7 compared to Sus scrofa domestica while it was 2/16 (except 4/14 in TTSuVk2a) in Sus scrofa and 5/13 (7/11 in TTSuVk2a and 6/12 in TTSuVk2b) in Sus scrofa domestica in TTSuV2 ( Figure S1).

RSCU in Different Species
Within the 18 most abundant synonymous codons, A-ended codons were the most frequent among the four synonymous codons, with an occurrence ten times in TTSuV1 and eight times in TTSuVk2. The frequencies of the other three synonymous codons were: five for C-ended, two for T-ended, and one for G-ended in TTSuV1, while six for T-ended and four for C-ended in TTSuVk2 (Table 1). In addition, eight (CTA, GTA, AGT, CCA, ACA, GCA, AGA, GGA) and nine (CTA, ATA, GTA, AGC, ACT, GCT, AAA, AGA, GGA) preferred codons had relative synonymous codon usage (RSCU) values of more than 1.6, indicating over-represented codons in TTSuV1 and TTSuVk2, respectively, and a higher frequency of A-ended synonymous codons. In addition, the RSCU relative to species was performed. We found that in TTSuV1, the 18 most abundant synonymous codons displayed no difference, whereas distinct RSCU patterns were observed in TTSuVk2a and TTSuVk2b. Moreover, to analyze the impact of the host on the RSCU pattern of TTSuV, the RSCU value of reference hosts, Sus scrofa, Sus scrofa domestica, and TTSuV, were compared. No complete coincidence nor antagonism existed among them. The ratio of coincidence/antagonism in TTSuV1 was 6/12 (both in TTSuV1a and TTSuV1b) compared to Sus scrofa and 11/7 compared to Sus scrofa domestica while it was 2/16 (except 4/14 in TTSuVk2a) in Sus scrofa and 5/13 (7/11 in TTSuVk2a and 6/12 in TTSuVk2b) in Sus scrofa domestica in TTSuV2 ( Figure S1).

Factors Shaping the Codon Usage of TTSuV
To further estimate factors influencing the low codon usage bias of TTSuV, an ENC-plot analysis and neutrality analysis were performed. ENC-plot analysis showed that all the TTSuV strains located under the standard curve, indicating that mutation pressure was not the sole force influencing codon usage. In particular, the analysis revealed general clustering among individual species ( Figure 4A).

Factors Shaping the Codon Usage of TTSuV
To further estimate factors influencing the low codon usage bias of TTSuV, an ENC-plot analysis and neutrality analysis were performed. ENC-plot analysis showed that all the TTSuV strains located under the standard curve, indicating that mutation pressure was not the sole force influencing codon usage. In particular, the analysis revealed general clustering among individual species ( Figure 4A).   Next, the level of contribution of mutation pressure and natural selection were investigated. GC12s and GC3s regression showed a significant difference (p < 0.05), with a slope of 0.1270 (R 2 = 0.09522), indicating that mutation pressure accounts for 12.7%, while natural selection accounts for 87.3%. For TTSuVk2, we found a very significant difference (p < 0.001), with a slope of 0.1334 (R 2 = 0.07590), indicating that mutation pressure accounts for 13.34% and 86.66% for natural selection. In terms of species-specific neutrality plots, the slopes were 0.08221 and 0.1129 for TTSuV1a and TTSuV1b, respectively, and 0.1674 and 0.1788 for TTSuVk2a and TTSuVk2b, respectively ( Figure 4B). Overall, natural selection is the dominant factor shaping the evolution of different TTSuV species.

Species-Specific Codon Adaptation and Deoptimization Pattern of TTSuV
Next, we explored the level of species adaptation and deoptimization. Firstly, we determined the codon adaptation index (CAI) to investigate the expression level of TTSuV in different hosts, including Sus scrofa and Sus scrofa domestica. TTSuV1 displayed a higher adaptation compared to TTSuVk2, with CAI values ranging from 0.601 to 0.659, while for TTSuVk2, values ranged from 0.571 to 0.628, regardless of hosts. We observed that TTSuV1a exhibited the highest CAI value, while TTSuVk2a exhibited the lowest. We found TTSuV to be more adapted to Sus scrofa, except for TTSuVk2a, whcih had similar values for Sus scrofa and Sus scrofa domestica ( Figure 5A). Regarding the codon deoptimization of TTSuV in respect to its hosts, TTSuVk2 displayed the highest deoptimization by relative codon deoptimization index (RCDI) for both Sus scrofa and Sus scrofa domestica, especially for TTSuVk2a. Furthermore, for TTSuV1, high RCDI values were observed in Sus scrofa domestica in comparison to Sus scrofa (values 1.665 and 1.569, respectively). For TTSuVk2, the values against Sus scrofa and Sus scrofa domestica were 1.947 and 2.052, respectively. In addition, regarding species-specific groups, TTSuVk2a displayed the highest deoptimization to Sus scrofa domestica ( Figure 5B).

Selection Pressure on TTSuV is Species-Specific
A similarity index (SiD) analysis was performed to detect the degree to which the hosts', Sus scrofa and Sus scrofa domestica, codon usage pattern impacts the virus codon usage pattern. We found that, in comparison to Sus scrofa domestica, the role of Sus scrofa was more important in shaping the evolution of TTSuV, especially in TTSuVk2 (SiD values of Sus scrofa domestica on TTSuV1: 0.068 and in TTSuVk2: 0.132; SiD values of Sus scrofa on TTSuV1 and TTSuVk2: 0.094 and 0.149, respectively) ( Figure 5C). In addition, the influence of Sus scrofa on the individual species was more important than Sus scrofa domestica, with a more significant role on TTSuVk2b.

TTSuV Dinucleotide Abundance
Dinucleotide composition revealed that there was preference on the usage of dinucleotides. None of the 16 dinucleotides were under-represented, with the lowest frequency for CpG (0.789 ± 0.058) and TpT being over-represented with a value of 1.32 ± 0.11 in TTSuV1. In contrast, a wider range of dinucleotides were identified in TTSuVk2. CpG, GpT, and TpG were under-represented with mean ± SD values of 0.706 ± 0.047, 0.668 ± 0.055, and 0.742 ± 0.091, respectively, and TpT and CpT were over-represented with mean ± SD values of 1.237 ± 0.047, and 1.344 ± 0.104, respectively ( Figure 6). represented in red, while TTSuV1a, TTSuV1b, TTSuVk2a, and TTSuVk2b are represented in purple, blue, pink and green, respectively.
Regarding the codon deoptimization of TTSuV in respect to its hosts, TTSuVk2 displayed the highest deoptimization by relative codon deoptimization index (RCDI) for both Sus scrofa and Sus scrofa domestica, especially for TTSuVk2a. Furthermore, for TTSuV1, high RCDI values were observed in Sus scrofa domestica in comparison to Sus scrofa (values 1.665 and 1.569, respectively). For TTSuVk2, the values against Sus scrofa and Sus scrofa domestica were 1.947 and 2.052, respectively. In addition, regarding species-specific groups, TTSuVk2a displayed the highest deoptimization to Sus scrofa domestica ( Figure 5B).

Selection Pressure on TTSuV is Species-Specific
A similarity index (SiD) analysis was performed to detect the degree to which the hosts', Sus scrofa and Sus scrofa domestica, codon usage pattern impacts the virus codon usage pattern. We found that, in comparison to Sus scrofa domestica, the role of Sus scrofa was more important in shaping the evolution of TTSuV, especially in TTSuVk2 (SiD values of Sus scrofa domestica on TTSuV1: 0.068 and in TTSuVk2: 0.132; SiD values of Sus scrofa on TTSuV1 and TTSuVk2: 0.094 and 0.149, respectively) ( Figure 5C). In addition, the influence of Sus scrofa on the individual species was more important than Sus scrofa domestica, with a more significant role on TTSuVk2b.

TTSuV Dinucleotide Abundance
Dinucleotide composition revealed that there was preference on the usage of dinucleotides. None of the 16 dinucleotides were under-represented, with the lowest frequency for CpG (0.789  0.058) and TpT being over-represented with a value of 1.32  0.11 in TTSuV1. In contrast, a wider range of dinucleotides were identified in TTSuVk2. CpG, GpT, and TpG were under-represented with mean  SD values of 0.706  0.047, 0.668  0.055, and 0.742  0.091, respectively, and TpT and CpT were over-represented with mean  SD values of 1.237  0.047, and 1.344  0.104, respectively ( Figure 6).

Discussion
TTSuV in pig farms is considered to be ubiquitous worldwide. Although of uncertain pathogenicity, it still remains a threat to the porcine industry, based on previous studies showing its ability to worsen PCVAD [18,19]. Therefore, to avoid the potential risk to pig farms, a better understanding of the evolutionary changes of TTSuV is one of the important steps to develop preventive measures. Although previous studies performed codon usage analysis on TTSuV [20,21],

Discussion
TTSuV in pig farms is considered to be ubiquitous worldwide. Although of uncertain pathogenicity, it still remains a threat to the porcine industry, based on previous studies showing its ability to worsen PCVAD [18,19]. Therefore, to avoid the potential risk to pig farms, a better understanding of the evolutionary changes of TTSuV is one of the important steps to develop preventive measures. Although previous studies performed codon usage analysis on TTSuV [20,21], no accurate comparison among different species nor correlation with adaptation were studied. Here, an extensive codon usage analysis based on the ORF1 gene was carried out for the first time to reveal the evolutionary changes and host-specific adaptation of the TTSuV species.
Recombination analysis detected a potential intra-species recombination signal in the ORF1 gene, especially in TTSuVk2a, in agreement with previous reports [8]. Phylogenetic analysis of the ORF1 gene combined with the nucleotide divergence revealed an independent evolutionary process for the two genera. Importantly, PCA revealed similar independent distributions of individuals, except for several overlaps in TTSuV1a and 1b, which might indicate divergence from a common ancestor [22]. In terms of nucleotide composition, ATs were preferred over GCs, with nucleotide A being more abundant. Additionally, we revealed dinucleotide abundance differences. The CpG composition was the lowest and under-represented in TTSuV1, while for TTSuVk2, it was low but not the lowest among the 16 dinucleotides. CpG in relation to RSCU analysis showed that all the NCG and CGN synonymous codons were less than one, indicative of negative codon usage for both TTSuV1 and TTSuVk2. CpG deficiency is a reflection of unmethylated CpG. Unmethylated CpG act as signatures for the innate immune system [23]. Whether this is the case for TTSuVs remains to be invistigated.
Generally, multiple factors drive codon usage bias, with mutation pressure and natural selection accounting for the biggest effect in most species [24]. Overall, the codon usage of TTSuV was low for both TTSuV1 and TTSuVk2. However, it was higher in TTSuVk2 than in TTSuV1, especially in TTSuVk2a. Low codon usage bias might be beneficial for virus replication in hosts which have different codon usage patterns [25], such as the chicken anemia virus (CAV), belonging to the same family, the Anelloviridae, but previously classified as Circoviridae [26], which has an ENC value of 55 ± 0.93 based on the ORF1 gene [27]. The porcine circovirus 2 (PCV2) has an ENC value of 54.31 [28], while the PCV3 has an ENC value of 55.52 [29], which enables prevalence in swine. Therefore, low codon usage bias in TTSuV may facilitate replication. Additionally, we investigated the forces driving the low codon usage bias of TTSuV using ENC-plot analysis and neutrality analysis. We found that both mutation pressure and natural selection drive the evolution of TTSuV, especially natural selection, having a more dominant effect in TTSuV1 compared to TTSuVk2.
RSCU analysis revealed that A-ended synonymous codons were preferred in TTSuV, in line with the overall AT-rich composition. We also compared the RSCU pattern between TTSuV and hosts and found a mixed phenomenon with different magnitudes of coincidence and in-coincidences in species-specific groups with a high ratio of coincidence/antagonism in TTSuV1 to Sus scrofa domestica. Consistent codon usage patterns allow effective amino acid translation, while inconsistent codon usage patterns are beneficial for protein folding [30]. This phenomenon suggested host selection pressure and possible adaptation of TTSuV, especially TTSuV1 to Sus scrofa domestica [31].
CAI analysis reflects the gene expression level to host cells and thus the effect of natural selection on virus evolution [32]. We performed CAI, RCDI and SiD analysis against the reference hosts Sus scrofa and Sus scrofa domestica. CAI analysis revealed that TTSuV1 and TTSuVk2 experienced different evolution dynamics: TTSuV1 was more adapted to different host species, especially to Sus scrofa, while for TTSuVk2, the same CAI values were found, except for TTSuVk2b. In particular, TTSuV1 was more expressed in Sus scrofa, while the same expression level was identified in Sus scrofa and Sus scrofa domestica for TTSuVk2a. Low RCDI values indicated high expression or replication in hosts [31]. RCDI analysis of TTSuV revealed higher values in Sus scrofa domestica, compared to Sus scrofa, both for TTSuV1 and TTSuVk2, especially in Sus scrofa domestica for TTSuVk2a. Using SiD analysis, we found that Sus scrofa and Sus scrofa domestica imposed similar selection pressure on TTSuVk2, especially on TTSuVk2a. However, in TTSuV1, the role of Sus scrofa was more important than Sus scrofa domestica. Overall, the above results indicate the significant role of Sus scrofa in shaping the evolution of TTSuV, specially TTSuV1, and both Sus scrofa and Sus scrofa domestica in the TTSuVk2a, which might contribute to the high prevalence of TTSuVk2a worldwide.
In conclusion, to explain the codon usage changes during the evolution of TTSuV, an extensive codon usage analysis was performed for the first time. We found that TTSuV1 and TTSuVk2 experienced different evolution dynamics. The low codon usage bias could benefit TTSuV host adaptation. The dominant role of natural selection might be one of the factors shaping the previously reported substitution rate [8]. In addition, CAI, RCDI, and SiD analysis uncovered the important role of Sus scrofa in the evolution of TTSuV, especially in TTSuV1a. Therefore, future pathogenicity studies should focus on TTSuV in Sus scrofa.

Sequence Data
A total of 181 complete TTSuV ORF1 genes were downloaded from the National Center for Biotechnology Information (NCBI) GenBank (https://www.ncbi.nlm.nih.gov/genbank/) up to February 2019. Nucleotide sequences were aligned in amino acids and then translated to nucleotide in MEGA 7 (Arizona State, USA; Commonwealth of Pennsylvania, USA) [33].

Recombination and Phylogenetic Analysis
A recombination signal was detected with the Recombination Detection Program (RDP4) (Cape Town, South Africa) [34]. A total of seven methods, including RDP, GENECONV, Chimaera, MaxChi, BootScan, SiScan, and 3Seq were applied with default settings, except for the replication, which was 1000 and a cut off p value was 0.01. More than four methods were needed to identify recombination for the sequences to be considered recombinant and further confirmed by SimPlot (v3.5.1) (Maryland, USA) [35]. After removal of recombinant sequences, a maximum likelihood (ML) phylogenetic tree was inferred using RAxML (v8.2.10) (Heidelberg, Germany) [36], based on the general time-reversible (GTR)+I+Γ substitution model identified by ModelGenerator (Nottingham, UK) [37].

Sequence Composition
The characterization of the sequence composition of the TTSuV1 and TTSuV2 ORF1 gene coding sequences included: (i) Nucleotide composition (A%, T%, G%, C%) (calculated using BioEdit (California, USA)); (ii) nucleotide frequency at the first, second, and third position of synonymous codons (A3s%, T3s%, G3s%, C3s%); (iii) G+C at the first, second, third position of synonymous codons; (iv) overall GC and AT frequency. Points (ii), (iii), and (iv) were calculated using the software CodonW (Oxford, UK) (http://codonw.sourceforge.net/). Given that Met and Trp encode only for ATG and TGG, respectively, and that TGA, TAA, and TAG are stop codons, they were excluded from the analysis.

Relative Synonymous Codon Usage
The RSCU values of a synonymous codon refers to the relative probability of its observed frequency to its expected frequency, assuming that all codons for an amino acid are used equally, removing the effect of amino acid composition on the use of codons [38]. Equation (1): where g ij is the observed number of the i th codon for the j th amino acid, which has n i kinds of alternative synonymous codons. A value of 1 is the boundary of the positive (>1) and negative (<1) codon usage. Values higher than 1.6 and less than 0.6, indicate over-represented and under-represented synonymous codons, respectively [39].

Principal Component Analysis
PCA is the representation of major tends of the codon usage pattern of a given coding sequence based on a multivariate statistical method. For the analysis, a 59-dimensional vector, which relates to the RSCU values of the 59 sense codons, represents each sequence [25]. Each axis value was calculated by CodonW.

Effective Number of Codons Analysis
ENC indicates the degree of codon usage bias, excluding gene length, as well as the occurrence of amino acids [25]. Here, it was calculated using the following formula (2): where F k (k = 2,3,4,6) means the average F k in the k-fold degenerate amino acid family and was calculated with the formula (3): where n means the summary of the codons for the corresponding amino acid. In addition, S was calculated as follows (4): where n i means the total frequency of the ith codon for the corresponding amino acid. The ENC value ranged from 20 to 61, with the value 20 indicating extreme codon usage bias and 61 indicating no codon bias [40]. Thus, larger ENC values indicate lower codon usage bias. Here, an ENC value of less than 35 was considered as significant high codon usage bias [41]. Furthermore, to better understand factors shaping the codon usage bias, ENC-plot analysis with GC3s plotted against the ENC values was completed. Mutation pressure was considered to be the only factor constraining the codon usage when the observed value sat on the standard curve. Otherwise, other factors shaped codon usage [42]. The ENC values were calculated as follows (5): where s means the percent of GC at the third position of synonymous codons (GC3s).

Neutrality Analysis
Neutrality analysis, the relationship between GC12s and GC3s, is commonly applied to distinguish the dominant role of mutation pressure and natural selection. The slope of the regression line when GC12s are plotted against GC3s indicates equilibrium in mutation-selection pressure. Thus, points distributed along the diagonal line indicate balance among the three codon positions, with no or little effect by selection pressure [43]. In general, the regression slope is the expression of the extent of neutrality.

Codon Adaptation Index and Relative Codon Deoptimization Index
CAI is considered to be a determination of the codon usage tendency of the virus to its corresponding hosts and displays the expression level of the respective coding sequence. With a range of 0-1, higher CAI values indicate higher preference, thus, more adaptation to hosts [44]. The CAI values were calculated using the CAIcal server (Tarragona, Spain) [45]. Sus scrofa and Sus scrofa domestica were used as reference hosts, with the relative synonymous codon usage of the two hosts retrieved from the Codon Usage Database (http://www.kazusa.or.jp/codon/). In contrast, RCDI is a measure of the tendency of the codon deoptimization of virus to its hosts. Here, it was calculated using the RCDI/eRCDI sever (Tarragona, Spain) [45]. A RCDI value more than 1, and closer to 1, means the virus is adapted to the host codon usage pattern [31].

Similarity Index
SiD is the measure that the host codon usage pattern has on shaping the virus codon usage pattern. It ranges from 0 to 1. A higher value indicates a more influencing role [46]. It was calculated using the following Formula (6) and (7):

Dinucleotide Abundance Analysis
The 16 dinucleotide abundance was calculated in the software DAMBE (Ottawa, Canada) [47], including the expected and observed frequencies. The comparison between the expected and observed value was performed using the following odds (8): where f xy is the observed occurrence of dinucleotide XY and f y f x is the expected occurrence of dinucleotide XY [48]. P xy more than 1.23 suggests over-represented dinucleotide abundance, while, P xy less than 0.78 suggests under-represented dinucleotide abundance.