Molecular and Structural Evolution of Porcine Epidemic Diarrhea Virus

Simple Summary To analyze the evolutionary characteristics of the highly contagious porcine epidemic diarrhea virus (PEDV), the complete genomes of 647 PEDV strains were analyzed. Eight amino acid (aa) sites of the S protein showed strong signals of positive selection, and seven of them were located on the surface of the S protein (S1 domain), suggesting a high selection pressure of S protein. Topologically, the S gene is more representative of the evolutionary relationship at the genome-wide level than are other genes. Structurally, the evolutionary pattern is highly S1 domain-related. The haplotype networks of the S gene showed that the strains are obviously clustered geographically in the lineages corresponding to genotypes GI and GII. The three distinguishable nucleic acid sites in the haplotypes assay suggested a putative evolutionary mechanism in PEDV. These findings provide several new fundamental insights into the evolution of PEDV and guidance for developing effective prevention countermeasures against PEDV. Abstract To analyze the evolutionary characteristics of the highly contagious porcine epidemic diarrhea virus (PEDV) at the molecular and structural levels, we analyzed the complete genomes of 647 strains retrieved from the GenBank database. The results showed that the spike (S) gene exhibited larger dS (synonymous substitutions per synonymous site) values than other PEDV genes. In the selective pressure analysis, eight amino acid (aa) sites of the S protein showed strong signals of positive selection, and seven of them were located on the surface of the S protein (S1 domain), suggesting a high selection pressure of S protein. Topologically, the S gene is more representative of the evolutionary relationship at the genome-wide level than are other genes. Structurally, the evolutionary pattern is highly S1 domain-related. The haplotype networks of the S gene showed that the strains are obviously clustered geographically in the lineages corresponding to genotypes GI and GII. The alignment analysis on representative strains of the main haplotypes revealed three distinguishable nucleic acid sites among those strains, suggesting a putative evolutionary mechanism in PEDV. These findings provide several new fundamental insights into the evolution of PEDV and guidance for developing effective prevention countermeasures against PEDV.


Introduction
The enteric disease porcine epidemic diarrhea (PED) in suckling piglets, characterized by severe diarrhea, vomiting and dehydration, was first reported in Europe in the early 1970s [1]. The etiology of porcine epidemic diarrhea virus (PEDV) has caused severe epidemics since the 1990s in Asia, including Japan [2] and Korea [3]. The devastating outbreaks of PEDV in China since 2006 [4][5][6], and in the United States since 2013 [7,8] led to a serious threat to swine health. Until now, PEDV has been the primary viral pathogen causing porcine diarrhea in China. The mortality rate of suckling piglets caused by PED often reaches 80-100%, resulting in serious economic losses in the swine industry.
PEDV is an enveloped, positive-sense, single-stranded RNA virus, belonging to the genus Alphacoronavirus in the family Coronaviridae of the order Nidovirales. Among the four PEDV structural proteins, namely spike (S), envelope (E), membrane (M), and nucleocapsid (N), the S protein dominates the surface of virus particles and mediates direct binding to cell receptors, is a major target to induce neutralizing antibodies [9][10][11], and can also be the target in vaccine development [12]. In coronavirus, as the major surface protein that directly interacts with cellular receptors, the S protein bears the greatest evolutionary pressure and the S gene contains the most variable regions in the entire PEDV genome, including the strains reported in the United States in 2013-2014 [8,13] with insertions and deletions in the S gene (S-INDEL). Thus, the diversity of the S gene is the phylogenetic marker in PEDV evolutionary analysis [13][14][15].
Despite the discoveries in the levels of the complete genome and the S gene, several fundamental issues related to the evolutionary patterns of PEDV remain unknown. In this study, we investigated the molecular divergence between PEDV and other related coronaviruses and carried out genetic analyses of PEDV complete genomes and structural protein genes, particularly from the viewpoint of the crystal structure of the S protein. This work could provide new insights into the evolution of PEDV and its spread pattern in animals.

Data Collection of PEDV and Other Related Viruses
To analyze the viral evolution of PEDV, part of the PEDV complete genome sequences in GenBank (https://www.ncbi.nlm.nih.gov/genbank, accessed on 20 June 2022) were not analyzed due to the high cell-adapted passages, or the artificially modified mutations. So, a total of 647 qualified whole-genome sequences of PEDV (from 22 countries) were selected for analysis in this study. Genome sequences of other related viruses were downloaded from GenBank. Details on the data set are summarized in Supplementary Table S1 and Figure S1.

Phylogenetic Analyses
The sequences of the PEDV complete genome, S gene and the combined gene fragments (ORF3, E, M, and N) were aligned using MUSCLE v3.8 [16]. The alignments were trimmed and edited by trimAl [17]. The codon alignments of the conserved ORFs of ORF1ab, S, ORF3, E, M, and N were further concatenated for down-stream evolutionary analysis. To infer phylogenetic trees, an ML approach was applied using RAxML [18]. The best-fit nucleotide substitution model was determined using Module Test-NG [19]. The model GTR + I + G4 was used in RAxML, which was run on 1000 bootstrap replicates.

Positive Selection of Amino Acids
The ratio of dN (nonsynonymous substitutions per nonsynonymous site) to dS (synonymous substitutions per synonymous site) substitution rates was identified as a value of ω (dN/dS). Positive selection was analyzed using EasyCodeML [20]. The M7 (beta) and M8 (beta and ω > 1) models were compared. In the M7 model, ω followed a beta distribution (0 ≤ ω ≤ 1), and in the M8 model, a proportion p0 of sites had ω drawn from the beta distribution, and the remaining sites with proportion p1 were positively selected and had ω1 > 1 [21]. After a likelihood-ratio test (LRT) for the pairwise comparisons of codon models using EasyCodeML, the naive empirical Bayes (NEB) and Bayes empirical Bayes (BEB) methods [22] were used to identify amino acid residues that potentially evolved under selection. The threshold for identifying amino acid sites under selection is a posterior probability of 0.95 [23].

Codon Usage Bias Analysis
The RSCU (Relative Synonymous Codon Usage) values of each codon in the PEDV CV777 genome (AF353511.1) were calculated. The RSCU value for each codon was the observed frequency of this codon divided by its expected frequency under equal usage among the amino acid [24]. The codons with RSCU > 1 were defined as preferred codons, and those with RSCU < 1 were defined as unpreferred codons. The FOP (frequency of optimal codons) value of each gene was calculated as the number of preferred codons divided by the total number of preferred and unpreferred codons.

Amino Acid Alignment
Amino acid sequences of S and N proteins were aligned to reveal the sequence identity using ESPript 3.0 [25]. The positively selected sites in the S and N proteins from NEB analysis were labeled (*: p > 95%; **: p > 99%).

Protein Spatial Structure Analysis
We generated aligned sequences of S protein to reveal the highly mutated regions using the sequence logo generator WebLogo 3 [26]. The highly mutated regions in the S protein were labeled in the structural model (spheres mode) of the S protein based on the PDB 6vv5 [27]; the brown transparent labeled region indicated the S1 domain.

Haplotype Network
The software DnaSP v6 [28] was used to generate multi-sequence aligned haplotype data, and PopART v1.7 [29] was used to draw haplotype networks based on the haplotypes generated by DnaSP v6. The evolution pattern of the PEDV was analyzed based on the representative strains identified by dominant haplotypes using MEGA-X [30].

The Divergence between PEDV and Other Related Coronaviruses
We concatenated six ORFs (ORF1ab, S, ORF3, E, M, N) and used CODEML in the PAMLX to calculate the pairwise dN, dS, and ω values between PEDV and other viruses ( Table 1). The results showed that the dS value varied across genes in CV777 and the other viruses, and that the S gene exhibited larger dS values than did other genes (Table 1), which could be caused by a high mutation rate or by natural selection that favors synonymous substitutions. The results of codon usage bias analysis (Supplementary Table S2) showed no difference between the frequency of optimal codons (FOP) of the S gene and the genomic average (0.656 versus 0.659).

Positive Selection of PEDV and Related Coronaviruses
The genome-wide ω value from 0.0782 to 0.3278 (Table 1) between PEDV CV777 and other viruses indicates a strong negative selection on the nonsynonymous sites, which means that from 67.22% to 92.18% of the nonsynonymous mutations were removed during viral evolution. In the assay of extent of positive selection, the S genes of all the viruses were analyzed using the M7 (beta: neutral and negative selection) and M8 (beta & ω > 1: neutral, negative, and positive selection) models in CODEML. The M8 model (lnL = −18,515.815, np = 22) was a significantly better fit than the M7 (lnL = −18,531.752, np = 20) model (p = 1.20 × 10 −7 ), suggesting that some aa substitutions were favored by positive selection. Under the M8 model, 91.436% (p0) of the nonsynonymous substitutions were estimated under neutral evolution or purifying selection (0 ≤ ω ≤1), and 8.564% (p1) of the nonsynonymous substitutions were under positive selection (ω = 1.503). A naive empirical Bayes (NEB) analysis suggested that eight aa sites showed strong signals of positive selection, seven of these positively selected sites were in the N-terminal domain (NTD) of the S protein (surface of the S protein), and one site (1101A) was in the S2 domain ( Figure 1A,B, Supplementary Figures S2 and S3). Two sites (166R and 213R) of the S protein with the highest values of "post mean +-SE for ω" in the NEB analysis were also identified as positively selected sites in the Bayes empirical Bayes (BEB) analysis. The results of the positive selection assay indicated that these sites were responsible for the evolution of S protein sequences, and deserve further functional studies. In the N gene, the M8 model (lnL = −5144.209, np = 22) was not a better fit than the M7 (lnL = −5145.170, np = 20) model (p = 0.382), suggesting that positive selection was not favored. Only one aa site on the N protein showed a strong signal for positive selection (247V) in the NEB analysis ( Figure 1C), but it did not in the BEB analysis.

Molecular Phylogenetic Analysis of PEDV Strains
The phylogenetic trees of PEDVs were constructed based on the sequence of the complete genome (represented by six concatenated ORFs, including ORF1ab, S, ORF3, E, M and N), spike, and ORF3-E-M-N, respectively. Topologically, as shown in Figure 2, the similarity between the complete genome and spike is higher than that of the complete genome and ORF3-E-M-N. Genotypes of G1 and G2 showed clear differentiation in both the phylogenetic trees of the complete genome and spike, while the tree of ORF3-E-M-N was cross-connected topologically compared with the complete genome and spike, indicating that the spike gene is more representative of the evolutionary relationship at the genome-wide level than are other genes. Geographically, the American strains in genotype GI were clustered with most of the strains from Japan and South Korea. The

Molecular Phylogenetic Analysis of PEDV Strains
The phylogenetic trees of PEDVs were constructed based on the sequence of the complete genome (represented by six concatenated ORFs, including ORF1ab, S, ORF3, E, M and N), spike, and ORF3-E-M-N, respectively. Topologically, as shown in Figure 2, the similarity between the complete genome and spike is higher than that of the complete genome and ORF3-E-M-N. Genotypes of G1 and G2 showed clear differentiation in both the phylogenetic trees of the complete genome and spike, while the tree of ORF3-E-M-N was cross-connected topologically compared with the complete genome and spike, indicating that the spike gene is more representative of the evolutionary relationship at the genome-wide level than are other genes. Geographically, the American strains in genotype GI were clustered with most of the strains from Japan and South Korea. The strains from China were distributed in both the genotypes GI and GII with a similar ratio in all three phylogenetic trees.

Molecular Phylogenetic Analysis of PEDV Strains
The phylogenetic trees of PEDVs were constructed based on the sequence of the complete genome (represented by six concatenated ORFs, including ORF1ab, S, ORF3, E, M and N), spike, and ORF3-E-M-N, respectively. Topologically, as shown in Figure 2, the similarity between the complete genome and spike is higher than that of the complete genome and ORF3-E-M-N. Genotypes of G1 and G2 showed clear differentiation in both the phylogenetic trees of the complete genome and spike, while the tree of ORF3-E-M-N was cross-connected topologically compared with the complete genome and spike, indicating that the spike gene is more representative of the evolutionary relationship at the genome-wide level than are other genes. Geographically, the American strains in genotype GI were clustered with most of the strains from Japan and South Korea. The strains from China were distributed in both the genotypes GI and GII with a similar ratio in all three phylogenetic trees.

PEDV Mutation from the Perspective of S Protein Spatial Structure
The mutation sites in the S protein based on spatial structure were analyzed. According to the results of the mutation assay using 647 sequences of S protein ( Figure 3A) and illustrated in the spatial structural data of PEDV S protein (PDB: 6vv5), we refrained from modeling residues showed as gray because these are missing from publicly available structures. It was found that most of the highly mutated residues of S protein were located in the S1 domain, the surface of the S protein ( Figure 3B), which was consistent with the assay results of positively selected sites ( Figure 1A), indicating that the evolutionary pattern of PEDV S protein is highly S1 domain-related.
The mutation sites in the S protein based on spatial structure were analyzed. Accor ing to the results of the mutation assay using 647 sequences of S protein ( Figure 3A) an illustrated in the spatial structural data of PEDV S protein (PDB: 6vv5), we refrained fro modeling residues showed as gray because these are missing from publicly availab structures. It was found that most of the highly mutated residues of S protein were locate in the S1 domain, the surface of the S protein ( Figure 3B), which was consistent with t assay results of positively selected sites ( Figure 1A), indicating that the evolutionary pa tern of PEDV S protein is highly S1 domain-related.

The Evolutionary History of PEDV Lineages
In the assay of PEDV lineage formation, the putative lineages of PEDV were foun when we constructed the haplotype networks using the PEDV S gene (297 representati strains out of 647 strains). The results showed that the strains were obviously clustere geographically in the lineages ( Figure 4A) corresponding to genotype GI and GII, whi was consistent with the results of the phylogenetic analysis. We performed alignme analysis on representative strains of the main haplotypes ( Figure 4B), and three disti guishable nucleic acid sites among those strains were found, including, a 12 nt insertio of the GII (AACCAGGGTGTC/T), a 3 nt insertion of the GII (G/AAT), and a 6 nt deletio of the GII (GGAAAA, or AATAGA), respectively. Interestingly, the linked strain TW/Yu lin550/2018 (MK673545.1) between GI and GII showed a completely different pattern the 6 nt site (AATAGA, aa of Asn-Arg), which is in a highly mutated region of the domain as shown in Figure 3 (site III).

The Evolutionary History of PEDV Lineages
In the assay of PEDV lineage formation, the putative lineages of PEDV were found when we constructed the haplotype networks using the PEDV S gene (297 representative strains out of 647 strains). The results showed that the strains were obviously clustered geographically in the lineages ( Figure 4A) corresponding to genotype GI and GII, which was consistent with the results of the phylogenetic analysis. We performed alignment analysis on representative strains of the main haplotypes ( Figure 4B), and three distinguishable nucleic acid sites among those strains were found, including, a 12 nt insertion of the GII (AACCAGGGTGTC/T), a 3 nt insertion of the GII (G/AAT), and a 6 nt deletion of the GII (GGAAAA, or AATAGA), respectively. Interestingly, the linked strain TW/Yunlin550/2018 (MK673545.1) between GI and GII showed a completely different pattern at the 6 nt site (AATAGA, aa of Asn-Arg), which is in a highly mutated region of the S1 domain as shown in Figure 3 (site III).

Discussion
The S protein of coronavirus directly interacts with cellular receptors; the diversity of the S gene is often used as the phylogenetic marker in evolutionary analysis. The complete genomes of diverse strains from the global database promotes a better understanding of evolutionary and phylogenetic relationships [31]. In the study of virus evolution, one method of testing for selection is to compute the ratio of nonsynonymous to synonymous substitution rates (ω); ω is expected to have a value of 1 under the assumption of neutral evolution. Positive and negative selection are indicated when ω > 1 and ω < 1, respectively [21]. The M8-M7 comparison model offers a very stringent test of positive selection [32]. In terms of the S gene of different coronaviruses, the M8 model was a better fit than the M7 model; the favor of positive selection here and the high mutation rate of the S gene [14] make it the best target for evolutionary assays, which also was confirmed in the evolutionary comparison of the whole genome sequence with the S gene and other structural protein genes.
Previous studies showed that natural selection was the main force influencing the codon usage pattern of PEDV, while mutation pressure played a minor role [33][34][35]. Here, if positive selection is the driving force for the higher synonymous substation rate seen in spike, we would expect the FOP of spike to be different from that of the genome. The elevated synonymous substitution rate measured in the S gene might be more likely caused by higher mutation rates, but the FOP of the S gene showed no obvious difference to that of the genomic average; the underlying molecular mechanism remains unclear, and deserves further study. Synonymous substitutions may serve as another layer of genetic regulation, guiding the efficiency of mRNA translation by changing codon usage.
By the application of NEB analysis, we found that S and N proteins were more favored for mutation than other proteins in this study, which was consistent with antigenic study of possible antigenic differences that emerged in both the spike and nucleocapsid proteins between different genogroups [36]. As we showed here, the S gene was found to provide the maximal interpretative power in PEDV evolution because of its high phylogenetic signal with substitution rate, and a phylogenetic topology similar to that obtained from the complete genome [37,38], which would facilitate data analysis.
Variations in the S protein are important for revealing the genetic evolution and the pathogenicity of PEDV strains [39][40][41]. Structurally, we found that the highly mutated residues of the S protein were S1 domain-dominated, indicating a strong S1-related evolutionary pattern of PEDV, which might be induced by the antigenic drift as amino acid positions with significant variation among isolates from different regions and subgroups were found [14]. In other coronaviruses, the naturally occurring spike mutations in NTD influence the infectivity and immunogenicity, and NTD has been identified as one of the mutation hotspots of SARS-CoV-2 [42,43]. PEDV S protein may undergo a conformational change after receptor binding and cleavage by exogenous trypsin, which induces membrane fusion [44].
Selective pressures drive adaptive changes in the coronavirus S proteins directing virus-cell entry. The high hypervariability in the SARS-CoV-2 S protein appears to be driven by counterbalancing pressures of effective virus-cell entry and durable extracellular virus infectivity [45], which could be caused by the variation in amino acid positions. The binding domain of the PEDV cellular receptor APN was shown to reside within a domain in the C-terminal of the S1 domain (residues 477-629), which is close to one of the sites we found in haplotype networks for the potential differentiation of the PEDV genotype. The strain of TW/Yunlin550/2018 (MK673545.1) that is located on the edge of the genotypes of GI and GII showed a completely different pattern at the 6 nt site (AATAGA, aa of Asn-Arg, NR) compared with the GI strains (GGAAAA, aa of Gly-Lys, GK), which might result in better host adaption during virus evolution when facing the counterbalancing pressures; further study on the evolution pattern is needed.
Geographically, the GI strains in Europe were evolutionarily separated from the GII strains in other global regions. In the genotype GII, the American strains were clustered with most of the strains from Japan and South Korea, and strains in China were shown in a more compact branch. The increasingly international pig industry involves the trade of various breeding materials and animals, which may increase the risk of disease transmission. The global exchange of ingredients has created demand for products that prevent disease transmission from the feed, such as the use of the monoglyceride blend, which could mitigate and prevent PED transmission to piglets from contaminated feed [46].
Overall, we found that S protein showed strong signals of positive selection, and the evolutionary pattern of S protein was highly S1 domain-related, which also represents the marker for clustering lineages corresponding to genotypes GI and GII geographically. These findings provide several fundamental insights into the evolution of PEDV and guidance for developing effective prevention countermeasures against PEDV.

Conclusions
PEDV S proteins showed strong signals of positive selection, and seven of them were located in the surface of the S protein (S1 domain), suggesting the high selection pressure of S protein. The S gene is more representative at the genome level and the evolutionary pattern is highly S1 domain-related. The three distinguishable nucleic acid sites in the haplotype assay suggested a putative evolutionary mechanism of PEDV. These findings provide several new fundamental insights into the evolution of PEDV and guidance for developing effective prevention countermeasures against PEDV.
Supplementary Materials: The following supporting information can be downloaded at https:// www.mdpi.com/article/10.3390/ani12233388/s1. Figure S1: Location and number of the isolated PEDV strains (with qualified sequence data). Figure S2: Positive selection assay of S protein by NEB analysis. Positive selection assay of S protein using the NEB method, the positive selected sites are labeled in red. Figure S3: Amino acid alignment of PEDV S and N proteins. The alignment of the complete amino acid of PEDV S and N proteins was carried out using ESPript 3.0. Table S1: Detailed information of the PEDV complete genomes used in this study. Table S2