Genomic Analysis of G2P[4] Group A Rotaviruses in Zambia Reveals Positive Selection in Amino Acid Site 7 of Viral Protein 3

The G2P[4] genotype is among the rotavirus strains that circulate commonly in humans. Several countries have reported its immediate upsurge after the introduction of rotavirus vaccination, raising concern about sub-optimal vaccine effectiveness against this genotype in the long term. This study aimed to gain insight into the evolution of post-vaccine Zambian G2P[4] group A rotavirus (RVA) strains and their overall genetic make-up by analysis of sequence alignments at the amino acid (AA) level. Twenty-nine Zambian G2P[4] rotavirus strains were subjected to whole-genome sequencing using the Illumina MiSeq® platform. All the strains exhibited the typical DS-1-like genotype constellation, and the nucleotide sequences of the 11 genome segments showed high nucleotide similarities (>97%). Phylogenetic analyses together with representative global G2P[4] RVA showed that Zambian strains clustered into human lineages IV (for VP2, VP4, VP7, NSP1, and NSP5), V (for VP1, VP3, VP6, NSP2, and NSP3), and XXIII (for NSP4). The AA differences between the lineages where the study strains clustered and lineages of global reference strains were identified and analyzed. Selection pressure analysis revealed that AA site seven in the Viral Protein 3 (VP3) genome segment was under positive selection. This site occurs in the region of intrinsic disorder in the VP3 protein, and Zambian G2P[4] strains could potentially be utilizing this intrinsically disordered region to survive immune pressure. The Zambian G2P[4] strains from 2012 to 2016 comprised the G2P[4] strains that have been circulating globally since the early 2000s, highlighting the epidemiological fitness of these contemporary G2P[4] strains. Continuous whole-genome surveillance of G2P[4] strains remains imperative to understand their evolution during the post-vaccination period.


Introduction
Group A rotavirus (RVA)-induced acute gastroenteritis persists among the leading causes of mortality in children under five years, particularly in sub-Saharan Africa and southeast Asia [1]. To alleviate RVA disease burden, the World Health Organization (WHO) prequalified four rotavirus vaccines, namely Rotarix ® (GlaxoSmithKline, Rixenstart, Belgium), RotaTeq ® (Merck & Co., Whitehouse Station, NJ, USA), Rotavac ® (Bharat Biotech, India), and Rotasiil ® (Serum Institute, India), after extensive efficacy and safety studies (https://www.who.int/publications/i/item/WHO-IVB-2021.03 (accessed on 10 January four intergenogroup reassortant strains [17,19], there is generally a paucity of published Zambian whole-genome G2P [4] RVA sequence data. The purpose of the present study was to perform whole-genome characterization of Zambian G2P [4] strains collected during routine surveillance to gain insight into their overall genetic makeup and evolution.

Ethics Statement
The Health Sciences Research Ethics Committee (HSREC) at the University of the Free State in Bloemfontein, South Africa, granted ethical permission for this study under ethics number (UFS-HSD2016/1082).

Stool Specimen and Strain Description
RVA-positive fecal specimens (n = 133) were obtained from children under the age of five years presenting with acute gastroenteritis as part of the ongoing World Health Organization Regional Office for Africa (WHO-AFRO) RVA surveillance program in Zambia. The stool samples were collected between 2011-2016 and were characterized into G and P types by conventional genotyping methods at the Diarrheal Pathogens Research Unit (DPRU), a WHO Rotavirus Reference Laboratory in South Africa (WHO RRL SA). Briefly, viral dsRNA was extracted from 10% fecal suspension using the QIAmp viral RNA extraction method (Qiagen, Hilden, Germany). The extracted RNA was reverse transcribed and amplified using consensus primer pairs Con2/Con3 and sBeg/End9 as described previously [37,38]. The resulting cDNA template was used for G and P typing using semi-nested RT-PCR amplification of the genes targeting VP7 and VP4. Out of the 133 samples, 29 were G2P [4] strains that were analyzed in this study.

Data Collection from GenBank
Datasets from different global geographical regions retrieved from GenBank were included for analysis (The accession numbers for the reference sequences used in this study are included as Supplementary Data S1).

Double-Stranded RNA Extraction
Rotavirus dsRNA was extracted as previously described [20]. Briefly,~100 mg of fecal specimen was added to 200 µL of phosphate-buffered saline (PBS) solution, 0.01 M, pH (Sigma-Aldrich ® , St Louis, MO, USA) to generate a fecal suspension. The fecal suspension was vortexed for 10 s and then left to stand for ten minutes at room temperature, after which a 300 µL of fecal suspension was added to 900 µL of TRI-Reagent ® -LS (Molecular Research Center, Cincinnati, OH, USA). The fecal-TRI-Reagent solution was vortexed for 10 s and left to stand at room temperature for 10 min. A 300 µL volume of chloroform (Sigma-Aldrich ® , St Louis, MO, USA) was added to the fecal-Tri-Reagent solution. After vortexing the solution for 10 s and letting it stand for 5 min at room temperature, the solution was centrifuged for 18,000× g for 20 min at 4 • C. The aqueous supernatant was removed and precipitated with 700 µL of ice-cold isopropanol (Sigma-Aldrich ® , St Louis, MO, USA). Extracted RNA was incubated in 8 M Lithium chloride (Sigma-Aldrich ® , St Louis, MO, USA) for 16 h at 4 • C to enrich for rotavirus dsRNA and then subsequently purified using the MinElute PCR purification kit (Qiagen, Hilden, Germany). Electrophoresis was then performed on a 5 µL aliquot of dsRNA in 1% 0.5 TBE agarose (Bioline, UK) gel stained with Pronasafe (Condalab, UK) for 1 h at 95 volts to check the integrity of the extracted and purified rotavirus dsRNA.

cDNA Synthesis
The complementary DNA (cDNA) was synthesized using the Maxima H Minus Double-Stranded cDNA Synthesis Kit (Thermo Fischer Scientific, Waltham, MA, USA), with modifications. Briefly, 13 µL of extracted viral RNA was denatured at 95 • C for 5 min, and then, 1 µL of random hexamer primer was added. Annealing was performed at 65 • C for 5 min. A 5 µL volume of First-Strand Reaction Mix was added to the solution, followed by 1 µL of First-Strand Enzyme. The mixture was incubated for 10 min at 25 • C, followed by 2 h at 50 • C. Afterward, the following components were added: 55 µL of nuclease-free water, 20 µL of Second-Strand Reaction Mix, and 5 µL of Second-Strand Enzyme, and the solution was incubated at 16 • C for 60 min. A 6 µL volume of 0.5 M EDTA, pH 8.0 was added to stop the reaction, and residual RNA was removed by adding 10 µL of RNase I.

DNA Library Preparations and Whole-Genome Sequencing
DNA libraries were generated using the Nextera XT DNA Library Preparation Kit (Illumina, San Diego, CA, USA) as per the manufacturer's instructions. Briefly, the genomic cDNA was tagmented, indexed, and then amplified. Afterward, the amplified, tagmented, and indexed DNA was cleaned up using Ampure XP beads (Beckman Coulter, Pasadena, CA, USA). The library quality and fragment sizes were then validated using an Agilent 2100 Bioanalyzer (Agilent Technologies, Waldbronn, Germany). The DNA libraries were then normalized to 4 nM, and then, 5 µL of normalized libraries were then pooled together into a single tube. A 5 µL pool of the normalized libraries was denatured using 5 µL of 0.2 N sodium hydroxide. Afterward, 990 µL of hybridization buffer was added to the 10 µL of the denatured 4 nM DNA library to dilute to 20 pM with a further dilution to achieve a final loading concentration of 8 pM with a 20% Phix control. The combined library and Phix were loaded into the Illumina V3 reagent cartridge and sequencing performed for 600 cycles (301 bp X 2 paired-end) on a MiSeq ® Illumina platform (Illumina, San Diego, CA, USA).

Genome Assembly
Quality control analysis of the sequenced raw data was performed using FASTQC v.0.11.9 [39]. Adapter sequences were trimmed from the raw FASTQ sequence data using BBDuk trimmer (https://sourceforge.net/projects/bbmap/ (accessed on 27 December 2022)). Reference-based mapping using the prototype DS-1-like reference strain (accession numbers HQ650116-HQ650126) was performed using Geneious Read Mapper 6.0.3 [40]. The Geneious Consensus Tool was used for consensus calling [40]. The Annotate and Predict Tool in Geneious Prime ® version 2020.1.1 was used to annotate low coverage (<200) regions, whereby a coverage-annotated track indicating the annotated regions was generated to aid in consensus calling.

Whole-Genome Genotyping
The genome segments were genotyped using the web-based Virus Pathogen Database and Analysis Resource (ViPR) to generate full-genome constellations [41].

Phylogenetic Analysis
Basic Local Alignment Search Tool (BLAST) and the National Center for Biotechnology Information's (NCBI) Virus Variation Resource were used to compile reference sequences [42,43]. The phylogenetic trees for the genome segments encoding VP4 and VP7 were created using previously described lineage designations [35,44]. A previously proposed lineage framework was used to assign lineages to non-G and non-P genome segments [45].
The open reading frame (ORF) sequences for each genome segment were aligned using the Multiple Sequence Comparison by Log-Expectation (MUSCLE) tool [46] in Molecular Evolutionary Genetic Analysis (MEGA) version X [47]. The Model Test in MEGA X was used to estimate the best evolutionary model. The evolutionary models used were Generalized Time Reversible + Gamma + Inversions (GTR + G + I) for VP1, VP2, and VP3; Tamura-3-parameter + Gamma + Inversions (T92 + G + I) for NSP2, NSP3, and NSP4; Tamura-3-parameter + Gamma (T92 + G) for VP4 and VP6; and Tamura-3 parameter + Inversions (T92 + I) for VP7, NSP1, and NSP5. For model selection analysis, maximum likelihood was chosen as the statistical method, and gaps/missing data were handled using partial deletion with an 80% site coverage cut-off and moderate-branch swap filtering. The MEGA X software was used to generate maximum likelihood phylogenetic trees for each genome segment with a 1000 bootstrap support.

Inference of Selective Pressures
In order to enhance the estimation of sites in the genome of G2P [4] RVA strains undergoing negative and positive selection, we employed three evolutionary analyses method available from the DataMonkey webserver [48]. These are fixed-effects likelihood (FEL) [49], Fast Unbiased Bayesian AppRoximation (FUBAR) [50], and Mixed Effects Model of Evolution (MEME) [51]. The basis of these three selective pressure analyses methods are the estimation of non-synonymous (dN) and synonymous (dS) substitution rates on a per-site basis for the coding alignment of the G2P [4] RVA genome segments. The FEL and MEME methods utilize a maximum likelihood approach, while FUBAR employs a Bayesian approach [48]. The FEL and MEME analysis was performed with p-value threshold of 0.1, while FUBAR was performed with a posterior probability of 0.9.

Inference of Selective Pressures
In order to enhance the estimation of sites in the genome of G2P [4] RVA strains undergoing negative and positive selection, we employed three evolutionary analyses method available from the DataMonkey webserver [48]. These are fixed-effects likelihood (FEL) [49], Fast Unbiased Bayesian AppRoximation (FUBAR) [50]), and Mixed Effects Model of Evolution (MEME) [51]. The basis of these three selective pressure analyses methods are the estimation of non-synonymous (dN) and synonymous (dS) substitution rates on a per-site basis for the coding alignment of the G2P [4] RVA genome segments. The FEL and MEME methods utilize a maximum likelihood approach, while FUBAR employs a Bayesian approach [48]. The FEL and MEME analysis was performed with p-value threshold of 0.1, while FUBAR was performed with a posterior probability of 0.9.

Lineages and Sub-Lineages
The table summarizes the amino acid differences observed between defined G2 lineages in respect to lineage IV where the study strains clustered, their region of occurrence in VP7, and the properties of the AA residues. Table 3. Amino acid (AA) differences observed across the P [4] lineages.   The table summarizes the amino acid differences observed between defined P [4] lineages in respect to lineage IV where the study strains clustered, their region of occurrence in VP4, and the properties of the AA residues. The asterisk after "VP5" in indicates that it is a specific protein generated after trypsin cleavage of VP4.

Lineages Representative P[4] Sequences
Amino acid position 159 of lineage V of R2 sequences where Zambian strains clustered featured an arginine, while other R2 lineages had a lysine (Tables S1 and S2). Three AA differences (N/D28I, I70V, and V538I) were observed between C2 sequences in lineages I-III and V-XIV in respect to lineage IV, where the study strains clustered (Tables S3 and  S4). We observed three AA differences (S49N, V209I, and M348V) for the VP3 sequences between lineages I-V and VI-XIV in respect to the lineage V, where Zambian sequences grouped (Tables S5 and S6).
In the genome segment encoding NSP1, three AA differences were observed between the lineages I-II, V, and the lineage IV of the study strains. These differences are represented by l273F, H319R, and K429R, as shown in Tables S7 and S8. Similarly, in NSP2, NSP3, NSP4, and NSP5, there were conservative AA differences between the reference lineages and the lineage of the study strains. These differences are represented by L273F in NSP2, M61I in NSP3, R137Q in NSP4, and V122M in NSP5, as shown in Tables S9-S16.

Selection Pressure Analysis
The AA sites in the genome segments of Zambian G2P [4] strains were undergoing purifying selection with the exception of the AA site seven in the VP3 gene, which was under diversifying/positive selection ( Figure 12 and Table S17).

Selection Pressure Analysis
The AA sites in the genome segments of Zambian G2P [4] strains were under purifying selection with the exception of the AA site seven in the VP3 gene, which under diversifying/positive selection ( Figure 12 and Table S17). This site was found to have two AA: isoleucine occurring in ~40% of the strain arginine in ~60% of the strains (Table S18). This site was found to have two AA: isoleucine occurring in~40% of the strains and arginine in~60% of the strains (Table S18).
The Zambian G2 sequences, which were all post-vaccine, clustered into sub-lineage IVa_3, which is relatable to a South African study wherein post-vaccine G2 strains clustered in sub-lineage IVa_3, while pre-vaccine strains clustered in sub-lineage IVa_1 [35]. In that study, G2 sequences in sub-lineage IVa_3 were found to have a serine at position 15, while sequences in sub-lineage IVa_1 featured phenyalanine, indicating that the AA substitution could be of epidemiological relevance. The aspartate to asparagine substitution at escape mutation site 96, which hallmarks the G2 sequences from the year 2000 [44], could be a selective advantage to enable antibody-dependent enhancement, potentially favoring the production of non-neutralizing antibodies to this protein region [54]. Therefore, this AA site possibly plays a significant role in epidemiological fitness and consequent transmissibility of lineage IV G2 strains. Methionine identification at the Cytotoxic T lymphocyte (CTL) region [55] of lineage IV G2 sequences could have far-reaching biological implications. Apart from protein initiation, methionine's unbranched side chain provides energetic stability, leading to different conformations [56]. Additionally, the presence of the sulfur atom makes methionine an optimal AA for transient protein-protein interactions [57] that could interfere with the clearance of rotavirus infection.
According to the literature, the isoleucine substitution for valine residue is suggested to shift the equilibrium to a more compact conformation [58]. Therefore, the occurrence of this AA substitution at the hemagglutination site 120 [59] of Zambian P [4] lineage IV strains could be posing possible functional changes such as improving interaction with the sialic-acid-containing structure during cell attachment. Notably, the P[4] study sequences were found to cluster in sub-lineage IVb. This is similar to what was observed in a South African study of post-vaccine P [4] strains [35]. The South African study reported that postvaccine P [4] sequences in sub-lineage IVb had an arginine at position 162, while pre-vaccine P [4] sequences had a glycine. The epidemiological relevance of this radical AA substitution warrants more exploration as to whether it contributes to epidemiological fitness of the contemporary circulating G2P [4] strains. The radical AA substitution of serine for leucine from polar to nonpolar residue [60] at position 598 of the VP5 portion could be enhancing the permeabilization effect of membranes to enable VP4 entry into the host cell [61], as the presence of the bulkier hydrophobic leucine side chain induces local rearrangements that may likely induce small but significant differences in function [60]. Although we noted that most of the AA differences between the lineages in the rest of the genome segments were conservative, we posit that the intrinsic link between the AA ambiguities and rotavirus protein multifunctionality requires in-depth exploration to comprehensively understand the epidemiological fitness of the lineages in the post-vaccination context era.
Most AA sites were undergoing purifying selection, probably as a strategy to purge deleterious polymorphisms that arise due to the error-prone nature of the RNA polymerase enzyme [7]. However, AA site seven in the VP3 was under positive selection. Arginine and isoleucine that were observed in this site have extreme physicochemical disparities, whereby arginine is a polar positively charged residue, while isoleucine is a non-polar neutrally charged residue [62]. The AA residues under strong selective pressure are suggested to have essential catalytic functions [63]. Therefore, this AA site is likely selected to enhance the methylase capping activity of the VP3. Additionally, apart from lying in the signal sequence domain region, site seven was also found among the disordered protein regions of VP3 [64]. Viruses utilize intrinsically disordered regions to survive in harsh environments to evade the host immune system and to hijack and manipulate host cellular proteins [65]. Therefore, selection for this effect could be at play in this region.
Our analysis would have been more enriched with analysis performed in the context of pre-vaccination Zambian G2P [4] samples. However, natural fluctuations are known to impact the prevalence of circulating rotavirus strains [53], and sampling for the current study was performed based on availability. Regardless, our study findings provide valuable whole-genome insights regarding the evolution of G2P [4] strains in Zambia.

Conclusions
The G2P [4] strains circulating in Zambia from 2012 to 2016 belong to the lineages of G2P [4] strains that have been circulating globally since the early 2000s, highlighting the epidemiological fitness of these contemporary strains. The analysis of AA substitutions defining the lineages of the circulating G2P [4] strains provides an insightful perspective to understanding their evolution.