Target Enrichment Metagenomics Reveals Human Pegivirus-1 in Pediatric Hematopoietic Stem Cell Transplantation Recipients

Human pegivirus-1 (HPgV-1) is a lymphotropic human virus, typically considered nonpathogenic, but its infection can sometimes cause persistent viremia both in immunocompetent and immunosuppressed individuals. In a viral discovery research program in hematopoietic stem cell transplant (HSCT) pediatric patients, HPgV-1 was detected in 3 out of 14 patients (21.4%) using a target enrichment next-generation sequencing method, and the presence of the viruses was confirmed by agent-specific qRT-PCR assays. For the first time in this patient cohort, complete genomes of HPgV-1 were acquired and characterized. Phylogenetic analyses indicated that two patients had HPgV-1 genotype 2 and one had HPgV-1 genotype 3. Intra-host genomic variations were described and discussed. Our results highlight the necessity to screen HSCT patients and blood and stem cell donors to reduce the potential risk of HPgV-1 transmission.


Introduction
Two strains of pegiviruses are known to infect humans, namely the human pegivirus-1 (HPgV-1) and the human pegivirus-2 (HPgV-2; also referred to as Human Hepegivirus 1) [1]. Out of the two, HPgV-1 is more prevalent and was initially named GB virus type C or hepatitis G virus. It is a persistent lymphotropic positive-sense RNA virus, within the genus Pegivirus of the family Flaviviridae [2][3][4], and it is known to transmit by blood transfusions, parenterally, sexually, and from infected mothers to their newborn infants [3]. It can induce persistent viremia in humans, but its biology and clinical significance is poorly known [5]. It is often reported to be associated with liver and kidney transplantations and blood transfusions [4,[6][7][8][9] and has even been shown to be present in the brain tissue of a patient with encephalitis [10].
In Thailand, several studies conducted before the year 2000 reported the prevalence of HPgV-1 among kidney transplant patients, thalassaemic children, patients with chronic

Nucleic Acid Extraction
Cultured Newcastle disease virus (NDV), an RNA virus of non-human origin, was added to the samples at concentrations ranging between 1 × 10 3 and 1 × 10 4 particles per 1 mL of whole blood as a positive internal control (Table S1). NDV preparation protocol can be found in Supplementary Methods S2. Plasma and cells were separated using centrifugation, and red blood cells (RBC) were removed using RBC lysis buffer (Geneaid, Taiwan). Total nucleic acid extraction was extracted from 1 mL of plasma and white blood cells (WBC) using QIAamp ccfDNA/RNA kit (Qiagen, Hilden, Germany) and AllPrep DNA/RNA Mini kit (Qiagen, Germany) according to the manufacturer's recommendations, respectively. Extracted RNA from both kits was reverse transcribed using SuperScript III kit (Invitrogen Life Technologies, USA). cDNA and extracted DNA was pooled prior to second-strand DNA synthesis using Klenow fragment (3 -5 exonuclease negative) (New England Biolabs, Ipswich, MA, USA). DNA samples were tested to confirm the presence of the internal positive control NDV, using HotStarTaq ® PCR kit (Qiagen, Germany) with SYBR-Green [14] (Supplementary Methods S3).

Library Preparation
DNA sequencing libraries were prepared following the SeqCap EZ HyperCap workflow user's guide (Version 2.3, Roche, Switzerland). Up to 12 samples were pooled and hybridized to the VirCapSeq-VERT probe pool for 16 h (SeqCap EZ VirCapSeq-VERT Panel). The system consisted of~2 million oligonucleotide biotinylated probes, designed specifically to capture and enrich genomic coding sequences of more than 200 vertebrate viruses [15].

Sequencing
The DNA library was denatured and diluted using the MiSeq reagent kit v3 (MS-102-3003, Illumina, San Diego, CA, USA) following the MiSeq system, denature and dilute libraries

Nucleic Acid Extraction
Cultured Newcastle disease virus (NDV), an RNA virus of non-human origin, was added to the samples at concentrations ranging between 1 × 10 3 and 1 × 10 4 particles per 1 mL of whole blood as a positive internal control (Table S1). NDV preparation protocol can be found in Supplementary Methods S2. Plasma and cells were separated using centrifugation, and red blood cells (RBC) were removed using RBC lysis buffer (Geneaid, Taiwan). Total nucleic acid extraction was extracted from 1 mL of plasma and white blood cells (WBC) using QIAamp ccfDNA/RNA kit (Qiagen, Hilden, Germany) and AllPrep DNA/RNA Mini kit (Qiagen, Germany) according to the manufacturer's recommendations, respectively. Extracted RNA from both kits was reverse transcribed using Super-Script III kit (Invitrogen Life Technologies, USA). cDNA and extracted DNA was pooled prior to second-strand DNA synthesis using Klenow fragment (3′-5′exonuclease negative) (New England Biolabs, Ipswich, MA, USA). DNA samples were tested to confirm the

Virus Identification
Sequencing data were demultiplexed and Q30 filtered by Illumina software to generate FastQ files for each sample. For virus identification, Virus Identification Pipeline (VIP) v.0.1.0 [16] was used with the default settings under the 'sense' mode. A virus detection was considered positive if its read count relative to the total read number was >0.01% and the coverage was >65% with respect to the reference genome in the VIP database.
HPgV-1's were identified in 3 patients (see Results) and the detections were validated using a previously described PCR protocol [17] with some modifications (Supplementary Methods S3). The negative results, i.e., the absence of HPgV-1, were also confirmed using the same protocol. Total nucleic acids from the plasma of the DA, D0, and D3 samples (200 µL each) were extracted using GENTi Viral DNA/RNA Extraction Kit (GeneAll, South Korea), and SYBR-Green-based qRT-PCR was performed using One-Step RT-PCR Kit (Qiagen, Germany). Little is currently known about the prevalence of HPgV-1 in Thai children (see Discussion). Therefore, HPgV-1 was also screened for in 112 whole blood samples collected from healthy adolescents (HA) using the above-mentioned PCR protocol. These samples were leftover blood from a health check-up collected from students aged 12-18 years (grade 7-12) in a school, Ayutthaya province, Thailand.

Genome Assembly
Metagenomic short reads were first assembled into contigs by using metaSPAdes v.3.15.1 [18] and IDBA-UD v.1.1.3 [19] with 11 k-mer values, starting from 21 and increasing to 121 with an increment of 10 mers each time. The assembled contigs were further assembled into scaffolds by using Quickmerge v.0.3 [20] with the default settings. Scaffolds were orientated and ordered by comparing them to viral genome sequences available in the NCBI RefSeq database using BLASTn [21] with the maximum e-value of 1e-10, the blast word size of 32, and the maximum target sequences of 10. Short reads were then mapped back to the assembly using bwa-mem2 v.2.1 [22] with the default settings for the purpose of quality checking, variant calling, and consensus sequence reconstruction. LoFreq [23] was used to call variants with the following options: -q 20 -Q 20 -m 60 -C 5 -D -d 1000000 -a 0.1, and the consensus function in the bcftools v.1.10.2 program package [24] was used to reconstruct consensus sequences, using the preliminary assembled and ordered scaffolds as references. Polymorphic sites with aggregated minor allele frequencies of more than 20% and with at least 20× read-mapping depths were characterized.

Phylogenetic Reconstruction
Phylogenetic analysis was performed to genotype the detected viruses. A multiple sequence alignment of the viruses detected and reference sequences obtained from the NCBI nucleotide database was made by using MAFFT v.7.453 [25]. Potential recombination regions within the alignment were checked by using RDP, GENECONV, Chimera, MaxChi, BootScan, SiScan, and 3Seq with their default settings, all implemented in RDP4 v.4.100 [26]. Regions detected by more than 4 programs were considered significant and were removed from all of the sequences in the alignment manually to make a recombination-free alignment. A maximum likelihood phylogenetic tree of the virus was reconstructed by using IQ-TREE 1.6.12 [27] with the best-fit nucleotide substitution model (GTR+F+I+γ4) as determined by ModelFinder [28] under the Bayesian information criterion. SH-aLRT and ultrafast bootstrap branch support was computed based on 10,000 bootstrapped trees.

Enrolled Patients
In total, 14 HSCT pediatric patients were enrolled for this study, and 13 of them developed febrile neutropenia (FN) post-HSCT. Whole blood samples were collected from each of these patients on the day of admission to the hospital (DA). For those that developed post-HSCT FN (i.e., all the patients except Patient 014), two additional whole blood samples were collected, one on the first day of the fever onset (D0) and 3 days after (D3). In one case (Patient 005), only DA and D3 samples were available. Demographic data of the patients, including age, gender, underlying disease, type of HSCT, and the given conditioning regiments can be found in Table 1. The majority of patients had low WBC counts, both on D0 and D3 ( Figure S1).

Virus Detection by Routine Tests
Out of the 14 patients, 2 patients (Patient 004 and 008) were tested positive for cytomegalovirus (CMV) in both of their D0 and D3 samples by routinely performed multiplex probe-based qRT-PCR assay, which screened for human adenovirus, human cytomegalovirus, Epstein-Barr virus, herpes simplex virus 1, herpes simplex virus 2, varicellazoster virus, enterovirus, human parechovirus, human herpesvirus 6, human herpesvirus

Virus Detection by TE-NGS Technology
To detect viruses in the blood samples that were not covered by the routine qRT-PCR assay, TE-NGS was applied, and VIP [16] was used to detect viral sequences within the NGS data produced. qRT-PCR was used to validate the bioinformatics results.  (Table S1), demonstrating that this approach was sensitive. Averaged across all samples, 45.1% (1.7-88.7%), 35.6% (2.5-91.8%), and 19.3% (6.5-60.2%) reads were identified as human, virus, and unidentifiable sequences, respectively, by VIP ( Figure S2 and Table S2).
CMV was detected in Patient 004's D3 sample (Ct of qRT-PCR = 31.18) by TE-NGS, but not in Patient 004's D0 sample or Patient 008's samples, in which the virus concentrations were much lower (Ct of qRT-PCR = 34.07-36.89).
In addition to CMV, HPgV-1 sequences were detected as positive in three patients' D0 and D3 samples, including Patient 002, 011, and 015. The analysis also detected the virus in Patient 002 and 011's DA samples (94.2% and 100% genome coverage, respectively, Table 2), consistent with that the patients likely acquired the virus before the hospitalization. The detections of HPgV-1 by bioinformatic analyses were confirmed by subjecting plasma samples to qRT-PCR using HPgV-1 specific primers [29], yielding Ct values ranging between 25.17 and 30.65 ( Table 2). The fact that HPgV-1 could be detected in the plasma is suggestive of an active infection. On the other hand, HPgV-1 could not be detected in Patient 015's DA sample by both NGS data analysis and qRT-PCR analyses, raising a possibility that the patient could have acquired the virus during hospitalization or the course of HSCT. None of the samples from Patient 014, who did not develop FN, and from 112 healthy adolescents were detected positive for HPgV-1 by qRT-PCR.
Among the three HPgV-1 positive patients, Patient 002 developed engraftment syndrome and had prolonged fever from day 1-16 post-HSCT, while none of them developed veno-occlusive disease or GVHD (Table 1). When it comes to long-term clinical effects, Patient 002 and 011 recovered normally, while Patient 015 experienced neuroblastoma relapse 5 months post-HSCT.

HPgV-1 Genome Assembly
Whole genomes of HPgV-1 detected in the eight samples (three from Patient 002, three from Patient 011, and two from Patient 015) were de novo assembled; seven of which had a full-length coding sequence, while one lacked a short sequence on the 5' end of the coding region ( Figure 2). The assembly sizes ranged between 8714 bases and 9336 bases, with average depths of at least 154× ( Figure 2 and Table 3).
Polymorphic sites with aggregated minor allele frequencies of more than 20% and with at least 20× read-mapping depths were determined by mapping NGS reads back to the consensus sequences. The average number of polymorphic sites per kilobase for Patient 002, 011, and 015's HPgV-1s was 9.27, 3.86, and 0.80, respectively (Figure 2, Tables S3 and S4). Of all the polymorphic sites identified, 94.86% were either C/T variant sites (56.76-67.44%) or A/G variant sites (27.91-40%).

Genotyping
The genotypes of the detected HPgV-1s were determined by using phylogenetic analysis. Based on their phylogenetic placements, the results suggested that those obtained from Patient 002 and 015 were genotype 2, while those in Patient 011 were genotype 3 ( Figure 3). Although the viruses from Patient 002 and 015 were found to be sister taxa, the branches separating the two were long and the two patients were admitted to the hospital at different times (1 year apart), thus they were likely distinct viruses, and it was unlikely that they were a result of direct virus transmission between the two patients.
ysis. Based on their phylogenetic placements, the results suggested that those obtained from Patient 002 and 015 were genotype 2, while those in Patient 011 were genotype 3 ( Figure 3). Although the viruses from Patient 002 and 015 were found to be sister taxa, the branches separating the two were long and the two patients were admitted to the hospital at different times (1 year apart), thus they were likely distinct viruses, and it was unlikely that they were a result of direct virus transmission between the two patients. The phylogenetic tree was reconstructed by using IQ-TREE 1.6.12 [27] with the best-fit nucleotide substitution model (GTR + F + I + γ4) under the Bayesian information criterion. SH-aLRT and ultrafast bootstrap branch support was computed based on 10,000 bootstrapped trees. Genotypes of the The phylogenetic tree was reconstructed by using IQ-TREE 1.6.12 [27] with the best-fit nucleotide substitution model (GTR + F + I + γ4) under the Bayesian information criterion. SH-aLRT and ultrafast bootstrap branch support was computed based on 10,000 bootstrapped trees. Genotypes of the viruses are indicated by Roman numerals. Asterisks (*) signify high (≥90%) bootstrap supported nodes. The scale bar is in the units of nucleotide substitutions per site.

Discussion
Viral infections are common complications following HSCT, and 42.9-68.2% of the total documented post-HSCT infections are viral infections [30][31][32][33][34]. Viruses most commonly found in HSCT patients include Epstein-Barr virus, cytomegalovirus, human herpesvirus 6, human adenoviruses, and polyomavirus BK [33][34][35][36][37][38]. A multitude of clinical tests is used for virus detection. Among them, polymerase chain reaction (PCR)-based techniques are typically considered the gold standard due to their ability to rapidly and precisely amplify small amounts of viral sequences that might be present in a clinical sample [39]. In the current clinical settings in Thailand, PCR-based approaches are routinely used to detect viruses in HSCT patients, and in our case, detected CMV in two patients, namely Patient 004 and 008.
One of the main limitations of PCR-based methods is that they require target-specific primers for pathogen detection, making their ability to detect novel or uncommon pathogens very limited. Because of this, a significant portion of viral infections might be overlooked, and thus, a sensitive diagnostic method that can detect a broader range of viruses, or ideally, any viruses that might be present in clinical samples, is a timely need. Viral metagenomics is a hypothesis-free approach that has shown promise for comprehensive characterization of human virome with no prior clinical knowledge required [40,41]. Due to typically low fractions of viral genetic materials in clinical samples, direct application of the NGS technology to clinical diagnosis has been challenging; however, studies have shown that an addition of a viral genetic material enrichment step to the protocol could offer a solution to this problem [42][43][44][45]. The VirCapSeq-VERT is a probe panel comprised of~2 million probes that cover all the viruses known to infect vertebrates. Its potential to enrich viral sequences of known viruses up to 10,000-fold when compared to conventional high throughput sequencing, as well as to detect novel viruses with an overall nucleotide divergence in the range of 40% has been demonstrated [15]. This makes this probe panel suitable for detecting divergent and potentially unknown viruses in clinical samples, which is essential when it comes to analyzing clinical samples from patients with unexplained illnesses, and for outbreak investigation of both known and unknown infectious etiologies. Subsequent other studies have also utilized this panel to explore the virome of a range of clinical and environmental samples and have obtained compelling results [46][47][48][49][50].
Therefore, in the present study, we utilized an unbiased VirCapSeq-VERT incorporated TE-NGS approach to screen for viruses that might be present in the whole blood samples of HSCT patients. The NGS-based test detected CMV only in Patient 004, specifically in their D3 sample, which had a moderate amount of target sequences (Ct = 31.18), but not in Patient 004's D0 sample or Patient 008's samples, which had relatively much lower viral concentrations (Ct = 34.07-36.89). When detecting viruses with a low concentration, it has been discussed that the sensitivity of agent-specific qRT-PCR could be higher than that of a TE-NGS test [51]. In any case, since the Ct values were very high, alternatively, it was thus possible that the qRT-PCR results could be false positive. Further investigations are needed to examine the relative sensitivity and specificity of the two assays when viral concentrations are very low.
In addition to CMVs, HPgV-1 was detected in three patients by TE-NGS, and was validated by qRT-PCR, including Patient 002, 011, and 015. The virus was not covered by the routine qRT-PCR tests in our clinical setting. Our results suggested that Patient 002 and 011 likely acquired the virus before the hospitalization, while Patient 015 may have acquired it during the hospitalization, but without any noticeable clinical symptoms that might be indicative of the virus acquisition. Due to the patients' multiple blood transfusion histories, and the lack of data from their stem cells donors, the precise source of infection of the three patients cannot be concluded.
All of the three HPgV-1 positive patients developed FN after the HSCT procedure. The average time taken to develop FN in HPgV-1 positive patients was slightly lower than that of HPgV-1 negative patients, estimated to be 3.3 days and 4.7 days, respectively. We, however, feel that the number of the patients in our study was probably too small to make any conclusive statements regarding (apparently positive) association between HPgV-1 and the early onset of FN. Regarding other post-HSCT complications, Patient 002 developed engraftment syndrome and prolonged fever up to 16 days post-procedure, while none of the three developed veno-occlusive disease or GVHD. It is nonetheless worth noting that such post-HSCT complications could be observed in several HPgV-1 negative patients as well ( Table 1). As for the long-term complications, we observed one of the three patients detected positive for HPgV-1 (33%), namely Patient 15, to experience a disease relapse 5 months after the transplantation. In the HPgV-1 negative patient group, 3 out of 11 patients (27%) were found to experience long-term clinical complications (Patient 008: chronic GVHD; Patient 013: died from GVHD and bleeding, and Patient 014: died from respiratory distress syndrome 39 days post-operation; Table 1). Despite this observation, again, we feel the sample size herein was probably too small to establish a link between HPgV-1 and long-term clinical complications.
We estimated the prevalence of HPgV-1 in this patient cohort to be 3/14 = 21.4%, comparable to those reported by other studies done in adult HSCT patients, ranging between 18% and 42% [52][53][54][55]. However, none of the studies reported the whole genomes of the identified viruses. Despite the reported high prevalence of HPgV-1, it is not listed among the viruses currently regarded as most relevant in the HSCT setting [33,36,37]. In contrast, we could not detect HPgV-1 in any of the 112 healthy adolescents samples. This suggested that the prevalence of HPgV-1 in our HSCT pediatric cohort could be much higher than that of the young healthy Thai population. However, since all 112 samples were from students of a single age group from a single school, the results should thus be interpreted with care and should not be extrapolated to general Thai children. To the best of our knowledge, there have not been reports on the prevalence of HPgV-1 in the Thai general population in the past 15 years. However, the prevalence of HPgV-1 in the Asian general population has been reported to be typically <10% [56], indeed, lower than those reported for HSCT patients [52][53][54][55], consistent with our speculation.
In general, the risk of HPgV-1 infection has shown to be high in those who are exposed to blood and blood products, those on hemodialysis, and patients with chronic hepatitis C or human immunodeficiency virus-1 (HIV-1) infection [1]. Interestingly, multiple studies have reported that persistent HPgV-1 infection may have an immunomodulatory effect in patients co-infected with HIV-1, slowing the disease progression of acquired immunodeficiency syndrome (AIDS) and prolonging the survival [reviewed in [57]]. Therefore, it is not only important to screen the general Thai population, but also patients with other diseases, particularly HIV-1 patients, who are relatively prevalent in Thailand [58].
Up to now, seven HPgV-1 genotypes have been described [59,60]. Our phylogenetic analyses suggested that the genotypes of the HPgV-1s detected in the three patients were different; the ones infecting Patient 002 and 015 likely belonged to genotype 2, while that in Patient 011 likely belonged to genotype 3 ( Figure 3). Two previous studies have also reported the detection of genotype 2 and 3 in Thailand [13,61], in addition to that, one of those studies has reported the detection of genotype 4 [13]. On a global scale, genotype 2 is most prevalent in Europe and North America but also can be seen worldwide, while genotype 3 is endemic in Asian countries and Amerindian populations [9].
Within the host, viruses typically experience very strong immune selection pressure, which drives them to diversify resulting in a cloud of genetic variants [62]. Our analyses showed that Patient 002's HPgV-1 had the highest number of polymorphic sites, while Patient 015's HPgV-1 had the lowest number of polymorphic sites (9.27, 3.86, and 0.80 polymorphic sites per kilobase on average for Patient 002, 011, and 015, respectively) ( Figure 2, Tables S3 and S4). The low number of polymorphic sites in Patient 015's viruses are suggestive of a relatively recent acquisition of the virus, consistent with the fact that Patient 015's DA sample was HPgV-1-negative.
Of all the polymorphic sites identified, 94.86% were either C/T or A/G variant sites, consistent with the fact that transition mutations are generally more likely to occur than transversions chemically. Besides polymerase and sequencing errors, host RNA editing mediated by ADARs (adenosine deaminases acting on RNA) may play some roles in the observed high numbers of transition changes as well. ADARs catalyze deamination of As to Is on double-stranded RNA substrates, which subsequently results in base changes from As to Gs or Us to Cs on the single-stranded RNA virus genome [63,64]. Many studies have shown that ADARs can induce hyper A-to-G and U-to-C mutations in a wide range of RNA viruses [64][65][66][67][68][69].
Interestingly, there were substantially more C/T variant sites (56.76-67.44% of the total number of all polymorphic sites characterized) than A/G variant sites (27.91-40%), and the ratios of C/T variant sites to A/G variant sites were similar across all samples (average ratios of normalized C/T variant sites to A/G variant sites: Patient 002: 1:0.57; Patient 011: 1:0.51; and Patient 015: 1:0.56) (Table S5). This observation could potentially be explained by C-to-U hypermutation mediated by members of the apolipoprotein B mRNA-editing enzyme, catalytic polypeptide-like (APOBEC) family, which, unlike ADARs, catalyze deamination of Cs to Us directly on single-stranded nucleic acid molecules. Antiviral activities of APOBEC enzymes have been well demonstrated in retroviruses, retroelements, and some DNA viruses [70], but recent studies have suggested that it may restrict RNA viruses as well, in particular positive-strand RNA viruses, including rubella virus [71] and coronaviruses [72,73]. Indeed, C-to-U hypermutation has been observed and welldocumented in both positive-strand RNA viruses [67,71,74], supporting our hypothesis.
We are the first group to report the prevalence of HPgV-1 in HSCT pediatric patients, as well as the virus whole genomes and their intra-host diversity. The pathogenicity of HPgV-1 remains controversial, but a positive association of HPgV-1 viremia with the increased risk of lymphoma in immunocompromised individuals has been observed [75,76]. Although we could not establish a link between clinical symptoms and the detection of HPgV-1, further research should be undertaken to investigate the clinical impact of HPgV-1 on HSCT patients. Till then, the results of this study suggest that HSCT patients and their blood and stem cell donors should be routinely screened for HPgV-1 to reduce the risk of the virus transmission.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/10 .3390/v14040796/s1, Figure S1: White blood cell counts on Day 0 (D0) and Day 3 (D3). Figure S2: Frequencies of viral (red), human (blue), and unidentifiable reads (grey) as determined by Virus Identification Pipeline, Table S1: qRT-PCR results and the Virus Identification Pipeline analysis output for the positive internal control, Newcastle Disease Virus (NDV). Table S2: Sequencing read distribution as determined by Virus Identification Pipeline. Table S3: Polymorphic sites in the recovered human pegivirus-1 genome sequences. Table S4  Data Availability Statement: NGS sequencing data generated by this study are available from the Sequence Read Archive (SRA) repository under the project accession number PRJNA705169. All consensus HPgV-1 genome sequences generated by this study are available in GenBank with the following accession numbers: MZ099565 to MZ099572.