HCV Genetic Diversity Can Be Used to Infer Infection Recency and Time since Infection

HIV-1 genetic diversity can be used to infer time since infection (TSI) and infection recency. We adapted this approach for HCV and identified genomic regions with informative diversity. We included 72 HCV/HIV-1 coinfected participants of the Swiss HIV Cohort Study, for whom reliable estimates of infection date and viral sequences were available. Average pairwise diversity (APD) was calculated over each codon position for the entire open reading frame of HCV. Utilizing cross validation, we evaluated the correlation of APD with TSI, and its ability to infer TSI via a linear model. We additionally studied the ability of diversity to classify infections as recent (infected for <1 year) or chronic, using receiver-operator-characteristic area under the curve (ROC-AUC) in 50 patients whose infection could be unambiguously classified as either recent or chronic. Measuring HCV diversity over third or all codon positions gave similar performances, and notable improvement over first or second codon positions. APD calculated over the entire genome enabled classification of infection recency (ROC-AUC = 0.76). Additionally, APD correlated with TSI (R2 = 0.33) and could predict TSI (mean absolute error = 1.67 years). Restricting the region over which APD was calculated to E2-NS2 further improved accuracy (ROC-AUC = 0.85, R2 = 0.54, mean absolute error = 1.38 years). Genetic diversity in HCV correlates with TSI and is a proxy for infection recency and TSI, even several years post-infection.


Introduction
Inferring the duration of infection is of key importance for understanding both the epidemiology and pathogenesis of hepatitis C virus (HCV) infections. From an epidemiological perspective, the time of infection can inform incidence assays, phylogenetic studies, and prediction of future chronic liver disease burdens. In particular, it could be vital for monitoring public health progress in the context of elimination [1], as it enables the identification of ongoing transmission. From an individual-patient perspective, this information could contribute to knowledge of disease progression and assessment.
The nature of HCV transmission and its mostly asymptomatic acute infection means that the date of infection is rarely known, and there is a lack of known biomarkers available from which this information could be estimated. Accordingly, studies typically have to rely on some combination of cohort data and mathematical modelling to infer infection dates [2][3][4][5][6][7][8][9][10][11][12][13], which remain highly uncertain for most HCV-infected individuals. In the present study, we took advantage of the unique opportunity of a cohort with annual HCV screening, detailed clinical characteristics, and sampling.
A similar problem exists for human immunodeficiency virus-1 (HIV-1), which is also an RNA virus of comparable genome size that chronically infects patients. For HIV-1, it has been shown that viral diversity can be used to infer infection recency [14][15][16] and time since infection [17], and that diversity derived from next-generation sequencing (NGS) sequences is more accurate than diversity derived as the fraction of ambiguous nucleotides from Sanger sequences [18].
In this study, we aimed to investigate whether the same NGS-derived-diversity method can be applied to HCV, and to identify the region of the genome over which it is most informative to measure diversity.

Patients
We included 72 HCV-HIV coinfected patients from the Swiss HIV Cohort Study (SHCS), for all of whom an NGS-sequenced HCV sample was available. For patients enrolled in the SHCS, HCV-serologies are determined since 2000 every 12-24 months. Therefore, the patients considered in our study had a date of HCV infection known to within 24 months. This date of infection was calculated as the midpoint between the date of the most recent negative (RNA or antibody) test prior to the sample date, and the earliest date of either a positive HCV (RNA or antibody) test result, or the date at which the sample was taken. Patient characteristics are summarized in Table 1.

Sequencing
Samples were sequenced in the context of two different projects, resulting in differing, although similar, protocols. Briefly, the following sequencing protocols were used: Project 1, as part of the Swiss-HCVree-trial (NCT02785666) [19] within the SHCS, 53 samples (see Supplementary Material S1 for the description of near full-length genome sequencing): cDNA was synthesized and amplified using a one-step reverse transcription and PCR kit, or in two individual steps. Some samples then underwent a second, nested-PCR (Supplementary Table S1). Samples were pooled for sequencing, and libraries were run using MiSeq (Illumina, San Diego, CA, USA) 1 × 150 cycles.
Project 2, within the SHCS, 19 samples: Viral RNA was extracted from plasma stored in the SHCS biobank by the Nucleospin RNA Virus Kit (Macherey-Nagel, Düren, Germany). RNA was then amplified by RT-PCR in a two-step process, and some samples underwent a second, nested-PCR if necessary. HCV RNA genome sequences were generated by amplification of almost full-length HCV RNA followed by massive parallel sequencing. A MiSeq (Illumina, San Diego, CA, USA) instrument was used for sequencing with 2 × 250 bp.
Sequences were processed using MinVar version 2.2.1 (https://github.com/medvir/MinVar) [20], which filters and aligns reads before returning sequence variants. A slightly modified script was used in order to output sequence position information at the nucleotide level.

Diversity Score Calculation
Average pairwise diversity (APD) was calculated from nucleotide minority variant frequencies using Equation (1) [17], explanation as in [18]. This first determines whether a position has minor variants above a threshold, and subsequently sums the diversity contribution of all variants at that position. Finally, diversity across all positions is averaged. This value is functionally equivalent to the average proportion of positions at which two randomly selected sequences differ.
where L = length of analysed sequence. i = sequence position.
x i m = frequency of major variant m at position i.
x iα = frequency of variant α at position i. x c is a frequency cut-off, below which variants are considered to be indistinguishable from those generated by PCR or sequencing errors. It was set to 1% as this is the approximate detection limit of a high-throughput sequencing workflow (Illumina) [21,22].
We calculated average pairwise diversity over various subsections of the HCV genome open reading frame: − All codon positions, and only the first/second/third codon positions in turn. − Whole open reading frame, individual genes, and 11 overlapping equal regions that the genome was split into (length = 502 or 501 amino acid codons).
We examined how well average pairwise diversity correlated with time since infection (TSI) using linear regression. This linear model (Equation (2)) was then validated using leave-one-out cross-validation, with mean absolute error as the primary outcome. This was conducted by calculating the model coefficients using all but one sample, and then applying this model to the remaining "test" sample. The absolute value of the difference between the estimated and actual time since infection was then calculated for this test sample. This procedure was repeated for all samples in turn, and the mean value of all the absolute errors was taken, providing a simple measure by which to compare average pairwise diversity calculated across different regions.
We studied how well average pairwise diversity could be used to infer infection recency using receiver operator characteristics (ROC) analyses, with recent infection (<1 year post-infection) defined as the positive outcome. We restricted these recency analyses to the 50 patients who could be unambiguously classified as recent (21 patients) or chronic (29 patients) due to the sample having been collected less than 12 months after the last negative HCV test or more than 12 months after the first positive HCV test.
A sensitivity analysis was conducted to assess variability between viral subtypes, as levels of diversity have been seen to vary between subtypes [28]. We therefore repeated our analyses using just those patients infected with subtype 1A, and 4D; insufficient number of samples were available from other viral subtypes for this analysis to be meaningful for them.

Ethics Approval and Consent to Participate
This analysis was conducted in the context of the SHCS which was approved by the ethics committees of the participating institutions (

Codon Position Makes a Notable Difference in Average Pairwise Diversity Scores
We found for the first and second codon positions a substantial correlation with time since infection but also that diversity remains low even several years post-infection ( Figure 1). Conversely, average pairwise diversity over the third codon position shows a large and steady increase with time since infection, and should therefore be more informative when inferring time since infection and infection recency. However, calculating average pairwise diversity over all three positions yields similar results to considering only third-codon positions ( Figure 1). This suggests that average pairwise diversity calculated over either third or all codon positions may be capable of inferring time since infection with reasonable precision.

Codon Position Makes a Notable Difference in Average Pairwise Diversity Scores
We found for the first and second codon positions a substantial correlation with time since infection but also that diversity remains low even several years post-infection ( Figure 1). Conversely, average pairwise diversity over the third codon position shows a large and steady increase with time since infection, and should therefore be more informative when inferring time since infection and infection recency. However, calculating average pairwise diversity over all three positions yields similar results to considering only third-codon positions ( Figure 1). This suggests that average pairwise diversity calculated over either third or all codon positions may be capable of inferring time since infection with reasonable precision. We compared how well average pairwise diversity over individual and all codon positions could infer infection recency using ROC analyses ( Figure 2). This analysis, was restricted to the 50 patients whose infections could be unambiguously classified as either recent (time since infection <1 year) or chronic (time since infection >1 year). In line with Figure 1, using third and all codon positions outperforms using first and second codon positions, supporting the notion that average pairwise diversity calculated over these positions is more informative. Due to the similarity between using the third codon and all codon positions, we continued our analyses using average pairwise diversity We compared how well average pairwise diversity over individual and all codon positions could infer infection recency using ROC analyses ( Figure 2). This analysis, was restricted to the 50 patients whose infections could be unambiguously classified as either recent (time since infection <1 year) or chronic (time since infection >1 year). In line with Figure 1, using third and all codon positions outperforms using first and second codon positions, supporting the notion that average pairwise diversity calculated over these positions is more informative. Due to the similarity between using the third codon and all codon positions, we continued our analyses using average pairwise diversity calculated over all codon positions, this being the easier measure to calculate as it does not require identification of the reading frame (results over the third codon position are provided in Supplementary Figure S1 for comparison and give very similar outcomes).

Restricting Average Pairwise Diversity to Certain Regions of the Genome Improves Inference of Time since Infection and Infection Recency
We calculated average pairwise diversity over the entire open reading frame of HCV, each gene individually, and for regions derived by dividing the open reading frame into eleven equal overlapping regions. We then assessed how well each average pairwise diversity scores could be used to infer infection recency (via ROC analyses) and time since infection (via linear regression and cross-validation). When comparing the three outcomes (area under the ROC curve, R 2 , and mean absolute error) for all regions across the open reading frame (Figure 3), we find that the region covering amino acid codons 503-1004-provides the best score for all three outcomes. This region spans the major 3′ part of E2, the entirety of p7, and the majority of the NS2 genes. The regions covering the last 1000 amino acid codons of the open reading frame also perform well, but less consistently across outcomes (Figure 3). The figure also shows the HCV genes colour-coded by a composite score of all three outcomes, from which a similar pattern emerges, with E2 and NS2 individually performing especially well. We therefore additionally calculated average pairwise diversity over the three genes E2, p7 and NS2 combined and found similarly strong results. Supplementary Table S2 provides the exact scores for all the regions tested. Supplementary Figure  S2 shows the same analysis repeated for larger regions, which provides a similar outcome.  (Figure 3), we find that the region covering amino acid codons 503-1004-provides the best score for all three outcomes. This region spans the major 3 part of E2, the entirety of p7, and the majority of the NS2 genes. The regions covering the last 1000 amino acid codons of the open reading frame also perform well, but less consistently across outcomes (Figure 3). The figure also shows the HCV genes colour-coded by a composite score of all three outcomes, from which a similar pattern emerges, with E2 and NS2 individually performing especially well. We therefore additionally calculated average pairwise diversity over the three genes E2, p7 and NS2 combined and found similarly strong results. Supplementary Table S2 provides the exact scores for all the regions tested. Supplementary Figure S2 shows the same analysis repeated for larger regions, which provides a similar outcome. Based on these results, we recommend calculating average pairwise diversity using all codon positions, over the region of amino acid codons 503-1004-of the open reading frame, or using the entirety of genes E2, p7 and NS2 together. Time since infection can be estimated from these scores using Equation (2), with recommended coefficients alpha = 0.42, beta = 353.70 for average pairwise diversity over the region 503-1004-, or alpha = 0.39, beta = 320.38 for average pairwise diversity over E2, p7 and NS2. A full list of suggested coefficients for all regions tested is available in Supplementary Table S2. Figure 4 shows the time since infection against estimated time since infection for the recommended regions. Based on these results, we recommend calculating average pairwise diversity using all codon positions, over the region of amino acid codons 503-1004-of the open reading frame, or using the entirety of genes E2, p7 and NS2 together. Time since infection can be estimated from these scores using Equation (2), with recommended coefficients alpha = 0.42, beta = 353.70 for average pairwise diversity over the region 503-1004-, or alpha = 0.39, beta = 320.38 for average pairwise diversity over E2, p7 and NS2. A full list of suggested coefficients for all regions tested is available in Supplementary Table S2. Figure 4 shows the time since infection against estimated time since infection for the recommended regions.

Sensitivity Analysis Shows Minor Variation between Viral Subtypes
We conducted a sub-analysis including only patients infected with viral subtype 1A (50 patients, of which 33 could be included in the ROC analysis), and 4D (14 patients, of which 10 could be included in the ROC analysis), using average pairwise diversity calculated over our recommended regions. For diversity calculated over amino acid codons 503-1004, the area under the ROC curve was the same as the overall score of 0.85 for genotype 1A (0.85) and slightly lower for genotype 4D (0.67). Conversely, R 2 was slightly higher than the overall score of 0.48 for genotype 4D (0.56), and lower for genotype 1A (0.43), whilst mean absolute error was worse than the overall 1.44 years for both genotype 1A (1.64 years) and 4D (1.52 years). A similar pattern was seen for average pairwise diversity over the genes E2, p7, and NS2 (overall: area under the ROC curve = 0.85, R 2 = 0.54, mean absolute error = 1.38 years. 1A: area under the ROC curve = 0.86, R 2 = 0.51, mean absolute error = 1.51 years. 4D: area under the ROC curve = 0.62, R 2 = 0.56, mean absolute error = 1.46 years).

Discussion
Overall, this study shows that HCV genetic diversity as calculated from nucleotide frequencies derived from NGS deep sequencing correlates with, and can be used to infer, the time since infection and infection recency. A wide range of samples of well-defined infection timepoints with differing infection durations was included, showing that this method could be employed for samples up to 16 years post-infection. With a mean absolute error of less than 1.5 years, this method can provide insight which may be particularly helpful for public health monitoring in the context of HCV elimination strategies, and the identification of recent versus chronic infections.
We additionally conducted analyses focusing on individual regions across the entire open reading frame, and have shown that the pattern of diversity differs across the genome, with some regions and codons experiencing differing levels of diversification over time. The contrast in diversity calculated over codons 1 or 2 compared to codon 3 highlights the strong effect of selection against non-synonymous mutations. It should be noted, however, that even at third-codon positions substitutions will not be completely neutral, as they can for example affect RNA secondary structure. Even though such effects may restrict the accumulation of viral diversity, they do not seem to have had a major effect on the association between time since infection and diversity as indicated by similarly strong correlations observed over the entire HCV genome (Figure 3). For example, we observed one of the strongest associations for the NS5B gene, which (in its second half) exhibits a

Sensitivity Analysis Shows Minor Variation between Viral Subtypes
We conducted a sub-analysis including only patients infected with viral subtype 1A (50 patients, of which 33 could be included in the ROC analysis), and 4D (14 patients, of which 10 could be included in the ROC analysis), using average pairwise diversity calculated over our recommended regions. For diversity calculated over amino acid codons 503-1004, the area under the ROC curve was the same as the overall score of 0.85 for genotype 1A (0.85) and slightly lower for genotype 4D (0.67). Conversely, R 2 was slightly higher than the overall score of 0.48 for genotype 4D (0.56), and lower for genotype 1A (0.43), whilst mean absolute error was worse than the overall 1.44 years for both genotype 1A (1.64 years) and 4D (1.52 years). A similar pattern was seen for average pairwise diversity over the genes E2, p7, and NS2 (overall: area under the ROC curve = 0.85, R 2 = 0.54, mean absolute error = 1.38 years. 1A: area under the ROC curve = 0.86, R 2 = 0.51, mean absolute error = 1.51 years. 4D: area under the ROC curve = 0.62, R 2 = 0.56, mean absolute error = 1.46 years).

Discussion
Overall, this study shows that HCV genetic diversity as calculated from nucleotide frequencies derived from NGS deep sequencing correlates with, and can be used to infer, the time since infection and infection recency. A wide range of samples of well-defined infection timepoints with differing infection durations was included, showing that this method could be employed for samples up to 16 years post-infection. With a mean absolute error of less than 1.5 years, this method can provide insight which may be particularly helpful for public health monitoring in the context of HCV elimination strategies, and the identification of recent versus chronic infections.
We additionally conducted analyses focusing on individual regions across the entire open reading frame, and have shown that the pattern of diversity differs across the genome, with some regions and codons experiencing differing levels of diversification over time. The contrast in diversity calculated over codons 1 or 2 compared to codon 3 highlights the strong effect of selection against non-synonymous mutations. It should be noted, however, that even at third-codon positions substitutions will not be completely neutral, as they can for example affect RNA secondary structure. Even though such effects may restrict the accumulation of viral diversity, they do not seem to have had a major effect on the association between time since infection and diversity as indicated by similarly strong correlations observed over the entire HCV genome (Figure 3). For example, we observed one of the strongest associations for the NS5B gene, which (in its second half) exhibits a particularly high density of regions encoding for RNA secondary structure [29,30]. Surprisingly, the most informative region was found to include the E2 gene. As this encodes one of the two viral envelope proteins, a generally high level of diversification is unsurprising, but it was expected that selective sweeps due to immune system pressure would result in an unsteady increase in diversity unsuitable for inferring time since infection, as has been seen for the HIV envelope gene [17]. This finding defies that and suggests that other factors may be additionally influencing the accumulation of diversity within the HCV genome, causing a steadier increase.
As some variation between diversity within different viral subtypes has been reported [28], we conducted a sensitivity analysis with individual subtypes to evaluate the impact of viral subtype on our analyses. Whilst some variation was observed, overall patterns were robust across subtypes Furthermore, the largest deviation from the overall results was found for the subtype 4D ROC analyses which were conducted with only 10 samples. Our results, hence suggest that this method is applicable across different viral subtypes; however, further analysis with more samples of different subtypes is required to confirm this.
This study was limited by the number of samples available, which were all from patients already infected with HIV-1 at the time of HCV sampling, and primarily from men who have sex with men (MSM) with no recorded history of intravenous drug use (60/72 patients). The small sample size (and especially the small number of patients with large infection times) prevented the examination of non-linear effects, such as a potential saturation of diversity at large infection times, which has been observed for HIV-1 [14,18]. Furthermore, the presence of HIV-1 may affect the diversification of HCV over time, given the known impact of HIV-1 infection on HCV infection [31][32][33], and the increased virus load of HCV in the presence of HIV-1 [34][35][36]. The HCV epidemic typically consists primarily of people who inject drugs, but there is growing incidence of HCV infections among HIV-infected MSM [11,12,[37][38][39][40][41]. As our study population consists almost exclusively of MSM, the validity of our results in people who inject drugs needs to be examined in future studies. Furthermore, our samples were derived from two different projects and therefore prepared and sequenced by slightly differing protocols. Whilst this variation in preparation and sequencing methods may remove some uniformity from our sample set, it more accurately reflects the scenario should the method be applied to HCV samples that have been collected and sequenced for a variety of purposes and can therefore be seen as a strength of this study.
As drug resistance testing is not routinely performed for HCV-infected patients, the availability of NGS sequences for this technique may be limited. However, as the price of NGS sequencing methods is decreasing, they are likely to be increasingly employed for surveillance purposes and gaining epidemiological understanding. Particularly in the context of elimination strategies, sequence-based information such as viral phylogenies and origins may be very desirable. Our study found a substantial correlation between time since infection and average pairwise diversity. Based on this, we could show that infection recency and time since infection can be accurately predicted by using average pairwise diversity. Thus, in the cases where samples from HCV-infected persons are sequenced, this method provides recency and date of infection information as free by-products of standard NGS sequencing, which can add to such monitoring studies.
Supplementary Materials: The following are available online at http://www.mdpi.com/1999-4915/12/11/1241/s1. Supplementary material S1: Full protocol for preparation and sequencing of samples for study 1; Table S1: Summary of preparations methods for sequencing for study 1; Table S2: Scores for average pairwise diversity calculated over all codon positions, and over various regions of the open reading frame, with associated coefficients from linear regression (AUC = area under the ROC curve, adj. R 2 = adjusted R 2 , MAE = mean absolute error); Figure S1: Area under the ROC curve, R 2 , and mean absolute error across the HCV open reading frame, third codon positions. The HCV open reading frame was split into 11 overlapping regions of approximately 500 amino acid codons, and average pairwise diversity (APD) was calculated over individual regions, using the third codon positions. Regions were tested for their ability to categorise infection as recent (<1 year) or chronic (top), their correlation with time since infection (middle), and their ability to infer time since infection (bottom). Black dashed lines show the respective values for APD calculated over the whole open reading frame. A similar analysis was performed with diversity calculated over each gene in turn. The HCV genome is shown along the x-axis, with genes colour-coded for a composite (z-score sum) of all three outcome scores. Darker red indicates a better overall performance. Numbers along the x-axis refer to amino acid positions of the H77 reference genome; Figure S2: Area under the ROC curve, R 2 , and mean absolute error across the HCV open reading frame, all codon positions. The HCV open reading frame was split into 5 overlapping regions of approximately 1000 amino acid codons, and average pairwise diversity (APD) was calculated over individual regions, using all codon positions. Regions were tested for their ability to categorise infection as recent (<1 year) or chronic (top), their correlation with time since infection (middle), and their ability to infer time since infection (bottom). Black dashed lines show the respective values for APD calculated over the whole open reading frame. A similar analysis was performed with diversity calculated over each gene in turn. The HCV genome is shown along the x-axis, with genes colour-coded for a composite (z-score sum) of all three outcome scores. Darker red indicates a better overall performance. Numbers along the x-axis refer to amino acid positions of the H77 reference genome; Equation S1: Z-score for converting outcome scores to be summed into a combined single score.