For some virus species, their highly error-prone replication mechanism produces a population of related genomic variants of the virus within a single infected host individual [1
]. Sequencing systems such as Illumina’s platform produce short, relatively accurate nucleotide sections of viral genome, often generating thousands of reads for a given genomic location from a single sample [2
]. Such deep sequencing technologies therefore offer methods for understanding the nature of viral intra-host diversity. Whole genome and deep sequencing of virus genomes has been widely applied in basic virology research but has also found applications in clinical contexts such as the detection of drug resistance [3
Various bioinformatics stages must be applied in the interpretation of viral deep sequencing data. Reads unrelated to the virus genome are removed and low-quality reads removed or trimmed. Following this, we must then construct an alignment: how the reads are arranged relative to each other within the virus genome, accounting for sequence homology. Reference-based alignment or mapping methods such as Bowtie 2 [4
], BWA [5
], MOSAIK [6
], Stampy [7
] or Tanoti [8
] use one or more reference sequences to guide the alignment of reads. In contrast, de novo assembly approaches such as SPAdes [9
] and VICUNA [10
] use associations derived purely from the read data itself to propose large genome fragments, avoiding the biases arising from the choice of reference sequence. A drawback of de novo methods is that they may not accurately capture the full genomic structure or diversity, thus, for well-known viruses with high levels of genomic diversity, combinations of de novo assembly and reference-based alignment methods, such as shiver [11
], are often used. One aspect common to almost all recent methods in this area is that they output their results in the form of a sequence alignment mapping (SAM) file. The SAM format [12
] integrates nucleotide, read quality and alignment data in a single file. It was standardised at an early point in the adoption of deep sequencing, allowing diverse methods to be compared with each other and integrated into processing pipelines for a broad set of applications.
A range of variant-calling methods have been developed to analyse genomic heterogeneity within deep sequencing data. Error rates in short read technologies such as Illumina are low but it can be challenging to distinguish errors from real single nucleotide variants (SNVs) occurring at frequencies comparable to the error rate. Therefore, variant-calling methods such as LoFreq [13
] and V-Phaser [14
] apply statistical techniques to the aligned read data to identify probable SNVs occurring even at very low frequency in the presence of sequencing errors. The focus on low frequency SNVs is critical in applications such as cancer genomics where somatic deviations from the consensus are both rare and of high consequence [16
]. However, virus bioinformatics has distinct priorities from fields focused on eukaryote or bacterial organisms with higher replication fidelity [17
]. Since viruses typically have a low replication fidelity, there is a higher level of diversity within an infected host and the viral population can be expected to contain many variants.
For reasons of clarity within the research community, virus genome locations can be defined in terms of a standard virus strain with a well-established “master reference” genome. In hepatitis C virus (HCV) for example, codons within viral proteins are numbered relative to the H77 strain [18
]. Polymorphisms at these standardised locations are reported with phenotypic associations established experimentally or in clinical trials. The advent of deep sequencing data prompts questions such as: What are the relative proportions of different amino acid residues at a given genome location? What proportion of reads support the presence of a certain sequence motif? What proportion of reads indicate a deletion? However, it is challenging to answer these questions within deep sequencing data, since read alignments do not in general use the reference coordinate space, and a mapping between the two spaces must be established and applied.
The genomes of the virus population may contain multiple nucleotide base variants at different positions within a single codon location. Both Verbist et al. [19
] and Döring et al. [20
] pointed out that linkage between nucleotide positions is lost when variants are called as SNVs. This linkage must be retained within datasets in order to accurately predict the amino acid residues arising from protein translation. Suppose for example we observe significant levels of both adenine (A) and thymine (T) at the first position of a particular codon location. At the second position, we observe cytosine (C) and guanine (G). With cytosine at the third position, what amino acids are the genomes in the virus population coding for? Without retaining linkage, these observations are consistent with a mix of Threonine (codon ACC) and Cysteine (TGC), or alternatively with purely Serine (AGC/TCC), or with any combination of these amino acids. By retaining linkage, we can accurately select between these interpretations. Haplotype reconstruction methods aim to determine linkage by associating sub-populations of reads as haplotypes. Schirmer et al. [21
] found that this was exceptionally unreliable for viral deep sequencing data. However, full haplotype reconstruction may not be necessary for practical applications where the variants of interest are linked within the span of a single read or read pair.
We present a subsystem of the GLUE software package [22
] called samReporter, focused on the analysis of aligned deep sequencing viral genome data. It directly processes the SAM file format produced by most methods, and can also process the more compact Binary Alignment Map (BAM) format. The samReporter can be instantiated within an existing GLUE project containing reference sequences and alignments for a given virus. This allows the software to establish a reading frame for coding region reads within a SAM file, and map between the read alignment coordinate space and standardised genome locations. In turn, this facilitates the scanning of reads directly for different classes of sequence pattern such as codons, amino acid residues, indels and motifs. This approach of scanning reads directly has the advantage of retaining linkage and we can report how often combinations of variations appear together on the same viral RNA, certainly within a codon location but also further, at least as far as paired-end read data allows.
We demonstrate the benefits of applying the GLUE samReporter to hepatitis C virus (HCV) deep sequencing data. HCV is a positive-sense single-stranded RNA virus of the family Flaviviridae. Its genome of about 9000 bases codes for a single polyprotein that produces 10 mature viral proteins. HCV affects over 100 million people worldwide and can cause liver disease and cirrhosis. The infection can be treated with a range of direct-acting antiviral (DAA) drugs which inhibit three of the mature proteins: NS3, NS5A and NS5B. Such therapies produce a sustained virological response (SVR) in the vast majority of patients, clearing the virus in around 95% of cases [23
]. Notwithstanding this therapeutic success, HCV is proving difficult to treat in certain categories of patients, including “retreatment” patients: those for whom prior DAA treatment has failed. It has been shown both in vitro and in vivo that certain resistance-associated substitutions (RASs) in the viral genome confer resistance to DAA drugs [24
Vermehren et al. suggested that retreatment patients have RASs in multiple drug target genes and that therefore “genomic resistance testing may be useful to select the optimal combination and treatment duration” for subsequent rounds of drug therapy [23
]. Recent guidance suggests that, if deep sequencing is used, observing a RAS in 15% of the virus population may be clinically relevant [25
]. Tools aimed at HCV resistance testing such as geno2pheno[ngs-freq] [20
] suggest frequencies of 2%, 10% and 15%. Thus, while HCV RAS testing can benefit from deep sequencing methods, moderately-low rather than ultra-low frequencies are of most interest. RAS testing for retreatment patients therefore provides a good case study for a deep sequencing data interpretation system.
We analysed viral genome diversity within a group of 241 HCV retreatment patients sampled within the United Kingdom. A range of genotypes were represented: Gt1 n = 115, Gt2 n = 5, Gt3 n = 104, Gt4 n = 14 and Gt6 n = 3. The five most frequent subtypes: were 1a n = 98, 1b n = 13, 3a n = 95, 3b n = 6 and 4r n = 6. Fourteen other subtypes were represented each by three or fewer patients. In three cases, a subtype could not be assigned.
The samReporter scans aligned reads directly, retaining linkage within reads with the intention of more accurate detection of specific variants. To test the benefits of this approach, we also contrived an alternative method that attempts to capture within-host diversity without retaining linkage.
Besides the four concrete bases A, C, G and T/U, IUPAC notation for nucleic acids, used in the FASTA sequence file format, contains 11 ambiguity codes, covering all possible combinations of more than one base [26
]. For example, code S can represent a combination of C and G. Any method which calls SNVs can encode these variants within a FASTA file. This aspect of the encoding is used to capture minor nucleotide variants. Web-based HCV drug resistance interpretation systems such as HCV-GLUE [27
] and geno2pheno[hcv] [28
] do attempt to interpret ambiguity codes if they appear in the input data. We produced FASTA files with ambiguity codes for each sample using the samReporter nucleotide-consensus
command, which produces one IUPAC code for each nucleotide position in the SAM reference coordinate space. Read bases with a Phred quality score of less than 25 were excluded. A “concrete” base (A, C, G or T) was encoded at a given position if it appeared both in at least five individual reads and in 5% of the quality-filtered reads at that location. IUPAC ambiguity codes are then used if multiple concrete bases are to be encoded. We found that FASTA files for all but two samples contained at least one ambiguity code. On average, files contained ambiguity codes which represent two bases at 1.02% of nucleotide positions (std. dev.
) and codes which represent three bases at 0.0161% of nucleotide positions (std. dev.
A triplet of concrete bases, i.e., a codon, specifies a single amino acid. If ambiguity codes occur within nucleotide data for a given codon location, multiple distinct codons are present in the underlying data at that location; the precise composition is unknown. For a given ambiguous triplet (possibly containing ambiguity codes), there is set of “possible” amino acids comprising any residues coded by one or more of the possible codons. For example, for the ambiguous triplet YTM, the set of possible amino acids is Leucine (L) and Phenylalanine (F) because the set of codons and their corresponding amino acids are CTA (L), CTC (L), TTA (L) and TTC (F). Additionally, there is a (possibly empty) subset of “definite” amino acid residues, i.e., those that must be coded by at least some of the underlying codons, whatever the composition, under reasonable assumptions. For the ambiguous triplet YTM, every combination of codons which produces the ambiguity codes contains at least some codons for Leucine (L); this is the single definite amino acid. In general, if there is a single ambiguity code encoding two bases within the triplet, there will be one or two definite amino acids, and these will also be the only possible amino acids.
The FASTA files were analysed for “ambiguous” codon locations where the definite and possible amino acid sets were different. This typically occurs when there are two ambiguity codes within a triplet. Such locations present a challenge for drug resistance interpretation systems based on FASTA file inputs. Whereas amino acid residues in the definite set can be inferred to be present in the virus population, the status of amino acids in the possible set but not the definite set cannot be established clearly from FASTA data. We excluded from the analysis degenerate locations i.e., those where the possible set, excluding stops, contained more than five amino acids or where the read depth for the whole codon location was less than 10.
In total, 435 ambiguous locations were found in patients, within all ten viral proteins: Core n
= 62, E1 n
= 24, E2 n
= 205, p7 n
= 5, NS2 n
= 14, NS3 n
= 30, NS4A n
= 1, NS4B n
= 9, NS5A n
= 38 and NS5B n
= 47. The full set of ambiguous locations is given in the Supplementary Materials
. Scaling by the length of each region, this implies that the E2 and Core proteins had a higher rate of ambiguous locations. The drug target proteins NS3, NS5A and NS5B have rates in the lower part of the range.
The current version of HCV-GLUE [27
] documents 44 locations associated in the literature with resistance to six DAA drugs in current use: 18 in NS3, 15 in NS5A and 11 in NS5B. These are listed in the Supplementary Materials
. Within the FASTA data, we found 10 ambiguous resistance-associated locations in nine patients, six in NS5A and four in NS5B. We resolved these locations using the GLUE samReporter, calculating the frequencies of codons and amino acids by directly analysing reads. Codons were excluded if any Phred base quality was below 25. Amino acid residues were deemed to be present if 5% or more of filtered read codons at the location coded for the residue. The 10 locations are shown in Table 1
. HCV-GLUE classifies an amino acid as typical at a location for a given subtype if 10% or more of the GenBank sequences of that subtype contain the residue, these are also shown in the table. In all cases except one (R25, NS5B position 159) the definite residues set (not shown) was empty.
In the cases shown in Table 1
, samReporter was able to eliminate many possible residues. Whereas the possible set contained up to four residues, samReporter confirmed that only one or two were actually present at the 5% level. Two resistance locations occur three times each and merit a discussion. For NS5A position 30, in subtype 1a the typical residue is Glutamine (Q). In sample R67, samReporter found Tyrosine (Y) at around 74%, which has not been documented as a RAS but is atypical for the subtype. In samples HCV300 and R164 (subtype 3a) the typical residue is Alanine (A); samReporter found Lysine (K), a well-documented RAS, at levels of 92% and 82% respectively. Thus, in these three cases samReporter confirmed the presence of a RAS or atypical substitution. Substitutions at NS5B position 282 have been strongly associated with resistance to sofosbuvir, particularly the substitution of the typical Serine (S) with Threonine (T). In contrast with NS5A position 30, samReporter was able to eliminate this resistant residue and the other possible atypical residue Cysteine (C); in these three ambiguous cases, only Serine is present, but is actually coded by significantly distinct codons in each case. For sample HCV294, the codons were TCT at 54%, AGT at 25% and AGC at 20%, for sample R25, AGC at 85% and TCC at 15%, and, for sample R36, TCC at 92% and AGC at 7%. The effect of the presence of these diverse codons is to create ambiguity at the nucleotide level. One possible explanation for the codon diversity is that Threonine codons became frequent in the viral population during sofosbuvir treatment, and that following the end of the treatment course the descendants of these virions reverted to Serine, but now coded using diverse alternative codons.
The HCV296 sample is typical in the sense that the size of the BAM file (24.6 MB) was closest to the mean for this dataset, it contains ≈282,000 paired-end reads with an average depth across the HCV polyprotein of ≈3700. To evaluate performance, we ran some samReporter commands on this file using a 2014 MacBook Pro with a 2.5 GHz Intel Core i7 processor and 16 GB of RAM. The samReporter was configured to use up to four CPU cores. The amino-acid
command was run to translate reads for the whole polyprotein, producing amino acid residue frequencies at each codon location, without any read filters. Using the auto-align feature with a known target reference sequence, this command took 7.5 s. Using the max-likelihood-placer feature, the command took 29.2 s, with most of the extra time spent in the RAxML-EPA step. See Section 3.2
for details of the samReporter design.
The results show that, within virus genomes of HCV retreatment patients, linkage between nucleotide variants within a codon location is not a purely theoretical issue. In a small number of cases among UK retreatment patients, such linkage did occur at sites critical for drug resistance. Approaches that do not preserve linkage, such as those encoding variants as ambiguity codes, cannot correctly resolve these cases. How would such a system deal with amino acid residues in the possible set but not in the definite set? If the system is configured not to report such residues, the result is false negative detection of a substitution at NS5A position 30 in three patients, obscuring A30K RAS in two cases. Conversely, if the system reports these residues, the result is false positive detection of the NS5B RAS S282T in three other patients. As far as we are aware, the current study is novel in terms of quantifying the effect of such linkage on resistance detection in real HCV patient data.
The current HCV-GLUE database [27
] documents many RASs that combine substitutions at locations within the span of a typical Illumina read. The samReporter can report the presence or absence of these on any read (or read pair for paired-end data) that covers the relevant locations. Future work might consider whether detection of these “combination” RASs at a minority level are of clinical relevance. It would also be of interest to incorporate existing low frequency variant-calling mechanisms from the literature into GLUE. The samReporter and GLUE generally are intended to be useful in both research and clinical contexts. However, even once a SAM file has been generated from a sample, the samReporter only represents one part of the process in terms of drug resistance analysis. The HCV-GLUE system is currently being developed to provide a comprehensive drug resistance report, using samReporter to interpret a SAM file.
Other software, for example the VirVarSeq system [19
], calls variants at the codon level but is focused on very low frequency variants. DiversiTools [46
] provides frequencies of amino acids on a per read basis but does not link to a standardised coordinate system as is available in the GLUE framework. The geno2pheno[ngs-freq] system [20
] directly interprets drug resistance in deep sequencing data. Users must transform their data into a table of nucleotide or codon frequencies and a web-based system then performs interpretation on this table, using a user-defined frequency threshold. In comparison with samReporter, this design facilitates fast transfer over a network since the frequency table is much more compact than a typical SAM or BAM file. However, some information is necessarily lost in the processing, for example the codon frequency table cannot encode linkage beyond a codon location which would be required for example to detect combination RASs.
While the current study applied samReporter to HCV, it can also be used to analyse deep sequencing data for other viruses. In many simple cases, the prerequisites would simply be a nucleotide alignment of alternative target reference sequences and a master reference sequence with coding region annotations. In more complex cases, for example where virus genomes contain ambisense genomes or RNA editing, GLUE and samReporter would need to take account of this. The GLUE samReporter shows that a simple, pragmatic software design can conveniently answer some common questions concerning within-host variation in viral deep sequencing data.