2.1. Overview
A standard laboratory workflow in our laboratory [
4] includes DNA extraction, PCR amplification, direct DNA sequencing by a third party service, viewing and checking of chromatograms, preparation of curated sequences, multiple sequence alignment, sequence analysis, serotyping, genotyping, phylogenetic analysis and preparation of sequences for submission to GenBank. During these processes, challenges were encountered in the analysis of HBV sequence data, and bioinformatic tools were developed to address these (
Table 1). Stand-alone, web-based tools allow users on any operating system platform to access the tools they require from any location with an Internet connection, without needing to learn a new bioinformatics software suite or a new program and without having to install any software onto their computer. The appropriate tool is simply used as and when required.
Table 1.
Online tools developed and the workflow process for which each would be used.
Table 1.
Online tools developed and the workflow process for which each would be used.
Workflow | Tool Name | Description | Source | Input | Performance |
---|
Chromatograms | Quality Score Analyzer | Plots Chromatogram Quality Scores | Sanger | Chromatogram | 0.4 s for 1200 bases |
Chromatograms | Automatic ContigGenerator Tool (ACGT) | Generates a contig from a forward and reverse chromatogram | Sanger | Chromatogram | 0.5 s for two chromatograms of 300 bases each |
Alignment | Automatic Alignment Clean-up Tool (AACT) | Eliminates “gap-columns” and disambiguates ambiguous bases | Sanger NGS * | FASTA | 0.2 s for 3800 sequences of 3221 bases in length (12-MB file) |
Alignment | Mind the Gap | Splits FASTA file based on gap threshold per column | Sanger NGS | FASTA | 0.5 s for 3800 sequences of 3221 bases in length (12-MB file) |
Analysis | Babylon Translator | Extracts HBV protein sequences (ORFs) | Sanger NGS | FASTA | 0.9 s for 3800 sequences of 3221 bases in length (12-MB file) |
Analysis | Wild-type 2 × 2 | Calculates 2 × 2 wild-type/mutant contingency tables | Sanger NGS | FASTA | 0.1 s for two groups of 50 sequences each of 3221 bases in length |
Serotyping | HBV Serotype Tool | Determines HBV Serotype | Sanger NGS | FASTA FASTA | 0.6 s for 225 sequences of 3221 bases in length |
Phylogenetics | Pipeline: TreeMail | Generates a phylogenetic tree | Sanger | Phylip | 1000 bootstraps of 41 sequences of 1000 bases required 15 min to process and email |
GenBank Preparation | PadSeq Tool | Places two HBV sequence fragments on a template | Sanger | FASTA | 0.6 s to place 3800 sequences from each of two input files |
2.2. Quality Score Analyzer
Direct DNA sequencing of PCR amplicons is a routine laboratory procedure. The result of this sequencing reaction is a chromatogram file, which is also known as a trace file or an electrophoretogram (visualized in
Figure 1). A common file format for chromatograms is the Applied Biosystems format (ABIF; The ABIF file format specifications are available online at
http://www.appliedbiosystems.com/support/software_community/ABIF_File_Format.pdf), which has a file name extension of “ab1”. In a process known as “base calling”, which is typically part of the DNA sequencing service, each nucleotide in the sequence is automatically identified by a software program. A “quality score”, implemented originally by the “phred” base-calling program, is assigned to each base call [
5,
6]. This score, which is logarithmic, indicates the reliability or confidence of the base call. A value of 10 indicates a one-in-10 (90%) probability that the base call is incorrect. A value of 20 indicates a one-in-100 (99%) probability of an incorrect base call. Generally, quality scores greater or equal to 20 are considered reliable. Due to the nature of sequencing reactions, quality scores at the beginning and the end of chromatograms are generally too low to be considered reliable and are therefore routinely removed before any downstream processing is done.
Figure 1.
Example chromatograms of the basic core promotor/precore (BCP/PC) region of the hepatitis B virus (HBV) genome. Panel (A) shows the “Kozak” region (TCAT) of subgenotype A1, followed by the “ATG” pre-core start codon; Panel (B) shows ambiguous (wobble) bases, which result from double peaks. These indicate a mixed population or the presence of quasispecies. Quality scores are indicated by the grey bars above each base call. The quality score associated with the ambiguous “W” base in panel (B) is 15, compared with scores above 50 for each base of the “TCAT” motif in panel (A).
Figure 1.
Example chromatograms of the basic core promotor/precore (BCP/PC) region of the hepatitis B virus (HBV) genome. Panel (A) shows the “Kozak” region (TCAT) of subgenotype A1, followed by the “ATG” pre-core start codon; Panel (B) shows ambiguous (wobble) bases, which result from double peaks. These indicate a mixed population or the presence of quasispecies. Quality scores are indicated by the grey bars above each base call. The quality score associated with the ambiguous “W” base in panel (B) is 15, compared with scores above 50 for each base of the “TCAT” motif in panel (A).
The quality scores of the base calls in a chromatogram are important. In some cases, the overall quality of an entire chromatogram is so poor, that it should not be used. In other cases, regions of the chromatogram are of poor quality. Using a poor quality chromatogram in an application or online tool will typically result in poor quality results or no results. An online tool was developed to assist users to determine the overall quality of a chromatogram file.
The online “Quality Score Analyzer” requires an “ab1” chromatogram file as input and displays a box-and-whisker plot (
not shown) and density plot of the quality scores and a “heat map” (
Figure 2). The “heat map” provides a visual representation of the overall quality of an entire chromatogram. Areas of interest, such as regions of low quality, can be examined in more detail. No trimming of the input chromatogram is performed.
Figure 2.
The output of the “Quality Score Analyzer” tool, showing the density plot on the left and a section of the “heat map” on the right. Each entry in this map is in the format “XXXX:YYZ”, where “XXXX” is the base position number in the sequence, increasing from “0001” for the first position in the file, and “YY” is the quality score from the chromatogram. The “Z” is the base called at the position. The color of each entry represents the quality score. Values in the range zero to nine (considered very poor) are shown in red, between 10 and 19 (poor) in yellow, between 20 and 29 (acceptable) in green, between 30 and 39 (good) in blue, between 40 and 49 (very good) in magenta and between 50 and 59 (excellent) in cyan. Quality scores higher or equal to 60, which are theoretical only, are shown in white. Ambiguous bases are shown in reverse colors (black text on a colored background).
Figure 2.
The output of the “Quality Score Analyzer” tool, showing the density plot on the left and a section of the “heat map” on the right. Each entry in this map is in the format “XXXX:YYZ”, where “XXXX” is the base position number in the sequence, increasing from “0001” for the first position in the file, and “YY” is the quality score from the chromatogram. The “Z” is the base called at the position. The color of each entry represents the quality score. Values in the range zero to nine (considered very poor) are shown in red, between 10 and 19 (poor) in yellow, between 20 and 29 (acceptable) in green, between 30 and 39 (good) in blue, between 40 and 49 (very good) in magenta and between 50 and 59 (excellent) in cyan. Quality scores higher or equal to 60, which are theoretical only, are shown in white. Ambiguous bases are shown in reverse colors (black text on a colored background).
2.6. Babylon Translator
The HBV genome codes for seven proteins, in four overlapping open reading frames (ORF) [
9]. The “Babylon” tool extracts (splits) HBV sequence data, from a single input file, into multiple files, with each output file containing either nucleotide or translated amino acid data for one HBV protein, as specified on the input page. The tool does not require full-length sequences, and if required, the co-ordinates used to extract the protein/s can be specified manually. The tool processes a FASTA file, which should contain aligned nucleotide sequence data from samples belonging to a single HBV (sub)genotype. The input page of the “Babylon” tool is shown in
Figure 5.
Figure 5.
Part of the input page of the “Babylon” tool. Selecting a genotype from the list on the left will populate the nucleotide positions for each protein with default values from [
10]. However, each of these positions can be edited, as necessary. The number of amino acids for each protein is determined automatically from the nucleotide values. An “Include” field for each protein specifies if it should be included in the output. The amino acid output is obtained by selecting the appropriate check-box on the input page. The “-” and “?” characters, which may be present in input sequence data, will be processed by the tool as an “N” character if the appropriate check-box is selected. It may be possible to translate nucleotides to amino acids when “-” and “?” characters are replaced with “N” characters.
Figure 5.
Part of the input page of the “Babylon” tool. Selecting a genotype from the list on the left will populate the nucleotide positions for each protein with default values from [
10]. However, each of these positions can be edited, as necessary. The number of amino acids for each protein is determined automatically from the nucleotide values. An “Include” field for each protein specifies if it should be included in the output. The amino acid output is obtained by selecting the appropriate check-box on the input page. The “-” and “?” characters, which may be present in input sequence data, will be processed by the tool as an “N” character if the appropriate check-box is selected. It may be possible to translate nucleotides to amino acids when “-” and “?” characters are replaced with “N” characters.
The tool extracts sequence data for each of the selected proteins from the FASTA file, optionally translating the data into amino acids, if specified. A separate output file (in FASTA format) is created for each selected protein, containing the nucleotide or amino acid data for all samples, for that protein only. The files can be downloaded individually or all together in one compressed archive (“ZIP”) file.
2.7. Wild-Type 2 × 2
When analyzing a set of HBV sequences, it is often desirable to compare the number of wild-type residues at a locus with the number of mutant (non-wild-type) residues at the same locus. In this case, “wild-type” refers to the residue, which occurs in the majority of the sequences. The “Wild-type 2 × 2” tool requires a FASTA file of (aligned) nucleotide or amino acid data as input. It calculates wild-type/mutant 2 × 2 contingency tables for sequences in the two specified groups, for all loci. Detailed output for loci, which are statistically significant at the specified threshold, is provided.
The input sequence data must be allocated into two groups using the number (numerical position) of sequences in the FASTA file. For example, if a file contains 20 sequences, with the first five representing “Group 1” (for example, sequences from males) and the remaining 15 representing “Group 2” (for example, sequences from females), this would be specified as “1–5” and “6–20”, without the quotation marks. Groups may also be specified as individual numbers, such as “1,3,6,7,10”, or as a mixture of both notations, such as “2,5,6–12”. No spaces or other characters are permitted. If one of the groups is omitted entirely (left blank), all sequences that are not allocated to the other group will automatically be allocated.
For each position/locus in the sequence data, the majority residue (nucleotide or amino acid) is determined, and this is considered the “wild-type” residue for that locus. The number of mutant residues, at each position, is then determined. A 2 × 2 contingency table is constructed, for each position, using wild-type and mutant counts, for each of the two groups. If at least one cell in the table contains a value less than or equal to five, a Fisher’s exact test is performed; otherwise a chi-squared test is performed on the table data. If the resulting
p-value is less than or equal to the threshold value specified on the input page, that position is considered as statistically significant, and the details of that position are included on the output page. The value of the optional “offset”, as entered on the input page, is added to the position in the output. This can be used to obtain output positions, which correspond exactly with genome co-ordinates. The two groups can be allocated names by entering text into the appropriate box on the input page. Example output is shown in
Figure 6.
2.9. Pipeline: TreeMail
Phylogenetic analyses undertaken by members of our research group typically involve several programs from the “Phylip” suite [
17]. These command-line tools are interactive and menu-driven, requiring the user to undertake several steps to complete an analyses. An output data file from one component of the suite must be renamed manually to prepare it for use as an input file for another component of the suite. This process is repetitive and time consuming, especially when running several analyses.
The “Pipeline: TreeMail” tool runs the “Phylip”
dnadist and
neighborprograms on the input file, with parameters automatically set as required by the research group (see below). The input page of the tool is shown in
Figure 8. The input file must be in Phylip (“.phy”) format. The Kimura and lower-triangular settings are specified for
dnadist, and the lower-triangular setting is specified for
neighbor. The pipeline tool emails the resulting tree file (“.tre”) to the email address provided. The tool reports progress once the input file has been uploaded. A description of each of the “Phylip” programs used by the “Pipeline: TreeMail” tool is provided in
Table 3.
If the “Bootstrap” mode is specified on the input page, the tool runs the Phylip
seqboot program before
dnadist and
consense after
neighbor, as described in the online documentation for
seqboot http://evolution.genetics.washington.edu/phylip/doc/seqboot.html. When in “Bootstrap” mode, the pipeline tool will create 1000 datasets with
seqboot and will email the final consensus tree to the email address provided.
The tool also emails the final Phylip “outfile” from either neighbor (normal mode) or consense (bootstrap mode) as a second attachment, called “result.txt”. When running in bootstrap mode, the actual bootstrap values will appear on the tree in this file (“result.txt”) and are present in the consensus tree file. These are shown as values out of 1000, not percentages.
Figure 8.
The “Pipeline: TreeMail” input page. An input file in “Phylip” format is required. The user may select either “Neighbor” mode, which does not include bootstrap replicates, or “Bootstrap” mode, which includes 1000 bootstrap replicates. Results of the analysis, including the tree file, are emailed to the address provided.
Figure 8.
The “Pipeline: TreeMail” input page. An input file in “Phylip” format is required. The user may select either “Neighbor” mode, which does not include bootstrap replicates, or “Bootstrap” mode, which includes 1000 bootstrap replicates. Results of the analysis, including the tree file, are emailed to the address provided.
Table 3.
Programs from the “Phylip” suite, which are used by the “Pipeline: TreeMail” tool.
Table 3.
Programs from the “Phylip” suite, which are used by the “Pipeline: TreeMail” tool.
Program | Description |
---|
consense | Computes consensus trees using the majority-rule method |
dnadist | Computes distances between samples from sequence data |
neighbor | Computes an unrooted by neighbor-joining or UPGMA |
seqboot | Generates multiple datasets by bootstrap resampling |
Running in bootstrap mode may increase the time required for the tool to complete. A bootstrap test run of 41 sequences of approximately 1000 nucleotides each took 15 min to process and email. The web-page will time-out if the analysis takes longer than 60 min.