3.1. CODEHOP Integration
Even though vast amounts of genomic sequences have been obtained recently, it is unlikely that the complete genome sequence will have been determined for all living species that might provide valuable scientific and medical insights. In order to obtain sequence information for specific genes in unsequenced organisms or pathogens, a primer design strategy for PCR amplification of novel genes using ”consensus-degenerate hybrid oligonucleotide primers” (CODEHOPs) was previously developed [
4]. CODEHOPs are designed from amino acid sequence motifs that are highly conserved within a gene family, and are used in PCR amplification to identify unknown related family members. Each CODEHOP consists of a pool of primers containing all possible nucleotide sequences within a 3′ degenerate core encoding a stretch of 3–4 highly conserved amino acids (
Figure 1). A longer 5′ nondegenerate clamp region in the primers contains the most probable nucleotide predicted for each flanking codon. The degenerate core allows primer binding to all existing target variations in the initial PCR cycles, while the clamp region, once integrated into early PCR products, leads to efficient amplification of the PCR products in later PCR cycles. CODEHOPs designed from two adjacent conserved motifs are used to amplify the gene sequences between these motifs.
Other methods to identify unknown genes have used degenerate primers, containing most or all of the possible nucleotide sequences encoding amino acid motifs, or a consensus primer containing the most common nucleotide at each codon position within the motifs. However, unlike strictly degenerate or consensus approaches, the CODEHOP PCR approach has proven to be highly successful in amplifying distantly related genes containing significant sequence variations at low copy numbers. The primer design software and the CODEHOP PCR strategy have been utilized for the identification and characterization of new gene orthologs and paralogs in different plant, animal, and bacterial species, as well as for virus typing (e.g., enteroviruses [
5]); consequently, the original publication has been cited in more than 800 subsequent publications. In addition, this approach has been successful in identifying new pathogen species and genes, as we have previously published [
6,
7,
8,
9,
10,
11,
12].
A computer strategy to predict CODEHOP PCR primers from multiply aligned sets of related protein sequences was previously developed, which has been continuously accessible over the internet since 1998 as an integral part of the BLOCKS database developed by Steven Henikoff and hosted by the Fred Hutchinson Cancer Research Center [
4]. A description of the CODEHOP program and its uses was published in the 2003 NAR Web services edition [
13]. Subsequently, we developed iCODEHOP, an interactive web application independent of the BLOCKS database, to simplify and automate the process of designing CODEHOP PCR primers. The iCODEHOP program added new features, including interactive visualization of predicted CODEHOPs, phylogenetic plots for multiple aligned sequences, and user sessions on the server that allowed data to be stored during the design process [
14]. However, due to advances in web browser technology, problems with the stored server sessions, and resource limitations, the iCODEHOP web application could no longer be supported and we have now developed a Java-based iteration called j-CODEHOP for integration into BBB (menu:
Advanced).
j-CODEHOP guides users through the CODEHOP PCR primer design process, including uploading sequences, creating a multiple alignment, and identifying and visualizing primer pools that match the specified design criteria. A linked tutorial provides a step-by-step guide to demonstrate how to create CODEHOPs, using a sample FASTA file containing related sequences within the uracil DNA glycosylase family. The input to j-CODEHOP can be a set of nonaligned protein sequences or a set of aligned protein sequences. Protein sequence files may be formatted as GenBank (*.gb, *.gbk), EMBL (*.embl), BBB (*.bbb), FASTA (*.fasta, *.fas, *.fa) or CLUSTAL (*.clustal, *.clustalw). The program’s output includes a graphic showing predicted CODEHOP primers at their locations along a consensus protein sequence, a graphical representation of the region of the multiple alignment from which they are derived, and a set of metadata about each primer pool (length, degeneracy, and annealing temperature range). j-CODEHOP enables the user to visually scan the entire set of predicted CODEHOP primers to assess their relative positions and orientations within the consensus protein sequence and select individual CODEHOP primers for further analysis.
For the aligned protein sequences and chosen criteria, j-CODEHOP computes all primer possibilities. The initial output shows the consensus amino acid sequence for conserved blocks of the multiple protein alignment (
Figure 2A). This sequence is numbered according to the positions in the multiple alignment, with capital letters for amino acids matching the minimum conservation criteria. A second window lists the possible primers to export. The user can view the consensus sequence to visualize the positions of the predicted primers, which are shown as arrows, forward or reverse. The amino acid motif targeted by the 3′ degenerate core of the primer is aligned with the primer arrow, as are the flanking amino acids specifying the 5′ non-degenerate clamp. A specific primer can be selected, which will open a third window to show the CODEHOP sequence, the block of aligned sequences used for primer design, and the primer design criteria (
Figure 2B). Both forward and reverse CODEHOPs in the correct orientation need to be identified. If an insufficient number of CODEHOPs are predicted, the program can be rerun using more relaxed design criteria or the distance between the group of sequences can be reduced. Detailed methodologies have been previously published that describe the design of CODEHOP PCR primers and their use in identifying novel sequences [
4,
11,
13,
14].
3.2. Sequence Characteristics
As the -omics revolution progresses, more and more researchers make use of sequence data from the various databases and, increasingly, the data behind publication claims are not presented. For example, a phylogenetic tree may be published without the MSA that was used to generate it. Given that errors in sequence naming and annotation are common in the databases, it is important that researchers check results if they are going to rely on them.
Figure 3 shows a BBB
Visual Summary of 2 virus genomes (menu:
Reports), which are almost identical except for four large indels and a block of very poorly matching sequence. This type of visualization is a powerful tool for highlighting inconsistencies in alignments. When we further investigated these sequences (BLAST [
15] searches and dotplots [
16]), we found that the differences between the two genomes were entirely the result of genome assembly errors.
Additionally, under the BBB Reports menu, the ability to display Sequence Similarity and Sequence Difference graphs (useful for detecting recombination; not shown) has been supplemented by the plotting of a Nucleotide Content Graph. The user has control over which nucleotides are included in the analysis, as well as the size of the sliding window of nucleotides and the number of nucleotides that is used to “step” across the sequence. The tool also allows the user to choose which sequences from an MSA are included in these analyses. Importantly, the option to ignore gapped columns in an MSA has been included.
Additional new Reports features that summarize characteristics of an MSA include: (1) Get Counts, which counts the number of columns in the MSA with particular features, reporting the number that have a gap, a single nucleotide, two nucleotides (consensus and second type), three nucleotides, and four nucleotides; (2) Get Unique Positions, which lists the number of unique positions that are not gaps for each sequence; and (3) Get SNP Counts, which examines the top two sequences (sequences can be moved up or down within the MSA to enable sequence selection) and reports the total number of SNPs and the number of each possible substitution.
3.3. Counting Nucleotides Associated with Specific Sequences in MSAs
As noted above, the data supporting a phylogenetic tree are not often provided in manuscripts. Often, it would be useful to know the percent identity between sequences and the numbers of SNPs that distinguish one branch on a tree from another. The ability to generate a nucleotide identity matrix from an MSA is an older feature of BBB. However, now, from within the
Advanced/Experimental Tools menu, BBB also allows a researcher to query the MSA data that support (or don’t support) a phylogenetic tree. The
Find Differences tool can be used to count the number of SNPs that support a particular branch; e.g., “find nucleotides that are identical in sequences A, B, and C, but different in all other sequences”.
Figure 4 shows the phylogenetic tree for the central relatively-conserved core (60 kb) of 10 cowpox viruses. For these sequences, the viruses in the DNA sequence identity range from 98.2–99.4%. Counting the number of SNPs unique to each sequence (red numbers in
Figure 4) shows that for these cowpox sequences, the branch lengths created may not truly reflect the evolutionary distances. Instead, the lengths were likely compressed due to evidence of recombination shown in
Table 1, which artificially reduced distances between distant strains.
An important feature of the Find Differences tool (menu: Advanced/Experimental Tools) is that it can allow the matching to be fuzzy. We have termed this feature “tolerance” and it can be viewed as the search “tolerating” one or more (specified by the user) sequences that do not fulfill the query. For example, the query “find nucleotides that are identical in sequences A and B but different in all other sequences, with tolerance = 1” allows any one of the sequences that should be different from A+B to be the same; different sequences are “tolerated” at different positions in the alignment. The software also: (1) Creates a list of all the positions in the alignment that satisfy the query and displays the “tolerated sequence” name if there is one, and (2) displays the distribution of SNPs in the MSA.
These BBB features were created to characterize recombination events among the poxviruses by highlighting the positions of shared SNPs. In any MSA, there will always be coincident SNPs from random events. However, for these cowpox sequences, when “nucleotides that are identical in sequences A, B, and C, but different in all other sequences” are located, some are, as expected, associated with the closest related sequence, but others are from more distant relatives. In addition, many of these coincident SNPs are found to be in nonrandom blocks, suggesting that the arrangements result from recombination among the genomes.
Table 1 shows SNPs present only in CPXV-BR and CPXV-Nor1994MAN and one other sequence taken from the tree shown in
Figure 4. In several instances, the common SNPs are unexpectedly clustered (
Table 1) and likely result from recombination events. The results with CPXV-Ge1980EP4 and CPXV-Ge2002MKY (which are very similar (
Figure 4)) as the extra sequence are dramatic; despite their similarity, CPXV-Ge1980EP4 has many more SNPs in common with the other two sequences (
Table 1; 33 SNPs) than with CPXV-Ge2002MKY (
Figure 4; 3 SNPs).
3.4. Manipulation of Sequences
As previously reported, BBB allows the addition or removal of sequences to an alignment and the removal of columns in an MSA that contains all gap characters which are often generated when removing sequences from an alignment. However, when visually inspecting the relationships between the sequences, it can also be useful to simplify the variation by removing any column that contains a gap character (menu: Tools/Delete Columns Containing Gap(s) and Export). Since this action will modify the sequences in use, by deleting residues from some sequences, the program will export the resulting sequence into a new BBB window and prompt the user to enter a new filename. If the sequences in an MSA are very diverged, each will have a relatively large number of unique SNPs. Since these can obscure patterns present among the SNPs shared by subsets of sequences, we also created the SNIP feature (menu: Advanced/Experimental Tools) that modifies the sequences such that the SNPs that are present only in a single sequence are changed to the consensus nucleotide. Again, because this procedure modifies the actual sequences, users are asked to save the result in a new alignment file.
When using large viral genomes and closely related viruses, SNPs may be relatively infrequent. Therefore, we incorporated a feature into BBB that allows the user to remove any specified column of nucleotides within an MSA. By removing the columns that only contain a single nucleotide, the variation is compressed into a much smaller sequence space and is more easily visualized by the user. First, the Find Differences tool (menu: Advanced/Experimental Tools) is used to find the columns that are identical (i.e., have no SNPs), then the “Search Log” is used to “List SNP Positions” only. Subsequently, these position values can then be used to delete specified columns (menu: Tools/Delete Specified Columns and Export).
3.5. Alignment of Sequences
The options for aligning complete or selected regions of sequences have been updated. Clustal Omega [
17] has replaced the option to use ClustalW. Clustal Omega and MUSCLE [
18] serve as options to align protein and gene length nucleotide sequences. For the alignment of large viral genomes, MAFFT [
19] is the tool of choice. However, the growing number of complete genomes sequenced has translated into a more frequent need to generate larger MSAs, often to update phylogenetic trees. Although MAFFT is available at various web resources and can be easily installed on desktop computers, most users prefer to use MAFFT within BBB (menu:
Tools/Align Selection). Therefore, we have incorporated the
MAFFT-add option into the BBB (menu:
Advanced/Experimental Tools) [
20]. This feature allows users to align one or more new sequences to an existing alignment, which significantly reduces the compute time. For example, the alignment of 10 cowpox virus genomes takes approximately 8 min, whereas aligning one new sequence to an alignment of 9 takes a little over 1 min. The
MAFFT-add function is also useful for scaffolding new contig sequences against a close reference sequence in the process of genome assembly.