Filovirus RefSeq Entries: Evaluation and Selection of Filovirus Type Variants, Type Sequences, and Names

Sequence determination of complete or coding-complete genomes of viruses is becoming common practice for supporting the work of epidemiologists, ecologists, virologists, and taxonomists. Sequencing duration and costs are rapidly decreasing, sequencing hardware is under modification for use by non-experts, and software is constantly being improved to simplify sequence data management and analysis. Thus, analysis of virus disease outbreaks on the molecular level is now feasible, including characterization of the evolution of individual virus populations in single patients over time. The increasing accumulation of sequencing data creates a management problem for the curators of commonly used sequence databases and an entry retrieval problem for end users. Therefore, utilizing the data to their fullest potential will require setting nomenclature and annotation standards for virus isolates and associated genomic sequences. The National Center for Biotechnology Information’s (NCBI’s) RefSeq is a non-redundant, curated database for reference (or type) nucleotide sequence records that supplies source data to numerous other databases. Building on recently proposed templates for filovirus variant naming [ ()////-], we report consensus decisions from a majority of past and currently active filovirus experts on the eight filovirus type variants and isolates to be represented in RefSeq, their final designations, and their associated sequences.

Abstract: Sequence determination of complete or coding-complete genomes of viruses is becoming common practice for supporting the work of epidemiologists, ecologists, virologists, and taxonomists. Sequencing duration and costs are rapidly decreasing, sequencing hardware is under modification for use by non-experts, and software is constantly being improved to simplify sequence data management and analysis. Thus, analysis of virus disease outbreaks on the molecular level is now feasible, including characterization of the evolution of individual virus populations in single patients over time. The increasing accumulation of sequencing data creates a management problem for the curators of commonly used sequence databases and an entry retrieval problem for end users. Therefore, utilizing the data to their fullest potential will require setting nomenclature and annotation standards for virus isolates and associated genomic sequences. The National Center for Biotechnology Information's (NCBI's) RefSeq is a non-redundant, curated database for reference (or type) nucleotide sequence records that supplies source data to numerous other databases. Building on recently proposed templates for filovirus variant naming [<virus name> (<strain>)/<isolation host-suffix>/<country of sampling>/<year of sampling>/<genetic variant designation>-<isolate designation>], we report consensus decisions from a majority of past and currently active filovirus experts on the eight filovirus type variants and isolates to be represented in RefSeq, their final designations, and their associated sequences.

Introduction
The National Center for Biotechnology Information (NCBI) RefSeq project was initiated to create a nonredundant and curated set of genomic, transcript, and protein sequence records [1]. Genomic RefSeq records provide a reference nucleotide sequence wherein individual protein coding regions and other sequence features are annotated, using the best available experimental data as a guide. Akin to the labeling of reference specimens as type specimens in other taxonomic schemes, RefSeq reference sequences can be considered type sequences for type viruses.
In the case of virological RefSeq records, each viral species was initially represented by only one genome sequence record, and all other genome records for members of the same species, or for different strains, variants, and isolates of the same member of this species were linked to this record as "genome neighbors" [2]. The rationale behind choosing a particular virus isolate sequence as reference sequence is unclear in most cases and has almost never been published. Annotation of individual RefSeq entries was performed using PubMed-indexed experimental data through NCBI inhouse and individual expert curation, since subspecialty-wide committees or expert groups had not been established.
The process of curating genome sequence data must now be fundamentally reformed, since the number of sequenced viral genomes has increased exponentially over the past decade [3]. Little to no experimental data are available for most new virus genomes, and annotation is often computationally transferred from related genomes or predicted de novo [4]. Moreover, the utility of reference genomes has expanded to include use in sequence assembly and pathogen detection pipelines [5][6][7][8][9]. With these changes, the data model has adapted, and multiple RefSeq records can now be maintained for several members of a particular virus species. This approach offers representation of the extant sequence diversity (or genotypes) within a particular species. Also, the approach provides a mechanism to maintain well annotated records from experimentally important laboratory isolates and from less studied isolates from the wild.

Current Filovirus RefSeq Entries
The mononegaviral family Filoviridae includes three genera, Cuevavirus, Ebolavirus, and Marburgvirus. Eight distinct filoviruses are recognized as members of a total of seven species distributed among these three genera (Table 1) [10][11][12][13][14]. These eight viruses are differentiated from each other by biological characteristics [12] and genomic sequence divergence [11,12,15,16]. This divergence is determined based on sequences of well-characterized variants of root viruses (from here on called type variants of type viruses) for each taxon [12]. These sequences, therefore, become de facto type sequences. Using type sequences allows algorithmic representation of filovirus relationships and newly isolated filoviruses can theoretically be automatically pre-assigned to existing or novel taxa ( Figure 1). Temporary type filovirus variants were established by the 2010-2011 ICTV Filoviridae Study Group [12]. These temporary type variants were largely consistent with those chosen for RefSeq (Table 2) by the NCBI, which automatically chose the first sequence available for a new virus. These variants and sequences therefore needed to be re-evaluated by filovirus experts. To achieve uniformity and consistency, the current RefSeq entries have to be relabeled to conform to current ICTV taxonomy. In addition, type filovirus variant designations have to be chosen and the individual isolate names have to be adjusted to the filovirus strain/variant/isolate schemes that were recently established [19]. Viruses are classified in the family Filoviridae (order Mononegavirales) based on a list of biophysical criteria, genomic organization, the type of disease the viruses cause in primates, geographic distribution, and morphology of their virions (outlined in [11,12]). Once a novel virus isolate clearly belongs to this family, genomic sequence comparison can help classification into lower taxa. Novel isolates are classified by comparing the genomic sequence of the new isolate first to the genomic sequences to the type viruses of the ICTVaccepted genera, and then to viruses of the ICTV-accepted species. Using taxon-specific genomic sequence divergence cut-offs, the novel isolate can then be automatically classified into existing taxa, unless they require the establishment of novel taxa through existing ICTV mechanisms.

RefSeq Entry Reevaluation
The "gold standard" filovirus type RefSeq entry should be selected on the basis of experimental importance and accessibility and represent a repository of functional information about a particular filovirus. It is of crucial importance that any functional annotation of a RefSeq entry (e.g., functions of particular genome parts or of genome-encoded proteins), is linked to the actual sequence associated with these experiments. The RefSeq entry should contain the most characterized virus/variant/isolate/sequence, independent of whether this virus, variant, or isolate was the first one discovered or the most widely used experimentally. Importantly, decisions on RefSeq entries do not entail a mandate that future experiments should necessarily be performed with the viruses associated with these entries. However, direct comparisons with RefSeq-associated viruses are highly recommended to further increase the detail associated with the RefSeq entries. These entries should be updated, and, if necessary, corrected on a continuous basis by a filovirus RefSeq subcommittee comprised of filovirus experts, whose composition is currently under consideration.
The authors of this article confirmed or replaced the current taxonomic type virus variants and isolates and the current filovirus RefSeq entries based on the availability of scientific information characterizing a particular virus. If scientific information is scarce for all members belonging to an entire taxon, other criteria such as availability, passaging history, or medical importance were used in decision making. Decisions were reached by consensus or simple majority voting, with the understanding that all authors will apply the final decisions reached by the entire group and enforce them in their functions as authors, peer-reviewers, and/or editors.

Cuevavirus RefSeq Entries
Only one cuevavirus, Lloviu virus (LLOV), has been described [18]. At the time of writing, LLOV had not been isolated in culture, and the sequence diversity of LLOV had only been defined in a single study using deep sequencing techniques on samples from deceased Schreibers's long-fingered bats (Miniopterus schreibersii) [18]. Only one additional study has been published on this virus, characterizing molecular-biological characteristics of the LLOV glycoprotein [20]. The codingcomplete genome of one LLOV has been determined (Genbank #JF828358), which therefore automatically became the current RefSeq sequence (#NC_016144) (see [21] for sequencing nomenclature used in this article). In the absence of additional deposited LLOV sequences and characterization data, this RefSeq entry should therefore be upheld but be considered temporary until a complete genome, including all non-coding sequences, is determined.
In line with filovirus strain/variant/isolate definitions outlined previously [19], we propose the variant designation "Asturias" (after the Principality of Asturias in Spain, where Cueva del Lloviu is located in which LLOV was discovered [18]) and the "isolate" name "Bat86" (instead of "MS-Liver-86/2003") for this virus: [Note here and below that the International Nucleotide Sequence Database Collaboration (INSDC) standard currently does not offer options other than "complete" or "partial," and, in particular, does not provide a possibility for the designation "coding-complete." Also note here and below that neither RefSeq nor GenBank currently can handle italics or extended Latin characters, which is why the species names are not italicized in the entry's definition line and <organism> fields and why letters with diacritics revert to their basic Latin letter counterpart].

Ebolavirus RefSeq Entries
The genus Ebolavirus includes five species, each of which is represented by one virus.

Bundibugyo Virus
Bundibugyo virus (BDBV) is the second least characterized ebolavirus. Although at least eight isolates of this virus are available [17,22], all experiments reported to date have been performed with one particular isolate, "811250" (often wrongly referred to as "200706291"). The complete sequence of this isolate is the one found in the current RefSeq entry (NC_014373). This isolate, obtained after two passages of clinical material in Vero E6 cells, came from a male patient who died in 2007 in Uganda [17]. We propose the variant designation "Butalya" (after Butalya Parish, Kikyo Subcounty in Uganda's Bundibugyo district where BDBV was discovered) and the isolate name "811250" for this virus:

Ebola Virus
Ebola virus (EBOV) is the most thoroughly characterized ebolavirus. Dozens of EBOV isolates are available, but the vast majority of published experiments have been performed with isolates "Mayinga" and "Kikwit" (reviewed in [23]). The "Mayinga" isolate, the first EBOV isolate obtained in 1976, has been used extensively for molecular-biological characterizations. The "Kikwit" variant, obtained during an Ebola virus disease outbreak in 1995, has been used almost exclusively for pathogenesis studies in nonhuman primates in the US (the "Mayinga" isolate is used almost everywhere else) [23]. All available EBOV cDNA clone systems are based on the "Mayinga" isolate (see [24]). The only available mouse-and guinea pig-adapted EBOV strains are derived from the "Mayinga" isolate (see [25]), and all available EBOV protein crystal structures are derived from the "Mayinga" isolate [26][27][28][29][30][31][32][33][34]. The "Mayinga" isolate was therefore chosen as the prototype EBOV for RefSeq (#NC_002549), which lists a complete genome obtained after 3-4 passages in Vero E6 cells. We uphold this decision and propose the variant designation "Yambuku" (after the village in which EBOV first emerged [35,36]) and retain the isolate designation "Mayinga" (the last name of a nurse who succumbed to infection [36]) for this virus:

Reston Virus
Reston virus (RESTV) has caused multiple epizootics among captive macaques (1989-1990, 1992, 1996) and domestic pigs in 2008 (reviewed in [37]). At least 10 isolates were obtained during all these outbreaks, and eight complete or coding-complete genomic sequences have been deposited. However, the vast majority of RESTV experiments, in particular those regarding molecular characterization, have been performed with "Pennsylvania" (reviewed in [23]). "Pennsylvania" is the only RESTV variant for which there is a reverse genetics system [38]. In addition, "Pennsylvania" sequences served as the basis for the available RESTV protein crystal structures [39][40][41][42][43]. "Pennsylvania" (NC_004161) was chosen for the current RESTV RefSeq entry, which we propose to maintain. We propose the variant designation "Philippines89" (a reference to the time and place from which this virus was exported to the US in 1989) and the isolate name "Pennsylvania" for this virus: Accordingly, in RefSeq #NC_004161, the definition line "Reston ebolavirus, complete genome" was changed to "Reston ebolavirus isolate Reston virus M.fascicularis-tc/USA/1989/Philippines89-Pennsylvania, complete genome." The RefSeq <strain> field was cleared; and the RefSeq <isolate> field was filled with "Reston virus M.fascicularis-tc/USA/1989/Philippines89-Pennsylvania". The same changes should be applied to GenBank #AF522874.

Sudan Virus
Sudan virus (SUDV) is the second-best characterized ebolavirus. Approximately 15 SUDV isolates have been described, but very few experiments have been performed with any of these isolates. Early experiments focused on isolate "Boneface" (often misspelled "Boniface"). Recently variant "Gulu" isolate "808892" has become a more popular choice, and data from experiments with this virus continue to accumulate (reviewed in [23]). Crystal structures for GP 1,2 were determined for both viruses [33,42,44,45]. However, the passaging history of the "Boneface" isolate has not been thoroughly documented and includes passaging in guinea pigs and culturing in various cell types. The "Gulu-808892" isolate, on the other hand, is completely sequenced and is the current virus of choice for nonhuman primate experiments in the US. While the "Boneface" isolate was chosen by the 2010-2011 ICTV Filoviridae Study Group as the type SUDV [12], "Gulu-808892" isolate was chosen as the prototype SUDV for RefSeq (#NC_006432). We propose to support the RefSeq decision and to change the SUDV type virus variant to "Gulu." As several "Gulu" isolates are available, we propose the variant designation "Gulu" for the virus variant that caused the disease outbreak that started in Gulu District, Uganda, in 2000, and the isolate designation "808892" for the RefSeq entry of this particular virus. ("808892" was obtained after three Vero E6 cell passages of clinical material coming from an infected male who died): Accordingly, in RefSeq #NC_006432, the definition line "Sudan ebolavirus, complete genome" was changed to "Sudan ebolavirus isolate Sudan virus H.sapiens-tc/UGA/2000/Gulu-808892, complete genome." The RefSeq <strain> field was cleared; and the RefSeq <isolate> field was filled with "Sudan virus H.sapiens-tc/UGA/2000/Gulu-808892." The same changes should be applied to GenBank #AY729654.

Taï Forest Virus
Taï Forest virus (TAFV) is the least characterized ebolavirus. Only one isolate ("807212" = "CI") was obtained from a female survivor [46] after seven passages in Vero E6 cells, and the codingcomplete genome of this isolate is the only genomic TAFV sequence available [17]. Therefore, this sequence automatically became the current RefSeq sequence (#NC_014372). In the absence of additional deposited TAFV sequences and characterization data, this RefSeq entry should therefore be upheld but be considered temporary.

Marburgvirus RefSeq Entries
The genus Marburgvirus includes a single species, which is represented by two divergent viruses.

Marburg Virus
Marburg virus (MARV) is the most thoroughly characterized marburgvirus. Some 70 MARV isolates are available, but the majority of published experiments have been performed with isolate "Musoke" (reviewed in [23]). However, experiments not characterizing MARV but rather the disease it causes are increasingly performed with an "Angola" isolate in the US and continue to be performed with "Popp" or "Voege" isolates in Russia. The only available MARV cDNA clone systems are based on the "Musoke" isolate (see [24]) and on a nonhuman (bat) isolate [47]. The "Musoke" isolate has therefore been chosen as the prototype MARV for RefSeq (#NC_001608). We uphold this decision and propose the variant designation "Mt. Elgon" (after Mount Elgon, Kenya, where this variant is thought to have originated [48]) and the isolate designation "Musoke" (after a Nairobi doctor who got infected [49]

Ravn Virus
Ravn virus (RAVV) is a largely uncharacterized marburgvirus that belongs to the same species as MARV. At least three human ("Ravn" = "810040," "09DCR," "02Uga") and four Egyptian rousette isolates ("44Bat," "188Bat," "982Bat," "1304 Bat") have been obtained. Virtually all RAVV characterization experiments have been performed with "Ravn" = "810040," which was obtained after at least two passages in SW-13 cells and four passages in Vero E6 cells. Since RAVV is a phylogenetically distinct marburgvirus, we created a RefSeq entry for the "Ravn" isolate, for which we propose the variant designation "Kitum Cave" (after Kenya's Kitum Cave on Mount Elgon where RAVV first emerged) and the isolate designation "810040": A summary of the proposed designations and RefSeq accession numbers can be found in Table 3.