1. Introduction
Over the last decade, tremendous progress has been made in the field of DNA sequencing [
1]. With the introduction of modern next-generation sequencing (NGS) platforms—e.g., Roche 454, Illumina, Ion Torrent and PacBio—the amount of generated sequences has dramatically increased while the costs have rapidly decreased [
1,
2]. Especially in the field of emerging infectious diseases, whole-genome sequencing using NGS has become an essential and widespread tool in understanding molecular epidemiology and pathogen evolution [
3,
4].
One emerging animal pathogen for which the need for whole-genome sequences has massively increased in recent years is African swine fever virus (ASFV) [
5].
ASFV was first described and is endemic in sub-Saharan Africa [
6] where it can be detected in wild African suids—e.g., forest hog, red river hog and bushpigs—and is transmitted in a sylvatic cycle between warthogs and
Ornithodoros soft ticks [
7]. Since only distantly related giant viruses with double-stranded DNA genomes are known so far, ASFV is the only member of its genus
Asfivirus and family
Asfarviridae [
5]. ASFV causes a disease in domestic pigs and European wild boar called African swine fever (ASF), which is characterised by high fever, haemorrhages, respiratory, gastro-intestinal and neurological signs and cyanosis [
8,
9]. The animals show a lethality of up to 100% depending on the strain-specific virulence and usually succumb to the infection within 7 to 10 days after infection [
9]. Neither vaccines nor treatments are available [
10]; thus, stopping ASFV spread relies solely on strict quarantine measures including the culling of infected herds, establishment of restriction zones and movement control leading to enormous socio-economic consequences in affected countries [
11,
12].
From Africa, ASFV was introduced into Europe (Portugal) two times and subsequently spread to other European countries; e.g., Spain, France, Italy, Belgium and the Netherlands [
13,
14]. After the virus was eradicated from most affected countries (with the exception of Sardinia, where ASFV is still endemic) [
11], ASFV was reintroduced in Georgia in 2007 [
15]. Since then, the virus has subsequently spread through wild boar and domestic pig populations of Eastern Europe (Georgia, Armenia, Russia, Belarus, Moldova, Ukraine) [
13,
16,
17,
18] and the European Union (Bulgaria, Hungary, Czech Republic, Slovakia, Romania, Estonia, Latvia, Lithuania and Poland) [
18]. More recently, the virus spread to Western Europe (Belgium) [
19] and reached Asia, causing devastating outbreaks (China, Cambodia, Mongolia, Vietnam, North Korea, Laos and Myanmar) [
13,
20].
The sequencing of partial genomes by Sanger sequencing has been used for decades to identify viral genotypes and trace virus introductions through molecular epidemiology, especially in Africa where ASFV is widely distributed and different genotypes occur [
21,
22,
23,
24,
25,
26,
27]. The viral genome consists of one linear molecule of double-stranded DNA with a length of 170–194 kbp [
28]. It shows a low mutation rate typical for DNA viruses that leads to a very low genetic variability, especially in close geographic regions [
23,
29,
30]. However, the identification of attenuated viruses with large deletions, inversions or duplications as well as novel genotypes in Africa might be explained by mechanisms leading to genetic reorganisations such as homologous recombination, as was suggested and described for other large dsDNA viruses [
31]. Therefore, the analysis of partial sequences quickly becomes ineffective for detailed analyses of phylogeny, molecular epidemiology, or virus evolution, as observed especially in Eastern Europe [
19].
The first ASFV whole-genome sequence was published in 1995 and generated from the Spanish cell-culture-adapted ASFV strain BA71V using Sanger sequencing [
32,
33]. Since then, only a few additional genomes have been published using the same technique [
34,
35] (
Table 1). In 2009, the first ASFV genome sequence was published using the Roche 454 GS FLX NGS technique in combination with Sanger sequencing [
36]. Subsequently, different platforms have been used to generate ASFV whole-genome sequences including Illumina HiScanSQ, MiSeq, HiSeq, NextSeq500 and PacBio [
37,
38,
39,
40,
41,
42,
43,
44,
45,
46] (
Table 1).
Since the ASFV genome includes extensive homopolymer and repeat regions including inverted terminal repeats (ITR) of variable length [
28], platform-specific limitations especially in homopolymer sequencing need to be considered [
47,
48]. In particular, the short read data generated by sequencing platforms of the second generation—e.g., Roche 454, Illumina and IonTorrent—need to be handled with care. Used for mapping against a reference sequence, genomic reorganisations such as inversions or duplications might be missed, and the quality of the consensus-sequence strongly relies on the sequence used as reference. Furthermore, when used for assembly, the small reads (200–300 bp) might have only small overlapping regions and therefore can be misassembled and—especially in homopolymer and repeat regions—the quality of the consensus-sequence might be low.
Therefore, data on sequence quality such as coverage, mappings and alignments are crucial to assess the results of whole genome sequencing and the analysis of genetic variations including single nucleotide variants (SNV) analyses and molecular epidemiology experiments [
3].
In particular, reference sequences frequently used as a basis for experiments and analyses [
45,
49,
50] need careful validation. Especially for ASFV, where only little data on gene expression and protein translation are available and most of the between 150 and 167 ORFs have only been predicted [
28,
37], reliable whole genome sequences are the basis needed for transcriptome and proteome analyses and up-to-date annotations.
Here, we describe a sequencing protocol combining ASFV-specific target enrichment, Illumina sequencing and long-read Nanopore sequencing to generate high-quality complete ASFV genome sequences. We used the described protocol for resequencing of the ASFV Georgia 2007/1 genome, leading to the correction of 71 homopolymer errors and other sequencing artefacts, a longer ITR region and ultimately updated annotations [
37] of the corrected genome sequence. Using this high-quality sequence as a reference, we used target enrichment in combination with Illumina sequencing and sequenced ASFV directly from organ tissue, thereby providing an ASFV whole-genome sequence from an outbreak in Moldova.
2. Materials and Methods
2.1. Virus Cultivation
The original ASFV Georgia 2007/1 isolate from Georgia reported in 2011 [
37] was passaged once on porcine bone marrow cells (PBMs) prior to sequencing.
To generate high molecular weight (HMW)-DNA to facilitate long reads through Nanopore sequencing, ASFV Georgia 2007/1 was passaged a second time on porcine primary peripheral blood monocytes (PBMCs) as described elsewhere [
54].
2.2. DNA Extraction
For Illumina sequencing, DNA was extracted from PBMC cell culture supernatant using the NucleoMag® VET Kit (Macherey Nagel, Düren, Germany) according to the manufacturer’s instructions.
The Moldovan organ sample (spleen), collected from infected domestic pigs, was homogenised in a 2 mL reaction tube containing one 5 mm steel bead and 200 µL phosphate buffered saline (PBS) in a TissueLyser II (Qiagen, Hilden, Germany) for 3 min at 30 Hz. After centrifugation at 10,000× g for 3 min, DNA was extracted from the supernatant as described above.
For Nanopore sequencing, HMW-DNA was extracted using a customised protocol. PBMCs [
54] were infected with ASFV Georgia 2007/1 at an MOI of 0.1 and harvested after 24 h. After one freeze-thaw-cycle of infected PBMCs, cell debris was removed by centrifugation at 1000×
g for 10 min at 4 °C and the virus particles were concentrated using ultracentrifugation through a sucrose cushion. Briefly, 6 mL 40% sucrose (in PBS) were carefully overlaid with the virus-containing supernatant (30 mL) and centrifuged for one hour at 50,000×
g at 4 °C. The resulting pellet was resuspended in 100 µL PBS and heat-inactivated for 30 min at 75 °C. Subsequently, the virus suspension was mixed with 3 mL TEN buffer (40 mM Tris-HCl, 1 mM EDTA pH 8.0, 150 mM NaCl) and 1.5 mL sarcosyl buffer (3% sodium lauroyl sarcosinate, 75 mM Tris-HCl pH 8.0, 25 mM EDTA) and incubated at 65 °C for 15 min. The DNA was extracted using equal volumes of phenol (Roth) in a first step, phenol/chloroform/isoamyl alcohol (Roth) in a second step and finally chloroform (Roth)/isoamyl alcohol (Merck) (24:1) in a final step. Precipitation of the DNA was carried out using 0.1 volumes of sodium acetate buffer (3 M, pH 4.8) and 2.5 volumes of ethanol at −20 °C for 16 h. The precipitated DNA was pelleted and washed with 70% ethanol and air-dried before final resuspension in 70 µL EB buffer (Qiagen).
2.3. Illumina Sequencing
Samples were prepared for and analysed on an Illumina MiSeq platform in the 300 bp paired-end mode, as previously described [
55].
2.4. Nanopore Sequencing
Library preparation was carried out using the Rapid Barcoding Sequencing Kit (SQK-RBK004, Oxford Nanopore Technologies, Oxford, UK) according to the manufacturer’s instructions. The finished library was loaded onto a R9.5.1 MinION flow cell (FLO-MIN107, Oxford Nanopore Technologies). Sequencing was conducted under the standard settings using a MinION (Mk1B, Oxford Nanopore Technologies) in combination with a MinIT (MNT-001, Oxford Nanopore Technologies). Signal information (fast5) was basecalled with MinIT and Guppy v2.1.3 (Oxford Nanopore Technologies).
2.5. Target Enrichment
For the enrichment of ASFV-specific sequences prior to Illumina MiSeq sequencing, the myBaits® kit with 1–20K unique baits (Arbor Bioscience, Ann Arbor, MI, USA) was used according to the manufacturer’s instructions (
Figure 1). Briefly, RNA baits (11,958 unique baits) were designed in silico to cover the ASFV Georgia 2007/1 sequence (FR682468.1) and the ASFV Ken06.bus sequence (KM111295) with a three-fold coverage by Arbor Bioscience. After DNA extraction and standard library preparation [
55], 100–500 ng of indexed libraries were heat-denatured at 95 °C and, after cooling to hybridisation temperature (65 °C), adapter sequences were blocked using Illumina-specific blocking oligos supplied by the kit (Arbor Bioscience). Denatured and blocked libraries were combined with the pre-heated hybridisation mix containing the ASFV-specific biotinylated RNA-baits and incubated at the pre-set hybridisation temperature (65 °C) for 16 h for bait to target hybridisation. Subsequently, the hybridised bait–target mix was mixed with MyOne™ Streptavidin C1 beads (Thermo Fisher Scientific, Waltham, MA, USA). After immobilisation on a magnetic rack and washing using pre-heated washing buffer (65 °C), target sequences were eluted from the beads by adding 10 mM Tris-HCl, 0.05% TWEEN®-20 solution (pH 8.0–8.5) and incubating for 5 min at 95 °C. Enriched libraries were amplified using the AccuPrime™ Taq DNA Polymerase System (Thermo Fisher Scientific) according to the manufacturer’s instructions for 14 cycles prior to Illumina MiSeq sequencing [
55]. Sequencing was performed using a MiSeq v3 600 cycle kit in 2 × 300 bp mode.
2.6. Data Analysis
Sequence data from the Illumina libraries lib02645 and lib02679 were mapped against the ASFV Georgia 2007/1 genome sequence (FR682468.1) using the Newbler 3.0 Software (Roche, Basel, Switzerland), and mapped reads were assembled using SPAdes 3.11.1 [
53] in the mode of read error correction prior to assembly.
Basecalled fastq data from Nanopore sequencing were assembled in a hybrid assembly with the Illumina reads mapped to FR682468.1 (see above) using SPAdes v3.10.0. Contigs were used as a further reference for an iterative mapping approach using the generic mapper in Geneious v11.1.5 (Biomatters, Auckland, New Zealand). Regions with a coverage lower than 2.5 times standard deviations from the mean were excluded from consensus generation and further analysis. Nanopore fastq reads were used in a mapping approach in parallel using KMA [
56], applying pre-sets optimised for Nanopore reads sensitive for indels. Coding sequences (CDS) were predicted with Glimmer3 [
52] using FR682468.1 (ASFV Georgia 2007/1) for model generation. CDS were manually annotated and used to update INSDC entry FR682468.1, now available under FR6824681.2.
Single nucleotide variants (SNVs) were determined using the generic SNP finder of the Geneious software suite, applying parameters of a maximum
p-value of 10
−5 and filter for strand bias. The threshold for SNP identification was set at 10% and a minimum coverage of 100 was defined. Variants were checked manually for accuracy. Due to the low read quality at the genome ends, variants within the ITR regions were not included in the analysis. Read duplicates from all data used for SNV analysis and coverage estimation were removed using Samtools v. 1.9 [
57] prior to analysis.
2.7. PCR and Sanger Sequencing
For conventional PCR, Phire Green Hot Start II PCR Master Mix (Invitrogen, Carlsbad CA USA) and the primers 457F CGGGCCAGACAAAATTGACC and 1223R AACAGGAAATACAAGGCGGC were used in a peqSTAR Thermocycler (VWR, Darmstadt, Germany) with the following cycling conditions: 30 s at 98 °C, 35 cycles each of 10 s at 98 °C, 30 s at 64 °C, 2 min at 72 °C and final elongation 10 min at 72 °C. Corresponding bands (751 bp) were purified from a 1.5% agarose gel with the QIAquick Gel Extraction Kit (Qiagen) and used for sequencing reactions (BigDye Terminator v3.1 Cycle Sequencing Kit, Applied Biosystems, Darmstadt, Germany). The reaction products were purified using NucleoSEQ columns (Macherey–Nagel) and sequenced on an ABI PRISM® 3100 Genetic Analyzer (Life Technologies, Darmstadt, Germany). Obtained sequences were assembled using the Geneious software, v11.1.5.
2.8. Data Availability
Whole-genome sequences are available from the International Nucleotide Sequence Database Collaboration (INSDC) databases under the study accession number PRJEB33279, and under the GenBank accession number FR682468.2.
4. Discussion
In this study, we used a combined workflow including target enrichment, Illumina and Nanopore sequencing (
Figure 4) for the resequencing of the ASFV Georgia 2007/1 isolate representing the first ASFV introduced into Eastern Europe in 2007.
Since the ASFV strains circulating in Eastern Europe and Asia are highly similar (
Figure 3), molecular epidemiology analyses will rely on small changes and even single nucleotide variations. Therefore, the availability of high-quality sequences and the sharing of raw data as well as data on alignments and mappings is highly desirable to evaluate the quality of a sequence prior to analysis and search for a distinct pattern that could help to draw spatial and temporal conclusions.
With the use of ASFV target-specific enrichment prior to sequencing, we were able to increase the amount of target-specific reads allowing for whole-genome assembly while reducing the total number of reads considerably. As shown by our sequencing results, target-specific enrichment is especially useful when sequencing samples with a low virus to host ratio—e.g., organ tissue (
Figure 4)—while samples with higher virus to host ratio—e.g., cell culture supernatant—do not necessarily call for enrichment (
Figure 4).
However, since target-specific enrichment is performed using a bait set designed on known sequences, the enrichment and therefore the sequencing is limited to these. Although a hybridisation of baits to target sequences showing up to 90% identity can be expected [
58], regions with extensive genetic variations or longer insertions might be missed. Therefore, bait sets should be designed carefully for the corresponding scientific question; e.g., specific sets for the whole-genome assembly of known viral genomes or sets including different genome variants for a broader enrichment. Furthermore, hybridisation parameters—e.g., temperature—can be optimised to include sequence variants [
60]. In addition, the bias introduced by the amplification needed to generate the size of the library required for sequencing must be considered, especially when working on samples with a very low amount of target-specific sequences, and corresponding datasets need to be de-duplicated prior to analysis.
Since the ASFV genome includes extensive repeat regions, ITRs of unknown length and homopolymer regions with up to 17 C and 16 G, even data from target enrichment approaches have proven difficult for the de novo assembly of a high-quality reference sequence (
Figure 2).
Through the employment of single molecule Nanopore sequencing (
Figure 4), we were able to generate long viral reads of up to 43.5 kb. However, the low accuracy of the Nanopore reads (see above) hampers precise consensus sequence generation. While Nanopore reads reach an accuracy of around 90%, Illumina data sets usually have accuracies around 99.9%. Hence, the higher accuracy of the Illumina data sets can be used for better precision, whereas the Nanopore reads can be taken for correct assembly. Therefore, using these in a hybrid assembly, we were able to further improve the sequence quality, especially in the ITR regions. Although we were able to enlarge the ITR regions considerably, because of the decreasing sequence coverage at the ends, we cannot exclude the possibility that they are even longer. Due to their proposed involvement in ASFV replication involving head to head concatemers [
61] (similar to poxviruses [
62]), a complete sequence of the ITR regions including the terminal hairpin loops [
63] would be especially interesting.
In our opinion, the presented workflows offer the best possible quality of ASFV sequences using the technology available to date. The sequences presented here are therefore of a reference nature and improve the analysis of additional ASFV sequences by far. Nonetheless, even with the combination of Nanopore and Illumina reads in a hybrid assembly, the sequencing of the poly G/C homopolymer regions has proven to be difficult, and we can neither completely exclude the existence of viral variants nor sequencing or bioinformatics artefacts, respectively. Therefore, it remains speculative if these homopolymer stretches might be involved in differential gene expression (maybe in the arthropod vector) or other unknown mechanisms, as might be suggested by the observed fusion or split of ORFs depending on the length of the homopolymer regions.
For all sequencing platforms, their specific sequencing limitations always need to be considered. For the 454 technology used for the generation of the original ASFV Georgia 2007/1 sequence (FR682468.1), limitations especially in sequencing homopolymer regions are well known [
64]. Therefore, it is not surprising that we discovered numerous differences between the original and the improved sequence in A/T homopolymer regions, as also identified previously [
41]. Since even the most recently developed sequencing technologies have their limitations, as shown by the evaluation of the IonTorrent and Illumina systems especially for homopolymer sequencing [
64,
65,
66], selecting the adequate sequencing platform is crucial for each scientific question and should be done carefully. Furthermore, sequence information from the database should be evaluated thoroughly using the available information on the employed sequencing platform and data analysis prior to their use in experimental designs or bioinformatic analyses.
Using the ASFV-specific target enrichment prior to sequencing and the now available improved ASFV Georgia 2007/1 sequence as a reference, we could easily provide a reliable whole-genome sequence for ASFV Moldova 2017/1 from organ material without prior need for virus cultivation (
Figure 4B). While the overall nucleotide sequence identity is more than 99.9% similar to the circulating ASFV strains from Eastern Europe and Asia, only a few single differences in nucleotides as well as a previously described tandem repeat insertion could be observed. However, the observed nucleotide differences, together with the available whole-genome sequence information, unfortunately do not allow for any further conclusions regarding phylogenetic or geographic relationships.
Although some of the single nucleotide changes in ASFV Georgia 2007/1 and ASFV Moldova 2017/1 reported in this study affect annotated ORFs, due to the lack of expression data for most of the ASFV ORFs and the lack of observations of virus attenuation from the field, no conclusion regarding their influence on virus replication or pathogenicity can be given. Therefore, transcriptome and proteome studies and infection experiments are needed to validate and update annotations, evaluate the influence of these changes on the viral phenotype and identify essential viral genes that might serve as targets for vaccine development (e.g., epitopes for T- or B-cells).
However, analysing SNVs in the viral genomes, we detected a clear difference in the number of SNVs. While we identified 49 SNVs in ASFV Georgia 2007/1, only 12 were detected in ASFV Moldova 2017/1. Since ASFV Georgia 2017/1 was passaged and sequenced from cell culture while ASFV Moldova 2017/1 was sequenced directly from organ material, it might be hypothesised that SNVs accumulate in cell culture-raised ASFV as was suggested for other viruses [
67,
68,
69]. However, this observation might be influenced by the differences in coverage at the SNV sites, and although ASFV Moldova 2017/1 has a higher coverage, which should lead to a higher number of SNVs and a low sample size, further experiments with cell culture and field samples will be essential to confirm this hypothesis. In addition, more data on SNV in ASFV populations might be used in fine-tuning the intra-genotype phylogeny [
70].
In conclusion, we used a workflow including target enrichment and Nanopore sequencing to provide an improved genomic sequence of ASFV Georgia 2007/1 (FR682468.2). Using this sequence as a reference, we subsequently generated the first whole-genome sequence from ASFV Moldova 2017/1, and showed that it is highly similar to the known circulating strains, showing only a few specific differences in single nucleotides from which only three seem to be specific for ASFV Moldova 2017/1.
We believe that our deep-sequencing workflow for ASFV and the provided reference sequences will be highly valuable to generate further high-quality ASFV whole-genome sequences and can serve as a basis for variant, transcriptome and proteome analyses. We are convinced that these sequences, along with coordinated efforts to improve data sharing and harmonised protocols, might pave the way for the identification of the new genetic markers that are desperately needed to understand virus evolution and trace routes of ASFV to eliminate the burden of this devastating animal disease.