Genome-Wide Diversity Analysis of African Swine Fever Virus Based on a Curated Dataset

Simple Summary African swine fever (ASF) is one of the most important animal diseases affecting the domestic swine population globally. Whole-genome sequence analysis on the circulating African swine fever virus (ASFV) strains would provide valuable information in tracking the outbreaks of the disease. The aim of this study was to prepare a curated dataset of ASFV genome sequences and investigate genome-wide diversity of circulating ASFV strains. We prepared a curated dataset containing 123 high-quality ASFV genome sequences representing 10 genotypes collected from 28 countries between 1949 and 2020. Phylogenetic analysis based on whole-genome sequences provided high-resolution topology in genotyping ASFV isolates, which was supported by pairwise genome sequence similarity comparison. Wide distribution and high variation of tandem repeat sequences were found in ASFV genomes. Structural variation and highly variable poly G or poly C tracts were also identified. This study improved our understanding on the patterns of genetic variation of ASFV and facilitated future studies on ASFV molecular epidemiology. Abstract African swine fever (ASF) is a lethal contagious viral disease of domestic pigs and wild boars caused by the African swine fever virus (ASFV). The pandemic spread of ASF has had serious effects on the global pig industry. Virus genome sequencing and comparison play an important role in tracking the outbreaks of the disease and tracing the transmission of the virus. Although more than 140 ASFV genome sequences have been deposited in the public databases, the genome-wide diversity of ASFV remains unclear. Here we prepared a curated dataset of ASFV genome sequences by filtering genomes with sequencing errors as well as duplicated genomes. A total of 123 ASFV genome sequences were included in the dataset, representing 10 genotypes collected between 1949 and 2020. Phylogenetic analysis based on whole-genome sequences provided high-resolution topology in differentiating closely related ASFV isolates, and drew new clues in the classification of some ASFV isolates. Genome-wide diversity of ASFV genomes was explored by pairwise sequence similarity comparison and ORF distribution comparison. Tandem repeat sequences were found widely distributed and highly varied in ASFV genomes. Structural variation and highly variable poly G or poly C tracts also contributed to the genome diversity. This study expanded our knowledge on the patterns of genetic diversity and evolution of ASFV, and provided valuable information for diagnosis improvement and vaccine development.


Maximum Likelihood Phylogenetic Analysis
Maximum likelihood (ML) phylogenetic trees were estimated by RAxML (version v8.2.12) [28] using the GTR-GAMMA nucleotide substitution model. ML bootstrapping was performed with 1000 replicates in order to assess the robustness of tree topologies. The final tree was midpoint rooted by FigTree v1.4.2.

Sequence Diversity Analysis
Pairwise sequence distance was inferred by the maximum-likelihood method from the nucleotide alignment of genome sequences using MEGA version 10.2.6 software, assuming a TN93 model of base substitution (equal substitution rates among sites and between transitional and transversional substitutions) [29]. A heatmap was drawn on the genome sequence identity using pheatmap packages (version 1.0.12).

ORF Analysis
ORF in each genome sequence was identified using GATU, setting the minimal threshold for the length of ORF as 30 codons [30]. Using Georgia 2007/1 as the comparator, the sequence identity of each putative ORF of each strain was compared to its Georgia 2007/1 ortholog. A custom script was used to show the distribution of ORF along the genome.

Tandem Repeat Sequences Analysis
Tandem repeat sequences (TRS) across the genome were identified using the TRF program (version 4.09) [31]. The match, mismatch, and delta parameters were selected as 2, 7, and 7, and the minimum score was set to 50.
All of the genome sequences were aligned by MAFFT software. Mutations in each genome sequence were checked manually. Artificial modifications of the codons in the open reading frame in the reverse strand of the genome were found in 6 genomes collected from South Africa (MN394630.3, MN641876.2, MN641877.2, and MN336500.3), Zambia (MN318203. 3), and Zaire (MN630494.2), as partly shown in Figure S1. A genome from Armenia (LR881473.1) had extremely low sequence homology to other genotype II genomes and was found to be a contamination of a genotype I strain [32]. Low genome coverage or sequence similarity was found in 4 ASFV genotype II genomes collected from Viet Nam (MW465755.1, MT180393.1, MT166692.1) and China (MW361944.1) (Table S1). Annexation bases or expand unknown bases were found in 2 sequences collected in Georgia (MH910495.1 and MH910496.1) and 6 genomes collected in Italy (MW788405.1, MW788407.1-MW788411.1). Unconfirmed large-fragment deletion was found in a sequence collected in China (MH766894.2). The origin of sequences MZ945537 and MZ945536 was unclear. Sequence MK333181.1 was completely identical to sequence MK333180.1. Sequence MN393477.1 was completely identical to sequence MN393476.1. In order to obtain a precise overview of the genome diversity of ASFV strains, these 24 genomes were not included in the dataset (Table S2).
MK333180.1. Sequence MN393477.1 was completely identical to sequence MN393476.1. In order to obtain a precise overview of the genome diversity of ASFV strains, these 24 genomes were not included in the dataset (Table S2).

Phylogenetic Reconstruction
The maximum likelihood phylogenetic dendrogram of ASFV was constructed from the alignment of genome sequences. Two primary clades were found with 100% bootstrap support ( Figure 2). Primary clade I included two subclades, comprising previously identified genotype X and genotype IX isolates, respectively, which were collected from Kenya, Congo, and Uganda. Primary clade II included three subclades and one singleton (Lil-20 Malawi 1983). One subclade comprised distinctive isolates of genotype V, III, IV, and XX collected from Malawi, Zambia, Namibia, and South Africa. The second subclade was composed of all of the genotype II strains collected in Africa, Europe, and Asia. The third subclade was comprised of a cluster of Mkuzi 1979 and Liv13/33 (OmLF2), a singleton of K49, and a cluster of all of the other genotype I isolates. It is of note that Liv13/33 (OmLF2), a strain previously classified into genotype I by p72 genotyping, was found to

Phylogenetic Reconstruction
The maximum likelihood phylogenetic dendrogram of ASFV was constructed from the alignment of genome sequences. Two primary clades were found with 100% bootstrap support ( Figure 2). Primary clade I included two subclades, comprising previously identified genotype X and genotype IX isolates, respectively, which were collected from Kenya, Congo, and Uganda. Primary clade II included three subclades and one singleton (Lil-20 Malawi 1983). One subclade comprised distinctive isolates of genotype V, III, IV, and XX collected from Malawi, Zambia, Namibia, and South Africa. The second subclade was composed of all of the genotype II strains collected in Africa, Europe, and Asia. The third subclade was comprised of a cluster of Mkuzi 1979 and Liv13/33 (OmLF2), a singleton of K49, and a cluster of all of the other genotype I isolates. It is of note that Liv13/33 (OmLF2), a strain previously classified into genotype I by p72 genotyping, was found to be closely related to Mkuzi 1979, a genotype VII strain. K49, which was previously identified as a genotype I isolate, was found remotely related to all of the other genotype I strains. It has been revealed that phylogenetic analysis based on high-quality whole-genome sequences could provide high-resolution genotyping for ASFV.

Genome-Wide Sequence Similarity
Pairwise sequence similarity was calculated to investigate the genome-wide diversity of ASFV. The overall whole-genome sequence similarity varied between 75.4% and 99.9% ( Figure 3). Inter-genotype similarity ranged between 75.4% and 94.7%. The lowest intergenotype pairwise similarity was found between genotype I and genotype X (Kenya 1950), between 75.4-83.4%. The highest inter-genotype pairwise similarity reached 82.4-94.7%, between genotype I and genotype II. Within genotype II, the sequence divergence ranged between 92.4% and 99.9%. Within genotype I, the pairwise sequence similarity diverged from 85.4% to 99.9%. It is of note that a drop in sequence similarity (85.4-89.9%) was found between NH/P68-like strains and other genotype I strains, which was caused by large fragment sequence deletion in these strains. Mkuzi 1979, Liv13/33 (OmLF2), and K49 shared 85.4-88.1% with the NH/P68-like strains and 89.7-95.4% sequence similarity with other genotype I strains. All of the other genotype I strains shared a high similarity, between 94.8% and 99.9%. It is suggested that the relationship between Mkuzi 1979, Liv13/33 (OmLF2), and K49 and genotype I strains should be reconsidered from the perspective of whole-genome sequence similarity.

Genome-Wide Sequence Similarity
Pairwise sequence similarity was calculated to investigate the genome-wide diversity of ASFV. The overall whole-genome sequence similarity varied between 75.4% and 99.9% ( Figure 3). Inter-genotype similarity ranged between 75.4% and 94.7%. The lowest intergenotype pairwise similarity was found between genotype I and genotype X (Kenya 1950), between 75.4-83.4%. The highest inter-genotype pairwise similarity reached 82.4-94.7%, between genotype I and genotype II. Within genotype II, the sequence divergence ranged between 92.4% and 99.9%. Within genotype I, the pairwise sequence similarity diverged from 85.4% to 99.9%. It is of note that a drop in sequence similarity (85.4-89.9%) was found between NH/P68-like strains and other genotype I strains, which was caused by large fragment sequence deletion in these strains. Mkuzi 1979, Liv13/33 (OmLF2), and K49 shared 85.4-88.1% with the NH/P68-like strains and 89.7-95.4% sequence similarity with other genotype I strains. All of the other genotype I strains shared a high similarity, between 94.8% and 99.9%. It is suggested that the relationship between Mkuzi 1979, Liv13/33 (OmLF2), and K49 and genotype I strains should be reconsidered from the perspective of whole-genome sequence similarity.

Genome Annotation
In order to investigate the open reading frame (ORF) distribution in ASFV genomes, gene annotation information was obtained by GenBank records. For genomes with no gene annotation information available, ORFs were identified using GATU software. The length of the genome, coding region sequence (CRS), and CCR are listed in Table S4. The length of the genomes ranged from 171,235 bp to 193,886 bp. The length of the CRS of each genome ranged from 171,046 bp to 192,664 bp. The length of CRS differed between different genotypes, which ranged from 171,046 bp to 186,915 bp in genotype I and from 181,232 bp to 189,797 bp in genotype II. The length of the CCR of each genome was identical, with minor diversity from 12,9288 bp to 13,2794 bp. No significant difference was observed in the length of CCR in different genotypes. . Sequence similarity matrix plot of ASFV genome sequences in the curated dataset. The level of identity of pairwise genome sequences is indicated by different colors. Dark red represents 100% identity, and blue represents lower identity. The maximum likelihood phylogenetic tree of ASFV genome sequences is shown on the right.

Genome Annotation
In order to investigate the open reading frame (ORF) distribution in ASFV genomes, gene annotation information was obtained by GenBank records. For genomes with no gene annotation information available, ORFs were identified using GATU software. The length of the genome, coding region sequence (CRS), and CCR are listed in Table S4. The length of the genomes ranged from 171,235 bp to 193,886 bp. The length of the CRS of each genome ranged from 171,046 bp to 192,664 bp. The length of CRS differed between different genotypes, which ranged from 171,046 bp to 186,915 bp in genotype I and from 181,232 bp to 189,797 bp in genotype II. The length of the CCR of each genome was identical, with minor diversity from 129,288 bp to 132,794 bp. No significant difference was observed in the length of CCR in different genotypes.
The ORF distribution in the representative strains of 10 different ASFV genotypes was compared ( Figure S2). Gain or loss of predicted ORF was observed mainly in the LVR and RVR regions, especially in members of MGF. As shown in Table 1, the number of MGF110 members ranged between 6 and 11 in different genotypes. The number of MGF360 members varied between 10 and 14 in LVR, and between 3 and 5 in RVR. The number of MGF505 members ranged between 8 and 9 in LVR. The sequence identity of each ORF to that of the reference strain Georgia 2007/1 is also shown in Figure S2. It is of interest that several genes located in the CCR showed a considerable sequence variation, including A118R, A238L, EP153R, EP402R, and E66L.

Structural Variation
Structural variations include deletion and insertion of sequence spanning at least 50 base pairs. Structural variations occurred during the circulation of ASFV in the host or passage of the virus in cell cultures. With the curated dataset, we investigated the structural variations of ASFV genomes in genotype I and genotype II. Structural variations were found in five ASFV genotype I genomes, including BA71 (one deletion/insertion), E75 (one deletion), L60 (one deletion), NH/P68 (three deletions and one insertion), and OURT 88/3 (same as NH/P68). The detail is shown in Figure 4A. Structural variations were also found in five ASFV genotype II genomes, including Tanzania/Rukwa/2017/1 (three deletions), MAL/19/Karonga (three deletions), Estonia 2014 (one deletion/insertion), Pig/Heilongjiang/HRB1/2020 (one recombination), and HuB20 (one deletion). The distribution of the structural variations in these genomes is shown in Figure 4B.

Tandem Repeat Sequences Variation
Tandem repeat sequences (TRS) across the genomes of the ten representative stra of each genotype were identified using the TRF program. In each genome, approximat 30 TRS regions were identified. The distribution of TRSs in the ASFV genome is shown Figure S2. Fourteen TRSs shared by all the ten representative strains were further inve

Tandem Repeat Sequences Variation
Tandem repeat sequences (TRS) across the genomes of the ten representative strains of each genotype were identified using the TRF program. In each genome, approximately 30 TRS regions were identified. The distribution of TRSs in the ASFV genome is shown in Figure S2. Fourteen TRSs shared by all the ten representative strains were further investigated (Table 2). Seven TRSs were located in the coding region of EP402R, C84L, B475L, B602L, B407L, B183L and I196L. The other seven TRSs were located in the non-coding region between two ORFs. Five TRSs were found to be compound TRSs, which were composed of two or three single TRSs. The length of repeat units ranged from 3 to 66 bp, the majority ranging from 10 to 20 bp. The copy number of repeat units in each TRS varied considerably among different genotypes. For instance, for the TRS located in the non-coding region between MGF505-9R and MGF505-10R, the 17-bp repeat unit (GTTCAGTTAAGACAGTA or GTTAAGACAATAGTTTT) had 7 copies in Ken06 (KM111295.1) but 32 copies in Malawi Lil-20/1 (AY261361.1). In the coding region of the B602L gene, a TRS with a 12-bp repeat unit (GTGCTTGTACAA) was identified in the 487-nucleotide site at the 5' end of the 1593-bp ORF, which is also known as a central variable region (CVR). In the CVR, the copy number of the repeat unit ranged variably from 7 in the strain Tengani 62 (genotype V) to 31 in the strain Malawi Lil-20/1 (genotype VIII). Mutations in each repeat unit were also observed, including single nucleotide polymorphisms and indels. In the coding region of different ASFV genotype I strains, variation in the number of repeat units was found in 11 TRS. Six TRS variations were located in the non-coding region, whereas five TRSs were located in the coding region of EP402R, B169L, B407L, and B602L ( Figure 5A,B). The most variable TRS was found in CVR (B602L), where variation in the number and sequence of the 12-bp repeat unit was identified. In CVR, according to the nucleotide sequence in each repeat unit, 10   The arrangement of the TRS in each genome was listed according to the nucleotide sequence. Each arrow of a different color represents a type of repeat unit with a specific nucleotide sequence. The nucleotide sequence of each type of repeat unit was also shown. The amino acid sequence of each type of repeat unit in the TRS in the coding region of B602L gene was also shown.
TRS variation was further investigated in ASFV genotype II genomes. Variation in the number of repeat units was found in seven TRSs in the coding region of different genotype II ASFV strains ( Figure 5C). The most variable TRS was found in I73R/I329L IGR. Based on the number of repeat units in I73R/I329L IGR-TRS, the genome-sequenced genotype II ASFV strains in the dataset could be classified into 2 types.

Discussion
Genome characterization of circulating ASFV strains could expand our knowledge on the genetic diversity and evolution of ASFV, providing valuable information for diagnosis improvement and vaccine development. The value of this kind of study depends on the quality of the genome sequences. Although more than 140 genome sequences have been published, approximately 10% of the genomes are not precisely determined. It is highly suggested that precautions should be taken to produce high-quality ASFV genome sequences [26]. Several deep-sequencing workflows have been developed for the fast and efficient generation of high-quality ASFV whole-genome sequences by using either nextgeneration sequencing alone or a combination of next-generation sequencing and thirdgeneration sequencing [16,19,[33][34][35]. No matter which workflow is taken, it is important that the genome assembly generated from the workflow should be checked manually. Sequence variations that are different from the reference sequence should be supported by

Discussion
Genome characterization of circulating ASFV strains could expand our knowledge on the genetic diversity and evolution of ASFV, providing valuable information for diagnosis improvement and vaccine development. The value of this kind of study depends on the quality of the genome sequences. Although more than 140 genome sequences have been published, approximately 10% of the genomes are not precisely determined. It is highly suggested that precautions should be taken to produce high-quality ASFV genome sequences [26]. Several deep-sequencing workflows have been developed for the fast and efficient generation of high-quality ASFV whole-genome sequences by using either next-generation sequencing alone or a combination of next-generation sequencing and thirdgeneration sequencing [16,19,[33][34][35]. No matter which workflow is taken, it is important that the genome assembly generated from the workflow should be checked manually. Sequence variations that are different from the reference sequence should be supported by a high depth of reads. Annexation bases or unknown bases in the sequences should be checked, corrected, or confirmed by Sanger sequencing, as well as fragment deletion. These steps would significantly improve the quality of the genome sequences and make the best use of the great efforts which have been made to obtain the genome sequences.
The curated dataset was used to reconstruct the phylogenetic tree of ASFV genome sequences. A robust phylogenetic tree was obtained. All of the major branches were wellsupported. Our study showed new insight into the genotyping of Mkuzi 1979, Liv13/33, and K49. By p72 genotyping, Mkuzi 1979 was classified with RSA/1/98 (AF302818) into genotype VII [36,37]. Liv13/33 was classified into genotype I by p72 genotyping [38]. K49 was classified into genotype I according to the description of the sequence submitter. In this study, it is supported by strong evidence that Liv13/33 and Mkuzi 1979 grouped into a distinct cluster, separating from the cluster of genotype I strains. It is therefore suggested that Liv13/33 and Mkuzi 1979 should be classified into genotype VII. In our study, K49 formed a singleton which was closely related to but separate from the cluster of genotype I or genotype VII strains. Achieving more genome sequences of genotype VII and genotype I strains collected in Africa would help to settle this problem.
To date, a total of 24 genotypes of ASFV have been identified worldwide. However, genome sequences of only 10 genotypes were available, with genomes of genotype I and genotype II collected in Eurasia overrepresented. The lack of ASFV genome sequences of the remaining 14 genotypes dramatically limited the breadth and depth of phylogenetic analysis of ASFV currently [39]. More resources should be allocated to generate more whole-genome sequences for historical and contemporary ASFV strains collected in Africa, belonging to all 24 genotypes.
Variation of tandem repeat sequences, especially CVR region and IGR I73R/I329R, has been reported and used to enhance discrimination of ASFV isolates [13,[40][41][42]. However, the genome-wide distribution of TRSs in ASFV has never been reported. According to our result, TRSs were widely distributed in the ASFV genomes. Some TRSs identified in this study have not been previously reported. The high quality of genome sequences in genotype I and genotype II enabled the in-depth investigation into the TRS variation during the circulation of ASFV. Although several TRS variations were identified in genotype I and genotype II, respectively, only one TRS in the NCR of MGF 360-9L and 10L was found simultaneously varied in the genome-sequence-available genotype I and II strains. Although dramatic variation in CVR was identified in different genotype I strains, no variation was found in this region in genotype II strains included in this study. However, a CVR variant was reported in ASFV strains in wild boar from a limited area in southern Estonia in 2015 and 2016 [43]. The role of TRS variation in discriminating closely related ASFV strains within a genotype needs to be further investigated.
Variation in stretches of poly G or poly C region in ASFV have been previously reported [18]. In this study, for the first time, we listed all of the highly variable poly G or poly C tracts observed in the genomes of ASFV genotype I and genotype II strains, respectively. It has been revealed that these regions are not only highly variable, but also irregular. One explanation is that this is caused by sequencing errors because the current sequencing methods reach their limit in determining the long homopolymer G or C stretches [26]. The other possibility is that it represents the intrastrain sequence variation due to replication slippage on the homopolymeric tracts [44].

Conclusions
In conclusion, we prepared a curated dataset of ASFV full-length genome sequences for further studies on genome characterization for new outbreaks and genomic epidemiology analysis for ASFV. Our whole-genome-wide diversity analysis based on the curated dataset improved our understanding of the evolution of ASFV during circulation, and thus might help control the spread of this important animal disease. The curated dataset described here will be updated regularly to include the newly published ASFV full genome sequences.  Figure S2: The distribution of ORF and TRS in the representative strains of 10 different ASFV genotypes. The sequence identity of each ORF to that of the reference strain Georgia 2007/1 was also shown. Table S1: Low genome coverage or sequence similarity in 4 ASFV genotype II genomes. Table S2: ASFV genome sequences not used in the curated dataset. Table S3: A curated dataset of ASFV genome sequences. Table S4: The length of genome, coding region sequence (CRS) and conserved central region (CCR) of the ASFV genome sequences in the curated dataset. References  are cited in the supplementary materials.