Reclassification of ASFV into 7 Biotypes Using Unsupervised Machine Learning

In 2007, an outbreak of African swine fever (ASF), a deadly disease of domestic swine and wild boar caused by the African swine fever virus (ASFV), occurred in Georgia and has since spread globally. Historically, ASFV was classified into 25 different genotypes. However, a newly proposed system recategorized all ASFV isolates into 6 genotypes exclusively using the predicted protein sequences of p72. However, ASFV has a large genome that encodes between 150–200 genes, and classifications using a single gene are insufficient and misleading, as strains encoding an identical p72 often have significant mutations in other areas of the genome. We present here a new classification of ASFV based on comparisons performed considering the entire encoded proteome. A curated database consisting of the protein sequences predicted to be encoded by 220 reannotated ASFV genomes was analyzed for similarity between homologous protein sequences. Weights were applied to the protein identity matrices and averaged to generate a genome-genome identity matrix that was then analyzed by an unsupervised machine learning algorithm, DBSCAN, to separate the genomes into distinct clusters. We conclude that all available ASFV genomes can be classified into 7 distinct biotypes.


Introduction
The only member of the Asfarviridae family is the African swine fever virus (ASFV), which contains a large double-stranded DNA genome consisting of 150-200 genes.ASFV causes a severe disease in domestic swine and wild boar, African swine fever (ASF), resulting in economic losses in areas where ASF remains endemic or is causing outbreaks.Historically, ASFV has been characterized into 25 genotypes based on the partial sequencing of the B646L gene that encodes the major capsid protein p72.Although ASFV has been around for over 100 years [1], before 2007 the disease only sporadically left Africa.However, currently ASFV is causing a global pandemic, that started after ASFV was discovered in the Republic of Georgia in 2007 [2].This outbreak has persistently spread across Europe, and Asia and, in 2021, reached the island of Hispaniola (Dominican Republic and Haiti) [3].Just recently, in 2023, ASF made its first appearance in Sweden [4].In the current pandemic, only genotype 2 has been detected outside of Africa, with the exception of China which has also detected a low virulent genotype 1 (that closely matches a historical vaccine strain) and a hybrid virus of these two strains [5].Numerous reports have documented the existence of variations stemming from the genotype 2 strain whose origin can be traced back to Georgia.Genotype 2 variants have been identified in various regions across the globe, including Europe, Asia, Hispaniola, and Africa [3,[6][7][8][9][10].Indeed, some of these strains have mutations across the genome, genetic rearrangements, and deletions.To further complicate research efforts even more, recent reports indicate that, in Africa, ASFV isolates in domestic or smallholder farms have been restricted to only p72 genotypes 1, 2, 9, and 23 [2,7,[11][12][13][14].
Subsequently, the significance and accuracy of the delimitations of genotypes have become a concern for the ASFV research community.Recently, during the Global African swine fever Research Alliance (GARA) meeting held in the Dominican Republic in May of 2022 and again at the GARA Gap analysis held in Uganda in February of 2023, the significance of p72 genotyping was discussed.ASFV genotyping based on the sequence of p72 was created with the purpose of the epidemiological tracking of the appearance of different field isolates, but its significance has been erroneously applied to other purposes, including prediction of cross-protection.Recently we analyzed all publicly available sequences for p72 and established criteria for genotyping, as this methodology is still in use in endemic and outbreak areas where the technologies for whole genome sequencing are likely to be unavailable [14].We discovered that there were not 25 genotypes as previously reported, and after correcting some sequence analysis errors, we established a new criterion for p72 genotyping, demonstrating the existence of only 6 genotypes.
With the recently approved vaccines in Vietnam for ASF against the current genotype 2 field strain, the question that arises is how many distinct ASFV genomes exist, and how many different vaccines will be needed to cover all current and future emerging strains of ASFV.With limited information about cross-protection of ASF vaccines, and since the only certain way to test cross-protection experimental evaluation in animal experiments, it is important as a starting point to group field isolates into distinct groups.This would facilitate the design of cross-protection studies and give a clear understanding of the landscape of circulating and historical ASFV strains.This methodology must also be implemented as evolutionary changes will create potential new field strains of ASFV which would need to be identified to determine as are true new emerging strains or if they fall within a cluster of previously identified ASFV strains.
As reviewed by Qu et al. [15], efforts have been made to enhance the resolution of the p72 categorization of ASFV through the utilization of other genes, specifically p54 (E183L) and the central variable region (CVR) of B602L [15].Still, due to the complexities inherent to the ASFV genome (large size, gain, and loss of genes, and hundreds of open reading frames-ORFs), the classification of ASFV based on small subsets of genes is inadequate.
Attempts to classify ASFV through whole genome analysis began in the 1980s utilizing restriction fragment length polymorphism [15][16][17].This method facilitated the categorization of 23 isolates collected from Africa, Europe, and the Americas into five groups [15,17].The recent advent of next generation sequencing (NGS) technologies has led to the assembly of whole ASFVs genomes, facilitating the phylogenetic analyses of ASFV by several research groups [18][19][20][21][22].These analyses have resulted in the proposition that ASFV can be categorized into five distinct clades: alpha, beta, gamma, delta, and epsilon [20].Still, categorization by whole genome analysis has its disadvantages: (1) Mutations in untranslated regions (UTR) are weighted the same as mutations that occur in ORFs, (2) synonymous mutations are weighted the same as nonsynonymous and nonsense mutations, and finally (3) mutations and deletions that occur in the highly variable genes within the MGF families are weighted the same as mutations that occur in conserved "core" proteins.Theoretically, one could partition and align each ORF, UTR, and the highly variable region (HVR) of the ASFV genome using a different evolutionary model for each region, however, this approach necessitates significant computational resources [15].More recently, 41 ASFV genomes were analyzed using the chewBBACA pipeline [23,24].In short, coding sequences (CDSs) over 250 nts long were extracted and binned by allele designation.Nucleotide sequences of alleles were then aligned, and a phylogenetic analysis was performed.While this methodology only examines mutations that occur in ORFs, the remaining disadvantages/challenges of whole genome nucleotide analysis remain.Finally, it should be noted that although phylogenetic analysis yields a visual representation of differences, the criteria employed to define clades may still be ambiguous.
In this study, we collected 220 non-duplicate whole genomes of ASFV isolated from NCBI.Genomes were annotated using ASFV Georgia 2007/1 as a reference and unidentified ORFs were annotated using CLC Genomics Workbench 23.0.2 (QIAGEN, Aarhus, Denmark).Annotations were manually curated, and translated to their predicted amino acid sequences, and homologous amino acid sequences encoded by the different genomes were aligned using MUSCLE [25] to create gene-level percent identity matrices via BioPython [26].Gene-level percent identity matrices were weighted using an in-house developed algorithm and averaged to create a genome-genome percent identity matrix that was analyzed using the density-based clustering algorithm DBSCAN [27,28] to cluster the genomes based on similarity.Based on this research we propose that ASFV can be classified into seven biotypes.

Materials and Methods
The overall methodology used in this paper is documented in Figure 1 and further described in the sections that follow.
and the highly variable region (HVR) of the ASFV genome using a different evolutionary model for each region, however, this approach necessitates significant computational resources.[15]More recently, 41 ASFV genomes were analyzed using the chewBBACA pipeline [23,24].In short, coding sequences (CDSs) over 250 nts long were extracted and binned by allele designation.Nucleotide sequences of alleles were then aligned, and a phylogenetic analysis was performed.While this methodology only examines mutations that occur in ORFs, the remaining disadvantages/challenges of whole genome nucleotide analysis remain.Finally, it should be noted that although phylogenetic analysis yields a visual representation of differences, the criteria employed to define clades may still be ambiguous.
In this study, we collected 220 non-duplicate whole genomes of ASFV isolated from NCBI.Genomes were annotated using ASFV Georgia 2007/1 as a reference and unidentified ORFs were annotated using CLC Genomics Workbench 23.0.2 (QIAGEN, Aarhus, Denmark).Annotations were manually curated, and translated to their predicted amino acid sequences, and homologous amino acid sequences encoded by the different genomes were aligned using MUSCLE [25] to create gene-level percent identity matrices via BioPython [26].Gene-level percent identity matrices were weighted using an in-house developed algorithm and averaged to create a genome-genome percent identity matrix that was analyzed using the density-based clustering algorithm DBSCAN [27,28] to cluster the genomes based on similarity.Based on this research we propose that ASFV can be classified into seven biotypes.

Materials and Methods
The overall methodology used in this paper is documented in Figure 1 and further described in the sections that follow.

Annotation of the Dataset
Using the isolate ASFV Georgia 2007/1 (GenBank accession: FR682468.2) as a reference, all genomes were annotated using the default parameters of Genome Annotation Transfer Utility (GATU) [29].Overlapping annotations were manually corrected.In addition, the "Find Open Reading Frames" function of CLC Genomics Workbench 23.0 (Qiagen, Aarhus, Denmark) was used to detect ORFs that were not identified by GATU (minimum length = 110 codons).The identified ORFs were subsequently translated, compared to the NCBI database using the default parameters of BLASTP, and were assigned names based on their best match [30][31][32][33].Any ORFs that matched a hypothetical protein that was not annotated in the Georgia 2007/1 genome were excluded from further analyses.In some instances, Georgia 2007/1 encodes a shortened version of a gene within one of the five multigene families (MGF) (MGF-100, MGF-110, MGF-300, MGF-360, and MGF-505).Accordingly, MGF annotations were manually extended to an earlier start codon if the start codon was over 100 nucleotides upstream and in-frame.The letters, starting with "a", were added to gene names if a gene was split into multiple ORFs.Duplicate genes were indicated by the suffix "_1".

Protein Alignment
The 242 homologues were aligned using MUSCLE v 3.8.31[35] using a gap extension penalty of −1.0, and a gap opening penalty of −10.0.The protein alignments were then converted to gene-level percent identity distance matrices using Biopython [26].

Genome Level Analysis and Clustering
Weights were designed to give more weight to "core" genes (non-MGF, non-ACD, and non-hypothetical genes) (Table S4) as well as genes present in more genomes.Genes were assigned a weight using the following equation: It should be noted that genomes that did not encode a gene would not have their average impacted by the absence of said gene.Using the weighted average, gene-level percent identity distance matrices were then averaged into a single cumulative genomegenome percent identity matrix.This final percent identity matrix was run through the spatial clustering machine learning algorithm DBSCAN from scikit-learn 1.3.0[27,36], 100 times using an Epsilon (eps) value of 0.01 to 1 at intervals of 0.01 and the following constant parameters: min_samples = 1, metric = 'euclidean', metric_params = None, algorithm = 'auto', leaf_size = 30, and p = None.The process is summarized in Figure 2.

Full-Length Genomes on NCBI
All 261 full-length ASFV genome sequences were downloaded from NCBI and processed as described in the Materials and Methods section 2 (Tables S1 and S3) resulting in a database of 220 curated genomes.ASFV annotation and gene nomenclature have not been standardized and have resulted in some genes having multiple alternative names [37], reviewed on https://asfvgenomics.com/proteindatabase (accessed 1 Dec 2023).To avoid potential conflicts in nomenclature, all genomes were re-annotated using Georgia 2007/1 (GenBank accession: FR682468.2) as a reference, resulting in 220 non-redundant genomes encoding an average of 175.9 genes (±15.1).

Full-Length Genomes on NCBI
All 261 full-length ASFV genome sequences were downloaded from NCBI and processed as described in the Materials and Methods Section 2 (Tables S1 and S3) resulting in a database of 220 curated genomes.ASFV annotation and gene nomenclature have not been standardized and have resulted in some genes having multiple alternative names [37], reviewed on https://asfvgenomics.com/proteindatabase (accessed on 1 December 2023).To avoid potential conflicts in nomenclature, all genomes were re-annotated using Georgia 2007/1 (GenBank accession: FR682468.2) as a reference, resulting in 220 non-redundant genomes encoding an average of 175.9 genes (±15.1).

Clustering of ASFV
Pairwise percent identities were calculated for each gene.Evaluating virus proteins encoded in the central region of the genome, the following proteins had the lowest percent identities (considering at least 100 genomes): A118R (0.73 ± 0.292), EP153R (0.77 ± 0.22), EP402R (0.79 ± 0.212), A238L (0.83 ± 0.155), L60L (0.84 ± 0.171), DP71L (0.89 ± 0.264), A240L (0.91 ± 0.1), I10L (0.92 ± 0.08), I196L (0.92 ± 0.093), L11L (0.93 ± 0.094) (Table S5).Following their calculation, all pairwise percent identity matrices were averaged into a single identity matrix using an algorithm developed in-house as described in the Materials and Methods section.Our weighting matrix increased the value of (1) genes that were consistently encoded and (2) 123 conserved "core" proteins (Table S4).Genes annotated as "ACD ####" were given less weight because they are hypothetical proteins with no known function.Genes within the five MGF families (MGF100, MGF110, MGF300, MGF360, and MGF505) were given less weight since they are highly variable and often contain homopolymer stretches of G/C or A/T that can result in indels that lead to deletions, truncations, and fusions [10,22] and reviewed in [37].The weighted gene-level percent identity distance matrices were then averaged into a single genome-genome percent identity matrix, and clustered based on similarity using DBSCAN.Clustering refers to the process of dividing a dataset into subsets of points.The aim is to group similar points together in the same cluster while separating dissimilar points into different clusters.The utilization of DBSCAN as a clustering technique offers an extra level of dependability, as DBSCAN identifies clusters of arbitrary shape and does not require a specific number of clusters to be specified [36].Accordingly, it is more powerful than other clustering methods such as k-means which place data in clusters based on their proximity to a central point and require a specific number of clusters to be identified before analysis.The DBSCAN algorithm produces an output whereby each genome is assigned a numerical value ranging from 0 to n, where n + 1 = the total number of clusters at a given epsilon ( contain homopolymer stretches of G/C or A/T that can result in indels that lead to deletions, truncations, and fusions [10,22] and reviewed in [37].The weighted gene-level percent identity distance matrices were then averaged into a single genome-genome percent identity matrix, and clustered based on similarity using DBSCAN.Clustering refers to the process of dividing a dataset into subsets of points.The aim is to group similar points together in the same cluster while separating dissimilar points into different clusters.The utilization of DBSCAN as a clustering technique offers an extra level of dependability, as DBSCAN identifies clusters of arbitrary shape and does not require a specific number of clusters to be specified [36].Accordingly, it is more powerful than other clustering methods such as k-means which place data in clusters based on their proximity to a central point and require a specific number of clusters to be identified before analysis.The DBSCAN algorithm produces an output whereby each genome is assigned a numerical value ranging from 0 to n, where n + 1 = the total number of clusters at a given epsilon (Ɛ) value.At each Ɛ level, genomes that clustered based on similarity were assigned an identical number.DBSCAN was executed for 100 iterations of the parameter Ɛ, ranging from 0.01 to 1, with an increment of 0.01 for each iteration (Table S6).Ɛ denotes the radius of a neighborhood centered at a point x.In simpler terms, as the value of Ɛ increases, the maximum number of dissimilarities permitted within a cluster also increases, leading to a reduction in the overall number of unique clusters.
Figure 3 shows the results of DBSCAN, highlighting the number of clusters and how they merge at Ɛ = 0.05, 0.06, 0.07, 0.09, 0.10, 0.12, 0.13, 0.14, 0.21, and 0.32 until total convergence occurs at epsilon = 0.53.At the lowest epsilon = 0.05, there were 18 groups, 10 of which (TAN/08/Mazimbu (ON409981), Malawi Lil-20/1 (AY261361), Ken05/Tk1 (KM111294), Uvira B53 (MT956648), BUR/18/Rutana (MW856067), Uganda (Unpublished), RSA/2/2004 (MN641877), Tengani 62 (AY261364), Mkuzi 1979 (AY261362), LIV 5/40 (MN318203) were composed of only a single genome.The remaining eight groups were named after a prototypical genome (Ken06, Kenya 1950, Warmbaths, Warthog, Benin, K, Recombinant, and Georgia) and were composed of 8, 2, 3, 4, 66, 2, 3, and 122 genomes, respectively (Table S7).The approximate percent similarity and amino acid changes that constitute each cluster at each of the given Ɛ values were estimated by comparing Benin 97/1 (AM712239) to a newly clustered isolate (Table 1).As expected, as Ɛ increases, the weighted percent or A/T that can result in indels that lead to and reviewed in [37].The weighted gene-level then averaged into a single genome-genome sed on similarity using DBSCAN.Clustering to subsets of points.The aim is to group similar le separating dissimilar points into different clustering technique offers an extra level of ters of arbitrary shape and does not require a 36].Accordingly, it is more powerful than other place data in clusters based on their proximity ber of clusters to be identified before analysis.utput whereby each genome is assigned a e n + 1 = the total number of clusters at a given mes that clustered based on similarity were as executed for 100 iterations of the parameter t of 0.01 for each iteration (Table S6).Ɛ denotes a point x.In simpler terms, as the value of Ɛ ssimilarities permitted within a cluster also rall number of unique clusters.
, highlighting the number of clusters and how .10,0.12, 0.13, 0.14, 0.21, and 0.32 until total lowest epsilon = 0.05, there were 18 groups, 10 1), Malawi Lil-20/1 (AY261361), Ken05/Tk1 ), BUR/18/Rutana (MW856067), Uganda engani 62 (AY261364), Mkuzi 1979 (AY261362), only a single genome.The remaining eight l genome (Ken06, Kenya 1950, Warmbaths, rgia) and were composed of 8, 2, 3, 4, 66, 2, 3, number of clusters at the indicated epsilon value.Dark blue boxes indicate a sequence from a single p of isolates with similar sequences.and amino acid changes that constitute each stimated by comparing Benin 97/1 (AM712239) expected, as Ɛ increases, the weighted percent level, genomes that clustered based on similarity were assigned an identical number.DBSCAN was executed for 100 iterations of the parameter x FOR PEER REVIEW 6 of 13 contain homopolymer stretches of G/C or A/T that can result in indels that lead to deletions, truncations, and fusions [10,22] and reviewed in [37].The weighted gene-level percent identity distance matrices were then averaged into a single genome-genome percent identity matrix, and clustered based on similarity using DBSCAN.Clustering refers to the process of dividing a dataset into subsets of points.The aim is to group similar points together in the same cluster while separating dissimilar points into different clusters.The utilization of DBSCAN as a clustering technique offers an extra level of dependability, as DBSCAN identifies clusters of arbitrary shape and does not require a specific number of clusters to be specified [36].Accordingly, it is more powerful than other clustering methods such as k-means which place data in clusters based on their proximity to a central point and require a specific number of clusters to be identified before analysis.
The DBSCAN algorithm produces an output whereby each genome is assigned a numerical value ranging from 0 to n, where n + 1 = the total number of clusters at a given epsilon (Ɛ) value.At each Ɛ level, genomes that clustered based on similarity were assigned an identical number.DBSCAN was executed for 100 iterations of the parameter Ɛ, ranging from 0.01 to 1, with an increment of 0.01 for each iteration (Table S6).Ɛ denotes the radius of a neighborhood centered at a point x.In simpler terms, as the value of Ɛ increases, the maximum number of dissimilarities permitted within a cluster also increases, leading to a reduction in the overall number of unique clusters.

R REVIEW 6 of 13
contain homopolymer stretches of G/C or A/T that can result in indels that lead to deletions, truncations, and fusions [10,22] and reviewed in [37].The weighted gene-level percent identity distance matrices were then averaged into a single genome-genome percent identity matrix, and clustered based on similarity using DBSCAN.Clustering refers to the process of dividing a dataset into subsets of points.The aim is to group similar points together in the same cluster while separating dissimilar points into different clusters.The utilization of DBSCAN as a clustering technique offers an extra level of dependability, as DBSCAN identifies clusters of arbitrary shape and does not require a specific number of clusters to be specified [36].Accordingly, it is more powerful than other clustering methods such as k-means which place data in clusters based on their proximity to a central point and require a specific number of clusters to be identified before analysis.
The DBSCAN algorithm produces an output whereby each genome is assigned a numerical value ranging from 0 to n, where n + 1 = the total number of clusters at a given epsilon (Ɛ) value.At each Ɛ level, genomes that clustered based on similarity were assigned an identical number.DBSCAN was executed for 100 iterations of the parameter Ɛ, ranging from 0.01 to 1, with an increment of 0.01 for each iteration (Table S6).Ɛ denotes the radius of a neighborhood centered at a point x.In simpler terms, as the value of Ɛ increases, the maximum number of dissimilarities permitted within a cluster also increases, leading to a reduction in the overall number of unique clusters.
Figure 3 shows the results of DBSCAN, highlighting the number of clusters and how they merge at Ɛ = 0.05, 0.06, 0.07, 0.09, 0.10, 0.12, 0.13, 0.14, 0.21, and 0.32 until total convergence occurs at epsilon = 0.53.At the lowest epsilon = 0.05, there were 18 groups, 10 of which (TAN/08/Mazimbu (ON409981), Malawi Lil-20/1 (AY261361), Ken05/Tk1 (KM111294), Uvira B53 (MT956648), BUR/18/Rutana (MW856067), Uganda (Unpublished), RSA/2/2004 (MN641877), Tengani 62 (AY261364), Mkuzi 1979 (AY261362), LIV 5/40 (MN318203) were composed of only a single genome.The remaining eight groups were named after a prototypical genome (Ken06, Kenya 1950, Warmbaths, Warthog, Benin, K, Recombinant, and Georgia) and were composed of 8, 2, 3, 4, 66, 2, 3, and 122 genomes, respectively (Table S7).The approximate percent similarity and amino acid changes that constitute each cluster at each of the given Ɛ values were estimated by comparing Benin 97/1 (AM712239) to a newly clustered isolate (Table 1).As expected, as Ɛ increases, the weighted percent denotes the radius of a neighborhood centered at a point x.In simpler terms, as the value of R PEER REVIEW 6 of 13 contain homopolymer stretches of G/C or A/T that can result in indels that lead to deletions, truncations, and fusions [10,22] and reviewed in [37].The weighted gene-level percent identity distance matrices were then averaged into a single genome-genome percent identity matrix, and clustered based on similarity using DBSCAN.Clustering refers to the process of dividing a dataset into subsets of points.The aim is to group similar points together in the same cluster while separating dissimilar points into different clusters.The utilization of DBSCAN as a clustering technique offers an extra level of dependability, as DBSCAN identifies clusters of arbitrary shape and does not require a specific number of clusters to be specified [36].Accordingly, it is more powerful than other clustering methods such as k-means which place data in clusters based on their proximity to a central point and require a specific number of clusters to be identified before analysis.
The DBSCAN algorithm produces an output whereby each genome is assigned a numerical value ranging from 0 to n, where n + 1 = the total number of clusters at a given epsilon (Ɛ) value.At each Ɛ level, genomes that clustered based on similarity were assigned an identical number.DBSCAN was executed for 100 iterations of the parameter Ɛ, ranging from 0.01 to 1, with an increment of 0.01 for each iteration (Table S6).Ɛ denotes the radius of a neighborhood centered at a point x.In simpler terms, as the value of Ɛ increases, the maximum number of dissimilarities permitted within a cluster also increases, leading to a reduction in the overall number of unique clusters.
contain homopolymer stretches of G/C or A/T that can result in indels tha deletions, truncations, and fusions [10,22] and reviewed in [37].The weighted g percent identity distance matrices were then averaged into a single genom percent identity matrix, and clustered based on similarity using DBSCAN.C refers to the process of dividing a dataset into subsets of points.The aim is to grou points together in the same cluster while separating dissimilar points into clusters.The utilization of DBSCAN as a clustering technique offers an extr dependability, as DBSCAN identifies clusters of arbitrary shape and does not specific number of clusters to be specified [36].Accordingly, it is more powerful t clustering methods such as k-means which place data in clusters based on their to a central point and require a specific number of clusters to be identified before The DBSCAN algorithm produces an output whereby each genome is as numerical value ranging from 0 to n, where n + 1 = the total number of clusters epsilon (Ɛ) value.At each Ɛ level, genomes that clustered based on simila assigned an identical number.DBSCAN was executed for 100 iterations of the p Ɛ, ranging from 0.01 to 1, with an increment of 0.01 for each iteration (Table S6). the radius of a neighborhood centered at a point x.In simpler terms, as the v increases, the maximum number of dissimilarities permitted within a clu increases, leading to a reduction in the overall number of unique clusters.
Figure 3 shows the results of DBSCAN, highlighting the number of clusters they merge at Ɛ = 0.05, 0.06, 0.07, 0.09, 0.10, 0.12, 0.13, 0.14, 0.21, and 0.32 u convergence occurs at epsilon = 0.53.At the lowest epsilon = 0.05, there were 18 g of which (TAN/08/Mazimbu (ON409981), Malawi Lil-20/1 (AY261361), K (KM111294), Uvira B53 (MT956648), BUR/18/Rutana (MW856067), (Unpublished), RSA/2/2004 (MN641877), Tengani 62 (AY261364), Mkuzi 1979 (A LIV 5/40 (MN318203) were composed of only a single genome.The remain groups were named after a prototypical genome (Ken06, Kenya 1950, Wa Warthog, Benin, K, Recombinant, and Georgia) and were composed of 8, 2, 3, and 122 genomes, respectively (Table S7).The approximate percent similarity and amino acid changes that consti cluster at each of the given Ɛ values were estimated by comparing Benin 97/1 (A to a newly clustered isolate (Table 1).As expected, as Ɛ increases, the weighte The approximate percent similarity and amino acid changes that constitute each cluster at each of the given 6 of 13 olymer stretches of G/C or A/T that can result in indels that lead to ations, and fusions [10,22] and reviewed in [37].The weighted gene-level y distance matrices were then averaged into a single genome-genome y matrix, and clustered based on similarity using DBSCAN.Clustering cess of dividing a dataset into subsets of points.The aim is to group similar r in the same cluster while separating dissimilar points into different tilization of DBSCAN as a clustering technique offers an extra level of as DBSCAN identifies clusters of arbitrary shape and does not require a r of clusters to be specified [36].Accordingly, it is more powerful than other ods such as k-means which place data in clusters based on their proximity t and require a specific number of clusters to be identified before analysis.algorithm produces an output whereby each genome is assigned a e ranging from 0 to n, where n + 1 = the total number of clusters at a given ue.At each Ɛ level, genomes that clustered based on similarity were ntical number.DBSCAN was executed for 100 iterations of the parameter 0.01 to 1, with an increment of 0.01 for each iteration (Table S6).Ɛ denotes neighborhood centered at a point x.In simpler terms, as the value of Ɛ maximum number of dissimilarities permitted within a cluster also ng to a reduction in the overall number of unique clusters.ows the results of DBSCAN, highlighting the number of clusters and how Ɛ = 0.05, 0.06, 0.07, 0.09, 0.10, 0.12, 0.13, 0.14, 0.21, and 0.32 until total curs at epsilon = 0.53.At the lowest epsilon = 0.05, there were 18 groups, 10 values were estimated by comparing Benin 97/1 (AM712239) to a newly clustered isolate (Table 1).As expected, as contain homopolymer stretches of G/C or A/T that can result in indels that lead to deletions, truncations, and fusions [10,22] and reviewed in [37].The weighted gene-level percent identity distance matrices were then averaged into a single genome-genome percent identity matrix, and clustered based on similarity using DBSCAN.Clustering refers to the process of dividing a dataset into subsets of points.The aim is to group similar points together in the same cluster while separating dissimilar points into different clusters.The utilization of DBSCAN as a clustering technique offers an extra level of dependability, as DBSCAN identifies clusters of arbitrary shape and does not require a specific number of clusters to be specified [36].Accordingly, it is more powerful than other clustering methods such as k-means which place data in clusters based on their proximity to a central point and require a specific number of clusters to be identified before analysis.
The DBSCAN algorithm produces an output whereby each genome is assigned a numerical value ranging from 0 to n, where n + 1 = the total number of clusters at a given epsilon (Ɛ) value.At each Ɛ level, genomes that clustered based on similarity were assigned an identical number.DBSCAN was executed for 100 iterations of the parameter Ɛ, ranging from 0.01 to 1, with an increment of 0.01 for each iteration (Table S6).Ɛ denotes the radius of a neighborhood centered at a point x.In simpler terms, as the value of Ɛ increases, the maximum number of dissimilarities permitted within a cluster also increases, leading to a reduction in the overall number of unique clusters.
Figure 3 shows the results of DBSCAN, highlighting the number of clusters and how they merge at Ɛ = 0.05, 0.06, 0.07, 0.09, 0.10, 0.12, 0.13, 0.14, 0.21, and 0.32 until total increases, the weighted percent Viruses 2024, 16, 67 7 of 13 similarity within each group decreases and the total number of amino acid differences increases.Compared to the previous standard of ASFV classification, genotyping by p72 (B646L), the weighted similarity metric exhibits significantly lower values compared to p72 similarity at each ame cluster while separating dissimilar points into different of DBSCAN as a clustering technique offers an extra level of N identifies clusters of arbitrary shape and does not require a rs to be specified [36].Accordingly, it is more powerful than other as k-means which place data in clusters based on their proximity uire a specific number of clusters to be identified before analysis.
produces an output whereby each genome is assigned a from 0 to n, where n + 1 = the total number of clusters at a given ch Ɛ level, genomes that clustered based on similarity were ber.DBSCAN was executed for 100 iterations of the parameter with an increment of 0.01 for each iteration (Table S6).Ɛ denotes hood centered at a point x.In simpler terms, as the value of Ɛ number of dissimilarities permitted within a cluster also duction in the overall number of unique clusters.esults of DBSCAN, highlighting the number of clusters and how 0.06, 0.07, 0.09, 0.10, 0.12, 0.13, 0.14, 0.21, and 0.32 until total silon = 0.53.At the lowest epsilon = 0.05, there were 18 groups, 10 imbu (ON409981), Malawi Lil-20/1 (AY261361), Ken05/Tk1 53 (MT956648), BUR/18/Rutana (MW856067), Uganda 04 (MN641877), Tengani 62 (AY261364), Mkuzi 1979 (AY261362), ere composed of only a single genome.The remaining eight ter a prototypical genome (Ken06, Kenya 1950, Warmbaths, mbinant, and Georgia) and were composed of 8, 2, 3, 4, 66, 2, 3, tively (Table S7).ercent similarity and amino acid changes that constitute each n Ɛ values were estimated by comparing Benin 97/1 (AM712239) ate (Table 1).As expected, as Ɛ increases, the weighted percent value.This suggests that the accumulated differences that contribute to the weighted similarity metric are more responsive and discerning in terms of classification than relying solely on p72 similarity alone.deletions, truncations, and fusions [10,22] and reviewed in [37].The weighted gene-level percent identity distance matrices were then averaged into a single genome-genome percent identity matrix, and clustered based on similarity using DBSCAN.Clustering refers to the process of dividing a dataset into subsets of points.The aim is to group similar points together in the same cluster while separating dissimilar points into different clusters.The utilization of DBSCAN as a clustering technique offers an extra level of dependability, as DBSCAN identifies clusters of arbitrary shape and does not require a specific number of clusters to be specified [36].Accordingly, it is more powerful than other clustering methods such as k-means which place data in clusters based on their proximity to a central point and require a specific number of clusters to be identified before analysis.The DBSCAN algorithm produces an output whereby each genome is assigned a numerical value ranging from 0 to n, where n + 1 = the total number of clusters at a given epsilon (Ɛ) value.At each Ɛ level, genomes that clustered based on similarity were assigned an identical number.DBSCAN was executed for 100 iterations of the parameter Ɛ, ranging from 0.01 to 1, with an increment of 0.01 for each iteration (Table S6).Ɛ denotes the radius of a neighborhood centered at a point x.In simpler terms, as the value of Ɛ increases, the maximum number of dissimilarities permitted within a cluster also increases, leading to a reduction in the overall number of unique clusters.

ASFV Can Be Classified as 7 Biotypes
At three epsilon values between 0.05 and 0.10 (0.06, 0.07, and 0.09) the 18 clusters merge based on similarity until forming 7 clusters at epsilon = 0.10 (Figure 3).Larger epsilon values (0.12, 0.13, 0.14, 0.21, and 0.32) continued to group the isolates, decreasing the total number of clusters (6, 5, 4, 3, and 2, respectively) until all genomes converged into a single cluster at an epsilon value of 0.53.Accordingly, for our new classification, we chose a cutoff value of epsilon = 0.10 as it was large enough to combine all historic genotype I isolates, yet small enough that genotype I isolates remained separated from the recently described recombinants which are composed of genetic sequences derived from NHV (genotype 1) and ASFV-G (genotype 2) [5].We propose that this new grouping be referred to as biotypes.

ASFV Can Be Classified as 7 Biotypes
At three epsilon values between 0.05 and 0.10 (0.06, 0.07, and 0.09) the 18 clusters merge based on similarity until forming 7 clusters at epsilon = 0.10 (Figure 3).Larger epsilon values (0.12, 0.13, 0.14, 0.21, and 0.32) continued to group the isolates, decreasing the total number of clusters (6, 5, 4, 3, and 2, respectively) until all genomes converged into a single cluster at an epsilon value of 0.53.Accordingly, for our new classification, we chose a cutoff value of epsilon = 0.10 as it was large enough to combine all historic genotype I isolates, yet small enough that genotype I isolates remained separated from the recently described recombinants which are composed of genetic sequences derived from NHV (genotype 1) and ASFV-G (genotype 2) [5].We propose that this new grouping be referred to as biotypes.
The 7 biotypes present when epsilon = 0.10 have some similarities to the traditional classification of ASFV based on p72 sequencing (Table 2 and Table S7) [38].Biotype 1 is composed of 69 isolates that would traditionally be considered part of Genotype I (Group Benin, Group K, and isolate LIV 5/40,).However, Mkuzi 1979 (AY261362), whose genotype has been ambiguous, historically classified as genotype I, VII, or XII, was also included in this biotype [14,18,39]).Interestingly, isolates within biotype 1 have been collected from outbreaks that occurred in many different regions including eastern Africa (Group K), southern Africa (LIV 5/40 and Mkuzi 1979), western Africa (Benin 97/1 and Ghana2021-95), Europe (NHV, L60, E75, OURT 88/3, E75, and the Sardinia strains), Asia (Pig/HeN/ZZ-P1/2021 and Pig/SD/DY-I/2021) and the island of Hispaniola (DR-1980) and that have spanned multiple decades (1949 to 2021) (Table 3).r stretches of G/C or A/T that can result in indels that lead to , and fusions [10,22] and reviewed in [37].The weighted gene-level ance matrices were then averaged into a single genome-genome ix, and clustered based on similarity using DBSCAN.Clustering f dividing a dataset into subsets of points.The aim is to group similar e same cluster while separating dissimilar points into different on of DBSCAN as a clustering technique offers an extra level of SCAN identifies clusters of arbitrary shape and does not require a sters to be specified [36].Accordingly, it is more powerful than other ch as k-means which place data in clusters based on their proximity require a specific number of clusters to be identified before analysis.ithm produces an output whereby each genome is assigned a ing from 0 to n, where n + 1 = the total number of clusters at a given each Ɛ level, genomes that clustered based on similarity were number.DBSCAN was executed for 100 iterations of the parameter o 1, with an increment of 0.01 for each iteration (Table S6).Ɛ denotes borhood centered at a point x.In simpler terms, as the value of Ɛ um number of dissimilarities permitted within a cluster also reduction in the overall number of unique clusters.he results of DBSCAN, highlighting the number of clusters and how 05, 0.06, 0.07, 0.09, 0.10, 0.12, 0.13, 0.14, 0.21, and 0.32 until total t epsilon = 0.53.At the lowest epsilon = 0.05, there were 18 groups, 10 azimbu (ON409981), Malawi Lil-20/1 (AY261361), Ken05/Tk1 B53 (MT956648), BUR/18/Rutana (MW856067), Uganda /2004 (MN641877), Tengani 62 (AY261364), Mkuzi 1979 (AY261362), ) were composed of only a single genome.The remaining eight after a prototypical genome (Ken06, Kenya 1950, Warmbaths, ecombinant, and Georgia) and were composed of 8, 2, 3, 4, 66, 2, 3, pectively (Table S7).e percent similarity and amino acid changes that constitute each iven Ɛ values were estimated by comparing Benin 97/1 (AM712239) isolate (Table 1).As expected, as Ɛ increases, the weighted percent = 0.14) cluster biotype 3 with biotype 1 before merging with biotype 2 (

of 13
or A/T that can result in indels that lead to 2] and reviewed in [37].The weighted gene-level e then averaged into a single genome-genome based on similarity using DBSCAN.Clustering t into subsets of points.The aim is to group similar hile separating dissimilar points into different s a clustering technique offers an extra level of lusters of arbitrary shape and does not require a d [36].Accordingly, it is more powerful than other ich place data in clusters based on their proximity umber of clusters to be identified before analysis.
output whereby each genome is assigned a ere n + 1 = the total number of clusters at a given nomes that clustered based on similarity were was executed for 100 iterations of the parameter ent of 0.01 for each iteration (Table S6).Ɛ denotes at a point x.In simpler terms, as the value of Ɛ dissimilarities permitted within a cluster also verall number of unique clusters.AN, highlighting the number of clusters and how , 0.10, 0.12, 0.13, 0. y and amino acid changes that constitute each e estimated by comparing Benin 97/1 (AM712239) s expected, as Ɛ increases, the weighted percent = 0.21).Interestingly, four isolates (Warmbaths (AY261365), RSA/2/2008 (MN336500), SPEC 57 (MN394630), and Pretorisuskop/96/4 (AY261363)) were collected from a tick, three isolates were collected from a warthog or wild boar (RSA_2_2004 (MN641877), Warthog (AY261366), and RSA/W1/1999 (MN641876)), while only two isolates (Zaire (MN630494) and Tengani 62 (AY261364)) were collected from domesticated pigs.Taken together, the metadata suggests the genetic adaption of ASFV to ticks or different host species [40,41].Still, care must be taken not to overanalyze the results as other strains isolated from ticks in South Africa (LIV 5/40 (MN318203), Mkuzi (AY261362), and Malawi Lil-20/1 (ON409981)) did not group into biotype 3.
Biotype 1-2 Recombinant consists of 3 recently isolated genomes from China.The genomes consist of biotype 1 and biotype 2 sequences and were believed to be the result of a recombination between NHV (also known as ASFV/NH/P68) (KM262845) and Georgia 2007/1 [5,39].

Webportal for Automatic ASFV Biotyping and Genotyping
A tool has been provided on the Center of Excellence for Swine Fever Genomics website (https://asfvgenomics.com/upload)(Accessed on 1 December 2023) that analyzes a novel ASFV genome uploaded by the user and returns the most likely biotype (manuscript in submission).Moreover, as genotyping is still widely recognized and as a method to connect contemporary variants with past samples the closest p72 match will be predicted Additionally, the tool will issue warnings for highly unlikely genomes (indicating a potential need for reassembly).In the future, a function will be implemented that will detect new potential biotypes or groups-if a new potential biotype or group is detected, an email address will be given so that we can correspond about results and aid in the classification of your genome.

Discussion
Historically, p72 (B646L) genotyping was conducted for the purpose of disease tracking and, because of the lack of next-generation sequencing technologies in many regions impacted by ASF [14], continues to be used to classify ASFV isolates.While a novel p72 classification scheme has reduced the number of genotypes from 25 to 6 and set clear parameters to define groupings within a genotype [14], it is clear that classification based on a single gene is insufficient for categorizing a virus as complex as ASFV.Any classification that is exclusively based on p72 is ultimately dependent on a handful of amino acid changes and accordingly, is easily prone to error as even a single sequencing error can result in a severe misclassification or the generation of a new genotype.As indicated in Table 1, using a method that considers more than just p72, the classification of ASFV can instead be based on hundreds or thousands of amino acid changes, avoiding the problems associated with genotyping by p72 alone.For example, while Warmbaths, Tengani 62, Pretorsuskop/96/4 and ASFV Georgia 2007/1 encode an identical p72 protein sequence, examination of the rest of the proteome reveals that ASFV Georgia 2007/1 is the outlier and far less similar.Conversely, while Ken05/Tk1 and Kenya 1950 encode a different p72, examination of the rest of the proteome reveals they are more similar.
Other groups have attempted to classify 60 full-length ASFV genomes into clades based on the longest common sequence (LCS) methodology [20].While this method is better than classifying ASFV solely based on the partial sequencing of p72, we believe our strategy is superior as it instead relies on a gene-level comparison that ignores non-coding regions, examines amino acid changes rather than nucleotide changes, weighs the results based on the number of times a gene is represented and whether the gene is a "core" protein, and was able to analyze the genomes of 220 ASFV isolates.
A comparison of results based on historical genotyping, our corrected p72 genotyping [14], clades [20], and our biotyping are shown in Table 2. Note, that as a result of the incomplete sequencing of their genomes, it was not possible to conduct biotyping analysis on the genotype 23 isolates.All the latter three methods reduce the number of groups from over twenty to less than eight.Further, both biotypes and clades suggest that typing by p72 alone is insufficient-both historical and corrected p72 genotypes contain multiple biotypes/clades.For example, genotype 2 includes biotypes 1-2R, 2, and 3; it also includes clades Beta, Epsilon, and Delta.Lastly, biotypes and clades were also similar in that genotype 1 isolates collected from outbreaks in Europe and Africa grouped together (biotype 1 and alpha clade) and all genotype 2 isolates grouped within the same biotype and clade (biotype 2 and beta clade).
Differences between groups created from the biotype and clades methodology were observed for certain isolates: (1) Tengani 62; our biotyping method groups it with the other strains originating from southern Africa (Warmbaths and Warthog), while it is the sole member of the delta clade.(2) The Eastern African groups (biotypes 4-6) were grouped into a single clade (gamma).(3) Liv 5/40 is classified along with Warmbaths and Warthog in the clades manuscript; however, for this analysis Liv 5/40 was reassembled and grouped with Mkuzi1979 within the biotype 1 group.Of course, at the time of publication certain genomes, such as those within biotype 1-2 Recombinant and TAN/08/Mazimbu, had yet to be sequenced and/or isolated and could not be analyzed.
Care should be taken when interpreting the biotypes since it was not the intention of this manuscript to separate isolates based on virulence.Previous studies have analyzed ASFV virulence by examining the presence or absence of functional domains encoded by attenuated and virulence strains [42].In its current iteration, both an attenuated and the virulent parental strain could be clustered together in the same biotype as was observed for BA71 and BA71V, K49 and KK262, L60 and OURT 88/3, and L60 and NHV.
In the future, as the assessment of potential coverage induced by various vaccines would necessitate cross-protection studies, we propose utilizing biotypes as a means to determine what strains are more closely related, since biotyping is based on the analysis of the entire genome, rather than on a single ASFV protein.As these studies are performed there will be a better understanding of the serotype or biotype that specifically correlates with protection against ASFV.Accordingly, the utilization of a clustering algorithm-based methodology, as opposed to a phylogenetic tree, for the classification of ASFV, enables the biotyping classification to be modified in response to future data.In addition, as ASF continues to have prolonged outbreaks and with the increasing number of ASFV isolates being fully sequenced by next-generation sequencing, it is possible that novel and highly heterogeneous variants of ASF could be found, where it may be necessary to adjust the epsilon value to classify ASFV into a greater or lesser number of biotypes.Further, as many ASFV strains, such as the isolates that make up genotype 23, ETH/AA (KT795353), ETH/017 (KT795355), ETH/1 (KT795354), ETH/004 (KT795356), ETH/2a (KT795358), and ETH/1a (KT795359), have only been partially sequenced and could not be analyzed, the number of biotypes may expand as more historic isolates are fully sequenced.Moreover, with our current knowledge along with the development of a web-based tool to easily identify the biotype of ASFV, we believe standardization of ASFV isolates by biotypes is possible and constitutes the most accurate classification for ASFV.

Figure 1 .
Figure 1.Summary of the analysis pipeline described in Materials and Methods.

Figure 1 .
Figure 1.Summary of the analysis pipeline described in Materials and Methods.

Figure 2 .
Figure 2. Process for going from pairwise percent identity gene-gene identity matrices to DBSCAN.(A) Gene-level identity matrices are multiplied by their weights and averaged resulting in the (B) genome-genome identity matrix.The average genome-genome identity matrix is then analyzed via DBSCAN to identify clusters based on similarity.The weight equation is described in Section 2.4.

Figure 2 .
Figure 2. Process for going from pairwise percent identity gene-gene identity matrices to DBSCAN.(A) Gene-level identity matrices are multiplied by their weights and averaged resulting in the (B) genome-genome identity matrix.The average genome-genome identity matrix is then analyzed via DBSCAN to identify clusters based on similarity.The weight equation is described in Section 2.4.

Figure 3 .
Figure 3. Results of DBSCAN demonstrate the number of clusters at the indicated epsilon value.Brackets indicate the convergence of clusters.Dark blue boxes indicate a sequence from a single isolate, and medium blue boxes indicate a group of isolates with similar sequences.

Figure 3 .
Figure 3. Results of DBSCAN demonstrate the number of clusters at the indicated epsilon value.Brackets indicate the convergence of clusters.Dark blue boxes indicate a sequence from a single isolate, and medium blue boxes indicate a group of isolates with similar sequences.

Figure 3 .
Figure 3. Results of DBSCAN demonstrate the number of clusters at the indicated epsilon value.Brackets indicate the convergence of clusters.Dark blue boxes indicate a sequence from a single isolate, and medium blue boxes indicate a group of isolates with similar sequences.

Figure 3 .
Figure 3. Results of DBSCAN demonstrate the number of clusters at the indicated epsilon value.Brackets indicate the convergence of clusters.Dark blue boxes indicate a sequence from a single isolate, and medium blue boxes indicate a group of isolates with similar sequences.
CAN demonstrate the number of clusters at the indicated epsilon value.vergence of clusters.Dark blue boxes indicate a sequence from a single boxes indicate a group of isolates with similar sequences.

Figure 3 .
Figure 3. Results of DBSCAN demonstrate the number of clusters at the indicated eps Brackets indicate the convergence of clusters.Dark blue boxes indicate a sequence fro isolate, and medium blue boxes indicate a group of isolates with similar sequences.

Figure 3 .
Figure 3. Results of DBSCAN demonstrate the number of clusters at the indicated epsilon value.Brackets indicate the convergence of clusters.Dark blue boxes indicate a sequence from a single isolate, and medium blue boxes indicate a group of isolates with similar sequences.

N
demonstrate the number of clusters at the indicated epsilon value.rgence of clusters.Dark blue boxes indicate a sequence from a single oxes indicate a group of isolates with similar sequences.

Figure 3 .
Figure 3. Results of DBSCAN demonstrate the number of clusters at the indicated epsilon value.Brackets indicate the convergence of clusters.Dark blue boxes indicate a sequence from a single isolate, and medium blue boxes indicate a group of isolates with similar sequences.
SCAN demonstrate the number of clusters at the indicated epsilon value.nvergence of clusters.Dark blue boxes indicate a sequence from a single e boxes indicate a group of isolates with similar sequences.

Table 1 .
Example Differences Between Benin 97/1 and the Indicated Isolate at the Given Epsilon (

Table 1 .
Example Differences Between Benin 97/1 and the Indicated Isolate at the Given Epsilon (Ɛ) Level.

Table 3 .
[14].Recent reanalysis of the p72 genotypes grouped these isolates into genotype 2 with the derivatives of Georgia 2007/1[14].However, full genome analysis clearly indicates that the genomes are different when compared to Georgia 2007/1.Further, large epsilon values (