The Use and Limitations of the 16S rRNA Sequence for Species Classification of Anaplasma Samples

With the advent of cheaper, high-throughput sequencing technologies, the ability to survey biodiversity in previously unexplored niches and geographies has expanded massively. Within Anaplasma, a genus containing several intra-hematopoietic pathogens of medical and economic importance, at least 25 new species have been proposed since the last formal taxonomic organization. Given the obligate intracellular nature of these bacteria, none of these proposed species have been able to attain formal standing in the nomenclature per the International Code of Nomenclature of Prokaryotes rules. Many novel species’ proposals use sequence data obtained from targeted or metagenomic PCR studies of only a few genes, most commonly the 16S rRNA gene. We examined the utility of the 16S rRNA gene sequence for discriminating Anaplasma samples to the species level. We find that while the genetic diversity of the genus Anaplasma appears greater than appreciated in the last organization of the genus, caution must be used when attempting to resolve to a species descriptor from the 16S rRNA gene alone. Specifically, genomically distinct species have similar 16S rRNA gene sequences, especially when only partial amplicons of the 16S rRNA are used. Furthermore, we provide key bases that allow classification of the formally named species of Anaplasma.


Introduction
The genus Anaplasma contains obligate intracellular bacteria capable of colonizing both mammalian and arthropod cells. Currently, the genus consists of seven formally named species: A. marginale, A. centrale, A. ovis, A. bovis, A. phagocytophilum, A. platys and A. caudatum [1,2]; however, since the last taxonomic organization of the genus [1] at least 25 new species have been proposed in the literature. These proposed species range from clinical isolates of infected animals to unique sequence variants detected in broad multispecies-based 16S rRNA metagenomic studies. Some, such as the proposed "A. capra", represent an important source of human and animal disease [3][4][5], while others appear as single, orphan isolates from unique animal hosts.
The nature and variety of evidence for the claim of a novel species vary greatly in the reports, but a unique 16S rRNA sequence identity appears to be the most prevalent evidence. The intracellular nature of these bacteria and the consequent inability to produce pure cultures make taxonomic studies and new species descriptions in line with the International Code of Nomenclature of Prokaryotes (ICNP) difficult to impossible [6]. Historically, molecular study of the genus has been limited by technical challenges posed by the intracellular lifestyle despite several species being of either public health or economic importance [7][8][9]. The expanded availability of high-throughput sequencing has provided the ability to acquire new genetic samples without the need for associated mammalian or tick-cell culture. While this has provided a boon for the ability to conduct new analyses, the scattershot application of gene sequencing studies lacking a standardized methodology or reference point has led to a proliferation of proposed novel species without the ability to rigorously and systematically analyze these claims en masse.
Clear species definitions in bacteria are fraught but at the onset of the genomic era of bacterial taxonomy, it was suggested and broadly accepted that bacteria defined as the same species should share at least 70% DNA cross-hybridization (DNA-DNA hybridization or DDH) [10]. DDH assays proved difficult to implement and as genome sequences became available, it was shown that an average nucleotide identity (ANI) of~95-96% could serve as a proxy for 70% DDH [11]. Following the work of Carl Woese and others, variation in sequence identity of the 16S rRNA gene has been broadly correlated with differences in genomic ANI in prokaryotes, and thus the percentage of DNA-DNA crosshybridization [12]. While 16S rRNA is a common sequence for metagenomic survey and taxonomic comparison, its utility in resolving isolates to the species level can vary greatly across various groups of bacteria, especially when only partial regions of the genes are used as the point of comparison [13]. In this paper, we evaluate the utility of near-complete and partial 16S rRNA sequences to resolve Anaplasma isolates to the species level and test the relationship between 16S rRNA and average nucleotide identity (ANI) for the genus Anaplasma.
We demonstrate that 16S rRNA sequences must be used with caution when attempting to define species within the genus Anaplasma, as bona fide species as confirmed by whole genome analysis possess 16S rRNA gene identities suggestive of same species status. Nevertheless, the prevalence of unique 16S rRNA sequences across the literature represents a large measure of uncaptured genetic diversity within the current taxonomy and suggests numerous potential novel species within the genus that deserve deeper geno-and phenotypic characterization. Finally, we present an analysis of the 16S rRNA sequences derived from the complete genomes of the genus to determine key bases that may aid in distinguishing the known, formally named species in the absence of a complete genome.

Sequences and Alignments
All 16S rRNA gene sequences were obtained from NCBI Genbank. Initial searches were conducted using several key phrases including "16S", "Anaplasma", as well as earlier names such as "Ehrlichia equi" that were utilized prior to the reorganization of the order in 2001 [1]. The 737 sequences obtained were compiled and their accession numbers are listed in Supplementary Table S1. A few instances of identical entries were removed, and only sequences over 1300 bp in length were deemed sufficient for inclusion. Additional sequences not captured in the above searches were added following review of the literature for proposed novel Anaplasma species. Sequences were aligned utilizing ClustalW within the Bioedit platform (Version 7.0.5.3) [14] according to species designation. Identical sequences were removed to leave a core set of unique 16S rRNA gene sequence variants for each species (see Supplementary Table S2) to construct the similarity tables. Consensus sequences were generated for each species using Bioedit. Ambiguous nucleotides assigned by Bioedit to consensus sequences were examined and kept if the single-nucleotide polymorphism (SNP) was observed in the locus three or more times. A SNP frequency below this threshold was considered within the range of sequence error. A few sequences (AB211164, JQ839010, KP062964-KP062966, AB588977, KX817983, and AF283007) that were classified in Genbank as A. centrale were reassigned to A. capra (for an explanation, see Khumalo et al., 2018 [15]).

ANI and Sequence Identity Matrix
OrthoANI values were calculated in a pairwise fashion from whole genome sequences using the EZBio platform (https://www.ezbiocloud.net/tools/ani; accessed on 7 March 2022) [16]. The 16S rRNA gene alignments were used to generate sequence identity matrices using the Bioedit platform (Version 7.0.5.3).

Phylogenetic Comparison
The phylogenetic tree arising from the consensus 16S rRNA gene sequences was constructed using the Phylogeny.fr platform (www.phylogeny.fr; accessed 7 March 2022) [17][18][19][20][21][22][23]. This platform aligns sequences with MUSCLE (v3.8.31), and removes ambiguous regions with Gblocks (v.091b) using these parameters: block after gap cleaning has minimum length of ten, segments with contiguous non-conserved positions larger than eight are rejected, and flank positions require 85% minimum sequences. The phylogenetic tree was constructed using a maximum likelihood method using PhyML (v3.1/3.0 aLRT) with a HKY85 substitution model with an estimated proportion of invariant sites (of 0.704) and four gamma-distributed rate categories. The gamma shape parameter was estimated directly from the data (gamma = 0.546). Reliability for the internal branch was assessed using the aLRT test (SH-Like). Graphical representation of the phylogenetic tree was performed with TreeDyn (v198.3).

Correlation of 16S rRNA with ANI within the Genus Anaplasma
An initial analysis was conducted to assess the relationship between 16S rRNA gene sequence identity and genome nucleotide identity within the genus Anaplasma. Across the Anaplasma species for which genomes are available, sequence similarity in 16S rRNA below 98.7% corresponded to an ANI percentage difference (i.e., < 96%) indicating a distinct species (Table 1). This preliminarily analysis confirms the correlation between 16S rRNA identities and ANI for the genus Anaplasma. Interestingly, A. marginale, A. centrale, and A. ovis share 16S rRNA sequence identity above 98.7% but nevertheless have distinctive ANI percentages. This indicates that high 16S rRNA sequence similarity may not indicate species status within the genus Anaplasma. As expected, within A. phagocytophilum and A. marginale, the two species for which multiple genomes have been sequenced, there is high intraspecies correlation of both the 16S rRNA identity and the ANI. Table 2 lists twenty-five proposed species documented in the literature, with the associated NCBI accession numbers of the 16S rRNA sequences. Seventeen of these species had sequences of sufficient length to be used in this study. Of note, none of these proposed species have available whole genome sequences. Additionally, few other genes have been sequenced from these organisms, and a consistent set of genes is not available from each organism to be able to study the utility of a larger gene set for species classification.  * These are partial or fragmented sequences that were not included in the analyses. "Accession #" refers to the Genbank sequence accession number.

16S rRNA Sequence Similarity across Anaplasma Sequences
Comparison of the 16S rRNA gene sequences from the proposed and validated Anaplasma organisms shows that several proposed species have identities greater than the 98.7% cutoff that usually would indicate that two organisms belong to the same species ( Table 3). Thirteen of the named species, "Candidatus Anaplasma boleense", "Candidatus Anaplasma pangolinii", "Candidatus Anaplasma testudinis", Anaplasma sp. Saso, Anaplasma sp. Hadesa, Anaplasma sp. Dedessa, A. capra, A. odocoilei, Anaplasma sp. SA dog, Anaplasma sp. ZAM dog, Anaplasma sp. Ar. walkerae, Anaplasma sp. O. moubata and Anaplasma sp. Shizhu, resolve below a 98.7% identity threshold from a formally named species of Anaplasma. "Candidatus Anaplasma testudinis", "Candidatus Anaplasma pangolinii", Anaplasma sp. Ar. walkerae, Anaplasma sp. O. moubata, Anaplasma sp. Shizhu and A. capra each form independent, distinct, profiles while Anaplasma sp. SA dog, and Anaplasma sp. ZAM dog group together. Anaplasma sp. Saso and Anaplasma sp. Hadesa group together as do Anaplasma sp. Dedessa and "Candidatus Anaplasma boleense". Interestingly, the putative A. odocoilei has less than 98.7% sequence identity with A. platys, but greater than 98.7% to other proposed species that group with A. platys (i.e., Anaplasma sp. Mymensingh, "Candidatus Anaplasma camelii", and Anaplasma sp. Omatijenne). In the absence of more sequence data, or a complete genome, it is unknown whether the sequences from these organisms represent truly separate species or are perhaps strain variants of A. platys. Notably, two sequences that have been identified as A. platys (KU585989 and KU586001) are distinct from the other A. platys sequences and may represent yet an additional species. The relationships among these organisms can be seen in the phylogenetic tree based on 16S rRNA gene sequences (Figure 1). The phylogenetic tree forms two clades, both anchored by three validly named species. The clade anchored by A. marginale, A. centrale, and A. ovis primarily infects erythrocytes. While the other, anchored by A. bovis, A. phagocytophilum and A. platys, primarily infects white blood cells and platelets. To visualize the totality of the sequences used in this study, non-redundant sequences were put into an identity matrix with identities ≥98.7 highlighted in dark blue (Figure 2), with those below 95% in white and intermediate identities in light blue. With this coloration as a guide, several samples appear to not be placed correctly with their most similar counterparts, or they do not appear to have a best fit position in the matrix. In other words, this analysis highlights misplaced organisms/species, such as some A. marginale better aligning with A. capra. It also shows that some of the putative species, such as "Candidatus Anaplasma boleense" and "Candidatus Anaplasma testudinis" appear to be unique. The relationships among these organisms can be seen in the phylogenetic tree based on 16S rRNA gene sequences (Figure 1). The phylogenetic tree forms two clades, both anchored by three validly named species. The clade anchored by A. marginale, A. centrale, and A. ovis primarily infects erythrocytes. While the other, anchored by A. bovis, A. phagocytophilum and A. platys, primarily infects white blood cells and platelets. To visualize the totality of the sequences used in this study, non-redundant sequences were put into an identity matrix with identities ≥98.7 highlighted in dark blue (Figure 2), with those below 95% in white and intermediate identities in light blue. With this coloration as a guide, several samples appear to not be placed correctly with their most similar counterparts, or they do not appear to have a best fit position in the matrix. In other words, this analysis highlights misplaced organisms/species, such as some A. marginale better aligning with A. capra. It also shows that some of the putative species, such as "Candidatus Anaplasma boleense" and "Candidatus Anaplasma testudinis" appear to be unique.

16S rRNA Variable Region Sequence Analysis across Anaplasma Sequences
Several studies have relied on partial sequences composed of only a few hypervariable regions of the 16S rRNA (see, for instance, [36]) ( Figure 2A). As such, we tested the degree to which hypervariable regions corresponded to results arising from the nearcomplete 16S rRNA sequence. In analyzing these partial sequences, we determined that hypervariable regions 2 and 6 (V2 and V6) were the most variable for Anaplasma genotypes. We examined the identity matrices of the near-full-length 16S rRNA sequences ( Figure 2B), V3-V4 ( Figure 2C) and concatenated V2 and V6 ( Figure 2D), to determine the precision and accuracy of specific hypervariable region sequences to classify Anaplasma sequences. Using concatenated sequences of V2 and V6 usually resulted in the highest correlation with near-complete sequences (i.e., highest accuracy); however, in a few instances, samples were misclassified into an incorrect species (relatively low precision).

16S rRNA Single-Nucleotide Polymorphisms across Closely Related Anaplasma Species
Since several distinct species share greater than 98.7% 16S rRNA sequence identity, we also examined the distinct single-nucleotide polymorphisms that might allow classification of a limited sample. In examining the closely related ruminant-infecting Anaplasma clade (A. marginale, A. centrale, A. ovis and the proposed species Anaplasma sp. Mongolia), only six nucleotides discriminate these species (Table 4). Similarly, when examining the 16S rRNA sequences for the organisms that appear to be closely related to A. platys, there are up to thirteen nucleotides that are differentiating, with a smaller number depending on the exact pair/set of sequences being examined (Table 5). Given the lack of genome sequences, an analysis of the ANI relationships could not be performed.
* Numbering based on Anaplasma platys S3 strain genome sequence. The consensus sequence of A. platys contains a "C" between bases at position 555-556 and a "T" between bases at position 1030-1031. These insertions are not present in all A. platys strains and are absent in the S3 strain.

Discussion
A clear species definition for bacteria has proved a vexing and, indeed, philosophical problem [65,66]. Nevertheless, as a practical matter, the consensus opinion in bacterial taxonomy has been that species should be delineated based on the degree of genetic overlap. The technical test for this genetic overlap has progressed from the degree of DNA-DNA hybridization to average nucleotide identity (ANI), to the percentage of 16S rRNA sequence identity. From the analysis of available sequence samples for the genus Anaplasma, we show that classification of samples to the species level remains difficult based on 16S rRNA sequences alone and that ancillary data are likely required to clearly define separate species.
Among the formally named Anaplasma, the 16S rRNA sequences from A. phagocytophilum are easily distinguishable from those of A. marginale, A. centrale and A. ovis; however, single-nucleotide polymorphisms must be used to distinguish this latter group from each other (Table 4), despite significant whole genome ANI and syntenty differences between these species that confirm their designation as unique entities (Table 1). Conversely, while 16S rRNA sequence comparisons showing identities below a 98.7% threshold are broadly used to suggest that two bacteria are separate species, as shown in Figure 2B for the formally named Anaplasma there are numerous sequences in which the samples have been classified as a given species of Anaplasma despite the 16S rRNA sequence having less than 98.7% shared identity to a plurality of sequences within that clade. It is possible that these sequences which diverge from the consensus could be novel species erroneously classified as a known species or, alternatively, it may represent a high degree of intraspecies population variance in the 16S rRNA sequence across the Anaplasma. Interestingly, the variance appears more prominent in the clade of Anaplasma infecting white cells and platelets than the clade infecting erythrocytes. Given the inconsistent sample size of sequence deposits across the various species of Anaplasma, and the lack of corroborating information for many deposited sequences, the intraspecies population heterogeneity for the 16S rRNA sequence for each of the Anaplasma remains impossible to determine in a rigorous manner, but these tentative findings deserve further investigation.
When the sequence identity of near-complete 16S rRNA sequences from the putative species in the literature are examined, eleven resolve below 98.7% identity with samples from validated species indicating greater diversity within the genus Anaplasma than appreciated in the currently accepted taxonomy ( Table 3). Several of these also show a low degree of intraspecies variation, though this may result from low sample size and sampling bias in the field study methodology. "Candidatus Anaplasma boleense","Candidatus Anaplasma pangolinii", "Candidatus Anaplasma testudinis", the two species from argasid ticks and A. capra all group as unique entities ( Figure 1) and represent likely candidates for independent species. The status and relationships of A. odocoilei, Anaplasma sp. SA dog, and Anaplasma sp. ZAM dog are more complicated. The similarity of 16S rRNA sequences between A. odocoilei and several putative species, which themselves have high similarity with A. platys may indicate one or even several new species, since 16S rRNA sequence identities above 98.7% can occur among unique Anaplasma species. The same is true for Anaplasma sp. SA dog and ZAM dog, which group together but are distinct from other Anaplasma spp. These may represent variants of a single species (as suggested by Kolo et al., 2020), or two unique entities; additional sequence data or whole genome sequences are needed to resolve this question. Anaplasma sp. Dedessa groups with "Candidatus Anaplasma boleense" and may represent a variant strain. The same may be true for Anaplasma sp. Hadesa and Anaplasma sp. Saso, which group together but apart from other Anaplasma spp. Collectively, they may represent variance within an independent species.
The other proposed species for which 16S rRNA sequences are available but do not resolve at a species level in our analysis are Anaplasma sp. Mymensingh, "Candidatus Anaplasma camelii", Anaplasma sp. Omatjenne and Anaplasma sp. Mongolia. Anaplasma sp. Mymensingh, "Candidatus Anaplasma camelii", Anaplasma sp. Omatjenne all group with A. platys, while Anaplasma sp. Mongolia groups closely with A. ovis. For Anaplasma sp. Mymensingh, "Candidatus Anaplasma camelii" and Anaplasma sp. Mongolia the similarity in 16S rRNA sequence identity with a validly named species were noted in their descriptions and differences in the groEL gene sequence was used to justify the creation of a new taxon. For Anaplasma sp. Omatjenne [58], the reasoning for novel nomenclature is not clear. The authors noted that Anaplasma sp. Omatjenne has 99.5% sequence similarity with A. platys but did not elaborate on why they assigned a new name. Presumably it is due to the non-standard host species-Anaplasma sp. Omatjenne was found in sheep, while A. platys is traditionally found in dogs, but this remains speculative on our part.
Our analysis revealed that sequences covering only a few hypervariable regions of the 16S rRNA gene can lead to misclassification of species ( Figure 2) and should be avoided in favor of complete or near-complete 16S rRNA sequences. As such, 16S rRNA sequences alone remain poorly suited for species assignment for the genus Anaplasma. This should particularly be noted by those attempting to discriminate Anaplasma species when performing microbiome analysis of ticks or other arthropods. Our findings largely comport to recent studies of the use of groEL for taxonomic placement and strongly indicate that gene specific knowledge is required for the creation of accurate phylogenies [67]. While groEL and gltA, along with various Anaplasma specific msp genes, were additionally considered for our analysis, ultimately too few sequences were available, especially when attempting to match species or strain specific sequences (e.g., pairing a 16S rRNA sequence and groEL sequence arising from the same bacterial isolate). There is, at the present time, no rigorous correlation between differences in sequence identity ratios of 16S rRNA and other housekeeping genes within an Anaplasma species or among the validly named Anaplasma species.
It bears emphasizing that Anaplasma species remain difficult to culture in laboratory settings as they require intracellular replication in either mammalian or arthropod cells. This dramatically limits the ability to generate high-quality genetic material suitable for whole genome sequences, particularly for the clinically observed isolates or those arising from metagenomic analyses. While some third-generation sequencing technologies may aid in relieving this barrier, the inability to generate whole genome sequences remains the greatest bottleneck in mapping the diversity of the genus Anaplasma.
From our analyses it is clear that genetic diversity in Anaplasma is greater than currently captured by the formally named species in the genus Anaplasma, but a more definitive description of the exact number and relationship of species requires further genetic and genome sequencing. We would furthermore urge caution in denoting a new species on the basis of only two genes as in the case of Anaplasma sp. Mymensingh, "Candidatus Anaplasma camelii", and Anaplasma sp. Mongolia, or particularly for naming putative species (i.e., "Candidatus Anaplasma corsicanum", "Candidatus Anaplasma ivorensis", and "Candidatus Anaplasma mediterraneum") on the basis of 23S rRNA sequences, for which there are only a handful of sequences deposited even for validly named Anaplasma species. Multilocus sequencing may represent a valuable way forward to overcome some of the current issues in Anaplasma taxonomy, as well as a larger number of whole genome sequences for all of the known species of Anaplasma. Accessing samples that are known not to be mixed (i.e., containing more than one species of Anaplasma) with sufficiently high concentrations of target DNA remains challenging.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/microorganisms10030605/s1, Table S1: All 16S rRNA sequence accessions gathered for Anaplasma. Table S2: List of the representative sequences used in analysis, and identical sequences that were removed for Figure 2. Table S3: Identity matrices of the 16S rRNA gene regions.
Author Contributions: M.T.C.: data curation, formal analysis, investigation, methodology, visualization, and writing-original draft, review and editing. K.A.B.: conceptualization, methodology, funding acquisition, project administration, supervision, validation, visualization, and writingreview and editing. All authors have read and agreed to the published version of the manuscript.
Funding: This study was funded in part by NIH NIAID grant R01AI136832. M.T.C. was supported by an Achievement Rewards for College Scientists (ARCS) Fellowship from the ARCS Foundation.
Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: Publicly available data were analyzed in this study. All data were downloaded from NCBI (www.ncbi.nlm.nih.gov/nuccore/) on 17 September 2021. Specific accession numbers are listed in Supplementary Table S1.