Mexican Strains of Anaplasma marginale: A First Comparative Genomics and Phylogeographic Analysis

The One Health approach looks after animal welfare and demands constant monitoring of the strains that circulate globally to prevent outbreaks. Anaplasma marginale is the etiologic agent of bovine anaplasmosis and is endemic worldwide. This study aimed to analyze, for the first time, the genetic diversity of seven Mexican strains of A. marginale and their relationship with other strains reported. The main features of A. marginale were obtained by characterizing all 24 genomes reported so far. Genetic diversity and phylogeography were analyzed by characterizing the msp1a gene and 5′-UTR microsatellite sequences and constructing a phylogenetic tree with 540 concatenated genes of the core genome. The Mexican strains show 15 different repeat sequences in six MSP1a structures and have phylogeographic relationships with strains from North America, South America, and Asia, which confirms they are highly variable. Based on our results, we encourage the performance of genome sequencing of A. marginale strains to obtain a high assembly level of molecular markers and the performance of extensive phylogeographic analysis. Undoubtedly, genomic surveillance helps build a picture of how a pathogen changes and evolves in geographical regions. However, we cannot discard the study of relationships pathogens establish with ticks and how they have co-evolved to establish themselves as a successful transmission system.


Introduction
Anaplasma marginale is the causal agent of infectious and non-contagious anaplasmosis, characterized by jaundice, fever, anemia, abortion in pregnant cows, decreased milk production, weight loss, and, in severe cases, even death [1][2][3]. Bovine anaplasmosis has been reported in endemic regions in North, Central, and South America, Africa, Asia, and Australia and is considered to be an impediment to animal health and production [4,5]. Consequently, the productivity losses and economic costs associated with anaplasmosis are significant and have been estimated at USD 300-800 million [6]. Additionally, the genetic variability of different geographical A. marginale strains that have been reported worldwide creates difficulties in relation to controlling the disease [7]. The development of vaccines as a prevention strategy remains elusive, as only in some countries is the use of live vaccines of Anaplasma centrale allowed [8]. Veterinary genomics studies are a relatively recent field of interest with many techniques and approaches used to decipher the genomic content of pathogens of veterinary interest [9][10][11]. In Mexico, as in some other countries, knowledge of the genomic and genetic diversity of geographically different strains of A. marginale has been studied [12,13]. Whole-genome sequencing of Anaplasma strains using next-generation sequencing (NGS) technologies provides significant data collection with potential applications [14,15]. In addition, comparative genomics has emerged as an invaluable approach for illuminating evolutionary mechanisms and forces, as well as discovering the similarities and differences between genomes that may contribute to the development of a vaccine against the different strains that cause animal diseases [16,17].
With the idea of preserving animal health, the study of pathogens should not be studied in isolation, but rather as part of a combination where animal, human, and environmental health converge. Studying animal diseases, such as bovine anaplasmosis, provides an opportunity for enhanced understanding of the pathogens (genetic variants, genomic modifications, pathogenesis, virulence factors, etc.) concerning outbreak prevention by proposing alternatives to control diseases. Therefore, the study of pathogens that affect animal health should have a perspective based on a combination of genomics approaches and epidemiological and virulence studies that leads to proposed prevention and control strategies [18].
In this work, we characterized 24 genomes of A. marginale reported in the GenBank database and performed a genetic diversity and phylogeographic analysis which allowed for the characterization of msp1a gene and 5 -UTR microsatellite sequence construction of a phylogenetic tree with 540 concatenated genes belonging to the core genome. Our results suggest that the A. marginale genomes are highly conserved but differ in some regions. We also identified that the seven draft genomes of Mexican strains of A. marginale have a high assembly level with 32 to 46 contigs, a genome size of~1. 17 Mb, and G + C content of 49.79%. Two draft genomes of Mexican strains have the same MSP1a repeats structure as previously reported sequences, but five draft genomes differ in the MSP1a repeats structure compared to previously reported sequences. The seven Mexican strains show 15 different sequence repeats in six different MSP1a structures and have phylogeographic relationships with strains from North America, South America, and Asia.
This work is the first phylogenetic analysis performed with the core genome of A. marginale, which represents a significant opportunity to elucidate their genetic diversity and the identification of proteins with potential applications in the control of bovine anaplasmosis.

General Features of Anaplasma marginale Genomes
The significant genomic features of A. marginale are shown in Table 1. The GenBank database has reported 24 genomes of A. marginale: 2 complete genomes assembled in a single chromosome, 4 closed genomes assembled in a single incomplete chromosome (that contains 0.10-7.27% of unidentified nucleotides), and 18 draft genomes assembled in contigs or scaffolds. Table 1 shows that the three draft genomes of A. marginale strains Okeechobee, South Idaho, and Washington Okanogan have a low assembly level, with more than 300 contigs, larger genome size, and lower G + C content. In comparison, the seven draft genomes of Mexican strains were assembled from 32 to 46 contigs with a genome size of~1.17 Mb and G + C content of~49.79% (Table 1).
The automatic annotation of rRNA genes showed that all 24 genomes of A. marginale contain one copy of 23S, 5S, and 16S genes with lengths of~2784, 114, and~1491 bp, respectively (Table S3). The mapping of rRNA genes shows that all 24 genomes contain an internal transcribed spacer (ITS) 80 bp distant from 23S and 5S rRNA genes (Table S3 and Figure S1). However, only two complete and four closed genomes contain a distance from 162,640 to 163,401 bp between the 23S-5S and 16S rRNA genes (Table S3 and Figure S1). The automatic annotation of tRNA genes showed that 19 genomes of A. marginale, including the seven Mexican strains, contain 37 tRNA genes ( Table 1). The 19 genomes contain the same 37 tRNA genes representing 35 different codons of all 20 amino acids (Table S4). We detected the number of copies of the msp1a gene to determine if the seven draft genomes of Mexican strains contained genomic information of two or more different strains of A. marginale (possible superinfection) in the same genome. As a result, it was discovered that the seven draft genomes contain the complete sequence of a copy of the msp1a gene in longer length contig 1 (Table S5). Thus, the seven draft genomes do not show a superinfection. Additionally, we obtained the tandem repeat sequences of MSP1a for seven Mexican strains using the Repeat Analyzer program ( Table 2). Only two Mexican strains (MEX-01-001-01 and MEX-30-193-01) have the same repeats structure as previously reported sequences. Four strains (MEX-15-099-01, MEX-17-017-01, MEX-30-184-02, and MEX-31-096-01) differ in terms of the length and type of repeats structure in comparison to previously reported sequences ( Table 2). The repeats structure of these four Mexican strains was verified to discard errors in the assembly of the msp1a genes ( Figure S2). As observed, the strain MEX-01-001-01 contains repeat structure 9 that has only been reported in Mexican strains (geographic regions of North America). The strain MEX-14-010-01 contains the repeat structure τ 22-2 13 18, which is identical to structures reported in Argentine strains (geographic regions of South America). The strains MEX-15-099-01 and MEX-30-193-01 contain repeats a, b, and G, which have been reported in strains from Argentina, Brazil, and Venezuela (geographic regions of South America). The strain MEX-17-017-01 contains repeat structure 14 that has also been reported in strains from China and the Philippines (geographic regions of Asia). Ultimately, the strains MEX-30-184-02 and MEX-31-096-01 contain repeats T and C, which have been reported in strains from the United States and Mexico (geographic regions of North America).

Phylogenetic and Pan-Genomic Analysis
The phylogenetic tree is based on 540 concatenated genes that belong to the core genome of 24 genomes of A. marginale   Pan-genomic analysis among the seven Mexican strains of A. marginale shows that the core genome comprises 883 CDS and 212 CDS for a single genome ( Figure 2). Moreover, pan-genomic analysis among the 24 A. marginale genomes shows that the core genome comprises 534 CDS and 1155 CDS for a single genome (Figure 3).
The pan-genomic analysis among the 24 genomes of A. marginale shows that the single CDS and the core genome decrease significantly with respect to the pan-genomic analysis among the 7 genomes of Mexican strains. The Mexican strains contain 349 core genome CDS not shared with American, Australian, Brazilian, and Puerto Rican strains of A. marginale. Furthermore, the data suggest that some CDS are conserved in the Mexican strains while not present in the others, despite the close phylogenetic relationships.
Pan-genomic analysis among the seven Mexican strains of A. marginale shows that the core genome comprises 883 CDS and 212 CDS for a single genome (Figure 2). Moreover, pan-genomic analysis among the 24 A. marginale genomes shows that the core genome comprises 534 CDS and 1155 CDS for a single genome (Figure 3).

Figure 2.
Pan-genome distribution in four categories (Cloud, Shell, Soft Core, and Core) of Mexican strains of A. marginale. The core genome contains 883 coding sequences (CDS) shared by the seven genomes. Cloud genome (red) containing between 0 and less than 15% of the total CDS; Shell (blue) containing between 15 and less than 95% of the total CDS; Soft Core (purple) containing between 95 and less than 100% of the total CDS; and Core (green) containing 100% of the total CDS. Cloud genome (red) containing between 0 and less than 15% of the total CDS; Shell (blue) containing between 15 and less than 95% of the total CDS; Soft Core (purple) containing between 95 and less than 100% of the total CDS; and Core (green) containing 100% of the total CDS.  The core genome contains 534 CDS shared by the seven genomes. Cloud (red) containing between 0 and less than 15% of the total CDS; Shell (blue) containing between 15 and less than 95% of the total CDS; Soft Core (purple) containing between 95 and less than 100% of the total CDS; and Core (green) containing 100% of the total CDS.
The pan-genomic analysis among the 24 genomes of A. marginale shows that the single CDS and the core genome decrease significantly with respect to the pan-genomic analysis among the 7 genomes of Mexican strains. The Mexican strains contain 349 core genome CDS not shared with American, Australian, Brazilian, and Puerto Rican strains of A. marginale. Furthermore, the data suggest that some CDS are conserved in the Mexican strains while not present in the others, despite the close phylogenetic relationships.

Genome Comparison
A visual evaluation of the level of conserved genomic sequences showed that the draft genomes of seven Mexican strains of A. marginale are highly conserved, with almost 100% identity and almost complete coverage (Figure 4). The genes of rRNA, tRNA, and . Pan-genome distribution in four categories (Cloud, Shell, Soft core, and Core) of coding sequences (CDS) shared by 24 A. marginale genomes. The core genome contains 534 CDS shared by the seven genomes. Cloud (red) containing between 0 and less than 15% of the total CDS; Shell (blue) containing between 15 and less than 95% of the total CDS; Soft Core (purple) containing between 95 and less than 100% of the total CDS; and Core (green) containing 100% of the total CDS.

Genome Comparison
A visual evaluation of the level of conserved genomic sequences showed that the draft genomes of seven Mexican strains of A. marginale are highly conserved, with almost 100% identity and almost complete coverage (Figure 4). The genes of rRNA, tRNA, and the type IV secretion system are conserved and show no significant differences. However, the family of MSP genes shows high diversity and/or variability (faded and faint colors) in all genomes. As additional information, the four closed genomes assembled in a single incomplete chromosome of strains Jaboticabal, Palmeira, Dawn, and Gypsy Plains show sequences gaps of unidentified nucleotides (Table S2 and Figure 4). the type IV secretion system are conserved and show no significant differences. However, the family of MSP genes shows high diversity and/or variability (faded and faint colors) in all genomes. As additional information, the four closed genomes assembled in a single incomplete chromosome of strains Jaboticabal, Palmeira, Dawn, and Gypsy Plains show sequences gaps of unidentified nucleotides (Table S2 and Figure 4). The outermost ring highlights the rRNA (blue arrows), tRNA (black lines), msp (purple arrows), and type IV secretion system (red arrows) genes. The color intensity in each ring represents the BLAST match identity: solid color shows 100% identity, faded color shows 70% identity, and faint color shows 50% identity according to BRIG output.

Discussion
Cattle health is the outcome of the balance between the microorganisms that inhabit them, either beneficial or pathogenic [20]. In this regard, how do we face the fact that animals for commercial purposes are mobilized while also considering the potential pathogens they may have? A starting point could be knowing the genetic diversity of A. marginale, with the idea of identifying which strains circulate in the cattle.
Mexico presents a wide genetic diversity of A. marginale strains related to strains from other continents [7]; this is probably due to the wide variety of ecological regions and the strategic commercial location that has allowed the movement of cattle and ticks (vectors) from different countries [21]. Ticks are part of their microbiota and possess a wide variety of microorganisms with different functions, many of them being potential pathogens to animals, including A. marginale, Borrelia spp., Mycoplasma spp., and Coxiella spp. among others [22]. The outermost ring highlights the rRNA (blue arrows), tRNA (black lines), msp (purple arrows), and type IV secretion system (red arrows) genes. The color intensity in each ring represents the BLAST match identity: solid color shows 100% identity, faded color shows 70% identity, and faint color shows 50% identity according to BRIG output.

Discussion
Cattle health is the outcome of the balance between the microorganisms that inhabit them, either beneficial or pathogenic [20]. In this regard, how do we face the fact that animals for commercial purposes are mobilized while also considering the potential pathogens they may have? A starting point could be knowing the genetic diversity of A. marginale, with the idea of identifying which strains circulate in the cattle.
Mexico presents a wide genetic diversity of A. marginale strains related to strains from other continents [7]; this is probably due to the wide variety of ecological regions and the strategic commercial location that has allowed the movement of cattle and ticks (vectors) from different countries [21]. Ticks are part of their microbiota and possess a wide variety of microorganisms with different functions, many of them being potential pathogens to animals, including A. marginale, Borrelia spp., Mycoplasma spp., and Coxiella spp. among others [22].
The genomic sequencing of A. marginale strains during recent years has reported 24 genomes (2 complete genomes, 4 closed genomes, and 18 draft genomes) with 1.13 to 1.40 Mb of total length and G + C content from 46.73 to 49.84%. Specifically, the comparison of the 18 draft genomes of A. marginale strains from the US, Puerto Rico, and Mexico suggests that the 7 draft genomes from Mexico have a high assembly level with a lower number of contigs than the 11 draft genomes of strains from Puerto Rico (59 contigs) and the United States (57-403 contigs and 44 scaffolds). Moreover, the 7 draft genomes of Mexican strains contain from 1178 to 1218 genes and from 1138 to 1178 CDS (Table 1), a similar number to 14 genomes of A. marginale from other countries and a smaller number compared to the 3 draft genomes with a low assembly level. These data suggest that a low rate of gain/loss of genes occurred through evolution in the A. marginale genomes.
Additionally, the annotation of rRNA genes suggested that all 24 genomes of A. marginale contain one copy of 23S, 5S, and 16S rRNA separated by a distance greater than 162 kb, and an ITS of 80 bp between 23S and 5S genes ( Table S3). As for the annotation of tRNA, the results showed that 19 genomes of A. marginale contain 37 tRNA genes for 35 different codons of all 20 aa, one copy for each tRNA gene, except the tRNA-Met gene which contained three copies, maybe due to the importance of initiating protein synthesis.
On the other hand, in Mexico and worldwide, identification and molecular characterization of the genetic diversity of A. marginale isolates are based on markers such as the variable region of MSP1a. The results obtained in this work from MSP1a suggest that the Mexican strains MEX-30-184-02 and MEX-31-096-01 are associated with ecoregion 1, while strain MEX-17-017-01 was not associated with any of the four ecoregions reported by [19]. Additionally, all Mexican strains, except MEX-17-017-01, are transmitted by ticks associated with ecoregion 1. Ecoregion 1 mainly extends over large areas of central Africa and central South America, primarily Argentina and southern Brazil. It is involved in a region with medium to high Normalized Difference Vegetation Index (NDVI) values, i.e., shrub and grassland to temperate and tropical rainforests. This region has the highest recorded temperature and 1000 mm of annual rainfall [19]. Thus, the environmental factors are dynamic and could change daily. It is essential to highlight that obtention of the complete sequences of msp1a genes by several molecular methods increases the error rate in the consensus sequences of these genes.
These results suggest that a percentage of msp1a gene sequences, obtained by several steps of molecular methods and reported in databases, may contain errors affecting the analyses of genetic diversity or geographic distribution. It could not be excluded that, during massive sequencing, errors might occur, especially in assemblies with low quality or that are still drafts. Therefore, the variable region sequence of msp1a genes could be confirmed or corrected by genome sequencing of A. marginale strains.
Moreover, genome sequencing provides complete sequences with a higher assembly level of msp1a genes. Surprisingly, we detected some draft genomes of Mexican strains of A. marginale that contain MSP1a repeat structures that differs from previously reported sequences. This fact highlights the importance of genome sequencing [7]. Currently, there are few reports of msp1a gene sequences by genome sequencing, while the msp1a sequences amplified by PCR are more abundant and include end-point, nested, and semi-nested PCR [23][24][25] In this regard, genome sequencing is proposed as an alternative to PCR amplification for studying the genetic diversity of A. marginale. At this point, we must emphasize that short-read sequencing methods, such as the one used in this work, allow high-quality reads assembled into genomes to be obtained. However, the obtained genomes are still imperfect and may contain assembly errors, which can mask interesting signals and propagate into false identification of candidate genes and inaccurate gene annotation [26]. Therefore, long-read sequencing (LRS) technologies emerge to improve genome assembly quality, and may offer refinements in the characterization of genetic variation and regions that are difficult to assess with prevailing NGS approaches [27].
This work is the first phylogenetic analysis performed with the core genome of A. marginale (540 concatenated genes). The phylogenetic reconstruction showed that Mexican A. marginale are related to strains from the geographic region of North America (MEX-01-001-01, MEX-30-184-02, and MEX-31-096-01), South America (MEX-14-010-01, MEX-15-099-01, and MEX-30-193-01), and Asia (MEX-17-017-01). However, it would be necessary to obtain A. marginale genome sequences from countries in Africa, Asia, and Europe to acquire a more defined tree topology (Figure 1). Additionally, the pan-genomic analysis showed that Mexican strains share specific local genomic and global information with strains from other continents.
Finally, comparative genomics revealed that A. marginale genomes are highly conserved but not identical, since the ANIm algorithm calculated alignment coverage and identity >93 and 98%, respectively.
As was observed in this study, the geographic strains of A. marginale are highly variable. This genetic variability is most likely due to cattle movement for commercial purposes and the presence of ticks that are transmitters of this pathogen. Additionally, the livestock trade has led to the emergence of strains of A. marginale from other countries, which has enriched the diversity that already existed. However, an essential factor to consider is the relationships pathogens establish with ticks and how they have co-evolved to establish themselves as a successful transmission system. Derived from our study, we propose a deep investigation of the interactions between ticks and A. marginale from the perspective of how populations of ticks impact A. marginale and vice-versa.

Analysis of MSP1a
We used the MSP1a protein of A. marginale strain St. Maries (GenBank accession number AAV86554.1) as a query sequence to identify the number of copies and location of msp1a genes in the seven draft genomes of Mexican strains of A. marginale using the tBlastn (https://blast.ncbi.nlm.nih.gov/Blast.cgi, accessed on 5 March 2022) program [33]. The msp1a gene sequences that had an alignment coverage and identity greater than 50 and 80%, respectively, were selected.
The tandem repeat sequences of the variable region of MSP1a proteins were obtained using the Repeat Analyzer (v2.8) program [34] with default settings. Additionally, the tandem repeat sequences were compared against the A. marginale database (updated December 13, 2017, with 412 strains and 274 repeat sequences) of the Repeat Analyzer (v2.8) [34] program to obtain the possible geographical distribution worldwide.

Analysis of MSP1a
We used the MSP1a protein of A. marginale strain St. Maries (GenBank accession number AAV86554.1) as a query sequence to identify the number of copies and location of msp1a genes in the seven draft genomes of Mexican strains of A. marginale using the tBlastn (https://blast.ncbi.nlm.nih.gov/Blast.cgi, accessed on 5 March 2022) program [33]. The msp1a gene sequences that had an alignment coverage and identity greater than 50 and 80%, respectively, were selected.
The tandem repeat sequences of the variable region of MSP1a proteins were obtained using the Repeat Analyzer (v2.8) program [34] with default settings. Additionally, the tandem repeat sequences were compared against the A. marginale database (updated 13 December 2017, with 412 strains and 274 repeat sequences) of the Repeat Analyzer (v2.8) [34] program to obtain the possible geographical distribution worldwide.

Phylogenetic and Pan-Genomic Analysis
For the phylogenetic reconstruction, the 24 genomes of A. marginale were annotated automatically using the Prokka (rapid prokaryotic genome annotation) (v1.12) program [36] with default settings. The GFF (General Feature Format) annotation files were used as input files by the Roary (the pan-genome pipeline) (v3.12.0) program [37], which made the multiple alignments between the core genomes using the option to create a multiFASTA alignment of core genes (-e). The jModelTest (v2.1.10) program [38] was used to select the best model of nucleotide substitution with the Akaike information criterion (AIC). The phylogenetic tree was estimated under the Maximum-Likelihood (ML) method using the PhyML (v3.1) program [39] with 1000 bootstrap replicates. The phylogenetic tree was visualized and edited using the FigTree (v1.4.4) (http://tree.bio.ed.ac.uk/software/figtree/, accessed on 6 March 2022) program.
Two pan-genomic analyses were performed using the GET_HOMOLOGUES (v11042019) software package [40] with the following options: (i) among the 7 genomes of Mexican strains; and (ii) among the 24 genomes of A. marginale. Briefly, the FAA (Fasta Amino Acid) annotation files of A. marginale strains that were obtained with the RAST server (see Section 4.2) were used as input files by the GET_HOMOLOGUES (v11042019) software package [40]. The get_homologues.pl and compare_clusters.pl Perl scripts were used to compute a consensus pan-genome, which resulted from the clustering of the all-against-all Blastp results with the COGtriangles and OMCL algorithms. The pan-genomic analyses were performed using the binary (presence-absence) matrix.

Comparative Genomics
The level of conserved genomic sequences was visualized by alignment of the seven genomes of Mexican strains against the reference genome of A. marginale strain St. Maries (GenBank accession number CP000030.1) and five genomes of A. marginale strains from Australia (Dawn and Gypsy Plains), Brazil (Jaboticabal and Palmeira), and the United States (complete genome of strain Florida) using the BRIG (BLAST Ring Image Generator) (v0.95) [41] software package with default settings. The circular comparative genomic map was constructed using GenBank files (gbk format) and the NCBI local blast-2.9.0 + suite. Ultimately, the average nucleotide identity (ANI) values were calculated by comparing the 13 genomes of A. marginale (above in this section) using the calculate_ani.py Python script (https://github.com/ctSkennerton/scriptShed/blob/master/calculate_ani.py, accessed on 10 March 2022) with the ANIm (by MUMmer) algorithm.

Conclusions
Currently, the potential represented by the massive sequencing of genomes is undeniable. Pathogens of veterinary importance, such as A. marginale, represent key targets. In this work, we focus on analyzing the phylogeographic relationships that exist in the strains of A. marginale reported worldwide and those that we have identified in Mexico.
We found that the structure of the MSP1a repeats of five Mexican strains did not coincide with those reported in the literature. Thus, we propose that these errors can decrease considerably through massive sequencing of the genomes and not only of the msp1a gene. On the other hand, the phylogenetic analysis of the core genome of 24 genomes of A. marginale showed that the Mexican strains are related to strains from regions of North America, South America, and Asia. Additionally, pan-genomic analysis between the seven Mexican strains revealed that the core genome contains 883 genes, while this value decreases to 534 genes when using the 24 genomes of A. marginale, showing that there are genes that only present in the national strains. It is worth mentioning that it is precisely these unique genes which are of interest because they represent potential targets for vaccine