Assembly and Comparison of Ca. Neoehrlichia mikurensis Genomes

Ca. Neoehrlichia mikurensis is widely prevalent in I. ricinus across Europe and has been associated with human disease. However, diagnostic modalities are limited, and much is still unknown about its biology. Here, we present the first complete Ca. Neoehrlichia mikurensis genomes directly derived from wildlife reservoir host tissues, using both long- and short-read sequencing technologies. This pragmatic approach provides an alternative to obtaining sufficient material from clinical cases, a difficult task for emerging infectious diseases, and to expensive and challenging bacterial isolation and culture methods. Both genomes exhibit a larger chromosome than the currently available Ca. Neoehrlichia mikurensis genomes and expand the ability to find new targets for the development of supportive laboratory diagnostics in the future. Moreover, this method could be utilized for other tick-borne pathogens that are difficult to culture.


Introduction
Ixodes ricinus is the most abundant and widespread tick species in Europe [1,2] and transmits multiple pathogens of medical and veterinary concern [3]. Two well-established tick-borne diseases, Lyme borreliosis and tick-borne encephalitis, are frequently reported in Europe, and several studies have indicated a rise in their incidences and spread over the last decades [4,5]. Ixodes ricinus also transmits Ca. Neoehrlichia mikurensis, which has been found all across Europe except for the United Kingdom [6][7][8]. The number of studies describing human infections involving Ca. N. mikurensis is accumulating, but whether these infections result in human disease has not been fully demonstrated [9,10].
One of the major obstacles to investigating how often and under what conditions Ca. N. mikurensis causes an infectious disease in humans, is its unequivocal detection in larger cohorts of patients with noncharacteristic disease symptoms [11,12] or in persons with a (recent) tick bite [13][14][15]. In other words, supportive laboratory diagnostics to detect and identify Ca. N. mikurensis infections are currently limited or reserved to research laboratories. Most importantly, although cultivation of Ca. N. mikurensis has been described in the literature recently [16], it turns out to be quite difficult, even in dedicated laboratories [17,18]. Once one or more Ca. N. mikurensis cultures are generally available, it will become possible to develop more specific and sensitive diagnostic modalities, for example serological tests, which will undoubtedly improve the abilities to detect (endured) infection with Ca. N. mikurensis in clinical practices as well as in epidemiological studies.

Genome Sequencing (Oxford Nanopore Technologies and Illumina)
The DNA from sample 18-2804 was used to prepare a 1D ligation library using the Ligation Sequencing Kit SQK-LSK110 according to the manufacturer's instructions (Oxford Nanopore Technologies, Oxford, UK). ONT libraries were run on a PromethION flowcell (FLO-PRO002) at Future Genomics Technologies BV (Leiden, The Netherlands) using the following settings: basecall model: high-accuracy; basecaller version: Guppy v4. 3.4. Parallel aliquots of both DNA samples (18-2804 and 18-2837) were used to prepare Illumina libraries using the Nextera DNA Flex Library Prep Kit according to the manufacturer's instructions (Illumina Inc. San Diego, CA, USA). Library quality was measured via electrophoresis in D1000 ScreenTape on an Agilent 4200 TapeStation System (Agilent Technologies Netherlands BV, Amstelveen, The Netherlands). The genomic paired-end (PE) libraries were sequenced with a read length of 2 × 150 nt using the Illumina NovaSeq 6000 system. Image analysis and basecalling were performed by the Illumina pipeline.

Genome Sequencing (Oxford Nanopore Technologies and Illumina)
The DNA from sample 18-2804 was used to prepare a 1D ligation library using Ligation Sequencing Kit SQK-LSK110 according to the manufacturer's instructions ( ford Nanopore Technologies, Oxford, UK). ONT libraries were run on a PromethI flowcell (FLO-PRO002) at Future Genomics Technologies BV (Leiden, The Netherlan using the following settings: basecall model: high-accuracy; basecaller version: Gup v4.3.4. Parallel aliquots of both DNA samples (18-2804 and 18-2837) were used to prep Illumina libraries using the Nextera DNA Flex Library Prep Kit according to the ma facturer's instructions (Illumina Inc. San Diego, CA, USA). Library quality was measu via electrophoresis in D1000 ScreenTape on an Agilent 4200 TapeStation System (Agil Technologies Netherlands BV, Amstelveen, The Netherlands). The genomic paired-e (PE) libraries were sequenced with a read length of 2 × 150 nt using the Illumina Nova 6000 system. Image analysis and basecalling were performed by the Illumina pipeline

Genome Assembly and Annotation
Three reference genomes were used for the removal of host-derived ONT reads: A cola amphibious (GCA_903992535.1), Apodemus sylvaticus (GCA_001305905.1), and Myo glareolus (GCA_004368595.1). Contigs were de novo assembled from the unaligned re using Flye v2.8.3-b1695 in standard mode and in metagenome mode [30,31]. The "m genome" contigs were further polished using Medaka v1.4.3 [32]. The filtered reads fr the ONT data set were aligned against the Medaka-polished assembly, which were t subsequently used for a de novo assembly using Flye with polishing using Medaka (F ure 1).  Illumina reads of samples 18-2837 and 18-2804 were aligned against the ONT-based consensus sequence of Ca. N. mikurensis (1,236,636 bp; PromethION derived from spleen 18-2804) using minimap2 v 2.17. Pilon vs. 1.23 [33] was then used to polish the ONT-based consensus sequence of Ca. Neoehrlichia mikurensis (18-2804_Ehrlichia_flye_medaka_prokka.fna) independently with the 18-2837 and 18-2804 set of aligned Illumina reads. Prokka v1.14.6 [34] was used to annotate the polished genome sequences. BUSCO v5.2.2 [35] was used for QC of the annotated genome sequences based on the rickettsiales_odb10 lineage dataset. The sorted Illumina reads of 18-2837 and 18-2804 and the sorted nanopore reads of 18-2804 were aligned back against both polished genome sequences to allow visualization of the Bam/Bai files in IGV v2.12.2 [36]. The presence or absence of prophages was determined using the online tool Phaster [37,38]. Furthermore, SAPP [39] was used for the functional annotation of protein coding genes using InterProScan [40] with PFAM [41]. The web version of eggNOG-Mapper v2 [42,43] was used to determine the Clusters of Orthologous Group (COG) categories for protein encoding regions.
The Neoehrlichia pangenome analysis was performed by following the anvi'o pangenomic workflow, and the mcl inflation was set to 2, using anvi'o v7 [49]. For the analyses, the genomes of Ca. N. mikurensis SE24, SE20, SE 26, Ehrlichia chaffeensis Arkansas, Ehrlichia ruminantium Welgevonden, strain Anaplasma phagocytophilum HZ, and Ca. Neoehrlichia lotoris RAC-413 were downloaded from NCBI (accessions numbers are listed in Supplementary Table S3). The Anvi'o genome databases were annotated using the NCBI COG function. A presence-absence Table was used to generate UpSet plots using the R package UpSetR [50] to visualize unique and shared gene clusters at both the intra-and interspecies levels.

Results
From the 76 M. glareolus captured, 24 spleen samples tested positive for Ca. N. mikurensis DNA in the qPCR analysis (Supplementary Table S1). Four samples with the lowest Ct-values were selected for genomic DNA extraction and microbial enrichment using the NEBNext Microbiome DNA Enrichment Kit (New England BioLabs, Ipswich, MA, USA). The two samples with the highest DNA yield were subjected to genomic sequencing (sample 18-2804 and 18-2837).

Genomes Generated in This Study
Two complete and circular Ca. N. mikurensis genomes derived from mice spleen samples were assembled in this study. The reference genome derived from PromethION (sample 18-2804) was polished with Illumina data from the same sample, resulting in a circular genome referred to as NL07 (GenBank accession no. CP089285). In addition, the PromethION data derived from sample 18-2804 were polished with Illumina data from sample 18-2837, resulting in a second circular genome referred to as NL06 (Gen-Bank accession no. CP089286). Both assemblies presented a complete genome with high BUSCO scores, which increased from 77.1% to >97% after the short reads were used to polish the long-read assembly and were accurately correct for sequence errors (Table 1, Supplementary Files S1 and S2). No prophages were identified in either genome.
The two genome assemblies generated in this study were compared for strain variations (including both indels and nucleotide substitutions), and in total, 250 variants were found (0.02% difference between genomes). Of these, 153 single nucleotide polymorphisms (SNPs), 34 insertions, 47 deletions, and 16 complex variants were detected. Out of these, two missense variants and three deletions were found in regions pertaining to the P44/Msp2 outer membrane protein (Supplementary Table S4). As the consensus sequence NL07 presented a higher level of completeness (BUSCO = 99.2%), and both short and long reads were obtained from the same sample, it was used as our reference genome for all downstream analyses. This reference genome (NL07) has 949 coding sequences (CDS) out of 990 genes as well as 3 rRNAs and 37 tRNAs (Table 1). Coding proteins were classified into functional Clusters of Orthologous Group (COG) categories (Supplementary Table S5), and the output was summarized into the number of coding proteins belonging to each COG category (Supplementary Table S6).

Intraspecies Comparisons
NL07 was compared to three previously published genomes of Ca. N. mikurensis, namely strains SE20, SE24, and SE26, all of which were obtained directly from patient materials in Sweden [18]. Of notable importance, when compared to NL07, the published genomes were on average 124,574 bp smaller. The comparison showed SE26 is the most similar to NL07 (0.02% difference and 165 SNPs), and SE24 is the most distant (0.028% and 257 SNPs) ( Table 2). Moreover, when compared to the published strains, NL07 shows mutations that could translate into phenotypic antigenic differences in the coding region of four P44/Msp2 outer membrane proteins (Table 3, Supplementary Tables S7 and S8). In terms of completeness, the BUSCO scores of SE20, SE24, and SE26 are lower than NL07 (93.7%, 94%, 94.3%, and 99.2%, respectively, Supplementary Files S2-S5), which suggests missing conserved genes in the previously published assemblies. Table 2. Genetic variation between NL07 and SE20 and SE24 and SE26. The Table shows the total amount of variants between NL07 and a given strain (Variant total), the number of multiple nucleotide polymorphisms (Complex), the number of deletions (Deletions), the number of insertions (Insertions), the number of single nucleotide polymorphisms (SNPs), the assembly size (Genome size), and the percentage in difference between NL07 and a given strain in the aligned regions (% difference). When investigating the 124,574 bp that were absent in the published genomes from patient samples, we found three main expansions that contain 31 genes belonging to 26 known protein domains as well as repeats of the outer membrane protein domain PF01617 (Supplementary Tables S10-S12, Supplementary Figure S1). All but one of the missing genes were most closely related to Ca. Neoehrlichia lotoris. The remaining gene was most similar to a domain participating in biotin metabolism found in Ehrlichia chaffeensis (Supplementary Tables S10 and S11). The Clusters of Orthologous Group (COG) categories assigned to these protein-coding genes are related to various essential processes needed for bacterial survival (Table 4), with the most abundant involved in replication, recombination, and repair (seven protein-coding domains) as well as translation, ribosomal structure, and biogenesis (five protein-coding domains). Table 3. P44/Msp2 family outer membrane protein variants between NL07 and Ca. N. mikurensis SE20, SE24, and SE20 assemblies. The effects of the SNPs are presented as synonymous (functionally silent) or nonsynonymous. Nonsynonymous variants, which lead to either a stop codon or a change in protein sequence, are in bold. The absent genes appear in three main gaps. Upon closer inspection, two of these gaps contain repeats of an outer membrane protein belonging to the Pfam PF01617 domain (Figure 2).

Strain
The absent genes appear in three main gaps. Upon closer inspection, two of these gaps contain repeats of an outer membrane protein belonging to the Pfam PF01617 domain ( Figure 2). assembly that align to NL07 and the gaps that SE20 does not encompass, and (C) in blue, the location of repeats of the outer membrane protein repeats belonging to the PF01617 domain in NL07 and SE20. Note that most repeats are present only in NL07.
Mapping the Illumina reads of NL07 to the assembly of SE20 shows that many copies of this surface antigen are stacked on top of a site (positioned around 743,000-755,000 bp in SE20) (Supplementary Figure S3). This may be indicative of a collapsed repeat of the surface antigen explaining part of the discrepancy between genome sizes. Moreover, the repeats of the PF01617 domain represent 25 different e-values ranging from 1.8 × 10 −6 to 3 × 10 −74 that may point to antigenic variation (Supplementary Table S12).

Pangenome Analysis
NL07 was compared to select genomes of the Anaplasmataceae family as well as the Ca. N. mikurensis strains from Sweden ( Table 5). The GC content of our reference genome (26.85%) is comparable to that of the published strains (26.84%) and close to that of E. ruminantium and Ca. N. lotoris (27.48% and 27.75, respectively) that shares a similar genome size (Table 5, Figure 3). Four hundred and sixty-three gene clusters are present across all genomes (Figure 4). Anaplasma phagocytophilum has the largest genome, and 523 unique gene clusters (Figures 3 and 4). In contrast, NL07 has 13 unique gene clusters and 13 that it only shares with the Ca. N. lotoris genome, the only other genome with which it solely shares gene clusters (Figures 3 and 4). Among the shared clusters are one gene cluster connected to cell motility, two related to cell wall/membrane/envelope biogenesis of which one is an outer membrane protein, one connected to translation, ribosomal structure, and biogenesis, and one connected to inorganic ion transport and metabolism and a TPR-like repeat domain (Supplementary Table S13).  assembly that align to NL07 and the gaps that SE20 does not encompass, and (C) in blue, the location of repeats of the outer membrane protein repeats belonging to the PF01617 domain in NL07 and SE20.
Note that most repeats are present only in NL07.
Mapping the Illumina reads of NL07 to the assembly of SE20 shows that many copies of this surface antigen are stacked on top of a site (positioned around 743,000-755,000 bp in SE20) (Supplementary Figure S3). This may be indicative of a collapsed repeat of the surface antigen explaining part of the discrepancy between genome sizes. Moreover, the repeats of the PF01617 domain represent 25 different e-values ranging from 1.8 × 10 −6 to 3 × 10 −74 that may point to antigenic variation (Supplementary Table S12).

Pangenome Analysis
NL07 was compared to select genomes of the Anaplasmataceae family as well as the Ca. N. mikurensis strains from Sweden ( Table 5). The GC content of our reference genome (26.85%) is comparable to that of the published strains (26.84%) and close to that of E. ruminantium and Ca. N. lotoris (27.48% and 27.75, respectively) that shares a similar genome size (Table 5, Figure 3). Four hundred and sixty-three gene clusters are present across all genomes (Figure 4). Anaplasma phagocytophilum has the largest genome, and 523 unique gene clusters (Figures 3 and 4). In contrast, NL07 has 13 unique gene clusters and 13 that it only shares with the Ca. N. lotoris genome, the only other genome with which it solely shares gene clusters (Figures 3 and 4). Among the shared clusters are one gene cluster connected to cell motility, two related to cell wall/membrane/envelope biogenesis of which one is an outer membrane protein, one connected to translation, ribosomal structure, and biogenesis, and one connected to inorganic ion transport and metabolism and a TPR-like repeat domain (Supplementary Table S13).

Discussion
Two novel and complete Ca. N. mikurensis genomes have been generated in this study using a reproducible approach for high-quality whole genome assembly directly from rodent spleens collected in the wild. These genomes expand on our ability to identify potential targets for the development of reliable diagnostic tools for neoehrlichiosis, which are currently lacking for this and some other tick-borne bacteria.
The genomes presented in this study are approximately 10% larger than the existing Ca. N. mikurensis assemblies recently published, which were derived from clinical samples [18]. The discrepancy in chromosome size might be related to genetic divergence rooted in the provenance of the samples or to the difference in technological platforms and assembly approaches employed.
Although the genes missing in the variants from Sweden are all involved in essential processes, one could argue the gene loss is related to pathogenicity gain as has been shown for other intracellular bacteria [52][53][54][55]. In order to investigate this hypothesis, a larger comparison of host-versus patient-derived genomes must be performed.
The genomes in this study are products of a hybrid assembly approach combining long and short reads, while the genomes from Sweden are based on short reads alone. While short reads are highly accurate at the nucleotide level, they lack the ability to reliably elucidate genome structure [23]. When mapped to the assembly of the Swedish variant SE20, our short-read data of NL07 revealed a large spike containing a repeat of an outer membrane protein domain, which has proven to be highly immunogenic in patients infected with other members of the family Anaplasmataceae [56]. Given that the Swedish variants were assembled based on short reads alone, it is possible that this domain, which

Discussion
Two novel and complete Ca. N. mikurensis genomes have been generated in this study using a reproducible approach for high-quality whole genome assembly directly from rodent spleens collected in the wild. These genomes expand on our ability to identify potential targets for the development of reliable diagnostic tools for neoehrlichiosis, which are currently lacking for this and some other tick-borne bacteria.
The genomes presented in this study are approximately 10% larger than the existing Ca. N. mikurensis assemblies recently published, which were derived from clinical samples [18]. The discrepancy in chromosome size might be related to genetic divergence rooted in the provenance of the samples or to the difference in technological platforms and assembly approaches employed.
Although the genes missing in the variants from Sweden are all involved in essential processes, one could argue the gene loss is related to pathogenicity gain as has been shown for other intracellular bacteria [52][53][54][55]. In order to investigate this hypothesis, a larger comparison of host-versus patient-derived genomes must be performed.
The genomes in this study are products of a hybrid assembly approach combining long and short reads, while the genomes from Sweden are based on short reads alone. While short reads are highly accurate at the nucleotide level, they lack the ability to reliably elucidate genome structure [23]. When mapped to the assembly of the Swedish variant SE20, our short-read data of NL07 revealed a large spike containing a repeat of an outer membrane protein domain, which has proven to be highly immunogenic in patients infected with other members of the family Anaplasmataceae [56]. Given that the Swedish variants were assembled based on short reads alone, it is possible that this domain, which appears throughout the genome, collapsed into one locus in said assemblies, explaining part of the discrepancy in genome sizes (Supplementary Figure S2).
The relatively high copy number of this domain could be related to the adaptive immunogenic capabilities of Ca. N. mikurensis [57]. In A. phagocytophilum, this domain has over 113 copies, which has been associated with an increased adaptability to the environment during infection [58], a phenomenon that has been described in the surface protein superfamily (Pfam01617) for A. marginale, E. canis, E. chaffeensis, and E. ruminantium [59]. In A. phagocytophilum, p44/msp2 proteins present strain variability, which could explain why our analyses show SNPs in this domain, between the genomes generated in this study as well as between NL07 and the publicly available Ca. N. mikurensis genomes. Thus, we believe this surface protein family should be studied in depth in order to understand the evolutionary processes involved and how they affect antigenic variation for this potentially emerging pathogen.
The genomes presented in this study provide a foundation for future studies that could explore the antigenic variation of Ca. N. mikurensis. Moreover, we believe that this approach, in which wildlife reservoir host derived tissues are directly used to obtain high-quality whole genomes based on hybrid sequencing, should be employed for other emerging tick-borne pathogens and symbionts.
Supplementary Materials: The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/microorganisms10061134/s1, Supplementary Tables S1-S13: 1_qPCR results of spleen samples, 2_Primers, 3_External genomes, 4_SNPs_NL07-NL06, 5_Eggnog_output_NL07, 6_Eggnog_COGs_NL07, 7_SNPs_NL07-SE20, 8_SNPs_NL07-SE24, 9_SNPs_NL07-SE26, 10_Genes NL07 only, 11_Eggnog_ouput_GAP, 12_PF01617 repeats, 13_Anvio output; Supplementary Files S1-S5: BUSCO_NL06_CP089286, BUSCO_NL07_CP089285, BUSCO_SE20_CP054597, BUSCO_SE24_CP066557, BUSCO_SE26_CP060793; Supplementary Figure  Funding: This study is funded by The Netherlands Organization for Health Research and Development (ZonMw, project number 52200-30-07), which has a peer-reviewed grant application, and by the Dutch Ministry of Health, Welfare, and Sports. H.S. is also supported by the NorthTick, European Union, European Regional Development Fund, in the North Sea Region Program. None of the funding organizations had or will have any role in the design or the data analysis and interpretation of the study.