A Targeted “Next-Generation” Sequencing-Informatic Approach to Define Genetic Diversity in Theileria orientalis Populations within Individual Cattle: Proof-of-Principle

Oriental theileriosis is an economically important tickborne disease of bovines, caused by some members of the Theileria orientalis complex. Currently, 11 distinct operational taxonomic units (OTUs), or genotypes, are recognized based on their major piroplasm surface protein (MPSP) gene sequences. Two of these genotypes (i.e., chitose and ikeda) are recognized as pathogenic in cattle, causing significant disease in countries of the Asia-Pacific region. However, the true extent of genetic variation and associated virulence/pathogenicity within this complex is unknown. Here, we undertook a proof-of-principle study of a small panel of genomic DNAs (n = 13) from blood samples originating from individual cattle known to harbor T. orientalis, in order to assess the performance of a targeted “next-generation” sequencing-informatic approach to identify genotypes. Five genotypes (chitose, ikeda, buffeli, type 4, and type 5) were defined; multiple genotypes were found within individual samples, with dominant and minor sequence types representing most genotypes. This study indicates that this sequencing-informatic workflow could be useful to assess the nature and extent of genetic variation within and among populations of T. orientalis on a large scale, and to potentially employ panels of distinct gene markers for expanded molecular epidemiological investigations of socioeconomically important protistan pathogens more generally.


Introduction
Tickborne diseases (TBDs) substantially affect livestock in many parts of the world and have a major negative impact, particularly on resource-poor farming communities [1]. Key TBDs are caused by pathogens including Anaplasma marginale and A. centrale (anaplasmosis), Babesia bovis, B. bigemina and B. divergens (babesiosis), Ehrlichia ruminantium (cowdriosis/heartwater), as well as Theileria annulata, T. parva, and members of the T. orientalis complex (theileriosis), particularly in the tropics and subtropics [2].
Outbreaks of oriental theileriosis have had a significant, adverse economic impact on cattle farms in Australasia and parts of Asia [3][4][5][6][7][8]. The use of DNA-based diagnostic and analytical tools to establish the distribution and prevalence of distinct genetic variants (genotypes) within the T. orientalis complex has been central to the tracking and monitoring of oriental theileriosis outbreaks. Currently, 11 distinct T. orientalis genotypes, or operational taxonomic units (OTUs), including chitose (= type 1),

Characteristics of Sequence Datasets, and Clustering of Reads to Define Genotypes
A total of 219,238 "short reads" were generated from all MPSP amplicons. After processing with the SeekDeep algorithm, 19,439 reads were excluded, as they represented either singletons, chimeras, or were of poor quality, leaving 199,799 reads, and a mean count of 15,329 reads per sample (Table 1). Of the 13 samples, all but two (A7 and A17) contained mixed genotypes, and one sample (A11) contained all five genotypes identified in this study. The numbers of reads that clustered into individual genotypes were as follows: 76,955 (chitose), 52,053 (ikeda), 52,236 (buffeli), 1905 (type 4), and 16,650 (type 5) ( Table 1 and Table S1); chitose was the commonest genotype, being identified in 11 of 13 samples; ikeda was recorded in nine samples; type 5 in four samples; and type 4 in two of the 13 samples (Table 1, Figure 1). Within genotype chitose, subgenotypes A and B (cf. [22]) were identified in six and eight samples, respectively.

Extensive MPSP Sequence Diversity within and among Samples
In total, 110 distinct MPSP sequences were identified in all 13 samples as follows: 37 for chitose, 58 for ikeda, 12 for buffeli, one for type 4 and two for type 5 (Tables 1 and S1; Figures 1 and S1). Of these 110 unique sequences, 72 (65.5%) were detected in a single sample (A17), whereas 38 (34.5%) were found in multiple samples. The number of different sequences, referred to as the "multiplicity of infection" (MOI) within a sample, ranged from four to 39 (Table 1 and Figure 1). Some samples  Table 1).

Extensive MPSP Sequence Diversity within and among Samples
In total, 110 distinct MPSP sequences were identified in all 13 samples as follows: 37 for chitose, 58 for ikeda, 12 for buffeli, one for type 4 and two for type 5 (Table 1 and Table S1; Figure 1 and Figure S1). Of these 110 unique sequences, 72 (65.5%) were detected in a single sample (A17), whereas 38 (34.5%) were found in multiple samples. The number of different sequences, referred to as the "multiplicity of infection" (MOI) within a sample, ranged from four to 39 (Table 1 and Figure 1). Some samples contained one genotype, yet many sequences representing the same genotype reflected a higher MOI (e.g., with A7 containing as many as 38 sequence variants representing ikeda). Similarly, samples A9 Pathogens 2020, 9, 448 4 of 8 and A17 had high MOI values (39 and 25, respectively, Figure 1), with little or no genotypic diversity (A9 contained ikeda and type5, whereas A17 contained only chitose), contrasting sample A2 which had the lowest MOI (= 4) ( Table 1 and Figure 1).
When conceptually-translated amino acid sequences were compared with the dominant (reference) sequence representing each of the five genotypes, 85 of the 110 amino acid sequences were unique. Therefore, most mutations were non-synonymous and were single point mutations (SNPs). Sequences representing each buffeli and chitose B differed at multiple nucleotide positions, whereas those representing each ikeda and chitose A varied only at single positions. There was only one sequence variant (representing ikeda) whose reading frame was disrupted by a premature stop codon (cf. Figure S1).

Genotypes within Samples Usually Represented by a Unique, Dominant Nucleotide Sequence
Individual genotypes within individual samples were mostly represented by a dominant "majority" sequence and, typically, sequences differed from this dominant sequence by only one or two nucleotides ( Figure S1). For example, 38 sequence variants were identified in Sample A7, with a dominant sequence equating to 57% of 8167 reads, while the next commonest sequence was represented by 3.7% of all reads ( Figure 1). For this sample, which contained solely the genotype ikeda, there was only a point mutation difference between the dominant sequence and all other sequence types. Likewise, sequences representing chitose A differed by single point mutations, whereas some representing chitose B and buffeli varied at multiple positions ( Figure S1). For the 11 samples containing chitose, coinfections of subgenotypes A and B were detected in three samples, type A was found exclusively in three samples, and type B was found exclusively in five samples (Table 1). Of all 110 sequence variants, most (n = 97) were novel, meaning that they did not have an identical match to any published sequence available in the GenBank database.

Discussion
Here, we established proof-of-principle for a targeted NGS, i.e., a bioinformatic approach to reliably and directly discern genotypes within the T. orientalis complex using a portion of the MPSP gene as the marker. Using this approach, we simultaneously detected five genotypes of the T. orientalis complex and, surprisingly, a total of 110 distinct sequence variants in all samples analyzed. As SeekDeep has been shown to eliminate artefactual sequences and retain sequence variants with very high confidence [21], we were able to infer the multiplicity of infection (MOI) of particular genotypes. The findings revealed that mixed-genotypic infections were common, in spite of the small sample size tested in this study. The ability to reliably discern sequence types, and attribute them to particular populations should prove useful for comprehensive investigations of genetic variation within species of Theileria. The ability to associate sequence types and genotypes with clinical and asymptomatic cases would facilitate the identification and tracking of sources of infection, and the monitoring of outbreaks (cf. [9]). NGS has been applied previously to Theileria, but in a manner distinct from this proof-of-principle study. Recently, a whole genome sequencing (WGS) method has been used to explore the T. orientalis complex in cattle [23], but such an approach is not presently scalable for routine monitoring or diagnostic purposes. Another WGS method was employed to investigate hemoparasites in ticks [24], but the amount of data produced was excessive and the procedure was costly as compared with targeted NGS which used one or a few loci to detect infections [25]. Targeted NGS approaches have been applied to selected SSU markers, resulting in the detection of 14 taxa of Babesia and Theileria in ruminants [14], Theileria species in African antelopes [26] and buffaloes [16,17], and Hepatozoon, Babesia, and Trypanosoma species in dogs [15]. However, the SSU region does not provide the resolution required for assessing genetic variation within taxa. An NGS approach with some similarity to ours has been applied previously to Plasmodium falciparum using regions of genes that encode the merozoite surface protein (MSP) [19,20], the P. falciparum apical membrane antigen 1 (pfama1 [18,27]) and genome-wide SNPs (reviewed by [28]). In one study [21], the authors employed targeted NGS of multiple proposed vaccine candidate-and drug resistance-genes, along with mock microbiome datasets.
The gene encoding MPSP, originally described as a 32 kDa immunogenic protein with a promise to be a vaccine candidate [29], is orthologous to the p32 gene of T. parva and the Tams1 gene of T. annulata, which are both known for having a mosaic pattern of antigenic diversity [9,30]. Accordingly, the high levels of sequence (genetic) variation found within and among some samples in this study (Table 1 and Figure 1) had been assumed, but its extent was not appreciated based on published information for T. orientalis. Interestingly, Hemmink et al. [13] examined allelic variation for six antigen-encoding genes of T. parva using a targeted NGS method and found a substantial number of gene variants within blood samples from some cattle investigated, indicating multiple infection event(s) in these animals. Similarly, in our study, the detection of up to five distinct genotypes in some samples studied here (e.g., A11, Figure 1) suggests that multiple infection events had occurred via suitable tick vectors or a single transmission of multiple genotypes by a tick [13]. We hypothesize that this marked variation reflects antigenic drift or shift in the pathogen(s) to evade or subvert immunological attack by the host animal. This hypothesis warrants testing. Ectopic recombination, as reported for the "var" genes of P. falciparum (see [31][32][33]) could be an alternative possibility, although such recombination could result in marked chromosomal rearrangement and organismal destruction [34].

Materials and Methods
Blood samples (n = 13) from cattle (Bos taurus or Bos indicus) were available from previously published studies, approved by the Research and Ethics Committee of the College of Veterinary and Animal Sciences, University of Bahawalpur, Pakistan [35] and the Ethiopian National Research Ethics Review Committee [36]. These blood samples were known to contain T. orientalis DNA based on results of conventional PCR testing [35,36].
Genomic DNAs were extracted from 200 µL of each of the whole-blood samples by sodium-dodecyl sulphate (SDS)-proteinase K digestion and mini-column purification using a commercial kit (DNeasy ® Blood & Tissue, Qiagen, Hilden, Germany), following the manufacturer's instructions, and samples were stored at −20 • C until further analysis. Then, a 344 bp region of the MPSP gene was amplified from individual samples using primers MPSP-AJ-F (5-TTC ACT CCA ACA GTC GCC CAC A-3) and MPSP-AJ-R1 (5-ACG TAA ACT TTG ACT GCG GTG-3) [37] in 50 µL containing 10 mM Tris-HCl (pH 8.4), 50 mM KCl (Promega, Madison, WI, USA), 3.5 mM MgCl 2 , 6.25 M of each deoxynucleotide triphosphate (dNTP), 100 pmol of each primer, and 1 U of GoTaq polymerase (Promega, Madison, WI, USA). The PCR cycling conditions were as follows: an initial denaturation at 95 • C for 5 min, followed by 35 cycles at 95 • C for 30 s, 58 • C for 30 s, 72 • C for 1 min, followed by 72 • C for 5 min. No-DNA (negative) and known positive (T. orientalis) control samples were included in each PCR run. PCR products were examined following agarose (1.5%) gel electrophoresis, and then subjected to Illumina sequencing on a MiSeq platform at Novogene (Beijing, China; (en.novogene.com) using the established protocol from Illumina. Paired-end sequence data (consensus length 301 bp) from which adaptors were trimmed were provided in Fastq files. Using SeekDeep v2.6.0 software [21], with settings designed specifically for Illumina data, the "clusters" algorithm was used to remove singletons, low quality reads, and chimeras. Then, sequences were clustered using the default cut-off (0.5%) for a fraction of all reads within a cluster to minimize errors. The "process clusters" algorithm was used to combine the reads from all amplicons. The, unique sequences were converted into the FASTA file format. Values for the multiplicity of infection (MOI, the number of distinct sequences within an individual sample) and nucleotide diversity among all samples were calculated using DnaSP v5 [38]. All sequence data were deposited in the GenBank database (accession nos. MT428433 to MT428542).

Conclusions
In this study, first, we elected to critically assess the sequencing and informatic approach to convince ourselves of the reliability of the datasets and the ability of the SeekDeep algorithm to eliminate PCR-induced artefacts from the datasets. Our results indicate that this approach is useful and has the potential to be applied to a large panel of distinct genetic markers, preferably derived from well-assembled genomes. We believe that a focus on using multiple divergent single-copy genes would be advantageous, circumventing problems associated with in vitro recombination of paralogous sequences (representing multigene families) during PCR. We envisage that the use of a multiplex PCR-based NGS-informatic workflow, employing replicates to ensure repeatability and reproducibility and to gain further confidence in MOI data, should provide a unique platform for future molecular epidemiological surveys and population genetic analyses of protists of veterinary and medical importance.
Supplementary Materials: The following are available online at http://www.mdpi.com/2076-0817/9/6/448/s1, Figure S1: Nucleotide alignments of the sequence variants for the Theileria orientalis complex (including genotypes buffeli, chitose (A & B), and ikeda), produced by targeted NGS using a region of the major piroplasm surface protein gene (MPSP). Alignments are compared to the most frequent sequence variant for each genotype. Synonymous nucleotide changes are in bold, and nonsynonymous amino acid changes are highlighted in colour. Genotypes type 4 and type 5 were omitted here, as only three sequences were recorded, Table S1: Codes and GenBank accession numbers of all 110 (partial) MPSP sequences obtained from 13 bovine blood DNA samples tested for Theileria orientalis using next-generation sequencing-based bioinformatic approach. Total number of the sequence reads for individual genotypes in samples are listed.