Mind the Gap: New Full-Length Sequences of Blastocystis Subtypes Generated via Oxford Nanopore Minion Sequencing Allow for Comparisons between Full-Length and Partial Sequences of the Small Subunit of the Ribosomal RNA Gene

Blastocystis is a common food- and water-borne intestinal protist parasite of humans and many other animals. Blastocystis comprises multiple subtypes (STs) based on variability within the small subunit ribosomal (SSU rRNA) RNA gene. Though full-length reference sequences of the SSU rRNA gene are a current requirement to name a novel Blastocystis subtype, full-length reference sequences are not currently available for all subtypes. In the present study, Oxford Nanopore MinION long-read sequencing was employed to generate full-length SSU rRNA sequences for seven new Blastocystis subtypes for which no full-length references currently exist: ST21, ST23, ST24, ST25, ST26, ST27, and ST28. Phylogenetic analyses and pairwise distance matrixes were used to compare full-length and partial sequences of the two regions that are most commonly used for subtyping. Analyses included Blastocystis nucleotide sequences obtained in this study (ST21 and ST23–ST28) and existing subtypes for which full-length reference sequences were available (ST1–ST17 and ST29). The relationships and sequence variance between new and existing subtypes observed in analyses of different portions of the SSU rRNA gene are discussed. The full-length SSU rRNA reference sequences generated in this study provide essential new data to study and understand the relationships between the genetic complexity of Blastocystis and its host specificity, pathogenicity, and epidemiology.


Introduction
Blastocystis sp. is one of the most common intestinal parasites observed in humans and has a global distribution [1,2]. It is spread via the fecal-oral route, with contaminated food and water being likely means of transmission [3][4][5]. While human infection has been associated with both gastrointestinal and extraintestinal symptoms, asymptomatic carriage is quite common, thus making the relationship between infection and disease unclear [6][7][8]. Blastocystis is also commonly observed in animal hosts, and while studies on symptomatic infection in animals are lacking, there is some evidence to support the zoonotic transmission of Blastocystis between humans and animals [2].
One factor thought to be key in explaining the pathogenicity and zoonotic potential of Blastocystis is the existence of a remarkable degree of genetic variability within the genus. This genetic variability has thus far been largely described using variability within the small subunit ribosomal RNA gene (SSU rRNA gene). Based on differences within the SSU rRNA gene, Blastocystis is divided into genetic groupings called subtypes. Currently, subtype designations are assigned using a numbering scheme that is chronological and based on publication order [9].
To be considered a candidate for a novel subtype designation, it is suggested in current proposed guidelines that a sequence should meet several inclusion criteria including:

Source of Blastocystis Isolates
Five DNA samples containing Blastocystis obtained from animal fecal isolates were used in this study (Table 1). All isolates used in this study had been previously typed using Sanger and/or next generation sequencing using previously reported protocols [12,13]. We selected samples that had been identified as containing Blastocystis subtypes ST21 and ST23-ST28 for which no published full-length sequences are currently available.

PCR Amplification of the Full-Length SSU rRNA Gene
The approximately 1800 base pair SSU rRNA gene was amplified by PCR using a previously described Nanopore sequencing strategy [15]. Briefly, a PCR using SSU-F1 (5'-AAC CTG GTT GAT CCT GCC AGT AGT C-3') and SSU-R1 (5'-TGA TCC TTC TGC AGG TTC ACC TAC G-3') that amplifies the SSU rRNA gene of most eukaryotic organisms was

Source of Blastocystis Isolates
Five DNA samples containing Blastocystis obtained from animal fecal isolates were used in this study (Table 1). All isolates used in this study had been previously typed using Sanger and/or next generation sequencing using previously reported protocols [12,13]. We selected samples that had been identified as containing Blastocystis subtypes ST21 and ST23-ST28 for which no published full-length sequences are currently available.

PCR Amplification of the Full-Length SSU rRNA Gene
The approximately 1800 base pair SSU rRNA gene was amplified by PCR using a previously described Nanopore sequencing strategy [15]. Briefly, a PCR using SSU-F1 (5'-AAC CTG GTT GAT CCT GCC AGT AGT C-3') and SSU-R1 (5'-TGA TCC TTC TGC AGG TTC ACC TAC G-3') that amplifies the SSU rRNA gene of most eukaryotic organisms was performed [17]. Each reaction used 1 µM forward and reverse primers, as well as 12.5 µL of KAPA HiFi HotStart ReadyMix (KAPABioSystems, Cape Town, South Africa), in a 25 µL reaction volume. Initial denaturation was performed at 98 • C for 5 min, followed by 35 cycles of amplification (20 sec at 98 • C, 45 sec at 60 • C, and 90 sec at 72 • C) and final extension for 5 min at 72 • C. Following amplification, amplicons were visualized using a QIAxcel (Qiagen, Valencia, CA, USA) and quantified using a Qubit fluorometer (ThermoFisher Scientific, Waltham, MA, USA).
The Nanopore sequencing library was prepared using the Oxford Nanopore Technologies (ONT) SQK-LSK109 Ligation Sequencing Kit (ONT, Oxford, UK) following the manufacturer's protocol for Amplicons by Ligation (ACDE_9064_v109_revQ_14Aug2019). Amplicons were quantified and diluted to ensure that 150 fmol of DNA were used as input into library prep, as recommended by the protocol. The nanopore library was run on an R9.4 flow cell (FLO-MIN106) using an ONT MinION Mk1B and MinKNOW v20.06.15 software. Basecalling was performed using ONT Guppy v4.0.11 (gpu) using a minimum quality score cut off of 7 for filtering low-quality reads. FASTQ reads were also length-filtered to only include reads between 1700 and 2000 nucleotides. Reads were then corrected using canu v2.1, and consensus sequences were generated by clustering reads using the vsearch-cluster_fast command (vsearch v2.14.1) with a 98% identity threshold, checked for chimeras, and polished as previously described [16].
For comparison purposes, for each same sample, full-length sequences and partial sequences obtained with MinION and MiSeq, respectively, were aligned using ClustalW in MegAlign 15 (DNASTAR Lasergene 15, Madison, WI, USA), and pairwise distances between consensus sequences were calculated. The nucleotide sequences generated in this study were deposited in GenBank under the accession numbers MW887928-MW887935.

Phylogenetic and Pairwise Distance Analyses
The full-length SSU rRNA gene nucleotide sequences obtained in this study, appropriate full-length Blastocystis reference nucleotide sequences obtained from the reference database found at http://entamoeba.lshtm.ac.uk/blastorefseqs.htm (accessed on 17 March 2021), and other full-length sequences available in GenBank were included to generate a phylogenetic tree rooted using Proteromonas lacertae, a Stramenopile (which is closely related to Blastocystis) as an outgroup (Table 2). Nucleotide sequences were aligned with the Clustal W algorithm using MEGA X, and phylogenetic analyses were performed using neighbor-joining (NJ) and maximum-likelihood (ML) methods and genetic distances calculated with the Kimura 2-parameter model using MEGA X [18,19]. Because the 64 nucleotide sequences used in this study differ in length, ends of sequences were trimmed. There were a total of 1953 positions in the final dataset. Bootstrapping with 1000 replicates was used to determine support for the generated clades. Evolutionary analysis was conducted to establish divergence between sequences (pairwise distance) using the Kimura 2-parameter model in MEGA X [18,19].
For the purpose of comparison, analyses employing the same 64 nucleotide sequences utilized for full-length analyses were conducted for those regions of the SSU rRNA gene amplified by the two most common standard primers sets used for amplifying and sequencing Blastocystis in survey studies [13,14] (Figure 1). There were a total of 587 and 568 positions in the final datasets for the barcoding and Santin regions, respectively.

Results
3.1. Full-Length SSU rRNA Gene for ST21, ST23, ST24, ST25, ST26, ST27, and ST28 The five samples included in this study were selected based on the results of screening via MiSeq sequencing of an approximately 500 bp region of the SSU rRNA gene and were found to contain ST21, ST23, ST24, ST25, ST26, ST27, and/or ST28. Nanopore sequencing was performed to produce full-length SSU rRNA gene sequences for these seven Blastocystis subtypes because no full-length references were available. Full length-sequences were successfully obtained for all seven subtypes for which no full-length reference sequences currently exist. Multiple subtypes were obtained from three of the five samples with ST21 and ST24, ST23 and ST26, and ST27 and ST28 being present in the same sample (Table 1). Agreement between the MiSeq sequence and the MinION sequence for each subtype was high, ranging from 99.6 to 100% for all sequences (Table 1).

Phylogenetic Analyses
Phylogenetic analyses were performed to generate trees using full-length SSU rRNA gene sequences from this study and references sequences for all other accepted Blastocystis subtypes (ST1-ST17, ST21, and ST23-ST29) with Proteromonas lacertae included as an outgroup (Table 2). Both the ML and NJ phylogenetic trees generated using the full-length MinION sequences reported in this study (ST21, ST23-ST27) showed similar topologies (Figures 2 and 3). The subtype clustering observed here was similar to prior ML analyses performed for subtypes ST1-ST10 and ST13-ST17 using full-length sequences (no fulllength references were available for ST11 and ST12 at that time) [15]. However, in the present analysis, ST16 no longer formed an external branch and instead nested within a clade formed with ST3 (Figures 2 and 3). Clades formed by ST23, ST27, and ST28 were supported by a bootstrap proportion of 100 by both NJ and ML. Clades formed by ST21 and ST26 were supported by a bootstrap proportion of 99 and 100 for ML and NJ, respectively. ST25 formed a clade with ST14 with bootstrap support of 97 and 99 for ML and NJ, respectively. The bootstrap support values for clades ST14 and ST25 were 45 and 93 for ML and NJ, respectively.  (Table 2). Proteromonas lacertae was used as outgroup taxon to root the tree. Analysis was conducted by a maximum likelihood method. Genetic distances were calculated using the Kimura two-parameter model. This analysis involved 64 nucleotide sequences, and there were a total of 1953 positions in the final dataset. Bootstrap values lower than 75% are not displayed.  (Table 2). Proteromonas lacertae was used as outgroup taxon to root the tree. Analysis was conducted by a maximum likelihood method. Genetic distances were calculated using the Kimura two-parameter model. This analysis involved 64 nucleotide sequences, and there were a total of 1953 positions in the final dataset. Bootstrap values lower than 75% are not displayed.  (Table 2). Proteromonas lacertae was used as outgroup taxon to root the tree. Analysis was conducted by a neighbor-joining method. Genetic distances were calculated using the Kimura two-parameter model. This analysis involved 64 nucleotide sequences, and there were a total of 1953 positions in the final dataset. Bootstrap values lower than 75% are not displayed.
Phylogenetic trees using ML and NJ were also constructed using partial SSU rRNA gene sequences of the barcoding region (Figures 4 and 5) and the Santin region, which was used in the initial designation of ST23-ST28 (Figures 6 and 7). The topology of the ML and NJ trees generated using partial gene regions showed similar relationships between ST21, ST23, ST24, ST25, ST26, ST28, and the rest of the subtypes when compared to full-length sequences (Figures 4-7). Clades formed by ST23, ST27, and ST28 using the barcoding region in ML tree were supported by bootstrap proportions of 99, 92, and 99, respectively ( Figure 4). By NJ, bootstrap proportions for ST23, ST27, and ST28 clades were 100, 64, and 100, respectively ( Figure 5). Similarly, clades formed by ST23, ST27, and ST28 using the Santin region were all supported by a bootstrap proportion of 100 by ML, respectively ( Figure 6) and a bootstrap proportion of 100, 100, and 99 by NJ, respectively ( Figure 7). However, ST27 formed a clade with ST9 in the tree for the barcoding region for ML ( Figure 4) but basally branched in NJ ( Figure 5) and basally branched for both ML and NJ in the Santin region ( Figures 6 and 7). Clades formed by ST21 and ST26 were supported by a bootstrap proportion of 80 and 94 using barcoding for ML and NJ, respectively (Figures 4 and 5) and 99 and 88 using the Santin region for ML and NJ, respectively. As observed in full-length analyses, ST25 formed a clade with ST14 using the Santin region with a bootstrap support of 97 for ML and NJ (Figures 6 and 7). For the barcoding region, ST24 and ST25 formed a clade together with bootstrap support of 85 for NJ ( Figure 4), but this clade was not supported for ML ( Figure 5).

Pairwise Distance
Comparison of pairwise distances for all full-length sequences was also performed to determine average sequence distances between new full-length references and currently accepted subtypes ( Table 2). The percentage of average sequence similarity among all subtypes ranged from 78 to 98%. For pairs of subtypes including the new full-length references ST24/ST25, ST14/ST24, ST14/ST25, and ST10/ST23 were all found to share 98% of sequence identity across the full-length SSU rRNA gene (Table 3). ST21 and ST26 were found to share 96% of sequence identity across the full-length SSU rRNA gene ( Table 3). Comparisons of pairwise distances of partial sequences for the barcoding region (Table 4) and Santin region (Table 5) were also performed. The percentage of sequence similarity among all sequences for the barcoding region ranged from 80 to 99%. The barcoding region showed a sequence similarity of ≥96% between several of the new reference sequences and existing subtypes (Table 3). Within the barcoding region, a sequence similarity of 99% was observed between pairs ST24/ST25, ST14/ST24, ST14/ST25, ST10/ST23, and ST21/ST26. The percentage of sequence similarity among all sequences for the Santin region ranged from 66 to 96% (Table 4). Furthermore, pairwise distance comparisons for subtypes at the Santin region exhibited greater degrees of divergence between sequences, with sequence similarities between subtypes observed at 97% only for pair ST25/ST14 and 96% for ST14/ST24 and ST24/ST25. The rest of the pairwise distances were lower than 96% (Table 5).  (Table 2). Proteromonas lacertae was used as outgroup taxon to root the tree. Analysis was conducted by a maximum likelihood method. Genetic distances were calculated using the Kimura two-parameter model. This analysis involved 64 nucleotide sequences, and there were a total of 587 positions in the final dataset. Bootstrap values lower than 75% are not displayed.  (Table 2). Proteromonas lacertae was used as outgroup taxon to root the tree. Analysis was conducted by a maximum likelihood method. Genetic distances were calculated using the Kimura two-parameter model. This analysis involved 64 nucleotide sequences, and there were a total of 587 positions in the final dataset. Bootstrap values lower than 75% are not displayed.  (Table 2). Proteromonas lacertae was used as outgroup taxon to root the tree. Analysis was conducted by a neighbor-joining method. Genetic distances were calculated using the Kimura two-parameter model. This analysis involved 64 nucleotide sequences, and there were a total of 587 positions in the final dataset. Bootstrap values lower than 75% are not displayed.  (Table 2). Proteromonas lacertae was used as outgroup taxon to root the tree. Analysis was conducted by a neighbor-joining method. Genetic distances were calculated using the Kimura two-parameter model. This analysis involved 64 nucleotide sequences, and there were a total of 587 positions in the final dataset. Bootstrap values lower than 75% are not displayed. Figure 6. Phylogenetic relationships among Blastocystis partial SSU rRNA gene sequences of the Santin region generated in the present study (represented with a black filled circle) and representative reference sequences of the accepted subtypes (Table 2). Proteromonas lacertae was used as outgroup taxon to root the tree. Analysis was conducted by a maximum likelihood method. Genetic distances were calculated using the Kimura two-parameter model. This analysis involved 64 nucleotide sequences, and there were a total of 568 positions in the final dataset. Bootstrap values lower than 75% are not displayed.  (Table 2). Proteromonas lacertae was used as outgroup taxon to root the tree. Analysis was conducted by a maximum likelihood method. Genetic distances were calculated using the Kimura two-parameter model. This analysis involved 64 nucleotide sequences, and there were a total of 568 positions in the final dataset. Bootstrap values lower than 75% are not displayed.  Table 2). Proteromonas lacertae was used as outgroup taxon to root the tree. Analysis was conducted by a neighbor-joining method. Genetic distances were calculated using the Kimura two-parameter model. This analysis involved 64 nucleotide sequences, and there were a total of 568 positions in the final dataset. Bootstrap values lower than 75% are not displayed.  (Table 2). There were a total of 1953 positions in the final dataset. Red background indicates pairwise distance lower than 0.04.  Table 4. Pairwise distances between Blastocystis subtypes barcoding region sequences showing the average number of base substitutions per site. Analyses were conducted using the Kimura 2-parameter model and included 64 nucleotide sequences (Table 2). There were a total of 587 positions in the final dataset. Red background indicates pairwise distance lower than 0.04.   (Table 2). There were a total of 568 positions in the final dataset. Red background indicates pairwise distance lower than 0.04.

Discussion
Current recommendations indicate that new subtype designations should only be given if >80% of the approximately 1800 bp SSU rRNA gene has been sequenced and demonstrated to vary from any known subtype by at least 4% across the full-length of the potentially novel sequence [9]. With these proposed guidelines, obtaining fulllength reference sequences of the SSU rRNA gene of Blastocystis is now essential for accurate subtype identification and for determining novel subtypes. Additionally, fulllength sequences of the SSU rRNA gene allow for the construction of phylogenies that can aide in more accurately describing the relationship between Blastocystis subtypes from humans and other animal hosts. However, this requirement has proven difficult to achieve, and several of the up-to 29 proposed subtypes have been named using only partial gene sequences. This issue is highlighted by the fact that subtypes described after 2013, ST18-ST28, were all named using a fragment of the SSU rRNA gene [10,20,21]. A recent review of new subtypes ST18-ST26 suggested that official acceptance be withheld for ST21 and ST23-26 until more data including full-length SSU rRNA reference sequences can be analyzed, while it was recommended that ST18-ST20 and ST22 be rejected on the basis of appearing to be artifactual sequences [9].
It was recently demonstrated that MinION long read sequencing could be used to successfully produce accurate full-length reference sequences of the Blastocystis SSU rRNA gene [15]. This method was validated using both cultured and fecal isolates of Blastocystis, and its suitability for the detection of sequences from samples containing both multiple subtypes and multiple variants of the same subtype was demonstrated. MinION sequencing was also successfully employed to produce the first full-length reference sequence for ST11 [16]. In the current study, MinION sequencing was employed to effectively produce full-length SSU rRNA reference sequences for the seven Blastocystis subtypes for which there are no known published full-length reference sequences (ST21, ST23, ST24, ST25, ST26, ST27, and ST28). The availability of full-length sequences of the SSU rRNA gene for all Blastocystis subtypes is essential to establish relationships among subtypes.
All subtype sequences generated in this study were produced using samples that were pre-screened using Illumina MiSeq sequencing of the 500 bp region of the SSU rRNA gene, which is sometimes referred to as the Santin region [12,13]. The pre-screening of samples via MiSeq was employed to determine both the intra and potential inter-subtype diversity of Blastocystis present in samples, as well as to provide comparison sequences for assessment of accuracy. All MinION-generated sequences were found to match their MiSeq counterpart with 99.6-100% similarity observed between sequences generated using the two methods (Table 1). This high degree of similarity between the two sequencing methods is in agreement with previously published reports that compared the two methods and further supports the use of MinION sequencing for reference sequence generation [11,16].
The current recommendation for a new subtype designation states that a phylogenetic analysis should be performed to ensure that new Blastocystis subtypes do not nest within any previously known subtypes [9]. This requirement is unclear because all phylogenies are composed of clades nested within larger clades, but we presume to interpret this recommendation to mean strong support for branching demonstrated through high bootstrap support (BS). Thus, phylogenetic analyses were performed using both full-length reference sequences generated in this study via MinION sequencing and reference sequences of established Blastocystis subtypes ( Table 2). All new subtype sequences reported here did, in fact, form branches that were supported by high bootstrap support (≥96 BS), with the exception of ST25 (45 BS in the ML tree and 93 BS in the NJ tree) (Figures 2 and 3). Low BS for the branching of ST5 from the clade containing ST12 and ST13 (19 BS), as well as ST12 and ST13 (24 BS) in the ML tree, was also observed. Thus, the phylogenetic analysis of full-length SSU rRNA sequences strongly supported the assignment of subtype designations for ST21, ST23, ST24, ST25, ST26, ST27, and ST28, when using NJ. There was weaker support for ST25 using ML, but this was also observed for previously accepted subtypes using ML.
The addition of new full-length SSU rRNA gene reference sequences to the phylogenetic analysis of the Blastocystis SSU rRNA gene had mostly subtle effects on the overall topology of the tree when compared to previous full-length analyses (Figures 2 and 3) [15]. However, a notable shift in the branching pattern for ST16 occurred following the inclusion of full-length references for ST21-ST29. Previous phylogenetic analyses that included ST1-ST17 have concluded that ST16 lacks a specific related mammalian lineage [15,22]. However, in the present analyses, ST16 formed a clade with ST3, thus indicating that it may, in fact, share common ancestry with other subtypes of mammalian lineage (Figures 2 and 3). This finding also supports the importance of full-length reference sequences in informing our understanding of the relationships among Blastocystis subtypes.
Interestingly, in the full-length SSU rRNA gene trees, branching patterns for the new subtypes that were originally identified in ruminants (ST21, ST23, ST24, ST25, and ST26) indicated common ancestry for ST21 and ST26, ST23 and ST10, and ST14, ST24, and ST25 (Figures 2 and 3). ST21, which was originally reported in a waterbuck in China, forms a clade with ST26 that was originally named in cattle in the United States [20,21]. Both ST21 and ST26 have been most frequently identified in cattle, so their shared ancestry may support their host specificity in ruminants [2]. A similar relationship between ST14, ST24, and ST25 is also documented, with ruminants being the most common hosts of these subtypes [2]. ST10 and ST23 branch together and form a clade with ST4 and ST8 (Figures 2 and 3). Both ST10 and ST23 are most commonly reported in cattle, and it has been suggested that ST10 is cattle-adapted subtype [2,20,23]. The data presented here support a shared ancestry for ST10 and ST23 but indicate a more distant relationship to other subtypes common to ruminants. Though ST21, ST23, ST24, ST25, and ST26 form a clade shared with other subtypes commonly reported in mammals, ST27 and ST28 (which were originally reported in birds) do not (Figures 2 and 3) [10]. ST27 forms a clade that includes ST6, ST7 and ST9, and two of those subtypes-ST6 and ST7-are the most common subtypes reported in birds and have been suggested to be host-adapted to birds [2,24]. The full-length SSU rRNA tree presented here supports the common ancestry of these potentially avian-adapted subtypes. ST28 forms a clade with ST15 (Figures 2 and 3). ST15, while commonly reported in mammals, appears to be distantly related to other mammalian subtypes [2]. ST28 joins ST15 in representing the two most basally branching subtypes of Blastocystis.
Phylogenies were also generated from two SSU rRNA gene regions commonly used to subtype Blastocystis when studying subtype distribution and frequency in different hosts [13,14]. While it was observed that the topology of the two regions for trees generated using ML and NJ were similar for the relationships of ST21, ST23, ST24, ST25, ST26, and ST28, there was a differences in branching for ST27, which formed an external branch in the barcoding region NJ tree ( Figure 5) but branched with ST9 in the barcoding ML tree ( Figure 4) and in both NJ and ML for the Santin region tree (Figures 6 and 7). The differences between the two trees could likely be at least in part explained by the levels of variability contained within the two regions, with the Santin region containing more overall variability than the barcoding region (Tables 3-5). We wish to avoid the over-analysis of partial sequence trees because it was previously demonstrated that partial SSU rRNA gene sequences may be insufficient for the phylogenetic analysis of Blastocystis subtypes [15]. However, similar branching patterns between the full-length SSU rRNA gene and the Santin region, which was used in the original designation of ST23-ST28, were observed, thus supporting their designation as new subtypes of Blastocystis.
As sequence distance is currently used to define subtypes of Blastocystis, a comparison of the pairwise distances of all full-length SSU rRNA gene sequences was also performed (Tables 3-5). In pairwise distance comparisons, ST21 and ST26 were found to have the most similarity with respect to the rest of the subtypes, with 96% of sequence identity across the full-length SSU rRNA gene (Table 3). Thus, ST21 and ST26 meet all current recommendations to be named as new subtypes. ST27 shared the most sequence similarity with ST6 and ST9 (95%), and ST28 shared the most sequence similarity with ST15 (93%) ( Table 3). Both ST27 and ST28 were found to well-exceed the 4% divergence recommended to name them as new subtypes. However, ST24 and ST25, ST14 and ST24, ST14 and ST25, and ST10 and ST23 were all found to share 98% of sequence identity across the full-length SSU rRNA gene (Table 3).
Though ST23, ST24, and ST25 were not found to meet the recommended minimum divergence of 4% across the full-length SSU rRNA gene, we suggest that these subtypes still be considered as valid. These subtype designations are already in use and have been documented in studies involving different hosts and/or countries [9,10]. Observations of these sequences in multiple studies supports their validity as novel subtypes as opposed to experimental artifacts, and discontinuing their use would only add further confusion to assigning new subtypes in the future. Furthermore, these subtypes are frequently identified in both domestic and wild ruminant species, and differentiating between these sequence variants could provide important information on the role of cross-species transmission. These subtypes are easily differentiated using the Santin region of the SSU rRNA gene (Table 5). However, it should be noted that ST23, ST24, and ST25 may not be easily distinguished using more conserved regions of the SSU rRNA gene such as the barcoding region (Table 4). This is an issue that extends beyond the identification of ST23, ST24, and ST25. For example, the already accepted subtypes ST5, ST12, and ST13 share 96-97% sequence identity with ST14 within the barcoding region, which may make their accurate identification difficult (Table 5). This may contribute to the increased incidence of sequences in GenBank sharing a high degree of identity but being named as any one of these subtypes, e.g., GenBank accession MN526814, which is recorded as ST14 returns BLAST matches for ST5 (99% identity, MK937752) and ST13 (97% identity, MF186700). For the Santin region, only ST14/ST25 were found to exceed the 96% cut-off, sharing a 97% similarity, while the similarity between other pairs was found to be ≤96% (Table 5), thus indicating this region could be better suited for subtyping Blastocystis isolates.
The full-length SSU rRNA gene reference sequence for ST27 allowed for a comparison of its barcoding region to other barcoding sequences available on GenBank. A Blast search of the first 600 bp of the full-length sequence returned 13 accessions with 98-99% of sequence identity and 95-97% of sequence coverage to ST27, which are listed here in order of shared identity: MK861944, MK861940, MK861941, MK861943, MK861939, MK861937, MK861936, MK861942, MK930361, MK357782, MT661535, MK861938, and MK861946. As all of these sequences also came from peafowl but from China (unpublished), it is likely that they represent other examples of ST27 and expand the range of ST27 to another continent. The MK861944 sequence is listed as a new subtype and shares 99% of identity with ST27. However, some sequences share a 98% identity with ST27 but are given new or already used subtype designations. For example, MT661535 shares 98% of identity with ST27 but is recorded as ST9, and MK930361 shares 98% of sequence identity with ST27 but is recorded as ST18. Because these sequences have no corresponding publication at this time, it is hard to know how the authors arrived at these subtype designations. However, these findings further highlight the need for quality full-length SSU rRNA gene references for describing Blastocystis diversity and host distribution.

Conclusions
Historically, new subtypes of Blastocystis have been named using both full-length and partial gene sequences of the SSU rRNA gene. However, the current recommendation suggests that a nearly full-length sequence be available before a new subtype designation can be given. While the data presented here highlight that different regions of the gene yield different relationships between subtypes based on sequence identity, it is also important to note that the requirement of a full-length sequence may not be an achievable goal for all researchers working in this field. In fact, attempting to procure the longest possible sequence may be contributing to the generation and publication of chimeric sequences, as some of the primers used to generate longer sequences of the gene may lack the specificity of shorter fragment primers. Mixed subtypes within a sample are common but can be difficult to detect using traditional Sanger sequencing [12]. Sanger sequencing produces a consensus of all present sequences, so either poor quality sequences generated from mixed infections or the propensity of forward and reverse primers to amplify different subtypes that are then stitched together into a single chimeric sequence are not easily identified. By comparison, the high sequence depth and stringent bioinformatic processing of MinION-generated sequences can better handle these issues, but this technology may not be widely available to all researchers. As the world of Blastocystis sequencing continues to expand to new hosts and regions of the world, these are outstanding issues that will need to be addressed so that all researchers can easily and accurately subtype their isolates. As additional sequences of both known and new subtypes are compiled, our understanding of the relationships within this diverse species complex will likely change and bring with it an enhanced understanding of host specificity, pathogenic potential, and the complex epidemiology of Blastocystis.