Long-Read Sequencing and Hybrid Assembly for Genomic Analysis of Clinical Brucella melitensis Isolates

Brucella melitensis is a key etiological agent of brucellosis and has been increasingly subject to characterization using sequencing methodologies. This study aimed to investigate and compare short-read, long-read, and hybrid assemblies of B. melitensis. Eighteen B. melitensis isolates from Southern Israel were sequenced using Illumina and the Oxford Nanopore (ONP) MinION, and hybrid assemblies were generated with ONP long reads scaffolded on Illumina short reads. Short reads were assembled with INNUca with SPADes, long reads and hybrid with dragonflye. Abricate with the virulence factor database (VFDB) and in silico PCR (for the genes BetB, BPE275, BSPB, manA, mviN, omp19, perA, PrpA, VceC, and ureI) were used for identifying virulence genes, and a total of 61 virulence genes were identified in short-read, long-read, and hybrid assemblies of all 18 isolates. The phylogenetic analysis using long-read assemblies revealed several inconsistencies in cluster assignment as compared to using hybrid and short-read assemblies. Overall, hybrid assembly provided the most comprehensive data, and stand-alone short-read sequencing provided comparable data to stand-alone long-read sequencing regarding virulence genes. For genomic epidemiology studies, stand-alone ONP sequencing may require further refinement in order to be useful in endemic settings.


Introduction
Microbial genomics analysis is widely being recognized as a potentially useful method to diagnose difficult-to-detect organisms and provide real-time surveillance for outbreaks [1,2]. However, traditional short-read sequencing methodologies have drawbacks in terms of contig length and turnaround time. Furthermore, short-read sequencing inevitably results in gaps in the assembly, and the gaps present in short read-only assemblies are a concern as genes present within that gap may be missed, and assemblies on the edge of a contig (next to the gap) may be of lower quality than assemblies in the middle of the contig.
The recent development of long-read sequencing technologies such as the Oxford Nanopore (ONP) MinION can potentially provide stand-alone long-read sequencing data with a rapid turnaround time. However, these technologies are still considered error prone despite continuous improvement. That being said, these technologies can bolster shortread analysis with long reads for hybrid analysis [3][4][5][6]. The portability, small footprint, and real-time sequencing capacity of the ONP MinION platforms makes them attractive for clinical use; however, as an emerging technology, work still needs to be performed establishing its usability in a clinical environment [6,7]. Some recent studies have undertaken comparisons of long-and short-read sequencing. Long-read assemblies were generally found to provide more complete assemblies and Microorganisms 2022, 10, 619 2 of 12 longer contigs than short-read assemblies, short-read assemblies were more precise than long-read assemblies, and hybrid assemblies were the most complete and accurate of all assemblies overall. In a study investigating the presence of antimicrobial resistance genes (ARGs), it was noted that stand-alone long-read sequencing resulted in occasional false negatives regarding the presence of certain ARGs [8].
Brucella melitensis is a key zoonotic bacterial species that is a driver of brucellosis infections, including in the Middle East [9][10][11][12][13][14][15][16]. Brucellosis is an under-diagnosed systemic infection; it is estimated that approximately 90% of human cases go undiagnosed [17,18]. Diagnosis is notably difficult due to lack of specific symptoms, and common testing methodologies vary in sensitivity [17]. If inadequately treated, the infection can progress to longterm, debilitating disease [19]. The gold standard of diagnosis is based on blood culture, but further handling the organism requires strict safety conditions and thus characterization of isolates (e.g., for identifying virulence genes or for performing epidemiological typing) is rarely performed outside reference laboratories. Moreover, the organism is commonly isolated from affected animals during field sampling. As such, Brucella spp. are an ideal target for diagnostic genomic sequencing, especially field-deployable long-read sequencing methodologies, to speed diagnosis as well as infer transmission pathways. For example, a case study was reported where a patient with neurobrucellosis was diagnosed by wholegenome metagenomic sequencing (Illumina HiSeq platform) after testing negative on a Brucella ELISA IgM [15]; and in another case study, brucellosis was rapidly identified via ONP long-read sequencing. Illumina sequencing has also been utilized in brucellosis outbreak investigations in Israel [11] as well as genomic epidemiology studies [14]. While the MinION has been used to investigate the presence of viral diseases including ebola, rabies, and dengue [20,21], research into sequencing of Brucella spp. including with the MinION is limited [22,23] and the application of long-read sequencing on human B. melitensis isolates powered by hybrid assemblies has not been attempted. This study aims to investigate longread sequencing of clinical B. melitensis. isolates, and in particular, to compare short-read, hybrid assembly, and long-read sequencing in order to recommend practicable workflow for future use.

Isolate Collection, DNA Extraction, and Sequencing
Eighteen B. melitensis isolates from brucellosis cases recovered in blood culture from patients treated at the Soroka University Medical Center, Beer Sheva, Southern Israel were retrieved from the National Brucellosis Reference Laboratory (Kimron Veterinary Institute, Beit Dagan, Israel). Isolates were a convenience sample sub-selected from a larger pool of clinical brucellosis isolates based on available DNA of sufficient quantity (500-1000 ng of total DNA) and quality (A260/A280 ratio of approximately 1.8) for Oxford Nanopore (ONP, Oxford, UK) sequencing. Isolates were extracted using the Qiagen Blood and Tissue kit (Qiagen, Hilden, Germany). Extracted DNA was measured with the QuBit (ThermoFisher, Waltham, MA, USA) and the NanoDrop (ThermoFisher, Waltham, MA, USA) devices to quantify DNA quantity and quality. Fragment length was assessed via BioAnalyzer (Agilent technologies, Santa Clara, CA, USA). DNA was sequenced using Illumina MiSeq platforms (Illumina, San Diego, CA, USA) and the ONP MinION (Oxford Nanopore, Oxford, UK). Culturing, DNA extraction, and Illumina sequencing are described in further detail in a previous publication [16]. For Illumina sequencing, DNA was sequenced using a Miseq V2-500 cycle kit to generate 2 × 250 paired-end reads. For ONP sequencing, a R9.4.1 Flow cell (FLO-MIN106) was used and Ligation Sequencing Kit (SQK-LSK109) was used with some modifications to the ONP protocol. Briefly, during library preparation, the AMPure beads (Beckman Coulter, Pasadena, CA, USA) were washed with 75% ethanol rather than 70% and incubated for 15 min on a rotational mixer during elution of DNA at the end of library preparation as well as the end of adapter ligation and clean-up. Short-and long-read genome assemblies described below are deposited under BioProject number PRJEB50430.

Read and Assembly Statistics
Regarding read statistics, the mean read length of long-read sequences ranged from 3393.2 to 8420.8 bp, and mean read length of short-read sequences ranged from 112.7 to177 bp. Q20% (percent of reads with a quality score above 20) and Q30% (percent of reads with a quality score above 30) were higher for short-read sequences than long-read sequences. Regarding Q20%, the median for long reads was 60.65% (range: 48.3-63.75%) and the median for short reads was 93.7% (range: 84.4-96.4%). Regarding Q30%, the median for long reads was 15.8% (range: 9.8-17.7%) and the median for short reads was 90.9% (78.9-94.6%).
Regarding assembly statistics, the median number of contigs was 2 contigs for longread and hybrid assemblies (range: 2-4 contigs for long-read assemblies, and all hybrid assemblies had 2 contigs) and 38.5 for short-read assemblies (range: 30-60 contigs). NG50 and NG75 values were higher for long-read and hybrid assemblies than short-read assemblies ( Figure 1, Table 1). After completion of BUSCO analysis, short-read assemblies had a slightly higher completeness than long-read assemblies. Hybrid assemblies had the same BUSCO values as short-read assemblies. The median largest contig was shorter for short-read assemblies than long-read and hybrid assemblies. Regarding total length and depth, this was fairly similar among assembly types. All descriptive statistics are detailed in Table 1. Regarding scaffolding and the utility of hybrid assembly, Figure 2 represents how a long-read assembly with two contigs can scaffold a fragmented short-read assembly.

Virulence Gene Identification
In total, 51 virulence genes from the VFDB were identified in all assemblies. Results from the in silico PCR for 10 additional virulence genes (BetB, BPE275, BSPB, manA, mviN, omp19, perA, PrpA, VceC, and ureI) were also concurrent; all ten virulence genes were identified in short-read, long-read and hybrid assemblies of all isolates. Sequencing errors in long-read assemblies (deletions, substitutions, etc.) were noted for one isolate in perA, omp19, and VceC and for four isolates in BPE275. Of the 8 total errors, 5 (62.5%) of the errors were substitutions, 3 (37.5%) were insertions, and none were deletions. In all of these instances, the error was corrected upon hybrid assembly and did not interfere with the identification of the virulence gene in long read-only assemblies. One true variant (present in short-read, long-read, and hybrid assemblies) was noted in one isolate for the gene BPE275 (C208T). Of note, in one isolate, a virulence gene was initially missed upon hybrid assembly as the full assembly is circular and initially the virulence gene searching tool missed it due to its limitations. Upon closer inspection, the gene was identified in the hybrid assembly. Figure 2. Assembly (bandage) plots of short-read assembly (A), long-read assembly (B), and hybrid assembly (C) for isolate B4 as an exemplar. This figure represents how a complete long-read assembly, visualized as two complete circles with two contigs, can scaffold a fragmented short-read assembly, visualized as multiple fragmented contigs.

Virulence Gene Identification
In total, 51 virulence genes from the VFDB were identified in all assemblies. Results from the in silico PCR for 10 additional virulence genes (BetB, BPE275, BSPB, manA, mviN, omp19, perA, PrpA, VceC, and ureI) were also concurrent; all ten virulence genes were identified in short-read, long-read and hybrid assemblies of all isolates. Sequencing errors in long-read assemblies (deletions, substitutions, etc.) were noted for one isolate in perA, omp19, and VceC and for four isolates in BPE275. Of the 8 total errors, 5 (62.5%) of the errors were substitutions, 3 (37.5%) were insertions, and none were deletions. In all of these instances, the error was corrected upon hybrid assembly and did not interfere with the identification of the virulence gene in long read-only assemblies. One true variant (present in short-read, long-read, and hybrid assemblies) was noted in one isolate for the gene BPE275 (C208T). Of note, in one isolate, a virulence gene was initially missed upon hybrid assembly as the full assembly is circular and initially the virulence gene searching tool missed it due to its limitations. Upon closer inspection, the gene was identified in the hybrid assembly.

Phylogeny Comparison
Prior work investigating outbreaks and regional clustering of B. melitensis has noted that up to six allelic differences would be considered an acceptable threshold to consider clustered isolates on a gene-by-gene phylogenetic analysis (cgMLST) as epidemiologically related [56]. Studies in our region [13,14] and our cumulative experience in local investigations of brucellosis (unpublished data) suggest up to 10-15 differing alleles or SNPs may still constitute a practicable threshold for relatedness. As seen in Figure 3, most sequenced isolates tend to cluster together according to short-read, long-read, and hybrid assemblies. However, in almost all cases, the long-read assemblies exhibited a much higher allelic difference from the hybrid assembly than the short-read assembly (Figure 3 Panel A). A similar finding was noted when the number of differing single-nucleotide polymorphisms (SNPs) was compared between long-read assemblies and hybrid assemblies and between short-read assemblies and hybrid assemblies (Figure 3 Panel B). For all clusters, the number of differing alleles between long-read and hybrid assemblies ranged between 13 and 203 allelic differences and 17 and 285 differing SNPs (median: 35.5 allelic differences, 42 differing SNPs). In comparison, the number of allelic differences between short-read and hybrid assemblies ranged between 1 and 3 allelic differences and 0 and 3 differing SNPs (median: 1 allelic difference, 0 differing SNPs). In 17 out of 18 isolates (94.4%), the number of differing alleles between long-read and hybrid assemblies was above the relaxed threshold (15 alleles) that would define epidemiological relatedness. Only one long-read/short-read/hybrid cluster (B10) fit within the parameters regarding less than 15 allelic differences (Figure 3, Panel A). Of note, the long-read assemblies of isolate B17 did not cluster with the short-read and hybrid counterparts at all; this mis-assignment is likely due to overall poor quality of long-read sequence in that sample (See: Q20% and Q30% results).
Microorganisms 2022, 10, x FOR PEER REVIEW 7 of 13 above the relaxed threshold (15 alleles) that would define epidemiological relatedness.
Only one long-read/short-read/hybrid cluster (B10) fit within the parameters regarding less than 15 allelic differences (Figure 3, Panel A). Of note, the long-read assemblies of isolate B17 did not cluster with the short-read and hybrid counterparts at all; this misassignment is likely due to overall poor quality of long-read sequence in that sample (See: Q20% and Q30% results). When a phylogenetic tree was constructed, only with long-read assemblies ( Figure  4), all isolates exhibited allelic differences that far exceed the epidemiological relatedness threshold and in a scenario using only ONP sequencing, no clear chains of transmission or relatedness between cases would have been evident. Moreover, short-read and hybrid assemblies showed several clear case clusters such as B6-B16 and B4-B12-B13, but these clusters were not evident in the long-read assemblies.
(A)  When a phylogenetic tree was constructed, only with long-read assemblies (Figure 4), all isolates exhibited allelic differences that far exceed the epidemiological relatedness threshold and in a scenario using only ONP sequencing, no clear chains of transmission or relatedness between cases would have been evident. Moreover, short-read and hybrid assemblies showed several clear case clusters such as B6-B16 and B4-B12-B13, but these clusters were not evident in the long-read assemblies.

What Is in the Gaps?
Analysis of the gaps present in short-read assemblies was conducted to assess what would have been missed if short-read assembly was used alone vs. in hybrid assembly. Genes were identified in the gaps of all 18 short-read assemblies (median: 10 genes, range: 9-17 genes). The majority of the genes missed by short read-only assembly are transposable elements (e.g., transposases of the IS3, IS5, and IS6 families) and do not, to the author's knowledge, have clinical relevance. Furthermore, repetitive regions such as these are known to be poorly sequenced by short-read sequencing methodologies, so missing these genes in short-read gaps is expected [57].
technology, isolates sequenced via short-read (SR) Illumina technology, and hybrid assembly of SRs scaffolded onto LRs. Numbers in grey denote number of allelic differences (Panel A) or number of differing single-nucleotide polymorphisms (SNPs) (Panel B) between assemblies or isolates, and the nodes are colored according to the isolate number. Nodes with more than one color denote two assemblies of two apparently related isolates that did not have any allelic differences (Panel A) or differing SNPs (Panel B). In (Panel B), iolates with no differences between the HY and SR assemblies are the same color node with a black line down the middle.

Study Summary
This study aimed to investigate long-read sequencing of clinical B. melitensis isolates, and in particular, to compare short read-based, long read-based and hybrid assemblies in order to recommend practicable workflow for future use. Overall, virulence genes were consistently identified in short-read, hybrid, and long-read assemblies. While there were instances of gaps in the short-read assembly and errors in long-read assemblies, this did not ultimately affect identification of virulence genes. These findings are similar to work investigating Illumina vs. ONP sequencing with Escherichia coli surrogate strain isolates [58]. In general, long-read sequencing had notable limitations in regard to phylogenetic analysis, and long-read assemblies generally failed to cluster closely enough with their short-read and hybrid counterparts to be considered as epidemiologically related based on a previously established threshold. One long-read assembly clustered with an entirely different isolate cluster; however, this has been observed in previous studies focusing on other bacterial genera, and as also seen in the research, this was reported to be corrected upon hybrid assembly [4,8,59]. Furthermore, when a tree was generated with long-read assemblies alone, related isolates that clustered together from short-read or hybrid assemblies no longer clustered together. The superiority of short-read assemblies to long-read assemblies in regard to phylogenetic resolution has also been observed in a benchmarking study of Salmonella isolates [8]. Ultimately, complete genomes resulting from hybrid assembly will allow for more confident analysis of genomes using a gene-by-gene approach as well as SNP-level analysis. Depth and BUSCO completeness percentages for short-read and long-read assemblies were similar, which disagrees with other studies [60], but it again bears mentioning that B. melitensis may be less complex to analyze than other studied organisms.

Implications for Field-Deployed or Low-Resource Settings
The ONP platform is popular due to its utility as a field-deployable sequencing platform; furthermore, in low-resource settings, it can provide whole genomes faster and with fewer resources. The MinION has been noted in multiple papers to be highly useful in backcountry or low-resource settings, including tent-based or car-based research efforts [20,61,62]. For infectious diseases such as brucellosis, epidemiological trace-back is often critical in the face of outbreaks, cases associated with international travel or even cases in non-endemic regions. Whole-genome sequencing can also be used in this regard; for example, a study utilizing Illumina sequencing in Germany noted that a large number of brucellosis cases were of Middle Eastern origin [63]. Previous work has noted that the ONP platform can provide speedy diagnosis of brucellosis; for example, when Gündogdu et al.'s (2019) ONP usage for clinical diagnosis identified the first read for B. melitensis in 30 min [23].
Given the virulence gene findings of this study, it is apparent that long read-only assemblies can provide actionable data regarding Brucella spp. virulence gene presence. However, the findings of the phylogenetic comparisons warrant further study; while the long-read sequencing produced good resolution in this regard, differences were observed among the assembly types as far as number of allelic differences or SNPs was concerned. It is possible typing using long reads could be improved with longer sequencing times (which necessitate faster consumption of costly flow cells). As the genome of B. melitensis is highly conserved and isolates from the same region, including in the Israeli Negev desert, tightly cluster together, high resolution for accurate phylogenetic analysis is very important for this particular organism in regard to epidemiological or outbreak investigations [13,14]. For the investigation of specific genes in other clinically-relevant organisms, more research is needed, as while this study found that the ONP platform was consistent in identifying virulence genes, recent studies have noted that ONP technology had inconsistent performance regarding the detection of antimicrobial resistance genes in Gram-negative bacteria when compared to Illumina technology [4,9,59]. Overall, the findings of this study suggest that in operational settings where information is needed on the virulence genes of B. melitensis, stand-alone long-read sequencing provides comparable data to short-read sequencing in a shorter amount of time. Regarding the utility for epidemiological and outbreak settings, there is a need for further refinement and validation of the method, perhaps via longer sequencing time.

Limitations and Future Research
The primary limitations of this study were the small number of isolates and lower quality of long-read sequences and lower depth for some of the samples. Future research optimizing ONP sequencing for the epidemiological investigation of Brucella spp. is also needed, especially regarding field-based sequencing in endemic areas. Other research investigating ONP sequencing has noted this need for future research before stand-alone long-read sequencing is utilized in the clinical environment [4]. Concordantly, future research should be undertaken to "downsample" long-read platforms to determine at what point depth suffers and falls significantly below short-read methodologies. Furthermore, future research should be undertaken using concordant blood, cerebral spinal fluid, or other relevant clinical materials to investigate the utility of ONP and hybrid sequencing for culture-independent B. melitensis diagnosis.

Conclusions
This study aimed to investigate long-read sequencing and hybrid sequencing of clinical B. melitensis isolates, with the specific intention to compare short read-based, long read-based and hybrid assemblies in order to recommend practicable workflow for future clinical use. Overall, it is key to note that all virulence genes were identified in all isolates using all sequencing and assembly methodologies; however, caution is warranted upon hybrid assembly. For phylogeny, some differences and inconsistencies in clustering were observed for the short-read and long-read assemblies; therefore, further research is needed regarding these technologies for phylogenomic research of Brucella spp.