1. Introduction
Microbial forensics has evolved as a specialized field dedicated to collecting and analyzing evidence involving microbes or their toxins used in acts of biological crime [
1]. It provides a scientific approach to securing evidence from bioterrorism, biological attacks, biological crimes, the intentional manipulation of biological agents and toxins, and the accidental release of such materials [
2]. The advent of next-generation sequencing (NGS) has significantly advanced microbial forensics by drastically reducing the time needed for whole-genome analyses of microbial pathogens [
3,
4]. Consequently, traditional microbial typing methods—DNA fingerprinting, multilocus variable number tandem repeat analysis (MLVA), or multilocus sequence typing (MLST)—can now be performed in silico, while high-precision approaches based on whole-genome sequencing (WGS) enable the identification of single-nucleotide polymorphisms (SNPs) [
5,
6].
Forensic samples are often limited in DNA quantity, contaminated with exogenous microbial or host DNA present in the environment, or severely degraded [
7]. Hybridization-based target enrichment methods have recently been used to analyze the specific genomic sequences of these samples [
8,
9]. These approaches have been used to capture and sequence ancient DNA from human remains, particularly for challenging samples in which the target DNA may constitute only 1–5% of the total extracted DNA. Target enrichment can increase target DNA yields to over 70% [
10,
11]. Furthermore, recent studies have successfully used target enrichment to conduct culture-free genomic analyses of hard-to-culture bacteria, achieving high-quality whole-genome sequences from clinical specimens [
12]. These findings indicate that NGS with target enrichment holds significant promise for comprehensive analysis of evidence DNA exposed to diverse environmental factors in microbial forensic investigations.
WGS enables high-resolution typing of bacterial pathogens through analyses such as whole-genome SNP and whole-genome MLST (wgMLST), serving as powerful tools in microbial genomics [
13,
14,
15]. The information identifiable only through WGS typically requires pathogen cultivation or highly purified nucleic acids at high concentrations, but NGS may fail to detect or reconstruct genomes when cultivation is unsuccessful or pathogens are present in low copy numbers within the sample [
16]. While target enrichment in NGS offers high-resolution bacterial pathogen analysis, its elevated cost and technical complexity may limit the feasibility of targeting multiple pathogens in a single reaction [
17]. MLVA is a technique that measures the variation in the number of tandem repeats (VNTR) across multiple loci to determine the genetic relatedness of bacterial strains, offering a relatively straightforward method that provides sufficient resolution for outbreak investigation and epidemiological studies. Despite examining less than 1% of the genome and thus providing lower resolution compared to WGS-based approaches, MLVA remains valuable for comparing microbial sources or assessing equivalence between distinct samples, especially in challenging cases where obtaining reliable WGS data is impractical [
18].
Until now, there have been few instances of directly applying target enrichment technology to MLVA profiling. MLVA generally relies on conventional polymerase chain reaction (PCR)-based methods to amplify and analyze specific genetic loci. However, with the advent of NGS, target enrichment approaches such as hybrid capture have been developed, enabling more in-depth analysis of diverse genomic regions. By selectively amplifying or capturing specific genetic targets, these technologies can enhance both the sensitivity and efficiency of the analysis. Consequently, incorporating target enrichment methods into MLVA profiling holds the potential to further improve analytical efficiency and accuracy.
Here, we designed and synthesized target capture probes for the MLVA regions of Yersinia pestis to perform typing from trace samples. These samples were provided by the 2024 United Nations Secretary-General’s Mechanism External Quality Assurance Exercise for Y. pestis detection and included plasma, tomato juice, grape juice, and a surgical mask containing unknown live bacteria; the other samples were inactivated. Although real-time PCR and NGS revealed the presence of Y. pestis in all samples, the chromosome coverage ranged from 0.46% to 97.1% depending on the sample, making strain identification challenging. Using our custom target capture probes, we successfully analyzed the MLVA loci in all samples and confirmed that the Y. pestis detected in these four samples belonged to the same strain.
2. Materials and Methods
2.1. Sample Preparation, Live Sample Handling, and Culture
The test samples were obtained from the Robert Koch Institute and included K2EDTA blood diluted 1:10 in PBS (#24-2), tomato juice (#24-5), grape juice (#24-8), and a punch of a grey surgical mask immersed in 0.8% NaCl, each provided in 0.5 mL volumes. For the commercial juice products, the tomato juice was composed of 99.2% tomato juice from concentrate, 0.5% salt, and lemon juice from concentrate. The grape juice was confirmed to be 100% grape juice. The live sample (#24-10) was provided in a 0.5 mL volume, and 10 μL was inoculated into 10 mL of tryptic soy agar (TSA) broth, followed by incubation with constant shaking at 28 °C for 24 h. All samples were subjected to nucleic acid extraction using the DNeasy Blood & Tissue Kit (QIAGEN) according to the manufacturer’s instructions. For each extraction, 100 μL of the sample was used, and the nucleic acids were eluted in 100 μL of elution buffer. For samples #24-10 and #24-12, 10 μL was spread onto both TSA and cefsulodin–irgasan–novobiocin (CIN) agar plates and incubated at 28 °C for 48 h. The opening and culturing of samples #24-10 were conducted in a biosafety level 3 facility.
2.2. Real-Time PCR
For each 20 μL reaction, 2 μL of extracted nucleic acid was combined with 10 μL of 2× TaqMan Gene Expression Master Mix (Applied Biosystems, Foster City, CA, USA), 0.5 μL each of forward and reverse primers (36 μM), 0.5 μL of fluorescent probe (10 μM), and 6.5 μL of double-deionized water, following the TaqPath Master Mix (Thermo Fisher Scientific, Waltham, MA, USA) instructions. Primers and probes were synthesized by Bioneer (Daejeon, Republic of Korea). The probes were labeled with Texas Red (TEX), 6-carboxyfluorescein (FAM), or Cyanine 5 (CY5) and incorporated an internal Bioneer Quencher (i-EBQ) with a phosphate-blocked 3′ end. To detect the pestis chromosome, the yihN gene was targeted with forward primer 5′-GCT TTA CCT TCA CCA AAC TG-3′, reverse primer 5′-GAA CCA AAG AAC AAG GA-3′, and probe 5′-[TEX]ATA AGT ACA[i-EBQ] TCA ATC ACA CCG CGA C[Phosphate]-3′. To detect pMT1, the caf1 gene was targeted using primers 5′-GTT GGT ACG CTT ACT CTT G-3′ and 5′-GTG GTT ATT TCC ATC CTG AG-3′, and probe 5′-[FAM]AAA ACA GGA[i-EBQ] ACC ACT AGC ACA TCT G[Phosphate]-3′. For pPCP1 detection, the pla gene was targeted using primers 5′-CTG GTT ACT CCA GGA TGA GA-3′ and 5′-TTC CGG TAT AAG CTC CAT TA-3′, and probe 5′-[CY5]TTG GAC AGC[i-EBQ] TAC AGG TGG TTC ATA T[Phosphate]-3′. All sequences were validated in silico using CLC Genomic Workbench 24. Amplification was performed at 90 °C for 10 min, followed by 40 cycles at 95 °C for 15 s and 60 °C for 1 min on a QuantStudio 6 Flex Real-Time PCR system (Thermo Fisher Scientific).
2.3. Whole-Genome Amplification (WGA)
DNA amplification was carried out using the 4BBTM TruePrime® WGA Kit (4basebio, Madrid, Spain) in a reaction volume of 50 μL. The reaction mixture consisted of 2.5 μL of DNA, 2.5 μL of Buffer D, 2.5 μL of Buffer N, 26.8 μL of nuclease-free water, 5 μL of Reaction Buffer, 5 μL of dNTPs, 5 μL of Enzyme 1, and 0.7 μL of Enzyme 2. The thermal cycling conditions were programmed as follows: incubation at 30 °C for 3 h and inactivation at 65 °C for 10 min using a ProFlex thermal cycler (Life Technologies, Carlsbad, CA, USA). The amplified DNA product was then purified using the MinElute PCR Purification Kit (Qiagen, Hilden, Germany).
2.4. NGS for Illumina NextSeq
Library preparation was performed using the TruSeq Nano DNA LT Sample Preparation Kit (Illumina, San Diego, CA, USA), following the protocol provided by the manufacturer. DNA samples were fragmented using an M220 Focused-ultrasonicator (Covaris, Woburn, MA, USA). The resulting DNA fragments were size-selected, A-tailed, and ligated to adaptors and indexed primers, followed by enrichment. Sequencing was conducted on the NextSeq benchtop sequencer using the 500/550 Mid Output Kit v2.5 (Illumina).
2.5. Construction of the All Living Organisms (ALO) Database and Taxonomic Profiling
To build a metagenome database using all published RefSeq sequences (Archaea, Eukaryota, and Viruses) or bacterial reference genomes available at the National Center for Biotechnology Information (NCBI,
https://www.ncbi.nlm.nih.gov/datasets/genome/, accessed on 4 November 2024), we performed domain-based classification and filtering, then downloaded the FASTA files via the FTP server. For each domain, an index for taxonomy profiling was created using CLC Genomics Workbench 24.0. NGS reads were subjected to adapter and quality trimming (>0.05) before taxonomic profiling under default settings. We combined only those families that accounted for ≥1% of reads in each category, designating the rest as Etc. For further NGS read analysis, we retrieved mitochondrial sequences by filtering only the Mitochondrion category from the NCBI organelle database (
https://www.ncbi.nlm.nih.gov/datasets/organelle/, accessed on 15 January 2025) and performed taxonomic profiling again.
2.6. Target Capture-Based Enrichment for Y. pestis MLVA
Target capture-based enrichment was used for MLVA library preparation of Y. pestis. The probe sequences were carefully designed to hybridize specifically to the target bacterial genome. This design involved creating overlapping 120 bp fragments tiled across the MLVA locus, with a 60 bp overlap between consecutive fragments to ensure accurate and efficient target detection. A total of 455 biotinylated probes were developed (Celemics, Seoul, Republic of Korea). Library preparation was carried out using the TruSeq RNA Library Prep for Enrichment kit (Illumina). During the process, DNA was processed, and dual-index adapters (Illumina) were ligated to the fragment ends. Adapter-ligated and amplified libraries were subsequently purified using AMPure XP beads (Beckman Coulter, Brea, CA, USA). The quality and concentration of the libraries were assessed using the TapeStation 4200 system and D1000 ScreenTape (Agilent Technologies, Santa Clara, CA, USA). Final library quantification was performed with the KAPA Library Quantification Kit (KAPA Biosystems, Wilmington, MA, USA) on a QuantStudio 6 Flex Real-Time PCR System (Thermo Fisher Scientific).
2.7. Library Preparation and Nanopore Sequencing
The target-enriched library was constructed using the Ligation Sequencing Kit (Oxford Nanopore Technologies, Oxford, UK) per the manufacturer’s protocol. The process, completed within an hour, involved end preparation of the DNA, adapter ligation, and loading onto a FLO-MIN106 (R9.4) flow cell (Oxford Nanopore Technologies, Oxford, UK). Sequencing was performed on the portable MK1C device (Oxford Nanopore Technologies).
2.8. MLVA Depth and Analysis
For MLVA, DNA extracted from sample #24-5 was used to prepare a mixture in accordance with the manufacturer’s protocol for nPfu-Special (Enzynomics). PCR conditions included an initial 2 min at 95 °C, followed by 40 cycles at 95 °C for 15 s, 56 °C for 15 s, and 72 °C for 2 min, and a final extension at 72 °C for 2 min, then held at 4 °C. The primers for each locus were chosen based on reference data [
19]. PCR products underwent agarose gel electrophoresis, were extracted, and then were subjected to Sanger sequencing using the same primers. Sanger sequencing was conducted on an Applied Biosystems (Life Technologies, Carlsbad, CA, USA) 3500 Genetic Analyzer with the BigDye Terminator v3.1 Cycle Sequencing Kit (Applied Biosystems) and the BigDye XTerminator Purification Kit (Applied Biosystems) per the manufacturer’s instructions. For the 25 MLVA loci, sequences obtained from Sanger sequencing were used as a reference in CLC Genomics Workbench 24.0 to perform read mapping, determine depth, and generate consensus sequences for subsequent MLVA analysis. Short-read reference mapping was performed using the Map Reads to Reference tool with default settings, specifically employing linear gap cost, a match score of 1, a mismatch cost of 2, a length fraction of 0.5, and a similarity fraction of 0.8. Long-read mapping utilized the Map Long Reads to Reference tool with the default Automatic parameter, and no additional specific parameters were determined. Coverage depth was calculated based on the read mapping results using the Quality Control for Targeted Sequencing tool. Consensus sequences for MLVA profiling were then generated using the Extract Consensus Sequence tool based on the read mapping results, setting the low coverage definition threshold at 5 and the low coverage handling method as split into separate sequences to secure the final consensus.
To determine the MLVA profile for each VNTR locus, we first estimated the flanking region size by subtracting the total length of the repeat region (i.e., repeat size multiplied by the number of repeats) from the amplicon size of the Y. pestis CO92 MLVA reference. Using this flanking region size, the number of tandem repeats in each sample was calculated by subtracting the flanking region size from the amplicon size inferred from NGS read mapping and then dividing the result by the repeat unit size. The final value was rounded to the nearest whole number and recorded as the repeat copy number for each locus. The MLVA profile of each sample was constructed by compiling the repeat copy numbers across all 25 VNTR loci. The accuracy of the calculated MLVA profiles was assessed through comparisons with validated reference profiles.
4. Discussion
In sample #24-5,
Y. pestis DNA accounted for 85.0% of the total reads, resulting in 97.1% chromosome coverage at a depth of 260× (
Table 3). Sample #24-8 contained 3.47%
Y. pestis DNA, with 97.0% coverage and 11× depth. Despite the difference in read abundance, both samples contained
Y. pestis genomic DNA at a concentration of 10
6 genome copies/mL. Notably, both lacked the 102 kb
pgm locus, which may account for their reduced mapping coverage compared to the
Y. pestis CO92 reference [
20]. Along with the absence of pPCP1, these observations suggest that the isolates are live attenuated
Y. pestis strains, potentially intended for vaccine use [
21].
The C
t values of sample #24-2 were lower than those of #24-5 and #24-8, indicating a higher amplification signal. Consistent with this finding, sample #24-2 was confirmed to contain
Y. pestis genomic DNA at a concentration of 10
7 genome copies/mL. Interestingly, despite having the higher genome copy number, based on real-time PCR, only 0.48% of total reads matched
Y. pestis (
Table 1 and
Table S3). This discrepancy appears to be due to the large amount of human genomic DNA in the sample, a challenge noted in previous studies [
22,
23,
24]. Such issues often necessitate host genome depletion or target enrichment to improve detection sensitivity. Similarly, sample #24-10 exemplifies the diagnostic difficulties posed by trace amounts of non-culturable microbes; the bacterium successfully cultured from this sample was
K. oxytoca rather than
Y. pestis.
When we initially received sample #24-10, it was described as a living infectious sample without clear information on whether
Y. pestis was present. However, we were later informed—after the forensic procedures—that the sample contained live
K. oxytoca and was spiked with
Y. pestis genomic DNA at a concentration of 10
7 copies/mL. The LOD of our
Y. pestis real-time PCR primers and probes was 10 copies per reaction; the final reaction after TSA enrichment contained approximately 20 genome copies. Although the C
t values indicated a weak positive signal, taxonomic profiling did not yield sufficient reads to conclusively identify
Y. pestis, and reference mapping to the
Y. pestis CO92 genome showed that only 0.005% of total reads (367 reads) are matched (
Table S3). As noted in a previous study, interpreting borderline C
t values near the LOD in real-time PCR remains a long-standing challenge, and in our case, it was difficult to make a definitive positive or negative call based on C
t values alone [
25]. Nonetheless, target-enriched sequencing revealed a clear MLVA profile identical to that of sample #24-5, leading us to conclude that
Y. pestis of the same strain was indeed present in sample #24-10.
Our study and others have shown that short-read NGS data often produce assembly errors in the VNTR regions used for MLVA [
26]. Therefore, we used long-read sequencing on the MinION following MLVA-locus target enrichment. Although indel errors can occur with long reads, they can generally be corrected by increasing the read count [
27]. Additionally, to reliably define the true MLVA profile of each sample, we used Sanger sequencing on #24-5—which had the highest
Y. pestis DNA concentration—as a reference standard. As in earlier reports, we observed a correlation between depth of coverage and MLVA accuracy in the WGA-only short-read data (
Table 5,
Figure 2). Although the sample quantity was insufficient for a direct comparison of WGA versus target enrichment in short-read NGS, the MLVA-locus depth and resultant MLVA profiles clearly showed that our custom-designed probes worked effectively.
Sample #24-8 was the most intriguing: 91% of its reads were unmatched by our ALO database (
Table S3). De novo assembly of the unmapped reads produced a 23,810 bp contig, which accounted for more than 76% of the unmapped reads. BLAST (version 2.16.0) analysis identified this contig as
Cladosporium spp. mtDNA. After adding the NCBI organelle database and re-running the taxonomic profiling, we observed that 69.3% of the total reads from #24-8 were classified as mtDNA, most of which were
Cladosporium species. Since
Cladosporium is a common fungal pathogen in grapes (
Vitis vinifera), we inferred that #24-8 likely originated from grape juice containing
Cladosporium mtDNA [
28,
29,
30]. However, no reads were initially assigned to
Cladosporium in our ALO-based metagenomic analysis.
In general, bacterial target enrichment requires probes covering two to three times the size of the whole genome [
31]. Additionally, in a previous study targeting ancient
Y. pestis DNA, a 120 Mb probe set was used to enable WGS [
32]. In contrast, our study demonstrated that strain-level identification of
Y. pestis is feasible using only 0.055 Mb of probes by targeting MLVA loci specifically. Targeting MLVA loci for enrichment offers a cost-effective and less complex alternative to whole-genome approaches. As the total probe size is minimal, this method has high scalability—allowing, in principle, the simultaneous analysis of hundreds or even thousands of pathogenic species in a single reaction by simply incorporating additional MLVA loci-specific probes. Since MLVA-based enrichment targets <1% of the genome, its resolution is inherently lower than that of WGS, and the possibility of misidentification cannot be completely ruled out [
18]. Therefore, for detailed characterization or confirmation, whole-genome target enrichment is still necessary. Rather than replacing such high-resolution methods, the approach we propose serves as a rapid screening tool to provide initial strain-level identification—especially useful in scenarios wherein sample quality or quantity is limited or when quick decision-making is required.
This study presents, to our knowledge, the first strain identification method utilizing target enrichment of MLVA regions for samples containing ultra-low amounts of target DNA. This approach follows the conceptual path of similar strategies in existing literature, where NGS-based detection of biothreat agents has been explored using target amplification of SNP regions [
33], and NGS has been successfully performed with target capture of SNP regions on forensically challenging samples [
7]. Our ultimate goal is ambitious: to identify hundreds of bacterial pathogens at the strain level in challenging forensic samples, in addition to obtaining whole-genome sequences of specific highly pathogenic viruses via target capture. Given that DNA fragmentation is a likely issue in environmental and forensic samples [
7,
34], we anticipated that relying on the amplification of numerous amplicons to cover both extensive bacterial targets and complete viral whole-genome sequences would significantly reduce efficiency. Therefore, we chose the target capture approach. While SNP detection undeniably offers higher resolution for detailed strain identification, we intentionally adopted a strategy that sacrifices some of this resolution for compact probe usage to enable efficient, rough strain identification. For example, we used only 455 probes for Yersinia pestis strain identification. We acknowledge, however, that this initial screening approach may necessitate additional sequencing for detailed characterization or confirmation. Consequently, although we considered target amplification of SNP regions for bacteria, as demonstrated in prior literature, the final decision on the optimal workflow must be a comprehensive one, balancing the total number of target bacteria and viruses, the required probe count, and the resulting overall efficiency.
Recent studies have attempted to resolve strain-level variation within specific bacterial species using metagenomic sequencing of clinical and environmental samples [
35,
36]. Advances in long-read sequencing technologies have also facilitated accurate genome assembly from complex microbiomes, facilitating the separation of closely related strains [
37]. However, these approaches typically require deep sequencing coverage to be effective. The MLVA loci-targeted enrichment strategy presented in this study lacks sufficient resolution to distinguish coexisting mixed strains within a single sample. Therefore, while it may serve as a rapid screening tool for strain-level identification, whole-genome target capture and sequencing may still be necessary for precise strain characterization. Future work should also investigate whether this approach can be expanded to simultaneously detect multiple bacterial species using a single probe set.