Metagenomic Sequencing for the Diagnosis of Plasmodium spp. with Different Levels of Parasitemia in EDTA Blood of Malaria Patients—A Proof-of-Principle Assessment

Molecular diagnostic approaches are increasingly included in the diagnostic workup and even in the primary diagnosis of malaria in non-endemic settings, where it is difficult to maintain skillful microscopic malaria detection due to the rarity of the disease. Pathogen-specific nucleic acid amplification, however, bears the risk of overlooking other pathogens associated with febrile illness in returnees from the tropics. Here, we assessed the discriminatory potential of metagenomic sequencing for the identification of different Plasmodium species with various parasitemia in EDTA blood of malaria patients. Overall, the proportion of Plasmodium spp.-specific sequence reads in the assessed samples showed a robust positive correlation with parasitemia (Spearman r = 0.7307, p = 0.0001) and a robust negative correlation with cycle threshold (Ct) values of genus-specific real-time PCR (Spearman r = −0.8626, p ≤ 0.0001). Depending on the applied bioinformatic algorithm, discrimination on species level was successful in 50% (11/22) to 63.6% (14/22) instances. Limiting factors for the discrimination on species level were very low parasitemia, species-depending lacking availability of reliable reference genomes, and mixed infections with high variance of the proportion of the infecting species. In summary, metagenomic sequencing as performed in this study is suitable for the detection of malaria in human blood samples, but the diagnostic detection limit for a reliable discrimination on species level remains higher than for competing diagnostic approaches like microscopy and PCR.


Introduction
Falciparum malaria is the quantitatively most important differential diagnosis in febrile travelers returning from Sub-Saharan Africa [1,2]. However, overall rare occurrence in non-endemic settings makes it difficult to maintain the training of laboratory personnel in skillful malaria microscopy as suggested by Giemsa more than 100 years ago [3] with positive slides apart from reference centers, resulting in a need for less investigator-dependent, standardizable diagnostic approaches. Molecular diagnostic techniques like real-time PCR (polymerase chain reaction) and LAMP (loop-mediated isothermal amplification) are associated with a higher sensitivity compared to microscopy even if performed in a reference center [4]. If performed from capillary blood, acceptable correlation of real-time PCR-based Int. J. Mol. Sci. 2022, 23, 11150 2 of 12 semi-quantification and microscopic quantification in line with the World Health Organization (WHO) standards can be achieved as well [5]. Further, species-specific real-time PCR outperforms traditional microscopy regarding the identification of mixed infections [6,7] as well as regarding the discrimination of morphologically similar parasites like Plasmodium vivax and Plasmodium ovale complex [8,9] or even of Plasmodium ovale curtisi and Plasmodium ovale wallikeri within the P. ovale complex [10][11][12][13]. Further, due to the superior sensitivity of the molecular test assays [14][15][16], they are of importance for the detection of reservoirs with low parasite density [17][18][19] or even submicroscopic infections [20] in the course of eradication programs [18,21,22].
In spite of such advantages of molecular malaria diagnosis, this diagnostic approach has a number of disadvantages as well. The main disadvantage of the diagnostic application of oligonucleotide-primed nucleic acid amplification techniques like real-time PCR or LAMP is the fact that they allow a targeted assessment only. So, only pathogens covered by the oligonucleotides of the assays can be detected, while other pathogens, which might have been also detected in case of microscopic assessment of stained slides, necessarily go undetected [23,24]. Although multiplexing partly compensates for these disadvantages, intrinsic technical features of the amplification techniques limit the quantitative dimension of multiplexing options. For a less specific and thus broader diagnostic screening, alternative molecular diagnostic strategies are therefore desirable.
Adding to previously published work on NGS-based diagnosis of malaria [34][35][36][37], the aim of the study is a differentiated assessment of metagenomic detection of Plasmodium spp. in human EDTA (ethylenediaminetetraacetic acid) blood for diagnostic purposes in a standardized way. To do so, nucleic acid extractions from well characterized samples of defined malaria patients infected with different Plasmodium spp. with and without mixed infection with different levels of parasitemia next to a negative control obtained from a patient with a non-Plasmodium bloodborne parasitic infection were subjected to metagenomic NGS analysis. By doing so, the diagnostic accuracy of the approach for the diagnosis of malaria was studied in a proof-of-principle assessment.

NGS Results and Correlation of Reads Assigned to Plasmodium spp. with Microscopically Assessed Parasitemia and Plasmodium Genus-Specific Real-Time PCR-Based Semi-Quantification
Successful NGS runs with raw read numbers between 13.5 and 32 million were recorded for 21 samples positive for Plasmodium spp. DNA and a negative control from a Schistosoma mansoni-infected patient ( Table 1). As shown in columns 2 and 3 of Table 1, a broad spectrum of parasitemia as confirmed by microscopic assessment (column two) and semi-quantification based on cycle threshold (Ct) values of real-time PCR (column three) was assessed. In the latter case, high parasitemia is indicated by low Ct values and vice versa. The proportion of reads assigned to human DNA ranged from 95.9% to 98.9%, and the proportions of reads assigned to Plasmodium spp. from 0.2% to 2.9%. The proportion of reads assigned to Plasmodium spp. was correlated with microscopically observed parasitemia as well as with measured genus-specific Ct values obtained with the RealStar Malaria PCR Kit 1.0 (altonaDiagnostics, Hamburg, Germany). A Spearman r of 0.7307 (95%-confidence interval (CI): 0.4360, 0.8839) was calculated indicating a positive correlation of this proportion with parasitemia (p = 0.0001), while a negative correlation (p ≤ 0.0001) with a Spearman r of −0.8626 (95%-CI −0.9461, −0.6717) was calculated for the correlation with the Ct values. Scatter diagrams showing the distribution of the recorded values are provided in Figure 1. Even in the negative control sample, however, a low proportion of 0.2% reads was assigned to Plasmodium spp., defining an expected range of unreliable assignments. Similar proportions were observed in various samples with low parasitemia close to the microscopic detection threshold of <50 parasites/µL. Malaria PCR Kit 1.0 (altonaDiagnostics, Hamburg, Germany). A Spearman r of 0.7307 (95%-confidence interval (CI): 0.4360, 0.8839) was calculated indicating a positive correlation of this proportion with parasitemia (p = 0.0001), while a negative correlation (p ≤ 0.0001) with a Spearman r of −0.8626 (95%-CI −0.9461, −0.6717) was calculated for the correlation with the Ct values. Scatter diagrams showing the distribution of the recorded values are provided in Figure 1. Even in the negative control sample, however, a low proportion of 0.2% reads was assigned to Plasmodium spp., defining an expected range of unreliable assignments. Similar proportions were observed in various samples with low parasitemia close to the microscopic detection threshold of <50 parasites/µ L.

Matching of the Diagnostic Results on Species Level Based on the Diagnostic Reference Approach and Bioinformatic Assessments Based on Kraken, Bracken and Pavian
Concerning the identification of Plasmodium spp. on species level in the sample materials, the Kraken and the Bracken algorithm showed similar performance with 50% (11/22) correct identifications each as defined by the highest number of reads assigned to the respective species (Table 2). For both algorithms, the correctly and falsely assigned samples were identical. In detail, the algorithms correctly identified 2/4 Plasmodium falciparum samples (parasitemia range: 50.000-175.000/µL), 4/5 P. vivax samples (parasitemia range: 2.400-50.000/µL), 5/5 P. ovale curtisi/complex samples (parasitemia range: 50-20.920/µL), 0/4 Plasmodium malariae samples and 0/3 samples with mixed infections comprising P. falciparum and P. malariae. All 11 falsely assigned samples including the negative control samples were assigned to P. ovale wallikeri. Parasitemia of the falsely assigned samples of malaria patients were 50/µL as well as below the microscopic detection threshold for the two missed P. falciparum samples, 272/µL for the single missed P. vivax sample, <50-122 µL for the four missed P. malariae samples and 50-5240/µL for the three misidentified cases with mixed infections. Erroneous assignments were associated with assigned read numbers < 50.000, while read numbers > 50.000 were associated with assignments matching the result of the diagnostic reference approach in all observed instances. Then applying the Pavian approach and using maximum z-scores as identifiers, the matching of the results of the diagnostic reference assessments and the NGS approach could be increased to 63.6% (14/22). In detail, the Pavian approach led to additional correct assignments of the two P. falciparum samples with parasitemia close to the microscopic detection limit, the P. vivax sample with 272 parasites/µL blood and the P. malariae sample with the highest parasitemia of 122/µL compared to the Kraken/Bracken approach. In contrast, one P. ovale complex sample with parasitemia at the microscopic detection limit of 50/µL, which had been correctly identified as P. ovale wallikeri/complex applying the Kraken/Bracken approach, was misidentified as P. vivax by the Pavian algorithm. Another misidentification with a species of etiological relevance for human patients comprised a sample containing 104/µL P. malariae, which reads were falsely assigned to P. falciparum. In the three samples from patients with mixed infections, at least P. falciparum as the quantitatively dominant species could be correctly identified, while the co-occurring P. malariae DNA went undetected in all three instances. The reads within the negative control sample were assigned to Plasmodium gallinaceum, a species without etiological relevance in human patients, thus indicating a false assignment. However, false assignments to Plasmodium species without etiological relevance in humans were also observed in two samples containing DNA of P. malariae with parasitemia below the microscopic detection threshold, leading to a false exclusion of human malaria in these two cases.

Discussion
The study was performed to assess the diagnostic reliability of a sequence-based metagenomic approach for the diagnosis of malaria from human blood samples. As suggested by previous studies [25][26][27][28][29][30][31][32][33][34][35][36][37], sequence-based malaria diagnosis from human primary sample materials is basically feasible. To evaluate the techniques' diagnostic threshold and detection limits, difficult-to-diagnose reference materials comprising submicroscopic parasitemia, parasitemia close to the microscopic detection limit and mixed plasmodial infections were included into the assessment.
As shown by the 11 to 14 perfectly matching results of reference testing and the diagnostic NGS approach as well as by the highly significant positive correlation of the Plasmodium spp.-specific NGS read proportion with parasitemia and the similarly good negative correlation with genus-specific Ct-values, hypothesis-free metagenomic diagnosis of malaria was confirmed to be feasible. This was true for several levels of parasitemia including a P. falciparum-positive sample below the microscopic detection threshold. Within the range of parasitemia close to the microscopic detection threshold, however, discrimination on species level becomes non-reliable. Varying proportions of the different species in mixed infections make NGS less suitable for their reliable identification. Further and discussed in more detail below, insufficient availability of quality-controlled reference genomes also limits the discriminatory potential on the species level.
In addition, some known general weaknesses of metagenomic diagnostic assays have been confirmed in the assessment as well. First, as known from the literature, the approach is quite laborious, still cost-intensive and time consuming [32], which limits its use for diagnostic purposes both in cases of medical emergencies and in the diagnostic routine. At least the diagnostic NGS algorithm assessed in this study still demands 2-3 days, which is considerably too long in comparison to the applied competitor approaches microscopy and real time PCR. Also, it requires several hours of hands-on-time and costs more than USD 1000 per sample (Table 3). Table 3. Time-to-result, hands-on-time and material costs of the diagnostic approaches for the diagnosis of malaria which were compared in this study. Second and also well-known from previous studies, the sensitivity of the assessment will depend on sequence depth [23], its reliability on the quality of the underlying databases and the applied diagnostic pipelines [23,37]. Third, the technique is volatile to contamination, which makes the discrimination of etiological relevant results from sample contaminations difficult to impossible if the target sequences occur in a similar or even lower frequency than co-occurring contaminating sequences [28,39]. Fourth, although NGS gets increasingly affordable [23,38] and in spite of modern USB-stick-shaped technical solutions, it is still technically demanding, which makes it less suitable for the point of care diagnosis of malaria at its present stage of technical development. Malaria diagnosis based on microscopy of thin and thick blood smear, in contrast, is still commonly performed in laboratories globally. This is in line with the fact that malaria microscopy is the diagnostic reference standard as defined by the World Health Organization (WHO), although microscopic skills are increasingly challenging to maintain even in endemic countries, because rapid diagnostic tests (RDT) for malaria are more frequently applied.

Microscopy
In the study provided here, the importance of the quality of the underlying database [23,37] was confirmed by the fact that reads were assigned to Plasmodium spp. even in a sample from a patient without malaria and that various misidentifications on species level occurred. Further, the depth of sequencing remains an issue [23], resulting in cases of failed or nonunambiguous species discrimination in case of parasitemia below as well as close to the diagnostic detection threshold of microscopy. The same will generally apply if the number of generated sequence reads is too low for technical reasons or due to reduced sample quality. Unfortunately, the discriminatory potential of metagenomic NGS for the diagnosis and differentiation of plasmodiae on species level was found to be inferior to the microscopic and PCR-based detection threshold for the diagnosis of malaria. On genus level, detection of reads assigned to Plasmodium spp. was difficult to interpret in samples with low parasitemia, because the number of recorded plasmodial reads was similar like observed in a negative control sample in this study. Of note, high variations in the proportions of different plasmodial species in case of plasmodial mixed infections made NGS poorly suitable for the identification of such mixed infections, while the z-score based approach at least allowed the identification of P. falciparum sequences in these cases.
Regarding the applied assessment algorithms, and since certain Plasmodium genomes contain sequences from host and other genomes (e.g., up to 60% of the Plasmodium yoelli genome identified as contaminants [40]), we first removed all contaminant sequences from the Plasmodium genomes present in the Ensembl database. For both the removal of contaminants and later for the taxonomic assignment of the metagenomic reads, we employed Kraken because this program is able to work in prokaryotic and eukaryotic genomes simultaneously, and it has shown an overall better performance than existing tools built for the same purposes [41,42]. Kraken aligns k-mers (sequence fragments of length k) from sequence reads to databases containing genomes with their respective taxonomic information. Reads that map to common regions in different genomes are assigned by Kraken to their lowest common ancestor (LCA). As a result, many reads are not attributed at the species level and hence Kraken assignments cannot be directly interpreted as species estimates [43]. Still, Kraken reports much more accurate results than similar tools upon normalization [44]. Therefore, we further processed the results from Kraken for abundances separately with Bracken [43] and Pavian [45]. Bracken calculates species abundances in metagenomic samples by probabilistically re-distributing reads in the taxonomic tree, i.e., reads are assigned above the species level are incorporated into the species level, while reads allocated at the strain level are re-distributed upwards to its corresponding parent species level [43]. Similarly, we used Pavian, an online tool that allows the visual inspection of Kraken results, as well as providing different transformations of the Kraken output data, such as the z-score of the assigned reads [45]. Both Bracken and Pavian have been employed in the diagnostics of infectious diseases through metagenomic assignment, and the results from Kraken were particularly improved upon using the zscore, as implemented in Pavian [46]. In this regard, since both Bracken and Pavian work on Kraken results, and these in turn are affected by the sequencing quality and coverage, we expect that the results from the metagenomic methods presented here will be greatly improved whenever the genomes from the target databases are less prone to contaminant sequences, and sequencing replicates and higher sequencing coverages are available for the clinical samples.
The study has a number of limitations. First, only a small number of samples could be included due to funding restrictions. Second, no patient-specific clinical details could be provided in line with the ethical demands allowing the use of anonymized residual sample materials for the diagnostic evaluation only. Third, it remains unclear if better matching might have resulted from the use of a more comprehensive database containing quality-controlled genomes of organisms other than just Homo sapiens and Plasmodium spp. However, such a comprehensive approach would have been beyond the scope of this proof-of-principle assessment with limited available resources.

Study Design and Sample Materials
The comparative study evaluating the metagenomic diagnosis of malaria in human EDTA blood was conducted in a single-blinded way, meaning that the study partner performing the NGS analyses did not know the microscopic and molecular diagnostic results when the assessments were performed. The workflow of the assessment is provided below in Table 4. Table 4. Summarized workflow of the assessment.

•
Inclusion of residual sample materials of 21 serum samples found to be positive for malaria in previous test evaluation approaches [4] comprising microscopic assessment and species-specific real-time PCR covering various plasmodial species and a broad range of parasitemia • Inclusion of a single residual sample material of a serum from a patient suffering from an infection with the non-malaria blood parasite Schistosoma mansoni as a negative control In detail, a total of 22 samples was assessed, one sample per NGS run was investigated. Among the samples, 21 out of 22 had been found to be positive for malaria due to P. falciparum, P. malariae, P. vivax or P. ovale complex comprising P. ovale curtisi and P. ovale wallikeri in mono-infections or mixed infections as confirmed by microscopy as recommended by WHO as well as molecular diagnostic approaches as described previously [4]. P. knowlesi-positive samples could not be included, because such residual materials were not available. In detail, the molecular diagnoses comprised genus-specific malaria detection applying commercial LAMP (Alethia Malaria, Meridian Bioscience Inc., Cincinnati, OH, USA) [47] and real-time PCR (RealStar Malaria PCR Kit 1.0, altona Diagnostics, Hamburg, Germany) [8] as well as species-specific SybrGreen-based real-time in-house PCR [48,49] and commercial species-specific real-time PCR (FTD Malaria differentiation, Fast Track Diagnostics, Sliema, Malta; RealStar Malaria S&T Kit 1.0, altona Diagnostics, Hamburg, Germany) [8]. Samples positive for P. ovale complex were differentiated into P. ovale curtisi and P. ovale wallikeri by in-house real-time PCR as described elsewhere [10][11][12][13]. Of note, one of the reference samples had been repeatedly misidentified as P. vivax by microscopy while only PCR revealed the identity as P. ovale complex [8]. Microscopic quantification as well as semi-quantification based on recorded cycle-threshold (Ct) values of real-time PCR had been conducted for the characterization of the reference materials. The samples were chosen by ensuring that specimens were included with a spectrum of the parasitemia ranging from submicroscopic infections (<50 parasites/µL) till one-digit percentages of infected erythrocytes. This was done in order to define the detection thresholds of the metagenomic NGS approach for identifications on genus level, on species level as well as for the discrimination of mono-infections and mixed infections. Finally, an EDTA blood sample from a patient with early schistosomiasis (Katayama fever) caused by S. masoni as confirmed by in-house serum duplex real-time PCR targeting Schistosoma haematobium complex and S. mansoni complex [50] was included as a negative control. Patient specific data could not be included in the assessment, because complete anonymization of the investigated sample materials was an ethical requirement as demanded by the ethics committee for the described diagnostic test assessment. This is an admitted deviation from the STARD (Standards for Reporting Diagnostic Accuracy) criteria [51].

Metagenomic Next Generation Sequencing
Nucleic acid sequences used for metagenomic NGS had been extracted applying the EZ1 DNA Blood 200 µL Kit using automated EZ1 nucleic acid extractors (Qiagen, Hilden, Germany) as described by the manufacturer and detailed elsewhere [52]. Prior to the assessments, the samples had been stored frozen at −80 • C. Metagenomic unbiased NGS sequencing of the DNA elements in the samples was conducted by an experienced medicallaboratory assistant using a MiSeq system (Illumina, San Diego, CA, USA) applying the protocols provided by the manufacturer. No target DNA enrichment or human DNA depletion was done. In short, DNA libraries were prepared with TruSeq ® Nano DNA Sample Preparation kits (Illumina) using the low sample (LS) protocol. Thereby, 100 ng of each genomic DNA from the assessed samples was fragmented using Adaptive Focused Acoustics TM Technology (Covaris, Inc., Woburn, MA, USA) with a Covaris M220 applying settings for fragment sizes in the 350 bp range (duty factor 20%, peak incident power 50 W, cycle per burst 200, duration 65 s, at a temperature of 65 • C). The chromosomal DNA fragments were cleaned up with bead technology. End repair was done in line with the TruSeq protocols. Additional clean-up and size selection was conducted with bead technology. 3'-ends were subsequently adenylated, Illumina adapters were ligated and the DNA fragments were enriched. An Agilent DNA 7500 kit (Agilent Technologies, Inc., Santa Clara, CA, USA) was applied for quality control and for the confirmation of the intended fragment size after the application of the Covaris M220 fragmentation protocol and after Illumina adapter ligation. Visualization of clearly defined peaks in the expected size range was considered as proof of successful DNA fragmentation and adapter ligation. No concentration calculation by integrating the area under the peak was conducted, because this was considered as insufficiently sensitive for sequencing purposes. Instead of this, library DNA concentration measurements were performed using Qubit dsDNA BR assay kits (Thermo Fisher Scientific, Waltham, MA, USA), before the sequencing cells were loaded. After this assessment, each individual library was adjusted to a 4 nmol/L stock solution and from these stocks, 6 pmol was used for each individual sequence run. Sequencing was conducted as Reagent Kit MiSeq ® v3 (600 cycle) runs (Illumina). Each sample was assessed with a complete v3 run.

Bioinformatic Sequence Analysis and Statistics
The bioinformatic analysis consisted of three steps: First, cleaning the Plasmodium genomes from contaminants; then building a custom database consisting of Plasmodium and human genomes; and finally assigning the clinical metagenomic sequencing reads to the corresponding Plasmodium genomes present in the custom database. For the first part, and as certain Plasmodium genomes are contaminated with host and microbial sequence data [40], Plasmodium spp. genomes were retrieved from Ensembl [53], and cleaned from contaminants with the following protocol: (i) slicing the genome sequences into pseudoreads (length 100 bp, overlap 50 bp) with pyfasta version 0.5.2 [54] and BBtools version 38.95 [55]; and (ii) assigning these pseudoreads to non-Plasmodium genomes, as present in the Standard-16 Kraken database, with Kraken version 2.1.2 [56]. The Standard-16 database consists of sequences that can be considered common contaminants of human samples, such as genomes from archaea, bacterial, viral, plasmid, Univec_Core and unmasked human sequences present in the RefSeq database (retrieved from https://benlangmead.github.io/aws-indexes/k2, accessed on 19 September 2022). Pseudoreads matching these contaminant genomes from this database were discarded from the genome assemblies, as well as any fragments smaller than 100 bp [40]. Afterwards, a custom Kraken database was built with these "clean" Plasmodium genome assemblies together with the unmasked human genome, which was then used to assign the metagenomic reads from the EDTA blood samples with Kraken to the taxonomic levels present in this custom database. These taxonomic assignments provided by Kraken were then further processed for abundance separately with Bracken version 2.7 [43] and Pavian (package version 1.2.0 [44]). In the case of our clinical metagenomic samples, Bracken was run at the species level and using a read size of 100 bp, while for Pavian we employed the z-score over the Kraken-assigned reads. All other parameters were kept with default values for both programs. The metagenomic sequencing runs were deposited on the European Nucleotide Archive (ENA) under the accession number PRJEB55471.
Correlations between the proportion of Plasmodium spp.-specific reads and microscopically observed parasitemia as well as measured genus-specific cycle threshold (Ct) values obtained with the RealStar Malaria PCR Kit 1.0 (altonaDiagnostics, Hamburg, Germany) were calculated applying Spearman rank correlation testing, for which no Gaussian distribution needs to be assumed. The calculation was conducted applying the software GraphPad Instat, version 3.06 (GraphPad Software Inc., La Jolla, CA, USA).

Ethical Clearance
Ethical clearance for the anonymized use of residual sample materials for test evaluation purposes without requirement of informed consent was provided by the medical association of Hamburg, Germany (reference number: WF-011/19, obtained on 11 March 2019) in line with National German laws. The study was performed according to the Declaration of Helsinki and its amendments.

Conclusions
The assessment proved the general suitability of the applied metagenomic diagnostic approach for the diagnosis of malaria from human blood. The limitations as detailed in the discussion, however, still leave considerable room for improvement of the technique before application in the diagnostic routine or even in case of emergency situations can be recommended.  Institutional Review Board Statement: Ethical clearance for the anonymized use of residual sample materials for test evaluation purposes without requirement of informed consent was provided by the medical association of Hamburg, Germany (reference number: WF-011/19, obtained on 11th March 2019) in line with National German laws. The study was performed according to the Declaration of Helsinki and its amendments.

Informed Consent Statement: Not applicable.
Data Availability Statement: All relevant data are provided within the manuscript and its appendix or are accessible via the links provided in the text. The Illumina sequencing data was deposited on the European Nucleotide Archive (ENA) under the project accession number PRJEB55471.