PathoLive—Real-Time Pathogen Identification from Metagenomic Illumina Datasets

Over the past years, NGS has become a crucial workhorse for open-view pathogen diagnostics. Yet, long turnaround times result from using massively parallel high-throughput technologies as the analysis can only be performed after sequencing has finished. The interpretation of results can further be challenged by contaminations, clinically irrelevant sequences, and the sheer amount and complexity of the data. We implemented PathoLive, a real-time diagnostics pipeline for the detection of pathogens from clinical samples hours before sequencing has finished. Based on real-time alignment with HiLive2, mappings are scored with respect to common contaminations, low-entropy areas, and sequences of widespread, non-pathogenic organisms. The results are visualized using an interactive taxonomic tree that provides an easily interpretable overview of the relevance of hits. For a human plasma sample that was spiked in vitro with six pathogenic viruses, all agents were clearly detected after only 40 of 200 sequencing cycles. For a real-world sample from Sudan, the results correctly indicated the presence of Crimean-Congo hemorrhagic fever virus. In a second real-world dataset from the 2019 SARS-CoV-2 outbreak in Wuhan, we found the presence of a SARS coronavirus as the most relevant hit without the novel virus reference genome being included in the database. For all samples, clinically irrelevant hits were correctly de-emphasized. Our approach is valuable to obtain fast and accurate NGS-based pathogen identifications and correctly prioritize and visualize them based on their clinical significance: PathoLive is open source and available on GitLab and BioConda.


Introduction
The identification of pathogens directly from patient samples is a major clinical need. While highly accurate pathogen detection methods such as polymerase chain reaction (PCR), cell culture, or amplicon sequencing exist, such routine procedures often fail to identify the underlying cause of a patient's symptoms due to their targeted behavior [1][2][3][4]. As a complementary approach, metagenomics next-generation sequencing (NGS) has been proposed as a valuable technique for clinical application. NGS facilitates the detection and characterization of pathogens without a priori knowledge about candidate species. Further, it generates a sufficient amount of data to detect even lowly abundant pathogens without targeted amplification of specified sequences allowing for hypothesis-free diagnostic analysis. Current tools to address NGS-based pathogen identification can be divided into two major categories, either aiming to discover yet unknown genomes [5][6][7][8][9][10][11][12][13][14][15] or to detect known organisms in a sample [16][17][18][19][20][21][22][23][24][25][26][27][28][29][30][31][32]. From an algorithmic perspective, a further distinction can be made between alignment-based methods, alignment-free methods, or combinations of both. While alignment-free methods usually deliver faster results, alignment-based methods potentially allow for a more extensive characterization of the sample.
Regardless of the algorithmic approach, existing methods based on unbiased metagenomics NGS face various obstacles, especially concerning the ranking of the results according to their clinical relevance and the long overall turnaround time [33][34][35][36][37][38][39][40]. The lack of good ranking methods is based on the fact that the distinction between clinically relevant and irrelevant data is not trivial. First, the dominating part of the sequences in a patient sample usually originates from the host genome. Second, there are nucleic acids of various species that are usually of low clinical relevance such as endogenous retroviruses (ERV) or non-pathogenic bacteria which commonly colonize a person. For these reasons, the number of reads hinting towards a relevant pathogen can be as low as a handful of individual reads. To put it more generally, it is a widespread misconception to rely only on quantitative measures when ranking the importance of candidate hits as not the amount but the uncommonness of a species in a given sample may give critical indications on its relevance. Based on the premise that a large proportion of the produced reads may stem from the host genome, species irrelevant for diagnosis, or common contaminations, even highly accurate methods struggle with false-positive hits potentially concealing the relevant results. This central problem is getting worse when considering that even microbial databases are contaminated with human sequences [41]. Existing pipelines tackle this problem in different ways. One common strategy is to ignore sequences that occur in a reference database of host and contaminating sequences [9,10,18,26,28,30]. While facilitating cleaner results, this approach may lead to a premature rejection of relevant sequences and does not solve the problem of human contaminations in reference databases as those "derive primarily from high-copy human repeat regions, which themselves are not adequately represented in the current human reference genome" [41]. Further, the definition of precise contamination databases proves rather difficult and has not yet been adequately solved. Thus, deleting any results to gain a better overview comes at great risk of overlooking the true cause of an infection. A different strategy is intensity filters, as implemented, e.g., in SLIMM [19], that disregard sequences with low genome coverages. As the author states, this step eliminates many genomes which introduces the risk of losing information that might be relevant in the following diagnostic process. This problem even intensifies for marker-gene-based methods such as MetaPhlAn2 [23], as large parts of the sequenced reads cannot be assigned due to the miniaturized reference database. While this may lead to a better ratio of seemingly relevant assigned reads to those from the background, it comes with the risk of disregarding relevant candidates.
Another fundamental problem of NGS-based pathogen identification approaches is the fact that sequencing and analysis are very time consuming. Even when considering the reduction in sequencing time in the last years, current mid-and high-throughput devices still have maximum runtimes of more than a day (NextSeq 550) and up to two (NovaSeq 6000) or three days (HiSeq X), respectively. The resulting turnaround times of two to four days including data processing and analysis are not short enough for many critical scenarios such as sepsis and infectious disease outbreaks. To obtain actionable results within an appropriate time frame it is crucial to reduce the time span from sample receipt to diagnosis. However, existing approaches to speed up NGS-based diagnostics come with significant disadvantages such as a highly reduced throughput and data quality [42], massive reduction of analyzed reads or targets [43] or the need for specialized hardware that involves additional costs and relatively low flexibility to adapt the workflow to a given scenario [44]. An actual approach for taxonomic classification of NGS data during runtime of sequencing is implemented in LiveKraken, a real-time version of the wellknown Kraken software [45]. However, by not providing positional information in the results, a sequence-based ranking to determine the relevance of hits is not possible with this approach.
As a general complement to real-time analysis of short-read sequencing data, there are several promising studies for pathogen detection using the MinION handheld device which is particularly useful for field studies and produces longer reads of up to several hundred kilobase pairs. While allowing very fast throughput times, these devices yield only approximately a million reads with comparably low per-base qualities, limiting their areas of application to targeted sequencing so far [5,42,[46][47][48].
The currently high turnaround times from sample arrival to final diagnosis make it necessary to develop efficient methods to generate, analyze, and understand large metagenomics datasets in an accurate and quick manner to pave the way for NGS as a standard tool for clinical diagnostics. This enforces NGS-based diagnostics workflows to generate and evaluate large numbers of reads to facilitate adequate sequencing depths while reducing the time span between sample receipt and diagnosis. To overcome the named obstacles, we present PathoLive, an NGS-based real-time pathogen detection tool. We present an innovative approach to handle the occurrence of common contaminations, background data, and irrelevant species in a single step. To tackle the problem of long overall turnaround times, we based our novel approach on the real-time read mapper HiLive2 which enables the analysis of sequencing data while an Illumina sequencer is still running [49]. This enables PathoLive to perform nucleotide-level analysis based on NGS providing an open view and high accuracy in short turnaround times while generating an intuitive and interactive visualization of results that highlights organisms of high clinical significance.

Implementation
Our workflow follows a different paradigm than other frameworks to tackle the existing problems, as shown in Figure 1: (i) prepare informative, well-defined reference databases, (ii) automatically define contaminating or non-pathogenic sequences beforehand, (iii) use HiLive2 for accurate real-time alignment of Illumina sequencing data, (iv) visualize the potential risk of candidate pathogens and present results in an intuitive, comprehensible manner. The details on the modules for each of these steps are provided in the following paragraphs: (i) Preparation of reference databases: In order to save computational effort during the analysis, reference databases including the full taxonomic lineage of organisms are prepared before the first execution of PathoLive. For this purpose user selectable databases, for example, the RefSeq Genomic Database [50], are downloaded from the File Transfer Protocol (FTP) servers of the National Center for Biotechnology Information (NCBI) and annotated accordingly with taxonomic information from the NCBI Taxonomy Database. While preserving the original NCBI annotation of each sequence, additional information is appended to the sequence header. This information consists of each taxonomic identifier (TaxID), rank, and name of each taxon in the lineage of an organism. Afterwards, user-definable sub-databases of taxonomic clades relevant for a distinct pathogen search are automatically created. For the experiments in this manuscript, we focused on viruses. The database updater used for this purpose is available at https://gitlab.com/rki_bioinformatics/database-updater (accessed on 23 August 2022). The viral database used in this manuscript can be downloaded as a single compressed FASTA file from Zenodo (https://doi.org/10.5281/zenodo.2536788, accessed on 23 August 2022) and is ready to use for viral diagnostics with PathoLive. (ii) Identification and labeling of clinically irrelevant hits: A main obstacle in NGSbased diagnostics is the large amount of background noise contained in the data. This includes various sources of contamination such as sequencing artifacts, ambiguous references, and clinically irrelevant species, which hinder a quick evaluation of a dataset. Defining an exhaustive set of possible contaminations is a yet unachieved goal. Furthermore, deleting such sequences carries the risk of losing relevant results. Since in this step raw sequencing data from a human host is examined, the logical conclusion is to contrast it to comparable raw datasets instead of processed genomes. Instead of deleting the background and risking the loss of relevant information, we implemented a method to define and mark all kinds of undesired signals on the basis of comparable datasets from freely available resources. For this purpose, raw data from 236 randomly selected datasets from the 1000 Genomes Project Phase 3 [51] were downloaded, assuming that a large majority of the participants in the 1000 Genomes Project were not acutely ill with an infectious disease. The full list of selected datasets is provided in the supplementary material (Section S3.3). The reads are quality trimmed using Trimmomatic [52] and mapped to the selected pathogen reference database using Bowtie2 [53]. Whenever a stretch of a sequence is covered once or more in a dataset from the 1000 Genomes Project, the overall background coverage of these bases is increased by one. Coverage maps of all references from the pathogen database are stored in the serialized pickle file format. Stretches of DNA found in this data are marked as of lower clinical significance and visualized as such in later steps of the workflow. The coverage maps of the background abundances are plotted in red color against the coverage maps of the reads from the patient dataset in green color on the same reference ( Figure 2). This enables highlighting presumably relevant results (ii) Identification and labeling of clinically irrelevant hits: A main obstacle in NGSbased diagnostics is the large amount of background noise contained in the data. This includes various sources of contamination such as sequencing artifacts, ambiguous references, and clinically irrelevant species, which hinder a quick evaluation of a dataset. Defining an exhaustive set of possible contaminations is a yet unachieved goal. Furthermore, deleting such sequences carries the risk of losing relevant results. Since in this step raw sequencing data from a human host is examined, the logical conclusion is to contrast it to comparable raw datasets instead of processed genomes. Instead of deleting the background and risking the loss of relevant information, we implemented a method to define and mark all kinds of undesired signals on the basis of comparable datasets from freely available resources. For this purpose, raw data from 236 randomly selected datasets from the 1000 Genomes Project Phase 3 [51] were downloaded, assuming that a large majority of the participants in the 1000 Genomes Project were not acutely ill with an infectious disease. The full list of selected datasets is provided in the supplementary material (Section S3.3). The reads are quality trimmed using Trimmomatic [52] and mapped to the selected pathogen reference database using Bowtie2 [53]. Whenever a stretch of a sequence is covered once or more in a dataset from the 1000 Genomes Project, the overall background coverage of these bases is increased by one. Coverage maps of all references from the pathogen database are stored in the serialized pickle file format. Stretches of DNA found in this data are marked as of lower clinical significance and visualized as such in later steps of the workflow. The coverage maps of the background abundances are plotted in red color against the coverage maps of the reads from the patient dataset in green color on the same reference ( Figure 2). This enables highlighting presumably relevant results without discarding other candidate pathogens, giving the researcher the best options to interpret the results in-depth but still in an efficient manner. The code for the generation of these databases is part of PathoLive.
Life 2022, 12, x FOR PEER REVIEW 5 of 17 without discarding other candidate pathogens, giving the researcher the best options to interpret the results in-depth but still in an efficient manner. The code for the generation of these databases is part of PathoLive. Based on these illustrations, human mastadenovirus B can be considered a relevant hit while HERV K113 is rightly found in the dataset, but not considered a clinically relevant candidate due to its common prevalence in healthy human individuals.
(iii) Using HiLive2 for real-time alignment of reads: We used HiLive2 (version 2.1) to produce real-time alignments of intermediate sequencing results. Thereby, the raw sequencing data is directly loaded in raw BCL file format without the need to perform a file conversion step. Alignments are updated with each new sequencing cycle and output in BAM format can be created for any sequencing cycle. As changes in the mapping positions mainly occur in early sequencing cycles, we recommend creating output in shorter intervals at the beginning of sequencing. Options for integrated demultiplexing and adapter trimming are available. For algorithmic details of HiLive2, we refer to Loka, Tausch and Renard [49].
(iv) Visualization and hazardousness classification: A key hurdle in a rapid diagnostics workflow, which is often underestimated, is the presentation of results in an intuitive way. Many promising efforts have been made by different tools, e.g., providing coverage plots [30,54] or interactive taxonomy explorers [12,28]. While being hard to measure and thus often ignored, the time it takes for groups of experts to assess the results and come to a correct conclusion should be considered. Our browser-based, interactive visualization is implemented in JavaScript using the data visualization library D3 [55]. For an example of the visualization, see Figure 3. While providing all available information on demand, the structure of a taxonomic tree allows an intuitive overview. Detailed measures are available on the genus, family, species, and sequence level. For the calculation of scores for a given node , we define ( ) as the total number of read alignments to an underlying species of . ( ) is the total number of bases being covered by all reads with respect to n. Accordingly, ( ) describes the number of bases being covered by the background database and \ ( ) is the number of bases being covered by the foreground but not by the background data. In total, we provide three different scores for each node n of the tree: Project. Based on these illustrations, human mastadenovirus B can be considered a relevant hit while HERV K113 is rightly found in the dataset, but not considered a clinically relevant candidate due to its common prevalence in healthy human individuals.
(iii) Using HiLive2 for real-time alignment of reads: We used HiLive2 (version 2.1) to produce real-time alignments of intermediate sequencing results. Thereby, the raw sequencing data is directly loaded in raw BCL file format without the need to perform a file conversion step. Alignments are updated with each new sequencing cycle and output in BAM format can be created for any sequencing cycle. As changes in the mapping positions mainly occur in early sequencing cycles, we recommend creating output in shorter intervals at the beginning of sequencing. Options for integrated demultiplexing and adapter trimming are available. For algorithmic details of HiLive2, we refer to Loka, Tausch and Renard [49].
(iv) Visualization and hazardousness classification: A key hurdle in a rapid diagnostics workflow, which is often underestimated, is the presentation of results in an intuitive way. Many promising efforts have been made by different tools, e.g., providing coverage plots [30,54] or interactive taxonomy explorers [12,28]. While being hard to measure and thus often ignored, the time it takes for groups of experts to assess the results and come to a correct conclusion should be considered. Our browser-based, interactive visualization is implemented in JavaScript using the data visualization library D3 [55]. For an example of the visualization, see Figure 3. While providing all available information on demand, the structure of a taxonomic tree allows an intuitive overview. Detailed measures are available on the genus, family, species, and sequence level. For the calculation of scores for a given node n, we define t(n) as the total number of read alignments to an underlying species of n. b(n) is the total number of bases being covered by all reads with respect to n. Accordingly, b bg (n) describes the number of bases being covered by the background database and b f g\bg (n) is the number of bases being covered by the foreground but not by the background data. In total, we provide three different scores for each node n of the tree: (a) Total hits T n , representing the total number of hits to all underlying sequences in this branch: T n := t(n), representing the total abundance of a clade. (b) Unambiguous bases U n , representing the total number of bases covered in the foreground data but not in any background dataset: U n := b f g\bg (n) (c) Weighted score W n , being the ratio of unambiguous bases for the foreground data to the number of bases covered by the background database and logarithmically weighted by the total number of alignments: Life 2022, 12, x FOR PEER REVIEW 6 of 17 (a) Total hits , representing the total number of hits to all underlying sequences in this branch: ∶= ( ), representing the total abundance of a clade.
(b) Unambiguous bases , representing the total number of bases covered in the foreground data but not in any background dataset: , being the ratio of unambiguous bases for the foreground data to the number of bases covered by the background database and logarithmically weighted by the total number of alignments: While the total hits can be useful to get a general impression of the abundance of sequences in the sample, the unambiguous bases provides a first comparison to the background dataset. The weighted score introduces an intensified metric of how often a sequence is found in a healthy individual, and thereby allows drawing stricter conclusions from the background data. Not only exactly overlapping mappings of fore-and While the total hits T n can be useful to get a general impression of the abundance of sequences in the sample, the unambiguous bases U n provides a first comparison to the background dataset. The weighted score W n introduces an intensified metric of how often a sequence is found in a healthy individual, and thereby allows drawing stricter conclusions from the background data. Not only exactly overlapping mappings of fore-Life 2022, 12, 1345 7 of 17 and background are regarded, but also the overall abundance of a sequence within the background data is considered.
The values of the selected scoring scheme are reflected in the thickness of the branches, which draws the visual focus to higher-rated branches. Users can switch between the three scores via the respective buttons in the interactive visualization. In order to enable users to make early decisions regarding the handling of a sample as well as to further enhance the intuitive understanding of the results, the hazardousness of detected pathogens is color-coded based on a biosafety level (BSL) score list [56]. To improve BSL classification, minor changes were manually applied to improve matches to the organism names in the reference database. The BSL score gives information on the biological risk emanating from an organism. Therefore, it qualifies as a measure of hazardousness in this use case. The BSL score is color-coded in green (no information/BSL1), blue (BSL2), yellow (BSL3), or red (BSL4), and the maximum hazardousness level of a branch is propagated to the parent nodes. While the BSL level may be non-informative in some cases, it can still help in gaining an overview for non-expert users who are not familiar with certain pathogens. Phages are displayed in grey, as they cannot infect humans directly, but may imply information on the presence of bacteria.
Details about the sums of all three available scores of all underlying species are provided on mouse-over ( Figure 3 in the results section). When expanding a branch to sequence level, additional plots of the foreground coverage calculated in step (iii) as well as the abundance of bases in the background datasets calculated in step (ii) are shown when hovering the mouse over the node (Figure 2). These plots provide a visualization of the significance of a hit. The hits of a species in the patient dataset are shown in green, while background hits are drawn in red on a coverage plot. This way, it is easy to evaluate if a sequence is commonly found in non-ill humans and therefore can be considered less relevant, or if a detected sequence is unique and could lead to more certain conclusions. The source code is available at https://gitlab.com/rki_bioinformatics/PathoLive, accessed on 23 August 2022.

Validation
We compared the results of PathoLive to two existing solutions, Clinical Pathoscope [26] and Bracken [57]. We selected Clinical Pathoscope for its very sophisticated read reassignment method, which promises a highly reliable rating of candidate hits. It also is perfectly tailored to this use case. Other promising pipelines such as SURPI [30] were not locally installable and had to be disregarded. Bracken, a method based on metagenomics classification with Kraken [22], was included in the benchmark as one of the fastest and best-known classification tools which makes it one of the primary go-to methods for many users. The experiment is based on a real sequencing run on an Illumina HiSeq 1500 in High Output Mode. We designed an in-house generated sample in order to have a solid ground truth. We ran all tools using 40 threads, starting each at the earliest possible time point when the data was available from the sequencer in the expected input format. For the non-real-time tools, the base calling was executed via Illumina's standard tool bcl2fastq and the runtime was regarded in the overall turnaround time. Clinical Pathoscope and Bracken were both run with default parameters, apart from the multithreading. We built the databases for PathoLive and Bracken using the viral part of the NCBI RefSeq [58]. For Clinical Pathoscope, we downloaded the associated database from http://www.bu.edu/jlab/wp-assets/databases.tar.gz (accessed on 23 August 2022) using the provided viral database as foreground and the human database as background. Details of the database construction are given in the Supplementary Methods (Section S3.1). Please note that, in contrast to all other results shown in this manuscript, the live analysis of the in-house sample was performed using readmapping results of HiLive, the predecessor of HiLive2. However, we repeated the analysis using HiLive2 and obtained similar results with respect to accuracy (cf. Supplementary Figure S1 and Supplementary Table S1). To validate PathoLive on real data, we applied it to a previously described diagnostic human serum sample from an outbreak of hemorrhagic fever virus in Sudan [59,60] and a dataset from an outbreak of Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in Wuhan, 2019 [61]. As the data were only available in FASTQ format, it was converted to BCL file format following the procedure described in the supplementary methods (Section S3.2). The total read length was 2 × 301 bp for the CCHFV dataset from Sudan and 2 × 151 bp for the SARS-CoV-2 dataset from Wuhan. The diagnostic sample from Sudan was prepared according to [59,60], including inactivation of the human serum in Qiagen Buffer AVL, extraction with Qiagen QIAamp Viral RNA Mini Kit, and DNA digestion using the Thermo Fisher TURBO DNA-free Kit. A sequencing library was created using the Illumina Nextera XT DNA Library Preparation Kit. The sample was sequenced on an Illumina MiSeq.

Sample Preparation
The dataset from the outbreak of SARS-CoV-2 in Wuhan in 2019 was sequenced on an Illumina MiniSeq sequencing device and is publicly available at the NCBI Sequence Read Archive (SRA) under accession number SRR10971381 [61].

Pathogen Detection in a Spiked Viral Mixture
The human plasma sample spiked with a viral mixture was sequenced on an Illumina HiSeq 1500 in high output mode on one lane. PathoLive was executed from the beginning of the sequencing run using 40 threads. Intermediary results were taken after 40, 60, 80, and 100 cycles or after 36, 55, 74, and 93 h, respectively. The time needed to produce results from the intermediary sequencing data was lower than 25 min for all output cycles. Raw reads usable for the testing of other tools were available only after 95 h as they had to be translated into the human-readable fastq-format first. As a ground truth, we selected all sequences associated with the species described as abundant above.
The area under the curve (AUC) of the receiver operating characteristic (ROC) was calculated using the 14 highest ranking species, as given by the tested tools. The top 14 of the identified species are considered because hits appearing after twice the number of true positives cannot be expected to be regarded by a user in this experiment. Furthermore, none of the tested tools found more true positives within the next 50 hits. The ROC plot  Figure 4) denotes the true-positive rate and false-positive rate for each threshold n ≤ 14, whereby a threshold n means that the best n hits are taken into account. This means that only the rank of the hits was considered while disregarding the actual score. For PathoLive, the ranks were determined by the weighted score W n , for Clinical Pathoscope we used the "final guess" metric, and for Bracken, the species with the most estimated reads were ranked highest.
none of the tested tools found more true positives withi (Figure 4) denotes the true-positive rate and false-positiv whereby a threshold n means that the best n hits are tak only the rank of the hits was considered while disregard ive, the ranks were determined by the weighted score the "final guess" metric, and for Bracken, the species wit ranked highest. We were able to detect all abundant spiked species of the sequencing run using PathoLive. While the over decreases with the sequencing time, the weighted score, a bases yield accurate results throughout all reports. Repor numbers, although they are optically grayed out in the vi vertebrates directly. As an example report, a screenshot o results after 80 cycles is shown in Figure 3.

Identification of Crimean-Congo Hemorrhagic Fever Viru
A central issue in pathogen identification, especially We were able to detect all abundant spiked species in the library after only 40 cycles of the sequencing run using PathoLive. While the overall number of false positive hits decreases with the sequencing time, the weighted score, and the number of unambiguous bases yield accurate results throughout all reports. Reported phages are included in these numbers, although they are optically grayed out in the visualization, as they cannot infect vertebrates directly. As an example report, a screenshot of the resulting interactive tree of results after 80 cycles is shown in Figure 3.

Identification of Crimean-Congo Hemorrhagic Fever Virus in a Real Sample from Sudan
A central issue in pathogen identification, especially for viruses, is the potentially low number of pathogenic reads in the sample. Therefore, we demonstrated the performance of PathoLive on real data that is known to contain a low number of reads of interest. We analyzed a human serum sample from Sudan that was confirmed via PCR to contain Crimean-Congo hemorrhagic fever virus (CCHFV) but only shows a small amount of related reads in the corresponding Illumina sequencing data (45 out of 1,178,054 reads were reported by Andrusch et al. in 2018 [59] to unambiguously belong to CCHFV). When running PathoLive with default parameters and having adapter trimming activated, Bunyaviridae was the family with the highest weighted score over the complete sequencing procedure when not considering phages and the "unassigned family" branch. Thereby, the score of Bunyaviridae was consistently equal to the score of the underlying species CCHFV while other underlying species did not contribute to the overall score of the family. Figure 5 shows the development over time for all families that reach a score of 500 in at least one output cycle. It can be seen that the weighted score of CCHFV (represented by the family of Bunyaviridae) is in the top three of all identified families after only 30 sequencing cycles, which corresponds to 5% of the sequencing procedure. At this time point, only 16 reads were aligned to CCHFV. Thus, indications for the correct finding are already possible within a short time span and based on only a couple of available reads while the result is more and more emphasized with ongoing sequencing. The only other family reaching a score higher than 500 and not exclusively containing phages was Retroviridae, being mainly driven by the species HIV1. However, a more detailed view of the sequence level shows that all mappings to HIV1 cluster in a small region of approximately 1000 bp (Figure 6d) while the alignments to CCHFV distribute over the complete genome (Figure 6b,c). This strongly indicates that CCHFV is more likely to be a true positive. Figure 6 further shows the family-level visualization of the PathoLive tree structure (Figure 6a) and an example for Granulovirus of the Baculoviridae family that shows a high total number of mappings, but all of those being located in regions that are covered in the background database leading to a weighted score of 0 ( Figure 6e). The overall results for this sample show the strength of PathoLive to pronounce interesting findings at first glance while still allowing for a more detailed perspective that is often important for interpretation.
Life 2022, 12, x FOR PEER REVIEW 10 of 17 already possible within a short time span and based on only a couple of available reads while the result is more and more emphasized with ongoing sequencing. The only other family reaching a score higher than 500 and not exclusively containing phages was Retroviridae, being mainly driven by the species HIV1. However, a more detailed view of the sequence level shows that all mappings to HIV1 cluster in a small region of approximately 1000 bp (Figure 6d) while the alignments to CCHFV distribute over the complete genome (Figure 6b,c). This strongly indicates that CCHFV is more likely to be a true positive. Figure 6 further shows the family-level visualization of the PathoLive tree structure ( Figure  6a) and an example for Granulovirus of the Baculoviridae family that shows a high total number of mappings, but all of those being located in regions that are covered in the background database leading to a weighted score of 0 ( Figure 6e). The overall results for this sample show the strength of PathoLive to pronounce interesting findings at first glance while still allowing for a more detailed perspective that is often important for interpretation. Figure 5. Development of the weighted score calculated by PathoLive on the real-world CCHFV sample from Sudan over the sequencing procedure for all families reaching a score higher than 500 in at least one output cycle. Colors of the plots correspond to the underlying biosafety level in the last cycle, i.e., green for BSL-1, blue for BSL-2, and yellow for BSL-3. Phages are displayed in gray color. The dotted section of each line indicates the shift from the first to the second read of the 2 × 301 bp data.

Detection of a Coronavirus in a Real Sample from the 2019 SARS-CoV-2 Outbreak in Wuhan
For a dataset from the outbreak of SARS-CoV-2 in Wuhan, 2019, we could also identify a coronavirus as the most probable causative virus. This example clearly demonstrates the strength of our scoring approach.
When using the pure quantity of alignments for the visualization of results, the Coronaviridae family branch is not among the most prominently visualized hits. In contrast, when activating the weighted score, a clear indication was already available after only 30 sequencing cycles, corresponding to 10% of the complete sequencing run (Figure 7). The ranks and underlying scores of the visualization shown in Table 1 further support the strength of the weighted score approach. A more detailed analysis of the underlying tree of the Coronaviridae family shows that there are different coronavirus species with high scores, mainly dominated by a SARS-related coronavirus and a bat coronavirus. This further indicates that a clear assignment to one of the underlying species was not possible. These results are as expected, since the correct species, SARS-CoV-2, was not present in the reference database.

Detection of a Coronavirus in a Real Sample from the 2019 SARS-CoV-2 Outbreak in Wuhan
For a dataset from the outbreak of SARS-CoV-2 in Wuhan, 2019, we could also identify a coronavirus as the most probable causative virus. This example clearly demonstrates the strength of our scoring approach.
When using the pure quantity of alignments for the visualization of results, the Coronaviridae family branch is not among the most prominently visualized hits. In contrast, when activating the weighted score, a clear indication was already available after only 30 sequencing cycles, corresponding to 10% of the complete sequencing run (Figure 7). The ranks and underlying scores of the visualization shown in Table 1 further support the strength of the weighted score approach. A more detailed analysis of the underlying tree of the Coronaviridae family shows that there are different coronavirus species with high scores, mainly dominated by a SARS-related coronavirus and a bat coronavirus. This further indicates that a clear assignment to one of the underlying species was not possible. These results are as expected, since the correct species, SARS-CoV-2, was not present in the reference database.
Another branch that is clearly highlighted by its red color belongs to the Poxviridae family. However, a more detailed look into the results shows that the BSL-4 classification originates from a single sub-branch where all mapping positions cluster to only two single peaks (not shown). This is a similar pattern to what we already showed for the occurrence of HIV1 in the previous section (cf. Figure 6d) and is therefore most probably not of biological interest.  Another branch that is clearly highlighted by its red color belongs to the Poxviridae family. However, a more detailed look into the results shows that the BSL-4 classification originates from a single sub-branch where all mapping positions cluster to only two single peaks (not shown). This is a similar pattern to what we already showed for the occurrence of HIV1 in the previous section (cf. Figure 6d) and is therefore most probably not of biological interest.

Discussion
NGS has been shown to be the current state-of-the-art DNA sequencing technology for pathogen detection. Although it is still a niche technology in clinical settings due to the described shortcomings, it makes an increasing impact on the diagnosis of infectious diseases. Although third-generation sequencing approaches are also becoming more and more influential, the discovery of lowly abundant pathogens is still problematic due to the relatively low number of reads. Additionally, the comparably low coverage and high error rates still hamper certain types of complex follow-up analyses such as the detection of antimicrobial resistances or the geographical origin of a pathogen. On the other hand, long-read sequencing technology shows immense potential for real-time diagnostics in the future, especially when considering the continuously decreasing error rates, shorter sample preparation times, arising higher throughput devices such as the PromethION, as well as valuable technology-specific features such as the read until functionality for first attempts that have been made to separate microbial reads from host DNA during the sequencing procedure [47,63]. All these aspects considered we assume long-read sequencing technology a valuable complement to NGS-based diagnostics in the future with distinct properties and therefore potentially different application areas.
The high turnaround time of NGS-based diagnostics is a major drawback compared to targeted molecular methods. Past efforts to speed up NGS-based diagnostics have been made but often come with significant disadvantages: Quick, Ashton, Calus, Chatt, Gossain, Hawker, Nair, Neal, Nye, Peters, De Pinna, Robinson, Struthers, Webber, Catto, Dallman, Hawkey and Loman [42] introduced a fast-sequencing protocol for Illumina sequencers that allows obtaining results after as little as 6 h. This speedup is accompanied by lower throughput and lower data quality, making it less suitable for whole genome shotgun sequencing approaches without a priori knowledge. Other approaches aiming at performing analyses of intermediate sequencing data require either a massive reduction of the amount of analyzed reads and/or targets [43] or the application of specialized hardware such as field-programmable gate array technology (FPGA) which is, for example, used for the DRAGEN system [44]. Such specialized hardware approaches come with additional costs, either for the purchase and infrastructure of local solutions or for the use of a cloud system. At the same time, such approaches provide a low level of flexibility in the analysis and are not algorithmically optimized for working with incomplete data. PathoLive does not require the use of specialized hardware and provides accurate diagnostics results in real time, illustrated with an easily understandable and interactive visualization. This strongly facilitates getting insights into a clinical sample before the sequencer has finished. Real-time output before the sequencing process of the first read has finished lacks information about multiplex indices, though. Therefore, early results of multiplexed sequencing runs can only be assigned to a specific sample after sequencing of the multiplex indices. For paired-end sequencing runs, this still means analyses are still possible far before the sequencer ends, and single-end sequencing runs can produce results at the very moment the indices have been sequenced. A possible solution for this problem is to sequence the indices before the first read, which can pose addressable challenges for cluster identification. As a working solution, many sequencing devices allow paired-end sequencing with different lengths for the first and second reads. It is thereby possible to sequence only a short fragment of the first read to get early access to the multiplex indices. Thus, this approach can be used to obtain de facto single-end reads (i.e., the full second read) while having the multiplex information available from the beginning of the read.
For pathogen identification, we changed the basis for the selection of clinically relevant hits from pure abundance or coverage-based measures towards a metric that takes information on the singularity of a detected pathogen into account. Still, we decided not to completely trust the algorithmic evaluation alone but provide all available information to the user in an intuitive interactive taxonomic tree. While we assume that this form of presentation allows users to come to the right conclusions very quickly, more sophisticated methods for abundance estimation especially on strain level exist. Implementing an additional abundance estimation approach comparable to the read reassignment of Clinical Pathoscope [26] or the abundance estimation of Bracken [57] could enable more accurate results, albeit this would not be applicable trivially to the overall conception of PathoLive. The sensitivity and specificity of PathoLive vary with the time of a sequencing run. In the beginning, when only little sequence information is available, only a small number of nucleotides specify a candidate hit, leading to comparably high false positive rates. At the end of a sequencing run, the number of sequence mismatches in the longer alignments may lead to the erroneous exclusion of hits, especially when sequencing quality decreases. However, this behavior is implicitly considered by the HiLive2 algorithm which allows for an increasing number of mismatching nucleotides with the increasing length of the reads. Still, the results can vary over runtime with the optimal outcome being measured at intermediate cycles if the selected parameters are not well-suited for the specific sample or if the sequencing quality decreases stronger than usual.
While at this state PathoLive is focused on the detection of pathogenic viruses due to the design of the reference database, the concept of PathoLive is transferable to bacteria or other pathogens by using the database updater and rerunning the background definition.
Besides these challenges which are unique to PathoLive, similar problems as conventional approaches occur. First, the definition of meaningful reference databases is difficult. No reference database can ever be exhaustive since not all existing organisms have been sequenced yet. Besides that, there may be erroneous information in the reference databases due to sequencing artifacts, contaminations, or false taxonomic assignment. The definition of hazardousness was especially complicated, as to our knowledge no well-established solution for the automated assignment of this information exists. Therefore, the basis for our BSL-levelling approach might not be exhaustive, leading to underestimated danger levels of pathogens that are missing in the underlying BSL list. Furthermore, in-house contaminations, some of which are known to be carried over from run to run on the sequencer while others may come from the lab, could interfere with the result interpretation of a sequencing run. Especially since no indices are sequenced for the first results of PathoLive, comparably large numbers of carry-over contaminations might lead to false conclusions. Candidate contaminations should therefore be kept in mind when interpreting results.
We believe that PathoLive combines the qualities of different existing tools while adding several new, helpful features, such as live analyses and user-friendly visualizations. That way, we outperform the runtime of existing methods in an unprecedented manner and optimize the overall turnaround time from sample receipt to workable results. For retrospective analysis of previously sequenced samples, other workflows may offer a different scope or insights that are more detailed. However, to gain a quick overview of a sample, the tailored methods included in PathoLive offer an ideal solution.
Using in-house generated spiked human plasma samples, we were able to show the advantages of PathoLive not only concerning its unprecedented runtime but also the selection of relevant pathogens. We further show the high sensitivity of our approach by identifying CCHFV in a real sample from Sudan based on a few dozen reads. While being very fast and accurate, a limitation of PathoLive lies in the discovery of yet unknown pathogens. This is due to the limited sensitivity of alignment-based methods in general, which hampers the correct assignment of highly deviant sequences. However, the analysis of a dataset from the SARS-CoV-2 outbreak in 2019 clearly shows that the detection of novel species that are related to known pathogens is still possible. Concluding, PathoLive is a helpful tool for accurate and yet rapid detection of pathogens in clinical NGS datasets. The key advantages are the real-time availability of analysis results as well as the intuitive and interactive visualization with the down-prioritization of likely irrelevant candidates.