MIRACUM-Pipe: An Adaptable Pipeline for Next-Generation Sequencing Analysis, Reporting, and Visualization for Clinical Decision Making

Simple Summary Next-generation sequencing (NGS) is a cutting-edge technology that enables rapid, high-throughput sequencing of DNA and RNA. Researchers and clinicians can identify genetic mutations, gene fusions, and other alterations that may drive cancer growth. This is particularly important in precision oncology as it is applied in the context of Molecular Tumor Boards (MTBs). The latter are multidisciplinary teams of experts who use NGS and bioinformatics tools to analyze patients’ genetic profiles and develop personalized treatment recommendations for cancer patients. Thus, a crucial process for MTB decision-making is the analysis, compilation, and presentation of high-dimensional sequencing data, which are used for both preparation of and case presentation to all stakeholders. MIRACUM-Pipe precisely addresses these requirements and offers an easy-to-use, one-prompt standardized solution to analyze NGS data, including quality control, variant calling, copy number estimation, annotation, visualization, and report generation. Abstract (1) Background: Next-generation sequencing (NGS) of patients with advanced tumors is becoming an established method in Molecular Tumor Boards. However, somatic variant detection, interpretation, and report generation, require in-depth knowledge of both bioinformatics and oncology. (2) Methods: MIRACUM-Pipe combines many individual tools into a seamless workflow for comprehensive analyses and annotation of NGS data including quality control, alignment, variant calling, copy number variation estimation, evaluation of complex biomarkers, and RNA fusion detection. (3) Results: MIRACUM-Pipe offers an easy-to-use, one-prompt standardized solution to analyze NGS data, including quality control, variant calling, copy number estimation, annotation, visualization, and report generation. (4) Conclusions: MIRACUM-Pipe, a versatile pipeline for NGS, can be customized according to bioinformatics and clinical needs and to support clinical decision-making with visual processing and interactive reporting.


Introduction
Molecular precision oncology aims to manipulate or influence a defined, direct, or indirect tumor-specific molecular target. The principle and clinical efficacy of highly effective "molecular targeting" has meanwhile been demonstrated for numerous compounds in large phase III trials [1][2][3][4]. However, the use of approved targeted agents currently remains almost exclusively within histologically defined entities and is often dependent on the evidence of a predictive biomarker. In the concept of molecular oncology, a tumor is not defined exclusively by its histological features but primarily by its molecular (genetic) tumor profile. According to molecular oncology, this profile and its associated biomarkers can be similar across entities and require the same molecular treatment. For example, many targeted compounds with promising results are currently in biomarker-stratified, tumor-agnostic phase I/II trials, providing a glimpse of the future molecular approval landscape. Unfortunately, only a few molecularly driven phase I/II trials are currently available (especially in Europe), so molecular drugs outside of regulatory approval are frequently used "off-label" in individual therapeutic trials. These molecularly driven "off-label" therapy trials are within the scope of Molecular Tumor Boards (MTBs). The main aim of MTBs is to provide therapeutic or diagnostic indications, usually based on genomic analysis, for cancer patients. To this end, data on specific and recurrent molecular mechanisms are collected from many individual patient cases to provide scientific and clinical evidence for the efficacy of therapeutic approaches targeting these mechanisms [5][6][7][8][9]. Eventually, this will bring highly evident molecular therapies into routine clinical care. MTBs consist of a multidisciplinary team that combines medical and scientific expertise with translational oncology, molecular biology, and bioinformatics [10,11]. There are several approaches to expanded molecular genetic analysis, each identifying a different spectrum of genetic alterations. In the context of MTBs, either targeted combinations of genes are studied using targeted NGS (tNGS) approaches to identify specific targets for which there are approved drugs or extended genetic diagnostics are performed using whole exome sequencing (WES), which covers all protein-coding genes, accounting for approximately 1% of the total genome. With the increasing complexity of genetic data available for individual patients, molecular alterations need to be interpreted according to predefined standards and ultimately reviewed in a multidisciplinary MTB. The therapeutic interpretation of molecular data as well as the assessment of pathogenic mutations require not only a high level of scientific interdisciplinarity but also a continuous literature search and poses a great challenge to the MTB team. Due to the complexity of the results, the need for software or supporting tools to interpret and present the results is tremendously high [12][13][14].
To support transparent data integration and decision-making across the MTBs we have developed MIRACUM-Pipe, an automated analysis workflow for NGS, that produces reliable and reproducible results across different facilities.

Materials and Methods
MIRACUM-Pipe is a workflow that combines many individual tools to create a seamless sequence for comprehensive analysis, annotation, and reporting of NGS data. The workflow mainly consists of several parts: (i) quality control, (ii) alignment, (iii) variant calling, (iv) copy number variation estimation, (v) RNA fusion detection, (vi) annotation, and (vii) reporting and visualization. MIRACUM-Pipe currently supports three different run-modes tailored to the following: 1.
tNGS analysis, i.e., tumor-only sequencing on a hybrid capture-based gene panel (DNA and optional RNA), 3.
Tumor-only analysis, i.e., a single tumor sample from WES (DNA only).
Depending on the selected mode different tools, workflows, and parameter settings are used, and the differences are highlighted in the following.

Alignment
The preprocessed sequencing samples are aligned using the Burrows-Wheeler aligner (BWA-MEM) [20] to the reference genome hg19 from UCSC. To obtain high confidence alignments we applied the Genome Analysis Toolkit (GATK) [21] indel realignment, and conducted base quality score recalibration, and duplication removal following the GATK Best Practice recommendations [22,23].

Variant Calling
Somatic variants for WES tumor-normal pairs are identified and filtered for false positives with VarScan2 to obtain high-confidence variant calls. VarScan2 separates the identified variants into somatic, germline, and loss of heterozygosity (LOH). The latter means that a variant is heterozygous in the germline but evolved to a homozygous variant in the tumor.
VarScan2 is similarly used for tumor-only WES samples. The only difference is that no somatic, germline, nor LOH can be distinguished due to the lack of an appropriate control sample. Therefore, the mpileup routines from VarScan2 are used followed by the false positive filter algorithm.
In the case of tNGS, Mutect2 from GATK is used to identify variants followed by the implemented filtering routine to filter and remove false positives. Mutect2 offers the possibility to supply known platform specific sequencing artifacts and a known germline resource to further limit false positive variant calls and help identify potential germline variants. This is particularly useful when no control sample has been analyzed to identify variants that are unique to the tumor.

Copy Number Variation Calling
Copy number variations (CNVs) are identified with Control-FREEC [24,25]. A matched control sample together with GC-content is used for normalization. If no control sample is available, only GC-content is used. Additionally, the tool Sequenza [26] is used to bioinformatically infer tumor purity and ploidy, and with the help of the scarHRD R package [27,28] the HRD score of the sample is inferred. Tumor purity and ploidy results are further used as input parameters for Control-FREEC.

RNA Fusion Calling
RNA fusions are called with the tool FusionCatcher [29] in case RNA sequencing samples are available in tNGS mode.
Furthermore, the following sample-specific complex biomarkers are calculated.
For TMB calculation, all somatic and protein-coding variants are used. HRD score, consisting of the sum of large-scale transitions (LST) [42], number of telomeric allelic imbalances (TAI) [43], and loss of heterozygosity (HRD-LOH) [44], is inferred, as described above with the R package scarHRD [27] and the Sequenza [26] output as well as ploidy and tumor purity values are also inferred, as described above. Microsatellite status is identified with MSIsensor-pro [45] for tumor-normal pairs and MSIsensor2 [46] for tumor-only cases.

Functional Enrichment Analysis
Functional enrichment analysis with Fisher's exact test is performed based on either all identified variants, or all genes affected by CNVs to obtain insights into altered pathways. As a signaling pathway source, the Molecular Signatures Database (MSigDB) [47] with the Hallmark gene sets [48] is used.

Reporting and Visualization
MIRACUM-Pipe reports all protein-coding and protein-altering variants fulfilling all quality metrics, the variant allele frequency (VAF) and population frequency cutoffs. As default cutoffs, a VAF above 5% and a population frequency below 0.1% are set. However, not only simple variants, such as SNVs and InDels, are reported but also all findings regarding to quality metrics, complex biomarkers, CNVs, and RNA fusions, if applicable. This information is presented in the form of an interactive PDF report that includes hyperlinks to curated online data sources, such as Genome Nexus [53], MetaKB [54] and VarSome [55], which provide background information on variants. Alternatively, the results are written in a data format that can be directly imported into cBioPortal [56,57].

Results
The results, as well as the appropriate selection and presentation of the MIRACUM-Pipe tools, were carefully chosen and adapted through close collaboration with clinicians and members of the MTBs, based on a comprehensive stakeholder analysis [13].

MIRACUM-Pipe
MIRACUM-Pipe incorporates tools for detecting SNVs, InDels, LOH variants, CNVs, and RNA fusions as well as for determining quality and statistics. Various functional prediction and annotation databases are integrated to annotate the identified variants automatically. The workflow is designed as a fully automated one-prompt solution from the raw sequencing files to the interactive PDF report containing quality assessments, the identified and annotated genetic variations as well as a gene set enrichment analysis of the SNVs and CNVs, respectively. MIRACUM-Pipe consists of bash and R scripts to perform NGS data processing, basic annotation, complex functional annotations, and downstream analysis of the results. The pipeline is divided into three main parts, as shown in Figure 1, namely (1) preprocessing, quality control, and alignment; (2) analysis, annotation, and interpretation, including variant calling and copy number calling; and (3) assembly of results into a PDF report and an input format for visualization in cBioPortal. In addition to identifying genetic alterations, another focus is annotating identified variants and other results from various available database sources to facilitate their interpretation.

Performance, Usability, and Configuration
The pipeline is intended to operate on a high-performance computing (HPC) cluster with a minimum number of eight cores, 150 GB of RAM, and 500 GB of hard drive space. After alignment, variant calling and annotation and copy number analysis is run in parallel by default (see Figure 1), saving processing time and distributing resources evenly. As a benchmark example, a WES dataset, consisting of a tumor and a matched germline sample with 100 million paired-end reads each, was analyzed within 12 h on a computer cluster with two 18-Core Intel Xeon E5-2697v4 processors (2.3-3.6 GHz) having 1 TB of RAM. We provide a Docker container to make the installation flawless and error-free, while allowing MIRACUM-Pipe to be distributed to other sites. This container includes a shell script that implements the pipeline processes, certain tools, R libraries, and several databases (Supplementary Material Table S1). Since MIRACUM-Pipe requires additional user-specific databases and files (Supplementary Material Table S2), we designed an environment wrapped around the Docker container (MIRACUM-Pipe-docker). The environment projects the structure inside the Docker container onto the host system, providing an easy solution to add additional databases and tools. It also serves as an interface for data input and output. In addition, the wrapper takes care of the correct Docker syntax when starting the pipeline, and the user can run it as a simple command line tool. This simplifies the application and setup of MIRACUM-Pipe. Due to existing license restrictions, some tools cannot be delivered within the Docker container. To address these issues, our software is split into two GitHub repositories; one for the pipeline itself, which is intended to be used as a Docker container: MIRACUM-Pipe, and another one for its application and setup: MIRACUM-Pipe-docker. The implementation scheme is shown in Figure 2.

MIRACUM-Pipe Results
Results are provided both as an interactive PDF report and as a machine-readable csv-based file that can be seamlessly imported into cBioPortal.

Interactive PDF Report
All identified and annotated variants and copy number variations are automatically compiled into a PDF report. All key results are presented on the first page of the report. These include the TMB, HRD score, and BRCAness ( Figure 3A). In addition, the mutations, ranked according to the ACMG classification (InterVar/ClinVar [34,35]), are presented in a tabular form. Further information about a gene can be obtained from hyperlinks to the Genome Nexus database [53], and information about an amino acid exchange can be obtained from the Variant Interpretation for Cancer Consortium Meta-Knowledgebase (VICC) [54], or from the VarSome [55] database ( Figure 3B). Quality anomalies are also mentioned on the first page of the report to better assess the results. A detailed presentation of the assessed quality metrics is shown in the report in tabular form with corresponding reference ranges ( Figure 4A). ranked according to the ACMG classification (InterVar/ClinVar [34,35]), are presented in a tabular form. Further information about a gene can be obtained from hyperlinks to the Genome Nexus database [53], and information about an amino acid exchange can be obtained from the Variant Interpretation for Cancer Consortium Meta-Knowledgebase (VICC) [54], or from the VarSome [55] database ( Figure 3B). Quality anomalies are also mentioned on the first page of the report to better assess the results. A detailed presentation of the assessed quality metrics is shown in the report in tabular form with corresponding reference ranges ( Figure 4A).   [38], the information on the variant allele frequency (VAF), and the assignment to cancer genes (tumor suppressor genes (TSGs) or oncogenes (OGs)).
In addition, the report includes the total number of SNVs and InDels, which are further categorized as homozygous or heterozygous variants, and LOH. Mutations are labeled as tumor suppressor genes (TSGs), oncogenes (OGs), or cancer hotspots. They are only considered if they occur in a protein-coding sequence with a population frequency (MAF) below 0.1% and a sample-specific VAF above 5%. However, these thresholds can be adjusted within the pipeline. The identified variants are displayed in a Circos plot (Figure 4B). To better understand the biological processes affected by the variants, a functional enrichment is calculated on hallmark gene sets [48], where terms with the highest significance are reported. In addition to the classical gene set enrichment, the variants are checked against five cancer-associated signaling pathways, namely PI3K-AKT-mTOR, RAF-MEK-ERK, DNA Damage Response, Cell Cycle and Tyrosine Kinases, that play a role in known cancer processes. The lists of genes involved in these pathways were obtained from Qiagen (https://geneglobe.qiagen.com/us/knowledge/pathways; accession: 26 May 2023). The COSMIC mutational signatures, e.g., the BRCAness signature (AC3, DNA damage), which according to Alexandrov [38], the information on the variant allele frequency (VAF), and the assignment to cancer genes (tumor suppressor genes (TSGs) or oncogenes (OGs)).
In addition, the report includes the total number of SNVs and InDels, which are further categorized as homozygous or heterozygous variants, and LOH. Mutations are labeled as tumor suppressor genes (TSGs), oncogenes (OGs), or cancer hotspots. They are only considered if they occur in a protein-coding sequence with a population frequency (MAF) below 0.1% and a sample-specific VAF above 5%. However, these thresholds can be adjusted within the pipeline. The identified variants are displayed in a Circos plot ( Figure 4B). To better understand the biological processes affected by the variants, a functional enrichment is calculated on hallmark gene sets [48], where terms with the highest significance are reported. In addition to the classical gene set enrichment, the variants are checked against five cancer-associated signaling pathways, namely PI3K-AKT-mTOR, RAF-MEK-ERK, DNA Damage Response, Cell Cycle and Tyrosine Kinases, that play a role in known cancer processes. The lists of genes involved in these pathways were obtained from Qiagen (https://geneglobe.qiagen.com/us/knowledge/pathways; accession: 26 May 2023). The COSMIC mutational signatures, e.g., the BRCAness signature (AC3, DNA damage), which according to Alexandrov et al. [51] provide insight into the selection of therapeutic options, such as PARP inhibitors, calculated using the R package YAPSA [49]. Copy number variations are visualized as an ideogram highlighting the altered regions above the chromosomes ( Figure 4C) and explicitly reported for TSGs and Ogs. For a better insight into the processes altered by the chromosomal instabilities, a functional enrichment based on the hallmark gene sets is performed. Furthermore, the quality criteria and coverage are reported. The reports' appendix lists all detectable somatic variants, LOH, and germline variants that meet all the criteria. Finally, all tools and databases used are listed, including version information.  [59] showing copy number variations (CNVs) in the first ten chromosomes of a patient's tumor genome. The figure briefly assesses which regions are likely to be amplified (red-shaded regions) or lost (blue-shaded regions), indicating increased or decreased activity of genes within that region. A complete table of genes with CNVs is provided in the full report.
All provided information related to complex biomarkers, genetic alterations, as well as quality abnormalities, provide clinicians with all the facts necessary to make a comprehensive assessment of the results and prepare a treatment recommendation for discussion within the MTB.

Visualization in cBioPortal
MIRACUM-Pipe reports all findings not only as an interactive PDF report but also in a format for seamless import into cBioPortal. cBioPortal is an open platform for exploring, visualizing, and analyzing multidimensional cancer genomics data and thus supports the translation of large data sets into biological insights and clinical applications [56,57]. For this reason, we have extended cBioPortal according to the needs of the MTB stakeholders [13] within the context of the MIRACUM consortium [60][61][62]. The upload is performed as follows: variants are exported in the mutation annotation format, the copy number variations as discrete copy number values and segmented data, and RNA fusions as structural variants. Additionally, all complex biomarkers and the results of the mutational signature analysis are stored as patient-specific clinical attributes. With all this information at hand, clinicians make extensive use of the features provided by cBioPortal to visualize the mutational landscape ( Figure 5) as well as specific variants, including related information on therapeutic options from, e.g., OncoKB [39] ( Figure S1). The latter was extended after information on the approval status of a given drug based on the European Medicines Agency (EMA). The extended cBioPortal even offers the possibility to search directly for clinical trials involving the patient's genetic alterations through an extension of ClincalTrials.gov ( Figure S2) [60]. Furthermore, the clinician can use cBioPortal's powerful virtual trial options to identify similar patients who have already been discussed with the option of documenting therapy recommendations based on the genetic findings generated by MIRACUM-Pipe, thus creating evidence for further similar individual therapeutic trials.  Figure S2) [60]. Furthermore, the clinician can use cBioPortal's powerful virtual trial options to identify similar patients who have already been discussed with the option of documenting therapy recommendations based on the genetic findings generated by MIRACUM-Pipe, thus creating evidence for further similar individual therapeutic trials.

Discussion
WES analysis is a high-throughput technology that allows the simultaneous sequencing of thousands of genes in a single experiment. In the context of an MTB, WES analysis can provide clinicians with a wealth of information about a patient's cancer, including potential driver mutations, actionable targets, and drug resistance mechanisms. Therefore, standardized analysis pipelines and visualization strategies are needed to assist all stakeholders involved in an MTB in handling the complex data that will ultimately lead to therapy recommendations for personalized oncology. In this study, we present the design and use of MIRACUM-Pipe for the analysis of NGS and tNGS data and the associated visualization capabilities of these data.

Use and Insights of MIRACUM-Pipe in the Context of an MTB
In general, WES provides a comprehensive tumor genome analysis, identifying known and novel mutations in cancer cells. This information can help clinicians understand the molecular drivers of the tumor and select appropriate treatment options. A WES analysis generates a large amount of data and interpreting the results can be challenging.

Discussion
WES analysis is a high-throughput technology that allows the simultaneous sequencing of thousands of genes in a single experiment. In the context of an MTB, WES analysis can provide clinicians with a wealth of information about a patient's cancer, including potential driver mutations, actionable targets, and drug resistance mechanisms. Therefore, standardized analysis pipelines and visualization strategies are needed to assist all stakeholders involved in an MTB in handling the complex data that will ultimately lead to therapy recommendations for personalized oncology. In this study, we present the design and use of MIRACUM-Pipe for the analysis of NGS and tNGS data and the associated visualization capabilities of these data.

Use and Insights of MIRACUM-Pipe in the Context of an MTB
In general, WES provides a comprehensive tumor genome analysis, identifying known and novel mutations in cancer cells. This information can help clinicians understand the molecular drivers of the tumor and select appropriate treatment options. A WES analysis generates a large amount of data and interpreting the results can be challenging. Clinicians may need specialized training to interpret the data and use it effectively to guide treatment decisions. MIRACUM-Pipe has been developed to support this in a mostly automated way. It combines a thorough selection of useful tools, some with appropriate adaptations, and a dedicated annotation process with various databases. However, the greatest advantage of MIRACUM-Pipe is that all codes and analysis processes are presented transparently and can be adapted flexibly if required. Another advantage of MIRACUM-Pipe is the compilation of the results into a PDF report and an input format for the visualization in cBioPortal. The PDF report can document the analysis and be added to the corresponding medical records. Visualization tools, such as cBioPortal, can significantly impact the MTB by providing a clear and concise representation of complex genomic data. cBioPortal can help to identify patterns and relationships within the data that may not be apparent in a textbased format. Visualization tools help the MTB to communicate findings more effectively to other team members and the patients. The interactive PDF report and cBioPortal facilitate discussion and decision-making processes by presenting data in a clear and concise manner. Other existing pipelines, e.g., [63][64][65][66], do not yet provide a comparatively in-depth and clearly structured presentation of genetic variants, quality metrics, complex biomarkers, CNVs, and RNA fusions in single outputs.

Limitations and Future Directions
The overall flexibility and scalability of the current implementation as a single Docker container is not ideal. Not all features of a current HPC cluster can be used, such as automated queuing and resource distribution. Therefore, we will implement the pipeline in Nextflow's workflow language [67] in the future. Using a Nextflow workflow offers the possibility of greater flexibility, scalability, and reproducibility, and with Nextflow's software containers, it becomes easier to exchange tools and incorporate new tools. This modular approach also facilitates the implementation of multiple tools for the same task, such as streamlining the integration of new methodologies. Another technical challenge in using WES data is the quality of the sequencing data. WES analysis requires high-quality DNA samples, and the quality of sequencing data can be affected by several technical factors, such as sequencing depth and coverage. This has to be taken into account in the analysis and has to be manually adjusted in MIRACUM-Pipe accordingly. Similarly, the cost of WES and its analysis must be considered and may currently be a barrier to its widespread use in clinical practice. However, even these technological challenges will dissolve or adapt, so the use of analysis pipelines and visualization tools, such as MIRACUM-Pipe and cBioPortal, will continue to be beneficial.
Although some of the tools used in MIRACUM-Pipe have been developed using machine learning and artificial intelligence (AI) methods, MTB therapy recommendation currently relies on expert knowledge due to the lack of large enough and well-stratified patient cohorts. One approach to compensate for this lack is few-shot learning, which combines individual patient information with additional data from in vitro screening [68]. This situation will change in the future. Current initiatives, such as PM 4 Onco (Personalized Medicine for Oncology, Medical Informatics Initiative: https://www.medizininformatikinitiative.de/en/node/801, accession: 26 May 2023) aim to harmonize MTB reporting across hospitals to generate a multi-center cohort of genomic and clinical information of MTB cases on which AI methods can be applied for therapy recommendations and response or resistance. Standardized reporting tools, such as MIRACUM-Pipe, will be instrumental in this as they can provide standardized, machine-readable output reports.

Conclusions
Next-generation sequencing, particularly WES, is increasingly being used to identify new therapeutic options. However, no common standards for analysis strategies, depth, or medical implementation have been established so far. To overcome these problems and to enable easy-to-use analysis, data interoperability and reuse, we have developed MIRACUM-Pipe. This pipeline can be easily adapted to integrate or merge future databases, analysis tools, and workflows. Furthermore, its visualization capabilities as an interactive PDF report and its integration with cBioPortal facilitate the understanding of the complex data for clinicians and all stakeholders in the MTB.
In this way, MIRACUM-Pipe can support physicians in making personalized therapy recommendations in oncology and unify the standardization of personalized oncology efforts in the German healthcare system.

Supplementary Materials:
The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/cancers15133456/s1, Figure S1: Illustration of cBioPortal and OncoKB. Figure S2: Integration of ClinicalTrials.gov to simplify the search study search. Supplementary Table S1: Tools, R libraries, and databases included in the MIRACUM-Pipe Docker container. Supplementary Table S2