Genomic Insights into Tumorigenesis in Newly Diagnosed Multiple Myeloma

Kyriakou, Marina; Papaloukas, Costas

doi:10.3390/diagnostics15172130

Open AccessArticle

Genomic Insights into Tumorigenesis in Newly Diagnosed Multiple Myeloma

by

Marina Kyriakou

and

Costas Papaloukas

^*

Department of Biological Applications and Technology, University of Ioannina, GR45110 Ioannina, Greece

^*

Author to whom correspondence should be addressed.

Diagnostics 2025, 15(17), 2130; https://doi.org/10.3390/diagnostics15172130

Submission received: 23 July 2025 / Revised: 15 August 2025 / Accepted: 21 August 2025 / Published: 23 August 2025

(This article belongs to the Section Pathology and Molecular Diagnostics)

Download

Browse Figures

Versions Notes

Abstract

Background: Multiple Myeloma (MM) is a malignant plasma cell dyscrasia that progresses through the consecutive asymptomatic, often undiagnosed, precancerous stages of Monoclonal Gammopathy of Undetermined Significance (MGUS) and Asymptomatic Multiple Myeloma (SMM). MM is characterized by low survival rates, severe complications and drug resistance; therefore, understanding the molecular mechanisms of progression is crucial. This study aims to detect genetic mutations, both germline and somatic, that contribute to disease progression and drive tumorigenesis at the final stage of MM, using samples from patients presenting MGUS or SMM, and newly diagnosed MM patients. Methods: Mutations were identified through a fully computational pipeline, implemented in a Linux and RStudio environment, applied to each patient sequence, obtained through single-cell RNA-sequencing (scRNA-seq), separately. Structural and functional mutation types were identified by stage, along with the affected genes. The analysis included quality control, removal of the Unique Molecular Identifiers (UMIs), trimming, genome mapping and result visualization. Results: The findings revealed frequent germline and somatic mutations, with distinct structural and functional patterns across disease stages. Mutations in key genes were identified, pointing to molecules that may play a central role in carcinogenesis and disease progression. Notable examples include the HLA-A, HLA-B and HLA-C genes, as well as the KIF, EP400 and KDM gene families, with the first four already confirmed. Comparative analysis between the stages highlighted molecular transition events from one stage to another. Emphasis was given to novel genes discovered in newly diagnosed MM patients, that might contribute to the tumorigenesis that takes place. Conclusions: This study contributes to the understanding of the genetic basis of plasma cell dyscrasias and the transition events between the stages, offering insights that could aid in early detection and diagnosis, guide the development of personalized therapeutic strategies, and improve the understanding of mechanisms responsible for resistance to existing therapies.

Keywords:

plasma cell dyscrasias; multiple myeloma; single-cell RNA-sequencing; bioinformatics; germline mutations; somatic mutations; carcinogenesis; disease progression

1. Introduction

Plasma cell dyscrasias are pathological hematologic conditions comprising a heterogeneous group of diseases characterized by clonal proliferation of bone marrow plasma cells and production of a monoclonal immunoglobulin, known as the M-protein, detected as a paraprotein present in serum and or urine [1]. These disorders encompass a broad variety of conditions, ranging from asymptomatic precancerous stages, such as Monoclonal Gammopathy of Undetermined Significance (MGUS) and Asymptomatic/Smoldering Multiple Myeloma (SMM), to malignant diseases, including Multiple Myeloma (MM) and plasma cell leukemia [2]. MGUS is found in approximately 5% of the population over the age of 50, and although it may remain stable for an extended period of time, it has the potential to progress to MM at an annual rate of 1% [3]. SMM represents an intermediate asymptomatic condition in the progression from MGUS to MM, with all MM cases arising from SMM. This stage differs from MGUS by its higher risk of progression to MM within the first five years post diagnosis [3]. As the disease advances, each heterogenous stage is characterized not only by increasing amounts of clonal bone marrow plasma cells and higher levels of M-protein, but also by differences in the type of M-protein. The stages also display variability in genetic changes, including cytogenetic abnormalities, such as translocations and trisomies, which may contribute to disease progression, among other things [4]. However, MM is distinguished from its precursor stages by severe bone destruction, which appears as osteolytic lesions, compression fractures, and skeletal weakening [2]. It is worth mentioning that both pre-malignant conditions are difficult to diagnose [5].

Although the pathogenesis of MM is complex and not yet fully elucidated, it is thought to result from a combination of alterations in the genetic material and changes in the bone marrow microenvironment [2]. Despite significant novel therapeutic advancements that can induce remission, MM remains incurable [6] with most patients experiencing relapse and eventually developing drug resistance [1]. Genetic mutations, both germline and somatic, are key drivers of the development and progression of malignant conditions, including MM and its precursor stages [6]. Germline mutations originate in reproductive cells, during gametogenesis [7], are inherited, and can therefore increase an individual’s predisposition to malignancy, without directly causing it. On the other hand, somatic mutations arise in non-reproductive cells during the individual’s life, due to either random errors or environmental factors [7] and drive malignant transformation and disease advancement [8]. These mutations can be classified structurally, as single nucleotide polymorphisms (SNPs) and insertions/deletions (INDELs) [9], and functionally, which include but are not limited to, missense, frameshift, start-loss or stop-loss and splice variants [10], both categories may lead to the development of diseases and the alterations of functions of certain proteins that impair essential cellular processes [9].

Next-generation sequencing (NGS) technology has revolutionized the detection and characterization of such mutations, enabling high-throughput, accurate analysis of large amounts of genetic information [11] and providing deep insights into the structure and complexity of the genome [9]. Among NGS applications, one of the most used is RNA-sequencing (RNA-seq), as it has proven particularly useful in analyzing gene expression [12] and detecting low-frequency mutations, which may not be identified using the conventional method of DNA-sequencing [12]. However, single-cell RNA-sequencing (scRNA-seq), as an emerging sequencing technology [13], has become a powerful, yet still developing tool that allows for the comparative analysis of individual cell transcriptomes, revealing transcriptional differences and similarities within a specific cell population, under both physiological and pathological conditions [14]. This method stands out due to its ability to detect rare cell subpopulations, such as tumor cancer cells [14], which may be masked by standard RNA-seq methods [15]. Single-cell studies have further revealed the heterogeneity present among different subpopulations of cancer cells within the same tumor, as well as the long-term evolution of tumors in MM [16]. Additionally, both approaches support the use of Unique Molecular Identifiers (UMIs), which allow for distinguishing original molecules from duplicates that may arise during Polymerase Chain Reaction (PCR) amplification, thereby enhancing the quantitative accuracy of both RNA-seq and scRNA-seq, especially in samples with low RNA content [17]. UMIs integration reduces false positive results due to duplicate reads and enhances the accuracy of variant detection [18].

Given the challenges in early diagnosis and the limited preventability of MM, estimated at just 14% of cases [19], the identification of molecular mechanisms driving progression from MGUS and SMM to MM remains critically important [5]. To further highlight the clinical significance of MM, in 2022, Western Europe accounted for 9% of all MM cases, with Northern Europe showing some of the highest mortality rates, namely 1.8 per 100,000, as per GLOBACAN 2022. Globally, if current rates remain unchanged, the incidence and mortality of MM are projected to increase by 71% and 79%, respectively, by 2045 [20]. This study aims to identify and compare germline and somatic genetic variants associated with MM, by analyzing scRNA-seq data from patients across the plasma cell dyscrasias spectrum, namely MGUS, SMM and newly diagnosed MM. The analysis focused on detecting and comparing SNPs and INDELs, aiming to elucidate early molecular events potentially contributing to MM pathogenesis. To achieve this, a completely computational pipeline was implemented, integrating quality control, sequence trimming and removal of UMIs, genome mapping, germline and somatic variant detection and annotation, as well as comparative evaluation of the results between each disease, respectively, and across the transitions between the stages, providing invaluable insights into the molecular changes underlying progression and carcinogenesis in the MM stage. This bioinformatics approach revealed distinct mutational patterns and discovered several novel candidate genes, to our best of knowledge, not previously implicated in MM, suggesting their role in carcinogenesis and disease evolution, as well as in future therapeutic targets directing attention to the early phases of MM.

These findings provide valuable information on the plasma cell dyscrasias, specifically the malignant disease of MM, and may play a central role in the development of predictive markers, both at the precancerous stages and at the tumorigenesis one. They may also aid in creating personalized treatment strategies, as well as in early prognosis and identification of the mechanisms responsible for drug resistance.

2. Materials and Methods

2.1. Data Acquisition

To identify mutations potentially responsible for the progression of Multiple Myeloma (MM) and its precursor stages, a comparative analysis was performed between individuals at each premalignant stage, namely Monoclonal Gammopathy of Undetermined Significance (MGUS) and Smoldering Multiple Myeloma (SMM), newly diagnosed patients with Multiple Myeloma (MM) and healthy individuals without diagnosis of any of the diseases mentioned in this study. The latter served as control samples, providing a reference point for identifying mutations present exclusively in pathological samples.

All datasets analyzed in this study were obtained from the publicly available Gene Expression Omnibus (GEO) database, specifically from the Series GSE271107, which includes all sequenced data used herein. The datasets included four healthy controls (GSM8369864, GSM8369865, GSM8369866, GSM8369867), four patients with MGUS diagnosis (GSM8369868, GSM8369869, GSM8369870, GSM8369871), four SMM patients (GSM8369874, GSM8369875, GSM8369876, GSM8369877) and four newly diagnosed MM patients (GSM8369878, GSM8369879, GSM8369880, GSM8369881).

All samples had been previously sequenced using the Illumina Novaseq 6000 (GPL24676) platform, a high-throughput scRNA-sequencing system with low error rates and high yield [21]. This platform follows the standard Illumina workflow [22], which includes reverse transcription of RNA into complementary DNA (cDNA), library preparation, cDNA amplification via Polymerase Chain Reaction (PCR), and sequencing by synthesis. For this study, library preparations were carried out using the Chromium Controller 10×Genomics (Pleasanton, CA, USA), which integrates Unique Molecular Identifiers (UMIs) following the cell-specific barcodes during library construction. Each sample was also registered in the Sequence Read Archive (SRA) database under a unique Experiment Accession, SRX, with four independent sequencing runs, namely Run Accessions SRR, per SRX. The data used in this study are publicly available under the Study Accession PRJNA1129864. For each sample, the SRR with the highest number of bases was selected, as higher coverage is indicative of better data quality. Because the size of each SRR file exceeds the 5 GB limit imposed by the SRA database, the FASTQ files were retrieved using the SRA Toolkit on Ubuntu Linux version 22.04.5 LTS (Jammy Jellyfish).

The rest of the preprocessing was also carried out on Ubuntu Linux, selected for its widespread utilization in bioinformatics, ease of use, as well as compatibility with popular bioinformatics tools [23]. Notably, when any errors or warnings arose during the next steps that could not be resolved conventionally, generative artificial intelligence (GenAI) was employed to assist in troubleshooting.

Figure 1 illustrates the computational pipeline implemented in this study for the preprocessing of raw sequence data, as well as for the variant calling processes. This workflow generated the necessary data for the analysis of mutations in cancer patients and their precursor stages.

2.2. File Preparation

Each SRR download contains four different FASTQ files. The two smaller files, called Index 1 (I1) and Index 2 (I2), store only index sequences and are used exclusively during demultiplexing to separate sample reads when converting raw BCL files into usable FASTQ files. These were not relevant to the computational pipeline of this study and thus were excluded. The remaining two files are named Read 1 (R1) and Read 2 (R2) and were utilized in this analysis. Read 1 is 28 base pairs (bp) long and contains the cell barcodes, that take up 16 bp, followed by 12 bp Unique Molecular Identifiers (UMIs) [24]. The R2 is the largest file, as it contains exclusively the cDNA sequence that consists of 90 bp, along with the UMI information from R1.

For this analysis, only R1 and R2 were used, with particular emphasis on R2, as it contains the cDNA sequence required for the study objectives. Although FASTQ file compression does not affect the process or the analysis results, the files were compressed solely to save storage space, since each file exceeded 60 GB.

A quality control assessment was performed on the R2 files of each sample, using the FastQC tool version 0.12.1 [25], and the results were summarized with MultiQC version 1.28. These reports were used only to compare the quality of raw versus trimmed sequences and were not directly included in subsequent analysis.

2.3. Unique Molecular Identifiers Tagging and Processing

The short UMI sequences present in R1 were removed from the R2 prior to further pre-processing. The identification and removal of these sequences from R2 was performed using the UMI-tools repository version 1.1.6 [26], and specifically the extract command. When using the whitelist command to enable further filtering and removal of barcodes, the new FASTQ files produced only contained a few kilobytes of information, indicating excessive sequence loss. To preserve data integrity, only the extract command was retained for downstream analysis. This command detects the UMIs in R1, removes them from the R2 sequence, and incorporates them into the read name in the corresponding FASTQ header. The resulting R2 file contains the cDNA sequence, while UMI information is recorded in the header; however, R1 was no longer required for subsequent analysis steps, as it was used exclusively for R2 processing [26].

2.4. Read Trimming

The new R2 sequence, with UMI information already removed, was further processed to improve its quality using the tool TrimGalore! version 0.6.1028, with default parameters. This Perl-based wrapper integrates Cutadapt for trimming and FastQC for quality assessment, leveraging the quality indicators of the latter for targeted and automated trimming without the need for additional parameters.

To evaluate whether an alternative trimming approach could yield better results, Cutadapt version 5.0 was also tested separately, targeting only overrepresented sequences. This approach was applied both to the R2 file already trimmed with TrimGalore! and to the R2 file immediately after UMI-tools processing. After trimming, the quality of each R2 file was reassessed using the same two tools mentioned above. Based on this comparison, TrimGalore! alone was selected for the final preprocessing workflow, out of the three tested strategies.

2.5. Mapping to the Human Reference Genome

After trimming, the R2 reads were mapped to the GRCh38 reference genome, also known as hg38, as specified by the GEO dataset characteristics. The HISAT2 tool, version 2.2.1, was used for this step, as it provides fast and accurate alignment of RNA-seq and scRNA-seq data [27]. Each output file, initially in SAM format, was then converted to its binary form, BAM, which contains the same information in compressed format to optimize downstream analysis performance.

Subsequently, each BAM file was sorted, and an index was generated for each file; both were carried out using the samtools software version 1.19.2, to ensure compatibility with subsequent analysis steps [28].

The quality of the mapped reads in the sorted BAM files was assessed using Qualimap tool version 2.3 [29]. The resulting metrics included alignment percentages, duplication rates and GC content, providing insight into the data quality and the effectiveness of the alignment process with respect to the reference genome [25].

Typically, a deduplication step follows mapping; however, initial attempts to deduplicate BAM files, to identify and remove potential duplicate reads while taking UMIs into account, using UMI-tools’ command dedup [26], resulted in empty outputs, likely due to the nature of the data. Therefore, this step was omitted to avoid loss of biologically meaningful repetitive regions, which cannot be easily distinguished from technical duplicates in RNA-sequencing, and specifically single-cell RNA-sequencing data [30].

In order to commence the variant detection process, it was necessary to add read group tags to each BAM file, defining the identity of each patient and control sample, to enable proper variant detection [31]. This step was completed using the GATK tools version 4.2.6.1 [31], followed by indexing the updated BAM files [28,31,32].

These pre-processing steps ensured that the data were of sufficient quality and well-prepared for mutation detection stage.

2.6. Variant Calling

The germline variant detection, namely the mutations of inherited predisposition, was performed using the tool HaplotypeCaller from GATK, since it is widely regarded as the most accurate for detecting single-nucleotide polymorphisms (SNPs) and small insertions and deletions (INDELs) of germline origin [32]. In conjunction with the reference genome, to identify regions of the genome with variation in the samples relative to the reference [8], a confidence threshold of 30.0 was applied during execution. This ensured that only variants with an estimated error probability below 0.1% were recorded, guaranteeing that the detected variants met a minimum confidence level for validity, being true positives and biologically present [33].

Somatic variant detection was subsequently performed on the same files using Mutect2 tool, also from GATK. These mutations, acquired during the patient’s lifetime, are more frequent in individuals diagnosed with cancer, making this tool suitable for analyzing the samples used in this study [32,34]. Like HaplotypeCaller, Mutect2 also uses the reference genome GRCh38 but does not require specifying a confidence threshold.

2.7. Variant Filtering

The previous step produced VCF files for each sample, containing the detected genomic variants, either germline or somatic, identified by HaplotypeCaller or Mutect2, respectively. Each recorded variant, in the VCF files, includes information such as chromosome, exact genomic position, nucleotide change, and confidence score. Filtering was then performed to remove low-confidence variants, optimizing the results.

A common filter in such analyses excludes variants with a Phred Quality Score (QUAL) below 30, corresponding to an error probability of 0.1% and an accuracy of 99.9%. The QUAL, computed by GATK, is inversely related to the likelihood of error [33].

For the VCF files generated by HaplotypeCaller, the GATK VariantFiltration command was applied, marking variants with QUAL < 30.0, as low quality in the FILTER field of the VCF file, without removing them at this stage. This approach, as opposed to earlier threshold-based filtering, allowed flagged variants to be excluded in later stages of analysis [35].

In contrast, the VCF files generated by Mutect2 were filtered using the specialized GATK command FilterMutectCalls, which evaluates each variant based on predefined automated quality criteria.

2.8. Variant Selection

Next, SNPs and INDELs were extracted separately from each sample to distinguish between these two major types of genetic variants, using bcftools version 1.19 [36].

The PASS filter was applied to retain only variants that met the previously defined quality criteria. It is important to analyze SNPs and INDELs separately, as they can lead to different types of genetic alterations.

2.9. Comparison of Patients and Controls Variants

Comparisons of the detected variants from each patient to each control sample were performed to identify unique variants potentially associated with MM or its precancerous stages. This comparison was performed using the isec command of bcftools [36].

A total of 64 comparisons were performed for each plasma cell disorder analyzed in this study, including 16 for SNPs from the filtered VCFs of HaplotypeCaller, 16 for SNPs from Mutect2, and 32 for INDELs from the VCFs of both tools. For each comparison, four VCF files were generated, namely 0000.vcf contained variants unique to the patient sample, 0001.vcf contained variants unique to the control sample, 0002.vcf contained the shared variants, and 0003.vcf contained the union of all variants.

2.10. Annotation of Detected Mutations

After comparing patients and controls, the patient-unique variants from each 0000.vcf file were annotated to determine their functional and biological significance. This step identified the gene in which the variant was located, the potentially affected protein, and the predicted functional impact [37]. Cataloging such pathogenic variants in the human genome is crucial, as it enhances our understanding of disease progression and can aid in detecting the relevant mutations for the study objectives [38].

Annotation in this study was performed using the SnpEff tool, version 5.2f (build on 7 February 2025) [39], which supports both SNPs and INDELs. The tool classifies variants according to their function and assigns predicted impact categories as HIGH, MODERATE, LOW, or MODIFIER [40]. These categories were then used to prioritize biologically meaningful variants in subsequent filtering and analysis.

2.11. Organization and Visualization

Upon completion of the annotation of the detected variants, the resulting VCF files were converted to CSV format to facilitate further data analysis and processing, using the awk command. During this conversion, it was observed that in some cases the ALT field contained multi-base insertions or multiple alternative alleles, causing column misalignment. To address this, the CSV files were reformatted to ensure that each variant corresponded to a single row, maintaining consistency of values across columns. For each patient–control comparison, the relevant CSVs containing the annotated SNPs and INDELs from both HaplotypeCaller and Mutect2 tools were then merged, resulting in a single consolidated CSV file per comparison that included all annotated variants.

Variants were then filtered by predicted impact, retaining only those classified as HIGH or MODERATE impact, therefore focusing the analysis on the most biologically significant variants.

To identify meaningful patterns, additional classification was performed to highlight mutations consistently present across all comparisons within a stage, as well as those uniquely emerging at each disease transition, namely from MGUS to SMM and from SMM to MM. Additionally, mutations specific to the final MM stage, detected in at least 12 patient–control comparisons, were prioritized for visualization.

To assess whether specific substitution types were statistically enriched in each stage, Fisher’s exact tests were performed for each substitution-stage combination. This test evaluates whether the frequency of a given substitution is significantly associated with a specific disease stage when compared to its distribution in the other stages. Fisher’s statistical test was selected due to its suitability for categorical data across all sample sizes [41], as well as its tendency to yield more conservative probability values (p-values), thereby reducing the risk of false positives [42]. Substitutions with a p-value of less than 0.05 were considered significantly enriched.

All further analyses and visualizations were carried out in R programming language (version 4.4.3) within the RStudio environment (version 2025.5.0.496). These tools were selected for their ability to efficiently handle and analyze large-scale data using specialized libraries and packages for statistical and bioinformatic analyses [43]. Summary plots included barplots of functional and structural mutation types, SNP substitution patterns, and OncoPrint-style heatmaps [44].

3. Results

3.1. Quality Control of Raw and Trimmed Reads

The initial quality assessment of the raw Read 2 (R2) reads indicated that none of the samples were flagged as low quality. However, all samples failed the duplication level metric and displayed warnings for the presence of overrepresented small sequences. In addition, most samples showed warnings for per-base sequence content, and only a few failed the Guanine-Cytosine (GC) content metric.

During the trimming process, multiple strategies were evaluated to optimize the quality of the R2 sequences. Among all methods applied, TrimGalore! alone, consistently yielded the highest overall quality compared to the alternative strategies, as reflected in the updated FastQC reports.

Consequently, TrimGalore! was retained as the final trimming method utilized in this pipeline. Nevertheless, all samples failed both the duplication levels and the per-base sequence content metric, while warnings were reported for the presence of overrepresented sequences and for sequence length distribution. In some samples, the GC content metric also remained flagged as failed.

Although the post-trimming FastQC reports initially showed more warnings compared to the raw data, this was expected and does not indicate an actual decrease in quality, as noted in the official FastQC documentation [45]. Rather, these changes reflect the removal of adapters and overrepresented small sequences, which influence certain quality metrics.

3.2. BAM Quality Assessment and Alignment Rates

Qualimap analysis of the BAM files confirmed that all samples achieved acceptable alignment rates, duplication levels and average coverage [29], supporting the reliability and suitability of the data for variant detection.

The alignment of the processed reads to the reference genome GRCh38 resulted in overall alignment rates ranging between 83% and 94%, which fall within the expected range for high-quality single-cell RNA-sequencing (scRNA-seq data), namely 70–90% [25]. The Monoclonal Gammopathy of Undetermined Significance (MGUS) exhibited the highest mean overall alignment rates, while Smoldering Multiple Myeloma (SMM) and Multiple Myeloma (MM) samples showed a gradual decrease, consistent with the increased mutational burden and genomic instability associated with disease progression [46]. In contrast, the control samples displayed alignment rates comparable to those of MM, despite no evidence of technical errors, as indicated by the consistent quality checks of both raw and processed reads. The alignment rates for each group are summarized in Table 1.

3.3. Mutation Profiles by Type

3.3.1. Functional Mutation Types

The analysis of functional mutation effects, according to Figure 2, revealed that missense mutations were the most prevalent type across all disease stages, suggesting potential impacts on protein structure and function. Other functional categories, such as frameshift mutations, stop-gained mutations, as well as intron and splice site mutations, occur at much lower frequencies throughout all stages. Rare functional classes, including start and stop loss mutations, as well as mutations in untranslated regions (UTRs) and in-frame deletions and insertions, appear only sporadically and were observed in all three stages (full results appear in Appendix A Figure A1).

However, the highest mutation frequency is reported in the SMM stage, contradicting the existing bibliography, which generally reports a higher mutational load in MM [47,48]. Nonetheless, MGUS exhibits lower mutation counts compared to the more advanced stages, and the overall mutation landscape reflects a progressive increase in diversity across disease evolution, in line with previous findings in the literature [6,47].

3.3.2. Structural Mutation Types

The structural classification of mutations is depicted in Figure 3. A clear predominance of single-nucleotide polymorphisms (SNPs) is detected in all disease stages, while insertions (INS) and deletions (DEL) are also present but at much lower frequencies, both patterns aligning with previous studies [6].

Interestingly, the overall mutation burden appears slightly elevated in the SMM stage, compared to the other two stages, as reported previously in functional classifications (Figure 2). Again, this observation diverges from the existing literature, which typically describes a higher mutational load in MM [47,48].

3.3.3. Nucleotide Substitution Patterns

SNPs comprise a variety of base substitutions, which may arise through different mutagenic mechanisms [49]. To characterize mutational patterns across disease progression stages (MGUS, SMM and MM), the substitutions were normalized to reflect their proportion relative to the total number of SNPs observed per stage. According to Figure 4, Cytosine (C) to Thymine (T) transitions were the most frequent type of substitution in MM, with a proportion equal to 0.202, a pattern well documented in multiple cancer types [6,50,51]. This stage-consistent substitution profile supports the hypothesis that such mutational signatures may represent a molecular hallmark of MM pathogenesis. Notably, Guanine (G) to Adenine (A) substitutions were also common across the precancerous stages MGUS and SMM, consistent with the pre-existing literature [50,51].

While some substitution types, were among the most frequently observed across all disease stages, statistical significance (p-value < 0.05) was only assigned to those that were disproportionately enriched in a specific stage when compared to others, and are marked with an asterisk in the resulting barplots, as can be seen in Figure 4, which displays normalized substitution proportions per stage with annotated counts.

Across the three conditions, distinct patterns of statistically significant substitution types were observed. In MGUS, both frequent substitutions, like C > T and A > G, and lower-frequency events, namely A > T, T > A, C > G and G > C were significantly enriched, suggesting that hallmark mutations associated with methylation and early mutational processes may emerge at the earliest stages of plasma cell dyscrasias [50]. In contrast, SMM exhibited enrichment primarily in low-frequency nucleotide substitutions, including T > C, G > T, A > T, T > A and G > C, implying a more heterogeneous or transitional mutational profile. Moreover, in MM, significant enrichment was limited to A > G and C > G, potentially reflecting novel mutagenic mechanisms possibly responsible for carcinogenesis. Together, these findings reveal a shift in substitution specificity over the course of disease progression, indicating possible stage-specific mutational mechanisms.

3.4. Most Frequently Mutated Genes

To identify genes potentially involved in the progression of the plasma cell dyscrasias examined in this study, the variants that were consistently present across all three disease stages, regardless of the number of patient–control comparisons, were evaluated. The top ten genes identified are shown in Figure 5.

The most recurrent mutations found in all 16 comparisons involved the MHC class I genes, namely HLA-A, HLA-B and HLA-C, which are key components of the immune response and may facilitate immune evasion when mutated [52].

Furthermore, RNF213 protein, is implicated in carcinogenesis, as it has been reported in multiple studies associated with MM and other cancers [53], while, a member of the A-kinase anchor protein family, the AKAP13 gene, is expressed in a variety of cell types including bone marrow, and is associated with cellular proliferation and oncogenic transformation [54].

Other notable genes include the MKI67, a well-established marker of cell proliferation that has been linked to several cancers, including MM and its precursor stage, MGUS [54]. The SYNE2 is a protein-coding gene, associated with various neoplasms, while another protein-coding gene, the MAP4, is involved in microtubule dynamics and its disruption may lead to mitotic abnormalities [54].

Additionally, the ZMYM5 gene likely contributes to DNA repair processes, and the BDP1 is implicated in the transcription of small RNAs, which are involved in regulatory and potentially tumorigenic functions [54].

The consistent detection of mutations in these genes across many comparisons suggests a potential role in both the commencement and the progression of MM.

3.5. Functional Mutations Emerging During Disease Progression

To highlight the mutations associated with disease progression, only those consistently present across all 16 patient–control comparisons at each transitional phase were selected. This approach ensures that the mutations observed represent stable molecular features of disease advancement rather than sporadic or incidental findings.

As shown in Figure 6, missense variants constitute the predominant functional class during both MGUS to SMM and SMM to MM transitions, which emphasizes their central role in cancer development. Frameshift mutations also appear to contribute to disease progression, further supporting the significance of protein-altering mutations.

In contrast, splice acceptor site and intron variants are detected exclusively during the transition from the precancerous SMM stage to the malignant MM stage, suggesting a possible intensification of transcriptional or epigenetic dysregulation in the later stages of disease evolution [55].

3.6. Mutations in Newly Diagnosed Multiple Myeloma Patients

To determine the genes affected among newly diagnosed MM patients, the mutation landscapes of all MM comparisons were analyzed (refer to Supplementary Figure S1 for germline mutations and Figure S2 for somatic mutations). The shared germline and somatic genetic alterations, occurring in all 16 patient–control comparisons, are depicted in Figure 7 and Appendix A Figure A2, respectively. The germline analysis revealed only 97 genes consistently altered, whereas in the somatic one, a total of 548 genes were found to be recurrently mutated across all comparisons (refer to Supplementary Tables S1 and S2). Figure 8 depicts the distribution of mutation types among the 548 genes that were consistently identified across all 16 patient–control comparisons.

Germline missense variants were the most prominent mutation class, frequently representing the sole alteration type within individual genes. This prevalence suggests a possible role in inherited susceptibility or modulation of disease risk [54]. Frameshift variants, which introduce significant disruptions to protein-coding sequences, were also observed recurrently. Additional, though less frequent, alterations included stop-gained mutations, disruptive and conservative in-frame deletions, and multi-hit events, which indicate multiple mutations within the same gene in a given comparison. The recurrence and consistency of these germline variants across comparisons suggest potential heritable contributors to MM pathogenesis.

In contrast, the somatic mutations observed in all 16 comparisons showed greater diversity in both type and distribution. Missense variants remained the most common class, highlighting their relevance in tumor evolution. Frameshift mutations also featured prominently, reflecting possible loss-of-function effects. Other detected mutation types included stop and start lost, as well as disruptive in-frame insertions. Notably, multi-hit patterns were more frequent in the somatic context, potentially indicating convergent mutational processes acting on the same gene. Importantly, most genes harbored the same mutation type across all comparisons, suggesting mutation-type exclusivity and possibly selective pressure during disease progression.

To the best of our knowledge, many of the genes identified in both the somatic and germline analysis diverge from previous reported mutation profiles of MM patients [6], suggesting novel or underexplored mechanisms that may be specific to newly diagnosed patients. In certain instances, genes classified in this study as somatic have been previously reported as germline, or they may belong to the same gene family, highlighting possible functional redundancy. Conversely, we also identified molecules consistently reported across studies, reinforcing their established relevance to MM pathogenesis, such as the KIF gene and HLA-A, HLA-B and HLA-C antigens [52], which are involved in processes like cell proliferation, immune recognition and DNA repair, further supporting the link between genetic instability and tumorigenesis [52].

Together, these findings underscore a distinction between germline and somatic mutation landscapes. Germline mutations exhibited greater uniformity and potential relevance to inherited predisposition, while somatic mutations reflected broader heterogeneity of mutation types, in line with their dynamic role in clonal evolution and disease advancement [56].

It is also noteworthy that the comparison between germline and somatic mutations revealed both common patterns and substantial differences. While 71 genes were shared between both datasets (e.g., CPEB4, UBE4A, ZNF273, BIRC6, SETBP1), 26 genes were exclusively mutated in the germline context, and 477 genes were unique to the somatic dataset.

Moreover, some of the genes highlighted in this study diverge slightly from previously published mutation profiles [6]. These discrepancies may reflect multiple factors, including differences in patient cohorts, disease subtypes, population-specific genomic backgrounds, or the specific context of newly diagnosed MM patients, as analyzed in this dataset. The early-stage mutational landscape may differ from that of relapsed or refractory cases, potentially capturing distinct biological pathways active at disease onset [57].

4. Discussion

In this study, we analyzed single-cell RNA-sequencing (scRNA-seq) data from patients diagnosed with Monoclonal Gammopathy of Undetermined Significance (MGUS) and Smoldering Multiple Myeloma (SMM), as well as newly diagnosed patients with Multiple Myeloma (MM), to investigate the mutational landscape of plasma cell dyscrasias, with a focus on the later disease, by utilizing a comprehensive bioinformatics pipeline tailored for scRNA-seq data. High-quality preprocessing, followed by variant calling and the application of strict filtering thresholds, enabled the identification of genetic alterations across disease stages. Our findings confirmed that missense variants were the most prevalent functional type of mutations, both in early precursor stages and in MM, with additional contributions from intron and splice-site mutations emerging during the tumorigenic stage. Notably, somatic variants found in MM datasets exhibited greater functional mutation-type diversity, while germline variants were more uniform, often dominated by missense mutations alone. Importantly, mutations present across all 16 patient–control comparisons likely represent stable molecular hallmarks of disease and point to underlying mechanisms of pathogenesis. Although germline mutations, which are inherited, play a role in certain forms of cancer, the majority of cancer cases are attributed to somatic mutations that accumulate over time due to endogenous processes or environmental factors [56]. These findings support the theory of a synergistic effect between predisposing and acquired genetic factors in the development of MM and the oncogenesis that characterizes this stage. All in all, the complexity of each patient’s molecular profile demonstrates that different combinations of gene mutations drive the progression of plasma cell dyscrasias and cancers in general.

In contrast to prior reports describing a higher mutational load in MM, our analysis revealed an unexpectedly elevated mutation burden in the intermediate SMM stage [47,48]. While previous studies often rely on bulk DNA-seq or exome-level profiling, our use of scRNA-seq data captured different molecular signals, including low-frequency subclonal events [58]. Additionally, the identification of 548 somatic and 97 germline genes mutated in all MM comparisons reflects a robust core of consistently altered genes, while the 71 overlapping genes between germline and somatic datasets further underscore a degree of mutation-type specificity. Furthermore, the large amount of novel identified genes in newly diagnosed MM patients suggests underexplored mechanisms that may be specific to newly diagnosed patients and may result in the progression of the disease [6]. Even so, the identified key mutated genes, such as the HLA genes and members of the KIF, EP400 and KDM families, may not only serve as potential biomarkers but also provide insights into the molecular mechanisms driving disease progression and highlight targets for future therapeutic treatments.

Our results reveal mechanistic insights that expand upon prior studies. The emergence of splice-site and intron variants specifically during the SMM to MM transition suggests that transcriptional dysregulation becomes increasingly pronounced in later disease stages [55]. Unlike the more heterogeneous mutation profiles reported in relapsed MM cases, the relative uniformity seen here may reflect a mutational foundation established early in disease. Moreover, the analytical pipeline utilized in this study is designed to be accessible and may be readily adopted without extensive bioinformatics expertise, while remaining fully functional for retrieving and interpreting variant information.

Several methodological limitations should be considered when interpreting our findings. A major constraint was the lack of detailed clinical metadata, particularly for control samples [59], and more generally for all samples utilized in this study, which limited our ability to account for confounding factors, such as other medical conditions or medication, which may have influenced the variant calling process. In addition, the relatively small number of samples per stage also limits the extent to which the findings can be extrapolated to the wider population. Equally important is the fact that all MM cases analyzed were from newly diagnosed patients, who may exhibit lower mutational burden compared to relapsed cases. Moreover, the novel mutations identified in our study vary from previously published mutation profiles, which may reflect multiple factors, including the absence of clinical metadata and the omission of a deduplication step to prevent the exclusion of biologically important repetitive regions. These regions are difficult to differentiate from potential technical artifacts in scRNA-seq data and, thus, cannot be entirely ruled out when interpreting our results. While the pipeline functioned effectively for mutation identification in this dataset, it may not be directly transferable to other sequencing platforms or methods. Nevertheless, rigorous quality control and filtering steps were applied throughout the workflow to ensure the reliability and validity of the results.

While the results of this study are promising, further work is needed to build upon these. Although these findings underscore critical points in disease evolution that could be leveraged for early treatment or disruption of malignant progression, future studies should involve larger and more diverse patient cohorts, including relapsed cases, to explore resistance mechanisms and validate key alterations. Furthermore, additional analyses could enhance the functional and biological interpretation of the identified variants in our dataset, including the assessment of their biological impact, using oncogenic signaling pathways or the Gene Set Enrichment Analysis to identify enriched biological pathways or functional categories. The analysis of copy number variation (CNV) could further uncover important genomic insights that may prove valuable to premature MM detection. In addition, future work could expand on our current variant annotation approach by incorporating alternative tools, such as ANNOVAR 2019 (https://annovar.openbioinformatics.org/en/latest/, accessed on 8 May 2025), alongside SnpEff, enabling a comparative evaluation of their outputs and potentially refining variant interpretation in the context of MM. Finally, mutational signature analyses and the identification of potential driver genes would strengthen both the clinical and biological significance of the findings.

5. Conclusions

The single-cell RNA-sequencing-based mutation analysis revealed meaningful molecular alterations in early stages of Multiple Myeloma, offering insights into early disease mechanisms and the transition from precursor stages. By distinguishing between germline and somatic mutations and identifying stage-specific events, we provide insight into the genetic basis of disease progression. These findings contribute to a deeper understanding of Multiple Myeloma evolution and may inform future strategies for early detection and targeted intervention.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/diagnostics15172130/s1, Supplementary Figure S1: Oncoprint of germline mutations identified in every Multiple Myeloma patient–control comparison; Supplementary Figure S2: Oncoprint of germline mutations identified in every Multiple Myeloma patient–control comparison; Supplementary Table S1: Germline mutations identified across all patient–control comparisons; Supplementary Table S2: Somatic mutations identified across all patient–control comparisons.

Author Contributions

Conceptualization, M.K. and C.P.; methodology, M.K.; validation, C.P.; formal analysis, M.K.; resources, M.K. and C.P.; data curation, M.K.; writing—original draft preparation, M.K.; writing—review and editing, C.P.; visualization, M.K.; supervision, C.P.; project administration, C.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets analyzed in this study are available in GEO under the following accession numbers: (a) MGUS patients: GSM8369868, GSM8369869, GSM8369870, GSM8369871; (b) SMM patients: GSM8369874, GSM8369875, GSM8369876, GSM8369877; (c) newly diagnosed MM patients: GSM8369878, GSM8369879, GSM8369880, GSM8369881; (d) controls: GSM8369864, GSM8369865, GSM8369866, GSM8369867. The code implemented in this study is available upon request.

Acknowledgments

The authors would like to thank Konstantina D. Kourou and all members of the Bioinformatics Laboratory of the School of Biological Applications and Technology at the University of Ioannina for their valuable insights.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MM	Multiple Myeloma
MGUS	Monoclonal Gammopathy of Undetermined Significance
SMM	Asymptomatic Multiple Myeloma
scRNA-seq	Single-cell RNA-sequencing
UMIs	Unique Molecular Identifiers
SNPs	Single Nucleotide Polymorphisms
INDELs	Insertions/Deletions
NGS	Next-Generation Sequencing
RNA-seq	RNA-sequencing
GEO	Gene Expression Omnibus
cDNA	Complementary DNA
PCR	Polymerase Chain Reaction
SRA	Sequence Read Archive
R	Read
I	Index
UTRs	Untranslated regions
A	Adenine
C	Cytosine
T	Thymine
G	Guanine
CNV	Copy Number Variation

Appendix A

Figure A1. Extended functional classification of variants in all disease stages. This figure includes all detected functional categories, including less frequent types.

Figure A2. Oncoprint of somatic mutations shared across all Multiple Myeloma patient–control comparisons, (a) includes 137 genes from A–D, (b) 137 genes from E–M, (c) 137 genes from M–S and (d) 137 genes from S–Z. Mutation types are color-coded; Each box from left to right represents each comparison. The bar plot above indicates the number of mutations per comparison, while the percentage on the right shows the frequency of each gene mutation across the 16 comparisons.

References

Streetly, M.J. Myeloma and MGUS. Medicine 2025, 53, 325–330. [Google Scholar] [CrossRef]
Bernstein, Z.S.; Kim, E.B.; Raje, N. Bone Disease in Multiple Myeloma: Biologic and Clinical Implications. Cells 2022, 11, 2308. [Google Scholar] [CrossRef]
Rajkumar, S.V.; Kumar, S.; Lonial, S.; Mateos, M.V. Smoldering Multiple Myeloma Current Treatment Algorithms. Blood Cancer J. 2022, 12, 129. [Google Scholar] [CrossRef]
Abdallah, N.; Rajkumar, S.V.; Greipp, P.; Kapoor, P.; Gertz, M.A.; Dispenzieri, A.; Baughn, L.B.; Lacy, M.Q.; Hayman, S.R.; Buadi, F.K.; et al. Cytogenetic Abnormalities in Multiple Myeloma: Association with Disease Characteristics and Treatment Response. Blood Cancer J. 2020, 10, 82. [Google Scholar] [CrossRef]
Rajkumar, S.V. MGUS and Smoldering Multiple Myeloma: Update on Pathogenesis, Natural History, and Management. Hematology 2005, 2005, 340–345. [Google Scholar] [CrossRef]
Xie, C.; Zhong, L.; Luo, J.; Luo, J.; Wu, Y.; Zheng, S.; Jiang, L.; Zhang, J.; Shi, Y. Identification of Mutation Gene Prognostic Biomarker in Multiple Myeloma through Gene Panel Exome Sequencing and Transcriptome Analysis in Chinese Population. Comput. Biol. Med. 2023, 163, 107224. [Google Scholar] [CrossRef]
Gong, Y.; Deng, J.; Wu, X. Germline Mutations and Blood Malignancy (Review). Oncol. Rep. 2020, 45, 49–57. [Google Scholar] [CrossRef]
Zhang, L.; Lee, M.; Maslov, A.Y.; Montagna, C.; Vijg, J.; Dong, X. Analyzing Somatic Mutations by Single-Cell Whole-Genome Sequencing. Nat. Protoc. 2024, 19, 487–516. [Google Scholar] [CrossRef] [PubMed]
Ratan, A.; Olson, T.L.; Loughran, T.P.; Miller, W. Identification of Indels in Next-Generation Sequencing Data. BMC Bioinform. 2015, 16, 42. [Google Scholar] [CrossRef] [PubMed]
López-Bigas, N.; Blencowe, B.J.; Ouzounis, C.A. Highly Consistent Patterns for Inherited Human Diseases at the Molecular Level. Bioinformatics 2006, 22, 269–277. [Google Scholar] [CrossRef] [PubMed]
Marschall, T.; Costa, I.G.; Canzar, S.; Bauer, M.; Klau, G.W.; Schliep, A.; Schönhuth, A. CLEVER: Clique-Enumerating Variant Finder. Bioinformatics 2012, 28, 2875–2882. [Google Scholar] [CrossRef]
Tang, G.; Liu, X.; Cho, M.; Li, Y.; Tran, D.-H.; Wang, X. Pan-Cancer Discovery of Somatic Mutations from RNA Sequencing Data. Commun. Biol. 2024, 7, 619. [Google Scholar] [CrossRef]
Wang, S.; Sun, S.-T.; Zhang, X.-Y.; Ding, H.-R.; Yuan, Y.; He, J.-J.; Wang, M.-S.; Yang, B.; Li, Y.-B. The Evolution of Single-Cell RNA Sequencing Technology and Application: Progress and Perspectives. Int. J. Mol. Sci. 2023, 24, 2943. [Google Scholar] [CrossRef] [PubMed]
Haque, A.; Engel, J.; Teichmann, S.A.; Lönnberg, T. A Practical Guide to Single-Cell RNA-Sequencing for Biomedical Research and Clinical Applications. Genome Med. 2017, 9, 75. [Google Scholar] [CrossRef] [PubMed]
Poirion, O.B.; Zhu, X.; Ching, T.; Garmire, L. Single-Cell Transcriptomics Bioinformatics and Computational Challenges. Front. Genet. 2016, 7, 163. [Google Scholar] [CrossRef] [PubMed]
Liu, R.; Gao, Q.; Foltz, S.M.; Fowles, J.S.; Yao, L.; Wang, J.T.; Cao, S.; Sun, H.; Wendl, M.C.; Sethuraman, S.; et al. Co-Evolution of Tumor and Immune Cells during Progression of Multiple Myeloma. Nat. Commun. 2021, 12, 2559. [Google Scholar] [CrossRef] [PubMed]
Ziegenhain, C.; Vieth, B.; Parekh, S.; Reinius, B.; Guillaumet-Adkins, A.; Smets, M.; Leonhardt, H.; Heyn, H.; Hellmann, I.; Enard, W. Comparative Analysis of Single-Cell RNA Sequencing Methods. Mol. Cell 2017, 65, 631–643.e4. [Google Scholar] [CrossRef]
Fu, Y.; Wu, P.-H.; Beane, T.; Zamore, P.D.; Weng, Z. Elimination of PCR Duplicates in RNA-Seq and Small RNA-Seq Using Unique Molecular Identifiers. BMC Genom. 2018, 19, 531. [Google Scholar] [CrossRef]
Myeloma Statistics|Cancer Research UK. Available online: https://www.cancerresearchuk.org/health-professional/cancer-statistics/statistics-by-cancer-type/myeloma (accessed on 12 May 2025).
Allini Mafra, Mathieu Laversanne, Rafael Marcos-Gragera, Humberto V S Chaves, Charlene Mcshane, Freddie Bray, Ariana Znaor The Global Multiple Myeloma Incidence and Mortality Burden in 2022 and Predictions for 2045. JNCI J. Natl. Cancer Inst. 2025, 117, 907–914. [CrossRef]
Modi, A.; Vai, S.; Caramelli, D.; Lari, M. The Illumina Sequencing Protocol and the NovaSeq 6000 System. In Bacterial Pangenomics: Methods and Protocols; Mengoni, A., Bacci, G., Fondi, M., Eds.; Springer US: New York, NY, USA, 2021; pp. 15–42. ISBN 978-1-0716-1099-2. [Google Scholar]
NGS Workflow Steps. Available online: https://www.illumina.com/science/technology/next-generation-sequencing/beginners/ngs-workflow.html (accessed on 8 May 2025).
Möller, S.; Krabbenhöft, H.N.; Tille, A.; Paleino, D.; Williams, A.; Wolstencroft, K.; Goble, C.; Holland, R.; Belhachemi, D.; Plessy, C. Community-Driven Computational Biology with Debian Linux. BMC Bioinform. 2010, 11, S5. [Google Scholar] [CrossRef]
Ebrahimi, G.; Orabi, B.; Robinson, M.; Chauve, C.; Flannigan, R.; Hach, F. Fast and Accurate Matching of Cellular Barcodes across Short-Reads and Long-Reads of Single-Cell RNA-Seq Experiments. iScience 2022, 25, 104530. [Google Scholar] [CrossRef]
Conesa, A.; Madrigal, P.; Tarazona, S.; Gomez-Cabrero, D.; Cervera, A.; McPherson, A.; Szcześniak, M.W.; Gaffney, D.J.; Elo, L.L.; Zhang, X.; et al. A Survey of Best Practices for RNA-Seq Data Analysis. Genome Biol. 2016, 17, 13. [Google Scholar] [CrossRef]
UMI-Tools: Modelling Sequencing Errors in Unique Molecular Identifiers to Improve Quantification Accuracy. Available online: https://genome.cshlp.org/content/early/2017/01/18/gr.209601.116.abstract (accessed on 8 May 2025).
HISAT2. Available online: https://daehwankimlab.github.io/hisat2/ (accessed on 19 May 2025).
Li, H.; Handsaker, B.; Wysoker, A.; Fennell, T.; Ruan, J.; Homer, N.; Marth, G.; Abecasis, G.; Durbin, R. 1000 Genome Project Data Processing Subgroup The Sequence Alignment/Map Format and SAMtools. Bioinformatics 2009, 25, 2078–2079. [Google Scholar] [CrossRef] [PubMed]
Qualimap: Evaluating next Generation Sequencing Alignment Data. Available online: http://qualimap.conesalab.org/ (accessed on 8 May 2025).
Parekh, S.; Ziegenhain, C.; Vieth, B.; Enard, W.; Hellmann, I. The Impact of Amplification on Differential Expression Analyses by RNA-Seq. Sci. Rep. 2016, 6, 25533. [Google Scholar] [CrossRef] [PubMed]
Read Groups–GATK. Available online: https://gatk.broadinstitute.org/hc/en-us/articles/360035890671-Read-groups (accessed on 24 May 2025).
Lee, J.H.; Kweon, S.; Park, Y.R. Sharing Genetic Variants with the NGS Pipeline Is Essential for Effective Genomic Data Sharing and Reproducibility in Health Information Exchange. Sci. Rep. 2021, 11, 2268. [Google Scholar] [CrossRef]
Phred-Scaled Quality Scores–GATK. Available online: https://gatk.broadinstitute.org/hc/en-us/articles/360035531872-Phred-scaled-quality-scores (accessed on 8 May 2025).
Lin, Y.; Rasmussen, M.H.; Christensen, M.H.; Frydendahl, A.; Maretty, L.; Andersen, C.L.; Besenbacher, S. Evaluating Bioinformatics Processing of Somatic Variant Detection in cfDNA Using Targeted Sequencing with UMIs. Int. J. Mol. Sci. 2024, 25, 11439. [Google Scholar] [CrossRef]
Van Der Auwera, G.A.; Carneiro, M.O.; Hartl, C.; Poplin, R.; Del Angel, G.; Levy-Moonshine, A.; Jordan, T.; Shakir, K.; Roazen, D.; Thibault, J.; et al. From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline. Curr. Protoc. Bioinforma. 2013, 43, 11.10.1–11.10.33. [Google Scholar] [CrossRef] [PubMed]
Bcftools. Available online: https://samtools.github.io/bcftools/bcftools.html#view (accessed on 8 May 2025).
Kartti, S.; Bouricha, E.M.; Zarrik, O.; Aghlallou, Y.; Mounjid, C.; ELJaoudi, R.; Belyamani, L.; Ibrahimi, A.; El Khannoussi, B. Targeted Gene Panel Sequencing Unveiled New Pathogenic Mutations in Patients With Breast Cancer. Bioinforma. Biol. Insights 2023, 17, 11779322231182054. [Google Scholar] [CrossRef]
Alsulami, A.F. Comprehensive Annotation of Mutations in Hallmark Genes Insights into Structural and Functional Implications. Comput. Biol. Med. 2025, 185, 109588. [Google Scholar] [CrossRef]
Variant Annotation (SNPs/INDELs Effects)|UC Davis Bioinformatics Core July Alliance for Global Health Makerere University Workshop—Variant Analysis. Available online: https://ucdavis-bioinformatics-training.github.io/2019-Alliance-for-Global-Health-and-Science-Makerere-University_Variants/variant_analysis/variant_annotation.html (accessed on 8 May 2025).
McLaren, W.; Gil, L.; Hunt, S.E.; Riat, H.S.; Ritchie, G.R.S.; Thormann, A.; Flicek, P.; Cunningham, F. The Ensembl Variant Effect Predictor. Genome Biol. 2016, 17, 122. [Google Scholar] [CrossRef]
Kim, H.-Y. Statistical Notes for Clinical Researchers: Chi-Squared Test and Fisher’s Exact Test. Restor. Dent. Endod. 2017, 42, 152. [Google Scholar] [CrossRef]
Bewick, V.; Cheek, L.; Ball, J. Statistics Review 8: Qualitative Data—Tests of Association. Crit. Care 2004, 8, 46. [Google Scholar] [CrossRef]
Savita; Verma, N. A REVIEW STUDY ON BIG DATA ANALYSIS USING R STUDIO. Int. J. Eng. Technol. Manag. Res. 2020, 6, 129–136. [Google Scholar] [CrossRef]
Gu, Z.; Eils, R.; Schlesner, M. Complex Heatmaps Reveal Patterns and Correlations in Multidimensional Genomic Data. Bioinformatics 2016, 32, 2847–2849. [Google Scholar] [CrossRef]
Babraham Bioinformatics-FastQC A Quality Control Tool for High Throughput Sequence Data. Available online: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ (accessed on 8 May 2025).
Alberge, J.-B.; Dutta, A.K.; Poletti, A.; Coorens, T.H.H.; Lightbody, E.D.; Toenges, R.; Loinaz, X.; Wallin, S.; Dunford, A.; Priebe, O.; et al. Genomic Landscape of Multiple Myeloma and Its Precursor Conditions. Nat. Genet. 2025, 57, 1493–1503. [Google Scholar] [CrossRef]
Mikulasova, A.; Wardell, C.P.; Murison, A.; Boyle, E.M.; Jackson, G.H.; Smetana, J.; Kufova, Z.; Pour, L.; Sandecka, V.; Almasi, M.; et al. The Spectrum of Somatic Mutations in Monoclonal Gammopathy of Undetermined Significance Indicates a Less Complex Genomic Landscape than That in Multiple Myeloma. Haematologica 2017, 102, 1617–1625. [Google Scholar] [CrossRef] [PubMed]
Musto, P.; Engelhardt, M.; Caers, J.; Bolli, N.; Kaiser, M.; Van De Donk, N.; Terpos, E.; Broijl, A.; De Larrea, C.F.; Gay, F.; et al. 2021 European Myeloma Network Review and Consensus Statement on Smoldering Multiple Myeloma: How to Distinguish (and Manage) Dr. Jekyll and Mr. Hyde. Haematologica 2021, 106, 2799–2812. [Google Scholar] [CrossRef] [PubMed]
Bacolla, A.; Cooper, D.; Vasquez, K. Mechanisms of Base Substitution Mutagenesis in Cancer Genomes. Genes 2014, 5, 108–146. [Google Scholar] [CrossRef]
Australian Pancreatic Cancer Genome Initiative; ICGC Breast Cancer Consortium; ICGC MMML-Seq Consortium; ICGC PedBrain; Alexandrov, L.B.; Nik-Zainal, S.; Wedge, D.C.; Aparicio, S.A.J.R.; Behjati, S.; Biankin, A.V.; et al. Signatures of Mutational Processes in Human Cancer. Nature 2013, 500, 415–421. [Google Scholar] [CrossRef]
Bin, Y.; Wang, X.; Zhao, L.; Wen, P.; Xia, J. An Analysis of Mutational Signatures of Synonymous Mutations across 15 Cancer Types. BMC Med. Genet. 2019, 20, 190. [Google Scholar] [CrossRef] [PubMed]
Smith, A.G.; Fan, W.; Regen, L.; Warnock, S.; Sprague, M.; Williams, R.; Nisperos, B.; Zhao, L.P.; Loken, M.R.; Hansen, J.A.; et al. Somatic Mutations in the HLA Genes of Patients with Hematological Malignancy. Tissue Antigens 2012, 79, 359–366. [Google Scholar] [CrossRef] [PubMed]
Perumal, D.; Imai, N.; Laganà, A.; Finnigan, J.; Melnekoff, D.; Leshchenko, V.V.; Solovyov, A.; Madduri, D.; Chari, A.; Cho, H.J.; et al. Mutation-Derived Neoantigen-Specific T-Cell Responses in Multiple Myeloma. Clin. Cancer Res. 2020, 26, 450–464. [Google Scholar] [CrossRef]
Home-Gene–NCBI. Available online: https://www.ncbi.nlm.nih.gov/gene/ (accessed on 3 June 2025).
Dutta, A.K.; Fink, J.L.; Grady, J.P.; Morgan, G.J.; Mullighan, C.G.; To, L.B.; Hewett, D.R.; Zannettino, A.C.W. Subclonal Evolution in Disease Progression from MGUS/SMM to Multiple Myeloma Is Characterised by Clonal Stability. Leukemia 2019, 33, 457–468. [Google Scholar] [CrossRef]
Alexandrov, L.B.; Kim, J.; Haradhvala, N.J.; Huang, M.N.; Tian Ng, A.W.; Wu, Y.; Boot, A.; Covington, K.R.; Gordenin, D.A.; Bergstrom, E.N.; et al. The Repertoire of Mutational Signatures in Human Cancer. Nature 2020, 578, 94–101. [Google Scholar] [CrossRef]
Giesen, N.; Paramasivam, N.; Toprak, U.H.; Huebschmann, D.; Xu, J.; Uhrig, S.; Samur, M.; Bähr, S.; Fröhlich, M.; Mughal, S.S.; et al. Comprehensive Genomic Analysis of Refractory Multiple Myeloma Reveals a Complex Mutational Landscape Associated with Drug Resistance and Novel Therapeutic Vulnerabilities. Haematologica 2022, 107, 1891–1901. [Google Scholar] [CrossRef]
Martins Rodrigues, F.; Jasielec, J.; Perpich, M.; Kim, A.; Moma, L.; Li, Y.; Storrs, E.; Wendl, M.C.; Jayasinghe, R.G.; Fiala, M.; et al. Germline Predisposition in Multiple Myeloma. iScience 2025, 28, 111620. [Google Scholar] [CrossRef] [PubMed]
Wojcik, G.L.; Murphy, J.; Edelson, J.L.; Gignoux, C.R.; Ioannidis, A.G.; Manning, A.; Rivas, M.A.; Buyske, S.; Hendricks, A.E. Opportunities and Challenges for the Use of Common Controls in Sequencing Studies. Nat. Rev. Genet. 2022, 23, 665–679. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Workflow pipeline for preprocessing (blue arrows) and variant detection (pink arrows), indicating the purpose of each step, the tools used, and the corresponding files. Black arrows within the boxes represent the conversion of input files into corresponding output files at each stage.

Figure 2. Functional classification of variants detected at each disease stage. Only the main functional mutation types are displayed; the full classification is provided in Appendix A, Figure A1.

Figure 3. Structural classification of mutations detected at each disease stage.

Figure 4. Normalized proportions of SNP substitution types across disease stages, (a) MGUS, (b) SMM and (c) MM. Each panel displays the relative frequency of single-nucleotide substitution types normalized to the total number of SNPs in that stage. Numeric labels represent raw substitution counts. Substitution types marked with an asterisk (*) were found to be significantly enriched in that stage (p < 0.05, Fisher’s exact test), indicating that their frequency was statistically higher compared to their frequency in the other two stages.

Figure 5. The top ten most frequently mutated genes, common to all disease stages.

Figure 6. Functional mutation types consistently present during stage-to-stage transitions in disease progression, with arrows indicating stage transitions.

Figure 7. Oncoprint of germline mutations shared across all Multiple Myeloma patient–control comparisons. Mutation types are color-coded; each box from left to right represents each comparison. The bar plot above indicates the number of mutations per comparison, while the percentage on the right shows the frequency of each gene mutation across the 16 comparisons.

Figure 8. Barplot showing the distribution of mutation types among the 548 somatic genes that were consistently identified across all 16 MM patient–control comparisons. Each bar represents one comparison, and the segments are color-coded according to mutation type. The height of each bar reflects the number of shared mutated genes per comparison, while the color composition illustrates the relative contribution of each mutation category.

Table 1. Overall alignment rates (percentages) for each individual sample, grouped by sample category, as well as the mean alignment rate, are reported for each group. Values represent the percentage of reads successfully aligned to the GRCh38 reference genome.

Sample	MGUS (%)	SMM (%)	MM (%)	Controls (%)
1	92.51	84.71	90.27	91.42
2	91.84	94.25	89.72	92.72
3	89.05	83.43	83.63	85.01
4	87.65	88.24	88.39	83.11
Mean	90.26	88.24	88.00	88.07

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kyriakou, M.; Papaloukas, C. Genomic Insights into Tumorigenesis in Newly Diagnosed Multiple Myeloma. Diagnostics 2025, 15, 2130. https://doi.org/10.3390/diagnostics15172130

AMA Style

Kyriakou M, Papaloukas C. Genomic Insights into Tumorigenesis in Newly Diagnosed Multiple Myeloma. Diagnostics. 2025; 15(17):2130. https://doi.org/10.3390/diagnostics15172130

Chicago/Turabian Style

Kyriakou, Marina, and Costas Papaloukas. 2025. "Genomic Insights into Tumorigenesis in Newly Diagnosed Multiple Myeloma" Diagnostics 15, no. 17: 2130. https://doi.org/10.3390/diagnostics15172130

APA Style

Kyriakou, M., & Papaloukas, C. (2025). Genomic Insights into Tumorigenesis in Newly Diagnosed Multiple Myeloma. Diagnostics, 15(17), 2130. https://doi.org/10.3390/diagnostics15172130

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Genomic Insights into Tumorigenesis in Newly Diagnosed Multiple Myeloma

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Acquisition

2.2. File Preparation

2.3. Unique Molecular Identifiers Tagging and Processing

2.4. Read Trimming

2.5. Mapping to the Human Reference Genome

2.6. Variant Calling

2.7. Variant Filtering

2.8. Variant Selection

2.9. Comparison of Patients and Controls Variants

2.10. Annotation of Detected Mutations

2.11. Organization and Visualization

3. Results

3.1. Quality Control of Raw and Trimmed Reads

3.2. BAM Quality Assessment and Alignment Rates

3.3. Mutation Profiles by Type

3.3.1. Functional Mutation Types

3.3.2. Structural Mutation Types

3.3.3. Nucleotide Substitution Patterns

3.4. Most Frequently Mutated Genes

3.5. Functional Mutations Emerging During Disease Progression

3.6. Mutations in Newly Diagnosed Multiple Myeloma Patients

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI