1. Introduction
Liquid biopsy is increasingly recognized for its many applications in cancer management, including early detection, diagnostics, surveillance, and prognosis [
1]. The most researched biomarker in liquid biopsies is circulating cell-free DNA (cfDNA), which is released from cells through apoptosis, necrosis, and active secretion. Due to their high turnover rate, tumor cells release more DNA, referred to as circulating tumor DNA (ctDNA), compared to normal cells [
2]. Both cfDNA and ctDNA carry complete genetic and epigenetic information from their cell of origin, including mutations, larger genetic abnormalities such as copy number variants, and methylation patterns [
3].
Epigenetics refer to heritable changes in gene expression that occur without alterations to the underlying DNA sequence. Key epigenetic mechanisms include DNA methylation, histone modifications, and chromatin remodeling, all of which regulate gene activity and maintain cellular identity. In malignancy, aberrant methylation patterns disrupt normal gene regulation, contributing to tumor initiation and progression. Since these epigenetic changes occur early in cancer development and are relatively stable, they are particularly attractive as biomarkers for early detection [
4].
Mutational profiling of cancer via liquid biopsy has been the most investigated biomarker approach and was the first to receive FDA approval for detecting resistance mutations in lung cancer [
5]. However, point mutations in ctDNA are rare and are often found at low allele frequencies, which decrease further as the cancer evolves and tumor heterogeneity increases. As a result, these analyses are less well-suited for reliable cancer detection [
6,
7]. On the other hand, epigenetic patterns such as methylation are established early in cancer development, and analyzing multiple loci enhances the likelihood of accurate detection [
8,
9,
10]. The most common methylation pattern in mammals involves the addition of a methyl group to the 5-carbon position of cytosines, usually in the context of CpG dinucleotides [
11]. In cancer, methylation alterations can be categorized into focal hypermethylation and global hypomethylation [
12,
13,
14]. Methylation patterns include differentially methylated regions (DMRs), which are defined as sets of CpGs in close proximity. Due to the harsh chemical treatment usually performed before methylation analysis, methylation-based cfDNA analyses have not been feasible, and few studies have investigated their use in a pan-cancer context [
15,
16].
In this study, we employed targeted next-generation sequencing (NGS)-based analysis combined with a novel cytosine conversion method, using an enzymatic methodology that is less damaging to DNA than conventional bisulfite treatment and consequently allows for low-input DNA. Using this method, we investigated methylation patterns in cfDNA as a diagnostic aid in a pan-cancer setting, focusing on patients presenting with severe, nonspecific symptoms suggestive of cancer.
The aim of the study was to investigate the feasibility of enzymatic conversion methylation sequencing for liquid biopsy. The aim was also to investigate the discriminative power of a targeted methylation-based NGS panel for cancer detection in a clinical pan-cancer setting.
2. Results
2.1. Patient Characteristics
The complete cohort consisted of 229 patients with an even sex distribution and a median age of 72 years (
Table 1). The cohort was divided into three groups: patients with nonmalignant conditions (
n = 183), patients with a cancer diagnosis within 12 months (
n = 37) and patients that received a cancer diagnosis later than 12 months from start of investigation (
n = 9). The cancer group included a slightly higher proportion of females (not statistically significant (n.s.)) and was significantly older than the nonmalignant control group (
p = 0.001,
Table 1). The sub-cohort used for DMR analysis consisted of all cancer cases with age- and sex-matched controls from the control group; hence the baseline characteristics are not reported separately for these controls. The clinical characteristics of the patients belonging to the Late cancer group can be seen in
Supplemental Table S2.
The cancer group primarily consisted of carcinomas (67.6%), with adenocarcinomas being the largest subtype (60%), originating from various anatomical sites. Haematolymphoid cancers accounted for 24.3% of the cancer cases. Among patients with solid tumors, 37.9% had metastatic disease at the time of diagnosis. Details are provided in
Supplemental Table S1.
The control group encompassed a broad spectrum of diagnoses (
Table 1), with further details available in
Supplemental Table S1. The largest subgroup consisted of patients diagnosed with various autoimmune diseases (13.5%), of which polymyalgia rheumatica accounted for the majority (50.0%). Infectious diseases accounted for another significant subgroup (11.5%), predominantly bacterial in origin (80.0%), with further details available in
Supplemental Table S1.
2.2. Feasibility
The cfDNA extraction yield had a median concentration of 8.3 ng/mL plasma (IQR: 5.0–13.1 ng/mL, range: 0.0–76.7 ng/mL), with two samples from the control group and one sample from Later cancer having undetectable concentrations. Compared with controls, cancer patients had a significantly higher median cfDNA concentration of 11.4 ng/mL (IQR: 8.7–22.0 ng/mL, range: 2.6–76.7 ng/mL), whereas controls had a median of 7.7 ng/mL (IQR: 4.8–12.2 ng/mL, range: 0.0–53.3 ng/mL,
p = 0.0008) (see
Figure 1A). The Later cancer group had a median cfDNA concentration of 8.8 ng/mL (range 0–48.6 ng/mL).
When cancer cases were stratified by disease type, metastatic (median 16.5 ng/mL, IQR: 9.8–35.5 ng/mL, range: 5.8–76.7 ng/mL) and haematolymphoid malignancies (median 13.8 ng/mL, IQR: 9.3–22.0 ng/mL, range 4.6–59.3 ng/mL) showed higher median cfDNA concentrations and greater variability compared to localized cases (median 9.6 ng/mL, IQR: 6.3–12.8 ng/mL, range 2.6–27.5 ng/mL), although these differences did not reach statistical significance (
p = 0.06) (see
Figure 1B).
When controls were stratified by diagnosis, autoimmune cases (median 10.5 ng/mL, IQR 7.6–14.4 ng/mL, range 4.2–34.0 ng/mL) exhibited statistically significant higher cfDNA concentration compared to controls with no diagnosis (median 6.2, IQR 4.1–9.9 ng/mL, range 1.4–53.3 ng/mL,
p.adj = 0.03). Infectious (median 12.1 mg/mL) and Inflammatory conditions (median 10.0 ng/mL) also displayed higher cfDNA levels, although these differences did not reach statistical significance. The heterogenous “Other” control group had a low cfDNA concentration (median 7.4 ng/mL) (see
Figure 1C).
No significant correlation was observed between BMI and cfDNA concentration in the overall cohort (ρ = 0.13, p = 0.06). In controls, a weak positive correlation was detected (ρ = 0.19, p = 0.01), while no correlations were found in cancer or later cancer. Age was positively correlated with cfDNA concentration in the full cohort (ρ = 0.28, p < 0.00001), primarily driven by controls (ρ = 0.26, p = 0.00045). No sex differences were observed overall (p = 0.68; men: median 8.3 ng/mL, IQR 7.4; women: median 9.1 ng/mL, IQR 7.8). Within the cancer group, women had higher concentrations than men (13.5 ng/mL, IQR 18.9 vs. 8.9 ng/mL, IQR 7.0; p = 0.045). No differences were observed in controls or Later cancer.
All samples with measurable cfDNA concentrations were processed for library preparation, with total input amounts ranging from 2.6 to 146.5 ng and a median input of 16.9 ng. All those samples successfully generated high-quality libraries and were sequenced, generating a median of 12.6 M reads (IQR: 10.1–15.9 M, range: 5.1–30.0 M reads). The median coverage was 659× (IQR: 511–844×, range: 208–1534×), with an on-target rate of 73.6% (IQR: 70.1–76.9%, range: 60.0–83.0%). Cancer samples had significantly higher median coverage than controls (median 857× vs. 652×, p = 0.006); however, this difference was no longer significant after adjusting for input cfDNA (p = 0.75). The median GC content was 33% (range: 31–37%) and showed a moderate positive correlation with input cfDNA amount, even after adjusting for cancer status (β: 0.63 CI: 0.024–0.033, p < 0.001). Input amount was positively correlated with coverage (rho = 0.76, p < 0.001) and negatively correlated with duplication rate (rho = −0.74, p < 0.001).
2.3. Overall CpG Methylation
To assess the conversion rate, commercial controls containing Lambda and pUC19 DNA were analyzed. All sequenced control libraries met the expected methylation levels, with unmethylated Lambda showing 0.25–0.26% methylation and methylated pUC19 showing 95.82–96.02% methylation.
Compared to controls, cancer samples presented statistically significant higher overall CpG methylation across the entire panel, with a median of 1.82% (IQR 1.38–2.93%, range: 1.21–16.94%), versus 1.34% (IQR 1.22–1.52%, range: 0.93–6.63,
p = 0.00252) (see
Figure 2A). This difference remained significant after adjusting for sample age and input amount (OR: 2.37, 95% CI: 1.36–4.12,
p = 0.002).
When cancer cases were stratified by metastatic status at the time of diagnosis, patients with metastases had significantly higher overall CpG methylation (median 2.93%, IQR 1.86–4.05%, range: 1.22–16.94%,
n = 11) compared to those with localized disease (median 1.15%, IQR 1.36–1.67%, range: 1.22–2.93%,
n = 17,
p = 0.027). Cancers of haematolymphoid origin displayed methylation levels comparable to those of patients with metastatic disease (median 2.40%, IQR 2.01–6.66%, range: 1.65–13.25%,
n = 9,
p = 1.0) but had significantly higher overall methylation levels than patients with localized disease (
p = 0.006); see
Figure 2B.
When controls were stratified by diagnosis, there was no statistically significant differences in overall CpG methylation levels between the groups (
p = 0.49). Median methylation levels were low and comparable between all groups: Autoimmune—median 1.35%, IQR 1.26–1.55%, range: 1.11–2.31%,
n = 23; Infectious—median 1.26%, IQR 1.17–1.41%, range: 0.96–1.94%,
n = 21; Inflammatory—median 1.38%, IQR: 1.21–2.05%, range 1.08–2.05%,
n = 15; Other—median 1.37%, IQR 1.24–1.48%, range 0.98–3.85%,
n = 57; and No diagnosis—median 1.35%, IQR 1.20–1.54%, range: 0.93–2.47%,
n = 67) (see
Figure 2C).
The Later cancers presented with median overall CpG methylation of 1.35% (range 1.09–6.63%). One of the cases had nearly twice the overall CpG methylation level (6.63%) compared to the highest control sample (3.89%). This patient had primary biliary cholangitis at the time of liquid biopsy, a condition associated with a significantly increased risk of hepatocellular cancer compared to matched individuals from the general population [
17]. This patient was later diagnosed with metastatic lung cancer.
No significant correlation was found between BMI and methylation in the overall cohort (ρ = −0.02, p = 0.82). Age was positively correlated with methylation (ρ = 0.17, p < 0.01), explained by controls (ρ = 0.15, p = 0.047), with no significant findings in cancer or later cancer. No sex differences were seen overall (p = 0.40; men: median 1.4, IQR 0.4; women: median 1.4, IQR 0.4). In the cancer group, methylation was higher in women compared with men (2.5, IQR 2.2 vs. 1.6, IQR 0.8; p = 0.015). No sex-related differences were observed in controls or later cancer.
2.4. Detection of DMRs
For detection of DMRs, 37 cancer cases (diagnosed within 12 months) and sex- and age-matched controls (
n = 37) were selected. A total of 162 significant DMRs were identified between cancer patients and controls (
Supplemental Table S3). These DMRs had a median length of 431 bp (IQR 203–798 bp, range: 79–2431 bp) and contained a median of 24 CpGs (IQR 8–40 bp, range: 5–133 CpGs). The vast majority of DMRs were hypermethylated in cancer, with 155 regions (95.7%) showing hypermethylation, while only 7 regions (4.3%) were hypomethylated. The mean difference in methylation levels across the DMRs was 6% (IQR 5–7%, range: −4–10%) (
Figure 3A). Regarding gene locations, 58.6% of the DMRs (
n = 95) were located in promoter regions, while 31.5% (
n = 51) were located within gene bodies (including exons, introns, 5′ UTR and 3′ UTR,
Figure 3B). When categorized by the CpG context, the majority of DMRs were found within CpG islands (
n = 103, 63.6%,
Figure 3C). Due to the relatively small number of cases within each tumor type, we were unable to perform reliable DMR analyses stratified by cancer subtype; therefore, all DMR results are presented at the global level (cancer vs. control).
Enrichment analysis compared the significant hypermethylated DMRs against background regions across different CpG contexts, including CpG islands, shores, shelves, and the open sea, as well as within gene regions annotated as promotors, untranslated regions, exons, introns, and downstream or intergenic regions. The significantly hypermethylated DMRs showed enrichment in the CpG shores compared to background regions (OR = 2.17, 95% CI = 1.53–3.08, p < 0.0001). The limited number of hypomethylated DMRs precluded meaningful enrichment analysis.
The DMRs were dispersed across the genome (see
Figure 4A). Ontology analysis revealed that these regions were associated with biological processes related to transcription regulation, organ development, and morphogenesis. Cellular component localization was predominantly associated with the transcription regulator complex, centrosome and nuclear origin, while molecular functions were primarily associated with transcription and sequence-specific DNA binding (see
Figure 4B). A multidimensional scaling plot further revealed moderate separation between cancer and control groups (see
Figure 4C).
Clustering the samples based on the significant DMRs showed moderate separation between cancers and controls (
Supplemental Figure S1).
2.5. Machine Learning and Model Building
Using the sigFeature machine learning approach, 20 DMRs were identified (see
Supplemental Table S3 for feature data and
Supplemental Figure S2 for an exemplary plot) and subsequently used as input features for constructing a classification model to distinguish cancer patients from controls. The top 20 DMRs’ discriminatory capacity is illustrated in a hierarchical clustering heatmap (see
Figure 5). Visual inspection of the dendrogram clearly separates the samples into two biologically meaningful clusters, one large mixed cluster (cancer = 30, controls = 37) and one smaller cluster with only cancer samples (
n = 7). All localized cancers were clustered in the mixed large cluster, while the cancer cluster predominately contained metastatic cancers. This cluster is characterized by high methylation levels in the selected DMRs.
The final model achieved a sensitivity of 83.8% and a specificity of 83.8% (AUC: 0.88), indicating a moderate discriminative ability to correctly identify cancer cases from methylation patterns. To further assess robustness, a five-fold cross-validation was performed; across the folds the mean sensitivity and specificity were 57.1% and 77.5%, respectively, with a mean AUC of 0.73, suggesting that while the model performs well on the training split, its performance decreases under cross-validation, reflecting variability across subsets of the data (see
Figure 6A). Applied to the training dataset, the model accurately classified 31 of 37 cancers as cancers, where all metastatic and haematolymphoid cancer cases were correctly predicted (see
Figure 6B). Investigating the different control groups further, all groups had a specificity higher than 80% except for the autoimmune disease group (57.1%) (see
Table 2).
The model was then applied to the remaining controls in the cohort receiving an overall specificity of 79.2%. The model had a relatively even performance among the different groups of diagnoses, with autoimmune diseases having the highest specificity (87.5%) and the no diagnosis group having the lowest (75.0%) (see
Table 3).
Among the Later cancers, eight had measurable cfDNA concentration and were used to test the classifiers’ ability to detect cancer in patients who received their cancer diagnosis after 12 months from start of clinical investigation (see
Table 4). Interestingly, the model correctly classified the two metastatic cases with the longest time until cancer diagnosis (see
Figure 6C).
3. Discussion
In this study, we employed an NGS-based targeted methylation panel for pan-cancer analysis to explore the feasibility of noninvasive cancer detection using plasma cfDNA. We successfully demonstrated that this approach is applicable in a low-input setting, showing robust performance even when only limited cfDNA amounts were available. The analysis revealed distinct methylation patterns that distinguished cancer patients from individuals with severe, nonspecific symptoms, both at a global methylation level and across specific DMRs throughout the genome. A classification model based on the 20 most informative DMRs achieved a sensitivity and specificity of 83.8%, and when applied to the remaining controls, the classifier correctly predicted 79.2% of cases.
In the past, methylation analysis traditionally relied on either affinity-based enrichment methods (~100 bp resolution) or sequencing with single-base resolution [
18,
19]. Additionally, sequencing used to depend on bisulfite conversion of unmethylated cytosines to uracil, which substantially damages DNA and necessitates large input amounts [
20]. This made them poorly suited for cfDNA, particularly in patients with compromised health and limited sample availability. The enzymatic conversion method used in our study, in contrast, is highly compatible with low-input cfDNA and enabled successful library preparation from as little as 2.6 ng of cfDNA. While similar enzymatic approaches have been evaluated previously [
21,
22], they have not been specifically tested in cfDNA. Our results revealed a strong correlation between input amount, coverage, and duplication rate, yet high-quality libraries were consistently obtained. This indicates that strict input thresholds are unnecessary—a crucial advantage since patients with nonmalignant diseases often have lower cfDNA concentrations [
23]. Excessive input requirements could otherwise exclude a substantial number of patients and thereby limit clinical applicability.
All study participants were recruited at their first visit within the cancer pathway for severe, nonspecific symptoms, when their cancer status was unknown. Consequently, our cohort was imbalanced, comprising 37 cancer cases, 183 non-healthy controls, and nine individuals who later developed cancer. This imbalance mirrors the clinical reality of the pathway, where only a minority of patients ultimately receive a cancer diagnosis. The cancer detection rate in our cohort (~17%) aligns with previously reported data for this diagnostic pathway [
24,
25].
Despite the heterogeneity in our cohort, we observed differences in overall methylation levels between cancer patients and controls, with cancer patients exhibiting higher methylation levels. Further analysis revealed that patients with metastatic disease displayed statistically significant higher methylation levels than those with localized tumors. This is likely due to a higher proportion of ctDNA relative to cfDNA in the bloodstream, consistent with established findings that ctDNA concentrations increase with advancing disease stage [
7,
26,
27]. Similarly, haematolymphoid cancers—which have direct access to the circulation—showed elevated methylation levels comparable to metastatic cancers [
28]. Since a large fraction of cfDNA (>70%) originates from leukocytes [
29], such cancers may be more readily detectable in plasma. These findings raise an important question: do methylation-based biomarkers primarily reflect tumor-specific processes, or are they also influenced by systemic effects associated with advanced disease? This remains a critical challenge for cfDNA-based biomarker interpretation.
The classifier demonstrated moderate performance on the training data (mean sensitivity 83.8%, specificity 83.8%, AUC 0.88) but decreased accuracy during cross-validation (mean sensitivity 57.1%, specificity 77.5%, AUC 0.73), indicating a potential risk of overfitting. This discrepancy likely stems from the limited cohort size and sample heterogeneity. However, the specificity remained relatively stable (79.2%) when applied to the remaining controls (n = 144), suggesting some generalizability with respect to specificity. Importantly, the classifier showed similar performance across all control subgroups, indicating consistent behavior regardless of underlying nonmalignant conditions.
The results of our methylation-based classifier are consistent with previous studies demonstrating the potential of methylation biomarkers for cancer detection across multiple tumor types [
30,
31,
32,
33]. In pan-cancer settings, the GRAIL test achieved an overall sensitivity of 76.4% and specificity of 99.3% for 12 common cancers, though sensitivity decreased to 54.9% when expanded to 50 tumor types [
16]. Other large studies, such as the Thunder trial (NCT04820868), reported a sensitivity of 69% and specificity of 99% [
34], while the PanSeer test demonstrated 88% sensitivity and 96% specificity when comparing cancer cases to healthy control and was able to detect cancers up to four years before clinical diagnosis [
15]. These results collectively illustrate the promise of methylation-based assays for early cancer detection and support ongoing clinical trials (NCT03934866, NCT04241796, NCT03085888).
One of the most prominent DMRs in the final model was located near the
WNT7B gene, which was hypermethylated in cancer patients (6% difference compared to controls).
WNT7B encodes a protein involved in Wnt signaling and has been implicated in several cancer types, including hepatocellular carcinoma [
35], colorectal cancer [
36], and prostate cancer [
37], as well as in metastasis of breast and pancreatic adenocarcinomas [
38]. Aberrant methylation of this locus has also been linked to gastric cancer [
39] and was recently evaluated as a pan-cancer biomarker [
40]. These findings underscore the biological relevance of the DMRs identified and support their potential significance in both general and cancer-type-specific contexts.
By including patients with severe, nonspecific symptoms as controls, our study extends previous work by evaluating a clinically relevant and heterogeneous cohort. While this heterogeneity may have reduced specificity compared to studies using healthy controls, it enhances the translational value of our findings. It demonstrates that methylation-based biomarkers retain discriminatory power even amid substantial biological variability and background noise from other severe, nonmalignant conditions. The inclusion of clinically relevant controls is therefore a major strength of this study.
The inclusion criteria for cancer patients, diagnosed within 12 months of sampling, were designed to associate symptoms directly with cancer presence. Since methylation changes can precede diagnosis by several years, cancer patients were matched with controls who remained cancer-free throughout follow-up to minimize bias. Nonetheless, some controls may have developed cancer later, potentially reducing observed group differences. Although this matching strategy mitigated computational bias in the machine learning model, it also reduced the sample size and, consequently, the statistical power.
Although our model was not specifically trained to predict later cancer occurrence, we wanted to explore its performance in this context. The classifier was applied to eight individuals who were diagnosed with cancer more than 12 months after their initial investigation, where no malignancy was initially detected. Interestingly, the two metastatic cases with the longest time to diagnosis were correctly classified, while the remaining six were not detected. This variability likely reflects differences in tumor growth dynamics and the possibility that cancer was not yet present at the time of sampling.
One limitation of this study is the imbalance between the control group and the cancer group. This discrepancy reflects the restricted availability of cancer samples, particularly across tumor subtypes, during the study period. Furthermore, no formal sample size calculation was conducted before study initiation, and the limited number of cases within individual tumor types precluded subtype-specific DMR analyses. Therefore, the identified DMRs reflect general differences between all cancer patients and controls. Future studies with larger, balanced cohorts and predefined sample size calculations are needed to validate these findings, investigate lineage-specific methylation patterns, and strengthen the generalizability of results.
In conclusion, this study demonstrates the clinical feasibility of enzymatic conversion-based library preparation for cfDNA methylation analysis in a low-input setting. We identified significant differences in methylation patterns and DMRs between cancer patients and controls, even within a heterogeneous and clinically relevant cohort. These findings support the potential of methylation as a robust and informative biomarker for cancer detection, maintaining strong discriminatory performance despite substantial biological and clinical variability.
4. Materials and Methods
4.1. Study Participants
Study participants were recruited between 2018 and 2022 at Örebro University Hospital through the fast-track diagnostic pathway for severe, nonspecific symptoms of cancer employed in Sweden. To enter this diagnostic pathway, patients had to present with one or more of the following symptoms: general malaise, extreme fatigue, reduced appetite, unintentional weight loss >5 kg, prolonged fever, unexplained pain, abnormal laboratory findings (e.g., anemia, elevated alkaline phosphatase, erythrocyte sedimentation rate, or calcium levels), increased health care utilization, increased medication use or a clinician’s intuitive sense (‘gut feeling’) that the patient was seriously ill. All patients underwent a standardized diagnostic work-up, including an expanded panel of biochemical analyses, physical examination, and imaging, such as computed tomography (CT), magnetic resonance imaging (MRI) or 18F fluorodeoxyglucose positron emission tomography/CT (18F-FDG PET/CT). Diagnostic tissue biopsies were performed when clinically indicated.
All patients entering the cancer pathway were asked to participate. At the time of inclusion, cancer status was unknown. The inclusion criteria for the study were: (1) age 18 years or older (n = 236), and (2) availability of EDTA plasma samples (n = 232). The exclusion criteria were: (1) a known, ongoing cancer diagnosis at the time of inclusion (n = 2), and (2) loss to follow-up, defined as death with a remaining suspicion of cancer without a confirmed diagnosis (n = 1). This yielded a final study cohort of 229 participants.
Based on clinical data from medical records, participants were classified into three main groups: (1) those diagnosed with cancer within twelve months of inclusion; (2) those without a cancer diagnosis, who were instead diagnosed with immunologic, infectious, or other nonmalignant conditions, hereafter referred to as controls; and (3) those not diagnosed with cancer within 12 months but later received a cancer diagnosis. These are referred to as Later Cancers. The larger number of controls reflects their greater availability during the study period, whereas cancer patient samples, particularly across the different tumor types represented, were more limited. No formal a priori sample size calculation was performed; instead, we included all eligible samples available during the study period to maximize the robustness of the analysis.
4.2. Sample Collection and cfDNA Extraction
Blood samples were collected in EDTA tubes, and plasma was isolated through a double centrifugation protocol: initially at 2000× g for 7 min, followed by 16,000× g for 10 min. The resulting cell-free plasma was stored at −80 °C until further processing. All samples were frozen within three hours of collection.
cfDNA was isolated from 1.5 to 4 mL of cell-free plasma using the QIAamp Circulating Nucleic Acid Kit (Qiagen, Hilden, Germany). The concentration of cfDNA was quantified using Qubit dsDNA HS Assay Kit (Invitrogen, Waltham, MA, USA) on a Qubit 2.0.
4.3. Library Preparation and Sequencing
All samples with measurable cfDNA concentrations after isolation were included in the analysis. Sample processing order was randomized based on cancer status and type, date of cfDNA isolation, and date of sampling.
The cfDNA samples underwent library preparation using the NEBNext® EM-seqTM Kit (New England Biolabs, Ipswich, MA, USA) and the NEBNext Enzymatic Methyl-seq Library Preparation Protocol (Twist Bioscience, San Francisco, CA, USA), with minor modifications.
Mechanical fragmentation was performed only on the control material, which consisted of a pool of hypomethylated lambda DNA (expected methylation <0.5%) and hypermethylated pUC19 DNA (expected methylation 95–98%). Fragmentation was performed using a Bioruptor® Pico sonication device (Diagenode, Seraing, Belgium), with 13 sonication cycles. Each cycle consisted of 30 s on and 30 s off, resulting in final fragment lengths of 240–290 bp.
The library concentration was measured using the same method as for the cfDNA quantification, and fragment length was assessed with a TapeStation 4200 (Agilent Technologies, Santa Clara, CA, USA) using D1000 ScreenTape and reagents.
The Twist Alliance Pancancer Methylation Panel (Twist Bioscience) was used for target enrichment following the Twist Targeted Methylation Sequencing Protocol, with minor modifications. The panel is 1.5 MB in size, covers 126 k CpGs (12 k DMRs), and is designed to cover methylation profiles of 31 cancer types and 47 disease entities based on data from the TCGA database (
https://www.cancer.gov/tcga, assessed on 15 October 2025). The hybridization was extended to 16 h, followed by 13 cycles of PCR amplification of the hybridized pools. The concentration and fragment length were assessed using the same methods as described for the libraries.
Sequencing was performed on a NextSeq2000 system (Illumina, San Diego, CA, USA) with 2 × 151 cycles. The sequencing pool consisted of 95% target-enriched libraries, 1% control material library, and 4% PhiX Control v3 Library.
4.4. Bioinformatics and Statistics
FASTQ files were generated using BCLConvert (v3.10.12, DRAGEN on-instrument solution, Illumina) and processed through the nf-core/methylseq pipeline (version 2.4.0 and Nextflow version 22.10.6) [
41] employing the Bismark Bisulfite Read Mapper and Methylation Caller (v0.24.0) [
42]. For methylation conversion controls, the mean CpG methylation was calculated after filtering on 30× depth including their respective full genome, as per instructions from the manufacturer. For samples, a coverage of 100× was used as threshold for calculating the mean CpG methylation level across target regions included in the pan-cancer panel bed file. This threshold led to a mean number of 87 751 covered CpGs in the cohort (SD 17 198 CpGs).
4.5. Machine Learning
To reduce statistical errors due to the class imbalance between the cancer group and controls, a subset of samples was analyzed for DMRs. For each cancer patient (n = 37), an age- (±2 years) and sex-matched control (n = 37) was selected, ensuring that controls remained cancer-free throughout the follow-up period (24–60 months). When multiple perfect matches were available, the control sample closest in sampling date to the corresponding cancer sample was selected. Control samples were sampled blindly for their diagnostic outcome.
Significant DMRs between cancer cases and controls were identified using the wrapped pipeline of DMRichR (v1.7.8) [
43], which utilizes dmrseq (v1.24.0) [
44] and bbseq (v1.40.0) [
45]. For inclusion in the DMR analysis, CpG sites were required to have a minimum coverage of 100× in at least 75% of samples. DMRs were defined as regions containing at least five CpG sites, with all other parameters set to default. Candidate regions were identified based on an adjusted
p-value < 0.05 for individual CpGs, determined from 74 permutations (one per sample included in the run) and adjusted for sex as a covariate.
Within the DMRichR package, Bsseq was used to generate individual smoothed methylation values, and DMR heatmaps were generated using pheatmap (v1.0.12). DMR annotations and enrichment analyses were obtained and visualized using DMRichCpG and DMRichGenic to determine the CpG locations (islands, shores, shelves, and the open sea) and gene regions (promoters, untranslated regions, exons, introns, downstream, or intergenic regions) for all DMRs, as well as separately for hypo- and hypermethylated DMRs. Fisher’s exact test was used to assess enrichment in DMR locations (CpG and genic context) compared to the background. Additionally, rGREAT (v2.6.0) [
46,
47,
48] was employed for gene ontology analysis based on the genomic coordinate of all DMRs.
Using the average CpG methylation levels in the DMRs identified by DMRichR, a support vector machine (SVM) classifier with a radial basis function kernel was developed, incorporating recursive feature elimination (RFE) for feature selection to distinguish cancer from control samples. Model performance was estimated using five-fold cross-validation. In each fold, the top 20 features were selected via SVM-RFE, and an SVM model was trained and evaluated on the held-out test set.
A final SVM model was trained on the full dataset (n = 74) using the top 20 features identified globally. Model performance was evaluated on the same dataset using receiver operating characteristics (ROC) analysis and confusion matrix metrics. Performance was assessed based on area under the curve (AUC), sensitivity, and specificity, both within the internal cross-validation folds and for the final model. A heatmap was generated to visualize methylation patterns of the selected features across all samples.
The final classifier was applied to an external validation set consisting of 144 controls previously unseen by the model. Furthermore, the classifiers’ ability to predict cancer in patients diagnosed after 12 months (Later cancer, n = 8) was assessed.
All machine learning analyses were performed in R version 4.4.1 or 4.4.2 using the following packages: sigFeature (v1.22.0) [
49] for feature ranking via SVM-RFE, caret (v7.0-1) [
50] for model training, cross-validation, and prediction, and pROC (v1.18.5) [
51] for ROC analysis and AUC computation.
Additional packages for data handling and visualization included: readr v2.1.5 [
52], dplyr v1.1.4 [
53], readxl v1.4.3 [
54], ggplot2 v3.5.1 [
55], and pheatmap v1.0.12 [
56].
4.6. Statistics
Descriptive statistics for the cohort were performed using IBM SPSS Statistics Viewer (v 29.0) [
57]. Shapiro-Wilk tests were conducted to assess the normality of the data. The Mann-Whitney U test and the Kruskal-Wallis test with Bonferroni correction for multiple comparisons were applied to continuous variables. Spearman’s rank correlation and linear regression analyses were used to evaluate correlation and trends.
Figures were generated using R (v4.4.1), or R (v4.4.2, in Rstudio v2024.12.1), and BioRender.com (assessed on 15 October 2025).