SCGclust: Single-Cell Graph Clustering Using Graph Autoencoders That Integrate SNVs and CNAs

Potu, Teja; Hu, Yunfei; Wang, Judy; Chi, Hongmei; Khan, Rituparna; Dharani, Srinija; Ni, Jingchao; Zhang, Liting; Zhou, Xin Maizie; Mallory, Xian

doi:10.3390/math14010046

Open AccessArticle

SCGclust: Single-Cell Graph Clustering Using Graph Autoencoders That Integrate SNVs and CNAs

by

Teja Potu

^1,†

,

Yunfei Hu

^2,†

,

Judy Wang

¹,

Hongmei Chi

³

,

Rituparna Khan

¹

,

Srinija Dharani

¹,

Jingchao Ni

⁴,

Liting Zhang

^5,*,

Xin Maizie Zhou

^2,6,*

and

Xian Mallory

^1,*

¹

Department of Computer Science, Florida State University, 222 S. Copeland St., Tallahassee, FL 32306, USA

²

Department of Computer Science, Vanderbilt University, 2201 West End Ave, Nashville, TN 37235, USA

³

Department of Computer Science, Florida Agricultural and Mechanical University, 1601 S Martin Luther King Jr Blvd, Tallahassee, FL 32307, USA

⁴

Department of Computer Science, University of Houston, 4302 University Dr, Houston, TX 77004, USA

⁵

Department of Computer Science, University of West Florida, 11000 University Parkway, Pensacola, FL 32514, USA

⁶

Department of Biomedical Engineering, Vanderbilt University, 2201 West End Ave, Nashville, TN 37235, USA

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2026, 14(1), 46; https://doi.org/10.3390/math14010046

Submission received: 10 July 2025 / Revised: 17 November 2025 / Accepted: 10 December 2025 / Published: 23 December 2025

(This article belongs to the Special Issue Emerging Trends in Computational Biology and Bioinformatics)

Download

Browse Figures

Versions Notes

Abstract

Intra-tumor heterogeneity (ITH) is a compounding factor for cancer prognoses and treatment. Single-cell DNA sequencing (scDNA-seq) provides cellular resolution of the variations in a cell and has been widely used to study cancer progression and the responses to drugs and treatments. While low-coverage scDNA-seq technologies typically provide a large number of cells, accurate cell clustering is essential for effectively characterizing the ITH. The existing cell clustering methods are typically based on either single-nucleotide variations (SNV) or copy number alterations (CNA), without leveraging both signals together. Since both SNVs and CNAs are indicative of cell subclonality, in this paper, we designed a robust cell-clustering tool that integrates both signals using a graph autoencoder. Our model co-trains the graph autoencoder and a graph convolutional network (GCN) to guarantee meaningful clustering results and to prevent all cells from collapsing into a single cluster. Given the low-dimensional embedding generated by the autoencoder, we adopted a Gaussian mixture model (GMM) to further cluster the cells. We evaluated our method on eight simulated datasets and a real cancer sample. Our results demonstrate that our method consistently achieved higher V-measure scores compared to SBMClone, an SNV-based method, and a K-means method that relies solely on CNA signals. These findings highlight the advantage of integrating both SNV and CNA signals within a graph autoencoder framework for accurate cell clustering.

Keywords:

single-cell DNA sequencing; intra-tumor heterogeneity; cell clustering; deep learning; machine learning; graph autoencoder

MSC:

92-08

1. Introduction

Cancer cells evolve by acquiring new mutations. Two important types of mutations are single-nucleotide variations (SNVs) and copy number alterations (CNAs). A cancer sample often contains multiple subclones, each distinguished by a unique set of SNVs and CNAs. This diversity within a tumor is referred to as intra-tumor heterogeneity (ITH).

ITH has confounded cancer treatments, prognoses, and metastasis prevention [1,2,3,4]. Specifically, ITH has been known to lead to heterogeneous cellular phenotypes that exhibit a differential response to therapies, including the emergence of drug-resistant cancer cells [5,6].

Over the last decade, the advent of single-cell DNA sequencing (scDNA-seq) has revolutionized the study of ITH. This technology provides an unparalleled ability to characterize ITH at the level of individual cells, enabling a deeper understanding of tumor complexity.

Given that each cell belongs to a subclone, with each subclone characterized by a unique set of SNVs and CNAs, it is natural to cluster cells using SNV and/or CNA profiles, where each cluster corresponds to a subclone. However, this task is complicated by the technical limitations of scDNA-seq. The current cost-effective scDNA-seq technologies, such as degenerate oligonucleotide-primed PCR (DOP-PCR) [7,8,9] and direct library preparation (DLP and DLP+) [10,11], are designed to sequence hundreds or thousands of cells, each at a low cost. However, these methods typically produce a very shallow coverage, often around 0.02×. This shallow coverage makes it challenging to cluster the cells, since approximately 98% of the genomic region remains uncovered by a single read.

On the other hand, although multiple displacement amplification (MDA) [12,13] can produce high-coverage scDNA-seq data, it is much more costly per cell than DOP-PCR or DLP/DLP+. Since the goal of cell clustering is to comprehensively characterize ITH, sequencing a larger number of cells increases the likelihood of capturing all subclones, thereby providing a more complete understanding of ITH. Therefore, in this paper, we focused on cell clustering for the scDNA-seq technologies capable of generating hundreds to thousands of cells, even at an extremely low coverage.

In the past, despite the fact that most SNVs lack informative signals due to an insufficient coverage for such data, SNVs have still been used for cell clustering in shallow scDNA-seq. Specifically, SBMClone is an SNV-based method that employs a stochastic block model (SBM) for accurate cell clustering. Nevertheless, SBMClone does not incorporate CNA signals into the cell clustering. To date, no method exists that considers both SNV and CNA signals together for cell clustering of shallow scDNA-seq data. We reasoned that, since both SNVs and CNAs arise during cancer cell evolution, combining these two signals together for cell clustering would produce better results.

A rich literature integrates multiple omics layers to cluster samples (patients) using bulk data. The iCluster family jointly models heterogeneous data types via shared latent variables to uncover unsupervised subtypes [14]. Similarity network fusion (SNF) fuses sample–sample similarities across omics before clustering [15], while factor models such as MOFA/MOFA+ learn shared low-dimensional factors [16,17]. More recently, iCluF iteratively refines and fuses omic-specific neighborhoods to improve subtype discovery [18]. These methods assume dense per-sample measurements and target patient-level integration. They are not designed for sparse cell-level clustering from the shallow scDNA-seq. In bulk sequencing, many approaches jointly consider SNVs and CNAs to infer subclonal structure and evolution, including PhyloWGS [19], PyClone-VI [20], and SciClone [21]. While powerful, bulk-based methods operate on sample-level aggregation and do not yield per-cell clusters. In single-cell sequencing, a few tools jointly use SNVs and CNAs, but they focus on phylogeny reconstruction under targeted or higher-depth designs, such as SCARLET, BiTSC, and SCsnvcna, rather than clustering under genome-wide, ultra-shallow coverage [22,23,24]. Nevertheless, these methods either require multi-modality data [24] or they assume that SNV detection has a reasonably low missing rate [22]. The method, SCGclust, proposed in this paper is single-modality and jointly models SNV and CNA signals measured from the same ultra-shallow scDNA-seq assay per cell, reducing the need for additional omics to achieve accurate cell-level clustering in ultra-shallow settings.

Recently, graph autoencoders have been widely used in transcriptomic sequencing analyses [25,26]. For example, ADEPT [27] leveraged a graph autoencoder to learn the low-dimensional latent embedding of each spot for spatial transcriptomics data, which was then used to cluster spatial spots. ADEPT modeled each spot as a node in the graph, with its gene expression data serving as the node feature. The spatial context of the spots was represented by the presence or absence of edges in the graph. By utilizing both the node features and edge weights, ADEPT effectively integrated gene expression signals with the spatial context. More broadly, deep-learning clustering methods for single-cell data have shown that end-to-end representation learning improves clustering in noisy, sparse settings, including variational and denoising autodencoders [28,29], graph neural networks on cell–cell graphs [30], and contrastive learning frameworks [31]. DNA-focused deep models are fewer (e.g., Dhaka for CNAs, bmVAE for SNVs, and CoT for clonal CNAs) [32,33,34], leaving open the question of whether the joint modeling of scDNA-seq features and the graph structure can improve single-cell clustering. We hypothesized that learning graph-aware and low-dimensional embeddings directly from scDNA-seq data while integrating CNA and SNV signals with a cell-to-cell graph structure can yield more accurate and robust clonal clustering than feature-only methods, such as K-means, or topology-only baselines, such as SBM-style block models. Inspired by ADEPT’s design, we developed SCGclust and compared it against SBMClone and a K-means-based method that uses only CNA signals across eight simulated datasets and a real breast cancer dataset. We found that SCGclust consistently outperformed SBMClone and the CNA-based method, producing more accurate and robust clustering results. Additionally, SCGclust demonstrated a greater resilience to extreme cases and was less sensitive to the lack of or wrong signals.

SCGclust fills the gap of using both the SNV and CNA signals for cell clustering for shallow scDNA-seq data. Although SCGclust was inspired by ADEPT, it is fundamentally different in design. It jointly trains a graph attention network (GAT)-based autoencoder and a graph convolutional network (GCN) to reduce three types of errors. Rather than using the GAT’s low-dimensional embedding directly for clustering, SCGclust integrates clustering awareness into the GAT’s dimension reduction, guiding the embeddings to be more cluster-informative.

2. Methods

The proposed SCGclust framework integrates graph-based learning with clustering methods to analyze genetic variation data at a single-cell resolution. It employs a graph neural network (GNN) architecture to model relationships between cells based on genomic features and utilizes advanced clustering techniques to partition cells into distinct subgroups.

Figure 1 shows an overview of the SCGclust framework. SCGclust takes as the input a cell-by-SNV matrix and a read count matrix containing the CNA signals (Figure 1A). Using these inputs, it builds a graph where each node represents a cell, the node feature corresponds to the CNA signals, and the edge weights between two nodes reflect the similarity between the SNV profiles of two cells (Figure 1B). Given the graph built in Figure 1B, SCGclust constructs a graph autoencoder consisting of two encoder layers (layers 1 and 2) followed by two decoder layers (layers 3 and 4). This architecture in Figure 1C reduces the dimensionality of the node features while accounting for the node features of neighboring nodes in the graph, weighted by the edge weight. SCGclust co-trains the parameters in the graph autoencoder and a two-layered graph convolutional network (GCN), shown in Figure 1D, the latter of which takes the low-dimensional embedding from the graph autoencoder as the node feature. The objective function is composed of three terms: the reconstruction mean squared error (MSE) term from the graph autoencoder, the modularity, and the collapse regularization term from the GCN. Finally, SCGclust clusters the cells based on the latent features extracted from the low-dimensional representation of the last encoder layer, applying an existing clustering method, the Gaussian mixture model (GMM) (Figure 1E). In the following, we describe in more detail the graph construction, the graph autoencoder architecture, co-training the graph autoencoder and GCN, and the clustering step.

2.1. Graph Construction

To represent the genetic variation data as a graph, we constructed an adjacency matrix and defined node features, enabling GNN-based learning.

2.1.1. Edge Weight Adjacency Matrix

The edge weight adjacency matrix was computed based on the Euclidean distance of the SNV profiles, which was used as the edge weights in the graph. This matrix quantifies pairwise similarities between all cells, with each entry ranging from 0 to 1, where 0 indicates no similarity and 1 indicates identical profiles. Due to the majority of the entries in the SNV profile consisting of missing entries (“3”), we replaced all missing entries with “0”s before calculating the Euclidean distance of every two SNV profiles, following the same practice of SBMClone [35]. In this way, the missing entries were treated the same as the absence of the SNV. Notice that, although most of the SNVs were missing, the edge weight was nonzero among the cells of the same cluster, but was zero or much smaller among the cells of different clusters because they do not share the same set of SNVs. For the distance metric for the edge weight adjacency matrix, we also tried other options in addition to the Euclidean distance, including the dot product, cosine distance, and Pearson correlation matrix. A systematic comparison among these four metrics indicated that the Euclidean distance consistently rendered the highest accuracy, as seen in Supplemental Figure S1. Therefore, for the rest of the experiments, we used the Euclidean distance by default.

2.1.2. Node Features

Node features are derived from the raw read counts of each cell, reflecting the CNA signals. However, instead of directly using the raw read count as the node features, we calculated the normalized cosine similarity between the raw read count vector of a cell and those of all other cells, resulting in a vector of normalized cosine similarities for each cell. Here, we used the cosine similarity as the measurement of the difference between a pair of cells’ CNA signals because the cosine similarity measures the angle between two vectors, and thus, it is invariant to sequencing depth variations while preserving relative offsets. This is advantageous for scDNA-seq data because each cell may be sequenced at different depths, and what really matters is the angle between the two read count vectors. The resulting vector is used as the node feature. This approach avoids directly relying on raw read counts, which are prone to fluctuations caused by sequencing artifacts. Additionally, it eliminates dependence on the absolute values of the raw read counts, which vary with the sliding window size and genome coverage. In contrast, the cosine similarity is invariant to absolute values and more robust to fluctuations in read counts unrelated to CNAs. While it is possible to use inferred CNAs instead of raw read counts, we opted not to do so in this study to avoid errors introduced during CNV inference. Nevertheless, using inferred CNAs as the node feature has its advantage, and these trade-offs are discussed further in the conclusion.

2.2. Graph Autoencoder (GAT) Architecture

The SCGclust framework employs a graph attention autoencoder to learn the latent representations of node features. The architecture consists of two main components: an encoder and a decoder, leveraging a graph attention mechanism with graph convolutional networks (GCNs).

2.2.1. Encoder and Decoder

Both the encoder and decoder comprise two graph attention convolution layers. The encoder transforms input features (called X) and the adjacency matrix (called A) into latent representations (called h⁽²⁾), while the decoder reconstructs the original input from these latent features. Equation (1) shows the mathematical equation of this process.

\hat{X} = G (A, F (A, X))

(1)

where

F

and

G

denote the encoder and decoder functions, respectively.

To enhance the consistency between the encoder and decoder, weight sharing is employed. Let W^(l) denote the weight of the l-th layer; the weights of the corresponding encoder and decoder layers are tied via transposition, as shown in Equation (2).

W^{(3)} = {(W^{(2)})}^{⊤}, W^{(4)} = {(W^{(1)})}^{⊤}

(2)

We trained the encoder and decoder layers and learned the weights such that an objective function is minimized. One of the terms in our objective function is the reconstruction MSE between the input feature (X) and the recovered feature from the decoder (

\hat{X}

), as shown in Equation (3). Such a term, combined with the modularity term and the regularization term shown in Equation (11), will constrain the weights in the graph autoencoder.

L_{MSE} = \sqrt{∥ X - \hat{X} ∥_{2}} = \sqrt{\sum_{i = 1}^{n} {(X_{i} - {\hat{X}}_{i})}^{2}},

(3)

in which n is the number of nodes in the graph.

2.2.2. Attention Mechanism

In this section, we describe how we obtained the embedding h⁽²⁾ while engaging the edge weight adjacency matrix. Our GAT operates by allowing each node to attend to its neighbors through learned attention coefficients. In more detail, each node i’s feature vector x_i is first projected into a new representation z_i via W⁽¹⁾:

z_{i} = W^{(1)} x_{i} .

(4)

For every pair of nodes i and j, we computed a raw attention score

{\tilde{e}}_{i j}

by applying a learnable vector a to the concatenated transformed feature

[z_{i} ∥ z_{j}]

:

{\tilde{e}}_{i j} = a^{⊤} [z_{i} ∥ z_{j}] .

(5)

To respect the graph structure, which contains the edge weight, we multiplied each raw score by the corresponding adjacency entry A_ij to create a weighted attention score:

{\tilde{e}}_{i j}^{A} = {\tilde{e}}_{i j} A_{i j} .

(6)

We then applied a LeakyReLU activation to obtain the final attention logits:

e_{i j} = LeakyReLU ({\tilde{e}}_{i j}^{A}) .

(7)

Next, these logits were normalized across nodes via a softmax, ensuring that the attention coefficients summed to 1 for node i:

α_{i j} = \frac{exp (e_{i j})}{\sum_{k} exp (e_{i k})} .

(8)

Finally, each node aggregated its neighbors’ features, weighted by these learned attention coefficients:

h_{i}^{(1)} = \sum_{j} α_{i j} z_{j} .

(9)

Here,

h_{i}^{(1)}

represents the hidden feature for node i from layer 1 that incorporated node i’s original feature x_i and the edge weight adjacency matrix A. This whole process generates a context-aware feature update by calculating the attention coefficient α_ij for each node j, and it allows graph autoencoder layers to learn which neighbors are most important for each node’s representation.

Notice that h⁽¹⁾ is the hidden feature learned from layer 1 for node i. The second layer’s output, h⁽²⁾, is simply

W^{(2)} h^{(1)}

without the application of the attention weights. h⁽²⁾ is the latent embedding that will be fed to the GCN in the next step.

2.3. Co-Training GAT and GCN

Ref. [36] introduced a GCN that was designed specifically for clustering when the objective function contains a modularity term, which encourages meaningful clustering, and a collapse regularization term, which discourages against all cells collapsing into one cluster. We adopted this model and combined it with our GAT, such that the objective function included not only the reconstruction MSE error, but also a spectral modularity term and a regularization term. In more detail, we first normalized the adjacency matrix A to be

\tilde{A} = D^{- \frac{1}{2}} A D^{- \frac{1}{2}}

, and we used

\tilde{A}

as the edge weight in a two-layer GCN that was operated on the embedding h⁽²⁾ from the GAT (Figure 1D). We then applied a softmax to the output of the GCN to obtain the soft cluster assignment C:

C = softmax (GCN (\tilde{A}, h^{(2)}))

(10)

Given the soft assignment C from the GCN, we then aimed to minimize the following objective function:

L = - L_{modularity} + L_{regularization} + L_{MSE},

(11)

in which

L_{modularity} = - \frac{1}{2 m} Tr (C^{⊤} B C)

and

L_{regularization} = \frac{\sqrt{k}}{n} {∥ \sum_{i} C_{i}^{T} ∥}_{F} - 1

. Here,

C \in R^{n \times k}

is the soft cluster assignment matrix,

B = A - \frac{d d^{⊤}}{2 m}

(with A being the adjacency matrix and d the degree vector), k is the number of clusters, m = ∑ _id_i/2 is the number of edges, and n is the number of nodes.

∥ \cdot ∥_{F}

represents the Frobenius norm.

L_{MSE}

is shown in Equation (3). The first term (modularity) encourages meaningful clusters in the graph; the second term (collapse regularization) discourages all nodes from collapsing into a single trivial cluster; and the third term ensures that the reconstructed feature is as close to the input feature as possible in the graph autoencoder.

The weights and parameters in both the GAT and GCN are trained together to minimize

L

in Equation (11). This combined objective function enforces high-quality clusters while preserving relevant node feature reconstruction.

2.4. Gaussian Mixture Model for Clustering

The final cluster assignments were obtained by applying Gaussian mixture model (GMM) clustering on the learned embeddings h⁽²⁾ from the graph autoencoder. The GMM modeled the latent space as a mixture of k Gaussian distributions, assigning each node to a cluster based on its learned representation. This two-step process leverages GCNs for high-level feature extraction and the GMM for refined clustering, ensuring the robust and interpretable partitioning of graph-structured data. Here, k is given by the user. We will discuss in future work the implications of selecting k and automating the selection of k.

3. Results

3.1. Simulated Dataset

We systematically benchmarked SCGclust’s performance by varying eight key variables: the number of subclones, the number of cells, the number of SNVs, the number of CNAs, the false positive (FP) rate of the SNVs, the false negative (FN) rate of the SNVs, the missing rate of the SNVs, and the noise level on the CNA signal. The eight variables, along with their values and default values, are listed in Table 1. For each experiment, we varied one variable while keeping all others at their default values. To ensure robust results and minimize the impact of random fluctuations, each setting was repeated five times. Among these variables, the missing rate of the SNVs was the most critical to our study, ranging from 0.95 to 0.98 and 0.99, with 0.98 set as the default. These missing rates corresponded to a coverage of 0.05×, 0.02×, and 0.01×, respectively.

In the following, we describe our simulator in detail, followed by a benchmark of SCGclust on the simulated dataset.

3.1.1. Simulator

To mimic the whole process of cancer evolution, we employed the beta-splitting model to generate a phylogenetic tree, following the practice of SCsnvcna [24]. The process begins with the generation of a clonal tree, where each leaf node represents a subclone of cells. The number of subclones is predetermined, and the tree structure is built using a beta-splitting model that stochastically determines the percentage of each subclone.

In more detail, the simulation starts with a normal root node. Each node that has not split yet has a chance to split into two subclones. This process goes on recursively until the desired number of leaf nodes is reached.

CNAs are then placed on the edges based on the branch lengths, with random intervals selected across chromosomes. The branch length of each edge is determined according to an exponential distribution, following the practice of [37]. Each CNV corresponds to a randomly selected genomic region that is either amplified or deleted. After determining the copy number states, we simulate the read count following the practice of [38]. In more detail, we use normal cells from the T10 dataset [8] to generate corresponding read counts. Noise is introduced into the copy number reads to simulate real-world biological variability. The possible sources of noise in simulated single-cell copy number reads include stochastic fluctuations in the read counts from low-coverage sampling and whole-genome amplification bias, mapping errors from ambiguous or repetitive regions, sequencing errors, GC-content-dependent coverage bias, and systematic dropouts or outliers from uneven amplification or low-mappability regions. These factors collectively create variability that must be reflected in realistic simulations. Specifically, 50% of the data rows for each set of leaf cells are randomly selected using a binary mask. Noise is then added to the selected copy numbers using a normal distribution with a mean of zero and a specified standard deviation, reflecting measurement fluctuations. Negative copy numbers are constrained to zero to maintain biological plausibility. The read counts for each cell are adjusted based on the noisy copy number values to ensure consistency with the original dataset while capturing the introduced variability.

SNVs are also placed on the edges according to the branch length, with their positions randomly sampled to ensure no overlap with the CNAs. Cells are assigned to the leaves based on their cell proportions. We then walk the path from the root to the leaf, and the SNVs and CNAs that are placed on the edges of the path are assigned to the cells corresponding to the leaf. In this way, we generate a complete genotype matrix. To simulate real-world conditions, we introduced false positives (FP), false negatives (FN), and missing data to the genotype matrix, given the variable used in each simulation setting. It is worth noting that we intentionally set the FN rate to be above 0.5, because when a site is covered by reads, most of the time it is covered by only one read. For a heterozygous SNV that falls in a copy-number-neutral area, there is a half-chance that the read is a reference-supporting read. Thus, the FN rate includes both the low coverage factor and the sequencing artifact in the scDNA-seq. We set the FN rate to range between 0.55 and 0.65 in our simulator. For the missing rate, we set it to be 0.95, 0.98, and 0.99, reflecting the low coverage of scDNA-seq at 0.05×, 0.02×, and 0.01×, respectively.

3.1.2. Benchmarking Results

We evaluated SCGclust, SBMClone [35], and a K-means-based method that uses only the read count on simulated datasets. Since this K-means-based method uses the CNA signals, we refer to it as “Kmeans-CNA”. SCGclust was tested in two modes: one mode where the silhouette score was used to select the best epoch without knowing the ground truth, and another as a reference where the ground truth was assumed to be known for selecting the best epoch. While the ground truth is not available in real-world applications, this reference mode helps determine whether any epoch achieves a good clustering result, providing a benchmark for assessing the effectiveness of SCGclust and SBMClone.

We compared four methods on the simulated datasets: SCGclust, using the silhouette score to select the optimal epoch (referred to as “SCGclust-silhouette”); SCGclust, using the ground truth to select the optimal epoch (referred to as “SCGclust-GT”); SBMClone, which uses only SNV signals; and Kmeans-CNA, which relies solely on CNA signals. In particular, we compared SCGclust with SBMClone and Kmeans-CNA to investigate whether SCGclust, which integrates both SNV and CNA signals, outperformed methods that rely solely on either SNVs or CNA signals, but not both. Given its ability to combine both signals, SCGclust was expected to have a distinct advantage.

We plotted the V-measures of these four methods, as shown in Figure 2. The V-measure is a metric that quantitatively measures the clustering result. It assesses the balance between homogeneity and completeness, in which homogeneity measures how much each inferred cluster contains only the cells in a single true subclone, and completeness measures how well all members of a true subclone are assigned to the same inferred cluster. Mathematically, let C be a vector of the true class labels for each cell, and K be the cluster assignment by the algorithm. The homogeneity h can be represented as

h = 1 - \frac{H (C | K)}{H (C)}

, in which H(C|K) is the conditional entropy of the cluster distribution given the cluster assignment by the algorithm, and H(C) is the entropy of the class distribution. The completeness c can be represented as

c = \frac{1 - H (K | C)}{H (K)}

, in which H(K|C) is the conditional entropy of the cluster assignment given the true class labels, and H(K) is the entropy of the cluster assignment by the algorithm. The V-measure is then a harmonic mean of h and c. We chose the V-measure as the primary metric of measuring the clustering accuracy because it balances both the homogeneity and completeness. A high V-measure is only achieved if both are high. This makes it less biased toward over-segmentation (which can inflate homogeneity) or over-merging (which can inflate completeness), and it is symmetric, which means that swapping labels of clusters or classes does not change the score. This symmetry and balance make it especially useful when the true labels are known and a fair assessment of the clustering quality is desired. The V-measure ranges from 0 to 1, where a score of 1 indicates perfect clustering and 0 indicates poor clustering. As a reference, we also tested two other widely used metrics: the adjusted random index (ARI) and normalized mutual information (NMI), the results of which are shown in Supplemental Figures S2 and S3. We observed that these three metrics rendered a very similar trend for all the methods being tested. Due to the fact that the V-measure balances both homogeneity and completeness, we chose the V-measure as the measurement metric for the clustering accuracy by default when ground truth labels were available.

It was observed that SCGclust’s V-measures were predominantly around and above 0.6 across all eight variables, whereas SBMClone and Kmeans-CNA generally had V-measures around or below 0.6. The only instance where SBMClone achieved a V-measure well above 0.6 was when the missing rate was as low as 0.95 (Figure 2G), with the V-measures ranging between 0.7 and 0.8. When looking into the details of the performance of these methods, we found that all methods except Kmeans-CNA had a trend of performing better as the number of SNVs increased (Figure 2C). This result was expected, as more SNVs provide greater signal strength and clustering power.

To assess whether SCGclust can function effectively when only CNA signals are provided as the node features and the edge weight does not reflect the SNV signal, we conducted an additional test. Specifically, as shown in Figure 2C, we set the SNV number to zero, assigned a very small constant weight (0.001) to all edges, and tested SCGclust’s performance under this condition. We found that, under this setting, SCGclust failed to render meaningful clustering results, indicating that SCGclust’s results benefit from SNV signals. Thus, although weak, these SNV signals provide valuable information for clustering. It is also noticeable in Figure 2C that, when the SNV count was as low as 1000, SBMClone’s V-measure dropped to near zero, showing that SBMClone requires a large number of SNVs for effective clustering. In contrast, SCGclust showed no significant decline in the V-measure as the number of SNVs decreased; instead, it maintained a median V-measure of around 0.6 with only 1000 SNVs.

Furthermore, in addition to the SNV number, we observed that SBMClone is much more sensitive to variations in the number of cells (Figure 2A) and the number of clones (Figure 2B) than SCGclust. Specifically, when the number of cells was as low as 100 or the number of clones was as high as 6, SBMClone’s V-measure dropped to around zero. In contrast, SCGclust maintained a V-measure above 0.5 in both scenarios. Additionally, we noticed that all methods except Kmeans-CNA performed worse as the SNV missing rate increased (Figure 2G). This outcome was expected because a higher SNV missing rate renders less signal for clustering. Notably, SBMClone’s V-measure dropped to near zero when the SNV missing rate was at 0.99, showing that SBMClone is very sensitive to the SNV missing rate.

Since SBMClone is a method based on SNV only, it is expected that the absence of a strong SNV signal would lead to a poor performance. In contrast, although SCGclust’s V-measure decreased with an increase in the SNV missing rate, it remained above 0.5 even at the highest SNV missing rate of 0.99. This highlights the advantage of SCGclust, which integrates both CNAs and SNVs. A similar pattern was observed with varying SNV FP and FN rates (Figure 2E,F). While Kmeans-CNA’s performance was stable, SCGclust and SBMClone exhibited declining V-measures as the SNV FP and FN rates increased. However, the decrease in SCGclust’s V-measure was less pronounced compared to that of SBMClone, the latter of which dropped to near zero across all five repetitions when the SNV FP rate was as high as 0.1 or the SNV FN rate reached 0.65. This again shows that SBMClone is more sensitive to the SNV FP and FN rates than SCGclust. Furthermore, we expected an increase in the V-measure for SCGclust and Kmeans-CNA with an increased number of CNAs, as both methods leverage the CNA signal. However, such a trend is not obvious in the plot (Figure 2D), suggesting that even a low CNA number of 100 is sufficient for cell clustering.

Lastly, we investigated the effectiveness of the silhouette score in selecting the optimal epoch by comparing SCGclust-silhouette and SCGclust-GT. Our results showed that SCGclust-silhouette (in blue bars) closely aligned with SCGclust-GT (in orange bars) in most cases, demonstrating that the silhouette score is a reliable indicator of the clustering performance and it effectively guides the selection of the best epoch.

In summary, SCGclust outperformed both SBMClone and Kmeans-CNAs, and SCGclust was much less sensitive to the signal of SNVs than SBMClone, showing the advantage of integrating both SNV and CNA signals. SCGclust’s V-measure can reach values exceeding 0.9, even at the missing rate of 0.95, which corresponds to a coverage of 0.05×.

3.1.3. Ablation Test: Comparison Among SCGclust, GCT, and GAT

To investigate whether the co-training strategy with the attention mechanism adopted by SCGclust has an advantage over the GAT without co-training and GCT without the attention mechanism, we further performed a comparison among these three methods (ours, GAT, and GCN), as shown in Figure 3. Specifically, for the SCGclust-GT metric, there are three backbone variants that performed comparably: our current version, GAT, and GCN. Based on the ground truth heuristics, i.e., using the knowledge of the ground truth to select the best epoch, ours achieved a mean of 0.843 (std = 0.186), GAT achieved 0.619 (std = 0.24), and GCN achieved 0.804 (std = 0.215). The pairwise statistical comparisons showed significant differences compared to GAT, but not GCN (ours vs. GAT: p = 0.000; ours vs. GCN: p = 0.414), indicating that GCN performed better than GAT, although its V-measure was still lower than ours. We observed a similar trend in the silhouette heuristics when the ground truth was unknown. Specifically, our mean was 0.774 (std = 0.215) compared to 0.595 (std = 0.245) for GAT and 0.753 (std = 0.235) for GCN. The improvement was statistically significant over GAT (p = 0.000), whereas it was not significant over GCN (p = 0.651).

3.1.4. Research Design and Choice of Hyperparameters

We used a graph neural network architecture based on a graph attention autoencoder followed by a GMM clustering module. The graph attention autoencoder was configured with multiple layers, where the input layer dimension corresponded to the input data and the hidden dimensions included two layers of 99 and a user-specified final latent dimension (e.g., 16). Each layer used one attention head. We used the ReLU activation function between layers. The model was optimized using the AdamW optimizer with a learning_rate = 0.001, weight_decay = 1 × 10⁻⁶, beta_1 = 0.9, beta_2 = 0.999, and epsilon = 1 × 10⁻⁷. We trained the model for 1000 epochs with a batch size equal to the number of nodes, as the entire graph was processed in one pass due to the graph nature of the data.

To improve the training stability, we employed gradient clipping with a clipnorm = 5.0 and clipvalue = 0.5. The dropout rate for the GNN layers was set via a tunable parameter, commonly 0 or 0.2 depending on the dataset. The DMoN pooling layer used a collapse regularization coefficient of 1.0 and operated on the normalized adjacency matrix. For cluster assignment, we used the GMM on the pooled embeddings, with the number of clusters (n_clusters) set to four unless otherwise specified.

The number of clusters was set to match the categories existing in the ground truth to ensure the best match after clustering for the simulated data. The latent dimensions of the model were not sensitive, as shown in the hp grid search experiments in Figure 4. Specifically, we evaluated the performance of SCGclust across nine neural network architectures with different hidden layer sizes and latent dimensions: 64, 64, 8; 64, 64, 32; 128, 128, 8; 128, 128, 32; 99, 99, 8; 99, 99, 32; 64, 64, 16; 128, 128, 16; and 99, 99, 16 (which is our current default). For each architecture, we recorded the V-measure of SCGclust using both the ground truth labels (SCGclust-GT) and the silhouette-based heuristics (SCGclust-silhouette). Across all architectures, the mean V-measures were extremely similar: SCGclust-GT ranged from approximately 0.758 to 0.843, and SCGclust-silhouette ranged from approximately 0.684 to 0.774, with minimal variation among runs. A statistical analysis using a one-way ANOVA confirmed that these small differences were not significant (SCGclust-GT: F = 2.32, p = 0.183, η² = 0.0240; SCGclust-silhouette: F = 2.09, p = 0.0347, η² = 0.0216), indicating that the choice of architecture dimension does not materially affect the SCGclust performance.

3.1.5. Runtime and Memory Consumption

The runtime and memory consumption of SCGclust were both remarkably low, making it well suited for high-throughput and large-cohort applications. Specifically, we tested SCGclust on a simulation study in which we increased the number of cells from 100 to 10,000 to examine its scalability. The computer that we used was a MacBook Pro, 14-inch, equipped with an Apple M3 Pro SoC featuring an 11-core CPU (5 performance + 6 efficiency cores) and a 14-core integrated GPU, paired with 18 GB of unified memory and a 512 GB SSD. Table 2 shows the model size, memory, and runtime for the range of cells between 100 and 10,000. We observed that the memory consumption increased linearly with the number of cells, whereas the runtime grew at a faster-than-linear rate. This pattern was expected, because the number of edges increases quadratically as the number of cells increases. Nevertheless, even with 10,000 cells, a scale rarely produced by current technologies, the runtime remained reasonable at approximately 74 min.

3.2. Real Dataset

We further tested SCGclust on a real breast cancer sample T10, which has 100 cells and fluorescence-activated cell sorting (FACS) indicating the presence of four distinct clusters, D, H, A1, and A2 [39]. The FACS results served as a ground truth and was used to evaluate SCGclust’s performance. The average coverage of T10’s data was about 1.81×. We used SCcaller (V2.0.0), a single-cell mutation caller for scDNA-seq data, to detect SNVs on each cell. In total, there were 48,898 SNVs detected for all cells. We then applied a filter for a variant allele fraction > 0.03, selected only nonsynonymous SNVs in exonic regions, and obtained a total of 3105 SNVs. These 3105 SNVs were the input to SCGclust and SBMClone. For the CNA signals, we used read counts for a sliding window of 500 kbp for each cell. This CNA signal was the input to both SCGclust and Kmeans-CNA. We compared the performance of SCGclust-silhouette, SCGclust-GT, SBMClone, and Kmeans-CNA, whose V-measures using the FACS results as the ground truth were 0.97, 0.98, 0.037, and 0.22, respectively. SCGclust had a much better performance than SBMClone and Kmeans-CNA. We then investigated whether SBMClone’s performance could improve by using all 48,898 SNVs. Our results show that, while SBMClone’s performance did improve slightly with more SNVs, the increase was limited, with the V-measure rising to 0.077. We reasoned that, since SBMClone’s V-measure for the simulated data was near zero when the cell number was as low as 100, it was expected that SBMClone would exhibit a similarly low V-measure for the T10 dataset with the same cell count. Thus, the large number of SNVs cannot compensate for the low number of cells for SBMClone. In contrast, SCGclust is robust to variations in the cell number, maintaining a V-measure above 0.97 even with only 100 cells.

We further looked into the t-distributed stochastic neighbor embedding (t-SNE) plot of the latent space from SCGclust for this dataset. Figure 5 shows that the latent space clearly separates four clusters. Notably, the last cluster, A2 (label 3), contains only four cells, and its latent representation is far from the other three clusters. Distinguishing A2 from A1 (label 1) is challenging given that A2 has only four cells. This shows that SCGclust is robust to small subclones as well.

Our future work will further consider strategies to improve the resolution of small subclones, for example, by leveraging higher-coverage scDNA-seq data or incorporating statistical priors that explicitly account for the subclone size. Such approaches may enable us to more reliably distinguish small subclones from larger ones, particularly when they are embedded within highly similar genetic backgrounds.

It is worth noting that, for this real dataset, the silhouette score can be used as a metric to select the best epoch, since the V-measures between SCGclust-silhouette and SCGclust-GT are close to each other. This indicates that, even when SCGclust is applied to a real dataset without ground truth clustering, the silhouette score should consistently be able to select the optimal epoch. Since, in practice, we are not given the true cluster number, we then investigated the performance of SCGclust when the given cluster number was slightly off. When setting the cluster number to 3, SCGclust’s V-measure was 0.6097 for SCGclust-silhouette and 0.5157 for SCGclust-GT. When setting the cluster number to 5, SCGclust’s V-measure was 0.5404 for both SCGclust-silhouette and SCGclust-GT, showing a drop in the V-measure when the cluster number is one off. However, the resulting V-measure was still significantly higher than those of SBMClone and Kmeans-CNA.

Self-Supervised Selection of the Optimal Number of Clusters in Real Data

We hypothesized that, since the silhouette score was used to select the best epoch, it might be indicative as a self-supervised metric to select the optimal number of clusters. We then tested this hypothesis on the T10 dataset by varying our cluster numbers from 3 to 6. As shown in Table 3, the silhouette score was the highest when the cluster number was 4. Since we have the prior knowledge that T10 does have four clusters, the silhouette score successfully selected the correct number of clusters.

4. Conclusions

ITH has been known to confound cancer treatments. To characterize ITH, cell clustering is the first and foremost step in scDNA-seq data. In this paper, we present SCGclust, a cell-clustering method that is applicable to shallow scDNA-seq data, integrating both the SNV and CNA signals. SCGclust adopts a graph autoencoder framework in which each node represents a cell, the node feature is the read count reflecting the CNA signal, and the edge weight is the SNV similarity between two cells. SCGclust co-trains the graph autoencoder and a GNN with an objective function that guarantees an optimal clustering result. By integrating both SNVs and CNAs, SCGclust outperforms the methods that rely on either signal but not both, including SBMClone and Kmeans-CNA, a K-means algorithm based on the read count only. We tested SCGclust, SBMClone, and Kmeans-CNA on eight simulated datasets and one real dataset. We found that SCGclust had the advantage of being robust in extreme cases such as situations with a very low coverage reflected by a high missing rate, a low number of cells, and high SNV FP and FN rates. In particular, SCGclust’s V-measure was mostly consistently above 0.6, whereas SBMClone’s V-measure dropped to zero in several extreme cases. SCGclust was also shown to have a higher V-measure than SBMClone and Kmeans-CNA on the real dataset T10. These show that SCGclust is a robust tool that can be used to cluster low-coverage scDNA-seq data. We reasoned that, because SCGclust leverages both SNV and CNA signals and CNA detection is less sensitive to low coverage since it relies on read counts over large genomic bins rather than individual nucleotides, the CNA signal can effectively compensate for the extremely high missing rate in the SNV data.

Regarding potential improvement to SCGclust, since SCGclust does not rely on the fact that the copy numbers are integers, but directly uses the read count instead of the inferred copy number, a future direction would be to either incorporate the inferred absolute copy numbers instead of the read count or combine the CNA inference and cell clustering together in a more complex model. Nevertheless, the current model that uses the read count directly is still advantageous in terms of avoiding the CNA inference error that may be passed by a CNA caller, which is still challenging in scDNA-seq data [38].

Identifying the correct cluster number is non-trivial to all cluster algorithms. In our real dataset experiment, we showed that the silhouette score is indicative of the optimal number of clusters. Future work will include more testing of this hypothesis on both simulated and real datasets.

SCGclust was not intentionally designed to be sensitive to rare subclones. Consider a rare subclone that has a few cells whose mutation signature differs significantly from other subclones; it is possible that SCGclust could detect such a rare subclone. Nevertheless, sensitively detecting rare subclones, especially at a low coverage, is a nontrivial task and is beyond the scope of this paper.

SCGclust is scalable to any experiment within 10,000 cells. Our simulation showed that its memory consumption is linear to the number of cells, and the runtime is about quadratic. Nevertheless, even when the cell number reached 10,000, the runtime was just a little over one hour. Future work will include testing SCGclust on a real dataset involving a large number of cells.

Last, but not least, although SCGclust was primarily designed for scDNA-seq data, the framework could be extended to other single-cell genomic contexts, such as inferring genomic alterations from transcriptome data. In such applications, additional sources of variation could be incorporated into the model. For example, batch effects could be mitigated by adjusting node features or edge weights using batch-aware normalization techniques, or by including batch identifiers as additional node features to guide the graph attention mechanism. Similarly, spatial information in spatial transcriptome data could be integrated by augmenting the graph with edges reflecting the physical proximity between cells, allowing the method to jointly leverage genomic similarity and tissue architecture. These extensions would broaden the applicability of SCGclust and enable it to capture both the biological and technical heterogeneity in diverse single-cell datasets.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/math14010046/s1, Figure S1: Comparison of four metrics to calculate cell pairwise distance for the SNV profile: Euclidean (blue), dot product (orange), cosine similarity (green) and Pearson correlation (red) for the simulated dataset; Figure S2: Using Adjusted Random Index (ARI) to measure clustering accuracy for SCGclust, SBMClone and Kmeans for the simulated dataset; Figure S3: Using Normalized Mutual Information (NMI) to measure clustering accuracy for SCGclust, SBMClone and Kmeans for the simulated dataset.

Author Contributions

X.M., X.M.Z. and L.Z. conceived and supervised the project; T.P., Y.H., J.N., L.Z., X.M.Z. and X.M. designed the model; T.P., Y.H., H.C., J.W., S.D., L.Z. and R.K. conducted the experiments; and T.P., L.Z., Y.H., X.M.Z. and X.M. wrote the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by NIH NIGMS Maximizing Investigators’ Research Award (MIRA) grant number R35 GM146960 to Xin Maizie Zhou, NSF CCF grant number 2523717 to Xian Mallory, and NSF CCF grant number 2523716 to Hongmei Chi.

Data Availability Statement

T10 breast cancer scDNA-seq data were obtained from the Sequence Read Archive (SRA) under accession number SRX021401. SCGclust is publicly available at https://github.com/compbio-mallory/cellClustering_GNN (accessed on 9 December 2025).

Acknowledgments

The authors thank the anonymous reviewers for their valuable suggestions.

Conflicts of Interest

The authors declare that they have no competing interests.

References

Lawrence, M.S.; Stojanov, P.; Polak, P.; Kryukov, G.V.; Cibulskis, K.; Sivachenko, A.; Carter, S.L.; Stewart, C.; Mermel, C.H.; Roberts, S.A. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 2013, 499, 214–218. [Google Scholar] [CrossRef]
Burrell, R.A.; McGranahan, N.; Bartek, J.; Swanton, C. The causes and consequences of genetic heterogeneity in cancer evolution. Nature 2013, 501, 338–345. [Google Scholar] [CrossRef] [PubMed]
Turajlic, S.; Sottoriva, A.; Graham, T.; Swanton, C. Resolving genetic heterogeneity in cancer. Nat. Rev. Genet. 2019, 20, 404–416. [Google Scholar] [CrossRef] [PubMed]
Lawson, D.A.; Kessenbrock, K.; Davis, R.T.; Pervolarakis, N.; Werb, Z. Tumour heterogeneity and metastasis at single-cell resolution. Nat. Cell Biol. 2018, 20, 1349–1360. [Google Scholar] [CrossRef] [PubMed]
Dagogo-Jack, I.; Shaw, A.T. Tumour heterogeneity and resistance to cancer therapies. Nat. Rev. Clin. Oncol. 2018, 15, 81–94. [Google Scholar] [CrossRef]
Marusyk, A.; Janiszewska, M.; Polyak, K. Intratumor heterogeneity: The rosetta stone of therapy resistance. Cancer Cell 2020, 37, 471–484. [Google Scholar] [CrossRef]
Carter, N.P.; Bebb, C.E.; Nordenskjo, M.; Ponder, B.A.; Tunnacliffe, A. Degenerate oligonucleotide-primed PCR: General amplification of target DNA by a single degenerate primer. Genomics 1992, 13, 718–725. [Google Scholar] [CrossRef]
Navin, N.; Kendall, J.; Troge, J.; Andrews, P.; Rodgers, L.; McIndoo, J.; Cook, K.; Stepansky, A.; Levy, D.; Esposito, D.; et al. Tumour evolution inferred by single-cell sequencing. Nature 2011, 472, 90–94. [Google Scholar] [CrossRef]
Baslan, T.; Kendall, J.; Rodgers, L.; Cox, H.; Riggs, M.; Stepansky, A.; Troge, J.; Ravi, K.; Esposito, D.; Lakshmi, B. Genome-wide copy number analysis of single cells. Nat. Protoc. 2012, 7, 1024–1041. [Google Scholar] [CrossRef]
Zahn, H.; Steif, A.; Laks, E.; Eirew, P.; VanInsberghe, M.; Shah, S.P.; Aparicio, S.; Hansen, C.L. Scalable whole-genome single-cell library preparation without preamplification. Nat. Methods 2017, 14, 167–173. [Google Scholar] [CrossRef]
Laks, E.; McPherson, A.; Zahn, H.; Lai, D.; Steif, A.; Brimhall, J.; Biele, J.; Wang, B.; Masud, T.; Ting, J.; et al. Clonal decomposition and DNA replication states defined by scaled single-cell genome sequencing. Cell 2019, 179, 1207–1221. [Google Scholar] [CrossRef] [PubMed]
Hou, Y.; Song, L.; Zhu, P.; Zhang, B.; Tao, Y.; Xu, X.; Li, F.; Wu, K.; Liang, J.; Shao, D. Single-cell exome sequencing and monoclonal evolution of a JAK2-negative myeloproliferative neoplasm. Cell 2012, 148, 873–885. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Waters, J.; Leung, M.L.; Unruh, A.; Roh, W.; Shi, X.; Chen, K.; Scheet, P.; Vattathil, S.; Liang, H. Clonal evolution in breast cancer revealed by single nucleus genome sequencing. Nature 2014, 512, 155–160. [Google Scholar] [CrossRef] [PubMed]
Shen, R.; Olshen, A.B.; Ladanyi, M. Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics 2009, 25, 2906–2912. [Google Scholar] [CrossRef]
Wang, B.; Mezlini, A.M.; Demir, F.; Fiume, M.; Tu, Z.; Brudno, M.; Haibe-Kains, B.; Goldenberg, A. Similarity network fusion for aggregating data types on a genomic scale. Nat. Methods 2014, 11, 333–337. [Google Scholar] [CrossRef]
Argelaguet, R.; Velten, B.; Arnol, D.; Dietrich, S.; Zenz, T.; Marioni, J.C.; Buettner, F.; Huber, W.; Stegle, O. Multi-Omics Factor Analysis—a framework for unsupervised integration of multi-omics data sets. Mol. Syst. Biol. 2018, 14, e8124. [Google Scholar] [CrossRef]
Argelaguet, R.; Arnol, D.; Bredikhin, D.; Deloro, Y.; Velten, B.; Marioni, J.C.; Stegle, O. MOFA+: A statistical framework for comprehensive integration of multi-modal single-cell data. Genome Biol. 2020, 21, 111. [Google Scholar] [CrossRef]
Shakyawar, S.K.; Sajja, B.R.; Patel, J.C.; Guda, C. i CluF: An unsupervised iterative cluster-fusion method for patient stratification using multiomics data. Bioinform. Adv. 2024, 4, vbae015. [Google Scholar] [CrossRef]
Deshwar, A.G.; Vembu, S.; Yung, C.K.; Jang, G.H.; Stein, L.; Morris, Q. PhyloWGS: Reconstructing subclonal composition and evolution from whole-genome sequencing of tumors. Genome Biol. 2015, 16, 35. [Google Scholar] [CrossRef]
Gillis, S.; Roth, A. PyClone-VI: Scalable inference of clonal population structures using whole genome data. BMC Bioinform. 2020, 21, 571. [Google Scholar] [CrossRef]
Miller, C.A.; White, B.S.; Dees, N.D.; Griffith, M.; Welch, J.S.; Griffith, O.L.; Vij, R.; Tomasson, M.H.; Graubert, T.A.; Walter, M.J.; et al. SciClone: Inferring clonal architecture and tracking the spatial and temporal patterns of tumor evolution. PLoS Comput. Biol. 2014, 10, e1003665. [Google Scholar] [CrossRef] [PubMed]
Satas, G.; Zaccaria, S.; Mon, G.; Raphael, B.J. SCARLET: Single-cell tumor phylogeny inference with copy-number constrained mutation losses. Cell Syst. 2020, 10, 323–332. [Google Scholar] [CrossRef] [PubMed]
Chen, Z.; Gong, F.; Wan, L.; Ma, L. BiTSC 2: Bayesian inference of tumor clonal tree by joint analysis of single-cell SNV and CNA data. Briefings Bioinform. 2022, 23, bbac092. [Google Scholar] [CrossRef] [PubMed]
Zhang, L.; Bass, H.W.; Irianto, J.; Mallory, X. Integrating SNVs and CNAs on a phylogenetic tree from single-cell DNA sequencing data. Genome Res. 2023, 33, 2002–2017. [Google Scholar] [CrossRef]
Hu, Y.; Xie, M.; Li, Y.; Rao, M.; Shen, W.; Luo, C.; Qin, H.; Baek, J.; Zhou, X.M. Benchmarking clustering, alignment, and integration methods for spatial transcriptomics. Genome Biol. 2024, 25, 212. [Google Scholar] [CrossRef]
Hu, Y.; Lin, Z.; Xie, M.; Yuan, W.; Li, Y.; Rao, M.; Liu, Y.H.; Shen, W.; Zhang, L.; Zhou, X.M. MaskGraphene: An advanced framework for interpretable joint representation for multi-slice, multi-condition spatial transcriptomics. Genome Biol. 2025, 26, 380. [Google Scholar] [CrossRef]
Hu, Y.; Zhao, Y.; Schunk, C.T.; Ma, Y.; Derr, T.; Zhou, X.M. ADEPT: Autoencoder with differentially expressed genes and imputation for robust spatial transcriptomics clustering. iScience 2023, 26, 106792. [Google Scholar] [CrossRef]
Lopez, R.; Regier, J.; Cole, M.B.; Jordan, M.I.; Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 2018, 15, 1053–1058. [Google Scholar] [CrossRef]
Tian, T.; Wan, J.; Song, Q.; Wei, Z. Clustering single-cell RNA-seq data with a model-based deep learning approach. Nat. Mach. Intell. 2019, 1, 191–198. [Google Scholar] [CrossRef]
Wang, J.; Ma, A.; Chang, Y.; Gong, J.; Jiang, Y.; Qi, R.; Wang, C.; Fu, H.; Ma, Q.; Xu, D. scGNN is a novel graph neural network framework for single-cell RNA-Seq analyses. Nat. Commun. 2021, 12, 1882. [Google Scholar] [CrossRef]
Weinberger, E.; Lin, C.; Lee, S.I. Isolating salient variations of interest in single-cell data with contrastiveVI. Nat. Methods 2023, 20, 1336–1345. [Google Scholar] [CrossRef]
Rashid, S.; Shah, S.; Bar-Joseph, Z.; Pandya, R. Dhaka: Variational autoencoder for unmasking tumor heterogeneity from single cell genomic data. Bioinformatics 2021, 37, 1535–1543. [Google Scholar] [CrossRef] [PubMed]
Yan, J.; Ma, M.; Yu, Z. bmVAE: A variational autoencoder method for clustering single-cell mutation data. Bioinformatics 2023, 39, btac790. [Google Scholar] [CrossRef] [PubMed]
Liu, F.; Shi, F.; Du, F.; Cao, X.; Yu, Z. CoT: A transformer-based method for inferring tumor clonal copy number substructure from scDNA-seq data. Briefings Bioinform. 2024, 25, bbae187. [Google Scholar] [CrossRef] [PubMed]
Myers, M.A.; Zaccaria, S.; Raphael, B.J. Identifying tumor clones in sparse single-cell mutation data. Bioinformatics 2020, 36, i186–i193. [Google Scholar] [CrossRef]
Tsitsulin, A.; Palowitch, J.; Perozzi, B.; Müller, E. Graph clustering with graph neural networks. J. Mach. Learn. Res. 2023, 24, 1–21. [Google Scholar]
Mallory, X.F.; Edrisi, M.; Navin, N.; Nakhleh, L. Assessing the performance of methods for copy number aberration detection from single-cell DNA sequencing data. PLoS Comput. Biol. 2020, 16, e1008012. [Google Scholar] [CrossRef]
Zhang, L.; Zhou, X.M.; Mallory, X. SCCNAInfer: A robust and accurate tool to infer the absolute copy number on scDNA-seq data. Bioinformatics 2024, 40, btae454. [Google Scholar] [CrossRef]
Navin, N.; Krasnitz, A.; Rodgers, L.; Cook, K.; Meth, J.; Kendall, J.; Riggs, M.; Eberling, Y.; Troge, J.; Grubor, V.; et al. Inferring tumor progression from genomic heterogeneity. Genome Res. 2010, 20, 68–80. [Google Scholar] [CrossRef]

Figure 1. Overview of SCGclust. (A). There are two inputs to SCGclust, the cell by SNV matrix (top) and the cell by genomic region matrix (bottom). The cell by SNV matrix has entries “1”, “0”, and “3”. The “1” and “0” entries represent that the SNV is present or absent in the cell, respectively. The “3” entries represent that there is no read covering the site, and thus, the signal is missing. The cell by genomic region matrix has the read count for each genomic region at each cell. Six cells (C1–C6) are shown as an illustration. It can be observed that C1–C3 and C4–C6 have relatively similar SNV and CNA profiles, respectively. (B). The two matrices are then used as the edge weight and the node feature for the graph autoencoder. The graph autoencoder has six nodes, representing the six cells. On top of each node is a vector of the node feature, which uses the cosine similarity vector of the read count that reflects the CNA signal. Between every two nodes is an edge weight, represented by the Euclidean distance of the SNV profiles between the two cells. Here, C1, C2, and C3 have larger edge weights (thicker edges) because their SNV profiles are closer to each other. Similarly, C4, C5, and C6 have larger edge weights (thicker edges). (C). The built graph is the input for the graph autoencoder, which reduces the dimensions of the node features in the encoder and recovers the original node features in the decoder. The dimension reduction process also considers the edge weight such that two cells with similar SNV profiles will have more similar embedding in the low dimension. The graph autoencoder has four layers in total; layers 1 and 2 are the encoder, and layers 3 and 4 are the decoder. (D). A graph convolutional network (GCN) is co-trained with the graph autoencoder, with the objective function composed of three terms: the reconstruction mean squared error (MSE) term, the modularity term, and the collapse regularization term. (E). Finally, we performed the cell clustering based on each cell’s embedded low dimension from the graph autoencoder using the Gaussian mixture model.

Figure 2. Boxplots are shown for the V-measure for SCGclust using the silhouette score to select the best epoch (SCGclust-silhouette, in blue), SCGclust using ground truth clustering to select the best epoch as a reference (SCGclust-GT, in orange), SBMClone (SBMClone, in green), and a K-means method that uses the read count, which is a CNA signal, to cluster the cells (Kmeans-CNA, in red). Eight variables were varied to study the performance of these four methods, including (A). the number of cells, (B). the number of clones, (C). the number of SNVs, (D). the number of CNAs on the top panel, (E). the FP rate, (F). the FN rate, (G). the missing rate, and (H). the CNA noise on the bottom panel. The median of each boxplot is highlighted with a black horizontal line. Each set of variables was repeated five times to avoid extreme cases.

Figure 3. Boxplots are shown for the V-measure comparing SCGclust’s co-training with attention (ours) with GAT without co-training and GCN without an attention mechanism, using the ground truth (left panel) and silhouette (right panel) to select the best epoch. The p-value between ours and GAT and GCN showed statistical significance under the silhouette setting.

Figure 4. Boxplots are shown for the V-measure of the nine neural network architectures when using both the ground truth and silhouette to select the best epoch, as well as SBMClone and Kmeans-CNA, for all the simulated datasets. (A). The number of cells, (B). the number of clones, (C). the number of SNVs, (D). the number of CNAs on the top panel, (E). the FP rate, (F). the FN rate, (G). the missing rate, and (H). the CNA noise on the bottom panel. The median of each boxplot is highlighted with a black horizontal line. Each set of variables was repeated five times to avoid extreme cases.

Figure 5. t-SNE plots of the latent space from SCCGclust on dataset T10, showing predicted subclones (left) and ground truth subclones (right).

Table 1. Summary of simulated datasets. Each row represents a simulated dataset with an index (first column) whose values of the varying variable (second column) are shown in the third column. The default value is denoted by “(d)” on its right.

1	Number of subclones	2, 4 (d), 6
2	Number of cells	100, 500 (d), 1000
3	Number of SNVs	0, 1000, 5000 (d), 10000
4	Number of CNAs	100, 200 (d), 300
5	False positive rate	0.01, 0.05 (d), 0.1
6	False negative rate	0.55, 0.6 (d), 0.65
7	Missing rate	0.95, 0.98 (d), 0.99
8	CNA noise	0.3, 0.5 (d), 1

Table 2. Memory consumption and runtime with an increase in the cell number.

# Cells	Model Size	Memory (Mb)	Runtime (s)
100	41,284	0.16127	12.34
1000	219,484	0.85736	49.61
2000	417,484	1.59	147.49
5000	1,011,484	3.86	928.88
10,000	2,001,484	7.64	4455.22

Table 3. Using the silhouette score to automatically select the optimal cluster number.

# Clusters	SCGclust-Silhouette	SCGclust-GT	Silhouette Score
3	0.6097	0.5157	0.1277
4	0.9715	0.9870	0.1578
5	0.5404	0.5404	0.0983
6	0.8859	0.9229	0.1381

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Potu, T.; Hu, Y.; Wang, J.; Chi, H.; Khan, R.; Dharani, S.; Ni, J.; Zhang, L.; Zhou, X.M.; Mallory, X. SCGclust: Single-Cell Graph Clustering Using Graph Autoencoders That Integrate SNVs and CNAs. Mathematics 2026, 14, 46. https://doi.org/10.3390/math14010046

AMA Style

Potu T, Hu Y, Wang J, Chi H, Khan R, Dharani S, Ni J, Zhang L, Zhou XM, Mallory X. SCGclust: Single-Cell Graph Clustering Using Graph Autoencoders That Integrate SNVs and CNAs. Mathematics. 2026; 14(1):46. https://doi.org/10.3390/math14010046

Chicago/Turabian Style

Potu, Teja, Yunfei Hu, Judy Wang, Hongmei Chi, Rituparna Khan, Srinija Dharani, Jingchao Ni, Liting Zhang, Xin Maizie Zhou, and Xian Mallory. 2026. "SCGclust: Single-Cell Graph Clustering Using Graph Autoencoders That Integrate SNVs and CNAs" Mathematics 14, no. 1: 46. https://doi.org/10.3390/math14010046

APA Style

Potu, T., Hu, Y., Wang, J., Chi, H., Khan, R., Dharani, S., Ni, J., Zhang, L., Zhou, X. M., & Mallory, X. (2026). SCGclust: Single-Cell Graph Clustering Using Graph Autoencoders That Integrate SNVs and CNAs. Mathematics, 14(1), 46. https://doi.org/10.3390/math14010046

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

SCGclust: Single-Cell Graph Clustering Using Graph Autoencoders That Integrate SNVs and CNAs

Abstract

1. Introduction

2. Methods

2.1. Graph Construction

2.1.1. Edge Weight Adjacency Matrix

2.1.2. Node Features

2.2. Graph Autoencoder (GAT) Architecture

2.2.1. Encoder and Decoder

2.2.2. Attention Mechanism

2.3. Co-Training GAT and GCN

2.4. Gaussian Mixture Model for Clustering

3. Results

3.1. Simulated Dataset

3.1.1. Simulator

3.1.2. Benchmarking Results

3.1.3. Ablation Test: Comparison Among SCGclust, GCT, and GAT

3.1.4. Research Design and Choice of Hyperparameters

3.1.5. Runtime and Memory Consumption

3.2. Real Dataset

Self-Supervised Selection of the Optimal Number of Clusters in Real Data

4. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI