ST-Community Detection Methods for Spatial Transcriptomics Data Analysis

Zhao, Charles; Ren, Jian-Jian

doi:10.3390/stats9010004

Open AccessArticle

ST-Community Detection Methods for Spatial Transcriptomics Data Analysis

by

Charles Zhao

¹ and

Jian-Jian Ren

^2,*

¹

Department of Statistics & Operations Research, University of North Carolina, Chapel Hill, NC 27599, USA

²

Statistics Program, Department of Mathematics, University of Maryland, College Park, MD 20742, USA

^*

Author to whom correspondence should be addressed.

Stats 2026, 9(1), 4; https://doi.org/10.3390/stats9010004 (registering DOI)

Submission received: 6 November 2025 / Revised: 2 December 2025 / Accepted: 9 December 2025 / Published: 1 January 2026

(This article belongs to the Section Computational Statistics)

Download

Browse Figures

Versions Notes

Abstract

The single-cell spatial transcriptomics (ST) data with cell type and spatial location, i.e.,

(C, x, y)

with C as cell type and

(x, y)

as its spatial location, produced by recent biotechnologies, such as CosMx and Xenium, contain a huge amount of information about cancer tissue samples, thus have great potential for cancer research via detection of ST-Community which is defined as a collection of cells with distinct cell-type composition and similar neighboring patterns based on nearby cell-percentages. But for huge CosMx single-cell ST data, the existing clustering methods do not work well for st-community detection, and the commonly used kNN compositional data method shows lack of informative neighboring cell patterns. In this article, we propose a novel and more informative disk compositional data (DCD) method for single-cell ST data, which identifies neighboring patterns of each cell via taking into account of ST data features from recent new technologies. After initial processing single-cell ST data into the DCD matrix, an innovative DCD-TMHC computation method for st-community detection is proposed here. Extensive simulation studies and the analysis of CosMx breast cancer data, which is an example of single-cell ST dataset, clearly show that our proposed DCD-TMHC computation method is superior to other existing methods. Based on the st-communities detected for CosMx breast cancer data, the logistic regression analysis results demonstrate that the proposed DCD-TMHC computation method produces better interpretable and superior outcomes, especially in terms of assessment for different cancer categories. These suggest that our proposed novel and informative DCD-TMHC computation method here will be helpful and have an impact on future cancer research based on single-cell ST data, which can improve cancer diagnosis and monitor cancer treatment progress.

Keywords:

cluster analysis; compositional data matrix; hierarchical clustering; k-means clustering; logistic regression model; SigClust; spatial transcriptomics data

1. Introduction

The recent development of spatial transcriptomics (ST) biotechnologies, such as CosMx and Xenium machines released in December 2022, has shown great potential for the improvement of cancer diagnosis and treatment; see [1] and the review article [2]. The CosMx machine, i.e., CosMx Spatial Molecular Imager, produces data consisting of images of tissues and cells, as well as gene expression at the single-cell level with detailed spatial location information which can be processed into single-cell ST data

(C, x, y)

with C as cell type and

(x, y)

as its spatial location [1]. The earlier generation of, or something similar to, single-cell ST data, such as CODEX [3], Visium [4], GeoMx (started in 2019; see [5]), etc., does not offer the same data information as CosMx. For instance, CODEX only focuses on protein markers, which gives data with many fewer analytes than ST data produced by the CosMx machine, though also can be processed into the same data form

(C, x, y)

; see [6,7], etc. Visium and GeoMx data provide the average information of many cells in a given area, but not at the single-cell level; see the review papers [6,8,9,10], among others.

For the general version of ST data, i.e., the data with gene expression (not single-cell type) and spatial location, numerous computational and statistical methods have been developed to detect the “Spatial Domain” which contains cells with similar gene expression profiles; see papers [11,12,13], among others. For data with cell type and spatial location, i.e., data

(C, x, y)

with C as cell type and

(x, y)

as its spatial location, various “Cellular Neighborhood” (CN) methods have been developed to identify clusters, each having a unique composition of cell types, which are based on the kNN or modified kNN compositional matrix and clustering methods; see [11,14,15,16], among others. Due to data structure differences, Bhate et al. [14] pointed out that CNs are not the same as spatial domains. The methods for network-based “Community” detection have been developed for matrix obtained from symmetric network; see [17,18], among others. However, none of these works is applicable or suitable for the huge dataset from above mentioned very recent single-cell ST data produced by CosMx and Xenium.

In cancer research, studies have shown that data with the same form as single-cell ST data can help identify groups of cell types with special characteristics that are associated with survival rate and different treatment outcomes. In colorectal cancer study, Schurch et al. [15] used CODEX data to identify nine distinct cellular neighborhoods with different characteristics of the immune tumor microenvironment, which led to the discovery of the enrichment of PD-

1^{+}

CD

4^{+}

T cells in a specific granulocyte cellular neighborhood that was associated with better survival for high-risk patients. In a triple negative breast cancer study, Shiao et al. [19] also used CODEX data to identify 12 distinct spatial districts which helped detect and analyze different immune response to treatment therapy for two groups of patients. In this article, for single-cell ST data we define ST-Community as a collection of cells with distinct cell-type composition and similar neighboring patterns based on nearby cell-percentages. This means that each cell in the st-community has similar nearby cell-percentages as neighboring cells in the same community, which characterizes the tumor microenvironment, and such concept of st-community includes “cellular neighborhoods” (CNs) detected in [15] and “spatial districts” detected in [19] as special cases. Note that a st-community contains multiple types of cells and may be located at different parts of the data, not just concentrated in one spatial area.

In such context, st-community detection for single-cell ST data obtained by new single-cell biotechnologies, such as CosMx and Xenium, is of great importance in cancer research. Considering the information of neighboring cell types, the discussion given in Section 2 shows that the initial step should process single-cell ST data into compositional data [20], then use an appropriate clustering method for st-community detection. The clustering method used in [19] was Leiden method [21] based on vectors formed by a graph neural network, but Leiden method is not applicable to compositional data. The clustering method used in [15] was the k-means method [22,23,24] for compositional data.

Other existing clustering methods which are applicable to compositional data are as follows: Elbow k-means method [25], Gap k-means method [26], Mclust algorithm [27,28], DBSCAN algorithm [29], and HDBSCAN algorithm [30,31]. However, the single-cell ST data produced by CosMx is usually too huge for these existing clustering methods to handle, and some of them are under too many unjustifiable assumptions.

Due to these reasons, this article first proposes a novel compositional data method for single-cell ST data, called disk compositional data (DCD), which particularly considers the features of ST data produced by recent new biotechnologies and is the initial step to provide each cell with nearby cell-percentages in the neighborhood of a disk using the chosen radius, then proposes an innovative computation method for st-community detection, called DCD-TMHC method, where TMHC is constructed based on the 2-means (TM) method and hierarchical clustering (HC) method [32,33]. This proposed DCD-TMHC computation method is generally applicable to single-cell ST data or the data type with the same data form

(C, x, y)

. The simulation studies presented in Section 3 show that DCD-TMHC method consistently performs better than above listed clustering methods.

Further, we apply the proposed DCD-TMHC method and other clustering methods, including CN methods, to analyze a CosMx breast cancer dataset, which is a single-cell ST dataset with 9 different cell types and was produced by the NanoString Company. Based on the data analysis results, we compare the performance of these methods via logistic regression model, which shows that our proposed DCD-TMHC method is superior to other methods, especially in terms of assessment for different cancer categories. Thus, the DCD-TMHC method proposed in this article will be helpful and have an impact on future cancer research based on single-cell ST data for the improvement of cancer diagnosis and treatments.

The rest of this article is organized as follows. Section 2 reviews single-cell ST data and existing methods, and proposes the novel st-community detection method DCD-TMHC; Section 3 presents simulation study results; Section 4 conducts CosMx breast cancer data analysis using the proposed DCD-TMHC method and alternative methods, then makes comparison via logistic regression analysis; and Section 5 gives discussion and conclusion.

2. ST-Community Detection of Single-Cell ST Data

Spatial Transcriptomics (ST) techniques have existed for almost a decade, but only were available at the institutions where they were developed. Recently, new techniques, such as CosMx and Xenium, have been commercialized and made ST technology more accessible [7]. ST quantifies messenger RNA transcripts for gene expression at the single-cell level with spatial context. In cancer research, it is often very difficult to extract RNA from formalin-fixed paraffin-embedded tissues because they are obtained previously from a patient’s treatment, thus not “fresh” samples; see review paper [8]. Now, ST technologies by CosMx and Xenium can handle and process such tissues. CosMx allows the quantification of 1000+ RNA and 64+ protein targets, and Xenium can be run on a panel with maximum of 480 gene markers.

In particular, the above mentioned CosMx breast cancer data, with 9 different cell types, provide 19–25 fields of view (FOVs) from each breast cancer tissue sample, which are small selected rectangular regions, and the location

(x, y)

of a cell C in all FOVs from the same tissue sample is determined based on a common origin. Within each FOV, there are thousands of cells and their spatial information. Figure 1 shows an example of 25 FOVs produced by CosMx from one patient’s primary breast cancer sample, where each FOV is a small grid portion shown as a colored rectangle. Notice that some FOVs are adjacent to each other, while others are more distant from each other.

As mentioned in Section 1, the st-community detection for single-cell ST data obtained by new and recent technologies is of great importance in cancer research. To identify the neighboring pattern of each cell based on nearby cell-percentages, the initial step of st-community detection is to conduct data processing to turn single-cell ST data into compositional data [20], which assigns cell proportion or cell-percentage in the neighborhood for each cell type under consideration. In [15] for colorectal cancer and in [34] for lung cancer, both had the same data form

(C, x, y)

as single-cell ST data, and both had their data processed into compositional data first, then used k-means method [22,23,24] to obtain st-communities in their studies.

2.1. Disk Compositional Data Matrix

For single-cell ST data, this subsection reviews the kNN compositional data matrix, proposes our disk compositional data (DCD) matrix, and gives comparison discussion which concludes that DCD matrix is far more informative and accurate.

$k$ NN Compositional Data

For data

(C, x, y)

with cell type and spatial location, both [15] and [34] used the k-nearest neighbor (kNN) method to obtain compositional data by using

k = 10

, which is the most commonly used method in computational biology literature. Below, we describe the kNN compositional data matrix obtained from one sample.

Suppose that sample

S_{1}

contains

N_{1}

cells with a total of m different cell types. For cell

C_{i}

in sample

S_{1}

, let

k_{i j}

be the total number of cell type j among k nearest cells in sample

S_{1}

around

C_{i}

, then the kNN compositional data matrix for sample

S_{1}

is given by:

C_{1} = {(v_{1}, \dots, v_{N_{1}})}^{⊤}

(1)

where

C_{1}

is an

N_{1} \times m

matrix, and its row is the compositional vector given by:

v_{i}^{⊤} = (\frac{k_{i 1}}{k}, \dots, \frac{k_{i m}}{k}), i = 1, \dots, N_{1} .

(2)

If single-cell ST data

D

contains q samples

S_{1}, \dots, S_{q}

with total numbers of cells

N_{1}, \dots, N_{q}

, respectively, and each sample

S_{j}

contains the same m different cell types, then the kNN compositional data matrix for

D

is given by:

C = {(C_{1}^{⊤}, \dots, C_{q}^{⊤})}^{⊤}

(3)

where

C

is an

N \times m

matrix with

N = N_{1} + \dots + N_{q}

, and each

N_{j} \times m

matrix

C_{j}

is obtained in the same way as shown in above Equation (1). Note that matrix

C

reflects the neighboring cell patterns and restricts overly large values by Equation (2).

In Schurch et al. [15],

q = q_{s} = 35

cancer samples with a total cell number

N = 132, 437

give sample cell average as 3784. In Enfield et al. [34],

q = q_{s} = 198

cancer samples with a total cell number

N = 2.3

million give sample cell average as

11, 616

. In comparison, the above mentioned CosMx breast cancer data contains

q_{s} = 8

cancer tissue samples with a total cell number

N = 601, 634

, which gives a much bigger sample cell average as

75, 204

and a total of

q = 192

FOVs. The kNN compositional data restricts the cell number in the neighborhood of any cell C to be always the chosen constant k, which can cause loss of information in the region with very high cell density, and can also cause misleading information from always counting the k nearest cells. For single-cell ST data, the distance between cell C and the kth nearest cell can be very tiny, and can also be very huge. These motivate the construction of a novel, more flexible, more adaptive, and more informative approach for processing single-cell ST data into compositional data as follows.

Disk Compositional Data

Choosing a suitable radius r, for any cell

C_{i}

in sample

S_{1}

with distance to the boundary of

S_{1}

(if the boundary exists obviously) no less than

r / 2

, consider the following disk:

B_{r} (C_{i}) = \{C | ∥ C - C_{i} ∥ \leq r\}

(4)

which contains all cells in sample

S_{1}

centered around cell

C_{i}

within radius r, and let

n_{i j}

be the total number of cell type j included in disk

B_{r} (C_{i})

, then the disk compositional data (DCD) matrix for sample

S_{1}

is given by the

{\tilde{N}}_{1} \times m

matrix

D_{1} = {(u_{1}, \dots, u_{{\tilde{N}}_{1}})}^{⊤}

, where

u_{i}^{⊤} = (\frac{n_{i 1}}{n_{i}}, \dots, \frac{n_{i m}}{n_{i}}), i = 1, \dots, {\tilde{N}}_{1}

(5)

with

n_{i} = \sum_{j = 1}^{m} n_{i j}

as the total cell number in disk

B_{r} (C_{i})

and

{\tilde{N}}_{1}

as the total cell number in sample

S_{1}

with distance to the boundary no less than

r / 2

. For single-cell ST data

D

containing q samples

S_{1}, \dots, S_{q}

and each sample

S_{j}

containing the same m different cell types, the disk compositional data (DCD) matrix for

D

is given by

C_{D} = {(D_{1}^{⊤}, \dots, D_{q}^{⊤})}^{⊤}

(6)

where

C_{D}

is an

N_{D} \times m

matrix with

N_{D} = {\tilde{N}}_{1} + \dots + {\tilde{N}}_{q}

, and each

{\tilde{N}}_{j} \times m

matrix

D_{j}

is obtained in the same way as above

D_{1}

.

The exclusion of certain cells too close to the sample boundary, such as FOVs in CosMx data, in the above process of obtaining DCD matrix

C_{D}

is because their corresponding disks (4) often contain too few cells, which give unreliable st-community detection. But in cases without obvious boundary like FOVs, such as simulation studies in Section 3 of this article, no cells in data

D

are excluded to obtain DCD matrix

C_{D}

given by (6).

Comparison of $k$ NN and Disk Compositional Data

In Figure 2a below, one FOV from above mentioned CosMx breast cancer data is displayed, where the use of different radius

r = 125, 250, 500

, respectively, is indicated. Note that near the boundary of the FOV, there are very few cells, and that for a smaller radius, disk

B_{r} (C_{i})

contains too few cells at times, while for a bigger radius, disk

B_{r} (C_{i})

contains too many cells. Thus, it is important to choose a suitable radius.

In Figure 2b, for above mentioned CosMx breast cancer data with

N = 601, 634

cells, we use

r = 250

to obtain the proposed DCD matrix

C_{D}

with

N_{D} = 524, 366

, then we display the histogram based on all total cell number

n_{i}

’s in disks

B_{r} (C_{i})

for all

q = 192

FOVs.

In Figure 2c, for above mentioned CosMx breast cancer data with

N = 601, 634

cells, since to our best knowledge

k = 10

is the most commonly used, we use

k = 10

to obtain the kNN compositional data matrix

C

, then we display the histogram based on all distance

d_{i (10)}

between cell

C_{i}

and its 10th nearest cell for all

q = 192

FOVs.

Comparing histograms in Figure 2, we see that Figure 2b shows overwhelming disks

B_{r} (C_{i})

have total cell number

n_{i}

’s in the range of 20–60, while Figure 2c shows that if using radius

r = d_{i (10)}

for cell

C_{i}

, more than 400,000 disks

B_{r} (C_{i})

with

r \approx 150

contain only 10 cells. Moreover, Figure 2c, with the red line indicating the maximum value of

d_{i (10)}

’s, shows that even using radius

r = d_{i (10)} > 1700

, some disks

B_{r} (C_{i})

only contain 10 cells. Thus, the DCD matrix is far more informative and accurate than the kNN compositional data matrix for huge single-cell ST data.

2.2. Existing Clustering Methods

After processing single-cell ST data into compositional data matrix, the next step is to use an appropriate clustering method for st-community detection. Below is a brief review of several existing and commonly used clustering methods in the literature.

Hierarchical Clustering Method: Suppose that

G_{1}, \dots, G_{n}, G_{n + 1}

is a partition of dataset

D

, we obtain a further partition

{\hat{G}}_{1}, \dots, {\hat{G}}_{n}

of

D

via the following equation:

{{\hat{G}}_{1}, \dots, {\hat{G}}_{n}} = \underset{E_{1}, \dots, E_{n}}{arg min} \sum_{i = 1}^{n} \sum_{v_{j} \in E_{i}} {∥ v_{j} - {\bar{v}}_{i} ∥}^{2}

(7)

where

E_{1}, \dots, E_{n}

is any partition of

D

with one of them being the union of two

G_{i}

’s and the rest remaining the same, and

{\bar{v}}_{i}

is the average of all points in

E_{i}

. If N is the total number of points in

D

, for a selected k, the hierarchical clustering (HC) method [24,32,33] is to obtain k clusters by starting with a partition of

D

with N clusters, repeatedly using (7) and stopping at

n = k

. This method is very time-consuming, thus cannot handle the huge single-cell ST data.

$k$ -Means Method: For a selected k, the k-means method [22,23,24] is to obtain k clusters

G_{1}, \dots, G_{k}

of dataset

D

by solving the following equation:

{G_{1}, \dots, G_{k}} = \underset{E_{1}, \dots, E_{k}}{arg min} \sum_{i = 1}^{n} \sum_{v_{j} \in E_{i}} {∥ v_{j} - {\bar{v}}_{i} ∥}^{2}

(8)

where

E_{1}, \dots, E_{k}

is any partition of

D

and

{\bar{v}}_{i}

is the average of all points in

E_{i}

.

Elbow $k$ -Means Method: The Elbow method is using graphical method to choose optimal k for the k-means method given by above Equation (8); see [25].

Gap $k$ -Means Method: The Gap statistics method is using statistical method to choose optimal k for the k-means method by Equation (8); see [26].

Mclust Algorithm: The Mclust algorithm is to fit the data by different Gaussian mixture models, then use the Bayesian Information Criterion (BIC) to identify the optimal number of clusters by choosing the best Gaussian mixture model; see [27,28].

DBSCAN Algorithm: The density-based spatial clustering of applications with noise (DBSCAN) algorithm uses a density-based method to identify clusters, a process in which two parameters eps and minPts are involved; see [29].

HDBSCAN Algorithm: The hierarchical density-based spatial clustering of applications with noise (HDBSCAN) algorithm uses a density-based method together with a minimum spanning tree to identify clusters, a process in which one parameter minPts is involved; see [27,31].

Both DBSCAN and HDBSCAN algorithms are very time-consuming. Other existing methods include Seurat method [35], UMAP-Mclust method [1], etc., which are incompatible for compositional data, and our simulation studies and data analysis show that these methods perform poorly for large single-cell ST data.

2.3. Data Transformation and SigClust

Before proposing our novel and innovative DCD-TMHC computation method in this article for st-community detection of single-cell ST data, we first need to describe Aitchison log-ratio transformation and SigClust method, which are used in the process of our newly developed DCD-TMHC method.

Aitchison Log-Ratio Transformation: The Aitchison log-ratio transformation maps the compositional data from a simplex to real Euclidean space, which in many situations gives better results for clustering; see [36,37]. However, it has also been pointed out that in certain situations, such transformation may not be necessary or not appropriate; see [20]. In our studies of this article, we consider such transformation inappropriate if there are too many zeros in DCD matrix because it indicates that the neighborhood of one single cell has no different other cell types.

SigClust: Statistical significance of clustering (SigClust) method is hypothesis test:

H_{0} : Data from a single Gaussian distribution v s . H_{1} : H_{0} does not hold

(9)

which can be used to determine whether any two clusters are significantly different or if we should put them together to be one cluster; see [38,39].

2.4. Proposed DCD-TMHC Computation Method

The HC method is well-known for being very computationally intensive, thus cannot handle our huge single-cell ST data even using the super-computer resources at University of North Carolina—Chapel Hill. On the other hand, the 2-means method can run the huge single-cell ST data with good computing speed. Thus, the basic idea of our proposed DCD-TMHC method is to use the 2-means method on the huge single-cell ST data consecutively until obtaining clusters small enough for the HC method to handle, then apply the HC method on each cluster. This allows us to use the HC method on the subsets of the huge single-cell ST data, which are identified by the 2-means method and verified by SigClust for their distinctiveness.

In practice, if the researchers have greater computation power, they can use the 2-means method fewer times to quickly get the clusters to be “not too big”, while if the researchers have smaller computation power, they need to use the 2-means method more times. Our proposed DCD-TMHC computation method is adaptive to handle the data of any large size, and it is described below as DCD-TMHC Computation Method:

Step 1:: Process the single-cell ST data into DCD matrix, then, if appropriate, transfer it using Aitchison log-ratio transformation;
Step 2:: Use the 2-means method consecutively to obtain clusters of not too big size as shown in Figure 3, where SigClust is used to make sure that the split clusters are distinct, and any two non-distinct clusters are considered as one st-community;
Step 3:: For distinct clusters obtained in Step 2, apply HC method as shown in Figure 4, where SigClust is used on each large enough node of the dendrogram to determine whether the split should be kept: (a) if not or not large enough, stop at the node and treat it as one st-community; (b) if yes, continue SigClust down to the next node.

Figure 3. Successive 2-Means Method.

Figure 4. Step 3 of DCD-TMHC Method.

Remark 1.

In real data analysis or in simulation studies, the “not too big” in Step 2 is determined by computation power, and “not large enough” in Step 3 uses a chosen number

K_{1}

to stop at Step 3 (a). The studies in this article all used super computer at University of North Carolina—Chapel Hill. Moreover, Step 2 and related Figure 3 suggest an alternative clustering method, called successive 2-means (STM) method, which keeps running Step 2 without ever going to Step 3 for running HC method. Such STM computation method stops at the node if it is “not large enough” for chosen

K_{1}

or if one of its two split clusters has size less than a chosen number

K_{2}

. In practice, the choice of radius is based on the discussion of Figure 2a,b at the end of Section 2.1 which ensures that the overwhelming disks contain total number of cells in the range of 20–60. For the choice of

K_{1}

, we first visualize the data, such as real data example shown in Figure 2a or simulation data shown in Figure 5, then choose

K_{1}

be a bigger number if the communities appear to be likely bigger size with same spatial and neighborhood pattern; and choose

K_{1}

to be a smaller number if certain communities appear to be quite small. Moreover, the choice of number

K_{2}

is selected based on how much smaller one of the two split clusters is based on 2-means clustering.

3. Simulation Studies

This section presents some simulation results on st-community detection using various existing methods listed in Section 2 and our proposed DCD-TMHC computation method, plus the alternative STM method. There are 5 different simulation study settings which are summarized in Table 1. Figure 5 displays the visualization of these 5 simulation settings, and Table 2a–e present the summary of simulation results.

In the setting of Simulation 1, there are 4 intended st-communities, each of which is in its own distinct spatial region. Each of three spatial regions has only one unique cell type present, while the 4th intended st-community has both cell type 1 and cell type 2 mixed together. The coordinates of location

(x, y)

for each cell are generated by normal distributions. These st-communities are designed to represent different cell neighboring patterns found in single-cell ST data. The settings of Simulations 2–5 are similarly designed.

In Table 1, we denote

U {m_{1}, m_{2}}

as the discrete uniform distribution between positive integers

m_{1}

and

m_{2}

,

U (a, b)

as the continuous uniform distribution with support interval

(a, b)

, and

N (μ, σ^{2})

as the normal distribution with mean

μ

and variance

σ^{2}

, where for simplicity, unit

1 = 1000

is used for parameters of distribution notations. For each simulation setting, the number of cells is generated using the discrete uniform distribution listed on the top, except different discrete uniform distributions are used in ST-Community 4 of Simulation 1 and in ST-Community 7 of Simulation 5, respectively. For instance, notation “Cell 1 (12,462)” in ST-Community 1 of Simulation 1 means a total number of 12,462 Cell

T y p e 1

is generated from

U {10, 25}

, where 10 and 25 represent 10,000 and 25,000, respectively. Also, under Simulation 2, notations “Cell 2 (50% = 7370)” in ST-Community 2 and “Cell 2 (50% = 7404)” in ST-Community 3 mean that 50% of the number generated from

U {10, 15}

for Cell Type 2 is used in ST-Communities 2 and 3, respectively.

For the 5 simulation settings in Table 1, we first process the data into DCD matrix (6), but Aitchison data transformation is not used here because the data contain too many zeros. Table 2a–e only include simulation results for some existing clustering methods, our proposed DCD-TMHC computation method and alternative method STM, because HC method is too time-consuming, and methods such as Seurat, UMAP-Mclust, etc., are incompatible for compositional data. The parameters listed in Table 2a–e are chosen based on our various experimental testings during the simulation studies.

Note that Table 1 gives intended st-community number k for each simulation setting; thus k is used for the k-means and Mclust methods, while all other methods determine the number of st-communities on their own, and “NA” indicates no results available for certain methods. Figure 5 displays the visualization of 5 simulation settings, which agrees with the actual intended st-community numbers of Simulations 1, 2 and 4. For Simulation 3, 5 is the intended st-community number, but Figure 5c clearly shows that the actual st-community number is 7–8, because the region with overlapping colors of orange and red has some parts that are more dense than other parts, and the boundary between colors orange and red has some parts with solid orange surrounding red, while other parts have sparse orange surrounding red. Similarly, Simulation 5 has 7 as the intended st-community number, but Figure 5e suggests that the actual st-community number is 9–10.

To check the accuracy of st-community detection, a good numerical measure is Adjusted Rand Index (ARI) [40], which ranges with values in interval

(- 1, 1)

and has number close to 1 as an indication of high accuracy for st-community detection. In Table 2a–e, the ARIs are computed for all listed methods assuming that the intended st-community numbers given in Table 1 are true.

Remark 2.

In Table 2a–e, our proposed DCD-TMHC method is obviously superior to all other methods because it consistently and correctly detects the actual st-community numbers, and it has the highest ARI for Simulations 1, 2 and 4, which give the accurate st-communities. But the ARI numbers in Table 2c,e for Simulations 3 and 5, respectively, do not represent the accuracy of st-community detection because the intended st-communities given in Table 1 are not the actual st-communities in these two cases. From the simulation results in Table 2a–e, it is easy to see that Gap k-means and HDBSCAN methods do not work well. Moreover, Mclust method does not work well either because without the given st-community number k, this method has 1–9 as the default st-community numbers and the finally determined st-community number is based on the highest BIC out of 14 different Gaussian mixture models; see Figure 6 for BIC plot under Simulation 1 setting, which gives 1 as the detected st-community number. For Simulation 2–5 settings, Mclust performs similarly as shown in Figure 6 for the setting of Simulation 1. In summary, the DCD data generated in simulation studies is very dense, thus the simulation results indicate that the existing clustering methods do not handle well very dense compositional data, such as single-cell ST data produced by CosMx, which is shown in real data analysis below.

4. CosMx Data Analysis

This section applies our proposed DCD-TMHC computation method, a few other methods and some CN methods to analyze a CosMx breast cancer dataset, which is a single-cell ST dataset and was produced by the CosMx machine at the NanoString Company (Bothell, WA, USA). The goal is to achieve st-community detection, then analyze its impact on cancer research.

The CosMx data under consideration here consists of eight samples from four people; each of them had one primary breast cancer sample and one metastasis sample. The total cell number from 8 samples is

N = 601, 634

, and the total highly cancer research-related

m = 9

cell types under consideration are B-plasma, endothelial, fibroblasts, hepatocytes, macrophages, mast, normal-BEC, T, and tumor cells.

For each sample, cell types were manually annotated by the Leiden method and the expression of cell marker genes, and 19–25 FOVs were manually chosen for the purpose of having many tumor cells, as well as reflecting other cell types present in the data, such as normal breast epithelial (normal-BEC) cells, hepatocytes cells, etc.; see Figure 1 for the description of FOVs, and see Table 3 below for the data summary.

As the initial step of st-community detection, we use radius

r = 250

to process the CosMx data

D

into DCD matrix

C_{D}

given by (6) with

N_{D} = 524, 366

and

m = 9

, which does not apply the disk given by (4) centered by cell

C_{i}

’s too close to the boundary of FOV or too isolated. After applying Aitchison log-ratio transformation to matrix

C_{D}

, we use our proposed DCD-TMHC method, STM method, 10-means method, and Elbow k-means method to detect st-communities; see the results displayed in Figure 7a–d, which give the bar charts of

m = 9

cell types for each detected st-community. A bar in each detected st-community represents the percentage of one cell type within that st-community.

In Figure 7, Elbow k-means method determines

k = 20

, DCD-TMHC computation method uses

K_{1} = 1000

, and STM method uses

K_{1} = 5000, K_{2} = 100

. In our analysis, we also consider DBSCAN method, which only detects two st-communities, thus the result is not included in Figure 7. Other clustering methods, such as Gap k-means, HDBSCAN, Mclust, etc., are also not included here due to the discussion given in Remark 2 about simulation results and their poor performance on this CosMx data

D

in our analysis.

In addition, some CN methods mentioned in Section 1 are also considered to analyze the CosMx breast cancer data here, such as CNE [16], CF-IDF [41], SpatialLDA [42], SPIAT [43], and CytoCommunity [44]. But they cannot handle the overly large data set here, except CNE and CF-IDF methods, both of which require a pre-chosen st-community number. It is interesting to notice that our proposed DCD-TMHC method detects 19 st-communities for the CosMx data, while Elbow k-means method determines

k = 20

st-communities. Thus, we use

k = 20

as the chosen st-community number for CNE and CF-IDF. Due to the very poor performance of CF-IDF method, we only include the results of CNE-20 method in Figure 7e.

To our best knowledge, among all existing spatial domain methods, only SpaDo method in [45] can directly handle single-cell ST data. We use this SpaDo method to analyze the CosMx breast cancer data, but it also cannot handle the overly large dataset, thus no results from SpaDo method are included in Figure 7.

Note that the detected st-communities among 5 methods displayed in Figure 7 have some similarities and also some quite noticeable differences. Our DCD-TMHC method detects ST-Community 5 as having the highest percentage of tumor cells, 99.7%, in the st-community, while STM, 10-means, Elbow k-means and CNE-20 methods detect ST-Communities 3, 5, 3 and 1 as having the highest tumor cell percentages 98.7%, 97.3%, 98.8% and 98.5%, respectively; thus they are quite similar; see Table 4. On the other hand, for immune cell as the sum of B-plasma and T cells, the DCD-TMHC method detects ST-Community 19 as having the highest percentage of immune cells, 62.2%, in the st-community, while STM, 10-means, Elbow k-means and CNE-20 methods detect ST-Communities 32, 8, 12 and 6 as having the highest immune cell percentages 48.0%, 27.4%, 50.6% and 87.0%, respectively, thus they are quite different; see Table 5.

For many other differences among detected st-communities by different methods, a particularly noticeable one is that the DCD-TMHC method detects two ST-Communities, 8 and 14, as having the highest percentage of normal cells; the STM method detects two such ST-Communities, 26 and 40; and the CNE-20 method also detects two such ST-Communities, 8 and 12. But each of 10-means method and Elbow k-means method only identifies one such st-community: ST-Community 1 for the 10-means method and ST-Community 13 for the Elbow k-means method. Detecting two st-communities with the highest percentages of normal cells better reflects the existing heterogeneity of neighboring patterns of normal cells.

As mentioned in Section 1 and 2, in [15] for colorectal cancer and in [34] for lung cancer, both used the 10-means method to detect st-communities based on their data. Schurch et al. [15] associated certain st-communities with better survival for high-risk patients, while Enfield et al. [34] identified certain st-community with a high percentage of current smokers which is highly associated with lung cancer. Thus, based on our detected st-communities in Figure 7, we are interested in their impact on and association with cancer research as follows.

In Table 4, for ST-Community 5 detected by our proposed DCD-TMHC method, each of 8 samples listed in Table 3 has the following variables given in Table 4:

x_{i} = \frac{k_{i}}{N_{i}} and y_{i} = {\begin{matrix} 1, & if S_{i} is Primary \\ 0, & if S_{i} is Metastasis \end{matrix}, i = 1, \dots, 8

(10)

where

k_{i}

is the total number of cells in sample

S_{i}

that belongs to ST-Community 5 detected by DCD-TMHC method,

N_{i}

is the total cell number in sample

S_{i}

, and the values of

x_{i}

’s in Table 4 are

x_{1} = 0.40 %, x_{2} = 39.30 %,

etc. Thus, this is a dataset observed for a binary response variable Y and an explanatory variable X, which is naturally analyzed using the following logistic regression model: [46,47]

π (x) = P {Y = 1 | X = x} = \frac{exp (α + β x)}{1 + exp (α + β x)},

(11)

and logistic regression (LR) estimators

\hat{α} = 1.375

and

\hat{β} = - 0.219

are computed based on data (10) and given in Table 4. In turn, the estimated logistic regression curve is

\hat{π} (x) = \frac{exp (\hat{α} + \hat{β} x)}{1 + exp (\hat{α} + \hat{β} x)}

(12)

which is displayed as a red curve in Figure 8. Note that

π (x)

in Equation (11) is the conditional probability of being primary breast cancer for given value x, and above

\hat{π} (x)

is the estimated conditional probability of

π (x)

.

For the rest of the data in Table 4 produced by four methods of STM, 10-means, Elbow k-means and CNE-20, we obtain four estimated logistic regression curves using Equations (10)–(12), respectively, and present them in Figure 8 as well for comparison.

Similarly presented as Table 4, the data in Table 5 and Table 6 are based on detected st-communities as having the highest percentages of immune cells and normal cells, respectively. Following Equations (10)–(12), we obtain estimated logistic regression curves for Table 5 and Table 6, and present them in Figure 9 and Figure 10, respectively.

Remark 3.

Logistic Regression Curves in Figure 8. From Equations (10)–(12), we see that in Figure 8, five estimated logistic regression curves based on the data in Table 4 are all decreasing functions, and the red curve based on the DCD-TMHC method’s detection of ST-Community 5 as having the highest percentage of tumor cells is located below all other four curves. The decreasing red curve means that for a sample

S_{i}

with larger percentage

x_{i}

of cells located in ST-Community 5, the patient with cancer tissue sample

S_{i}

has less chance of being primary breast cancer due to the meaning of

π (x)

and

\hat{π} (x)

given in Equations (11) and (12). From

P {Y = 0 | X = x} = 1 - π (x)

, the decreasing red curve also means that for a sample

S_{i}

with larger percentage of cells

x_{i}

located in ST-Community 5, the patient with cancer tissue sample

S_{i}

has greater chance of being at metastasis of cancer, and our proposed DCD-TMHC method makes such assessment more sharply than all other four methods because the red curve

{\hat{π}}_{R} (x)

by DCD-TMHC method being located below all other four curves in Figure 8 means that curve

[1 - {\hat{π}}_{R} (x)]

is located above all other four curves

[1 - \hat{π} (x)]

; that is for the same large value of x, the red curve

[1 - {\hat{π}}_{R} (x)]

by DCD-TMHC method predicts the greatest chance of being at metastasis of cancer than all other four methods.

Remark 4.

Logistic Regression Curves in Figure 9 and Figure 10. From Remark 3 on the interpretation of Figure 8, we see clearly that in Figure 9, three estimated logistic regression curves, red

{\hat{π}}_{R} (x)

, green

{\hat{π}}_{G} (x)

and orange

{\hat{π}}_{O} (x)

based on data in Table 5 via ST-Communities 19, 32 and 12 as having the highest percentages of immune cells detected by methods DCD-TMHC, STM and Elbow k-means, respectively, are all increasing functions, while the red curve

{\hat{π}}_{R} (x)

by the DCD-TMHC method is located significantly above two curves

{\hat{π}}_{G} (x)

and

{\hat{π}}_{O} (x)

. This means that for a sample

S_{i}

with large percentage

x_{i}

of cells located in ST-Community 19 detected by DCD-TMHC method, the patient with cancer tissue sample

S_{i}

has large chance of being primary breast cancer, and that for the same large value of x, the red curve

{\hat{π}}_{R} (x)

by DCD-TMHC method predicts the greatest chance of being primary breast cancer than both STM method and Elbow k-means method. However, it should be noticed that the blue estimated logistic regression curve

{\hat{π}}_{B} (x)

by the 10-means method and the brown estimated logistic regression curve

{\hat{π}}_{B R} (x)

by the CNE-20 method are slightly decreasing functions, which means that the 10-means method and CNE-20 method are not properly predictive. Thus, our proposed DCD-TMHC method has superior performance compared to all other methods for detecting immune cell related st-communities. In Figure 10, we see that all five estimated logistic regression curves based on the data in Table 6 are increasing functions, which are based on the st-communities as having the highest percentages of normal cells detected by 5 different methods. The curves

{\hat{π}}_{R} (x)

and

{\hat{π}}_{G} (x)

by DCD-TMHC and STM methods, respectively, are located quite closely, while curves

{\hat{π}}_{B} (x)

and

{\hat{π}}_{O} (x)

by the 10-means and Elbow k-means methods, respectively, are also located quite closely. The brown curve

{\hat{π}}_{B R} (x)

by the CNE-20 method is somewhat located in the middle of two groups. Since curves

{\hat{π}}_{R} (x)

and

{\hat{π}}_{G} (x)

are located significantly above curves

{\hat{π}}_{B} (x)

and

{\hat{π}}_{O} (x)

as well as being notably located above curve

{\hat{π}}_{B R} (x)

, we know that DCD-TMHC and STM methods perform better for the detected normal cell related st-communities.

5. Discussion and Conclusions

The single-cell ST data produced by recent biotechnologies, such as CosMx and Xenium machine, contain a huge amount of information about cancer tissue samples, which has great potential for the improvement of cancer diagnosis and treatment. This article reveals that many existing clustering methods perform poorly for st-community detection of single-cell ST data produced by CosMx, and the commonly used kNN compositional data method shows a lack of informative neighboring cell patterns for huge CosMx data. Thus, here we propose a novel and much more informative disk compositional data (DCD) method, which identifies the neighboring pattern of each cell based on nearby cell-percentages via taking into account of the features of single-cell ST data produced by recent new technologies.

After initial processing of the single-cell ST data into DCD matrix, an innovative and interpretable DCD-TMHC st-community detection method is proposed in this paper. Applying various existing methods as well as our DCD-TMHC computation method, extensive simulation studies and actual analysis of a CosMx breast cancer dataset show that our proposed DCD-TMHC method performs better than all other methods, including the existing and applicable CN methods or spatial domain methods.

Based on the st-communities detected by our proposed DCD-TMHC computation method for the CosMx breast cancer data, we use the logistic regression model to analyze the association and relationship between the identified st-communities of the CosMx data and cancer phenotypes. The results here demonstrate that our proposed DCD-TMHC method is better interpretable and superior to all other existing methods, especially in terms of assessment for primary cancer and metastasis cancer.

For upcoming research in the future, the novel, innovative, informative and interpretable DCD-TMHC computation method proposed in this article will be helpful and have an impact on future cancer research based on single-cell ST data. In particular, if we can use the new ST technologies to obtain more tissue samples from different types of cancer at different stages, then different types of st-communities can be detected and identified by our DCD-TMHC method for a specific type of cancer at a particular stage, which, by using the logistic regression model, appropriate generalized linear models, or other statistical models, such as weighted empirical likelihood method [48,49,50], generalized latent proportional hazards model [51], etc., can be applied to improve cancer diagnosis and monitor cancer treatment progress.

Author Contributions

Conceptualization, C.Z.; Methodology, C.Z. and J.-J.R.; Formal analysis, C.Z. and J.-J.R.; Investigation, C.Z.; Writing—original draft, C.Z.; Writing—review & editing, C.Z. and J.-J.R. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by funds from the NCI Breast SPORE Program, Breast Cancer Research Foundation, and The UNC Lineberger Triple Negative Breast Cancer Center. Charles Zhao was also partially supported by NSF research grant DMS-2113404 for some period of time.

Data Availability Statement

Information was provided at the initial submission of this article.

Acknowledgments

The authors are grateful to the CosMx breast cancer data produced by the NanoString Company, and would like to thank Susana Garcia-Recio for processing and providing the data which is used as the data analysis example in this article. In addition, Charles Zhao is very grateful to the inspiring, informative and encouraging discussions with Mingyao Li of University of Pennsylvania after her invited lecture at 2025 JSM in Nashville, TN, as well as later communications with her which are related to this current work. Moreover, the authors would like to thank three reviewers very much for their comments and suggestions on the earlier draft of this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

He, S.; Bhatt, R.; Brown, C.; Brown, E.A.; Buhr, D.L.; Chantranuvatana, K.; Danaher, P.; Dunaway, D.; Garrison, R.G.; Geiss, G.; et al. High-plex imaging of RNA and proteins at subcellular resolution in fixed tissue by spatial molecular imaging. Nat. Biotechnol. 2022, 40, 1794–1806. [Google Scholar] [CrossRef] [PubMed]
Jin, Y.; Zuo, Y.; Li, G.; Liu, W.; Pan, Y.; Fan, T.; Fu, X.; Yao, X.; Peng, Y. Advances in spatial transcriptomics and its applications in cancer research. Mol Cancer 2024, 23, 129. [Google Scholar] [CrossRef]
Goltsev, Y.; Samusik, N.; Kennedy-Darling, J.; Bhate, S.; Hale, M.; Vazquez, G.; Black, S.; Nolan, G.P. Deep profiling of mouse splenic architecture with CODEX multiplexed imaging. Cell 2018, 174, 968–981. [Google Scholar] [CrossRef]
Ståhl, P.L.; Salmén, F.; Vickovic, S.; Lundmark, A.; Navarro, J.F.; Magnusson, J.; Giacomello, S.; Asp, M.; Westholm, J.O.; Huss, M.; et al. Visualization and analysis of gene expression in tissue sections by spatial transcriptomics. Science 2016, 353, 78–82. [Google Scholar] [CrossRef]
Smith, K.D.; Prince, D.K.; MacDonald, J.W.; Bammler, T.K.; Akilesh, S. Challenges and opportunities for the clinical translation of spatial transcriptomics technologies. Glomerular Dis. 2024, 4, 49–63. [Google Scholar] [CrossRef]
Chen, T.Y.; You, L.; Hardillo, J.A.U.; Chien, M.P. Spatial transcriptomic technologies. Cells 2023, 12, 2042. [Google Scholar] [CrossRef]
Williams, C.G.; Lee, H.J.; Asatsuma, T.; Vento-Tormo, R.; Haque, A. An introduction to spatial transcriptomics for biomedical research. Genome Med. 2022, 14, 68. [Google Scholar] [CrossRef] [PubMed]
Cilento, M.A.; Sweeney, C.J.; Butler, L.M. Spatial transcriptomics in cancer research and potential clinical impact: A narrative review. J. Cancer Res. Clin. Oncol. 2024, 150, 296. [Google Scholar] [CrossRef] [PubMed]
Maciejewski, K.; Czerwinska, P. Scoping Review: Methods and Applications of Spatial Transcriptomics in Tumor Research. Cancers 2024, 16, 3100. [Google Scholar] [CrossRef]
Saqib, J.; Park, B.; Jin, Y.; Seo, J.; Mo, J.; Kim, J. Identification of Niche-Specific Gene Signatures between Malignant Tumor Microenvironments by Integrating Single Cell and Spatial Transcriptomics Data. Genes 2023, 14, 2033. [Google Scholar] [CrossRef]
Hu, J.; Li, X.; Coleman, K.; Schroeder, A.; Ma, N.; Irwin, D.J.; Lee, E.B.; Shinohara, R.T.; Li, M. SpaGCN: Integrating gene expression, spatial location and histology to identify spatial domains and spatially variable genes by graph convolutional network. Nat. Methods 2021, 18, 1342–1351. [Google Scholar] [CrossRef]
Singhal, V.; Chou, N.; Lee, J.; Yue, Y.; Liu, J.; Chock, W.K.; Lin, L.; Chang, Y.C.; Teo, E.M.L.; Aow, J.; et al. BANKSY unifies cell typing and tissue domain segmentation for scalable spatial omics data analysis. Nat. Genet. 2024, 56, 431–441. [Google Scholar] [CrossRef]
Yan, G.; Hua, S.H.; Li, J.J. Categorization of 34 computational methods to detect spatially variable genes from spatially resolved transcriptomics data. Nat. Commun. 2025, 16, 1141. [Google Scholar] [CrossRef] [PubMed]
Bhate, S.S.; Barlow, G.L.; Schurch, C.M.; Nolan, G.P. Tissue schematics map the specialization of immune tissue motifs and their appropriation by tumors. Cell Syst. 2022, 13, 109–130. [Google Scholar] [CrossRef]
Schürch, C.M.; Bhate, S.S.; Barlow, G.L.; Phillips, D.J.; Noti, L.; Zlobec, I.; Chu, P.; Black, S.; Demeter, J.; McIlwain, D.R.; et al. Coordinated cellular neighborhoods orchestrate antitumoral immunity at the colorectal cancer invasive front. Cell 2020, 182, 1341–1359, Correction in Cell 2020, 183, 838.. [Google Scholar] [CrossRef]
Tao, Y.; Feng, F.; Luo, X.; Reihsmann, C.V.; Hopkirk, A.L.; Cartailler, J.P.; Brissova, M.; Parker, S.C.; Saunders, D.C.; Liu, J. CNTools: A computational toolbox for cellular neighborhood analysis from multiplexed images. PLoS Comput. Biol. 2024, 20, 8. [Google Scholar] [CrossRef] [PubMed]
Jiang, Y.; Ke, Z.T. Semi-Supervised Community Detection via Structural Similarity Metrics. In Proceedings of the ICLR, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Jin, J.; Ke, Z.T.; Luo, S.; Wang, M. Optimal Estimation of the Number of Network Communities. J. Am. Stat. Assoc. 2023, 118, 2101–2116. [Google Scholar] [CrossRef]
Shiao, S.L.; Gouin, K.H.; Ing, N.; Ho, A.; Basho, R.; Shah, A.; Mebane, R.H.; Zitser, D.; Martinez, A.; Mevises, N.Y.; et al. Single-cell and spatial profiling identify three response trajectories to pembrolizumab and radiation therapy in triple negative breast cancer. Cancer Cell 2024, 42, 70–84. [Google Scholar] [CrossRef]
Quinn, T.P.; Erb, I.; Richardson, M.F.; Crowley, T.M. Understanding sequencing data as compositions: An outlook and review. Bioinformatics 2018, 34, 2870–2878. [Google Scholar] [CrossRef]
Traag, V.A.; Waltman, L.; van Eck, N.J. From Louvain to Leiden: Guaranteeing well-connected communities. Sci. Rep. 2019, 9, 5233. [Google Scholar] [CrossRef]
Hartigan, J.A.; Wong, M.A. A k-means clustering algorithm. J. R. Stat. Soc. Ser. C Appl. Stat. 1979, 28, 100–108. [Google Scholar]
Lloyd, S. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–137. [Google Scholar] [CrossRef]
Marron, J.S.; Dryden, I.L. Object Oriented Data Analysis, 1st ed.; Chapman and Hall/CRC: Boca Raton, FL, USA, 2021. [Google Scholar]
Schubert, E. Stop using the elbow criterion for k-means and how to choose the number of clusters instead. SIGKDD Explor. 2023, 25, 36–42. [Google Scholar] [CrossRef]
Tibshirani, R.; Walther, G.; Hastie, T. Estimating the number of clusters in a data set via the Gap statistic. J. R. Stat. Soc. B 2001, 63, 411–423. [Google Scholar] [CrossRef]
Fraley, C.; Raftery, A.E. Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 2002, 97, 611–631. [Google Scholar] [CrossRef]
Scrucca, L.; Fraley, C.; Murphy, T.B.; Raftery, A.E. Model-Based Clustering, Classification, and Density Estimation Using Mclust in R, 1st ed.; Chapman and Hall/CRC: Boca Raton, FL, USA, 2023. [Google Scholar]
Hahsler, M.; Piekenbrock, M.; Doran, D. DBSCAN: Fast density-based clustering with R. J. Stat. Softw. 2019, 91, 1–30. [Google Scholar] [CrossRef]
Campello, R.J.G.B.; Moulavi, D.; Sander, J. Density-based clustering based on hierarchical density estimates. Adv. Knowl. Discov. Data Min. 2013, 7819, 160–172. [Google Scholar]
Campello, R.J.G.B.; Moulavi, D.; Zimek, A.; Sander, J. Hierarchical density estimates for data clustering, visualization, and outlier detection. ACM Trans. Knowl. Discov. Data 2015, 10, 1–51. [Google Scholar] [CrossRef]
Murtagh, F.; Contreras, P. Algorithms for hierarchical clustering: An overview. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2012, 2, 86–97. [Google Scholar] [CrossRef]
Ward, J.H. Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 1963, 58, 236–244. [Google Scholar] [CrossRef]
Enfield, K.S.; Colliver, E.; Lee, C.; Magness, A.; Moore, D.A.; Sivakumar, M.; Grigoriadis, K.; Pich, O.; Karasaki, T.; Hobson, P.S.; et al. Spatial architecture of myeloid and T cells orchestrates immune evasion and clinical outcome in lung cancer. Cancer Discov. 2024, 14, 1018–1047. [Google Scholar] [CrossRef]
Hao, Y.; Hao, S.; Andersen-Nissen, E.; Mauck, W.M.; Zheng, S.; Butler, A.; Lee, M.J.; Wilk, A.J.; Darby, C.; Zager, M.; et al. Integrated analysis of multimodal single-cell data. Cell 2021, 184, 3573–3587. [Google Scholar] [CrossRef]
Aitchison, J. The statistical analysis of compositional data. J. R. Stat. Soc. Ser. B 1982, 44, 139–177. [Google Scholar] [CrossRef]
Aitchison, J. The Statistical Analysis of Compositional Data; Chapman and Hall, Ltd.: London, UK, 1986. [Google Scholar]
Huang, H.; Liu, Y.; Yuan, M.; Marron, J.S. Statistical significance of clustering using soft thresholding. J. Comput. Graph. Stat. 2015, 24, 975–993. [Google Scholar] [CrossRef]
Liu, Y.; Hayes, D.N.; Nobel, A.; Marron, J.S. Statistical significance of clustering for high-dimension, low–sample size data. J. Am. Stat. Assoc. 2008, 103, 1281–1293. [Google Scholar] [CrossRef]
Hubert, L.; Arabie, P. Comparing partitions. J. Classif. 1985, 2, 193–218. [Google Scholar] [CrossRef]
Walker, J.T.; Saunders, D.C.; Rai, V.; Chen, H.H.; Orchard, P.; Dai, C.; Pettway, Y.D.; Hopkirk, A.L.; Reihsmann, C.V.; Tao, Y.; et al. Genetic risk converges on regulatory networks mediating early type 2 diabetes. Nature 2023, 624, 621–629. [Google Scholar] [CrossRef]
Chen, Z.; Soifer, I.; Hilton, H.; Keren, L.; Jojic, V. Modeling Multiplexed Images with Spatial-LDA Reveals Novel Tissue Microenvironments. J. Comput. Biol. 2020, 27, 1204–1218. [Google Scholar] [CrossRef] [PubMed]
Feng, Y.; Yang, T.; Zhu, J.; Li, M.; Doyle, M.; Ozcoban, V.; Bass, G.T.; Pizzolla, A.; Cain, L.; Weng, S.; et al. Spatial analysis with SPIAT and spaSim to characterize and simulate tissue microenvironments. Nat. Commun. 2023, 14, 2697. [Google Scholar] [CrossRef] [PubMed]
Hu, Y.; Rong, J.; Xu, Y.; Xie, R.; Peng, J.; Gao, L.; Tan, K. Unsupervised and supervised discovery of tissue cellular neighborhoods from cell phenotypes. Nat. Methods 2024, 21, 267–278. [Google Scholar] [CrossRef] [PubMed]
Duan, B.; Chen, S.; Cheng, X.; Liu, Q. Multi-slice spatial transcriptome domain analysis with SpaDo. Genome Biol. 2024, 25, 73. [Google Scholar] [CrossRef]
Agresti, A. Categorical Data Analysis; Wiley: Hoboken, NJ, USA, 2002. [Google Scholar]
McCullagh, P.; Nelder, J.A. Generalized Linear Models, 2nd ed.; Chapman and Hall/CRC: Boca Raton, FL, USA, 1989. [Google Scholar]
Ren, J. Weighted empirical likelihood in some two-sample semiparametric models with various types of censored data. Ann. Stat. 2008, 36, 147–166. [Google Scholar] [CrossRef]
Ren, J.; Lyu, Y. Multivariate weighted empirical likelihood for accelerated life model with various types of censored data. Stats 2024, 7, 944–954. [Google Scholar] [CrossRef]
Ren, J.; Wang, Y. Multivariate weighted empirical likelihood MLE for the Cox model with various types of censored data. Electron. J. Stat. 2025, 19, 2033–2051. [Google Scholar] [CrossRef]
Ren, J.; Zhao, C. Asymptotic properties of empirical likelihood MLE for joint modeling right censored survival data and intensive longitudinal covariates. Ann. Inst. Stat. Math. 2025; to appear. [Google Scholar]

Figure 1. Example of FOVs from Cancer Sample 1.

Figure 2. FOV Example and Histograms of Disk and

k

NN Methods.

Figure 2. FOV Example and Histograms of Disk and

k

NN Methods.

Figure 5. Dot Plots of 5 Simulation Settings.

Figure 6. BIC Plot of Mclust in Simulation 1.

Figure 7. Bar Charts of Detected ST-Communities.

Figure 8. Logistic Regression Curves for Highest Tumor ST-Community.

Figure 9. Logistic Regression Curves for Highest Immune ST-Community.

Figure 10. Logistic Regression Curves for Highest Normal ST-Community.

Table 1. Five Simulation Study Settings.

Intended	Simulation 1	Simulation 2	Simulation 3	Simulation 4	Simulation 5
Community	$U {10, 25}$	$U {10, 15}$	$U {100, 150}$	$U {90, 120}$	$U {150, 200}$
	Cell 1 (12,462):	Cell 1 (11,016):	Cell 1 (102,985):	Cell 1 (45% = 48,330):	Cell 1 (152,985):
1	$X, Y \sim N (0, 250)$	$X, Y \sim U (- 0.25, 0.25)$	$X, Y \sim N (2, 100)$	$X \sim U (- 0.55, - 0.35)$	$X, Y \sim N (2, 100)$
				$Y \sim U (- 0.25, 0.25)$
	Cell 2 (12,510):	Cell 2 (50% = 7370):	Cell 2 (98,264):	Cell 1 (55% = 59,070):	Cell 2 (136,445):
2	$X, Y \sim N (4, 250)$	$X \sim U (- 0.25, 0.25)$	$X, Y \sim N (4, 100) \cap R^{c}$	Cell 4 (45% = 52,538):	$X, Y \sim N (4, 100) \cap R^{c}$
		$Y \sim U (0.5, 1)$		$X, Y \sim U (- 0.25, 0.25)$
	Cell 3 (20,418):	Cell 2 (50% = 7404):	Cell 3 (129,709):	Cell 2 (45% = 51,474):	Cell 3 (179,709):
3	$X, Y \sim N (10, 250)$	Cell 3 (50% = 6066):	$X, Y \sim N (8, 100)$	$X \sim U (- 0.25, 0.25)$	$X, Y \sim N (8, 100)$
		$X \sim U (0.35, 0.85)$		$Y \sim U (0.5, 1)$
		$Y \sim U (0.5, 1)$
	$U {6, 8}$	Cell 3 (50% = 6110):	Cell 2 (31,660):	Cell 2 (55% = 62,913):	Cell 2 (43,479):
	Cell 1 (6525):	$X \sim U (1.15, 1.65)$	$X, Y \sim N (4, 100) \cap R$	Cell 3 (45% = 42,648):	$X, Y \sim N (4, 100) \cap R$
4	Cell 2 (6525):	$Y \sim U (0.5, 1)$	Cell 4 (44% = 60,512):	$X \sim U (0.35, . 85)$	Cell 4 (58% = 108,766):
	$X \sim N (8, 50)$		$X, Y \sim U (4, 4.7)$	$Y \sim U (0.5, 1)$	$X, Y \sim (4, 4.7)$
	$Y \sim N (0, 50)$
		Cell 4 (50% = 5773):	Cell 4 (56% = 77,016):	Cell 3 (55% = 52,126):	Cell 4 (42% = 78,762):
		$X \sim U (0.7, 1)$	$X \sim U (5, 6)$	$X \sim U (1.15, 1.65)$	Cell 5 (58% = 88,598):
		$Y \sim U (- 0.15, 0.15)$	$Y \sim U (8.5, 9.2)$	$Y \sim U (0.5, 1)$	$X \sim U (5.5, 6.5)$
5		Cell 4 (50% = 5,759):			$Y \sim U (8.5, 9.2)$
		$θ \sim U (0, 2 π)$
		$X = 0.4 cos θ + 0.8$
		$Y = 0.25 sin θ$
				Cell 4 (55% = 64,214):	Cell 5 (42% = 64,158):
6				$X \sim U (0.35, 0.85)$	$X \sim U (2.5, 4.5)$
				$Y \sim U (1.1, 1.35)$	$Y \sim U (8.5, 9.2)$
				Cell 5 (45% = 46,447):	$U {30, 50}$
				$X \sim U (0.7, 1)$	Cell 1 (42,635):
				$Y \sim U (- 0.15, 0.15)$	Cell 2 (39,208):
7				Cell 5 (55% = 56,770):	$X \sim N (8, 90)$
				$θ \sim U (0, 2 π)$	$Y \sim N (2, 90)$
				$X = 0.4 cos θ + 0.8$
				$Y = 0.25 sin θ$
Notations	$1 = 1000$ for distribution notations; $R = [3995, 4705] \times [3995, 4705]$

Table 2. Summary of Simulation Results.

(a) Simulation 1 with N = 58,440 Cells
Method	Detected ST-Community No.	Chosen Parameters	ARI
DCD-TMHC	4	$K_{1} = 0$	0.9815
STM	5	$K_{1} = 583, K_{2} = 150$	0.9712
k-Means	4		0.9593
Elbow k-Means	4		0.9593
Gap k-Means	10		0.8783
Mclust	4		0.9593
HDBSCAN	2	$m i n P t s = 1001$	0.4590
DBSCAN	5	$e p s = 0.025, m i n P t s = 1001$	0.9337
(b) Simulation 2 with N = 49,498 Cells
Method	Detected ST-Community No.	Chosen Parameters	ARI
DCD-TMHC	5	$K_{1} = 0$	0.9997
STM	5	$K_{1} = 12, 000, K_{2} = 100$	0.9104
k-Means	5		0.6178
Elbow k-Means	7		0.8418
Gap k-Means	4		0.7081
Mclust	5		0.9655
HDBSCAN	2	$m i n P t s = 1001$	0.2647
DBSCAN	9	$e p s = 0.02, m i n P t s = 1001$	0.8348
(c) Simulation 3 with N = 500,146 Cells
Method	Detected ST-Community No.	Chosen Parameters	ARI
DCD-TMHC	7	$K_{1} = 60, 000$	0.9206
STM	6	$K_{1} = 60, 000, K_{2} = 10, 000$	0.8930
k-Means	5		0.6267
Elbow k-Means	9		0.6234
Gap k-Means	NA		NA
Mclust	5		0.8791
HDBSCAN	NA		NA
DBSCAN	16	$e p s = 0.01, m i n P t s = 1001$	0.8870
(d) Simulation 4 with N = 536,530 Cells
Method	Detected ST-Community No.	Chosen Parameters	ARI
DCD-TMHC	7	$K_{1} = 0$	0.9998
STM	9	$K_{1} = 60, 000, K_{2} = 10, 000$	0.7982
k-Means	7		0.9758
Elbow k-Means	7		0.9758
Gap k-Means	NA		NA
Mclust	7		0.9765
HDBSCAN	NA		NA
DBSCAN	9	$e p s = 0.02, m i n P t s = 2001$	0.9727
(e) Simulation 5 with N = 934,745 Cells
Method	Detected ST-Community No.	Chosen Parameters	ARI
DCD-TMHC	10	$K_{1} = 80, 000$	0.8337
STM	9	$K_{1} = 125, 000, K_{2} = 5000$	0.7954
k-Means	7		0.8902
Elbow k-Means	7		0.8902
Gap k-Means	NA		NA
Mclust	7		0.8894
HDBSCAN	NA		NA
DBSCAN	9	$e p s = 0.015, m i n P t s = 2001$	0.7782

Table 3. CosMx Breast Cancer Data Summary.

Sample Name	Sample No.	Sample Size	Patient No.	Sample Type	Tissue Type
AER8-TTP1	1	59,556	1	Primary	Breast
AER8-TTM2	2	57,045	1	Metastasis	Liver
AFE4-TTP1	3	20,495	2	Primary	Breast
AFE4-TTM6	4	84,168	2	Metastasis	Liver
RA11-044-PRIM	5	48,092	4	Primary	Breast
RA11-044-MET	6	97,895	4	Metastasis	Lung
RA11-049-PRIM	7	113,317	3	Primary	Breast
RA11-049-MET	8	121,066	3	Metastasis	Liver

Table 4. Sample Cell % in Detected ST-Community with Highest Tumor Cell.

Method: DCD-TMHC; ST-Community 5 with 99.7% as Highest Tumor Cell Percentage
Sample	1	2	3	4	5	6	7	8	LR Estimator
x	0.40	39.30	3.26	12.58	7.28	6.95	2.98	0.59	$\hat{α} = 1.375$
y	1	0	1	0	1	0	1	0	$\hat{β} = - 0.219$
Method: STM; ST-Community 3 with 98.7% as Highest Tumor Cell Percentage
Sample	1	2	3	4	5	6	7	8	LR Estimator
x	0.87	42.30	3.74	14.62	7.43	6.78	2.93	0.64	$\hat{α} = 1.344$
y	1	0	1	0	1	0	1	0	$\hat{β} = - 0.199$
Method: 10-Means; ST-Community 5 with 97.3% as Highest Tumor Cell Percentage
Sample	1	2	3	4	5	6	7	8	LR Estimator
x	3.06	49.58	7.28	46.69	24.86	17.54	11.85	2.69	$\hat{α} = 1.354$
y	1	0	1	0	1	0	1	0	$\hat{β} = - 0.071$
Method: Elbow k-Means; ST-Community 3 with 98.8% as Highest Tumor Cell Percentage
Sample	1	2	3	4	5	6	7	8	LR Estimator
x	1.17	44.69	4.66	29.73	10.86	7.87	4.82	1.36	$\hat{α} = 1.245$
y	1	0	1	0	1	0	1	0	$\hat{β} = - 0.119$
Method: CNE-20; ST-Community 1 with 98.5% as Highest Tumor Cell Percentage
Sample	1	2	3	4	5	6	7	8	LR Estimator
x	32.16	67.35	10.35	76.20	56.83	59.74	56.88	39.36	$\hat{α} = 3.922$
y	1	0	1	0	1	0	1	0	$\hat{β} = - 0.076$

Table 5. Sample Cell % in Detected ST-Community with Highest Immune Cell.

Method: DCD-TMHC; ST-Community 19 with 62.2% as Highest Immune B+T Cell Percentage
Sample	1	2	3	4	5	6	7	8	LR Estimator
x	0.18	0.05	0.47	0.04	0.78	0.07	0.00	0.00	$\hat{α} = - 1.566$
y	1	0	1	0	1	0	1	0	$\hat{β} = 14.073$
Method: STM; ST-Community 32 with 48.0% as Highest Immune B+T Cell Percentage
Sample	1	2	3	4	5	6	7	8	LR Estimator
x	1.07	0.57	5.27	0.18	1.36	0.30	0.00	0.00	$\hat{α} = - 1.640$
y	1	0	1	0	1	0	1	0	$\hat{β} = 2.691$
Method: 10-Means; ST-Community 8 with 27.4% as Highest Immune B+T Cell Percentage
Sample	1	2	3	4	5	6	7	8	LR Estimator
x	0.63	0.42	4.32	0.31	3.34	10.37	0.25	0.02	$\hat{α} = 0.142$
y	1	0	1	0	1	0	1	0	$\hat{β} = - 0.058$
Method: Elbow k-Means; ST-Community 12 with 50.6% as Highest Immune B+T Cell Percentage
Sample	1	2	3	4	5	6	7	8	LR Estimator
x	1.06	0.25	5.77	0.25	1.91	1.29	0.01	0.00	$\hat{α} = - 1.163$
y	1	0	1	0	1	0	1	0	$\hat{β} = 1.251$
Method: CNE-20; ST-Community 6 with 87.0% as Highest Immune B+T Cell Percentage
Sample	1	2	3	4	5	6	7	8	LR Estimator
x	0.26	0.01	0.89	0.05	0.53	1.60	0.00	0.00	$\hat{α} = - 0.003$
y	1	0	1	0	1	0	1	0	$\hat{β} = 0.008$

Table 6. Sample Cell % in Detected ST-Community with Highest Normal Cell.

Method: DCD-TMHC; ST-Community 14 with 61.4% as Highest Normal-BEC Cell Percentage
Sample	1	2	3	4	5	6	7	8	LR Estimator
x	1.36	0.00	1.43	0.00	0.07	0.00	0.00	0.00	$\hat{α} = - 1.386$
y	1	0	1	0	1	0	1	0	$\hat{β} = 273.171$
Method: STM; ST-Community 40 with 76.8% as Highest Normal-BEC Cell Percentage
Sample	1	2	3	4	5	6	7	8	LR Estimator
x	0.59	0.00	0.09	0.00	0.01	0.00	0.00	0.00	$\hat{α} = - 1.386$
y	1	0	1	0	1	0	1	0	$\hat{β} = 2660.943$
Method: 10-Means; ST-Community 1 with 47.8% as Highest Normal-BEC Cell Percentage
Sample	1	2	3	4	5	6	7	8	LR Estimator
x	4.41	0.00	13.04	0.00	0.76	0.00	0.00	0.00	$\hat{α} = - 1.386$
y	1	0	1	0	1	0	1	0	$\hat{β} = 26.804$
Method: Elbow k-Means; ST-Community 13 with 51.0% as Highest Normal-BEC Cell Percentage
Sample	1	2	3	4	5	6	7	8	LR Estimator
x	4.28	0.00	11.52	0.00	0.60	0.00	0.00	0.00	$\hat{α} = - 1.386$
y	1	0	1	0	1	0	1	0	$\hat{β} = 33.942$
Method: CNE-20; ST-Community 8 with 87.6% as Highest Normal-BEC Cell Percentage
Sample	1	2	3	4	5	6	7	8	LR Estimator
x	2.20	0.00	1.37	0.00	0.23	0.00	0.00	0.00	$\hat{α} = - 1.386$
y	1	0	1	0	1	0	1	0	$\hat{β} = 88.321$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhao, C.; Ren, J.-J. ST-Community Detection Methods for Spatial Transcriptomics Data Analysis. Stats 2026, 9, 4. https://doi.org/10.3390/stats9010004

AMA Style

Zhao C, Ren J-J. ST-Community Detection Methods for Spatial Transcriptomics Data Analysis. Stats. 2026; 9(1):4. https://doi.org/10.3390/stats9010004

Chicago/Turabian Style

Zhao, Charles, and Jian-Jian Ren. 2026. "ST-Community Detection Methods for Spatial Transcriptomics Data Analysis" Stats 9, no. 1: 4. https://doi.org/10.3390/stats9010004

APA Style

Zhao, C., & Ren, J.-J. (2026). ST-Community Detection Methods for Spatial Transcriptomics Data Analysis. Stats, 9(1), 4. https://doi.org/10.3390/stats9010004

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

ST-Community Detection Methods for Spatial Transcriptomics Data Analysis

Abstract

1. Introduction

2. ST-Community Detection of Single-Cell ST Data

2.1. Disk Compositional Data Matrix

2.2. Existing Clustering Methods

2.3. Data Transformation and SigClust

2.4. Proposed DCD-TMHC Computation Method

3. Simulation Studies

4. CosMx Data Analysis

5. Discussion and Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI