Simulation and Quantitative Analysis of Spatial Centromere Distribution Patterns

Adib Keikhosravi; Krishnendu Guin; Gianluca Pegoraro; Tom Misteli

doi:10.3390/cells14070491

,

and

¹

High Throughput Imaging Facility (HiTIF), National Cancer Institute, NIH, Bethesda, MD 20892, USA

²

Cell Biology of Genomes Group, National Cancer Institute, NIH, Bethesda, MD 20892, USA

^*

Authors to whom correspondence should be addressed.

Cells2025, 14(7), 491;https://doi.org/10.3390/cells14070491

This article belongs to the Special Issue Imaging Methods in Cell Biology

Version Notes

Order Reprints

Abstract

A prominent feature of eukaryotic chromosomes are centromeres, which are specialized regions of repetitive DNA required for faithful chromosome segregation during cell division. In interphase cells, centromeres are non-randomly positioned in the three-dimensional space of the nucleus in a cell type-specific manner. The functional relevance and the cellular mechanisms underlying this localization are unknown, and quantitative methods to measure distribution patterns of centromeres in 3D space are needed. Here, we developed an analytical framework that combines sensitive clustering metrics and advanced modeling techniques for the quantitative analysis of centromere distributions at the single-cell level. To identify a robust quantitative measure for centromere clustering, we benchmarked six metrics for their ability to sensitively detect changes in centromere distribution patterns from high-throughput imaging data of human cells, both under normal conditions and upon experimental perturbation of centromere distribution. We found that Ripley’s K function has the highest accuracy with minimal sensitivity to variations in the number of centromeres, making it the most suitable metric for measuring centromere distributions. As a complementary approach, we also developed and validated spatial models to replicate centromere distribution patterns, and we show that a radially shifted Gaussian distribution best represents the centromere patterns seen in human cells. Our approach creates tools for the quantitative characterization of spatial centromere distributions with applications in both targeted studies of centromere organization and unbiased screening approaches.

Keywords:

centromere; spatial distribution; clustering metrics; image analysis; Ripley’s K; genome organization

1. Introduction

The spatial distribution of many nuclear compartments, including proteinaceous bodies and genomic domains, is non-random [1]. Gene-rich chromosomes, for example, tend to occupy central nuclear regions, whereas gene-poor chromosomes are typically more peripheral [2]. Similarly, transcription sites and replicating loci, often associated with active chromatin, are generally more centrally localized in the cell nucleus, whereas inactive heterochromatin tend to be positioned at the nuclear periphery or in proximity to nucleoli [3,4,5]. The spatial organization of chromosomes within the nucleus has been suggested to be integral to cellular function, impacting processes such as chromatin accessibility, gene expression, DNA repair, and DNA replication [1].

Centromeres are specialized regions on each chromosome that are crucial for accurate chromosome segregation. During cell division, centromeres serve as the attachment point for the microtubule spindle via the kinetochore protein complex [6,7]. Similar to other nuclear structures, centromeres assume an apparently non-random spatial distribution within the nucleus [8,9,10,11], and in a cell type-specific manner [10,12]. For example, in human stem cells, the majority of centromeres cluster strongly near the nucleoli [13,14], and this association is weakened as the stem cells differentiate [14]. Similarly, the association of centromeres with the nucleolus has also been observed in other non-stem cell types [15]. Although the proximity of rDNA loci to the centromeres on all five rDNA-containing chromosomes may partly account for this behavior [16], the spatial proximity of centromeres to the nucleolus may also have functional consequences, as suggested by the finding that transcription from alpha-satellite repeats of centromeres is limited near nucleoli [17] and by the observation that peripheral chromosomes have a higher rate of mis-segregation compared with centrally positioned ones [18]. Furthermore, variations in centromere clustering have also been implicated in diverse biological phenomena, including development, cancer progression, and response to cellular stress [6,19].

Little is known about the cellular factors and mechanisms that determine spatial centromere distribution. Candidate-based studies in cancer cell lines have demonstrated that alterations in centromere clustering can result from changes in chromatin organization, genomic stability and mitotic fidelity [20]. Similarly, clustering variations have been associated with perturbations of the NCAPH2 subunit of the condensin II complex [21], which regulates chromatin compaction and spatial genome organization [22]. However, the specific molecular mechanisms and processes that determine the spatial organization of centromeres remain unknown.

High-throughput imaging (HTI) assays are a powerful approach to systematically study cellular phenotypes, including centromere distribution, at the single-cell level, and in an unbiased fashion when paired with functional genomics perturbations, such as RNAi or CRISPR-Cas9 gene knockouts [23]. These assays are ideally suited to perform functional genomics screens to identify regulators of centromeric localization, and they rely on robust metrics to quantify centromere clustering patterns across millions of cells and across thousands of biological conditions. Unfortunately, such metrics are currently missing.

To fill this gap, we conducted extended testing to identify novel analytical tools to quantitatively analyze centromere localization patterns in human cells, including several global and local clustering metrics and spatial distribution modeling approaches. First, we benchmarked six clustering metrics and tested multiple spot generation models to evaluate their utility in measuring different clustering patterns on simulated data. Then, we tested these metrics on HTI experimental data from multiple human cell types. We identified a derivative version of Ripley’s K function [24] as the most robust indicator of different clustering patterns in single cells. Finally, to extend the applicability of our framework, we also surveyed several modeling approaches to fit the imaging data and recreate realistic centromere distributions in silico. We tested these models on multiple human cell lines, demonstrating their versatility and accuracy in capturing diverse centromere localization patterns in the presence or absence of experimental perturbation of centromere clustering patterns. Our approach establishes quantitative tools for the study of centromere localization and function, and they have potential for wider application in genome research.

2. Methods

2.1. siRNA Oligos Transfection and Immunofluorescence

The image dataset containing four technical replicates of human colon cancer HCT116-Cas9 cells were reverse-transfected with siRNA oligos against the NCAPH2 gene and a scrambled negative siRNA control in 384-well plates has been described [25].

2.2. Cell Growth and Centromere Visualization

All cells were grown in 384-well plates (CellVis, cat. No. P384-1.5H-N). The sources of cell lines, culture conditions, media compositions, and relevant references for respective culture protocols are described in Supplementary Table S1. Cell lines were grown for 72 h after cell seeding before fixation, except for the iPS WTC11 cells that were grown for 3–5 days and fixed before the colony edges merged with each other. All cell lines were fixed with 2% paraformaldehyde (PFA, Electron Microscopy Sciences, Hatfield, PA, USA, cat. No. 15710) solution in media by adding one equal volume of 4% PFA solution in PBS to the cell growth medium in each well for 15 min at room temperature. Fixed cells were then subjected to immunofluorescence (IF) staining using anti-CENP-C antibody (MBL Co., Ltd., Tokyo, Japan, cat. No. PD030) and DAPI (4′,6-diamidino-2-phenylindole) (Thermo Fisher Scientific, Waltham, MA, USA, cat. No. 62248) as previously described [25]. The total number of cells analyzed from this dataset is as follows: A375, n = 11,925; MDA-MB-231, n = 6347; HFF-hTERT, n = 1322; hTERT-RPE1, n = 3641; A549, n = 7786; WTC11, n = 6071; HAP1, n = 18,359; and HCT116, n = 16,116.

2.3. High-Throughput Image Acquisition

For IF experiments, images were acquired using 405 nm (DAPI channel) or 488 nm (CENP-C channel) excitation lasers and a 405/488/561/640 nm excitation dichroic mirror. A 60× water-immersion objective (NA 1.2) was employed, paired with 445/45 nm (DAPI channel) or 525/50 nm (CENP-C channel) bandpass emission filters. A 16-bit sCMOS camera (2048 × 2048 pixels, 1 × 1 binning, pixel size: 0.108 microns) was used for the capture of image Z-stacks spanning 14 microns in depth, collected at 1-micron intervals and maximally projected on the fly. Images were acquired from 22 fields of view (FOV) per well.

2.4. High-Throughput Image Analysis

The analysis of imaging data was carried out using HiTIPS, a high-throughput image analysis software designed to analyze cell-based assays in fixed and live cells as previously described [25]. Maximally projected DAPI images were used for nucleus segmentation, while CENP-C images were used for CENP-C spot finding and localization. Specific analysis parameters were selected in HiTIPS tailored to align with the average nucleus size, as well as the size and brightness of the centromere spots observed. The GPU-based Cellpose algorithm for nuclear segmentation [26] was used in conjunction with the Laplacian of Gaussian method for spot detection. Spot positions were determined as the center of gravity of the segmented spots.

2.5. Methods for Generating Synthetic Spot Patterns

To evaluate centromere spot pattern characterization metrics, synthetic images containing simulated distributions of spots were generated under controlled spatial arrangements within a circular region representing the cell nucleus. For each spatial pattern, 100,000 images were generated. The synthetic spots were generated on an image patch of pixels. A circular region was defined at the center of this patch, with a radius of pixels and center coordinates. All synthetic spots were constrained within this circular region, ensuring consistency across pattern types. The constraint for any spot was

(x - c_{x})^{2} + (y - c_{y})^{2} \leq r^{2}

. Each distribution was designed with specific statistical properties to mimic potential centromere clustering patterns.

2.5.1. Poisson Process or Complete Spatial Randomness (CSR)

To generate a uniform distribution of spots, random points were sampled according to a uniform distribution within the bounds

[c_{x} - r, c_{x} + r]

for

x

and

[c_{y} - r, c_{y} + r]

for

y

. Each point was retained only if it lay within the circular region as defined above. For each sample, 46 spots were generated. This uniform sampling, which is also referred to as a Poisson process, provided a baseline distribution to assess other pattern types against a spatially random background.

2.5.2. Single Two-Dimensional Gaussian Distribution (S2DG)

For clustering patterns, spots were generated based on a 2D Gaussian distribution centered at

(128,128)

with varying covariance matrices. The covariance matrices

Σ

systematically varied in size (from 50 to 1000 pixels squared in increments) and orientation (from 0 to

π

radians).

The covariance matrix for each Gaussian distribution was defined as follows:

Σ = R \cdot diag (σ_{x}^{2}, σ_{y}^{2}) \cdot R^{T}

where

R

is the rotation matrix for angle

θ

and

σ_{x}

and

σ_{y}

represent the standard deviations along the rotated principal axes. For each sample, points were drawn from the Gaussian distribution, and only the first 46 spots within the circular boundary were retained. This process provided various levels of spot clustering within the circle.

2.5.3. Two (T2DG) and Three (TH2DG) Two-Dimensional Gaussian Distributions

To model multimodal clustering, Gaussian mixture models (GMMs) [27] were used with either two or three Gaussian components (modes) centered within the circular region. For the two-mode GMM, the component centers were positioned symmetrically around

(c_{x}, c_{y})

at distances of 25 pixels, representing two distinct clusters. For the three-mode GMM, three Gaussian components were similarly spaced 20 pixels apart in a triangular arrangement around

(c_{x}, c_{y})

.

Each component’s covariance matrix systematically varied using the same range of sizes and orientations as described for the single Gaussian distribution. Samples from the GMM were filtered to retain only those within the circle. The GMM model used equal weights for each component:

p (x) = \sum_{k = 1}^{K} π_{k} N (x∣ μ_{k}, Σ_{k})

where

K

is the number of modes (2 or 3) and

π_{k}

is the component weight (0.5 for the 2-mode GMM, 1/3 for the 3-mode GMM).

Means:
○
2-mode GMM: $(128,103)$ and $(128,153)$ ;
○
3-mode GMM: $(128,103)$ , $(128,153)$ , and $(103,128)$ .
Covariance range: size from 50 to 1000, orientations—from 0 to $π$ .

For each sample, points were drawn from the corresponding distribution, and only the first 46 spots within the circular boundary were retained.

2.5.4. Poisson Disk Sampling (PDS)

A Poisson disk sampling approach [28] was applied to generate dispersed spots, enforcing a minimum inter-spot distance of 10 pixels. This approach simulated highly dispersed patterns, characteristic of spatially non-clustered arrangements. Candidate points were generated uniformly within the circular boundary, but each new point was retained only if it satisfied the minimum distance requirement from all existing spots. This technique was applied to generate 46 spots per sample while maintaining a minimum spacing constraint.

2.5.5. Uniformly Distributed Centromeres with Two (UTA) and Three (UTHA) Adjacent Spots

To model proximity effects, two and three adjacent spots were generated by perturbing a subset of initial points with a random shift

ϵ

sampled from

[- 3, 3]

pixels. For two adjacent spots, the perturbation was applied to 15 spots out of the 46, generating a second, closely placed spot for each. For three adjacent spots, two perturbations were applied to 15 initial spots, creating two adjacent spots per each of the 15. The first 46 points within the circular boundary were retained after applying the perturbation.

2.6. Spot Clustering Metrics

To analyze centromere spot patterns within synthetic distributions, we employed different metrics that capture clustering, modularity, spatial autocorrelation, and dispersion characteristics. Each metric was computed based on specific parameter settings and methodologies, as outlined below.

2.6.1. Ripley’s K Function

Ripley’s K function

(K (r))

[29] was used to quantify spatial clustering by calculating the difference between the expected number of neighboring points within a radius

r

of each point, and the expected number of neighboring points within a radius

r

of each point for a homogeneous spatial Poisson process, or Complete State of Randomness (CSR). Ripley’s K function is maximized when the expected number of spots from any given point is larger than the one for the CSR distribution for a large number of radiuses

r

. This metric results in higher values for spot patterns from one large cluster such as single 2D Gaussian distribution with and without nuclear bodies. However, for dispersed small clusters, and even two- and three-mode 2D Gaussians distributions, this Ripley’s K function is expected to return significantly lower values. We report the clustering percentage as the fraction of radii where

K (r)

exceeds the Poisson expectation, indicating clustering. The calculation is as follows:

K (r) = \frac{A}{N^{2}} \sum_{i = 1}^{N} \sum_{i = 1}^{N} I (d_{i j} \leq r)

where

A

is the area of the region,

N

is the number of points,

d_{i j}

is the distance between points

i

and

j

, and

I

is an indicator function that is 1 if

d_{i j} \leq r

and 0 otherwise. We calculated Ripley’s K clustering score, or the clustering percentage, as follows:

Clustering Percentage = \frac{\sum_{r = 0}^{r_{\max}} (K (r) - K_{Poisson} (r) > 0)}{total steps} \times 100

where

K_{Poisson} (r)

is the expectation for a spatially random distribution.

2.6.2. Assortativity Coefficient

The assortativity coefficient is a measure of the connectedness of nodes with similar degrees in a graph [30], and it is computed for a

k

-nearest neighbor (k-NN) graph created from the centromere points, reflecting the tendency of nodes to connect to others with a similar degree (K = 10). Assortativity is based on degrees and measures the correlation between a node’s degree and the degrees of their neighbors, and it is defined as follows:

r = \frac{\sum_{i, j \in E} (k_{i} - \overset{ˉ}{k}) (k_{j} - \overset{ˉ}{k})}{\sum_{i, j \in E} (k_{i} - \overset{ˉ}{k})^{2}}

where

k_{i}

and

k_{j}

are the degrees of nodes

i

and

j

in the graph,

\overset{ˉ}{k}

is the mean node degree, and

E

represents the set of edges in the graph.

Assortativity is reported as a single coefficient between

- 1

(disassortative) and

+ 1

(assortative).

Since the node degrees are calculated as the inverse of the squared distance of spots from each other, this metric is expected to be maximal when most of the spots surround the nuclear bodies, such as the nucleolus. However, the values are still very close to zero and very closely followed by spots generated using a single 2D Gaussian distribution.

2.6.3. Modularity

Modularity

Q

measures the density of connections within clusters compared with connections between clusters [31], reflecting how well the graph divides into modules. The Louvain algorithm was used to compute modularity as follows:

Q = \frac{1}{2 m} \sum_{i, j} (A_{i j} - \frac{k_{i} k_{j}}{2 m}) δ (c_{i}, c_{j})

where

A_{i j}

is the adjacency matrix,

k_{i}

and

k_{j}

are the degrees of nodes

i

and

j

,

m

is the total number of edges, and

δ (c_{i}, c_{j})

is 1 if nodes

i

and

j

belong to the same community and 0 otherwise. The modularity index

Q

ranges from

- 1

to

1

, with values closer to 1 indicating stronger modularity. This metric is expected to be maximal in the presence of small clusters dispersed within nuclei.

2.6.4. Moran’s I

Moran’s I statistic was used to evaluate spatial autocorrelation of centromere spots [32]. Moran’s I measures whether spots are clustered (positive value), dispersed (negative value), or randomly distributed (near zero). Moran’s I is defined as follows:

I = \frac{N \sum_{i} \sum_{j} w_{i j} (x_{i} - \overset{ˉ}{x}) (x_{j} - \overset{ˉ}{x})}{W \sum_{i} (x_{i} - \overset{ˉ}{x})^{2}}

where

N

is the number of points,

x_{i}

and

x_{j}

are the coordinates of points

i

and

j

,

\overset{ˉ}{x}

is the mean coordinate,

w_{i j}

is the spatial weight (set to

\frac{1}{d_{i j}^{2}}

for

d_{i j} > 0

, else 0), and

W = \sum_{i} \sum_{j} w_{i j}

is the sum of all weights. Moran’s I ranges from

- 1

(complete dispersion) to 1 (perfect clustering), with zero indicating no autocorrelation. Moran’s I is expected to be maximized by the presence of local clusters further from the centroid of the spots, such as uniformly distributed centromeres with two and three adjacent spots, and minimized when all the spots are clustered at the center of the nucleus, as generated by a single 2D Gaussian distribution.

2.6.5. Mean Nearest Neighbor Distance (MNND)

The mean nearest neighbor distance (MNND) quantifies the average distance to the nearest neighboring spot [33], providing a measure of the clustering density of the centromeres. It is calculated as:

MNND = \frac{1}{N} \sum_{i = 1}^{N} d_{i, NN}

where

d_{i, NN}

is the distance between point i and its nearest neighbor and

N

is the total number of points. A lower MNND indicates tighter clustering, while a higher value suggests more dispersed points. The MNND is expected to be maximized with a hardcore process spot pattern and minimized when each spot has at least one other spot in proximity.

2.6.6. Dispersion Index

The dispersion index quantifies the level of spatial dispersion or clustering by comparing the variance of pairwise distances to their mean distance [34]. The dispersion index

D

is defined as follows:

D = \frac{σ_{d}^{2}}{μ_{d}}

where

σ_{d}^{2}

is the variance of the pairwise distances between points and

μ_{d}

is the mean of those distances. A higher dispersion index indicates greater variability in distances, suggesting spatial clustering, whereas a lower index indicates uniform dispersion. The dispersion index is used to assess the spread of centromeres within the defined circular area, and it is maximized when the average of pairwise distances of the spots is minimum and the variance is high. Both these conditions can occur simultaneously with uniformly distributed centromeres with two and three adjacent spots.

2.7. Centromere Spot Localization Modeling Methods Using Gaussian Distributions

To model the localization of the centromere spots, we used several approaches based on spot location and the distribution of their pairwise distances. Each approach is explained in detail below.

2.7.1. Uniform Distribution of Spots on the Cell Nucleus as a Benchmarking Method

To establish a baseline for centromere spot localization, a uniform distribution of spots was generated within the boundaries of the cell nucleus, also known as Poisson process. This method simulates a completely random spatial arrangement, providing a benchmark for comparing clustering metrics and other spatial pattern models.

2.7.2. Modeling of Centromere Localization Using the Cell Shape

To model centromere localization based on nuclear geometry, an ellipse was fitted to the segmented nucleus. The parameters of the ellipse, specifically the major axis

a

and minor axis

b

, were used to define the spatial boundaries for generating a two-dimensional Gaussian distribution. The Gaussian distribution was centered at the nucleus center

{(x}_{c}, y_{c)}

, and the probability density function at any point (

x, y)

was as follows:

f (x, y) = \frac{1}{2 π σ_{x} σ_{y}} \exp (- \frac{1}{2} [\frac{(x - x_{c})^{2}}{σ_{x}^{2}} + \frac{(y - y_{c})^{2}}{σ_{y}^{2}}])

where

σ_{x} = \frac{a}{3}

and

σ_{y} = \frac{b}{3}

represent the standard deviations along the major and minor axes, respectively, to ensure realistic clustering within the nuclear boundary. The generated spots follow this elliptical Gaussian distribution, accurately reflecting the centromere patterns under the spatial constraints imposed by the nuclear shape.

2.7.3. Radial Distribution and Ripley’s $K (r)$ Function Calculation

For this approach, we calculated Ripley’s

K (r)

function, which describes the expected number of points within a distance

r

from a randomly chosen point, normalized by the intensity

λ

of the point process. The

K (r)

function for our radially symmetric Gaussian distribution is defined as follows:

K (r) = \frac{C}{λ} \int_{0}^{r} \int_{0}^{2 π} f (ρ) ρ d θ d ρ

where

f (ρ)

is the radial probability density function (PDF) of a 2D Gaussian distribution centered at the origin as follows:

f (ρ) = \frac{1}{2 π σ^{2}} \exp (- \frac{ρ^{2}}{2 σ^{2}})

Converting to polar coordinates allows us to express the expected number of points within distance

r

by integrating

f (ρ)

over the radial distance

ρ

from the origin. The inner integral over

θ

simplifies due to the independence of

f (ρ)

from the angular coordinate, reducing

K (r)

to the following:

K (r) = \frac{C}{λ} \int_{0}^{r} \frac{ρ}{σ^{2}} \exp (- \frac{ρ^{2}}{2 σ^{2}}) d ρ

To further evaluate this expression, we used the substitution

u = \frac{ρ^{2}}{2 σ^{2}}

, transforming the integral into the following:

K (r) = \frac{C}{λ} \int_{0}^{\frac{r^{2}}{2 σ^{2}}} \exp (- u) d u

which yields the following:

K (r) = \frac{C}{λ} (1 - \exp (- \frac{r^{2}}{2 σ^{2}}))

This closed-form solution provides a cumulative description of the expected clustering of points up to a radius

r

, based on a Gaussian spatial distribution of centromere locations.

2.7.4. Bayesian Estimation of Radially Shifted Gaussian Distribution Using Spots Coordinates

Given the observed

r_{1}, r_{2}, \dots, r_{n}

and knowing that each

r_{i}

is drawn from a Gaussian distribution with mean

r_{0}

and variance

σ^{2}

, the likelihood function

L (r_{0}, σ)

can be expressed as follows:

L (r_{0}, σ) = P (r_{1}, r_{2}, \dots, r_{n}∣ r_{0}, σ) = \prod_{i = 1}^{n} P (r_{i}∣ r_{0}, σ)

where each

P (r_{i}∣ r_{0}, σ)

is calculated as follows:

P (r_{i}∣ r_{0}, σ) = \frac{1}{\sqrt{2 π σ^{2}}} e^{- \frac{(r_{i} - r_{0})^{2}}{2 σ^{2}}}

Thus, the full likelihood function is as follows:

L (r_{0}, σ) = {(\frac{1}{\sqrt{2 π σ^{2}}})}^{n} \exp (- \frac{1}{2 σ^{2}} \sum_{i = 1}^{n} (r_{i} - r_{0})^{2})

Given that

r_{0}

and

σ

are themselves normally distributed, we can define the priors:

Prior on $r_{0}$ : $r_{0} \sim N (μ_{r_{0}}, τ_{r_{0}}^{2})$ , where $μ_{r_{0}}$ and $τ_{r_{0}}^{2}$ are the mean and the variance of the prior on $r_{0}$ .

$P (r_{0}) = \frac{1}{\sqrt{2 π τ_{r_{0}}^{2}}} e^{- \frac{(r_{0} - μ_{r_{0}})^{2}}{2 τ_{r_{0}}^{2}}}$
Prior on $σ$ : $σ \sim N (μ_{σ}, τ_{σ}^{2})$ , where $μ_{σ}$ and $τ_{σ}^{2}$ are the mean and the variance of the prior on $σ$ .

$P (σ) = \frac{1}{\sqrt{2 π τ_{σ}^{2}}} e^{- \frac{(σ - μ_{σ})^{2}}{2 τ_{σ}^{2}}}$

The goal is to find the posterior distribution

P (r_{0}, σ∣ r_{1}, r_{2}, \dots, r_{n})

, which is proportional to the product of the likelihood and the prior:

P (r_{0}, σ∣ r_{1}, r_{2}, \dots, r_{n}) \propto P (r_{1}, r_{2}, \dots, r_{n}∣ r_{0}, σ) P (r_{0}) P (σ)

Substituting the likelihood and the priors,

P (r_{0}, σ∣ r_{1}, r_{2}, \dots, r_{n}) \propto {(\frac{1}{\sqrt{2 π σ^{2}}})}^{n} \exp (- \frac{1}{2 σ^{2}} \sum_{i = 1}^{n} (r_{i} - r_{0})^{2}) \cdot \frac{1}{\sqrt{2 π τ_{r_{0}}^{2}}} e^{- \frac{(r_{0} - μ_{r_{0}})^{2}}{2 τ_{r_{0}}^{2}}} \cdot \frac{1}{\sqrt{2 π τ_{σ}^{2}}} e^{- \frac{(σ - μ_{σ})^{2}}{2 τ_{σ}^{2}}}

This can be expressed more compactly as follows:

P (r_{0}, σ∣ r_{1}, r_{2}, \dots, r_{n}) \propto \exp (- \frac{1}{2} [\frac{1}{σ^{2}} \sum_{i = 1}^{n} (r_{i} - r_{0})^{2} + \frac{(r_{0} - μ_{r_{0}})^{2}}{τ_{r_{0}}^{2}} + \frac{(σ - μ_{σ})^{2}}{τ_{σ}^{2}}])

2.7.5. Bayesian Estimation of the Radially Shifted Gaussian Distribution Using Pairwise Distances

In polar coordinates, a radially shifted Gaussian distribution can be represented as follows:

f (r, θ) = \frac{1}{\sqrt{2 π σ^{2}}} e^{- \frac{(r - r_{0})^{2}}{2 σ^{2}}}

where

r

is the radial distance from the origin,

θ

is the angular coordinate,

r_{0}

is the mean radius of the doughnut shape, and σ is the standard deviation controlling the thickness of the doughnut.

We then convert the polar coordinates

(r_{1}, θ_{1})

and

(r_{2}, θ_{2})

into Cartesian coordinates:

x_{1} = r_{1} \cos (θ_{1}), y_{1} = r_{1} \sin (θ_{1})

x_{2} = r_{2} \cos (θ_{2}), y_{2} = r_{2} \sin (θ_{2})

The Euclidean distance

d

between the two points

(x_{1}, y_{1})

and

(x_{2}, y_{2})

is as follows:

d = \sqrt{(x_{2} - x_{1})^{2} + (y_{2} - y_{1})^{2}}

Substituting the expressions for

x_{1}, y_{1}

and

x_{2}, y_{2}

provides the following:

d = \sqrt{r_{1}^{2} + r_{2}^{2} - 2 r_{1} r_{2} \cos (θ_{2} - θ_{1})}

To derive the distribution of pairwise distances, we need to consider the probability distributions of

r_{1}

,

r_{2}

, and

θ_{2} - θ_{1}

. Assuming

r_{1}

and

r_{2}

are independently drawn from the radially shifted Gaussian distribution and

θ_{2} - θ_{1}

is uniformly distributed over

[0, 2 π]

, the distribution of

d

can be obtained by integrating over all possible values of

r_{1}

,

r_{2}

, and

θ_{2} - θ_{1}

:

P (d) = \int_{0}^{2 π} \int_{0}^{\infty} \int_{0}^{\infty} P (r_{1}) P (r_{2}) P (θ_{2} - θ_{1}) δ (d - \sqrt{r_{1}^{2} + r_{2}^{2} - 2 r_{1} r_{2} \cos (θ_{2} - θ_{1})}) d r_{1} d r_{2} d (θ_{2} - θ_{1})

Here,

P (r_{1})

and

P (r_{2})

are the radial distributions (Gaussian distribution with mean

r_{0}

and variance

σ^{2}

) and

δ

is the Dirac delta function that ensures the integration considers only valid pairwise distances, where

P (r_{1})

and

P (r_{2})

are Gaussian distributions:

P (r) = \frac{1}{\sqrt{2 π σ^{2}}} e^{- \frac{(r - r_{0})^{2}}{2 σ^{2}}}

and

P (θ) = \frac{1}{2 π}

is the uniform distribution of the angle difference. Simplifying the angle integral by noting that

P (θ)

is uniform, the following is obtained:

P (d) = \frac{1}{(2 π)^{2}} \int_{0}^{2 π} \int_{0}^{\infty} \int_{0}^{\infty} P (r_{1}) P (r_{2}) δ (d - \sqrt{r_{1}^{2} + r_{2}^{2} - 2 r_{1} r_{2} \cos (θ)}) d r_{1} d r_{2} d θ

By substituting the variable

u = \cos (θ)

, we have

d u = - \sin (θ) d θ

. The limits of integration for

u

are from −1 to 1.

The delta function imposes the following condition:

d^{2} = r_{1}^{2} + r_{2}^{2} - 2 r_{1} r_{2} u

. This, in turn, implies that

u = \frac{r_{1}^{2} + r_{2}^{2} - d^{2}}{2 r_{1} r_{2}}

. Thus, the integral over

u

is as follows:

P (d) = \frac{1}{(2 π)^{2}} \int_{0}^{\infty} \int_{0}^{\infty} \frac{P (r_{1}) P (r_{2})}{2 r_{1} r_{2}} [\int_{- 1}^{1} δ (u - \frac{r_{1}^{2} + r_{2}^{2} - d^{2}}{2 r_{1} r_{2}}) d u] d r_{1} d r_{2}

The delta function reduces the integral over

u

as follows:

P (d) = \frac{1}{2 π^{2}} \int_{0}^{\infty} \int_{0}^{\infty} \frac{P (r_{1}) P (r_{2})}{2 r_{1} r_{2}} H (1 - | \frac{r_{1}^{2} + r_{2}^{2} - d^{2}}{2 r_{1} r_{2}} |) d r_{1} d r_{2}

where

H (x)

is the Heaviside step function ensuring that

\frac{r_{1}^{2} + r_{2}^{2} - d^{2}}{2 r_{1} r_{2}}

lies within

[- 1, 1]

.

The final form of

P (d)

involves evaluating the integral:

P (d) = \frac{1}{2 π^{2}} \int_{0}^{\infty} \int_{0}^{\infty} \frac{P (r_{1}) P (r_{2})}{2 r_{1} r_{2}} H (1 - | \frac{r_{1}^{2} + r_{2}^{2} - d^{2}}{2 r_{1} r_{2}} |) d r_{1} d r_{2}

For a set of pairwise distances

D

from a single realization, the likelihood is as follows:

L (r_{0}, σ ∣ D) = \prod_{d \in D} P (d ∣ r_{0}, σ)

Then, the log-likelihood is as follows:

l o g L (r_{0}, σ ∣ D) = \sum_{d \in D} l o g P (d ∣ r_{0}, σ)

If

P (d ∣ r_{0}, σ) = 0

(e.g., due to insufficient Monte Carlo samples), a large negative value (e.g.,

- 10^{10}

) is assigned to

l o g P (d ∣ r_{0}, σ)

to avoid numerical issues.

For a cell line with multiple realizations

D_{1}, D_{2}, \dots, D_{M}

, the total log-likelihood is as follows:

l o g L (r_{0}, σ ∣ {D_{m}}) = \sum_{m = 1}^{M} l o g L (r_{0}, σ ∣ D_{m})

where each

P (r_{i}∣ r_{0}, σ)

is as follows:

P (r_{i}∣ r_{0}, σ) = \frac{1}{\sqrt{2 π σ^{2}}} e^{- \frac{(r_{i} - r_{0})^{2}}{2 σ^{2}}}

Thus, the full likelihood function is as follows:

L (r_{0}, σ) = {(\frac{1}{\sqrt{2 π σ^{2}}})}^{n} \exp (- \frac{1}{2 σ^{2}} \sum_{i = 1}^{n} (r_{i} - r_{0})^{2})

Given that

r_{0}

and

σ

are themselves normally distributed, we can define the priors

P (r_{0})

and

P (σ)

the same as those defined for Bayesian estimation of a radially shifted Gaussian distribution using spot coordinates. The posterior distribution combines the likelihood and priors:

P (r_{0}, σ ∣ {D_{m}}) \propto L (r_{0}, σ ∣ {D_{m}}) P (r_{0}) P (σ)

The log-posterior is as follows:

l o g P (r_{0}, σ ∣ {D_{m}}) = l o g L (r_{0}, σ ∣ {D_{m}}) + l o g P (r_{0}) + l o g P (σ) + constant

2.8. MCMC Framework (Metropolis–Hastings Algorithm)

The Metropolis–Hastings algorithm involves iteratively proposing new values for model parameters based on a predefined distribution and calculating an acceptance ratio that compares the likelihood of the new parameters against the current ones. If the new parameters yield a higher probability or meet a random acceptance criterion, they are accepted; otherwise, the current parameters are retained. This process is repeated for a set number of iterations, with each accepted parameter set stored for subsequent analysis. The method ensures exploration of the parameter space while gradually converging on the most probable values based on the observed data.

2.9. Statistical Analysis

2.9.1. Metrics Comparison to the Complete State of Randomness for Different Spot Generation Methods

To evaluate differences between groups for each metric, we conducted statistical testing in a two-step process. First, pairwise comparisons were performed using the Mann–Whitney U test [35], a rank-based non-parametric method for comparing two independent groups. This test evaluates whether one group tends to have higher or lower values than the other, without assuming normality or equal variances. We compared all groups to a reference group, complete spatial randomness (CSR), for each metric. To control for multiple comparisons and reduce the likelihood of false positives, we applied the Benjamini–Hochberg false discovery rate (FDR) correction [36] to the raw p-values. For each comparison, the Mann–Whitney U test results included the test statistic, the raw p-value, the corrected p-value, and the significance status (whether the corrected p-value was below 0.05). Significant results were bolded for presentation in the final dataset. The corrected p-values were used to determine statistical significance, with values below 0.05 considered indicative of significant differences.

2.9.2. Pairwise Metrics Comparison for Different Spot Generation Methods

To assess differences in clustering metrics (e.g., Ripley’s K function, assortativity, modularity, Moran’s I, MNND, and dispersion index) across nine synthetic spot generation models (e.g., CSR, PDS, UTA, UTHA, S2DG, T2DG, TH2DG, 2DGNB, T2DGTNB) relative to complete spatial randomness (CSR), we performed a robust statistical analysis. Initially, the Kruskal–Wallis test [37] was applied to detect overall differences across all models for each metric, offering a non-parametric assessment of variation without assuming normality. Subsequently, pairwise comparisons were conducted using the Mann–Whitney U test [35], comparing each model to the CSR reference group to pinpoint specific differences. This rank-based method was selected for its ability to handle non-normal data and unequal variances. To manage the risk of false positives from multiple comparisons, we implemented the Benjamini–Hochberg false discovery rate (FDR) correction [36] on the raw p-values. For each pairwise test, we reported the test statistic (U), the raw p-value, the corrected p-value, and the significance status (corrected p-value < 0.05), with significant results bolded in the output. Additionally, the rank-biserial correlation was computed as an effect size to measure the magnitude and direction of differences. Corrected p-values below 0.05 indicated statistically significant deviations from CSR for the given metric and model.

2.9.3. Metrics for Comparing Distributions

We employed three metrics—the Wasserstein distance, the normalized mean squared error (MSE), and the Kolmogorov–Smirnov (KS) statistic—to compare the similarity between real and synthetic data distributions. These metrics were calculated for multiple methods and across two conditions: scrambled (control) and siNCAPH2 (treated) and for the eight cell line distributions. These metrics assess the level of agreement between two probability density functions (PDFs), enabling us to quantify the fidelity of synthetic data in capturing the characteristics of experimental data.

The Wasserstein distance (WD), also known as the Earth mover’s distance, measures the minimum “cost” of transforming one distribution into another. It is defined for one-dimensional distributions as follows:

W D = \int_{- \infty}^{\infty} ∣ F_{1} (x) - F_{2} (x) ∣ d x

where

F_{1} (x)

and

F_{2} (x)

are the cumulative distribution functions (CDFs) of the two datasets. Lower Wasserstein distances indicate greater similarity.

The normalized mean squared error (MSE) quantifies the discrepancy between the real and synthetic PDFs by normalizing the MSE by the variance of the real PDF. It is as follows:

Normalized MSE = \frac{MSE}{Variance of Real PDF} = \frac{\sum_{i = 1}^{N} (Real PDF [i] - Synthetic PDF [i])^{2}}{Variance of Real PDF}

A lower normalized MSE indicates a higher similarity, with values close to zero reflecting strong agreement.

Finally, the Kolmogorov–Smirnov (KS) statistic compares the maximum difference between the CDFs of two distributions as follows:

K S = \underset{x}{\max ∣ F_{1} (x) - F_{2} (x) ∣}

where

F_{1} (x)

and

F_{2} (x)

are the cumulative distribution functions (CDFs) of the two datasets. Smaller KS values indicate closer alignment between the distributions. Together, these metrics provide a comprehensive assessment of the similarity between real and synthetic data across various methods and experimental conditions.

3. Results

3.1. Simulation of Centromeric Spot Patterns Using Different Spatial Distribution Models

To test and benchmark centromere clustering metrics, we first generated synthetic image datasets representing a range of spatial patterns of centromeres, including uniform distributions, clustered arrangements, and dispersed configurations, to replicate centromere arrangements observed in experimental data [25]. We used 9 different spatial distribution models to mimic the centromeric localization observed in HCT116-Cas9 colorectal cancer cells stained for the CENP-centromere protein CENP-C (Figure 1), which stably binds to centromeres throughout the cell cycle [38]. For each of the 9 spatial patterns, 100,000 images were generated. No experimental treatments were used in the presented data, and the observed differences in centromere distribution are an intrinsic property of most human cell types, which show cell-to-cell variability in centromere distribution. The models used for synthetic data generation included a complete spatial randomness (CSR) distribution that draws samples from a 2D uniform distribution (Figure 1A), a Poisson disk sampling (PDS) process that sets a minimum distance threshold between two neighboring spots (Figure 1B), uniformly distributed centromeres with two adjacent spots (UTA) (Figure 1C) or uniformly distributed centromeres with three adjacent spots (UTHA) (Figure 1D) with adjacent spots mimicking the formation of small clusters, single two-dimensional Gaussian distribution (S2DG) (Figure 1E), two two-dimensional Gaussian distributions (T2DG) (Figure 1F), or three two-dimensional Gaussian distributions (TH2DG) (Figure 1G) to represent clustering in large nuclear regions, and either two-dimensional Gaussian distribution with a nuclear body (2DGNB) (Figure 1H) or two two-dimensional Gaussian distributions with two nuclear bodies (T2DGTNB) (Figure 1I) with central exclusion to mirror centromeres localized in proximity of large nuclear bodies, such as nucleoli. For each of these spot patterns, we generated 100,000 synthetic images of nuclei, assuming the nucleus shape to be circular with a diameter of 10 μm, based on measurements of the mean nucleus area obtained from experimental image datasets of HCT116-Cas9 cells [25].

Figure 1. Simulated spatial distribution patterns of centromeres modeled to reflect various clustering and dispersion behaviors observed in HCT116-Cas9 colorectal cancer cells. For each row, the two leftmost images display the fluorescence image of centromere protein CENP-C in a HCT116 cell (green), followed by extracted spot locations (yellow circles) outlined within the boundary to indicate the nucleus outline (red). Each row represents a distinct spatial distribution model: (A) complete spatial randomness (CSR), where centromeres are uniformly and randomly distributed; (B) Poisson disk sampling (PDS), enforcing a minimum distance between spots to prevent overlap; (C,D) uniformly distributed centromeres with two (UTA) or three (UTHA) adjacent spots mimicking small clusters; (E–G) single (S2DG), two (T2DG), or three (TH2DG) two-dimensional Gaussian distributions, representing varying levels of clustering within larger nuclear regions; (H) two-dimensional Gaussian distribution with a nuclear body (2DGNB), excluding the nuclear body area; and (I) two two-dimensional Gaussian distributions with two nuclear bodies (T2DGTNB). For simulated distributions, a gradient of Gaussian variance increases from left to right as indicated.

3.2. Benchmarking of Clustering Metrics on Synthetic Patterns

These synthetic spot distribution datasets offer a controlled environment for assessing how well different clustering metrics can detect specific clustering patterns. Accordingly, we used them to benchmark six clustering metrics (Figure 2) to measure local and global differences in point clustering patterns [24,30,39,40]. In particular, we focused on how sensitive each metric was to detecting differences in clustering patterns between the synthetic data simulated by a complete state of randomness (CSR) that draws samples from a 2D uniform distribution and by each of the other distributions that represent either increased degrees or patterns of spot clustering, or to detecting centromere dispersion for the Poisson disk sampling (PDS) (Figure 2, see Section 2 for details). When applied to synthetic centromere spot patterns, the clustering metrics showed clear differences in their sensitivity to detect different centromere clustering patterns (Figure 2A).

Figure 2. Benchmarking clustering metrics for detecting and characterizing synthetic centromere clustering patterns. (A) The sensitivity of the six clustering metrics was tested across synthetic spot distributions. Each boxplot displays the variability of metric values for the nine spatial distribution models, with statistical testing confirming significant differences between CSR and other distributions for most metrics. The box represents the interquartile range (IQR), showing the middle 50% of the data, with the line inside the box indicating the median value. The whiskers extend from the box to the smallest and largest values within 1.5 times the IQR from the lower and upper quartiles. (B) The robustness of clustering metrics to changes in the spot number was evaluated by progressively removing a random number of spots from synthetic nuclei.

Ripley’s K function [24,29] had a value of zero for PDS and very small values for uniformly distributed spots, most likely due to the limited area and spot sampling size. Pairwise Mann–Whitney U tests, corrected using the Benjamini–Hochberg false discovery rate (BH-FDR) method, showed a statistically significant difference when using Ripley’s K function between CSR and all other spot generation methods (p < 0.05) (Figure 2A and Supplementary Table S2), making it a robust metric for measuring centromere clustering. Cohen’s D values also indicated a substantial difference between the same comparison groups (Supplementary Figure S1). The assortativity index is a measure of connectedness of nodes with similar degrees in a graph [30]. For this metric, statistical testing showed a significant difference between CSR and all other spot generation methods (p < 0.05), except for T2DG (Figure 2A). However, the absolute Cohen’s D values for this metric showed only small (D < 0.2) to medium (0.5 < D <0.8) (Supplementary Figure S1) differences between clustering patterns and the CSR negative control, limiting its use for detection of changes in centromere clustering patterns. The modularity index, a measure of connection density within clusters compared with connections between clusters in a graph, reflects how well the graph divides into modules. The modularity index analysis showed a statistically significant difference between CSR and all other spot generation methods (p < 0.05) (Figure 2A and Supplementary Table S2). However, this metric only showed a substantial difference between CSR and three other spot generation methods: PDS, UTHA, and S2DG, while the rest were either modest or small (Supplementary Figure S1). Moran’s I is a measure of spatial autocorrelation [32]. Statistical testing yielded a significant difference between CSR and all other spot generation methods (p < 0.05), except for 2DGNB (Supplementary Table S2). This metric, however, only showed a substantial difference between CSR and three other spot generation methods: PDS, UTA, UTHA, and S2DG, while the rest were either modest or small (Supplementary Figure S1). The mean nearest neighbor distance (MNND) [33], which measures the average distance to the nearest neighboring spot, showed a significant difference between CSR and all other spot generation methods (p << 0.05) (Supplementary Table S2). Cohen’s D values above 0.8 also confirm a substantial difference between the same comparison groups, except for TH2DG, which was only modest (0.5 < D < 0.8) (Supplementary Figure S1). The dispersion index [34], calculated as the variance over the mean of the pairwise distance distribution, showed a significant difference between CSR and all other spot generation methods (p << 0.05) (Supplementary Table S2). Cohen’s D values above 0.8 also confirmed a substantial difference between the same comparison groups, except for T2DG and T2DGTNB, which were small (0.2 < D < 0.5) (Supplementary Figure S1).

These results suggest that while some different metrics such as the MNND and the dispersion index perform well in separating CSR from other distributions, only Ripley’s K function showed robust and consistent differences between CSR and both clustering and dispersion patterns. We conclude that Ripley’s K function is the most suitable clustering metric to detect a wide range of clustering patterns as would, for example, be seen in unbiased screening approaches (Supplementary Figure S1 and Table S2).

To assess the discriminative power of clustering metrics across various synthetic spot distributions, we performed pairwise Mann–Whitney U tests, corrected with the Benjamini–Hochberg false discovery rate (BH-FDR) method, evaluating six metrics across nine spot generation methods. Ripley’s K function stood out as the most robust, detecting statistically significant differences (p < 0.05) in all pairwise comparisons with large effect sizes (e.g., r ranging from −0.995 to 1), making it exceptional at distinguishing all spot generation methods. The MNND closely followed, missing significance in only one comparison (T2DGTNB vs. S2DG) and showing large effect sizes (e.g., r = 1 for several pairs), proving its strength in most cases. The dispersion index identified significant differences in most pairs, though with variable effect sizes (e.g., r = 0.046 for CSR vs. T2DG), suggesting utility for specific distributions. Modularity and Moran’s I also performed well, detecting significant differences in nearly all comparisons (except for Moran’s I for 2DGNB vs. CSR), with effect sizes varying (e.g., modularity r = −0.152 for CSR vs. T2DGTNB), indicating effectiveness for certain pairs. Assortativity detected differences in most cases but had two non-significant comparisons (PDS vs. UTA and CSR vs. T2DG) and smaller effect sizes (e.g., r = 0.005 for PDS vs. UTA), implying less consistency yet some situational value. Detailed results are available in Supplementary Table S3, highlighting that while Ripley’s K function excels overall, the other metrics provide complementary insights for distinguishing specific spot generation methods.

3.3. Measuring Clustering Metrics’ Robustness to Synthetic Changes in the Spot Number per Nucleus

The human HCT116 colon cancer cell line is pseudodiploid [41] and the expected number of detected centromeres in these cells is 46. However, we noticed that the median number of detected CENP-C-labelled centromere spots per nucleus in HCT116-Cas9 cells was only 32 [25]. This discrepancy is possibly due to the limited optical resolution of our diffraction-limited microscopes, which cannot resolve multiple centromeres located very close to each other in maximal projections of image z-stacks (see Section 2). Given the lower number of detected centromere signals in experimental data compared with simulated spot distributions, we sought to determine how sensitive the clustering metrics were to changes in the CENP-C spot number. To address this question, we retested all clustering metrics on the synthetic spot image datasets, but removed at random a variable number of spots (from 1 to 30) from each nucleus (Figure 2B). The average value for each metric was then determined for each number of spots removed from the initial 46 spots, and the percent change was calculated to quantify the relative variability of each metric across all distributions compared with the value with 46 spots (Figure 2B; Supplementary Table S4). Once again, Ripley’s K function proved to be the most robust metric. Both for Ripley’s K function and for the dispersion index, most percent change values were below 10%, indicating the robustness of these metrics to the number of detected spots. On the contrary, Moran’s I and modularity showed moderate variation with the number of spots (below 60%), while assortativity and the MNND showed high sensitivity to most spot generation models (Supplementary Table S4).

Altogether, the results of these benchmark tests on synthetic images indicate that Ripley’s K function and, to a lesser extent, the dispersion index, which are both directly calculated from the statistical properties of spot distance distributions in a cell, are the most robust metrics to spot number variability (Figure 2B). Ripley’s K function also showed the best performance in separating different distributions based on Cohen’s D values (Supplementary Figure S1). Other metrics, however, may still be used for finding specific patterns (such as modularity and the MNND for adjacent spots), albeit with less reliability.

3.4. Validation of Clustering Metrics Using Experimental Data

Having benchmarked the different clustering metrics on synthetic images, we next tested all metrics on HTI images of cells stained with an antibody against the centromere protein CENP-C in HCT116-Cas9 cells (Figure 3A). Since one of the major applications of a quantitative clustering metric would be its use in detecting changes in clustering patterns, for example, in imaging-based optical screens, we also sought to determine which clustering metric produced the largest difference between the normal distribution of centromeres in HCT116-Cas9 cells and HCT116-Cas9 cells transfected with an siRNA oligo targeting the NCAPH2 gene, which encodes for a subunit of the condensin II complex and whose knockdown is known to lead to increased centromeric clustering (Figure 3A) [21,25]. The results of this experiment revealed varying abilities of the different clustering metrics to capture this known biological effect (Figure 3B).

Figure 3. Analysis of centromere clustering in the control and NCAPH2-depleted cells. (A) Representative immunofluorescence images showing CENP-C spot detection (green) in HCT116-Cas9 cells transfected with siScramble (control) or siNCAPH2. The nuclear outline is indicated in red. Scale bar: 5 μm. (B) Comparison of different clustering metrics to quantify differences between the control and NCAPH2-depleted cells. The box represents the interquartile range (IQR), showing the middle 50% of the data, with the line inside the box indicating the median value. The whiskers extend from the box to the smallest and largest values within 1.5 times the IQR from the lower and upper quartiles. (C) Cohen’s D analysis showing the relative sensitivity of various clustering metrics for separating siScramble (control) and siNCAPH2 in HCT116 cells, with Ripley’s K function demonstrating the highest absolute value, followed by spot density, spot number, modularity, and assortativity.

While each of the clustering metrics could detect differences in centromere distribution at the single-cell level (p < 0.05), Ripley’s K clustering score produced the largest difference in centromere clustering in control and NCAPH2 knockdown cells, followed by modularity and assortativity (Figure 3B). Similarly, when we also calculated and plotted values for the CENP-C spot number and density on a per cell basis, we observed large differences upon NCAPH2 knockdown, likely due to increased clustering of individual centromeres into larger spot aggregates that cannot be resolved with diffraction-limited microscopy. Accordingly, Cohen’s D values for all these metrics indicated that Ripley’s K clustering score was, again, the most sensitive metric followed by spot density, spot number, modularity, and assortativity (Figure 3C). Using the dispersion index, Moran’s I, and the MNND for the same image set led to much lower Cohen’s D values (Figure 3C). The results of these experiments indicate that the use of the Ripley’s K function is the most sensitive metric to detect changes in overall clustering of centromeres in cells.

3.5. Modeling the Spatial Distribution of Centromeres in Cells

As a complementary approach to our identification of the most sensitive metrics for the analysis of centromere distributions, we asked whether we could predict the observed distribution of centromere patterns using a modeling approach. To that goal, we first sought to establish the overall spatial distribution of centromeres in HCT116-Cas9 cells. Accordingly, we overlaid all the spots detected in HCT116-Cas9 cells by shifting all spots within each cell so that the nuclear centers were located at the (0,0) position (Figure 4). The histogram of the standardized distribution of the spots in 2D space revealed a doughnut-like shape with the center at (0,0) (Figure 4A). A line plot of the 2D histogram of standardized spot locations confirmed low spot density near the center of the nucleus and a higher density in the mid-region between the center and the edge of the nucleus (Figure 4B). It is likely that the low density of spots at the center was caused by the well-established presence of major nuclear compartments such as the nucleolus in the interior of the nucleus [14,15,20]. The doughnut-like structure shows a lower density at the center in siNCAPH2 compared with the control, which could be due to tight clustering of the spots at the center of the nucleus, which lowered the number of detected individual spots in this region.

Figure 4. Spatial distribution of standardized centromere locations in the control and NCAPH2-depleted cells. (A) 2D histogram showing the distribution of CENP-C spots relative to the nuclear center (0,0), revealing a doughnut-shaped pattern in both conditions. (B) Line plot analysis at X = 0 and Y = 0 demonstrating lower spot density at the nuclear center and higher density between the center and the nuclear edge. (C,D) Gaussian mixture modeling of spot distribution along the X = 0 and Y = 0 axes, indicating a radially symmetric centromere organization with potential alignment artifacts affecting the observed central low-density region.

The distribution in both X and Y directions could be accurately modeled using a 2-mode one-dimensional Gaussian mixture (Figure 4C,D). The closeness of the Gaussian parameters for fitted line plots in both directions (Pearson correlation coefficient > 0.99) suggested that a radial Gaussian distribution that is uniformly distributed in all directions reflects well the cellular distribution of centromeres. The same properties of a doughnut-like distribution and close fit to a radial Gaussian distribution were found upon knockdown of NCAPH2.

3.6. Centromere Localization Modeling Using Parametric Distributions

To assess centromere localization in 2D using parametric distributions based on observations in Figure 4, we fitted several spatial distribution models to the spot locations acquired using imaging data. Then, to measure the accuracy of the simulations, we first simulated spot positions overlaid on nuclear masks resulting from the DAPI-channel segmentation using the parameters obtained from the models fitted from the microscopy images. We then compared the simulated pairwise spot-to-spot distance distributions and the spot radial distance distributions with the ones obtained from the microscopy images [25].

We used several spot distributions for modeling of the experimental imaging data: (1) complete state of randomness or uniformly distributed spots in 2D (model 1; M1) with no preferential localization; (2) nucleus-shaped Gaussian distribution (M2), which represents a radially distributed pattern with parameters constrained by the nucleus shape; (3) radial Gaussian distribution in 2D space with uniformly distributed spots for all angles (for this method, we extracted the model parameters by fitting to the analytically calculated CDFs of pairwise distances to the same function from real data (M3; see Section 2 for details); (4) radially shifted Gaussian distribution, assuming uniform distribution across 360 degrees. The model parameters were calculated in an iterative approach using a Bayesian framework that used the spot coordinates from real data (~46 spot coordinates per cell) (M4; see Section 2 for details); (5) radially shifted Gaussian distribution using spot distances, assuming uniform distribution across 360 degrees. The model parameters were calculated in an iterative approach using a Bayesian framework that used the distribution of pairwise distances of spots from real data (~1000 pairwise distances per cell) (M5; see Section 2 for details).

By comparing the simulated (M1–M5) spot localizations with the observed CENP-C spot positions measured in images of HCT116-Cas9 cells (M0), we observed varying degrees of similarity for both pairwise and radial distance distributions (Figure 5A). The pairwise distance distribution generated by the M4 model most closely fitted the experimental data, with its density curve aligning well at the peak and along the tail regions, indicating its high accuracy in capturing the underlying spatial organization of centromeres (Figure 5A). M3 and M5 also performed significantly better than M1 and M2, but their peak was slightly less aligned with the experimental data compared with M4. These observations were quantified using three established distribution similarity metrics: the Wasserstein distance, the normalized mean squared error (MSE), and the KS statistic, which quantify different properties of a distribution (see Section 2 for details) (Figure 5B). These analyses indicate a better fit to the observed distributions using the M4 model, followed by M3 and M5, with most of its distribution similarity metrics below 10% (Figure 5B).

Figure 5. Comparative analysis of centromere spatial distribution models. (A) Pairwise and radial distance distributions comparing the experimental data (M0) with five different synthetic models (M1–M5) in HCT116-Cas9 cells. (B) Quantitative assessment of model performance using the Wasserstein distance, the normalized MSE, and the KS statistic, showing superior performance of M4. (C) Heatmap analysis of clustering metrics across different models, demonstrating that Bayesian-based models (M4 and M5) achieve the lowest normalized MSE values for most metrics. The ** indicates the minimum value for each row.

3.7. Spot Clustering Metrics Comparison for Generated Spots

We also compared the spot clustering metrics for spots generated using various spot modeling methods (Figure 5C). The Bayesian-based models (M4 and M5) consistently achieved the lowest normalized MSE values across most metrics. Metrics such as Ripley’s K function, modularity, and the dispersion index showed significant improvement with M4 and M5 compared with simpler models such as M1 and M2, highlighting their ability to replicate global spatial clustering behavior. Although M4 and M5 did not outperform other methods for the MNND, the differences between these models and the best-performing methods were very minimal. The values for the assortativity index in most cases were below 0.005 and displayed as zero. This is in line with our previous observation when comparing different metrics for a varying number of spots (Figure 2B), indicating that assortativity is less effective in discriminating different spot distributions.

We next asked whether spot localizations simulated using models M1–M5 can approximate the localization of CENP-C spots in cells when centromeric spatial patterns are perturbed experimentally, such as when NCAPH2 is knocked down (siNCAPH2) in HCT116-Cas9 cells, which leads to increased clustering of centromeres [42,43,44]. Upon analysis of the siNCAPH2 dataset, M4 still remained the best model followed by M3 and M5 (Figure 6A). While quantification of these observations using the Wasserstein distance, the normalized MSE, and the KS statistic further confirmed M4’s superior performance, followed by M3 and M5 (Figure 6B), the similarity between modeled distributions and the real data distribution of pairwise and radial distance distributions was lower compared with the control HCT116-Cas9 cells (Figure 5B). This could be in part due to the smaller numbers of spots detected in siNCAPH2-treated cells, which lowers the accuracy of model fitting, as well as due to changes in cellular architecture due to NCAPH2 loss.

Figure 6. Model performance analysis in NCAPH2-depleted cells. (A) Pairwise and radial distance distribution comparisons between experimental data and synthetic models (M1–M5) in the siNCAPH2-treated cells. (B) Distribution similarity metrics showing that the M4 model maintains the best performance but with reduced accuracy compared with the control cells. (C) Clustering metric analysis demonstrating that the Bayesian models (M4, M5) achieve the lowest similarity values for most metrics, with slight variations in the MNND performance. The ** specifies the minimum value for each row.

Comparing spot clustering metrics for spots generated using the various spot modeling methods with the same metrics calculated using the experimental data for siNCAPH2 cells (Figure 6C) showed a similar trend as in the control cells (Figure 5C). Similar to the control cells, in the siNCAPH2 datasets, M4 and M5 achieved the lowest values for similarity metrics such as Ripley’s K function, modularity, Moran’s I, and the dispersion index, confirming their robustness in capturing centromere clustering patterns.

3.8. Evaluation of Generative Models for CENP-C Spot Localization Patterns in Multiple Human Cell Lines

To finally ask whether these quantification methods are generally applicable to multiple cell types, we tested them against the imaging dataset from a diverse set of eight human cell types and tissues, ranging from colon cancer cells and skin cells to induced pluripotent stem cells which show an exceptionally high level of clustering [14]. Similar to HCT116-Cas9 cells, we observed the same pattern of doughnut-like localization by overlaying the nucleus-centered spot locations in all eight cell lines (Figure S2A). All cell lines showed a similar Gaussian shaped distribution of spot densities (Figure S2B). Visual inspection of the pairwise distance distributions revealed that model-based distributions can reliably reproduce the distribution of pairwise distances with high accuracy (Figure 7A). While the models were not fitted based on radial distances, they could follow the downward parabolic trend in the underlying radial positioning (Figure 7B).

Figure 7. Centromere spatial organization analysis across cell lines, including HCT116 (colon), A375 (melanoma), MDA-MB-231 (breast), HFF-hTERT (fibroblast), hTERT-RPE1 (retinal), A549 (lung), HAP1 (myeloid), and WTC11 (embryonic stem cells). (A) Pairwise distance distributions comparing the experimental data with model predictions. (B) Radial distance distributions showing the models’ ability to capture centromere positioning. (C) Quantitative comparison of model performance using the Wasserstein distance, the normalized MSE, and the KS statistic, demonstrating M3 as the best-performing model across most cell lines, with M4 as the second-best. The ** indicates the minimum value for each row.

We did notice a clear difference in the performance of the computational models in reproducing centromere distributions in various cell types (Figure 7C). For pairwise distances, except the WTC11 embryonic stem cell line, the M3 model consistently achieved the lowest Wasserstein distance, the normalized MSE, and the KS statistic across most cell lines, indicating that it is the most effective at predicting the observed pairwise distance distributions of centromeres. Similarly, M3 also outperformed other methods when radial distribution was analyzed. M4 followed closely as the second-best performing model, followed by M5, M1, and M2. This trend was observed across most cell lines. M5 exhibited a noticeably weaker predictive performance compared with M1 and M2, which could potentially be attributed to the small number of cells available, which is not sufficient for the more complex M5 model that uses a larger input set (>1000 pairwise distances) to fully capture the spatial organization of centromeres.

4. Discussion

Understanding the spatial organization of cellular structures is a fundamental question in modern cell biology. Centromeres are prominent cellular features of each chromosome, and the elucidation of their distribution in the human cell nucleus is crucial for deciphering chromosomal behavior and nuclear architecture in both normal and diseased cells. To this end, we generated a systematic framework and specific measurement tools to quantitatively assess the cellular distribution of human centromeres. These new methods will be useful in the investigation of the mechanisms involved in establishing and maintaining centromere localization patterns and their functional implications.

We used this framework to assess multiple centromere spatial distribution types, clustering metrics, and spot generation models. By simulating diverse spatial patterns—including uniform distributions, single Gaussian clusters, multi-modal Gaussian mixtures, and perturbed adjacent spots—these models allow precise comparison and identification of the metrics most suited for specific biological questions. This benchmarking process ensures that the chosen metric aligns with the spatial characteristics of interest, enhancing the rigor and relevance of quantitative assessments.

Based on our analysis, Ripley’s K function emerged as the most sensitive and versatile metric for the measurement of global centromere clustering, with minimal variation across different numbers of detected spots. This property makes it a suitable metric for studying changes in centromere clustering upon experimental perturbation, for example, after the elimination of the condensin protein NCAPH2. Both the sensitivity and robustness of this metric are mainly due to the fact that it is calculated using density-normalized CDFs of all spot distances, irrespective of local features of their distribution. Although the dispersion index is also using the first- and second-order statistics of the same distribution, the compression of the distribution into only two parameters, reduces its effectiveness in separating various spot distributions compared with Ripley’s K function. One limitation of Ripley’s K function may be its inability to detect local clusters dispersed within the nucleus because it is calculated using the overall distance distribution of spots.

To better understand the nature of cellular centromere distribution, we used computational modeling to replicate the localization and clustering behavior of centromeres. Specifically, we applied Bayesian estimation methods to fit a radially shifted Gaussian distribution using three different approaches. The first approach directly fitted the cumulative distribution function (CDF) of pairwise distances between centromeres. The second approach used spot coordinates (approximately 46 spots per cell), while the third relied on pairwise distances (around 1035 values per cell). Although all these models are based on radial Gaussian distributions, the third approach, with its larger number of input values, requires a higher sample size to achieve robust and accurate parameter fitting. This highlights the trade-off between input complexity and the need for larger datasets when modeling spatial distributions.

One of our findings, based on mapping the location of several thousand centromeres by imaging, is that centromeres are distributed in a doughnut-like distribution within the cell nucleus. This pattern was evident when mapping individual centromeres across cell populations and was reproduced in our modeling approaches. Despite the noted discrepancies in fitting radial distance distributions due to differences in calculation methods and spot localization variability, our models effectively captured the essential centromere-to-centromere distance patterns critical for biological interpretation. This finding aligns well with previous studies on the radial organization of nuclear structures, such as the observation of a lower density of chromatin and chromosomes in the nuclear interior [2,11]. These studies highlight the influence of biological constraints, such as gene density and chromatin architecture, as the key drivers of spatial nuclear organization.

The analytical tools we generated have practical applications. These analysis methods can be used to quantitatively describe changes in centromere distribution in response to a specific experimental perturbation, for example, loss of the cohesin component NCAPH2 as shown here, or during physiological or pathological processes such as differentiation, development, and in disease such as cancer. We also anticipate that the methods are applicable to analysis of tissue samples, provided nuclei can be segmented accurately. Probably more importantly, our analysis approaches will now enable the execution of large-scale functional genomic screen, such as CRISPR screens, which will allow the identification of entirely novel modulators of centromere localization in an unbiased fashion.

The combination of advanced imaging techniques with robust computational analysis, as demonstrated here, is not limited to centromeres, but can also be extended to study the spatial organization of other cellular structures. For instance, nuclear bodies such as nucleoli, Cajal bodies, and PML nuclear bodies exhibit distinct spatial patterns influenced by chromatin interactions and nuclear architecture [45,46,47]. Probabilistic and spatial modeling approaches, such as those described here, have already been applied to chromosomal territories and transcription factories, revealing the roles of gene density and transcriptional activity in nuclear organization [2,11,48]. Additionally, approaches using spatial modeling or clustering metrics can be tailored to study other biological structures, such as cytoplasmic structures, for example, stress granules [49], focal adhesions [50] or mitochondrial networks [51], to better understand their distribution and functional implications. By adopting and adapting these methods, the principles underlying spatial organization and its impact on cellular processes across a wide range of subcellular structures can be explored using quantitative measures.

Taken together, this study presents a framework for characterizing the cellular distribution of centromeres using a combination of spatial metrics and computational modeling. By systematically benchmarking metrics with synthetic spot generation models, we demonstrate the capability of combining data-driven analysis with computational tools to capture centromere clustering patterns with potential application to other cellular structures.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/cells14070491/s1. Figure S1: Heatmap displaying Cohen’s D values for clustering metrics across synthetic spatial distribution models. Figure S2: Spatial distribution of standardized centromere locations across eight wild-type cell lines. Figure S3: Spatial distribution of standardized spot locations generated using M1–M5 methods. Table S1: Sources of cell lines, culture conditions, media compositions and relevant references of respective culture protocols for the eight cell lines. Table S2: Statistical testing results for pairwise Mann-Whitney U tests, corrected using the Benjamini-Hochberg False Discovery Rate (BH-FDR) method, comparing clustering metrics calculated for CSR and all other spot generation methods. Table S3: Statistical testing results for all pairwise Mann-Whitney U tests, corrected using the Benjamini-Hochberg False Discovery Rate (BH-FDR) method, comparing clustering metrics across all spot generation methods. Table S4: The percent change calculated as average value for each metric for up to 30 spots removed from the initial 46 spots compared to the value with 46 spots.

Author Contributions

Conceptualization, A.K., G.P. and T.M.; methodology, A.K. and K.G.; software, A.K.; formal analysis, A.K.; writing—original draft preparation, A.K.; writing—review and editing, A.K., K.G., G.P. and T.M.; supervision, G.P. and T.M.; project administration, G.P. and T.M.; funding acquisition, G.P. and T.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Intramural Research Program of the NIH, NCI, Center for Cancer Research through grant 1-ZIA-BC010309-25 to T.M and grant 1-ZIC-BC-011567 to HiTIF.

Informed Consent Statement

Not applicable.

Data Availability Statement

The HCT116-Cas9 cells imaging data presented in this study were previously deposited in BioImage Archive [52] under accession number S-BIAD1043 and available at https://www.ebi.ac.uk/biostudies/BioImages/studies/S-BIAD1043 (accessed on 15 March 2025). The data related to the eight cell lines experiment was deposited in the BioImage Archive [52] under accession number S-BIAD1602 and is available at https://www.ebi.ac.uk/biostudies/bioimages/studies/S-BIAD1602 (accessed on 15 March 2025).

Acknowledgments

We would like to thank Thomas Gonatopoulos-Pournatzis for sharing several of the cell lines used in this study. This work utilized the computational resources of the NIH HPC Biowulf cluster (https://hpc.nih.gov, accessed on 15 March 2025).

Conflicts of Interest

The authors declare no conflict of interest.

Code Availability Statement

All the code used for modeling and generating the results and plots in this manuscript can be found here: https://github.com/CBIIT/centromere_clustering_analysis (accessed on 15 March 2025).

Abbreviations

Spot generation model abbreviations.

Name	Abbreviation
Complete spatial randomness	CSR
Poisson disk sampling	PDS
Uniformly distributed centromeres with two adjacent spots	UTA
Uniformly distributed centromeres with three adjacent spots	UTHA
Single two-dimensional Gaussian distribution	S2DG
Two two-dimensional Gaussian distributions	T2DG
Three two-dimensional Gaussian distributions	TH2DG
Two-dimensional Gaussian distribution with a nuclear body	2DGNB
Two two-dimensional Gaussian distributions with two nuclear bodies	T2DGTNB
Real data	M0
Uniformly distributed spots	M1
Cell-shaped 2D Gaussian distribution	M2
Radial Gaussian distribution using Ripley’s K function	M3
Bayesian estimation of a radially shifted Gaussian distribution using spot coordinates	M4
Bayesian estimation of a radially shifted Gaussian distribution using spot distances	M5

References

Misteli, T. The Self-Organizing Genome: Principles of Genome Architecture and Function. Cell 2020, 183, 28–45. [Google Scholar] [CrossRef]
Kreth, G.; Finsterle, J.; von Hase, J.; Cremer, M.; Cremer, C. Radial Arrangement of Chromosome Territories in Human Cell Nuclei: A Computer Model Approach Based on Gene Density Indicates a Probabilistic Global Positioning Code. Biophys. J. 2004, 86, 2803–2812. [Google Scholar] [CrossRef] [PubMed]
Hu, Q.; Kwon, Y.-S.; Nunez, E.; Cardamone, M.D.; Hutt, K.R.; Ohgi, K.A.; Garcia-Bassets, I.; Rose, D.W.; Glass, C.K.; Rosenfeld, M.G.; et al. Enhancing Nuclear Receptor-Induced Transcription Requires Nuclear Motor and LSD1-Dependent Gene Networking in Interchromatin Granules. Proc. Natl. Acad. Sci. USA 2008, 105, 19199–19204. [Google Scholar] [CrossRef]
Faro-Trindade, I.; Cook, P.R. Transcription Factories: Structures Conserved during Differentiation and Evolution. Biochem. Soc. Trans. 2006, 34, 1133–1137. [Google Scholar] [CrossRef] [PubMed]
Osborne, C.S.; Chakalova, L.; Mitchell, J.A.; Horton, A.; Wood, A.L.; Bolland, D.J.; Corcoran, A.E.; Fraser, P. Myc Dynamically and Preferentially Relocates to a Transcription Factory Occupied by Igh. PLoS Biol. 2007, 5, e192. [Google Scholar] [CrossRef]
McKinley, K.L.; Cheeseman, I.M. The Molecular Basis for Centromere Identity and Function. Nat. Rev. Mol. Cell Biol. 2016, 17, 16–29. [Google Scholar] [CrossRef] [PubMed]
Fukagawa, T.; Earnshaw, W.C. The Centromere: Chromatin Foundation for the Kinetochore Machinery. Dev. Cell 2014, 30, 496–508. [Google Scholar] [CrossRef]
Sullivan, L.L.; Sullivan, B.A. Genomic and Functional Variation of Human Centromeres. Exp. Cell Res. 2020, 389, 111896. [Google Scholar] [CrossRef]
Sullivan, L.L.; Boivin, C.D.; Mravinac, B.; Song, I.Y.; Sullivan, B.A. Genomic Size of CENP-A Domain Is Proportional to Total Alpha Satellite Array Size at Human Centromeres and Expands in Cancer Cells. Chromosome Res. 2011, 19, 457–470. [Google Scholar] [CrossRef]
Solovei, I.; Schermelleh, L.; Düring, K.; Engelhardt, A.; Stein, S.; Cremer, C.; Cremer, T. Differences in Centromere Positioning of Cycling and Postmitotic Human Cell Types. Chromosoma 2004, 112, 410–423. [Google Scholar] [CrossRef]
Cremer, M.; von Hase, J.; Volm, T.; Brero, A.; Kreth, G.; Walter, J.; Fischer, C.; Solovei, I.; Cremer, C.; Cremer, T. Non-Random Radial Higher-Order Chromatin Arrangements in Nuclei of Diploid Human Cells. Chromosome Res. Int. J. Mol. Supramol. Evol. Asp. Chromosome Biol. 2001, 9, 541–567. [Google Scholar] [CrossRef]
Alcobia, I.; Dilão, R.; Parreira, L. Spatial Associations of Centromeres in the Nuclei of Hematopoietic Cells: Evidence for Cell-Type-Specific Organizational Patterns. Blood 2000, 95, 1608–1615. [Google Scholar] [CrossRef] [PubMed]
Bersaglieri, C.; Kresoja-Rakic, J.; Gupta, S.; Bär, D.; Kuzyakiv, R.; Panatta, M.; Santoro, R. Genome-Wide Maps of Nucleolus Interactions Reveal Distinct Layers of Repressive Chromatin Domains. Nat. Commun. 2022, 13, 1483. [Google Scholar] [CrossRef] [PubMed]
Rodrigues, A.; MacQuarrie, K.L.; Freeman, E.; Lin, A.; Willis, A.B.; Xu, Z.; Alvarez, A.A.; Ma, Y.; White, B.E.P.; Foltz, D.R.; et al. Nucleoli and the Nucleoli–Centromere Association Are Dynamic during Normal Development and in Cancer. Mol. Biol. Cell 2023, 34, br5. [Google Scholar] [CrossRef]
Kumar, P.; Gholamalamdari, O.; Zhang, Y.; Zhang, L.; Vertii, A.; van Schaik, T.; Peric-Hupkes, D.; Sasaki, T.; Gilbert, D.M.; van Steensel, B.; et al. Nucleolus and Centromere Tyramide Signal Amplification-Seq Reveals Variable Localization of Heterochromatin in Different Cell Types. Commun. Biol. 2024, 7, 1135. [Google Scholar] [CrossRef]
Altemose, N.; Logsdon, G.A.; Bzikadze, A.V.; Sidhwani, P.; Langley, S.A.; Caldas, G.V.; Hoyt, S.J.; Uralsky, L.; Ryabov, F.D.; Shew, C.J.; et al. Complete Genomic and Epigenetic Maps of Human Centromeres. Science 2022, 376, eabl4178. [Google Scholar] [CrossRef] [PubMed]
Bury, L.; Moodie, B.; Ly, J.; McKay, L.S.; Miga, K.H.; Cheeseman, I.M. Alpha-Satellite RNA Transcripts Are Repressed by Centromere–Nucleolus Associations. eLife 2020, 9, e59770. [Google Scholar] [CrossRef]
Klaasen, S.J.; Truong, M.A.; van Jaarsveld, R.H.; Koprivec, I.; Štimac, V.; de Vries, S.G.; Risteski, P.; Kodba, S.; Vukušić, K.; de Luca, K.L.; et al. Nuclear Chromosome Locations Dictate Segregation Error Frequencies. Nature 2022, 607, 604–609. [Google Scholar] [CrossRef]
Krämer, A.; Maier, B.; Bartek, J. Centrosome Clustering and Chromosomal (in)Stability: A Matter of Life and Death. Mol. Oncol. 2011, 5, 324–335. [Google Scholar] [CrossRef]
Naughton, C.; Huidobro, C.; Catacchio, C.R.; Buckle, A.; Grimes, G.R.; Nozawa, R.-S.; Purgato, S.; Rocchi, M.; Gilbert, N. Human Centromere Repositioning Activates Transcription and Opens Chromatin Fibre Structure. Nat. Commun. 2022, 13, 5609. [Google Scholar] [CrossRef]
Hoencamp, C.; Dudchenko, O.; Elbatsh, A.M.O.; Brahmachari, S.; Raaijmakers, J.A.; van Schaik, T.; Sedeño Cacciatore, Á.; Contessoto, V.G.; van Heesbeen, R.G.H.P.; van den Broek, B.; et al. 3D Genomics across the Tree of Life Reveals Condensin II as a Determinant of Architecture Type. Science 2021, 372, 984–989. [Google Scholar] [CrossRef] [PubMed]
Stephens, A.D.; Quammen, C.W.; Chang, B.; Haase, J.; Taylor, R.M.; Bloom, K. The Spatial Segregation of Pericentric Cohesin and Condensin in the Mitotic Spindle. Mol. Biol. Cell 2013, 24, 3909–3919. [Google Scholar] [CrossRef]
Pegoraro, G.; Misteli, T. High-Throughput Imaging for the Discovery of Cellular Mechanisms of Disease. Trends Genet. 2017, 33, 604–615. [Google Scholar] [CrossRef] [PubMed]
Kiskowski, M.A.; Hancock, J.F.; Kenworthy, A.K. On the Use of Ripley’s K-Function and Its Derivatives to Analyze Domain Size. Biophys. J. 2009, 97, 1095–1103. [Google Scholar] [CrossRef] [PubMed]
Keikhosravi, A.; Almansour, F.; Bohrer, C.H.; Fursova, N.A.; Guin, K.; Sood, V.; Misteli, T.; Larson, D.R.; Pegoraro, G. High-Throughput Image Processing Software for the Study of Nuclear Architecture and Gene Expression. Sci. Rep. 2024, 14, 18426. [Google Scholar] [CrossRef]
Pachitariu, M.; Stringer, C. Cellpose 2.0: How to Train Your Own Model. Nat. Methods 2022, 19, 1634–1641. [Google Scholar] [CrossRef]
Reynolds, D. Gaussian Mixture Models. In Encyclopedia of Biometrics; Li, S.Z., Jain, A., Eds.; Springer: Boston, MA, USA, 2009; pp. 659–663. ISBN 978-0-387-73003-5. [Google Scholar]
Wang, T. Poisson-Disk Sampling: Theory and Applications. In Encyclopedia of Computer Graphics and Games; Lee, N., Ed.; Springer International Publishing: Cham, Switzerland, 2024; pp. 1424–1431. ISBN 978-3-031-23161-2. [Google Scholar]
Ripley, B.D. The Second-Order Analysis of Stationary Point Processes. J. Appl. Probab. 1976, 13, 255–266. [Google Scholar] [CrossRef]
Shizuka, D.; Farine, D.R. Measuring the Robustness of Network Community Structure Using Assortativity. Anim. Behav. 2016, 112, 237–246. [Google Scholar] [CrossRef]
Brandes, U.; Delling, D.; Gaertler, M.; Gorke, R.; Hoefer, M.; Nikoloski, Z.; Wagner, D. On Modularity Clustering. IEEE Trans. Knowl. Data Eng. 2008, 20, 172–188. [Google Scholar] [CrossRef]
Tiefelsdorf, M.; Boots, B. The Exact Distribution of Moran’s I. Environ. Plan. Econ. Space 1995, 27, 985–999. [Google Scholar] [CrossRef]
Clark, P.J.; Evans, F.C. Distance to Nearest Neighbor as a Measure of Spatial Relationships in Populations. Ecology 1954, 35, 445–453. [Google Scholar] [CrossRef]
Myers, J.H. Selecting a Measure of Dispersion. Environ. Entomol. 1978, 7, 619–621. [Google Scholar] [CrossRef]
Mann, H.B.; Whitney, D.R. On a Test of Whether One of Two Random Variables Is Stochastically Larger than the Other. Ann. Math. Stat. 1947, 18, 50–60. [Google Scholar] [CrossRef]
Benjamini, Y.; Hochberg, Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J. R. Stat. Soc. Ser. B Methodol. 1995, 57, 289–300. [Google Scholar] [CrossRef]
Kruskal, W.H.; Wallis, W.A. Use of Ranks in One-Criterion Variance Analysis. J. Am. Stat. Assoc. 1952, 47, 583–621. [Google Scholar] [CrossRef]
Hori, T.; Amano, M.; Suzuki, A.; Backer, C.B.; Welburn, J.P.; Dong, Y.; McEwan, B.F.; Shang, W.-H.; Suzuki, E.; Okawa, E.; et al. CCAN makes multiple contacts with centromeric DNA to provide distinct pathways to the outer kinetochore. Cell. 2008, 135, 1039–1052. [Google Scholar] [CrossRef] [PubMed]
Savulescu, A.F.; Brackin, R.; Bouilhol, E.; Dartigues, B.; Warrell, J.H.; Pimentel, M.R.; Beaume, N.; Fortunato, I.C.; Dallongeville, S.; Boulle, M.; et al. Interrogating RNA and Protein Spatial Subcellular Distribution in smFISH Data with DypFISH. Cell Rep. Methods 2021, 1, 100068. [Google Scholar] [CrossRef] [PubMed]
Norton, H.K.; Emerson, D.J.; Huang, H.; Kim, J.; Titus, K.R.; Gu, S.; Bassett, D.S.; Phillips-Cremins, J.E. Detecting Hierarchical Genome Folding with Network Modularity. Nat. Methods 2018, 15, 119–122. [Google Scholar] [CrossRef]
Rajput, A.; Dominguez San Martin, I.; Rose, R.; Beko, A.; LeVea, C.; Sharratt, E.; Mazurchuk, R.; Hoffman, R.M.; Brattain, M.G.; Wang, J. Characterization of HCT116 Human Colon Cancer Cells in an Orthotopic Model. J. Surg. Res. 2008, 147, 276–281. [Google Scholar] [CrossRef]
Ono, T.; Sakamoto, C.; Nakao, M.; Saitoh, N.; Hirano, T. Condensin II Plays an Essential Role in Reversible Assembly of Mitotic Chromosomes in Situ. Mol. Biol. Cell 2017, 28, 2875–2886. [Google Scholar] [CrossRef]
Wallace, H.A.; Rana, V.; Nguyen, H.Q.; Bosco, G. Condensin II Subunit NCAPH2 Associates with Shelterin Protein TRF1 and Is Required for Telomere Stability. J. Cell. Physiol. 2019, 234, 20755–20768. [Google Scholar] [CrossRef]
Martin, C.-A.; Murray, J.E.; Carroll, P.; Leitch, A.; Mackenzie, K.J.; Halachev, M.; Fetit, A.E.; Keith, C.; Bicknell, L.S.; Fluteau, A.; et al. Mutations in Genes Encoding Condensin Complex Proteins Cause Microcephaly through Decatenation Failure at Mitosis. Genes Dev. 2016, 30, 2158–2172. [Google Scholar] [CrossRef] [PubMed]
Cremer, T.; Cremer, M.; Hübner, B.; Strickfaden, H.; Smeets, D.; Popken, J.; Sterr, M.; Markaki, Y.; Rippe, K.; Cremer, C. The 4D Nucleome: Evidence for a Dynamic Nuclear Landscape Based on Co-Aligned Active and Inactive Nuclear Compartments. FEBS Lett. 2015, 589, 2931–2943. [Google Scholar] [CrossRef]
Quinodoz, S.A.; Ollikainen, N.; Tabak, B.; Palla, A.; Schmidt, J.M.; Detmar, E.; Lai, M.M.; Shishkin, A.A.; Bhat, P.; Takei, Y.; et al. Higher-Order Inter-Chromosomal Hubs Shape 3D Genome Organization in the Nucleus. Cell 2018, 174, 744–757.e24. [Google Scholar] [CrossRef] [PubMed]
Dundr, M.; Misteli, T. Biogenesis of Nuclear Bodies. Cold Spring Harb. Perspect. Biol. 2010, 2, a000711. [Google Scholar] [CrossRef]
Fraser, P.; Bickmore, W. Nuclear Organization of the Genome and the Potential for Gene Regulation. Nature 2007, 447, 413–417. [Google Scholar] [CrossRef]
Wheeler, J.R.; Matheny, T.; Jain, S.; Abrisch, R.; Parker, R. Distinct Stages in Stress Granule Assembly and Disassembly. eLife 2016, 5, e18413. [Google Scholar] [CrossRef]
Kanchanawong, P.; Shtengel, G.; Pasapera, A.M.; Ramko, E.B.; Davidson, M.W.; Hess, H.F.; Waterman, C.M. Nanoscale Architecture of Integrin-Based Cell Adhesions. Nature 2010, 468, 580–584. [Google Scholar] [CrossRef]
Picard, M.; Shirihai, O.S.; Gentil, B.J.; Burelle, Y. Mitochondrial Morphology Transitions and Functions: Implications for Retrograde Signaling? Am. J. Physiol. Regul. Integr. Comp. Physiol. 2013, 304, R393–R406. [Google Scholar] [CrossRef]
Hartley, M.; Kleywegt, G.J.; Patwardhan, A.; Sarkans, U.; Swedlow, J.R.; Brazma, A. The BioImage Archive—Building a Home for Life-Sciences Microscopy Data. J. Mol. Biol. 2022, 434, 167505. [Google Scholar] [CrossRef]

Figure 1. Simulated spatial distribution patterns of centromeres modeled to reflect various clustering and dispersion behaviors observed in HCT116-Cas9 colorectal cancer cells. For each row, the two leftmost images display the fluorescence image of centromere protein CENP-C in a HCT116 cell (green), followed by extracted spot locations (yellow circles) outlined within the boundary to indicate the nucleus outline (red). Each row represents a distinct spatial distribution model: (A) complete spatial randomness (CSR), where centromeres are uniformly and randomly distributed; (B) Poisson disk sampling (PDS), enforcing a minimum distance between spots to prevent overlap; (C,D) uniformly distributed centromeres with two (UTA) or three (UTHA) adjacent spots mimicking small clusters; (E–G) single (S2DG), two (T2DG), or three (TH2DG) two-dimensional Gaussian distributions, representing varying levels of clustering within larger nuclear regions; (H) two-dimensional Gaussian distribution with a nuclear body (2DGNB), excluding the nuclear body area; and (I) two two-dimensional Gaussian distributions with two nuclear bodies (T2DGTNB). For simulated distributions, a gradient of Gaussian variance increases from left to right as indicated.

Figure 2. Benchmarking clustering metrics for detecting and characterizing synthetic centromere clustering patterns. (A) The sensitivity of the six clustering metrics was tested across synthetic spot distributions. Each boxplot displays the variability of metric values for the nine spatial distribution models, with statistical testing confirming significant differences between CSR and other distributions for most metrics. The box represents the interquartile range (IQR), showing the middle 50% of the data, with the line inside the box indicating the median value. The whiskers extend from the box to the smallest and largest values within 1.5 times the IQR from the lower and upper quartiles. (B) The robustness of clustering metrics to changes in the spot number was evaluated by progressively removing a random number of spots from synthetic nuclei.

Figure 3. Analysis of centromere clustering in the control and NCAPH2-depleted cells. (A) Representative immunofluorescence images showing CENP-C spot detection (green) in HCT116-Cas9 cells transfected with siScramble (control) or siNCAPH2. The nuclear outline is indicated in red. Scale bar: 5 μm. (B) Comparison of different clustering metrics to quantify differences between the control and NCAPH2-depleted cells. The box represents the interquartile range (IQR), showing the middle 50% of the data, with the line inside the box indicating the median value. The whiskers extend from the box to the smallest and largest values within 1.5 times the IQR from the lower and upper quartiles. (C) Cohen’s D analysis showing the relative sensitivity of various clustering metrics for separating siScramble (control) and siNCAPH2 in HCT116 cells, with Ripley’s K function demonstrating the highest absolute value, followed by spot density, spot number, modularity, and assortativity.

Figure 4. Spatial distribution of standardized centromere locations in the control and NCAPH2-depleted cells. (A) 2D histogram showing the distribution of CENP-C spots relative to the nuclear center (0,0), revealing a doughnut-shaped pattern in both conditions. (B) Line plot analysis at X = 0 and Y = 0 demonstrating lower spot density at the nuclear center and higher density between the center and the nuclear edge. (C,D) Gaussian mixture modeling of spot distribution along the X = 0 and Y = 0 axes, indicating a radially symmetric centromere organization with potential alignment artifacts affecting the observed central low-density region.

Figure 5. Comparative analysis of centromere spatial distribution models. (A) Pairwise and radial distance distributions comparing the experimental data (M0) with five different synthetic models (M1–M5) in HCT116-Cas9 cells. (B) Quantitative assessment of model performance using the Wasserstein distance, the normalized MSE, and the KS statistic, showing superior performance of M4. (C) Heatmap analysis of clustering metrics across different models, demonstrating that Bayesian-based models (M4 and M5) achieve the lowest normalized MSE values for most metrics. The ** indicates the minimum value for each row.

Figure 6. Model performance analysis in NCAPH2-depleted cells. (A) Pairwise and radial distance distribution comparisons between experimental data and synthetic models (M1–M5) in the siNCAPH2-treated cells. (B) Distribution similarity metrics showing that the M4 model maintains the best performance but with reduced accuracy compared with the control cells. (C) Clustering metric analysis demonstrating that the Bayesian models (M4, M5) achieve the lowest similarity values for most metrics, with slight variations in the MNND performance. The ** specifies the minimum value for each row.

Figure 7. Centromere spatial organization analysis across cell lines, including HCT116 (colon), A375 (melanoma), MDA-MB-231 (breast), HFF-hTERT (fibroblast), hTERT-RPE1 (retinal), A549 (lung), HAP1 (myeloid), and WTC11 (embryonic stem cells). (A) Pairwise distance distributions comparing the experimental data with model predictions. (B) Radial distance distributions showing the models’ ability to capture centromere positioning. (C) Quantitative comparison of model performance using the Wasserstein distance, the normalized MSE, and the KS statistic, demonstrating M3 as the best-performing model across most cell lines, with M4 as the second-best. The ** indicates the minimum value for each row.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Simulation and Quantitative Analysis of Spatial Centromere Distribution Patterns

Abstract

1. Introduction

2. Methods

2.1. siRNA Oligos Transfection and Immunofluorescence

2.2. Cell Growth and Centromere Visualization

2.3. High-Throughput Image Acquisition

2.4. High-Throughput Image Analysis

2.5. Methods for Generating Synthetic Spot Patterns

2.5.1. Poisson Process or Complete Spatial Randomness (CSR)

2.5.2. Single Two-Dimensional Gaussian Distribution (S2DG)

2.5.3. Two (T2DG) and Three (TH2DG) Two-Dimensional Gaussian Distributions

2.5.4. Poisson Disk Sampling (PDS)

2.5.5. Uniformly Distributed Centromeres with Two (UTA) and Three (UTHA) Adjacent Spots

2.6. Spot Clustering Metrics

2.6.1. Ripley’s K Function

2.6.2. Assortativity Coefficient

2.6.3. Modularity

2.6.4. Moran’s I

2.6.5. Mean Nearest Neighbor Distance (MNND)

2.6.6. Dispersion Index

2.7. Centromere Spot Localization Modeling Methods Using Gaussian Distributions

2.7.1. Uniform Distribution of Spots on the Cell Nucleus as a Benchmarking Method

2.7.2. Modeling of Centromere Localization Using the Cell Shape

2.7.3. Radial Distribution and Ripley’s K r Function Calculation

2.7.4. Bayesian Estimation of Radially Shifted Gaussian Distribution Using Spots Coordinates

2.7.5. Bayesian Estimation of the Radially Shifted Gaussian Distribution Using Pairwise Distances

2.8. MCMC Framework (Metropolis–Hastings Algorithm)

2.9. Statistical Analysis

2.9.1. Metrics Comparison to the Complete State of Randomness for Different Spot Generation Methods

2.9.2. Pairwise Metrics Comparison for Different Spot Generation Methods

2.9.3. Metrics for Comparing Distributions

3. Results

3.1. Simulation of Centromeric Spot Patterns Using Different Spatial Distribution Models

3.2. Benchmarking of Clustering Metrics on Synthetic Patterns

3.3. Measuring Clustering Metrics’ Robustness to Synthetic Changes in the Spot Number per Nucleus

3.4. Validation of Clustering Metrics Using Experimental Data

3.5. Modeling the Spatial Distribution of Centromeres in Cells

3.6. Centromere Localization Modeling Using Parametric Distributions

3.7. Spot Clustering Metrics Comparison for Generated Spots

3.8. Evaluation of Generative Models for CENP-C Spot Localization Patterns in Multiple Human Cell Lines

4. Discussion

Supplementary Materials

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Code Availability Statement

Abbreviations

References

Article Metrics

Citations

Article Access Statistics

2.7.3. Radial Distribution and Ripley’s $K (r)$ Function Calculation