1. Introduction
In unsupervised learning, clustering algorithms aim to identify hidden structure in data by grouping samples with similar characteristics without relying on labeled information [
1]. Among the most influential techniques, the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm stands out for its ability to detect dense regions of arbitrary shape while separating sparse or isolated observations as noise. In this sense, DBSCAN overcomes important limitations of partition-based methods such as K-means, which assume approximately spherical clusters and are sensitive to outliers [
2], as well as some hierarchical approaches whose behavior is strongly determined by the geometry induced by the data representation [
3].
Despite these advantages, DBSCAN presents a persistent practical difficulty: its sensitivity to the hyperparameters
(neighborhood radius) and minPts (minimum number of points required to define a dense neighborhood). While minPts is often selected using relatively stable practical rules,
plays a much more delicate role because it determines when dense components emerge, merge, or disappear. As a result, small perturbations in
may induce substantial changes in neighborhood connectivity and therefore in the final clustering structure [
4,
5]. This has motivated numerous attempts to automate
selection, ranging from heuristic use of the
k-distance plot [
6] to hierarchical or adaptive strategies [
7]. However, most of these approaches remain predominantly empirical and do not formally explain why a local modification in
can trigger a global topological change in the density structure [
6,
8].
In this context, there are density-based extensions like HDBSCAN and OPTICS, techniques that partially address sensitivity to the
parameter by using generative schemes or density-based structures. These allow capturing distributions with variable density without requiring a fixed global threshold. However, the added complexity makes it difficult, in certain scenarios, to explicitly understand the importance of
in the final cluster formation and structure. Ultimately, the challenge of estimating
within the classical DBSCAN framework remains significant from both practical and analytical viewpoints [
9,
10].
Traditional heuristics such as Kneedle often rely heavily on local curvature and geometric inflection criteria [
11,
12]. In practice, this may lead to what we refer to as
geometric myopia: the estimator becomes highly sensitive to noise, weak transitions, and locally ambiguous changes in the ordered signal. This issue becomes more severe in higher-dimensional settings, where distance measures progressively lose discriminative power due to the concentration of measure phenomenon [
13,
14]. Under these conditions, spurious structural boundaries and oversegmentation may arise, making purely geometric rules less reliable for identifying an informative threshold.
These limitations motivate the search for a more structured interpretation of
selection. In parallel, recent work has explored ways of bringing ideas from causal discovery into unsupervised learning. Historically, causal discovery methods such as the PC algorithm [
15] and structural approaches based on interventions and conditional responses [
16,
17] were developed to identify directed dependencies among observed variables. More recently, some of these ideas have been adapted to unsupervised settings in order to obtain more structured interpretations of variability and organization in complex data environments [
18]. Rather than treating threshold selection as a purely geometric problem, this perspective opens the possibility of analyzing it as an organized estimation mechanism with explicitly defined internal dependencies.
In this work, the problem of selecting in DBSCAN is articulated through three distinct but related levels. First, at the algorithmic level, controls neighborhood formation and, therefore, affects connectivity and the global clustering structure. Second, at the modeling level, this structural sensitivity is represented through a surrogate dynamical estimator defined over the ordered k-distance signal. Third, at the causal level, the resulting estimator is interpreted through interventions on its internal threshold-selection mechanism. These three levels are related, but they are not equivalent and should not be interpreted as automatic consequences of one another.
Under this view, small perturbations in induce observable changes in neighborhood connectivity and cluster configuration at the algorithmic level. These changes motivate the introduction of a surrogate dynamical layer, but do not by themselves imply a causal interpretation. To formalize the modeling level of the proposal, a mass–spring–damper-inspired formulation is introduced over the ordered k-distance curve, in which derivatives and state transitions characterize structural changes in density. The resulting estimator is then analyzed through interventions on its internal modules, so that the causal reading applies to the threshold-selection mechanism rather than to the physical data-generating process itself.
Thus, the present work goes beyond a traditional geometric heuristic by articulating an algorithmic description of how modifies connectivity, a surrogate dynamical model for the ordered k-distance signal, and an intervention-based interpretation of the resulting estimator. The aim is to improve the precision of selection while distinguishing clearly between the algorithmic, modeling, and causal roles of the proposal.
The main contributions of this work can be summarized as follows:
A three-level formulation of the -selection problem in DBSCAN, distinguishing explicitly between the algorithmic role of neighborhood connectivity, the surrogate dynamical modeling of the ordered k-distance signal, and the intervention-based interpretation of the resulting estimator.
A mass–spring–damper-inspired dynamical estimator defined over the ordered k-distance curve, designed to identify structurally relevant transition regions for selection beyond purely geometric elbow heuristics.
A Structural Causal Model (SCM)-based interpretation of the internal estimation mechanism, in which and ACE are applied at the level of the threshold-selection process rather than at the level of the physical data-generating process.
An empirical evaluation on synthetic and real-world datasets, including clustering-quality comparisons with the Kneedle baseline, an ablation analysis of the dynamic layer, and a practical runtime characterization of the proposed pipeline.
The proposal is evaluated in three scenarios: two synthetic datasets with between 1300 and 5000 samples, and one real-world multivariate dataset with 581,013 samples. The synthetic datasets were designed to represent varying-density distributions and severe topological overlaps, where traditional hyperparameter selection often fails to isolate meaningful structures. The real-world dataset evaluates the method in a multivariate setting of dimension . For each scenario, the proposed estimator is compared against the Kneedle heuristic using clustering-quality indices including Davies–Bouldin, Silhouette, and Calinski–Harabasz.
Hierarchical approaches such as HDBSCAN mitigate the varying-density problem by avoiding a single global
threshold [
19]. However, the objective of this work is not to replace the density-thresholding paradigm but to provide an interpretable mechanism for estimating
within standard DBSCAN. For this reason, the most appropriate baseline is not HDBSCAN, but the geometric heuristic commonly used for the same estimation task, namely Kneedle.
The remainder of the article is organized as follows.
Section 2 presents the theoretical foundations of the proposal.
Section 3 describes the methodology.
Section 4 reports the experimental results and discussion. Finally,
Section 5 presents the conclusions and future work.
2. Theoretical Foundations
2.1. DBSCAN and the Role of the Parameter
The DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm separates low-density regions from high-density regions in order to identify clusters and noise [
20]. This process is controlled by two parameters:
, the neighborhood radius that defines the local adjacency structure, and minPts, which specifies the minimum neighborhood cardinality required for a point to be considered part of a dense core. While minPts is often chosen using simple practical rules,
plays a more delicate role because it determines when dense components appear, merge, or vanish.
At the algorithmic level, controls neighborhood formation and therefore directly affects local connectivity in DBSCAN. Since cluster expansion is based on density reachability, even small perturbations in may alter adjacency relations between points and induce structural changes in the final partition. In this sense, acts first as the control variable of the clustering mechanism itself, before any modeling or intervention-based interpretation is introduced.
This algorithmic sensitivity motivates the modeling step adopted in the present work. Rather than treating the ordered k-distance curve solely as a geometric heuristic, the proposal approximates it with a surrogate dynamical layer to identify structurally relevant transitions. At this stage, however, the role of the dynamic formulation is representational: it provides a structured approximation of the ordered signal and should not yet be interpreted as a causal claim about the data’s physical origin.
Although DBSCAN presents inherent limitations due to its reliance on a global density threshold, the proposed dynamical framework does not eliminate these constraints. Instead, it provides a structured way to analyze and mitigate their effects through a transition-sensitive interpretation of .
For this dynamic formulation to be informative, the data must exhibit specific topological properties. The framework does not require symmetry or adherence to a strict parametric distribution, but it does require the following conditions:
The transitions between dense cluster cores and sparse ambient noise cannot be entirely discrete. The density must decay at a measurable rate so that the k-distance graph remains smooth and continuously differentiable.
Data points must form continuous local manifolds. In scenarios with purely uniform, completely random noise, the dynamic layer lacks a structural anchor, preventing the system from identifying a meaningful equilibrium.
Although the framework is designed to handle varying densities, the dataset must contain discernible density drops where cohesive structures separate from the background noise. These transitions provide the structural contrast required for the dynamical analogy to be informative.
Formally, the model aims to locate the critical transition region in the k-distance curve, that is, the point at which a small local variation in is associated with a global structural change in density connectivity. This transition provides the basis for selecting an value that is structurally informative for clustering.
2.2. K-Distances Curve Extraction
The
K-distance curve characterizes the local density distribution of the data points by quantifying the distance from each data point to its
k-th nearest neighbor [
8]. In this context, the
K-distance plot is obtained by calculating the
k nearest neighbors of each point using Euclidean distance and the Nearest Neighbors algorithm.
First, Euclidean distance is defined as a metric that estimates the distance between two points within a system in
[
21], represented in Equation (
1):
Secondly, the Nearest Neighbors algorithm can be defined as follows: given a finite dataset
S in a Euclidean space
and a query
q, Nearest Neighbors obtains the
k nearest neighbors
R of
q by evaluating
where
[
22]. The parameter
R is defined by Equation (
2).
Analogously, in clustering, neighborhoods define which points causally influence the local density and the subsequent formation of clusters.
2.3. Curve Smoothing
One of the most important prerequisites for working with signals is a clean input signal. When working with a
k-distance curve, the signal is implicitly noisy, which makes numerical analysis difficult. In this study, a Savitzky–Golay filter was employed. This digital smoothing filter is based on least-squares polynomial approximation and is widely used to reduce noise in data while effectively preserving the signal’s essential characteristics, such as peak heights and widths [
23]; its mathematical expression is represented by Equation (
3)
where
is the
m-th smoothed derivative at point
i,
are the signal values,
are the Savitzky–Golay filter coefficients obtained by fitting a polynomial of degree
p over a window of
points,
is the time step of the samples, and
m is the order of the derivative to be calculated. To ensure reproducibility and algorithmic adaptability across varying dataset sizes, the filters were systematically constrained rather than manually fine-tuned. First, a cubic polynomial was selected because it is the lowest-order polynomial capable of modeling inflection points; this allows the algorithm to accurately capture structural transitions in the density space without introducing Runge’s Phenomenon (overfitting oscillations). Second, instead of a static value, the window size is defined dynamically as a proportion of the total dataset size
N, ensuring that the resulting integer is odd and that the baseline is at least 5 points. This adaptive, scale-invariant windowing guarantees a receptive field broad enough to capture macro-structural density drops while mitigating micro-noise.
2.4. Mass–Spring–Damper System
A mass–spring–damper system is a dynamic system consisting of a mass
m attached to a spring with a stiffness constant
k and a damper with coefficient
b, subjected to a force
F [
24]. First, the stiffness constant
k is a magnitude that quantifies how rigid a spring is; it relates the force applied to the spring with the displacement produced by it, following Hooke’s Law [
25]. This law is detailed in Equation (
4)
where
F is the force exerted,
k is the stiffness constant whose unit is
, and
x is the displacement generated by the spring. Secondly, the damping coefficient
b quantifies a system’s ability to dissipate kinetic energy through frictional forces [
26]; its value depends on the nature of the restrictive medium.
Figure 1 depicts a classic model of the mass–spring–damper system; in this representation, the interaction between the mass (
m), the spring stiffness constant (
k), the damping coefficient (
), and the applied force (
F) is illustrated.
Mathematically, the dynamics of the system are described by the second-order differential equation represented in Equation (
5).
At the modeling level,
is introduced as the control input of the surrogate dynamic layer. Its causal interpretation is addressed later at the level of the internal estimation mechanism. That is, a change in
leads to a perceptible alteration in the system’s acceleration and, consequently, in the cluster configuration. To solve the differential equation, the fourth-order Runge–Kutta method was used. It is a numerical method for solving differential equations. It attempts to find a numerical solution by approximating values
at a finite set of points
, with
[
28]. The discrete numerical integration process is expressed in Equation (
6):
This numerical integration formalizes the mass–spring–damper-inspired layer as a discrete state-space model for the ordered signal. At this point, the formulation is modeling-oriented: it defines how the surrogate dynamic state evolves before any intervention-based interpretation is introduced. Consequently, manipulating the density threshold produces an observable and mathematically traceable effect within the surrogate dynamic layer, making the internal decision mechanism explicit. It’s important to mention that the system parameters (
m,
b,
k) are not treated as new manual hyperparameters; instead, they are autonomously estimated from the ordered
k-distance representation through the fitting procedure via system identification algorithms, ensuring that the model remains objective and fully independent of user intervention.
2.5. System Parameter Identification
The proposed algorithm is framed within the context of system identification, a field dedicated to constructing mathematical models of dynamic systems from experimental input-output data and stochastic disturbances [
29]. In this project, the identification process follows a series of structured steps to estimate the dynamic-layer parameters that constitute the system, which arise directly from the data.
Because DBSCAN operates on a finite collection of elements, the continuous theoretical model must be bridged with the dataset’s discrete nature. The k-distance curve is treated as a discrete ordered signal , where the independent variable is an index over sorted points rather than physical time. From this ordered signal, we compute first- and second-order numerical derivatives (but not limited), which are used as descriptors of local structural variation along the sorted k-distance curve.
In terms of implementation, the continuous derivatives—velocity
and acceleration
are numerically approximated using central finite differences over the smoothed
k-distance array. The discrete velocity
and acceleration
at index
i are computed following Equation (
7).
These discrete arrays represent the observed dynamics of the data density, formally denoted as
,
, and
, moving forward. On the other hand, the objective of system identification is to find the optimal values of
m (mass),
b (damping), and
k (stiffness) described in Equation (
5). To achieve this, a least-squares optimization is performed; by rearranging Equation (
5), the acceleration (the term to be predicted) can be isolated (see Equation (
8)).
The objective function to be minimized is the mean squared error between the observed acceleration (
) and the predicted acceleration
. The process is summarized in Equation (
9):
For the force
, we assume a simple Newtonian force
that acts as an initial excitation to set the system in motion.
Finally, a parameter estimation is performed to find the parameter vector
. This vector is detailed in Equation (
10)
More specifically, the cost function
is understood as a function
that measures how costly a set of parameters is. To perform this type of optimization, a quasi-Newtonian method called Broyden–Fletcher–Goldfarb–Shanno (BFGS) was used. This method attempts to update an approximation to the inverse Hessian matrix. This update follows Equation (
11):
where
is
, and
is
. On the other hand, the update step is described in Equation (
12):
In this way, we understand that the parameters
m,
b, and
k are estimated, and that the model itself selects them based on the
k-distance curve; this assertion is corroborated in
Section 4.1.
2.6. Elbow Point Estimation
The elbow point of a function
is a random and spontaneous event on the curve that marks the end of one state and the beginning of another [
6]. In DBSCAN, the elbow point represents the change from a dense region to a sparse region (noise); the value of this point on the
y-axis represents the value of the neighborhood radius
.
In the proposed system, assuming a particle moves along the
k-distance curve under the action of stiffness and damping forces, the calculation of the jerk of the particle in each iteration is of vital importance, whose formula is represented in Equation (
13):
The location where
reaches its local maximum indicates a sharp change in the system’s acceleration, which in turn indicates the transition between dense and sparse areas. The
value related to this point is therefore considered optimal.
2.7. Structural Causal Model
To avoid conflating distinct levels of interpretation, the proposed framework should be read in three layers. First, at the algorithmic level, controls neighborhood formation and, in turn, affects the connectivity and global clustering structure of DBSCAN. Second, at the modeling level, the ordered k-distance signal is approximated using surrogate mass–spring–damper-inspired dynamics to identify structurally relevant transitions. Third, at the causal level, the resulting estimator is interpreted through interventions on its internal modules, so that and ACE are understood as acting on the threshold-selection mechanism rather than on the physical data-generating process. These three levels are related, but they are not equivalent and should not be interpreted as automatic consequences of one another.
At the causal level, the proposed estimator is interpreted through interventions on its internal threshold-selection mechanism. In this context, does not denote an intervention on the physical data-generating process, but rather on the computational mechanism used to estimate . Under this reading, the role of the SCM is to organize the estimator’s internal dependencies and to provide an intervention-based interpretation of how changes to the threshold-selection mechanism affect clustering quality.
This section presents the SCM used to formalize the parameter estimation process, integrating directions of dependency, functional mechanisms, and controlled interventions in the sense of Pearl’s causal theory [
16]. Unlike standard causal discovery, which seeks physical relationships among observed variables, the present approach models the algorithmic pipeline itself as a causal system. Under this framework, the computational steps are treated as modular mechanisms that propagate information toward the final estimate of
.
To compare the proposed method against standard heuristics, we introduce the do-operator applied to mechanism selection. In this setting, M denotes the internal mechanism used to estimate :
Conceptually, this operation corresponds to an algorithmic intervention in which the threshold-selection mechanism is replaced while the rest of the pipeline is kept fixed. This allows us to evaluate the effect of the estimation mechanism on the final clustering quality Q.
The total process unfolds through a set of internal variables organized hierarchically:
U: exogenous variables associated with intrinsic variability and measurement noise.
: k-NN distances and local density features after standardization.
: smoothed signal obtained through the Savitzky–Golay filter.
: dynamic trajectory generated from the equation of motion and integrated using the fourth-order Runge–Kutta method.
J: jerk, obtained as the third numerical derivative of the trajectory.
: final value of the parameter used in DBSCAN.
These variables define a unidirectional dependency structure that can be represented by the DAG shown in
Figure 2. The dataset affects the smoothed signal, the smoothed signal induces the dynamic trajectory, the trajectory determines the internal transition descriptor, and this descriptor is used to produce the final estimate of
.
The causal relationships specified in the DAG are formalized through the structural equations in
Figure 2: … w given by
, so that interventions on internal nodes induce the surrogate dynamic layer. The SCM formulation assumes that each node depends uniquely on its causal parents, preserving the factorization imposed by the DAG.
After defining the structural equations, controlled interventions can be expressed using Pearl’s notation, in which the structural equation for a node is replaced externally to study how the change propagates to the output variable. In this proposal, the parameter selection is expressed as so that interventions on internal nodes generate observable changes in the final estimate. Formally, these interventions take the form , , or , leading to interventional distributions such as .
Finally, the model’s internal coherence is examined through controlled interventions that probe the consistency of the dependency sequence. The manipulations considered are: (i) adjusting the smoothing window through
; (ii) modifying the dynamic parameters through
; and (iii) imposing artificial trajectories through
. In each case, the following dependency chain is expected:
This behavior is consistent with the interpretation that the parameter estimate is produced by a modular intervention-sensitive mechanism within the estimator, rather than by isolated local correlations alone.
2.8. Evaluation Metrics
To provide an objective description of the quality of the generated clusters, three metric categories were employed. In the first place, it is evaluated whether the data tends to cluster. For this purpose, the Hopkins statistic is used as a quantitative tool to determine whether a dataset exhibits a naturally clusterable structure [
30]. In this context, the hypothesis shown in Equation (
14) is applied:
In this context, the value of the Hopkins statistic is obtained from Equation (
15):
where
are the
n sampled real points and
are the
n random points.
Next, cohesion metrics are evaluated; these assess the internal compactness of the formed clusters. In this context, we speak of inter-cluster variance (
), cluster diameter (
), and mean cluster distance (
). These metrics are described in Equations (16)–(18) respectively:
Note that when more than one cluster is found, these metrics are averaged across all
K clusters.
Finally, separation metrics or internal validation indices were evaluated to assess the quality of the cluster partition by measuring how well the clusters are separated from one another. In this context, we find the Silhouette coefficient (measures how similar a point
i is to its cluster; this metric is known to favor convex clusters), the Davies–Bouldin index (measures the ratio of the worst similarity between clusters), and the Calinski–Harabasz index or variance ratio (measures the relationship between the variance between clusters and the variance within the clusters). Their calculations are described in Equations (19)–(21) respectively:
4. Results
4.1. Experimental Design and Data Configuration
The experimental design aims to evaluate the proposed explicit-causal estimator, which formulates the search for the optimal DBSCAN parameter as a system identification problem within a mass–spring–damper-inspired framework. Specifically, the experiment evaluates how the proposed dynamical estimator responds to different topological structures in order to determine the optimal density threshold through its internal response profile.
First, the
k-distance graph was calculated for all datasets using Nearest Neighbors, showing that the proposed dynamical formulation identifies a candidate transition region for threshold selection. Next,
Figure 4 presents these graphs for each dataset, illustrating how variations in the ordered signal are used to detect structurally relevant changes for DBSCAN.
To reinforce the algorithm’s analysis, estimates of the mass, damping, and stiffness parameters were obtained, testing the model’s adaptability across the datasets (see
Table 3).
An interesting point noticed in
Table 3 is the near-zero damping parameter (
), which is better interpreted as an empirical property of the fitted surrogate dynamics under the current preprocessing and optimization setup. Under the present fit, low damping values suggest that the dynamic layer preserves sensitivity to abrupt structural transitions instead of attenuating them through strong dissipative terms.
In typical dynamic formulations, damping is used to suppress high-frequency oscillations induced by noise. However, in the present framework, the Savitzky–Golay filter already removes a substantial portion of the high-frequency variability while preserving the structural shape of the signal. From an estimation standpoint, this indicates that the velocity-related contribution became less influential under the fitted objective function. Therefore, the observed low-damping regime should be interpreted cautiously as a result of the current fitting configuration rather than as direct evidence of an intrinsic physical or causal property of the data.
4.2. Ablation Analysis of the Dynamic Layer
To assess the specific contribution of the full dynamic formulation, an ablation analysis was conducted using progressively simplified variants of the proposed method. The comparison included: (i) the full dynamic model, consisting of Savitzky–Golay smoothing, parameter fitting of the surrogate dynamic layer, RK4-based state evolution, and jerk-based detection; (ii) Savitzky–Golay smoothing followed by direct numerical jerk estimation without dynamic fitting; (iii) Savitzky–Golay smoothing followed by a simpler change-point detector based on the second derivative; and (iv) the Kneedle baseline. This experiment was designed to determine whether the observed clustering behavior is primarily attributable to the full structured dynamic layer or to the smoothing and transition-detection stages alone. For each variant, the estimated value of was used to run DBSCAN, and the resulting partitions were evaluated in terms of the number of clusters, percentage of noise, Silhouette score, Davies–Bouldin index (DBI), and Calinski–Harabasz index (CHI).
As shown in
Table 4, the ablation results do not indicate a uniform advantage of the full dynamic model across all evaluated datasets. In the Covtype dataset, the variants based on Savitzky–Golay smoothing combined with direct jerk estimation or second-derivative detection produced more coherent clustering solutions than both the full model and the Kneedle baseline. A similar tendency was observed in Synthetic 1, where the simplified variants yielded substantially lower noise levels and more favorable internal clustering metrics than the full dynamic formulation. In Synthetic 2, the results were again mixed: although the full model and Kneedle produced highly fragmented solutions with large noise percentages, the simplified variants led to more compact partitions, with the second-derivative alternative achieving the most interpretable multi-cluster solution.
Overall, these results suggest that a substantial part of the practical effectiveness of the proposed pipeline is associated with the smoothing stage and the detection of structural transitions in the ordered k-distance signal. Under the present experimental configuration, the full surrogate dynamic layer does not consistently outperform its simplified counterparts. Therefore, the contribution of the proposed formulation should be interpreted more cautiously: not as definitive evidence that the full dynamic model is always necessary, but rather as evidence that the ordered-signal representation, together with transition-sensitive criteria such as jerk or curvature-related changes, plays a central role in selection.
4.3. Runtime and Computational Cost
To complement the asymptotic analysis, the practical computational cost of the proposed pipeline was documented using the same implementation employed in the experiments. All measurements were obtained in Google Colab under a CPU-only configuration (x86_64), with 12.7 GB of RAM and no GPU/CUDA acceleration. In all cases, nearest-neighbor retrieval was performed using the kd-tree strategy. Warm-up runs were executed before measurement, and the complete pipeline was then repeated 30 times for each dataset. Reported times correspond to mean and standard deviation in seconds.
Table 5 summarizes the observed runtime behavior. The results show that the nearest-neighbor stage dominates the total cost in the largest dataset, whereas in the synthetic datasets the optimization stage becomes more visible due to the smaller scale of the search problem. This supports the view that the practical cost of the method depends on both the neighbor-search structure and the geometry of the data.
4.4. Efficiency Metrics
To objectively evaluate the quality, robustness, and significance of the clusterings generated by the
value, an empirical expectation analysis was made over 50 iterations for each dataset. For the real-world Covtype dataset, we applied a bootstrap resampling technique to capture the variance of the data distribution. For the synthetic data, the random seeds that govern the structural generation and background noise were dynamically varied. This ensures that the reported metrics represent the expected behavior of the algorithms under topological uncertainty. Two sets of metrics were applied: (i) cohesion metrics and (ii) inter-cluster separation metrics. For cohesion, we measure the Intracluster Variance (IV), the Cluster’s Mean Diameter (CMD), and the Cluster’s Mean Distance (DMD). In
Table 6, we can find the results of these cohesion metrics. On the other hand,
Table 7 presents the results of the inter-cluster separation metrics. The results, reported as the mean ± standard deviation across the 50 iterations, are presented in
Table 6 and
Table 7.
Before discussing the quality of the formed clusters, it is necessary to define the data’s clustering tendency. For this, the Hopkins statistic was used; from this context, values of for the Covtype dataset, for Synthetic 1, and for Synthetic 2 were obtained. The low magnitude of the statistic in all three cases indicates a tendency to cluster, consistent with the inverse formulation, where .
In this context, across the empirical expectation evaluations, Kneedle reports extremely low diameters, indicating that this method is fragmenting natural structures into very tiny components (averaging 100 clusters in Covtype, 7 clusters in Synthetic 1, and 67 in Synthetic 2), which suggests oversegmentation. On the other hand, the proposed algorithm reports clusters of greater dimension and variance; given that Hopkins ensures the data is clustered, the method’s greater variance indicates a more faithful representation of the clusters’ real extent (core and periphery).
These three metrics allow evaluation of the separation quality of the detected groupings; on the one hand, the Silhouette coefficient indicates that, across the three datasets presented, Kneedle consistently yields negative values, providing mathematical evidence that the partitions are artificial. In contrast, the proposed method maintains positive values (
for the Covtype dataset,
for the Synthetic dataset 1, and
for the Synthetic dataset 2), indicating that the data’s natural structure is preserved. This disparity highlights the Geometric Myopia of the Kneedle algorithm mentioned before. By failing to account for the dynamic inertia of the density curve, Kneedle reacts to local geometric irregularities (noise), selecting premature ratios that fracture unitary structures.
Figure 5 illustrates a visual comparison of the clustering enconding each class with different colors results from a single representative iteration. The first row displays the application of both methods to the Covtype dataset, while the second and third rows show the results for the Synthetic 1 and Synthetic 2 datasets, respectively. Note that for the Covtype dataset, Principal Component Analysis (PCA) was applied to project the 10-dimensional feature space onto a
plane for visualization. While
Table 6 and
Table 7 expect over 50 runs, this visualization exemplifies the typical topological behavior previously discussed.
To quantify the benefit of the dynamic intervention over the geometric heuristic, we calculate the Average Causal Effect (ACE), understood here as the expected change in clustering quality when the
-selection mechanism is switched from the geometric baseline to the proposed dynamic estimator (see Equation (
22)). In this context,
should be interpreted as an intervention on the internal estimation mechanism rather than on the physical data-generating process.
Applying this to the experimental results:
Covtype Dataset: As shown in
Table 7, the dynamic intervention consistently corrects the massive fragmentation produced by Kneedle across 50 bootstrap samples, shifting the expected quality from negative to positive:
Synthetic Dataset 1: Similarly, the intervention prevents the geometric algorithm from overfitting to uniform background noise across varying random seeds.
Synthetic Dataset 2: For this dataset, Kneedle consistently fractured the structure due to ambiguous density gradients, yielding a negative expected score. The proposed dynamical decision mechanism successfully captured the dominant structural transition in the dataset:
4.5. Analysis and Causal Discussion
It is important to distinguish the proposed SCM from a physical model of the data generation process. The DAG in
Figure 2 does not imply that the dataset itself was generated by a mass–spring–damper system. Instead, it assumes that the ordered
k-distance curve can be embedded into a surrogate dynamical representation whose internal transitions help identify the decision point for
. By intervening on the decision mechanism (do(
M)), we show that the proposed dynamical decision mechanism isolates the intervention-sensitive cut-off point more effectively than the geometric heuristic. Thus, the causality considered here refers to the flow of information within the estimator, supporting the use of the ACE metric to quantify the algorithmic improvement produced by intervening on the threshold-selection mechanism.
However, there is a challenge in evaluating computational pipelines through a causal view; it is about distinguishing true structural causality from functional dependence. In deterministic algorithms, changing an intermediate stage would modify downstream outputs. To assess whether the proposed DAG can be interpreted as an SCM with modular intervention-sensitive components rather than a simple sequence of operations, we show that these mechanisms keep their inner structural integrity across varying input conditions and respond to targeted interventions predictably. To address this, we operationalized Pearl’s do-calculus to conduct control tests and modular forward-dependence tests across the three scenarios.
4.5.1. Control Tests
To prove that the internal modules act autonomously, we performed a control test using an intervention that should not modify the topological structure or spatial orientation. Let U be the exogenous unobserved spatial orientation of the dataset. We intervened in the data generation mechanism by applying , where is the 45-degree rotation matrix for the 2D synthetic datasets and a random N-dimensional orthogonal rotation matrix for the 10-dimensional Covtype dataset obtained via QR decomposition. Due to the module, which computes isotropic distances, the downstream mechanisms should be entirely invariant to . The maximum absolute difference in the smoothed signal before and after the intervention was recorded as follows:
Covtype Dataset: .
Synthetic Dataset 1: .
Synthetic Dataset 2: .
These values show a machine-level zero. This control test reflects that modules are stable and structurally invariant to exogeneous spatial confounders, isolating the topological mechanism from coordinate variance.
4.5.2. Modular Intervention
To examine whether the internal dynamic layer behaves as an intervention-sensitive component of the estimator, we applied the modular intervention , imposing an artificial high-damping regime on the fitted response without modifying the preceding curve-extraction or jerk-computation stages. Under this intervention, the expected value of shifted relative to the unperturbed configuration, indicating that the estimator’s internal response is sensitive to targeted changes in the dynamic layer. The results were conclusive across all topologies:
Covtype: Observational vs. Intervened .
Synthetic 1: Observational vs. Intervened .
Synthetic 2 Observational vs. Intervened .
These intervention results should be interpreted at the level of the computational mechanism rather than as evidence of a physical law in the data-generating process. In this sense, the proposed SCM provides a structured way to analyze how changes in the internal estimation modules propagate toward the final threshold selection.
4.5.3. Final Remarks
Overall, the contribution of the proposed framework should be understood through three related but distinct layers: algorithmic sensitivity of DBSCAN to , surrogate dynamical modeling of the ordered k-distance signal, and an intervention-based interpretation of the resulting estimator. This separation is important because the causal reading is not presented here as an automatic consequence of the physical analogy, but as an additional interpretive layer defined over the estimator itself.
The dynamic analysis of the mass–spring–damper-inspired layer suggested that jerk peaks tend to coincide with structural transitions in the ordered density signal. This supports the use of jerk as a transition-sensitive descriptor within the proposed estimator and provides a mechanistic interpretation of the internal decision process. In contrast to Kneedle, which relies on geometric detection of the inflection point, the proposed framework uses the structured dynamical response of the signal to identify transition regions that are informative for selection.
Finally, the analysis of empirical expectations supports the applicability of the proposed intervention-based mechanism. In scenarios characterized by spatial overlap, with dimensionality up to , background noise, and continuous density gradients, the geometric heuristic showed greater sensitivity to local irregularities. In contrast, the proposed SCM more consistently preserved the dominant structural organization of the data. The Average Causal Effect scores across the evaluated challenges are consistent with the view that intervening on the estimation mechanism can improve threshold selection relative to the geometric baseline, although the ablation analysis also indicates that an important part of this behavior is associated with the smoothing stage and the detection of structural transitions in the ordered k-distance signal.
4.6. Scope, Validity Region, and Expected Failure Cases
The proposed method should not be interpreted as a universal solution for estimation. Its applicability depends on whether the ordered k-distance signal preserves a structurally meaningful transition after smoothing. In practical terms, the method is more advisable when the boundary between dense regions and noise is smooth enough to generate a detectable transition, when the data form locally continuous manifolds, and when the resulting signal admits a stable transition-sensitive criterion such as jerk or curvature change.
Conversely, the method becomes less advisable in scenarios where the ordered k-distance signal does not provide a stable structural anchor. This includes cases dominated by nearly uniform noise, highly irregular signals without a coherent transition region, or geometries in which no stable and interpretable emerges after smoothing. Under such conditions, the dynamic layer may lose discriminative power, and the resulting threshold should be interpreted with caution.
At the same time, the method should be interpreted within a restricted validity region. Its usefulness depends on the presence of a structurally meaningful transition in the smoothed ordered k-distance signal, and its practical cost is determined by the nearest-neighbor search strategy and the effective dimensionality of the data. Therefore, the proposed framework is better understood as an intervention-sensitive estimator for selection under suitable geometric conditions, rather than as a universally transferable causal model for all density configurations.
4.7. Future Work
Further works that become enabled for further work that complements this proposal include the following: Developing an adaptive multi-scale strategy to enable local estimation of in regions with heterogeneous densities remains a promising direction for future work. Another important line of research involves extending the proposed SCM-based interpretation of DBSCAN to other clustering paradigms, including graph-based and feature-learning approaches, to broaden its applicability to non-Euclidean spaces. Further investigation could focus on combining the proposed SCM with structural inference techniques, such as the PC algorithm, to identify relationships between clustering parameters and latent factors in complex datasets. Incorporating additional internal validation measures, such as DBCV, would allow for a more comprehensive assessment of connectivity and density-based separation in non-convex cluster structures. Moreover, by contrasting the estimated partitions with the original generative structure in synthetic datasets using external validation indices such as ARI, NMI, and AMI, one can gain deeper insight into clustering performance. Expanding the comparative analysis to variable-density clustering algorithms, including HDBSCAN and OPTICS, as well as exploring -selection strategies beyond Kneedle and evaluating the framework on superior dimensions to , including the evaluation of the modularity and invariance properties of each structural component ensuring that each function behaves as an autonomous causal mechanism under interventions in datasets, also represents a valuable extension. Finally, examining the stability of the fitted parameters (m, b, k) across multiple initializations would help determine whether the observed low-damping regime is robust or specific to certain configurations.
5. Conclusions
This article presents a three-level interpretation of selection in DBSCAN. At the algorithmic level, governs neighborhood connectivity and structural transitions in clustering. At the modeling level, the ordered distance signal is approximated through a structured dynamical representation inspired by a mass–spring–damper system. At the causal level, the resulting estimator is interpreted through interventions on its internal threshold-selection mechanism.
Under this view, the contribution of the paper should not be understood as an automatic passage from topological sensitivity to physical dynamics and then to causality. Instead, the proposal is organized as a layered framework in which each level plays a distinct role: algorithmic sensitivity motivates the modeling step, the modeling step provides a structured representation of the ordered signal, and the causal step offers an intervention-based interpretation of the internal estimation mechanism.
This article presents the view that the tuning of the hyperparameter in the DBSCAN algorithm can be interpreted as an explicit-causal estimation process supported by a structured dynamical representation of the distance distribution. By modeling the curve using differential equations, we establish a link between local variations in the slope and structural changes in data density. This reinterpretation converts the estimation of the neighborhood radius into the identification of an intervention-sensitive transition point in the ordered density signal: a slight modification in the neighborhood triggers a significant change in the cluster configuration.
The experimental results tested the method with complex scenarios known in the literature where DBSCAN fails by definition: dimensionality data (Covtype) and data with smooth density transitions (Synthetic 2). Both scenarios contribute to the development of variable-density algorithms such as OPTICS and HDBSCAN.
Despite these limitations, the empirical comparison with Kneedle shows that the proposed intervention-based estimator can produce more favorable threshold selections under the evaluated scenarios. In particular, Kneedle frequently yielded fragmented or unstable partitions, whereas the proposed framework more often preserved the dominant structural organization of the data. The ACE values support the usefulness of intervening on the estimation mechanism relative to the geometric baseline. At the same time, the ablation analysis indicates that an important part of this behavior is also associated with the smoothing stage and with the detection of structural transitions in the ordered k-distance signal.
This separation improves the precision of the central argument and helps delimit the scope of the claims made in the manuscript. In particular, the causal language adopted here refers to the modular computational structure of the estimator, while the dynamical layer remains a surrogate modeling device for transition detection in the ordered k-distance signal.
An important line of research involves extending the proposed SCM-based interpretation of DBSCAN to other clustering paradigms, including graph-based and feature-learning approaches, with the aim of broadening its applicability to non-Euclidean spaces. Afterwards this study could focus on combining the proposed SCM with structural inference techniques, such as the PC algorithm, to identify relationships between clustering parameters and latent factors in complex datasets.
Adding more internal validation measures, such as DBCV, would make it practicable to more accurately evaluate connectivity and density-based separation in non-convex cluster structures. Furthermore, comparing the estimated partitions to the original generative structure in synthetic datasets using external validation indices like ARI, NMI, and AMI would give us a better idea of how well the clustering works. Expanding the comparative analysis to variable-density clustering algorithms, including HDBSCAN and OPTICS, as well as exploring -selection strategies beyond Kneedle and evaluating the framework on higher-dimensional datasets, also represents a valuable extension. Finally, examining the stability of the fitted parameters under multiple initializations would help determine whether the observed low-damping regime is robust or dependent on specific configurations. This comprehensive validation strategy will strengthen the generalizability of the results as intrinsic properties of the system. By extending the sensitivity of the physical parameters toward greater robustness in the clustering, an even more accurate interpretation of the damping regimes in various operating scenarios will be consolidated.