DBSCAN is widely used to identify structured regions in unlabeled data, but its performance depends critically on the selection of the neighborhood parameter
. Traditional heuristics for estimating
often become unreliable in high-dimensional or varying-density settings because they rely heavily on
[...] Read more.
DBSCAN is widely used to identify structured regions in unlabeled data, but its performance depends critically on the selection of the neighborhood parameter
. Traditional heuristics for estimating
often become unreliable in high-dimensional or varying-density settings because they rely heavily on local geometric criteria and may fail under smooth transitions or topological ambiguity. This work presents a three-level perspective on DBSCAN hyperparameter selection. At the algorithmic level,
controls neighborhood connectivity and structural transitions in clustering. At the modeling level, the ordered
k-distance signal is approximated through a surrogate dynamical estimation framework inspired by a mass–spring–damper system. At the causal level, the resulting estimator is interpreted through interventions on its internal threshold-selection mechanism. The proposed method models the variation of
using ordinary differential equations defined on the ordered
k-distance signal, enabling analysis of structural transitions in density organization via a surrogate dynamical representation. System identification is performed using L-BFGS-B optimization on the smoothed
k-distance curve, while the system dynamics are solved with the fourth-order Runge–Kutta method. The resulting estimator identifies transition regions that are structurally informative for
selection in DBSCAN. To analyze the estimator at the intervention level, Pearl’s do-calculus is used to compute the Average Causal Effect (ACE). The method was evaluated on synthetic benchmarks and on the Covtype dataset, including scenarios with multi-density overlap and dimensionality up to
. The resulting ACE values,
,
, and
, indicate that the proposed estimator improves intervention-based
selection relative to the geometric baseline across the evaluated datasets. Its practical computational cost is dominated by nearest-neighbor search, behaving approximately as
under favorable indexing conditions and degrading toward
in high-dimensional or weak-pruning regimes.
Full article