2.1. Time Series Change Point Detection
Change point detection (CPD) aims to identify time instants at which the statistical properties of a time series undergo significant changes. It is a classical problem in statistics and machine learning with applications ranging from process monitoring and anomaly detection to finance, biomedical signal analysis, and large-scale sensor networks. Depending on whether data are processed retrospectively or sequentially, CPD methods can be broadly categorized into offline and online settings. We refer interested readers to the comprehensive survey by Truong et al. [
3].
From a distributional perspective, many CPD methods formulate the problem as detecting discrepancies between probability distributions across two temporal segments. Given a multivariate time series
with
, a change point at time
is declared if the divergence between the distributions of observations from two adjacent windows exceeds a predefined threshold:
where
W denotes the window size and
is a suitable measure of distributional discrepancy. This formulation is flexible and model-agnostic, and has become a dominant paradigm for nonparametric change detection in high-dimensional time series.
Early distributional approaches estimate divergences directly via nonparametric density estimation, such as kernel density estimators (KDEs) [
14]. However, KDE-based divergence estimation becomes unreliable in high-dimensional settings due to the curse of dimensionality. To address this issue, density-ratio estimation methods have been proposed, which estimate the ratio of two distributions without explicitly estimating the individual densities. Representative examples include the Kullback–Leibler importance estimation procedure (KLIEP) [
15] and relative unconstrained least-squares importance fitting (RuLSIF) [
6]. These methods are computationally efficient and have been widely adopted in practical change detection systems, particularly in online and streaming scenarios.
Another prominent line of work formulates CPD as a two-sample hypothesis testing problem using kernel methods. The maximum mean discrepancy (MMD) [
16] compares distributions via their mean embeddings in reproducing kernel Hilbert spaces, thereby avoiding explicit density estimation. MMD-based detectors are flexible, nonparametric, and applicable to a wide range of data types, and have been successfully applied to change detection in time series [
7,
8]. More recently, kernel change-point detection has been integrated with deep models to enhance detection power in complex and highly structured data distributions [
9,
17].
Recent efforts have begun to address the interpretability limitations of conventional CPD methods by explicitly localizing which features or dependencies are responsible for detected distributional shifts. A representative line of work formulates interpretability in terms of conditional distribution changes. Kulinski et al. [
10] formalized feature shift detection via conditional distribution tests, framing interpretability as identifying dimensions whose conditional distributions
change over time. Nie and Nicolae [
18] study detection and localization of conditional distribution shifts from a hypothesis testing perspective. Related perspectives on conditional dependence and invariance have also been explored in the context of causal discovery [
19,
20].
In parallel, several approaches rely on classifier-based two-sample tests to detect distributional changes [
21,
22]. While often powerful in high-dimensional settings, such methods typically require post hoc attribution techniques to infer which features drive the detected shift, and their explanations may depend strongly on model inductive biases. Another complementary line of research characterizes changes through variations in dependency or graph structures, such as mutual information matrices or dynamic networks [
23,
24].
2.2. The Cauchy–Schwarz Divergence and Its Conditional Extension
Before introducing our change detection framework, we briefly review the CS divergence and its conditional extension, which provide fundamental information-theoretic tools for measuring distributional discrepancies and serve as building blocks of the proposed method.
A line of work initiated by Principe and colleagues in the early 2000s proposed divergence measures constructed from classical inequalities in functional analysis [
11,
12]. A representative example is the CS divergence, which is rooted in the Cauchy–Schwarz inequality applied to probability density functions. Given two densities
and
defined on a common domain, their
inner product satisfies
where equality holds if and only if
p and
q are linearly dependent in
. This inequality motivates measuring distributional discrepancy by comparing the cross-information term
against the self-information terms
and
. Accordingly, the CS divergence is defined as
An equivalent view is that is the negative logarithm of a normalized squared correlation (cosine similarity) between p and q in the Hilbert space , and thus it decomposes naturally into quadratic (self) and cross terms involving p and q.
The CS divergence enjoys several properties that make it attractive for statistical signal processing and change detection. For certain distribution families, such as mixtures of Gaussians, the integrals in Equation (
3) admit closed-form expressions [
25]. More generally, it can be estimated efficiently using nonparametric density estimators. In contrast to the Kullback–Leibler (KL) divergence, the formulation in Equation (
3) avoids explicit density-ratio estimation and logarithms of densities, which often improves numerical stability in practice [
26].
Given i.i.d. samples
and
drawn from
and
, respectively, the integrals in the CS divergence can be approximated using kernel density estimation (KDE). Employing a Gaussian kernel
with kernel width
, the CS divergence admits an empirical estimator of the form:
Importantly, the estimator is nonparametric, meaning that it does not impose any predefined parametric assumptions (e.g., Gaussianity) on the underlying data distributions. This property is particularly relevant in our setting. In multivariate sensor monitoring and industrial process analysis, the underlying data-generating mechanisms are often nonlinear, partially unknown, and difficult to model accurately using parametric distributions (e.g., Gaussian or linear models). A nonparametric formulation enables direct comparison of empirical window-based distributions without risking model misspecification, thereby improving robustness in complex real-world systems.
Note that the consistency of the underlying KDE-based estimator follows from standard KDE theory (see Appendix B.4 of [
27] for a detailed proof), and therefore the CS divergence estimator inherits these guarantees.
The CS divergence can be naturally extended to conditional distributions. Let
and
denote two sets of random variables, and consider two conditional distributions
and
. The conditional CS divergence is defined as [
13]:
The conditional CS divergence also admits an efficient empirical approximation based on KDEs, analogous to the unconditional case. This estimator enables practical evaluation of discrepancies between conditional distributions directly from samples, which is essential for interpretable analysis of multivariate distributional changes.
Specifically, let us consider two collections of paired observations and , drawn i.i.d. from joint distributions and , respectively. Let and denote the Gram matrices associated with the variables x and y under distribution p, and similarly let and denote the corresponding Gram matrices under distribution q. Cross-distribution Gram matrices are defined as and , with entries computed using the same kernel functions.
Under this construction, the conditional CS divergence
can be approximated empirically by
This estimator avoids matrix inversion or explicit density ratio estimation. As a result, it scales favorably to high-dimensional settings and serves as a practical building block for conditional distribution-based change detection.
Although CS divergence has been explored in reinforcement learning [
13], domain adaptation [
26], and shape matching [
28], its conditional formulation has not, to our knowledge, been leveraged for interpretable change-point detection.