Interpretable Sensor Change Detection via Conditional Cauchy–Schwarz Divergence

Wang, Wenyu; Shen, Yuan; Ni, Yao; Wu, Wangyu

doi:10.3390/s26061791

Open AccessArticle

Interpretable Sensor Change Detection via Conditional Cauchy–Schwarz Divergence

¹

Department of Statistical Science, University College London, London WC1E 6BT, UK

²

Meta Platforms Inc., Menlo Park, CA 94025, USA

³

School of Integrated Circuit Engineering, Guangdong University of Technology, Guangzhou 510006, China

⁴

School of Computer Science, University of Liverpool, Liverpool L69 3DR, UK

^*

Authors to whom correspondence should be addressed.

Sensors 2026, 26(6), 1791; https://doi.org/10.3390/s26061791

Submission received: 23 January 2026 / Revised: 25 February 2026 / Accepted: 10 March 2026 / Published: 12 March 2026

(This article belongs to the Section Intelligent Sensors)

Download

Browse Figures

Versions Notes

Abstract

Detecting distributional changes in multivariate sensor networks is a fundamental task for monitoring complex systems such as industrial processes, structural health monitoring, and large-scale Internet of Things infrastructures. Despite significant progress, most existing change-point detection methods either operate on high-dimensional observations in a black-box manner or provide limited insight into how inter-sensor dependencies evolve over time, thereby restricting their practical utility in safety-critical applications. In this work, we propose an interpretable change detection framework based on the Cauchy–Schwarz (CS) divergence. By extending CS divergence to conditional distributions over sensor variables, the proposed method detects distributional shifts through changes in sensor-wise conditional relationships. This design enables reliable change detection while simultaneously providing transparent sensor-level explanations of detected changes. Extensive experiments on synthetic data, generic multivariate sensor time series, and a large-scale industrial process benchmark demonstrate that the proposed method achieves competitive or superior detection performance compared to representative baselines, while offering fine-grained interpretability for practical sensor monitoring systems.

Keywords:

change point detection; sensor networks; Cauchy–Schwarz divergence; interpretability; industrial process monitoring

1. Introduction

Quantifying changes in probability distributions is a fundamental problem in statistics and machine learning, with applications ranging from anomaly detection and system monitoring to representation learning and generative modeling [1,2]. In many real-world scenarios, multivariate time series are generated by non-stationary processes whose underlying distributions evolve over time due to environmental variations, system degradation, or external interventions. For example, in industrial automation, subtle changes in conditional dependencies among sensors may indicate early-stage equipment degradation; in healthcare monitoring, distributional shifts in physiological signals may signal abnormal patient states; and in cybersecurity, traffic distribution changes may reveal emerging attacks.

When such distributional shifts occur, predictive models trained on historical data may suffer significant performance degradation, potentially leading to unreliable decisions in safety-critical domains such as industrial automation, healthcare monitoring, and large-scale sensor networks. Therefore, accurately detecting and quantifying distributional changes is essential for ensuring robustness, reliability, and trustworthiness in modern machine learning systems.

In this context, change-point detection (CPD) provides a principled framework for identifying time instants at which the data-generating process undergoes significant shifts [3]. A large body of existing CPD methods focuses on comparing probability distributions across temporal windows using divergence measures such as the Kullback–Leibler divergence, maximum mean discrepancy, or Wasserstein distance [4,5]. While effective at signaling the presence of a change, these approaches typically operate at the level of joint or marginal distributions and offer limited interpretability. In practice, however, detecting that a change has occurred is often insufficient: practitioners must also determine which variables are responsible for the observed change, particularly in high-dimensional systems.

Recent advances have begun to address this limitation by incorporating feature-wise or dependency-aware perspectives into change detection. Examples include density-ratio-based methods [6], kernel-based two-sample tests [7,8], and classifier-based approaches that leverage representation learning to enhance detection power [9]. More recently, interpretable formulations have been explored through conditional distribution tests that aim to localize which features have shifted over time [10]. Despite this progress, a unified framework that simultaneously offers distributional sensitivity, computational efficiency, and intrinsic interpretability in high-dimensional settings remains limited.

Large-scale sensor networks exhibit complex inter-variable dependencies arising from physical couplings and shared operational factors. In such systems, distributional changes often manifest as localized shifts in conditional relationships rather than global deviations of the joint distribution. This observation motivates an interpretable change detection framework based on conditional distribution analysis.

Accordingly, we characterize distributional changes through variable-wise conditional relationships and employ the classical Cauchy–Schwarz (CS) divergence [11,12] and its conditional extension [13] as information-theoretic measures of temporal discrepancies. Building on this idea, we develop an interpretable change detection framework that evaluates conditional distribution shifts at the level of individual variables. By decomposing discrepancies across sensors, the proposed approach enables localization of the variables whose dependency structures drive detected changes. Experiments on synthetic data, generic multivariate sensor time series, and a large-scale industrial process benchmark demonstrate competitive detection performance while providing sensor-level interpretability.

To summarize, this work makes the following contributions:

First, we formulate interpretable change-point detection in multivariate sensor time series from a conditional distribution perspective, explicitly linking distributional changes to shifts in sensor-wise conditional relationships.
Second, we introduce a conditional form of the classical Cauchy–Schwarz divergence as an information-theoretic tool to quantify conditional distributional discrepancies in a nonparametric and differentiable manner.
Third, we develop an interpretable change detection framework that decomposes conditional discrepancies across individual sensors, enabling fine-grained localization of the sources of change without relying on black-box learning models.

2. Related Work

2.1. Time Series Change Point Detection

Change point detection (CPD) aims to identify time instants at which the statistical properties of a time series undergo significant changes. It is a classical problem in statistics and machine learning with applications ranging from process monitoring and anomaly detection to finance, biomedical signal analysis, and large-scale sensor networks. Depending on whether data are processed retrospectively or sequentially, CPD methods can be broadly categorized into offline and online settings. We refer interested readers to the comprehensive survey by Truong et al. [3].

From a distributional perspective, many CPD methods formulate the problem as detecting discrepancies between probability distributions across two temporal segments. Given a multivariate time series

{x_{t}}_{t = 1}^{T}

with

x_{t} \in R^{m}

, a change point at time

t_{0}

is declared if the divergence between the distributions of observations from two adjacent windows exceeds a predefined threshold:

D (p (x_{t_{0} - W : t_{0} - 1}), p (x_{t_{0} : t_{0} + W - 1})) \geq δ,

(1)

where W denotes the window size and

D (\cdot, \cdot)

is a suitable measure of distributional discrepancy. This formulation is flexible and model-agnostic, and has become a dominant paradigm for nonparametric change detection in high-dimensional time series.

Early distributional approaches estimate divergences directly via nonparametric density estimation, such as kernel density estimators (KDEs) [14]. However, KDE-based divergence estimation becomes unreliable in high-dimensional settings due to the curse of dimensionality. To address this issue, density-ratio estimation methods have been proposed, which estimate the ratio of two distributions without explicitly estimating the individual densities. Representative examples include the Kullback–Leibler importance estimation procedure (KLIEP) [15] and relative unconstrained least-squares importance fitting (RuLSIF) [6]. These methods are computationally efficient and have been widely adopted in practical change detection systems, particularly in online and streaming scenarios.

Another prominent line of work formulates CPD as a two-sample hypothesis testing problem using kernel methods. The maximum mean discrepancy (MMD) [16] compares distributions via their mean embeddings in reproducing kernel Hilbert spaces, thereby avoiding explicit density estimation. MMD-based detectors are flexible, nonparametric, and applicable to a wide range of data types, and have been successfully applied to change detection in time series [7,8]. More recently, kernel change-point detection has been integrated with deep models to enhance detection power in complex and highly structured data distributions [9,17].

Recent efforts have begun to address the interpretability limitations of conventional CPD methods by explicitly localizing which features or dependencies are responsible for detected distributional shifts. A representative line of work formulates interpretability in terms of conditional distribution changes. Kulinski et al. [10] formalized feature shift detection via conditional distribution tests, framing interpretability as identifying dimensions whose conditional distributions

p (x^{i} ∣ x^{- i})

change over time. Nie and Nicolae [18] study detection and localization of conditional distribution shifts from a hypothesis testing perspective. Related perspectives on conditional dependence and invariance have also been explored in the context of causal discovery [19,20].

In parallel, several approaches rely on classifier-based two-sample tests to detect distributional changes [21,22]. While often powerful in high-dimensional settings, such methods typically require post hoc attribution techniques to infer which features drive the detected shift, and their explanations may depend strongly on model inductive biases. Another complementary line of research characterizes changes through variations in dependency or graph structures, such as mutual information matrices or dynamic networks [23,24].

2.2. The Cauchy–Schwarz Divergence and Its Conditional Extension

Before introducing our change detection framework, we briefly review the CS divergence and its conditional extension, which provide fundamental information-theoretic tools for measuring distributional discrepancies and serve as building blocks of the proposed method.

A line of work initiated by Principe and colleagues in the early 2000s proposed divergence measures constructed from classical inequalities in functional analysis [11,12]. A representative example is the CS divergence, which is rooted in the Cauchy–Schwarz inequality applied to probability density functions. Given two densities

p (x)

and

q (x)

defined on a common domain, their

L^{2}

inner product satisfies

{(\int p (x) q (x) d x)}^{2} \leq (\int p {(x)}^{2} d x) (\int q {(x)}^{2} d x),

(2)

where equality holds if and only if p and q are linearly dependent in

L^{2}

. This inequality motivates measuring distributional discrepancy by comparing the cross-information term

\int p (x) q (x) d x

against the self-information terms

\int p {(x)}^{2} d x

and

\int q {(x)}^{2} d x

. Accordingly, the CS divergence is defined as

D_{CS} (p; q) = - log \frac{{(\int p (x) q (x) d x)}^{2}}{(\int p {(x)}^{2} d x) (\int q {(x)}^{2} d x)} .

(3)

An equivalent view is that

D_{CS} (p; q)

is the negative logarithm of a normalized squared correlation (cosine similarity) between p and q in the Hilbert space

L^{2}

, and thus it decomposes naturally into quadratic (self) and cross terms involving p and q.

The CS divergence enjoys several properties that make it attractive for statistical signal processing and change detection. For certain distribution families, such as mixtures of Gaussians, the integrals in Equation (3) admit closed-form expressions [25]. More generally, it can be estimated efficiently using nonparametric density estimators. In contrast to the Kullback–Leibler (KL) divergence, the formulation in Equation (3) avoids explicit density-ratio estimation and logarithms of densities, which often improves numerical stability in practice [26].

Given i.i.d. samples

{x_{p, i}}_{i = 1}^{M}

and

{x_{q, j}}_{j = 1}^{N}

drawn from

p (x)

and

q (x)

, respectively, the integrals in the CS divergence can be approximated using kernel density estimation (KDE). Employing a Gaussian kernel

G_{σ} (x) = {exp (- ∥ x ∥}^{2} / 2 σ^{2})

with kernel width

σ

, the CS divergence admits an empirical estimator of the form:

\begin{matrix} {\hat{D}}_{CS} (p; q) = & log (\frac{1}{M^{2}} \sum_{i, j} G_{σ} (x_{p, i} - x_{p, j})) + log (\frac{1}{N^{2}} \sum_{i, j} G_{σ} (x_{q, i} - x_{q, j})) \\ - 2 log (\frac{1}{M N} \sum_{i, j} G_{σ} (x_{p, i} - x_{q, j})) . \end{matrix}

(4)

Importantly, the estimator is nonparametric, meaning that it does not impose any predefined parametric assumptions (e.g., Gaussianity) on the underlying data distributions. This property is particularly relevant in our setting. In multivariate sensor monitoring and industrial process analysis, the underlying data-generating mechanisms are often nonlinear, partially unknown, and difficult to model accurately using parametric distributions (e.g., Gaussian or linear models). A nonparametric formulation enables direct comparison of empirical window-based distributions without risking model misspecification, thereby improving robustness in complex real-world systems.

Note that the consistency of the underlying KDE-based estimator follows from standard KDE theory (see Appendix B.4 of [27] for a detailed proof), and therefore the CS divergence estimator inherits these guarantees.

The CS divergence can be naturally extended to conditional distributions. Let

y \in R

and

x \in R^{m - 1}

denote two sets of random variables, and consider two conditional distributions

p (y ∣ x)

and

q (y ∣ x)

. The conditional CS divergence is defined as [13]:

\begin{matrix} D_{CS} (p (y ∣ x); q (y ∣ x)) = & - 2 log (\int_{x} \int_{y} p (y ∣ x) q (y ∣ x) d x d y) \\ + log (\int_{x} \int_{y} p {(y ∣ x)}^{2} d x d y) + log (\int_{x} \int_{y} q {(y ∣ x)}^{2} d x d y), \end{matrix}

(5)

The conditional CS divergence also admits an efficient empirical approximation based on KDEs, analogous to the unconditional case. This estimator enables practical evaluation of discrepancies between conditional distributions directly from samples, which is essential for interpretable analysis of multivariate distributional changes.

Specifically, let us consider two collections of paired observations

ψ_{p} = {(x_{i}^{p}, y_{i}^{p})}_{i = 1}^{M}

and

ψ_{q} = {(x_{j}^{q}, y_{j}^{q})}_{j = 1}^{N}

, drawn i.i.d. from joint distributions

p (x, y)

and

q (x, y)

, respectively. Let

K^{p}

and

L^{p}

denote the Gram matrices associated with the variables x and y under distribution p, and similarly let

K^{q}

and

L^{q}

denote the corresponding Gram matrices under distribution q. Cross-distribution Gram matrices are defined as

K^{p q} \in R^{M \times N}

and

L^{p q} \in R^{M \times N}

, with entries computed using the same kernel functions.

Under this construction, the conditional CS divergence

D_{CS} (p (y ∣ x); q (y ∣ x))

can be approximated empirically by

\begin{matrix} {\hat{D}}_{CS} (p (y ∣ x); q (y ∣ x)) & \approx log (\sum_{j = 1}^{M} \frac{\sum_{i = 1}^{M} K_{j i}^{p} L_{j i}^{p}}{{(\sum_{i = 1}^{M} K_{j i}^{p})}^{2}}) + log (\sum_{j = 1}^{N} \frac{\sum_{i = 1}^{N} K_{j i}^{q} L_{j i}^{q}}{{(\sum_{i = 1}^{N} K_{j i}^{q})}^{2}}) \\ - log (\sum_{j = 1}^{M} \frac{\sum_{i = 1}^{N} K_{j i}^{p q} L_{j i}^{q p}}{(\sum_{i = 1}^{M} K_{j i}^{p}) (\sum_{i = 1}^{N} K_{j i}^{p q})}) - log (\sum_{j = 1}^{N} \frac{\sum_{i = 1}^{M} K_{j i}^{q p} L_{j i}^{p q}}{(\sum_{i = 1}^{M} K_{j i}^{q p}) (\sum_{i = 1}^{N} K_{j i}^{q})}) . \end{matrix}

(6)

This estimator avoids matrix inversion or explicit density ratio estimation. As a result, it scales favorably to high-dimensional settings and serves as a practical building block for conditional distribution-based change detection.

Although CS divergence has been explored in reinforcement learning [13], domain adaptation [26], and shape matching [28], its conditional formulation has not, to our knowledge, been leveraged for interpretable change-point detection.

3. Interpretable Change Point Detection via Conditional Cauchy–Schwarz Divergence

3.1. Problem Formulation and Framework Overview

We consider an online multivariate time series

X = {x_{t}}_{t = 1}^{\infty}

, where

x_{t} = {[x_{t}^{1}, x_{t}^{2}, \dots, x_{t}^{m}]}^{⊤} \in R^{m}

denotes observations collected from m sensors (or variables) at time t. The data stream is generated by an unknown and potentially non-stationary stochastic process, whose underlying distribution may evolve over time.

Given two time instants

t_{1}

and

t_{2}

(or two temporal windows centered around them), the goal of change detection is to determine whether the joint data-generating distributions differ, i.e.,

P (x_{t_{1}}) \neq P (x_{t_{2}}),

(7)

where

P (x_{t})

denotes the joint probability distribution of the multivariate observation at time t. This formulation does not assume the existence of a fixed steady-state distribution and naturally accommodates both abrupt and gradual distributional shifts in high-dimensional time series. In practice, a change is declared when the estimated divergence exceeds a predefined threshold determined by the desired false alarm level.

Beyond detecting the presence of a change, we are interested in interpreting detected shifts by identifying which variables are most responsible. Let

x^{i}

denote the i-th variable and

x^{- i} = x ∖ {x^{i}} = {x^{1}, \dots, x^{i - 1}, x^{i + 1}, \dots, x^{m}}

denote the collection of all remaining variables. A change is attributed to variable i if its conditional distribution with respect to the rest of the system changes across time, namely,

P (x_{t_{1}}^{i} ∣ x_{t_{1}}^{- i}) \neq P (x_{t_{2}}^{i} ∣ x_{t_{2}}^{- i}) .

(8)

This conditional perspective captures changes in inter-variable dependency structure and provides a principled notion of interpretability: a variable is deemed responsible for a detected change if its conditional relationship with the rest of the system exhibits a significant distributional shift. Figure 1 illustrates the difference between sensor change detection and change interpretation.

Why CS Divergence

As can be seen in Figure 1, our framework, termed Temporal Conditional Divergence (TCD), simultaneously relies on (i) joint distribution comparison for change detection and (ii) conditional distribution comparison for post hoc interpretability. Therefore, the divergence measure must admit both a stable marginal form and a principled conditional extension within a unified functional structure.

While alternative divergences such as KL divergence or MMD could in principle be employed in both stages (e.g., KL with conditional KL, or MMD with conditional MMD), several practical and theoretical limitations arise.

First, commonly used nonparametric estimators of conditional KL divergence, such as k-nearest neighbor (k-NN) estimators [29], are not differentiable and may suffer from instability in high-dimensional sliding-window settings. Second, conditional MMD does not admit a universally agreed-upon definition, and its faithfulness depends strongly on the specific formulation of conditional kernel embeddings [30]. Moreover, conditional MMD often requires Gram matrix inversion or regression operators, leading to increased computational complexity [13].

In contrast, the CS divergence admits a unified quadratic formulation for both marginal and conditional distributions [13]. The conditional extension preserves the same Hilbert-space structure as the marginal case, enabling fully nonparametric, differentiable, and inversion-free estimation via kernel Gram matrices. This structural uniformity ensures consistency between detection and interpretation stages and avoids mixing heterogeneous divergence definitions within a single monitoring framework.

Table 1 summarizes key properties of representative conditional divergences. As shown, the conditional CS divergence simultaneously satisfies differentiability, computational feasibility, and faithfulness, making it particularly suitable for interpretable online change detection in multivariate sensor streams.

3.2. Window-Based Distribution Comparison

To capture local temporal dynamics and enable online monitoring, we adopt a sliding-window representation of the multivariate time series. Given the data stream

X = {x_{t}}_{t = 1}^{\infty}

with

x_{t} \in R^{m}

, we define at time index k a windowed data matrix

X_{k} = [x_{k - W + 1}, \dots, x_{k}] \in R^{m \times W},

(9)

where W denotes the window length. This construction aggregates consecutive observations and is widely used to approximate local distributions in time series analysis and system monitoring.

At each time step, we compare the distribution induced by observations from a reference window and that from a current window. Let

P (x_{t_{1}})

and

P (x_{t_{2}})

denote the empirical distributions estimated from the reference and current windows, respectively. A distributional change is indicated when the discrepancy between these two distributions exceeds a predefined threshold.

This window-based joint distribution comparison is solely used to determine whether a change has occurred. Interpretation and localization of change sources are addressed separately by analyzing conditional distribution shifts at the variable level, as detailed in the next subsection.

3.3. Variable-Wise Conditional Change Localization

While joint distribution comparison can indicate whether a distributional change has occurred, it does not reveal which variables are responsible for the detected shift. To enable interpretable change detection, we analyze distributional changes at the level of variable-wise conditional distributions.

At a given time interval, the dependence of

x^{i}

on the rest of the system is characterized by the conditional distribution

P (x^{i} ∣ x^{- i})

. A change is attributed to variable i if this conditional relationship differs across time, i.e., if

P (x_{t_{1}}^{i} ∣ x_{t_{1}}^{- i}) \neq P (x_{t_{2}}^{i} ∣ x_{t_{2}}^{- i}) .

(10)

To quantify such conditional distribution shifts, we define a variable-wise conditional change score using the conditional CS divergence,

S_{i} (k) = D_{CS} (P (x_{t_{1}}^{i} ∣ x_{t_{1}}^{- i}); P (x_{t_{2}}^{i} ∣ x_{t_{2}}^{- i})) .

(11)

The score

S_{i} (k)

measures the extent to which the conditional relationship of variable i with respect to the rest of the system has changed. Importantly, the conditional scores

{S_{i} (k)}_{i = 1}^{m}

are not used to trigger change alarms. They are computed only at detected alarm times and serve as post hoc interpretability measures that decompose a detected distributional shift into variable-wise contributions, enabling fine-grained localization of the sources of change.

3.4. Online Change Detection and Post Hoc Interpretation

Combining the joint distribution comparison in Section 3.2 and the conditional change localization in Section 3.3, we obtain a unified framework for online change detection with post hoc interpretability. A generic implementation is summarized in Algorithm 1. The choice of the detection threshold

δ

is application-dependent. For generic multivariate time series,

δ

can be set based on empirical statistics or heuristic criteria (e.g., percentile-based thresholds). For industrial process monitoring scenarios,

δ

can be calibrated from fault-free data to control the false alarm rate, as detailed in Section 4.4.

Algorithm 1: Generic Online Change Detection with CS Divergence and Post Hoc Interpretation

4. Experimental Evaluation

This section provides a comprehensive experimental evaluation of the proposed interpretable change detection framework. The experiments are organized progressively from controlled to realistic settings in order to assess detection accuracy, interpretability, and practical applicability. Specifically, we first describe the experimental setup and competing baselines in Section 4.1. We then evaluate the method on synthetic data with known change sources to validate interpretability (Section 4.2), followed by experiments on generic real-world multivariate sensor time series (Section 4.3). Finally, we demonstrate the effectiveness of the proposed framework on a large-scale industrial process benchmark, the Tennessee Eastman Process (TEP) [31,32], to highlight its relevance for industrial sensor monitoring (Section 4.4).

4.1. Experimental Setup and Competing Baselines

We compare the proposed method with representative baselines from the literature on multivariate time series change-point detection and distributional change detection. These baselines include kernel-based two-sample tests as well as density-ratio and divergence-based approaches, which are widely adopted in unsupervised detection settings.

All selected baselines produce explicit window-based detection statistics under a unified online evaluation protocol. While several recent works (see Section 2.1) investigate conditional dependence or structural changes, they are not formulated as standalone sliding-window change-point detection algorithms and therefore are not directly comparable within the same experimental framework.

Specifically, we include the following methods:

MMD [7]: the maximum mean discrepancy-based two-sample test, which detects distributional changes by comparing mean embeddings in a reproducing kernel Hilbert space.
KDE-based divergence [14]: a KDE-based method that approximates the KL divergence using Rényi’s $α$ -order divergence with $α = 1.01$ , following common practice to avoid numerical instability when directly estimating KL divergence [33]. The detailed formulation and KDE-based estimation of Rényi’s $α$ -order divergence are provided in Appendix A.
RuLSIF [6]: the relative unconstrained least-squares importance fitting method, which detects changes by estimating relative density ratios.
KL-CPD [9]: a neural network-based change-point detector that maximizes a lower bound of the test power of kernel two-sample tests using an auxiliary generative model.
NN-CPD [17]: a deep learning–based offline change-point detection framework that trains a neural network classifier to distinguish between sequences with and without change-points.

These baselines are selected to represent different methodological paradigms in distributional change detection, including density-ratio estimation (RuLSIF), kernel-based two-sample testing (MMD), classical divergence estimation (KDE), and neural network–based approaches (KL-CPD, NN-CPD).

For kernel-based methods, including MMD, KDE-based divergence, and the proposed method, we use a Gaussian kernel. The kernel bandwidth

σ

is selected using the median heuristic, i.e., the median of pairwise distances between samples within each window, which is a widely adopted and parameter-free choice in kernel two-sample testing. This ensures a fair comparison by avoiding dataset-specific manual tuning.

For RuLSIF, the kernel bandwidth is selected via 5-fold cross-validation, and the subspace dimension is set to

k = 5

with

α = 0.01

, following the recommendations of the original authors. For KL-CPD, the regularization parameter

λ

is chosen from

{0.1, 1, 10}

and the weight

β

is selected from

{10^{- 3}, 10^{- 2}, 10^{- 1}, 1}

. The best-performing configuration is reported for each dataset.

All methods are evaluated under a unified sliding-window protocol with a stride of 1. The window length W is selected from

{50, 60, \dots, 100}

based on validation performance for each dataset. For the generic detection framework (Algorithm 1), the detection threshold

δ

is determined using a permutation test. Specifically, for each time step, samples from the reference and current windows are randomly permuted to generate surrogate distributions under the null hypothesis of no change. The empirical distribution of the resulting divergence statistics is used to set

δ

at the 95th percentile, corresponding to a significance level of

α = 0.05

.

All baseline methods are implemented following their original descriptions under the same window configuration and evaluation protocol to ensure fair comparison.

4.2. Synthetic Data: Interpretability Study

To validate the interpretability of the proposed method under a controlled setting, we first conduct experiments on synthetic data where the source variables responsible for distributional changes are explicitly known. Unlike real-world datasets, this setting allows us to directly assess whether the detected changes and the associated variable-level attributions are consistent with the underlying data-generating mechanisms.

Following [34], we generate a 5-dimensional multivariate time series consisting of 1900 non-Gaussian samples. Three change points are introduced at time indices

[418, 980, 1411]

, resulting in four consecutive segments. Within each segment, the data are generated according to a copula Gaussian graphical model [35], denoted by

C_{N} (μ, K^{- 1})

, where the precision matrix K encodes the conditional dependency structure among variables. Specifically, an edge between nodes i and j exists if and only if

K_{i j} \neq 0

.

The precision matrix K is generated as follows. We first uniformly sample locations

{y_{i}}_{i = 1}^{m}

from the unit square and initialize K as an identity matrix. For each pair

(i, j)

, the off-diagonal entries

{(K)}_{i j} = {(K)}_{j i}

are set to

ρ = 0.245

with probability

{(\sqrt{2 π})}^{- 1} exp (- 4 ∥ y_{i} - y_{j} ∥^{2})

, and to zero otherwise. The resulting graph structures for each segment are illustrated in Figure 2, where red and green edges indicate positive and negative entries of K, respectively.

For each competing method, the detection threshold

δ

is selected to provide a reasonable trade-off between true positive rate (TPR) and false positive rate (FPR). In addition to detection accuracy, we report the root mean square error (RMSE) between the estimated change-point locations and the ground-truth change points. As summarized in Table 2, the proposed method achieves competitive detection performance compared with existing approaches.

More importantly, the synthetic setting enables a fine-grained analysis of interpretability. By examining the variable-wise conditional CS divergence scores

D_{CS} (p_{s} (y_{j} ∣ y_{- j})

;

p_{t} (y_{j} ∣ y_{- j}))

, we observe that different variables contribute unevenly to different change points. For instance, the fourth variable contributes negligibly to the detection of the second change point, which is consistent with the fact that its conditional dependence structure with respect to

{y_{1}, y_{2}, y_{3}, y_{5}}

remains largely unchanged between segments

[418, 979]

and

[980, 1410]

(see Figure 2). By contrast, the second variable consistently exhibits large conditional divergence scores across multiple change points, reflecting the pronounced alterations in its dependency structure. A visualization of the variable-wise conditional CS divergence is shown in Figure 3.

These results demonstrate that, beyond detecting distributional changes, the proposed method is capable of correctly attributing changes to the underlying variables whose conditional dependencies are truly affected. This controlled experiment provides strong evidence that the conditional CS divergence offers meaningful and reliable interpretability for multivariate change detection.

We further examine the sensitivity of the TCD to the kernel bandwidth parameter. Specifically, we scale the median heuristic by factors of

0.5

and

2.0

to assess robustness under moderate bandwidth variation. As shown in Table 3, the TPR and FPR remain stable across different bandwidth scales. The primary effect is observed in the RMSE of the estimated change-point location, reflecting the expected smoothness–responsiveness trade-off: larger bandwidths produce slightly smoother but delayed detections, whereas smaller bandwidths react faster. Overall, the results indicate that the proposed method is robust to moderate bandwidth perturbations.

4.3. Evaluation on Generic Multivariate Sensor Time Series

We further evaluate the proposed method on three widely used real-world benchmarks that represent generic multivariate sensor time series across different modalities, including physiological signals, audio, and video. These datasets have been extensively adopted in the change-point detection literature and allow us to assess the robustness and general applicability of the proposed framework beyond controlled synthetic settings. A summary of the dataset characteristics is provided in Table 4.

The EEG eye state dataset is obtained from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/EEG+Eye+State, accessed on 1 October 2025) and consists of 14,980 time samples recorded from 14 EEG electrodes and one additional reference channel (15 signals in total) over approximately 117 s. The electrodes follow the standard 10–20 EEG placement system as described in the original dataset documentation. No additional preprocessing or dimensionality reduction was applied beyond using the provided normalized signals. Binary eye states (open/closed) were annotated via synchronized video recordings in the original dataset. In our experiments, a change point is defined as a transition between the annotated eye states. Following prior work [36], we evaluate on the first 2000 samples, which contain four abrupt state transitions at time indices 188, 871, 1335, and 1637.

The Great Barbet dataset is constructed following [37] by concatenating birdsong recordings from the same species obtained from the xeno-canto database. Specifically, we download three recordings and concatenate approximately one-minute segments from each song. Each audio frame is represented by 13 mel-frequency cepstral coefficients (MFCCs) along with the log-energy term, resulting in a 14-dimensional feature vector. We use the first 1900, 1800, and 1000 frames from the three recordings, respectively, forming a 14-dimensional time series of length 4700. Change points correspond to transitions between different acoustic regimes.

The UGS dataset [38] is a documentary video from the Open Video Project that contains fast shot transitions and camera motion. The sequence consists of 2143 frames and 11 shots. For each frame, we extract local binary pattern (LBP) features as appearance descriptors, resulting in a 256-dimensional representation. The objective is to detect abrupt shot changes for video temporal segmentation. This dataset has been widely used as a benchmark for change detection in high-dimensional visual time series.

For quantitative comparison on these generic benchmarks, we adopt the receiver operating characteristic (ROC) curve and the area under the ROC curve (AUC), which are commonly used when change-point annotations are available but detection delays may vary across methods. Following prior work [6,7,9], we define the true positive rate (TPR) and false positive rate (FPR) as

\{\begin{matrix} TPR = n_{cr} / n_{cp}, \\ FPR = (n_{al} - n_{cr}) / n_{al}, \end{matrix}

(12)

where

n_{cr}

denotes the number of correctly detected change points,

n_{cp}

is the total number of true change points, and

n_{al}

is the total number of detection alarms. Detection alarms are identified as peaks in the change-point score. An alarm at time index t is considered correct if there exists a true change point at

t^{*}

such that

t \in [t^{*} - τ, t^{*} + τ]

, where

τ

specifies the acceptable detection tolerance.

In Table 5, we report the AUC values of all competing methods on the three real-world benchmark datasets. As can be observed, the proposed method consistently achieves the highest AUC across all datasets, indicating superior change detection performance. Figure 4 illustrates the detected change point.

4.4. Application to the Tennessee Eastman Process

The experiments in Section 4.3 demonstrate that the proposed method achieves robust change detection performance on a variety of generic multivariate sensor time series across different modalities, including physiological signals, audio, and video. While these benchmarks validate the general applicability of the proposed framework, they do not fully capture the complexity, strong variable coupling, and operational constraints commonly encountered in large-scale industrial process monitoring systems.

To further assess the proposed framework under industrially realistic conditions, we consider the Tennessee Eastman Process (TEP), a widely used benchmark for industrial process monitoring [31,32]. TEP models a large-scale chemical production system with complex nonlinear dynamics, tightly coupled process variables, and diverse fault mechanisms, and has therefore been extensively adopted for evaluating sensor-based fault detection and diagnosis methods.

In this study, we employ the closed-loop TEP simulation model [32,39] to evaluate the proposed change detection and post hoc interpretability framework in an industrial monitoring setting. The process provides 22 continuous process measurements and 11 manipulated variables, resulting in a 33-dimensional multivariate sensor stream. All variables are sampled at a fixed interval of 3 min, following standard TEP configurations commonly used in the process monitoring literature. The detailed monitoring procedure for applying the proposed framework to the Tennessee Eastman Process is summarized in Algorithm 2, which outlines the complete online monitoring pipeline, including threshold calibration from fault-free data, online alarm triggering, and post hoc sensor-level interpretability analysis.

4.4.1. Monitoring Protocol and Threshold Calibration

Rather than adopting a conventional train–test split as in supervised learning, we follow a process monitoring protocol that reflects practical industrial scenarios. Specifically, a continuous segment of fault-free operation is first used as a reference run to characterize normal process behavior. This reference data is used solely for calibrating the detection threshold and does not involve any fault labels.

Let W denote the sliding window length. At each time index t, two adjacent windows of equal length are constructed from the data stream: a reference window and a current monitoring window. To detect distributional changes in the multivariate sensor stream, we compute a global detection statistic by comparing the joint distributions induced by these two windows using the CS divergence,

G (t) = D_{CS} (P_{ref} (x); P_{cur} (x)),

(13)

where

P_{ref} (x)

and

P_{cur} (x)

denote the empirical joint distributions estimated from the reference and current windows, respectively.

Algorithm 2: TEP Monitoring with CS-Divergence Alarming and Conditional CS Interpretability

Using the fault-free reference run, we obtain an empirical distribution of

G (t)

under normal operating conditions and set the alarm threshold

δ

as a high quantile (e.g.,

99 %

) of this distribution. This threshold serves as a control limit that regulates the false alarm rate and is fixed for all subsequent online monitoring runs.

4.4.2. Online Fault Detection Performance

To evaluate online fault detection performance, we generate independent monitoring runs of 100 h, in which a single fault is injected exactly 20 h after the start of each sequence. This design ensures a sufficiently long fault-free period for assessing false alarms, followed by a clear transition to a faulty operating regime for evaluating detection capability.

Rather than exhaustively evaluating all 21 predefined fault types in the Tennessee Eastman Process (TEP), we focus on a representative subset of commonly studied benchmark faults that span diverse fault mechanisms and detection characteristics. The selected cases include step-type faults (Faults 1–5), a random disturbance (Fault 8), a slow drift fault (Fault 13), and a sticking-related fault (Fault 14). These faults collectively cover abrupt, stochastic, gradual, and nonlinear behaviors, providing broad coverage for assessing both detection accuracy and sensor-level interpretability of the proposed framework.

During online monitoring, an alarm is triggered whenever the detection statistic

G (t)

exceeds the predefined threshold

δ

. Detection performance is quantitatively evaluated using three standard metrics widely adopted in industrial process monitoring: the false alarm rate (FAR) and the fault detection rate (FDR). The FAR measures the frequency of spurious alarms under normal operating conditions, the FDR evaluates the proportion of faulty runs in which at least one alarm is raised after fault onset, and Fault Detection Delay (FDD) characterizes the timeliness of detection by measuring the delay between the true fault occurrence time and the first alarm.

Following standard practice in Tennessee Eastman process monitoring, the FAR is computed on fault-free runs, while the FDD is averaged across the selected fault scenarios. The summary results are reported in Table 6. In contrast, due to the heterogeneous detection difficulty of different fault mechanisms, the FDR is reported on a per-fault basis for the selected fault subset, as shown in Table 7. Note that we additionally compare our approach with RTCSA [40], which is widely regarded as a state-of-the-art method for fault change detection.

4.4.3. Post Hoc Sensor-Level Interpretability

Beyond detecting the occurrence of process changes, it is crucial in industrial monitoring applications to understand which sensors and inter-variable dependencies are most responsible for the detected abnormality. Importantly, the proposed framework explicitly decouples change detection from interpretation: sensor-level analysis is performed only after an alarm is triggered and does not affect the alarming mechanism itself. Detailed descriptions of all TEP sensors and manipulated variables are provided in Appendix B.

For interpretability analysis, we focus on Fault 1, which corresponds to a canonical step-type disturbance in feed composition and is among the most well-studied and interpretable fault scenarios in the Tennessee Eastman Process. This fault provides a representative and intuitive test case for examining whether the proposed conditional analysis yields physically meaningful explanations.

Specifically, at the alarm time

t^{*}

, we compute for each sensor i the conditional CS divergence

S_{i} (t^{*}) = D_{CS} (p_{ref} (x_{i} ∣ x_{- i}); p_{t^{*}} (x_{i} ∣ x_{- i})),

(14)

which quantifies the change in the conditional dependency between sensor i and the remaining sensors. For ease of interpretation, the conditional scores are normalized and used to rank sensors according to their relative contributions to the detected change.

For Fault 1, the top-ranked sensors identified by the proposed method are Sensor 1 (A feed flow), Sensor 2 (D feed flow), and Sensor 8 (reactor level). These sensors are directly associated with upstream feed streams and the resulting reactor operating state, which is consistent with the physical nature of a step-type feed composition disturbance. In particular, changes in the feed flow variables (Sensors 1 and 2) propagate through the process and induce shifts in reactor inventory and dynamics, which are reflected by the reactor level (Sensor 8).

Notably, the prominence of reactor level as a top contributor indicates that the detected change is characterized by altered inter-variable dependency structure rather than isolated marginal deviations in individual sensor readings. This observation highlights the advantage of conditional analysis: while marginal statistics may capture local signal fluctuations, the conditional CS divergence explicitly reveals how relationships between variables evolve under abnormal operating conditions.

Overall, this post hoc analysis demonstrates that the proposed framework not only detects process changes reliably, but also provides physically meaningful and actionable interpretations by localizing the sensors and dependencies most responsible for the detected fault.

5. Conclusions and Future Work

In this work, we proposed an interpretable change-point detection framework for multivariate sensor time series based on conditional Cauchy–Schwarz (CS) divergence. By shifting the focus from joint distribution shifts to changes in sensor-wise conditional relationships, the proposed method enables fine-grained attribution of detected changes to specific sensors and their dependencies. The conditional CS divergence can be efficiently estimated in a fully nonparametric manner using kernel density estimation, without density-ratio learning or matrix inversion, making it well suited for online monitoring in high-dimensional settings.

Extensive experiments on synthetic data, generic multivariate benchmarks, and the Tennessee Eastman Process demonstrate that the proposed approach achieves reliable change detection performance while providing transparent, sensor-level interpretations that are not available in conventional CPD methods. In particular, the industrial case study shows that the identified key sensors are consistent with known fault mechanisms, highlighting the practical value of conditional distribution analysis for interpretable monitoring of complex industrial processes.

Beyond the experimental validation presented in this work, the proposed conditional CS divergence-based framework has the potential to support a range of real-world monitoring applications. In large-scale sensor networks and industrial automation systems, it could facilitate early detection of abnormal operating regimes while providing interpretable sensor-level insights into dependency changes. In cybersecurity and network monitoring, conditional distribution analysis may help localize shifts in traffic patterns or feature dependencies, offering improved transparency compared to purely black-box detectors. Similarly, in environmental and structural monitoring contexts, the framework could assist in identifying evolving inter-sensor relationships caused by external disturbances or gradual system degradation.

A natural direction for future work is to extend the current variable-wise attribution framework to explicitly characterize coordinated shifts across subsets of sensors. Although the proposed detection mechanism operates on the full joint distribution and thus captures multivariate changes at the global level, the interpretability module presently evaluates conditional shifts at the level of individual variables. In high-dimensional systems, distributional changes may arise from collective variations among interacting sensors. Extending the conditional CS framework to subset-wise conditioning could enable richer attribution of joint dependency shifts while preserving its distributional, model-agnostic nature.

Moreover, exploring extensions toward directional or causal change analysis, such as integrating transfer entropy measures [41], may further enhance the ability to characterize structural evolution in complex systems.

Author Contributions

Conceptualization, W.W. (Wenyu Wang), Y.S., Y.N. and W.W. (Wangyu Wu); methodology, W.W. (Wenyu Wang), Y.S., Y.N. and W.W. (Wangyu Wu); software, W.W. (Wenyu Wang); validation, Y.S.; formal analysis, W.W. (Wenyu Wang); investigation, W.W. (Wenyu Wang) and Y.S.; writing—original draft preparation, W.W. (Wenyu Wang) and Y.S.; writing—review and editing, Y.N. and W.W. (Wangyu Wu); visualization, Y.S.; supervision, Y.S., Y.N. and W.W. (Wangyu Wu); project administration, Y.N.; funding acquisition, Y.N. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by Guangdong Basic and Applied Basic Research Foundation (2023A1515110319), Funding by Science and Technology Projects in Guangzhou (2025A04J3757), the Key Science and Technology Program of Henan Province (242102210171). The authors gratefully acknowledge their financial support.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this study are publicly available. The EEG Eye State dataset is available from the UCI Machine Learning Repository. The Great Barbet audio recordings were obtained from the xeno-canto database. The UGS video dataset is available from the Open Video Project. The Tennessee Eastman Process data were generated using the standard simulation model under commonly adopted configurations. All data sources are cited in the manuscript, and no new data were created.

Conflicts of Interest

Author Yuan Shen was employed by the company Meta Platforms Inc. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CS	Cauchy–Schwarz
CPD	Change point detection
FAR	False Alarm Rate
FDR	Fault Detection Rate

Appendix A. KDE-Based Rényi Divergence Estimation

In this appendix, we briefly review the kernel density estimation (KDE)-based approximation of Rényi’s

α

-order divergence, which is used as a baseline method in the experimental evaluation.

Rényi’s

α

-order divergence of

g (x)

from

f (x)

is defined as

D_{α} (f; g) ≜ \frac{1}{α - 1} log \int f (x) {(\frac{f (x)}{g (x)})}^{α - 1} d x .

(A1)

Appendix A.1. Limiting Case α→1: Connection to KL Divergence

Applying L’Hôpital’s rule to Equation (A1) yields

\begin{matrix} lim_{α \to 1} D_{α} (f; g) & = lim_{α \to 1} \frac{1}{α - 1} log \int f (x) {(\frac{f (x)}{g (x)})}^{α - 1} d x \\ = \frac{{lim}_{α \to 1} \int - f (x) {(\frac{f (x)}{g (x)})}^{α - 1} log (\frac{g (x)}{f (x)}) d x}{{lim}_{α \to 1} \int {(\frac{f (x)}{g (x)})}^{α - 1} d x} \\ = \int f (x) log (\frac{f (x)}{g (x)}) d x \\ = D_{KL} (f; g), \end{matrix}

(A2)

which shows that Rényi’s

α

-order divergence converges to the Kullback–Leibler (KL) divergence as

α \to 1

.

Appendix A.2. KDE-Based Nonparametric Estimation of Rényi Divergence

Direct estimation of the Kullback–Leibler divergence using kernel density estimation (KDE) is known to be numerically unstable in high-dimensional settings. To address this issue, a common practice is to approximate the KL divergence using Rényi’s

α

-order divergence with

α

chosen close to 1.

Let

{x_{i}^{f}}_{i = 1}^{N}

and

{x_{i}^{g}}_{i = 1}^{M}

be i.i.d. samples drawn from densities

f (x)

and

g (x)

, respectively. Using KDE, the densities are estimated as

\hat{f} (x) = \frac{1}{N} \sum_{i = 1}^{N} G_{σ} (x - x_{i}^{f}), \hat{g} (x) = \frac{1}{M} \sum_{i = 1}^{M} G_{σ} (x - x_{i}^{g}),

(A3)

where

G_{σ} (\cdot)

denotes a Gaussian kernel with bandwidth

σ

.

The Rényi divergence can be expressed as

D_{α} (f; g) = \frac{1}{α - 1} log E_{f} [{(\frac{f (x)}{g (x)})}^{α - 1}],

(A4)

which can be approximated empirically by

\begin{matrix} D_{α} (f; g) & \approx \frac{1}{α - 1} log \frac{1}{N} \sum_{j = 1}^{N} {(\frac{\hat{f} (x_{j}^{f})}{\hat{g} (x_{j}^{f})})}^{α - 1} \\ = \frac{1}{α - 1} log \frac{1}{N} \sum_{j = 1}^{N} {(\frac{\sum_{i = 1}^{N} G_{σ} (x_{i}^{f} - x_{j}^{f})}{\sum_{i = 1}^{M} G_{σ} (x_{i}^{g} - x_{j}^{f})})}^{α - 1} . \end{matrix}

(A5)

This estimator follows the standard KDE-based approximation of Rényi’s divergence and is adopted as a baseline method in the experimental evaluation.

Appendix B. Monitoring Variables in the Tennessee Eastman Process

The TEP benchmark provides a set of continuous process measurements and manipulated variables that characterize the operating conditions of a large-scale chemical production system. In this work, we adopt the standard TEP variable notation commonly used in the process monitoring literature, where variables are indexed by prefixes indicating their physical types, including flow-related variables (F), temperatures (T), pressures (P), levels (L), and manipulated variables (V).

Table A1 summarizes the definitions of all monitoring variables used in our experiments, together with their corresponding physical meanings. This table is provided to facilitate the interpretation of the sensor-level results reported in Section 4.4, particularly for the post hoc analysis of conditional dependency changes under different fault scenarios.

Table A1. Monitoring variables in the Tennessee Eastman Process (TEP).

Variable	Description	Variable	Description
F1	A feed (stream 1)	T18	Stripper temperature
F2	D feed (stream 2)	T19	Stripper steam flow
F3	E feed (stream 3)	C20	Compressor work
F4	A and C feed (stream 4)	T21	Reactor cooling water outlet temperature
F5	Recycle flow (stream 8)	T22	Separator cooling water outlet temperature
F6	Reactor feed rate (stream 6)	V23	D feed flow (stream 2)
P7	Reactor pressure	V24	E feed flow (stream 3)
L8	Reactor level	V25	A feed flow (stream 1)
T9	Reactor temperature	V26	A and C feed flow (stream 4)
F10	Purge rate (stream 9)	V27	Compressor recycle valve
T11	Product separator temperature	V28	Purge valve (stream 9)
L12	Product separator level	V29	Separator pot liquid flow (stream 10)
P13	Product separator pressure	V30	Stripper liquid product flow (stream 11)
F14	Product separator underflow (stream 10)	V31	Stripper steam valve
L15	Stripper level	V32	Reactor cooling water flow
P16	Stripper pressure	V33	Condenser cooling water flow
F17	Stripper underflow (stream 11)

References

Basu, A.; Shioya, H.; Park, C. Statistical Inference: The Minimum Distance Approach; CRC Press: Boca Raton, FL, USA, 2011. [Google Scholar]
Müller, A. Integral probability metrics and their generating classes of functions. Adv. Appl. Probab. 1997, 29, 429–443. [Google Scholar] [CrossRef]
Truong, C.; Oudre, L.; Vayatis, N. Selective review of offline change point detection methods. Signal Process. 2020, 167, 107299. [Google Scholar] [CrossRef]
Gretton, A.; Borgwardt, K.M.; Rasch, M.J.; Schölkopf, B.; Smola, A. A kernel two-sample test. J. Mach. Learn. Res. 2012, 13, 723–773. [Google Scholar]
Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein generative adversarial networks. In Proceedings of the International Conference on Machine Learning; PMLR: Westminster, UK, 2017; pp. 214–223. [Google Scholar]
Liu, S.; Yamada, M.; Collier, N.; Sugiyama, M. Change-point detection in time-series data by relative density-ratio estimation. Neural Netw. 2013, 43, 72–83. [Google Scholar] [CrossRef]
Li, S.; Xie, Y.; Dai, H.; Song, L. M-statistic for kernel change-point detection. In NIPS’15: Proceedings of the 29th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; MIT Press: Cambridge, MA, USA, 2015. [Google Scholar]
Zou, S.; Liang, Y.; Poor, H.V.; Shi, X. Nonparametric detection of anomalous data via kernel mean embedding. arXiv 2014, arXiv:1405.2294. [Google Scholar]
Chang, W.C.; Li, C.L.; Yang, Y.; Póczos, B. Kernel Change-point Detection with Auxiliary Deep Generative Models. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Kulinski, S.; Bagchi, S.; Inouye, D.I. Feature shift detection: Localizing which features have shifted via conditional distribution tests. Adv. Neural Inf. Process. Syst. 2020, 33, 19523–19533. [Google Scholar]
Principe, J.C.; Xu, D.; Fisher, J.; Haykin, S. Information theoretic learning. Unsupervised Adapt. Filter. 2000, 1, 265–319. [Google Scholar]
Jenssen, R.; Principe, J.C.; Erdogmus, D.; Eltoft, T. The Cauchy–Schwarz divergence and Parzen windowing: Connections to graph theory and Mercer kernels. J. Frankl. Inst. 2006, 343, 614–629. [Google Scholar] [CrossRef]
Yu, S.; Li, H.; Løkse, S.; Jenssen, R.; Príncipe, J.C. The conditional cauchy-schwarz divergence with applications to time-series data and sequential decision making. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 5901–5917. [Google Scholar] [CrossRef]
Sugiyama, M.; Liu, S.; du Plessis, M.C.; Yamanaka, M.; Yamada, M.; Suzuki, T.; Kanamori, T. Direct divergence approximation between probability distributions and its applications in machine learning. J. Comput. Sci. Eng. 2013, 7, 99–111. [Google Scholar] [CrossRef]
Kawahara, Y.; Sugiyama, M. Sequential change-point detection based on direct density-ratio estimation. Stat. Anal. Data Min. 2012, 5, 114–127. [Google Scholar]
Gretton, A.; Fukumizu, K.; Teo, C.; Song, L.; Schölkopf, B.; Smola, A. A kernel statistical test of independence. In NIPS’07: Proceedings of the 21st International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 3–6 December 2007; Curran Associates Inc.: Red Hook, NY, USA, 2007. [Google Scholar]
Li, J.; Fearnhead, P.; Fryzlewicz, P.; Wang, T. Automatic change-point detection in time series via deep learning. J. R. Stat. Soc. Ser. B Stat. Methodol. 2024, 86, 273–285. [Google Scholar] [CrossRef]
Nie, L.; Nicolae, D. Detection and localization of changes in conditional distributions. Adv. Neural Inf. Process. Syst. 2022, 35, 36216–36229. [Google Scholar]
Peters, J.; Bühlmann, P.; Meinshausen, N. Causal inference by using invariant prediction: Identification and confidence intervals. J. R. Stat. Soc. Ser. B Stat. Methodol. 2016, 78, 947–1012. [Google Scholar] [CrossRef]
Huang, B.; Zhang, K.; Zhang, J.; Ramsey, J.; Sanchez-Romero, R.; Glymour, C.; Schölkopf, B. Causal discovery from heterogeneous/nonstationary data. J. Mach. Learn. Res. 2020, 21, 1–53. [Google Scholar]
Lopez-Paz, D.; Oquab, M. Revisiting classifier two-sample tests. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Rabanser, S.; Günnemann, S.; Lipton, Z. Failing loudly: An empirical study of methods for detecting dataset shift. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Matteson, D.S.; James, N.A. A nonparametric approach for multiple change point analysis of multivariate data. J. Am. Stat. Assoc. 2014, 109, 334–345. [Google Scholar] [CrossRef]
Lv, F.; Yu, S.; Wen, C.; Principe, J.C. Interpretable fault detection using projections of mutual information matrix. J. Frankl. Inst. 2021, 358, 4028–4057. [Google Scholar] [CrossRef]
Kampa, K.; Hasanbelliu, E.; Principe, J.C. Closed-form Cauchy-Schwarz PDF divergence for mixture of Gaussians. In Proceedings of the 2011 International Joint Conference on Neural Networks, San Jose, CA, USA, 31 July–5 August 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 2578–2585. [Google Scholar]
Yin, W.; Yu, S.; Lin, Y.; Liu, J.; Sonke, J.J.; Gavves, S. Domain Adaptation with Cauchy-Schwarz Divergence. In Proceedings of the 40th Conference on Uncertainty in Artificial Intelligence, Barcelona, Spain, 15–19 July 2024. [Google Scholar]
Yu, S.; Yu, X.; Løkse, S.; Jenssen, R.; Principe, J.C. Cauchy-Schwarz Divergence Information Bottleneck for Regression. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Hasanbelliu, E.; Giraldo, L.S.; Principe, J.C. Information theoretic shape matching. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 2436–2451. [Google Scholar] [CrossRef] [PubMed]
Wang, Q.; Kulkarni, S.R.; Verdú, S. Divergence estimation for multidimensional densities via k-Nearest-Neighbor distances. IEEE Trans. Inf. Theory 2009, 55, 2392–2405. [Google Scholar] [CrossRef]
Park, J.; Muandet, K. A measure-theoretic approach to kernel conditional mean embeddings. Adv. Neural Inf. Process. Syst. 2020, 33, 21247–21259. [Google Scholar]
Downs, J.J.; Vogel, E.F. A plant-wide industrial process control problem. Comput. Chem. Eng. 1993, 17, 245–255. [Google Scholar] [CrossRef]
Ricker, N. Optimal steady-state operation of the Tennessee Eastman challenge process. Comput. Chem. Eng. 1995, 19, 949–959. [Google Scholar] [CrossRef]
Yu, S.; Giraldo, L.G.S.; Jenssen, R.; Principe, J.C. Multivariate Extension of Matrix-Based Rényi’s α-Order Entropy Functional. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2960–2966. [Google Scholar] [CrossRef] [PubMed]
Yu, H.; Dauwels, J. Modeling Functional Networks via Piecewise-Stationary Graphical Models 1. In Signal Processing and Machine Learning for Biomedical Big Data; CRC Press: Boca Raton, FL, USA, 2018; pp. 193–208. [Google Scholar]
Dobra, A.; Lenkoski, A. Copula Gaussian graphical models and their application to modeling functional disability data. Ann. Appl. Stat. 2011, 5, 969–993. [Google Scholar] [CrossRef]
Deldari, S.; Smith, D.V.; Sadri, A.; Salim, F. ESPRESSO: Entropy and ShaPe awaRe timE-Series SegmentatiOn for processing heterogeneous sensor data. In Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies; Association for Computing Machinery: New York, NY, USA, 2020; Volume 4, pp. 1–24. [Google Scholar] [CrossRef]
Schäfer, P.; Ermshaus, A.; Leser, U. ClaSP-Time Series Segmentation. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, Virtual, 1–5 November 2021; Association for Computing Machinery: New York, NY, USA, 2021; pp. 1578–1587. [Google Scholar]
Ho, S.S.; Wechsler, H. A martingale framework for detecting changes in data streams by testing exchangeability. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 2113–2127. [Google Scholar] [CrossRef]
Chiang, L.H.; Russell, E.L.; Braatz, R.D. Fault Detection and Diagnosis in Industrial Systems; Springer Science & Business Media: London, UK, 2012. [Google Scholar]
Shang, J.; Chen, M.; Ji, H.; Zhou, D. Recursive transformed component statistical analysis for incipient fault detection. Automatica 2017, 80, 313–327. [Google Scholar] [CrossRef]
Ma, Z.; Yu, S. Cauchy-Schwarz Divergence Transfer Entropy. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar]

Figure 1. An overview of the proposed change detection and interpretation framework. The key distinction is that detection is performed on joint distributions, while interpretation is achieved via variable-wise conditional analysis. A change alarm is triggered by comparing joint distributions between two time intervals. Conditional distribution shifts are then analyzed at the sensor level to provide post hoc interpretation and localization of change sources.

Figure 2. Graph structures in time instants 1–417, 418–979, 980–1410, and 1411–1900 of synthetic time series: red edges indicates positive value in the element of K, whereas green edges indicates negative value.

Figure 3. Change point scores on synthetic data obtained using conditional CS divergence for the 2nd dimension (i.e.,

p (y_{2} ∣ y_{1}, y_{3}, y_{4}, y_{5})

; top figure) and the 4th dimension (i.e.,

p (y_{4} ∣ y_{1}, y_{2}, y_{3}, y_{5})

; bottom figure). Red curve is the obtained score.

Figure 3. Change point scores on synthetic data obtained using conditional CS divergence for the 2nd dimension (i.e.,

p (y_{2} ∣ y_{1}, y_{3}, y_{4}, y_{5})

; top figure) and the 4th dimension (i.e.,

p (y_{4} ∣ y_{1}, y_{2}, y_{3}, y_{5})

; bottom figure). Red curve is the obtained score.

Figure 4. (a) Original multivariate UGS signal (first dimension) and corresponding change-point scores obtained by the conditional CS divergence; (b) detected frames of snapshot change, marked with red rectangles.

Table 1. Properties of representative conditional divergences. “Diff.” indicates differentiability of the estimator; “Faith.” indicates whether the divergence satisfies the faithfulness property (i.e.,

D (p (y | x), q (y | x)) = 0

if and only if

p (y | x) = q (y | x)

). “✓” indicates that the property is satisfied; “×” indicates that it is not satisfied; and “?” indicates that the property remains unproven.

Table 1. Properties of representative conditional divergences. “Diff.” indicates differentiability of the estimator; “Faith.” indicates whether the divergence satisfies the faithfulness property (i.e.,

D (p (y | x), q (y | x)) = 0

if and only if

p (y | x) = q (y | x)

). “✓” indicates that the property is satisfied; “×” indicates that it is not satisfied; and “?” indicates that the property remains unproven.

Method	Hyperparameter	Complexity	Diff.	Faith.
Cond. KL ¹	k	$O (k N log N)$	×	✓
Cond. MMD ²	kernel size $σ$ , $λ$	$O (N^{2} d + N^{3})$	✓	? ³
Cond. CS (ours)	kernel size $σ$	$O (N^{2} d)$	✓	✓

¹ Conditional KL divergence estimated via k-nearest neighbor (k-NN) graph. ² N denotes the number of samples; d denotes the dimensionality of x or y. ³ There is no universally agreed-upon definition of conditional MMD, and its faithfulness depends on the specific formulation.

Table 2. Quantitative evaluation in synthetic dataset in terms of true positive rate (TPR), false positive rate (FPR), and the root mean square error (RMSE) between the estimated position of change points and the ground truth. ↑ indicates higher value is better, whereas ↓ indicates smaller value is better.

Method	TPR (↑)	FPR (↓)	RMSE (↓)
RuLSIF	0.33	0.5	2.01
MMD	1	0.5714	7.33
KDE	1	0.7	6.25
KL-CPD	0.66	0.5	3.92
NN-CPD	1	0.5	2.48
TCD (ours)	1	0.4	3.56

Table 3. Bandwidth sensitivity analysis for TCD (ours). The bandwidth is scaled relative to the median heuristic. ↑ indicates the higher the better; whereas ↓ indicates the lower the better.

Bandwidth Scale	TPR (↑)	FPR (↓)	RMSE (↓)
$0.5 \times$	1.00	0.40	3.38
$1.0 \times$	1.00	0.40	3.56
$2.0 \times$	1.00	0.40	3.81

Table 4. Properties of the datasets (in time series change-point detection).

Datasets	Length	Dimension	Segments
synthetic	1900	5	4
Eye State	2000	15	5
GreatBarbet	4700	14	3
UGS	2143	256	11

Table 5. AUC scores on three real-world datasets. The higher value the better performance.

Method	Eye State	Great Barbet	UGS
RuLSIF	0.73	0.67	0.91
MMD	0.79	0.67	0.96
KDE	0.65	0.75	0.95
KL-CPD	0.87	0.88	0.94
NN-CPD	0.85	0.76	0.93
TCD (ours)	0.89	0.92	0.97

Table 6. False alarm rates (FARs, %) on fault-free Tennessee Eastman Process data and average fault detection delay (FDD, in samples) on selected TEP faults. The best performance is in bold.

	RuLSIF	MMD	KDE	KL-CPD	NN-CPD	RTCSA	Ours
FAR (%)	2.7	4.4	3.8	3.2	2.9	1.9	1.2
FDD (samples)	7.2	11.8	9.6	7.5	8.1	6.4	5.9

Table 7. Fault Detection Rates (FDRs, %) on selected TEP faults. The best performance is in bold.

Fault ID	RuLSIF	MMD	KDE	KL-CPD	NN-CPD	RTCSA	Ours
Fault 1	97.7	93.3	95.7	97.5	97.5	98.1	99.0
Fault 2	95.6	90.3	93.1	95.5	95.3	97.3	98.5
Fault 3	92.1	86.3	89.3	92.4	90.9	94.6	96.8
Fault 4	90.7	84.4	87.2	90.6	89.7	93.3	95.2
Fault 5	92.0	86.5	89.5	92.7	90.1	94.1	96.4
Fault 8	86.1	74.2	78.9	87.0	83.8	92.7	94.0
Fault 13	80.9	66.7	70.0	82.2	77.5	88.0	91.0
Fault 14	98.0	94.8	96.9	98.5	98.0	99.4	99.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, W.; Shen, Y.; Ni, Y.; Wu, W. Interpretable Sensor Change Detection via Conditional Cauchy–Schwarz Divergence. Sensors 2026, 26, 1791. https://doi.org/10.3390/s26061791

AMA Style

Wang W, Shen Y, Ni Y, Wu W. Interpretable Sensor Change Detection via Conditional Cauchy–Schwarz Divergence. Sensors. 2026; 26(6):1791. https://doi.org/10.3390/s26061791

Chicago/Turabian Style

Wang, Wenyu, Yuan Shen, Yao Ni, and Wangyu Wu. 2026. "Interpretable Sensor Change Detection via Conditional Cauchy–Schwarz Divergence" Sensors 26, no. 6: 1791. https://doi.org/10.3390/s26061791

APA Style

Wang, W., Shen, Y., Ni, Y., & Wu, W. (2026). Interpretable Sensor Change Detection via Conditional Cauchy–Schwarz Divergence. Sensors, 26(6), 1791. https://doi.org/10.3390/s26061791

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Interpretable Sensor Change Detection via Conditional Cauchy–Schwarz Divergence

Abstract

1. Introduction

2. Related Work

2.1. Time Series Change Point Detection

2.2. The Cauchy–Schwarz Divergence and Its Conditional Extension

3. Interpretable Change Point Detection via Conditional Cauchy–Schwarz Divergence

3.1. Problem Formulation and Framework Overview

Why CS Divergence

3.2. Window-Based Distribution Comparison

3.3. Variable-Wise Conditional Change Localization

3.4. Online Change Detection and Post Hoc Interpretation

4. Experimental Evaluation

4.1. Experimental Setup and Competing Baselines

4.2. Synthetic Data: Interpretability Study

4.3. Evaluation on Generic Multivariate Sensor Time Series

4.4. Application to the Tennessee Eastman Process

4.4.1. Monitoring Protocol and Threshold Calibration

4.4.2. Online Fault Detection Performance

4.4.3. Post Hoc Sensor-Level Interpretability

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. KDE-Based Rényi Divergence Estimation

Appendix A.1. Limiting Case α→1: Connection to KL Divergence

Appendix A.2. KDE-Based Nonparametric Estimation of Rényi Divergence

Appendix B. Monitoring Variables in the Tennessee Eastman Process

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI