Spectral Ranking of Causal Influence in Complex Systems

Similar to natural complex systems, such as the Earth’s climate or a living cell, semiconductor lithography systems are characterized by nonlinear dynamics across more than a dozen orders of magnitude in space and time. Thousands of sensors measure relevant process variables at appropriate sampling rates, to provide time series as primary sources for system diagnostics. However, high-dimensionality, non-linearity and non-stationarity of the data are major challenges to efficiently, yet accurately, diagnose rare or new system issues by merely using model-based approaches. To reliably narrow down the causal search space, we validate a ranking algorithm that applies transfer entropy for bivariate interaction analysis of a system’s multivariate time series to obtain a weighted directed graph, and graph eigenvector centrality to identify the system’s most important sources of original information or causal influence. The results suggest that this approach robustly identifies the true drivers or causes of a complex system’s deviant behavior, even when its reconstructed information transfer network includes redundant edges.

Semiconductor lithography systems are extremely complicated electromechanical systems, capable of subnanometer positioning and sub-milliKelvin temperature control, while generating extreme ultraviolet light from laser-pulsed tin plasma.It is notoriously difficult to fully understand the complex interactions among thousands of observed variables affecting the output of such systems.Model-based approaches alone are inadequate to effectively diagnose rare or new system issues, as they inherently do not model abnormal behavior.To efficiently, yet reliably reduce the search space of potential causes, we consider a system's causal network and the centrality of the network nodes.Therefore, we use Schreiber's transfer entropy [1] which quantifies predictive information transfer or influence between two time series, providing direction, strength and delay of both linear and non-linear interactions.For time series with a Gaussian distribution, transfer entropy reduces to Granger causality [2].However, being a bivariate measure, transfer entropy naturally disregards the multivariate nature of interactions in complex systems.An exhaustive multivariate decomposition into redundant, unique and synergistic information would become computationally intractable, due to combinatorial scaling with the number of involved variables.We mitigate this curse of dimensionality, by combining transfer entropy as measure of a node's direct influence on its neighbors with eigenvector centrality [3] as measure of a node's global influence in the network, to identify the system's most important propagation sources of original information or influence.
In this paper, we introduce the measures of transfer entropy and eigenvector centrality and describe the two-step algorithm as used.In rolling window analysis of a simulated time series representing two bidirectionally coupled Lorenz systems, we compare its causal inference results to those of a constraint-based algorithm for multivariate causal analysis.The rolling window approach allows additional analysis of time-evolving global influence of the Lorenz system state variables.Finally, we assess the algorithm in diagnosing a real world industrial system issue, using high-dimensional time series.
Transfer entropy (T E) is an information-theoretic implementation of Wiener's notion of causality applied to time series [4], whereby the cause precedes -and contains unique information about -the effect.Consider two stationary ergodic Markov processes X and Y (or X (i)  and X (j) as below) and their corresponding time series {x 1 , x 2 , ..., x M } and {y 1 , y 2 , ..., y M } of M samples.Transfer entropy quantifies reduction in uncertainty about future states of a source process X, when passed states of a target process Y are observed in addition to passed states of X itself.As an asymmetric measure based on transition probabilities, transfer entropy naturally incorporates directional and dynamic information which may imply causation between X and Y : where x t and y t represent states of X and Y at time t while T E (k,l) X→Y (t, τ ) indicates maximized information transfer from X to Y , computed across a range D = {0, 1, 2, ..., τmax} of embedding delays τ , such that The embedding dimensions k and l denote the num-ber of passed states in X and Y used to condition the probabilities of transition to the next state of Y (or X) represented by x t−1 = {y t−1−l+1 , y t−1−l+2 , ..., y t−1 }.The logarithm base b = 2, defines the informational unit of transfer entropy in bits i.e. reduction in average code-length re-quired to optimally encode the target variable (effect), given passed states of the source variable (cause) and target variable.Herein, we keep the embedding dimensions at the commonly used k = l = 1, mostly for computational reasons.For every pair (X (i) , X (j) ) from N multivariate time series X (1) , X (2) , ..., X (N ) , we apply Eq. ( 1) to estimate information transfer T E and thereby determine the candidate source times series X (d) and target time series X (r) .We then generate 1/(α = 0.05) − 1 surrogates {X (d) , ..., X (d) } which share their amplitude distribution and power spectrum with original time series X (d) , using the iterative amplitude adjusted Fourier transform (iAAFT) proposed by Schreiber et al. [5].Following this study, we estimate information transfer T E from each surrogate to (original) target time series X (r) and obtain {T E 1 , ..., T E 19 }.If T E > max{T E 1 , ..., T E 19 }, information transfer T E is considered to be significant or non-significant otherwise.The resultant information transfer network, given by a directed weighted graph Each edge e ij connects a source node v i to target node v j with strength w ij i.e. information transfer T E.
To diagnose performance issues or even failures within technological complex systems, we wish to locate where perturbations enter the system and propagate, causing downstream effects throughout the system.Therefore, we consider an information transfer network wherein the (main) sources of original information can be identified by measurement and ranking of each node's global network influence termed centrality or importance.Centrality, the basic principle of Google's search engine [6], has proven useful to measure and rank a page's relevance based on its inbound links.Here, we use out-degree eigenvector centrality which defines the centrality c(v i ) of a node v i V = {v 1 , ..., v n } as proportional to the summed centralities of its outbound neighbors: where λ −1 is a proportionality factor and c is the eigenvector of centralities associated with eigenvalue λ of adjacency matrix W , whose entries w ij denote information transfer from node v i to node v j in graph G.If W is nonnegative and irreducible, the Perron-Frobenius theorem [7,8] ensures there is a unique vector c 1 of N centralities c 1 (v i ) > 0 ,∀ v i associated with the largest positive eigenvalue λ 1 = ρ(W ) or spectral radius of W , satisfying Eqs.(2).Usually c 1 is normalized, such that each entry indicates the centrality or importance of a node v i in graph G on a relative scale from 0 to 1.Alternatively, matrix W is normalized to a transition matrix P , whose entries p ij denote probabilities of transition from node v i to node v j in a random walk on graph G or Markov chain while N j=1 p ij = 1, such that: Following Google's PageRank approach [9], we modify matrix P by adding a teleportation probability γ 0, 1 and an all-ones matrix J to obtain an N × N , irreducible, positive matrix P : where a random walker at node v i follows an edge with probability γ or jumps to any other node in the N -nodes network with probability (1 − γ).Herein, we use PageRank's typical value of γ = 0.85.Considering the Markov chain associated with matrix P γ , the Perron-Frobenius theorem ensures an unique stationary probability distribution that matches the eigenvector π 1 of P γ associated with eigenvalue λ 1 (P γ ) = 1, such that π 1 = P γ π 1 .Eigenvector π 1 is usually computed via power iteration [10] in k steps: 0) where π (0) denotes an initial distribution.Empirical studies by Kalavri et al. [11] revealed that PageRank (or eigenvector centrality) yields similar ranking results when computed with, or without a graph's (first-order) semi-metric edges, which we consider sufficient to robustly and reproducibly identify a system's key propagation sources of influence.
In what follows, we assess the method's accuracy in causal inference and node ranking using simulated time series, followed by root cause analysis from real world diagnostic data.Firstly, we consider a system S (L1 L2) of two bidirectionally coupled Lorenz systems L 1 and L 2 , investigated by Wibral et al. [12] and given by: The bidirectional coupling Y 1 Y 2 is governed by timedelayed quadratic terms.We used Pydelay [13] to generate a multivariate time series {X1, X2, Y1, Y2, Z1, Z2} of 150K samples by numerically integrating Eqs.(5.1 -5.6) with step size dt = 0.01 and initial conditions X1(0) = X2(0) = 1.0,Y1(0) = Y2(0) = 0.97 and Z1(0) = Z2(0) = 0.99.We use the FaultMap algorithm [14] which, to our knowledge, is the only publicly available implementation of the described approach.To assess its accuracy in causal network reconstruction against other model-free approaches, we use PCMCI (v4.0) [15] as a distinct, constraint-based, multivariate alternative.FaultMap estimates edge-weight w ij as ∆T E(τ ) = T E X (i) →X (j) (τ ) − T E X (j) →X (i) (τ ) using the Kraskov-Stögbauer-Grassberger estimator for transfer entropy in the Java Information Dynamics Toolkit [16].PCMCI performs condition-selection at an optimized significance level α using Akaike's information criterion, followed by a conditional independence test where we use the linear ParCorr test (at α = 0.05) instead of the computationally expensive nonlinear CMI test.Following the recommended sample size of Bauer [17], we use a rolling analysis window W s of 2K samples to define 50 adjacent time series slices for causal network reconstruction of the coupled Lorenz systems by both algorithms and node importance ranking by Fault-Map only.Figure 1 depicts the rolling window analysis approach, the butterfly-shaped attractors of the coupled Lorenz systems in phase space (x, y, z) and cause-effect detection count heatmaps for both algorithms.Y 2 at a detection rate of 89 % vs 85 % by PCMCI, while the latter achieved a notably higher overall detection rate (95 %) of causal links within the Lorenz subsystems than FaultMap (83 %).This difference is mainly related to PCMCI's apparent higher sensitivity to autocorrelative effects encoded in the time series.PCMCI also detected 23 % more transitive indirect links than FaultMap, while we would expect mere direct relations from its multivariate causal analysis.
Due to transitivity of bivariate information transfer, most networks inferred by FaultMap include direct and indirect connections, all of which passed strict significance tests, as 2. The coupled Lorenz systems are easily discernible by two subnetworks (X 1 , Y 1 , Z 1 ) and (X 2 , Y 2 , Z 2 ) that exhibit distinct levels of information transfer with similar time delays in the range of milliseconds.At a rate of at least 98 %, FaultMap detected the same edges X → Y , Y → X, X → Z and Y → Z indicating (normalized) net influence within both Lorenz subsystems, as Gencaga et al. [18] found for a single Lorenz system.PCMCI shows similar detection results for these edges (see heatmaps).Of note, the rolling window approach took PCMCI 495 min of runtime on a 24-cores HPC node with 192 Gb for causal analysis vs 58 min by FaultMap for causal analysis and node ranking.
In Figure 3a-b, we focus on reconstruction of time delays in coupling Y 1 Y 2, given the modeled delays in Eq. (5.2) and (5.5). Figure 3a reveals the dynamic nature of bidirectional information transfer in terms of strength and delay.The median difference (≈ 0.05) of information transfer distributions in Figure 3b suggests that Lorenz system L 2 predominantly drives L 1 , which is reasonable to expect since the coupling strength (0.1) in Eq. (5.2) is twice the coupling strength (0.05) in Eq. (5.5).Given coupling delays of 3 and 5 sec, the distribution of reconstructed interaction delays seems realistic but may be impacted by lag synchronisation of the Lorenz systems, as suggested by Coufal et al. [19].Figure 3c shows highly dynamic global influence of particularly X-and Y -variables, while the Y -variables remain the driving force within their respective Lorenz subsystem at all times.Grey-colored bars highlight time windows in which L 1 is identified as driving, and To our knowledge, this is the first rolling window analysis to date, capturing the dynamics of time-varying information transfer or global influence (importance) of state variables within interacting Lorenz systems.Our findings may enable automated identification of monitoring observables for performance (anomaly) diagnostics or predictive maintenance within technological complex systems.The ability to capture time-varying importance of a complex system's state variables is also relevant in time series analysis of natural complex systems, including the Earth's climate.Nanolithography systems are among the most complex technological systems today, capable of sub-nanometer positioning and sub-milliKelvin temperature control, even as system modules accelerate at up to 15Gs.Such systems are particularly challenging for model-based diagnosis of rare or new issues, due to nonlinear interactions across multiple time and spatial scales.To assess FaultMap's potential in diagnosis of issues within such systems, we investigate temperature, flow and pressure instability within an ASML subsystem.Therefore, we use a multivariate time series of 315 binary samples from 366 parameters related to the problem.As shown in Figure 4, FaultMap identified parameter P 0 as primary source of original information i.e. most probable cause leading to event P 27 , through a network of collateral effects {P 1 , ..., P 26 }.The indicated root cause is confirmed to be correct by a series of automatically logged system events as well as service actions.This preliminary result is a promising step in automation of reliable data-driven diagnostics for technological complex systems.To fully understand a complex system's (deviant) behaviour, it is essential to identify its main propagation sources of influence affecting downstream elements throughout the system.We empirically show that spectral centrality analysis of the related information transfer network allows to robustly and reproducibly identify a complex system's main sources of original information or influence.Our only algorithm of choice compares favorably against the sufficiently distinct alternative algorithm in causal analysis of two nonlinearly coupled Lorenz systems.Also, it shows to be accurate, robust and efficient, identifying the (alternately) driving and driven Lorenz subsystems as well as the driving force within either subsystem.Finally, the algorithm correctly locates the original disturbance within a technological complex system.We conclude that the inherent robustness of spectral centrality ranking to graph semi-metricity, allows to reliably and efficiently identify a complex system's most important propagation sources of influence, without the need to differentiate between direct and indirect links.This finding is relevant for any time series analysis of natural and artificial complex systems.
Figure 2 exemplifies FaultMap's rolling window analysis results of which Figure 3 specifically reports information transfer (delay) via coupling Y 1 Y 2 and global influence of X-, Y -and Z-variables on the coupled Lorenz systems.The heatmaps in Figure 1b show the total count of each causeeffect relation detected by FaultMap or PCMCI over 50 adjacent time series slices.FaultMap detected the bidirectional coupling Y 1 Figure 1.(a) time series of system S (L 1 L 2 ) composed of bidirectionally delay-coupled Lorenz systems L1 and L2, generated from Eqs. (5.1 -5.6).(b) heatmaps of detection count per cause-effect relation, for FaultMap (left) and PCMCI (right).Direct cause-effect relations are denoted by (*).

L 2
as driven subsystem.Since the coupling strengths are constant, Y 1 → Y 2 is likely to dominate Y 2 → Y 1 in strength when the Y 2 state reaches vanishingly low values relative to the Y 1 state.The median difference across all node importance distributions in Figure3dreflects the aforementioned drive-response relations in, and between the interacting Lorenz systems.It might explain FaultMap's 100 % detection rate of Y -variable self-loops vs lower rates of all other self-loops.The outcome of both Y -variables as Lorenz system key driver complies with the Lorenz model of Rayleigh-Bénard convection[20], where temperature difference drives convective heat transfer in addition to conduction at Rayleigh number Ra ≥ 25 (see Eqs. (5.2) and (5.5)).

Figure 2 .
Figure 2. Information transfer network of two bidirectionally delay-coupled Lorenz systems.Edges indicate direct (−→) or transitive indirect ( ) information transfer.Edge annotations denote information transfer delay (sec).Node importance indicates a node's global network influence.Edge-weight represents level of information transfer (wij).

Figure 3 .
Figure 3. dynamic information transfer via bidirectional delay-coupling Y1 Y2 of Lorenz systems L1 and L2, and dynamic importance of the Lorenz system state variables.

Figure 4 .
Figure 4. Top-ranked node P0 (root cause) transfers original information towards event P27 via a network of collateral effects {P1, ..., P26} within an ASML subsystem.The legend of Figure 2 applies here.