An Improved Indicator for Causal Interaction in Non-Linear Systems †

: Utilizing an extension of Pearson’s correlation in the case of random vectors, we improve the empirical dynamic modeling causal analysis of non-linear systems. To prove the effectiveness of the use of such an extension we analyze two real-world examples, the paramecium-didinium protozoan system and the inﬂuence of environmental variables on mosquito abundance in northern Greece. In both examples it is shown that the causal analysis based on the extended metric outperforms the usual method of measuring the correlation between observed and predicted values of a single vector component.


Introduction
Empirical dynamic modeling (EDM) is a useful technique for the study of causal interactions of non-linear systems. The basic idea involves creating time-lagged histories of an observed time series from which (due to a theorem by Takens [1]) it is possible to construct an embedding of the attractor manifold that describes the evolution of the underlying dynamical system. Cross-mapping the time-lagged manifolds of two time series, belonging to the same dynamical system, forwards and backwards in time one can infer whether a causal relationship exists between the two [2][3][4]. This method has often been employed in the study of causal interactions of various systems [5][6][7][8].
Although the reconstructed points are vectors in an E dimensional state space, the quality of cross mapping in the literature is determined by the fidelity between observed and predicted values of a single component. Since two embeddings of the same manifold are, by definition, homeomorphic to each other and share the same topology, it makes more sense to enquire to what degree the full vector of observations (consisting of the current value of the timeseries along with its E − 1 previous values) is correlated with its prediction under cross mapping. To achieve this, it is necessary to extend the notion of correlation in the case of vectors [9,10].
As an example, we apply the techniques to the causal analysis of a protozoan predatorprey system and on the study of environmental effects on mosquito abundance. In the first case it is shown that the vector correlation is more robust and results in a consistent causal network under changes of the embedding dimension, while in the second case it is able of detecting more causal factors compared to its scalar counterpart.

Methods
EDM is concerned with the reconstruction of the state space of a system from observations of a single time series X. As was already mentioned, Takens' theorem [1] ensures that for an appropriate embedding dimension E and time lag τ, the manifold of time-lagged vectors is an embedding of the attractor of the system. For a fixed τ, the best choice for the value of the embedding dimension can be determined by requiring the correlation between observed and predicted values (first component of Equation (1)) of the forward trajectory of the manifold, one time step into the future, to be maximized. Predictions are made by use of Simplex projection [11]. This is a k-nearest neighbors regression algorithm which uses information about the forward trajectory of dynamics similar to that of the predictee. For any point X(t) in the reconstructed state space, the prediction T time steps into the future is given by a weighted average of its E + 1 closest neighbors, forward in timeX with exponential weights equal to where d i is the Euclidean distance of X(t) from its ith closest neighbor X(t i ) and d is the mean distance. Causality between two time series X and Y can be deduced by cross mapping their lagged embeddings. Once again one employs Simplex projection to predict points of one manifold, e.g., Y(t), but in this case the appropriate weighting is determined by the structure of the other manifold (the distances of X(t) from its closest neighbors as before, note that in this case T can take both positive as well as negative values). The reasoning behind this thinking is that if the two variables are part of the same dynamical system, then their embeddings are homeomorphic to each other, so in principle it should be possible to use information about the neighborhoods of one variable to make predictions about the other. Since both the observed and predicted points are part of an E + 1 dimensional manifold, the quality of predictions in this case is given by the vector correlation [10] ρ Y,Ŷ = tr ∑ YŶ where ∑ AB denotes the cross-correlation matrix of random vectors A and B. If the maximum value of the vector correlation for T ≥ 0 is greater than the one for T ≤ 0 then past information about the structure of the manifold of X can be used to predict Y forward in time. This suggests that X is a causal connected to Y. Conversely it the future topological structure of X results in a higher vector correlation for the predictions of Y, i.e., the maximum cross mapping for predicting Y measured with the respect to the vector correlation is better for T ≤ 0 than for T ≥ 0, then the reverse is true and suggests that Y is causing X.

Results
We will now analyze two real-world examples and compare their causal structures when both the scalar and vector correlation metrics are employed.

The Paramecium-Didinium Protozoan System
The paramecium-didinium protozoan system by Veilleux [12] consists of two single celled organisms acting as predator (Didinium nasutum) and prey (Paramecium aurelia). Every twelve hours cultivations of the bacteria in petri dishes were sampled non-destructively and their counts recorded. The evolution of the system depends on the concentration of Cenophyl medium (CC) on which the initial population of paramecium was grown [12,13]. Here we analyze two instances which allow for coexistence between the two species, one for which CC = 0.375 g/L and one for which CC = 0.5 g/L. It is known that for this system embedding dimensions E ≥ 3 can be used for the analysis [2,3].
In Tables 1 and 2 we present the optimum time-lag T that maximizes cross mapping for embedding dimensions 3 ≤ E ≤ 10. As was mentioned above in the Methods Section, information about the topology of the lagged manifold of one species is used to make predictions about the other. Due to the limited number of available data a leave-one out cross validation was performed in which the model is trained on every point except for the one in question. The maximum time-lag chosen for the analysis was equal to T max = ±72 h (extending the maximum had no effect on the analysis). In the table X ⊗ Y stands for "variable X is used to cross map (or predict) variable Y, T time steps forwards of backwards". Table 1. Optimum cross mapping time-lag T (in hours) between paramecium and didinium for an initial Cerophyl concentration CC = 0.375 g/L (values in parentheses).

Scalar Correlation
Vector Correlation We observe that the causal analysis based on the vector correlation metric performs better than the scalar case. When CC = 0.375 g/L vector correlation consistently implies a top-down control by the predator in which didinium causally affects paramecium while the scalar case switches between a top-down and a two-way causation structure where both species affect each other. For CC = 0.5 g/L even though both metrics imply the same causal structure for every embedding dimension (a bottom-up control by the prey in the case of vector correlation and two-way causality in the case of scalar correlation) the optimum time-lag when scalar correlation is employed decreases with increasing dimension while in the case of vector correlation remains relatively constant.

Environmental Effects on Mosquito Abundance
We will now study the causal effect of environmental variables on daily mosquito abundance (of the culex genus). The data was collected every fortnight between the 21st and the 39th week of 2012 in the regional district of Thessaloniki (see Figure 1). Mosquito abundances were sampled by trap placement every two weeks and the following environmental variables on the day of capture were recorded: vegetation density and distribution (NDVI), changes in water content (NDWI), vegetation water content (NDMI), day mean of land surface temperature (LST), accumulated precipitation two weeks before of date of placement (RAIN) and mean hourly magnitude of wind (WIND). Due to the very short length of each individual time series (only 10 available values per station), we performed a spatial extension of EDM on the data where lagged vectors from multiple spatial replicates are combined into a reconstruction of the state space [4,14]. To ensure a high density of points in embedded space and a good fit for the model, we consider only those replicates with a high degree of correlation between abundances (Table 3) and perform a min-max regularization on the time series. By predicting the forward trajectory of the lagged manifold with the help of Simplex projection two weeks into the future, as described in the Methods Section, we find that the best embedding dimension in this case, presented in Figure 1, is equal to E = 6. In order to increase the number of available points and max time-lag for the causal analysis we instead choose E = 5. This allows us to check for time-lags up to T max = ±4 weeks (To avoid under fitting the data, we require that the total number of available points for training the model with N replicates of length L, equal to N(L − E − T + 1), to be greater than four times the number of required neighbors). In Table 4 we present the results of the optimum time-lag (in weeks) for cross-mapping between each environmental variable and culex abundance for both correlation metrics. Table 4. Optimum cross-mapping time-lag (in weeks) between culex abundance and environmental variables in the regional district of Thessaloniki (values in parentheses).

Scalar Correlation
Vector Correlation By using the vector correlation as the metric, all the environmental variables except for NDMI are found to be causal factors of culex abundance with both cross mappings displaying approximately the same correlation (we only consider cases for which both cross mappings have a positive value). In contrast using the scalar correlation reveals only three environmental variables as causal factors (NDMI, LST, RAIN) (see Figure 2) with the differences in correlation for the mean land surface temperature being more pronounced.

Conclusions
It was demonstrated that, in the study of causal interactions with EDM techniques, performing cross-mapping with a vector-based measure of fidelity is preferable to measuring the fidelity of only a single component as is usually done in the literature. Specifically, the analysis with the extended metric was shown to be robust under changes of the embedded dimension, this is important in situations where the dimension is hard to determine due to either noise or a limited number of available data. This also seems to agree well with the idea of nearest-false-neighbor techniques that for embedding dimensions above the correct value the manifold is essentially 'unfolded' so no qualitative changes should be expected [15].
Vector correlation also enhanced the study of the effects of the environment on the number of culex mosquitoes in northern Greece and was able to detect a larger number of causal factors. Since mosquitoes are known to be vectors of diseases such as West Nile virus, these applications could be of particular interest especially in guiding vector control strategies and health policy assessments. Funding: This research has been co-financed by the European Regional Development Fund of the European Union and Greek national funds through the Operational Program Competitiveness, Entrepreneurship and Innovation, under the call RESEARCH-CREATE-INNOVATE (project code: T2E∆K-02070). This work was also partially supported from the EIC Horizon Prize "Early Warning for Epidemics".

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
Restrictions apply to the availability of these data. Culex data was obtained from Ecodevelopment S. A. and are available from the authors with the permission of Ecodevelopment S. A.