Causal Discovery in High-Dimensional Point Process Networks with Hidden Nodes

Thanks to technological advances leading to near-continuous time observations, emerging multivariate point process data offer new opportunities for causal discovery. However, a key obstacle in achieving this goal is that many relevant processes may not be observed in practice. Naïve estimation approaches that ignore these hidden variables can generate misleading results because of the unadjusted confounding. To plug this gap, we propose a deconfounding procedure to estimate high-dimensional point process networks with only a subset of the nodes being observed. Our method allows flexible connections between the observed and unobserved processes. It also allows the number of unobserved processes to be unknown and potentially larger than the number of observed nodes. Theoretical analyses and numerical studies highlight the advantages of the proposed method in identifying causal interactions among the observed processes.


Introduction
Learning causal interactions from observational multivariate time series is generally impossible [1,2]. Among many challenges, two of the most important ones are that (i) the data acquisition rate may be much slower than the underlying rate of changes; and (ii) there may be unmeasured confounders [1,3]. First, due to the cost or technological constraints, the data acquisition rate may be much slower than the underlying rate of changes. In such settings, the most commonly used procedure for inferring interactions among time series, Granger causality, may both miss true interactions and identify spurious ones [4][5][6]. Second, the available data may only include a small fraction of potentially relevant variables, leading to unmeasured confounders. Naïve connectivity estimators that ignore these confounding effects can produce highly biased results [7]. Therefore, reliably distinguishing causal connections between pairs of observed processes from correlations induced by common inputs from unobserved confounders remains a key challenge.
Learning causal interactions between neurons is critical to understanding the neural basis of cognitive functions [8,9]. Many existing neuroscience data, such as data collected using functional magnetic resonance imaging (fMRI), have relatively low temporal resolutions, and are thus of limited utility for causal discovery [10]. This is because many important neuronal processes and interactions happen at finer time scales [11]. New technologies, such as calcium florescent imaging that generate spike train data, make it possible to collect 'live' data at high temporal resolutions [12]. The spike train data, which are multivariate point processes containing spiking times of a collection of neurons, are increasingly used to learn the latent brain connectivity networks and to glean insight into how neurons respond to external stimuli [13]. For example, Bolding and Franks [14] collected spike train data on neurons in mouse olfactory bulb region at 30 kHz under multiple laser intensity levels to study the odor identification mechanism. Despite progress in recording the activity of massive populations of neurons [15], simultaneously monitoring a complete network of spiking neurons at high temporal resolutions is still beyond the reach of the current technology. In fact, most experiments only collect data on a small fraction of neurons, finite-sample performance compared to the corresponding generalization of HIVE for point processes and/or the naïve approach that ignores the unobserved confounders.

The Hawkes Process
Let {t k } k∈Z be a sequence of real-valued random variables, taking values in [0, T], with t k+1 > t k and t 1 ≥ 0 almost surely. Here, time t = 0 is a reference point in time, e.g., the start of an experiment, and T is the duration of the experiment. A simple point process N on R is defined as a family {N(A)} A∈B(R) , where B(R) denotes the Borel σ-field of the real line and N(A) = ∑ k 1 {t k ∈A} . The process N is essentially a simple counting process with isolated jumps of unit height that occur at {t k } k∈Z . We write N([t, t + dt)) as dN(t), where dt denotes an arbitrarily small increment of t.
Let N be a p-variate counting process N ≡ {N i } i∈{1,...,p} , where, as above, N i satisfies N i (A) = ∑ k 1 {t ik ∈A} for A ∈ B(R) with {t i1 , t i2 , . . . } denoting the event times of N i . Let H t be the history of N prior to time t. The intensity process {λ 1 (t), . . . , λ p (t)} is a p-variate H t -predictable process, defined as λ i (t)dt = P(dN i (t) = 1 | H t ). (1) Hawkes [23] proposed a class of point process models in which past events can affect the probability of future events. The process N is a linear Hawkes process if the intensity function for each unit i ∈ {1, . . . , p} takes the form where Here, µ i is the background intensity of unit i and ω ij (·) : R + → R is the transfer function. In particular, ω ij (t − t jk ) represents the influence from the kth event of unit j on the intensity of unit i at time t.
Motivated by neuroscience applications [38,39], we consider a parametric transfer function ω ij (·) of the form with a transition kernel κ j (·) : R + → R that captures the decay of the dependence on past events. This leads to ω ij * dN j (t) = β ij x j (t), where the integrated stochastic process summarizes the entire history of unit j of the multivariate Hawkes processes. A commonly used example is the exponential transition kernel, κ j (t) = e −t [40]. Assuming that the model holds and all relevant processes are observed, it follows from [40] that the connectivity coefficient β ij represents the strength of the causal dependence of unit i's intensity on unit j's past events. A positive β ij implies that past events of unit j excite future events of unit i and is often considered in the literature (see, e.g., [40,41]). However, we might also wish to allow for negative β ij values to represent inhibitory effects [34,42], which are expected in neuroscience applications [43].

The Confounded Hawkes Process
Because of technology constraints, neuroscience experiments usually collect data from only a small portion of neurons. As a result, many other neurons that potentially interact with the observed neurons will be unobserved. Consider a network of p + q counting processes, where we only observe the first p components. The number of unobserved neurons, q, is usually unknown and likely much greater than p. Extending (7) to include the unobserved components, we obtain the confounded Hawkes model, in which z(t) = (x p+1 (t), . . . , x p+q (t)) ∈ R q denotes the integrated processes of the hidden components, and δ i ∈ R q denotes the connectivity coefficients from the unobserved components to unit i. Unless the observed and unobserved processes are independent, the naïve estimator that ignores the unobserved components will produce misleading conclusion about the causal relationship among the observed components. This is illustrated in the simple linear vector autoregressive process of Figure 1. This example includes three continuous random variables generated according to the following set of equations where i are mean zero innovation or error terms. The Granger causal network corresponding to the above process is shown in Figure 1A. Figure 1B shows that if Y 3 is not observed, the conditional means of the observed variables Y 1 and Y 2 , namely, leads to incorrect Granger causal conclusions-in this case, a spurious autoregressive effect from the past values of Y 2 . The same phenomenon occurs in Hawkes processes with unobserved components.
Throughout this paper, we assume that the confounded linear Hawkes model in (8) is stationary, meaning that for all units i = 1, . . . , p, the spontaneous rates µ i and strengths of transition (β i , δ i ) are constant over the time range [0, T] [44,45].

Extending Trim Regression to Hawkes Processes
We can write the confounded linear Hawkes model in (8) in the form of the perturbed linear model [37]: where ν i (t) = z (t)δ i − x (t)b i + i (t). By the construction of b i , ν(t) is uncorrelated with the observed processes x(t) and b i represents the bias, or the perturbation, due to the confounding from z (t)δ i . In general, b i = 0 unless Cov(x(t), z(t)) = 0. The perturbed model in (10) is generally unidentifiable because we can only estimate β i + b i from the observed data, e.g., by regressing Y i (t) on x(t). The trim regression [37] is a two-step deconfounding procedure to estimate β i for independent and Gaussiandistributed data. The method first applies a simple spectral transformation, called trim transformation (described below), to the observed data. It then estimates β i , using penalized regression. When b i is sufficiently small, the method consistently estimates β i . Although this condition is generally not valid for Gaussian-distributed data, previous work on Hawkes processes [34] implies that the confounding magnitude cannot be large when the underlying network is stable, particularly when the confounders affect many observed components (see the discussion following Corollary 1 in Section 4). This allows us to generalize the trim regression to learn the network of multivariate Hawkes processes. Assume, without loss of generality, that the first p components are observed at times indexed from 1 to T. Let X ∈ R T×p be the design matrix of the observed integrated process and Y i = (Y i (1), . . . , Y i (T)) ∈ R T be the vector of observed outcomes. Further, let X = UDV be the singular value decomposition on X, where U ∈ R T×r , D ∈ R r×r and V ∈ R p×r ; here, r = min(T, p) is the rank of X. Denoting the non-zero diagonal entries of D by d 1 , . . . , d r , the spectral transformation F : R T×p → R T×p is given by Denoting by D a diagonal matrix with entriesd 1 , . . . ,d r , the first step of hp-trim involves applying the spectral transformation to the observed data to obtain The spectral transformation is designed to reduce the magnitude of confounding. In particular, when b i aligns with the top eigen-vectors of X, for an appropriate F, e.g.,d k = min(τ, d k ) as used in previous work [37], the magnitude of Xb i is small compared with Xb i . Here, τ is a threshold parameter and the trim transformation is a special case of the spectral transformation when τ = median(d 1 , . . . , d r ). SeeĆevid et al. [37] for additional details.
In the second step, we then estimate the network connectivities using the transformed data by solving the following optimization problem arg min which is an instance of lasso regression [46] and can be solved separately for each i ∈ {1, . . . , p}.

An Alternative Approach
HIdden Variable adjustment Estimation (HIVE) [36] is an alternative method for estimating coefficients of a linear model with independent and Gaussian-distributed data in the presence of latent variables. Adapted to the network of multivariate point processes, HIVE first estimates the latent column space of the unobserved connectivity matrix, ∆ = δ 1 . . . δ p ∈ R p×q , with δ i defined in (8). It then projects the outcome vector, Y(t) = Y 1 (t), . . . , Y p (t) , onto the space orthogonal to the column space of ∆. Assuming that the column space of the observed connectivity matrix, Θ = β 1 . . . β p ∈ R p×p is orthogonal to that of ∆, HIVE consistently estimates Θ using the transformed data. While the orthogonality assumption might be satisfied when the hidden processes are external, such as experimental perturbations in genetic studies [47], it might be too stringent in a network setting. However, when the orthogonality assumption fails, HIVE may lead to poor edge selection performance, and potentially worse than the naïve method that ignores the hidden processes. HIVE also requires the number of hidden variables to be known. Although methods in selecting the number of hidden variables have been proposed, the resulting theoretical guarantees would only be asymptotic. An over-or under-estimated number can either miss the true edges or generate false ones. Given these limitations, we outline the extension of HIVE for Hawkes processes in Appendix A and refer the interested reader to Bing et al. [36] for details.

Theoretical Properties
In this section we establish the recovery of the network connectivity in the presence of hidden processes. Technical proofs for the results in this section are given in Appendix B.
We start by stating our assumptions. For a square matrix A, let Λ max (A) and Λ min (A) be its maximum and minimum eigenvalues, respectively.
Assumption 1 is necessary for stationarity of a Hawkes process [34]. The constant γ Ω does not depend on the dimension p + q. For any fixed dimension, Brémaud and Massoulié [44] show that given this assumption the intensity process of the form (6) is stable in distribution and, thus, a stationary process exists. Since our connectivity coefficients of interest are ill-defined without stationarity, this assumption provides the necessary context for our estimation framework.

Assumption 2.
There exists λ min and λ max such that Assumption 2 requires that the intensity rate is strictly bounded, which prevents degenerate processes for all components of the multivariate Hawkes processes. This assumption has been considered in the previous analysis of Hawkes processes [33][34][35]42,48].
Assumption 3 implies that the integrated process x j (t) in (5) is bounded. Assumption 4 requires maximum in-and out-intensity flows to be bounded, which provides a sufficient condition for bounding the eigenvalues of the cross-covariance of x(t) [35]. A similar assumption is considered by Basu and Michailidis [49] in the context of VAR models. Together, Assumptions 3 and 4 imply that the model parameters are bounded, which is often required in time-series analysis [50]. Specifically, these assumptions restrict the influence of the hidden processes from being too large.
Define the set of active indices among the observed components, and γ min ≡ Λ min (Q) and γ max ≡ Λ max (Q). Our first result provides a fixed sample bound on the error of estimating the connectivity coefficients. Compared to the case with independent and Gaussian-distributed data ( [37], Theorem 2), we obtain a slower convergence rate because of the complex dependency of the Hawkes processes. Our rate takes into account the network sparsity among the observed components. It also does not depend on the size of unobserved components, q, which is critical in neuroscience experiments because q is often unknown and potentially very large.

Theorem 1. Suppose each of the p-variate Hawkes processes with intensity function defined in
The result in Theorem 1 is different from the corresponding result obtained when all processes are observed ( [35], Lemma 10). More specifically, our result includes an extra error term, Xb i 2 2 , which captures the effect of unobserved processes. Next, we show that when b i 2 2 is sufficiently small, we obtain a similar rate of convergence as the one obtained when all processes are observed.

Corollary 1.
Under the same assumptions in Theorem 1, suppose, in addition, where c 1 , c 2 > 0 depending on the model parameters and the transition kernel.
The spectral transformation empirically reduces the magnitude of 1 T Xb i 2 2 , especially when the confounding vector, b i , stays in the sub-space spanned by top right singular vectors of X; however, this is not guaranteed to hold for arbitrary b i . Corollary 1 specifies a condition on b i that leads to consistent estimation of β i , regardless of the empirical performance of the spectral transformation. While the condition does not always hold for arbitrary stochastic process, it is satisfied for a stable network of high-dimensional multivariate Hawkes processes when the confounding is dense. Specifically, by the construction of b i in (9) Next we introduce an additional assumption to establish the edge selection consistency. To this end, we consider the thresholded connectivity estimator, Thresholded estimators are used for variable selections in high-dimensional network estimation [51] as they alleviate the need for restrictive irrepresentability assumptions [52].
Assumption 5 is called the β-min condition [53] and requires sufficient signal strength for the true edges in order to distinguish them from 0. Let the estimated edge set S = The next result shows that the estimated edge set consistently recovers the true edge set.

Theorem 2.
Under the same conditions in Theorem 1, assume Assumption 5 is satisfied with where c 1 , c 2 > 0 depending on the model parameters and the transition kernel.
Theorem 2 guarantees the recovery of causal interactions among the observed components. As before, the result is valid irrespsective of the number of unobserved components, which is important in neuroscience applications.

Simulation Studies
We compare our proposed method, hp-trim, with two alternatives, HIVE and the naïve approach that ignores the unobserved nodes. To this end, we compare the methods in terms of their abilities to identify the correct causal interactions among the observed components.
We consider a point process network consisting of 200 nodes with half of the nodes being observed; that is p = q = 100. The observed nodes are connected in blocks of five nodes, and half of the blocks are connected with the unobserved nodes (see Figure 2a). This setting exemplifies neuroscience applications, where the orthogonality assumption of HIVE is violated. As a sensitivity analysis, we also consider a second setting similar to the first, in which we remove the connections of the blocks that are not connected with the unobserved nodes This setting, shown in Figure 3a, satisfies HIVE's orthogonality assumption.

Simulation Studies
We compare our proposed method, hp-trim, with two alternatives, HIVE and the naïve approach that ignores the unobserved nodes. To this end, we compare the methods in terms of their abilities to identify the correct causal interactions among the observed components.
We consider a point process network consisting of 200 nodes with half of the nodes being observed; that is p = q = 100. The observed nodes are connected in blocks of five nodes, and half of the blocks are connected with the unobserved nodes (see Figure 2a). This setting exemplifies neuroscience applications, where the orthogonality assumption of HIVE is violated. As a sensitivity analysis, we also consider a second setting similar to the first, in which we remove the connections of the blocks that are not connected with the unobserved nodes This setting, shown in Figure 3a, satisfies HIVE's orthogonality assumption. To generate point process data, we consider β ij = 0.12 and δ ij = 0.10 in the setting of Figure 2a, and β ij = 0.2 and δ ij = 0.18 in the setting of Figure 3b. The background intensity, µ i , is set to 0.05 in both settings. The transfer kernel function is chosen to be exp(−t). These settings satisfy the assumptions of stationary Hawkes processes. In both settings, we set the length of the time series to T ∈ {1000, 5000} . To generate point process data, we consider β ij = 0.12 and δ ij = 0.10 in the setting of Figure 2a, and β ij = 0.2 and δ ij = 0.18 in the setting of Figure 3b. The background intensity, µ i , is set to 0.05 in both settings. The transfer kernel function is chosen to be exp(−t). These settings satisfy the assumptions of stationary Hawkes processes. In both settings, we set the length of the time series to T ∈ {1000, 5000} . The results in Figure 2b shown that hp-trim offers superior performance for both small and large sample sizes in the first setting. For example, with large sample size, T = 5000, hp-trim is able to detect almost all 200 true edges at the expense of about 50 falsely detected edges; this is almost twice as large as the number of true edges detected by HIVE and the naïve method, which only detect half of the true edges at the same level of falsely detected edges. The naïve method eventually detects all true edges but at much bigger cost of about 400 falsely detected edges. In this case, HIVE performs poorly and detects at most half of the true edges, no matter the tolerance level of the number of falsely detected edges. The poor performance of HIVE is because its stringent orthogonality condition is violated in this simulation setting. When the orthogonality condition is satisfied (Figure 3a), HIVE shows the best performance. Specifically, with large sample size, T = 5000, HIVE detects all true edges almost without identifying any falsely detected edges (the red solid line in Figure 3b). However, this advantage requires knowledge of the correct number of latent  The results in Figure 2b shown that hp-trim offers superior performance for both small and large sample sizes in the first setting. For example, with large sample size, T = 5000, hp-trim is able to detect almost all 200 true edges at the expense of about 50 falsely detected edges; this is almost twice as large as the number of true edges detected by HIVE and the naïve method, which only detect half of the true edges at the same level of falsely detected edges. The naïve method eventually detects all true edges but at much bigger cost of about 400 falsely detected edges. In this case, HIVE performs poorly and detects at most half of the true edges, no matter the tolerance level of the number of falsely detected edges. The poor performance of HIVE is because its stringent orthogonality condition is violated in this simulation setting. When the orthogonality condition is satisfied (Figure 3a), HIVE shows the best performance. Specifically, with large sample size, T = 5000, HIVE detects all true edges almost without identifying any falsely detected edges (the red solid line in Figure 3b). However, this advantage requires knowledge of the correct number of latent features. When the number of latent features is unknown and estimated from data, HIVE's performance deteriorates, especially with an insufficient sample size. For example, HIVE with empirically estimated number of latent features only detect about 40 true edges (out of a total of 100) at the expense of 100 falsely detected edges (pink lines in Figure 3b). In contrast, hp-trim's performance with both moderate and large sample sizes is close to the oracle version of HIVE (HIVE-oracle). Specifically, with a large sample size, T = 5000, hp-trim captures all 100 true edges at the expense of 50 falsely detected edges, again than twice as many true edges as HIVE-empirical.
Although our main focus is on the edge selection relevant for causal discovery, in Appendix C we also examine the estimation performance of our algorithm on the connectivity coefficients associated with the observed processes. Not surprisingly, the results indicate that hp-trim can also offer advantages in estimating the parameters, especially in settings where it offers improved edge selection.

Analysis of Mouse Spike Train Data
We consider the task of learning causal interactions among the observed population of neurons, using the spike train data from Bolding and Franks [14]. In this experiment, spike times are recorded at 30 kHz on a region of the mice olfactory bulb (OB), while a laser pulse is applied directly on the OB cells of the subject mouse. The laser pulse has been applied at increasing intensities from 0 to 50 (mW/mm 2 ). The laser pulse at each intensity level lasts 10 seconds and is repeated 10 times on the same set of neuron cells of the subject mouse.
The experiment consists of spike train data multiple mice and we consider data from the subject mouse with the most detected neurons (25) under laser (20 mW/mm 2 ) and no laser conditions. In particular, we use the spike train data from one laser pulse at each intensity level. Since one laser pulse spans 10 seconds and the spike train data is recorded at 30 kHz, there are 300,000 time points per experimental replicate.
The population of observed neurons is a small subset of all the neurons in mouse's brain. Therefore, to discover causal interactions among the p = 25 observed neurons, we apply our estimation procedure, hp-trim, along with HIVE and naïve approaches, separately for each intensity level, and obtain the estimated connectivity coefficients for the observed neurons. For ease of comparison, the tuning parameters for both methods are chosen to have about 30 estimated edges; moreover, for HIVE, q is estimated following the procedure in Bing et al. [36], which is based on the maximum decrease in eigenvalue of the covariance matrix of the errors, E(t) in (A1). Figure 4 shows the estimated connectivity coefficients specific to each laser condition in a graph representation. In this representation, each node represents a neuron, and a directed edge indicates a non-zero estimated connectivity coefficient. We see different network connectivity structures when laser stimulus is applied, which agrees with the observation by neuroscientists that the OB response is sensitive to the external stimuli [14].
Compared to our proposed method, the naïve approach generates a more similar network than HIVE under both laser and no-laser conditions, which is likely an indication that the naïve estimate is incorrect in this application.
As discussed in Section 4, our inference procedure is asymptotically valid. In other words, with large enough sample size, if the other assumptions in Section 4 are satisfied, the estimated edges should represent the true edges. Assessing the validity of the assumptions and selecting the true edges in real data applications is challenging. However, we can assess the sample size requirement and the validity of assumptions by estimating the edges over a subset of neurons as if the other removed neurons are unobserved. If the sample size is sufficient and the other assumptions are satisfied, we should obtain similar connectivities among the observed subset of neurons, even when some neurons are hidden. Figure 5 shows the result of such a stability analysis for the laser condition using hp-trim. Comparing the connectivities in this graph with those in Figure 4 indicates that the estimated edges using the subsets of neurons are all consistent with those estimated using all neurons. Thus, the assumptions are likely satisfied in this application. . Estimated functional connectivities among neurons using mouse spike train data from laser and no-laser conditions [14]. Common edges estimated by the three methods are in red and the method-specific edges are in blue. Thicker edges indicate estimated connectivity coefficients of larger magnitudes. (c) Figure 5. Estimated functional connectivities using hp-trim among multiple subset of neurons. Here, data is the same as that used in Figure 4 under the laser condition, except that 5, 10 and 15 neurons (shown in gray) are considered hidden. Thicker edges indicate estimated connectivity coefficients of larger magnitudes. All estimated edges using the subsets of neurons are also found in the estimated network using all neurons (a-c).

Conclusions and Future Work
We proposed a causal-estimation procedure with theoretical guarantees for highdimensional network of multivariate Hawkes processes in the presence of hidden confounders. Our method extends the trim regression [37] to the setting of point process data. The choice of trim regression as the starting point was motivated by the fact that its assumptions are less stringent than conditions required for the alternative HIVE procedure, especially for a stable point process network with dense confounding effects. Empirically, our procedure, hp-trim, shows superior performance in identifying edges in the causal network compared with HIVE and a naïve method that ignores the unobserved nodes.
Causal discovery from observational time series is a challenging problem and the success of our method is not without limitations. First, the theoretical guarantees for hp-trim require the magnitude of the hidden confounding to be bounded. As we discussed in the paper, this condition is likely met for a stable network of high-dimensional multivariate Hawkes processes when the confounding is dense. Nonetheless a careful examination of this condition is required when applying the method in other settings. When certain structure exists between the observed and hidden network connectivities, more structurespecific methods, such as HIVE, may be able to better utilize the structural property of the network for improved performance in identifying the causal effects. Second, our estimates assume a linear Hawkes process with a particular parametric form of the transition function. We also assume the underlying Hawkes process is stationary, where certain structural requirements of the process (specified as assumptions in Section 4) must be satisfied. The proposed method is guaranteed to identify causal effects only if these modeling assumptions are valid. When the modeling assumptions are violated, the estimated effects may not be causal. In other words, the method is primarily designed to generate causal hypothesesor facilitate causal discovery-and the results should be interpreted with caution. Extending the proposed approach to model the transition function nonparametrically, learning its form adaptively from data and capturing time-varying processes would be important future research directions. Finally, given that non-linear link functions are often used when analyzing spike train data [54,55], it would also be of interest to develop causal-estimation procedure for non-linear Hawkes processes. Data Availability Statement: Data collected by [14] have been deposited at the CRCNS (https:// crcns.org, accessed on 25 November 2021) and can be accessed at https://doi.org/10.6080/K00C4SZB, accessed on 25 November 2021.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Additional Details on HIVE
We introduce additional notations before illustrating the method.
To illustrate the confounding induced by the hidden process, we project Z(t) onto the space spanned by X(t) as where A is the projection matrix, representing the cross-sectional correlation between Z and X. Then, (A1) becomes where From the above, it is easy to see that the correlations between the observed and unobserved processes determine the strength the confounding. Specifically, unless A = 0i.e., when the observed and unobserved processes are independent, directly regressing Y(t) on X(t) produces biased estimates on Θ. Under the condition that Θ ⊥ ∆-i.e., the column space of Θ is orthogonal to the column space of ∆, HIVE gets around this issue by finding a projection matrix, P ∆ ⊥ , that projects ∆ onto its orthogonal space-i.e., P ∆ ⊥ ∆ = 0. Moreover, because of the orthogonality assumption, P ∆ ⊥ Θ = Θ. Therefore, when multiplying both sides in (A1) by P ∆ ⊥ , the unobserved term disappears. Specifically, letting Consequently, regressing Y(t) on X(t) produces unbiased estimates on Θ (using penalized regression with 1 -penalty on Θ under the high-dimensional setting when p is allowed to grow with the sample size T). In order to obtain P ∆ ⊥ , HIVE first calculates E(t) in (A3) and then implement heteroPCA algorithm [56] to estimate the latent column space of ∆ thus to obtain P ∆ . Then, the method obtains the corresponding orthogonal project as P ∆ ⊥ = I − P ∆ . We refer the interested readers to Bing et al. [36] for details about the method.

Appendix B. Proof of Main Results
Since our focus is on the estimation error for β i , we consider the perturbation model in (10) in the following.
Let θ i = µ i β i be the true model parameter and θ i = µ i β i be the optimizer for (14). Recall that the set of active indices, S i = {j : β ij = 0, 1 ≤ j ≤ p}, and s i = |S i | and s * ≡ max 1≤i≤p s i . Because optimization problem (14) can be solved separately for each component process, in the follows we focus on the estimation consistency for one component process. For ease of notation, we drop the subscript i; that is, we use x(t) for Next, we state two lemmas that will be used in the proof of main results.
Lemma A1 (van de Geer [57]). Suppose there exists λ max such that λ(t) ≤ λ max where λ(t) is the intensity function of Hawkes process defined in (2). Let H(t) be a bounded function that is H t -predictable. Then, for any > 0, with probability at least 1 − C exp(− T), for some constant C.
Proof of Theorem 1. While the skeleton of the proof follows from (Ćevid et al. [37], Theorem 2), the following two conditions are needed because of the Hawkes process data's unique dependency structure.

Condition 1.
There exist constants γ min , c, C > 0 such that Condition 1 is referred as the restrict strong convexity (RSC) [58]. Lemma A2 by Wang et al. [35] has shown Condition 1 holds when X = X under Assumptions 1-4. Since the min eigenvalue of X stays the same with our choice of F, Condition 1 holds for X = FX.

Condition 2.
There exist c, C > 0 such that where ν is defined in (10).
Condition 2 holds as a result of Lemma A1 by van de Geer [57]. Under the two conditions, we achieve the conclusion as follows. Because θ is the optimizer for (14), Next, we discuss in two conditions: i) 1 T Xb 2 2 ≤ λ θ S − θ S 1 and ii) 1 that our estimates on the true zero and non-zero coefficients can be separated with high probability; that is, there exists some constant ∆ > 0 such that for β S ∈ S and β S C ∈ S C , | β S − β S C | ≥ ∆ with high probability. By the β-min condition specified in Assumption 5, we have β ij ∈ S ≥ 2τ. Theorem 1 shows that for 1 ≤ i, j ≤ p, | β ij − β ij | ≤ τ with probability at least 1 − c 1 p 2 T exp(−c 2 T 1/5 ). Then, for any β S ∈ S and β S C ∈ S C , This means the estimates on zero and non-zero coefficients can be separated with high probability.
Next, we show there exists a post-selection threshold that allows to correctly identify S and S C based on the estimates. In fact, the post-selection estimator is β = β1(| β| > τ).

Appendix C. Parameter Estimation Performance
In this section we examine estimation performance of our algorithm on the connectivity coefficients associated with the observed processes. To this end, we compare the optimal root-mean squared error (RMSE) of the various methods (hp-trim, HIVE and Naïve) over all connectivity coefficients for the observed processes. Here, the optimal RMSE is the minimum RMSE for each estimation method over the range of tuning parameters in each simulation run.
We find that in the case when hp-trim performs the best in terms of edge selection (i.e., under the setting by Figure 2a), the method also gives the lowest RMSE (see Unorthogonal (T = 5000 and T = 1000) in Figure A1). In contrast, when the orthogonality condition is met for HIVE (i.e., under the setting by Figure 3a), HIVE-oracle gives the best RMSE (see Orthogonal (T = 5000 and T = 1000) in Figure A1). However, HIVE-oracle is not available in practice, and even when the orthogonality assumption is satisfied, the empirical version of HIVE (HIVE-empirical) performs worse than hp-trim. RMSE over all connectivity coefficients is calculated as ij is the estimate of the true parameter value, β ij , from the kth simulation run (k = 1, . . . , 100) and p = 100 observed processes are considered as in the simulation study in the main text.