Nonlinear Hawkes Process with Gaussian Process Self Effects

Traditionally, Hawkes processes are used to model time--continuous point processes with history dependence. Here we propose an extended model where the self--effects are of both excitatory and inhibitory type and follow a Gaussian Process. Whereas previous work either relies on a less flexible parameterization of the model, or requires a large amount of data, our formulation allows for both a flexible model and learning when data are scarce. We continue the line of work of Bayesian inference for Hawkes processes, and our approach dispenses with the necessity of estimating a branching structure for the posterior, as we perform inference on an aggregated sum of Gaussian Processes. Efficient approximate Bayesian inference is achieved via data augmentation, and we describe a mean--field variational inference approach to learn the model parameters. To demonstrate the flexibility of the model we apply our methodology on data from three different domains and compare it to previously reported results.


Introduction
Sequences of self exciting, or inhibiting, temporal events are frequent footmarks of natural phenomena: Earthquakes are known to be temporally clustered as aftershocks are commonly triggered following the occurrence of a main event [Ogata, 1988]; in social networks, the propagation of news can be modeled in terms of information cascades over the edges of a graph [Zhao et al., 2015]; and in neuronal activity, the occurrence of one spike may increase or decrease the probability of the occurrence of the next spike over some time period [Dayan and Abbott, 2001].
Traditionally, sequences of events in continuous time are modeled by Point processes, of which Cox processes [Cox, 1955], or doubly stochastic processes, use a stochastic process for the intensity function, which depends only on time and is not effected by the occurrences of the events. The Hawkes process [Hawkes and Oakes, 1974] extends the Cox process to capture phenomena in which the past events affects future arrivals, by introducing a memory dependence via a memory kernel. When incorporating dependence of the process on its own history, due to the superposition theorem of point process, new events will depend on either an exogenous rate, which is independent of the history, or an endogenous rate from past arrivals. This results in a branching structure, where new events that originate from previous events can be seen as "children" of the past events.
Originally, the dependence on the history in the Hawkes process is assumed to be self excitatory, and the memory kernel is parameterized by an exponential or power law decay, which results in a model with low flexibility. Furthermore, assuming only excitatory relation between the events, does not hold for other phenomena we wish to model. For example, inhibitory effects between neurons [Maffei et al., 2004], and even self-inhibition [Smith and Jahr, 2002], are crucial for regulating the neuronal activity. Thus, the memory kernel should also include inhibitory relations between the events and by doing so the intensity may become negative. To ensure that the intensity function is non-negative, a nonlinear link function is applied on the memory kernel, and the resulting model is often referred to as a Nonlinear Hawkes process [Brémaud and Massoulié, 1996, Zhu, 2013, Truccolo, 2016.
In this work we present a Nonlinear Hawkes process with Gaussian Process Self-effects (NH-GPS) which extends the class of Nonlinear Hawkes processes. We choose a non-parametric approach which avoids the limiting parameterization of the memory kernel and the background rate. We assume a Gaussian Process (GP) prior on the exogenous events intensity and on the memory kernel, which allows also for an inhibitory effect between the events. To ensure that the intensity function is non-negative we use the Sigmoid link function. This modeling approach is not only descriptive, but also allows us to obtain a fast inference procedure. The history of self-effects defines an aggregated Gaussian process, and we perform the inference directly on this aggregation rather than obtaining a posterior over each self effect.
While highly flexible approaches to modeling the intensity function of nonlinear Hawkes processes have been presented before, they mainly rely deep neural network solutions [Jia andBenson, 2019, Xiao et al., 2017]. These approaches are hindered by the necessity of large datasets. However, our methodology retains the modeling flexibility due to the non-parametric nature of Gaussian processes, while being able to perform well when data are scarce.
Our main contributions are the following • We present a new flexible non-parametric approach to the nonlinear Hawkes process, which allows us to perform inference when data are scarce.
• We derive an efficient variational inference scheme to the model.
Outline In Section 2 we discuss related work and describe how it relates to ours. In Section 3 we describe the NH-GPS model and emphasize how its structure allows for efficient Bayesian inference. In Section 4 we describe the augmentation scheme and derive the mean-field variational inference algorithm. In Section 5 we present the results of our experiments both on synthetic data and different real world examples, and compare to existing work when possible. In Section 6 we conclude by discussing our work and future research directions.

Related Work
Bayesian approaches to Cox Processes model the intensity with a Gaussian process prior, which is then passed through a link function to ensure its positivity. A common choice of the link function is the exponential or the quadratic functions [Hensman et al., 2015, Lloyd et al., 2015. Another choice, which is more relevant to our wok, is the sigmoid link function, resulting in the sigmoidal Gaussian Cox process. Inference in this model was first done via Markov Chain Monte Carlo [Adams et al., 2009] and later via variational inference [Donner and Opper, 2018a].
As for the Hawkes process, first attempts to perform Bayesian inference relied on the definition in terms of a marked Poisson cluster process and identifying the branching structure of the self-excitation [Rasmussen, 2013]. Alternative approaches to inference in temporal point processes and Hawkes processes include directly maximizing the Likelihood function via stochastic gradient descent  or re-modeling the intensity function as a Recurrent Neural Network (RNN) [Mei and Eisner, 2017, Jing and Smola, 2017, Zuo et al., 2020. The latter often require training over very large datasets to produce reliable results.
Other than RNN, another highly flexible approach to estimating the intensity function of the Hawkes process relies on GP priors [Zhang et al., 2018, Zhou et al., 2019, Zhang et al., 2020. Recent adaptation of this approach is the model described in Zhou et al. [2020]. Similarly to the model described in our work, the authors avoid the limiting parameterization of the memory kernel by using GPs. Differently to our work, Zhou et al. [2020] remain in the linear Hawkes process regime and assume that the effects of past events are assumed to be only excitatory, whereas our approach allows both excitatory and inhibitory effects.
Noticeable variations of the nonlinear Hawkes process are the Isotonic Hawkes Process [Wang et al., 2016] and the Mutually Regressive Point Process [Apostolopoulou et al., 2019]. In the first variation, although the model allows for both inhibitory and excitatory self-effects, there is a relation of either-or between them. Thus, the entire history may be either inhibiting or exciting. In contrast, the GP prior in our approach allows for the self-effects to change over time.
Hence, one particular event may have inhibiting effect on the next one, but have an exciting effect on the one after that.
The second variation is more closely related to our work. The Mutually Regressive Point Process is designed to model neuronal spike trains. In this work, the classical self-excitatory Hawkes Process intensity function is augmented by a probability term. This term induces inhibition when it is close to zero. In a sense, this model includes two memory kernels -one excitatory only which appears in the intensity function and another which can also induce inhibition in the augmenting probability term. In the current work, we achieve such flexibility of the self-effects in a simpler fashion by assuming the GP prior on the self effects. As mentioned before this also allows for the type of effect to change over time, which dows not appear in the work of Apostolopoulou et al. [2019].

Classical Hawkes Process
Let T T = [0, t] ∈ R. We define the counting measure N (T t ) as the number of arrivals in the sequence H t = {T 1 , ..., T N (Tt) : T i ∈ T t ∧ T i−1 < T i } where H t defines the history of the process until time t, and T i corresponds to the time of arrival i. For a temporal point process, the counting measure N (·) has an associated intensity defined as The intensity function may depend on the history of the process. An example of such a process is the Hawkes process, or self exciting point process, [Kingman, 1993] which defines self excitations [Daley and Vere-Jones, 2007] around exogenous events.
Following Hawkes and Oakes [1974], the intensity of the Hawkes process is defined by where s(t) is the base intensity of exogenous arrivals and g (t − t n ) is the memory kernel defining the change in the excitation value for each arrival. In the classical Hawkes process, only excitations are allowed and the memory kernel is usually of the form g(t − t n ) = βe −α(t−tn) for an exponentially decaying memory.

Nonlinear Hawkes process with Gaussian Process Self-Effects
In the classical Hawkes process, the memory kernel g in Equation 1 must be non-negative, to prevent the intensity function from being negative. As a result the history of the model has only excitatory effect on future events. We are interested in a model that includes inhibition between events, and we release the constraint over g so it can be negative, and define the following nonlinear intensity function Here, we choose the sigmoid function to ensure that the intensity function Λ (·) is non-negative. λ is the intensity bound and we refer to φ (·) as the linear intensity function.
We explicitly add the exponential decay to enforce the forgetting constraint which is essential for most realistic processes. Although we choose here a specific parameterization of the memory decay, one can choose other forms of memory decay with minimal adaptation to the learning procedure of the model parameters.
To maximize the flexibility of the model we avoid any specific parameterization of the background rate or the memory kernel. Thus, rather than specifying a functional form for s (t) and g (t), we assume the following GP priors In this work we use the Radial Basis Function (RBF) kernel for the GP priors. This choice is not a constraint of the model -one can choose any other kernel, and it will not effect the augmentation and inference processes described bellow.
Finally, we assume a prior distribution also on the upper intensity bound and we identify the hyperparameters of the model as {σ g , a g , α, σ s , a s }.
In this work we propose Bayesian inference for fitting the model to data. Due to the non-linearity over φ (·) we are no longer able to easily utilize the branching structure of the Hawkes process which allowed for the estimation of s (·) and g (·) [Rasmussen, 2013, Zhou et al., 2020. Thus, a natural solution is to perform the inference directly on φ (·).
Next, we identify the prior over the entire linear intensity p (φ). From Equation 4 we see that the linear intensity function φ is nothing but the sum of GPs, and as such it is also a GP

Multivariate Model
We propose an extension of the model to multiple dimensions. This is useful in applications where different types of events are observed, or the events originate from different processes that effect each other. We define an R-dimensional point process with intensity in dimension r where t m n is the time of event number n of type m. We assume that every dimension has its own intensity bound λ k and background rate s k (·). The different dimension interact with each other via the self effects term. g k,l (·) defined the effects of events of type l on events of type k. As in the univariate case, this effect may change over time.
Given the observations, the different dimensions are independent of each other and we can learn their parameters separately. Thus, in the following section we present the inference for the univariate model, and the extension to the multivariate case is straight forward.

Inference
Looking at the likelihood defined above, Equations 4 and 8, implementing Bayesian inference for the model is not straightforward, due to the non-conjugate structure of the likelihood and prior. Similarly to previous work on Cox and Hawkes processes [Donner and Opper, 2018a, Apostolopoulou et al., 2019, Zhou et al., 2020, we augment the model with auxiliary variables, which leads to a conditionally conjugated model with closed form solutions for Gibbs sampler and variational inference.

Model Augmentation
The first step we take in treating the likelihood function is using the Pólya-Gamma (PG) augmentation scheme. Following Theorem 1 in Polson et al. [2013], we can rewrite the nonlinear intensity function as As we augment each observation with a variable w n from a PG distribution, the joint likelihood of the observed events {t n } and PG variables {w n } is Where we used σ(t) = 1 − σ(−t).
Next, we utilize the Campbell's theorem [Kingman, 1993] which states that for a Poisson process Π with intensity ϕ Looking at Equation 11 we identify x = (t, w) and ϕ (t, w) = λP G (w|1, 0) is the intensity of a marked Poisson process in T with marks w ∼ P G (0, 1). Further, we determine h (x) = f (w, −φ (t)). We can now rewrite the exponential in Equation 10 as for realizations {t m ,ŵ m } M m=1 . We substitute Equation 12 into Equation 10 which results in the full augmented likelihood. Given the prior distribution over φ and λ, we can now write the model's posterior distribution as To summarize, we augment the model with two sets of variables -the PG variables {w n } which augment the actual realizations and the tuples {t m ,ŵ m } which are the realizations and marks of the auxiliary marked Poisson process.
As mentioned above, we intend to learn directly the linear intensity function φ (·). This allows us to utilize the efficient mean-field variational inference previously introduced in Donner and Opper [2018a] and Donner and Opper [2018b]. Next, we go through the steps of the algorithm, and we refer the reader to the two papers mentioned above for further details. As a baseline we compare the performance of the variational inference algorithm to a Gibbs sampler. The details of the Gibbs sampler can be found in the Supplementary Material.

Variational Inference
In variational inference [Jordan et al., 1999, Bishop, 2006 we define a tractable distribution family and adapt it to approximate the posterior by maximizing the lower bound L(Q) defined below. This procedure minimizes the Kullback-Leibler divergence between the unknown posterior and the proposed approximating distribution. The posterior density is approximated by . This leads to the following lower bound on the evidence Here Q refers to the probability measure of the variational posterior. We can maximize the bound by alternating the maximization over each of the factors [Bishop, 2006]. The optimal solution for each factor is Thus, to obtain the optimal distribution of one of the factors, one must calculate expectations of the logarithm of the joint distribution over the remaining factors in the approximation, resulting in an iterative algorithm.
In the following subsections, we explicitly express the functional form of the optimal distributions, and obtain the corresponding expectations required for updating the factors. The hyperparameters ({σ g , a g , α, σ s , a s }) are learned via gradients update of the lower bound, we present details in the Supplementary Material.

Optimal q 1
We find that the optimal q 1 is factorized as The first factor is identified as a Gamma distribution The optimal distribution for the second factor is of the form Generally, the integrals above cannot be evaluated analytically. Thus, we resort to another variational approximation, where we approximate the likelihood term, by a distribution that depends only on a finite set of inducing point {c}, q and we use the notation p q = E q (p). The optimalq (φ c ) is given bỹ From here, using known results of conditional GPs and sparse variational GPs [Csató et al., 2001, Titsias, 2009 we haveq with K c the covariance kernel between the inducing points, κ(t) = k c (t) K −1 c and k c (t) is the kernel between the inducing points and another set of points (either the real data or the integration points). The mean and the variance of the sparse approximated GP are

Optimal q 2
Similarly to the previous section, we find that the optimal q 2 is factorized as , we define the first factor as which corresponds to a tilted PG distribution q 2 (w n ) = P G w n |1, φ 2 n q 1 .
The second factor takes the form It can be shown that this distribution corresponds to a Poisson process with intensity function where to simplify the notation we write φ instead of φ t .
We summarize the algorithm and discuss the hyper parameters learning in the Supplementary Material.

Experiments
All the algorithm and experiments for this work are implemented in Python and are available online. For further implementation details please see the Supplementary Material.

Synthetic Data
To assess the performance of the inference algorithms presented in Section 4, we learn the parameters of data generated by the model, and compare the learned parameters to the ground truth. To generate data we start by sampling the memory GP and the background GP, based on Equations 6 and 7. We generate events from the model using Poisson thinning [Lewis and Shedler, 1979]. First we sample the number of candidates J ∼ P oisson (λT ), and sample candidate events {t 1 , . . . t J } uniformly. Next we chronologically iterate through the candidates and accept them with probability The results for the synthetic data are included in Figure 1. The time window used was one second, and the dataset includes 18 events. In panel (a), the comparison between the ground truth predictive intensity and the ones inferred by the learning algorithms demonstrates the accuracy of the inference methods. We compare the ground truth to the mean of the Gibbs samples, and the mean of the approximating distribution of the VI. Due to space limitation we do not include the standard deviation of the inferred predictive intensity, and we refer the reader to the Supplementary Material for further results.
Panel (b) compares the ground truth value of the linear intensity and the one inferred by the two learning algorithms. Unlike for the predictive density, the Gibbs sampler is more accurate than the VI. Panel (c) present the inference results for the intensity bound. As expected the approximated distribution by the VI is much more narrow than the distribution of the Gibbs samples.

Panels (d) and (e)
show the autocorrelation of the intensity bound, and the ELBO through the Gibbs samples and VI iterations. In this example the convergence of the ELBO is very fast, the autocorrelation of the Gibbs sample vanish only after a few thousands iterations.
We use the test log-likelihood per data point, averaged over ten datasets, to quantify the performance of the two inference algorithms. The Gibbs sampler and the VI achieve very similar results. Thus, in the next section we present the results only from the VI.  Our model assumes both inhibitory and excitatory self effects, but it should also be able to capture phenomena where only one of the two types of effects exist. To test this, we fit our model to crime report data, where it is assumed that past events have excitatory effect on future events [Mohler et al., 2011]. We use the same two datasets described in Zhou et al. [2020], and follow their data processing procedure. Each dataset contains one type of crime and so we use the univariate version of the model. The work of Zhou et al. [2020] includes several inference methods and we compare our results to the results of their reported mean-field variational inference approach, as it is the closest to our inference procedure. Table 1 compares the test log-likelihood of our NH-GPS model to the one reported in Zhou et al. [2020]. We perform the experiment five times and report the mean and variance of the test log-likelihood. As expected, our model performs similarly to the non-parametric Hawkes process presented by Zhou et al. [2020].

Neuronal Activity Data
One of the motivating real world phenomena behind our work is the spiking activity of neurons, where it is known that the process has both self-excitatory and self-inhibitory effects. As an example for our model's ability to capture neuronal activity we use the datasets that were first presented in Gerhard et al. is introduced. One dataset includes ten recordings from a single neuron in a monkey cortex, with the duration of one second each, and the other includes ten recordings from single neuron in a human cortex for a duration of ten seconds each.
In Figure 2 we assess the ability of the model to capture the data for the recordings from the monkey cortex. The results for the human cortex are included in the Supplementary Material. Panel a and b present the raster plot of the real data and the raster plot generated from the fitted model respectively. Similarly to the real data, the generated data displays both excitation, in the form of clustered events, and inhibition.
To quantify how suitable the model is to the data, we apply the random time change theorem [Daley and Vere-Jones, 2003] to the inferred intensity and the experimental data. The theorem states that realizations from a general point process can be transformed to realizations from a homogeneous Poisson process with unit rate. Similarly to the work of Apostolopoulou et al. [2019], we further transform the exponential realizations to those from a uniform distribution, following Brown et al. [2002]. We then use the Kolmogorov-Smirnov test to compare the quantiles of the distribution of the transformed realizations to the quantiles of the uniform distribution. The results of this test are displayed in the in Figure 2 c. The comparison relies between the 95% confidence bounds, which are indicated by the dashed lines. The model passes the goodness of fit test (p value > 0.05), and the p value is higher than the one achieved by the MR-PP. We compare the reported p-value achieved by the MR-PP model for the two datasets in Table 2. For both datasets our model achieves higher p-value than the MR-PP.

Retweets Data
To demonstrate the necessity of the multivariate variation of the model we use a subset of a retweets dataset. We used the processed data that was published in Zuo et al. [2020]. Each time series in the dataset contains the cascade of retweets of a specific tweet and the event time is relative to the time of the initial tweet. Each tweet is tagged according  to the amount of followers the user has, and we use the same grouping in to three groups as in Zuo et al. [2020], which results in three event types.
It can be shown that the change time theorem can be applied separately to every dimension of the multivariate intensity function [Daley and Vere-Jones, 2003], which in our case corresponds to each of the event types. As our model is best fit to use cases were the data are scarce, we fit a model to each retweets cascade separately. Figure 3 presents the QQ-plot for each type of event for one retweet cascade dataset. All types pass the goodness-of-fit test.
As a baseline we fit the univariate version of the model to the same data, without differentiating between different event types. Table 3 presents the comparison between the fit of the univariate and multivariate versions of the model, averaged over five retweets cascades. It includes the p-values as calculated by the Kolmogorov-Smirnov test, and the test log-likelihood per data point. To calculate the test log-likelihood we trained the model on an hour of retweets and tested it on the following half hour. Both the univariate and the multivariate versions of the model pass the goodness-of-fit test, but the multivariate model achieves a higher p-value, and a higher test log-likelihood.

Conclusion
In this work we presented the nonlinear Hawkes model with Gaussian process self-effects (NH-GPS). We motivated the development of the new model with the need for a flexible model that can capture both exciting and inhibiting interactions between events, while maintaining the ability to learn also when data are scarce.
Due to the structure of the model, we dispense with the branching structure that is commonly used for Bayesian inference in Hawkes processes. We propose an efficient mean-field variational inference algorithm which relies on a data augmenting scheme. We show that the results of the variational inference are comparable with those of a Gibbs sampler.
We demonstrate the performance of our model in three different real world applications. Due to the flexibility of our model, it achieves good results on data where events have only excitatory effects and on data where events have both excitatory and inhibitory effects.
Last, we present the multivariate variation of the model. Here, we use it to fit data where events are of different types. This direction can be extended further. The multivariate version of the model can be used to describe events in a network, and even learn the connections in the network. Another future research direction is spatio-temporal processes.
The model can be directly extended to capture events in time and space, as the memory kernel is reduced to a kernel function of a GP which can be applied also to multi-dimensional data.

Conditional Distribution of the augmenting events
The conditional posterior of the augmenting events is proportional to We recognize the conditional posterior with the unnormalized density of a marked inhomogeneous Poisson process with intensity The underlying density of the inhomogeneous Poisson process in the realizations space T is given by To sample the augmenting realizations we use the thinning algorithm [Ogata, 1981]. First, we sample the expected number of events J ∼ P oisson (λT ) .
Next, J candidates are sampled uniformly over the realizations space T . To evaluate the intensity at the candidate points we need to evaluate the linear intensity φ in these points, given its values in the real events and the previously sampled augmenting events. This can be done using results from GP regression [Rasmussen and Williams, 2006] Once we have φ J we perform thinning -for each candidatet j we generate a random number r j between 0 and 1. If r j < σ (−φ j ) we accept the candidatet j and otherwise we discard it.

Hyper parameters Learning
The augmented model is not conditionally conjugated with respect to the kernel hyperparameters. This is usually solved by using MCMC within Gibbs sampler approach [Gilks andWild, 1992, Martino et al., 2015]. This method applies rejection sampling, such as Metropolis-Hastings (MH) [Hastings, 1970], Hamiltonian Monte Carlo (HMC) [Duane et al., 1987], to sample the hyperparameters, and relies heavily on some design choice. A wrong choice of the proposal distribution (for MH) or the mass matrix (for HMC) may result in a very slow convergence, or prevent the sampler from converging at all.
We choose the less traditional approach of taking a gradient step within the Gibbs sampler. This is implemented in the following way -after sampling all the model parameters from the conditional posterior distributions described above, we derive the negative model log-posterior with respect to the hyperparameters and take a step in the direction of the negative gradient. This approach can be developed further in the spirit of Stochastic Gradient Descent (SGD). Meaning, rather than updating the hyper parameters after each iteration of the Gibbs sampler, we perform several steps of sampling, take the gradient of the averaged posterior and update the hyper parameters following the averaged gradient.
We include bellow the derivatives with respect for the hyper parameters of the model θ = σ g , α, a g , σ s .
To learn the hyperparameters we derive the posterior of the model which appears in Equation 13 in the main text. First, we notice that all of the hyper parameters appear in the prior over the linear intensity and all of the hyperparameters appear in the prior kernel.
We next derive an entry in the kernel with respect to the different hyper parameters.
We can plug these results to the chain rule and we get

Implementation Details
All the experiments were implemented in Python. To parallelize the computation over the available computing resources we used the JAX package [Bradbury et al., 2018]. In the Gibbs sampler -the sampling of the PG variables was done using the PyPólyaGamma package [Linderman, 2017].

Parameter Value
Time limit 1 λ 320 σ g 0.07 a g 1 α 20 σ s 0.5 a s 1 Gradient step within the Gibbs sampler We update the hyper parameters of the model using SGD. As an optimizer we use the ADAM algorithm [Kingma and Ba, 2014]. We found that updating the hyperparameters every second iteration yields the best results in terms of running time and convergence time. The ADAM optimizer was also used for the gradient step in the Variational Inference algorithm. We used the recommended default variables of the ADAM optimizer.

Hyper parameters values
The parameters values that were used to generate the synthetic dataset presented in Figure 1 are presented in Table 4. The hyper parameters for the Variational inference, for all the real world datasets, are presented in Table 5. The noise level refers to the noise added the diagonal of the relevant covariance matrices. Figure 4 includes three examples of a generated ground truth intensity function and the inference results of the Gibbs sampler and variational inference. The upper panel includes the same data as Figure 1 in the main text, but includes also the standard deviation of the inference results.   The NHGSE generates data that resembles the real data, and passes the goodness of fit test.