Online Topology Inference from Streaming Stationary Graph Signals with Partial Connectivity Information

We develop online graph learning algorithms from streaming network data. Our goal is to track the (possibly) time-varying network topology, and effect memory and computational savings by processing the data on-the-fly as they are acquired. The setup entails observations modeled as stationary graph signals generated by local diffusion dynamics on the unknown network. Moreover, we may have a priori information on the presence or absence of a few edges as in the link prediction problem. The stationarity assumption implies that the observations' covariance matrix and the so-called graph shift operator (GSO -- a matrix encoding the graph topology) commute under mild requirements. This motivates formulating the topology inference task as an inverse problem, whereby one searches for a sparse GSO that is structurally admissible and approximately commutes with the observations' empirical covariance matrix. For streaming data said covariance can be updated recursively, and we show online proximal gradient iterations can be brought to bear to efficiently track the time-varying solution of the inverse problem with quantifiable guarantees. Specifically, we derive conditions under which the GSO recovery cost is strongly convex and use this property to prove that the online algorithm converges to within a neighborhood of the optimal time-varying batch solution. Numerical tests illustrate the effectiveness of the proposed graph learning approach in adapting to streaming information and tracking changes in the sought dynamic network.


Introduction
Network data supported on the vertices of a graph G representing pairwise interactions among entities are nowadays ubiquitous across disciplines spanning engineering as well as social and the bio-behavioral sciences; see e.g., [1,Ch. 1].Such data can be conceptualized as graph signals, namely high-dimensional vectors with correlated entries indexed by the nodes of G. Graph-supported signals abound in real-world applications of complex systems, including vehicle congestion levels over road networks, neurological activity signals supported on brain connectivity networks, COVID-19 incidence in different geographical regions linked via mobility or social graphs, and fake news that spread on online social networks.Efficiently mining information from unprecedented volumes of network data promises to prevent or limit the spread of epidemics and diseases, identifying trends in financial markets, learning the dynamics of emergent social-computational systems, and also protect critical infrastructure including the smart grid and the Internet's backbone network [2].In this context, the goal of graph signal processing (GSP) is to develop information processing algorithms that fruitfully exploit the relational structure of said network data [3].However, oftentimes G is not readily available and a first key step is to use nodal observations (i.e., measurements of graph signals) to identify the underlying network structure, or, a useful graph model that facilitates signal representations and downstream learning tasks; see [4,5] for recent tutorials on graph learning with a signal processing flavor and [1,Ch. 7] for a statistical treatment.Recognizing that many of these networks are not only unknown but also dynamic, there is a growing need to develop online graph learning algorithms that can process network information streams in an efficient fashion.

Identifying the structure of network diffusion processes
Consider a weighted, undirected graph G consisting of a node set N of cardinality N, and symmetric adjacency matrix A ∈ R N×N + with entry A ij = A ji ≠0 denoting the edge weight between node i and node j.We assume that G contains no self-loops; i.e., A ii =0.The adjacency matrix encodes the topology of G.One could generically define a graph-shift operator (GSO) S ∈ R N×N as any matrix capturing the same sparsity pattern as A on its off-diagonal entries.Beyond A, common choices for S are the combinatorial Laplacian L ∶= diag(A1) − A as well as their normalized counterparts [3].Henceforth we focus on S = A and aim to recover the sparse adjacency matrix of the unknown graph G.Other GSOs can be accommodated in a similar fashion.The graph can be dynamic with a slowly time-varying GSO S t , t = 1, 2, . . .(see also Section 3), but for now we omit any form of temporal dependency to simplify exposition.
Our goal is to develop an online framework to estimate sparse graphs that explain the structure of a class of streaming random signals.At some time instant, let y = [y 1 , . . ., y N ] T ∈ R N be a zero-mean graph signal in which the ith element y i denotes the signal value at node i of the unknown graph G with shift operator S. Further consider a zero-mean white signal x.We state that the graph S represents the structure of the signal y ∈ R N if there exists a diffusion process in the GSO S that produces the signal y from the input signal x [6], that is Under the assumption that C x = E xx T = I N (identity matrix), ( 1) is equivalent to the stationarity of y in S; see e.g., [7,Def. 1], [8], [9].The product and sum representations in (1) are common (and equivalent) models for the generation of random network processes.Indeed, any process that can be understood as the linear propagation of a white input through G can be written in the form in (1), and subsumes heat diffusion, consensus and the classic DeGroot model of opinion dynamics as special cases.The justification to say that S represents the structure of y is that we can think of the edges of G, i.e., those few non-zero entries in S, as direct (one-hop) relations between the elements of the signal.The diffusion in (1) modifies the original correlation by inducing indirect (multi-hop) relations in C y = E yy T .At a high level, the problem studied in this paper is the following [6,10].
Problem.Recover the direct relations described by a sparse GSO S from a set {y t } T t=1 of T independent samples of the random signal y adhering to (1).
We consider the challenging setup when we have no knowledge of the diffusion coefficients {α l } (or {β l }), nor do we get to observe the specific realizations of the driving inputs {x t } T t=1 .The stated inverse problem is severely underdetermined and nonconvex.It is underdetermined because for every observation y we have the same number of unknowns in the input x on top of the unknown diffusion coefficients and the shift S, the latter being the quantity of interest.The problem is nonconvex because the observations depend on the product of our unknowns and, notably, on powers of S [cf.(1)].Our main contribution is to develop novel algorithms to address these technical challenges in several previously unexplored settings.

Technical approach and paper outline
We tackle the network topology inference problem of the previous section in two fundamental settings.We start in Section 2 with the batch (offline) case where observations {y t } T t=1 are all available for joint processing as in [6,10].Here though we exploit the implications of stationarity from a different angle which leads to a formulation amenable to efficient solutions via proximal gradient (PG) algorithms; see e.g., [11,12].Specifically, as we show in Section 2.1 stationarity implies that the covariance matrix C y of the observations commutes with S under mild requirements; see also [7].This motivates formulating the topology inference task as an inverse problem, whereby one searches for a sparse S that is structurally admissible and approximately commutes with the observations' empirical covariance matrix Ĉy .In [6,10], the algorithmic issues were not thoroughly studied and relied mostly on off-the-shelf convex solvers whose scalability could be challenged.Unlike [13] but similar to link prediction problems [1, Ch. 7.2], [14], here we rely on a priori knowledge about the presence (or absence) of a few edges; conceivably leading to simpler algorithmic updates and better recovery performance.We may learn about edge status via limited questionnaires and experiments, or, we could perform edge screening prior to topology inference [15].The batch PG algorithm developed in Section 2.3 to recover the network topology is computationally more efficient than existing methods for this (and related) problem(s).It also serves as a fundamental building block to construct online iterations, which constitutes the second main contribution of this paper.
Focus in Section 3 shifts to online recovery of S from streaming signals {y 1 , . . ., y t , y t+1 , . ..}, each of them adhering to the generative model in (1).For streaming data the empirical covariance estimate Ĉy,t can be updated recursively, and in Section 3.1 we show online PG iterations can be brought to bear to efficiently track the time-varying solution of the inverse problem with quantifiable (non-asymptotic) guarantees.Different from [13], the updates are simple and devoid of multiple inner loops to compute expensive projections.Moreover, we establish convergence to within a neighborhood of the optimal time-varying batch solution as well as dynamic regret bounds (Section 3.2) by adapting results in [16].The algorithm and analyses of this paper are valid even for dynamic networks, i.e., if the GSO S t in (1) is (slowly) time-varying.Indeed, we examine how the variability and eigenvectors of the underlying graph as well as the diffusion filters' frequency response influence the size of the convergence radius (or misadjustment in the adaptive filtering parlance).Numerical tests in Section 4 corroborate the efficiency and effectiveness of the proposed algorithm in adapting to streaming information and tracking changes in the sought dynamic network.Concluding remarks are given in Section 5.

Contributions in context of prior related work
Early topology inference approaches in the statistics literature can be traced back to the problem of (undirected) graphical model selection [1,Ch. 7], [4,5].Under Gaussianity assumptions, this line of work has well-documented connections with covariance selection [17] and sparse precision matrix estimation [18][19][20], as well as neighborhood-based sparse linear regression [21].Recent GSP-based network inference frameworks postulate that the network exists as a latent underlying structure, and that observations are generated as a result of a network process defined in such a graph [6,10,[22][23][24][25].
Different from [22,24,26,27] that infer structure from signals assumed to be smooth over the sought undirected graph, here the measurements are assumed related to the graph via filtering [cf.(1) and the opening discussion in Section 2].Few works have recently built on this rationale to identify a symmetric GSO given its eigenvectors, either assuming that the input is white [6,10] -equivalently implying y is graph stationary [7][8][9]; or, colored [28,29].Unlike prior online algorithms developed based on the aforementioned graph spectral domain design [13,14], here we estimate the (possibly) time-varying GSO directly (without tracking its eigenvectors) and derive quantifiable recovery guarantees; see Remark 2. Recent algorithms for identifying topologies of time-varying graphs [30,31] operate in batch mode, they are non-recursive and hence their computational complexity grows linearly with time.While we assume that the graph signals are stationary, the online graph learning scheme in [32] uses observations from a Laplacian-based, continuous-time graph process.Unlike [33] that relies on a single-pole graph filter [34], the filter structure underlying (1) can be arbitrary, but the focus here is on learning undirected graphs.Online PG methods were adopted for directed graph inference under dynamic structural equation models [35], but lacking a formal performance analysis.The interested reader is referred to [36] for a comprehensive survey about topology identification of dynamic graphs with directional links.The recovery guarantees in Section 3.2 are adapted from the results in [16], obtained therein in the context of online sparse subspace clustering.Convergence and dynamic regret analysis techniques have been recently developed to study solutions of time-varying convex optimization problems; see [37] for a timely survey of this body of work.The impact of these optimization advances to dynamic network topology identification is yet to fully materialize, and this paper offers a first exploration in this direction.
In closing, we note that information processing from streams of network data has been considered in recent work that falls under the umbrella of adaptive GSP [38,39].Most existing results pertain to online reconstruction of partially-observed streaming graph signals, which are assumed to be bandlimited with respect to a known graph.An exception is [40], which puts forth an online algorithm for joint inference of the dynamic network topology and processes from partial nodal observations.

Notational conventions
The entries of a matrix X and a (column) vector x are denoted by X ij and x i , respectively.Sets are represented by calligraphic capital letters and R + denotes the non-negative real numbers.The notation T stands for matrix or vector transposition; 0 and 1 refer to the all-zero and all-one vectors; while I N denotes the N × N identity matrix.For a vector x, diag(x) is a diagonal matrix whose ith diagonal entry is x i .The operators ⊗, ⊙, ○, and vec(⋅) stand for Kronecker product, Khatri-Rao (columnwise Kronecker) product, Hadamard (entrywise) product and matrix vectorization, respectively.The spectral radius of matrix X is denoted by λ max (X) and X p stands for the p norm of vec(X).Lastly, for a function f , denote f ∞ =sup X f (X) .

Identifying graph topologies from observations of stationary graph signals
We start with batch (offline) estimation of a time-invariant network topology as considered in [6].This context is useful to better appreciate the alternative formulation in Section 2.1 and the proposed algorithm in Section 2.3.Together, these two elements will be instrumental in transitioning to the online setting studied in Section 3.
To state the problem, recall the symmetric GSO S ∈ R N×N associated with the undirected graph 3,34], the Cayley-Hamilton theorem asserts that the model in ( 1) can be equivalently reparameterized as for some particular h and filter order L ≤ N. Since S is a local (one-hop) diffusion operator, L specifies the (multi-hop) range of vertex interactions when generating y from x.It should now be clear that we are constraining ourselves to observations of signals generated via graph filtering.As we explain next, this assumption leads to a tight coupling between S and the second-order statistics of y.
The covariance matrix of y = Hx is given by [recall (2) and We relied on the symmetry of H to obtain the third equality, as H is a polynomial in the symmetric GSO S. Using the spectral decomposition of S = VΛV T to express the filter as )V T , we can readily diagonalize the covariance matrix in (3) as Such a covariance expression is the requirement for a graph signal to be stationary in S [7, Def.2.b].
In other words, if y is graph stationary (equivalently if C x = I N ), (4) shows that the eigenvectors of the shift S, the filter H, and the covariance C y are all the same.Thus given a batch of observations {y t } T t=1 comprising independent realizations of ( 4), the approach in [6] advocates: (i) forming the sample covariance Ĉy = 1 T ∑ T t=1 y t y T t and extracting its eigenvectors V as spectral templates of G; then (ii) recover S that is optimal in some sense by estimating its eigenvalues Λ = diag(λ 1 , . . ., λ N ).Namely, one solves the inverse problem min which is convex for appropriate choices of the function f (S), the constraint set S and the matrix distance The tuning parameter > 0 accounts for the (finite) sample size-induced errors in estimating V, and could be set to zero if the eigenvectors were perfectly known.The formulation (5) entails a general class of network topology inference problems parameterized by the choices of f , d and S; see [6] for additional details on various application-dependent instances.Particular cases of ( 5) with = 0 were independently studied in [10].
In this paper we consider a different formulation from (5), which is amenable to efficient algorithms.For concreteness, we also make specific choices of f , d and S as described next.

Revisiting stationarity for graph learning
As a different take on the shared eigenspace property, observe that stationarity of y also implies C y S = SC y , thus forcing the covariance C y to be a polynomial in S [cf.(3)].This commutation identity holds under the pragmatic assumption that all the eigenvalues of S are simple and ∑ L−1 l=0 h l λ l i ≠ 0, for i = 1, . . ., N. It is also a natural characterization of (second-order) graph stationarity, requiring that second-order statistical information is shift invariant [7].Since one can only estimate Ĉy from data {y t } T t=1 , our idea to recover a sparse GSO is to solve [cf.(5)] The inverse problem ( 6) is intuitive.The constraints are such that one searches for an admissible S ∈ S that approximately commutes with the observations' empirical covariance matrix Ĉy .To recover a sparse graph we minimize the 1 -norm of S, the workhorse convex proxy to the intractable cardinality function S 0 counting the number of non-zero entries (edges) in the GSO.Different from (5) in [6], the formulation (6) circumvents computation of eigenvectors (an additional source of errors beyond covariance estimation), and reduces the number of optimization variables.More importantly, as we show in Section 2.3 it offers favorable structure to invoke a proximal gradient solver with convergence guarantees, even in an online setting where the signals y t arrive in a streaming fashion.
In closing, we elaborate on S as the means to enforce the GSO is structurally admissible as well as to incorporate a priori knowledge about S. Henceforth we let S = A represent the adjacency matrix of an undirected graph with non-negative weights (S ij = S ji ≥ 0) and no self-loops (S ii = 0).Unlike [6,10], we investigate the effect of having additional prior information about the presence (or absence) of a few edges, or even their corresponding weights.This is well motivated in studies where one can afford to explicitly measure a few of the pairwise relationships among the objects in N .Examples include social network studies involving questionnaire-based random sampling designs [1, Ch. 5.3], or experimental testing of suspected regulatory interactions among selected pairs of genes [1,Ch. 7.3.4].Since ( 6) is a sparse linear regression problem, one could resort to so-termed variable screening techniques to drop edges prior to solving the optimization; see e.g., [15].Either way, one ends up with extra constraints S ij = s ij , for a few vertex pairs (i, j) in the set Ω ⊂ N × N of observed edge weights s ij .Accordingly, we can write the convex set of admissible adjacency matrices as Other GSOs can be accommodated using appropriate modifications.

Size of the feasible set
Here we examine the feasibility set and the degrees of freedom of the GSO S under the assumption that the perfect output covariance C y is available and S ∈ S [cf.(7)].This would shed light on the dependency of the feasibility set's structure and dimensionality (hence the difficulty of recovering S) on the number Ω of observed edges.This dependency can serve as a guideline to the number of edges we may seek to observe prior to graph inference.As we show next, the feasibility set may potentially reduce to a singleton (the graph S ∈ S is completely specified by C y ), or more generally to a low-dimensional subspace.In the latter (more interesting) case, or more pragmatically when we approximate Ĉy with the observations' sample covariance, we solve the convex optimization problem as in (6) to search for a sparse and structurally admissible GSO.
Given the (ensemble) output covariance C y , consider the mapping SC y = C y S which stems from the stationarity of y (cf. the opening discussion in Section 2.1).This identity implies that C y and S are simultaneously diagonalizable.Thus the eigenvectors V of S are known and coincide with those of C y .Given the GSO eigenvectors V, consider the mapping S = VΛV T between S and Λ.This can precisely be rewritten as vec(S) = (V ⊙ V)λ = Wλ, where ⊙ denotes the Khatri-Rao (column-wise Kronecker) product, λ ∈R N collects the diagonal entries of Λ, and W ∶= V⊙V ∈R N 2 ×N .Recall that S ∈ S and accordingly the entries of vec(S) corresponding to the diagonal of S are zero.Upon defining the set D ∶= N(i − 1)+i i ∈{1, . . ., N} , we have the mapping W D λ = 0 to the null diagonal entries of S, where W D ∈ R N×N is a submatrix of W that contains rows indexed by the set D. Thus, it follows W D is rank-deficient and λ ∈ ker(W D ), where ker(.)denotes the null-space of its matrix argument.In particular, assume that rank(W D )= N−k, 1≤ k < N, which implies λ lives in a k-dimensional subspace.As some of the entries in S are known according to S, intuitively we expect that by observing k = Ω "sufficiently different" edges, the feasible set may boil down to a singleton resulting in a unique feasible S ∈ S. To quantify the effect of the partial connectivity constraints on the size of the feasible set, let M ∶= {N(j − 1)+i (i, j) ∈ Ω} correspond to the known entries of vec(S).Then upon defining U ∈ R N×k comprising the basis vectors of ker(W D ), the condition rank(W M U) = k would be sufficient to determine S uniquely in the k-dimensional null space of W D as stated in the following proposition.Proof.Since λ ∈ ker(W D ), there exists an α ∈ R k such that λ = Uα.From the known entries of vec(S) denoted by w ∶=[vec(S)] M we have W M λ = W M Uα = w.Thus, to uniquely identify α and equivalently λ (and S), it is sufficient to have rank(W M U)= k.
Proposition 1 further implies that rank(W M ) ≥ k under the assumption that (W M U) = k.This is due to the inequality rank(W M U) ≤ min{rank(W M ), rank(U)}.Observing k "sufficiently different" edges for unique recovery of S is the intuition behind the rank constraint on W M .In real-world graphs, we have observed that k is typically much smaller than N; see also Section 4 and [6,Section 3].This would make it feasible to uniquely identify the graph, given only its eigenvectors and the status of k edges.However, in practice we may not know about the status of those many edges, or, the output covariance C y may only be imperfectly estimated via sample covariance matrix Ĉy .This motivates searching for an optimal graph while accounting for the (finite sample size) approximation errors and the prescribed structural constraints, the subject dealt with next.

Proximal gradient algorithm for batch topology identification
Exploiting the problem structure in (6), a batch proximal gradient (PG) algorithm is developed in this section to recover the network topology; see [11] for a comprehensive tutorial treatment on proximal methods.Based on this module, an online algorithm for tracking the (possibly dynamically-evolving) GSO from streaming graph signals is obtained in Section 3. PG methods have been popularized for 1 -norm regularized linear regression problems, through the class of iterative shrinkage-thresholding algorithms (ISTA); see e.g., [41][42][43].The main advantage of ISTA over off-the-shelf interior point methods is its computational simplicity.We show this desirable feature can permeate naturally to the topology identification context of this paper, addressing the open algorithmic questions in [6, Sec.IV].
To make the graph learning problem amenable to this optimization method, we dualize the constraint S Ĉy − Ĉy S 2 F ≤ in (6) and write the composite, non-smooth optimization .
The quadratic function g(⋅) is convex and M-smooth [i.e., ∇g(⋅) is M-Lipschitz continuous] and µ > 0 is a tuning parameter.Notice that ( 8) and ( 6) are equivalent optimization problems, since for each there exists a value of µ such that the respective minimizers coincide.
To derive the PG iterations, first notice that the gradient of g(S) in ( 8) has the simple form which is Lipschitz continuous with constant M =4µλ 2 max ( Ĉy ).With α > 0 and S a convex set, introduce the proximal operator of a function α f (⋅) ∶ R N×N → R evaluated at matrix M ∈ R N×N as With these definitions, the PG updates with fixed step size γ < 2 M to solve the batch problem (8) are given by (k = 1, 2, . . .denote iterations) It follows that GSO refinements are obtained through the composition of a gradient-descent step and a proximal operator.Instead of directly optimizing the composite cost in (8), the PG update rule in (11) can be interpreted as the result of minimizing a quadratic overestimator of F(S) at judiciously chosen points (here the current iterate S k ); see [12] for a detailed justification.
Evaluating the proximal operator efficiently is key to the success of PG methods.For our specific case of sparse graph learning with partial connectivity information, i.e., S as defined in and f (S) = S 1 , the proximal operator Z in (10) has entries given by The resulting entry-wise separable nonlinear map nulls the diagonal entries of S k+1 , sets the edge weights corresponding to Ω to the observed values s ij , and applies a non-negative soft-thresholding operator to update the remaining entries.Note that for symmetric S, the gradient (9) will also be a symmetric matrix.So if the algorithm is initialized with, say, a very sparse random symmetric matrix, then all subsequent iterates S k , k ≥ 1, will be symmetric without requiring extra projections.The resulting iterations are tabulated under Algorithm 1, which will serve as the basis for the online algorithm in the next section.
The computational complexity is dominated by the gradient evaluation in (9), incurring a cost of O( S k 0 N 2 ) per iteration k.The iterates S k tend to become (and remain) quite sparse at early stages of the algorithm by virtue of the soft-thresholding operations (a sparse initialization is useful to this end), hence reducing the complexity of the matrix products in (9).As k → ∞ the sequence of iterates (11) provably converges to a minimizer S ⋆ [cf.(8)]; see e.g., [44].Moreover, F(S k ) − F(S ⋆ ) → 0 due to the continuity of the composite objective function F(⋅) -a remark on the convergence rate is now in order.

Remark 1.
Results in [45] assert that convergence speedups can be obtained through the so-termed accelerated (A)PG algorithm; see [12] for specifics in the context of ISTA that are also applicable here.Without increasing the computational complexity of Algorithm 1, one can readily devise an accelerated variant with a (worst-case) convergence rate guarantee of O(1 √ ε) iterations to return an ε-optimal solution measured by the objective value F(⋅) [cf.Algorithm 1 instead offering a O(1 ε) rate].

Online network topology inference
Additional challenges arise with real-time network data collection, where analytics must often be performed "on-the-fly" and without the opportunity to revisit past graph signal observations due to e.g., staleness or storage constraints [2].Consider now online estimation of S (or even tracking S t in a dynamic setting) from streaming data {y 1 , . . ., y t , y t+1 , . ..}.To this end, a viable approach is to solve at each time instant t = 1, 2, . . ., the composite, time-varying optimization problem [cf.(8)] In writing Ĉy,t we make explicit that the covariance matrix is estimated with all signals acquired by time t.
As data come in, the covariance estimate will fluctuate and hence the time dependence of the objective function F t (S) through its smooth component g t .Notice that even as t becomes large, the squared residuals in g t remain roughly of the same order due to data averaging (rather than accumulation) in Ĉy,t .Thus a constant regularization parameter µ > 0 tends to suffice because g t (S) will not dwarf the 1 -norm term in (13).
The solution S ⋆ t of ( 13) is the batch network estimate at time t.Accordingly, a naive sequential estimation approach consists of solving (13) using Algorithm 1 repeatedly.However online operation in delay-sensitive applications may not tolerate running multiple inner PG iterations per time interval, so that convergence to S ⋆ t is attained for each t as required by Algorithm 1.If G is dynamic it may not be even prudent to obtain S ⋆ t with high precision (hence incurring high delay and unnecessary computational cost), since at time t + 1 a new datum arrives and the solution S ⋆ t+1 may deviate significantly from the prior estimate.These reasons motivate devising an efficient online and recursive algorithm to solve the time-varying optimization problem (13).We are faced with an interesting trade-off that emerges with time-constrained data-intensive problems, where a high-quality answer that is obtained slowly can be less useful than a medium-quality answer that is obtained quickly.

Algorithm construction
Our algorithm construction approach entails two steps per time instant t = 1, 2, . .., where we: (i) recursively update the observations' covariance matrix Ĉy,t in O(N 2 ) complexity; and then (ii) run a single iteration of the batch graph learning algorithm developed in Section 2.3 to solve (13) efficiently.Different from recent approaches that learn dynamic graphs from the observation of smooth signals [30,31], the resulting algorithm's memory storage requirement and computational cost per data sample y t does not grow with t.A similar idea to the one outlined in (ii) was advocated in [16] to develop an online sparse subspace clustering algorithm; see also [35] for dynamic SEM estimation from traces of information cascades.
Step (i) is straightforward, and the sample covariance Ĉy,t−1 is recursively updated once y t becomes available through a rank-one correction as follows This so-termed infinite memory sample estimate is appropriate for time-invariant settings, i.e., when the graph topology remains fixed for all t.To track dynamic graphs, it is prudent to eventually forget about past observations [cf.(14) incorporates all signals {y τ } t τ=1 ].This can be accomplished via exponentially-weighted empirical covariance estimators or by using a (fixed-length) sliding window of data, both of which can be updated recursively via minor modifications to (14).Initialization of the covariance estimate Ĉy,0 can be performed in practice via sample averaging of a few signals collected before the algorithm runs, or simply as a scaled identity matrix.
To solve (13) online by building on the insights gained from the batch solver [cf.step (ii)], we let iterations k = 1, 2, . . . in Algorithm 1 coincide with the instants t of data acquisition.In other words, at time Ψ t,D c ∈ R N 2 ×N(N−1) (the submatrix of Ψ t that contains columns indexed by the set D c ) is full column rank, then g t (S) in ( 13) is strongly convex with constant m t > 0 being the smallest (nonzero) singular value of Ψ t,D c .
Consider the eigendecomposition Ĉy,t = Vt Λt VT t of the sample covariance matrix at time t, with Vt denoting the matrix of orthogonal eigenvectors and Λt the diagonal matrix of non-negative eigenvalues.Exploiting the structure of Ψ t it follows that the strong convexity condition stated in Proposition 2 will be satisfied if (i) all eigenvalues of Ĉy,t are distinct; and (ii) the matrix Vt ○ Vt is non-singular.Recalling the covariance expression in (4), the aforementioned conditions (i)-(ii) immediately translate to properties of the diffusion filter's (squared) frequency response and the GSO eigenvectors.In extensive simulations involving several synthetic and real-world graphs, we have indeed observed that Ψ t,D c is typically full column rank and thus g t is strongly convex.
Under the strong convexity assumption, we have the following (non-asymptotic) performance guarantee for Algorithm 2. The result is adapted from [16, Theorem 1]; see Appendix B for a proof.
t+1 −S ⋆ t F capture the variability of the optimal solution of (13).If g t in (13) is strongly convex with constant m t , then for all t ≥ 1 the iterates S t generated by Algorithm 2 satisfy where In terms of the objective values, it can be shown that F ; see [43,Theorem 10.29].
As expected, Theorem 1 asserts that the higher the variability in the underlying graph, the higher the recovery performance penalty.Even if the graph G (and hence the GSO) is time invariant, then ν t will be non-zero especially for small t since the solution S * t may fluctuate due to insufficient data.Much alike classic performance results in adaptive signal processing [46], here there is misadjustment due to the "dynamics-induced noise" ν t and hence the online algorithm will only converge to within a neighborhood of S ⋆ t .
Then it follows that Lt ≤ (M − m) (M + m) < 1.The misadjustment νt (1 − Lt−1 ) in ( 17) grows with νt as expected, and will also increase when the problem is badly conditioned (M → ∞ or m → 0) because Lt → 1.If one could afford taking i t PG iterations (instead of a single one as in Algorithm 2) per time step t, the performance gains can be readily evaluated by substituting Lt = ∏ t τ=0 L i τ τ in Theorem 1 to use the bound (16).
In closing, we state dynamic regret bounds which are weaker than the performance guarantees in ( 16)-( 17), but they do not require strong convexity assumptions.The proof of the following result can be found in [16] and is omitted here in the interest of brevity.
τ=0 S ⋆ τ be the average trajectory of the optimal solutions of (13), and introduce ρ t (τ) ∶= Ŝ⋆ t − S ⋆ τ to capture the deviation of instantaneous solutions from this average trajectory.Define δ t ∶= g t+1 − g t ∞ as a measure of the variability in F t .Suppose M τ ≤ M and set the step-size γ τ = 1 M , for all τ = 0, . . ., t.Then for all t ≥ 1 the iterates S t generated by Algorithm 2 satisfy To interpret the result of Theorem 2, define ρt ∶= max τ=0,...,t−1 ρ t (τ) and δt ∶= max τ=0,...,t−2 δ τ .Using these definitions, the following simplified regret bound is obtained immediately from (18) If δt and ρt are well-behaved (i.e., the cost function changes slowly and so does its optimal solution S * t ) then, on average, F t (S t ) hovers within a constant term of F t (S * t ).The network size N and the problem conditioning (as measured by the Lipschitz constant upper bound M) also play a natural role.

Numerical Tests
Here we assess the performance of the proposed algorithms in recovering sparse real-world graphs.
To that end, we: (i) illustrate the scalability of Algorithm 1 using a relatively large-scale Facebook graph with thousands of nodes in a batch setup; (ii) evaluate the performance of the proposed online Algorithm 2 in settings with streaming signals; and (iii) demonstrate the effectiveness of Algorithm 2 in adapting to dynamical behavior of the network.
Throughout this section, we infer unweighted real-world networks from the observation of diffusion processes that are synthetically generated via graph filtering as in (2).For the graph shift S = A, the adjacency matrix of the sought network, we consider a second-order filter H = ∑ 2 l=0 h l S l , where the coefficients {h l } are drawn uniformly from [0, 1].To measure the edge-support recovery, we compute the F-measure defined as the harmonic mean of edge precision and recall (precision is the percentage of correct edges in Ŝ, and recall is the fraction of edges in S that are retrieved).

Facebook friendship graph: Offline
To evaluate the scalability of Algorithm 1, consider a directed network of N =2888 Facebook users, where the 2981 edges represent friendships among the users [47,48].More precisely, an edge from node i to node j exists if user i is a friend of the user j.To make the graph amenable to our framework, we assume that the friendships are bilateral and ignore the directions.First, we notice that rank(W D ) = 2882 (cf.Proposition 1).This means that knowing the GSO's eigenvectors and k = 6 suitably chosen edges as a priori information would lead to a singleton feasibility set.To test Algorithm 1 in recovering this reasonably large-scale graph, the performance is averaged over 10 experiments wherein we assume that we know the existence of Ω = 5 randomly sampled edges in each experiment.We then generate T = 10 3 , 10 4 , 10 5 , or 10 6 synthetic random signals {y t } T t=1 adhering to generative diffusion process in (2), where the entries of the inputs {x t } T t=1 are drawn independently from the standard Gaussian distribution yo yield stationary observations.We obtain Ĉy via a sample covariance estimate.For the aforementioned different values of T, the recovered Ŝ obtained after running Algorithm 1 results in the following F-measures.As the number of observations increase, the estimate Ĉy becomes more reliable which leads to a better performance in recovering the underlying GSO; recall that perfect support recovery corresponds to an F-measure of 1.Moreover, the reported results corroborate the effectiveness of Algorithm 1 in recovering large-scale graphs.Off-the-shelf algorithms utilized to solve related topology inference problems in [6] are not effective for graphs with more than a few hundred nodes.

Zachary's karate club: Online
Next, we consider the social network of Zachary's karate club [1, Ch. 1.2.2] represented by a graph G consisting of N = 34 nodes or members of the club and 78 undirected edges symbolizing friendships among them.We seek to infer this graph from the observation of diffusion processes that are synthetically generated via graph filtering as in (2).The rank of W D (cf.Proposition 1) for this graph is 32.This implies that the knowledge of the perfect output covariance C y leaves the GSO S in a 2-dimensional subspace which can lead to a singleton feasibility set by observing only 2 different edges.However, here we work with a noisy sample covariance Ĉy .We observe one of the 78 edges (chosen uniformly at random) and aim to infer the rest of the edges.At each time step, 10 synthetic signals {y p } are generated through diffusion process H where the entries of the inputs {x p } are drawn independently from the standard Gaussian distribution.In the online case, upon sensing 10 signals at each time step, we first update the sample covariance Ĉy,t and then carry out i t = 10 PG iterations as Algorithm 2. Also, to examine the tracking capability of the online estimator, after 5000 time steps, we remove 10% of the existing edges and add the same number of edges elsewhere.This would affect the graph filter H accordingly.
To corroborate the assumption in Theorem 1, it is worth mentioning that throughout the process we observed that Ψ t,D c was full column rank and thus the cost in (13) was strongly convex; see Proposition 2. Fig. 1-(a) depicts the running objective value F t (S t ) [cf. (13)] averaged over 10 experiments as a function of the time steps and the a priori knowledge -3 randomly chosen edges.We also superimpose Fig. 1-(a) with the optimal objective value F t (S ⋆ t ) at each time step.First, we notice that the objective value trajectory converges to a region above the optimal trajectory.Also, we observe that after 5000 iterations, the performance deteriorates at first due to the sudden change of the network structure, but after observing enough new samples Algorithm 2 can adapt and track the batch estimator as well.This demonstrates the effectiveness of the developed online algorithm when it comes to adapting to network perturbations.
Finally, we study the quality of the online learned graph S t at iteration 5000.Fig. 1-(b) depicts the heat maps of the ground-truth and inferred adjacency matrices for different a priori information.Although the procedure results in a slight gap between F t (S ⋆ t ) and F t (S t ), it still reveals the underlying support of A with reasonable accuracy.Interestingly, we notice that an edge with lower betweenness centrality [e.g., (6,17) and (15, 34) compared to (2,4)] is more informative to identify the network topology; see also [14].

Conclusions
We studied the problem of identifying the topology of an undirected network from streaming observations of stationary signals diffused on the unknown graph.This is an important problem, because as a general principle network structural information can be used to understand, predict, and control the behavior of complex systems.The stationarity assumption implies that the observations' covariance matrix and the GSO commute under mild requirements.This motivates formulating the topology inference task as an inverse problem, whereby one searches for a (e.g., sparse) GSO that is structurally admissible and approximately commutes with the observations' empirical covariance matrix.For streaming data said covariance can be updated recursively, and we show online proximal gradient iterations can be brought to bear to efficiently track the time-varying solution of the inverse problem with quantifiable recovery guarantees.Ongoing work includes extensions of the online graph learning framework to observations of streaming signals that are smooth on the sought network.where the first inequality is due to the Lipschitz continuity of g t (⋅) with constant M t =4µλ 2 max ( Ĉy,t ) along with the strong convexity condition (A1).The second inequality holds since M t ≥ m t .
Noting that S * t is a fixed point of (15), we have

Proposition 1 .
Suppose the GSO eigenvectors V are given.If rank(W D )= N−k and rank(W M U)= k, then S is a singleton.

Figure 1 .
Figure 1.Recovery of Zachary's karate club graph with N = 34 nodes.(a) Evolution of the objective values for the online and batch estimators in inferring a karate club.(b) Ground truth adjacency matrix and corresponding estimates with different a priori information on the connectivities attained after 5000 time steps.
S − γ t ∇g t (S) − S ′ − γ t ∇g t S ′ 2 F ≤ 1 − 2γ t m t + γ 2 t M 2 t+1 − S * t F = prox γ t ⋅ 1 ,S S t − γ t ∇g t (S t ) − prox γ t ⋅ 1 ,S S * t − γ t ∇g t (S * t ) F ≤ S t − γ t ∇g t (S t ) − [S * t − γ t ∇g t (S * t )] F ≤ L t S t − S * t F ,where the first inequality is due to the nonexpansiveness of the proximal operator and the second one follows from (A2).Leveraging the triangle inequality and the definition ν t = S * t+1 −S (16)F results in S t+1 − S * t+1 F ≤ S t+1 − S * t F + S * t+1 − S * t F ≤ L t S t − S * t F + ν t ,which can be applied recursively to obtain(16).