Spectrum-Adapted Polynomial Approximation for Matrix Functions with Applications in Graph Signal Processing

We propose and investigate two new methods to approximate f (A)b for large, sparse, Hermitian matrices A. Computations of this form play an important role in numerous signal processing and machine learning tasks. The main idea behind both methods is to first estimate the spectral density of A, and then find polynomials of a fixed order that better approximate the function f on areas of the spectrum with a higher density of eigenvalues. Compared to state-of-the-art methods such as the Lanczos method and truncated Chebyshev expansion, the proposed methods tend to provide more accurate approximations of f (A)b at lower polynomial orders, and for matrices A with a large number of distinct interior eigenvalues and a small spectral width. We also explore the application of these techniques to (i) fast estimation of the norms of localized graph spectral filter dictionary atoms, and (ii) fast filtering of time-vertex signals.


Introduction
Efficiently computing f (A)b, a function of a large, sparse Hermitian matrix times a vector, is an important component in numerous signal processing, machine learning, applied mathematics, and computer science tasks. Application examples include graph-based semi-supervised learning methods [1][2][3]; graph spectral filtering in graph signal processing [4]; convolutional neural networks/deep learning [5,6]; clustering [7,8]; approximating the spectral density of a large matrix [9]; estimating the numerical rank of a matrix [10,11]; approximating spectral sums such as the log-determinant of a matrix [12] or the trace of a matrix inverse for applications in physics, biology, information theory, and other disciplines [13]; solving semidefinite programs [14]; simulating random walks [15] (Chapter 8); and solving ordinary and partial differential equations [16][17][18].
References [19] (Chapter 13), [20][21][22] survey different approaches to this well-studied problem of efficiently computing where the columns of V are the eigenvectors of the Hermitian matrix A ∈ R N×N ; Λ is a diagonal matrix whose diagonal elements are the corresponding eigenvalues of A, which we denote by we perform numerical experiments, approximating f (A)b for different matrices A and functions f , and discuss the situations in which the proposed methods work better than the state-of-the-art methods. In Sections 5 and 6, we explore the application of the proposed technique to fast estimation of the norms of localized graph spectral filter dictionary atoms and fast filtering of time-vertex signals, respectively.

Spectral Density Estimation
The cumulative spectral density function or empirical spectral cumulative distribution of the matrix A is defined as where 1 1 {C} = 1 if statement C is true and 1 1 {C} = 0 otherwise. The spectral density function [36] (Chapter 6) (also called the Density of States or empirical spectral distribution [37] (Chapter 2.4)) of A is the probability measure defined as Lin et al. [9] provide an overview of methods to approximate these functions. In this work, we use a variant of the Kernel Polynomial Method (KPM) [38][39][40] described in [9,41] to estimate the cumulative spectral density function P λ (z) of A. Namely, for each of S linearly spaced points {ξ i } S i=1 between λ and λ, we estimate the number of eigenvalues less than or equal to ξ i via stochastic trace estimation [42,43]. Let x denote a Gaussian random vector with distribution N (0, I), {x (j) } J j=1 denote a sample of size J from this distribution, andΘ ξ i denote a Jackson-Chebyshev polynomial approximation to Θ ξ i (λ) := 1 1 {λ≤ξ i } [44,45]. The stochastic trace estimate of the number of eigenvalues less than or equal to ξ i is then given by As in [46], we then form an approximationP λ (z) to P λ (z) by performing monotonic piecewise cubic interpolation [47] on the series of points ξ i , η i N i=1,2,...,S . Analytically differentiatingP λ (z) yields an approximationp λ (z) to the spectral density function p λ (z). SinceP λ (z) is a monotonic cubic spline, we can also analytically compute its inverse functionP −1 λ (y). The spectrum-adapted methods we propose in Section 3 utilize both p λ (z) andP −1 λ (y) to focus on regions of the spectrum with higher eigenvalue density when generating polynomial approximations. Figure 2 shows examples of the estimated cumulative spectral density functions for eight real, symmetric matrices A: the graph Laplacians of the Erdös-Renyi graph (gnp) from Figure 1, the Minnesota traffic network [48] (N = 2642), and the Stanford bunny graph [49] (N = 2503); the normalized graph Laplacian of a random sensor network (N = 5000) from [50]; and the net25 (N = 9520), si2 (N = 769), cage9 (N = 3534), and saylr4 (N = 3564) matrices from the SuiteSparse Matrix Collection [51] (We use A+A 2 for cage9, and for net25 and saylr4, we generate graph Laplacians based on the off-diagonal elements of A. For saylr4, we scale the entire Laplacian by a factor of 1 2000 ). The computational complexity of forming the estimateP λ (z) is O(MJK Θ ), where M is the number of nonzero entries in A, J is the number of random vectors in (6) (in our experiments, J = 10 suffices), and K Θ is the degree of the Jackson-Chebyshev polynomial approximationsΘ ξ i [41]. While this cost is non-negligible if computing f (A)b for a single f and a single b, it only needs to be computed once for each A if repeating this calculation for multiple functions f or multiple vectors b, as is often the case in the applications mentioned above.  Figure 2. Estimated and actual cumulative spectral density functions for eight real, symmetric matrices A. We estimate the eigenvalue counts for S = 10 linearly spaced points on [λ, λ] via (6), with degree K Θ = 30 polynomials and J = 10 random vectors x (j) .

Spectrum-Adapted Methods
In this section, we introduce two new classes of degree K polynomial approximations p K (A)b to f (A)b, both of which leverage the estimated cumulative spectral density functionP λ (z).

Spectrum-Adapted Polynomial Interpolation
In the first method, we take y k := cos( kπ K )+1 2 , for k = 0, 1, . . . , K, which are the K + 1 extrema of the degree K Chebyshev polynomial shifted to the interval [0, 1]. We then warp these points via the inverse of the estimated cumulative spectral density function by setting x k = P −1 λ (y k ), before finding the unique degree K polynomial interpolation through the points {(x k , f (x k ))} k=0,1,...,K . The intuition behind warping the interpolation points is that (i) a better approximation is attained in areas of the spectrum (domain) with more interpolation points, (ii) the error in (3) only depends on the errors at the eigenvalues of A, so we would like the approximation to be best in regions with many eigenvalues, and (iii) as shown in Figure 3, using the inverse of the estimated cumulative spectral density function as the warping function leads to a higher density of the warped points {x k } falling in higher density regions of the spectrum of A. Thus, the warping should ideally lead to more interpolation points in high density regions of the spectrum, better approximation of the target function in these regions, and, in turn, a reduction in the error (3), as compared to interpolations generated from points spread more evenly across the spectrum.
To find the (unique) degree K polynomial interpolation, our numerical implementation uses MATLAB's polyfit function, which centers and scales the data and then solves the resulting system of equations via a QR decomposition. Once the interpolating polynomial coefficients are attained, p K (A)b can be computed, e.g., via (2) or by representing the interpolating polynomial as a linear combination of Chebyshev polynomials and using Chebyshev coefficients in the associated three-term recurrence [23,32]. This entire procedure is detailed in Algorithm 1.

Input Hermitian matrix
, an estimate of the cumulative spectral density of A 2: for k in 0 : K do 3: 4: x k ←P −1 λ (y k ) 5: end for 6: Find the (unique) degree K polynomial interpolation p K through the points {(x k , f (x k ))} k=0,1,...,K 7: Compute p K (A)b via (2) 8: return p K (A)b

Spectrum-Adapted Polynomial Regression/Orthogonal Polynomial Expansion
A second approach is to solve the weighted least squares polynomial regression problem where the abscissae {x m } m=1,2,...,M and weights {w m } m=1,2,...,M are chosen to capture the estimated spectral density function. We investigated several methods to set the points (e.g., linearly spaced points, Chebyshev points on the interval [λ, λ], Chebyshev points on each subinterval [ξ i , ξ i+1 ], and warped points via the inverse of the estimated cumulative spectral density function as in Section 3.1) and weights (e.g., the analytically computed estimatep λ of the spectral density function, a discrete estimate of the spectral density function based on the eigenvalue counts in (6), the original KPM density of states method based on a truncated Chebyshev expansion [9] (Equation 3.11), or equal weights for warped points). Without going into extensive detail about these various options, we remark that choosing abscissae {x m } that are very close to each other, which may occur when using points warped by the inverse of the estimated density function, may lead to numerical instabilities when solving the weighted least squares problem. In the numerical experiments, we use M evenly spaced points on the interval [λ, λ] (i.e., x m = m−1 M−1 (λ − λ) + λ), and set the weights to be w m =p λ (x m ). To solve (7), we use a customized variant of MATLAB's polyfit function that solves, again via QR decomposition, the normal equations of the weighted least squares problem: where Ψ x is the Vandermonde matrix associated with the points {x m }, W is a diagonal matrix with diagonal elements equal to the weights {w m }, y is a column vector with entries equal to { f (x m )}, and c is the vector of unknown polynomial coefficients. Once the coefficients c of the optimal polynomial, p * K are attained, p * K (A)b can once again be computed, e.g., via (2). A summary of this method is detailed in Algorithm 2 (As pointed out by an anonymous reviewer, since our estimatep λ of the spectral density function is a piecewise quadratic function, the optimization problem (7) could be replaced by λ (x) dx, which can be solved analytically for many functions f (·)). Algorithm 2 Spectrum-adapted polynomial regression.

Input Hermitian matrix
An alternative way to view this weighted least squares method [52] is as a truncated expansion in polynomials orthogonal with respect to the discrete measure dλ M with finite support at the points {x m }, and an associated inner product [53] (Section 1.1) The M discrete monic orthogonal polynomials {π k,M } k=0,1,M−1 satisfy the three-term recurrence relation [53] [54]. In matrix-vector notation, the vectors π k,M ∈ R M , which are the discrete orthogonal polynomials evaluated at the M abscissae, can be computed iteratively by the relation Figure 4 shows an example of these discrete orthogonal polynomials.  Figure 4. Comparison of (top) the first six discrete orthogonal polynomials defined in (8), adapted to the estimated cumulative spectral density of the random Erdös-Renyi graph from Figures 1-3, to (bottom) the first six standard shifted Chebyshev polynomials with degree k = 0 to 5. In the region of high spectral density, shown in the zoomed boxes on the right, the discrete orthogonal polynomials feature more oscillation while maintaining small amplitudes, enabling better approximation of smooth functions in this region. Finally, the degree K polynomial approximation to f (A)b is computed as Before proceeding to numerical experiments, we briefly comment on the relationship between the spectrum-adapted approximation proposed in this section and the Lanczos approximation to f (A)b, which is given by [19] (Section 13.2), [23] where Q K is an N × (K + 1) matrix whose columns form an orthonormal basis for The approximation (9) can also be written as q K (A)b, where q K is the degree K polynomial that interpolates the function f at the K + 1 eigenvalues of T K [19] (Theorem 13.5), [55]. Thus, unlike classical polynomial approximation methods, such as the truncated Cheybshev expansion, the Lanczos method is indirectly adapted to the spectrum of A. The Lanczos method differs from proposed method in that T K and the Lanczos approximating polynomial q K depend on the initial vector b. Specifically, the polynomials {π k } generated from the three-term recurrence of the form (8) with the {α k } k=0,1,...,K and {γ k } k=1,2,...,K coefficients taken from the diagonal and superdiagonal entries of T K , respectively, are orthogonal with respect to the piecewise-constant measure . Ifb happens to be a constant vector, then µ(x) = P λ (x) from (5). If A is a graph Laplacian,b is the graph Fourier transform [4] of b, normalized to have unit energy.

Numerical Examples and Discussion
In Figure 5, for different functions f (λ) and matrices A, we approximate f (A)b with b = V1 and polynomial approximation orders ranging from K = 3 to 25. To estimate the cumulative spectral density functionP λ (z) with parameters S = 10, J = 10, and K Θ = 30, we use the KPM, as shown in Figure 2. Based on the analytical derivative and inverse function ofP λ (z), we obtain the two proposed spectrum-adapted polynomial approximations for f (λ), before computing each p K (A)b via the corresponding three-term recursion. We compare the proposed methods to the truncated Chebyshev expansion and the Lanczos method with the same polynomial order. Note that when b is a constant vector in the spectral domain of A, the relative error , the numerator of which is the discrete least squares objective mentioned in Section 1.
The first column of Figure 5 displays the errors at all eigenvalues of A for each order 10 polynomial approximation of f (λ) = e −λ . The second column examines the convergence of relative errors in approximating e −A b for matrices with various spectral distributions, for each of the four methods.
In the field of graph signal processing [4], it is common to analyze or modify a graph signal b ∈ R N , where b(i) is the value of the graph signal at vertex i of a weighted, connected graph G with N vertices, by applying a graph spectral filter f l . The filtered signal is exactly the product (1) of a function of the graph Laplacian (or some other symmetric matrix) and a vector, the graph signal b. In Figure 6, we show examples of collections of such functions, commonly referred to as graph spectral filter banks. In the right three columns of Figure 5, we examine the relative errors incurred by approximating f l (A)b for the lowpass, bandpass, and highpass graph spectral filters f 1 , f 3 , and f 5 shown in Figure 6a. We expand on the applications of such graph spectral filters in Section 5.
We make two observations based on the numerical examples: 1.
The spectrum-adapted interpolation method often works well for low degree approximations (K ≤ 10), but is not very stable at higher orders due to overfitting of the polynomial interpolant to the specific K + 1 interpolation points (i.e., the interpolant is highly oscillatory).

2.
The proposed spectrum-adapted weighted least squares method tends to outperform the Lanczos method for matrices such as si2 and cage9 that have a large number of distinct interior eigenvalues.    [46,57]. (b) A set of four log-warped translates (also called octave-band or wavelet filters) [46]. We plot both sets of spectral graph filters on the spectrum of the Stanford bunny graph with N = 2503 vertices; however, the design only depends on the spectral range [0, λ max ], so the filters will look the same (except stretched) for all graphs considered, e.g., in Figure 5.

Application I: Estimation of the Norms of Localized Graph Spectral Filter Dictionary Atoms
A common method to extract information from data residing on a weighted, undirected graph is to represent the graph signal as a linear combination of building block graph signals called atoms, a collection of which is called a dictionary. In this section, we consider localized spectral graph filter dictionaries with the form D = {ϕ i,l } i=1,2,...,N; l=1,2,...,L , where each atom is defined as ϕ i,l := f l (L)δ i with L being the graph Laplacian and δ i having a value of 1 at vertex i and 0 elsewhere. Each atom can be interpreted as the result of localizing a spectral pattern characterized by the filter function f l to be centered at vertex i in the graph. See [58] for more details about localized spectral graph filter dictionaries and their applications as transforms and regularizers in myriad signal processing and machine learning tasks.
For large, sparse graphs, the dictionary atoms are never explicitly computed; rather, their inner products with the graph signal are approximated by b, ϕ i,l = δ i f l (L)b ≈ δ i p l,K (L)b, using polynomial approximation methods such as those described in Section 1 or those proposed in this work. However, in graph signal processing applications such as thresholding for denoising or compression [58][59][60] or non-uniform random sampling and interpolation of graph signals [41,45,58,61], it is often important to form a fast estimate of the norms of the dictionary atoms, {||ϕ i,l || 2 } i,l . Since ||ϕ i,l || 2 2 = ϕ i,l , ϕ i,l = δ i f 2 l (L)δ i is a bilinear form of the type u f (A)u, the norm of a single atom can be estimated via quadrature methods such as Lanczos quadrature [53] (Ch. 3.1.7), [56] (Ch. 7), [62]; however, doing this for all NL atoms is not computationally tractable since each requires a different combination of function and starting vector. Other alternatives include the methods discussed in [63] for estimating diagonal elements of a matrix that is not explicitly available but for which matrix-vector products are easy to evaluate, as ||ϕ i,l || 2 = f 2 l (L) i,i . In this application example, we estimate the norms of the dictionary atoms through the products of matrix functions with random vectors, as follows. Let x be a random vector with each component having an independent and identical standard normal distribution (in fact, we are only utilizing the property that the random components have unit variance). Then we have Thus, to estimate ||ϕ i,l || 2 , it suffices to estimate sd ϕ i,l , x = sd δ i f l (L)x ≈ sd δ i p l,K (L)x for a degree K polynomial approximation p l,K to f l . We therefore define each atom norm estimate as the sample standard deviation where each x (j) is a realization of the random vector x. In the numerical experiments, we compare the estimates resulting from Chebyshev polynomial approximation to those resulting from spectrum-adapted weighted least squares polynomial approximation. As a computational aside, in the process of estimating the spectral density via KPM in (6), we have already computedT k (L)x (j) for each k = 0, 1, . . . , K and each j, whereT k are the Chebyshev polynomials shifted to the interval [0, λ max ]. From these quantities, we can easily compute the p l,K (L)x (j) vectors in (10) for different values of l (different filters). See [41] (Sec. III.B.1) for details.
In the top row of Figure 7, we demonstrate the estimation of the norms of the atoms of a spectral graph wavelet dictionary generated by localizing the four filters in Figure 6b to each of the vertices of the bunny graph. Figure 7a shows the exact norms of the NL = 2503 · 4 = 10012 dictionary atoms. In Figure 7b, we plot the estimated norms of the atoms, generated via degree K = 8 spectrum-adapted weighted least squares polynomial approximation with J = 50 random vectors in (10), against the actual atom norms. The ratios of the estimated atom norms to the actual norms are shown in Figure 7c.
We show the mean of the relative error ν i,l ||ϕ i,l || 2 − 1 across all of these atoms as a single point in Figure 7d, and also repeat this experiment with different values of K and J and different classes of approximating polynomials, as well as for different graphs in Figure 7e-f. On all three graphs and at both values of J, for low degrees K, the estimates generated from the spectrum-adapted polynomial least squares method have lower mean relative error than those based on Chebyshev polynomial approximation. While these examples are on small to medium graphs for comparison to the exact atom norms, the method scales efficiently to dictionaries designed for large, sparse graphs. (d-f) The mean relative errors across all atoms for dictionaries generated by the same set of filters on three different graphs, with varying polynomial approximation methods, polynomial degrees K, and numbers of random vectors J.

Application II: Fast Filtering of Time-Vertex Signals
In this section, we demonstrate the use of the spectrum-adapted approximation methods in Section 3 to accelerate the the joint filtering of time-vertex signals in both time and graph spectral domains [50,64,65].

Time-Vertex Signals
We consider a weighted, undirected graph G = {V, E , W G } with N vertices, where V is the set of vertices, E is the set of edges, and W G is the weighted adjacency matrix. The combinatorial graph Laplacian is defined as L G := D G − W G , where D G is diagonal with D G (i, i) equal to the degree of the ith vertex. At each vertex, we observe a time series of T observations. Thus, the time-varying graph signal can be represented as a matrix X ∈ R N×T , where X i,j is the value on the ith vertex at the jth time. Figure 8 shows an example of a time-vertex signal.

Time-Vertex Filtering
Fixing a point in time, each column of X is a graph signal defined on G. Let x t j ∈ R N×1 denote the jth column of X, j = 1, · · · , T. Based on the graph structure of G, we can perform high-dimensional graph signal processing tasks on x t j , such as filtering, denoising, inpainting, and compression [4]. In particular, for a filter g : σ(L G ) → C defined on the eigenvalues of L G , the graph spectral filtering of x t j can be computed as G is the spectral decomposition of L G . Conversely, focusing on one vertex of G, the ith row of X is a discrete time signal x v i ∈ R 1×T , which indicates how the signal value changes over time on the ith vertex. We can compute the one-dimensional discrete Fourier transform (DFT) of where U R is the normalized DFT matrix of size T, and U R is its complex conjugate. The DFT converts a signal from the time domain to the frequency domain, and allows for the amplification or attenuation of different frequency components of the signal. This process is referred to as frequency filtering in classical signal processing [4]. Frequency filtering of classical one-dimensional signals is equivalent to graph spectral filtering of graph signals on a ring graph [66] (Theorem 5.1). Let L R denote the graph Laplacian of a ring graph with T vertices. Its spectral decomposition gives L R = U R Λ R U * R , where U R comprises the DFT basis vectors (normalized to have length 1), and the kth eigenvalue is given by Λ R (k, k) = 2 − 2 cos 2π(k−1) T , for k = 1, · · · , T.
The joint time-vertex Fourier transform of a time-varying graph signal X is defined as the combination of a graph Fourier transform and a DFT [50]: A joint time-vertex filter h : σ(L G ) × σ(L R ) → C is defined for all combinations of (λ G , λ R ) where λ G is an eigenvalue of L G and λ R is an eigenvalue of L R . The joint time-vertex filtering is defined accordingly as where H ∈ R N×T has entries H i,j = h(λ G i , λ R j ), and • denotes the entry-wise product of two matrices. Figure 9 shows two examples of time-vertex filters: an ideal lowpass filter and a wave filter where λ Gmax is the largest eigenvalue of L G . The eigenvalues of L R range from 0 to 4 regardless of the size of T, so 1 2 λ Rmax = 2. As a two-dimensional extension of spectral filtering, the time-vertex filtering decomposes the input signal into NT orthogonal components, where each component corresponds to an outer product of a graph Laplacian eigenvector and a DFT basis function. Then, the components are amplified or attenuated by the corresponding scalars h(λ G , λ R ). Finally, the scaled components are added together to obtain the filtered signal.

Spectrum-Adapted Approximation of Time-Vertex Filtering
Due to the high complexity of the spectral decomposition required to compute U G , approximation methods have been developed to accelerate joint time-vertex filtering, such as Chebyshev2D [64], ARMA2D [67], and the Fast Fourier Chebyshev algorithm [50].
As outlined in Algorithm 3, we can use the methods described in Section 3 to efficiently approximate the filtering of time-vertex signals. The overall complexity of our method is O(NT log T + TKM), where M is the number of nonzero entries in L G . The FFT of N discrete-time signals of length T takes O(NT log T). The loop computes T spectrum-adapted approximations to matrix functions with polynomials of degree K, and thus has a complexity of O(TKM) for sparse L G .

Algorithm 3 Spectrum-adapted approximate time-vertex filtering.
Input weighted undirected graph G with N vertices, time-vertex signal X ∈ R N×T , filter h Output time-vertex filtered signal Y = h(L G , L R )X ∈ R N×T 1:X ← FFT of X, where the nth row ofX is the 1D FFT of the nth row of X, for n = 1, · · · , N 2: Estimate the spectral density of G 3: for t in 1 : T do 4: Find the best order K polynomial approximation p k (λ G ) to f t (λ G ) via interpolation on the warped Chebyshev points (described in Section 3.1), or weighted least squares regression with evenly spaced abscissae and weights from the estimated spectral PDF (described in Section 3.2) 6: Compute p k (L G )x t , wherex t is the tth column ofX 7: tth column ofỸ ← p k (L G )x t 8: end for 9: Y ← inverse FFT ofỸ, where the nth row of Y is the 1D inverse FFT of the nth row ofỸ 10: return Y

Numerical Experiments
We consider the ideal lowpass filter (12) and the wave filter (13). We approximate h(L G , L R )X for both filter functions, with T = 1000 observations, and for different graphs G: gnp (N = 500), saylr4 (N = 3564) and a random sensor network (N = 5000), the cumulative spectral densities of which are shown in Figure 2. In each case, we choose X = 1 √ NT ∑ i=1,··· ,N;j=1,··· ,T v G i v T j , i.e., a constant vector in the joint spectral domain of L G and L R , in order to test the average performance of the approximation methods over different combinations of eigenvalue pairs. With polynomial approximation orders ranging from K = 1 to 30, we follow the procedure described in Algorithm 3 to approximate h(L G , L R )X. We estimate the cumulative spectral density functionsP λ (z) with parameters T = 10, J = 10, and K Θ = 30. We use M = 100 in the spectrum-adapted polynomial regression/orthogonal polynomial expansion when finding the best polynomial approximation. We compare the proposed methods to the truncated Chebyshev expansion and the Lanczos method with the same polynomial order. For each method, we examine the convergence of relative errors in Frobenius norm as a function of K for graphs with various structures (and thus various distributions of Laplacian eigenvalues). The results are summarized in Figure 10. Similar to our observation in Figure 5, we see that the spectrum-adapted interpolation method performs well for lower polynomial orders (K ≤ 5), but tends to be unstable at higher polynomial orders. for h(L G , L R )X when h is an ideal lowpass filter (12) and a wave filter (13).

Dynamic Mesh Denoising
Finally, we replicate a dynamic mesh denoising example presented in [50] (Sec. VI.B). The original time-varying sequence of 3D meshes of a walking dog features meshes from T = 59 time instances, each with N = 2502 points in 3D space (x-y-z coordinates). This sequence is denoted by the 2502 × 59 × 3 matrix X. The original 3D mesh from time t = 5 is shown in Figure 11a. The dynamic mesh sequence is corrupted by adding Gaussian noise (mean 0, standard deviation equal to 0.2||X|| F √ 2502·59·3 ) to each mesh point coordinate. We denote the noisy 3D mesh sequence, one element of which is shown in Figure 11b, by Y. A single graph is created by subtracting the mean x, y, and z coordinates of each noisy mesh from that mesh, averaging the centered noisy mesh coordinates across all 59 time instances, and then building a 5-nearest neighbor graph on the average centered noisy mesh coordinates. The resulting graph and its spectral density function are shown in Figure 11e-f. The dynamic mesh sequence is denoised by solving the Tikhonov regularization problem [50] (Equation (30)) The optimization problem (14) i.e., the joint time-vertex filtering operation defined in (11) is applied to each of the noisy x, y, and z coordinates, using the same joint non-separable lowpass filter h tik , which is defined in the joint spectral domain as [50] (Equation (31)) We perform a grid search to find the values of the parameters τ 1 and τ 2 that minimize the relative averaged over multiple realizations of the noise. In Figure 11g, we show the resulting filter h tik (λ G , λ R ) with τ 1 = 7.20 and τ 2 = 0.45 on the joint spectrum of the graph shown in Figure 11e. The dashed black line in Figure 11h shows the relative error between the denoised sequence computed exactly in (15) and the original dynamic 3D mesh sequence, averaged over 20 realizations of the additive Gaussian noise. The other three curves in the same image show the average relative error when the computation in (15) is approximated by the fast Fourier Chebyshev method of [50], the Chebyshev2D method of [64], and the spectrum-adapted approximate time-vertex filtering (fast Fourier weighted least squares) of Algorithm 3, for different values of the polynomial degree K. All three approximations converge to the exact solution and corresponding error, but at low degrees (K ≤ 10), the spectrum-adapted fast Fourier weighted least squares yields a better approximation. Figure 11c-d display examples of the denoised mesh at a single time (t = 5) resulting from two of these approximations, with K = 6. The difference between the two is subtle, but can be seen, e.g., by scanning the top of the dog, where the mesh points form a slightly smoother surface.  Figure 11. Denoising of a time-varying sequence of 3D meshes of a walking dog. Top row: one element each of the dynamic sequences of (a) original, (b) noisy, and (c,d) denoised (using two different approximation methods) meshes. Bottom row: (e,f) a 5-nearest neighbor graph (and its spectral CDF) constructed from the entire sequence of noisy meshes, which is used to denoise the meshes at all times instances; (g) the joint time-vertex lowpass filter (16); and (h) the average relative errors between the original sequence of meshes and the denoised versions attained by exactly or approximately computing (15) with different polynomial approximation methods, for a range of polynomial degrees.

Conclusions
We presented two novel spectrum-adapted polynomial methods for approximating f (A)b for large sparse matrices: spectrum-adapted interpolation and spectrum-adapted weighted least squares/orthogonal polynomial expansion. These methods leverage an estimation of the cumulative spectral density of the matrix to build polynomials of a fixed order K that yield better approximations to f in the higher density regions of the matrix spectrum. In terms of approximation accuracy, numerical experiments showed that, relative to the state-of-the-art Lanczos and Chebysev polynomial approximation techniques, the proposed methods often yield more accurate approximations at lower polynomial orders; however, the proposed spectrum-adapted interpolation method is not very stable at higher degrees (K > 10) due to overfitting. The proposed spectrum-adapted weighted least squares method performs particularly well in terms of accuracy for matrices with many distinct interior eigenvalues, whereas the Lanczos method, e.g., is often more accurate when K is higher and/or the matrix A has many extreme eigenvalues. One potential extension would be to investigate a hybrid method that combines the Lanczos and spectrum-adapted weighted least squares approaches. We did not notice consistent trends regarding relative approximation accuracy with respect to the shape of the function f .
In terms of computational complexity, the cost of our methods, like Chebyshev polynomial approximation, is O(MK), where M = nnz(A). For very large, sparse matrices, this complexity reduces to O(NK), where A is an N × N matrix. The Lanczos method, on the other hand, incurs an additional O(NK 2 ) cost due to the orthogonalization step, making it more expensive for large enough K. Finally, the proposed spectrum-adapted methods, like the Chebyshev approximation, are amenable to efficient parallel and distributed computation via communication between neighboring nodes [33]. The inner products of the Lanczos method, on the other hand, may lead to additional communication expense or severe loss of efficiency in certain distributed computation environments (e.g., GPUs).