Bandwidth Detection of Graph Signals with a Small Sample Size

Bandwidth is the crucial knowledge to sampling, reconstruction or estimation of the graph signal (GS). However, it is typically unknown in practice. In this paper, we focus on detecting the bandwidth of bandlimited GS with a small sample size, where the number of spectral components of GS to be tested may greatly exceed the sample size. To control the significance of the result, the detection procedure is implemented by multi-stage testing. In each stage, a Bayesian score test, which introduces a prior to the spectral components, is adopted to face the high dimensional challenge. By setting different priors in each stage, we make the test more powerful against alternatives that have similar bandwidth to the null hypothesis. We prove that the Bayesian score test is locally most powerful in expectation against the alternatives following the given prior. Finally, numerical analysis shows that our method has a good performance in bandwidth detection and is robust to the noise.


Introduction
Graph signal (GS) is a versatile model for describing information in irregular domains, which has been widely used in sensor networks [1,2], biological networks [3,4] and image and 3-D point cloud processing [5,6]. Graph signal processing (GSP) theory generalizes the classic discrete signal processing theory to irregular domains by introducing graph Fourier transform (GFT) [7,8]. In most real world networks, adjacent vertices usually have similar signals, which leads to bandlimitedness or approximate bandlimitedness in graph spectral domain. For example, the differences of temperature measured by sensors within short distances may not vary a lot; people tend to have similar interests or views with their friends in the social network.
Bandwidth is a widely used prior of GS in selecting the best sensor placements to monitor spatial phenomena [9][10][11], low-pass filter designed for denoising [7,8,12] and estimating the signals on all the sensors from partial observations [13,14]. However, the bandwidth of GS is typically unknown in practice. There are only a few works considering the problem of bandwidth detection. The bandwidth estimation in Reference [9] relies on noise-free samples. The scenario of noisy observations is considered in Reference [15], but it only applies to the spectral sparse case. Meanwhile, the estimation accuracy of the bandwidth in Reference [15] highly depends on the parameter controlling the sparsity of the GS, and the process of finding an optimal parameter is time-consuming.
The detection of bandwidth for a GS aims to detect the number of spectral components in a GS, which can be seen as a model selection problem in linear regression. The wellknown model selection procedures like Akaike information criterion (AIC) and Bayesian information criterion (BIC) [16] pick the most likely model under some criteria. But they do not consider the significance of the chosen model. In contrast, hypothesis testing, such as likelihood ratio (LR) test and F-test can provide the significance of a model. If the bandwidth of a GS is k, the largest index of non-zero elements in frequency coefficients should be k. Let the vertex number of a GS be N, the regression coefficients of the last N − k frequency components are denoted byf −k . To test whether the bandwidth of the GS is k, the hypothesis can be expressed as (1) The null hypothesis will be rejected if the model with the assumed bandwidth k is not significant enough. Therefore, the bandwidth of GS can be detected by applying tests over all the assumed bandwidths to see whether there is a model achieving the given significance.
For large-scale GS, such as social networks, it is impossible to sample the signal on every vertex due to huge data collection cost, which means we can only have the sample size M < N − k sometimes. In this high dimensional situation, the conventional test for testing regression coefficients such as LR test and F-test is infeasible, which will be explained in Section 3.1. There are plenty of literature on testing the coefficients of high dimensional linear regression. For example, Reference [17,18] propose new test statistics by modifying the F-statistic, and Reference [19] introduces Bayesian priors to the parameters being tested. When k N, the spectrum of the GS can be seen as sparse. Several approaches reconstruct the signal in compressive sensing which estimate the bandwidth as a by-product [20,21]. The frequency components detected by compressive sensing methods can help estimate the spectrum sparse signal well. However, the bandwidth obtained by the compressive sensing methods may be inaccurate since the choose frequency components in the whole space without considering the bandlimitedness. More accurate bandwidth is need in some applications like filter design and sampling set design of which the performance highly depend on the bandwidth prior.
In this letter, we try to detect the bandwidth of bandlimited GS with small sample size, which is not much larger than the bandwidth and there is no sparsity constraint to the spectrum. The bandwidth is detected by a forward multi-stage test. The bandwidth of model being tested is increasing over stages and the bandwidth is obtained when the null hypothesis is accepted. Since the samples are not adequate, we try to use the limited samples to better distinguish the assumed bandwidth from those close to it in each stage for a small detection bias. Therefore, we customize priors forf −k in each stage to describe our attention on testing whether some elements inf −k are non-zero. By doing so, we only use the limited samples to focus on testing a few alternatives in each stage. Bayesian score test [19] is adopted in this paper, but we do not give uniform prior to the parameters being tested as Reference [19] does. Since the bandwidth being tested is increasing over stages, our attentions on coefficients of different frequencies are also updated in each stage which makes our method able to locate the true bandwidth with a small bias even though the samples are insufficient.
The widely used multi-stage test for model selection, stepwise regression ( [16] Chapter 10) does poorly with a small sample size since the estimation of the regression coefficients in it is biased [22]. In contrast, our method is proved to be locally most powerful (LMP) in expectation against alternatives following the given prior, which means it performs well in distinguishing among models with neighbouring bandwidth even using a small amount of samples in each stage.

Problem Formulation
Consider an N-vertex undirected connected graph G = (V, E , W ), where V is the vertex set, E is the edge set and W is the N × N weighted adjacency matrix with W (i, j) = w ij . A GS defined on V can be represented as a vector f ∈ R N , and its element f i represents the signal value at the i-th vertex in V. Laplacian matrix of the undirected graph is given by L = D − W, where D is the diagonal degree matrix diag{d 1 , · · · , d N } with d i = ∑ j w ij . Since w ij = w ji for undirected graphs, Laplacian matrix is symmetric. Therefore, it has real non-negative eigenvalues 0 = λ 1 ≤ λ 2 ≤ · · · ≤ λ N and an orthogonal set of eigenvectors V . As a result, the spectral decomposition of graph Laplacian is given as L = V ΛV T , where Λ is a diagonal matrix of eigenvalues. The columns of V denoted by {v i } 1≤i≤N are regarded as the graph Fourier bases and the eigenvalues are regarded as frequencies [7]. The expansion coefficients of f corresponding to eigenvectors are defined asf .
A GS is called bandlimited when there exists a K ∈ {0, 1, · · · , N − 1} such that its GFT f satisfiesf i = 0 for all i ≥ K [10]. The smallest such K is called the bandwidth of f . If all the frequency coefficients are non-zero, the GS is not bandlimited. If f is a signal with bandwidth K, then it satisfies f = V KfK , where V K denotes the first K columns of V andf K denotes the first K coefficients off .
Suppose that M(M < N) noisy observations y ∈ R M are sampled from f ∈ R N to detect the bandwidth, the observation model can be summarized as where w is an N × 1 vector with ith element representing the observation noise on the ith vertex and Ψ : R N → R M is the sampling matrix, of which the element in the ith row and jth column is defined as We use a simple example to illustrate the role of the sampling matrix. For a graph with 5 vertices and signal f = [ f 1 , · · · , f 5 ] T , if the 1st, 2nd and 4th vertices are sampled, then the sampling matrix is a matrix with 3 rows and 5 columns as follow The sampled graph signal is In this paper, we assume that the observation noise on each sample follows the Gaussian distribution N (0, σ 2 ) and the noise on all the samples are independently and identically distributed (i.i.d.), so the observation follows w ∼ N (0, σ 2 I N×N ), where I N×N is an N × N identity matrix.
By dividing the columns of V into two parts V k and V −k for any assumed bandwidth k ∈ {1, 2, · · · , N − 1}, the observation model (2) can be rewritten as wheref k andf −k denote the first k and the last N − k elements off , respectively, and If the actual bandwidth K = k, H 0 in (1) will be accepted. Otherwise, H 0 will be rejected. For a GS with actual bandwidth K, the hypothesis with assumed bandwidth k < K should all be rejected. Therefore, we detect the bandwidth of GS in a forward multi-stage way with assumed bandwidth increasing over stages. The bandwidth is obtained when the null hypothesis is accepted. By doing so, we reduce the number of models to be tested from N to K.

High Dimensional Challenge
F-test and LR test are common approaches to test the hypothesis (1) in linear regression model. Let the maximum likelihood estimation (MLE) off under H 0 and H 1 bef R andf U , respectively. The corresponding sum of squared residuals (SSR) are denoted by SSR(f R ) and SSR(f U ). Then, LR test statistic ([23] Chapter 10) equals M log(SSR(f U )/SSR(f R )), which is asymptotically distributed as χ 2 (N − k) as the sample size approaches to infinity. When the sample size is small, the test statistic of LR test has no specific distribution which makes it hard to implement. F-test statistic ([23] Chapter 10) equals where , which means the sample size needs to be bigger than the assumed bandwidth, as well as the dimension off −k . For large-scale GS, the sampling cost for bandwidth detection using F-test or LR test is too high. Therefore, a test with small sample size is preferred. The sample size only needs to be larger than the assumed bandwidth, that is, M > k, in score test ([23] Chapter 10). However, when N − k > M, there may exist some alternatives which havef −k = 0 but U −kf−k = 0. We can never hope to have any power against these alternatives. Since the GS is known to be bandlimited, we should pay more attention on testing whether the low frequency coefficients inf −k are non-zero than the high frequency coefficients. Thus, we adopt Bayesian score test [19] to allocate our attention among the alternatives by designing a prior forf −k . Different from Reference [19], which pays equal attention to all the alternatives, we customize a prior forf −k to fit our bandwidth detection purpose.

Design of the Prior
In the test for each bandwidth, the alternative thatf −k = α andf −k = −α for every α = 0 contributes equally in rejecting H 0 , so we assign E(f −k ) = 0 to make the prior unbiased, where E(·) denotes the expectation of the given stochastic variable. The covariance matrix off −k can be assigned as where Σ −k is a positive semidefinite k × k matrix to be designed and τ 2 is an unknown parameter. In this paper, we let Σ −k be a diagonal matrix, then each of its diagonal element is the variance of the corresponding element inf −k . Since E(f −k ) = 0, a larger variance indicates the corresponding element inf −k is more likely to have a larger amplitude. To decide whether H 0 should be accepted, more attention should be payed to the element in f −k with larger variance. Thus, the attention among elements inf −k in bandwidth detection is linked to their variances in the prior. The attention is allocated by designing a Σ −k . In our forward multi-stage test, the bandwidth being tested updates over stages, thus Σ −k also needs to be updated. To avoid the the multi-stage test accepting the null hypothesis too early, the test should be more distinguishable from the model with bandwidth close to the null hypothesis in each stage. Therefore, we use the limited samples to make the test concentrate on distinguishing bandwidth close to k. The design of Σ −k follows the guideline that as the frequency increases, element inf −k with higher frequency is paid less attention. Since we cannot determine a bandwidth larger than M − 1 with sample size M, we set the attention onf −k with frequency larger than M to a small constant δ. For example, the ith diagonal element of Σ −k can be designed to be where Z = exp −1/(2σ 2 ) is the normalization factor. According to the three-standarddeviations property of normal distribution, σ is set to be (M − k)/3 to make the attention on coefficients with frequencies larger than M equals to δ with high probability. Equation (8) is not the only choice of Σ −k , but this form is applicable. The details about how it affects the power of the test is analyzed later in Section 3.3.
As the multi-stage test moves forward, the attentions among frequency coefficients update, and in each stage, the test is distinguishable among bandwidth close to k, which makes the small-bias bandwidth detection with small sample size possible. An illustration of the attention update is shown in Figure 1. We can complete the specification of the distribution off −k by choosing a value for τ 2 and a distribution shape. Let L k (f −k ; y) be the likelihood off −k , for a specific prior distribution off −k , the likelihood of τ 2 is where π(f −k |τ 2 ) is the distribution off −k for a given τ 2 . GivenL k (τ 2 ; y), we convert hypothesis (1) tō τ 2 = 0 impliesf −k = 0, since they lead to the same distribution of y. Thus H 0 :f −k = 0 will be rejected ifH 0 is rejected. The score test ofH 0 is named the Bayesian score test of H 0 with the given prior off −k .
The score test ofH 0 in stage k is a one-sided test for one parameter, the test statistic is where tr(Σ −k F k ) denotes the trace of Σ −k F k .H 0 will be rejected when S Σ −k ≥ t for some threshold t. From (11), we can find that the test statistic of τ 2 can be seen as the slope of log-likelihood function of τ 2 underH 0 . The slope will equal to 0 when τ 2 equals its MLEτ 2 . The more closerτ 2 to 0, the more closer S Σ −k to 0. If S Σ −k is larger than a given t, which meansτ 2 is much larger than 0, thenH 0 will be rejected. Equation (12) is given by Reference [19], where g k = σ −2 U −k M k y and F k = σ −2 U T U are the gradient and Fisher information matrix of log L k (0; y), respectively.

Method
Considering that the second part of (12) is not related to the observations, it is more convenient to work with the equivalent test statistic σ −2 y T M k U −k Σ −k U T −k M k y. Because σ 2 is not known, we plug in the MLE resultσ 2 ∝ y T M k y under the null hypothesis, the resulting test statistic is • Interpretation of the test statistic: M k y indicates the part of y out of the range of U k . The numerator of (13) can be rewritten as Larger Q i indicates a larger energy of M k y lying in the range of U k+i , which links to a larger amplitude of the ith element inf −k . S Σ −k equals Q, which is a weighted sum of Q i , normalized by the energy of M k y. When the weight (Σ −k ) i,i decreases with i, S Σ −k will be larger if Q i is larger for smaller i. Therefore, the test is more powerful against H A that has non-zero frequency coefficients in low frequency than that has non-zero frequency coefficients in high frequency. This is in accordance with the purpose of our design that the test should be more distinguishable from the model with bandwidth close to k.
Finally, we decide whether the null hypothesis is accepted by a p-value threshold. For a given significance level α, if p ≤ α, the null hypothesis is rejected. Let S −k be the observation value of S Σ −k , then the p-value of the test is Since M k y = M k (y − U kfk ) = M k w under null hypothesis, it follows a multivariate normal distribution. Thus, p is the probability that the quadratic form of normal variables is non-negative. The distribution of the quadratic forms in normal variables is approximated by a χ 2 distribution in Reference [24] and p-value can be calculated approximately as, where 2 , and h k = tr(X 2 k )tr(X k )/ tr(X 3 k ).
Our algorithm for the multi-stage bandwidth detection is shown in Algorithm 1. If p ≤ α for all the stages, the output of Algorithm 1 will be 'None', which means the bandwidth cannot be determined with the given sample size. Then we can say that the bandwidth is larger than M − 1.

Power Analysis
In the situation N − k > M, there may exist some alternatives which havef −k = 0 but U −kf−k = 0. It is impossible to find a test which is optimal against all the alternatives, that is, uniformly most powerful. If the truef −k has large deviations from H 0 , it is very easy to detect. However, if the deviations are small, the detection becomes harder. In the Bayesian score test of H 0 , the deviations of the truef −k from H 0 is denoted by τξ, where ξ is assumed to follow a prior distribution with E(ξ) = 0 and E(ξξ T ) ∝ Σ −k and τ 2 indicates how large the deviation is. Next, we will analyze the power of the Bayesian score test of H 0 when τ 2 is small. Definition 1 (Locally most powerful (LMP) [25]). Consider the problem of testing the simple null hypothesis H : θ = θ 0 against the one-sided alternative K : An LMP test is one of the best tests for detecting small deviations from null hypothesis, though it is not good at detecting all kinds of alternatives. In each stage of Algorithm 1, we aim at deciding whether the bandwidth is k or not, so small deviations in the coefficients of frequencies around k should be detected as possible as we can. Therefore, a LMP test is preferred in each stage. In Theorem 1, we will show that the Bayesian score test in each stage of Algorithm 1 is LMP in expectation.
Proof of Theorem 1. Letβ(τ 2 ) be the power function of the level α score test ofH 0 . Sincê f −k = 0 and τ 2 = 0 lead to the same distribution of y, the level α tests of H 0 andH 0 lead to the same critical region. The same conclusion holds for any other level α tests of H 0 and H 0 with power function w(f −k ) and β(τ 2 ). It has been proved in Reference [25] that the one-side score test for one dimensional parameter is LMP. Therefore, for the given level α, there exists a ∆ > 0 such that for all τ 2 ∈H A with 0 ≤ τ 2 ≤ ∆, the power function of score test for (10) is larger than that of any other test for (10) β(τ 2 ) ≤β(τ 2 ). (16) Let the critical region of the level α Bayesian score test be A, we have where µ is a dominating measure. According to (9),L k (τ 2 ; y) is the marginalized distribution of L k (f −k ; y). According to (12), every distribution shape off −k = τξ with E(ξ) = 0 and E(ξξ T ) ∝ Σ −k leads to the same S Σ −k , and therefore the same power function. So (17) can be written as . (18) According to (16) and (18), To make Theorem 1 more intuitive, we give the following example. Example: Suppose there are two alternatives to be tested in step k, H l A =f −k = [0.1, 0, · · · , 0], which has a small deviation 0.1 at a low frequency and H h A =f −k = [0, · · · , 0.1, · · · , 0], which has the same deviation at high frequency. The prior off −k is given as (8), which means the test pays more attention to the low frequency components. Then, the power of testing H 0 against H l A will be larger than the power of testing H 0 against H h A at the same significance level. As a result, the average power of the Bayesian score test is increased since there are more alternatives having small deviations at the low frequencies in a bandlimited GS.
Furthermore, the proof of Theorem 1 does not rely on the specific form of Σ −k , which means that it will hold for different designs of Σ −k . We can design different Σ −k to allocate attentions among frequency components to achieve various testing purposes with small sample size, for example, spectrum anomaly detection.

Bandwidth Detection
An accurate bandwidth is helpful in various applications. For example, an accurate bandwidth in low-pass filter design for graph signals can help remove the noise and keep the original signal well; An accurate bandwidth can also help in choosing the minimal sample size in the sampling set design for graph signals. So in this section, we first validate the performance of bandwidth detection on an Erdős-Rènyi random graph with N = 250 and the probability of edge presence being 0.25. The frequency coefficients of bandlimited GS are independently generated from a uniform distribution over the interval [0, 1]. Gaussian noise is added to the GS to produce the observations, the signal to noise ratio (SNR) is calculated by SNR = 10 log f 2 2 /(Nσ 2 ) . The performance of bandwidth detection is shown in bias of 1000 simulations, the SNR is set to 20 dB and vertices are sampled randomly in each simulation. In this simulation, we design two priors for the alternatives in our method follows the guideline in Section 3.2 and the significance level of the test is set to α = 0.05. Prior 1 is given by (8) with δ = 0.01. Prior 2 is given by with δ = 0.01. To show the effectiveness of our prior designed for bandwidth detection, we compare the mean bandwidth obtained by givingf −k our designed priors and uniform prior Σ −k = I (N−k)×(N−k) , as shown in Figure 2. The test with uniform prior fails to detect the bandwidth when M = 80, while the test with designed prior 1 and prior 2 performs well under the same sample size. When the sample size increases to 2N, the test with uniform prior turns out to have an acceptable performance, but still worse than the test with designed priors with M = 80. This implies that the prior we designed for bandwidth detection is very helpful for saving the sample size. We can also find that the bandwidth detection performance of our method with prior 1 and prior 2 are comparable, which indicates the guideline of prior design in our algorithm improves the detection performance instead of a specific form of prior. The performance of our method is also compared with the bandwidth estimation method in Reference [15], in which a dictionary of 80 bandlimited kernels is constructed with β = 10 3 and the regularization parameter µ in it is set to 0.01 by cross validation. We first simulate the performance of bandwidth detection at different noise levels for bandwidth K = {10, 30, 50}, the sample size is set to M = 80. The results are shown in Figure 3. We can find that the bias and standard deviation (SD) of Algorithm 1 and sparsity based method [15] are similar for small bandwidth at low noise level. For large bandwidth, the bias and SD of our method are significantly smaller than those of Reference [15]. This is because the GS is no longer sparse for large bandwidth, which is out of the scope of sparsity based method [15]. At high noise level, Algorithm 1 is much more robust than Reference [15] especially when the bandwidth is close to the sample size.  Bias and SD Method in [14] with K = 50 Method in [14] with K = 30 Method in [14] with K = 10 Proposed method with K = 50 Proposed method with K = 30 Proposed method with K = 10 In Figure 4, we show that the increase of sample size can help decrease the bias and SD of Algorithm 1. However, when the sample size is abundant, the performance improvement caused by the increase of sample size can be ignored.

Signal Estimation
Bandwidth is an useful prior when estimating signals from partial observations. In this section, we use the bandwidth detected by Algorithm 1 as a prior in signal estimation and a least squares (LS) estimator is used to estimate the signal with the form The estimation performance is evaluated by the normalized mean square error (NMSE), which is We first show the estimation performance on the same graph signal with that in Section 4.1. The bandwidth of signal varies from 10 to 60, the sample size is set to M = 80 and the SNR is set to 20 dB. Our method is compared with two methods based on sparsity, namely BP [26] and BCS [21], which also select frequency components to estimate the signal. The result is show in Figure 5, we can find that as the bandwidth increases, the performance of the sparsity base methods degrade rapidly. This is because the sparsity based methods select frequency components in the whole space and the sparsity constraint makes the amount of frequency components they selected as small as possible. Therefore, they are not fit for the application when the frequency spectrum is not sparse.
We also compare the performance of the signal estimation in real-data set with BP and BCS. The data set comprises 22 signals corresponding to the average temperature on January 1st in the intervals 1998-2019 measured by 119 stations in China [27]. Each station is identified with a vertex and the distance between 2 vertices is calculated by the haversine formula. A 3-NN unweighted graph is constructed. The GS is shown in Figure 6a and the corresponding spectrum is shown in Figure 6b. i.i.d. Gaussian noise with SNR = 20 dB is added to the signal to generate observations. 200 simulations are implemented for each signal, the vertices are sampled randomly in each simulation. We can find that the temperatures distribute over the graph smoothly, so there are only a few low frequency components, which is sparse in spectrum. Since the GS is approximately bandlimited in real-data, the approximate bandwidth obtained by Algorithm 1 is used in (20). The average NMSE of all the simulations under different sample sizes is shown in Figure 7. We can find that the estimation performance of our method is also better than the sparsity based methods in real-data. Since BP ensures the sparsity of the signal by l 1 -norm, it results in obtaining the same signal values on each vertex in this experiment. BCS selects 14 nonadjacent frequency components in all the frequency components, including high frequencies and low frequencies. The approximate bandwidth obtained by Algorithm 1 is 12, which means we estimate the GS using the first 12 frequency components. We can find that if the GS is known to be bandlimited, the signal estimation performance can be improved by detecting the bandwidth accurately first and using the bandwidth as a prior in estimation. If the knowledge that the GS is bandlimited is not take advantage of, more samples are needed to achieve the same estimation performance.

Conclusions
In this letter, we proposed a multi-stage Bayesian score test approach for the bandlimited GS bandwidth detection with a small sample size. By customizing a prior for frequency coefficients being tested, we made the test more distinguishable from the models with bandwidths close to the assumed one. In practice, our method may obtain a bandwidth smaller than the true bandwidth K if there is a frequency band of which the coefficients are zeros in band, since we focus on testing the frequency coefficients close to the assumed one but ignore the others in the small sample size situation. We will try to improve the performance under this situation in the future. In addition, we would like to analyze how the different sampling sets affect the performance of bandwidth detection and design a sampling set to make the test more powerful.