Multiscale Methods for Signal Selection in Single-Cell Data

Analysis of single-cell transcriptomics often relies on clustering cells and then performing differential gene expression (DGE) to identify genes that vary between these clusters. These discrete analyses successfully determine cell types and markers; however, continuous variation within and between cell types may not be detected. We propose three topologically motivated mathematical methods for unsupervised feature selection that consider discrete and continuous transcriptional patterns on an equal footing across multiple scales simultaneously. Eigenscores (eigi) rank signals or genes based on their correspondence to low-frequency intrinsic patterning in the data using the spectral decomposition of the Laplacian graph. The multiscale Laplacian score (MLS) is an unsupervised method for locating relevant scales in data and selecting the genes that are coherently expressed at these respective scales. The persistent Rayleigh quotient (PRQ) takes data equipped with a filtration, allowing the separation of genes with different roles in a bifurcation process (e.g., pseudo-time). We demonstrate the utility of these techniques by applying them to published single-cell transcriptomics data sets. The methods validate previously identified genes and detect additional biologically meaningful genes with coherent expression patterns. By studying the interaction between gene signals and the geometry of the underlying space, the three methods give multidimensional rankings of the genes and visualisation of relationships between them.


Introduction
Cells, the building blocks of life, are often classified into discrete cell types (e.g., liver, neuron, immune, or blood cells). In modern experiments, cell type identification commonly relies on partitioning single cell RNA sequencing (scRNA-seq) data. Differential gene expression (DGE) algorithms use statistical tests to determine genes that significantly differ between predefined groups of cells. However, cellular biology is more nuanced: there are multiple scales of cell classification (e.g., Treg cells are T cells which are a type of immune cell), continuous transitions into cell types (e.g., embryonic development starts from stem cells that differentiate into broad cell types that further specialise), or natural variations within cell types. The rich repertoire of gene expression patterns and cellular subphenotypes offers an opportunity to study continuity of gene expression.
Mathematically, single-cell data are given as raw counts of RNA transcripts that represent the expression of more than 20,000 genes in the human genome. A cell-by-gene matrix of counts is then preprocessed to reduce noise, variance due to technical effects, and the number of genes to form a smaller normalised gene expression matrix Y ∈ R m×n , where m ∼ 10 3 genes and n ∼ 10 3 -10 6 cells [1,2]. Due to the high-dimensional nature of these data, along with sparsity and noise, standard data science methods are out of reach.
The field of topological data analysis (TDA) studies the shape and connectivity of data at multiple scales of resolution. TDA methods require a metric and approximate the shape of the data by building covers or sequences of higher order networks (i.e., filtrations) on the data. TDA methods have successfully analysed and visualised single-cell data (e.g., UMAP, which relies on fuzzy simplicial sets; or Mapper, which visualises data using covers and filters) [3][4][5][6][7][8]. In this paper, instead of studying the shape of data, we focus on the related task of quantifying how well signals on a given data set align with the topology of the data.
The multiscale nature of topological data analysis and filtrations leads us to combine these ideas with graph signal processing [9] and spectral graph theory [10] to study the continuous variation of gene features across cells. The analysis in this paper starts with a preprocessed single-cell data matrix Y, as computed in the standard software Seurat [1], and then, it uses UMAP to construct an undirected weighted k-nearest neighbour cell similarity graph G. The nodes, which represent cells, are connected by edges and weighted by the similarity of gene expression. On this graph, the expression of a single gene is now a real-valued function g : V → R on the vertices of the graph, which is known as a graph signal in graph signal processing. Viewing a gene as a function on vertices of a graph, or as a graph signal, is implicit when applying DGE to a clustering of graph vertices.
Spectral graph theory, graph signal processing and the emerging field of topological signal processing [11][12][13] offer a setting to study and compare continuous patterns of functions across nodes, edges, and higher-order data structures. He et al. introduced the Laplacian score in [14] as a method for feature selection on point cloud data. Feature selection in machine learning is a process of dimensionality reduction of data before other analyses are applied. Govek et al. [15] applied and extended the Laplacian score to singlecell data as an improvement on using the variance of gene expressions to rank genes by importance. By studying the spectral properties of the Laplacian graph, each gene is given a score according to its consistency with the local geometric structure of the graph [15]. The Laplacian score is small if the gene signal roughly correlates with the graph structure (i.e., it is locally approximately constant but has global variation) and is large if the expression of a gene varies wildly on local neighbourhoods. This feature selection approach ranks the best features (e.g., genes) from the input data to form a compact and informative data representation. A score for each gene can be calculated, providing an overall ranking of features or gene signals [14,15].
In this work, we analyse gene signals on a cell similarity graph, while taking into account multiple scales of the single-cell data. We propose three computationally tractable methods for finding gene expression patterns that drive continuous variation in the data set. Similar to Govek et al. [15], the proposed methods do not require clustering cells or predefined cell assignments. Therefore, the selected gene signals are agnostic to cell types or clusters, gene selection is continuous across the cell network and some genes may link across multiple 'cell types'. In this sense, DGE based on standard clustering approaches will only identify genes that are specifically enriched in a pre-specified cell population, whereas the proposed methods can identify genes common to disparate cell types. Instead of one ranking, we propose multiple rankings of gene expression patterns at different scales of the data. Briefly, eigenscores restrict the signal to the eigenspaces corresponding to the smallest eigenvalues. We score each gene by its alignment to each of the eigenvectors with the smallest eigenvalues and then visualise the signals in gene space ( Figure 1). Our proposed multiscale Laplacian score (MLS) pipeline uses the theory of continuous-time random walks and Markov stability [16,17] to rank genes according to their consistency with features that range from local to global geometric structures. The persistent Rayleigh quotient (PRQ) takes in a filtration on the data (e.g., time) to study bifurcation patterns in gene expression data. The PRQ is based on the Kron reduced (persistent) Laplacian [18][19][20], which considers subgraphs inside a larger graph. It then applies the Rayleigh quotient associated with this operator, resulting in the identification of genes that drive bifurcation processes. To probe the discrete cell type paradigm, we apply the methods to synthetic and experimental data sets, which select subsets of genes that span known cell types and provide possible pathway transitions between them.
The article is organised as follows. In Section 2.1, we present mathematical preliminaries. We then introduce the three proposed scores (Sections 2.2-2.4) and data sets (Section 2.5). In Section 3, we present and discuss the computational results, highlighting the potential of each method for application on single-cell data sets, and then, we conclude in the final section.

Preliminaries
Let G = (V, E) be an undirected graph where V = {1, . . . , n} are nodes (representing the set of n cells) and E ⊆ V × V are edges that are weighted by gene correlation or similarity. The weight between cells u and v is recorded in the a uv entry of the weighted adjacency matrix A. Let d v denote the degree of node v, and let D denote the diagonal matrix D vv whose entries have value d v . Definition 1. The combinatorial Laplacian L, the symmetrically normalised Laplacian L and the random walk Laplacian L rw of the graph G are: Definition 2. The Rayleigh quotient for a non-zero graph signal g : V → R on the nodes of G is where u ∼ v indicates that u and v are adjacent nodes in G and the inner product is defined as If g is constant, then R L (g) is zero. Substituting the normalised Laplacian into Equation (4), we have the following equation: When normalising signals by D 1/2 [10] so that R L (D 1/2 1) = 0, where 1 ∈ R n is the vector of ones, we obtain The graph mean µ G (g) of a signal g is defined as and the graph variance of g is Var [14].

Definition 3.
If we re-centre the graph signal g by settingg(v) = g(v) − µ G (g), then the Laplacian score of g (in the sense of [15]) is defined as The Rayleigh quotient and Laplacian score measure consistency of the graph signal with the underlying graph structure. Small scores correspond to signals which exhibit variation consistent with the local graph structures; larger scores correspond to signals inconsistent with the local graph structures. While the Rayleigh quotient is zero for constant signals (i.e., a perfect score), the Laplacian score is undefined for constant signals and is high for near-constant signals [10,14].
Using the Laplacian matrix as a measure of consistency of a graph signal is directly related to Laplacian eigenmaps [21]. Laplacian eigenmaps construct an optimal (cf. [21], Section 3.1) embedding of graph signals for dimensionality reduction by finding the smallest eigenvalues of the generalised eigenvalue problem Lg = λDg. A signal f solves this problem with eigenvalue λ if D 1/2 is an eigenvector of L with eigenvalue λ. As L permits an orthonormal eigendecomposition, we can write i is an eigenvector of L with eigenvalue λ i . Then This score is small wheng aligns well with the Laplacian eigenmap embedding, i.e., when the squared coefficients a 2 i corresponding to large eigenvalues are small.

Eigenscores
The Rayleigh quotient and Laplacian score order graph signals by coherence with the underlying graph. We remark that this ordering only considers consistency at a single scale. In order to obtain a finer-grained, multiscale understanding of graph signals, we consider their alignment with different coherent structures on multiple scales in the graph. To explain how we can do this, we first recall the spectrum of the Laplacian.
Given a graph signal g, D 1/2 g = ∑ n−1 i=0 g i e i where g i = D 1/2 g, e i . Writing Equation (6) in this eigenbasis gives: Given the expression of the eigenbasis in Equation (9), we now consider individual contributions to the Rayleigh quotient separately, proposing the following definition.
Definition 4 (Eigenscore). Given a graph signal g : G → R and e i as the ith eigenvector of the normalised Laplacian, we define the ith eigenscore eig i by Given that g i = D 1/2 g, e i , it follows that We can view the ith eigenscore of a graph signal as the contribution from the ith eigenvector direction to its Rayleigh quotient. It can also be viewed as the cosine of the angle between D 1/2 g and the ith eigenvector. Thus, a large positive value for eig i (g) indicates strong alignment of the graph signal with the ith eigenvector of L, and a large negative value indicates strong anti-alignment (i.e., alignment with minus the eigenvector).
The ordering of the eigenvalues by magnitude explains the multiscale nature of the eigenscore. Expressing a graph signal in terms of Laplacian eigenvector contributions can be viewed as expanding in a frequency basis. Here, ordering the eigenvectors according to increasing eigenvalue corresponds to considering waves of increasing frequency. Expressing a signal in this basis can be viewed as the graph analogue of a Fourier transform. In general, computing the full eigendecomposition is expensive; however, algorithms exist for computing the first few dominant eigenvectors of a symmetric sparse matrix [22].

The 0th Eigenscore
Set D 1/2 1 to be the 0th eigenvector in our eigenbasis. Then where µ G is the graph mean, as defined in Equation (7).

Eigenscores to Visualise Graph Signals
Projecting gene signals onto the eigenspace spanned by low-frequency eigenscores allows us to visualise gene space and identify meaningful signals (see Figure 1). Noisy signals are mapped close to zero, and interesting signals lie on the periphery in such an eigenspace plot. Constructing such an embedding using Laplacian eigenvectors is reminiscent of Laplacian eigenmaps [21]. However, in [21], Laplacian eigenvectors are used to construct an embedding of the nodes of the graph, whereas we embed signals on the graph.

Multiscale Laplacian Score
The Laplacian score (Equation (8)) considers the change in signal along single edges in the graph. We propose the multiscale Laplacian score (MLS), which relies on random walks to measure the consistency of a signal across local graph neighbourhoods of continuously increasing size. This unsupervised approach provides a multiscale ranking of signal coherence with the graph. We can determine a finite number of scales at which the random walker admits a Markov stable partition [23,24], and we pair this pipeline with the MLS.

Random Walks on Graphs
Random walks on graphs are stochastic processes that can model a range of phenomena, including diffusion on graphs [25]. For any graph G with adjacency matrix A, the evolution of a continuous-time Markov process is governed by the Kolmogorov differential equation: where p is a time-dependent node vector and p v (t) gives the probability of a random walker being on node v at time t. In this Markov process, a random walker jumps to adjacent nodes (with probability proportional to the respective edge weight) after a period of time drawn from an Exp(1) random variable. The stationary distribution π ∈ R n is the unique left eigenvector of L rw with eigenvalue 0 whose entries sum to 1. The solution to Equation (11) is p(t) = p(0) exp(−tL rw ) and π = lim t→∞ p(t).

Remark 1.
The heat kernel can be used in conjunction with the graph Laplacian to model heat flow on a graph, which is viewed as a discrete approximation to a Riemannian manifold (see [21] Section 3.3 and references therein). From this viewpoint, with signals representing heat, the Rayleigh quotient quantifies relative differences in heat across the manifold for an infinitesimal time step.

Community Detection
Community detection in networks is concerned with finding groups of nodes that are more tightly connected to each other than to the rest of the network. Some of the best known community detection algorithms, such as modularity optimisation [26], exploit combinatorial properties of the graph. The communities found by optimising modularity are dense; i.e., there are many more edges between nodes in the group than with the rest of the network.
Community detection has extended from the notion of dense connections defining a community to also include connectivity via random walks. Markov stability, a dynamical approach for community detection, relies on random walks to detect stable graph partitions V = C 1 C 2 · · · C k at multiple resolutions [23,24]. We call each C i a community and assume that it is non-empty. Moreover, we denote the community to which node v belongs as c v and assume that all subgraphs induced by C i are connected. Two partitions are considered identical if one can be obtained from the other by permuting the labels 1, ..., k.
..,k be a partition of the graph G into communities. If M = D −1 A is the random walk transition matrix with stationary distribution π, then the continuous Markov stability of the partition at time is where δ is the Kronecker delta and P(t) := exp(−tL rw ) is the continuous time transition matrix [16].
The Markov stability of a graph partition at time t is the probability of a random walker remaining in its initial community after walking for time t minus the probability that two independent random walkers are in the same community at time t. All walkers are assumed to be in the stationary distribution. The Markov stability of a partition {C i } at time t takes values in the range (−1/2, 1]. High values indicate that a random walker tends to be trapped in one of the groups, which is what we expect in the presence of communities. For each value of t, coherent community structures on a graph can be found by maximising Markov stability using the Louvain method [27], which is a successful algorithm for finding community structures at different scales in applications [28][29][30]. A bottleneck (for graphs larger than those considered in this study) is the computation of the matrix exponential for the Markov stability, which requires an eigenvalue decomposition of L rw . To speed up computations in such regimes, one can use a linear approximation of r cont for small t (see [16] The choice of values of t to use for finding community structures via the maximisation of Markov stability depends on the graph G. For example, a complete graph will only have one sensible community structure (containing a single community) which will be detected at a relatively large t, while many real-world networks exhibit community structures at a variety of scales t. The partitions at different t obtained from this optimisation are assessed using the mean pairwise variation of information (VI), which tests the consistency and robustness of partitions [31]: be two partitions of a graph with N nodes. Then, their variation of information is defined to be The VI provides a measure of similarity between partitions, which can be viewed as a function of t. At resolutions of t for which there is an obvious community structure, the VI is relatively small and takes a local minimum-a plateau of such a minimum suggests the stable partition of communities. This behaviour is explained in [17] and illustrated in Figure 2. In this work, we chose values of t for which VI attains a local minimum. Each graph may have different stable communities and, therefore, the selected value of t would be chosen based on the minima of that graph.

Signal Scores at Multiple Resolutions
We can reinterpret the Laplacian score (Equation (8)) in terms of the random walk Laplacian (Equation (3)): Thus, the Laplacian score of a signal g is the expected squared difference in the signal g that is observed when a random walker at stationary distribution takes exactly one step following transition matrix D −1 A, which is divided by twice the graph variance. By extending the Laplacian score from a single random step to a random walk for time t, we arrive at our definition for the multiscale Laplacian score: be a graph with adjacancy matrix A, g : V → R be a signal on G and t ∈ R ≥0 . The multiscale Laplacian score of g at resolution t is defined as where we use the identity d u P(t) uv = d v P(t) vu for all t ∈ R ≥0 and all u, v ∈ V.
If the expected change in a signal g, which a continuous-time random walker is exposed to, after time t is small, then the MLS(g, t) is small. In such a case, we say that the signal g is consistent with the graph structure of G at resolution t. The MLS extends the Laplacian score (Equation (8)) [15] by performing a consistency analysis at multiple resolutions. Analysing multiple resolutions, ranging from local to global structures, is useful for studying graphs G paired with signals that are consistent at multiple resolutions.

MLS Analysis Pipeline
In the MLS analysis pipeline, we partition a given graph G into communities at 100 Markov times using the Louvain algorithm [27]. We then select a small set of Markov times at which the VI attains local minima. Next, we calculate the MLS at each of these resolutions and for each signal on the graph. We can then compare the MLS at different Markov times to identify gene signals particularly consistent with a given topological structure at a given resolution. For example, a small MLS at an earlier Markov time (compared to the mean behaviour of all signals) is more consistent with structures at that resolution (see Figure 3). The VI (on y-axis, VI is 0 except for a brief spike around t = 3.35) identifies resolutions t 1 , at which all three communities are identified, and t 2 , at which two communities are identified (note that due to the simplicity of the graph, there are intervals of local minima instead of points; we pick t 1 before the spike and t 2 after). In (B), we calculate the MLS at t 1 and t 2 (given by black circles) of three signals that are equal to 1 on one of the t 1 -communities (constant part of the signal is highlighted by arrows) and uniformly random elsewhere, and one completely random signal. The signal that is constant on the largest cluster (bottom left) is identified as highly consistent at both times. The random signal (top right) is identified as inconsistent at both times. Conversely, the signal constant on the smallest community (top left) has a high MLS at t 2 relative to the MLS at t 1 , separating it from the signal constant on the community of intermediate size (centre).

Persistent Rayleigh Quotient
Given a graph G and signals g:V → R, we may have additional information associated to each node of G that we would like to use to further inform our analysis. In single-cell data, for example, this could be the developmental time of each observed cell, which associates a real value to each node of the graph G.

Definition 8 (Filtered graph).
A filtration of a graph G is a integer-valued function f : V → Z on the nodes of G. For i ∈ Z the sub-level set α(i) of f at i is the set the nodes of G have a filtration value not greater than i. The induced subgraph G[α(t)] is the subgraph of G with nodes α(i) and every edge in G that has both endpoints in α(i). Then, the filtration f gives a sequence of induced subgraphs of G Topological data analysis studies the evolution of topological invariants across filtered graphs G[α(i)]. The most common tool is persistent homology [33], which computes how invariants, such as connected components in G[α(i)], persist in the larger graph G[α(j)]. Persistent homology is limited to studying the structure of the filtered graph itself. To analyse the signals on the sequence of subgraphs, we first recall the persistent Laplacian and then introduce the persistent Rayleigh quotient.

Persistent Laplacian
Given a subset α ⊆ V, one can reduce the Laplacian of G to a Laplacian on the nodes α by a method known as Kron reduction [18]. Briefly, Kron reduction removes the nodes in V \ α and adds weighted edges that preserve the geometric structure between the nodes α in G. For example, in network circuit theory [18], Kron reduction creates a simpler representation of a circuit whilst preserving resistances. Memoli, Wang and collaborators extended this method to higher-order graphs (i.e., simplicial complexes) [19,20] and showed that the Kron reduced Laplacian is the 0-degree persistent Laplacian. There is a direct relationship between persistent homology and the Kron reduction/persistent Laplacian: the nullity of the reduced Laplacian is exactly the persistent Betti number of G[α] ⊆ G [20]. For graphs, this persistent Betti number is the number of connected components of G[α] that remain disconnected in G.
For subsets α, β ⊆ V let L[α, β] be the submatrix of L with rows indexed by α and columns indexed by β. Under an appropriate reordering of the node labels, the Laplacian L has block form Definition 9 (Kron reduction [18]/Persistent Laplacian [20]). The Kron reduction (or 0-degree persistent Laplacian) of L with respect to α is the matrix which is also known as the Schur complement L/L[α c , α c ]. We analogously define L α for the normalised Laplacian L.
The Kron reduction L α of L arises from performing Gaussian elimination on L to remove blocks L[α, α c ] and L[α c , α]: Lemma 1 (Lemma 2.6 in [18]). In Definition 9, the following hold: L α is symmetric. 3.
L α 1 = 0, where 1 is the column vector of ones.
Hence, L α is a Laplacian matrix in the sense that there exists a weighted graph with nodes α and Laplacian equal to L α .
Suppose we have a filtration f on the nodes of the graph G.
is defined analogously.
Definition 10 (Persistent Rayleigh quotient). For a graph G with filtration f and i, j ∈ Z with i ≤ j, the persistent Rayleigh quotient of a signal g : G → R is which is the Rayleigh quotient (as in Equation (4)) using the (i, j)-persistent Laplacian. We further define the normalised persistent Rayleigh quotient to be which is the Rayleigh quotient (as in Equation (6)) using the normalised version of the (i, j)persistent Laplacian on the normalisation of the signal. Here, D j i is the degree matrix of the graph corresponding to L j i . When applying L j i and D j i to g, we implicitly restrict g to the nodes α(i).
Suppose that this graph represents a biological system: node c represents a parent cell type at developmental time t 0 and nodes a and b are daughter cell types at developmental time t 1 . If we filter the graph G by time, then we only ever have one connected component. Thus, we filter in 'reverse time' by setting the filtration to be t max − t. Explicitly, we define a filtration f by f (a) = f (b) = 0 and f (c) = t 1 − t 0 . Now, G[α(0)] has two connected components which merge into a single component in When we perform the Kron reduction of L = L t 1 −t 0 with respect to α(0) to obtain the Laplacian L t 1 −t 0 0 , the graph associated to this Laplacian has just a and b as nodes and a single 1/2-weight edge between them ( Figure 4B). This graph still has one connected component, but the connection is weaker. In the language of persistent homology, this corresponds to two H 0 -bars: one is born at filtration value 0 and dies before value t 1 − t 0 , and the other is born at value 0 and persists infinitely.
Comparing the usual normalised Rayleigh quotient, corresponding to PRQ(t 1 − t 0 , t 1 − t 0 ), to the normalised persistent Rayleigh quotient PRQ(0, t 1 − t 0 ) separates the binary graph functions g i on G ( Figure 4C). In the context of single-cell differentiation data, graph signals correspond to genes. Gene g 2 lies above the diagonal in Figure 4C and is highly expressed in the parent cell type and only one of the daughter cell types. Such behaviour indicates that a gene, or its transcriptional regulators, may be involved in cell differentiation towards a particular lineage or fate. Similarly, gene g 3 , which lies below the diagonal, is expressed in both daughter cell types but not the parent cell type, and, as such, it may correspond to a class of genes which represents markers of differentiation shared across cell fates. Genes corresponding to g 1 are expressed at a constant rate, representing possible 'house-keeper' genes and, thus, have zero (or close to zero) persistent Rayleigh quotients. Finally, gene g 4 is only expressed in a single daughter cell type and, as such, represents markers of differentiation to a particular cell fate. Gene g 4 lies along the diagonal. for binary functions on the graph representing high and low gene expression of a particular gene. The persistent Rayleigh quotient separates these genes based on relevance to the bifurcation: g 1 is expressed in all cell types, g 2 is expressed in the parent and one daughter cell type, g 3 is expressed only in both daughter cell types, g 4 is expressed only in one daughter cell type.

Data Sets
We apply the proposed methods on three different experimental scRNA-seq data sets. The first is an scRNA-seq data set of 2700 human peripheral blood mononuclear cells (PBMC) [34], which is used in Seurat tutorials [35,36]. The cell types found in the PBMC data set are lymphocytes (T cells, NK cells, B cells), monocytes, and dendritic cells. This data set also contains a small group of PPBP+ cells, which have been variously identified as platelets or their progenitors, megakaryocytes [35][36][37]. For simplicity, we will refer to them as platelets in our discussion, following the Seurat tutorial nomenclature.
The second data set is scRNA-seq of 24,911 human T cells infiltrating lung tumours and adjacent normal tissue [38] (previously analysed with the Laplacian score [15]). While the majority of the cells are CD4+ and CD8+ T cells, some NK cell clusters are also present in this data set. The aim of the T cell experimental data set was to determine how the cellular composition of the stromal region surrounding lung tumours differs from the stromal region of healthy lung tissue. The authors analysed scRNA-seq data to identify different stromal cell compositions (or signatures) that correlate with survival.
The third data set is scRNA-seq of 447 mouse foetal liver cells at different stages of development [39]. The aim of the mouse liver experimental data was to identify divergence in gene expression (and associated time-course) as progenitor cells in the liver (i.e., hepatoblasts) specialise or differentiate into two different types of mature cells, hepatocytes and cholangiocytes. Cells in the mouse foetal liver data set were sampled on embryonic days 10, 11, 12, 13, 14, 15, and 17.

Preprocessing of PBMC and T Cell Data Sets
We normalised the PBMC and T cell scRNA data sets using the variance stabilizing transform (VST) [40] as implemented by the function SCTransform from the R-library Seurat [1]. The VST returns the 3000 genes with the highest dispersion in each data set, and it is then reduced to its 30 principal components with the highest variance, following the recommendation given in the manual of Seurat. We then construct a k-nearest neighbour (k-nn) graph on cells for both data sets, using k = 15, and weight the edges of these graphs according to the weights given by the dimension reduction algorithm UMAP [3]. We note that the method of eigenscores is particularly robust to changes in the value of k as well as leaving out preprocessing with PCA (results not shown). We use cosine-dissimilarity for the PBMC data (following the Seurat tutorials) and Pearson correlation-distance for the T cell data (following [15]). We sample 3000 cells at random between the PCA and k-nn graph steps in the T cell data set (following [15]).

Remark 2.
The UMAP visualisation approach uses a locally scaled Laplacian kernel to weight its edges. A Laplacian kernel is similar, but not identical, to a heat kernel (also known as Gaussian kernel) in its definition. We use the weighted graph generated by UMAP in all of our analyses.

Preprocessing of Mouse Foetal Liver Cell Data Set
As the mouse data set is substantially smaller than the other data sets, we used a simpler preprocessing strategy. The public data from [39] were provided in transcripts-permillion (TPM), and we further applied a log e (x + 1) transform. The 10,000 most highly varying genes were retained. We tailored the value of k according to the number of cells in the experiment. For the larger experiments, we used the default k = 15 whereas here, the UMAP-weighted k-nn graph was built using k = 3 with the Euclidean metric. As in the original paper, we plot the resulting graph on the first two principal components ( Figure A4).

Previous Results on PBMC Data
In the UMAP plot generated from the variance stabilised PBMC data (see Figure A1), five large populations (clusters) and two smaller ones are visible. By colouring the UMAP plot by marker gene expression, the VST vignette in [36] identifies that the large populations roughly correspond to CD4 T cells, CD8 T cells, NK cells, B cells and monocytes, while the small components contain platelets and dendritic cells. The clustering algorithm applied to the data further subdivides these populations. By performing differential gene expression (DGE) analysis by use of a non-parametric Wilcoxon rank sum test [41], the VST vignette suggests that the CD4 and CD8 T cells can be split into three subpopulations. The B cells and NK cells are decomposed into two subpopulations, each closely linked to a high expression of known marker genes. The top ten differentially expressed genes in the twelve resulting clusters are given in Table A1.

Previous Results on T Cell Data
Lambrechts et al. [38] identify several subclusters of T and natural killer (NK) cells based on clustering in t-SNE (nine clusters) and marker genes (six subgroups, see Figure 5).
There is only one visibly connected component in the original t-SNE plot, while our UMAP plot shows more structure. Govek et al. [15] applied the combinatorial Laplacian score to identify a variety of genes, including HAVCR2, RSAD2 and GZMK, that are consistent with the topological structure of the data set but do not correspond to the clusters defined in [38]. Govek et al. also demonstrated on the T cell data that the discriminating power of the combinatorial Laplacian score (measured by the area under the receiver-operating characteristic curve) is comparable to that of conventional DGE methods and superior to feature variance without taking topology into account [15].

Previous Results on Mouse Foetal Liver Cell Data Set
Yang et al. [39] selected 1761 heterogeneously expressed genes that correlate with the first two principal components of the data which were then clustered. In a separate study on different data, Mu et al. [42] ranked genes that were differentially expressed in hepatocyte development.

Code Availability
The code for computing eigenscores, the multiscale Laplacian score, and the persistent Rayleigh quotient for signals on a graph is available here: https://github.com/osumray/multiscale-signal-selection-single-cell.git (accessed on 11 August 2022).

Results
In this section, we apply the three multiscale methods outlined above to three singlecell data sets. Eigenscores rank genes at different frequencies: high eigenscores identify dominant genes that align with the underlying cell similarity graph as the frequency increases. The multiscale Laplacian score (MLS) identifies coherent genes as the distance traversed by a random walker on the cell similarity graph is scaled. Third, the persistent Rayleigh quotient identifies genes involved in bifurcation processes when additional temporal meta data are available.

Eigenscores
Eigenscores test whether features such as genes, viewed as functions on nodes representing cells of a graph, align or anti-align with the Laplacian eigenvectors on the graph. They can be used to rank genes similarly to DGE but on different scales in the data, according to their coherence with the topology of the cell similarity graph. They can also be applied to explore gene expression by visualising genes in a gene space. By scoring genes for alignment with individual eigenvectors, as well as selecting relevant genes, we can often shed light on the biological processes in which these genes are involved. Eigenscores are meaningful in data sets with clear community structures, e.g., the PBMC data set. Eigenscores are also meaningful in data sets that cannot be decomposed into distinct clusters but rather have a continuous structure, e.g., the T cell data set. In either case, the eigenscores do not rely on predefined clusters; they scan through the data in an unsupervised way.

The Geometry of PBMC Genes via Eigenscores
We compute eigenscores of the top genes in the PBMC data set [34] (see Figure A2). While many selected genes overlap with cluster marker genes identified by differential gene expression (DGE) via Seurat's default implementation of the non-parametric Wilcoxon rank sum test [41], validating eigenscores, this method identifies 26 additional genes (see Table A1). To interpret the genes, we compare to 12 finer cell subtypes or seven previously determined broader cell subtypes [35] (see Figures 6A and A1). Eigenscores select genes that are enriched in broader cell types (e.g., MALAT1 for all lymphocytes or FTL and FTH1 for all monocytes). Due to their expression in multiple cell clusters, differential gene expression (DGE) does not identify these as highly ranked significant genes for one cell cluster. MALAT1 is a highly-expressed non-coding RNA, whose enrichment may be a feature of damaged or low-quality cells in some scRNA-seq data sets [43,44]. However, the biological differential expression of MALAT1 between immune cell types has also been previously reported, including increased expression in lymphocytes compared with monocytes [45]. In any case, the increased expression of MALAT1 is certainly a feature shared by multiple lymphocyte clusters in the PBMC data set. Genes FTL and FTH1 encode, respectively, the light and heavy chain of ferritin, which is a major protein involved in iron storage and homeostasis. It is highly expressed within myeloid (monocyte) cells, which can be further modulated by inflammatory stimuli [46][47][48].
While traditional DGE identifies markers unique to a given cluster or grouping of cells, eigenscores can also identify genes that exhibit biological variation within multiple disparate cell types. For example, the gene FCGR3A can distinguish between major subpopulations of monocyte cells [49][50][51], major subpopulations of NK cells [52,53] and major subpopulations of T cells [49]. Eigenscore analysis ranks FCGR3A highly ( Figure A2, e 4 and e 5 ), whereas with traditional DGE, it is not ranked in the top 10 genes for any cluster, which is perhaps due to its high level of expression across clusters in subtypes from disparate cell lineages. Moreover, PPBP, a highly differentially expressed marker gene on platelets, is ranked highly by eigenscores and not DGE. For completeness, we include a qualitative and quantitative comparison between eigenscore and DGE ranking by non-parametric Wilcoxon rank sum test (Figure 7) and present the set complement of the top scores.
We can further explore the relationships between genes by projecting the gene space of eigenscores via UMAP ( Figure 6B). The visualisation of 16 dimensional low-frequency eigenscores (eigenscore 1-16 corresponding to 0 < λ i < 0.1) emphasises gene signals that are most coherent with the cell similarity graph structure. We interpret this visualisation of gene space in Figure 6B. Genes plotting in the centre (blue) have low eigenscores, where the gene expression is incoherent with the graph topology (see, for example, gene BAG4). Genes with similar expression patterns in the data set plot together, and a sequence of continuous transitions of gene signals plots continuously in eigenscore space (as illustrated in Figure 1). The flares in the gene space with high eigenscores correspond to groups of genes strongly expressed on clusters corresponding to broad cell types or the cell cycle. We explore and interpret the continuous signal of gene space in Figure 6B [34]. (B) UMAP of genes set in eigenscore space for eigenvectors 1-16. Genes (dots) are colour-coded for the logarithm of the norm of the vector in 16-dimensional eigenscore space. Genes with similar expression patterns in the PBMC single-cell data [34] plot close together in eigenscore space, and expression patterns vary continuously as we move through this space. The outward branches I-VI correspond to genes that are expressed highly on specific groups of cells. Figure 7C gives a quantitative comparison of the amount of overlap in the eigenscore and DGE rankings, showing that there is consistent overlap while we also find additional genes. We highlight that two flares of the gene space geometry are missed by DGE via non-parametric Wilcoxon rank sum test (see Figure 7B), particularly region VI. This region corresponds to genes involved with cell cycle progression, which are highest in a highly proliferative subpopulation of B cells in the PBMC data set. The genes in this region show strong overlap with gene signatures used by Seurat to identify the S phase (PCNA, TYMS, RRM2) and G2/M phase (TOP2A, BIRC5, UBE2C, HMGB2, SMC4, CKS1B). Notably, our eigenscore method is able to identify the cell cycle as an important source of biological variation in this data set that is missed by traditional cluster-based DGE without requiring the need to run an additional signature scoring method relying on the expression of a predefined set of markers for this biological phenomenon. Figure 7. Eigenscores compared to differential gene expression (DGE) on PBMC data set [34]. (A) Comparative study of DGE ranking using Seurat clustering and a non-parametric Wilcoxon rank sum test (log of rank computed from adjusted p-value on x-axis) versus ranking by norm in eigenscore space (log of eigenscore rank of 16 lowest frequencies on y-axis). Example genes in top 100 for one ranking but not the other shown on the sides. (B) Top 100 genes ranked by adjusted p-value in DGE marked on the eigenscore UMAP plot of genes from Figure 6. Two regions in the UMAP not found in the top of DGE are branch V from Figure 6B (T cell and lymphocyte genes that are expressed in larger groups of cells); branch VI (genes expressed in RRM2+ cluster that is not found by DGE). (C) Quantitative comparison of gene ranks given by adjusted p-value in DGE versus norm in 16-dimensional eigenscore space.

Eigenscores for Analysing Data with Continuous Structure: T Cells
We next compute eigenscores of the T cell data [38] (corresponding to 0 < λ < 0.1). As before, we can visualise the gene space in Figure 8A where genes are near to another gene if they have similar expression patterns. For example, a large number of mitochondrial genes, such as MT-CO3, that are lowly expressed in cells across the data set are grouped together as a distinct gene cluster. The detection of cells expressing high levels of such mitochondrial genes can be an important step in filtering poor quality cells from scRNA-seq data sets [54].
The continuous nature of the T cell data set is reflected in the many intermediate to high eigenscore genes that show coherent regions in the data set on multiple scales. While we can identify groups of cells that have coherent gene expression behaviour, such as the clusters formed by EEF1A1+ cells, HBB+ cells, ANXA1+ cells and HSPA1A+ cells, we also find genes that have unique expressions that are unlike any other gene signals (e.g., GNLY and GZMB, which are both known secreted cytotoxic effector genes found across T and NK cell subsets). To explore the geometry of genes further, we analyse the top 20 genes ranked by eigenscore norm in 1-19 dimensional eigenscore space. We find AREG (9th in eigenscore rank) does not show up on the DGE ranking. As shown in the UMAP cell subfigure, AREG is expressed in between clusters of cell type, connecting NK cells and a cluster of CD8 T cells ( Figure 5, see Section 3.2.2 for biological implications). In this way, eigenscores provide insight into both the continuous and discrete nature of T cell behaviour.

Multiscale Laplacian Score
The multiscale Laplacian score (MLS), similar to the 0-dimensional combinatorial Laplacian score [15] and gene connectivity score [6], extends DGE to settings in which a stable partition of cells into groups is not feasible; therefore, no assignment of cells into groups is required. The MLS ranks genes by their consistency with the topological structure of the data set and performs such topological consistency analyses at multiple resolutions.
The resolutions are determined by finding scales in the data that provide stable community structures. We reiterate that the MLS calculation does not use the obtained communities; rather, we use the resolution that provides a stable communities via local minima in variation of information (VI). We highlight that the communities we find in both the PBMC and T cell data increase in size with increasing Markov time (see Figures 9 and 10, panels A). This size increase illustrates that the MLS at the selected Markov times tests for consistency at different scales in gene space.

Multiscale Laplacian Score of PBMC Data
In Figure 9, we apply MLS to the PBMC data set [34]. These data permit a stable clustering into five larger groups of cells. However, these clusters contain non-stable substructures (see Figure 9A). The substructures largely align with the clusters found in the Seurat VST vignette [36] (Figure A1). We find that the genes GZMK and CD8B exhibit a higher consistency with the structures at the first resolution (t 1 ) than at later resolutions ( Figure 9B). The gene GZMK is highly expressed on the intersection of two communities at t 1 which correspond to naive and memory CD8 T cells, but it is not highly expressed on the union of these two communities (the two communities merge at resolution t 2 ). GZMK seems to mark a transition between these two clusters. Similarly, CD8B expression is detected within the left-most community assigned to the broader CD4 T cell cluster of the UMAP plots ( Figure A1), which is merged into a larger community at t 2 . The genes GZMB and XCL2 are examples of features with low MLS at t 2 but relatively high MLS at t 3 . The former is highly expressed on a community corresponding to effector CD8 T cells (cluster 6 in Figure A1), the latter on the intersection of the effector and the naive/memory T cell communities (clusters 6 and 5 and 7 in Figure A1). At resolution t 3 , clusters 5 and 7 are merged with cluster 6. The communities at resolution t 3 correspond to the seven populations describing the different cell types with the NK and CD8 T cells merged.
Examples of genes with low MLS at t 3 , relative to the standard LS as in [15] and MLS at other resolutions, include AIF1 and CTSS. Both genes are highly and consistently expressed on the community consisting of CD14+ and FCGR3A+ monocytes ( Figure A1) [35,36]. Resolution t 3 is the first resolution at which CD14+ and FCGR3A+ monocytes form a single community.

Multiscale Laplacian Score of T Cell Data
We compute the MLS to a human T cell data set from Labrechts et al. [38] (Figure 10). As remarked by [15], these data do not allow for any partitioning into stable clusters (see high VI values in Figure 10A). Next, we determine three resolutions of interest based on the VI. Genes with a relatively low MLS at the finest resolution, t 1 , include IGKC and IFI27. Both are highly expressed on a small group of cells (in the center of left hand side and top right of UMAP plot respectively; see Figure 10B).
The gene IGKC is an immunoglobulin gene, an antibody component found in B cell subsets, particularly plasma cells [55]. Cells expressing IGKC are also JCHAIN+ and positive for antibody subtypes suggestive of class switching (e.g., IGHG1 and IGHA1). Since this is a T cell data set, this almost certainly indicates that the cells in question are doublets (two cells in the same experimental droplet), specifically T cells binding B cells. While not representing single cell states, it is important that these readings are picked up in the analysis.
The gene IFI27 is part of an antiviral/interferon-induced (IFI) response signature. It is particularly interesting that MLS can detect a specific transcriptional programme shared across multiple cell types (CD4+ and CD8+ T cells). This could represent a shared T cell programme directed against viruses or induced during stress responses (e.g., for scRNA-seq processing) [56].
As a particular example of informative gene prioritisation, we highlight AREG at resolution t 2 , where it is expressed highly on a group of cells bridging nearby clusters identified as NK cells and CD8 T cells based on overall marker expression of each cluster ( Figure 5). Within the immune system, AREG is expressed by subsets of NK cells and other types of innate lymphoid cells (ILCs) [53,57], where it plays an important role in mediating type 2 immunity [57,58]. Despite bridging different clusters in our global clustering and that of the original authors [38], this population likely represents cells in various states of transition between two previously described AREG+ NK cell phenotypes: one with high levels of secreted molecules associated with effector functions (CCL3, CCL4) and the other expressing homing receptors associated with a more circulatory phenotype (CD44, SELL) [53] (see Figure A3). Therefore, while the original authors identified these two subsets as discrete NK and type 1 ILC-like cell types, respectively ( [38] Figure S13), both our eigenscore and MLS-informed approaches highlight AREG as a shared feature, supporting the notion that these two populations may be consistent with a more continuous transition between CCL3+ and SELL+ states within the NK cell population. This interpretation is further supported by the preserved expression of NK cell markers (e.g., CD94/KLRD1, NKG2A/KLRC1) in the SELL+AREG+ cells, which is often considered to be a feature of NK cells that is not shared by otherwise closely related type 1 ILCs [59,60].
Similarly, GZMB is highly expressed on the intersection of exhausted and proliferating T cells (see Figure 5), two clusters of which are visible in the community structure at t 2 but merge at t 3 . Finally, at Markov time t 3 , FGFBP2 and NKG7 are examples of genes with relatively low expression that are highly and consistently expressed on the cluster of NK cells.

Persistent Rayleigh Quotient
Cell bifurcation methods, such as trajectory inference algorithms, seek to assign a pseudotime to each cell by fitting a tree onto the data set [8,61] or fit a statistical model to each gene and then test against a null model [62][63][64][65][66]. The Rayleigh quotient and Laplacian score have proven useful in selecting genes [15] but are agnostic to any prior cell knowledge or meta data. Here, we use additional time information to filter the graph. This time information could be real developmental time or an inferred pseudotime derived from trajectory analysis. Using the persistent Rayleigh quotient (PRQ), we can then separate genes that have different roles in this differentiation process.
We apply the PRQ to a bifurcation describing the differentiation of mouse hepatic cells by Yang et al. [39] (see Figure 11). Hepatoblasts, hepatocytes, and cholangiocytes were sampled from mouse embryos at seven time points, from embryonic day 10 to day 17.
Hepatoblasts are a parental cell type whose daughter cells differentiate into hepatocyte and cholangiocyte cell types. As the topologically interesting direction is in 'reverse time', we assign a filtration to the graph by assigning a node from day t the filtration value 17 − t. For each pair of steps i and j with i ≤ j in the filtration of the graph, we apply the normalised persistent Rayleigh quotient to a particular gene. We represent this as a two-dimensional score for each gene in Figure 11. This is reminiscent of the birth-death persistence diagram in TDA where the x-axis records the filtration step where certain topological features first appear, known as the birth time, while the y-axis records where these features finally vanish, known as the death time. In order to show the largest differences between parts of the PRQ score for a gene, we compare the normalised persistent Rayleigh quotients PRQ(2, 7) with PRQ(7, 7) in Figure 11C.
We have highlighted genes that were found to be differentially expressed during hepatoblast differentiation in [42] (Figure 11A,B,D,E). As expected, genes such as Tubb5, Mdk, and Igfbp1, which are expressed in hepatoblasts and only one of hepatocytes or cholangiocytes, have a higher value for the persistent Rayleigh quotient than the full Rayleigh quotient ( Figure 11A,B). The gene Mdk is known to show a decrease during hepatoblast maturation towards hepatocytes in utero, corresponding with the upregulation of genes involved in hepatocyte function including Aldob [42,67]; however, this gene is preserved in cholangiocyte populations [68], agreeing with our observations using the PRQ. Igfbp1 is not known to play a role in differentation into hepatocytes; however, the expression of Igfbp1 has been previously shown in hepatocytes, where it has a prosurvival role that can be enhanced by p53 activity [69]. Genes Aldob and Mt2 are expressed in cholangiocytes and hepatocytes but not hepatoblasts. Hence, their persistent Rayleigh quotient has a lower value than the full Rayleigh quotient ( Figure 11D). Finally, the persistent Rayleigh quotient and full Rayleigh quotient for Fabp1 and Ahsg are almost the same. This corresponds to the fact that Fabp1 and Ahsg are highly expressed in only one of the daughter cell types ( Figure 11E). The full Rayleigh quotient PRQ(7, 7) can sort genes based on how coherently expressed they are with respect to the underlying graph, but it does not distinguish between different expression patterns relevant to development. In contrast, the persistent Rayleigh quotient can differentiate genes whose expression pattern is relevant to bifurcation. Figure 11. The persistent Rayleigh quotient separates genes by their role in a cell differentiation process. The PRQ is parameterised by birth (i) and death (j), each pair (i, j) assigning a non-negative number to every gene. We plot these values for each gene for (i = 7, j = 7) on the x-axis and (i = 2, j = 7) on the y-axis on subfigure (C). Selected for display (A,B,D,E) are top differentially expressed genes from [42] on the data from [39] (see Figure A4). Genes Tubb5, Mdk, and Igfbp1 are expressed in parent and one daughter cell lineage, hepatoblast to (A) cholangiocyte or (B) hepatocyte and lie above the diagonal. Genes Aldob and Mt2 are expressed in both daughter cell types but not in the parent cell type (D), and they lie below the diagonal. Genes Ahsg and Fabp1 are only expressed in one daughter cell type (E) and lie on the diagonal (compare with Figure 4).

Conclusions
Inspired by the multiscale nature of topological data analysis, we proposed three multiscale methods relying on spectral graph theory and signal processing, which complement standard differential gene expression. We showcased the versatility of eigenscores and multiscale Laplacian scores (MLS) on different data sets. These methods select genes in an unsupervised and continuous manner without requiring a clustering of cells and therefore can identify genes with important biological variation within and across disparate clusters, which may be more difficult to identify using traditional DGE. The persistent Rayleigh quotient (PRQ) was applied to a cell differentiation data set, which validated a known cellular bifurcation and separated genes based on their role in the differentiation process. These methods proposed provide multiple different rankings of genes. Future directions include the systematic comparison of multiple rankings (e.g., using Hodge theory) and summary statistics to compare with methods that output one-dimensionally ranked genes. The PRQ also gives a rich representation for each gene, and future work will explore statistical integration with other pipelines. We provide available code, and a future goal is to create a topological genomics signaling package to increase accessibility and adoption.
While we focused on the geometry of gene space with a specific k-nn graph constructed using scRNA-seq, the proposed methods are flexible for other graphs, such as Mapper graphs [6,70], but the resulting analysis would change if the underlying cell graph changes. The choice of resolution(s) for the MLS is not limited to Markov stability times (e.g., graph wavelets [71]). Future directions include extending these signal selection approaches to other signals more generally (e.g., epigenetic factors), other complex single-cell network structures [5] or other higher-order networks [12,72], with a view towards data integration [73].
The methods we present in this paper provide an exciting foundation to rethink the de novo identification of important genes in scRNA-seq data sets. For example, by assessing the geometry of gene space using eigenscores, we are able to implement a new variant of gene set enrichment analysis, where the identification of meaningful groups of genes is driven by the expression patterns of the genes themselves at any scale, instead of a test statistic averaged across a predefined comparison of clusters. This approach allowed us to not only identify broad markers of lineages and cell types in our example data sets but also important smaller-scale biological phenomena such as the cell cycle and preserved AREG expression bridging NK cell subtypes without relying on the need for optimal cluster assignment used by traditional DGE or biological priors required by signature scoring methods. In this way, our methods allow the user to build understanding of their data sets at multiple levels in a single analysis by identifying the key genes that drive these multiscale sources of biological variation. Data Availability Statement: Publicly available data sets were analysed in this study. The PBMC data set is available from the 10X website under the link https://cf.10xgenomics.com/samples/cell/ pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz (accessed 2 August 2022). The T Cell data set is available in ArrayExpress under accessions E-MTAB-6149 (https://www.ebi.ac.uk/arrayexpress/ experiments/E-MTAB-6149/, accessed 2 August 2022) and E-MTAB-6653 (https://www.ebi.ac.uk/ arrayexpress/experiments/E-MTAB-6653/, accessed 2 August 2022). The Mouse foetal liver cell data set is available in NCBI GEO under accession GSE80732 (https://www.ncbi.nlm.nih.gov/geo/ query/acc.cgi?acc=GSE80732, accessed 2 August 2022).
in part by EPSRC EP/R018472/1. LM, OS, TMC, XL and HMB are funded by the Ludwig Institute for Cancer Research Ltd. TMC gratefully acknowledges scholarship support from the Rhodes Trust. For the purpose of Open Access, the authors have applied a CC BY public copyright licence to any Author Accepted Manuscript (AAM) version arising from this submission.