Reﬁned Mode-Clustering via the Gradient of Slope

: In this paper, we propose a new clustering method inspired by mode-clustering that not only ﬁnds clusters, but also assigns each cluster with an attribute label. Clusters obtained from our method show connectivity of the underlying distribution. We also design a local two-sample test based on the clustering result that has more power than a conventional method. We apply our method to the Astronomy and GvHD data and show that our method ﬁnds meaningful clusters. We also derive the statistical and computational theory of our method.


Introduction
Mode-clustering is a clustering analysis method that partitions the data into groups by the local modes of the underlying density function [1][2][3][4]. A density local mode is often a signature of a cluster, so mode-clustering leads to clusters that are easy to interpret. In practice, we estimate the density function from the data and perform mode-clustering via the density estimator. When we use a kernel density estimator (KDE), there exists a simple and elegant algorithm called the mean-shift algorithm [5][6][7] that allows us to compute clusters easily. The mean-shift algorithm has made the mode-clustering a numerically friendly problem.
When applied to a scientific problem, we often use a clustering method to gain insight from the data [8,9]. Sometimes, finding clusters is not the ultimate goal. The connectivity among clusters may yield valuable information for scientists. To see this, consider the galaxy sample from the Sloan Digital Sky Survey [10] in Figure 1. While the original data is 3D, here we use a 2D slice of the original data to illustrate the idea. Each black dot indicates the location of a galaxy at a particular location in the sky. Astronomers seek to find clusters of galaxies and their connectivity, since these quantities (clusters and their connections) are associated with the large-scale structures in the universe. Our method finds the underlying connectivity structures without assuming any parametric form of the underlying distribution. In the middle panel, we display the results by the usual mode-clustering method, which only shows clusters, but not how they connect with each other. On the other hand, our proposed method is given in the right panel, which finds a set of dense clusters (purple regions) along with some regions serving as bridges connecting clusters (green areas) and a set of low-density regions (yellow regions). Thus, our clustering method allows us to better identify the structures of galaxies.
We improve the usual mode-clustering method by (1) adding additional clusters that can further partition the entire sample space, and (2) assigning an attribute label to each cluster. The attribute label will indicate if this cluster is a 'robust cluster' (a cluster around a local mode; purple regions in Figure 1), a 'boundary cluster' (a cluster bridging two or more robust clusters; green regions in Figure 1), or an 'outlier cluster' (a cluster representing low-density regions; yellow regions in Figure 1). With this refined clustering result, we gain further insights into the underlying density function and are able to infer the intricate structure behind the data. Furthermore, we can apply our improved clustering method to the two sample tests. In this case, we can identify the local differences between the two Stats 2021, 4 populations and provide a more sensitive result. Note that in the usual case of cluster analysis, adding more clusters is not a preferred idea. However, if our goal is to detect the underlying structures (such as finding the connectivity of high-density regions in the galaxy data in Figure 1), using more clusters as an intermediate step to find connectivity could be a plausible approach. We improve the usual mode clustering by (1) adding additional clusters that can 32 further partition the entire sample space, and (2) assigning an attribute label to each 33 cluster. The attribute label will indicate if this cluster is a 'robust cluster' (a cluster 34 around a local mode; purple regions in Figure 1), a 'boundary cluster' (a cluster bridging 35 two or more robust clusters; green regions in Figure 1) or an 'outlier cluster' (a cluster 36 representing low density regions; yellow regions in Figure 1). With this refined clustering 37 result, we gain further insights into the underlying density function and are able to 38 infer the intricate structure behind the data. Furthermore, we can apply our improved 39 clustering method to the two sample test. In this case, we can identify the local differences 40 between the two populations and provide a more sensitive result. Note that in the usual 41 case of cluster analysis, adding more clusters is not a preferred idea. However, if our goal 42 is to detect the underlying structures (such as finding the connectivity of high density 43 regions in the galaxy data in Figure 1), using more clusters as an intermediate step to 44 find connectivity could be a plausible approach. 45 To summarize, our main contributions are as follows: We propose a new clustering method by the slope function that has an additional 47 attribute label of each cluster (Section 3).

48
• We propose a new two-sample tests using the clustering result (Section 4).

49
• We introduce a visualization method using the detected clusters (Algorithm 3).

50
• We derive both statistical and computational guarantees of the proposed method 51 (Section 7).

52
Related work. The idea of using local modes to cluster observations can be dated 53 back to [5], where the authors used local modes of the KDE to cluster observations and 54 propose the mean-shift algorithm for this purpose [5,11]. Mode clustering has been 55 widely studied in statistics and machine learning community [3,4,7,[12][13][14]. However, 56 the KDE is not the only option for the mode clustering, [1,15] proposed a Gaussian 57 mixture model method and [16] used a fuzzy clustering algorithm and [17] introduced a 58 nearest-neighbor density method. 59 Outline. The paper is organized as follows. We start with a brief review on mode 60 clustering in Section 2 and formally introduce our method in Section 3. In Section 4, 61 we combine two-sample test and our approach to create a local two-sample test. We  To summarize, our main contributions are as follows: • We propose a new clustering method by the slope function that has an additional attribute label of each cluster (Section 3). • We propose new two-sample tests using the clustering result (Section 4). • We introduce a visualization method using the detected clusters (Algorithm 3). • We derive both statistical and computational guarantees of the proposed method (Section 7).
Related work. The idea of using local modes to cluster observations can be dated back to [5], where the authors used local modes of the KDE to cluster observations and propose the mean-shift algorithm for this purpose [5,11]. mode-clustering has been widely studied in statistics and the machine-learning community [3,4,7,[12][13][14]. However, the KDE is not the only option for mode-clustering- [1,15] proposed a Gaussian mixture model method, and [16] used a fuzzy clustering algorithm, and [17] introduced a nearest-neighbor density method.
Outline. The paper is organized as follows. We start with a brief review on modeclustering in Section 2 and formally introduce our method in Section 3. In Section 4, we combine the two-sample test and our approach to create a local two-sample test. We use simulations to illustrate our method on simple examples in Section 5. We show the applicability of our approach to three real datasets in Section 6. Finally, we study both statistical and computational theories of our method in Section 7.

Review of Mode-Clustering
We start with a review of mode-clustering [2,4,12,18]. The concept of mode-clustering is based on the rationale of associated clusters to the regions around the modes of the density. When the density function is estimated by the kernel density estimator, there is an elegant algorithm called the mean-shift algorithm [5] that can easily perform the clustering.
In more detail, let p be a probability density function with a compact support K ⊂ R d . Starting at any point x, mode-clustering creates a gradient ascent flow γ x (t) such that Namely, the flow γ x (t) starts at point x and moves according to the gradient at the present location. Let γ x (∞) = lim t→∞ γ x (t) be the destination of the flow γ x (t). According to the Morse theory [19,20], when the function is smooth (being a Morse function), such a flow converges to a local maximum of p except for starting points in a set of the Lebesgue measure 0. The mode-clustering partitions the space according to the destination of the gradient flow, that is, for two points x, y, they will be assigned to the same cluster if γ x (∞) = γ y (∞). For a local mode η, we define its basin of attraction as D(η) = {x : γ x (∞) = η}. The basin of attraction describes the set of points that belongs to the same cluster.
In practice, we do not know p, so we replace it by a density estimator,p n . A common approach to estimate p as the kernel density estimator, in whichp n iŝ where K is a smooth function (also known, according to the Morse theory, as the kernel function), such as a Gaussian kernel, and h > 0 is the smoothing bandwidth that determines the amount of smoothness. Since we used a nonparametric density estimator, we did not need to assume any parametric assumptions on the shape of the distribution.) With this choice, we the define a sample analogue to the flow γ x (t) aŝ and partition the space according to the destination ofγ x .

Refining the Clusters by the Gradient of Slope
As is mentioned previously, the mode-clustering has some limitations that the resulting clusters do not provide enough information on the finer structure of the density. To resolve this problem, we introduce a new clustering method by considering gradient descent flows of the 'slope' function. Let ∇p(x) be the gradient of p. Define the slope function of p as s(x) = ∇p(x) 2 . Namely, the slope function is the squared amplitude of the density gradient.
An interesting property of the slope function is that the minimal points {x : s(x) = 0} = {x : ∇p(x) = 0} = C form the collection of critical points of p, so it contains local modes of p as well as other critical points, such as saddle points and local minima. According to the Morse theory [21,22], there is a saddle point between two nearby local modes when the function is a Morse function. A Morse function is a smooth function f , such that all eigenvalues of Hessian Matrix of f at every critical point are away from 0. This implies that saddle points may be used to bridge connecting regions around two local modes.
With this insight, we propose to create clusters using the gradient 'descent' flow of s(x). Let ∇s(x) be the gradient of the slope function. Given a starting point x ∈ R d , we construct a gradient descent flow as follows: That is, π x is a flow starting from x and moving along the direction of ∇s. Similar to mode-clustering, we use the destination of gradient flows to cluster the entire sample space.
Note that if the slope function s is a Morse function, the corresponding PDF p will also be a Morse function, as described in the following Lemma.

Lemma 1. If s(x) is a Morse function, then p(x) is a Morse function.
Throughout this paper, we will assume that the slope function is Morse. Thus, the corresponding PDF will also be a Morse function and all critical points of the PDF will be well-separated.

Type of Clusters
Recall that C is the collection of critical points of density p. Let S be the collection of local minima of the slope function s(x). It is easy to see C ⊂ S, since any critical point of p has gradient 0, so it is also a local minimum of s.
Thus, the gradient flow in Equation (1) leads to a partition of the sample space. Specifically, let π x (∞) be the destination of the gradient flow π x (t). For an element m ∈ C, let S(m) = {x : π x (∞) = m} be its basin of attraction.
We use the sign of eigenvalues of ∇ 2 p(x) to assign an additional attribute to each basin, so the set {S(m) : m ∈ C} forms a collection of meaningful disjoint regions. In more detail, for a critical point m ∈ C such that p(m) > δ for a small threshold δ, its S(m) is classified according to ). In the case of p(m) ≤ δ, we always assign it as an outlier cluster. Note that the threshold δ was added to stabilize the numerical calculation. In other words, we refer to a basin of attraction in S(m) as a robust cluster if m ∈ C is a local mode of p. If m is a local minimum of p, then we call its basin of attraction an outlier cluster. The remaining clusters, which are regions connecting robust clusters, are denoted as boundary cluster. Note that the regions outside the support are, by definition, a set of local minima. We assign the same cluster label to those x whose destination π x (∞) is outside the support, which is an outlier cluster. Our classification of S(m) is based on the following observations. Regions around local modes of p are where we have strong confidence that these points should belong to the cluster represented by their nearby local modes. Regions around local minima of p are the low-density areas where we should treat them as anomaly points/outliers. Figure 1 provides a concrete example that our clustering method could lead to more scientific insight-the connectivity among robust clusters may reveal intricate structure of the underlying distribution.
Defining different types of clusters allows us to partition the whole space into meaningful subregions. Given a random sample, to assign the cluster label to each of them, we simply examine which basins of attraction these data points fall in and pass the cluster labels from the regions to the data points. After assigning cluster labels to data points, the cluster categories in Equation (2) provide additional information about the characteristics of each data point. Those data points in robust clusters are data points that are highly clustered together; points in the outlier clusters are data points in low-density regions, which could be viewed as anomalies; the rest of points are in the boundary clusters, where these points are not well-clustered and are on the connection regions among different robust clusters.

Estimators
The above procedure is defined when we have access to the true PDF p. In practice, we do not know p, but we have an IID random sample X 1 , . . . , X n from p with a compact support K. So we estimate p using X 1 , . . . , X n and then use the estimated PDF to perform the above clustering task.
While there are many choices of density estimators, we consider the kernel density estimator (KDE) in this paper, since it has a nice form and its derivatives are wellestablished [14,[23][24][25]. In more detail, the KDE iŝ where K is a smooth function (also known as the kernel function) such as a Gaussian kernel, and h > 0 is the smoothing bandwidth that determines the amount of smoothness. Note that the bandwidth h in the KDE could be replaced by h i that depends on each observation. This is called the variable bandwidth KDE in Breiman et al. [26]. However, since the choice of how h i depends on each observation is a non-trivial problem, so to simplify the problem, we set all bandwidths to be the same. Based onŝ n (x), we first construct a corresponding estimated flow using ∇ŝ n (x): An appealing feature is that ∇ŝ n (x) has an explicit form: where ∇p n (x) and ∇ 2p n (x) are the estimated density gradient and Hessian matrix of p. Thus, to numerically construct the gradient flowπ x (t), we update x by where γ > 0 is the learning rate parameter. Algorithm 1 summarizes the gradient descent approach.
1. Input:p n (x) and a point x.
2. Initialize x 0 = x and iterate the following equation until convergence: (γ is a step size that could be set to a constant) With an output from Algorithm 1, we can group observations into different clusters, with each cluster labeled by a local minimum ofŝ n . We assign an attribute to each cluster via the rule in Equation (2). Note that the smoothing bias could cause some biases around the boundary of clusters. However, when h → 0, this bias will asymptotically be negligible.

Enhancements in Two-Sample Tests
Our clustering method can be used as a localized two-sample test. An overview of the idea is as follows. Given two random samples, we first merge them and use clustering method to form partitions of the sample space. Under the null hypothesis, the two samples are from the same distribution, so the proportion of each sample within each cluster should be similar. By comparing the difference in proportion, we obtain a localized two-sample test. Algorithm 2 summarizes the procedure.
In more detail, suppose we want to compare two samples G 1 = {X 1 , X 2 , . . . , X N } and Under H 0 , the two samples are from the same distribution, so they have the same PDF q. We first pull both samples together to form a joint dataset We then compute the KDEp n using G all and compute the corresponding estimated slope functionŝ n and apply Algorithm 1 to form clusters. Thus, we obtain a partition of G all . Under H 0 , the proportion of Sample 1 in each cluster should be roughly the same as the global proportion N N+M . Therefore, we can apply a simple test of the proportion within each cluster to obtain a p-value. In practice, we often only focus on the robust and boundary clusters and ignore the outlier clusters because of sample size consideration. Let D 1 , . . . , D J ⊂ G all be the robust and boundary clusters, and be the global proportion, and be the observed proportion of cluster D j . We use the test statistic where n j = |D j | is the total number of the pulled sample within cluster D j , when H 0 is true and the test statistic Z j follows from a standard normal distribution asymptotically. Note that since we are conducting multiple tests, we reject the null hypothesis after applying the Bonferroni correction.
1. Combine two samples (G 1 and G 2 ) into one, called G all and compute r 0 = N N+M from Equation (6). 2. Construct a kernel density estimator using G all and its slope function and apply Algorithm 1 to form clusters based on the convergent point. 3. Assign an attribute to each cluster according to Equation (2). 4. Let robust clusters and boundary clusters be D 1 , D 2 , . . . , D J , where D j ⊂ G all for each j. 5. For each cluster D j , compute r j from Equation (7) and construct Z statistic: Find the corresponding p-value p j . 6. Reject H 0 if p j < α/J for some j under the significance level α.
We can apply this idea to other clustering algorithms. However, we need to be very careful when implementing it because we are using data twice-first to form clusters, then again to do two-sample tests. This could inflate the Type 1 error. Our approach is asymptotically valid because the clusters from the estimated slope converge to the clusters of the population slope (see Section 7). Note that our method may not control the Type 1 error in the finite sample situation, but our simulation results in Section 5.2 show that this procedure still controls the Type 1 error. This might be due to the conservative result of the Bonferroni correction.
The advantage of this new two-sample test is that we are using the local information, so if the two distributions only differ in a small region, this method will be more powerful than a conventional two-sample test. In particular, the robust clusters are often the ones with more power because they have a higher sample size, and the bumps in the pulled sample's density could be created by a density bump of one sample but not the other, leading to a region with high testing power. In Section 5, we demonstrate this through some numerical simulations.

An Approximation Method
The major computational burden of Algorithm 2 comes from Step 2, where we apply Algorithm 1 to 'every observation'. This may be computationally heavy if the sample size is large. Here we propose a quick approximation to the clustering result.
Instead of applying Algorithm 1 to every observation, we randomly subsample the original data (large dimension) or create a grid (low dimension) of points and only apply Algorithm 1 to this smaller set of points. This gives us an approximated set of local minima of the slope function. We then assign a cluster label of each observation according to the 'nearest' local minima.

Simulations
In this section, we demonstrate the applicability of our method by applying it to some simulation setups. Note that in practice, we need to choose the smoothing bandwidth h in the KDE. Silverman's rule [27] is one of the most popular methods for bandwidth selection. The idea is to find the optimal bandwidth by minimizing the mean integrated squared error of the estimated density. Silverman [27] proposed to use the normal density to approximate the second derivative of the true density, and use the interquartile range providing a robust estimation of the sample standard deviation. For the univariate case, it is defined as follows: whereσ is the sample standard deviation and IQR is the interquartile range. As discussed earlier, we choose h = C log n n 1 d+8 , where C is a constant. This choice is motivated by theoretical analysis in Section 7 (Theorem 1). In practice, we do not know C , so we applied a modification of Silverman's rule [27]: whereσ k is the standard deviation of the samples on kth dimension, IQR k is the interquartile range on kth dimension, and k = 1, 2, . . . , d. Note that our procedure involves estimating both the gradient and Hessian of the PDF. The optimal bandwidth of the two quantities are different, so one may apply two separated bandwidths for gradient and Hessian estimation. However, our empirical studies show that a single bandwidth (optimal for Hessian estimation) still leads to reliable results. Note that this bandwidth selector tends to oversmooth the data in the sense that some density peaks in Figure 6b were not detected (not in purple color).
Note that e i is the ith standard basis vector, and I 2 is the 2 × 2 identity matrix. For each scenario, we apply the gradient flow method and draw the contour. If points are outliers, their destinations go to infinity. Thus, we set a threshold to stop them from moving and assign them to outlier clusters. Figure 2 demonstrates that we identify both two clusters and the boundary of these two clusters. Each colored region is the basin of attraction of a local minimum of s(x) in the picture (a-c). Picture (d-f) provide examples of data points clustering. Given the setting of two equal-sized Gaussian mixture, it is straightforward to verify that the gradient flow algorithm can successfully distinguish points according to their destinations. The purple points represent points that belong to corresponding clusters with strong confidence, while green points represent points in low-density areas that belong to the connection regions among clusters. The yellow points represent points that are not important to any of the clusters. In summary, our proposed method performs well and is not affected by the changes of covariance and outliers.
Version May 25, 2021 submitted to Stats selector tends to oversmooth the data in the sense that some density peaks in Figure   206 (b) were not detected (not in purple color). Four-Gaussian mixture. To show how boundary clusters can serve as bridges among robust clusters, we consider a four-Gaussian mixture. We sample n = 800 from a mixture of four two-dimensional normals N(0, 0.1I 2 ), N(0.5e 1 , 0.1I 2 ), N(0.5e 2 , 0.1I 2 ) and N(0.5e 1 + 0.5e 2 , 0.1I 2 ) with equal proportion. Then we apply our method and display the result in Figure 3. Each colored region is the basin of attraction of a local minimum of s(x). The red '+'s are the corresponding local minima to each of the basin of attraction. Clearly, we see how robust clusters are connected by the boundary clusters so the additional attributes provide useful information on the connectivity among density modes. Comparison. To better illustrate the strength of our proposed method, we generate an unbalanced four-Gaussian mixture. We sample n = 2400 from a mixture of four twodimensional normals N(0, 0.5I 2 ), N(2e 1 , 0.5I 2 ), N(5e 2 , 0.5I 2 ) and N(2e 1 + 5e 2 , 0.5I 2 ) with proportion 5 12 , 5 12 , 1 12 , 1 12 , respectively. Then we apply our method and compare it with the density-based spatial clustering of applications with noise (DBSCAN) [28] in Figure 4. DBSCAN is a classical non-parametric, density-based clustering method that estimates the density around each data point by counting the number of points in a certain neighborhood and applies a threshold minPts to identify core, border and noise points. DBSCAN requires two parameters: the minimum number of nearby points required to form a core point (minPts) and the radius of a neighborhood with respect to a certain point (eps). Two points are connected if they are within the distance of eps. Clusters are the connected components of connected core points. Border points are points connected to a core point, but which do not have enough neighbors to be a core point. Here, we investigate the feasibility of using border points to detect the connectivity of clusters. These two parameters, minPts and eps, are very hard to choose. In the top two rows of Figure 4, we set minPts equal to 5 and 10 and change the value of eps to see if we can find the connectivity of core points using border points (gray points). Our results show that it is not possible to use border points to find the connectivity of the top two clusters and the bottom two clusters at the same time. When we are able to detect the connectivity of bottom two clusters (panel (f)), we are not able to find the top two clusters. On the other hand, when we can find the connectivity of the top two clusters (panel (c,h)), the bottom two clusters have already merged into a single cluster. The limitation of DBSCAN is that it is based on the density level set, so when the structures involve different density values, DBSCAN will not be applicable. In contrast, our method only requires one parameter, bandwidth, and it has good performance in this case. From Figure 4i-l, our method detects four robust clusters and their boundaries correctly. In addition, this result also shows that our method is robust to the bandwidth selection. (l) Figure 4. Picture (a) -(f) display the simulations using DBSCAN with different parameters settings, where minPts represents the the minimum number of points required to form a dense region and eps represents the radius of a neighborhood with respect to certain point. Picture (i) -(l) display the simulations using our proposed method with different bandwidth, where h represents the bandwidth selected according to Equation (8). In picture (a) -(h), each colored region is the cluster detected by DBSCAN, while the gray and black points are points that are border points and outliers, respectively. In picture (i) -(l), points that labeled blue, orange, and green are assigned to robust, boundary, and outlier clusters respectively. Picture (a-f) displays the simulations using DBSCAN with different parameters settings, where minPts represents the the minimum number of points required to form a dense region and eps represents the radius of a neighborhood with respect to certain point. Picture (i-l) displays the simulations using our proposed method with different bandwidth, where h represents the bandwidth selected according to Equation (8). In Picture (a-h), each colored region is the cluster detected by DBSCAN, while the gray and black points are points that are border points and outliers, respectively. In Picture (i-l), points that are labeled blue, orange, and green are assigned to robust, boundary, and outlier clusters, respectively.

Two-Sample Test
In this section, we carry out simulation studies to evaluate the performance of the two-sample test in Section 4. We compare our method to three other popular approaches: the energy test [29], the kernel test [30], and KS [31] tests based on each of the two variables.
Our simulation is designed as follows. We draw random samples from a two-Gaussian mixture model in Equation (9): where φ(·) is a cumulative distribution function of normal distribution. For the first group, we choose the parameters as a = 0.7, µ 1 = (−1, 0), µ 2 = (0, 1), Σ 1 = diag(0.3, 0.3), and Σ 2 = diag(0.3, 0.3). In our first experiment (left panel of Figure 5), we generate the second sample from a Gaussian mixture with identical setup, except that the second covariance matrix Σ 2 = diag(σ 2 , 0.3), and we gradually increase σ 2 from 0.3 (H 0 is correct) to 0.8 to see how the power of the test changes. We generate n 1 = n 2 = 500 observations in both samples and repeat the process 500 times to compute the power of the test. This experiment investigates the power as a function of signal strength.
In the second experiment (right panel of Figure 5), we consider a similar setup except that we fix Σ 2 = diag(0.35, 0.3) and vary the sample size from n 1 = n 2 = 500 to n 1 = n 2 = 4000 and examine how the power changes under different sample size. This experiment examines the power as a function of sample size. Figure 5. Power analysis of the proposed method. We compare the power of our two-sample test with three other approaches: the energy test, the kernel test, the KS test with only the first variable, and the KS test with only the second variable. In the left panel, we vary the variance of the second Gaussian. In the right panel, we fix the two distributions and increase the sample size. In both cases, our method has a higher power than the other three naive approaches.
In both experiments, all methods control the Type 1 errors. However, our method has better power in both experiments compared to the other alternatives. Our method is more powerful because we utilize the local information from clustering. In this simulation setup, the difference between the two distributions is the width of second Gaussian component. Our method is capable of capturing this local difference and using it as evidence in the hypothesis test.
Finally, we would like to emphasize again that two-sample test after clustering has to be used with caution; we are using data twice, so we may not be able to control the Type 1 error. One needs to theoretically justify that the resulting clusters converge to a population limit and apply numerical analysis to investigate the finite-sample coverage.

Applications to Astronomy
We apply our method to detect the Cosmic Webs [32] from the galaxy sample of the Sloan Digital Sky Survey [10]. It is known that galaxies inside our universe are not uniformly distributed. There are low-dimensional structures where matters are aggregated together. Roughly speaking, there are four types of structures in the Cosmic Webs: galaxy clusters, filaments, sheets, and voids [32]. Galaxy clusters are small regions with lots of matter. Filaments are regions with moderate matter density which connect galaxy clusters. Sheets are weakly dense regions where clusters and filaments are distributed. Voids are vast regions with very low matter density. Because of their properties, galaxy clusters are like zero-dimensional objects (points), filaments are one-dimensional curvelike structures, sheets are two-dimensional surface-like structures, and voids are threedimensional regions. Figure 6 displays our result. Note that it is the same data as Section 1. Panel (a) of Figure 6 shows the scatter plot of galaxies in the thin slice of the universe. In Panel (b), we color galaxies according to the types of clusters they belong to; purple, green and orange regions are the robust boundary and outlier clusters, respectively. We mark the locations of known galaxy clusters as blue "×"s [33]. These galaxy clusters are obtained using imaging analysis [34], which is a completely different approach. As can easily be seen, there is a strong agreement between galaxy clusters and the robust regions. Out of the 21 galaxy clusters, 85.71% fall into the robust clusters, and 14.29% fall into the boundary clusters. Moreover, the boundary clusters (green), connecting the robust clusters (purple), behave like the filaments in the Cosmic Webs, and the low-density outlier clusters are similar to the void structures. Figure 6 As for comparison, we display the results from k-means (Figure 6c), traditional mode-clustering (Figure 6d), and Gaussian mixture model (Figure 6e), which are not structurally correlated with the locations of blue "×"s.  Figure 6. We show that the gradient flow method is better in detecting the 'Cosmic Web' [31] in our universe. For comparison, we perform the k-means clustering method with 20 centers and traditional mode clustering to show that our proposed method is better to detect the 'Cosmic Web' in our universe. The blue "×"s are the points from image analysis. The results do not structurally correlate with the locations of blue "×"s. Figure 6. We show that the gradient flow method is better in detecting the 'Cosmic Web' [32] in our universe. For comparison, we perform the k-means clustering method with 20 centers and traditional mode-clustering to show that our proposed method is better to detect the 'Cosmic Web' in our universe. The blue "×"s are the points from image analysis. The results do not structurally correlate with the locations of blue "×"s. Thus, this analysis reveals the potential of our approach as a good method for detecting the Cosmic Webs with less information. Note that, since our dataset is two-dimensional, we cannot define the cosmic sheet structures.

Application to GvHD Data
We also apply our method to the GvHD (Graft-versus-Host Disease) data from [35]. The GvHD is a famous example for two-sample test problem. It contains a positive/disease sample and a control/normal sample. There are 9083 observations in the positive sample and 6809 observations in the control sample. Each observation consists of four biomarkers: CD4, CD8b, CD3, and CD8. Our goal is to test whether the positive sample and control sample are from the same distribution or not.
Since the sample size is non-trivial and the dimension is 4, naively applying Algorithm 2 will be computationally heavy, so we apply the approximation method in Section 4.1. We first random select 5% of the whole dataset, including both positive and control samples, as initial points in Algorithm 2. Then, the algorithm to find the local minima and add the attribute label is based on Equation (2). Finally, we assign a cluster label and attribute it to each observation according to an observation's nearest detected local minima of the slope.
Having identified clusters, we perform the two-sample test, and the result is summarized in Table 1. According to Table 1, all groups are significantly different. Thus, we can conclude that the positive sample is from a different distribution than the control sample. The clustering result can be used to visualize the data, since the robust and boundary clusters characterize regions with non-trivial probability mass and each cluster is represented by a minimum of the slope function. The slope minimum within each cluster is the center of that cluster. Algorithm 3 provides a summary of the visualization algorithm. In more detail, we first compute the minimal distance of two different clusters to decide whether two clusters (robust or boundary) are connected. If the value is less than 4 × √ h 2 × d, two clusters are connected (neighboring to each other), where d is the number of dimensions. Then we apply multi-dimensional scaling to the centers of robust and boundary clusters to reduce the dimension to 2. Each of these points represents a particular cluster. If two clusters are connected, we add an edge to them on the graph. Finally, we add a pie chart at each cluster's center with a radius corresponding to the total number of observations in that cluster, and partition the pie chart according to the composition from the two samples. Figure 7 shows the 2D visualization of the GvHD data, along with the composition of the two samples in each cluster. 7. Apply multidimensional scaling to local minima corresponding to robust and boundary clusters. Let their 2 dimensional representation point be s * 1 , · · · s * J 1 +J 2 . 8. For each cluster D j in {R 1 , R 2 , . . . , R J 1 , B 1 , B 2 , . . . , B J 2 }, plot a pie chart centered at corresponding s * j with radius proportional to |D j |. The pie chart contains two groups, each with ratio 9. Label the robust clusters and boundary clusters, and add an edge between a pair of robust cluster R j 1 and boundary cluster B j 2 if edge j 1 , where d is the number of dimensions.

Theory
In this section, we study both statistical and algorithmic convergence of our method. We start with the convergence of estimated minimaŜ to the population minima S along with the convergence of the gradient flow. Then we discuss the algorithmic convergence of Algorithm 1.
For a set D, we denote its cardinality by |D|. For a function f , we define f ∞ = sup x | f (x)| to be the L ∞ -norm. Let ∇ f and ∇ 2 f be the gradient and Hessian matrix of f , respectively. We define f l,∞ as the element-wise L ∞ -norm for l-th order derivatives of f . Specifically, f 0,max = f ∞ , for k = 1, 2, . . . , d and k = 1, 2, . . . , d. A twice-differentiable function f is called Morse [19][20][21] if all eigenvalues of the Hessian matrix of f at critical points are away from 0.
Recall that our data are random sample X 1 , . . . , X n from a PDF p(x) and s(x) = ∇p(x) 2 2 . Additionally,p n , ∇p n and ∇ 2p n are the estimated PDF, gradient, and Hessian matrix, respectively. In our analysis, we consider the following assumptions. Assumptions.
(P) The density function p(x) is four-times bounded and continuously differentiable. (L) s(x) is a Morse function.
(K) The kernel K is four-times bounded and continuously differentiable. Moreover, the collection of kernel functions and their partial derivatives up to the third order satisfy the VC-type conditions in Giné and Guillou [36]. See Appendix A for more details.
Assumption (P) is slightly stronger than the conventional assumptions for density estimation that we need to be four-times differentiable. This is because we are working with gradient of 'slope', which already involves second derivatives. To control the bias, we need additionally two derivatives, leading to a requirement on the fourth-order derivatives. Assumption (L) is slightly stronger than the conventional Morse function assumption on p(x). We need the slope function to be Morse so that the gradient system is well-behaved. In fact, Assumption (L) implies that p(x) is Morse function due to Lemma 1. Assumption (K) is a common assumption to ensure uniform convergence of a kernel-type estimator; see, for example [37,38].

Estimation Consistency
With the above assumption, we can show that the local minima ofŝ n converge to the local minima of s. . Theorem 1 shows two results. First, asymptotically, there will be a one-one corresponding relationship between a population's local minimum and an estimated local minimum. The second result shows the rate of convergence, which is the rate of estimating second derivatives. This is reasonable, since the local minima of s is defined through the gradient of s(x) = ∇p(x) 2 , which requires second derivatives of p.
Note that the fourth-order derivative assumption (P) can be relaxed to a smoothed third-order derivative conditions. We use this stronger condition to simplify the derivation, since the global minima of s are the critical points of p, the consistency of estimating a global minimum only requires a third-order derivative (or a smooth second-order derivative) assumption; see, for example [39,40].
Theorem 1 also implies the rate of the set estimatorŜ in terms of the Hausdorff distance. For given two sets A, B, their Hausdorff distance is  The above results describe the statistical consistency of the convergent points (local minima) of a gradient flow system. In what follows, we show that the gradient flows will also converge under the same set of assumptions.
Theorem 2 (Consistency of gradient flows). Assume (K), (P) and (L). Then for a fixed point x, where µ min (x) and µ max (x) are the minimal and maximal eigenvalues of the Hessian matrix of s evaluated at the destination π x (∞), and α = µ min (x)/(µ min (x) + µ max (x)).
Theorem 2 is mainly inspired by Theorem 2 in Arias-Castro et al. [3]. It shows that starting at a given point x, the estimated gradient flowπ x (t) is a consistent estimator to the population gradient flow π x (t). One may notice that this result shows that the convergence rate is slowed down by the factor α, which comes from the curvature of s around the local minimum. This is due to the fact that when a flow is close to its convergent point (a local minimum), the speed of flow is decreasing until 0 (when it arrives at a minimum), so the eigenvalues determine the rate of how fast the speed of a flow decreases along a particular direction. When the eigengap (difference between µ min (x) and µ max (x)) is large, even a small perturbation could change the orientation of the flow drastically, leading to a slower convergence rate.
Remark. It is possible to obtain the clustering consistency in the sense that the clustering based on s andŝ n are asymptotically the same [41]. In [41], the authors placed conditions on the density function and showed that the mode-clustering ofp leads to a consistent partition of the data compared to the mode-clustering of p. If we generalize their conditions to the slope s, we will obtain a similar clustering consistency result.

Algorithmic Consistency
In this section, we study the algorithmic convergence of Algorithm 1. For simplicity, we consider the case where the gradient descent algorithm is applied to s. The convergence analysis of gradient descent has been well studied in the literature [42,43] under convex/concave setups. Our algorithm is a gradient descent algorithm but is applied to a non-convex scenario. Fortunately, if we consider a small ball around each local minimum, the function s will still be a convex function, so the conventional techniques apply.
Specifically, we need an additional assumption that is slightly stronger than (L).
(A2) There are positive numbers R 0 , η 1 , λ 0 > 0 such that for all x ∈ B(m, R 0 ), where m ∈ S, and B(m, R 0 ) is a ball with center m and radius R 0 , all eigenvalues of Hessian matrix ∇ 2 s(x) are above λ 0 and ∇s(x) ≤ η 1 .
The assumption (A2) is a local strongly convex condition.
Theorem 3 (Convergence of Algorithm 1). Assume conditions (P), (K), (A1) and (A2). Let the step size in Algorithm 1 be γ. Recall that x t is the point at iteration time t and x 0 is the initial point. Assume that the step size γ < 1/L, where L = sup x ∇s(x) . For any initial point x 0 within the ball B(m, R 0 ), there exists a constant C 0 < 1 such that: Note that λ 0 is the constant in assumption (A2) and satisfies λ 0 ≤ L; see the proof of this theorem.
Theorem 3 shows that when the initial point is sufficiently close to a local minimum, the algorithm converges linearly [42,43] to the local minimum. Additionally, this implies that the ball B(m, R 0 ) is always in the basin of attraction of m. However, note that the actual basin could be much larger than B(m, R 0 ).

Conclusions
In this paper, we introduced a novel clustering approach based on the gradient of the slope function. The resulting clusters are associated with an attribute label, which provides additional information on each cluster. With this new clustering method, we propose a two-sample test using local information within each cluster, which improves the testing power. Finally, we developed an informative visualization tool that gives the structure of multi-dimensional data.
We studied our improved method's performance empirically and theoretically. Simulation studies show that our refined clustering method is capable of capturing fine structures within the data. Furthermore, as a two-sample test procedure, our clustering method has better power than conventional approaches. The analysis on Astronomy and GvHD data shows that our method finds meaningful clusters. Finally, we studied both statistical and computational theory of our proposed method. Our proposed method demonstrated good empirical performance and statistical and numerical properties. Finally, we would like to note that while our method works well for the GvHD data (d = 4), it may not be applicable for any higher dimensional data, since our method is a nonparametric procedure involving derivative estimation. The curse of dimensionality prevents us from applying it to data with more dimensions.  There is no human-related subject in this project so no need for IRB review.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Proofs
To explicitly describe the kernel assumption (K), we need to define a few notations first. A vector α = (α 1 , α 2 , . . . , α d ) of non-negative integers is called a multi-index with |α| = α 1 + α 2 + · · · + α d and the corresponding derivative operator is where D α f is often written as f (α) . The assumption (K) requires the followings. Let where K (α) is the partial derivative along α = (α 1 , · · · , α d ) direction and let K * r = ∪ r l=0 K l . K * r is the partial derivatives of the kernel function up to fourth-order. We assume that K * 4 is a VC-type class. that is, there exists constants A, v, and constant envelope b 0 such that where N(T , d T , ) is the -covering number for a semi-metric set T with metric d T and L 2 (Q) is the L 2 norm with respect to the probability measure Q. While this condition looks complicated, the Gaussian kernel and any smooth compactly supported kernel satisfy this condition; see [36]. For simplicity, we describe some notations which will be used across all proofs. We denote g s (x) = ∇s(x) be the gradient of s(x) and H s (x) = ∇ 2 s(x) be the Hessian matrix. Denoteĝ s (x) = ∇ŝ n (x) andĤ s (x) = ∇ 2ŝ n (x), whereŝ n is the estimator of function s. Let g(x) = ∇p(x) be the gradient of p(x) and H(x) = ∇ 2 p(x) be the Hessian matrix. Denotê g n (x) = ∇p n (x) andĤ n (x) = ∇ 2p n (x), wherep n is the estimator of function p. For a smooth function f , recall that we define f l,∞ be the L ∞ -norm of l-th order derivative.
For instance, Proof of Lemma 1: Recall that s(x) = g(x) 2 and ∇s(x) = H(x)g(x). Thus, C ⊂ S, where S is the collection of critical points of s(x). In addition, the Hessian matrix of s(x) is ∂H(x) ∂x l g l (x) and g l (x) is the l-th component of g(x). For any m ∈ C, since C is the collection of critical points of the density p, we have g(m) = 0 and the Hessian of slope function T(m) = H 2 (m), since we assume s is a Morse function, the eigenvalues of T(m) is non-zero, which implies the eigenvalues of H(m) is non-zero, thus completes the proof.
Proof of Theorem 1: We will prove the convergence rate and the one-one correspondence. The first assertion (estimated number of local minima equals the population number of local minima) follows from the one-one correspondence.
Our proof consists of two steps. First, we show that there is a one to one mapping between an estimated local minimum and the corresponding true local minimum. Then we can obtain the rate for the distance by using derivative estimation under assumption (K).
The one to one mapping assertion for local minima can be satisfied by modifying the result of Theorem 1 in [4]. Recall that m is a local minimum of s, letm n be a local minimum ofŝ n . From the first two steps of the proof of Theorem 1 in [4], we can get: min m∈S m n − m ≤ λ 0 2dc 1 when p n − p 4,max is sufficiently small. Such a local minimumm n ofŝ n is unique, which means there cannot be another critical point for that given local minimum of s. In other words, each m only corresponds to onem n and vice versa. This completes the proof of one to one mapping assertion for local minima.
To derive the rate for the distance m n − m , note thatĝ s (m n ) = g s (m) = 0. By Taylor's theorem, Proof of Lemma A1: Property 1 can be directly obtained by the definition of L-Lipschitz continuity. For property 2, f (x) is strongly convex, so ∇ f (x) ≥ C m x − x * , where C m is smaller than or equal to the minimum eigenvalue of Hessian matrix of f (x). For property 3, f (x) is L-Lipschitz, so f (x) ≤ L 2 x − x * 2 + f * . According to the fact that f (x) ≥ f * = 0, then, Thus, the results are as desired. The C m -strongly convexity implies ∇ f (x) ≥ C m x − x * and the L-Lipscthitz gradient implies f (x) − f * ≤ L 2 x − x * . Thus, the Property 4 holds.

Proof of Theorem 3:
From assumptions (A1) and (A2), there exists a ball with certain radius R 0 around each minimum of s such that all points within that ball have all positive eigenvalues of the Hessian matrix. Let a starting point within a ball to be x 0 . Note that within each ball, s(x) is λ 0 -strongly convex, since the Hessian matrix has all of its eigenvalues bounded [42]. The constant λ 0 is from assumption (A2).
According to assumption (P) and (L), s is a continuously differentiable function with Lipschitz continuous gradient and Lipschitz constant L. Consider a minimum m j ∈ S and let s * = s(m j ) = 0. According to Property 3 and Property 4 in Lemma A1, we have: 2s(x t )λ 2 0 /L.
Applying L-Lipschitz again and according to the Property 4 from Lemma A1, we have: By rearrangements, Recall that x 0 is the initial point. By telescoping, we can get: . since 0 < γ ≤ 1/L, −