A Spectral Clustering Algorithm for Non-Linear Graph Embedding in Information Networks

: With the development of network technology, information networks have become one of the most important means for people to understand society. As the scale of information networks expands, the construction of network graphs and high-dimensional feature representation will become major factors affecting the performance of spectral clustering algorithms. To address this issue, in this paper, we propose a spectral clustering algorithm based on similarity graphs and non-linear deep embedding, named SEG _ SC .This algorithm introduces a new spectral clustering model that explores the underlying structure of graphs through sparse similarity graphs and deep graph representation learning, thereby enhancing graph clustering performance. Experimental analysis with multiple types of real datasets shows that the performance of this model surpasses several advanced benchmark algorithms and performs well in clustering on medium-to large-scale information networks.


Introduction
Information networks, an essential form for expressing complex relationships between objects, are ubiquitous in our daily lives and work.They include social information networks formed by social media platforms such as WeChat and semantic information networks constituted by webpages on the Internet.
Spectral clustering, detailed in numerous studies [1][2][3][4], comprises three main steps: constructing similarity graphs, transforming graphs into low-dimensional vectors, and utilizing the K-means clustering algorithm.For example, Cai et al. [1,2] proved that a similarity graph based on the representative point strategy could be successfully applied to spectral clustering for information networks.Luo et al. [3] used K-means to select representative points from original data and constructed a multi-layer kernel-free similarity graph based on representative points.Shanham et al. [4] used the deep learning network approach to overcome the scalability and generalization issues of low-dimensional embedding in spectral clustering.In summary, at present, the optimization of spectral clustering algorithms mainly consists of the construction of similarity graphs and low-dimensional embedding representations.
With the development of information networks, designing effective algorithms to construct information network graphs to represent their semantic information has become a focal research issue.Traditional methods [5,6] generally adopt high-dimensional sparse vectors, representing the structure and semantic information of information networks through adjacency matrices and involving the design of various application algorithms based on the above.However, high-dimensional sparse representation requires more computational space and time as the network scale increases, leading to poor scalability and accuracy in a graph structure.
The k-nearest neighbor (KNN) graph is a manifold graph construction learning method, which includes methods such as Isomap [7] and local linear embedding [8].This particular model has good generalization capabilities and can map topological manifolds.If highdimensional manifolds are embedded in low-dimensional manifold spaces, the properties of Euclidean space still exist locally.The KNN algorithm is known for its stability, simplicity, and efficiency.However, it presents limitations when handling large datasets or high-dimensional data, which can result in increased computational complexity and limited clustering results.In recent years, considering nearest neighbor search methods, D. Cai et al. [1,3,9] improved the KNN graph by introducing a KNN graph clustering algorithm that utilizes representative points.These representative points represent each cluster during computation, and noise has minimal influence on this approach.Moreover, for KNN graph construction, the representative point graph has been applied in spectral clustering.By computing a low-rank approximation of the affinity tensor graph, the initial unified affinity matrix is refined with reliable information from the confidence affinity matrix.
For low-dimensional vector representation, there are many effective methods available.Most models [10,11] that learn graph construction first use linear encoding and then apply singular value decomposition (SVD) [12] for linear dimension reduction.Linear encoding [6,13] produces sparse representation, where each data point is represented as a linear representation of a small number of base vectors.Mathematically, linear encoding is defined as finding two matrices U and Z, whose product can best approximate the data X ≈ UZ.Large-scale sparse coding clustering (LSC) [9] combines linear encoding, SVD, and K-means into a spectral clustering method.It has a solid theoretical basis and has contributed to significant advancements in clustering research on large-scale data.GraRep [14] is a novel model designed for learning weighted graph vertex representations by generating low-dimensional vectors to represent graphs.However, these methods all involve the use of SVD for linear dimension reduction, and more effective non-linear dimension reduction techniques have yet to be explored.
In recent years, deep learning [4,15] has been widely applied in graph representation learning.Implementing a greedy layer training strategy can improve computational performance.Stacked autoencoders [16] are considered an effective method for mapping from the original representation space to low-dimensional spaces, with the capability to produce highly non-linear projections.
More recently, sparse autoencoders have been used to replace the SVD step in spectral clustering.Shao et al. [17] proposed a deep architecture called DLC for fast graph clustering with linear encoders.Their method combines feature embedding and linear encoding and was implemented through a deep hierarchical architecture.Cao et al. [18] proposed the DNGR model for learning graph representations.The model utilizes pointwise mutual information (PMI) to represent the original graph.It then projects this low-dimensional representation using stacked denoising autoencoders.The authors theoretically and empirically demonstrate the effectiveness of the model in clustering and classification processes.Tian et al. [13] indicated in their study that the computational complexity of autoencoders, O(n), is considerably lower than the characteristic decomposition time, O(n 3 ), of spectral clustering.They proposed a non-linear spectral clustering algorithm based on deep learning.In this method, the graph is embedded in deep learning through autoencoders, followed by clustering using the K-means algorithm.This study on non-linear mapping spectral clustering algorithms made significant contributions to the field; however, the study presented the following issues: (1) for large-scale data, directly representing graph structures with high-dimensional similarity matrices does not reflect the essential features of graph structures; and (2) the authors simply propose a framework for non-linear spectral clustering algorithms without providing theoretical proof.
In our study, we utilized sparse representation for the graph similarity matrix, which typically enhances clustering accuracy by eliminating noise information that can impact clustering results.In representation learning, deep structures are used to construct features.Deep structures can benefit from hierarchical training for fine-tuning, which helps in finding improved local solutions to meet the expectation maximization objectives.The method proposed in this paper is called sparse encoder graph spectral clustering (SEG_SC).
The SEG_SC algorithm proposed in this paper makes the following contributions: 1.
In this study, we study the manifold learning similarity matrix and linear lowdimensional mapping of semantic information networks.We define a model structure using the K-means clustering algorithm, similar to spectral algorithms.

2.
Theoretically, the results presented in this paper prove that the model adopts a clustering method consistent with the spectral clustering objective function and captures the non-linear features of semantic information networks.

3.
Empirically, our results demonstrate that the model can enhance the clustering performance of semantic information networks by capturing meaningful semantic, relational, and structural information.Our experimental results show that the model can effectively complete clustering tasks.
The remainder of this paper is arranged as follows: in Section 2, we review related works; in Section 3, we describe the proposed framework, its theoretical analysis, computation complexity analysis, and the optimization algorithm; in Section 4, we report the experimental results; and finally, we conclude the paper in Section 5.

Related Work 2.1. Spectral Clustering
Spectral clustering is an algorithm derived from graph theory and has been widely utilized in clustering.The primary aim is to represent all nodes using a Laplacian matrix.In this matrix, the edge weights between nodes that are far apart are lower; in contrast, those that are closer are higher.The Laplacian matrix is then subjected to eigenvalue decomposition to identify its k-smallest eigenvalues, which are utilized to derive a reduceddimension representation of the network.Finally, the K-means algorithm is employed for clustering.The process is illustrated in Figure 1.There are three methods for constructing the similarity graph in the Laplacian matrix: the adjacency method, the KNN method, and the fully connected method.The adjacency method sets a distance threshold and then measures the similarity between any two nodes using the Euclidean distance.The KNN algorithm traverses all sample points, selecting the nearest k points as neighbors for each sample in order to calculate their similarity.The fully connected method calculates the similarity values between all points.
The objective of spectral clustering is to partition the graph into k-disconnected subgraphs while minimizing the graph cut.The normalized cut criterion [19] is defined as follows: where vol represents the number of connections between the two subgraphs C i and C i .
Solving the normalized cut set is an NP-hard problem; according to Normalized- , its approximation can be described as a trace function minimization formula (2).arg min where , and properties of the Laplacian matrix L are as follows: • L is a symmetric and positive semi-definite matrix; • The number of graph connections equals the number of eigenvalues equal to zero, i.e., Rank(L) = n-k; This objective function calculates the eigenvectors of the matrix L. According to the matrix computation theorem 1, H is composed of the eigenvectors corresponding to the k-smallest non-negative singular values of the normalized Laplacian matrix.The final clustering result is obtained by running the K-means algorithm on the graph's embedding matrix H.

Theorem 1. (Eckart-Young-Mirsky Theorem)
Given an r-rank matrix P, its SVD: P = VΣU T , where V T V = I, U T U = I and then arg min where − → Σ ; besides only containing the k largest singular values and replacing other singular values with 0, the rest remains the same as the matrix Σ.This theorem indicates that reconstructing the matrix using the k largest singular vectors from singular value decomposition provides the best rank-k approximation of the original matrix.
The effectiveness of spectral clustering depends on the similarity matrix, making it very effective for clustering sparse data, a task that is challenging for the K-means algorithm.Moreover, due to the utilization of dimensionality reduction techniques, spectral clustering handles the complexity of high-dimensional clustering more effectively than traditional clustering methods such as K-means.

Linear Coding
Linear coding (LC) [9] is a matrix decomposition technique that represents the original data as a combination of a set of base vectors and their corresponding "compressed" representations.Its data representation form is where X = (x 1 , x 2 , ..., x n ) ∈ R m×n , U ∈ R m×p , and the column vectors of U are the basis vectors.Z ∈ R p×n and the column vectors of Z represent p-dimensional vectors with respect to the base vectors of U. Therefore, the matrix decomposition of LC is defined as the following optimization function problem: LC adds sparse constraints to each column of Z in the optimization problem described in Formula (5).LC implements a data-sparse representation based on matrix decomposition, where each data point is a linear combination of a few base vectors.This approach allows for a more accurate interpretation of each data point.

Low-Dimensional Embedding Learning of Information Networks
Dimension reduction is the process of transforming the original high-dimensional attribute space into a lower-dimensional subspace data representation through mathematical transformations.At present, dimension reduction techniques are mainly divided into two categories: linear and non-linear methods.
Classic linear dimension reduction methods include principal component analysis (PCA), multi-dimensional scaling (MDS), linear discriminant analysis (LDA), etc. PCA [20] projects data into a low-dimensional space representation through linear transformation.Its optimal objective function is to minimize Formula (6): The goal of MDS is to construct a low-dimensional space where the similarity relationships between nodes can be represented.In this space, the distances are consistent with those in the original data, taking into account the global Euclidean distance of samples during sample dimension reduction.LDA minimizes differences between nodes in the same cluster and maximizes differences between nodes in different clusters through linear projection.
Recently, neural information network learning has achieved significant success in various applications, including image classification, speech recognition, and natural language processing.Collobert et al.'s research [21][22][23][24][25] made training in deep architectural structures possible and effective.One of the strategies employed in their research was the greedy layered unsupervised pre-training strategy, whose aim was to learn a useful representation layer by layer and then set the output features as inputs for the next layer.Each layer in this process involves some form of non-linearity, such as non-linear activation functions (e.g., sigmoid or tanh) or regularization of features (e.g., sparse constraints).By stacking these non-linear layers together, deep learning can be used to generate improved representations.The process from the input layer to the hidden layer is called encoding, and the process from the hidden layer to the output layer is called decoding, as illustrated in the model process in Figure 2.

Model Framework
The model proposed in this paper is driven by a three-step scheme of spectral clustering, which includes the construction of similarity graphs, low-dimensional graph embedding, and the K-means clustering algorithm.In our study, we first implemented the reconstruction of the similarity matrix for spectral clustering, followed by creating a low-dimensional embedded graph.Our spectral clustering model was utilized for large-scale clustering.The basis of this model is to design an effective method for constructing a similarity graph and performing Laplacian matrix eigen decomposition.Specifically, in the similarity graph S, we attempted to design it as symmetric and positive semi-definite, satisfying the following form: where Z ∈ R p×n represents the p-dimensional representation of the original graph, p ≪ n.
The purpose of the SEG_SC spectral clustering model is to first represent the information network as a graph G = (V, S), where S is the node similarity graph.Next, it embeds S into a low-dimensional graph embedding matrix R and then identifies a disjoint partition C = {C 1 , C 2 , ...., C k } using the K-means algorithm, where k represents the number of clusters.

Similarity Matrix
KNN [26] is a commonly used manifold learning method.By utilizing a test sample, we integrated the concept of k-nearest neighbor information and distance measurement with the normalized Laplacian matrix of spectral clustering to derive a suitable weighted information network similarity matrix.This matrix can be represented by the following Formula (8): where v t is the k-nearest neighbor node of v i .S defines the similarity relationship between each node and all its adjacent nodes; however, this representation is very time-consuming, especially for large-scale data.According to the SC method, for any data point v i , our approximation can be represented as shown in Formula ( 9): where the value of U is randomly selected from the information network graph to create a set of p landmark nodes, generating the landmark node set U = {u 1 , u 2 , ..., u p }. Calculating the relationship values of U with the k-nearest neighbor nodes, we obtained the similarity graph.According to the Nadaraya-Watson method [27], the sparse representation Z is defined as shown in Formula (10): where K(:) is the Gaussian kernel function.Therefore, the definition of the similarity matrix S is as shown in Formula ( 11):

Non-Linear Low-Dimensional Graph Embedding
First, in the present paper, we analyze the non-linear spectral clustering properties of the single-layer network.Second, the model representation was studied, and we describe the deep embedding graph of the multi-layer network.Third, we analyze the computation process of the multi-layer network based on the objection function.Finally, we analyze its time performance and the algorithm.

Spectral Analysis on Single-Layer Graph
Using the method of graph dimension reduction based on non-linear projection, the projection of sample points in the low-dimensional coordinate system is as follows: where S i is the i-th row vector of the similarity matrix S, W j is the j-th row vector of parameter W, and R i = (R i1 , ..., R id ) is the low-dimensional embedding landmark of S i .R i approximates the reconstruction of X i , which can be represented as The objective formula for the difference in distance between the original sample S i and the projected reconstructed sample Ŝi is as follows: where W = (w 1 , w 2 , ..., w d ); hence, the objective formula solving process can be approximated as Therefore, the model ensures that the distance to the sample points on the hyperplane is sufficiently close, and the projections of the sample points on the hyperplane are separated as far as possible.
Notably, when the function g() is a linear function, this objective model transforms into the objective function of regular spectral clustering, as shown in formula (16): Therefore, Formula ( 15) is defined as the non-linear objective model of spectral clustering.

Deep Graph Embedding on a Multi-Layer Graph
As previously stated, we can consider S as a training set containing n nodes.We consider multi-layer information networks in each layer.The objective formula is as follows: arg max Assuming the multi-layer information network deep architecture as shown in Figure 2, the input of the first layer is denoted as X (1) = S.More hidden layers need to be added to the multi-layer architecture.
It is widely known that multi-layer information networks are divided into an input layer, hidden layers, and an output layer; the structure of the i-th neuron in layer (l + 1) is shown in Figure 3.The formula for h (l+1) where f is the neuron activation function, which is generally a sigmoid or tanh function.X (l) represents the data of the layer l, assuming that the number of neural units in layer l is q + 1.
Assuming there are r training samples X 1 , ..., X r through the deep neural information networks, if the input sample is X i , the output of the output layer is h W,b (X i ), and the actual output should be Y i .Therefore, a loss function with regularization terms is defined as where is the sum of squared errors of the output layer.Here, n l denotes the number of layers in the entire information network, and s l indicates the number of units in layer l, excluding bias units.λ represents the regularization term parameter.

Computation Analysis Based on Deep Graph Embedding
The regularization term stabilizes the algorithm and suppresses overfitting.J(W, b) is a non-convex function; thus, using the gradient descent method can easily lead to local optimization results.In addition, the initialization of each parameter w i must be set randomly and not all set to zero.If this is not performed, the output will be the same for any input.
The step formula for parameters w where λ is the step learning parameter.
, where ); hence, the output layer e ); thus, the entire output vector } and ⊙ represents the Hadamard product.Therefore, the middle layer is where ) ).Thus, the parameters were trained using the gradient descent method.
where r represents the number of samples.

Computation Complexity Analysis of an Algorithm on a Multi-Layer Graph
The results presented in this paper refer to the deep model as Sparse Encoding Graph Spectral Clustering (SEG_SC), and its implementation process is shown in Algorithm 1.
Suppose we possess the original data X ∈ R n×d and use p landmark points; we need O(pdn) to construct the similarity matrix and O(n) to complete the multi-layer graph embedding.K-means requires O(nkt), where t is the number of iterations.In Algorithm 1, the computational complexity is O(pdn + n + nkt).

Benchmark Algorithms
To demonstrate the effectiveness of SEG_SC, the following methods were used as benchmark algorithms.A brief description of each method is found below as follows: • PCA [20]: A common matrix decomposition method used for dimensionality reduction or feature extraction.PCA defines a loss function based on the distance between the original sample and the reconstructed sample after projection.PCA is a linear graph embedding method.The clustering result is then obtained using the K-means algorithm.• SC [2]: Spectral clustering consists of an anchor graph based on set-to-set distances derived from local covariance matrix representation.• GraphEncoder [13]: A method for learning graph representations.This method explores non-linear graph embedding using deep learning and obtains clustering results by running the K-means algorithm on the embedding.• DCN (deep clustering network) [28]: The DCN performs joint dimensionality reduction and clustering.Reduction is achieved by training a deep neural information network to approximate any non-linear function.

Evaluation Criteria
In this paper, we evaluate the performance of the aforementioned clustering algorithms using the accuracy (ACC) and normalized mutual information (NMI) of the clustering results.ACC evaluates the performance of the algorithm by comparing the clustering labels it produces with the actual clustering labels, examining the clustering status of all pairs of nodes to determine if they belong to the same cluster.ACC is defined as follows: where n represents the number of nodes, cluster_label i represents the clustering label result of the i-th node by the algorithm, and truth_label i represents the real clustering label of the i-th node.NMI is the mutual information entropy between the algorithm's clustering labels and the actual clustering labels.Mutual information entropy considers partitions as the probability distribution of nodes falling into a cluster.It is defined as follows: where π a and π b represent two types of clustering partitioning, n i,j represents the number of nodes belonging to both cluster C a i and cluster C b j , n a i indicates the number of nodes belonging to the i-th cluster C a i , and n b j indicates the number of nodes belonging to the j-th cluster C b j .NMI's value ranges from 0 to 1, with a value of 1 being given when the results of a and b are the same.

Parameter Settings
In our experiments, each sample from all datasets was normalized to possess the same unit length.We set the neighborhood number in the sparse KNN graph search to five.We detected the relationship between parameters p (the number of landmark points) and m (depth level lay) and performance indicators ACC and NMI on real small-and medium-sized data from various sources.
Experiments were conducted on the SEG_SC model, setting the landmark number parameter p from 200 to 1200, increasing in intervals of 100, with the derived ACC results as shown in Figure 4. From Figure 4, it can be seen that the ACC value increases with the change in landmark value, which is attributed to the random selection method.Evidently, having more landmarks can enhance the final ACC performance; however, the change was not found to be significant in our study.Figure 5 illustrates the relationship between the number of landmarks and NMI.The NMI value generally increases as the number of landmarks increases but with limited magnitude.Examining the settings of the depth structure layer (lay), the ACC value of the first layer of SGE_SC on the dataset in Figure 6 was obtained by directly applying K-means to the input normalized sparse similarity graph.It can be observed from Figure 6 that as the layers become deeper, the ACC value increases.This suggests that deep structures play a crucial role in generating high-quality clustering results.In Figure 7, as the number of layers increases, the NMI values for large-scale datasets such as blogdata and 20 newgroup increase.In summary, the results presented in this paper suggest using five layers for mediumto large-scale datasets; small-scale datasets, in contrast, do not require more than three layers.The information presented in Table 2 shows the number of nodes in each layer.Moreover, without the loss of generality, in this study we set the number of landmarks to 1000.All neural networks use the sigmoid function as the activation function for each layer, and the sparse target (the target value for hidden layer activation) is 0.05.Specific parameter information is shown in Table 3.In terms of cluster performance, Table 5 summarizes the clustering results of seven datasets, indicating that SEG_SC offers relatively higher ACC.Table 6 displays the comparison values of various algorithms on large-scale data, demonstrating stronger NMI performance advantages as the data scale increases.From the results shown in Tables 5 and 6, it can be seen that the model proposed in this paper exhibits superior performance compared to the other models.As shown in Table 5, SEG_SC achieved higher ACC.This improvement was significant compared to the other algorithms, attributed to the optimization of graph similarity.In addition, as the graph uses a non-linear projection, it can be observed from Table 6 that in the various types of large-scale datasets, the NMI values tend to be optimal.
The advantage of this model lies in using sparse similarity graphs to replace the original graph's similarity matrix and adopting non-linear encoding schemes instead of feature decomposition.During the final step in particular, the low-dimensional features inputted into K-means are complemented by spectral clustering.Furthermore, the nonlinear encoding scheme reduces the dimensionality of the data.

Conclusions and Future Research
In the present paper, a spectral clustering model based on deep neural networks named SEG_SC is proposed.First, the model utilizes a sparse similarity graph based on landmark points and the KNN method, introducing a deep graph representation that encodes each vertex into a low-dimensional vector representation.Secondly, the model theoretically demonstrates the effectiveness of SEG_SC.Experiments involving the use of real data demonstrate that spectral clustering using deep learning can enhance clustering performance and reduce time consumption in various types of large-scale datasets of information networks.
In future research, we plan to extend this model from general information networks to multi-dimensional information networks, enhancing its scalability on large-scale datasets.In addition, we plan to strengthen its impact by including case studies or real-world applications where the proposed method has been implemented.Conducting such studies will demonstrate its practical viability.

Figure 1 .
Figure 1.Structural diagram of the spectral clustering algorithm.

Figure 4 .
Figure 4. Relationship graph between ACC and the number of landmarks (Parameter p) on the dataset.

Figure 5 .
Figure 5. Relationship graph between NMI and the number of landmarks (parameter p) on the dataset.

Figure 6 .
Figure 6.Relationship graph between ACC and depth level lay (parameter m) on the dataset.

Figure 7 .
Figure 7. Relationship graph between NMI and depth level lay (parameter m) on the dataset.

Table 2 .
Neural information network layer structure.

Table 4
presents the average running time of the SEG_SC algorithm in comparison with similar deep neural network models such as Graphencoder, indicating that their running times are approximately equal.This result meets the purpose of sparse embedding representation time optimization.

Table 5 .
Cluster comparison of ACC results.

Table 6 .
Cluster comparison of NMI results in large-scale data.