Improving Graph Convolutional Network with Learnable Edge Weights and Edge-Node Co-Embedding for Graph Anomaly Detection

The era of Industry 4.0 is gradually transforming our society into a data-driven one, which can help us uncover valuable information from accumulated data, thereby improving the level of social governance. The detection of anomalies, is crucial for maintaining societal trust and fairness, yet it poses significant challenges due to the ubiquity of anomalies and the difficulty in identifying them accurately. This paper aims to enhance the performance of the current Graph Convolutional Network (GCN)-based Graph Anomaly Detection (GAD) algorithm on datasets with extremely low proportions of anomalous labels. This goal is achieved through modifying the GCN network structure and conducting feature extraction, thus fully utilizing three types of information in the graph: node label information, node feature information, and edge information. Firstly, we theoretically demonstrate the relationship between label propagation and feature convolution, indicating that the Label Propagation Algorithm (LPA) can serve as a regularization penalty term for GCN, aiding in training and enabling learnable edge weights, providing a basis for incorporating node label information into GCN networks. Secondly, we introduce a method to aggregate node and edge features, thereby incorporating edge information into GCN networks. Finally, we design different GCN trainable weights for node features and co-embedding features. This design allows different features to be projected into different spaces, greatly enhancing model expressiveness. Experimental results on the DGraph dataset demonstrate superior AUC performance compared to baseline models, highlighting the feasibility and efficacy of the proposed approach in addressing GAD tasks in the scene with extremely low proportions of anomalous data.


Introduction
In real-world settings, anomalies are both ubiquitous and potentially detrimental, yet challenging to identify.Fraudulent activities pose significant threats to the robust functioning of society, not only in economic terms but also by adversely impacting societal trust and fairness.Anomaly detection, also referred to as fraud detection, is assuming an increasingly crucial role in contemporary society.Its applications range from detecting fake news [1] to identifying fraudulent reviews [2].
In academia, "anomaly" is commonly abstracted as a set of data that deviates from the majority [3].Despite the development of numerous techniques in recent years for uncovering outliers and anomalies in unstructured sets of multi-dimensional points, the scarcity of anomalous behavior, high labeling costs, and privacy protection policies result in a paucity of data samples labeled as anomalous [3].
We observe that graphs in the current graph anomaly detection domain mainly exhibit the following three characteristics.

1.
Nodes are Labeled as well as Featured.Nodes in graphs often possess both label and feature properties.Node labels can propagate and aggregate along the edges of the graph, while node features can propagate along the edges and undergo transformations through neural network layers.Both label and feature propagation are beneficial for predicting node behavior.The effective combination of these two propagation aspects warrants attention.

2.
Edges Contain Rich Information.In the field of GAD, the topology and structure of the graph contain a wealth of information.The attributes and directions of edges imply that even in isomorphic graphs, there are multiple types of connections (i.e., edges) between different nodes.For instance, relationships between users may include transfer and recipient relationships, loan and borrower relationships, as well as familial relationships.This information can assist in better classification, thereby identifying anomalous nodes.However, this information is often overlooked by other methods.

3.
The Proportion of Anomalous Data is Extremely Low.While fraud incidents do occur, the majority of users in reality are normal users.The proportion of fraudulent users is significantly smaller than that of normal users, making fraud detection a classic example of imbalanced data.In machine learning model training, achieving a relatively balanced distribution of various sample classes is crucial for optimal performance.When there is a substantial difference in the proportions of different classes, the trained model tends to exhibit characteristics more aligned with the majority class, thereby neglecting features associated with minority class samples.
The inherent suitability of graph structures for analyzing relationships and connectivity patterns in networks has propelled GAD to the forefront.GAD abstracts traditional streaming data into graph structures, where nodes represent entities and edges signify relationships between entities.The interconnected nodes form edges that convey structural information about the graph, information that can be stored using an adjacency matrix [4].GAD aims to detect anomalies by leveraging the structural information inherent in the network, combining it with classic anomaly detection approaches.Considering fraud scenarios, fraudulent activities often exhibit associations and collaboration.The connections between fraud groups are typically closer than those between normal users, and the behaviors demonstrate a certain level of similarity.This also explains why graph analysis can be employed to detect anomalous behaviors in practical scenarios.According to the different basic graph algorithms, GAD algorithms can be classified into several types, including community detection, feature discovery, pattern mining, and GCN.This paper focuses on the GCN-based GAD algorithm.
Due to divergent definitions of "anomaly" in academic and industrial contexts, coupled with the variability in identifying anomalous data across different scenarios, the precise characterization of what constitutes an "anomaly" proves challenging [3].This paper abstains from delving into the root cause behind the classification of specific data as anomalous or enumerating the features that render data anomalous.Instead, our focus remains exclusively on the handling of datasets with anomaly ground-truth.Leveraging information from node features, edge features, and graph structures, we seek to infer potential label values for unlabeled nodes based on existing positive and negative labels.The objective is to simplify the model into a semi-supervised learning classification task rooted in graph neural networks, and the main task is to determine whether a node on the graph is an anomaly or not.This classification approach is also often employed in GAD tasks [5].
We propose a unified model capable of addressing the aforementioned three challenges in graphs to make full use of the information in the nodes and edges to improve the performance of the algorithm.The primary contributions of this paper include the following.

1.
We analyzed the theoretical relationship between the feature convolution process of GCN and the label propagation process of LPA, demonstrating that using LPA as an auxiliary regularization term in the GCN loss function can train edge weights and thereby improve GCN performance.Building on this theoretical analysis, we introduced a learnable edge weight into GCN and trained it using LPA.This method modifies the convolution process of GCN, making full use of node labels.

2.
Considering the heterogeneity of edges, we analyzed the impact of edge information on GAD tasks.Firstly, we embedded edge direction and attribute information to generate edge features.Then, we aggregated these edge features with node features to generate co-embedding.This method conducted feature extraction, fully utilizing the information on edges.

3.
Based on the co-embedding, we designed separate trainable GCN weights for node features and co-embedding features, projecting them into different feature spaces, greatly enhancing the model's generalization ability.
We conducted experiments on the publicly available DGraph [6] dataset, selecting MLP, Node2Vec [7], LPA [8], GCN [9], GAT [10], GATv2 [11], GraphSAGE [12], and UniMP [13] as baseline models, with AUC serving as the evaluation metric, which refers to the area under the ROC curve.The ROC (Receiver Operating Characteristic) curve is a graphical visualization that provides an intuitive representation of a model's performance.As a numeric value, the AUC can directly and intuitively exhibit the goodness or poorness of a model's performance.The results indicate that our proposed method outperforms the baselines in terms of AUC.Our model proves feasible in addressing GAD tasks, providing valuable insights for applications in scenarios with scarce data.
The remaining sections of this paper are organized as follows.Section 2 discusses related work in the field of anomaly detection.Section 3 defines the mathematical model of the problem, analyzing the theoretical connection between GCN's convolution on node features and LPA's propagation on node labels, providing a mathematical basis for incorporating node labels into the GCN model.Section 4 introduces our GCN-based GAD model, which is achieved through three enhancements to the GCN network.Section 5 presents experimental results on public datasets, comparing the performance with baseline models, indicating that our method exhibits the best AUC performance on datasets with extremely low proportions of anomalous nodes.Finally, in Section 6, we summarize the key contributions of this paper, as well as directions for future improvement.

Related Work
Anomaly detection algorithms typically view the task as a node classification problem, where data are categorized into normal and abnormal classes [3][4][5].Our algorithm conceptually relates to semi-supervised methods in graph-based learning and aligns with recent advancements in applying convolutional neural networks to graph-structured data.

Limitations Encountered by Current Algorithms
Traditional anomaly detection methods can be divided into rule-based methods and probability-based methods.Rule-based methods aim to formalize expert knowledge rules and replicate them through computational means, while probability-based methods rely on statistical knowledge and assume that data distribution conforms to certain statistical distributions.Traditional anomaly detection methods are highly interpretable, but they need to design different detection rules for different scenarios, which requires the introduction of a large amount of expert knowledge.This method has high time and labor costs, and the generalization of the method is weak.

Anomaly Detection Based on Semi-Supervised Methods
Machine learning anomaly detection methods can be categorized into supervised, semisupervised, and unsupervised learning.In cases where the number of negative samples is small, resulting in a large disparity between positive and negative sample proportions, and the patterns of negative samples are difficult to learn, supervised learning methods are not applicable.Unsupervised learning does not require considering prior knowledge and is suitable for most cases where most objects are normal.However, due to insufficient input information, unsupervised learning is prone to the curse of dimensionality, leading to low accuracy and high false-positive rates in existing patterns.
Semi-supervised learning does not rely on external interaction and automatically leverages unlabeled samples to enhance learning based on a training set composed entirely of normal data [14,15].It allows for joint learning between a small number of labeled samples and a large number of unlabeled samples [16], leading to improved learning performance and generalization ability [17,18].By better capturing data distribution characteristics, it is particularly suitable for scenarios with limited labeled data and expensive manual annotation.Specifically, semi-supervised learning can be categorized into inductive learning (or pure semi-supervised learning) and transductive learning [14].In the former, the assumption is that the unlabeled samples in the training data do not represent the data to be tested, while in the latter, the unlabeled samples considered during the learning process are precisely the ones to be predicted [17].
In the mentioned methods, LPA and GNN methods are the focal points of our attention, they can be summarized in Table 1.From the table, we observe that LPA is typically associated with transductive learning, implying its inability to generalize to new nodes; instead, it solely classifies the remaining nodes in the original graph.This is because, during the model learning process, nodes with unknown labels in the original graph may also be utilized for training.In other words, certain unlabeled data obtain their label information from other unlabeled data.GNN is associated with inductive learning [19].Additionally, GNNs can leverage node features and support graph embedding, whereas LPA cannot.LPA, however, can make full use of node label information, which GNNs may not fully exploit.Consequently, both algorithms alone may not effectively utilize the information present in the nodes of the graph.Our primary research goal is to cleverly combine these two algorithms.LPA was initially proposed by Zhu and Ghahramani [8], and implemented using bipartite graphs.Several anomaly detection algorithms have been proposed based on LPA.Li et al. [20] used adaptive label propagation for group anomaly detection in networks.Wang et al. [21] introduced a collaborative multi-label propagation strategy, leveraging a user-item interaction graph for acceleration.Wang et al. [22] improves graph-based label propagation algorithm with group partition for fraud detection.
With GNN algorithms, numerous effective anomaly detection algorithms have also been proposed.Building upon the dynamic heterogeneous graph, Zhang et al. [23] introduced the hierarchical multi-modal fusion GNN (HMF-GNN), which was applied to the detection of health insurance claim records.This model addresses the challenge of lost neighbor information.DAGAD [24] provides a promising solution to alleviating the problem of GAD with anomalous sample scarcity and class imbalance.
However, the aforementioned methods have not been able to fully integrate the information provided by node features and labels.

Anomaly Detection Based on Graph Convolutional Networks
According to whether the node attributes in the graph are multivariate, graphs can be classified into homogeneous and heterogeneous graphs [25], this paper exclusively discusses homogeneous graphs.For graph data, anomalies can be considered as elements such as nodes, edges, subgraphs, etc., that differ from the majority of objects in the graph.We can categorize GAD into different types based on the properties of the graph (static graph, static attribute graph, and dynamic graph), and the nature of anomalies (nodebased, edge-based, and graph-based) [4,26].The focus of this paper is on the detection of anomalous nodes in static attribute graphs.In attribute graphs, besides the structural information between nodes, each node also encompasses attribute information [27].These attribute details aid in uncovering hidden features of nodes, thereby enhancing the accuracy of node anomaly detection.
Our approach is primarily inspired by GCN-based algorithms.Kipf and Welling [9] introduced the concept of convolution to graph structures, proposing GCN.Building on the work of GCN, Hamilton et al. [12] improved the mean aggregation method and introduced GraphSAGE, a method for inductive representation learning, enabling algorithmic application to large-scale graphs through minibatch sampling.However, GAD algorithms constructed based on the above methods, such as GEM [28] and DualFraud [29], have not taken into consideration the benefits brought by node labels.
Other methods, like GCN-LPA [30], and UniMP [13], cleverly combine GCN with LPA.Huang et al. [19] demonstrated that this combination can avoid excessive smoothing, thereby enhancing the performance of GCN.However, the aforementioned methods have not fully utilized the information embedded in edges.
Research on combining GCN with LPA for recommender systems (RS) has proven the feasibility of this approach in graph node classification [30][31][32].Li et al. [33] combined LPA with GNN and proposed a GAD method applicable to e-commerce shopping networks.

Problem Formulation
We start with some notation.Assume that there is a graph, directed or undirected, G = (V, E), in which V = {v 1 , . . . ,v I } denote the sets of subjects to be detected.Each node has a P-dimensional feature, the i-th node's feature is represented by x i .The arrangement of all features in a row by node orders the results in the feature matrix X ∈ R I×P .A ∈ {0, 1} I×I is the binary adjacency matrix, the ij-th entry a ij ∈ {0, 1} of A means whether there is an edge from v i to v j , and vice versa.Number the elements of A from the first row to the n-th row as 1, 2, . . ., n + 1, n + 2, . . ., nn − 1, nn; therefore, A can be flattened into a twodimensional array A ∈ N 2×c , called the edge index matrix, where the first row represents the source node and the second row represents the target node.c is the number of non-zero elements in the matrix A, as well as the number of edges in the graph.Partition A into two triangular matrices, A upper and A lower .For an undirected graph, the first c/2 elements of A, correspond to the ith non-zero element in A upper ; the last c/2 elements of A correspond to the i-th non-zero element in A lower .D is the diagonal degree matrix, and S = D −1/2 AD −1/2 is the normalized adjacency matrix.
For graph G, each edge comprises two elements: edge direction and attribution.Matrices E and Z, are organized based on the corresponding order of edges.That is to say, E and Z shares the same topology as A. The matrix E ∈ {0, 1} 1×c stores the directions of the edges, where 0 and 1 represent incoming edges and outgoing edges.The matrix Z ∈ {1, . . . ,n} 1×c stores the attributes of the edges, where different values of element Z i represent different attributes of the edges.Figure 1 shows the process of mapping the adjacency matrix A to the edge index matrix A, as well as the construction of the edge direction matrix E and the edge attribution matrix Z.In (a), the first non-zero element in A is read, which is located in the first row and third column.Therefore, it represents that the corresponding edge has a source node of index 1 and a target node of index 3.Consequently, the first column of the edge index matrix A, denoted as A 1 , has 1 in the row above (representing the source node) and 3 in the row below (representing the target node).For the edge direction matrix, since this element is located in the upper triangular matrix A upper , the first element of the edge direction matrix E 1 is 0, indicating an incoming edge.The first element of the edge attribution matrix Z 1 stores attribute 3 of this edge.Following this procedure, we iterate through the other non-zero elements in A to construct A, E, and Z.We can transform the directed graph G into an undirected graph UG through a simple symmetric operation.This involves processing the elements in matrices A with the transformation a ij or a ji = a ji or a ij , if a ij or a ji ̸ = 0 0, otherwise.Consequently, A becomes symmetric matrices.In GAD tasks, the node set V is divided into a disjoint set of unlabeled nodes and labeled nodes, while only the first m nodes (m ≪ n) have labels from a label set L = {1, 2}.Here, 1 represents normal nodes, and 2 represents anomalous nodes.The label is stored by a one-hot-encoding matrix Y = [y 1 , . . . ,y n ] T .
Finally, V is divided into three sets in proportion: training set V train , validation set V valid , and test set V test .Our objective is to learn the features of nodes with labels, propagate their labels, and then infer the labels of the remaining nodes, given G, X, Y. Training is performed on V train , validation is conducted on V valid , and testing is carried out on V test .
Table 2 summarizes the symbols used in this paper along with their corresponding meanings.

Label Propagation Algorithm
LPA is based on the assumption that two connected nodes are correlated because of the homophily and influence.LPA utilizes node labels as its input.
We construct a one-hot-encoding label matrix Y (k) = y In the label propagation process during the k-th iteration, all nodes propagate their labels to neighboring nodes with Equation (1).From the perspective of nodes, node v i will choose the community of its neighbor with the highest membership count as its new community: Labels are propagated from each other nodes through a normalized adjacency matrix D −1 A. Subsequently, all nodes are reset back to their initial labels y (0) i , before proceeding to the (k + 1)-th iteration: y LPA is an iterative computation process and does not guarantee convergence.Approximately five iterations are often sufficient to achieve convergence [8], striking a balance between accuracy and performance.This will be the chosen number of iterations in our study.Consequently, for node v, the probability of category l is calculated as: where e(v, u) ∈ E represents the edge from v to its neighbor u. Figure 2 illustrates an LPA process executed on an unweighted, undirected graph.2. Label propagation process in three iterations: a demo.In the first iteration, the algorithm starts computation from the first unlabeled node (i.e., node 2).For node 2, N 2 = {1, 3}.Here, N i represents the neighboring nodes of node i.The next unlabeled node is then computed (i.e., node 3), using the updated value of node 2. After the first iteration, all nodes are reset back to their initial labels according to Equation (2).After multiple iterations (three iterations are shown here), some node values will approach convergence, representing the final label probabilities for nodes.In case of an undirected graph, the final representation of node i is the weighted average of its neighboring nodes j ∈ N(i), represented by

Graph Convolutional Neural Networks
GCN [9] is one of the most popular GAD models based on the Laplacian smoothing assumption.GCN utilizes node features as its input.In general, GCN employs a sum aggregator as an aggregation function, the discussions in this paper are based on it.We begin our derivation from the spatial representation of GCN.(The connection between LPA and GCN is observed in their spatial representations, both of which have the form of AHΘ.The original derivation of GCN starts from the spectral domain of graph convolution, and the spatial representation of GCN is an approximation using Weisfeiler-Lehman on the spectral domain of graph convolution.However, for LPA, the original paper directly provides a spatial representation without a spectral domain expression.Reference [34] introduces energy to represent information propagation and then considers energy Laplacian, but its form differs significantly from the spectral domain expression of GCN.) The feature propagation scheme of GCN in layer l can be formulaed as: where σ is an non-linear activation function, Θ (l) denotes the GCN trainable weight in the lth layer, and H (l) is the l-th layer representations of nodes with H (l) = X for the input layer.
To avoid neglecting the node's own features, self-loops are added when updating the node itself.This involves combining the node's own features with the features of its neighbors during the node update process.The normalized adjacency matrix will be S = D− 1 2 Ã D− 1  2 , where Ã = A + I, and D is the diagonal degree matrix for Ã with entries Dii = ∑ j Ãij .Of particular note is the adjacency matrix S, which here remains fixed, and depends on the topological structure of the constructed graph.Meanwhile, the weight Θ mentioned is the trainable weight of GCN, which is different from our learnable edge weight.
By stacking GCN, features for nodes in an l-dimensional space can be obtained.In GAD tasks, the final layer of GCN is set to an output dimension of 2, allowing for the classification of all nodes into normal/abnormal binary categories.Generally, stacking two layers of GCN already yields satisfactory results [9], and this will be the number of GCN layers used in this paper.At this point, the features of nodes after two layers of GCN are represented as: Similar to LPA, in the case of an undirected graph and sum aggregator, the final representation of a node i is the weighted sum of its neighboring nodes j ∈ N(i), represented by

Relationship between LPA and GCN
According to analysis before, on the label side L, LPA can propagate and aggregate node labels along edges; along feature side V, GCN can convolute and transform node features along edges.We can also observe that Equations (1) and 4 share similarities in their forms.In this section, we attempt to answer two questions: What is the relationship between the label propagation of LPA and the feature convolution of GCN?Why can LPA serve as edge weights for training GCN?
Consider two nodes v 1 and v 2 in a graph; v 1 is unlabeled while v 2 is labeled.We study the relationship between LPA and GCN from the perspective of label influence and feature influence.Especially, if the initial label (feature) of v 2 changes, how does the output label (feature) of v 1 change.The label (feature) influence can be measured by the gradient (Jacobian) of the output label (feature) of v 1 with respect to the initial label (feature) of v 2 [35,36].The label influence of v 2 on v 1 after k iterations of LPA is: Due to the iterative nature of the LPA algorithm, the influence of x 2 on x 1 accumulates, and can eventually be expressed as: On the other hand, denote x (k) 1 as the k-th layer embedding of v 1 in GCN, and x 2 represents the initial embedding of v 2 .The feature influence of v 2 on v 1 after k-layers of GCN is the L1-norm of the expected Jacobian matrix [30], as: where the expectation is taken under the transformation matrix Θ (•) .The feature influence should be normalized as: Assume that the ReLU function serves as the activation function σ of GCN, and β denotes the fraction of unlabeled nodes.Consequently, the relationship between the label propagation of LPA and the feature convolution of GCN can be concluded, which can be expressed as: The details of the proof for Equation ( 10) can be found in Reference [30].With this, the first question posed in this section is answered: the relationship between label propagation and feature convolution is that if the label of node v 2 has a sufficiently large influence on the label of node v 1 , then the initial features of node v 2 will also have an equally large impact on the output features of node v 1 .In other words, the influence of all nodes labeled as c on the label of node v 1 , i.e., is proportional to the probability of node v 1 being classified as c by LPA as Equation (3), i.e., From the above derivation, we can conclude that label influence can serve as a substitute for feature influence, which is the answer to the second question.This provides a theoretical basis for us to use LPA to assist in training edge weights for GCN and to establish a GCN model where edge weights are learnable.

Our Unified GCN-Based GAD Model
This section will introduce our unified GCN-based GAD model, primarily consisting of three clever enhancements built upon the GCN network: learnable edge weights, edge-node co-embedding, and separate trainable GCN weight.Learnable edge weights introduce node label information through LPA, edge-node co-embedding incorporates information from edges, and separate trainable GCN weights enhance the model's expressive ability.We aim to leverage the improvements mentioned above to fully utilize node feature information, node label information, and edge information in the GAD graph, thereby enhancing the algorithm's performance in scenarios with extremely low proportions of anomalous data.Figure 3 illustrates the overall architecture of our unified GCN-based GAD model.In contrast, Figure 4 illustrates the structure of a classical GCN network.

Learnable Edge Weight
To ensure narrative coherence, we first introduce the concept of learnable edge weights.Building upon traditional GCN networks, as in Equation ( 4), a learnable edge weight parameter W (l) is introduced.With this addition, the GCN network can be rewritten as: where W (l) represents the edge weight parameter for layer l, • is the symbol of the Hadamard product.It is important to note that the matrix Ã in the equation is constant and solely dependent on the adjacency relations of the input nodes, with its values remaining unchanged throughout the training process.Θ represents the weights of the graph neural network (not our learnable edge weight).Figure 5 illustrates our first improvement, which is introducing node label information by adding learnable edge weights.The number of columns in the GCN trainable weight layer, denoted as C, is variable.It is set to 2 in the final GCN layer (meeting the binary classification requirements of our GAD task), while in other layers, it is set to the number of hidden units.The next question becomes how to train the weight matrix W • Ã to optimize its performance.Combining Equations ( 11) and ( 12), we can conclude that when the loss function of LPA L LPA (W • Ã) is minimized, the optimal edge weight Ŵ • Ã is achieved, as: Cross entropy can measure the distance between predicted values ŷlpa 1 and ground truth y 1 .Therefore, the above expression can be rewritten as: where J is the cross entropy loss.As a GAD task, as analyzed in Section 1, fraudulent activities exhibit correlations and organizational characteristics.Fraudulent groups often have close connections, and their behaviors show certain similarities.This implies that neighboring nodes in the graph often have the same labels, and their feature vectors are also similar (as the node feature vectors in the GAD task are derived from real user behavior).Given that neighboring nodes have similar labels and feature vectors, cross entropy can effectively encourage adjacent nodes to be similar, learning an effective mapping from feature vectors to labels.Therefore, the loss function for traditional GCN is defined as follows: cross entropy is also used to compute the distance, and the loss of predicted labels by GCN J ŷgcn 1 , y 1 minimized to learn the optimal transformation matrix Θ * in the GCN: The final question is how to determine the loss function for the unified model.For a graph semi-supervised learning task, we can utilize both node features and graph structure information.Therefore, we can optimize both loss functions together and incorporate L2 regularization [9]: where L labeled represents the loss function based on labeled data, L reg represents the loss function based on graph structure information, ∥ω∥ 2 2 is the L2 regularizer, and λ and γ are hyperparameters that respectively adjust the relative importance of L reg and ∥ω∥ 2 2 .Data loss L labeled considers the distance between the predicted values and the ground truth, which is the supervised loss.As a regularization penalty term, L reg only considers the loss induced by the weight coefficients Θ.
As per Section 3.3, it is evident that LPA has the ability to replace GCN in training edge weights.Therefore, the loss of LPA (Equation ( 15)) can replace the regularization penalty term L reg in Equation (17) for training, thereby updating W (Equation ( 13)), which is the proposed learnable edge weight.The data loss in Equation ( 17) is composed of the loss of traditional GCN (Equation ( 16)).Ultimately, the loss function can be written as: where m is the number of labeled nodes in the graph.The first term corresponds to the segment of GCN that learns the edge weights W and the trainable matrix Θ.Meanwhile, the second term corresponds to the label propagation segment, which can be seen as imposing constraints on the edge weights W. The second term can be viewed as a regularization on W to assist GCN in learning edge weights.The above process is implemented using Algorithm 1. First, we initialize a trainable edge weight matrix W (0) as a matrix of ones (line 2).Then, we utilize GCN to perform feature convolution, setting the output dimension of the last GCN layer to 2 for binary classification, obtaining the GCN's label predictions ŷgcn (lines 4-8).The computation of LPA is similar to GCN, resulting in the LPA's label predictions ŷlpa (lines 9-12).Subsequently, the cross-entropy losses between ŷgcn and ground truth y are calculated, as well as ŷgcn and y, to obtain the total loss L (line 13).Finally, the GCN training weights Θ and the learnable edge weight W are updated using gradient descent.

Algorithm 1 Learnable edge weight.
Input: Graph G(V, E); adjacency matrix A; label matrix Y; degree matrix D; number of GCN layers L; number of LPA iterations K; square matrix W, with the size equal to the number of edges in E, where all elements are ones.Output: Predicted abnormal nodes out.

Co-Embedding of Edge and Node Features
This section will focus on constructing edge embeddings and aggregating them with node embeddings.The amount of information contained in H directly determines the feature influence of a node on its neighboring nodes (as described in Equation ( 9)).We start by considering increasing the information contained in H.By adding the embeddings of edge direction and properties and generating edge features, we introduce a method to aggregate node features and edge features into a co-embedding feature.
As discussed in Section 3, each edge consists of two elements: edge direction and attribution.Reiterating the definition provided, matrix E ∈ {0, 1} 1×c stores the direction of the edges, and matrix Z ∈ {1, . . . ,n} 1×c stores the attribution of the edges.
Once the edge direction and edge attributes are defined as matrices, they can be aggregated using an additive approach for propagation [37].Although this simple aggregation calculation is computationally efficient, it fails to capture dependencies between different relationships and interactions between node and edge features [38].
To ensure uniform dimensions between node feature matrices and edge feature matrices, while also reflecting the relationship between node and edge features, incorporating edge features into node features is needed.This involves performing aggregation operations on both node features and edge features.
First, we map the high-dimensional discrete features E and Z to continuous lowdimensional vector spaces, obtaining dense vectors, which enables the model to better handle information from E and Z.The embedding layer maps each node with values (both e and z) to a d-dimensional dense vector representation, denoted as E emb and Z emb .The calculation process of the embedding layer can be expressed as: where E ∈ R v×d is the weight matrix of the embedding layer.For the edge direction matrix and edge attribute matrix, v is 2 and n, respectively, representing the dimensionality of the input space for the embedding layer.Now, we have obtained matrices E emb ∈ R c×d and Z emb ∈ R c×d after the embedding operation, and perform matrix addition to obtain the edge features: where F ∈ R c×d .Secondly, we expand the node feature matrix X to X.We index the node feature vector based on the elements stored in the first row A 1 (source node) of the edge index matrix A. The elements in A correspond to the row numbers in X.Since the same element value may appear multiple times in A (because the out-degree of a specific node may be greater than one, i.e., it may act as a source node multiple times), there will be duplicate rows in the extended node feature matrix X.Each row in the extended node feature matrix X is a copy of the corresponding row in the node feature matrix X.For example, if element 2 appears twice in A, then the row corresponding to X 2 will also appear twice in X.Ultimately, X ∈ R I×P is expanded to X ∈ R c×P , where c is the number of edges in the graph and P is the dimensionality of the node features.
Thirdly, concatenate F with X to obtain the aggregated feature matrix F, where c is the number of edges in the graph, and P + d is the sum of the dimensions of the edge embedding and the extended node feature.
Finally, F is scattered to transform edge features into node features.F is set as the input tensor, and scatter indices are specified at target node positions from the second row of A. We specify the output tensor size as the number of nodes (the dimension of adjacency matrix A, denoted by I), and aggregate the data using the sum method.Ultimately, we obtain the co-embedding matrix concat(H node , H edge ) ∈ R I×(P+d) of node features and edge features.Figure 6 illustrates the process of generating co-embeddings by embedding edge direction and edge attributes, combining them with extended node features to generate aggregated features, and then scattering to obtain edge-node co-embeddings.

Separate Trainable GCN Weights
In this section, we design separate trainable GCN weights for node features and coembedding features.The term "separate trainable GCN weights" refers to splitting the original trainable weight Θ in GCN into two weights, Θ 1 and Θ 2 .These weights are used to transform node embeddings and co-embeddings in their respective feature spaces before combining them through weighted summation.
In theory, our method extracts high-dimensional information from the neighborhood of graph nodes by dimensionality reduction into dense vector embeddings.These embeddings can be provided to downstream machine learning systems to accomplish the GAD task.In graph analysis tasks, edge embeddings have been proven to be highly useful as input features for GAD [39].
Our discussion still starts with the traditional GCN.For the sake of simplicity in exposition, we rewrite Equation (4) in a more concise form: In the above equation, the activation function σ and normalized adjacency matrix S are omitted, and only one layer of GCN is considered, which does not affect the construction of our aggregation function.The equation can be seen as the transformation of node embeddings H to obtain new embeddings H ′ through trainable weight Θ, where H (0) node = X.To cleverly combine node embedding H node and co-embedding concat(H node , H edge ), we designed different Θ for H node and concat(H node , H edge ), projecting them into different feature spaces.This helps enhance the model's expressive ability, which can be written as follows: here, W is the learnable edge weight discussed in Section 4.1, which is incorporated into the edge embedding network, ϕ is a hyperparameter that adjusts the relative importance.By substituting the co-embeddings into Equation ( 23), Θ 1 and Θ 2 can be trained separately.Figure 7 illustrates our third improvement, which is designing separate GCN trainable weights to enhance the model's expressive power, and Algorithm 2 outlines the process of our co-embedding and separate training algorithm.First, we perform embedding operations separately on the directional and attribute information of the edges (lines 1-2).Then, we add the obtained embedding matrices to obtain the edge feature matrix (line 3).We redistribute the node feature matrix to obtain the extended node feature matrix (line 4).Next, we aggregate the edge feature matrix and the node feature matrix (line 5), and obtain the edge-node co-embedding through scatter operations (line 6).Finally, we project node features and edge-node co-embedding into different spaces using different GCN trainable weights (line 7).We selected the publicly available dataset, DGraph [6], a real-world GAD graph in the finance domain, as our test sample.This dataset comprises a directed, unweighted graph that represents a social network among users.In this graph, each node represents a user, and the node features, derived from basic personal profiles, consist of a 17-dimensional vector.Each dimension corresponds to a different element of the individual's profile, primarily including the following aspects: (1) basic personal information, such as age, gender, and location; (2) past borrowing behavior, including agreed-upon repayment due dates and actual repayment dates; (3) emergency contact-related information, such as the existence of emergency contacts.An edge from one user to another signifies a different type of connection, with a total of 12 distinct types of connections.It is noteworthy that DGraph is highly sparse, containing approximately 3 million nodes and 4 million edges.The number of nodes in DGraph is 17.1 times that of Elliptic [40], making it the largest publicly available dataset in the GAD domain to date.
As a financial dataset, DGraph originates from real-world borrowing and lending behaviors.The determination of whether a node is classified as normal or anomalous depends on its borrowing and lending activities, summarized briefly as follows.All labeled nodes in DGraph have engaged in borrowing or lending at least once.Nodes that are registered but have not engaged in borrowing or lending are classified as unlabeled nodes.Among the labeled nodes, those classified as anomalous are users who have failed to repay their loans for a considerable period after the loan's due date, disregarding repeated reminders from the platform.According to the aforementioned criteria, 15,509 users are classified as anomalous.Among these nodes, normal users account for 22.07% (nodes labeled as 0), while abnormal users account for 0.26%.The ratio of positive to negative samples is approximately 85:1, indicating a significant class imbalance in the dataset.The nodes are randomly split into training/validation/test sets with a ratio of 70:15:15.Table 3 lists the key parameters of DGraph.Table 4 summarizes the comparison of parameters between DGraph and mainstream anomaly detection datasets.Compared to other datasets, DGraph is closer to real-world applications because it has a minimum percentage of anomalous nodes and a diluted network, which is a common phenomenon in real GAD task.At the same time, the label data proportion in DGraph is also very low.Compared to DGraph, YelpChi is constructed using a Yelp Review Filter [41], making it not the real ground truth.The ratio of edges to nodes (in other words, the degree of the nodes) in the Amazon dataset is approximately 368:1, creating a graph that is too dense to be common in the GAD field.Elliptic is the closest to our scenario, but it treats edges as homogeneous, meaning that edges do not have attributes.In real GAD scenarios, however, the connections between users are diverse.Excessive features also reduce the usability of Elliptic.This is the reason why DGraph was chosen as the dataset for this study.It is important to note that, for privacy protection purposes, the features of the nodes in DGraph are anonymously provided and not ordered by importance.This anonymity of features prevents this paper from visualizing the dataset in the same manner as other datasets, where the first two features are typically chosen for the x-axis and y-axis in plots.The anonymous provision of features also limits our ability to further explain which features contribute more significantly to node anomalies.(The DGraph provider anonymizes users by removing personal identifiers and randomizing transaction details, making it impossible to trace back to user IDs.Furthermore, since user features are not unique, users cannot be identified from the dataset, ensuring DGraph's anonymity.)Figure 8 lists the distribution of node out-degrees and in-degrees.There is a significant difference in the distributions of out-degrees and in-degrees for the two types of nodes.According to the definition of DGraph, each node represents a user.The out-degree represents the completeness of information for a user, where the higher out-degree indicates higher information completeness.The in-degree represents the trust relationship of a user, where a higher in-degree indicates higher trust from other users.The out-degree of a node is determined by the user, while the in-degree is not.The significant difference between node out-degrees and in-degrees provides a basis for our edge direction embedding.The edge direction treats the out-degree and in-degree of nodes differently, accurately distinguishing between the two types of nodes.Background nodes are an important attribute of the DGraph dataset.[6] These nodes refer to entities that lack borrowing and lending activities, are not the target of anomaly detection, but play a crucial role in maintaining the connectivity of DGraph.Throughout our experiments, we adhere to the following assumption: retain all background nodes but only perform anomaly detection on labeled nodes (normal or anomalous).

Baselines
We have selected eight methods as our baseline approaches, comprising one baseline method (MLP), two generic graph methods (Node2Vec and LPA), and five GCN-based methods (GCN, GAT, GATv2, GraphSAGE, and UniMP).We adopt AUC as the evaluation metric, which is extremely useful in the scenario of data imbalance where abnormal data sets are scarce while normal data sets are extremely abundant.The higher the AUC value, the better the model's performance.For example, in DGraph, abnormal users only account for 1.16% of the dataset.Therefore, even if the model predicts all samples as normal, the accuracy would still be as high as 99%.However, it is evident that the model's predictive performance is poor.Moreover, in GAD scenarios, the losses incurred by predicting fraudulent users as normal would be far greater than predicting normal users as fraudulent.Hence, for our GAD task, AUC is more crucial.

Experimental Setup
The experimental environment for all experiments remained consistent.The GPU utilized was an NVIDIA V100 (32 GB NVLink memory, single GPU with 5120 CUDA Cores), the CPU consisted of 12 Intel Xeon processors (Platinum 8163 (Skylake) @2.50 GHz), and the system had 92 GB of RAM.For each method, five runs were executed, with each run comprising 500 epochs, and the best-performing value was selected.Unless otherwise specified, all experiments are conducted on undirected graphs.
In implementing all algorithms, we utilized PyTorch Geometric.During the implementation process, we employed one trick: we replaced the storage of the matrix containing the source and target nodes of edges with a sparse adjacency matrix.This significantly accelerated computation speed, except for GAT, GATv2, and co-embedding, as they require explicit edge propagation.
All the hyper-parameters are listed below:

Comparison with Baselines
Table 5 presents the results of the comparison between our unified model and the baseline algorithms in terms of the AUC metric.The hyperparameters used here are: LPA weight λ = 1 and co-embedding weight ϕ = 1.In the table, "WL" refers to the first improvement module, which is the edge weight learnable module in our model; "CE" refers to the second improvement module, which is the edge-node co-embedding module.50.89% 50.95%N/A Node2Vec [7] 62.61% 62.92% N/A GCN-based Methods GCN [9] 70.78% 72.32% 0.608 s GAT [10] 72.33% 73.31% 0.798 s GATv2 [11] (ICLR 2022) 75.26% 76.24% 0.901 s GraphSAGE [12] 75.48% 76.21% 0.671 s UniMP [13]  From Table 5, we observe that the MLP baseline, which does not utilize any graph information, achieved good results.Among the generic graph methods, Node2Vec, which only utilizes graph structure, and LPA, which only utilizes node labels, did not achieve high AUC scores.This suggests that in the GAD task on DGraph, the primary information lies within the node features.Among the GCN-based models, classical GCN and GAT performed similarly, while GraphSAGE showed significant improvement in performance.This is because GraphSAGE designs different trainable features for neighboring nodes.GATv2 improves upon GAT and exhibits noticeable performance gains.UniMP, incorporating attention mechanisms into GCN models, achieved very significant improvements.
Our edge-weight-learnable method, which is an improvement of GCN utilizing LPA to incorporate node labels, achieved higher AUC compared to traditional GCN.This demonstrates that both node features and node labels contain rich information, and LPA can act as a regularization penalty to aid training.
In terms of model training time, the general graph methods LPA and Node2Vec do not involve explicit "training" and therefore do not have training time.The MLP model, being the smallest in scale and simplest in architecture, naturally has the shortest training time.Among the methods employing the sparse matrix trick, our edge weight learnable method has the second shortest training time, only slightly longer than the classical GCN method.This is because our edge weight learnable method does not increase the model's dimensionality; it only introduces node labels (which are also used in traditional GCN-

Figure 1 .
The process of mapping the adjacency matrix A to the edge index matrix A, as well as the construction of E and Z.(a) The first non-zero element in A is mapped to A. (b) The second non-zero element in A is mapped to A. (c) The remaining non-zero elements in A are mapped to A.
(k) 1 , . . ., y (k) n T to store labels.The i-th row y (k)T i stores the predicted label of node y i after k times label propagation.The label matrix Y (0) is initialized with the rule y (a) Initial state.(b) Update for node 2 in the 1st iteration.(c) Update for node 3 in the 1st iteration.(d) After a round of updates for all unlabeled nodes (iteration 1).(e) After iteration 2. (f) After iteration 3.

Figure 5 .
Figure 5. First improvement: Introducing node label information by adding learnable edge weights (only one layer of Graph Convolutional Network (GCN) is displayed).

Figure 6 .
Figure 6.Second improvement: Introducing information from edges through edge-node co-embedding.

Figure 7 .Algorithm 2
Figure 7. Third improvement: Designing separate Graph Convolutional Network (GCN) trainable weights to enhance the model's expressive power (only one layer of GCN is displayed).

Figure 8 .
Distribution of node out-degrees and in-degrees.(a) Out-degree distribution of nodes.(b) In-degree distribution of nodes.

Table 1 .
Comparison of semi-supervised anomaly detection methods.

Table 2 .
Symbols used in this paper.

Table 4 .
Comparison of parameters between DGraph and mainstream Graph Anomaly Detection (GAD) datasets.