RLC-GNN: An Improved Deep Architecture for Spatial-Based Graph Neural Network with Application to Fraud Detection

: Graph neural networks (GNNs) have been very successful at solving fraud detection tasks. The GNN-based detection algorithms learn node embeddings by aggregating neighboring information. Recently, CAmouﬂage-REsistant GNN (CARE-GNN) is proposed, and this algorithm achieves state-of-the-art results on fraud detection tasks by dealing with relation camouﬂages and feature camouﬂages. However, stacking multiple layers in a traditional way deﬁned by hop leads to a rapid performance drop. As the single-layer CARE-GNN cannot extract more information to ﬁx the potential mistakes, the performance heavily relies on the only one layer. In order to avoid the case of single-layer learning, in this paper, we consider a multi-layer architecture which can form a complementary relationship with residual structure. We propose an improved algorithm named Residual Layered CARE-GNN (RLC-GNN). The new algorithm learns layer by layer progressively and corrects mistakes continuously. We choose three metrics—recall, AUC, and F1-score—to evaluate proposed algorithm. Numerical experiments are conducted. We obtain up to 5.66%, 7.72%, and 9.09% improvements in recall, AUC, and F1-score, respectively, on Yelp dataset. Moreover, we also obtain up to 3.66%, 4.27%, and 3.25% improvements in the same three metrics on the Amazon dataset.


Introduction
In the current information age, things are moving towards cyberspace.The Internet has become a significant part of modern society.Moreover, fraud activities have subsequently expended from offline to online.Fraudsters post fake content anonymously, which disrupts the normalcy of cyberspace, and benign users may suffer losses as a result.It is inefficient and costly to manually censor tens of millions or even hundreds of millions of complicated online content to make sure whether the information is fraud or not.Performing fraud detection efficiently has important research significance.
Considering a social networking service or a shopping site, we abstract its users, reviews, or any information into nodes, and regard the relationship between nodes as edges.Then, we can build a graph, and GNNs are applied to detect the suspicious characteristics of nodes so as to find out the fraudsters.In fact, graph neural networks (GNNs) extend application field of classic deep learning to tasks with non-Euclidean data [1].By dealing with relation camouflages and feature camouflages, CAmouflage-REsistant GNN (CARE-GNN) is proposed and achieves state-of-the-art results recently [2].
One of primary reasons for the success of neural network models in the area of computer vision is the ability to train deep network architectures (e.g., 19 layers for VGG [3], 22 layers for GoogLeNet [4], and 152 layers for ResNet [5]).However, GNNs suffer from a intrinsic limit which makes it difficult for GNNs to be trained with deep architectures.As a consequence, most state-of-the-art GNNs are restricted to have no more than four layers [6].Experimental results of CARE-GNN show that it performs the best with single-layer architecture.Moreover, its performance drops rapidly as the number of layers increases.In this case, the single-layer CARE-GNN aggregates neighboring information for every node only once.The model has no chance to fix its potential mistakes and to revise its inferences.We call the problem above as single-layer learning.Briefly, single-layer learning has two meanings: (1) intrinsic shallow limit of GNNs and (2) no chance to correct mistakes made by previous layer and to revise inferences.
Based on the CARE-GNN algorithm, inspired by the work in [7], we first use a different method to stack multiple layers.The classical architecture of multi-layer for GNNs is defined by hop.We take the calculation process of a node v as an example.If we first take v's neighbors as central nodes and aggregate information from the neighbors' neighbors for each one, and then aggregate the neighbors' information into node v, we define that the model has two hops, which also means that the model has two layers.Thus, as we increase the number of layers, more and further information can be aggregated.However, the most valuable information always exists in neighbors that are directly connected to central nodes.The further the distance (i.e., the hop) is, the less relevant the information will be.Therefore, we consider how to make the most effective use of the information of directly connected nodes.
Based on the above considerations, we propose a new improved algorithm named Residual Layered CARE-GNN (RLC-GNN).We use a different method to define the concept of multi-layer.We utilize a layered structure to collect the most valuable information from neighbors.The architecture and working mechanism of RLC-GNN is similar to the models for classic deep learning tasks (e.g., image classification).In an iteration, every layer processes a same batch of nodes.Moreover, we do not update original dataset with each layer's output features.Furthermore, we introduce the residual structure into our algorithm to compensate the losses of information during propagation.Thus, RLC-GNN has the ability to correct its own mistakes and "think" more deeply and comprehensively.The two structures can form a complementary relationship and make the most effective use of neighboring information.Experiments are implemented on Yelp dataset and Amazon dataset provided in the work [2].Numerical results show that the new algorithm obtains a significant improvement in performance w.r.t the CARE-GNN algorithm.The contribution of this work includes the following: (1) Adapt the idea of the new definition of multilayer in a spatial-based GNN.(2) Present an improved algorithm named RLC-GNN to successfully train deep spatial-based GNN.(3) Introduce the residual structure into our multi-layer case to form a complementary relationship and present empirical analysis of how the combination of the two structures, layered architecture and residual structure, improve the performance.( 4) Some experiments on a Yelp dataset and an Amazon dataset are conducted, and the proposed RLC-GNN achieves significant improvements with the application to fraud detection.

Related Works
GNNs are originally designed to deal with the complicated non-Euclidean data [8], which the classic deep learning algorithms are unable to process systematically and reliably in general.There are two main types of GNNs [6]: spectral-based GNNs and spatial-based GNNs.The first proposed spectral-based GNN algorithm implements the convolutional operation on topological graphs using spectral graph theory [9] (i.e., using the eigenvalues and eigenvectors of the Laplacian matrix of the graph to study the properties of graphs).Since then, many improved algorithms based on graph convolutional network (GCN) have appeared one after another.The first spatial-based GNN algorithm [10] was actually proposed much earlier, which iteratively aggregates neighborhood information and updates the node embeddings by global shared local transition function and local output function.As spatial-based GNNs are in possession of more flexibility in algorithm designing, most proposed GNN variants belongs to this type.Moreover, so is our proposed RLC-GNN.
Fraud detection is essentially a semi-supervised node classification problem.There are two reasons which make the fraud detection task special: One reason is that sample distribution of datasets is extremely unbalanced, and the other reason is that fraudsters will actively conceal their features to avoid to be detected [2,11].These characteristics make it difficult for classic deep learning algorithms to learn implicit rules in data.In this case, even if a classification model is obtained, it always have serious bias, which results in poor generalization.Various methods have been presented recently to solve the problem.The Graph Embeddings for Malicious accounts algorithm [12] is the first heterogeneous graph neural network approach for detecting malicious accounts, establishing multiple homogeneous subgraphs based on the types of nodes, and aggregating neighborhood information under each subgraph.Moreover, the GEM algorithm uses an attention mechanism to learn the importance of each type.The GCN-based Anti-Spam algorithm [13] is the first GCN-based spam detector which applies on heterogeneous graph and directly aggregates information of three types of nodes at the same time.Then, research of the Adversary Situation Awareness algorithm [14] gave the solutions to the fact of the confrontational behaviors of fraudsters, e.g., adding special symbols into texts which makes detectors unable to recognize the original semantic information or avoiding detection by switching devices and networks.
Current GNN algorithms are limited to shallow depth in the conventional sense which is defined by k-hop.Some approaches have been proposed to help train deeper GNNs.In the traditional CNN area, the work of ResNet [5] presents a residual learning framework to ease the training of very deep neural networks (as many as 152 layers).Moreover, the works in [15,16] apply the residual structure to break the shallow limit of GCNs.The work in [17] proposes a Node Normalization technique to reduce feature correlation and increase the smoothness of models, successfully helping train deeper GCN.The research of H-GCN [18] proposes coarsening procedure and correspondingly makes GCN deeper to enlarge the receptive field for each node.
However, many algorithms and methods are designed for spectral-based GNNs of which the calculation theories are not similar to those of classic deep learning models (e.g., convolutional neural network (CNN)).They cannot utilize well-developed techniques to assist training and make good use of the flexibility of structure design of spatial-based GNNs.

Graph Representation Learning
Graphs are a general language for describing and analyzing entities with interactions.Many types of real-world data are graphs and form complex systems, e.g., computer networks, social networks, economic networks, code graphs, and molecules.The relation structure of these data contains valuable information of which we can take advantages for better prediction.A graph is represented as G = {V, X, E}, where V is the set of nodes.For each node v i , it is represented by a d-dimension feature x i ∈ R d in X.Each edge e i,j ∈ E indicates that node v i and node v j are connected due to a certain relationship between them.For example, in Figure 1, we show the famous Zachary Karate Club Network [19] which uses nodes to represent individuals and uses edges to represent friendship between two individuals.During Zachary's research, the club was divided into two communities which were led by the instructor and the club's chairman, namely, node 1 and node 34, respectively.Moreover, Zachary correctly predicted which individuals would join each community based on this graph structure.
An image is composed of many pixels that are regularly placed.Therefore, a unified and regular operation can be designed to process images.We show a typical processing method of a single CNN layer on an image (Figure 2).However, real-world graph data have some inconvenient characteristics that have arbitrary size, a complex topological structure, and no fixed node ordering, which makes it hard to apply classic deep learning models to solve graph learning tasks (Figure 3).

Image Graph
Appropriate?
Figure 3.In the practice of classic deep learning, it is reasonable and effective to apply sliding window on images.However, there is no fixed notion of locality on graph, and we cannot apply sliding window on the graph data.
A disadvantage of traditional machine learning approaches on graphs is that feature engineering is an inescapable procedure, which makes these approaches of less flexibility.Node embedding, the graph representation learning method, alleviates the need to do feature engineering every single time.This method is an encoder-decoder framework [20], as is shown in Figure 4.There are two key components in encoder-decoder frameworks.Furthermore, the main problem in these networks is how to learn the mapping function (i.e., the encoder).For example, a graph encoder can use node features around each node to generate an embedding.The corresponding decoder extracts information from the embedding.Moreover, this might be information about the node's classification label.The generalized encoder architecture is often called GNN.

General GNN Framework
The key idea of GNNs is to transform information at the neighbors and combine it.In Figure 5, we show a typical architecture and the working mechanism of GNN.As can be seen, the aggregate function consists of two steps.The first step is message passing.And the second step is the formal aggregate operation (e.g., mean, pool, and long shortterm memory).For a simplest GNN, the message passing function could be a constant and the aggregate operation could be a summary (i.e., we do nothing but simply adding neighboring information together).It is not hard to find that we can improve GNN's performance from at least three aspects: selecting strategy, aggregate function, and stacking multiple layers.For instance, if the importance of neighbors to central node is inconsistent, we can apply attention mechanism at the first step of aggregate function.Now, we can give the general formulation of GNNs: where h

(k)
in and h out denote input and updated node embedding at k-th layer, respectively.h update usually denotes the k-th layer's trainable parameters of neural networks after the aggregate function.We show the general architecture of GNN in Figure 5.A typical and basic architecture and processing procedures of GNN.First, GNN selects neighbors with a certain strategy.Then, an aggregate function is applied to extract information around the central node.At last, the aggregated information passes through a neural network to be performed nonlinear transformation.The output is updated representation of central node.
The research of GraphSAGE [21] proposed a general framework for inductive node embedding: where h v is generated by the following steps.First, each node aggregates features from its local neighborhood with the AGG aggregator into a single vector h (k) N(v) .After the aggregate of neighbors' features, GraphSAGE concatenates the node's previous representation with its neighborhood feature vector.Then, the concatenated vector is fed through a multi-layer perceptron (MLP) with ReLU activation function, and the output is the new representation of node v which will be used as the input at next layer.
Like classic deep learning, for supervised learning (i.e., we are given input x, and the goal is to predict label y), a graph learning task is also formulated as an optimization problem: where Θ is a set of parameters we optimize and L is a loss function.f denotes graph neural network function which can be very complex.The output of f is predictions of nodes.Our goal is to make the loss which measures the gap between predictions and actual labels as lower as possible.

CARE-GNN
The key idea of CARE-GNN is to convert heterogeneous graph into homogeneous graph (i.e., aggregating features separately under each relation) [2].At each layer, the input node embeddings (i.e., a batch of dataset) are fed through a MLP.Then, the label-aware similarity measure will be performed to calculate the l 1 -distance between each central node in the minibatch and its neighbors: where D (k) (v, u) denotes the l 1 -distance between central node v and one of its neighbors u at k-th layer.The similarity between two nodes is defined as where each layer has its own similarity measure module.A reinforcement learning (RL) module is designed to dynamically learn threshold, p r ∈ [0, 1], which is used by similarityaware neighbor selector to perform top-p sampling of neighbors for feature aggregate (see in [2] for more details on the threshold and top-p sampling).The ability of similarity measure module, discriminating whether the categories of neighbors are the same as that of central node, is in a dynamic state during the training.If the algorithm choose same nodes in every epoch, it will be difficult to effectively adapt to the state of network.If the average distance between a central node and its neighbors is decreasing, it means that similarity between the central node and its neighbors is increasing and there may be more homogeneous nodes.In this case, more neighbors can be selected to extract richer information to help model convince better of the identity of the central node (i.e., we increase the corresponding p (k) r ).Conversely, it means that there are too many different nodes that have been selected to perform aggregate.Moreover, the aggregate operation will cover up the true characteristics of the central node.Therefore, we need to decrease the corresponding p (k) r .This process is done automatically by the RL module.As CARE-GNN applies multi-relation aggregate, the node embedding at each layer is composed of two steps: First, the intra-relation aggregate is performed under each relation: where h v,r denotes the embedding of node v after intra-relation aggregate under relation r and nonlinear transformation at k-th layer.We take mean aggregator as the intra-relation aggregator.When the first step is completed, we obtain the embeddings under all the relation {h r as the weight of embedding under relation r to perform inter-relation aggregate: where h denotes the embedding of node v at previous layer.

Methodology
In the field of graph learning, designing the deep structure is always difficult.As is shown in Figure 6, CARE-GNN with single-layer architecture performs the best and suffers from the performance degradation as the number of layers increases.Based on CARE-GNN, we employ the layered graph neural network [7] architecture and residual structure [5] to make the following improvements.First, we utilize layered architecture in Figure 7 to expend the original model to deep structure.The input to the model is a batch of central nodes and the original dataset.Each layer is an independent GNN, of which the input includes the original dataset and the node embeddings calculated by previous layer.The model is trained layer by layer, and each layer will correct mistakes made by previous layers.Intuitively, it indicates that each layer of GNN can focus on solving a simpler sub-problem.Moreover, the sub-problem is caused by the invalid or incorrect feature extracted by previous layers.Therefore, the layer-by-layer progressive learning process means the model is less reliant on a certain layer than the single-layer model, which allows every layer to make mistakes to some extent.Moreover, the mistakes will soon be continuously corrected by subsequent layers.Moreover, the fact that each layer inherits the results calculated by previous layer and its input includes the original graph enables each layer to make better choices on sampling neighbors.In other words, each layer is able to select a different and better set of neighbors based on the results calculated by previous layers.As a result, more nodes can be taken into account, and the neighborhood can be expended.The nodes' information will be richer and more accurate with the layer-by-layer training.As the depth increases, the model is able to "think" for more times, and each time the model can "think" more deeply and comprehensively based on the achievements of previous layer, thereby making better inferences.
We also take advantage of residual structure [5] to help the training process.The research points out that when adding multiple nonlinear layers to a shallow model, if the added layers can learn an identity mapping, which means the output equals to the input, then the performance of the model will at least not deteriorate.However, the fact is that it is difficult for the complicated neural networks to fit a potential identity mapping.To reduce the learning difficulty for the networks, we let the model optimize the residual mapping instead of the original mapping.The added networks will learn a function F (G, {W i }) to fit a target mapping H(G), where G and {W i } are the input graph and a set of learnable parameters, respectively.As it is difficult for the networks to fit the identity mapping H(G) = G, we turn it to fit its residual function H(G) − G. Now, the function to be learned is F (G, {W i }) = H(G) − G, and the target mapping is F (G, {W i }) + G. Therefore, the added nonlinear layers with residual structure still fits the desired underlying mapping.According to the form of the target mapping, we lead out a shortcut connection from input directly to the output.The work in [16] shows that adding shortcut connection for every GNN layer reaches the best result (see Figure 8).Combining the layered structure and residual structure, we show details of a single layer of RLC-GNN in Figure 9.For each layer, the input of similarity measure module consists of two parts.One is the original dataset, and another one is updated node embeddings from previous layer except the first layer.More precisely, at the first layer, the similarity measure module will directly use a batch of dataset and its neighbors' node embeddings from dataset to do the measurement.After the first layer, the input is the node embeddings calculated by previous layer (i.e., output of previous layer).
For a specific node v, based on the discussion above, we now perform the similarity measure to a neighbor node u as follows: where o is the updated embedding of node v at previous layer (k − 1), h (G) u is a neighbor node embedding that is directly from the dataset.Moreover, due to the existence of mpl, the dimension of output embeddings may change.The new node embedding at k-th (for k > 1) layer is where h v,r is the node embedding after intra-aggregate under relation r.In particular, we do not adopt shortcut connection at the first layer.Moreover, the way to calculate node embedding at the first layer is where o (k) is the new node embeddings of the batch (o n−1 ).We have mentioned that the dimension of input must equal the dimension of output.If it is not the case, we perform a linear projection with a trainable weight matrix W s [5]: The overall structure of RLC-GNN is shown in Figure 10.At each iteration, the input of model is a batch of nodes with features and the entire graph.The input of each layer includes the output from the previous layer and the entire graph.The output of each layer is summary of the new features extracted by the current layer from the neighbors in original graph and the input from shortcut connection.In the training process, final output loss is given by where L Σ is final output loss which is obtained by adding up two parts.L

(k)
Simi is loss of scores which is used by the similarity measure module at k-th layers to do the label-aware similarity measure.Moreover, L GNN is loss of the classifier that predicts labels of nodes.Both losses are calculated by using cross-entropy: where y v is actual node label of node v.
We show the pseudocode of the proposed RLC-GNN in Algorithm 1.Given a heterogeneous graph, at each iteration, we initialize thresholds of neighbor selector with manually specified values randomly select a batch of nodes with features as input for the first layer.For all subsequent layers, the input of each layer is the updated embeddings generated by previous layer, which is the result of layered structure.We first do the similarity measure, top-p sampling, and intra-relation aggregate under each relation as is shown in Equations ( 5) and ( 6).This step determines the output dimension of current layer.Then, we apply interrelation aggregate (Equation ( 7)) and add up the result and input (Equation (11) or Equation (12), depending on whether input and output dimensions match).Here, we get updated node embeddings.Moreover, then we calculate losses of similarity measure modules and RLC-GNN (Equations ( 13) and ( 14)), and do backpropagation to update trainable parameters.At last, we use the reinforcement learning modules to update selector thresholds.When the node embeddings are input to a layer, there may be the case that the current layer has not been properly trained and the benign samples cannot be effectively filtered out during the similarity measure, which results in the features of fraudsters being covered up by benign samples' features.After the introduction of residual structure to the model, if the current layer causes an adverse effect on the classification, the input embeddings passing through the shortcut connection will reduce the losses during the propagation.Therefore, RLC-GNN has the ability to skip the layers which have not been trained well and perform the rollback of embeddings (i.e., we avoid "bad" layers blocking normal training of subsequent layers).The model will have opportunities to further learning the characteristics of fraudsters based on the knowledge that has been acquired by previous layers.Thus, as the learning process progresses layer by layer, more information will be taken into account and better selection will be made.
In Figure 11, we show our implementation for RLC-GNN with various number of layers.If dimensions of input and output not match, we adopt a linear projection W s to match the dimensions.Moreover, we show details of the architecture for RLC-GNN with various number of layers in Table 1.

Experiments 5.1. Datasets
In order to ensure the comparability of the experimental results, we conduct experiments on the Yelp Dataset and Amazon Dataset provided in research [2].The Yelp Dataset is the public internal dataset of Yelp, the largest review site in USA, and covers business, reviews, user information, and so on.In our experiment, we use the reviews to build the graph which includes 45,954 nodes (14.5% are fraud reviews) and 3,846,979 edges.Amazon Dataset is an open source dataset created by Amazon platform and it includes more than 140 million reviews and product metadata under 24 product categories.We use the reviews under musical instruments and take the users as nodes of the graph and the graph includes 11,944 nodes (9.5% are fraudsters) and 4,398,392 edges.
There are three types of relationships between the nodes of each dataset.Yelp Dataset: (1) R-U-R: two reviews are posted by the same user; (2) R-S-R: two reviews are under the same product with the same star rating; (3) R-T-R: two reviews under the same product are posted in the same month.Amazon Dataset: (1) U-P-U: It connects two users who have reviewed at least one same product; (2) U-S-U: It connects two users who have rated the same star within a week; (3) U-V-U: It connects two users of whom 5% reviews are similar.

Implementation
In the experiment, we use Pytorch 1.7.0 to implement RLC-GNN and use cross-entropy as the loss function.We choose an Adam optimization algorithm and set the learning rate to 0.01.Each dataset is divided into two parts: 40% as the training set and 60% as the test set.We utilize the mini-batching training skills to improve the training efficiency [22].We verify the performance of RLC-GNN with 4, 6, 11, 19, and 27 layers under both datasets.The final dimension of node embeddings is 16.All experiments are running on Python 3.7.6,Windows 10 OS, AMD Ryzen 7 4800H CPU, 16GB RAM, Nvidia RTX 2060 GPU.

Evaluation Metrics
For the fraud detection task, we concern about the model's capability to correctly identify the fraud samples.Therefore, we use recall as one of our metrics.However, if model is not learning effectively and simply predicts all samples as frauds, we also get high recall.To avoid the confusing results, meanwhile, we consider the ratio of correct predictions in all predictions as frauds.We use F1-score as one of metrics.Furthermore, due to the extremely imbalance of sample distribution (ratio of fraud samples to benign samples is about 1 to 9), we use AUC (insensitive to distribution of samples) as the third metric to evaluate our model more fairly [23].

Results
First, we show the normalized training loss for RLC-GNN with 4, 6, 11, 19, and 27 layers in Figure 12.As the number of layers increases, the overall normalized training loss goes down.As we have discussed above, the final loss, Loss Σ , consists of the loss of similarity measure modules from every layer.In other words, Loss Σ contains more items as the number of layer increases.However, we note that more layers leads to greater loss reduction ratio on the contrary.Tendency of training loss of each layer in Figure 13 shows that latter layer can make better inferences based on inherited knowledge of previous layers, which shows the effectiveness of proposed method on dealing with the singlelayer learning problem.Moreover, we notice that the second layer makes the greatest improvement.Although the latter layers do not obtain as much improvement as the second layer, they indeed perform better layer by layer.Here we give the empirical analysis.As the number of layers increases, the problems to be solved for each layer become simpler.For instance, the single-layer model has to solve the entire problem alone.There is no other layers to share its learning pressure and to correct its mistakes.When we add more layers, each layer only needs to deal with partial problems (i.e., problems become easier for every layer).The remaining problems and the mistakes made by previous layers will be solved and corrected by subsequent layers.More precisely, because of the input of each layer consisting of original dataset, aggregated information and progress already made, each layer has sufficient information to judge the correctness of inferences (i.e., giving higher confidence to correct inferences, and correcting the wrongs).Furthermore, if a layer has not yet been trained well, input features can skip the layer by passing through shortcut connection.The training will not be interrupted, and the problem will be directly handed over to the next layer for processing.With such cooperation mechanism, the whole problem can be solved more smoothly.As it can be seen, an obvious characteristic on both dataset is that the overall training loss is lower with the increasing of layers.
We show the performance of RLC-GNN and various GNNs on the fraud detection tasks on Yelp and Amazon datasets in Table 2. GCN, GAT, GraphSAGE, and GeniePath are designed to run on homogeneous graphs.Multiple relations are merged into a single relation (i.e., heterogeneous to homogeneous) in the experiments of these GNNs [2].Compared with the single-relation GNNs, we can see that multi-relation GNNs have great advantages on tasks based on heterogeneous graphs.Furthermore, it might point out a direction for future development of GNNs.Moreover, based on this superiority, proposed RLC-GNN introduces the mechanism of progressive-learning and self-correcting, which makes the use of neighboring information as more effective as possible and once again achieves significant improvements.
Table 3 shows the experiment results of RLC-GNN with various depth.According to the experimental results, RLC-GNN outperforms CARE-GNN significantly, especially on the more complex Yelp dataset.On the Yelp dataset, the model with 27 layers achieves the best performance in our experiments, and recall, AUC, and F1-score increase by 5.66%, 7.72%, and 8.90%, respectively.On the Amazon dataset, when RLC-GNN has 11 layers, overall results outperform other settings, which Recall, AUC and F1 increase by 3.22%, 4.05%, and 3.25%, respectively.Training loss of each layer of RLC-GNN-6 on Amazon dataset.We note that training loss of later layer is lower, which means that the later layers can make inferences with higher confidence.The original feature dimension size of Yelp dataset and Amazon dataset are 32 and 25, respectively.Generally, the more complex the features are, the more complex model we will need to fit data.Under current hyperparameter settings, the experiment results show that recall and AUC nearly grow all the time on both datasets, which means the model can identify more fraudsters with the number of layer increasing.We also notice different degrees of decline in F1-score.Usually, large networks have better ability of generalization than small networks' [24].However, when a model is too complicated relative to the dataset, it will suffer from the overfitting, which reduces the generalization of the model [25].In this case, the model needs more data to be trained sufficiently.Moreover, we argue that our deep models are facing this problem.As a consequence, there is slightly decrement in the performance of F1-score when the depth increases to certain extent.
We notice that our focus is the problem faced by classic deep learning models instead of the GNN-specific over-smooth problem that leads to shallow limit.In other words, we successfully deal with the shallow structure limit to GNNs with the application of fraud detection.Furthermore, note that the same hyperparameters (e.g., learning rate and weight decay for optimizer) are used to train all the models and no optimization is made for models with various depth on different dataset.If we carefully adjust the hyperparameters, we may reach better results.

Conclusions
This work proposes RLC-GNN an improved spatial-based GNN algorithm that could be trained with deep architecture.We utilized a layered structure to deal with the single-layer learning problem and introduced the concept of residual network to complement the layered structure to assist training, which forms a type of cooperation and enables the model to be much deeper.Therefore, we can enjoy the benefits of depth without being trapped by the intrinsic shallow limit of graph neural networks.The experiments on fraud detection of Yelp dataset and Amazon dataset show that the proposed RLC-GNN algorithm obtains significant improvements under three metrics, recall, AUC, and F1 score, which could be further improved within some extend as the number of layers increases.
We verified the effectiveness of the proposed algorithm with the application of fraud detection.In future research, we will extend experiments to more application domains of graph neural network, and we will further explore widely applied techniques to deal with overfitting problem faced by deep RLC-GNN.Moreover, we have indicated some intuitive reasons for the combined effect of the layered structure and residual structure.The theoretical analysis will be explored in the future research.

Figure 1 .
Figure 1.Zachary's social network of friendships between 34 members of a karate club at a US university in the 1970s.An edge connects two individuals if they socialized outside of the club.

Figure 2 .
Figure 2. Single CNN layer with 3 × 3 filter.The filter slides over the feature maps of the image with a fixed step size.At each step, a 3 × 3 block of the feature map is taken, and the filter uses the block to calculate new features in a specific way (e.g., mean).

Figure 4 .
Figure 4.The encoder-decoder framework.The encoder is a function that maps nodes to lowdimension vectors (i.e., node embeddings).The decoder reconstructs certain graph statistics from the node embeddings that generated by encoder, e.g., predicting u's category on node classification task.
matrices in aggregate function and update function at k-th layer, respectively.Moreover, W (k)

Figure 5 .
Figure5.A typical and basic architecture and processing procedures of GNN.First, GNN selects neighbors with a certain strategy.Then, an aggregate function is applied to extract information around the central node.At last, the aggregated information passes through a neural network to be performed nonlinear transformation.The output is updated representation of central node.
denote the representation of node v at current layer k and previous layer (k − 1), respectively.Moreover, h (k−1) u indicates the node embeddings of the neighbors from previous layer.A k and B k are the trainable weight matrices.σ is a nonlinear activation function.AGG is the aggregator.h (k)

Figure 6 .
Figure 6.Recall of CARE-GNN with different number of layers.

Figure 9 .
Figure 9. Details of a layer of RLC-GNN.W s is a trainable weight matrix to match the embedding's dimensions between the input nodes and output nodes of the current layer.S i denotes the similarity measure module for the i-th relation.Now we give the expression of output at k-th layer:

16 NFigure 11 .
Figure 11.Our implementation for RLC-GNN architecture in experiments.d i and d o denote dimensions of input and output, respectively.See Table1for details of architectures.

Figure 12 .
Figure 12.The normalized training loss for RLC-GNN with varying depth on Amazon dataset (a) and Yelp dataset (b).As it can be seen, an obvious characteristic on both dataset is that the overall training loss is lower with the increasing of layers.

Figure 13 .
Figure 13.Training loss of each layer of RLC-GNN-6 on Amazon dataset.We note that training loss of later layer is lower, which means that the later layers can make inferences with higher confidence.

Table 1 .
Table 1 for details of architectures.Implementations for RLC-GNN with various number of layers.N d denotes the number of layers whose dimension of input node embeddings is d.

Table 2 .
Performance of RLC-GNN-27 and various GNNs on Yelp and Amazon datasets.

Table 3 .
Results of RLC-GNN with various number of layers and the baseline.We report results after running 100 epochs.