1. Introduction
Nowadays, digital finance is booming under the dual-wheel drive of financial data and digital technology, providing new ways and means to serve the real economy and resolve financial risks. Along with this, the “black industry” has also taken root in the shadows, penetrating the fields of personal credit, e-commerce, and various types of insurance. The report released by the Security Attack and Defense Laboratory of the Industrial and Commercial Bank of China (ICBC) showed that fraud against digital finance has caused losses of USD 174 billion in 2021, and has developed into a well-organized black industrial chain with a clear division of labor. Financial marketing scenarios, for example, platforms or merchants to enhance competitiveness, to attract users to provide red packets, cashback, full reduction, and other marketing activities, have given rise to several marketing fraud gangs focused on the above marketing activities to obtain economic benefits [
1]. In recent years, this kind of fraud has shown a trend of “professionalization, specialization, teamwork, and transnationalization”, which has caused serious losses to both customers and financial institutions. Therefore, it is of great theoretical value and practical significance to explore financial anti-fraud methods of in new scenarios.
Mainstream fraud detection methods can be categorized into three types: rule-based, machine learning-based, and graph representation learning-based. Rule-based strategies rely on expert experience to design strong rules for risk scoring, but they do not apply to current fraud detection scenarios due to the increased covertness of defrauding users through tampering with IPs and other means [
2]. In contrast, data-driven machine learning methods show better performance in complex tasks, but machine learning also faces certain challenges around user feature representation learning. The most immediate problem is feature ambiguity, where fraudulent users can hide their identities by maliciously mimicking and packaging them as normal transactions, which largely affects model recognition accuracy. The deeper problem is that, for the trend of fraud ganging, the traditional feature mining perspective of machine learning will ignore the information of the fraud gang interaction network and ignore the association information between entities [
3]. Based on this idea and thanks to the natural information representation ability of graph structures, graph representation learning has gradually been applied to fraud detection research [
4]. Traditional graph representation learning focuses on quantifying the degree of association between nodes, and fraud detection is achieved through subgraph partitioning, which is commonly used (for example by Fraudar [
5]), and community discovery algorithms [
6,
7]. However, traditional graph representation learning exists only for node relationship modeling, and the existence of node features can take full advantage of the disadvantages of the information, graph neural network algorithm well-integrated node relationships, and node features of the two types of information, resulting in a better performance in financial fraud detection [
8]. Yu et al. proposed the message-passing-based graph convolution network (MP-GCN) to detect phishing frauds, which is based on a graph convolution neural network to achieve a transaction network graph embedding which compresses the node network into a low-dimensional embedded representation and learns to reconstruct the topological information of the network and the features of the nodes [
9]. Jing et al. proposed a method of learning parametric adjacency matrices, which relies on the similarity of the features to deliver messages, to improve the GCN layer to extract node features, and to effectively improve the performance of credit card fraud detection with a small number of samples [
10]. In addition to the utilization of node information, Zhu et al. considered the influence of external factors and designed an Attribute Enhanced Spatio-Temporal Graph Convolutional Network (AST-GCN) to encode these factors and integrate them into the spatio-temporal graph convolutional model to improve the model detection accuracy [
11].
All of the above studies are for isomorphic diagrams, but in real environments, the diagram structure of the composition between things is often heteromorphic. In this regard, Kanezashi et al. evaluated representative homogeneous GNN models and heterogeneous GNN models and emphasized that the diversity of semantic relationships of heteromorphic graph nodes can effectively improve model fraud detection performance [
12]. Currently, in heterogeneous information network fraud detection, the main research method used is to obtain the potential expression of nodes by synthesizing node information and network topologies. Meanwhile, there are fewer studies on the impact of different association relationships in heterogeneous graph fraud detection, and fewer studies focusing on different association behavior patterns to explain the feasibility of fraudulent behaviors.
Combining the above considerations, this paper proposes Tri-RGCN-XGBoost, a multiple relational heterogeneous graph neural network fraud detection algorithm, to capture the intrinsic patterns of fraudulent users or their interaction behaviors. The specific approach transforms the user behavior data into a heterogeneous network, abstract the user–device, user–merchant, and user–address relationship graphs, use the graph neural network model to aggregate the three kinds of meta-paths of the fraudulent user’s interaction patterns. These are combined with the XGBoost tree model to complete the fusion of decision-making under the three kinds of relationship graphs, to achieve fraudulent user detection. The experimental comparison of this paper’s model with the baseline model on four real enterprise datasets show a significantly improvement in evaluation indexes such as recall rate, proving the accuracy and effectiveness of this model. In addition, the key association behaviors are further analyzed by feature importance ranking, which highlights the effective enhancement of targeted association behavior mining on fraud detection in terms of relationship graph importance.
5. Conclusions
In this paper, a multiple relational graph fraud detection model Tri-RGCN-XGBoost that integrates multiple behavioral patterns was designed. The main idea lies in mining various types of user associations in the fraudulent scenario, using graph convolutional neural network as the initial framework and aggregating the user’s neighboring node information under multiple behavioral associations. Finally, the final fraud prediction is obtained by fusing the classification results under the three relational graphs by XGBoost. In the experimental validation and comparative analysis on real datasets, the model in this paper outperformed the baseline model in each evaluation index, proving the feasibility and effectiveness of the model. Combining the single bipartite graph convolution iteration with the XGBoost layer feature importance ranking verified the improvement of the fraud detection effect of the experimental data in this paper, and further explains the integrated model.
The method proposed in this paper shows excellent performance in fraud detection but is limited by the fact that the data only focus on three behavioral patterns and networked experiments. In addition, the process does not consider the edge attributes of the heterogeneous graph, which simplifies the node information aggregation function; thus, the next step should focus on the reasonable mining and application of the edge attributes to improve the fraud detection effect based on the heterogeneous graph.