Abnormal Alliance Detection Method Based on a Dynamic Community Identification and Tracking Method for Time-Varying Bipartite Networks

Beibei Zhang; Fan Gao; Shaoxuan Li; Xiaoyan Xu; Yichuan Wang

doi:10.3390/ai6120328

,

and

¹

Shaanxi Key Laboratory for Network Computing and Security Technology, School of Computer Science and Engineering, Xi’an University of Technology, Xi’an 710048, China

²

Department of Mathematics, Xi’an Shiyou University, Xi’an 710065, China

^*

Author to whom correspondence should be addressed.

AI2025, 6(12), 328;https://doi.org/10.3390/ai6120328

Version Notes

Order Reprints

Abstract

Identifying abnormal group behavior formed by multi-type participants from large-scale historical industry and tax data is important for regulators to prevent potential criminal activity. We propose an Abnormal Alliance detection framework comprising two methods. For detecting joint behavior among multi-type participants, we present DyCIAComDet, a dynamic community identification and tracking method for large-scale, time-varying bipartite multi-type participant networks, and introduce three community-splitting measurement indicators—cohesion, integration, and leadership—to improve community division. To verify whether joint behavior is abnormal, termed an Abnormal Alliance, we propose BMPS, a frequent-sequence identification algorithm that mines key features along community evolution paths based on bitmap matrices, sequence matrices, prefix-projection matrices, and repeated-projection matrices. The framework is designed to address sampling limitations, temporal issues, and subjectivity that hinder traditional analyses and to remain scalable to large datasets. Experiments on the Southern Women benchmark and a real tax dataset show DyCIAComDet yields a mean modularity Q improvement of 24.6% over traditional community detection algorithms. Compared with PrefixSpan, BMPS improves mean time and space efficiency by up to 34.8% and 35.3%, respectively. Together, DyCIAComDet and BMPS constitute an effective, scalable detection pipeline for identifying abnormal alliances in tax datasets and supporting regulatory analysis.

Keywords:

community evolution; bipartite networks; sequence pattern mining; bitmap matrix; abnormal group behavior

1. Introduction

In the field of tax administration, abnormal group behavior such as false invoicing, income concealment, tax evasion, and tax fraud are prevalent and significantly impact the healthy operation of tax collection and the economic ecosystem [1]. Based on large-scale historical tax data, efficiently and effectively identifying abnormal group behavior from tax-related data, including taxpayers and tax officials, is an urgent need for tax regulatory departments and a key focus of this paper’s research [2].

As far as group behavior analysis is concerned, a variety of group behavior systems can be transformed into complex networks, such as economic systems [1], biological systems [3], community ecosystems [4], and other complex systems [5]. Theoretically, many complex systems can be described by using complex networks and graphs [6]. Most real-world networks possess community structures, meaning that a large network can be divided into several sub-communities, which are tightly connected internally but sparsely connected externally [7]. Community detection algorithms are of great significance for understanding network topology, mining implicit patterns, and predicting network behavior, and are applied in multiple fields.

Especially for bipartite networks, an important and extraordinary type of complex network, it is a structural manifestation of group behavior in complex networks. Its main characteristics are as follows: (1) it consists of two different types of nodes; (2) nodes are typically divided into two disjoint sets; (3) nodes within each set are of the same type, while nodes from different sets are of different types; (4) connections only exist between the two types of nodes; that is, there are no connections between nodes within the same node type, but there are connections between nodes from different nodes types.

As the bipartite network shown in Figure 1, ellipses represent node type 1, called U-type, and rectangles represent node type 2, called V-type, with edges existing between nodes of different types and no edges between nodes of the same type. In special cases, there may be instances where nodes of the same type are connected by edges, but these are generally disregarded, and the network is still considered a bipartite network.

Figure 1. Illustration of a bipartite network.

Many networks in nature exhibit bipartite structures, such as taxpayer–tax officer networks [2], author–paper networks [8,9], investor–shareholder networks [10,11], viewer–movie networks [12], customer–product networks [13], and disease–gene networks [14]. In the face of the huge amount of data generated by illegal activities, such as issuing false value-added tax invoices in the process of tax supervision, it is essential to effectively overcome the problem of traditional methods being unable to analyze the temporal evolution characteristics of illegal activities based on identifying the characteristics of illegal activities.

Thus, we propose an anomaly detection method to mine temporal evolution characteristics of multi-type participants’ illegal activities, including two progressive processes. Multi-type-participant joint behavior is detected by means of the tax multi-type-participant anomaly detection module, which specifically introduces the DyCIAComDet (Dynamic CIA Community Detection Method) algorithm proposed. Then the other module is the key feature identification module for tax anomalies, which proposes the BMPS (Bitmap-based PrefixSpan algorithm) frequent sequence-mining algorithm.

In DyCIAComDet, this paper proposes an algorithm for the evolution recognition of community structures in time-varying bipartite networks. The method includes three core steps: first, we introduce the CIA algorithm for identifying community structures in static bipartite networks; second, we propose indicators that include cohesion and integration for discovering the “leadership” metric to measure the closeness between communities at adjacent timestamps; then, to verify whether the cooperative behavior has the characteristics of illegal behavior, we propose a recursive formula to incorporate cohesion and integration indicators into the calculation of the modularity Q value of community partitions over time. Furthermore, considering the time characteristic of large-scale tax data, the taxpayer–tax officer network is actually a time-varying bipartite network. This paper proposes a dynamic bipartite network community evolution algorithm based on the static CIA algorithm, namely the DyCIAComDet algorithm. The algorithm slices the dataset by timestamps. At the initial time epoch

T_{0}

, the corresponding bipartite network, denoted as Network

T_{0}

, uses the static CIA algorithm to confirm the community structure and denote the leadership nodes within these communities. For the new node data from time epoch

T_{0}

to

T_{1}

, it quantifies the connection relationship with the leadership nodes of the historical communities at time epoch

T_{0}

(i.e., historical features). If the historical features are satisfied, these nodes are ascribed to the historical communities. For the node data at time epoch

T_{1}

that do not belong to any historical communities, the static CIA algorithm is applied for community detection, and the leadership nodes of the detected communities are calculated for the next iteration. The above processes are iterated by timestamps until the final time epoch Tf or all data nodes possessing community labels, at which the algorithm stops.

Finally, based on the statistical characteristics of the communities discovered by the DyCIAComDet algorithm, two types of communities are ascribed, namely ComType1-“scalper” and ComType2-“Abnormal Alliance”, which correspond to two types of actual abnormal group behavior. Key feature extraction is performed on ComType2, corresponding to the verification of abnormal group features. BMPS (bitmap frequent sequence-mining method), based on bitmap, sequence matrix, and projection matrix, is proposed, which can overcome the huge repeated scanning of the database; thus, the efficiency is greatly improved. The evolutionary path of ComType2 in the time-varying bipartite network is extracted, along with key features of their temporal paths, namely the frequently appearing node sequences. Finally, this paper innovatively proposes the BMPS algorithm (a frequent sequence-mining algorithm based on bitmap and PrefixSpan algorithm to avoid repeated scanning of the projection matrix), which is based on PrefixSpan, bitmap matrices, and the theorem of repeated projection matrices. BMPS addresses the bottleneck issue of high-frequency repetitive scanning of the database in large-scale datasets, thereby greatly improving the spatiotemporal efficiency of frequent sequence identification on large-scale sequence sets.

2. Literature Review

2.1. Community Evolutionary Algorithms for Bipartite Network

Bipartite networks are ubiquitous in nature and society, such as scientist–paper collaboration networks, person–location networks, actor–movie collaboration networks, and so on. There are typically two approaches to community detection in bipartite networks: the mapping approach and the direct approach.

The mapping method involves transforming a bipartite network into a unipartite network based on the shared nodes between the two types of nodes, and then employing established community detection algorithms designed for unipartite networks to identify community structures. The principle behind the mapping method is relatively straightforward, but it leads to a loss of network structural information, which can potentially cause significant errors in the detection of the entire community structure. Guimera et al. [15] have demonstrated that using the mapping method for community detection can yield erroneous results and even affect the community structure of the entire network. Barber [16] presented different outcomes for community detection in real bipartite networks and their corresponding projected unipartite networks, thereby confirming that the mapping method for bipartite networks is not advisable.

Guimera et al. [15] proposed a new community detection algorithm for bipartite networks based on bipartite modularity extended from unipartite modularity. Barber et al. [16] extended Newman’s modularity from unipartite networks to bipartite networks and introduced the adaptive BRIM algorithm for community detection, which is, however, limited to small-scale bipartite networks. Murata [17] also proposed a bipartite network community detection method based on bipartite modularity and demonstrated that its performance is comparable to that of unipartite network modularity. Furthermore, Liu et al. [18] presented a community detection and analysis algorithm based on label propagation for large-scale bipartite networks.

Recently, Raghavan et al. [19] proposed the Label Propagation Algorithm (LPA), a widely used method for community detection in bipartite networks. Subsequently, Li et al. [20] proposed a novel quantitative community detection method based on bipartite partition density in bipartite networks, which is better than Barber’s bipartite modularity and others. Chang et al. [21] introduced an overlapping community detection approach based on Bi-EgoNet in bipartite networks, and Wang Yang et al. [22] proposed a method for bipartite network community detection based on comparative definitions and community force.

Although community detection methods are crucial for uncovering complex network structures, the heterogeneous nature of nodes in bipartite networks presents significant challenges in accurately clustering networks with community centrality features. Furthermore, existing approaches often fail to fully capture the dynamic structural evolution that occurs during the evolution of bipartite networks.

2.2. Sequence Pattern Mining

Sequence pattern mining holds significant practical value in various application domains, such as personalized recommendations and customer behavior analysis. For instance, it can be used to deliver personalized web page recommendations based on the sequence in which users visit websites, or to optimize product promotion strategies based on patterns in customers’ daily shopping habits.

Classic sequence pattern mining algorithms are usually divided into two main categories: the first category is based on the candidate generation–test idea, utilizing the Apriori property, such as the AprioriAll [23] algorithm and the AprioriSome [23] algorithm. The logic of these two algorithms is equivalent to a breadth-first search strategy on the concept lattice constructed by the data items of the sequence database. One downside of the first category is the large scale of candidate sequences and multiple scans of the database, which leads to low efficiency in both time and space, thus reducing its feasibility in big data scenarios.

R. Agrawal proposed a constraint-based sequence pattern mining algorithm based on the Apriori property, known as the GSP algorithm [24] (Generalized Sequential Pattern), which can effectively discover frequent patterns in small-scale sequence datasets. However, when dealing with large-scale datasets, the efficiency of the algorithm significantly decreases. Additionally, in scenarios involving long sequence data, a large number of candidate sequences is generated, leading to a substantial increase in time and space.

To address these issues, Ayres et al. proposed the SPAM algorithm [25], which employs a vertical bitmap representation and bitwise operations (e.g., AND and bit-count) to compute supports of candidate sequences efficiently; however, SPAM’s bitmap structures can become memory-intensive for large-scale or long-sequence datasets.

The second category of algorithms is based on ideas of divide-and-conquer and pattern growth, such as FP-Growth [26], FreeSpan [27], and PrefixSpan [28]. Among these, PrefixSpan is a prefix-projection sequence-growth algorithm: for each frequent prefix, it recursively projects the database and grows patterns within the projected database, avoiding exhaustive candidate generation and repeated full-database scans.

In summary, these growth-based methods significantly reduce candidate generation and database scans, improving scalability. Nevertheless, bitmap-based approaches (such as SPAM) trade reduced I/O for increased memory footprint; practical approaches must balance preprocessing/memory costs and subsequent mining efficiency depending on data scale and sequence length.

3. Abnormal Alliance Detection Framework

Our Abnormal Alliance detection framework includes three parts: first, the static CIA algorithm framework proposed is designed to better detect communities in a bipartite network that possesses network centrality of networks; then, we propose a Dynamic Community Structure Discovery Algorithm on Bipartite Networks based on the CIA algorithm framework to detect the evolution of community structures in large-scale time-varying bipartite networks; finally, a frequent sequence identification algorithm (BMPS algorithm) is proposed to identify tax-related anomalous behavior.

3.1. Static CIA Algorithm Framework

Based on community cohesion and integration force, this paper proposes a static bipartite network community detection method called CIA. The framework of this algorithm is shown in Figure 2.

Figure 2. Flowchart of the Static CIA Algorithm.

Step 1: Denote the nodes of the bipartite network as U-type and V-type. At time epoch

T_{0}

, each U-type node is treated as one individual community, and they are denoted as

C_{1}

,

C_{2}

, …,

C_{g}

, where g is the total number of U-type nodes. In a bipartite network

G = (U, V, E)

, U represents the node set of the first type, V represents the node set of the second type, and E represents the edge set connecting nodes of these two types. For any node

u_{i}

in U, the list of all nodes in V that are connected to

u_{i}

is called the neighborhood list of node

u_{i}

. This is defined by Formula (1).

N (u_{i}) = \{v_{j} w_{i j} ∣ v_{j} \in V, (u_{i}, v_{j}) \in E\}

(1)

where

w_{i j}

is the edge weight connecting node

u_{i}

and node

v_{j}

. It indicates that there may exist duplicate nodes within the neighbor node list of

u_{i}

, and weight

w_{i j}

also reflects the connection closeness between

u_{i}^{'} s

neighboring nodes and

u_{i}

. In other words, the higher the weight, the closer the connection.

Step 2: Calculate each community cohesion according to Formula (2) for discovering key nodes in the community or network. Community cohesion is a pattern that is similar to the concept of centrality of networks; nodes with the maximum degree within a network or community are usually regarded as the core nodes of that network or community. Moreover, nodes with higher degrees are connected with high probability; thus, they bear a higher probability of belonging to the same community.

C (C^{U}) = | N (C^{U}) | / (\sum_{i = 0}^{| C^{U} |} \sum_{v_{j} \in N (u_{i})} w_{i j}), u_{i} \in C^{U}

(2)

D (u_{i})

represents the degree of node

u_{i}

;

C^{U}

denotes a community consisting solely of U-type nodes.

| C^{U} |

denotes the number of nodes in community

C^{U}

;

| N (C^{U}) |

represents the count of common neighbors among all U-type nodes in community

C^{U}

. The denominator part represents the sum of edge weights of all U-type nodes in community

C^{U}

.

Step 3: The community integration force reflects a measurable criterion for determining whether two small communities can merge into a larger one. The greater the number of common neighbors shared between two communities, the higher the similarity between them. According to the community integration force Formula (3), calculate the integration force between every pair of communities obtained in Step 1.

I N ({C_{i}}^{U}, {C_{j}}^{U}) = \frac{\sum_{v_{j} \in N ({C_{i}}^{U}, {C_{j}}^{U})} \sum_{u_{i} \in {C_{i}}^{U} \cup {C_{j}}^{U}} w_{u_{i} v_{j}}}{| {C_{i}}^{U} | + | {C_{j}}^{U} |}

(3)

N (C_{i}^{U}, C_{j}^{U})

denotes the set of common neighbors between communities

C_{i}^{U}

and

C_{j}^{U}

, while

| C_{i}^{U} |

and

| C_{j}^{U} |

represent the number of nodes in

C_{i}^{U}

and

C_{j}^{U}

, respectively.

Step 4: Calculate the

I N_{Minus} (C_{i}, C_{j}) = 1.87 \cdot I N (C_{i}^{U}, C_{j}^{U}) - C (C_{i}^{U}) - C (C_{j}^{U})

, merge communities with the greatest difference.

Step 5: For each

v_{i} \in V

, merge

v_{j}

into community

C_{b}

according to

b = arg max (\frac{| N (v_{j}) ⋂ C_{k} |}{| C_{k} |})

. If

b \geq 2

, then merge

v_{j}

into the community corresponding to the largest NMI calculated according to Formula (4).

N M I_{C D} = \frac{- 2 \sum_{i = 1}^{N^{C}} \sum_{j = 1}^{N^{D}} n_{i j}^{C D} {log}_{10} (\frac{n_{i j}^{C D} M}{n_{i}^{C} n_{j}^{D}})}{\sum_{i = 1}^{N^{C}} n_{i}^{C} {log}_{10} (\frac{n_{i}^{C}}{M}) + \sum_{j = 1}^{N^{D}} n_{j}^{D} {log}_{10} (\frac{n_{j}^{D}}{M})}

(4)

where M represents the total number of nodes in the network, and

N^{C}

and

N^{D}

represent the number of communities obtained by community partitioning algorithms C and D, respectively.

n_{i}^{C}

denotes the total number of nodes in the ith community obtained by algorithm C, and similarly,

n_{j}^{D}

denotes the total number of nodes in the jth community obtained by algorithm D.

n_{i j}^{C D}

represents the number of common nodes between the ith community obtained by algorithm C and the jth community obtained by algorithm D [29]. From Formula (4), we can see that the closer the community partitioning results of the two algorithms are, the higher the value of mutual information. If the community partitioning results of the two algorithms are completely different, the Normalized Mutual Information (NMI) value is 0. The range of mutual information values lies between 0 and 1.

Step 6: Calculate the Q value according to Formula (5) for measuring the level of node aggregation in the network, where m represents the total number of edges in the network.

Q = \frac{1}{2 m} \sum_{i, j} [A_{i j} - \frac{k_{i} k_{j}}{2 m}] δ (C_{i}, C_{j})

(5)

Step 7: Repeat Steps 2 and 6 until no community needs to be merged or the Q value stops increasing; the algorithm stops. Extensive experiments have demonstrated that when the integration force is multiplied by 1.87, the community result yields the optimal Q value. This coefficient is derived from gradient optimization experiments across two representative bipartite network datasets: an artificially generated bipartite network (with four preset communities and

D_{o u t} \in {0, 1, 2, 3, 4}

) and the public Southern Women benchmark dataset. With the maximization of modularity Q as the objective function, we tested

λ

values ranging from 1.0 to 2.5, and found that

λ

=1.87 yielded the peak Q values (0.5328 for the artificial network and 0.5384 for the Southern Women dataset), with the community division results showing the highest consistency with the true network structure (NMI value). Hence, the community merging condition is formalized as

I N_{M i n u s} (C_{i}, C_{j}) = 1.87 \cdot I N (C_{i}^{U}, C_{j}^{U}) - C (C_{i}^{U}) - C (C_{j}^{U})

based on the following rule. (1) We calculate the total sum of the integration force of all communities, then multiply it by 1.87; (2) we subtract the integration force

C (C_{i}^{U})

,

C (C_{j}^{U})

; (3) we merge the community

C_{i}

and

C_{j}

with the highest

I N_{M i n u s}

.

3.2. Dynamic Community Structure Discovery Algorithm on Bipartite Network DyCIAComDet

Traditional community detection algorithms typically rely on full datasets to directly partition the entire network. However, in the big data context, such algorithms suffer from low time and space efficiency. To address this, this paper proposes the DyCIAComDet algorithm, designed for community structure evolution in large-scale time-varying bipartite networks.

The DyCIAComDet method firstly segments the bipartite network into different sub-networks with a time tag attached according to the given time interval. For the static bipartite network corresponding to time

T_{0}

, the CIA algorithm is applied to determine the community division of the network at time point

T_{0}

. For nodes labeled between

T_{0}

and

T_{1}

, if their closeness to the existing communities at time

T_{0}

exceeds the threshold, these nodes are considered as having historical community characteristics, and are assigned to the historical community. Otherwise, the CIA algorithm is applied to nodes that do not satisfy the threshold, and characteristics of the obtained communities are recorded. This process repeats until the final time epoch

T_{f}

attains or all the nodes of the network have their community belonging, and the algorithm terminates. The flowchart of the algorithm is depicted in Figure 3.

Figure 3. Community structure evolution flowchart of time-varying bipartite network.

The algorithm steps of DyCIAComDet are presented as follows:

Step 1: Segment the entire network G into sub-networks

G_{T_{0}}, G_{T_{1}}, \dots, G_{T_{n}}

with time tags attached.

Step 2: Set i = 0, draw nodes with time tag

T_{0}

, and construct the corresponding bipartite network

G_{T_{0}}

; the bipartite network

G_{T_{0}}

is used as the initial bipartite network.

Step 3: Apply the CIA algorithm to sub-network

G_{T_{0}}

, obtain the community partition result, and denote it as

C_{1}^{T_{0}}, C_{2}^{T_{0}}, \dots, C_{l}^{T_{0}}

; record their community leadership as

C o m L D_{i}^{T_{0}}, i = 1, 2, \dots, l

.

Community leadership is defined as the node with the highest weighted degree within the community, as provided in Formula (6).

L D = \underset{x \in C}{argmax} (\sum_{y \in N (x)} w_{x y})

(6)

In Formula (6): x denotes any node (either U-type or V-type) in community C;

N (x)

represents the neighbor node list of x (consistent with Formula (1));

\sum_{y \in N (x)} w_{x y}

calculates the weighted degree of node x; argmax selects the node x with the maximum weighted degree, which is defined as the community leadership node.

The metric Q is traditionally used to evaluate static network community detection results, but it fails to effectively characterize the dynamic evolution of networks. Thus, this paper proposes a novel evaluation metric for dynamic network community evolution. Specifically, the complex network is first sliced into dynamic sub-networks by time. At time

T_{i - 1}

, we use the static network community detection method to determine the community result for the network at time epoch

T_{i - 1}

. For the network community results from

T_{i - 1}

to

T_{i}

, the newly added nodes from

T_{i - 1}

to

T_{i}

are firstly divided into two parts. The first part consists of nodes that are closely connected to the historical communities at

T_{i - 1}

, and their community results are denoted as

Ω_{T_{i} - T_{i - 1}}^{h i s t o r y}

. The remaining nodes form the second part, and their results are denoted as

Ω_{T_{i} - T_{i - 1}}^{n e w}

. Thus, community results from

T_{i - 1}

to

T_{i}

can be denoted as

P a r t_{T_{i} - T_{i - 1}} = Ω_{T_{i} - T_{i - 1}}^{h i s t o r y} + Ω_{T_{i} - T_{i - 1}}^{n e w}

.

Step 4: For nodes between time

T_{0}

and

T_{1}

, if nodes have edges with communities found at time

T_{0}

, the set of nodes is defined as

Ω_{T_{1} - T_{0}}^{h i s t o r y}

. Then, they are assigned into communities at

T_{0}

, and the community leadership is updated; otherwise, the remaining nodes form a new sub-bipartite network, namely

Ω_{T_{1} - T_{0}}^{n e w}

.

Step 5: Repeat steps 1–4 between

T_{i - 1}

and

T_{i}

, and apply the CIA algorithm to the sub-bipartite networks

Ω_{T_{i} - T_{i - 1}}^{h i s t o r y}

and

Ω_{T_{i} - T_{i - 1}}^{n e w}

; obtain the whole community results of

Ω_{T_{i} - T_{i - 1}} = Ω_{T_{i} - T_{i - 1}}^{h i s t o r y} + Ω_{T_{i} - T_{i - 1}}^{n e w}

, denoted as

C_{1}^{T_{i} - T_{i - 1}}, C_{2}^{T_{i} - T_{i - 1}}, \dots, C_{l}^{T_{i} - T_{i - 1}}

, and record the leadership of each community, namely

C o m L D_{i}^{T_{i} - T_{i - 1}}, i = 1, 2, \dots, l

.

The index

Q_{T_{i} - T_{i - 1}}

for

P a r t_{\leq T_{i}}

before

T_{i}

is calculated using Formula (7).

Q_{\leq T_{i}} = \frac{1}{2 m} \sum_{i = 1}^{p} \sum_{j = 1}^{q} [\bar{a_{i j}} - \frac{k_{i} k_{j}}{2 m}] (η_{u_{i}, v_{j}}^{1} + η_{u_{i}, v_{j}}^{2} + η_{u_{i}, v_{j}}^{3}) δ (C_{u_{i}}, C_{v_{j}})

(7)

where

η_{u_{i}, v_{j}}^{1}

equals 1 if node

u_{i}

and node

v_{j}

belong to the same community; otherwise, it equals 0.

η_{u_{i}, v_{j}}^{2}

equals 1 if node

u_{i}

belongs to communities at time epoch

T_{i - 1}

,

v_{j}

belongs to communities from

T_{i - 1}

to

T_{i}

or

u_{i}

belongs to communities from

T_{i - 1}

to

T_{i}

,

v_{j}

belongs to communities at time epoch

T_{i - 1}

; otherwise, it equals 0.

η_{u_{i}, v_{j}}^{3}

equals 1 if both node

u_{i}

and

v_{j}

belong to communities from

T_{i - 1}

to

T_{i}

; otherwise, it equals 0. Here, p denotes the total number of U-type nodes (

u_{i}

) and q the total number of V-type nodes (

v_{j}

) in the interval

T_{i - 1}

to

T_{i}

, matching the summation limits in Formulas (7) and (9). Where

Q_{T_{i}}

is calculated using Formula (8), while

Q_{T_{i + 1} - T_{i}}

is calculated using Formula (9).

Q_{T_{i}} = \frac{1}{2 m} \sum_{i, j} [A_{i j} - \frac{k_{i} k_{j}}{2 m}] δ (C_{i}, C_{j})

(8)

Q_{T_{i + 1} - T_{i}} = \frac{1}{2 m} \sum_{i = 1}^{p} \sum_{j = 1}^{q} [\bar{a_{i j}} - \frac{k_{i} k_{j}}{2 m}] (η_{u_{i}, v_{j}}^{2} + η_{u_{i}, v_{j}}^{3}) δ (C_{u_{i}}, C_{v_{j}})

(9)

Step 6: Merge the existed communities

C_{1}^{T_{i}}, C_{2}^{T_{i}}, \dots, C_{l}^{T_{i}}

and the newfound communities

C_{1}^{T_{i + 1} - T_{i}}, C_{2}^{T_{i + 1} - T_{i}}, \dots, C_{l}^{T_{i + 1} - T_{i}}

, the corresponding leadership

{C o m L D}_{i}^{T_{i + 1} - T_{i}}, i = 1, 2, \dots, l

, and obtain the communities

C_{1}^{T_{i + 1}}, C_{2}^{T_{i + 1}}, \dots, C_{l}^{T_{i + 1}}

and the corresponding leadership

{C o m L D}_{i}^{T_{i + 1}}

,

i = 1, 2, \dots, l

.

Step 7: Update

i = i + 1

and repeat Step 2–5. Continue this process until one of the following conditions is met: all nodes of the entire network belong to an existing community, or the final time point

T_{n}

is reached.The algorithm then terminates.

3.3. BMPS Algorithm for Detecting “Abnormal Alliances”

Section 3 addresses the identification of tax-related anomalous behavior by modeling it as a community evolution problem in dynamic bipartite networks. In Section 4, the paper further addresses the issue of key feature verification for “Abnormal Alliances”. We model this as a frequent pattern mining problem on the sequence database composed of nodes along the evolution paths of “Abnormal Alliances”. The paper proposes the BMPS algorithm for frequent sequence pattern mining, based on concepts such as bitmaps, sequence matrices, sequence prefixes and suffixes, projection matrices, repeated projection matrices, and pattern growth.

BMPS (Bitmap-based PrefixSpan algorithm) is an improved PrefixSpan sequence pattern mining algorithm based on bitmap matrices. Bitmap [25] is a data structure typically used for storing images. Bitmaps are frequently employed for the efficient processing of large-scale binary data. In practical applications, bitmaps are usually stored in the form of image files, such as BMP, PNG, JPEG, and other formats. The advantage of bitmaps is that they can perform efficient bitwise operations, such as AND, OR, XOR, NOT, etc., which is why they are often used in fields of data compression, sorting, deduplication, set operations, and more. In contrast to PrefixSpan, both the sequence data in the sequence database and the projected sequences in the projection database are stored in a two-dimensional sequence matrix.

The flowchart of BMPS is shown in Figure 4.

Figure 4. Flowchart of BMPS algorithm.

Step 1: Instead of scanning the original sequence database, the algorithm first scans the sequence matrix E to identify frequent 1-sequence patterns.

For a given sequence database D, if the occurrence frequency of a certain sequence pattern

α

exceeds the minimum support threshold min_sup, i.e.,

S u p (α) \geq m i n_s u p

, then the sequence

α

is called a frequent sequence. The sequence matrix of database D is shown in Table 1, the corresponding bitmap matrix denoted as

E

. Each sequence

d_{i}

corresponds to a bitmap matrix are denoted as

E_{(d_{i})}

, which is shown in Figure 5.

Table 1. Sequence database.

Figure 5. Sequence bitmap matrix.

Taking the sequence database D in Table 1 as an example, its corresponding sequence matrix is shown in Figure 5. D contains four sequences in total, corresponding to the four sequence matrices in Figure 5. D includes six data items: a, b, c, d, e, and f, corresponding to the six rows in Figure 5. When calculating the support of a data item, the algorithm only needs to traverse each row of the matrix. If a ‘1’ appears in a row, the support value of that item is incremented by 1. Then, the algorithm skips the current matrix and continues to traverse the next sequence matrix. Assuming

m i n_s u p = 2

, in {a:4, b:4, c:4, d:3, e:2, f:1} shown in Figure 5, except for item f, all other items are frequent 1-sequence items. By traversing each column of the matrix, we can obtain the item set according to the position value of the sequence. For example, in the second column, only elements a, b, and c have a position value ‘1’; thus, its item set is (abc).

Step 2: Based on the position value of

L_{1}

in matrix E, where position usually refers to the location where an element or item appears in the given sequence, the following procedure is performed. Specifically, suppose there is a sequence

α = ⟨ α_{1}, α_{2}, \dots, α_{n} ⟩

, and there exists an item b and

b \in ⋃_{j = 1}^{n} α_{j}, α_{j} (b, j)

represents the position value of item b in the sequence

α

. Matrix E projects and partitions to obtain the projection matrix

E_{(L_{1})}

. The projection matrix

E_{(L_{1})}

, proposed instead of the projection database, is shown as follows. Given a sequence prefix

α

and a matrix E, the projection matrix of

α

relative to E, denoted as

E_{(α)}

, defines the set of position values in E that are greater than the position value of the sequence prefix

α

. Similarly,

E_{(d_{i}, α)}

describes the set of position values that are greater than the position value of

α

in sequence

d_{i}

.

For example,

α = ⟨ a (b) ⟩

, the positional values of the sequence prefixes

α

in E are

d_{1} (α, 2)

,

d_{2} (α, 2)

and

d_{4} (α, 2)

, respectively. Therefore, scanning all columns after the second column of the matrices corresponding to

d_{1}

,

d_{2}

, and

d_{4}

in E yields the projection matrix, as shown in Figure 6. Taking

d_{1}

as an example, the rules for generating the projection matrix are as follows: the sequence prefix

α

has a position value of 2 in

d_{1}

, and the last item of

α

is b. Therefore, starting from the second column of

E_{(d_{1})}

, traversing vertically, item b is in the second row; thus, all position values before

E_{(d_{1}, α)} [2] [2]

are 0, which means that the first column of

E_{(d_{1}, α)}

is all zeros, namely

E_{(d_{1}, α)} [1] [2] = 0

,

E_{(d_{1}, α)} [2] [2] = 0

. The values after the second row are the same as those in

E_{(d_{1})}

. The values of the second column and beyond in

E_{(d_{1})}

are the same as the corresponding values in

E_{(d_{1}, α)}

.

Figure 6. Projection matrix of sequence

α

.

When performing sequence mining, rules for calculating the item support are presented as follows. In Figure 6, by traversing the three matrices where item a is located, there is ‘1’ in

E_{(d_{4}, α)}

, which is less than the minimum support; by traversing the three matrices where item b is located, there are ‘1’s in both

E_{(d_{1}, α)}

and

E_{(d_{2}, α)}

, and its support is 2. Similarly, the support for all items is obtained as {

⟨ (a) ⟩ : 1, ⟨ (b) ⟩ : 2, ⟨ (c) ⟩ : 3, ⟨ (d) ⟩ : 2, ⟨ (e) ⟩ : 1, ⟨ (_c) ⟩ : 2, ⟨ (_e) ⟩ :

1}

. Items with support less than

m i n_s u p

= 1 are dropped out; thus, the extensible items are

\{⟨ (b) ⟩, ⟨ (c) ⟩, ⟨ (d) ⟩, ⟨ (_c) ⟩\}

.

When calculating the item support, it is important to notice the position value of the last item in the prefix sequence, and the items in front of the columns corresponding to the same position value should be prefixed with an underscore “_”, indicating that they belong to the same item set.

Step 3: In the projection matrix

E_{(L_{1})}

, we first mine the frequent subsequences

α

that satisfy the minimum support. By connecting the frequent 1-sequence patterns

L_{1}

with

α

, we obtain frequent 2-sequence patterns. The sequences

L_{1}

and

L_{2}

satisfy the frequent-sequence relationship due to the equation

C I D (L_{1}, L_{2}) | E = C I D (L_{2}) | E

.

The proposed frequent subsequences are shown as follows. Assuming there exist two frequent subsequences, denoted as sequence

α

and

β

,

{C I D (α) |}_{D}

represents the set of data records in database D that contain subsequence

α

, and

{C I D (β) |}_{D}

represents the set of data records in database D that contain subsequence

β

. Let

{C I D (α, β) |}_{D}

denote the set of data records in the database that contain both subsequences

α

and

β

. Therefore,

{C I D (α, β) |}_{D} = C I D (α) {|_{D} ⋂ C I D (β) |}_{D}

, and it is said that sequences

α

and

β

have the frequent-sequence relationship.

If

{| C I D (α, β) |}_{D}

<

(m i n_s u p)

, then

α

and

β

do not satisfy the frequent-sequence relationship in database D. This is due to the fact that the number of records containing both

α

and

β

is less than

m i n_s u p

. If

m i n_{s u p} \leq {| C I D (α, β) |}_{D} \leq {m i n {| C I D (α) |}_{D} {, | C I D (β) |}_{D}}

,

α

and

β

satisfy the frequent-sequence relationship in database D.

For example, a sequence database D is shown in Table 1. Suppose min_sup = 2. The data-record set containing

⟨ (a) ⟩

is

{C I D (⟨ (a) ⟩) |}_{D}

=

{1, 2, 3, 4}

. The data-record set containing

⟨ (b) ⟩

is

{C I D (⟨ (b) ⟩) |}_{D}

=

{1, 2, 3, 4}

. The data-record set containing both

⟨ (a) ⟩

and

⟨ (b) ⟩

is

{C I D (⟨ (a) ⟩, ⟨ (b) ⟩) |}_{D}

=

C I D (⟨ (a) ⟩) ⋂ C I D (⟨ (b) ⟩)

=

{1, 2, 3, 4}

. The number of records containing both sequences

⟨ (a) ⟩

and

⟨ (b) ⟩

is greater than 2; thus, subsequences

⟨ (a) ⟩

and

⟨ (b) ⟩

have a frequent-sequence relationship. Also, the data-record set containing the sequence

⟨ (d) ⟩

is

{C I D (⟨ (d) ⟩) |}_{D} = {1, 2, 3}

. The data-record set containing sequence

⟨ (f) ⟩

is

{C I D (⟨ (f) ⟩) |}_{D} = {1, 2, 3}

. The data-record set containing both

⟨ (d) ⟩

and

⟨ (f) ⟩

is

{C I D (⟨ (d) ⟩, ⟨ (f) ⟩) |}_{D} = C I D (⟨ (d) ⟩) ⋂ C I D (⟨ (f) ⟩) = {1, 2, 3}

. The number of records containing both subsequences

⟨ (d) ⟩

and

⟨ (f) ⟩

is less than 2; thus, the subsequences

⟨ (d) ⟩

and

⟨ (f) ⟩

dose do not have a frequent-sequence relationship.

Step 4: According to the Repeated Projection Matrix, there may be three situations with the sequence

L_{2}

and the frequent 1-sequences

β

.

Where Repeated Projection Matrix proposed is shown as follows. Given two frequent subsequences

α

and

β

, assume that the position values of

α

and

β

are the same within the same sequence, and the count of equal positional values is not less than min_sup. Then, the projection matrix constructed with

α

and

β

as prefixes duplicates with each other.

Assuming

α

=

⟨ a (b) ⟩

, its corresponding projection matrix is shown in Figure 7. Assuming

β = ⟨ (b) ⟩

, its corresponding projection matrix is also shown in Figure 7. Because

d_{1} (α, 2)

=

d_{1} (β, 2)

,

d_{2} (α, 2)

=

d_{2} (β, 2)

, and

d_{4} (α, 2)

=

d_{1} (β, 2)

the position values of

α

and

β

in matrices

d_{1}

,

d_{2}

and

d_{4}

are equal. Since the number of equal position values is greater than min_sup, which is 1, it can be observed from Figure 6 and Figure 7 that there are duplicate parts in the projection matrices of

α

and

β

.

Figure 7. Projection matrix of sequence

β

.

The sequence

L_{2}

includes the following three situations:

Situation (1): The set of position value of $L_{2}$ in matrix E is the same as the set of position values of $β$ in matrix E. In this scenario, there is no need to generate a projection matrix for

L_{2}

; instead, we can directly mine within the projection matrix of

β

. By connecting the sub-sequences obtained from

β

with

L_{2}

, we can generate higher-order frequent sequences.

Situation (2): The set of position values of $L_{2}$ in E is a subset of the position values of $β$ in E. Assuming the set of position values for

L_{2}

in E is

{j_{1}, j_{2}, j_{4}}

, and the set of position values for

β

in E is

{j_{1}, j_{2}, j_{4}}

, it can be seen from the position value sets that sequence

d_{3}

contains sequence

β

but not

L_{2}

.

We first perform frequent-sequence mining in the projection matrix of

L_{2}

, and assume that the mined local frequent-sequence items are

α^{'} = {α_{1}, α_{2}, α_{3}}

. Second, we perform mining in the projection matrix of

β

as the prefix, and assume that the mined local frequent sequences are

α^{″} = {α_{1}, α_{2}, α_{3}, α_{4}, α_{5}}

. Here, we select items contained in sequence

d_{3}

from

α^{″} = {α_{1}, α_{2}, α_{3}, α_{4}, α_{5}}

, and assume that the items appearing in d3 are

{α_{3}, α_{4}}

, and the result after selection is

α^{‴} = {α_{1}, α_{2}, α_{5}}

.

A sequence prefix is essentially a subsequence of a sequence. Specifically, suppose we have sequence

α = ⟨ α_{1}, α_{2}, \dots, α_{m} ⟩

and sequence

α = ⟨ β_{1}, β_{2}, \dots, β_{n} ⟩, 1 \leq m \leq n

. Sequence

α

is referred to as a sequence prefix of sequence

β

when the following three conditions are satisfied.

(1)

α_{i}

β_{i}

; (2)

α_{p} \subset β_{p}

; and (3) all items in

α_{p}

appear in the earlier part of

β_{p}

.

Example. Suppose

α = ⟨ a (abc) (bc) (de) ⟩

; then

⟨ a (a) ⟩

,

⟨ a (a b c) ⟩

,

⟨ (b c) (b) ⟩

are all sequence prefixes of

α

. The projection matrix for the intersection of

α^{'}

and

α^{‴}

, specifically

α_{1}

and

α_{2}

, needs to be generated only once, with

α_{1}

and

α_{2}

serving as the prefixes. More specifically, the projection matrix for

α_{1}

and

α_{2}

is generated during the mining process of

L_{2}

, and there is no need to regenerate it during the mining process of

β

.

Situation (3): The position-value sets of $L_{2}$ and $β$ do not meet situations (1) and (2) in E.

Generate the projection matrix separately and perform the conventional sequence-mining process. The algorithm terminates when all values in the projection matrix are 0 or the support of all items is less than the minimum support threshold.

4. Numerical Experiment

Our numerical experiment is divided into two stages. The first stage involves the identification of community evolution based on a large-scale bipartite network, which is to identify anomalous behavior within large-scale data. The second stage is to search for evidence of anomalous behavior, namely, Abnormal Alliances.

4.1. Comparison of the CIA Algorithm with Other Algorithms

In the following, to validate the effectiveness and reliability of the CIA algorithm proposed in this paper, this study initially constructs artificial bipartite networks through computer simulation for comparative analysis between the CIA algorithm and other algorithms, such as the BRIM algorithm, algorithms based on spectral clustering, and algorithms based on edge clustering coefficients. Then the CIA algorithm is implemented on the Southern Women Dataset and actual tax-related data.

The artificially generated bipartite network in this paper consists of four communities, each containing 10 U-class nodes and 10 V-class nodes. Each node has a degree of 10, where

d_{i n}

represents the number of edges between a node and other nodes within its community, and

d_{o u t}

represents the number of edges between a node and nodes outside its community. Meanwhile, the weights between edges within the community are assumed to follow a certain random distribution. The constructed artificial network is shown in Figure 8. When

d_{o u t}

= 0,

d_{i n}

= 10 is set, its community structure is very clear in Figure 8. Figure 8a shows the structure of Community 1; Figure 8b shows the structure of Community 2; Figure 8c shows the structure of Community 3; and Figure 8d shows the structure of Community 4. When

d_{o u t}

takes the values of 0, 1, 2, 3, and 4, the Normalized Mutual Information (NMI) metrics are calculated for the community partition results obtained by the static CIA algorithm. These results are compared with those from the BRIM algorithm, spectral clustering algorithm, and edge clustering coefficient community detection algorithm, using the real network community structure as a reference. The comparison results are shown in Figure 9. It can be seen from Figure 9 that for different values of

d_{o u t}

, the NMI values of the CIA algorithm are higher than those of the other three algorithms, indicating that the community partition results obtained by the CIA algorithm are closer to the true community structure of the network. Therefore, the community partitioning results obtained by the CIA algorithm are closer to the true network community structure compared with the other three algorithms under the artificial bipartite network.

Figure 8. Artificial bipartite network.

Figure 9. Comparison results of NMI under four algorithms.

In the following section, this paper will validate the effectiveness of the CIA algorithm on real bipartite networks using the publicly available real-world network (Southern Women Dataset), which is a well-known standard bipartite network. It includes 18 woman nodes and 14 activity nodes, with edges representing a woman’s participation in a particular activity, shown in Figure 10.

Figure 10. Network of Southern Women.

The community structure partitioned by the CIA algorithm is shown in Figure 11. It can be seen that the CIA algorithm divides the Southern Women into three communities. If the women nodes are sequentially numbered, it can be seen that Community 1 includes six women nodes and activity nodes, including E8, E10, E12, E13, and E14. Community 2 includes five women nodes and activity nodes, including E11 and E9, and Community 3 includes seven women nodes and activity nodes, including E1 to E7.

Figure 11. Community structure of Southern Women by CIA algorithm.

From the modularity comparison chart shown in Figure 12, it can be seen that the Q value of the CIA algorithm is significantly higher than the other three algorithms. Comparatively, the Q value of the CIA algorithm has increased by 12.8%, 19.8%, and 8.4% over that of algorithms such as BRIM, spectral clustering, and the edge clustering coefficient, respectively. The community detection algorithm based on spectral clustering requires the number of communities to be known in advance; the parameter is set to 3 due to the three communities identified by the CIA algorithm.

Figure 12. Comparison of the Q value of four community detection algorithms on the Southern Women dataset.

Therefore, the CIA algorithm proposed in this paper outperforms other algorithms in both artificial and real-world network datasets, which shows that the CIA algorithm can identify the true community structure of bipartite networks.

4.2. DyCIAComDet and Anomaly Behavior Detection

DyCIACommDet will be applied to the real large-scale bipartite network composed of taxpayers and tax service providers to perform dynamic community structure detection and evolution trajectory tracking. By identifying the community structures and their evolutionary patterns, potential abnormal tax behaviors can be detected. The dataset includes 280,000 real-name tax transaction records.

Implementation of DyCIAComDet in the taxpayer–tax official network: The CIA algorithm is applied to the sliced taxpayer–tax official network at time

T_{0}

for community detection. The resulting community structure is shown in Figure 13 and compared with the community-partitioning results of the BRIM algorithm, algorithms based on spectral clustering, and algorithms based on edge-clustering coefficients, with 18 small communities in total displayed in Figure 13. The comparison of community partitioning performance, based on Q values, is shown in Figure 14. The CIA algorithm achieves a modularity Q value of 0.6134, which represents improvements of 22.7%, 9.6%, and 14.3% compared to the BRIM algorithm, spectral clustering algorithm, and edge clustering coefficient algorithm, respectively. This demonstrates that the CIA algorithm detects a more accurate community structure for the given network.

Figure 13. Community structure in the sliced network at

T_{0}

.

Figure 14. Comparison of Q values of community detection for the sliced network at time

T_{0}

.

After obtaining the community partitioning of the network at time

T_{0}

, according to the community evolution algorithm, to facilitate the community detection of the corresponding network in subsequent time slices, this paper records the characteristics of each community at

T_{0}

, for instance in Table 2, in which the feature sets of the top five communities with the highest number of nodes at time

T_{0}

are presented. Then, the data at time

T_{1}

is classified based on whether its edges are related to the historical community feature set or not. If a data node has an edge connected to the historical community feature set, the data node is directly assigned to the historical community, and the historical community feature set is updated. Otherwise, a new bipartite network structure at time

T_{1}

is constructed, and the CIA algorithm is applied for community detection in the new network at time

T_{1}

.

Table 2. Characteristics of the community of the network at

T_{0}

.

In the taxpayer–tax-officer network, the data nodes at time

T_{1}

are classified according to the following rules: (1) if there is an edge between the taxpayer and the tax officer in the historical community, and either the taxpayer or the tax officer belongs to the feature set of the historical community, then all nodes belong to the historical community; (2) if one of the taxpayers or the tax officer belongs to the feature set of the historical community, then the other node is directly assigned to the historical community; (3) data nodes that do not possess the above two characteristics belong to the network at time

T_{1}

. The algorithm iterates until

T_{n}

when all data belong to a certain community, and the evolution of the entire network stops.

Based on the rules for determining tax-related anomalies, this study proposes two analysis strategies. The first strategy is to identify taxpayers who belong to more than eight different communities at different times and have low transaction weights; these taxpayers can be identified as scalpers. The second strategy is to identify taxpayers who consistently belong to fewer than three fixed communities with high connection weights; these taxpayers can be identified as Abnormal Alliance with tax officers. These transaction data will be further refined and judged in the sequential pattern mining section.

While certain taxpayers (e.g., large state-owned enterprises or corporate headquarters) may naturally exhibit high-frequency interactions with the tax authority as a whole due to operational scale or designated compliance procedures, this study specifically restricts the context to transactions at on-site tax service hall windows rather than the institutional level. Under this constrained setting, if a taxpayer consistently interacts with the same tax officer at the same service window over time, such a pattern constitutes valid grounds for suspicion of anomalous behavior.

As shown in Figure 15, it provides information about partial taxpayer nodes belonging to different communities at various time points. The time range covers from November 2020 to December 2021, totaling 14 time epochs. Vertically, individual nodes belong to different communities at different times, with multiple grids showing different colors. For instance, the taxpayer node 10116101000051727085 belongs to 11 distinct communities over 14 time epochs, and the taxpayer node 10116101000051620205 also belongs to 11 distinct communities in the same period. A large number of similar nodes exist, and these taxpayers can be identified as scalpers. In contrast, other taxpayers consistently belong to only a small number of communities: for example, the taxpayer node 10216101000000248479 always belongs to only one community in all data, and the taxpayer node 10116101000051455599 belongs to only two different communities. These taxpayers have an Abnormal Alliance behavior with tax officers and require further tracking and handling in subsequent steps. Through qualitative review by tax experts, the groups identified in this study are highly suspicious, and some results are highly consistent with historically known violation cases.

Figure 15. Community evolution.

This paper detects Abnormal Alliances between taxpayers and tax officers by applying the dynamic community detection algorithm, DyCIAComDet, to the real taxpayer–tax officer bipartite network. Furthermore, this paper seeks to verify the Abnormal Alliance behavior between taxpayers and tax officers. The data is then converted into a sequence database and a sequence matrix. The BMPS algorithm is applied for sequence pattern mining to identify frequent sequence abnormal business chains. Subsequently, the taxpayer and tax officer numbers that include these frequent sequences are obtained, thereby verifying the Abnormal Alliance behavior. The transaction dataset used in this study covers a one-year period from November 2020 to November 2021. All records are historical transaction logs collected from the on-site tax service halls; the dataset contains N transactions and was analyzed via monthly time slices to obtain community evolution. Although tax officers may rotate between windows in practice, our analysis does not assume permanent stationarity of taxpayer–tax officer links. Instead, we treat repeated, concentrated interactions over multiple time windows as signals for potential abnormal behavior.

We first analyze the tax business type for each transaction record and then apply the BMPS algorithm to obtain frequent sequence patterns within each business type. The IDs are represented by short characters, as shown in Table 3.

Table 3. Correspondence of ID and code.

5. BMPS Performance Analysis

Based on Table 3, due to the vast amount of data, a transaction database corresponding to the business types in Table 3 has been established, along with a transaction database of Abnormal Alliance, as shown in Table 4. We first apply the BMPS algorithm to convert the transaction database of Abnormal Alliance into a sequence database using the following rules. First, the taxpayer ID and the tax officer ID combined form the CID of the sequence database. Second, the business item sets are arranged in the order of occurrence time to form an ordered item set. The resulting sequence database is shown in Table 5. We convert it into the sequence matrix, and the Anomaly Alliance sequence matrix is depicted in Figure 16. Assuming

m i n_s u p = 4

, by scanning the sequence matrix in Figure 16, the support of the 1-sequence is {

⟨ (a) ⟩ : 5

,

⟨ (b) ⟩ : 6

,

⟨ (c) ⟩ : 4

,

⟨ (d) ⟩ : 6

,

⟨ (e) ⟩ : 7

,

⟨ (f) ⟩ : 3

,

⟨ (g) ⟩ : 5

,

⟨ (h) ⟩ : 4

}. All sequences except

⟨ (f) ⟩

satisfy the minimum support threshold. After filtering out the sequence

⟨ (f) ⟩

, the remaining sequences are denoted as

L_{1}

. The projection matrix of

L_{1}

is then generated by calculating the position value of

L_{1}

.

Table 4. Transaction database of Abnormal Alliance.

Table 5. Sequence database of Abnormal Alliance.

Figure 16. Sequence matrix of Abnormal Alliance.

According to the algorithm in Section 4.2, we first scan the projection matrix. According to the rules for identifying duplicate projection matrices, we discovered that the projection matrices obtained with

⟨ (c) ⟩

as prefix and with

⟨ (b c) ⟩

as prefix duplicate with each other; the projection matrices obtained with

⟨ (h) ⟩

as prefix and with

⟨ (g h) ⟩

as prefix duplicates with each other; and the projection matrices obtained with

⟨ (e) (e) (h) ⟩

as a prefix and with

⟨ (e) (h) ⟩

as prefix duplicate each other. Therefore, these duplicate projection matrices will be mined only once to prevent redundant data mining.

The final frequent sequences obtained with

⟨ (a) ⟩

as prefix are

⟨ (a) (d) ⟩

,

⟨ (a) (e) ⟩

; frequent sequences obtained with

⟨ (b) ⟩

as prefix are

⟨ (b) (d) ⟩

,

⟨ (b) (d) (e) ⟩

,

⟨ (b) (e) ⟩

,

⟨ (b) (g) ⟩

,

⟨ (b c) (e) ⟩

,

⟨ (b) (c) ⟩

; frequent sequences obtained with

⟨ (c) ⟩

as prefix are

⟨ (c) (e) ⟩

; frequent sequences obtained with

⟨ (d) ⟩

as prefix are

⟨ (d) (d) ⟩

,

⟨ (d) (e) ⟩

,

⟨ (d e) ⟩

; frequent sequences obtained with

⟨ (e) ⟩

as prefix are

⟨ (e) (e) ⟩

,

⟨ (e) (d) ⟩

,

⟨ (e) (h) ⟩

,

⟨ (e) (g) ⟩

,

⟨ (e) (g h) ⟩

; frequent sequence obtained with

⟨ (g) ⟩

as prefix is

⟨ (g h) ⟩

; frequent sequence obtained with

⟨ (h) ⟩

as prefix is

⟨ (h) ⟩

.

After obtaining abnormal frequent sequences, the number of abnormal frequent sequences contained in the sequences formed by taxpayers and tax officers is counted, which is significant in further tax investigation. For example, sequence

d_{3}

contains 9 abnormal frequent sequences, while sequence

d_{4}

contains 17 abnormal frequent sequences. This implies that the suspicion level between taxpayer 3 and tax officer CC is lower than that between taxpayer 4 and tax officer AA. This suspicion level is crucial for tax inspection authorities to prioritize investigations of relevant taxpayers and officials.

In order to assess the performance of the BMPS algorithm in terms of time and space efficiency, we present comparisons of the spatiotemporal efficiency between BMPS and PrefixSpan at different minimum support levels. From Figure 17a, it is evident that the BMPS algorithm outperforms PrefixSpan in terms of time efficiency at different support levels. Especially when min_sup is small, there inevitably exists a huge amount of duplicate projection matrices in the sequence database, leading to a significant difference in time efficiency between BMPS and PrefixSpan. As min_sup increases, the time efficiency gap between BMPS and PrefixSpan narrows, because there are relatively fewer sequences that satisfy the min_sup criterion.

Figure 17. Time and space efficiency comparison between BMPS and PrefixSpan.

From Figure 17b, it is evident that BMPS also surpasses PrefixSpan in terms of space efficiency. In fact, when storing sequences, if the items within the sequence are of character type, at least 1 byte is required. However, if a bitmap matrix is used, only one bit is needed. When the item set contains a large number of items, the space efficiency advantage of BMPS over PrefixSpan becomes more apparent. The reason is that when storing items using a bitmap, there will be fewer 0 values in the columns of the bitmap matrix, preventing it from becoming a sparse matrix; thus, space can be more fully utilized compared to the PrefixSpan algorithm.

To verify the effectiveness of the BMPS algorithm in big data scenarios, all the abnormal alliance data were converted into sequence matrices, and efficiency comparisons with PrefixSpan are shown in Figure 18. It can be found that the BMPS algorithm has significant advantages over the PrefixSpan algorithm in big data scenarios. The reason is that when each item set in the sequences contains large items, using bitmap matrices for storage can save a considerable amount of storage space. Furthermore, avoiding extensive scanning of duplicate projection databases leads to a noticeable improvement in time efficiency.

Figure 18. Time and space efficiency comparison on Abnormal Alliance.

From Figure 18a, it can be seen that when min_sup is 2 and 3, the BMPS algorithm improves time efficiency by 34.8% and 28.3%, respectively, compared with the PrefixSpan algorithm, which demonstrates that using bitmap storage and position values to reduce large scanning of duplicate projection matrices can greatly enhance the algorithm’s time efficiency. When min_sup is 4, 5, 6, and 7, the time efficiency improvement of BMPS over PrefixSpan decreases, suggesting that there are fewer sequences satisfying the min_sup threshold, and the depth of sequences obtained under BMPS and PrefixSpan is essentially the same. From Figure 18b, we can see that under different min_sup thresholds, BMPS saves more space than PrefixSpan. When min_sup is 4, BMPS saves the most space, which amounts to 35.3%; when min_sup is 5, it saves the least, which is 25.2%. This demonstrates the high applicability of bitmap storage in scenarios where item set sequences contain a huge number of items.

6. Conclusions and Discussion

Targeting the hard problem of mining and verifying potential abnormal group behavior in a large-scale tax-related bipartite network composed of taxpayers and tax officials, this paper proposes a two-module framework. The framework includes two modules: abnormal group behavior identification and key feature verification.

Module one innovatively proposes the dynamic community structure identification algorithm on a large-scale time-varying bipartite network, namely DyCIAcomDet, based on indicators including cohesion, integration, and leadership. Performance comparisons were conducted on artificially generated data, small-scale public data, namely Southern Women, and large-scale real tax-related data, which validates the efficiency and superiority DyCIAcomDet.

Module two focuses on the verification of key features under Abnormal Alliance’s behavior. It first proposes the frequent sequence mining algorithm BMPS. This algorithm is based on the concepts of bitmap matrices, bit sequence matrices, projection matrices, and duplicate projection matrices. BMPS addresses the problem of low spatiotemporal efficiency in big data scenarios caused by repeated scanning of duplicate projection matrices in the PrefixSpan algorithm.

It should be noted that the coefficient 1.87 is optimized based on specific datasets and may be overfit to them, and its effectiveness in extremely sparse or scale-free networks requires further verification. In addition, the algorithm may carry an inherent bias of equating high-frequency interaction with fraudulent collusion. The fixed contacts required in legitimate business scenarios could thus be mislabeled, constituting an ethical risk that should be prioritized in practical applications.

Author Contributions

Methodology: B.Z.; Writing—original draft: B.Z. and Y.W.; Writing—review and editing: B.Z., Y.W., F.G., S.L., and X.X.; Formal analysis: F.G.; Validation: S.L. and F.G.; Visualization: X.X.; Supervision: Y.W.; Investigation: S.L.; Software: X.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the following funding sources: 1. Shaanxi Provincial Key R&D Program (Project No. 2018ZDXM-GY-036); 2. Natural Science Foundation of Shaanxi Province (Project No. 2021JM-344); 3. Independent Research Project of Shaanxi Provincial Key Laboratory of Network Computing and Security Technology (Project No. NCST2021YB-05).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy restrictions.

Acknowledgments

The authors acknowledge technical support from the High-Performance Computing Center of Xi’an University of Posts and Telecommunications. Special thanks to Zhang Wei for algorithm optimization guidance.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

References

Wen, D.; Yuan, Y.; Li, X.-R. Artificial Societies, Computational Experiments, and Parallel Systems: An Investigation on a Computational Theory for Complex Socioeconomic Systems. IEEE Trans. Serv. Comput. 2013, 6, 177–185. [Google Scholar] [CrossRef]
Muradyan, A. The Future of Tax Administration. Belt Road Initiat. Tax J. 2024, 5, 15–19. [Google Scholar]
Zhang, C.; Deng, L. Microbial Community Analysis based on Bipartite Graph Clustering of Metabolic Network. J. Phys. Conf. Ser. 2021, 1828, 012092. [Google Scholar] [CrossRef]
Cui, Y.; Wang, X. Uncovering overlapping community structures by the key bi-community and intimate degree in bipartite networks. Phys. A 2014, 407, 7–14. [Google Scholar] [CrossRef]
Liu, D.; Jin, D.; He, D.; Huang, J.; Yang, J.; Yang, B. Community Mining in Complex Networks. J. Comput. Res. Dev. 2013, 50, 2140–2154. [Google Scholar]
Li, L.; Liu, X.; Wang, H.; Wang, X. (Eds.) Mathematical Foundations and Applications of Big Data; University of Electronic Science and Technology of China Press: Chengdu, China, 2024; ISBN 978-7-5770-0867-7. [Google Scholar]
Himmelstein, D.S.; Baranzini, S.E. Heterogeneous Network Edge Prediction: A Data Integration Approach to Prioritize Disease-Associated Genes. PLoS Comput. Biol. 2015, 11, e1004259. [Google Scholar] [CrossRef] [PubMed]
Li, M.; Jiang, Y.; Di, Z. Characterizing the importance of nodes with information feedback in multilayer networks. Inf. Process. Manag. 2023, 60, 103344. [Google Scholar] [CrossRef]
Rachamadugu, S.; Pushphavathi, T. A Comparative Analysis of Community Detection Agglomerative Technique Algorithms and Metrics on Citation Network. Ann. Emerg. Technol. Comput. 2023, 7, 1–13. [Google Scholar] [CrossRef]
An, P.; Zhou, J.; Li, H.; Sun, B.; Shi, Y. The evolutionary similarity of the co-shareholder relationship network from institutional and non-institutional shareholder perspectives. Phys. A 2018, 503, 439–450. [Google Scholar] [CrossRef]
Li, J.; Ren, D.; Feng, X.; Zhang, Y. Network of listed companies based on common shareholders and the prediction of market volatility. Phys. A 2016, 462, 508–521. [Google Scholar] [CrossRef]
Moon, S.; Bergey, P.K.; Iacobucci, D. Dynamic Effects among Movie Ratings, Movie Revenues, and Viewer Satisfaction. J. Mark. 2010, 74, 108–121. [Google Scholar] [CrossRef]
Wang, M.; Chen, W.; Huang, Y.; Contractor, N.S.; Fu, Y. A Multidimensional Network Approach for Modeling Customer-Product Relations in Engineering Design. In Proceedings of the 27th International Conference on Design Theory and Methodology (ASME IDETC/CIE), Boston, MA, USA, 2–5 August 2015; p. V007T06A044. [Google Scholar] [CrossRef]
Hu, K.; Xiang, J.; Yu, Y.-X.; Tang, L.; Xiang, Q.; Li, J.-M.; Tang, Y.-H.; Chen, Y.-J.; Zhang, Y. Significance-based multi-scale method for network community detection and its application in disease-gene prediction. PLoS ONE 2020, 15, e0227244. [Google Scholar] [CrossRef] [PubMed]
Guimerà, R.; Sales-Pardo, M.; Amaral, L.A.N. Module identification in bipartite and directed networks. Phys. Rev. E 2007, 76, 036102. [Google Scholar] [CrossRef] [PubMed]
Barber, M.J. Modularity and community detection in bipartite networks. Phys. Rev. E 2007, 76, 066102. [Google Scholar] [CrossRef] [PubMed]
Murata, T. Detecting Communities from Bipartite Networks Based on Bipartite Modularities. In Proceedings of the 2009 International Conference on Computational Science and Engineering (CSE), Washington, DC, USA, 29–31 August 2009; Volume 4, pp. 50–57. [Google Scholar] [CrossRef]
Liu, X.; Murata, T. How Does Label Propagation Algorithm Work in Bipartite Networks? In Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), Washington, DC, USA, 15–18 September 2009; Volume 3, pp. 5–8. [Google Scholar] [CrossRef]
Raghavan, U.N.; Albert, R.; Kumara, S. Near linear time algorithm to detect community structures in large-scale networks. Phys. Rev. E 2007, 76, 036106. [Google Scholar] [CrossRef] [PubMed]
Li, Z.; Wang, R.-S.; Zhang, S.; Zhang, X.-S. Quantitative function and algorithm for community detection in bipartite networks. Inf. Sci. 2016, 367–368, 874–889. [Google Scholar] [CrossRef]
Chang, F.; Zhang, B.; Zhao, Y.; Wu, S.; Zou, G.; Niu, S. Overlapping Community Detection in Bipartite Networks using a Micro-bipartite Network Model: Bi-EgoNet. J. Intell. Fuzzy Syst. 2019, 37, 7965–7976. [Google Scholar] [CrossRef]
Wang, Y.; Di, Z.; Fan, Y. Comparative Definition of Community in Bipartite Network. Complex Syst. Complex. Sci. 2009, 6, 40–44. [Google Scholar]
Agrawal, R.; Srikant, R. Mining sequential patterns. In Proceedings of the Eleventh International Conference on Data Engineering (ICDE), Taipei, Taiwan, 6–10 March 1995; pp. 3–14. [Google Scholar] [CrossRef]
Srikant, R.; Agrawal, R. Mining sequential patterns: Generalizations and performance improvements. In Advances in Database Technology—EDBT ’96; Apers, P., Bouzeghoub, M., Gardarin, G., Eds.; Springer: Berlin/Heidelberg, Germany, 1996; pp. 1–17. ISBN 978-3-540-49943-5. [Google Scholar]
Ayres, J.; Flannick, J.; Gehrke, J.; Yiu, T. Sequential Pattern mining using a bitmap representation. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Edmonton, AB, Canada, 23–26 July 2002; pp. 429–435. [Google Scholar] [CrossRef]
Han, J.; Pei, J.; Yin, Y. Mining frequent patterns without candidate generation. SIGMOD Rec. 2000, 29, 1–12. [Google Scholar] [CrossRef]
Han, J.; Pei, J.; Mortazavi-Asl, B.; Chen, Q.; Dayal, U.; Hsu, M.-C. FreeSpan: Frequent pattern-projected sequential pattern mining. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Boston, MA, USA, 20–23 August 2000; pp. 355–359, ISBN 1581132336. [Google Scholar]
Pei, J.; Han, J.; Mortazavi-Asl, B.; Pinto, H.; Chen, Q.; Dayal, U.; Hsu, M.-C. PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth. In Proceedings of the 17th International Conference on Data Engineering (ICDE), Washington, DC, USA, 2–6 April 2001; pp. 215–224. [Google Scholar] [CrossRef]
Wang, F.; Zhang, M.; Yan, X. Practice of Creating HBASE Secondary Index Based on Bitmap Technology. Inf. Commun. Technol. 2021, 3, 78–82. [Google Scholar]

Figure 1. Illustration of a bipartite network.

Figure 2. Flowchart of the Static CIA Algorithm.

Figure 3. Community structure evolution flowchart of time-varying bipartite network.

Figure 4. Flowchart of BMPS algorithm.

Figure 5. Sequence bitmap matrix.

Figure 6. Projection matrix of sequence

α

.

Figure 7. Projection matrix of sequence

β

.

Figure 8. Artificial bipartite network.

Figure 9. Comparison results of NMI under four algorithms.

Figure 10. Network of Southern Women.

Figure 11. Community structure of Southern Women by CIA algorithm.

Figure 12. Comparison of the Q value of four community detection algorithms on the Southern Women dataset.

Figure 13. Community structure in the sliced network at

T_{0}

.

Figure 14. Comparison of Q values of community detection for the sliced network at time

T_{0}

.

Figure 15. Community evolution.

Figure 16. Sequence matrix of Abnormal Alliance.

Figure 17. Time and space efficiency comparison between BMPS and PrefixSpan.

Figure 18. Time and space efficiency comparison on Abnormal Alliance.

Table 1. Sequence database.

Sid	Sequence
1	<a(abc)(bc)b(cd)>
2	<(ad)(abc)(bcd)d>
3	<(be)(ae)(c)(d)f>
4	<a(be)(ce)(ac)>

Table 2. Characteristics of the community of the network at

T_{0}

.

Table 2. Characteristics of the community of the network at

T_{0}

.

Community Name	Feature Set
T0C01	{16101999829, 16101990334}
T0C02	{16101998453, 26100019940}
T0C03	{56101902020}
T0C04	{16101999724}
T0C05	{16101991292}

Table 3. Correspondence of ID and code.

Taxpayer id	Code
10216101000000248479	1
10116101010000225684	2
10216101000000116589	3
10216101000000229519	4
10216101000000135342	5
10116101000051770601	6
10116101010000363030	7
16101991299	AA
16101990314	BB
16101998664	CC
16101991332	DD
A00000010200003	a
A00000010200006	b
A00000010200016	c
A00000010200020	d
A00000010200021	e
A00000010200022	f
A00000010200025	g
A00000010200026	h

Table 4. Transaction database of Abnormal Alliance.

Taxpayer id	Tax Officer id	Item Set	Date
1	AA	(a)	2020-11
2	BB	(bc)	2020-11
3	CC	(abc)	2020-11
5	DD	(bc)	2020-11
3	CC	(ab)	2020-12
4	AA	(abe)	2020-12
6	CC	(b)	2020-12
1	AA	(bce)	2021-01
7	DD	(de)	2021-01
5	DD	(cef)	2021-01
2	BB	(g)	2021-02
4	AA	(e)	2021-02
6	CC	(ef)	2021-02
1	AA	(de)	2021-03
2	BB	(abc)	2021-03
4	AA	(gh)	2021-04
2	BB	(bcd)	2021-04
6	CC	(degh)	2021-05
5	DD	(fgh)	2021-05
3	CC	(c)	2021-06
7	DD	(de)	2021-06
1	AA	(f)	2021-07
3	CC	(bcd)	2021-07
5	DD	(h)	2021-08
6	CC	(bde)	2021-08
7	DD	(gh)	2021-09
4	AA	(dh)	2021-09
2	BB	(de)	2021-10
7	DD	(a)	2021-10
3	CC	(be)	2021-11
1	AA	(ad)	2021-11
4	AA	(e)	2021-12

Table 5. Sequence database of Abnormal Alliance.

Combined Code (CID)	Seq. Num.	Sequence Database
(1, AA)	$d_{1}$	<a(bce)(de)f(ad) >
(2, BB)	$d_{2}$	<(bc)g(abc)(bcd)(de) >
(3, CC)	$d_{3}$	<(abc)(ab)c(bcd)(be) >
(4, AA)	$d_{4}$	<(abe)e(gh)(dh)e >
(5, DD)	$d_{5}$	<(bc)(cef)(fgh)h >
(6, CC)	$d_{6}$	<b(ef)(degh)(bde) >
(7, DD)	$d_{7}$	<(de)(de)(gh)a >

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Abnormal Alliance Detection Method Based on a Dynamic Community Identification and Tracking Method for Time-Varying Bipartite Networks

Abstract

1. Introduction

2. Literature Review

2.1. Community Evolutionary Algorithms for Bipartite Network

2.2. Sequence Pattern Mining

3. Abnormal Alliance Detection Framework

3.1. Static CIA Algorithm Framework

3.2. Dynamic Community Structure Discovery Algorithm on Bipartite Network DyCIAComDet

3.3. BMPS Algorithm for Detecting “Abnormal Alliances”

4. Numerical Experiment

4.1. Comparison of the CIA Algorithm with Other Algorithms

4.2. DyCIAComDet and Anomaly Behavior Detection

5. BMPS Performance Analysis

6. Conclusions and Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics