Dynamic Weights Based Risk Rule Generation Algorithm for Incremental Data of Customs Declarations

: Aimed at shortcomings, such as fewer risk rules for assisting decision-making in customs entry inspection scenarios and relying on expert experience generation, a dynamic weight assignment method based on the attributes of customs declaration data and an improved dynamic-weight Can-Tree incremental mining algorithm are proposed. In this paper, we ﬁrst discretize the customs declaration data, and then form composite attributes by combining and expanding the attributes, which is conducive to generating rules with risk judgment signiﬁcance. Then, weights are determined according to the characteristics and freshness of the customs declaration data, and the weighting method is applied to the Can-Tree algorithm for incremental association rule mining to automatically and efﬁciently generate risk rules. By comparing FP-Growth and traditional Can-Tree algorithms experimentally, the improved dynamic-weight Can-Tree incremental mining algorithm occupies less memory space and is more time efﬁcient. The introduction of dynamic weights can visually distinguish the importance level of customs declaration data and mine more representative rules. The dynamic weights combine conﬁdence and elevation to further improve the accuracy and positive correlation of the generated rules.


Introduction
Since China became a member of the World Trade Organization, the volume of import and export trade has grown significantly and the task of customs supervision has become increasingly heavy. The existing customs supervision means overly relying on manual work, and the risk decision system can hardly support the security access tests in the new era. Studying customs risk data and mining the potential correlations therein to form a risk rule base can significantly reduce the pressure on customs supervision.
Commodities need to be declared before customs entry, and upon entry customs operations personnel can make preliminary judgments on suspected risky items based on a library of expert rules. Customs' existing risk rules are defined by experts based on original business data observations and relying on their own work experience. Therefore, potential common relationships in business data features are difficult to discover, and the scalability of rules proposed by experts in different fields is limited, and the application effect of the same rule in different scenarios has certain deviations and is difficult to extend to the whole customs port. In order to break through the difficulties of a limited number of expert rules, insufficient scalability and limited manual effort, risk rules can be automatically generated using methods such as machine learning and used to improve the intelligence of customs' risk screening system.
Customs operational data (e.g., customs declarations) contain information on all three stages of goods from departure, transportation, and entry, and this type of information is not yet fully used. If the risk information containing the historical inspection results can be mined to obtain the implied risk association rules, the business personnel can automatically determine whether the goods need to be opened for inspection by matching the risk association rules with the subsequent customs declaration. Currently, data mining is an important method to transform information into value, which mainly refers to the process of discovering knowledge from data, i.e., mining potentially valuable information from massive data [1]. Among them, association rule is a rule-based unsupervised machine learning method that can discover potential relationships among items in massive data. Unlike other machine learning methods, such as decision trees and random forests, association rules focus on the relationships associated between data features, while the former focuses more on how to make decisions, so the use of association rules is more suitable for the rule generation of customs business data [2][3][4]. In 1993, Agrawal and Srikant proposed the Apriori algorithm, which uses bottom-up iterative layer-by-layer searches for frequent item sets [5]. To solve the problem of that the a priori method requires several scans of the database and the generation of a large number of candidate item sets, Han et al. proposed the FP-Growth algorithm based on frequent pattern trees in 2000. The algorithm stores the frequent items contained in each sample in descending order of support into the FP-Tree by scanning the database twice [6]. Although this algorithm reduces the mining time by not generating candidate items during the whole discovery process, the FP-Tree requires a large memory space occupation due to the large number of frequent patterns to be stored. Therefore, during the mining process of dynamic data, the FP-Growth algorithm needs to build trees several times due to the change in data volume, which affects the total time efficiency of the mining rules [7]. To address this problem, Leung et al. proposed the Can-Tree (Canonical-order tree) algorithm based on tree structure improvement [8]. It is a user-specified natural sequence to arrange the sample data items in the transaction database, and constructs a Can-Tree based on the ordered items, i.e., merges the new items in the form of a prefix tree, which solves the problem of rebuilding the tree when FP-Growth is faced with incremental data. It achieves "build once, mine many times". Although Can-Tree solves the problem of reconstructing the tree structure when the nodes change during data increment, Can-Tree takes up more memory space than the reconstructed FP-Tree. Therefore, when dealing with a large amount of data, the time efficiency of tree building and the time efficiency of mining are reduced due to the increased memory footprint of the tree. Chen Gang et al. reduced pruning and conditional pattern base generation time by replacing the parent-child node pointers [9]. Hu Jun et al. used data item sorting instead of the original sorting method to reduce the size of the Can-Tree, and improve the mining efficiency of incremental data [10]. Hong Yan et al. used clustering to divide the original dataset into multiple blocks according to item support, and cut the blocks that do not meet the minimum support to reduce the data size, thus improving the algorithm mining efficiency [11]. This method reduces the size of the tree to a certain extent, and the way of cutting according to the support size tends to make the incremental part of the data missing. Therefore, instead of using support count filtering for sample data item pruning when constructing frequent pattern trees for the first time, the data storage records are compressed by improving the data item sorting. Therefore, in the subsequent incremental mining stage, the data items are cropped by the set support degree [12,13].
The current rule mining related research based on customs business data is relatively small, and has mainly occurred with the help of the characteristics of association rule similarity mining, which can generate new risk rules through automatic mining by analyzing the inspection data of customs historical declarations. In this paper, considering that customs declaration data are dynamically increased with customs operations, the a priori and FP-Growth algorithms are no longer applicable to dynamic data mining, so the Can-Tree algorithm is adopted for risk rule mining to cope with the new data added by customs in real time. Customs declaration attributes include date, port, importer, trade country, quantity, weight, FOB (Free On Board) price, CIF (Cost Insurance Freight) price, tariff, etc. The degree of risk implied by different data attributes varies. Therefore, when mining, it is necessary to integrate the importance degree of different attributes to assign weights, and then rank each sample data item according to the weights from the largest to the smallest. This ranking result can reasonably reflect the degree of association of each attribute in the topological network, and thus efficiently compress the storage space of frequent pattern trees [14,15]. Meanwhile, the introduction of composite attributes makes the rule mining results more consistent with the characteristics of customs risk checking. Based on the existing incremental data mining strategy [16], the newly generated incremental data are inserted into the Can-Tree structure generated from the established initial transaction database to lay the foundation for generating rules with new risk characteristics. In the incremental data mining process, the current steps can be simplified by combining the results of the previous round of mining, which ultimately improves the time efficiency of the incremental mining model. There are two additional problems that need to be addressed: how to mine risk rules from incremental data (dynamically increasing business data of customs declaration) more efficiently; and how to improve the validity and relevance of the generated rules. In this paper, we propose a risk-attribute combination expansion method and an improved Can-Tree incremental mining algorithm (dynamic-weight Can-Tree) based on the sequential compressed storage of the dynamic weights of data items, which can highlight the contribution of different attributes to risk and improve the mining efficiency of association rules. In this method, we consider that different attributes of customs declaration data have different degrees of importance in the risk rules, and different attributes are in different branches of the constructed association network. Therefore, before rule mining, reasonable weights are assigned by combining the importance degree of attribute values and sample freshness to make the results of association rule mining more consistent with the requirements of customs risk checking. Meanwhile, the customs declaration contains some numerical attributes (such as weight, price, etc.), and such data are difficult to directly generate risk rules with high evaluability. Therefore, the evaluability of the rules is appropriately improved by means of data intervalization and attribute combination expansion. Finally, in the face of changes to the risk of customs inspection and the increase in new data, the incremental mining strategy is used to solve the problem of high time costs arising from the reconstruction and mining of association networks. Meanwhile, it introduces dynamic weighted-support-level, dynamic weighted-confidence-level and dynamic weighted-lift-level to improve the effectiveness and relevance of the rules. This paper is divided into four parts to describe the research content. The first chapter introduces the background and significance of this paper, as well as the status of research related to association rule algorithms. Section 2 gives the method of combining and expanding the numerical data of customs declaration and the data discretization scheme. The Section 3 proposes the dynamic-weight incremental Can-Tree mining algorithm. First, the traditional Can-Tree algorithm is introduced. Then the definition related to the dynamicweight of data items is described. Finally, the specific improvement points and processes of the dynamic-weight Can-Tree algorithm are described. Section 4 compares the algorithm of this paper with other algorithms for experiments and analyzes its efficiency in time and space. Section 5 summarizes the research and points out the shortcomings of the current work.

Outlier Bi-KMeans
In association rule mining, the discretization of numerical data is an important processing aspect. For association rule mining containing quantitative-value change data, firstly, generalized fuzzy intervals are used instead of numerical objects, and then relevant mining tasks are executed to obtain more effective association rules. There are a large number of numerical attributes in the customs declaration data samples, such as quantity, weight, FOB price, CIF price, and tariff. This kind of data has the characteristics of discontinuous distribution and more values taken. A large number of unprocessed numerical attributes occupy a large memory space in the construction of frequent pattern trees, which eventually affects the efficiency of rule mining. At the same time, in order to generate rules with risk screening meaning, the numerical values need to be converted into intervals with common judgmental metrics, such as weight-low (it indicates a weight of 0 to z 1 ). Therefore, the clustering algorithm is used to divide the numerical attributes into several feature intervals according to the distribution of the values, and construct the fuzzy values of the numerical attributes of the customs declaration.
In this paper, based on the obtained customs declaration inspection data, we screen out the attributes with high-risk measurement significance: quantity, weight, FOB price, CIF price, and tariff. These data are widely and unevenly distributed, and cannot be directly used as the attribute indicators for rule mining. The four attributes of quantity, weight, CIF, and FOB are clustered and analyzed in turn to form multiple clusters with a distribution of nearest neighbor values. Instead of discrete values, according to the interval in which the clustering results are located, they are used to describe the customs declaration numerical data, which in turn provides usable input data for subsequent association rule mining [17]. The traditional kMeans algorithm is limited by the initial clustering centers and outliers when dealing with inhomogeneous data. Therefore, the stability of the clustering effect obtained when performing data discretization is general, and it is difficult to discover the interval change points of the numerical data [18,19], so the Bi-KMeans algorithm for outlier optimization is proposed, and the algorithm steps are shown below.
(1) Optimising outliers. Calculate the minimum distribution density (minP) of the samples and propose outlier sample points smaller than minP. (2) Select the starting centroids and divide the sample data. Randomly select N initial clustering centers from the numerical data attributes of the customs declaration. The distances from all data sample points to the cluster centers are calculated with n centers, respectively, and then they are added to the cluster with the nearest distance. To measure the stability of the clustering algorithms, they are evaluated by the clustering criterion function (SSE, sum of the squared error), and a comparison of the Bi-KMeans algorithm for outlier point optimisation with other clustering algorithms is shown in Figure 1. The Bi-KMeans algorithm with outlier optimization is used to discretize the data for the customs declaration data attributes quantity, weight, FOB, CIF, and tariff, respectively. The clustering results are shown in Figure 2.  The Bi-KMeans algorithm with outlier optimization is used to discretize the data for the customs declaration data attributes quantity, weight, FOB, CIF, and tariff, respectively. The clustering results are shown in Figure 2. The Bi-KMeans algorithm with outlier optimization is used to discretize the data for the customs declaration data attributes quantity, weight, FOB, CIF, and tariff, respectively. The clustering results are shown in Figure 2.

Property Combination Extensions
Attributes such as quantity, weight, FOB, CIF, and tariff are quantitative descriptions of inbound and outbound goods, which reflect the objective characteristics of the goods. Because numerical attributes are interrelated and non-independently distributed, the contribution of a single attribute to the risk of a good is low [20]. Therefore, there is a need to

Property Combination Extensions
Attributes such as quantity, weight, FOB, CIF, and tariff are quantitative descriptions of inbound and outbound goods, which reflect the objective characteristics of the goods. Because numerical attributes are interrelated and non-independently distributed, the contribution of a single attribute to the risk of a good is low [20]. Therefore, there is a need to discover the intrinsic relationships between attributes, i.e., attribute extensions in the form of data combinations to form composite attributes that relate the relevant characteristics of the goods together. Compared with single attributes, composite attributes can describe the implied associated risks in customs declarations, discover the risk rules embedded in them, and improve the efficiency of the fast customs clearance of goods. Definition 1. Let the ratio of price (FOB/CIF) to weight w be the price to weight ratio, representing the matching relationship between price and weight in the transaction record (quan_wt).

Definition 2.
Let the ratio of price (FOB/CIF) to quantity c be the price to quantity ratio, representing the matching relationship between price and quantity in the transaction record (price_quan).

Definition 3.
Let the ratio of quantity to weight be the weight-to-quantity ratio, representing the matching relationship between weight and quantity in the transaction record (wt_quan).

Definition 4.
Let the ratio of tariff to quantity be the tariff-quantity ratio, representing the matching relationship between tax and quantity in the transaction record (tax_quan).

Definition 5.
Let the ratio of tariff t to weight c be the tariff-to-weight ratio, which represents the matching relationship between tax and weight in the transaction record (tax_wt).

Definition 6.
Let the difference between FOB and CIF be the price difference, representing the relationship between the difference in bilateral trade prices in the transaction records (fob_cif).
Based on the defined composite attributes, the Bi-KMeans algorithm for outlier optimization is used to discretize the data separately to construct the fuzzy values of the clustering space for the new attributes.

Traditional Can-Tree Algorithm
In order to solve the problem that the FP-Growth algorithm has in rebuilding the tree several times on incremental data mining, the Can-Tree algorithm uses prefix trees with ordered structures, compresses the stored data nodes, and merges the same transaction data items to reduce the size of the data structure. The algorithm first performs the initial transaction data set scanning and data item sorting, and then constructs the frequent pattern tree. When incremental data arrives, the transaction data items are simply sorted in alphabetical order and inserted into the already existing frequent pattern tree one by one. This approach avoids the repeated scanning of previous data items and eliminates the need to build a new frequent pattern tree. Therefore, it greatly reduces the time of scanning the initial database several times, and improves the efficiency of association rule incremental data mining.
The Can-Tree is constructed as follows.
(1) First, scan the transaction database (DB), then sort the data items of each record in alphabetical or dictionary order; set the initial Can-Tree root node to T, a null value, as the starting node of the tree. (2) Insert the sorted data items of each sample from DB into the Can-Tree in order, and find whether the same node already exists. If the same node exists, the count value of the same node will be added by one, and no node insertion operation will be performed; if the node does not exist, the node creation, insertion, and update count operations are performed. (3) Repeat the steps in (2) until all data items are inserted.
The Can-Tree mining process is as follows: (1) Top-down mining of the constructed Can-Tree path nodes, finding all leaf nodes of the Can-Tree, constructing a conditional pattern base (CPB) for each leaf node item, i.e., recording the prefix path nodes of the leaf node item, and calculating the frequency sum of all data items on the path of CPB. (2) Discover all the conditional pattern bases, build a CPB set; accumulate the frequency of data items on each CPB, filter the data items smaller than Minimum Support (minSup), build a new item header (used to record 1-frequent-item's frequency), and build a conditional FP-Tree based on it. (3) Finally, each conditional FP-Tree is recursively mined, collecting and merging the postfix frequent item sets until the conditional FP-Tree is left with a unique path or the FP-Tree is empty, the former indicating that items on that path are all frequent item sets.

Dynamic-Weight for Customs Declaration Data Items
In order to efficiently implement the task of customs risk rule generation, firstly, data pre-processing, data discretization, and attribute combination expansion are performed on the customs declaration data (divided into two parts: original data and incremental data). Then, we assign weights to the attributes of the customs declaration data by combining the distribution characteristics of the attributes and the complexity of the nodes in the construction of the mining network, and realize the importance ranking of the data items in the data samples. Finally, all the records in the DB of customs declaration transaction database are compressed into the Can-tree, based on which, strong association rules with customs risk information are generated.
In actual customs risk screening operations, the reliability of the old data gradually decreases as time increases, and the risk information contained in the new data is more in line with the changing policies and practical needs. Therefore, the freshness of new incremental data is higher, and the weight of incremental data needs to be increased appropriately in the actual model calculation. In order to improve the accuracy and freshness of the generated risk rules, this paper considers the allocation of risk weights from three perspectives: customs ports, trade countries, and the timeliness of customs declaration data.
(1) Customs port importance weights According to the ranking of national port cargo throughput released by the Ministry of Transport of China in 2021, it can be seen that the carrying capacity and the scale of different customs ports are different, and their contribution to the rules is also different. The greater the carrying capacity of a port, the more frequently it appears in the customs declaration data and the more branches of its nodes in the association network, the higher the degree of importance of the attribute. In this paper, the customs ports are divided into four levels (according to the throughput from small to large) according to the throughput of the port business, and the weights w P (0 ≤ w ≤ 1) are set as 0.25, 0.50, 0.75, and 1.00.
(2) Trading country importance weights Unlike customs ports, which are graded according to throughput size, trade country attributes need to be considered in terms of trade relationships. Countries with more trade transactions have a higher data volume and relatively reliable cooperation relationships, so they can be mined for more reliable association rules. On the contrary, countries with fewer trade transactions have less importance in the association network due to their smaller data volume. Therefore, the importance weights of trade countries are graded according to high-frequency trade countries and general trade countries, and the corresponding weights w c (0 ≤ w ≤ 1) are set to 0.4 and 0.6.

(3) Data sample timeliness weights
The risks faced by customs inspection are often fluid and change along with changes in policies, bills, and unexpected events. In the mining of customs declaration historical business data, data closer to the current point in time are more important than those further away. New incremental data are given more weight in the model mining process to improve the real-time availability of the mined rules. The commonly used item set weighting methods mainly use the average function method and the weighting method based on the base of data items, ignoring the influence of the timeliness of the incremental data in the transaction database on the weights, and using only the frequency of data items and fixed weights for evaluation. In this paper, based on the characteristics of the timeliness of customs declaration data and its risk research needs, we propose an item set evaluation method based on dynamic timeliness weights. The method takes into account both the calculation of data item bases and their weights, and includes the importance differences between different incremental data in the transaction database. It significantly improves the efficiency of model mining and can mine high-quality and highly available risk rules.
Based on Zhao et al.'s study [21], the average function method is introduced to evaluate incremental data weights based on the time-based weight calculation method. The data samples in the DB are divided into N parts according to their chronological order. Then the weight w t (0 ≤ w ≤ 1) of each time period transaction database data sample is: Due to the introduction of temporal weights, for the same data item in DB, there may be multiple different weights in different samples due to the different time intervals in which they are located. Therefore, it is necessary to recalculate the weighted temporal support (Sup wt ) of the sample data items based on the sample temporal weights and item counts. The dynamic timing weights w θ and weighted timing support (Sup wt ) of data items are calculated as follows: The introduction of dynamic risk weights requires the calculation of weighted support, weighted confidence, and weighted elevation for the data items in the customs declaration transaction database, according to the set weight importance level and frequency count.

Definition 7.
Weighted support of an item-set X Sup(X) = w P x p + w c |x c | + Sup wt (X) |DB| (6) where x p and |x c | denote the support counts when the data attributes are ports and trade countries; w P and w c are the importance weights of ports and trade countries; Sup wt_X denotes the time-weighted support frequency of the data item; and |DB| is the total sample size of DB.

Definition 8.
The weighted support for the simultaneous occurrence of item-set X and item-set Y in the transaction pool is as follows where y p represents the data support attributed to the port; |y c | represents the support count of data item value as trade country; and Sup wt (XY) represents the time-weighted support frequency of item-set X and item-set Y.
Definition 9. The degree of correlation between item-set X and item-set Y is called the weightedlift-level.
Definition 10. The probability of indicating the simultaneous occurrence of item-set Y for item-set X is called the weighted-confidence-level.

Dynamic-Weight Can-Tree Incremental Mining Algorithm
As the volume of data continues to increase, the traditional Can-Tree algorithm suffers from several problems: The size of frequent pattern trees built based on alphabetical or dictionary order gradually increases, causing a huge pressure on the memory. Although the base-based sorting algorithm can significantly reduce the number of tree nodes and size, it takes more time to recalculate and update the data item support count and sort each time when facing new incremental data. The traditional sorting method based on alphabetical order or dictionary order has two drawbacks in the application of customs declaration data mining: (1) it cannot reflect the importance degree of different attribute values to customs risk; (2) it needs to recalculate the support count of each item when new incremental data is updated, which cannot fully utilize the mining results of the previous round and eventually affects the mining efficiency of the model. To solve these problems, this paper proposes the dynamic-weight Can-Tree incremental mining algorithm. The algorithm is based on the child-parent node pointer introduced by Chen Gang and Hu Jin et al. and combines the risk-weight importance model of customs data and a new incremental mining strategy. It is used to solve the problems of large memory space occupation of the traditional Can-Tree algorithm and insufficient information utilization between each round of incremental mining, and finally improve the mining efficiency of the association rule algorithm. The specific improvements are as follows.
(1) To reduce the size of the Can-Tree, the Can-Tree construction is pre-pruned during incremental mining by pruning the set of items that do not satisfy the minimum support (reducing the number of data records in the DB). (2) Combining the problems related to the risk rule mining of customs declaration data, the original dictionary order sorting method of Can-Tree is replaced by the base sorting method incorporating dynamic risk weights of data items, while other data items that are not combined with weights are sorted in alphabetical order. Since some data items of customs declaration, such as port, trade country and other attributes occupy more branches in the constructed frequent pattern tree, such data items are sorted in the first order (insertion order) to help reduce the size of the tree. Accordingly, their weights have a higher degree of importance. Although the purpose of sharing prefix nodes can also be achieved by using base sorting, each change in the amount of data during the iterative construction of the Can-Tree with incremental data will affect the item count and have a certain impact on the sorting results. Therefore, some data items are calculated and sorted by weighted support, and general data items with no risk weight set or the same weighted support are sorted by alphabetical order. This change makes the sorting order of each sample data item with insertion follow the distribution pattern of decreasing importance. It enables more data items to share the same prefix effect and thus reduces the size of the Can-Tree. (3) Based on the improved FP-Growth algorithm proposed by He et al. [22], the design of child-parent nodes is introduced into the Can-Tree to form a Can-Tree structure containing bidirectional pointers (prev, next). The introduction of the prev pointer enables fast bottom-up iteration of the CPB of the Can-Tree child nodes. This paper additionally adds a hash table structure for storing leaf node positions and counts of constructed Can-Tree (leafNode_hashTbl). The CPB of each leaf node can be obtained without re-traversing the whole Can-Tree by using the leafNode_hashTbl. Compared with the time complexity of O(logN) for the dichotomous lookup of subnodes, the hash structure requires only O(1) time complexity. Therefore, in the process of constructing Can-Tree, the hash structure can reduce the time loss caused by scanning Can-Tree leaf nodes and improve the time efficiency of the algorithm. In the mining phase, the composite data structure called mining_history_res{can_tree, header_table, leafNode_hashTbl} is introduced to cache the constructed Can-Tree, leafNode_hashTbl and their weighted support for each round of incremental mining, which is used as the data support for the next round of incremental mining.

Frequent Pattern Tree Construction and Mining Based on Dynamic-Weight Can-Tree Method
The improved Can-Tree incremental mining algorithm based on dynamic weights for frequent pattern tree construction is divided into two parts: the can-tree construction of the initial transaction database (DB_ORIGIN) and the incremental database (DB_INCR). The construction processes and pseudo-code are as follows.
Initial transactional data frequent pattern tree construction. Model input: Initial transaction database (DB_ORIGIN). Model output: Can-Tree data structure (can_tree_origin).
(1) Data pre-processing of DB_ORIGIN samples, and then data discretization for numerical attributes such as weight, price, FOB, CIF, wt_quan ratio, and f ob_ci f difference. (2) The data items of each record of the data sample DATA are sorted according to the base inverse order of the dynamic weighting calculation combined with alphabetical order, and the weighted support is calculated Sup wt . (3) Insert each data sample item into the can-tree according to the sorted result, and update the node count information. For a single item, if there is a node in the can-tree with the same name as the item, simply add one to the node count value of the item in the can-tree; otherwise, create a new node N and set the count value to 1, and insert it after the node with the same prefix item name as the item. Repeat until all the data records in DB_ORIGIN have been inserted. (4) Record the positions of all leaf nodes in the process of building the can-tree data structure (leafNode_hashTbl). (5) After the construction of DB_ORIGIN's frequent pattern tree is completed, store the current can-tree information and leafNode_hashTbl (called mining_history_res) for the subsequent incremental transaction data mining.
(1) After data preprocessing and discretization, the incremental data are sorted by each record data attribute in reverse order according to the weighted support count. Insert them one by one into the constructed can_tree_origin to form a new incremental frequent pattern tree can_tree_incr. (2) Update the leaf node position of can_tree_incr and leafNode_hashTbl.
The pseudo-code for building the frequent pattern tree is shown in Figure 3. Frequent pattern tree mining process: (1) Based on the obtained leafNode_hashTbl, each leaf node of the Can-Tree is traversed in turn, and the dynamically weighted CPBs of all leaf nodes and their weightedsupport-level are retrieved from the bottom up by the prev pointer. Then the pruning operation of the frequent itemset is performed by filtering the minimum weightedsupport-level to generate the final CPB set. (2) Constructing a dynamically weighted conditional pattern tree (called condi-tion_can_tree) from a collection of CPB set. (3) Recursively mining dynamic weighted conditional pattern trees to obtain weighted frequent item sets and the strong association rule generation related to customs risk screening is achieved by the introduced weighted-support-level, weighted-confidence-level, and weighted-lift-level.
Incremental frequent pattern tree mining: (1) Set mining_history_res to get the minimum weighted-support-level and minimum weighted-confidence-level of the previous round of mining, and determine whether the minimum weighted-support-level and weighted-confidence-level are the same as the current round. If they are the same, just mine the data items in the current leafNode_hashTbl whose count values have changed, reconstruct the CPB, CPB set, and frequent item set of , and then filter them by weighted-confidence-level and weighted-lift-level to generate strong association rules; if they are not the same, the whole Can-Tree needs to be mined again.
(2) Save the Can-Tree information, item header table information, and leafNode_hashTbl generated from this round of mining into mining_history_res, to provide data support for the next round of incremental mining.

Main Process of Dynamic-Weight Can-Tree
The flow of dynamic-weight Can-Tree algorithm is shown in Figure 4. It effectively alleviates the problem of large memory occupation of frequent pattern tree during incremental data iteration, and avoids the problem of recalculating leaf node CPBs when min- (1) Based on the obtained leafNode_hashTbl, each leaf node of the Can-Tree is traversed in turn, and the dynamically weighted CPBs of all leaf nodes and their weightedsupport-level are retrieved from the bottom up by the prev pointer. Then the pruning operation of the frequent itemset is performed by filtering the minimum weightedsupport-level to generate the final CPB set. (2) Constructing a dynamically weighted conditional pattern tree (called condition_can_tree) from a collection of CPB set. (3) Recursively mining dynamic weighted conditional pattern trees to obtain weighted frequent item sets and the strong association rule generation related to customs risk screening is achieved by the introduced weighted-support-level, weighted-confidencelevel, and weighted-lift-level.
Incremental frequent pattern tree mining: (1) Set mining_history_res to get the minimum weighted-support-level and minimum weighted-confidence-level of the previous round of mining, and determine whether the minimum weighted-support-level and weighted-confidence-level are the same as the current round. If they are the same, just mine the data items x i in the current leafNode_hashTbl whose count values have changed, reconstruct the CPB, CPB set, and frequent item set of x i , and then filter them by weighted-confidence-level and weighted-lift-level to generate strong association rules; if they are not the same, the whole Can-Tree needs to be mined again. (2) Save the Can-Tree information, item header table information, and leafNode_hashTbl generated from this round of mining into mining_history_res, to provide data support for the next round of incremental mining.

Main Process of Dynamic-Weight Can-Tree
The flow of dynamic-weight Can-Tree algorithm is shown in Figure 4. It effectively alleviates the problem of large memory occupation of frequent pattern tree during incremental data iteration, and avoids the problem of recalculating leaf node CPBs when mining the frequent pattern tree. In addition, the introduction of weighted-support-level, weighted-confidence-level, and weighted-lift-level is helpful to mine the rules with stronger correlation in customs declaration data. weighted-confidence-level, and weighted-lift-level is helpful to mine the rules with stronger correlation in customs declaration data.

Experimental Data
The experimental data in this paper is provided by a customs information center in

Experimental Data
The experimental data in this paper is provided by a customs information center in China, with a total of 100,000 samples of inspected customs declarations. The data attributes include time, port, trade country, importer, commodity model, country of origin, production unit, sales unit, cargo attributes, trade mode, packaging type, transaction mode, supervision mode, freight, premium, quantity, weight, price, and tariff. In this paper, based on the overall situation of customs risk, the relationship between different data attributes of customs declaration and risk is considered [23], as well as the feature filtering of customs declaration attributes from the perspective of the practicality of the generated risk rules [24]. Table 1 shows the main customs declaration data attributes after data desensitization and attribute screening.

Experimental Analysis
The experimental environment is Intel Xeon Bronze 3204 @ 1.9 GHz hexa-core, 256 G RAM, ubuntu20.04 operating system, python3.9. FP-Growth, Can-Tree based on alphabetical sorting, Can-Tree based on data-itemset-volume-sorting, and dynamic-weight Can-Tree algorithms are used to conduct customs declaration data mining experiments. This paper measures the advantages of the improved Can-Tree algorithm in three aspects: space efficiency, time efficiency, and effectiveness analysis.
(I) Comparative analysis of the spatial efficiency of constructing frequent pattern trees. Figure 5 depicts the number of tree nodes for different algorithms to construct frequent pattern trees on DB _ORIGIN. The experimental results indicate that the dynamic-weights Can-Tree algorithm generates frequent pattern trees at a scale smaller than the alphabeticalorder Can-Tree and the data-volume-sorted Can-Tree, and slightly larger than the FP-Growth algorithm. This is mainly determined by the incremental data mining strategy. On DB_ORIGIN, this algorithm needs to construct the full amount of Can-Tree without the pre-pruning of nodes, while the FP-Growth algorithm performs the pruning of nodes when constructing the FP-Tree. It can be seen that the method can effectively reduce the rules of frequent pattern trees and improve the space utilization of the algorithm. attributes include time, port, trade country, importer, commodity model, country of origin, production unit, sales unit, cargo attributes, trade mode, packaging type, transaction mode, supervision mode, freight, premium, quantity, weight, price, and tariff. In this paper, based on the overall situation of customs risk, the relationship between different data attributes of customs declaration and risk is considered [23], as well as the feature filtering of customs declaration attributes from the perspective of the practicality of the generated risk rules [24]. Table 1 shows the main customs declaration data attributes after data desensitization and attribute screening.

Experimental Analysis
The experimental environment is Intel Xeon Bronze 3204 @ 1.9 GHz hexa-core, 256 G RAM, ubuntu20.04 operating system, python3.9. FP-Growth, Can-Tree based on alphabetical sorting, Can-Tree based on data-itemset-volume-sorting, and dynamic-weight Can-Tree algorithms are used to conduct customs declaration data mining experiments. This paper measures the advantages of the improved Can-Tree algorithm in three aspects: space efficiency, time efficiency, and effectiveness analysis.
(I) Comparative analysis of the spatial efficiency of constructing frequent pattern trees. Figure 5 depicts the number of tree nodes for different algorithms to construct frequent pattern trees on DB _ORIGIN. The experimental results indicate that the dynamicweights Can-Tree algorithm generates frequent pattern trees at a scale smaller than the alphabetical-order Can-Tree and the data-volume-sorted Can-Tree, and slightly larger than the FP-Growth algorithm. This is mainly determined by the incremental data mining strategy. On DB_ORIGIN, this algorithm needs to construct the full amount of Can-Tree without the pre-pruning of nodes, while the FP-Growth algorithm performs the pruning of nodes when constructing the FP-Tree. It can be seen that the method can effectively reduce the rules of frequent pattern trees and improve the space utilization of the algorithm.   Figure 6 depicts the change in the number of nodes of different algorithms to construct the frequent pattern tree as the incremental data increases. Compared with FP-Growth and alphabetical order-based Can-Tree methods, the method used in this paper is able to significantly reduce the number of tree nodes in the construction stage of frequent pattern trees, which ultimately reduces the spatial size of the trees.
Information 2023, 14, x FOR PEER REVIEW 15 of 20 Figure 6 depicts the change in the number of nodes of different algorithms to construct the frequent pattern tree as the incremental data increases. Compared with FP-Growth and alphabetical order-based Can-Tree methods, the method used in this paper is able to significantly reduce the number of tree nodes in the construction stage of frequent pattern trees, which ultimately reduces the spatial size of the trees. In summary, the algorithm in this paper is suitable for incremental data mining, and there are still some shortcomings when it is applied to the single mining scenario. In the case of non-incremental mining, the algorithm occupies slightly more space than the traditional FP-Growth algorithm because the full amount of frequent pattern trees need to be constructed for the first time.
(II) Comparative analysis of the time efficiency of constructing frequent pattern trees.
In the construction stage of a frequent pattern tree, DB_ORIGIN and DB_INCR construction time efficiency comparison experiments are conducted, respectively, based on the same support and data volume. Figure 7 shows that in the DB_ORIGIN tree building phase, the Can-Tree method in this paper outperforms the Can-Tree method based on alphabetical order and data volume sorting, but takes more time compared with FP-Growth in terms of time efficiency. This is due to the fact that the Can-Tree algorithm is designed based on incremental data mining, which requires the construction of a full volume frequent pattern tree for the first time. The DB_INCR tree building time is shown in Figure 8. With the increase in incremental data, the time consumption of the dynamicweights Can-Tree algorithm in this paper is much less than that of FP-growth, and it also outperforms other Can-Tree algorithms in terms of time efficiency. In the incremental mining phase, the introduced strategy of weighted-support-level pre-pruning can filter out data items smaller than the threshold value, so that they can avoid participating in the tree construction process and reduce the size of the final Can-Tree, which in turn improves the efficiency of the mining phase. In summary, the algorithm in this paper is suitable for incremental data mining, and there are still some shortcomings when it is applied to the single mining scenario. In the case of non-incremental mining, the algorithm occupies slightly more space than the traditional FP-Growth algorithm because the full amount of frequent pattern trees need to be constructed for the first time.
(II) Comparative analysis of the time efficiency of constructing frequent pattern trees.
In the construction stage of a frequent pattern tree, DB_ORIGIN and DB_INCR construction time efficiency comparison experiments are conducted, respectively, based on the same support and data volume. Figure 7 shows that in the DB_ORIGIN tree building phase, the Can-Tree method in this paper outperforms the Can-Tree method based on alphabetical order and data volume sorting, but takes more time compared with FP-Growth in terms of time efficiency. This is due to the fact that the Can-Tree algorithm is designed based on incremental data mining, which requires the construction of a full volume frequent pattern tree for the first time. The DB_INCR tree building time is shown in Figure 8. With the increase in incremental data, the time consumption of the dynamic-weights Can-Tree algorithm in this paper is much less than that of FP-growth, and it also outperforms other Can-Tree algorithms in terms of time efficiency. In the incremental mining phase, the introduced strategy of weighted-support-level pre-pruning can filter out data items smaller than the threshold value, so that they can avoid participating in the tree construction process and reduce the size of the final Can-Tree, which in turn improves the efficiency of the mining phase.  From the experimental results on time efficiency, we can see that the dynamic-weight Can-Tree algorithm is better at handling incremental data mining, and its time efficiency increases with the increase in incremental data. It still has shortcomings in the face of single mining and is slightly inferior to the FP-Growth algorithm.
(III) Comparative analysis of the time efficiency of mining frequent pattern trees.
Setting the initial data volume of 10,000 and the minimum weighted support of 50, the mining experiments of the DB_ORIGIN sample data are performed first, and then incremental data are gradually overlaid to perform the mining of the DB_INCR sample data. Comparing the time consumed by dynamic-weight Can-Tree and other algorithms in the frequent pattern tree mining stage, the experimental results are shown in Figures 9 and  10.
As can be seen from Figure 9, the mining time of the improved Can-Tree algorithm in this paper is significantly smaller than other algorithms for both DB_ORIGIN and DB_INCR. On the one hand, the improved Can-Tree algorithm uses the pre-pruning strategy based on weighted support to reduce the size of frequent pattern trees when constructing Can-Tree on incremental data. On the other hand, the introduced leaf node hash  From the experimental results on time efficiency, we can see that the dynamic-weight Can-Tree algorithm is better at handling incremental data mining, and its time efficiency increases with the increase in incremental data. It still has shortcomings in the face of single mining and is slightly inferior to the FP-Growth algorithm.
(III) Comparative analysis of the time efficiency of mining frequent pattern trees.
Setting the initial data volume of 10,000 and the minimum weighted support of 50, the mining experiments of the DB_ORIGIN sample data are performed first, and then incremental data are gradually overlaid to perform the mining of the DB_INCR sample data. Comparing the time consumed by dynamic-weight Can-Tree and other algorithms in the frequent pattern tree mining stage, the experimental results are shown in Figures 9 and  10.
As can be seen from Figure 9, the mining time of the improved Can-Tree algorithm in this paper is significantly smaller than other algorithms for both DB_ORIGIN and DB_INCR. On the one hand, the improved Can-Tree algorithm uses the pre-pruning strategy based on weighted support to reduce the size of frequent pattern trees when constructing Can-Tree on incremental data. On the other hand, the introduced leaf node hash From the experimental results on time efficiency, we can see that the dynamic-weight Can-Tree algorithm is better at handling incremental data mining, and its time efficiency increases with the increase in incremental data. It still has shortcomings in the face of single mining and is slightly inferior to the FP-Growth algorithm.
(III) Comparative analysis of the time efficiency of mining frequent pattern trees.
Setting the initial data volume of 10,000 and the minimum weighted support of 50, the mining experiments of the DB_ORIGIN sample data are performed first, and then incremental data are gradually overlaid to perform the mining of the DB_INCR sample data.  (leafNode_hashTbl) can quickly locate the leaf node position. The algorithm does not need to traverse to find the whole Can-Tree, which shortens the discovery time of the frequent item sets and improves the time efficiency of frequent pattern tree mining.  The above experimental results are all conducted under a single minimum weighted support count, so the mining time efficiency of various algorithms under different support degrees is added as a comparison. The experimental results are shown in Figure 11, with the change of minimum weighted support, the mining time of the improved Can-Tree algorithm in this paper is less than other algorithms.  (leafNode_hashTbl) can quickly locate the leaf node position. The algorithm does not need to traverse to find the whole Can-Tree, which shortens the discovery time of the frequent item sets and improves the time efficiency of frequent pattern tree mining.  The above experimental results are all conducted under a single minimum weighted support count, so the mining time efficiency of various algorithms under different support degrees is added as a comparison. The experimental results are shown in Figure 11, with the change of minimum weighted support, the mining time of the improved Can-Tree algorithm in this paper is less than other algorithms. As can be seen from Figure 9, the mining time of the improved Can-Tree algorithm in this paper is significantly smaller than other algorithms for both DB_ORIGIN and DB_INCR. On the one hand, the improved Can-Tree algorithm uses the pre-pruning strategy based on weighted support to reduce the size of frequent pattern trees when constructing Can-Tree on incremental data. On the other hand, the introduced leaf node hash table (leafNode_hashTbl) can quickly locate the leaf node position. The algorithm does not need to traverse to find the whole Can-Tree, which shortens the discovery time of the frequent item sets and improves the time efficiency of frequent pattern tree mining.
The above experimental results are all conducted under a single minimum weighted support count, so the mining time efficiency of various algorithms under different support degrees is added as a comparison. The experimental results are shown in Figure 11, with the change of minimum weighted support, the mining time of the improved Can-Tree algorithm in this paper is less than other algorithms. (IV) Analysis of the validity of incremental mining results.
The set of items in the association rule algorithm that satisfies the minimum supportlevel threshold is called the frequent itemset and is filtered by the minimum confidencelevel to obtain a strong association rule [25]. Where the confidence level indicates the occurrence of event X while event Y is possible, but ignores the correlation factor between event X and event Y of the rule [26]. The dynamic weighting calculation method introduced in this paper has an impact on the calculation of data item frequencies. Therefore, the concepts of dynamic weighted-confidence-level and dynamic weighted-lift-level need to be introduced for measuring the correlation between different events. This allows the dynamic-weight Can-Tree to generate strong association rules with positive correlation without reducing the accuracy of the rules. Table 2 describes the generated strong correlation rules and their threshold attributes.
Although the introduction of dynamic weights improves the practicality of incremental association rule mining, it also makes the validation of rules difficult, which is because the change of weights affects the weighted-support-level and weighted-confidence-level, which increases the extra time overhead of model computation and has a certain impact on performance.  (IV) Analysis of the validity of incremental mining results.
The set of items in the association rule algorithm that satisfies the minimum supportlevel threshold is called the frequent itemset and is filtered by the minimum confidence-level to obtain a strong association rule [25]. Where the confidence level indicates the occurrence of event X while event Y is possible, but ignores the correlation factor between event X and event Y of the rule [26]. The dynamic weighting calculation method introduced in this paper has an impact on the calculation of data item frequencies. Therefore, the concepts of dynamic weighted-confidence-level and dynamic weighted-lift-level need to be introduced for measuring the correlation between different events. This allows the dynamic-weight Can-Tree to generate strong association rules with positive correlation without reducing the accuracy of the rules. Table 2 describes the generated strong correlation rules and their threshold attributes. Although the introduction of dynamic weights improves the practicality of incremental association rule mining, it also makes the validation of rules difficult, which is because the change of weights affects the weighted-support-level and weighted-confidence-level, which increases the extra time overhead of model computation and has a certain impact on performance.

Summary
In order to improve the efficiency of customs port inspection and further solve the practical problems of over-reliance on expert experience, high human cost, and low discriminatory efficiency in the customs risk screening business, in this paper we propose a data discretization algorithm based on outlier optimization and a dynamic-weight Can-Tree incremental mining algorithm from customs declaration data as risk mining samples. Based on the traditional Can-Tree algorithm, the alphabetical order is replaced by a dynamicweighted counting data item sorting method, which optimizes the data storage method by combining the importance degree of the nodes in the association network with the attributes of the customs declaration data. Compared with the traditional Can-Tree algorithm, it saves about 40% of memory space consumption. The introduced leaf node hash table and incremental mining strategy further reduce the mining time and improve the time efficiency of the model. In terms of improving the validity of generated rules, the introduction of weighted-confidence-level and weighted-lift-level improves the quality of rules. In this paper, there is still much room for improvement in data discretization processing and rule validation. In the subsequent research, the uncertainty of discretization can be incorporated into the association rule mining stage to improve the stability and validity of the rules. In addition, the assignment of risk weights can cover more attributes of customs declarations, and then construct a more complete nodal network importance system. From the perspective of constructing a customs risk rule base, the method adopted in this paper is essentially unsupervised learning, which removes duplication, updating and accurately verifying the generated rules, and eventually leading to the rule base that may be generally effective in actual use. From the perspective of customs declaration inspection data, there is a certain ambiguity in the inspection results (credibility of declaration inspection results), which has not been fully taken into account by the method in this paper. In future work, more research is still needed.