An Efficient and Scalable Algorithm to Mine Functional Dependencies from Distributed Big Data

A crucial step in improving data quality is to discover semantic relationships between data. Functional dependencies are rules that describe semantic relationships between data in relational databases and have been applied to improve data quality recently. However, traditional functional discovery algorithms applied to distributed data may lead to errors and the inability to scale to large-scale data. To solve the above problems, we propose a novel distributed functional dependency discovery algorithm based on Apache Spark, which can effectively discover functional dependencies in large-scale data. The basic idea is to use data redistribution to discover functional dependencies in parallel on multiple nodes. In this algorithm, we take a sampling approach to quickly remove invalid functional dependencies and propose a greedy-based task assignment strategy to balance the load. In addition, the prefix tree is used to store intermediate computation results during the validation process to avoid repeated computation of equivalence classes. Experimental results on real and synthetic datasets show that the proposed algorithm in this paper is more efficient than existing methods while ensuring accuracy.


Introduction
In the information age, data has become the most important asset of a company, and data-driven decisions can bring good results to every organization and company [1]. However, with the explosive growth of data volume and the variety of data sources, low quality data inevitably appears. Specifically, the collected data may contain missing, redundant, and semantic contradictions. For example, in the process of interaction with the environment, the sensor network is easily damaged under the influence of the natural environment such as sunlight and rain, resulting in equipment failure and an inability to return data or leading to the return of incorrect data [2]. Business decisions made with low-quality data can lead to huge financial losses and irreversible consequences [3,4].
Therefore, data cleaning has grown up to be a necessary prerequisite for designing and completing system engineering and has received extensive attention from many scientific researchers and related practitioners. According to statistics, in applications such as machine learning and data mining, researchers spend more than 60% of their time and energy on data preprocessing [5]. It can be seen that the theory and method of improving data quality have significant research significance and value.
Taking measures at the data source to avoid the generation of low-quality data is usually not achievable, so the main method to improve data quality is to perform error detection and repair on the dataset [6]. Many scholars have studied the process of automatic data detection and repair, including outlier detection [7][8][9], dependency conflict detection [10][11][12], and duplicate value detection [13,14]. Dependency-based methods [10][11][12] detect errors and repair data through semantic relations between data, which are represented by various integrity constraints [15], such as functional dependencies [16], conditional functional dependencies [17], and denial constraints [18]. Detecting and repairing data through dependency-based methods [19,20] still requires end-user input of integrity constraints, and the system utilizes these inputs to detect units that conflict with dependencies. However, manually writing integrity constraints are inefficient and requires sufficient domain knowledge, so it is usually necessary to mine the dependencies on the dataset with the help of automatic discovery algorithms.
Functional dependency [15] is one of the most basic and important integrity constraints. In the process of using functional dependencies to improve data quality, the primary problem is how to efficiently and automatically discover functional dependencies from table data. The study in [21] shows that the existing seven important functional dependency discovery algorithms are only suitable for small-scale centralized data sets, and cannot be extended to table data with hundreds of columns or millions of rows, and in the case of data distributed storage, these algorithms can lead to erroneous results. Therefore, with the advent of the era of big data, the amount of data has increased dramatically, and the wide application of distributed databases has brought new problems and challenges to functional dependency discovery. methods [10][11][12] detect errors and repair data through semantic relations between data, which are represented by various integrity constraints [15], such as functional dependencies [16], conditional functional dependencies [17], and denial constraints [18]. Detecting and repairing data through dependency-based methods [19,20] still requires end-user input of integrity constraints, and the system utilizes these inputs to detect units that conflict with dependencies. However, manually writing integrity constraints are inefficient and requires sufficient domain knowledge, so it is usually necessary to mine the dependencies on the dataset with the help of automatic discovery algorithms. Functional dependency [15] is one of the most basic and important integrity constraints. In the process of using functional dependencies to improve data quality, the primary problem is how to efficiently and automatically discover functional dependencies from table data. The study in [21] shows that the existing seven important functional dependency discovery algorithms are only suitable for small-scale centralized data sets, and cannot be extended to table data with hundreds of columns or millions of rows, and in the case of data distributed storage, these algorithms can lead to erroneous results. Therefore, with the advent of the era of big data, the amount of data has increased dramatically, and the wide application of distributed databases has brought new problems and challenges to functional dependency discovery. Example 1. Given a relation R, as shown in Figure 1a, R is horizontally divided into two parts 1  However, functional dependencies that pass local validation do not necessarily hold on global data. According to Figure 1a, it can be concluded that the functional dependence → does not hold on . Therefore, the existing centralized functional dependency discovery algorithms cannot be directly applied to the distributed environment.
In distributed scenarios, functional dependency discovery for large-scale data has gradually become a research hotspot. In the distributed functional dependency discovery algorithm, FDcent_discover [22] presents a distributed database functional dependency discovery framework. Firstly, functional dependency discovery is performed at each node, and then the data of each node is sent to the master node, where the centralized discovery algorithm is used to discover. The HFDD [23] and FDPar_Discover [24] algorithms adopt the data redistribution scheme to group candidate functional dependencies using the left-hand features of the functional dependencies, and send the tuples with the same common attribute values to the same node, the functional dependency discovery algorithm is performed in parallel at each node. However, there are still the following problems: First, the characteristics of the data set itself are not However, functional dependencies that pass local validation do not necessarily hold on global data. According to Figure 1a, it can be concluded that the functional dependence A → B does not hold on R. Therefore, the existing centralized functional dependency discovery algorithms cannot be directly applied to the distributed environment.
In distributed scenarios, functional dependency discovery for large-scale data has gradually become a research hotspot. In the distributed functional dependency discovery algorithm, FDcent_discover [22] presents a distributed database functional dependency discovery framework. Firstly, functional dependency discovery is performed at each node, and then the data of each node is sent to the master node, where the centralized discovery algorithm is used to discover. The HFDD [23] and FDPar_Discover [24] algorithms adopt the data redistribution scheme to group candidate functional dependencies using the left-hand features of the functional dependencies, and send the tuples with the same common attribute values to the same node, the functional dependency discovery algorithm is performed in parallel at each node. However, there are still the following problems: First, the characteristics of the data set itself are not considered, which leads to the verification of many invalid function dependencies and increases the computational cost. Second, in dis-tributed scenarios, when the distribution of attribute values is uneven, the load unbalance is expected to result in a waste of computing resources. Third, the repeated computation of equivalence classes in the process of verifying candidate functional dependencies leads to inefficiency.
The contributions of this paper are as follows:

1.
A spark-based distributed functional dependency discovery algorithm is proposed.

2.
Aiming at the unbalanced load caused by the uneven distribution of attribute values, the greedy-based task assignment strategy is proposed to balance the computing tasks of each node and avoid the unbalanced load causing too long computing time.

3.
A dynamic memory management strategy is proposed to store calculated equivalence classes in memory and periodically clear equivalence classes that have not been accessed for a long time to maximize the use of memory space.

4.
Verify the distributed functional dependency discovery algorithm proposed in this paper on real and artificial data sets through experiments.
This paper is organized as follows. In Section 2, the definitions and related work are introduced. Section 3 presents the algorithm structure and implementation process of the algorithm DisTFD. Section 4 presents the experimental results and the comparison of existing methods. Section 5 is the conclusion and outlook for future work.

Definition
This section introduces definitions related to FD discovery. Let R be a relational schema and r be an instance on R. t[X] represents the projection of a tuple t in R onto the subset X ⊆ R. Definition 1. Functional dependency. A functional dependency X → A specifies that the value of X functionally determines the value of A, where X ⊆ R and A ∈ R. If all tuple pairs t 1 , t 2 ∈ r in R satisfy t 1 [X] = t 2 [X], then t 1 [A] = t 2 [A], then the functional dependency X → A on the instance r of R is established. Let X be the left part (LHS) of the FD and A be the right part (RHS) of the FD. Definition 2. Non-trivial functional dependency. If a functional dependency X → A holds and A / ∈ X, then X → A is said to be a non-trivial functional dependency.
Definition 3. Minimum functional dependency. If a functional dependency X → A holds and any proper subset X of X cannot determine the value of attribute A, that is, for any X ∈ X, X ∈ X ∈ A does not hold, then we call X → A the minimum functional dependency.

Definition 4.
Equivalence class. The equivalence class of a tuple t ∈ r is expressed as Taking the relation R in Example 1 as an example, an equivalence class of the tuple t 1 on the attribute C is {1, 4}.

Definition 5.
Partition. Divide all tuples in r into multiple equivalence classes based on the attribute set X ∈ R. The partition Π X = {[t] X |t ∈ r} of relation r on attribute set X is the set of all equivalence classes, and |Π X | represents the number of equivalence classes in Π X . In Example 1, the relation R can be divided into multiple equivalence classes on the attribute set {C}:

Related Work
Functional dependency discovery. Existing functional dependency discovery algorithms are mainly used in centralized environments and can be divided into three categories: lattice search algorithms, difference and consensus set algorithms, and hybrid algorithms.
Lattice search algorithm: the typical representatives are TANE [25], FUN [26], and FD_Mine [27] algorithms. The search space is modeled as the lattice of attribute combinations to represent all candidate functional dependencies, and a bottom-up search strategy is adopted to verify the candidate functional dependencies at each layer. The time complexity of the lattice search algorithm mainly depends on the size of the lattice, and the size of the lattice depends on the number of attributes of the dataset. Therefore, the lattice search algorithm has better row scalability and is suitable for large-scale datasets with fewer columns.
Difference set and consistent set algorithm: the typical representatives are Dep-Miner [28] and FastFDs [29] algorithms. Based on the comparison between tuples, the consistent set and the difference set are obtained, and finally the candidate functional dependency is verified according to the difference set. The time complexity of difference and consistent set algorithms depends on the number of tuples. Therefore, the difference set and consistent set algorithms have better column scalability and are suitable for small-scale datasets with many columns.
Hybrid Algorithm: HyFD [30] uses a hybrid discovery strategy to combine the advantages of the lattice search algorithm and the difference set and consistent set algorithms, and has better scalability in rows and columns. HyFD first generates a consensus set from the sampled data, identifies candidate functional dependencies from the consensus set, and uses FDTree to represent the corresponding attribute set. Then, HyFD is transformed into the lattice search algorithm, and candidate functional dependencies are verified by traversing the FDTree.
Approximate functional dependency discovery. In 1992, Kivinen and Mannila [31] first proposed an error metric for approximate functional dependencies. Subsequently, CORDS [32] automatically discovered unary approximate functional dependencies from relational data. To further speed up the discovery of approximate functional dependencies, the authors of [33] used heuristics to prune the candidate space of approximate functional dependencies. Mandros and Boley [34] represented the approximation of functional dependencies more precisely by scores.
The authors of [35] use a machine learning approach to infer approximate functional dependencies by comparing tuples with each other. The method finds all conflicting functional dependencies by tuple pair comparison, applies an error threshold to remove infrequent conflicting tuple pairs, and finally, inferring approximate functional dependencies from the remaining conflicting tuple pairs.
In recent work, Caruccio and Deufemia [36] proposed a new candidate approximate functional dependency verification method to discover multiple types of approximate functional dependencies by constructing a difference matrix of attributes. AFDDPar [37] proposed a parallel approach in a distributed environment for discovering approximate functional dependencies in a distributed environment, balancing the load of individual nodes before data redistribution, and pruning candidate approximate functional dependencies quickly after data redistribution.

The Distributed Algorithm for Mining Functional Dependency
In this chapter, a description of the distributed functional dependency discovery problem and a general overview of the algorithm DisTFD are given. In this paper, functional dependency discovery is carried out in a distributed big data environment, a distributed processing method is designed, and intermediate results are reasonably stored. On the premise of ensuring the correct rate, the load of each computing node is balanced as much as possible to reduce the time consumption of the algorithm.

Algorithm Architecture Overview
The algorithm DisTFD consists of multiple components, which are divided into different logical modules. The framework of the algorithm DisTFD is shown in Figure 2.

Algorithm Architecture Overview
The algorithm DisTFD consists of multiple components, which are divided into different logical modules. The framework of the algorithm DisTFD is shown in Figure 2. The Master Module mainly performs data input, output, control sampling ratio, and data preprocessing. The master module can only be located on the master node. The worker module has several work nodes, which are mainly responsible for data storage and generation and verification of candidate function dependencies, and send the verification results to the master node. The partition management module merges the results calculated by multiple worker nodes, and stores the partition used for verifying candidate functional dependencies.
The specific functions of the components in the three modules are as follows: ResultSet. The ResultSet component stores the invalid functional dependencies and the valid functional dependencies as two sets, respectively.
CandidateGen. The CandidateGen component generates candidate functional dependencies in the form of the lattice and sends the candidate functional dependencies to each worker node. After each validation, CandidateGen prunes candidate function dependencies according to the validation results in the ResultSet.
Sampler. The Sampler component samples the data according to the ratio set in the master node and is responsible for verifying the received candidate FD on the sampling data set `. If the verification result is true, the candidate function will be sent to the work nodes for further verification. If the verification result is false, the candidate functional dependency will be sent to the ResultSet.
PartitionMgr. The PartitionMgr component accepts the request for partition by the work node, and if there is a partition of the request in the PartitionCache, it will be sent to the corresponding work node. If the requested partition does not exist in PartitionCache, Worker will calculate the partition and merge calculation results by PartitionMgr. Then, PartitionMgr stores calculated partitions in the PartitionCache, and periodically clears the partitions that have not been accessed for a long time.
Worker. The Worker component verifies the candidate functional dependencies, and sends the result to the ResultSet and requests a new verification job from CandidateGen. The Master Module mainly performs data input, output, control sampling ratio, and data preprocessing. The master module can only be located on the master node. The worker module has several work nodes, which are mainly responsible for data storage and generation and verification of candidate function dependencies, and send the verification results to the master node. The partition management module merges the results calculated by multiple worker nodes, and stores the partition used for verifying candidate functional dependencies.
The specific functions of the components in the three modules are as follows: ResultSet. The ResultSet component stores the invalid functional dependencies and the valid functional dependencies as two sets, respectively.
CandidateGen. The CandidateGen component generates candidate functional dependencies in the form of the lattice and sends the candidate functional dependencies to each worker node. After each validation, CandidateGen prunes candidate function dependencies according to the validation results in the ResultSet.
Sampler. The Sampler component samples the data according to the ratio set in the master node and is responsible for verifying the received candidate FD on the sampling data set D . If the verification result is true, the candidate function will be sent to the work nodes for further verification. If the verification result is false, the candidate functional dependency will be sent to the ResultSet.
PartitionMgr. The PartitionMgr component accepts the request for partition by the work node, and if there is a partition of the request in the PartitionCache, it will be sent to the corresponding work node. If the requested partition does not exist in PartitionCache, Worker will calculate the partition and merge calculation results by PartitionMgr. Then, PartitionMgr stores calculated partitions in the PartitionCache, and periodically clears the partitions that have not been accessed for a long time.
Worker. The Worker component verifies the candidate functional dependencies, and sends the result to the ResultSet and requests a new verification job from CandidateGen.
WorkerMgr. The WorkerMgr component records the load of each node after data redistribution. When the node load is unbalanced, the task assignment algorithm is called to assign the task to achieve load balance. This paper proposes a distributed functional dependency discovery algorithm DisTFD based on attribute space traversal as shown in Algorithm 1: SamplingValidate(ϕ, D ) 5.
} /* task assignment to balances the load of worker nodes*/ /*Verify that each function in the candidate space depends on

Data Preprocessing
The preprocessor preprocesses the data, including statistical attribute cardinality and attribute value frequency. In the case of a large amount of data and distributed storage, it is necessary to summarize the results for all data statistics multiple times, which make the cost extremely high. Therefore, this article counts attribute-related information on the sampling data set and will introduce the sampling method in Section 3.3.
The number of types of attribute values is called the cardinality of the attribute, and the number of tuples corresponding to each attribute value is called the frequency of the attribute value. Based on the statistics of the cardinality and frequency information, the skewness of each attribute is then calculated. Given an attribute A, let c be the cardinality of attribute A, V be the set of all values of attribute A, f requency(V k ) represents the frequency of the k-th value of attribute A, then the skewness of attribute A is expressed as: where,k ∈ [1, c], n is the total number of tuples in the dataset. The data preprocessing process is shown in Algorithm 2.

Algorithm 2. Pre_processing
Input: sample data set D , attribute set X Output: Sorted attribute set SortedAttribute 1.
Set the Skew threshold t 2. For Return SortedAttribute After calculating the skewness of each attribute, the attributes are divided into Skew attribute and non-Skew attribute according to the given threshold. Then, sort all the attributes, and specify that the Skew attribute is ranked after the non-Skew attribute.

Sampling Validation Framework
Sampling refers to taking a part of the population of the research objects for investigation or statistics according to a certain procedure, so as to make inferences about the population of the research objects. In this paper, the statistical attribute information of the sampling data set reflects the situation of the attribute in the overall data set.
Sampler uses systematic sampling [38] to sample population data. According to the preset sample size n, determine an integer k closest to N/n, randomly select an integer r in the range of [1, k] as the starting unit of the sample, and then select a unit every k as a sample unit until n samples are drawn.
The size of the sampled data set D is much smaller than the overall data set D and is only stored on the master node. Therefore, the cost of functional dependency discovery on the sampled data set D is small. The functional dependencies found in D and D have the following two properties:

1.
Completeness: A functional dependency ϕ that holds on D also holds on D .

2.
Minimality: The minimum functional dependence ϕ that holds on D , if the functional dependence holds on D, then the functional dependence ϕ is also the smallest functional dependence on D.
According to the above two properties, the invalid or non-minimum functional dependencies can be quickly verified in the sampled data set, saving the time of distributed verification and improving the efficiency of the algorithm.

Search and Prune
The row-efficient functional dependence discovery algorithm is appropriate for largescale data sets with many tuples. Therefore, this paper uses the lattice of TANE, FUN and other algorithms to generate candidate functional dependence search space. Given a relational schema R = {A, B, C, D}, all of its candidate functional dependencies are shown in Figure 3. The LHS of the candidate FDs is all possible attribute combinations in , the connection between the first node of Level-5 and the first node of Level-4 represents the candidate function dependency → , the connection between the first node at level 3 and the first node at level 2 represents the candidate function dependency → , and so on. Lemma 1. Given the attribute set { 1 ， … ， } defined on the relational schema , then the number of all non-trivial minimum functional dependencies is * 2 −1 − .
Proof of Lemma 1: Consider only the nontrivial minimal functional dependencies for which RHS has one property. For all candidate functional dependencies on the relation R, the number of attributes of the LHS takes the value [1, − 1]. The number of candidate functional dependencies of LHS with only one attribute is 1 * −1 1 , the number of candidate functional dependencies of LHS with two attributes is 2 * −2 1 , and the number of candidate functional dependencies of LHS with three attributes is 3 * −3 1 ,…, and the number of candidate functional dependencies of LHS with − 1 attributes is −1 * 1 1 . Therefore, the total number of non-trivial minimum functional dependencies for which RHS has a property is: When verifying candidate functional dependencies, most existing lattice searches verify candidate functional dependencies one by one in a bottom-up or top-down order and the set of candidate functional dependencies is pruned using the following lemma: Lemma 2. Let , , be the three attribute sets of the relation . If ⊂ and ↛ , then ↛ . Lemma 3. Let , , be the three attribute sets of the relation . If ⊂ and → hold, then → holds.
According to Lemma 2, the top-down search strategy can be used to prune functional dependencies that do not hold in lower levels. For example, it has been verified that functional dependencies → do not hold, then → and → do not hold. Therefore, if most of the functional dependencies at the upper level are valid and those at The LHS of the candidate FDs is all possible attribute combinations in R, the connection between the first node of Level-5 and the first node of Level-4 represents the candidate function dependency ABC → D , the connection between the first node at level 3 and the first node at level 2 represents the candidate function dependency A → B , and so on. Lemma 1. Given the attribute set {A 1 , . . . , A n } defined on the relational schema R, then the number of all non-trivial minimum functional dependencies is n * 2 n−1 − n.
Proof of Lemma 1: Consider only the nontrivial minimal functional dependencies for which RHS has one property. For all candidate functional dependencies on the relation R, the number of attributes of the LHS takes the value [1, n − 1]. The number of candidate functional dependencies of LHS with only one attribute is C 1 n * C 1 n−1 , the number of candidate functional dependencies of LHS with two attributes is C 2 n * C 1 n−2 , and the number of candidate functional dependencies of LHS with three attributes is C 3 n * C 1 n−3 , . . . , and the number of candidate functional dependencies of LHS with n − 1 attributes is C n−1 n * C 1 1 . Therefore, the total number of non-trivial minimum functional dependencies for which RHS has a property is: When verifying candidate functional dependencies, most existing lattice searches verify candidate functional dependencies one by one in a bottom-up or top-down order and the set of candidate functional dependencies is pruned using the following lemma: Lemma 2. Let X, Y, Z be the three attribute sets of the relation R. If Y ⊂ X and X Z, then Y Z.
Lemma 3. Let X, Y, Z be the three attribute sets of the relation R. If Y ⊂ X and Y → Z hold, then X → Z holds.
According to Lemma 2, the top-down search strategy can be used to prune functional dependencies that do not hold in lower levels. For example, it has been verified that functional dependencies ABC → D do not hold, then AB → D and AC → D do not hold. Therefore, if most of the functional dependencies at the upper level are valid and those at the lower level are not, then the top-down strategy will verify more useless functional dependencies and reduce the verification efficiency.
According to Lemma 3, the bottom-up search strategy can be used to prune the functional dependencies at higher levels. For example, it has been verified that the functional dependencies AB → D holds, then ABC → D must hold, and bottom-up search strategy can avoid the verification of non-minimal functional dependencies. However, when there are many lower levels functional dependencies that do not hold, the search space cannot be effectively pruned.
In this paper, we adopt the validation method in [39] and use a two-way alternating search validation strategy in the sampling validation process. The validation is alternated from both ends of the search space. It is assumed that there are n levels of candidate functional dependencies. DisTFD verify the Level-i (i ≤ n/2) firstly, if the verification result is true, Lemma 2 is used to prune the functional dependencies greater than the Level-i. Then, verify the Level-j (j = n + 1 − i), Lemma 3 is used to prune the functional dependencies smaller than the Level-j if the verification result is false, and then verify the Level-(i+1), and so on until all candidate functional dependencies are verified. For example, in the 4-attribute search space shown in Figure 3, the verification order is Level-2: ∅ → A ; Level-4: ABC → D ; Level-2: ∅ → B ; . . . ; Level-3: CD → B .

Global Validation
Candidate function dependencies verified by sampling are further verified using data redistribution.

Partition Caching
Calculating the number of equivalence classes in a partition to verify candidate functional dependencies. For example, verifying X → Y requires comparing |Π X | = |Π XY | for equality. Theorem 1. A functional dependency X → Y hold if and only if |Π X | = |Π XY |.
Proof of Theorem 1: Since |Π X | = |Π XY | by definition 4 and definition 5, the number of equivalence classes in X is equal to the number of equivalence class in XY so the total number of tuples contained in the X and XY equivalence classes is equal. That is, for any tuple t i , if t i is in an equivalence class of X, then t i is also in the same equivalence class of XY, and t i [X] = t j [X] is satisfied for two tuples if t i and t j in the same equivalence class |Π X |, then t i [Y] = t j [Y], in line with the definition of functional dependency, it can be concluded that X → Y is hold.
The partition Π XY can be derived from Π X Π Y , a process called computing the intersection of partition. As shown in Figure 1 DisTFD stores the calculated partition in the prefix tree [40] shown in Figure 4 for easy query. Each node stores the partition corresponding to the path, and the number on the node indicates the size of the partition. In the above example, to calculate Π AC , the attribute set {A, C} is converted into an attribute list (A, C) according to the attribute order in the relational schema R, and then (A, C) is used as a keyword to query in the prefix tree.  When using the partition in the cache, the following two rules should be followed: 1. When calculating the results, the number of partition intersections should be as few as possible. 2. In each calculation of Π ⋂Π , the |Π | and |Π | should be minimized as much as possible.
Algorithm 3 gives the execution process of using partition to cache the calculation results under the two rules above.
← Query the partition of all subsets in the Cache 2. Π ← Find the smallest stripped partitioning in Π ← Find the partition in that satisfies | | = max {| / |} 6.
← 10. For Π ∈ do 11. Π ∪ = Π ∩ Π ← ∪ 12. Return Π In Algorithm 3, the partitions of all subsets of attribute set are first queried in the cache and stored in the query result , the smallest number of partitions is found as the starting unit of the partition intersection calculation. Next, according to rule 1, select the partition with the most newly added attributes in to calculate the intersection, until all attributes in appear at least once in the selected partition. Finally, the order of partition intersection calculation is determined according to Rule 2, the partition with a small number of equivalence classes should perform intersection calculation as soon as possible.
For example, assuming that Π is currently calculated using the partition caching shown in Figure 4, = {Π , Π , Π , Π , Π , Π , Π , Π }is searched in the prefix tree, and the smallest Π is selected as the starting unit, Then, select Π with the most newly added attributes, and then select Π . After the partition selection of the intersection When using the partition in the cache, the following two rules should be followed:

1.
When calculating the results, the number of partition intersections should be as few as possible.

2.
In each calculation of Π X Π Y , the |Π X | and |Π Y | should be minimized as much as possible.
Algorithm 3 gives the execution process of using partition to cache the calculation results under the two rules above.

1.
Φ ← Query the partition of all X subsets in the Cache 2.
Π Y ← Find the smallest stripped partitioning in Φ 3.
For C ⊂ X do 5.
End for 8.
Sort the partition in L in ascending order 9.
In Algorithm 3, the partitions of all subsets of attribute set X are first queried in the cache and stored in the query result Φ, the smallest number of partitions is found as the starting unit of the partition intersection calculation. Next, according to rule 1, select the partition with the most newly added attributes in Φ to calculate the intersection, until all attributes in X appear at least once in the selected partition. Finally, the order of partition intersection calculation is determined according to Rule 2, the partition with a small number of equivalence classes should perform intersection calculation as soon as possible.
For example, assuming that Π ABCDE is currently calculated using the partition caching shown in Figure 4, Φ = {Π A , Π B , Π C , Π D , Π E , Π AB , Π AD , Π CE } is searched in the prefix tree, and the smallest Π CE is selected as the starting unit, Then, select Π AB with the most newly added attributes, and then select Π D . After the partition selection of the intersection calculation is completed, the order of intersection calculation is determined from small to large, and finally Π ABCDE = Π CE ∩ Π AB ∩ Π D can be obtained.
When caching partitions, memory resources are usually limited. If all partitions are cached, excessive memory space may be occupied. Most partitions are only used for a period of time, DisTFD save the memory space by clearing partitions that are no longer used [41]. Each time the partition cache is returned, PartitionMgr records the access time of each partition and periodically clears the recently unused partitions.

Task Assignment and Validation
Select the sorted attributes in turn as public attributes for data redistribution. In the process of data redistribution, the tuples with the same value on the common attribute are sent to the same node by calculating the hash value of the common attribute value. When the non-Skew attribute is used as the public attribute, it is directly verified after data redistribution, and when the skewed attribute is selected as the public attribute, DisTFD assignment the task based on the greedy strategy to achieve load balancing [42].
Each attribute value of the public attribute is represented by key i (1 ≤ i ≤ m), and the process of the task assignment shown in Figure 5 is as follows: 1.
Sort keys from small to large according to the frequency of each attribute value counted in data preprocessing.

2.
Add up all key frequencies to calculate mean Avg relative to the number of nodes.

3.
Traverse the key, if the key frequency is greater than Avg, split it and assign it to a node with a load of 0, record the corresponding relationship between the key and the node allocation, and subtract the Avg from the frequency of the key. Repeat this step until the frequency of the key is less than Avg. If the frequency of the key is not 0, the key is re-inserted into the queue. 4.
Repeat step 3 until all keys with a frequency greater than avg are processed.

5.
Select the remaining nodes that are not involved in step 3, traverse the key queue and find the sum of the node load and key frequency, if Sum is less than Avg, assign the key to the current node, and Sum is used as the load of the current node, then delete the information of the key in the queue. Repeat the above steps until all keys in the queue are processed. 6.
Repeat step 5 to balance the load of the remaining nodes, and record the correspondence between keys and node assignments. period of time, DisTFD save the memory space by clearing partitions that are no longer used [41]. Each time the partition cache is returned, PartitionMgr records the access time of each partition and periodically clears the recently unused partitions.

Task Assignment and Validation
Select the sorted attributes in turn as public attributes for data redistribution. In the process of data redistribution, the tuples with the same value on the common attribute are sent to the same node by calculating the hash value of the common attribute value. When the non-Skew attribute is used as the public attribute, it is directly verified after data redistribution, and when the skewed attribute is selected as the public attribute, DisTFD assignment the task based on the greedy strategy to achieve load balancing [42].
Each attribute value of the public attribute is represented by (1 ≤ ≤ ), and the process of the task assignment shown in Figure 5 is as follows: 1. Sort keys from small to large according to the frequency of each attribute value counted in data preprocessing. 2. Add up all key frequencies to calculate mean Avg relative to the number of nodes. 3. Traverse the key, if the key frequency is greater than Avg, split it and assign it to a node with a load of 0, record the corresponding relationship between the key and the node allocation, and subtract the Avg from the frequency of the key. Repeat this step until the frequency of the key is less than Avg. If the frequency of the key is not 0, the key is re-inserted into the queue. 4. Repeat step 3 until all keys with a frequency greater than avg are processed 5. Select the remaining nodes that are not involved in step 3, traverse the key queue and find the sum of the node load and key frequency, if Sum is less than Avg, assign the key to the current node, and Sum is used as the load of the current node, then delete the information of the key in the queue. Repeat the above steps until all keys in the queue are processed. 6. Repeat step 5 to balance the load of the remaining nodes, and record the correspondence between keys and node assignments.  Algorithm 4 describes the process of Task assignment. Lines 1-3 calculate the sum of the key frequencies and calculate the average load Avg on m nodes. Lines 4-11 split the partitions with a load greater than the average, and record the assignment relationship between keys and nodes. Lines 12-20 traverse the Key queue, merge the partitions with a load less than the average, and record the relationship between keys and node assignment. Read the frequency of Key in preprocessing and record it to Kfreq 2.
For Key.size > Avgdo /*Key is assigned to the node with a load of 0*/ 7.
end for /* Re-insert the split Key into the queue */ 9.
Else break 11. end for 12. for each node do 13.
end for 18. end for

Return true
After the load balance is achieved, the local equivalence classes are obtained by computing the partitions in parallel at each node, and the local equivalence classes with the same value are merged. Finally, the partition of the candidate function dependent on LHS and LHS∪RHS is obtained. The process of merging local equivalence classes is shown in Figure 6. if ( . > ) 6.
For . > do /*Key is assigned to the node with a load of 0*/ 7.
end for /* Re-insert the split Key into the queue */ 9.
Else break 11. end for 12. for each node do 13.
end for 18. end for 19. Return true After the load balance is achieved, the local equivalence classes are obtained by computing the partitions in parallel at each node, and the local equivalence classes with the same value are merged. Finally, the partition of the candidate function dependent on LHS and LHS∪RHS is obtained. The process of merging local equivalence classes is shown in Figure 6.
. Figure 6. The process of merging the local equivalence classes of each node.
Let be a set of common attributes of candidate function dependencies, = ( 1 ， … ， −1 ， ) is the data after redistributing the attribute value of A ,then where, is the number of keys, ∈ [1, ].
where, m is the number of keys, j ∈ [1, m]. Algorithm 5 shows the process of parallel verification of selected function dependencies at each node.
for each Π X/Zij in node i
end for 11. end for Return false 15.
Return false 21. } The input of Algorithm 5 is the redistributed data set D, the candidate function dependency ϕ : X → Y , and the output is the verification result. The algorithm first sends a request to the partitioned cache to obtain partial results, then computes the partition of the remaining attributes in X. Then, verify the candidate functional dependency on a single node, if the functional dependency is true on each node, merge the results with the same Key. Before storing the partition in the cache, the merged result is used to verify again to avoid storing invalid partition, and finally output the verification result of candidate function dependency.

Experiment
In this chapter, experiments are performed on real and synthetic datasets, and compared with other existing algorithms to verify the efficiency, scalability, and accuracy of the proposed algorithm.

Experimental Setup
In this experiment, a cluster consisting of 8 servers connected through a local area network is used. The configuration of each server is as follows: the CPU is Intel Xeon2 processor, 32GB memory, and the operating system is Ubuntu 10.4. The algorithm is written in Java and runs on Apache Spark and the HDFS distributed file system. A summary of the experimental dataset is shown in Table 1.

Scalability
In this section, the scalability of DisTFD (Node scalability and Data scale scalability) is evaluated and compared with other algorithms.
Node Scalability. By changing the number of nodes |V|, 3 ≤ |V| ≤ 8, the dataset scale is fixed, evaluate the scalability of this algorithm to the number of nodes. Figure 7a

Scalability
In this section, the scalability of DisTFD (Node scalability and Data scale scalability) is evaluated and compared with other algorithms.
Node Scalability. By changing the number of nodes | |, 3 ≤ | | ≤ 8, the dataset scale is fixed, evaluate the scalability of this algorithm to the number of nodes. Figure 7a As shown in Figure 7a,b, as the number of nodes increases, the response time of the algorithm HFDD and the algorithm DisTFD decreases significantly, and the response time of the algorithm Cet increases slowly. The algorithm Cet verifies the candidate functional dependencies by concentrating the data into master node. When the number of nodes increases, the data of each node migrates to the master node, the amount of data migration becomes larger and the load is unbalanced, which leads to an increase in response time. Algorithm HFDD and algorithm DisTFD verify candidate function dependencies in parallel, so as the number of nodes increases, the response time will be significantly reduced, but when the number of nodes is same, algorithm DisTFD is more efficient than algorithm HFDD. The results show that the algorithm DisTFD has better node scalability.
Data scale scalability. By changing the scale of the data set | |, the scalability of the algorithm for the data scale is evaluated. The fixed number of nodes | | = 4, and the value range of the data scale is 20%-100%. Figure 8a,b show the response times of algorithms Cet, HFDD and DisTFD under different data scales, respectively. As shown in Figure 7a,b, as the number of nodes increases, the response time of the algorithm HFDD and the algorithm DisTFD decreases significantly, and the response time of the algorithm Cet increases slowly. The algorithm Cet verifies the candidate functional dependencies by concentrating the data into master node. When the number of nodes increases, the data of each node migrates to the master node, the amount of data migration becomes larger and the load is unbalanced, which leads to an increase in response time. Algorithm HFDD and algorithm DisTFD verify candidate function dependencies in parallel, so as the number of nodes increases, the response time will be significantly reduced, but when the number of nodes is same, algorithm DisTFD is more efficient than algorithm HFDD. The results show that the algorithm DisTFD has better node scalability.
Data scale scalability. By changing the scale of the data set |D|, the scalability of the algorithm for the data scale is evaluated. The fixed number of nodes |V| = 4, and the value range of the data scale is 20-100%. Figure 8a  From Figure 8a,b, it can be concluded that with the expansion of the data scale, the response times of algorithms Cet, HFDD, and DisTFD show an increasing trend. Under the same conditions, the distributed discovery algorithms HFDD and DisTFD have less response time than the centralized discovery algorithm Cet. Compared with the algorithm HFDD, the algorithm DisTFD has a significant improvement in execution efficiency. From the above, it can be concluded that the algorithm DisTFD proposed in this paper has better scalability in terms of data scale.

Evaluation of Accuracy
In this section, we evaluate the accuracy of the algorithm by comparing the results of algorithms Cet, HFDD, and DisTFD with those of the TANE [25] algorithm, respectively, using the method in the literature [45].
We consider Precision, Recall, and F1measure as the metric of algorithm accuracy. The confusion matrix for classification results is shown in Table 2. Precision, Recall, and F1measure can be calculated as:  Table 2, the results can be divided into four categories：true positive (TP), false positive (FP), true negative (TN), and false negative (FN). We take the results of algorithm TANE as Truth, and the results of algorithms Cet, HFDD, and DisTFD as prediction values, and calculate Precision, Recall, and F1measure of the three algorithms and compare them, respectively. The comparison results are shown in Table 3. From Figure 8a,b, it can be concluded that with the expansion of the data scale, the response times of algorithms Cet, HFDD, and DisTFD show an increasing trend. Under the same conditions, the distributed discovery algorithms HFDD and DisTFD have less response time than the centralized discovery algorithm Cet. Compared with the algorithm HFDD, the algorithm DisTFD has a significant improvement in execution efficiency. From the above, it can be concluded that the algorithm DisTFD proposed in this paper has better scalability in terms of data scale.

Evaluation of Accuracy
In this section, we evaluate the accuracy of the algorithm by comparing the results of algorithms Cet, HFDD, and DisTFD with those of the TANE [25] algorithm, respectively, using the method in the literature [45].
We consider Precision, Recall, and F1measure as the metric of algorithm accuracy. The confusion matrix for classification results is shown in Table 2. Precision, Recall, and F1measure can be calculated as: As shown in Table 2, the results can be divided into four categories: true positive (TP), false positive (FP), true negative (TN), and false negative (FN). We take the results of algorithm TANE as Truth, and the results of algorithms Cet, HFDD, and DisTFD as prediction values, and calculate Precision, Recall, and F1measure of the three algorithms and compare them, respectively. The comparison results are shown in Table 3. As shown in Table 3, the algorithms Cet, HFDD, and DisTFD have little difference in Precision, Recall, and F1measure, and the F1measure of DisTFD is slightly improved, indicating that all the above algorithms have higher accuracy, but the algorithm DisTFD is more efficient with similar accuracy.

Evaluation of Performance
In this section, the effectiveness of the proposed method is evaluated by two sets of experiments, respectively.
Evaluation of partition cache. By changing the number of columns in the ONTS and Airline datasets, we evaluate the effect of turning off and on the partition cache on the response time of the algorithm. The fixed number of nodes |V| = 4, and the range of the number of data columns is 10-60%. Figure 9a,b shows the change of the response time of the DisTFD algorithm with the partition cache turning on or off as the number of columns increases.  Table 3, the algorithms Cet, HFDD, and DisTFD have little difference in Precision, Recall, and F1measure, and the F1measure of DisTFD is slightly improved, indicating that all the above algorithms have higher accuracy, but the algorithm DisTFD is more efficient with similar accuracy.

Evaluation of Performance
In this section, the effectiveness of the proposed method is evaluated by two sets of experiments, respectively.
Evaluation of partition cache. By changing the number of columns in the ONTS and Airline datasets, we evaluate the effect of turning off and on the partition cache on the response time of the algorithm. The fixed number of nodes | | = 4, and the range of the number of data columns is 10%-60%. Figure 9a,b shows the change of the response time of the DisTFD algorithm with the partition cache turning on or off as the number of columns increases. As shown in Figure 9, When the number of columns is large, partition caching can significantly reduce the response time of the algorithm. When the number of columns is small, the response time of turning on or off the partition cache does not change significantly. As the number of columns increases, the partition cache significantly improves the execution efficiency of the algorithm.
Evaluation of load balancing. By injecting attribute values with different skewness into the synthetic dataset Stud, the performance of the algorithm under different uniformity of attribute values is evaluated. The fixed number of nodes | | = 4, and according to the ratio of the number of tuples corresponding to the attribute value with the largest attribute value at the left end of the functional dependence to the total number of tuples in the data set from the lowest 10% to the highest 60%, the experiment is carried out. Figure 10 shows the response time of algorithms Cet, HFDD and DisTFD under the different skewness of attribute values. As shown in Figure 9, when the number of columns is large, partition caching can significantly reduce the response time of the algorithm. When the number of columns is small, the response time of turning on or off the partition cache does not change significantly. As the number of columns increases, the partition cache significantly improves the execution efficiency of the algorithm.
Evaluation of load balancing. By injecting attribute values with different skewness into the synthetic dataset Stud, the performance of the algorithm under different uniformity of attribute values is evaluated. The fixed number of nodes |V| = 4, and according to the ratio of the number of tuples corresponding to the attribute value with the largest attribute value at the left end of the functional dependence to the total number of tuples in the data set from the lowest 10% to the highest 60%, the experiment is carried out. Figure 10   As shown in Figure 10, the response time of algorithm Cet increases slightly with the increase in skewness, and the response time of algorithm HFDD increases significantly in the case of larger skewness. However, the algorithm DisTFD has no significant change in response time as the skewness increases. Therefore, the algorithm DisTFD has better performance in the case of uneven distribution of attribute values.

Conclusion and Future Work
Aiming at the problems existing in the process of centralized functional dependency discovery, this paper proposes an algorithm to discover functional dependencies from distributed data. This paper proposes a functional dependency discovery algorithm suitable for distributed data, focusing on reducing the response time of distributed functional dependency discovery. In order to improve the efficiency of functional dependency discovery in a distributed environment, the intermediate results in the calculation process are stored in the cache to reduce the repeated calculation of equivalence classes. Balance the load during the verification process to avoid inefficiencies caused by the unbalanced load. The proposed algorithm is validated on real and synthetic datasets. The results show that the algorithm has good scalability in terms of node and data scale, and significantly improves the execution efficiency compared with existing methods. In future work, we will consider discover approximately functional dependencies and discover functional dependencies in the case of incomplete data. In addition, how to improve the column scalability of the algorithm is also a problem that needs to be considered.   As shown in Figure 10, the response time of algorithm Cet increases slightly with the increase in skewness, and the response time of algorithm HFDD increases significantly in the case of larger skewness. However, the algorithm DisTFD has no significant change in response time as the skewness increases. Therefore, the algorithm DisTFD has better performance in the case of uneven distribution of attribute values.

Conclusions and Future Work
Aiming at the problems existing in the process of centralized functional dependency discovery, this paper proposes an algorithm to discover functional dependencies from distributed data. This paper proposes a functional dependency discovery algorithm suitable for distributed data, focusing on reducing the response time of distributed functional dependency discovery. In order to improve the efficiency of functional dependency discovery in a distributed environment, the intermediate results in the calculation process are stored in the cache to reduce the repeated calculation of equivalence classes. Balance the load during the verification process to avoid inefficiencies caused by the unbalanced load. The proposed algorithm is validated on real and synthetic datasets. The results show that the algorithm has good scalability in terms of node and data scale, and significantly improves the execution efficiency compared with existing methods. In future work, we will consider discover approximately functional dependencies and discover functional dependencies in the case of incomplete data. In addition, how to improve the column scalability of the algorithm is also a problem that needs to be considered.