An Improved Similarity-Based Clustering Algorithm for Multi-Database Mining

Clustering algorithms for multi-database mining (MDM) rely on computing (n2−n)/2 pairwise similarities between n multiple databases to generate and evaluate m∈[1,(n2−n)/2] candidate clusterings in order to select the ideal partitioning that optimizes a predefined goodness measure. However, when these pairwise similarities are distributed around the mean value, the clustering algorithm becomes indecisive when choosing what database pairs are considered eligible to be grouped together. Consequently, a trivial result is produced by putting all the n databases in one cluster or by returning n singleton clusters. To tackle the latter problem, we propose a learning algorithm to reduce the fuzziness of the similarity matrix by minimizing a weighted binary entropy loss function via gradient descent and back-propagation. As a result, the learned model will improve the certainty of the clustering algorithm by correctly identifying the optimal database clusters. Additionally, in contrast to gradient-based clustering algorithms, which are sensitive to the choice of the learning rate and require more iterations to converge, we propose a learning-rate-free algorithm to assess the candidate clusterings generated on the fly in fewer upper-bounded iterations. To achieve our goal, we use coordinate descent (CD) and back-propagation to search for the optimal clustering of the n multiple database in a way that minimizes a convex clustering quality measure L(θ) in less than (n2−n)/2 iterations. By using a max-heap data structure within our CD algorithm, we optimally choose the largest weight variable θp,q(i) at each iteration i such that taking the partial derivative of L(θ) with respect to θp,q(i) allows us to attain the next steepest descent minimizing L(θ) without using a learning rate. Through a series of experiments on multiple database samples, we show that our algorithm outperforms the existing clustering algorithms for MDM.


Introduction
Large multi-branch companies need to analyze multiple databases to discover useful patterns for the decision-making process. To make global decisions for the entire company, the traditional approach suggests to merge and integrate the local branch-databases into a huge data warehouse, and then one can apply data mining algorithms [1] to the accumulated dataset to mine the global patterns useful for all the branches of the company. However, there are some limitations associated with this approach. For instance, the cost of moving the data over the network, and integrating and storing potentially heterogeneous databases could be expensive. Moreover, some branches may not accept sharing their raw data due to the underlying privacy issues. More crucially, integrating a large amount of irrelevant data can easily disguise some essential patterns hidden in multiple databases. To tackle the latter problems, it is suggested to keep the transactional data stored locally and only forward the local patterns mined at each branch database to a central site where they will be clustered into disjoint cohesive pattern-base groups for knowledge discovery. In fact, analyzing the local patterns present in each individual cluster of the multiple databases (MDB) enhances the quality of aggregating novel relevant patterns, and also facilitates the parallel maintenance of the obtained database clusters.Various clustering algorithms and models have been introduced in the literature, namely spectral-based models [2], hierarchical [3], partitioning [4], competitive learning-based models [5][6][7] and artificial neural networks (ANNs) based clustering [8][9][10]. Additionally, clustering could be applied in many domains [11,12] including community discovery in social networks [13,14], image segmentation [15,16] and recommendation systems [17][18][19]. In this article, we focus on exploring similarity-based clustering models for multi-database mining [20][21][22][23], due to their stability, simplicity [24] and robustness in partitioning graphs of n multiple databases into k connected components consisting of similar database objects. Nevertheless, the existing clustering quality measures in [20][21][22][23] are non-convex objectives suffering from the existence of local optima. Consequently, identifying the optimal clustering may be a difficult task, as it requires evaluating all the candidate clusterings generated at all the local optima in order to find the ideal clustering.
To address the issues associated with clustroid initialization, preselection of a suitable number of clusters and non-convexity of the clustering quality objectives, we proposed in [25,26] an algorithm named GDMDBClustering, which minimizes a quasi-convex loss function quantifying the quality of the multi-database clustering, without a priori assumptions about which number of clusters should be chosen. Therefore, in contrast to the clustering models proposed in [20][21][22][23], GDMDBClustering [25] does not require us to produce and assess all the possible candidate classifications in order to find the optimal partitioning. Alternatively, each partitioning is assessed on the fly as it is generated and the clustering algorithm terminates right after attaining the global minimum of the objective function. However, the existing gradient-based clustering algorithms [25,26] are strongly dependent on the choice of the learning rate η, which influences the number of learning cycles required to find the optimal partitioning. In fact, selecting a larger η value may cause global minimum overshooting and setting a smaller η value may necessitate many learning iterations for the algorithm to converge.
In this paper, we improve upon previous work [25,26] and propose a learning-rate-free (i.e., independent of the learning rate η) algorithm requiring fewer upper-bounded iterations (i.e., the maximum number of iterations is at most (n 2 − n)/2) to minimize a clustering convex loss function L(θ) using coordinate descent (CD) and back-propagation. Precisely, our proposed algorithm minimizes a quadratic hinge-based loss L(θ) over the first largest coordinate variable θ p,q while keeping the rest of the ( n 2 ) − 1 variables fixed. Then, it minimizes L(θ) over the second largest coordinate variable while keeping the rest of the ( n 2 ) − 1 variables fixed, and so on until convergence or until cycling through all the ( n 2 ) coordinate variables. Consequently, our algorithm becomes faster than GDMDBClustering [25] which is dependent on a learning rate and also requires us to minimize the cost over a large set of variables at each iteration. This can be a very challenging problem in contrast to minimizing the loss over one single variable at a time while keeping all the other dimensions fixed.
On the other hand, existing clustering algorithms for multi-database mining (MDM) [20][21][22][23]25,26] proceed by computing (n 2 − n)/2 pairwise similarities sim(D p , D q ) ∈ [0, 1] between n multiple databases, and then use these values to generate and evaluate m ∈ [1, (n 2 − n)/2] candidate clusterings in order to select the ideal partitioning optimizing a given goodness measure. However, when sim(D p , D q ) n×n (p = 0, . . . , n − 2, q = p + 1, . . . , n − 1) are distributed around the mean value µ = 0.5, the fuzziness index of the similarity matrix increases and the clustering algorithm becomes uncertain when choosing what database pairs are considered similar and hence eligible to be put into the same cluster. Consequently, a trivial result is produced, i.e., putting all the n databases in one cluster or returning n singleton clusters. To tackle the latter problem, we propose a learning algorithm to reduce the fuzziness in the pairwise similarities by minimizing a weighted binary entropy loss function H(·) via gradient descent and back-propagation. Precisely, the learned model will force the similarity values above 0.5 to go closer to their maximum value (≈1), and let those below 0.5 go closer to their minimum value (≈0) in a way that minimizes H(·). This will significantly reduce the associated fuzziness and improve the certainty of the clustering algorithm to correctly identify the optimal database clusters. The main contributions of this article are listed as follows: • Unlike the existing algorithms proposed in [20][21][22][23]25,26] where one-class trivial clusterings are produced when the similarity values are centered around the mean value, we have added a preprocessing layer prior to clustering where the pairwise similarities are adjusted to reduce the associated fuzziness and hence improve the quality of the produced clustering. Our experimental results show that reducing the fuzziness of the similarity matrix helps generating meaningful relevant clusters that are different to the one-class trivial clusterings. • Unlike the multi-database clustering algorithms proposed in [20][21][22][23], our approach uses a convex objective function L(θ) to assess the quality of the produced clustering. This allows our algorithm to terminate just after attaining the global minimum of the objective function (i.e., after exploring fewer similarity levels). Consequently, this avoids generating unnecessary candidate clusterings, and hence reduces the CPU overhead. On the other hand, the clustering algorithms in [20][21][22][23] use non-convex objectives (i.e., they suffer from the existence of local optima due to the use of more than two monotonic functions), and therefore require generating and evaluating all the (n 2 − n)/2 local candidate clustering solutions in order to find the clustering located at the global optimum. • Furthermore, unlike the previous gradient-based clustering algorithms [25,26], our proposed algorithm is leaning-rate-free (i.e., independent of the learning rate), and needs at most (in the worst case) (n 2 − n)/2 iterations to converge. That is why our proposed algorithm is faster than GDMDBClustering [25], which is strongly dependent on the learning step size η and its decay rate. • Additionally, unlike the similarity measure proposed in [20], which assumes that the same threshold was used to mine the local patterns from the n transactional databases, our proposed similarity measure takes into account the existence of n different local thresholds, which are then combined to calculate a new threshold for each cluster. Afterward, using the new thresholds, our similarity measure accurately estimates the valid patterns post-mined from each cluster in order to compute the (n 2 − n)/2 pairwise similarities. • The experiments carried out on real, synthetic and randomly generated datasets show that the proposed clustering algorithm outperforms the compared clustering models in [20][21][22][23]25,26], as it has the shorted average running time and the lowest average clustering error.
The remainder of this paper is organized as follows: Section 2 presents an example motivating the importance of clustering for multi-database mining (MDM) and also reviews traditional clustering algorithms for MDM. Section 3 defines the main concepts related to similarity-based clustering and then introduces the proposed approach and its main components. Section 4 presents and analyzes the experimental results. Finally, Section 5 draws conclusions and highlights potential future work.

Motivating Example
Prior to mining the multiple databases (MDB) of a multi-branch enterprise, it is essential to cluster these MDB into disjoint and cohesive pattern-base groups sharing an important number of local patterns in common. Then, using local pattern analysis and pattern synthesizing techniques [27][28][29][30], one can examine the local patterns in each individual cluster to discover novel patterns, including the exceptional patterns [31] and the high-vote patterns [32], which are extremely useful when it comes to making special targeted decisions regarding each cluster of branches of the same corporation. In the following example, we show the impact of clustering the multi-databases of a multi-branch corporation prior to multi-database mining. Consider the six transactional databases D = ∪ 6 p=1 {D p } shown in Table 1, where each database D p records a set of transactions enclosed in parentheses and each transaction contains a set of items separated by commas. Consider a minimum support threshold α = 0.5. The local frequent itemsets, denoted by FIS(D p , α), and discovered from each database D p are shown in Table 2, such that I k in each tuple I k , supp(I k , D p ) of FIS(D p , α) is the frequent itemset name and supp(I k , D p ), named support, is the ratio of the number of transactions in D p containing I k to the total number of transactions in D p .    Now, the global support of each itemset I k ∈ ∪ 6 p=1 {FIS(D p , 0.5)} is calculated via the synthesizing equation [33] defined as follows: where n = 6 is the total number of databases in D and |D p | is the number of transactions in D p . For instance, we can calculate the global support of the itemset A as follows: After computing the global supports of the rest of the itemsets using (1), no single novel pattern has been found, i.e., ∀ I k ∈ ∪ 6 p=1 {FIS(D p , 0.5)}, supp(I k , D) < 0.5). The reason is that irrelevant patterns were involved in the synthesizing procedure. Now, if we examine the frequent itemsets in Table 2, we observe that some databases share many patterns in common. Precisely, the six databases seem to form two clusters, C 1 = {D 1 , D 2 , D 3 } and C 2 = {D 4 , D 5 , D 6 }, where each cluster of databases tend to share similar frequent itemsets.
Next, let us use the synthesizing Equation (1) on the frequent itemsets coming from every single cluster C i , such that 4 ≤ p ≤ 6 = n for cluster C 2 and 1 ≤ p ≤ 3 = n for cluster C 1 . This time, new valid frequent itemsets having a support value above the minimum threshold α are discovered in the two clusters. In fact, FIS(C 2 , 0.5) = { F H, 0.727 , F, 0.727 , H, 0.818 } and FIS(C 1 , 0.5) = { C, 0.769 , B, 0.769 , A, 0.692 }. The obtained patterns show that a percentage of more than 69% of the total transactions in the cluster C 1 include the itemsets C, B and A. More than 72% of the total transactions in the cluster C 2 include F H, F and H. Moreover, some associations between itemsets could be derived as well, such that the itemset F H, 0.727 ∈ FIS(C 2 , 0.5) suggests that on average, if a customer collects the item H at one of the branches in C 2 , they are likely to also buy the item F with a supp(FH,C 2 ) supp(H,C 2 ) = 88.87% confidence. The above example demonstrates the importance of clustering the multi-databases into disjoint cohesive clusters before synthesizing the global patterns. In fact, when the local patterns mined from the six databases were analyzed all together, no global pattern could be synthesized. On the other hand, when the six databases were divided into two different clusters and then each cluster was analyzed individually, useful and novel patterns (knowledge) were discovered. Actually, from the discovered knowledge, decision makers and stakeholders are going to have a clear vision about the branches that exhibit similar purchasing behaviors, and hence take useful decision accordingly. In fact, appropriate business decisions may be taken regarding each group of similar branches in order to predict potential purchasing patterns, increase the customer retention rate and convince customers to purchase more services in the future. Consequently, exploring and examining individual clusters of similar local patterns is going to help the discovery of new and relevant patterns capable of improving the decision-making quality.

Prior Work
The authors in [34] have adopted a divide and conquer mono-database mining approach to accelerate mining global frequent itemsets (FIs) in large transactional databases. In [35,36], the authors have proposed similar work where big transactional databases are divided into k disjoint transaction partitions whose sizes are small enough to be read and loaded to the random access memory. Then, the frequent itemsets (FIs) mined from all the k partitions are synthesized into global FIs using an aggregation function such as the one suggested by the authors in [33]. It is worth noting that for mono-database mining applications, we usually have direct access to the raw data stored in big transactional databases. On the other hand, for multi-database mining (MDM) applications, it is suggested to keep the transactional data stored locally and only forward the local patterns mined at each branch database to a central site where they will be clustered into disjoint cohesive pattern-base groups for knowledge discovery. As a result, the confidential raw data are kept safe, and also the cost associated with transmitting a large amount of data over the network is cut off. Hence, in contrast to clustering the transactional data stored in a single data warehouse, our approach consists of clustering the local patterns mined and forwarded from multi-databases without requiring the number of clusters to be set a priori. Our purpose is to identify the group of databases that share similar patterns, such as the high-vote patterns [32] and the exceptional patterns [31,37,38] that can be used to make specific decisions regarding their corresponding branches. In the traditional clustering approach [34][35][36] applied for mono-database mining, we can only mine the global patterns that are supported by the whole multi-branch company.
The existing clustering algorithms for multi-database [20,21,23,39,40] are based on an agglomerative process that generates hierarchical partitionings at different levels of similarity, where each cluster in a given candidate partitioning is included in another cluster of a partitioning produced at the next similarity level. Regardless of the latter observation, each candidate partitioning is produced without taking into account the use of the clusters generated at the previous similarity levels. As a result, the clustering algorithms in [20,21,23,39,40] unnecessarily reconstruct clusters that have been built at the previous similarity levels. The latter limitation inspired the authors in [22] to design a graph-based algorithm, which maintains the classes produced at prior similarity levels in order to produce new subsequent classes out of them. Despite the fact that the experiments done in [22] showed promising results against the prior work [20,21,23,39,40], these algorithms are based on non-convex functions to evaluate the quality of the produced candidate clusterings. Consequently, finding the ideal clustering for which a non-convex function is optimal may be a difficult problem to solve in a short time.
To face the latter problem, the authors in [26] have transformed the clustering problem into a quasi-convex optimization problem solvable via gradient descent and backpropagation. Consequently, an early stopping of the clustering process occurs right after converging to the global minimum. Hence, by avoiding generating and evaluating unnecessary candidate clusterings, we can significantly reduce the CPU execution time. Even though traditional clustering algorithm such as k-means [4,41] are intuitive, popular and not hard to implement, they remain sensitive to clustroid initialization, preselection of a suitable number of clusters and non-convexity of the clustering quality objective [42]. The silhouette plot [43] could be used to find an appropriate number of clusters, but this requires executing k-means multiple times with different number of clusters in order to find the ideal partitioning maximizing the silhouette objective. As a result, the time performance will be influenced in the case of clustering big high-dimensional datasets. Slightly different, hierarchical-based clustering algorithms [3] build nested hierarchical levels to visualize the relationships between different objects in the form of dendrograms. Then, it is up to the domain expert or to some non-convex metrics to determine at which level the tree diagram should be cut.
Conversely, the optimization problem formulated in [25,26] is quasi-convex. Therefore, convergence to the global optimum is independent of the initial settings. Furthermore, the proposed gradient-based clustering GDMDBClustering [25] does not need to have the number of clusters as a parameter. Alternatively, the number of clusters becomes a parametric function in the main objective. However, GDMDBClustering is based on the choice of a suitable learning rate, i.e., choosing a small learning rate η may increase the number of iterations and slow down learning the optimal weights, whereas a large η may let the algorithm overshoot the global minimum. To overcome the latter limitation, we propose in this paper a learning-rate-free clustering algorithm, named CDClustering, which minimizes a convex objective function quantifying the clustering quality. For this purpose, we use coordinate descent (CD) and back-propagation to search for the optimal clustering of n multiple database in less than (n 2 − n)/2 iterations and without using a learning rate. This makes our algorithm faster than the previous gradient-based clustering algorithm [25,26] which remains dependent on a learning rate defined based on some prior knowledge of the properties of the loss function. On the other hand, due to the fuzziness of the similarity matrix, which increases when the pairwise similarities are distributed around the mean value, the clustering algorithm becomes indecisive when grouping similar databases together. To face this problem, we design a learning algorithm to adjust the pairwise similarities between n multiple databases, in a way which minimizes a binary entropy loss function quantifying the fuzziness associated with the similarity matrix. Thus, the proposed algorithm becomes crisp in discriminating between the different database clusters.

Materials and Methods
In this section, we present our fuzziness reduction model applied to the pairwise similarities between n multiple databases and describe in details our coordinate descentbased clustering approach. Some definitions and notions relevant to this work need to be presented first.

Background and Relevant Concepts
In this subsection, we define the similarity measure between two transaction databases and present the process of generating and evaluating a given candidate clustering. We also define four clustering validity functions used to evaluate the clustering quality.

Similarity Measure
Each transactional database D p is encoded as a hash-table to be defined as follows: where p = 0, . . . , n − 1, n is the number of transactional databases, m is the number of frequent itemsets in D p , I k is the name of the k-th frequent itemset, supp(I k , D p ) ∈ [0, 1] is the support of I k , which is the ratio of the number of rows in D p containing I k to the total number of rows in D p , and α p ∈ [0, 1] is the minimum support threshold corresponding to D p , such that supp(I k , D p ) ≥ α p . In this paper, FP-Growth [1] algorithm is used to mine the frequent itemsets in each database D p as it only requires two passes over the whole database. Our proposed similarity measure is based on maximizing the number of global frequent itemsets (FIs) synthesized from the local FIs in each cluster. Precisely, to measure the similarity between two transactional databases D q and D p , for p = 0, . . . , n − 2, q = p + 1, . . . , n − 1, we define the following function: where such that and We note that the operator | · | is the cardinality of the set passed in as argument. Multiplying α p by |D p | returns the minimum number of transactions in which a frequent itemset I k should occur in D p . Therefore, α p,q is the minimum percentage of transactions from the cluster C p,q = {D p , D q } containing the itemset I k , i.e., supp(I k , C p,q ) ≥ α p,q . In fact, the similarity measure sim in Formula (3) takes into account the local minimum support threshold at each database to calculate a new threshold for each cluster. In this paper, instead of writing 'the similarity measure sim in Formula (3)', we often write sim (3).

Clustering Generation and Evaluation
Let C(D, δ i ) = {C 1 , C 2 , . . . , C k } be a candidate clustering of D = {D 0 , D 1 , . . . , D n−1 } produced at a given level of similarity From a graph-theoretic perspective, each cluster C j represents a connected component in a similarity graph G = (D, E), and an edge (D p , D q ) is added to the list of edges E if and only if sim(D p , D q ) ≥ δ i , where p = 0, . . . , n − 2, q = p + 1, . . . , n − 1.
Initially, G = (D, E) has no edge, i.e., E = ∅. Then, at a given similarity level δ i ∈ [0, 1], edges (D p , D q ) satisfying sim(D p , D q ) ≥ δ i , are added to E. The level of similarity δ i (i = 1, . . . , m) is chosen from the list of the m unique sorted pairwise similarities sim(D p , D q ) computed between the n transaction databases, such that δ 1 > δ 2 > · · · δ i−1 > δ i > δ i+1 > · · · > δ m and m ≤ (n 2 − n)/2. After adding all the edges (D p , D q ) at δ i , each graph component C j (j = 1, . . . , k) will be representing one database cluster in our candidate partitioning C(D, δ i ). One can then use one of the clustering goodness measures shown in Table 3 to assess the quality of C(D, δ i ). Table 3. A summary of the clustering quality measures mentioned in this paper.

Clustering Quality [Reference] Function (Equation)
Optimal Value [20] max goodness(D) [23] , f (D) > 1 max goodness 3 (D) [43,44] Once we generate and evaluate all the m ≤ (n 2 − n)/2 candidate clusterings, we report the global optimum (minimum or maximum) of the goodness measure and compare its corresponding clustering with the ground truth if it is known or with the clustering generated at the maximum point of the silhouette coefficient when the ground truth is unknown. In fact, the silhouette coefficient SC(D) ∈ [−1, 1] proposed in [43,44] (See the last row in Table 3) could be used to verify the correctness of the cluster labels assigned to the n transactional databases. Precisely, a value SC(D) ≈ 1 suggests that the n transactional databases are highly matched to their own clusters and loosely matched to their neighboring clusters.
We should note that each clustering goodness measure in Table 3 depends on more than two monotonic functions. For instance, the quality measure goodness (see the first row in Table 3) proposed in [20] is based on maximizing both the intra-cluster similarity W(D) (which is a non-decreasing function on the interval [0,1]) and the inter-cluster distance B(D) (which is a non-increasing function on the interval [0,1]), while minimizing the number of clusters f (D) (which is a non-increasing function on the interval [0,1]). Consequently, as it was shown via the experiments done in [25,26], most of the time, the graphs of the objectives functions in Table 3 show a non-convex behavior, which makes identifying the ideal partitioning a hard problem to solve without generating and evaluating all the candidate clusterings generated at the local optima.

Similarity Matrix Fuzziness Reduction
In this subsection, we present our fuzziness reduction model applied to the pairwise similarities between n multiple databases. Let z p,q = θ p,q × x p,q be a weighted similarity, such that x p,q = sim(D p , D q ) is the similarity value between D p and D q using Formula (3) and θ p,q is the weight value associated with x p,q where p = 0, . . . , n − 2, q = p + 1, . . . , n − 1. Let g : R →]0, 1[ be a continuous piecewise linear activation function and ∂g be its partial derivative defined as follows: The graph plots of g(z p,q , ) and ∂g(z p,q , ) ∂z p,q with respect to z p,q are depicted in Figure 1a.
The parameter ensures that each value z p,q is within the range [ , 1 − ] such that is a very small number (e.g., = 1e-7) forcing g(z p,q , ) to be always above 0 and below 1, so that it can be plugged into our log-based loss function defined in (10).

Fuzziness Index
The fuzziness index of the pairwise similarity vector , also known as the entropy of the fuzzy set X T [45], and defined from R ( n 2 ) to [0, 1], is given as follows: The smaller the value of Fuzziness(X), the better the clustering performance, and viceversa. In fact, reducing the fuzziness of the pairwise similarities will lead to a more crisp decision making when it comes to finding the optimal partitioning of the n multiple databases. Particularly, the fuzziness of the similarity matrix increases when the pairwise values are centered around 0.5, resulting in more confusion when we need to decide whether two databases should be in the same cluster or not.

Proposed Model and Algorithm
To reduce the fuzziness associated with the (n 2 − n)/2 pairwise similarities between the n transaction databases D = {D 0 , D 1 , . . . , D n−1 }, we need to make the similarity values that are above the mean value µ = 0.5 go closer to 1, and adjust the similarity values that are below µ = 0.5 to go closer to 0. To do so, we consider the minimization of the sum of the binary entropy loss functions over the (n 2 − n)/2 weighed similarity values z p,q = θ p,q × x p,q as follows: such that n is the number of databases, θ T = [θ 0,1 , θ 0,2 , . . . , θ n−2,n−1 ] represents the model weight vector, z p,q represents the weighted similarity θ p,q × sim(D p , D q ) and g(z p,q , ) is the activation function defined in (7). The graph plots of H(g(z p,q , )) and ∂g(z p,q , ) with respect to g(z p,q , ) are depicted in Figure 1b. Since the fuzziness of the similarity matrix is influenced by the weights associated with the pairwise similarities, the degree to which a pair of databases (D p , D q ) belongs to the same cluster could be changed by adjusting the corresponding weight θ p,q , which is learned by minimizing (10) via gradient descent and back-propagation. The training equations are derived as follows: where Let η 0 and epochs be the initial learning rate and the maximum number of learning iterations, respectively. At each epoch i, the current learning rate η decreases as follows: We note that selecting a large learning rate value may cause global minimum overshooting, whereas choosing a small learning rate may necessitate many iterations for the algorithm to converge. Hence, it is reasonable to let the learning rate decrease over time as the algorithm converges to the global minimum. In Figure 2 and Algorithm 1, we present in detail, the framework and the algorithm of the proposed fuzziness reduction model. The proposed learning Algorithm 1: SimFuzzinessReduction keeps adjusting the weight vector θ by moving in the opposite direction to the gradient of the loss function H(θ, ) until it reaches the maximum number of iteration epochs or until the magnitude of the gradient vector becomes below the minimum value . After convergence, we can feed the new similarity values [g(θ 0 × sim(D 0 , D 1 ), ), g(θ 0 × sim(D 0 , D 2 ), ), . . . , g(θ 0 × sim(D n−2 , D n−1 ), )] to any similarity-based clustering algorithm in order to improve the quality of the produced clustering when the latter is trivial or irrelevant.

Proposed Coordinate Descent-Based Clustering
In this subsection, we present and discuss our proposed loss function and our coordinate descent-based clustering approach in detail. Unlike the gradient-based clustering in [25,26], our algorithm is learning-rate-free and needs to run at most (n 2 − n)/2 learning cycles to converge to the global minimum, such that n is the number of transaction databases. In fact, at each iteration, the largest coordinate variable θ p,q is selected and popped from a max-heap data structure (initially built by pushing the (n 2 − n)/2 pairwise similarities onto the heap). Then, we minimize our quadratic convex hinge-based loss L(θ) over θ p,q which is then adjusted by moving in the opposite direction to the gradient of L(θ). This process continues until satisfying a convergence test, which will be defined later in this subsection. Each bloc of selected coordinate variables θ p,q that have the same value will form a set of edges to be added to our graph G = (D, E). Determining the disjoint connected components in G after convergence will allow us to discover the optimal database clusters maximizing the intra-cluster similarity and the inter-cluster distance.

Proposed Loss Function and Algorithm
In order to implement our coordinate descent-based clustering, we propose a quadratic version of the hinge loss L(θ) : R ( n 2 ) → [0, n 2 −n 4 ], which is a convex function (see proof of Theorem 1) whose minimization problem is formulated as follows: arg min A simplified 3D graph plot of L(θ) is depicted in Figure 3.  (14), where θ = [θ 1 , θ 2 ] for visualization purposes. P 1 , P 2 , P 3 , P 4 , A, B, C are some selected 3D points at which L(θ) is evaluated. From P 1 all the way down to P 4 , we can clearly see that L(θ) decreases monotonically when the coordinate variables θ 1 and θ 2 increase their values. That is, ∀(θ , where i is an integer representing the current iteration in our algorithm. Initially, the weight vector θ T is set to the ( n 2 ) pairwise similarities X T = [sim(D 0 , D 1 ), sim(D 0 , D 2 ), . . . , sim(D n−2 , D n−1 )], and then each weight component of θ T is pushed onto a max-heap data structure. At each iteration i = 1, . . . , ( n 2 ), the weight θ (i) p,q (p = 0, . . . , n − 2, q = p + 1, . . . , n − 1) associated with the current largest similarity value sim(D p , D q ) is popped from the max-heap and is updated as follows: Such that g : R → [0, 1] is a differentiable activation function defined as follows: and its partial derivative with respect to the weight θ p,q is: We note that sgn : R → {−1, 1} is the signum function. The usage of g(·) ensures that each weight θ p,q is within the range [0, 1]. As there is no learning rate and schedule to choose for our coordinate descent-based algorithm, we set η to 1. (14) is convex satisfying the following inequality [46]: Proof. To prove the convexity of L(θ), we can show that its Hessian matrix H L is positive semi-definite as follows: Since H is positive semi-definite satisfying x T Hx ≥ 0 for all x ∈ R ( n 2 ) , L(θ) is convex, and therefore guarantees convergence to the global minimum.
In order to reach the global minimum of L(θ) (i.e., min L(θ) = 0), our learning algorithm needs to set the weight vector θ to − → 1 (i.e., the unit vector). Consequently, the intra-cluster similarity will reach its maximum value and all the n databases will be put into the same cluster resulting in a meaningless partitioning. Therefore, in order to prevent this scenario from occurring, we need to assess the clustering quality after popping all the coordinate variables that have the same weight θ p,q (i.e., a block of weights having the same value) from the max-heap. This corresponds to generating one candidate clustering by adding the list of edges (D p , D q ) satisfying sim(D p , D q ) ≥ θ p,q to the graph G = (D, E). Afterward, we need a stopping condition to terminate our algorithm if the current candidate clustering quality is judged to be the optimal one in terms of the similarity-intra cluster W θ (i) (D) and the number of clusters f θ (i) (D). For this purpose, we define the following quasi-convex loss function evaluated at the i-th iteration: where ϕ : Our algorithm terminates right after it reaches the global minimum of L(·). In other words, if L(θ (i) ) ≤ L(θ (i−1) ), then we continue updating the weight vector, the clustering labels and save the optimal partitioning found so far. Otherwise, the algorithm terminates as it has reached the global minimum L(θ (i−1) ), and therefore, the optimal partitioning saved so far is returned as the ideal clustering of the n transactional databases. This stopping condition is only possible due to the quasi-convexity of L(·).
is quasi-convex satisfying the following inequality [46]: Proof. To prove the quasi-convexity of L(θ), we need to demonstrate the validity of (20). First, since f θ (D) is a decreasing function on the range [0,1], it is then both quasi-concave and quasi-convex satisfying the following: is an increasing function on the range [0,1], it is then both quasi-concave and quasi-convex satisfying the following: 1]. By subtracting the two last inequalities, we obtain: , the right side of the resulting inequality is equal to f (θ (i) ) − W(θ (i) ), which could be set equal to max{ f (θ (i+1) ) − W(θ (i+1) ), f (θ (i) ) − W(θ (i) )}. Finally, by squaring and dividing both sides of the inequality by 2, we get a variation on the Jensen inequality for quasi-convex functions [46] as defined in (20). Hence, L(θ) is quasi-convex.

Time Complexity Analysis
In this subsection, we analyze the time complexity of our coordinate descentbased clustering algorithm presented in Algorithm 2, named CDClustering, which depends on the two subroutines presented in Algorithm 3: union and Algorithm 4: cluster. We note that the superscript i enclosed in round brackets, i.e., θ (i) p,q , is used to indicate the iteration number at which a given variable θ p,q has been assigned a value. The proposed algorithm takes as argument the ( n 2 ) pairwise similarities X T = [sim(D 0 , D 1 ), sim(D 0 , D 2 ), . . . , sim(D n−2 , D n−1 )] and outputs the optimal clustering minimizing our proposed loss function L(θ) (14). First, the weight vector θ T is initially set equal to X T . Afterward, coordinate descent and back-propagation are used to search for the optimal weight vector θ T , which minimizes our hinge-based objective L(θ). Through each learning cycle i, one coordinate variable θ p,q is popped from a max-heap. Then, θ p,q is updated by making the optimal step in the opposite direction to the gradient of L(θ). The weights θ p,q (p = 0, . . . , n − 2, q = p + 1, . . . , n − 1) attaining the maximum value of 1 will have their corresponding database pairs (D p , D p ) put into the same cluster. By using a max-heap data structure within our coordinate descent algorithm, we optimally choose the current largest variable θ (i) p,q at each iteration i such that taking the partial derivative of our loss L(θ) with respect to θ p,q allows us to attain the next steepest descent minimizing L(θ) without using a learning rate. This way, the maximum number of iterations required for our algorithm to converge is less than or equal to (n 2 − n)/2, i.e., the number of the pairwise similarities. Initially, the number of clusters f θ (D) is set equal to the number of transactional databases n. Then, in order to keep track of the database clusters, their number f θ (D) and their sizes, we implement a disjoint-set data structure [47], which consists of an array A[0, . . . , n − 1] of n integers managed by two main operations: cluster and union. Each cluster C p is represented by a tree whose root index p satisfies A[p] = −1, and a database D q belonging to the cluster C p satisfies A[q] = p. Therefore, the cluster function is called recursively to find the label assigned to the database index p (passed in as argument) by moving down the tree towards the root (i.e., A[p] = −1). On the other hand, the union procedure links two disjoint clusters C p and C q by making the root of the smaller tree point to the root of the larger one in A[0, . . . , n − 1]. The algorithms corresponding to union and cluster are presented in Algorithm 3 and Algorithm 4, respectively. Let s = ( n 2 ) be the size of the weight vector θ T . The time complexity of building the max-heap is O(s) and the time complexity of the proposed Algorithm 2: CDClustering is O(s + h log 2 (n)), such that h ∈ [1, s] is the number of learning cycles run until global minimum convergence, and O(log 2 (n)) is the time complexity of one pop operation from the heap. The proposed model is also illustrated in Figure 4. Since it is meaningless to return a single cluster consisting of all the n databases, if the clustering obtained at step (10a) is trivial (i.e., all the n databases are put together in one class or each single database stands alone in its own cluster), then we first need to run the model proposed in Figure 2 on the pairwise similarities to reduce the associated intrinsic fuzziness measured in (9). Afterward, we can apply the proposed model Figure 4 on the new adjusted similarity values to obtain more relevant results.
sum up the two cluster sizes A[C q ] ← C p establish a link to the root of the larger cluster end else

Performance Evaluation
To assess the performance of the proposed clustering algorithm, we carried out numerous experiments on real and synthetic datasets, including Zoo [48], Iris [48], Mushroom [48] and T10I4D100K [49]. To simulate a multi-database environment, we have partitioned each dataset horizontally into n partitions D 1 , D 2 , . . . , D n , such that n ∈ {12, 10, 6, 4}. Afterward, given a minimum support threshold α ∈ {0.5, 0.2, 0.03}, we run FP-Growth [1] on each partition D i (i = 1, . . . n) to discover the local frequent itemsets (FIs) corresponding to each partition. All the details related to the partition sizes and their corresponding FIs are shown in Table A1. We note that the fifth column of Table A1 reports the number of FIs discovered in the entire dataset, whereas the most right column of the same table reports the number of FIs aggregated from the local FIs mined from the partitions in each cluster.
The proposed similarity measure sim (3) is called on the (n 2 − n)/2 pairs of FIs to compute the n × n similarity matrices shown in Figures A1a-A7a. Next, using the obtained pairwise similarities, candidate clusterings are produced via the process described in Section 3.1.2, and then evaluated using the clustering quality measures defined in Table 3, including SC(D) [43], goodness 3 (D) [21], goodness 2 (D) [23], goodness(D) [20] and our proposed loss function L(θ) (14). The graphs corresponding to the studied goodness measures are shown in Figures A1b-A7b, where the optimal point (maximum or minimum) of each objective function is depicted as a black dot on its corresponding graph, except that for the graph of our loss function L(θ), there is a red dot representing the value L(arg min θ L(θ)) (i.e., the optimal point at which our algorithm terminates). It is worth mentioning that due to scale differences, we sometimes multiply or divide our loss function L(θ), goodness 3 (D) [21] and goodness 2 (D) [23] by a scaling number to stretch or shrink their graphs in the direction of the y-axis. The experimental results depicted in Figures A1-A7 are summarized in Table A2, such that δ ∈ [0, 1] is the ideal similarity threshold for which a goodness measure attains its optimal point. Python version 3.9.2 was used to implement all the algorithms, and the codes were run on a Ubuntu-20.04 server equipped with an Intel(R) Xeon(R) CPU clocked at 2.30 GHz with 50 GB available Disk capacity and 12 GB of available RAM.

Similarity Accuracy Analysis
To demonstrate the efficiency of sim  Table 4. We note that goodness [20] is a clustering quality measure, such that the higher the value of goodness for a given candidate clustering C, the better the quality of C.  [20] 0.2 0.5 From Table 4, we notice that using our similarity measure sim (3), we have obtained a larger intra-cluster similarity, a larger inter-cluster distance and a larger goodness [20]. Now, let us synthesize the global frequent itemsets from the clusters containing more than one database, i.e., C 2,3 = {D 2 , D 3 } and C 1,2 = {D 1 , D 2 }. The obtained results are shown in = 0.17 are the minimum support thresholds corresponding to C 2,3 and C 2,3 respectively. As we can see, the similarity measure simi [20] captures only high frequency itemsets (supp ≈ 1), such as E, and neglects low support frequent itemsets (i.e., whose supports are immediately above the minimum threshold α with supp ∈ [α, α + ] and is a very small number), such as A, B and C. This characteristic gives a high similarity value to database pairs sharing only one or very few high frequency itemsets. On the other hand, database pairs sharing many frequent itemsets with a low support will be assigned a lower similarity. However, once the clustering is done, we will be interested in the patterns discovered from each cluster individually, such as the high-vote patterns [32] and the exceptional patterns [31]. That is why our similarity measure estimates the patterns post-mined from each cluster C p,q = {D p , D q } in order to compute sim(D p , D q ). Since our similarity measure focuses on maximizing the number of frequent itemsets synthesized from each cluster C p,q ⊆ D, only relevant clusters will be assigned a large similarity value.

Fuzziness Reduction Analysis
To demonstrate the importance of reducing the fuzziness associated with a similarity matrix, we run the clustering algorithm BestDatabaseClustering [22] on two similarity matrices in Figures 5a and 6a. The obtained results in terms of the optimal clustering, max goodness(D) [20], the optimal similarity level δ opt (i.e., the similarity level at max goodness(D)) and the silhouette coefficient SC(D) [43] at δ opt are shown in Figures 5b,c and 6b,c corresponding to rows 1 and 2 of Table 6, respectively. From the obtained results, we can clearly see that when the similarity matrices are centered around the mean value 0.5, the fuzziness index becomes larger and closer to 1, and Best-DatabaseClustering [22] could not return a meaningful clustering since it has put all the n databases into the same cluster. Now, let us run our fuzziness reduction model on the previous similarity matrices and depict the adjusted similarity matrices in Figures 7a and 8a, respectively. Afterward, we run BestDatabaseClustering [22] on the new similarity matrices and show the clustering results in Figures 7b,c and 8b,c corresponding to rows 3 and 4 of Table 6, respectively. As we can see, after reducing the fuzziness index associated with the previous similarity matrices in Figures 5a and 6a, the algorithm BestDatabaseClustering [22] was able to produce meaningful non-trivial clusterings with an increase in the silhouette coefficient SC(D) [43] for both similarity matrices in Figures 7a and 8a. Table 6. A summary of the results obtained in Figures 5-8. We note that δ opt is the optimal similarity level at which goodness(D) [20] attains its maximum value, and θ T is the optimal weight vector learned after a number of epochs.

Convexity and Clustering Analysis
In this part of our experiments, we analyze the convex behavior of the proposed clustering quality functions L(θ) (19) and L(θ) (14), and we also examine the non-convexity of the existing goodness measures in [20,21,23,43]. Additionally, we compare the clustering produced by our algorithm and the ones generated at the optimal points of the previous compared goodness measures (i.e., at max goodness(D) [20], min goodness 2 (D) [23] and max goodness 3 (D) [21]) with the underlying ground-truth cluster labels. When the actual clustering is unknown, we replace it with the partitioning obtained at the maximum value of the silhouette coefficient [43], that is, at max SC(D). All the graphs corresponding to our loss functions and the compared goodness measures in Table 3 are plotted in Figures A1b-A7b, where the x-axis represents the similarity levels δ at which multiple candidate clusterings are generated and evaluated.
Consider the 7 × 7 similarity matrix shown in Figure A1a. From the graphs plotted in Figure A1b and according to the results shown in the first row of Table A2, we can see that using our loss function L(θ) and goodness(D) [20], we were able to find the optimal clustering {C 1 = {D 3 , D 2 , D 1 }, C 2 = {D 4 }, C 3 = {D 7 , D 6 , D 5 }} at a similarity level δ = 0.44 where the silhouette coefficient reaches its maximum value SC(D) = 0.46. On the other hand, goodness 3 (D) [21] and goodness 2 (D) [23] did not successfully discover the partitioning maximizing the silhouette coefficient. Additionally, we observe that the proposed convergence test function L(θ) has a quasi-convex behavior (see proof of Theorem 2). This allows us to terminate the clustering process right after reaching the global minimum. Conversely, the graphs corresponding to goodness 2 (D) [23] and goodness(D) [20] have local optima. Consequently, it is required to explore about (n 2 − n)/2 similarity levels in order to generate and evaluate all the candidate clusterings possible. Now, let us examine the results of some experiments that we have conducted on the synthetic and real-world datasets shown in Table A1. From Figures A2b and A7b (the last and second rows of Table A2), we observe that goodness 3 (D) [21] and goodness 2 (D) [23] attain their optimal values when all the partition databases are clustered together in one class. The same phenomenon is observed in Figures A3b, A6b and A7b (the last, the sixth and the third rows of Table A2), where both goodness 2 (D) [23] and goodness(D) [20] have put all the databases into one cluster.
In contrast, the proposed loss function L(θ) has successfully identified the clustering for which the silhouette coefficient SC is maximum. Precisely, in Figure A7b, which corresponds to the last row of Table A2), L(θ) was the only clustering quality measure which has properly identified the ideal 7-class clustering at δ = 0.846.
From the obtained graphs in Figures A1-A7, we notice that goodness 3 (D) [21], goodness 2 (D) [23] and goodness(D) [20] are neither quasi-concave nor quasi-convex on the domain [0, 1]. As a result, we observe the existence of local optimum points on their corresponding graphs, which makes the search of the global optimum a difficult problem to solve without exploring all the local solutions.
Conversely, we observe that our loss function L(θ) (14) is monotonically decreasing all the time and L(θ) = 0 atθ = arg min θ L(θ) = − → 1 . This corresponds to the similarity level δ = 0 where all the n databases are put into the same single cluster. To avoid this case from occurring, we used the quasi-convex function L(θ) (19) as a convergence test function to terminate our algorithm at the point L(arg min θ L(θ)) corresponding to the red dot on the graph of our loss function L(θ). Moreover, it is worth noting that for every two real ( n 2 )-dimensional vectors θ (i) and θ (i+1) , where L(θ (i+1) ) ≤ L(θ (i) ), the line that joins the points (θ (i+1) , L(θ (i) )) and (θ (i) , L(θ (i) )) remains above L(θ), which is observed in Figures A1-A7. Therefore, using the proposed loss function L(θ) (14) along with L(θ) (19) guarantees global minimum convergence.
In the fifth and the most right columns of Table A1, we compare the number of frequent itemsets (FIs) mined from all the partitions of a given dataset D with the FIs mined from each single cluster C j consisting of similar partitions from the same dataset, where ∩ k j=1 {C j } = ∅ and ∪ k j=1 {C j } = D. We notice that mining all the partitions from datasets Iris [48] and Zoo [48] did not result in discovering any valid frequent itemset. Whereas, mining each individual cluster of partitions from the datasets Iris and Zoo has led to the discovery of new patterns in each cluster C j .
In Table A3, we report the similarity levels δ opt at which the clustering evaluation measures goodness(D) [20], goodness 2 (D) [23], goodness 3 (D) [21], the silhouette coefficient SC [43] and our proposed loss function L(θ) attain their optimal values in Figures A1-A7. We note that, the fraction |{δ 1 ,...,δ stop }| |{δ 1 ,...,δ m }| in Table A3 represents the number of similarity levels required to test the convergence and terminate divided by the number of all similarity levels m. We note that opt is the index of the optimal similarity level according to a given clustering quality measure. Since our proposed algorithm is based on a convex loss function, we notice that stop = opt < m. On the other hand, as for the compared algorithms, which are based on non-convex objectives, we notice that stop = m. Therefore, our algorithm requires the least number of similarity levels (opt out of m) in order to converge and terminate, which makes our algorithm faster than the compared algorithms in [21][22][23], requiring to generate and evaluate all the m candidate clusterings in order to return the optimal one. All the previous results confirm that using our loss function L(θ) (14) along with L(θ) (19), we have identified the ideal clustering for which the silhouette coefficient SC [43] is maximum and we have also improved the quality of the frequent itemsets (FIs) mined from the multiple databases partitioned from the datasets in Table A1.

Clustering Error and Running Time Analysis
In this experimental part, we compare the running time of the proposed clustering algorithm with the execution times of two clustering algorithms for multi-database mining (MDM), namely GDMDBClustering [25] and BestDatabaseClustering [22], all run on the same random data samples. We also calculate how the clusterings produced by our algorithm and the compared models are different from the ground-truth clustering. For this purpose, we propose an error function in (21), which measures the difference between two given clusterings P and Q.
First, we generated n = 30 to N = 120 isotropic Gaussian blobs using the scikit-learn generator [50], such that the number of features for each n blobs is set to random.randint (2,10), while the number of clusters is set to n 2 . In Table 7, we present a brief summary of the random blobs generated via scikit-learn [50]. Table 7. A brief summary of the random blobs generated via scikit-learn [50].

Number of Random Blobs
(n) Afterward, we use the min-max scaling [51] to normalize each feature (out of the m features) into the interval [0,1]. Then, for each n blobs, every pair of m-dimensional blobs is passed as arguments to the function sim (3) in order to compute the (n 2 − n)/2 pairwise similarities between the n blobs. We then run the proposed algorithm, GDMDBClustering [25] (with three different learning rate values) and BestDatabaseClustering [22] on each of the (n 2 − n)/2 pairwise similarities (n = 30, . . . , 120) and plot their running time graphs in Figures A8a-A10a, and then plot the clustering error graphs in Figures A8b-A10b.

Number of Centers
Without loss of generality, assume Q is the ground-truth clustering (i.e., the actual clusters) of the current n blobs D = {D 1 , D 2 , . . . , D n } generated via scikit-learn [50], and assume P is the partitioning of D produced by a given clustering algorithm. To measure how far is P from Q, we propose the error function E n (P, Q) ∈ [0, 1] to be defined as follows: E n (P, Q) = |Pairs Q \ Pairs P | + |Pairs P \ Pairs Q | |Pairs Q | + |Pairs P | (21) where |Pairs P | is the number of all the database pairs obtained from every cluster in P and |Pairs P \ Pairs Q | is the number of all the database pairs that only exist in Pairs P and that cannot be found in Pairs Q . We note that E n (P, Q) approaches the maximum value of 1 (i.e., E n (P, Q) ≈ 1) when P and Q are too different and do not share many database pairs in common (i.e., |Pairs P ∩ Pairs Q | ≈ 0). Conversely, E n (P, Q) ≈ 0 when the clustering P and Q are too similar, i.e., they share the maximum number of pairs (D p , D q ). We also define the average of the N − n + 1 clustering errors, which could also be seen as the mean absolute clustering error: From the obtained results in Figures A8a-A10a, we observe a rapid increase in the running time of BestDatabaseClustering [22] as the number of generated blobs (n) increases linearly. This is due to the fact that BestDatabaseClustering needs to generate and evaluate approximately (n 2 − n)/2 candidate clusterings in order to find the optimal clustering for which the nonconvex function goodness(D) [20] is maximum. In fact, goodness(D) suffers from the existence of local maxima, which requires exploring all the local candidate solutions in order to find the global maximum. On the other hand, using the proposed convex loss function L(θ) and the quasi-convex convergence test function L(θ) allows us to stop the clustering process at L(arg min θ L(θ)). Consequently, this avoids generating unnecessary candidate clusterings, and hence reduces the CPU overhead. Since our algorithm is independent of the learning rate η, the running time of our algorithm is the same in all Figures A8a-A10a. Whereas, the running time of GDMDBClustering [25] increases for smaller learning rate values (e.g., Figure A10) and decreases when we set larger learning rate values (e.g., Figure A9), but this comes at the cost of having an increased clustering error.
Next, by examining the three clustering error graphs in Figures A8-A10 (b), we observe that BestDatabaseClustering [22] has the largest clustering error among the three algorithms with a clustering average error E (P, Q) = 0.936. In fact, on average, Best-DatabaseClustering [22] tends to group all the current n blobs (n = 30, . . . , 120) in one single cluster. On the other hand, our proposed algorithm and GDMDBClustering [25] produce clusterings that are close to the ground-truth clustering predetermined by the scikit-learn generator [50]. In fact, the average clustering error due to our algorithm is E (P, Q) = 0.285. For GDMDBClustering [25], we get E (P, Q) = 0.285 when the learning rate η = 0.0005 or η = 0.001, and the error increases to E (P, Q) = 0.29 when η = 0.002. The average running times and clustering errors of our algorithm, GDMDBClustering [25] and BestDatabaseClustering [22] are summarized in Table A4.
Our algorithm and GDMDBClustering [25] terminate once we reach the global minimum of the convergence test function L(θ). Consequently, the running times of our algorithm and GDMDBClustering [25] are most of the time shorter than that of BestDatabaseClustering [22]. Overall, the running time of GDMDBClustering [25] stays relatively steady with respect to n. However, GDMDBClustering depends strongly on the learning step size η and its decay rate. On the other hand, our algorithm is learning-rate-free and needs at most (in the worst case) (n 2 − n)/2 iterations to converge. Consequently, our proposed algorithm is faster than both BestDatabaseClustering [22] and GDMDBClustering [25].
To illustrate the statistically significant superiority of the proposed clustering model in terms of running time and clustering accuracy, we have applied the Friedman test [52] (under a significance level α = 0.05) on the measurements (execution times and clustering errors depicted in Figures A8-A10) obtained by our algorithm, BestDatabaseClustering [22] and GDMDBClustering [25] (with three different values for the learning rate η) considering all the random samples in Table 7.
After conducting the Friedman test [52], we obtained the results shown in Tables A5-A7, namely the average running time, the average clustering error E (P, Q) (22), the standard deviation (SD), the variance (Var), the critical value (stat) and its p-value for all the tested clustering algorithms, considering all 91 random samples generated via scikitlearn [50].
We notice that all the obtained results in Tables A5-A7 show p-values that are below the significance level α = 0.05. Consequently, the test suggests a rejection of the null hypothesis, stating that the compared clustering models have a similar performance. In fact, the proposed clustering algorithm significantly outperforms the other compared models, as it has the shortest average running time (6.367 milliseconds) and the lowest average clustering error (E (P, Q) = 0.285) among all the compared models.

Clustering Comparison and Assessment
In the third part of our experiments, we are interested in using some information retrieval measures to compare the clusterings produced by our algorithm and some other clustering algorithms with the ground-truth data.
Let D = {D 1 , D 2 , . . . , D n } be n transactional databases. Let P = {P 1 , P 2 , . . . , P k } be a kclass clustering of D produced by any given clustering algorithm, and let Q = {Q 1 , Q 2 , . . . , Q l } be the ground-truth clustering of the databases in D, Let us define Pairs P and Pairs Q as the set of database pairs obtained from each cluster of the same clustering. That is, Pairs P = ∪ P t ∈P ∪ D r ,D s ∈P t ;r<s {(D r , D s )} and Pairs Q = ∪ Q t ∈Q ∪ D r ,D s ∈Q t ;r<s {(D r , D s )}. To compare the two clusterings P with Q, few methods method [53][54][55] could be used. In this paper, we use a pair counting [56][57][58][59] to calculate some information retrieval measures [60,61], including precision, recall, F-measure (i.e., harmonic mean of recall and precision), Rand index [62] and Jaccard index [63] over pairs of databases being clustered together in P and/or Q. This will allow us to assess whether the predicted database pairs from P cluster together in Q, i.e., are the discovered database pairs in Pairs P correct with respect to the underlying true pairs in Pairs Q from the ground-truth clustering Q.
In Table A9, we show the categories of database pairs which represent the working set of all pair counting measures cited in Table A10. Precisely, a: represents the number of pairs that exist in both clusterings Q and P, d: represents the number of pairs that do not exist in either clustering, b: is the number of pairs present only in clustering Q, and c: is the number of pairs present only in clustering P. By counting the pairs in each category, we get an indicator for agreement and disagreement of the two clusterings being compared. The following example illustrates how to compute the measures defined in Table A10 for two given clusterings P = Then, a = |Pairs Q ∩ Pairs P | = 6, b = |Pairs Q \ Pairs P | = 0, c = |Pairs P \ Pairs Q | = 3, d = |Pairs D \ (Pairs Q ∪ Pairs P )| = 12. Therefore, we get the following measures: F-measure = 0.8, precision = 0.66, recall = 1.0, Rand = 0.857, Jaccard = 0.66. We note that the higher the values of the evaluation measures given in Table A10, the better the matching of the clustering P to its corresponding ground-truth clustering Q.

Conclusions
An improved similarity-based clustering algorithm for multi-database mining was proposed in this paper. Unlike the previous works, our algorithm requires fewer upperbounded iterations to minimize a convex clustering quality measure. In addition, we have proposed a preprocessing layer prior to clustering where the pairwise similarities between multiple databases are first adjusted to reduce their fuzziness. This will help the clustering process to be more precise and less confused in discriminating between the different database clusters. To assess the performance of our algorithm, we conducted several experiments on real and synthetic datasets. Compared to the existing clustering algorithms for multi-database mining, our algorithm achieved the best performance in terms of accuracy and running time. In this paper, we have used the most frequent itemsets mined from each transaction database as feature sets to compute the pairwise similarities between the multiple databases. However, when the sizes of these input vectors become large, building the similarity matrix will increase the CPU overhead drastically. Moreover, the existence of some noisy frequent itemsets (FIs) may largely influence how databases are clustered together. In future work, we will investigate the impact of compressing the size of the FIs into a latent variable represented in a lower dimensional space with discriminative features. Practically, reconstituting the input vectors from the embedding space using deep auto-encoders and non-linear dimensionality reduction techniques, such as T-SNE (tdistributed stochastic neighbor embedding) and UMAP (uniform manifold approximation and projection), will force the removal of the noisy features present in the input data while keeping only the meaningful discriminative ones. Consequently, this may help improve the accuracy and running time of the clustering algorithm. Additionally, we are interested in exploring new ways to reduce the computational time used to calculate the similarity matrix via locality sensitive hashing (LSH) techniques, such as BagMinHash for weighted sets. These methods aim to encode the feature-set vectors into hash-code signatures in order to efficiently estimate the Jaccard similarity between the local transactional databases. Last but not least, in order to design a parallel version of the proposed algorithm, we will study and explore some high-performance computing tools, such as MapReduce and Spark, as an attempt to improve the clustering performance for multi-database mining.

Acknowledgments:
We would like to thank the anonymous reviewers for their time and their valuable comments.

Conflicts of Interest:
The authors declare no conflict of interest.

Clustering Predicted Clusters
Pairs in P Pairs Not in P

Actual clusters
Pairs in Q a := |Pairs Q ∩ Pairs P | (True Positive) b := |Pairs Q \ Pairs P | (False Negative) Pairs not in Q c := |Pairs P \ Pairs Q | (False Positive) d := Pairs in none (True Negative)    [63] reached by the clustering algorithms in [21][22][23] and our proposed algorithm on the datasets shown in Figures A1-A7. Notice that our algorithm gets the best scores for all the datasets.