An Improved SimilarityBased Clustering Algorithm for MultiDatabase Mining
Abstract
:1. Introduction
 Unlike the existing algorithms proposed in [20,21,22,23,25,26] where oneclass trivial clusterings are produced when the similarity values are centered around the mean value, we have added a preprocessing layer prior to clustering where the pairwise similarities are adjusted to reduce the associated fuzziness and hence improve the quality of the produced clustering. Our experimental results show that reducing the fuzziness of the similarity matrix helps generating meaningful relevant clusters that are different to the oneclass trivial clusterings.
 Unlike the multidatabase clustering algorithms proposed in [20,21,22,23], our approach uses a convex objective function $\mathcal{L}\left(\theta \right)$ to assess the quality of the produced clustering. This allows our algorithm to terminate just after attaining the global minimum of the objective function (i.e., after exploring fewer similarity levels). Consequently, this avoids generating unnecessary candidate clusterings, and hence reduces the CPU overhead. On the other hand, the clustering algorithms in [20,21,22,23] use nonconvex objectives (i.e., they suffer from the existence of local optima due to the use of more than two monotonic functions), and therefore require generating and evaluating all the $({n}^{2}n)/2$ local candidate clustering solutions in order to find the clustering located at the global optimum.
 Furthermore, unlike the previous gradientbased clustering algorithms [25,26], our proposed algorithm is leaningratefree (i.e., independent of the learning rate), and needs at most (in the worst case) $({n}^{2}n)/2$ iterations to converge. That is why our proposed algorithm is faster than GDMDBClustering [25], which is strongly dependent on the learning step size $\eta $ and its decay rate.
 Additionally, unlike the similarity measure proposed in [20], which assumes that the same threshold was used to mine the local patterns from the n transactional databases, our proposed similarity measure takes into account the existence of n different local thresholds, which are then combined to calculate a new threshold for each cluster. Afterward, using the new thresholds, our similarity measure accurately estimates the valid patterns postmined from each cluster in order to compute the $({n}^{2}n)/2$ pairwise similarities.
2. Motivation and Related Work
2.1. Motivating Example
2.2. Prior Work
3. Materials and Methods
3.1. Background and Relevant Concepts
3.1.1. Similarity Measure
3.1.2. Clustering Generation and Evaluation
3.2. Similarity Matrix Fuzziness Reduction
3.2.1. Fuzziness Index
3.2.2. Proposed Model and Algorithm
Algorithm 1: SimFuzzinessReduction 
3.3. Proposed Coordinate DescentBased Clustering
3.3.1. Proposed Loss Function and Algorithm
3.3.2. Time Complexity Analysis
Algorithm 2: CDClustering 
Algorithm 3: union 
Algorithm 4: cluster 
4. Performance Evaluation
4.1. Similarity Accuracy Analysis
4.2. Fuzziness Reduction Analysis
4.3. Convexity and Clustering Analysis
4.4. Clustering Error and Running Time Analysis
4.5. Clustering Comparison and Assessment
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
FIs  Frequent Itemsets 
FIM  Frequent Itemset Mining 
MDB  Multiple Databases 
MDM  Multidatabase Mining 
CD  Coordinate Descent 
CL  Competitive Learning 
BMU  Best Matching Unit 
TSNE  tDistributed Stochastic Neighbor Embedding 
UMAP  Uniform Manifold Approximation and Projection 
LSH  Locality Sensitive Hashing 
Appendix A
Dataset Name/Ref  Number of Rows  Number of Rows in Partition ${\mathcal{D}}_{\mathit{i}}$  Number of $\mathit{FIS}({\mathcal{D}}_{\mathit{i}},\mathit{\alpha})$ from Partition ${\mathcal{D}}_{\mathit{i}}$  Number of $\mathit{FIS}(\mathcal{D},\mathit{\alpha})$ from Dataset $\mathcal{D}$  Ground Truth Clustering  Number of $\mathit{FIS}({\mathit{C}}_{\mathit{j}},\mathit{\alpha})$ from Cluster ${\mathit{C}}_{\mathit{j}}$ 

Mushroom [48] (2 classes)  8124  ${\mathcal{D}}_{1}=3916$ (${C}_{1}$) ${\mathcal{D}}_{2}=1402$ (${C}_{2}$) ${\mathcal{D}}_{3}=1402$ (${C}_{2}$) ${\mathcal{D}}_{4}=1404$ (${C}_{2}$)  $\leftFIS\right({\mathcal{D}}_{1},0.5\left)\right=375$ $\leftFIS\right({\mathcal{D}}_{2},0.5\left)\right=2063$ $\leftFIS\right({\mathcal{D}}_{3},0.5\left)\right=\mathrm{32,911}$ $\leftFIS\right({\mathcal{D}}_{4},0.5\left)\right=807$  $\leftFIS\right(\mathcal{D},0.5\left)\right=151$  ${\mathit{C}}_{\mathbf{1}}=\left\{{\mathcal{D}}_{\mathbf{1}}\right\},$ ${\mathit{C}}_{\mathbf{2}}=\{{\mathcal{D}}_{\mathbf{4}},{\mathcal{D}}_{\mathbf{3}},{\mathcal{D}}_{\mathbf{2}}\}$  $\leftFIS\right({C}_{1},0.5\left)\right=375$ $\leftFIS\right({C}_{2},0.5\left)\right=1441$ 
Zoo [48] (7 classes)  101  ${\mathcal{D}}_{1}=20$ (${C}_{1}$) ${\mathcal{D}}_{2}=21$ (${C}_{1}$) ${\mathcal{D}}_{3}=10$ (${C}_{2}$) ${\mathcal{D}}_{4}=10$ (${C}_{2}$) ${\mathcal{D}}_{5}=5$ (${C}_{3}$) ${\mathcal{D}}_{6}=6$ (${C}_{4}$) ${\mathcal{D}}_{7}=7$ (${C}_{4}$) ${\mathcal{D}}_{8}=2$ (${C}_{5}$) ${\mathcal{D}}_{9}=2$ (${C}_{5}$) ${\mathcal{D}}_{10}=4$ (${C}_{6}$) ${\mathcal{D}}_{11}=4$ (${C}_{6}$) ${\mathcal{D}}_{12}=10$ (${C}_{7}$)  $\leftFIS\right({\mathcal{D}}_{1},0.5\left)\right=\mathrm{24,383}$ $\leftFIS\right({\mathcal{D}}_{2},0.5=\mathrm{30,975}$ $\leftFIS\right({\mathcal{D}}_{3},0.5\left)\right=\mathrm{30,719}$ $\leftFIS\right({\mathcal{D}}_{4},0.5\left)\right=\mathrm{32,767}$ $\leftFIS\right({\mathcal{D}}_{5},0.5\left)\right=\mathrm{20,479}$ $\leftFIS\right({\mathcal{D}}_{6},0.5=\mathrm{65,535}$ $\leftFIS\right({\mathcal{D}}_{7},0.5\left)\right=\mathrm{65,535}$ $\leftFIS\right({\mathcal{D}}_{8},0.5\left)\right=\mathrm{114,687}$ $\leftFIS\right({\mathcal{D}}_{9},0.5\left)\right=\mathrm{98,303}$ $\leftFIS\right({\mathcal{D}}_{10},0.5\left)\right=\mathrm{53,247}$ $\leftFIS\right({\mathcal{D}}_{11},0.5\left)\right=\mathrm{57,343}$ $\leftFIS\right({\mathcal{D}}_{12},0.5\left)\right=\mathrm{28,671}$  $\leftFIS\right(\mathcal{D},0.5\left)\right=0$  ${\mathit{C}}_{\mathbf{1}}=\{{\mathcal{D}}_{\mathbf{2}},{\mathcal{D}}_{\mathbf{1}}\},$ ${\mathit{C}}_{\mathbf{2}}=\{{\mathcal{D}}_{\mathbf{4}},{\mathcal{D}}_{\mathbf{3}}\},$ ${\mathit{C}}_{\mathbf{3}}=\left\{{\mathcal{D}}_{\mathbf{5}}\right\},$ ${\mathit{C}}_{\mathbf{4}}=\{{\mathcal{D}}_{\mathbf{7}},{\mathcal{D}}_{\mathbf{6}}\},$ ${\mathit{C}}_{\mathbf{5}}=\{{\mathcal{D}}_{\mathbf{9}},{\mathcal{D}}_{\mathbf{8}}\},$ ${\mathit{C}}_{\mathbf{6}}=\{{\mathcal{D}}_{\mathbf{11}},{\mathcal{D}}_{\mathbf{10}}\},$ ${\mathit{C}}_{\mathbf{7}}=\left\{{\mathcal{D}}_{\mathbf{12}}\right\},$  $\leftFIS\right({C}_{1},0.5\left)\right=\mathrm{25,087}$ $\leftFIS\right({C}_{2},0.5\left)\right=\mathrm{28,671}$ $\leftFIS\right({C}_{3},0.5\left)\right=2479$ $\leftFIS\right({C}_{4},0.5\left)\right=\mathrm{49,151}$ $\leftFIS\right({C}_{5},0.5\left)\right=\mathrm{57,343}$ $\leftFIS\right({C}_{6},0.5\left)\right=\mathrm{45,055}$ $\leftFIS\right({C}_{7},0.5\left)\right=\mathrm{28,671}$ 
Iris [48] (3 classes)  150  ${\mathcal{D}}_{1}=25$ (${C}_{1}$) ${\mathcal{D}}_{2}=25$ (${C}_{1}$) ${\mathcal{D}}_{3}=25$ (${C}_{2}$) ${\mathcal{D}}_{4}=25$ (${C}_{2}$) ${\mathcal{D}}_{5}=25$ (${C}_{3}$) ${\mathcal{D}}_{6}=25$ (${C}_{3}$)  $\leftFIS\right({\mathcal{D}}_{1},0.2\left)\right=5$ $\leftFIS\right({\mathcal{D}}_{2},0.2\left)\right=6$ $\leftFIS\right({\mathcal{D}}_{3},0.2\left)\right=2$ $\leftFIS\right({\mathcal{D}}_{4},0.2\left)\right=2$ $\leftFIS\right({\mathcal{D}}_{5},0.2\left)\right=2$ $\leftFIS\right({\mathcal{D}}_{6},0.2\left)\right=5$  $\leftFIS\right(\mathcal{D},0.2\left)\right=0$  ${\mathit{C}}_{\mathbf{1}}=\{{\mathcal{D}}_{\mathbf{2}},{\mathcal{D}}_{\mathbf{1}}\},$ ${\mathit{C}}_{\mathbf{2}}=\{{\mathcal{D}}_{\mathbf{4}},{\mathcal{D}}_{\mathbf{3}}\},$ ${\mathit{C}}_{\mathbf{3}}=\{{\mathcal{D}}_{\mathbf{6}},{\mathcal{D}}_{\mathbf{5}}\}$  $\leftFIS\right({C}_{1},0.2\left)\right=3$ $\leftFIS\right({C}_{2},0.2\left)\right=1$ $\leftFIS\right({C}_{3},0.2\left)\right=2$ 
T10I4D100K [49] (unknown classes)  100,000  ${D}_{i}=\mathrm{10,000}$ rows, $i=1\dots 10$  $\leftFIS\right({\mathcal{D}}_{1},0.03\left)\right=58$ $\leftFIS\right({\mathcal{D}}_{2},0.03\left)\right=58$ $\leftFIS\right({\mathcal{D}}_{3},0.03\left)\right=62$ $\leftFIS\right({\mathcal{D}}_{4},0.03\left)\right=57$ $\leftFIS\right({\mathcal{D}}_{5},0.03\left)\right=62$ $\leftFIS\right({\mathcal{D}}_{6},0.03\left)\right=63$ $\leftFIS\right({\mathcal{D}}_{7},0.03\left)\right=63$ $\leftFIS\right({\mathcal{D}}_{8},0.03\left)\right=59$ $\leftFIS\right({\mathcal{D}}_{9},0.03\left)\right=61$ $\leftFIS\right({\mathcal{D}}_{10},0.03\left)\right=62$  $\leftFIS\right(\mathcal{D},0.03\left)\right=50$  Seven clusters found via the silhouette coefficient [43] ${\mathit{C}}_{\mathbf{1}}=\left\{{\mathcal{D}}_{\mathbf{1}}\right\},$ ${\mathit{C}}_{\mathbf{2}}=\left\{{\mathcal{D}}_{\mathbf{2}}\right\},$ ${\mathit{C}}_{\mathbf{3}}=\left\{{\mathcal{D}}_{\mathbf{3}}\right\},$ ${\mathit{C}}_{\mathbf{4}}=\{{\mathcal{D}}_{\mathbf{5}},{\mathcal{D}}_{\mathbf{4}}\},$ ${\mathit{C}}_{\mathbf{5}}=\left\{{\mathcal{D}}_{\mathbf{6}}\right\},$ ${\mathit{C}}_{\mathbf{6}}=\left\{{\mathcal{D}}_{\mathbf{7}}\right\},$ ${\mathit{C}}_{\mathbf{7}}=\{{\mathcal{D}}_{\mathbf{10}},{\mathcal{D}}_{\mathbf{9}},{\mathcal{D}}_{\mathbf{8}}\}$  $\leftFIS\right({C}_{1},0.03\left)\right=58$ $\leftFIS\right({C}_{2},0.03\left)\right=58$ $\leftFIS\right({C}_{3},0.03\left)\right=62$ $\leftFIS\right({C}_{4},0.03\left)\right=59$ $\leftFIS\right({C}_{5},0.03\left)\right=63$ $\leftFIS\right({C}_{6},0.03\left)\right=59$ $\leftFIS\right({C}_{7},0.03\left)\right=61$ 
Figure A1 [20] (unknown classes)  24  ${\mathcal{D}}_{1}=3$ ${\mathcal{D}}_{2}=3$ ${\mathcal{D}}_{3}=3$ ${\mathcal{D}}_{4}=4$ ${\mathcal{D}}_{5}=4$ ${\mathcal{D}}_{6}=3$ ${\mathcal{D}}_{7}=4$  $\leftFIS\right({\mathcal{D}}_{1},0.42\left)\right=3$ $\leftFIS\right({\mathcal{D}}_{2},0.42\left)\right=3$ $\leftFIS\right({\mathcal{D}}_{3},0.42\left)\right=5$ $\leftFIS\right({\mathcal{D}}_{4},0.42\left)\right=7$ $\leftFIS\right({\mathcal{D}}_{5},0.42\left)\right=7$ $\leftFIS\right({\mathcal{D}}_{6},0.42\left)\right=5$ $\leftFIS\right({\mathcal{D}}_{7},0.42\left)\right=3$  $\leftFIS\right(\mathcal{D},0.42\left)\right=0$  Three clusters found via the silhouette coefficient [43] ${\mathit{C}}_{\mathbf{1}}=\{{\mathcal{D}}_{\mathbf{3}},{\mathcal{D}}_{\mathbf{2}},{\mathcal{D}}_{\mathbf{1}}\}$ ${\mathit{C}}_{\mathbf{2}}=\left\{{\mathcal{D}}_{\mathbf{4}}\right\}$ ${\mathit{C}}_{\mathbf{3}}=\{{\mathcal{D}}_{\mathbf{7}},{\mathcal{D}}_{\mathbf{6}},{\mathcal{D}}_{\mathbf{5}}\}$  $\leftFIS\right({C}_{1},0.42\left)\right=3$ $\leftFIS\right({C}_{2},0.42\left)\right=7$ $\leftFIS\right({C}_{3},0.42\left)\right=3$ 
Dataset Name/Ref  Silhouette Coefficient  Clustering Result Using Proposed Objective  Clustering Result/ Optimal Value  Clustering Result/ Optimal Value  Clustering Result/ Optimal Value  

$\mathit{max}$ $\mathit{SC}\left(\mathcal{D}\right)$  Clusters  $\mathcal{L}(\mathbf{arg}\underset{\begin{array}{c}\mathit{\theta}\end{array}}{\mathbf{min}}\mathit{L}\left(\mathit{\theta}\right))$  Clusters  $\mathit{max}$ $\mathit{goodness}\left(\mathcal{D}\right)$  Clusters  $\mathit{min}$ ${\mathit{goodness}}^{2}\left(\mathcal{D}\right)$  Clusters  $\mathit{max}$ ${\mathit{goodness}}^{3}\left(\mathcal{D}\right)$  
Figure A1 $7\times 7$ [20]  0.46 at $\delta =0.444$  ${\mathit{C}}_{\mathbf{1}}=\{{\mathcal{D}}_{\mathbf{3}},{\mathcal{D}}_{\mathbf{2}},{\mathcal{D}}_{\mathbf{1}}\},$ ${\mathit{C}}_{\mathbf{2}}=\left\{{\mathcal{D}}_{\mathbf{4}}\right\},$ ${\mathit{C}}_{\mathbf{3}}=\{{\mathcal{D}}_{\mathbf{7}},{\mathcal{D}}_{\mathbf{6}},{\mathcal{D}}_{\mathbf{5}}\}$  2.004 at $\delta =0.444$  ${\mathit{C}}_{\mathbf{1}}=\{{\mathcal{D}}_{\mathbf{3}},{\mathcal{D}}_{\mathbf{2}},{\mathcal{D}}_{\mathbf{1}}\},$ ${\mathit{C}}_{\mathbf{2}}=\left\{{\mathcal{D}}_{\mathbf{4}}\right\},$ ${\mathit{C}}_{\mathbf{3}}=\{{\mathcal{D}}_{\mathbf{7}},{\mathcal{D}}_{\mathbf{6}},{\mathcal{D}}_{\mathbf{5}}\}$  15.407 at $\delta =0.444$  ${\mathit{C}}_{\mathbf{1}}=\{{\mathcal{D}}_{\mathbf{7}},\dots ,{\mathcal{D}}_{\mathbf{1}}\}$  0.259 at $\delta =0.065$  ${\mathit{C}}_{\mathbf{1}}=\{{\mathcal{D}}_{\mathbf{3}},{\mathcal{D}}_{\mathbf{2}},{\mathcal{D}}_{\mathbf{1}}\},$ ${\mathit{C}}_{\mathbf{2}}=\{{\mathcal{D}}_{\mathbf{7}},\dots ,{\mathcal{D}}_{\mathbf{4}}\}$  0.728 at $\delta =0.086$ 
Figure A2 $12\times 12$ $Zoo$ [48]  0.41 at $\delta =0.559$  ${\mathit{C}}_{\mathbf{1}}=\{{\mathcal{D}}_{\mathbf{2}},{\mathcal{D}}_{\mathbf{1}}\},$ ${\mathit{C}}_{\mathbf{2}}=\{{\mathcal{D}}_{\mathbf{4}},{\mathcal{D}}_{\mathbf{3}}\},$ ${\mathit{C}}_{\mathbf{3}}=\left\{{\mathcal{D}}_{\mathbf{5}}\right\},$ ${\mathit{C}}_{\mathbf{4}}=\{{\mathcal{D}}_{\mathbf{7}},{\mathcal{D}}_{\mathbf{6}}\},$ ${\mathit{C}}_{\mathbf{5}}=\{{\mathcal{D}}_{\mathbf{9}},{\mathcal{D}}_{\mathbf{8}}\},$ ${\mathit{C}}_{\mathbf{6}}=\{{\mathcal{D}}_{\mathbf{11}},{\mathcal{D}}_{\mathbf{10}}\},$ ${\mathit{C}}_{\mathbf{7}}=\left\{{\mathcal{D}}_{\mathbf{12}}\right\}$  7.71 at $\delta =0.559$  ${\mathit{C}}_{\mathbf{1}}=\{{\mathcal{D}}_{\mathbf{2}},{\mathcal{D}}_{\mathbf{1}}\},$ ${\mathit{C}}_{\mathbf{2}}=\{{\mathcal{D}}_{\mathbf{4}},{\mathcal{D}}_{\mathbf{3}}\},$ ${\mathit{C}}_{\mathbf{3}}=\left\{{\mathcal{D}}_{\mathbf{5}}\right\},$ ${\mathit{C}}_{\mathbf{4}}=\{{\mathcal{D}}_{\mathbf{7}},{\mathcal{D}}_{\mathbf{6}}\},$ ${\mathit{C}}_{\mathbf{5}}=\{{\mathcal{D}}_{\mathbf{9}},{\mathcal{D}}_{\mathbf{8}}\},$ ${\mathit{C}}_{\mathbf{6}}=\{{\mathcal{D}}_{\mathbf{11}},{\mathcal{D}}_{\mathbf{10}}\},$ ${\mathit{C}}_{\mathbf{7}}=\left\{{\mathcal{D}}_{\mathbf{12}}\right\}$  32.98 at $\delta =0.559$  ${\mathit{C}}_{\mathbf{1}}=\{{\mathcal{D}}_{\mathbf{12}},\dots ,{\mathcal{D}}_{\mathbf{1}}\}$  0.57 at $\delta =0.348$  ${\mathit{C}}_{\mathbf{1}}=\{{\mathcal{D}}_{\mathbf{12}},\dots ,{\mathcal{D}}_{\mathbf{1}}\}$  0.42 at $\delta =0.348$ 
Figure A3 $4\times 4$ $Mushroom$ [48]  0.08 at $\delta =0.41$  ${\mathit{C}}_{\mathbf{1}}=\left\{{\mathcal{D}}_{\mathbf{1}}\right\},$ ${\mathit{C}}_{\mathbf{2}}=\{{\mathcal{D}}_{\mathbf{4}},{\mathcal{D}}_{\mathbf{3}},{\mathcal{D}}_{\mathbf{2}}\}$  0.43 at $\delta =0.41$  ${\mathit{C}}_{\mathbf{1}}=\{{\mathcal{D}}_{\mathbf{4}},\dots ,{\mathcal{D}}_{\mathbf{1}}\}$  1.672 at $\delta =0.365$  ${\mathit{C}}_{\mathbf{1}}=\{{\mathcal{D}}_{\mathbf{4}},\dots ,{\mathcal{D}}_{\mathbf{1}}\}$  0.55 at $\delta =0.365$  ${\mathit{C}}_{\mathbf{1}}=\left\{{\mathcal{D}}_{\mathbf{1}}\right\},$ ${\mathit{C}}_{\mathbf{2}}=\{{\mathcal{D}}_{\mathbf{4}},{\mathcal{D}}_{\mathbf{3}},{\mathcal{D}}_{\mathbf{2}}\}$  0.68 at $\delta =0.41$ 
Figure A4 $6\times 6$ $Iris$ [48]  0.304 at $\delta =0.3$  ${\mathit{C}}_{\mathbf{1}}=\{{\mathcal{D}}_{\mathbf{2}},{\mathcal{D}}_{\mathbf{1}}\},$ ${\mathit{C}}_{\mathbf{2}}=\{{\mathcal{D}}_{\mathbf{4}},{\mathcal{D}}_{\mathbf{3}}\},$ ${\mathit{C}}_{\mathbf{3}}=\{{\mathcal{D}}_{\mathbf{6}},{\mathcal{D}}_{\mathbf{5}}\}$  1.10 at $\delta =0.3$  ${\mathit{C}}_{\mathbf{1}}=\{{\mathcal{D}}_{\mathbf{2}},{\mathcal{D}}_{\mathbf{1}}\},$ ${\mathit{C}}_{\mathbf{2}}=\{{\mathcal{D}}_{\mathbf{4}},{\mathcal{D}}_{\mathbf{3}}\},$ ${\mathit{C}}_{\mathbf{3}}=\{{\mathcal{D}}_{\mathbf{6}},{\mathcal{D}}_{\mathbf{5}}\}$  9.64 at $\delta =0.3$  ${\mathit{C}}_{\mathbf{1}}=\{{\mathcal{D}}_{\mathbf{2}},{\mathcal{D}}_{\mathbf{1}}\},$ ${\mathit{C}}_{\mathbf{2}}=\{{\mathcal{D}}_{\mathbf{4}},{\mathcal{D}}_{\mathbf{3}}\},$ ${\mathit{C}}_{\mathbf{3}}=\{{\mathcal{D}}_{\mathbf{6}},{\mathcal{D}}_{\mathbf{5}}\}$  0.55 at $\delta =0.3$  ${\mathit{C}}_{\mathbf{1}}=\{{\mathcal{D}}_{\mathbf{2}},{\mathcal{D}}_{\mathbf{1}}\},$ ${\mathit{C}}_{\mathbf{2}}=\{{\mathcal{D}}_{\mathbf{4}},{\mathcal{D}}_{\mathbf{3}}\},$ ${\mathit{C}}_{\mathbf{3}}=\{{\mathcal{D}}_{\mathbf{6}},{\mathcal{D}}_{\mathbf{5}}\}$  0.44 at $\delta =0.3$ 
Figure A5 $6\times 6$ $Zoo$ & $Mushroom$ [48]  0.63 at $\delta =0.384$  ${\mathit{C}}_{\mathbf{1}}=\{{\mathcal{D}}_{\mathbf{2}},{\mathcal{D}}_{\mathbf{1}}\},$ ${\mathit{C}}_{\mathbf{2}}=\{{\mathcal{D}}_{\mathbf{6}},\dots ,{\mathcal{D}}_{\mathbf{3}}\}$  0.5 at $\delta =0.384$  ${\mathit{C}}_{\mathbf{1}}=\{{\mathcal{D}}_{\mathbf{2}},{\mathcal{D}}_{\mathbf{1}}\},$ ${\mathit{C}}_{\mathbf{2}}=\{{\mathcal{D}}_{\mathbf{6}},\dots ,{\mathcal{D}}_{\mathbf{3}}\}$  9.96 at $\delta =0.384$  ${\mathit{C}}_{\mathbf{1}}=\{{\mathcal{D}}_{\mathbf{2}},{\mathcal{D}}_{\mathbf{1}}\},$ ${\mathit{C}}_{\mathbf{2}}=\{{\mathcal{D}}_{\mathbf{6}},\dots ,{\mathcal{D}}_{\mathbf{3}}\}$  0.40 at $\delta =0.384$  ${\mathit{C}}_{\mathbf{1}}=\{{\mathcal{D}}_{\mathbf{2}},{\mathcal{D}}_{\mathbf{1}}\},$ ${\mathit{C}}_{\mathbf{2}}=\{{\mathcal{D}}_{\mathbf{6}},\dots ,{\mathcal{D}}_{\mathbf{3}}\}$  0.85 at $\delta =0.384$ 
Figure A6 $4\times 4$ [39]  0.34 at $\delta =0.429$  ${\mathit{C}}_{\mathbf{1}}=\{{\mathcal{D}}_{\mathbf{3}},{\mathcal{D}}_{\mathbf{2}},{\mathcal{D}}_{\mathbf{1}}\},$ ${\mathit{C}}_{\mathbf{2}}=\left\{{\mathcal{D}}_{\mathbf{4}}\right\}$  1.12 at $\delta =0.429$  ${\mathit{C}}_{\mathbf{1}}=\{{\mathcal{D}}_{\mathbf{4}},\dots ,{\mathcal{D}}_{\mathbf{1}}\}$  2.708 at $\delta =0.25$  ${\mathit{C}}_{\mathbf{1}}=\{{\mathcal{D}}_{\mathbf{4}},\dots ,{\mathcal{D}}_{\mathbf{1}}\}$  0.38 at $\delta =0.25$  ${\mathit{C}}_{\mathbf{1}}=\{{\mathcal{D}}_{\mathbf{3}},{\mathcal{D}}_{\mathbf{2}},{\mathcal{D}}_{\mathbf{1}}\},$ ${\mathit{C}}_{\mathbf{2}}=\left\{{\mathcal{D}}_{\mathbf{4}}\right\}$  0.81 at $\delta =0.429$ 
Figure A7 $10\times 10$ T10I4D100K [49]  0.115 at $\delta =0.846$  ${\mathit{C}}_{\mathbf{1}}=\left\{{\mathcal{D}}_{\mathbf{1}}\right\},$ ${\mathit{C}}_{\mathbf{2}}=\left\{{\mathcal{D}}_{\mathbf{2}}\right\},$ ${\mathit{C}}_{\mathbf{3}}=\left\{{\mathcal{D}}_{\mathbf{3}}\right\},$ ${\mathit{C}}_{\mathbf{4}}=\{{\mathcal{D}}_{\mathbf{5}},{\mathcal{D}}_{\mathbf{4}}\},$ ${\mathit{C}}_{\mathbf{5}}=\left\{{\mathcal{D}}_{\mathbf{6}}\right\},$ ${\mathit{C}}_{\mathbf{6}}=\left\{{\mathcal{D}}_{\mathbf{7}}\right\},$ ${\mathit{C}}_{\mathbf{7}}=\{{\mathcal{D}}_{\mathbf{10}},{\mathcal{D}}_{\mathbf{9}},{\mathcal{D}}_{\mathbf{8}}\}$  0.71 at $\delta =0.846$  ${\mathit{C}}_{\mathbf{1}}=\{{\mathcal{D}}_{\mathbf{10}},\dots ,{\mathcal{D}}_{\mathbf{1}}\}$  35.275 at $\delta =0.737$  ${\mathit{C}}_{\mathbf{1}}=\{{\mathcal{D}}_{\mathbf{10}},\dots ,{\mathcal{D}}_{\mathbf{1}}\}$  0.193 at $\delta =0.737$  ${\mathit{C}}_{\mathbf{1}}=\{{\mathcal{D}}_{\mathbf{10}},\dots ,{\mathcal{D}}_{\mathbf{1}}\}$  0.806 at $\delta =0.737$ 
Dataset $\mathcal{D}$  Silhouette Coefficient $\underset{\begin{array}{c}\mathit{\theta}\end{array}}{\mathbf{max}}\mathit{S}\mathit{C}\left(\mathcal{D}\right)$  Proposed Loss Function $\mathcal{L}(\mathbf{arg}\underset{\begin{array}{c}\mathit{\theta}\end{array}}{\mathbf{min}}\mathcal{L}\left(\mathit{\theta}\right))$  Goodness Measure $\underset{\mathit{\theta}}{\mathbf{max}}\mathit{goodness}\left(\mathcal{D}\right)$  Goodness Measure $\underset{\mathit{\theta}}{\mathbf{min}}{\mathit{goodness}}^{2}\left(\mathcal{D}\right)$  Goodness Measure $\underset{\mathit{\theta}}{\mathbf{max}}{\mathit{goodness}}^{3}\left(\mathcal{D}\right)$  

${\mathit{\delta}}_{\mathit{opt}}$  $\frac{\left\right\{{\mathit{\delta}}_{1},\dots ,{\mathit{\delta}}_{\mathit{stop}}\left\}\right}{\left\right\{{\mathit{\delta}}_{1},\dots ,{\mathit{\delta}}_{\mathit{m}}\left\}\right}$  ${\mathit{\delta}}_{\mathit{opt}}$  $\frac{\left\right\{{\mathit{\delta}}_{1},\dots ,{\mathit{\delta}}_{\mathit{stop}}\left\}\right}{\left\right\{{\mathit{\delta}}_{1},\dots ,{\mathit{\delta}}_{\mathit{m}}\left\}\right}$  ${\mathit{\delta}}_{\mathit{opt}}$  $\frac{\left\right\{{\mathit{\delta}}_{1},\dots ,{\mathit{\delta}}_{\mathit{stop}}\left\}\right}{\left\right\{{\mathit{\delta}}_{1},\dots ,{\mathit{\delta}}_{\mathit{m}}\left\}\right}$  ${\mathit{\delta}}_{\mathit{opt}}$  $\frac{\left\right\{{\mathit{\delta}}_{1},\dots ,{\mathit{\delta}}_{\mathit{stop}}\left\}\right}{\left\right\{{\mathit{\delta}}_{1},\dots ,{\mathit{\delta}}_{\mathit{m}}\left\}\right}$  ${\mathit{\delta}}_{\mathit{opt}}$  $\frac{\left\right\{{\mathit{\delta}}_{1},\dots ,{\mathit{\delta}}_{\mathit{stop}}\left\}\right}{\left\right\{{\mathit{\delta}}_{1},\dots ,{\mathit{\delta}}_{\mathit{m}}\left\}\right}$  
Figure A1 $7\times 7$ [20]  0.444  $\frac{10}{10}$  0.444  $\frac{5}{10}$  0.444  $\frac{10}{10}$  0.065  $\frac{10}{10}$  0.086  $\frac{10}{10}$ 
Figure A2 $12\times 12$ $Zoo$ [48]  0.559  $\frac{48}{48}$  0.559  $\frac{5}{48}$  0.559  $\frac{48}{48}$  0.348  $\frac{48}{48}$  0.348  $\frac{48}{48}$ 
Figure A3 $4\times 4$ $Mushroom$ [48]  0.41  $\frac{4}{4}$  0.41  $\frac{2}{4}$  0.365  $\frac{4}{4}$  0.365  $\frac{4}{4}$  0.41  $\frac{4}{4}$ 
Figure A4 $6\times 6$ $Iris$ [48]  0.3  $\frac{6}{6}$  0.3  $\frac{3}{6}$  0.3  $\frac{6}{6}$  0.3  $\frac{6}{6}$  0.3  $\frac{6}{6}$ 
Figure A5 $6\times 6$ $Zoo$ & $Mushroom$ [48]  0.384  $\frac{8}{8}$  0.384  $\frac{7}{8}$  0.384  $\frac{8}{8}$  0.384  $\frac{8}{8}$  0.384  $\frac{8}{8}$ 
Figure A6 $4\times 4$ [39]  0.429  $\frac{6}{6}$  0.429  $\frac{3}{6}$  0.25  $\frac{6}{6}$  0.25  $\frac{6}{6}$  0.429  $\frac{6}{6}$ 
Figure A7 $10\times 10$ T10I4D100K [49]  0.846  $\frac{31}{31}$  0.846  $\frac{4}{31}$  0.737  $\frac{31}{31}$  0.737  $\frac{31}{31}$  0.737  $\frac{31}{31}$ 
Experiment (Figure)  Proposed Algo  BestDatabaseClustering [22]  GDMDBClustering [25]  

Average Running Time  Average Clustering Error  Average Running Time  Average Clustering Error  Average Running Time  Average Clustering Error  
Figure A8 $\eta =0.001$  6.367  0.285  47.208  0.936  14.825  0.285 
Figure A9 $\eta =0.002$  6.367  0.285  47.208  0.936  7.305  0.290 
Figure A10 $\eta =0.0005$  6.367  0.285  47.208  0.936  28.479  0.285 
Algorithm  Measurements  

Running Time  Clustering Error  
Average  SD  Var  Stat  pValue  Average  SD  Var  Stat  pValue  
Proposed Algo  6.367  3.018  9.107  0.285  0.080  0.006  
BestDatabaseClustering [22]  47.208  27.537  758.313  135.707  3.40$\mathrm{e}$–30  0.936  0.066  0.004  150  $2.67\mathrm{e}$–33 
GDMDBClustering [25] ($\eta =0.001$)  14.825  1.743  3.037  0.285  0.080  0.006 
Algorithm  Measurements  

Running Time  Clustering Error  
Average  SD  Var  Stat  pValue  Average  SD  Var  Stat  pValue  
Proposed Algo  6.367  3.018  9.107  0.285  0.080  0.006  
BestDatabaseClustering [22]  47.208  27.537  758.313  121.62  3.88$\mathrm{e}$–27  0.936  0.066  0.004  131  $3.56\mathrm{e}$–29 
GDMDBClustering [25] ($\eta =0.002$)  7.305  1.766  3.118  0.290  0.086  0.007 
Algorithm  Measurements  

Running Time  Clustering Error  
Average  SD  Var  Stat  pValue  Average  SD  Var  Stat  pValue  
Proposed Algo  6.367  3.018  9.107  0.285  0.080  0.006  
BestDatabaseClustering [22]  47.208  27.537  758.313  118.90  $1.51\mathrm{e}$–26  0.936  0.066  0.004  150  2.67$\mathrm{e}$–33 
GDMDBClustering [25] ($\eta =0.0005$)  28.479  4.655  21.669  0.285  0.080  0.006 
Dataset  FMeasure [60,61]  Precision [60,61]  Recall [60,61]  

Proposed Algo  Algo [22]  Algo [23]  Algo [21]  Proposed Algo  Algo [22]  Algo [23]  Algo [21]  Proposed Algo  Algo [22]  Algo [23]  Algo [21]  
Figure A1 7 × 7 [20]  1  1  0.44  0.8  1  1  0.28  0.66  1  1  1  1 
Figure A2 12 × 12 $Zoo$ [48]  1  1  0.14  0.14  1  1  0.075  0.075  1  1  1  1 
Figure A3 4 × 4 $Mushroom$ [48]  1  0.66  0.66  1  1  0.5  0.5  1  1  1  1  1 
Figure A4 6 × 6 $Iris$ [48]  1  1  1  1  1  1  1  1  1  1  1  1 
Figure A5 6 × 6 $Zoo$ & $Mushroom$ [48]  1  1  1  1  1  1  1  1  1  1  1  1 
Figure A6 4 × 4 [39]  1  0.66  0.66  1  1  0.5  0.5  1  1  1  1  1 
Figure A7 10 × 10 T10I4D100K [49]  1  0.16  0.16  0.16  1  0.088  0.088  0.088  1  1  1  1 
Clustering  Predicted Clusters  

Pairs in $\mathcal{P}$  Pairs Not in $\mathcal{P}$  
Actual clusters  
Pairs in$\mathcal{Q}$  $a:=Pair{s}_{\mathcal{Q}}\cap Pair{s}_{\mathcal{P}}$ (True Positive)  $b:=Pair{s}_{\mathcal{Q}}\backslash Pair{s}_{\mathcal{P}}$ (False Negative) 
Pairs not in$\mathcal{Q}$  $c:=Pair{s}_{\mathcal{P}}\backslash Pair{s}_{\mathcal{Q}}$ (False Positive)  $d:=$ Pairs in none (True Negative) 
Precision [60,61]  Recall [60,61]  FMeasure [60,61]  Rand [62]  Jaccard [63] 

$\frac{\left(a\right)}{\left(a\right)+\left(c\right)}$  $\frac{\left(a\right)}{\left(a\right)+\left(b\right)}$  $\frac{2\left(a\right)}{2\left(a\right)+\left(b\right)+\left(c\right)}$  $\frac{\left(a\right)+\left(d\right)}{\left(a\right)+\left(b\right)+\left(c\right)+\left(d\right)}$  $\frac{\left(a\right)}{\left(a\right)+\left(b\right)+\left(c\right)}$ 
Dataset  Rand [62]  Jaccard [63]  

Proposed Algo  Algo [22]  Algo [23]  Algo [21]  Proposed Algo  Algo [22]  Algo [23]  Algo [21]  
Figure A1 $7\times 7$ [20]  1  1  0.28  0.85  1  1  0.28  0.66 
Figure A2 $12\times 12$ $Zoo$ [48]  1  1  0.075  0.075  1  1  0.075  0.075 
Figure A3 $4\times 4$ $Mushroom$ [48]  1  0.5  0.5  1  1  0.5  0.5  1 
Figure A4 $6\times 6$ $Iris$ [48]  1  1  1  1  1  1  1  1 
Figure A5 $6\times 6$ $Zoo$ & $Mushroom$ [48]  1  1  1  1  1  1  1  1 
Figure A6 $4\times 4$ [39]  1  0.5  0.5  1  1  0.5  0.5  1 
Figure A7 $10\times 10$ T10I4D100K [49]  1  0.088  0.088  0.088  1  0.088  0.088  0.088 
References
 Han, J.; Pei, J.; Yin, Y.; Mao, R. Mining frequent patterns without candidate generation: A frequentpattern tree approach. Data Min. Knowl. Discov. 2004, 8, 53–87. [Google Scholar] [CrossRef]
 Ng, A.Y.; Jordan, M.I.; Weiss, Y. On spectral clustering: Analysis and an algorithm. Adv. Neural Inf. Process. Syst. 2001, 849–856. [Google Scholar] [CrossRef]
 Johnson, S.C. Hierarchical clustering schemes. Psychometrika 1967, 32, 241–254. [Google Scholar] [CrossRef] [PubMed]
 MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Oakland, CA, USA, 27 December 1965–7 January 1966; Volume 1, pp. 281–297. [Google Scholar]
 Zhang, Y.J.; Liu, Z.Q. Selfsplitting competitive learning: A new online clustering paradigm. IEEE Trans. Neural Netw. 2002, 13, 369–380. [Google Scholar] [CrossRef]
 Yair, E.; Zeger, K.; Gersho, A. Competitive learning and soft competition for vector quantizer design. IEEE Trans. Signal Process. 1992, 40, 294–309. [Google Scholar] [CrossRef]
 Hofmann, T.; Buhmann, J.M. Competitive learning algorithms for robust vector quantization. IEEE Trans. Signal Process. 1998, 46, 1665–1675. [Google Scholar] [CrossRef] [Green Version]
 Kohonen, T. SelfOrganizing Maps; Springer Science & Business Media: Berlin/Heidelberg, Germany; New York, NY, USA, 2012; Volume 30. [Google Scholar]
 Pal, N.R.; Bezdek, J.C.; Tsao, E.K. Generalized clustering networks and Kohonen’s selforganizing scheme. IEEE Trans. Neural Netw. 1993, 4, 549–557. [Google Scholar] [CrossRef]
 Mao, J.; Jain, A.K. A selforganizing network for hyperellipsoidal clustering (HEC). Trans. Neural Netw. 1996, 7, 16–29. [Google Scholar]
 Anderberg, M.R. Cluster Analysis for Applications: Probability and Mathematical Statistics: A Series of Monographs and Textbooks; Academic Press: Cambridge, MA, USA, 2014; Volume 19. [Google Scholar]
 Aggarwal, C.C.; Reddy, C.K. Data clustering. Algorithms and Application; CRC Press: Boca Raton, FL, USA, 2014. [Google Scholar]
 Wang, C.D.; Lai, J.H.; Philip, S.Y. NEIWalk: Community discovery in dynamic contentbased networks. IEEE Trans. Knowl. Data Eng. 2013, 26, 1734–1748. [Google Scholar] [CrossRef]
 Wang, Z.; Zhang, D.; Zhou, X.; Yang, D.; Yu, Z.; Yu, Z. Discovering and profiling overlapping communities in locationbased social networks. IEEE Trans. Syst. Man Cybern. Syst. 2013, 44, 499–509. [Google Scholar] [CrossRef] [Green Version]
 Huang, D.; Lai, J.H.; Wang, C.D.; Yuen, P.C. Ensembling oversegmentations: From weak evidence to strong segmentation. Neurocomputing 2016, 207, 416–427. [Google Scholar] [CrossRef]
 Shi, J.; Malik, J. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 888–905. [Google Scholar]
 Zhao, Q.; Wang, C.; Wang, P.; Zhou, M.; Jiang, C. A novel method on information recommendation via hybrid similarity. IEEE Trans. Syst. Man Cybern. Syst. 2016, 48, 448–459. [Google Scholar] [CrossRef]
 Symeonidis, P. ClustHOSVD: Item recommendation by combining semantically enhanced tag clustering with tensor HOSVD. IEEE Trans. Syst. Man Cybern. Syst. 2015, 46, 1240–1251. [Google Scholar] [CrossRef]
 Rafailidis, D.; Daras, P. The TFC model: Tensor factorization and tag clustering for item recommendation in social tagging systems. IEEE Trans. Syst. Man Cybern. Syst. 2012, 43, 673–688. [Google Scholar] [CrossRef]
 Adhikari, A.; Adhikari, J. Clustering Multiple Databases Induced by Local Patterns. In Advances in Knowledge Discovery in Batabases; Springer: Cham, Switzerland, 2015; pp. 305–332. [Google Scholar]
 Liu, Y.; Yuan, D.; Cuan, Y. Completely Clustering for Multidatabases Mining. J. Comput. Inf. Syst. 2013, 9, 6595–6602. [Google Scholar]
 Miloudi, S.; Hebri, S.A.R.; Khiat, S. Contribution to Improve Database Classification Algorithms for MultiDatabase Mining. J. Inf. Proces. Syst. 2018, 14, 709–726. [Google Scholar]
 Tang, H.; Mei, Z. A Simple Methodology for Database Clustering. In Proceedings of the 5th International Conference on Computer Engineering and Networks, SISSA Medialab, Shanghai, China, 12–13 September 2015; Volume 259, p. 19. [Google Scholar]
 Wang, R.; Ji, W.; Liu, M.; Wang, X.; Weng, J.; Deng, S.; Gao, S.; Yuan, C.A. Review on mining data from multiple data sources. Pattern Recognit. Lett. 2018, 109, 120–128. [Google Scholar] [CrossRef]
 Miloudi, S.; Wang, Y.; Ding, W. A GradientBased Clustering for MultiDatabase Mining. IEEE Access 2021, 9, 11144–11172. [Google Scholar] [CrossRef]
 Miloudi, S.; Wang, Y.; Ding, W. An Optimized Graphbased Clustering for Multidatabase Mining. In Proceedings of the 2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI), Baltimore, MD, USA, 9–11 November 2020; pp. 807–812. [Google Scholar] [CrossRef]
 Zhang, S.; Zaki, M.J. Mining Multiple Data Sources: Local Pattern Analysis. Data Min. Knowl. Discov. 2006, 12, 121–125. [Google Scholar] [CrossRef] [Green Version]
 Adhikari, A.; Rao, P.R. Synthesizing heavy association rules from different real data sources. Pattern Recognit. Lett. 2008, 29, 59–71. [Google Scholar] [CrossRef]
 Adhikari, A.; Adhikari, J. Advances in Knowledge Discovery in Databases; Springer: Cham, Switzerland, 2015. [Google Scholar]
 Adhikari, A.; Jain, L.C.; Prasad, B. A StateoftheArt Review of Knowledge Discovery in Multiple Databases. J. Intell. Syst. 2017, 26, 23–34. [Google Scholar] [CrossRef] [Green Version]
 Zhang, S.; Zhang, C.; Wu, X. Identifying Exceptional Patterns. Knowl. Discov. Multiple Datab. 2004, 185–195. [Google Scholar]
 Zhang, S.; Zhang, C.; Wu, X. Identifying Highvote Patterns. Knowl. Discov. Multiple Datab. 2004, 157–183. [Google Scholar]
 Ramkumar, T.; Srinivasan, R. Modified algorithms for synthesizing highfrequency rules from different data sources. Knowl. Inf. Syst. 2008, 17, 313–334. [Google Scholar] [CrossRef]
 Djenouri, Y.; Lin, J.C.W.; Nørvåg, K.; Ramampiaro, H. Highly efficient pattern mining based on transaction decomposition. In Proceedings of the 2019 IEEE 35th International Conference on Data Engineering (ICDE), Macao, China, 8–11 April 2019; pp. 1646–1649. [Google Scholar]
 Savasere, A.; Omiecinski, E.R.; Navathe, S.B. An Efficient Algorithm for Mining Association Rules in Large Databases; Technical Report GITCC9504; Georgia Institute of Technology: Zurich, Switzerland, 1995. [Google Scholar]
 Zhang, S.; Wu, X. Large scale data mining based on data partitioning. Appl. Artif. Intel. 2001, 15, 129–139. [Google Scholar] [CrossRef]
 Zhang, C.; Liu, M.; Nie, W.; Zhang, S. Identifying Global Exceptional Patterns in Multidatabase Mining. IEEE Intell. Inform. Bull. 2004, 3, 19–24. [Google Scholar]
 Zhang, S.; Zhang, C.; Yu, J.X. An efficient strategy for mining exceptions in multidatabases. Inf. Sci. 2004, 165, 1–20. [Google Scholar] [CrossRef]
 Wu, X.; Zhang, C.; Zhang, S. Database classification for multidatabase mining. Inf. Syst. 2005, 30, 71–88. [Google Scholar] [CrossRef]
 Li, H.; Hu, X.; Zhang, Y. An Improved Database Classification Algorithm for Multidatabase Mining. In Frontiers in Algorithmics; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2009; pp. 346–357. [Google Scholar]
 Na, S.; Xumin, L.; Yong, G. Research on kmeans clustering algorithm: An improved kmeans clustering algorithm. In Proceedings of the 2010 Third International Symposium on Intelligent Information Technology and Security Informatics, Jian, China, 2–4 April 2010; pp. 63–67. [Google Scholar]
 Selim, S.Z.; Ismail, M.A. Kmeanstype algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Trans. Pattern Anal. Mach. Intel. 1984, 81–87. [Google Scholar] [CrossRef] [PubMed]
 Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef] [Green Version]
 Kaufman, L.; Rousseeuw, P.J. Finding Groups in Data: An Introduction to Cluster Analysis; John Wiley & Sons: Hoboken, NJ, USA, 2009; Volume 344. [Google Scholar]
 De Luca, A.; Termini, S. A Definition of a Nonprobabilistic Entropy in the Setting of Fuzzy Sets Theory. In Readings in Fuzzy Sets for Intelligent Systems; Dubois, D., Prade, H., Yager, R.R., Eds.; Morgan Kaufmann: San Francisco, CA, USA, 1993; pp. 197–202. [Google Scholar] [CrossRef]
 Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
 Cormen, T.H.; Leiserson, C.E.; Rivest, R.L.; Stein, C. Data structures for disjoint sets. In Introduction to Algorithms; MIT Press: Cambridge, MA, USA, 2009; pp. 498–524. [Google Scholar]
 Center for Machine Learning and Intelligent Systems. UCI Machine Learning Repository. Available online: https://archive.ics.uci.edu/ml/datasets/ (accessed on 10 October 2020).
 IBM Almaden Quest Research Group. Frequent Itemset Mining Dataset Repository. Available online: http://fimi.ua.ac.be/data/. (accessed on 10 October 2020).
 Thirion, G.; Varoquaux, A.; Gramfort, V.; Michel, O.; Grisel, G.; Louppe, J. Nothman. Scikitlearn: Sklearn.datasets.makeblobs. Available online: https://scikitlearn.org/stable/modules/generated/sklearn.datasets.make_blobs.html (accessed on 10 October 2020).
 Gramfort, A.; Blondel, M.; Grisel, O.; Mueller, A.; Martin, E.; Patrini, G.; Chang, E. ScikitLearn: Sklearn.preprocessing.MinMaxScaler. Available online: https://scikitlearn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html (accessed on 10 October 2020).
 Friedman, M. A comparison of alternative tests of significance for the problem of m rankings. Ann. Math. Stat. 1940, 11, 86–92. [Google Scholar] [CrossRef]
 Meilǎ, M. Comparing clusterings: An axiomatic view. In Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany, 7–11 August 2005; pp. 577–584. [Google Scholar]
 Vinh, N.X.; Epps, J.; Bailey, J. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 2010, 11, 2837–2854. [Google Scholar]
 Günnemann, S.; Färber, I.; Müller, E.; Assent, I.; Seidl, T. External evaluation measures for subspace clustering. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, Scotland, UK, 24–28 October 2011; pp. 1363–1372. [Google Scholar]
 Banerjee, A.; Krumpelman, C.; Ghosh, J.; Basu, S.; Mooney, R.J. Modelbased overlapping clustering. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, Chicago, IL, USA, 21–24 August 2005; pp. 532–537. [Google Scholar]
 Pfitzner, D.; Leibbrandt, R.; Powers, D. Characterization and evaluation of similarity measures for pairs of clusterings. Knowl. Inf. Syst. 2009, 19, 361–394. [Google Scholar] [CrossRef]
 Achtert, E.; Goldhofer, S.; Kriegel, H.P.; Schubert, E.; Zimek, A. Evaluation of clusterings–metrics and visual support. In Proceedings of the 2012 IEEE 28th International Conference on Data Engineering, Arlington, VA, USA, 1–5 April 2012; pp. 1285–1288. [Google Scholar]
 Shafiei, M.; Milios, E. Modelbased overlapping coclustering. In Proceedings of the SIAM Conference on Data Mining, Bethesda, MD, USA, 20–22 April 2006. [Google Scholar]
 Chinchor, N. MUC4 evaluation metrics. In Proceedings of the of the Fourth Message Understanding Conference, McLean, VA, USA, 16–18 June 1992. [Google Scholar]
 Mei, Q.; Radev, D. Information retrieval. In The Oxford Handbook of Computational Linguistics, 2nd ed.; Oxford University Press: New York, NY, USA, 1979. [Google Scholar]
 Rand, W.M. Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 1971, 66, 846–850. [Google Scholar] [CrossRef]
 Jaccard, P. Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions voisines. Bull. Soc. Vaudoise Sci. Nat. 1901, 37, 241–272. [Google Scholar]
Transactional Database $\left({\mathcal{D}}_{\mathit{p}}\right)$  Transactions/Rows 

${\mathcal{D}}_{1}$  $(A,C),(A,B,C),(B,C),(A,B,C,D)$ 
${\mathcal{D}}_{2}$  $(A,B,C),(B,C),(A,B),(A,C),(A,B,D)$ 
${\mathcal{D}}_{3}$  $(B,C),(A,D),(B,C,D),(A,B,C)$ 
${\mathcal{D}}_{4}$  $(E,F,H),(F,H),(F,G,H,I,J)$ 
${\mathcal{D}}_{5}$  $(E,J),(F,H,J),(E,F,H,J),(F,H)$ 
${\mathcal{D}}_{6}$  $(E,I),(E,F,H),(F,H,I,J),(E,H,J)$ 
Transactional Database $\left({\mathcal{D}}_{\mathit{p}}\right)$  Frequent Itemsets $\mathit{FIS}({\mathcal{D}}_{\mathit{p}},\mathit{\alpha})$ 

${\mathcal{D}}_{1}$  $\langle AC,0.75\rangle ,\langle AB,0.5\rangle ,\langle ABC,0.5\rangle ,\langle BC,0.75\rangle ,\langle C,1.0\rangle ,\langle B,0.75\rangle ,\langle A,0.75\rangle $ 
${\mathcal{D}}_{2}$  $\langle AB,0.6\rangle ,\langle C,0.6\rangle ,\langle B,0.8\rangle ,\langle A,0.8\rangle $ 
${\mathcal{D}}_{3}$  $\langle BC,0.75\rangle ,\langle D,0.5\rangle ,\langle C,0.75\rangle ,\langle B,0.75\rangle ,\langle A,0.5\rangle $ 
${\mathcal{D}}_{4}$  $\langle H,1.0\rangle ,\langle F,1.0\rangle ,\langle FH,1.0\rangle $ 
${\mathcal{D}}_{5}$  $\langle E,0.5\rangle ,\langle EJ,0.5\rangle ,\langle J,0.75\rangle ,\langle HJ,0.5\rangle ,\langle FHJ,0.5\rangle ,\langle FJ,0.5\rangle ,\langle H,0.75\rangle ,\langle FH,0.75\rangle ,\langle F,0.75\rangle $ 
${\mathcal{D}}_{6}$  $\langle I,0.5\rangle ,\langle J,0.5\rangle ,\langle HJ,0.5\rangle ,\langle F,0.5\rangle ,\langle FH,0.5\rangle ,\langle E,0.75\rangle ,\langle EH,0.5\rangle ,\langle H,0.75\rangle $ 
Clustering Quality [Reference]  Function (Equation)  Optimal Value 

[20]  $\mathit{goodness}\left(\mathcal{D}\right)=\mathit{B}\left(\mathcal{D}\right)+\mathit{W}\left(\mathcal{D}\right)\mathit{f}\left(\mathcal{D}\right)$ $\left(\right)$ 
$$\begin{array}{c}\hfill max\phantom{\rule{4pt}{0ex}}goodness\left(\mathcal{D}\right)\end{array}$$

[23]  ${\mathit{goodness}}^{\mathbf{2}}\left(\mathcal{D}\right)=\frac{\mathit{sum}\mathit{dist}\left(\mathcal{D}\right)}{({\mathit{n}}^{\mathbf{2}}\mathit{n})/\mathbf{2}}+\frac{\mathit{coupling}\left(\mathcal{D}\right)}{({\mathit{n}}^{\mathbf{2}}\mathit{n})/\mathbf{2}}+\frac{\mathit{f}\left(\mathcal{D}\right)\mathbf{1}}{\mathit{n}\mathbf{1}}$ $\left(\right)$ 
$$\begin{array}{c}\hfill min\phantom{\rule{4pt}{0ex}}goodnes{s}^{2}\left(\mathcal{D}\right)\end{array}$$

[21]  ${\mathit{goodness}}^{\mathbf{3}}\left(\mathcal{D}\right)=\frac{\mathit{intra}\mathit{sim}\left(\mathcal{D}\right)+\mathit{inter}\mathit{dist}\left(\mathcal{D}\right)}{\mathit{f}\left(\mathcal{D}\right)}$ $\left(\right)$ 
$$\begin{array}{c}\hfill max\phantom{\rule{4pt}{0ex}}goodnes{s}^{3}\left(\mathcal{D}\right)\end{array}$$

[43,44]  $\mathit{SC}\left(\mathcal{D}\right)=\frac{\mathbf{1}}{\mathit{n}}{\sum}_{\mathit{p}=\mathbf{0}}^{\mathit{n}\mathbf{1}}\mathit{s}\left({\mathcal{D}}_{\mathit{p}}\right)$ $\left(\right)$ 
$$\begin{array}{c}\hfill max\phantom{\rule{4pt}{0ex}}SC\left(\mathcal{D}\right)\end{array}$$

Output  Clustering 1 under simi [20]  Clustering 2 under sim (3) 

clusters  $\left\{{\mathcal{D}}_{1}\right\},\{{\mathcal{D}}_{2},{\mathcal{D}}_{3}\}$  $\{{\mathcal{D}}_{1},{\mathcal{D}}_{2}\},\left\{{D}_{3}\right\}$ 
Similarity intracluster  0.6  0.75 
Distance intercluster  1.6  1.75 
Measure goodness [20]  0.2  0.5 
Synthesized Itemsets ${\mathit{I}}_{\mathit{k}}$  $\mathit{supp}({\mathit{I}}_{\mathit{k}},{\mathit{C}}_{2,3})$ under simi [20]  $\mathit{supp}({\mathit{I}}_{\mathit{k}},{\mathit{C}}_{1,2})$ under sim (3) 

A  $0.12<{\alpha}_{2,3}=0.19$  $0.2>{\alpha}_{1,2}=0.17$ 
B  $0.12<{\alpha}_{2,3}=0.19$  $0.2>{\alpha}_{1,2}=0.17$ 
C  $0.12<{\alpha}_{2,3}=0.19$  $0.2>{\alpha}_{1,2}=0.17$ 
E  $0.9>{\alpha}_{2,3}=0.19$  $0.54>{\alpha}_{1,2}=0.17$ 
Similarity Matrix  Fuzziness Index (9)  ${\mathit{\theta}}^{\mathit{T}},\mathit{epochs},\mathit{\eta}$  max $\mathit{goodness}\left(\mathcal{D}\right)$ [20]  ${\mathit{\delta}}_{\mathit{opt}}$  $\mathit{SC}\left(\mathcal{D}\right)$ [43,44] at ${\mathit{\delta}}_{\mathit{opt}}$  Optimal Clustering at ${\mathit{\delta}}_{\mathit{opt}}$ 

Figure 5  0.97  ${\theta}^{T}=[1,1,\dots ,1,1]$ (Without fuzziness reduction)  4.19  0.46  −1  $\{{\mathcal{D}}_{1},{\mathcal{D}}_{2},{\mathcal{D}}_{3},{\mathcal{D}}_{4},{\mathcal{D}}_{5}\}$ 
Figure 5  0.95  ${\theta}^{T}=[1,1,\dots ,1,1]$ (Without fuzziness reduction)  1.29  0.313  −1  $\{{\mathcal{D}}_{1},{\mathcal{D}}_{2},{\mathcal{D}}_{3},{\mathcal{D}}_{4}\}$ 
Figure 6  0.74  ${\theta}^{T}$ = [1.30,0.52,0.71,0.71,0.52, 0.71,0.71,0.52,0.52,1.44], $epochs=300$, $\eta =0.1$  4.54  0.95  0.73  $\{{\mathcal{D}}_{1},{\mathcal{D}}_{2}\},\left\{{\mathcal{D}}_{3}\right\},\{{\mathcal{D}}_{4},{\mathcal{D}}_{5}\}$ 
Figure 6  0.81  ${\theta}^{T}=[0.63,0.638,0.591,0.712,0.712,0.77]$, $epochs=100$, $\eta =0.1$  1.27  0.292  0.08  $\{{\mathcal{D}}_{4},{\mathcal{D}}_{3},{\mathcal{D}}_{2}\},\left\{{\mathcal{D}}_{1}\right\}$ 
Number of Random Blobs $\mathbf{\left(}\mathit{n}\mathbf{\right)}$  Number of Centers $\mathbf{\lfloor}\frac{\mathit{n}}{\mathbf{2}}\mathbf{\rfloor}$  Number of Attributes $\mathbf{\left(}\mathit{m}\mathbf{\right)}$ 

30  15  random.randint(2, 10) 
⋮  ⋮  ⋮ 
60  30  random.randint(2, 10) 
⋮  ⋮  ⋮ 
120  60  random.randint(2, 10) 
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. 
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Miloudi, S.; Wang, Y.; Ding, W. An Improved SimilarityBased Clustering Algorithm for MultiDatabase Mining. Entropy 2021, 23, 553. https://doi.org/10.3390/e23050553
Miloudi S, Wang Y, Ding W. An Improved SimilarityBased Clustering Algorithm for MultiDatabase Mining. Entropy. 2021; 23(5):553. https://doi.org/10.3390/e23050553
Chicago/Turabian StyleMiloudi, Salim, Yulin Wang, and Wenjia Ding. 2021. "An Improved SimilarityBased Clustering Algorithm for MultiDatabase Mining" Entropy 23, no. 5: 553. https://doi.org/10.3390/e23050553