Document Clustering Using KMeans with Term Weighting as SimilarityBased Constraints
Abstract
:1. Introduction
2. Related Works
2.1. Comparative Analysis on Four Learning/Mining Schemes
2.2. SimilarityBased Constrained Clustering
3. Constrained Document Clustering with DistributionBased Term Weighting
3.1. DistributionBased Term Weighting Scheme
 Term frequency (TF):$$t{f}_{ij}=N({d}_{i},{t}_{j})$$
 Inverse document frequency (IDF):$$id{f}_{j}={\mathrm{log}}_{10}(1+\frac{\leftD\right}{d{f}_{j}})$$
 Standard deviation (SD):$$s{d}_{j}=\sqrt{\frac{{\sum}_{k}{\sum}_{{d}_{i}\in {C}_{k}}{(t{f}_{ij}\left(\frac{{\sum}_{k}{\sum}_{{d}_{i}\in {C}_{k}}{tf}_{ij}}{{\sum}_{k}\left{C}_{k}\right}\right))}^{2}}{{\sum}_{k}\left{C}_{k}\right}}$$
 Average class standard deviation (ACSD):$$acs{d}_{j}=\frac{1}{\leftC\right}\sum _{k}\sqrt{\frac{1}{{C}_{k}}\sum _{{d}_{i}\in {C}_{k}}{(t{f}_{ij}{\overline{tf}}_{jk})}^{2}}\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}$$
 Interclass standard deviation (ICSD):$$\begin{array}{ccc}\hfill ics{d}_{j}& =& \sqrt{\frac{1}{\leftC\right}\sum _{k}{({\overline{tf}}_{jk}\left(\frac{1}{\leftC\right}\sum _{k}{\overline{tf}}_{jk}\right))}^{2}}\hfill \end{array}$$$$\begin{array}{ccc}\hfill {\overline{tf}}_{jk}& =& \frac{{\sum}_{{d}_{i}\in {C}_{k}}{tf}_{ij}}{{C}_{k}}\hfill \end{array}$$
3.2. Constrained Document Clustering with Term Weighting
3.3. The Framework of Clustering with Term Weighting
Algorithm 1 Pseudocode of main procedure of the constrained kmeans clustering (semiunsupervised learning) by distributionbased term weighting 

Algorithm 2 Pseudocode of subfunctions for constrained kmeans clustering 

4. Experiment Settings and Metrics
4.1. Data Sets and Preprocessing
4.2. Experiment Settings
4.3. Evaluation Measures
5. Experimental Results
5.1. Cluster Quality of Single Factor
5.2. Cluster Quality of Multiple Factors
5.3. Term Weighting as Expression of User Intention
5.4. Investigation of Various Training Set Sizes
5.5. Effect of Cluster Number on Cluster Quality
6. Discussion and Related Works
7. Conclusions
Scheme Property  SL  SSL  SUSL  USL 

Predefined classes  ◯  ◯  ×  × 
Model learning  ◯  ◯  △  × 
Availability of labeled examples  ◯  △  △  × 
Availability of unlabeled examples  ×  △  ◯  ◯ 
Term Weighting Scheme  WebKB1 (4 Classes)  WebKB2 (5 Classes)  

SL  SSL  USL  SL  SSL  USL  
TF  73.79  63.76  52.63  71.52  32.46  30.62 
(74.71, 73.18)  (64.94, 62.60)  (53.94, 51.36)  (74.92, 68.27)  (40.19, 26.21)  (39.35, 23.83)  
nTF  75.12  48.67  47.24  76.39  40.08  28.03 
(74.96, 75.29)  (47.78, 49.58)  (48.14, 46.36)  (80.27, 72.70)  (46.49, 34.56)  (31.68, 24.80)  
TF × IDF  79.66  70.55  55.03  86.43  61.68  30.25 
(80.10, 79.22)  (71.01, 70.10)  (58.12, 52.11)  (90.03, 82.97)  (63.95, 59.48)  (39.06, 30.25)  
nTF × IDF  81.82  78.68  56.00  93.32  86.24  38.14 
(82.03, 81.60)  (78.42, 78.93)  (57.73, 54.32)  (95.34, 91.34)  (88.20, 84.32)  (43.45, 33.47) 
Dataset  Amazon  Drug Info.  WebKB1  WebKB2  20Newsgroups  ThaiReform 

General Characteristics  
Abbreviation  AM  DI  KB1  KB2  20N  TR 
Language  English  English  English  English  English  Thai 
Genre  Product  Medicine  Education  Education  News  Politic 
# classes  3  7  4  5  20  3 
# doc./class  2000 each  640 each  501/922/1118/1620  221/237/249/304/3150  various (628999)  1000 each 
Total terms  387,493  1,243,566  572,949  572,949  1,896,335  131,717 
Distinct terms  7614  7768  6527  6527  8286  3549 
Document Size (total terms)  
Avg.  64.58  277.58  137.70  137.70  100.76  43.91 
Max.  1654  4063  17,719  17,719  5366  1114 
Min.  1  2  4  4  1  2 
SD.  73.26  323.92  315.70  315.70  210.35  53.64 
Document Size (distinct terms)  
Avg.  51.52  136.60  79.64  79.64  64.43  31.98 
Max.  743  846  2505  2505  1288  357 
Min.  1  2  2  2  1  2 
SD.  49.38  117.16  74.29  74.29  75.71  28.79 
Class Size (total terms)  
Avg.  129,164.33  177,652.29  143,237.25  114,589.80  94,816.75  43,905.67 
Max.  148,115  309,812  181,757  430,950  173,234  56,608 
Min.  94,784  59,112  86,085  28,499  52,972  21,459 
SD.  24,353.03  97,841.22  35,145.48  158,348.95  20,107.90  15,918.06 
Class Size (distinct terms)  
Avg.  6041.67  5005.71  5446.75  4072.20  4891.55  2545.67 
Max.  6933  6029  6008  6435  5613  2835 
Min.  4375  3520  4839  3204  4136  2001 
SD.  1179.46  977.75  515.30  1167.89  482.08  385.39 
Inter/Intra size of TF by cosine similarity  
Intersimilarity  0.0291  0.0487  0.1252  0.1314  0.0204  0.2063 
Intrasimilarity  0.0429  0.1444  0.1659  0.1408  0.0575  0.2802 
Inter/Intra  0.6784  0.3373  0.7547  0.9332  0.3548  0.7363 
Method  FW = TF × IDF  Avg.  FW = NTF × IDF  Avg.  

FW  ⊙  DW  AM  DI  KB1  KB2  20N  TR  AM  DI  KB1  KB2  20N  TR  
Panel I: Centroidbased method (Classification)  
FW  /  ${\mathrm{SD}}^{\mathrm{T}}$  92.13  92.21  89.69  90.74  87.56  94.76  91.18 ${}^{\u2020\u2020\u2020}$  92.25  91.96  89.61  91.20  87.80  94.92  91.29 ${}^{\u2020\u2020\u2020}$  
FW  ×  ${\mathrm{SD}}^{\mathrm{T}}$  84.72  69.05  59.75  44.91  57.21  75.35  65.17  84.90  68.77  59.86  46.03  57.19  75.84  65.43  
FW  /  ${\mathrm{ACSD}}^{\mathrm{T}}$  92.83  92.06  89.78  92.00  86.26  95.20  91.36 ${}^{\u2020\u2020\u2020}$  92.96  91.91  89.06  92.19  86.48  95.19  91.30 ${}^{\u2020\u2020\u2020}$  
FW  ×  ${\mathrm{ACSD}}^{\mathrm{T}}$  83.15  70.48  54.67  41.50  54.01  66.07  61.65  83.37  70.14  52.59  42.12  53.88  66.33  61.41  
FW  /  ${\mathrm{ICSD}}^{\mathrm{T}}$  77.44  69.18  78.17  79.43  81.94  79.90  77.68  78.57  68.79  77.74  80.42  82.02  79.84  77.90  
FW  ×  ${\mathrm{ICSD}}^{\mathrm{T}}$  81.22  78.72  60.56  73.72  57.70  85.14  72.84  81.25  78.69  60.88  74.05  57.06  85.22  72.86  
FW  /  ${\mathrm{SD}}^{\mathrm{N}}$  94.25  96.55  89.45  93.25  91.90  94.24  93.27 ${}^{\u2020\u2020\u2020}$  94.23  95.99  89.11  94.24  91.86  93.58  93.17 ${}^{\u2020\u2020\u2020}$  
FW  ×  ${\mathrm{SD}}^{\mathrm{N}}$  79.77  81.01  62.79  65.49  66.27  85.22  73.43  79.80  81.24  60.30  65.83  66.16  85.54  73.15  
FW  /  ${\mathrm{ACSD}}^{\mathrm{N}}$  94.60  97.02  90.59  93.83  90.08  94.76  93.48 ${}^{\u2020\u2020\u2020}$  94.63  96.53  90.30  94.48  90.03  94.01  93.33 ${}^{\u2020\u2020\u2020}$  
FW  ×  ${\mathrm{ACSD}}^{\mathrm{N}}$  78.36  82.45  56.99  59.16  61.30  81.82  70.01  78.79  82.31  57.05  59.29  61.38  81.72  70.09  
FW  /  ${\mathrm{ICSD}}^{\mathrm{N}}$  84.43  86.41  83.78  83.18  88.19  83.48  84.91 ${}^{\u2020\u2020}$  84.48  86.86  83.27  83.10  88.17  83.38  84.88 ${}^{\u2020\u2020}$  
FW  ×  ${\mathrm{ICSD}}^{\mathrm{N}}$  77.66  80.16  65.54  77.64  59.69  85.57  74.38  77.89  79.98  65.55  77.99  59.81  85.61  74.47  
FW  /  ${\mathrm{SD}}^{\mathrm{TI}}$  88.71  91.71  83.12  88.04  84.56  92.05  88.03 ${}^{\u2020\u2020\u2020}$  92.40  93.04  90.27  94.12  88.48  95.37  92.28 ${}^{\u2020\u2020\u2020}$  
FW  ×  ${\mathrm{SD}}^{\mathrm{TI}}$  88.19  56.44  61.67  53.10  63.59  90.47  68.91  89.25  59.62  61.88  54.05  64.33  91.07  70.03  
FW  /  ${\mathrm{ACSD}}^{\mathrm{TI}}$  88.94  91.55  82.43  86.69  84.08  91.38  87.51 ${}^{\u2020\u2020\u2020}$  93.31  93.00  90.86  93.68  87.18  95.76  92.3 ${}^{\u2020\u2020\u2020}$  
FW  ×  ${\mathrm{ACSD}}^{\mathrm{TI}}$  88.37  53.48  60.81  65.91  63.40  89.96  70.32  89.33  56.83  62.14  68.43  64.43  90.84  72.00  
FW  /  ${\mathrm{ICSD}}^{\mathrm{TI}}$  61.93  46.61  63.27  59.26  76.50  58.62  61.03  69.47  63.98  79.34  78.40  83.80  67.74  73.79  
FW  ×  ${\mathrm{ICSD}}^{\mathrm{TI}}$  86.14  71.32  67.24  86.28  64.17  88.23  77.23 ${}^{\u2020\u2020\u2020}$  86.71  71.75  68.40  86.33  64.54  88.24  77.66  
FW  /  ${\mathrm{SD}}^{\mathrm{NI}}$  94.68  96.77  89.42  97.83  91.75  96.46  94.49 ${}^{\u2020\u2020\u2020}$  87.57  93.37  80.12  97.82  86.42  94.93  90.04 ${}^{\u2020\u2020\u2020}$  
FW  ×  ${\mathrm{SD}}^{\mathrm{NI}}$  86.38  80.42  77.82  88.82  75.59  90.74  83.30  85.07  77.41  76.04  88.8  73.73  90.74  81.97  
FW  /  ${\mathrm{ACSD}}^{\mathrm{NI}}$  94.95  97.31  90.21  98.03  90.45  97.45  94.73 ${}^{\u2020\u2020\u2020}$  86.66  94.47  80.07  98.02  85.54  97.45  90.37 ${}^{\u2020\u2020\u2020}$  
FW  ×  ${\mathrm{ACSD}}^{\mathrm{NI}}$  85.69  80.93  76.81  86.24  73.19  90.21  82.18  84.31  77.13  75.42  86.24  71.06  90.22  80.73  
FW  /  ${\mathrm{ICSD}}^{\mathrm{NI}}$  83.72  90.17  84.00  86.52  89.38  86.51  86.72${}^{\u2020}$  70.04  61.99  61.23  86.54  76.88  86.51  73.87  
FW  ×  ${\mathrm{ICSD}}^{\mathrm{NI}}$  82.12  81.88  80.55  87.89  68.21  89.44  81.68  81.70  82.00  79.74  87.88  67.21  89.44  81.33 ${}^{\u2020\u2020}$  
Panel II: Seeded kmeans method (Clustering)  
FW  /  ${\mathrm{SD}}^{\mathrm{T}}$  90.92  89.58  81.79  86.79  83.65  92.30  87.51 ${}^{\u2020\u2020\u2020}$  91.44  89.93  87.18  87.30  84.60  93.16  88.94 ${}^{\u2020\u2020\u2020}$  
FW  ×  ${\mathrm{SD}}^{\mathrm{T}}$  75.31  43.83  53.74  31.78  40.05  46.29  48.50  72.66  44.15  53.79  31.41  41.25  54.04  49.55  
FW  /  ${\mathrm{ACSD}}^{\mathrm{T}}$  91.73  90.37  81.90  80.76  82.47  93.39  86.77 ${}^{\u2020\u2020\u2020}$  92.31  90.81  85.73  83.77  83.32  94.05  88.33 ${}^{\u2020\u2020\u2020}$  
FW  ×  ${\mathrm{ACSD}}^{\mathrm{T}}$  73.68  36.57  46.96  31.39  33.69  43.31  44.27  71.31  43.30  49.62  31.42  29.27  50.11  45.84  
FW  /  ${\mathrm{ICSD}}^{\mathrm{T}}$  60.59  59.84  66.36  61.65  74.15  71.82  65.74  65.75  46.77  69.72  63.15  71.47  67.91  64.13  
FW  ×  ${\mathrm{ICSD}}^{\mathrm{T}}$  75.60  65.81  46.28  66.19  46.26  79.94  63.35  75.39  58.75  45.87  67.98  45.12  78.05  61.86  
FW  /  ${\mathrm{SD}}^{\mathrm{N}}$  88.17  84.98  70.73  77.13  82.76  79.33  80.52 ${}^{\u2020\u2020\u2020}$  88.74  95.48  87.5  90.58  90.69  88.64  90.27 ${}^{\u2020\u2020\u2020}$  
FW  ×  ${\mathrm{SD}}^{\mathrm{N}}$  70.36  63.76  51.67  32.43  47.19  82.10  57.92  67.35  62.10  45.13  42.38  44.97  78.06  56.67  
FW  /  ${\mathrm{ACSD}}^{\mathrm{N}}$  89.19  86.83  72.18  65.51  82.61  80.76  79.51 ${}^{\u2020\u2020\u2020}$  89.21  96.00  88.76  78.71  88.54  89.44  88.44 ${}^{\u2020\u2020\u2020}$  
FW  ×  ${\mathrm{ACSD}}^{\mathrm{N}}$  69.29  60.77  49.95  31.53  35.41  58.86  50.97  64.58  58.17  46.12  32.47  30.33  59.67  48.56  
FW  /  ${\mathrm{ICSD}}^{\mathrm{N}}$  66.67  64.45  62.12  51.55  69.91  62.44  62.86  73.78  79.75  77.94  62.87  82.28  75.04  75.28${}^{\u2020}$  
FW  ×  ${\mathrm{ICSD}}^{\mathrm{N}}$  74.74  71.26  51.73  68.33  47.55  85.71  66.55  73.90  70.90  50.14  69.62  45.32  85.21  65.85  
FW  /  ${\mathrm{SD}}^{\mathrm{TI}}$  87.38  84.73  75.46  75.17  82.24  91.12  82.68 ${}^{\u2020\u2020\u2020}$  91.79  82.30  83.32  89.74  86.35  94.59  88.02 ${}^{\u2020\u2020\u2020}$  
FW  ×  ${\mathrm{SD}}^{\mathrm{TI}}$  83.94  32.91  49.40  34.95  81.66  76.06  59.82  76.63  43.42  53.48  40.57  48.74  87.20  58.34  
FW  /  ${\mathrm{ACSD}}^{\mathrm{TI}}$  87.75  86.25  74.57  67.88  69.33  91.01  79.47 ${}^{\u2020\u2020\u2020}$  92.86  92.60  83.14  88.76  84.92  95.17  89.58 ${}^{\u2020\u2020\u2020}$  
FW  ×  ${\mathrm{ACSD}}^{\mathrm{TI}}$  86.59  30.20  48.31  32.68  76.66  81.06  59.25  76.95  38.86  51.61  42.85  42.61  85.08  56.33  
FW  /  ${\mathrm{ICSD}}^{\mathrm{TI}}$  35.04  33.41  52.56  38.62  49.46  44.95  42.34  36.55  37.58  68.12  50.41  76.25  40.47  51.56  
FW  ×  ${\mathrm{ICSD}}^{\mathrm{TI}}$  79.05  67.52  53.56  85.25  43.65  86.12  69.19 ${}^{\u2020\u2020\u2020}$  77.57  62.16  48.29  85.50  51.84  83.64  68.17 ${}^{\u2020\u2020}$  
FW  /  ${\mathrm{SD}}^{\mathrm{NI}}$  89.45  86.65  75.24  82.47  82.97  83.66  83.41 ${}^{\u2020\u2020\u2020}$  85.23  92.39  67.76  97.59  85.18  93.16  86.88 ${}^{\u2020\u2020\u2020}$  
FW  ×  ${\mathrm{SD}}^{\mathrm{NI}}$  75.39  52.92  69.04  66.62  60.54  89.70  69.04  73.20  60.07  52.36  81.89  56.87  89.61  69.00  
FW  /  ${\mathrm{ACSD}}^{\mathrm{NI}}$  90.95  88.18  77.36  76.50  84.61  85.46  83.84 ${}^{\u2020\u2020\u2020}$  84.45  93.00  68.78  96.60  83.98  97.38  87.37 ${}^{\u2020\u2020\u2020}$  
FW  ×  ${\mathrm{ACSD}}^{\mathrm{NI}}$  74.80  49.19  65.39  49.29  51.33  88.99  63.17  70.83  51.91  57.47  67.54  46.55  87.41  63.62  
FW  /  ${\mathrm{ICSD}}^{\mathrm{NI}}$  52.87  67.61  58.01  54.51  73.05  65.46  61.92  55.64  53.72  49.68  76.26  70.14  74.82  63.38  
FW  ×  ${\mathrm{ICSD}}^{\mathrm{NI}}$  77.42  69.29  65.71  85.86  54.10  88.56  73.49 ${}^{\u2020\u2020}$  75.80  76.30  63.60  85.93  56.22  88.14  74.33 ${}^{\u2020\u2020}$ 
Method  Power of DW (p)  Total  

−1  −0.5  0  0.5  1  
Panel I: Centroidbased algorithm  
Panel A (Best):  
SD  5(3)  7(5)  5(2)  2(0)  1(0)  20(10) 
ACSD  8(3)  7(5)  4(2)  1(0)  0(0)  20(10) 
ICSD  0(0)  0(0)  11(6)  7(4)  2(0)  20(10) 
Panel B (Worst):  
SD  5(4)  4(2)  2(0)  3(1)  6(3)  20(10) 
ACSD  5(3)  3(2)  1(1)  2(1)  9(3)  20(10) 
ICSD  12(8)  5(2)  1(0)  1(0)  1(0)  20(10) 
Panel II: Seeded kmeans algorithm  
Panel A (Best):  
SD  6(3)  7(4)  5(2)  1(1)  1(0)  20(10) 
ACSD  7(3)  6(4)  5(2)  2(1)  0(0)  20(10) 
ICSD  0(0)  0(0)  10(7)  7(3)  3(0)  20(10) 
Panel B (Worst):  
SD  3(0)  3(0)  2(1)  4(3)  8(6)  20(10) 
ACSD  3(0)  2(0)  1(1)  4(3)  10(6)  20(10) 
ICSD  11(6)  5(3)  3(1)  0(0)  1(0)  20(10) 
Panel III: Conventional kmeans algorithm  
Panel A (Best):  
SD  6(4)  6(4)  5(2)  3(0)  0(0)  20(10) 
ACSD  7(3)  6(4)  5(3)  2(0)  0(0)  20(10) 
ICSD  0(0)  0(0)  0(0)  10(4)  10(6)  20(10) 
Panel B (Worst):  
SD  2(0)  3(1)  6(3)  6(3)  3(3)  20(10) 
ACSD  3(0)  2(0)  4(3)  5(3)  6(4)  20(10) 
ICSD  16(9)  4(1)  0(0)  0(0)  0(0)  20(10) 
Method  Power of DW  AM  DI  KB1  KB2  20N  TR  Avg.  Panel Ranking  

SD  ACSD  ICSD  I  II  III  
Panel I:Centroidbased algorithm  
SC1  −0.5  −1  0.5  91.15  95.99  84.85  95.80  84.93  96.10  91.47  1  1  18 
SC2  −1  −0.5  0.5  91.46  95.77  84.43  94.91  85.51  95.93  91.34  2  2  21 
SC3  −0.5  −1  0  91.04  95.38  84.48  95.45  86.20  95.23  91.30  3  3  45 
SC4  0  −0.5  0  91.23  94.54  84.08  95.58  86.76  95.19  91.23  4  4  50 
SC5  −0.5  0  0  91.68  94.38  83.23  96.03  86.31  95.63  91.21  5  5  54 
SC6  0  −1  0.5  91.93  94.47  86.35  93.13  83.24  93.91  90.51  6  11  12 
SC7  −1  −0.5  0  91.20  93.71  82.35  95.14  84.67  95.22  90.38  7  9  44 
SC8  −0.5  −0.5  0  87.18  94.20  80.32  98.53  86.47  95.23  90.32  8  7  72 
SC9  −0.5  −0.5  0.5  91.77  94.35  85.41  92.40  83.21  94.14  90.21  9  6  3 
SC10  −1  0  0  87.57  93.37  80.12  97.82  86.42  94.93  90.04  10  10  71 
BSC  0  0  0  91.25  90.78  81.82  93.32  83.06  93.74  89.00  16  12  65 
Panel II:Seeded kmeans algorithm  
SK1  −0.5  −1  0.5  91.87  95.75  80.52  95.37  83.65  96.04  90.53  1  1  18 
SK2  −1  −0.5  0.5  91.79  93.88  79.39  92.68  83.98  95.11  89.47  2  2  21 
SK3  −0.5  −1  0  91.33  92.95  76.93  95.40  84.95  94.56  89.35  3  3  45 
SK4  0  −0.5  0  89.93  94.80  72.39  95.06  84.12  95.56  88.64  4  4  50 
SK5  −0.5  0  0  90.33  95.71  72.14  94.49  83.65  95.40  88.62  5  5  54 
SK6  −0.5  −0.5  0.5  91.51  90.21  78.03  91.68  80.21  93.88  87.59  9  6  3 
SK7  −0.5  −0.5  0  90.74  89.76  79.25  86.21  84.98  93.03  87.33  8  7  72 
SK8  0  −1  0  84.46  93.00  68.78  96.60  85.61  94.05  87.08  11  8  70 
SK9  −1  −0.5  0  90.92  91.30  76.05  91.76  78.23  93.86  87.02  7  9  44 
SK10  −1  0  0  85.23  92.39  67.76  97.59  85.18  93.16  86.88  10  10  72 
BSK  0  0  0  90.17  89.28  78.68  86.24  80.12  93.01  86.25  16  12  65 
Panel III:Conventional kmeans algorithm  
UK1  −1  −0.5  1  80.25  85.08  66.91  86.63  71.81  89.57  80.04  21  19  1 
UK2  −0.5  −1  1  79.74  82.96  67.09  86.97  72.70  89.46  79.82  19  17  2 
UK3  −0.5  −0.5  0.5  80.65  83.49  64.10  72.88  76.12  90.74  78.00  9  6  3 
UK4  −1  −1  1  74.94  78.51  58.68  87.34  74.38  92.38  77.71  12  20  4 
UK5  −1  0  0.5  76.03  83.94  64.27  73.82  75.40  91.78  77.54  14  14  5 
UK6  0  −1  1  73.05  79.31  66.41  86.28  68.42  88.30  76.96  30  27  6 
UK7  0  −0.5  0.5  76.04  77.99  62.72  84.10  71.26  89.07  76.86  20  13  7 
UK8  −1  0  1  77.02  80.97  64.55  85.97  63.24  88.62  76.73  35  29  8 
UK9  −0.5  0  0.5  76.66  80.23  62.31  82.65  69.67  88.61  76.69  25  16  9 
UK10  −0.5  −0.5  1  74.70  79.49  66.30  85.96  65.16  88.36  76.66  32  21  10 
BUK  0  0  0  59.85  51.21  51.80  33.15  44.04  68.13  51.36  16  12  65 
Methods  Power of DW  User Dimension  Difference  

SD  ACSD  ICSD  Dim.1 no. class = 4  Dim.2 no. class = 5  Dim.1 − Dim.2  
Panel I: Distributionbased term weighting from KB1 (K = 4)  
UKKB11  −0.5  −1  1  67.09 (70.46, 63.88)  29.80 (34.08, 26.05)  37.29 (36.38, 37.83) 
UKKB12  −1  −0.5  1  66.91 (68.11, 65.72)  30.30 (34.15, 27.05)  36.61 (33.96, 38.67) 
UKKB13  0  −1  1  66.41 (70.35, 62.69)  29.95 (33.63, 26.67)  36.46 (36.72, 36.02) 
UKKB14  −0.5  −0.5  1  66.30 (69.00, 63.71)  31.14 (34.90, 27.78)  35.16 (34.10, 35.93) 
UKKB15  −1  0  1  64.55 (67.68, 61.56)  30.39 (34.15, 27.05)  34.16 (33.53, 34.51) 
Panel II: Distributionbased term weighting from KB2 (K = 5)  
UKKB21  −1  −1  1  35.29 (48.32, 25.78)  87.34 (90.73, 84.07)  52.05 (42.41, 58.29) 
UKKB22  −0.5  −1  1  31.72 (39.96, 25.18)  86.97 (90.18, 83.87)  55.25 (50.22, 58.69) 
UKKB23  −1  −0.5  1  30.10 (35.69, 25.38)  86.63 (89.93, 83.45)  56.53 (54.24, 58.07) 
UKKB24  0  −1  1  35.72 (46.66, 27.35)  86.28 (89.59, 83.09)  50.56 (42.93, 61.74) 
UKKB25  −1  0  1  33.55 (44.68, 25.19)  85.97 (89.19, 82.86)  52.42 (44.51, 57.67) 
