# Revisiting Sequential Information Bottleneck: New Implementation and Evaluation

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Algorithm Overview

#### 2.1. Theoretical Foundation

- X—the list of vectors to cluster;
- Y—the vocabulary used for representing the texts;
- $p(X,Y)$—the estimated joint distribution between X and Y;
- K—the number of clusters to produce;
- T—a partition of X into K clusters.

“Intuitively, in this procedure the information contained in X about Y is ‘squeezed’ through a compact ‘bottleneck’ of clusters T, that is forced to represent the ‘relevant’ part in X with respect to Y”(p. 2)

#### 2.2. Divergence Function

#### 2.3. Pseudo-Code

Algorithm 1 Algorithm pseudo-code. |

Input: $\left|X\right|$ objects to be clustered Parameters: $K,\phantom{\rule{0.277778em}{0ex}}n,\phantom{\rule{0.277778em}{0ex}}maxL,\phantom{\rule{0.277778em}{0ex}}\u03f5$ Output: A partition T of X into K clusters Main Loop: For $i=1,\dots ,n$ ${T}_{i}\leftarrow $ random partition of X. $c\leftarrow 0,\phantom{\rule{4pt}{0ex}}C\leftarrow 0,\phantom{\rule{4pt}{0ex}}done=FALSE$ While not $done$ For x in shuffle(X) draw x out of $t\left(x\right)$ ${t}^{new}\left(x\right)=arg{min}_{{t}^{\prime}}{d}_{F}(x,{t}^{\prime})$ If ${t}^{new}\left(x\right)\ne t\left(x\right)$ then $c\leftarrow c+1$ Merge x into ${t}^{new}\left(x\right)$ $C\leftarrow C+1$ if $C\ge maxL$ or $c\le \u03f5\xb7\left|X\right|$ then $done\leftarrow TRUE$ $T\leftarrow arg{max}_{{T}_{i}}f\left({T}_{i}\right)$ |

#### 2.4. Sequential Clustering Algorithm

#### 2.5. Vector Representation

#### 2.6. Focus of This Work

## 3. Methods

#### 3.1. Computation of ${t}^{new}\left(x\right)$ and Associated Intuition

- $\widehat{x}$ = $p\left(y\right|x)$—the TF vector representing the sample x normalized by the ${L}^{1}$-norm. Let $u\in {\mathbb{R}}^{n}$, the ${L}^{1}$-norm ${\left|u\right|}_{1}$ of u is defined by: ${\left|u\right|}_{1}={\sum}_{i=1}^{n}\left|{u}_{i}\right|$;
- $\widehat{t}$ = $p\left(y\right|t)$—the vector representing the centroid of cluster t, normalized by ${L}^{1}$;
- m = ${\pi}_{1}\xb7\widehat{x}+{\pi}_{2}\xb7\widehat{t}$—the average of $\widehat{x}$ and $\widehat{t}$ weighted by ${\pi}_{1}$ and ${\pi}_{2}$, respectively.

#### 3.2. Optimization for Sparse Vector Representation

#### 3.3. Caching $Log$ Computations

#### 3.4. Implementation

## 4. Experimental Setup

#### 4.1. Materials

- BBC News [34] consists of 2225 articles from the BBC news website. The articles are from 2004–2005 and cover stories in five topical areas: business, entertainment, politics, sport, and tech.
- 20 News Groups [23] consists of $\mathrm{18,846}$ emails sent through 20 news groups. The topics are diverse and cover tech, religion, and politics, among others.
- AG News [35] consists of $\mathrm{127,600}$ pairs of titles and snippets of news articles from the AG corpus, covering four topical areas: World, Sports, Business, and Sci/Tech. The title and snippet of each article are concatenated when the data is clustered.
- DBPedia [35] consists of $\mathrm{630,000}$ pairs of titles and abstracts of documents from 14 non-overlapping ontology classes such as Artist, Film, and Company. The title and abstract are concatenated when the data are clustered.
- Yahoo! Answers [35] consists of $\mathrm{1,460,000}$ triplets of question title, question content, and best answer from the Yahoo! Answers Comprehensive Questions and Answers version $1.0$ dataset. The data covers the 10 largest topical categories, such as Society & Culture, Computers & Internet, and Health. The question title, content, and best answer are concatenated when the data are clustered.

#### 4.2. Clustering Metrics

#### 4.3. Clustering Setups

#### 4.4. Robustness Considerations

#### 4.5. Hardware

#### 4.6. Code

## 5. Results

#### Discussion

## 6. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Data Availability Statement

## Conflicts of Interest

## Appendix A. Package Distribution

## Appendix B. Illustrations of Micro-F1, Macro-F1, V-Measure and Total Run-Time

**Figure A1.**Micro-F1 (

**top-left**), Macro-F1 (

**top-right**) V-Measure (

**bottom-left**) scores, and Total Run-Time (

**bottom-right**) measurements for the clustering setups over all benchmark datasets. The Total Run-Time chart is presented on a $lo{g}_{2}$ scale for the y-axis. The scores and measurements are means of 10 samples per metric. Error bars indicate the 95% confidence interval obtained by bias-corrected and accelerated bootstrapping.

## Appendix C. Comparing sIB and K-Means Parallelism Models

## References

- Aggarwal, C.C.; Zhai, C. A Survey of Text Clustering Algorithms. In Mining Text Data; Aggarwal, C.C., Zhai, C., Eds.; Springer US: Boston, MA, USA, 2012; pp. 77–128. [Google Scholar] [CrossRef]
- Huang, A. Similarity measures for text document clustering. In Proceedings of the Sixth New Zealand Computer Science Research Student Conference (NZCSRSC2008), Christchurch, New Zealand, 6–9 April 2008; Volume 4, pp. 9–56. [Google Scholar]
- Abualigah, L.M.Q. Feature Selection and Enhanced Krill Herd Algorithm for Text Document Clustering, 1st ed.; Springer Publishing Company, Incorporated: Berlin/Heidelberg, Germany, 2018. [Google Scholar]
- Roy, S.; Muni, D.P.; Tack Yan, J.J.Y.; Budhiraja, N.; Ceiler, F. Clustering and Labeling IT Maintenance Tickets. In Proceedings of the Service-Oriented Computing, Banff, AB, Canada, 10–13 October 2016; Sheng, Q.Z., Stroulia, E., Tata, S., Bhiri, S., Eds.; pp. 829–845. [Google Scholar]
- Compton, J.E.; Adams, M.C. Clustering Support Tickets with Natural Language Processing: K-Means Applied to Transformer Embeddings; Sandia National Lab. (SNL-NM): Albuquerque, NM, USA, 2020. [Google Scholar]
- Poomagal, S.; Visalakshi, P.; Hamsapriya, T. A novel method for clustering tweets in Twitter. Int. J. Web Based Commun.
**2015**, 11, 170–187. [Google Scholar] [CrossRef] - Rosa, K.D.; Shah, R.; Lin, B.; Gershman, A.; Frederking, R.E. Topical Clustering of Tweets. In Proceedings of the ACM SIGIR: SWSM, Beijing, China, 28 July 2011. [Google Scholar]
- Curiskis, S.A.; Drake, B.; Osborn, T.R.; Kennedy, P.J. An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit. Inf. Process. Manag.
**2020**, 57, 102034. [Google Scholar] [CrossRef] - Lloyd, S.P. Least Squares Quantization in PCM. IEEE Trans. Inf. Theor.
**1982**, 28, 129–137. [Google Scholar] [CrossRef] - Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res.
**2011**, 12, 2825–2830. [Google Scholar] - Salton, G.; McGill, M.J. Introduction to Modern Information Retrieval; McGraw-Hill, Inc.: New York, NY, USA, 1986. [Google Scholar]
- Bellman, R. Dynamic programming. Science
**1966**, 153, 34–37. [Google Scholar] [CrossRef] [PubMed] - Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv
**2013**, arXiv:1301.3781. [Google Scholar] [CrossRef] - Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; Dean, J. Distributed Representations of Words and Phrases and their Compositionality. Adv. Neural Inf. Process. Syst.
**2013**, 26, 1421. [Google Scholar] [CrossRef] - Pennington, J.; Socher, R.; Manning, C.D. GloVe: Global Vectors for Word Representation. In Proceedings of the Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
- Bengio, Y.; Ducharme, R.; Vincent, P.; Janvin, C. A Neural Probabilistic Language Model. J. Mach. Learn. Res.
**2003**, 3, 1137–1155. [Google Scholar] - Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. CoRR
**2017**, 30, 3058. [Google Scholar] - Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
- Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S.R. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. arXiv
**2018**, arXiv:1804.07461. [Google Scholar] [CrossRef] - Wang, A.; Pruksachatkun, Y.; Nangia, N.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S.R. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. Adv. Neural Inf. Process. Syst.
**2019**, 32, 1828. [Google Scholar] [CrossRef] - Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv
**2019**, arXiv:1908.10084. [Google Scholar] [CrossRef] - Slonim, N.; Friedman, N.; Tishby, N. Unsupervised Document Classification Using Sequential Information Maximization. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR’02, Tampere, Finland, 11–15 August 2002; pp. 129–136. [Google Scholar] [CrossRef]
- Lang, K. NewsWeeder: Learning to Filter Netnews. In Proceedings of the 12th International Machine Learning Conference (ML95), Tahoe City, CA, USA, 9–12 July 1995. [Google Scholar]
- Connor, R.C.H.; Cardillo, F.A.; Moss, R.; Rabitti, F. Evaluation of Jensen-Shannon Distance over Sparse Data. In Proceedings of the SISAP, A Coruna, Spain, 2–4 October 2013. [Google Scholar]
- Tishby, N.; Pereira, F.C.; Bialek, W. The information bottleneck method. arXiv
**2000**, arXiv:physics/0004057. [Google Scholar] [CrossRef] - Slonim, N. The Information Bottleneck: Theory and Applications. Ph.D. Thesis, Hebrew University of Jerusalem, Jerusalem, Israel, 2002. [Google Scholar]
- Cover, T.M.; Thomas, J.A. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing); Wiley-Interscience: Hoboken, NJ, USA, 2006. [Google Scholar]
- Slonim, N.; Tishby, N. Agglomerative Information Bottleneck. In Proceedings of the Advances in Neural Information Processing Systems, Denver, CO, USA, 29 November–4 December 1999; Solla, S., Leen, T., Müller, K., Eds.; Volume 12. [Google Scholar]
- Zhang, J.A.; Kurkoski, B.M. Low-complexity quantization of discrete memoryless channels. In Proceedings of the 2016 International Symposium on Information Theory and Its Applications (ISITA), Monterey, CA, USA, 3 October–2 November 2016; pp. 448–452. [Google Scholar]
- Chou, P. Optimal partitioning for classification and regression trees. IEEE Trans. Pattern Anal. Mach. Intell.
**1991**, 13, 340–354. [Google Scholar] [CrossRef] - Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman Divergences. J. Mach. Learn. Res.
**2005**, 6, 1705–1749. [Google Scholar] - Kurkoski, B.M. On the relationship between the KL means algorithm and the information bottleneck method. In Proceedings of the SCC 2017, 11th International ITG Conference on Systems, Communications and Coding, Hamburg, Germany, 6–9 February 2017; pp. 1–6. [Google Scholar]
- Slonim, N.; Aharoni, E.; Crammer, K. Hartigan’s K-Means versus Lloyd’s K-Means: Is It Time for a Change? In Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, IJCAI’13, Beijing, China, 3–9 August 2013; pp. 1677–1684. [Google Scholar]
- Greene, D.; Cunningham, P. Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering. In Proceedings of the 23rd International Conference on Machine Learning (ICML’06), New York, NY, USA, 25–29 June 2006; pp. 377–384. [Google Scholar]
- Zhang, X.; Zhao, J.; LeCun, Y. Character-level Convolutional Networks for Text Classification. Adv. Neural Inf. Process. Syst.
**2015**, 28, 456. [Google Scholar] [CrossRef] - Vinh, N.X.; Epps, J.; Bailey, J. Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance. J. Mach. Learn. Res.
**2010**, 11, 2837–2854. [Google Scholar] - Meilă, M. Comparing clusterings—An information based distance. J. Multivar. Anal.
**2007**, 98, 873–895. [Google Scholar] [CrossRef] - Rosenberg, A.; Hirschberg, J. V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic, 28–30 June 2007; pp. 410–420. [Google Scholar]
- Arthur, D.; Vassilvitskii, S. K-Means++: The Advantages of Careful Seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, Louisiana, 7–9 January 2007; pp. 1027–1035. [Google Scholar]
- Spencer, N. Essentials of Multivariate Data Analysis; Taylor & Francis: Abingdon, UK, 2013; p. 95. [Google Scholar]

**Figure 1.**Adjusted Mutual-Information (

**left**) and Adjusted Rand-Index (

**right**) scores for the clustering setups over all benchmark datasets. The scores are means of 10 samples per metric. Error bars indicate the 95% confidence interval obtained by bias-corrected and accelerated bootstrapping.

**Table 1.**Benchmark datasets for evaluation. The column #Texts indicates the number of texts in the dataset. The column #Words shows the average text length in terms of word count in the dataset, and #Classes shows the number of classes in the dataset.

Dataset | #Texts | #Words | #Classes |
---|---|---|---|

BBC News | 2225 | 390 | 5 |

20 News Groups | 18,846 | 284 | 20 |

AG NEWS | 127,600 | 38 | 4 |

DBPedia | 630,000 | 46 | 14 |

Yahoo! Answers | 1,460,000 | 92 | 10 |

**Table 2.**Assessment of clustering quality using the metrics: AMI, ARI, V-Measure (VM), Micro-F1 (Mic-F1) and Macro-F1 (Mac-F1), and of clustering speed based on measurements of the vectorization time (Vector), clustering time (Cluster), and their sum (Total).

Dataset | Algorithm | Embed | AMI | ARI | VM | Mic-F1 | Mac-F1 | Vector | Cluster | Total |
---|---|---|---|---|---|---|---|---|---|---|

20 News Groups | K-Means | TF | 0.01 | 0.00 | 0.01 | 0.06 | 0.01 | 00:03 | 00:07 | 00:10 |

K-Means | TF/IDF | 0.36 | 0.14 | 0.36 | 0.35 | 0.32 | 00:03 | 00:10 | 00:14 | |

K-Means | GloVe | 0.36 | 0.17 | 0.36 | 0.34 | 0.31 | 00:19 | 00:09 | 00:28 | |

K-Means | S-Bert | 0.59 | 0.44 | 0.59 | 0.61 | 0.58 | 22:27 | 00:08 | 22:35 | |

sIB | TF | 0.65 | 0.53 | 0.65 | 0.66 | 0.61 | 00:03 | 00:11 | 00:14 | |

AG NEWS | K-Means | TF | 0.03 | 0.00 | 0.03 | 0.29 | 0.20 | 00:03 | 00:02 | 00:05 |

K-Means | TF/IDF | 0.04 | 0.01 | 0.04 | 0.31 | 0.24 | 00:03 | 00:03 | 00:06 | |

K-Means | GloVe | 0.53 | 0.55 | 0.53 | 0.80 | 0.80 | 00:19 | 00:07 | 00:26 | |

K-Means | S-Bert | 0.60 | 0.63 | 0.60 | 0.84 | 0.84 | 38:18 | 00:11 | 38:29 | |

sIB | TF | 0.66 | 0.70 | 0.66 | 0.87 | 0.87 | 00:03 | 00:03 | 00:06 | |

BBC News | K-Means | TF | 0.24 | 0.11 | 0.24 | 0.41 | 0.32 | 00:01 | 00:00 | 00:01 |

K-Means | TF/IDF | 0.70 | 0.62 | 0.70 | 0.83 | 0.83 | 00:00 | 00:00 | 00:01 | |

K-Means | GloVe | 0.75 | 0.76 | 0.75 | 0.90 | 0.90 | 00:05 | 00:00 | 00:06 | |

K-Means | S-Bert | 0.87 | 0.90 | 0.87 | 0.96 | 0.96 | 02:55 | 00:00 | 02:56 | |

sIB | TF | 0.88 | 0.90 | 0.88 | 0.96 | 0.96 | 00:01 | 00:01 | 00:01 | |

DBPedia | K-Means | TF | 0.56 | 0.21 | 0.56 | 0.50 | 0.47 | 00:20 | 00:43 | 01:04 |

K-Means | TF/IDF | 0.61 | 0.24 | 0.61 | 0.56 | 0.55 | 00:20 | 00:45 | 01:06 | |

K-Means | GloVe | 0.73 | 0.63 | 0.73 | 0.76 | 0.72 | 01:28 | 02:08 | 03:37 | |

K-Means | S-Bert | 0.79 | 0.71 | 0.79 | 0.82 | 0.79 | 03:38:31 | 02:00 | 03:40:31 | |

sIB | TF | 0.79 | 0.68 | 0.79 | 0.78 | 0.74 | 00:20 | 00:44 | 01:05 | |

Yahoo! Answers | K-Means | TF | 0.03 | 0.01 | 0.03 | 0.15 | 0.08 | 01:15 | 04:22 | 05:37 |

K-Means | TF/IDF | 0.16 | 0.05 | 0.16 | 0.29 | 0.25 | 01:16 | 03:44 | 05:01 | |

K-Means | GloVe | 0.32 | 0.23 | 0.32 | 0.49 | 0.44 | 06:21 | 06:42 | 13:03 | |

K-Means | S-Bert | 0.41 | 0.33 | 0.41 | 0.59 | 0.56 | 16:20:10 | 06:18 | 16:26:28 | |

sIB | TF | 0.39 | 0.32 | 0.39 | 0.57 | 0.54 | 01:15 | 02:20 | 03:35 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Toledo, A.; Venezian, E.; Slonim, N.
Revisiting Sequential Information Bottleneck: New Implementation and Evaluation. *Entropy* **2022**, *24*, 1132.
https://doi.org/10.3390/e24081132

**AMA Style**

Toledo A, Venezian E, Slonim N.
Revisiting Sequential Information Bottleneck: New Implementation and Evaluation. *Entropy*. 2022; 24(8):1132.
https://doi.org/10.3390/e24081132

**Chicago/Turabian Style**

Toledo, Assaf, Elad Venezian, and Noam Slonim.
2022. "Revisiting Sequential Information Bottleneck: New Implementation and Evaluation" *Entropy* 24, no. 8: 1132.
https://doi.org/10.3390/e24081132