LLM-Enhanced Semantic Text Segmentation
Abstract
1. Introduction
2. LLM Embedding Models
3. Segmentation Algorithms
3.1. Basic Segmentation Algorithms (Magnetic Clustering)
- and have the same sign and
- and have different signs and
- For every belongs to the same cluster as or
3.2. Adaptation of Clustering Algorithms for Segmentation
3.3. Graph-Based Algorithms for Segmentation
4. Boundary Segmentation Metric
5. Description of Datasets
5.1. Benchmarking Datasets in Text Segmentation Field
5.2. Datasets Generated in This Study
6. Experiments
7. Optimizable Parameters and Result Overview
- Magnetic clustering resolves segmentation tasks well on simpler datasets (Choi and Abstracts) but performs poorly on more challenging cases where coherent blocks depend on human perception.
- Magnetic clustering results on PhilPapersAI indicate systematic errors, which can be interpreted in different ways.
- In the majority of cases, the nomic-embed model outperforms the others.
- Context size deeply affects most evaluations; LLM-based embeddings over two consecutive sentences generally provide higher scores.
- Our hypothesis on the applicability of basic algorithms is partially confirmed. With the help of simple (computationally efficient) algorithms, it is possible to obtain results comparable to those of computationally expensive ones, for example, those based on graph methods.
8. Concluding Remarks
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Nguyen, D.Q. A survey of embedding models of entities and relationships for knowledge graph completion. In Proceedings of the Graph-Based Methods for Natural Language Processing (TextGraphs), Barcelona, Spain, 13 December 2020; Ustalov, D., Somasundaran, S., Panchenko, A., Malliaros, F.D., Hulpuș, I., Jansen, P., Jana, A., Eds.; Association for Computational Linguistics: Washington, DC, USA, 2020; pp. 1–14. [Google Scholar] [CrossRef]
- Liu, Q.; Kusner, M.J.; Blunsom, P. A Survey on Contextual Embeddings. arXiv 2020, arXiv:2003.07278. [Google Scholar] [CrossRef]
- Ghinassi, I.; Wang, L.; Newell, C.; Purver, M. Recent Trends in Linear Text Segmentation: A Survey. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, FL, USA, 12–16 November 2024; Al-Onaizan, Y., Bansal, M., Chen, Y.N., Eds.; Association for Computational Linguistics: Washington, DC, USA, 2024; pp. 3084–3095. [Google Scholar] [CrossRef]
- Toleu, A.; Tolegen, G.; Makazhanov, A. Character-based Deep Learning Models for Token and Sentence Segmentation. In Proceedings of the 5th International Conference on Turkic Languages Processing (TurkLang 2017), Kazan, Russia, 18–21 October 2017. [Google Scholar]
- Conneau, A.; Kiela, D.; Schwenk, H.; Barrault, L.; Bordes, A. Supervised Learning of Universal Sentence Representations from Natural Language Inference Data. arXiv 2018, arXiv:1705.02364. [Google Scholar] [CrossRef]
- Fournier, C. Evaluating Text Segmentation. Master’s Thesis, University of Ottawa, Ottawa, ON, Canada, 2013. [Google Scholar]
- Galley, M.; McKeown, K.R.; Fosler-Lussier, E.; Jing, H. Discourse Segmentation of Multi-Party Conversation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan, 7–12 July 2003; pp. 562–569. [Google Scholar] [CrossRef]
- Choi, F.Y.Y. Advances in domain independent linear text segmentation. In Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics, Seattle, WA, USA, 29 April–4 May 2000. [Google Scholar]
- Solbiati, A.; Heffernan, K.; Damaskinos, G.; Poddar, S.; Modi, S.; Calì, J. Unsupervised Topic Segmentation of Meetings with BERT Embeddings. arXiv 2021, arXiv:2106.12978. [Google Scholar] [CrossRef]
- Koshorek, O.; Cohen, A.; Mor, N.; Rotman, M.; Berant, J. Text Segmentation as a Supervised Learning Task. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, LO, USA, 1–6 June 2018; Walker, M., Ji, H., Stent, A., Eds.; Association for Computational Linguistics: Washington, DC, USA, 2018; pp. 469–473. [Google Scholar] [CrossRef]
- Glavaš, G.; Nanni, F.; Ponzetto, S.P. Unsupervised Text Segmentation Using Semantic Relatedness Graphs. In Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics, Berlin, Germany, 11–12 August 2016; Gardent, C., Bernardi, R., Titov, I., Eds.; Association for Computational Linguistics: Washington, DC, USA, 2016; pp. 125–130. [Google Scholar] [CrossRef]
- Fan, W.; Ding, Y.; Ning, L.; Wang, S.; Li, H.; Yin, D.; Chua, T.S.; Li, Q. A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’24, New York, NY, USA, 1–8 February 2024; pp. 6491–6501. [Google Scholar] [CrossRef]
- Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, M.; Wang, H. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv 2024, arXiv:2312.10997. [Google Scholar] [CrossRef]
- Leng, Q.; Portes, J.; Havens, S.; Zaharia, M.; Carbin, M. Long Context RAG Performance of Large Language Models. In Proceedings of the Adaptive Foundation Models: Evolving AI for Personalized and Efficient Learning, Vancouver, BC, Canada, 14 December 2024. [Google Scholar]
- Tao, C.; Shen, T.; Gao, S.; Zhang, J.; Li, Z.; Hua, K.; Hu, W.; Tao, Z.; Ma, S. LLMs are Also Effective Embedding Models: An In-depth Overview. arXiv 2025, arXiv:2412.12591. [Google Scholar]
- Oro, E.; Granata, F.M.; Ruffolo, M. A Comprehensive Evaluation of Embedding Models and LLMs for IR and QA Across English and Italian. Big Data Cogn. Comput. 2025, 9, 141. [Google Scholar] [CrossRef]
- Ollama developers. Ollama Project. Available online: https://ollama.com/search?c=embedding (accessed on 1 August 2025).
- Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Hong Kong, China, 3–7 November 2019; Association for Computational Linguistics: Washington, DC, USA, 2019; Volume 11. [Google Scholar]
- Nussbaum, Z.; Morris, J.X.; Duderstadt, B.; Mulyar, A. Nomic Embed: Training a Reproducible Long Context Text Embedder. arXiv 2024, arXiv:2402.01613. [Google Scholar] [CrossRef]
- Li, X.; Li, J. AnglE-optimized Text Embeddings. arXiv 2023, arXiv:2309.12871. [Google Scholar]
- Lee, S.; Shakir, A.; Koenig, D.; Lipp, J. Open Source Strikes Bread-New Fluffy Embeddings Model. 2024. Available online: https://www.mixedbread.com/blog/mxbai-embed-large-v1 (accessed on 7 October 2025).
- Chen, J.; Xiao, S.; Zhang, P.; Luo, K.; Lian, D.; Liu, Z. BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. arXiv 2024, arXiv:2402.03216. [Google Scholar]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Yu, S.X.; Shi, J. Multiclass Spectral Clustering. In Proceedings of the 9th IEEE International Conference on Computer Vision (ICCV 2003), Nice, France, 14–17 October 2003; IEEE Computer Society: Los Alamitos, CA, USA, 2003; pp. 313–319. [Google Scholar] [CrossRef]
- Monath, N.; Dubey, K.A.; Guruganesh, G.; Zaheer, M.; Ahmed, A.; McCallum, A.; Mergen, G.; Najork, M.; Terzihan, M.; Tjanaka, B.; et al. Scalable Hierarchical Agglomerative Clustering. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’21, Singapore, 14–18 August 2021; Association for Computing Machinery: New York, NY, USA, 2021; pp. 1245–1255. [Google Scholar] [CrossRef]
- Furtlehner, C.; Sebag, M.; Zhang, X. Scaling analysis of affinity propagation. Phys. Rev. E 2010, 81, 066102. [Google Scholar] [CrossRef]
- Hartigan, J.A.; Wong, M.A. A k-means clustering algorithm. JSTOR Appl. Stat. 1979, 28, 100–108. [Google Scholar] [CrossRef]
- Mussabayev, R.; Mladenovic, N.; Jarboui, B.; Mussabayev, R. How to Use K-means for Big Data Clustering? Pattern Recognit. 2023, 137, 109269. [Google Scholar] [CrossRef]
- Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A Next-Generation Hyperparameter Optimization Framework. In Proceedings of the The 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 2623–2631. [Google Scholar]
- Fournier, C. Evaluating Text Segmentation using Boundary Edit Distance. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Sofia, Bulgaria, 4–9 August 2013; Schuetze, H., Fung, P., Poesio, M., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2013; pp. 1702–1712. [Google Scholar]
- Beeferman, D.; Berger, A.; Lafferty, J. Statistical Models for Text Segmentation. Mach. Learn. 1999, 34, 177–210. [Google Scholar] [CrossRef]
- Pevzner, L.; Hearst, M.A. A Critique and Improvement of an Evaluation Metric for Text Segmentation. Comput. Linguist. 2002, 28, 19–36. [Google Scholar] [CrossRef]
- Morris, J.; Hirst, G. Lexical cohesion computed by thesaural relations as an indicator of the structure of text. Comput. Linguist. 1991, 17, 21–48. [Google Scholar]
- Chen, H.; Branavan, S.; Barzilay, R.; Karger, D.R. Global Models of Document Structure using Latent Permutations. In Human Language Technologies: Proceedings of the 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Boulder, CO, USA, 31 May–5 June 2009; Ostendorf, M., Collins, M., Narayanan, S., Oard, D.W., Vanderwende, L., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2009; pp. 371–379. [Google Scholar]
- Hearst, M.A. Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages. Comput. Linguist. 1997, 23, 33–64. [Google Scholar]
- Budge, I.; Klingemann, H.D.; Volkens, A.; Bara, J.; Tanenbaum, E. Mapping Policy Preferences. Estimates for Parties, Electors, and Governments 1945–1998; Oxford University Press: Oxford, UK, 2001. [Google Scholar]
- Lehmann, P.; Franzmann, S.; Al-Gaddooa, D.; Burst, T.; Ivanusch, C.; Regel, S.; Riethmüller, F.; Volkens, A.; Weßels, B.; Zehnter, L. The Manifesto Data Collection. Manifesto Project (MRG/CMP/MARPOR). Version 2024a. 2024. Available online: https://manifesto-project.wzb.eu/datasets/MPDS2024a (accessed on 7 October 2025).
- Riedl, M.; Biemann, C. TopicTiling: A Text Segmentation Algorithm based on LDA. In Proceedings of the ACL 2012 Student Research Workshop, Jeju Island, Republic of Korea, 8–14 July 2012; Cheung, J.C.K., Hatori, J., Henriquez, C., Irvine, A., Eds.; Association for Computational Linguistics: Washington, DC, USA, 2012; pp. 37–42. [Google Scholar]
- Beeferman, D.; Berger, A.; Lafferty, J. Text Segmentation Using Exponential Models. In Proceedings of the Second Conference on Empirical Methods in Natural Language Processing, Providence, RI, USA, 1–2 August 1997. [Google Scholar]
- Ostendorff, M. Dataset philpapers-2023-10-28. 2023. Available online: https://huggingface.co/datasets/malteos/philpapers-2023-10-28/tree/main/data (accessed on 1 August 2025).
- Ostendorff, M.; Rehm, G. Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning. arXiv 2023, arXiv:2301.09626. [Google Scholar] [CrossRef]
- Krassovitskiy, A. Six Small Datasets for Text Segmentation, 2025. Available online: https://data.mendeley.com/datasets/cj22rpfdbb/1 (accessed on 7 October 2025).
Name | Size | Ref. | Notes | |
---|---|---|---|---|
a | all-minilm | 23 M | [9,18] | An embedding model trained on very large sentence-level datasets. |
b | nomic-embed-text | 137 M | [19] | A nomic AI model that generates high-quality dense vector representations optimized for semantic search, clustering, and retrieval tasks. A high-performing open embedding model with a large token context window. |
c | mxbai-embed-large | 334 M | [20,21] | A state-of-the-art large embedding model from mixedbread.ai. |
d | bge-m3 | 1.2 G | [22] | A very large embedding model from BAAI, distinguished by its versatility across multiple functions, languages, and granularities. |
Choi | Manifesto | Wiki-1024 | Abstracts | SMan | PhilPapers AI | |
---|---|---|---|---|---|---|
Documents | 922 | 6 | 1024 | 300 | 300 | 336 |
Real-world | ✗ | ✓ | ✓ | ✗ | ✗ | ✓ |
Topic variety/document | High | Average | Low | Low | Low | Average |
Topic variety/dataset | High | Low | High | Low | High | High |
Segment length | 7.36 ± 2.98 | 3.08 ± 5.42 | 28.47 ± 33.02 | 8.09 ± 2.92 | 10.70 ± 10.66 | 5.73 ± 6.19 |
Segments/document | 9.98 ± 0.13 | 478.50 ± 250.07 | 7.33 ± 2.61 | 24.83 ± 8.57 | 30.00 ± 0.00 | 7.85 ± 2.94 |
Document length | 73.44 ± 21.36 | 1474.33 ± 390.63 | 208.55 ± 105.05 | 200.96 ± 71.37 | 321.09 ± 56.57 | 44.92 ± 30.64 |
Boundary Similarity | Choi | Manifesto | Wiki-1024 | Abstracts | SMan | PhilPapers AI |
---|---|---|---|---|---|---|
Magnetic | 0.72 | 0.14 | 0.14 | 0.73 | 0.10 | 0.05 |
Spectral | 0.71 | 0.36 | 0.08 | 0.70 | 0.25 | 0.37 |
Agglomerative | 0.77 | 0.35 | 0.12 | 0.79 | 0.20 | 0.32 |
Affinity | 0.61 | 0.27 | 0.06 | 0.49 | 0.30 | 0.22 |
KMeans++ | 0.65 | 0.33 | 0.11 | 0.68 | 0.34 | 0.28 |
GraphSeg [11] | 0.49 | – 1 | 0.08 | 0.71 | – 1 | 0.33 |
GraphSegSM | 0.68 | 0.33 | 0.13 | 0.73 | 0.54 | 0.38 |
Pk [31] | Choi | Manifesto | Wiki-1024 | Abstracts | SMan | PhilPapers AI |
---|---|---|---|---|---|---|
Magnetic | 0.14 | 0.50 | 0.40 | 0.13 | 0.49 | 0.76 |
Spectral | 0.10 | 0.44 | 0.46 | 0.13 | 0.40 | 0.40 |
Agglomerative | 0.08 | 0.44 | 0.44 | 0.09 | 0.45 | 0.43 |
Affinity | 0.14 | 0.46 | 0.58 | 0.24 | 0.34 | 0.56 |
KMeans++ | 0.18 | 0.44 | 0.41 | 0.14 | 0.34 | 0.43 |
GraphSeg | 0.32 | – | 0.61 | 0.13 | – | 0.45 |
GraphSegSM | 0.18 | 0.45 | 0.58 | 0.13 | 0.29 | 0.42 |
Choi | Manifesto | Wiki-1024 | Abstracts | SMan | PhilPapers AI | |
---|---|---|---|---|---|---|
Magnetic | 2-c | 2-a | 3-b | 2-b | 2-b | 3-a |
Spectral | 1-c | 2-d | 2-b | 2-b | 2-b | 3-d |
Agglomerative | 1-c | 1-b | 2-b | 2-b | 1-b | 3-b |
Affinity | 1-b | 1-c | 1-b | 3-b | 3-d | 1-d |
KMeans++ | 2-c | 3-d | 3-b | 2-b | 3-b | 3-d |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Krassovitskiy, A.; Mussabayev, R.; Yakunin, K. LLM-Enhanced Semantic Text Segmentation. Appl. Sci. 2025, 15, 10849. https://doi.org/10.3390/app151910849
Krassovitskiy A, Mussabayev R, Yakunin K. LLM-Enhanced Semantic Text Segmentation. Applied Sciences. 2025; 15(19):10849. https://doi.org/10.3390/app151910849
Chicago/Turabian StyleKrassovitskiy, Alexander, Rustam Mussabayev, and Kirill Yakunin. 2025. "LLM-Enhanced Semantic Text Segmentation" Applied Sciences 15, no. 19: 10849. https://doi.org/10.3390/app151910849
APA StyleKrassovitskiy, A., Mussabayev, R., & Yakunin, K. (2025). LLM-Enhanced Semantic Text Segmentation. Applied Sciences, 15(19), 10849. https://doi.org/10.3390/app151910849