An Evaluative Baseline for Sentence-Level Semantic Division
Abstract
:1. Introduction
2. Related Work
3. Sentence-Level Semantic Division Baseline
3.1. General Idea
3.2. Survey Preparation
- (a)
- Topics can be distinguished from each other and are not allowed to have highly related or inclusive relationships. For example, theft and burglary are considered highly related and need to be removed or fused.
- (b)
- A topic cannot be just a word in M&C, which would affect the associative power of the grid. For example, a car cannot be a topic because it is a word in M&C, but our results allow a car accident to be a topic in the grid.
- (c)
- To be considered valid, the semantics of a candidate topic must be contained in at least two sentences.
3.3. Survey Process
- (a)
- Only one most relevant topic should be selected for all sentences.
- (b)
- Do not use too much inference.
- (b)
- If a sentence contained multiple components with differences, then the most important part of the topic was selected.
- (d)
- When filling out the questionnaire, there was no concern about the impact of the different statements.
- (e)
- If you find no option that matches the requirements, please choose option F, which means that no subject among the candidates matches.
4. Evaluation
4.1. Screening Feedback
4.2. Generating SSDB-100
4.3. Data Division
5. Conclusions and Future Work
- (a)
- The 100 grid topics have different semantic granularities and can be distinguished from each other. This facilitated the expansion of the corpus into a single grid.
- (b)
- The corpus overlays the vocabulary in the lexical–semantic evaluation benchmark M&C, which is divided by humans into the most matched semantic topic grids according to semantic information.
- (c)
- The dataset is scalable, and we open-sourced the dataset so that relevant scholars can expand the thematic grid and corpus based on it.
- (d)
- Using the semantic grids obtained by SSDB-100, the corresponding SDR encoding can be obtained by the SFT, which facilitates the development of future encoding studies.
- (e)
- It is not only applicable to class semantic folding and theoretical calculations of sentence granularity, but also to text clustering tasks.
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Fitch, W.T. Unity and diversity in human language. Philos. Trans. R. Soc. Lond. B Biol. Sci. 2011, 366, 376–388. [Google Scholar] [CrossRef] [PubMed]
- Bhatnagar, S.C.; Mandybur, G.T.; Buckingham, H.W.; Andy, O.J. Language Representation in the Human Brain: Evidence from Cortical Mapping. Brain Lang. 2000, 74, 238–259. [Google Scholar] [CrossRef] [PubMed]
- Hagoort, P. The neurobiology of language beyond single-word processing. Science 2019, 366, 55–58. [Google Scholar] [CrossRef] [PubMed]
- Hawkins, J. Why Can’t a Computer be more Like a Brain? IEEE Spectr. 2007, 44, 21–26. [Google Scholar] [CrossRef]
- Hawkins, J.; Ahmad, S.; Purdy, S.; Lavin, A. Biological and Machine Intelligence. Release 0.4. 2016–2020. Available online: https://numenta.com/resources/biological-and-machine-intelligence/ (accessed on 8 November 2023).
- Ahmad, S.; Hawkins, J. Properties of Sparse Distributed Representations and their Application to Hierarchical Temporal Memory. arXiv 2015, arXiv:1503.07469. [Google Scholar] [CrossRef]
- Purdy, S. Encoding Data for HTM Systems. arXiv 2016, arXiv:1602.05925. [Google Scholar] [CrossRef]
- Ahmad, S.; Hawkins, J. How do neurons operate on sparse distributed representations? A mathematical theory of sparsity, neurons and active dendrites. arXiv 2016, arXiv:1601.00720. [Google Scholar] [CrossRef]
- Cui, Y.; Ahmad, S.; Hawkins, J. The HTM Spatial Pooler-A Neocortical Algorithm for Online Sparse Distributed Coding. Front. Comput. Neurosci. 2017, 11, 111. [Google Scholar] [CrossRef] [PubMed]
- Webber, F.D.S. Semantic Folding Theory—White Paper; Cortical.io: Vienna, Austria, 2015. [Google Scholar]
- Khan, H.M.; Khan, F.M.; Khan, A.; Ashgar, A.Z.; Alghazzawi, D.M. Anomalous Behavior Detection Framework Using HTM-Based Semantic Folding Technique. Comput. Math. Methods Med. 2021, 2021, 5585238. [Google Scholar] [CrossRef] [PubMed]
- Minaee, S.; Kalchbrenner, N.; Cambria, E.; Khasmakhi, N.N.; Asgari-Chenaghlu, M.; Gao, J. Deep Learning-based Text Classification: A Comprehensive Review. Acm Comput. Surv. 2022, 54, 1–40. [Google Scholar] [CrossRef]
- Irfan, R.; King, C.K.; Grages, D.; Ewen, S.; Khan, S.U.; Madani, S.A.; Kolodziej, J.; Wang, L.; Chen, D.; Rayes, A.; et al. A survey on text mining in social networks. Knowl. Eng. Rev. 2015, 30, 157–170. [Google Scholar] [CrossRef]
- Saxena, A.; Prasad, M.; Gupta, A.; Bharill, N.; Patel, O.P.; Tiwari, A.; Er, M.J.; Ding, W.; Lin, C.-T. A review of clustering techniques and developments. Neurocomputing 2017, 267, 664–681. [Google Scholar] [CrossRef]
- Wang, D.; Li, T.; Zhu, S.; Ding, C. Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Singapore, 20–24 July 2008; Association for Computing Machinery: New Tork, NY, USA, 2008; pp. 307–314. [Google Scholar]
- Zha, H. Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland, 11–15 August 2002; Association for Computing Machinery: New Tork, NY, USA, 2002; pp. 113–120. [Google Scholar]
- Geiss, J. Creating a Gold Standard for Sentence Clustering in Multi-Document Summarization. In Proceedings of the ACL-IJCNLP 2009 Student Research Workshop, Suntec, Singapore, 4 August 2009; Association for Computational Linguistics: Stroudsburg, PA, USA, 2009; pp. 96–104. [Google Scholar]
- Yelp Dataset. Available online: https://www.yelp.com/dataset (accessed on 8 November 2023).
- Large Movie Review Dataset. Available online: http://ai.stanford.edu/~amaas/data/sentiment/ (accessed on 8 November 2023).
- Richard, S.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C.D.; Ng, A.; Potts, C. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; pp. 1631–1642. [Google Scholar]
- Kaggle. Consumer Reviews of Amazon Products. Available online: https://www.kaggle.com/datasets/datafiniti/consumer-reviews-of-amazon-products (accessed on 8 November 2023).
- 20 Newsgroups. Available online: http://qwone.com/~jason/20Newsgroups/ (accessed on 8 November 2023).
- Reuters. Available online: https://martin-thoma.com/nlp-reuters (accessed on 8 November 2023).
- Seno, E.R.; Nunes, M.D. Some Experiments on Clustering Similar Sentences of Texts in Portuguese. In Proceedings of the 8th International Conference on Computational Processing of the Portuguese Language, Aveiro, Portugal, 8–10 September 2008; pp. 133–142. [Google Scholar]
- Wikipedia Dataset. Available online: https://dumps.wikimedia.org/ (accessed on 8 November 2023).
- English-Corpora. Available online: https://www.english-corpora.org/ (accessed on 8 November 2023).
- Miller, G.A.; Charles, W.G. Contextual correlates of semantic similarity. Lang. Cognitive Proc. 1991, 6, 1–28. [Google Scholar] [CrossRef]
- Toral, A.; Muñoz, R.; Monachini, M. Named Entity WordNet. In Proceedings of the International Conference on Language Resources and Evaluation, Marrakech, Morocco, 28–30 May 2008. [Google Scholar]
Sentence | Label 1 | Label 2 |
---|---|---|
It was badly damaged during the Peninsular War of the Napoleonic era, and the monks were dispersed. | Buddhism | War |
“The Inmates” (1952) is set in a mental hospital and explores Powys’s interest in mental illness, but it is a work on which Powys failed to bestow sufficient “time and care”. | Mental sickness | Films and television |
The CFP limits how long a boat can be at sea and sets quotas for how much it can catch and of what. | Ships | Fishing |
Bladezz attempts to perform a magic trick involving fire, which ends up torching the restaurant (costing Codex and Bladezz their jobs). | Incineration | Magic show |
To begin with, classifying rubbish enables us to reduce the amount of rubbish and environmental pollution. | Ecological environment | Garbage |
Tyler’s skills as a gifted street dancer draw Nora’s attention. | Talent | Dance |
Forest therapy has been linked to some physiological benefits, as indicated by neuroimaging, and the profile of mood states in psychological test. | Forest | Mental illness |
His wrecked and burned car had been found later, with a body inside, which had been burned beyond recognition. | Traffic accident | Incineration |
But the fact that the Anglo-Australian miner is weighing a bid suggests it has emerged from its post-Alcan funk. | Financial crisis | Metal refining |
An armed Mexican schooner is attempting to smuggle slaves into the United States. | Smuggle | Ships |
Id | Accuracy | Kappa | Macro-F1 | Weighted-F1 | RS | RS(10) |
---|---|---|---|---|---|---|
1 | 0.867 | 0.872 | 0.872 | 0.867 | 0.868 | 0.882 |
2 | 0.88 | 0.885 | 0.885 | 0.88 | 0.881 | 0.894 |
3 | 0.837 | 0.85 | 0.85 | 0.837 | 0.84 | 0.849 |
4 | 0.86 | 0.864 | 0.864 | 0.86 | 0.861 | 0.874 |
5 | 0.87 | 0.873 | 0.873 | 0.87 | 0.87 | 0.884 |
6 | 0.844 | 0.858 | 0.858 | 0.844 | 0.847 | 0.857 |
7 | 0.808 | 0.83 | 0.83 | 0.808 | 0.813 | 0.818 |
8 | 0.859 | 0.868 | 0.868 | 0.859 | 0.86 | 0.872 |
9 | 0.797 | 0.818 | 0.818 | 0.797 | 0.802 | 0.808 |
10 | 0.751 | 0.758 | 0.758 | 0.751 | 0.752 | - |
11 | 0.808 | 0.823 | 0.823 | 0.808 | 0.811 | 0.819 |
Total Corpus | Undisputed | Two Options | Three Options | Four Options |
---|---|---|---|---|
3220 | 1826 | 1120 | 254 | 20 |
Method | Homogeneity | Completeness | NMI |
---|---|---|---|
DBscan | 0.41 | 0.362 | 0.385 |
Hierarchical | 0.439 | 0.55 | 0.49 |
Single-Pass | 0.445 | 0.472 | 0.458 |
K-Means | 0.36 | 0.37 | 0.363 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Cai, K.; Chen, Z.; Guo, H.; Wang, S.; Li, G.; Li, J.; Chen, F.; Feng, H. An Evaluative Baseline for Sentence-Level Semantic Division. Mach. Learn. Knowl. Extr. 2024, 6, 41-52. https://doi.org/10.3390/make6010003
Cai K, Chen Z, Guo H, Wang S, Li G, Li J, Chen F, Feng H. An Evaluative Baseline for Sentence-Level Semantic Division. Machine Learning and Knowledge Extraction. 2024; 6(1):41-52. https://doi.org/10.3390/make6010003
Chicago/Turabian StyleCai, Kuangsheng, Zugang Chen, Hengliang Guo, Shaohua Wang, Guoqing Li, Jing Li, Feng Chen, and Hang Feng. 2024. "An Evaluative Baseline for Sentence-Level Semantic Division" Machine Learning and Knowledge Extraction 6, no. 1: 41-52. https://doi.org/10.3390/make6010003
APA StyleCai, K., Chen, Z., Guo, H., Wang, S., Li, G., Li, J., Chen, F., & Feng, H. (2024). An Evaluative Baseline for Sentence-Level Semantic Division. Machine Learning and Knowledge Extraction, 6(1), 41-52. https://doi.org/10.3390/make6010003