Hybrid Neighborhood-Based Similarity Measure for Text Classification
Abstract
1. Introduction
2. Background Concepts in Document Similarity and Topology
2.1. Document Representation and Similarity Measures
2.2. Limitations of Global Similarity Models
2.3. Neighborhood-Based Similarity Concepts
2.4. Fundamentals of Mathematical Topology
2.5. Near-Open Sets and Topological Approximation
2.6. Topology as a Bridge Between Structure and Semantics
3. Previous Work
- 1.
- Vector Space Modeling: This approach transforms documents into numerical vectors within a high-dimensional feature space, where dimensions typically correspond to linguistic features or terms. Similarity computation relies on geometric relationships between these vectors, with common implementations including TF-IDF weighted representations, distributed word embeddings (Word2Vec, GloVe), and document-level embedding techniques (Doc2Vec). The spatial relationships between vectors—whether measured through angular separation (cosine similarity) or Euclidean distance—serve as the similarity metric [2,3,5,6,17].Although classical methods continue to provide strong baselines in many classification tasks, their limitations become more pronounced in semantically diverse or multilingual corpora.
- 2.
- Statistical Analysis Methods: These techniques quantify document relationships through probabilistic and frequency-based features, examining patterns in term distributions, n-gram occurrences, and syntactic structures. Similarity computation employs statistical measures such as cosine similarity for vector alignment, Jaccard index for set-based comparisons, or Pearson correlation for covariance analysis of feature distributions [18,19].
- 3.
- Semantic Similarity Techniques: Moving beyond surface-level features, these methods analyze conceptual meaning through lexical databases (WordNet), dimensional reduction (Latent Semantic Analysis), or probabilistic topic modeling (Latent Dirichlet Allocation). They capture document relationships through shared conceptual spaces rather than direct term matching, enabling more nuanced similarity detection of semantically equivalent but lexically distinct content [1,20,21,22].
- 4.
- Deep Learning Architectures: Advanced neural network models including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformer-based architectures automatically learn hierarchical document representations. These models excel at capturing both local syntactic patterns and global semantic relationships, often achieving state-of-the-art performance through their ability to model complex linguistic dependencies [7,12,23,24]. Later transformer architectures such as BERT significantly improved semantic modeling through bidirectional contextual encoding.
- 5.
- Hybrid Ensemble Methods: Recognizing the complementary strengths of different approaches, ensemble techniques strategically combine multiple similarity measures through meta-learning strategies such as weighted averaging, stacked generalization, or majority voting. This synthesis often yields more robust performance by balancing the strengths of vector-based, statistical, semantic, and neural approaches while mitigating their individual limitations.
4. Topological Document Similarity Using near Open Sets
4.1. Core Definitions
- Predecessor set: Rp(d) = {x: (x,d) ∈ R} represents documents similar to d.
- Successor set: Rs(d) = {x: (d,x) ∈ R} represents documents to which d is similar.
- Nearest open sets: Op(d) = ∩{Rp(d)} and Os(d) = ∩{Rs(d)} form topological bases.
4.2. Topology Construction
4.3. Similarity Measurement
- Near set: Nβ(d) = {x: ρ(x,d) ≥ β}.
- Near set family: NSβ(d) = {Nβ(d): β ∈ [0, 1]} with Na(d) ⊆ Nβ(d) when α ≥ β.
4.4. Document Ordering
- Asymmetry: d1 < d2 ⇒ ¬(d2 < d1)
- Transitivity: d1 < d2 ∧ d2 < d3 ⇒ d1 < d3
- Example 1. Consider DS = {d1,…,d7} with near sets:NSβ(d1) = {{d1,d5}, {d1,d2,d3}, {d1,d2,d4,d6}}
- {d1} < {d2} < {d3,d4,d6} < {d7}
- {d1} < {d5} < {d7}
- d5 is incomparable to {d2,d3,d4,d6} regarding similarity to d1.
- d3 is incomparable to d4 and d6.
5. Comparative Analysis of Document Similarity Approaches
5.1. Metric-Based Similarity Models
5.2. Latent Semantic and Probabilistic Models
5.3. Neural Embedding-Based Similarity
5.4. Graph-Based Similarity Representations
5.5. Neighborhood-Based Topological Similarity
5.6. Hybrid Topological–Neural Framework
5.7. Structural and Computational Comparison
6. Hybrid Topological–Neural Similarity Framework
6.1. Neighborhood-Induced Topological Space
6.2. Near-Open Sets and Semantic Continuity
6.3. Asymmetric Similarity and Neighborhood Inclusion
6.4. (2,3)-Fuzzy Topological Extension
6.5. Topological Similarity Measure
6.6. Theoretical Properties
6.7. Algorithmic Framework
- 1.
- Embed each document using BERT to obtain ψ(d).
- 2.
- Compute the similarity relation .
- 3.
- Construct β-neighborhoods and induced (fuzzy) topology.
- 4.
- Identify near-open sets and neighborhood inclusion relations.
- 5.
- Compute similarity using neighborhood overlap.
- 6.
- Generate explanations via shared neighborhoods and topological paths.
| Algorithm 1: Hybrid Topological–Neural Document Similarity |
| Input: DS = {d1, d2, …, dn} // document collection β ∈ (0, 1] // neighborhood threshold BERT // pretrained language model Output: S_top(di, dj) // topological similarity matrix Begin // Step 1: Semantic Embedding for each document d ∈ DS do ψ(d) ← BERT_Embed(d) end for // Step 2: Similarity Relation Induction for each pair (di, dj) ∈ DS × DS do ρemb(di, dj) ← CosineSimilarity(ψ(di), ψ(dj)) end for // Step 3: Neighborhood System Construction for each document d ∈ DS do Nβ(d) ← { x ∈ DS | ρemb(x, d) ≥ β} end for // Step 4: Induced Topological Space T ← { U ⊆ DS | ∀ d ∈ U, ∃ β such that Nβ(d) ⊆ U} // Step 5: Topological Similarity Computation for each pair (di, dj) ∈ DS × DS do Stop(di, dj) ← |Nβ(d_i) ∩ Nβ(d_j)| / |Nβ(di) ∪ Nβ(dj)| end for return Stop End |
| Algorithm 2: Hybrid (2,3)-Fuzzy Topological Similarity |
| This version allows partial neighborhood membership, suitable for ambiguous or large corpora. Algorithm Fuzzy_Topological_Neural_Similarity Input: DS = {d1, d2, …, dn} β ∈ (0, 1] μ: DS → [0, 1] // fuzzy membership function BERT Output: Sfuzzy(di, dj) Begin // Step 1: Semantic Embedding for each document d ∈ DS do ψ(d) ← BERT_Embed(d) end for // Step 2: Similarity Relation for each pair (di, dj) ∈ DS × DS do ρemb(di, dj) ← CosineSimilarity(ψ(di), ψ(dj)) end for // Step 3: Fuzzy Neighborhood Construction for each document d ∈ DS do Ñβ(d) ← {(x, μ(x)) | ρemb(x, d) ≥ β} end for // Step 4: Fuzzy Topological Similarity for each pair (di, dj) ∈ DS × DS do numerator ← Σ min(μi(x), μj(x)) for x ∈ Ñβ(di) ∩ Ñβ(dj) denominator ← Σ max(μi(x), μj(x)) for x ∈ Ñβ(di) ∪ Ñβ(dj) S_fuzzy(di, dj) ← numerator / denominator end for return S_fuzzy End |
- Step 1 (Preprocessing and Embedding): Each document d in the training set (N = 2400) and test set (N = 600) was preprocessed following the protocol in Section 7.1.2. AraBERT v02 generated a 768-dimensional embedding ψ(d) for each document.
- Step 2 (Similarity Relation Computation): For all pairs in the training set, we computed = cosine_similarity . This produced a 2400 × 2400 similarity matrix.
- Step 3 (Neighborhood Construction—Training): For each training document d and for β = 0.75, we constructed Nβ(d) = {x ∈ training_set | (x, d) ≥ β}. These neighborhoods were stored as sparse lists (average size ≈ 180 documents per neighborhood).
- Step 4 (Topological and Hybrid Similarity Computation): For each pair of training documents, we computed:
- Step 5 (Test Document Classification): For a test document d_test:
- Compute its embedding ψ(d_test);
- Compute (d_test, d_train) for all d_train in training set;
- Construct N_β(d_test) = {x ∈ training_set | ρ_emb(x, d_test) ≥ β};
- Compute S_top(d_test, d_train) and S_hybrid(d_test, d_train) for all training documents;
- Identify the k = 5 training documents with highest similarity according to each measure;
- Assign the majority class among these 5 neighbors to d_test.
7. Experimental Results and Analysis
7.1. Experimental Setup
7.1.1. Dataset Description
- Al-Jazeera Arabic News Channel (news articles);
- Al-Ahram Newspaper (Egyptian daily newspaper);
- Al-Watan Newspaper (Saudi daily newspaper);
- Al-Akhbar Newspaper (Lebanese daily newspaper);
- Al-Arabiya News Channel (pan-Arab news);
- Al-Hayah Newspaper (Egyptian daily newspaper);
- Wikipedia Arabic (encyclopedic content).
7.1.2. Preprocessing and Representation
- Normalization: Unicode normalization (NFKC) was applied to standardize character representations, including normalization of Alef variants (أ, إ, آ → ا), Yeh (ي, ى → ي), and removal of diacritics (tashkeel) and tatweel (kashida).
- Tokenization: Documents were segmented into tokens using whitespace and punctuation boundaries, with special handling for Arabic-specific constructs.
- Stop word removal: A standard Arabic stop word list was applied to filter out frequent function words with limited discriminative power.
- Stemming: Light stemming was performed using the Arabic stemmer from the NLTK library to reduce morphological variants to common roots.
- TF-IDF Representation: Term Frequency–Inverse Document Frequency vectors were computed for all documents, providing a lexical-semantic baseline. Term weighting followed the standard formulation:
- 2.
- BERT Embeddings: Contextual semantic representations were generated using AraBERT v02 [7,36], a pretrained transformer model specifically optimized for Arabic text. Each document was passed through the model, and the [CLS] token embedding from the final hidden layer was extracted as the document-level representation, yielding 768-dimensional vectors. Documents longer than the maximum sequence length (512 tokens) were truncated, with the first 512 tokens retained.
7.1.3. Similarity Measures and Baselines
7.1.4. Classification Framework
- Training set: 80% of documents (approximately 2400 documents).
- Test set: 20% of documents (approximately 600 documents).
7.1.5. Evaluation Metrics
- Macro Precision:
- Macro Recall:
- Macro F1-score:
7.2. Results and Analysis
7.2.1. Classification Performance
7.2.2. Validation of Topological Properties
7.2.3. Parameter Sensitivity Analysis
- Low β (0.5–0.6): Neighborhoods are large and inclusive, containing many irrelevant documents, leading to decreased precision.
- Optimal β (0.7–0.8): Neighborhoods balance inclusiveness and precision, yielding maximum F1-score.
- High β (>0.85): Neighborhoods become too sparse, causing loss of recall as relevant documents are excluded.
- Topological structure contributes substantially (40%) to optimal similarity judgments.
- Performance degrades more sharply when moving toward pure topological (α < 0.4) than toward pure BERT (α > 0.8), suggesting that semantic proximity provides a necessary foundation upon which topological structure builds.
7.2.4. Computational Efficiency
- Neighborhood precomputation: β-neighborhoods can be precomputed offline in O(n·k) where k is average neighborhood size, rather than O(n2) for exhaustive pairwise comparisons.
- Query-time complexity: Similarity computation for a new document requires only neighborhood lookups and set operations, scaling with neighborhood size rather than corpus size.
- Sparse representations: As β increases, neighborhoods become sparse, enabling efficient storage and computation.
7.3. Discussion
- Superior Classification Performance: The hybrid model achieves state-of-the-art performance (F1 = 0.93), significantly outperforming both traditional TF-IDF and pure neural baselines.
- Validation of Theoretical Properties: Experiments confirm neighborhood stability (Proposition 1) and the emergence of asymmetric similarity relations (Proposition 2), grounding the theoretical framework in empirical observation.
- Enhanced Explainability: The topological framework enables transparent similarity decisions through explicit neighborhood relations and topological paths, addressing a critical limitation of black-box neural models.
- Complementary Semantic and Structural Information: The optimal hybrid weighting (α = 0.6) demonstrates that semantic proximity and structural consistency encode different yet complementary aspects of document relationships. Neural embeddings capture contextual meaning, while topological modeling reinforces decisions by validating similarity through shared neighborhood structure.
- Practical Advantages: The framework offers query-time computational efficiency on a 3000-document corpus through neighborhood-based computation and robustness to parameter variations, as demonstrated by sensitivity analysis. Large-scale deployment would require approximate nearest-neighbor indexing to address the offline O(n2) similarity matrix construction.
- Limitations of the Current Study: Despite the promising results, this study has several limitations that should be acknowledged. First, the primary empirical evaluation is conducted on a single Arabic text corpus of approximately 3000 documents. While we have added validation on the English 20 Newsgroups dataset to demonstrate language-agnosticism, the scalability and generalizability to very large-scale corpora (e.g., millions of documents) or to structurally different domains (e.g., legal or biomedical texts) remain to be fully validated. Second, the fuzzy membership function μ, while theoretically grounded, was defined in a relatively simple way based on similarity to a reference document; more sophisticated, learnable membership functions may yield further improvements. Third, the current implementation stores neighborhoods for all documents, which, while efficient for 3000 documents, may need optimization for significantly larger datasets. Fourth, the topology T is operationally implemented as a neighborhood-based structure (beta-neighborhoods with sparse list storage) rather than as a full power-set enumeration, which resolves the implementability concern while preserving the theoretical properties. The computational complexity of constructing neighborhoods is O(n2) for the similarity matrix computation and O(n*k) for neighborhood storage where k is the average neighborhood size (approximately 180 in our dataset); this is substantially better than exhaustive pairwise comparison for retrieval tasks. Fifth, the selection of beta = 0.75 was determined through validation experiments using a grid search over beta in {0.5, 0.6, 0.7, 0.75, 0.8, 0.85, 0.9} on 20% of the training data; Figure 1 reports sensitivity analysis confirming that the optimal range is 0.7–0.8. Similarly, alpha = 0.6 was selected via grid search over {0.1, 0.2, …, 0.9} on the same validation set. While these values were found to be optimal for the Arabic dataset evaluated, their transferability to other domains would benefit from further systematic analysis. Sixth, the claim of scalability refers specifically to query-time complexity (O(k) neighborhood lookup versus O(n) exhaustive comparison); the offline precomputation cost remains O(n2) for the similarity matrix, and future work should address approximate neighborhood construction to reduce this cost for very large corpora. Seventh, the generalizability of the proposed method to other languages and domains is provided with preliminary evidence by the 20 Newsgroups results but should be further validated; the choice of Arabic as the primary evaluation language was motivated by the relative scarcity of topological NLP studies for Arabic and by the availability of high-quality pretrained models (AraBERT, CAMeLBERT, AraELECTRA).
8. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Deerwester, S.; Dumais, S.T.; Furnas, G.W.; Landauer, T.K.; Harshman, R. Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 1990, 41, 391–407. [Google Scholar] [CrossRef]
- Levy, O.; Goldberg, Y.; Dagan, I. Improving distributional similarity with lessons learned from word embeddings. Trans. Assoc. Comput. Linguist. 2015, 3, 211–225. [Google Scholar] [CrossRef]
- Gomaa, W.H.; Fahmy, A.A. A survey of text similarity approaches. Int. J. Comput. Appl. 2013, 68, 13–18. [Google Scholar] [CrossRef]
- Le, Q.V.; Mikolov, T. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference Machine Learning (ICML-14); JMLR: Cambridge, MA, USA, 2014; pp. 1188–1196. [Google Scholar]
- Pennington, J.; Socher, R.; Manning, C. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP); Association for Computational Linguistics: Stroudsburg, PA, USA, 2014; pp. 1532–1543. [Google Scholar] [CrossRef]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference North American Chapter of the Association for Computational Linguistics: Human Language Technologies; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
- Shi, L.; Cao, L.; Ye, Y.; Zhao, Y.; Chen, B. Tensor-based Graph Learning with Consistency and Specificity for Multi-view Clustering. In IEEE Transactions on Multimedia; IEEE: New York, NY, USA, 2026. [Google Scholar]
- Chen, Z.; Liu, Y.; Shi, L.; Chen, X.; Zhao, Y.; Ren, F. MDEval: Evaluating and Enhancing Markdown Awareness in Large Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 1–15. [Google Scholar]
- Salama, A.S.; El-Barbary, O.G. Document classification in information retrieval system based on neutrosophic sets. Filomat 2020, 34, 1591–1602. [Google Scholar] [CrossRef]
- Jarvis, R.A.; Patrick, E.A. Clustering using a similarity measure based on shared near neighbors. IEEE Trans. Comput. 1973, C-22, 1025–1034. [Google Scholar] [CrossRef]
- He, Z. Text similarity based on two independent channels: Siamese Convolutional Neural Networks and Siamese Recurrent Neural Networks. Neurocomputing 2025, 643, 130355. [Google Scholar] [CrossRef]
- Ibrahim, H.Z.; Al-Shami, T.M.; Elbarbary, O.G. (3, 2)-Fuzzy Sets and Their Applications to Topology and Optimal Choices. Comput. Intell. Neurosci. 2021, 2021, 1272266. [Google Scholar] [CrossRef] [PubMed]
- Pawlak, Z. Rough sets. Int. J. Comput. Inf. Sci. 1982, 11, 341–356. [Google Scholar] [CrossRef]
- El-Barbary, O.G.; Abu Shaheen, F.A.; Al-Shami, T.M.; Arar, M. Supra finite soft-open sets and applications to operators and continuity. J. Math. Comput. Sci. 2024, 35, 120–135. [Google Scholar] [CrossRef]
- El-Barbary, O.G.; Salama, A.S. Topological approach to retrieve missing values in incomplete information systems. J. Egypt. Math. Soc. 2017, 25, 419–423. [Google Scholar] [CrossRef]
- Lau, J.H.; Baldwin, T. An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv 2016, arXiv:1607.05368. [Google Scholar] [CrossRef]
- Pearson, K. Notes on regression and inheritance in the case of two parents. Proc. R. Soc. Lond. 1895, 58, 240–242. [Google Scholar] [CrossRef]
- Jaccard, P. tude comparative de la distribution florale dans une portion des Alpes et du Jura. Bull. Soci Vaudoise Sci. Nat. 1901, 37, 547–579. [Google Scholar] [CrossRef]
- Blei, D.M.; Ng, A.Y.; Jordan, M.J.; Dietterich, T.G.; Becker, S.; Ghahramani, Z. Latent Dirichlet allocation. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2002; pp. 601–608. [Google Scholar]
- Blei, D.M.; Ng, A.Y.; Jordan, M.J. Latent Dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
- Martinez-Gil, J. Automatic design of semantic similarity ensembles using grammatical evolution. arXiv 2024, arXiv:2307.00925. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, U.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 5998–6008. [Google Scholar]
- Kim, Y. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference Empirical Methods in Natural Language Processing (EMNLP); Association for Computational Linguistics: Stroudsburg, PA, USA, 2014; pp. 1746–1751. [Google Scholar] [CrossRef]
- Perin, E.L.S.; Souza, M.C.D.; Silva, J.D.A.; Matsubara, E.T. DynGraph-BERT: Combining BERT and GNN Using Dynamic Graphs for Inductive Semi-Supervised Text Classification. Informatics 2025, 12, 20. [Google Scholar] [CrossRef]
- Abdelali, A.; Darwish, K.; Mubarak, H. Transparent, Low Resource, and Context-Aware Information Retrieval From a Closed Domain Knowledge Base. IEEE Access 2024, 12, 44233–44243. [Google Scholar] [CrossRef]
- Kalogeropoulos, N.-R.; Ioannou, D.; Stathopoulos, D.; Makris, C. On Embedding Implementations in Text Ranking and Classification Employing Graphs. Electronics 2024, 13, 1897. [Google Scholar] [CrossRef]
- Hendry, H.; Tukino, T.; Sediyono, E.; Fauzi, A.; Huda, B. HyEWCos: A Comparative Study of Hybrid Embedding and Weighting Techniques for Text Similarity in Short Subjective Educational Text. Information 2025, 16, 995. [Google Scholar] [CrossRef]
- Shen, Z.; Xiao, Z. A Chinese Short Text Similarity Method Integrating Sentence-Level and Phrase-Level Semantics. Electronics 2024, 13, 4868. [Google Scholar] [CrossRef]
- Alammar, M.; El Hindi, K.; Al-Khalifa, H. English-Arabic Hybrid Semantic Text Chunking Based on Fine-Tuning BERT. Computation 2025, 13, 151. [Google Scholar] [CrossRef]
- Kim, H.; Gim, G. Enhancing Patent Document Similarity Evaluation and Classification Precision Through a Multimodal AI Approach. Appl. Sci. 2025, 15, 9254. [Google Scholar] [CrossRef]
- Ostendorff, M.; Rethmeier, N.; Augenstein, I.; Gipp, B.; Rehm, G. Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings. arXiv 2022, arXiv:2202.06671. [Google Scholar] [CrossRef]
- Zadeh, L.A. Fuzzy sets. Inf. Control 1965, 8, 338–353. [Google Scholar] [CrossRef]
- Inoue, G.; Alhafni, B.; Baimukan, N.; Bouamor, H.; Habash, N. The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models. In Proceedings of the Sixth Arabic Natural Language Processing Workshop; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 92–104. [Google Scholar]
- Hofmann, T. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference Research and Development in Information Retrieval; Association for Computing Machinery (ACM): New York, NY, USA, 1999; pp. 50–57. [Google Scholar] [CrossRef]
- Antoun, A.; Baly, F.; Hajj, H. AraELECTRA: Pre-training text encoders for Arabic language understanding. In Proceedings of the Sixth Arabic Natural Language Processing Workshop; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 13–24. Available online: https://aclanthology.org/2021.wanlp-1.20/ (accessed on 1 May 2026).



| Method | Precision | Recall | F1-Score | Std. Dev. (F1) | 95% CI (F1) [±1.96·σ/√10] | Cohen’s d |
|---|---|---|---|---|---|---|
| TF-IDF | 0.80 | 0.79 | 0.80 | ±0.023 | ±0.014 | (reference) |
| BERT (AraBERT) | 0.88 | 0.86 | 0.86 | ±0.018 | ±0.011 | 2.93 (large) |
| Topological (β = 0.75) | 0.90 | 0.89 | 0.89 | ±0.015 | ±0.009 | 4.74 (large) |
| Hybrid (α=0.6) ★ | 0.93 | 0.94 | 0.93 | ±0.012 | ±0.007 | 7.22 (large) |
| Method | Precision | Recall | F1-Score | Std. Dev. (F1) | 95% CI (F1) [±1.96·σ/√10] | Cohen’s d vs. TF-IDF |
|---|---|---|---|---|---|---|
| TF-IDF | 0.80 | 0.79 | 0.80 | ±0.023 | ±0.014 | (reference) |
| LDA (Topic Model) | 0.74 | 0.76 | 0.75 | ±0.031 | ±0.019 | −1.52 (large) |
| Doc2Vec | 0.77 | 0.79 | 0.78 | ±0.028 | ±0.017 | −0.72 (medium) |
| BERT (AraBERT) | 0.88 | 0.86 | 0.86 | ±0.018 | ±0.011 | 2.93 (large) |
| CAMeLBERT | 0.88 | 0.86 | 0.87 | ±0.016 | ±0.010 | 3.82 (large) |
| AraELECTRA | 0.86 | 0.84 | 0.85 | ±0.019 | ±0.012 | 2.27 (large) |
| Topological (β = 0.75) | 0.90 | 0.89 | 0.89 | ±0.015 | ±0.009 | 4.74 (large) |
| Hybrid (α=0.6) ★ | 0.93 | 0.94 | 0.93 | ±0.012 | ±0.007 | 7.22 (large) |
| Model Configuration | F1-Score | Precision | Recall | Δ from TF-IDF | 95% CI (F1) [±1.96·σ/√5] |
|---|---|---|---|---|---|
| TF-IDF k-NN (Baseline) | 0.68 | 0.67 | 0.69 | (ref) | ±0.022 |
| BERT-only (Baseline) | 0.72 | 0.71 | 0.73 | +0.04 | ±0.020 |
| BERT + Topological (Jaccard) | 0.75 | 0.74 | 0.76 | +0.07 | ±0.018 |
| BERT + Topology + Asymmetry | 0.76 | 0.75 | 0.77 | +0.08 | ±0.017 |
| Full Hybrid—Proposed ★ | 0.77 | 0.76 | 0.78 | +0.09 | ±0.016 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
El Barbary, O.G.; Hagras, S.; M. Allam, T. Hybrid Neighborhood-Based Similarity Measure for Text Classification. Information 2026, 17, 560. https://doi.org/10.3390/info17060560
El Barbary OG, Hagras S, M. Allam T. Hybrid Neighborhood-Based Similarity Measure for Text Classification. Information. 2026; 17(6):560. https://doi.org/10.3390/info17060560
Chicago/Turabian StyleEl Barbary, O. G., Shaimaa Hagras, and Tahani M. Allam. 2026. "Hybrid Neighborhood-Based Similarity Measure for Text Classification" Information 17, no. 6: 560. https://doi.org/10.3390/info17060560
APA StyleEl Barbary, O. G., Hagras, S., & M. Allam, T. (2026). Hybrid Neighborhood-Based Similarity Measure for Text Classification. Information, 17(6), 560. https://doi.org/10.3390/info17060560

