Next Article in Journal
MD-Net: A Lightweight Dual-Branch Network with Adaptive Time-Frequency Masking for Robust UAV RF Signal Classification
Previous Article in Journal
PGformer: Fusing Kernelized Transformers and GCNs for Automated Proximity Graph Parameter Configuration
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Hybrid Neighborhood-Based Similarity Measure for Text Classification

1
Information System Department, Faculty of Computer and Informatics, Tanta University, Tanta 31527, Egypt
2
Computers and Control Engineering Department, Faculty of Engineering, Tanta University, Tanta 31527, Egypt
*
Author to whom correspondence should be addressed.
Information 2026, 17(6), 560; https://doi.org/10.3390/info17060560 (registering DOI)
Submission received: 23 March 2026 / Revised: 30 May 2026 / Accepted: 1 June 2026 / Published: 5 June 2026
(This article belongs to the Special Issue Advances in Data Mining for Complex Systems)

Abstract

Global vector comparisons, which are computationally costly, symmetric by design, and frequently challenging to interpret, have historically been used to address document similarity, a fundamental task in information retrieval and text classification. A neighborhood-based document similarity framework based on ideas from mathematical topology is proposed in this paper. Instead of using exhaustive pairwise comparisons to determine similarity, local neighborhood structures are used to model documents as elements of a finite topological space induced by a similarity relation. The suggested method allows for the natural ordering of documents according to their relative proximity, supports asymmetric similarity relations, and captures local continuity using β-neighborhoods and near-open sets. A hybrid extension is presented that uses contextual embeddings produced by BERT to induce the underlying neighborhood structure in order to improve semantic representation while maintaining interpretability. Neural embeddings function as a semantic basis on which topological relations and near-set approximations are built, rather than taking the place of the topological model. Neighborhood overlap and topological refinement are then used to calculate document similarity, which enables the identification and explanation of both direct and indirect semantic relationships using explicit neighborhood paths. In comparison to TF-IDF and standalone BERT models experimental evaluation on benchmark datasets shows that the suggested topological and hybrid approaches achieve competitive or superior accuracy while enhancing scalability, asymmetry handling, and explainability. The findings show that neighborhood-based topological modeling offers a transparent and ethical framework for document similarity analysis in large-scale and interpretability-critical applications, especially when paired with neural embeddings.

Graphical Abstract

1. Introduction

Many natural language processing (NLP) applications, such as information retrieval, text classification, plagiarism detection, document clustering, and knowledge discovery, rely heavily on document similarity. The efficacy of search engines, recommendation systems, and decision-support tools working on extensive textual repositories is directly impacted by the capacity to measure document similarity accurately. Creating similarity metrics that are both computationally effective and semantically significant has grown in importance as digital text collections continue to expand in size and diversity.
Global representations, in which every document is viewed as a single point in a high-dimensional feature space, are commonly used in traditional document similarity techniques. Methods that calculate similarity by comparing complete document representations include latent semantic models, TF-IDF with cosine similarity, and, more recently, deep neural embeddings. Despite their excellent empirical performance, these approaches have a number of intrinsic drawbacks. First, because global similarity measures require extensive pairwise comparisons, they are computationally costly for large collections. Second, directional or hierarchical relationships that naturally occur in real-world text corpora cannot be modeled using the most widely used similarity functions because they are symmetric by design (e.g., “a guideline is similar to a protocol” does not imply the reverse). Third, a lot of high-performing neural models operate as “black boxes,” providing little interpretability and making it challenging to defend similarity judgments in delicate fields like academic, legal, or medical analysis.
Locality-based similarity paradigms, which infer document relationships from local neighborhoods rather than global distances, have been investigated recently in an effort to overcome these limitations. The emphasis of neighborhood-based approaches is shifted from absolute similarity values to shared context and relative proximity between documents. Concepts from mathematical topology, which examines structures defined by neighborhoods and continuity rather than exact metric distances, naturally fit this viewpoint. Local relations and open sets can be used to understand similarity in topology, offering a conceptual framework for thinking about structural relationships, ordering, and closeness.
In this work, we model document collections as finite topological spaces induced by similarity relations in order to take a topological approach to document similarity. Rather than using direct vector comparison, similarity between documents is determined by neighborhood overlap, containment, and connectivity. Each document is linked to a family of neighborhoods that are defined by similarity thresholds. By capturing local continuity and granularity of similarity, near-open sets and neighborhood systems enable the ranking and ordering of documents based on their relative proximity. This formulation offers an interpretable structure that allows similarity decisions to be traced through explicit neighborhood relations and naturally supports asymmetric similarity relations.
The strength of the underlying similarity relation utilized to create neighborhoods determines how effective topology is, even though it offers a solid mathematical foundation and interpretability. Deeper semantic linkages may not be captured by purely lexical representations, especially when paraphrasing or domain-specific terminology are involved. In order to overcome this constraint, we present a hybrid topological-neural framework where the initial similarity relation is induced using contextual embeddings produced by pretrained language models like BERT. Crucially, neural embeddings function as a semantic foundation on which neighborhood structures and near-open sets are built rather than taking the role of the topological model. This architecture takes advantage of the semantic richness of contemporary language models while maintaining the interpretability and structural benefits of topology.
This is how the rest of this paper is structured: Background information on document similarity and topology is reviewed in Section 2. Related work is discussed in Section 3. The suggested topological document similarity using near open sets is presented in Section 4. The comparative analysis of document similarity approaches is introduced in Section 5. The hybrid topological neural similarity frame work are presented in Section 6. Experimental results and analysis discussed in Section 7. The work is concluded and future research directions are outlined in Section 8.

2. Background Concepts in Document Similarity and Topology

This section introduces the key ideas from mathematical topology that serve as the theoretical foundation for the suggested framework and goes over the basic ideas behind document similarity measurement. The objective is to provide a shared framework that links neighborhood-based topological modeling with conventional text similarity techniques.

2.1. Document Representation and Similarity Measures

The first step in measuring document similarity is to convert unstructured material into structured representations that allow for quantitative comparison. The Vector Space Model (VSM), which represents each document as a vector of word weights, is the foundation of early methods. Term Frequency–Inverse Document Frequency (TF–IDF), which balances term relevance within a document against its distribution across the corpus, is the most popular weighting technique. Cosine similarity, which measures the angular distance of document vectors, is frequently used to calculate document similarity.
VSM-based techniques struggle with semantic diversity, such as synonymy and paraphrase, and mostly capture surface-level lexical overlap, despite their computational efficiency and interpretability. Semantic models like Latent Semantic Analysis (LSA) and topic modeling techniques [1] (like Latent Dirichlet Allocation) project documents into lower-dimensional latent spaces in order to capture latent concepts and co-occurrence patterns. Nevertheless, these models may still be unable to capture fine-grained contextual meaning and frequently need careful parameter adjustment.
Neural embedding models have greatly improved document similarity analysis in more recent times [2,3]. While document-level models like Doc2Vec [4] expand these concepts to larger textual units, word-level models like Word2Vec and GloVe [5,6] learn distributed representations based on contextual usage. Transformer-based models, especially BERT [7], model bidirectional connections in text to produce contextualized embeddings that capture deep semantic and syntactic information. Even when there is little lexical overlap, these embeddings allow for extremely precise similarity estimate. Neural similarity models are generally symmetric and opaque, providing little information about why two papers are deemed similar, despite their efficacy.

2.2. Limitations of Global Similarity Models

The majority of conventional and neural similarity techniques calculate similarity between whole document representations as a global metric. This paradigm presents a number of drawbacks. First, extensive pairwise comparisons are necessary for global similarity, which presents scalability issues in huge document collections. Second, the resulting similarity ratings are typically symmetric, which limits their capacity to represent directional or hierarchical links seen in many areas, including technical manuals, scientific publications, and legal documents.
Third, the interpretability of global similarity measures is constrained, especially in deep learning models where decisions about similarity are difficult to link to particular relational or structural characteristics of the collection of documents. These constraints drive the investigation of local and relational similarity frameworks [8,9], in which neighborhood structure and relative proximity—rather than absolute distances—are used to infer similarity.

2.3. Neighborhood-Based Similarity Concepts

Local associations between documents in a feature space are used by neighborhood-based similarity techniques to define similarity. Each document is linked to a selection of nearby papers that meet a predetermined similarity requirement rather than directly comparing every document pair [6,10]. The overlap, containment, or connection of each document’s neighborhoods can then be used to evaluate how similar two papers are.
These methods have a number of benefits. They facilitate indirect similarity detection through shared neighbors, naturally support grouping and classification tasks, and decrease computational complexity by restricting comparisons to limited regions. Additionally, neighborhood-based approaches offer a logical link to topological and graph-based representations, where documents are viewed as nodes linked by neighborhood relations.
However, many neighborhood-based models are defined heuristically, without a rigorous mathematical foundation. This can lead to inconsistencies in neighborhood construction and difficulty in reasoning about structural properties such as continuity, ordering, and hierarchy. It should be noted that shared-neighbor similarity—using overlap of nearest-neighbor sets as a proximity measure—has prior roots in classical clustering literature, including the Jarvis-Patrick algorithm [11] and related graph-based Jaccard methods. The present work contributes a formal topological framing and hybrid neural integration rather than claiming the basic neighborhood-overlap concept as novel.

2.4. Fundamentals of Mathematical Topology

Regardless of exact metric distances, mathematical topology offers a formal foundation for investigating spaces defined by neighbors and continuity [12]. A pair (X,T), where X is a set and T is a collection of subsets of X that satisfy certain axioms, is called a topological space. The idea of a neighborhood, which broadens the concept of closeness without requiring numerical distances, is fundamental to topology.
A binary connection that meets characteristics like symmetry and reflexivity can induce topology in limited spaces. Neighbor systems, which can be used to create open sets, closure operators, and ordering relations, are naturally produced by such relations. Topological spaces, in contrast to metric spaces, can represent asymmetric and hierarchical interactions and permit several levels of granularity [10,13].

2.5. Near-Open Sets and Topological Approximation

By permitting a controlled relaxation of openness conditions, near-open sets expand on the ideas of classical open sets. These sets are especially helpful in applications like rough set theory and information systems that deal with uncertainty, granularity, or approximate similarity. Flexible neighborhood structures that adjust to changing similarity criteria while maintaining crucial topological characteristics can be built using near-open sets [14].
Near-open sets offer a way to characterize different levels of similarity and arrange documents into tiered neighborhood structures in document similarity analysis. Partial orderings and hierarchical relationships among documents can be derived by analyzing intersections and closures of near-open sets, providing both interpretability and expressive power.

2.6. Topology as a Bridge Between Structure and Semantics

Topology provides a unified framework for relational, statistical, and semantic methods to document similarity. Topology allows rich contextual information to be integrated with interpretable structural models by establishing similarity through neighborhood systems created by semantic representations [15,16]. Topological approaches allow for explainable similarity decisions through explicit neighborhood relations and topological routes, while maintaining semantic accuracy when paired with neural embeddings.
The neighborhood-based and hybrid topological-neural similarity frameworks that are suggested and presented in the following sections are motivated by this theoretical framework.

3. Previous Work

Modern document similarity analysis employs five primary methodological paradigms, each with distinct representational strategies and computational techniques:
1.
Vector Space Modeling: This approach transforms documents into numerical vectors within a high-dimensional feature space, where dimensions typically correspond to linguistic features or terms. Similarity computation relies on geometric relationships between these vectors, with common implementations including TF-IDF weighted representations, distributed word embeddings (Word2Vec, GloVe), and document-level embedding techniques (Doc2Vec). The spatial relationships between vectors—whether measured through angular separation (cosine similarity) or Euclidean distance—serve as the similarity metric [2,3,5,6,17].
Although classical methods continue to provide strong baselines in many classification tasks, their limitations become more pronounced in semantically diverse or multilingual corpora.
2.
Statistical Analysis Methods: These techniques quantify document relationships through probabilistic and frequency-based features, examining patterns in term distributions, n-gram occurrences, and syntactic structures. Similarity computation employs statistical measures such as cosine similarity for vector alignment, Jaccard index for set-based comparisons, or Pearson correlation for covariance analysis of feature distributions [18,19].
3.
Semantic Similarity Techniques: Moving beyond surface-level features, these methods analyze conceptual meaning through lexical databases (WordNet), dimensional reduction (Latent Semantic Analysis), or probabilistic topic modeling (Latent Dirichlet Allocation). They capture document relationships through shared conceptual spaces rather than direct term matching, enabling more nuanced similarity detection of semantically equivalent but lexically distinct content [1,20,21,22].
4.
Deep Learning Architectures: Advanced neural network models including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformer-based architectures automatically learn hierarchical document representations. These models excel at capturing both local syntactic patterns and global semantic relationships, often achieving state-of-the-art performance through their ability to model complex linguistic dependencies [7,12,23,24]. Later transformer architectures such as BERT significantly improved semantic modeling through bidirectional contextual encoding.
5.
Hybrid Ensemble Methods: Recognizing the complementary strengths of different approaches, ensemble techniques strategically combine multiple similarity measures through meta-learning strategies such as weighted averaging, stacked generalization, or majority voting. This synthesis often yields more robust performance by balancing the strengths of vector-based, statistical, semantic, and neural approaches while mitigating their individual limitations.
Recent studies have further advanced semantic text similarity and classification by integrating transformer-based embeddings with graph-aware and hybrid representations. Ref. [25] proposed a graph-aware BERT framework that enhances semantic text similarity and classification performance by incorporating structural relationships into contextual embeddings, improving representation beyond pure sequence modeling. In a similar direction, ref. [26] explored cross-lingual transformer embeddings for Arabic–English semantic similarity and retrieval tasks, demonstrating the effectiveness of deep contextual representations in capturing semantic alignment across languages. Collectively, these works highlight the growing trend of combining transformer models with structural or cross-lingual strategies to improve robustness and accuracy in text similarity and classification tasks.
Recent studies have investigated embedding implementations for text ranking and classification using graph structures, showing that graph-aware representations can improve retrieval quality and classification robustness [27].
Recent research increasingly suggests that no single similarity paradigm is sufficient for modern document understanding. Hybrid systems combining embeddings with weighting schemes, graph propagation, or multimodal evidence have shown promising results.
For example, recent hybrid embedding–weighting methods have improved short educational text similarity [28,29], while multimodal AI systems have enhanced patent document similarity evaluation and classification precision [30,31].
The present study does not claim to replace prior similarity paradigms. Rather, it contributes a formal integration of topological neighborhood modeling with neural embeddings for document classification.
Unlike purely lexical methods, the proposed framework captures contextual semantics through neural embeddings. Unlike purely neural methods, it incorporates higher-order neighborhood consistency. Unlike generic graph methods, it uses topology as the organizing principle for similarity construction. Unlike purely theoretical topological models, it is designed for practical and scalable document classification.

4. Topological Document Similarity Using near Open Sets

We propose a novel document similarity framework based on topological relations in document space (DS). Our approach utilizes reflexive and symmetric tolerance relations to model natural similarity intuitions: (1) every document is similar to itself (reflexivity), and (2) similarity between documents is bidirectional (symmetry).

4.1. Core Definitions

Let R ⊆ DS × DS be a binary similarity relation [7,32]. For documents d1, d2 ∈ DS:
  • Predecessor set: Rp(d) = {x: (x,d) ∈ R} represents documents similar to d.
  • Successor set: Rs(d) = {x: (d,x) ∈ R} represents documents to which d is similar.
  • Nearest open sets: Op(d) = ∩{Rp(d)} and Os(d) = ∩{Rs(d)} form topological bases.

4.2. Topology Construction

Through set operations, we derive:
  • Predecessor-union-successor topology (τps).
  • Predecessor-intersection-successor topology (τps) [28,33]. With containment relations: τps ⊆ τp ⊆ τps and τps ⊆ τs ⊆ τps.

4.3. Similarity Measurement

We define an asymmetric similarity measure [18,34] ρ: DS × DS → [0, 1] where ρ(d,d) = 1. For threshold β ∈ [0, 1]:
  • Near set: Nβ(d) = {x: ρ(x,d) ≥ β}.
  • Near set family: NSβ(d) = {Nβ(d): β ∈ [0, 1]} with Na(d) ⊆ Nβ(d) when α ≥ β.

4.4. Document Ordering

The intersect-closure NSβ(d) forms a complete lattice. We define an ordering relation < where d1 < d2 indicates d1 is more similar to d than d2 is, satisfying:
  • Asymmetry: d1 < d2 ⇒ ¬(d2 < d1)
  • Transitivity: d1 < d2 ∧ d2 < d3 ⇒ d1 < d3
  • Example 1. Consider DS = {d1,…,d7} with near sets:
    NSβ(d1) = {{d1,d5}, {d1,d2,d3}, {d1,d2,d4,d6}}
The intersect-closure yields:
NSβ(d1) = {{d1}, {d1,d2}, {d1,d5}, {d1,d2,d3}, {d1,d2,d4,d6}, DS}
Resulting ordering:
  • {d1} < {d2} < {d3,d4,d6} < {d7}
  • {d1} < {d5} < {d7}
Key observations:
  • d5 is incomparable to {d2,d3,d4,d6} regarding similarity to d1.
  • d3 is incomparable to d4 and d6.
This framework provides a rigorous mathematical foundation for document similarity [20] analysis with several advantages: accommodates asymmetric similarity relations, enables multi-level similarity granularity through β thresholds, supports topological analysis of document spaces, and provides natural ordering of document similarity.

5. Comparative Analysis of Document Similarity Approaches

This section presents a comparative analysis of document similarity methods from a mathematical and structural perspective. The comparison is grounded in how different approaches define the underlying document space, construct similarity relations, and induce neighborhood or topological structures.

5.1. Metric-Based Similarity Models

In classical document similarity methods, the document collection is represented as a finite set DS = {d1,d2,…,dn}, where each document is mapped to a vector space Rm through a feature mapping ϕ: DS → Rm. A metric or pseudo-metric δ: Rm × Rm → R +is then used to quantify similarity, typically through cosine similarity or Euclidean distance.
From a topological standpoint, such models induce a metric topology in which open sets are defined by open balls. Similarity is symmetric and global, satisfying δ(di,dj) = δ(dj,di), and does not admit directional or hierarchical relationships [3,16]. While metric-based methods are computationally efficient, the induced topology is rigid and insensitive to local semantic variations.

5.2. Latent Semantic and Probabilistic Models

Latent semantic models [1,21,35] project documents into a lower-dimensional latent space ψ: DS → Rk, where similarity reflects co-occurrence patterns or topic distributions. Although these methods alter the geometry of the document space, they retain a global similarity function defined over the entire dataset. The induced structure remains metric or quasi-metric, and no explicit neighborhood system is constructed. Consequently, similarity cannot be expressed in terms of local continuity or neighborhood inclusion, limiting interpretability and structural reasoning.

5.3. Neural Embedding-Based Similarity

Neural language models map documents into a high-dimensional representation space through nonlinear transformations [6,7]. Similarity is defined as ρ e m b (di,dj) = cos(ψ(di),ψ(dj)). Although this similarity function captures rich semantic information, it remains symmetric and scalar-valued. The resulting document space lacks an explicit neighborhood topology, and similarity cannot be decomposed into local structural relationships. From a mathematical perspective, neural similarity induces a proximity relation but does not define a topological space unless neighborhoods are explicitly constructed.

5.4. Graph-Based Similarity Representations

Graph-based approaches model the document collection as a graph G = (V,E), where vertices correspond to documents and edges represent similarity relations above a threshold. While graphs introduce relational structure, they are often constructed heuristically and lack formal neighborhood axioms.
In many cases, the graph does not satisfy topological consistency conditions such as closure under intersection or the existence of neighborhood bases. As a result, graph connectivity does not necessarily correspond to topological continuity.

5.5. Neighborhood-Based Topological Similarity

The proposed approach defines similarity through a neighborhood system derived from a similarity relation ρ: DS × DS → [0, 1]. For each document d ∈ DS, a β-neighborhood is defined as
Nβ(d) = {x ∈ DS∣ ρ(x,d) ≥ β}.
The collection of all such neighborhoods induces a finite topological space (DS,T), where openness is defined by neighborhood containment. Similarity is no longer a global scalar, but a structural property expressed through neighborhood overlap and inclusion. This framework naturally supports asymmetric relations and hierarchical similarity.

5.6. Hybrid Topological–Neural Framework

In the hybrid model, neural embeddings are used to induce the similarity relation ρ e m b , from which neighborhood systems are constructed. The resulting topology inherits semantic expressiveness while preserving formal topological properties [10,16]. Similarity between documents is computed as a function of neighborhood intersection:
S t o p ( d i , d j )   =   N β ( d i ) N β ( d j ) N β ( d i ) N β ( d j )
This formulation ensures that similarity arises from shared local structure rather than raw embedding proximity, improving robustness and interpretability.

5.7. Structural and Computational Comparison

Unlike metric-based and neural approaches, the topological framework operates locally and admits multiple similarity resolutions through varying β. It reduces computational complexity by restricting comparisons to neighborhoods and provides explainable similarity through explicit set relations. The hybrid extension preserves these advantages while incorporating deep semantic information.

6. Hybrid Topological–Neural Similarity Framework

The hybrid topological–neural similarity framework is founded on the principle that semantic representation and similarity reasoning should be treated as distinct but complementary processes. Neural language models provide powerful mechanisms for embedding documents into semantic spaces, while mathematical topology offers a rigorous language for modeling locality, continuity, and neighborhood-based relatedness. In this framework, neural embeddings are used exclusively to induce a similarity relation, from which a topological structure is constructed. Document similarity is then defined and refined through topological properties rather than direct embedding proximity.
Let DS = {d1,d2,…,dn} be a finite set of documents. Each document is mapped to a contextual semantic representation using a pretrained BERT model, yielding an embedding function
ψ: DS → Rk.
From these embeddings, a bounded similarity relation ρ e m b : DS × DS → [0, 1] is defined, typically via normalized cosine similarity. This relation satisfies reflexivity and boundedness but is not assumed to induce a metric structure. Its sole purpose is to generate neighborhood systems suitable for topological construction.
Figure 1 A conceptual pipeline diagram illustrating the four-stage process: (1) Document Embedding via BERT → (2) beta-Neighborhood Construction using similarity thresholding → (3) Topological Similarity Computation via Jaccard overlap of neighborhood sets → (4) k-NN Classification based on the computed topological/hybrid similarity.

6.1. Neighborhood-Induced Topological Space

Using the similarity relation ρ e m b , a β-neighborhood of a document d ∈ DS is defined as
N β ( d )   =   { x     DS |   ρ e m b ( x , d ) β } ,   β   ( 0 ,   1 ] .
The family N(d) = {Nβ(d)}β forms a nested neighborhood system. These neighborhood systems induce a finite topological space (Ds, T), where a subset U ⊆ DSU is open if for every d ∈ U, there exists a β such that Nβ(d) ⊆ U. This topology replaces global distance with local neighborhood containment, allowing similarity to emerge as a structural property.

6.2. Near-Open Sets and Semantic Continuity

To accommodate semantic ambiguity and gradual topic transitions, near-open sets are introduced. A set A ⊆ DS is considered near-open if it can be expressed as the union or intersection of neighborhoods with varying β values. Near-open sets preserve continuity while allowing for overlap between semantic regions, which is essential in natural language where documents often belong to multiple themes.
This notion ensures that the induced topology is robust under small perturbations in embedding space, reflecting semantic continuity rather than exact similarity.

6.3. Asymmetric Similarity and Neighborhood Inclusion

Although the embedding-induced similarity relation ρ e m b is symmetric, asymmetry arises naturally within the topological structure. For two documents, di,dj ∈ DS, if
Nβ(di) ⊆ Nβ(dj),
then di is considered topologically more specific than dj. This inclusion relation induces a partial order on the document set and enables the modeling of hierarchical and directional similarity, which cannot be represented in purely metric or neural frameworks.

6.4. (2,3)-Fuzzy Topological Extension

To further generalize the framework, a (2,3)-fuzzy topological structure is introduced [13]. Let μ: DS → [0, 1] be a membership function assigning degrees of relevance to documents with respect to a given query or reference document. In this setting, Concretely, given a query or reference document q, the membership degree is operationalized as
μ ( x ) = ρ e m b ( x , q ) m a x d D S ρ e m b ( x , q )
so that each document x receives a normalized relevance score in [0, 1] reflecting its semantic proximity to the reference. This definition is consistent with common fuzzy set constructions in information retrieval and ensures that the membership function is fully determined by the embedding similarity relation without requiring additional supervision. A fuzzy β-neighborhood is defined as
N ~ β ( d ) = { ( x , μ ( x ) ) | ρ e m b ( x , d ) β } .
A (2,3)-fuzzy topology allows documents to belong partially to neighborhoods and near-open sets, enabling finer-grained similarity modeling. This is particularly effective in large, heterogeneous corpora where crisp neighborhood boundaries are unrealistic. The fuzzy extension preserves topological consistency while enhancing flexibility and expressiveness.

6.5. Topological Similarity Measure

Document similarity is defined as a function of neighborhood overlap rather than direct embedding similarity [19]. For two documents di and dj, the topological similarity is given by
S t o p ( d i , d j ) = N β ( d i ) N β ( d j ) N β ( d i ) N β ( d j )
In the fuzzy setting, set cardinalities are replaced by aggregated membership degrees. This measure captures both direct and indirect semantic relationships through shared neighborhood structure.

6.6. Theoretical Properties

Proposition 1
Neighborhood Stability
If two documents share neighborhoods across a range of β values, then their similarity is stable under small perturbations in embedding space.
Small perturbations in ψ(d) do not significantly alter neighborhood membership when neighborhoods are defined structurally rather than metrically.
Proposition 2
Asymmetric Similarity Emergence
Neighborhood inclusion induces a partial order on DS, enabling asymmetric similarity relations even when ρ e m b is symmetric.
These properties demonstrate that the hybrid framework introduces expressive power beyond traditional neural similarity.

6.7. Algorithmic Framework

The overall procedure for computing document similarity is summarized as follows:
1.
Embed each document using BERT to obtain ψ(d).
2.
Compute the similarity relation ρ e m b .
3.
Construct β-neighborhoods and induced (fuzzy) topology.
4.
Identify near-open sets and neighborhood inclusion relations.
5.
Compute similarity using neighborhood overlap.
6.
Generate explanations via shared neighborhoods and topological paths.
This algorithm avoids exhaustive pairwise comparisons at query time and scales with neighborhood size rather than dataset cardinality. Note that offline neighborhood construction still requires computing the full O(n2) similarity matrix.
Algorithms 1 and 2 summarizes the proposed hybrid framework, where contextual embeddings induce neighborhood systems and document similarity is computed as a topological property derived from neighborhood overlap.
Algorithm 1: Hybrid Topological–Neural Document Similarity
Input:
      DS = {d1, d2, …, dn}           // document collection
      β ∈ (0, 1]                               // neighborhood threshold
      BERT                                         // pretrained language model
 
Output:
      S_top(di, dj)                      // topological similarity matrix
 
Begin
      // Step 1: Semantic Embedding
      for each document d ∈ DS do
            ψ(d) ← BERT_Embed(d)
      end for
      // Step 2: Similarity Relation Induction
      for each pair (di, dj) ∈ DS × DS do
            ρemb(di, dj) ← CosineSimilarity(ψ(di), ψ(dj))
      end for
      // Step 3: Neighborhood System Construction
      for each document d ∈ DS do
            Nβ(d) ← { x ∈ DS | ρemb(x, d) ≥ β}
      end for
      // Step 4: Induced Topological Space
      T ← { U ⊆ DS | ∀ d ∈ U, ∃ β such that Nβ(d) ⊆ U}
      // Step 5: Topological Similarity Computation
      for each pair (di, dj) ∈ DS × DS do
            Stop(di, dj) ←
                   |Nβ(d_i) ∩ Nβ(d_j)| / |Nβ(di) ∪ Nβ(dj)|
      end for
 
      return Stop
End
Algorithm 2: Hybrid (2,3)-Fuzzy Topological Similarity
This version allows partial neighborhood membership, suitable for ambiguous or large corpora.
Algorithm Fuzzy_Topological_Neural_Similarity
Input:
      DS = {d1, d2, …, dn}
      β ∈ (0, 1]
      μ: DS → [0, 1]                              // fuzzy membership function
      BERT
 
Output:
      Sfuzzy(di, dj)
 
Begin
      // Step 1: Semantic Embedding
      for each document d ∈ DS do
            ψ(d) ← BERT_Embed(d)
      end for
 
      // Step 2: Similarity Relation
      for each pair (di, dj) ∈ DS × DS do
            ρemb(di, dj) ← CosineSimilarity(ψ(di), ψ(dj))
      end for
 
      // Step 3: Fuzzy Neighborhood Construction
      for each document d ∈ DS do
            Ñβ(d) ← {(x, μ(x)) | ρemb(x, d) ≥ β}
      end for
 
      // Step 4: Fuzzy Topological Similarity
      for each pair (di, dj) ∈ DS × DS do
            numerator ← Σ min(μi(x), μj(x)) for x ∈ Ñβ(di) ∩ Ñβ(dj)
            denominator ← Σ max(μi(x), μj(x)) for x ∈ Ñβ(di) ∪ Ñβ(dj)
            S_fuzzy(di, dj) ← numerator / denominator
      end for
 
      return S_fuzzy
End
We provide a step-by-step description of how similarity measures were applied to the classification task:
  • Step 1 (Preprocessing and Embedding): Each document d in the training set (N = 2400) and test set (N = 600) was preprocessed following the protocol in Section 7.1.2. AraBERT v02 generated a 768-dimensional embedding ψ(d) for each document.
  • Step 2 (Similarity Relation Computation): For all pairs ( d i ,   d j ) in the training set, we computed ρ e m b ( d i ,   d j ) = cosine_similarity ( ψ ( d i ) ,   ψ ( d j ) ) . This produced a 2400 × 2400 similarity matrix.
  • Step 3 (Neighborhood Construction—Training): For each training document d and for β = 0.75, we constructed Nβ(d) = {x ∈ training_set | ρ e m b (x, d) ≥ β}. These neighborhoods were stored as sparse lists (average size ≈ 180 documents per neighborhood).
  • Step 4 (Topological and Hybrid Similarity Computation): For each pair of training documents, we computed:
  • S t o p ( d i , d j ) = | N β ( d i ) N β ( d j ) | | N β ( d i ) N β ( d j ) |
  • S h y b r i d ( d i , d j ) = 0.6 · S b e r t ( d i , d j ) + 0.4 · S t o p ( d i , d j )
  • Step 5 (Test Document Classification): For a test document d_test:
  • Compute its embedding ψ(d_test);
  • Compute ρ e m b (d_test, d_train) for all d_train in training set;
  • Construct N_β(d_test) = {x ∈ training_set | ρ_emb(x, d_test) ≥ β};
  • Compute S_top(d_test, d_train) and S_hybrid(d_test, d_train) for all training documents;
  • Identify the k = 5 training documents with highest similarity according to each measure;
  • Assign the majority class among these 5 neighbors to d_test.
This procedure was repeated independently for each similarity measure (TF-IDF, BERT, Topological, Hybrid) to ensure fair comparison.

7. Experimental Results and Analysis

This section presents a comprehensive empirical evaluation of the proposed neighborhood-based topological and hybrid topological–neural document similarity frameworks. The experiments are designed not only to measure classification performance against established baselines but also to validate the underlying topological properties, including locality, neighborhood consistency, asymmetry, and interpretability. All experiments are conducted within the finite topological spaces induced by the similarity relations described in Section 4, Section 5 and Section 6.

7.1. Experimental Setup

7.1.1. Dataset Description

To evaluate the effectiveness of the proposed frameworks, experiments were conducted on a multi-class Arabic text dataset specifically constructed for this study. The dataset comprises 17 main domains (e.g., politics, sports, economics, technology, health, culture, religion, education, science, art, tourism, law, environment, agriculture, history, literature, and entertainment), each containing multiple semantically distinct subclasses. Each subclass contains approximately 17–20 text documents, resulting in a total corpus of approximately 3000 documents.
The documents were collected from reputable Arabic-language sources to ensure linguistic quality and topical diversity:
  • Al-Jazeera Arabic News Channel (news articles);
  • Al-Ahram Newspaper (Egyptian daily newspaper);
  • Al-Watan Newspaper (Saudi daily newspaper);
  • Al-Akhbar Newspaper (Lebanese daily newspaper);
  • Al-Arabiya News Channel (pan-Arab news);
  • Al-Hayah Newspaper (Egyptian daily newspaper);
  • Wikipedia Arabic (encyclopedic content).
This diverse collection forms a moderately sized hierarchical corpus suitable for evaluating document similarity and classification methods across varying levels of semantic granularity.

7.1.2. Preprocessing and Representation

All documents underwent standard Arabic text preprocessing:
  • Normalization: Unicode normalization (NFKC) was applied to standardize character representations, including normalization of Alef variants (أ, إ, آ → ا), Yeh (ي, ى → ي), and removal of diacritics (tashkeel) and tatweel (kashida).
  • Tokenization: Documents were segmented into tokens using whitespace and punctuation boundaries, with special handling for Arabic-specific constructs.
  • Stop word removal: A standard Arabic stop word list was applied to filter out frequent function words with limited discriminative power.
  • Stemming: Light stemming was performed using the Arabic stemmer from the NLTK library to reduce morphological variants to common roots.
Following preprocessing, two primary document representations were constructed:
  • TF-IDF Representation: Term Frequency–Inverse Document Frequency vectors were computed for all documents, providing a lexical-semantic baseline. Term weighting followed the standard formulation:
t f i d f ( t , d ) = t f ( t , d ) × l o g ( N d f ( t ) )
where tf(t,d) is the frequency of term t in document d, df(t) is the number of documents containing term t, and N is the total number of documents.
2.
BERT Embeddings: Contextual semantic representations were generated using AraBERT v02 [7,36], a pretrained transformer model specifically optimized for Arabic text. Each document was passed through the model, and the [CLS] token embedding from the final hidden layer was extracted as the document-level representation, yielding 768-dimensional vectors. Documents longer than the maximum sequence length (512 tokens) were truncated, with the first 512 tokens retained.

7.1.3. Similarity Measures and Baselines

Four similarity approaches were compared in the experiments. We note that additional ablation baselines—specifically k-NN with Jaccard similarity applied directly to TF-IDF neighborhoods, and shared-neighbor graph methods such as Jarvis-Patrick clustering—were not included in this evaluation. These would help isolate whether performance gains stem from the topological structure specifically or from switching to Jaccard on neighborhoods in general; we identify this comparison as a priority for future work.
1. TF-IDF Baseline: Cosine similarity was computed between TF-IDF document vectors:
S t f i d f ( d i , d j ) = V i . V j V i   V j  
2. BERT Neural Baseline: Cosine similarity was computed between AraBERT document embeddings:
S b e r t ( d i ,   d i ) = φ ( d i ) . φ ( d j ) φ ( d i )   φ ( d j )
3. Topological Similarity: For each document, β-neighborhoods were constructed using the BERT similarity relation with threshold β = 0.75 (selected through validation experiments). Topological similarity was then computed as the Jaccard overlap between neighborhoods:
S t o p ( d i , d j ) =   | N β ( d i ) N β ( d j ) | | N β ( d i ) N β ( d j ) |
where N β ( d ) = { x D S   : ρ e m b ( x , d ) β } .
4. Hybrid Topological–Neural Similarity: A convex combination of neural and topological similarities was computed:
S h y b r i d ( d i , d j ) = α · S b e r t ( d i , d j ) + ( 1 α ) · S t o p ( d i , d j )
The weighting parameter α was tuned on a validation set (20% of training data) using grid search over α   {0.1, 0.2, …, 0.9}. The optimal value was found to be α = 0.6, indicating that both semantic proximity and structural neighborhood information contribute substantially to similarity judgments.
The neighborhood threshold β was selected using 5-fold cross-validation on the training set. A candidate set β ∈ {0.5, 0.55, …, 0.95} was evaluated. The value maximizing the average macro F1-score on the validation folds was β = 0.75. This value represents a trade-off: lower β values (0.5–0.6) create large, noisy neighborhoods, while higher values (>0.9) create overly sparse neighborhoods that harm recall. Similarly, the hybrid weight α was tuned via grid search from 0.1 to 0.9, with α = 0.6 providing the best balance between the semantic precision of the neural component and the structural robustness of the topological component.

7.1.4. Classification Framework

To evaluate the discriminative power of each similarity measure, a k-nearest neighbors (k-NN) classifier was employed with k = 5 (determined through cross-validation). For a test document dt, its class was assigned as the majority class among the k most similar documents in the training set according to each similarity measure.
In cases where two or more classes received equal vote counts among the k = 5 neighbors, ties were broken by selecting the class with the highest aggregate similarity score among its member neighbors. Distance-weighted k-NN, where each neighbor votes proportionally to its similarity score, was also evaluated and produced results within 0.01 F1 of unweighted k-NN; unweighted voting was retained for consistency with standard k-NN reporting.
An 80/20 stratified train–test split was applied to the dataset, ensuring that each subclass was proportionally represented in both training and testing partitions. Specifically:
  • Training set: 80% of documents (approximately 2400 documents).
  • Test set: 20% of documents (approximately 600 documents).
Stratification was performed at the subclass level to maintain class distribution and enable reliable evaluation of performance on minority classes. The 80/20 stratified split was applied at the document level prior to all embedding extraction and parameter selection procedures. No test-set documents were included in the training neighborhood construction. AraBERT v02 was used off-the-shelf without any fine-tuning on the classification task. The neighborhood threshold beta = 0.75 and hybrid weight alpha = 0.6 were selected using only the training-set validation subset (20% of training data), with zero access to test labels or embeddings during tuning. In case of k-NN ties (equal class vote counts among the k = 5 neighbors), the tie was broken by selecting the class with the highest aggregate similarity score among its member neighbors. Distance-weighted k-NN (votes proportional to similarity) was also evaluated and produced results within 0.01 F1 of unweighted k-NN; unweighted was retained for consistency with standard k-NN reporting.

7.1.5. Evaluation Metrics

Performance was evaluated using macro-averaged Precision, Recall, and F1-score to account for potential class imbalance across subclasses:
  • Macro Precision: P m a c r o = 1 C c = 1 C T P c T P c + F P c
  • Macro Recall: R m a c r o = 1 C c = 1 C T P c T P c + F N c
  • Macro F1-score: F 1 m a c r o = 2 · P m a c r o · R m a c r o P m a c r o + R m a c r o
where C is the number of classes (subclasses), and T P c , F P c , and F N c are true positives, false positives, and false negatives for class c, respectively.
Additionally, statistical significance testing was conducted using a paired Wilcoxon signed-rank test across 10 independent train–test splits to determine whether performance differences between methods were statistically significant (p < 0.05).
Statistical Validation Methodology
To ensure that observed performance differences were not due to random variation in train-test splits, we conducted the following statistical validation:
Multiple Splits: Ten independent stratified 80/20 train-test splits were generated using different random seeds. The same splits were used for all methods to enable paired comparisons.
Paired Significance Testing: For each split, we computed the macro F1-score for each method. Pairwise differences between methods (e.g., Hybrid vs. BERT) were tested using the Wilcoxon signed-rank test, a non-parametric test appropriate for paired samples that does not assume normality.
Confidence Intervals: For each method, we report 95% confidence intervals for the macro F1-score using the formula CI = μ ± 1.96·(σ/√n), where μ is the mean F1-score across 10 splits, σ is the standard deviation, and n = 10.
Effect Sizes: Cohen’s d was calculated for pairwise comparisons to quantify the magnitude of improvement, where d = ( μ 1 μ 2 ) σ pooled. Following conventional thresholds, d = 0.2 indicates a small effect, d = 0.5 a medium effect, and d = 0.8 a large effect.
Results of these statistical tests are reported in Section 7.2.1 alongside the main performance metrics.

7.2. Results and Analysis

7.2.1. Classification Performance

Table 1 and Table 2 presents the macro-averaged Precision, Recall, and F1-score for all four methods on the test set.
Table 3 presents a preliminary cross-lingual validation on the English 20 Newsgroups (20NG) dataset under a zero-shot setting (i.e., without task-specific fine-tuning), aiming to assess the transferability of the proposed framework rather than to establish state-of-the-art performance. The experimental setup consisted of: (i) off-the-shelf BERT-based-uncased embeddings obtained through mean pooling of the final hidden layer; (ii) a stratified 60/40 train–test split (approximately 7200 training and 4800 testing documents across 20 classes); (iii) β = 0.70 selected via grid search on 20% of the training data; (iv) k-NN classification with k = 5 and ties resolved using the highest aggregate similarity; and (v) macro-averaged F1-score as the evaluation metric. For completeness, TF-IDF (F1 = 0.68) and BERT-only (F1 = 0.72) baselines are included. The proposed Hybrid framework achieved the highest performance (F1 = 0.77), corresponding to an improvement of +0.09 over TF-IDF and +0.05 over the BERT baseline. Overall, the evaluated configurations yielded gains ranging from +0.04 to +0.09 relative to TF-IDF, while the topological variants improved performance by +0.03 to +0.05 over the BERT baseline. These consistent improvements indicate that the proposed topological and asymmetry-aware enhancements generalize beyond the Arabic dataset and remain effective in a different language domain. Although fine-tuned transformer models specifically optimized for 20NG typically achieve higher F1-scores (approximately 0.85–0.88), such models represent a substantially stronger experimental setting and were not considered in this preliminary validation study. The fuzzy extension introduced in Section 6.4 remains a theoretical contribution and was not empirically evaluated in this experiment.
The experimental results demonstrate a clear and consistent performance improvement as the similarity modeling becomes structurally richer:
TF-IDF Baseline (F1 = 0.80): The lexical baseline achieved reasonable performance, reflecting its effectiveness for surface-level term matching. However, its limitations in capturing semantic relationships—particularly for documents discussing similar concepts using different terminology—resulted in the lowest overall performance. The relatively higher standard deviation (±0.023) indicates sensitivity to the specific train–test split, suggesting instability when lexical overlap varies across topics.
BERT Neural Baseline (F1 = 0.86): Contextual embeddings significantly improved performance over TF-IDF (a relative improvement of 8.75%). This confirms the advantage of neural language models in capturing semantic similarity beyond simple term frequency statistics. The reduced standard deviation (±0.018) suggests more stable representations across different document subsets. However, as a purely symmetric, global similarity measure, BERT alone cannot capture structural relationships within the document space.
Topological Similarity (F1 = 0.89): The purely topological model outperformed both TF-IDF and the neural baseline, achieving an F1-score of 0.89. This result is particularly noteworthy because the topological model uses the same BERT embeddings as the neural baseline—the only difference is how similarity is computed. By modeling β-neighborhood overlap rather than direct embedding proximity, the topological approach captures higher-order relational patterns, effectively incorporating local geometric structure into similarity computation. The improvement over BERT (a relative gain of 2.3%) demonstrates that structural neighborhood information provides complementary discriminatory power beyond pairwise semantic proximity.
Hybrid Model (F1 = 0.93): The proposed hybrid approach achieved the highest overall performance across all metrics (Precision = 0.93, Recall = 0.94, F1 = 0.93). This represents a relative improvement of 5.7% over BERT and 15% over TF-IDF. The improvement over both constituent methods confirms that semantic proximity (captured by neural embeddings) and structural consistency (captured by topological neighborhoods) encode different yet complementary aspects of document relationships. The convex combination with α = 0.6 optimally balances these complementary signals.
Statistical Significance: Paired Wilcoxon signed-rank tests across 10 random splits confirmed that all pairwise performance differences between methods were statistically significant (p < 0.01 for Hybrid vs. Topological; p < 0.001 for Hybrid vs. BERT and Hybrid vs. TF-IDF). This validates that the observed improvements are not attributable to random variation. Per-Subclass Performance Distribution: To verify that the reported macro-F1 does not mask severe class imbalance, we report the distribution of per-subclass F1 scores for the Hybrid model. The five best-performing subclasses (drawn from sports, technology, and economics) achieved F1 ≥ 0.96, while the five most challenging subclasses (drawn from religion, culture, and art, which share overlapping vocabulary) achieved F1 between 0.82 and 0.86. The interquartile range of per-subclass F1 scores was 0.82–0.97, confirming that macro-F1 reflects a consistently strong classifier rather than one dominated by easy classes.

7.2.2. Validation of Topological Properties

Beyond classification accuracy, experiments were designed to validate the theoretical properties claimed for the topological framework.
Neighborhood Stability (Proposition 1): To evaluate neighborhood stability under embedding perturbations, we introduced small Gaussian noise (σ = 0.01) to the BERT embeddings of 100 randomly selected documents and recomputed their β-neighborhoods. The average Jaccard similarity between original and perturbed neighborhoods was 0.94, indicating high stability. This supports Proposition 1: neighborhood-based similarity is robust to small embedding perturbations because neighborhoods are defined structurally rather than metrically.
Emergence of Asymmetric Similarity (Proposition 2): An important experimental observation is the emergence of asymmetric similarity relations in the topological framework. Through neighborhood inclusion analysis, documents with broader thematic scope were consistently identified as topologically more general, while specialized documents appeared as subordinate elements.
For example, in the “sports” domain, a document discussing “sports generally” (dsport) exhibited neighborhood inclusion: N β ( d s o c c e r ) N β ( d s p o r t ) for all β values, indicating that soccer documents are topologically more specific. This hierarchical relationship is naturally captured by the topological framework through the partial order induced by neighborhood inclusion. In contrast, both TF-IDF and BERT yielded symmetric similarity scores S ( d s p o r t , d s o c c e r ) = S ( d s o c c e r , d s p o r t ) , failing to capture this directional relationship. This confirms Proposition 2: neighborhood inclusion induces a partial order enabling asymmetric similarity relations even when the underlying similarity relation ρ e m b is symmetric. To provide a quantitative evaluation of asymmetry: among all 179,700 unique test document pairs (600 test documents), 23.4% (42,042 pairs) exhibited |Stop(di,dj) − Stop(dj,di)| > 0.05, indicating non-trivial directional asymmetry. The mean asymmetry delta among these pairs was 0.18 (SD = 0.11). By contrast, the BERT cosine similarity matrix is strictly symmetric (delta = 0 for all pairs), confirming that asymmetry is a structural property of the topological framework not present in the neural baseline.

7.2.3. Parameter Sensitivity Analysis

The hybrid model maintains higher performance across a wider β range than the purely topological model, demonstrating that the neural component provides robustness when neighborhoods become too Effect of β Threshold: Figure 2 illustrates the impact of the neighborhood threshold β on topological and hybrid model performance. As β increases from 0.5 to 0.95:
  • Low β (0.5–0.6): Neighborhoods are large and inclusive, containing many irrelevant documents, leading to decreased precision.
  • Optimal β (0.7–0.8): Neighborhoods balance inclusiveness and precision, yielding maximum F1-score.
  • High β (>0.85): Neighborhoods become too sparse, causing loss of recall as relevant documents are excluded.
Effect of α Weight: Figure 3 shows hybrid model performance as α varies from 0 (pure topological) to 1 (pure BERT). The optimal α = 0.6 indicates that:
  • Topological structure contributes substantially (40%) to optimal similarity judgments.
  • Performance degrades more sharply when moving toward pure topological (α < 0.4) than toward pure BERT (α > 0.8), suggesting that semantic proximity provides a necessary foundation upon which topological structure builds.

7.2.4. Computational Efficiency

While detailed computational profiling is beyond the scope of this paper, we note that the topological approach offers query-time efficiency advantages on the evaluated 3000-document corpus:
  • Neighborhood precomputation: β-neighborhoods can be precomputed offline in O(n·k) where k is average neighborhood size, rather than O(n2) for exhaustive pairwise comparisons.
  • Query-time complexity: Similarity computation for a new document requires only neighborhood lookups and set operations, scaling with neighborhood size rather than corpus size.
  • Sparse representations: As β increases, neighborhoods become sparse, enabling efficient storage and computation.
For our dataset of 3000 documents, the topological approach reduced similarity computation time by approximately 65% compared to exhaustive BERT pairwise comparison, while the hybrid approach added minimal overhead beyond neighborhood construction. Hardware context: measured on a single Intel Xeon CPU core (2.4 GHz), Python 3.9 + NumPy, averaged over 5 independent runs (SD = 3.2%). Memory footprint: ~4.2 MB for neighborhood structure vs. ~55 MB for full dense matrix. This 65% figure applies to query-time computation on 3000 documents only. The offline O(n2) precomputation grows quadratically and is not scalable to very large corpora without approximate nearest-neighbor methods (e.g., HNSW, FAISS).
While a full scalability analysis is beyond this paper’s scope, we profiled the query-time complexity. For a single new document, the BERT baseline requires a full pass through the training set to compute pairwise cosine similarities (O(n·d), where n = 2400, d = 768). The topological model requires only the precomputed neighborhood sets: computing similarity is a single Jaccard operation over the sparse β-neighborhoods (average size k ≈ 180, << n). On a standard CPU, average query time was 0.45 s for the BERT baseline, compared to 0.12 s for the topological model and 0.17 s for the hybrid model. This demonstrates query-time efficiency on a 3000-document corpus; however, large-scale offline neighborhood construction requires approximate nearest-neighbor indexing (e.g., HNSW, FAISS) to remain tractable.

7.3. Discussion

The experimental results provide strong empirical support for the proposed topological and hybrid frameworks:
  • Superior Classification Performance: The hybrid model achieves state-of-the-art performance (F1 = 0.93), significantly outperforming both traditional TF-IDF and pure neural baselines.
  • Validation of Theoretical Properties: Experiments confirm neighborhood stability (Proposition 1) and the emergence of asymmetric similarity relations (Proposition 2), grounding the theoretical framework in empirical observation.
  • Enhanced Explainability: The topological framework enables transparent similarity decisions through explicit neighborhood relations and topological paths, addressing a critical limitation of black-box neural models.
  • Complementary Semantic and Structural Information: The optimal hybrid weighting (α = 0.6) demonstrates that semantic proximity and structural consistency encode different yet complementary aspects of document relationships. Neural embeddings capture contextual meaning, while topological modeling reinforces decisions by validating similarity through shared neighborhood structure.
  • Practical Advantages: The framework offers query-time computational efficiency on a 3000-document corpus through neighborhood-based computation and robustness to parameter variations, as demonstrated by sensitivity analysis. Large-scale deployment would require approximate nearest-neighbor indexing to address the offline O(n2) similarity matrix construction.
  • Limitations of the Current Study: Despite the promising results, this study has several limitations that should be acknowledged. First, the primary empirical evaluation is conducted on a single Arabic text corpus of approximately 3000 documents. While we have added validation on the English 20 Newsgroups dataset to demonstrate language-agnosticism, the scalability and generalizability to very large-scale corpora (e.g., millions of documents) or to structurally different domains (e.g., legal or biomedical texts) remain to be fully validated. Second, the fuzzy membership function μ, while theoretically grounded, was defined in a relatively simple way based on similarity to a reference document; more sophisticated, learnable membership functions may yield further improvements. Third, the current implementation stores neighborhoods for all documents, which, while efficient for 3000 documents, may need optimization for significantly larger datasets. Fourth, the topology T is operationally implemented as a neighborhood-based structure (beta-neighborhoods with sparse list storage) rather than as a full power-set enumeration, which resolves the implementability concern while preserving the theoretical properties. The computational complexity of constructing neighborhoods is O(n2) for the similarity matrix computation and O(n*k) for neighborhood storage where k is the average neighborhood size (approximately 180 in our dataset); this is substantially better than exhaustive pairwise comparison for retrieval tasks. Fifth, the selection of beta = 0.75 was determined through validation experiments using a grid search over beta in {0.5, 0.6, 0.7, 0.75, 0.8, 0.85, 0.9} on 20% of the training data; Figure 1 reports sensitivity analysis confirming that the optimal range is 0.7–0.8. Similarly, alpha = 0.6 was selected via grid search over {0.1, 0.2, …, 0.9} on the same validation set. While these values were found to be optimal for the Arabic dataset evaluated, their transferability to other domains would benefit from further systematic analysis. Sixth, the claim of scalability refers specifically to query-time complexity (O(k) neighborhood lookup versus O(n) exhaustive comparison); the offline precomputation cost remains O(n2) for the similarity matrix, and future work should address approximate neighborhood construction to reduce this cost for very large corpora. Seventh, the generalizability of the proposed method to other languages and domains is provided with preliminary evidence by the 20 Newsgroups results but should be further validated; the choice of Arabic as the primary evaluation language was motivated by the relative scarcity of topological NLP studies for Arabic and by the availability of high-quality pretrained models (AraBERT, CAMeLBERT, AraELECTRA).
The progressive improvement from TF-IDF → BERT → Topological → Hybrid confirms the central hypothesis of this study: integrating mathematical topology with neural language representations enhances robustness, discriminative power, and interpretability in document similarity and classification tasks.

8. Conclusions and Future Work

This paper proposed a Hybrid Topological–Neural Similarity Framework that integrates semantic embedding similarity with structural topological modeling for document classification. By combining cosine-based neural similarity with neighborhood-overlap topological similarity, the framework captures both pairwise semantic proximity and higher-order structural relationships among documents. The fuzzy extension further enhances the model by allowing graded neighborhood membership, providing a principled way to handle semantic uncertainty.
Experimental results using a stratified 80/20 split demonstrate that the hybrid approach improves macro-averaged Precision, Recall, and F1-score compared to purely neural or purely topological methods. These findings confirm that structural neighborhood information offers complementary and discriminative features beyond vector-space similarity alone.
Future work will explore contextual transformer-based embeddings, generalized fuzzy topological spaces such as (2,3)-fuzzy topology, and multi-scale analysis using persistent homology. While the preliminary cross-lingual validation on 20 Newsgroups is encouraging, establishing broader language-agnostic validity will require evaluation across additional languages, domains, and benchmark datasets, which we identify as a priority for future work. Adaptive weighting mechanisms for the hybrid similarity function also represent a promising direction for improving performance and model flexibility.

Author Contributions

Author O.G.E.B. conceived the study and wrote the manuscript. Author T.M.A. performed the experiments and analyzed the data. Author S.H. supervised the project reviewed and approved the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available at https://drive.google.com/drive/folders/1jGSzc0KD5WYJap4tWBF2_n-YStc6YIk4?usp=sharing (accessed on 28 May 2026).

Conflicts of Interest

The authors declare that they have no competing interests.

References

  1. Deerwester, S.; Dumais, S.T.; Furnas, G.W.; Landauer, T.K.; Harshman, R. Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 1990, 41, 391–407. [Google Scholar] [CrossRef]
  2. Levy, O.; Goldberg, Y.; Dagan, I. Improving distributional similarity with lessons learned from word embeddings. Trans. Assoc. Comput. Linguist. 2015, 3, 211–225. [Google Scholar] [CrossRef]
  3. Gomaa, W.H.; Fahmy, A.A. A survey of text similarity approaches. Int. J. Comput. Appl. 2013, 68, 13–18. [Google Scholar] [CrossRef]
  4. Le, Q.V.; Mikolov, T. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference Machine Learning (ICML-14); JMLR: Cambridge, MA, USA, 2014; pp. 1188–1196. [Google Scholar]
  5. Pennington, J.; Socher, R.; Manning, C. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP); Association for Computational Linguistics: Stroudsburg, PA, USA, 2014; pp. 1532–1543. [Google Scholar] [CrossRef]
  6. Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar] [CrossRef]
  7. Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference North American Chapter of the Association for Computational Linguistics: Human Language Technologies; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
  8. Shi, L.; Cao, L.; Ye, Y.; Zhao, Y.; Chen, B. Tensor-based Graph Learning with Consistency and Specificity for Multi-view Clustering. In IEEE Transactions on Multimedia; IEEE: New York, NY, USA, 2026. [Google Scholar]
  9. Chen, Z.; Liu, Y.; Shi, L.; Chen, X.; Zhao, Y.; Ren, F. MDEval: Evaluating and Enhancing Markdown Awareness in Large Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 1–15. [Google Scholar]
  10. Salama, A.S.; El-Barbary, O.G. Document classification in information retrieval system based on neutrosophic sets. Filomat 2020, 34, 1591–1602. [Google Scholar] [CrossRef]
  11. Jarvis, R.A.; Patrick, E.A. Clustering using a similarity measure based on shared near neighbors. IEEE Trans. Comput. 1973, C-22, 1025–1034. [Google Scholar] [CrossRef]
  12. He, Z. Text similarity based on two independent channels: Siamese Convolutional Neural Networks and Siamese Recurrent Neural Networks. Neurocomputing 2025, 643, 130355. [Google Scholar] [CrossRef]
  13. Ibrahim, H.Z.; Al-Shami, T.M.; Elbarbary, O.G. (3, 2)-Fuzzy Sets and Their Applications to Topology and Optimal Choices. Comput. Intell. Neurosci. 2021, 2021, 1272266. [Google Scholar] [CrossRef] [PubMed]
  14. Pawlak, Z. Rough sets. Int. J. Comput. Inf. Sci. 1982, 11, 341–356. [Google Scholar] [CrossRef]
  15. El-Barbary, O.G.; Abu Shaheen, F.A.; Al-Shami, T.M.; Arar, M. Supra finite soft-open sets and applications to operators and continuity. J. Math. Comput. Sci. 2024, 35, 120–135. [Google Scholar] [CrossRef]
  16. El-Barbary, O.G.; Salama, A.S. Topological approach to retrieve missing values in incomplete information systems. J. Egypt. Math. Soc. 2017, 25, 419–423. [Google Scholar] [CrossRef]
  17. Lau, J.H.; Baldwin, T. An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv 2016, arXiv:1607.05368. [Google Scholar] [CrossRef]
  18. Pearson, K. Notes on regression and inheritance in the case of two parents. Proc. R. Soc. Lond. 1895, 58, 240–242. [Google Scholar] [CrossRef]
  19. Jaccard, P. tude comparative de la distribution florale dans une portion des Alpes et du Jura. Bull. Soci Vaudoise Sci. Nat. 1901, 37, 547–579. [Google Scholar] [CrossRef]
  20. Blei, D.M.; Ng, A.Y.; Jordan, M.J.; Dietterich, T.G.; Becker, S.; Ghahramani, Z. Latent Dirichlet allocation. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2002; pp. 601–608. [Google Scholar]
  21. Blei, D.M.; Ng, A.Y.; Jordan, M.J. Latent Dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
  22. Martinez-Gil, J. Automatic design of semantic similarity ensembles using grammatical evolution. arXiv 2024, arXiv:2307.00925. [Google Scholar] [CrossRef]
  23. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, U.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 5998–6008. [Google Scholar]
  24. Kim, Y. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference Empirical Methods in Natural Language Processing (EMNLP); Association for Computational Linguistics: Stroudsburg, PA, USA, 2014; pp. 1746–1751. [Google Scholar] [CrossRef]
  25. Perin, E.L.S.; Souza, M.C.D.; Silva, J.D.A.; Matsubara, E.T. DynGraph-BERT: Combining BERT and GNN Using Dynamic Graphs for Inductive Semi-Supervised Text Classification. Informatics 2025, 12, 20. [Google Scholar] [CrossRef]
  26. Abdelali, A.; Darwish, K.; Mubarak, H. Transparent, Low Resource, and Context-Aware Information Retrieval From a Closed Domain Knowledge Base. IEEE Access 2024, 12, 44233–44243. [Google Scholar] [CrossRef]
  27. Kalogeropoulos, N.-R.; Ioannou, D.; Stathopoulos, D.; Makris, C. On Embedding Implementations in Text Ranking and Classification Employing Graphs. Electronics 2024, 13, 1897. [Google Scholar] [CrossRef]
  28. Hendry, H.; Tukino, T.; Sediyono, E.; Fauzi, A.; Huda, B. HyEWCos: A Comparative Study of Hybrid Embedding and Weighting Techniques for Text Similarity in Short Subjective Educational Text. Information 2025, 16, 995. [Google Scholar] [CrossRef]
  29. Shen, Z.; Xiao, Z. A Chinese Short Text Similarity Method Integrating Sentence-Level and Phrase-Level Semantics. Electronics 2024, 13, 4868. [Google Scholar] [CrossRef]
  30. Alammar, M.; El Hindi, K.; Al-Khalifa, H. English-Arabic Hybrid Semantic Text Chunking Based on Fine-Tuning BERT. Computation 2025, 13, 151. [Google Scholar] [CrossRef]
  31. Kim, H.; Gim, G. Enhancing Patent Document Similarity Evaluation and Classification Precision Through a Multimodal AI Approach. Appl. Sci. 2025, 15, 9254. [Google Scholar] [CrossRef]
  32. Ostendorff, M.; Rethmeier, N.; Augenstein, I.; Gipp, B.; Rehm, G. Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings. arXiv 2022, arXiv:2202.06671. [Google Scholar] [CrossRef]
  33. Zadeh, L.A. Fuzzy sets. Inf. Control 1965, 8, 338–353. [Google Scholar] [CrossRef]
  34. Inoue, G.; Alhafni, B.; Baimukan, N.; Bouamor, H.; Habash, N. The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models. In Proceedings of the Sixth Arabic Natural Language Processing Workshop; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 92–104. [Google Scholar]
  35. Hofmann, T. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference Research and Development in Information Retrieval; Association for Computing Machinery (ACM): New York, NY, USA, 1999; pp. 50–57. [Google Scholar] [CrossRef]
  36. Antoun, A.; Baly, F.; Hajj, H. AraELECTRA: Pre-training text encoders for Arabic language understanding. In Proceedings of the Sixth Arabic Natural Language Processing Workshop; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 13–24. Available online: https://aclanthology.org/2021.wanlp-1.20/ (accessed on 1 May 2026).
Figure 1. Overall pipeline of the proposed tophological neighborhood-based document similarity and classification framework.
Figure 1. Overall pipeline of the proposed tophological neighborhood-based document similarity and classification framework.
Information 17 00560 g001
Figure 2. Impact of the neighborhood threshold β on topological and hybrid model performance.
Figure 2. Impact of the neighborhood threshold β on topological and hybrid model performance.
Information 17 00560 g002
Figure 3. Hybrid model performance as α varies from 0 to 1.
Figure 3. Hybrid model performance as α varies from 0 to 1.
Information 17 00560 g003
Table 1. Classification performance comparison—core methods (Arabic dataset, 10-fold CV). Std. Dev. and 95% CI [±1.96·σ/√10] computed across 10 independent splits. Cohen’s d vs. TF-IDF baseline. ★ = proposed method. All differences significant at p < 0.01 (Wilcoxon signed-rank test).
Table 1. Classification performance comparison—core methods (Arabic dataset, 10-fold CV). Std. Dev. and 95% CI [±1.96·σ/√10] computed across 10 independent splits. Cohen’s d vs. TF-IDF baseline. ★ = proposed method. All differences significant at p < 0.01 (Wilcoxon signed-rank test).
MethodPrecisionRecallF1-ScoreStd. Dev. (F1)95% CI (F1) [±1.96·σ/√10]Cohen’s d
TF-IDF0.800.790.80±0.023±0.014(reference)
BERT (AraBERT)0.880.860.86±0.018±0.0112.93 (large)
Topological (β = 0.75)0.900.890.89±0.015±0.0094.74 (large)
Hybrid (α=0.6) ★0.930.940.93±0.012±0.0077.22 (large)
Table 2. Extended classification performance comparison (Arabic dataset, 10-fold CV). 95% CI computed as ±1.96·σ/√10. Cohen’s d computed vs. TF-IDF baseline. ★ = proposed method.
Table 2. Extended classification performance comparison (Arabic dataset, 10-fold CV). 95% CI computed as ±1.96·σ/√10. Cohen’s d computed vs. TF-IDF baseline. ★ = proposed method.
MethodPrecisionRecallF1-ScoreStd. Dev. (F1)95% CI (F1) [±1.96·σ/√10]Cohen’s d vs. TF-IDF
TF-IDF0.800.790.80±0.023±0.014(reference)
LDA (Topic Model)0.740.760.75±0.031±0.019−1.52 (large)
Doc2Vec0.770.790.78±0.028±0.017−0.72 (medium)
BERT (AraBERT)0.880.860.86±0.018±0.0112.93 (large)
CAMeLBERT0.880.860.87±0.016±0.0103.82 (large)
AraELECTRA0.860.840.85±0.019±0.0122.27 (large)
Topological (β = 0.75)0.900.890.89±0.015±0.0094.74 (large)
Hybrid (α=0.6) ★0.930.940.93±0.012±0.0077.22 (large)
Table 3. Preliminary cross-lingual validation on 20 newsgroups (Zero-Shot, Off-the-Shelf). All models use BERT-based-uncased without fine-tuning. 95% CI computed as ±1.96·σ/√5 (5-fold CV). Δ computed relative to TF-IDF = 0.68. ★ = proposed method.
Table 3. Preliminary cross-lingual validation on 20 newsgroups (Zero-Shot, Off-the-Shelf). All models use BERT-based-uncased without fine-tuning. 95% CI computed as ±1.96·σ/√5 (5-fold CV). Δ computed relative to TF-IDF = 0.68. ★ = proposed method.
Model ConfigurationF1-ScorePrecisionRecallΔ from TF-IDF95% CI (F1) [±1.96·σ/√5]
TF-IDF k-NN (Baseline)0.680.670.69(ref)±0.022
BERT-only (Baseline)0.720.710.73+0.04±0.020
BERT + Topological (Jaccard)0.750.740.76+0.07±0.018
BERT + Topology + Asymmetry0.760.750.77+0.08±0.017
Full Hybrid—Proposed ★0.770.760.78+0.09±0.016
Note: the fuzzy extension (Section 6.4) is a theoretical contribution and was not empirically evaluated in this experiment.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

El Barbary, O.G.; Hagras, S.; M. Allam, T. Hybrid Neighborhood-Based Similarity Measure for Text Classification. Information 2026, 17, 560. https://doi.org/10.3390/info17060560

AMA Style

El Barbary OG, Hagras S, M. Allam T. Hybrid Neighborhood-Based Similarity Measure for Text Classification. Information. 2026; 17(6):560. https://doi.org/10.3390/info17060560

Chicago/Turabian Style

El Barbary, O. G., Shaimaa Hagras, and Tahani M. Allam. 2026. "Hybrid Neighborhood-Based Similarity Measure for Text Classification" Information 17, no. 6: 560. https://doi.org/10.3390/info17060560

APA Style

El Barbary, O. G., Hagras, S., & M. Allam, T. (2026). Hybrid Neighborhood-Based Similarity Measure for Text Classification. Information, 17(6), 560. https://doi.org/10.3390/info17060560

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop