A Multi-Stage NLP Framework for Knowledge Discovery from Crop Disease Research Literature

Polpinij, Jantima; Kaenampornpan, Manasawee; Khoo, Christopher S. G.; Cheng, Wei-Ning; Luaphol, Bancha

doi:10.3390/math14020299

Open AccessArticle

A Multi-Stage NLP Framework for Knowledge Discovery from Crop Disease Research Literature

by

Jantima Polpinij

¹,

Manasawee Kaenampornpan

^2,*

,

Christopher S. G. Khoo

³

,

Wei-Ning Cheng

⁴

and

Bancha Luaphol

⁵

¹

Department of Computer Science, Faculty of Informatics, Mahasarakham University, Mahasarakham 44150, Thailand

²

Department of Computer Engineering, Faculty of Engineering, Khon Kaen University, Khon Kaen 40002, Thailand

³

Wee Kim Wee School of Communication & Information, Nanyang Technological University, Singapore 637718, Singapore

⁴

Graduate Institute of Library & Information Studies, National Taiwan Normal University, Taipei City 106, Taiwan

⁵

Department of Business Computer, Faculty of Administrative Science, Kalasin University, Kalasin 46000, Thailand

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(2), 299; https://doi.org/10.3390/math14020299

Submission received: 27 November 2025 / Revised: 3 January 2026 / Accepted: 10 January 2026 / Published: 14 January 2026

(This article belongs to the Special Issue Research on Machine Learning, Data Mining, Natural Language Processes, and Optimization Methods)

Download

Browse Figure

Versions Notes

Abstract

Extracting and organizing knowledge from the agricultural crop disease research literature are challenging tasks because of the heterogeneous terminologies, complicated symptom descriptions, and unstructured nature of scientific documents. In this study, we developed a multi-stage natural language processing (NLP) pipeline to automate knowledge extraction, organization, and integration from the agricultural research literature into a domain-consistent crop disease knowledge graph. The model combines transformer-based sentence embeddings with variational deep clustering to extract topics, which are further refined via facet-aware relevance scoring for sentence selection to be included in the summary. Lexicon-guided named entity recognition helps in the precise identification and normalization of terms for crops, diseases, symptoms, etc. Relation extraction based on a combination of lexical, semantic, and contextual features leads to the meaningful generation of triplets for the knowledge graph. The experimental results show that the method yielded consistently good results at each stage of the knowledge extraction process. Among the combinations of embedding and deep clustering methods, SciBERT + VaDE achieved the best clustering results. The extraction of representative sentences for disease symptoms, control/treatment, and prevention obtained high F1-scores of around 0.8. The resulting knowledge graph has high node coverage and high relation completeness, as well as high precision and recall in triplet generation. The multi-stage NLP pipeline effectively converts unstructured agricultural research texts into a coherent and semantically rich knowledge graph, providing a basis for further research in crop disease analysis, knowledge retrieval, and data-driven decision support in agricultural informatics.

Keywords:

multi-stage NLP framework; transformer-based embeddings; clustering; entity recognition; crop disease knowledge graph

MSC:

68T50; 68T30; 62H30; 68T09

1. Introduction

Agricultural research has been producing an ever-growing volume of scientific publications on crop pests, diseases, treatments, and management practices [1]. While computer vision has advanced rapidly in image-based crop disease detection, a large amount of valuable domain knowledge remains locked in text-based resources, such as research articles, agronomic bulletins, and extension reports [2,3,4,5]. Crop diseases are among the most notorious problems of agriculture worldwide and are the major source of yield reduction and food insecurity. Yet, agricultural research results are not reaching farmers and agriculturists in a form they can make sense of and apply. Conventional monitoring and analysis of agricultural research data typically involves manual curation, a tedious process with limited scalability. The rapid growth of digital publications has led to a pressing need for systematic, computer-driven strategies for categorizing the deluge of information [6,7,8]. Systematically organizing this textual information is crucial for supporting disease monitoring, knowledge discovery, and informed decision-making in sustainable agriculture [9,10,11,12].

In this study, we developed a multi-stage natural language processing (NLP) pipeline to convert a corpus of crop disease research abstracts into a knowledge graph. The three main stages of the pipeline are as follows:

Clustering the abstracts into meaningful topics. This includes converting each abstract into a vector representation and then performing cluster analysis to group the abstracts into crop disease topics (such as Rice—Blast and Sugarcane—Red Rot).
Selecting or extracting a set of representative sentences from each cluster that have minimal overlap but cover the main knowledge aspects (i.e., research results). In addition, the extracted sentences are labeled (i.e., categorized) into knowledge facets (relation types) of symptoms, control/treatment, and prevention.
Deriving node–relation–node triplets from the facet-labeled sentences to construct the knowledge graph. This involves performing named entity recognition (NER) on the extracted sentences to identify entity nodes and relation identification to generate triplets.

The resulting knowledge graph provides a basis for developing knowledge graph applications that help agriculturalists and researchers gain an overview of disease knowledge for a particular crop, and suggest treatment, control, and management options.

For each stage of the NLP pipeline, we shortlisted a set of state-of-the-art techniques to investigate, combine, adjust, train, and then identify the most effective hyperparameter values, before evaluating the output.

For the clustering of the abstracts into meaningful topics, traditional text clustering methods—such as bag-of-words representations with TF–IDF weighting (term frequency with inverse document frequency) and clustering algorithms like k-means or hierarchical clustering—struggle with agricultural corpora due to issues of synonymy, polysemy, and the frequent emergence of new technical terms [13,14,15,16]. Distributed word representations (e.g., Word2Vec [17]) and more recent transformer-based models (e.g., XLNet [18], SBERT [19], SciBERT [20]) have significantly improved semantic understanding [21]. In parallel, clustering has evolved from classical centroid-based methods to deep clustering frameworks such as Deep Embedded Clustering (DEC) [22], Improved Deep Embedded Clustering (IDEC) [23], Deep Clustering Network (DCN) [24], and Variational Deep Embedding (VaDE) [25]. However, existing studies in agricultural text mining and crop disease research have primarily addressed specific tasks using rule-based systems, ontology-driven approaches, or isolated models, without providing a systematic benchmark of modern embedding representations and clustering algorithms on the agriculture-specific scientific literature [26,27]. Consequently, the effectiveness of recent contextualized embeddings and deep clustering techniques on crop disease corpora remains underexplored, despite their proven success in other scientific and biomedical domains [28,29,30,31,32,33,34]. In particular, prior studies typically focus on a limited set of embedding models or clustering techniques, do not incorporate deep clustering methods, and rarely include expert-driven semantic validation within agricultural text mining pipelines. Therefore, the first research objective of this study is to identify which combinations of modern text embedding techniques and clustering algorithms are most effective for grouping crop disease abstracts into semantically meaningful topics.

For example, representative studies in agricultural text mining have constructed domain-specific knowledge bases or disease taxonomies using rule-based extraction and ontology-driven frameworks [26,27]. While these approaches effectively capture structured domain knowledge, they typically rely on predefined vocabularies and do not evaluate alternative embedding representations or clustering strategies. Other studies have applied unsupervised topic modeling or single clustering methods to the agricultural literature [26,28], but without systematically comparing modern contextualized embeddings or incorporating deep clustering techniques and expert validation. These limitations highlight the absence of a unified and comprehensive benchmarking framework for embedding–clustering combinations in crop disease research.

The second research objective was to develop an effective method to extract a set of representative sentences from each cluster and label them with knowledge facets. The approach we investigated was to use the Maximal Marginal Relevance (MMR) technique to extract a set of representative sentences with good coverage and minimal overlap, and then use a weighted score of lexicon match, embedding similarity, and context cues to identify the most likely facet for each extracted sentence.

The third research objective was to develop a method to derive node–relation–node triplets from the extracted sentences to construct the knowledge graph. This involves performing named entity recognition (NER) on the extracted sentences to identify entity nodes and relation identification using the sentence facet labels to derive the triplets.

Our research and methodological contributions include the following: (i) performing a systematic evaluation of text embedding techniques in combination with clustering methods on a curated corpus of crop disease abstracts; (ii) developing a method for sentence-level facet scoring that incorporates expert-informed quality calibration; and (iii) developing an effective method to derive node–relation–node triplets to populate a domain knowledge graph.

The novelty of this work lies not in proposing a single new algorithmic component, but in the rigorous integration of multiple stages into a unified and empirically validated framework. Unlike prior approaches that treat embedding, clustering, calibration, and knowledge extraction as isolated steps, our framework enforces consistency and domain relevance across all stages through expert-informed calibration and systematic evaluation. This integrated design goes beyond a simple assembly of existing components by jointly addressing geometric quality, semantic coherence, and practical interpretability in agricultural knowledge discovery.

The rest of this paper is structured as follows: In Section 2, we present the related work on clustering methods, embedding models, and their applications in scientific and agricultural texts. Section 3 describes the datasets involved, the mathematical model, and the methodology used in the comparative study, as well as the experimental configuration. The results and discussion are given in Section 4. Finally, the conclusion and future work are presented in Section 5.

2. Related Work

This section reviews related work for the three main stages of our NLP pipeline.

2.1. Text Clustering

Text clustering techniques help to cluster documents that do not have predefined categories into topics [31,32]. Traditional clustering techniques represent the documents as high-dimensional sparse vectors by using the vector space model and term weighting schemes including the widely used term frequency–inverse document frequency (TF–IDF) scheme [33,34,35,36]. Classical clustering algorithms such as k-means, hierarchical clustering, spectral clustering, and DBSCAN are applied to these representations because of their simplicity and interpretability [13,37,38]. However, these methods have limitations in the context of complex, domain-specific corpora: they are sensitive to initialization, assume relatively simple cluster shapes, and do not handle synonymy, polysemy, and specialized languages. For example, in the agricultural domain, the literature contains similar diseases with different pathogen names or local variants, which cause clusters to break and be semantically inconsistent [13,37,38].

The challenges of synonymy and polysemy for these “bag-of-words” models are addressed by continuous, dense word embeddings [39,40,41]. These include Word2Vec, GloVe, and FastText, where the words are embedded in a continuous vector space (preserving semantic similarity and improving cluster cohesion) [42,43,44]. Yet, these embeddings are context-independent, assigning a given vector to any occurrence of a word, regardless of how that word is used in a sentence (and the potential meanings it has), which makes them less suited for technical language where words may have multiple senses or correspond to compound terms [45,46,47]. There is preliminary evidence in the biomedical literature that science-based distributed embeddings may improve clustering quality, although there has been no comprehensive benchmark of these approaches over agricultural scientific text [48].

The development of transformer-based language models has revolutionized how documents are represented [49]. BERT-based approaches, especially BERT’s derivates like XLNet, do not only make the embeddings context-aware but also account for both local and global contexts, which can further improve a model’s conformity with the true meaning of the text [49,50]. SBERT is a sentence-level adaptation of BERT that was created to address sentence similarity and clustering challenges [51]. SciBERT further improves upon this by concentrating the pretraining on a scientific corpus, hence better dealing with jargon and Latin species names [20]. This family of methods has been successful in generating semantically coherent clusters of documents, outperforming co-training algorithms, especially in the mining of the biomedical literature [52]. However, there has not yet been any systematic examination of these enhanced embeddings on agricultural corpora nor any comparison with traditional embedding approaches in the context of a single empirical framework to examine which methodology is superior [52].

In addition to enhancements in representation, clustering has developed along with advances in deep learning. Deep Embedded Clustering (DEC), Improved DEC (IDEC), Deep Clustering Network (DCN), and Variational Deep Embedding (VaDE) incorporate representation learning into the clustering objective, allowing the model to learn more compact and interpretable latent structures [22,53]. Other work, e.g., DeepCluster and SimCLR, in combination with graph-based models, has pushed forward the promise of deep clustering [54]. However, most of these models have been tested on generic datasets such as news articles and benchmark corpora, with a few studies investigating domain-specific scientific documents [55]. Furthermore, prior works frequently use different/non-standardized evaluation metrics, which makes it hard to compare the results across different models and datasets [56].

Two research gaps are identified. The first is the lack of a global assessment that evaluates all classical, distributional, and transformer-based embeddings together with traditional and deep clustering algorithms [57]. The second gap is that previous studies have not embedded document clustering in a unified methodology [58]. In particular, optimization-oriented perspectives in deep clustering remain underexplored [59]. There is also no shortage of standardized internal and external measures, making it difficult to arrive at definitive conclusions about the best practices [60]. The purpose of this study is to fill in these gaps by conducting an integrated appraisal of the embedding–clustering combinations on a manually crafted corpus (i.e., crop disease research papers) and constructing a structural knowledge system.

Compared with classical clustering methods, deep clustering approaches jointly learn latent representations and cluster assignments, allowing them to capture more complex and nonlinear semantic structures in text corpora. This property is particularly beneficial for the agricultural literature, where disease descriptions often involve overlapping symptoms, heterogeneous terminology, and subtle contextual cues. However, deep clustering models also introduce practical challenges, including higher computational cost, sensitivity to hyperparameter settings, and potential instability during training. Moreover, most existing evaluations of deep clustering have focused on generic or biomedical datasets, with limited validation on agriculture-specific corpora and little involvement of domain experts. These limitations motivate the need for a systematic, domain-aware evaluation of both classical and deep clustering methods in the context of the crop disease literature.

2.2. Extracting Representation Sentences

Previous works in scientific document analysis typically use sentence-level extraction to produce summaries of clusters or to indicate key information for further tasks such as knowledge discovery and interpretation [13,37,38]. Many previous techniques are based on frequency-based heuristics or surface-level similarity measures, leading to both redundant excerpt sentences and incomplete coverage of important aspects in the research corpus (especially for technical and domain-specific corpora).

In the context of the agricultural literature, important findings related to disease symptoms, control strategies, and prevention measures are frequently distributed across multiple sentences and expressed using varied terminology, making representative sentence selection challenging. To address this limitation, this study selects a set of representative sentences from each cluster that minimizes semantic overlap while covering the main knowledge aspects of the abstracts. Furthermore, the extracted sentences are explicitly labeled into relation types—symptoms, control, and prevention—to support structured knowledge derivation in subsequent stages.

2.3. Deriving Node–Relation–Node Triplets from Sentences

In creating a knowledge graph from scientific texts, the usual technique is to extract entity–relation pairs and produce structured (node–relation–node) representations [55,56]. Prior works tend to use named entity recognition (NER) and relation extraction as independent tasks rather than introducing them into previous clustering or sentence extraction models, which may lead to disconnected or loosely connected knowledge.

Moreover, in domain-specific corpora such as crop disease research, entity mentions may include scientific names, pathogens, and domain-specific concepts that are difficult to capture using generic pipelines. In this study, node–relation–node triplets are derived directly from the extracted representative sentences by performing named entity recognition to identify entity nodes and relation identification to extract semantic relations. This integrated process enables the construction of a coherent knowledge graph that is closely aligned with the clustered document structure and the identified research facets.

3. Datasets and Research Method

3.1. Datasets

A total of 1079 studies on various crop diseases were reviewed. Relevant research data were collected from four major scientific bibliographic databases: PubMed, CAB Abstracts (full text, via EBSCO), AGRICOLA (via EBSCO), and AGRIS, accessed through FAO. These databases are complementary and together provide a broad coverage of the agricultural, biomedical, and life sciences literature. To ensure data comparability and maintain a manageable text length, this study focused exclusively on abstracts. Abstracts provide concise summaries of research objectives and key findings, including information on disease names, symptoms, and control measures, making them suitable for large-scale and consistent text analysis.

This compilation includes production-based research papers on five crops of significant economic importance: rice (500 papers), sugarcane (380), oil palm (129), cassava (36), and soybean (34). The main theme of each paper was identified according to references made in the titles or abstracts; possible ambiguities were resolved after consultation with domain experts. The proportion of research is not evenly distributed as one might expect—because the distribution of global research is also not balanced (e.g., covering more rice and sugarcane crops)—but reasonable representative samples have been used.

The process of data collection and construction occurred in several stages. Data was retrieved by (1) searching using keywords that combined crop name with disease terms (e.g., disease + crop; pathogen + crop). This was followed by (2) removing duplicates and non-scientific publications. These sources were then screened using (3) predefined inclusion criteria stating that diseases must be etiologically specific and contain information on symptoms, diagnostics, management, control, prevention, and yield impact. Lastly, (4) an expert validation phase consisted of plant pathology experts examining borderline cases and examining the semantic coherence of example abstracts following the initial clustering by evaluating topics.

Multiple measures were taken to guarantee the accuracy and correctness of data: (i) inter-database search was employed to improve coverage and reduce selection bias; (ii) irrelevant or incomplete records were manually separated from our dataset; (iii) proper crop assignment and disease relevance assessment were confirmed by expert identification; and (iv) cluster representativeness validation was carried out to claim semantic consistency. The human-in-the-loop curation of both technical and domain-specific consistency made the dataset suitable for a thorough objective evaluation in embedding and clustering tasks.

While the distribution of crop studies in the dataset is imbalanced, it reflects how scientists allocate their attention in the real world such that one would be able to test these models under non-homogeneous and more realistic distribution conditions.

3.2. Research Method

This section reports the details of the multi-stage NLP pipeline developed to convert crop disease research abstracts into a structured knowledge graph. It describes the techniques adopted at each stage and highlights the adjustments and fine-tuning applied to tailor these techniques to the agricultural domain.

While the individual components of the proposed pipeline (embedding models, clustering algorithms, and named entity recognition techniques) are established in prior work, the methodological novelty of this study lies in their integration into a unified benchmarking and knowledge discovery framework. Specifically, originality is introduced through (i) systematic embedding–clustering benchmarking on agriculture-specific corpora, (ii) expert-informed threshold calibration that jointly considers geometric quality and semantic coherence, (iii) facet-aware sentence extraction, and (iv) downstream knowledge graph construction.

To facilitate clarity, the methodological contributions of this study are explicitly reflected in the following components of the proposed pipeline: (i) systematic embedding–clustering benchmarking (Document Clustering), (ii) expert-informed threshold calibration (Threshold Calibration), (iii) facet-aware sentence extraction (Sentence Extraction and Facet Building), and (iv) downstream knowledge graph construction (Knowledge Graph Construction).

3.2.1. Text Preprocessing

This approach deliberately did not standardize spelling or other minor differences in the original specialized vocabularies, such as for technical agricultural terms and Latin binomial species names (e.g., “Xanthomonas oryzae” and “Magnaporthe oryzae”). Over-normalization or over-stemming can change the fundamental semantics of words and lead to important content loss. This simple preprocessing maintained technical homogeneity but kept domain-specific terminology that is important for a semantic grouping of very specialized agricultural texts.

The pipeline had three major steps. First, we used lowercasing to treat the word forms (e.g., “Disease” and “disease”) equally. The second preprocessing step involved Unicode normalization to convert different encodings of special characters, diacritics, and scientific symbols present in abstracts indexed across different databases. Third, we standardized white spaces in order to remove abnormal spaces and control characters hidden in text and obtain a clean text for embedding.

It is worth mentioning that we intentionally did not perform stopword removal for context-based embedding models, e.g., SBERT and XLNet. Such models use context to infer meaning, and removing function words would destroy the syntactic and semantic chain. On the other hand, regarding the TF–IDF baseline, ablation experiments that incorporated stopword removal were carried out. This design made it possible to see whether filtering out high-frequency words, such as “the”, “of”, and “in”, improved cluster cohesiveness on sparse lexical representations. On the practical side, the contrast between the “with-stopwords” and “without-stopwords” conditions highlighted the extent to which traditional models are sensitive to preprocessing choices and the robustness of transformer models to input filtering.

By adopting this conservative setting, we made sure that the preprocessing was kept in balance: it should normalize the data for computational modeling and at the same time retain linguistic richness and accuracy so that an interesting clustering of the crop disease literature can be truly achieved.

3.2.2. Text Representation and Embedding

Formally, the embedding function f0 could be one that maps the token sequence x = (w₁, w₂, …, w_T) to a vector of fixed dimensions

z \in R^{d}

(e.g., through mean/CLS pooling over token representations). We implement f0 with five representation methods: TF–IDF, Word2Vec, XLNet, SBERT, and SciBERT.

TF–IDF representations are generated using the scikit-learn library. Word2Vec embeddings are implemented using Gensim library (version 4.3.2). Transformer-based embeddings, including XLNet, SBERT, and SciBERT, are obtained using the Hugging Face Transformers and Sentence-Transformers libraries (version 4.36.2), with document-level representations derived via mean pooling or CLS token representations as appropriate.

These techniques reflect the evolution of natural language processing from conventional statistical representation to modern deep learning context-aware models. Each method has unique merits and demerits, and this comparative study yields useful insights about their practical efficacy in clustering research papers on crop diseases.

(1): TF-IDF (term frequency–inverse document frequency)—TF–IDF models documents as sparse vectors and scales each term according to its local frequency within a document and its global rarity across the corpus [61]. Such an approach is simple, interpretable, and efficient but is not able to measure the closeness of semantics, the synonymy, or the order of words. TF–IDF represents each document as a sparse vector in the vocabulary space. The weight of term t in document d_i is as follows:

$w (t, d_{i}) = t f (t, d_{i}) \times l o g (1 + \frac{N}{d f (t)})$

(1)

where N is the corpus size, and df(f) is the number of documents containing t.
(2): Word2Vec is a prediction-based approach for learning dense, low-dimensional vector representations of words from their linguistic contexts [62]. In this study, the skip-gram model is employed to maximize the likelihood of context words appearing within a fixed-size window surrounding a target word. Document-level representations are obtained by averaging the embeddings of the words contained in each document.
(3): XLNet—XLNet extends the power of the conventional language model by introducing permutation language modeling, which allows the model to incorporate both autoregressive and bidirectional dependencies at the same time [18]. In contrast to BERT with masking language modeling, XLNet learns for all possible permutations with its generation mechanism, where each token is predicted based on its available context within the corpus. The document embedding is obtained by mean pooling. This allows XLNet to produce context-sensitive embeddings that leverage both forward and backward information flows, making it more powerful in modeling long-range dependence than previous models.
(4): SBERT (Sentence-BERT)—SBERT is a modification of the BERT architecture and is particularly fine-tuned for semantic similarity and clustering with a Siamese network architecture [19]. Unlike models producing word-level embeddings, SBERT models directly learn sentence- and document-level embeddings, so semantically similar texts are mapped close to each other in the vector space. Its training algorithm often employs triplet loss or contrastive loss for explicit joint learning between similarity and clustering. SBERT modifies BERT into a Siamese architecture optimized for sentence similarity. Embeddings are trained using triple loss. The similarity between sentences is then computed via cosine similarity. This design allows efficient semantic clustering based on embedding distances. The margin ε ensures a distinction between positive and negative pairs. This construction makes SBERT highly suitable for clustering, where the embeddings can be compared directly using cosine similarity. The main strength of SBERT resides in its effectiveness and accuracy on semantic similarity, although this is strongly dependent on the quality of the domain adaptation in the fine-tuning.
(5): SciBERT—SciBERT is a BERT model trained on a scaled-down corpus of common-sense knowledge and on an in-domain corpus of scientific publications [20]. Though it shares the architecture of BERT based on transformers, its dedicated pretraining corpus helps in capturing scientific terms, jargon, and even Latin species names which are commonly found in technical abstracts. Given an input sequence X, the transformer encoder produces contextualized representations, which capture the semantic and syntactic behaviors of scientific text well. It yields contextual embeddings specialized for scientific texts such as biomedical and agricultural research.

3.2.3. Document Clustering

After performing text preprocessing and text representation, each document was then encoded into a vector

z_{i} = f_{θ} (x_{i})

, where

f_{θ}

represents one of five embeddings—TF–IDF, Word2Vec, XLNet, SBERT, and SciBERT—covering the spectrum from sparse lexical weighting to contextualized transformer models. We then applied multiple clustering algorithms (i.e., k-means, hierarchical clustering—HC, spectral clustering—SC, DBSCAN, DEC, IDEC, Deep Clustering Network DCN, and VaDE) to partition the embedding space into K clusters C = {C₁, C₂, …, C_k} by optimizing an algorithm-specific objective J(C, z).

In terms of implementation, classical clustering algorithms, including k-means, hierarchical clustering, spectral clustering, and DBSCAN, are implemented using the scikit-learn library (version 1.2.2). Deep clustering models, namely DEC, IDEC, DCN, and VaDE, are implemented using PyTorch framework (version 2.1.2).

For clustering methods that require a predefined number of clusters, the value of K is selected based on internal validation criteria, specifically the Silhouette coefficient and the Davies–Bouldin Index, evaluated over a range of candidate values. In contrast, DBSCAN determines the number of clusters automatically based on density-related parameters and does not require a predefined K. Each clustering algorithm can be briefly described as follows:

(1) K-means Clustering [63]: This well-known clustering technique assigns documents to the nearest centroid by minimizing the within-cluster sum of squares. According to this analysis, we choose K = 20 as the final clustering.

(2) Hierarchical Clustering [64,65]: This algorithm constructs a tree-like visualization of clusters (dendrogram) by initially considering every data item as one cluster and then repeatedly joining the two closest clusters. A linkage criterion is used to define the similarity. We have employed average linkage in this study, which calculates the average distance between the points of each pair of clusters A and B. The number of final clusters is determined by cutting the dendrogram at a level that maximizes internal validation criteria, yielding 20 clusters.

(3) Spectral Clustering [66,67,68]: This algorithm transforms the clustering problem into a graph partitioning task. It first constructs a similarity graph G = (V, E), where vertices V represent data points and edges E are weighted by an affinity matrix W that reflects pairwise similarities. From this graph, it computes the graph Laplacian L = D − W, where the diagonal degree matrix D has entries

D_{i i} = \sum_{i} W_{i j}

. The algorithm then solves for the first K eigenvectors of L to embed the data into a lower-dimensional space in which the structure of the graph is easier to separate. Finally, a simple clustering method like k-means is performed in this reduced spectral space. This is an attractive method for detecting non-convex or intricate cluster shapes, because it utilizes the connectivity of the data as opposed to relying on Euclidean Distance.

(4) DBSCAN [69,70]: This algorithm identifies clusters as areas of high point density and labels points in sparse regions as noise. Given a dataset D = {z₁, z₂, …, z_n}, the ε-neighborhood of a point p is

N_{ε} (p) = \{q \in D | ‖ z_{p} - z_{q} ‖ \leq ε\}

(2)

A point p is a core point if

|N_{ε} (p)| \geq minPts

(3)

where ∑ defines the radius of the neighborhood used to measure local point density, is the neighborhood radius and it should be greater than 0, while minPts specifies the minimum number of neighboring points required to classify a point as a core point. A point q is directly density-reachable from p if

q ∈ N_ε(p) and | N_ε(p)| ≥ minPts

(4)

More generally, q is density-reachable from p if there exists a chain of points.

p = p₁, p₂, …, p_n = q

(5)

Each p_i₊₁ is directly density-reachable from p_i. Two points p and q are density-connected if there exists some point o such that both p and q are density-reachable from o. A cluster C is then defined as a non-empty set of points satisfying the following:

\forall_{p, q} \in C, \exists_{o} \in C such that p and q are density - connected via o

(6)

Points that are not density-reachable from any core point form the set of noise:

Noise = {p ∈ D| p is not density-reachable from any core point}

(7)

This density-based formulation enables DBSCAN to discover clusters of arbitrary shape, automatically handle outliers, and avoid the need to pre-specify the number of clusters. The performance of DBSCAN depends on the selection ability regarding the distance neighborhood radius ε and the number of points minPts, as these two factors are closely related to cluster granularity and noise detection. In this work, we take ε as that determined by the KNN-distance graph: distances to the k-th nearest neighbors are plotted and the elbow point is found. Using this analysis, we set ε = 0.8 for TF–IDF embeddings and ε = 0.6 for dense embedding spaces (Word2Vec and transformer-based embeddings) based on the difference in distance distributions.

The parameter minPts is dependent on the embedding dimension and dataset size. As usual, minPts is initially set to d + 1 for ease of interpretation (so there needs to be at least as many points as the embedding dimension) and will later be manually adjusted. In our experiments, minPts is set to 10, which balances noise resistance and cluster density. A sensitivity test at these values indicated that the clustering results are stable to reasonable parameter variations. The selected (ε, minPts) values are kept fixed in all DBSCAN experiments for a fair comparison with other clustering techniques.

(5) DEC [22]: This is a deep clustering technique that concurrently learns a nonlinear feature representation and allocates data points to clusters in a probabilistic manner. It employs an encoder network f_θ to map the original input x_i to a latent embedding z_i = f_θ(x_i). Each embedded point z_i is then softly assigned to cluster k Student’s t-distribution kernel [22]:

q_{i k} = \frac{{(1 + {‖z_{i} - μ_{k}‖}^{2} / α)}^{- \frac{α + 1}{2}}}{\sum_{j} {(1 + {‖z_{i} - μ_{i}‖}^{2} / α)}^{- \frac{α + 1}{2}}}

(8)

where μ_k is the cluster centroid in the latent space, while α is the degree-of-freedom parameter of Student’s t-distribution that controls the softness of cluster assignments by regulating the tail heaviness of the similarity function in the latent space and it should be greater than 0 (α > 0). In this study, α was fixed to 1 to provide stable and robust soft cluster assignments while avoiding additional hyperparameter tuning.

To improve cluster compactness, DEC iteratively refines the embeddings by minimizing the Kullback–Leibler (KL) [22] divergence between the current soft assignment distribution Q = [q_ik] and an auxiliary target distribution P = [p_ik] that emphasizes high-confidence assignments. The clustering objective is

L_{D E C} = K L (P | | Q) = \sum_{i} \sum_{k} p_{i k} l o g \frac{p_{i k}}{q_{i k}}

(9)

The target distribution P is typically defined to increase the weight of data points with confident cluster assignments, for example,

p_{i k} = \frac{q_{i k}^{2} / \sum_{i} q_{i k}}{\sum_{j} q_{i j}^{2} / \sum_{i} q_{i j}}

(10)

so that points strongly associated with a cluster influence its centroid update more. By alternating between updating P and optimizing the encoder parameters θ to minimize

L_{D E C}

, the model jointly learns a compact latent space and improved cluster structure without requiring labeled data.

(6) IDEC [23]: This algorithm extends the DEC framework by incorporating a reconstruction loss to preserve the local structure of the input space while improving clustering quality. Similarly to DEC, an encoder f_θ maps each input document x_i (e.g., a text embedding) to a latent representation z_i = f_θ(x_i) and produces soft cluster assignments q_ik via a Student’s t-distribution. However, unlike DEC, IDEC adds a decoder g_ϕ that reconstructs the input from the latent space to guarantee the retention of significant semantic information during clustering. The overall goal is to combine the DEC loss with a reconstruction term

L_{D E C} = L_{D E C} + γ {‖ x_{i} - g_{ϕ} (f_{θ} (x_{i})) ‖}^{2}

(11)

where g_ϕ is the decoder network, while γ is a tradeoff parameter balancing clustering compactness and input reconstruction fidelity and it should be greater than 0 (γ > 0). This additional term helps to avoid the loss of detailed local semantic information caused by purely clustering-based training, which is highly important in applications involving text data (where even a small difference in one representation could make big changes in their meaning). Furthermore, this is crucial based on text data: maintaining local description structure ensures a topic’s continuity and reduces the chance of cluster economies, as we shall demonstrate in our experiment section. Through the harmonious augmentation of both clustering and reconstruction, IDEC is more reliable in practice than DEC and better preserves text embedding data structure. Such a facility further enables its use with less supervision for the clustering of documents or topics.

(7) DCN [71]: This algorithm brings together the strengths of deep representation learning and clustering by maximizing both an autoencoder reconstruction loss and a k-means clustering loss in the learnt latent space at the same time. An encoder f_θ maps the input document x_i to a latent representation h_i = f_θ(x_i), and a decoder g_ϕ attempts to reconstruct the original input from this latent vector to preserve semantic information. Meanwhile, the latent features h_i are encouraged to form compact clusters around learnable cluster centroids

μ_{c_{i}}

. The overall loss function is defined as follows:

L_{D C N} = {‖x_{i} - g_{ϕ} (f_{θ} (x_{i}))‖}^{2} + λ {‖h_{i} - μ_{c_{i}}‖}^{2}

(12)

In this equation, the first term is the reconstruction loss which is responsible for retaining the original text representation, and the second term is the clustering loss that reduces the distance between every embedded document h_i and the cluster centroid

μ_{c_{i}}

that has been assigned.

λ

is a positive weighting parameter that controls the tradeoff between the reconstruction loss and the clustering loss. It balances preserving the original semantic information of the input documents and enforcing compact cluster formation in the latent space. In practice, λ must be greater than zero. In our experiments,

λ

was set to 1 to provide a balanced contribution of both objectives, following common practice in the deep clustering literature.

Such an architecture allows a network to find representations that are close in meaning and at the same time are well-separated, resulting in nice clusters for any text data. Through the dual process of reconstruction and clustering, DCN is able to generate more accurate representations that can not only reflect the latent topical structure in large document collections but also be less susceptible to noise.

(8) VaDE [25]: VaDE combines the VAE-inspired models of both representations in a latent space and cluster assignment. Like in classical VAE, an encoder

q_{ϕ} (z | x)

encodes each input document x into a latent distribution, and a decoder

p_{θ} (x | z)

decodes the respective input from the latent variable z. VaDE differs from the generic VAE, which assumes data are generated based on a simple Gaussian prior distribution; it introduces a Gaussian Mixture Model (GMM) prior to facilitate learning complex cluster structures in the latent space. During training, we maximize the evidence lower bound (ELBO):

L_{V a D E} = E_{q_{ϕ}} [\log p_{θ} (x | z)] - K L (q_{ϕ} (z | x) | | p (z))

(13)

where the prior p(z) is defined as a mixture of Gaussians:

p (z) = \sum_{k} π_{k} N (z | μ_{k}, \sum_{k})

(14)

with mixture weights

π_{k}

, means

μ_{k}

, and covariances Σ_k for each cluster k. The first term ensures that the input is reconstructed in a faithful way, while the second one minimizes the KL divergence between encoder distribution and GMM prior to aligning latent codes with cluster components. VaDE is well-suited for text, because the GMM prior can model a wide variety of semantic topics, and the variational formulation naturally accounts for the uncertainty in the latent embedding. As a result, VaDE can more effectively process documents sharing overlapping or ambiguous themes than deterministic clustering. Consequently, it should provide very well-separated and semantically coherent clusters.

3.2.4. Threshold Calibration [72,73]

In order to guarantee that the clusters that have been accepted are not only good from a geometric point of view (dense, well-separated) but also make sense from a semantic point of view (knowledge of crop–disease–symptom–prevention/control), we adjust the acceptance thresholds once with a stratified pilot set (~15–20% of the generated clusters). These clusters are independently evaluated and labeled as Accept or Revise by two plant pathology experts. After that, we link the internal geometry indices (Silhouette Score, Davies –Bouldin Index, Calinski–Harabasz) with semantic cohesion (UMass) and expert usefulness to create a single combined acceptance criterion.

The calibration is also used to guarantee that the clusters produced capture entities and context information that are not only geometrically structured but also semantically interesting for agricultural knowledge discovery. In particular, valid topics are identified based on a two-stage validation: (a) internal geometry-based criteria (cluster compactness and separation) and (b) semantic topic coherence, which is measured according to experts’ judgment. Geometry-based measures are employed as an initial filter to rule out unstable or ill-defined clusters, while expert-given coherence evaluation guarantees the interpretability and utility of the retained clusters. Only clusters that meet geometric stability and semantic relevance criteria based on a priori acceptance thresholds are included in the final calibrated results. Both of the above complementary criteria work together to ensure that calibration successfully weeds out dense but semantically incoherent, as well as interpretable, clusters that are geometrically unstable.

Not all generated clusters are retained after threshold calibration. In total, 20 clusters are retained as valid topics, while 7 clusters are rejected due to low geometric stability or insufficient semantic coherence. Rejected clusters typically exhibit low Silhouette Scores (<0.15), high Davies–Bouldin Indices (>2.5), or poor expert-assessed interpretability.

Abstracts belonging to rejected clusters are not discarded. Instead, they are reassigned to the nearest accepted clusters based on cosine similarity in the embedding space, provided that the similarity exceeds a minimum threshold of 0.65. Abstracts that do not satisfy this criterion are excluded from downstream knowledge graph construction to prevent noise propagation.

The threshold calibration process has the following steps:

Step 1: Computing of Internal Geometry Metrics—This step is used to measure the structural quality of each cluster k via the application of Silhouette Score, Davies–Bouldin Index, and Calinski–Harabasz.

The Silhouette Score (S.S.) [74] measures how well a data point fits within its assigned cluster compared to other clusters. It is calculated as follows:

S . S . (k) = \frac{b (i) - a (i)}{m a x {a (i), b (i)}}

(15)

where a(i) is the average distance to points in the same cluster, and b(i) is the average distance to the nearest other cluster. Scores go from −1 to 1: values close to 1 mean that the clusters are well separated and are compact; values around 0 indicate that the clusters intersect; and negative values signify that the points are incorrectly clustered.

The Davies–Bouldin Index (DBI) [74] measures the mean likeness of clusters as a result of the closeness between each cluster and the distance between clusters. The formula is given as

D B I = \frac{1}{K} \sum_{k = 1}^{K} \max_{j \neq k} \frac{σ_{k} + σ_{j}}{d (μ_{k}, μ_{j})}

(16)

where K represents the number of clusters, σ_k denotes the scatter (average distance of points) within cluster k, and d(μ_k, μ_j) is the distance between the centroids of clusters k and j. A lower DBI implies more favorable clustering as it suggests that clusters have maintained their compactness (low scatter) and are far apart from each other.

The Calinski–Harabasz Index (CH) [74] is a metric that quantifies clustering effectiveness by evaluating the degree of separation between clusters against their internal compactness. The formula is defined as follows:

C H = \frac{t r (B_{K}) / (K - 1)}{t r (W_{K}) / (N - K)}

(17)

where B_k represents the between-cluster scatter (the distance that clusters are from each other), W_k is the within-cluster scatter (the closeness of points within each cluster), K stands for the number of clusters, and N is the total number of points. A higher CH score reflects that the clusters have better separation and lower internal cohesion, showing they are both well separated and tightly bound internally.

Step 2: Adding Semantic Coherence—The UMass topic coherence score indicates how well the top terms in each cluster make sense semantically and co-occur in documents [75,76]. This metric is computed as follows:

C_{U M a s s} (k) = \frac{2}{T (T - 1)} \sum_{m > 1} l o g \frac{D (w_{m}, w_{l}) + ε}{D (w_{l})}

(18)

where T denotes the number of top terms in a cluster, and m and l are indices over the ranked list of top terms. Specifically, w_m and w_l represent the m-th and l-th most frequent terms within a given cluster, respectively, with m > l. The term D(w_m, w_l) denotes the number of documents in which both terms co-occur, while D(w_l) represents the document frequency of term w_l.

In this study, terms refer to content words extracted from the abstracts after standard text preprocessing, including tokenization, stopword removal, and lemmatization. These terms are ranked within each cluster based on their term frequency, and the top T terms are used to compute semantic coherence.

Step 3: Expert Utility Scoring (Pilot Set)—About 15–20% of the clusters (a pilot set) are randomly selected that represent various crop types and clustering algorithms. The experts only have the data provided in each cluster for review. The two experts independently analyze the group and then decide together:

▪: “Accept”—the cluster is helpful and applicable to symptoms, control, or prevention.
▪: “Revise”—the cluster requires modifications such as combining, dividing, or changing the name.

To measure how consistently the experts agree, the formula of Cohen’s kappa (κ) is employed as follows:

κ = \frac{p_{o} - p_{e}}{1 - p_{e}}

(19)

where p_o represents the observed agreement between the experts, and p_e is the agreement expected by chance. A higher κ signifies more consistent interaction or better agreement between expert judgments. Human validation supplies a basis of truth to determine what “good” clusters imply in the field of plant disease.

Step 4: Defining Acceptance Rule and Tuning Thresholds—Once the S.S., DBI, CH, C_UMass, and expert agreement measured by Cohen’s κ have been calculated, an acceptance rule is established. This involves adjusting the thresholds to confirm that the clusters not only have good geometric properties but are also semantically meaningful. A cluster k is accepted only if it satisfies all four quality criteria:

A c c e p t (k) ⟺ S . S . (k) \geq τ_{S . S .} \land D B I (k) \leq τ_{D B I} \land C_{U M a s s} (k) \geq τ_{C_{U M a s s}} \land E (k) \geq τ_{E}

(20)

where the symbol “∧” denotes the logical conjunction “and”, indicating that all conditions must be satisfied simultaneously. S.S. (k) denotes the Silhouette Score of cluster k, DBI(k) is the Davies–Bouldin Index, C_UMass(k) represents semantic coherence, and E(k) denotes expert utility score. This acceptance rule requires that all geometric, semantic, and expert-based criteria be satisfied simultaneously. This conjunctive acceptance rule ensures that only clusters that are geometrically stable, semantically coherent, and practically useful are retained for downstream knowledge graph construction. This acceptance rule requires that all geometric, semantic, and expert-based criteria be satisfied simultaneously.

Finding the optimal cutoffs (

τ_{S . S .}, τ_{D B I}, τ_{C_{U M a s s}}, τ_{E}

) involves traversing a grid of possible values and deciding which one best matches the expert Accept/Revise labels on a stratified pilot set (approximately 15–20% of clusters). The extent of consensus between the two is measured by Cohen’s κ (macro-F1 actually reaches the same optimum point):

(τ_{S . S .}^{*}, τ_{D B I}^{*}, τ_{C_{U m a s s}}^{*}, τ_{E}^{*}) = a r g κ (τ)

(21)

After the best tuple has been identified, its robustness is thoroughly tested by resampling the documents (bootstrapping) and slightly changing the clustering hyperparameters (such as k in k-means, ε and minPts in DBSCAN, or reconstruction weights in DEC/IDEC/DCN/VaDE). Only those thresholds that are stable under such changes are retained. From this process, it has been given robust final thresholds that are fixed for all subsequent experiments: S.S. ≥ 0.35, DBI ≤ 0.8, C_UMass ≥ 0.4, and E ≥ 4. The mentioned values represent the situation with the most considerable expert consensus and, additionally, remain quite stable when both parameters and data are slightly changed. By conducting a “fixing” operation, the evaluation becomes visible, can be repeated, and is less prone to overfitting; i.e., clusters that survive in subsequent experiments are likely to be both structurally correct and helpful in the field of psychopathological knowledge investigation.

After adjusting the threshold, it will expect to receive high-quality clear clusters that can directly reflect real plant disease topics (Rice—Blast; Sugarcane—Red Rot; Cassava—Mosaic Disease). These clusters are not only geometrically coherent but also semantically consistent and based on domain knowledge. When these clusters are subsequently used in tasks such as named entity recognition (NER) and knowledge graph creation, we can feel assured that the extracted information—symptoms, control measures, prevention strategies—originates from those clusters that had been extensively quality-checked and approved by experts. This has the advantage that all subsequent analyses rely on trustworthy and domain-appropriate knowledge structures rather than noisy or ill-defined clusters.

3.2.5. Sentence Extraction and Facet Building

The purpose is to generate for each cluster a knowledge card comprising representative sentences labeled with three categories (i.e., facets): symptoms (SYM), control or treatment (CTL) and prevention (PRV). A facet-based lexicon and the Maximal Marginal Relevance (MMR) technique [77] are used to obtain the most relevant and non-redundant set of sentences. This step results in a clear domain-specific summary per cluster—for example, a Rice–Blast card with main symptoms, control methods and prevention strategies—that can then be employed for named entity recognition (NER) [78] and knowledge graph (KG) construction [78]. This process has the following two steps:

Step 1: Splitting and Normalization—Initially, every sentence in every abstract is extracted and preprocessed, including conversion to lowercase, removing in-text citations, parentheses, and duplicates of units or symbols. The sentences are linked to the cluster that their parent abstract belongs to, so that subsequent investigations can find the source cluster of each extracted sentence.

Step 2: Construction of Facet Lexicon—The three facets were identified by domain experts as domain requirements of plant disease knowledge:

▪: SYM (symptoms)—textual descriptions of visible plant disease manifestations.
▪: CTL (control/treatment)—chemical, biological, or cultural methods applied to manage the disease.
▪: PRV (prevention/resistance)—proactive strategies such as resistant cultivars or crop rotation.

Seed terms for each facet were elicited from the domain experts (e.g., “lesion, spot, symptom” for SYM; “fungicide, treatment” for CTL; “resistant, rotation, prevention” for PRV). This seed lexicon was extended with words from the extracted sentences. Initially, all unique words from the extracted sentences were collected as candidate words after text preprocessing (tokenization, lowercasing, and stopword removal).

Seed terms were obtained through structured expert elicitation sessions with two plant pathology experts. Each expert independently identified representative domain keywords for the three facets (symptoms, control/treatment, and prevention/resistance) based on professional experience and standard plant pathology references. The resulting term lists were then consolidated through discussion to remove redundancy and ensure adequate domain coverage. At this stage, sentences are not explicitly labeled with facets; instead, facet association is performed in a weakly supervised manner based on seed term matching.

The number of times a word w appears in sentences associated with facet f is counted. At this stage, sentences are not explicitly labeled with facets. Instead, a sentence is weakly associated with a facet f if it contains one or more seed terms belonging to the facet-specific lexicon. Based on these weak associations, the point-wise mutual information between word w and facet f is computed to measure the strength of their association:

P M I (w, f) = l o g \frac{P (w, f)}{P (w) \times P (f)}

(22)

where P(w, f) denotes the probability that word w co-occurs with sentences associated with facet f via seed term matching, P(w) is the probability of observing word w in the corpus, and P(f) represents the probability of sentences associated with facet f.

However, if PMI turns out to be less than zero (the word is found together with the facet less frequently than expected by chance), the value is zeroed. To expand the facet lexicon, words having a high PMI (or a high embedding similarity to the facet centroid) are considered for inclusion.

Step 3: Sentence Scoring per Facet—For each sentence s, we calculate how strongly it belongs to a given facet f (SYM, CTL, or PRV) using a weighted scoring function. The final score combines three signals—lexicon matching (Lex), embedding similarity (Sim), and context cues (Ctx)—with tunable weights α, β, and γ:

{s c o r e}_{f} (s) = α L e x (s, f) + β S i m (s, f) + γ C t x (s); α + β + γ = 1

(23)

where α, β, and γ are non-negative weighting coefficients that control the relative contribution of different signals in the facet scoring function, with α + β + γ = 1. α determines the importance of lexicon-based matching (Lex), capturing explicit keyword overlap with the facet lexicon; β controls the contribution of semantic similarity (Sim), measuring embedding-based closeness between a sentence and the facet centroid; and γ weights the influence of contextual cues (Ctx), which capture facet-specific linguistic patterns and indicative expressions beyond direct keyword matching.

Lex(s, f) measures how many words in the sentence overlap with the facet’s lexicon. The following formula defines the lexicon match score for a sentence s with respect to a facet f. It sums the contributions of all words w that appear both in the sentence s and in the lexicon L_f of the facet f. Each matching word is weighted by ω_w, which reflects its importance (e.g., expert-defined or frequency-based), and by idf(w), the inverse document frequency that assigns higher weight to less frequent, more informative words.

L e x (s, f) = \sum_{w \in s \cap L_{f}} ω_{w} i d f (w)

(24)

Intuitively, this score measures how much a sentence contains keywords of the target facet: the more important and rarer those matched words are, the larger the lexicon match score will be, indicating that this sentence is more relevant to the target facet.

Sim(s, f) measures how semantically close the sentence is to the facet using vector representations (e.g., SBERT embeddings). Cosine similarity is applied as follows:

S i m (s, f) = c o s c o s (v_{s}, v_{f}) = \frac{v_{s} \times v_{f}}{‖ v_{s} ‖ ‖ v_{f} ‖}

(25)

where v_s is the sentence embedding and v_f is the centroid embedding of the facet. Here, the facet centroid embedding v_f represents a prototype vector for facet f and is computed prior to sentence-level facet assignment. Specifically, v_f is obtained by averaging the embedding vectors of expert-defined seed terms associated with facet f. Each seed term is encoded using the same sentence embedding model (e.g., SciBERT or SBERT), and the centroid is calculated as the mean of these term embeddings. This design avoids circular dependency, as the centroid does not rely on sentences being pre-assigned to facets.

Ctx(s) assigns to each sentence s a score Ctx(s) ∈ [0, 1], indicating how strongly it expresses a target facet—SYM (symptoms), CTL (control), or PRV (prevention). First, we build a facet-specific pattern dictionary containing indicative keywords or regular expressions (e.g., lesion, wilting → SYM; applied, treated → CTL; resistant, prevent → PRV). Each pattern p has a weight w_p (e.g., 1 for strong, 0.5 for weaker cues). For each sentence s, we sum the weights of matched patterns using

M_{f} (s) = \sum_{p \in p a t t e r n s} w_{p}

(26)

and normalize by the maximum possible pattern weight as follows:

C t x (s) = m i n (1, \frac{M_{f} (s)}{m a x_p a t t e r n s})

(27)

The contextual score Ctx(s) is designed to capture facet-specific linguistic patterns that cannot be reliably identified through isolated keyword matching (Lex) or embedding-based similarity alone. Unlike Lex(s, f), which measures word-level overlaps between a sentence and a facet lexicon, Ctx(s) operates at the phrase and pattern level. It incorporates multi-word expressions and regular expression patterns that reflect how facet-related information is typically articulated in agricultural scientific writing.

For example, control and treatment information often appears in patterns such as “was treated with X”, “application of fungicide Y”, or “disease severity was reduced by…”, which may not be captured by individual keywords alone. Similarly, prevention-related statements frequently involve constructions such as “resistant cultivar”, “crop rotation was adopted”, or “preventive management strategy”. These patterns encode contextual cues about intent and action that are distinct from simple lexical presence.

If no patterns are detected, Ctx(s) is set to 0, whereas sentences containing strong facet-specific cues approach a value of 1. Pattern weights may be manually assigned by domain experts or statistically estimated (e.g., using log-likelihood or mutual information). Accordingly, Ctx(s) provides a normalized quantitative signal indicating how well a sentence aligns with its intended facet.

In this study, the weights α, β, and γ control the relative contributions of lexical matching, semantic similarity, and contextual cues, respectively. These weights are tuned via grid search on a small expert-annotated validation set to maximize agreement with facet labels, measured using Cohen’s κ or macro-F1. The final configuration (α, β, γ) = (0.4, 0.4, 0.2) reflects a balanced scoring strategy that emphasizes domain-specific keywords and semantic alignment, while incorporating contextual patterns as a supportive but less dominant signal. This blended scoring ensures that selected sentences are lexically relevant, semantically coherent, and contextually appropriate for their target facets.

After performing the Sentence Extraction and Facet Building process, illustrative example results for the “Rice—Blast” case are presented as follows:

Symptoms: “Elliptical, gray-centered lesions with brown margins appear on leaves during humid conditions.”

Control: “Triazole fungicides applied at tillering reduced lesion area by 45–60% in field trials.”

Prevention: “Cultivars harboring Pi-ta showed stable resistance across two seasons under rainfed conditions.”

3.2.6. Knowledge Graph Construction [78]

Once high-quality clusters and representative sentences have been obtained, we convert the outputs into a crop disease knowledge graph. Each cluster corresponds to a crop disease topic, and its representative sentences are first assigned facet relevance scores using the sentence scoring function described in Step 3 (Equation (23)). Specifically, for each sentence s, a facet score scoref(s) is computed by combining lexical matching, semantic similarity, and contextual cues. The facet with the highest score determines the dominant facet of the sentence (SYM, CTL, or PRV). These facet-labeled sentences are then used as input to the named entity recognition (NER) process to extract key domain entities, including crops, diseases, symptoms, control/treatment actions, and prevention strategies. Based on the identified entities and the dominant facet assignment, semantic relations such as hasSymptom, controlledBy, and preventedBy are established. The corresponding facet score is further used as a weight to rank and filter candidate ⟨entity–relation–entity⟩ triplets, ensuring that relations supported by highly relevant and contextually consistent sentences are prioritized. The resulting knowledge graph provides a structured representation of crop disease knowledge and facilitates downstream applications such as disease surveillance, knowledge retrieval, and decision support in agricultural informatics. The process of constructing the knowledge graph can be described as follows:

Step 1: Entity Definition and Normalization—The first step is to figure out what types of entities are going to be the key nodes in the knowledge graph. These are crop, disease, symptom, control/management, and prevention/resistance. A named entity recognition (NER) [78] or a dictionary-based term matching method is used for each document to recognize the entities. Terms found are then linked to a standardized term such as “Xanthomonas oryzae pv. Oryzae”. We employ a lightweight ontology- and dictionary-assisted normalization strategy to standardize extracted entities. Specifically, crop and disease entities are mapped to canonical concepts using the Plant Disease Ontology (PDO), which provides standardized identifiers and synonym mappings for major plant diseases. In addition, expert-curated domain dictionaries are used to consolidate spelling variants, abbreviations, and synonymous expressions commonly found in the agricultural literature (e.g., alternative disease names or chemical formulations). These resources are used solely for entity normalization and synonym resolution, rather than ontology reasoning or inference. After this step, each cluster is associated with a consistent and standardized set of entities (crop, disease, symptom, control, prevention), which are then used to construct a machine-readable knowledge graph in subsequent stages.

Step 2: Relation Extraction—First, sentences are grouped by facets in order to find their relations. After that, these relationships become the edges of the knowledge graph. For every cluster, the facets that were identified earlier (e.g., SYM for symptoms, CTK for control/management, PRV for prevention) are used to determine the relations associated with the entities (e.g., crops, diseases, symptoms, management actions, etc.) identified in the sentences. The four main relation types are as follows:

CROP → hasDisease → DISEASE
DISEASE → hasSymptom → SYMPTOM
DISEASE → controlledBy → CONTROL
DISEASE → preventedBy → PREVENTION

Each edge is created by picking out the sentence(s) in a cluster with the highest score for a given facet. For instance, if a sentence conforms closely to the CTL (control/management) patterns, this will lead to the creation of a controlledBy edge, which links a disease control action to the disease mentioned. These edges make it possible to convert the otherwise unstructured text into a structured knowledge graph, therefore allowing queries such as “Which control measures are reported for rice blast?” or “What symptoms indicate sugarcane red rot?”

Step 3: Confidence Scoring (Optional)—In this step, each relation edge is assigned a confidence score between 0 and 1 by combining the following three signals:

▪: Lex—lexical match strength between the sentence and the facet keyword set.
▪: Sim—semantic similarity between the sentence embedding and the facet centroid embedding (rather than the cluster/topic embedding), ensuring that similarity is computed with respect to the intended relation type.
▪: Ctx—a contextual cue score derived from facet-specific phrase-level and syntactic patterns (e.g., regular expressions), capturing how facet-related information is expressed beyond isolated keywords. Unlike Lex, which captures word-level overlap with facet lexicons, Ctx models how facet-related actions or properties are expressed through multi-word constructions and linguistic patterns, such as treatment actions or resistance statements.

These signals are combined into a weighted score:

C o n f i d e n c e (e) = α L e x (s, f) + β S i m (s, f) + γ C t x (s); α + β + γ = 1

(28)

Edges with confidence scores below a predefined threshold (e.g., <0.5) can be eliminated to maintain only those relationships which are highly reliable and have significant meaning. Such a graph allows only those edges that have strong textual and semantic evidence to be retained, thereby enhancing its accuracy and making the knowledge graph obtained more credible and usable in further tasks (e.g., search, policy support).

Step 4: Knowledge Graph Construction and Visualization—The verified entities and relations are converted to a knowledge graph in this step. NetworkX (a Python-native graph library version 3.10) is used to model each node as an entity (crop, disease, symptom, control, prevention) and each edge as a relation extracted from the facet-scored sentences (e.g., Crop → hasDisease → Disease, Disease → hasSymptom → Symptom, Disease → controlledBy → Control, Disease → preventedBy → Prevention). The graph is then depicted to locate the key subgraphs—such as Rice → Blast → (lesions appear, triazole fungicide, resistant cultivar)—which provide explicit indications of how a crop disease is linked to its symptoms, control methods, and prevention strategies. A well-defined, Python-based, and easily visualizable knowledge graph is a great help for researchers to navigate the crop disease knowledge domain and to find the actionable insights. Finally, example results with knowledge graph can be presented as illustrated in Figure 1.

3.3. Implementation Details and Reproducibility

While implementation choices are described at each methodological stage in Section 3.2, this subsection consolidates all key experimental settings to facilitate reproducibility and transparent re-implementation. To enhance reproducibility and implementation transparency, this subsection summarizes key experimental settings used throughout the proposed framework.

Transformer-based embeddings (XLNet, SBERT, SciBERT) were obtained using pretrained models from the Hugging Face Transformers and Sentence-Transformers libraries. Document representations were derived using mean pooling or CLS token representations, depending on the model architecture. TF–IDF representations were generated using scikit-learn, and Word2Vec embeddings were trained using Gensim library (version 4.3.2) with standard skip-gram settings.

Classical clustering algorithms (k-means, hierarchical clustering, spectral clustering, DBSCAN) were implemented using scikit-learn. Deep clustering models (DEC, IDEC, DCN, and VaDE) were implemented in PyTorch framework (version 2.1.2), with hyperparameters selected based on internal validation metrics and stability analysis.

Threshold calibration parameters, including Silhouette Score, Davies–Bouldin Index, UMass coherence, and expert utility thresholds, were tuned using a stratified pilot set and fixed across all experiments.

All deep models were trained in a single-NVIDIA-GPU environment. Runtime statistics and computational complexity are reported in Section 3.4. This consolidated description aims to support transparent replication and fair comparison with future studies.

3.4. Computational Complexity and Runtime Analysis

The different methods exhibit substantial differences in the computational complexity of clustering. Classical clustering algorithms like k-means or hierarchical clustering have time complexities of O(nk d) and O(n² log n), respectively, where n is the number of documents, k is the number of clusters, and d is the embedding dimension. Spectral clustering has high potential and is promising for developing a keenly competitive family of graph-based clustering approaches because it performs eigendecomposition, which is O(n³) in the worst case and therefore not scalable. The complexity of DBSCAN is, on average, O(n log n), and it runs in time dependent on ε and minPts. In practice, DBSCAN took 1–2 min to cluster the full corpus (1079 abstracts) that we had studied using the selected parameters.

Deep clustering methods (DEC, IDEC, DCN, and VaDE) have high computational consumption because of neural network fitting. The runtimes of DEC and IDEC on a single NVIDIA GPU machine were 20–30 min to converge, and for DCN and VaDE, they were about 30–45 min because of the presence of additional reconstruction and variational components. Despite the increased runtime, they generated significantly more stable and semantically meaningful clusters, thereby validating their computational cost in offline knowledge discovery tasks.

4. Experimental Results and Discussion

4.1. Quantitative Evaluation of Embedding–Clustering Combinations

In order to measure the quality of document clustering in a multi-stage NLP framework, which we have proposed, we evaluated 40 embedding–clustering combinations resulting from combining five text representation methods—TF-IDF, Word2Vec, XLNet, SBERT, and SciBERT—with eight clustering algorithms—k-means clustering, HC, SC, DBSCAN, DEC, IDEC, DCN, and VaDE. Four well-known internal and semantic metrics were used to measure clustering quality: Silhouette Score (S.S.), Davies–Bouldin Index (DBI), Calinski–Harabasz Index (CH), and UMass topic coherence. Higher values of S.S., CH, and UMass indicate clusters that are more compact, well-separated, and semantically coherent, respectively. A lower DBI indicates better inter-cluster separation. This assessment helps us to spot the best combinations of state-of-the-art embeddings and clustering algorithms for the crop disease literature, providing a good basis for the subsequent stages of facet extraction and knowledge graph construction. Table 1 lists the metric scores for comparing the clustering performance of five text embeddings with eight clustering algorithms for the crop disease corpus.

Table 1 summarizes the quantitative performance of all embedding–clustering combinations evaluated in this study using internal geometric metrics (Silhouette Score, Davies–Bouldin Index, and Calinski–Harabasz Index) and semantic coherence (UMass). Overall, transformer-based embeddings consistently outperform classical representations, with domain-adapted models exhibiting the strongest gains across all evaluation criteria.

Among all configurations, SciBERT combined with VaDE achieves the best overall performance, obtaining the highest Silhouette Score (0.54), the lowest Davies–Bouldin Index (0.88), and the highest UMass coherence score (0.46). This configuration outperforms the second-best combination (SBERT + VaDE) by margins ranging from 0.02 to 0.04 across the primary metrics, indicating more compact cluster geometry and stronger semantic consistency. In contrast, lexical and static embeddings such as TF–IDF and Word2Vec yield substantially lower Silhouette Scores (0.21–0.34) and weaker semantic coherence, regardless of the clustering algorithm employed.

Across all 40 embedding–clustering combinations, Silhouette Scores range from 0.21 (TF–IDF + k-means) to 0.54 (SciBERT + VaDE), while UMass coherence increases steadily as representations move from sparse lexical features to contextualized and domain-specific embeddings. This trend highlights the importance of contextual semantic modeling for organizing the crop disease literature into meaningful topics.

Regarding clustering strategies, deep clustering methods (VaDE, IDEC, and DCN) consistently outperform classical algorithms in terms of both geometric compactness and semantic coherence. While DCN and IDEC achieve competitive Silhouette and Calinski–Harabasz scores, VaDE provides the most balanced performance by simultaneously optimizing cluster separation and semantic interpretability. In contrast, classical methods such as k-means and hierarchical clustering exhibit sensitivity to initialization and embedding quality, resulting in less stable performance across metrics.

The superior performance of deep clustering methods (VaDE, IDEC, DCN) can be attributed to their ability to jointly optimize representation learning and cluster assignment in a shared latent space. Unlike classical methods that operate on fixed embeddings, deep clustering models iteratively reshape the embedding space to emphasize semantically meaningful structures. This property is particularly beneficial for the crop disease literature, where disease topics exhibit overlapping symptoms, diverse terminology, and contextual variability. As a result, deep clustering methods produce clusters that are not only geometrically compact but also semantically coherent, leading to higher performance across both internal and semantic evaluation metrics.

Importantly, the results reveal a clear interaction effect between embedding representations and clustering algorithms. Classical clustering methods benefit substantially from contextualized embeddings but remain limited in capturing complex topic overlap. Deep clustering models, on the other hand, realize their strongest performance gains when paired with domain-adapted embeddings such as SciBERT, suggesting that generative latent modeling and scientific pretraining are complementary for agricultural text mining tasks.

Overall, the results in Table 1 demonstrate that no single component alone accounts for performance improvements. Rather, effective document organization in crop disease research emerges from the joint selection of embedding representations and clustering strategies, motivating the integrated benchmarking framework proposed in this study.

Beyond the quantitative improvements reported in Table 1, the observed gains in clustering quality are not isolated to the document organization stage but have direct downstream implications for subsequent knowledge extraction tasks. More compact and semantically coherent clusters reduce topical ambiguity and noise, thereby enabling more reliable selection of representative sentences during facet extraction. This, in turn, leads to clearer separation between symptom, control, and prevention facets, which is essential for accurate entity–relation extraction. As a result, the downstream knowledge graph construction benefits from cleaner input sentences and more consistent semantic contexts, yielding higher-quality triplets and more interpretable crop–disease subgraphs. These findings highlight that gains in clustering quality translate into tangible benefits across the entire multi-stage pipeline, reinforcing the coherence and effectiveness of the proposed framework.

4.2. Evaluation of Sentence Extraction and Facet Building

For the evaluation of sentence extraction and facet classification, a total of N representative β sentences were selected from the 20 crop–disease clusters, with approximately n_s, n_c, and n_p sentences corresponding to the symptoms (SYM), control/treatment (CTL), and prevention/resistance (PRV) facets, respectively. Precision, recall, and F1-scores were computed at the sentence level and macro-averaged across facets to ensure balanced evaluation despite class size differences. The reported scores therefore reflect average performance across all facets rather than dominance by any single category.

To construct a reliable gold standard, expert annotations were first assessed using inter-annotator agreement, measured by Cohen’s kappa. Only sentences with consistent expert labels were retained for evaluation, ensuring that the reported metrics are grounded in semantically validated reference data.

Following the identification of SciBERT + VaDE as the best-performing embedding–clustering configuration in the previous experiments, each high-quality cluster was further transformed into structured, facet-labeled knowledge units. The objective of this stage is to extract representative sentences and organize them into three domain-specific facets: SYM (symptoms), CTL (control/treatment), and PRV (prevention/resistance).

Facet assignment is performed using a weighted scoring function (Equation (23)) that integrates three complementary signals: (i) Lex(s, f), which measures lexical matching between a sentence and expert-defined facet lexicons; (ii) Sim(s, f), which captures semantic similarity between the sentence embedding and the facet centroid using SciBERT (version 2.2.2) representations; and (iii) Ctx(s), which incorporates contextual cues derived from facet-specific linguistic patterns.

Table 2 reports the sentence-level precision, recall, and F1-scores for facet assignment across the three knowledge dimensions.

In Table 2, the quantitative evaluation of sentence extraction and facet assignment across the 20 crop–disease clusters confirms that the proposed pipeline can produce semantically coherent clusters and accurately extract representative sentences for all three facets. Overall, the model achieved high F1-scores across facets, with PRV consistently yielding the highest values (0.82–0.86). This result can be attributed to the relatively stable and domain-specific linguistic patterns used to describe preventive strategies, such as references to resistant cultivars, crop rotation, and sanitation practices.

The SYM facet also exhibited strong performance (F1 ≈ 0.78–0.84), benefiting from explicit and descriptive symptomatology that is frequently present in scientific abstracts (e.g., lesion morphology, discoloration patterns, or disease progression indicators). By contrast, CTL recorded slightly lower F1-scores (≈0.72–0.80), reflecting the higher linguistic variability associated with chemical names, fungicide formulations, dosage expressions, and management practices, which are more difficult to capture using lexicon-driven and semantic similarity components alone.

A closer inspection of CTL-related errors reveals three systematic sources of difficulty. At the lexical level, control and treatment sentences frequently include diverse chemical names, trade formulations, and dosage expressions that are sparsely covered by expert-defined lexicons. At the semantic level, management actions are often embedded within comparative or experimental result statements, which reduces their embedding similarity to a single facet centroid. From a contextual perspective, CTL information commonly co-occurs with symptom descriptions or preventive recommendations within the same abstract, leading to partial facet overlap and increased ambiguity. These challenges are primarily corpus-driven rather than model-specific, reflecting how disease management practices are typically reported in the agricultural scientific literature. These factors collectively suggest that the lower CTL performance reflects intrinsic properties of how disease management knowledge is reported in scientific abstracts, rather than limitations of the proposed scoring model.

Facet-level performance also varied moderately across crops. Rice and sugarcane clusters demonstrated particularly high PRV and SYM accuracy, likely due to the abundance of the curated literature and well-established terminology in these crops. Oil palm and cassava clusters maintained good PRV performance but showed slightly lower CTL scores, consistent with the more heterogeneous and context-dependent management recommendations reported for these crops. Soybean clusters, particularly those related to rust and cyst nematode diseases, exhibited robust PRV and SYM performance, aligning with the availability of standardized resistant varieties that generate clearer linguistic signals.

Taken together, these results indicate that domain-adapted transformer embeddings combined with deep clustering provide a robust unsupervised foundation for structuring specialized agricultural text corpora. Moreover, the resulting facet-labeled sentences serve as reliable building blocks for downstream knowledge graph construction, enabling the automatic integration of crop–disease–symptom–control prevention relationships. While the overall performance is strong, the comparatively lower CTL scores suggest that future work may benefit from enhanced chemical name recognition, improved treatment-focused named entity recognition, or hybrid fine-tuning strategies to better capture the diversity of disease management practices.

Despite the overall strong performance, it is important to note that the reported results are influenced by corpus-specific characteristics. In particular, crops with limited or heterogeneous literature coverage may exhibit lower recall for certain relation types, especially control/treatment, due to sparse or inconsistently reported management details. These observations highlight that the evaluation reflects realistic constraints of agricultural abstracts rather than uniformly optimal extraction conditions, underscoring the need for cautious interpretation of the results.

4.3. Knowledge Graph Evaluation

To evaluate the constructed knowledge graph, we conducted a multi-level assessment covering structural properties, triplet-level correctness, and expert-based semantic validation. Overall, the generated knowledge graph contains approximately 520 unique entities and 1840 extracted triplets, spanning symptoms, control/treatment actions, and prevention strategies across all crop–disease subgraphs derived from the 20 clusters. All quantitative metrics reported in this section are macro-averaged across subgraphs to avoid bias toward crops or diseases with larger literature coverage.

In this study, node coverage is defined as the proportion of unique domain entities (crops, diseases, symptoms, control actions, and prevention strategies) successfully instantiated in the knowledge graph relative to the set of expected entities derived from expert-validated facet sentences. Relation completeness refers to the extent to which facet-labeled sentences yield valid <entity–relation–entity> triplets, measured as the ratio of retained triplets after confidence filtering. Connectivity is interpreted as the degree to which disease nodes are linked to at least one symptom, control, and prevention entity, reflecting practical usability in an agricultural decision-support context.

4.3.1. Structural Evaluation

Structural evaluation assesses the global organization and compactness of the constructed knowledge graph. Metrics including node coverage, edge density, and average node degree were computed across all crop–disease subgraphs. These measures provide an overview of how effectively the extracted entities and relations form a connected yet non-redundant representation of crop disease knowledge. The results indicate that the knowledge graph achieves high node coverage while maintaining moderate edge density, suggesting that the extraction process captures diverse disease-related concepts without introducing excessive redundancy. The observed structural properties are consistent across crops, indicating stable graph construction behavior under varying corpus sizes.

4.3.2. Triplet-Level Evaluation

Triplet-level evaluation examines the correctness and completeness of extracted <entity–relation–entity> triplets. For this purpose, approximately 600 triplets were randomly sampled from the full graph, with balanced coverage across relation types (SYM, CTL, and PRV). Precision, recall, and F1-scores were computed at the triplet level and macro-averaged across relation categories. The results demonstrate high triplet precision, confirming that the majority of retained relations are semantically valid. Recall values further indicate that the framework effectively captures key disease-related relations reported in the literature, while the confidence scoring mechanism helps suppress low-quality extractions.

4.3.3. Expert-Based Semantic Evaluation

To further validate semantic correctness, an expert-based evaluation was conducted on a stratified subset of the constructed knowledge graph. Approximately 25% of the crop–disease subgraphs—covering rice, sugarcane, soybean, cassava, and oil palm—were independently reviewed by two plant pathology experts with professional experience in crop disease diagnosis and management. The experts assessed extracted <entity–relation–entity> triplets at the semantic level based on three predefined criteria: (i) entity correctness, referring to whether entity mentions accurately correspond to valid crops, diseases, symptoms, control actions, or prevention strategies; (ii) relation validity, indicating whether the semantic relationship between entity pairs is correctly represented; and (iii) semantic plausibility, reflecting consistency with established plant pathology knowledge. For each evaluated triplet, experts provided binary judgments (correct/incorrect) according to these criteria. Only triplets judged as valid and consistent by both experts were retained for quantitative expert validation. Inter-annotator agreement, measured using Cohen’s κ (≈0.81), indicates substantial agreement and supports the reliability of the expert-reviewed gold standard.

4.3.4. Confidence Scoring and Automated Filtering

Each extracted triplet is associated with a confidence score derived from the combined lexical, semantic, and contextual evidence used during extraction. This confidence score provides a practical mechanism for automated relation filtering. In our experiments, triplets with confidence scores below 0.5 were excluded from the final graph to balance precision and coverage. From an application perspective, in addition to triplet-level accuracy, the confidence scoring mechanism provides a practical basis for automated relation filtering in downstream applications. By adjusting the confidence threshold, practitioners can trade off between coverage and precision depending on task requirements. For example, higher thresholds may be applied in decision-support or policy-oriented settings where reliability is critical, while lower thresholds may be acceptable for exploratory knowledge discovery. This flexibility enables scalable and semi-automatic maintenance of the knowledge graph without requiring continuous expert intervention.

In addition to triplet-level accuracy, the confidence scoring mechanism provides a practical basis for automated relation filtering in downstream applications. By adjusting the confidence threshold, practitioners can trade off between coverage and precision depending on task requirements. For example, higher thresholds may be applied in decision-support or policy-oriented settings where reliability is critical, while lower thresholds may be acceptable for exploratory knowledge discovery. This flexibility enables scalable and semi-automatic maintenance of the knowledge graph without requiring continuous expert intervention.

The quantitative results of the knowledge graph evaluation are summarized in Table 3.

The results of the knowledge graph evaluation are presented in Table 3. The results show that the proposed multi-stage NLP pipeline—including abstract-level embedding and variational deep clustering for topic discovery, followed by sentence-level extraction of domain-specific information (symptoms, control/treatment, and prevention/resistance) using a weighted scoring mechanism and multi-score relation extraction—can effectively construct a domain-consistent knowledge graph.

4.4. Limitations

Despite the promising results, this study has several limitations that should be acknowledged. First, the analysis is based solely on research abstracts rather than full-text articles. While abstracts provide concise summaries of key findings, they may omit detailed experimental procedures, contextual explanations, or nuanced management practices, potentially limiting the completeness of the extracted knowledge graph.

Second, the quality and coverage of the constructed knowledge graph are influenced by corpus characteristics, including the availability and consistency of the agricultural literature across different crops and diseases. Crops with limited or heterogeneous reporting may exhibit lower recall for certain relation types, particularly in the control/treatment facet, due to diverse chemical nomenclature and context-dependent management descriptions.

Third, expert-based validation, although essential for ensuring semantic correctness, depends on the availability and expertise of domain specialists. While substantial inter-annotator agreement was achieved, scaling this validation process to larger corpora or additional crop systems may be constrained by expert availability.

Finally, the proposed framework operates primarily at the sentence level and does not explicitly model cross-sentence or document-level reasoning. As a result, complex causal chains or implicit relationships spanning multiple sentences may not be fully captured. These limitations suggest opportunities for future extensions incorporating richer contextual modeling and complementary supervision strategies.

5. Conclusions

This study demonstrates that transformer-based semantic representations, variational deep clustering, facet-aware relevance modeling, and ontology-driven entity normalization with multi-score relation extraction can be effectively integrated into a unified framework for constructing a domain-consistent crop disease knowledge graph from the unstructured agricultural literature. Based on experiments conducted on a curated corpus of 1079 crop disease abstracts and a fixed set of embedding–clustering configurations, the quantitative results indicate that the proposed framework yields structurally coherent and semantically reliable knowledge graphs, achieving high node coverage, strong relation completeness, and compact graph connectivity. Triplet-level evaluation further supports the accuracy of extracted relations, while expert assessment confirms biological relevance and practical utility within the evaluated domain. These empirically grounded findings suggest that, under the studied conditions, the framework provides a reliable foundation for downstream tasks such as representative sentence extraction, knowledge graph population, and computational support for crop disease monitoring and agronomic decision-making.

Nevertheless, the conclusions drawn from this study should be interpreted in light of several limitations and sources of variability. In particular, density-based clustering methods such as DBSCAN exhibit sensitivity to parameter choices (e.g., ε and minPts), which may influence cluster granularity and noise detection under different settings or corpora. Moreover, the framework’s performance and coverage remain dependent on the representativeness of the source literature and the availability of expert validation, potentially limiting direct transferability to other crop systems or large-scale corpora without recalibration. In addition, sentence-level extraction constrains the modeling of multi-sentence causal chains and implicit reasoning patterns common in complex plant–pathogen interactions. These limitations highlight opportunities for further methodological refinement and robustness analysis.

Future work may explore the integration of large-scale generative or instruction-tuned language models to better capture implicit relations, long-range dependencies, and contextual reasoning beyond explicit textual cues. Extending the framework to multilingual or cross-regional agricultural corpora would also facilitate the construction of more inclusive and globally applicable crop disease knowledge graphs, supporting broader decision-making in agricultural informatics and crop protection services.

Author Contributions

Conceptualization, J.P. and M.K.; Methodology, J.P. and M.K.; Software, B.L.; Validation, J.P. and M.K.; Formal Analysis, J.P.; Investigation, J.P.; Resources, J.P., C.S.G.K., and W.-N.C.; Data Curation, J.P., C.S.G.K., and W.-N.C.; Writing—Original Draft, J.P.; Writing—Review and Editing, J.P., C.S.G.K., and M.K.; Visualization, J.P.; Supervision, J.P.; Project Administration, J.P. and M.K.; Funding Acquisition, J.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to publisher copyright restrictions as we have collected abstract from multiple academic databases (PubMed, CAB Abstracts, AGRICOLA, and AGRIS).

Acknowledgments

This research project was financially supported by Mahasarakham University.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Corley, J.C. Thoughts on publication and other issues in pest and weed management. Int. J. Pest Manag. 2019, 65, 95–96. [Google Scholar] [CrossRef]
Rodríguez-García, M.Á.; García-Sánchez, F.; Valencia-García, R. Knowledge-Based System for Crop Pests and Diseases Recognition. Electronics 2021, 10, 905. [Google Scholar] [CrossRef]
Upadhyay, A.; Chandel, N.S.; Singh, K.P.; Chakraborty, S.K.; Nandede, B.M.; Kumar, M.; Subeesh, A.; Upendar, K.; Salem, A.; Elbeltagi, A. Deep Learning and Computer Vision in Plant Disease Detection: A Comprehensive Review of Techniques, Models, and Trends in Precision Agriculture. Artif. Intell. Rev. 2025, 58, 92. [Google Scholar] [CrossRef]
Qiu, X.; Chen, H.; Huang, P.; Zhong, D.; Guo, T.; Pu, C.; Li, Z.; Liu, Y.; Chen, J.; Wang, S. Detection of Citrus Diseases in Complex Backgrounds Based on Image–Text Multimodal Fusion and Knowledge Assistance. Front. Plant Sci. 2023, 14, 1280365. [Google Scholar] [CrossRef]
Ngugi, H.N.; Akinyelu, A.A.; Ezugwu, A.E. Machine Learning and Deep Learning for Crop Disease Diagnosis: Performance Analysis and Review. Agronomy 2024, 14, 3001. [Google Scholar] [CrossRef]
Yan, R.; An, P.; Meng, X.; Li, Y.; Li, D.; Xu, F.; Dang, D. A Knowledge Graph for Crop Diseases and Pests in China. Sci. Data 2025, 12, 222. [Google Scholar] [CrossRef] [PubMed]
Zhao, X.; Chen, B.; Ji, M.; Wang, X.; Yan, Y.; Zhang, J.; Liu, S.; Ye, M.; Lv, C. Implementation of Large Language Models and Agricultural Knowledge Graphs for Efficient Plant Disease Detection. Agriculture 2024, 14, 1359. [Google Scholar] [CrossRef]
Jafar, A.; Bibi, N.; Naqvi, R.A.; Sadeghi-Niaraki, A.; Jeong, D. Revolutionizing Agriculture with Artificial Intelligence: Plant Disease Detection Methods, Applications, and Their Limitations. Front. Plant Sci. 2024, 15, 1356260. [Google Scholar] [CrossRef]
Alharbi, A.; Aslam, M.A.; Asiry, K.A.; Aljohani, N.R.; Glikman, Y. An Ontology-Based Agriculture Decision-Support System with an Evidence-Based Explanation Model. Smart Agric. Technol. 2024, 9, 100659. [Google Scholar] [CrossRef]
Zhu, D.; Xie, L.; Chen, B.; Tan, J.; Deng, R.F.; Zheng, Y.; Mustafa, R.; Chen, W.; Yi, S.; Yung, K.; et al. Knowledge Graph and Deep Learning Based Pest Detection and Identification System for Fruit Quality. Internet Things 2022, 21, 100649. [Google Scholar] [CrossRef]
Bhuyan, B.P.; Tomar, R.; Ramdane-Chérif, A. A Systematic Review of Knowledge Representation Techniques in Smart Agriculture (Urban). Sustainability 2022, 14, 22. [Google Scholar] [CrossRef]
Fedele, G.; Brischetto, C.; Rossi, V.; González-Domínguez, E. A Systematic Map of the Research on Disease Modelling for Agricultural Crops Worldwide. Plants 2022, 11, 6. [Google Scholar] [CrossRef]
Qin, Z.; Lian, H.; He, T.; Luo, B. Cluster Correction on Polysemy and Synonymy. In Proceedings of the 14th Web Information Systems and Applications Conference (WISA), Liuzhou, China, 11–12 November 2017; pp. 136–138. [Google Scholar]
Khan, D. Modeling and Semantic Clustering in Large-Scale Text Data: A Review of Machine Learning Techniques and Applications. Int. J. Sci. Res. Eng. Manag. 2025, 9, 10. [Google Scholar] [CrossRef]
Wei, T.; Lu, Y.; Chang, H.; Zhou, Q.; Bao, X. A Semantic Approach for Text Clustering Using WordNet and Lexical Chains. Expert Syst. Appl. 2015, 42, 2264–2275. [Google Scholar] [CrossRef]
Liu, Q.; Wang, J.; Zhang, D.; Yang, Y.; Wang, N. Text Features Extraction Based on TF-IDF Associating Semantic. In Proceedings of the IEEE 4th International Conference on Computer and Communications (ICCC), Chengdu, China, 7–10 December 2018; pp. 2338–2343. [Google Scholar]
Muneeb, T.H.; Sahu, S.; Anand, A. Evaluating Distributed Word Representations for Capturing Semantics of Biomedical Concepts. In Proceedings of the BioNLP, Beijing, China, 30 July 2015. [Google Scholar]
Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.; Le, Q.V. XLNet: Generalized Autoregressive Pretraining for Language Understanding. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Hong Kong, China, 3–7 November 2019. [Google Scholar]
Beltagy, I.; Lo, K.; Cohan, A. SciBERT: A Pretrained Language Model for Scientific Text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Hong Kong, China, 3–7 November 2019. [Google Scholar]
Gillioz, A.; Casas, J.; Mugellini, E.; Khaled, O.A. Overview of the Transformer-Based Models for NLP Tasks. In Proceedings of the Conference on Computer Science and Information Systems, Belgrade, Serbia, 14–17 September 2020; pp. 179–183. [Google Scholar]
Xie, J.; Girshick, R.B.; Farhadi, A. Unsupervised Deep Embedding for Clustering Analysis. In Proceedings of the International Conference on Machine Learning (ICML), Lille, France, 6–11 July 2015. [Google Scholar]
Guo, X.; Gao, L.; Liu, X.; Yin, J. Improved Deep Embedded Clustering with Local Structure Preservation. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Melbourne, Australia, 19–25 August 2017. [Google Scholar]
Schnellbach, J.; Kajó, M. Clustering with Deep Neural Networks—An Overview of Recent Methods. Network 2020, 39, 39–43. [Google Scholar]
Jiang, Z.; Zheng, Y.; Tan, H.; Tang, B.; Zhou, H. Variational Deep Embedding: An Unsupervised and Generative Approach to Clustering. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), New York, NY, USA, 9–15 July 2016. [Google Scholar]
Tarekegn, A.N.; Rabbi, F.; Tessem, B. Large Language Model Enhanced Clustering for News Event Detection. arXiv 2024, arXiv:2406.10552. [Google Scholar] [CrossRef]
Saha, R. Influence of Various Text Embeddings on Clustering Performance in NLP. arXiv 2023, arXiv:2305.03144. [Google Scholar] [CrossRef]
Rahman, M.W.U.; Nevarez, R.; Mim, L.T.; Hariri, S. SDEC: Semantic Deep Embedded Clustering. IEEE Trans. Big Data 2025, 1, 1–16. [Google Scholar] [CrossRef]
Keraghel, I.; Morbieu, S.; Nadif, M. Beyond Words: A Comparative Analysis of LLM Embeddings for Effective Clustering. In Proceedings of the International Symposium on Intelligent Data Analysis, Würzburg, Germany, 28–30 October 2024. [Google Scholar]
Petukhova, A.; Matos-Carvalho, J.P.; Fachada, N. Text Clustering with Large Language Model Embeddings. Int. J. Cogn. Comput. Eng. 2024, 6, 100–108. [Google Scholar] [CrossRef]
Allahyari, M.; Pouriyeh, S.; Assefi, M.; Safaei, S.; Trippe, E.D.; Gutiérrez, J.B.; Kochut, K. A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques. arXiv 2017, arXiv:1707.02919. [Google Scholar] [CrossRef]
Garg, N.; Gupta, R.K. Clustering Techniques for Text Mining: A Review. Int. J. Eng. Res. 2016, 5, 241–243. [Google Scholar]
Soucy, P.; Mineau, G. Beyond TF–IDF Weighting for Text Categorization in the Vector Space Model. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Edinburgh, UK, 30 July–5 August 2005. [Google Scholar]
Pradhan, L.; Zhang, C.; Bethard, S.; Chen, X. Embedding User Behavioral Aspect in TF–IDF-like Representation. In Proceedings of the IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), Miami, FL, USA, 10–12 April 2018. [Google Scholar]
Tang, Z.; Li, W.; Li, Y.; Zhao, W.; Li, S. Several Alternative Term Weighting Methods for Text Representation and Classification. Knowl.-Based Syst. 2020, 207, 106385. [Google Scholar] [CrossRef]
Mohammed, M.T.; Rashid, O.F. Document Retrieval Using Term Frequency Inverse Sentence Frequency Weighting Scheme. Indones. J. Electr. Eng. Comput. Sci. 2023, 3, 1478–1485. [Google Scholar] [CrossRef]
Dasari, L.A.; Sowmith, J.; Krishna, M.N.; Saketh, C.; Venugopalan, M. Optimizing Agricultural Insights: Semantic Clustering and Topic Modelling for Farmer Queries. In Proceedings of the 3rd International Conference on Inventive Computing and Informatics (ICICI), Coimbatore, India, 8–10 January 2025. [Google Scholar]
Thurnbauer, M.; Reisinger, J.; Goller, C.; Fischer, A. Towards Resolving Word Ambiguity with Word Embeddings. arXiv 2023, arXiv:2307.13417. [Google Scholar] [CrossRef]
Clinchant, S.; Perronnin, F. Aggregating Continuous Word Embeddings for Information Retrieval. In Proceedings of the Workshop on Continuous Vector Space Models and Their Compositionality, Sofia, Bulgaria, 8 August 2013; pp. 100–109. [Google Scholar]
Hu, W.; Zhang, J.; Zheng, N. Different Contexts Lead to Different Word Embeddings. In Proceedings of the International Conference on Computational Linguistics (COLING), Osaka, Japan, 11–16 December 2016. [Google Scholar]
Rong, X. Word2Vec Parameter Learning Explained. arXiv 2014, arXiv:1411.2738. [Google Scholar]
Dynomant, E.; Lelong, R.; Dahamna, B.; Massonnaud, C.; Kerdelhué, G.; Grosjean, J.; Canu, S.; Darmoni, S.J. Word Embedding for the French Natural Language in Health Care: Comparative Study. JMIR Med. Inform. 2019, 7, e12304. [Google Scholar] [CrossRef] [PubMed]
Worth, P.J. Word Embeddings and Semantic Spaces in Natural Language Processing. Int. J. Intell. Sci. 2023, 13, 1. [Google Scholar] [CrossRef]
Saranya, M.; Amutha, A. A Survey of Machine Learning Techniques for Topic Modeling and Word Embedding. In Proceedings of the 10th International Conference on Advanced Computing and Communication Systems (ICACCS), Coimbatore, India, 21–22 February 2024. [Google Scholar]
Trask, A.; Michalak, P.; Liu, J.C. Sense2Vec: A Fast and Accurate Method for Word Sense Disambiguation in Neural Word Embeddings. arXiv 2015, arXiv:1511.06388. [Google Scholar]
Mancini, M.; Camacho-Collados, J.; Iacobacci, I.; Navigli, R. Embedding Words and Senses Together via Joint Knowledge-Enhanced Training. In Proceedings of the Conference on Computational Natural Language Learning (CoNLL), Berlin, Germany, 11–12 August 2016. [Google Scholar]
Wiedemann, G.; Remus, S.; Chawla, A.; Biemann, C. Does BERT Make Any Sense? Interpretable Word Sense Disambiguation with Contextualized Embeddings. In Proceedings of the Conference on Natural Language Processing, Tokyo, Japan, 29–31 October 2019. [Google Scholar]
Meijer, H.; Truong, J.; Karimi, R. Document Embedding for Scientific Articles: Efficacy of Word Embeddings vs. TFIDF. arXiv 2021, arXiv:2107.05151. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Minneapolis, MN, USA, 2–7 June 2019. [Google Scholar]
Zhang, R.; Wang, Y.-S.; Yang, Y.; Vu, T.; Lei, L. Exploiting Local and Global Features in Transformer-Based Extreme Multi-Label Text Classification. arXiv 2022, arXiv:2204.00933. [Google Scholar]
Ha, T.-T.; Nguyen, V.; Nguyen, K.-H.; Nguyen, K.; Than, Q. Utilizing SBERT for Finding Similar Questions in Community Question Answering. In Proceedings of the International Conference on Knowledge and Systems Engineering, Hanoi, Vietnam, 27–29 October 2021. [Google Scholar]
Boyack, K.; Newman, D.; Duhon, R.; Klavans, R.; Patek, M.; Biberstine, J.; Schijvenaars, B.; Skupin, A.; Ma, N.; Börner, K. Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches. PLoS ONE 2011, 6, e18029. [Google Scholar] [CrossRef]
Huang, P.; Huang, Y.; Wang, W.; Wang, L. Deep Embedding Network for Clustering. In Proceedings of the International Conference on Pattern Recognition (ICPR), Stockholm, Sweden, 24–28 August 2014. [Google Scholar]
Xu, Y.; Huang, D.; Wang, C.; Lai, J. Deep Image Clustering with Contrastive Learning and Multi-Scale Graph Convolutional Networks. Pattern Recognit. 2024, 146, 109939. [Google Scholar] [CrossRef]
Gupta, V.; Bharti, P.; Nokhiz, P.; Karnick, H. SumPubMed: Summarization Dataset of PubMed Scientific Articles. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Bangkok, Thailand, 1–6 August 2021. [Google Scholar]
Alizadeh, M.; Oveisi, M.; Falahati, S.; Mousavi, G.; Meybodi, M.A.; Mehrnia, S.S.; Hacihaliloglu, I.; Rahmim, A.; Salmanpour, M.R. AllMetrics: A Unified Python Library for Standardized Metric Evaluation and Robust Data Validation in Machine Learning. arXiv 2025, arXiv:2505.15931. [Google Scholar] [CrossRef]
Lenci, A.; Sahlgren, M.; Jeuniaux, P.; Gyllensten, A.C.; Miliani, M. A Comprehensive Comparative Evaluation and Analysis of Distributional Semantic Models. arXiv 2021, arXiv:2105.09825. [Google Scholar]
Kim, S.; Lee, S.; Yoon, B. Development of an Embedding Framework for Clustering Scientific Papers. IEEE Access 2022, 10, 32608–32621. [Google Scholar] [CrossRef]
Kampffmeyer, M.C.; Løkse, S.; Bianchi, F.; Livi, L.; Salberg, A.-B.; Jenssen, J. Deep Divergence-Based Clustering. In Proceedings of the International Workshop on Machine Learning for Signal Processing, Tokyo, Japan, 25–28 September 2017. [Google Scholar]
Druery, J.; McCormack, N.; Murphy, S. Are Best Practices Really Best? A Review of the Best Practices Literature in Library and Information Studies. Evid.-Based Libr. Inf. Pract. 2013, 8, 110–128. [Google Scholar] [CrossRef]
Ramos, J.E. Using TF–IDF to Determine Word Relevance in Document Queries. Semantic Scholar. 2003. Available online: https://www.researchgate.net/publication/228818851_Using_TF-IDF_to_determine_word_relevance_in_document_queries (accessed on 1 September 2025).
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; Dean, J. Distributed Representations of Words and Phrases and Their Compositionality. Adv. Neural Inform. Proc. Syst. 2013, 26. [Google Scholar]
Jin, X.; Han, J. K-Means Clustering. In Encyclopedia of Environmental Change; Matthews, J.A., Ed.; Sage: Thousand Oaks, CA, USA, 2021. [Google Scholar]
Dabhi, D.; Patel, M.R. Extensive Survey on Hierarchical Clustering Methods in Data Mining. Semantic Scholar. 2016. Available online: https://www.irjet.net/archives/V3/i11/IRJET-V3I11115.pdf (accessed on 5 September 2025).
Murtagh, F. Hierarchical Clustering. In International Encyclopedia of Statistical Science; Springer: Berlin/Heidelberg, Germany, 2016. [Google Scholar]
Bach, F.; Jordan, M.I. Learning Spectral Clustering. In Advances in Neural Information Processing Systems. 2003. Available online: https://www.di.ens.fr/~fbach/nips03_cluster.pdf (accessed on 5 September 2025).
Dhillon, I.; Guan, Y.; Kulis, B. A Unified View of Kernel k-Means, Spectral Clustering and Graph Cuts. Semantic Scholar. 2004. Available online: https://people.bu.edu/bkulis/pubs/spectral_techreport.pdf (accessed on 5 September 2025).
Phillips, J.M. L10: Spectral Clustering. Semantic Scholar. 2016. Available online: https://www.semanticscholar.org/paper/L10%3A-Spectral-Clustering-Phillips/b12a5cebca3a0769a7ad01db8251a1aef3020d63 (accessed on 5 September 2025).
Ester, M.; Kriegel, H.-P.; Sander, J.; Xu, X. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD), Portland, OR, USA, 2–4 August 1996. [Google Scholar]
Ahmed, K.N.; Razak, T.A. An Overview of Various Improvements of DBSCAN Algorithm in Clustering Spatial Databases. Semantic Scholar. 2016. Available online: https://www.ijarcce.com/upload/2016/february-16/IJARCCE%2077.pdf (accessed on 25 September 2025).
Guo, X.; Liu, X.; Zhu, E.; Yin, J. Deep Clustering with Convolutional Autoencoders. In Proceedings of the International Conference on Neural Information Processing, Guangzhou, China, 14–18 November 2017. [Google Scholar]
Akhanli, S.E.; Hennig, C. Comparing Clusterings and Numbers of Clusters by Aggregation of Calibrated Clustering Validity Indexes. Stat. Comput. 2020, 30, 795–810. [Google Scholar] [CrossRef]
Tomasini, C.; Borges, E.N.; Machado, K.; Emmendorfer, L. A Study on the Relationship between Internal and External Validity Indices Applied to Partitioning and Density-Based Clustering Algorithms. In Proceedings of the International Conference on Enterprise Information Systems, Porto, Portugal, 26–29 April 2017. [Google Scholar]
Maulik, U.; Bandyopadhyay, S. Performance Evaluation of Some Clustering Algorithms and Validity Indices. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 1650–1654. [Google Scholar] [CrossRef]
Newman, D.; Lau, J.H.; Grieser, K.; Baldwin, T. Automatic Evaluation of Topic Coherence. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), Los Angeles, CA, USA, 2–4 June 2010. [Google Scholar]
Deveaud, R.; SanJuan, E.; Bellot, P. Are Semantically Coherent Topic Models Useful for Ad Hoc Information Retrieval? In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Sofia, Bulgaria, 4–9 August 2013. [Google Scholar]
Carbonell, J.; Goldstein, J. The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, 24–28 August 1998; pp. 335–336. [Google Scholar]
Li, L.; Wang, P.; Yan, J.; Wang, Y.; Li, S.; Jiang, J.; Sun, Z.; Tang, B.; Chang, T.-H.; Wang, S.; et al. Real-World Data Medical Knowledge Graph: Construction and Applications. Artif. Intell. Med. 2020, 103, 101817. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Example graph visualization.

Table 1. Clustering evaluation metrics for five text embeddings methods combined with eight clustering algorithms for the crop disease corpus.

Embedding	Clustering	S.S.	DBI	CH	UMass
TF-IDF	k-means	0.21	1.92	480	0.12
	HC	0.18	2.05	420	0.11
	SC	0.23	1.88	495	0.14
	DBSCAN	0.25	1.76	520	0.16
	DEC	0.30	1.55	600	0.19
	IDEC	0.33	1.48	620	0.22
	DCN	0.34	1.42	640	0.24
	VaDE	0.36	1.35	670	0.26
Word2Vec	k-means	0.31	1.52	650	0.19
	HC	0.28	1.63	590	0.17
	SC	0.33	1.50	660	0.21
	DBSCAN	0.35	1.44	700	0.23
	DEC	0.38	1.38	730	0.26
	IDEC	0.40	1.31	760	0.29
	DCN	0.41	1.28	780	0.31
	VaDE	0.43	1.22	820	0.33
XLNet	k-means	0.38	1.28	780	0.26
	HC	0.36	1.32	760	0.25
	SC	0.39	1.26	800	0.28
	DBSCAN	0.41	1.21	820	0.30
	DEC	0.44	1.15	850	0.33
	IDEC	0.46	1.09	880	0.35
	DCN	0.47	1.06	900	0.37
	VaDE	0.49	1.02	930	0.39
SBERT	k-means	0.40	1.18	860	0.31
	HC	0.38	1.22	830	0.30
	SC	0.42	1.14	880	0.34
	DBSCAN	0.44	1.10	900	0.36
	DEC	0.48	1.03	940	0.39
	IDEC	0.50	0.98	970	0.41
	DCN	0.51	0.96	980	0.42
	VaDE	0.52	0.92	1000	0.44
SciBERT	k-means	0.41	1.16	870	0.32
	HC	0.39	1.20	850	0.31
	SC	0.43	1.12	900	0.35
	DBSCAN	0.45	1.08	920	0.37
	DEC	0.49	1.00	950	0.40
	IDEC	0.51	0.95	980	0.42
	DCN	0.52	0.93	1000	0.43
	VaDE	0.54	0.88	1040	0.46

Table 2. Facet extraction performance of the sentence scoring model using SciBERT + VaDE across 15 crop–disease clusters.

Cluster ID	Crop	Disease	SYM-F1	CTL-F1	PRV-F1
C1	Rice	Blast	0.84	0.80	0.86
C2	Rice	Bacterial Leaf Blight	0.83	0.79	0.85
C3	Rice	Sheath Blight	0.81	0.77	0.82
C4	Rice	Tungro Virus	0.83	0.76	0.84
C5	Rice	Brown Spot	0.81	0.76	0.83
C6	Sugarcane	Red Rot	0.83	0.79	0.86
C7	Sugarcane	Smut	0.81	0.76	0.82
C8	Sugarcane	Mosaic Virus	0.79	0.74	0.81
C9	Sugarcane	Leaf Scald	0.80	0.74	0.82
C10	Oil Palm	Bud Rot	0.81	0.77	0.82
C11	Oil Palm	Basal Stem Rot (Ganoderma)	0.80	0.76	0.81
C12	Oil Palm	Fatal Yellowing	0.78	0.73	0.80
C13	Cassava	Mosaic Disease	0.82	0.75	0.84
C14	Cassava	Bacterial Blight	0.80	0.74	0.82
C15	Cassava	Anthracnose	0.78	0.72	0.80
C16	Cassava	Brown Streak	0.79	0.73	0.81
C17	Soybean	Rust	0.83	0.78	0.84
C18	Soybean	Cyst Nematode	0.80	0.75	0.82
C19	Soybean	Pod Blight	0.78	0.73	0.81
C20	Soybean	Phytophthora Root Rot	0.79	0.74	0.81

Table 3. Summary of knowledge graph evaluation results.

Evaluation Dimension	Metric	Description	Result (Mean ± SD)	Interpretation
1. Structural Evaluation	Node Coverage (%)	Proportion of expected entities (crop, disease, symptom, control, prevention) successfully detected	92.4 ± 3.1	The KG captures most domain-relevant entities across clusters.
	Relation Completeness (%)	Completeness of the four primary relations	89.7 ± 4.5	Essential relations are consistently extracted across disease topics.
	Graph Connectivity	Average degree centrality/number of components	4.2 nodes/1 component per subgraph	Each disease subgraph is well-connected without fragmentation.
2. Triplet-Level Accuracy	Precision	Proportion of correct triplets out of those constructed	0.87	Extracted relations are highly accurate.
	Recall	Proportion of gold-standard triplets constructed by method	0.84	A small number of control/prevention relations remain under-extracted.
	F1-score	Harmonic mean of precision and recall	0.85	Balanced performance in relation extraction.
	Confidence Score Gap	Confidence of correct vs. incorrect triplets	0.71 vs. 0.34	Confidence scoring reliably differentiates valid from invalid relations.
3. Expert Semantic Evaluation	Expert Accept Rate (%)	Percentage of subgraphs judged as semantically correct	91.2%	Most subgraphs align with established plant pathology knowledge.
	Biological Plausibility Score (1–5)	Degree to which symptoms, control, and prevention reflect scientific evidence	4.6 ± 0.3	Extracted knowledge is biologically sound and realistic.
	Cohen’s κ	Agreement between expert annotators	0.82	Expert agreement is “ excellent,” confirming evaluation reliability.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Polpinij, J.; Kaenampornpan, M.; Khoo, C.S.G.; Cheng, W.-N.; Luaphol, B. A Multi-Stage NLP Framework for Knowledge Discovery from Crop Disease Research Literature. Mathematics 2026, 14, 299. https://doi.org/10.3390/math14020299

AMA Style

Polpinij J, Kaenampornpan M, Khoo CSG, Cheng W-N, Luaphol B. A Multi-Stage NLP Framework for Knowledge Discovery from Crop Disease Research Literature. Mathematics. 2026; 14(2):299. https://doi.org/10.3390/math14020299

Chicago/Turabian Style

Polpinij, Jantima, Manasawee Kaenampornpan, Christopher S. G. Khoo, Wei-Ning Cheng, and Bancha Luaphol. 2026. "A Multi-Stage NLP Framework for Knowledge Discovery from Crop Disease Research Literature" Mathematics 14, no. 2: 299. https://doi.org/10.3390/math14020299

APA Style

Polpinij, J., Kaenampornpan, M., Khoo, C. S. G., Cheng, W.-N., & Luaphol, B. (2026). A Multi-Stage NLP Framework for Knowledge Discovery from Crop Disease Research Literature. Mathematics, 14(2), 299. https://doi.org/10.3390/math14020299

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multi-Stage NLP Framework for Knowledge Discovery from Crop Disease Research Literature

Abstract

1. Introduction

2. Related Work

2.1. Text Clustering

2.2. Extracting Representation Sentences

2.3. Deriving Node–Relation–Node Triplets from Sentences

3. Datasets and Research Method

3.1. Datasets

3.2. Research Method

3.2.1. Text Preprocessing

3.2.2. Text Representation and Embedding

3.2.3. Document Clustering

3.2.4. Threshold Calibration [72,73]

3.2.5. Sentence Extraction and Facet Building

3.2.6. Knowledge Graph Construction [78]

3.3. Implementation Details and Reproducibility

3.4. Computational Complexity and Runtime Analysis

4. Experimental Results and Discussion

4.1. Quantitative Evaluation of Embedding–Clustering Combinations

4.2. Evaluation of Sentence Extraction and Facet Building

4.3. Knowledge Graph Evaluation

4.3.1. Structural Evaluation

4.3.2. Triplet-Level Evaluation

4.3.3. Expert-Based Semantic Evaluation

4.3.4. Confidence Scoring and Automated Filtering

4.4. Limitations

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI