A Multi-Stage NLP Framework for Knowledge Discovery from Crop Disease Research Literature
Abstract
1. Introduction
- Clustering the abstracts into meaningful topics. This includes converting each abstract into a vector representation and then performing cluster analysis to group the abstracts into crop disease topics (such as Rice—Blast and Sugarcane—Red Rot).
- Selecting or extracting a set of representative sentences from each cluster that have minimal overlap but cover the main knowledge aspects (i.e., research results). In addition, the extracted sentences are labeled (i.e., categorized) into knowledge facets (relation types) of symptoms, control/treatment, and prevention.
- Deriving node–relation–node triplets from the facet-labeled sentences to construct the knowledge graph. This involves performing named entity recognition (NER) on the extracted sentences to identify entity nodes and relation identification to generate triplets.
2. Related Work
2.1. Text Clustering
2.2. Extracting Representation Sentences
2.3. Deriving Node–Relation–Node Triplets from Sentences
3. Datasets and Research Method
3.1. Datasets
3.2. Research Method
3.2.1. Text Preprocessing
3.2.2. Text Representation and Embedding
- (1)
- TF-IDF (term frequency–inverse document frequency)—TF–IDF models documents as sparse vectors and scales each term according to its local frequency within a document and its global rarity across the corpus [61]. Such an approach is simple, interpretable, and efficient but is not able to measure the closeness of semantics, the synonymy, or the order of words. TF–IDF represents each document as a sparse vector in the vocabulary space. The weight of term t in document di is as follows:where N is the corpus size, and df(f) is the number of documents containing t.
- (2)
- Word2Vec is a prediction-based approach for learning dense, low-dimensional vector representations of words from their linguistic contexts [62]. In this study, the skip-gram model is employed to maximize the likelihood of context words appearing within a fixed-size window surrounding a target word. Document-level representations are obtained by averaging the embeddings of the words contained in each document.
- (3)
- XLNet—XLNet extends the power of the conventional language model by introducing permutation language modeling, which allows the model to incorporate both autoregressive and bidirectional dependencies at the same time [18]. In contrast to BERT with masking language modeling, XLNet learns for all possible permutations with its generation mechanism, where each token is predicted based on its available context within the corpus. The document embedding is obtained by mean pooling. This allows XLNet to produce context-sensitive embeddings that leverage both forward and backward information flows, making it more powerful in modeling long-range dependence than previous models.
- (4)
- SBERT (Sentence-BERT)—SBERT is a modification of the BERT architecture and is particularly fine-tuned for semantic similarity and clustering with a Siamese network architecture [19]. Unlike models producing word-level embeddings, SBERT models directly learn sentence- and document-level embeddings, so semantically similar texts are mapped close to each other in the vector space. Its training algorithm often employs triplet loss or contrastive loss for explicit joint learning between similarity and clustering. SBERT modifies BERT into a Siamese architecture optimized for sentence similarity. Embeddings are trained using triple loss. The similarity between sentences is then computed via cosine similarity. This design allows efficient semantic clustering based on embedding distances. The margin ε ensures a distinction between positive and negative pairs. This construction makes SBERT highly suitable for clustering, where the embeddings can be compared directly using cosine similarity. The main strength of SBERT resides in its effectiveness and accuracy on semantic similarity, although this is strongly dependent on the quality of the domain adaptation in the fine-tuning.
- (5)
- SciBERT—SciBERT is a BERT model trained on a scaled-down corpus of common-sense knowledge and on an in-domain corpus of scientific publications [20]. Though it shares the architecture of BERT based on transformers, its dedicated pretraining corpus helps in capturing scientific terms, jargon, and even Latin species names which are commonly found in technical abstracts. Given an input sequence X, the transformer encoder produces contextualized representations, which capture the semantic and syntactic behaviors of scientific text well. It yields contextual embeddings specialized for scientific texts such as biomedical and agricultural research.
3.2.3. Document Clustering
3.2.4. Threshold Calibration [72,73]
- ▪
- “Accept”—the cluster is helpful and applicable to symptoms, control, or prevention.
- ▪
- “Revise”—the cluster requires modifications such as combining, dividing, or changing the name.
3.2.5. Sentence Extraction and Facet Building
- ▪
- SYM (symptoms)—textual descriptions of visible plant disease manifestations.
- ▪
- CTL (control/treatment)—chemical, biological, or cultural methods applied to manage the disease.
- ▪
- PRV (prevention/resistance)—proactive strategies such as resistant cultivars or crop rotation.
3.2.6. Knowledge Graph Construction [78]
- CROP → hasDisease → DISEASE
- DISEASE → hasSymptom → SYMPTOM
- DISEASE → controlledBy → CONTROL
- DISEASE → preventedBy → PREVENTION
- ▪
- Lex—lexical match strength between the sentence and the facet keyword set.
- ▪
- Sim—semantic similarity between the sentence embedding and the facet centroid embedding (rather than the cluster/topic embedding), ensuring that similarity is computed with respect to the intended relation type.
- ▪
- Ctx—a contextual cue score derived from facet-specific phrase-level and syntactic patterns (e.g., regular expressions), capturing how facet-related information is expressed beyond isolated keywords. Unlike Lex, which captures word-level overlap with facet lexicons, Ctx models how facet-related actions or properties are expressed through multi-word constructions and linguistic patterns, such as treatment actions or resistance statements.
3.3. Implementation Details and Reproducibility
3.4. Computational Complexity and Runtime Analysis
4. Experimental Results and Discussion
4.1. Quantitative Evaluation of Embedding–Clustering Combinations
4.2. Evaluation of Sentence Extraction and Facet Building
4.3. Knowledge Graph Evaluation
4.3.1. Structural Evaluation
4.3.2. Triplet-Level Evaluation
4.3.3. Expert-Based Semantic Evaluation
4.3.4. Confidence Scoring and Automated Filtering
4.4. Limitations
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Corley, J.C. Thoughts on publication and other issues in pest and weed management. Int. J. Pest Manag. 2019, 65, 95–96. [Google Scholar] [CrossRef]
- Rodríguez-García, M.Á.; García-Sánchez, F.; Valencia-García, R. Knowledge-Based System for Crop Pests and Diseases Recognition. Electronics 2021, 10, 905. [Google Scholar] [CrossRef]
- Upadhyay, A.; Chandel, N.S.; Singh, K.P.; Chakraborty, S.K.; Nandede, B.M.; Kumar, M.; Subeesh, A.; Upendar, K.; Salem, A.; Elbeltagi, A. Deep Learning and Computer Vision in Plant Disease Detection: A Comprehensive Review of Techniques, Models, and Trends in Precision Agriculture. Artif. Intell. Rev. 2025, 58, 92. [Google Scholar] [CrossRef]
- Qiu, X.; Chen, H.; Huang, P.; Zhong, D.; Guo, T.; Pu, C.; Li, Z.; Liu, Y.; Chen, J.; Wang, S. Detection of Citrus Diseases in Complex Backgrounds Based on Image–Text Multimodal Fusion and Knowledge Assistance. Front. Plant Sci. 2023, 14, 1280365. [Google Scholar] [CrossRef]
- Ngugi, H.N.; Akinyelu, A.A.; Ezugwu, A.E. Machine Learning and Deep Learning for Crop Disease Diagnosis: Performance Analysis and Review. Agronomy 2024, 14, 3001. [Google Scholar] [CrossRef]
- Yan, R.; An, P.; Meng, X.; Li, Y.; Li, D.; Xu, F.; Dang, D. A Knowledge Graph for Crop Diseases and Pests in China. Sci. Data 2025, 12, 222. [Google Scholar] [CrossRef] [PubMed]
- Zhao, X.; Chen, B.; Ji, M.; Wang, X.; Yan, Y.; Zhang, J.; Liu, S.; Ye, M.; Lv, C. Implementation of Large Language Models and Agricultural Knowledge Graphs for Efficient Plant Disease Detection. Agriculture 2024, 14, 1359. [Google Scholar] [CrossRef]
- Jafar, A.; Bibi, N.; Naqvi, R.A.; Sadeghi-Niaraki, A.; Jeong, D. Revolutionizing Agriculture with Artificial Intelligence: Plant Disease Detection Methods, Applications, and Their Limitations. Front. Plant Sci. 2024, 15, 1356260. [Google Scholar] [CrossRef]
- Alharbi, A.; Aslam, M.A.; Asiry, K.A.; Aljohani, N.R.; Glikman, Y. An Ontology-Based Agriculture Decision-Support System with an Evidence-Based Explanation Model. Smart Agric. Technol. 2024, 9, 100659. [Google Scholar] [CrossRef]
- Zhu, D.; Xie, L.; Chen, B.; Tan, J.; Deng, R.F.; Zheng, Y.; Mustafa, R.; Chen, W.; Yi, S.; Yung, K.; et al. Knowledge Graph and Deep Learning Based Pest Detection and Identification System for Fruit Quality. Internet Things 2022, 21, 100649. [Google Scholar] [CrossRef]
- Bhuyan, B.P.; Tomar, R.; Ramdane-Chérif, A. A Systematic Review of Knowledge Representation Techniques in Smart Agriculture (Urban). Sustainability 2022, 14, 22. [Google Scholar] [CrossRef]
- Fedele, G.; Brischetto, C.; Rossi, V.; González-Domínguez, E. A Systematic Map of the Research on Disease Modelling for Agricultural Crops Worldwide. Plants 2022, 11, 6. [Google Scholar] [CrossRef]
- Qin, Z.; Lian, H.; He, T.; Luo, B. Cluster Correction on Polysemy and Synonymy. In Proceedings of the 14th Web Information Systems and Applications Conference (WISA), Liuzhou, China, 11–12 November 2017; pp. 136–138. [Google Scholar]
- Khan, D. Modeling and Semantic Clustering in Large-Scale Text Data: A Review of Machine Learning Techniques and Applications. Int. J. Sci. Res. Eng. Manag. 2025, 9, 10. [Google Scholar] [CrossRef]
- Wei, T.; Lu, Y.; Chang, H.; Zhou, Q.; Bao, X. A Semantic Approach for Text Clustering Using WordNet and Lexical Chains. Expert Syst. Appl. 2015, 42, 2264–2275. [Google Scholar] [CrossRef]
- Liu, Q.; Wang, J.; Zhang, D.; Yang, Y.; Wang, N. Text Features Extraction Based on TF-IDF Associating Semantic. In Proceedings of the IEEE 4th International Conference on Computer and Communications (ICCC), Chengdu, China, 7–10 December 2018; pp. 2338–2343. [Google Scholar]
- Muneeb, T.H.; Sahu, S.; Anand, A. Evaluating Distributed Word Representations for Capturing Semantics of Biomedical Concepts. In Proceedings of the BioNLP, Beijing, China, 30 July 2015. [Google Scholar]
- Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.; Le, Q.V. XLNet: Generalized Autoregressive Pretraining for Language Understanding. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
- Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Hong Kong, China, 3–7 November 2019. [Google Scholar]
- Beltagy, I.; Lo, K.; Cohan, A. SciBERT: A Pretrained Language Model for Scientific Text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Hong Kong, China, 3–7 November 2019. [Google Scholar]
- Gillioz, A.; Casas, J.; Mugellini, E.; Khaled, O.A. Overview of the Transformer-Based Models for NLP Tasks. In Proceedings of the Conference on Computer Science and Information Systems, Belgrade, Serbia, 14–17 September 2020; pp. 179–183. [Google Scholar]
- Xie, J.; Girshick, R.B.; Farhadi, A. Unsupervised Deep Embedding for Clustering Analysis. In Proceedings of the International Conference on Machine Learning (ICML), Lille, France, 6–11 July 2015. [Google Scholar]
- Guo, X.; Gao, L.; Liu, X.; Yin, J. Improved Deep Embedded Clustering with Local Structure Preservation. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Melbourne, Australia, 19–25 August 2017. [Google Scholar]
- Schnellbach, J.; Kajó, M. Clustering with Deep Neural Networks—An Overview of Recent Methods. Network 2020, 39, 39–43. [Google Scholar]
- Jiang, Z.; Zheng, Y.; Tan, H.; Tang, B.; Zhou, H. Variational Deep Embedding: An Unsupervised and Generative Approach to Clustering. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), New York, NY, USA, 9–15 July 2016. [Google Scholar]
- Tarekegn, A.N.; Rabbi, F.; Tessem, B. Large Language Model Enhanced Clustering for News Event Detection. arXiv 2024, arXiv:2406.10552. [Google Scholar] [CrossRef]
- Saha, R. Influence of Various Text Embeddings on Clustering Performance in NLP. arXiv 2023, arXiv:2305.03144. [Google Scholar] [CrossRef]
- Rahman, M.W.U.; Nevarez, R.; Mim, L.T.; Hariri, S. SDEC: Semantic Deep Embedded Clustering. IEEE Trans. Big Data 2025, 1, 1–16. [Google Scholar] [CrossRef]
- Keraghel, I.; Morbieu, S.; Nadif, M. Beyond Words: A Comparative Analysis of LLM Embeddings for Effective Clustering. In Proceedings of the International Symposium on Intelligent Data Analysis, Würzburg, Germany, 28–30 October 2024. [Google Scholar]
- Petukhova, A.; Matos-Carvalho, J.P.; Fachada, N. Text Clustering with Large Language Model Embeddings. Int. J. Cogn. Comput. Eng. 2024, 6, 100–108. [Google Scholar] [CrossRef]
- Allahyari, M.; Pouriyeh, S.; Assefi, M.; Safaei, S.; Trippe, E.D.; Gutiérrez, J.B.; Kochut, K. A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques. arXiv 2017, arXiv:1707.02919. [Google Scholar] [CrossRef]
- Garg, N.; Gupta, R.K. Clustering Techniques for Text Mining: A Review. Int. J. Eng. Res. 2016, 5, 241–243. [Google Scholar]
- Soucy, P.; Mineau, G. Beyond TF–IDF Weighting for Text Categorization in the Vector Space Model. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Edinburgh, UK, 30 July–5 August 2005. [Google Scholar]
- Pradhan, L.; Zhang, C.; Bethard, S.; Chen, X. Embedding User Behavioral Aspect in TF–IDF-like Representation. In Proceedings of the IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), Miami, FL, USA, 10–12 April 2018. [Google Scholar]
- Tang, Z.; Li, W.; Li, Y.; Zhao, W.; Li, S. Several Alternative Term Weighting Methods for Text Representation and Classification. Knowl.-Based Syst. 2020, 207, 106385. [Google Scholar] [CrossRef]
- Mohammed, M.T.; Rashid, O.F. Document Retrieval Using Term Frequency Inverse Sentence Frequency Weighting Scheme. Indones. J. Electr. Eng. Comput. Sci. 2023, 3, 1478–1485. [Google Scholar] [CrossRef]
- Dasari, L.A.; Sowmith, J.; Krishna, M.N.; Saketh, C.; Venugopalan, M. Optimizing Agricultural Insights: Semantic Clustering and Topic Modelling for Farmer Queries. In Proceedings of the 3rd International Conference on Inventive Computing and Informatics (ICICI), Coimbatore, India, 8–10 January 2025. [Google Scholar]
- Thurnbauer, M.; Reisinger, J.; Goller, C.; Fischer, A. Towards Resolving Word Ambiguity with Word Embeddings. arXiv 2023, arXiv:2307.13417. [Google Scholar] [CrossRef]
- Clinchant, S.; Perronnin, F. Aggregating Continuous Word Embeddings for Information Retrieval. In Proceedings of the Workshop on Continuous Vector Space Models and Their Compositionality, Sofia, Bulgaria, 8 August 2013; pp. 100–109. [Google Scholar]
- Hu, W.; Zhang, J.; Zheng, N. Different Contexts Lead to Different Word Embeddings. In Proceedings of the International Conference on Computational Linguistics (COLING), Osaka, Japan, 11–16 December 2016. [Google Scholar]
- Rong, X. Word2Vec Parameter Learning Explained. arXiv 2014, arXiv:1411.2738. [Google Scholar]
- Dynomant, E.; Lelong, R.; Dahamna, B.; Massonnaud, C.; Kerdelhué, G.; Grosjean, J.; Canu, S.; Darmoni, S.J. Word Embedding for the French Natural Language in Health Care: Comparative Study. JMIR Med. Inform. 2019, 7, e12304. [Google Scholar] [CrossRef] [PubMed]
- Worth, P.J. Word Embeddings and Semantic Spaces in Natural Language Processing. Int. J. Intell. Sci. 2023, 13, 1. [Google Scholar] [CrossRef]
- Saranya, M.; Amutha, A. A Survey of Machine Learning Techniques for Topic Modeling and Word Embedding. In Proceedings of the 10th International Conference on Advanced Computing and Communication Systems (ICACCS), Coimbatore, India, 21–22 February 2024. [Google Scholar]
- Trask, A.; Michalak, P.; Liu, J.C. Sense2Vec: A Fast and Accurate Method for Word Sense Disambiguation in Neural Word Embeddings. arXiv 2015, arXiv:1511.06388. [Google Scholar]
- Mancini, M.; Camacho-Collados, J.; Iacobacci, I.; Navigli, R. Embedding Words and Senses Together via Joint Knowledge-Enhanced Training. In Proceedings of the Conference on Computational Natural Language Learning (CoNLL), Berlin, Germany, 11–12 August 2016. [Google Scholar]
- Wiedemann, G.; Remus, S.; Chawla, A.; Biemann, C. Does BERT Make Any Sense? Interpretable Word Sense Disambiguation with Contextualized Embeddings. In Proceedings of the Conference on Natural Language Processing, Tokyo, Japan, 29–31 October 2019. [Google Scholar]
- Meijer, H.; Truong, J.; Karimi, R. Document Embedding for Scientific Articles: Efficacy of Word Embeddings vs. TFIDF. arXiv 2021, arXiv:2107.05151. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Minneapolis, MN, USA, 2–7 June 2019. [Google Scholar]
- Zhang, R.; Wang, Y.-S.; Yang, Y.; Vu, T.; Lei, L. Exploiting Local and Global Features in Transformer-Based Extreme Multi-Label Text Classification. arXiv 2022, arXiv:2204.00933. [Google Scholar]
- Ha, T.-T.; Nguyen, V.; Nguyen, K.-H.; Nguyen, K.; Than, Q. Utilizing SBERT for Finding Similar Questions in Community Question Answering. In Proceedings of the International Conference on Knowledge and Systems Engineering, Hanoi, Vietnam, 27–29 October 2021. [Google Scholar]
- Boyack, K.; Newman, D.; Duhon, R.; Klavans, R.; Patek, M.; Biberstine, J.; Schijvenaars, B.; Skupin, A.; Ma, N.; Börner, K. Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches. PLoS ONE 2011, 6, e18029. [Google Scholar] [CrossRef]
- Huang, P.; Huang, Y.; Wang, W.; Wang, L. Deep Embedding Network for Clustering. In Proceedings of the International Conference on Pattern Recognition (ICPR), Stockholm, Sweden, 24–28 August 2014. [Google Scholar]
- Xu, Y.; Huang, D.; Wang, C.; Lai, J. Deep Image Clustering with Contrastive Learning and Multi-Scale Graph Convolutional Networks. Pattern Recognit. 2024, 146, 109939. [Google Scholar] [CrossRef]
- Gupta, V.; Bharti, P.; Nokhiz, P.; Karnick, H. SumPubMed: Summarization Dataset of PubMed Scientific Articles. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Bangkok, Thailand, 1–6 August 2021. [Google Scholar]
- Alizadeh, M.; Oveisi, M.; Falahati, S.; Mousavi, G.; Meybodi, M.A.; Mehrnia, S.S.; Hacihaliloglu, I.; Rahmim, A.; Salmanpour, M.R. AllMetrics: A Unified Python Library for Standardized Metric Evaluation and Robust Data Validation in Machine Learning. arXiv 2025, arXiv:2505.15931. [Google Scholar] [CrossRef]
- Lenci, A.; Sahlgren, M.; Jeuniaux, P.; Gyllensten, A.C.; Miliani, M. A Comprehensive Comparative Evaluation and Analysis of Distributional Semantic Models. arXiv 2021, arXiv:2105.09825. [Google Scholar]
- Kim, S.; Lee, S.; Yoon, B. Development of an Embedding Framework for Clustering Scientific Papers. IEEE Access 2022, 10, 32608–32621. [Google Scholar] [CrossRef]
- Kampffmeyer, M.C.; Løkse, S.; Bianchi, F.; Livi, L.; Salberg, A.-B.; Jenssen, J. Deep Divergence-Based Clustering. In Proceedings of the International Workshop on Machine Learning for Signal Processing, Tokyo, Japan, 25–28 September 2017. [Google Scholar]
- Druery, J.; McCormack, N.; Murphy, S. Are Best Practices Really Best? A Review of the Best Practices Literature in Library and Information Studies. Evid.-Based Libr. Inf. Pract. 2013, 8, 110–128. [Google Scholar] [CrossRef]
- Ramos, J.E. Using TF–IDF to Determine Word Relevance in Document Queries. Semantic Scholar. 2003. Available online: https://www.researchgate.net/publication/228818851_Using_TF-IDF_to_determine_word_relevance_in_document_queries (accessed on 1 September 2025).
- Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; Dean, J. Distributed Representations of Words and Phrases and Their Compositionality. Adv. Neural Inform. Proc. Syst. 2013, 26. [Google Scholar]
- Jin, X.; Han, J. K-Means Clustering. In Encyclopedia of Environmental Change; Matthews, J.A., Ed.; Sage: Thousand Oaks, CA, USA, 2021. [Google Scholar]
- Dabhi, D.; Patel, M.R. Extensive Survey on Hierarchical Clustering Methods in Data Mining. Semantic Scholar. 2016. Available online: https://www.irjet.net/archives/V3/i11/IRJET-V3I11115.pdf (accessed on 5 September 2025).
- Murtagh, F. Hierarchical Clustering. In International Encyclopedia of Statistical Science; Springer: Berlin/Heidelberg, Germany, 2016. [Google Scholar]
- Bach, F.; Jordan, M.I. Learning Spectral Clustering. In Advances in Neural Information Processing Systems. 2003. Available online: https://www.di.ens.fr/~fbach/nips03_cluster.pdf (accessed on 5 September 2025).
- Dhillon, I.; Guan, Y.; Kulis, B. A Unified View of Kernel k-Means, Spectral Clustering and Graph Cuts. Semantic Scholar. 2004. Available online: https://people.bu.edu/bkulis/pubs/spectral_techreport.pdf (accessed on 5 September 2025).
- Phillips, J.M. L10: Spectral Clustering. Semantic Scholar. 2016. Available online: https://www.semanticscholar.org/paper/L10%3A-Spectral-Clustering-Phillips/b12a5cebca3a0769a7ad01db8251a1aef3020d63 (accessed on 5 September 2025).
- Ester, M.; Kriegel, H.-P.; Sander, J.; Xu, X. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD), Portland, OR, USA, 2–4 August 1996. [Google Scholar]
- Ahmed, K.N.; Razak, T.A. An Overview of Various Improvements of DBSCAN Algorithm in Clustering Spatial Databases. Semantic Scholar. 2016. Available online: https://www.ijarcce.com/upload/2016/february-16/IJARCCE%2077.pdf (accessed on 25 September 2025).
- Guo, X.; Liu, X.; Zhu, E.; Yin, J. Deep Clustering with Convolutional Autoencoders. In Proceedings of the International Conference on Neural Information Processing, Guangzhou, China, 14–18 November 2017. [Google Scholar]
- Akhanli, S.E.; Hennig, C. Comparing Clusterings and Numbers of Clusters by Aggregation of Calibrated Clustering Validity Indexes. Stat. Comput. 2020, 30, 795–810. [Google Scholar] [CrossRef]
- Tomasini, C.; Borges, E.N.; Machado, K.; Emmendorfer, L. A Study on the Relationship between Internal and External Validity Indices Applied to Partitioning and Density-Based Clustering Algorithms. In Proceedings of the International Conference on Enterprise Information Systems, Porto, Portugal, 26–29 April 2017. [Google Scholar]
- Maulik, U.; Bandyopadhyay, S. Performance Evaluation of Some Clustering Algorithms and Validity Indices. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 1650–1654. [Google Scholar] [CrossRef]
- Newman, D.; Lau, J.H.; Grieser, K.; Baldwin, T. Automatic Evaluation of Topic Coherence. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), Los Angeles, CA, USA, 2–4 June 2010. [Google Scholar]
- Deveaud, R.; SanJuan, E.; Bellot, P. Are Semantically Coherent Topic Models Useful for Ad Hoc Information Retrieval? In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Sofia, Bulgaria, 4–9 August 2013. [Google Scholar]
- Carbonell, J.; Goldstein, J. The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, 24–28 August 1998; pp. 335–336. [Google Scholar]
- Li, L.; Wang, P.; Yan, J.; Wang, Y.; Li, S.; Jiang, J.; Sun, Z.; Tang, B.; Chang, T.-H.; Wang, S.; et al. Real-World Data Medical Knowledge Graph: Construction and Applications. Artif. Intell. Med. 2020, 103, 101817. [Google Scholar] [CrossRef] [PubMed]

| Embedding | Clustering | S.S. | DBI | CH | UMass |
|---|---|---|---|---|---|
| TF-IDF | k-means | 0.21 | 1.92 | 480 | 0.12 |
| HC | 0.18 | 2.05 | 420 | 0.11 | |
| SC | 0.23 | 1.88 | 495 | 0.14 | |
| DBSCAN | 0.25 | 1.76 | 520 | 0.16 | |
| DEC | 0.30 | 1.55 | 600 | 0.19 | |
| IDEC | 0.33 | 1.48 | 620 | 0.22 | |
| DCN | 0.34 | 1.42 | 640 | 0.24 | |
| VaDE | 0.36 | 1.35 | 670 | 0.26 | |
| Word2Vec | k-means | 0.31 | 1.52 | 650 | 0.19 |
| HC | 0.28 | 1.63 | 590 | 0.17 | |
| SC | 0.33 | 1.50 | 660 | 0.21 | |
| DBSCAN | 0.35 | 1.44 | 700 | 0.23 | |
| DEC | 0.38 | 1.38 | 730 | 0.26 | |
| IDEC | 0.40 | 1.31 | 760 | 0.29 | |
| DCN | 0.41 | 1.28 | 780 | 0.31 | |
| VaDE | 0.43 | 1.22 | 820 | 0.33 | |
| XLNet | k-means | 0.38 | 1.28 | 780 | 0.26 |
| HC | 0.36 | 1.32 | 760 | 0.25 | |
| SC | 0.39 | 1.26 | 800 | 0.28 | |
| DBSCAN | 0.41 | 1.21 | 820 | 0.30 | |
| DEC | 0.44 | 1.15 | 850 | 0.33 | |
| IDEC | 0.46 | 1.09 | 880 | 0.35 | |
| DCN | 0.47 | 1.06 | 900 | 0.37 | |
| VaDE | 0.49 | 1.02 | 930 | 0.39 | |
| SBERT | k-means | 0.40 | 1.18 | 860 | 0.31 |
| HC | 0.38 | 1.22 | 830 | 0.30 | |
| SC | 0.42 | 1.14 | 880 | 0.34 | |
| DBSCAN | 0.44 | 1.10 | 900 | 0.36 | |
| DEC | 0.48 | 1.03 | 940 | 0.39 | |
| IDEC | 0.50 | 0.98 | 970 | 0.41 | |
| DCN | 0.51 | 0.96 | 980 | 0.42 | |
| VaDE | 0.52 | 0.92 | 1000 | 0.44 | |
| SciBERT | k-means | 0.41 | 1.16 | 870 | 0.32 |
| HC | 0.39 | 1.20 | 850 | 0.31 | |
| SC | 0.43 | 1.12 | 900 | 0.35 | |
| DBSCAN | 0.45 | 1.08 | 920 | 0.37 | |
| DEC | 0.49 | 1.00 | 950 | 0.40 | |
| IDEC | 0.51 | 0.95 | 980 | 0.42 | |
| DCN | 0.52 | 0.93 | 1000 | 0.43 | |
| VaDE | 0.54 | 0.88 | 1040 | 0.46 |
| Cluster ID | Crop | Disease | SYM-F1 | CTL-F1 | PRV-F1 |
|---|---|---|---|---|---|
| C1 | Rice | Blast | 0.84 | 0.80 | 0.86 |
| C2 | Rice | Bacterial Leaf Blight | 0.83 | 0.79 | 0.85 |
| C3 | Rice | Sheath Blight | 0.81 | 0.77 | 0.82 |
| C4 | Rice | Tungro Virus | 0.83 | 0.76 | 0.84 |
| C5 | Rice | Brown Spot | 0.81 | 0.76 | 0.83 |
| C6 | Sugarcane | Red Rot | 0.83 | 0.79 | 0.86 |
| C7 | Sugarcane | Smut | 0.81 | 0.76 | 0.82 |
| C8 | Sugarcane | Mosaic Virus | 0.79 | 0.74 | 0.81 |
| C9 | Sugarcane | Leaf Scald | 0.80 | 0.74 | 0.82 |
| C10 | Oil Palm | Bud Rot | 0.81 | 0.77 | 0.82 |
| C11 | Oil Palm | Basal Stem Rot (Ganoderma) | 0.80 | 0.76 | 0.81 |
| C12 | Oil Palm | Fatal Yellowing | 0.78 | 0.73 | 0.80 |
| C13 | Cassava | Mosaic Disease | 0.82 | 0.75 | 0.84 |
| C14 | Cassava | Bacterial Blight | 0.80 | 0.74 | 0.82 |
| C15 | Cassava | Anthracnose | 0.78 | 0.72 | 0.80 |
| C16 | Cassava | Brown Streak | 0.79 | 0.73 | 0.81 |
| C17 | Soybean | Rust | 0.83 | 0.78 | 0.84 |
| C18 | Soybean | Cyst Nematode | 0.80 | 0.75 | 0.82 |
| C19 | Soybean | Pod Blight | 0.78 | 0.73 | 0.81 |
| C20 | Soybean | Phytophthora Root Rot | 0.79 | 0.74 | 0.81 |
| Evaluation Dimension | Metric | Description | Result (Mean ± SD) | Interpretation |
|---|---|---|---|---|
| 1. Structural Evaluation | Node Coverage (%) | Proportion of expected entities (crop, disease, symptom, control, prevention) successfully detected | 92.4 ± 3.1 | The KG captures most domain-relevant entities across clusters. |
| Relation Completeness (%) | Completeness of the four primary relations | 89.7 ± 4.5 | Essential relations are consistently extracted across disease topics. | |
| Graph Connectivity | Average degree centrality/number of components | 4.2 nodes/1 component per subgraph | Each disease subgraph is well-connected without fragmentation. | |
| 2. Triplet-Level Accuracy | Precision | Proportion of correct triplets out of those constructed | 0.87 | Extracted relations are highly accurate. |
| Recall | Proportion of gold-standard triplets constructed by method | 0.84 | A small number of control/prevention relations remain under-extracted. | |
| F1-score | Harmonic mean of precision and recall | 0.85 | Balanced performance in relation extraction. | |
| Confidence Score Gap | Confidence of correct vs. incorrect triplets | 0.71 vs. 0.34 | Confidence scoring reliably differentiates valid from invalid relations. | |
| 3. Expert Semantic Evaluation | Expert Accept Rate (%) | Percentage of subgraphs judged as semantically correct | 91.2% | Most subgraphs align with established plant pathology knowledge. |
| Biological Plausibility Score (1–5) | Degree to which symptoms, control, and prevention reflect scientific evidence | 4.6 ± 0.3 | Extracted knowledge is biologically sound and realistic. | |
| Cohen’s κ | Agreement between expert annotators | 0.82 | Expert agreement is “ excellent,” confirming evaluation reliability. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Polpinij, J.; Kaenampornpan, M.; Khoo, C.S.G.; Cheng, W.-N.; Luaphol, B. A Multi-Stage NLP Framework for Knowledge Discovery from Crop Disease Research Literature. Mathematics 2026, 14, 299. https://doi.org/10.3390/math14020299
Polpinij J, Kaenampornpan M, Khoo CSG, Cheng W-N, Luaphol B. A Multi-Stage NLP Framework for Knowledge Discovery from Crop Disease Research Literature. Mathematics. 2026; 14(2):299. https://doi.org/10.3390/math14020299
Chicago/Turabian StylePolpinij, Jantima, Manasawee Kaenampornpan, Christopher S. G. Khoo, Wei-Ning Cheng, and Bancha Luaphol. 2026. "A Multi-Stage NLP Framework for Knowledge Discovery from Crop Disease Research Literature" Mathematics 14, no. 2: 299. https://doi.org/10.3390/math14020299
APA StylePolpinij, J., Kaenampornpan, M., Khoo, C. S. G., Cheng, W.-N., & Luaphol, B. (2026). A Multi-Stage NLP Framework for Knowledge Discovery from Crop Disease Research Literature. Mathematics, 14(2), 299. https://doi.org/10.3390/math14020299

