Next Article in Journal
Statistical Hypothesis Testing: A Comprehensive Review of Theory, Methods, and Applications
Previous Article in Journal
Consensus and Divergence in Explainable AI (XAI): Evaluating Global Feature-Ranking Consistency with Empirical Evidence from Solar Energy Forecasting
Previous Article in Special Issue
A Readability-Driven Curriculum Learning Method for Data-Efficient Small Language Model Pretraining
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

A Multi-Stage NLP Framework for Knowledge Discovery from Crop Disease Research Literature

by
Jantima Polpinij
1,
Manasawee Kaenampornpan
2,*,
Christopher S. G. Khoo
3,
Wei-Ning Cheng
4 and
Bancha Luaphol
5
1
Department of Computer Science, Faculty of Informatics, Mahasarakham University, Mahasarakham 44150, Thailand
2
Department of Computer Engineering, Faculty of Engineering, Khon Kaen University, Khon Kaen 40002, Thailand
3
Wee Kim Wee School of Communication & Information, Nanyang Technological University, Singapore 637718, Singapore
4
Graduate Institute of Library & Information Studies, National Taiwan Normal University, Taipei City 106, Taiwan
5
Department of Business Computer, Faculty of Administrative Science, Kalasin University, Kalasin 46000, Thailand
*
Author to whom correspondence should be addressed.
Mathematics 2026, 14(2), 299; https://doi.org/10.3390/math14020299
Submission received: 27 November 2025 / Revised: 3 January 2026 / Accepted: 10 January 2026 / Published: 14 January 2026

Abstract

Extracting and organizing knowledge from the agricultural crop disease research literature are challenging tasks because of the heterogeneous terminologies, complicated symptom descriptions, and unstructured nature of scientific documents. In this study, we developed a multi-stage natural language processing (NLP) pipeline to automate knowledge extraction, organization, and integration from the agricultural research literature into a domain-consistent crop disease knowledge graph. The model combines transformer-based sentence embeddings with variational deep clustering to extract topics, which are further refined via facet-aware relevance scoring for sentence selection to be included in the summary. Lexicon-guided named entity recognition helps in the precise identification and normalization of terms for crops, diseases, symptoms, etc. Relation extraction based on a combination of lexical, semantic, and contextual features leads to the meaningful generation of triplets for the knowledge graph. The experimental results show that the method yielded consistently good results at each stage of the knowledge extraction process. Among the combinations of embedding and deep clustering methods, SciBERT + VaDE achieved the best clustering results. The extraction of representative sentences for disease symptoms, control/treatment, and prevention obtained high F1-scores of around 0.8. The resulting knowledge graph has high node coverage and high relation completeness, as well as high precision and recall in triplet generation. The multi-stage NLP pipeline effectively converts unstructured agricultural research texts into a coherent and semantically rich knowledge graph, providing a basis for further research in crop disease analysis, knowledge retrieval, and data-driven decision support in agricultural informatics.
Keywords: multi-stage NLP framework; transformer-based embeddings; clustering; entity recognition; crop disease knowledge graph multi-stage NLP framework; transformer-based embeddings; clustering; entity recognition; crop disease knowledge graph

Share and Cite

MDPI and ACS Style

Polpinij, J.; Kaenampornpan, M.; Khoo, C.S.G.; Cheng, W.-N.; Luaphol, B. A Multi-Stage NLP Framework for Knowledge Discovery from Crop Disease Research Literature. Mathematics 2026, 14, 299. https://doi.org/10.3390/math14020299

AMA Style

Polpinij J, Kaenampornpan M, Khoo CSG, Cheng W-N, Luaphol B. A Multi-Stage NLP Framework for Knowledge Discovery from Crop Disease Research Literature. Mathematics. 2026; 14(2):299. https://doi.org/10.3390/math14020299

Chicago/Turabian Style

Polpinij, Jantima, Manasawee Kaenampornpan, Christopher S. G. Khoo, Wei-Ning Cheng, and Bancha Luaphol. 2026. "A Multi-Stage NLP Framework for Knowledge Discovery from Crop Disease Research Literature" Mathematics 14, no. 2: 299. https://doi.org/10.3390/math14020299

APA Style

Polpinij, J., Kaenampornpan, M., Khoo, C. S. G., Cheng, W.-N., & Luaphol, B. (2026). A Multi-Stage NLP Framework for Knowledge Discovery from Crop Disease Research Literature. Mathematics, 14(2), 299. https://doi.org/10.3390/math14020299

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop