Ontology-Enhanced Deep Learning for Early Detection of Date Palm Diseases in Smart Farming Systems

Ghannam, Naglaa E.; Mancy, H.; Fathy, Asmaa Mohamed; Mahareek, Esraa A.

doi:10.3390/agriengineering8010029

Open AccessArticle

Ontology-Enhanced Deep Learning for Early Detection of Date Palm Diseases in Smart Farming Systems

by

Naglaa E. Ghannam

^1,*

,

H. Mancy

²

,

Asmaa Mohamed Fathy

³

and

Esraa A. Mahareek

³

¹

Department of Computer Engineering and Information, College of Engineering, Wadi Ad Dwaser, Prince Sattam Bin Abdulaziz University, Al-Kharj 16273, Saudi Arabia

²

Department of Computer Science, College of Engineering and Computer Sciences, Prince Sattam Bin Abdulaziz University, Al-Kharj 11942, Saudi Arabia

³

Faculty of Science, Al-Azhar University, Cairo 11754, Egypt

^*

Author to whom correspondence should be addressed.

AgriEngineering 2026, 8(1), 29; https://doi.org/10.3390/agriengineering8010029

Submission received: 12 September 2025 / Revised: 30 December 2025 / Accepted: 9 January 2026 / Published: 13 January 2026

(This article belongs to the Special Issue The Application of Machine Learning and Deep Learning Techniques in Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Early and accurate date palm disease detection is the key to successful smart farming ecosystem sustainability. In this paper, we introduce DoST-DPD, a new Dual-Stream Transformer architecture for multimodal disease diagnosis utilizing RGB, thermal and NIR imaging. In contrast with standard deep learning approaches, our model receives ontology-based semantic supervision (via per-dataset OWL ontologies), enabling knowledge injection via SPARQL-driven reasoning during training. This structured knowledge layer not only improves multimodal feature correspondence but also restricts label consistency for improving generalization performance, particularly in early disease diagnosis. We tested our proposed method on a comprehensive set of five benchmarks (PlantVillage, PlantDoc, Figshare, Mendeley, and Kaggle Date Palm) together with domain-specific ontologies. An ablation study validates the effectiveness of incorporating ontology supervision, consistently improving the performance across Accuracy, Precision, Recall, F1-Score and AUC. We achieve state-of-the-art performance over five widely recognized baselines (PlantXViT, Multi-ViT, ERCP-Net, andResNet), with our model DoST-DPD achieving the highest Accuracy of 99.3% and AUC of 98.2% on the PlantVillage dataset. In addition, ontology-driven attention maps and semantic consistency contributed to high interpretability and robustness in multiple crop and imaging modalities. Results: This work presents a scalable roadmap for ontology-integrated AI systems in agriculture and illustrates how structured semantic reasoning can directly benefit multimodal plant disease detection systems. The proposed model demonstrates competitive performance across multiple datasets and highlights the unique advantage of integrating ontology-guided supervision in multimodal crop disease detection.

Keywords:

date palm diseases; precision agriculture; ontology; plant pathology

1. Introduction

The date palm is a truly majestic tree of major socio-economic and ecological importance, especially in arid and semi-arid regions. Date palms are cultivated in more than 37 countries worldwide, sustaining oasis agriculture and providing millions of people with nutrition and livelihoods [1]. Egypt ranks among the world’s top producers of dates, with approximately seven million productive date palm trees [2]. These trees also contribute to carbon sequestration and food security through the fruit they produce [3].

However, date palm cultivation is severely affected by a wide range of diseases and pests, leading to significant yield losses and, in some cases, tree mortality. The rapid spread of fungal infections, scale infestations, and blights across orchards underscores the critical need for early detection and timely intervention. Historically, date palm disease diagnosis has been based on expert visual observation, which is time-consuming and laborious. Modern smart farming uses remote sensing and AI technologies to assist with crop monitoring. Deep learning in computer vision can automatically detect plant diseases from images, interpreting complex disease patterns. CNNs have been used for plant disease classification with great accuracy, outperforming classical machine learning approaches. Attention mechanisms are incorporated into CNN architectures to focus on relevant regions and identify small or inconspicuous lesions [4]. Of course, such attention-enhanced CNN variants have already proved better than others considered before in pest and disease recognition because they learned to focus on the key features. Meanwhile, ViT and other attention-based frameworks have lately been searched for under the plant disease identification tasks [5,6].

ViT models split the image into patches and use self-attention to learn long-range contextual relations, which may be helpful in multiple complex backgrounds or overlapping symptom cases. It was early reported that transformer-based approaches may achieve at least a comparable if not better performance than CNNs on leaf disease datasets [5]. For example, PlantXViT, an explainable ViT-CNN hybrid model, has been proposed to carry out the efficient identification of multiple crop diseases, attaining over 93% accuracy on several benchmark datasets [7]. Similarly, an attention score-based Multi-ViT ensemble has been shown to outperform conventional models by aggregating features from multiple Vision Transformers and views of the plant [8]. These advances illustrate the rapid progress in deep learning techniques—from plain CNNs to attention-augmented networks and transformers—for plant disease detection. Unlike traditional deep learning systems, our framework incorporates OWL-based ontologies that inject semantic structure directly into the learning process. This is achieved via a dedicated supervision layer that enables reasoning over disease symptoms, environmental conditions, and crop-specific traits using SPARQL-based rules.

Despite this progress, significant gaps remain in current deep learning approaches for plant disease detection, especially for date palms in real-field conditions. First, most models focus on classifying clear symptomatic images and struggle to detect diseases at an early stage when visual signs are minimal. In practice, early infections often produce only subtle color changes or small lesions that are frequently overlooked by both humans and standard CNNs. Indeed, it has been noted that obtaining an early diagnosis of plant diseases is very difficult—even high-resolution RGB images may not reveal symptoms in the initial stages, and few studies report successful early detection. The downsampling operations in deep CNNs can further obscure the features of tiny lesions [9], causing models to miss early-stage disease signals. Second, there is a limited use of multimodal imagery in existing frameworks. Most plant disease datasets comprise only visible light (RGB) images. While it is easier for RGB imaging to capture, physiological stress responses occur beyond the visible spectrum. Other modalities, such as thermal infrared or multispectral imaging, may be needed to observe stressed-at-the-early-stage plants (e.g., changes in leaf temperature or reflectance). One example is that thermal imaging can detect crop infections about 3–6 days before the macroscopic symptoms appear by measuring temperature variations induced by disease-induced changes in transpiration [10]. Thermal imagery enhances early disease detection by highlighting pre-symptomatic heat stress patterns, while NIR imaging captures subtle changes in leaf structure and moisture content. These modalities, when fused with RGB, offer complementary insights into the physiological state of date palms.

Deep learning models for plant disease detection often lack rich domain knowledge in plant pathology, which could enhance both model reasoning and interpretability. Although this knowledge is frequently encoded in ontologies or expert systems, most deep learning approaches fail to incorporate such semantic domain information. To address this limitation, recent studies have begun integrating deep learning with semantic web technologies—such as ontologies and knowledge graphs—to improve the diagnosis and explanation of plant diseases. For example, augmenting deep models with an ontology of disease symptoms has been shown to improve prediction accuracy and produce more meaningful results [11]. Basically, however, the approach is still very much at its beginning, and ontology-enhanced deep learning seems to be less explored for crop disease detection. Recent research has explored knowledge-graph-guided or neuro-symbolic learning frameworks in agriculture. However, these approaches often operate as post-processing modules or label filters. In contrast, our model injects semantic knowledge into the core learning process by aligning transformer attention with ontological reasoning, leading to improved multimodal discrimination and explainability.

We define a dual-stream Vision Transformer as an architecture with two independent ViT branches that process distinct modalities (e.g., RGB and Thermal/NIR) in parallel before semantic fusion. This design enables targeted attention mechanisms per modality and joint feature learning across spectral domains. In this paper, we propose a novel approach, called Dual-Stream Ontology-Supervised Transformer for Date Palm Disease Detection (DoST-DPD), to address these limitations for date palm disease detection. DoST-DPD is a deep learning framework for date palm diseases that uses an ontology-enhanced model and multimodal sensing to guide feature extraction and decision-making. It combines an expert-curated ontology with a state-of-the-art deep learning model to focus on relevant symptom features and provide human-understandable reasoning for predictions. The method uses multi-modal image inputs, including thermal and RGB images, for early-stage detection, detecting temperature anomalies in date palm fronds before visible symptoms and color and texture information for later-stage symptoms. To ensure dataset-specific semantic supervision, we developed or adapted OWL ontologies for each dataset used in the study. AgriOnto-DP was applied to date-palm datasets, while simplified ontologies were used for more generic datasets such as PlantVillage and PlantDoc, ensuring contextual relevance.

Figure 1 describes the principal differences linking traditional CNN basis for detecting plant diseases and the proposed DoST-DPD framework. Conventional AI systems tend to rely solely on RGB images and most times fail in detecting early-stage symptoms because of their single-modality limitations and lack of domain knowledge. Conversely, DoST-DPD combines RGB and thermal imagery using a dual-stream Vision Transformer with attention-based fusion, whereas semantic supervision is introduced via the AgriOnt-DP ontology. This enables early predictions to be delivered by this system in a truly interpretable fashion, facilitating smart date palm farming.

We design a hybrid CNN–Transformer network that processes these multimodal inputs, using attention mechanisms to align and fuse features across the different modalities. The network is trained to not only classify diseases but also to localize early symptoms, assisted by the ontology that penalizes implausible predictions (e.g., a symptom that does not match the diagnosed disease in the ontology). To our knowledge, DoST-DPD is the first framework to incorporate a domain ontology with deep multimodal learning for plant disease detection. This ontology-enhanced approach is expected to detect infections at earlier stages than conventional RGB-only models and to yield more interpretable results, which is crucial for decision support in smart farming systems. Figure 2 presents a conceptual excerpt from the Agricultural Ontology for Date Palm Diseases (AgriOnt-DP), illustrating how semantic relationships support intelligent disease reasoning. The disease Graphiola Leaf Spot is linked to its symptom (Black Dots on Leaf), the affected plant part (Frond), its infection stage (Early Stage), and its cause (Fungal Infection). These structured relationships form the backbone of ontology-guided supervision in the DoST-DPD model, enabling the system to validate predictions against expert-defined biological logic and improve both accuracy and interpretability. The ontological supervision enforces multi-label consistency using SPARQL queries derived from OWL class restrictions, enabling rule-based reasoning during training. For example, an SPARQL rule may infer the presence of pest infestation if both leaf wilt and elevated leaf temperature are detected.

We evaluated the proposed DoST-DPD method on two real-world date palm disease datasets collected from smart farm deployments. The datasets include RGB and thermal images of date palm trees infected by common diseases and pests in actual field conditions (e.g., White Scale, Fusarium wilt), with expert annotations for validation. The performance of DoST-DPD is compared against four recent algorithms representative of the state of the art. PlantXViT [7] is a Vision Transformer-based CNN model that was designed for plant disease identification with lightweight architecture and interpretability. Multi-ViT [8] is an attention score-based multi-Vision Transformer ensemble that aggregates features from multiple ViT models and views, shown to improve accuracy on complex disease datasets. We also include a conventional deep CNN baseline, ResNet-50 with Grad-CAM, which uses a ResNet classifier [12] for disease recognition and applies the Grad-CAM technique for a visual explanation of the results. Finally, to evaluate the proposed DoST-DPD framework, we utilize six publicly available datasets that encompass a diverse range of modalities and environmental conditions. These include PlantDoc [13], which offers RGB field images with complex backgrounds; the Figshare dataset [14], providing paired RGB and thermal imagery labeled at the tree level; the Mendeley dataset [15], containing RGB and NIR images annotated across nine disease categories; the Kaggle Date Palm dataset [16], offering RGB images under lab conditions across three disease classes; and the PlantVillage dataset [17], which comprises over 54,000 lab-captured RGB images spanning 38 crop–disease combinations. Some samples from this dataset are plotted in Figure 3. The incorporation of such datasets allows for a comprehensive assessment across single- and multimodal conditions as well as benchmarking in real-world agricultural environments. We experiment extensively to combat these peer methods and show that our method enhanced with ontology achieves better performance in terms of detection effectiveness, especially in the early stage of disease, as well as interpretability for it. The following sections outline the system design, ontology integration, experimental results, and implications for the deployment of smart farming.

The rest of this paper proceeds as follows: In Section 2, we present related works in plant disease detection, multimodal deep learning methods, and ontology integration. Section 3 describes the DoST-DPD framework for collaborative multimodal disease detection, including architecture, ontology schema, and loss functions. Section 4 describes experimental setup details, including datasets and implementation. Section 5 presents results and performs comparisons of our method against baseline methods. Section 6 concludes this study and points to avenues of future work.

2. Related Works

2.1. Recent AI Techniques for Plant Disease Detection

Deep learning has revolutionized plant disease detection in recent years. Convolutional Neural Networks (CNNs) like ResNet and Inception have been widely applied, often via transfer learning on large pretrained models, to classify leaf images with high accuracy. However, standard CNNs have limitations in capturing long-range dependencies in complex foliage backgrounds [18]. To address this, researchers have integrated attention mechanisms into CNNs. For example, the Squeeze-and-Excitation (SE-Net) block and Convolutional Block Attention Module (CBAM) add channel/spatial attention to emphasize disease-relevant features, improving performance. An improved channel attention module based on a CBAM was shown to boost accuracy in leaf disease classification [17].

Vision Transformers (ViTs) have emerged as an alternative, using self-attention to capture global context. Pure ViTs require large datasets, but hybrid CNN-Transformer models leverage the strengths of both. Thakur et al. (2023) introduced PlantXViT, which combines a CNN backbone (with Inception-style layers) and a Transformer encoder [19]. This hybrid achieved state-of-the-art results across multiple crops—e.g., 93.5% on apple and 98.3% on rice—even under challenging field backgrounds [20]. PlantXViT also incorporated explainability via Grad-CAM, highlighting leaf regions that influenced the classification [18]. Similarly, Zhu et al. (2023) proposed a Multiscale Convolutional ViT (MSCVT) that fuses multi-scale CNN features with a Transformer [21]. This model captured fine-grained texture and global context, outperforming single-scale CNNs and standalone ViTs in crop disease identification [21]. Another lightweight hybrid, ConvViT, preserved edge details via convolutional patch embedding and achieved ~96.8% accuracy on a challenging apple disease dataset with far fewer parameters [22].

Attention-based CNN enhancements remain popular too. A recent example is ERCP-Net, a deep CNN with a Channel Extension Residual block and an Adaptive Channel Attention module. ERCP-Net achieved 99.82% accuracy on the PlantVillage dataset (lab images) and even 86.21% on a more complex field dataset, surpassing other CNNs. This underscores how modern architectures (residual connections, multi-scale pooling, attention modules) can attain extremely high accuracy on benchmark datasets. Indeed, many studies report above 98–99% accuracy on PlantVillage [23]. However, simpler models often falter when tested on images beyond the conditions of training, motivating the community to explore transformers and hybrid models for greater robustness.

Recently, researchers have also developed ensemble and multi-branch Transformers. For instance, Baek et al. (2023) introduced an Attention Score-Based Multi-ViT model that ensembles multiple ViT branches [8]. Their framework uses a novel attention-based feature fusion to dynamically prioritize information from multiple leaves, addressing the limitation of single-image diagnosis. By aggregating outputs from several ViT instances (each capturing different visual patterns), the Multi-ViT can holistically analyze distributed symptoms on a plant. This model achieved over 99% accuracy on datasets of apple, grape, and tomato diseases, outperforming single ViTs and CNNs in handling subtle or diverse symptoms. Such results confirm that attention-based models (CBAM, SE-Net) and transformer architectures (ViTs, hybrid CNN-Transformer) are at the forefront of plant disease AI, offering improved feature representation and accuracy.

2.2. Multimodal Deep Learning in Agriculture

Most vision-based disease detection has relied on RGB imagery, but multimodal approaches are gaining traction for early detection. Different imaging modalities (visible, thermal, near-infrared, hyperspectral) provide complementary information about plant health [23]. Researchers have designed dual-stream architectures to fuse these modalities. In a typical dual-stream CNN, separate convolutional networks process (for example) RGB and NIR inputs, and their feature maps are fused via concatenation or attention. This allows the model to learn joint features: RGB captures color/texture symptoms while NIR or thermal can reveal water stress or temperature anomalies before visible symptoms emerge. For instance, Ahamed et al. used parallel CNNs on visible vs. NIR grape leaf images, showing that NIR images accentuated early mildew spots invisible in RGB [24]. Attention-based fusion can further improve learning from multiple inputs. Some studies introduce transformer fusion modules that treat multimodal feature vectors as a sequence and apply self-attention to weigh the contributions of each modality. Li et al. (2022) developed a transformer-based fusion network for grape diseases that integrates RGB images, hyperspectral data, and even environmental sensors [22]. Their model’s attention mechanism learns the relevance of spectral cues and environmental factors, significantly improving detection robustness under field conditions. Notably, hyperspectral imagery (with dozens of narrow bands) allows finer spectral signatures of infection. By fusing it with RGB, models can distinguish stress-induced spectral changes that RGB alone might miss. Thermal imaging has also been combined with RGB for early stress detection: temperature anomalies in leaves (captured via thermal cameras) can indicate disease or pest infestation inside a plant before outward signs appear. Overall, multimodal deep learning—using dual or multi-stream networks with modality-specific encoders and attention-based fusion—has demonstrated superior performance in early disease detection. By leveraging cross-modal signals (e.g., heat, reflectance, moisture), these approaches can catch subtle physiological changes, enhancing early warning capabilities over RGB-only models.

Recent reviews on Agriculture 5.0 emphasize the growing integration of AI, multimodal sensing, and semantic technologies in precision farming. For instance, ref. [25] provides a comprehensive overview of embedded sensor technologies for plant stress monitoring, underscoring the importance of fusing RGB, thermal, and hyperspectral modalities—an approach aligned with the multimodal philosophy of DoST-DPD. Beyond image data, some works incorporate non-visual modalities (e.g., soil moisture, weather) alongside images to contextualize disease predictions. This multi-modal sensor fusion (vision + IoT data) can improve generalization, but it remains an emerging area. Challenges persist in aligning different data types and scaling dual networks. Nevertheless, studies consistently show that combining modalities (RGB with NIR, thermal, or hyperspectral) yields more reliable and earlier detections than single-spectrum analysis.

2.3. Ontology-Guided or Knowledge-Driven AI in Agriculture

Integrating domain knowledge through ontologies and knowledge graphs is a promising avenue to enhance AI models in agriculture. Plant and Crop Ontologies provide a structured vocabulary of plant parts, phenotypes, and disease names that can be leveraged for better classification and explanation [26]. For example, the Plant Ontology (PO) standardizes terms for plant anatomy (leaf, stem, fruit, etc.), while the Plant Disease Ontology (PDO) defines disease names and categories (e.g., fungal, bacterial diseases of certain crops). By linking model outputs to such ontologies, one can ensure consistent naming and even enforce biologically plausible predictions. A model could be designed to predict a disease label that is ontology-consistent with the host plant (avoiding impossible host–pathogen combinations). Knowledge-driven classification has been explored in expert systems. Jearanaiwongkul et al. (2021) [26] built an ontology-based rice disease diagnosis system that utilizes the Rice Disease Ontology (RiceDO) to match observed symptoms with probable diseases. While their approach was rule-based, the ontology ensured semantic clarity (e.g., distinguishing diseases by pathogen type and plant stage).

In deep learning, ontologies have been used to improve explainability and semantic labeling. One novel approach is to employ ontologies of visual symptoms for concept-based explanations. For instance, researchers created an ontology of plant disease symptoms (color change, lesion shape, wilting) and used it to guide a concept-based interpretability method (TCAV) for a CNN classifier [27]. By mapping neuron activations to high-level concepts defined in the ontology, they could explain a model’s prediction in terms of human-understandable traits (e.g., “brown spots” concept was highly influential in classifying Alternaria leaf blight). This neuro-symbolic integration helps bridge the gap between raw pixel features and expert knowledge, making AI decisions more transparent to agronomists. Ontologies such as Plant Ontology and Crop Ontology have also been utilized to create knowledge graphs that relate crops, diseases, pests, and symptoms. Zhou et al. (2022) constructed a Plant Pest and Disease knowledge graph (KG) to support an explainable diagnosis model: the KG captures relationships (e.g., DatePalm → FungalDisease → White Rot) so that a neural network’s predictions can be cross-checked against known associations [28]. Such knowledge-enhanced models can enhance semantic consistency (predicting the correct disease for a given crop species) and provide justifications (referencing known biological facts) for their outputs. In summary, ontology-guided AI in agriculture enhances classification by embedding expert knowledge of plant biology, and it improves explainability by linking model decisions to human-understandable concepts and relationships.

Recent studies have demonstrated the growing utility of Swin Transformer-based architectures in plant disease detection, owing to their ability to model both local and global dependencies effectively. The work by Zhang et al. (2025) [29] introduces a novel Swin-Axial transformer architecture, which fuses axial attention into hierarchical Swin blocks for improved detection accuracy of tomato leaf diseases in multi-illuminated conditions and complex backgrounds. Their model had an accuracy of just over 96%, which exceeded that of CNNs and Vision Transformer baselines, thanks to better capturing of fine-grained disease symptoms and robustness in field works [29]. Complementing this, Song and Gao [30] presented an approach using Swin Transformer for recognizing plant diseases in digital images drawn from the PlantVillage dataset, reporting over 95% classification accuracy. They demonstrated that Swin-T’s shifted window operation facilitates feature extraction on endemic disease regions and at the same time preserves global context, which can outperform for fine-grained classification. Together, these works demonstrate that Swin Transformer models serve as competitive baselines for plant disease recognition, supporting the need to include them in comparative studies of ontologysupported and multimodal detection methods.

2.4. Ontology-Integrated Approaches in Agricultural Disease Detection

Researchers have explored incorporating domain ontologies and knowledge graphs into agricultural AI systems to improve decision support and reasoning. Ontology-driven expert systems for crop protection have a long history in agriculture. For example, Alharbi et al. developed AgrODSS, a semantic decision-support system that diagnoses plant pests and diseases using a dedicated Plant Disease and Pest Ontology (PDP-O) and provides evidence-based explanations via SPARQL queries over a knowledge base [31]. Likewise, Jearanaiwongkul et al. (2021) built an expert system called RiceMan that leverages a Rice Disease Ontology and rule-based inference (using SWRL rules linking observed symptoms to probable diseases) to deliver explainable diagnoses [26]. These systems rely on semantic web technologies and rule-based reasoning to ensure that diagnoses align with agronomic domain knowledge, improving transparency and consistency in recommendations.

More recently, hybrid neuro-symbolic models have emerged, integrating semantic knowledge directly with deep learning for plant stress and disease detection. Ahmed and Yadav [32], for instance, proposed an ontology-based classification framework that models plant disease knowledge in OWL and combines it with deep neural networks, showing that augmenting a CNN with an ontology of disease symptoms can improve prediction accuracy. Amara et al. (2024) incorporated a tomato disease ontology into a concept-based explanation method (TCAV) to guide a vision model’s focus toward expert-defined symptom concepts, thereby enhancing interpretability without sacrificing performance [27]. Nagarathna et al. (2025) introduced a Hybrid Vision Graph Neural Network (HV-GNN) that fuses image features with an ontology-driven knowledge graph embedding of plant species, symptoms, and diseases [33]. This approach enriches the classifier with domain context by integrating semantic relationships among diseases and symptoms into the learning process . Similarly, Kurhe and Dashore (2025) proposed a multimodal neuro-symbolic architecture (NeuroCausal-FusionNet) where CNN/Transformer visual features are merged with a plant phenotype knowledge graph via a dedicated knowledge-graph attention mechanism, yielding biologically informed latent representations for disease diagnosis Such approaches demonstrate that embedding expert knowledge into deep models—through ontologies, knowledge graphs, or rules can improve the robustness and explainability of plant disease detection systems [34].

DoST-DPD distinguishes itself from these prior works by tightly coupling domain knowledge with a dual-stream transformer architecture for multimodal date palm disease detection. Whereas earlier ontology-guided methods often act as post-processing filters or separate modules, DoST-DPD injects semantic supervision directly into the model’s training loop. The ontology (AgriOnt-DP) is used to align and constrain the transformer’s predictions, effectively guiding the model’s attention toward symptom-relevant features and enforcing logical consistency among predicted labels. In practice, this means the network is penalized for outputs that violate known disease–symptom relationships defined in the ontology. For example, if an image’s thermal signature indicates leaf wilt but the RGB features suggest a fungal Leaf Spot, ontology-derived SPARQL rules can flag the combination as implausible and adjust the model’s learning accordingly. By incorporating ontological constraints in this way, DoST-DPD ensures its multimodal predictions remain coherent with expert knowledge (e.g., no impossible disease-symptom pairs), which enhances both accuracy and interpretability. This ontology-integrated strategy goes beyond what previous neuro-symbolic models achieved, effectively bridging deep visual learners with symbolic reasoning within a unified framework.

2.5. Limitations in Existing Models

Despite the progress in plant disease AI, current models face several limitations. A major gap is early detection capability most models are trained on images where symptoms are visually pronounced (e.g., clear lesions or discoloration) [35]. Early-stage disease detection (when symptoms are mild, barely visible or internal) is still challenging for RGB image-based models. Here multimodal strategies involving thermal/NIR have potential but further studies and data are needed. Another issue with such a model is the absence of interpretability. Otherwise, the state-of-the-art CNN or transformer models risk serving a “black box” model whose predictions agronomists cannot trust. We do use post hoc explainers to study our model, such as saliency maps, but these sometimes only pinpoint a large region of the input with no semantic explanation for them (e.g., they may indicate that they are focusing on a discolored leaf region without explaining why it is as a spot or blight). This accentuates the importance of the integrated explainability (like ontology-based) in order to build user trust.

Furthermore, many models still rely solely on RGB inputs, which limits their robustness. Visible-spectrum images can be unreliable under variable lighting conditions and may fail to capture physiological changes that are detectable in other spectral bands. This modality limitation contributes to poor generalization: models trained on ideal or controlled datasets often perform poorly when applied to different environments. For example, a classifier trained on one crop or on a single dataset (such as laboratory images) may fail when tested on another crop or on field images with complex backgrounds [36]. There is also a tendency in the literature to report high accuracy on narrow test sets, while cross-dataset or cross-season performance is seldom reported. In practice, models can suffer from dataset bias and may not sustain accuracy when exposed to new conditions (different soil background, camera angles, or unseen disease variants). Limited dataset diversity (few samples of certain diseases or missing early-stage examples) exacerbates this issue. Additionally, current deep models have little integration of agricultural domain knowledge. They might misclassify diseases that have similar visual symptoms but occur on different hosts, because the model lacks the contextual knowledge a human expert uses. Ontology integration is one approach to tackle this, but it is not yet mainstream in most plant disease classifiers. In summary, the literature suggests that early-stage detection, interpretability, modality constraints, and generalization across conditions remain key challenges to be addressed in smart farming disease detection systems.

Table 1 shows how different deep learning models for detecting plant diseases compare. It shows their input types, use of attention mechanisms, explainability features, and ability to work with different datasets. ResNet and other classic CNNs only work with RGB images and do not have built-in attention modules or ways to combine semantic knowledge. They do well on training datasets, but they often have trouble generalizing without careful tuning.

Advanced architecture introduces meaningful improvements. ERCP-Net includes an Adaptive Channel Attention module and multi-scale feature extraction (✓ for attention), improving sensitivity to disease-specific textures. Transformer-based models such as PlantXViT and Multi-ViT incorporate self-attention (✓), capturing long-range dependencies and achieving superior robustness in cross-dataset evaluations. PlantXViT, for instance, was benchmarked on five crop disease datasets with strong results.

Swin Transformer generalizes this trend by incorporating a hierarchical, shifted window self-attention mechanism (✓), which efficiently enables the capture of local fine-grained details and global information. Recent results indicate that Swin Transformer exhibits strong generalization performance on challenging field datasets and has topped baseline CNNs and popular ViTs in terms of robustness and interpretability. Its architecture also improves feature localization and scaling in resolutions, thus being particularly suited for practical disease monitoring tasks.

None of these models natively use ontologies or explicit knowledge graphs (– in Knowledge Integration), which reflects a critical gap this paper aims to address. While post hoc explainability methods like Grad-CAM remain common (marked ✓ where used), few models incorporate interpretability into their architecture by design. Generalization is noted based on whether models were evaluated across diverse crops or environmental conditions. Multi-ViT, PlantXViT, and Swin Transformer have all demonstrated cross-dataset robustness (✓), unlike earlier CNN-based baselines.

Each of the above models has advanced the field in distinct manners. Being among the first, ResNet and the likes of CNN architecture laid a firm foundation for deep learning in plant pathology. ERCP-Net showed the advantages of custom CNN modules with adaptive attention in improving accuracy. Being the latest of all, transformer-based approaches such as PlantXViT and Multi-ViT model global relationships in leaf images to provide better cross-domain generalizations. Swin transformers further extend this pathway to marry hierarchical representation and efficient local–global self-attention for field variability and high-resolution disease cues. However, none of these models leverage ontologies for domain knowledge, hence, the gap this paper intends to bridge. Our ontology-enhanced deep learning framework aims to integrate expert knowledge on date palm diseases directly into model supervision to enhance early detection, semantic consistency, and interpretability for innovative farming applications in the real world.

3. Proposed Methodology: DoST-DPD Framework

The proposed framework, DoST-DPD (Dual-Stream Ontology-Supervised Transformer for Date Palm Disease Detection), is designed to leverage both multimodal imagery and semantic supervision for an accurate early-stage detection of diseases in date palm trees. It integrates RGB, thermal, and NIR data with a domain-specific ontology to enhance classification accuracy, interpretability, and generalizability across diverse environmental settings. DoST-DPD is structured into three functional modules: (1) a dual-stream transformer for extracting features from multiple image modalities, (2) an ontology-guided supervision mechanism that semantically aligns model predictions with expert knowledge, and (3) a classification head that synthesizes these representations for final decision-making. Each input sample includes an RGB image,

x_{i}^{r g b} \in R^{H \times W \times 3}

, and optionally a thermal or NIR image,

x_{i}^{t h}

or

x_{i}^{n i r} \in R^{H \times W \times 1}

. The DoST-DPD model accepts two input modalities per sample: a ground-level or drone-acquired RGB image and a corresponding thermal or multispectral image. Each input image is resized to a fixed spatial resolution of 224 × 224 pixels and normalized across channels. Standard augmentation techniques—such as random rotation, horizontal flipping, and spectral band stretching—are applied to enhance robustness and reduce overfitting. Thermal inputs provide early temperature-based stress indicators, NIR highlights early chlorophyll degradation, and RGB captures structural and color changes, making the multimodal combination beneficial for early-stage detection.

3.1. Modality-Specific Feature Extraction

To preserve modality-specific characteristics, DoST-DPD utilizes two independent Vision Transformers (ViTs). The RGB image is processed as follows:

z_{i}^{r g b} = {V i T}_{r g b} (x_{i}^{r g b}) \in R^{L \times D}

(1)

Here,

L

is the number of tokens (patches), and

D

is the embedding dimension. The RGB encoder follows a ViT-B/16 configuration with a patch size of 16 × 16, embedding dimension of 768, 12 transformer layers, 12 attention heads, MLP ratio of 4, and ImageNet-21k pretraining. The second ViT processes the thermal or NIR image:

z_{i}^{m o d} = {V i T}_{m o d} (x_{i}^{m o d}) \in R^{L \times D}

(2)

(where

x_{i}^{m o d} \in {x_{i}^{t h}, x_{i}^{n i r}}

). This stream uses a lighter ViT-Tiny configuration (patch size 16 × 16, embedding dimension 384, 12 layers, and 6 attention heads) with learnable positional encodings to better capture modality-specific feature distributions. RGB–thermal/NIR pairs are geometrically aligned using calibration-based homography mapping and normalized independently to reduce cross-modal intensity variation.

3.2. Multimodal Feature Fusion

The outputs of the RGB and auxiliary ViTs are aligned through a cross-attention module that learns how features from each modality relate to one another. This produces a unified representation:

z_{i}^{f u s e} = C r o s s A t t n (z_{i}^{r g b}, z_{i}^{m o d})

(3)

This fusion enables the model to integrate visual patterns and thermal/NIR cues that may signify stress or disease not visible in RGB alone. The fusion module performs mid-level feature fusion after the sixth transformer block of each stream, uses eight attention heads, and projects features to a shared 512-dimensional space before cross-attention to ensure coherent multimodal interaction.

3.3. Ontology-Guided Semantic Supervision

To incorporate domain knowledge, we introduce AgriOnt-DP, an ontology extended from AgriOnt that includes taxonomy, symptoms, and causal relationships specific to date palm pathology. For each input image, the ontology provides a multi-label semantic annotation,

y_{i}^{o n t o} \in R^{C_{o n t o}}

, where

C_{o n t o}

is the number of semantic classes. An ontology supervision head attached to the fused representation produces

{\hat{y}}_{i}^{o n t o}

, and a supervised loss aligns predictions with these concepts:

L_{o n t o} = - \sum_{c = 1}^{C_{o n t o}} y_{i}^{o n t o} [c] \cdot l o g ({\hat{y}}_{i}^{o n t o} [c])

(4)

This step encourages the network to develop features that are interpretable and consistent with agronomic semantics. Ontology-based supervision is generated using SPARQL queries to retrieve symptom–disease relationships; for example,

SELECT ?disease WHERE {

?symptom rdf:type :BrownSpots .

?symptom :indicates ?disease .

}

The retrieved concepts are encoded as per-image multi-hot targets

y_{i}^{o n t o}

for training.

3.4. Final Classification Head

The fused representation is passed to a dense classification layer to predict disease labels:

{\hat{y}}_{i} = S o f t m a x (W_{c l s} z_{i}^{f u s e} + b_{c l s})

(5)

The model is trained with standard categorical cross-entropy loss:

L_{c l s} = - \sum_{c = 1}^{C} y_{i} [c] \cdot l o g ({\hat{y}}_{i} [c])

(6)

where

y_{i}

is a one-hot encoded vector of disease class labels.

3.5. Joint Optimization

The final loss function, used throughout all experiments, balances prediction accuracy and semantic alignment and is defined as

L_{t o t a l} = L_{c l s} + λ \cdot L_{o n t o}

(7)

This loss formulation is used consistently across all experiments and is referenced in the training protocol described in Section 4.2. where

λ

is a hyperparameter regulating the impact of the ontology-guided loss. For clarity,

L_{t o t a l}

is defined once in Equation (7) and used consistently throughout the manuscript. Table 2 shows the definitions of symbols.

Figure 4 illustrates the proposed DoST-DPD architecture, a multimodal deep learning framework designed to enhance the accuracy and interpretability of date palm disease detection in smart farming systems. The model processes RGB and thermal or multispectral images independently through patch embedding and dedicated Vision Transformer (ViT) encoders. The outputs from both streams are then fused using a cross-attention module, enabling the model to capture spatial and spectral correlations. A stage-aware attention module raises the awareness of disease progression and yields a unified representation that is applicable to multi-task predictions. Lastly, supervision under the ontology-guided AgriOnt-DP schema drives semantic constraints to make predictions coherent with agricultural knowledge, enhancing the reliability and interpretability of model outputs. This architecture allows the early detection of diseases and provides explainable and knowledge consistent decision support for farmers and agronomists. Figure 5 outlines the practical deployment of the DoST-DPD system in a real-world agricultural context. Multimodal data are collected via UAVs equipped with RGB and multispectral cameras and complemented by ground-based environmental sensors. This data is first preprocessed at the edge—close to the data source—to reduce latency and bandwidth. It is then transmitted securely to a centralized cloud server where the DoST-DPD deep learning model is hosted alongside the AgriOnt-DP ontology reasoning module. Model inference generates high-level outputs such as disease type, affected regions, and disease stage. These outputs are visualized in a farmer-friendly dashboard and optionally trigger automated alerts for early intervention. This architecture ensures scalable, interpretable, and actionable disease monitoring within precision agriculture systems.

Therefore, the DoST-DPD pipeline brings in the paradigm of transformer-based multimodal learning, ontology-driven supervision, and explainable AI to arrive at a powerful yet interpretable solution for the early and accurate detection of date palm diseases in smart farming ecosystems.

4. Experimental Setup

A vast experimental setup was implemented for model evaluation purposes underlined by various real-world datasets, training protocols, and baselines; this section describes the dataset configurations, data preprocessing, training environment, etc.

4.1. Datasets and Modalities

On a full-scale spot check on the DoST-DPD framework, five publicly available datasets were carefully selected under user discretion. These datasets were used because of their relevance to date palm disease detection and the aspect of variety in image modality (e.g., RGB, NIR, or thermal), field setting conditions, and the quality of the disease annotation.

A robust evaluation of plant disease models requires diverse datasets. Numerous open access image datasets have been used from controlled lab environments to real-world field conditions. The PlantVillage dataset (2015, Penn State) is one of the most widely used benchmarks [16]. It contains ~54,300 images of healthy and diseased leaves from 14 crop species (38 diseases) captured under uniform backgrounds [16]. PlantVillage’s curated, high-quality images enabled early deep learning breakthroughs, but it represents lab conditions (single leaves on plain backgrounds) rather than farm fields. As a result, models trained solely on PlantVillage often fail to generalize to field imagery—for example, a PlantVillage-trained CNN that achieved 99% accuracy in lab tests dropped to ~31% accuracy when tested on field images with complex backgrounds. This revealed the need for more realistic datasets. For all datasets, a unified 70% training, 15% validation, and 15% testing split was applied when official splits were not provided, ensuring consistent comparison across modalities and conditions. RGB–thermal/NIR pairs were aligned using geometric homography calibration and normalized per modality to reduce distribution shifts.

To address the gap, PlantDoc (2020) was released as a field image dataset for plant disease detection as shown in Figure 6 [13]. It comprises ~2598 images across 13 species (17 disease classes) captured in natural scenes (leaves on the plant, varying backgrounds and lighting). PlantDoc is an object detection dataset (diseased leaves annotated with bounding boxes) reflecting in situ conditions. Models tested on PlantDoc have significantly lower accuracy than on PlantVillage, highlighting the challenge of field variability [13]. Some studies (e.g., Lu et al., 2021) use PlantVillage for training and PlantDoc for testing to evaluate cross-domain performance—often needing domain adaptation techniques to handle the distribution shift [28]. Beyond these, crop-specific datasets are widely used.

For date palm, which is at the center stage of our focus, recent approaches were made to nurture some dedicated datasets. Namoun et al. (2024) introduced the Date Palm Leaf Disease dataset, encompassing eight classes of diseases along with healthy leaves [16]. Date palm leaves images (leaflets) exhibiting symptoms of Black scorch, Fusarium wilt, Graphiola Leaf Spot, nutrient deficiencies (potassium, magnesium, and manganese), and Parlatoria scale (pest damage) diseases are hence included. The dataset has 3089 processed images (from 608 raw images captured) of date palm leaflets affected by these diseases [16,37]. Images of affected date palm trees were obtained from 10 real farms in Saudi Arabia using smartphone cameras in natural outdoor environment. Thus, this open access palm dataset offers some of the required real-world symptoms on the palm fronds at early and late stages to assist researchers in training and testing models for date palm pathology.

The dataset contains 832 RGB and thermal image pairs captured from 200 date palm trees from Sindh, Pakistan. Images are labeled according to trees into four health categories: Healthy, Infected, Severely Infested, and Dead. Thermal images were captured with Seek IR devices, calibrated per session, and normalized using min–max temperature scaling to ensure cross-tree consistency [14].

Consisting of 3000 RGB images of date palm leaves at various stages of Dubas infestation, the Mendeley Dubas Insect Leaf Dataset [15] contains four classes in total: Bug only, Honeydew only, Bug + Honeydew, and Healthy. This dataset additionally includes NIR reflectance as an auxiliary band; therefore, all images were aligned using UAV pose estimation metadata and normalized using per-band reflectance correction. The dataset was collected through aerial drone photography over farms in Iraq and thus reflected the natural patterns of infestation. This dataset, therefore, introduces a unique pest-specific domain allowing the model to learn the subtle visual patterns typically difficult to capture with ground-based imaging. Its inclusion, therefore, fosters a complete evaluation of aerial monitoring-based pest detection capability. Across all datasets, class imbalance was addressed using weighted sampling during training, and summary statistics—including per-class counts—are reported to ensure transparent evaluation.

The Kaggle Date Palm Dataset [16] provides 2631 RGB images collected under controlled laboratory conditions with uniform backgrounds. There are three main types: healthy, brown spot, and white scale. The dataset’s structured annotation and clear image quality make it a good choice for testing classification accuracy in limited settings. Its application in our analysis provides a straightforward, regulated contrast to those utilizing more intricate datasets obtained in the field. Hence, they have been selected considering image modality, disease type, annotation styles, and real capture conditions. Altogether, these components ensure that the proposed ontology-enhanced framework is thoroughly evaluated for robustness, generalization, and early detection capability. Table 3 presents an overview of the five datasets for DoST-DPD evaluation.

4.2. Training Protocol and Environment

The model was implemented using PyTorch 2.0 and trained on an NVIDIA RTX A6000 (32 GB RAM). Optimization was performed using the AdamW optimizer with a corrected initial learning rate of

1 \times 10^{- 4}

, batch size of 32, and weight decay of

5 \times 10^{- 4}

. The learning rate followed a cosine decay schedule over 100 epochs. Data augmentation included random horizontal and vertical flipping, color jittering, and random spectral perturbation for multispectral images.

During training, the model was optimized using the joint loss function defined in Equation (7), which combines the categorical classification loss and the ontology-guided semantic loss. The loss weighting coefficient, λ, was empirically set to 0.3 across all experiments to balance predictive accuracy and semantic consistency. No additional loss terms were introduced beyond those defined in Equation (7), ensuring a unified and consistent optimization objective throughout training.

L_{total} = L_{cls} + λ L_{onto}

where

L_{cls}

denotes the categorical cross-entropy loss for disease classification and

L_{onto}

denotes the ontology-guided semantic supervision loss. The loss weighting coefficient was empirically set to

λ = 0.5

across all experiments.

To ensure reproducibility, all experiments were run with fixed random seeds (42), mixed-precision FP16 training, and gradient clipping with a maximum norm of 1.0. Each experiment was repeated three times, and the mean ± standard deviation of the evaluation metrics was reported.

The RGB stream employed a ViT-B/16 backbone pre-trained on ImageNet-21k, while the thermal/NIR stream used a ViT-Tiny configuration initialized using Xavier uniform initialization. All models were trained using early stopping based on validation loss with a patience of 12 epochs.

4.3. Ablation Study Protocol

For the ablation study, we trained the proposed DoST-DPD architecture with/without the ontology supervision module to separate out the effect of ontology-guided semantic supervision. In the without-ontology version, we discarded the ontology head and set the ontology loss weight to zero (

λ = 0

), leaving all remaining modules as is the (dual-stream transformer backbone, cross-attention fusion, stage-aware attention) while only changing data splits, the augmentation strategy, the optimizer, the learning rate schedule and the number of training epochs. This controlled environment demonstrates that any performance gap is attributed to the access of ontology-guided supervision.

4.4. Ontology-Driven Semantic Integration and Supervision

To address the limitations of traditional deep learning systems in agricultural disease diagnosis—particularly their inability to incorporate expert knowledge or reason semantically, the DoST-DPD framework integrates dedicated ontologies aligned with each dataset. Each ontology encodes structured agricultural knowledge as classes, relationships, constraints, and taxonomies that describe disease symptoms, causal agents, imaging modalities, and phenological stages. As the integration operates in a per-dataset manner, it allows symbolic supervision and biologically constrained learning, building performance and interpretability above label-based training. All ontologies are interfaced with the training pipeline through a semantic reasoner module that interprets OWL files and processes SPARQL queries through an RDF4J backend.

Ontology supervision was applied only during training, with ontology-derived multi-label vectors computed offline and caught to ensure consistent supervision across epochs. The ontology loss weight, λ, was set to 0.3, and supervision was attached after multimodal fusion to align the shared embedding with semantic constraints.

For the PlantVillage dataset, which provides a clean laboratory benchmark of 38 crop disease classes, we employ the plantvillage_ontology.owl to inject semantic differentiation into diseases with similar visual features. The ontology’s structure highlights classes such as TomatoMosaicVirus, AppleScab, and LeafCurling, along with properties like hasSymptom and affectsCrop, as shown in Figure 7. For example, both TomatoMosaicVirus and TomatoYellowLeafCurlVirus present overlapping discoloration symptoms. RGB alone may confound them, but the ontology disambiguates them based on symptom location, pattern, and cause. The following SPARQL query describes nested constraints and filters for excluding only tomato viral diseases where symptoms on the leaf involve a MosaicPattern but exclude Necrosis:

SELECT ?disease ?symptomPattern

WHERE {

?disease pv:affectsCrop pv:Tomato ;

pv:causedBy pv:Virus ;

pv:hasSymptom ?symptom .

?symptom pv:hasLocation pv:Leaf ;

pv:hasPattern ?symptomPattern .

FILTER(?symptomPattern = pv:MosaicPattern && ?symptom != pv:Necrosis)

}

This semantic filtering is converted into an auxiliary training signal that reduces false positives between visually similar viral classes. Without ontology supervision, the model relies solely on pixel intensities, which often fail to capture subtle visual context.

For the PlantDoc dataset, captured in complex field conditions, the plantdoc_full.owl ontology model’s environmental context and multi-class symptom overlap. The ontology defines classes such as LateBlight and PowderyMildew and environmental modifiers like LowLight and OverlappingLeaves. These are shown in Figure 8, which visualizes the contextual enrichment paths and how multi-symptom co-occurrence is modeled. A major challenge in field-based detection is co-infection and overlapping disease traits. To address this, we execute transitive multi-hop reasoning using property chains. The following advanced SPARQL query identifies all diseases that co-occur with LateBlight and share at least one common symptom property chain with contextual overlap:

SELECT DISTINCT ?coOccurringDisease ?commonSymptom

WHERE {

pd:LateBlight pd:hasSymptom ?sym .

?coOccurringDisease pd:hasSymptom ?sym ;

pd:hasEnvironment pd:FieldCondition .

?sym pd:hasContextualModifier ?mod .

FILTER EXISTS {

?coOccurringDisease pd:hasSymptom ?s2 .

?s2 pd:hasContextualModifier ?mod .

}

FILTER(?coOccurringDisease != pd:LateBlight)

}

This query supports the dynamic expansion of symptom attention in ViT layers, ensuring the model attends to multiple symptoms that co-occur in natural scenes. Without ontology guidance, traditional networks trained on lab datasets often fail in such complex multi-label scenarios.

Created before any real pest infestation took place, the Figshare dataset operates with RGB and thermal images of pest-infested palm trees. One studied function of enhanced pest diagnosis by considering those stages where temperature changes are either subtly manifested or delayed, as described in the figshare.owl pest ontology. These spectral patterns could be considered

E a r l y T h e r m a l R i s e

, and pest concepts could be labeled with

R e d P a l m W e e v i l

and DeadTree, related to each other with properties like

h a s T h e r m a l P a t t e r n

and

h a s S e v e r i t y S t a g e

. The structure of this ontology is visualized in Figure 9, which showcases how temperature-based symptom progression is hierarchically represented and semantically queried to enhance early detection in the thermal stream of DoST-DPD. Unlike RGB-only networks, DoST-DPD uses SPARQL-based thermal pattern logic to prioritize early-stage cues. The following query identifies pest classes that exhibit a thermal anomaly before frond damage becomes visible, using a conditional threshold over the

h a s T h e r m a l D e l t a

datatype property:

SELECT ?pest ?symptom ?thermalDelta

WHERE {

?pest a fig:Pest ;

fig:hasSymptom ?symptom ;

fig:hasThermalPattern fig:EarlyThermalRise .

?symptom fig:hasAffectedPart fig:Frond ;

fig:hasThermalDelta ?thermalDelta .

FILTER(?thermalDelta > "2.5"^^xsd:float)

}

This enables the model to adaptively boost attention over regions showing >2.5 °C deviation, something a standard deep learning model cannot infer. Moreover, early-stage classification accuracy is significantly improved by conditioning transformer tokens on thermal priorities defined semantically.

For the Mendeley RGB + NIR dataset, used for the drone-based detection of Dubas insect infestations, we utilized the mendely_ontology.owl, which captures aerial perspectives of pest infestations and spectral cues such as

H i g h N I R A b s o r p t i o n

, Honeydew, and

B u g C l u s t e r

. The ontology also links image context (e.g., capturedBy Drone) with reflectance-based symptoms. Its semantic schema is displayed in Figure 10, demonstrating how spectral anomalies are linked to insect life stages and anatomical zones, enabling spectral-aware training using ontology-derived queries. Ontology-free models lack a contextual understanding of where NIR anomalies originate. By contrast, the following graph pattern query associates anatomical location, pest species, and specific reflectance drop signatures:

SELECT ?insect ?symptom ?nirDropValue

WHERE {

?insect md:targetsSpecies md:DatePalm ;

md:triggersSymptom ?symptom .

?symptom md:hasSpectralAnomaly md:NIRReflectanceDrop ;

md:hasAnatomicalRegion md:UpperLeafSurface ;

md:hasNIRDropValue ?nirDropValue .

FILTER(?nirDropValue > “0.25”^^xsd:float)

}

These outputs refine NIR token embeddings by focusing ViT attention on upper leaf reflectance cues tied to Dubas infestation. Without this ontology-level mapping, drone-acquired NIR noise overwhelms visual learning.

Finally, for the Kaggle Date Palm dataset, the datepalm_ontology.owl models visual and stage-based variations in common diseases such as

B r o w n S p o t

and

W h i t e S c a l e

. The ontology defines

W h i t e S c a l e

,

B r o w n S p o t

, and related symptom morphology such as

S u r f a c e P u s t u l e s

and

S p o t S h a p e

, along with infection stages. Its visual structure is shown in Figure 11, highlighting class hierarchies and diagnostic rules that improve fine-grained classification when RGB symptoms are subtle or overlapping. While both exhibit brown lesion-like structures in RGB, their biological causes differ. The following inferential SPARQL query retrieves only fungal diseases where the lesions appear in mature-stage samples and do not affect the leaf tip:

SELECT ?disease ?lesionShape

WHERE {

?disease dp:hasCause dp:Fungus ;

dp:hasSymptom ?sym .

?sym dp:hasStage dp:MatureStage ;

dp:hasLocation ?loc ;

dp:hasShape ?lesionShape .

FILTER(?loc != dp:LeafTip)

}

This high-resolution supervision prevents the model from confusing visually similar but etiologically distinct conditions, something RGB-only or class-label-based models routinely fail at.

In each of these cases, the resulting knowledge graphs are parsed into semantic feature vectors and encoded into the learning objective via an ontology-guided loss,

L_{o n t o}

. The total loss for the DoST-DPD model is defined as

L_{t o t a l} = L_{c l s} + β_{1} L_{o n t o} + β_{2} L_{a t t}

(8)

where

L_{c l s}

is the categorical loss on ground truth labels,

L_{o n t o}

penalizes semantic inconsistency against ontology rules, and

L_{a t t}

guides attention maps using concept saliency derived from ontology reasoning. Without ontological integration, the model operates as a black box relying solely on pixel-level features and class supervision. With it, DoST-DPD can reason over multi-modal features, filter implausible outcomes, and offer biologically grounded interpretations, yielding a smarter, more generalizable plant disease diagnostic tool for real-world farming environments.

5. Experimental Results and Discussion

This section compares the proposed DoST-DPD framework with five state-of-the-art baseline algorithms: ResNet, ERCP-Net, PlantXViT, Multi-ViT, and Grad-CAM. The benchmarking rests on six different datasets with RGB, thermal, and NIR modalities. The evaluation metrics include Accuracy, Precision, Recall, F1-Score, and AUC. All results are reported as the mean of three independent runs to ensure statistical reliability, and weighted sampling was applied during training to mitigate class imbalance across datasets.

Table 4 and Figure 12a summarize the varied performances of various models on the PlantVillage dataset, a controlled benchmark under RGB images acquired in perfect lab conditions. Equal visibility of a clean background and crystalline symptom-bearing pictures offers every model high accuracy. The proposed DoST-DPD with an ontology framework still surpasses all others by a large margin, with Accuracy—99.3%, Precision—96.6%, Recall—96.9%, F1-Score—97.7%, and AUC—98.2%. To explicitly assess the contribution of ontology-guided semantic supervision, an ablation variant of DoST-DPD without ontology integration was evaluated on the same dataset. This variant achieves lower performance, with an Accuracy of 95.5%, Precision of 94.3%, Recall of 95.6%, F1-Score of 95.1%, and AUC of 97.1%, confirming a consistent performance gap when semantic supervision is removed. Since both variants share identical architectures, training protocols, and optimization settings, this performance difference can be directly attributed to the ontology-guided loss term

λ L_{o n t o}

in the joint objective defined in Equation (7).

These results demonstrate that ontology integration contributes measurable performance gains even in noise-free laboratory settings, where purely visual cues are already highly discriminative. Notably, the gains persist despite the strong baseline performance, indicating that semantic constraints act as a regularizer rather than a dataset-specific heuristic. The observed improvements indicate that semantic constraints help regularize the learning process, reducing overfitting to visually dominant disease patterns and enhancing robustness in cases of highly similar leaf textures.

Referring to the baselines discussed previously, PlantXViT can be considered the second best performing model, achieving strong Recall (96.8%) and AUC (96.7%), reinforcing the effectiveness of Vision Transformer architectures for high-resolution agricultural imagery. Multi-ViT also exhibits robust performance with balanced metrics and an F1-Score of 94.6%, highlighting the benefit of multi-view attention mechanisms in structured datasets. ERCP-Net achieves strong Precision and Accuracy but shows reduced Recall (92.4%), suggesting limitations in capturing faint or rare disease manifestations. ResNet and Grad-CAM remain moderately successful for conventional classification tasks, with F1-Scores of 90.9% and 91.1%, respectively. The radar chart in Figure 12a further accentuates these contrasts, with DoST-DPD (with ontology) nearly filling the radar space across all axes, while the ablation variant and baseline models exhibit smaller and more irregular profiles. This comparative analysis confirms that the superior performance of DoST-DPD is not solely attributable to the transformer backbone but is significantly enhanced by ontology-guided semantic supervision.

A performance comparison on the PlantDoc dataset is presented in Table 5 and Figure 12b, where the six models are evaluated under challenging real-world conditions characterized by occlusion, background clutter, and diverse crop diseases captured under uncontrolled lighting. Under these conditions, the proposed DoST-DPD framework achieves the highest performance across all evaluation metrics, with an Accuracy of 94.4%, Precision of 93.2%, Recall of 94.7%, F1-Score of 93.9%, and AUC of 95.6%. These results demonstrate the strong adaptability and generalizability of the ontology-enhanced dual-stream architecture in processing noisy field data and visually ambiguous symptoms. The improvements are particularly evident in Recall, indicating that DoST-DPD effectively identifies minority and visually subtle disease classes that are commonly misclassified in field imagery.

To quantify the impact of ontology-guided semantic supervision, an ablation variant of DoST-DPD without ontology integration was also evaluated under the same experimental protocol. This variant attains an Accuracy of 93.2%, Precision of 91.9%, Recall of 93.0%, F1-Score of 92.9%, and AUC of 94.0%, consistently underperforming the full model across all metrics. The observed performance gap, especially in Recall and AUC, directly reflects the contribution of the ontology-based loss term

L_{o n t o}

in Equation (7), which enforces semantic consistency beyond visual similarity. The improvements are particularly evident in Recall, confirming that semantic supervision improves sensitivity to subtle disease patterns and enhances discrimination in visually complex field environments. This effect is amplified in field datasets, where background clutter and co-occurring symptoms weaken purely appearance-based learning.

Among the baseline models, Multi-ViT achieves the strongest performance, with an F1-Score of 92.2% and an AUC of 93.8%, owing to its multi-view transformer aggregation strategy. PlantXViT also performs competitively with an F1-Score of 90.9%, demonstrating the effectiveness of lightweight transformer-based architectures. In contrast, ERCP-Net and Grad-CAM yield only moderate results, with F1-Scores of 88.8% and 86.9%, respectively, as they rely primarily on convolutional feature extraction without semantic constraints. ResNet, as the simplest architecture in the comparison, achieves a reasonable F1-Score of 86.1% but struggles to distinguish diseases with similar visual textures. The radar chart in Figure 12b visually reinforces these observations: DoST-DPD with ontology exhibits a large and symmetrical footprint across all five metrics, whereas the ablation variant and baseline models display smaller and more irregular profiles, particularly for Recall and AUC. This clearly indicates that ontology-guided supervision, rather than multimodal fusion alone, is the primary driver of robustness in real-world conditions.

Table 6 and Figure 12c show how six deep learning models compare on the Figshare RGB + Thermal dataset. They show how important multi-modal integration and ontology-guided supervision are for real-world pest detection tasks. The proposed DoST-DPD framework achieves the highest performance across all five key metrics—90.5% Accuracy, 89.6% Precision, 90.7% Recall, 90.1% F1-Score, and 92.2% AUC demonstrated its superior ability to leverage both visual and thermal cues. These improvements highlight that the thermal stream, when guided by ontology-derived cues about temperature-based symptom stages, plays a critical role in early detection before RGB features become visually discernible. Multi-ViT and PlantXViT deliver F1-Scores of 87.7% and 86.5%, which are both competitive but limited by the absence of semantic temperature priors. In this multimodal context, ResNet and Grad-CAM perform poorly (82.0% and 82.8% F1-Scores) due to their inability to capture thermal anomalies. ERCP-Net achieves moderate success with an F1-Score of 84.7% but remains outperformed by transformer-based approaches. The radar chart in Figure 12c further verifies these observations, showing the superiority of DoST-DPD in all metrics. The consistency across Accuracy, Recall, and AUC suggests the model effectively captures early thermal deviations crucial for timely pest intervention.

Table 7 and Figure 12d highlight the comparative performance of deep learning models on the Mendeley RGB + NIR dataset, which incorporates both visual and near-infrared channels to detect spectral indicators of pest activity and physiological stress in date palms. The proposed DoST-DPD framework leads by a substantial margin across all evaluation criteria, attaining an Accuracy of 92.0%, Precision of 91.1%, Recall of 91.7%, F1-Score of 91.4%, and AUC of 93.0%. These results confirm the advantage of combining NIR spectral cues with semantic supervision, enabling the model to detect chlorophyll-level variations associated with early-stage infestations.

Multi-ViT is the next best performing model, achieving an F1-Score of 88.8% and AUC of 90.4%, with an edge due to its ensemble transformer design but without the symbolic reasoning present in DoST-DPD. In addition, plantXViT also performs well F1-Score 87.5%, leading to us believing that transformer-based models are able to generalize effectively to our structured spectral features. In the meantime, ERCP-Net keeps mid-level performance with well-balanced but reduced scores, such as an F1-Score of 85.5%, restrained by its single-modal convolutional design. DoST-DPD offers the most uniformly distributed metric values, indicating a robust fusion of spectral spatiality under ontology constraints.

Conventional approaches including ResNet and Grad-CAM achieve lower performance, with 82.8% to 83.8% of F1-Scores and AUCs (85.0%, 86.0%), which are notably below the proposed method’s, indicating their poor capability to exploit spectral information or interpret spatially complex NIR difference in variations. The radar chart in Figure 12d visually reinforces this performance hierarchy. DoST-DPD forms a well-rounded, dominant profile, while traditional models occupy more compressed and uneven radar regions.

Table 8 and Figure 12e summarize the performance of various models on the Kaggle Date Palm Dataset. The proposed DoST-DPD framework clearly outperforms all competing methods, achieving 96.2% Accuracy, 96.1% Precision, 95.7% Recall, 93.4% F1-Score and 95.9% AUC. These results demonstrate DoST-DPD’s ability to perform fine-grained disease differentiation even when symptoms are visually similar, benefiting from ontology-based constraints that reduce label confusion.

Multi-ViT follows with an F1-Score of 89.5% and AUC of 91.0%. PlantXViT performs competitively with an F1-Score of 88.2%. ERCP-Net and Grad-CAM deliver moderate performance (86.3% and 85.0%). ResNet yields the lowest results (84.5% F1-Score). The radar chart clearly shows DoST-DPD occupying the broadest region, indicating the most balanced trade-off between sensitivity and specificity.

Figure 13a depicts a confusion matrix showing the classification performance of DoST-DPD on the PlantDoc dataset. High diagonal values indicate strong accuracy despite noisy backgrounds and occlusions. Figure 13b presents the performance on the PlantVillage dataset, where most diseases exhibit high separability. Some visually similar classes (e.g., Early Blight vs. Septoria Leaf Spot) show mild confusion, indicating subtle intra-class overlaps. Overall, confusion matrix interpretation confirms that DoST-DPD not only improves average metrics but also reduces systematic errors across challenging disease categories.

The experimental results across five heterogeneous datasets—covering RGB, thermal, and NIR modalities—consistently validate the superiority of the proposed DoST-DPD framework. By integrating ontology-guided semantic supervision with a dual-stream transformer architecture, DoST-DPD achieves significant gains across all metrics. These improvements are pronounced in complex multimodal datasets such as Figshare and Mendeley, where semantic alignment and spectral reasoning are crucial for accurate early disease detection. The balanced radar plots demonstrate that DoST-DPD achieves strong performance across all criteria without suffering from the precision–recall trade-offs seen in baseline models. Overall, these findings highlight the importance of combining domain knowledge, symbolic reasoning, and multimodal fusion in smart agricultural diagnostics.

6. Conclusions

This paper proposes an ontology-enhanced dual-stream transformer to carry out the intricate task of plant disease detection, early and accurate, in the realm of smart farming. Unlike other approaches that use visual features only, DoST-DPD combines semantic reasoning with multimodal visual analysis. It does so by integrating dataset-specific OWL ontologies within the training and inference pipeline. These ontologies grant a symbolic understanding of domain-specific attributes, such as disease symptoms, environmental stressors, and phenotypic variations, creating a semantic layer that aids attention-guided recognition and improves model generalization. DoST-DPD outperformed competitive baselines like ResNet, Grad-CAM, ERCPNet, PlantXViT, and Multi-ViT for five benchmark datasets, namely RGB, thermal, and NIR modalities. It was set as a framework to base the state-of-the-art performances on different scenarios, providing controlled and real-world analysis and recording Accuracy rates as high as 99.3% and an AUC of 98.2%. Radar and bar chart-based comparisons have further borne out DoST-DPD’s balanced superior performance across all evaluation metrics. Except for its noteworthy performance, a major contribution of this work lies in its ontology-driven modularity. Each dataset is registered with a bespoke OWL ontology and goes through semantic enrichment based on SPARQL queries; one dispenses with the need for a central knowledge graph, thereby endowing dataset-specific interpretability. This makes it highly explainable, scalable, and adaptable to the inclusion of newer crops, modalities, or diseases. In conclusion, an important step toward establishing trustworthy, intelligent, and scalable plant disease diagnosis systems is fostered by DoST-DPD. It fills in the bridging position between deep learning and symbolic knowledge representation, laying track for future endeavors regarding semantic-aware, multimodal AI frameworks for precision agriculture and autonomous crop health monitoring in smart farming ecosystems.

Author Contributions

Conceptualization, N.E.G. and H.M.; methodology, A.M.F.; software, N.E.G.; validation, H.M.; formal analysis, H.M.; investigation, N.E.G.; resources, A.M.F.; data curation, N.E.G.; writing—original draft preparation, H.M.; writing—review and editing, N.E.G. and E.A.M.; visualization, A.M.F. and E.A.M.; supervision, N.E.G.; project administration, H.M.; funding acquisition, N.E.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Prince Sattam Bin Abdulaziz University through project number (2024/01/31489).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this research is available on request from the corresponding author. The data is publicly available in PlantDoc at https://doi.org/10.1145/3371158.3371196.

Acknowledgments

The authors extend their appreciation to Prince Sattam bin Abdulaziz University for funding this research work through project number (2024/01/31489).

Conflicts of Interest

The authors declare no conflicts of interest.

References

El-Mously, H.I. Date Palm Plantations: A Future Sustainable Support to Forests. Int. J. Fam. Stud. Food Sci. Nutr. Health 2022, 3, 153–164. [Google Scholar] [CrossRef]
Ahmed, M.; Elkaoud, N.; Arif, E.; ElSharabasy, S. Study the Physical and Mechanical Properties Affecting on Date Palm Tree Mechanical Serves. J. Soil Sci. Agric. Eng. 2021, 12, 357–362. [Google Scholar] [CrossRef]
Magsi, A.; Mahar, J.A.; Maitlo, A.; Ahmad, M.; Razzaq, M.A.; Bhuiyan, M.A.S.; Yew, T.J. A New Hybrid Algorithm for Intelligent Detection of Sudden Decline Syndrome of Date Palm Disease. Sci. Rep. 2023, 13, 15381. [Google Scholar] [CrossRef]
Liu, J.; Wang, X. Plant Diseases and Pests Detection Based on Deep Learning: A Review. Plant Methods 2021, 17, 22. [Google Scholar] [CrossRef]
Li, G.; Wang, Y.; Zhao, Q.; Yuan, P.; Chang, B. PMVT: A Lightweight Vision Transformer for Plant Disease Identification on Mobile Devices. Front. Plant Sci. 2023, 14, 1256773. [Google Scholar] [CrossRef]
Mahareek, E.A.; Cifci, M.A.; Desuky, A.S. Integrating Convolutional, Transformer, and Graph Neural Networks for Precision Agriculture and Food Security. AgriEngineering 2025, 7, 353. [Google Scholar] [CrossRef]
Thakur, P.S.; Khanna, P.; Sheorey, T.; Ojha, A. Explainable Vision Transformer Enabled Convolutional Neural Network for Plant Disease Identification: PlantXViT. arXiv 2022, arXiv:2207.07919. [Google Scholar] [CrossRef]
Baek, E.-T. Attention Score-Based Multi-Vision Transformer Technique for Plant Disease Classification. Sensors 2025, 25, 270. [Google Scholar] [CrossRef]
Thangaraj, R.; Anandamurugan, S.; Kaliappan, V.K. Automated Tomato Leaf Disease Classification Using Transfer Learning-Based Deep Convolution Neural Network. J. Plant Dis. Prot. 2021, 128, 73–86. [Google Scholar] [CrossRef]
Rippa, M.; Pasqualini, A.; Curcio, R.; Mormile, P.; Pane, C. Active vs. Passive Thermal Imaging for Helping the Early Detection of Soil-Borne Rot Diseases on Wild Rocket [Diplotaxis tenuifolia (L.) D.C.]. Plants 2023, 12, 1615. [Google Scholar] [CrossRef]
Chhetri, T.R.; Hohenegger, A.; Fensel, A.; Kasali, M.A.; Adekunle, A.A. Towards Improving Prediction Accuracy and User-Level Explainability Using Deep Learning and Knowledge Graphs: A Study on Cassava Disease. Expert Syst. Appl. 2023, 233, 120955. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 770–778. [Google Scholar] [CrossRef]
Singh, D.; Jain, N.; Jain, P.; Kayal, P.; Kumawat, S.; Batra, N. PlantDoc: A Dataset for Visual Plant Disease Detection. In Proceedings of the 7th ACM IKDD CoDS and 25th COMAD, Hyderabad, India, 5–7 January 2020; ACM: New York, NY, USA, 2020; pp. 249–253. [Google Scholar] [CrossRef]
Mehmood, A.; Nadeem, A.; Ashraf, M.; Rizwan, K. A Dataset of Date Palm Trees’ Thermal and RGB Images for Pest Management; figshare Dataset; Figshare: London, UK, 2024. [Google Scholar]
Al-Mahmood, A.M.; Shahadi, H.I.; Hasoon Khayeat, A.R. Image Dataset of Infected Date Palm Leaves by Dubas Insects. Data Brief 2023, 49, 109371. [Google Scholar] [CrossRef]
Namoun, A.; Alkhodre, A.B.; Sen, A.A.A.; Alsaawy, Y.; Almoamari, H. Dataset of Infected Date Palm Leaves for Palm Tree Disease Detection and Classification. Data Brief 2024, 57, 110933. [Google Scholar] [CrossRef]
Ma, X.; Chen, W.; Xu, Y. ERCP-Net: A Channel Extension Residual Structure and Adaptive Channel Attention Mechanism for Plant Leaf Disease Classification Network. Sci. Rep. 2024, 14, 4221. [Google Scholar] [CrossRef]
Hemalatha, S.; Jayachandran, J.J.B. A Multitask Learning-Based Vision Transformer for Plant Disease Localization and Classification. Int. J. Comput. Intell. Syst. 2024, 17, 188. [Google Scholar] [CrossRef]
Thakur, P.S.; Chaturvedi, S.; Khanna, P.; Sheorey, T.; Ojha, A. Vision Transformer Meets Convolutional Neural Network for Plant Disease Classification. Ecol. Inform. 2023, 77, 102245. [Google Scholar] [CrossRef]
Thai, H.-T.; Le, K.-H.; Nguyen, N.L.-T. FormerLeaf: An Efficient Vision Transformer for Cassava Leaf Disease Detection. Comput. Electron. Agric. 2023, 204, 107518. [Google Scholar] [CrossRef]
Zhu, D.; Tan, J.; Wu, C.; Yung, K.; Ip, A.W.H. Crop Disease Identification by Fusing Multiscale Convolution and Vision Transformer. Sensors 2023, 23, 6015. [Google Scholar] [CrossRef]
Li, X.; Li, S. Transformer Help CNN See Better: A Lightweight Hybrid Apple Disease Identification Model Based on Transformers. Agriculture 2022, 12, 884. [Google Scholar] [CrossRef]
Ferentinos, K.P. Deep Learning Models for Plant Disease Detection and Diagnosis. Comput. Electron. Agric. 2018, 145, 311–318. [Google Scholar] [CrossRef]
De Silva, M.; Brown, D. Multispectral Plant Disease Detection with Vision Transformer–Convolutional Neural Network Hybrid Approaches. Sensors 2023, 23, 8531. [Google Scholar] [CrossRef]
Dey, B.; Ahmed, R. A comprehensive review of AI-driven plant stress monitoring and embedded sensor technology: Agriculture 5.0. J. Ind. Inf. Integr. 2025, 47, 100931. [Google Scholar] [CrossRef]
Jearanaiwongkul, W.; Anutariya, C.; Racharak, T.; Andres, F. An Ontology-Based Expert System for Rice Disease Identification and Control Recommendation. Appl. Sci. 2021, 11, 10450. [Google Scholar] [CrossRef]
Amara, J.; Samuel, S.; König-Ries, B. Integrating Domain Knowledge for Enhanced Concept Model Explainability in Plant Disease Classification. In Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2024; pp. 289–306. [Google Scholar] [CrossRef]
Liu, Y. DKG-PIPD: A Novel Method About Building Deep Knowledge Graph. IEEE Access 2021, 9, 137295–137308. [Google Scholar] [CrossRef]
Zhang, A.; Liu, W. Plant Disease Detection Using an Innovative Swin-Axial Transformer. IEEE Access 2025, 13, 111938–111952. [Google Scholar] [CrossRef]
Song, H.; Gao, Y. Plant Diseases Recognition on Digital Images Using Swin Transformer. In Proceedings of the 2022 11th International Conference on Computing and Pattern Recognition, New York, NY, USA, 4–6 November 2022; ACM: New York, NY, USA, 2022; pp. 219–223. [Google Scholar] [CrossRef]
Alharbi, A.F.; Aslam, M.A.; Asiry, K.A.; Aljohani, N.R.; Glikman, Y. An ontology-based agriculture decision-support system with an evidence-based explanation model. Smart Agric. Technol. 2024, 9, 100659. [Google Scholar] [CrossRef]
Ahmed, I.; Yadav, P.K. Ontology-Based Classification Method Using Statistical and Symbolic Approaches for Plant Diseases Detection in Agriculture; SSRN: Rochester, NY, USA, 2023. [Google Scholar] [CrossRef]
Nagarathna, P.; Lateef Haroon, P.S.A.; Srinivasalu, G.; Venkata, A.K.P. Intelligent Plant Disease Detection and Classification using HV-GNN with Ontology-Driven Knowledge Embedding. ITM Web Conf. 2025, 79, 01047. [Google Scholar] [CrossRef]
Kurhe, P.; Dashore, P. Design of an Improved Model Using Neuro-Symbolic Encoding and Federated Meta-Adaptation for Plant Disease Detection and Explanation Process. EPJ Web Conf. 2025, 328, 01073. [Google Scholar] [CrossRef]
Hernández, I.; Gutiérrez, S.; Tardaguila, J. Image Analysis with Deep Learning for Early Detection of Downy Mildew in Grapevine. Sci. Hortic. 2024, 331, 113155. [Google Scholar] [CrossRef]
Moupojou, E.; Tagne, A.; Retraint, F.; Tadonkemwa, A.; Wilfried, D.; Tapamo, H.; Nkenlifack, M. FieldPlant: A Dataset of Field Plant Images for Plant Disease Detection and Classification with Deep Learning. IEEE Access 2023, 11, 35398–35410. [Google Scholar] [CrossRef]
Al Hassan, A.N.; Ashraf, M.; Mehmood, A.; Rizwan, K.; Siddiqui, M.S. Dataset of Date Palm Tree Thermal Images and Their Classification Based on Red Palm Weevil Infestation. Front. Agron. 2025, 7, 1604188. [Google Scholar] [CrossRef]

Figure 1. Motivation diagram for DoST-DPD: traditional AI vs. ontology-enhanced multimodal detection.

Figure 2. Ontology snapshot for Graphiola Leaf Spot in AgriOnt-DP.

Figure 3. Sample images from plant disease datasets. These eight examples (a–h) are leaves from the PlantVillage dataset, each showing a distinct disease: (a) apple scab, (b) grape black rot, (c) Peach bacterial spot, (d) Potato early blight, (e) Squash powdery mildew, (f) Strawberry leaf scorch, (g) tomato leaf mold, and (h) tomato mosaic virus. PlantVillage images are captured in lab-like conditions with clear views of single leaves.

Figure 4. Flowchart of proposed DoST-DPD framework for early detection of date palm diseases.

Figure 5. System architecture for smart farming deployment of DoST-DPD.

Figure 6. PlantDoc dataset samples.

Figure 7. PlantVillage dataset ontology visualization.

Figure 8. PlantDoc dataset ontology visualization.

Figure 9. Figshare dataset ontology visualization.

Figure 10. Mendeley RGB+NIR dataset ontology visualization.

Figure 11. Kaggle Date Palm dataset ontology visualization.

Figure 12. Radar chart comparison of model performance on all datasets. (a) PlantVIIIage; (b) PlantDoc; (c) Figshare RGB + Thermal; (d) Mendeley RGB + NIR; (e) Kaggle Date Palm.

Figure 13. Normalized confusion matrices for DoST-DPD on the PlantDoc and PlantVillage datasets. (a) Normalized confusion matrices for DoST-DPD on the PlantDoc dataset; (b) Normalized confusion matrices for DoST-DPD on the PlantVillage dataset.

Table 1. Comparison of benchmark models for plant disease detection (✓ denotes the presence of a feature or capability).

Model	Modality Used	Attention Mechanism	Explainability	Generalization (Cross-Dataset)
ResNet	RGB images	– (none)	– (post hoc only)	–
ERCP-Net	RGB images	✓ Channel attention (adaptive)	–	– (tested on single benchmark)
PlantXViT	RGB images	✓ Hybrid CNN–ViT (self-attention)	✓ (Grad-CAM used)	✓ (evaluated on multiple datasets)
Multi-ViT	RGB images (multiple views)	✓ Transformer ensemble	–	✓ (robust across different crops)
Swin Transformer	RGB images, optionally multimodal	✓ Hierarchical shifted-window self-attention	✓ (attention maps available)	✓ (tested across diverse field conditions)

Table 2. Symbol definitions.

Symbol	Description
$x_{i}^{r g b}, x_{i}^{t h}, x_{i}^{n i r}$	Input RGB, thermal, and NIR images
$z_{i}^{r g b}, z_{i}^{m o d}$	Feature embeddings from ViTs
$z_{i}^{f u s e}$	Cross-attention-based fused embedding
$y_{i}$ $, {\hat{y}}_{i}$	True and predicted disease class labels
$y_{i}^{o n t o}$ $, {\hat{y}}_{i}^{o n t o}$	Ontology labels and predictions
$L_{c l s}$ $, L_{o n t o}$	Classification and ontology loss
$λ$	Loss balancing hyperparameter
${\hat{y}}_{i}^{o n t o}$	Predicted ontology concept vector

Table 3. Overview of selected datasets for DoST-DPD evaluation.

Dataset	#Images	#Classes	Modalities	Capture Environment	Annotation Type	Source
PlantVillage	54,305	38 (14 crops × 26 diseases + healthy classes)	RGB	Lab	Image-level (per crop/disease)	Kaggle
PlantDoc	2598	29 (including Healthy + 28 crop diseases)	RGB	Field	Image-level + some bounding boxes	GitHub
Figshare RGB + Thermal	832	4 (Healthy, Infected, Severely Infested, Dead)	RGB + Thermal	Field	Tree-level categories	Figshare
Mendeley RGB+NIR	3089	9 (8 diseases + Healthy)	RGB + NIR	Field (augmented)	Image-level labels	Mendeley
Kaggle Date Palm	2631	3 (Healthy, Brown Spot, White Scale)	RGB	Lab	Folder-based image-level	Kaggle

Table 4. Performance comparison on the PlantVillage dataset.

Model	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)	AUC (%)
ResNet	91.4	90.8	91.0	90.9	92.6
ERCP-Net	94.6	94.7	92.4	92.1	93.4
PlantXViT	96.0	95.5	96.8	93.6	96.7
Multi-ViT	95.2	94.4	94.9	94.6	95.9
Grad-CAM	91.8	91.0	91.2	91.1	92.9
DoST-DPD (without ontology)	95.5	94.3	95.6	95.1	97.1
DoST-DPD (with ontology)	99.3	96.6	96.9	97.7	98.2

Table 5. Performance comparison on the PlantDoc dataset.

Model	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)	AUC (%)
ResNet	87.2	85.9	86.3	86.1	89.4
ERCP-Net	89.5	88.7	88.9	88.8	90.6
PlantXViT	91.3	90.8	91.1	90.9	92.5
Multi-ViT	92.7	91.5	93.0	92.2	93.8
Grad-CAM	88.1	86.5	87.4	86.9	89.1
DoST-DPD (without ontology)	93.2	91.9	93.0	92.9	94.0
DoST-DPD (with ontology)	94.4	93.2	94.7	93.9	95.6

Table 6. Performance comparison on the Figshare RGB + Thermal dataset.

Model	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)	AUC (%)
ResNet	83.4	82.2	81.9	82.0	85.1
PlantXViT	87.2	86.4	86.6	86.5	88.3
Grad-CAM	84.0	82.5	83.2	82.8	85.4
Multi-ViT	88.3	87.5	87.9	87.7	89.6
ERCP-Net	85.9	84.3	85.1	84.7	86.7
DoST-DPD (with Ontology)	90.5	89.6	90.7	90.1	92.2

Table 7. Performance Comparison on the Mendeley RGB + NIR Dataset.

Model	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)	AUC (%)
ResNet	84.1	83.0	82.7	82.8	85.0
ERCP-Net	86.5	85.3	85.7	85.5	87.1
PlantXViT	88.3	87.0	88.1	87.5	89.3
Multi-ViT	89.5	88.4	89.2	88.8	90.4
Grad-CAM	85.0	83.6	84.1	83.8	86.0
DoST-DPD (with Ontology)	92.0	91.1	91.7	91.4	93.0

Table 8. Performance Comparison on the Kaggle Date Palm Dataset.

Model	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)	AUC (%)
ResNet	85.6	84.3	84.7	84.5	86.8
ERCP-Net	87.2	86.1	86.5	86.3	88.0
PlantXViT	89.0	88.0	88.5	88.2	89.9
Multi-ViT	90.2	89.3	89.7	89.5	91.0
Grad-CAM	86.0	84.8	85.2	85.0	87.1
DoST-DPD (with Ontology)	96.2	96.1	95.7	93.4	95.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ghannam, N.E.; Mancy, H.; Fathy, A.M.; Mahareek, E.A. Ontology-Enhanced Deep Learning for Early Detection of Date Palm Diseases in Smart Farming Systems. AgriEngineering 2026, 8, 29. https://doi.org/10.3390/agriengineering8010029

AMA Style

Ghannam NE, Mancy H, Fathy AM, Mahareek EA. Ontology-Enhanced Deep Learning for Early Detection of Date Palm Diseases in Smart Farming Systems. AgriEngineering. 2026; 8(1):29. https://doi.org/10.3390/agriengineering8010029

Chicago/Turabian Style

Ghannam, Naglaa E., H. Mancy, Asmaa Mohamed Fathy, and Esraa A. Mahareek. 2026. "Ontology-Enhanced Deep Learning for Early Detection of Date Palm Diseases in Smart Farming Systems" AgriEngineering 8, no. 1: 29. https://doi.org/10.3390/agriengineering8010029

APA Style

Ghannam, N. E., Mancy, H., Fathy, A. M., & Mahareek, E. A. (2026). Ontology-Enhanced Deep Learning for Early Detection of Date Palm Diseases in Smart Farming Systems. AgriEngineering, 8(1), 29. https://doi.org/10.3390/agriengineering8010029

Article Menu

Ontology-Enhanced Deep Learning for Early Detection of Date Palm Diseases in Smart Farming Systems

Abstract

1. Introduction

2. Related Works

2.1. Recent AI Techniques for Plant Disease Detection

2.2. Multimodal Deep Learning in Agriculture

2.3. Ontology-Guided or Knowledge-Driven AI in Agriculture

2.4. Ontology-Integrated Approaches in Agricultural Disease Detection

2.5. Limitations in Existing Models

3. Proposed Methodology: DoST-DPD Framework

3.1. Modality-Specific Feature Extraction

3.2. Multimodal Feature Fusion

3.3. Ontology-Guided Semantic Supervision

3.4. Final Classification Head

3.5. Joint Optimization

4. Experimental Setup

4.1. Datasets and Modalities

4.2. Training Protocol and Environment

4.3. Ablation Study Protocol

4.4. Ontology-Driven Semantic Integration and Supervision

5. Experimental Results and Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI