Challenges and Opportunities in Tomato Leaf Disease Detection with Limited and Multimodal Data: A Review

Hu, Yingbiao; Li, Huinian; Yang, Chengcheng; Chen, Ningxia; Pan, Zhenfu; Ke, Wei

doi:10.3390/math14030422

Open AccessReview

Challenges and Opportunities in Tomato Leaf Disease Detection with Limited and Multimodal Data: A Review

by

Yingbiao Hu

¹

,

Huinian Li

²

,

Chengcheng Yang

^2,3

,

Ningxia Chen

³,

Zhenfu Pan

¹

and

Wei Ke

^1,*

¹

Faculty of Applied Sciences, Macao Polytechnic University, Macau SAR 999078, China

²

School of Computer Science and Engineering, Faculty of Innovation Engineering, Macau University of Science and Technology, Macau SAR 999078, China

³

College of Artificial Intelligence, Zhongkai University of Agriculture and Engineering, Guangzhou 510225, China

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(3), 422; https://doi.org/10.3390/math14030422

Submission received: 11 December 2025 / Revised: 16 January 2026 / Accepted: 20 January 2026 / Published: 26 January 2026

(This article belongs to the Special Issue Structural Networks for Image Application)

Download

Browse Figures

Versions Notes

Abstract

Tomato leaf diseases cause substantial yield and quality losses worldwide, yet reliable detection in real fields remains challenging. Two practical bottlenecks dominate current research: (i) limited data, including small samples for rare diseases, class imbalance, and noisy field images, and (ii) multimodal heterogeneity, where RGB images, textual symptom descriptions, spectral cues, and optional molecular assays provide complementary but hard-to-align evidence. This review summarizes recent advances in tomato leaf disease detection under these constraints. We first formalize the problem settings of limited and multimodal data and analyze their impacts on model generalization. We then survey representative solutions for limited data (transfer learning, data augmentation, few-/zero-shot learning, self-supervised learning, and knowledge distillation) and multimodal fusion (feature-, decision-, and hybrid-level strategies, with attention-based alignment). Typical model–dataset pairs are compared, with emphasis on cross-domain robustness and deployment cost. Finally, we outline open challenges—including weak generalization in complex field environments, limited interpretability of multimodal models, and the absence of unified multimodal benchmarks—and discuss future opportunities toward lightweight, edge-ready, and scalable multimodal systems for precision agriculture.

Keywords:

tomato leaf disease detection; limited data learning; few-shot learning; self-supervised learning; multimodal fusion; domain generalization; lightweight models; mathematical modeling; precision agriculture

MSC:

68T01

1. Introduction

1.1. Background and Significance

Tomatoes (Solanum lycopersicum L.) are globally recognized as one of the most economically vital vegetable crops, with cultivation spanning temperate, subtropical, and tropical regions. As a staple component of human diets, tomatoes provide essential nutrients—including vitamins C, A, and K, dietary fiber, and antioxidants like lycopene—making them indispensable for addressing global food security and public health challenges [1,2,3]. From an agricultural economic perspective, global tomato production reached approximately 186.82 million tons in recent years [4], with Asia contributing 61.1% of total output [5]; China, as the world’s largest tomato producer, accounts for one-third of global production, underscoring the crop’s critical role in national agricultural economies [6].

However, as shown in Table 1, tomato production faces persistent threats from leaf diseases, which cause severe yield and quality losses annually. Typical tomato leaf diseases include bacterial spot (caused by Xanthomonas euvesicatoria [7]), early blight (caused by Alternaria solani [8]), late blight (caused by Phytophthora infestans [9]), tomato yellow leaf curl virus (TYLCV) [10], leaf mold (caused by Passalora fulva [11]), and Septoria leaf spot (caused by Septoria lycopersici [12]). Infected plants typically show chlorosis, upward leaf curling, stunting, and reduced fruit set, and severe outbreaks may lead to near-complete crop failure [13,14]. The persistent spread of TYLCV across tropical, subtropical, and increasingly temperate regions highlights the need for integrated management combining resistant cultivars, vector control, and early monitoring [15,16].

Typical visual symptoms of major tomato leaf diseases are summarized in Table 2. Traditional tomato leaf disease detection relies on manual visual inspection by farmers or agronomists, a process constrained by inherent limitations: it is labor-intensive, time-consuming, and prone to subjective errors—especially for early-stage diseases with subtle symptoms or visually similar diseases such as distinguishing early blight from late blight based on lesion morphology [17]. With the expansion of large-scale, intensive tomato cultivation, this manual approach has become increasingly incompatible with the demands for real-time, large-area disease monitoring [18]. In contrast, artificial intelligence (AI)-driven detection methods—particularly deep learning (DL)-based approaches—offer transformative solutions: they enable high-precision [19], real-time disease identification, support targeted interventions such as site-specific pesticide application [20], and facilitate the integration of disease detection into precision agriculture frameworks [21,22,23].

The practical deployment of AI-based tomato leaf disease detection, however, is hindered by two critical bottlenecks in agricultural scenarios: limited data and complex multimodal information [24,25]. Limited data arises from scarce samples of rare diseases such as tomato leaf miner in the Tomato-Village dataset [26], high annotation costs (requiring agronomists to label disease types, lesion locations, and severity) [27], and low-quality field samples (blurred images, weed occlusion, or inconsistent lighting) [28,29,30]. Multimodal information—encompassing images (RGB, infrared), text (disease symptom descriptions), and viral molecular data (genome sequences, betasatellite proteins)—offers complementary insights for detection but introduces challenges of data heterogeneity and cross-modal alignment [31,32,33,34]. Addressing these bottlenecks is essential to unlock the full potential of AI in protecting tomato yields and advancing sustainable agriculture. Beyond their direct economic value, tomatoes also serve as a model crop for studying plant–pathogen interactions and evaluating digital agriculture technologies. Many tomato-growing regions are experiencing rapid structural changes, including the expansion of protected cultivation (greenhouses, plastic tunnels), the adoption of fertigation and hydroponic systems, and the intensification of production in peri-urban areas. These transitions can amplify the impact of foliar diseases: high planting density and elevated humidity inside greenhouses, for example, create favorable microclimates for fungal pathogens, while continuous cropping increases the inoculum load in soil and crop residues [35,36]. At the same time, labor shortages and aging farmer populations in many countries reduce the feasibility of relying solely on expert visual inspection for disease management.

Climate variability adds further complexity. Changes in temperature and precipitation regimes can shift the geographical distribution and seasonality of tomato pathogens and their insect vectors. For instance, warmer winters in temperate regions facilitate overwintering of whitefly populations, thereby increasing TYLCV pressure in subsequent seasons [1,16]. Extreme weather events such as heat waves, heavy rainfall, and storms may directly damage plants, indirectly alter disease susceptibility, and complicate image-based diagnosis (e.g., leaf scorching or mechanical injury mimicking disease symptoms). From a systems perspective, tomato disease detection must therefore be robust not only to static spatial heterogeneity but also to dynamic temporal and climatic shifts.

In parallel, there is growing interest in integrating tomato disease detection with broader precision agriculture workflows. Modern production systems increasingly rely on Internet-of-Things (IoT) devices (soil and canopy sensors, weather stations), edge computing platforms, and farm-management information systems. Disease detection models are expected to interact with these components by, for example, providing early warnings that trigger targeted scouting, adjusting irrigation or ventilation schedules to reduce disease risk, or informing variable-rate pesticide applications. Such integration imposes additional constraints on model design: detectors must be lightweight enough to run on embedded hardware, interoperable with existing data infrastructure, and interpretable to agronomists and growers who must ultimately trust and act on the recommendations [37].

It is also important to recognize the diversity of tomato production contexts. Large commercial farms may possess high-resolution cameras, drones, and greenhouse monitoring systems, whereas smallholder farmers often depend on low-cost smartphones and basic extension services. Consequently, the available data modalities, image quality, and annotation resources vary widely across regions and production scales. A review focusing solely on high-end sensing platforms risks overlooking the constraints faced by smallholder-dominated systems, where limited data and noisy images captured in suboptimal conditions are the norm rather than the exception. In this review, we therefore pay particular attention to methods that can operate effectively with modest computational resources and imperfect data.

Finally, tomato leaf diseases constitute a natural testbed for broader methodological questions in agricultural AI. Many of the challenges encountered here—class imbalance between frequent and rare diseases, domain shift between laboratory and field images, multimodal data integration, and the need for trustworthy, human-understandable predictions—also arise in other crops and stressors (diseases, pests, nutrient deficiencies, abiotic stresses). Insights gained from tomato-specific studies are thus likely to generalize to wider plant health monitoring scenarios. Conversely, advances in generic computer vision, multimodal learning, and limited-data learning can often be adapted to tomato disease detection with relatively minor modifications. One goal of this review is to bridge these two perspectives by situating tomato-specific contributions within the broader methodological landscape.

1.2. Definition of Core Concepts

To establish a consistent analytical framework, this section defines key concepts related to limited and multimodal data in tomato leaf disease detection, drawing on definitions and classifications from the referenced literature.

1.2.1. Limited Data in Tomato Leaf Disease Detection

Limited data refers to scenarios where the quantity, quality, or distribution of disease-related data fails to meet the requirements for training robust DL models. It encompasses three interrelated challenges [38]:

Small sample size: Scarcity of labeled samples for emerging, rare, or region-specific tomato leaf diseases. For instance, the Tomato-Village dataset—used for testing model generalization—contains only 517 samples of tomato spotted wilt and 1024 samples of leaf miner, reflecting the rarity of these diseases in conventional cultivation regions [26]. Additionally, professional annotation of disease samples is resource-intensive: annotating a single dataset of 6000 tomato leaf images (covering 10 disease categories) requires approximately 100 person-days of work by agricultural experts, who must distinguish subtle symptom differences such as TYLCV-induced yellowing vs. nutrient deficiency [6,39]. This high annotation cost further exacerbates the small sample problem for understudied diseases.

Class imbalance: Uneven distribution of samples across disease categories in datasets [40,41]. A typical example is the PlantVillage dataset—one of the most widely used benchmarks for tomato leaf disease detection—where common diseases like TYLCV have 5357 images, while rare diseases like early blight and downy mildew have only 1000 images each [42]. Class imbalance biases DL models toward majority classes, leading to low recall for minority diseases such as 82% recall for TYLCV vs. 65% for downy mildew in unbalanced training [43,44]. This issue is particularly critical for diseases with high economic impact but low sample representation, such as tomato mosaic virus.

Low-quality and noisy samples: Degradation of data quality due to field environmental interference [45,46]. Natural background datasets—such as the Dataset of Tomato Leaves and PlantDoc—contain complex artifacts, including weed occlusion (30% of images in PlantDoc), soil backgrounds, motion blur (from wind), and inconsistent lighting (overexposure/underexposure) [47]. These artifacts obscure disease features such as small lesions on leaves and reduce model accuracy by 10–15% compared to controlled-environment datasets like PlantVillage with standardized backgrounds [48].

1.2.2. Multimodal Data in Tomato Leaf Disease Detection

Multimodal data encompasses heterogeneous information sources that collectively describe tomato leaf diseases, each capturing unique aspects of pathogen infection or plant response. Key modalities relevant to tomato leaf disease detection include image, textual, and viral molecular information.

Image data. The most widely used modality for disease detection, consisting of RGB (visible light) images [49] that capture visual symptoms such as lesion color, shape, and distribution. Benchmark datasets like PlantVillage (controlled laboratory backgrounds) and PlantDoc (natural field backgrounds) provide RGB images of 10–15 tomato disease categories, enabling models to learn symptom patterns under varying environmental conditions [50].

Text data. Descriptive information about diseases, including structured symptom annotations and unstructured expert descriptions. Structured annotations specify attributes like lesion color, shape, and location—for example, TYLCV is annotated as yellowed and curly leaves, dwarfed plants with shrunken foliage [31]. Free-text descriptions from agronomists or extension reports further provide contextual information about cultivar [51], growth stage [52], and management practices.

Viral molecular data. Genomic and proteomic information specific to viral diseases such as TYLCV. The TYLCV genome is a single-stranded circular DNA encoding six open reading frames (ORFs), such as C1 (Rep), C2 (transcriptional activator), and C4 (pathogenicity determinant) [53]. Associated betasatellites encode proteins such as

β

C1 that further modulate host responses and symptom expression [54]. Although molecular assays are costlier than imaging, they provide direct evidence of viral infection and strain composition.

1.2.3. Relationship Between Limited Data and Multimodal Data

Limited data and multimodal data exhibit a complementary yet conflicting relationship that shapes the design of AI-based detection systems:

Complementarity: Multimodal data alleviates data limitations by enriching information density. For example, image–text pairs in TLDITRD provide semantic context (text) to compensate for small image sample sizes—models trained on 1000 image–text pairs of a rare disease such as leaf miner achieve comparable accuracy to those trained on 6000 samples [48].

Conflict: Limited data increases the difficulty of multimodal fusion. Cross-modal alignment—matching features across modalities such as linking text descriptions of symptoms to image regions of interest—requires sufficient paired samples, which are often lacking for rare diseases [48].

This review (i) formalizes a taxonomy of “limited data” specific to tomato leaf diseases (small-sample, imbalance, and field noise) and relates it to multimodal fusion challenges, (ii) synthesizes recent techniques (transfer/self-distillation/self-supervised/few-shot) with a deployment-oriented lens, (iii) benchmarks model–dataset pairs with a focus on cross-domain generalization, and (iv) identifies open issues (interpretability, unified multimodal benchmarks) and proposes a practical roadmap for edge-ready fusion systems.

1.2.4. Mathematical Problem Formulation

From a learning–theoretic perspective, tomato leaf disease detection under limited and multimodal data can be cast as a supervised or weakly supervised classification problem with structured inputs. Let

X_{img}

,

X_{text}

, and

X_{mol}

denote the spaces of image, textual, and molecular features, respectively, and let

Y = {1, \dots, K}

be the set of tomato leaf disease classes (including healthy leaves). A generic multimodal detector learns a mapping:

f_{θ} : X_{img} \times X_{text} \times X_{mol} \to Δ^{K - 1},

(1)

where

θ

denotes trainable parameters and

Δ^{K - 1}

is the

(K - 1)

-simplex of class probability vectors.

Here,

Δ^{K - 1}

denotes the probability simplex, i.e., a vector

p \in R^{K}

with

p_{k} \geq 0

and

\sum_{k = 1}^{K} p_{k} = 1

.

Equation (1) defines a generic multimodal tomato disease classifier. For instance, given a leaf RGB image and an optional symptom description, the model outputs a probability vector over (K) disease categories, e.g., (p(TYLCV) = 0.72, …).

Given a training set

D = {(x_{i}^{(I)}, x_{i}^{(T)}, x_{i}^{(M)}, y_{i})}_{i = 1}^{n},

(2)

with samples drawn i.i.d. from an unknown data-generating distribution P, the standard supervised learning objective is to minimize the empirical risk:

\hat{R} (θ) = \frac{1}{n} \sum_{i = 1}^{n} ℓ (f_{θ} (x_{i}^{(I)}, x_{i}^{(T)}, x_{i}^{(M)}), y_{i}),

(3)

where ℓ is typically the cross-entropy loss.

In this review, we mainly consider the cross-entropy loss

ℓ (p, y) = - log p_{y}

. Here,

p_{y}

is the predicted probability for the true class y.

The true risk is

R (θ) = E_{(X, Y) \sim P} [ℓ (f_{θ} (X), Y)],

(4)

and the central goal is to control the generalization gap

R (θ) - \hat{R} (θ)

under various data constraints.

Equation (4) is the population (expected) risk under the unknown data-generating distribution and Equation (3) is its sample-average approximation.

Under limited data, n is small and the class prior

π_{k} = P (Y = k)

is highly imbalanced, i.e.,

π_{k} ≪ π_{k^{'}} for some minority class k and majority class k^{'},

(5)

This leads to high variance of

\hat{R} (θ)

and biased optimization toward majority classes.

We interpret

π_{k} ≪ π_{k^{'}}

as

π_{k} / π_{k^{'}} < ε

for a small

ε

(e.g.,

ε = 0.1

).

We adopt the cross-entropy loss in Equation (6).

ℓ (p, y) = - log p_{y},

(6)

so Equation (3) is a sample-average approximation of the population risk in Equation (4); under i.i.d. sampling,

\hat{R} (θ) \to R (θ)

as n grows.

We use

π_{k} ≪ π_{k^{'}}

to denote severe imbalance, e.g.,

π_{k} / π_{k^{'}} < ε

for a small threshold

ε

(such as 0.1), meaning minority diseases are underrepresented by at least one order of magnitude.

A common remedy is to adopt class-weighted losses or focal losses, e.g.,

{\hat{R}}_{w} (θ) = \frac{1}{n} \sum_{i = 1}^{n} α_{y_{i}} ℓ (f_{θ} (x_{i}), y_{i}),

(7)

where

α_{y_{i}}

is larger for minority diseases.

In the multimodal setting, each training sample may only contain a subset of modalities. Let

M_{i} \subseteq {img, text, mol}

denote the available modalities for sample i. The model must be robust to missing modalities and heterogeneity in

M_{i}

, while still learning a joint representation that supports cross-modal retrieval and fusion. Techniques such as modality-specific encoders, shared latent spaces, and cross-attention mechanisms are therefore central to the design of tomato leaf disease detection systems under limited and multimodal data.

From a statistical learning viewpoint, the limited-data and multimodal constraints can be linked to classical notions of sample complexity and hypothesis class capacity. A standard generalization bound is given in Equation (8). A proof sketch of Equation (8) is provided in Appendix A. Let

H

denote the hypothesis class underlying

f_{θ}

. Under standard assumptions (i.i.d. sampling, bounded loss), the expected generalization gap can be controlled in terms of complexity measures such as the empirical Rademacher complexity

{\hat{R}}_{n} (H)

, yielding bounds of the form

R (θ) \leq \hat{R} (θ) + 2 {\hat{R}}_{n} (H) + c \sqrt{\frac{log (1 / δ)}{2 n}}

(8)

with a probability of at least

1 - δ

for some constant

c > 0

. In the tomato disease setting, deep multimodal architectures typically induce a large

{\hat{R}}_{n} (H)

, while n is relatively small and class-imbalanced, which makes tight generalization control difficult. This motivates architectural choices and regularization strategies that effectively reduce the capacity of the model in directions that are not disease-discriminative (e.g., invariance to background clutter and illumination changes), while preserving sensitivity to subtle lesion characteristics.

Label noise is another important factor in practical datasets. Let

\tilde{Y}

denote the observed (possibly noisy) label and assume a class-conditional noise model with transition matrix

T \in {[0, 1]}^{K \times K}

, where

T_{i j} = P (\tilde{Y} = j ∣ Y = i), i, j \in {1, \dots, K} .

(9)

The class-conditional noise model is defined by the transition matrix in Equation (9). Under this model, minimizing the empirical risk with respect to

\tilde{Y}

generally leads to a biased estimator of the optimal classifier for Y, unless T is known and explicitly corrected for. In tomato disease datasets, label noise arises from ambiguous symptoms, overlapping infections, and inter-annotator disagreement, especially for visually similar diseases such as early and late blight or for mixed nutrient/disease stress. Robust loss functions, noise-tolerant training schemes, and multi-expert annotation protocols can be interpreted as practical attempts to mitigate the impact of an unknown T on the learned classifier.

Domain shift can be formalized by considering a collection of domains

{D_{m}}_{m = 1}^{M}

, each associated with a distribution

P_{m}

over

(X, Y)

. Let

P_{train}

be the mixture of source domains used for training and

P_{test}

be the target domain encountered at deployment. Domain generalization aims to learn

θ

such that

R_{P_{test}} (θ)

remains small for a wide range of plausible

P_{test}

differing from

P_{train}

in covariate, label, or conditional distributions. In tomato disease detection,

P_{m}

may correspond to different geographic regions, cultivars, or imaging setups (e.g., smartphone versus DSLR), and the discrepancy between

P_{train}

and

P_{test}

may be quantified using divergence measures such as the

H

-divergence or Wasserstein distance. Many of the methods reviewed in later sections can be understood as implicitly reducing these divergences by aligning feature distributions across domains or by enforcing invariances that are believed to be disease-specific rather than domain-specific.

Finally, in the multimodal case, the input space decomposes as

X = X_{img} \times X_{text} \times X_{mol}

, but, in practice, we often observe only partial inputs. Let

M \subseteq {img, text, mol}

denote the subset of available modalities and denote by

X_{M}

the restriction of X to modalities in M. A practically relevant objective is then to learn a family of predictors

{f_{θ, M}}_{M}

that share parameters where appropriate but can operate on any observed subset

X_{M}

, i.e.,

f_{θ, M} : \prod_{m \in M} X_{m} \to Δ^{K - 1}, M \neq ⌀ .

(10)

Here, M denotes the set of available modalities for a sample and

X_{M}

denotes the observed input restricted to modalities in M. The mapping

f_{θ, M}

shares parameters with other

f_{θ, M^{'}}

where appropriate but is evaluated only on the modalities present in M.

To handle missing modalities, we consider the predictor family in Equation (10). Designing such flexible architectures—which gracefully degrade when some modalities are missing and exploit complementarities when multiple modalities are present—remains an open and important research question for tomato leaf disease detection in heterogeneous field environments.

The main notation used in the formulation is summarized in Table 3.

1.3. Review Methodology

We conducted a structured literature search in Web of Science, Scopus, and Google Scholar covering publications from 2015 to 2025. Keywords included “tomato leaf disease detection”, “limited data”, “few-shot learning”, “self-supervised learning”, “domain generalization”, and “multimodal fusion”. Studies were included if they (i) targeted tomato leaf diseases, (ii) proposed or evaluated AI/ML/DL detection methods under limited or multimodal data settings, and (iii) reported quantitative results on public or field datasets. Papers focusing solely on non-tomato crops or without methodological or quantitative evidence were excluded.

In addition, we performed backward snowballing on the reference lists of highly cited tomato disease detection papers and relevant surveys in agricultural AI to identify missed works. The final corpus covers more than one hundred articles, with a particular emphasis on studies published between 2023 and 2025 to capture the latest advances in deep learning, multimodal fusion, and deployment-oriented designs.

Beyond the keyword-based search described above, we also carried out several rounds of manual screening to reduce topic drift and ensure that the included works are truly relevant to tomato leaf disease detection. In a first pass, we removed papers that only used tomato as a toy example in generic computer vision benchmarks, or that focused on unrelated tasks such as fruit grading, yield prediction, or greenhouse climate control. In a second pass, we excluded works that reported qualitative demonstrations without quantitative evaluation, or that used proprietary datasets without sufficient description of the acquisition conditions, class labels, and train/test splits. These exclusion steps are similar in spirit to those adopted in broader reviews of plant disease detection and agricultural deep learning [55,56,57,58,59], but tailored here to tomato-specific scenarios.

To organize the final corpus, we adopted a two-stage coding procedure. First, each paper was assigned to one or more of three high-level categories: (i) limited-data learning for tomato leaf diseases (including transfer learning, self-distillation, self-/semi-supervised learning, few-/zero-shot learning, domain adaptation/generalization, and active learning); (ii) multimodal and cross-modal methods (including image–text, image–sensor, and image–molecular fusion); and (iii) deployment-oriented designs (including lightweight architectures, edge computing, and integration with IoT or farm management systems). Second, within each category, we annotated the specific datasets used (PlantVillage, PlantDoc, Tomato-Village, TLDITRD, and private field datasets), the backbone architectures (CNNs, Transformers, hybrid models), and the evaluation protocols. This coding scheme was inspired by prior survey work in plant disease detection [58] but extended to capture multimodal and limited-data aspects that have become prominent only in the last few years [57].

The temporal distribution of the selected papers also revealed an interesting pattern. Early deep learning-based tomato disease studies around 2015–2018 primarily focused on demonstrating that convolutional networks substantially outperform classical hand-crafted feature pipelines [57]. Between 2019 and 2021, the focus shifted toward improving accuracy and robustness on more challenging field datasets and exploring lightweight backbones for mobile deployment [60]. From 2022 onward, we observed a rapid increase in works that explicitly address limited data (few-shot, self-supervised, semi-supervised learning) and multimodal fusion (image–text, image–sensor, and image–genomics) [61,62,63]. This temporal evolution supported our choice to frame this review around limited and multimodal data.

It is important to acknowledge several sources of potential bias in our methodology. First, although we queried both English and non-English databases, the vast majority of included papers were written in English and indexed in major citation databases, possibly under-representing regional works published in local journals or conference proceedings. Second, keyword-based searches inevitably miss studies that use atypical terminology for tomato diseases or deep learning methods. Third, publication bias may favor positive results (e.g., models that show clear improvements over baselines), whereas unsuccessful attempts at certain techniques (e.g., specific domain adaptation or few-shot schemes) remain under-reported [56]. To partially mitigate these issues, we complemented database searches with backward and forward citation chasing from several highly cited surveys and benchmark papers on plant disease detection [55,56,57,58] and manually inspected recent special issues on agricultural AI and plant phenotyping.

Finally, we emphasize that the present review is intentionally tomato-centric but methodologically outward-looking. Whenever appropriate, we include and discuss generic plant disease or agricultural computer vision papers that introduce techniques of direct relevance to tomato leaf diseases, such as contrastive self-supervised learning on leaf images [64], unsupervised domain adaptation from laboratory to field imagery [61], and few-shot meta-learning for cross-crop disease recognition [65]. In each case, we explicitly indicate whether the original study included tomato in its experiments or whether we consider it as a transferable methodological contribution. This hybrid strategy allowed us to maintain a clear focus on tomato while still leveraging the broader methodological landscape in plant disease detection and agricultural AI.

To make the procedure more transparent, the main steps of the review methodology are summarized in Table 4.

In this review, we formalize a tomato-specific taxonomy of limited data and connect it to multimodal fusion challenges. We then synthesize representative methods (transfer learning, self-distillation, self-/semi-supervised learning, and few-shot learning) with an explicit deployment-oriented lens. Finally, we compare typical model–dataset pairs in terms of cross-domain robustness and practical cost and highlight open issues such as interpretability and the lack of unified multimodal benchmarks.

2. Challenges in Tomato Leaf Disease Detection: Limited and Multimodal Data

Tomato leaf disease detection in realistic agricultural environments is fundamentally constrained by data issues. On the one hand, models must be trained from limited, imbalanced, and noisy samples. On the other hand, they increasingly need to exploit heterogeneous multimodal information, including images, text, and molecular assays. This section analyzes these challenges and their interplay from a modeling and deployment perspective.

2.1. Challenges from Limited Data

2.1.1. Small Sample Size

For several tomato diseases, especially emerging or region-specific ones, the number of labeled samples is intrinsically small. For example, in the Tomato-Village dataset, tomato spotted wilt and leaf miner are represented by only a few hundred images, while more common diseases have thousands of samples [26]. Similar scarcity arises for newly characterized viral complexes such as tomato yellow leaf curl Guangdong virus (TYLCGdV) and its associated betasatellite TYLCGdB [54].

In addition to epidemiological factors, annotation cost is a major bottleneck. Constructing a disease dataset with several thousand images typically requires weeks of work from trained agronomists, who must distinguish visually similar diseases (e.g., early blight [66] vs. late blight [67]) and annotate lesion location, shape, and severity [37]. As a result, many datasets are effectively low-shot for some classes, even if the total number of images appears large.

2.1.2. Class Imbalance

Class imbalance is another pervasive phenomenon in tomato disease datasets. In PlantVillage, common diseases such as Tomato yellow leaf curl virus (TYLCV) are represented by thousands of images, whereas mosaic virus and other rare diseases have only a few hundred [6]. Under standard empirical risk minimization, deep models tend to optimize global accuracy by focusing on majority classes, which yields high performance on frequent diseases but poor recall on rare but economically important ones. For example, when trained on imbalanced data, models may achieve high accuracy on TYLCV while substantially under-detecting Septoria leaf spot and mosaic virus [68].

From a learning–theoretic perspective, class imbalance effectively distorts the empirical class prior, amplifying the contribution of majority classes to the loss and widening the gap between the empirical and true risks. Correcting this distortion while avoiding overfitting on scarce classes remains a nontrivial design problem.

2.1.3. Low-Quality and Noisy Samples

Field deployment introduces a third form of data limitation: label-preserving but feature-degrading noise. Compared with laboratory-style PlantVillage images, field datasets such as the Dataset of Tomato Leaves and PlantDoc contain complex backgrounds, overlapping leaves, weeds, soil, and supporting structures [48]. Many images suffer from motion blur due to wind, nonuniform illumination in greenhouses, partial occlusion of lesions, and camera compression artifacts [68].

These factors distort the visual manifestation of disease symptoms, making lesion boundaries less clear and color cues less reliable. Empirically, models trained only on clean laboratory data experience a substantial performance drop when evaluated on such noisy field images, even if the disease categories are nominally the same [6].

These three types of limitations jointly define a “limited-data regime” for tomato leaf disease detection, as summarized in Figure 1.

2.1.4. Annotation Noise and Inter-Annotator Variability

In addition to sample scarcity and image-level noise, annotation noise constitutes a fourth, often underappreciated, challenge in tomato leaf disease datasets. In practice, labels are provided by human experts (agronomists, plant pathologists, or trained technicians) who must distinguish between visually similar diseases, assign severity scores, and sometimes identify mixed infections. Even under controlled conditions, inter-annotator agreement can be imperfect, especially for early or mild symptoms where visual cues are subtle. In field images, confounding factors such as nutrient deficiencies, abiotic stress (e.g., heat, drought, salinity), mechanical damage, and herbicide injury can further blur the boundaries between categories.

Several sources of annotation noise can be informally distinguished:

Class noise, where the primary disease label is incorrect (e.g., early blight labeled as late blight or TYLCV confusion with other yellowing disorders).
Boundary noise, where lesion regions in detection or segmentation tasks are imprecisely delineated, often due to occlusions or low resolution.
Severity noise, where ordinal severity scores (e.g., mild, moderate, severe) exhibit high variability between annotators or annotation sessions.

In small datasets, even a modest proportion of noisy labels can significantly distort the empirical risk landscape, causing models to overfit spurious patterns. This issue is exacerbated by class imbalance: rare diseases may have very few examples, and a handful of mis-labeled images can dramatically bias the estimated decision boundary for these classes. In multimodal settings, inconsistencies between textual descriptions and image labels (e.g., a text mentioning “severe chlorosis” attached to a mildly affected leaf) can also hamper cross-modal alignment [48].

Mitigation strategies include multi-expert annotation with consensus or adjudication protocols, active re-annotation of samples with high disagreement, and algorithmic techniques such as robust loss functions, label smoothing, noise-transition modeling, and co-teaching schemes that down-weight suspected noisy examples during training. Semi-automated tools that highlight image regions most responsible for the model’s prediction (e.g., Grad-CAM or attention maps) can further facilitate expert review by directing human attention to potentially mis-labeled or ambiguous samples. In the longer term, integrating annotation interfaces directly with farm management apps could enable continuous refinement of labels as farmers and agronomists correct or confirm model predictions in real time.

2.2. Challenges from Multimodal Data

2.2.1. Heterogeneity of Multimodal Data

Multimodal disease detection involves combining heterogeneous information sources that differ in structure, scale, and acquisition pipeline. Image data provide spatially structured visual patterns (lesion color, shape, and distribution), text describes symptoms or management history in natural language, and viral molecular data encode sequence-level signatures of infection [54].

These modalities reside in distinct feature spaces and exhibit different noise characteristics. Text annotations may be ambiguous or inconsistent across annotators (e.g., the same phrase “brown spots” [69] used for both early blight and bacterial spot [70]), whereas molecular measurements can be affected by assay-specific biases or detection limits [54]. In practice, viral or molecular assays are only available for a subset of virus-related diseases such as TYLCV, and are usually treated as auxiliary rather than primary inputs.

2.2.2. Difficulty of Cross-Modal Alignment

Effective multimodal fusion requires aligning information across modalities in a semantically coherent way. In tomato disease detection, this often amounts to matching textual symptom descriptions (e.g., “systemic yellowing” [71], “upward leaf curling” [72]) with corresponding image regions or temporal progression. However, such alignment is inherently noisy: early infection stages may show only subtle or localized visual changes while text descriptions reflect fully developed symptoms [48].

Moreover, contextual modalities such as environmental or management data can introduce apparent contradictions. For instance, high humidity may suggest a high risk of downy mildew, yet the current images show no lesions, or genomic assays may detect TYLCV DNA before visible symptoms emerge [54]. Robust cross-modal alignment must therefore account for temporal lags, partial observability, and the fact that not all modalities are equally informative at all times.

2.2.3. Model Complexity and Resource Constraints

Multimodal fusion architectures tend to be more complex than unimodal ones, as they require separate encoders, fusion modules, and, sometimes, modality-specific decision heads. Typical designs in the tomato domain combine Transformer-based image backbones (e.g., ViT [73]) with large language models (e.g., BERT [74]) and attention-based fusion modules, as in LAFANet [48]. When such models are further combined with limited-data techniques (e.g., self-distillation [75], contrastive pretraining [76]), the parameter count and training cost grow substantially.

This complexity increases the risk of overfitting under limited paired data. Image–text pairs are often scarce for rare diseases, making it difficult to reliably learn cross-modal correspondences. At the same time, computational and energy budgets on edge devices (smartphones, embedded boards in greenhouses) are tight, limiting the feasible model size and fusion depth [68].

2.3. Interaction Between Limited and Multimodal Data Challenges

The limitations discussed above do not occur in isolation. In practice, limited data and multimodal heterogeneity interact in ways that compound the difficulty of model design.

From the data side, small sample sizes and class imbalance directly reduce the number of reliable multimodal pairs. For some rare diseases, only a handful of images may be accompanied by high-quality symptom descriptions or molecular assays. This makes it challenging to estimate cross-modal similarity distributions and to train attention-based fusion modules without overfitting [6]. Conventional image-only data augmentation cannot fully compensate for missing or sparse text and molecular signals.

From the system side, multimodal architectures are often harder to deploy in resource-constrained agricultural settings. Smallholder farms and small greenhouses may lack the infrastructure to routinely collect and synchronize all modalities (e.g., hyperspectral data or viral sequencing), and may instead rely primarily on RGB images and occasional expert notes [54,68]. This leads to highly heterogeneous and incomplete modality availability at inference time, which most current models are not explicitly designed to handle.

Understanding and modeling these coupled challenges is essential for developing tomato disease detection systems that are both scientifically sound and practically deployable. Table 5 summarizes the main limited-data challenges and typical countermeasures.

From a more systemic perspective, the interaction between limited and multimodal data can be illustrated along three practical axes: (i) the stage of the disease epidemic (early warning versus severe outbreak), (ii) the technological level of the farm (smallholder versus high-tech greenhouse), and (iii) the available sensing modalities (image-only, image plus text, or image plus high-cost assays). In early-warning scenarios, visual symptoms on tomato leaves may be extremely subtle or even absent, rendering RGB images alone insufficient for reliable detection. In such cases, molecular assays or high-resolution spectral imaging can provide early signals, but only for a small subset of plants due to cost and logistical constraints [1,53,63]. Conversely, during severe outbreaks, visual symptoms become prominent and images are abundant, but textual and molecular annotations may lag behind, leading to multimodal misalignment and label noise.

The farm technology level further modulates this interplay. Large commercial greenhouses may operate fixed camera networks, environmental sensor arrays, and occasional molecular diagnostics, giving rise to rich but heterogeneous multimodal streams. However, these farms often collect data at different temporal resolutions (e.g., hourly sensor readings versus weekly molecular sampling), and aligning these streams to individual leaves or plants is non-trivial [58]. By contrast, smallholder farmers primarily rely on ad-hoc smartphone photos and brief verbal or textual descriptions, which yield extremely sparse and noisy multimodal pairs [55]. Designing multimodal tomato disease detectors that remain useful in both extremes—data-rich but complex greenhouses and data-poor but widespread smallholder systems—is thus a central challenge.

A third axis concerns the granularity at which modalities provide information. Image data are typically leaf- or plant-level, whereas many textual descriptions are plot- or field-level (e.g., “yellowing observed in the southern part of the greenhouse”). Molecular data may further aggregate information from pools of leaves or composite samples [1,54]. This granularity mismatch leads to weak supervision: an image of a single leaf may be labeled as TYLCV-positive simply because it comes from a TYLCV-positive field sample, even if the leaf itself does not yet show symptoms. Similar issues arise in multimodal plant disease classification beyond tomato [77]. From a methodological standpoint, this suggests that techniques from weakly supervised and multiple-instance learning could be fruitfully adapted to the tomato disease setting.

In summary, limited and multimodal data should not be treated as independent complications that are solved in isolation. Instead, the coupled nature of these challenges calls for jointly designed data-collection and modeling strategies. On the data side, this includes coordinated acquisition protocols where a small number of plants are sampled with high-cost modalities (e.g., genomics, hyperspectral imaging) to anchor the semantics of textual and image features [62]. On the modeling side, it motivates architectures that can gracefully interpolate between regimes: operating as strong image-only classifiers when auxiliary modalities are unavailable, upgrading to multimodal fusion when paired data are present, and exploiting weak field-level labels via multiple-instance or semi-supervised objectives [78]. We return to these design principles when discussing multimodal fusion patterns and deployment scenarios in Section 4 and Section 6.

3. Technical Solutions for Limited Data in Tomato Leaf Disease Detection

Tomato leaf disease models must operate under small, imbalanced, and noisy datasets. In this section, we review limited-data learning strategies, including transfer learning, self-distillation, data augmentation, self-supervised learning, few- and zero-shot learning, and domain generalization.

A compact comparison of limited-data learning strategies is provided in Table 6.

The overall taxonomy of these strategies is shown in Figure 2.

3.1. Transfer Learning

3.1.1. Principle and Implementation

Transfer learning [79,80,81] mitigates data scarcity by reusing representations learned from large-scale source datasets, typically ImageNet, and adapting them to tomato leaf disease detection. Instead of training a deep model from scratch on a few thousand tomato images, a pre-trained backbone (e.g., DeiT [6], ShuffleNetV2 [37], or ResNet [82] variants) is fine-tuned on disease-specific datasets. The lower layers retain generic edge, texture, and shape features, while upper layers are adjusted to capture disease-specific patterns such as lesion morphology and leaf deformation.

In practice, two main strategies are followed. The first is full fine-tuning, where all layers of the pre-trained network are updated with a smaller learning rate on tomato images, sometimes preceded by a brief warm-up stage. The second is partial fine-tuning or feature extraction, where most backbone layers are frozen and only a task-specific head (e.g., a lightweight classifier [83] or detection head [84]) is trained. LAFANet adopts the latter strategy by using a ViT backbone pre-trained on ImageNet as a generic image encoder and then learning multimodal fusion modules on top [48].

3.1.2. Application Cases and Performance

EMA-DeiT is a representative example of transfer learning in tomato disease classification. By initializing from DeiT pre-trained on ImageNet and incorporating an Exponential Moving Average (EMA) mechanism, EMA-DeiT achieves 99.6% accuracy on PlantVillage (10 disease types) and 98.2% on the Dataset of Tomato Leaves (6 disease types), clearly surpassing training the same architecture from scratch (93.6% accuracy on PlantVillage) [6]. Similarly, KD-ShuffleNetV2 leverages pre-trained ShuffleNetV2 weights as a lightweight backbone and further refines performance via ensemble self-distillation, reaching 95.08% accuracy while maintaining a very small parameter count [37].

These results indicate that even for domain-specific tasks such as tomato disease detection, generic visual features learned from natural images provide a strong initialization, especially when per-class sample sizes are below one thousand.

3.1.3. Advantages and Limitations

The main advantages of transfer learning are reduced training time, improved data efficiency, and better convergence stability. EMA-DeiT, for instance, converges within a few epochs on tomato datasets, whereas training a similar Transformer from scratch requires substantially more iterations and may still underfit minority diseases [6]. Transfer learning is therefore particularly attractive for practitioners who wish to build competitive models from modest datasets without extensive hyperparameter tuning.

However, transfer learning alone does not fully resolve domain shift. Pre-trained models are typically optimized on generic web images that differ markedly from agricultural scenes in background statistics, illumination patterns, and object scales. As a result, even strong transfer-learning baselines exhibit notable performance degradation when moving from laboratory-style datasets (PlantVillage) to field datasets (PlantDoc), as observed for EMA-DeiT (from 99.6% to 97.1% accuracy) [6]. Addressing such cross-domain discrepancies requires combining transfer learning with domain adaptation, self-supervised pretraining on in-domain images, or data augmentation strategies targeted at field-specific variations.

3.2. Self-Distillation and Ensemble Learning

3.2.1. Self-Distillation for Data Efficiency

Knowledge distillation traditionally transfers information from a large “teacher” model to a smaller “student”, with the goal of compressing capacity while preserving accuracy. Self-distillation adapts this idea by letting a model teach itself across training stages or architectural branches, thereby enhancing performance without relying on an external teacher [37]. This is particularly relevant in limited-data regimes, where additional large teachers may be unavailable or too costly to train.

EMA-DeiT integrates self-distillation implicitly: the EMA-updated parameters act as a slowly evolving teacher that provides smoothed predictions for the current student. A KL divergence term encourages the student to match these soft labels, stabilizing training and improving generalization on small tomato datasets [6]. KD-ShuffleNetV2, in contrast, explicitly constructs an ensemble of shallow ShuffleNetV2 branches and uses their aggregated predictions as teacher signals for each individual branch, effectively recycling the model’s own intermediate knowledge [37].

3.2.2. Ensembles Under Class Imbalance

Ensemble learning offers a complementary mechanism to mitigate class imbalance and label noise. By combining multiple models or branches that focus on different aspects of the data, ensembles can reduce variance and alleviate overfitting to majority classes. In KD-ShuffleNetV2, three shallow subnetworks are trained jointly on an aggregated tomato disease dataset, and their outputs are fused via a weighted scheme. This ensemble yields 95.15% accuracy, outperforming each constituent model (e.g., the best single shallow model achieves 93.21%) and improving recall for minority classes such as mosaic virus [37].

From an empirical risk perspective, ensembles effectively average multiple hypotheses drawn from different local minima, smoothing the loss landscape and reducing sensitivity to sample-specific noise. When combined with self-distillation, they provide a structured way to inject additional supervision signals without acquiring more labeled data.

3.2.3. Benefits and Trade-Offs

Self-distillation and ensembles are attractive because they primarily reuse existing models and training runs, making better use of limited data. They can be integrated with transfer learning without modifying the underlying backbone, and they often yield consistent gains on both clean and noisy datasets.

The main trade-off is computational. Ensembles increase inference cost unless additional compression is performed, and self-distillation introduces auxiliary losses and additional forward passes during training. For edge deployment, it is therefore important to either distill ensembles back into a single compact model or to design architecture-level sharing (e.g., multi-branch backbones with shared early layers) so that the increase in FLOPs and memory remains acceptable for greenhouse or smartphone hardware.

3.3. Data Augmentation

3.3.1. Classical and Generative Augmentation Strategies

Data augmentation expands the effective size and diversity of training sets by applying label-preserving transformations. Classical augmentation strategies include geometric operations (rotation, flipping, scaling, random cropping) and photometric changes (brightness, contrast, blur, color jitter). In the tomato context, such transforms simulate variations in camera viewpoint, leaf orientation, and illumination commonly encountered in greenhouse and field conditions [6]. Augmenting PlantVillage with rotation and flipping, for example, can roughly double the number of training samples without additional annotation effort.

Beyond these classical operations, generative augmentation methods employ models such as GANs or diffusion models to synthesize plausible disease images. For rare diseases (e.g., Septoria leaf spot or Tomato spotted wilt), synthetic images can be generated to approximate the appearance of lesions on diverse backgrounds, helping to rebalance class distributions [37]. When carefully controlled, these synthetic samples enrich the support of the minority-class distribution and reduce overfitting.

Augmentation is also tightly coupled with self-supervised and contrastive learning. In such frameworks, stochastic augmentations (e.g., random cropping, color jitter, patch masking) are used to create multiple views of the same leaf, and the model is trained to produce consistent representations across these views. This encourages invariance to nuisance factors such as viewpoint and lighting while retaining sensitivity to disease-discriminative patterns [48].

3.3.2. Effectiveness and Limitations Under Limited Data

Empirical studies show that well-designed augmentation improves robustness to field noise and mitigates overfitting, especially when combined with transfer learning and self-distillation. For example, augmenting PlantDoc with strong geometric and photometric transforms increases the effective training set size and narrows the performance gap between laboratory and field datasets [6].

However, augmentation is not a panacea. Excessively aggressive transforms may distort disease-discriminative cues, such as small necrotic lesions or subtle color changes, effectively altering the label semantics. Generative models, if trained on biased data, may amplify existing class imbalance or introduce unrealistic artifacts that mislead the classifier. Designing augmentation policies that reflect realistic agronomic variability—rather than blindly applying generic image transforms—remains an important open problem for tomato disease detection.

3.4. Self-Supervised Representation Learning

3.4.1. Core Idea and Objectives

Self-supervised learning (SSL) aims to learn transferable representations from unlabeled data by solving pretext tasks, such as image reconstruction, contrastive instance discrimination, or masked patch prediction. This paradigm is particularly attractive for tomato leaf disease detection because large collections of unlabeled field images can be cheaply acquired, whereas expert annotation is costly and time-consuming.

Formally, SSL introduces an auxiliary loss

L_{SSL}

defined on augmented views of unlabeled images

{\tilde{x}}_{i}^{(1)}, {\tilde{x}}_{i}^{(2)} \sim T (x_{i})

, where

T

is a set of stochastic augmentations. The encoder

ϕ_{θ}

is trained to minimize

{\hat{R}}_{SSL} (θ) = \frac{1}{n_{u}} \sum_{i = 1}^{n_{u}} L_{SSL} (ϕ_{θ} ({\tilde{x}}_{i}^{(1)}), ϕ_{θ} ({\tilde{x}}_{i}^{(2)})),

(11)

and the resulting features are later fine-tuned on the labeled tomato leaf dataset.

3.4.2. Representative Paradigms

Typical SSL choices relevant to tomato leaf images include

Contrastive SSL (SimCLR/MoCo-style): pull together two augmented views of the same leaf and push apart different leaves; improves invariance to illumination/background.
Masked image modeling (ViT-style): randomly masks patches and reconstructs them; captures global leaf structure and context under occlusions.
Multimodal SSL (image–text alignment): contrastive objectives on paired image–description data, enabling CLIP-like retrieval and zero-shot diagnosis.

3.4.3. Benefits and Limitations Under Limited Data

Empirically, SSL pretraining on in-domain unlabeled tomato images narrows the performance gap between models trained with and without large-scale ImageNet pretraining, particularly when labeled samples per class are below one thousand. SSL also improves robustness to field noise (blur, occlusion, and complex backgrounds) by enforcing invariance to data augmentations.

However, SSL brings additional computational cost in the pretraining stage and its benefit depends critically on the diversity and quality of unlabeled data. If unlabeled images are heavily biased toward specific cultivars or environments, the learned representations may still exhibit domain shift when transferred to new regions or disease spectra. Designing lightweight SSL objectives tailored to resource-constrained agricultural devices remains an open direction.

3.5. Few- and Zero-Shot Learning for Rare Diseases

3.5.1. Problem Setting

Few-shot learning (FSL) targets the recognition of novel classes with only a handful of labeled examples per class. In tomato leaf disease detection, FSL is directly motivated by emerging or region-specific diseases for which only 1–5 expert-annotated images are available. Zero-shot learning (ZSL) further assumes that no labeled images are observed for some diseases during training and leverages auxiliary semantic information such as textual symptom descriptions or taxonomic relations.

Let

Y_{base}

and

Y_{novel}

denote base and novel disease sets, with

Y_{base} \cap Y_{novel} = ⌀

. FSL assumes abundant labeled data for base classes and at most N labeled examples per novel class (an N-shot regime). During evaluation, the model must classify queries drawn from

Y_{novel}

, relying on knowledge transferred from

Y_{base}

.

3.5.2. Metric-Based and Meta-Learning Approaches

Most FSL algorithms can be grouped into metric-based and meta-learning-based approaches. Metric-based FSL learns an embedding function

ϕ_{θ}

such that samples of the same disease cluster together and classifies a query image x by nearest prototype,

\hat{y} = arg min_{k \in Y_{novel}} d (ϕ_{θ} (x), c_{k}),

(12)

where

c_{k}

is the prototype of class k computed from its few support examples and

d (\cdot, \cdot)

is a distance metric.

Meta-learning instead trains the model to rapidly adapt to new diseases by simulating few-shot episodes on base classes. Each episode mimics a small N-way–K-shot classification task. Gradient-based meta-learners or metric-based meta-learners can be used on top of CNN or Transformer backbones, enabling fast adaptation to novel tomato diseases with only a few annotated examples.

3.5.3. Zero-Shot Learning with Symptom Semantics

ZSL for tomato diseases can exploit textual symptom descriptions and structured attributes as semantic class embeddings. Let

a_{k}

denote the semantic vector for disease k, derived from expert-written descriptions or curated ontologies. A common strategy is to learn a compatibility function

F (x, k)

such that

\hat{y} = arg max_{k \in Y_{all}} F (ϕ_{θ} (x), a_{k}),

(13)

where

Y_{all} = Y_{base} \cup Y_{novel}

. In multimodal frameworks, image and text encoders can be jointly trained so that both images and symptom phrases lie in a shared embedding space, naturally supporting ZSL and open-set disease retrieval.

For real-world tomato disease surveillance, combining FSL/ZSL with continual learning and active learning (querying experts for the most informative samples) is a promising direction to maintain detectors as disease spectra evolve.

3.6. Domain Generalization and Domain Adaptation

3.6.1. Domain Shift in Tomato Leaf Datasets

A recurring issue in tomato leaf disease detection is the domain shift between training data (often captured under controlled laboratory conditions) and deployment environments (open fields, greenhouses, diverse camera devices). Let

P_{source}

and

P_{target}

denote the joint distributions of images and labels in source and target domains, respectively. Standard supervised learning implicitly assumes

P_{source} = P_{target}

, which is violated in practice.

The performance drop of EMA-DeiT from PlantVillage to PlantDoc is a typical example of domain shift. Here, background clutter, illumination changes, cultivar variability, and different disease prevalence patterns all contribute to discrepancies between

P_{source}

and

P_{target}

.

3.6.2. Domain Adaptation (DA)

Domain adaptation methods aim to leverage labeled source data and unlabeled (or sparsely labeled) target data to improve target performance. In the tomato context, this often corresponds to adapting a model trained on PlantVillage or controlled datasets to new field datasets. Common DA strategies include

Feature alignment: learning domain-invariant features by minimizing discrepancies between feature distributions of source and target leaves.
Style transfer and image translation: using GAN-based models to translate source images into target-like appearances (field style), thereby augmenting training data for robust disease recognition under target conditions.
Self-training: iteratively assigning pseudo-labels to confident target samples and retraining the model to better fit the target domain.

3.6.3. Domain Generalization (DG)

Domain generalization seeks to learn models that perform well on unseen domains without access to target data during training. In tomato disease detection, this corresponds to training on multiple known environments (e.g., different regions or greenhouse conditions) and expecting the model to generalize to new farms, cultivars, and climates.

DG approaches relevant to tomato leaf disease detection include

Data-level diversification: constructing composite training sets from multiple datasets (PlantVillage, PlantDoc, Tomato-Village, etc.) and applying strong augmentations to simulate wider domain variability.
Representation learning with invariance: enforcing invariance across source domains via regularization or contrastive objectives, so that disease-discriminative features are insensitive to domain-specific factors such as background and imaging device.
Meta-learning for domains: treating each source domain as a meta-task and training models that can quickly adapt to novel domains using a small calibration set, which is particularly suitable for deployment on new farms with a limited number of annotated images.

For tomato leaf disease detection under limited data, DA and DG methods are complementary to transfer learning and SSL. They explicitly target the discrepancy between

P_{source}

and

P_{target}

and are crucial for achieving reliable performance in real-world, cross-region deployments.

3.7. Active Learning and Human-in-the-Loop Annotation

While most limited-data strategies assume a fixed labeled dataset, active learning explicitly exploits the presence of a large pool of unlabeled images and a limited annotation budget. The core idea is to iteratively select the most informative samples for expert labeling, thereby maximizing model performance per unit of annotation effort. This paradigm fits well with tomato disease monitoring, where large numbers of leaf images can be automatically captured (e.g., by smartphones, greenhouse cameras, or drones), but expert time for careful labeling is scarce.

Formally, let

U = {x_{j}}_{j = 1}^{n_{u}}

denote a pool of unlabeled images and

L

the current labeled set. At each iteration, an acquisition function

a_{θ} (x)

ranks unlabeled samples according to their expected utility. The top-B samples are selected for labeling by experts and moved from

U

to

L

; the model is then retrained or fine-tuned on the expanded labeled dataset. Typical acquisition strategies include

Uncertainty-based sampling, which prioritizes samples with high predictive entropy or low margin between the top predicted classes, under the intuition that the model is currently unsure about these images.
Diversity-based sampling, which selects images that are not only uncertain but also diverse in feature space, reducing redundancy and covering a broader range of lesion appearances and environmental conditions.
Hybrid and task-aware strategies, which incorporate class imbalance, expected error reduction, or domain coverage (e.g., preferring images from underrepresented farms or cultivars).

In tomato leaf disease detection, active learning can be combined with mobile apps or web platforms where farmers periodically submit images; the system then selectively queries human experts for labels on a small subset of strategically chosen images. Over time, this leads to a curated dataset emphasizing informative and difficult cases, including rare diseases, atypical symptoms, and domain-shifted conditions. In multimodal settings, active learning can also guide the collection of complementary modalities (e.g., requesting additional textual descriptions or molecular assays for particularly ambiguous images), thus tightly coupling data acquisition with model training.

Despite its promise, active learning introduces practical challenges, such as the need for responsive annotation workflows, the computational cost of retraining models after each annotation batch, and the risk of acquisition bias (e.g., oversampling unusual but clinically less important cases). Careful design of acquisition strategies, batch sizes, and human–AI interaction protocols is therefore crucial to fully realize the benefits of active learning in resource-constrained agricultural environments.

3.8. Semi-Supervised Learning and Pseudo-Labeling

Semi-supervised learning (SSL in the narrow sense, distinct from self-supervised pretraining) addresses the common situation where a small labeled dataset is accompanied by a much larger unlabeled pool. In tomato leaf disease detection, unlabeled images can be collected at scale from farms and greenhouses, while only a fraction can be feasibly annotated by experts. Semi-supervised methods aim to leverage this unlabeled data to improve classification or detection performance.

A widely used family of approaches is pseudo-labeling. Given a model

f_{θ}

trained on the labeled set

L

, pseudo-labels

{\hat{y}}_{j} = arg {max}_{k} f_{θ} {(x_{j})}_{k}

are assigned to unlabeled samples

x_{j} \in U

whose prediction confidence exceeds a threshold

τ

. These pseudo-labeled samples are then merged with

L

and used for further training, typically with lower loss weights to reflect their noisier nature. Iterating this process effectively enlarges the labeled dataset and encourages the model to learn decision boundaries aligned with the structure of the unlabeled data distribution.

More advanced semi-supervised techniques enforce consistency regularization, which encourages the model to produce similar predictions for different perturbations of the same image. Let

{\tilde{x}}_{j}^{(1)}

and

{\tilde{x}}_{j}^{(2)}

be two augmented versions of an unlabeled leaf image

x_{j}

. A consistency loss of the form

L_{cons} = \frac{1}{n_{u}} \sum_{j = 1}^{n_{u}} ∥ f_{θ} ({\tilde{x}}_{j}^{(1)}) - f_{θ} ({\tilde{x}}_{j}^{(2)}) ∥^{2}

(14)

encourages invariance to these augmentations, which can be designed to mimic realistic variations in tomato leaves (e.g., rotations, brightness changes, small occlusions). When combined with cross-entropy loss on labeled samples, such regularization helps the model to exploit the geometry of the unlabeled data manifold while remaining anchored to high-quality labels.

In practice, semi-supervised learning can be integrated with transfer learning, self-distillation, and domain adaptation. For instance, an EMA-stabilized teacher model can generate pseudo-labels for unlabeled field images, which are then used to fine-tune a lightweight student model for deployment on edge devices [37]. Potential pitfalls include confirmation bias (the model reinforcing its own mistakes through incorrect pseudo-labels) and degradation under severe class imbalance or high label noise. Combining semi-supervised techniques with active learning—e.g., selectively querying experts for labels on samples with conflicting pseudo-labels or low consistency—offers a promising hybrid strategy to improve both data efficiency and robustness.

3.9. Practical Guidelines for Combining Limited-Data Strategies in Tomato Applications

The techniques surveyed in this section—transfer learning, self-distillation and ensembles, augmentation, self-/semi-supervised learning, few-/zero-shot methods, domain adaptation/generalization, and active learning—are most effective when used in combination rather than in isolation. For practitioners aiming to build tomato leaf disease detectors under realistic constraints, it is therefore useful to distill a set of practical guidelines that link method choices to data regimes, computational budgets, and deployment requirements [60].

A first rule of thumb concerns the size and diversity of the labeled dataset. When only a few thousand labeled tomato images are available, and they come from a relatively homogeneous environment (e.g., a single greenhouse complex), transfer learning from ImageNet or crop-generic plant datasets should be considered mandatory [57]. In such cases, full fine-tuning of a moderately sized backbone (e.g., ResNet-50, DeiT-small) with strong but task-aware augmentation is typically preferable to training from scratch. When the labeled dataset grows beyond roughly

10^{4}

samples and covers multiple farms or cultivars, it becomes more attractive to explore self-supervised pretraining on in-domain images (with or without ImageNet initialization), followed by supervised fine-tuning [64].

A second guideline addresses class imbalance. If some tomato diseases have fewer than a few hundred examples while others have thousands, it is rarely sufficient to rely on standard cross-entropy loss. Instead, practitioners should combine at least three complementary tools: (i) loss reweighting (class-balanced or focal losses); (ii) targeted augmentation and/or generative oversampling of minority classes; and (iii) architectural or training tricks that emphasize rare classes, such as ensemble self-distillation or curriculum learning [85]. Empirical evidence from both tomato and other crops indicates that such combinations can substantially improve macro-F1 and minority-class recall without sacrificing overall accuracy [60].

A third consideration is cross-domain robustness. When the final model is expected to operate on farms that differ substantially from the training environment (different regions, cultivars, imaging devices), domain adaptation and domain generalization become critical [86]. For example, one can pretrain or fine-tune on a composite dataset that merges PlantVillage, PlantDoc, Tomato-Village, and private field images, while using domain-adversarial or contrastive objectives to encourage invariant representations [63]. In deployment, it is advisable to monitor performance on a small annotated calibration set from each new farm and, if necessary, perform lightweight adaptation (e.g., updating only batch-normalization statistics or the final classifier layer) [56].

From the perspective of annotation cost, semi-supervised and active learning methods are most effective when integrated into a continuous data pipeline rather than applied as one-off stages [60]. A practical pattern is to deploy an initial model trained with transfer learning on an existing dataset, use it to screen incoming images on-farm, and then interleave two kinds of feedback loops: automatic pseudo-labeling of high-confidence samples and expert annotation of a small batch of low-confidence or highly uncertain samples identified by an acquisition function [64]. Over time, this can lead to a curated dataset that is both larger and more informative than the original seed set.

Hardware constraints impose another layer of design choices. On resource-rich servers or cloud platforms, one can afford heavier backbones, complex ensembles, and iterative SSL or DA schemes. On edge devices such as smartphones or embedded greenhouse controllers, lightweight networks like KD-ShuffleNetV2 or YOLOv11n are more appropriate [68]. In the latter case, a reasonable strategy is to first train a high-capacity teacher model using the full arsenal of limited-data techniques (SSL, DA, semi-supervised learning) and then distill its knowledge into a compact student network suitable for on-device inference [63]. This “train big, deploy small” paradigm has been successfully applied in other agricultural vision tasks and fits well with the tomato disease detection problem [58].

Finally, we stress the importance of transparent reporting when combining limited-data strategies. For each new tomato leaf disease model, authors should specify the exact datasets and splits used; the augmentation policies; whether and how transfer learning, SSL, DA, or semi-/few-shot learning were applied; and the computational budget (GPU hours, memory) required to reproduce the results [77]. Clear documentation not only facilitates fair comparison but also helps practitioners decide which subset of techniques is feasible in their own context.

4. Technical Solutions for Multimodal Data Fusion in Tomato Leaf Disease Detection

Tomato disease diagnosis increasingly relies on heterogeneous modalities, such as RGB images, textual descriptions of symptoms, and, for viral diseases, molecular assays. An overview of key modalities and common fusion strategies is provided in Table 7.

Figure 3 provides an overview of a generic multimodal fusion pipeline for tomato leaf disease diagnosis and image–text retrieval.

4.1. Taxonomy of Multimodal Fusion Strategies

Let

x^{img} \in R^{d_{img}}

,

x^{text} \in R^{d_{text}}

, and

x^{mol} \in R^{d_{mol}}

denote feature vectors extracted from image, text, and molecular modalities, respectively. Existing tomato leaf disease studies adopt three main fusion paradigms.

In feature-level (early) fusion, modality-specific encoders produce features that are concatenated or otherwise combined before classification,

z = g (x^{img}, x^{text}, x^{mol}), p (y ∣ x) = h (z),

(15)

where g may be simple concatenation, gated fusion, or an attention-based module. Early fusion is expressive but more prone to overfitting under limited data, given the increased feature dimensionality and parameter count.

Decision-level (late) fusion processes each modality with an independent classifier and aggregates predictions at the decision level, for example,

p_{k} = α p_{k}^{img} + β p_{k}^{text} + γ p_{k}^{mol},

(16)

where

p_{k}^{img}

is the probability for class k based on images and

(α, β, γ)

are learned or manually tuned fusion weights. Late fusion is robust to missing modalities and heterogeneous data acquisition, but may underutilize fine-grained cross-modal interactions.

Hybrid fusion combines the two ideas, often using cross-attention so that one modality queries another at the feature level while still retaining modality-specific decision heads. LAFANet is a representative example that introduces learnable fusion attention between image and text features and refines similarity scores at the retrieval stage.

Critical Remarks

The trade-offs and a detailed comparison of fusion strategies are summarized in Table 8 and Table 9, respectively. Early fusion can achieve strong accuracy when paired multimodal samples are sufficient, but it is prone to overfitting and is fragile under modality missingness. Late fusion is computationally cheaper and naturally robust to missing modalities, yet it may under-exploit fine-grained symptom–lesion correspondences. Hybrid fusion (e.g., cross-attention or mixture-of-experts) often offers the best balance between accuracy and interpretability, but requires careful regularization and ablations to justify the extra complexity under limited paired data.

In tomato leaf disease detection, where multimodal data (image–text, image–molecular, or image–environmental) are often partially missing and highly imbalanced, the choice of fusion strategy must carefully trade off expressiveness, robustness, and computational cost.

4.2. Image-Text Fusion for Retrieval and Recognition

A representative line of work on multimodal fusion for tomato diseases leverages paired image–text data to support cross-modal retrieval and assistive diagnosis. In this setting, each tomato leaf image is accompanied by a structured or free-text symptom description, and the model learns to align visual and textual representations in a joint embedding space. This enables practical workflows where farmers provide an image and receive ranked symptom descriptions or, conversely, enter a textual query and retrieve similar disease images.

LAFANet exemplifies this paradigm on the TLDITRD dataset [48]. It employs a ViT backbone as an image encoder and a BERT-like network as a text encoder. The two encoders produce high-dimensional feature vectors, which are then fused via a Learnable Fusion Attention (LFA) module. LFA allows the model to emphasize cross-modal correspondences: image regions that are strongly related to particular textual tokens receive higher attention weights, and vice versa. To further refine the training signal, LAFANet introduces a False Negative Elimination–Adversarial Negative Selection (FNE–ANS) mechanism that reduces the impact of hard but semantically unrelated negative pairs in the triplet loss.

On TLDITRD, LAFANet achieves Recall@1 of 81.7% for image-to-text retrieval and 80.3% for text-to-image retrieval, outperforming prior methods such as SCAN and FNE (e.g., SCAN attains 72.1% R@1 in image-to-text retrieval) [48]. Qualitative analyses show that LAFANet tends to focus on lesion regions consistent with symptom phrases like “yellow curled leaves” or “necrotic spots along the margin”, which improves interpretability compared with unimodal baselines.

Despite these advances, current image–text fusion models are trained on relatively small and domain-specific corpora. Their performance is sensitive to annotation quality, and they may struggle to generalize to free-form farmer descriptions that deviate from curated expert terminology. Adapting larger, generic vision–language models to the tomato domain while preserving efficiency and robustness is therefore an important direction, as discussed in Section 4.5.

From a data perspective, constructing high-quality image–text datasets for tomato diseases poses its own challenges. Expert-written descriptions often use technical terminology and assume background knowledge (e.g., names of cultivars, growth stages, or pathogen species), whereas farmers’ descriptions collected via mobile apps tend to be short, noisy, and context-dependent (e.g., “leaves turning yellow after rain”). Bridging this linguistic gap may require the design of controlled vocabularies or symptom ontologies that map colloquial expressions to standardized descriptors. Moreover, many tomato-growing regions are multilingual; extending image–text models to handle multiple languages, dialects, and writing systems would increase their accessibility but also complicate data collection and model training. Adapting pre-trained multilingual language models, combined with modest amounts of tomato-specific parallel text (e.g., expert descriptions translated into local languages), is a promising direction to make multimodal tomato disease tools usable beyond research settings.

In real-world advisory scenarios, image–text fusion can support interactive diagnostic workflows rather than one-shot predictions. For example, a system might first retrieve several candidate diseases with associated symptom descriptions and ask the user follow-up questions (e.g., “Do you observe lesions mainly on lower leaves?”, “Have symptoms spread to fruits?”) to refine the diagnosis. Implementing such conversational loops requires integrating multimodal retrieval models with dialogue management and uncertainty estimation, ensuring that the system knows when to defer to human experts. While such interactive systems are still emergent in the tomato domain, they represent a natural evolution from static image classifiers toward collaborative human–AI decision support tools.

4.3. Viral and Image Data Fusion for TYLCV Detection

Viral molecular data offer a complementary modality for diseases caused by specific pathogens such as TYLCV. While image-based models capture visible symptoms (chlorosis, leaf curling, stunting), molecular assays directly quantify viral load and characterize genomic variants. Fusing these modalities can in principle improve early detection, disentangle co-infections, and provide mechanistic insights into symptom development [54].

Recent virology studies on TYLCV and its associated betasatellites illustrate the potential of such fusion. The TYLCV genome encodes several open reading frames, including C4, which is implicated in symptom induction and modulation of host defenses. Betasatellite-encoded proteins such as

β

C1 further contribute to pathogenicity and can alter host methylation patterns [54]. Experimental analyses have linked specific mutations in C4 and

β

C1 to variations in symptom severity, including differences in leaf curling and yellowing intensity.

From a modeling perspective, one can treat molecular descriptors (e.g., sequence-derived features, methylation levels, or protein expression indicators) as an auxiliary feature vector

x^{mol}

that complements image features

x^{img}

. Fusion can then be performed via early or hybrid strategies, as described in Section 4.1, with the aim of learning joint representations that correlate molecular signatures with visual symptom patterns. In practical deployment, molecular assays may only be available for a subset of plants (e.g., sampled plots in large fields), so models must also handle missing molecular inputs gracefully and degrade to image-only operation when necessary.

At present, most TYLCV-related works still analyze molecular and image data in parallel rather than within a single unified learning framework. Developing truly multimodal models that integrate viral genomics with leaf imagery, while respecting the cost and sparsity of molecular measurements, remains an open opportunity at the intersection of plant pathology, virology, and machine learning.

4.4. Integration with IoT Sensors and Remote Sensing

Beyond RGB images, tomato production systems increasingly generate heterogeneous data streams from IoT devices and remote sensing platforms. Greenhouse deployments may include sensors measuring air temperature, relative humidity, CO₂ concentration, soil moisture, and leaf wetness, while open-field systems can be monitored using UAV or satellite imagery capturing spectral indices related to canopy vigor and stress. These modalities can provide early, indirect indicators of disease risk (e.g., prolonged leaf wetness conducive to fungal infection) or spatial patterns of canopy decline that are not yet visible at the single-leaf level.

From a modeling standpoint, environmental and remote-sensing data can be treated as additional modalities

x^{env}

and

x^{rs}

that complement leaf-scale image features. Fusion strategies range from simple feature concatenation (e.g., combining summary statistics of sensor readings with CNN features) to more structured architectures such as temporal convolutional networks or recurrent models that explicitly capture the dynamics of environmental conditions. For UAV imagery, multi-scale approaches that jointly analyze plot-level vegetation indices and leaf-level images may help to link field-scale patterns with individual plant symptoms, thereby improving the targeting of scouting and intervention.

However, integrating IoT and remote-sensing modalities also introduces practical challenges. Sensor coverage may be sparse or uneven, data streams can be incomplete or noisy (e.g., sensor failures, communication dropouts), and time alignment between modalities must be carefully handled. Moreover, the added complexity of collecting and maintaining such infrastructure may be prohibitive for smallholder farmers. As a result, there is a trade-off between sophisticated multimodal systems that leverage rich contextual information and simpler image-only solutions that are easier to deploy at scale. Future research should therefore not only explore advanced fusion architectures but also systematically evaluate their cost–benefit balance under realistic farm conditions, including hardware costs, maintenance requirements, and user training needs.

4.5. Emerging Multimodal Paradigms: CLIP-Style and Cross-Attention Models

Beyond task-specific multimodal architectures, recent advances in vision–language pretraining have introduced CLIP-style models that jointly learn image and text encoders on large-scale web data. Their core objective is to maximize the similarity between matched image–text pairs and minimize it for mismatched pairs, using a contrastive loss. Adapting such paradigms to tomato leaf disease detection offers several opportunities: a unified embedding space for images and symptom descriptions, so that disease images and expert-written symptom phrases can be mapped together to support zero-shot classification, cross-modal retrieval, and human-in-the-loop diagnosis without per-disease retraining, and transformer-based cross-attention mechanisms, which can highlight image regions corresponding to specific textual tokens (e.g., “leaf margin necrosis”), thereby improving interpretability and offering visual explanations that are more acceptable to agronomists and farmers.

However, directly training CLIP-style models from scratch in the tomato domain is unrealistic given the scale of data typically available. A more feasible path is to start from generic vision–language models and perform domain-specific adaptation using curated tomato image–text pairs, while constraining model size for edge deployment. This line of research tightly couples multimodal fusion with limited-data techniques such as transfer learning, SSL, and few-shot adaptation.

4.6. Design Patterns and Ablation Studies for Multimodal Tomato Systems

Given the diversity of tomato production systems and sensing infrastructures, it is unlikely that a single multimodal architecture will be optimal for all scenarios. Instead, several recurring design patterns have emerged in the broader vision–language and multimodal learning literature that can be adapted to tomato leaf disease detection [62]. Here we summarize three such patterns and discuss how they interact with limited data.

The first pattern is the two-tower contrastive encoder, exemplified by CLIP-style models [87]. Image and text encoders are trained separately but jointly, using a contrastive loss that pushes matched image–text pairs together and unmatched pairs apart in a shared embedding space. In the tomato domain, this pattern underlies LAFANet and related image–text retrieval frameworks [78]. Its main advantages are architectural simplicity, flexibility in handling missing modalities at inference time, and natural support for zero-shot classification via text prompts describing diseases. However, two-tower models rely heavily on large numbers of image–text pairs to avoid representation collapse, which is a challenge for tomato-specific datasets. A promising compromise is to initialize encoders from generic CLIP models and perform lightweight adaptation on curated tomato image–text pairs [77].

The second pattern is single-stream cross-attention, where image patches and text tokens are concatenated and processed jointly by a Transformer with cross-modal attention layers [88,89]. This design often yields stronger fine-grained alignment between lesion regions and symptom phrases, improving interpretability and performance on tasks such as visual question answering or interactive diagnosis. For example, cross-attention can highlight leaf margins when the text mentions “marginal necrosis” or focus on interveinal areas when the description refers to “interveinal chlorosis”. In tomato disease detection, such models could support dialog-style interfaces where farmers ask follow-up questions about likely diseases [78]. The trade-off is higher computational cost and a stronger dependence on high-quality multimodal supervision, which again motivates the use of transfer learning and careful data curation.

A third pattern is mixture-of-expert fusion, where separate experts are trained for different modality subsets (e.g., image-only, image+text, image+molecular, image+sensor), and a gating mechanism selects or weights experts at inference time based on available inputs [77]. This is particularly suitable for tomato production environments with heterogeneous and intermittent modalities: for instance, a model could fall back to an image-only expert for everyday smartphone photos, switch to an image+text expert when agronomists provide detailed symptom descriptions, and leverage an image+molecular expert when TYLCV qPCR results are available [54]. Designing and training such mixtures under limited data is non-trivial, but early experiments in other agricultural domains suggest that parameter sharing across experts and multi-task learning can alleviate data fragmentation [63].

To make multimodal models scientifically and practically credible, rigorous ablation studies are essential. At a minimum, we recommend that new tomato multimodal systems report: (i) performance of each modality in isolation (image-only, text-only, etc.); (ii) incremental gains from adding each additional modality or fusion component; and (iii) robustness under synthetic modality dropout, where some modalities are randomly masked at inference time [77]. Such ablations help disentangle how much of the performance improvement comes from multimodal synergy versus simply using a stronger backbone or more data. They also reveal failure modes where models over-rely on spurious correlations in one modality (e.g., background color in images or certain keywords in text).

Finally, multimodal tomato systems should be evaluated not only on standard retrieval and classification metrics but also on human-centered criteria. These include the interpretability of cross-modal attention maps to agronomists, the consistency of textual explanations with established symptom descriptions, and the ease with which farmers can provide the necessary inputs (e.g., taking photos, answering symptom-related questions) [62]. Incorporating user studies—even small-scale ones—into the evaluation protocol will be crucial for ensuring that multimodal models transition from research prototypes to trusted tools in precision tomato production.

5. Case Studies: Typical Models and Benchmark Datasets

To concretize the above discussions, we next analyze representative tomato leaf disease models under different capacity and deployment constraints. Figure 4 positions typical architectures along two axes: model capacity/modality and deployment requirements.

5.1. Benchmark Datasets for Tomato Leaf Disease Detection

Public benchmark datasets play a central role in evaluating and comparing tomato leaf disease detection methods. Table 10 summarizes representative datasets frequently used in the literature, including their modalities, number of images, disease categories, and acquisition conditions.

PlantVillage: a large-scale laboratory-style dataset with uniform backgrounds and controlled lighting, widely adopted as a starting point for tomato disease classification. It contains multiple tomato diseases (e.g., early blight, late blight, leaf mold, Septoria leaf spot, Tomato yellow leaf curl virus), with substantial class imbalance between common and rare diseases.
PlantDoc: a field-style dataset with natural backgrounds, occlusions, and illumination variations. Compared to PlantVillage, PlantDoc better reflects real-world complexity but provides fewer samples per class.
Dataset of Tomato Leaves: a mid-sized dataset with natural backgrounds, focusing on a subset of common tomato leaf diseases and healthy leaves, often used to evaluate the robustness of models fine-tuned from PlantVillage.
Tomato-Village and related field datasets: newer datasets designed to stress-test generalization to rare diseases (e.g., tomato leaf miner, spotted wilt) and diverse cultivation systems. They typically contain fewer images for rare classes, exacerbating limited-data and class-imbalance issues.
Multimodal datasets (e.g., TLDITRD): paired image–text datasets where each tomato leaf image is accompanied by structured symptom descriptions or free-text annotations. These datasets enable the study of image–text retrieval and multimodal fusion.

In addition, several works construct private field datasets collected from commercial greenhouses or experimental farms, often combining RGB images with environmental sensor readings or molecular assays. While these datasets are crucial for evaluating practical deployment scenarios, their restricted availability hampers reproducibility and cross-method comparison, highlighting the need for more open, multimodal, and cross-region benchmark datasets.

Recent tomato works typically report

mAP @ 0.5

and occasionally mAP@0.5:0.95 to reflect performance across a range of IoU thresholds.

For image–text retrieval tasks, Recall@K (R@K) is widely used. Given a query (image or text), R@K is defined as the probability that at least one ground-truth match appears in the top-K retrieved items. Higher R@1, R@5, and R@10 indicate better cross-modal alignment.

Besides these primary metrics, some studies also analyze model complexity (number of parameters, FLOPs), inference latency on specific hardware (GPU, edge devices, smartphones), and robustness indicators (performance under synthetic corruptions or cross-dataset evaluation), which are crucial for practical deployment in precision agriculture.

5.2. Model Performance on Limited Data

This subsection summarizes representative tomato leaf disease models designed with limited data in mind and analyzes their behavior across different datasets.

5.2.1. EMA-DeiT

EMA-DeiT combines a DeiT-based Transformer backbone with an Exponential Moving Average mechanism and self-distillation to enhance generalization under limited data [6]. Evaluated on four datasets—PlantVillage (10 diseases), Dataset of Tomato Leaves (6 diseases), PlantDoc (8 diseases), and Tomato-Village (8 diseases)—the model achieves 99.6%, 98.2%, 97.1%, and 97.6% accuracy, respectively. Compared with ResNet50 and a DeiT-small variant trained under similar conditions, EMA-DeiT consistently yields 1–3 percentage points higher accuracy on controlled datasets and a smaller, though still notable, improvement on field datasets.

The discrepancy between PlantVillage and PlantDoc performances highlights the residual domain shift: despite aggressive augmentation and transfer learning, accuracy on field images remains a few points lower. Cross-dataset experiments further show that models trained solely on PlantVillage tend to overfit to homogeneous backgrounds and lighting, underscoring the need to explicitly incorporate domain adaptation or domain generalization techniques.

5.2.2. KD-ShuffleNetV2

KD-ShuffleNetV2 targets the complementary goal of maintaining high accuracy while drastically reducing model size for edge deployment [37]. Built upon a ShuffleNetV2 backbone, it introduces an ensemble self-distillation framework and is trained on an aggregated dataset combining PlantVillage, AI Challenger 2018, PlantDoc, and a Taiwan tomato disease dataset. The resulting model attains 95.08% accuracy, with 94.58% precision and 94.55% recall, using only 1.27 M parameters—substantially fewer than MobileNetV2 (2.26 M parameters).

In addition to overall accuracy, KD-ShuffleNetV2 improves per-class recall for several minority diseases compared with vanilla ShuffleNetV2, reflecting the benefits of ensemble self-distillation under imbalanced data. These results suggest that carefully designed lightweight backbones can serve as strong baselines when combined with limited-data techniques, especially for deployment on smartphones or embedded devices in greenhouses.

5.2.3. YOLOv11m with Hyperparameter Optimization

For object detection tasks, YOLOv11m represents a modern, high-capacity detector adapted to tomato leaf disease localization [68]. Trained on an improved dataset containing 22,000 images from 11 classes, and optimized via a combination of one-factor-at-a-time (OFAT) analysis and random search, the model achieves a fitness score of 0.99268, precision of 0.99190, recall of 0.99348, and

mAP @ 0.5

of 0.99262. These numbers outperform a lighter YOLOv11n baseline (fitness 0.98395) and illustrate the impact of systematic hyperparameter tuning.

However, such high performance is obtained on a relatively large and curated dataset; it may not directly translate to low-data or cross-domain scenarios. Integrating YOLOv11m with limited-data strategies such as transfer learning from multispecies plant datasets, self-supervised pretraining on unlabeled greenhouse footage, or domain adaptation to new farms is a promising avenue for future work.

5.3. Multimodal Model Performance

LAFANet provides a concrete example of how multimodal fusion can improve performance on image–text retrieval tasks relevant to tomato disease diagnosis [48]. On TLDITRD, which contains 6000 image–text pairs for six diseases, LAFANet reaches 81.7% Recall@1 for image-to-text retrieval and 80.3% Recall@1 for text-to-image retrieval. Compared with SCAN and FNE baselines, which achieve lower R@1 scores, the gains can be attributed to the Learnable Fusion Attention module and the FNE–ANS strategy that filters out misleading negatives during contrastive training.

From a practical viewpoint, these retrieval metrics translate into more reliable cross-modal search: given a diseased leaf image, the correct textual description is likely to appear among the top suggestions, and vice versa. Nevertheless, evaluation is still conducted on a single, relatively small multimodal dataset, and robustness to noisy, free-form farmer descriptions remains largely unexplored. Representative model–dataset pairs and reported results are summarized in Table 11. Systematic benchmarking of multimodal models across multiple datasets and languages is needed to fully assess their potential for real-world agricultural advisory systems.

5.4. Comparative Summary of Model Families

To synthesize the discussion of representative models, Table 12 compares typical tomato leaf disease detection architectures along several axes: backbone family, parameter scale, primary modality, and qualitative strengths and limitations. The numbers for parameter counts and metrics are indicative rather than exhaustive, as implementations and training protocols differ across studies.

Overall, three broad trends can be observed. First, lightweight CNNs and compact detectors are increasingly favored for deployment on mobile and edge devices, especially in smallholder contexts where internet connectivity and computational resources are limited. Techniques such as self-distillation, knowledge distillation from larger teachers, and careful hyperparameter tuning have narrowed the performance gap between these models and heavier backbones [68]. Second, Transformer-based architectures and hybrid CNN–Transformer networks offer strong representational power and are particularly attractive when combined with transfer learning and in-domain self-supervised pretraining, but their computational cost may constrain their use to server-side processing or high-end devices [6]. Third, multimodal fusion models, while still relatively rare in the tomato literature, point toward a future in which disease diagnosis integrates images, expert text, and possibly molecular or sensor data into a coherent decision-support framework [54].

For practitioners, the choice of model family should be informed not only by benchmark accuracy but also by the specific deployment scenario: on-device versus cloud inference, expected image quality, availability of multimodal data, and required level of interpretability. In research settings, systematic cross-family comparisons under standardized protocols and datasets would help clarify these trade-offs and identify promising directions for next-generation tomato disease detection systems.

5.5. Cross-Dataset Evaluation and Ablation Practices

As tomato leaf disease research matures, evaluation protocols must evolve beyond single-dataset accuracy reporting. Cross-dataset and cross-domain evaluations are particularly important in this field because models trained on one dataset (e.g., PlantVillage) are often deployed in very different environments (e.g., smallholder fields or commercial greenhouses) [6,26,57,91]. Without such evaluations, it is difficult to assess whether a method truly improves generalization or merely overfits a specific benchmark.

Figure 5 summarizes a practical cross-dataset protocol, including train-on-A test-on-B evaluation and leave-one-domain-out validation across domains (e.g., farms, devices, illumination). This workflow makes the domain shift explicit and prevents overly optimistic in-domain reporting.

For multimodal systems, Figure 6 highlights a recommended ablation routine with a clear main line: (1) modality-specific ablation (image-only vs. text-only vs. fused), followed by (2) gradual fusion gains (late fusion → cross-attention fusion → fully joint tuning), so that each added component has an attributable improvement rather than a bundled, non-diagnosable gain.

A simple yet informative protocol is train-on-A, test-on-B. For instance, models can be trained on PlantVillage and tested on PlantDoc or Tomato-Village without any adaptation [26]. The resulting accuracy or macro-F1 provides a direct measure of domain shift. Conversely, training on field datasets and testing on laboratory-style images can reveal whether the model has learned disease-discriminative features that transfer across background and illumination changes. When combined with domain adaptation methods, one can further compare “source-only”, “adapted”, and “oracle” (joint training) baselines to quantify the benefits and limitations of adaptation schemes [86].

Another useful practice is leave-one-domain-out evaluation, where models are trained on multiple domains (e.g., PlantVillage + Tomato-Village) and tested on the held-out domain (e.g., PlantDoc) [58]. This protocol is naturally aligned with domain generalization and meta-learning approaches that explicitly aim to learn representations robust to unseen environments. Reporting performance across multiple such splits provides a more balanced picture of model robustness than a single train/test partition.

Within each dataset, ablation studies should go beyond architecture comparisons to include the limited-data techniques discussed in Section 3. For example, when proposing a new Transformer-based backbone, authors should systematically examine the impact of (i) transfer learning versus training from scratch, (ii) with or without self-distillation, (iii) standard versus class-balanced or focal loss, and (iv) weak versus strong augmentation [60]. Such ablations are especially important in tomato disease detection because different techniques may benefit different disease categories: minority classes often gain more from loss reweighting and data augmentation, while majority classes may benefit more from self-distillation and SSL.

For multimodal models, cross-dataset evaluation should ideally consider both intra-dataset and cross-dataset generalization of multimodal alignment. For example, image–text retrieval performance can be measured not only on TLDITRD but also on held-out subsets that emulate new symptom vocabularies or new environmental conditions [78]. When possible, models should also be evaluated on cross-lingual or cross-annotation scenarios, where text descriptions come from different sources (research articles versus extension bulletins versus farmer reports) [77]. These experiments would shed light on whether multimodal tomato systems are robust to linguistic and cultural variability.

Actionable Ablations

We recommend reporting modality-specific baselines and gradual fusion gains:

Unimodal baselines: image-only, text-only, molecular-only, and environmental-only performance.
Incremental fusion: image+text, image+sensor, image+molecular, and full fusion; report the gain $Δ$ over image-only.
Modality dropout: randomly mask one modality at inference to test robustness under missing inputs.

In addition to these quantitative practices, we advocate for the release of code, trained weights, and detailed training logs (including hyperparameters, learning schedules, and data splits) for tomato leaf disease models whenever possible [60]. Publicly available baselines—for example, reference implementations of EMA-DeiT, KD-ShuffleNetV2, YOLOv11m, and LAFANet—would greatly facilitate fair comparison and accelerate progress. Establishing community benchmarks with standardized train/validation/test splits across multiple datasets, similar to those emerging in other plant phenotyping and agricultural vision tasks [61], is a natural next step for the tomato disease detection community.

6. Current Challenges and Future Opportunities

Building practical tomato leaf disease detection systems requires not only strong models but also an end-to-end pipeline that spans data acquisition, model training, deployment, and feedback. Figure 7 sketches such a pipeline under limited-data constraints.

6.1. Technical Bottlenecks

Although recent models such as EMA-DeiT, KD-ShuffleNetV2, YOLOv11m, and LAFANet achieve impressive results on benchmark datasets, several technical limitations remain. First, generalization to complex field environments is still fragile. Performance drops between PlantVillage and PlantDoc or Tomato-Village—often on the order of several percentage points—indicate that current models do not fully capture the variability in background, illumination, cultivar, and disease spectrum [68]. Existing domain adaptation and generalization strategies are only beginning to be applied systematically in this context.

Second, multimodal fusion architectures are largely treated as black boxes. Attention maps and intermediate fusion features are rarely analyzed in relation to agronomic knowledge, such as which symptom attributes or genomic markers drive specific decisions [48,54]. Without clearer interpretability, it is difficult for farmers and plant pathologists to trust model outputs, especially when recommendations conflict with field experience.

6.2. Data and Standardization Issues

On the data side, the ecosystem of tomato disease datasets is fragmented. Image-only datasets like PlantVillage and PlantDoc, image–text datasets like TLDITRD, and virus-focused datasets centered on TYLCV genomics all use different label taxonomies, acquisition protocols, and annotation formats [54]. This heterogeneity complicates cross-method comparison and makes it hard to assess true progress across studies.

Furthermore, most public datasets are geographically limited. Many focus on specific regions or production systems, leading to models that overfit to local cultivars and climate patterns. Cross-region generalization—for example, from East Asia to the Mediterranean basin or the Middle East—is rarely evaluated, even though TYLCV and other diseases are global threats.

Another cross-cutting challenge concerns model interpretability and user trust. Most state-of-the-art deep models for tomato disease detection operate as high-dimensional function approximators with limited direct biological or agronomic interpretation. Although saliency maps, class activation maps, and attention visualizations can highlight regions deemed relevant by the model, these explanations are often coarse and sensitive to implementation details. For multimodal models, understanding how textual and molecular signals influence decisions is even more complex. Developing explanation techniques that align with domain experts’ mental models—for example, decomposing predictions into contributions from specific symptom attributes, lesion types, or environmental risk factors—could greatly enhance the acceptability and diagnostic value of AI tools in plant health surveillance.

Reproducibility and benchmarking also remain open issues. Differences in dataset splits, preprocessing pipelines, augmentation policies, and evaluation metrics can lead to substantial variability in reported performance across studies, even when using similar architectures. Few works release complete code, trained weights, and detailed training recipes, which hampers fair comparison and practical adoption. Establishing publicly available, versioned benchmarks with clearly defined training and test splits, accompanied by baseline implementations of representative models (e.g., CNN, Transformer, detector, and multimodal baselines), would provide a more solid foundation for progress. For multimodal data, common protocols for constructing and evaluating image–text or image–molecular pairs are particularly needed.

From a broader systems perspective, future tomato disease detection research should increasingly move beyond isolated model accuracy toward end-to-end evaluation of decision-support workflows. This includes assessing how AI-assisted diagnosis influences farmer behavior (e.g., timing and intensity of pesticide applications), yield, input use efficiency, and environmental outcomes. For instance, highly sensitive detectors that over-predict disease presence may lead to unnecessary chemical applications, whereas overly conservative models may fail to trigger timely interventions. Coupling disease detection models with economic and environmental impact assessments, as well as with integrated pest management (IPM) guidelines, could help ensure that AI tools contribute to both profitability and sustainability in tomato production systems.

Finally, many methodological advances developed for tomato leaf disease detection are likely transferable to other crops and stressors. Active or semi-supervised learning pipelines designed to handle limited, imbalanced, and noisy tomato datasets can be adapted to grapevine, wheat, or rice diseases, while multimodal fusion frameworks can generalize to scenarios involving pest damage, nutrient deficiencies, or abiotic stress. Conversely, insights from cross-crop studies—such as meta-learning across plant species, universal leaf encoders, or crop-agnostic symptom ontologies—could further improve tomato disease models and reduce their dependence on crop-specific labeled data. Exploring such cross-domain transfer and generalization is a promising direction for building more scalable, adaptable plant health monitoring systems.

From a human-centered perspective, it is also crucial to design interfaces that integrate seamlessly into existing workflows. Tomato growers and extension agents typically make disease management decisions under time pressure and with limited connectivity; they may prefer concise, actionable recommendations (e.g., “high risk of early blight; inspect lower leaves and consider fungicide X if confirmed”) over raw probability scores or complex visualizations [57]. Furthermore, different stakeholders—smallholder farmers, greenhouse technicians, plant pathologists, and policymakers—have distinct information needs and levels of technical expertise. Future tomato disease systems should therefore be co-designed with end-users, incorporating participatory design and user-testing methodologies borrowed from human–computer interaction and agricultural extension research [77]. Such efforts will be essential to convert algorithmic advances into real-world impact.

6.3. Edge-Ready Deployment: Example Frameworks and Hardware Requirements

Example edge-hardware considerations are summarized in Table 13. We outline practical deployment frameworks for real farms:

Smartphone diagnosis: image-only baseline with optional symptom text; lightweight CNN/ViT + on-device quantization.
Greenhouse edge box: camera + environment sensors; periodic inference + risk alerts; robust to missing sensors.
Hybrid cloud–edge: edge performs screening; uncertain cases uploaded for multimodal fusion and expert review.

6.4. Future Opportunities

From a methodological perspective, integrating lightweight architectures with advanced limited-data learning is an attractive research direction. Combining compact backbones such as ShuffleNetV2 or YOLOv11n with self-distillation, domain generalization, and multimodal fusion could yield models that are both accurate and deployable on edge devices [68]. Exploring CLIP-style vision–language pretraining and cross-attention mechanisms in a tomato-specific setting, while controlling model size, may further improve robustness and open up zero-shot or few-shot capabilities.

On the data and platform side, there is a clear need for large-scale, cross-scenario multimodal datasets that cover multiple regions, cultivars, management practices, and disease spectra, with standardized label taxonomies and metadata schemas [48]. Such datasets should ideally include RGB images, text, and selected molecular or sensor modalities, enabling systematic study of fusion strategies. Open-source toolchains for data alignment, augmentation, and model training would lower the barrier to entry for researchers and practitioners, facilitating reproducible benchmarks and more rapid iteration [37].

Beyond the tomato domain, many of the methods and insights discussed in this review can inform plant health monitoring for other crops. For example, the combination of transfer learning, self-supervised pretraining, and class-imbalance handling that underpins state-of-the-art tomato detectors is directly applicable to grape, citrus, wheat, and rice diseases, where labeled data are similarly limited and field conditions are highly variable [60]. Likewise, multimodal fusion patterns that integrate images with textual symptom descriptions or sensor data can be generalized to pest detection, nutrient deficiency diagnosis, and abiotic stress monitoring across diverse cropping systems [77]. Systematic cross-crop studies—for instance, meta-learning frameworks trained on multiple species and then adapted to tomato—may further reduce the per-crop data requirements and accelerate the development of robust disease detectors [92].

At the same time, transferring models and datasets across crops raises new scientific and ethical questions. Phenotypic manifestations of disease and stress can differ substantially between species, and there is a risk that models trained predominantly on well-studied crops (such as tomato and grapevine) may perform poorly on under-represented crops grown by smallholder farmers [56]. Data governance and fairness considerations—including who owns and controls leaf images, sensor data, and genomic information, and how benefits are shared among technology providers, farmers, and public institutions—will therefore become increasingly important [77]. Addressing these issues will require collaboration between plant pathologists, AI researchers, social scientists, and policymakers, and represents a fertile area for future interdisciplinary work.

6.5. Under-Studied Areas and Open Problems

Despite rapid progress, several directions remain insufficiently studied in tomato leaf disease detection:

Semi-supervised multimodal learning: how to reliably exploit large unlabeled image pools together with sparse paired text/molecular/sensor signals without confirmation bias.
Explainable multimodal diagnosis: explanations that align with agronomic concepts (lesion type, symptom attributes, growth stage) rather than generic saliency maps.
Unified multimodal benchmarks: standardized datasets and protocols covering multiple regions, devices, and modalities (image/text/sensors/molecular), enabling fair cross-paper comparison.

7. Conclusions

This review has examined tomato leaf disease detection from the joint perspectives of limited data and multimodal fusion. We first characterized the main data-related challenges—small sample sizes for rare and emerging diseases, severe class imbalance, and noisy field images with complex backgrounds—and related them to the heterogeneity and alignment issues inherent in combining images, text, and viral molecular data. We then surveyed technical solutions for limited data, including transfer learning, self-distillation and ensemble methods, data augmentation, self-supervised learning, few- and zero-shot learning, and domain adaptation/generalization, with emphasis on their applicability to tomato-specific datasets.

On the multimodal side, we discussed fusion strategies at the feature, decision, and hybrid levels and highlighted case studies such as LAFANet for image–text retrieval and emerging efforts toward viral–image fusion for TYLCV. We further analyzed representative model–dataset pairs (EMA-DeiT, KD-ShuffleNetV2, YOLOv11m, LAFANet) and their performance on laboratory and field benchmarks, drawing attention to the persistent gap between curated datasets and real-world deployment scenarios.

Looking forward, we argue that truly practical tomato disease detection systems will require a combination of lightweight architectures, principled limited-data learning, interpretable multimodal fusion, and well-designed cross-region datasets. Progress along these directions will not only improve the accuracy and robustness of AI-based diagnosis but also enable their integration into precision agriculture workflows, ultimately contributing to more resilient tomato production and sustainable food systems.

Author Contributions

Conceptualization, Y.H., W.K., C.Y., N.C. and Z.P.; literature search and analysis, Y.H. and H.L.; writing—original draft preparation, Y.H.; writing—review and editing, W.K., H.L., C.Y., N.C. and Z.P.; supervision, W.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Acknowledgments

This work is supported by Macao Polytechnic University, Macao SAR, under submission code fca.f7f2.06fd.0.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

TYLCV	Tomato yellow leaf curl virus
DL	Deep learning
CNN	Convolutional neural network
ViT	Vision Transformer
FSL	Few-shot learning
SSL	Self-supervised learning
GAN	Generative adversarial network
mAP	mean Average Precision

Appendix A. Proof Sketch of the Generalization Bound

Equation (8) follows a standard Rademacher-complexity argument: (i) apply symmetrization to relate

R (θ) - \hat{R} (θ)

to the supremum of an empirical process; (ii) bound the supremum by the empirical Rademacher complexity

{\hat{R}}_{n} (H)

; (iii) use concentration (e.g., McDiarmid’s inequality) to obtain the high-probability term

c \sqrt{log (1 / δ) / (2 n)}

. We refer readers to standard learning theory texts for the full proof.

References

Cao, X.; Huang, M.N.; Wang, S.M.; Li, T.; Huang, Y. Tomato yellow leaf curl virus: Characteristics, influence, and regulation mechanism. Plant Physiol. Biochem. 2024, 213, 108812. [Google Scholar] [CrossRef]
Nwakoby, I.; Iheukwumere, I.; Iheukwumere, C.; Nwakoby, N.; Idigo, M.; Ike, V. Food safety and law: The role of microbiology in ensuring safe food products. IPS J. Nutr. Food Sci. 2025, 4, 601–607. [Google Scholar] [CrossRef]
Ahmed, M.; Babayola, M.; Bake, I. Role of Horticultural Crops in Food and Nutritional Security: A Review. J. Nutr. Food Process. 2024, 7, 1–6. [Google Scholar] [CrossRef]
Lata, S.; Hussain, Z.; Yadav, R.; Jat, G.S.; Kumar, P.; Tomar, B. Insights into the genetic improvement of tomato. In Genetic Engineering of Crop Plants for Food and Health Security: Volume 2; Springer: Berlin/Heidelberg, Germany, 2024; pp. 165–184. [Google Scholar]
Arain, S.M.; Sajjad, M.; Faheem, M.; Ullah, G.; Laghari, K.A.; Sial, M.A. Confronting Abiotic Stresses: Molecular Strategies for Improving Tomato Stress Tolerance. In Omics Approaches for Tomato Yield and Quality Trait Improvement; Springer: Berlin/Heidelberg, Germany, 2025; pp. 55–94. [Google Scholar]
Sun, C.; Li, Y.; Song, Z.D.; Liu, Q.; Si, H.P.; Yang, Y.J.; Cao, Q. Research on tomato disease image recognition method based on DeiT. Eur. J. Agron. 2025, 162, 127400. [Google Scholar] [CrossRef]
Gašić, K.; Ivanović, M.M.; Ignjatov, M.; Calić, A.; Obradović, A. Isolation and characterization of Xanthomonas euvesicatoria bacteriophages. J. Plant Pathol. 2011, 2, 415–423. [Google Scholar]
Chaerani, R.; Voorrips, R.E. Tomato early blight (Alternaria solani): The pathogen, genetics, and breeding for resistance. J. Gen. Plant Pathol. 2006, 72, 335–347. [Google Scholar] [CrossRef]
Legard, D.; Lee, T.; Fry, W. Pathogenic specialization in Phytophthora infestans: Aggressiveness on tomato. Phytopathology 1995, 85, 1356–1361. [Google Scholar] [CrossRef]
Moriones, E.; Navas-Castillo, J. Tomato yellow leaf curl virus, an emerging virus complex causing epidemics worldwide. Virus Res. 2000, 71, 123–134. [Google Scholar] [CrossRef]
Watanabe, H.; Horinouchi, H.; Muramoto, Y.; Ishii, H. Occurrence of azoxystrobin-resistant isolates in Passalora fulva, the pathogen of tomato leaf mould disease. Plant Pathol. 2017, 66, 1472–1479. [Google Scholar] [CrossRef]
Pritchard, F.J.; Porte, W. The relation of temperature and humidity to tomato leaf spot (Septoria lycopersici Speg.). Phytopathology 1924, 14, 156–169. [Google Scholar]
Guo, Q.; Sun, Y.; Ji, C.; Kong, Z.; Liu, Z.; Li, Y.; Li, Y.; Lai, H. Plant resistance to tomato yellow leaf curl virus is enhanced by Bacillus amyloliquefaciens Ba13 through modulation of RNA interference. Front. Microbiol. 2023, 14, 1251698. [Google Scholar] [CrossRef]
Sánchez, M.S.; Hernández, E.A.; Quintana-Obregón, E.A.; Arispuro, I.V.; Téllez, M.Á.M. Estimating tomato production losses due to plant viruses, a look at the past and new challenges. Comun. Sci. 2024, 15, 71. [Google Scholar] [CrossRef]
Akbar, A.; Al Hashash, H.; Al-Ali, E. Tomato yellow leaf curl virus (TYLCV) in Kuwait and global analysis of the population structure and evolutionary pattern of TYLCV. Virol. J. 2024, 21, 308. [Google Scholar] [CrossRef]
Kumar, M.; Bag, S.; McAvoy, T.; Torrance, T.; Cloud, C.; Simmons, A.M. A shift in begomovirus Coheni populations associated with tomato yellow leaf curl disease infecting tomato cultivars in the southeastern united States. Plant Pathol. 2025, 74, 1277–1289. [Google Scholar] [CrossRef]
Moldvai, L.; Nyéki, A. Innovative computer vision methods for tomato (Solanum Lycopersicon) detection and cultivation: A review. Discov. Appl. Sci. 2025, 7, 975. [Google Scholar] [CrossRef]
Deng, S.; Zhu, J.; Hu, Y.; He, M.; Xia, Y. Tomato Leaf Disease Identification Framework FCMNet Based on Multimodal Fusion. Plants 2025, 14, 2329. [Google Scholar] [CrossRef]
Upadhyay, A.; Patel, A.; Patel, A.; Chandel, N.S.; Chakraborty, S.K.; Bhalekar, D.G. Leveraging AI and ML in Precision Farming for Pest and Disease Management: Benefits, Challenges, and Future Prospects. In Ecologically Mediated Development: Promoting Biodiversity Conservation and Food Security; Springer: Singapore, 2025; pp. 511–528. [Google Scholar]
Castillo-Girones, S.; Munera, S.; Martínez-Sober, M.; Blasco, J.; Cubero, S.; Gómez-Sanchis, J. Artificial Neural Networks in Agriculture, the core of artificial intelligence: What, When, and Why. Comput. Electron. Agric. 2025, 230, 109938. [Google Scholar] [CrossRef]
Kumari, S.; Venkatesh, V.; Tan, F.T.C.; Bharathi, S.V.; Ramasubramanian, M.; Shi, Y. Application of machine learning and artificial intelligence on agriculture supply chain: A comprehensive review and future research directions. Ann. Oper. Res. 2025, 348, 1573–1617. [Google Scholar] [CrossRef]
Ali, Z.; Muhammad, A.; Lee, N.; Waqar, M.; Lee, S.W. Artificial Intelligence for sustainable agriculture: A comprehensive review of AI-driven technologies in crop production. Sustainability 2025, 17, 2281. [Google Scholar] [CrossRef]
Aijaz, N.; Lan, H.; Raza, T.; Yaqub, M.; Iqbal, R.; Pathan, M.S. Artificial intelligence in agriculture: Advancing crop productivity and sustainability. J. Agric. Food Res. 2025, 20, 101762. [Google Scholar] [CrossRef]
Khan, R.; Ud Din, N.; Zaman, A.; Huang, B. Automated Tomato Leaf Disease Detection Using Image Processing: An SVM-Based Approach with GLCM and SIFT Features. J. Eng. 2024, 2024, 9918296. [Google Scholar] [CrossRef]
Shanthi, D.; Vinutha, K.; Ashwini, N.; Vashistha, S. Tomato leaf disease detection using CNN. Procedia Comput. Sci. 2024, 235, 2975–2984. [Google Scholar] [CrossRef]
Gehlot, M.; Saxena, R.K.; Gandhi, G.C. “Tomato-Village”: A dataset for end-to-end tomato disease detection in a real-world environment. Multimed. Syst. 2023, 29, 3305–3328. [Google Scholar] [CrossRef]
Wang, X.; Liu, J. An efficient deep learning model for tomato disease detection. Plant Methods 2024, 20, 61. [Google Scholar] [CrossRef]
Sun, W.; Xu, Z.; Xu, K.; Ru, L.; Yang, R.; Wang, R.; Xing, J. Ultra-lightweight tomatoes disease recognition method based on efficient attention mechanism in complex environment. Front. Plant Sci. 2025, 15, 1491593. [Google Scholar] [CrossRef]
Ajith, S.; Vijayakumar, S.; Elakkiya, N. Yield prediction, pest and disease diagnosis, soil fertility mapping, precision irrigation scheduling, and food quality assessment using machine learning and deep learning algorithms. Discov. Food 2025, 5, 67. [Google Scholar] [CrossRef]
Jonak, M.; Mucha, J.; Jezek, S.; Kovac, D.; Cziria, K. SPAGRI-AI: Smart precision agriculture dataset of aerial images at different heights for crop and weed detection using super-resolution. Agric. Syst. 2024, 216, 103876. [Google Scholar] [CrossRef]
Li, H.; Chen, B.; Chen, J.; Li, S.; He, F.; Hu, Y. ITIMCA: Image-text information and cross-attention for multi-modal cassava leaf disease classification based on a novel multi-modal dataset in natural environments. Crop Prot. 2025, 189, 106981. [Google Scholar] [CrossRef]
El Sakka, M.; Ivanovici, M.; Chaari, L.; Mothe, J. A review of CNN applications in smart agriculture using multimodal data. Sensors 2025, 25, 472. [Google Scholar] [CrossRef] [PubMed]
Sapkota, R.; Qureshi, R.; Hadi, M.U.; Hassan, S.Z.; Sadak, F.; Shoman, M.; Sajjad, M.; Dharejo, F.A.; Paudel, A.; Li, J.; et al. Multi-modal LLMs in agriculture: A comprehensive review. IEEE Trans. Autom. Sci. Eng. 2025, 22, 22510–22540. [Google Scholar] [CrossRef]
Li, Q.; Zhang, Y.; Mai, Z.; Chen, Y.; Lou, S.; Huang, H.; Zhang, J.; Zhang, Z.; Wen, Y.; Li, W.; et al. Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind. arXiv 2025, arXiv:2505.12207. [Google Scholar] [CrossRef]
Hussain, I.; Farooq, T.; Khan, S.A.; Ali, N.; Waris, M.; Jalal, A.; Nielsen, S.L.; Ali, S. Variability in indigenous Pakistani tomato lines and worldwide reference collection for Tomato Mosaic Virus (ToMV) and Tomato Yellow Leaf Curl Virus (TYLCV) infection. Braz. J. Biol. 2022, 84, e253605. [Google Scholar] [CrossRef]
Li, F.; Qiao, R.; Yang, X.; Gong, P.; Zhou, X. Occurrence, distribution, and management of tomato yellow leaf curl virus in China. Phytopathol. Res. 2022, 4, 28. [Google Scholar] [CrossRef]
Ni, S.; Jia, Y.; Zhu, M.F.; Zhang, Y.Z.; Wang, W.D.; Liu, S.X.; Chen, Y.W. An improved ShuffleNetV2 method based on ensemble self-distillation for tomato leaf diseases recognition. Front. Plant Sci. 2025, 15, 1521008. [Google Scholar] [CrossRef] [PubMed]
Gupta, S.; Tripathi, A.K.; Lewis, N. Pre-trained noise based unsupervised GAN for fruit disease classification in imbalanced datasets. Pattern Anal. Appl. 2025, 28, 39. [Google Scholar] [CrossRef]
Shoaib, M.; Hussain, T.; Shah, B.; Ullah, I.; Shah, S.M.; Ali, F.; Park, S.H. Deep learning-based segmentation and classification of leaf images for detection of tomato plant disease. Front. Plant Sci. 2022, 13, 1031748. [Google Scholar] [CrossRef]
Ma, Y.; Tian, Y.; Moniz, N.; Chawla, N.V. Class-imbalanced learning on graphs: A survey. ACM Comput. Surv. 2025, 57, 207. [Google Scholar] [CrossRef]
Vinothini, A.; Aswiga, R. Transfer learning based deep learning model for classifying tomato plant leaf diseases. Eng. Res. Express 2025, 7, 025250. [Google Scholar] [CrossRef]
Pazou, M.G.A.; Sobabe, A.A.; Kouhoundji, N.; Dovonou, C. Detection of bacterial spot and yellow leaf curl virus in tomato leaves images using deep learning. In Proceedings of the 2021 International Conference on Electrical, Computer and Energy Technologies (ICECET), Cape Town, South Africa, 9–10 December 2021; pp. 1–5. [Google Scholar]
Alzahrani, M. Automated Tomato Defect Detection Using CNN Feature Fusion for Enhanced Classification. Processes 2025, 13, 115. [Google Scholar] [CrossRef]
Nishankar, S.; Mithuran, T.; Thuseethan, S.; Sebastian, Y.; Yeo, K.C.; Shanmugam, B. TOM-SSL: Tomato Disease Recognition Using Pseudo-Labelling-Based Semi-Supervised Learning. AgriEngineering 2025, 7, 248. [Google Scholar] [CrossRef]
Dhiab, Y.B.; Aoueileyine, M.O.E.; Namoun, A.; Bouallegue, R. TomDetLeaf: A Realistic Multi-Source Dataset for Real-Time Tomato Leaf Detection. Int. J. Adv. Comput. Sci. Appl. 2025, 16. [Google Scholar] [CrossRef]
Tang, X.; Sun, Z.; Yang, L.; Chen, Q.; Liu, Z.; Wang, P.; Zhang, Y. YOLOv11-AIU: A lightweight detection model for the grading detection of early blight disease in tomatoes. Plant Methods 2025, 21, 118. [Google Scholar] [CrossRef]
Singh, D.; Jain, N.; Jain, P.; Kayal, P.; Kumawat, S.; Batra, N. PlantDoc: A dataset for visual plant disease detection. In Proceedings of the 7th ACM IKDD CoDS and 25th COMAD, Hyderabad, India, 5–7 January 2020; pp. 249–253. [Google Scholar]
Xu, J.X.; Zhou, H.L.; Hu, Y.F.; Xue, Y.F.; Zhou, G.X.; Li, L.J.; Dai, W.S.; Li, J.Y. High-Accuracy Tomato Leaf Disease Image-Text Retrieval Method Utilizing LAFANet. Plants 2024, 13, 1176. [Google Scholar] [CrossRef] [PubMed]
Nakagawa, Y.; Sano, H.; Takata, T. Classification of Tomato Growth Degree Adopting Machine-Learning to Photomorphogenesis Information in the Visible Light Region. In Proceedings of the 2025 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Fukuoka, Japan, 18–21 February 2025; pp. 82–86. [Google Scholar]
Zhang, K.; Chai, Q.; Qian, X.; Gao, R.; Liu, X.; Yang, L.; Pang, G.; Wang, Y.; Sun, J. Potential of machine learning in leaf-based multi-source data driven tomato growth monitoring. Smart Agric. Technol. 2025, 10, 100854. [Google Scholar] [CrossRef]
Huo, Y.; Liu, Y.; He, P.; Hu, L.; Gao, W.; Gu, L. Identifying Tomato Growth Stages in Protected Agriculture with StyleGAN3–Synthetic Images and Vision Transformer. Agriculture 2025, 15, 120. [Google Scholar] [CrossRef]
Oni, M.K.; Prama, T.T. A comprehensive dataset of tomato leaf images for disease analysis in Bangladesh. Data Brief 2025, 59, 111327. [Google Scholar] [CrossRef]
Skoric, D.; Zindovic, J.; Grbin, D.; Pul, P.; Božović, V.; Margaria, P.; Mehle, N.; Pecman, A.; Kogej Zwitter, Z.; Kutnjak, D.; et al. Tomato spotted wilt virus in tomato from Croatia, Montenegro and Slovenia: Genetic diversity and evolution. Front. Microbiol. 2025, 16, 1618327. [Google Scholar] [CrossRef]
Li, Z.G.; Tang, Y.F.; She, X.M.; Yu, L.; Lan, G.B.; Ding, S.W.; He, Z.F. Characterisation of a Betasatellite Associated with Tomato Yellow Leaf Curl Guangdong Virus and Discovery of an Unusual Modulation of Virus Infection Associated with C4 Protein. Mol. Plant Pathol. 2025, 26, e70051. [Google Scholar] [CrossRef]
Arnal Barbedo, J.G. Digital image processing techniques for detecting, quantifying and classifying plant diseases. SpringerPlus 2013, 2, 660. [Google Scholar] [CrossRef]
Kamilaris, A.; Prenafeta-Boldú, F.X. Deep learning in agriculture: A survey. Comput. Electron. Agric. 2018, 147, 70–90. [Google Scholar] [CrossRef]
Ferentinos, K.P. Deep learning models for plant disease detection and diagnosis. Comput. Electron. Agric. 2018, 145, 311–318. [Google Scholar] [CrossRef]
Ma, X.; Zhang, X.; Guan, H.; Wang, L. Recognition Method of Crop Disease Based on Image Fusion and Deep Learning Model. Agronomy 2024, 14, 1518. [Google Scholar] [CrossRef]
Attri, I.; Awasthi, L.K.; Sharma, T.P. Machine learning in agriculture: A review of crop management applications. Multimed. Tools Appl. 2024, 83, 12875–12915. [Google Scholar] [CrossRef]
Pacal, I.; Kunduracioglu, I.; Alma, M.H.; Deveci, M.; Kadry, S.; Nedoma, J.; Slany, V.; Martinek, R. A systematic review of deep learning techniques for plant diseases. Artif. Intell. Rev. 2024, 57, 304. [Google Scholar] [CrossRef]
Guo, R.; Li, B.; Zhao, Y.; Tang, C.; Klosterman, S.J.; Wang, Y. Rhizobacterial Bacillus enrichment in soil enhances smoke tree resistance to Verticillium wilt. Plant Cell Environ. 2024, 47, 4086–4100. [Google Scholar] [CrossRef] [PubMed]
Xiong, S.; Wang, L.; Zhang, Y.; Dong, P.; Wang, B.; Che, Y.; Shi, L.; Si, H. Boosting crop disease recognition via automated image description generation and multimodal fusion. Comput. Electron. Agric. 2025, 239, 111082. [Google Scholar] [CrossRef]
Ogidi, F.C.; Eramian, M.G.; Stavness, I. Benchmarking self-supervised contrastive learning methods for image-based plant phenotyping. Plant Phenomics 2023, 5, 37. [Google Scholar] [CrossRef]
Xin, Y.; Liu, L.; Yang, X.R.; Yang, L.Y.; Guang, S.B.; Zheng, Y.M.; Zhao, Q.B. Adaptive shifts in plant traits associated with nitrogen removal driven by phytoremediation strategies in subtropical river restoration. Water Res. 2024, 249, 121008. [Google Scholar] [CrossRef] [PubMed]
Yang, S.; Feng, Q.; Guo, F.; Zhou, W. Estimation of Potato Growth Parameters Under Limited Field Data Availability by Integrating Few-Shot Learning and Multi-Task Learning. Agriculture 2025, 15, 1638. [Google Scholar] [CrossRef]
Adhikari, P.; Oh, Y.; Panthee, D.R. Current status of early blight resistance in tomato: An update. Int. J. Mol. Sci. 2017, 18, 2019. [Google Scholar] [CrossRef]
Nowicki, M.; Kozik, E.U.; Foolad, M.R. Late blight of tomato. Transl. Genom. Crop Breed. Biot. Stress 2013, 1, 241–265. [Google Scholar]
Lee, Y.S.; Patil, M.P.; Kim, J.G.; Seo, Y.B.; Ahn, D.H.; Kim, G.D. Hyperparameter Optimization for Tomato Leaf Disease Recognition Based on YOLOv11m. Plants 2025, 14, 653. [Google Scholar] [CrossRef]
Abd-Alla, M.H.; Bashandy, S.R.; Schnell, S.; Ratering, S. Isolation and characterization of Serratia rubidaea from dark brown spots of tomato fruits. Phytoparasitica 2011, 39, 175–183. [Google Scholar] [CrossRef]
Sharma, S.; Bhattarai, K. Progress in developing bacterial spot resistance in tomato. Agronomy 2019, 9, 26. [Google Scholar] [CrossRef]
Dovas, C.; Katis, N.; Avgelis, A. Multiplex detection of criniviruses associated with epidemics of a yellowing disease of tomato in Greece. Plant Dis. 2002, 86, 1345–1349. [Google Scholar] [CrossRef]
Liu, X.; Lin, Y.; Wu, C.; Yang, Y.; Su, D.; Xian, Z.; Zhu, Y.; Yu, C.; Hu, G.; Deng, W.; et al. The SlARF4-SlHB8 regulatory module mediates leaf rolling in tomato. Plant Sci. 2023, 335, 111790. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [Google Scholar] [CrossRef]
Koroteev, M.V. BERT: A review of applications in natural language processing and understanding. arXiv 2021, arXiv:2103.11943. [Google Scholar] [CrossRef]
Zhang, L.; Bao, C.; Ma, K. Self-distillation: Towards efficient and compact neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 4388–4403. [Google Scholar] [CrossRef] [PubMed]
Wang, X.; Zhang, R.; Shen, C.; Kong, T.; Li, L. Dense contrastive learning for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3024–3033. [Google Scholar]
Zhang, R.; Liu, C.; Su, Y.; Li, R.; Huang, X.; Li, X.; Yu, P.S. A Comprehensive Survey on Multimodal RAG: All Combinations of Modalities as Input and Output. TechRxiv 2025. [Google Scholar] [CrossRef] [PubMed]
Zhao, K.; Wu, X.; Xiao, Y.; Jiang, S.; Yu, P.; Wang, Y.; Wang, Q. PlanText: Gradually Masked Guidance to Align Image Phenotypes with Trait Descriptions for Plant Disease Texts. Plant Phenomics 2024, 6, 272. [Google Scholar] [CrossRef]
Pan, S.J.; Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 2009, 22, 1345–1359. [Google Scholar] [CrossRef]
Gholizade, M.; Soltanizadeh, H.; Rahmanimanesh, M.; Sana, S.S. A review of recent advances and strategies in transfer learning. Int. J. Syst. Assur. Eng. Manag. 2025, 16, 1123–1162. [Google Scholar] [CrossRef]
Hossen, M.I.; Awrangjeb, M.; Pan, S.; Mamun, A.A. Transfer learning in agriculture: A review. Artif. Intell. Rev. 2025, 58, 97. [Google Scholar] [CrossRef]
Targ, S.; Almeida, D.; Lyman, K. Resnet in resnet: Generalizing residual architectures. arXiv 2016, arXiv:1603.08029. [Google Scholar] [CrossRef]
Sun, W.; Zhang, X.; He, X. Lightweight image classifier using dilated and depthwise separable convolutions. J. Cloud Comput. 2020, 9, 55. [Google Scholar] [CrossRef]
Dai, X.; Chen, Y.; Xiao, B.; Chen, D.; Liu, M.; Yuan, L.; Zhang, L. Dynamic head: Unifying object detection heads with attentions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7373–7382. [Google Scholar]
Chen, W.; Yang, K.; Yu, Z.; Shi, Y.; Chen, C.P. A survey on imbalanced learning: Latest research, applications and future directions. Artif. Intell. Rev. 2024, 57, 137. [Google Scholar] [CrossRef]
Wu, X.; Fan, X.; Luo, P.; Choudhury, S.D.; Tjahjadi, T.; Hu, C. From laboratory to field: Unsupervised domain adaptation for plant disease recognition in the wild. Plant Phenomics 2023, 5, 38. [Google Scholar] [CrossRef]
Agarwal, S.; Krueger, G.; Clark, J.; Radford, A.; Kim, J.W.; Brundage, M. Evaluating clip: Towards characterization of broader capabilities and downstream implications. arXiv 2021, arXiv:2108.02818. [Google Scholar] [CrossRef]
Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 12888–12900. [Google Scholar]
Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Llava: Large language and vision assistant. In Proceedings of the International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Noyan, M.A. Uncovering bias in the PlantVillage dataset. arXiv 2022, arXiv:2206.04374. [Google Scholar] [CrossRef]
Mohanty, S.P.; Hughes, D.P.; Salathé, M. Using deep learning for image-based plant disease detection. Front. Plant Sci. 2016, 7, 215232. [Google Scholar] [CrossRef] [PubMed]
Li, D.; Chen, S. Fine-grained Image Classification Based on MogaNet Network and Multi-level Gating Mechanism. Front. Neurorobotics 2025, 19, 1630281. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Data challenges in tomato leaf disease detection.

Figure 2. Framework of limited-data learning strategies for tomato leaf disease detection.

Figure 3. Multimodal fusion framework for tomato leaf disease diagnosis and image–text retrieval.

Figure 4. Representative tomato leaf disease models positioned by model capacity/modality and deployment constraints.

Figure 5. Workflow of cross-dataset evaluation (train-on-A test-on-B, leave-one-domain-out) for tomato leaf disease detection.

Figure 6. Recommended ablation setups for multimodal systems: modality-specific ablations and gradual fusion gains.

Figure 7. End-to-end pipeline for tomato leaf disease detection under limited-data learning.

Table 1. Quick rating of key challenges in tomato leaf disease detection (higher = more challenging).

Aspect	Data	Compute	Deploy	Notes
Small sample size	5	2	3	Rare diseases and expensive expert labeling.
Class imbalance	4	2	3	Minority recall is critical in practice.
Field noise/domain shift	5	3	5	Largest gap between lab benchmarks and farms.
Multimodal alignment	4	4	4	Sparse paired data and missing modalities.
Interpretability	3	2	5	Needed for farmer/agronomist trust.

Table 2. Examples of common tomato leaf diseases and typical visual symptoms (indicative).

Disease	Pathogen (Type)	Typical Leaf Symptoms
Bacterial spot	Xanthomonas (bacteria)	Small, water-soaked spots turning dark; may coalesce under humid conditions.
Early blight	Alternaria solani (fungus)	Concentric rings (“target spots”), often starting on older leaves.
Late blight	Phytophthora infestans (oomycete)	Irregular lesions with pale edges; fast spread under cool/wet conditions.
TYLCV	Begomovirus (virus)	Yellowing, upward curling, stunting; reduced fruit set.
Leaf mold	Passalora fulva (fungus)	Yellow spots upper leaf; olive-green mold underside in high humidity.
Septoria leaf spot	Septoria lycopersici (fungus)	Numerous small grayish spots with dark margins; defoliation in severe cases.

Table 3. Notation used in the mathematical formulation.

Symbol	Meaning
$X_{img}, X_{text}, X_{mol}$	Image/text/molecular input spaces
$Y = {1, \dots, K}$	Disease label set (including healthy)
$f_{θ} (\cdot)$	Multimodal classifier with parameters $θ$ outputting class probabilities
$Δ^{K - 1}$	Probability simplex over K classes
$ℓ (p, y)$	Loss function (cross-entropy: $ℓ (p, y) = - log p_{y}$ )
$\hat{R} (θ), R (θ)$	Empirical risk vs. expected (true) risk
$π_{k}$	Class prior $P (Y = k)$ ; imbalance when $π_{k} / π_{k^{'}} < ε$

Table 4. Overview of the review methodology.

Aspect	Summary
Databases and period	Web of Science, Scopus, Google Scholar; years 2015–2025 (focus on 2023–2025).
Search strategy	Keyword search on tomato leaf disease detection, limited data, few-shot/self-supervised learning, domain generalization, multimodal fusion.
Inclusion/exclusion	Include: tomato leaf diseases + AI/ML/DL methods + quantitative results; exclude: non-tomato crops/tasks or studies without clear methodology/metrics.
Study coding	Two-stage coding into limited-data, multimodal, deployment categories; record datasets, backbones, and evaluation protocols.
Bias and mitigation	Potential language, indexing, and publication bias mitigated via multi-database search, citation chasing, and manual screening to reduce topic drift.

Table 5. Limited-data challenges in tomato leaf disease detection and typical countermeasures.

Challenge	Effect on Models	Typical Strategies
Small sample size	Overfitting to few labeled images; poor generalization to new fields or cultivars.	Transfer learning from large datasets; self-supervised pretraining on unlabeled field images; few-shot/meta-learning; active learning for expert labeling.
Class imbalance	Bias toward frequent diseases; low recall/F1 for rare but important classes.	Class-weighted or focal loss; over-/under-sampling; synthetic minority generation (GANs, diffusion); ensemble and self-distillation focusing on rare classes.
Low-quality/ noisy images	Lesion cues obscured by clutter, blur, or lighting; large gap between lab and field domains.	Task-aware augmentation; domain adaptation/generalization; robust architectures and regularization; training on mixed lab–field datasets.

Table 6. Summary of limited-data learning strategies for tomato leaf disease detection.

Strategy	Principle	Representative Methods	Advantages	Limitations
Transfer learning	Initialize from large-scale pretraining and fine-tune on tomato data	ImageNet-pretrained CNN/ViT; DeiT-style fine-tuning; domain-specific pretraining	Strong baseline; fast convergence	Residual domain shift; sensitive to fine-tuning recipe
Self-/ensemble distillation	Use soft targets from EMA/ensemble to regularize learning	EMA teacher; multi-branch self-distillation; snapshot ensemble distillation	Improves generalization under small data	Training/inference overhead; teacher quality matters
Data augmentation and regularization	Expand effective sample diversity and reduce overfitting	RandAugment/AutoAugment; MixUp/CutMix; label smoothing; stochastic depth	Cheap; plug-and-play; boosts robustness	May distort symptoms; tuning cost; gains saturate
Self-supervised pretraining	Learn transferable representations without labels, then fine-tune	SimCLR/MoCo/DINO; MAE-style masked pretraining on plant images	Better feature reuse; label-efficient	Extra pretraining compute; mismatch if pretrain domain differs
Semi-supervised learning	Leverage unlabeled tomato images via consistency/pseudo-labels	FixMatch/Mean Teacher; pseudo-labeling with confidence threshold	Reduces label demand; strong under scarce labels	Error amplification; sensitive to threshold/imbalance
Few-shot/metric learning	Classify by learned embedding distances with few labeled examples	Prototypical Networks; Matching Networks; cosine classifier; episodic training	Good for new diseases/rare classes	Episode design complexity; unstable if intra-class variance is high
Synthetic data/generative augmentation	Generate or translate images to enlarge target distribution	GAN-based synthesis; diffusion-based generation; style transfer (lab→field)	Covers rare cases; enriches backgrounds	Quality/label fidelity risk; may introduce artifacts/bias
Domain generalization/adaptation	Reduce domain shift between lab and field settings	Domain adversarial training; style normalization; test-time adaptation (TTA)	Improves cross-dataset robustness	May require target data; stability/reproducibility issues
Active learning	Query the most informative samples to label first	Uncertainty sampling; diversity sampling; core-set selection	Maximizes annotation efficiency	Needs iterative labeling loop; selection bias risk

Table 7. Overview of modalities and fusion strategies in tomato disease diagnosis.

Modality	What It Provides	Common Fusion Strategies
Image (RGB/IR)	Lesion color/shape/distribution; visual symptoms	Early fusion, late fusion, cross-attention
Text (symptom descriptions)	Semantic symptom attributes; context (stage/management)	CLIP-style alignment, cross-attention, late fusion
Molecular (virus/genomics)	Direct evidence of infection/strain; early signal	Hybrid fusion, MoE experts, late fusion when sparse
Sensors/environment	Risk factors (humidity/temperature/leaf wetness)	Temporal fusion (RNN/TCN), hybrid fusion, MoE

Table 8. Critical comparison of multimodal fusion strategies for tomato disease detection.

Fusion Type	How It Works	Acc. pot.	Interpre- Tability	Comp. Cost	Missing-Mod. Robustness
Feature-level (early)	Fuse embeddings before the classifier (concat/gating/attention)	High	Medium	High	Low
Decision-level (late)	Separate unimodal models; fuse scores/probabilities	Medium	Medium–High	Low–Medium	High
Hybrid (cross-attn/MoE)	Cross-modal interaction + modality-specific heads/experts	High	High	Medium–High	Medium–High

Table 9. Comparison of multimodal fusion strategies for tomato leaf disease detection.

Fusion Type	Mechanism	Main Strengths	Main Weaknesses
Feature-level (early)	Concatenate or transform modality features before classification.	Rich joint representation; captures fine-grained cross-modal interactions.	High dimensionality; prone to overfitting under few paired samples; sensitive to missing modalities.
Decision-level (late)	Separate classifiers; fuse probabilities or scores.	Simple; robust when some modalities are absent; flexible with heterogeneous data.	Limited cross-modal interaction; fusion weights often hand-tuned; may ignore subtle complementarities.
Hybrid fusion	Attention-based feature interaction plus modality-specific heads.	Balances expressiveness and robustness; provides interpretable cross-modal attention.	More parameters and training complexity; unstable when multimodal pairs are very sparse or noisy.

Table 10. Summary and comparison of representative tomato leaf disease datasets.

Dataset	#Imgs	#Cls	Disease Categories (Examples)	Imaging Conditions	Modalities	Paired?
PlantVillage [90]	∼18 k	10–15	TYLCV, early/late blight, leaf mold, Septoria, healthy (tomato subset)	Lab-like, controlled background/lighting	RGB	No
PlantDoc [47]	∼2.5 k	8–10	Field diseases with cluttered backgrounds (tomato subset)	In-the-wild, occlusion, illumination variation	RGB	No
Tomato-Village [26]	∼7 k	8	Includes rare classes (e.g., leaf miner, spotted wilt)	Multi-region field captures	RGB	No
Dataset of Tomato Leaves [6]	∼6 k	6	Common diseases + healthy	Field/greenhouse, natural background	RGB	No
TLDITRD [48]	∼6 k pairs	6	Six tomato disease classes (paired descriptions)	Field settings, paired annotations	RGB + Text	Yes

Table 11. Summary of representative models, datasets, and reported performance (indicative).

Model	Task	Dataset(s)	Key Reported Results/Notes
EMA-DeiT [6]	Classification	PlantVillage, PlantDoc, Tomato-Village, Tomato Leaves	Accuracy: 99.6% (PV), 97.1% (PD); strong baseline but residual lab-to-field gap.
KD-ShuffleNetV2 [37]	Classification	Aggregated multi-source tomato datasets	∼95% accuracy; ∼1.27 M params; edge-friendly with self-distillation gains.
YOLOv11m [68]	Detection	Curated detection dataset	High mAP on curated data; needs cross-dataset eval and modality ablations.
LAFANet [48]	Image-text retrieval	TLDITRD	R@1 ≈ 81.7% (I→T), 80.3% (T→I); sensitive to text noise and pair scarcity.

Table 12. Representative model families for tomato leaf disease detection and their qualitative characteristics.

Family	Representative Models	Backbone	Params	Modality	Key Strengths/Limitations
Convolutional	ResNet variants; EfficientNet	ResNet-50/101; EffNet-B0/B3	20–40 M	RGB	Mature and stable; strong on PlantVillage; may be heavy on edge; needs adaptation for field images.
Lightweight CNN	ShuffleNetV2; MobileNetV2; KD-ShuffleNetV2 [37]	Depthwise/channel-shuffle CNNs	1–3 M	RGB	Mobile-friendly; benefits from transfer + self-distillation; limited capacity for complex multimodal tasks.
Transformer-based	DeiT; EMA-DeiT [6]	ViT/DeiT	20–30 M	RGB	Strong with pretraining; flexible; needs strong regularization/aug; higher memory footprint.
Object detectors	YOLOv5/YOLOv8/ YOLOv11m [45,46,68]	One-stage detectors	10–30 M	RGB (bbox)	Localize lesions/leaves; sensitive to annotation quality; heavier than classifiers.
Multimodal fusion	LAFANet [48]; image–molecular prototypes	ViT + BERT-like	40 M+	Image+text (+mol.)	Richer decision support; interpretable fusion; requires paired data and more complex training.

Table 13. Example hardware considerations for edge-ready tomato disease detection (indicative).

Platform	Typical Constraints	Recommended Model Traits
Smartphone	Limited battery; variable camera quality	≤5 M params; fast inference; strong augmentation/DG
Embedded board	Limited RAM/compute; continuous operation	Lightweight backbone; pruning/quantization; robust to noise
Greenhouse gateway	Multi-sensor sync; intermittent missing data	MoE/hybrid fusion; modality dropout robustness

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hu, Y.; Li, H.; Yang, C.; Chen, N.; Pan, Z.; Ke, W. Challenges and Opportunities in Tomato Leaf Disease Detection with Limited and Multimodal Data: A Review. Mathematics 2026, 14, 422. https://doi.org/10.3390/math14030422

AMA Style

Hu Y, Li H, Yang C, Chen N, Pan Z, Ke W. Challenges and Opportunities in Tomato Leaf Disease Detection with Limited and Multimodal Data: A Review. Mathematics. 2026; 14(3):422. https://doi.org/10.3390/math14030422

Chicago/Turabian Style

Hu, Yingbiao, Huinian Li, Chengcheng Yang, Ningxia Chen, Zhenfu Pan, and Wei Ke. 2026. "Challenges and Opportunities in Tomato Leaf Disease Detection with Limited and Multimodal Data: A Review" Mathematics 14, no. 3: 422. https://doi.org/10.3390/math14030422

APA Style

Hu, Y., Li, H., Yang, C., Chen, N., Pan, Z., & Ke, W. (2026). Challenges and Opportunities in Tomato Leaf Disease Detection with Limited and Multimodal Data: A Review. Mathematics, 14(3), 422. https://doi.org/10.3390/math14030422

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Challenges and Opportunities in Tomato Leaf Disease Detection with Limited and Multimodal Data: A Review

Abstract

1. Introduction

1.1. Background and Significance

1.2. Definition of Core Concepts

1.2.1. Limited Data in Tomato Leaf Disease Detection

1.2.2. Multimodal Data in Tomato Leaf Disease Detection

1.2.3. Relationship Between Limited Data and Multimodal Data

1.2.4. Mathematical Problem Formulation

1.3. Review Methodology

2. Challenges in Tomato Leaf Disease Detection: Limited and Multimodal Data

2.1. Challenges from Limited Data

2.1.1. Small Sample Size

2.1.2. Class Imbalance

2.1.3. Low-Quality and Noisy Samples

2.1.4. Annotation Noise and Inter-Annotator Variability

2.2. Challenges from Multimodal Data

2.2.1. Heterogeneity of Multimodal Data

2.2.2. Difficulty of Cross-Modal Alignment

2.2.3. Model Complexity and Resource Constraints

2.3. Interaction Between Limited and Multimodal Data Challenges

3. Technical Solutions for Limited Data in Tomato Leaf Disease Detection

3.1. Transfer Learning

3.1.1. Principle and Implementation

3.1.2. Application Cases and Performance

3.1.3. Advantages and Limitations

3.2. Self-Distillation and Ensemble Learning

3.2.1. Self-Distillation for Data Efficiency

3.2.2. Ensembles Under Class Imbalance

3.2.3. Benefits and Trade-Offs

3.3. Data Augmentation

3.3.1. Classical and Generative Augmentation Strategies

3.3.2. Effectiveness and Limitations Under Limited Data

3.4. Self-Supervised Representation Learning

3.4.1. Core Idea and Objectives

3.4.2. Representative Paradigms

3.4.3. Benefits and Limitations Under Limited Data

3.5. Few- and Zero-Shot Learning for Rare Diseases

3.5.1. Problem Setting

3.5.2. Metric-Based and Meta-Learning Approaches

3.5.3. Zero-Shot Learning with Symptom Semantics

3.6. Domain Generalization and Domain Adaptation

3.6.1. Domain Shift in Tomato Leaf Datasets

3.6.2. Domain Adaptation (DA)

3.6.3. Domain Generalization (DG)

3.7. Active Learning and Human-in-the-Loop Annotation

3.8. Semi-Supervised Learning and Pseudo-Labeling

3.9. Practical Guidelines for Combining Limited-Data Strategies in Tomato Applications

4. Technical Solutions for Multimodal Data Fusion in Tomato Leaf Disease Detection

4.1. Taxonomy of Multimodal Fusion Strategies

Critical Remarks

4.2. Image-Text Fusion for Retrieval and Recognition

4.3. Viral and Image Data Fusion for TYLCV Detection

4.4. Integration with IoT Sensors and Remote Sensing

4.5. Emerging Multimodal Paradigms: CLIP-Style and Cross-Attention Models

4.6. Design Patterns and Ablation Studies for Multimodal Tomato Systems

5. Case Studies: Typical Models and Benchmark Datasets

5.1. Benchmark Datasets for Tomato Leaf Disease Detection

5.2. Model Performance on Limited Data

5.2.1. EMA-DeiT

5.2.2. KD-ShuffleNetV2

5.2.3. YOLOv11m with Hyperparameter Optimization

5.3. Multimodal Model Performance

5.4. Comparative Summary of Model Families

5.5. Cross-Dataset Evaluation and Ablation Practices

Actionable Ablations

6. Current Challenges and Future Opportunities

6.1. Technical Bottlenecks

6.2. Data and Standardization Issues

6.3. Edge-Ready Deployment: Example Frameworks and Hardware Requirements

6.4. Future Opportunities

6.5. Under-Studied Areas and Open Problems

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations