1. Introduction
Precision agriculture (PA) increasingly depends on unmanned aerial vehicles (UAVs) and data-driven modeling systems to enable high-resolution crop monitoring, stress detection, and yield forecasting. Advances in sensing platforms, including RGB, multispectral, hyperspectral, thermal, and LiDAR (Light Detection and Ranging) sensors, combined with machine learning and deep learning techniques, have transformed how agricultural data are collected, processed, and interpreted. These technologies support scalable data-driven decision making leading to improved productivity, resource efficiency, and environmental sustainability.
Beyond serving as an application domain, PA presents a set of characteristics that actively shape the development of artificial intelligence (AI) methods. Agricultural environments expose AI systems to strong non-stationarity across seasons, cultivars, and management practices; extreme intra class variability caused by growth stages and phenology; and severe data imbalance between healthy and stressed conditions. These properties challenge assumptions commonly made in benchmark-driven computer vision and time-series modeling, where data distributions are often static and labels are abundant.
Existing AI approaches in PA are often limited in scope, focusing on specific sensing modalities, individual crops, or isolated analytical tasks such as segmentation, pest and disease detection, fruit and bloom count, and yield prediction. Many surveys emphasize either model architectures or sensing technologies but do not systematically connect data characteristics, model design, and downstream agricultural applications. As a result, it remains difficult to compare approaches across studies or to understand how methodological choices interact with real-world deployment constraints in UAV-enabled precision agriculture. In view of these limitations, the following research questions were formulated:
How do different AI model families (e.g., convolutional neural networks (CNNs), transformers, recurrent neural networks (RNNs), and classical models) perform across UAV-based sensing modalities and agricultural tasks?
What factors limit the reliability of AI-based predictions under varying real-world agricultural conditions?
How can UAV-based AI systems be translated into actionable insights for crop monitoring and interventional management?
Unlike previous surveys that primarily organize the literature around individual sensing platforms, crop types, or model architectures, this review introduces a unified four-dimensional taxonomy that jointly connects sensing modality, data type, model family, and analytical task within a single analytical framework. This structure enables systematic comparison across studies that would otherwise remain disconnected, including segmentation, detection, counting, and yield prediction pipelines operating under different sensing and deployment conditions. The proposed taxonomy also provides practical guidance for designing scalable and operationally feasible UAV-enabled agricultural analytics systems.
The key contributions of this survey are as follows:
We propose a unified taxonomy for AI-driven precision agriculture that integrates four complementary dimensions: sensing modality, data type, model family, and analytical task, enabling systematic cross-study comparison.
We identify structural challenges including data scarcity, annotation cost, domain shift, limited cross-modal integration, and deployment constraints that limit robustness and scalability.
We outline emerging research directions, including data-efficient learning, domain adaptation, representation-level multimodal fusion, synthetic data generation, and lightweight architectures for real-time deployment.
The literature reviewed in this survey was collected primarily from IEEE Xplore, ACM Digital Library, Scopus, Web of Science, ScienceDirect, and MDPI databases using combinations of keywords related to UAVs, precision agriculture, machine learning, deep learning, crop monitoring, disease detection, segmentation, fruit counting, and yield prediction. The survey focuses primarily on peer-reviewed journal and conference papers published between 2015 and 2025, with emphasis on studies involving UAV-enabled sensing and AI-driven analytical pipelines. Papers were selected based on their relevance to at least one of the four taxonomy dimensions introduced in this survey: sensing modality, data type, model family, or analytical task. More than 100 studies were analyzed across segmentation, pest and disease detection, bloom and fruit counting, and yield prediction tasks, spanning RGB, multispectral, hyperspectral, thermal, LiDAR, satellite, and IoT-based sensing modalities.
The remainder of this paper is organized as follows.
Section 2 provides background on precision agriculture, discusses key challenges, and introduces our taxonomy.
Section 3 reviews sensing modalities, data-acquisition strategies, and data-preparation techniques.
Section 4 summarizes segmentation methods.
Section 5 reviews pest and disease detection.
Section 6 covers bloom detection, fruit counting, and yield prediction.
Section 8 synthesizes cross-cutting challenges and emerging research directions, and
Section 9 concludes the survey.
2. Background, and Taxonomy
2.1. Background and Research Gap
The evolution of information and communication technologies, remote sensing, and internet of things (IoT) has dramatically expanded the range of data available for precision agriculture [
1,
2]. UAVs and ground-based sensors provide high-resolution measurements of crop and soil conditions, while satellite platforms offer broad spatial and temporal coverage. Although these technologies enable data-driven decision making, they also introduce several challenges that directly impact the design and deployment of deep learning models.
Data acquisition is central to both deep learning (DL) models and IoT systems in agriculture. Sensors, UAVs, and other devices collect diverse data on pests, soil properties, canopy structure, crop yields, and environmental variables such as temperature, light, and humidity [
2,
3]. Studies on smart agriculture systems [
4,
5,
6] emphasize the importance of reliable sensing infrastructures and communication networks for real-time monitoring and control.
High-quality annotation is equally critical. Accurate labels enable supervised DL models to learn robust representations for tasks such as pest detection, disease diagnosis, and crop health monitoring [
7,
8]. For UAV imagery, the annotation process is often labor intensive and requires domain expertise, particularly when dealing with subtle symptoms or occluded plant parts.
Several works highlight the difficulties of vegetation stress detection using UAVs, where models must capture both intra field variability and temporal dynamics [
9,
10]. Large-scale, well-annotated datasets are rarely available, and models trained on small datasets tend to overfit or fail to generalize to new fields. To mitigate these issues, researchers employ data augmentation and regularization strategies such as geometric transformations and dropout layers [
11]. The use of high-resolution imagery and appropriate preprocessing steps is particularly important for building reliable prediction models [
12].
Data quality issues remain a recurring challenge. Agricultural data vary widely across crop types, regions, and seasons, leading to covariate shifts that degrade model performance. Incremental or phased adoption of advanced sensing and annotation technologies has been proposed as a practical strategy to improve data quality over time and support the sustainable deployment of PA systems [
13].
In summary, the following research gap continues to shape the development of AI-based solutions in UAV-enabled precision agriculture:
Data variability and generalization: Although UAV sensing enables high-resolution monitoring, significant variability across environments, crop varieties, and management practices limits the generalizability of machine learning (ML) models.
Data quality and limited annotated datasets: Many deep learning approaches rely on large, consistently annotated datasets, which remain scarce and costly to obtain, particularly for fine-grained agricultural tasks.
Operational deployment and scalability: Practical deployment requires the robust integration of sensing platforms, data infrastructures, and analytical models, which is still evolving in many precision agriculture systems.
This preliminary information about the research gap motivates the methodological choices discussed in subsequent sections and underscores the need for a structured taxonomy that connects sensing modalities, data types, model families, and analytical tasks.
2.2. Taxonomy of UAV-Based Sensing and Machine Learning in Precision Agriculture
The diversity of sensing platforms, data types, model architectures, and agricultural applications makes it challenging to compare existing approaches in a systematic manner. To provide a unifying perspective, we introduce a taxonomy that organizes UAV-based sensing and machine learning methods along four complementary dimensions: (1) sensing modality, (2) data type, (3) model family, and (4) analytical task. This taxonomy links upstream sensing and data preparation steps in
Section 3 with downstream segmentation, detection, counting, and prediction methods discussed in later sections. It also provides the structural foundation for the survey, with each subsequent section mapped explicitly to one or more of these four dimensions.
Several recent surveys review deep learning applications in agriculture or remote sensing independently; however, they typically organize the literature along a single dimension, such as model architecture, crop type, or sensing platform. In contrast, the taxonomy proposed in this survey integrates four complementary dimensions including sensing modality, data type, model family, and analytical task into a unified framework. This multidimensional structure enables systematic comparison across studies that would otherwise appear disconnected, such as UAV-based object detection and satellite-driven yield forecasting.
Unlike model centric surveys that emphasize architectural trends, or remote sensing reviews that focus primarily on sensor characteristics, our taxonomy explicitly links upstream sensing infrastructure and data preparation choices to downstream learning models and decision support objectives. This integrated perspective reveals underexplored combinations of sensing modalities and model families and provides a structured foundation for analyzing and designing scalable, end-to-end data-driven systems in precision agriculture.
Compared to previous reviews that focus primarily on algorithms or crop-specific case studies, this survey integrates sensing infrastructure, data structure, modeling approaches, and analytical objectives within a unified engineering framework. By bridging sensing technologies and machine learning methodologies, the proposed taxonomy supports the design and evaluation of deployable, robust, and scalable agricultural systems.
- 1.
Sensing modality: AI-enabled precision agriculture relies on heterogeneous sensing platforms that vary widely in spatial, spectral, and temporal resolution. UAV-mounted RGB, multispectral, hyperspectral, thermal, and LiDAR sensors provide high spatial detail for canopy structure, plant health assessment, and fine-scale monitoring [
3,
14]. Satellite platforms extend temporal and regional coverage, offering multi-temporal vegetation indices and spectral diagnostics used extensively for crop health and yield modeling [
2]. Ground-based IoT sensors, including soil moisture probes, weather stations, and in-field cameras provide micro-environmental and crop-level observations that complement airborne and satellite sensing [
15]. Together, these modalities form the multi-resolution sensing backbone of modern agricultural analytics.
- 2.
Data type: The raw data produced by these sensing modalities differ in structure, dimensionality, and temporal properties. We categorize data types into five groups: (i) single frame 2D imagery, (ii) multi-temporal image sequences used for phenological monitoring and yield prediction [
16], (iii) 3D point clouds derived from LiDAR or structure from motion photogrammetry [
14], (iv) tabular sensor and environmental streams from IoT systems, and (v) fused multimodal datasets combining imagery, spectral indices, and ground measurements [
17]. Each data type imposes different requirements on preprocessing pipelines, feature extraction, and model selection.
- 3.
Model family: A wide range of machine learning and deep learning models have been applied to agricultural imagery and sensor data. Classical methods such as support vector machines (SVMs), random forests, and fuzzy rule-based systems remain competitive when datasets are small or interpretability is required. Deep learning models dominate imagery-driven tasks, including convolutional neural networks (CNNs) [
18], residual networks (ResNet) [
19], encoder–decoder architectures such as U-Net [
20], region-based detectors including Faster region-CNN (Faster R-CNN) and Mask Region-CNN (Mask R-CNN) [
21,
22], and one-stage detectors such as you only look once (YOLO) variants [
23,
24]. Sequential models (long short-term memory (LSTM) [
25], Convolutional LSTM (ConvLSTM) [
26]) and hybrid architectures (CNN–SVM, CNN–LSTM, and stacking ensembles) further extend these capabilities. Recent advances in attention-based and transformer architectures further expand this model family dimension, offering alternatives to convolutional and recurrent networks for modeling long-range spatial and temporal dependencies in agricultural data.
- 4.
Analytical task: The final dimension of the taxonomy organizes methods by their analytical objective: segmentation (e.g., canopy, leaf, or disease region delineation), object detection (e.g., pests, fruits, and flowers), classification (e.g., disease categories, maturity stages), counting (e.g., bloom or fruit load), and regression-based yield prediction. These task categories correspond directly to the main sections of the paper, where each analytical task is reviewed within the context of its typical sensing modalities, data types, and model families.
The four-dimensional taxonomy of UAV-enabled machine learning in precision agriculture is illustrated in
Figure 1. Each reviewed study can be systematically mapped onto this framework, enabling a structured comparison across sensing modalities, data types, model families, and analytical tasks. A clear pattern emerges from the surveyed literature; CNN- and YOLO-based architectures performance depends on UAV-based RGB imagery applications for real-time detection and segmentation tasks, as reported in studies on pest and disease monitoring [
7,
27,
28]. In contrast, transformer-based and hybrid architectures are more frequently explored in multispectral and hyperspectral settings, where global contextual modeling is beneficial but still limited by data availability and computational cost [
29].
Across sensing modalities, RGB UAV imagery remains the most widely adopted due to its high spatial resolution and ease of acquisition, particularly for detection and counting tasks, whereas multispectral and hyperspectral data are more strongly associated with stress analysis and yield-related prediction tasks due to their spectral richness [
30,
31]. Similarly, lightweight CNN and YOLO variants are consistently preferred for edge and UAV deployment scenarios, while computationally intensive transformer models are typically evaluated in offline or cloud-based settings.
This comparative mapping also highlights a recurring limitation in the literature; for example, most studies rely on single modality pipelines, whereas limited work explores multimodal integration across UAV, satellite, and IoT data streams. Existing multimodal approaches are often restricted to late fusion or feature concatenation strategies, indicating a clear gap in representation-level fusion frameworks capable of cross-scale learning [
32,
33]. These evidence-based patterns justify the taxonomy structure and highlight key research gaps alongside its experimental benefits for future investigation.
Table 1 translates the four-dimensional taxonomy into practical model selection guidance. It shows that analytical task, sensing modality, and deployment constraints jointly determine appropriate model families. Lightweight models and classical approaches remain attractive when onboard computation, power, and bandwidth are limited, whereas transformer-based, multimodal, and temporal models are more suitable when richer data and greater computational resources are available. This matrix also reveals underexplored combinations, including lightweight transformer models for hyperspectral UAV imagery and multimodal fusion models that jointly integrate UAV, satellite, and IoT observations.
2.3. End-to-End Framework of UAV-Based Machine Learning in Precision Agriculture
While the taxonomy introduced in
Section 2.2 organizes existing research along four complementary dimensions, practical precision agriculture systems operate as integrated, end-to-end pipelines. To complement the taxonomy, we present a conceptual framework that illustrates how the sensing infrastructure, data-processing stages, and analytical models interact within a unified operational workflow.
Figure 2 depicts the overall analytical pipeline from sensing to task-specific modeling of UAV-based machine learning systems in precision agriculture. The framework begins with heterogeneous sensing modalities, including UAV-mounted RGB, multispectral, hyperspectral, thermal, and LiDAR sensors, as well as satellite imagery and ground-based IoT measurements. These sensing systems generate heterogeneous raw data streams that require preprocessing, normalization, feature extraction, dimensionality reduction, and multimodal integration.
Segmentation often serves as an intermediate representation stage, partitioning imagery into canopy, leaf, fruit, or lesion regions that facilitate downstream reasoning. Task-specific models are subsequently applied, including pest and disease detection, bloom identification, fruit counting, and yield prediction. Depending on the task, models range from classical machine learning approaches to deep convolutional, recurrent, and attention-based architectures.
Finally, model outputs feed into decision support systems that enable real-time UAV deployment, edge-based inference, autonomous spraying, robotic harvesting, or farm-level yield forecasting. Across all stages, system-level constraints such as limited labeled data, domain shift across seasons and regions, computational limits on UAV platforms, and multimodal data heterogeneity influence model design and deployment feasibility.
This end-to-end perspective highlights the interdependencies among sensing infrastructure, data characteristics, modeling choices, and operational objectives, reinforcing the need for integrated system design rather than isolated algorithmic improvements.
3. Data Acquisition and Preprocessing
Robust AI pipelines in precision agriculture depend heavily on the quality, diversity, and structure of the input data. As outlined in
Section 2.2, sensing modalities and data types directly influence downstream model design, feature extraction strategies, and analytical capabilities. This section reviews the major sensing platforms used in AI-driven agriculture and synthesizes common preprocessing and data preparation techniques that support segmentation, detection, counting, and prediction tasks. The sensing and data preparation components of the taxonomy and the overall pipeline are illustrated in
Figure 1 and in
Figure 2, respectively.
3.1. Data Acquisition
3.1.1. UAV-Based Sensing
UAVs have become central to precision agriculture due to their ability to capture high-resolution, flexible, and timely imagery. Modern platforms support RGB, multispectral, hyperspectral, thermal, and LiDAR payloads, enabling detailed characterization of canopy structure, crop vigor, and micro-environmental variability [
1,
3]. High-resolution RGB imagery is widely used for individual tree crown detection and orchard mapping, often combined with semi-supervised or weakly supervised models to leverage limited ground truth [
34].
LiDAR and structure from motion (SfM) photogrammetry provide 3D reconstructions of tree height, canopy volume, and stand density. For example, Ref. [
14] fused LiDAR point clouds with multispectral imagery using a PointNet++ architecture for tree species and health classification. UAVs have also been deployed as mobile data collectors within wireless sensor networks (WSNs), extending the spatial reach of ground sensors and supporting integrated monitoring frameworks [
35].
3.1.2. Satellite-Based Sensing
Satellite imagery complements UAV data by providing broader spatial coverage and longer temporal continuity. Multispectral and hyperspectral platforms support computation of vegetation indices such as NDVI and EVI, which are used for crop health monitoring, stress detection, and yield estimation. Multi-temporal imagery enables the modeling of phenological trends. For instance, Ref. [
36] applied deep learning to WorldView-3 and PlanetScope data for field-scale yield prediction, while Ref. [
16] used a hybrid LSTM-1D CNN model to estimate rice yields from satellite-derived time series.
3.1.3. Ground-Based Sensors and IoT
Ground-based IoT sensors provide high-frequency, field-level measurements of soil moisture, temperature, humidity, and radiation variables that are often not captured directly by UAV or satellite platforms. These measurements offer localized context for interpreting imagery and support applications such as stress detection, irrigation scheduling, and microclimate assessment. IoT deployments range from low-cost infrared and near-infrared probes [
37] to multilayer systems featuring wireless sensor networks, edge devices, and cloud-connected infrastructures for real-time monitoring and actuation [
15]. When integrated with aerial and satellite imagery, IoT measurements improve the completeness and robustness of datasets used in downstream ML and DL models [
38,
39].
Together, UAV, satellite, and IoT sensing systems provide complementary spatial, temporal, and spectral information, enabling multiscale, multimodal datasets that support the full range of analytical tasks reviewed in this survey. These raw data streams require substantial preprocessing, normalization, and feature engineering before they can be effectively used by ML and DL models as described in the next section.
Different sensing modalities provide complementary information and are suited to specific agricultural tasks. RGB imagery is widely used for detection and counting tasks due to its high spatial resolution and availability, while multispectral and hyperspectral data are more effective for stress detection and disease analysis by capturing spectral signatures beyond the visible range. LiDAR data, in contrast, provide structural information that is particularly useful for canopy modeling. In practice, combining these modalities can improve performance but requires careful alignment and fusion strategies to ensure consistency across spatial and temporal scales.
3.2. Data Preparation and Preprocessing
Data preparation is a critical stage (shown as a third layer of the framework in
Figure 2) that transforms heterogeneous raw inputs into formats suitable for learning algorithms. Preprocessing workflows typically include normalization, feature extraction, noise reduction, augmentation, dimensionality reduction, and multimodal integration.
Table 2 summarizes the representative techniques and associated studies.
3.2.1. Normalization and Feature Extraction
Normalization reduces variability introduced by illumination changes, sensor characteristics, and flight configurations. UAV spectral data normalization has been used for land-cover and crop discrimination [
40], while pixel-level normalization improves crop weed classification in row crops [
41]. Applications in orchards and vineyards benefit from color-based normalization techniques that correct shadows and phenological variation [
42,
43]. In deep neural networks, batch normalization [
50] is routinely applied to stabilize training and accelerate convergence.
Feature extraction transforms raw imagery or sensor data into representations that emphasize relevant biological or structural cues. Vegetation indices (e.g., NDVI and ExG) are widely used to distinguish vegetation from soil and to characterize canopy vigor [
40,
41]. Geometric and texture features capture canopy shape, fruit morphology, and spatial patterns in orchards and vineyards [
42,
44]. Deep models, including CNNs, ResNets, and region proposal networks, automatically learn hierarchical feature representations and have demonstrated strong performance in fruit detection, weed mapping, and disease diagnosis [
10,
19,
21].
3.2.2. Data Cleaning, Augmentation, and Dimensionality Reduction
Data cleaning mitigates sensor noise, shadows, occlusion, and background clutter. Common approaches include segmentation-based masking, Gaussian and median filtering, and thresholding [
40,
42,
45,
46]. Augmentation strategies such as geometric transformations, photometric adjustments, and synthetic sample generation are essential for counteracting small or imbalanced datasets and improving generalization. Regularization techniques like dropout [
11] further reduce overfitting in deep architectures.
High-dimensional data sources, especially multispectral and hyperspectral imagery, often require dimensionality reduction. Principal component analysis (PCA) and related methods reduce computational cost and highlight the most informative spectral features [
47,
48], improving both efficiency and downstream model accuracy.
3.2.3. Data Integration
Multimodal integration combines UAV imagery, satellite observations, and ground-based sensor streams to provide a holistic view of crop and environmental conditions. Fusion frameworks support applications such as stress detection, irrigation management, and yield estimation. For instance, Refs. [
17,
49] demonstrate the benefits of combining imagery with IoT or field-level sensor data for robust decision making in smart agriculture systems. Integrated datasets are particularly valuable for temporal modeling tasks and for capturing interactions between environmental conditions and crop responses.
Recent approaches focus on semi-supervised and data-efficient learning, particularly for hyperspectral UAV data, where labeled samples are limited but large volumes of unlabeled data are available. A transformer-adapted approach called Low-rank adaptation Local Attention Spectral Vision Transformer was proposed in [
51] for low-data regimes, which combines a three-dimensional convolutional spectral front end with a local window-based self-attention mechanism. The study results revealed 99% accuracy, demonstrating the effectiveness of low-label-data regimes with substantially fewer parameters.
The effectiveness of machine learning models is strongly influenced by the characteristics of the sensing data. Hyperspectral imagery, for example, contains high-dimensional spectral information, which often requires dimensionality reduction or band selection to mitigate redundancy and improve computational efficiency, followed by models capable of learning joint spectral–spatial features [
51,
52]. In contrast, LiDAR data provide structural information in the form of point clouds [
14], which must be aligned and fused with optical imagery to enable meaningful interpretation of crop geometry and canopy structure.
Multimodal data integration further introduces challenges related to spatial and temporal alignment, where simple feature-level fusion is often insufficient. In such cases, attention-based and transformer-driven models offer advantages by learning relationships across heterogeneous data sources. Therefore, the fusion may occur at the following three levels:
Early (input-level) fusion that combines data sources after preprocessing (e.g., stacking UAV imagery with IoT measurements). It is sensitive to spatial resolution mismatch and temporal misalignment between sensing modalities.
Representation-level fusion learns joint feature spaces across modalities. Recent approaches use attention-based mechanisms, such as cross-attention, to align UAV imagery with satellite time series or IoT signals, improving robustness to scale and temporal differences.
Late (decision) fusion combines outputs from independent models trained on different modalities. This approach is less affected by registration errors but does not capture deep cross-modal interactions.
Fusion pipelines highlight how these strategies are applied in practice. For example, cross-attention-based architectures align UAV imagery with satellite data at the feature level to address resolution mismatch. Hierarchical spatiotemporal pipelines integrate UAV observations with IoT time series by first aligning temporal signals and then refining spatial features. Graph-based fusion models represent fields, sensors, and observations as nodes, enabling flexible integration across heterogeneous data sources while handling missing or misaligned inputs.
Overall, the data-acquisition (Level 2 of
Figure 2) and preprocessing (Level 3 of
Figure 2) steps, reviewed here, form the foundation for the segmentation, detection, counting, and yield prediction methods examined in the following sections. They also directly support the four dimensions of the taxonomy introduced in
Section 2.2, linking sensing modalities, data types, and model selection to analytical objectives.
4. Segmentation Methods and Architectures
Segmentation is a foundational analytical task in our taxonomy and plays a critical role in many downstream applications, including disease detection, fruit and bloom counting, canopy characterization, and yield modeling. By partitioning imagery into meaningful regions, segmentation produces structured spatial representations that enable object-level and pixel-level reasoning.
In UAV-based precision agriculture, segmentation approaches can be examined at two complementary levels: (i) methodological paradigms that define how image regions are delineated based on spectral, spatial, or learned features, and (ii) architectural implementations that operationalize these paradigms through specific neural or classical model designs.
Within the taxonomy in
Figure 1, segmentation methods represent a key component linking sensing data to downstream analytical tasks. We first review representative methodological categories, followed by representative architectural instantiations widely adopted in agricultural applications.
4.1. Segmentation Methodological Paradigms
Segmentation methodologies in agricultural imagery span a spectrum from classical rule-based approaches to modern deep neural frameworks. For clarity, we organize the literature into five representative categories: threshold-based, color-based, texture- and shape-based, deep learning-based semantic and instance segmentation, and transformer-based segmentation models. This categorization highlights the methodological evolution from handcrafted feature extraction to learned hierarchical and attention-based representations.
4.1.1. Threshold-Based Segmentation
Threshold-based segmentation represents one of the earliest and most computationally efficient approaches in agricultural image analysis. These methods separate foreground objects from background regions using global or adaptive thresholds derived from pixel-intensity histograms or spectral index distributions. Otsu’s method [
53] and related histogram-driven threshold selection strategies have been applied to fruit detection, canopy extraction, and soil vegetation separation tasks [
8,
48], where Ref. [
48] reported 92.5% accuracy on deep learning oriented techniques.
Vegetation index thresholding using Normalized Difference Vegetation Index (NDVI), Excess Green (ExG), or related spectral metrics is particularly common in UAV-based crop background segmentation and early canopy mapping [
2,
3]. In orchard environments, adaptive thresholding with automatic parameter tuning has been proposed to improve fruit detection robustness under varying illumination and background conditions [
54], where the authors achieved a final F
1 score of 93.1% and 99.3% in apple and pepper detection, respectively.
Although threshold-based approaches are attractive due to low computational cost and ease of deployment on embedded systems, their performance is highly sensitive to illumination variability, shadowing, soil reflectance, and canopy heterogeneity. Adaptive and locally optimized binarization techniques partially mitigate these issues [
54,
55], yet robustness across seasons and sensing modalities remains limited. Consequently, threshold-based segmentation is increasingly used as a preprocessing step rather than as a standalone solution in modern UAV-driven pipelines.
4.1.2. Color-Based Segmentation
Color-based segmentation leverages differences in Red–Green–Blue (RGB), Hue–Saturation–Intensity (HSI), and Hue–Saturation–Value (HSV) color spaces to distinguish vegetation, fruiting bodies, flowers, or water surfaces from surrounding backgrounds. Unlike intensity-only thresholding, chromatic transformations isolate vegetation-specific spectral responses and reduce sensitivity to grayscale illumination changes.
Hue-histogram-based threshold detection has been applied to UAV captured cropped fields to improve vegetation–soil separation under varying lighting conditions [
56], revealed mean accuracy of 87.29% and standard deviation of 12.5%. Similarly, color index-based thresholding methods using Excess Green (ExG), Excess Red (ExR), and normalized RGB ratios have demonstrated effectiveness for the background–foreground segmentation of plant imagery [
57], where results showed segmentation error of 6.62 ± 5.85% and a classification ratio of 1.93 ± 0.05. These approaches are particularly useful in crop–weed discrimination and early-stage canopy extraction.
Compared to global intensity thresholding, color space transformations improve discrimination under moderate lighting variation. However, chromatic distributions shift substantially with time of day, cloud cover, sensor calibration, and soil background variability. Consequently, color-based segmentation often requires normalization, radiometric calibration, or adaptive histogram equalization to maintain consistency across flights and growing conditions. While computationally efficient, purely color-driven approaches remain sensitive to environmental variability and are increasingly complemented by learned feature representations that capture structural and contextual cues.
4.1.3. Texture- and Shape-Based Segmentation
Texture- and shape-based segmentation methods extend beyond simple spectral cues by incorporating spatial patterns and geometric priors. Classical texture descriptors such as Local Binary Patterns (LBPs), Gabor filters, Gray-Level Co-occurrence Matrices (GLCMs), and Haralick features have been widely applied to plant extraction and vegetation segmentation in field imagery [
58]. These descriptors capture micro-patterns and repetitive structures that distinguish foliage from soil, weeds, or diseased regions, particularly when color contrast alone is insufficient.
More recent surveys on aerial vegetation and microplot segmentation highlight the continued relevance of texture-driven representations in structured agricultural layouts [
59]. In UAV imagery, texture cues can help delineate crop rows, canopy gaps, and stress patterns that exhibit consistent spatial repetition across plots.
Shape-based approaches, including Circular Hough Transform, contour-based filtering, watershed segmentation, and morphological operations, incorporate geometric constraints to improve the detection of approximately circular fruits such as apples, citrus, and tomatoes. By leveraging structural priors, these methods reduce false positives in moderately cluttered scenes and improve boundary delineation.
However, handcrafted texture and geometric descriptors are sensitive to scale variation, occlusion, and irregular canopy geometry. Performance degrades in highly heterogeneous field environments, motivating the transition toward deep neural architectures capable of learning hierarchical spatial features directly from data.
4.1.4. Deep Learning-Based Semantic and Instance Segmentation
Deep learning architectures have become the dominant paradigm for segmentation in precision agriculture due to their ability to learn multiscale spatial and contextual representations directly from raw imagery. Semantic segmentation models assign pixel-level class labels, whereas instance segmentation frameworks additionally distinguish individual plant objects within a scene. Encoder–decoder networks such as U-Net and its variants are widely adopted for leaf delineation, weed mapping, disease region segmentation, and crop row detection [
7,
33], with 90–97% accuracy demonstrated on average. Through hierarchical feature extraction and skip connections, these architectures preserve fine boundary detail while capturing broader contextual information.
Lightweight and task-specific adaptations improve suitability for UAV and edge deployment, emphasizing computational efficiency while maintaining competitive segmentation performance [
60,
61]. Their results showed 4.6% mean average precision improvement and 31.5 as average precision. Multi-task and feature fusion frameworks, including UniSteamNet, jointly optimize segmentation and recognition objectives to enhance structural coherence and reduce redundant computation [
62]. Transfer learning pipelines and region proposal mechanisms further improve robustness under limited labeled data by leveraging pretrained visual backbones and hierarchical feature reuse [
63], achieving 0.94 mean average precision and 0.89 as F
1 score.
Compared with handcrafted texture- and shape-based approaches, deep models demonstrate stronger resilience to heterogeneous backgrounds, illumination variability, and canopy complexity. Nevertheless, performance remains contingent on dataset scale, annotation fidelity, and domain alignment across seasons, cultivars, and sensing configurations. Computational cost and deployment constraints also pose practical challenges in real-time UAV and edge scenarios.
4.1.5. Transformer-Based Architecture and Segmentation
Recent advances in transformer-based architectures introduce new opportunities for segmentation in agricultural imagery by modeling long-range spatial dependencies and global contextual relationships. Unlike convolutional neural networks, which rely primarily on local receptive fields, transformers employ self-attention mechanisms that allow each image region to interact with all others [
64]. This capability is particularly relevant for agricultural imagery, where canopy structures, disease patterns, and crop rows often exhibit spatial relationships that extend beyond local neighborhoods and vary substantially across scales.
Vision Transformers (ViT) [
65], having reported accuracy of 97%, and hierarchical variants such as the Swin Transformer [
66] with 87.3% accuracy and 53.5 mean Intersection Over Union (mIOU), have been explored as backbones for semantic and instance segmentation in remote sensing and agricultural contexts. Window-based self-attention and multiscale feature hierarchies enable these models to process high-resolution UAV imagery more efficiently than naïve global attention mechanisms. Transformer-based segmentation frameworks, including hybrid CNN–Transformer architectures such as SegFormer [
29], which achieved 50.3% mIoU, have demonstrated strong performance in complex outdoor scenes characterized by heterogeneous backgrounds, occlusion, and variable illumination—conditions commonly encountered in orchards and field environments [
67]. Recent work, such as Convolutional Meets Transformer Network (CMTNet) [
52], demonstrates the effectiveness of hybrid CNN–Transformer architectures for UAV-based hyperspectral crop classification, enabling improved spectral–spatial feature representation across three datasets, including WHU-Hi-LongKou, WHU-Hi-HanChuan, and WHU-Hi-HongHu. The study results showed that the proposed CMTNet model achieved an accuracy of 99.58%, surpassing state-of-the-art methods, such as CMTixer. Likewise, a work in [
68] integrated UAV-based semantic and Super-resolution reconstruction for tobacco fields, aiming to evaluate recent architectures, including Mamba-based models and transformers. The ensemble approach combining the transformer and mamba architectures achieved the highest mean IoU of 90.7%.
Despite their promise, transformer-based segmentation models remain comparatively underexplored in precision agriculture. High data requirements, computational cost, and the limited availability of large-scale labeled agricultural datasets pose practical challenges for widespread adoption. Hybrid architectures that combine convolutional feature extraction with transformer-based attention mechanisms represent a promising compromise, particularly in scenarios where global context is beneficial but computational resources are limited. Future research is likely to focus on data-efficient training strategies, multimodal fusion, and lightweight transformer variants suitable for deployment on UAVs and edge devices, aligning with the operational constraints of real-world agricultural systems.
While the preceding subsections categorize segmentation approaches according to methodological principles, practical performance in UAV-based precision agriculture ultimately depends on architectural design choices. Different network backbones, efficiency-oriented variants, feature fusion mechanisms, and transfer learning strategies operationalize these methodological paradigms in distinct ways. We therefore next examine representative segmentation architectures that instantiate these paradigms across diverse agricultural applications.
4.2. Representative Segmentation Architectures in UAV-Based Precision Agriculture
While methodological paradigms define segmentation principles, architectural implementations determine how these principles are operationalized in practice. Segmentation performance in UAV-based precision agriculture is closely tied to architectural design choices, particularly under conditions of occlusion, variable illumination, heterogeneous backgrounds, and limited annotated data.
Table 3 summarizes representative deep learning architectures and their agricultural applications across crop monitoring, row recognition, fruit segmentation, and disease severity estimation.
4.2.1. Encoder–Decoder Architectures
Encoder–decoder networks remain the dominant backbone for agricultural segmentation tasks. U-Net and its variants have demonstrated strong boundary delineation capabilities across diverse applications, including cotton inter-row navigation [
70] with reported mIoU of 96.85%, wheat farmland segmentation [
71] with reported mIoU of 89.80%, fruit segmentation in orchard environments [
69] with mIoU of 90.03%, and leaf disease delineation [
72] with reported mIoU between 94 and 96%.
Multi-task extensions, such as the unified crop recognition and stem localization framework in [
62], illustrate how segmentation can be jointly optimized with recognition objectives to enhance structural coherence and downstream localization accuracy. Specialized architectural modifications further improve robustness in disease monitoring scenarios. DF-U-Net integrates dynamic feature fusion and multispectral inputs to enhance wheat yellow rust severity segmentation in UAV imagery [
73]. Such adaptations illustrate how architectural innovations are increasingly tailored to crop-specific spectral and structural characteristics.
4.2.2. Transformer and CNN Architecture
In comparison with CNN-based segmentation models, transformer architectures offer advantages in capturing global context and long-range dependencies, which are beneficial in complex field conditions with irregular crop patterns and background variability [
29,
82]. However, these improvements generally depend on the availability of large training datasets and significant computational resources. In many agricultural applications, where labeled data are limited, CNN-based models remain more practical due to their lower training cost and stable performance in data-constrained settings. In contrast, CNNs are inherently limited by local receptive fields and may struggle to capture long-range semantic relationships.
The attention mechanisms in transformers, including self-attention and multi-head attention, enable the model to learn relationships between distant regions in an image more effectively. As highlighted in [
83], transformer-based encoder–decoder structures can better model global feature interactions compared to purely convolutional designs. In practice, transformer-based models are most beneficial in scenarios involving large-scale datasets, high-resolution imagery, or complex spatial patterns, while hybrid CNN–Transformer architectures provide a more feasible solution for typical agricultural settings with limited data and computational resources.
4.2.3. Lightweight and Efficiency-Oriented Architectures
To address UAV and edge deployment constraints, lightweight convolutional variants have been introduced to balance segmentation accuracy with computational efficiency. Efficient Dense modules of Asymmetric Convolution (EDANet) and related dynamic alignment strategies have been applied to crop lodging recognition and small-target detection scenarios [
60,
74], where authors reported 85–90% mIoU and 4.6% mean average precision improvements, respectively.
Similarly, ERFNet (Efficient Residual Factorized ConvNet) models have been adapted for crop row and instance-level field segmentation tasks, supporting real-time phenotyping and structural analysis in UAV imagery [
61,
77], with the reported mIoU of around 90%. These approaches emphasize architectural efficiency while preserving spatial fidelity. Recent studies have explored lightweight and hybrid architectures for UAV deployment, focusing on reducing computational overhead while maintaining accuracy, particularly for real-time hyperspectral and disease detection tasks. Ciem C. et al. [
84] proposed the Online Hyperspectral Simple Linear Iterative Clustering (OHSLIC) framework, a lightweight architecture that achieves a dice score of 0.72 and processes 82 frames per second. Likewise, Zhang T. et al. [
85] proposed a novel architecture based on Multiscale CNN State with feature fusion and Visual State Space that extract and integrate features hierarchically and at multiple levels, achieving pixel-level accuracy of 94.21% and a mean IoU of 91.52%.
4.2.4. Region Proposal and Transfer Learning Frameworks
Region Proposal Network (RPN) mechanisms and transfer learning strategies further enhance segmentation performance under limited agricultural datasets. Three-stage RPN-based frameworks and Mask R-CNN adaptations have been applied to crop and fruit segmentation, leveraging pretrained visual backbones to improve feature robustness and localization accuracy [
63,
78,
79], with a reported F
1 score of 89% and a mean average precision of 96–97.5%. Feature reuse and fine tuning enable improved generalization across varying canopy structures and orchard layouts.
4.2.5. Hybrid and Alternative Architectures
Beyond canonical convolutional models, alternative architectures such as multilayer perceptron (MLP)-based segmentation and multi-sensor fusion frameworks have been explored for structured field environments [
80,
81], where the authors reported 86.2% mIoU and 92% accuracy, respectively. These models demonstrate that carefully engineered feature representations can remain competitive in constrained or semi-structured agricultural contexts.
Collectively, these representative architectures illustrate how model family selection interacts with sensing modality, data characteristics, and analytical objectives, reinforcing the multidimensional taxonomy introduced in
Section 2.2. While convolutional encoder–decoder models remain dominant, emerging transformer-based segmentation frameworks (
Section 4.1.5) introduce global attention mechanisms that may further enhance cross-scale contextual modeling in high-resolution UAV imagery.
Taken together, the methodological categories reviewed in
Section 4.1 reveal a clear progression in segmentation strategies for UAV-based precision agriculture. Threshold- and color-based approaches emphasize computational simplicity and remain suitable for controlled or high-contrast environments. Texture- and shape-based methods introduce structural priors that improve robustness under moderate variability but remain constrained by handcrafted feature design. Deep learning paradigms substantially enhance resilience to heterogeneous backgrounds and complex canopy geometries, while transformer-based models extend contextual reasoning across broader spatial scales through attention mechanisms.
At the architectural level (
Section 4.2), these paradigms are instantiated through encoder–decoder networks, lightweight efficiency-oriented variants, region proposal frameworks, and hybrid fusion models. Architectural selection determines how effectively methodological principles translate into operational performance under real-world UAV constraints, including limited onboard computation, variable illumination, and sparse annotations. The selection of segmentation strategy therefore reflects an integrated trade-off among computational efficiency, data availability, sensing modality, model complexity, and deployment constraints in UAV-enabled agricultural systems.
Overall, segmentation serves as a crucial intermediate representation within the broader analytics pipeline. It provides a structured representation of agricultural scenes by isolating crops, leaves, and regions of interest from complex backgrounds. These segmented outputs reduce noise and enable more precise localization of relevant features. Building on this representation, detection models can more effectively identify pests and disease symptoms within the extracted regions. This transition reflects the progression from pixel-level understanding to object-level analysis in UAV-based agricultural workflows. By transforming raw UAV imagery into structured spatial units, segmentation enables downstream tasks such as pest and disease detection, bloom and fruit counting, canopy characterization, and yield prediction, which are examined in the following sections.
5. Pest and Disease Detection Models
Pest and disease detection represents a core analytical task within our taxonomy, relying primarily on single frame RGB imagery from UAVs, ground mounted cameras, or IoT-enabled imaging systems. These tasks are typically formulated as object detection, pixel-level lesion segmentation, or image-level classification problems and are dominated by deep learning model families such as region-based detectors, one-stage detectors, encoder–decoder architectures, and fine-tuned convolutional networks. The detection tasks correspond to the analytical task dimension in the taxonomy shown in
Figure 1. The following subsections review representative approaches for pest detection and disease detection separately, illustrating how model families align with specific sensing modalities, data types, and application objectives.
5.1. Pest Detection
Deep learning-based object detectors have become the standard for automated pest monitoring, driven by their ability to localize small objects under challenging outdoor conditions. UAV imagery, in-field cameras, and low power embedded systems supply the visual data, while model families such as Faster R-CNN, Mask R-CNN, and YOLO variants form the dominant detection backbone. These approaches illustrate the interplay between sensing modality (high-resolution UAV images), data type (object centric RGB frames), and analytical task (object detection) in the taxonomy. Ching-Ju et al. [
27] reported an accuracy of 90% using Faster/Mask R-CNN models. Similarly, Ref. [
28] achieved a mean average precision of 0.93 with YOLO variants. An F
1 score of 0.92 was reported by [
86], while Ref. [
30] obtained an F
1 score of 0.81. In addition, Saranya T. et al. [
7] reported an accuracy of 96.58% using fine-tuned models.
Table 4 summarizes representative pest detection models and their corresponding agricultural use cases.
5.1.1. Faster R-CNN and Mask R-CNN
Region-based detectors remain strong performers for small object detection due to their explicit region proposal mechanism. Faster R-CNN integrates a Region Proposal Network (RPN) with classification and regression heads, enabling the precise localization of small pests in complex orchard environments. Refs. [
33,
87] demonstrated its effectiveness on the Pest24 dataset, achieving an AP of 98.6%. Mask R-CNN extends this architecture with pixel-level segmentation branches, supporting tasks where both detection and lesion delineation are needed. For example, Ref. [
27] used Mask R-CNN within an Artificial Intelligence of Things (AIoT) pipeline to detect and segment lesions on coffee leaves, enabling fine-grained health monitoring.
5.1.2. YOLO-Based Detectors
YOLO-based one-stage detectors prioritize speed and are well suited for UAVs and edge devices. YOLOv3 has been deployed for real-time pest identification in integrated AIoT systems [
27] with 90% accuracy, while Tiny YOLOv3 was demonstrated on embedded drone platforms for fruit tree pest monitoring [
28], with a reported 0.93 as mean average precision. Lightweight variants such as Ag-YOLO combine ShuffleNet-v2 backbones with YOLO heads to achieve high F1 score (92.05%) for precision spraying in field conditions [
86]. Additional enhancements, such as DenseNet backbones [
30] and YOLOv5 architectures [
88] with mean average precision of 0.92, further improve robustness under occlusion and variable lighting.
5.1.3. VGG, ResNet, and Fine-Tuned CNNs
Convolutional backbones also remain widely used for pest classification when bounding boxes are not required. Fine-tuned VGG16 architectures have achieved strong performance for multiclass pest categorization, reaching 96.58% accuracy in [
90]. ResNet-based classifiers have been incorporated as the final stage of multi-step detection pipelines [
31], and hyperparameter optimized VGG variants have shown strong generalization across multiple pest classes [
7] with the highest reported accuracy of 96.58%. These methods highlight how classical CNN families continue to complement object detection pipelines, particularly under limited training data.
Table 5 summarizes representative disease detection models, linking model families to common agricultural use cases.
Model selection in UAV-based precision agriculture is closely tied to the nature of the analytical task and deployment constraints. Encoder–decoder architectures such as U-Net are particularly effective for crop and canopy segmentation due to their ability to preserve spatial resolution and capture fine grained pixel-level details, which are essential for delineating plant structures. Therefore, an improved performance of 90–95% was achieved in [
71,
72] with U-Net-based architecture. In contrast, one-stage detectors such as the YOLO series (including YOLOv3, Tiny YOLO) are better suited for real-time pest and disease detection, as they provide a favorable trade-off between detection accuracy of 90–93% and inference speed of 35–40 frames per second (FPS) [
28,
86], making them practical for onboard UAV deployment. Two-stage detectors such as Faster R-CNN generally achieve higher localization accuracy but require greater computational resources, i.e., inference speed of 5 fps with the method presented in [
21], which limits their use in real-time applications and makes them more suitable for offline analysis. This indicates that model selection is not solely driven by accuracy but by the balance between precision, speed, and operational constraints.
5.2. Disease Detection
Disease detection encompasses both pixel-level lesion segmentation and image-level disease classification. The choice of model family often depends on the sensing modality: leaf-level imagery from handheld or in-field cameras favors encoder–decoder architectures, whereas canopy-scale UAV imagery motivates hybrid CNNs or transformer-based models.
Table 5 summarizes representative disease detection models across segmentation and classification tasks, linking model families to common agricultural use cases. As in pest detection, these methods map directly onto the “model family’’ and “analytical task’’ dimensions of the taxonomy, with data types ranging from high-resolution RGB leaf images to multispectral UAV frames.
5.2.1. Inception ResNet-v2 and Hybrid Architectures
Hybrid deep networks combining inception modules and residual connections capture multiscale and hierarchical lesion patterns. In [
89], an Inception ResNet-v2 architecture achieved 86.1% accuracy for coconut tree disease detection, demonstrating robustness to complex backgrounds and heterogeneous lighting. Such hybrid architectures are especially suitable for canopy-level monitoring where lesions appear at varying spatial scales.
5.2.2. U-Net and Encoder–Decoder Models
Encoder–decoder networks remain the dominant approach for precise lesion segmentation. U-Net and its variants have been widely applied to leaf-level disease mapping, spike or panicle segmentation, and early anomaly detection [
7,
33], reporting 96.58% accuracy with 0.5% loss. These models isolate diseased regions for downstream classification and quantification. For example, [
91] employed U-Net for sorghum panicle segmentation, while U-Net variants achieved precision above 94% across diverse disease datasets [
92,
93]. The strong performance of encoder–decoder architectures reinforces their alignment with the segmentation-focused analytical tasks identified in our taxonomy.
5.2.3. 2D CNNs and VGG-Based Feature Extractors
2D CNNs have been used for disease classification and lesion localization in crops such as coconut and soybean [
94], where accuracy reaches at 93.82%. VGG-19-based models, often combined with ensemble classifiers or PLS regression, provide strong baselines for small datasets or variable imaging conditions [
95,
96]. Mobile-ready implementations, such as those in [
97,
98], further demonstrate the practicality of CNN-based disease detection for real-time field diagnostics, with accuracy of 99.5% and 91.5%, respectively.
5.2.4. Classical Machine Learning Models
Although deep learning dominates contemporary work, classical ML models remain useful where data scarcity or interpretability is a priority. Support vector machine trained on HOG features have shown competitive performance for tomato and papaya leaf disease classification [
99], showing 92.15% F
1 score. These approaches highlight that model families beyond deep learning still play a role, particularly in low-resource agricultural environments.
Pest and disease detection illustrate how different model families, sensing modalities, and data types align with the analytical tasks defined in our taxonomy. The outputs of pest and disease detection models provide critical inputs for higher-level agricultural tasks. Identifying affected regions and plant conditions supports subsequent analysis such as bloom detection, fruit counting, and yield estimation. These tasks rely not only on accurate detection but also on consistent spatial and temporal interpretation of field conditions. This progression highlights the shift from detection to quantitative assessment in precision agriculture. The resulting detection outputs also serve as critical inputs to downstream processes such as agriculture monitoring, and yield prediction, discussed in the next section.
6. Bloom Detection, Fruit Counting, and Yield Prediction
Bloom detection, fruit counting, and yield prediction form a sequential analytical pipeline in precision agriculture, with flowering intensity and fruit load serving as intermediate indicators of eventual yield [
32,
100]. These tasks map directly onto the taxonomy introduced in
Section 2.2: they rely primarily on image-based and multi-temporal data types (similar to
Figure 2) and draw on model families ranging from classical machine learning to deep convolutional and hybrid sequential networks that currently dominate operational implementations. The following applications represent downstream analytical tasks given in the typical framework illustrated in
Figure 2.
6.1. Bloom Detection
Bloom detection supports phenological monitoring and early season yield forecasting. Deep learning models have significantly improved robustness under heterogeneous orchard conditions involving occlusion, clutter, and variable illumination. For instance, Ref. [
101] applied DeepLab-ResNet with atrous convolutions and spatial pyramid pooling for multispecies bloom segmentation, followed by a region growing refinement (RGR) step to improve boundary localization.
Hybrid pipelines that pair CNN-based feature extraction with classical ML remain effective when annotated data are limited. Ref. [
102] demonstrated that a fine-tuned CNN combined with an SVM classifier achieved an F1 score of 93.4% for apple bloom detection, outperforming HSV+SVM baselines. Consistent patterns appear across crops: CNN-based bloom stage classifiers achieved over 95% accuracy in lettuce fields [
103], while SVMs trained on handcrafted features remain competitive in low-data scenarios [
104].
6.2. Fruit Counting
Fruit counting supports in-season yield estimation, inventory planning, and thinning decisions. Earlier approaches relied on handcrafted color and shape cues. Ref. [
105] used RGB/HSI segmentation with connected component analysis for apple counting, achieving
and root mean squared error (RMSE) of 20 fruits per tree. Ref. [
106] integrated SVMs, the Hough Transform, and spatial enhancement to detect green oranges with 97% accuracy.
Modern deep detectors now dominate orchard scale counting. YOLOv5 paired with Deep SORT achieved 99% accuracy for green tomatoes and 85% for red tomatoes in UAV imagery [
107]. In mango orchards, MangoYOLO augmented with Kalman filtering and Hungarian matching addressed occlusions by tracking fruits across frames, yielding 62% agreement with harvest counts and outperforming dual view baselines [
108]. CNN-based counting pipelines for lettuce and tomato routinely exceed 98% accuracy [
103,
109].
Recent work integrates detection with geometric modeling. Ref. [
110] introduced a UAV-based workflow using HSV filtering, Density-Based Spatial Clustering of Applications with Noise (DBSCAN) clustering, and sphere fitting to infer counts from 3D structure; large clusters were refined via a secondary K-means step. Although this approach highlights the potential of geometric cues, performance remains highly sensitive to hyperparameters, illumination, and canopy density factors that are less limiting for modern deep detectors.
Unsupervised clustering methods such as K-means offer lightweight solutions when no labels are available, as shown for maize in [
111], but they struggle with occlusion and overlapping fruit, restricting their scalability to orchard environments.
6.3. Fruit Yield Prediction
Yield prediction integrates spatial, temporal, and environmental information and remains one of the most challenging tasks in agricultural analytics. Hybrid temporal models such as LSTM-1D CNN architectures have demonstrated strong performance; for example, Ref. [
16] achieved
for rice yield prediction using multi-temporal satellite indices and temperature data. Classical ML approaches remain competitive when labeled data are scarce: fuzzy rule-based systems (FRBCSs) reached 94.29% accuracy for tomato ripeness estimation [
112], and SVMs using canopy and fruit features produced
for apple yield prediction [
113].
Tree-based models are broadly used for their ability to capture nonlinear interactions and spatial heterogeneity. Ref. [
114] reported that Random Forest achieved
for apple yield prediction, outperforming the mechanistic Carnegie–Ames–Stanford Approach (CASA) model, while XGBoost reached
for wild blueberry yield prediction [
115]. Linear models remain effective for small datasets, as shown by [
116], who achieved
for olive yield using UAV-derived NDVI, slope, and canopy features.
More recent studies explore multimodal and ensemble learning. A stacking ensemble combining ConvLSTM and SVR achieved
for strawberry yield and price forecasting [
117], while BPNNs integrating vegetation indices, texture metrics, and 3D canopy morphology achieved
–
[
118]. Deep architectures also scale effectively to large datasets: for corn, ERT, RF, and deep networks reached RMSE values of 0.75–0.85 t/ha [
87].
Table 6 summarizes representative modeling approaches for yield-related tasks, linking methodological categories to typical agricultural applications.
Table 7 synthesizes the alignment between major yield prediction challenges and effective model families reported in the literature. These tables underscore the importance of aligning model choice with data characteristics and task requirements.
Table 8 and
Table 9 synthesize representative approaches across a broad set of crops and analytical tasks, linking model families to reported performance outcomes in bloom detection, fruit counting, and yield prediction. This cross-crop perspective illustrates how the taxonomy’s four dimensions, sensing modality, data type, model family, and analytical task, interact in practical deployments. Across crops, deep detection architectures consistently dominate fruit counting tasks due to their robustness to occlusion and complex canopy structure, whereas yield prediction exhibits greater methodological diversity, reflecting its stronger dependence on temporal dynamics, environmental variability, and multimodal integration. These patterns suggest that task complexity and data structure, not merely model innovation, drive methodological selection in precision agriculture.
Taken together, bloom detection, fruit counting, and yield prediction demonstrate how sensing choices, data structures, and model architectures interact across the agricultural analytics pipeline. They also highlight recurring challenges including occlusion, spectral variability, limited labeled data, and domain shift across orchards and seasons that motivate the research directions discussed in
Section 8.
Although convolutional and recurrent architectures currently dominate these tasks in operational settings, emerging attention-based and transformer models may offer new opportunities for modeling long-range spatial and temporal dependencies as larger multimodal agricultural datasets become available.
Performance improvements across agricultural tasks are governed by a combination of data quality, task complexity, and model data compatibility. In segmentation tasks (
Section 4), accuracy is largely dependent on the availability of high-quality annotations and the spatial resolution of UAV imagery, as precise boundary delineation is required. Detection tasks (
Section 5.1 and
Section 5.2) are more sensitive to object scale, inference speed, occlusion, and background variability, where models must balance localization accuracy with computational efficiency, particularly for real-time applications. In contrast, yield prediction (
Section 6.3) relies on temporal consistency and the integration of multiple data sources, including environmental and phenological information, making it more dependent on sequential modeling, i.e., LSTM and multimodal fusion [
16]. These differences indicate that no single model is universally optimal, and performance gains are achieved by hybrid model capabilities with task-specific requirements and data characteristics.
7. System-Level Evaluation and Deployment Considerations
While
Section 4,
Section 5 and
Section 6 reviewed machine learning approaches across individual analytical tasks including segmentation, pest and disease detection, and yield-related prediction, these methods are often evaluated in isolation. However, practical precision agriculture systems operate as integrated pipelines, where model performance must be considered alongside sensing conditions, data characteristics, and deployment feasibility.
To bridge this gap, this section provides a system-level synthesis of model families across tasks and examines their readiness for real-world deployment.
Table 9 summarizes how major model families discussed throughout
Section 4,
Section 5 and
Section 6 align with analytical tasks, sensing modalities, and representative architectures. Rather than focusing on individual studies, this synthesis highlights recurring design patterns and trade-offs across model classes, providing a unified view of the methodological landscape. Building on this synthesis, we then assess the extent to which current studies report deployment relevant metrics and identify limitations in translating algorithmic performance into operational systems.
Table 10 provides a model-centric synthesis of machine learning approaches across segmentation, detection, and prediction tasks. Several key observations emerge.
First, convolutional neural networks and their variants remain the dominant model family across most tasks, particularly for image-driven applications such as segmentation and disease detection. Lightweight CNN variants are commonly used when real-time or edge deployment is required, while transformer-based and hybrid architectures are increasingly explored for capturing complex spatial and spectral dependencies.
Second, model selection is closely tied to sensing modality; RGB-based UAV imagery dominates detection and segmentation tasks, whereas multispectral, hyperspectral, and multimodal data are more frequently associated with prediction and stress analysis. This reflects the trade-off between spatial resolution and spectral richness in agricultural sensing systems.
Third, each model family exhibits distinct strengths and limitations in terms of accuracy, computational complexity, data requirements, and deployment suitability. These trade-offs highlight the need to evaluate models beyond predictive performance, motivating the deployment focused analysis presented in the following subsection.
Deployment Assessment and Reporting Gaps
To assess the practical readiness of existing approaches, we reviewed about 15 representative studies discussed in this survey that has promosing research directions w.r.t. performance and methodology, covering tasks such as crop disease detection, UAV-based monitoring, and segmentation. The aim was to examine the reporting of deployment-related metrics, including inference speed, model complexity, memory footprint, power consumption, and bandwidth along with system-level performance. The analysis shows that a small fraction provides an indication of inference speed (typically FPS), while key indicators such as model size, Floating-point Operations Per Second (FLOPs), memory usage, and energy consumption are almost absent. System-level metrics, including throughput and end-to-end latency, are reported in fewer studies, primarily in IoT-based implementations. This highlights a clear gap between algorithmic development and real-world deployment considerations. Therefore, the authors propose a three-tier reporting standard for AI-based agricultural systems based on the overall assessment:
- 1.
Tier 1 (Minimum)
Model footprint that includes Parameter count, Model file size, Input resolution used during inference and FLOPs or Multiply–Accumulate operations (MACs) computed at the given resolution.
Inference speed including latency per image (reported as mean ± standard deviation), FPS derived from latency, and batch size and framework used for evaluation.
- 2.
Tier 2 (Recommended)
Memory and deployment that has Peak GPU virtual-RAM usage during inference, CPU RAM usage for edge- or CPU-based deployment, and quantization or optimization applied.
Real-world throughput that focuses on processing rate (images per hour), field coverage rate (e.g., hectares per hour for UAV/robot systems), Platform specifications such as UAV speed and altitude, and end-to-end pipeline latency (from data capture to final action).
- 3.
Tier 3 (Aspirational)
Energy and power that contains GPU or device power consumption during inference, energy usage per frame, and estimated battery life for mobile or field platforms.
Reproducibility that relies on the availability of public code and trained model weights, profiling tools used (e.g., ptflops) and reasonable comparison with baselines under identical hardware conditions.
8. Discussion, Challenges and Future Directions
The surveyed literature demonstrates substantial progress in AI and UAV-enabled precision agriculture across the full analytics pipeline, from sensing and preprocessing to segmentation, detection, counting, and yield prediction. However, beyond incremental performance gains, the field faces broader challenges related to robustness, scalability, and system integration. This section synthesizes methodological patterns across tasks, identifies systemic bottlenecks that limit generalization and deployment, and outlines research directions toward resilient operational systems.
8.1. Synthesis Across the Analytics Pipeline
A clear methodological stratification has emerged across analytical tasks, largely driven by differences in data structure and task requirements; convolutional architectures and region-based detectors dominate image-intensive operations such as segmentation, pest detection, bloom identification, and fruit counting. CNN backbones (e.g., VGG and ResNet), encoder–decoder models (e.g., U-Net variants), and one-stage detectors (e.g., YOLO families) provide a practical balance between accuracy and computational efficiency, particularly for UAV-based deployment. In contrast, yield prediction exhibits greater architectural diversity, frequently combining tree-based models, boosting methods, hybrid LSTM-CNN architectures, and ensemble strategies to capture nonlinear interactions and temporal dynamics [
16,
114,
115].
In response to Research Question 1, CNN-based models remain effective for spatial tasks such as segmentation and detection, whereas transformers show advantages in modeling complex patterns in high-resolution and multimodal data. RNN-based architectures are primarily suited for temporal prediction tasks. Consequently, architectural selection is increasingly driven by task-specific data characteristics and deployment constraints rather than by generic model superiority.
Segmentation functions as a structural bridge within the pipeline. Reliable delineation of canopies, leaves, fruits, and lesions improves downstream detection, counting, and yield estimation [
32,
33]. However, segmentation models are frequently trained under limited environmental variability, raising questions about their robustness under domain shift across seasons, lighting conditions, and crop phenology.
An emerging shift in modeling philosophy involves the gradual adoption of attention-based and transformer architectures. Transformers enable broader spatial and temporal relational modeling, which is particularly relevant for agricultural imagery exhibiting long-range dependencies such as canopy structure and disease spread. Although these models remain comparatively underexplored due to data and computational requirements, hybrid CNN–Transformer architectures represent a promising direction for integrating local feature extraction with contextual reasoning.
8.2. Structural Challenges to Robust Deployment
Despite strong reported performance, several structural limitations constrain generalization and operational scalability. These challenges emerge across multiple layers of the pipeline shown in
Figure 1.
Data limitations, annotations and domain generalization remain major barriers to reliable deployment. Pixel-level annotation for segmentation and lesion mapping is especially resource intensive and dependent on expert knowledge [
7,
9]. In addition, small and imbalanced datasets increase the risk of overfitting, particularly for rare crop conditions and early stress detection tasks. Consequently, models trained on specific orchards, cultivars, sensor configurations, or seasonal conditions often fail to generalize across new environments. Environmental variability across climate, soil conditions, and management practices further amplifies these limitations.
Although domain adaptation techniques have been explored [
32,
33], systematic cross-region or cross-season evaluation remains limited, creating a persistent gap between experimental validation and field-level reliability. These findings indicate that model reliability is strongly influenced by environmental variability, limited and imbalanced datasets, and differences in sensing configurations across regions and seasons.
Fragmented multimodal fusion also limits system-level coherence. Although UAV imagery is often combined with vegetation indices or environmental variables, unified architectures integrating UAV, satellite, IoT, and management data remain uncommon. Existing fusion strategies are typically feature level or post hoc rather than representation level, limiting the ability to learn shared cross-modal abstractions across spatial and temporal scales.
Deployment constraints represent a critical barrier to practical adoption. Many models are evaluated offline on high-performance hardware, whereas agricultural operations require real-time inference on UAVs, robots, or edge devices with limited power and memory. Efficient architecture design, model compression, and hardware aware optimization therefore remain essential yet comparatively underexplored relative to accuracy improvements [
27,
28,
86].
Several studies have demonstrated field-level implementations using UAVs for crop monitoring [
1,
35,
120], pest detection [
28,
82,
92,
94], and yield estimation [
87,
113,
121,
122], often relying on onboard or edge-based inference. However, balancing predictive accuracy with computational efficiency remains challenging due to battery, memory, and bandwidth limitations [
1,
35,
120]. Consequently, many systems rely on lightweight models [
29,
99,
113,
123], partial onboard processing, and hybrid edge–cloud frameworks to maintain operational feasibility.
Evaluation and reproducibility limitations hinder comparative progress. Inconsistent metrics, heterogeneous experimental protocols, and limited public benchmarks restrict cross-study synthesis. These are largely based on average accuracy or mAP, which may not reflect performance under class imbalance or rare events common in agricultural data. The lack of standardized, large-scale, and publicly accessible benchmarks further limits reproducibility and direct comparison across studies. Many existing datasets remain geographically localized, crop specific, or privately collected, with inconsistent annotation protocols and limited support for cross-season or cross-field evaluation. Moreover, most studies do not report uncertainty estimates (i.e., intervals) for risk-aware decision making, limiting understanding of out-of-distribution generalization and operational reliability.
8.3. Emerging Research Directions
Addressing these structural limitations requires methodological advances aligned with practical deployment constraints.
Data-efficient learning represents a central priority. Self-supervised, semi-supervised, and contrastive pretraining strategies can leverage abundant unlabeled UAV and satellite imagery, reducing dependence on expensive annotations while improving robustness across crops and environments.
Domain adaptation and continual learning offer mechanisms for mitigating covariate shift. Approaches such as adversarial feature alignment, meta learning, and sensor-aware normalization may improve cross-region generalization. Continual learning frameworks are particularly relevant for agriculture, where environmental conditions evolve seasonally and interannually.
Representation-level multimodal fusion constitutes another critical frontier. Integrating UAV imagery, satellite time series, IoT measurements, and management metadata within unified architectures potentially integrating convolutional, recurrent, and transformer-based modules may enable hierarchical modeling across spatial, temporal, and spectral scales.
Transformer-based architectures are likely to play an expanding role as larger and more diverse datasets become available. Their capacity for modeling long-range spatial dependencies and cross-modal attention may be particularly beneficial for hyperspectral imagery, multi-temporal forecasting, and sensor integration. However, advances in efficient training and lightweight attention mechanisms will be necessary for practical deployment.
Edge-aware modeling and compression must also become first-class design considerations. Techniques such as pruning, quantization, knowledge distillation, and neural architecture search tailored to agricultural workloads can facilitate real-time inference under power and bandwidth limitations. Recent studies (2024–2025) have demonstrated the practical value of model compression for agricultural deployment while balancing accuracy and computational efficiency. For example, pruning and quantization have been applied to UAV-based weed detection systems [
124], reducing model size by approximately 70% while maintaining a detection accuracy of about 90%. Similarly, knowledge distillation has been used for lightweight pest and disease identification [
125], enabling deployment on resource constrained devices with reported accuracies between 94 and 96%. More recently, Yu Haiefang et al. [
126] combined pruning and knowledge distillation for rapeseed pest detection, reducing the model size from 11.2 MB to 4.4 MB and floating-point operations from 28.3 G to 10.01 G on a Jetson Nano edge device, while achieving 93.2% accuracy and 92.7% recall. These findings suggest that compression strategies can substantially improve deployment efficiency without a proportional loss in predictive performance, making them promising for future UAV and edge-based agricultural systems.
Lastly, uncertainty aware and decision centric evaluation is essential for practical adoption. Evaluations should extend beyond predictive accuracy to include operational outcomes, such as reduction in input usage, yield improvement, and cost of false alarms. These measures are critical for translating model performance into practical agricultural value. Metrics such as macro-F1, recall for minority classes, and PR-AUC provide more informative assessment in long-tailed scenarios.
8.4. Implications for Practice and Data Infrastructure
Advancing AI-driven precision agriculture requires coordinated progress in algorithms, sensing infrastructure, data governance, and interdisciplinary collaboration. Standardized data schemas, shared benchmarks, and open datasets would enable reproducible comparison and cross-regional studies [
13,
33]. The incremental adoption of sensing technologies, combined with robust data management pipelines, can facilitate sustainable integration into agricultural workflows.
A practical direction for future research is the integration of complementary learning strategies into a unified framework. For instance, self-supervised pretraining on large-scale unlabeled UAV data can be combined with semi-supervised fine tuning models to address limited annotations. Domain adaptation techniques, such as feature alignment across seasons and sensing conditions, can further improve model robustness. Likewise, for deployment, lightweight optimization strategies including pruning, quantization, and knowledge distillation can be incorporated to enable efficient inference on UAV platforms. Such a hybrid framework provides a feasible pathway toward robust and scalable AI systems in precision agriculture.
Overall, the field is transitioning from isolated proof-of-concept studies toward integrated, operational systems. Sustained progress will depend not only on architectural innovation but also on principled system design that accounts for multimodal data integration, domain variability, and deployment feasibility.
9. Conclusions
This survey reviewed more than one hundred studies on AI and UAV-enabled precision agriculture, synthesizing advances in sensing modalities, data types, model families, and analytical tasks through a unified taxonomy. While deep learning has achieved strong performance in segmentation, pest and disease detection, fruit counting, and yield prediction, many approaches remain constrained by small, localized datasets and limited cross-season or cross-region generalization. Fragmented multimodal fusion strategies and deployment constraints on UAV and edge platforms further hinder large-scale operational adoption.
Emerging methodological directions including self-supervised learning, domain adaptation, representation-level multimodal fusion, and lightweight architecture design offer promising pathways toward more robust and deployable systems. Real-world impact will depend not only on predictive accuracy but also on uncertainty awareness, interpretability, computational efficiency, and integration with decision support workflows. Progress therefore requires coordinated advances in sensing infrastructure, modeling frameworks, and interdisciplinary collaboration among AI researchers, agronomists, and practitioners.
From a broader AI perspective, precision agriculture exposes the structural limitations of current learning paradigms when confronted with non-stationary environments, sparse supervision, multimodal heterogeneity, and stringent deployment constraints. Addressing these challenges demands advances in continual adaptation, data-efficient training, multimodal representation learning, and model compression. Agriculture should thus be regarded not merely as an application domain for AI but as a catalyst for methodological innovation relevant to other dynamic, resource-constrained, and safety-critical real-world systems.
Future research should prioritize data-efficient and generalizable models, particularly by combining CNN-based architectures and transformer-based models for capturing complex spatial dependencies. RNN-based hybrid approaches remain suitable for temporal tasks such as yield prediction, while hybrid CNN–Transformer models offer a practical balance for multimodal agricultural data and YOLOv5 with Deep Sort outperformed on fruit counting. Progress depends on the development of large-scale, diverse benchmark datasets and standardized evaluation protocols, especially across land regions and growing agricultural conditions. Emerging directions such as multimodal transformers, semi-supervised learning, and edge AI architectures are expected to play a key role in enabling scalable and deployable precision agriculture systems.