ObjectDetection in Agriculture: A Comprehensive Review of Methods, Applications, Challenges, and Future Directions

Khan, Zohaib; Shen, Yue; Liu, Hui

doi:10.3390/agriculture15131351

Open AccessReview

ObjectDetection in Agriculture: A Comprehensive Review of Methods, Applications, Challenges, and Future Directions

by

Zohaib Khan

,

Yue Shen

^*

and

Hui Liu

School of Electrical and Information Engineering, Jiangsu University, Zhenjiang 212013, China

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(13), 1351; https://doi.org/10.3390/agriculture15131351

Submission received: 7 May 2025 / Revised: 14 June 2025 / Accepted: 17 June 2025 / Published: 24 June 2025

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Object detection is revolutionizing precision agriculture by enabling advanced crop monitoring, weed management, pest detection, and autonomous field operations. This comprehensive review synthesizes object detection methodologies, tracing their evolution from traditional feature-based approaches to cutting-edge deep learning architectures. We analyze key agricultural applications, leveraging datasets like PlantVillage, DeepWeeds, and AgriNet, and introduce a novel framework for evaluating algorithm performance based on mean Average Precision (mAP), inference speed, and computational efficiency. Through a comparative analysis of leading algorithms, including Faster R-CNN, YOLO, and SSD, we identify critical trade-offs and highlight advancements in real-time detection for resource-constrained environments. Persistent challenges, such as environmental variability, limited labeled data, and model generalization, are critically examined, with proposed solutions including multi-modal data fusion and lightweight models for edge deployment. By integrating technical evaluations, meaningful insights, and actionable recommendations, this work bridges technical innovation with practical deployment, paving the way for sustainable, resilient, and productive agricultural systems.

Keywords:

object detection; deep learning; crop monitoring; autonomous agricultural robots; agricultural datasets; smart farming

1. Introduction

Over the past two decades, the field of artificial intelligence (AI) has undergone a profound paradigm shift, catalyzed by transformative advancements in machine learning and computer vision that have redefined the capabilities of automated systems [1]. Object detection, encompassing the simultaneous localization and classification of objects in images, has become an essential component of machine vision systems in agriculture, as demonstrated by its successful application in classifying apple color and deformity using convolutional neural networks (CNNs)-based methods [2,3]. This evolution has been propelled by the transition from manually engineered feature extraction methods to data-driven approaches, culminating in the widespread adoption of Deep Learning(DL) techniques that exploit extensive computational resources and large-scale annotated datasets [4]. At present, object detection underpins a broad spectrum of applications, including autonomous navigation in self-driving vehicles, anomaly detection in medical imaging, quality assurance in industrial manufacturing, and precision agriculture, where its transformative potential is increasingly evident [5]. The initial reliance on conventional signal processing and static feature engineering has been progressively addressed with the emergence of DL architectures, particularly CNNs enhancing accuracy and robustness in complex, real-world perception tasks such as autonomous driving [6].

The development of object detection has progressed through distinct stages, reflecting the overall evolution of artificial intelligence in terms of innovation and adaptation. In its early stages, object detection relied on traditional computer vision methods that emphasized manual feature extraction and object boundary definition through heuristic and statistical techniques which required extensive human intervention to define object boundaries and features [7]. Hyperspectral imaging techniques have shown strong potential for non-destructive chemical analysis in agriculture, as demonstrated by the quantitative detection of mixed pesticide residues on lettuce leaves [8,9]. Near-infrared transmission spectroscopy has been effectively applied for the non-destructive identification of pesticide residues in leafy vegetables, as demonstrated by the detection of contaminants in lettuce leaves [10]. However, these methods exhibited substantial limitations when confronted with complex scenes, occlusions, or variability in object appearance, primarily due to their dependence on hand-crafted feature descriptors. The transition toward machine learning methodologies in the late 2000s introduced trainable classifiers coupled with hand-crafted descriptors; however, performance remained constrained by a limited ability to generalize across diverse contexts [11]. The advent of DL, ignited by the success of AlexNet in 2012, marked a pivotal shift. AlexNet, comprising stacked convolutional and fully connected layers and introducing techniques like ReLU and dropout, demonstrated that CNNs could autonomously learn hierarchical feature representations directly from raw pixel data, eliminating the need for manual feature engineering [12]. Subsequent innovations, including Region-based Convolutional Neural Networks (R-CNN), Faster R-CNN, Single Shot MultiBox Detector (SSD), and You Only Look Once (YOLO), further refined this paradigm by improving detection speed, localization precision, and end-to-end learning capabilities. These models integrated region proposal, feature extraction, and classification into end-to-end trainable architectures, achieving remarkable improvements in both speed and accuracy [3,13]. These advancements have substantially enhanced the technical capabilities of object detection and expanded its practical applicability across a wide range of domains, thereby laying the foundation for its widespread adoption in real-world applications [14]. A summary of key milestones in the evolution of object detection algorithms from 1999 to 2025 is illustrated in Figure 1.

In agriculture, the impact of AI-driven object detection systems is particularly pronounced, as they address longstanding inefficiencies inherent in traditional monitoring and management practices. Traditional agricultural tasks relied on labor-intensive manual inspections, resulting in processes that were time-consuming, prone to human error, and poorly suited to the scale and variability of modern farming operations [15]. Visual assessment of crop health across extensive fields or distinguishing weeds from crops under inconsistent lighting conditions often resulted in delayed interventions and suboptimal resource utilization [16]. In contrast, object detection algorithms powered by DL process imagery from drones, satellites, and ground-based sensors to deliver consistent, scalable, and real-time insights, thereby enabling precision agriculture at an unprecedented level [17]. Numerous studies have validated the effectiveness of such systems in agricultural applications, including the high-accuracy identification of plant diseases using CNNs, the spatial mapping of weed distributions for precision herbicide deployment, and yield estimation through automated flower counting [18,19]. Advanced techniques have further tailored object detection to the unique challenges of rural environments by enabling model adaptation to agricultural datasets and improving detection of small or occluded objects [20]. In addition to mainstream DL approaches, several cross-domain methodologies have demonstrated potential contributions to agricultural object detection. For example, deep belief networks have been used to build intelligent agricultural information systems within Internet of Things (IoT) frameworks, enabling robust data collection and management in distributed farm environments [21]. High-resolution salient object detection frameworks developed for embedded platforms [22] offer insights into deploying lightweight, high-accuracy models in real-time field scenarios. In multi-modal contexts, camera–radar fusion techniques with modality interaction and Gaussian expansion have shown improved robustness for 3D object detection [23], which could benefit navigation and perception in autonomous agricultural robots. Furthermore, remote intelligent perception systems for multi-object detection [24] align with the goals of long-range crop monitoring and pest surveillance. Although not originally designed for agriculture, these innovations provide important methodological references for developing advanced, scalable detection systems under field-specific constraints.Beyond agricultural domains, the same foundational principles enhance capabilities in other fields; in robotics [25,26], object detection facilitates grasping and navigation, in surveillance, it enables threat identification, and in healthcare, it supports diagnostic imaging by pinpointing abnormalities [27]. This versatility underscores the role of object detection as a critical discipline within AI, bridging theoretical innovation with tangible societal benefits [28]. Examples of agricultural robots employing advanced detection systems, perception modules, and actuation units are shown in Figure 2.

Despite the transformative potential of AI-powered object detection systems in agriculture, several inherent limitations remain. Many state-of-the-art models demand substantial computational power and large-scale annotated datasets, which are often impractical to obtain or deploy in resource-constrained rural environments. Field conditions introduce additional variability—such as fluctuating lighting, partial occlusion by foliage, and sensor noise—that can significantly degrade detection accuracy. Moreover, the initial costs of hardware acquisition, sensor integration, and customized model training remain a barrier to adoption for small-scale farmers. Real-time inference, particularly on edge devices, also poses significant challenges due to latency and energy constraints. These limitations underscore the need for ongoing research into lightweight architectures, domain adaptation techniques, and data-efficient learning frameworks that are robust to real-world agricultural conditions.

The significance of object detection extends beyond its immediate applications, reflecting broader trends and challenges in AI research. The transition from traditional methods to deep learning has not only enhanced performance but also introduced new complexities, including the need for large, annotated datasets and substantial computational resources, which pose barriers to deployment in resource-constrained settings [33]. In agricultural applications, domain-specific factors, including variable lighting conditions, dense foliage occlusions, and limited labeled data for rare crops or pests, further exacerbate these challenges, necessitating innovative solutions based on synthetic data generation and domain adaptation techniques [34]. Achieving real-time performance on edge devices continues to drive research into lightweight and computationally efficient architectures, particularly in tasks like field navigation where improved YOLOv8 structures have been applied successfully [35]. Across multiple domains, object detection remains intertwined with fundamental AI research questions, including the ability of models to generalize across diverse datasets, the trade-offs between accuracy and computational efficiency, and strategies for ensuring robustness in noisy, unstructured environments.

To address object detection in agriculture, this review conducts a comprehensive and critical synthesis focused on practical applications and contributions to artificial intelligence research.

(1) Object detection methodologies are reviewed, tracing their progression from classical feature-based techniques to DL frameworks, with emphasis on agricultural applications including crop monitoring, and weed classification. Technical foundations, including early feature extraction methods and CNN-based models, are summarized, highlighting adaptations for agricultural environments.

(2) Object detection methods are systematically compared across agricultural tasks. Evaluation considers metrics including mean Average Precision (mAP), inference speed, and robustness to environmental noise, with additional comparisons to domains in robotic navigation and medical diagnostics to analyze trade-offs between accuracy and computational efficiency.

(3) Key deployment challenges are analyzed, covering both general AI issues and agricultural-specific complexities. Agricultural challenges include unstructured scenes, seasonal variability, and integration of multi-modal data sources (RGB, thermal, hyperspectral imagery). Solutions focus on improvements in data preprocessing, model design, and validation strategies.

(4) Future research directions are proposed, focusing on the development of lightweight, energy-efficient models for edge deployment, fusion of multi-modal sensor data to enhance detection robustness, and integration of explainable AI for system transparency [36].

Recent interdisciplinary advancements further reflect the evolving landscape of object detection in agriculture, especially under constraints of generalization, sensing diversity, and precision deployment. The use of UAV-based multimodal data combined with energy balance models has enabled fine-grained environmental perception, such as paddy field evapotranspiration estimation with improved accuracy and scalability [37]. In structured environments like plant factories, the integration of LiDAR, inertial, and ultrasonic sensors within tightly-coupled SLAM systems has demonstrated high-precision localization, addressing the limitations of conventional vision-only navigation pipelines [38]. Meanwhile, improvements in weak target detection through dual-image contrast analysis have enhanced the ability to identify small or low-contrast agricultural objects under occlusions or thermal ambiguity [39]. To address the inherent variability between real and synthetic domains, meta-learning approaches have been employed to encode transferable priors, significantly improving the robustness of object detection models in unseen agricultural scenarios [40]. Collectively, these efforts underscore a shift toward multimodal fusion, spatially aware representations, and adaptive learning frameworks, all of which are central to the next generation of efficient and resilient detection systems in precision agriculture.

Scope and Contributions of This Review

This review aims to provide a comprehensive synthesis of object detection in agriculture, addressing the need for a structured evaluation of methodologies and their practical implications. We introduce a novel framework for comparing algorithms based on mean Average Precision (mAP), inference speed, and computational cost, tailored to agricultural tasks. By integrating quantitative analyses, critical insights into environmental adaptability, and recommendations for lightweight model deployment, this work bridges theoretical advancements with real-world applications. To better illustrate the progression of major object detection frameworks, Table 1 summarizes key models, their release years, types, and distinguishing features. This comparison highlights the evolution from two-stage architectures to faster and more efficient one-stage methods.

2. Object Detection Fundamentals

Object detection constitutes a cornerstone of computer vision, addressing the dual challenge of identifying and localizing objects within an image or video frame by generating bounding boxes around detected entities and assigning corresponding class labels [48]. Unlike image classification, which labels an entire image, or semantic segmentation, which assigns per-pixel classes without bounding discrete objects, object detection provides localized bounding box predictions [49]. The capability to spatially localize objects has rendered object detection indispensable across a wide range of applications, including autonomous systems and precision agriculture, where the identification of crops, weeds, or pests within complex scenes drives actionable insights [1,5]. The historical evolution of object detection methodologies is examined, beginning with traditional approaches based on hand-crafted features and progressing toward the transformative influence of DL, with particular emphasis on their technical foundations and applicability to agricultural domains [4,11].

2.1. Traditional Approaches in Agriculture

Traditional object detection methods in agriculture predominantly relied on classical computer vision and machine learning techniques that utilized hand-crafted features, including Scale-Invariant Feature Transform (SIFT), Histogram of Oriented Gradients (HOG), and color histograms. These approaches typically followed a multi-stage pipeline consisting of feature extraction, classification using algorithms like Support Vector Machines (SVMs), and localization through sliding window techniques or selective search [50]. Although computationally intensive, these methods established the foundation of early agricultural automation by enabling the recognition of visual patterns in plant leaves, fruits, and weeds under controlled conditions [51]. SVMs have been applied to shape- and color-based features for the detection of tea leaf diseases [52,53], while other approaches have utilized Gabor filters and texture analysis for the classification of weed species in crop fields.

Despite inherent limitations, traditional approaches demonstrated effectiveness in scenarios with minimal variability in visual inputs, including indoor farming, greenhouse environments, and early-stage laboratory datasets. Their relatively low data requirements and explainable pipelines rendered them suitable for classification and detection tasks under consistent lighting conditions [51]. Moreover, these techniques contributed to the early development of precision agriculture systems by enabling targeted spraying and automated yield monitoring. Although modern DL methods have largely superseded classical pipelines, the interpretability and lower computational cost associated with traditional techniques continue to offer advantages for low-power and edge computing applications in resource-limited agricultural settings [27]. The overall workflow of traditional object detection methods using SIFT/HOG features and classical classifiers is illustrated in Figure 3.

2.2. Deep Learning-Based Methods in Agriculture

Beyond CNN-based vision systems, alternative sensing modalities such as Near-Infrared Spectroscopy (NIRS) combined with machine learning have shown effectiveness in agricultural tasks like seed age discrimination [54]. End-to-end models now detect fruits, leaves, pests, and weeds under diverse field conditions [55]. YOLO-based approaches achieve real-time inference in grapevine disease detection [32], flower counting [18], and tea leaf classification [56,57], supporting deployment on Unmanned Aerial Vehicles (UAV) and mobile platforms. Transfer learning and data augmentation enhance generalization in low-label scenarios [58], while lightweight architectures like MobileNet and EfficientNet enable DL models to operate on edge devices for timely, in-field decision-making [59]. Together, these advances strengthen precision farming by enabling early disease detection, continuous crop monitoring, and resource-efficient management.

2.2.1. R-CNN and Fast R-CNN

R-CNN introduced the concept of region proposals followed by CNN-based feature extraction and classification, marking a pivotal shift in object detection paradigms [41]. Although R-CNN achieved high accuracy, its multi-stage pipeline resulted in significant computational overhead. Fast R-CNN addressed these limitations by integrating feature extraction and classification within a single network using ROI pooling, thereby reducing both inference time and memory usage [60]. In agricultural applications, both modern region proposal-based frameworks and traditional feature-based methods have demonstrated reliable performance under variable lighting conditions for tasks such as plant disease identification and fruit counting [61].

2.2.2. Faster R-CNN

Faster R-CNN further advanced the R-CNN family by embedding a Region Proposal Network (RPN) directly into the backbone CNN, enabling end-to-end training and significantly improving inference speed without compromising accuracy [43]. Its applications in agriculture include greenhouse detection and fruit recognition in complex orchard environments, where high detection accuracy is often prioritized over latency [62]. The ability to generate high-quality region proposals renders Faster R-CNN particularly effective for detecting dense or overlapping agricultural targets.

2.2.3. YOLO (You Only Look Once)

The YOLO series of models reformulated object detection as a single regression problem, thereby enabling real-time processing capabilities [45]. Due to its efficiency, YOLO has become particularly suitable for embedded agricultural applications, including drone-based weed monitoring and robotic fruit picking in real-time environments [63]. Successive versions, from YOLOv3 to YOLOv8, have introduced architectural enhancements that boost detection accuracy and robustness against occlusions and scale variations. The YOLO detection pipeline can be abstracted as a composite function consisting of a backbone, neck, and detection head, formulated as:

\hat{Y} = f_{dec} (f_{neck} (f_{backbone} (X)))

(1)

where X is the input image,

f_{backbone}

denotes the feature extraction network,

f_{neck}

represents the intermediate feature aggregation module, and

f_{dec}

is the detection head responsible for producing bounding box coordinates and class probabilities.

2.2.4. SSD (Single Shot MultiBox Detector)

The Single Shot MultiBox Detector (SSD) architecture achieves a balance between speed and accuracy by simultaneously predicting object classes and bounding boxes in a single forward pass through the network [44]. In contrast to YOLO, SSD utilizes multiple feature maps at different resolutions, making it particularly adept at detecting objects of varying sizes, an essential advantage in agricultural environments where pests or produce may appear at different scales.SSD-based models have been explored for agricultural applications that demand lightweight inference while maintaining acceptable detection accuracy.

The transition to DL methodologies has enhanced object detection capabilities and expanded their application scope. Recent architectures have refined performance by addressing challenges related to class imbalance and contextual reasoning [64]. Advancements in object detection technologies enable precision tasks including real-time identification of diseased leaves and mapping of crop distributions across extensive fields, with underlying principles that also extend to domains including robotics, surveillance, and medical imaging [65].

Figure 4 illustrates the chronological evolution of object detection architectures from R-CNN (2014) to the recent prompt-aware YOLOE (2025). R-CNN introduced region proposal-based detection, achieving high accuracy at the cost of computational complexity. The CNN (2016) in this sequence represents one of the early semantic segmentation networks applied in agriculture, enabling pixel-level classification of crops and weeds in real time. YOLOv3 marked a key advancement with its single-shot architecture and multi-scale detection strategy, significantly improving inference speed. YOLOv7 expanded on this with enhanced feature fusion, re-parameterization blocks, and backbone optimizations for precision and speed. The most recent innovation, YOLOE, integrates semantic and textual prompts via transformer mechanisms, aiming to generalize across unseen objects and enable flexible multi-modal reasoning. This evolution highlights a shift toward architectures that are not only fast and accurate but also adaptive and context-aware—an essential capability for diverse, real-world agricultural applications.

The SSD architecture adopts a multi-scale prediction strategy, where object detection is performed on multiple feature maps derived from different stages of the backbone network. This process can be mathematically expressed as:

\hat{Y} = ⋃_{i = 1}^{N} {Conv}_{i} (F_{i})

(2)

where

F_{i}

denotes the i-th feature map extracted from distinct layers of the backbone, and

{Conv}_{i}

represents the convolutional predictors applied to

F_{i}

for detecting objects at various scales. The final prediction

\hat{Y}

is obtained by aggregating outputs from all feature levels.

2.3. Agricultural Adaptations

Figure 5 presents a conceptual comparison of key object detection architectures—R-CNN, Fast R-CNN, Faster R-CNN, YOLO, and SSD—highlighting their structural differences and progressive simplification of the detection pipeline. The evolution from multi-stage region proposal methods (R-CNN and its variants) to unified single-stage frameworks (YOLO and SSD) reflects a shift toward real-time performance and reduced computational complexity.

R-CNN initiates detection by generating region proposals, followed by CNN-based feature extraction and classification, making it accurate but computationally expensive. Fast R-CNN improves upon this by integrating feature extraction and classification into a single network using ROI pooling. Faster R-CNN further enhances efficiency by introducing a Region Proposal Network (RPN), streamlining the pipeline into an end-to-end trainable architecture.

In contrast, YOLO reframes object detection as a single regression problem, using a grid-based prediction mechanism that significantly accelerates inference, making it suitable for real-time agricultural tasks such as weed and pest detection. SSD extends this paradigm by incorporating multi-scale feature maps, allowing it to detect objects of varying sizes more effectively, which is particularly useful in dense agricultural environments.

These architectures differ not only in design but also in their applicability: Faster R-CNN is favored for high-precision offline tasks like fruit counting in high-resolution imagery, while YOLO and SSD are commonly deployed in real-time systems embedded in drones or mobile agricultural robots. The diagram serves as a foundation for understanding how algorithmic design influences trade-offs between speed, accuracy, and deployment feasibility in agricultural applications.

In agriculture, object detection models are tailored to address domain-specific challenges, such as occlusions in dense foliage and variability in crop appearance. For instance, transfer learning has been successfully applied to adapt pre-trained YOLOv5 models for weed detection on small datasets such as DeepWeeds, improving generalization and detection accuracy [58]. Techniques like feature pyramid networks (FPN) enhance detection of small objects, such as pests, in complex scenes [70]. These adaptations ensure robustness in unstructured field environments, as demonstrated by YOLOv7’s application to grapevine disease detection [68].

Table 2 compares representative object detection models used in agricultural tasks based on their accuracy (mAP), inference speed (FPS), and computational cost (GFLOPs). The architectural differences underlying these models are illustrated in Figure 5. Faster R-CNN offers the highest accuracy among the listed models but requires significant computational resources, making it better suited for offline analysis. YOLOv3 and YOLOv7 demonstrate strong trade-offs between speed and accuracy, with YOLOv7 achieving the highest mAP and faster inference, making it highly effective for real-time pest detection. SSD, while slightly less accurate, remains a practical choice for embedded platforms due to its moderate complexity and efficient runtime. This comparison underscores the importance of selecting models not solely based on accuracy but also on their suitability for specific deployment scenarios, such as field robotics, UAV surveillance, or cloud-based processing.

Figure 6 illustrates the performance trend of various object detection models on the COCO dataset, specifically showing the mean Average Precision at IoU (mAP@0.5) over time from 2014 to 2022. The models included in the comparison are R-CNN, Fast R-CNN, Faster R-CNN, YOLOv2, YOLOv3, YOLOv4, YOLOv5, and YOLOv8. As shown on the graph, the performance of these models has steadily improved, with each successive version achieving better mAP@0.5 scores. The orange line, representing the YOLO series, clearly demonstrates a continuous enhancement in performance, culminating in YOLOv8, which achieves one of the highest mAP@0.5 scores in 2022. This plot effectively highlights the evolution of object detection models, with the YOLO models showing significant advancements in accuracy over the years.

3. Applications in Agriculture

The transformative potential of semantic segmentation is vividly illustrated in agriculture, where it enables precise identification of crop varieties, as demonstrated in grape classification across multiple vineyard conditions. [74]. By leveraging advanced computer vision and DL techniques, these applications enhance efficiency, precision, and scalability, replacing labor-intensive manual methods with automated systems capable of operating under complex, real-world conditions [75]. A structured analysis of key agricultural use cases, encompassing weed detection, fruit counting and ripeness assessment, disease and pest identification, livestock and wildlife monitoring, as well as crop row and canopy detection, is presented, emphasizing the computational methods employed, datasets referenced, and domain-specific challenges addressed [19]. While grounded in agriculture, these applications reflect broader AI principles, offering insights into object detection’s adaptability and limitations across diverse domains [76]. Table 3 summarizes representative agricultural applications of object detection algorithms across various tasks in precision farming.

3.1. Weed Detection

Effective weed management is a cornerstone of crop productivity, requiring precise discrimination between crop and weed species to enable selective herbicide application or autonomous weeding [81]. Object detection has significantly advanced this task by enabling real-time identification of invasive weed species amidst dense vegetation [63]. Datasets such as DeepWeeds, comprising thousands of annotated images from Australian rangelands, have supported the training of DL models including YOLOv5 and SSD, which excel at detecting weeds under varying field conditions [67]. YOLOv5, with its lightweight architecture and multi-scale prediction capabilities, achieves rapid inference speeds, while SSD’s utilization of multi-layer feature maps enhances accuracy in identifying small or partially occluded weeds [63]. Studies have reported mean Average Precision (mAP) scores exceeding 0.85 for weed classification in controlled environments, although performance declines in cluttered or shadowed conditions due to visual similarities between crops and weeds [82]. Techniques such as transfer learning, which adapts pre-trained weights from datasets like ImageNet to weed-specific datasets, have been employed to address data scarcity challenges in agricultural vision applications [83]. These advancements underscore the critical role of object detection in precision agriculture, with parallels observed in fine-grained classification tasks across other fields, such as species identification in ecological studies [84]. Figure 7 showcases representative samples from the DeepWeeds dataset, illustrating the visual diversity and complexity encountered in field-based weed detection tasks.

3.2. Fruit Counting and Ripeness Detection

Fruit detection serves dual purposes in agriculture: estimating yield for harvest planning and assessing ripeness to optimize picking schedules [3]. Object detectors trained on annotated datasets, such as MinneApple for apples, GrapeCS for grapes, and TomatoID for tomatoes, have demonstrated robust performance in localizing and classifying fruits under diverse conditions [85]. Faster R-CNN models have demonstrated strong performance in detecting fruits under occluded conditions, achieving high precision through the extraction of contextual features from deep convolutional layers [86]. EfficientNet, a more recent architecture, balances speed and accuracy through compound scaling, making it suitable for real-time ripeness assessment on mobile platforms [87]. Research has reported detection accuracies above 90% in well-lit orchard environments; however, performance degrades under low-light conditions or heavy occlusion, necessitating preprocessing techniques such as contrast enhancement or the integration of multi-modal data, including thermal imaging [88]. Ripeness detection often incorporates color-based features or temporal tracking, reflecting object detection’s adaptability to task-specific cues, a principle similarly employed in industrial quality control applications [89]. Figure 8 presents detection results of grape clusters across multiple varieties, demonstrating the capability of object detection models to localize fruits under varying occlusion and illumination conditions.

3.3. Disease and Pest Detection

Early detection of plant diseases and pest infestations is critical for crop protection, enabling timely interventions to minimize yield losses [90]. Object detection systems trained on datasets such as PlantVillage, which includes over 50,000 images of diseased and healthy leaves across multiple crops, have proven effective in identifying subtle symptoms of infection and the presence of insect pests [91]. YOLOv3 and RetinaNet leverage high-resolution feature maps and focal loss functions to enhance the detection of small objects, including disease spots and aphids, and to address class imbalance [77]. YOLOv7 has been deployed for the detection of powdery mildew on grapevines, achieving mAP scores exceeding 0.80, while RetinaNet’s emphasis on hard examples improves recall in sparse pest distributions [68]. Key challenges include differentiating disease symptoms from natural leaf variations and managing limited training data for rare conditions, issues often addressed through data augmentation or synthetic image generation [92]. These developments parallel diagnostic imaging applications in healthcare, where object detection is similarly employed to identify anomalies with high precision [93]. Figure 9 illustrates the experimental setup involving grape leaves marked for validating disease detection accuracy and evaluating spray coverage effectiveness.

3.4. Crop Row and Canopy Detection

Autonomous agricultural vehicles, including tractors and harvesters, rely on object detection systems to navigate fields by following crop rows and mapping plant canopies [94,95]. This task integrates geometric reasoning, detecting linear row patterns, with semantic understanding of canopy boundaries, often utilizing RGB or multispectral imagery captured by onboard cameras [69]. Faster R-CNN and SSD models, supplemented by post-processing techniques like Hough transforms for line detection, facilitate precise row alignment, thereby reducing crop damage during mechanical operations [96]. The Sugar Beet Field dataset has facilitated the training of canopy detection models, with reported accuracies exceeding 95% in structured field environments, although performance declines in uneven terrains or irregular planting conditions [97]. The real-time requirements of autonomous navigation have driven the adoption of lightweight architectures such as MobileNet, reflecting a broader trend toward edge-based AI in robotics and autonomous systems [13]. These techniques extend beyond agriculture to domains such as autonomous driving, underscoring shared challenges in spatial reasoning and environmental perception. Figure 10 depicts the key components of an agricultural robot path-tracking system, integrating power, navigation, chassis, and remote control modules for autonomous field operations.

4. Dataset Overview

The rapid advancement of object detection in agriculture has been strongly facilitated by the availability of publicly accessible datasets, which serve as the backbone for training, validating, and benchmarking DL models [98]. These datasets provide annotated visual data, including images or videos with labeled bounding boxes, class labels, or segmentation masks, enabling algorithms to learn complex agricultural patterns under diverse conditions [99]. While object detection as a broader AI discipline benefits from general-purpose datasets such as COCO and ImageNet, agricultural applications require specialized datasets that capture the unique variability of crops, weeds, pests, and environmental factors [98].

4.1. Key Public Datasets

Figure 11 presents a comparative overview of prominent agricultural vision datasets based on image count. Each bar represents a different dataset, highlighting the variation in data scale across domains such as plant disease detection, weed identification, and crop segmentation. PlantVillage, with over 50,000 labeled images, dominates in volume, while specialized datasets like TeaDisease and AppleAphid are relatively small but focused on high-resolution detection tasks. This comparison underscores the importance of dataset scale and diversity in selecting or designing object detection models tailored for specific agricultural applications.

Table 4 provides a structured summary of major agricultural datasets commonly used in object detection tasks, including image volume, target crop or weed types, and relevant application notes.This variation is further illustrated in Figure 11, which compares dataset sizes across several prominent agricultural benchmarks. PlantVillage includes over 50,000 high-quality RGB images collected in controlled environments, covering 38 crop–disease combinations across 14 species [100]. In contrast, DeepWeeds consists of 17,509 field-acquired images annotated for nine weed species and a negative class, offering realistic complexity for weed detection in Australian rangelands [67]. GrapeLeaf Dataset provides over 5000 images labeled for disease segmentation in grapevines, enabling detailed study of foliar symptoms in viticultural settings. DeepFruit contributes more than 35,000 images targeting fruit detection and yield estimation, particularly for crops like apple, mango, and citrus. Complementing these, AppleAphid focuses on pest detection in apples, offering high-resolution images labeled with bounding boxes for aphid identification [101]. AgriNet, a large-scale global dataset exceeding 100,000 images, covers crops, weeds, pests, and diseases with rich annotations across multiple tasks [102]. Finally, Mini-PlantNet emphasizes species-level plant recognition and has been adapted for lightweight detection applications, though it lacks task-specific annotations for broader agricultural contexts [103]. Collectively, these datasets differ in terms of image scale, environmental conditions, and label granularity, forming a robust foundation for developing and benchmarking object detection models in agriculture.

4.2. Dataset Characteristics and Contributions

These datasets exhibit substantial variation in size, annotation type, and acquisition methods, reflecting the multifaceted nature of agricultural object detection [106]. PlantVillage and AppleAphid, with their controlled conditions, provide high-quality annotations suitable for initial model training, whereas DeepWeeds and AgriNet capture real-world complexity, supporting robust testing under natural environmental conditions [19,20]. Annotation types range from bounding boxes (DeepWeeds, AgriNet) to class labels (PlantVillage), and some datasets incorporate metadata such as growth stage or environmental context, thereby enhancing model interpretability [99]. The contributions of these datasets extend beyond agriculture, as PlantVillage’s fine-grained labels are analogous to datasets used in medical imaging, and DeepWeeds’ outdoor imagery aligns with ecological monitoring applications [104]. By providing standardized benchmarks, these datasets have driven algorithmic advancements, including the adoption of transfer learning techniques to adapt general-purpose models pre-trained on datasets like COCO to specialized agricultural tasks, thereby mitigating the reliance on extensive labeled data [11].

4.3. Challenges in Dataset Diversity and Quality

Despite their significant contributions, agricultural datasets face notable challenges that hinder model performance and generalization, reflecting broader issues in AI data curation [20]. Seasonal variation, such as shifts in crop appearance across growth cycles or weather conditions, is poorly represented in datasets like PlantVillage, limiting models’ adaptability to temporal changes [107]. Occlusions, frequently encountered in dense agricultural fields where leaves obscure fruits or pests, complicate bounding box accuracy, as observed in small-object detection tasks with datasets like AppleAphid [108]. Inconsistent annotation standards across datasets, including variations in bounding box tightness and class definitions in AgriNet, introduce noise that undermines cross-dataset compatibility [109]. Moreover, the scarcity of labeled data in underrepresented regions, particularly tropical agriculture, restricts the global applicability of trained models, a problem exacerbated by the high cost and expertise required for manual annotation [110].

4.4. Implications for Object Detection Research

The limitations of current agricultural datasets underscore the need for innovative data strategies, which represent a critical frontier in AI research [109]. Techniques such as domain adaptation, which fine-tunes models on small, region-specific datasets, and multi-modal integration, which combines different sensor modalities, are emerging to address diversity gaps [111]. Furthermore, the push for open-source, standardized datasets aligns with broader efforts to democratize AI development, enabling researchers and practitioners to address real-world deployment challenges more effectively [112].

5. Comparison of Algorithms

Object detection algorithms constitute the foundation of modern computer vision systems, distinguished by trade-offs in accuracy, inference speed, and computational demands, attributes that critically influence their practical utility across diverse applications [71]. In agriculture, these trade-offs are particularly significant due to the need for deployment on edge devices, such as drones, robots, and handheld tools, where processing power is limited and real-time performance is often essential [113]. The comparative evaluation highlights universal design principles that extend beyond agricultural contexts, with applicability to domains including robotics, surveillance, and autonomous systems. Table 5 compares representative object detection models applied to agricultural tasks, summarizing their dataset usage, performance characteristics, and practical deployment considerations.

5.1. Evaluation Metrics

To comprehensively assess the performance of object detection models in agricultural applications, both accuracy-related and efficiency-related metrics are considered. The key evaluation indicators include precision, recall, F1-score, Intersection over Union (IoU), mean Average Precision (mAP), and computational complexity measured in GFLOPs.

Detection Accuracy Metrics.

Precision and recall are fundamental metrics for evaluating classification-based detection outcomes. Precision (P) quantifies the proportion of correctly predicted positive samples among all positive predictions, while recall (R) measures the proportion of actual positives that are correctly identified:

Precision (P) = \frac{T P}{T P + F P}

(3)

Recall (R) = \frac{T P}{T P + F N}

(4)

The F1-score provides a harmonic mean of precision and recall, offering a balanced perspective when both false positives and false negatives are of concern:

F 1 - score = 2 \cdot \frac{P \cdot R}{P + R}

(5)

Intersection over Union (IoU) evaluates the spatial overlap between the predicted bounding box and the ground-truth box, serving as a basis for determining true positives in object detection:

IoU = \frac{Area of Overlap}{Area of Union}

(6)

The mean Average Precision (mAP), as defined in Equation (7), is the average of the average precision (AP) over all categories. AP is computed as the area under the precision-recall (PR) curve for each class:

mAP = \frac{1}{N} \sum_{i = 1}^{N} \int_{0}^{1} P_{i} (R) d R

(7)

Here,

T P

,

F P

, and

F N

denote true positives, false positives, and false negatives, respectively. N represents the number of object classes, and

P_{i} (R)

is the precision as a function of recall for class i.

Computational Complexity.

In resource-constrained agricultural environments, model efficiency is crucial. The computational cost of a model is typically measured in terms of Giga Floating Point Operations (GFLOPs), which represents the total number of operations normalized by

10^{9}

:

GFLOPs = \frac{1}{10^{9}} \sum_{l = 1}^{L} C_{l} \cdot K_{l}^{2} \cdot M_{l}^{2} \cdot N_{l}

(8)

In Equation (8), L is the total number of layers,

C_{l}

is the number of input channels,

K_{l}

is the kernel size,

M_{l} \times M_{l}

is the spatial size of the output feature map, and

N_{l}

is the number of output channels at layer l.

Collectively, these metrics (Equations (3)–(8)) establish a standardized evaluation foundation that is adopted throughout this study to ensure consistent and rigorous comparisons among object detection models.

5.2. Algorithmic Foundations and Performance

Each model represents a distinct approach to object detection, balancing competing demands for precision and efficiency through innovative architectural designs. Table 5 compares representative object detection models applied to agricultural tasks, summarizing their dataset usage, performance characteristics, and practical deployment considerations.

Figure 12 shows a performance comparison of several object detection models, including various YOLO versions (YOLOv6, YOLOv7, YOLOv8, YOLOv9, and YOLOv10), as well as other models like PP-YOLOE, RTMDet, YOLO-MS, Gold-YOLO, and RT-DETR, using the COCO dataset. The plot compares the COCO Average Precision (AP) (y-axis) against the Number of Parameters (M) (x-axis) for each model. The red line represents YOLOv10 (Ours), which demonstrates high performance with relatively fewer parameters compared to the other models. This suggests that YOLOv10 offers an efficient trade-off between accuracy and model size. The figure illustrates that YOLOv10 achieves one of the best performance scores on the COCO dataset while maintaining a lower number of parameters than some other models, making it a highly optimized object detection model.

5.2.1. Technical Evaluation and Trade-Offs

The performance of object detection models in agriculture reflects a trade-off between accuracy and efficiency. For instance, Faster R-CNN’s high mAP (0.92) suits offline tasks but its low FPS (5–10) limits real-time use, as shown in Table 5. YOLOv5’s high FPS (50+) and moderate mAP (0.89) make it ideal for drone-based weed detection, though it struggles with small objects [115]. EfficientNet balances both, with scalable variants (D0–D7) optimizing for edge or server deployment [87]. These trade-offs highlight the need for task-specific model selection, with lightweight models like SSD favored for resource-constrained environments.

5.2.2. Faster R-CNN

Introduced as a two-stage detector, Faster R-CNN integrates a Region Proposal Network (RPN) with a CNN backbone (ResNet-50 or ResNet-101) to generate and classify region proposals, followed by bounding box regression and classification [43]. Its strength lies in achieving high accuracy, often exceeding mean Average Precision (mAP) scores of 0.90 on datasets such as COCO, due to its capacity to leverage deep feature hierarchies and contextual reasoning [116]. However, its inference speed remains relatively slow, typically 5–10 frames per second (FPS) on high-end GPUs, owing to the computational overhead of processing multiple stages. In agriculture, Faster R-CNN excels in offline analysis tasks requiring meticulous precision, including the detection of early disease symptoms on leaves and the identification of pests in high-resolution imagery [117]. Nevertheless, its reliance on substantial computational resources limits its applicability in real-time edge scenarios, relegating it to cloud-based or high-performance GPU deployments.

5.2.3. YOLO

The You Only Look Once (YOLO) family represents a single-stage detection paradigm, reframing object detection as a regression problem by predicting bounding boxes and class probabilities in a single forward pass across a grid of image cells [118]. Optimized for speed and efficiency, YOLOv5 achieves frame rates exceeding 50 FPS on mid-tier GPUs (NVIDIA GTX 1660) while maintaining mAP scores between 0.85 and 0.90, depending on the variant (YOLOv5s for lightweight applications, YOLOv5x for higher accuracy) [119]. Its architecture incorporates enhancements such as anchor box optimization, multi-scale predictions via Feature Pyramid Networks (FPN), and a lightweight backbone (CSPDarknet53), making it highly suitable for real-time agricultural applications [120]. In field settings, YOLOv5 supports tasks like weed detection via drones, fruit counting in orchards, and pest monitoring through mobile devices, where rapid decision-making is essential [115]. In dense and complex field environments, such as those encountered during broccoli head detection, accuracy can be challenged by occlusions and object clustering, necessitating design trade-offs in model structure and resolution settings [121].

5.2.4. SSD

The Single Shot MultiBox Detector (SSD) adopts a single-stage approach, utilizing a VGG-16 backbone to extract features and predict object classes and bounding boxes across multiple scales from different convolutional layers [44]. This design yields a balanced trade-off, with inference speeds ranging from 20 to 40 FPS and mAP scores typically between 0.75 and 0.85, depending on input resolution and hardware capabilities [71,122]. SSD’s capacity to detect objects at varying scales renders it well-suited for embedded systems, such as mobile robots engaged in weed or fruit detection in agricultural fields [123]. In practice, it provides sufficient accuracy for tasks where moderate precision aligns with operational needs, while its lower computational footprint compared to two-stage models supports deployment on resource-constrained devices [124]. Limitations include reduced performance in detecting small objects or handling highly cluttered scenes, which are common challenges in dense crop environments.

5.2.5. EfficientNet

EfficientNet represents a scalable, state-of-the-art model, employing compound scaling to simultaneously adjust network depth, width, and resolution for optimizing both accuracy and efficiency [87]. Built on an EfficientNet backbone and enhanced with a Bidirectional Feature Pyramid Network (BiFPN), it achieves mAP scores exceeding 0.90 while maintaining inference speeds of 30–50 FPS on edge TPUs or high-end GPUs [125,126]. Its scalability across D0 to D7 variants enables customization for a wide range of hardware, from lightweight edge devices to powerful servers, supporting diverse agricultural tasks such as crop row mapping, livestock monitoring, and multi-object field analysis [127]. In agricultural applications, deep learning and computer vision have shown promising results in quality assessment tasks such as tea grading and matching [128].

5.3. Comparative Analysis in Agricultural Contexts

The suitability of these algorithms for agricultural tasks hinges on their performance metrics and deployment constraints [129]. Faster R-CNN’s high accuracy (mAP approximately 0.92) makes it suitable for offline tasks involving disease detection from high-resolution drone imagery, although its low speed (5–10 FPS) precludes real-time use [71]. YOLOv5’s balanced profile (mAP approximately 0.88, 50+ FPS) dominates real-time applications, including weed detection on drones and pest tracking on edge AI devices [130]. SSD’s moderate accuracy (mAP approximately 0.80) and speed (20–40 FPS) make it viable for embedded systems, particularly for fruit and weed detection where hardware limitations are significant [73]. EfficientNet’s high accuracy (mAP approximately 0.91) and efficiency (30–50 FPS) position it as a versatile general-purpose solution, excelling in tasks requiring both precision and real-time operation. Hardware compatibility further shapes algorithm selection, with edge devices (NVIDIA Jetson) favoring lightweight models, while cloud or server-based setups accommodate more computationally intensive models [131]. Beyond agriculture, these dynamics similarly inform model selection in fields such as robotics and surveillance, where accuracy-speed balances are critical [132].

5.4. Broader Implications and Trends

This comparative analysis highlights a continuum of design philosophies: two-stage models such as Faster R-CNN prioritize accuracy at the expense of speed, whereas single-stage models such as YOLOv5 and SSD emphasize efficiency [87]. EfficientNet bridges this gap through scalable architecture. In agriculture, the need for edge deployment amplifies the demand for lightweight, real-time solutions, driving innovations such as model pruning and quantization to reduce memory footprints without compromising performance [133]. Similar optimization strategies are observed across the broader field of artificial intelligence, particularly in resource-constrained environments found in IoT systems and mobile robotic platforms [134]. Furthermore, the integration of agricultural benchmarks into model evaluation, moving beyond general-purpose datasets like COCO, reflects a growing emphasis on domain-specific metrics in AI research [135].

6. Challenges and Open Problems

Despite the remarkable strides made in object detection, its application in agriculture remains hindered by a constellation of challenges spanning environmental, data-related, computational, and interpretative dimensions [136]. These obstacles reflect not only the unique complexities of agricultural environments but also broader open problems in AI, where robustness, scalability, and usability are perennial concerns [137]. Environmental variability, data scarcity, model generalization, real-time constraints, and explainability represent enduring challenges in agricultural object detection. Their implications, potential mitigation strategies, and broader relevance to artificial intelligence research are examined [138]. These issues underscore the necessity for continued interdisciplinary efforts to advance both precision agriculture and object detection frameworks [139]. Table 6 summarizes key challenges in agricultural object detection along with corresponding technical solutions and remaining research gaps. Tiny object detection—a common issue when identifying small pests or diseases—is often addressed through Focal Loss, yet limitations persist in effectively capturing multi-scale features. Domain shift, resulting from variations across field conditions or geographic regions, has been tackled via domain adaptation, though cross-regional biases still hinder model generalization. Synthetic data generation helps alleviate the scarcity of labeled agricultural images, but concerns remain about the realism and annotation quality of synthetic samples. While explainability methods such as Grad-CAM and SHAP aid model interpretation, real-time explanation tools suited for embedded deployment are lacking. Lighting variations are mitigated through multi-modal sensing strategies, but real-time sensor fusion introduces computational overhead. Lastly, techniques like model pruning improve inference speed for real-time applications, though often at the cost of reduced accuracy. This table highlights the ongoing need for balanced solutions that simultaneously address performance, interpretability, and deployment feasibility in real-world agricultural environments.

Figure 13 illustrates a structured flowchart connecting core challenges in agricultural object detection to corresponding technical solutions and highlighting unresolved research gaps. At the top, primary obstacles such as data scarcity, small object detection, domain variability, and real-time constraints are identified. These challenges are addressed through strategies like data augmentation, domain adaptation, multi-scale feature extraction, and lightweight model design, respectively. However, the flowchart also annotates emerging research needs that remain underexplored—such as few-shot learning for low-data regimes and multi-modal fusion for integrating heterogeneous sensor data. This mapping serves as a conceptual framework for guiding future research directions, emphasizing the importance of aligning practical solutions with context-specific agricultural challenges.

6.1. Environmental Variability

Agricultural scenes are inherently dynamic, and environmental variability poses significant challenges to the robustness of object detection models [136]. Lighting conditions fluctuate dramatically, from harsh midday sunlight to dawn shadows and overcast skies, altering object appearance and reducing model accuracy [141]. YOLOv5’s performance in weed detection has been observed to decline significantly under low-light conditions, with mAP decreasing by up to 20% due to poor contrast [142]. Occlusions, such as overlapping leaves obscuring fruits or pests, complicate bounding box predictions and increase false negatives [143]. Background clutter, including soil textures, plant debris, or mixed vegetation, further confounds detectors by introducing visual noise, as seen in SSD’s difficulties isolating small pests amidst dense foliage [144]. Addressing these challenges requires robust preprocessing strategies, with methods such as histogram equalization for lighting correction and the integration of multi-modal inputs like thermal imaging to mitigate occlusions, although these approaches often result in increased computational overhead [141]. Beyond agriculture, environmental variability parallels challenges faced in outdoor robotics and autonomous driving [145].

6.2. Model Generalization

The ability of object detection models to generalize across diverse agricultural contexts, including different crop types, geographic regions, and seasonal variations, remains a major challenge, often resulting in overfitting to specific training conditions [146]. A Faster R-CNN model trained on apple orchards in temperate climates may fail to detect mangoes in tropical environments due to differences in fruit shape, color, or background. Seasonal variability in crop appearance across growth stages can lead to significant domain shifts, which negatively impact detection performance. This effect has been demonstrated in the case of SSD, where accuracy declines when models are applied across visually distinct phases of the same crop [147]. This lack of generalization stems from dataset biases and insufficient feature invariance, further exacerbated by the limited diversity of agricultural datasets [146]. Transfer learning, wherein models pre-trained on large datasets like ImageNet are fine-tuned on agricultural data, provides partial mitigation; however, significant domain gaps persist. This issue mirrors broader AI challenges related to cross-domain adaptation in fields such as robotics and healthcare, where models must transcend training biases to perform reliably in varied real-world conditions [148].

6.3. Real-Time Constraints

Real-time performance is a critical requirement for agricultural edge devices, including drones, robots, and handheld tools, where limited processing power presents a formidable constraint [149]. Despite their high accuracy, models with inference times of 100–200 ms per frame remain unsuitable for dynamic agricultural tasks that demand real-time processing, with operations like autonomous weeding requiring frame rates of at least 30 FPS [71]. Lightweight models demonstrate favorable frame rates on edge devices, typically reaching 20–50 FPS, but this often comes at the cost of reduced precision, especially in scenarios involving small object detection [150]. These trade-offs are intensified by the memory and power limitations inherent to edge hardware, prompting research into model compression techniques, including pruning, quantization, and knowledge distillation, to reduce model sizes (from 50 MB to 10 MB) while preserving inference performance [133]. Real-time constraints constitute a widespread challenge in artificial intelligence and are equally critical in domains demanding high edge efficiency, notably autonomous driving and the Internet of Things (IoT) [131].

7. Future Directions

Environmental variability, data scarcity, generalization, real-time constraints, and explainability, underscore the necessity for innovative approaches to enhance the utility of object detection in agriculture [151]. As precision agriculture evolves toward smarter and more sustainable systems, emerging AI paradigms offer promising avenues to bridge these gaps, thereby improving scalability and practical impact [152]. Future advancements are expected to center around explainable AI and the seamless integration of edge AI, multimodal sensing, and data-efficient learning into comprehensive detection frameworks [153]. The integration of these advancements has the potential to overcome existing limitations in object detection, thereby enabling the development of autonomous, adaptable, and reliable systems for agricultural and broader real-world applications [154].

7.1. Explainable AI (XAI)

Enhancing model interpretability is essential for building user trust and supporting decision-making, particularly for agricultural object detection [155]. Attention mechanisms such as Grad-CAM highlight image regions that influence model predictions, while feature attribution methods like SHAP quantify each pixel’s contribution to the output [140]. Traditional neural networks, such as BP-based models, have also shown promise in early-stage agricultural disease identification tasks, such as mildew detection in aeroponically propagated mulberry cuttings [156]. Post-hoc explanation tools, including decision trees that approximate CNN behavior, offer simplified interpretative pathways for non-expert users, although real-time deployment of these explanations remains computationally demanding [157]. The rise of XAI reflects a broader movement within AI toward ethical and accountable systems, paralleling efforts in domains such as finance and medical diagnostics, where interpretability is a critical requirement [158].

To support transparent visual interpretation, Grad-CAM was employed to generate attention heatmaps that highlight spatial regions most influential to the model’s decision-making process. For a target class c, the importance of the k-th feature map

A^{k}

is computed as:

α_{k}^{c} = \frac{1}{Z} \sum_{i} \sum_{j} \frac{\partial y^{c}}{\partial A_{i j}^{k}}

(9)

where

y^{c}

is the model output score for class c,

A_{i j}^{k}

is the activation at location

(i, j)

in feature map

A^{k}

, and Z is the total number of spatial positions. The final Grad-CAM heatmap is computed as:

L_{Grad - CAM}^{c} = ReLU (\sum_{k} α_{k}^{c} A^{k})

(10)

These attention maps allow researchers to verify whether the model focuses on biologically meaningful regions when performing pest or disease detection.

Figure 14 illustrates this concept with an agricultural example. The original image (left) shows grape leaves in a vineyard setting. The middle and right panels display attention heatmaps from improved YOLOv10 and YOLOv10n models, respectively. These highlight the regions most influential in driving the model’s pest or disease detection decisions. Such interpretability tools enable researchers to verify whether models attend to biologically meaningful areas, thereby supporting transparent and accountable agricultural AI systems.

Figure 14. Visual comparison of attention heatmaps generated by improved YOLOv10 and YOLOv10n models on grape leaf imagery [159].

7.2. Few-Shot and Self-Supervised Learning

Addressing the challenge of limited annotated datasets is essential for advancing agricultural computer vision, particularly in scenarios involving rare crop varieties, pest infestations, or plant diseases for which labeled imagery is scarce. Few-shot learning (FSL) presents a viable solution by enabling models to generalize effectively from only a few labeled instances, typically between 5 to 10 samples per class, through techniques such as meta-learning or metric-based methods. Notably, prototypical networks have demonstrated efficacy by comparing query samples to learned class prototypes, facilitating classification with minimal supervision [1,32]. This concept is illustrated in Figure 15, where embedding networks process both support and query sets to compute similarity scores. Within agricultural domains, semantic segmentation models such as U-Net and DeepLabv3+ have proven effective in distinguishing between grape varieties using limited labeled data, offering a scalable approach to reducing manual annotation efforts [74].

In parallel, self-supervised learning (SSL) offers an alternative paradigm by harnessing the abundance of unlabeled agricultural imagery, frequently captured via drones or stationary cameras, to learn useful visual representations. Pretext tasks, such as predicting image rotations or spatial relationships between patches, enable robust pre-training before downstream fine-tuning on limited labeled data [5,160]. SSL pre-training on unlabeled video frames from agricultural fields can substantially enhance the performance of object detectors like YOLOv5 in weed identification tasks where annotations are sparse [68,161]. These methods, initially developed for data-constrained fields such as medical imaging and natural language processing, are now increasingly adopted in agricultural applications to improve generalizability and data efficiency [154,162].

Figure 15. Illustration of a metric-based few-shot learning framework using shared embedding networks to compare a query image against a limited labeled support set. The model assigns the class based on similarity scores to prototype representations. This approach is applicable in agriculture for pest or disease classification with limited annotated data [163].

7.3. Multimodal Approaches

Integrating RGB imagery with complementary sensing modalities, such as hyperspectral, thermal, and LiDAR imaging, has emerged as a powerful strategy for mitigating the effects of environmental variability in agricultural object detection [162]. Hyperspectral imaging, by capturing reflectance patterns across a wide range of wavelengths beyond the visible spectrum, facilitates the differentiation of crop health conditions, enabling the early identification of stress symptoms or diseases even under low illumination [5]. Thermal imaging, which detects infrared radiation, provides the advantage of penetrating partial occlusions such as dense foliage and enhances nighttime detection of pests and wildlife by leveraging heat signatures [164]. Meanwhile, LiDAR technology contributes high-resolution 3D structural information that supports precise tasks such as crop row localization and terrain mapping for autonomous machinery [95].

The fusion of such heterogeneous modalities, whether through multi-stream convolutional neural networks or attention-based architectures like Transformers, has been shown to significantly improve detection accuracy. Multimodal frameworks combining RGB and thermal imagery have achieved performance gains of 10–15% in mean Average Precision (mAP) in complex agricultural environments [13,165]. Recent architectures also utilize generative adversarial networks (GANs) to perform high-resolution reconstruction by fusing RGB and thermal inputs, further enhancing the utility of multimodal sensing in complex agricultural scenarios (Figure 16). These multimodal approaches align with broader trends in robotics and autonomous driving, where sensor fusion is leveraged to enhance perception robustness and contextual awareness.

7.4. Federated Learning

Federated learning (FL) offers a privacy-preserving framework for collaborative model training across decentralized sources, addressing the dual challenges of data scarcity and confidentiality in agriculture [162]. In FL, local models, typically deployed on edge devices at individual farms, train on private datasets (e.g., drone-captured crop imagery) and communicate only model parameters or gradients to a central aggregator, which synthesizes a global model without accessing raw data [167]. This process is illustrated in Figure 17, where local models update independently and only share model parameters, not raw data, preserving privacy throughout the learning cycle. Models such as attention-enhanced CNNs, which have been proven effective in tasks like tomato leaf disease diagnosis [168,169], can serve as local learners in federated pipelines.

In agricultural object detection, federated versions of deep models like YOLOv5 have demonstrated promising results. Studies show that FL can yield up to a 20% improvement in mean Average Precision (mAP) over isolated, locally-trained models when applied to rare pest or disease detection across farms with limited annotations [170,171]. This framework parallels developments in domains such as healthcare and smart cities, where federated learning supports secure and collaborative AI development under strict data protection requirements [12].

Figure 17. Illustration of federated learning architecture. Local devices train models on private data and share only model parameters with a central server, which aggregates them to update a global model. This decentralized paradigm enables collaborative learning without compromising data privacy [172].

7.5. Edge AI Optimization

Real-time agricultural operations, ranging from on-site disease detection to autonomous robot navigation, demand efficient DL models deployable on edge devices with constrained resources. Edge AI optimization techniques, such as model pruning, quantization, and knowledge distillation, are instrumental in reducing model size and computation without sacrificing accuracy [1]. Pruning YOLOv5 can reduce its size from 50 MB to 10 MB, while maintaining accuracy and achieving frame rates exceeding 30 FPS on lightweight platforms like the Jetson Nano, significantly outperforming heavier architectures such as Faster R-CNN [68,161].

Lightweight neural network architectures, including MobileNet, YOLO-NAS, and TinyML frameworks, are increasingly adopted for tasks like fruit counting, weed detection, and livestock tracking on handheld or mobile platforms [173,174]. These strategies support low-latency and energy-efficient inference, which is vital for large-scale deployment in remote or infrastructure-limited agricultural regions. This direction aligns with broader AI and IoT trends that emphasize distributed intelligence and real-time responsiveness in resource-constrained environments [171].

7.6. Strategic Recommendations for the Future

To advance the effectiveness and scalability of object detection in agriculture, it is essential to establish actionable research and deployment priorities. These strategic recommendations are organized into short-term (1–3 years) and long-term (5+ years) targets, reflecting both the technological readiness of current methods and their anticipated impact on practical applications. Emphasis is placed on improving real-time deployment, data efficiency, model interpretability, and multi-modal integration to address persistent challenges encountered in diverse agricultural environments.

7.6.1. Short-Term Recommendations (1–3 Years)

Deployment of Lightweight Models: Further optimization of compact object detection models, such as YOLOv8 and its lightweight variants, is essential for enabling real-time inference on edge devices. These models must strike a balance between detection accuracy and computational efficiency to support use cases including in-field pest surveillance and mobile crop monitoring [164].
Synthetic Data Augmentation: The use of generative models—particularly diffusion-based architectures—to synthesize realistic training images offers a promising avenue for addressing data scarcity. Augmenting agricultural datasets such as AgriNet with high-quality synthetic samples can reduce the manual annotation burden while maintaining dataset diversity [92].

7.6.2. Long-Term Recommendations (5+ Years)

Advanced Multi-Modal Fusion: The integration of heterogeneous data sources, including RGB, thermal, and hyperspectral imagery, represents a robust strategy for mitigating environmental variability and improving model generalization. Developing scalable frameworks capable of real-time, multi-stream fusion remains a long-term research priority [162].
Real-Time Explainable AI (XAI): While post-hoc interpretability tools such as SHAP and Grad-CAM have gained popularity, their adaptation for real-time deployment in resource-constrained agricultural settings is still limited. Advancing interpretable AI frameworks tailored for field deployment will be critical for increasing user trust and facilitating adoption in practical applications [140].

8. Conclusions

Object detection has transformed precision agriculture, enabling advancements in crop monitoring, weed management, pest detection, and autonomous operations. This review synthesizes methodologies, from traditional feature-based approaches to DL architectures like YOLO and Faster R-CNN, and introduces a novel framework for evaluating performance based on mAP, FPS, and FLOPs. Leveraging datasets like PlantVillage and DeepWeeds, we identify critical trade-offs and propose solutions like multi-modal fusion and lightweight models. Challenges, including environmental variability and data scarcity, persist, but emerging paradigms such as few-shot learning and explainable AI offer promising avenues. By providing technical evaluations, insights, and actionable recommendations, this work bridges AI innovation with practical deployment, calling for interdisciplinary efforts to enhance agricultural productivity and sustainability.

Author Contributions

Conceptualization, Y.S., Z.K. and H.L.; methodology, Z.K.; validation, Y.S., Z.K. and H.L.; formal analysis, Y.S. and H.L.; investigation, Z.K.; resources, Y.S. and H.L.; writing—original draft preparation, Z.K.; writing—review and editing, Z.K. and H.L.; supervision, Y.S. and H.L.; project administration, Y.S. and H.L.; funding acquisition, H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China, grant number 32171908, and Jiangsu Agricultural and Technology Innovation Fund, grant number CX(24)3025.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Alif, M.A.R.; Hussain, M. YOLOv1 to YOLOv10: A comprehensive review of YOLO variants and their application in the agricultural domain. arXiv 2024, arXiv:2406.10139. [Google Scholar]
Qiu, D.; Guo, T.; Yu, S.; Liu, W.; Li, L.; Sun, Z.; Hu, D. Classification of Apple Color and Deformity Using Machine Vision Combined with CNN. Agriculture 2024, 14, 978. [Google Scholar] [CrossRef]
Ji, W.; Zhai, K.; Xu, B.; Wu, J. Green Apple Detection Method Based on Multidimensional Feature Extraction Network Model and Transformer Module. J. Food Prot. 2025, 88, 100397. [Google Scholar] [CrossRef] [PubMed]
Suganthi, S.U.; Prinslin, L.; Selvi, R.; Prabha, R. Generative AI in Agri: Sustainability in Smart Precision Farming Yield Prediction Mapping System Based on GIS Using Deep Learning and GPS. Procedia Comput. Sci. 2025, 252, 365–380. [Google Scholar]
Zhou, Z.; Majeed, Y.; Naranjo, G.D.; Gambacorta, E.M. Assessment for crop water stress with infrared thermal imagery in precision agriculture: A review and future prospects for deep learning applications. Comput. Electron. Agric. 2021, 182, 106019. [Google Scholar] [CrossRef]
Wang, H.; Li, J.; Dong, H. A Review of Vision-Based Multi-Task Perception Research Methods for Autonomous Vehicles. Sensors 2025, 25, 2611. [Google Scholar] [CrossRef]
Wang, H.; Gu, J.; Wang, M. A review on the application of computer vision and machine learning in the tea industry. Front. Sustain. Food Syst. 2023, 7, 1172543. [Google Scholar] [CrossRef]
Sun, J.; Cong, S.; Mao, H.; Wu, X.; Yang, N. Quantitative Detection of Mixed Pesticide Residue of Lettuce Leaves Based on Hyperspectral Technique. J. Food Process Eng. 2018, 41, e12654. [Google Scholar] [CrossRef]
Wu, M.; Sun, J.; Lu, B.; Ge, X.; Zhou, X.; Zou, M. Application of Deep Brief Network in Transmission Spectroscopy Detection of Pesticide Residues in Lettuce Leaves. J. Food Process Eng. 2019, 42, e13005. [Google Scholar] [CrossRef]
Sun, J.; Ge, X.; Wu, X.; Dai, C.; Yang, N. Identification of Pesticide Residues in Lettuce Leaves Based on Near Infrared Transmission Spectroscopy. J. Food Process Eng. 2018, 41, e12816. [Google Scholar] [CrossRef]
Sharma, A.; Jain, A.; Gupta, P.; Chowdary, V. Machine learning applications for precision agriculture: A comprehensive review. IEEE Access 2020, 9, 4843–4873. [Google Scholar] [CrossRef]
Akhter, R.; Sofi, S.A. Precision agriculture using IoT data analytics and machine learning. J. King Saud Univ. Comput. Inf. Sci. 2022, 34, 5602–5618. [Google Scholar] [CrossRef]
Zhao, H.; Tang, Z.; Li, Z.; Dong, Y.; Si, Y.; Lu, M.; Panoutsos, G. Real-time object detection and robotic manipulation for agriculture using a YOLO-based learning approach. In Proceedings of the 2024 IEEE International Conference on Industrial Technology (ICIT), Bristol, UK, 25–27 March 2024; pp. 1–6. [Google Scholar]
Kashyap, P.K.; Kumar, S.; Jaiswal, A.; Prasad, M.; Gandomi, A.H. Towards precision agriculture: IoT-enabled intelligent irrigation systems using deep learning neural network. IEEE Sens. J. 2021, 21, 17479–17491. [Google Scholar] [CrossRef]
Zhang, F.; Chen, Z.; Ali, S.; Yang, N.; Fu, S.; Zhang, Y. Multi-class detection of cherry tomatoes using improved YOLOv4-Tiny. Int. J. Agric. Biol. Eng. 2023, 16, 225–231. [Google Scholar] [CrossRef]
Zhou, X.; Sun, J.; Tian, Y.; Lu, B.; Hang, Y.; Chen, Q. Hyperspectral technique combined with deep learning algorithm for detection of compound heavy metals in lettuce. Food Chem. 2020, 321, 126503. [Google Scholar] [CrossRef]
Sabir, R.M.; Mehmood, K.; Sarwar, A.; Safdar, M.; Muhammad, N.E.; Gul, N.; Akram, H.M.B. Remote Sensing and Precision Agriculture: A Sustainable Future. In Transforming Agricultural Management for a Sustainable Future: Climate Change and Machine Learning Perspectives; Springer Nature: Cham, Switzerland, 2024; pp. 75–103. [Google Scholar] [CrossRef]
Tao, K.; Wang, A.; Shen, Y.; Lu, Z.; Peng, F.; Wei, X. Peach flower density detection based on an improved CNN incorporating attention mechanism and multi-scale feature fusion. Horticulturae 2022, 8, 904. [Google Scholar] [CrossRef]
Zheng, Y.Y.; Kong, J.L.; Jin, X.B.; Wang, X.Y.; Su, T.L.; Zuo, M. CropDeep: The crop vision dataset for deep-learning-based classification and detection in precision agriculture. Sensors 2019, 19, 1058. [Google Scholar] [CrossRef] [PubMed]
Li, Z.; Wang, D.; Zhu, T.; Tao, Y.; Ni, C. Review of deep learning-based methods for non-destructive evaluation of agricultural products. Biosyst. Eng. 2024, 245, 56–83. [Google Scholar] [CrossRef]
Zhang, C.; Liu, Y.; Zhou, L. Using deep belief network to construct the agricultural information system based on Internet of Things. J. Supercomput. 2019, 75, 5171–5184. [Google Scholar] [CrossRef]
Hou, Q.; Cheng, M.M.; Hu, X.; Borji, A.; Tu, Z.; Torr, P.H. A novel embedded cross framework for high-resolution salient object detection. IEEE Trans. Image Process. 2021, 30, 1034–1046. [Google Scholar] [CrossRef]
Xu, Y.; Wang, Y.; Chen, X.; Li, H.; Jia, J. Camera–Radar Fusion with Modality Interaction and Radar Gaussian Expansion for 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 16314–16323. [Google Scholar] [CrossRef]
Liu, H.; Chen, X.; Zhang, M. Remote intelligent perception system for multi-object detection in smart agriculture. Inf. Process. Agric. 2023, 10, 321–334. [Google Scholar] [CrossRef]
Ahmed, S.; Qiu, B.; Ahmad, F.; Kong, C.W.; Xin, H. A State-of-the-Art Analysis of Obstacle Avoidance Methods from the Perspective of an Agricultural Sprayer UAV’s Operation Scenario. Agronomy 2021, 11, 1069. [Google Scholar] [CrossRef]
Liu, H.; Zhu, H. Evaluation of a Laser Scanning Sensor in Detection of Complex-Shaped Targets for Variable-Rate Sprayer Development. Trans. ASABE 2016, 59, 1181–1192. [Google Scholar]
Myers, V.I.; Allen, W.A. Electrooptical remote sensing methods as nondestructive testing and measuring techniques in agriculture. Appl. Opt. 1968, 7, 1819–1838. [Google Scholar] [CrossRef] [PubMed]
Hu, T.; Wang, W.; Gu, J.; Xia, Z.; Zhang, J.; Wang, B. Research on Apple Object Detection and Localization Method Based on Improved YOLOX and RGB-D Images. Agronomy 2023, 13, 1816. [Google Scholar] [CrossRef]
Xie, D.; Chen, L.; Liu, L.; Chen, L.; Wang, H. Actuators and sensors for application in agricultural robots: A review. Machines 2022, 10, 913. [Google Scholar] [CrossRef]
Xiong, Y.; Peng, C.; Grimstad, L.; From, P.J.; Isler, V. Development and field evaluation of a strawberry harvesting robot with a cable-driven gripper. Comput. Electron. Agric. 2019, 157, 392–402. [Google Scholar] [CrossRef]
Khan, Z.; Liu, H.; Shen, Y.; Zeng, X. Deep learning improved YOLOv8 algorithm: Real-time precise instance segmentation of crown region orchard canopies in natural environment. Comput. Electron. Agric. 2024, 224, 109168. [Google Scholar] [CrossRef]
Khan, Z.; Liu, H.; Shen, Y.; Yang, Z.; Zhang, L.; Yang, F. Optimizing precision agriculture: A real-time detection approach for grape vineyard unhealthy leaves using deep learning improved YOLOv7 with feature extraction capabilities. Comput. Electron. Agric. 2025, 231, 109969. [Google Scholar] [CrossRef]
Chen, C.; Zhang, P.; Zhang, H.; Dai, J.; Yi, Y.; Zhang, H.; Zhang, Y. Deep Learning on Computational-Resource-Limited Platforms: A Survey. Mob. Inf. Syst. 2020, 2020, 1–19. [Google Scholar] [CrossRef]
Qin, Y.M.; Tu, Y.H.; Li, T.; Ni, Y.; Wang, R.F.; Wang, H. Deep Learning for Sustainable Agriculture: A Systematic Review on Applications in Lettuce Cultivation. Sustainability 2025, 17, 3190. [Google Scholar] [CrossRef]
Lv, R.; Hu, J.; Zhang, T.; Chen, X.; Liu, W. Crop-Free-Ridge Navigation Line Recognition Based on the Lightweight Structure Improvement of YOLOv8. Agriculture 2025, 15, 942. [Google Scholar] [CrossRef]
Joshi, H. Edge-AI for Agriculture: Lightweight Vision Models for Disease Detection in Resource-Limited Settings. arXiv 2024, arXiv:2412.18635. [Google Scholar]
Wu, T.; Liu, K.; Cheng, M.; Gu, Z.; Guo, W.; Jiao, X. Paddy Field Scale Evapotranspiration Estimation Based on Two-Source Energy Balance Model with Energy Flux Constraints and UAV Multimodal Data. Remote Sens. 2025, 17, 1662. [Google Scholar] [CrossRef]
Gong, L.; Gao, B.; Sun, Y.; Zhang, W.; Lin, G.; Zhang, Z.; Li, Y.; Liu, C. preciseSLAM: Robust, Real-Time, LiDAR–Inertial–Ultrasonic Tightly-Coupled SLAM With Ultraprecise Positioning for Plant Factories. IEEE Trans. Ind. Inform. 2024, 20, 8818–8827. [Google Scholar] [CrossRef]
Zhuang, J.; Chen, W.; Guo, B.; Yan, Y. Infrared Weak Target Detection in Dual Images and Dual Areas. Remote Sens. 2024, 16, 3608. [Google Scholar] [CrossRef]
Liao, H.; Xia, J.; Yang, Z.; Pan, F.; Liu, Z.; Liu, Y. Meta-Learning Based Domain Prior With Application to Optical-ISAR Image Translation. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 7041–7056. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
Ji, W.; Pan, Y.; Xu, B.; Wang, J. A Real-Time Apple Targets Detection Method for Picking Robot Based on ShufflenetV2-YOLOX. Agriculture 2022, 12, 856. [Google Scholar] [CrossRef]
Liu, X.; Jia, W.; Ruan, C.; Zhao, D.; Gu, Y.; Chen, W. The recognition of apple fruits in plastic bags based on block classification. Precis. Agric. 2018, 19, 735–749. [Google Scholar] [CrossRef]
Uijlings, J.R.; van de Sande, K.E.; Gevers, T.; Smeulders, A.W. Selective Search for Object Recognition. Int. J. Comput. Vis. 2013, 104, 154–171. [Google Scholar] [CrossRef]
Sladojevic, S.; Arsenovic, M.; Anderla, A.; Culibrk, D.; Stefanovic, D. Deep Neural Networks Based Recognition of Plant Diseases by Leaf Image Classification. Comput. Intell. Neurosci. 2016, 2016, 3289801. [Google Scholar] [CrossRef] [PubMed]
Hossain, S.; Mou, R.M.; Hasan, M.M.; Chakraborty, S.; Razzak, M.A. Recognition and detection of tea leaf’s diseases using support vector machine. In Proceedings of the 2018 IEEE 14th International Colloquium on Signal Processing & Its Applications (CSPA), Penang, Malaysia, 9–10 March 2018; pp. 150–154. [Google Scholar] [CrossRef]
Tang, L.; Tian, L.; Steward, B.L. Classification of broadleaf and grass weeds using Gabor wavelets and an artificial neural network. Trans. ASAE 2003, 46, 1247–1254. [Google Scholar] [CrossRef]
Zhu, Y.; Fan, S.; Zuo, M.; Zhang, B.; Zhu, Q.; Kong, J. Discrimination of New and Aged Seeds Based on On-Line Near-Infrared Spectroscopy Technology Combined with Machine Learning. Foods 2024, 13, 1570. [Google Scholar] [CrossRef] [PubMed]
Jin, X. Development status and trend of agricultural robot technology. Int. J. Agric. Biol. Eng. 2021, 14, 1–14. [Google Scholar] [CrossRef]
Zhang, Z.; Lu, Y.; Zhao, Y.; Pan, Q.; Jin, K.; Xu, G.; Hu, Y. TS-YOLO: An all-day and lightweight tea canopy shoots detection model. Agronomy 2023, 13, 1411. [Google Scholar] [CrossRef]
Ge, X.; Sun, J.; Lu, B.; Chen, Q.; Xun, W.; Jin, Y. Classification of Oolong Tea Varieties Based on Hyperspectral Imaging Technology and BOSS-LightGBM Model. J. Food Process Eng. 2019, 42, e13289. [Google Scholar] [CrossRef]
Deng, L.; Miao, Z.; Zhao, X.; Yang, S.; Gao, Y.; Zhai, C.; Zhao, C. HAD-YOLO: An Accurate and Effective Weed Detection Model Based on Improved YOLOV5 Network. Agronomy 2025, 15, 57. [Google Scholar] [CrossRef]
Peng, Y.; Zhao, S.; Liu, J. Fused-Deep-Features Based Grape Leaf Disease Diagnosis. Agronomy 2021, 11, 2234. [Google Scholar] [CrossRef]
Xu, C.; Lu, C.; Piao, J.; Wang, Y.; Zhou, Y.; Li, S. Rice virus release from the planthopper salivary gland is independent of plant tissue recognition by the stylet. Pest Manag. Sci. 2020, 76, 3208–3216. [Google Scholar] [CrossRef] [PubMed]
Yang, N.; Qian, Y.; EL-Mesery, H.S.; Zhang, R.; Wang, A.; Tang, J. Rapid Detection of Rice Disease Using Microscopy Image Identification Based on the Synergistic Judgment of Texture and Shape Features and Decision Tree–Confusion Matrix Method. J. Sci. Food Agric. 2019, 99, 6589–6600. [Google Scholar] [CrossRef] [PubMed]
Viveros Escamilla, L.D.; Gómez-Espinosa, A.; Escobedo Cabello, J.A.; Cantoral-Ceballos, J.A. Maturity recognition and fruit counting for sweet peppers in greenhouses using deep learning neural networks. Agriculture 2024, 14, 331. [Google Scholar] [CrossRef]
Fatima, H.S.; ul Hassan, I.; Hasan, S.; Khurram, M.; Stricker, D.; Afzal, M.Z. Formation of a Lightweight, Deep Learning-Based Weed Detection System for a Commercial Autonomous Laser Weeding Robot. Appl. Sci. 2023, 13, 3997. [Google Scholar] [CrossRef]
Amjoud, A.B.; Amrouch, M. Object Detection Using Deep Learning, CNNs and Vision Transformers: A Review. IEEE Access 2023, 11, 35479–35516. [Google Scholar] [CrossRef]
Sun, T.; Zhang, W.; Miao, Z.; Zhang, Z.; Li, N. Object localization methodology in occluded agricultural environments through deep learning and active sensing. Comput. Electron. Agric. 2023, 212, 108141. [Google Scholar] [CrossRef]
Wang, A.; Liu, L.; Chen, H.; Lin, Z.; Han, J.; Ding, G. YoloE: Real-Time Seeing Anything. arXiv 2025, arXiv:2503.07465. [Google Scholar]
Olsen, A.; Konovalov, D.A.; Philippa, B.; Ridd, P.; Wood, J.C.; Johns, J.; Banks, W.; Girgenti, B.; Kenny, O.; Whinney, J.; et al. DeepWeeds: A Multiclass Weed Species Image Dataset for Deep Learning. Sci. Rep. 2019, 9, 2058. [Google Scholar] [CrossRef]
Sun, F.; Lv, Q.; Bian, Y.; He, R.; Lv, D.; Gao, L.; Li, X. Grape Target Detection Method in Orchard Environment Based on Improved YOLOv7. Agronomy 2025, 15, 42. [Google Scholar] [CrossRef]
Milioto, A.; Lottes, P.; Stachniss, C. Real-time semantic segmentation of crop and weed for precision agriculture robots leveraging background knowledge in CNNs. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 2229–2235. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Huang, J.; Rathod, V.; Sun, C.; Zhu, M.; Korattikara, A.; Fathi, A.; Fischer, I.; Wojna, Z.; Song, Y.; Guadarrama, S.; et al. Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7310–7311. [Google Scholar] [CrossRef]
Dandekar, Y.; Shinde, K.; Gangan, J.; Firdausi, S.; Bharne, S. Weed Plant Detection from Agricultural Field Images using YOLOv3 Algorithm. In Proceedings of the 2022 6th International Conference on Computing, Communication, Control and Automation (ICCUBEA), Pune, India, 26–27 August 2022; pp. 1–6. [Google Scholar]
Moreira, G.; Magalhães, S.A.; Pinho, T.M.; Cunha, M. Evaluating the Single-Shot MultiBox Detector and YOLO Deep Learning Models for the Detection of Tomatoes in a Greenhouse. Sensors 2021, 21, 3569. [Google Scholar] [CrossRef]
Peng, Y.; Wang, A.; Liu, J.; Faheem, M. A comparative study of semantic segmentation models for identification of grape with different varieties. Agriculture 2021, 11, 997. [Google Scholar] [CrossRef]
Wang, Y.; Han, Y.; Wang, C.; Song, S.; Tian, Q.; Huang, G. Computation-efficient Deep Learning for Computer Vision: A Survey. arXiv 2024, arXiv:2308.13998. [Google Scholar]
Ariza-Sentís, M.; Vélez, S.; Martínez-Peña, R.; Baja, H.; Valente, J. Object detection and tracking in Precision Farming: A systematic review. Comput. Electron. Agric. 2024, 219, 108757. [Google Scholar] [CrossRef]
Duan, Y.; Han, W.; Guo, P.; Wei, X. YOLOv8-GDCI: Research on the Phytophthora Blight Detection Method of Different Parts of Chili Based on Improved YOLOv8 Model. Agronomy 2024, 14, 2734. [Google Scholar] [CrossRef]
Ma, J.; Li, M.; Fan, W.; Liu, J. State-of-the-Art Techniques for Fruit Maturity Detection. Agronomy 2024, 14, 56. [Google Scholar] [CrossRef]
Sa, I.; Ge, Z.; Dayoub, F.; Upcroft, B.; Perez, T.; McCool, C. DeepFruits: A fruit detection system using deep neural networks. Sensors 2016, 16, 1222. [Google Scholar] [CrossRef]
Jia, Z.; Zhang, M.; Yuan, C.; Liu, Q.; Liu, H.; Qiu, X.; Shi, J. ADL-YOLOv8: A Field Crop Weed Detection Model Based on Improved YOLOv8. Agronomy 2024, 14, 2355. [Google Scholar] [CrossRef]
Zoubek, T.; Bumbálek, R.; Ufitikirezi, J.D.D.M.; Strob, M.; Filip, M.; Špalek, F.; Bartoš, P. Advancing precision agriculture with computer vision: A comparative study of YOLO models for weed and crop recognition. Crop Prot. 2025, 190, 107076. [Google Scholar] [CrossRef]
Saleem, M.H.; Velayudhan, K.K.; Potgieter, J.; Arif, K.M. Weed Identification by Single-Stage and Two-Stage Neural Networks: A Study on the Impact of Image Resizers and Weights Optimization Algorithms. Front. Plant Sci. 2022, 13, 850666. [Google Scholar] [CrossRef]
Zhu, W.; Sun, J.; Wang, S.; Shen, J.; Yang, K.; Zhou, X. Identifying Field Crop Diseases Using Transformer-Embedded Convolutional Neural Network. Agriculture 2022, 12, 1083. [Google Scholar] [CrossRef]
Tang, S.; Xia, Z.; Gu, J.; Wang, W.; Huang, Z.; Zhang, W. High-precision apple recognition and localization method based on RGB-D and improved SOLOv2 instance segmentation. Front. Sustain. Food Syst. 2024, 8, 1403872. [Google Scholar] [CrossRef]
Shen, L.; Su, J.; Huang, R.; Quan, W.; Song, Y.; Fang, Y.; Su, B. Fusing attention mechanism with Mask R-CNN for instance segmentation of grape cluster in the field. Front. Plant Sci. 2022, 13, 934450. [Google Scholar] [CrossRef] [PubMed]
Yang, J.; Han, M.; He, J.; Wen, J.; Chen, D.; Wang, Y. Object detection and localization algorithm in agricultural scenes based on YOLOv5. J. Electron. Imaging 2023, 32, 052402. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar] [CrossRef]
Chu, P.; Li, Z.; Zhang, K.; Chen, D.; Lammers, K.; Lu, R. O2RNet: Occluder-Occludee Relational Network for Robust Apple Detection in Clustered Orchard Environments. Smart Agric. Technol. 2023, 5, 100284. [Google Scholar] [CrossRef]
Sun, J.; He, X.; Ge, X.; Wu, X.; Shen, J.; Song, Y. Detection of Key Organs in Tomato Based on Deep Migration Learning in a Complex Background. Agriculture 2018, 8, 196. [Google Scholar] [CrossRef]
Wang, A.; Gao, B.; Cao, H.; Wang, P.; Zhang, T.; Wei, X. Early detection of Sclerotinia sclerotiorum on oilseed rape leaves based on optical properties. Biosyst. Eng. 2022, 224, 80–91. [Google Scholar] [CrossRef]
Jasim, M.; Al-Tuwaijari, A. Detection and identification of plant leaf diseases using YOLOv4. PLoS ONE 2023, 18, e0284567. [Google Scholar] [CrossRef]
Muhammad, A.; Salman, Z.; Lee, K.; Han, D. Harnessing the power of diffusion models for plant disease image augmentation. Front. Plant Sci. 2023, 14, 1280496. [Google Scholar] [CrossRef]
Yang, R.; Yu, Y. Artificial convolutional neural network in object detection and semantic segmentation for medical imaging analysis. Front. Oncol. 2021, 11, 638182. [Google Scholar] [CrossRef]
Pei, H.; Sun, Y.; Huang, H.; Zhang, W.; Sheng, J.; Zhang, Z. Weed detection in maize fields by UAV images based on crop row preprocessing and improved YOLOv4. Agriculture 2022, 12, 975. [Google Scholar] [CrossRef]
Ding, H.; Zhang, B.; Zhou, J.; Yan, Y.; Tian, G.; Gu, B. Recent developments and applications of simultaneous localization and mapping in agriculture. J. Field Robot. 2022, 39, 956–983. [Google Scholar] [CrossRef]
Pang, Y.; Shi, Y.; Gao, S.; Jiang, F.; Veeranampalayam-Sivakumar, A.N.; Thompson, L.; Luck, J.; Liu, C. Improved crop row detection with deep neural network for early-season maize stand count in UAV imagery. Comput. Electron. Agric. 2020, 178, 105766. [Google Scholar] [CrossRef]
Milioto, A.; Lottes, P.; Stachniss, C. Real-time blob-wise sugar beets vs. weeds classification for monitoring fields using convolutional neural networks. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2017, 4, 41–48. [Google Scholar] [CrossRef]
Chiu, M.T.; Xu, X.; Wei, Y.; Huang, Z.; Schwing, A.G.; Brunner, R.; Khachatrian, H.; Karapetyan, H.; Dozier, I.; Rose, G.; et al. Agriculture-Vision: A Large Aerial Image Database for Agricultural Pattern Analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 2828–2838. [Google Scholar] [CrossRef]
Lu, Y.; Young, S. A survey of public datasets for computer vision tasks in precision agriculture. Comput. Electron. Agric. 2020, 178, 105760. [Google Scholar] [CrossRef]
Liu, J.; Abbas, I.; Noor, R.S. Development of Deep Learning-Based Variable Rate Agrochemical Spraying System for Targeted Weeds Control in Strawberry Crop. Agronomy 2021, 11, 1480. [Google Scholar] [CrossRef]
Zhang, X.; Li, H.; Sun, S.; Zhang, W.; Shi, F.; Zhang, R.; Liu, Q. Classification and identification of apple leaf diseases and insect pests based on improved ResNet-50 model. Horticulturae 2023, 9, 1046. [Google Scholar] [CrossRef]
Al Sahili, Z.; Awad, M. The Power of Transfer Learning in Agricultural Applications: AgriNet. Front. Plant Sci. 2022, 13, 992700. [Google Scholar] [CrossRef]
Garcin, C.; Joly, A.; Bonnet, P.; Lombardo, J.C.; Affouard, A.; Chouet, M.; Servajean, M.; Lorieul, T.; Salmon, J. Pl@ntNet-300K: A plant image dataset with high label ambiguity and a long-tailed distribution. In Proceedings of the NeurIPS Datasets and Benchmarks 2021, Online, 6–14 December 2021. [Google Scholar]
Hughes, D.P.; Salathé, M. An open access repository of images on plant health to enable the development of mobile disease diagnostics. arXiv 2015, arXiv:1511.08060. [Google Scholar]
Rahnemoonfar, M.; Sheppard, C. Deep count: Fruit counting based on deep simulated learning. Sensors 2017, 17, 905. [Google Scholar] [CrossRef]
Lu, D.; Wang, Y. MAR-YOLOv9: A Multi-Dataset Object Detection Method for Agricultural Fields Based on YOLOv9. PLoS ONE 2024, 19, e0307643. [Google Scholar] [CrossRef]
Noyan, M.A. Uncovering Bias in the PlantVillage Dataset. arXiv 2022, arXiv:2206.04374. [Google Scholar] [CrossRef]
Li, T.; Feng, Q.; Qiu, Q.; Xie, F.; Zhao, C. Occluded Apple Fruit Detection and Localization with a Frustum-Based Point-Cloud-Processing Approach for Robotic Harvesting. Remote Sens. 2022, 14, 482. [Google Scholar] [CrossRef]
Cravero, A.; Pardo, S.; Sepúlveda, S.; Muñoz, L. Challenges to Use Machine Learning in Agricultural Big Data: A Systematic Literature Review. Agronomy 2022, 12, 748. [Google Scholar] [CrossRef]
Rufin, P.; Wang, S.; Lisboa, S.N.; Hemmerling, J.; Tulbure, M.G.; Meyfroidt, P. Taking it further: Leveraging pseudo-labels for field delineation across label-scarce smallholder regions. Int. J. Appl. Earth Obs. Geoinf. 2024, 134, 104149. [Google Scholar] [CrossRef]
Li, L.; Xie, S.; Ning, J.; Chen, Q.; Zhang, Z. Evaluating green tea quality based on multisensor data fusion combining hyperspectral imaging and olfactory visualization systems. J. Sci. Food Agric. 2019, 99, 1787–1794. [Google Scholar] [CrossRef] [PubMed]
Schuhmann, C.; Beaumont, R.; Vencu, R.; Gordon, C.; Wightman, R.; Cherti, M.; Coombes, T.; Katta, A.; Mullis, C.; Wortsman, M.; et al. LAION-5B: An open large-scale dataset for training next generation image-text models. Adv. Neural Inf. Process. Syst. 2022, 35, 25278–25294. [Google Scholar]
Li, A.; Wang, C.; Ji, T.; Wang, Q.; Zhang, T. D3-YOLOv10: Improved YOLOv10-Based Lightweight Tomato Detection Algorithm Under Facility Scenario. Agriculture 2024, 14, 2268. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Tao, T.; Wei, X. STBNA-YOLOv5: An Improved YOLOv5 Network for Weed Detection in Rapeseed Field. Agriculture 2025, 15, 22. [Google Scholar] [CrossRef]
Liu, S.; Li, Z.; Sun, J. Self-EMD: Self-Supervised Object Detection without ImageNet. arXiv 2020, arXiv:2011.13677. [Google Scholar]
Gunay, M.; Koseoglu, M. Detection of circuit components on hand-drawn circuit images by using faster R-CNN method. Int. J. Adv. Comput. Sci. Appl. 2021, 12, 1–7. [Google Scholar] [CrossRef]
Štancel, M.; Hulič, M. An Introduction to Image Classification and Object Detection Using YOLO Detector. In Proceedings of the CEUR Workshop Proceedings, Castiglione della Pescaia, Italy, 16–19 June 2019; Volume 2403, pp. 1–8. [Google Scholar]
Roboflow. YOLOv5 Is Here: State-of-the-Art Object Detection at 140 FPS. 2020. Available online: https://blog.roboflow.com/yolov5-is-here/ (accessed on 19 May 2025).
Xu, B.; Cui, X.; Ji, W.; Yuan, H.; Wang, J. Apple Grading Method Design and Implementation for Automatic Grader Based on Improved YOLOv5. Agriculture 2023, 13, 124. [Google Scholar] [CrossRef]
Zuo, Z.; Gao, S.; Peng, H.; Xue, Y.; Han, L.; Ma, G.; Mao, H. Lightweight Detection of Broccoli Heads in Complex Field Environments Based on LBDC-YOLO. Agronomy 2024, 14, 2359. [Google Scholar] [CrossRef]
Kulhandjian, H.; Yang, Y.; Amely, N. Design and Implementation of a Smart Agricultural Robot bullDOG (SARDOG). In Proceedings of the 2024 International Conference on Computing, Networking and Communications (ICNC), Hawaii, HL, USA, 19–22 February 2024; pp. 767–771. [Google Scholar]
Fu, C.Y.; Liu, W.; Ranga, A.; Tyagi, A.; Berg, A.C. DSSD: Deconvolutional Single Shot Detector. arXiv 2017, arXiv:1701.06659. [Google Scholar]
Zhang, H.; Hong, X.; Zhu, L. Detecting Small Objects in Thermal Images Using Single-Shot Detector. arXiv 2021, arXiv:2108.11101. [Google Scholar] [CrossRef]
Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv 2019, arXiv:1905.11946. [Google Scholar]
Ultralytics. EfficientDet vs. RTDETRv2: A Technical Comparison for Object Detection. 2023. Available online: https://docs.ultralytics.com/zh/compare/rtdetr-vs-efficientdet/ (accessed on 10 May 2025).
Wang, Y.; Qin, Y.; Cui, J. Occlusion Robust Wheat Ear Counting Algorithm Based on Deep Learning. Front. Plant Sci. 2021, 12, 645899. [Google Scholar] [CrossRef]
Guo, J.; Zhang, K.; Adade, S.Y.S.S.; Lin, J.; Lin, H.; Chen, Q. Tea grading, blending, and matching based on computer vision and deep learning. J. Sci. Food Agric. 2025, 105, 3239–3251. [Google Scholar] [CrossRef]
Rehman, M.M.U.; Liu, J.; Nijabat, A.; Faheem, M.; Wang, W.; Zhao, S. Leveraging Convolutional Neural Networks for Disease Detection in Vegetables: A Comprehensive Review. Agronomy 2024, 14, 2231. [Google Scholar] [CrossRef]
Wang, A.; Peng, T.; Cao, H.; Xu, Y.; Wei, X.; Cui, B. TIA-YOLOv5: An improved YOLOv5 network for real-time detection of crop and weed in the field. Front. Plant Sci. 2022, 13, 1091655. [Google Scholar] [CrossRef]
Li, E.; Zhou, Z.; Chen, X. Edge intelligence: On-demand deep learning model co-inference with device-edge synergy. In Proceedings of the 2018 Workshop on Mobile Edge Communications, Budapest Hungary, 20 August 2018; pp. 31–36. [Google Scholar] [CrossRef]
Grigorescu, S.; Trasnea, B.; Cocias, T.; Macesanu, G. A survey of deep learning techniques for autonomous driving. J. Field Robot. 2020, 37, 362–386. [Google Scholar] [CrossRef]
Cheng, H.; Zhang, M.; Shi, J.Q. A Survey on Deep Neural Network Pruning: Taxonomy, Comparison, Analysis, and Recommendations. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10558–10578. [Google Scholar] [CrossRef] [PubMed]
Zhu, M.; Gupta, S. To Prune, or Not to Prune: Exploring the Efficacy of Pruning for Model Compression. arXiv 2017, arXiv:1710.01878. [Google Scholar] [CrossRef]
Wang, R.; Liu, L.; Xie, C.; Yang, P.; Li, R.; Zhou, M. AgriPest: A Large-Scale Domain-Specific Benchmark Dataset for Practical Agricultural Pest Detection in the Wild. Sensors 2021, 21, 1601. [Google Scholar] [CrossRef]
Badgujar, C.M.; Poulose, A.; Gan, H. Agricultural Object Detection with You Look Only Once (YOLO) Algorithm: A Bibliometric and Systematic Literature Review. arXiv 2024, arXiv:2401.10379. [Google Scholar] [CrossRef]
Shi, Y.; Han, L.; Zhang, X.; Sobeih, T.; Gaiser, T.; Thuy, N.H.; Behrend, D.; Srivastava, A.K.; Halder, K.; Ewert, F. Deep Learning Meets Process-Based Models: A Hybrid Approach to Agricultural Challenges. arXiv 2025, arXiv:2504.16141. [Google Scholar]
Dong, M.; Yu, H.; Sun, Z.; Zhang, L.; Sui, Y.; Zhao, R. Research on Agricultural Environmental Monitoring Internet of Things Based on Edge Computing and Deep Learning. J. Intell. Syst. 2024, 33, 20230114. [Google Scholar] [CrossRef]
Memon, M.S.; Chen, S.; Shen, B.; Liang, R.; Tang, Z.; Wang, S.; Memon, N. Automatic visual recognition, detection and classification of weeds in cotton fields based on machine vision. Crop Prot. 2025, 187, 106966. [Google Scholar] [CrossRef]
Bhattacharya, A. Applied Machine Learning Explainability Techniques: Make ML Models Explainable and Trustworthy for Practical Applications Using LIME, SHAP, and More; Packt Publishing Ltd.: Birmingham, UK, 2022. [Google Scholar]
Wang, T.s.; Kim, G.T.; Shin, J.; Jang, S.W. Hierarchical Image Quality Improvement Based on Illumination, Resolution, and Noise Factors for Improving Object Detection. Electronics 2024, 13, 4438. [Google Scholar] [CrossRef]
Li, Z.; Xiang, J.; Duan, J. A low illumination target detection method based on a dynamic gradient gain allocation strategy. Sci. Rep. 2024, 14, 29058. [Google Scholar] [CrossRef]
Beldek, C.; Cunningham, J.; Aydin, M.; Sariyildiz, E.; Phung, S.L.; Alici, G. Sensing-based Robustness Challenges in Agricultural Robotic Harvesting. arXiv 2025, arXiv:2502.12403. [Google Scholar]
Lyu, Z.; Jin, H.; Zhen, T.; Sun, F.; Xu, H. Small object recognition algorithm of grain pests based on SSD feature fusion. IEEE Access 2021, 9, 43202–43213. [Google Scholar] [CrossRef]
Silwal, A.; Parhar, T.; Yandun, F.; Kantor, G. A Robust Illumination-Invariant Camera System for Agricultural Applications. arXiv 2021, arXiv:2101.02190. [Google Scholar]
Bargoti, S.; Underwood, J. Deep Fruit Detection in Orchards. arXiv 2016, arXiv:1610.03677. [Google Scholar]
Liu, S.; Peng, D.; Zhang, B.; Chen, Z.; Yu, L.; Chen, J.; Yang, S. The Accuracy of Winter Wheat Identification at Different Growth Stages Using Remote Sensing. Remote Sens. 2022, 14, 893. [Google Scholar] [CrossRef]
Kamath, U.; Liu, J.; Whitaker, J. Transfer Learning: Domain Adaptation. In Deep Learning for NLP and Speech Recognition; Springer: Berlin/Heidelberg, Germany, 2019; pp. 495–535. [Google Scholar] [CrossRef]
Kamilaris, A.; Prenafeta-Boldú, F.X. Deep learning in agriculture: A survey. Comput. Electron. Agric. 2018, 147, 70–90. [Google Scholar] [CrossRef]
Migneco, P. Traffic Sign Recognition Algorithm: A Deep Comparison Between YOLOv5 and SSD Mobilenet. Doctoral Dissertation, Politecnico di Torino, Torino, Italy, 2024. [Google Scholar]
Albahar, M. A survey on deep learning and its impact on agriculture: Challenges and opportunities. Agriculture 2023, 13, 540. [Google Scholar] [CrossRef]
Vincent, D.R.; Deepa, N.; Elavarasan, D.; Srinivasan, K.; Chauhdary, S.H.; Iwendi, C. Sensors driven AI-based agriculture recommendation model for assessing land suitability. Sensors 2019, 19, 3667. [Google Scholar] [CrossRef]
Yang, J.; Guo, X.; Li, Y.; Marinello, F.; Ercisli, S.; Zhang, Z. A survey of few-shot learning in smart agriculture: Developments, applications, and challenges. Plant Methods 2022, 18, 1–15. [Google Scholar] [CrossRef]
Dhanya, V.; Subeesh, A.; Kushwaha, N.; Vishwakarma, D.; Kumar, T.; Ritika, G.; Singh, A. Deep learning based computer vision approaches for smart agricultural applications. Artif. Intell. Agric. 2022, 6, 211–229. [Google Scholar] [CrossRef]
Hrast Essenfelder, A.; Toreti, A.; Seguini, L. Expert-driven explainable artificial intelligence models can detect multiple climate hazards relevant for agriculture. Commun. Earth Environ. 2025, 6, 207. [Google Scholar] [CrossRef]
Guo, Y.; Gao, J.; Tunio, M.H.; Wang, L. Study on the Identification of Mildew Disease of Cuttings at the Base of Mulberry Cuttings by Aeroponics Rapid Propagation Based on a BP Neural Network. Agronomy 2022, 13, 106. [Google Scholar] [CrossRef]
Kawakura, S.; Hirafuji, M.; Ninomiya, S.; Shibasaki, R. Adaptations of Explainable Artificial Intelligence (XAI) to Agricultural Data Models with ELI5, PDPbox, and Skater using Diverse Agricultural Worker Data. Eur. J. Artif. Intell. 2022, 3, 14. [Google Scholar] [CrossRef]
Dara, R.; Hazrati Fard, S.M.; Kaur, J. Recommendations for ethical and responsible use of artificial intelligence in digital agriculture. Front. Artif. Intell. 2022, 5, 884192. [Google Scholar] [CrossRef]
Shen, Y.; Khan, Z.; Liu, H.; Yang, Z.; Hussain, I. YOLO Optimization for Small Object Detection: DyFAM, EFRAdaptiveBlock, and Bayesian Tuning in Precision Agriculture. SSRN Electron. J. 2025, early stage. [Google Scholar] [CrossRef]
Liu, H.; Zeng, X.; Shen, Y.; Xu, J.; Khan, Z. A Single-Stage Navigation Path Extraction Network for agricultural robots in orchards. Comput. Electron. Agric. 2025, 229, 109687. [Google Scholar] [CrossRef]
Xiang, W.; Wu, D.; Wang, J. Enhancing stem localization in precision agriculture: A Two-Stage approach combining YOLOv5 with EffiStemNet. Comput. Electron. Agric. 2025, 231, 109914. [Google Scholar] [CrossRef]
Coulibaly, S.; Kamsu-Foguem, B.; Kamissoko, D.; Traore, D. Deep learning for precision agriculture: A bibliometric analysis. Intell. Syst. Appl. 2022, 16, 200102. [Google Scholar] [CrossRef]
Sun, X.; Wang, B.; Wang, Z.; Fu, K. Research Progress on Few-Shot Learning for Remote Sensing Image Interpretation. Remote Sens. 2021, 13, 678. [Google Scholar] [CrossRef]
Chen, Z.; Feng, J.; Yang, Z.; Wang, Y.; Ren, M. YOLOv8-ACCW: Lightweight grape leaf disease detection method based on improved YOLOv8. IEEE Access 2024, 12, 123595–123608. [Google Scholar] [CrossRef]
Chen, J.W.; Lin, W.J.; Cheng, H.J.; Hung, C.L.; Lin, C.Y.; Chen, S.P. A smartphone-based application for scale pest detection using multiple-object detection methods. Electronics 2021, 10, 372. [Google Scholar] [CrossRef]
Almasri, F.; Debeir, O. Multimodal Sensor Fusion in Single Thermal Image Super-Resolution. In Proceedings of the Computer Vision—ACCV 2018 Workshops, Perth, Australia, 2–6 December 2018; Springer: Berlin/Heidelberg, Germany, 2019; pp. 418–433. [Google Scholar] [CrossRef]
Padhiary, M.; Hoque, A.; Prasad, G.; Kumar, K.; Sahu, B. Precision Agriculture and AI-Driven Resource Optimization for Sustainable Land and Resource Management. In Smart Water Technology for Sustainable Management in Modern Cities; IGI Global: Hershey, PA, USA, 2025; pp. 197–232. [Google Scholar]
Zhao, S.; Peng, Y.; Liu, J.; Wu, S. Tomato Leaf Disease Diagnosis Based on Improved Convolution Neural Network by Attention Module. Agriculture 2021, 11, 651. [Google Scholar] [CrossRef]
Shen, Y.; Yang, Z.; Khan, Z.; Liu, H.; Chen, W.; Duan, S. Optimization of Improved YOLOv8 for Precision Tomato Leaf Disease Detection in Sustainable Agriculture. Sensors 2025, 25, 1398. [Google Scholar] [CrossRef] [PubMed]
Dhanya, K.; Gopal, P.; Srinivasan, V. Deep learning in agriculture: Challenges and future directions. Artif. Intell. Agric. 2022, 6, 1–11. [Google Scholar]
Padhiary, M.; Hoque, A.; Prasad, G.; Kumar, K.; Sahu, B. The Convergence of Deep Learning, IoT, Sensors, and Farm Machinery in Agriculture. In Designing Sustainable Internet of Things Solutions for Smart Industries; IGI Global: Hershey, PA, USA, 2025; pp. 109–142. [Google Scholar]
Zheng, W.; Cao, Y.; Tan, H. Secure sharing of industrial IoT data based on distributed trust management and trusted execution environments: A federated learning approach. Neural Comput. Appl. 2023, 35, 21499–21509. [Google Scholar] [CrossRef]
Kumar, Y.; Kumar, P. Comparative study of YOLOv8 and YOLO-NAS for agriculture application. In Proceedings of the 2024 11th International Conference on Signal Processing and Integrated Networks (SPIN), Noida, India, 21–22 March 2024; pp. 72–77. [Google Scholar]
Padhiary, M.; Kumar, R. Enhancing Agriculture Through AI Vision and Machine Learning: The Evolution of Smart Farming. In Advancements in Intelligent Process Automation; IGI Global: Hershey, PA, USA, 2025; pp. 295–324. [Google Scholar]

Figure 1. Timeline of Advances in Object Detection Algorithms (1999–2025).

Figure 2. Agricultural robots integrated with detection systems, perception modules, and actuation units. (A) Forest mapping robot [29]; (B) Strawberry harvesting robot with multiple sensors [30]; (C) Autonomous orchard spraying robot with flexible mechanism [31]; (D) Variable spray robot for precision agriculture [32].

Figure 3. Object Detection Workflow Using SIFT/HOG Features in Agricultural Applications.

Figure 4. Evolution of object detection architectures from early region-based methods (e.g., R-CNN [41]) to modern transformer-integrated and prompt-enhanced frameworks (e.g., YOLOE [66]), highlighting key milestones in speed, accuracy, and adaptability. Each component represents a distinct architectural paradigm: R-CNN emphasize detection precision but suffer from high latency; real-time convolutional models (YOLOv3 [67], YOLOv7 [68]) introduce dense prediction heads for faster inference and are widely applied in agricultural monitoring tasks such as weed, fruit, and disease detection. The 2016 CNN-based segmentation model [69] reflects early efforts in pixel-wise classification of crop and weed patterns. The most recent generation, YOLOE [66], incorporates prompt-based modules and transformers to improve detection robustness under field-specific challenges such as occlusion, illumination variability, and visual clutter. Visual components are adapted from their original sources.

Figure 5. Comparison of representative object detection architectures and their functional components, with relevance to agricultural vision tasks. From left to right: R-CNN employs region proposals and separate stages for feature extraction and classification, offering high accuracy but limited speed—applicable to precise tasks such as fruit counting. Fast R-CNN and Faster R-CNN integrate shared convolutional features and region proposal networks (RPN) to improve efficiency while maintaining detection quality. YOLO represents a single-stage regression-based model that directly predicts bounding boxes and class probabilities from grid-based outputs, enabling real-time weed or disease detection. SSD also follows a single-shot design but incorporates multi-scale feature maps to enhance detection of objects at varying sizes. All components shown illustrate distinct architectural paradigms and their trade-offs between accuracy and inference speed. This figure is original and created by the authors.

Figure 6. Performance trend of object detection models on the COCO dataset (mAP@0.5). The plot shows how various models, including R-CNN, Fast R-CNN, Faster R-CNN, and the YOLO series, have evolved in terms of accuracy over the years. YOLOv8 achieves the highest performance in 2022.

Figure 7. Representative samples from the Weed Species Dataset [67]. Each row (a–i) corresponds to a distinct weed species found in the dataset: (a) Chinee apple (Ziziphus mauritiana), (b) Lantana (Lantana camara), (c) Parkinsonia (Parkinsonia aculeata), (d) Parthenium (Parthenium hysterophorus), (e) Prickly acacia (Vachellia nilotica), (f) Rubber vine (Cryptostegia grandiflora), (g) Siam weed (Chromolaena odorata), (h) Snake weed (Stachytarpheta spp.), and (i) Negative (non-target/background vegetation). The images illustrate natural variations in lighting, occlusion, background complexity, and weed morphology, highlighting the dataset’s challenge for visual recognition tasks.

Figure 8. Detection Results of Grape Clusters Across Different Varieties [85].

Figure 9. Setup of Grape Leaves with Markers for Experimental Validation of Detection Accuracy and Spray Coverage.

Figure 10. Agricultural robot path-tracking system [85].

Figure 11. Comparison of Agricultural Datasets by Image Count and Annotation Type (e.g., bounding boxes, class labels), highlighting diversity in tasks like disease detection and weed identification.

Figure 12. Performance comparison of various YOLO models and other object detection models (YOLOv6, YOLOv7, YOLOv8, YOLOv9, YOLOv10) on the COCO dataset. The plot shows COCO AP (%) versus the number of parameters (M), where YOLOv10 performs well with fewer parameters. (Source: [114]).

Figure 13. Structured overview of key challenges, current solutions, and research gaps in agricultural object detection. The top layer identifies two persistent issues: data scarcity, which limits supervised model training due to lack of labeled agricultural datasets, and small object detection, which affects accurate localization of fine-scale targets such as pests or early-stage leaf lesions. These challenges are commonly addressed through data augmentation techniques and multi-scale feature learning, respectively. However, the bottom layer highlights ongoing research gaps, including the need for few-shot learning to reduce dependence on large datasets, and real-time multi-modal fusion to improve detection robustness in complex farm environments. This figure is original and created by the authors.

Figure 16. Example of a multimodal fusion network using RGB and thermal imagery. The generator fuses both modalities to reconstruct high-resolution thermal outputs, while the discriminator evaluates output quality. Such architectures improve the fidelity and robustness of downstream tasks like detection or mapping [166].

Table 1. Overview of Major Object Detection Frameworks in Agricultural Applications.

Model	Year	Key Features
R-CNN	2014	Region proposals + CNN classification [41]
Fast R-CNN	2015	ROI pooling, faster training [42]
Faster R-CNN	2015	Integrated RPN for proposal generation [43]
SSD	2016	Multi-box detection with multiple feature maps [44]
YOLOv1	2016	Unified detection and classification [45]
YOLOv3	2018	Multi-scale prediction, Darknet-53 [46]
YOLOv7	2022	E-ELAN optimization, fast and accurate [47]

Note: R-CNN (Region-based Convolutional Neural Network); SSD (Single Shot MultiBox Detector); YOLO (You Only Look Once).

Table 2. Comparative Performance of Representative Object Detection Models in Agricultural Applications.

Model	Year	Task	mAP (%)	FPS	GFLOPs	Refs.
Faster R-CNN	2015	Fruit Counting	92.0	5–10	120	[43,71]
YOLOv3	2018	Weed Detection	88.7	25	65	[67,72]
YOLOv7	2022	Pest Detection	94.1	40	35	[32,68]
SSD	2016	Crop Row Detection	85.4	30–40	50	[44,73]

Table 3. Applications of Object Detection in Agriculture with Algorithmic Contributions.

Task	Application Example	Reference	Algorithmic Contribution
Disease Detection	YOLOv7 for grapevine powdery mildew detection	[68]	Improved YOLOv7 with backbone pruning and feature enhancement for orchard environments
Disease Detection	RetinaNet for multi-crop disease classification	[77]	YOLOv8-GDCI with global detail-context interaction for detecting small objects in plant parts
Fruit Counting	YOLOv5 applied to apple counting	[78]	Reviewed deep learning maturity detection techniques including object-level fruit analysis
Fruit Counting	SSD for citrus fruit detection in orchards	[79]	Developed SSD-based detection with real-time capability using multispectral image fusion
Weed Detection	DeepWeeds dataset classification using YOLOv3	[67]	Introduced multiclass weed dataset; evaluated YOLOv3 under real-world conditions
Weed Detection	Improved YOLOv8 for weed detection in crop field	[80]	Enhanced YOLOv8 with attention-guided dual-layer feature fusion for dense weed clusters
Spraying Robotics	Precision pesticide application in vineyards	[32]	YOLOv7 improved with custom feature extractors targeting grape leaf health conditions
Spraying Robotics	Precision pesticide application in orchards	[31]	Real-time instance segmentation of canopies using refined YOLOv8 architecture

Note: YOLO (You Only Look Once), SSD (Single Shot MultiBox Detector), RetinaNet (Retina Network), GDCI (Global Detail-Context Interaction).

Table 4. Overview of Major Agricultural Datasets for Object Detection.

Dataset	Images	Crop/Weed Types	Notes
PlantVillage	50,000+	38 crop-disease pairs	Controlled lab images [104]
DeepWeeds	17,509	9 weed species	Field conditions, weeds in Australia [67]
GrapeLeaf Dataset	5000+	Grapevine diseases	Grape disease segmentation [68]
DeepFruit	35,000+	Apple, mango, citrus	Fruit detection for yield estimation [105]

Table 5. Comparison of Object Detection Models for Agricultural Tasks.

Model	Architecture	FPS	mAP(%)	GFLOPs	Agricultural Relevance
Faster R-CNN	Two-stage	5–10	0.90–0.92	120	High-precision disease detection [43]
YOLOv5	One-stage	50+	0.87–0.89	25	Real-time weed detection [68]
SSD	One-stage	30–40	0.78–0.82	50	Lightweight fruit detection [44]
EfficientNet	One-stage	30–50	0.90–0.91	40	Versatile crop row mapping [87]
RetinaNet	One-stage	20–30	0.85–0.88	60	Rare disease detection [70]
YOLOv8	One-stage	25–30	0.93–0.94	35	Real-time orchard spray [31]

Note: Faster R-CNN (Region-based Convolutional Neural Network), YOLO (You Only Look Once), SSD (Single Shot MultiBox Detector), EfficientNet (Efficient Network), RetinaNet (Retina Network).

Table 6. Challenges, Solutions, and Gaps in Agricultural Object Detection.

Challenge	Proposed Solution	Research Gap
Tiny Object Detection	Focal Loss [70]	Limited multi-scale feature fusion
Domain Shift	Domain adaptation [92]	Cross-regional dataset biases
Limited Labeled Data	Synthetic data generation [93]	Quality of synthetic annotations
Explainability	Grad-CAM, SHAP [140]	Real-time explanation tools
Lighting Variations	Multi-modal sensing [90]	Real-time fusion overhead
Real-Time Deployment	Model pruning [91]	Accuracy-efficiency trade-offs

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Khan, Z.; Shen, Y.; Liu, H. ObjectDetection in Agriculture: A Comprehensive Review of Methods, Applications, Challenges, and Future Directions. Agriculture 2025, 15, 1351. https://doi.org/10.3390/agriculture15131351

AMA Style

Khan Z, Shen Y, Liu H. ObjectDetection in Agriculture: A Comprehensive Review of Methods, Applications, Challenges, and Future Directions. Agriculture. 2025; 15(13):1351. https://doi.org/10.3390/agriculture15131351

Chicago/Turabian Style

Khan, Zohaib, Yue Shen, and Hui Liu. 2025. "ObjectDetection in Agriculture: A Comprehensive Review of Methods, Applications, Challenges, and Future Directions" Agriculture 15, no. 13: 1351. https://doi.org/10.3390/agriculture15131351

APA Style

Khan, Z., Shen, Y., & Liu, H. (2025). ObjectDetection in Agriculture: A Comprehensive Review of Methods, Applications, Challenges, and Future Directions. Agriculture, 15(13), 1351. https://doi.org/10.3390/agriculture15131351

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ObjectDetection in Agriculture: A Comprehensive Review of Methods, Applications, Challenges, and Future Directions

Abstract

1. Introduction

Scope and Contributions of This Review

2. Object Detection Fundamentals

2.1. Traditional Approaches in Agriculture

2.2. Deep Learning-Based Methods in Agriculture

2.2.1. R-CNN and Fast R-CNN

2.2.2. Faster R-CNN

2.2.3. YOLO (You Only Look Once)

2.2.4. SSD (Single Shot MultiBox Detector)

2.3. Agricultural Adaptations

3. Applications in Agriculture

3.1. Weed Detection

3.2. Fruit Counting and Ripeness Detection

3.3. Disease and Pest Detection

3.4. Crop Row and Canopy Detection

4. Dataset Overview

4.1. Key Public Datasets

4.2. Dataset Characteristics and Contributions

4.3. Challenges in Dataset Diversity and Quality

4.4. Implications for Object Detection Research

5. Comparison of Algorithms

5.1. Evaluation Metrics

5.2. Algorithmic Foundations and Performance

5.2.1. Technical Evaluation and Trade-Offs

5.2.2. Faster R-CNN

5.2.3. YOLO

5.2.4. SSD

5.2.5. EfficientNet

5.3. Comparative Analysis in Agricultural Contexts

5.4. Broader Implications and Trends

6. Challenges and Open Problems

6.1. Environmental Variability

6.2. Model Generalization

6.3. Real-Time Constraints

7. Future Directions

7.1. Explainable AI (XAI)

7.2. Few-Shot and Self-Supervised Learning

7.3. Multimodal Approaches

7.4. Federated Learning

7.5. Edge AI Optimization

7.6. Strategic Recommendations for the Future

7.6.1. Short-Term Recommendations (1–3 Years)

7.6.2. Long-Term Recommendations (5+ Years)

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI