Image and Point Cloud-Based Neural Network Models and Applications in Agricultural Nursery Plant Protection Tasks

Xu, Jie; Liu, Hui; Shen, Yue

doi:10.3390/agronomy15092147

Open AccessReview

Image and Point Cloud-Based Neural Network Models and Applications in Agricultural Nursery Plant Protection Tasks

by

Jie Xu

,

Hui Liu

^*

and

Yue Shen

School of Electrical and Information Engineering, Jiangsu University, Zhenjiang 212000, China

^*

Author to whom correspondence should be addressed.

Agronomy 2025, 15(9), 2147; https://doi.org/10.3390/agronomy15092147

Submission received: 25 July 2025 / Revised: 27 August 2025 / Accepted: 27 August 2025 / Published: 8 September 2025

(This article belongs to the Special Issue Next-Generation Pesticide Application Technologies: Precision, Safety and Environmental Sustainability)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Nurseries represent a fundamental component of modern agricultural systems, specializing in the cultivation and management of diverse seedlings. Scientific cultivation methods significantly enhance seedling survival rates, while intelligent agricultural robots improve operational efficiency through autonomous plant protection. Central to these robotic systems, the perception system utilizes advanced neural networks to process environmental data from both images and point clouds, enabling precise feature extraction. This review systematically explores prevalent image-based models for classification, segmentation, and object detection tasks, alongside point cloud processing techniques employing multi-view, voxel-based, and original data approaches. The discussion extends to practical applications across six critical plant protection areas. Image-based neural network models can fully utilize the color information of objects, making them more suitable for tasks such as leaf disease detection and pest detection. In contrast, point cloud-based neural network models can take full advantage of the spatial information of objects, thus being more applicable to tasks like target information detection. By identifying current challenges and future research priorities, the analysis provides valuable insights for advancing agricultural robotics and precision plant protection technologies.

Keywords:

nursery management; perception systems; neural network models

1. Introduction

Ornamental nurseries hold a significant position in the agricultural system. They are places specifically dedicated to the cultivation, plant protection, and management tasks of various seedlings. These nurseries are mainly used for cultivating seedlings with ornamental value to beautify the urban environment. By means of scientific planting and management measures, the survival rate of seedlings in nurseries can be greatly improved, thereby playing an important role in multiple fields such as economic development and ecological protection [1,2]. In recent years, there has been a growing focus on the efficient management of agricultural nurseries [3]. To quantify the changing trends in research popularity in this field, literature surveys in the Elsevier, the Institute of Electrical and Electronics Engineers (IEEE), and the Multidisciplinary Digital Publishing Institute (MDPI) databases were conducted. Elsevier is characterized by a wide disciplinary coverage and high authority; IEEE is highly reputed in technical fields such as electrical and electronics. MDPI is known for its open-access model and timely updates. Comprehensive and cutting-edge data can be provided by the combination of these three databases. In the Elsevier database, the search term “agriculture nursery management seedling tree” was used, while in the IEEE and MDPI databases, “agriculture nursery” was adopted as the search term. By setting specific search criteria, relevant literature published from 2015 to August 2025 was precisely screened and statistically analyzed. The statistical results are shown in Figure 1a–c. According to Figure 1, a significant upward trend in the field of nursery management can be observed. By 2024, the number of relevant papers in Elsevier had exceeded 350 and continued to grow steadily, which fully reflects the increasing academic attention on agricultural nursery management research.

Seedlings are the core component of the nursery environments. During the full-life-cycle management of seedlings, it is necessary to dynamically and precisely carry out operations such as plant protection tasks according to their different growth stages to address the threat of pests and diseases and meet the regulation needs of the seedlings. Specifically, various spraying treatments are commonly used. Chemical fungicides (Mancozeb, ADAMA, Israel) are applied to prevent and treat fungal diseases, and chemical herbicides (Atrazine, Syngenta, Switzerland) are used to control weeds. Bio-stimulants (Kelp-based biostimulant, Acadian, Canada) enhance plant growth and stress resistance, while leaf fertilization products (Plantacote foliar fertilizer, COMPO, Germany) directly supply essential nutrients to foliage, facilitating the healthy development of seedlings. Based on the existing research papers retrieved from the Elsevier platform, the keyword “seedling protection” was further added. The proportion of article categories and number of relevant papers in recent years is presented in Figure 2a,b. It is evident that this area of plant protection tasks has become a hot topic in recent years. Currently, plant protection tasks in nurseries are still mainly carried out manually. However, as the nursery industry develops towards large-scale production, this traditional task mode has exposed many problems. On the one hand, with the continuous expansion of the planting area, the labor cost shows a continuous upward trend. On the other hand, the quality of manual operations is easily affected by the individual experience differences of operators, which may lead to problems such as poor uniformity of pesticide application, missed spraying, and repeated spraying, thus having a negative impact on the growth quality of seedlings.

To address the issues existing in the traditional plant protection tasks, agricultural plant protection spraying robots have emerged. This type of robot integrates modules such as environmental perception, autonomous navigation, and precise operation, effectively promoting the transformation of nursery management towards an intelligent direction [4,5,6,7,8,9,10,11]. During the spraying process, pesticides and fertilizers can easily pose hazards to the human body. For instance, skin contact may lead to allergies, and inhalation through the respiratory tract can damage the respiratory system. Employing robots for operations can prevent personnel from being directly exposed to the hazardous environment, significantly reducing safety risks. Moreover, the application of intelligent robots can improve the efficiency of plant protection tasks. While saving labor costs, it can also reduce the amount of pesticide used. This not only alleviates the pressure of labor shortage to a certain extent and reduces production and operation costs, but also enhances the control ability of seedling quality through standardized operation processes. Currently, many high-performance agricultural robots have been applied in the plant protection field. For example, the FarmWise Titan weeding robot, introduced by FarmWise (San Francisco, CA, USA), is equipped with 12 high-resolution cameras in its perception system, which can obtain environmental information from multiple angles. This robot can accurately identify more than 30 common vegetable crops and over 100 types of weeds. In addition, multiple agricultural spraying robots from DJI Agras (Shenzhen, China), such as AGRAS T20, MG-1P, and AGRAS T16, use RTK or GNSS for positioning. They are equipped with technologies like automatic obstacle avoidance and dynamic liquid medicine control technologies, enabling effective plant protection operations.

Agricultural plant protection spraying robots with autonomous operation capabilities generally rely on perception systems to collect environmental information in the nursery [12]. Only after accurately acquiring and effectively processing this environmental information can the agricultural robot generate reliable operation instructions through the control system, and then drive hardware devices such as the mobile chassis and spray system to carry out precise operations [13]. Therefore, designing an efficient perception system is a key research content for promoting the development of agricultural plant protection spray robots.

Traditional approaches to realizing perception tasks are mostly based on machine learning. For example, Ref. [14] proposes the DSWTS algorithm. After converting the input image into a grayscale image, it utilizes a magnitude gradient function and watershed techniques to obtain the edge information of humans, thereby enabling the tractor to recognize humans during the operation. Taking the classification task as an example, such methods typically involve two crucial steps: feature extraction and classifier classification. In classical machine learning methods, artificial features are mainly extracted according to artificially set rules. They have the advantages of fast speed and do not require a large amount of data for model training, making them suitable for classification tasks with small-sample data. Common classifiers include Support Vector Machine (SVM) [15], Decision Tree [16], Random Forest [17], and so on, which have been widely used in agricultural scene operations [18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34]. Traditional methods have also be applied to agricultural nursery management tasks. In [35], Random Forest and SVM were used to analyze the leaf and canopy reflectance spectra of Scots pine seedlings, achieving non-destructive genetic evaluation and improving nursery management efficiency. In [36], methods such as least squares and the k-means++ clustering algorithm were employed to process morphological, spectral, and spatial information for detecting and locating conifer seedlings. In [37], the hyperspectral information of oil palm seedlings was first preprocessed using standard normal variate and first and second Savitzky–Golay derivatives. Then, principal components were extracted by Principal Component Analysis (PCA). Finally, SVM and other machine learning methods were used to determine the water stress level in the leaves of oil palm seedlings. However, these methods have significant limitations. Due to the limited number of manually extracted features, they are only applicable to simple classification tasks with low requirements for computing resources. When dealing with high-dimensional data, their efficiency decreases significantly, and thus their application scenarios are restricted.

In contrast, implementing perception tasks based on neural network models is an end-to-end autonomous learning method with significant advantages. This method enables models to directly extract multi-level features from original data autonomously, thereby endowing models with stronger generalization ability and enabling it to more effectively adapt to the diverse needs of robots. Currently, images and point clouds are the commonly used input data types for neural network models. Neural network models based on images have an earlier start in development and have now approached maturity. In recent years, neural network models based on point clouds have gradually become a popular research direction because they can fully explore and utilize spatial information. Relevant models have been widely applied in various tasks in agricultural scenarios, such as crop disease and pest identification [38,39,40,41,42], agricultural product quality grading [43,44,45,46,47,48], and so on.

This paper will introduce the neural network models based on images and point clouds, as well as their applications in plant protection tasks in agricultural environments. This section systematically reviews image-based neural network models for different tasks, including classification, semantic segmentation, and object detection. The point cloud-based neural network models will be introduced by classifying them into multi view-based methods, voxel and mesh-based methods, and original point cloud-based methods according to the ways of processing point clouds. In addition, this paper will also introduce the specific applications of various models in combination with the common tasks of agricultural robot plant protection tasks, such as pest detection, leaf disease detection, weed detection, and so on. Meanwhile, it will delve into the challenges faced by relevant research on plant protection tasks and the future research directions.

2. Image-Based Neural Network Models

Images contain rich color information, which can provide crucial visual features for precision tasks. Through continuous training on images, relevant models can extract discriminative feature representations from the color information [49]. Meanwhile, color differences can offer intuitive visual cues for the recognition of object boundaries, shapes, and structures, contributing to enhancing the perception ability of objects. In addition, images have obvious advantages in data acquisition and processing. The data collection process can be completed simply by using conventional imaging devices, such as mobile phones and ordinary cameras. These characteristics result in a lower difficulty in training models based on images and a relatively lower demand for computing resources. Therefore, neural network models based on images started to develop earlier and have been applied in the agricultural field [50,51,52].

Neural network models based on images can be divided into three categories according to task requirements: classification models, segmentation models, and object detection models. Among them, classification models are mainly used to determine the overall category of the input image. Segmentation models assign a category label to each pixel in the image, thereby dividing different objects or regions, accurately identifying the contours of each target in the image, and achieving fine-grained segmentation of the targets. Object detection models use two-dimensional bounding boxes to determine the specific positions of targets and obtain their category information. This chapter will introduce some classic classification, segmentation, and object detection models.

2.1. Classification Models

Image classification models typically take a single image as input and output the category information represented by the entire image or the category of the dominant object in the image. Compared with image segmentation and object detection tasks, the requirements for hardware resources for this kind of model are relatively low, enabling them to operate efficiently on resource-constrained devices such as embedded systems and mobile devices [53]. The classification model plays a crucial role in agricultural nursery plant protection tasks. Take the weeding operation as an example. Precise identification of weed categories and corresponding targeted pesticide spraying can significantly improve the efficiency and effectiveness of agricultural plant protection, reduce unnecessary pesticide use, and minimize potential environmental impacts. Moreover, agricultural robots need to perform different operations based on the identified object categories. When the classification model detects the target, the robot will promptly initiate the spraying operation. When a non-target is detected, the robot will execute corresponding alternative actions. For instance, when a pedestrian is detected entering the operation area, the robot must immediately stop the current operation to ensure their safety. Thus, classification models are often used for quick and simple classification tasks like pest and leaf disease detection.

2.1.1. Visual Geometry Group Network (VGG)

VGG [54], proposed in 2014, is an influential Convolutional Neural Network (CNN) architecture that consistently employs 3 × 3 convolution kernels throughout its layers. These small kernels maintain the same receptive field as larger ones while significantly reducing model parameters and complexity. Each convolutional layer is immediately followed by an ReLU activation function, introducing nonlinearity to enhance the network’s ability to learn complex patterns and improve classification performance. Additionally, VGG utilizes 2 × 2 max-pooling layers with a stride of 2 for downsampling feature maps, reducing computational load and improving efficiency.

2.1.2. GoogLeNet

GoogLeNet [55] is a classification CNN proposed in 2014. The core highlight of this model lies in the design and application of the Inception module. In the Inception module, the outputs of each path are concatenated along the channel dimension. A comprehensive and detailed extraction of the feature information can be carried out at different scales, thereby capturing more diverse and rich image features. In addition, GoogLeNet replaces the traditional fully-connected layer with a global average pooling layer. This improvement not only significantly reduces the number of model parameters but also enhances the generalization ability of the model, effectively reducing the risk of overfitting.

Building upon the original Inception module, Inception-v2 and Inception-v3 [56] significantly improve model performance through key innovations: Inception-v2 incorporates Batch Normalization to accelerate network convergence and enhance stability, while Inception-v3 employs convolutional decomposition (e.g., replacing 5 × 5 convolutions with two 3 × 3 convolutions or factorizing n × n into n × 1 and 1 × n) to boost computational efficiency and nonlinear representation. Inception-v4 [57] further introduces residual connections to mitigate gradient vanishing and improve generalization. Xception [58] revolutionizes the approach by separating convolution into depthwise and pointwise operations, dramatically reducing parameters and computational costs while maintaining feature extraction capability and accelerating training.

2.1.3. ResNet

ResNet [59], a groundbreaking deep learning architecture, fundamentally addresses the degradation problem in deep CNNs through its innovative residual learning framework. By introducing residual blocks with shortcut connections that directly add input to output, ResNet enables efficient gradient flow during backpropagation, effectively solving vanishing gradient issues. This design not only allows training of substantially deeper networks but also facilitates faster convergence with larger learning rates, significantly reducing training time and resource requirements. The residual block’s versatility has made it a foundational component in numerous subsequent models, profoundly influencing the development of deep learning across computer vision and beyond.

2.1.4. MobileNet Series

The MobileNet model has emerged as a pivotal family of lightweight computer vision models, revolutionizing efficient deep learning through progressive architectural innovations. MobileNet-v1 [60] pioneers depthwise separable convolutions combined with width and resolution multipliers to dynamically control model size and computation. MobileNet-v2 [61] enhances this foundation with linear bottlenecks and inverted residual structures that expand channels before depthwise convolution then compresses them, optimizing feature representation while minimizing redundancy. The latest MobileNet-v3 [62] integrates Squeeze-and-Excitation (SE) attention modules [63] that adaptively weight channel features through learnable importance scoring, further refining the accuracy efficiency tradeoff. Together, these innovations establish MobileNet as a versatile framework for deploying high performance vision models across resource constrained devices.

2.1.5. EfficientNet Series

EfficientNet [64] revolutionized model efficiency in 2019 through its innovative compound scaling method, which simultaneously optimizes network depth, width, and input resolution using a unified coefficient. This holistic approach outperforms traditional single-dimension scaling by capturing interdependencies between network dimensions, achieving optimal performance across varying computational constraints. The architecture further enhances efficiency by integrating MobileNet-v2’s inverted residual blocks and squeeze-excitation modules, delivering superior accuracy with reduced computational costs.

It is worth noting that, in addition to demonstrating their application value in image classification tasks, the aforementioned models are also applicable to the feature extraction tasks in semantic segmentation and object detection models.

2.2. Segmentation Models

The segmentation model has powerful capabilities. It can accurately identify the category of each pixel of the image. This rich and detailed information is of great significance for conducting in-depth specific analyses. Take the leaf disease detection task as an example. Merely knowing the disease category fails to enable the identification of specific diseased regions on leaves, rendering it infeasible to implement precise pesticide application or other treatment measures. The segmentation model can accurately pinpoint the specific diseased areas on leaves [65,66]. Leveraging this information, agricultural practitioners can conduct targeted plant protection operations.

2.2.1. U-Net Series

The U-Net architecture [67] represents a seminal CNN design initially proposed for biomedical image segmentation. Characterized by its symmetric encoder–decoder structure, the model employs the encoder parts to capture contextual information through successive convolutional and downsampling operations, followed by the decoder parts that progressively recover spatial resolution via upsampling and convolution operations. A distinctive feature of U-Net is its skip connections, which concatenate multi-scale features from the encoder to corresponding decoder layers, thereby preserving fine-grained spatial details while integrating high-level semantic information. This hierarchical feature fusion enables precise localization and classification, addressing the challenge of information loss during downsampling. The network terminates with a 1 × 1 convolutional layer that generates pixel-wise predictions, producing segmentation masks with both accurate boundary delineation and contextual awareness.

2.2.2. DeepLab Series

The DeepLab series of models have achieved remarkable results in semantic segmentation tasks. The various versions of models in this series have continuously evolved and innovated, gradually enhancing the performance of semantic segmentation.

The DeepLab series has significantly advanced semantic segmentation through progressive architectural improvements. DeepLab-v1 [68] introduces two fundamental architectural modifications: firstly, the substitution of pooling operations with dilated convolutions in the VGG-16 backbone to simultaneously preserve spatial details and expand receptive fields; secondly, the integration of a Conditional Random Field (CRF) to enhance segmentation boundary precision. DeepLab-v2 [69] enhances this framework by adopting ResNet-101 and developing the Atrous Spatial Pyramid Pooling (ASPP) module, which uses parallel dilated convolutions with different rates to capture multi-scale context. DeepLab-v3 [70] enhances the architecture by incorporating global context through image-level feature pooling. Building on this foundation, DeepLab-v3+ [71] further improves computational efficiency through two key modifications: employing an Xception encoder with depthwise separable convolutions to significantly reduce model parameters, while adopting a lightweight decoder architecture that strategically combines high-level semantic features with selected low-level features to achieve more precise boundary delineation.

2.2.3. SegFormer

SegFormer [72] pioneers the integration of Transformer architecture into semantic segmentation tasks through its innovative design. The encoder part of the model employs four hierarchical Transformer blocks, each featuring three core components: an overlapped patch merging module that enhances local feature preservation through overlapping patches, an efficient self-attention mechanism that optimizes computational efficiency via sequence reduction, and a Mix-FFN module incorporating positional encoding through zero-padded 3 × 3 convolutions. These blocks extract multi-scale features that are subsequently fused in a lightweight decoder. This unique architecture addresses two critical limitations of conventional approaches: the loss of local information in non-overlapping patch processing and the excessive computational demands of standard self-attention mechanisms, while simultaneously enabling effective multi-scale object segmentation.

2.2.4. SegNet

SegNet [73] also adopts an encoder–decoder as its main architecture. Different from U-Net, SegNet stores the indices of the max pooling operation to preserve the position information of the maximum feature values in the encoder feature maps. During the decoder stage, the position information is utilized to restore the features of all pixels in the original image. Since the position information is directly derived from the original input image, they can more accurately reflect the boundaries of objects. In the encoder part, SegNet selects VGG-16 as the main feature extraction module and discards the fully-connected layers, thereby reducing the number of parameters required by the model.

2.3. Object Detection Models

The object detection model plays a unique role in the field of agricultural plant protection. It mainly utilizes the bounding box technique to effectively obtain the approximate location information and corresponding categories of objects. In some plant protection tasks, there is no need for the extremely precise positioning and analysis of objects. Take the pest detection task as an example; its focus is usually on the location of pests and their categories. By using the bounding box information of the object detection model, the position of pests in the image and their categories can quickly and effectively be determined [74,75,76]. Compared with performing complex semantic segmentation tasks, this kind of networks is more efficient and direct, avoiding the cumbersome process of classifying each pixel in the image, greatly saving time and computational resources.

Object detection models can be broadly classified into two categories: two-stage object detection models and one-stage object detection models. Two-stage object detection models divide the object detection task into two sequential stages. First, these models generate a series of candidate regions, which are image areas where objects may potentially exist. Subsequently, classification and localization operations are performed on these candidate regions to determine the categories and precise positions of the objects. Representative two-stage object detection models include R-CNN series. In contrast, one-stage object detection models do not need to generate candidate regions. Instead, they directly perform classification and localization predictions on the input image. This approach simplifies the detection process and improves the detection speed. Representative one-stage object detection models include the You Only Look Once (YOLO) series, and others.

2.3.1. R-CNN Series

R-CNN [77] pioneers a novel paradigm in object detection by transforming the task into a region-based classification framework. The innovative pipeline begins with selective search to generate high-quality region proposals, leveraging low-level visual features like color and texture to reduce computational overhead compared to traditional sliding-window methods. These proposals are then processed through four critical stages: region warping to standardized dimensions, CNN-based feature extraction from a pre-trained network, SVM classification for category prediction, and finally bounding-box regression for precise localization. As the first successful integration of CNNs into object detection, R-CNN overcomes limitations of handcrafted-feature approaches by automatically learning discriminative representations, while selective search’s efficient proposal generation maintains computational tractability. This framework establishes foundational concepts that influences subsequent region-based detectors.

Fast R-CNN [78] significantly improves upon the computational inefficiency by implementing two key innovations. First, it adopts a more efficient processing pipeline that extracts features from the entire image before generating region proposals via selective search, eliminating redundant convolutional computations for individual regions. Second, it unifies the classification and regression tasks within a single network, optimizing both through a multi-task loss function that streamlines training and improves computational efficiency. However, the model remains constrained by the inherent limitations of selective search in proposal generation speed.

Faster R-CNN [79] addresses this bottleneck by replacing selective search with a learnable Region Proposal Network (RPN). The dual branch architecture of RPN simultaneously classifies object presence and predicts bounding-box offsets, enabling fully integrated end-to-end detection. This advancement not only accelerates proposal generation but also improves accuracy, while maintaining the computational benefits of the unified feature extraction approach. The resulting framework establishes a new standard for efficient, accurate object detection method.

Mask R-CNN [80] extends the framework of Faster R-CNN through two significant architectural innovations. First, it introduces a parallel fully convolutional network branch that generates precise binary masks for instance segmentation, enabling detailed shape characterization alongside detection. Second, it replaces RoI Pooling with RoI Align, eliminating quantization artifacts through bilinear interpolation to maintain precise spatial correspondence between features and original image coordinates. These modifications specifically address the critical need for accurate pixel-level localization in instance segmentation tasks.

2.3.2. YOLO Series

While the R-CNN series established the foundation for two-stage object detection, this paradigm suffers from inherent computational inefficiencies due to its sequential region proposal and refinement process. In contrast, single-stage detectors employ a more streamlined architecture that directly predicts both object categories and bounding box coordinates in a single forward pass, significantly improving inference speed and reducing computational overhead while maintaining competitive detection accuracy.

The YOLO series models represent a significant evolution in single-stage object detection architectures. The initial YOLOv1 [81], introduced in 2016, establishes the fundamental grid-based regression approach, processing detection as a unified regression problem through a single network pass. While innovative for its time, this architecture exhibits limitations in handling small or densely clustered objects due to its constrained predictions per grid cell. The subsequent YOLOv2 (YOLO9000) [82] addresses data limitations through its novel WordTree structure, enabling effective multi-category detection by combining ImageNet classification and COCO detection datasets through semantic grouping. YOLOv3 [83] marks a substantial improvement by introducing three key innovations: a multi-scale Feature Pyramid Network (FPN) for detecting objects across various sizes, replacement of softmax with logistic classifiers for multi-label classification, and more efficient backbone networks. These advancements significantly enhances detection performance while maintaining real-time capabilities. YOLOv4 [84] further refines the architecture through its systematic backbone–neck–head design: A CSPDarknet53 backbone with cross-stage partial connections reduced computational overhead while preserving feature quality; the neck component combined Spatial Pyramid Pooling (SPP) for multi-scale context aggregation with PANet’s bidirectional feature fusion; and the detection head maintained high precision through optimized prediction mechanisms.

Recent iterations, including YOLOv10, YOLOv11, and YOLOv12 [85,86,87], have continued this trajectory of innovation, focusing on architectural refinements and implementation optimizations within PyTorch frameworks. These models’ enduring impact lies in their balanced approach to accuracy and speed, with each generation addressing specific limitations while introducing novel solutions that have collectively advanced the state-of-the-art in real-time object detection.

The YOLO series models have become the benchmark in object detection by achieving an optimal balance between speed and accuracy. YOLO’s continuous innovations have significantly influenced both practical applications and research directions, making it one of the most widely adopted detection frameworks in agricultural domain [88].

2.3.3. Single Shot MultiBox Detector (SSD)

SSD employs VGG-16 as its base model [89]. One of the significant innovations of this model is the implementation of multi-scale object detection on feature maps of different scales. Specifically, after the backbone network of the model, multiple convolutional layers of different scales are added to generate feature maps of various sizes, thereby enhancing the ability to detect objects of different sizes. In addition, the SSD model introduces the concept of default boxes. Multiple default boxes with different scales and aspect ratios are preset in each feature map cell, thus simplifying the process of object detection.

3. Point Cloud-Based Neural Network Models

Images contain rich two-dimensional information, mainly manifested as texture and color features. Due to this inherent property, such data are insufficient to directly and effectively represent the three-dimensional structure of objects. In contrast, point clouds are composed of points containing three-dimensional information, which can more accurately and intuitively present the geometric shape of objects and their spatial positional relationships. In addition, the pixel values of images are extremely susceptible to lighting conditions, which limits their performance in practical perceptual applications to a certain extent. On the contrary, point cloud data have strong robustness to lighting changes and can maintain relatively stable data features in different lighting environments. In view of the above advantages, neural network models related to point clouds have become a research hotspot in recent years.

Many point cloud-based neural network models often have multiple branches to perform different tasks, such as classification and segmentation tasks. Different from image-based neural network models, this section will introduce relevant series of models according to the processing form of point clouds. Currently, point cloud-based neural network models can be divided into the following three categories according to the input data types: multi-view-based neural network models, voxel and mesh-based neural network models, and original point cloud-based neural network models. This section will provide some introductions to these three types of models.

3.1. Multi-View-Based Neural Network Models

The development of image-based neural network models were initiated at an earlier stage. After an extended period of theoretical exploration and practical application, they have currently reached a relatively mature state. If these well-developed models can be migrated to the field of point cloud feature extraction, it is expected that their existing technical advantages can be fully explored and utilized. Nevertheless, image data are essentially composed of pixels arranged in a specific order, whereas point clouds consist of a set of three-dimensional coordinate points lacking sequential relationships. This substantial difference at the data-structure level renders it difficult for conventional image-based neural network models to be directly applied to the tasks of point cloud feature extraction.

To address the challenges posed by the disorderliness of point clouds, early research focused on converting point clouds into an ordered data type. The multi-view-based neural network models are such solutions. Specifically, researchers project point clouds onto multiple images from various angles or positions. Subsequently, they leverage well-developed image-based neural network models to perform classification or segmentation tasks. After that, the results of multiple experiments are fused and back-projected into the original point cloud data. Owing to the relatively mature development of image-based neural network models, the multi-view-based approach can yield relatively accurate results. The Multi-View Convolutional Neural Network (MVCNN) is a typical representative of such methods [90]. This method captures images of the model from multiple rotation angles, obtaining a series of pictures to ensure comprehensive coverage of information from all directions of the object. Subsequently, all the captured images are input into a pre-trained classical neural network model for feature extraction. In the feature fusion stage, MVCNN employs the max pooling fusion method. Specifically, for the feature vectors of all views, the maximum value in each dimension is selected as the value of the fused feature vector in that dimension. This fusion method can concisely and efficiently integrate features from multiple perspectives while retaining the most prominent key features.

Recent research has significantly advanced beyond the standard MVCNN framework by developing sophisticated mechanisms to address its two key limitations: equal view weighting and simplistic feature fusion. Several innovative approaches have emerged to dynamically assess view importance. SimNet [91] introduces similarity metrics between views combined with an Adaptive Margin-based Triplet-Center Loss (AMTCL) to better evaluate view relationships and optimize feature learning. Attention-based methods like MVCNN-SA and the Capsule Attention Layer (CAL) [92] employ self-attention mechanisms to automatically learn optimal view weights during feature aggregation. Alternative architectures explore view associations through group learning modules (MLVACN [93]) or graph neural networks (View-GCN [94]). These approaches overcome the information loss inherent in traditional max or average pooling operations by implementing more nuanced feature fusion strategies. For instance, MLVACN incorporates a specialized weight fusion layer to combine features based on learned inter-view relationships, while View-GCN employs hierarchical graph learning to derive global shape descriptors. Collectively, these innovations demonstrate that adaptive view weighting and advanced feature fusion significantly enhance the discriminative power of shape descriptors in multi-view 3D recognition tasks.

Neural network models utilizing multi-view imagery exhibit several significant advantages. First, these models can capitalize on well-established image-based neural network architectures. The field of image processing has reached a high level of maturity, with extensive research and practical expertise available for direct application. By leveraging these proven technologies, multi-view point cloud neural networks substantially reduce both development time and computational costs. Second, through the acquisition or projection of images at multiple rotational angles, these models comprehensively capture object features from diverse perspectives. This multi-view approach ensures complete spatial coverage of the target object, enabling more robust and accurate feature representation compared to single-view methods.

Despite their advantages, multi-view-based point cloud neural networks present several inherent limitations. The effectiveness of this kind of model is highly dependent on careful viewpoint selection and projection methodology, as suboptimal configurations can lead to feature loss or geometric distortions that degrade performance. Furthermore, the requirement to process numerous projected views imposes substantial computational burdens during both training and inference, particularly for large-scale point cloud datasets. Most fundamentally, the projection process itself unavoidably discards certain three-dimensional structural information present in the original point cloud data, potentially compromising the ability to fully characterize the geometry of the object. These factors collectively constrain the applicability and accuracy of such approaches in practical 3D vision tasks.

3.2. Voxel and Mesh-Based Neural Network Models

Voxel-based neural network models provide an effective solution for processing irregular 3D point clouds by converting them into regular volumetric grids. This transformation enables direct application of standard 3D CNNs, bypassing the need for specialized architectures to handle unstructured data. The approach involves discretizing space into uniform voxels, where only non-empty voxels containing points are processed. Compared to multi-view methods, voxel representations better preserve 3D spatial relationships while maintaining computational efficiency. Similar mesh-based approaches offer additional advantages through explicit topological connections that can more accurately approximate surface geometries.

Voxel-based approaches have significantly advanced 3D point cloud processing, with VoxNet [95] pioneering this domain by transforming point clouds into volumetric occupancy grids and applying 3D convolutional operations. This framework introduces critical 3D extensions of 2D techniques, including max and average pooling operations for feature reduction and receptive field expansion, demonstrating efficient classification capabilities. Subsequent developments have expanded voxel-based applications in object detection. VoxelNet [96] establishes a three-component architecture comprising (1) a feature learning network with Voxel Feature Encoding (VFE) layers, (2) convolutional middle layers for feature integration, and (3) a region proposal network for final regression. Alternative approaches include VoxelNeXt’s [97] sparse voxel feature utilization and 3D-FCN’s [98] adaptation of 2D fully convolutional networks to 3D space. Various optimization strategies have emerged, such as 3D ShapeNets’ [99] data representation, octree-based partitioning [100,101] for efficient computation, and K-d tree transformations [102] for parameter sharing. Recent innovations like end-to-end frameworks [103] combining voxelization with 3D CNNs and sparse convolutions [104] have further enhanced performance.

In contrast to voxel methods, mesh-based approaches like MeshCNN [105] focus on topological structures through specialized convolution operations that aggregate vertex adjacency information, pooling layers that preserve mesh topology during downsampling, and fully-connected layers for final task outputs. These methods demonstrate particular strengths in analyzing 3D object structures.

Point cloud data, characterized by its disordered and irregular spatial distribution, can be transformed into more structured representations such as voxels or meshes to facilitate processing with 3D convolutional neural networks. Both voxel and mesh representations effectively preserve partial spatial information from the original point clouds, enabling models to extract features more efficiently. The structured nature of these representations allows for direct application of standard 3D deep learning techniques while maintaining critical geometric relationships within the data.

Despite their advantages, voxel and mesh conversions present several shared limitations. First, both methods require substantial computational resources, particularly when high resolutions or densities are needed to maintain accuracy, resulting in significant memory and processing demands. Second, the conversion processes inevitably lead to information loss—voxelization, which discretizes continuous space into finite units, while mesh generation often simplifies complex geometries. Finally, these methods involve complex preprocessing pipelines, including parameter tuning for voxelization and multi-step procedures for mesh creation, which may introduce additional errors and require careful optimization for different datasets.

3.3. Original Point Cloud-Based Neural Network Models

The conversion of point cloud data to alternative representations often results in the loss of essential spatial characteristics that fundamentally distinguish 3D point clouds. Direct utilization of original point clouds as input offers significant advantages by preserving the complete spatial information while eliminating preprocessing overhead. This approach not only maintains the intrinsic geometric relationships within the data but also reduces computational resource requirements and processing time. Consequently, original point cloud-based methods have emerged as a prominent research focus in recent 3D computer vision studies.

PointNet [106] pioneers direct point cloud processing by employing dual T-Net modules to address point order invariance through coordinate and feature alignment, followed by MLP-based feature extraction. While effective for global feature learning through max pooling, this architecture exhibits limitations in local feature capture. Subsequent improvements lead to PointNet++ [107], which introduces a hierarchical structure with set abstraction layers using ball query for local region sampling and relative coordinate transformation, enabling progressive multi-scale feature learning from local to global contexts.

Alternative approaches have further advanced local feature extraction. DGCNN [108] utilizes dynamic graph construction with edge weights based on spatial or feature distances, enabling adaptive local feature aggregation through graph convolutions. PointWeb [109] enhances this through inter-point feature correlation modeling within local neighborhoods, while PointSIFT [110] incorporates directional encoding from eight orientations to better capture geometric structures. Recent innovations include ASSANet’s [111] efficient separable abstraction with dual attention mechanisms, PointNeXt’s [112] scale-invariant learning through standardized radii and data augmentation, and RandLA-Net’s [113] efficient large-scale processing combining random sampling with attentive feature pooling. These developments collectively address the fundamental challenges of point cloud analysis while maintaining computational efficiency.

The Transformer architecture, originally developed for natural language processing, has been effectively adapted to point cloud analysis through its self-attention mechanism, which captures both global dependencies and local geometric relationships. The Point Transformer series exemplifies this adaptation: Point Transformer-v1 [114] establishes the basic framework with self-attention blocks and residual connections; Point Transformer-v2 [115] enhances spatial awareness through grouped vector attention and advanced position encoding; and Point Transformer-v3 [116] improves efficiency via point cloud serialization for neighborhood queries and optimized attention mechanisms. Together, these developments demonstrate how Transformers can effectively process 3D point clouds while balancing feature learning and computational efficiency.

Methods that directly process original point clouds offer significant advantages by preserving the complete three-dimensional geometric information without undergoing structural transformations or complex preprocessing. These approaches eliminate the need for parameter-dependent conversions such as projection angle selection or voxel size determination, allowing point clouds to serve as direct model inputs while maintaining their intrinsic spatial characteristics.

Nevertheless, these methods face some fundamental challenges. First, the inherent disorder of point clouds fundamentally differs from the regular grid structure of 2D image data, preventing direct application of established computer vision models and necessitating specialized algorithmic designs. Second, the potential sparsity of point distributions in certain regions may adversely affect processing outcomes, presenting additional computational challenges that require specific attention during model development.

4. Image and Point Cloud-Based Neural Network Models for Plant Protection

Neural networks utilizing image and point cloud data have demonstrated considerable potential for plant protection tasks. This section systematically reviews and categorizes six critical plant protection tasks essential for agricultural robots in nurseries and similar environments: (1) leaf disease detection, (2) pest identification, (3) weed recognition, (4) target and non-target object detection, (5) seedling information monitoring, (6) spray drift assessment. Typically, classification neural network models serve the purpose of categorizing leaf diseases, pests, weeds, and detected objects. For instance, in the context of leaf disease detection, these models are capable of differentiating common plant leaf diseases, including powdery mildew and rust. Segmentation models are utilized to perform a detailed partitioning of the plant protection area for subsequent analysis. For example, in leaf disease detection, these models can accurately delineate the diseased regions on leaves, which facilitates the assessment of disease severity. In spray operations, they can precisely define the target areas, such as the regions where weeds require spraying. Target detection models are applied to acquire the location information necessary for plant protection operations. These models can pinpoint the exact positions of pests on plants, thus enabling targeted spraying. And they can clearly identify the locations of both target objects, and thereby realize precise spraying operations.

Some of the models and related applications mentioned in this section originate from general agricultural scenarios (including farmland fields and orchards). However, they can be effectively applied to agricultural nursery scenarios. This is achieved through reasonable technical means, such as transfer learning, and by leveraging the common characteristics across different scenarios. For example, trees often exhibit some common symptoms during the occurrence of diseases and pests in their growth cycles. Transfer learning can utilize the model parameters trained in general agricultural scenarios and fine-tune them with a small amount of data collecting from specific nursery environments. This enables models to quickly adapt to the nursery environment, saving training costs and time. Therefore, this paper provides corresponding examples to offer relevant guidance to readers.

4.1. Leaf Disease Detection

The accurate identification of leaf health status and pathological symptoms is essential for modern plant protection systems. This capability enables precision resource management, allowing for optimized irrigation, fertilization, and pesticide application based on specific needs. Such targeted interventions significantly improve resource-use efficiency while minimizing environmental impact. In current agricultural practice, computer vision systems incorporating deep neural network models have emerged as the predominant modality for automated foliar diagnosis, demonstrating both operational reliability and diagnostic accuracy in field deployment scenarios.

Recent research in leaf disease classification has demonstrated significant progress through various deep learning approaches. By accurately identifying the different diseases afflicting the leaves, suitable pesticides are tailored for each specific disease in plant protection efforts, thereby significantly enhancing the efficiency of plant protection (Figure 3). Numerous studies have explored the potential of VGG architectures, with [117] successfully applying VGG-16 to the PlantVillage dataset, while [118] achieves a 7% accuracy improvement by combining VGG-16 with Inception-v2 for enhanced feature extraction. Further VGG-based innovations include the Hydra model [119], which modifies activation patterns to achieve 95% accuracy, and transfer learning applications of VGG-19 for specific disease detection [120]. Beyond VGG networks, researchers have developed sophisticated hybrid architectures, such as SwinGNet [121], which merges Swin Transformer with GoogLeNet, and a GoogLeNet-ResNet combination [122] for superior feature representation. For practical deployment scenarios, lightweight MobileNet variants have been particularly successful, including SE-enhanced MobileNet-v3 [123] and attention-augmented MobileNet-v2 (achieving >98% accuracy) [124]. Addressing the critical challenge of class imbalance, Ref. [125] combines EfficientNet-B5 with SMOTE sampling to achieve exceptional 99.22% accuracy in citrus disease classification. These collective advances demonstrate both the versatility of deep learning approaches and their continued potential for agricultural applications.

Beyond leaf disease classification, precise disease detection plays a crucial role in enabling targeted treatment and accurate pesticide application through the identification of infected leaf regions, as shown in Figure 4. Several advanced approaches have been developed to address this challenge. Ref. [65] presents an enhanced multi-scale dilated feature fusion segmentation network that demonstrates effective segmentation of diseased leaf areas. Further advancing this field, Ref. [66] introduces ALDNet, a novel two-stage architecture specifically designed for sequential leaf and lesion segmentation. This innovative system employs two dedicated subnetworks: PBGNet for leaf segmentation and PDFNet for subsequent lesion detection. In another significant contribution, Ref. [126] proposes AS-DeepLab-v3+, an improved variant of DeepLab-v3+ optimized for leaf lesion segmentation. This enhanced model incorporates MobileNet-v2 as its backbone network while integrating multiple attention mechanisms (Coordinate, ECA, CBAM, and Triplet attention modules) to boost segmentation accuracy. Additionally, the model features a dynamic ASPP module to effectively handle lesions at varying scales, demonstrating robust performance across different infection patterns.

4.2. Pest Identification

Pest detection plays a critical role in plant protection tasks. For example, in the nursery environments seedlings are particularly vulnerable to infestations due to their underdeveloped defense mechanisms. Timely and accurate pest identification enables the prompt implementation of targeted protection measures, thereby mitigating potential damage and ensuring the healthy development of seedlings.

Accurate localization of pest infestations is critically important for effective plant protection measures (Figure 5). Target detection models have emerged as valuable tools in this context. The ODP-Transformer model proposed in [74] incorporates a backbone network based on the Faster R-CNN framework, utilizing a parts sequence encoder, description decoder, and classification decoder to simultaneously output pest categories, body parts, and the descriptive information. In another development, Ref. [75] enhances the YOLO-v3 model through an Adaptive Energy-based Harris Hawks Optimization (AE-HHO) algorithm, combining ResNet-50 and VGG-16 for feature extraction after target region identification to achieve accurate pest classification. Furthermore, Ref. [76] introduces the YOLOCSP-PEST model, an improved variant of YOLO-v7 that employs a cross stage partial network as its base architecture. When trained on the comprehensive IP102 dataset encompassing 102 pest categories, this modified model achieves a mean Average Precision (mAP) exceeding 88%, demonstrating its robust detection capabilities.

Semantic segmentation enables precise pest localization through pixel-level image analysis, allowing for targeted pesticide application that optimizes spray coverage and dosage based on infestation patterns (Figure 6). This approach minimizes chemical use while improving control efficacy. The models also provide detailed pest morphology and distribution data to inform customized management strategies. Ref. [127] develops TinySegformer, a Transformer-based model addressing computational inefficiencies in traditional attention mechanisms. Conventional approaches require exhaustive pairwise element relationship computation, whereas TinySegformer’s lightweight self-attention module strategically limits interacting elements, substantially reducing computational overhead. Ref. [128] enhances the U-Net framework by incorporating Multi-Scale Feature Fusion (MSFF) and Multi-Scale Dilated Attention (MSDA) mechanisms. These modifications enables more comprehensive extraction of multi-scale semantic features, resulting in a remarkable Dice score improvement from 82.35% (baseline U-Net) to 93.12%. The 10.77 percentage point enhancement conclusively demonstrates the efficacy of these architectural innovations for segmentation tasks.

4.3. Weed Recognition

In agricultural production systems, the proliferation of weeds presents significant challenges, particularly during the seedling stage where they can severely inhibit crop development. Targeted herbicide application has thus become an essential practice to ensure optimal agricultural productivity. Within this framework, the accurate detection and localization of weeds constitutes a critical prerequisite for precision spraying operations.

Precision herbicide selection based on weed species identification plays a crucial role in optimizing agricultural weed control efficiency. Recent studies have demonstrated the effectiveness of deep learning approaches for accurate weed classification. In [129], the researchers develop a specialized weed dataset using drone-captured imagery and evaluate three benchmark models (VGG-16, ResNet-50, and Xception). The results reveal that both ResNet-50 and Xception achieve exceptional classification accuracy (>97%), indicating their strong potential for weed identification tasks. Ref. [130] utilizes the public DeepWeeds dataset to validate Inception-v3 and ResNet-50, with both models consistently exceeding 95% accuracy, further confirming their reliability for weed classification applications. Complementary work in [131] involving 12,443 annotates images demonstrated VGG-16’s superior performance for real-time weed detection when compared to AlexNet and GoogleNet architectures.

In precision plant protection systems, effective weed management requires not only species-specific herbicide selection but also accurate spatial targeting of application sites. The visual similarity between weeds and crop seedlings in color and morphology presents significant challenges for conventional classification approaches, necessitating more sophisticated detection methods. Consequently, research has increasingly focused on general-purpose object detection architectures, with YOLO-series models emerging as the predominant choice for high-precision weed detection tasks [132,133]. These models demonstrate particular effectiveness in distinguishing morphologically similar vegetation while maintaining real-time operational capabilities essential for field applications. The HAD-YOLO framework [134] builds upon YOLO-v5 by incorporating an HGNetV2 backbone network. This architecture integrates two key components: the SSFF module for multi-scale feature fusion and the TFE module for advanced feature representation. Additionally, it employs a multi-attention detection head to significantly improve localization precision. Subsequent developments include HLBODL-WDSA [135], which modifies YOLOv5 architecture, and YOLO-Spot [136], which optimizes YOLO-v7-tiny through streamlined convolutional layers and feature map reduction, while YOLO-CWD [137] integrates ECA and spatial attention modules into YOLOv8n’s C2f block, demonstrating robust performance under varying lighting conditions on the CropAndWeed dataset [138]. Further innovations comprise PMDNet [139], featuring a PKINet backbone with context anchor attention for long-range dependency capture, and MSFPN for feature fusion, alongside WeedsSORT [140], which effectively combines YOLOv11 with triplet attention and enhances tracking through an optimized SuperPoint detector with multi-layer decoding, collectively advancing precision agriculture through balanced accuracy and computational efficiency.

Beyond YOLO models, multiple detection architectures effectively address weed identification. Ref. [141] optimizes models for edge devices with NPUs, specifically examining accuracy–speed tradeoffs. The research evaluated various backbones and neck networks, demonstrating that architectural choices significantly impact detection performance.

Semantic segmentation models significantly enhance robotic weed control by precisely delineating infested areas. These models enable robots to optimize their navigation paths and operational boundaries, ensuring complete weed removal while preventing crop damage. This capability is particularly valuable for precision agriculture, where accurate spatial identification directly improves weeding efficiency and crop safety. Several studies have advanced weed segmentation through U-Net architectural modifications. Ref. [142] evaluates three pre-trained backbone networks (VGG-19, MobileNetV2, and InceptionResNetV2) within the U-Net framework, with InceptionResNetV2 achieving superior performance at 96.79% accuracy, while all models exceeded 90% segmentation precision. Ref. [143] adapts the Random Image Cropping and Patching (RICAP) method, originally developed for classification tasks, to semantic segmentation data augmentation, demonstrating measurable improvements in weed-target differentiation. Ref. [144] develops DWUNet, a U-Net variant incorporating YOLOv8’s C2f module with depthwise convolutions in the encoder. This architecture maintained high accuracy while achieving exceptional computational efficiency, enabling real-time weed segmentation application.

4.4. Target and Non-Target Object Detection

Precise object detection is essential for agricultural robots to perform safe and efficient spraying operations in agricultural environments like nurseries or orchards. By accurately identifying both target (such as seedlings) and non-target objects (such as humans, infrastructures, and stones), these systems enable real-time collision avoidance, optimized path planning, and other safe operations (Figure 7). This capability not only prevents accidents and equipment damage but also enhances operational efficiency across varying field conditions.

Recent research has made significant progress in obstacle detection for agricultural robotics through advanced image-based models. Ref. [145] develops UCIW-YOLO, an enhanced YOLOv5-based model incorporating a Universal Inverted Bottleneck (UIB) module and Coordinate Attention (CA) mechanism, demonstrating robust multi-class detection capability for trees, humans, and utility poles. In vineyard environments, Ref. [146] proposes the Parallel RepConv Network (PRCN), which integrates Parallel RepConv (PRC) blocks with residual connections for improved feature extraction, along with a novel TriangleNet architecture for multi-scale feature fusion, achieving effective identification of grapevines, stones, and people. For orchard applications, Ref. [147] optimizes YOLOv3 through MobileNetV2 integration, creating a lightweight solution that surpasses both Faster R-CNN and SSD in both accuracy and inference speed for typical obstacles including workers, concrete posts, and electrical infrastructure.

For deployment on existing hardware platforms, PointNet and PointNet++ have emerged as widely adopted point cloud-based neural networks for target detection due to their high efficiencies. Ref. [148] develops a lightweight PointNet variant incorporating residual modules to reduce model fitting complexity and pruning algorithms to eliminate redundant parameters. This optimized architecture demonstrated dual improvements in both classification accuracy and inference speed, fulfilling real-time operational requirements. Ref. [149] enhances PointNet++ by integrating a CBAM module to refine features in both channel and spatial dimensions, enabling precise six-category classification of nursery objects, including seedlings and pedestrians. Ref. [150] voxelizes point clouds before processing through a Focal Voxel R-CNN model. This architecture improved 3D feature extraction through focal sparse convolutional layers, offering another viable solution for complex agricultural environments.

Figure 7. Detection results of different kinds of non-target object under different degrees of occlusion. Despite the occlusion issue, the model can still accurately identify non-target objects in the orchard scene, such as people and support poles. It can not only locate non-target objects using bounding boxes but also provide their category information. Based on the location and category information, the agricultural robots can be precisely guided to perform different operations. (Reprinted with permission from Ref. [147]. Copyright 2021 Elsevier.)

4.5. Seedling Information Monitoring

In nurseries or orchards, seedlings are the main focus of plant protection. The accurate acquisition of seedling information plays a pivotal role in enabling intelligent agricultural robotics. For example, precise measurements of canopy positions allow robotics to intelligently determine optimal spray trajectories based on distribution patterns while automatically adjusting dosage according to calculated canopy volumes. This targeted approach not only ensures effective pesticide deposition but also significantly reduces input waste, demonstrating particular importance in nursery plant protection operations.

Accurate classification of seedling species is fundamental for enabling precision plant protection operations. Recent advances in deep learning have demonstrated remarkable success in this domain. Ref. [151] achieves effective classification of five tropical tree species (coconut, coconut intercropping, durian, pomelo, and rambutan) in satellite imagery using optimized convolutional and fully-connected modules. Further progress is shown in [152], where pre-trained architectures (VGG-16, VGG-19, InceptionV3, and Xception) successfully identified Ethiopian indigenous medicinal plants. Notably, the hybrid 3D-2DCNN-CA model proposed in [153] combines three-dimensional spatial feature extraction with two-dimensional spectral analysis, achieving exceptional 98.44% average accuracy in tree species classification.

Beyond seedling classification, canopy characterization holds equal importance for precision spraying operations (Figure 8, Figure 9 and Figure 10), given that plant protection primarily targets foliar surfaces. Recent images-based studies have demonstrated significant progress in this domain. For example, Ref. [154] successfully achieves tobacco plant localization and leaf segmentation using YOLOv5 and YOLOv6 models, respectively. Ref. [155] incorporates dilated convolutions and GELU activation functions into YOLOv8’s backbone network, yielded exceptional performance in canopy segmentation on custom nursery datasets. The YOLOv9-based approach in [156] integrates FasterNet modules (featuring PConv and PWConv layers) with Transformer-derived iRMB blocks for enhanced nursery canopy detection. Alternative architectures have also proven effective—Ref. [157] utilizes Mask R-CNN for canopy parameter extraction (including diameter measurements), while [158] develops an improved DeepLabv3+ variant combining MobileNet-v2, ASPP modules, and Shuffle Attention Mechanisms to achieve precise banana tree canopy segmentation. These diverse methodological innovations collectively advance the technical capacity for canopy-aware precision spraying systems.

Point cloud data analysis has emerged as a powerful tool for addressing key agricultural challenges, particularly in seedling species classification and canopy characterization. Ref. [159] enhances the PointNet architecture through the integration of channel and spatial attention mechanisms, achieving 74.69% classification accuracy across four seedling species. Ref. [160] advances the PointNet++ framework by incorporating dense connections and gated feature fusion, enabling both multi-species classification and component segmentation. Further innovation is demonstrated in [161] through the Dynamic Fusion Segmentation Network (DFSNet), which extended DGCNN’s capabilities with specialized fusion modules for detailed nursery tree segmentation. Practical applications have also been realized, such as the implementation of PointNet++ in intelligent orchard sprayers for precise spray-target identification [162].

Figure 9. Schematic diagrams of the segmented point clouds of seven seedling species, point clouds with different colors representing different parts, including seedling crown, trunk, supporting pole, and pot. The spray volume can be controlled according to the semantic segmentation information of the tree crowns, and the autonomous driving can be guided according to the trunk information. (Reprinted with permission from Ref. [160]. Copyright 2024 Elsevier.)

Recent advances in point cloud-based neural networks have significantly enhanced large-scale semantic segmentation capabilities for agricultural applications, enabling precise extraction of target features, such as seedlings from complex 3D environments. Ref. [163] improves RandLA-Net by incorporating angular information into local feature extraction and dynamically adjusting neighbor point features through multi-relative feature analysis, achieving accurate canopy and trunk segmentation. The Point Transformer architecture in [164] demonstrates exceptional 96% accuracy in tree point cloud segmentation, while [165] develops a hybrid approach combining RandLA-Net for initial tree/non-tree segmentation with YOLOv3 detection in projected 2D space and subsequent 3D point cloud partitioning. Specialized architectures have also emerged, including TreeisoNet [166], with its four dedicated subnetworks for simultaneous trunk and canopy segmentation, a dual-branch framework [167] integrating semantic and label information for single-tree extraction, and the CCD-YOLO system [168] incorporating a CReToNeXt backbone and CBAM modules for enhanced canopy feature detection.

Figure 10. Visualization segmentation results for different parts of seedlings from large-scale point cloud in the nursery. (a) depicts the ground truth and the segmentation results of RandLA-Net and the improved model. (b–e) are some specific details of the segmentation results and the white boxed areas highlight the incorrect predictions made by RandLA-Net. More detailed seedling and nursery environment information can be provided for agricultural plant protection operations. (Reprinted with permission from Ref. [163]. Copyright 2024 MDPI. https://doi.org/10.3390/rs16214011. Licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)).

4.6. Spray Drift Assessment

Accurate detection and tracking of spray droplets are critical for optimizing agricultural spraying operations. By precisely monitoring droplet distribution, managers can adjust spraying parameters in real time to ensure uniform coverage while minimizing pesticide waste—typically reducing the usage and lowering production costs. This technology also prevents over-application, safeguarding crops from phytotoxicity while maintaining effective pest control.

Recent studies have developed various methods for precise spray droplet detection in agricultural spraying. Ref. [169] improves YOLOv5s by adding a GSConv module with depthwise separable convolution, making detection faster while keeping good accuracy. Ref. [170] uses Faster R-CNN for droplet detection, while [171] enhances U-Net with dense connections and BConvLSTM to better identify spray patterns. Additionally, Ref. [172] applies PointNet++ to spray drones, allowing them to track droplet paths and adjust spraying in real-time. Together, these approaches help optimize spraying operations by balancing speed, accuracy, and adaptability.

5. Future Directions and Research Challenges

5.1. Multi-Source Data Fusion and Utilization

In agricultural scenarios, single-type data usually fails to comprehensively present the real status of seedlings and the impact of environmental factors on them. For example, to pursue economic benefits, nurseries usually adopt the high-density vegetation planting pattern. In this pattern, the canopies of seedlings overlap, which significantly increases the difficulty of accurately identifying individual seedlings. For instance, multiple sensors can be used to obtain seedling information from different directions (such as ground and air) to achieve accurate separation of individual seedlings. Moreover, most nursery scenarios are in outdoor environments. If only image-based neural network models are used for perception, the results will be affected by lighting conditions. Point clouds are composed of points containing spatial coordinate information and are not restricted by lighting conditions. Therefore, combining the perception information from image and point cloud-based neural networks can improve the robustness to lighting.

The effective fusion of multi-view point cloud data is mainly restricted by registration accuracy. In complex nursery scenarios, the reliability of feature point extraction and matching is relatively low, which can easily lead to the continuous accumulation of registration errors. Simultaneously, the intrinsic disparities between image data and point cloud data pose significant challenges for the fusion of these different data types. Prior to integration, extensive conversion and registration procedures are required to align these distinct data types. Variations in acquisition methods and sensor perspectives further introduce potential registration errors, ultimately compromising fusion accuracy. Current fusion algorithms remain constrained by their computational complexity, with existing solutions still exhibiting considerable limitations in accuracy, processing efficiency, and generalizability factors that currently hinder their practical deployment in real-world agricultural applications.

5.2. Improvement of Perception Model Performances

In nursery scenarios, perception models face some unique and severe challenges. These include a high similarity between the foreground and the background. For instance, the leaves of various seedlings share similar shapes and colors, and the trunks of seedlings are similar to supporting poles in terms of both color and texture. Enhancing the perception accuracy and precision of models is one of the important research directions in the future. For example, introducing effective attention modules into neural network models is a highly promising strategy. The more important a feature is, the larger the corresponding weight coefficient. And for useless information, the feature weight is relatively small. By multiplying the generated coefficients with the input features, automatic adjustment of features can be achieved. Moreover, the full integration of high- and low-dimensional features is effective for addressing the specific issues in nursery environments. Low-dimensional features are rich in global information such as position and shape and are more suitable for describing the overall information of seedlings. This capability greatly facilitates tasks including seedling counting and differentiating trunks from supporting poles through coarse segmentation. High-dimensional features contain abundant texture details and excel at capturing local features such as tiny lesions on leaf margins and subtle color changes in leaf veins, which are key to achieving accurate disease diagnosis. Fusing these two types of features can not only effectively eliminate some interference information from the background in the nursery but also accurately acquire target information.

While these advanced modules enhance model performance, they inevitably increase architectural complexity, resulting in substantially higher computational demands and storage needs. This escalation requires deployment on more capable hardware platforms to maintain operational feasibility. Additionally, the expanded processing latency may compromise real-time execution capabilities, ultimately constraining the practical deployment potential of these models in field applications.

5.3. Design of Lightweight Models

Given the urgent need for real-time operations in practical applications, designing more lightweight neural network models for perception tasks has become a crucial research direction for the future. In the context of agricultural spraying operations, developing suitable lightweight neural network models holds significant practical significance, as these models can effectively reduce the demand for computing resources. For example, techniques such as pruning and quantization can be employed to compress the model size, enabling the model to run efficiently on hardware devices with limited computing resources.

However, conventional lightweight design often requires replacing or removing relevant modules of the original model, which may lead to problems such as excessive reduction of key information and insufficient feature extraction ability, thereby causing a significant decline in the perception accuracy of the model. For example, when features with high sensitivity to micro-targets are overlooked, the perceptual efficacy of the model towards subtle yet critical pest and disease features will undergo a significant attenuation. In addition, lightweight design may increase the requirements for the quality of input data. Low-quality data, such as noisy data, data with missing values, or inaccurately labeled data, will have a more significant impact on the training process and final performance of the model when the feature extraction ability is limited.

5.4. Enhancement of Generalization Abilities of Models

In the agricultural sector, agricultural settings like nurseries and orchards display a high degree of diversity. For neural network models, if they are solely trained on data from particular regions, when deployed in other areas the model performance typically experiences a notable drop because of the substantial environmental disparities between different regions. For example, due to the significant differences in species, growth stages, and cultivation methods among different seedlings, the canopies of seedlings exhibit high variability in terms of shape, size, and structure. Therefore, the perception tasks designed for a specific category of seedlings are difficult to directly apply to the identification tasks of other categories of seedlings. Consequently, boosting the generalization ability of models has emerged as one of the crucial research directions.

Concretely, transfer learning can be utilized. By leveraging a small quantity of data from new regions to fine-tune the already-trained model, the model can rapidly adapt to the new environment. Additionally, data augmentation and simulation techniques can be used to artificially increase the diversity of the training data. For example, to address the challenge of canopy geometric shape variability, transfer learning can be employed to transfer the knowledge from general plant growth models to nursery seedling canopy recognition. Meanwhile, a comprehensive dataset of nursery plant canopy geometric shapes is helpful to enhance the adaptability and recognition performance of the model.

6. Conclusions

This paper mainly reviews neural network models based on images and point clouds and their applications in nursery spraying operations. Table 1 summarizes the content of Section 2, Section 3 and Section 4, covering information such as the input types and the advantages and disadvantages of various models. From the above table, the following conclusions can be drawn.

Compared with neural network models based on point clouds, neural network models based on images are more widely applied. Image-based neural network models can fully utilize the color information of objects, enabling them to identify features such as diseased parts of leaves according to color differences. In contrast, point cloud-based neural network models focus more on leveraging the spatial information of objects. Their applications are relatively limited in tasks that rely more on color features, such as leaf disease detection and pest identification, which in turn restricts their scope of application. However, because point cloud–based neural network models can acquire spatial information, they can be used to obtain the spatial position information of targets, making them suitable for tasks like target detection. Moreover, since the position information in point clouds is not affected by lighting conditions, point cloud-based neural network models are more suitable for application scenarios with significant lighting variations.
In real-time application scenarios, like quickly identifying targets and non-targets and promptly spraying after getting target information, the hardware deploy abilities and fast inference abilities of models are important. The MobileNet series utilizes depth-separable convolutions, SegNet removes the fully-connected layers, the YOLO series adopts the single-stage object detection architecture, and neural network models based on raw point clouds do not require additional data operations. Relying on their unique architectural advantages, these models can effectively meet the requirements for fast inference in such scenarios. Conversely, in application scenarios where high-precision perception takes precedence over real-time performance, like pest recognition and leaf disease diagnosis, more advanced models such as SegFormer and the R-CNN series can be used. These models have strong feature extraction and analytical capabilities, which can provide more accurate detection results.
Acquiring more valuable features contributes to enhancing the performance of models. Take the GoogLeNet, the DeepLab series, and the SegFormer series as examples. GoogLeNet incorporates the Inception module, and the DeepLab series employs the ASPP module. Both methods can obtain multi-scale features, and the SegFormer series leverages the Transformer architecture to capture global features. When the detection targets are partially occluded, these multi-scale or global features can significantly improve the accuracy of recognition.

The perception system serves as a critical component for autonomous agricultural robots, functioning to acquire environmental data for decision-making and hardware control operations. In typical agricultural settings such as nurseries, precise acquisition of target object information significantly enhances robotic autonomy during plant protection operations. Owing to their automated feature extraction capabilities, neural network-based approaches have been increasingly adopted for environmental perception. This paper presents a systematic review of neural network models utilizing both image and point cloud data, with particular emphasis on their applications in six key plant protection tasks in nursery environments: leaf disease detection, pest identification, weed recognition, target and non-target object detection, seedling information monitoring, and spray drift assessment. This comprehensive review provides valuable references for developing next-generation perception systems in agricultural robotics and precision plant protection.

Author Contributions

Conceptualization, J.X. and H.L.; methodology, J.X. and Y.S.; software, J.X.; validation, J.X. and Y.S.; formal analysis, J.X., H.L. and Y.S.; investigation, J.X.; resources, J.X., H.L. and Y.S.; data curation, J.X. and H.L.; writing—original draft preparation, J.X.; writing—review and editing, H.L. and Y.S.; visualization, J.X.; supervision, H.L. and Y.S.; project administration, H.L. and Y.S.; funding acquisition, H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the National Natural Foundation of China General Project: (32171908).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to thank the editor and reviewers for their valuable suggestions for improving this paper. Thanks to China Scholarship Council for supporting.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

Abbreviation	Full-Length Name
SVM	Support Vector Machine
PCA	Principal Component Analysis
VGG	Visual Geometry Group
CNN	Convolutional Neural Network
CRF	Conditional Random Field
ASPP	Atrous Spatial Pyramid Pooling
YOLO	You Only Look Once
RPN	Region Proposal Network
FPN	Feature Pyramid Network
SPP	Spatial Pyramid Pooling
MVCNN	Multi-View Convolutional Neural Network
AMTCL	Adaptive Margin based Triplet-Center Loss
CAL	Capsule Attention Layer
VFE	Voxel Feature Encoding
SE	Squeeze-and-Excitation Attention Module
CA	Coordinate Attention Module
ECA	Efficient Channel Attention
CBAM	Convolutional Block Attention Module
AE-HHO	Adaptive Energy-based Harris Hawks Optimization
mAP	Mean Average Precision
MSFF	Multi-scale Feature Fusion
MSDA	Multi-scale Dilated Attention
RICAP	Random Image Cropping and Patching
UIB	Universal Inverted Bottleneck
PRCN	Parallel RepConv Network
PRC	Parallel RepConv
DFSNet	Dynamic Fusion Segmentation Network

References

Essegbemon, A.; Tjeerd, J.S.; Dansou, K.K.; Alphonse, O.O.; Paul, C.S. Effects of nursery management practices on morphological quality attributes of tree seedlings at planting: The case of oil palm (Elaeis guineensis Jacq.). For. Ecol. Manag. 2014, 324, 28–36. [Google Scholar] [CrossRef]
Amit, K.J.; Ellen, R.G.; Yigal, E.; Omer, F. Biochar as a management tool for soilborne diseases affecting early stage nursery seedling production. For. Ecol. Manag. 2019, 120, 34–42. [Google Scholar]
Victor, M.G.; Cinthia, N.; Nazim, S.G.; Angelo, S.; Jesús, G.; Roberto, R.; Jesús, O.; Catalina, E.; Juan, A.F. An in-depth analysis of sustainable practices in vegetable seedlings nurseries: A review. Sci. Hortic. 2024, 334, 113342. [Google Scholar] [CrossRef]
Li, J.; Wu, Z.; Li, M.; Shang, Z. Dynamic Measurement Method for Steering Wheel Angle of Autonomous Agricultural Vehicles. Agriculture 2024, 14, 1602. [Google Scholar] [CrossRef]
Ahmed, S.; Qiu, B.; Ahmad, F.; Kong, C.-W.; Xin, H. A State-of-the-Art Analysis of Obstacle Avoidance Methods from the Perspective of an Agricultural Sprayer UAV’s Operation Scenario. Agronomy 2021, 11, 1069. [Google Scholar] [CrossRef]
Sun, J.; Wang, Z.; Ding, S.; Xia, J.; Xing, G. Adaptive disturbance observer-based fixed time nonsingular terminal sliding mode control for path-tracking of unmanned agricultural tractors. Biosyst. Eng. 2024, 246, 96–109. [Google Scholar]
Lu, E.; Xue, J.; Chen, T.; Jiang, S. Robust Trajectory Tracking Control of an Autonomous Tractor-Trailer Considering Model Parameter Uncertainties and Disturbances. Agriculture 2023, 13, 869. [Google Scholar] [CrossRef]
Liu, H.; Yan, S.; Shen, Y.; Li, C.; Zhang, Y.; Hussain, F. Model predictive control system based on direct yaw moment control for 4WID self-steering agriculture vehicle. Int. J. Agric. Biol. Eng. 2021, 14, 175–181. [Google Scholar] [CrossRef]
Zhu, Y.; Cui, B.; Yu, Z.; Gao, Y.; Wei, X. Tillage Depth Detection and Control Based on Attitude Estimation and Online Calibration of Model Parameters. Agriculture 2024, 14, 2130. [Google Scholar] [CrossRef]
Dai, D.; Chen, D.; Wang, S.; Li, S.; Mao, X.; Zhang, B.; Wang, Z.; Ma, Z. Compilation and Extrapolation of Load Spectrum of Tractor Ground Vibration Load Based on CEEMDAN-POT Model. Agriculture 2023, 13, 125. [Google Scholar] [CrossRef]
Liao, J.; Luo, X.; Wang, P.; Zhou, Z.; O’Donnell, C.C.; Zang, Y.; Hewitt, A.J. Analysis of the Influence of Different Parameters on Droplet Characteristics and Droplet Size Classification Categories for Air Induction Nozzle. Agronomy 2020, 10, 256. [Google Scholar] [CrossRef]
Li, Y.; Li, Y.; Nie, J.; Li, Z.; Li, J.; Gao, J.; Fang, Z. Navigation of the spraying robot in jujube orchard. Alex. Eng. J. 2025, 126, 320–340. [Google Scholar] [CrossRef]
Prashanta, P.; Ajay, S.; Daniel, F.; Karla, L. Design and systematic evaluation of an under-canopy robotic spray system for row crops. Smart Agric. Technol. 2024, 8, 100510. [Google Scholar] [CrossRef]
Hamed, R.; Hassan, Z.; Hassan, M.; Gholamreza, A. A new DSWTS algorithm for real-time pedestrian detection in autonomous agricultural tractors as a computer vision system. Measurement 2016, 93, 126–134. [Google Scholar] [CrossRef]
Hearst, M.A.; Dumais, S.T.; Osuna, E.; Platt, J.; Scholkopf, B. Support vector machines. IEEE Intell. Syst. Their Appl. 1998, 13, 18–28. [Google Scholar] [CrossRef]
Kotsiantis, S.B. Decision trees: A recent overview. Artif. Intell. Rev. 2013, 39, 261–283. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Sun, J.; Yang, F.; Cheng, J.; Wang, S.; Fu, L. Nondestructive identification of soybean protein in minced chicken meat based on hyperspectral imaging and VGG16-SVM. J. Food Compos. Anal. 2024, 125, 105713. [Google Scholar] [CrossRef]
Ding, Y.; Yan, Y.; Li, J.; Chen, X.; Jiang, H. Classification of Tea Quality Levels Using Near-Infrared Spectroscopy Based on CLPSO-SVM. Foods 2022, 11, 1658. [Google Scholar] [CrossRef]
Dai, C.; Sun, J.; Huang, X.; Zhang, X.; Tian, X.; Wang, W.; Sun, J.; Luan, Y. Application of Hyperspectral Imaging as a Nondestructive Technology for Identifying Tomato Maturity and Quantitatively Predicting Lycopene Content. Foods 2023, 12, 2957. [Google Scholar] [CrossRef]
Cheng, J.; Sun, J.; Yao, K.; Xu, M.; Zhou, X. Nondestructive detection and visualization of protein oxidation degree of frozen-thawed pork using fluorescence hyperspectral imaging. Meat Sci. 2022, 194, 108975. [Google Scholar] [CrossRef]
Li, H.; Wu, P.; Dai, J.; Pan, T.; Holmes, M.; Chen, T.; Zou, X. Discriminating compounds identification based on the innovative sparse representation chemometrics to assess the quality of Maofeng tea. J. Food Compos. Anal. 2023, 123, 105590. [Google Scholar] [CrossRef]
Yao, K.; Sun, J.; Zhang, L.; Zhou, X.; Tian, Y.; Tang, N.; Wu, X. Nondestructive detection for egg freshness based on hyperspectral imaging technology combined with harris hawks optimization support vector regression. J. Food Saf. 2021, 41, e12888. [Google Scholar] [CrossRef]
Sun, J.; Liu, Y.; Wu, G.; Zhang, Y.; Zhang, R.; Li, X.J. A Fusion Parameter Method for Classifying Freshness of Fish Based on Electrochemical Impedance Spectroscopy. J. Food Qual. 2021, 2021, 6664291. [Google Scholar] [CrossRef]
Yao, K.; Sun, J.; Zhou, X.; Nirere, A.; Tian, Y.; Wu, X. Nondestructive detection for egg freshness grade based on hyperspectral imaging technology. J. Food Process Eng. 2020, 43, e13422. [Google Scholar] [CrossRef]
Li, Y.; Pan, T.; Li, H.; Chen, S. Non-invasive quality analysis of thawed tuna using near infrared spectroscopy with baseline correction. J. Food Process Eng. 2020, 43, e13445. [Google Scholar] [CrossRef]
Wu, X.; Zhou, H.; Wu, B.; Fu, H. Determination of apple varieties by near infrared reflectance spectroscopy coupled with improved possibilistic Gath–Geva clustering algorithm. J. Food Process Eng. 2020, 44, e14561. [Google Scholar] [CrossRef]
Nirere, A.; Sun, J.; Atindana, V.A.; Hussain, A.; Zhou, X.; Yao, K. A comparative analysis of hybrid SVM and LS-SVM classification algorithms to identify dried wolfberry fruits quality based on hyperspectral imaging technology. J. Food Process. Preserv. 2022, 46, e16320. [Google Scholar] [CrossRef]
Nirere, A.; Sun, J.; Kama, R.; Atindana, V.A.; Nikubwimana, F.D.; Dusabe, K.D.; Zhong, Y. Nondestructive detection of adulterated wolfberry (Lycium Chinense) fruits based on hyperspectral imaging technology. J. Food Process Eng. 2023, 46, e14293. [Google Scholar] [CrossRef]
Wang, S.; Sun, J.; Fu, L.; Xu, M.; Tang, N.; Cao, Y.; Yao, K.; Jing, J. Identification of red jujube varieties based on hyperspectral imaging technology combined with CARS-IRIV and SSA-SVM. J. Food Process Eng. 2022, 45, e14137. [Google Scholar] [CrossRef]
Ahmad, H.; Sun, J.; Nirere, A.; Shaheen, N.; Zhou, X.; Yao, K. Classification of tea varieties based on fluorescence hyperspectral image technology and ABC-SVM algorithm. J. Food Process Eng. 2021, 45, e15241. [Google Scholar] [CrossRef]
Yao, K.; Sun, J.; Tang, N.; Xu, M.; Cao, Y.; Fu, L.; Zhou, X.; Wu, X. Nondestructive detection for Panax notoginseng powder grades based on hyperspectral imaging technology combined with CARS-PCA and MPA-LSSVM. J. Food Process Eng. 2021, 44, e13718. [Google Scholar] [CrossRef]
Fu, L.; Sun, J.; Wang, S.; Xu, M.; Yao, K.; Cao, Y.; Tang, N. Identification of maize seed varieties based on stacked sparse autoencoder and near-infrared hyperspectral imaging technology. J. Food Process Eng. 2022, 45, e14120. [Google Scholar] [CrossRef]
Tang, N.; Sun, J.; Yao, K.; Zhou, X.; Tian, Y.; Cao, Y.; Nirere, A. Identification of varieties based on hyperspectral imaging technique and competitive adaptive reweighted sampling-whale optimization algorithm-support vector machine. J. Food Process Eng. 2021, 44, e13603. [Google Scholar] [CrossRef]
Jan, S.; Jaroslav, Č.; Eva, N.; Olusegun, O.A.; Jiří, C.; Daniel, P.; Markku, K.; Petya, C.; Jana, A.; Milan, L.; et al. Making the Genotypic Variation Visible: Hyperspectral Phenotyping in Scots Pine Seedlings. Plant Phenomics 2023, 5, 0111. [Google Scholar] [CrossRef] [PubMed]
Finn, A.; Kumar, P.; Peters, S.; O’Hehir, J. Unsupervised spectral-spatial processing of drone imagery for identification of pine seedlings. ISPRS J. Photogramm. Remote Sens. 2022, 183, 363–388. [Google Scholar] [CrossRef]
Raypah, M.E.; Nasru, M.I.M.; Nazim, M.H.H.; Omar, A.F.; Zahir, S.A.D.M.; Jamlos, M.F.; Muncan, J. Spectral response to early detection of stressed oil palm seedlings using near-infrared reflectance spectra at region 900–1000 nm. Infrared Phys. Technol. 2023, 135, 104984. [Google Scholar] [CrossRef]
Zuo, X.; Chu, J.; Shen, J.; Sun, J. Multi-Granularity Feature Aggregation with Self-Attention and Spatial Reasoning for Fine-Grained Crop Disease Classification. Agriculture 2022, 12, 1499. [Google Scholar] [CrossRef]
Bing, L.; Sun, J.; Yang, N.; Wu, X.; Zhou, X. Identification of tea white star disease and anthrax based on hyperspectral image information. J. Food Process Eng. 2020, 44, e13584. [Google Scholar]
Wang, Y.; Li, T.; Chen, T.; Zhang, X.; Taha, M.F.; Yang, N.; Mao, H.; Shi, Q. Cucumber Downy Mildew Disease Prediction Using a CNN-LSTM Approach. Agriculture 2024, 14, 1155. [Google Scholar] [CrossRef]
Deng, J.; Ni, L.; Bai, X.; Jiang, H.; Xu, L. Simultaneous analysis of mildew degree and aflatoxin B1 of wheat by a multi-task deep learning strategy based on microwave detection technology. LWT 2023, 184, 115047. [Google Scholar] [CrossRef]
Wang, B.; Deng, J.; Jiang, H. Markov Transition Field Combined with Convolutional Neural Network Improved the Predictive Performance of Near-Infrared Spectroscopy Models for Determination of Aflatoxin B1 in Maize. Foods 2022, 11, 2210. [Google Scholar] [CrossRef]
Cheng, J.; Sun, J.; Shi, L.; Dai, C. An effective method fusing electronic nose and fluorescence hyperspectral imaging for the detection of pork freshness. Food Biosci. 2024, 59, 103880. [Google Scholar] [CrossRef]
Sun, J.; Cheng, J.; Xu, M.; Yao, K. A method for freshness detection of pork using two-dimensional correlation spectroscopy images combined with dual-branch deep learning. J. Food Compos. Anal. 2024, 129, 106144. [Google Scholar] [CrossRef]
Xu, B.; Cui, X.; Ji, W.; Yuan, H.; Wang, J. Apple Grading Method Design and Implementation for Automatic Grader Based on Improved YOLOv5. Agriculture 2023, 13, 124. [Google Scholar] [CrossRef]
Cheng, J.; Sun, J.; Yao, K.; Xu, M.; Dai, C. Multi-task convolutional neural network for simultaneous monitoring of lipid and protein oxidative damage in frozen-thawed pork using hyperspectral imaging. Meat Sci. 2023, 201, 109196. [Google Scholar] [CrossRef] [PubMed]
Cheng, J.; Sun, J.; Yao, K.; Dai, C. Generalized and hetero two-dimensional correlation analysis of hyperspectral imaging combined with three-dimensional convolutional neural network for evaluating lipid oxidation in pork. Food Control 2023, 153, 109940. [Google Scholar] [CrossRef]
Yang, F.; Sun, J.; Cheng, J.; Fu, L.; Wang, S.; Xu, M. Detection of starch in minced chicken meat based on hyperspectral imaging technique and transfer learning. J. Food Process Eng. 2023, 46, e14304. [Google Scholar] [CrossRef]
Geoffrey, C.; Eric, R. Intelligent Imaging in Nuclear Medicine: The Principles of Artificial Intelligence, Machine Learning and Deep Learning. Semin. Nucl. Med. 2021, 51, 102–111. [Google Scholar] [CrossRef]
Usman, K.; Muhammad, K.K.; Muhammad, A.L.; Muhammad, N.; Muhammad, M.A.; Salman, A.K.; Mazliham, M.S. A Systematic Literature Review of Machine Learning and Deep Learning Approaches for Spectral Image Classification in Agricultural Applications Using Aerial Photography. Comput. Mater. Contin. 2024, 78, 2967–3000. [Google Scholar] [CrossRef]
Jayme, G.; Arnal, B. A review on the combination of deep learning techniques with proximal hyperspectral images in agriculture. Comput. Electron. Agric. 2023, 210, 107920. [Google Scholar] [CrossRef]
Atiya, K.; Amol, D.V.; Shankar, M.; Patil, C.H. A systematic review on hyperspectral imaging technology with a machine and deep learning methodology for agricultural applications. Ecol. Inform. 2022, 69, 101678. [Google Scholar]
Larissa, F.R.M.; Rodrigo, M.; Bruno, A.N.T.; André, R.B. Deep learning based image classification for embedded devices: A systematic review. Neurocomputing 2025, 623, 129402. [Google Scholar] [CrossRef]
Simonyan, K.; Andrew, Z. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the 2015 International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Christian, S.; Liu, W.; Jia, Y.; Pierre, S.; Scott, R.; Dragomir, A.; Dumitru, E.; Vincent, V.; Andrew, R. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In Proceedings of the AAAI Conference on Artificial Intelligence 2017, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1800–1807. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Andrew, G.H.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Tobias, W.; Marco, A.; Hartwig, A. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Mark, S.; Andrew, H.; Zhu, M.; Zhmoginov, A.; Chen, L. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Howard, A.; Sandler, M.; Chen, B.; Wang, W.; Chen, L.; Tan, M.; Chu, G.; Vasudevan, V.; Zhu, Y.; Pang, R.; et al. Searching for MobileNetV3. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Tan, M.; Le, Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Adnan, H.; Muhammad, A.; Jin, S.H.; Haseeb, S.; Nadeem, U.; Kang, R.P. Multi-scale and multi-receptive field-based feature fusion for robust segmentation of plant disease and fruit using agricultural images. Appl. Soft Comput. 2024, 167, 112300. [Google Scholar]
Cheng, J.; Song, Z.; Wu, Y.; Xu, J. ALDNet: A two-stage method with deep aggregation and multi-scale fusion for apple leaf disease spot segmentation. Measurement 2025, 253, 117706. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Chen, L.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. In Proceedings of the 2014 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
Chen, L.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef]
Chen, L.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. In Proceedings of the 2017 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 22–25 July 2017. [Google Scholar]
Chen, L.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the 2018 European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 833–851. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Jose, M.A.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. In Proceedings of the 2021 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Wang, S.; Zeng, Q.; Ni, W.; Cheng, C.; Wang, Y. ODP-Transformer: Interpretation of pest classification results using image caption generation techniques. Comput. Electron. Agric. 2023, 209, 107863. [Google Scholar] [CrossRef]
Prasath, B.; Akila, M. IoT-based pest detection and classification using deep features with enhanced deep learning strategies. Eng. Appl. Artif. Intell. 2023, 121, 105985. [Google Scholar]
Farooq, A.; Huma, Q.; Kashif, S.; Iftikhar, A.; Muhammad, J.I. YOLOCSP-PEST for Crops Pest Localization and Classification. Comput. Mater. Contin. 2023, 82, 2373–2388. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. StructToken: Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Bochkovskiy, A.; Wang, C.; Liao, M.H. YOLOv4: Optimal Speed and Accuracy of Object Detection. In Proceedings of the 2020 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. In Proceedings of the 2024 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024. [Google Scholar]
Rahima, K.; Muhammad, H. YOLOv11: An Overview of the Key Architectural Enhancements. In Proceedings of the2024 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024. [Google Scholar]
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. In Proceedings of the 2025 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025. [Google Scholar]
Chetan, M.B.; Alwin, P.; Hao, G. Agricultural object detection with You Only Look Once (YOLO) Algorithm: A bibliometric and systematic literature review. Comput. Electron. Agric. 2024, 223, 109090. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.; Alexander, C.B. SSD: Single Shot MultiBox Detector. In Proceedings of the 2016 European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Hang, S.; Subhransu, M.; Evangelos, K.; Erik, L. Multi-view Convolutional Neural Networks for 3D Shape Recognition. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 945–953. [Google Scholar]
He, X.; Bai, S.; Chu, J.; Bai, X. An Improved Multi-View Convolutional Neural Network for 3D Object Retrieval. IEEE Trans. Image Process. 2020, 29, 7917–7930. [Google Scholar] [CrossRef]
Sun, K.; Zhang, J.; Xu, S.; Zhao, Z.; Zhang, C.; Liu, J.; Hu, J. CACNN: Capsule Attention Convolutional Neural Networks for 3D Object Recognition. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 4091–4102. [Google Scholar] [CrossRef]
Gao, Z.; Zhang, Y.; Zhang, H.; Guan, W.; Feng, D.; Chen, S. Multi-Level View Associative Convolution Network for View-Based 3D Model Retrieval. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 2264–2278. [Google Scholar]
Wei, X.; Yu, R.; Sun, J. View-GCN: View-Based Graph Convolutional Network for 3D Shape Analysis. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1847–1856. [Google Scholar]
Maturana, D.; Scherer, S. VoxNet: A 3D Convolutional Neural Network for real-time object recognition. In Proceedings of the 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, Germany, 28 September–2 October 2015; pp. 922–928. [Google Scholar]
Zhou, Y.; Tuzel, O. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 4490–4499. [Google Scholar]
Chen, Y.; Liu, J.; Zhang, X.; Qi, X.; Jia, J. VoxelNeXt: Fully Sparse VoxelNet for 3D Object Detection and Tracking. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 21674–21683. [Google Scholar]
Li, B. 3D fully convolutional network for vehicle detection in point cloud. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 1513–1518. [Google Scholar]
Wu, Z.; Song, S.; Khosla, A.; Tang, X.; Xiao, J. 3D ShapeNets for 2.5D Object Recognition and Next-Best-View Prediction. In Proceedings of the 2015 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 8–10 June 2015. [Google Scholar]
Riegler, G.; Ulusoy, A.O.; Geiger, A. OctNet: Learning Deep 3D Representations at High Resolutions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6620–6629. [Google Scholar]
Wang, P.; Liu, Y.; Guo, Y.; Sun, C.; Tong, X. O-CNN: Octree-based convolutional neural networks for 3D shape analysis. Assoc. Comput. Mach. 2017, 36, 11. [Google Scholar]
Klokov, R.; Lempitsky, V. Escape from Cells: Deep Kd-Networks for the Recognition of 3D Point Cloud Models. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 863–872. [Google Scholar]
Tchapmi, L.; Choy, C.; Armeni, I.; Gwak, J.; Savarese, S. SEGCloud: Semantic Segmentation of 3D Point Clouds. In Proceedings of the 2017 International Conference on 3D Vision (3DV), Qingdao, China, 10–12 October 2017; pp. 537–547. [Google Scholar]
Riansyah, M.I.; Putra, O.V.; Priyadi, A.; Sardjono, T.A.; Yuniarno, E.M.; Purnomo, M.H. Modified CNN VoxNet Based Depthwise Separable Convolution for Voxel-Driven Body Orientation Classification. In Proceedings of the 2024 IEEE International Conference on Imaging Systems and Techniques (IST), Tokyo, Japan, 14–16 October 2024; pp. 1–6. [Google Scholar]
Hanocka, R.; Hertz, A.; Fish, N.; Giryes, R.; Fleishman, S.; Cohen-Or, D. MeshCNN: A network with an edge. Assoc. Comput. Mach. 2019, 38, 12. [Google Scholar]
Charles, R.Q.; Su, H.; Kaichun, M.; Guibas, L.J. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 77–85. [Google Scholar]
Charles, R.Q.; Charles, R.Q.; Li, Y.; Hao, S.; Leonidas, J.G. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5105–5114. [Google Scholar]
Wang, Y.; Sun, Y.; Liu, Z.; Sarma, S.E.; Bronstein, M.M.; Solomon, J.M. Dynamic Graph CNN for Learning on Point Clouds. Assoc. Comput. Mach. 2019, 38, 12. [Google Scholar]
Zhao, H.; Jiang, L.; Fu, C.; Jia, J. PointWeb: Enhancing Local Neighborhood Features for Point Cloud Processing. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5560–5568. [Google Scholar]
Jiang, M.; Wu, Y.; Zhao, Z.; Lu, C. PointSIFT: A SIFT-like Network Module for 3D Point Cloud Semantic Segmentation. arXiv 2018, arXiv:1807.00652. [Google Scholar]
Qian, G.; Hammoud, H.A.A.K.; Li, G.; Thabet, A.; Ghanem, B. ASSANet: An anisotropic separable set abstraction for efficient point cloud representation learning. In Proceedings of the 35th International Conference on Neural Information Processing Systems, Online, 6–14 December 2021; pp. 28119–28130. [Google Scholar]
Qian, G.; Li, Y.; Peng, H.; Mai, J.; Hammoud, H.; Elhoseiny, M.; Ghanem, B. PointNeXt: Revisiting PointNet++ with Improved Training and Scaling Strategies. Adv. Neural Inf. Process. Syst. 2022, 35, 23192–23204. [Google Scholar]
Hu, Q.; Yang, B.; Xie, L.; Rosa, S.; Guo, Y.; Wang, Z.; Trigoni, N.; Markham, A. RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11105–11114. [Google Scholar]
Zhao, H.; Jiang, L.; Jia, J.; Torr, P.; Koltun, V. Point Transformer. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 16239–16248. [Google Scholar]
Wu, X.; Lao, Y.; Jiang, L.; Liu, X.; Zhao, H. Point Transformer V2: Grouped Vector Attention and Partition-based Pooling. Adv. Neural Inf. Process. Syst. 2022, 35, 33330–33342. [Google Scholar]
Wu, X.; Jiang, L.; Wang, P.; Liu, Z.; Liu, X.; Qiao, Y.; Ouyang, W.; He, T.; Zhao, H. Point Transformer V3: Simpler, Faster, Stronger. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 4840–4851. [Google Scholar]
Ananda, S.P.; Vandana, B.M. Transfer Learning for Multi-Crop Leaf Disease Image Classification using Convolutional Neural Network VGG. Artif. Intell. Agric. 2022, 6, 23–33. [Google Scholar] [CrossRef]
Hong, P.; Luo, X.; Bao, L. Crop disease diagnosis and prediction using two-stream hybrid convolutional neural networks. Crop Prot. 2024, 184, 106867. [Google Scholar] [CrossRef]
Pudumalar, S.; Muthuramalingam, S. Hydra: An ensemble deep learning recognition model for plant diseases. J. Eng. Res. 2024, 12, 781–792. [Google Scholar] [CrossRef]
Roopali, D.; Shalli, R.; Aman, S.; Marwan, A.A.; Alina, E.B.; Ahmed, A. Deep learning model for detection of brown spot rice leaf disease with smart agriculture. Comput. Electr. Eng. 2023, 109, 108659. [Google Scholar] [CrossRef]
Suri, b.N.; Midhun, P.M.; Abubeker, K.M.; Shafeena, K.A. SwinGNet: A Hybrid Swin Transform- GoogleNet Framework for Real-Time Grape Leaf Disease Classification. Procedia Comput. Sci. 2025, 258, 1629–1639. [Google Scholar]
Biniyam, M.A.; Abdela, A.M. Coffee disease classification using Convolutional Neural Network based on feature concatenation. Inform. Med. Unlocked 2023, 39, 101245. [Google Scholar]
Liu, Y.; Wang, Z.; Wang, R.; Chen, J.; Gao, H. Flooding-based MobileNet to identify cucumber diseases from leaf images in natural scenes. Comput. Electron. Agric. 2023, 213, 108166. [Google Scholar] [CrossRef]
Chen, J.; Zhang, D.; Suzauddola, M.; Zeb, A. Identifying crop diseases using attention embedded MobileNet-V2 model. Appl. Soft Comput. 2021, 113, 107901. [Google Scholar] [CrossRef]
Rukuna, L.A.; Zambuk, F.U.; Gital, A.Y.; Bello, M.U. Citrus diseases detection and classification based on efficientnet-B5. Syst. Soft Comput. 2025, 7, 200199. [Google Scholar] [CrossRef]
Ding, Y.; Yang, W.; Zhang, J. An improved DeepLabV3+ based approach for disease spot segmentation on apple leaves. Comput. Electron. Agric. 2025, 231, 110041. [Google Scholar] [CrossRef]
Zhang, Y.; Lv, C. TinySegformer: A lightweight visual segmentation model for real-time agricultural pest detection. Comput. Electron. Agric. 2024, 218, 108740. [Google Scholar] [CrossRef]
Zhang, C.; Zhang, Y.; Xu, X. Dilated inception U-Net with attention for crop pest image segmentation in real-field environment. Smart Agric. Technol. 2025, 11, 100917. [Google Scholar] [CrossRef]
Peteinatos, G.G.; Reichel, P.; Karouta, J.; Andújar, D.; Gerhards, R. Weed Identification in Maize, Sunflower, and Potatoes with the Aid of Convolutional Neural Networks. Remote Sens. 2020, 12, 4185. [Google Scholar] [CrossRef]
Alex, O.; Dmitry, A.K.; Bronson, P.; Peter, R.; Jake, C.W.; Jamie, J.; Wesley, B.; Benjamin, G.; Owen, K.; James, W.; et al. DeepWeeds: A Multiclass Weed Species Image Dataset for Deep Learning. Sci. Rep. 2019, 9, 2058. [Google Scholar] [CrossRef]
Liu, J.; Abbas, I.; Noor, R.S. Development of Deep Learning-Based Variable Rate Agrochemical Spraying System for Targeted Weeds Control in Strawberry Crop. Agronomy 2021, 11, 1480. [Google Scholar] [CrossRef]
Akhilesh, S.; Vipan, K.; Louis, L. Comparative performance of YOLOv8, YOLOv9, YOLOv10, YOLOv11 and Faster R-CNN models for detection of multiple weed species. Smart Agric. Technol. 2024, 9, 100648. [Google Scholar] [CrossRef]
García-Navarrete, O.L.; Camacho-Tamayo, J.H.; Bregon, A.B.; Martín-García, J.; Navas-Gracia, L.M. Performance Analysis of Real-Time Detection Transformer and You Only Look Once Models for Weed Detection in Maize Cultivation. Agronomy 2025, 15, 796. [Google Scholar] [CrossRef]
Deng, L.; Miao, Z.; Zhao, X.; Yang, S.; Gao, Y.; Zhai, C.; Zhao, C. HAD-YOLO: An Accurate and Effective Weed Detection Model Based on Improved YOLOV5 Network. Agronomy 2025, 15, 57. [Google Scholar] [CrossRef]
Fadwa, A.; Mashael, M.A.; Rana, A.; Radwa, M.; Anwer, M.H.; Ahmed, A.; Deepak, G. Hybrid leader based optimization with deep learning driven weed detection on internet of things enabled smart agriculture environment. Comput. Electr. Eng. 2022, 104, 108411. [Google Scholar] [CrossRef]
Nitin, R.; Yu, Z.; Maria, V.; Kirk, H.; Michael, O.; Xin, S. Agricultural weed identification in images and videos by integrating optimized deep learning architecture on an edge computing technology. Comput. Electr. Eng. 2024, 216, 108442. [Google Scholar]
Ma, C.; Chi, G.; Ju, X.; Zhang, J.; Yan, C. YOLO-CWD: A novel model for crop and weed detection based on improved YOLOv8. Crop Prot. 2025, 192, 107169. [Google Scholar] [CrossRef]
Steininger, D.; Trondl, A.; Croonen, G.; Simon, J.; Widhalm, V. The CropAndWeed Dataset: A Multi-Modal Learning Approach for Efficient Crop and Weed Manipulation. In Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 3718–3727. [Google Scholar]
Qi, Z.; Wang, J. PMDNet: An Improved Object Detection Model for Wheat Field Weed. Agronomy 2025, 15, 55. [Google Scholar] [CrossRef]
Jin, T.; Liang, K.; Lu, M.; Zhao, Y.; Xu, Y. WeedsSORT: A weed tracking-by-detection framework for laser weeding applications within precision agriculture. Smart Agric. Technol. 2025, 11, 100883. [Google Scholar] [CrossRef]
Nils, H.; Kai, L.; Anthony, S. Accelerating weed detection for smart agricultural sprayers using a Neural Processing Unit. Comput. Electron. Agric. 2025, 237, 110608. [Google Scholar] [CrossRef]
Sanjay, K.G.; Shivam, K.Y.; Sanjay, K.S.; Udai, S.; Pradeep, K.S. Multiclass weed identification using semantic segmentation: An automated approach for precision agriculture. Ecol. Inform. 2023, 78, 102366. [Google Scholar]
Su, D.; Kong, H.; Qiao, Y.; Sukkarieh, S. Data augmentation for deep learning based semantic segmentation and crop-weed classification in agricultural robotics. Comput. Electron. Agric. 2021, 190, 106418. [Google Scholar] [CrossRef]
Mohammed, H.; Salma, S.; Adil, T.; Youssef, O. New segmentation approach for effective weed management in agriculture. Smart Agric. Technol. 2024, 8, 100505. [Google Scholar] [CrossRef]
Liu, G.; Jin, C.; Ni, Y.; Yang, T.; Liu, Z. UCIW-YOLO: Multi-category and high-precision obstacle detection model for agricultural machinery in unstructured farmland environments. Expert Syst. Appl. 2025, 294, 128686. [Google Scholar] [CrossRef]
Cui, X.; Zhu, L.; Zhao, B.; Wang, R.; Han, Z.; Zhang, W.; Dong, L. Parallel RepConv network: Efficient vineyard obstacle detection with adaptability to multi-illumination conditions. Comput. Electron. Agric. 2025, 230, 109901. [Google Scholar] [CrossRef]
Li, Y.; Li, M.; Qi, J.; Zhou, D.; Zou, Z.; Liu, K. Detection of typical obstacles in orchards based on deep convolutional neural network. Comput. Electron. Agric. 2021, 181, 105932. [Google Scholar] [CrossRef]
Liu, H.; Du, Z.; Yang, F.; Zhang, Y.; Shen, Y. Real-time recognizing spray target in forest and fruit orchard using lightweight PointNet. Trans. Chin. Soc. Agric. Eng. 2024, 40, 144–151. [Google Scholar]
Liu, H.; Wang, X.; Shen, Y.; Xu, J. Multi-objective classification method of nursery scene based on 3D laser point cloud. J. ZheJiang Univ. (Eng. Sci.) 2023, 57, 2430–2438. [Google Scholar]
Qin, J.; Sun, R.; Zhou, K.; Xu, Y.; Lin, B.; Yang, L.; Chen, Z.; Wen, L.; Wu, C. Lidar-Based 3D Obstacle Detection Using Focal Voxel R-CNN for Farmland Environment. Agronomy 2023, 13, 650. [Google Scholar] [CrossRef]
Can, T.N.; Phan, K.D.; Dang, H.N.; Nguyen, K.D.; Nguyen, T.H.D.; Thanh-Noi, P.; Vo, Q.M.; Nguyen, H.Q. Leveraging convolutional neural networks and textural features for tropical fruit tree species classification. Remote Sens. Appl. Soc. Environ. 2025, 39, 101633. [Google Scholar]
Mulugeta, A.K.; Durga, P.S.; Mesfin, A.H. Deep learning for Ethiopian indigenous medicinal plant species identification and classification. J. Ayurveda Integr. Med. 2024, 15, 100987. [Google Scholar] [CrossRef]
Xu, L.; Lu, C.; Zhou, T.; Wu, J.; Feng, H. A 3D-2DCNN-CA approach for enhanced classification of hickory tree species using UAV-based hyperspectral imaging. Microchem. J. 2024, 199, 109981. [Google Scholar] [CrossRef]
Muhammad, A.; Ahmar, R.; Khurram, K.; Abid, I.; Faheem, K.; Muhammad, A.A.; Hammad, M.C. Real-time precision spraying application for tobacco plants. Smart Agric. Technol. 2024, 8, 100497. [Google Scholar] [CrossRef]
Khan, Z.; Liu, H.; Shen, Y.; Zeng, X. Deep learning improved YOLOv8 algorithm: Real-time precise instance segmentation of crown region orchard canopies in natural environment. Comput. Electron. Agric. 2024, 224, 109168. [Google Scholar] [CrossRef]
Wei, P.; Yan, X.; Yan, W.; Sun, L.; Xu, J.; Yuan, H. Precise extraction of targeted apple tree canopy with YOLO-Fi model for advanced UAV spraying plans. Comput. Electron. Agric. 2024, 226, 109425. [Google Scholar] [CrossRef]
Zhang, J.; Lu, J.; Zhang, Q.; Qi, Q.; Zheng, G.; Chen, F.; Chen, S.; Zhang, F.; Fang, W.; Guan, Z. Estimation of Garden Chrysanthemum Crown Diameter Using Unmanned Aerial Vehicle (UAV)-Based RGB Imagery. Agronomy 2024, 14, 337. [Google Scholar] [CrossRef]
He, J.; Duan, J.; Yang, Z.; Ou, J.; Ou, X.; Yu, S.; Xie, M.; Luo, Y.; Wang, H.; Jiang, Q. Method for Segmentation of Banana Crown Based on Improved DeepLabv3+. Agronomy 2023, 13, 1838. [Google Scholar] [CrossRef]
Huo, Y.; Leng, L.; Wang, M.; Ji, X.; Wang, M. CSA-PointNet: A tree species classification model for coniferous and broad-leaved mixed forests. World Geol. 2024, 43, 551–556. [Google Scholar]
Xu, J.; Liu, H.; Shen, Y.; Zeng, X.; Zheng, X. Individual nursery trees classification and segmentation using a point cloud-based neural network with dense connection pattern. Sci. Hortic. 2024, 328, 112945. [Google Scholar] [CrossRef]
Bu, X.; Liu, C.; Liu, H.; Yang, G.; Shen, Y.; Xu, J. DFSNet: A 3D Point Cloud Segmentation Network toward Trees Detection in an Orchard Scene. Sensors 2024, 24, 2244. [Google Scholar] [CrossRef] [PubMed]
Seol, J.; Kim, J.; Son, H.I. Spray Drift Segmentation for Intelligent Spraying System Using 3D Point Cloud Deep Learning Framework. IEEE Access 2022, 10, 77263–77271. [Google Scholar] [CrossRef]
Liu, H.; Xu, J.; Chen, W.; Shen, Y.; Kai, J. Efficient Semantic Segmentation for Large-Scale Agricultural Nursery Managements via Point Cloud-Based Neural Network. Remote Sens. 2024, 16, 4011. [Google Scholar] [CrossRef]
Yang, J.; Gan, R.; Luo, B.; Wang, A.; Shi, S.; Du, L. An Improved Method for Individual Tree Segmentation in Complex Urban Scenes Based on Using Multispectral LiDAR by Deep Learning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 6561–6576. [Google Scholar] [CrossRef]
Chang, L.; Fan, H.; Zhu, N.; Dong, Z. A Two-Stage Approach for Individual Tree Segmentation From TLS Point Clouds. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 8682–8693. [Google Scholar] [CrossRef]
Xi, Z.; Degenhardt, D. A new unified framework for supervised 3D crown segmentation (TreeisoNet) using deep neural networks across airborne, UAV-borne, and terrestrial laser scans. ISPRS Open J. Photogramm. Remote Sens. 2025, 15, 100083. [Google Scholar] [CrossRef]
Jiang, T.; Wang, Y.; Liu, S.; Zhang, Q.; Zhao, L.; Sun, J. Instance recognition of street trees from urban point clouds using a three-stage neural network. ISPRS Open J. Photogramm. Remote Sens. 2023, 199, 305–334. [Google Scholar] [CrossRef]
Liu, Y.; Zhang, A.; Gao, P. From Crown Detection to Boundary Segmentation: Advancing Forest Analytics with Enhanced YOLO Model and Airborne LiDAR Point Clouds. Forests 2025, 16, 248. [Google Scholar] [CrossRef]
Chen, S.; Liu, J.; Xu, X.; Guo, J.; Hu, S.; Zhou, Z.; Lan, Y. Detection and tracking of agricultural spray droplets using GSConv-enhanced YOLOv5s and DeepSORT. Comput. Electron. Agric. 2025, 235, 110353. [Google Scholar]
Praneel, A.; Travis, B.; Kim-Doang, N. AI-enabled droplet detection and tracking for agricultural spraying systems. Comput. Electron. Agric. 2022, 202, 107325. [Google Scholar] [CrossRef]
Kumar, M.S.; Hogan, J.C.; Fredericks, S.A.; Hong, J. Visualization and characterization of agricultural sprays using machine learning based digital inline holography. Comput. Electron. Agric. 2024, 216, 108486. [Google Scholar] [CrossRef]
Seol, J.; Kim, C.; Ju, E.; Son, H.I. STPAS: Spatial-Temporal Filtering-Based Perception and Analysis System for Precision Aerial Spraying. IEEE Access 2024, 12, 145997–146008. [Google Scholar] [CrossRef]

Figure 1. The number of relevant papers about nursery management tasks in Elsevier (a), IEEE (b), and MDPI (c) databases from 2015 to August 2025.

Figure 2. In the Elsevier database, the proportion of article categories related to plant protection (a) and the publication of research papers (b) during the period from 2015 to August 2025.

Figure 3. Samples of diseased and healthy leaves. Compared with healthy leaves, diseased leaves usually exhibit spots, and symptoms such as yellowing and curling can be observed at the leaf margins. Based on the different types of leaf diseases identified, appropriate pesticides are selected for plant protection work, thereby enhancing the efficiency of plant protection. (Reprinted with permission from Ref. [125]. Copyright 2025 Elsevier.)

Figure 4. Some examples from the BRACOL dataset. The green regions are indicative of the normal areas of the leaves, while the red regions correspond to the leaf sections that have been segmented to represent the diseased parts. Semantic segmentation models are mainly used to identify the specific areas of leaf diseases, thereby facilitating targeted analysis and plant protection operations. (Reprinted with permission from Ref. [65]. Copyright 2024 Elsevier.)

Figure 5. Three examples of pest body parts in the pest identification task. The object detection model uses different bounding boxes to locate multiple body parts of pests, such as the head and wings. Accurately acquiring the location information of pests, the operation area for plant protection can be more precisely located, effectively enhancing the pest control efficacy. (Reprinted with permission from Ref. [74]. Copyright 2023 Elsevier.)

Figure 6. Pest segmentation results by TinySegformer proposed in [127]. TinySegformer can accurately segment pests from the background images, providing detailed information about the exact shape of the pest. ( Reprinted with permission from Ref. [127]. Copyright 2024 Elsevier.)

Figure 8. Schematic illustration of multi-component segmentation for seedlings in nursery settings, including canopy, planting containers, and support poles. (A,C,E) Original YOLOv8 segmentation results; (B,D,F) Improved YOLOv8 segmentation results. By accurately obtaining the canopy information of seedlings, the spraying position can be determined, achieving precise spraying operations in plant protection work. (Reprinted with permission from Ref. [155]. Copyright 2024 Elsevier.)

Table 1. Comparison of various image and point cloud-based deep neural network models.

Method	Types of Inputs	Tasks	Advantages	Disadvantages	References
VGG series	Images	Leaf Disease Detection; Pest Identification; Weed Recognition; Seedling Information Monitoring.	Simple structures; Composed of common modules; Easy to understand and implement.	Computational complexity.	[75,117,118,119,120,129,142,152]
GoogLeNet series	Images	Leaf Disease Detection; Weed Recognition.	Obtain multi-scale features; High accuracy performance.	Complex structures; Computational complexity.	[121,122,131]
ResNet series	Images	Leaf Disease Detection; Pest Identification; Weed Recognition.	Alleviates the problems of vanishing and exploding gradients.	Much redundant information.	[75,122,129,130,142]
MobileNet series	Images	Leaf Disease Detection; Weed Recognition; Target and Non-target Object Detection; Seedling Information Monitoring.	Computational efficiency; Low-memory utilization; Rapid inference speed.	Limits feature extraction capabilities; Poor performance in complex scenarios.	[123,124,126,142,147,158]
EfficientNet series	Images	Leaf Disease Detection.	Adjust parameters to get high-performance efficiency.	Simultaneously adjusting multiple parameters increases complexity.	[125]
U-Net series	Images	Leaf Disease Detection; Pest Identification; Weed Recognition; Spray Drift Assessment.	Preserve multi-level feature information; Good performances in small-sample conditions.	Complex structures; High memory consumption; Long training times for large-scale data.	[65,128,142,144,171]
DeepLab series	Images	Leaf Disease Detection; Seedling Information Monitoring.	Larger receptive field; Obtains multi-scale features.	High memory consumption; Computational complexity Lack robustness in complex scenarios.	[65,126,158]
SegFormer	Images	Pest Identification.	Capture global features; High accuracy performance.	Complex structures; High memory consumption; Computational complexity.	[127]
SegNet	Images	Leaf Disease Detection.	Obtain precise boundary information; Low memory requirements; Fewer model parameters.	Low segmentation accuracy for small-sized objects.	[65]
R-CNN series	Images	Pest Identification; Target and Non-target Object Detection; Seedling Information Monitoring; Spray Drift Assessment.	Relatively high detection accuracy; Good performances for small objects.	Involve multi-step processes; Computational complexity.	[74,147,150,157,170]
YOLO series	Images	Pest Identification; Target and Non-target Object Detection; Seedling Information Monitoring; Spray Drift Assessment.	Adopt one-stage object detection architectures; Simple network structures.	Low accuracy in detecting small objects; Poor localization precision.	[75,76,134,135,136,137,140,144,145,147,154,155,156,165,168,169]
SSD series	Images	Target and Non-target Object Detection.	Obtain multi-scale features; High accuracy performance.	High memory consumption during training and inference; Computational complexity.	[147]
Multi-view-based Neural Network Models	Point clouds	Seedling Information Monitoring.	Utilize high-performance 2D neural network models; High accuracy performance.	Obtain the relationships between different images; The loss of original 3D spatial features.	[165]
Voxel and Mesh-based Neural Network Models	Point clouds	Target and Non-target Object Detection.	Utilize high-performance 3D neural network models; High accuracy performance.	Requires more memory and time for Voxelizing processes; The loss of original 3D spatial features.	[150]
Original point cloud-based Neural Network Models	Point clouds	Target and Non-target Object Detection; Seedling Information Monitoring; Spray Drift Assessment.	Fully utilize the original spatial features of point clouds.	Problems such as the original disorder of point clouds need to be solved.	[148,149,159,160,161,162,163,164,165,172]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, J.; Liu, H.; Shen, Y. Image and Point Cloud-Based Neural Network Models and Applications in Agricultural Nursery Plant Protection Tasks. Agronomy 2025, 15, 2147. https://doi.org/10.3390/agronomy15092147

AMA Style

Xu J, Liu H, Shen Y. Image and Point Cloud-Based Neural Network Models and Applications in Agricultural Nursery Plant Protection Tasks. Agronomy. 2025; 15(9):2147. https://doi.org/10.3390/agronomy15092147

Chicago/Turabian Style

Xu, Jie, Hui Liu, and Yue Shen. 2025. "Image and Point Cloud-Based Neural Network Models and Applications in Agricultural Nursery Plant Protection Tasks" Agronomy 15, no. 9: 2147. https://doi.org/10.3390/agronomy15092147

APA Style

Xu, J., Liu, H., & Shen, Y. (2025). Image and Point Cloud-Based Neural Network Models and Applications in Agricultural Nursery Plant Protection Tasks. Agronomy, 15(9), 2147. https://doi.org/10.3390/agronomy15092147

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Image and Point Cloud-Based Neural Network Models and Applications in Agricultural Nursery Plant Protection Tasks

Abstract

1. Introduction

2. Image-Based Neural Network Models

2.1. Classification Models

2.1.1. Visual Geometry Group Network (VGG)

2.1.2. GoogLeNet

2.1.3. ResNet

2.1.4. MobileNet Series

2.1.5. EfficientNet Series

2.2. Segmentation Models

2.2.1. U-Net Series

2.2.2. DeepLab Series

2.2.3. SegFormer

2.2.4. SegNet

2.3. Object Detection Models

2.3.1. R-CNN Series

2.3.2. YOLO Series

2.3.3. Single Shot MultiBox Detector (SSD)

3. Point Cloud-Based Neural Network Models

3.1. Multi-View-Based Neural Network Models

3.2. Voxel and Mesh-Based Neural Network Models

3.3. Original Point Cloud-Based Neural Network Models

4. Image and Point Cloud-Based Neural Network Models for Plant Protection

4.1. Leaf Disease Detection

4.2. Pest Identification

4.3. Weed Recognition

4.4. Target and Non-Target Object Detection

4.5. Seedling Information Monitoring

4.6. Spray Drift Assessment

5. Future Directions and Research Challenges

5.1. Multi-Source Data Fusion and Utilization

5.2. Improvement of Perception Model Performances

5.3. Design of Lightweight Models

5.4. Enhancement of Generalization Abilities of Models

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI