A Review of Computer Vision and Deep Learning Applications in Crop Growth Management

Cao, Zhijie; Sun, Shantong; Bao, Xu

doi:10.3390/app15158438

Open AccessReview

A Review of Computer Vision and Deep Learning Applications in Crop Growth Management

by

Zhijie Cao

,

Shantong Sun

^*

and

Xu Bao

School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang 212013, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(15), 8438; https://doi.org/10.3390/app15158438

Submission received: 1 July 2025 / Revised: 24 July 2025 / Accepted: 28 July 2025 / Published: 30 July 2025

(This article belongs to the Section Agricultural Science and Technology)

Download

Browse Figures

Versions Notes

Abstract

Agriculture is the foundational industry for human survival, profoundly impacting economic, ecological, and social dimensions. In the face of global challenges such as rapid population growth, resource scarcity, and climate change, achieving technological innovation in agriculture and advancing smart farming have become increasingly critical. In recent years, deep learning and computer vision have developed rapidly. Key areas in computer vision—such as deep learning-based image processing, object detection, and multimodal fusion—are rapidly transforming traditional agricultural practices. Processes in agriculture, including planting planning, growth management, harvesting, and post-harvest handling, are shifting from experience-driven methods to digital and intelligent approaches. This paper systematically reviews applications of deep learning and computer vision in agricultural growth management over the past decade, categorizing them into four key areas: crop identification, grading and classification, disease monitoring, and weed detection. Additionally, we introduce classic methods and models in computer vision and deep learning, discussing approaches that utilize different types of visual information. Finally, we summarize current challenges and limitations of existing methods, providing insights for future research and promoting technological innovation in agriculture.

Keywords:

agricultural applications; computer vision; deep learning; target recognition; crop grading; disease monitoring; weed detection

1. Introduction

The development of agricultural production can be broadly divided into four stages: the era of traditional agriculture, the era of mechanized agriculture, the era of automated agriculture, and the era of smart agriculture—characterized by unmanned operations and deep integration with computer science [1]. Today, the digital revolution in agriculture is gaining momentum. During this structural transformation, the agricultural sector faces severe labor shortages due to a declining workforce and rural-to-urban migration, while societal demand for agricultural output continues to rise. This imbalance between labor scarcity and growing demand threatens agricultural productivity and sustainability [2]. From a global perspective, as challenges related to food security and resource sustainability intensify, the need to reform agricultural practices and enhance productivity has become more urgent than ever. Pressing issues such as population growth, climate change, and urbanization—compounded by dwindling natural resources—cast a shadow over the future [3]. Agricultural progress plays a vital role in a nation’s economy and survival [4].

As noted in [5], achieving sustainable agriculture requires improving production efficiency, adopting precision farming, and reducing waste on the demand side. Therefore, technological innovation and agricultural transformation are imperative.

Traditionally, acquiring agricultural data was labor-intensive, time-consuming, and prone to errors. However, with the integration of computer science and agriculture, technologies such as remote sensing, digital applications, sensors, hyperspectral imaging, and artificial intelligence have ushered agriculture into an era of intelligent transformation [6].

Computer vision technology plays a pivotal role in this agricultural revolution. Thanks to its advancements, it has been widely adopted in smart agriculture [7]. Deep learning, a field closely related to computer vision, enables automated feature extraction and learning from large-scale datasets, achieving unprecedented capabilities in image understanding [8]. It demonstrates state-of-the-art performance in video analysis, facial recognition, image classification, biomedical applications, and health informatics. Moreover, its exceptional data processing and learning capabilities are proving indispensable in agriculture [9].

Agricultural production is a complex and systematic industry involving a complete workflow from pre-planting preparation to post-harvest handling. This paper categorizes the agricultural production process into three key phases as follows. The first phase is preparation, which requires comprehensive land assessment, soil health evaluation, and crop suitability analysis [10,11]. Modern soil testing has evolved into an intelligent system integrating spectroscopy, electrochemistry, molecular biology, and other technologies, significantly improving production efficiency and serving as a critical component in agriculture [12].

The second phase is crop growth management. This stage encompasses seed selection, growth monitoring, grading and classification, disease detection, and weed identification in fields. It directly determines crop yield, quality, and economic returns. The integration of computer vision and deep learning in this phase maximizes the potential of land, seeds, water, fertilizers, and other resources while mitigating risks [13,14,15,16].

The final phase is harvesting and post-harvest handling. Crop growth environments are complex and seasonal, with harvest periods often concentrated within short timeframes. The methods of harvesting and processing directly affect product quality, shelf life, and market value. Against the backdrop of labor shortages and surging demand for crop yields, the transition toward automation and intelligent solutions has become an inevitable trend [17,18,19].

The rapid advancement of computer vision and deep learning has highlighted their pivotal role in agriculture. The Boolean search terms used in the Web of Science database were: (“agriculture”) AND (“computer vision”) AND (“deep learning”), with the time frame limited to articles from the past five years. The study initially screened nearly 3000 articles based on four thematic aspects. Subsequently, the top 10% most relevant articles were selected, and a manual screening was conducted based on their abstracts, resulting in a final selection of 100 articles (Figure 1). Notably, the number of publications on deep learning-based computer vision applications in agriculture has shown a consistent annual increase, as illustrated in Figure 2.

Prior to our work, several studies have reviewed research in this field. Ref. [20] focused specifically on tea crops, summarizing applications of computer vision and machine learning across three key stages: cultivation, harvesting, and processing. Ref. [21] introduced eight types of image segmentation methods and reviewed deep learning applications in agricultural image segmentation. Ref. [16] analyzed deep learning models and discussed their use in plant disease and pest detection. The study by [17] reviewed computer vision-based weed detection methods, with a particular emphasis on traditional image processing techniques and deep learning approaches for agricultural weed identification. Meanwhile, ref. [8] concentrated on deep learning applications in crop perception for harvesting robots. Most of these reviews either focus on a specific crop, summarize the use of particular technologies in agriculture, or examine deep learning solutions for isolated agricultural challenges.

However, our study makes three primary contributions:

First, we categorize agricultural production into three key phases, with particular focus on applications of computer vision and deep learning in the crop growth management stage. We systematically review four critical aspects: crop identification, crop grading, disease monitoring, and weed detection.

Second, the paper organizes relevant computer vision technologies and deep learning-based model algorithms, providing a classified review of methods based on their utilization of visual information.

Third, we examine current research challenges in the field and discuss future development trends.

This literature review is structured into six sections. The remaining five sections are organized as follows: Section 2 introduces computer vision and synthesizes related technologies. Section 3 reviews several categories of widely used deep learning-based models. Section 4 provides a detailed examination of applications of computer vision and deep learning in agricultural growth management. Section 5 discusses current challenges of computer vision in this field. Finally, Section 6 presents conclusions and future research directions.

2. Computer Vision

Computer vision is a crucial branch of artificial intelligence. It involves receiving, understanding, and interpreting visual information by inputting digital images or video data into a computer, and then reasoning appropriate decisions or executing tasks based on this information. Like many technologies, computer vision is an interdisciplinary field that achieves its functions by imitating the human visual system. However, it relies on algorithms and mathematical models to process data rather than human visual mechanisms. Computer vision is widely applied in fields such as agricultural automation, autonomous driving, and medical imaging.

The human visual mechanism works by the eyes receiving light signals, transmitting them to the brain for analysis, and then reacting based on the analysis results. In contrast, computer vision captures image data through a camera, analyzes the data via algorithms, and finally infers and outputs results. The first step in processing tasks for computer vision is to extract meaningful information from images or videos, such as RGB information, point cloud information, depth information, etc. The obtained information is then processed through cropping, scaling, feature extraction, and other operations. Finally, the processed information is used for training and inference. Therefore, acquiring high-quality original visual data is of vital importance for computer vision. Just as the human visual system relies on the eyes to obtain information, computer vision depends on various visual sensors, namely cameras, to acquire visual information [22]. Common cameras used in computer vision include monocular cameras, stereo vision cameras, structured light cameras, etc. Meanwhile, data generated by multispectral and hyperspectral cameras, spectral sensors, and other devices can also be processed and analyzed through computer vision technologies.

2.1. Monocular Camera

A monocular camera is a visual system that uses a single camera for image acquisition. It can perceive color and texture information in the environment and has the advantages of simple structure and low cost. However, monocular cameras cannot directly calculate depth information using multi-view geometry principles, and their accuracy is limited. As an easy-to-implement and easy-to-deploy visual sensor, monocular cameras play an important role in various stages of agriculture [23,24]. For example, in [25,26], monocular cameras are used to capture target images as initial image data, which are then processed through deep learning models for training and inference to address irrigation and crop grading issues in agriculture. In [27,28], computer vision is combined with electronic nose technology to achieve practical and efficient solutions in crop storage and quality inspection.

2.2. Stereo Vision Camera

A stereo vision camera obtains depth information about a scene by mimicking the principle of human binocular disparity. It uses two or more cameras to capture the same scene from different angles and reconstructs the three-dimensional information of the target by calculating the differences between corresponding points in the images. Stereo vision cameras can output RGB and depth information, and their robustness in occluded scenes can be enhanced by increasing the number of viewpoints. However, they are highly dependent on object texture, and their performance is limited under low-texture conditions [29]. Currently, mainstream cameras on the market include the Intel RealSense D455, ZED 2 (StereoLabs, San Francisco, CA, USA), and Basler blaze cameras (Basler, Ahrensburg, Germany).

In the agricultural field, stereo vision cameras play an important role. Ref. [30] uses the ZED2 camera to capture color images and depth point cloud information, enabling the development of a real-time detection and localization method for potted flowers, aimed at the automated management of greenhouse flowers. In [31], videos are captured around crops to extract multi-angle frame images. Real-time weed detection is performed using 3D reconstruction technology. Ref. [32] proposes a non-contact measurement method for vegetable size based on stereo vision cameras and keypoint detection, integrating the pixel coordinates and depth values of keypoints to enhance detection accuracy for small targets. Ref. [33] generates depth maps through stereo disparity calculation, thereby reconstructing a 3D model of the soil surface. This non-contact method based on stereo vision can quantify soil surface roughness and assess tillage quality. Additionally, study [34] is a plant detection method based on RGB-D data, aiming to address the identification of crops and weeds in dense scenes, and summarizes the limitations of current stereo vision depth data in plant detection.

2.3. Structured Light Camera

A structured light camera is an active 3D imaging device that projects specific light patterns such as dot patterns, stripes, and encoded graphics onto the surface of a target object. It then captures the deformed patterns using a camera and calculates the depth information of the object based on the principle of triangulation to reconstruct its 3D shape. The camera consists of one or more monocular cameras and a projector. It projects a series of known patterns into the scene and uses the correspondence between the projection frames and the captured frames to obtain depth information based on the degree of pattern deformation. Compared to stereo cameras, it can achieve good results even for objects with weak textures, and the depth accuracy is higher by calculating the deformation of encoded light spots or stripes. Structured light cameras also have some drawbacks, such as interference from strong light sources, and their cost is generally higher than that of monocular cameras and stereo vision cameras [35,36].

In [37], a structured light projection system was defined, with structured light projected from the top and sides to cover the entire plant. This addresses the issue of insufficient surface texture on plants, improving matching accuracy and enabling effective segmentation of plants from the background even in low-light environments. Ref. [38] utilizes embedded projectors and industrial cameras to build a structured light camera system, dynamically adjusting the width of the stripes to scale each layer proportionally. This ensures point cloud density and improves small hole detection capabilities. Ref. [39] uses low-cost structured light cameras to perform multi-angle scanning, data calibration and performance verification, and parameter extraction to measure the three-dimensional body of cows. This enables low-cost, non-contact monitoring of cow growth.

2.4. Hyperspectral and Multispectral Cameras

Hyperspectral sensors acquire detailed spectral information of targets through continuous narrow bands, forming nearly continuous spectral curves. Multispectral sensors capture spectral information of targets through multiple discrete wide bands (typically 5–10), with wide band intervals (such as visible light, near-infrared, short-wave infrared, etc.). Hyperspectral sensors have been successfully applied in multiple fields, enabling the simultaneous acquisition of spatial and spectral information. Compared to other visual sensors, hyperspectral sensors have more spectral bands, allowing for a more comprehensive capture of spectral information. Hyperspectral imaging modes can be categorized into reflection, transmission, and interaction modes [40]. When there is a need to increase feature quantity and improve processing speed, a multispectral system can be considered. It can acquire data from multiple information wavelengths, capturing their spectral characteristics [41].

Hyperspectral and multispectral sensors have extensive application value in agriculture [42]. By performing multi-scale decomposition on hyperspectral data and extracting deep spectral features through an encoding–decoding structure, efficient and lossless detection of composite heavy metals in lettuce can be achieved. In [43], researchers selected five wavelength bands using a multispectral camera and combined multispectral images with a deep learning model. This approach was used to predict wheat biomass and yield, guiding precise operations of combine harvesters. In [44], hyperspectral imaging technology was employed to acquire hyperspectral data cubes via line scanning, extracting average spectra from regions of interest to construct linear and nonlinear classification models. This enabled rapid, low-cost detection of orange rot based on hyperspectral imaging. It can be seen that recent research has focused on combining deep learning with hyperspectral and multispectral sensors. This enables ultra-high-precision analysis and early detection in agriculture, multidimensional data fusion, and intelligent decision-making, driving rapid development in precision agriculture [45,46,47,48,49].

2.5. Infrared Vision Sensors and Spectrometers

An infrared vision sensor is an imaging or detection device capable of detecting, receiving, and processing infrared radiation. It captures infrared radiation emitted or reflected by objects to form visible or analyzable infrared images. It is widely applied in fields such as night vision, temperature measurement, and target tracking [50]. Near-infrared spectroscopy is a penetrating spectrum whose non-invasive and accurate characteristics make it widely used in the agricultural sector. For example, it provides non-destructive testing for agriculture and quality assessment of food interiors [51,52]. Additionally, combining near-infrared spectroscopy with computer vision enables automatic differentiation of crop characteristics such as appearance, shape, color, and texture, meeting the needs of crop grading [53,54]. It also facilitates crop health monitoring, pest and disease early warning, and yield optimization.

Traditional visible light cameras have limitations in image processing under complex lighting conditions. Using infrared vision sensors to generate infrared thermal images, combined with image processing to extract features, can effectively address issues caused by strong light interference [55,56]. Ref. [57] employs a spectrometer to dynamically collect the transmission spectra of pear fruits, enabling real-time acquisition of spectral data. By visualizing spectral features and analyzing their distribution using visual methods, the system achieves online non-destructive detection of pear browning. Additionally, computer vision techniques were used to fuse multi-modal data such as near-infrared spectra and thermal imaging, enabling real-time online detection of crops and rapid detection of leaks in drip irrigation systems for agricultural applications [58,59].

3. Deep Learning

Deep learning (DL) is a subfield of machine learning inspired by the structure of neurons in the human brain. It automatically learns multi-level feature representations of data by simulating the neural network structure of the human brain. Deep learning uses multiple layers of nonlinear transformations to progressively abstract raw data into high-level feature representations, enabling it to effectively perform tasks such as classification, prediction, and generation. The ability to automatically extract high-level features from large datasets without relying on manually designed features has led to the increasing adoption of deep learning across various industries [60,61]. DL involves several basic processes. First, data is propagated layer by layer from the input layer to the output layer to compute the prediction results (forward propagation). Then, based on the prediction error, the chain rule is used to adjust the network parameters (weights and biases) from the output layer backward (backpropagation). By iteratively optimizing the loss function, the prediction error is gradually reduced until convergence is achieved. The potential of DL has led to significant achievements in the agricultural field [62]. In [63], recurrent neural networks (RNNs) are used to process time-series data collected by sensors, proposing a deep learning-based IoT module to enable smart agriculture under various environmental conditions. Convolutional neural networks (CNNs) are a crucial component of deep learning, offering enhanced feature extraction capabilities as the number of layers increases, though this also leads to the issue of gradient explosion [64]. In [65], researchers introduced deep residual networks (ResNet) into the field of agricultural pest identification to address the issue of gradient explosion in deep layers. This approach is used for identifying agricultural pests in complex farmland environments. Deep learning emphasizes the importance of feature learning and has proposed numerous classic mechanisms, network structures, and models to learn features more efficiently and rapidly [66].

3.1. Attention Mechanism

Like computer vision, the attention mechanism is inspired by human biological structure and mimics human cognitive attention. It is a classic and practical technique in the field of deep learning. When processing large amounts of information, it focuses on key parts, assigns different weights based on the importance of the information, and determines which parts of the input data are more worthy of attention, thereby improving the model’s efficiency in utilizing key information [29]. The attention mechanism was first introduced into neural networks in [67], followed by the development of classic attention mechanisms such as squeeze-and-encourage attention [68] and multi-head attention [69] (Figure 3).

In the implementation of smart agriculture and precision agriculture, the attention mechanism is one of the key technologies. It helps agricultural models analyze and make decisions more efficiently by focusing on key data. To better estimate peach blossom density, the attention module was introduced into the feature extraction and multi-scale feature fusion stages in [70] to enhance the network’s ability to focus on key details of peach blossoms and suppress interference from irrelevant backgrounds. In [71], to precisely identify and distinguish similar leaf diseases, squeezed incentive attention was used to enhance the weights of important diseases, effectively focusing on disease-affected areas and achieving a high-precision disease diagnosis model. Additionally, in [72], pixel-level spatial self-attention (P-FSA) and block-level channel self-attention (B-FSA) are employed. By capturing subtle local features and distinguishing different global features, efficient feature extraction is achieved in fine-grained crop disease classification.

3.2. Transformer-Based Models

Transformer is a deep learning model architecture based on self-attention mechanisms proposed in 2017 [69]. Since its introduction, Transformer has been incorporated into numerous papers, yielding innovative results. In the field of deep learning, many tasks (such as text generation and image segmentation) can be implemented using the Transformer architecture (Figure 4). In 2020, ref. [73] proposed the Vision Transformer (ViT), a visual Transformer designed for image classification. This research significantly advanced the development of Transformers, leading to the emergence of data-efficient image transformers, shunted transformers, and cross-former architectures [74]. These studies have played an important role in the agricultural field.

Ref. [75] focused on crop disease issues in agricultural fields, where it is challenging to focus on diseased areas in complex backgrounds. By embedding the Transformer Encoder into an improved lightweight convolutional neural network, long-range feature dependencies were established to extract global features from diseased images. This approach achieved excellent performance across multiple datasets. Similarly, addressing agricultural disease issues, ref. [76] aimed to resolve the instability of existing models across spectral ranges (visible light and near-infrared). The paper employs various Vision Transformer (ViT) architectures and their hybrid variants, combined with multispectral imaging technology, to achieve high-precision early detection of plant diseases in complex environments.

3.3. Image Segmentation and Image Detection Models

Image segmentation is one of the core tasks in computer vision, with the goal of dividing an image into semantically meaningful regions. Depending on the task requirements, image segmentation tasks can be further categorized into three types: semantic segmentation, instance segmentation, and panoptic segmentation. The purpose of semantic segmentation is to assign a category label to each pixel in an image, without distinguishing between different instances. Instance segmentation requires distinguishing between different instances of the same category. Finally, panoramic segmentation combines the characteristics of semantic segmentation and instance segmentation [29,77]. Image detection models are also one of the core tasks in computer vision, aiming to identify objects in images and represent their locations using bounding boxes or masks. Unlike segmentation models, image detection models must simultaneously address the questions of “what” and “where”. Detection models can be categorized into two types: two-stage models that first generate candidate regions and then classify and regress these regions, The other category consists of single-stage models that directly predict object categories and locations without requiring candidate region generation [78]. Unlike traditional methods that rely on manually designed features, deep learning provides detection models with powerful feature extraction capabilities and end-to-end training frameworks, enabling them to automatically learn the categories and locations of target objects. Deep learning-based image segmentation and detection are developing rapidly and hold great promise in the agricultural field. In [79], deep learning-based object detection and image segmentation techniques are used to assess tomato ripeness and locate stems. Detection accuracy of tomato stems in complex agricultural environments is improved, enabling efficient robotic autonomous harvesting. Additionally, these techniques can be applied to crop monitoring, pest and disease identification, automated harvesting, and irrigation management in precision agriculture [80,81].

4. Agricultural Applications

Computer vision and deep learning are applied to various stages of agricultural processes. This chapter summarizes their applications in the growth management stage. For this stage, we review four main aspects: crop identification and detection, rating and classification, disease monitoring, and weed identification.

4.1. Crop Identification and Detection

Crop identification and localization play a crucial role in modern agriculture, as it is essential to accurately monitor crop growth conditions and locations throughout the agricultural process. In the past, crop identification and localization primarily relied on manual visual inspections in the field, which provided a direct means of monitoring crop growth status. However, the current labor shortage has led to increased costs associated with this method. With the development of computer technology and the widespread adoption of automation, manual crop identification monitoring is gradually being replaced, which helps alleviate labor shortages and better monitor real-time crop dynamics [29]. Crop identification can estimate crop density and locate crops. Density estimation aids in planning and managing planting, ensuring crops are planted at a reasonable density to increase yield. Location identification of crops can serve as guidance for automated harvesting, improving efficiency and saving human resources, and can also facilitate operations based on crop types [82,83]. However, compared to manual inspection, computer vision faces various challenges during crop identification and detection, such as background interference, changes in lighting, and object occlusion (Figure 5). Therefore, a review of papers on crop identification and detection research is provided (Table 1).

Detecting and identifying crops in complex environments has always been a challenge. As mentioned in [14], apples face issues such as leaf obstruction, dense fruit clusters, and insufficient lighting at night. Additionally, apples require packaging during their growth process, further complicating identification and localization. This study first replaces the native network of YOLOv4 with a lightweight network as the feature extraction network, significantly reducing the number of model parameters (the model size is only 29.8 MB, a 87.8% reduction from the original YOLOv4). Second, data augmentation was applied to increase background diversity and small object samples, enhancing the model’s robustness in complex environments. Finally, spatial pyramid pooling and multi-layer feature fusion mechanisms were combined to enhance detection capabilities for small objects. While reducing model size and improving detection efficiency, the accuracy remained largely unchanged. Regarding improving detection accuracy, ref. [85] also addresses the challenge of identifying apples in complex backgrounds. This study converts RGB images to R–G grayscale to enhance fruit-background contrast. It segments images into irregular patches for classification, improving robustness to lighting variations and reducing bagging-induced misidentification. An optimized watershed algorithm boosts segmentation efficiency. The method achieves significantly higher accuracy with minimal preprocessing overhead—an acceptable trade-off for precision-critical applications.

For crop identification and detection, it is not only necessary to perform individual object detection but also density estimation detection, which is particularly important. In [83], efforts were made to address the issue of missed detections caused by fruit occlusion and overlap, as well as to quantitatively analyze the spatial distribution of strawberries to guide reasonable planting density. This paper introduces the squeezed incentive attention mechanism and the Efficient Intersection over Union (EIOU) loss function into the YOLO module. The squeezed incentive attention mechanism enhances focus on strawberry features, while the EIOU loss function optimizes bounding box regression. This improves the model’s detection accuracy in scenarios with fruit occlusion and overlap. The model uses Kernel Density Estimation (KDE) and nearest neighbor analysis (NNA) for density estimation, achieving the first end-to-end analysis from object detection to density distribution in strawberry images, providing more detailed spatial distribution information. The model outperforms YOLOv3-tiny in terms of accuracy. Additionally, the model detects high-density regions with clear boundaries that align with actual distributions. However, the morphology of crops changes at different stages, and density detection when fruits are small is not addressed. In [82], the proposed model can handle the diversity of different flowering periods and fruit maturity levels and addresses the detection of small-sized, densely distributed objects. This paper presents a U-Net model with an optimized ConvNeXt-T encoder, reducing parameters while maintaining performance. The model integrates density estimation and segmentation branches, using a Convolutional Block Attention Module (CBAM) to enhance target features and suppress background noise, improving robustness in complex farm environments. Random image flipping during training boosts generalization. However, higher input resolutions trade efficiency for accuracy, and insufficient labeled data may cause overfitting. The model is suitable for resource-limited devices (e.g., drones or smartphones) for large-scale orchard surveys.

Different crop morphologies exhibit distinct characteristics. Ref. [86] achieves a balance between speed and accuracy through lightweight design and feature enhancement modules. Using images captured under various lighting conditions as datasets, it successfully enables all-weather tea bud detection. The model’s nighttime performance is only 0.76% lower than under optimal lighting. However, its effectiveness is limited in occlusion scenarios as buds with over 75% occlusion were excluded during data annotation. Ref. [87] introduces a deformable large kernel attention mechanism at the network’s output to enhance detection capabilities for occluded and deformed targets. Frequency Domain Feature Distillation (FreeKD) is used to transfer knowledge from large models to lightweight models. This multi-kernel dynamic convolution module enhances the model’s feature extraction capabilities for tomatoes of different sizes and shapes by adaptively aggregating convolution kernels of varying sizes, thereby achieving high-precision tomato detection. Traditional models often struggle to accurately detect small fruits with variable shapes. To address this, ref. [89] focuses on target detection research for small red pears. The proposed method incorporates a Spatial-Channel Decoupled Downsampling (SCDown) module to optimize the downsampling process while preserving fine details of small targets. The research expanded an original dataset of 395 images through data augmentation, ultimately creating a dataset of 1580 images. However, the dataset has several limitations: It contains few occluded samples; the images were primarily captured during daytime with insufficient representation of extreme lighting conditions; and the experiments were conducted in a single orchard environment without cross-regional or cross-crop validation of model generalizability. Kiwifruit detection also faces challenges such as dense growth, leaf obstruction, and fruit overlap. Therefore, ref. [92] focuses on kiwifruit identification and localization. This study combines the Coordinate Attention Mechanism (CA) with YOLO for feature extraction and optimizes the loss function by integrating Focal Loss (to address class imbalance), EIoU Loss (to improve bounding box localization accuracy), and confidence loss. It uses stereo vision localization and 2D coordinates to output the real-world location of kiwifruit. This achieves real-time recognition and localization of kiwifruit, facilitating subsequent management and harvesting.

4.2. Crop Grading

As living standards improve, people’s demands for crop quality continue to rise. In agricultural processes, scientifically scoring and grading crops is a core means of enhancing industry efficiency [93]. By establishing a standardized evaluation system, it is possible to precisely guide production management, achieve the selection of high-quality varieties, and make intelligent decisions on harvesting timing based on maturity grading during the harvesting phase to reduce losses. In the processing phase, grading based on quality differences enables optimized processing to maximize returns [94]. In crop growth management, classifying crops by grade is a critical step in meeting the needs of consumers and the processing industry for high-quality crops (Figure 6). Manual visual inspections lack consistency, and even with standards in place, errors are prone to occur, and the process is labor-intensive [95]. Therefore, leveraging deep learning, computer vision, and image processing technologies (Table 2) to enhance the accuracy of grading systems is imperative.

Traditional crop grading relies on manual labor or simple machine vision (such as SVM and KNN), which struggles to handle complex features. Therefore, deep learning was utilized for apple grading in [96]. This study made three improvements to the YOLOv5 model. The Mish activation function replaced the original function to enhance deep feature extraction capabilities. The DIoU_Loss loss function was used to optimize bounding box regression, improving localization accuracy and convergence speed. The squeeze-and-excitation attention module is embedded into the backbone network to enhance attention on the key features of apples. The balance between accuracy and efficiency is achieved through the collaborative optimization of three components. However, due to the imbalance in the dataset setup, with insufficient data on grade-3 apples, the model is prone to overfitting in detecting grade-3 apples. Ref. [104] aims to balance precision with model lightweighting. First, ref. [104] leverages target information from multiple perspectives to address the limitation of traditional methods, which rely on single-perspective images and cannot fully capture apple surface information. Then, by introducing lightweight convolutions and dynamic convolutions into the detection module, the feature extraction capability is enhanced while reducing computational load, thereby improving inference speed. Additionally, this study embeds the Swin Transformer module into the backbone network, utilizing a window self-attention mechanism to integrate global features and enhance grading accuracy. While achieving model lightweighting, grading accuracy is maintained. However, this study remains focused on apples as a single crop, and its generalization capabilities also have limitations.

Compared to crops like apples, which have large fruits and dispersed targets, crops like tea and tobacco have leaf-based grading targets. Leaves are thin and prone to sticking together, with overlapping leaves and difficulty in determining maturity. Additionally, grading and classification for such crops are strict, with each grade corresponding to a specific price. For tea, traditional manual grading is inefficient and costly, and existing mechanical harvesting often mixes different grades of tea, affecting subsequent processing quality. Ref. [97] Based on the YOLOv8n model, the SPD-Conv module is used to enhance feature extraction capabilities for low-resolution and small-target objects like tea leaves. Additionally, the Super-Token Vision Transformer (SViT) is employed to reduce redundant information interference and improve perception of small targets. To better classify tea leaves, data is collected and enhanced to cover various lighting and weather conditions, with different grades of tea leaves annotated. This improved the accuracy of automatic tea grading. Similarly, to address the issues of low efficiency and strong subjectivity in traditional manual grading, Lu et al. [98] proposed a method in which tobacco leaves are classified into different grades based on their local features. To capture the subtle local features of tobacco leaves, a dual-branch ensemble model was designed. The improved A-ResNet-65 network serves as the global branch to scale the entire image proportionally. The local branch uses the ResNet-34 network to crop local image blocks from high-resolution images. The results from the dual branches are integrated using a weighted voting method for feature fusion. In experimental testing on eight different grades of tobacco leaves, the final model achieved a classification accuracy of 91.30%. However, dynamic testing was not addressed. Study [99] also employs a dual-branch approach, embedding the Channel-Based Attention Mechanism (CBAM) into the ResNet50 network. The designed FPN-CBAM-ResNet50 architecture integrates low-level detail features with high-level semantic features, enhancing the model’s ability to capture subtle differences. VGG16 serves as the global branch network, employing a bilinear pooling strategy for feature fusion. This method was tested on a dynamic production line, achieving dynamic high-speed grading of tobacco leaves. However, the model has a large number of parameters, resulting in high computational complexity to maintain accuracy. Additionally, training on laboratory data may struggle to generalize to tobacco from other regions or years, posing domain adaptability issues.

In addition to the aforementioned studies, ref. [100] proposed a method for evaluating the quality of carrots. In this study, a deep convolutional generative adversarial network (DCGAN) was used to generate high-quality carrot images, and a differential hashing algorithm was employed to remove low-quality or duplicate samples from the generated images. The study utilized a ResNet-18 as the main network, incorporated a squeezed incentive attention module to enhance the weighting of key features, and optimized the classifier module to retain more details. While progress was made in cross-crop classification, high-precision detection was achieved in carrot quality evaluation. Ref. [101] also utilized the channel-based attention module (CBAM) to enhance key feature extraction. The improved network framework enabled multi-scale feature fusion, achieving efficient and precise classification of mangosteens.

4.3. Disease Monitoring

Ensuring healthy crop growth and maintaining yields are fundamental objectives of agricultural processes. Diseases can cause significant reductions in crop yields, resulting in substantial damage to agriculture [105]. In modern society, the continuous increase in agricultural yields and efficiency is largely attributed to the application of deep learning and computer vision in crop disease monitoring [106]. Therefore, integrating computer vision and deep learning for disease detection, classification, and diagnosis is a critical component of crop growth management and a major trend in development. Crop diseases can be categorized into plant diseases and insect pests (Figure 7). Plant diseases are caused by pathogens (such as fungi, bacteria, or viruses) or non-biological factors (such as nutrient deficiencies or pesticide damage), leading to abnormal plant physiological functions and resulting in growth inhibition, yield reduction, or quality deterioration. Insect pests, on the other hand, cause crop damage through direct feeding by harmful insects or indirect transmission of pathogens. A summary of relevant research is presented in Table 3.

To the human eye, obvious shapes such as spots on leaves can be quickly detected. However, some diseases do not have obvious symptoms, and by the time obvious symptoms appear, the plants have already suffered serious damage [106]. Computer vision based on deep learning can efficiently detect and classify plant diseases. In [115], researchers extracted the circumscribed rectangle aspect ratio (CRAR) feature and the Histogram of Oriented Gradients (HOGs) feature to distinguish morphological differences between diseased and healthy rice plants. Subsequently, a convolutional neural network was used for feature fusion to improve recognition accuracy. The paper designed two modes—offline and online—to work in tandem, adapting to poor network environments. This achieved the goal of rapid field detection of rice black stem disease. Similarly, ref. [116] also proposed a rapid detection method for black stem rust. However, both papers did not address multi-scale feature fusion, which is suboptimal for small object detection. In response, ref. [107] proposed YOLOv8-GDCI to address the real-time precise detection of Phytophthora blight on different parts of chili plants. This study employs the RepGFPN feature fusion network for parallel processing and cross-level feature fusion, enhancing the expression capability of multi-scale features. The dynamic upsampling algorithm expands the receptive field. Coordinate attention (CoordAtt) is introduced into the backbone network, combining channel and spatial position information to enhance the ability to capture key features of diseases. This mechanism effectively captures dense lesions and adapts to dense lesion scenarios. For the dataset used in the experiment, there are far more leaf labels than stem labels, which may result in the model having weaker detection capabilities for stem lesions. Additionally, the experimental data was collected under relatively uniform lighting conditions, and results may vary in real-world environments such as cloudy days or nighttime. Therefore, when deploying in high-resource scenarios, the full model can be used with high-resolution images to balance speed and accuracy. When using edge devices such as handheld terminals, model lightweighting can be performed to sacrifice a small amount of accuracy for real-time performance. For extreme lighting conditions, infrared sensors or similar devices can be used in conjunction.

As plant species and cultivation techniques continue to evolve, the number of plant diseases is also increasing. Different measures should be taken to address different plant diseases [117]. Therefore, classifying and accurately identifying plant diseases can effectively ensure crop health. Ref. [117] combines features extracted using deep learning with traditional features for multi-class classification of plant leaf diseases. A deep separable convolutional network with fewer parameters is used to extract high-level hidden features. The Local Binary Pattern (LBP) captures local texture information from leaf images. These two types of features are directly concatenated and classified using a Softmax classifier. Traditional features are difficult to accurately represent the target, and manual extraction is time-consuming and labor-intensive. Therefore, in [108], a dual-branch model combining convolutional neural networks and visual Transformers (ViTs) is proposed, abandoning traditional features for crop disease classification. Crop disease lesions are complex, requiring the integration of local and global details while considering contextual information. This study employs a dual-branch parallel approach to extract local and global features and dynamically fuse the two types of features. The Transformer branch uses Separable Self-Attention (SSA) to reduce computational overhead. This dynamically fused dual-branch structure achieves high-precision, low-cost crop disease classification.

Pest control is another measure to ensure crop yield. Ref. [109] focuses on the detection of flying insects and small target insects. This study avoids the distortion issues caused by traditional models directly scaling images through a sliding window mechanism. The backbone network introduces EfficientNetV2-S to enhance feature extraction capabilities. Decoupling classification and localization tasks reduces task conflicts. These customized improvements significantly enhance the accuracy of small target pest detection. However, the model lacks instance segmentation or edge detection capabilities, and densely distributed pests are easily misidentified as single targets. No domain adaptation training was conducted for environments with shadows, reflections, or low light. In practical deployment, it can be used with mobile monitoring devices such as drones. Similarly, for small target pest detection, ref. [118] expanded the experimental scope to complex field environments. The system evaluated the performance of six YOLO variants in small target pest detection, deployed the optimized model on mobile devices through compression, and achieved efficient inference on mobile devices. Ref. [110] proposed a deep learning model named ResNet-SA. This model is based on residual network (ResNet) and self-attention mechanism (SA), enhancing the model’s ability to extract key features. But the dataset used in (104) only contains 3150 original images, which were expanded through basic geometric transformations. The model proposed in the paper achieves an accuracy rate of 99.8% on the training set. However, in real-world scenarios, the morphology, posture, and lighting conditions of pests may be more complex. The model’s performance on more complex datasets or in real agricultural environments may decline. Additionally, the risk of overfitting due to small datasets should not be overlooked. Therefore, the high performance (99.80%) on the test set in this paper still requires further validation through more complex real-world scenarios or cross-dataset validation.

Ref. [111] combines the strengths of single-stage and two-stage detection networks to balance accuracy and speed. The single-stage network uses Inception for multi-scale feature extraction, improving detection of occluded targets. The two-stage network is based on Faster-RCNN. Clustering algorithms optimize anchor box sizes to enhance localization accuracy. This multi-scale approach gives the model clear advantages in detecting pests of varying sizes. The dataset uses overlapping annotations to help the model learn overlapping features, reducing missed detections. The method performs well even in complex field conditions with pest clusters or leaf occlusion.

4.4. Weed Detection

In addition to pests and diseases, weeds are also one of the key factors affecting agricultural yields. The widespread use of chemical herbicides places a significant burden on the ecological environment and undermines the sustainability of agricultural development [15]. Precise application of herbicides directly onto weeds can significantly reduce the use of chemicals, weed control costs, and environmental damage [119]. Some weeds resemble crops in appearance, while others overlap and shade crops (Figure 8). Therefore, accurately distinguishing between crops and weeds is crucial for precise herbicide application. A summary of relevant research is presented in Table 4.

Weed target sizes vary widely, and detecting small targets is challenging. Factors such as lighting and occlusion in complex farmland environments also affect detection performance. Therefore, ref. [120] proposed HAD-YOLO based on YOLOv5 to address the challenge of precise weed detection in farmland. This paper adopts the lightweight HGNetV2 architecture, combined with Deep Separable Convolution (DWConv) to reduce parameter counts and enhance feature extraction capabilities. The Scale Sequence Feature Fusion (SSFF) module and Triple Feature Encoding (TFE) module enhance the model’s ability to process features of different scales. Finally, the Integrated Dynamic Attention Mechanism (Dyhead) is used to improve classification and localization accuracy. This strategy of multi-scale fusion and lightweight modeling enables the method to achieve good accuracy on both greenhouse and field datasets. However, its performance is average under occlusion conditions. Ref. [121] also based their research on the YOLO5 model for precise weed detection in rapeseed fields. This study faced the challenge of varying weed sizes (especially small targets) while also addressing the difficulty of distinguishing weeds from similar-looking rapeseed seedlings. Unlike [120,121] incorporates a Swin Transformer encoding module into the model to enhance global feature extraction while reducing computational complexity. The study employs a Bidirectional Feature Pyramid Network (BiFPN) structure for weighted multi-scale feature fusion. During the feature fusion process, the weights of different features are assigned by the Normalization-based Attention Module (NAM) attention mechanism, enabling adaptive fusion of multi-scale features. The introduction of Swin Transformer allows the module to better capture global contextual information, demonstrating certain improvements in addressing occlusion issues. This approach effectively balances computational efficiency with feature representation capability. A comparison of these two methods shows that the model in [120] is more lightweight and suitable for embedded deployment. However, it involves few weed species in its design and has insufficient generalization. It is suitable for real-time single weed detection tasks. Ref. [121] can handle occluded targets well through Swin Transformer. It uses data augmentation to effectively solve the problem of data imbalance. It can identify multiple weed species and is more suitable for actual farmland. However, it has low real-time performance and high computational complexity. It is suitable for weed detection in complex environments with low real-time requirements.

In [127], a BP neural network model is constructed using multiple features such as color, texture, and shape as inputs for weed detection inference. Color features are extracted using the Color Moments algorithm in combination with the RGB and HSV color spaces. Texture features are extracted using the Gray-Level Co-occurrence Matrix (GLCM) and Local Binary Pattern (LBP). Shape features are extracted using Hu invariant moments and geometric parameters. After the multi-feature complementary fusion, the input is fed into a backpropagation network for weed detection in asparagus fields. For the feature fusion method, ref. [124] employs a multi-polar feature weighted fusion strategy. Features are weighted using Softmax to highlight crop and weed features while suppressing background interference. Cross-layer feature fusion is then performed, combining texture and shape information from different layers. This dynamic strategy significantly improves the accuracy of crop and weed segmentation. However, in the rice seedling dataset, the model misclassifies crop shadows as weeds. This indicates that the model is not robust enough against light changes and shadows. This may be because there are few shadow samples in the training data, or the annotations are insufficient. When tested in actual field environments, its performance may decline due to complex lighting and other conditions.

4.5. Actual Case Applications

After a model algorithm is proposed, it should be applied to actual agricultural practices to promote agricultural economic development. The flowchart for deploying computer vision to practical applications is shown in Figure 9. In this process, factors such as initial costs, environmental constraints, and implementation effects need to be comprehensively considered. Taking a smart farm (33 hectares) in Jiangsu as an example, this farm adopts a collaborative scheme of “drone inspection + ground precision spraying” to realize intelligent weed management. The farm collects vegetation spectral features through a multispectral camera and uses a detection model to distinguish crops from 12 common weed species, guiding the variable spray device for targeted application. In the specific deployment process, a Matrice 350 RTK drone and a multispectral camera are used for large-scale mapping of weed distribution. The ground edge computing equipment adopts an industrial computer equipped with an NVIDIA RTX A5000 graphics card to ensure real-time model inference and decision-making. A variable spray device controlled by intelligent valves is used for targeted application. In addition, there are licenses for the detection model and data management platform to realize algorithm support and data storage. The above constitutes the initial cost of the farm, which is approximately USD 40,000.

After equipment deployment, the implementation of intelligent systems still requires farmers to conduct actual operations. Farmers need 1–2 days of training to master drone takeoff and landing, adjustment of spraying parameters (such as spray pressure), and daily maintenance of equipment. They also need basic smartphone operation skills. In addition to operators, there must be technical maintenance personnel. They are responsible for model updates, equipment troubleshooting (such as camera calibration and GPU driver maintenance), and data report interpretation. Farmland environments differ from test scenarios, with complex background conditions. Taking this farm as an example, under strong light, multispectral cameras are susceptible to reflection interference, which reduces detection accuracy. Thus, operations must be limited to before 9 a.m. or after 5 p.m. When air humidity exceeds 85%, drone battery life shortens, and ground equipment sensors are prone to moisture damage, requiring supporting waterproof enclosures. After implementing these measures, pesticide usage decreased from 180 L/ha to 99 L/ha, a 45% reduction compared to full-field spraying. Weeding effectiveness improved, increasing crop yield by 5%. Meanwhile, weed detection efficiency significantly enhanced, rising from 133 m²/h (manual inspection) to 5300 m²/h (drone inspection). Traditional chemical weed control requires approximately USD 30,000 annually for pesticide and labor costs. In smart agricultural farms, agricultural output increases by USD 5000, with annual equipment maintenance and pesticide expenses around USD 14,000. Compared with traditional methods, the annual gain is approximately USD 20,000. The initial equipment investment can be recovered within 2 to 3 years.

Through this case, the implementation of smart agriculture can indeed promote the development of agricultural economy. For small- and medium-sized farms, priority should be given to lightweight solutions with low-cost initial equipment. They can realize basic links such as crop identification and disease monitoring. With the further improvement of algorithm models, data acquisition will also be simplified. This can further reduce operation difficulty and costs. For large-scale farms, appropriate increases in initial investment can be made. They can use hardware facilities such as soil sensors, weather stations, and high-performance edge devices. Combined with cloud computing platforms, they can conduct data integration and yield prediction. Supporting infrastructure should include high-precision positioning systems to support automatic driving of agricultural machinery. It should also ensure that 5G base stations in the location are sound to guarantee the transmission of big data.

5. Challenges and the Way Forward

Traditional computer vision methods for extracting manual features are costly and provide limited information, making them difficult to apply in complex agricultural environments. Deep learning can automatically learn information-rich features from data and adapt to complex agricultural environments. As both technologies continue to develop, computer vision and deep learning are playing increasingly important roles in precision agriculture and smart agriculture, much like the eyes and brain. This paper focuses on the agricultural growth management process, reviewing four aspects: crop identification and detection, crop scoring and grading, disease monitoring, and weed detection. Although deep learning and computer vision have brought revolutionary progress to agricultural intelligence, they still face numerous technical bottlenecks and practical obstacles in real-world applications.

The current urgent challenge is to achieve real-time inference in low-resource environments. The ultimate experimental objective is practical implementation to meet agricultural production needs. In actual farmland management scenarios, the lack of high-performance hardware is a common issue. Most computing devices have limited processing power, making it difficult for large-parameter models to maintain efficient inference speeds. Some farmland areas suffer from poor signal coverage [115], hindering real-time crop disease identification. Additionally, equipment such as sprayers and weeders cannot effectively support high-speed inference models, which remains a critical problem to address. For instance, in ref. [122], the proposed model achieved 85.2% mAP50 during testing. However, when deployed on mobile devices, this metric dropped to 67.4%. To resolve this issue, an increasing number of studies [86,87] have adopted more lightweight modules to optimize models, improving their deployability in practical applications. Furthermore, field operators are the primary users who need to collect and interpret precise farmland data. Excessively complex operations would increase training costs for farmers. Therefore, reducing model parameters while maintaining accuracy and lowering model complexity is essential. Ensuring reliable inference capabilities in real-world farmland environments with constrained network and computational resources represents a crucial direction for future research.

After practical deployment of the method, improving its accuracy becomes the next major challenge. Designing universally generalizable models for multi-scale dynamic targets thus represents a high-priority technical challenge. For crop identification and detection, the varying shapes of crops make multi-scale target detection particularly difficult. The system must simultaneously detect targets ranging from millimeter-scale objects (e.g., pollen) to meter-scale structures (e.g., fruit tree canopies), while accounting for morphological changes throughout the growth cycle [91]. Crops like grapes present additional challenges with severe overlapping and occlusion issues [18]. In disease monitoring, visible damage on crop surfaces varies significantly in size. However, many diseases cause extremely subtle damage that is difficult to detect in early stages. The weak initial symptoms and tiny detection targets constitute major obstacles in disease surveillance. Studies such as [108,117] focus on extracting more advanced and diverse features. These high-level fused features typically incorporate both global and local crop information, enabling better identification of diseased areas. For pest monitoring, the small size and high mobility of target insects create substantial detection challenges. When processing small insect features, it is crucial to avoid feature loss caused by excessive extraction spans, which may lead to missed detections [109]. Ref. [118] compared multiple models and expanded the number of recognizable pest categories to nine, yet universal generalizability remained unattainable. Weed detection faces similar issues. Weeds are typically small, morphologically diverse, and sometimes resemble crops, factors that significantly increase the difficulty of accurate identification. Studies like [125] have proposed using multi-scale convolution or attention layer mechanisms to enhance feature extraction and focus on weeds, thereby improving detection accuracy and generalizability. However, misidentification still occurs in heavily occluded environments, where non-weed plants may be incorrectly classified. Developing models capable of handling multi-scale targets while improving generalizability remains a critical challenge.

Agricultural field environments are complex, and dynamic environmental interference affects model accuracy. In real-world scenarios, issues such as target overlap, occlusion, and background interference persist. These factors, along with variations in lighting, temperature, humidity, and background conditions, influence the input data acquisition for models. Consequently, there are discrepancies between practical application results and laboratory test performance, indicating a need for improved method adaptability [15]. Studies such as [14,86] have enhanced detection models to achieve better accuracy in complex environments, though these improvements remain condition-specific. To strengthen model stability in challenging conditions, research can also focus on input data optimization. Acquiring high-quality image data and constructing comprehensive datasets that encompass diverse real-world scenarios represent a growing trend for improving model precision. As discussed in Section 2, integrating technologies like structured-light cameras and infrared spectroscopy can effectively enhance data quality. However, the use of such high-performance equipment increases costs significantly. To address this, some studies [82] employ data augmentation techniques to improve data quality and testing accuracy without requiring expensive hardware. This approach balances performance and cost-effectiveness while maintaining model reliability in variable agricultural environments. Future work should continue exploring robust data acquisition and processing methods to bridge the gap between controlled testing and real-world deployment.

6. Conclusions

Deep learning and computer vision technologies have brought revolutionary changes to agricultural production. This paper reviews and summarizes relevant articles from the past decade, focusing on four aspects: crop identification, quality grading, disease monitoring, and weed detection. Section 2 categorizes common methods used in computer vision to obtain information, while Section 3 introduces commonly used deep learning technologies in the agricultural field. Through a systematic review of relevant research, we find that these technologies have moved from the laboratory to the field: models based on YOLO and Transformers achieve over 90% accuracy in crop recognition in complex environments. The combination of hyperspectral imaging and 3D-CNN keeps grading errors for agricultural products within 3%. Multimodal data fusion significantly enhances disease diagnosis capabilities, while lightweight detection models provide real-time solutions for precise weed control.

However, the application of deep learning and computer vision in agriculture also faces numerous challenges. Scarce and expensive labeled data constrain model generalization capabilities, and the early detection of millimeter-sized pests and diseases remains a technical challenge. Algorithms need to be deployed on actual devices to ensure the practical implementation of the technology, but hardware investments result in prohibitively high costs. Additionally, complex real-world environments limit the effectiveness of algorithms and hardware. Despite these challenges, the field still holds significant potential for future development. First, specialized large-scale models for agriculture should be developed, leveraging pre-training with massive datasets to enhance transfer learning capabilities and improve model generalization. Second, novel sensing technologies, deep learning techniques, and agriculture should be more comprehensively integrated, such as quantum sensing and nano-imaging technologies, as well as various more efficient attention mechanisms. Finally, experiments and production should progress together. By designing lightweight models to lower application barriers, research findings can be more widely adopted and applied. It is foreseeable that when technological maturity aligns with industrial demand, agriculture will enter an era of smart agriculture, where the entire process from sowing to harvest will be driven by autonomous decision-making. This leap in production efficiency is a crucial safeguard for food security and sustainable development—and a top priority for human societal progress.

Author Contributions

Conceptualization, Z.C.; investigation, Z.C.; writing—original draft preparation, Z.C.; writing—review and editing, Z.C. and S.S.; supervision, X.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Project 333 of Jiangsu Province, grant number 501100017534.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Luo, X.; Liao, J.; Zang, Y.; Ou, Y.; Wang, P. Developing from mechanized to smart agricultural production in China. Strateg. Study Chin. Acad. Eng. 2022, 24, 46–54. [Google Scholar] [CrossRef]
Christiaensen, L.; Rutledge, Z.; Taylor, J.E. Viewpoint: The future of work in agri-food. Food Policy 2021, 99, 101963. [Google Scholar] [CrossRef]
Akbar, J.U.M.; Kamarulzaman, S.F.; Muzahid, A.J.M.; Rahman, M.A.; Uddin, M. A comprehensive review on deep learning assisted computer vision techniques for smart greenhouse agriculture. IEEE Access 2024, 12, 4485–4522. [Google Scholar] [CrossRef]
Saleem, M.H.; Potgieter, J.; Arif, K.M. Automation in agriculture by machine and deep learning techniques: A review of recent developments. Precis. Agric. 2021, 22, 2053–2091. [Google Scholar] [CrossRef]
Ranganathan, J.; Waite, R.; Searchinger, T.; Hanson, C. How to Sustainably Feed 10 Billion People by 2050, in 21 Charts; World Resources Institute: Washington, DC, USA, 2018; p. 5. [Google Scholar]
Dhanya, V.G.; Subeesh, A.; Kushwaha, N.L.; Vishwakarma, D.K.; Kumar, T.N.; Ritika, G.; Singh, A.N. Deep learning based computer vision approaches for smart agricultural applications. Artif. Intell. Agric. 2022, 6, 211–229. [Google Scholar] [CrossRef]
Tian, H.; Wang, T.; Liu, Y.; Qiao, X.; Li, Y. Computer vision technology in agricultural automation—A review. Inf. Process. Agric. 2020, 7, 1–19. [Google Scholar] [CrossRef]
Jin, Y.; Xia, X.; Gao, Q.; Yue, Y.; Lim, E.G.; Wong, P.; Ding, W.; Zhu, X. Deep learning in produce perception of harvesting robots: A comprehensive review. Appl. Soft Comput. 2025, 174, 112971. [Google Scholar] [CrossRef]
Ukwuoma, C.C.; Qin, Z.; Bin Heyat, M.B.; Ali, L.; Almaspoor, Z.; Monday, H.N. Recent advancements in fruit detection and classification using deep learning techniques. Math. Probl. Eng. 2022, 2022, 9210947. [Google Scholar] [CrossRef]
Zheng, W.; Lan, R.; Zhangzhong, L.; Yang, L.; Gao, L.; Yu, J. A Hybrid Approach for Soil Total Nitrogen Anomaly Detection Integrating Machine Learning and Spatial Statistics. Agronomy 2023, 13, 2669. [Google Scholar] [CrossRef]
Bünemann, E.K.; Bongiorno, G.; Bai, Z.; Creamer, R.E.; De Deyn, G.; De Goede, R.; Fleskens, L.; Geissen, V.; Kuyper, T.W.; Mäder, P.; et al. Soil quality–A critical review. Soil Biol. Biochem. 2018, 120, 105–125. [Google Scholar] [CrossRef]
Ma, Y.; Woolf, D.; Fan, M.; Qiao, L.; Li, R.; Lehmann, J. Global crop production increase by soil organic carbon. Nat. Geosci. 2023, 16, 1159–1165. [Google Scholar] [CrossRef]
Zhao, L.; Haque, S.M.; Wang, R. Automated seed identification with computer vision: Challenges and opportunities. Seed Sci. Technol. 2022, 50, 75–102. [Google Scholar] [CrossRef]
Ji, W.; Gao, X.; Xu, B.; Pan, Y.; Zhang, Z.; Zhao, D. Apple target recognition method in complex environment based on improved YOLOv4. J. Food Process Eng. 2021, 44, e13866. [Google Scholar] [CrossRef]
Wu, Z.; Chen, Y.; Zhao, B.; Kang, X.; Ding, Y. Review of weed detection methods based on computer vision. Sensors 2021, 21, 3647. [Google Scholar] [CrossRef]
Shoaib, M.; Sadeghi-Niaraki, A.; Ali, F.; Hussain, I.; Khalid, S. Leveraging deep learning for plant disease and pest detection: A comprehensive review and future directions. Front. Plant Sci. 2025, 16, 1538163. [Google Scholar] [CrossRef]
Xie, F.; Guo, Z.; Li, T.; Feng, Q.; Zhao, C. Dynamic Task Planning for Multi-Arm Harvesting Robots Under Multiple Constraints Using Deep Reinforcement Learning. Horticulturae 2025, 11, 88. [Google Scholar] [CrossRef]
Liu, J.; Liang, J.; Zhao, S.; Jiang, Y.; Wang, J.; Jin, Y. Design of a virtual multi-interaction operation system for hand–eye coordination of grape harvesting robots. Agronomy 2023, 13, 829. [Google Scholar] [CrossRef]
Luo, Y.; Wei, L.; Xu, L.; Zhang, Q.; Liu, J.; Cai, Q.; Zhang, W. Stereo-vision-based multi-crop harvesting edge detection for precise automatic steering of combine harvester. Biosyst. Eng. 2022, 215, 115–128. [Google Scholar] [CrossRef]
Wang, H.; Gu, J.; Wang, M. A review on the application of computer vision and machine learning in the tea industry. Front. Sustain. Food Syst. 2023, 7, 1172543. [Google Scholar] [CrossRef]
Kim, W.S.; Lee, D.H.; Kim, Y.J.; Kim, T.; Lee, W.S.; Choi, C.H. Stereo-vision-based crop height estimation for agricultural robots. Comput. Electron. Agric. 2021, 181, 105937. [Google Scholar] [CrossRef]
Chen, J.; Zhang, M.; Xu, B.; Sun, J.; Mujumdar, A.S. Artificial intelligence assisted technologies for controlling the drying of fruits and vegetables using physical fields: A review. Trends Food Sci. Technol. 2020, 105, 251–260. [Google Scholar] [CrossRef]
Xu, Q.; Cai, J.R.; Zhang, W.; Bai, J.W.; Li, Z.Q.; Tan, B.; Sun, L. Detection of citrus Huanglongbing (HLB) based on the HLB-induced leaf starch accumulation using a home-made computer vision system. Biosyst. Eng. 2022, 218, 163–174. [Google Scholar] [CrossRef]
Chen, J.; Lian, Y.; Zou, R.; Zhang, S.; Ning, X.; Han, M. Real-time grain breakage sensing for rice combine harvesters using machine vision technology. Int. J. Agric. Biol. Eng. 2020, 13, 194–199. [Google Scholar] [CrossRef]
Guo, J.; Zhang, K.; Adade, S.Y.S.S.; Lin, J.; Lin, H.; Chen, Q. Tea grading, blending, and matching based on computer vision and deep learning. J. Sci. Food Agric. 2025, 105, 3239–3251. [Google Scholar] [CrossRef]
Zhu, C.; Hao, S.; Liu, C.; Wang, Y.; Jia, X.; Xu, J.; Guo, S.; Huo, J.; Wang, W. An Efficient Computer Vision-Based Dual-Face Target Precision Variable Spraying Robotic System for Foliar Fertilisers. Agronomy 2024, 14, 2770. [Google Scholar] [CrossRef]
Huang, X.Y.; Pan, S.H.; Sun, Z.Y.; Ye, W.T.; Aheto, J.H. Evaluating quality of tomato during storage using fusion information of computer vision and electronic nose. J. Food Process Eng. 2018, 41, e12832. [Google Scholar] [CrossRef]
Tu, H.; Huang, D.; Huang, X.; Aheto, J.H.; Ren, Y.; Wang, Y.; Liu, J.; Niu, S.; Xu, M. Detection of browning of fresh-cut potato chips based on machine vision and electronic nose. J. Food Process Eng. 2021, 44, e13631. [Google Scholar]
Lei, L.; Yang, Q.; Yang, L.; Shen, T.; Wang, R.; Fu, C. Deep learning implementation of image segmentation in agricultural applications: A comprehensive review. Artif. Intell. Rev. 2024, 57, 149. [Google Scholar] [CrossRef]
Wang, J.; Gao, Z.; Zhang, Y.; Zhou, J.; Wu, J.; Li, P. Real-time detection and location of potted flowers based on a ZED camera and a YOLO V4-tiny deep learning algorithm. Horticulturae 2021, 8, 21. [Google Scholar] [CrossRef]
Badhan, S.; Desai, K.; Dsilva, M.; Sonkusare, R.; Weakey, S. Real-time weed detection using machine learning and stereo-vision. In Proceedings of the 2021 6th International Conference for Convergence in Technology (I2CT), Maharashtra, India, 2–4 April 2021; IEEE: New York, NY, USA, 2021; pp. 1–5. [Google Scholar]
Zheng, B.; Sun, G.; Meng, Z.; Nan, R. Vegetable size measurement based on stereo camera and keypoints detection. Sensors 2022, 22, 1617. [Google Scholar] [CrossRef]
Azizi, A.; Abbaspour-Gilandeh, Y.; Mesri-Gundoshmian, T.; Farooque, A.A.; Afzaal, H. Estimation of soil surface roughness using stereo vision approach. Sensors 2021, 21, 4386. [Google Scholar] [CrossRef] [PubMed]
Ruigrok, T.; van Henten, E.J.; Kootstra, G. Stereo Vision for Plant Detection in Dense Scenes. Sensors 2024, 24, 1942. [Google Scholar] [CrossRef] [PubMed]
Wang, T.; Chen, B.; Zhang, Z.; Li, H.; Zhang, M. Applications of machine vision in agricultural robot navigation: A review. Comput. Electron. Agric. 2022, 198, 107085. [Google Scholar] [CrossRef]
Pezzuolo, A.; Guarino, M.; Sartori, L.; Marinello, F. A feasibility study on the use of a structured light depth-camera for three-dimensional body measurements of dairy cows in free-stall barns. Sensors 2018, 18, 673. [Google Scholar] [CrossRef]
Nguyen, T.T.; Slaughter, D.C.; Max, N.; Maloof, J.N.; Sinha, N. Structured light-based 3D reconstruction system for plants. Sensors 2015, 15, 18587–18612. [Google Scholar] [CrossRef]
Atif, M.; Lee, S. Adaptive pattern resolution for structured light 3D camera system. In Proceedings of the 2018 IEEE SENSORS, New Delhi, India, 28–31 October 2018; IEEE: New York, NY, USA, 2018; pp. 1–4. [Google Scholar]
Fu, L.; Gao, F.; Wu, J.; Li, R.; Karkee, M.; Zhang, Q. Application of consumer RGB-D cameras for fruit detection and localization in field: A critical review. Comput. Electron. Agric. 2020, 177, 105687. [Google Scholar] [CrossRef]
Shuai, L.; Li, Z.; Chen, Z.; Luo, D.; Mu, J. A research review on deep learning combined with hyperspectral Imaging in multiscale agricultural sensing. Comput. Electron. Agric. 2024, 217, 108577. [Google Scholar] [CrossRef]
Tamayo-Monsalve, M.A.; Mercado-Ruiz, E.; Villa-Pulgarin, J.P.; Bravo-Ortiz, M.A.; Arteaga-Arteaga, H.B.; Mora-Rubio, A.; Alzate-Grisales, J.A.; Arias-Garzon, D.; Romero-Cano, V.; Orozco-Arias, S.; et al. Coffee maturity classification using convolutional neural networks and transfer learning. IEEE Access 2022, 10, 42971–42982. [Google Scholar] [CrossRef]
Zhou, X.; Sun, J.; Tian, Y.; Lu, B.; Hang, Y.; Chen, Q. Hyperspectral technique combined with deep learning algorithm for detection of compound heavy metals in lettuce. Food Chem. 2020, 321, 126503. [Google Scholar] [CrossRef]
Wei, L.; Yang, H.; Niu, Y.; Zhang, Y.; Xu, L.; Chai, X. Wheat biomass, yield, and straw-grain ratio estimation from multi-temporal UAV-based RGB and multispectral images. Biosyst. Eng. 2023, 234, 187–205. [Google Scholar] [CrossRef]
Li, J.; Luo, W.; Han, L.; Cai, Z.; Guo, Z. Two-wavelength image detection of early decayed oranges by coupling spectral classification with image processing. J. Food Compos. Anal. 2022, 111, 104642. [Google Scholar] [CrossRef]
Petersson, H.; Gustafsson, D.; Bergstrom, D. (December). Hyperspectral image analysis using deep learning—A review. In Proceedings of the 2016 Sixth International Conference on Image Processing Theory, Tools and Applications (IPTA), Oulu, Finland, 12–15 December 2016; IEEE: New York, NY, USA, 2016; pp. 1–6. [Google Scholar]
Shafique, A.; Siraj, M.; Cheng, B.; Alsaif, S.A.; Sadad, T. Hyperspectral Imaging and Advanced Vision Transformers for Identifying Pure and Pesticide-Coated Apples. IEEE Access 2025, 13, 66405–66419. [Google Scholar] [CrossRef]
Feng, H.; Chen, Y.; Song, J.; Lu, B.; Shu, C.; Qiao, J.; Liao, Y.; Yang, W. Maturity classification of rapeseed using hyperspectral image combined with machine learning. Plant Phenomics 2024, 6, 0139. [Google Scholar] [CrossRef]
Guo, Y.; Chen, S.; Li, X.; Cunha, M.; Jayavelu, S.; Cammarano, D.; Fu, Y. Machine learning-based approaches for predicting SPAD values of maize using multi-spectral images. Remote Sens. 2022, 14, 1337. [Google Scholar] [CrossRef]
Tian, S.; Lu, Q.; Wei, L. Multiscale superpixel-based fine classification of crops in the UAV-based hyperspectral imagery. Remote Sens. 2022, 14, 3292. [Google Scholar] [CrossRef]
Shen, T.; Zhou, X.; Shi, J.; Li, Z.; Huang, X.; Xu, Y.; Chen, W. Determination geographical origin and flavonoids content of goji berry using near-infrared spectroscopy and chemometrics. Food Anal. Methods 2016, 9, 68–79. [Google Scholar]
Zhang, H.; Jiang, H.; Liu, G.; Mei, C.; Huang, Y. Identification of Radix puerariae starch from different geographical origins by FT-NIR spectroscopy. Int. J. Food Prop. 2017, 20 (Suppl. 2), 1567–1577. [Google Scholar] [CrossRef]
Li, Y.; Zou, X.; Shen, T.; Shi, J.; Zhao, J.; Holmes, M. Determination of geographical origin and anthocyanin content of black goji berry (Lycium ruthenicum Murr.) using near-infrared spectroscopy and chemometrics. Food Anal. Methods 2017, 10, 1034–1044. [Google Scholar]
Lu, J.; Zhang, M.; Hu, Y.; Ma, W.; Tian, Z.; Liao, H.; Chen, J.; Yang, Y. From Outside to Inside: The Subtle Probing of Globular Fruits and Solanaceous Vegetables Using Machine Vision and Near-Infrared Methods. Agronomy 2024, 14, 2395. [Google Scholar] [CrossRef]
Patel, K.K.; Pathare, P.B. Principle and applications of near-infrared imaging for fruit quality assessment—An overview. Int. J. Food Sci. Technol. 2024, 59, 3436–3450. [Google Scholar] [CrossRef]
Yang, N.; Yuan, M.; Wang, P.; Zhang, R.; Sun, J.; Mao, H. Tea diseases detection based on fast infrared thermal image processing technology. J. Sci. Food Agric. 2019, 99, 3459–3466. [Google Scholar] [CrossRef]
Zhu, Y.; Fan, S.; Zuo, M.; Zhang, B.; Zhu, Q.; Kong, J. Discrimination of new and aged seeds based on on-line near-infrared spectroscopy technology combined with machine learning. Foods 2024, 13, 1570. [Google Scholar] [CrossRef] [PubMed]
Hao, Y.; Li, X.; Zhang, C.; Lei, Z. Online inspection of browning in yali pears using visible-near infrared spectroscopy and interpretable spectrogram-based CNN modeling. Biosensors 2023, 13, 203. [Google Scholar] [CrossRef] [PubMed]
Lu, H.; Wang, F.; Liu, X.; Wu, Y. Rapid assessment of tomato ripeness using visible/near-infrared spectroscopy and machine vision. Food Anal. Methods 2017, 10, 1721–1726. [Google Scholar] [CrossRef]
Türkler, L.; Akkan, T.; Akkan, L.Ö. Detection of Water Leakage in Drip Irrigation Systems Using Infrared Technique in Smart Agricultural Robots. Sensors 2023, 23, 9244. [Google Scholar] [CrossRef]
Zhou, X.; Zhao, C.; Sun, J.; Cao, Y.; Yao, K.; Xu, M. A deep learning method for predicting lead content in oilseed rape leaves using fluorescence hyperspectral imaging. Food Chem. 2023, 409, 135251. [Google Scholar] [CrossRef]
Sun, J.; He, X.; Ge, X.; Wu, X.; Shen, J.; Song, Y. Detection of key organs in tomato based on deep migration learning in a complex background. Agriculture 2018, 8, 196. [Google Scholar] [CrossRef]
You, J.; Li, D.; Wang, Z.; Chen, Q.; Ouyang, Q. Prediction and visualization of moisture content in Tencha drying processes by computer vision and deep learning. J. Sci. Food Agric. 2024, 104, 5486–5494. [Google Scholar] [CrossRef]
Manikandan, R.; Ranganathan, G.; Bindhu, V. Deep learning based IoT module for smart farming in different environmental conditions. Wirel. Pers. Commun. 2023, 128, 1715–1732. [Google Scholar] [CrossRef]
Li, H.; Luo, X.; Haruna, S.A.; Zareef, M.; Chen, Q.; Ding, Z.; Yan, Y. Au-Ag OHCs-based SERS sensor coupled with deep learning CNN algorithm to quantify thiram and pymetrozine in tea. Food Chem. 2023, 428, 136798. [Google Scholar] [CrossRef]
Cheng, X.; Zhang, Y.; Chen, Y.; Wu, Y.; Yue, Y. Pest identification via deep residual learning in complex background. Comput. Electron. Agric. 2017, 141, 351–356. [Google Scholar] [CrossRef]
Zhang, Q.; Liu, Y.; Gong, C.; Chen, Y.; Yu, H. Applications of deep learning for dense scenes analysis in agriculture: A review. Sensors 2020, 20, 1520. [Google Scholar] [CrossRef] [PubMed]
Mnih, V.; Heess, N.; Graves, A.; Kavukcuoglu, K. Recurrent models of visual attention. Adv. Neural Inf. Process. Syst. 2014, 27. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Tao, K.; Wang, A.; Shen, Y.; Lu, Z.; Peng, F.; Wei, X. Peach flower density detection based on an improved cnn incorporating attention mechanism and multi-scale feature fusion. Horticulturae 2022, 8, 904. [Google Scholar] [CrossRef]
Zhao, S.; Peng, Y.; Liu, J.; Wu, S. Tomato leaf disease diagnosis based on improved convolution neural network by attention module. Agriculture 2021, 11, 651. [Google Scholar] [CrossRef]
Zuo, X.; Chu, J.; Shen, J.; Sun, J. Multi-granularity feature aggregation with self-attention and spatial reasoning for fine-grained crop disease classification. Agriculture 2022, 12, 1499. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Bin, P.; Jing, B.; Wenjing, L.; Hu, Z.H.E.N.G.; Xiangyu, M.A. Survey on Visual Transformer for Image Classification. J. Front. Comput. Sci. Technol. 2024, 18, 320. [Google Scholar]
Zhu, W.; Sun, J.; Wang, S.; Shen, J.; Yang, K.; Zhou, X. Identifying field crop diseases using transformer-embedded convolutional neural network. Agriculture 2022, 12, 1083. [Google Scholar] [CrossRef]
De Silva, M.; Brown, D. Multispectral plant Disease Detection with Vision transformer–convolutional neural network hybrid approaches. Sensors 2023, 23, 8531. [Google Scholar] [CrossRef]
Luo, Z.; Yang, W.; Yuan, Y.; Gou, R.; Li, X. Semantic segmentation of agricultural images: A survey. Inf. Process. Agric. 2024, 11, 172–186. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision And Pattern Recognition 2016, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Miao, Z.; Yu, X.; Li, N.; Zhang, Z.; He, C.; Li, Z.; Deng, C.; Sun, T. Efficient tomato harvesting robot based on image processing and deep learning. Precis. Agric. 2023, 24, 254–287. [Google Scholar] [CrossRef]
Zheng, Y.Y.; Kong, J.L.; Jin, X.B.; Wang, X.Y.; Su, T.L.; Zuo, M. CropDeep: The crop vision dataset for deep-learning-based classification and detection in precision agriculture. Sensors 2019, 19, 1058. [Google Scholar] [CrossRef] [PubMed]
Tian, Y.; Sun, J.; Zhou, X.; Yao, K.; Tang, N. Detection of soluble solid content in apples based on hyperspectral technology combined with deep learning algorithm. J. Food Process. Preserv. 2022, 46, e16414. [Google Scholar] [CrossRef]
Bhattarai, U.; Bhusal, S.; Zhang, Q.; Karkee, M. AgRegNet: A deep regression network for flower and fruit density estimation, localization, and counting in orchards. Comput. Electron. Agric. 2024, 227, 109534. [Google Scholar] [CrossRef]
Jiang, L.; Wang, Y.; Wu, C.; Wu, H. Fruit Distribution Density Estimation in YOLO-Detected Strawberry Images: A Kernel Density and Nearest Neighbor Analysis Approach. Agriculture 2024, 14, 1848. [Google Scholar] [CrossRef]
Hu, T.; Wang, W.; Gu, J.; Xia, Z.; Zhang, J.; Wang, B. Research on apple object detection and localization method based on improved YOLOX and RGB-D images. Agronomy 2023, 13, 1816. [Google Scholar] [CrossRef]
Liu, X.; Jia, W.; Ruan, C.; Zhao, D.; Gu, Y.; Chen, W. The recognition of apple fruits in plastic bags based on block classification. Precis. Agric. 2018, 19, 735–749. [Google Scholar] [CrossRef]
Zhang, Z.; Lu, Y.; Zhao, Y.; Pan, Q.; Jin, K.; Xu, G.; Hu, Y. Ts-yolo: An all-day and lightweight tea canopy shoots detection model. Agronomy 2023, 13, 1411. [Google Scholar] [CrossRef]
Li, A.; Wang, C.; Ji, T.; Wang, Q.; Zhang, T. D3-YOLOv10: Improved YOLOv10-Based Lightweight Tomato Detection Algorithm Under Facility Scenario. Agriculture 2024, 14, 2268. [Google Scholar] [CrossRef]
Zhang, F.; Chen, Z.; Ali, S.; Yang, N.; Fu, S.; Zhang, Y. Multi-class detection of cherry tomatoes using improved Yolov4-tiny model. Int. J. Agric. Biol. Eng. 2023, 16, 225–231. [Google Scholar]
Shi, Y.; Duan, Z.; Qing, S.; Zhao, L.; Wang, F.; Yuwen, X. YOLOV9S-Pear: A lightweight YOLOV9S-Based improved model for young Red Pear Small-Target recognition. Agronomy 2024, 14, 2086. [Google Scholar] [CrossRef]
Zhang, B.; Xia, Y.; Wang, R.; Wang, Y.; Yin, C.; Fu, M.; Fu, W. Recognition of mango and location of picking point on stem based on a multi-task CNN model named YOLOMS. Precis. Agric. 2024, 25, 1454–1476. [Google Scholar] [CrossRef]
Zhou, W.; Cui, Y.; Huang, H.; Huang, H.; Wang, C. A fast and data-efficient deep learning framework for multi-class fruit blossom detection. Comput. Electron. Agric. 2024, 217, 108592. [Google Scholar] [CrossRef]
Dai, J.S.; He, Z.Q. Real-Time Recognition and Localization of Kiwifruit Based on Improved YOLOv5s Algorithm. IEEE Access 2024, 12, 156261–156272. [Google Scholar] [CrossRef]
Yao, K.; Sun, J.; Zhou, X.; Nirere, A.; Tian, Y.; Wu, X. Nondestructive detection for egg freshness grade based on hyperspectral imaging technology. J. Food Process Eng. 2020, 43, e13422. [Google Scholar] [CrossRef]
Zhu, J.; Cai, J.; Sun, B.; Xu, Y.; Lu, F.; Ma, H. Inspection and classification of wheat quality using image processing. Qual. Assur. Saf. Crops Foods 2023, 15, 43–54. [Google Scholar] [CrossRef]
Gururaj, N.; Vinod, V.; Vijayakumar, K. Deep grading of mangoes using convolutional neural network and computer vision. Multimed. Tools Appl. 2023, 82, 39525–39550. [Google Scholar] [CrossRef]
Xu, B.; Cui, X.; Ji, W.; Yuan, H.; Wang, J. Apple grading method design and implementation for automatic grader based on improved YOLOv5. Agriculture 2023, 13, 124. [Google Scholar] [CrossRef]
Xia, Y.; Wang, Z.; Cao, Z.; Chen, Y.; Li, L.; Chen, L.; Zhang, S.; Wang, C.; Li, H.; Wang, B. Recognition Model for Tea Grading and Counting Based on the Improved YOLOv8n. Agronomy 2024, 14, 1251. [Google Scholar] [CrossRef]
Lu, M.; Jiang, S.; Wang, C.; Chen, D.; Chen, T.E. Tobacco leaf grading based on deep convolutional neural networks and machine vision. J. ASABE 2022, 65, 11–22. [Google Scholar] [CrossRef]
Lu, M.; Wang, C.; Wu, W.; Zhu, D.; Zhou, Q.; Wang, Z.; Chen, T.; Jiang, S.; Chen, D. Intelligent grading of tobacco leaves using an improved bilinear convolutional neural network. IEEE Access 2023, 11, 68153–68170. [Google Scholar] [CrossRef]
Ni, J.; Liu, B.; Li, J.; Gao, J.; Yang, H.; Han, Z. Detection of carrot quality using DCGAN and deep network with squeeze-and-excitation. Food Anal. Methods 2022, 15, 1432–1444. [Google Scholar] [CrossRef]
Zhang, Y.; Mohd Khairuddin, A.S.; Chuah, J.H.; Zhao, X.; Huang, J. An intelligent mangosteen grading system based on an improved convolutional neural network. Signal Image Video Process. 2024, 18, 8585–8595. [Google Scholar] [CrossRef]
Ismail, N.; Malik, O.A. Real-time visual inspection system for grading fruits using computer vision and deep learning techniques. Inf. Process. Agric. 2022, 9, 24–37. [Google Scholar]
Fan, S.; Li, J.; Zhang, Y.; Tian, X.; Wang, Q.; He, X.; Zhang, C.; Huang, W. On line detection of defective apples using computer vision system combined with deep learning methods. J. Food Eng. 2020, 286, 110102. [Google Scholar] [CrossRef]
Ji, W.; Wang, J.; Xu, B.; Zhang, T. Apple grading based on multi-dimensional view processing and deep learning. Foods 2023, 12, 2117. [Google Scholar] [CrossRef]
Jun, S.; Xin, Z.; Hanping, M.; Xiaohong, W.; Xiaodong, Z.; Hongyan, G. Identification of residue level in lettuce based on hyperspectra and chlorophyll fluorescence spectra. Int. J. Agric. Biol. Eng. 2016, 9, 231–239. [Google Scholar]
Upadhyay, A.; Chandel, N.S.; Singh, K.P.; Chakraborty, S.K.; Nandede, B.M.; Kumar, M.; Subeesh, A.; Upendar, K.; Salem, A.; Elbeltagi, A. Deep learning and computer vision in plant disease detection: A comprehensive review of techniques, models, and trends in precision agriculture. Artif. Intell. Rev. 2025, 58, 92. [Google Scholar] [CrossRef]
Duan, Y.; Han, W.; Guo, P.; Wei, X. YOLOv8-GDCI: Research on the Phytophthora Blight Detection Method of Different Parts of Chili Based on Improved YOLOv8 Model. Agronomy 2024, 14, 2734. [Google Scholar] [CrossRef]
Meng, Q.; Guo, J.; Zhang, H.; Zhou, Y.; Zhang, X. A dual-branch model combining convolution and vision transformer for crop disease classification. PLoS ONE 2025, 20, e0321753. [Google Scholar] [CrossRef] [PubMed]
Guo, Q.; Wang, C.; Xiao, D.; Huang, Q. Automatic monitoring of flying vegetable insect pests using an RGB camera and YOLO-SIP detector. Precis. Agric. 2023, 24, 436–457. [Google Scholar] [CrossRef]
Hassan, S.M.; Maji, A.K. Pest Identification based on fusion of Self-Attention with ResNet. IEEE Access 2024, 12, 6036–6050. [Google Scholar] [CrossRef]
Li, M.; Cheng, S.; Cui, J.; Li, C.; Li, Z.; Zhou, C.; Lv, C. High-performance plant pest and disease detection based on model ensemble with inception module and cluster algorithm. Plants 2023, 12, 200. [Google Scholar] [CrossRef]
Lamba, S.; Baliyan, A.; Kukreja, V. A novel GCL hybrid classification model for paddy diseases. Int. J. Inf. Technol. 2023, 15, 1127–1136. [Google Scholar] [CrossRef]
Yang, Y.; Xiao, Y.; Chen, Z.; Tang, D.; Li, Z.; Li, Z. FCBTYOLO: A lightweight and high-performance fine grain detection strategy for rice pests. IEEE Access 2023, 11, 101286–101295. [Google Scholar] [CrossRef]
Maruthai, S.; Selvanarayanan, R.; Thanarajan, T.; Rajendran, S. Hybrid vision GNNs based early detection and protection against pest diseases in coffee plants. Sci. Rep. 2025, 15, 11778. [Google Scholar] [CrossRef]
Yang, N.; Chang, K.; Dong, S.; Tang, J.; Wang, A.; Huang, R.; Jia, Y. Rapid image detection and recognition of rice false smut based on mobile smart devices with anti-light features from cloud database. Biosyst. Eng. 2022, 218, 229–244. [Google Scholar] [CrossRef]
Yang, N.; Qian, Y.; EL-Mesery, H.S.; Zhang, R.; Wang, A.; Tang, J. Rapid detection of rice disease using microscopy image identification based on the synergistic judgment of texture and shape features and decision tree–confusion matrix method. J. Sci. Food Agric. 2019, 99, 6589–6600. [Google Scholar] [CrossRef]
Hosny, K.M.; El-Hady, W.M.; Samy, F.M.; Vrochidou, E.; Papakostas, G.A. Multi-class classification of plant leaf diseases using feature fusion of deep convolutional neural network and local binary pattern. IEEE Access 2023, 11, 62307–62317. [Google Scholar] [CrossRef]
Khalid, S.; Oqaibi, H.M.; Aqib, M.; Hafeez, Y. Small pests detection in field crops using deep learning object detection. Sustainability 2023, 15, 6815. [Google Scholar] [CrossRef]
Liu, J.; Abbas, I.; Noor, R.S. Development of deep learning-based variable rate agrochemical spraying system for targeted weeds control in strawberry crop. Agronomy 2021, 11, 1480. [Google Scholar] [CrossRef]
Deng, L.; Miao, Z.; Zhao, X.; Yang, S.; Gao, Y.; Zhai, C.; Zhao, C. HAD-YOLO: An Accurate and Effective Weed Detection Model Based on Improved YOLOV5 Network. Agronomy 2025, 15, 57. [Google Scholar] [CrossRef]
Tao, T.; Wei, X. STBNA-YOLOv5: An Improved YOLOv5 Network for Weed Detection in Rapeseed Field. Agriculture 2024, 15, 22. [Google Scholar] [CrossRef]
Pei, H.; Sun, Y.; Huang, H.; Zhang, W.; Sheng, J.; Zhang, Z. Weed detection in maize fields by UAV images based on crop row preprocessing and improved YOLOv4. Agriculture 2022, 12, 975. [Google Scholar] [CrossRef]
Lu, Z.; Zhu, C.; Lu, L.; Yan, Y.; Jun, W.; Wei, X.; Ke, X.; Jun, T. Star-YOLO: A lightweight and efficient model for weed detection in cotton fields using advanced YOLOv8 improvements. Comput. Electron. Agric. 2025, 235, 110306. [Google Scholar] [CrossRef]
Janneh, L.L.; Zhang, Y.; Cui, Z.; Yang, Y. Multi-level feature re-weighted fusion for the semantic segmentation of crops and weeds. J. King Saud Univ. Comput. Inf. Sci. 2023, 35, 101545. [Google Scholar] [CrossRef]
Zhou, Q.; Li, H.; Cai, Z.; Zhong, Y.; Zhong, F.; Lin, X.; Wang, L. YOLO-ACE: Enhancing YOLO with Augmented Contextual Efficiency for Precision Cotton Weed Detection. Sensors 2025, 25, 1635. [Google Scholar] [CrossRef]
Khan, S.D.; Basalamah, S.; Lbath, A. Weed–Crop segmentation in drone images with a novel encoder–decoder framework enhanced via attention modules. Remote Sens. 2023, 15, 5615. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, X.; Ma, G.; Du, X.; Shaheen, N.; Mao, H. Recognition of weeds at asparagus fields using multi-feature fusion and backpropagation neural network. Int. J. Agric. Biol. Eng. 2021, 14, 190–198. [Google Scholar] [CrossRef]

Figure 1. Flowchart of data collection.

Figure 2. The number of published articles in the past ten years.

Figure 3. Squeeze-and-excitation block [68].

Figure 4. Transformer—model [69].

Figure 5. Apple inspection under occlusion, in bags, and at night.

Figure 6. Crop quality grading chart.

Figure 7. Some samples selected from public datasets.

Figure 8. Weed images from the CottonWeedDet12 dataset.

Figure 9. System operation flowchart.

Table 1. Summary of the application of computer vision and deep learning in crop identification and detection.

Reference	Target	Approach	Performance	Hardware Specifications	Number of Datasets
[84]	Apple detection and localization	YOLOX, SPP	F1: 93% mAP50: 94.09% speed: 167.43 FPS	i7+RTX 2080Ti (Intel, Santa Clara, CA, USA)	4785
[14]	Apple target recognition	YOLOv4, EfficientNet-B0, PANet	mAP50: 93.42% Recall: 87.64% speed: 63.20 FPS	NVIDIA GTX 1080Ti (Nvidia, Santa Clara, CA, USA)	10,385
[83]	Distribution density of strawberry fruits	YOLOv8n, Squeeze-and-Excitation, Kernel Density Estimation	mAP50-95: 87.3% Recall: 90.7% Speed: 15.95 FPS	NVIDIA GTX 1080Ti (Nvidia, Santa Clara, CA, USA)	4500
[85]	Recognition of apple	SVM, BPNN, Watershed Algorithm	FNR: 4.65% FPR: 3.50%	NVIDIA GTX 1080Ti (Nvidia, Santa Clara, CA, USA)	___
[86]	Tea detection	YOLOv4, DepC, DCN, Coordinate Attention, MobileNetV3	Precision: 85.35% Recall: 78.42% mAP50: 82.12%	NVIDIA GTX 1080Ti (Nvidia, Santa Clara, CA, USA)	4347
[87]	Tomato detection	YOLOv10, DyFasterNet, D-LKA	mAP50: 91.8% mAP50-95: 63.8% Speed: 80.1 FPS	NVIDIA GTX (Nvidia, Santa Clara, CA, USA)	2000
[88]	Tomatoes detection	YOLOv4-Tiny, CBAM	mAP50: 90.78% Speed: 31.04 FPS	NVIDIA GTX (Nvidia, Santa Clara, CA, USA)	8112
[89]	Red pear small-target recognition	YOLOv9s, SCDown, C2FUIBELAN	mAP50-95: 84.8% mAP50: 99.1% Recall: 97% Speed: 83.64 FPS	NVIDIA A16 (Nvidia, Santa Clara, CA, USA)	1580
[90]	Recognition of mango	YOLOv5s, RepVGG	Precision: 84.81% Recall: 85.64% mAP50: 82.42% Speed: 39.73FPS	NVIDIA GeForce RTX 3090 (Nvidia, Santa Clara, CA, USA)	1760
[91]	Blossom detection	VoVNet, CenterNet2, Location Guidance Module	mAP50: 74.33% Speed: 47FPS	NVIDIA RTX GTX 3090 (Nvidia, Santa Clara, CA, USA)	2760

Table 2. Summary of the Application of computer vision and deep learning in crop grading.

Reference	Target	Approach	Performance	Hardware Specifications	Number of Datasets
[96]	Apple Grading	YOLOv5s, Squeeze-and-Excitation	mAP50: 90.6% Precision: 95.1% Recall: 95.2% Speed: 59.63 FPS	NVIDIA GTX1660Ti (Nvidia, Santa Clara, CA, USA)	6000
[97]	Tea Grading	YOLOv8n, SPD-Conv, Super-Token Vision Transformer	mAP50: 89.1% Precision: 86.9% Recall: 85.5%	NVIDIA GeForce RTX 3060 (Nvidia, Santa Clara, CA, USA)	3612
[98]	Tobacco Leaf Grading	A-ResNet-65, ResNet-34, BN-PReLU-Conv	Precision: 91.30% Speed: 82.18 FPS	NVIDIA GeForce GTX 1080Ti (Nvidia, Santa Clara, CA, USA)	22,330
[99]	Tobacco Leaf Grading	VGG16, FPN-CBAM-ResNet50, FPN, CBAM	Precision: 80.65% Speed: 42.1 FPS	2 × NVIDIA GeForce GTX 1080 Ti GPU (Nvidia, Santa Clara, CA, USA)	22,322
[100]	Detection of Carrot Quality	ResNet-18, Squeeze-and-Excitation, DCGAN	Precision: 98.36% F1score: 98.41%	NVIDIA GTX 2060 (Nvidia, Santa Clara, CA, USA)	6086
[101]	Mangosteen Grading	MobileNetV3, InceptionV3, CBAM	Precision: 97.15% Recall: 97.75%	NVIDIA GTX (Nvidia, Santa Clara, CA, USA)	20,000
[102]	Grading Fruits	ResNet50, DenseNet121, EfficientNet, MobileNetV2	A Precision: 99.2% ± 0.12% B Precision: 98.6% ± 0.42	___	9091
[103]	Apple Grading	CNN, Softmax, Max Pooling	Precision: 92%, Recall: 91%, Speed: 72 FPS	Intel E7400 CPU (Intel, Santa Clara, CA, USA)	79,200

Table 3. Summary of the application of computer vision and deep learning in disease monitoring.

Reference	Target	Approach	Performance	Hardware Specifications	Number of Datasets
[107]	Disease detection	YOLOv8n, RepGFPN, Coordinate Attention	Recall: 84.2% mAP50: 88.9% Speed: 219.5 FPS	NVIDIA GeForce RTX 4060 (Nvidia, Santa Clara, CA, USA)	1083
[108]	Disease classification	CNN, Vision Transformer, Separable Self-Attention	Data1 Precision: 99.71% Data2 Precision: 98.78%	NVIDIA GeForce RTX 4090 (Nvidia, Santa Clara, CA, USA)	58,367
[109]	Detection of pests	YOLOv4, EfficientNetV2-S, Fully CNN	mAP50: 84.22% Speed: 4.72 FPS	2×GeForce GTX 1080 Ti (Nvidia, Santa Clara, CA, USA)	3557
[110]	Pest identification	ResNet, Self-Attention	Accuracy: 99.80% F1: 99.33%	Google Colab Pro Platform (https://colab.google/)	4500
[111]	Plant pest and disease detection	YOLOv3, Faster R-CNN, Inception	mAP50: 85.2% Speed: 23 FPS	NVIDIA RTX 3080 (Nvidia, Santa Clara, CA, USA)	26,106
[112]	Disease classification	CNN, GAN, LSTM	Bacterial Blight: Precision: 96% Recall: 97% F1: 99%	___	5120
[113]	Detection of rice pests	YOLOv8n, FastGAN, Fully Connected Bottleneck Transformer, SPPF	mAP50: 93.6% Speed: 59.52 FPS	NVIDIA Tesla T4 (Nvidia, Santa Clara, CA, USA)	13,877
[114]	Detection of pests	CNN, GNN, SPPF	F1: 87.24% Recall: 81.16% Precision: 87.40%	NVIDIA GeForce GTX 1050 Ti (Nvidia, Santa Clara, CA, USA)	2850

Table 4. Summary of the application of computer vision and deep learning in weed detection.

Reference	Target	Approach	Performance	Hardware Specifications	Number of Datasets
[119]	Targeted Weeds Control	VGG-16, AlexNet, GoogleNet	Precision: 98% Recall: 97% F1: 97%	NVIDIA GeForce GTX 1080 GPU (Nvidia, Santa Clara, CA, USA)	12443
[120]	Weed Detection	YOLOv5, HGNetV2, Scale Sequence Feature Fusion Module	mAP50: 94.2% Speed: 30.6 FPS	NVIDIA GeForce RTX 3090 (Nvidia, Santa Clara, CA, USA)	5270
[121]	Weed Detection	YOLOv5, BiFPN, Swin Transformer	mAP50: 90.8% Recall: 88.1% Precision: 64.4% Speed: 20.1 FPS	NVIDIA GTX 3080Ti (Nvidia, Santa Clara, CA, USA)	5000
[122]	Fields Weed Detection	YOLOv4, CSPDarknet53, CBAM	mAP50: 86.89% Recall: weed: 78.02% maize: 83.55%	NVIDIA Tesla V100 (Nvidia, Santa Clara, CA, USA)	3000
[123]	Weed Detection	YOLOv8, LSK, DySample	mAP50: 98.0% mAP50-95: 95.4% Speed: 118 FPS	NVIDIA GeForce RTX 3080 Ti (Nvidia, Santa Clara, CA, USA)	6496
[124]	Semantic Segmentation of Crops and Weeds	ResNet34, CWFDM	Precision: 98.4% mIoU: 0.9164 F1: 0.9556	NVIDIA Tesla P40 GPU (Nvidia, Santa Clara, CA, USA)	492
[125]	Weed Detection	YOLOv5s, CSPDarkNet53, SKAttention	CottonWeedDet12 mAP50: 95.3% mAP50-95: 89.5% Speed: 77 FPS	NVIDIA RTX A5000 (Nvidia, Santa Clara, CA, USA)	5648
[126]	Weed–Crop Segmentation	Dense-Inception, ASPP, CnSAU	mIoU Rice: 0.81 Weeds: 0.79 Others: 0.84	___	1092

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cao, Z.; Sun, S.; Bao, X. A Review of Computer Vision and Deep Learning Applications in Crop Growth Management. Appl. Sci. 2025, 15, 8438. https://doi.org/10.3390/app15158438

AMA Style

Cao Z, Sun S, Bao X. A Review of Computer Vision and Deep Learning Applications in Crop Growth Management. Applied Sciences. 2025; 15(15):8438. https://doi.org/10.3390/app15158438

Chicago/Turabian Style

Cao, Zhijie, Shantong Sun, and Xu Bao. 2025. "A Review of Computer Vision and Deep Learning Applications in Crop Growth Management" Applied Sciences 15, no. 15: 8438. https://doi.org/10.3390/app15158438

APA Style

Cao, Z., Sun, S., & Bao, X. (2025). A Review of Computer Vision and Deep Learning Applications in Crop Growth Management. Applied Sciences, 15(15), 8438. https://doi.org/10.3390/app15158438

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Review of Computer Vision and Deep Learning Applications in Crop Growth Management

Abstract

1. Introduction

2. Computer Vision

2.1. Monocular Camera

2.2. Stereo Vision Camera

2.3. Structured Light Camera

2.4. Hyperspectral and Multispectral Cameras

2.5. Infrared Vision Sensors and Spectrometers

3. Deep Learning

3.1. Attention Mechanism

3.2. Transformer-Based Models

3.3. Image Segmentation and Image Detection Models

4. Agricultural Applications

4.1. Crop Identification and Detection

4.2. Crop Grading

4.3. Disease Monitoring

4.4. Weed Detection

4.5. Actual Case Applications

5. Challenges and the Way Forward

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI