1. Introduction
Vegetables contain a variety of nutrients essential for human health and are indispensable in daily diets. Plug seedling transplantation, as an emerging vegetable cultivation method, has achieved cost reductions, efficiency improvements, yield increases, and quality enhancements by incorporating advanced technologies such as modern biotechnology and environmental control during the seedling stage [
1,
2]. Before commercial plug seedlings are sold, they typically undergo sorting to remove missing or weak seedlings and replace them with healthy ones, ensuring uniform and robust seedlings [
3,
4]. However, the current sorting in industrialized seedling production is predominantly manual, which is time-consuming and labor-intensive, making it difficult to meet the demands of large-scale production [
5,
6]. Research on automated plug seedling sorting technology is therefore of great significance, and machine vision-based seedling grading is a crucial approach to achieving automated sorting.
At present, the detection and grading of tray seedlings include both traditional machine learning methods and deep learning methods and have demonstrated unique technical advantages and application potential in this field. Traditional machine learning methods rely on artificially designed feature engineering, such as feature extraction algorithms based on color, texture, and geometric shape. Through classifiers such as support vector machine (SVM) and random forest, mature detection and classification models are constructed. Their advantages lie in strong model interpretability and relatively low computing resource requirements, and they are suitable for scenarios with clear features and limited data scale. Deep learning methods, relying on architectures such as convolutional neural networks (CNNs) and transformer, automatically learn complex feature patterns from massive data, demonstrating outstanding performance in object detection and semantic segmentation tasks, especially when dealing with complex backgrounds, multi-scale objects, and occlusion problems. Next, this article will introduce the detection and grading methods of tray seedlings from two aspects: traditional machine learning methods and deep learning methods.
As early as the last century, researchers began to apply traditional machine vision technology to the classification methods of plants. In 1994, Tai et al. [
7] used cameras and laser emitters to capture images of plug seedlings, applied the area of interest (AOI) technique for empty-cell detection, and achieved a 95% success rate in identifying empty cells. In 1996, Ling et al. [
8] successfully measured the canopy area using Otsu’s adaptive thresholding method, providing key features for seedling grading. In 2009, Jiang Huanyu et al. [
9] employed the watershed algorithm to extract features such as leaf area and perimeter from tomato seedlings, achieving a 98% classification accuracy. In 2010, Sun Guoxiang et al. [
10] proposed a Freeman chain code-based method for leaf area extraction in overlapping tomato seedling leaves, achieving 100% and 96% segmentation success rates for 72-cell and 128-cell plug trays, respectively. In 2013, Hu Fei et al. [
11] developed a machine vision-based method to detect empty cells and substandard seedlings using a CK-JV200CI CCD industrial camera and threshold segmentation, achieving over 95.8% accuracy for 13-day-old seedlings in 72-cell trays. The same year, Giselsson et al. [
12] introduced two novel shape-feature generation methods based on distance transformation, demonstrating superior performance compared to traditional feature sets. In 2018, Wang Yongwei et al. [
13] designed an automatic seedling replenishment system for Arabidopsis seedlings, achieving 100% accuracy in detecting empty and occupied cells by analyzing pixel statistics from grayscale and threshold-segmented images. In 2020, Zhang Guodong et al. [
14] proposed a machine vision-based method to identify empty cells and unhealthy seedlings using CKVisionBuilder software, achieving over 95.8% accuracy for lettuce, cabbage, and flowering Chinese cabbage seedlings in 72-cell trays. In 2021, Tong et al. [
15] developed a mobile detection system for seedling replenishment, optimizing image stitching algorithms (block matching, Harris corner detection, and SURF feature detection) to achieve 98.7% accuracy in seedling health assessment. Wang Jizhang et al. [
16] used a Kinect camera to acquire color and depth images, calculating germination rate, plant height, and leaf area to establish a robust seedling index model with high measurement precision. In 2022, Zhang Lina et al. [
17] proposed an RGB-D image-based method for detecting delayed emergence, combining point cloud segmentation (conditional filtering, statistical filtering, and Euclidean clustering) with α-shape-based leaf area and curvature-derived plant height measurements, achieving 95% accuracy. Jin et al. [
18] optimized seedling extraction paths using edge recognition and orthogonal experiments to minimize transplant damage. Jin et al. [
19] employed an Intel RealSense D415 camera to capture point clouds, planning L-shaped paths to avoid stem and leaf contact, reducing damage rates by 11.11% with only a 0.029 s increase in transplant time. With the advances in computer technology, deep learning has achieved breakthrough progress, with practical applications emerging in the industrial, agricultural, and medical fields through models such as ChatGPT4.0, ERNIE Bot3.0, and DeepSeekR1 [
20,
21,
22]. As technology continues to break new ground, modern agriculture stands at the cusp of a revolutionary transformation. To usher in an efficient and intelligent new era of agriculture, the concept of precision agriculture has been proposed, utilizing various smart sensors combined with big data technology and decision-making algorithms to achieve the digitalization and increased intelligence of agriculture [
23,
24,
25].
The development of deep learning-based image processing technology has further advanced agricultural intelligence. Thanks to its powerful feature extraction capabilities, deep learning algorithms hold significant advantages over traditional methods in processing large and complex datasets. Deep learning models can perform appearance quality inspection of seeds before sowing, assist in monitoring abnormal growth conditions, and conduct quality grading of fruits and vegetables post-harvest. The successful application of deep learning in computer vision has provided new tools for intelligent agricultural and forestry plant information management [
26,
27,
28,
29]. In 2019, He Yan et al. [
30] developed a closed image acquisition system using an AdaBoost algorithm, multilayer perceptron (MLP), and convolutional neural network (CNN) recognition technology, achieving 97.58% accuracy in identifying tobacco seedling tray categories and robust seedlings. Zhang Yong et al. [
31] employed the LeNet deep learning algorithm as the core method to identify empty cells and inferior seedlings in trays, achieving 98.7% recognition accuracy, and subsequently designed a deep learning-based image processing system for seedling grading and transplanting. Xiao et al. [
32] proposed a transfer learning-based classification method for plug seedlings by extracting regions of interest from original images and applying grayscale processing, then constructing a classification model using a VGG16 convolutional neural network, ultimately achieving 95.50% classification accuracy. In 2021, Perugachi-Diaz et al. [
33] used a dataset containing 13,200 seedling images to predict the growth success rate of cabbage seedlings, comparing traditional logistic regression (LR), multilayer perceptron (MLP), and four pre-trained CNN architectures (AlexNet, DenseNet, ResNet, and VGG). The results showed AlexNet performed best, achieving 94% accuracy and 0.95 AUC on the test set, demonstrating CNN’s superiority over traditional methods in processing image data. Kolhar et al. [
34] employed spatiotemporal deep neural networks to classify Arabidopsis thaliana strains, comparing 3D CNNs, CNNs with convolutional long short-term memory (ConvLSTM) networks, and vision transformer. These methods utilized temporal and spatial information through time-series RGB images to classify four different Arabidopsis strains. Experimental results showed Vision Transformer achieved the highest classification accuracy at 98.59%, though with more parameters, while CNN-ConvLSTM achieved 97.97% accuracy with fewer parameters. In 2022, Jin et al. [
35] established a healthy lettuce seedling identification model based on ResNet18 network through deep learning and transfer learning strategies. The model achieved 97.44% detection accuracy with model loss maintained at approximately 0.005, outperforming physical feature-based recognition models.
Most current plug seedling grading methods only detect missing seedlings in trays without assessing seedling quality. The few studies that do perform quality grading are based on extracting basic seedling features, where occlusion and interference between seedlings are inevitable, significantly reducing grading accuracy. Therefore, high-precision and high-stability plug seedling grading methods still require further research. Meanwhile, in view of the problem that the features extracted from the tray seedlings are not comprehensive due to the perspective during classification, accordingly, this study investigates an intelligent grading method for pepper plug seedlings based on RGB and point cloud images, aiming to achieve quality grading during the transplanting process. By utilizing RGB and point cloud images of pepper plug seedlings for precise grading, this method improves the grading accuracy and efficiency while reducing labor costs. Additionally, an intelligent grading system for pepper plug seedlings using RGB and point cloud images is designed to better facilitate grading and transplanting operations.
The main contributions of this paper are as follows:
The design and construction of an image acquisition platform to collect RGB and point cloud images of pepper plug seedlings. The acquired images were preprocessed and annotated to create three distinct datasets: an RGB seedling recognition dataset, an RGB leaf recognition dataset, and a 2D point cloud image segmentation dataset.
The investigation of mainstream object detection algorithms and conducted comparative experiments to establish pepper seedling recognition and leaf recognition models. Through these experiments, YOLOv11 was selected as the detection network for this method.
The improvement of the U-Net based image segmentation network by incorporating C-Res residual modules and ResAG gate attention modules into the skip connections. The enhanced U-Net was experimentally evaluated against mainstream segmentation networks, demonstrating performance improvements across all metrics. Comparative analysis of the segmentation results confirmed the functional effectiveness of the proposed modules. Ablation studies further verified that each modification contributed to performance enhancement, with the final optimized model meeting the practical requirements of this method.
Developed feature extraction methods for plug seedlings, obtaining parameters including normal leaf count, abnormal leaf count, leaf area, and plant height. These features were used to create a seedling grading dataset. After training with various classification algorithms, the random forest algorithm demonstrated superior performance and was selected as the grading method for this approach.
2. Materials and Methods
2.1. Intelligent Grading Method for Pepper Seedling Trays Based on RGB and Point Cloud Images
Currently, algorithms for plug seedling grading primarily use 2D images, with two mainstream methods:
The advantages and disadvantages of the two mainstream methods can be seen from
Table 1. This paper proposes an intelligent plug seedling grading system based on RGB and point cloud images. The image acquisition device is placed above the seedlings to capture RGB and point cloud images from top to bottom. The RGB images are mainly used to provide position information and leaf count of individual seedlings, while the point cloud images primarily provide plant height and leaf area. Benefiting from the three-dimensional information of point cloud images, they can not only provide plant height information but also better restore the actual size of leaves during shooting, thereby more accurately obtaining the leaf area.
The core of this paper lies in using deep learning for seedling feature extraction, followed by grading through machine learning. If the extracted features have significant errors, the grading results will be severely affected. Therefore, the accuracy of feature extraction is crucial. To improve accuracy, this paper first uses deep learning algorithms to identify and segment individual seedlings from the tray, reducing interference from other seedlings and facilitating subsequent leaf identification and segmentation. Leaf count is one of the important parameters for evaluating seedling quality, yet it rarely appears in the grading indicators of various methods. The reason is that unlike other parameters, leaves vary in shape and size, making it difficult for traditional methods to capture their features. Therefore, we adopted deep learning methods to identify leaves and then counted them to obtain the leaf number. Plant height and leaf area are also the most intuitive features for evaluating seedling grade. Benefiting from the high-dimensional information of point cloud images, this paper easily obtained the plant height and leaf area information from point clouds. The only challenge is determining the precise position of seedlings in point cloud images. This paper projects point cloud images onto the xy-plane to obtain 2D point cloud images, then uses deep learning segmentation to locate seedlings accurately on the xy-plane. The position information is then used to extract the corresponding 3D point cloud data of each seedling. Finally, plant height and leaf area are calculated based on point cloud height and density.
The specific process of the grading method in this paper is shown in
Figure 1. Firstly, a 2D camera and a 3D camera are used to collect the RGB and point cloud images of the plug seedlings in the tray. The point cloud images are 2D processed through image processing, and the 2D point cloud images are aligned with the RGB images. Then, a target recognition deep learning algorithm is employed to conduct image recognition on the RGB images so as to identify the position of each plug seedling in the tray. By using the position information of the plug seedlings obtained after recognition, the plug seedlings are segmented from the RGB images and the 2D point cloud images. The segmented RGB images of the plug seedlings undergo leaf recognition based on a deep learning algorithm to obtain the number of leaves. The segmented 2D point cloud images are subjected to leaf segmentation based on a deep learning algorithm to obtain the position information of the leaves. With the position information of the leaves, the point clouds contained in the leaves in the original point cloud image can be obtained, thus getting the leaf area. Meanwhile, the plant height can be obtained according to the height information of the point clouds. A classification model for the plug seedlings in the tray is designed, and the characteristic parameters of the plug seedlings obtained are utilized to construct a grading model. Finally, the grading results of the plug seedlings in the tray are obtained.
2.2. Data Collection
In order to collect 3D images and 2D images simultaneously, this paper designed and built an image acquisition platform for the grading of pepper plug seedlings. The image acquisition platform mainly consisted of a conveying structure, image acquisition equipment, and image processing equipment. The conveying structure was composed of a conveyor belt and a control module, which was used to simulate the actual working conditions during the operation of the transplanter. The image acquisition equipment was composed of an industrial camera and a laser camera. The image processing equipment was a computer.
The image acquisition equipment in this paper was composed of an industrial camera and a laser camera. The training effect of the deep learning model depends largely on the quality of the dataset. Therefore, an industrial camera was selected in this paper for dataset collection. The MI-230U150C industrial camera of MVCAM was obtained from Hangzhou, China and was produced by Hangzhou HikRobot Co., LTD. Due to its high resolution and excellent imaging quality, the camera is widely used in the field of machine vision and has become an ideal choice in this field. Thus, it was selected as the acquisition device for RGB images in this paper, and the physical picture of the camera is shown in
Figure 2a. The SICK TriSpector 1000 is a high-performance 3D vision sensor that can quickly and accurately obtain the three-dimensional point cloud of an object. This equipment is from Walderkirch, Germany and is produced by the SICK Company. Its main features include high measurement accuracy and fast data processing. Therefore, it was selected as the acquisition device for point cloud images in this paper. The point cloud acquisition device is shown in
Figure 2b.
Figure 3 shows the image acquisition platform built in this paper. The main frame of the platform is made of aluminum profiles and mainly consists of an MI-230U150C industrial camera and lens, a SICK TriSpector 1000, an acquisition box, an LED lamp, a light source controller, a conveyor belt, a conveyor belt controller, a communication cable, and a computer. The specific parameters of the experimental platform were as follows: the maximum height of the platform was 150 cm, the height of the conveyor belt from the ground was 70 cm, the length was 150 cm, the width was 30 cm, the height of the acquisition box was 80 cm, the height of the industrial camera from the conveyor belt was 59 cm, the height of the SICK TriSpector 1000 from the conveyor belt was 47 cm, and the height of the light source from the conveyor belt was 78 cm. During the image acquisition operation, the plug seedlings in trays were placed in the acquisition box. At this time, the industrial camera was directly above the position of the seedling tray to be acquired. After receiving the control command, the camera took photos. After the computer received the RGB images, it controlled the conveyor belt to move to the right. When it moved to the point cloud image acquisition area, the SICK TriSpector 1000 image acquisition device received the command and started to collect point cloud images. After the image acquisition was completed, the conveyor belt stopped moving.
In order to construct a dataset of pepper plug seedlings, the pepper seedlings of the “Xiangla 66” variety were selected for the experiment in this paper. The stems of the “Xiangla 66” chili seedlings are thick and strong, with strong adaptability and disease resistance. Pepper farmers in Hunan Province often adopt modern seedling raising techniques, such as soil-with seedling raising technology. The 72-hole trays combined with imported substrate soil became the mainstream choice, which can enable the root system of seedlings to form a close symbiosis with the soil, effectively reducing root damage during transplanting. The transplanting cycle of this type of pepper seedlings is 6–8 leaves. This paper cooperates with Hengyang Vegetable Seeds Co., Ltd. The company is located at No. 46, Xianfeng Road, Yanfeng District, Hengyang City, Hunan Province, China. The company was entrusted with the cultivation of the plug seedlings in trays. Pepper seedlings with a seedling age of 7–10 days were selected for image acquisition. Firstly, in this paper, the industrial camera was used to take photos of the whole tray of plug seedlings in trays to obtain RGB images. In this paper, 20 trays of pepper plug seedlings were collected. The specification of the seedling tray was 5 × 10, and a total of 1000 pepper seedlings were obtained.
Figure 4 and
Figure 5 are the RGB images and point cloud images collected in this paper. The point cloud images are colored according to the height. Firstly, in this paper, the three-dimensional point cloud images were two-dimensionally processed, preparing for the production of the dataset and the training of the model in the next step. Then, in order to align the RGB images with the two-dimensional point cloud images, in this paper, the RGB images and the two-dimensional point cloud images were cropped by using the region of interest (ROI) technology. After the cropping was completed, the size of the two-dimensional point cloud images was adjusted to be the same as that of the RGB images, preparing for the cropping after the recognition of the plug seedlings in trays.
Figure 6 shows the comparison of the two-point cloud image and the RGB image before and after processing.
After collecting the images, in this paper, the captured pictures of pepper seedlings were annotated, and the pepper seedlings were framed one by one through the Labelimg 1.8.6 software. After the annotation was completed, in this paper, 20 RGB pictures of plug seedlings in trays were divided into a training set and a validation set according to a ratio of 7:3 to obtain an RGB dataset for the recognition of plug seedlings in trays.
In order to extract the characteristic of the number of leaves of the plug seedlings in trays, in this paper, the single-plant images of the plug seedlings in trays were cut out by using the annotation frames in the RGB dataset for the recognition of plug seedlings in trays. A total of 500 single-plant images of the plug seedlings in trays were selected, and the leaves of each plug seedling in the tray were annotated through the Labelimg software. Two types of leaves were annotated, one was the normal leaf, and the other was the abnormal leaf. The abnormal leaves included the insect-eaten leaves, curled leaves, and yellow leaves. After the annotation is completed, in this paper, the annotated dataset was still divided into a training set and a validation set according to a ratio of 7:3 to obtain an RGB dataset for leaf recognition.
In order to extract the characteristic of the leaf area of the plug seedlings in trays, in this paper, the single-plant images in the two-dimensional point cloud images were cut out by using the position of the annotation frames in the RGB dataset for the recognition of plug seedlings in trays. Similarly, 500 single-plant images of the plug seedlings in trays were selected, and the single-plant plug seedlings in trays were segmented and annotated using the Labelme software. After the annotation was completed, in this paper, the annotated dataset was divided into a training set and a validation set according to a ratio of 7:3 to obtain a 2D point cloud image segmentation dataset.
This study formed an annotation team consisting of two domain researchers and three annotators who had received standardized training. All annotators have passed the pre-annotation assessment (with a pass rate of ≥90%) and completed the learning of the unified standard before the formal annotation.
2.3. The YOLO Object Recognition Algorithm
The You Only Look Once (YOLO) series algorithms are advanced one-stage object detection algorithms. Since Joseph and others proposed YOLOv1 in 2016, through continuous iterations, multiple subsequent models such as YOLOv2, YOLOv3, YOLOv4, YOLOv5, YOLOv6, YOLOv7, and YOLOv8 have been developed. Each model has been improved and innovated in aspects such as the network structure, loss function, and training techniques. Currently, it has become one of the algorithms with the best performance in the object recognition task.
At present, the YOLO series has been iterated to version v11. In this paper, the latest YOLOv11 network was adopted as the main network for the identification of pepper plug seedlings and leaf identification.
Figure 7 is the network structure diagram of YOLOv11 used in this paper.
The YOLOv11 network is mainly composed of three parts: the backbone network (Backbone), the neck network (Neck), and the detection head (Head). The backbone network serves as the foundation of the entire YOLOv11 series of algorithms, primarily responsible for feature extraction of the input image. Through a series of operations such as convolutional layers and pooling layers, it gradually transforms the original image data into feature maps with different abstract features. It adopts the CSPDarknet structure, introducing the concept of cross-stage partial connection (CSP), which reduces the computational load while ensuring the feature extraction ability and improves the operation efficiency of the model. Meanwhile, using Darknet-53 as the backbone network endows it with powerful feature extraction capabilities.
The neck network is located between the backbone network and the detection head, and its main function is to further process and fuse the feature maps extracted by the backbone network. Since the feature maps extracted by the backbone network usually have different scales, the feature maps of different scales contain target information of different sizes. The neck network can fuse these feature maps of different scales through feature fusion, enabling the model to use feature information of different scales simultaneously, thus improving the detection ability for objects of different sizes. Compared with other versions, YOLOv11 adopts the path aggregation network (PAN) architecture, and incorporates advanced attention mechanisms such as spatial pyramid fast pooling (SPFF) and C2PSA in the neck network. In addition, there is a new C3K2 module. The PAN architecture helps to integrate features from multiple scales and optimize the efficiency of feature transfer, and as an evolved version of the C2F module, the C3K2 module can further enhance the feature processing ability.
The detection head is the part of the YOLO series of algorithms that finally performs object detection. Based on the fused feature maps output by the neck network, it predicts the category and location information of the targets. The detection head usually contains multiple convolutional layers, which process the feature maps through these convolutional layers and output information such as the category probability and bounding box coordinates of each prediction box. YOLOv11 adds two depth-wise separable convolutions (DWConv) in the classification detection head. In this way, while reducing the computational load and the number of parameters, the network can effectively improve the operation efficiency of the model and perform inference and prediction more quickly.
2.4. Improvements Based on the U-Net Image Segmentation Algorithm
In this paper, the U-net network was used as the basic network for the segmentation of two-dimensional point cloud images, and improvements were made on the basis of the original network. The improved network structure diagram in this paper is shown in
Figure 8.
In this paper, improvements were made to the connection structure and decoder part of the U-net. By adding the residual module C-Res to the connection structure part, the network structure of the U-Net is deepened. At the same time, the input information can directly skip some network layers and be added to the subsequent output. This enables the network to solve the problems of gradient disappearance and degradation that occur with the increase in the depth of the neural network. Moreover, the network can more easily learn the identity mapping of features, thus allowing the training of extremely deep networks. Meanwhile, it can also accelerate the training convergence speed and improve the generalization ability of the model, enabling the model to better capture the complex features in the data. At the same time, a module named residual attention gate (ResAG) was added to the decoder part. The residual block is used to extract depth information, and then the Attention Gates (AGs) module is used to focus on the key areas of the target. In this way, weights are dynamically assigned to the input feature maps to highlight important information and suppress secondary information. Ultimately, it improves the model’s ability to capture key features and enhances the representational ability and performance of the model.
2.5. Improve the Residual Module C-Res
In traditional neural networks, for example, cnn [
36] and vgg [
37] networks as the depth of the network increases, information may gradually be lost or become blurred during the transmission process. In residual networks, residual connections allow input information to directly skip some network layers and be added to the output after operations such as convolution. In this way, during the forward propagation of the network, the input information can be more directly and completely transmitted to subsequent layers, avoiding excessive attenuation of the information. During the backpropagation process, residual networks are helpful in solving the problem of gradient disappearance. Since the input can directly participate in the calculation of the output, the gradient can flow more smoothly through the residual connection path during backpropagation, making the network easier to train. Even in very deep networks, the gradient can be effectively propagated to the previous layers, enabling the network to learn more complex mapping relationships. In this paper, the improved residual module C-Res is added to improve the network to achieve better results.
Figure 9 shows the structural diagram of the proposed C-Res module in this paper.
This paper enables the U-Net to incorporate extremely deep network structures without encountering problems of gradient vanishing and degradation by introducing the improved residual module C-Res. In this way, it can learn very complex and advanced feature representations, improving the performance of the model in segmentation tasks. The residual block of this residual module is composed of a CBR convolutional block. A CBR convolutional block generally refers to a module composed in sequence of a convolutional layer, a batch normalization layer, and an activation function (such as the ReLU function). This module moves the convolutional kernel in the convolutional layer on the input feature map to perform a convolution operation, enabling it to obtain the key features in the feature map. Convolutional kernels of different sizes can capture different types of features.
In the convolutional module, the batch normalization layer mainly normalizes the feature map output by the convolutional layer, adjusting its distribution to be close to a standard normal distribution with a mean of 0 and a variance of 1. This can accelerate the convergence speed of the network and reduce problems of gradient vanishing or explosion, allowing the network to be trained with a larger learning rate and thus shortening the training time. At the same time, batch normalization also has a certain regularization effect, which can reduce the model’s dependence on initial parameters and improve the generalization ability of the model. The activation function introduces non-linear factors into the network, enabling the network to learn the non-linear relationship between the input and output, allowing the network to fit any complex function and thus handle various complex tasks. Moreover, a 1 × 1 convolution is added to the original residual structure. By using the 1 × 1 convolution, linear combinations can be made of the different feature information contained in each channel of the feature map, realizing cross-channel information interaction. After the features of different channels are convolved, they will be fused together, enabling the network to learn the correlations between channels and excavate more representative features. For example, when processing an image, different channels may represent information such as the color and texture of the image. The 1 × 1 convolution can fuse this information to generate a richer feature representation.
It can be expressed by the following formula:
where
Convn×
n() represents the convolution operation using a convolution kernel of size
n ×
n;
denotes the addition operation;
BN() represents normalization using a normalization layer; and Re
LU() represents linear transformation using Re
LU.
2.6. The Improved Attention Gate Module ResAG
The attention mechanism is a crucial technology in deep learning. It mimics the human visual system’s ability to selectively focus on different parts of an image, enabling the model to concentrate on important regions, thereby enhancing the model’s performance and efficiency. In 2018, Oktay et al. [
38] proposed the gated attention mechanism. As a variant of the attention mechanism, the gated attention module is mainly used to regulate the information flow of feature maps. It typically has a control signal (such as the features transmitted through the connection structure in U-Net) and input features (such as the features in the decoder). By performing linear transformations on the input features and the control signal, then adding the two results and passing them through an activation function, a gating signal is obtained. This gating signal is then multiplied element-wise with the input features to achieve the screening and enhancement of features.
To improve the network’s extraction of deep features, this paper first calculates the residuals of the input features of the gated attention before attention processing to obtain a feature map containing deeper feature information. Then, the feature map processed by the residual network is fed into the gated attention for feature screening and adjustment. By introducing the gating mechanism, the flow of features can be controlled more precisely. It can enhance or suppress certain features according to specific task requirements, thereby improving the model’s performance in specific regions or tasks. This can effectively suppress background noise and highlight the features of the target area, making the segmentation results more consistent with the actual situation. The detailed structure of ResAG is shown in
Figure 10.
The two inputs of the gated attention are the directly connected through the connection structure and the processed by the residual network in the decoder, respectively. They are convolved using a 1 × 1 convolution. The feature maps obtained from the operation are added together, and then the number of channels of the feature map is reduced to 1 through a ReLU activation function and a 1 × 1 convolution. After that, a weight coefficient is obtained by using a Sigmoid activation function, and then the size of the feature map is changed to the size before processing through the Resample module. In this way, an attention coefficient with the same shape as the feature map is obtained, and finally the feature map is weighed using the attention coefficient. Since the gated attention module is mainly based on element-wise operations and simple linear transformations, it does not require large-scale similarity calculations and matrix operations like the ordinary attention mechanism. Therefore, it has certain advantages in computational efficiency and is more suitable for tasks with high real-time requirements.
It can be expressed by the formula as follows:
Among them, Cat() represents the feature multiplication operation; Res() represents the residual operation; and () represents the sigmoid activation function.
2.7. Feature Extraction Methods for Plug Seedlings
In this paper, the plug seedlings were graded mainly according to the characteristic parameters of the plug seedlings. Therefore, how to correctly extract various characteristics was one of the important steps for the grading of pepper plug seedlings in this paper. In the local document of Bayingolin Mongolian Autonomous Prefecture, “Quality Grading of Plug Seedlings for Processing Peppers” (DB 6528/T 110-2023) [
39], it is mentioned that the evaluation is carried out from aspects such as plant height, the number of leaves, the thickness of the stem base, the ratio of dry weight to root-shoot, and the vigorous seedling index. The evaluation index of the plug seedlings was calculated through different weights, and finally the grade evaluation of the plug seedlings was obtained. Through the investigation and interview with the growers, it is learned that when the growers conduct manual grading, they mainly grade by visual observation. The growers mainly judge the grade of the plug seedlings by observing the number of leaves, plant height, leaf area, and the presence of insect-infected leaves and diseased leaves of the plug seedlings. Through the discussion and analysis of the above investigation results, in order to quickly and accurately grade the plug seedlings without damaging the seedlings, in this paper, the leaf area, the number of leaves, and the plant height were finally selected for the grading of the plug seedlings.
2.7.1. Extraction of the Leaf Area Parameter of Plug Seedlings
Since the point cloud generated by the point cloud camera is disordered and there is no topological relationship among the points, the three-dimensional point cloud needs to be projected onto a two-dimensional plane through the normal, and then the point cloud in the plane is triangulated to obtain the connection relationship between each point. Edelsbrunner et al. [
40] proposed a point cloud surface-reconstruction method based on the α shape algorithm. This method first performs Delaunay triangulation on the point cloud and then defines a sphere to roll in the point cloud set. The radius of the sphere is α. For each simplex (tetrahedron, triangular patch, edge, and vertex) in the triangulation result, the value interval belonging to the α shape is calculated. If α is within this value interval, the simplex is retained; if α is not within this value interval, the simplex is deleted. This algorithm can well achieve the reconstruction of the surface, and the constructed surface basically has no holes. In the α-shape method, the fluctuation of the α value will have a multi-dimensional impact on the result. From the perspective of shape characteristics, the α value determines the tightness or looseness of the shape formed by the point set. When the α value is large, the α-shape tends to include scattered points, forming a relatively smooth and broad contour, and local details may be lost. When the α value is relatively small, the generated shape will be closer to the closely distributed area of the point set, capable of capturing sharp corners and fine structures. However, it is also prone to jagged irregular edges due to overfitting. Ultimately, it leads to inaccurate area measurement.
Through preliminary experiments, when α is set to 0.2454, the surface reconstruction effect is better. The point cloud of the seedling leaves after reconstruction is composed of many triangular patches with topological relationships. The sum of the areas of all triangular patches is the area of the point cloud of the seedling leaves. The visualization effect of the leaf area fitted by α-shape is shown in
Figure 11.
The leaf area calculation values obtained based on the α shape algorithm are linearly fitted with the true values, and the linear regression equation is obtained as
Here,
x represents the area calculated by the algorithm, and
y represents the actual value of the leaf area, with the unit being cm
2. The fitting results are shown in
Figure 12.
The leaf area fitted through Equation (5) is compared with the true value. As shown in
Figure 13, the average error between the two is 1.12 cm
2, which is comparable on average. The error is 9.31%, which is a relatively high accuracy rate. Therefore, this paper will directly fit it. The leaf area is used for the calculation of the grading coefficient and the grading threshold for late emergence.
2.7.2. Extraction of Plant Height Parameter for Plug Seedlings
In the document “Quality Grading of Plug Seedlings for Processing Peppers”, the plant height is defined as the vertical distance from the highest point of the seedling to the ground in its natural state. This paper uses the same method as described in the document to measure the plant height. In
Section 4, the positions of the leaves of each plug seedling in the two-dimensional point cloud image were obtained through image segmentation. In this paper, these position information are converted into polygons, which are then projected onto the point cloud image. The highest point cloud in this area is obtained through an algorithm, and thus the maximum
z-axis height of this area is obtained. By converting the point cloud coordinates into real-world coordinates, the plant height information of the plug seedlings is obtained. In this paper, the relationship between the actual height and the point cloud height is obtained through measurement experiments on the point cloud acquisition device. A cube with a length and width of 120 mm and a height of 90 mm is selected as the measurement sample, and its point cloud is collected.
Through the measurement experiment, this paper obtains that the point cloud height of the conveyor belt is 59.86, and the point cloud height of the measured object is 150.2. Through calculation, the height of the measured object in the point cloud image is 90.34, and the actual height of the measured object is 90 mm. Therefore, the conversion coefficient between the point cloud coordinates and the real coordinates is 0.996. The conversion formula for converting the final point cloud coordinates into real coordinates is as follows:
In the formula, h represents the height of the plug seedlings in real coordinates, G represents the height of the plug tray cells in the point cloud image, and H represents the height of the plug seedlings in the point cloud image.
Compare the plant height fitted through Equation (6) with the true value. As shown in
Figure 14, this paper will directly fit it. The plant height is used for the calculation of the grading coefficient and the grading threshold for late emergence.
2.7.3. Extraction of Parameters of the Number of Normal Leaves and the Number of Abnormal Leaves of Plug Seedlings
The number of leaves of plug seedlings is one of the important indicators in the grading process of plug seedlings. During the growth of plug seedlings, insect-infested leaves, diseased leaves, and damaged leaves will have a significant impact on their grading. Generally, comprehensive judgment is made based on the number of these leaves, the severity level, and their influence on the overall growth of the seedlings. When there are only a few abnormal leaves on the plug seedlings and the overall impact on the leaves is small, and the growth of the seedlings is not significantly disturbed, there will generally be little impact during the grading. When there are a large number of abnormal leaves, it will affect the photosynthesis and transpiration of the seedlings, resulting in a certain degree of inhibition of the growth of the plug seedlings, leading to the unqualified growth of the plug seedlings.
In this paper, a deep learning algorithm was used to identify the leaves of plug seedlings, so as to identify the number of normal leaves and abnormal leaves of each seedling, and thus obtain the number of normal leaves and abnormal leaves of the plug seedlings.
2.8. Methods for Grading Plug Seedlings
The quality grading problem of plug seedlings is a typical classification problem based on features. In machine learning, there are many classification algorithms that can solve this kind of problem, such as support vector machines (SVMs), random forests, etc. In this paper, the plug seedlings are divided into three grades, namely first-grade seedlings, second-grade seedlings, and unqualified seedlings. In this paper, the features of 500 seedlings are collected as the dataset for machine learning, including four features: the number of normal leaves, the number of abnormal leaves, the leaf area, and the plant height.
Random forest is an ensemble learning algorithm that combines the results of multiple decision trees to improve the accuracy and stability of the model. A decision tree is a model that makes decisions based on a tree structure. Each internal node represents an attribute test, the branches are the test results, and the leaf nodes are the categories. It tests the attributes of the samples and uses the test results to gradually divide the samples into different child nodes until reaching the leaf nodes, so as to achieve the classification or prediction of the samples. Random Forest completes the classification or regression task by constructing multiple decision trees and then synthesizing the results of these decision trees. When constructing each decision tree, it introduces two kinds of randomness: Data random sampling: The bootstrap sampling method is adopted to randomly draw a certain number of samples with replacement from the original training dataset to form a new training subset for constructing each decision tree. This means that the training data of each tree may have repetitions and omissions, increasing the differences among the trees.
3. Results
3.1. Training Environment and Parameter Setting
Intel(R) Core(TM) i9-10900X CPU @ 3.70 GHz, GPU: RTX 3080, 10 GB video memory, Windows10 operating system, Cuda 11.1. Deep learning framework platforms: Pytorch version 1.10.0, Python version 3.9. Among them, the network training parameters of yolov11 were set as follows: batch size was 2, and the initial learning rate was set to 0.001. The Stochastic Graded Descent (SGD) method is used to optimize the momentum factor is 0.937 and the weight attenuation factor is 0.0005. The model was trained for 300 rounds in total, the model weights are saved through 10 iterations, and finally the model with the highest recognition accuracy is selected. The improved U-Net batch size for the experiment is set to 1. The optimizer is RMSprop with the initial learning rate set to 0.0001 and the weight decay factor set to 1 × 10−8.
3.2. Evaluation Index1
The evaluation metrics adopted in this paper are as follows: accuracy P (Precision), Recall R (Recall), average Accuracy (mAP), DICE, and average intersection and union ratio MIOU. Among them, the accuracy rate represents the proportion of true samples among the true samples predicted by the model. That is, the ratio of the number of true samples predicted correctly to the total number of true samples predicted. The formula is expressed as:
In the formula, TP (True Positive) is the true sample predicted correctly, that is, the number of samples that are actually true samples and correctly predicted as true samples by the model; FP (False Positive) is the wrongly predicted true sample, that is, the number of samples that are actually false but wrongly predicted as true by the model.
Recall rate refers to the ratio of the number of true samples that a model can correctly detect to the actual number of true samples, reflecting the model’s ability to capture positive examples. The formula is expressed as follows:
In the formula, FN (false negative) is a false counterexample, that is, the number of samples that are actually true samples but wrongly predicted as false samples by the model.
mAP is a comprehensive metric calculated based on accuracy and recall rate, used to measure the overall detection performance of the model in different categories. It is obtained by calculating the average precision (
AP) for each category and then taking the average of the aps for all categories.
In the formula, N represents the number of identified categories.
The
DICE index, also known as the
DICE coefficient, is calculated based on the intersection and union between the predicted results and the true labels. It is mainly used for binary classification segmentation tasks. The maximum value of
DICE is 1. The larger the value, the higher the similarity between the predicted segmentation result and the real label, and the better the segmentation effect. Conversely, the closer the
DICE value is to 0, the worse the segmentation effect will be. The calculation formula of the
DICE coefficient is as follows:
Among them, A and B represent the predicted segmentation result set A and the true annotation set B, respectively.
The main function of the Mean Intersect-Union Ratio (
MIOU) is to calculate the overlapping area of the predicted results with the true labels and the area of their union. When
MIOU is 1, it indicates that the predicted results of all categories are completely consistent with the true labels, that is, the segmentation of all categories is perfect and accurate. This is an ideal and optimal state. When the
MIOU is 0, it means that the predicted results of all categories have no overlap with the true labels at all, that is, the segmentation of all categories is completely wrong. Generally speaking, the closer the
MIOU value is to 1, the better the overall segmentation effect of the model for each category is, and the higher the performance of the model will be. Conversely, the closer the
MIOU value is to 0, the worse the segmentation effect is. The calculation formula of
MIOU is as follows:
Among them, A is the predicted area of this category by the model; B is the truly labeled area of this category; and n is the number of categories. In this work, n = 1.
3.3. Experimental Results and Analysis of the Recognition Algorithm for Pepper Plug Seedlings Based on YOLOv11
After conducting experiments on the above experimental platform and in the given environment, we compared the results obtained from training YOLOv11 with the dataset for recognizing plug seedlings in RGB images with those of other versions of the YOLO algorithm. The experimental results are shown in
Table 2.
From the table, we can see that the recognition effect of the YOLO series on the dataset for recognizing plug seedlings in RGB images generally improves with the iteration of the versions. Among them, YOLOv3 has the worst performance, with relatively low precision and recall rates, and its Mean Average Precision (mAP) is the lowest at 70.2%. YOLOv11 has the best performance. Both its precision and recall rates are the highest, reaching 97.8% and 96.3%, respectively, and its mAP reaches 98.5%. Therefore, the model trained by the YOLOv11 network we selected can meet the requirements of practical applications.
3.4. Experimental Results and Analysis of the Pepper Leaf Recognition Algorithm Based on YOLOv11
In order to obtain the number of leaves of pepper seedlings and determine whether there are problems such as insect-infested leaves, yellow leaves, and curled leaves in pepper plug seedlings, we use the recognition algorithm to identify the leaves of pepper seedlings and train the network with the RGB leaf recognition dataset. In this dataset, the leaves are labeled in two types. One is the normal leaf, labeled as 1, and the other is the abnormal leaf, labeled as 2. In this dataset, there are more than 1500 normal leaves and more than 300 abnormal leaves. We compared the evaluation indicators and detection effects of the models trained by different networks, and the evaluation indicators of each model are shown in
Table 3.
In
Table 3, the first row of each model represents the detection metrics for normal leaves, and the second row represents those for abnormal leaves. From the table, it can be observed that due to the relative simplicity of the dataset, each method achieves a relatively high recognition rate for normal leaves, meeting the requirements of practical applications. However, abnormal leaves exhibit inconsistent abnormal states. For example, yellow leaves and insect-infested leaves not only differ significantly in color but also in leaf morphology. This makes it challenging for various networks to capture the common features of abnormal leaves, resulting in generally lower metrics for abnormal leaves compared to normal leaves. Therefore, this paper focused on analyzing the recognition effects of each model on abnormal leaves.
Regarding the recognition of abnormal leaves, YOLOv3 still shows relatively low metrics across the board, while YOLOv11 performs excellently in all aspects. In terms of precision, YOLOv11 outperforms YOLOv3 by 19.7%. In terms of recall, YOLOv11 is 4.8% higher than YOLOv3. In terms of mAP, YOLOv11 exceeds YOLOv3 by 13.6%. YOLOv11 is clearly superior to YOLOv3 in every aspect.
3.5. Experimental Results and Analysis of the Pepper Leaf Recognition Algorithm Based on YOLOv11
To verify the effectiveness of the method proposed in this paper, the improved U-Net network model was compared with the original U-Net, DeepLabv3, MBSNet, AttUnet, TransAttUnet, and SegNet networks by training them on a two-dimensional point cloud image segmentation dataset. The experimental results of the trained models are shown in
Table 4.
From the table, it can be seen that the improved U-Net performs excellently in the three indicators of DICE, IOU, and recall, and also has a relatively high precision, indicating that its overall performance in the segmentation task is good. AttUnet and DeepLabv3 also have good performances and have their own advantages in different indicators. MBSNet has a relatively high precision but a low recall, suggesting that there may be cases of missed detections. TransAttUnet and the original U-Net perform relatively poorly in the various indicators.
In addition to comparing the evaluation indicators of each model, this paper also randomly selected five images for segmentation by each model, so as to more intuitively display and analyze the segmentation effects of each model.
As can be seen from
Figure 15, the improved U-Net outperforms other models. By comparing the images segmented by different models, it can be found that the improved U-Net basically has no missed detections. For other networks, such as SegNet, there is a relatively serious problem of missed detections. Although other networks do not have the problem of missed detections, the integrity of the detected leaves is not high. The essence of these phenomena is that the deep learning network fails to fully master the target features, resulting in the trained model being unable to completely identify the leaves. Thanks to the addition of the residual module C-Res in the connection structure of the U-Net in this paper, the ability to extract deep features is enhanced, enabling the network to have better ability to extract depth information during model training. This allows the network to better learn the deep information of the leaves, enabling the model to better identify the leaves and reducing the occurrence of missed detections. Therefore, the leaves segmented by the method proposed in this paper are the most complete among many methods.
In addition, the images segmented by the improved U-Net model in this paper are cleaner than those segmented by other models. There are noise points in the images segmented by other models, especially in U-Net, MBSnet, and Deeplabv3. The images segmented by these models have a large number of noise points, which seriously affects the segmentation effect of the model. Since the network extracts and collects various features during training, some features are correct target features, while some are incorrect features. Eventually, the trained model segments non-target areas during segmentation, resulting in noise in the segmented images. In the improved U-Net network selected in this paper, the features are screened during feature fusion through the gate attention module, reducing the impact of incorrect information during fusion and improving the effect of correct information during fusion. This enables the model trained by the network to better identify the target area, so the improved Unet has relatively fewer noise points. Based on the original gate attention module, this paper further improves it. The ResAG module uses the residual module to further enhance the feature extraction ability of the network, enabling the network to obtain more and deeper features, and these features are screened and fused through the gate attention module. Therefore, compared with the original U-Net network, the improved network in this paper has better feature extraction and learning abilities, and the model trained by the network proposed in this paper has the best segmentation effect.
To further verify the effectiveness of the module proposed in this paper, an ablation experiment was conducted to test the improvement of the network by different modules. The results are shown in
Table 5.
Model 1 in
Table 5 represents the model trained by the original U-net, Model 2 represents the model trained by the U-net with the C-Res module added, and Model 3 represents the model trained by the U-net with the C-Res module and the ResAG module added, which is also the improved U-net proposed in this paper. It can be seen from the chart that Model 3 introduces the C-Res module and the ResAG module on the basis of the basic model. Compared with Model 1 and Model 2, it shows significant advantages in multiple performance indicators. Specifically, the DICE coefficient of Model 3 reached 0.834, increasing by 0.126 and 0.024 compared with Model 1 (0.708) and Model 2 (0.810); the IOU index was 0.725, which increased by 0.148 and 0.022 compared with Model 1 (0.577) and Model 2 (0.703). Furthermore, Model 3 also outperforms Model 1 and Model 2 in terms of precision (0.914) and recall (0.78) metrics, indicating that while maintaining a high detection accuracy, it can capture the target area more comprehensively and reduce the situations of missed detections and false detections.
3.6. Experimental and Result Analysis of the Grading Model
After obtaining the dataset, in this paper, manual classification was carried out according to various features. The classified dataset was fed into different machine learning algorithms for training. After obtaining the model, result analysis was conducted, and finally, the model with the best performance was selected as the grading model of this paper.
In this paper, accuracy, F1 score, and AUC–ROC are selected as the evaluation indicators of the model. Accuracy refers to the proportion of correctly classified samples in the total number of samples, which reflects the degree of correct classification of all samples by the model. It can be expressed by the formula:
TP represents the number of samples that are actually positive and predicted as positive by the model; TN represents the number of samples that are actually negative and predicted as negative by the model; FP represents the number of samples that are actually negative but predicted as positive by the model; FN represents the number of samples that are actually positive but predicted as negative by the model.
The
F1 value is the harmonic mean of
Precision and
Recall. It comprehensively takes into account the accuracy and integrity of the model and can more comprehensively evaluate the performance of the model in classifying positive and negative samples, especially suitable for the situation where the positive and negative samples are imbalanced.
The ROC curve is a curve plotted with the false positive rate (FPR) as the abscissa and the TRUE POSITIVE RATE (TPR) as the ordinate, and the AUC–ROC represents the area under this curve. The larger the value of AUC–ROC, the better the classification performance of the model. It measures the ability of the model to correctly distinguish positive and negative examples under different thresholds, and comprehensively evaluates the overall performance of the classifier, which is not affected by the imbalance of samples.
Among them, (xi, yi) and (xi+1, yi+1) are the coordinates of two adjacent points on the ROC curve, and n is the number of points on the ROC curve.
As can be seen from
Table 6, logistic regression shows relatively stable performance, with an accuracy of 92.0%, an F1 score of 91.8%, and an ROC-AUC of 96.8%. Although its performance is not as good as that of random forest and XGBoost, it has advantages in simplicity and computational efficiency. Random forest performs the best, with an accuracy of 97.0%, an F1 score of 96.9%, and an ROC-AUC of 99.1%. This indicates that random forest has strong capabilities in handling complex data and nonlinear relationships. XGBoost’s performance is close to that of random forest, with an accuracy of 96.0%, an F1 score of 95.8%, and an ROC-AUC of 99.4%. The support vector machine has an Accuracy and an F1 score both of 96.0%, and an ROC-AUC of 95.5%. Although its performance is good, the training time is long. K-Means clustering performs the worst, with an Accuracy of 71.0%, an F1 score of 70.5%, and an ROC-AUC of 83.4%. Since it is essentially an unsupervised learning method and lacks the utilization of label information, the model trained by it has the worst effect. Through the experiment in this paper, the random forest algorithm has a better effect on this dataset, so random forest is selected as the grading algorithm in this paper.
4. Discussion
At present, most of the mainstream methods for grading tray seedlings only detect whether there are missing seedlings in the trays, but do not grade the quality of the tray seedlings. The few studies on quality grading are based on extracting the basic characteristics of tray seedlings for grading. During this process, the occlusion and interference between seedlings are indispensable, which greatly reduces the accuracy of grading. Therefore, the grading method of tray seedlings with high precision and high stability still needs further research. This paper studies an intelligent grading method for pepper tray seedlings based on RGB and point cloud images, which is used to achieve the quality grading of tray seedlings during the grading transplanting process. The RGB images and point cloud images of the pepper tray seedlings were used to accurately grade the pepper tray seedlings, so as to improve the grading accuracy and efficiency and reduce the labor cost. And a set of intelligent grading system for pepper tray seedlings based on RGB and point cloud images is designed to be better applied in the grading and transplanting work.
This paper focuses on improving the segmentation algorithm based on deep learning. Based on the U-net algorithm, this algorithm introduces the innovative C-Res residual module and the gate attention module ResAG module. These introduced innovative modules enable the algorithm to have better segmentation accuracy and segmentation stability. By experimentally comparing the improved U-Net with the mainstream segmentation networks, the results show that the improved model has significant improvements in various indicators. Among them, the accuracy rate, recall rate, DICE coefficient, and average intersection and union ratio reach 91.4%, 78.0%, 83.4%, and 72.5%, respectively, performing the best among similar algorithms. Compared with the algorithm before improvement, the improved algorithm has increased by 2.6%, 6.4%, 5.5%, and 6.5% in terms of accuracy rate, recall rate, DICE coefficient and average intersection and union ratio, respectively.
Study the extraction methods of various features of plug seedlings, and extract the number of normal leaves, the number of abnormal leaves, the leaf area, and the plant height parameters of plug seedlings. Use the obtained characteristic parameters to create a grading dataset for plug seedlings and use mainstream classification problem algorithms for classification training. Finally, use the random forest algorithm with the best effect as the grading method of this method. The accuracy, F1, and ROC of the random forest are 97.0%, 96.9%, and 99.1%, respectively.
In terms of real-time performance, when dealing with large-scale data or complex scenarios, the computing speed of the model can meet the real-time requirements of seedling production and improve the overall efficiency of transplanting. Despite the obvious technical advantages, many challenges are still faced in the actual deployment process. In terms of cost, the research and development, procurement, and maintenance expenses of the entire set of equipment are high, including hardware investments such as high-precision sensors and professional computing equipment, as well as software costs such as algorithm optimization and system upgrades. This poses a significant economic pressure on small agricultural enterprises and farmers. In addition, the problem of occlusion treatment is also quite prominent. In the actual seedling raising environment, the mutual occlusion of tray seedlings is widespread. Although the model used in this paper can grade tray seedlings under slight occlusion, for large-scale occlusion, this method is difficult to accurately identify the characteristics of the occluded parts of the seedlings, which is prone to cause deviation in the detection results and reduce the accuracy of grading.