Research on Instance Segmentation Algorithm of Greenhouse Sweet Pepper Detection Based on Improved Mask RCNN

Cong, Peichao; Li, Shanda; Zhou, Jiachao; Lv, Kunfeng; Feng, Hao

doi:10.3390/agronomy13010196

Open AccessArticle

Research on Instance Segmentation Algorithm of Greenhouse Sweet Pepper Detection Based on Improved Mask RCNN

by

Peichao Cong

^*,

Shanda Li

^*,

Jiachao Zhou

,

Kunfeng Lv

and

Hao Feng

School of Mechanical and Automotive Engineering, Guangxi University of Science and Technology, Liuzhou 545006, China

^*

Authors to whom correspondence should be addressed.

Agronomy 2023, 13(1), 196; https://doi.org/10.3390/agronomy13010196

Submission received: 25 November 2022 / Revised: 2 January 2023 / Accepted: 5 January 2023 / Published: 7 January 2023

(This article belongs to the Special Issue AI, Sensors and Robotics for Smart Agriculture)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The fruit quality and yield of sweet peppers can be effectively improved by accurately and efficiently controlling the growth conditions and taking timely corresponding measures to manage the planting process dynamically. The use of deep-learning-based image recognition technology to segment sweet pepper instances accurately is an important means of achieving the above goals. However, the accuracy of the existing instance segmentation algorithms is seriously affected by complex scenes such as changes in ambient light and shade, similarity between the pepper color and background, overlap, and leaf occlusion. Therefore, this paper proposes an instance segmentation algorithm that integrates the Swin Transformer attention mechanism into the backbone network of a Mask region-based convolutional neural network (Mask RCNN) to enhance the feature extraction ability of the algorithm. In addition, UNet3+ is used to improve the mask head and segmentation quality of the mask. The experimental results show that the proposed algorithm can effectively segment different categories of sweet peppers under conditions of extreme light, sweet pepper overlap, and leaf occlusion. The detection AP, AR, segmentation AP, and F1 score were 98.1%, 99.4%, 94.8%, and 98.8%, respectively. The average FPS value was 5, which can be satisfied with the requirement of dynamic monitoring of the growth status of sweet peppers. These findings provide important theoretical support for the intelligent management of greenhouse crops.

Keywords:

sweet pepper planting; mask RCNN; attention mechanism; mask branch

1. Introduction

Greenhouse farming has emerged as a new direction for crop cultivation, free from climatic conditions. Sweet peppers, which are typical greenhouse crops, have been growing globally. The efficiency and cost of traditional manual cultivation management are unsatisfactory. Therefore, the introduction of artificial intelligence technology into the sweet pepper planting process can not only improve production efficiency and product quality but also bring huge economic benefits to relevant practitioners. Among them, the use of image recognition technology to monitor the growth of sweet peppers and dynamically evaluate the development and changes in its shape, color, and final yield is a key method of realizing the intelligent planting of sweet peppers [1].

Owing to the relationships between the physical characteristics of sweet pepper surfaces (color, size, shape, etc.) during the growth state and the final quality, yield level has a very important role; therefore, image recognition technology requires real-time, accurate identification of the above information. However, greenhouse sweet peppers are grown in a very complex environment with different types of light conditions at different angles and times of day. Furthermore, the colors and shapes of sweet peppers in different growth cycles are different. When a sweet pepper is in the immature growth stage, its color is quite similar to the background color. In addition, the problem is worsened by the lush foliage.

Currently, two methods exist that involve image recognition for fruit and vegetable targets: firstly, traditional image recognition and segmentation techniques. It is a feature extraction method based on traditional machine learning. It can be further divided into classification [KNN (K-Nearest Neighbor), SVM (Support Vector Machine), Random Forest], and regression (XGBoost (eXtreme Gradient Boosting), linear regression, ridge regression). The difference between the two tasks is not in the difference of input values, but in the fact that they have different types of output values. Among them, classification methods have the ability to predict different categories, which is a discrete result, while regression methods have the ability to predict results with uncertainty and continuous variation, which predicts a series of continuous values. These methods can only be used to extract the simple geometric and color features of an object surface under a single background and require the objects to be recognized and judged to have highly similar features. Attia et al. [2] used the DSSAT-CERES-maize model and combined six traditional machine learning methods (linear regression, ridge regression, lasso regression, KNN, random forest, and XGBoost) to predict maize grain yield (GY) and evapotranspiration (ET) under different influences (climate and management practices). It is demonstrated that the XGBoost model works optimally. It provides a basis for agricultural managers to make decisions on crop planting schemes and improve the efficiency of agricultural products. Kheir et al. [3] utilized multi-crop models (CMs) and combined several machine learning models (Artificial Neural Network (ANN), KNN, Random Forest Regressor (RFR) and Support Vector Regressor (SVR)). The key elements such as temperature, solar radiation, and precipitation were used to predict wheat yield from different regions. The results showed that the ANN and RFR algorithms had the highest accuracy and the best results. The method is useful to assist crop growers in planning scientific and rational planting schemes. Rocha et al. [4] proposed a method that can fuse multiple features and classifiers of objects in images to classify all types of fruits and vegetables in images automatically, thus solving the complex classification problem. To reduce human labor and realize automatic fruit sorting, Kumar et al. [5] designed a technique to identify, localize, and classify citrus fruits based on object color features. Kanade et al. [6] presented a machine vision method for detecting guava maturity. By calculating the distribution characteristics of RGB channels in the image and using parameters such as the tricolor stimulus value and average value of the color coordinates based on the CIE1931 standard, four categories of green, ripe, overripe, and metamorphic were classified. Jiang et al. [7] developed a combined recognition method based on RGB and depth information for the classification of multiple categories of fruits, which improved the robustness of recognition under different angles and light conditions. Visa et al. [8] designed a method that combines fruit contour morphological information to model and adopt Bayesian classification technology to classify different types of tomatoes. Xin et al. [9] used the apple segmentation method with fractal features to achieve fruit segmentation in the natural growth environment of fruits. The method can improve the speed of segmentation in a complex background. Efi et al. [10] presented an adaptive threshold algorithm based on multisensor fusion, which improved the detection accuracy of red sweet peppers under different lighting conditions. Based on the K-means clustering method, Wen et al. [11] used the thermal imaging information of plant leaves captured by an infrared camera to segment cucumber images growing in a greenhouse. Bai et al. [12] developed an improved fuzzy C-means algorithm for segmentation of cucumber leaf spot images in different scenes. However, the above studies only analyzed the surface features (shape, contour, color, etc.) of objects in simple scenes and did not involve the high-dimensional features (details, texture, etc.) of the objects. In an actual environment, the characteristics (size, shape, and color) of fruits and vegetables vary greatly among individuals, even within the same category. If image recognition technology only considers shallow feature extraction, many important influencing factors will be missed, and the robustness of the algorithm will not be satisfactory.

With the emergence of high-performance graphics processors and big data samples for model training, another type of method for fruit and vegetable target recognition—image recognition technology based on deep learning—has emerged. This approach trains numerous network models (composed of known samples) under the framework of deep learning, improves the accuracy and speed of fruit and vegetable target detection, and obtains high-dimensional features. Therefore, this type of method has stronger feature extraction ability and greater robustness than traditional methods. A growing number of researchers are exploring combining deep learning techniques with agriculture [13,14,15,16,17,18,19,20,21,22,23].

When the feature differences between different classes of objects in an image are significant and the feature similarity of the same class is high, traditional machine learning methods can train the desired model simply and effectively. However, the unstructured environment studied in this paper has many distracting factors, different sweet pepper individuals have different styles, and the feature differences between the target objects and the surrounding environment are difficult to distinguish. Therefore, traditional machine learning methods are no longer applicable to this study, and instead, a deep learning-based recognition and segmentation method, which does not rely on manual feature labeling and coding, has high flexibility, applicability, and efficiency, and can independently mine deep features in images during the training process, such complex features are difficult to rely on manual judgment, and thus is more suitable for application in scenes with unstructured characteristics.

Ajala et al. [24] propose a system architecture to accurately quantify dielectrophoretic force invoked on microparticles in a textile electrode-based sensing device and compare several mainstream conventional machine learning (KNN, SVM) and deep learning models (AlexNet (Alex Network), ResNet-50 (Residual Neural Network-50)). The results show that the deep learning model works best for efficient feature extraction of complex data, and its accuracy and generalization ability are better than traditional machine learning models, which validates the advantages of deep learning methods. Liu et al. [25] optimized the YOLOv3 (You Only Look Once version 3) network structure and fused the dense network structure into this algorithm to facilitate the full utilization of image features. Additionally, the output detection box was designed to be circular to accommodate the shape of the tomatoes. Kang et al. [26] developed a fruit recognition and positioning system based on deep learning that utilized an automatic annotation unit to improve the efficiency of image annotation. Sa et al. [27] proposed feature fusion of color and infrared images to enrich image features and used a Faster Region-Convolutional Neural Network (Faster RCNN) detector to detect fruits. Gao et al. [28] used the Faster RCNN model to detect physical barriers such as wires or branches under different conditions for various categories of apples (no occlusion, leaf blockage, branch and wire blockage, fruit occlusion, etc.). This approach helped improve the work efficiency of picking robots and reduced obstacle damage, such as wire damage to the manipulator. Wan et al. [29] optimized the convolutional and pooling layers of Faster RCNN for identifying different classes of fruits with higher detection accuracy and shorter running time than other algorithms. Gan et al. [30] used thermal imaging technology for fruit detection. Firstly, a water spraying system was used to generate the temperature difference between branches, leaves, and fruits to make the thermal image characteristics of the two more significant. Then, a thermal camera was employed to detect the fruits, so as to solve the problem that immature citrus fruits are quite similar to green branches and leaves, making it difficult to detect them accurately. Kang et al. [31] developed a multifunctional network that used the Gated Feature Pyramid Network (GFPN) to extract different levels of image feature information. They also adopted a lightweight network structure to reduce model parameters, reduce unnecessary computation, and use a vision system to identify and locate apples. Roy et al. [32] designed an enhanced semantic segmentation network based on an improved UNet (a U-shape network designed based on a fully convolutional network) that incorporated more feature layers into the encoder, obtained higher-level features from images for accurate segmentation of rotten and fresh apples to achieve high-efficiency automatic sorting in fruit processing. Ni et al. [33] used a data processing pipeline to reduce annotation time and used the Mask RCNN instance segmentation algorithm to segment and evaluate four blueberry varieties. Yu et al. [34] processed strawberry images with multi-scale information by building a Mask RCNN model and using a feature pyramid network (FPN). For visual localization of strawberries, the model has higher recognition accuracy and better generalization ability, and its performance is better than the traditional model. Jia et al. [35] optimized the Mask RCNN model, which combines a ResNet and Densely Connected Convolutional Networks (DenseNet) to achieve accurate segmentation of overlapping apples. Compared with traditional image recognition and segmentation technology, the above research effectively improves the feature extraction ability of fruit and vegetable targets and can detect and segment fruits and vegetables under conditions involving large differences in object features and the presence of interfering objects. However, when the object is in a complex scene (light and dark changes, target and background color approximation, overlap, branch and leaf occlusion, etc.), the existing methods have difficulty ensuring the accuracy of detection and segmentation and are prone to leakage and false detection under extreme conditions (backlighting, severe occlusion). Therefore, fruit and vegetable recognition and segmentation accuracy in complex scenes must be improved.

To address these shortcomings, an improved Mask RCNN instance segmentation algorithm was developed in this study to perform accurate instance segmentation on the images of sweet peppers planted in a greenhouse under complex scenes to ensure dynamic monitoring of the growth status of sweet peppers. The main contributions of this study are as follows:

Image preprocessing methods such as sharpening, salt and pepper noise, and Gaussian noise are used to enhance the data of a limited number of image samples to enrich them, thereby increasing dataset richness.
The Swin Transformer attention mechanism is introduced into the backbone network of the Mask RCNN to enhance the feature extraction ability of the network. The Swin Transformer reduces computational complexity by moving the windows for image feature learning and calculating self-attention in the window.
UNet3 + is used to improve the mask branch and replace the full convolutional network (FCN) of the original algorithm to improve the segmentation quality of the mask further. The UNet3+ network with a full-size jump connection can fully extract shallow and deep feature information.
Ablation experiments were designed to explore the effects of the above improvements on the Mask RCNN algorithm in sweet pepper instance segmentation. In addition, the proposed algorithm was compared and tested using different algorithms to evaluate its advanced nature.

The remainder of this paper is structured as follows: Section 2 introduces the image data acquisition technique, preprocessing methods, image annotation process, improved Mask RCNN instance segmentation algorithm, network training method, and algorithm performance evaluation index. Section 3 presents the experimental results and a corresponding analysis. Finally, Section 4 summarizes the conclusions and prospects.

2. Materials and Methods

2.1. The Overall Sweet Pepper Segmentation Framework

In this study, an improved Mask RCNN instance segmentation algorithm is developed to perform accurate instance segmentation of greenhouse sweet peppers’ images grown in complex scenes to ensure dynamic monitoring of sweet peppers’ growth status. The system framework of the greenhouse sweet peppers instance segmentation method proposed in this study is shown in Figure 1. The method consists of two steps: step 1 acquires and does a series of processing on the sweet peppers images data set to generate the original image data required for model training, in preparation for the next model training; step 2 uses the processed sweet peppers images data as the original input images for the proposed algorithm, and carries out feature extraction and model training operations, after which the performance of the trained model is evaluated to measure the performance of the proposed algorithm. After that, the performance of the trained model is evaluated to measure the advantages and disadvantages of the proposed algorithm, and finally the recognition and segmentation of greenhouse sweet peppers are realized, and the model output results are analyzed qualitatively and quantitatively.

2.2. Image Acquisition

This study collected images of greenhouse sweet peppers provided by the Kaggle and Veer platforms on the Internet as a dataset (Figure 2). The images included different weather and lighting conditions (well-lit, backlight) and were taken at different angles. The number of datasets used in this study was 2286, and contained a total of three categories of sweet peppers: green, red, and yellow. The authors randomly divided each category into training and validation sets, respectively (Table 1). Among them, the training sets are used for the original input images during model training, and the validation sets are used for the evaluation of the model performance after training.

2.3. Image Preprocessing

To improve the richness of the dataset, increase the feature information at different levels in the image, and improve the algorithm’s ability to adapt to an actual scene, data augmentation was adopted to improve the sample size further. The data enhancement method adopted in this study involved Laplacian sharpening and adding salt and pepper and Gaussian noise [36]. Laplacian sharpening can further improve the image sharpness, make the edge details of the object in the image clearer, and compensate for the problem of unclear images caused by low resolution. The Laplace operator is:

g (x, y) = f (x, y) + c [\nabla^{2} f (x, y)]

(1)

where x and y is the pixel coordinate values,

g (x, y)

is the sharpened image,

f (x, y)

is the original image,

[\nabla^{2} f (x, y)]

is the Laplace transform of the original image, and the Laplace mask is shown in Figure 3. When the central value of the Laplace mask is positive, c = 1; otherwise, c = –1. A comparison between the original and sharpened images is provided in Figure 4.

Salt and pepper noise refers to the addition of black and white noise points at random positions in an image. The name originates from the characteristic that the noise is similar in shape to salt or black pepper grains. Gaussian noise is a type of noise whose probability density distribution follows a normal distribution in the image, and its probability density function is:

p (z) = \frac{1}{\sqrt{2 π} σ} e^{\frac{- {(z - μ)}^{2}}{2 σ^{2}}}

(2)

where z represents the gray value of the pixels, μ represents the average or expected value of the pixels, and σ represents the standard deviation of the pixel values. Salt and pepper noise will appear randomly at any position in the image, whereas Gaussian noise will exist in every pixel in the image. The comparison of the two effects is shown in Figure 5.

2.4. Dataset Annotation

Manual annotation of images is required before model training. The image annotation tool used in this study was EISeg [37], which is based on the Reviving Iterative Training with Mask Guidance for Interactive Segmentation (RITM) and EdgeFlow algorithms. It is an intelligent image segmentation annotation tool developed by PaddlePaddle. As long as a few key points are placed in the object, the software can automatically identify and annotate the outline of the entire object, which is far more efficient than other manual annotation tools. The EISeg annotation files are saved in coco or json format, which satisfies most model training requirements. The specific annotation effects are depicted in Figure 6.

2.5. Improved Mask RCNN Instance Segmentation Algorithm

Mask RCNN [38] is currently the most widely used instance segmentation algorithm. It was developed from Faster RCNN [39], and a new mask branch was added to output segmentation results. Compared with the traditional methods, although Mask RCNN can meet the needs of instance segmentation in most scenes, its algorithm has poor sensitivity to the features of the target object, pays the same attention to all kinds of image features, cannot screen important information independently, and ignores interference factors. Because there is a substantial amount of worthless information in the image of a greenhouse sweet pepper, it requires considerable computational resources and low efficiency to be processed indiscriminately. In addition, the mask branch of Mask RCNN adopts an FCN structure, which cannot fully exploit the shallow feature information of sweet pepper images and seriously affects the segmentation quality of sweet peppers in various complex scenes (including light and dark changes, target and background color approximation, overlap, branch and leaf occlusion). To enhance the Mask RCNN instance segmentation algorithm’s feature extraction capability, convergence speed, computational effectiveness, and segmentation quality, the backbone network and mask branch were improved in this study based on Mask RCNN. In the improved algorithm, the Swin Transformer [40] attention mechanism is utilized in the backbone network to replace the original ResNet and FPN for feature extraction, and the output feature map is used as the input of the region proposal network (RPN) to generate different candidate regions. In addition, RoIAlign is employed to extract the features for each candidate region, and a bilinear interpolation method is used to scale the feature map to the correct size. Finally, the position, category, and mask of the object are output by two parallel tasks. The position and category are output by the fully connected (FC) layer, and the mask is output by the FCN. The output above is obtained, and the location of the object can be accurately obtained. To increase the accuracy of greenhouse sweet pepper instance segmentation under complex conditions, an improved mask branch is utilized to achieve a high-quality mask output. Specifically, UNet3+ [41] is employed to replace the full convolutional network and the feature map output by RoIAlign is introduced into the mask branch. After up-down sampling using UNet3+, the final output size of the mask is 14 × 14 × 256 pixels. The network structure of the improved Mask RCNN instance segmentation algorithm designed in this study is shown in Figure 7a. The details are explained in the following sections.

2.5.1. Feature Extraction

Backbones are typically used as feature extraction networks. Their main role is to extract rich features from initial images to prepare for a series of tasks to be executed later. For a long time, the process modeling of image classification, detection, and segmentation in computer vision has been highly dependent on the support of convolutional neural networks (CNNs); therefore, related research has been vigorously conducted [42]. The network structure of natural language processing (NLP) takes a completely different path from that of CNN. Transformers are currently the most widely used network architectures. Aiming at sequence modeling and various transformation tasks, transformers use the attention mechanism to extract interdependence between data, which significantly improves parallel computing ability and efficiency and has advantages over the CNN model. The remarkable achievements of transformers in NLP have inspired scholars in the machine vision field [43].

Mask RCNN uses ResNet and an FPN as a feature extraction network. The feature map obtained by this network has a low resolution and is less capable of acquiring information on small and multi-scale objects. In addition, different sizes of convolution kernels and pooling methods affect the computational efficiency and convergence. To solve the above problems, we used a Swin Transformer [Figure 7b] to replace ResNet and the FPN in this study as the new feature extraction network of Mask RCNN, removed the FC layer used for classification, and adjusted the output format of the feature map processed by this network to be the same as the input format of the original image. To ensure that the output feature map is changed from multi-scale to single-scale. The use of a Swin Transformer to extract the geometric and color features of sweet peppers can significantly enhance the backbone’s ability to acquire information on multi-scale features and small target objects, and improve the computational efficiency. The introduction of an attention mechanism enables the network to retain useful information to improve the recognition accuracy.

Swin Transformer is a new vision converter that can be introduced into backbone networks for various computer vision tasks. Because of the substantially enhanced image resolution, the amount of information is also increased. Therefore, this article proposed an algorithm with a converter, stratified by a shifting window for calculation; the image will be self-attention to calculate the limit in the local window of non-overlapping, at the same time, allow the connection across the window, to enhance the calculation efficiency, and ensure the hierarchical structure with the flexibility of modelling in different scales. The Swin Transformer consists of four stages. In each stage, the resolution of the input sweet pepper characteristic map is compressed, and the receptive field is expanded simultaneously. Firstly, the original sweet pepper image is processed and segmented into multiple blocks. Secondly, each phase of the sweet pepper feature map contains a patch merging and multiple blocks, where the patch merging module compresses the resolution of each input sweet pepper feature map. The block is mainly composed of LayerNorm, multilayer perceptron, window attention, and shifted window attention. After the last stage of processing, the output feature map of the sweet pepper is upsampled. The size of the adjusted image is the same as that of the output feature map of the sweet pepper in Stage 3. Finally, the image is feature-spliced. After the above processing, the number of network layers is increased, and the feature expression ability of the algorithm is further improved. In the above process, Stage 3 is performed by the shallow network, which is mainly responsible for the information extraction of sweet pepper surface features, and Stage 4 is conducted by the deep network, which is mainly responsible for the extraction of abstract high-dimensional feature information of sweet peppers. Compared with global multihead self-attention (MSA), the Swin Transformer learns image features through moving windows. The calculation of self-attention in the window makes the window multihead self-attention (W-MSA) effectively reduce the complexity of the calculation. The Swin Transformer also improves its feature acquisition ability by using a sliding window to obtain a block and calculating the self-attention in the window. In addition, it can establish connections between adjacent windows by moving window operations to ensure cross-window connections between the upper and lower layers and thereby provide global modeling capabilities.

The Swin Transformer network structure can ensure that the computational complexity and image size are linear rather than square, which can reduce the computational burden of model training so that the computer can train the model of higher resolution images. The computational complexities of MSA and W-MSA are:

Ω (MSA) = 4 h w c^{2} + 2 {(h w)}^{2} c

(3)

Ω (W ­ MSA) = 4 h w c^{2} + 2 M^{2} h w c

(4)

where

Ω

: complexity;

h

,

w

, and

c

: height, width, and depth of feature map;

M

: the side length of each window. Observing Equations (3) and (4), we can see that global self-attention involves a large amount of computation for high-resolution images, whereas window-based self-attention requires only a small amount of computation.

2.5.2. Generation of Region of Interest (RoI) and RoIAlign

The backbone extracts features from sweet pepper images, transmits the image features to the RPN, and searches for the locations of different sweet peppers. Owing to the significant differences in the environmental conditions, shooting angles, and sweet pepper morphology, the RoI area composed of the pixels corresponding to the sweet pepper will produce three different area ratios: 32 × 32, 64 × 64, and 128 × 128. The aspect ratios of the rectangular boxes surrounding different sweet peppers were 1:2, 1:1, and 2:1, respectively. The above two ratios were randomly combined, and nine anchors appeared at the designated position in the original image to infer the exact location of the sweet pepper, thereby improving the accuracy of the RoI. The main tasks of the RPN are to perform classification and bounding box regression operations, output RoI categories and corresponding coordinate positions, and use the output categories to judge whether the identified objects are foreground or background. The foreground includes the location of the sweet pepper, and the background represents the area in which the sweet pepper is not present. The resulting RoI and corresponding feature image are then passed to RoIAlign, which aligns the extracted features with the input precisely to improve the segmentation accuracy further.

2.5.3. Mask Head

To realize fine segmentation of sweet pepper images, the mask branch should fully integrate the shallow and deep feature information. However, the mask branch of Mask RCNN adopts an FCN structure, and its network integrates feature maps at different levels by simple superposition. The more layers in the network, the greater the propensity for shallow feature information to become lost. Therefore, the feature information of the sweet peppers cannot be fully utilized. To improve the segmentation quality of different categories of sweet peppers further, we used UNet3+ instead of an FCN to improve the mask branch of Mask RCNN. UNet3+ is an improved version of UNet and UNet++ [44]. Although UNet++ adopts a network structure of dense hopping connections, it cannot obtain sufficient information from RGB images of different scales. Essentially, UNet++ uses only short connections. The extracted features are reprocessed at each connection node and fused between different connections, resulting in original features with multi-scale information not being fully utilized. Thus, UNet3+ solves these problems. The overall network junction of UNet3+ is shown in Figure 8 and is composed of an encoder (left half) and a decoder (right half).

X_{E n}^{i} (i = 1, 2, \dots, 5)

represents the feature map generated by the encoder at layer

i

, and

X_{D e}^{i} (i = 1, 2, \dots, 5)

represents the feature map generated by the decoder at layer

i

. The core idea is as follows: a full-size jump connection and full-size depth supervision mechanism are adopted. The former closely combines the high- and low-level semantics of image features of different scales. For example, feature maps

X_{E n}^{1}

,

X_{E n}^{2}

,

X_{E n}^{3}

,

X_{E n}^{4}

, and

X_{D e}^{5}

produced by the encoder generate feature map

X_{D e}^{4}

of the decoder after jump connection (dashed line), and the calculation formula is as follows:

X_{D e}^{i} = \{\begin{matrix} X_{E n}^{i} \\ \underset{⏟}{H ([C {(D (X_{E n}^{k}))}_{k = 1}^{i - 1}, C (X_{E n}^{i}),} \underset{⏟}{C {(U (X_{D e}^{k}))}_{k = i + 1}^{N}])}, i = 1, …, N - 1 \end{matrix} S c a l e s : 1^{t h} ~ i^{t h} S c a l e s : {(i + 1)}^{t h} ~ N^{t h}

(5)

where

X_{D e}^{i}

: different feature map,

i

:

i

th downsampling layer in the coding direction,

N

: number of encoders,

C (.)

: convolution operation, function

H (.)

: feature aggregation mechanism, function

D (.)

: up-sampling operation, function

U (.)

: down-sampling operation, and

[.]

: channel dimension concatenation and fusion. The full-size depth supervision mechanism learns hierarchical representations from multiscale aggregated feature maps and produces corresponding outputs on the side of each decoder, which are supervised by real values. For depth supervision, the last layer of each decoder is fed into a 3 × 3 convolutional layer, followed by bilinear upsampling and a sigmoid function.

Through the introduction of the UNet3+ network structure, the proposed algorithm has stronger feature fusion ability, can fully extract shallow and deep feature information, improves the ability to capture details and contour features of sweet pepper, increases the delicacy of sweet pepper mask segmentation, and avoids the loss of important information.

2.5.4. Loss Function

The final output of the algorithm includes the sweet pepper position, category, and segmentation mask. The classification and positioning of sweet peppers were realized using an FC layer. The segmentation mask was realized by up- and down-sampling in UNet3+. The loss function reflects the difference between the predicted and actual values. We used a multi-task loss function to evaluate the accuracy of the results. The multitask loss function (

L

) consists of three parts: the category loss (

L_{c l s}

), bounding box loss (

L_{b b o x}

), and prediction mask loss (

L_{m a s k}

), as shown in Equations (6)–(9).

L_{c l s}

is the difference between the predicted and actual values of the sweet pepper category;

L_{b b o x}

represents the distance between the predicted and actual location parameters (origin, width, and height) of each sweet pepper; and

L_{m a s k}

represents the model confidence in the binary classification of each pixel in the foreground object (sweet pepper) and background, which is the binary cross-entropy used for pixel classification:

L = L_{c l s} + L_{b b o x} + L_{m a s k}

(6)

L_{c l s} = \sum_{i} - \log [p_{i}^{*} p_{i} + (1 - p_{i}^{*}) (1 - p_{i})]

(7)

L_{b b o x} = \frac{1}{N_{r e g}} \sum_{i} p_{i}^{*} R (t_{i} - t_{i}^{*})

(8)

L_{m a s k} = - \frac{1}{m^{2}} \sum_{1 \leq i, j \leq m} [y_{i j}^{*} \log y_{i j} + (1 - y_{i j}^{*}) \log (1 - y_{i j})]

(9)

where

L_{c l s}

: category loss;

L_{b b o x}

: bounding box loss;

L_{m a s k}

: prediction mask loss;

p_{i}

and

p_{i}^{*}

: prediction probability and true value of anchor point i, respectively;

N_{r e g}

: number of pixels in the feature map;

t_{i}

and

t_{i}^{*}

: coordinates of the predicted and true values, respectively; and

R (.)

: smoothing L1 function. For the mask branch, each RoI produces m²-dimensional output results; therefore,

y_{i j}^{*}

represents the true coordinates (i, j) in the m × m region, and

y_{i j}

represents the prediction results.

2.6. Network Training

Firstly, the algorithms involved in the experiments are built on the Windows 10 platform. Secondly, the labeled sweet pepper data are loaded and converted into the data format required by the algorithm. Finally, the algorithms are trained using the model. The network training in this study was performed using an NVIDIA GeForce RTX 2080Ti graphics card, and the input image was an RGB three-channel color image with dimensions of 500 × 500 × 3. The experimental environments of all the algorithms in the training process were the same (Table 2).

2.7. Network Model Performance Evaluation Index

Four parameters were used to evaluate the segmentation performance of the proposed algorithm for sweet pepper instances: average accuracy (AP), average recall (AR), F1 score, and FPS. The first three parameters were utilized to evaluate the sweet pepper location detection and segmentation accuracy: AP50 is the average accuracy when the IOU threshold is 0.5, AP75 is the average accuracy when the IOU threshold is 0.75, and AP is the average value of 10 IOU thresholds when the IOU takes 0.05 as a single step. The FPS was used to evaluate the speed of the algorithm (average inference frame rate per image). The precision and recall involved in the above indicators are, respectively, expressed as the ratio of the correctly predicted positive samples to the total predicted positive samples and the ratio of the correctly predicted positive samples to the total positive samples. The calculation method for the above evaluation index is shown in Equations (10)–(13).

p r e c i s i o n = \frac{T P}{(T P + F P)} \times 100 %

(10)

r e c a l l = \frac{T P}{(T P + F N)} \times 100 %

(11)

F 1 = \frac{2 \times p r e c i s i o n \cdot r e c a l l}{(p r e c i s i o n + r e c a l l)}

(12)

F P S = \frac{N u m F i g u r e}{T o t a l T i m e}

(13)

where TP: true positive, FP: false positive, FN: false negative, NumFigure: total number of sweet pepper images used for model inference, and TotalTime: total time required for inference on sweet pepper images.

3. Results

3.1. Qualitative Analysis of Improved Mask RCNN Model Performance

To test the segmentation capability of the improved algorithm, the order of images in the validation set was disordered, 100 pictures of greenhouse sweet peppers involving different categories and environmental conditions were selected for testing. The results are shown in Figure 9 (“green,” “red,” and “yellow” labels represent green, red, and yellow sweet peppers, respectively). From Figure 9, the proposed instance segmentation algorithm achieved good results for sweet pepper images of different types and under different environments. The details are as follows. When the number of sweet peppers is small or large, the surface colors of semi-mature sweet peppers are uneven, overlapping, or blocked by branches and leaves, and the algorithm can achieve accurate instance segmentation of each individual sweet pepper without leakage or false detection, as shown in Figure 9a–d. Figure 9e presents the segmentation results for a green sweet pepper in a complex background. At this time, the color of the sweet pepper is very similar to that of the surrounding environment, and it is very challenging to segment sweet peppers accurately under these conditions. Although it is difficult for the human eye to identify the specific position of each sweet pepper accurately in a short time, the algorithm developed in this study can do so, which will undoubtedly greatly reduce the burden of manual detection and significantly improve work efficiency. Extreme lighting conditions were tested in this study. The experimental results show that the proposed algorithm can effectively overcome the effects of different lighting conditions on the instance segmentation results when shadows are present on the surface of sweet pepper and insufficient or uneven lighting, as shown in Figure 9f.

The above analysis demonstrates that the proposed algorithm can achieve accurate target recognition for greenhouse sweet peppers under different lighting conditions and in various complex scenes and that the case segmentation satisfies the actual requirements for dynamic monitoring of sweet pepper growth status and yield prediction.

3.2. Ablation Experiment

In this study, the Mask RCNN algorithm was improved, and the Swin Transformer attention mechanism was added to its backbone network to enhance the feature extraction ability of the algorithm. In addition, UNet3+ was used to replace the full convolutional network layer in the mask branch to improve the mask quality of the sweet pepper output using the algorithm. Ablation experiments were designed to explore the effects of the above improvements on the Mask RCNN algorithm in the segmentation of sweet pepper instances. Table 3 shows the segmentation results of the sweet pepper instances before and after the improvement of each aspect of the Mask RCNN, where Bbox represents the detection index of the object position, Seg represents the segmentation index of the object mask. When different algorithms were used for model training, the same training set, parameter settings, and validation set were used for verification. ResNet-50, ResNet-101, ResNext-50, and ResNext-101 were used to compare the backbone network of the original Mask RCNN, and the mask branch of the original Mask RCNN used an FCN. The data in bold in Table 3 prove that this algorithm has the best performance for this index. Algorithms 1–6 represent the instance segmentation algorithms with different network structures.

It can be seen that, after the improvement of the Mask RCNN algorithm, the detection AP, AR, segmentation AP, F1 score, and average reasoning speed were 95.7%, 98.3%, 84.5%, 96.9%, and 0.20 s, respectively. Compared with before the improvement, the first four evaluation indices were at least 22%, 21.7%, 17.3%, and 21.8% higher, respectively. The average reasoning speed does not change significantly; therefore, this method can improve accuracy under the condition that the reasoning speed is basically unchanged.

Figure 10 shows the qualitative comparison segmentation images before and after the improvement of the Mask RCNN algorithm using the test set. The first row depicts the original input images, the remaining rows correspond to different positions in the Mask RCNN algorithm, and the resulting images were segmented using different network structures. In this study, three types of sweet pepper were qualitatively analyzed. The red dotted boxes represent the shortcomings of the algorithm when it segments some details of sweet peppers. Figure 10 demonstrates that the instance segmentation effect after the improvement of the Mask RCNN backbone network is greatly improved and that the false detection rate is further reduced. These improvements are due to the introduction of the attention mechanism, which enables the algorithm to filter irrelevant information more effectively and only retain useful information. In addition, after using UNet3+ to improve the mask branch, the segmentation quality is further improved. This enhancement occurs because UNet3+ uses a full-size jump connection and full-size depth supervision mechanism to combine the high- and low-level semantics of image features at different scales closely, thus significantly improving the mask quality. The red dashed boxes in Figure 10 indicate that the segmentation effect of the sweet pepper edge before the algorithm improvement is rough, whereas the mask edge of the improved algorithm is more consistent with the actual object edge. The above analysis demonstrates that the performance of the improved Mask RCNN has been greatly improved.

To explore the convergence of the loss functions of different network structures involved in the ablation experiment, a set of experiments was designed in the model training stage to compare the loss function curves of the different algorithms. These curves include the overall loss function curve, bounding box regression loss function curve, classification loss function curve, regression loss curve of RPN bounding box, RPN target recognition loss function curve, and mask loss function curve. Figure 11 shows the loss function changes during the training process for the models with different network structures.

According to the variation rule of the loss function curve in Figure 11a, the overall loss function of this algorithm is at least 0.1 lower than other algorithms, and eventually converges to 0.1, which is faster to converge. By observing the loss function curve before and after the addition of the Swin Transformer attention mechanism (Figure 11b–d), it can be seen that the feature extraction ability of the algorithm is improved after the attention mechanism is introduced, which effectively filters irrelevant information and reduces the computational load. Finally, the convergence speed of the algorithm is further accelerated. The loss functions of the bounding box regression loss, classification loss, and regression loss of the RPN bounding box in the proposed algorithm eventually converge to 0.005, which are at least 0.015, 0.01, and 0.005 lower than other algorithms, respectively. In addition, UNet3+ has little effect on the classification loss function curve (Figure 11c) and Regression loss curve of RPN bounding box (Figure 11d), which is due to the factor that UNet3+ is mainly used for the improvement of the mask branch and only contributes to the improvement of the mask quality and does not affect the other branches. The loss function curves in Figure 11e indicate that the algorithm has little influence on the RPN target recognition loss, and its convergence rate is basically the same as the other algorithms, which finally converge to 0.0075. In Figure 11f, the effect of adding an attention mechanism on the mask loss function is regressive; however, after adding UNet3+ semantic network segmentation, the shallow and deep characteristics can be fully utilized, further enhancing the performance segmentation algorithm, thus compensating for the decrease in the performance of the mask loss function due to the introduction of the attention mechanism. Hence, the mask loss function converges better than those of several other algorithms, and converges faster than other algorithms, eventually converges to 0.07, which is at least 0.04 lower than other algorithms. Accordingly, it can be seen that UNet3+ is significant in improving the mask accuracy.

3.3. Analysis of Segmentation Results of Different Types of Sweet Pepper

The complete greenhouse sweet pepper growth cycle consists of three periods: immature, semi-mature, and mature stages. The different growth stages correspond to different colors and morphologies; in the immature stage, the individuals are green and small; the semi-mature stage includes individuals with uneven color gradients and sizes; and the individuals in the mature stage are red or yellow and large. Sweet peppers with the colors and shapes corresponding to the above three stages have different impacts on the instance segmentation results. To explore the effects of different categories of sweet peppers on the segmentation results, the improved Mask RCNN was used to verify the sets of the three categories of sweet peppers. The corresponding detection AP, AR, segmentation AP, and F1 score are listed in Table 4, and bold font is used to indicate the best value of each index.

Table 4 shows that the instance segmentation results are the worst for the green sweet peppers and the best for the yellow sweet peppers. The differences between the detection AP, AR, segmentation AP, and F1 score in these cases are 2.4%, 2.1%, 15.5%, and 0.3%, respectively. The segmentation results for the red sweet peppers are similar to those for the yellow sweet peppers but better than those for the green sweet peppers. The segmentation results are the worst for the green sweet peppers because their color was quite similar to the background color, and the contrast difference between the foreground objects and background features was not significant enough. The background characteristics were mainly green branches and leaves, and the volume of immature sweet peppers was small. When the volume of green sweet peppers in the image is small, it is easy to miss recognition. When green branches and leaves occlude green sweet peppers to different degrees, false detection is more likely to occur, and green leaves are wrongly regarded as parts of green sweet peppers. In addition, insufficient illumination is one of the factors affecting the accuracy. When there is a shadow on the surface of a sweet pepper, the edge features of its contour are seriously lost, the segmentation results are not sufficiently fine, and, finally, the output mask does not match the actual object edge. The red dashed box in Figure 12 indicates the shortcomings of some detection details when the proposed algorithm is used to segment the sweet pepper instances. Missed or false detections may occur in rare cases. False detection is caused by the high similarity between the interferents and sweet peppers. Missing detection is caused by insufficient characteristic information due to the small size of the sweet peppers. There are two ways to solve the above problems. First, image sharpening technology can be used to improve the sharpness of the contour edge of the sweet pepper, so that the distinction between the sweet pepper and its background is more significant. Second, image features can be enriched by expanding the number of training sets to improve the feature extraction and generalization abilities of the algorithm further. Although there is a certain degree of leakage and false detection, the proposed algorithm can guarantee high detection and segmentation accuracy in most cases and can reach a satisfactory level in the application of greenhouse sweet pepper segmentation.

3.4. Effect of Data Enhancement Technology on Segmentation Results

To enrich the image features in limited samples and to expand the dataset further, we adopted data enhancement technology. Specific measures include Laplacian sharpening and adding noise. To verify the influence of data enhancement technology on the sweet pepper segmentation results, the proposed algorithm was used to train the models with different data enhancement methods, and the parameters of the segmentation results were obtained, as listed in Table 5. The indices corresponding to the bold data indicate the best performance.

Table 5 demonstrates that the use of data enhancement technology significantly affects the segmentation results, and the use of three data enhancement methods simultaneously has the best effect. Specifically, the detection AP, AR, segmentation AP, and F1 score are increased by 2.4%, 1.1%, 10.3%, and 1.8%, respectively. The use of all three methods simultaneously produces the best results because the different data-enhancement methods have their own characteristics. Sharpening can make the object contour edges clearer. Because of the variety in an actual scene, limited sample data cannot represent the entire actual environment. The model training process can only fit parameter values similar to the real environment and adding noise can further increase the differences between datasets, simulate actual cases, represent all types of complex scenes, and enable limited data to be utilized with maximum effectiveness. The above analysis demonstrates that the appropriate introduction of different data enhancement methods increases feature information at different levels in the image and improves algorithm performance in limited image samples.

3.5. Comparison of Different Instance Segmentation Algorithms

To verify the superiority of the segmentation performance of the proposed algorithm further, the quantitative indices AP, AR, F1, and FPS were used to evaluate the algorithm quantitatively, and a comparison with three mainstream algorithms, BlendMask [45], BoxInst [46], and CondInst [47], was performed. The segmentation results are presented in Table 6, and bold font is used to indicate the optimal values of the indices.

Table 6 demonstrates that the improved Mask RCNN yields the best performance across all indices, with detection AP, AR, segmentation AP, and F1 score of 95.7%, 98.3%, 84.5%, and 96.9%, respectively; these are improved by 22%, 21.7%, 17.3%, and 21.8%, respectively, compared with the corresponding values obtained by the other algorithms. The average inference FPS for improved Mask RCNN is 5 (0.20 s), which is the fastest detection speed. The proposed algorithm can further improve the detection and segmentation accuracies if the frame rate of image inference does not change significantly.

Some sweet pepper pictures under extreme conditions (complex background, branches, and leaf occlusion) were selected from the test set to test different algorithms and perform qualitative comparative analysis (Figure 13). The red dotted boxes indicate the main shortcomings of each algorithm in the detailed sweet pepper segmentation results. Qualitative comparison of the images in Figure 13 demonstrates that improved Mask RCNN outputs the best mask quality, which retains the most complete edge contours of sweet pepper individuals and can accurately distinguish foreground and background objects. When a sweet pepper individual is divided into two branches of shade or obstacles, the improved Mask RCNN also gives correct results. The rest of the algorithm will detect errors, such as the identification of blades or obstacles as parts of the sweet pepper individuals or the detection of different bell pepper individuals as the same object and provide only one detection box. It is concluded that the improved Mask RCNN algorithm can segment the greenhouse sweet pepper images more accurately with a higher frame rate and has advantages over the other algorithms.

4. Conclusions

This paper proposes an instance segmentation algorithm based on an improved Mask RCNN to ensure dynamic monitoring of sweet pepper growth status. To enhance the feature extraction ability of the algorithm, the Swin Transformer attention mechanism is introduced into the backbone network, and the mask branch is improved using UNet3+. Compared with other instance segmentation algorithms (BlendMask, BoxInst, and CondInst), the instance segmentation effect of the proposed algorithm is better in complex scenes (with high similarity between the sweet peppers and background, insufficient lighting, overlap, occlusion of branches and leaves, etc.). The detection AP, AR, segmentation AP, and F1 score were 98.1%, 99.4%, 94.8%, and 98.8%, respectively. The average FPS value was 5. The detection speed of this algorithm can satisfy the requirement of dynamic monitoring of sweet peppers growth status. The results of this study can be used as an important theoretical basis for monitoring crop growth in greenhouses.

The main disadvantage of the current instance segmentation algorithm for automated fruit picking robots is its poor real-time performance, which is not conducive to practical application. The algorithm proposed in this paper, as one of the typical representatives of two-stage algorithms, also has the above disadvantage and still has room for improvement in terms of inference speed; in addition, the algorithm has a slightly large number of model parameters, which requires high performance of the embedded deployment device, and its network structure needs to be further simplified. The algorithm will be further optimized in the future to improve the instance segmentation accuracy and increase its reasoning speed to enhance the real-time performance of the algorithm further for use in automated fruit picking techniques in precision agriculture.

Author Contributions

Conceptualization, P.C. and S.L.; methodology, P.C. and S.L.; software, J.Z. and P.C.; validation, P.C., S.L., J.Z. and H.F.; formal analysis, K.L. and H.F.; investigation, S.L. and J.Z.; resources, S.L., J.Z. and K.L.; data curation, S.L.; writing—original draft preparation, P.C. and S.L.; writing—review and editing, J.Z. and K.L.; visualization, P.C. and S.L.; supervision, J.Z. and K.L.; project administration, S.L.; funding acquisition, P.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Central Government Guides Local Science and Technology Development Foundation Projects (Grant No. ZY19183003), Guangxi Key Research and Development Project (Grant No. AB20058001).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Bargoti, S.; Underwood, J. Deep fruit detection in orchards. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Marina Bay Sands, Singapore, 29 May–3 June 2017; pp. 3626–3633. [Google Scholar]
Attia, A.; Govind, A.; Qureshi, A.S.; Feike, T.; Rizk, M.S.; Shabana, M.M.; Kheir, A.M. Coupling Process-Based Models and Machine Learning Algorithms for Predicting Yield and Evapotranspiration of Maize in Arid Environments. Water 2022, 14, 3647. [Google Scholar] [CrossRef]
Kheir, A.M.; Ammar, K.A.; Amer, A.; Ali, M.G.; Ding, Z.; Elnashar, A. Machine learning-based cloud computing improved wheat yield simulation in arid regions. Comput. Electron. Agric. 2022, 203, 107457. [Google Scholar] [CrossRef]
Rocha, A.; Hauagge, D.C.; Wainer, J.; Goldenstein, S. Automatic fruit and vegetable classification from images. Comput. Electron. Agric. 2010, 70, 96–104. [Google Scholar] [CrossRef] [Green Version]
Kumar, C.; Chauhan, S.; Alla, R.N. Classifications of citrus fruit using image processing-GLCM parameters. In Proceedings of the International Conference on Communications and Signal Processing (ICCSP), Tamilnadu, India, 2–4 April 2015; pp. 1743–1747. [Google Scholar]
Kanade, A.; Shaligram, A. Development of machine vision based system for classification of Guava fruits on the basis of CIE1931 chromaticity coordinates. In Proceedings of the 2nd International Symposium on Physics and Technology of Sensors (ISPTS), Pune, India, 7–10 March 2015; pp. 177–180. [Google Scholar]
Jiang, L.; Koch, A.; Scherer, S.A.; Zell, A. Multi-class fruit classification using RGB-D data for indoor robots. In Proceedings of the 2013 IEEE International Conference on Robotics and Biomimetics (ROBIO), Shenzhen, China, 12–14 December 2013; pp. 587–592. [Google Scholar]
Visa, S.; Cao, C.; Gardener, M.S.; Van Der Knaap, E. Modeling of tomato fruits into nine shape categories using elliptic fourier shape modeling and Bayesian classification of contour morphometric data. Euphytica 2014, 200, 429–439. [Google Scholar] [CrossRef]
Xin, S.; Lei, Y. The Study of Adaptive Multi Threshold Segmentation Method for Apple Fruit Based on the Fractal Characteristics. In Proceedings of the 2015 8th International Symposium on Computational Intelligence and Design (ISCID), Hangzhou, China, 12–13 December 2015; pp. 168–171. [Google Scholar]
Efi, V.; Yael, E. Adaptive thresholding with fusion using a RGBD sensor for red sweet-pepper detection. Biosyst. Eng. 2016, 146, 45–56. [Google Scholar] [CrossRef]
Wen, D.; Ren, A.; Ji, T.; Flores-Parra, I.M.; Yang, X.; Li, M. Segmentation of thermal infrared images of cucumber leaves using K-means clustering for estimating leaf wetness duration. Int. J. Agric. Biol. Eng. 2020, 13, 161–167. [Google Scholar] [CrossRef]
Bai, X.; Li, X.; Fu, Z.; Lv, X.; Zhang, L. A fuzzy clustering segmentation method based on neighborhood grayscale information for defining cucumber leaf spot disease images. Comput. Electron. Agric. 2017, 136, 157–165. [Google Scholar] [CrossRef]
Tian, Y.; Yang, G.; Wang, Z.; Li, E.; Liang, Z. Instance segmentation of apple flowers using the improved mask R-CNN model. Biosyst. Eng. 2020, 193, 264–278. [Google Scholar] [CrossRef]
Cheng, C.A.; Bo, L.B.; Jl, A.; Ren, N. Monocular positioning of sweet peppers: An instance segmentation approach for harvest robots. Biosyst. Eng. 2020, 196, 15–28. [Google Scholar] [CrossRef]
Jia, W.; Zhang, Z.; Shao, W.; Hou, S.; Ji, Z.; Liu, G.; Yin, X. FoveaMask: A fast and accurate deep learning model for green fruit instance segmentation. Comput. Electron. Agric. 2021, 191, 106488. [Google Scholar] [CrossRef]
Hameed, K.; Chai, D.; Rassau, A. Score-based mask edge improvement of Mask–RCNN for segmentation of fruit and vegetables. Expert Syst. Appl. 2022, 190, 116205. [Google Scholar] [CrossRef]
Wang, X.; Kang, H.; Zhou, H.; Au, W.; Chen, C. Geometry-Aware Fruit Grasping Estimation for Robotic Harvesting in Orchards. Comput. Electron. Agric. 2022, 193, 106716. [Google Scholar] [CrossRef]
Su, F.; Zhao, Y.; Wang, G.; Liu, P.; Yan, Y.; Zu, L. Tomato Maturity Classification Based on SE-YOLOv3-MobileNetV1 Network under Nature Greenhouse Environment. Agronomy 2022, 12, 1638. [Google Scholar] [CrossRef]
Li, Z.; Jiang, X.; Shuai, L.; Zhang, B.; Yang, Y.; Mu, J. A Real-Time Detection Algorithm for Sweet Cherry Fruit Maturity Based on YOLOX in the Natural Environment. Agronomy 2022, 12, 2482. [Google Scholar] [CrossRef]
Wu, L.; Ma, J.; Zhao, Y.; Liu, H. Apple detection in complex scene using the improved YOLOv4 model. Agronomy 2021, 11, 476. [Google Scholar] [CrossRef]
Yang, J.; Wang, Y.; Chen, Y.; Yu, J. Detection of weeds growing in Alfalfa using convolutional neural networks. Agronomy 2022, 12, 1459. [Google Scholar] [CrossRef]
Chen, Z.; Wu, R.; Lin, Y.; Li, C.; Chen, S.; Yuan, Z.; Chen, S.; Zou, X. Plant disease recognition model based on improved YOLOv5. Agronomy 2022, 12, 365. [Google Scholar] [CrossRef]
Wang, F.; Sun, Z.; Chen, Y.; Zheng, H.; Jiang, J. Xiaomila Green Pepper Target Detection Method under Complex Environment Based on Improved YOLOv5s. Agronomy 2022, 12, 1477. [Google Scholar] [CrossRef]
Ajala, S.; Muraleedharan Jalajamony, H.; Nair, M.; Marimuthu, P.; Fernandez, R.E. Comparing machine learning and deep learning regression frameworks for accurate prediction of dielectrophoretic force. Sci. Rep. 2022, 12, 1–17. [Google Scholar] [CrossRef]
Liu, G.; Nouaze, J.C.; Touko, P.L.; Kim, J.H. YOLO-Tomato: A Robust Algorithm for Tomato Detection based on YOLOv3. Sensors 2020, 20, 2145. [Google Scholar] [CrossRef]
Kang, H.; Chen, C. Fast implementation of real-time fruit detection in apple orchards using deep learning. Comput. Electron. Agric. 2020, 168, 105108. [Google Scholar] [CrossRef]
Sa, Y.; Ge, Z.; Dayoub, F.; Upcroft, B.; Perez, T.; McCool, C. DeepFruits: A Fruit Detection System Using Deep Neural Networks. Sensors 2016, 16, 1222. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Gao, F.; Fu, L.; Zhang, X.; Majeed, Y.; Li, R.; Karkee, M.; Zhang, Q. Multi-class fruit-on-plant detection for apple in SNAP system using Faster R-CNN. Comput. Electron. Agric. 2020, 176, 105634. [Google Scholar] [CrossRef]
Wan, S.; Goudos, S. Faster R-CNN for multi-class fruit detection using a robotic vision system. Comput. Netw. 2019, 168, 107036. [Google Scholar] [CrossRef]
Gan, H.; Lee, W.S.; Alchantis, V.; Abd-Elrahman, A. Active thermal imaging for immature citrus fruit detection. Biosyst. Eng. 2020, 198, 291–303. [Google Scholar] [CrossRef]
Kang, H.; Chen, C. Fruit Detection and Segmentation for Apple Harvesting Using Visual Sensor in Orchards. Sensors 2019, 19, 4599. [Google Scholar] [CrossRef] [Green Version]
Roy, K.; Chaudhuri, S.S.; Pramanik, S. Deep learning based real-time Industrial framework for rotten and fresh fruit detection using semantic segmentation. Microsyst. Technol. 2020, 27, 1–11. [Google Scholar] [CrossRef]
Ni, X.; Li, C.; Jiang, H.; Takeda, F. Deep learning image segmentation and extraction of blueberry fruit traits associated with harvest ability and yield. Horticult. Res. 2020, 7, 1–14. [Google Scholar] [CrossRef]
Yu, Y.; Zhang, K.; Yang, L.; Zhang, D. Fruit detection for strawberry harvesting robot in non-structural environment based on Mask-RCNN. Comput. Electron. Agric. 2019, 163, 104846. [Google Scholar] [CrossRef]
Jia, W.; Tian, Y.; Luo, R.; Zhang, Z.; Lian, J.; Zheng, Y. Detection and segmentation of overlapped fruits based on optimized Mask RCNN application in apple harvesting robot. Comput. Electron. Agric. 2020, 172, 105380. [Google Scholar] [CrossRef]
Li, H.; Li, C.; Li, G.; Chen, L. A real-time table grape detection method based on improved YOLOv4-tiny network in complex background. Biosyst. Eng. 2021, 212, 347–359. [Google Scholar] [CrossRef]
Min, X.; Fei, X.; Cheng, H.D.; Zhang, Y.; Ding, J. EISeg: Effective Interactive Segmentation. In Proceedings of the 2016 23rd International Conference on Pattern Recognition (ICPR), Cancun, Mexico, 4–8 December 2016; pp. 1982–1987. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. NIPS 2015, 28, 1137–1149. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar]
Huang, H.; Lin, L.; Tong, R.; Hu, H.; Zhang, Q.; Iwamoto, Y.; Han, X.; Chen, Y.-W.; Wu, J. UNet 3+: A Full-Scale Connected UNet for Medical Image Segmentation. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, 4–8 May 2020; IEEE: Piscataway, NJ, USA, 2020. [Google Scholar]
Khan, A.; Sohail, A.; Zahoora, U.; Qureshi, A.S. A survey of the recent architectures of deep convolutional neural networks. Artif. Intell. Rev. 2020, 53, 5455–5516. [Google Scholar] [CrossRef] [Green Version]
Lin, T.; Wang, Y.; Liu, X.; Qiu, X. A Survey of Transformers. arXiv 2021, arXiv:2106.04554. [Google Scholar] [CrossRef]
Zhou, Z.; Siddiquee, M.; Tajbakhsh, N.; Liang, J. UNet++: Redesigning Skip Connections to Exploit Multiscale Features in Image Segmentation. IEEE Trans. Med. Imag. 2020, 39, 1856–1867. [Google Scholar] [CrossRef] [Green Version]
Chen, H.; Sun, K.; Tian, Z.; Shen, C.; Huang, Y.; Yan, Y. BlendMask: Top–Down Meets Bottom-Up for Instance Segmentation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; IEEE: Piscataway, NJ, USA, 2020. [Google Scholar]
Tian, Z.; Shen, C.; Wang, X.; Chen, H. BoxInst: High-Performance Instance Segmentation with Box Annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5443–5452. [Google Scholar]
Tian, Z.; Zhang, B.; Chen, H.; Shen, C. Instance and Panoptic Segmentation Using Conditional Convolutions. TPAMI 2022, 45, 669–680. [Google Scholar] [CrossRef]

Figure 1. Framework of the proposed sweet peppers segmentation method.

Figure 2. Image samples. (a) heavily shaded yellow sweet pepper; (b) green sweet pepper affected by shadow, shading, and uneven illumination; (c) semi-mature red sweet pepper.

Figure 3. Laplace mask.

Figure 4. Image sharpening. (a) original image; (b) after sharpening.

Figure 5. Images noise. (a) original image; (b) salt and pepper noise; (c) Gaussian noise.

Figure 6. Image annotation. (a) original image; (b) explanatory image.

Figure 7. Proposed system architecture. (a) the improved Mask RCNN network model structure; (b) Swin Transformer network structure.

Figure 8. UNet3+ network structure.

Figure 9. Sweet pepper instance segmentation results obtained using improved Mask RCNN. (a) small number of individuals; (b) large number of individuals; (c) uneven surface color; (d) severely occluded blades; (e) complex background; (f) insufficient or uneven lighting.

Figure 10. Comparison of Mask RCNN model results before and after improvement. The red dotted boxes represent shortcomings of the algorithm when it segments some sweet pepper details.

Figure 11. Loss function curves of Mask RCNN with different network structures in the training stage. (a) overall loss function curve; (b) bounding box regression loss function curve; (c) classification loss function curve; (d) regression loss curve of RPN bounding box; (e) RPN target recognition loss function curve; (f) mask loss function curve.

Figure 12. Example of error segmentation.

Figure 13. Segmentation results obtained using different instance segmentation algorithms.

Table 1. Types and quantities of images in data sets.

Category	Training Set	Verification Set
Green	632	126
Red	638	128
Yellow	635	127

Table 2. Parameter settings of sweet pepper instance segmentation method.

Parameter	Value
CPU	Intel Core i9-10900K
Memory/GB	32 GB
GPU	NVIDIA GeForce RTX 2080Ti
System	Windows 10
Development tool	PyCharm
Network framework	PyTorch 1.10.0
Batch size	2
Weight attenuation	0.0005
Basic learning rate	0.004
Epochs	10
Momentum	0.9
Optimizer	SGD
Input image size	500 × 500 × 3 (RGB)

Table 3. Ablation experiments.

Number	Algorithm	Backbone	Mask Head	(Bbox) AP (%)	(Bbox) AR (%)	(Seg)AP (%)	F1 (%)	Speed (s)
Algorithm 1	Mask RCNN	ResNet-50	FCN	63.6	67.6	61.8	65.5	0.24
Algorithm 2	Mask RCNN	ResNet-101	FCN	71.4	75.3	65.3	73.3	0.23
Algorithm 3	Mask RCNN	ResNext-50	FCN	70.0	72.7	64.6	71.3	0.20
Algorithm 4	Mask RCNN	ResNext-101	FCN	73.7	76.6	67.2	75.1	0.26
Algorithm 5	Mask RCNN	Swin Transformer	FCN	94.4	97.2	61.7	95.8	0.19
Algorithm 6	Mask RCNN	Swin Transformer	UNet3+	95.7	98.3	84.5	96.9	0.20

Table 4. Instance segmentation results for different types of sweet peppers.

Category	(Bbox)AP (%)	(Bbox)AR (%)	(Seg)AP (%)	F1 (%)
Green	94.1	99.8	74.3	96.8
Red	96.5	97.5	89.3	97.0
Yellow	96.5	97.7	89.8	97.1

Table 5. Experimental results obtained using different data enhancement methods.

Method	(Bbox)AP (%)	(Bbox)AR (%)	(Seg)AP (%)	F1 (%)
Data enhancement not used	95.7	98.3	84.5	97.0
Laplace sharpening	97.1	98.9	93.1	98.0
Salt and pepper noise	97.3	98.9	93.2	98.2
Gaussian noise	96.2	98.5	92.7	97.3
Three data enhancement methods simultaneously	98.1	99.4	94.8	98.8

Table 6. Instance segmentation results of different algorithms.

Algorithm	(Bbox)AP (%)	(Bbox)AR (%)	(Seg)AP (%)	F1 (%)	Speed (s)
BlendMask	47.5	56.9	48.1	51.8	0.36
BoxInst	49.6	65.7	41.8	56.5	0.30
CondInst	40.1	61.3	48.2	48.5	0.31
Mask RCNN	73.7	76.6	67.2	75.1	0.26
Improved Mask RCNN	95.7	98.3	84.5	96.9	0.20

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cong, P.; Li, S.; Zhou, J.; Lv, K.; Feng, H. Research on Instance Segmentation Algorithm of Greenhouse Sweet Pepper Detection Based on Improved Mask RCNN. Agronomy 2023, 13, 196. https://doi.org/10.3390/agronomy13010196

AMA Style

Cong P, Li S, Zhou J, Lv K, Feng H. Research on Instance Segmentation Algorithm of Greenhouse Sweet Pepper Detection Based on Improved Mask RCNN. Agronomy. 2023; 13(1):196. https://doi.org/10.3390/agronomy13010196

Chicago/Turabian Style

Cong, Peichao, Shanda Li, Jiachao Zhou, Kunfeng Lv, and Hao Feng. 2023. "Research on Instance Segmentation Algorithm of Greenhouse Sweet Pepper Detection Based on Improved Mask RCNN" Agronomy 13, no. 1: 196. https://doi.org/10.3390/agronomy13010196

APA Style

Cong, P., Li, S., Zhou, J., Lv, K., & Feng, H. (2023). Research on Instance Segmentation Algorithm of Greenhouse Sweet Pepper Detection Based on Improved Mask RCNN. Agronomy, 13(1), 196. https://doi.org/10.3390/agronomy13010196

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Instance Segmentation Algorithm of Greenhouse Sweet Pepper Detection Based on Improved Mask RCNN

Abstract

1. Introduction

2. Materials and Methods

2.1. The Overall Sweet Pepper Segmentation Framework

2.2. Image Acquisition

2.3. Image Preprocessing

2.4. Dataset Annotation

2.5. Improved Mask RCNN Instance Segmentation Algorithm

2.5.1. Feature Extraction

2.5.2. Generation of Region of Interest (RoI) and RoIAlign

2.5.3. Mask Head

2.5.4. Loss Function

2.6. Network Training

2.7. Network Model Performance Evaluation Index

3. Results

3.1. Qualitative Analysis of Improved Mask RCNN Model Performance

3.2. Ablation Experiment

3.3. Analysis of Segmentation Results of Different Types of Sweet Pepper

3.4. Effect of Data Enhancement Technology on Segmentation Results

3.5. Comparison of Different Instance Segmentation Algorithms

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI