Next Article in Journal
Phylogeographical Analyses of a Relict Fern of Palaeotropical Flora (Vandenboschia speciosa): Distribution and Diversity Model in Relation to the Geological and Climate Events of the Late Miocene and Early Pliocene
Previous Article in Journal
Insights into the Genomic Architecture of Seed and Pod Quality Traits in the U.S. Peanut Mini-Core Diversity Panel
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Medicinal Chrysanthemum Detection under Complex Environments Using the MC-LCNN Model

1
College of Engineering, Nanjing Agricultural University, Nanjing 210031, China
2
College of Intelligent Engineering and Technology, Jiangsu Vocational Institute of Commerce, Nanjing 211168, China
*
Author to whom correspondence should be addressed.
Plants 2022, 11(7), 838; https://doi.org/10.3390/plants11070838
Submission received: 23 February 2022 / Revised: 19 March 2022 / Accepted: 20 March 2022 / Published: 22 March 2022
(This article belongs to the Section Horticultural Science and Ornamental Plants)

Abstract

:
Medicinal chrysanthemum detection is one of the desirable tasks of selective chrysanthemum harvesting robots. However, it is challenging to achieve accurate detection in real time under complex unstructured field environments. In this context, we propose a novel lightweight convolutional neural network for medicinal chrysanthemum detection (MC-LCNN). First, in the backbone and neck components, we employed the proposed residual structures MC-ResNetv1 and MC-ResNetv2 as the main network and embedded the custom feature extraction module and feature fusion module to guide the gradient flow. Moreover, across the network, we used a custom loss function to improve the precision of the proposed model. The results showed that under the NVIDIA Tesla V100 GPU environment, the inference speed could reach 109.28 FPS per image (416 × 416), and the detection precision (AP50) could reach 93.06%. Not only that, we embedded the MC-LCNN model into the edge computing device NVIDIA Jetson TX2 for real-time object detection, adopting a CPU–GPU multithreaded pipeline design to improve the inference speed by 2FPS. This model could be further developed into a perception system for selective harvesting chrysanthemum robots in the future.

1. Introduction

Numerous studies have reported that medicinal chrysanthemums have significant commercial value [1]. Furthermore, it has prominent medicinal values [2], such as heat clearing, eye brightening, anti-inflammatory, antihypertensive, and antitumor properties. In the natural environment, a single chrysanthemum plant can present flower heads in different flowering stages, whereas medicinal chrysanthemums are mainly harvested at the bud stage. To show the research objective of this study, the different flowering stages of medicinal chrysanthemums are presented in Figure 1.
At present, the harvesting process of medicinal chrysanthemums is labor-intensive and time-consuming. Consequently, due to the current shortage of skilled labor, it is highly desirable to develop a selective harvesting robot to solve the crop waste problem. The design of manipulators and the development of visual perception systems are vital for selective harvesting robots, and this study is focused on the development of visual perception systems for medicinal chrysanthemums. Traditional machine learning techniques for computer vision tasks are well developed, with shallow learning of image information through manual feature extraction [3]. Convolutional neural networks (CNNs), an important subset of machine learning techniques that learn hierarchical representations and discover potentially complex patterns from the data, have made impressive advances in the computer vision field [4]. CNNs have also yielded encouraging results in agriculture [5]. Although the approaches based on traditional machine learning techniques and deep learning techniques have achieved significant success in agricultural applications, developing lightweight networks for selective harvesting robots under unstructured environments is still difficult.
We collected the literature on chrysanthemum detection based on traditional machine learning techniques and deep learning techniques throughout the world, and the results are shown in Table 1. Overall, the available literature is relatively scarce. When carefully analyzing Table 1, we found three issues that deserve further exploration.
Issue 1: The current research has not yet achieved high-accuracy, real-time detection of chrysanthemums.
Issue 2: Throughout the literature, the testing environment has mainly been in the laboratory, which cannot guarantee the robustness of the model.
Issue 3: Although there are some differences in the research tasks for chrysanthemum detection, the aim of the research is to achieve commercialization, and this could be effective in helping farmers reduce their workload. Commercial production inevitably requires embedding the models into low-power edge computing devices, but the current test results are laptop-based.
Based on the three issues above, we propose a lightweight convolutional neural model (MC-LCNN). First, MC-LCNN can balance detection precision and inference speed to achieve real-time and efficient detection of medicinal chrysanthemums. Second, MC-LCNN was tested under three complex unstructured environments (illumination variations, overlaps, and occlusions) to ensure the robustness of the proposed model. Finally, for subsequent development of selective harvesting chrysanthemum robots, we chose to test MC-LCNN on a low-power embedded GPU platform, NVIDIA Jetson TX2, and further improved the detection inference speed by designing a CPU–GPU heterogeneous architecture. The contributions of this study are as follows:
  • A lightweight MC-LCNN model was designed to achieve high-accuracy, real-time detection of medicinal chrysanthemums under complex unstructured environments.
  • A series of experiments were designed to validate the superiority of MC-LCNN, including comparisons with different data enhancements, ablation experiments between various network components, and comparisons with state-of-the-art object detection models.
  • The MC-LCNN model was embedded into an edge computing device with a custom pipeline design to achieve accurate real-time medicinal chrysanthemum detection.
The rest of the paper is organized as follows. Section 2 describes the dataset, the hardware parameters of the NVIDIA Jetson TX2, the structure of the proposed model, the improvement approach of multithreading, the evaluation metrics, and the experimental setup. Section 3 presents the experimental results in detail. Section 4 discusses the experimental results, advantages and disadvantages, solutions, and future research perspectives of this work. Section 5 briefly summarizes the contributions of this study.

2. Materials and Methods

2.1. Dataset

The medicinal chrysanthemum dataset used in this study was collected at Yangma Town, China, from October 2019 to October 2021. Due to the short flowering stage of medicinal chrysanthemums, there are only a few days per year to collect suitable samples. The capture device was an Apple X phone with a video resolution of 1080 × 1920. The dataset was collected entirely in the field, with backgrounds including illumination variations, occlusions, and overlaps. It is worth mentioning that to ensure the robustness of the robotic perception system, the collected images had no natural environmental constraints. The dataset comprising a total of 4000 chrysanthemum images was divided into training, validation, and test datasets following a ratio of 6:3:1. Some original images are shown in Figure 2.

2.2. NVIDIA Jetson TX2

The NVIDIA Jetson TX2 comprises a 6-core ARMv8 64-bit CPU complex and a 256-core NVIDIA Pascal architecture GPU. The CPU complex includes a dual-core Denver2 processor, a quad-core ARM Cortex-A57, 8 GB of LPDDR4 memory, and a 128 bit interface, making it ideal for applications with low power and high computing performance. Therefore, we chose this edge computing device to design and implement a real-time object detection system. We introduce the NVIDIA Jetson TX2 in Figure 3.

2.3. MC-LCNN

The MC-LCNN is a lightweight network (11.3 M) that can achieve real-time detection of complex unstructured environments (light changes, occlusions, and overlaps). The network structure is mainly constructed based on the backbone, neck, and head, as shown in Figure 4. In the backbone, the main network utilizes the proposed MC-ResNetv1 incorporating the CBM module and SPP module in this component. In the neck, the main network uses the proposed MC-ResNetv2 with the CBL module embedded. In the head, a feature pyramid network (FPN) feature fusion strategy is employed. Furthermore, several strategies were used throughout the network to improve the training robustness, including exponential moving average (EMA), larger batch size, DropBlock regularization, and generalized focal loss.

2.3.1. MC-ResNetv1 and MC-ResNetv2

The main challenge in implementing lightweight models is that under fixed computational budgets (FLOPs), only a restricted amount of feature channels can be afforded. To increase the number of channels at low computational budgets, we employed a 1 × 1 convolution and a bottleneck structure to achieve information exchange between different channels. The shape of the 1 × 1 convolution was determined by the input channels c1 and output channels c2. Thus, the FLOPs of the 1 × 1 convolution could be calculated as B = h w c 1 c 2 , where h and w are the spatial sizes of the feature maps. When the cache in the computing device is sufficiently large to store all the feature maps and parameters, the memory access cost ( m a c ) = h w ( c 1 + c 2 ) + c 1 c 2 . Based on the mean inequality, we obtain the following:
mac   2 h w B + B h w
Accordingly, the memory access cost has a minimum value given by the FLOPs. It reaches its minimum value when the number of input and output channels are equal.
A 1 × 1 convolution reduces the computational burden by replacing dense convolution with sparse convolution. On the one hand, it allows more channels to be used at fixed FLOPs and increases the network capacity. However, on the other hand, the increase in the number of channels leads to a higher memory access cost. The relationship between memory access cost and FLOPs for 1 × 1 group convolution is as follows:
m a c = h w ( c 1 + c 2 ) + c 1 c 2 g = h w c 1 + B g c 1 + B h w ,
where g denotes the number of groups, and B = h w c 1 c 2 / g stands for FLOPs. Given the fixed input shape c 1 × h × w and the computational cost B , the memory access cost increases with the growth of g .
Both 1 × 1 convolution and bottleneck structures increase the memory access cost. This cost is not negligible, especially for lightweight networks. Consequently, to obtain ze high model capacity and efficiency, the critical issue is how to keep numerous equal-width channels without either dense convolution or many groups. To achieve the above, we designed the MC-ResNetv1 module. We introduced a simple operator named Focus, where the input is split into two branches at the beginning of each unit. One branch uses a shortcut design, where half of the feature channels directly passes through the block and joins the next block, which can be considered as functional reuse. The other branch comprises two convolutions with the same input and output channels. Moreover, another MC-ResNetv2 module was designed, where the Focus operation was removed and thereby the number of output channels was doubled. At the same time, the original shortcut design was substituted with two convolutions. The blocks were repeatedly stacked to construct the entire network. Therefore, 3 × 3 convolutions are followed by an additional 1 × 1 convolutional layer to blend the features, and the number of channels in each block is scaled to generate a network of different complexities. Not only that, the 1 × 1 convolution removes computational bottlenecks by reducing the dimensionality of the module, which is otherwise constraining the size of the network. This not only increases the depth of the network but also increases the width of the network without significantly affecting performance. To verify the performance of MC-ResNetv1 and MC-ResNetv2, we implemented ablation experiments, as outlined in Section 3.2.

2.3.2. Generalized Focal Loss

Focal loss [14] is designed for object detection tasks with an imbalance between the foreground and background classes, and Equation (3) is as follows:
FL   ( p ) = ( 1 p t ) γ   log   ( p t ) ,   p t = { p , when y = 1 1 p , when y = 0
where y { 1 , 0 } denotes the ground truth class, p { 1 , 0 } indicates the estimated probability of the class labeled as y = 1 , and γ represents an adjustable focusing parameter. To be specific, focal loss comprises a dynamically scaling factor part ( 1 p t ) γ and a standard cross-entropy part log ( p t ) . Due to the presence of class imbalance problems, we considered extending the two components of focal loss, known as the quality focus loss (Q):
  Q ( σ ) = | y σ | β   ( ( 1 y )   log   ( 1 σ ) + y   log   ( σ ) )
where σ = y means the global minimum solution of the quality focus loss. | y σ | β is a moderating factor that goes to 0 when the quality estimate becomes accurate, i.e., σ y , and the loss of well-estimated samples is downgraded, where the parameter β smoothly controls the downgraded rate. We used the relative offset from the location to the four sides of the bounding box as the regression objective. The bounding box regression models the regression label y as Dirac delta distribution δ ( x y ) , where + δ ( x y ) d x = 1 . The integral of y is as follows:
y = + δ ( x y ) x d x
We learnt the underlying generic distribution P ( x ) directly without inserting any other prior factors instead of the Dirac delta assumption. Based on the range of labels for the minimum y 0 and maximum y 0   ( y 0 y y n , n + ) , we can estimate y ^ from the model:
y ^ = + P ( x ) x d x = y 0 y n P ( x ) x d x
To be consistent with the network structure, we discretized the range [ y 0 , y n ] as a set of { y 0 , y 1 , , y i , y i + 1 , , y n 1 , y n } , converting the integral of the continuous domain into a discrete representation. Thus, according to the discrete distribution property i = 0 n P ( y i ) = 1 , the regression value y ^   can be formulated as follows:
y ^ = i = 0 n P ( y i ) y i
Consequently, P ( x ) can be simply achieved by the softmax S ( · )   layer, where P ( y i ) is represented as S i .
To encourage high probability values close to the target y to optimize P ( x ) , we introduced a distribution focus loss. By expanding the probabilities of y i and y i + 1 , the network is forced to concentrate quickly on values close to the label y. We defined the distributional focus loss by applying the entire cross-entropy component of the mass focus loss. We defined distribution focus loss by applying the whole cross-entropy part of quality focus loss:
  Q ( S i , S i + 1 ) = ( ( y i + 1 y ) log ( S i ) + ( y y i ) log ( S i + 1 ) )
The purpose of distribution focus loss is to expand the probability of the values around the target y . The global minimum solution of distribution focus loss, i.e., S i = y i + 1 y y i + 1 y i , S i + 1 = y y i y i + 1 y i , ensures that the estimated regression target y ^ is infinitely close to the corresponding label y , i.e., y ^ = j = 0 n P ( y j ) y j = S i y i + S i + 1 y i + 1 = y i + 1 y y i + 1 y i y i + y y i y i + 1 y i y i + 1 = y .
Quality focus loss and distribution focus loss can be unified into a general form known as generalized focal loss. Suppose a model has probability estimates for two variables y l , y r   ( y l < y r ) as p y l , p y r   ( p y l 0 , p y r 0 , p y l + p y r = 1 ) , and the final prediction of their linear combination is y ^ = y l p y l + y r p y r ( y l y ^ y r ) . The corresponding label y of the predicted y ^ also satisfies y l y y r . With the absolute distance | y y ^ | β ( β 0 ) as the moderating factor, the equation of generalized focal loss (G) is as follows:
  G ( p y l , p y r ) = | y ( y l p y l + y r p y r ) | β ( ( y r y ) log ( p y l ) + ( y y l ) log ( p y r ) )
Generalized focal loss ( p y l , p y r ) reaches a global minimum at p y l * = y r y y r y l and p y r * = y y l y r y l , which also implies that y ^ exactly matches the continuous label y , i.e., y ^ = y l p y l * + y r p y r * = y . The modified detector differs from the former detector in two respects. First, we fed the classification scores directly as NMS scores during the inference process without multiplication if any separate quality prediction existed. Second, the final layer of the regression branch used to predict the location of each bounding box now has n + 1 outputs rather than l output, resulting in negligible additional computational cost. We can define the training loss in terms of generalized focal loss as follows:
= 1 N pos   z Q + 1 N pos   z 1 { c z * > 0 } ( λ 0 + λ 1 D )
where Q is quality focus loss, and D is distribution focus loss. stands for GIoU loss, and λ 0 and λ 1 refer to the balance weights of Q and D , respectively. Here, 1 { c z * > 0 } is the indicator function, where the value is 1 if c z * > 0 and 0 otherwise.

2.4. CPU–GPU Multithreaded Pipeline Design

To make full use of GPU computational power, the aim was to design a real-time object detection system on the NVIDIA Jetson TX2, a low-power embedded heterogeneous GPU platform. Due to the low power of the TX2, energy consumption can be controlled by minimizing the calculations during system operation, and the inference speed can be improved simultaneously. Computational reduction often leads to a decline in detection accuracy, and the critical issue to be tackled is how to increase the inference speed while retaining system accuracy.
With a multicore CPU on the TX2, we maximized the computational power of the GPU via a multithreaded CPU–GPU pipeline design, where the CPU is primarily responsible for processing more logical tasks and the GPU is used to process high-density floating-point calculations. The data is transferred from the CPU memory to the GPU graphics memory. The GPU finishes processing the data for calculation and then transfers the results out to the CPU memory.
Calculation of the detection time of the system for the object target starts with reading the image and ends with the system completing the detection and returning the result of the object and its position. Using the time detection function to count the inference time for each part of the code, we found that the time spent on the object detection process was primarily in the CPU image preprocessing and GPU network prediction stages, whereas the time for the final CPU output detection results was negligible. By further statistical analysis of the TC-YOLO network execution on the CPU and GPU, the time taken to process each frame was approximately 21 ms in single-threaded operation, with 12.6 ms executed on the GPU and 8.4 ms on the CPU. Considering that the CPU on the TX2 development board is multicore, an attempt was made to maximize the use of the computational power of the GPU by opening multiple threads for scheduling and trying to keep the GPU in constant computation. Here, one thread performs the GPU task, and another thread conducts the CPU image reading and preprocessing tasks simultaneously. When the first thread finishes the GPU computation, the second thread can immediately start the GPU computation task. The whole process carries out the GPU computation task of the previous image and the CPU preprocessing stage of the next image at the same time, so the time for preprocessing each image can be saved during the detection. Depending on the dataset and input requirements, the number of threads opened can be adjusted. We used two threads for pipelined detection depending on the current application. The final time spent on the whole process entirely hides the CPU processing time, and only the GPU processing time needs to be calculated to detect the images. In addition, the improvements proposed herein do not involve changes to the network structure and thus have no impact on the accuracy of the system.

2.5. Evaluation Metrics

To define the detection results in more detail, we introduced a series of evaluation metrics based on average precision (AP), including AP50, AP75, APS, APM, and APL, where AP50 denotes the AP at intersection over union = 0.5, and AP75 indicates the AP at intersection over union = 0.75. APS indicates the AP with detection area less than 1394 (34 × 41), APM indicates the AP with detection area larger than 1394 (34 × 41) and smaller than 2888 (76 × 38), and APL refers to the AP with detection area larger than 2888 (76 × 38). The equation of AP is as follows:
AP = i = 1 N P ( i ) r e c a l l   ( i )
where N denotes the number of test images, P(i) represents the precision value at i images, and recall (i) shows the change in recall between k and k − 1 images.

Experimental Setup

The experiments were conducted on a server with NVIDIA Tesla V100, CUDA 11.2. The basic detection frameworks were MC-ResNetv1 and MC-ResNetv2. During training, the key hyperparameters were set as follows: learning rate = 0.0002, momentum = 0.8, gamma = 0.1, and weight decay = 0.0002. The optimizer used was stochastic gradient descent (SGD). Moreover, to ensure the test results would be more convincing, we executed the whole test process 10 times, and the final test results were averaged.
In different versions of the same model, as the input size of the image gets larger, the network needs more layers (deeper and wider) to expand the receptive fields and more channels to capture finer-grained features. Thus, the network depth or width of the backbone is typically different in various versions of the same model; in other words, their weight files are varied. If we simplistically resize the inputs of different versions to the same resolution, it would be unfair to these state-of-the-art models. To this end, we kept all parameters of the comparison models, including input size, backbone, and weights, unchanged, allowing all models to perform well. It is worth noting that the input size of the proposed model was 416 × 416 (we reshaped the 1080 × 1920 resolution to 416 × 416 resolution) to balance performance and inference speed.

3. Results

3.1. The Impact of Data Augmentation on the MC-LCNN

Data augmentation is an integral part of the whole training process and has direct impact on the final detection accuracy. We compared 14 influential data augmentation methods [15,16,17,18] and combined them to determine the final approach for dataset augmentation in this study. First, we tested the 14 enhancement methods in turn and then selected the top four performing methods to combine and test them. Typically, it is difficult for an augmentation method that performs poorly when working alone to suddenly become superior when combined with other methods, so we only considered the top four augmentation methods that perform well. The test results are shown in Table 2.
We can clearly observe that Cutout, Blur, Flip, and Rotation achieved excellent performance with AP50 of 91.14%, 90.69%, 88.59%, and 88.38%, respectively. Surprisingly, the three most advanced data augmentation methods, namely Mixup, Cutmix, and Mosaic, all showed mediocre performance, probably because the image features of medicinal chrysanthemums are mostly similar, such as color, texture, etc. Thus, using complex augmentation methods can generate a large amount of redundant local information and cause overfitting. It is worth noting that the performance of Blur ranked second among all the enhancement methods, probably due to the fact that Blur makes the whole dataset increase with new features rather than redundant ones, which greatly improves the robustness of the model. Furthermore, when we combined Cutout and Blur together, the AP50 improved from 91.14% to 93.06%, an encouraging result. In summary, we combined Cutout and Blur as the data augmentation methods in this study.

3.2. Ablation Experiments

MC-LCNN employs several modules, including the proposed MC-ResNet, DropBlock, EMA, SPP, and CBM. We used ablation experiments to verify the performance of these modules. First, to validate the performance of the MC-ResNet module, we replaced MC-ResNet with 24 feature extraction networks. Furthermore, to validate the performance of DropBlock, EMA, and SPP, we removed these modules. Finally, we verified the performance of CBM by sequentially increasing the number of CBM. It is worth noting that MC-LCNN is essentially a convolutional neural network, so CBM cannot be completely removed. The results of the ablation experiments are shown in Table 3.
First, as observed in Table 3, MC-ResNet outperformed the 24 feature extraction networks, and the AP50, APS, APM, and APL were 2.13%, 3.29%, 3.14%, and 2.36% higher than the suboptimal CSPRetNeXt module, respectively, showing that MC-ResNet had the most prominent ability for small object feature extraction. Not only that, the inference speed (FPS) of MC-ResNet was an impressive 11.07% higher than that of CSPDarknet53 (the module with the second highest inference speed after MC-ResNet). Second, after adding DropBlock, EMA, and SPP, the AP50 of MC-LCNN improved by 3.4%, 2.12%, and 6.81% and the inference speed (FPS) improved by 2.4%, 2.59%, and 7.99%, respectively. Because SSP can receive any size of feature map input and output it to a fixed size of feature vector, this can significantly improve the detection precision and inference speed of the model. Finally, we verified that the optimal performance of MC-LCNN could be achieved using a single CBM module. When several CBM modules were employed, the AP50 of the whole model showed a slight increase. When using four CBM modules, the AP50 marginally increased by 0.27%, but the inference speed FPS significantly decreased by 19.45. To intuitively observe the image features, we show the visualization process of some images in MC-LCNN in Figure 5.

3.3. Comparisons with State-of-the-Art Detection Methods

In this section, we present a comprehensive comparison of the latest 13 object detection frameworks (54 models) with the proposed MC-LCNN. The results are shown in Table 4.
First, our goal was to build a lightweight network; hence, the inference speed of the model was crucial to us. The inference speed of MC-LCNN (FPS = 109.28) was second only to PP-YOLOv2 (FPS = 110.54) with an input size of 320 × 320, ranking second among the 54 models in terms of inference speed. However, the AP50 of MC-LCNN (93.06%) was 7.08% higher than that of PP-YOLOv2 (85.98%) with an input size of 320 × 320, showing a clear advantage. Secondly, although the inference speed of MC-LCNN was not the most superior, the detection accuracy (AP50 = 93.06%) was the highest among the 54 models and 3.43% higher than the suboptimal YOLOX-X (AP50 = 89.63%), which is an encouraging result. Not only that, in MC-LCNN, the detection precision for different anchor box sizes (APS = 69.63%, APM = 76.42%, and APL = 88.89%) was 4.41%, 2.88%, and 2.03% higher than that of the suboptimal YOLOX-X (APS = 65.22%, APM = 73.54%, and APL = 86.86%), respectively. The performance of MC-LCNN was more prominent for small-sized anchor box detection, which is critical for robotic systems that operate in natural environments. Because of path planning constraints, small-sized anchor box detection is particularly relevant when the robot picks distant chrysanthemums. Finally, according to the improvement strategy in Section 2.4, we tested MC-LCNN on a heterogeneous GPU platform, NVIDIA Jetson TX2, and the example is shown in Figure 6.
The precision of the model remained unchanged, and the inference speed of the whole model increased by 2FPS as it benefited from the multithreaded pipeline design of the CPU–GPU. Unfortunately, we assumed that the design would completely hide CPU processing time and thus only counted GPU processing time, resulting in an improvement in detection time of approximately 19 FPS. However, due to FPS calculation and communication loss between multiple threads, the actual improvement in detection speed was different from the ideal case, although it still somewhat saved the CPU preprocessing time. The test results on the NVIDIA Jetson TX2 are shown in Figure 7.

4. Discussion

In response to the three issues proposed in the introduction, we have compared the proposed MC-LCNN with the studies in Table 1. For issue 1, from a detection accuracy perspective, the inference speed of MC-LCNN (9.15 ms) was slightly faster than the Liu et al. research (10 ms) [12], but the detection accuracy (AP50) was tremendously improved by 15.06%. From an inference speed perspective, the detection accuracy of MC-LCNN (AP50 = 93.06%) was 3.06% higher than the research by Yang et al. (AP50 = 90%) [11], with a significant improvement in inference speed from 0.7 s to 9.15 ms. MC-LCNN achieved the first highly accurate real-time testing work in the world for medicinal chrysanthemums. For issue 2, it is clear from Table 1 that most studies were tested in ideal environments or under illumination variations. In this study, the dataset was collected from natural environments, including complex unstructured environments, such as illumination variations, overlaps, and occlusions, thus significantly improving the robustness of the model. For issue 3, we tested MC-LCNN embedded in a low-power edge computing device, the NVIDIA Jetson TX2. Not only that, we used a multithreaded CPU–GPU pipeline design to improve the inference speed of MC-LCNN.
The proposed MC-LCNN has apparent advantages but also has shortcomings that need to be addressed. First, the inference speed of MC-LCNN was not optimal among all the compared models, and inference speed is crucial for robotic picking. Not only that, when the proposed model was embedded in the Jetson TX2, it took around 0.6 s to test a single image, which is an acceptable but not surprising result. Furthermore, actual unstructured environments involve more than just illumination variations, overlaps, and occlusions, and we need to collect further different scenarios to improve the robustness of the model.

5. Conclusions

In this work, we propose a new lightweight convolutional neural network, named MC-LCNN, for detecting medicinal chrysanthemums at the bud stage under complex unstructured environments (illumination variations, overlaps, and occlusions). We collected 4000 original images (1080 × 1920) as the dataset. In the NVIDIA Tesla V100 GPU environment, the AP50 of the test dataset reached 93.06%, and the inference speed was 109.28 FPS. The optimal data enhancement strategy for training MC-LCNN was the combination of Cutout and Blur. Furthermore, we compared the proposed MC-LCNN with 13 state-of-the-art object detection frameworks (54 models). MC-LCNN achieved the highest AP50 and was second to the optimal PP-YOLOv2 in terms of inference speed. Finally, we embedded MC-LCNN into the NVIDIA Jetson TX2 for real-time object detection and improved the inference speed by 2FPS through a multithreaded CPU–GPU pipeline design. The proposed MC-LCNN has the potential to be integrated into a selective picking robot for automatic picking of medicinal chrysanthemums via NVIDIA Jetson TX2 in the future.

Author Contributions

C.Q., conceptualization, methodology, software, data curation, writing—original draft. J.C., conceptualization, methodology, software, writing—review and editing. J.Z., writing—review and editing. Y.Z., software. Z.B., methodology. K.C., supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Northern Jiangsu Science and Technology Major Project-Enriching the People and Strengthening the Power of County Program, grant number SZ-YC2019002.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

This study is funded by the Northern Jiangsu Science and Technology Major Project Enriching the People and Strengthening the Power of County Program (grant number “SZ- YC2019002”). The authors would also like to thank Kunjie Chen of Nanjing Agricultural University for his technical support.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Liu, C.L.; Lu, W.Y.; Gao, B.Y.; Kimura, H.; Li, Y.F.; Wang, J. Rapid identification of chrysanthemum teas by computer vision and deep learning. Food Sci. Nutr. 2020, 8, 1968–1977. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  2. Yuan, H.; Jiang, S.; Liu, Y.; Daniyal, M.; Jian, Y.; Peng, C.; Shen, J.; Liu, S.; Wang, W. The flower head of Chrysanthemum morifolium Ramat. (Juhua): A paradigm of flowers serving as Chinese dietary herbal medicine. J. Ethnopharmacol. 2020, 261, 113043. [Google Scholar] [CrossRef] [PubMed]
  3. Hasan, R.I.; Yusuf, S.M.; Alzubaidi, L. Review of the State of the Art of Deep Learning for Plant Diseases: A Broad Analysis and Discussion. Plants 2020, 9, 1302. [Google Scholar] [CrossRef] [PubMed]
  4. Wöber, W.; Mehnen, L.; Sykacek, P.; Meimberg, H. Investigating Explanatory Factors of Machine Learning Models for Plant Classification. Plants 2021, 10, 2674. [Google Scholar] [CrossRef] [PubMed]
  5. Genaev, M.A.; Skolotneva, E.S.; Gultyaeva, E.I.; Orlova, E.A.; Bechtold, N.P.; Afonnikov, D.A. Image-Based Wheat Fungi Diseases Identification by Deep Learning. Plants 2021, 10, 1500. [Google Scholar] [CrossRef] [PubMed]
  6. Kondo, N.; Ogawa, Y.; Monta, M.; Shibano, Y. Visual Sensing Algorithm For Chrysanthemum Cutting Sticking Robot System. In Proceedings of the International Society for Horticultural Science (ISHS), Leuven, Belgium, 1 December 1996; pp. 383–388. [Google Scholar]
  7. Warren, D. Image analysis in chrysanthemum DUS testing. Comput. Electron. Agric. 2000, 25, 213–220. [Google Scholar] [CrossRef]
  8. Tarry, C.; Wspanialy, P.; Veres, M.; Moussa, M. An Integrated Bud Detection and Localization System for Application in Greenhouse Automation. In Proceedings of the Canadian Conference on Computer and Robot Vision, Montreal, QC, Canada, 6–9 May 2014; pp. 344–348. [Google Scholar]
  9. Tete, T.N.; Kamlu, S. Detection of plant disease using threshold, k-mean cluster and ann algorithm. In Proceedings of the 2nd International Conference for Convergence in Technology (I2CT), Mumbai, India, 7–9 April 2017; pp. 523–526. [Google Scholar]
  10. Yang, Q.H.; Chang, C.; Bao, G.J.; Fan, J.; Xun, Y. Recognition and localization system of the robot for harvesting Hangzhou White Chrysanthemums. Int. J. Agric. Biol. Eng. 2018, 11, 88–95. [Google Scholar] [CrossRef]
  11. Yang, Q.H.; Luo, S.L.; Chang, C.; Xun, Y.; Bao, G.J. Segmentation algorithm for Hangzhou white chrysanthemums based on least squares support vector machine. Int. J. Agric. Biol. Eng. 2019, 12, 127–134. [Google Scholar] [CrossRef]
  12. Liu, Z.L.; Wang, J.; Tian, Y.; Dai, S.L. Deep learning for image-based large-flowered chrysanthemum cultivar recognition. Plant Methods 2019, 15, 146. [Google Scholar] [CrossRef] [PubMed]
  13. Van Nam, N. Application of the Faster R-CNN algorithm to identify objects with both noisy and noiseless images. Int. J. Adv. Res. Comput. Eng. Technol. 2020, 9, 112–115. [Google Scholar]
  14. Lin, T.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  15. Tian, Y.N.; Yang, G.D.; Wang, Z.; Li, E.; Liang, Z.Z. Instance segmentation of apple flowers using the improved mask R-CNN model. Biosyst. Eng. 2020, 193, 264–278. [Google Scholar] [CrossRef]
  16. Espejo-Garcia, B.; Mylonas, N.; Athanasakos, L.; Vali, E.; Fountas, S. Combining generative adversarial networks and agricultural transfer learning for weeds identification. Biosyst. Eng. 2021, 204, 79–89. [Google Scholar] [CrossRef]
  17. Ma, R.; Tao, P.; Tang, H.Y. Optimizing Data Augmentation for Semantic Segmentation on Small-Scale Dataset. In Proceedings of the 2nd International Conference on Control and Computer Vision (ICCCV), New York, NY, USA, 6–9 June 2019; pp. 77–81. [Google Scholar]
  18. Pandian, J.A.; Geetharamani, G.; Annette, B. Data Augmentation on Plant Leaf Disease Image Dataset Using Image Manipulation and Deep Learning Techniques. In Proceedings of the IEEE 9th International Conference on Advanced Computing (IACC), Tamilnadu, India, 13–14 December 2019; pp. 199–204. [Google Scholar]
  19. Dou, Z.; Gao, K.; Zhang, X.; Wang, H.; Wang, J. Improving Performance and Adaptivity of Anchor-Based Detector Using Differentiable Anchoring With Efficient Target Generation. IEEE Trans. Image Process. 2021, 30, 712–724. [Google Scholar] [CrossRef] [PubMed]
  20. Wu, D.H.; Lv, S.C.; Jiang, M.; Song, H.B. Using channel pruning-based YOLO v4 deep learning algorithm for the real-time and accurate detection of apple flowers in natural environments. Comput. Electron. Agric. 2020, 178, 12. [Google Scholar] [CrossRef]
  21. Zhang, T.; Li, L. An Improved Object Detection Algorithm Based on M2Det. In Proceedings of the IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA), Dalian, China, 27–29 June 2020; pp. 582–585. [Google Scholar]
  22. Hsu, W.Y.; Lin, W.Y. Ratio-and-Scale-Aware YOLO for Pedestrian Detection. IEEE Trans. Image Process. 2021, 30, 934–947. [Google Scholar] [CrossRef] [PubMed]
  23. Kim, S.W.; Kook, H.K.; Sun, J.Y.; Kang, M.C.; Ko, S.J. Parallel Feature Pyramid Network for Object Detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 234–250. [Google Scholar]
  24. Wang, Z.; Cheng, Z.; Huang, H.; Zhao, J. ShuDA-RFBNet for Real-time Multi-task Traffic Scene Perception. In Proceedings of the Chinese Automation Congress (CAC), Hangzhou, China, 22–24 November 2019; pp. 305–310. [Google Scholar]
  25. Zhang, S.; Wen, L.; Lei, Z.; Li, S.Z. RefineDet++: Single-Shot Refinement Neural Network for Object Detection. IEEE Trans Circuits Syst. Video Technol. 2021, 31, 674–687. [Google Scholar] [CrossRef]
  26. Jian, W.; Lang, L. Face mask detection based on Transfer learning and PP-YOLO. In Proceedings of the IEEE 2nd International Conference on Big Data, Artificial Intelligence and Internet of Things Engineering (ICBAIE), Nanchang, China, 26–28 March 2021; pp. 106–109. [Google Scholar]
  27. Huang, X.; Wang, X.; Lv, W.; Bai, X.; Long, X.; Deng, K.; Dang, Q.; Han, S.; Liu, Q.; Hu, X.; et al. PP-YOLOv2: A Practical Object Detector. arXiv 2021, arXiv:2104.10419. [Google Scholar]
  28. Xu, X.; Zhang, L.; Yang, J.; Cao, C.; Tan, Z.; Luo, M. Object Detection Based on Fusion of Sparse Point Cloud and Image Information. IEEE Trans. Instrum. Meas. 2021, 70, 1–12. [Google Scholar] [CrossRef]
  29. Ge, Z.; Liu, S.; Wang, F.; Li, Z. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Figure 1. The different flowering stages of medicinal chrysanthemums.
Figure 1. The different flowering stages of medicinal chrysanthemums.
Plants 11 00838 g001
Figure 2. Some original images.
Figure 2. Some original images.
Plants 11 00838 g002
Figure 3. NVIDIA Jetson TX2 parameters.
Figure 3. NVIDIA Jetson TX2 parameters.
Plants 11 00838 g003
Figure 4. Structure of the proposed MC-LCNN.
Figure 4. Structure of the proposed MC-LCNN.
Plants 11 00838 g004
Figure 5. Visualization results of some input images.
Figure 5. Visualization results of some input images.
Plants 11 00838 g005
Figure 6. The test results on the NVIDIA Jetson TX2.
Figure 6. The test results on the NVIDIA Jetson TX2.
Plants 11 00838 g006
Figure 7. The test results on the NVIDIA Jetson TX2.
Figure 7. The test results on the NVIDIA Jetson TX2.
Plants 11 00838 g007
Table 1. The literature on different chrysanthemum detection tasks.
Table 1. The literature on different chrysanthemum detection tasks.
AuthorsTasksPublished
Year
Test
Environment
PrecisionInference SpeedTest
Devices
[6]Chrysanthemum cut detection1996Ideal//Laptop
[7]Chrysanthemum leaf recognition2000Ideal//Laptop
[8]Chrysanthemum bud testing2014Ideal0.75/Laptop
[9]Chrysanthemum disease detection2017Ideal//Laptop
[10]Chrysanthemum variety testing2018Illumination0.850.4 sLaptop
[11]Chrysanthemum picking2019Illumination0.90.7 sLaptop
[12]Chrysanthemum variety classification2019Ideal0.7810 msLaptop
[1]Chrysanthemum variety classification2020Ideal0.96/Laptop
[13]Chrysanthemum image recognition2020Ideal0.760.3 sLaptop
Table 2. Comparison with different data enhancement methods.
Table 2. Comparison with different data enhancement methods.
FlipShearCropRotationGrayscaleHuSaturationExposureBlurNoiseCutoutMixupCutmixMosaicAPAP50AP75APSAPMAPL
70.6888.5975.4969.2275.8785.89
70.9988.6375.3267.0175.2285.28
69.0387.8474.0166.8475.3485.44
69.5688.3874.4266.2876.0385.59
68.4287.8473.2166.1475.8486.41
68.8288.4473.5766.1176.0386.04
68.4988.1873.3665.9875.6286.63
69.9389.1373.5266.0175.8386.12
70.1390.6973.5966.0275.9887.35
68.0687.1171.2564.3972.8884.11
70.3391.1475.4467.2274.8987.88
68.4688.3172.5365.5273.4685.03
68.8888.6772.6865.3373.1385.67
68.8788.5472.2365.1273.0685.29
71.6292.0375.0967.8875.3888.26
70.9891.8274.6667.6575.2287.61
71.4492.3675.9368.6675.8787.99
71.6492.2276.2369.0376.0887.92
72.2293.0676.4669.6376.4288.89
71.8892.6276.3269.1276.5388.22
71.0392.0375.9668.9975.9987.53
70.6591.8774.5367.6375.6887.04
70.1190.5873.8966.3774.8186.34
Table 3. The ablation experiment results of different modules.
Table 3. The ablation experiment results of different modules.
MethodFPSAPAP50AP75APSAPMAPL
Ours + CBM × 580.2972.4493.3177.0270.2377.3988.93
Ours + CBM × 489.8372.6393.3377.8670.9977.5489.25
Ours + CBM × 396.1172.5693.2377.3370.5677.2889.21
Ours + CBM × 2101.2672.3493.0877.0270.1277.2588.91
Ours − SPP101.2967.8586.2569.8262.6369.9184.36
Ours − EMA106.6970.3390.9473.8364.4574.1287.11
Ours − DropBlock106.8869.5889.6673.2564.2273.6686.82
Ours (ResNet101)64.6664.1285.1466.8958.3367.0182.34
Ours (ResNet50)73.4562.0682.6465.5757.4665.6380.84
Ours (RetNeXt-101)92.2169.3688.0874.1265.8974.3385.33
Ours (ResNet50-vd-dcn)80.5868.5487.6174.8867.8274.9385.26
Ours (ResNet101-vd-dcn)67.9968.3889.9674.5867.6674.6186.01
Ours (EfficientB6)61.5868.3588.3171.2967.4171.3885.49
Ours (EfficientB5)67.3367.6887.5569.8466.8469.8585.12
Ours (EfficientB4)70.4467.3987.0868.4666.1968.6384.87
Ours (EfficientB3)78.0966.6786.4270.4167.8870.5284.58
Ours (EfficientB2)83.2866.3385.2769.1665.4469.4684.33
Ours (EfficientB1)85.3365.6483.2667.3362.0667.4282.89
Ours (EfficientB0)96.6363.5980.8368.9964.8369.5878.45
Ours (VGG16)76.1363.8781.6566.8961.2670.3478.05
Ours (MobileNet v1)83.5462.6679.9972.6766.0272.9376.85
Ours (MobileNet v2)79.5664.4882.1173.4366.2473.6780.99
Ours (ShuffleNet v1)85.8465.1284.1269.9161.4170.2882.24
Ours (ShuffleNet v2)76.2766.6987.2870.5762.6670.8884.44
Ours (DenseNet)81.0267.3488.5469.6662.1669.9984.83
Ours (DarkNet53)84.8267.9889.6770.1864.5370.2285.06
Ours (CSPDarknet53)98.2168.1189.8272.8965.9872.8885.54
Ours (CSPDenseNet)91.4668.1490.2274.3367.3874.5686.22
Ours (CSPRetNeXt)93.1168.8890.9373.2666.3473.2886.53
Ours (RetinaNet)62.6364.0984.0866.2860.1166.5481.31
Ours (Modified CSP v5)90.2369.2390.8273.1167.2373.2586.83
Ours109.2872.2293.0676.4669.6376.4288.89
Table 4. Comparisons with state-of-the-art detection methods.
Table 4. Comparisons with state-of-the-art detection methods.
MethodBackboneSizeFPSAPAP50AP75APSAPMAPL
RetinaNet [19]ResNet101800 × 80015.6348.3370.2351.2441.2251.3367.03
RetinaNetResNet50800 × 80018.8251.6176.4455.0944.2155.4369.14
RetinaNetResNet101500 × 50024.5860.8381.2962.8451.2962.1175.49
RetinaNetResNet50500 × 50030.9963.6982.9964.4453.0964.1376.58
EfficientDetD6 [20]EfficientB61280 × 128010.2664.1385.2166.4556.3365.9177.27
EfficientDetD5EfficientB51280 × 128023.5863.0984.6666.3155.9466.3578.21
EfficientDetD4EfficientB41024 × 102438.6162.9984.3365.1155.3165.3678.01
EfficientDetD3EfficientB3896 × 89650.8360.8683.1664.4654.8664.3977.92
EfficientDetD2EfficientB2768 × 76868.9959.5482.8464.0854.1164.1277.87
EfficientDetD1EfficientB1640 × 64080.1156.4479.4158.6649.6658.4972.28
EfficientDetD0EfficientB0512 × 51288.2953.2877.9655.8647.2655.8970.21
M2Det [21]VGG16800 × 80019.2255.2381.2257.6948.5457.5871.55
M2DetResNet101320 × 32030.5452.3377.3856.5448.4456.3670.83
M2DetVGG16512 × 51233.5650.1974.9454.4646.2154.3269.91
M2DetVGG16300 × 30045.4449.6871.8651.3344.3752.6868.58
YOLOv3 [22]DarkNet53608 × 60845.3164.6586.8567.2358.5767.6674.83
YOLOv3(SPP)DarkNet53608 × 60846.3964.0585.1366.8856.8866.4374.22
YOLOv3DarkNet53416 × 41658.6261.1880.0863.1855.0163.5472.84
YOLOv3DarkNet53320 × 32062.5958.4177.3461.3454.6761.6771.11
PFPNet (R) [23]VGG16512 × 51243.1152.2273.5956.2450.8856.6868.42
PFPNet (R)VGG16320 × 32052.0951.3572.6355.1248.8955.3767.95
PFPNet (s)VGG16300 × 30053.6455.5374.3359.8153.2260.4472.67
RFBNetEVGG16512 × 51236.9960.2580.0362.5854.2762.8975.21
RFBNet [24]VGG16512 × 51252.0258.1176.1361.0653.8561.4675.03
RFBNetVGG16512 × 51260.1663.9684.8565.4858.6865.6681.84
RefineDet [25]VGG16512 × 51242.1359.8379.6663.5657.5363.6976.53
RefineDetVGG16448 × 44858.6157.5178.0961.1156.9161.4175.54
YOLOv4 [20]CSPDarknet53608 × 60849.5866.9988.2369.6460.8569.9886.88
YOLOv4CSPDarknet53512 × 51269.4266.3887.9868.9960.4469.3385.34
YOLOv4CSPDarknet53300 × 30083.2863.2483.4366.4859.6866.5180.28
YOLOv5sCSPDenseNet416 × 41684.1165.1484.3368.2261.2468.3281.11
YOLOv5lCSPDenseNet416 × 41667.0366.3586.2669.3161.3769.4181.33
YOLOv5mCSPDenseNet416 × 41651.2267.5886.6769.8961.9970.2283.59
YOLOv5xCSPDenseNet416 × 41630.6868.9388.6472.6663.1272.6884.44
PP-YOLO [26]ResNet50-vd-dcn320 × 320106.8566.6485.2668.1560.8568.1781.23
PP-YOLOResNet50-vd-dcn416 × 41693.2567.0686.8868.6760.9968.6182.03
PP-YOLOResNet50-vd-dcn512 × 51280.0168.3287.2969.5861.4569.6283.22
PP-YOLOResNet50-vd-dcn608 × 60864.2669.1188.0270.1862.3370.5484.31
PP-YOLOv2 [27]ResNet50-vd-dcn320 × 320110.5467.8985.9868.2862.0268.4782.06
PP-YOLOv2ResNet50-vd-dcn416 × 416103.8867.9586.1368.8862.5570.4683.11
PP-YOLOv2ResNet50-vd-dcn512 × 51289.0468.3686.8569.3362.8469.6783.89
PP-YOLOv2ResNet50-vd-dcn608 × 60881.6768.8887.2670.0663.0470.3384.48
PP-YOLOv2ResNet50-vd-dcn640 × 64063.3869.4588.6471.2364.2471.6185.15
PP-YOLOv2ResNet101-vd-dcn512 × 51248.9869.4889.2271.9964.5372.3286.67
PP-YOLOv2ResNet101-vd-dcn640 × 64041.3469.6689.5972.8365.1172.8886.88
YOLOF [28]RetinaNet512 × 512102.8465.5386.5269.0362.1569.1183.12
YOLOF-R101ResNet-101512 × 51289.2865.9186.5869.4462.4169.4583.48
YOLOF-X101RetNeXt-101512 × 51268.0967.5688.3470.9562.9571.0685.66
YOLOF-X101+RetNeXt-101512 × 51253.6967.9488.8271.3863.1171.4485.83
YOLOF-X101++RetNeXt-101512 × 51236.0668.2589.0372.6364.2372.6186.22
YOLOX-DarkNet53Darknet-53640 × 64081.6166.8987.4171.1263.2871.2986.13
YOLOX-M [29]Modified CSP v5640 × 64065.4867.8388.3671.5363.5671.5886.27
YOLOX-LModified CSP v5640 × 64053.5469.4489.1473.2464.9373.3886.35
YOLOX-XModified CSP v5640 × 64046.2269.8689.6373.3965.2273.5486.86
OursMC-ResNet416 × 416109.2872.2293.0676.4669.6376.4288.89
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Qi, C.; Chang, J.; Zhang, J.; Zuo, Y.; Ben, Z.; Chen, K. Medicinal Chrysanthemum Detection under Complex Environments Using the MC-LCNN Model. Plants 2022, 11, 838. https://doi.org/10.3390/plants11070838

AMA Style

Qi C, Chang J, Zhang J, Zuo Y, Ben Z, Chen K. Medicinal Chrysanthemum Detection under Complex Environments Using the MC-LCNN Model. Plants. 2022; 11(7):838. https://doi.org/10.3390/plants11070838

Chicago/Turabian Style

Qi, Chao, Jiangxue Chang, Jiayu Zhang, Yi Zuo, Zongyou Ben, and Kunjie Chen. 2022. "Medicinal Chrysanthemum Detection under Complex Environments Using the MC-LCNN Model" Plants 11, no. 7: 838. https://doi.org/10.3390/plants11070838

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop