Apple Detection in Complex Scene Using the Improved YOLOv4 Model

: To enable the apple picking robot to quickly and accurately detect apples under the complex background in orchards, we propose an improved You Only Look Once version 4 (YOLOv4) model and data augmentation methods. Firstly, the crawler technology is utilized to collect pertinent apple images from the Internet for labeling. For the problem of insufﬁcient image data caused by the random occlusion between leaves, in addition to traditional data augmentation techniques, a leaf illustration data augmentation method is proposed in this paper to accomplish data augmentation. Secondly, due to the large size and calculation of the YOLOv4 model, the backbone network Cross Stage Partial Darknet53 (CSPDarknet53) of the YOLOv4 model is replaced by EfﬁcientNet, and convolution layer (Conv2D) is added to the three outputs to further adjust and extract the features, which make the model lighter and reduce the computational complexity. Finally, the apple detection experiment is performed on 2670 expanded samples. The test results show that the EfﬁcientNet-B0-YOLOv4 model proposed in this paper has better detection performance than YOLOv3, YOLOv4, and Faster R-CNN with ResNet, which are state-of-the-art apple detection model. The average values of Recall, Precision, and F1 reach 97.43%, 95.52%, and 96.54% respectively, the average detection time per frame of the model is 0.338 s, which proves that the proposed method can be well applied in the vision system of picking robots in the apple industry.


Introduction and Related Works
Apple is one of the most popular fruits, and its output is also among the top three in global fruit sales. According to incomplete statistics, there are more than 7500 types of known apples [1] in the world. However, experienced farmers are still the main force of agricultural production. Manual work consumes time and increases production costs, and workers who lack knowledge and experience will make unnecessary mistakes. With the continuous progress of precision agricultural technology, fruit picking robots have been widely used in agriculture. In the picking systems, there are mainly two subsystems: the vision system and the manipulator system [2]. The vision system detects and localizes fruits and guides the manipulator to detach fruits from trees. Therefore, a robust and efficient vision system is the key to the success of the picking robot, but due to the complex background in orchards, there are still many challenges in this research.
For the complex background in orchards, the dense occlusion between leaves is one of the biggest interference factors in apple detection, which will cause false detection or missed detection of apples. Therefore, to make the model learn features better, the training data should contain more comprehensive scenes. However, due to the huge number of apples and complex background, apple labeling is a very time-consuming and energy-consuming task, which leads to the number of most datasets ranges from dozens to thousands of images [3][4][5][6][7], and covers a single scene. The data for occlusion scenes is even scarcer, which is not conducive to enhancing the detection ability of the model. To overcome this deficiency, we propose a leaf illustration data augmentation method to expand the

Common Data Augmentation
In this paper, we use 8 common data augmentation methods to expand the dataset, respectively mirror, crop, brightness, blur, dropout, rotation, scale, and translation operation. These operations are used to further simulate the complex scenes of apple detection in orchards. Figure 1b-i shows the effects of various common data augmentation after these operations. The main sources of images are Baidu and Google. The search keywords are Red Fuji Apple, Apple Tree, and Apple, etc. Firstly, to ensure image quality, the width or height of the crawled image is set to be at least greater than 500 pixels. Secondly, after the manual screening, the repetitive, fuzzy, and inconsistent images are mainly removed. Finally, 267 high-quality images are obtained, of which 35 images contain only a single apple, 54 images with multiple apples without overlapping, and 178 images with multiple apples overlapping.
Then, these 267 images are expanded to 2670 images by using data augmentation methods. Randomly divide 1922 images as the training set to train the detection model, 214 images as the validation set to adjust the model parameters, and 534 images as the test set to verify the detection performance. To better compare the performance of different models, images in the training set are converted to PASCAL VOC format. The completed dataset is shown in Table 1.

Common Data Augmentation
In this paper, we use 8 common data augmentation methods to expand the dataset, respectively mirror, crop, brightness, blur, dropout, rotation, scale, and translation operation. These operations are used to further simulate the complex scenes of apple detection in orchards. Figure 1b-i shows the effects of various common data augmentation after these operations.

Image Mirror
In orchards, the position and direction of the apples are various. Therefore, we use 50% probability horizontal mirroring and 50% probability vertical mirroring to process the original image. Both can be used alone or in combination.

Image Crop
In many apples stacked together, there will be various occlusion problems, and some apples will be blocked a little or more. Therefore, we randomly cut off 20% of the original image edges to simulate this scene.

Image Mirror
In orchards, the position and direction of the apples are various. Therefore, we use 50% probability horizontal mirroring and 50% probability vertical mirroring to process the original image. Both can be used alone or in combination.

Image Crop
In many apples stacked together, there will be various occlusion problems, and some apples will be blocked a little or more. Therefore, we randomly cut off 20% of the original image edges to simulate this scene.

Image Brightness
When the illumination is strong or weak, it will lead to apple color changes, which cause huge interference for the detection. Therefore, to enhance the robustness of the model, we randomly multiply the image with a brightness factor between 0.5 and 1.5.

Image Blur
Sometimes the image captured by the picking robot may be unclear or blurred, which can also cause interference with the apple detection. Therefore, we use the gaussian blur with a mean value of 2.0 and a standard deviation of 8.0 to augment the dataset.

Image Dropout
Apples often encounter the problem of diseases and insect pests, typically leaving traces of numerous spots. Therefore, we randomly dropout the grid points between 0.01 and 0.1 on the original image, and the grid points are filled with black.

Image Rotation
Similar to the mirror method, rotation is to further increase the image viewing angles. Therefore, we use randomly rotating the original image at an angle between −30 • and 30 • to augment the dataset, and the space vacated by the rotation is filled with black.

Image Scale
Due to the different positions of the apples in orchards, there will be apples of different sizes when capturing images. Therefore, to simulate this scene, we randomly multiply the original image with a scaling factor between 0.5 and 1.5.

Image Translation
Similar to the crop method, translation is to further solve the occlusion problem of the apple. Therefore, we randomly translate 20% of the edges of the original image, and the space after translation is filled with black.

Illustration Data Augmentation
To enrich the background and texture of training images, in Mixup [20], CutMix [21], Mosaic [18], several original images can be regularly mixed or superimposed to form a new image. For example, four original images are evenly spread into four-square-grid images, which will not increase the cost of training and inference but enhance the localization ability of the model. For apple detection in orchards, the biggest interference to the apple detection is the dense occlusion between leaves, and the complexity and randomness of the background, which makes feature extraction difficult and leads to false detection or missed detection of apples.
Therefore, to simulate the complexity of the scene and enrich the background of the object, we propose a leaf illustration data augmentation method, which uses some leaf illustrations to randomly insert on the original image. Firstly, collect 5 kinds of apple leaf illustrations, as shown in Figure 2. The format of the illustration is PNG, only contains the object itself, and the background is transparent, which helps protect the original image after insertion and avoid adding the invalid background. Secondly, the illustration size is 1/8 to 1/4 of the average value of all the ground-truth in the current image, and the number of insertions is 5 to 15 times. Finally, the original dataset is expanded in batches by using the illustration data augmentation method. Figure 3 shows the augmentation effects of different leaf illustrations.

YOLOv4
YOLOv4 is the state-of-the-art, real-time detection model, which is further improved based on the YOLOv3 model. As a result, on the MS COCO dataset, without a drop in the frames per second (FPS), the mean Average Precision (mAP) is increased to 44%, and the overall performance is significantly improved. There are three major improvements in the network structure: (1) Using the CSPNet [22] to modify Darknet53 to CSPDarknet53, which further promotes the fusion of low-level information and achieves stronger feature extraction capabilities. As shown in Figure 4b, the original residual module is divided into left and right parts. The right parts maintain the original residual stack, and the left parts use a large residual edge to fuse the low-level information with the high-level information extracted from the residual block; (2) Using the Spatial Pyramid Pooling (SPP) [23] to add 4 different max-pooling operations at the last output to further extract and fuse features, the convolution kernel size is (1 × 1), (5 × 5), (9 × 9), and (13 × 13), as shown in Figure 5; (3) Modifying Feature Pyramid Networks (FPN) [24] structure to Path Aggregation Network (PANet) [25], that is, add a top-down structure to the bottom-up structure of FPN to further extract and merge feature information, as shown in Figure 6b.

YOLOv4
YOLOv4 is the state-of-the-art, real-time detection model, which is further improved based on the YOLOv3 model. As a result, on the MS COCO dataset, without a drop in the frames per second (FPS), the mean Average Precision (mAP) is increased to 44%, and the overall performance is significantly improved. There are three major improvements in the network structure: (1) Using the CSPNet [22] to modify Darknet53 to CSPDarknet53, which further promotes the fusion of low-level information and achieves stronger feature extraction capabilities. As shown in Figure 4b, the original residual module is divided into left and right parts. The right parts maintain the original residual stack, and the left parts use a large residual edge to fuse the low-level information with the high-level information extracted from the residual block; (2) Using the Spatial Pyramid Pooling (SPP) [23] to add 4 different max-pooling operations at the last output to further extract and fuse features, the convolution kernel size is (1 × 1), (5 × 5), (9 × 9), and (13 × 13), as shown in Figure 5; (3) Modifying Feature Pyramid Networks (FPN) [24] structure to Path Aggregation Network (PANet) [25], that is, add a top-down structure to the bottom-up structure of FPN to further extract and merge feature information, as shown in Figure 6b.

YOLOv4
YOLOv4 is the state-of-the-art, real-time detection model, which is further improved based on the YOLOv3 model. As a result, on the MS COCO dataset, without a drop in the frames per second (FPS), the mean Average Precision (mAP) is increased to 44%, and the overall performance is significantly improved. There are three major improvements in the network structure: (1) Using the CSPNet [22] to modify Darknet53 to CSPDarknet53, which further promotes the fusion of low-level information and achieves stronger feature extraction capabilities. As shown in Figure 4b, the original residual module is divided into left and right parts. The right parts maintain the original residual stack, and the left parts use a large residual edge to fuse the low-level information with the high-level information extracted from the residual block; (2) Using the Spatial Pyramid Pooling (SPP) [23] to add 4 different max-pooling operations at the last output to further extract and fuse features, the convolution kernel size is (1 × 1), (5 × 5), (9 × 9), and (13 × 13), as shown in Figure 5; (3) Modifying Feature Pyramid Networks (FPN) [24] structure to Path Aggregation Network (PANet) [25], that is, add a top-down structure to the bottom-up structure of FPN to further extract and merge feature information, as shown in Figure 6b.

YOLOv4
YOLOv4 is the state-of-the-art, real-time detection model, which is further improved based on the YOLOv3 model. As a result, on the MS COCO dataset, without a drop in the frames per second (FPS), the mean Average Precision (mAP) is increased to 44%, and the overall performance is significantly improved. There are three major improvements in the network structure: (1) Using the CSPNet [22] to modify Darknet53 to CSPDarknet53, which further promotes the fusion of low-level information and achieves stronger feature extraction capabilities. As shown in Figure 4b, the original residual module is divided into left and right parts. The right parts maintain the original residual stack, and the left parts use a large residual edge to fuse the low-level information with the high-level information extracted from the residual block; (2) Using the Spatial Pyramid Pooling (SPP) [23] to add 4 different max-pooling operations at the last output to further extract and fuse features, the convolution kernel size is (1 × 1), (5 × 5), (9 × 9), and (13 × 13), as shown in Figure 5; (3) Modifying Feature Pyramid Networks (FPN) [24] structure to Path Aggregation Network (PANet) [25], that is, add a top-down structure to the bottom-up structure of FPN to further extract and merge feature information, as shown in Figure 6b.

EfficientNet
In recent years, the rapid development of deep learning has spawned various excellent convolutional neural networks. From the initial simple network [26][27][28] to the current complex network [29][30][31][32], the performance of the models is getting better and better in all aspects. EfficientNet combines the advantages of previous networks, which summarizes the improvement of network performance into three dimensions: (1) Deepen the network, that is, use the skip connection to increase the depth of the neural network, and achieve feature extraction through deeper layers; (2) Widen the network, that is, increase the number of convolutional layers to achieve more features and obtain more functions; (3) By increasing the input image resolution, the network can learn and express more things, which is beneficial to improve accuracy. Then, use a compound coefficient to uniformly scale and balance the depth, width, and resolution of the network, and maximize the network accuracy on limited resources. The calculation of the compound coefficient is shown in Equation (1): where the d, w, and r are the coefficients used to scale the depth, width, and resolution of the network. The α, β, and γ are resource allocation for network depth, width, and resolution. According to the research of Tan M [19] in his paper, the network parameters of EfficientNet-B0 are shown in Table 2. The optimal coefficients of the network are: α = 1.2,

EfficientNet
In recent years, the rapid development of deep learning has spawned various excellent convolutional neural networks. From the initial simple network [26][27][28] to the current complex network [29][30][31][32], the performance of the models is getting better and better in all aspects. EfficientNet combines the advantages of previous networks, which summarizes the improvement of network performance into three dimensions: (1) Deepen the network, that is, use the skip connection to increase the depth of the neural network, and achieve feature extraction through deeper layers; (2) Widen the network, that is, increase the number of convolutional layers to achieve more features and obtain more functions; (3) By increasing the input image resolution, the network can learn and express more things, which is beneficial to improve accuracy. Then, use a compound coefficient to uniformly scale and balance the depth, width, and resolution of the network, and maximize the network accuracy on limited resources. The calculation of the compound coefficient is shown in Equation (1): where the d, w, and r are the coefficients used to scale the depth, width, and resolution of the network. The α, β, and γ are resource allocation for network depth, width, and resolution. According to the research of Tan M [19] in his paper, the network parameters of EfficientNet-B0 are shown in Table 2. The optimal coefficients of the network are: α = 1.2,

EfficientNet
In recent years, the rapid development of deep learning has spawned various excellent convolutional neural networks. From the initial simple network [26][27][28] to the current complex network [29][30][31][32], the performance of the models is getting better and better in all aspects. EfficientNet combines the advantages of previous networks, which summarizes the improvement of network performance into three dimensions: (1) Deepen the network, that is, use the skip connection to increase the depth of the neural network, and achieve feature extraction through deeper layers; (2) Widen the network, that is, increase the number of convolutional layers to achieve more features and obtain more functions; (3) By increasing the input image resolution, the network can learn and express more things, which is beneficial to improve accuracy. Then, use a compound coefficient φ to uniformly scale and balance the depth, width, and resolution of the network, and maximize the network accuracy on limited resources. The calculation of the compound coefficient is shown in Equation (1): where the d, w, and r are the coefficients used to scale the depth, width, and resolution of the network. The α, β, and γ are resource allocation for network depth, width, and resolution. According to the research of Tan M [19] in his paper, the network parameters of EfficientNet-B0 are shown in conventional convolutional neural networks. Figure 7 shows the EfficientNet-B0, which is the baseline network of EfficientNet.  16 Blocks, Conv2D, GlobalAver-agePooling2D, and Dense layers. The design of Blocks is mainly based on the residual structure and attention mechanism, and the other structures are similar to conventional convolutional neural networks. Figure 7 shows the EfficientNet-B0, which is the baseline network of EfficientNet.

EfficientNet-B0-YOLOv4
There are 8 versions of EfficientNet (B0-B7). With the increase of the version, the performance of the model gradually improves, but the corresponding model size and calculation amount also gradually increases. Although the original YOLOv4 model has excellent performance, its size and calculation amount are large, which is not suitable for the application of some low-performance devices. To further improve the accuracy and efficiency of the YOLOv4 model and consider the size of the model, we replace the backbone network CSPDarknet53 of the YOLOv4 model with EfficientNet-B0, and choose P3, P5, and P7 as three different feature layers. Since the output sizes of the three feature layers of the CSPDarknet53 are (256 × 256), (512 × 512), and (1024 × 1024), respectively, the corresponding P3 is (40 × 40), P5 is (112 × 112), and P7 is (320 × 320). Therefore, to match the

EfficientNet-B0-YOLOv4
There are 8 versions of EfficientNet (B0-B7). With the increase of the version, the performance of the model gradually improves, but the corresponding model size and calculation amount also gradually increases. Although the original YOLOv4 model has excellent performance, its size and calculation amount are large, which is not suitable for the application of some low-performance devices. To further improve the accuracy and efficiency of the YOLOv4 model and consider the size of the model, we replace the backbone network CSPDarknet53 of the YOLOv4 model with EfficientNet-B0, and choose P3, P5, and P7 as three different feature layers. Since the output sizes of the three feature layers of the CSPDarknet53 are (256 × 256), (512 × 512), and (1024 × 1024), respectively, the corresponding P3 is (40 × 40), P5 is (112 × 112), and P7 is (320 × 320). Therefore, to match the size and further extract the features, Conv2D is added to adjust the three output features. Figure 8 shows the network structure of EfficientNet-YOLOv4. size and further extract the features, Conv2D is added to adjust the three output features. Figure 8 shows the network structure of EfficientNet-YOLOv4. The loss function remains the same as the YOLOv4 model, which consists of three parts: classification loss, regression loss, and confidence loss. Classification loss and confidence loss remain the same as the YOLOv3 model, but Complete Intersection over Union (CIoU) [33] is used to replace mean squared error (MSE) to optimize the regression loss.
The CIoU loss function is as follows: where , represents the Euclidean distance between the center points of the prediction box and the ground truth, c represents the diagonal distance of the smallest closed area that can simultaneously contain the prediction box and the ground truth. Figure 9 shows the structure of CIoU.
The formulas of α and υ are as follows: 1 IoU The total loss function of the YOLOv4 model is:  The loss function remains the same as the YOLOv4 model, which consists of three parts: classification loss, regression loss, and confidence loss. Classification loss and confidence loss remain the same as the YOLOv3 model, but Complete Intersection over Union (CIoU) [33] is used to replace mean squared error (MSE) to optimize the regression loss.
The CIoU loss function is as follows: where ρ 2 b, b gt represents the Euclidean distance between the center points of the prediction box and the ground truth, c represents the diagonal distance of the smallest closed area that can simultaneously contain the prediction box and the ground truth. Figure 9 shows the structure of CIoU.
Agronomy 2021, 11, x FOR PEER REVIEW 9 of 16 where S 2 represents S × S grids, each grid generates B candidate boxes, and each candidate box gets corresponding bounding boxes through the network, finally, S × S × B bounding boxes are formed. If there is no object (noobj) in the box, only the confidence loss of the box is calculated. The confidence loss function uses cross entropy error and is divided into two parts: there is the object (obj) and noobj. The loss of noobj increases the weight coefficient λ, which is to reduce the contribution weight of the noobj calculation part. The classification loss function also uses cross entropy error. When the j-th anchor box of the i-th grid is responsible for certain ground truth, then the bounding box generated by this anchor box will calculate the classification loss function.

Simulation Setup
The experimental environment of this paper is under Ubuntu 18.04 system, GPU is Tesla K80 (12 GB), CPU is Intel XEON, and the models are all written with PyTorch. General settings: the training epoch is 100, the learning rate for the first 50 epochs is 1 × 10 −3 and the batch size is 16, the learning rate for the next 50 epochs is 1 × 10 −3 and the batch size is 8. Due to the relatively small RAM, the input image size of the YOLOv4 model is changed from 608 × 608 to 416 × 416, which is the same as the original YOLOv3 model. Table 3 shows the basic configuration of the local computer. The formulas of α and υ are as follows: The total loss function of the YOLOv4 model is: where S 2 represents S × S grids, each grid generates B candidate boxes, and each candidate box gets corresponding bounding boxes through the network, finally, S × S × B bounding boxes are formed. If there is no object (noobj) in the box, only the confidence loss of the box is calculated. The confidence loss function uses cross entropy error and is divided into two parts: there is the object (obj) and noobj. The loss of noobj increases the weight coefficient λ, which is to reduce the contribution weight of the noobj calculation part. The classification loss function also uses cross entropy error. When the j-th anchor box of the i-th grid is responsible for certain ground truth, then the bounding box generated by this anchor box will calculate the classification loss function.

Simulation Setup
The experimental environment of this paper is under Ubuntu 18.04 system, GPU is Tesla K80 (12 GB), CPU is Intel XEON, and the models are all written with PyTorch. General settings: the training epoch is 100, the learning rate for the first 50 epochs is 1 × 10 −3 and the batch size is 16, the learning rate for the next 50 epochs is 1 × 10 −3 and the batch size is 8. Due to the relatively small RAM, the input image size of the YOLOv4 model is changed from 608 × 608 to 416 × 416, which is the same as the original YOLOv3 model. Table 3 shows the basic configuration of the local computer.

Evaluation Index
In the binary classification problem, according to the combination of the sample's true class and the model's prediction class, it can be divided into 4 types: TP, FP, TN, and FN. TP means true positive, that is, the actual is positive and the prediction is also positive; FP means false positive, that is, the actual is negative but the prediction is positive; TN is true negative, that is, the actual is negative and the prediction is also negative; FN means false negative, that is, the actual is positive but the prediction is negative. The Precision represents the proportion of samples that are true positive among all samples predicted to be positive. The Recall represents the proportion of samples predicted to be positive among the samples that are true positive. The AP value of each class is the area under the P-R curve formed by the Precision and the Recall. The mAP value is the average of the AP values of all classes. The F1 score is based on the harmonic average of the Precision and the Recall. The formula definition is as follows: Recall: Average Precision: Mean Average Precision: F1 score:

Influence of Data Augmentation Methods
To obtain better detection results, traditional data augmentation techniques and a leaf illustration augmentation technique are adopted to expand the dataset. To evaluate the influence of the augmentation techniques on the EfficientNet-B0-YOLOv4 model, the control variate technique is adopted to get rid of one data augmentation approach every time and get the F1 indicators in the absence of this method, as shown in Table 4. It can be seen from the experimental results that the removal of the illustration data augmentation method has the greatest impact on the performance of the model, and the F1 drops by 2.05%, indicating that the image data augmented by the illustration data augmentation method has a greater contribution to enriching the diversity of the training set. Compared with common data augmentation methods, illustration data augmentation will generate new background and texture for the image, which is of great help to enhance the robustness of model detection, especially under the interference of dense leaves. Due to the complexity and diversity of leaves, the proposed illustration data augmentation method combined with common data augmentation methods to generate apple images can make up for the lack of training images, greatly reduce the workload of labeling, and achieve better results in model detection. Table 4. F1 comparison between different data augmentation methods.

of 16
detections are consistent with these methods, which shows the effectiveness and ity of our proposed method.
further verify the influence of the proposed illustration data augmentation on the improved model, as shown in Figure 10, the detection results of the model by the illustration augmented image are compared with the model trained by the age. From the detection results, it can be seen that the model trained by the illusaugmented images and the model trained by the real images both can accurately apples under no occlusion between leaves. However, under the dense occlusion n leaves, the model trained by real images hardly detects apples, while the model by illustration augmented images can detect apples well and improve the detecults. It can be seen that the illustration augmentation method enriches the leaf ocscene in the training images, provides richer features for the learning of the model, s helps to improve the learning ability and detection results of the model. 1 comparison between different data augmentation methods. missed detections are consistent with these methods, which shows the effectiveness and feasibility of our proposed method.

Data Augmentation Methods
To further verify the influence of the proposed illustration data augmentation method on the improved model, as shown in Figure 10, the detection results of the model trained by the illustration augmented image are compared with the model trained by the real image. From the detection results, it can be seen that the model trained by the illustration augmented images and the model trained by the real images both can accurately detect apples under no occlusion between leaves. However, under the dense occlusion between leaves, the model trained by real images hardly detects apples, while the model trained by illustration augmented images can detect apples well and improve the detection results. It can be seen that the illustration augmentation method enriches the leaf occlusion scene in the training images, provides richer features for the learning of the model, and thus helps to improve the learning ability and detection results of the model.  missed detections are consistent with these methods, which shows the effectiveness and feasibility of our proposed method.

Data Augmentation Methods
To further verify the influence of the proposed illustration data augmentation method on the improved model, as shown in Figure 10, the detection results of the model trained by the illustration augmented image are compared with the model trained by the real image. From the detection results, it can be seen that the model trained by the illustration augmented images and the model trained by the real images both can accurately detect apples under no occlusion between leaves. However, under the dense occlusion between leaves, the model trained by real images hardly detects apples, while the model trained by illustration augmented images can detect apples well and improve the detection results. It can be seen that the illustration augmentation method enriches the leaf occlusion scene in the training images, provides richer features for the learning of the model, and thus helps to improve the learning ability and detection results of the model.  missed detections are consistent with these methods, which shows the effectiveness and feasibility of our proposed method.

Data Augmentation Methods
To further verify the influence of the proposed illustration data augmentation method on the improved model, as shown in Figure 10, the detection results of the model trained by the illustration augmented image are compared with the model trained by the real image. From the detection results, it can be seen that the model trained by the illustration augmented images and the model trained by the real images both can accurately detect apples under no occlusion between leaves. However, under the dense occlusion between leaves, the model trained by real images hardly detects apples, while the model trained by illustration augmented images can detect apples well and improve the detection results. It can be seen that the illustration augmentation method enriches the leaf occlusion scene in the training images, provides richer features for the learning of the model, and thus helps to improve the learning ability and detection results of the model.  missed detections are consistent with these methods, which shows the effectiveness and feasibility of our proposed method.

Data Augmentation Methods
To further verify the influence of the proposed illustration data augmentation method on the improved model, as shown in Figure 10, the detection results of the model trained by the illustration augmented image are compared with the model trained by the real image. From the detection results, it can be seen that the model trained by the illustration augmented images and the model trained by the real images both can accurately detect apples under no occlusion between leaves. However, under the dense occlusion between leaves, the model trained by real images hardly detects apples, while the model trained by illustration augmented images can detect apples well and improve the detection results. It can be seen that the illustration augmentation method enriches the leaf occlusion scene in the training images, provides richer features for the learning of the model, and thus helps to improve the learning ability and detection results of the model.  missed detections are consistent with these methods, which shows the effectiveness an feasibility of our proposed method.

Data Augmentation Methods
To further verify the influence of the proposed illustration data augmentatio method on the improved model, as shown in Figure 10, the detection results of the mode trained by the illustration augmented image are compared with the model trained by th real image. From the detection results, it can be seen that the model trained by the illus tration augmented images and the model trained by the real images both can accuratel detect apples under no occlusion between leaves. However, under the dense occlusio between leaves, the model trained by real images hardly detects apples, while the mode trained by illustration augmented images can detect apples well and improve the detec tion results. It can be seen that the illustration augmentation method enriches the leaf oc clusion scene in the training images, provides richer features for the learning of the mode and thus helps to improve the learning ability and detection results of the model.  To further verify the influence of the proposed illustration data au method on the improved model, as shown in Figure 10, the detection results o trained by the illustration augmented image are compared with the model tra real image. From the detection results, it can be seen that the model trained tration augmented images and the model trained by the real images both can detect apples under no occlusion between leaves. However, under the dens between leaves, the model trained by real images hardly detects apples, whil trained by illustration augmented images can detect apples well and improv tion results. It can be seen that the illustration augmentation method enriches clusion scene in the training images, provides richer features for the learning o and thus helps to improve the learning ability and detection results of the mo Table 4. F1 comparison between different data augmentation methods. To further verify the influence of the proposed illustratio method on the improved model, as shown in Figure 10, the detecti trained by the illustration augmented image are compared with th real image. From the detection results, it can be seen that the mod tration augmented images and the model trained by the real imag detect apples under no occlusion between leaves. However, und between leaves, the model trained by real images hardly detects a trained by illustration augmented images can detect apples well a tion results. It can be seen that the illustration augmentation metho clusion scene in the training images, provides richer features for the and thus helps to improve the learning ability and detection result Table 4. F1 comparison between different data augmentation methods. To further verify the influence of the proposed method on the improved model, as shown in Figure 10, trained by the illustration augmented image are compar real image. From the detection results, it can be seen th tration augmented images and the model trained by th detect apples under no occlusion between leaves. How between leaves, the model trained by real images hardl trained by illustration augmented images can detect ap tion results. It can be seen that the illustration augmenta clusion scene in the training images, provides richer feat and thus helps to improve the learning ability and detec Table 4. F1 comparison between different data augmentation me detections are consistent with these methods, which shows the effectiveness and ity of our proposed method. further verify the influence of the proposed illustration data augmentation on the improved model, as shown in Figure 10, the detection results of the model by the illustration augmented image are compared with the model trained by the age. From the detection results, it can be seen that the model trained by the illusaugmented images and the model trained by the real images both can accurately apples under no occlusion between leaves. However, under the dense occlusion n leaves, the model trained by real images hardly detects apples, while the model by illustration augmented images can detect apples well and improve the detecults. It can be seen that the illustration augmentation method enriches the leaf ocscene in the training images, provides richer features for the learning of the model, s helps to improve the learning ability and detection results of the model. 1 comparison between different data augmentation methods. missed detections are consistent with these methods, which shows the effectiveness and feasibility of our proposed method.

Data Augmentation Methods
To further verify the influence of the proposed illustration data augmentation method on the improved model, as shown in Figure 10, the detection results of the model trained by the illustration augmented image are compared with the model trained by the real image. From the detection results, it can be seen that the model trained by the illustration augmented images and the model trained by the real images both can accurately detect apples under no occlusion between leaves. However, under the dense occlusion between leaves, the model trained by real images hardly detects apples, while the model trained by illustration augmented images can detect apples well and improve the detection results. It can be seen that the illustration augmentation method enriches the leaf occlusion scene in the training images, provides richer features for the learning of the model, and thus helps to improve the learning ability and detection results of the model.  missed detections are consistent with these methods, which shows the effectiveness and feasibility of our proposed method.

Data Augmentation Methods
To further verify the influence of the proposed illustration data augmentation method on the improved model, as shown in Figure 10, the detection results of the model trained by the illustration augmented image are compared with the model trained by the real image. From the detection results, it can be seen that the model trained by the illustration augmented images and the model trained by the real images both can accurately detect apples under no occlusion between leaves. However, under the dense occlusion between leaves, the model trained by real images hardly detects apples, while the model trained by illustration augmented images can detect apples well and improve the detection results. It can be seen that the illustration augmentation method enriches the leaf occlusion scene in the training images, provides richer features for the learning of the model, and thus helps to improve the learning ability and detection results of the model.  missed detections are consistent with these methods, which shows the effectiveness and feasibility of our proposed method.

Data Augmentation Methods
To further verify the influence of the proposed illustration data augmentation method on the improved model, as shown in Figure 10, the detection results of the model trained by the illustration augmented image are compared with the model trained by the real image. From the detection results, it can be seen that the model trained by the illustration augmented images and the model trained by the real images both can accurately detect apples under no occlusion between leaves. However, under the dense occlusion between leaves, the model trained by real images hardly detects apples, while the model trained by illustration augmented images can detect apples well and improve the detection results. It can be seen that the illustration augmentation method enriches the leaf occlusion scene in the training images, provides richer features for the learning of the model, and thus helps to improve the learning ability and detection results of the model.  missed detections are consistent with these methods, which shows the effectiveness and feasibility of our proposed method.

Data Augmentation Methods
To further verify the influence of the proposed illustration data augmentation method on the improved model, as shown in Figure 10, the detection results of the model trained by the illustration augmented image are compared with the model trained by the real image. From the detection results, it can be seen that the model trained by the illustration augmented images and the model trained by the real images both can accurately detect apples under no occlusion between leaves. However, under the dense occlusion between leaves, the model trained by real images hardly detects apples, while the model trained by illustration augmented images can detect apples well and improve the detection results. It can be seen that the illustration augmentation method enriches the leaf occlusion scene in the training images, provides richer features for the learning of the model, and thus helps to improve the learning ability and detection results of the model.  missed detections are consistent with these methods, which shows the effectiveness an feasibility of our proposed method.

Data Augmentation Methods
To further verify the influence of the proposed illustration data augmentatio method on the improved model, as shown in Figure 10, the detection results of the mode trained by the illustration augmented image are compared with the model trained by th real image. From the detection results, it can be seen that the model trained by the illus tration augmented images and the model trained by the real images both can accuratel detect apples under no occlusion between leaves. However, under the dense occlusio between leaves, the model trained by real images hardly detects apples, while the mode trained by illustration augmented images can detect apples well and improve the detec tion results. It can be seen that the illustration augmentation method enriches the leaf oc clusion scene in the training images, provides richer features for the learning of the mode and thus helps to improve the learning ability and detection results of the model.  To further verify the influence of the proposed illustration data au method on the improved model, as shown in Figure 10, the detection results o trained by the illustration augmented image are compared with the model tra real image. From the detection results, it can be seen that the model trained tration augmented images and the model trained by the real images both can detect apples under no occlusion between leaves. However, under the dens between leaves, the model trained by real images hardly detects apples, whil trained by illustration augmented images can detect apples well and improv tion results. It can be seen that the illustration augmentation method enriches clusion scene in the training images, provides richer features for the learning o and thus helps to improve the learning ability and detection results of the mo Table 4. F1 comparison between different data augmentation methods. To further verify the influence of the proposed illustratio method on the improved model, as shown in Figure 10, the detecti trained by the illustration augmented image are compared with th real image. From the detection results, it can be seen that the mod tration augmented images and the model trained by the real imag detect apples under no occlusion between leaves. However, und between leaves, the model trained by real images hardly detects a trained by illustration augmented images can detect apples well a tion results. It can be seen that the illustration augmentation metho clusion scene in the training images, provides richer features for the and thus helps to improve the learning ability and detection result Table 4. F1 comparison between different data augmentation methods. detections are consistent with these methods, which shows the effectiveness and ity of our proposed method. further verify the influence of the proposed illustration data augmentation on the improved model, as shown in Figure 10, the detection results of the model by the illustration augmented image are compared with the model trained by the age. From the detection results, it can be seen that the model trained by the illusaugmented images and the model trained by the real images both can accurately apples under no occlusion between leaves. However, under the dense occlusion n leaves, the model trained by real images hardly detects apples, while the model by illustration augmented images can detect apples well and improve the detecults. It can be seen that the illustration augmentation method enriches the leaf ocscene in the training images, provides richer features for the learning of the model, s helps to improve the learning ability and detection results of the model. 1 comparison between different data augmentation methods. missed detections are consistent with these methods, which shows the effectiveness and feasibility of our proposed method.

Data Augmentation Methods
To further verify the influence of the proposed illustration data augmentation method on the improved model, as shown in Figure 10, the detection results of the model trained by the illustration augmented image are compared with the model trained by the real image. From the detection results, it can be seen that the model trained by the illustration augmented images and the model trained by the real images both can accurately detect apples under no occlusion between leaves. However, under the dense occlusion between leaves, the model trained by real images hardly detects apples, while the model trained by illustration augmented images can detect apples well and improve the detection results. It can be seen that the illustration augmentation method enriches the leaf occlusion scene in the training images, provides richer features for the learning of the model, and thus helps to improve the learning ability and detection results of the model.  missed detections are consistent with these methods, which shows the effectiveness and feasibility of our proposed method.

Data Augmentation Methods
To further verify the influence of the proposed illustration data augmentation method on the improved model, as shown in Figure 10, the detection results of the model trained by the illustration augmented image are compared with the model trained by the real image. From the detection results, it can be seen that the model trained by the illustration augmented images and the model trained by the real images both can accurately detect apples under no occlusion between leaves. However, under the dense occlusion between leaves, the model trained by real images hardly detects apples, while the model trained by illustration augmented images can detect apples well and improve the detection results. It can be seen that the illustration augmentation method enriches the leaf occlusion scene in the training images, provides richer features for the learning of the model, and thus helps to improve the learning ability and detection results of the model.  missed detections are consistent with these methods, which shows the effectiveness and feasibility of our proposed method.

Data Augmentation Methods
To further verify the influence of the proposed illustration data augmentation method on the improved model, as shown in Figure 10, the detection results of the model trained by the illustration augmented image are compared with the model trained by the real image. From the detection results, it can be seen that the model trained by the illustration augmented images and the model trained by the real images both can accurately detect apples under no occlusion between leaves. However, under the dense occlusion between leaves, the model trained by real images hardly detects apples, while the model trained by illustration augmented images can detect apples well and improve the detection results. It can be seen that the illustration augmentation method enriches the leaf occlusion scene in the training images, provides richer features for the learning of the model, and thus helps to improve the learning ability and detection results of the model.  missed detections are consistent with these methods, which shows the effectiveness and feasibility of our proposed method.

Data Augmentation Methods
To further verify the influence of the proposed illustration data augmentation method on the improved model, as shown in Figure 10, the detection results of the model trained by the illustration augmented image are compared with the model trained by the real image. From the detection results, it can be seen that the model trained by the illustration augmented images and the model trained by the real images both can accurately detect apples under no occlusion between leaves. However, under the dense occlusion between leaves, the model trained by real images hardly detects apples, while the model trained by illustration augmented images can detect apples well and improve the detection results. It can be seen that the illustration augmentation method enriches the leaf occlusion scene in the training images, provides richer features for the learning of the model, and thus helps to improve the learning ability and detection results of the model.  missed detections are consistent with these methods, which shows the effectiveness an feasibility of our proposed method.

Data Augmentation Methods
To further verify the influence of the proposed illustration data augmentatio method on the improved model, as shown in Figure 10, the detection results of the mode trained by the illustration augmented image are compared with the model trained by th real image. From the detection results, it can be seen that the model trained by the illus tration augmented images and the model trained by the real images both can accuratel detect apples under no occlusion between leaves. However, under the dense occlusio between leaves, the model trained by real images hardly detects apples, while the mode trained by illustration augmented images can detect apples well and improve the detec tion results. It can be seen that the illustration augmentation method enriches the leaf oc clusion scene in the training images, provides richer features for the learning of the mode and thus helps to improve the learning ability and detection results of the model.  To further verify the influence of the proposed illustration data au method on the improved model, as shown in Figure 10, the detection results o trained by the illustration augmented image are compared with the model tra real image. From the detection results, it can be seen that the model trained tration augmented images and the model trained by the real images both can detect apples under no occlusion between leaves. However, under the dens between leaves, the model trained by real images hardly detects apples, whil trained by illustration augmented images can detect apples well and improv tion results. It can be seen that the illustration augmentation method enriches clusion scene in the training images, provides richer features for the learning o and thus helps to improve the learning ability and detection results of the mo Table 4. F1 comparison between different data augmentation methods. detections are consistent with these methods, which shows the effectiveness and ity of our proposed method. further verify the influence of the proposed illustration data augmentation on the improved model, as shown in Figure 10, the detection results of the model by the illustration augmented image are compared with the model trained by the age. From the detection results, it can be seen that the model trained by the illusaugmented images and the model trained by the real images both can accurately apples under no occlusion between leaves. However, under the dense occlusion n leaves, the model trained by real images hardly detects apples, while the model by illustration augmented images can detect apples well and improve the detecults. It can be seen that the illustration augmentation method enriches the leaf ocscene in the training images, provides richer features for the learning of the model, s helps to improve the learning ability and detection results of the model. 1 comparison between different data augmentation methods. missed detections are consistent with these methods, which shows the effectiveness and feasibility of our proposed method.

Data Augmentation Methods
To further verify the influence of the proposed illustration data augmentation method on the improved model, as shown in Figure 10, the detection results of the model trained by the illustration augmented image are compared with the model trained by the real image. From the detection results, it can be seen that the model trained by the illustration augmented images and the model trained by the real images both can accurately detect apples under no occlusion between leaves. However, under the dense occlusion between leaves, the model trained by real images hardly detects apples, while the model trained by illustration augmented images can detect apples well and improve the detection results. It can be seen that the illustration augmentation method enriches the leaf occlusion scene in the training images, provides richer features for the learning of the model, and thus helps to improve the learning ability and detection results of the model.  missed detections are consistent with these methods, which shows the effectiveness and feasibility of our proposed method.

Data Augmentation Methods
To further verify the influence of the proposed illustration data augmentation method on the improved model, as shown in Figure 10, the detection results of the model trained by the illustration augmented image are compared with the model trained by the real image. From the detection results, it can be seen that the model trained by the illustration augmented images and the model trained by the real images both can accurately detect apples under no occlusion between leaves. However, under the dense occlusion between leaves, the model trained by real images hardly detects apples, while the model trained by illustration augmented images can detect apples well and improve the detection results. It can be seen that the illustration augmentation method enriches the leaf occlusion scene in the training images, provides richer features for the learning of the model, and thus helps to improve the learning ability and detection results of the model.  missed detections are consistent with these methods, which shows the effectiveness and feasibility of our proposed method.

Data Augmentation Methods
To further verify the influence of the proposed illustration data augmentation method on the improved model, as shown in Figure 10, the detection results of the model trained by the illustration augmented image are compared with the model trained by the real image. From the detection results, it can be seen that the model trained by the illustration augmented images and the model trained by the real images both can accurately detect apples under no occlusion between leaves. However, under the dense occlusion between leaves, the model trained by real images hardly detects apples, while the model trained by illustration augmented images can detect apples well and improve the detection results. It can be seen that the illustration augmentation method enriches the leaf occlusion scene in the training images, provides richer features for the learning of the model, and thus helps to improve the learning ability and detection results of the model.  missed detections are consistent with these methods, which shows the effectiveness and feasibility of our proposed method.

Data Augmentation Methods
To further verify the influence of the proposed illustration data augmentation method on the improved model, as shown in Figure 10, the detection results of the model trained by the illustration augmented image are compared with the model trained by the real image. From the detection results, it can be seen that the model trained by the illustration augmented images and the model trained by the real images both can accurately detect apples under no occlusion between leaves. However, under the dense occlusion between leaves, the model trained by real images hardly detects apples, while the model trained by illustration augmented images can detect apples well and improve the detection results. It can be seen that the illustration augmentation method enriches the leaf occlusion scene in the training images, provides richer features for the learning of the model, and thus helps to improve the learning ability and detection results of the model.  missed detections are consistent with these methods, which shows the effectiveness an feasibility of our proposed method.

Data Augmentation Methods
To further verify the influence of the proposed illustration data augmentatio method on the improved model, as shown in Figure 10, the detection results of the mode trained by the illustration augmented image are compared with the model trained by th real image. From the detection results, it can be seen that the model trained by the illus tration augmented images and the model trained by the real images both can accuratel detect apples under no occlusion between leaves. However, under the dense occlusio between leaves, the model trained by real images hardly detects apples, while the mode trained by illustration augmented images can detect apples well and improve the detec tion results. It can be seen that the illustration augmentation method enriches the leaf oc clusion scene in the training images, provides richer features for the learning of the mode and thus helps to improve the learning ability and detection results of the model.  detections are consistent with these methods, which shows the effectiveness and ity of our proposed method. further verify the influence of the proposed illustration data augmentation on the improved model, as shown in Figure 10, the detection results of the model by the illustration augmented image are compared with the model trained by the age. From the detection results, it can be seen that the model trained by the illusaugmented images and the model trained by the real images both can accurately apples under no occlusion between leaves. However, under the dense occlusion n leaves, the model trained by real images hardly detects apples, while the model by illustration augmented images can detect apples well and improve the detecults. It can be seen that the illustration augmentation method enriches the leaf ocscene in the training images, provides richer features for the learning of the model, s helps to improve the learning ability and detection results of the model. 1 comparison between different data augmentation methods. missed detections are consistent with these methods, which shows the effectiveness and feasibility of our proposed method.

Data Augmentation Methods
To further verify the influence of the proposed illustration data augmentation method on the improved model, as shown in Figure 10, the detection results of the model trained by the illustration augmented image are compared with the model trained by the real image. From the detection results, it can be seen that the model trained by the illustration augmented images and the model trained by the real images both can accurately detect apples under no occlusion between leaves. However, under the dense occlusion between leaves, the model trained by real images hardly detects apples, while the model trained by illustration augmented images can detect apples well and improve the detection results. It can be seen that the illustration augmentation method enriches the leaf occlusion scene in the training images, provides richer features for the learning of the model, and thus helps to improve the learning ability and detection results of the model.  missed detections are consistent with these methods, which shows the effectiveness and feasibility of our proposed method.

Data Augmentation Methods
To further verify the influence of the proposed illustration data augmentation method on the improved model, as shown in Figure 10, the detection results of the model trained by the illustration augmented image are compared with the model trained by the real image. From the detection results, it can be seen that the model trained by the illustration augmented images and the model trained by the real images both can accurately detect apples under no occlusion between leaves. However, under the dense occlusion between leaves, the model trained by real images hardly detects apples, while the model trained by illustration augmented images can detect apples well and improve the detection results. It can be seen that the illustration augmentation method enriches the leaf occlusion scene in the training images, provides richer features for the learning of the model, and thus helps to improve the learning ability and detection results of the model.  missed detections are consistent with these methods, which shows the effectiveness and feasibility of our proposed method.

Data Augmentation Methods
To further verify the influence of the proposed illustration data augmentation method on the improved model, as shown in Figure 10, the detection results of the model trained by the illustration augmented image are compared with the model trained by the real image. From the detection results, it can be seen that the model trained by the illustration augmented images and the model trained by the real images both can accurately detect apples under no occlusion between leaves. However, under the dense occlusion between leaves, the model trained by real images hardly detects apples, while the model trained by illustration augmented images can detect apples well and improve the detection results. It can be seen that the illustration augmentation method enriches the leaf occlusion scene in the training images, provides richer features for the learning of the model, and thus helps to improve the learning ability and detection results of the model.  missed detections are consistent with these methods, which shows the effectiveness and feasibility of our proposed method.

Data Augmentation Methods
To further verify the influence of the proposed illustration data augmentation method on the improved model, as shown in Figure 10, the detection results of the model trained by the illustration augmented image are compared with the model trained by the real image. From the detection results, it can be seen that the model trained by the illustration augmented images and the model trained by the real images both can accurately detect apples under no occlusion between leaves. However, under the dense occlusion between leaves, the model trained by real images hardly detects apples, while the model trained by illustration augmented images can detect apples well and improve the detection results. It can be seen that the illustration augmentation method enriches the leaf occlusion scene in the training images, provides richer features for the learning of the model, and thus helps to improve the learning ability and detection results of the model.  detections are consistent with these methods, which shows the effectiveness and ity of our proposed method. further verify the influence of the proposed illustration data augmentation on the improved model, as shown in Figure 10, the detection results of the model by the illustration augmented image are compared with the model trained by the age. From the detection results, it can be seen that the model trained by the illusaugmented images and the model trained by the real images both can accurately apples under no occlusion between leaves. However, under the dense occlusion n leaves, the model trained by real images hardly detects apples, while the model by illustration augmented images can detect apples well and improve the detecults. It can be seen that the illustration augmentation method enriches the leaf ocscene in the training images, provides richer features for the learning of the model, s helps to improve the learning ability and detection results of the model. 1 comparison between different data augmentation methods. missed detections are consistent with these methods, which shows the effectiveness and feasibility of our proposed method.

Data Augmentation Methods
To further verify the influence of the proposed illustration data augmentation method on the improved model, as shown in Figure 10, the detection results of the model trained by the illustration augmented image are compared with the model trained by the real image. From the detection results, it can be seen that the model trained by the illustration augmented images and the model trained by the real images both can accurately detect apples under no occlusion between leaves. However, under the dense occlusion between leaves, the model trained by real images hardly detects apples, while the model trained by illustration augmented images can detect apples well and improve the detection results. It can be seen that the illustration augmentation method enriches the leaf occlusion scene in the training images, provides richer features for the learning of the model, and thus helps to improve the learning ability and detection results of the model.  missed detections are consistent with these methods, which shows the effectiveness and feasibility of our proposed method.

Data Augmentation Methods
To further verify the influence of the proposed illustration data augmentation method on the improved model, as shown in Figure 10, the detection results of the model trained by the illustration augmented image are compared with the model trained by the real image. From the detection results, it can be seen that the model trained by the illustration augmented images and the model trained by the real images both can accurately detect apples under no occlusion between leaves. However, under the dense occlusion between leaves, the model trained by real images hardly detects apples, while the model trained by illustration augmented images can detect apples well and improve the detection results. It can be seen that the illustration augmentation method enriches the leaf occlusion scene in the training images, provides richer features for the learning of the model, and thus helps to improve the learning ability and detection results of the model.  missed detections are consistent with these methods, which shows the effectiveness and feasibility of our proposed method.

Data Augmentation Methods
To further verify the influence of the proposed illustration data augmentation method on the improved model, as shown in Figure 10, the detection results of the model trained by the illustration augmented image are compared with the model trained by the real image. From the detection results, it can be seen that the model trained by the illustration augmented images and the model trained by the real images both can accurately detect apples under no occlusion between leaves. However, under the dense occlusion between leaves, the model trained by real images hardly detects apples, while the model trained by illustration augmented images can detect apples well and improve the detection results. It can be seen that the illustration augmentation method enriches the leaf occlusion scene in the training images, provides richer features for the learning of the model, and thus helps to improve the learning ability and detection results of the model.  detections are consistent with these methods, which shows the effectiveness and ity of our proposed method. further verify the influence of the proposed illustration data augmentation on the improved model, as shown in Figure 10, the detection results of the model by the illustration augmented image are compared with the model trained by the age. From the detection results, it can be seen that the model trained by the illusaugmented images and the model trained by the real images both can accurately apples under no occlusion between leaves. However, under the dense occlusion n leaves, the model trained by real images hardly detects apples, while the model by illustration augmented images can detect apples well and improve the detecults. It can be seen that the illustration augmentation method enriches the leaf ocscene in the training images, provides richer features for the learning of the model, s helps to improve the learning ability and detection results of the model. 1 comparison between different data augmentation methods. missed detections are consistent with these methods, which shows the effectiveness and feasibility of our proposed method.

Data Augmentation Methods
To further verify the influence of the proposed illustration data augmentation method on the improved model, as shown in Figure 10, the detection results of the model trained by the illustration augmented image are compared with the model trained by the real image. From the detection results, it can be seen that the model trained by the illustration augmented images and the model trained by the real images both can accurately detect apples under no occlusion between leaves. However, under the dense occlusion between leaves, the model trained by real images hardly detects apples, while the model trained by illustration augmented images can detect apples well and improve the detection results. It can be seen that the illustration augmentation method enriches the leaf occlusion scene in the training images, provides richer features for the learning of the model, and thus helps to improve the learning ability and detection results of the model.  missed detections are consistent with these methods, which shows the effectiveness and feasibility of our proposed method.

Data Augmentation Methods
To further verify the influence of the proposed illustration data augmentation method on the improved model, as shown in Figure 10, the detection results of the model trained by the illustration augmented image are compared with the model trained by the real image. From the detection results, it can be seen that the model trained by the illustration augmented images and the model trained by the real images both can accurately detect apples under no occlusion between leaves. However, under the dense occlusion between leaves, the model trained by real images hardly detects apples, while the model trained by illustration augmented images can detect apples well and improve the detection results. It can be seen that the illustration augmentation method enriches the leaf occlusion scene in the training images, provides richer features for the learning of the model, and thus helps to improve the learning ability and detection results of the model.  detections are consistent with these methods, which shows the effectiveness and ity of our proposed method. further verify the influence of the proposed illustration data augmentation on the improved model, as shown in Figure 10, the detection results of the model by the illustration augmented image are compared with the model trained by the age. From the detection results, it can be seen that the model trained by the illusaugmented images and the model trained by the real images both can accurately apples under no occlusion between leaves. However, under the dense occlusion n leaves, the model trained by real images hardly detects apples, while the model by illustration augmented images can detect apples well and improve the detecults. It can be seen that the illustration augmentation method enriches the leaf ocscene in the training images, provides richer features for the learning of the model, s helps to improve the learning ability and detection results of the model. 1 comparison between different data augmentation methods. missed detections are consistent with these methods, which shows the effectiveness and feasibility of our proposed method.

Data Augmentation Methods
To further verify the influence of the proposed illustration data augmentation method on the improved model, as shown in Figure 10, the detection results of the model trained by the illustration augmented image are compared with the model trained by the real image. From the detection results, it can be seen that the model trained by the illustration augmented images and the model trained by the real images both can accurately detect apples under no occlusion between leaves. However, under the dense occlusion between leaves, the model trained by real images hardly detects apples, while the model trained by illustration augmented images can detect apples well and improve the detection results. It can be seen that the illustration augmentation method enriches the leaf occlusion scene in the training images, provides richer features for the learning of the model, and thus helps to improve the learning ability and detection results of the model.  detections are consistent with these methods, which shows the effectiveness and ity of our proposed method. further verify the influence of the proposed illustration data augmentation on the improved model, as shown in Figure 10, the detection results of the model by the illustration augmented image are compared with the model trained by the age. From the detection results, it can be seen that the model trained by the illusaugmented images and the model trained by the real images both can accurately apples under no occlusion between leaves. However, under the dense occlusion n leaves, the model trained by real images hardly detects apples, while the model by illustration augmented images can detect apples well and improve the detecults. It can be seen that the illustration augmentation method enriches the leaf ocscene in the training images, provides richer features for the learning of the model, s helps to improve the learning ability and detection results of the model. 1 comparison between different data augmentation methods. To better assess the performance of the improved model, we count the detection results of original images and augmented images. The test results are shown in Table 5. It can be seen that the EfficinetNet-B0-YOLOV4 model in this paper can achieve desirable detection results for the apple images augmented by using data augmentation methods. Compared with the original image, the methods of mirror, crop, rotation, scale, and translation are mainly based on the change of image position or angle, which hardly adds new texture information, so the detection results are similar to those of the original images. The methods of brightness, blur, dropout, and illustration bring new texture information to the image. Although it will cause more false detections, keeps the number of missed detections similar to the original images, and has more object detections, which shows that the rich background will enhance the learning ability of the model. Compared with the detection results of traditional augmentation techniques, the proposed illustration augmentation technique will lead to more false detections, but the detection quantity and missed detections are consistent with these methods, which shows the effectiveness and feasibility of our proposed method. To further verify the influence of the proposed illustration data augmentation method on the improved model, as shown in Figure 10, the detection results of the model trained by the illustration augmented image are compared with the model trained by the real image. From the detection results, it can be seen that the model trained by the illustration augmented images and the model trained by the real images both can accurately detect apples under no occlusion between leaves. However, under the dense occlusion between leaves, the model trained by real images hardly detects apples, while the model trained by illustration augmented images can detect apples well and improve the detection results. It can be seen that the illustration augmentation method enriches the leaf occlusion scene in the training images, provides richer features for the learning of the model, and thus helps to improve the learning ability and detection results of the model. missed detections are consistent with these methods, which shows the effectiveness and feasibility of our proposed method.

Data Augmentation Methods
To further verify the influence of the proposed illustration data augmentation method on the improved model, as shown in Figure 10, the detection results of the model trained by the illustration augmented image are compared with the model trained by the real image. From the detection results, it can be seen that the model trained by the illustration augmented images and the model trained by the real images both can accurately detect apples under no occlusion between leaves. However, under the dense occlusion between leaves, the model trained by real images hardly detects apples, while the model trained by illustration augmented images can detect apples well and improve the detection results. It can be seen that the illustration augmentation method enriches the leaf occlusion scene in the training images, provides richer features for the learning of the model, and thus helps to improve the learning ability and detection results of the model.

Comparison of Different Models
To verify the superiority of the proposed EfficientNet-B0-YOLOv4 model in this paper, we compare it with YOLOv3, YOLOv4, and Faster R-CNN with ResNet, which are the state-of-the-art apple detection models. Table 6 shows the comparison of F1, mAP, Precision and Recall of different models. Table 7 shows the comparison of the average detection time per frame, weight size, parameter amount and calculation amount (FLOPs) of different models. Table 8 shows the detection results of different models in the test set. Figure 11 shows the comparison of the P-R curve of different models. Figure 12 shows the detection results of different models, where the green ring represents the missed detection and the blue ring represents the false detection.  where the F1 is 4.60% higher, mAP is 2.87% higher, Precision is 2.41% higher, and Recall is 6.54% higher especially. The EfficientNet-B0-YOLOv4 model proposed in this paper is slightly better than the YOLOv4 model in detection performance, where the F1 is 0.18% higher, mAP is 1.30% higher, Precision is 2.70% lower, and Recall is 2.86% higher especially. But in terms of weight indicators, the improved model is much better than the YOLOv4 model, where the average detection time per frame is reduced by 0.072 s, the weight size is reduced by 86 MB, the parameter amount is reduced by 2.62 × 10 7 , and the calculation amount is reduced by 1.72 × 10 10 . It can be seen from Figure 11 that the P-R curve area under the EfficientNet-B0-YOLOv4 model proposed in this paper is larger, which shows that it has better performance. It can be seen from Table 8 and Figure 12 that in the case of large objects, each model can accurately detect apples, but in the case of small objects, especially in the case of occlusion, the Faster R-CNN with ResNet model will have more missed detections and false detections, which leads to low Precision (52.07%). At the same time, the YOLO models will have fewer missed detections and false detections, the detection result of the YOLOv3 model is relatively poor, the detection result of the YOLOv4 model and the Effi-cientNet-B0-YOLOv4 model proposed in this paper are close to the same.
Based on the above analysis, the whole results of the EfficientNet-B0-YOLOv4 model proposed in this paper are better than the current popular apple detection models, which can achieve high-recall and real-time detection performance, and reduce the weight size and computational complexity. The experimental results show that the proposed method in this paper is well applied to the vision system of the picking robot. Figure 11. P-R curves of different detection models. Figure 11. P-R curves of different detection models.

Conclusions
To simulate the possible complex scenes of apple detection in orchards and improve the apple dataset, an illustration data augmentation method is proposed and 8 common data augmentation methods are utilized to expand the dataset. On the expanded 2670 samples, the F1 of using the illustration data augmentation method has increased the most. Given the large size and computational complexity of the YOLOv4 model, Efficient-Net is utilized to replace its backbone network CSPDarknet53. The improved EfficientNet-B0-YOLOv4 model has the F1 of 96.54%, the mAP of 98.15%, the Recall of 97.43%, and the average calculation time per frame of 0.338 s, which are better than the current popular YOLOv3 model, YOLOv4 model, and Faster R-CNN with ResNet model. Comparing the proposed EfficientNet-B0-YOLOv4 model with the original YOLOv4 model, the weight size is reduced by 35.25%, the parameter amount is reduced by 40.94%, and the calculation amount is reduced by 57.53%. In future work, we hope to add more apple classes for detection, and conduct level evaluation for each class after picking. For example, each class is divided into three levels: good, medium, and bad, thus forming a complete set of the apple detection system. Furthermore, we will continue to consolidate the illustration data augmentation method to improve the dataset.  Generally, to make robots pick more real apples in orchards, more attention should be paid to the improvement of the Recall. It can be seen from Tables 6 and 7 that the Faster R-CNN with ResNet model has a better Recall (93.76%), but the other performance and detection results are the worst. Although the weight (108 MB) and the parameter amount (2.83 × 10 7 ) are lower than the YOLO models, the two-stage steps are complex and lead to the calculation amount (1.79 × 10 11 ) and the average detection time per frame (6.167 s) greatly exceed the YOLO models. The YOLOv3 model and YOLOv4 model still maintain better real-time detection results, and other indicators in the weight are close to the same, but the detection performance of the YOLOv4 model is better than the YOLOv3 model, where the F1 is 4.60% higher, mAP is 2.87% higher, Precision is 2.41% higher, and Recall is 6.54% higher especially. The EfficientNet-B0-YOLOv4 model proposed in this paper is slightly better than the YOLOv4 model in detection performance, where the F1 is 0.18% higher, mAP is 1.30% higher, Precision is 2.70% lower, and Recall is 2.86% higher especially. But in terms of weight indicators, the improved model is much better than the YOLOv4 model, where the average detection time per frame is reduced by 0.072 s, the weight size is reduced by 86 MB, the parameter amount is reduced by 2.62 × 10 7 , and the calculation amount is reduced by 1.72 × 10 10 .
It can be seen from Figure 11 that the P-R curve area under the EfficientNet-B0-YOLOv4 model proposed in this paper is larger, which shows that it has better performance. It can be seen from Table 8 and Figure 12 that in the case of large objects, each model can accurately detect apples, but in the case of small objects, especially in the case of occlusion, the Faster R-CNN with ResNet model will have more missed detections and false detections, which leads to low Precision (52.07%). At the same time, the YOLO models will have fewer missed detections and false detections, the detection result of the YOLOv3 model is relatively poor, the detection result of the YOLOv4 model and the EfficientNet-B0-YOLOv4 model proposed in this paper are close to the same.
Based on the above analysis, the whole results of the EfficientNet-B0-YOLOv4 model proposed in this paper are better than the current popular apple detection models, which can achieve high-recall and real-time detection performance, and reduce the weight size and computational complexity. The experimental results show that the proposed method in this paper is well applied to the vision system of the picking robot.

Conclusions
To simulate the possible complex scenes of apple detection in orchards and improve the apple dataset, an illustration data augmentation method is proposed and 8 common data augmentation methods are utilized to expand the dataset. On the expanded 2670 samples, the F1 of using the illustration data augmentation method has increased the most. Given the large size and computational complexity of the YOLOv4 model, EfficientNet is utilized to replace its backbone network CSPDarknet53. The improved EfficientNet-B0-YOLOv4 model has the F1 of 96.54%, the mAP of 98.15%, the Recall of 97.43%, and the average calculation time per frame of 0.338 s, which are better than the current popular YOLOv3 model, YOLOv4 model, and Faster R-CNN with ResNet model. Comparing the proposed EfficientNet-B0-YOLOv4 model with the original YOLOv4 model, the weight size is reduced by 35.25%, the parameter amount is reduced by 40.94%, and the calculation amount is reduced by 57.53%. In future work, we hope to add more apple classes for detection, and conduct level evaluation for each class after picking. For example, each class is divided into three levels: good, medium, and bad, thus forming a complete set of the apple detection system. Furthermore, we will continue to consolidate the illustration data augmentation method to improve the dataset.

Data Availability Statement:
The raw data required to reproduce these findings cannot be shared at this time as the data also forms part of an ongoing study.