Augmentation Method for High Intra-Class Variation Data in Apple Detection

Deep learning is widely used in modern orchard production for various inspection missions, which helps improve the efficiency of orchard operations. In the mission of visual detection during fruit picking, most current lightweight detection models are not yet effective enough to detect multi-type occlusion targets, severely affecting automated fruit-picking efficiency. This study addresses this problem by proposing the pioneering design of a multi-type occlusion apple dataset and an augmentation method of data balance. We divided apple occlusion into eight types and used the proposed method to balance the number of annotation boxes for multi-type occlusion apple targets. Finally, a validation experiment was carried out using five popular lightweight object detection models: yolox-s, yolov5-s, yolov4-s, yolov3-tiny, and efficidentdet-d0. The results show that, using the proposed augmentation method, the average detection precision of the five popular lightweight object detection models improved significantly. Specifically, the precision increased from 0.894 to 0.974, recall increased from 0.845 to 0.972, and mAP0.5 increased from 0.982 to 0.919 for yolox-s. This implies that the proposed augmentation method shows great potential for different fruit detection missions in future orchard applications.


Introduction
Deep learning has evolved rapidly in recent years. A proper deep learning algorithm requires a suitable dataset or data augmentation method to demonstrate its superior performance. Therefore, finding pre-processing methods to enhance data quality is a critical step in research since it is more effective to increase the quantity, diversity, and quality of a dataset than to increase the complexity and depth of a model [1], not only to increase the generalization capability of the algorithm but also to make research more convincing.
Research related to data quality improvement is essential for fruit-picking detection in agriculture. It is well-known that the significant demand for mechanized automatic fruit harvesting in the agricultural sector offers significant opportunities for developing agricultural picking robots. Automatic picking robots have received a great deal of attention from researchers in recent decades, and a variety of robots have been developed domestically and internationally to harvest fruits and vegetables, such as apple-picking robots [2]. In addition, vision-based control technology for picking robots has seen rapid development in recent years. Although there has been a great deal of research in the development of vision-control techniques for robotic picking, the low success rate of fruit recognition and inefficient hand-eye coordination still limits picking robot performance [3]. Occlusion is considered to be one of the challenges of robotic vision picking technology since it seriously affects the recognition and localization accuracy of picking robots [4], mainly in the case of 1.
We created the MTOA, the first dataset considering multi-type occlusion of apple fruits, and made it available for free under the MIT license.

2.
We proposed a balance augmentation method to achieve a balanced number of annotation boxes of each apple occlusion class in different regions under different illuminations and solved the problem of severe differences in the number of samples between classes. 3.
We validated the effectiveness of the proposed algorithm using five popular lightweight object detection models.

Making the MTOA Dataset
The raw images in the apple orchards were obtained by self-collection and webcollection, respectively, and an example of the collected images for each region is shown in Figure 1. The raw images include images of Yanfu-3 apples from Zhaoyuan, Shandong, China (SD_ZY_IMG), Yanfu-8 apples from Qixia, Shandong, China (SD_QX_IMG), and Jonagold apples from Prosser, Washington, USA (WT_PSR_IMG) [25]. The specific information about the collected images is listed in Table 1. 1. We created the MTOA, the first dataset considering multi-type occlusion of apple fruits, and made it available for free under the MIT license. 2. We proposed a balance augmentation method to achieve a balanced number of annotation boxes of each apple occlusion class in different regions under different illuminations and solved the problem of severe differences in the number of samples between classes. 3. We validated the effectiveness of the proposed algorithm using five popular lightweight object detection models.

Making the MTOA Dataset
The raw images in the apple orchards were obtained by self-collection and web-collection, respectively, and an example of the collected images for each region is shown in Figure 1. The raw images include images of Yanfu-3 apples from Zhaoyuan, Shandong, China (SD_ZY_IMG), Yanfu-8 apples from Qixia, Shandong, China (SD_QX_IMG), and Jonagold apples from Prosser, Washington, USA (WT_PSR_IMG) [25]. The specific information about the collected images is listed in Table 1.  There is variability in the way orchards are grown and data collected between the three regions. The Zhaoyuan orchard is a modern spindle planting structure with apple trees spaced in rows approximately 3.5 m apart, with 1.5 m between plants, and with a height of 3.5 m. Multi-angle photography was mainly performed by handheld cameras at a distance of about 0.5-1.0 m. The images were taken in the morning, at midday, and in the evening, with clear weather during the day and artificial lighting at night. On the other hand, the Qixia apple orchard is a traditional orchard with a happy canopy, with rows about 4 m apart, plants approximately 5 m apart, and trees with a height of about 3 m. It was mainly photographed by handheld cameras at a distance of 0.3-0.8 m, at midday with clear weather. Meanwhile, the Prosser apple orchard in Washington State has a tree-wall structure. The data on Jonagold apples were collected by mounting the camera, which  There is variability in the way orchards are grown and data collected between the three regions. The Zhaoyuan orchard is a modern spindle planting structure with apple trees spaced in rows approximately 3.5 m apart, with 1.5 m between plants, and with a height of 3.5 m. Multi-angle photography was mainly performed by handheld cameras at a distance of about 0.5-1.0 m. The images were taken in the morning, at midday, and in the evening, with clear weather during the day and artificial lighting at night. On the other hand, the Qixia apple orchard is a traditional orchard with a happy canopy, with rows about 4 m apart, plants approximately 5 m apart, and trees with a height of about 3 m. It was mainly photographed by handheld cameras at a distance of 0.3-0.8 m, at midday with clear weather. Meanwhile, the Prosser apple orchard in Washington State has a tree-wall structure. The data on Jonagold apples were collected by mounting the camera, which was approximately 1.7 m above the ground, on a mobile platform and keeping the distance from the camera to the tree wall at about 0.5 m at midday with clear weather.
Since there are no publicly available apple datasets, we manually annotate all images. Annotation classes consisted of eight types of occlusions, including no occlusions (N), leaf occlusions (L), fruit occlusions (F), branch occlusions (B), leaf and fruit occlusions (LF), leaf and branch occlusions (BL), branch and fruit occlusions (BF), and leaf, branch, and fruit occlusions (BLF). Each annotation class is shown in Figure 2. The MTOA dataset was constructed after all annotations. We counted all annotation boxes, and the result showed that the class with the largest number of annotation boxes was N (22986), accounting for 28.2% of the total, and the class with the smallest number of annotation boxes was BLF (1374), accounting for 1.7% of the total, which is a difference of approximately 16 times. In BF, F, and LF, the proportion of data in each occlusion class did not exceed 5% of the total number of annotated boxes, which shows the significant variability between all occlusion apple classes in the MTOA dataset. was approximately 1.7 m above the ground, on a mobile platform and keeping the distance from the camera to the tree wall at about 0.5 m at midday with clear weather.
Since there are no publicly available apple datasets, we manually annotate all images. Annotation classes consisted of eight types of occlusions, including no occlusions (N), leaf occlusions (L), fruit occlusions (F), branch occlusions (B), leaf and fruit occlusions (LF), leaf and branch occlusions (BL), branch and fruit occlusions (BF), and leaf, branch, and fruit occlusions (BLF). Each annotation class is shown in Figure 2. The MTOA dataset was constructed after all annotations. We counted all annotation boxes, and the result showed that the class with the largest number of annotation boxes was N (22986), accounting for 28.2% of the total, and the class with the smallest number of annotation boxes was BLF (1374), accounting for 1.7% of the total, which is a difference of approximately 16 times. In BF, F, and LF, the proportion of data in each occlusion class did not exceed 5% of the total number of annotated boxes, which shows the significant variability between all occlusion apple classes in the MTOA dataset.

Algorithm Validation Process
The validation flow of the data balance augmentation algorithm proposed in this study is shown in Figure 3. The diagram contains the following steps.
1. Splitting the MTOA dataset into a basic test and training dataset with a ratio of 3:7 and then training five lightweight models with the basic training dataset. The training of five lightweight models was carried out to form five corresponding basic models. 2. Balancing the basic training dataset using the proposed method to form the balanced MTOA dataset. 3. Training five lightweight models with the balanced MTOA dataset. Five lightweight models were trained to form five corresponding balanced models. 4. Using basic test dataset to perform metrics on five basic and balanced models and analyze reasons for changes in model performance.

Algorithm Validation Process
The validation flow of the data balance augmentation algorithm proposed in this study is shown in Figure 3. The diagram contains the following steps.

1.
Splitting the MTOA dataset into a basic test and training dataset with a ratio of 3:7 and then training five lightweight models with the basic training dataset. The training of five lightweight models was carried out to form five corresponding basic models.

2.
Balancing the basic training dataset using the proposed method to form the balanced MTOA dataset.

3.
Training five lightweight models with the balanced MTOA dataset. Five lightweight models were trained to form five corresponding balanced models.

4.
Using basic test dataset to perform metrics on five basic and balanced models and analyze reasons for changes in model performance.

MTOA Dataset Statistical Analysis by Illumination
The lighting methods in the apple orchards were natural lighting during the day and artificial lighting at night. Different lighting methods produce different light illuminations (i.e., RGB vectors of different lighting colors) [37]. As illumination changes, the same scene shows different color representations, for example, backlit and artificially lit images in which apple targets tend to be too bright or have severe color distortion. In this study, to effectively classify images under different illuminations [38], a MobileNetV3-based classification model was used to classify raw images from each region into high and low illuminations, with low-illumination images mainly containing images under backlight, evening, or nighttime artificial lighting conditions, and high-illumination images containing images under sufficient daytime light conditions. The classification of raw images by illumination is shown in Figure S1 in the Supplementary Material.
The MTOA dataset was analyzed and counted by illumination to clarify the number of each type of annotation box in each region with high and low illuminations in the MTOA dataset. First, raw images in the MTOA dataset were classified using the illumination binary classification model of MobileNetV3 to form six sub-datasets, including the ZhaoYuan high-illumination dataset (ZY_H), ZhaoYuan low-illumination dataset (ZY_L), Qixia high-illumination dataset (QX_H), Qixia low-illumination dataset (QX_L), Prosser

MTOA Dataset Statistical Analysis by Illumination
The lighting methods in the apple orchards were natural lighting during the day and artificial lighting at night. Different lighting methods produce different light illuminations (i.e., RGB vectors of different lighting colors) [37]. As illumination changes, the same scene shows different color representations, for example, backlit and artificially lit images in which apple targets tend to be too bright or have severe color distortion. In this study, to effectively classify images under different illuminations [38], a MobileNetV3-based classification model was used to classify raw images from each region into high and low illuminations, with low-illumination images mainly containing images under backlight, evening, or nighttime artificial lighting conditions, and high-illumination images containing images under sufficient daytime light conditions. The classification of raw images by illumination is shown in Figure S1 in the Supplementary Material.
The MTOA dataset was analyzed and counted by illumination to clarify the number of each type of annotation box in each region with high and low illuminations in the MTOA dataset. First, raw images in the MTOA dataset were classified using the illumination binary classification model of MobileNetV3 to form six sub-datasets, including the ZhaoYuan high-illumination dataset (ZY_H), ZhaoYuan low-illumination dataset (ZY_L), Qixia high-illumination dataset (QX_H), Qixia low-illumination dataset (QX_L), Prosser high-illumination dataset (PSR_H), and Prosser low-illumination dataset (PSR_L). Six sub-Sensors 2022, 22, 6325 6 of 18 datasets were counted by illumination for each occlusion type, and the results are shown in Table 2.
The statistics showed that QX_L had a low number of annotation boxes and PSR_L had zero annotation boxes because most of the raw images in QX_L and all raw images in PSR_L were collected in the morning with even lighting, respectively. Due to the small number of annotation boxes in QX_L and PSR_L, data balance augmentation of these two sub-datasets was ignored. From the other four sub-datasets, the number of annotation boxes varied greatly between apple occlusion classes, with the largest class being B for ZY_H, the smallest class being BF for PSR_H, and the difference between them being approximately 195 times. In summary, the number of annotation boxes and data in subdatasets by illumination varied greatly between regions, resulting in difficulty making the training data from different regions and different illuminations play an equal role in calculating the training loss of the model.

Rules for Building the Component Pool
The component pool is a collection of elements required for synthesizing each occlusion apple. It consists of five elements in high and low illuminations: base image elements, fruit elements, branch elements, leaf elements, and composite elements.

a.
Rules for making base image elements In this study, images with no fruit in high illumination and matching the orchard background were selected as high-illumination base images. Images with no fruit in low illumination and matching the orchard background were selected as low-illumination base images. However, since there were fewer images in this study, some low-illumination images were blurred and supplemented as base images. These two types of base images were randomly scaled to 640 × 480 and 1280 × 720 to maintain consistency with the image size in basic training dataset. Finally, 1000 high-illumination base images and 1000 lowillumination base images were selected. Figure 4 shows various base images under high and low illuminations.

b.
Rules for the selection of occlusion elements In this study, fruit, branch, leaf, and composite occlusion elements were segmented from images from the basic training dataset. The fruit occlusion elements were mainly single intact fruits; the branch elements were divided into multiple branches and single branches; the leaf elements were divided into a single leaf and multiple leaves; and the composite elements were mainly the combination of branches, leaves, and fruits. Five hundred elements of each of the five classes were segmented to ensure the diversity of selected results. After that, fruit, branch, leaf, and composite occlusion elements under high illumination were grouped into one category. Meanwhile, the fruit, branch, leaf, and composite occlusion elements under low illumination were grouped into another category. Figure 5 shows the occlusion elements at high and low illuminations.
illumination and matching the orchard background were selected as low-illumination base images. However, since there were fewer images in this study, some low-illumination images were blurred and supplemented as base images. These two types of base images were randomly scaled to 640 × 480 and 1280 × 720 to maintain consistency with the image size in basic training dataset. Finally, 1000 high-illumination base images and 1000 low-illumination base images were selected. Figure 4 shows various base images under high and low illuminations.  In this study, fruit, branch, leaf, and composite occlusion elements were segmented from images from the basic training dataset. The fruit occlusion elements were mainly single intact fruits; the branch elements were divided into multiple branches and single branches; the leaf elements were divided into a single leaf and multiple leaves; and the composite elements were mainly the combination of branches, leaves, and fruits. Five hundred elements of each of the five classes were segmented to ensure the diversity of selected results. After that, fruit, branch, leaf, and composite occlusion elements under high illumination were grouped into one category. Meanwhile, the fruit, branch, leaf, and composite occlusion elements under low illumination were grouped into another category. Figure 5 shows the occlusion elements at high and low illuminations.

Data Synthesis Methods for Each Apple Occlusion Class
The main idea in synthesizing each occlusion apple class was to paste the corresponding occlusion elements into the N area of the raw image, depending on the number of occlusion apple classes to be synthesized and the illumination requirements. This was because the occlusion classes could be combined from N and occlusion elements after observing raw images. All occlusion elements entered from the boundary region of N and extended randomly to arbitrary locations. This study used this prior knowledge to complete the synthesis of each occlusion apple class. There was also variability in the synthesis of each occlusion class. For B and L, both could be formed by randomly pasting leaf and branch elements into N according to the edge entry rule. However, for F, they could not be synthesized according to this method because if N was shaded by more than 50%, the fruit occlusion element easily became an N and the shaded fruit became F (i.e., both occlusion apple classes would appear simultaneously). The BL, BF, LF, and BLF were composite occlusion classes and could be formed by either synthesizing composite occlusion elements or attaching the existing composite occlusion elements to N.
a. N, B, and L class synthesis methods N synthesis required the cyclic extraction of the required number of N from the basic training dataset for the corresponding region and illumination. B and L were formed by directly attaching branch and leaf occlusion elements to the surface of N randomly according to the edge entry rule. Figure 6 shows the synthesis of B and L.
1. First, the numbers of B and L to be synthesized were calculated. Then, the N class was selected from the images of the basic training dataset for the corresponding region and illumination. The selected N was divided into six equal parts according to their width and height to form a grid with a size of 6 × 6. Since all occlusion elements were entered by the edges of N, the boundary of the grid was set in this study as an edge entry area, within which the starting points of branch and leaf needed to be selected.

Data Synthesis Methods for Each Apple Occlusion Class
The main idea in synthesizing each occlusion apple class was to paste the corresponding occlusion elements into the N area of the raw image, depending on the number of occlusion apple classes to be synthesized and the illumination requirements. This was because the occlusion classes could be combined from N and occlusion elements after observing raw images. All occlusion elements entered from the boundary region of N and extended randomly to arbitrary locations. This study used this prior knowledge to complete the synthesis of each occlusion apple class. There was also variability in the synthesis of each occlusion class. For B and L, both could be formed by randomly pasting leaf and branch elements into N according to the edge entry rule. However, for F, they could not be synthesized according to this method because if N was shaded by more than 50%, the fruit occlusion element easily became an N and the shaded fruit became F (i.e., both occlusion apple classes would appear simultaneously). The BL, BF, LF, and BLF were composite occlusion classes and could be formed by either synthesizing composite occlusion elements or attaching the existing composite occlusion elements to N. a.
N, B, and L class synthesis methods N synthesis required the cyclic extraction of the required number of N from the basic training dataset for the corresponding region and illumination. B and L were formed by directly attaching branch and leaf occlusion elements to the surface of N randomly according to the edge entry rule. Figure 6 shows the synthesis of B and L.

1.
First, the numbers of B and L to be synthesized were calculated. Then, the N class was selected from the images of the basic training dataset for the corresponding region and illumination. The selected N was divided into six equal parts according to their width and height to form a grid with a size of 6 × 6. Since all occlusion elements were entered by the edges of N, the boundary of the grid was set in this study as an edge entry area, within which the starting points of branch and leaf needed to be selected.

2.
The branch and leaf elements were randomly extracted from the component pool and cropped at a scale of 0.5-1.0 to form a new occlusion element.

3.
The edge entry area contained 24 location points from which the starting point of the occlusion element was randomly selected. The endpoint of the occlusion element could not be in the same row or column as the random starting point because the area formed by the starting and endpoints of the occlusion element in the same row and column was a line, which cannot provide a rectangular area of the same size for the occlusion element. To highlight the occlusion elements, this study set the distance between the starting and endpoints of the occlusion element to be greater than three grid lengths to ensure that any area adjacent to the random starting point that was less than three grid lengths could not be used as the endpoint of the selection area. Other areas could be used as the endpoint of the selection area. Then, we randomly selected the endpoint of the occlusion elements from the endpoint of the selection area.

4.
After determining the random starting points and random endpoints of the occlusion element, the dimensions of the new occlusion element were transformed by scaling or cropping the rectangular area between the random start and endpoints. Then, the changed occlusion element was pasted in the random starting and ending areas to form B or L, and finally, the synthetic B or L image was saved. b.

Class F synthesis method
When synthesizing class F, it was impossible to directly attach the fruit occlusion element to the N surface. This was because it was easy to identify the fruit occlusion element as N during the model training process if the N area was overshadowed. The synthesis process of F, shown in Figure 7, was accomplished by limiting the common area of the fruit occlusion element and N.

1.
First, N was selected from the images of the basic training dataset for the corresponding region and illumination, which depended on the number of F to be synthesized. Then, we obtained the fruit occlusion element from the component pool according to illumination demand and adjusted its size to the same size as the selected N. Subsequently, we divided N into a 6 × 6 grid and created a 14 × 14 grid area with the center point of N as the origin. The 14 × 14 grid area was divided into four quadrants, with the upper left corner as the first quadrant.

2.
The location of the upper left corner of the fruit occlusion element was randomly selected in the first quadrant. To prevent the fruit occlusion element from completely obscuring N, the centroid of the occlusion element and origin were not allowed to overlap. This method limited the area of overlap between the fruit occlusion element and N to no more than 34%, because the fruit occlusion element would become N if it exceeded 34%, leading to confusing annotations.

3.
After determining the starting position of the upper left corner for the fruit occlusion element, the upper left corner of the fruit occlusion element and starting position were overlapped to complete the operation of pasting the fruit occlusion element on a 14 × 14 grid. Finally, area N on a 14 × 14 grid was intercepted, and the result of the interception was the synthesized F image.

c. Fused occlusion-type compositing methods
The fused occlusion apples mainly consisted of BL, BF, LF, and BLF. All four could be synthesized on the basis of B, L, and F by the attachment of a second or third occlusion element to form the final fused occlusion apple class. The results of BL, BF, LF, and BLF are shown in Figure 8, and the specific synthesis process is described in Method 1 of the Supplementary Material. selected the endpoint of the occlusion elements from the endpoint of the selection area. 4. After determining the random starting points and random endpoints of the occlusion element, the dimensions of the new occlusion element were transformed by scaling or cropping the rectangular area between the random start and endpoints. Then, the changed occlusion element was pasted in the random starting and ending areas to form B or L, and finally, the synthetic B or L image was saved.

b. Class F synthesis method
When synthesizing class F, it was impossible to directly attach the fruit occlusion element to the N surface. This was because it was easy to identify the fruit occlusion element as N during the model training process if the N area was overshadowed. The synthesis process of F, shown in Figure 7, was accomplished by limiting the common area of the fruit occlusion element and N. 1. First, N was selected from the images of the basic training dataset for the corresponding region and illumination, which depended on the number of F to be synthesized.

Making the Augmented MTOA Dataset
After synthesizing all occlusion apple images, the next step was to attach these images to base images and automatically label them to form the augmented MTOA dataset, as shown in Algorithm 1. Finally, the augmented MTOA and basic training datasets were combined to form the balanced MTOA dataset. and N to no more than 34%, because the fruit occlusion element would become N if it exceeded 34%, leading to confusing annotations. 3. After determining the starting position of the upper left corner for the fruit occlusion element, the upper left corner of the fruit occlusion element and starting position were overlapped to complete the operation of pasting the fruit occlusion element on a 14 × 14 grid. Finally, area N on a 14 × 14 grid was intercepted, and the result of the interception was the synthesized F image.

Making the Augmented MTOA Dataset
After synthesizing all occlusion apple images, the next step was to attach these images to base images and automatically label them to form the augmented MTOA dataset, as shown in Algorithm 1. Finally, the augmented MTOA and basic training datasets were combined to form the balanced MTOA dataset.  if b img .remaining space > shelter img .shape then start_pos = random(x, y) in b_img.remaining_space end_pos = (x + shelter_img.shape.x, y + shelter_img.shape.y) copy shelter img to b img .remaining space with strart pos update b_img.remaining_space labels+ = (cls, strart_pos, end_pos) else save new bg_img DB(area, lgt) = (bg_img, labels) clear labels id∈(Zhaoyuan,Qixia,Procieer), l∈(low,high), i∈(N,L,F,B,LF,BL,BF,BLF), j is num of class i with l light, k is img num with l light.

Model Selection and Training
In this study, from the perspective of the practical use of agricultural robots with embedded computing, the primary consideration is the ability of the algorithm to detect fruits quickly and accurately in real time [39]. Therefore, five lightweight models that could be deployed in embedded AI terminals were used to verify the effectiveness of the algorithm. These models were yolox-s, yolov5-s, yolov4-s, yolov3-tiny, and efficientdet-d0, with fps of 73, 73, 164, 556, and 31, respectively, on the experimental host. The specific information for each model is listed in Table 3. The training progress was stopped when the accuracy of each model reached convergence, and the optimal model was saved at the end of training. The model training time was measured in hours, and each model contained two training times; the first value was the training time using the MTOA dataset, and the second value was the training time using the balanced MTOA dataset.

Performance Metrics
In this study, four metrics were used to evaluate the performance of the trained model, including precision (P), recall (R), average precision (AP), and mean average precision (mAP), which were calculated as follows: where P is the proportion of the correct prediction boxes detected among all prediction boxes; R is the proportion of the correct prediction boxes among all annotation boxes; AP is the average accuracy value for each occlusion apple class that measures how well the trained model does on each class; mAP is the average AP value for eight occlusion apple classes to measure how well the trained model does on all classes; TP is the number of correctly matched prediction boxes for all annotation boxes; FP is the number of incorrectly predicted boxes; and FN is the number of missed annotation boxes.

Results after the Balance of Each Occlusion Class
The basic training dataset was balanced to ensure that the numbers of annotation boxes in each occlusion apple class were equal. However, annotation boxes in QX_L and PSR_L were not balanced because of their small number. The final number of annotation boxes in each occlusion apple with each illumination was 10,579, which is the number of B in ZY_H. The total number of annotation boxes was 338,937, and the original number of annotation boxes was 81,631. The results of the synthesis of the images under high and low illuminations are shown in Figure 9, and the numbers of annotation boxes of each occlusion class after synthesis are shown in Table 4. is the number of incorrectly predicted boxes; and is the number of missed annotation boxes.

Results after the Balance of Each Occlusion Class
The basic training dataset was balanced to ensure that the numbers of annotation boxes in each occlusion apple class were equal. However, annotation boxes in QX_L and PSR_L were not balanced because of their small number. The final number of annotation boxes in each occlusion apple with each illumination was 10,579, which is the number of B in ZY_H. The total number of annotation boxes was 338,937, and the original number of annotation boxes was 81,631. The results of the synthesis of the images under high and low illuminations are shown in Figure 9, and the numbers of annotation boxes of each occlusion class after synthesis are shown in Table 4.

1.
Yolox-s-bal improved significantly over the yolox-s-basic model in terms of precision, recall, and mAp, and it showed some improvement in Ap values between all occlusion classes. Moreover, yolov4-s-bal, yolov3-tiny-bal, and yolox-s-bal had the same performance, indicating that the balance method proposed in this study can improve the performance of the above four models in detecting apple occlusion targets in a comprehensive manner. Compared with the yolov5-s-basic model, the precision and AP BLF values decreased by 0.001 and 0.042, respectively. However, the recall value improved significantly, the mAp value remained balanced, and the AP values of all classes except BLF improved, indicating that the proposed method in this study can maintain an accurate performance of yolov5-s-basic while improving the ability of the model to find a full range of occlusion targets. The proposed method improved most of the metrics of each model, but the AP BLF metrics decreased after the training of yolov5-s-bal and efficientDet-d0-bal, indicating that the features of the BLF in the balanced MTOA dataset differed from those in the BLF classes of the basic training dataset. This affected the detection ability of the BLF in these two models.

Discussion
The experimental results show that the AP BLF value decreased for the yolov5-s-bal and efficientDet-d0-bal models and increased for all other models. To analyze the reasons, the original test dataset was divided into four sub-classes of test datasets (ZY_H_TEST, ZY_L_TEST, QX_H_TEST, and PSR_H_TEST). Then, the yolov5-s-bal model was used as a representative of an AP BLF decreasing model because the AP BLF value in yolov5-s-bal declined the most, and yolov4-s-bal was used as a representative of a rising model for all metrics. Finally, these two models were tested separately with four sub-classes of test datasets, as the AP BLF value in yolov4-s increased the most.

1.
First, four sub-classes of test datasets were tested based on the yolov5-s-basic and yolov5-s-bal models. Test results are shown in Table 6, and visualization test results are shown in Supplementary Material Figure S2. In the test results for ZY_H_TEST and QX_H_TEST, the AP BLF improved by 0.018 and 0.071, respectively, but the AP BLF decreased by 0.085 and 0.096 in the test results for ZY_L_TEST and PSR_H_TEST, respectively. Meanwhile, in the test results for PSR_H, all metrics decreased for different illuminations except AP F , indicating that the yolov5-s-bal model had a loss in the detection performance of the BLF in PSR_H_TEST and ZY_L_TEST, which is also the main reason for the decrease in AP BLF when testing the yolov5-s-bal model. In addition, the yolov5-s-basic model had the greatest variability in all test metrics for PSR_H_TEST. If we want to continue to improve the metrics, we need to provide data that are more similar to source data for model training. However, for PSR_H_TEST balance augmentation, the occlusion elements in ZY_H and ZY_L were used for synthesis, which improved the amount of training data but fell short of the goal of being more similar to the source data, resulting in a decrease in the AP BLF and other metrics.

2.
Four sub-classes of test datasets were tested based on the yolov4-s-basic and yolov4-sbal models. The test results are shown in Table 7, and the visualization test results are shown in Supplementary Material Figure S3. The decreases in AP BLF and P for ZY_L_TEST data were 0.036 and 0.001, respectively, with a large decrease in AP BLF , which is due to the same reason as the decrease in AP BLF for the yolov5-s-bal model test. Moreover, the detection metrics under all other sub-class test datasets showed substantial improvements, indicating that the combined detection performance of the yolov4-s-bal model improved significantly. Although the detection ability of BLF in ZY_L_TEST data was suppressed, the AP values in other sub-classes increased significantly, which was the main reason for the increase in AP BLF when all test datasets passed the yolov4-s-bal test. The results of the yolov4-s-basic model for PSR_H_TEST, ZY_H_TEST, and PSR_H_TEST were all relatively low, which shows that there is more room to improve the performance of the model. All other metrics improved after training the model with a balanced MTOA dataset, indicating that the proposed method in this study is more effective in improving the performance of the model when there is more room for improvement. The diff represents the difference between detection metrics of yolov5-s-basic assay and detection metrics of yolov5-s-bal, and it is bolded in black; the boxed area shows a larger decline in AP BLF . The diff represents the difference between detection metrics of yolov4-s-basic assay and detection metrics of yolov4-s-bal, and it is bolded in black; the boxed area shows a larger decline in AP BLF .
The method proposed in this study is expected to be applied to the fine detection of apple fruit in multiple regions. Compared with most data augmentation algorithms, the proposed method greatly increases the quantity and quality of underlying multi-type occlusion apple data and solves the problem of unbalanced quantities between classes. The detection capability of the lightweight model and the generalization capability of the model were also improved, which shows that our method can be used for advancing the fine-grained identification of occlusion fruits. This method needs to be used to fully use a priori knowledge (consistency of occlusion elements, type of occlusion, and features of external illumination) to generate high-quality data and ensure that synthetic data are consistent with the main features of the original data.

Limitations and Future Research
Despite the aforementioned achievements [40], there is still room for improvement in the proposed method. Firstly, the process of component pool production consumes a great deal of time because all elements need to be manually selected or extracted, such as selecting base images, manually segmenting fruits, leaves, branches, etc., which increases the cost of data production. Secondly, after the MOTA dataset is balanced, the size of the balanced MOTA dataset will increase, and the training time of the model will increase accordingly when more data are used for model training. In the future, our research direction is to reduce the dataset production time and labor cost by using the existing segmentation model for building a component pool.

Conclusions
In this study, we addressed the problem that most lightweight models detect multiple types of occlusion targets inefficiently during fruit picking. We proposed the first MTOA dataset and a balance augmentation method. The results show that using the proposed method, the average detection precision of the five popular lightweight object detection models can be significantly improved, demonstrating the effectiveness of the proposed method. However, we still need to pay attention to the selected occlusion types that should be consistent with the actual situation since this will affect the similarity between the synthetic and actual data. The proposed method showed considerable potential for different fruit detection missions in future orchard applications in complex environments.

Author Contributions:
The contribution of the authors are as follows: H.L. performed the work on experimental design and writing, data collection, and synthesis of the study. Y.S. guided the thesis, experimental analysis, and design. W.G. supervised the experimental design, writing, and finishing of the thesis. G.L. performed data collection and data pre-processing. All authors have read and agreed to the published version of the manuscript.