Lightweight Detection Algorithm of Kiwifruit Based on Improved YOLOX-S

: Considering the high requirements of current kiwifruit picking recognition systems for mobile devices, including the small number of available features for image targets and small-scale aggregation, an enhanced YOLOX-S target detection algorithm for kiwifruit picking robots is proposed in this study. This involved designing a new multi-scale feature integration structure in which, with the aim of providing a small and lightweight model, the feature maps used for detecting large targets in the YOLOX model are eliminated, the feature map of small targets is sampled through the nearest neighbor values, the superﬁcial features are spliced with the ﬁnal features, the gradient of the SiLU activation function is perturbed, and the loss function at the output is optimized. The experimental results show that, compared with the original YOLOX-S, the enhanced model improved the detection average precision (AP) of kiwifruit images by 6.52%, reduced the number of model parameters by 44.8%, and improved the model detection speed by 63.9%. Hence, with its outstanding effectiveness and relatively light weight, the proposed model is capable of effectively providing data support for the 3D positioning and automated picking of kiwifruit. It may also successfully provide solutions in similar ﬁelds related to small target detection.


Introduction
Agriculture is the source of human clothing, food, housing, and transportation; an important foundation for people's lives; the backbone that supports the national economy; and the guarantee for the country's stable development.At present, the application of artificial intelligence in agriculture mainly includes intelligent farm systems with management and decision-making capabilities based on the background of agricultural big data [1], motion obstacle target detection and path recognition [2], crop growth and pest detection [3], weed recognition [4], fruit and vegetable quality detection [5], and automatic picking based on agricultural robots and other related fields.
Fruit-picking robots can automate picking work, effectively resolving issues related to labor shortages, high costs, and low efficiency in the manual picking process [6,7].Determination of the critical criteria of picking robots involves studying the visual system, while the efficiency and stability of such robots predominantly depend on the speed and accuracy of fruit recognition, along with the accuracy and adaptability in complex environments [8,9].Therefore, research on visual systems that possess the capability to accurately identify the fruit on trees in complex environments is of substantial value and practical significance for achieving automatic picking and yield estimation.
Numerous scholars across the world have conducted extensive research on target object recognition technology [10][11][12][13].In the field of fruit crop detection in natural environments, feature extraction and recognition have predominately targeted tomato [14,15], apple [16][17][18], cucumber [19,20], strawberry [21], sugarcane [22], pineapple [23], and various other fruits.Among the various fruits, the planting area and yield of kiwifruit in particular have continued to increase over time.With its high yield and rich nutritional value, kiwifruit has been widely planted and become popular among consumers.Methods for detection and recognition are predominantly segregated into traditional machine vision methods and deep learning methods.As an example of such machine vision, Cui Yongjie et al. [24] utilized the L*a*b* color space a* channel for kiwifruit image segmentation, and adopted the elliptical Hough transform to fit the contour of a single fruit for segmentation and recognition.In addition, Fu et al. [25] have proposed the use of 1.1R-G color characteristics for nighttime kiwifruit image segmentation, and combined the minimum circumscribed rectangle method and elliptical Hough transform to identify each fruit.However, both methods presented unsatisfactory results for fruit segmentation and unfavorable results for multi-fruit cluster recognition compared to traditional algorithms, such as SIFT [26], HOG [27], and texture extraction algorithms [28][29][30].Kiwifruit images in the field environment possess vastly diverse features, complex backgrounds, and substantial differences in morphological features.Traditional machine vision methods are mainly constructed based on experience and are influenced by samples and human subjectivity; hence, they are unable to effectively meet the demands of applications in complex field environments.
Deep learning target detection algorithms have experienced significant leaps in performance and accuracy, and various model networks have substantially enhanced their ability to resist scale changes and translation.Song Zhenzhen et al. [31] have constructed a fast VGG16 model to achieve the detection of kiwifruit in live images by integrating a region proposal network (RPN) and a fast R-CNN network, while Fu Longsheng et al. [32] have proposed a network-based LeNet convolutional neural network deep learning model for multi-cluster kiwifruit images with general applicability to the recognition of multi-cluster kiwifruits.Although, as research on deep learning-based target detection methods has focused more on the construction of a deeper networks for the purpose of enhancing detection accuracy, the associated network models have generally suffered from an overly large number of parameters.This has led to slow detection speeds, meaning that the algorithms can only be run on high-performance graphics processors and generally have high equipment requirements.Concurrently, according to analysis of the growth characteristics of kiwifruit, most targets in kiwifruit detection tasks have predominantly been on small targets (both absolute and relative scales are relatively small).
Therefore, in the interest of reducing the number of model parameters and enhancing the model detection speed, the YOLOX-S network, which possesses excellent multi-scale detection performance and takes into account both the detection speed and accuracy as its basis, was selected for this research.This work aims to improve on the original network model, in order to maintain the target detection accuracy while compressing the model, thus effectively achieving the detection of small kiwifruit targets.

Vision Platform System
In this paper, we primarily focus on the object detection task in the image processing field.The image recognition module was a Jetson Nano embedded development board, as presented in Figure 1.The improved model algorithm, which was trained in advance, was embedded in the board, and wireless communications, remote monitoring, and remote control were achieved through the 4G network module.The communication system is mainly divided into the picking-machine end, cloud server end, and client end, ensuring the transmission and storage of information.Remote wireless control of the picking robot can also be achieved.In addition, the left and right imagers of the depth camera capture video or image data, which are sent to a depth imaging processor.This processor correlates points in the left image with those in the right image, and calculates the depth value of each pixel in the image by shifting the points in the left image to match with the right image.Finally, it returns the result to the terminal in order to command the manipulator to act accordingly.
value of each pixel in the image by shifting the points in the left image to match with right image.Finally, it returns the result to the terminal in order to command the man ulator to act accordingly.

Hardware Platform
A test platform was independently developed by our team, which can be applied a fruit picking and transferring platform in hilly and mountainous areas (Figure 2).T platform has a pure electric drive and a CAN interface for chassis speed regulation, ste ing, and attitude feedback.It is capable of stable driving and meets the hardware requi ments of the platform positioning test for the chassis in hilly and mountainous areas.

Hardware Platform
A test platform was independently developed by our team, which can be applied as a fruit picking and transferring platform in hilly and mountainous areas (Figure 2).The platform has a pure electric drive and a CAN interface for chassis speed regulation, steering, and attitude feedback.It is capable of stable driving and meets the hardware requirements of the platform positioning test for the chassis in hilly and mountainous areas.

Experimental Configuration and Environment
The used graphics card was an NVIDIA GeForce GTX 3060, and the CPU was AMD Ryzen 7 5800H with 16 GB memory.The experimental configuration was Windo 10, Python 3.8, PyTorch 1.8.1, and CUDA 10.1.The parameter settings are presented Table 1.

Experimental Configuration and Environment
The used graphics card was an NVIDIA GeForce GTX 3060, and the CPU was an AMD Ryzen 7 5800H with 16 GB memory.The experimental configuration was Windows 10, Python 3.8, PyTorch 1.8.1, and CUDA 10.1.The parameter settings are presented in Table 1.

Experimental Sample Dataset
The experimental data in this paper were collected from the Internet and from on-site filming.A total of 1500 images were collected.The photos taken on the spot are all taken from the orchard.Each picture contained a significantly large number of kiwifruit target fruits, and the total number of targets was 41,687.The targets in each image were labeled with fine granularity, in order to facilitate subsequent enhancements in the detection of small targets.

Principles and Methods
YOLOX [33] is a brand new high performance real-time target detection network, recently launched by Beijing Megvii Technology.It adopts cutting-edge technologies such as the anchor-free mechanism, decoupled heads, multi-positives, advanced label assignment strategy, and strong data augmentation.Hence, it has faster speed, higher recognition accuracy, smaller weight files, and can be easily mounted on mobile devices with lower configurations, thereby offering high research value.The structure of the YOLOX-S network selected for this paper is depicted in Figure 3. CSPDarknet is the backbone feature extraction network of the YOLOX algorithm, which is predominantly composed of three modules: Focus, CSPNet, and a spatial pyramid pooling network.The model first slices an input image for the operation.By sampling the complete image at equal intervals, multiple sampled images of appropriate size can be obtained.Subsequently, these images are combined in the channel dimension and the information in the image is transferred to the channel space, resulting in a down-sampled image with no information loss.The CSPNet module contains the backbone feature extraction and residual structure, which can effectively extract image features and significantly reduces the computational effort while maintaining high accuracy.The SPP network convolutes the output of the last CSPNet once, then utilizes three different scales of maximum pooled kernels to integrate the features of the feature image under different receptive fields.FPN + PAN is a circular pyramid structure composed of convolution, sampling, and feature fusion operations, which repeatedly extracts the input image fea- CSPDarknet is the backbone feature extraction network of the YOLOX algorithm, which is predominantly composed of three modules: Focus, CSPNet, and a spatial pyramid pooling network.The model first slices an input image for the operation.By sampling the complete image at equal intervals, multiple sampled images of appropriate size can be obtained.Subsequently, these images are combined in the channel dimension and the information in the image is transferred to the channel space, resulting in a downsampled image with no information loss.The CSPNet module contains the backbone feature extraction and residual structure, which can effectively extract image features and significantly reduces the computational effort while maintaining high accuracy.The SPP network convolutes the output of the last CSPNet once, then utilizes three different scales of maximum pooled kernels to integrate the features of the feature image under different receptive fields.FPN + PAN is a circular pyramid structure composed of convolution, sampling, and feature fusion operations, which repeatedly extracts the input image features, performs feature fusion at different scales, and finally outputs the three feature maps at different scales to the decoupled head for accurate prediction.

Pre-Processing of the Data Set
YOLOX utilizes mosaic and mix-up data augmentation methods to substantially enrich the detection dataset; in particular, random scaling is conducted to supplement the many small targets and make the network more robust.Mosaic augmentation involves performing a series of operations, such as flipping, scaling, and color shifting, on multiple different pictures, followed by cropping and splicing to recombine them into a new image.Hence, the generated images often contain more targets.Therefore, this kind of augmentation technique can significantly enrich the background and alleviate the imbalance of positive and negative samples in the detection process, to a certain extent.Mix-up augmentation refers to the fusion of two pictures, to some degree, in which the labels of the samples are also weighted.The prediction results are weighted using the weighted labels in order to calculate the loss; subsequently, the backpropagation update parameters can be enhanced.The effect is shown in Figure 4.

Perturbing the Activation Function Gradient
The predominant function of the activation function is to provide non-linearity in the network structure.Considering that the difference between the gradient propagation effects of the SiLU and Mish loss functions utilized in the YOLOX model is slight, gradient perturbation was considered based on the SiLU activation function.As presented in Figure 5, the SiLU→SiLU-1 gradient change led to a smoother curve, while the SiLU→SiLU + 1 gradient change became steeper.Given that the Mish activation function worked relatively well in YOLOv4, it was considered to increase or decrease the gradient change based on SiLU.Introducing a gradient increase can enhance the generalization ability of the model more robustly, and as such, we found that the SiLU + 1 activation function enhanced the generalization ability of the model to a certain extent.

Improved YOLOX-S Network 3.2.1. Perturbing the Activation Function Gradient
The predominant function of the activation function is to provide non-linearity in the network structure.Considering that the difference between the gradient propagation effects of the SiLU and Mish loss functions utilized in the YOLOX model is slight, gradient perturbation was considered based on the SiLU activation function.As presented in Figure 5, the SiLU→SiLU-1 gradient change led to a smoother curve, while the SiLU→SiLU + 1 gradient change became steeper.Given that the Mish activation function worked relatively well in YOLOv4, it was considered to increase or decrease the gradient change based on SiLU.Introducing a gradient increase can enhance the generalization ability of the model more robustly, and as such, we found that the SiLU + 1 activation function enhanced the generalization ability of the model to a certain extent.The dynamic positive and negative sample allocation algorithm utilized by YOLOX, SimOTA, is fast and effective.When determining the candidate areas for positive samples, the center point of a grid (20 × 20, 40 × 40, 80 × 80) was selected as the circle inside the ground truth (GT), with r being the radius centered on the center point of the GT.In Figure 6, the green box denotes the GT.It can be observed that there may be mismatches when using a small feature map.Subsequently, it can be observed that GTs are more likely to match smaller GTs in larger feature maps, but small feature maps can match a significantly small number of GTs.The dynamic positive and negative sample allocation algorithm utilized by YOLOX, SimOTA, is fast and effective.When determining the candidate areas for positive samples, the center point of a grid (20 × 20, 40 × 40, 80 × 80) was selected as the circle inside the ground truth (GT), with r being the radius centered on the center point of the GT.In Figure 6, the green box denotes the GT.It can be observed that there may be mismatches when using a small feature map.Subsequently, it can be observed that GTs are more likely to match smaller GTs in larger feature maps, but small feature maps can match a significantly small number of GTs.Through in-depth research on the allocation strategy of positive and negative samples in the YOLOX model, the YOLOX model was found to reduce the number of predicted samples of the feature map in the confidence loss calculation, where almost all of the reduced samples were negative.Hence, the problem of imbalance in quantity caused by too many negative samples was alleviated, thereby suggesting that most targets in the kiwifruit detection task are small targets (i.e., both the absolute scale and relative scale are relatively small).Therefore, with the goal of reducing the number of model parameters and improving the model detection speed, the feature maps (20 × 20, 40 × 40) used for detecting large targets in the YOLOX model were eliminated.Subsequently, only the 80 × 80 feature map was retained, and a larger feature map size was introduced on this basis to match the GT more effectively.When acquiring the final output from the 80 × 80 feature map, nearest-neighbor interpolation was utilized for up-sampling.This allows the model to provide more predictions and better match GTs, thereby extensively reducing the complexity of the model and the number of parameters.Figure 7 demonstrates the structure

Nearest Neighbor Interpolation Up-Sampling of 80 × 80 Feature Map
Through in-depth research on the allocation strategy of positive and negative samples in the YOLOX model, the YOLOX model was found to reduce the number of predicted samples of the feature map in the confidence loss calculation, where almost all of the reduced samples were negative.Hence, the problem of imbalance in quantity caused by too many negative samples was alleviated, thereby suggesting that most targets in the kiwifruit detection task are small targets (i.e., both the absolute scale and relative scale are relatively small).Therefore, with the goal of reducing the number of model parameters and improving the model detection speed, the feature maps (20 × 20, 40 × 40) used for detecting large targets in the YOLOX model were eliminated.Subsequently, only the 80 × 80 feature map was retained, and a larger feature map size was introduced on this basis to match the GT more effectively.When acquiring the final output from the 80 × 80 feature map, nearestneighbor interpolation was utilized for up-sampling.This allows the model to provide more predictions and better match GTs, thereby extensively reducing the complexity of the model and the number of parameters.Figure 7 demonstrates the structure of the network before and after the improvement.

Transfer of Shallow Features
The performance when using a single output feature map may be unstable under specific conditions.Considering that the low-level feature semantic information is relatively small but the target position is accurate, the final feature map and the feature map in the shallow network were concatenated, in order to better integrate the semantic and representation information to a certain extent, such that the accuracy of the regression box could be significantly enhanced (see Figure 8).

Transfer of Shallow Features
The performance when using a single output feature map may be unstable under specific conditions.Considering that the low-level feature semantic information is relatively small but the target position is accurate, the final feature map and the feature map in the shallow network were concatenated, in order to better integrate the semantic and representation information to a certain extent, such that the accuracy of the regression box could be significantly enhanced (see Figure 8).

Enhancing the Loss Function
Equations ( 1)-( 5) are the loss functions of the YOLOX-S algorithm.The bounding box loss functions GIOU_loss and IOU_loss for predicting Reg have certain limitations, resulting in an inability to effectively optimize the overlap between the detection box and the real box when one is included in the other.Subsequently, for the confidence degree and category loss, the original algorithm adopts a binary cross-entropy loss function, which is not conducive to the classification of positive and negative samples.[ log( ) ( 1) log( 1)]

Enhancing the Loss Function
Equations ( 1)-( 5) are the loss functions of the YOLOX-S algorithm.The bounding box loss functions GIOU_loss and IOU_loss for predicting Reg have certain limitations, resulting in an inability to effectively optimize the overlap between the detection box and the real box when one is included in the other.Subsequently, for the confidence degree and category loss, the original algorithm adopts a binary cross-entropy loss function, which is not conducive to the classification of positive and negative samples.
In Equation ( 2), C represents the minimum circumscribed rectangle of the detection frame and the priori frame, and Q represents the difference between the minimum circumscribed rectangle and the concatenation of the two frames.
In Equations ( 3) and ( 4 i refer to the prediction values.Therefore, we adopted CIOU_loss as the Reg bounding box loss function and increased the aspect-ratio restriction mechanism, compared with the previous one, such that the prediction box was more in line with the real box, as demonstrated in Equation ( 5).Equation ( 6) was used to measure the consistency of the aspect ratio, and the confidence degree and category loss function utilized the PolyLoss function based on the Taylor expansion approximation of the focal loss [34].Thus, it not only took into account the superior binary classification performance of the focal function, but also achieved enhancement of the accuracy and performance on this basis.The convergence speed was also effectively accelerated.
where ρ() is the Euclidean distance between the center points of the two boxes, c is the diagonal length of the smallest circumscribed rectangle of the two, α is the weight coefficient, and v is the aspect ratio distance between the two frames.
where P t represents the probability of target label prediction.

Evaluation of Model Performance
In order to evaluate the effectiveness of the proposed method for kiwifruit detection in different aspects, the mean average accuracy, the number of model parameters utilized, and the FPS, along with the detection time per sheet, were selected as evaluation metrics.mAP refers to a comprehensive consideration of precision and recall, which is used to evaluate the effectiveness of the model, while FPS refers to the number of frames per second, which can be utilized to measure the real-time performance of the model.Finally, the number of model parameters reflects the lightness of the model.
where TP represents the number of correctly identified images, FP represents the number of misidentified images, and FN represents the number of missed images.When there is only one category, mAP is equal to the AP.

Analysis and Comparison of the Enhanced Model
In order to efficiently verify the effectiveness of the proposed model, a comparative experiment was conducted on the enhanced YOLOX-S network using the same training parameters.Table 2 provides the detailed scores of each evaluation index before and after the enhancement.Figure 9 presents the AP diagram of the model before and after enhancement.From the perspective of model lightness, the improved model parameters were reduced by 44.8% and the model detection speed was increased by 63.9%, verifying the feature expression ability of the model.Feature map up-sampling and nearest-neighbor interpolation reduced the computational complexity by omitting unnecessary computations, thus achieving the effect of making the network lightweight.
So as to more intuitively depict the improvement in various aspects for the considered models, we created a performance comparison diagram with respect to the model improvement strategies, as shown in Figure 10.From the perspective of model lightness, the improved model parameters were reduced by 44.8% and the model detection speed was increased by 63.9%, verifying the feature expression ability of the model.Feature map up-sampling and nearest-neighbor interpolation reduced the computational complexity by omitting unnecessary computations, thus achieving the effect of making the network lightweight.
So as to more intuitively depict the improvement in various aspects for the considered models, we created a performance comparison diagram with respect to the model improvement strategies, as shown in Figure 10.As shown by Figure 10, in terms of model effectiveness and accuracy, the expressiveness growth of the model is mainly divided into three stages.The first stage is that the perturbation of the activation function enhances the generalization ability of the model by selecting the SILU+1 function, which increases the AP value by 1.13%.The improvement in the second stage is due to the cancellation of the feature map of the redundant large target in this detection task, so that the network detection is all concentrated on the small target, which reduces the calculation of negative samples and the misjudgment of positive samples, so that the AP value continues to increase by 2.01%.The improvement in the last stage comes from the design of the new network fusion structure.By splicing the final output feature map and shallow-level features, the semantic information of the two is combined, and the loss function is improved in the prediction segment.Compared with the original model, the enhanced model significantly improved the AP value on the kiwifruit images by 6.52%, which is a substantial increase.
Figure 11 presents the before and after images for comparison.By comparing the groups of images, it can be seen that in the (Figure 11a) group of experiments, the fruit could be effectively identified by the enhanced model, even when there were tree trunks, branches, and leaves in the way.However, for the (Figure 11b) group of experiments, the original algorithm was not able to recognize the low-density fruits effectively.Additionally, for the (Figure 11c) group of experiments, the original algorithm misidentified the tree trunk as a fruit, but the enhanced model corrected it, and was able to accurately identify more fruit.However, for the (Figure 11b) group of experiments, the original algorithm was not able to recognize the low-density fruits effectively.Additionally, for the (Figure 11c) group of experiments, the original algorithm misidentified the tree trunk as a fruit, but the enhanced model corrected it, and was able to accurately identify more fruit.
In light of the above, the enhanced algorithm significantly improved the ability to detect small-scale target fruit and reduced the false recognition and misrecognition rates.In addition, we compared several state-of-the-art algorithms and conducted training tests under the same conditions.The proposed enhanced model provided improved results in all aspects.The performance comparison is given in Table 3.In addition, Table 4 compares our findings with those of various scholars around the world, and details the advantages and disadvantages of their techniques.It can be seen from the table that the algorithm proposed in this paper can solve the problem of poor recognition of multiple fruit clusters compared with the traditional image processing method used in past research [24,25,[35][36][37].Compared with studies based on deep learning methods [31,32], the improved accuracy of our algorithm alleviates the problems of the network model being too large and the equipment requirements being too high.The recognition in the case of fruit occlusion and misjudgment is improved, the recognition accuracy and speed of the fruit are further improved, and the parameter amount of the model is reduced.It can effectively complete the detection task of kiwifruit in agricultural production, and has a positive impact on future picking of kiwifruit.

Conclusions
The research ultimately proposed an enhanced YOLOX-S target detection algorithm for kiwifruit picking robots.In order to effectively improve the detection of small-scale targets, the YOLOX-S algorithm was enhanced through fine-grained annotation of the target frame of the data set, as well as mosaic and mix-up data augmentation methods.Through up-sampling of the nearest-neighbor value in the small target feature map and the splicing of superficial features with the final features, in addition to the optimization of the loss function, the number of parameters of the enhanced YOLOX-S were significantly reduced while the AP values was increased.We demonstrated that the proposed enhancement method is applicable to actual fruit-picking environments, and is beneficial for embedment in mobile devices.
This research predominantly focused on the detection of kiwifruit.Simultaneously, the critical key to effective picking is to locate the target and return its three-dimensional coordinate points.
In further research, we intend to focus on: (a) At present, the AP of the model has not reached the ideal state.Next, the data set will be enriched to further improve the performance and accuracy of the model.(b) We will use pre-processing of the depth image data and color image data by utilizing the camera's external and internal parameters, triangulation principles, and the conversion of pixel coordinates to 3D spatial coordinates to carry out fruit localization.(c) The proposed algorithm effectively met the basic requirements for fruit picking using a large-end actuator.However, due to the large number of kiwifruit that need to be picked, in order to further enhance the efficiency of the manipulator, it is necessary to further research the picking sequence allocation for kiwifruit.(d) We will analyze the correlation between data, identify a variety of other types of fruit through transfer learning, and design a multi-classification general picking model for orchards.

Figure 2 .
Figure 2. Electric fruit picking platform used for experimental trials.

Figure 2 .
Figure 2. Electric fruit picking platform used for experimental trials.

Figure 3 .
Figure 3. Structure of the YOLOX-S network model.

Figure 4 .
Figure 4. Data enhancement effects: (a) mosaic data enhancement and (b) mix-up data enhancement.

Figure 4 .
Figure 4. Data enhancement effects: (a) mosaic data enhancement and (b) mix-up data enhancement.

Figure 5 .
Figure 5. Variation between the perturbed gradient functions.

Figure 5 .
Figure 5. Variation between the perturbed gradient functions.

Figure 7 .
Figure 7. (a) Original YOLOX feature fusion structure and (b) improved structure, in which only the 80 × 80 feature map structure is preserved.

Figure 7 .
Figure 7. (a) Original YOLOX feature fusion structure and (b) improved structure, in which only the 80 × 80 feature map structure is preserved.

Figure 8 .
Figure 8. Structure of the final feature map.
target falls into detection frame j of grid i, and λ noobj represents the loss weight of the localization error.Subsequently, C i j and P i j refer to the training values, and

Figure 10 .
Figure 10.Model improvement strategy performance comparison chart.

Figure 11 .
Figure 11.(a) Enhancement of fruit recognition under tree trunk and leaf occlusion.(b) Enhancement of low-density fruit missed recognition.(c) Enhancement of non-target fruit misrecognition.

Figure 11 .
Figure 11.(a) Enhancement of fruit recognition under tree trunk and leaf occlusion.(b) Enhancement of low-density fruit missed recognition.(c) Enhancement of non-target fruit misrecognition.

Table 2 .
The comparison of model scores before and after enhancement.

Table 3 .
Comparison of mainstream models.

Table 4 .
Comparison and analysis of advantages and disadvantages of methods.