A Real-Time Detection Algorithm for Kiwifruit Defects Based on YOLOv5

: Defect detection is the most important step in the postpartum reprocessing of kiwifruit. However, there are some small defects difﬁcult to detect. The accuracy and speed of existing detection algorithms are difﬁcult to meet the requirements of real-time detection. For solving these problems, we developed a defect detection model based on YOLOv5, which is able to detect defects accurately and at a fast speed. The main contributions of this research are as follows: (1) a small object detection layer is added to improve the model’s ability to detect small defects; (2) we pay attention to the importance of different channels by embedding SELayer; (3) the loss function CIoU is introduced to make the regression more accurate; (4) under the prerequisite of no increase in training cost, we train our model based on transfer learning and use the CosineAnnealing algorithm to improve the effect. The results of the experiment show that the overall performance of the improved network YOLOv5-Ours is better than the original and mainstream detection algorithms. The mAP@0.5 of YOLOv5-Ours has reached 94.7%, which was an improvement of nearly 9%, compared to the original algorithm. Our model only takes 0.1 s to detect a single image, which proves the effectiveness of the model. Therefore, YOLOv5-Ours can well meet the requirements of real-time detection and provides a robust strategy for the kiwi ﬂaw detection system.


Introduction
China is a giant producer of kiwi, whose output ranks first in the world [1]. Defect detection plays a significant role in the postpartum reprocessing of kiwifruit. Through defect detection, we can grade and price different kiwifruit based on their quality, which helps to change the phenomenon that the price of kiwifruit was difficult to increase in the past [2]. It also guarantees food safety. However, detection technology is very traditional and outdated. Most manufacturers and workers mainly rely on manual detecting, which wastes too much labor and has poor efficiency [3].
In recent years, computer-vision-based object detection technology has gradually become matured [4,5]. Shah et al. use Faster RCNN to identify plants and weeds [6]. Zeze et al. use CNN to realize the recognition of apples [7]. Computer vision has the obvious advantages of high accuracy and fast speed [8]. Defect detection based on computer vision is an automatic and nondestructive fruit detection method [9]. It overwhelms manual detection on precision and efficiency; hence, it will bring the inevitable trend of application in fruits in the future [10].
In current fruit defect detection algorithms, it is difficult to balance speed and accuracy simultaneously. Dong et al. [11] used computer vision technology to detect the surface Electronics 2021, 10, 1711 2 of 13 defects of Korla fragrant pears. Under the condition of guaranteeing accuracy, it still takes 2.5 s to detect a single image. Wang et al. [12] conducted rapid detection of pomegranate leaf diseases, but the accuracy was only 87%. Xing et al. [13] used the BP neural network in mango quality inspection to increase the speed as much as possible while ensuring accuracy. The final speed also took 0.8 s.
The development of deep learning algorithms in recent years has led to major breakthroughs in the field of computer vision. In terms of target recognition, deep learning algorithms represented by convolutional neural networks (CNNs) have improved the accuracy and detection speed, compared with traditional methods [14]. At present, target recognition algorithms are mainly divided into two types: one is a two-stage algorithm based on the detection frame and classifier, such as the R-CNN [15] series algorithm, which is of higher accuracy, but its deeper network structure also leads to a slower speed, failing to meet real-time the requirements of the target recognition detection. The other is a regression-based first-order algorithm, such as SDD [16], YOLO [17] series algorithms, etc., with faster inference speed and stronger practicability, which can meet real-time object recognition and detection.
This paper takes kiwifruit defect as the research object, collects four types of common flaw photos to make a kiwi flaw dataset, and uses the characteristics of high detection speed and high accuracy of the YOLOv5 [18] algorithm in the field of image detection. We ameliorated the problem and compared the improved model with the original one. The use of the CosineAnnealing [19] decay method in the training process can improve the model effect without increasing the cost of training. The result proves that the improved model leads to significant progress, which proves the effectiveness of the improved model.

YOLO Algorithm
The main current object recognition algorithms include the R-CNN series and the YOLO series. The R-CNN series is superior in target detection requiring higher accuracy, but its detection speed is lower than that of the YOLO series. In practical scenarios, it cannot meet the real-time performance of object detection. In this context, the YOLO series of algorithms use the idea of regression to make it easier to learn the generalized characteristics of the target and solve the speed problem. The YOLO series of algorithms use a one-stage neural network to complete detection object positioning and classification directly [20,21].
YOLO views image detection as a regression problem with a simple pipeline and fast speed. It can process streaming video in real-time with a delay of fewer than 25 s. During the training process, YOLO can look over the entire image with more attention on global information in target detection. The core idea of YOLO is to use the entire picture as the input of the network, and directly return to the position of the bounding box and the category to which the bounding box belongs at the output. In YOLO, each bounding box is predicted by the characteristics of the entire image, and each bounding box contains five predictions and confidences, which are relative to the grid unit in the center of the bounding box of the boundary. The basic frame of YOLO is as follows: w and h are the predicted width and height of the entire image (relative to the entire image). The YOLO is mainly composed of three main components: YOLOv2 [22] uses a new training algorithm. YOLOv2 uses the k-means clustering method to cluster the bounding boxes in the training set. As the main purpose of setting, the a priori box is to make the IOU between the prediction box and the ground truth better, the IOU value between the box and the cluster center box is used as the distance indicator in the cluster analysis. Compared with YOLOv1, it significantly improves the accuracy and the recall rate. YOLOv3 [23] uses a better basic classification network-class ResNet [24] and classifier Darknet-53. At the same time, the FPN [25]-like network structure is used to realize multiscale prediction. The detection accuracy and speed are greatly improved, and the false background detection rate is effectively reduced. YOLOv4 [26] retains the head part of YOLOv3, changes the backbone network to CSPDarknet53, and uses the idea of SPP [27] (spatial pyramid pooling) to expand the receptive field, with PANet [28] as the neck part. The structure of CSPNet [29] can achieve richer gradient combination information and reduce the amount of calculation. The PANet structure fully integrates the different feature layers, which can effectively improve the feature extraction ability of defects.
YOLOv5 continues to use the three main components of the YOLO series. The network structure is shown in Figure 1.
• Head: It can predict image features, generate bounding boxes, and predict categories.
The confidence indicates the accuracy of classification under the specific condition.
YOLOv2 [22] uses a new training algorithm. YOLOv2 uses the k-means clustering method to cluster the bounding boxes in the training set. As the main purpose of setting, the a priori box is to make the IOU between the prediction box and the ground truth better, the IOU value between the box and the cluster center box is used as the distance indicator in the cluster analysis. Compared with YOLOv1, it significantly improves the accuracy and the recall rate. YOLOv3 [23] uses a better basic classification network-class ResNet [24] and classifier Darknet-53. At the same time, the FPN [25]-like network structure is used to realize multiscale prediction. The detection accuracy and speed are greatly improved, and the false background detection rate is effectively reduced. YOLOv4 [26] retains the head part of YOLOv3, changes the backbone network to CSPDarknet53, and uses the idea of SPP [27] (spatial pyramid pooling) to expand the receptive field, with PANet [28] as the neck part. The structure of CSPNet [29] can achieve richer gradient combination information and reduce the amount of calculation. The PANet structure fully integrates the different feature layers, which can effectively improve the feature extraction ability of defects.
YOLOv5 continues to use the three main components of the YOLO series. The network structure is shown in Figure 1.

Input
The input end of YOLOv5 uses the same mosaic data enhancement method as YOLOv4, which performs better in small target detection. YOLOv5 adds the function of adaptive anchor frame calculation. During each training, the value of the optimal anchor frame in different training sets is calculated adaptively.

Backbone
YOLOv5 adds the Focus structure to realize the slicing operation. Taking the structure of Yolov5 s as an example, the original 640 × 640 × 3 image is input into the Focus structure, and the slicing operation is used first to form a 320 × 320 × 12 feature map, and then after a convolution operation of 32 convolution kernels, it finally constructs a feature map of 320 × 320 × 32.

Neck
Yolov5 uses the FPN-PAN structure, CSP2 structure designed by CSPNet, and PANET as Neck to aggregate features. The neck is mainly used to generate feature pyramids, enhance the model's detection of objects of different scales, and realize the recognition of the same object of different sizes and scales. The feature extractor of the

Input
The input end of YOLOv5 uses the same mosaic data enhancement method as YOLOv4, which performs better in small target detection. YOLOv5 adds the function of adaptive anchor frame calculation. During each training, the value of the optimal anchor frame in different training sets is calculated adaptively.

Backbone
YOLOv5 adds the Focus structure to realize the slicing operation. Taking the structure of Yolov5 s as an example, the original 640 × 640 × 3 image is input into the Focus structure, and the slicing operation is used first to form a 320 × 320 × 12 feature map, and then after a convolution operation of 32 convolution kernels, it finally constructs a feature map of 320 × 320 × 32.

Neck
Yolov5 uses the FPN-PAN structure, CSP2 structure designed by CSPNet, and PANET as Neck to aggregate features. The neck is mainly used to generate feature pyramids, enhance the model's detection of objects of different scales, and realize the recognition of the same object of different sizes and scales. The feature extractor of the network uses a new FPN structure, which enhances the bottom-up path and improves the propagation of low-level features. As the kiwifruit has small-size flaws and few pixel features, the inspection model is required to have a strong inevitable ability for small defects. In the original YOLOv5 model, the feature map of the last layer of the convolutional network structure is too small to meet the requirements of the subsequent detection and regression. To solve this problem, we add a small target detection layer and continue to process the feature map for expansion. The main purpose of upsampling is to enlarge the original image so that it can be displayed on a higher resolution display device. The zoom operation of the image cannot bring more information about the image; hence, the quality of the image will inevitably be affected. However, there are indeed some zooming methods that can increase the information of the image such that the quality of the zoomed thick image exceeds the quality of the original image. Upsampling adopts the interpolation method, that is, on the basis of the original image pixels, a suitable interpolation algorithm is used to insert new elements between pixels, as shown in the following Figure 2. At the same time, the acquired feature map and the feature map of the second layer in the backbone network are Concat Fusion in order to obtain a larger feature map for small target detection.
propagation of low-level features.

Small Target Recognition Layer
As the kiwifruit has small-size flaws and few pixel features, the inspection required to have a strong inevitable ability for small defects. In the original Y model, the feature map of the last layer of the convolutional network structure is t to meet the requirements of the subsequent detection and regression. To so problem, we add a small target detection layer and continue to process the feat for expansion. The main purpose of upsampling is to enlarge the original image can be displayed on a higher resolution display device. The zoom operation of th cannot bring more information about the image; hence, the quality of the im inevitably be affected. However, there are indeed some zooming methods increase the information of the image such that the quality of the zoomed thic exceeds the quality of the original image. Upsampling adopts the interpolation that is, on the basis of the original image pixels, a suitable interpolation algorithm to insert new elements between pixels, as shown in the following Figure 2. At t time, the acquired feature map and the feature map of the second layer in the b network are Concat Fusion in order to obtain a larger feature map for sma detection.

SELayer
In order to obtain more detailed information about the target that needs a and suppress other useless information from different channels, we introd Attention network, SElayer [30]. SENet is a network structure proposed by Jie et a mainly focuses on the feature fusion among channels of the convolution operatio backbone network. The main innovation of this network is that the mo automatically learn the importance of different channel features by focusing relationship between channels. The SE module mainly includes operations compression (Squeeze) and excitation (Excitation). The Squeeze operation take average pooling to encode the entire spatial feature on a channel as a local feat calculation method is as follows: In this formula, the second two-dimensional matrix in the three-dimensiona after convolution represents the result of the Squeeze operation, and the s represents the number of channels.
After the Squeeze operation obtains the channel information, it uses tw connected layers to form a gate mechanism and activates it with Sigmod. The cal method is as follows: where is the ReLu activation function, is the Sigmoid function, and is the weight of the two fully connected layers used for dimensionality reduc dimension upgrade, which, respectively, equals to and , is the

SELayer
In order to obtain more detailed information about the target that needs attention and suppress other useless information from different channels, we introduce the Attention network, SElayer [30]. SENet is a network structure proposed by Jie et al., which mainly focuses on the feature fusion among channels of the convolution operation in the backbone network. The main innovation of this network is that the model can automatically learn the importance of different channel features by focusing on the relationship between channels. The SE module mainly includes operations through compression (Squeeze) and excitation (Excitation). The Squeeze operation takes global average pooling to encode the entire spatial feature on a channel as a local feature. The calculation method is as follows: In this formula, the second two-dimensional matrix in the three-dimensional matrix after convolution represents the result of the Squeeze operation, and the subscript represents the number of channels.
After the Squeeze operation obtains the channel information, it uses two fully connected layers to form a gate mechanism and activates it with Sigmod. The calculation method is as follows: where δ is the ReLu activation function, σ is the Sigmoid function, and W 1 and W 2 is the weight of the two fully connected layers used for dimensionality reduction and dimension upgrade, which, respectively, equals to C r × C and C × C r , r is the scaling parameters to limit model complexity and improve model capabilities. s represents the weight set of the feature maps obtained through the fully connected layer and the nonlinear layer. Finally, Electronics 2021, 10, 1711 5 of 13 the weight of the output is assigned to the original feature. The calculation formula is as follows: In the formula, x c is a feature map of a featured channel of X, s c is a weight, and u c is a two-dimensional matrix. After modification, the network structure is shown in Figure 3: Electronics 2021, 10, x FOR PEER REVIEW 5 of 13 parameters to limit model complexity and improve model capabilities. represents the weight set of the feature maps obtained through the fully connected layer and the nonlinear layer. Finally, the weight of the output is assigned to the original feature. The calculation formula is as follows: In the formula, is a feature map of a featured channel of , is a weight, and c u is a two-dimensional matrix. After modification, the network structure is shown in Figure 3: This article considers embedding SELayer in the backbone. The improved network structure is shown below in Figure 4.

Boundary Loss Function
IoU [31] is the intersection over union, a common indicator in target detection, whose main function is to determine the positive sample and the negative sample and to evaluate the distance between the output box and the correct label. IoU is scale invariant, which means that it is not sensitive to scale. Therefore, in the regression task, IoU is the most direct indicator for judging output madness and correct labeling. However, there is a problem with the definition of IoU itself. IoU is 0 if the two boxes do not intersect. At the same time, due to the 0 loss, there is no gradient back; hence, learning and training cannot be performed. To solve these problems, Rezatofighi et al. proposed the idea of GIoU [32] and directly set IoU as the return loss. Since IoU is a ratio concept, it is not sensitive to the scale of the target object. However, the BBox regression loss (MSE loss, l1smooth loss, etc.) optimization and IoU optimization in the detection task is not completely equivalent, the Ln norm is also sensitive to the scale of the object, and IoU cannot optimize the part that does not overlap directly. The principle of GIoU is as follows: This article considers embedding SELayer in the backbone. The improved network structure is shown below in Figure 4.
Electronics 2021, 10, x FOR PEER REVIEW 5 of 13 parameters to limit model complexity and improve model capabilities. represents the weight set of the feature maps obtained through the fully connected layer and the nonlinear layer. Finally, the weight of the output is assigned to the original feature. The calculation formula is as follows: In the formula, is a feature map of a featured channel of , is a weight, and c u is a two-dimensional matrix. After modification, the network structure is shown in Figure 3: This article considers embedding SELayer in the backbone. The improved network structure is shown below in Figure 4.

Boundary Loss Function
IoU [31] is the intersection over union, a common indicator in target detection, whose main function is to determine the positive sample and the negative sample and to evaluate the distance between the output box and the correct label. IoU is scale invariant, which means that it is not sensitive to scale. Therefore, in the regression task, IoU is the most direct indicator for judging output madness and correct labeling. However, there is a problem with the definition of IoU itself. IoU is 0 if the two boxes do not intersect. At the same time, due to the 0 loss, there is no gradient back; hence, learning and training cannot be performed. To solve these problems, Rezatofighi et al. proposed the idea of GIoU [32] and directly set IoU as the return loss. Since IoU is a ratio concept, it is not sensitive to the scale of the target object. However, the BBox regression loss (MSE loss, l1smooth loss, etc.) optimization and IoU optimization in the detection task is not completely equivalent, the Ln norm is also sensitive to the scale of the object, and IoU cannot optimize the part that does not overlap directly. The principle of GIoU is as follows:

Boundary Loss Function
IoU [31] is the intersection over union, a common indicator in target detection, whose main function is to determine the positive sample and the negative sample and to evaluate the distance between the output box and the correct label. IoU is scale invariant, which means that it is not sensitive to scale. Therefore, in the regression task, IoU is the most direct indicator for judging output madness and correct labeling. However, there is a problem with the definition of IoU itself. IoU is 0 if the two boxes do not intersect. At the same time, due to the 0 loss, there is no gradient back; hence, learning and training cannot be performed. To solve these problems, Rezatofighi et al. proposed the idea of GIoU [32] and directly set IoU as the return loss. Since IoU is a ratio concept, it is not sensitive to the scale of the target object. However, the BBox regression loss (MSE loss, l1-smooth loss, etc.) optimization and IoU optimization in the detection task is not completely equivalent, the Ln norm is also sensitive to the scale of the object, and IoU cannot optimize the part that does not overlap directly. The principle of GIoU is as follows: However, there are still some problems with the GIoU such as the unstable target frame regression and the easy divergence during training. Some frames of the target detection without overlapping GIoU regression strategies may degenerate into IoU regression strategies. In order to directly minimize the normalized distance between the anchor box Electronics 2021, 10, 1711 6 of 13 and the target box to achieve a faster convergence rate and make the regression more accurate and faster when it overlaps or even contains the target box, Zheng et al. put forward the idea of DIoU and CIoU [33]. The principle is as follows: where b and b gt represent the center points of the prediction box and the real box, respectively, ρ represents the Euclidean distance between the two center points, and c represents the diagonal distance of the smallest closed area that can contain the prediction box and the real box at the same time.
Comparatively, DIoU is more in line with the target frame regression mechanism than GIou. For the situation that contains two frames in the horizontal and vertical directions, the DIoU loss can make the regression very fast, while the GIoU loss almost degenerates into the IoU loss. DIoU can also replace the common IoU evaluation strategy and apply it to NMS, making the results obtained by NMS more reasonable and effective.
The DIoU calculation does not take the aspect ratio into consideration but only considers the overlapping area of the bounding box and the center point distance of b and b gt . However, the consistency of the ratio of w and h between the anchor box, and the target box is also of high significance. Based on this, the author proposes complete loU loss.
The penalty term of CIoU is based on the penalty term of DIoU plus an impact factor α, ν, which takes into account the aspect ratio of the predicted frame to fit the target frame. The specific principle is as follows: As shown in Figure 5, the upper left block represents the target frame, the lower right block represents the prediction frame, and the dashed block represents the smallest bounding rectangle, and c and d, respectively, represent the diagonal distance of the smallest enclosing rectangle and the Euclidean distance between the center points of the two boxes.
However, there are still some problems with the GIoU such as the unstable target frame regression and the easy divergence during training. Some frames of the target detection without overlapping GIoU regression strategies may degenerate into IoU regression strategies. In order to directly minimize the normalized distance between the anchor box and the target box to achieve a faster convergence rate and make the regression more accurate and faster when it overlaps or even contains the target box, Zheng et al. put forward the idea of DIoU and CIoU [33]. The principle is as follows: where and represent the center points of the prediction box and the real box, respectively, represents the Euclidean distance between the two center points, and represents the diagonal distance of the smallest closed area that can contain the prediction box and the real box at the same time.
Comparatively, DIoU is more in line with the target frame regression mechanism than GIou. For the situation that contains two frames in the horizontal and vertical directions, the DIoU loss can make the regression very fast, while the GIoU loss almost degenerates into the IoU loss. DIoU can also replace the common IoU evaluation strategy and apply it to NMS, making the results obtained by NMS more reasonable and effective.
The DIoU calculation does not take the aspect ratio into consideration but only considers the overlapping area of the bounding box and the center point distance of and . However, the consistency of the ratio of and ℎ between the anchor box, and the target box is also of high significance. Based on this, the author proposes complete loU loss.
The penalty term of CIoU is based on the penalty term of DIoU plus an impact factor , , which takes into account the aspect ratio of the predicted frame to fit the target frame. The specific principle is as follows: 1 IoU , As shown in Figure 5, the upper left block represents the target frame, the lower right block represents the prediction frame, and the dashed block represents the smallest bounding rectangle, and c and d, respectively, represent the diagonal distance of the smallest enclosing rectangle and the Euclidean distance between the center points of the two boxes. The expressions of the weight function and the parameters for measuring the consistency of the aspect ratio are shown in Equations (7) and (8).
The expressions of the weight function and the parameters for measuring the consistency of the aspect ratio are shown in Equations (7) and (8).
Among them, w gt and h gt represent the width and height of the target frame, and w p and h p represent the width and height of the prediction frame, respectively.

Experimental Setup 2.3.1. Dataset Production and Preprocessing
From September 2019 to December 2019, three different types of kiwifruit were randomly collected at Ya'an Hongming Farm. Different types of kiwifruit vary in size and shape. To improve the effectiveness of training and increase the diversity of samples, the collected image data were screened before training. The image preprocessing software was Labelimg, which is a software used to annotate image labels. Finally, 1600 images were obtained and stored in JPG format with a resolution of 6000 px × 4000 px. In the next step, 1000 pictures were randomly selected as the training set. The dataset was enhanced by adaptive contrast, rotation, translation, cropping, and other methods, and the dataset was expanded to 2000. The dataset was divided into 4 categories, which are disease, mold, speckle, and deformation. Then, 300 pictures were randomly selected as the verification set, and 2200 pictures were annotated as kiwifruit. There were 300 unlabeled kiwifruit images left as the test set. The dataset is shown in Figure 6.
Electronics 2021, 10, x FOR PEER REVIEW Among them, and ℎ represent the width and height of the target fram and ℎ represent the width and height of the prediction frame, respectively.

Dataset Production and Preprocessing
From September 2019 to December 2019, three different types of kiwifrui randomly collected at Ya'an Hongming Farm. Different types of kiwifruit vary in si shape. To improve the effectiveness of training and increase the diversity of sampl collected image data were screened before training. The image preprocessing so was Labelimg, which is a software used to annotate image labels. Finally, 1600 i were obtained and stored in JPG format with a resolution of 6000 px × 4000 px. In th step, 1000 pictures were randomly selected as the training set. The dataset was enh by adaptive contrast, rotation, translation, cropping, and other methods, and the d was expanded to 2000. The dataset was divided into 4 categories, which are disease, speckle, and deformation. Then, 300 pictures were randomly selected as the verif set, and 2200 pictures were annotated as kiwifruit. There were 300 unlabeled kiw images left as the test set. The dataset is shown in Figure 6.

Migration Network Initialization
Transfer learning is a common machine learning method, whose key is to trans knowledge that has been trained in a certain field to another new field. As for this it concerns the completion of the model pretraining. The results will be migrated YOLO v5 network of kiwi flaw detection to help the training of the detection mod initialize the model parameters of a small training set, a pretrained network mo selected with a good learning ability to complete. Since the kiwi flawed image sam this paper are limited and few, migration learning is also chosen to initiali parameters of the YOLO v5 network, which can ensure the successful migration learned knowledge and the capability to make the new network capable to learn q In this way, the overfitting caused by insufficient kiwi samples can be improve

Migration Network Initialization
Transfer learning is a common machine learning method, whose key is to transfer the knowledge that has been trained in a certain field to another new field. As for this paper, it concerns the completion of the model pretraining. The results will be migrated to the YOLO v5 network of kiwi flaw detection to help the training of the detection model. To initialize the model parameters of a small training set, a pretrained network model is selected with a good learning ability to complete. Since the kiwi flawed image samples in this paper are limited and few, migration learning is also chosen to initialize the parameters of the YOLO v5 network, which can ensure the successful migration of the learned knowledge and the capability to make the new network capable to learn quickly. In this way, the overfitting caused by insufficient kiwi samples can be improved to a certain degree. At the same time, the generalization ability of kiwi flaw detection can also be improved correspondingly so that the recognition model can be facilitated. Even under complex natural conditions, the model has a good recognition ability to perform migration learning. We need to understand the datasets, because there are many datasets in the field of image deep learning, and they have their characteristics. This paper selects one of the most common and widely used datasets-ImageNet. This dataset shows outstanding performance in image classification, detection, positioning, and other fields.

CosineAnnealing
The CosineAnnealing is different from the traditional method. The learning rate will decrease rapidly with the increase of epoch, and the model will find the local optimal point and save the current model. After that, the learning rate will abruptly increase to a larger value, escape from the current local optimal point, find a new local optimal point, and then repeat this process to adjust the learning rate according to the cycle until the training is completed. As shown in Equation (6), l min represents the minimum learning rate, l init represents the initial learning rate, T max represents a quarter of the change period of the learning rate, and l new represents the new learning rate obtained.
In this training, T max is set to 5, l min is 0.00001, and the learning variability curve of the first 100 epochs is shown in Figure 7.
certain degree. At the same time, the generalization ability of kiwi flaw detection can also be improved correspondingly so that the recognition model can be facilitated. Even under complex natural conditions, the model has a good recognition ability to perform migration learning. We need to understand the datasets, because there are many datasets in the field of image deep learning, and they have their characteristics. This paper selects one of the most common and widely used datasets-ImageNet. This dataset shows outstanding performance in image classification, detection, positioning, and other fields.

CosineAnnealing
The CosineAnnealing is different from the traditional method. The learning rate will decrease rapidly with the increase of epoch, and the model will find the local optimal point and save the current model. After that, the learning rate will abruptly increase to a larger value, escape from the current local optimal point, find a new local optimal point, and then repeat this process to adjust the learning rate according to the cycle until the training is completed. As shown in Equation (6), represents the minimum learning rate, represents the initial learning rate, represents a quarter of the change period of the learning rate, and represents the new learning rate obtained.
In this training, is set to 5, is 0.00001, and the learning variability curve of the first 100 epochs is shown in Figure 7.

Experimental platform
The training of the model was completed based on the Windows 10 operating system and the Pytorch framework. The CPU model of the test equipment is Intel®Core™ i9_10900K CPU@3.70 GHz, the GPU model is GeForce RTX 5000 16 G, and the software environment is CUDA 10.1, CUDNN 7.6, Python3.7.
The original YOLOv5 and the improved YOLOv5 were trained separately. The parameters were set as follows: the maximum number of iterations was 1000, the momentum was 0.95, the CosineAnnealing of base learning rate was 0.01.

Experimental platform
The training of the model was completed based on the Windows 10 operating system and the Pytorch framework. The CPU model of the test equipment is Intel ® Core™ i9_10900K CPU@3.70 GHz, the GPU model is GeForce RTX 5000 16 G, and the software environment is CUDA 10.1, CUDNN 7.6, Python3.7.
The original YOLOv5 and the improved YOLOv5 were trained separately. The parameters were set as follows: the maximum number of iterations was 1000, the momentum was 0.95, the CosineAnnealing of base learning rate was 0.01.

Model Evaluation Indicators
This paper introduces precision (P), which is precision rate, recall rate (R), and mean average precision (mAP) to evaluate the performance of the kiwi flaw detection model. The expressions of P and R are as follows: Among them, true positives (TP), false positives (FP), and false negatives (FN), respectively, represent positive samples with correct classification, negative samples with incorrect classification, and positive samples with incorrect classification.
AP is the average accuracy rate, which is the integral of the P index to the R index, that is, the area under the P-R curve; mAP is the average accuracy of the mean, which means that the AP value of each category is summed, and then divided by all categories, i.e., the average value. They are defined as follows: where Q R is the number of categories.

Experimental Results
In order to judge the quality of the detection model accurately, the evaluation in this paper is based on the loss function curve (Loss) and average accuracy value (mAP).
During the network training process, the loss function can intuitively reflect whether the network model can converge stably as the number of iterations increases. The specific loss function of the model is shown in Figure 8 below.

Model Evaluation Indicators
This paper introduces precision (P), which is precision rate, recall rate (R), and mean average precision (mAP) to evaluate the performance of the kiwi flaw detection model. The expressions of P and R are as follows: AP is the average accuracy rate, which is the integral of the P index to the R index, that is, the area under the P-R curve; mAP is the average accuracy of the mean, which means that the AP value of each category is summed, and then divided by all categories, i.e., the average value. They are defined as follows: where is the number of categories.

Experimental Results
In order to judge the quality of the detection model accurately, the evaluation in this paper is based on the loss function curve (Loss) and average accuracy value (mAP).
During the network training process, the loss function can intuitively reflect whether the network model can converge stably as the number of iterations increases. The specific loss function of the model is shown in Figure 8 below. From the figure, it is found that as the number of iterations gradually increases, the improved YOLOv5 algorithm curve gradually converges, and the loss value becomes smaller and smaller. When the model is iterated 600 times, the loss value is basically stable From the figure, it is found that as the number of iterations gradually increases, the improved YOLOv5 algorithm curve gradually converges, and the loss value becomes smaller and smaller. When the model is iterated 600 times, the loss value is basically stable and has dropped to near 0, and the network basically converges. Compared with the original YOLOv5, the regression is faster and more accurate, which proves the validity and effectiveness of the model.
The mAP is used to measure the quality of the defect detection model. The higher the value is, the higher the average detection accuracy and the better the performance will be. Figure 9 shows that after about 200 iterations of the YOLOv5-Ours model, the mAP reaches about 94%, and has gradually stabilized, reaching a maximum of 98%, indicating that the improved YOLOv5 model has an average accuracy rate for defect detection. The overall model performance has met and even exceeded expectations. and has dropped to near 0, and the network basically converges. Compared with the original YOLOv5, the regression is faster and more accurate, which proves the validity and effectiveness of the model.
The mAP is used to measure the quality of the defect detection model. The higher the value is, the higher the average detection accuracy and the better the performance will be. Figure 9 shows that after about 200 iterations of the YOLOv5-Ours model, the mAP reaches about 94%, and has gradually stabilized, reaching a maximum of 98%, indicating that the improved YOLOv5 model has an average accuracy rate for defect detection. The overall model performance has met and even exceeded expectations.

3.2.Analysis
The following Figure 10 shows the improved YOLOv5 network and the YOLOv5-Ours network in the kiwifruit dataset part of the detection results, respectively, for different defect categories and defect sizes.

Analysis
The following Figure 10 shows the improved YOLOv5 network and the YOLOv5-Ours network in the kiwifruit dataset part of the detection results, respectively, for different defect categories and defect sizes. and has dropped to near 0, and the network basically converges. Compared with the original YOLOv5, the regression is faster and more accurate, which proves the validity and effectiveness of the model.
The mAP is used to measure the quality of the defect detection model. The higher the value is, the higher the average detection accuracy and the better the performance will be. Figure 9 shows that after about 200 iterations of the YOLOv5-Ours model, the mAP reaches about 94%, and has gradually stabilized, reaching a maximum of 98%, indicating that the improved YOLOv5 model has an average accuracy rate for defect detection. The overall model performance has met and even exceeded expectations.

3.2.Analysis
The following Figure 10 shows the improved YOLOv5 network and the YOLOv5-Ours network in the kiwifruit dataset part of the detection results, respectively, for different defect categories and defect sizes.   As the results show, our improved YOLOv5 can accurately detect defects in complex environments, such as tiny defects, and the return positioning frame is more accurate. Embedding SELayer discards unimportant features, significantly improves the robustness of the model, and proves the effectiveness of the network.
Under the condition that the IoU threshold is 50%, the mAP@0.5 of the original YOLOv5 is 85%, and the mAP@0.5 of the improved YOLOv5 is 94.7%. Table 1 below shows the accuracy comparison between the original model and the improved one. According to Table 1, the improved model has improved mAP by nearly 8%. Through testing, it is found that despite the increased complexity of the model, the improved network still only takes 0.1 s to detect a single image, which is in line with real-time detection.
It can be inferred from Table 2 that, compared with mainstream detection algorithms, our network has a higher mAP. Although Fast R-CNN performs well on mAP, it takes 0.79 s to detect a single image, which cannot meet the requirements of real-time detection.

Discussion
This paper explores an automatic detection method for kiwifruit defects in real time. To meet the needs of farmers to understand the states of kiwifruit at any time and in real time, we use the YOLOv5 model for deeper research. By adding a small target detection layer, the ability to detect small defects is improved. The layer was embedded to enhance useful features and suppress less important features. The CIoU was used as the loss function to make the regression more stable. The feasibility of this method is as follows:

•
In terms of processing accuracy, the dataset of this study is manually captured images; hence, the background information is relatively simple. In slightly complex background conditions, the accuracy may be reduced. However, this research is based on unnatural or industrial scenes. Thus, there will be no complex background in practical application.

•
In terms of processing speed, in order to meet the real-time needs of farmers, it is necessary to process the images collected by the camera. The initial consideration is using an object detection model to replace models such as, for instance, segmentation or semantic segmentation (the latter two are relatively slow in processing speed). To detect models in multiple objects, the YOLOv5 model for processing is considered, which is a useful model in an advanced single-stage method in the field of object detection. Compared with the two-step method, the former has a higher processing speed based on the same hardware environment. Compared with other one-stage methods (such as YOLOv2), the related reasons have been described in Section 2.1. The optimized YOLOv5 network structure is complex. Compared with the YOLOv5-Ours, the detection speed is reduced, but a single image only takes 0.1 s, which can meet the above requirements.

•
In terms of model generalization ability, YOLOv5 uses a mosaic data enhancement strategy to improve the model's generalization ability and robustness.
Based on the above discussion, we believe that the method we proposed is an effective exploration and can promote the development of postproduction reprocessing of crops.

Conclusions and Future Work
In this research, Deep learning technology was applied to kiwi flaw detection. Based on YOLOv5, a high-precision kiwi flaw detection method was proposed. First, a kiwifruit dataset containing four types of defects was collected. As far as we know, this is the first kiwifruit defect dataset in the world and even the first agricultural product postproduction defect dataset. At the same time, this is the first time that the YOLOv5 network has been applied to crops. Then, through the improvement of YOLOv5, a small target detection layer was added to the backbone network, and SELayer was embedded to improve the feature extraction ability of the model. In addition, we modified the DIoU loss function to the CIoU loss function to improve the accurate positioning ability of the model prediction frame and enhance the model convergence effect. Compared with the original YOLOv5 model, mAP@0.5 increases 9%. It can detect a single image in only 0.1 s (base on GPU 1050Ti) and has better robustness to the environment, which proves the effectiveness of the model and provides farmers with more efficient and intelligent postproduction reprocessing strategies. This paper mainly researches and develops kiwifruit defects under the requirement of real-time detection. However, fast detection still needs specific hardware configuration. In the future, we will continue to optimize YOLOv5-Ours and use pruning technology to optimize the model. At the same time, we will continue to increase the research on more kiwifruit varieties and increase the scope of application.