Detection of Dense Citrus Fruits by Combining Coordinated Attention and Cross-Scale Connection with Weighted Feature Fusion

The accuracy detection of individual citrus fruits in a citrus orchard environments is one of the key steps in realizing precision agriculture applications such as yield estimation, fruit thinning, and mechanical harvesting. This study proposes an improved object detection YOLOv5 model to achieve accurate the identification and counting of citrus fruits in an orchard environment. First, the latest visual attention mechanism coordinated attention module (CA) was inserted into an improved backbone network to focus on fruit-dense regions to recognize small target fruits. Second, an efficient two-way cross-scale connection and weighted feature fusion BiFPN in the neck network were used to replace the PANet multiscale feature fusion network, giving effective feature corresponding weights to fully fuse the high-level and bottom-level features. Finally, the varifocal loss function was used to calculate the model loss for better model training results. The results of the experiments on four varieties of citrus trees showed that our improved model proposed to this study could effectively identify dense small citrus fruits. Specifically, the recognized AP (average precision) reached 98.4%, and the average recognition time was 0.019 s per image. Compared with the original YOLOv5 (including deferent variants of n, s, m, l, and x), the increase in the average accuracy precision of the improved YOLOv5 ranged from 7.5% to 0.8% while maintaining similar average inference time. Four different citrus varieties were also tested to evaluate the generalization performance of the improved model. The method can be further used as a part in a vision system to provide technical support for the real-time and accurate detection of multiple fruit targets during mechanical picking


Introduction
Early yield estimates of fruits in orchards can help to plan subsequent fertilization and other operations more accurately. Currently, precision agriculture faces various challenges to its development. First, the proper selection and application of models among the various available models is important. Moreover, following the advanced techniques in computer vision and deep learning, the influence of factors of the natural environment on the application of these techniques can be studied [1]. With the rapid progress of artificial intelligence (AI), AI-driven technical tools and solutions [2] have shown their profitability and potentiality in addressing farming problems including monitoring crop status and field production management such as pest monitoring, fruit thinning, and mechanical harvesting [3]. Therefore, developing a low-cost, highly maneuverable computer vision system for small targets to perform fruit recognition on orchard trees is of great significance to precision agriculture.
Before deep learning technologies became popular, traditional computer techniques were often used to detect fruit. Simple visual features such as circular Hough transform (CHT) [4], color threshold [5], or combination of color and shape features [6] were widely used in early efforts to detect fruits in the images. Although these approaches can sometimes achieve convincing results, however, it is still a big challenge to achieve high accuracy under complex natural conditions [7]. Some practical factors such as illumination changes, crop variability, and occlusion problems need to be considered and addressed. Lu and Sang [8] proposed a method by combining the color and contour information to recognize citrus fruit in actual orchards. Although their method worked in a complex orchard environment, the detection effect was not ideal when citrus was heavily occluded by the tree canopy. On the other hand, color/shape-based methods are often limited by specific feature selection and have low scalability.
In recent years, deep learning technologies has been widely researched and applied in various fields such as cyber security [9], defect detection [10], autonomous driving [11] and precision agriculture [12]. A series of deep learning-based methods or models after nondestructive object detection tasks have been developed in agricultural production management in which the convolutional neural network (CNN) framework, represented by Faster R-CNN [13] and YOLO [14], is undoubtedly the most widely adopted for these kinds of tasks. In addition to the two types of frameworks, some scholars have also proposed detection and classification networks based on CNN such as Erel-net, which can detect and classify product defects with a classification accuracy of 77% [15]. Compared to traditional machine vision techniques, deep learning-based methods are more likely to solve complex practical problems that would challenge the use of traditional methods. The deep learning method is promising by achieving a satisfactory inference model from the samples and is no longer constrained by difficult feature selection.
In the agricultural domain, among the CNN-based detection frameworks, the twostage target detection algorithm Faster R-CNN is undoubtedly one of the most widely used. For instance, Quan et al. proposed an improved Faster R-CNN detection model for the field robotic platform (FRP), which can quickly and accurately detect corn seedlings at different growth stages in complex field operations, with an accuracy rate of 97.71% [16]. Li et al. used focal loss to improve the Faster R-CNN detection model for automatic hydroponic lettuce seedling target detection, with an average accuracy of 86.2 [17]. It can provide amazing performances in detecting single objects and multiple objects at the same time. However, Faster R-CNN will suffer from its two-stage strategy: region proposal networks (RPNs) to generate high-quality region proposals and classification networks to predict the final object. It is not satisfactory for the recognition of small targets and overlapping objects, and the detection speed is slow when processing high-resolution images, which usually cannot meet the needs of real-time detection. On the other hand, the YOLO framework uses a single-stage strategy, and performs classification and regression directly to detect objects in images, thus accelerating the detection tasks. Kuznetsova et al. used YOLOv3 as a machine vision system to detect apples in orchards for a harvesting robot with an average detection time of 19 ms [18]. Ji et al. used the lightweight EfficientNet-B0 network to improve YOLOv4, the accurate identification of apples in complex environments, with an average accuracy of 93.42 and a recognition speed of 63.20 frames per second [19]. Lu et al. proposed canopy-attention-YOLOv4 to accurately detect immature and ripe apples in an orchard environment in a complex environment, with a 3% improvement in fruit count compared to the original YOLOv4 [20]. In addition, researchers have detected and classified other types of fruits such as citrus, tomatoes, blueberries, cherries, and kiwis [21][22][23][24]. In the latest research, Lyu et al. improved YOLOv5s and proposed the YOLOv5-CS (Citrus Sort) model, which can accurately detect green citrus in the natural environment with an accuracy of 98.23. The model inference speed was 0.017 s on the server and 0.037 s on an Nvidia Jetson Xavier NX [25]. Chen et al. proposed a YOLOv5-based citrus fruit ripeness method to identify three ripeness levels of citrus fruits. The accuracy of this method was 95.07%, which is convenient for the selective harvesting of citrus picking robots [26]. Currently, most researchers have focused on improving the detection performance of a single growing period of a single variety of fruit. However, there are few studies on the identification of different ripening stages of multi-variety citrus. Therefore, it is necessary to provide an automatic detection algorithm that can not only ensure the detection speed, but also has high precision and robustness, and can meet the requirements of various citrus fruits in different growth stages.
The main purposes of this study included: 1. Improving the original YOLOv5 network architecture with a visual attention mechanism coordinated attention module (CA) and weighted feature fusion BiFPN; 2.
Comparing the performance of the improved YOLOv5 with other commonly used object detectors including Faster R-CNN, YOLOv4, and original YOLOv5; and 3.
Developing a robot fruit detector for precision agriculture applications such as yield estimation and mechanical harvesting, where such a detector can adapt to different varieties of citrus trees and different growth stages under a natural environment.

Citrus Image Collection
In this study, our ultimate goal was to develop an object detection algorithm for applications such as fruit thinning and harvesting. An Intel RealSense D455 depth camera was used for image acquisition, which has been widely used in these applications. The shooting distance was between 0.5 m and 2 m. The image resolution was 1280 × 720 pixels.
Fruit images were collected from three experimental orchards of the Guangxi Academy of Specialty Crops, China. These orchards are located in Lingui City (110 • 7 32.3 N, 25 • 7 34.5 E), Guilin, China. The permission for sample collection was granted by the Guangxi Academy of Specialty Crops. In this experiment, four citrus varieties, "Kumquat", "Nanfeng tangerine", "Fertile tangerine", and "Shatang tangerine" were selected. Images were collected starting in spring when the diameter of the fruit was approximately 5 cm, until autumn in 2020 when the fruit was mature. Images were taken twice a week and four times in a day at 9:00 am, 11:00 am, 2:00 pm, and 4:00 pm, respectively. Finally, 500 images were collected from "Kumquat", 400 images from "Nanfeng tangerine", 300 images from "Fertile tangerine", and 300 images from "Shatang tangerine". The RGB images of the four varieties of citrus are shown in Figure 1. based citrus fruit ripeness method to identify three ripeness levels of citrus fruits. The accuracy of this method was 95.07%, which is convenient for the selective harvesting of citrus picking robots [26]. Currently, most researchers have focused on improving the detection performance of a single growing period of a single variety of fruit. However, there are few studies on the identification of different ripening stages of multi-variety citrus. Therefore, it is necessary to provide an automatic detection algorithm that can not only ensure the detection speed, but also has high precision and robustness, and can meet the requirements of various citrus fruits in different growth stages.
The main purposes of this study included: 1. Improving the original YOLOv5 network architecture with a visual attention mechanism coordinated attention module (CA) and weighted feature fusion BiFPN; 2.
Comparing the performance of the improved YOLOv5 with other commonly used object detectors including Faster R-CNN, YOLOv4, and original YOLOv5; and 3.
Developing a robot fruit detector for precision agriculture applications such as yield estimation and mechanical harvesting, where such a detector can adapt to different varieties of citrus trees and different growth stages under a natural environment.

Citrus Image Collection
In this study, our ultimate goal was to develop an object detection algorithm for applications such as fruit thinning and harvesting. An Intel RealSense D455 depth camera was used for image acquisition, which has been widely used in these applications. The shooting distance was between 0.5 m and 2 m. The image resolution was 1280 × 720 pixels.
Fruit images were collected from three experimental orchards of the Guangxi Academy of Specialty Crops, China. These orchards are located in Lingui City (110°7′32.3″ N, 25°7′34.5″ E), Guilin, China. The permission for sample collection was granted by the Guangxi Academy of Specialty Crops. In this experiment, four citrus varieties, "Kumquat", "Nanfeng tangerine", "Fertile tangerine", and "Shatang tangerine" were selected. Images were collected starting in spring when the diameter of the fruit was approximately 5 cm, until autumn in 2020 when the fruit was mature. Images were taken twice a week and four times in a day at 9:00 am, 11:00 am, 2:00 pm, and 4:00 pm, respectively. Finally, 500 images were collected from "Kumquat", 400 images from "Nanfeng tangerine", 300 images from "Fertile tangerine", and 300 images from "Shatang tangerine". The RGB images of the four varieties of citrus are shown in Figure 1.

Image Processing and Augmentation
LabelImg was used as the image annotation tool. When using LabelImg to mark the citrus fruits, the position and classification of the citrus fruits was marked. During the labeling process, dense and occluded fruits were also labeled separately.
Data augmentation on the original dataset can increase the diversity and robustness

Image Processing and Augmentation
LabelImg was used as the image annotation tool. When using LabelImg to mark the citrus fruits, the position and classification of the citrus fruits was marked. During the labeling process, dense and occluded fruits were also labeled separately.
Data augmentation on the original dataset can increase the diversity and robustness of the experimental dataset. The Imgaug image data augmentation library was used to augment the citrus datasets. Considering the actual interference factors of the natural environment, several processes were used in the augmentation including rotating the original image, adjusting the image color, and adding noise. By performing the augmentation described above, a dataset containing 16,000 images was created and used to train and test the YOLOv5 model. The details of the citrus dataset are shown in Table 1. The local effects after image enhancement are shown in Figure 2.

Image Processing and Augmentation
LabelImg was used as the image annotation tool. When using LabelImg to mark the citrus fruits, the position and classification of the citrus fruits was marked. During the labeling process, dense and occluded fruits were also labeled separately.
Data augmentation on the original dataset can increase the diversity and robustness of the experimental dataset. The Imgaug image data augmentation library was used to augment the citrus datasets. Considering the actual interference factors of the natural environment, several processes were used in the augmentation including rotating the original image, adjusting the image color, and adding noise. By performing the augmentation described above, a dataset containing 16,000 images was created and used to train and test the YOLOv5 model. The details of the citrus dataset are shown in Table 1. The local effects after image enhancement are shown in Figure 2.

YOLOv5 Object Detection Network
YOLOv5 is the best overall performance target detection model in the YOLO architecture series. Currently, there are several variants of YOLOv5 (including n, s, m, x, and

YOLOv5 Object Detection Network
YOLOv5 is the best overall performance target detection model in the YOLO architecture series. Currently, there are several variants of YOLOv5 (including n, s, m, x, and l). For simplicity, we used YOLOv5, which represents the standard version YOLOv5l in the following. As a single-stage detection model, YOLOv5 has high detection accuracy and fast detection speed. YOLOv5s can reach 140 frames per second on the GPU server of nvidia tesla V100. Compared withYOLOv4 [27], the training speed of the YOLOv5 model is faster, the weight file of the model is smaller, and it supports the batch processing of images. The model is developed based on the Pytorch framework, which can easily convert the Pytorch weight file to the Android ONXX format or iOS format, indicating that the YOLOv5 model is suitable for deployment on mobile devices and is easier to put into production environment. Therefore, the advantage of the YOLOv5 model lies in its high detection accuracy, light weight, fast detection speed, and easy deployment. The accuracy, real-time performance, and light weight of the fruit detection model are crucial to the recognition of citrus fruit in an orchard in the natural environment. This study attempts to base the detector of the YOLOv5 architecture to detect dense small citrus fruit. Figure 3, the YOLOv5 framework includes three parts: the backbone network, neck network, and detection network. The function of the backbone network is to extract feature maps from the input image by multiple convolutions and merging. As shown in Figure 3, a three-layer feature map was generated in the backbone network. Their sizes were: 80 × 80, 40 × 40, and 20 × 20. The neck network mainly fuses the feature maps of different scales generated by the backbone network to output a new enhanced feature map to obtain more contextual information and reduce the loss of information [28]. In the merging process, the characteristic pyramid structure of FPN and PANet is adopted. Strong semantic features are transferred from the top feature map to the bottom feature map by the FPN structure. At the same time, strong localization features are also transferred from lower feature maps to higher feature maps by using PANet. FPN and PANet, hence together, enhance the feature fusion ability of the neck network. Finally the detection network is used to provide the detection result. The detection network of YOLOv5 consists of three detection layers, with an output feature map of 80 × 80, 40 × 40, and 20 × 20, which is used to detect objects in the input image. Each detection layer ultimately outputs a 21-channel vector and then generates and marks the predicted bounding box and category of the target in the original input image to achieve the detection and classification of the final targets.

As shown in
As shown in Figure 3, the YOLOv5 framework includes three parts: the backbone network, neck network, and detection network. The function of the backbone network is to extract feature maps from the input image by multiple convolutions and merging. As shown in Figure 3, a three-layer feature map was generated in the backbone network. Their sizes were: 80 × 80, 40 × 40, and 20 × 20. The neck network mainly fuses the feature maps of different scales generated by the backbone network to output a new enhanced feature map to obtain more contextual information and reduce the loss of information [28]. In the merging process, the characteristic pyramid structure of FPN and PANet is adopted. Strong semantic features are transferred from the top feature map to the bottom feature map by the FPN structure. At the same time, strong localization features are also transferred from lower feature maps to higher feature maps by using PANet. FPN and PANet, hence together, enhance the feature fusion ability of the neck network. Finally the detection network is used to provide the detection result. The detection network of YOLOv5 consists of three detection layers, with an output feature map of 80 × 80, 40 × 40, and 20 × 20, which is used to detect objects in the input image. Each detection layer ultimately outputs a 21-channel vector and then generates and marks the predicted bounding box and category of the target in the original input image to achieve the detection and classification of the final targets. In this architecture, the Focus module slices and connects images. This module aims to reduce the number of model calculations and speed up training. First, it uses a slice operation to divide the input 3-channel image into four slices. Second, a concat operation In this architecture, the Focus module slices and connects images. This module aims to reduce the number of model calculations and speed up training. First, it uses a slice operation to divide the input 3-channel image into four slices. Second, a concat operation is used to deeply connect the four slices, and then a convolutional layer is used to generate the output feature map, as shown in Figure 4a.
CBL is a standard convolutional layer, which is composed of convolution, normalization and SiLU activation function modules, as shown in Figure 4b. The C3 module is used in the latest YOLOv5, aiming to improve the inference speed by reducing the size of the model while maintaining accuracy and better extracting the deep features of the image. The initial input of the C3 module has two branches, and the number of channels of feature mapping is halved by the convolution operation in the two branches. Then, the output feature maps of the two branches are deeply connected by the concat operation through the bottleneck module and CONV layer in the second branch. Finally, the CONV layer is passed again to generate the output feature map of the module, as shown in Figure 4c,d. is used to deeply connect the four slices, and then a convolutional layer is used to generate the output feature map, as shown in Figure 4a. CBL is a standard convolutional layer, which is composed of convolution, normalization and SiLU activation function modules, as shown in Figure 4b. The C3 module is used in the latest YOLOv5, aiming to improve the inference speed by reducing the size of the model while maintaining accuracy and better extracting the deep features of the image. The initial input of the C3 module has two branches, and the number of channels of feature mapping is halved by the convolution operation in the two branches. Then, the output feature maps of the two branches are deeply connected by the concat operation through the bottleneck module and CONV layer in the second branch. Finally, the CONV layer is passed again to generate the output feature map of the module, as shown in Figure 4c,d.
The SPP module is located at the penultimate layer of the backbone network (see Figure 4e). This module aims to improve the reception field by converting feature maps of arbitrary size into feature vectors of fixed size. First, the feature map is output through a convolutional layer with a convolution kernel size of 1 × 1. This feature map is then deeply connected with the output feature map subsampled by three parallel max pooling layers. Finally, the final output feature map is obtained from the convolutional layer.

Improvement to the YOLOv5
For citrus fruit detection in natural, outdoor orchard environments, many complicated factors must be considered. Among them, a major factor that hinders the accuracy of the model detection is that a citrus fruit is a small target in the collected citrus plant images. The measurement and evaluation of the MS-COCO dataset mentioned the standard of small objects [29]. In an image of 640 × 640, the area of the object that belongs to the category of "small target" is less than or equal to 32 × 32 pixels. In the collected citrus plant images of 1280 × 720 pixels, the pixels occupied by a single fruit were less than or equal to 100 × 100. Before entering the network, the pixels occupied by a single fruit are less than or equal to 32 × 32 as the images are resized to 640 × 640. By this definition, citrus fruit can be regarded as a small object in the image, and this study is therefore a small object detection task. Due to the characteristics of small objects with low pixels, low resolution, and weak ability to express features, the recognition effect of small objects is often not as good as that of regular-sized objects. Currently, many algorithms can be used for small object The SPP module is located at the penultimate layer of the backbone network (see Figure 4e). This module aims to improve the reception field by converting feature maps of arbitrary size into feature vectors of fixed size. First, the feature map is output through a convolutional layer with a convolution kernel size of 1 × 1. This feature map is then deeply connected with the output feature map subsampled by three parallel max pooling layers. Finally, the final output feature map is obtained from the convolutional layer.

Improvement to the YOLOv5
For citrus fruit detection in natural, outdoor orchard environments, many complicated factors must be considered. Among them, a major factor that hinders the accuracy of the model detection is that a citrus fruit is a small target in the collected citrus plant images. The measurement and evaluation of the MS-COCO dataset mentioned the standard of small objects [29]. In an image of 640 × 640, the area of the object that belongs to the category of "small target" is less than or equal to 32 × 32 pixels. In the collected citrus plant images of 1280 × 720 pixels, the pixels occupied by a single fruit were less than or equal to 100 × 100. Before entering the network, the pixels occupied by a single fruit are less than or equal to 32 × 32 as the images are resized to 640 × 640. By this definition, citrus fruit can be regarded as a small object in the image, and this study is therefore a small object detection task. Due to the characteristics of small objects with low pixels, low resolution, and weak ability to express features, the recognition effect of small objects is often not as good as that of regular-sized objects. Currently, many algorithms can be used for small object detection such as generating super-resolution feature representations, presenting attention mechanisms [30], introducing context information [31], and dealing with dataset differences. Based on the above ideas, in this study, we improved the original YOLOv5 detection model to adapt to small object detection tasks.

Improvement on the Network Structure
First, we embedded the coordinated attention mechanism into the backbone network, enabling it to make targeted choices when extracting features from input samples [32]. The attention mechanism improved the model performance by assigning higher weights to features that are beneficial to network model training, and assigning lower weights to features that have no or even adverse effects on the training conducive of the network model. The object detection algorithm is expected to be able to identify dense small fruits Appl. Sci. 2022, 12, 6600 7 of 17 accurately in various situations in complex orchard environments. To improve the detection accuracy on fruits in a natural environment, the attention mechanism is commonly used in the design of the target detection network to better extract the features of citrus. Coordinate attention is the latest attention mechanism, and the principle is shown in Figure 5a. The channel attention works by converting the feature tensor into a single feature vector through the 2D global pool while coordinate attention factorizes channel attention into two parallel 1D feature encoding processes, for the purpose of integrating spatial coordinate information with the generated attention maps. In this way, the network can capture the long-distance dependencies along one spatial direction, while retaining accurate position information along another spatial direction. Based on the above process, a pair of direction-aware and position-sensitive attention maps were then generated from the obtained feature maps. The feature map output by the coordinated attention mechanism is shown in Figure 5b. These attention maps can be used as complementarily information to the input feature maps to strengthen the representation of the object of interest. Coordinate attention captures the cross-channel information, direction perception, and position-sensitive information concurrently, which undoubtedly helps the detector find and identify objects of interest with more accuracy. The coordinated attention mechanism is flexible and lightweight. As such, in this study, we embedded it into the backbone network of the YOLOv5 architecture to improve the detection performance for dense and small objects.
enabling it to make targeted choices when extracting features from input samples [32]. The attention mechanism improved the model performance by assigning higher weights to features that are beneficial to network model training, and assigning lower weights to features that have no or even adverse effects on the training conducive of the network model. The object detection algorithm is expected to be able to identify dense small fruits accurately in various situations in complex orchard environments. To improve the detection accuracy on fruits in a natural environment, the attention mechanism is commonly used in the design of the target detection network to better extract the features of citrus. Coordinate attention is the latest attention mechanism, and the principle is shown in Figure 5a. The channel attention works by converting the feature tensor into a single feature vector through the 2D global pool while coordinate attention factorizes channel attention into two parallel 1D feature encoding processes, for the purpose of integrating spatial coordinate information with the generated attention maps. In this way, the network can capture the long-distance dependencies along one spatial direction, while retaining accurate position information along another spatial direction. Based on the above process, a pair of direction-aware and position-sensitive attention maps were then generated from the obtained feature maps. The feature map output by the coordinated attention mechanism is shown in Figure 5b. These attention maps can be used as complementarily information to the input feature maps to strengthen the representation of the object of interest. Coordinate attention captures the cross-channel information, direction perception, and position-sensitive information concurrently, which undoubtedly helps the detector find and identify objects of interest with more accuracy. The coordinated attention mechanism is flexible and lightweight. As such, in this study, we embedded it into the backbone network of the YOLOv5 architecture to improve the detection performance for dense and small objects.  Second, we used an efficient two-way cross-scale connection and weighted feature fusion BiFPN in the neck network to replace the PANet multiscale feature fusion network, as shown in Figure 6. The low-level feature map commonly has a higher resolution and contains more information about the location and detailed information of the target object. However, because there are fewer features extracted by the convolutional layer, the semantics of the lower-level feature maps are lower; thus, the feature maps will also contain more noise. While higher-level feature maps are rich in semantic information, these feature maps have low resolution and relatively insufficient perception of the image details. Therefore, we can use BiFPN to make full use of the feature context information of the low-level and high-level features and effectively merge the high-level and low-level features to improve the model detection performance. In BiFPN, the optimization method of cross-scale connection was used: (1) the nodes that only have one input edge are deleted; (2) additional edges are inserted between the original input node and output node to fuse more features without using too much cost, if the two nodes locate at the same level; and (3) each bidirectional (top-down and bottom-up) path is treated as one feature network layer, and the procedure is repeated in the same layer multiple times to enable more highlevel feature fusion [33]. When input features of different scales are used for feature fusion, their contributions to output features are usually unequal. When BiFPN merges features with different resolutions, it adds an extra weight to each input that enables the network to understand the contribution of each input feature. To ensure that the model training converges quickly, BiFPN uses a fast normalized weighted feature fusion in this article: Among them, the ReLU function is applied after each w i to ensure that w i ≥ 0, and = 0.0001 is a small value to avoid numerical instability [33]. mantics of the lower-level feature maps are lower; thus, the feature maps will also contain more noise. While higher-level feature maps are rich in semantic information, these feature maps have low resolution and relatively insufficient perception of the image details. Therefore, we can use BiFPN to make full use of the feature context information of the low-level and high-level features and effectively merge the high-level and low-level features to improve the model detection performance. In BiFPN, the optimization method of cross-scale connection was used: (1) the nodes that only have one input edge are deleted; (2) additional edges are inserted between the original input node and output node to fuse more features without using too much cost, if the two nodes locate at the same level; and (3) each bidirectional (top-down and bottom-up) path is treated as one feature network layer, and the procedure is repeated in the same layer multiple times to enable more highlevel feature fusion [33]. When input features of different scales are used for feature fusion, their contributions to output features are usually unequal. When BiFPN merges features with different resolutions, it adds an extra weight to each input that enables the network to understand the contribution of each input feature. To ensure that the model training converges quickly, BiFPN uses a fast normalized weighted feature fusion in this article: Among them, the ReLU function is applied after each to ensure that ≥ 0, and = 0.0001 is a small value to avoid numerical instability [33]. The new network structure after adding the coordinated attention mechanism, efficient two-way cross-scale connection, and weighted feature fusion is shown in Figure 7. The new network structure after adding the coordinated attention mechanism, efficient two-way cross-scale connection, and weighted feature fusion is shown in Figure 7.

Improvement on Loss Function
In this research, we used the varifocal loss function to replace the focal loss function, which was used in the original YOLOv5 to obtain better model training results. Focal loss is dedicated to distinguish two category samples, one is difficult to classify (described as

Improvement on Loss Function
In this research, we used the varifocal loss function to replace the focal loss function, which was used in the original YOLOv5 to obtain better model training results. Focal loss is dedicated to distinguish two category samples, one is difficult to classify (described as 'difficult samples' below) and another is easy to classify (described as 'easy samples' below). Focal loss is used in the learning stage to focus on difficult samples by reducing the weight of easy samples. The formula is expressed as: where (1 − p) γ represents the loss contribution of easily separable samples and γ represents the hyperparameter, respectively [34]. The difference between the difficult samples and the easy samples is directly proportional to the value of γ. When γ is equal to 0, the focal loss function will degenerate into a cross-entropy loss. The varifocal loss function is expressed as: where p represents the predicted IoU-aware classification score (IACS) [35]; q represents the IoU score [35] between its basic facts (actual category label and actual accurate position) and the generated bounding box. For positive samples in training, q was set to the IoU score between the generated bounding box and ground truth box; while for negative samples, the training target q of all categories was 0. The training focused on these high-quality samples in the candidate detection samples with higher IACS. α and γ are hyperparameters. α is a scale parameter belonging to 0 to 1, used to adjust the loss between positive and negative samples. In this way, training can focus less on the negative samples. By applying this loss function, the model can make judgments and trade-offs between difficult samples and easy samples, which can further increase the loss contribution of difficult samples.

Experimental Environment
The following experiments were performed on a workstation equipped with an Intel ® Xeon(R) Silver 4114 CPU processor with a 64 GB memory, a NVIDIA RTX3090 graphics processor (24 GB video memory), and Ubuntu 20.04 LTS. The PyTorch framework was used to build the YOLOv5 model. Python was used to write the program code and call the required libraries such as CUDA, cuDNN, and OpenCV.
In this research, the improved YOLOv5 network was used for training through stochastic gradient descent (SGD) in an end-to-end manner. The input network image resolution was set to 640 × 640, the batch size of the model training was set to 32, and each time the BN layer performed regularization to update the weight of the model. The training epochs were set to 100, and the specific settings of the network training hyperparameters are shown in Table 2. After training, the weight file of the obtained detection model was outputted and saved, then the test set was used to evaluate the performance of this model. The final output of the model were the location frame of the identified fruit target, with the ID belonging to a specific category and its probability.

Evaluation Indicators
The precision (P), recall (R), F 1 score, mean average precision (mAP), model parameters, and detection time of per image were used to assess the quality of the model. mAP is the average accuracy, and commonly used as a crucial indicator to measure the general quality of a learning model. Model parameters were the number of parameters of the model, which was used to reflect the size of the model. The detection time of per image was used to measure the speed in using the model to process the image and receive the final results. The computations of P, R, and F 1 are explained as follows: where TP is the positive samples correctly predicted by the model, TN is the negative samples correctly predicted by the model, FP is the positive samples incorrectly predicted by the model, and FN is the negative samples incorrectly predicted by the model [36].

Experimental Results
The experiments used the improved model trained on the original citrus dataset, and the proposed method based on the improved YOLOv5 achieved the highest mAP (average accuracy precision) with a mAP of 88.5% compared to deferent variants of YOLOv5. As shown in Table 3, we found that YOLOv5x obtained the best mAP and YOLOv5l obtained the best F1 score among these variants of YOLOv5 while our improved YOLOv5 achieved better results both on the mAP and F1 score, and the detection speed of a single image reached 0.019 s. The number of parameters and calculation were similar to YOLOv5l and less than YOLOv5x, the speed remained unchanged, and the accuracy was improved. When training on the augmented dataset, we found that the improved YOLOv5 could reach a higher detection accuracy of 98.4%, which was 5.4% higher than the YOLOv5l network, and the Loss could be significantly reduced in the first 10 epochs. After using the improved loss function, our YOLOv5 could learn features faster for hard-to-identify citrus samples, and then quickly converge. The training results were shown in Figure 8. The experiments used the improved model trained on the original citrus dataset, and the proposed method based on the improved YOLOv5 achieved the highest mAP (average accuracy precision) with a mAP of 88.5% compared to deferent variants of YOLOv5. As shown in Table 3, we found that YOLOv5x obtained the best mAP and YOLOv5l obtained the best F1 score among these variants of YOLOv5 while our improved YOLOv5 achieved better results both on the mAP and F1 score, and the detection speed of a single image reached 0.019 s. The number of parameters and calculation were similar to YOLOv5l and less than YOLOv5x, the speed remained unchanged, and the accuracy was improved. When training on the augmented dataset, we found that the improved YOLOv5 could reach a higher detection accuracy of 98.4%, which was 5.4% higher than the YOLOv5l network, and the Loss could be significantly reduced in the first 10 epochs. After using the improved loss function, our YOLOv5 could learn features faster for hardto-identify citrus samples, and then quickly converge. The training results were shown in Figure 8.

Discussion
This study carried out qualitative and quantitative analysis through the following experiments:

Discussion
This study carried out qualitative and quantitative analysis through the following experiments: 1.
Comparisons of the improved YOLOv5 model on citrus fruit images at different growth stages were performed to evaluate the fruit recognition accuracy and performance.

2.
Comparisons of the improved YOLOv5 model with the other three commonly used deep learning-based models were performed on citrus images of the growth period and mature period.

3.
Testing the generalization of the improved YOLOv5 model by detecting the recognition effect of four different citrus fruits.

Dense Citrus Detection with the Improved YOLOv5
This experiment aimed to use the dataset of "Kumquat" trained at different growth stages to verify whether the improved processing above-mentioned can improve the performance of the YOLOv5 model for dense small target detection. When testing the model, 20 images of growing stage and 20 images of the mature stage were selected. The improved YOLOv5 model was used to detect the dense fruits of the plant canopy. The ground truth was compared with the number of model recognitions. The experimental results showed that the improved version had better recognition performance compared to the original YOLOv5 model to detect citrus fruits at different growth stages, in which the standard YOLOv5l were used as the representative of YOLOv5.
The quantitative comparison results of the above schemes are shown in Table 4, and the visual effects of model recognition are shown in Figure 9. For growing stage "Kumquats" fruits, the improved YOLOv5 network detected more citrus fruits and could more correctly identify citrus fruits, while YOLOv5l detected fewer citrus fruits, and there were more misidentifications. For mature "Kumquats" fruits, the improved YOLOv5 method performed better than the original method. During the growing period, the "Kumquat" fruits were small in size, the fruit and leaves were green, and the color features were not obvious. It could be seen that the improved method significantly improved the detection ability of the YOLOv5 model for citrus, and the recognition accuracy was not affected by the color or individual features. "Kumquats" at maturity had more obvious color features, with larger individual volumes, and the occlusion was less. After many quantitative analysis and comparison experiments, it was proven that after adding the CA attention mechanism and BiFPN to enhance feature extraction, the improved YOLOv5 method significantly improved the small target fruit.

Comparison with Other Commonly Used Object Models
The purpose of this experiment was to compare our improved YOLOv5 model with Faster R-CNN, YOLOv4, and YOLOv5l by training the kumquat data under different growth stages. Table 5 provides the details of the experimental results. This demonstrated that the improved YOLOv5 model achieved better detection performance for "Kumquat" fruits at both growth stages. According to the detected images and Table 5, for growing "Kumquat" fruits, the improved YOLOv5 method detected a larger number of citrus fruits, while Faster R-CNN and YOLOv4 detected relatively fewer fruits, along with a

Comparison with Other Commonly Used Object Models
The purpose of this experiment was to compare our improved YOLOv5 model with Faster R-CNN, YOLOv4, and YOLOv5l by training the kumquat data under different growth stages. Table 5 provides the details of the experimental results. This demonstrated that the improved YOLOv5 model achieved better detection performance for "Kumquat" fruits at both growth stages. According to the detected images and Table 5, for growing "Kumquat" fruits, the improved YOLOv5 method detected a larger number of citrus fruits, while Faster R-CNN and YOLOv4 detected relatively fewer fruits, along with a small number of identification errors. For mature "Kumquat" fruits, the improved YOLOv5 model was significantly better than the other three methods. During the growing period, the "Kumquat" fruits were small in size, the fruits and leaves were green, and the color characteristics were not obvious. It could be seen from the chart that the improved method significantly improved the detection ability of the YOLOv5 model, and the recognition accuracy was not affected by the color or individual characteristics of the growth stage. After many comparison experiments, the results showed that the improved YOLOv5 method was better than the other three methods, and there was no loss of detection speed after combining the CA attention mechanism and BiFPN. Figures 10 and 11 show the detection effects of the four models on the growth stage and maturity stage.

Detection on Different Citrus Varieties
This experiment aimed to validate the generalization ability of our improved model using images of the four different citrus varieties collected during growth. The detection results of our improved YOLOv5 model on four different varieties (Kumquat, Shatang

Detection on Different Citrus Varieties
This experiment aimed to validate the generalization ability of our improved model using images of the four different citrus varieties collected during growth. The detection results of our improved YOLOv5 model on four different varieties (Kumquat, Shatang tangerine, Fertile orange, and Nanfeng tangerine, respectively) of citrus fruits are shown in Figure 12. Figure 13 shows the AP values from four detection models (namely, Faster R-CNN, YOLOv4, YOLOv5l, and the improved YOLOv5) on four citrus varieties. It is evident from Figure 11 that each method varies in recognition accuracy among the four varieties of citrus. According to the detected images and AP values, the improved YOLOv5 method can accurately detect the position and number of fruits. The experimental results showed that our model has good generalization ability. Moreover, the recognition accuracy of the improved YOLOv5 method was also very impressive when encountering dense small targets in the picture.
R-CNN, YOLOv4, YOLOv5l, and the improved YOLOv5) on four citrus varieties. It is evident from Figure 11 that each method varies in recognition accuracy among the four varieties of citrus. According to the detected images and AP values, the improved YOLOv5 method can accurately detect the position and number of fruits. The experimental results showed that our model has good generalization ability. Moreover, the recognition accuracy of the improved YOLOv5 method was also very impressive when encountering dense small targets in the picture.

Conclusions
This study presented an object detection algorithm and its application to citrus fruit identification and counting at different growth stages. Based on the results, the specific conclusions from this work can be reached as follows: (1) Our improvements to the YOLOv5 model included three factors: (I) the latest visual attention mechanism coordinated attention module (CA) was inserted into the backbone network of the original YOLOv51 to recognize small target fruits; (II) the twoway cross-scale connection and weighted feature fusion BiFPN in the neck network were used to replace the PANet multiscale feature fusion network; and (III) the vari-

Conclusions
This study presented an object detection algorithm and its application to citrus fruit identification and counting at different growth stages. Based on the results, the specific conclusions from this work can be reached as follows: (1) Our improvements to the YOLOv5 model included three factors: (I) the latest visual attention mechanism coordinated attention module (CA) was inserted into the backbone network of the original YOLOv51 to recognize small target fruits; (II) the two-way cross-scale connection and weighted feature fusion BiFPN in the neck network were used to replace the PANet multiscale feature fusion network; and (III) the varifocal loss function was used to replace the focal loss function for detecting occluded fruits. (2) Compared with the original YOLOv5 model, the mAP@0.5 of the improved model was improved by 5.4%, and the inference speed of YOLOv5 for detecting images on the server was 0.019 s, respectively. The results of the experiments on the four varieties of citrus trees showed that our proposed improved model could effectively identify dense small citrus fruits for their entire growth period.
In the future, since deployment on edge devices is subject to specific hardware configurations, we will continue to optimize and improve YOLOv5 and use pruning technology to reduce the model parameters to make it more suitable for mobile deployment. We would also like to study the deployment of the citrus identification system on the mechanical picking arm, and study the use of depth cameras to achieve 3D localization and the picking of citrus fruits, contributing to the development of smart orchards. Using this improved YOLOv5 to deal with defective fruit in transparent packaging in the production line is another valuable direction that is expected to be explored to provide technical support for the fruit production management chain.