Optimizing the YOLOv7-Tiny Model with Multiple Strategies for Citrus Fruit Yield Estimation in Complex Scenarios

: The accurate identification of citrus fruits is important for fruit yield estimation in complex citrus orchards. In this study, the YOLOv7-tiny-BVP network is constructed based on the YOLOv7-tiny network, with citrus fruits as the research object. This network introduces a BiFormer bilevel routing attention mechanism, which replaces regular convolution with GSConv, adds the VoVGSCSP module to the neck network, and replaces the simplified efficient layer aggregation network (ELAN) with partial convolution (PConv) in the backbone network. The improved model significantly reduces the number of model parameters and the model inference time, while maintaining the network’s high recognition rate for citrus fruits. The results showed that the fruit recognition accuracy of the modified model was 97.9% on the test dataset. Compared with the YOLOv7-tiny, the number of parameters and the size of the improved network were reduced by 38.47% and 4.6 MB, respectively. Moreover, the recognition accuracy, frames per second (FPS), and F1 score improved by 0.9, 2.02, and 1%, respectively. The network model proposed in this paper has an accuracy of 97.9% even after the parameters are reduced by 38.47%, and the model size is only 7.7 MB, which provides a new idea for the development of a lightweight target detection model.


Introduction
In the last decade, China has been a world leader in citrus fruit production [1], which has become an important factor in the economic growth of agricultural cultivation.In citrus fruit production, preharvest yield estimation is an essential tool for assessing fruit quality and planting techniques and for reflecting the market supply and demand; moreover, this method is crucial for guiding agricultural production [2].Traditional methods for estimating production usually involve manual sampling, which is inefficient and labor intensive.Currently, deep learning technology has shown excellent advantages in image detection, and the fast and accurate detection ability of a target detection network can provide a more efficient means for citrus fruit yield estimation and information counting.
In recent years, object detection has played an important role in the field of agriculture.Sozzi et al. [3] used a detection network to determine grape yield estimation. Wang et al. [4] employed the YOLOv5 network to detect small apples for early yield estimation. Cardellicchio et al. [5] used the YOLOv5 detection network to study the phenotypic characteristics of tomato plants, so that tomato fruits and flowers could be accurately identified.In addition, the target detection networks have also demonstrated a good detection performance in fruit identification and counting [6][7][8][9][10].Bi et al. [11] used migration learning to train the FastR-CNN model for citrus fruit recognition with an average accuracy of 86.6%.Chen et al. [12] Agriculture 2024, 14, 303 2 of 16 improved the YOLOv7 network for citrus fruit detection by introducing a small-target detection layer and lightweight convolution with a model size of 24.26 M.However, the above model suffers from an excessive number of parameters and computational effort.Therefore, the lightweight YOLO series of inspection networks is the preferred choice [13][14][15].Lightweight networks can increase the inference speed and reduce the model size through cutting the depth and width of the model.Huang et al. [16] improved upon the YOLOv5 lightweight model via introducing the CBAM attention mechanism, which has an average recognition accuracy of 91.3% for citrus fruits.Wang et al. [17] used the YOLOv5 model by introducing migration learning to detect apples with a model size of 4.07 M, but the detection accuracy was only 83.1%.Ma et al. [18] introduced a BiFPN module based on YOLOv7-tiny for detecting apples under different weather conditions, with an accuracy of 80.1%.Wang et al. [19] replaced the regular convolution in the YOLOv7-tiny backbone network by using variability convolution and introducing an SE attention mechanism for detecting millet chili at different maturity levels, with a model accuracy of 90.3%.In the above studies, most of the networks focused on local area feature extraction and matching, and had a lightweight character while losing the semantic information of the target fruits.Thus, they are prone to both missing and misdetecting target fruits when performing a multiscale target detection task, resulting in a lower accuracy in complex scenarios or small-target contexts.Therefore, to further improve the network's ability to detect citrus plants in complex environments and to meet the current demand for lightweight edge devices, there is an urgent need to explore in-depth strategies and methods to further enhance the detection capability of lightweight YOLO networks.
The purpose of this study is to solve the problems of redundant parameters and the high computation of target detection networks in complex scenarios.We propose a lightweight and high-precision network, YOLOv7-tiny-BVP, through module replacement and parameter optimization of reconfigurable network layers.The network includes the BiFormer attention mechanism, which improves the capture of spatial and contextual semantic information about citrus plants via capturing bidirectional dependencies in the input sequence.Moreover, regular convolution is replaced with GSConv in the neck network, and the VoVGSCSP module is introduced to improve detection accuracy and to reduce the model inference time.In the backbone network, the simplified ELAN is replaced with PConv, which leads to a more efficient feature extraction and reduces the number of network computations.The network is lightweight while maintaining high performance, which opens up the possibility of future applications of lightweight networks.

Data Acquisition
The citrus fruit images were collected at the Pengyu Brothers Citrus Demonstration Base in Gongcheng County, Guilin City, Guangxi Zhuang Autonomous Region.The data were collected on 11 November 2022, at 2:00-5:00 p.m.In this study, to improve sample representativeness, model applicability, and device compatibility, images were acquired from the Huawei Mate40 Pro phone and Nova 7 phone.To ensure the representativeness of the sample data, we obtain citrus fruit images at different distances varying from 0.2 to 2.0 m away from the fruit trees, and various real growth conditions were taken into account, such as upward view, downward view, downlight, backlight, fruit sparseness, denseness, shading, overlapping, etc.Finally, a total of 525 high-definition citrus fruit images with a resolution of 4096 × 3072 were captured in JPG format.

Data Enhancement
In the natural environment, there are a variety of disturbing factors, such as plant stems, leaves, and light.To improve the robustness of the model and prevent any overfitting caused by too little training data [20], 182 representative images were selected for data augmentation in this study.First, the original images were manually labeled using the image annotation software LabelImg.Then, the mixup method, the mosaic method, image equalization, gray scaling, and gamma transformation were used for mixed data enhancement.After manual screening, 3182 data points were ultimately obtained.The effect after data enhancement is shown in Figure 1.
Agriculture 2024, 14, 303 3 of 16 augmentation in this study.First, the original images were manually labeled using the image annotation software LabelImg.Then, the mixup method, the mosaic method, image equalization, gray scaling, and gamma transformation were used for mixed data enhancement.After manual screening, 3182 data points were ultimately obtained.The effect after data enhancement is shown in Figure 1.

Dataset Preparation
In this study, the data were stored according to the format of Microsoft's publicly available dataset MS COCO, and the dataset was randomly divided into training, validation, and testing sets at a ratio of 7:2:1.The dataset contained 3182 images, including 182 original citrus fruit images and 3000 enhanced images, and 39,483 citrus fruits were labeled.The training set consisted of 2290 images that labeled 28,279 citrus fruits.The validation set consisted of 573 images in which 7181 fruits were labeled.The remaining 319 images comprised the test set, which included 4023 citrus fruits.According to the COCO dataset standard, citrus fruit targets with a resolution of less than 32 pixels × 32 pixels in the image were defined as small targets in this experiment.Table 1 demonstrates the division of the dataset.

Dataset Preparation
In this study, the data were stored according to the format of Microsoft's publicly available dataset MS COCO, and the dataset was randomly divided into training, validation, and testing sets at a ratio of 7:2:1.The dataset contained 3182 images, including 182 original citrus fruit images and 3000 enhanced images, and 39,483 citrus fruits were labeled.The training set consisted of 2290 images that labeled 28,279 citrus fruits.The validation set consisted of 573 images in which 7181 fruits were labeled.The remaining 319 images comprised the test set, which included 4023 citrus fruits.According to the COCO dataset standard, citrus fruit targets with a resolution of less than 32 pixels × 32 pixels in the image were defined as small targets in this experiment.Table 1 demonstrates the division of the dataset.

Test Platforms and Parameters
This experiment was based on Windows 10 with the following hardware configuration: an i7-9800X@3.80GHz CPU, 32 GB of RAM, and an Nvidia GeForce RTX2060 graphics card.We used PyCharm as the IDE, Python version 3.9 as the compiler, and PyTorch 1.17 as the test framework.In addition, CUDA version 12.1 parallel computing was used in combination with the cuDNN version 11.7 deep neural network acceleration library, and OpenCV 4.3.5 was selected as the image processing library.
To ensure the effectiveness of the model, we set the parameters of the initialized model as follows: begin training from 0 without loading the official pretraining weights, fix the random seeds, and accelerate the model training process by means of adjusting the number of threads used by CPU during data loading.The network hyperparameters were configured as follows: the input image size was 640 pixels × 640 pixels, and the batch size was 32.For model optimization, the stochastic gradient descent (SGD) method was used by setting the initial learning rate, momentum factor, and weight decay factor to 0.001, 0.937, and 0.0005, respectively.The model was trained for a total of 300 epochs, the weights were saved every 10 epochs, and the weights with the highest accuracy were ultimately selected for validation and testing.
In this experiment, COCO dataset-related metrics were used to evaluate the model performance.The evaluation metrics included precision (P), recall (R), mean average precision (mAP), F1 score (F1), model parameters, model size, and FPS, and the coefficient of determination R 2 was selected as the evaluation indicator of the prediction model.The equations for the aforementioned assessment indicators are shown in (1)-( 4): In the above equations, TP denotes the positive sample predicted by the model, referring to the number of citrus fruits correctly identified, FP denotes the negative sample predicted by the model, referring to the number of citrus fruits incorrectly identified, and FN denotes citrus fruits that are not correctly detected.

YOLOv7-Tiny Network Infrastructure
The YOLOv7 target detection network is a detector proposed by Alexey Bochkovskiy's team in July 2022.On the Microsoft COCO public dataset, YOLOv7 outperforms all currently known detectors in terms of both speed and accuracy.Compared to its predecessors, the authors innovatively proposed the composite model scaling method and the extended efficient layer aggregation network (E-ELAN) to increase the depth and width of the network, which improved the feature expressiveness and detection performance of the network.Moreover, methods such as the dynamic label assignment strategy are used to improve the inference speed and detection accuracy [21].
YOLOv7-tiny is a lightweight network of the YOLOv7 series that consists of three main parts: the backbone, the neck, and the head; its network structure is shown in Figure 2. To achieve the lightweight standard, YOLOv7-tiny reduces the number of convolutions in the ELAN module, MP module, and SPPCSPC module and uses regular convolution instead of the CBL module, but does not fully integrate the advantageous modules in the YOLOv7 structure, which weakens the feature learning ability of the network to a certain extent [22].YOLOv7 structure, which weakens the feature learning ability of the network to a certain extent [22].

Construction of a New YOLOv7-Tiny-BVP Network Using Multi-Strategy
This study explored methods for improving the new network based on the YOLOv7tiny model.That is, the BiFormer attention mechanism is used after the SPPCSPC module, which enhances the network's fruit feature extraction ability in complex scenes; second, in the neck and head part, the regular convolution is replaced by GSConv, and the VoVGSCSP module is introduced, improving the detection accuracy and reduces the model inference time, simultaneously.Finally, the PConv module is used to replace the simplified ELAN structure in the backbone, which can reduce the network parameters and redundant computations, while maintaining detection accuracy.The new network named YOLOv7-tiny-BVP is constructed through multipolicy fusion; the structure is shown in Figure 3.

Construction of a New YOLOv7-Tiny-BVP Network Using Multi-Strategy
This study explored methods for improving the new network based on the YOLOv7tiny model.That is, the BiFormer attention mechanism is used after the SPPCSPC module, which enhances the network's fruit feature extraction ability in complex scenes; second, in the neck and head part, the regular convolution is replaced by GSConv, and the VoVGSCSP module is introduced, improving the detection accuracy and reduces the model inference time, simultaneously.Finally, the PConv module is used to replace the simplified ELAN structure in the backbone, which can reduce the network parameters and redundant computations, while maintaining detection accuracy.The new network named YOLOv7tiny-BVP is constructed through multipolicy fusion; the structure is shown in Figure 3.

BiFormer Dynamic Sparse Attention Mechanism for Dual-Layer Routing
The BiFormer is an improvement and extension of the transformer model and consists of an encoder and a decoder.The encoder is used to encode the input sequence, and the decoder is used to generate the output sequence [23].The BiFormer structure, which introduces a new bidirectional attention mechanism, is shown in Figure 4.The attention scores of each position are calculated via inputting the query vector (Q), key vector (K), and value vector (V) in the sequence, thereby establishing a global association within the sequence [24].In citrus fruit recognition and detection tasks, contextual information is crucial for locating targets.The BiFormer can more comprehensively capture the background information and contextual relationships around citrus fruit plants, improving fruit detection accuracy.

BiFormer Dynamic Sparse Attention Mechanism for Dual-Layer Routing
The BiFormer is an improvement and extension of the transformer model and consists of an encoder and a decoder.The encoder is used to encode the input sequence, and the decoder is used to generate the output sequence [23].The BiFormer structure, which introduces a new bidirectional attention mechanism, is shown in Figure 4.The attention scores of each position are calculated via inputting the query vector (Q), key vector (K), and value vector (V) in the sequence, thereby establishing a global association within the sequence [24].In citrus fruit recognition and detection tasks, contextual information is crucial for locating targets.The BiFormer can more comprehensively capture the background information and contextual relationships around citrus fruit plants, improving fruit detection accuracy.
The query vector Q is used to represent the content that the current position needs to focus on the attention mechanism.The weight in the attention allocation process is determined through calculating the similarity with the key vector K.The value vector V contains the actual feature representation, which is assigned different weights based on the similarity between Q and K to determine the importance of the different positions in the output.The relationships between them are shown in Equations ( 5)-( 7): where I r in Equations ( 5) and ( 6) represents the index tensor, which is used to specify the position where the elements are collected from the input tensor.The gathering operation involves collecting the elements at corresponding positions from the input tensor based on the index; K g and V g are new key and value vectors obtained after the gathering operation.In Equation ( 7), attention represents the attention mechanism calculation, and LCE represents the cross-entropy loss function.O is the final output result that is calculated through the attention mechanism and combined with cross-entropy loss to obtain the final output, helping the model to better understand contextual information in detection tasks and to make accurate predictions.

VoVGSCSP Module Based on a Slim Neck
To further improve the model's ability to capture the positional information of citrus fruit features and reduce computational costs, the regular convolution in the neck network is replaced with GSConv, and the ELAN module is replaced with the VoVGSCSP module in this study to balance the model accuracy and speed [25].
In the convolutional neural network architecture, regular convolutional operations have a high accuracy, but the model is more complex and has a longer inference time.In contrast, although DWConv (depth-wise separable convolution) has a faster detection speed, it has a lower accuracy [26].Therefore, the GSConv module was originally designed to reduce the model complexity while maintaining detection accuracy; the structure of the module is shown in Figure 5. GSConv infiltrates the semantic information generated by regular convolution into each part of DWConv, which fully utilizes the features of regular convolution and DWConv convolution, increasing both the speed of DWConv and the accuracy of regular convolution; thus, it achieves a better performance in citrus fruit detection tasks.The query vector Q is used to represent the content that the current position needs to focus on the attention mechanism.The weight in the attention allocation process is determined through calculating the similarity with the key vector K.The value vector V contains the actual feature representation, which is assigned different weights based on the similarity between Q and K to determine the importance of the different positions in the output.The relationships between them are shown in Equations ( 5)-( 7): where I r in Equations ( 5) and ( 6) represents the index tensor, which is used to specify the position where the elements are collected from the input tensor.The gathering operation involves collecting the elements at corresponding positions from the input tensor based on the index; K g and V g are new key and value vectors obtained after the gathering operation.In Equation ( 7), attention represents the attention mechanism calculation, and LCE represents the cross-entropy loss function.O is the final output result that is calculated through the attention mechanism and combined with cross-entropy loss to obtain the final output, helping the model to better understand contextual information in detection tasks and to make accurate predictions.

VoVGSCSP Module Based on a Slim Neck
To further improve the model's ability to capture the positional information of citrus fruit features and reduce computational costs, the regular convolution in the neck network is replaced with GSConv, and the ELAN module is replaced with the VoVGSCSP module in this study to balance the model accuracy and speed [25].
In the convolutional neural network architecture, regular convolutional operations have a high accuracy, but the model is more complex and has a longer inference time.In contrast, although DWConv (depth-wise separable convolution) has a faster detection speed, it has a lower accuracy [26].Therefore, the GSConv module was originally designed to reduce the model complexity while maintaining detection accuracy; the structure of the module is shown in Figure 5. GSConv infiltrates the semantic information generated by regular convolution into each part of DWConv, which fully utilizes the features of regular convolution and DWConv convolution, increasing both the speed of DWConv and the accuracy of regular convolution; thus, it achieves a better performance in citrus fruit detection tasks.

VoVGSCSP Module Based on a Slim Neck
To further improve the model's ability to capture the positional information of citrus fruit features and reduce computational costs, the regular convolution in the neck network is replaced with GSConv, and the ELAN module is replaced with the VoVGSCSP module in this study to balance the model accuracy and speed [25].
In the convolutional neural network architecture, regular convolutional operations have a high accuracy, but the model is more complex and has a longer inference time.In contrast, although DWConv (depth-wise separable convolution) has a faster detection speed, it has a lower accuracy [26].Therefore, the GSConv module was originally designed to reduce the model complexity while maintaining detection accuracy; the structure of the module is shown in Figure 5. GSConv infiltrates the semantic information generated by regular convolution into each part of DWConv, which fully utilizes the features of regular convolution and DWConv convolution, increasing both the speed of DWConv and the accuracy of regular convolution; thus, it achieves a better performance in citrus fruit detection tasks.Although GSConv can significantly reduce the redundant information in the feature maps of citrus fruit detection models, it has limitations in further reducing inference time and maintaining accuracy.Therefore, the ELAN module is replaced with the VoVGSCSP module in the neck network section.The VoVGSCSP is formed by combining the modules in Figure 5 through a one-time aggregation method.The structure of VoVGSCSPC is shown in Figure 6.This approach can reduce the complexity of the computations and Although GSConv can significantly reduce the redundant information in the feature maps of citrus fruit detection models, it has limitations in further reducing inference time and maintaining accuracy.Therefore, the ELAN module is replaced with the VoVGSCSP module in the neck network section.The VoVGSCSP is formed by combining the modules in Figure 5 through a one-time aggregation method.The structure of VoVGSCSPC is shown in Figure 6.This approach can reduce the complexity of the computations and network structure and further minimize the memory footprint of the model, facilitating model deployment to edge devices with limited computational resources.
network structure and further minimize the memory footprint of the model, facilitating model deployment to edge devices with limited computational resources.

PConv Module Based on FasterNet
The floating-point operation per second (FLOPS) is the calculation speed; the larger the value, the better the network performance.The current mainstream conventional convolution, group convolution, and depth-wise separable convolution (DWConv) methods all suffer from low FLOPS problems, which is mainly due to frequent access memory [27].For floating-point of operations (FLOPs), the greater the number of computations, the smaller the value, which is generally used to measure the model complexity.The original YOLOv7-tiny backbone network uses large numbers of regular convolutions for feature extraction, with many parameters and a large computational effort.Although convolution kernels can be utilized to reduce the number of parameters and FLOPs, memory access increases as the network width increases, to compensate for the accuracy degradation [28].Partial convolution (PConv) uses partial channels for spatial feature extraction, which can effectively reduce redundant computations and memory access.Therefore, to reduce the network complexity while maintaining feature extraction ability, this experiment replaces the regular convolution in the backbone network with PConv.The working principle of PConv is shown in Figure 7.In Figure 7, h and w represent the height and width of the feature map, respectively.c represents the total number of channels, cp represents the number of channels used, and the convolutional kernel size is k × k.In contrast to regular convolution, PConv requires only a portion of the input channels for spatial feature extraction, while keeping the remaining channels constant.For consecutive or regular memory access, the first or last consecutive channel is considered representative of the entire feature map for computation, and the input and feature maps are considered to have the same number of channels.

PConv Module Based on FasterNet
The floating-point operation per second (FLOPS) is the calculation speed; the larger the value, the better the network performance.The current mainstream conventional convolution, group convolution, and depth-wise separable convolution (DWConv) methods all suffer from low FLOPS problems, which is mainly due to frequent access memory [27].For floating-point of operations (FLOPs), the greater the number of computations, the smaller the value, which is generally used to measure the model complexity.The original YOLOv7-tiny backbone network uses large numbers of regular convolutions for feature extraction, with many parameters and a large computational effort.Although convolution kernels can be utilized to reduce the number of parameters and FLOPs, memory access increases as the network width increases, to compensate for the accuracy degradation [28].Partial convolution (PConv) uses partial channels for spatial feature extraction, which can effectively reduce redundant computations and memory access.Therefore, to reduce the network complexity while maintaining feature extraction ability, this experiment replaces the regular convolution in the backbone network with PConv.The working principle of PConv is shown in Figure 7.

PConv Module Based on FasterNet
The floating-point operation per second (FLOPS) is the calculation speed; the larger the value, the better the network performance.The current mainstream conventional convolution, group convolution, and depth-wise separable convolution (DWConv) methods all suffer from low FLOPS problems, which is mainly due to frequent access memory [27].For floating-point of operations (FLOPs), the greater the number of computations, the smaller the value, which is generally used to measure the model complexity.The original YOLOv7-tiny backbone network uses large numbers of regular convolutions for feature extraction, with many parameters and a large computational effort.Although convolution kernels can be utilized to reduce the number of parameters and FLOPs, memory access increases as the network width increases, to compensate for the accuracy degradation [28].Partial convolution (PConv) uses partial channels for spatial feature extraction, which can effectively reduce redundant computations and memory access.Therefore, to reduce the network complexity while maintaining feature extraction ability, this experiment replaces the regular convolution in the backbone network with PConv.The working principle of PConv is shown in Figure 7.In Figure 7, h and w represent the height and width of the feature map, respectively.c represents the total number of channels, cp represents the number of channels used, and the convolutional kernel size is k × k.In contrast to regular convolution, PConv requires only a portion of the input channels for spatial feature extraction, while keeping the remaining channels constant.For consecutive or regular memory access, the first or last consecutive channel is considered representative of the entire feature map for computation, and the input and feature maps are considered to have the same number of channels.In Figure 7, h and w represent the height and width of the feature map, respectively.c represents the total number of channels, cp represents the number of channels used, and the convolutional kernel size is k × k.In contrast to regular convolution, PConv requires only a portion of the input channels for spatial feature extraction, while keeping the remaining channels constant.For consecutive or regular memory access, the first or last consecutive channel is considered representative of the entire feature map for computation, and the input and feature maps are considered to have the same number of channels.
Without a loss of generality, the formula for calculating the FLOPs and memory access of PConv is as follows: h According to Equations ( 8) and ( 9), the FLOPs of PConv are only 1/16 of regular Conv, and the memory access is only 1/4 of the regular Conv.

Comparative Testing of Different Detection Models
To verify the ability of the new YOLOv7-tiny-BVP model to recognize citrus fruits, the current mainstream target detection networks in the YOLOv5 [29] series (YOLOv5s, YOLOv5x, YOLOv5n, YOLOv5l, and YOLOv5m) and YOLOv7 [22] series (YOLOv7, YOLOv7x, and YOLOv7-tiny) were selected for a comprehensive comparison in this experiment.During training, all the networks used the same dataset without loading the official default weights, and all the hyperparameters remained consistent.After training was completed, the weight of the network with the highest accuracy was selected.With the same test set, a total of 319 images were comprehensively evaluated.In this evaluation, the FPS metric was calculated as follows: when batch size = 1 is maintained, the result is calculated by dividing 1000 ms by the sum of image reasoning time, preprocessing time, and post-processing time.The results of the above target detection model after the completion of the test are shown in Table 2. Note: Bold font is the optimal value for the model detection indicators; mAP@.5 average accuracy at IoU = 0.5.
Table 2 shows that the current YOLOv5 series network performs well in citrus fruit detection, but the overall performance is weaker than that of the YOLOv7 series network.Although YOLOv5n has a small number of parameters and a smaller proportion of models, it is still weaker than YOLOv7 series networks in terms of the other indicators.Although the P, R, and mAP indices of the YOLOv7 and YOLOv7x networks are higher than those of the YOLOv5 series, the overall model parameters are large, the model proportion is too large, and the FPS frame rate is low, making it difficult to deploy edge computing devices with insufficient GPU resources for detection.
Compared with YOLOv7-tiny, the overall parameters decreased by 38.47%, the model scale was reduced by 4.6 Mb, and P increased by 0.9%.Meanwhile, the FPS and F1 scores increased by 2.02 and 1%, respectively.Obviously, YOLOv7-tiny-BVP effectively reduces the computational load on the device, and still maintains a high performance.

Comparative Experiments in Complex Scenarios
Under natural conditions, the orchard environment is more complex.The growth characteristics of fruit trees result in severe damage between the branches and leaves of the fruit.Moreover, fruits of different sizes and shapes commonly overlap each other.Additionally, there are many disturbing factors in the field environment, such as light and darkness, which make fruit identification and counting difficult [30].Therefore, for the above two real-world scenarios, the YOLOv5s model, which has the most balanced performance among the YOLOv5 series; the YOLOv7-tiny; and the YOLOv7-tiny-BVP proposed in this paper were selected for comparative analysis.In Figures 8-11, the red and green circles indicate the network failing to detect under two different scenarios.
Additionally, there are many disturbing factors in the field environment, such as light and darkness, which make fruit identification and counting difficult [30].Therefore, for the above two real-world scenarios, the YOLOv5s model, which has the most balanced performance among the YOLOv5 series; the YOLOv7-tiny; and the YOLOv7-tiny-BVP proposed in this paper were selected for comparative analysis.In Figures 8-11, the red and green circles indicate the network failing to detect under two different scenarios.

YOLOv5s
YOLOv7-tiny YOLOv7-tiny-BVP  Additionally, there are many disturbing factors in the field environment, such as light and darkness, which make fruit identification and counting difficult [30].Therefore, for the above two real-world scenarios, the YOLOv5s model, which has the most balanced performance among the YOLOv5 series; the YOLOv7-tiny; and the YOLOv7-tiny-BVP proposed in this paper were selected for comparative analysis.In Figures 8-11, the red and green circles indicate the network failing to detect under two different scenarios.

Comparison of Recognition Results under Different Occlusion Scenes
Figure 8 shows that both YOLOv5s and YOLOv7-tiny can recognize more obvious targets in the slightly occluded scene, but the network is not sensitive to small targets at the edge of the image due to foliage occlusion, and both suffer from missed detections.YOLOv7-tiny-BVP can accurately detect the location of citrus fruit plants by introducing an attention mechanism, which increases the sensitivity of the network to the feature information around the citrus plants.
Figure 9 shows that in the case of heavy occlusion, overlapping fruits and leafy branch misalignment result in target feature loss.Although the improved network can still accurately recognize the target in the case of heavy occlusion, YOLOv7-tiny suffers from the omission of detection, and YOLOv5s can accurately recognize the fruits but suffers from misdiagnosis, incorrectly recognizing the two adhered fruits as one.In summary, the YOLOv7-tiny-BVP network proposed in this paper achieves a better performance in different occlusion scenarios.

Comparison of Recognition Results in Bright and Dark Scenes
Brightness also poses a considerable challenge for fruit identification and counting during actual field inspections.Figure 10 shows the actual detection results of the three network models for different light and dark scenes.As Figure 10 shows, in the case of dim light, YOLOv5s has a poor recognition performance compared to the other two networks, and three fruits were missed.YOLOv7 has two unrecognized fruits, while the recognition effect of YOLOv7-tiny-BVP is better overall than that of the other two networks.It is concluded that the color features of citrus fruit plants are masked in a dimly lit environment, the network cannot extract effective textural features, and there is a complication associated with overlapping leaves and fruits in the figure, leading to individual fruits not being detected.
In addition, in a well-lit environment, under the conditions of severe leaf and fruit overlap, the recognition ability of YOLOv7-tiny-BVP and YOLOv7-tiny proposed in this study significantly improved compared to that of the YOLOv5s network.Figure 11 shows

Comparison of Recognition Results under Different Occlusion Scenes
Figure 8 shows that both YOLOv5s and YOLOv7-tiny can recognize more obvious targets in the slightly occluded scene, but the network is not sensitive to small targets at the edge of the image due to foliage occlusion, and both suffer from missed detections.YOLOv7-tiny-BVP can accurately detect the location of citrus fruit plants by introducing an attention mechanism, which increases the sensitivity of the network to the feature information around the citrus plants.
Figure 9 shows that in the case of heavy occlusion, overlapping fruits and leafy branch misalignment result in target feature loss.Although the improved network can still accurately recognize the target in the case of heavy occlusion, YOLOv7-tiny suffers from the omission of detection, and YOLOv5s can accurately recognize the fruits but suffers from misdiagnosis, incorrectly recognizing the two adhered fruits as one.In summary, the YOLOv7-tiny-BVP network proposed in this paper achieves a better performance in different occlusion scenarios.

Comparison of Recognition Results in Bright and Dark Scenes
Brightness also poses a considerable challenge for fruit identification and counting during actual field inspections.Figure 10 shows the actual detection results of the three network models for different light and dark scenes.As Figure 10 shows, in the case of dim light, YOLOv5s has a poor recognition performance compared to the other two networks, and three fruits were missed.YOLOv7 has two unrecognized fruits, while the recognition effect of YOLOv7-tiny-BVP is better overall than that of the other two networks.It is concluded that the color features of citrus fruit plants are masked in a dimly lit environment, the network cannot extract effective textural features, and there is a complication associated with overlapping leaves and fruits in the figure, leading to individual fruits not being detected.
In addition, in a well-lit environment, under the conditions of severe leaf and fruit overlap, the recognition ability of YOLOv7-tiny-BVP and YOLOv7-tiny proposed in this study significantly improved compared to that of the YOLOv5s network.Figure 11 shows that the recognition effect of YOLOv7-tiny is more accurate than that of dim light scenarios.It is concluded that under sufficient light conditions, the network is more likely to obtain semantic color information about fruits, resulting in more accurate fruit recognition than in dim light scenarios.Therefore, the YOLOv7-tiny-BVP network proposed in this paper is more suitable for fruit detection in complex environments, has a high accuracy, and fewer missed detections under different scenes of light and darkness.

Ablation Experiment
To validate the effectiveness of the three improvement strategies proposed in this article for the YOLOv7-tiny-BVP network, ablation experiments were conducted to observe the changes in various indicators on the same training set after 300 epochs of training.The training results are shown in Figure 12, where the model gradually starts to converge after 250 iterations, and there is no significant change between the three independent improvement schemes and the original network model YOLOv7-tiny in terms of precision, recall, and mAP@.5 in the training set.However, in terms of the mAP@.5:.95 indicator, the YOLOv7-tiny-BVP achieves better results when three improvement strategies are introduced simultaneously, rather than when the BiFormer attention mechanism, VoVGSCSP, or PConv are introduced each time alone.In addition, YOLOv7-tiny-BVP also has a lower localization (boss_loss) and confidence loss (obj_loss) than the other individual improvement strategies, indicating that the model overall performs better after introducing the three improvement strategies.

Ablation Experiment
To validate the effectiveness of the three improvement strategies proposed in this article for the YOLOv7-tiny-BVP network, ablation experiments were conducted to observe the changes in various indicators on the same training set after 300 epochs of training.The training results are shown in Figure 12, where the model gradually starts to converge after 250 iterations, and there is no significant change between the three independent improvement schemes and the original network model YOLOv7-tiny in terms of precision, recall, and mAP@.5 in the training set.However, in terms of the mAP@.5:.95 indicator, the YOLOv7-tiny-BVP achieves better results when three improvement strategies are introduced simultaneously, rather than when the BiFormer attention mechanism, VoVGSCSP, or PConv are introduced each time alone.In addition, YOLOv7-tiny-BVP also has a lower localization (boss_loss) and confidence loss (obj_loss) than the other individual improvement strategies, indicating that the model overall performs better after introducing the three improvement strategies.After the training was completed, the weights with optimal accuracy were selected for a comprehensive model evaluation in the same validation set of 573 images; the evaluation results are shown in Table 3.In Table 3, time is the sum of the image preprocessing time, inference speed, and the NMS non extremely large suppression time; a smaller value indicates a better model performance.The analysis of various indicators in the table shows that the accuracy of the model improves by 1% after introducing the BiFormer attention mechanism.The analysis suggests that introducing an attention mechanism enhances the network's ability to extract features and improve detection accuracy.However, at the same time, the proportion of the model is relatively increased, which does not meet the need for lightweight materials.In addition, after introducing the VoVGSCSP module, the After the training was completed, the weights with optimal accuracy were selected for a comprehensive model evaluation in the same validation set of 573 images; the evaluation results are shown in Table 3.In Table 3, time is the sum of the image preprocessing time, inference speed, and the NMS non extremely large suppression time; a smaller value indicates a better model performance.The analysis of various indicators in the table shows that the accuracy of the model improves by 1% after introducing the BiFormer attention mechanism.The analysis suggests that introducing an attention mechanism enhances the network's ability to extract features and improve detection accuracy.However, at the same time, the proportion of the model is relatively increased, which does not meet the need for lightweight materials.In addition, after introducing the VoVGSCSP module, the overall parameter quantity and the model size decrease slightly, while the accuracy improves by 1%.Moreover, the model's image processing time increases by 1.2 ms, resulting in a relatively poor real-time detection efficiency.
Moreover, after the ELAN module was replaced with the PConv module in the YOLOv7-tiny backbone network, the overall parameter quantity and the proportion of the model significantly decreases, while the accuracy decreases by 2% compared to those of the original network.Additionally, the analysis shows that the number of convolutions and the depth of the network decrease after the network is replaced with PConv, after which the number of parameters is reduced.Therefore, the network's feature extraction ability is weakened.In addition, after introducing the three improvement schemes simultaneously, the recall and mAP@.5 metrics of the model decrease by 1%, and the overall parameters of the model decrease significantly.The improved model has the highest accuracy and shortest image processing time, and the model is lighter, more accurate, more real-time, and easier to deploy.

ms
Note: Bold font is the optimal value for the model detection indicators; mAP@.5 average accuracy at IoU = 0.5.

Comparison of the Latest Methods for Fruit Testing
Table 4 shows the latest research results.Lai et al. [10] used the YOLOv7 network to detect pineapple fruits and classify pineapple fruits according to their ripeness.Ma et al. [18] detect apples under complex weather conditions by using the YOLOv7-tiny model and a public apple dataset.Compared to their model, our model has a significant improvement in P, F1 score, and mAP.Chen et al. [12] used the YOLOv7 network to detect the ripeness of citrus fruits; though the mAP of the model reached 97.29%, the model has too much computation and too many parameters.In addition, Zhang et al. [31] used UAV to collect citrus fruit images and employed the YOLOv7-tiny network to recognize the fruits; though the size of the model is only 3.96 MB, our model has a great improvement in both P and F1 scores.In summary, the model proposed in this study has a more comprehensive performance than the models compared above and is more suitable for citrus fruit detection.

Discussion on Lightweight Networks
With the rapid development of mobile and embedded devices nowadays, more and more algorithms are deployed to mobile devices for people's convenience, but the aforementioned devices suffer from the problems of small memory, insufficient storage space, and limited computational power [32,33].The lightweight network proposed in this study can both accurately identify citrus fruits and significantly reduce the number of network participants.As shown in Figure 13 the image is the side view of a citrus fruit tree, and the improved network maintains a high recognition rate for citrus fruits, which provides a new way to deploy the aforementioned mobile devices for fruit identification and yield estimation.
Although the network model proposed in this study is small and has a high accuracy, it still has some limitations.(1) The date and location of image collection is relatively homogeneous.(2) There is a lack of green or yellow-green citrus fruits in the images.
(3) The proposed network only demonstrates the promise of a lightweight network and was not deployed to mobile devices for testing.In future research, we will focus more on image acquisition, including date, location, and type, to optimize the proposed lightweight network for the real-time detection of citrus fruits during the full-growth period.
it still has some limitations.(1) The date and location of image collection is relatively homogeneous.(2) There is a lack of green or yellow-green citrus fruits in the images.(3) The proposed network only demonstrates the promise of a lightweight network and was not deployed to mobile devices for testing.In future research, we will focus more on image acquisition, including date, location, and type, to optimize the proposed lightweight network for the real-time detection of citrus fruits during the full-growth period.

Conclusions
Based on YOLOv7-tiny, a new network model, YOLOv7-tiny-BVP, is constructed in this study by introducing the BiFormer attention mechanism, the VoVGSCSPC module,

Conclusions
Based on YOLOv7-tiny, a new network model, YOLOv7-tiny-BVP, is constructed in this study by introducing the BiFormer attention mechanism, the VoVGSCSPC module, and the PConv convolution.In this study, a total of eight different networks from the YOLOv5 and YOLOv7 series were selected for comparative testing with the same dataset.The results show that compared with YOLOv7-tiny, the total number of parameters and model scale of YOLOv7-tiny-BVP are reduced by 38.47% and 4.6 MB, respectively.
Meanwhile, the accuracy, FPS, and F1 were improved to different degrees, respectively, and the detection accuracy was no less than that of YOLOv7-tiny.Therefore, the proposed YOLOv7-tiny-BVP network model is not only lightweight, but also has the advantage of a high detection accuracy of the YOLOv7 model, which provides a new idea for the direction of the future research of lightweight models, and also provides the possibility of applying the lightweight target detection model in the field of agricultural yield estimation.

Figure 4 .
Figure 4. Diagram of the BiFormer attention mechanism.

Figure 4 .
Figure 4. Diagram of the BiFormer attention mechanism.

Figure 4 .
Figure 4. Diagram of the BiFormer attention mechanism.
further minimize the memory footprint of the model, facilitating model deployment to edge devices with limited computational resources.

Figure 10 .
Figure 10.Recognition effect in a dark scene.

Figure 11 .
Figure 11.Recognition results in bright scenes.

Figure 13 .
Figure 13.Fruit identification results of citrus fruit trees from a side view.(a-e) Results of different fruit tree tests.

Figure 13 .
Figure 13.Fruit identification results of citrus fruit trees from a side view.(a-e) Results of different fruit tree tests.

Table 2 .
Comparison of the results of different network models.

Table 3 .
Comparison of ablation experiment indices.

Table 4 .
Latest research on fruit detection.