Improved Feature Fusion in YOLOv5 for Accurate Detection and Counting of Chinese Flowering Cabbage ( Brassica campestris L. ssp. chinensis var. utilis Tsen et Lee) Buds

: Chinese ﬂ owering cabbage ( Brassica campestris L. ssp. chinensis var. utilis Tsen et Lee) is an important leaf vegetable originating from southern China. Its planting area is expanding year by year. Accurately judging its maturity and determining the appropriate harvest time are crucial for production. The open state of Chinese ﬂ owering cabbage buds serves as a crucial maturity indicator. To address the challenge of accurately identifying Chinese ﬂ owering cabbage buds, we introduced improvements to the feature fusion approach of the YOLOv5 (You Only Look Once version 5) algo-rithm, resulting in an innovative algorithm with a dynamically adjustable detection head, named FPNDyH-YOLOv5 (Feature Pyramid Network with Dynamic Head-You Only Look Once version 5). Firstly, a P2 detection layer was added to enhance the model’s detection ability of small objects. Secondly, the spatial-aware a tt ention mechanism from DyHead (Dynamic Head) for feature fusion was added, enabling the adaptive fusion of semantic information across di ﬀ erent scales. Furthermore, a center-region counting method based on the Bytetrack object tracking algorithm was devised for real-time quanti ﬁ cation of various categories. The experimental results demonstrate that the improved model achieved a mean average precision (mAP@0.5) of 93.9%, representing a 2.5% improvement compared to the baseline model. The average precision (AP) for buds at di ﬀ erent maturity levels was 96.1%, 86.9%, and 98.7%, respectively. When applying the trained model in conjunction with Bytetrack for video detection, the average counting accuracy, relative to manual counting, was 88.5%, with class-speci ﬁ c accuracies of 90.4%, 80.0%, and 95.1%. In conclusion, this method facilitates relatively accurate classi ﬁ cation and counting of Chinese ﬂ owering cabbage buds in natural environments.


Introduction
Chinese flowering cabbage (Brassica campestris L. ssp.chinensis var.utilis Tsen et Lee) is one of the specialty leafy vegetables in southern China, beloved for its rich nutritional content and tender texture, with a cultivation area reaching approximately 6.67 million hectares [1].The timing of harvesting is crucial; immature Chinese flowering cabbage presents thin stems and small dimensions, while overripe Chinese flowering cabbage is unsuitable for transportation and falls short in taste and quality [2].Therefore, it is necessary to detect the maturity of Chinese flowering cabbage before harvesting [3].
Traditional methods for assessing the maturity of leafy vegetables primarily involve farmers making subjective judgments based on the appearance, quality, and market demand [4].Additionally, some destructive maturity tests, while relatively accurate, can only sample test vegetables after harvesting under laboratory conditions and may not truly reflect the maturity of the entire plant population before harvest.In response, some non-destructive maturity detection methods have been proposed [5,6].For instance, Michele et al. [7] designed a mechanically structured contact-type Radicchio maturity sensor.Radicchio at different maturity levels exhibits distinct heights, leading to varied displacements as the sensor passes through.Similarly, Birrell et al. [8] developed a vision system for selectively harvesting mature and disease-free iceberg lettuce.They initially employed YOLOv3 to detect the bounding boxes and positions of lettuce, subsequently feeding the bounding boxes into the Darknet network.The network classified them into three categories: mature, immature, and diseased, allowing for the exclusive harvesting of mature individuals.
The limited number of studies on the maturity detection of leafy vegetables by scholars can be attributed to the minimal differences between individuals at various maturity levels in vegetables, especially when compared to fruits [9].There are fewer detectable characteristics in leafy vegetables, which poses challenges for non-destructive maturity detection research.Currently, there is no research on non-destructive detection of Chinese flowering cabbage maturity before harvest.
In accordance with the industry standards outlined in the Chinese flowering cabbage grading specifications, noticeable disparities in the morphology and color of cabbage buds at various maturity levels offer essential information for assessing their degree of maturity [10].Maturity discrimination can be achieved by identifying and classifying the buds of Chinese flowering cabbage, which is essentially an object detection task in computer vision.
Research on image-based object detection technology in crop detection is already very extensive, with the main methods being traditional image processing and machine learning [11,12].Traditional image processing predominantly relies on factors such as color [13], texture [14], and shape features [15] or incorporates methods that combine multiple features for crop recognition and detection [16,17].Traditional machine learning methods typically involve a two-step process: manual extraction of target features followed by the use of trained classifiers for pixel-level classification to identify target regions [11,18].However, it is important to note that traditional methods are susceptible to environmental variables, including fluctuations in lighting intensity, shooting angles and distances, and background variations [6].
In recent years, deep learning object detection methods based on convolutional neural networks have been widely applied in the agricultural domain [19,20].Existing object detection methods can be broadly classified into one-stage and two-stage approaches.One-stage object detection models have faster detection speeds.In contrast, two-stage models are characterized by slower processing speeds and are often unsuitable for realtime detection scenarios [21].Particularly, the introduction of visual attention mechanisms has enhanced the detection capability of small objects while ensuring inference speed [22,23].For instance, Li et al. [24] incorporated the CBAM (Convolutional Block Attention Module) attention mechanism into the YOLOv5 model for wheat spike detection and counting.From the feature heatmap, it is evident that the model with the attention mechanism focuses more on targets, resulting in improved accuracy.Chen et al. [25] embedded the ECA (Efficient Channel Attention) attention mechanism into RetinaNet to detect the ripeness of pineapples, and the improved model achieved recognition accuracies exceeding 90% for pineapples at different ripeness levels.
For counting in continuous image sequences, it is crucial to avoid multiple counting of the same target.Object detection models typically lack access to temporal and target displacement information between frames, often resulting in repetitive counting of objects [26].To mitigate this challenge, numerous studies have sought to combine object detection models with Multi-Object Tracking (MOT) algorithms, thereby enhancing counting accuracy.For instance, Wang et al. [27] introduced the MangoYOLO detection algorithm, which integrates Kalman filtering and the Hungarian algorithm to detect, track, and count fruit trees in video sequences.Li et al. [28] trained a YOLOv5 detection model for counting tea buds and implemented automatic counting using the improved DeepSORT algorithm.The algorithm achieved a high correlation of 98% with manual counting results.Given the complexity of the Chinese flowering cabbage growth environment, with varying bud sizes, shapes, and growth positions at different levels of maturity.Consequently, the methods mentioned above may not be entirely suitable for this specific research context.
In summary, this study achieved real-time detection of Chinese flowering cabbage maturity in the field through bud detection.The main work and contributions of this study include: (1) Added a dynamically adjustable detection head with adaptive capabilities to the original YOLOv5 algorithm's neck, enhancing the feature fusion capability of the original algorithm and improving the detection accuracy of Chinese flowering cabbage buds.(2) Designed a center-region counting method based on the Bytetrack multiobject tracking algorithm, to some extent, avoiding the repeated counting of the same Chinese flowering cabbage buds.(3) Filled the gap in the field of pre-harvest maturity detection of Chinese flowering cabbage, providing references for the detection of other crops.

Image Acquisition and Data Sets
This study utilized RGB images and video data from two distinct varieties of Chinese flowering cabbage, namely "49-days" and "80-days".The data collection was carried out at the "QiLin" Experimental Farm of South China Agricultural University (113.38 E, 23.17 N).The data acquisition equipment consisted of two smartphones (Xiaomi 10 and Redmi K40 are both from Xiaomi Technology Co., Ltd., Beijing, China) and a SONI ILCE-5100 (Sony Group Corporation, Tokyo, Japan) digital camera, with pixel resolutions of 3000 × 4000, 5792 × 4344, and 2000 × 3008, respectively.The video data had a resolution of 1080 × 1920, and random single frames were extracted to construct the dataset.The images were captured from distances ranging from 30 to 80 cm from the top of the cabbage.The dataset included images under various lighting conditions and backgrounds with soil, totaling 4683 images.After removing blurry images, the dataset contained 4445 images.Among these, "49-days" images were taken 30~40 days after planting, totaling 1925 images, while "80-days" images were taken 50~60 days after planting, totaling 2520 images.
The images were annotated using Labelimg (Data annotation tools, v1.8.1, Human-Signal, SF, USA).In accordance with the industry standard for Chinese flowering cabbage, the maturity was categorized into three distinct stages: "growing", "ripe", and "overripe".This categorization was based on the extent of bud opening, as shown in Figure 1.To ensure the quality and effectiveness of the annotations, objects that extended beyond 1/3 of the image borders were not included in the annotation process.Following annotation, XML files were generated to store category and target coordinate information.The dataset was then divided into training, validation, and test sets in a ratio of 7:2:1, as depicted in Table 1.

Detection and Counting Method
This study can detect Chinese flowering cabbage buds in dynamic scenes, and the workflow is shown in Figure 2. The process of this method consists of three steps: (1) Utilizing the improved YOLOv5 algorithm to detect video frames captured by the camera, obtaining the location information and maturity categories in the current frame.(2) Jointly using the Bytetrack algorithm to track the targets and assign independent IDs.(3) When the motion of a target stabilizes and enters the central region, its category is determined, completing the counting.

Original YOLOv5
YOLOv5 is a one-stage object detection algorithm based on regression, comprising four main components: an input end, backbone, neck, and detection head [29].The input end performs a series of preprocessing on the image, including online data augmentation, such as Mosaic, MixUp, and geometric transformations.The backbone is CSP-Darknet53, primarily composed of C3 modules utilizing the CSP (Cross Stage Partial) structure.A C3 module, in turn, consists of three standard convolutional layers and Bottleneck units featuring residual connections [30].The neck network adopts the Path Aggregation Network (PANet) to enhance the feature fusion capabilities of the network [31].The detection head comprises three standard convolutional detectors, each responsible for detecting the feature maps at three different scales from the PANet output.It generates predictions for four position coordinates and class confidence scores.
YOLOv5 employs a matching strategy based on aspect ratios, which enables crossanchor, cross-grid, and cross-branch predictions.This strategy substantially increases the number of positive samples, accelerates model convergence, and improves detection accuracy.During the inference process, non-maximum suppression is applied for post-processing to remove low-confidence boxes, resulting in the final detection results [32].

Bud Detection Method Based on Improved YOLOv5
This study addresses the detection task of Chinese flowering cabbage buds in continuous images in natural environments, which presents several challenges.On the one hand, the targets are relatively small, exhibiting diverse shapes, and the "growing" and "ripe" categories are highly similar, making it challenging to distinguish them from the background.Secondly, it is necessary to count various categories of buds in motion, which leads to image blurring and increases the difficulty of detection and continuous tracking [26].
The original algorithm's head directly processes the 3-scale feature maps from PANet through 1 × 1 convolutions to output prediction information of na × (nc + 5), where na represents the number of anchors, and nc is the number of classes.This type of head only considers the issue of different object scales in the images and does not address the problem of misidentification caused by different perspectives, leading to objects appearing in different shapes, rotations, and positions, reflecting a lack of spatial awareness.Additionally, YOLOv5's head unifies the tasks of classification and localization, even though these two tasks have entirely different objectives and constraints, resulting in a lack of task awareness.The original algorithm struggles to meet the task requirements, and tracking performance relies on the accuracy of the detector.Therefore, in this study, we made the following improvements to YOLOv5. Add a 4× down-sampling layer.
On top of the original detection layers P3, P4, and P5 in the network, a P2 detection layer is added.P3, P4, and P5 correspond to 8, 16, and 32 times downsampling, while the P2 layer is derived from a 4× downsampling feature map, which has a smaller receptive field, making it advantageous for detecting small-sized objects [33].The specific operation involves upsampling the feature map of the 17th layer once and then concatenating it with the output of the 3rd layers in the backbone to form the network's P2 detection layer.Due to significant differences between the self-built dataset used in this study and the COCO (Common Objects in Context) dataset, and the addition of the P2 detection layer, the original Anchors are no longer applicable.Therefore, a re-clustering of Anchors was conducted to assign 3 new Anchors to the P2 detection layer.Ultimately, the sizes of the 12 groups of Anchors are as follows (((21,21), (31,31), (41,42)), ((63,64), (81,85), (109,106)), ((136,136), (171,169), (218,216)), ((310,275), (392,417), (576,522))).


Feature fusion with a spatial-aware attention mechanism.
Expanding on the previously discussed enhancements, we introduced a unified object detection head framework named "DyHead" into the head network.DyHead integrates three distinct self-attention mechanisms: scale-awareness, spatial-awareness, and task-awareness [34].In this study, the PANet structure in the YOLOv5 neck was removed because the spatial-awareness module relies on feature fusion, and only the FPN (Feature Pyramid Net) structure was retained, simplifying the model structure and preventing information loss caused by downsampling.
The P2, P3, P4, and P5 feature layers obtained from FPN first go through the spatialaware module.As illustrated in Figure 3, taking the middle P3 layer as an example, it initially undergoes a convolution operation to derive self-learned position biases (offset) and importance factors (mask).Additionally, the low-level feature layer P2 and high-level feature layer P4 undergo Deformable Convolution (DefV2-0, DefV2-1, and DefV2-2 represent three deformable convolution operations with different step sizes) to promote sparse attention learning.Subsequently, feature aggregation is performed at the same spatial positions with the middle feature layer (P3), resulting in temp_fea_High, temp_fea_Mid, and temp_fea_Low.The spatial-awareness module plays a vital role in introducing position biases to acquire deformation representation capabilities.Simultaneously, it utilizes an importance factor to adaptively weight the deformation-sampled positions, enabling the network to possess dynamic adjustment capability and better adapt to different forms of flowering bud targets.Following this, each temp undergoes a scale-awareness attention mechanism, as shown in Figure 4.The temps are first subjected to average pooling to compress the features and reduce the parameter count.Subsequently, they are connected to a fully connected layer (replaced by a 1 × 1 convolution), followed by a ReLu activation layer, and finally, a hard sigmoid activation layer to expedite training.Essentially, it assigns weights to feature layers at different levels, allowing the model to adaptively blend features based on the importance of features at that level.Finally, there is the task-aware module, as shown in Figure 5, which adapts to the detection task by activating channels in the feature mapping to enhance performance.The specific process involves reducing the feature dimension of input feature x through average pooling, followed by two fully connected layers and a normalization layer to map the feature space to the range of [-1, 1].The functions implemented by these operations are similar to a hyper function θ(x) to generate four learnable parameters α1, β1, α2, and β2 for subsequent calculations [35].Lastly, the activation function fθ(x) is used to dynamically activate different channels of the input feature x, resulting in the final output of the taskaware block.More detailed information about the hyperfunction θ(x) and activation function fθ(x) can be found in the literature [36].The complete improved model is shown in Figure 6, and the improved sections are highlighted in pink.After passing through the dynamic detection head, four differentsized detection result maps are generated, with sizes of 320, 160, 80, and 40, corresponding to the detection of smaller, small, medium, and large targets.Classification and localization information is then obtained by subsequent standard convolution layers.

Tracking and Counting Based on Bytetrack
Because of the image sensor's rapid acquisition speed, the same target may be captured in consecutive frames, resulting in the repeated counting of identical targets.To guarantee that each bud is counted only once, this study employs the Bytetrack object tracking algorithm.Bytetrack associates the detection results from YOLOv5 at different time points, assigning independent IDs to each target and enabling the accurate counting of Chinese flowering cabbage buds.The specific method is elucidated below.
Bytetrack is a detection-based tracking algorithm that employs a data association method called Byte [37].Instead of simply discarding low-confidence detection results, it segregates the target detection boxes into low-score boxes and high-score boxes based on confidence scores and processes them separately.When using Bytetrack to track Chinese flowering cabbage buds, the initial step is to divide the detection results into high-score detection boxes and low-score detection boxes based on a confidence threshold and create corresponding trajectories.The first matching is performed using high-score boxes and track, with IoU (Intersection over Union) as the only similarity calculation, significantly improving matching speed.Unmatched high-score boxes and unmatched high-score trajectories (U_track) are retained.The second matching is performed using low-score boxes and U_track, with the retention of U_track continuing [37].At this point, background false detections can be filtered out as they lack corresponding trajectories.Meanwhile, obscured targets can be recovered.Unmatched trajectories are retained for a certain lifespan, and if no boxes are matched during this period, they are deleted.For high-score boxes that are not matched to trajectories, they continue to be observed in frames, and if they are continuously detected, trajectories are assigned.
During the utilization of Bytetrack for bud tracking, the movement of the camera may cause detected buds to shift from the image's edges to its center.This can result in significant alterations in the aspect ratio of the detection boxes [26].Consequently, the highscore detection boxes in the current frame may not be correctly matched with the previously assigned trackers.This situation leads to a modification in the assigned ID for the same flower bud, ultimately impacting counting accuracy.Therefore, this study does not directly use the IDs generated by the Bytetrack algorithm for counting.Instead, a counting method was devised based on a center region, as shown in Algorithm 1. Firstly, two fixed-width and positioned lines are set at the center of the video window, as shown in Figure 7, one as the entry line and the other as the counting line, capable of adapting to the counting of targets moving in two different directions, up and down.Taking the example of targets moving from bottom to top, as the camera moves, the targets continuously tracked by Bytetrack gradually move toward the center of the image.The system checks whether the current target's center point (x i ,y i ) has crossed the entry line.If the target ID is not stored, it adds the target ID to the Arraydown storage list.As the targets continue to move upwards, their center points cross the counting line, and it checks whether the target ID appears in the Arraydown storage list.If it exists, it accumulates the total count of targets and determines the target class, achieving the counting of flower buds at different stages of maturity.This counting method only counts the target when it is fully visible, and the aspect ratio does not undergo significant changes, ensuring a certain level of stability and accuracy in target ID and counting.

Evaluation Metrics
To validate the effectiveness of our improvement method, Precision, Recall, mAP, and F1 metrics are used to evaluate model performance.The calculation of these metrics is as shown in the following equations:

  TP Recall TP FN
(2) Precision represents the ratio of correctly detected results to the total detected results, while recall is defined as the proportion of correctly detected results among all true results, as shown in Equations ( 1) and (2).Where TP (True Positives) represents the number of correctly predicted positive class bounding boxes, FN (False Negatives) represents the number of positive class bounding boxes that were missed by the model, and FP (False Positives) represents the number of incorrectly predicted positive class bounding boxes.Compared to precision and recall, Average Precision (AP) can more comprehensively reflect the overall detection performance of a model [38].mAP is the average of AP for each class, where AP is defined as the area under the precision-recall curve.mAP@0.5 represents the mAP calculated at an IoU of 0.5, and mAP@0.5:0.95denotes the average mAP calculated with IoU thresholds moving from 0.5 to 0.95 at intervals of 0.05 [39].N stands for the number of classes, which is equal to 3 in this study.

Model Training Results and Ablation Experiment
Table 2 presents the software and hardware configuration used for model training.
During training, pretrained weights were employed to enhance the model's initial performance.The batch size was fixed at eight, and the input image size was set to 1280 × 1280.Stochastic Gradient Descent (SGD) was used to update the network parameters with a momentum of 0.937 and weight decay of 0.0005.The learning rate was updated using linear decay, where the initial learning rate (lr) was set to 0.01, and the decay factor was set to 0.01. Figure 8 shows the changes in model metrics and losses after 100 training epochs.It is evident that during the initial 20 epochs of training, the model exhibited rapid convergence.This phase was characterized by substantial reductions in location loss, object loss, and class loss for both the training and validation datasets, along with a notable increase in model precision and recall.However, after 50 epochs, the modelʹs progress began to stabilize.To ensure optimal model performance, an early stopping strategy was used [40].The training process was concluded at the 100th epoch.Specifically, we saved the model that achieved the highest sum of precision, recall, mAP@0.5, and mAP@0.5:0.95scores.These scores were weighted using coefficients (0, 0, 0.1, and 0.9) to prioritize critical metrics.Figure 9 illustrates the results of evaluating our improved model using the test dataset.Figure 8a represents the PR curve of the model at an IoU threshold of 0.5, and the area enclosed by the curve is the Average Precision (AP).The AP values for different categories of flower buds are as follows: growing: 86.9%, ripe: 96.1%, over-ripe: 98.7%.Notably, the model achieves the highest detection precision for over-ripe buds, which is attributed to their fully open state, distinct yellow color, and high contrast with the background.In contrast, growing buds, surrounded by leaves and with a color similar to the background, exhibit relatively lower detection performance.Ripe buds, characterized by a pale yellow color and plump morphology, perform well in terms of detection.Figure 8b-d   Ablation experiments involve removing or adding certain structures of the detection algorithm to observe their impact on performance [41].In order to validate the effectiveness of the improved model, the ablative experiments were conducted on FPNDyH-YOLOv5.YOLOv5s without PANet were used as the baseline, retaining only the FPN.Ablation experiments were performed by adding the P2 detection layer and utilizing the spatial-aware attention mechanism, scale-aware attention mechanism, and task-aware attention mechanism.We considered precision and recall at the point of the maximum F1 value.The experimental results are presented in Table 3, where "√" and "-" represent the selected and unselected methods, respectively.Where "√" and "-" represent the selected and unselected methods, respectively.
It can be observed that after adding the P2 detection layer to the FPN structure, the recall rate increased by 1.0%.The P2 detection layer only undergoes a 4× down-sampling, making it more sensitive to small object detection, resulting in more true positives being correctly detected.However, it also leads to many false-positive targets, causing a slight decrease in precision.After incorporating the spatial aware attention mechanism, all model metrics showed a significant improvement, with a direct increase of 3.9% in mAP@0.5.With the addition of scale awareness and task awareness, the model metrics further improved, although to a lesser extent.This suggests that the spatial awareness module played a dominant role [34].Ultimately, the model's performance improved by 3.0%, 4.6%, and 5.2% compared to the baseline model when all three attention mechanisms were combined.To provide a more intuitive demonstration of the effectiveness of each attention module, we conducted visualizations of the feature maps of the same channel after the action of each module, as shown in Figure 11.It is clearly visible that the P2 detection layer contains more small targets.After passing through each attention module, the target features are enhanced, demonstrating the effectiveness of the improvement.

Experiment on Different Feature Fusion Methods
This study conducted a comprehensive comparison of various feature fusion methods, building upon the addition of the P2 detection layer.Each method is denoted as follows: A represents the FPN structure, B represents the original PANet of YOLOv5, C represents the modified BiFPN for PANet, D represents the direct fusion of the output layers of the backbone using DyHead, E represents the FPN combined with DyHead, and F represents the combination of PANet with DyHead.Precision and recall were evaluated at the point of the maximum F1 value, and the experimental results are shown in Table 4.The results reveal that methods D, E, and F, which incorporate DyHead for dynamic feature fusion, outperform models that do not utilize them.Notably, our proposed method E (FPNDyH) achieves the highest mAP at an IoU threshold of 0.5, surpassing the original algorithm using method B (PANet) by 2.5 percentage points.While the precision of method E is slightly lower than that of method B, it boasts a remarkable 3.3% increase in recall.Models A and D have fewer parameters, but their metrics are inferior to E. In comparison to B, C, and F, our method exhibits the best performance with the least number of parameters and computational burden.In summary, our proposed FPNDyH method is more suitable for detecting Chinese flowering cabbage buds.

Compared with Other Detection Models
To further verify the performance of the improved algorithm in Chinese flowering cabbage bud detection, we trained five typical object detection algorithms: SSD, Faster R-CNN, YOLOv4, YOLOX-s, and YOLOv7, and conducted comparative experiments using the same dataset.Several performance metrics were used, including precision, recall, parameters, flops, and mAP@0.5.The results are summarized in Table 5. Faster-RCNN is a typical two-stage object detection algorithm with the largest number of parameters and computational requirements and exhibits the poorest performance.The other algorithms are one-stage detection algorithms.SSD shows relatively weak performance in detecting small objects [42].YOLOv4, YOLOX, and YOLOv7 are algorithms from the same YOLOv5 family and perform reasonably well, but they are all slightly inferior to our proposed FPNDyH-YOLOv5.A more intuitive comparison can be seen in Figure 12, which clearly illustrates that FPNDyH-YOLOv5 outperforms other algorithms across various metrics while also having the smallest number of parameters and computational requirements.

Analysis of Tracking and Counting Results
In this study, the trained FPNDyH-YOLOv5 combined with Bytetrack is used to achieve real-time detection and counting of Chinese flowering cabbage buds.The FPNDyH-YOLOv5 model is exported in ONNX (Open Neural Network Exchange) format, and after several experimental optimizations, the NMS (Non-Maximum Suppression) IoU threshold was set to 0.45, and the confidence threshold was set to 0.25; Bytetrack parameters include a tracking threshold of 0.2 and a matching threshold of 0.5.To validate the effectiveness of this method, five video segments were randomly selected from the test dataset.The videos had a resolution of 1080 × 1920 and ran at 30 frames per second.Each video segment contained vegetable beds approximately 0.8 m wide and 2.0 m long.The number and total of plants at different maturities were manually counted and compared with the algorithm's results.The results are shown in Table 6.The counting accuracy for the ripe, growing, and over-ripe categories was found to be 90.4%,80.0%, and 95.1%, respectively.The lower counting accuracy of growing is mainly attributed to the small size of the objects in this category, making them challenging to distinguish from the background.The counting accuracy of the ripe category is moderate.The count of the overripe category is higher than manual counting, primarily due to false detections of the ripe category by the detector.Overall, this method can achieve a relatively good counting of Chinese flowering cabbage buds.

Discussion
In this study, the primary objective is category-based counting, and the accurate detection and classification of the detector are crucial for achieving correct counts [43].To this end, this study has focused on improving the detector.The incorporation of three types of attention mechanisms has resulted in noticeable improvements in the performance of the detector.The spatial-aware attention module plays a significant role [34].This is mainly because the deformable convolution enhances the deformation representation ability of the model [44].It is invariant to spatial transformation and aggregates the features of each scale at the same spatial position, fully integrating the information of each scale.However, the improvement for the "growing" category is not substantial, with a precision of only 86.9%.This is mainly due to the immature flower buds being concealed among the plant's stems and leaves and their very small size, making them observable only from a vertical overhead view.This has also resulted in the lowest accuracy for the "growing" category during tracking and counting.Recognizing that relying solely on algorithmic improvements may be insufficient to address this issue, in our next step, we consider utilizing the height difference information between flower buds and leaves.We will use an RGB-D camera to acquire depth images for 4D input model training [26].By incorporating depth information, we anticipate that the model will be better equipped to detect and distinguish "growing" flower buds even when they are partially obscured by plant structures or have a smaller visible profile.This enhancement should contribute to achieving more accurate and reliable counting results for Chinese flowering cabbage buds [45,46].
During the tracking and counting process, this study did not rely on maximum ID but rather designed a method based on the central region that considers the actual scenario.This method counts the Chinese flower cabbage buds only when the target moves to the central region of the image (i.e., in a vertical overhead view), to some extent ensuring counting accuracy.However, it is important to note that this method has some limitations.Specifically, the width of the line cannot be adaptively adjusted to match different movement speeds.In other words, if a target moves too fast, with a displacement large enough to skip across the count lines between consecutive frames, it can lead to inaccurate counting.In addition, the distance between the two lines will also affect the counting.Theoretically, the smaller the distance, the more accurate the counting will be.This is because the Kalman filter in the Bytetrack is a linear model, and its premise is that the target is constant speed.This means that as the displacement of the target decreases, the probability of ID change decreases, but if the distance between the two lines is too small, it becomes difficult to adapt to faster speeds.In general, it is necessary to experimentally select appropriate parameters to balance speed and accuracy, and within a certain range of movement speeds, our method remains effective.
In future research, we plan to explore the development of a lightweight model suitable for deployment on edge computing devices [33].This would enable real-time tracking and counting of Chinese flowering cabbage buds in practical agricultural settings, further enhancing the applicability of our approach.

Conclusions
This study achieved real-time non-destructive detection of Chinese flowering cabbage maturity before harvest.Introducing spatial-aware, scale-aware, and task-aware attention mechanisms into YOLOv5, we proposed the FPNDyH feature fusion method, named FPNDyH-YOLOv5.The improved algorithm demonstrated impressive precision, recall, and mAP@0.5 scores of 86.5%, 90.1%, and 93.9%, respectively, surpassing the original algorithm.Based on the ablation experiment, we visualized the feature maps after the three attention mechanisms were applied.It can be intuitively seen that the improved model pays more attention to the small target.Compared to other models, it has fewer parameters and computations, with superior performance in all metrics.Utilizing the trained detection model combined with the Bytetrack algorithm, we designed a central region counting method for real-time tracking and counting of various targets.In test videos, the counting accuracy for each category was 90.4%, 80%, and 95.1%, respectively.The results indicate the effectiveness of the method in detecting and counting Chinese flowering cabbage buds at different maturity stages.This study also provides a non-manual solution for timely harvest assessment of other crops, contributing to advancements in agricultural practices.

Figure 7 .
Figure 7.An example of the process of flower buds counting.

Figure 9 .
Figure9illustrates the results of evaluating our improved model using the test dataset.Figure8arepresents the PR curve of the model at an IoU threshold of 0.5, and the area enclosed by the curve is the Average Precision (AP).The AP values for different categories of flower buds are as follows: growing: 86.9%, ripe: 96.1%, over-ripe: 98.7%.Notably, the model achieves the highest detection precision for over-ripe buds, which is attributed to their fully open state, distinct yellow color, and high contrast with the background.In contrast, growing buds, surrounded by leaves and with a color similar to the background, exhibit relatively lower detection performance.Ripe buds, characterized by a pale yellow color and plump morphology, perform well in terms of detection.Figure8b-d represents precision, recall, and F1 curves at different confidence levels, respectively.These curves collectively reveal the model's excellent performance on the test set, indicating strong fitting and generalization capabilities for all three maturity levels of cabbage.Examples of six detection results are shown in Figure10, only a few targets of growing were missed (blue circle in Figure10a,c), and the rest were detected correctly, indicating that the model performs well.

Figure 12 .
Figure 12.Performance comparison of different detection algorithms.

Table 1 .
Dataset overview and description.
Algorithm 1.A Special Tracking and Counting MethodInput: id i ; class i ; (x i ,y i );

Table 4 .
Experimental results for different feature fusion methods.

Table 5 .
Comparison with different models.