Swin-transformer-yolov5 For Real-time Wine Grape Bunch Detection

In this research, an integrated detection model, Swin-transformer-YOLOv5 or Swin-T-YOLOv5, was proposed for real-time wine grape bunch detection to inherit the advantages from both YOLOv5 and Swin-transformer. The research was conducted on two different grape varieties of Chardonnay (always white berry skin) and Merlot (white or white-red mix berry skin when immature; red when matured) from July to September in 2019. To verify the superiority of Swin-T-YOLOv5, its performance was compared against several commonly used/competitive object detectors, including Faster R-CNN, YOLOv3, YOLOv4, and YOLOv5. All models were assessed under different test conditions, including two different weather conditions (sunny and cloudy), two different berry maturity stages (immature and mature), and three different sunlight directions/intensities (morning, noon, and afternoon) for a comprehensive comparison. Additionally, the predicted number of grape bunches by Swin-T-YOLOv5 was further compared with ground truth values, including both in-field manual counting and manual labeling during the annotation process. Results showed that the proposed Swin-T-YOLOv5 outperformed all other studied models for grape bunch detection, with up to 97% of mean Average Precision (mAP) and 0.89 of F1-score when the weather was cloudy. This mAP was approximately 44%, 18%, 14%, and 4% greater than Faster R-CNN, YOLOv3, YOLOv4, and YOLOv5, respectively. Swin-T-YOLOv5 achieved its lowest mAP (90%) and F1-score (0.82) when detecting immature berries, where the mAP was approximately 40%, 5%, 3%, and 1% greater than the same. Furthermore, Swin-T-YOLOv5 performed better on Chardonnay variety with achieved up to 0.91 of R2 and 2.36 root mean square error (RMSE) when comparing the predictions with ground truth. However, it underperformed on Merlot variety with achieved only up to 0.70 of R2 and 3.30 of RMSE.


Introduction
The overall grape production in the United States has reached 6.04 million tons in 2020, in which approximately 3.59 million tons ( 59%) were from wine grape production in California and Washington citation USDA. To maintain the premium quality of wine, wine vineyard needs to be elaborately managed so that the quantity and quality of the grapes can be well balanced for maximum vineyard profitability. Such vineyard management can be difficult, because the number of the berry bunches should be closely monitored by labors throughout the entire growth season to avoid a high volume of bunches overburdening the plant and thus the berry composition may not be optimal [1]. This presents significant challenges for wine vineyard owners and managers due to agricultural workforce shrinking and cost increasing. Potentially, this issue might be mitigated by leveraging the superiority of state-of-the-art computer vision technologies and data-driven artificial intelligence (AI) techniques [2].
Object detection is one of the fundamental tasks in computer vision, which is used for detecting instances of one or more classes of objects in digital images. Several common challenges, that prevent a target object from being successfully detected, include but are not limited to variable outdoor light conditions, scale changes of the objects, small objects, and partially occluded objects. In recent years, numerous deep learning-driven object detectors have been developed for various real-world tasks, such as fully connected networks (FCNs), convolutional neural networks (CNNs), and vision Transformer. Among these, CNN-based object detectors have demonstrated promising results [3], [4]. Generally, CNN-based object detectors can be divided into two types, including one-stage detectors and two-stage detectors. Taking a few examples, one-stage detectors include Single Shot multibox Detector (SSD) [5], RetinaNet [6], Fully Convolutional One-Stage (FCOS) [7], DEtection TRansformer (DETR) [8], EfficientDet [9], You Only Look Once (YOLO) family [10], [11], [12], [13], while two-stage detectors include Region-based CNN (R-CNN) [14], Fast/Faster R-CNN [15], [16], Spatial Pyramid Pooling Networks (SPPNet) [17], Feature Pyramid Network (FPN) [18], and CenterNet2 [19].
As agriculture is being digitalized, both one-stage and two-stage object detectors have been widely applied on various orchard and vineyard scenarios, such as fruit detection and localization, with promising results achieved. Some of the major reasons, which made object detection challenging in agricultural environments, including severe occlusions from non-target objects (e.g., leaves, branches, trellis-wires, and densely clustered fruits) to target objects (e.g., fruit) [20]. Thus, in some cases, the two-stage detectors were preferred by the researchers due to their greater accuracy and robustness. Shuqin et al. [21] developed an improved model based on multi-scale Fast R-CNN that used both RGB (i.e., red, green, and blue) and depth images to detect passion fruit. Results indicated that the accuracy of the proposed model was improved from 0.850 to 0.931 (by 10%). Gao et al. [20] proposed a Faster R-CNN-based multi-class apple detection model for dense fruit-wall trees. It could detect apples under different canopy conditions, including non-occluded, leaf-occluded, branch/trellis-wire occluded, and fruit-occluded apple fruits, with an average detection accuracy of 0.879 across the four occlusion conditions. Additionally, the model processed each image in 241ms on average. Although two-stage detectors have shown robustness and promising detection results in agricultural applications, there are still two major concerns, the corresponding slow inference speed and high requirement of computational resources, to further implement them in the field. Therefore, it has become more popular nowadays to utilize one-stage detectors in identifying objects in orchards and vineyards, particularly using YOLO family models with their feature of real-time detection.
Huang et al. [22] proposed an improved YOLOv3 model for detecting immature apples in orchards, using Cross Stage Partial (CSP)-Darknet53 as the backbone network of the model to improve the detection accuracy. Results showed that the F1-Score and mean Average Precision (mAP) were 0.652 and 0.675, respectively, for those severely occluded fruits. Furthermore, Chen et al. [23] also improved YOLOv3 model for cherry tomato detection, which adopted a dual-path network [24] to extract features. The model established four feature layers at different scales for multi-scale detection, achieving an overall detection accuracy of 94.29%, Recall of 94.07%, F1-Score of 94.18%, and inference speed of 58 ms per image. Lu et al. [25] introduced a convolutional block attention module (CBAM) [26] and embedded a larger-scale feature map to the original YOLOv4 to enhance the detection performance on canopy apples in different growth stages. In general, YOLO family models tend to have a fast inference speed when testing on images. However, the detection accuracy in YOLO detectors might be influenced by the occlusion, which could cause loss of information during detection, in the canopy. Some strategies should be taken to compensate for this shortcoming of YOLO models. During the past two years, vision Transformer has demonstrated outstanding performances in numerous computer vision tasks [27] and, therefore, are worth being further investigated to be employed together with YOLO models in addressing the challenges.
A typical vision Transformer architecture is based on a self-attention mechanism that can learn the relationships between components of a sequence [27], [28]. Among all types, Swin-transformer is a novel backbone network of hierarchical vision Transformer, using a multi-head self-attention mechanism that can focus on a sequence of image patches to encode global, local, and contextual cues with certain flexibilities [29]. Swin-transformer has already shown its compelling records in various computer vision tasks, including region-level object detection [30], pixel-level semantic segmentation [31], and image-level image classification [32]. Particularly, it exhibited strong robustness to severe occlusions from foreground objects, random patch locations, and non-salient background regions. However, using Swin-transformer alone in object detections requires large computing resources as encoding-decoding structure of the Swin-transformer is different from the conventional CNNs. For example, each encoder of Swin-transformer contains two sublayers. The first sublayer is a multi-head attention layer, and the second sublayer is a fully connected layer, where the residual connections are used between the two sublayers. It can explore the potential of feature representation through a self-attention mechanism [33], [34]. Previous studies on public datasets, e.g., COCO [35], have demonstrated that Swin-transformer outperformed other models on severely occluded objects [36]. Recently, Swin-transformer has also been applied in agricultural field. For example, Wang et al. [37] proposed "SwinGD" for grape bunch detection using Swin-transformer and Detection Transformer (DETR) models. Results showed that SwinGD achieved 94% of mAP, which was more accurate and robust in overexposed, darkened, and occluded field conditions. Zheng et al. [38] researched a method for the recognition of strawberry appearance quality based on Swin-transformer and Multilayer Perceptron (MLP), or "Swin-MLP", in which Swin-transformer was used to extract strawberry features and MLP was used to identify strawberry according to the imported features. Wang et al. [39] improved the backbone of Swin-transformer and then applied it to identify cucumber leaf diseases using an augmented dataset. The improved model had a strong ability to recognize the diseases with a 98.97% accuracy.
Although many models for fruit detection have been studied in orchards and vineyards [25], [40], [41], and [42], the critical challenges in grape detection in the field environment (e.g., multi-variety, multi-stage of growth, multi-condition of light source) have not yet been fully studied using a combined model of YOLOv5 and Swin-transformer. In this research, to achieve better accuracy and efficiency of grape bunch detection under dense-foliage and occlusion conditions in vineyards, we architecturally combined the state-of-the-art, one-stage detector of YOLOv5 and Swin-transformer (i.e., Swin-Transformer-YOLOv5 or Swin-T-YOLOv5), so that the proposed new network structure had the potential to inherently preserve the advantages from both models. The overarching goal of this research was to detect wine grape bunches accurately and efficiently under complex vineyard environment using the developed Swin-T-YOLOv5. The specific research objectives were to: The data acquisition and research activities in this study were carried out in a wine vineyard located in Washington State University (WSU) Roza Experimental Orchards, Prosser, WA. Two different wine grape varieties were selected as the target due to their distinct color of berry skin when matured, including Chardonnay (white berries; Figure 1(a)) and Merlot (red berries; Figure1(d)). The color of berry skin for Chardonnay was consistently white throughout the growth season ( Figure 1(b)-1(c)), while the color for Merlot changed from white to red (Figure 1(e)-1(f)) during the season. There were approximately 10-33 and 12-32 grape bunches per vine for the experimental Chardonnay and Merlot plants in this study. The wine vineyard was maintained by a professional manager for optimal productivity. The row and inter-plant spaces were about 2.5 m and 1.8 m for both varieties. The imagery data collection was completed using a Samsung Galaxy S6 smartphone (Samsung Electronics Co., Ltd., Suwon, South Korea) at the distances of 1-2 m, while the camera was facing perpendicularly to the canopy. The data collection was carried out during the entire growth season (i.e., from the berries were developed to matured) at a periodical frequency of one day per week and three times per day from 7/4/2019 to 9/30/2019. More specific details were given in Table 1 that the images of the canopies were captured under two weather conditions (i.e., sunny and cloudy), two berry maturity conditions (i.e., immature and mature), and three sunlight direction/intensity conditions (i.e., morning at 8am-9am, noon at 11am-12pm, and afternoon at 4pm-5pm, Pacific Daylight Time). All these various outdoor conditions largely represented the diversity of the imagery dataset. Note that all images were acquired from the consistent side of the canopy. As a result, 459 raw images were collected in total for Chardonnay (234 images) and Merlot (225 images) grape varieties in the original resolution of 5,312 x 2,988 pixels (Table 2). Specific number of raw images under individual conditions can also be found in Table 1.

Dataset annotation and augmentation
The raw imagery dataset was manually annotated using the annotation tool of LabelImg [43]. The position of the grape bunch was individually selected using bounding boxes. Clustered grape bunches were also carefully separated. In addition, the "debar" approach was adopted based on our previous publication [25] to separate individual canopies for evaluation purposes only. Once all raw images (Figure 2(a)) were annotated, the dataset was further enriched by using data enhancement and augmentation library of Imgaug [44], including the image adjustment of rotation ( Figure 2(b)), channel enhancement ( Figure 2(c)), Gaussian blur/noise ( Figure 2(d)), and rectangle pixel discard ( Figure 2(e)). During data augmentation, the annotated "key points" and "bounding boxes" were transformed accordingly. The enriched dataset can better represent the field conditions of the grape bunches. After augmentation, a dataset containing 4,418 images was developed, where a detailed description of the augmented dataset can be found in Table 2. The actual number of augmented images were less than planned (4,590 images) because some invalid augmented images were identified and discarded. The finalized dataset was further divided into train (80%), validation (10%), and test sets (10%), respectively, for development of grape bunch detection models. Finally, the in-field manual counting of grape bunches was completed during the harvest season on 10/1/2019 after the last dataset was acquired.

Swin-transformer
First, the Swin-transformer architecture [29] was introduced in Figure 3(a). It can split the input RGB image into non-overlapping, small patches through a patch partition module. Each patch was treated as a "token" whose features were set as the concatenation of the raw pixel values in the RGB image (i.e., 3 channels). In this study, a patch size of 4 × 4 was used and, therefore, the feature dimension per patch was 4×4×3 = 48. A linear embedding layer was then applied to this raw value feature to project it to an arbitrary dimension (denoted as C in 3(a)). Swin-transformer was built through replacing the standard Multi-head Self Attention (MSA) module in a regular Transformer block by a MSA module based on "windows" (i.e., W-MSA) and "shifted windows" (i.e., SW-MSA), while other layers kept the same (3(b)). This module was followed by a 2-layer Multi-Layer Perceptron (MLP) with nonlinearity of Rectified Linear Unit (ReLU) in between. A Normalization Layer (LayerNorm) and a residual connection were applied before and after each MSA module and MLP layer.

YOLOv5
YOLOv5 (specifically, YOLOv5s) is a recent detection model in YOLO family [12], which has fast inference (detection) speed with up to 140 frames per second (fps). In addition, YOLOv5s is a lightweight model with fewer model parameters, which is approximately 10% of the generic YOLOv4, indicating that this model might be more suitable for deployment on embedded devices for real-time object detection. Combined with all these advantages, this study attempted to detect grape bunches in dense canopies using the improved YOLOv5.
In general, YOLOv5 framework includes three parts: backbone, neck, and detection (or output) networks ( Figure  4).The backbone network was used to extract feature maps from the input images with multiple convolutions and  ,590 a 4,418 b a It was planned that there will be additional nine images augmented for each raw image using the augmentation methods illustrated in Figure 2 .
b The actual augmented images were less than what was planned because all augmented images were examined, and those invalid images were discarded.
merging. A three-layer feature map was then generated in the backbone network in the sizes of 80×80, 40×40, and 20×20 (Figure 4(a); left). After backbone network, the neck network contained a series of feature fusion layers that can mix and combine image features. All feature maps in different sizes generated by the backbone network were fused to obtain more context information and reduce the information loss. The characteristic pyramid structure of Feature Pyramid Network (FPN) and Path Aggregation Network (PANet) were adopted during the merging process, where strong semantic features were transferred from top to bottom feature maps using FPN structure. Meanwhile, strong localization features were transferred from lower to higher feature maps using PANet. Overall, the ability of feature fusion in the neck network was enhanced by using FPN and PANet together (Figure 4(a); middle). Finally, the detection network was used to give the detection results. It consisted of three detection layers, with the corresponding output feature maps of 80×80, 40×40, and 20×20, which was used to detect objects in the input images. Each detection layer ultimately can output a 21-channel vector and then generate and mark the predicted bounding box and category of the target in the original input images for final detections (Figure 4(a); right).
Moreover, the Focus module of YOLOv5 can slice and concatenate images (Figure 4(b)), which was designed to reduce the computational load of the model and speed up the training process. It can first split the input 3-channel image into four slices using the Slice operation. The four slices were concatenated using the Concat operation, and a Convolutional layer (CONV) was then used to generate the output feature map. on some layers/modules in backbone network, including CBL and C3, in which CBL was a standard convolutional module consisting of CONV, Batch Normalization (BN), and activation function of Sigmoid Linear Unit (SiLU); C3 was Cross Stage Partial (CSP) Bottleneck with 3 CONVs. The initial input was split into two branches, and thus the number of channels of the feature maps were halved by CONV operation in each branch. The output feature maps of the two branches were again connected through the Concat operation. The final output feature map of C3 was generated by CONV. C3 was used to improve inference (test) speed by reducing the size of the model, while maintaining desired performances in extracting useful features from images. Finally, Spatial Pyramid Pooling (SPP) module was used to improve the receptive field by converting feature maps of arbitrary size into feature vectors of fixed size (Figure 4(d)). The feature map was first output through a CONV layer with the kernel size of 1×1. It was then connected with the output feature map subsampled by three parallel max pooling layers, followed by a CONV layer to output the final feature map.

Integration of Swin-transformer and YOLOv5
To take advantages of both Swin-transformer and YOLOv5, two models were integrated (i.e., Swin-transformer-YOLOv5 or Swin-T-YOLOv5) by replacing the last C3 layer (i.e., with CSP Bottleneck and three CONVs) in the original YOLOv5 with Swin-transformer encoder blocks (Figure 4(a)). Because the resolution of feature maps was 20×20 at the end of the backbone network, applying Swin-transformer on low-resolution feature maps can reduce computational load and save memory space. This integration can compensate for the shortcoming of YOLOv5 as one of the typical CNNs in lack of capturing global and contextual information due to the limited receptive field [45], while Swin-transformer can be used to capture long-distance dependencies and retain different local information [29]. Therefore, our proposed scheme combined YOLOv5s and Swin-transformer, so that the new structure can inherit their advantages and preserve both global and local features. Furthermore, the self-attention mechanism was used to improve the detection accuracy of the integrated model. This integration might be particularly useful for the occluded grape bunches in dense foliage vineyard canopies. Pre-trained YOLOv5s using COCO dataset was adopted during training to improve the generalization ability of the proposed network. Faster R-CNN [16], YOLOv3 [11], YOLOv4 [10], and YOLOv5 [12], where the training hyperparameters of each model were shown in Table 3.

Evaluation metrics
The performance of each model was evaluated using its Precision (P), Recall (R), F1-score, mean Average Precision (mAP) (Equations (1-4)), and inference (detection) speed per image, in which mAP served as a key metric to assess the overall performance of a model:  Additionally, r 2 and root mean square error (RMSE;Eation (5)) was adopted to compare the results predicted by the models and ground truth data from both in-field manual counting and manual labeling: where i represents the variable, N represents the number of data points (plants), x i represents the actual count of grape bunches (in-field or label),x represents the estimated count of grape bunches using Swin-T-YOLOv5. The P-R curves were also used for visually demonstrating the performance of the models, where P and R were shown on vertical and horizontal axes, respectively. The Intersection over Union (IoU) and the confidence score were both set to 0.5 for test set.

Swin-transformer-YOLOv5 training and validation
All models were trained and validated for a comprehensive comparison. Table 4 shows the detailed comparison results using the previously defined evaluation metrics. Overall, our proposed Swin-T-YOLOv5 outperformed all other models, with the mAP of 97.4%, F1-score of 0.96, and inference speed of 13.2 ms per image. The mAP of Swin-T-YOLOv5 was 34.8%, 2.1%, 3.2%, and 2.1% better than Faster R-CNN, YOLOv3, YOLOv4, and YOLOv5, respectively. Although its inference time was slightly slower (1.4 ms) than the original YOLOv5, it was still faster than the rest of the models by 0.6-336.8 ms per image. Moreover, P-R curves were given in Figure 5 that Swin-T-YOLOv5 had the curve (in blue color) that was nearest to the upper-right point, indicating the best performance among all models. While Faster R-CNN (in yellow color) performed the worst.

Testing under two weather conditions
All models were tested under different conditions as listed in Table 1, including two weather conditions (i.e., sunny and cloudy), two berry maturity stages (i.e., immature and mature), and three sunlight directions/intensities (i.e., morning, noon, and afternoon), to verify the superiority of the proposed Swin-T-YOLOv5. Detailed model comparison results were given in Table 5 using the test set under two weather conditions for both grape varieties of Chardonnay and Merlot. Compared to Faster R-CNN, YOLOv3, YOLOv4, and YOLOv5, Swin-T-YOLOv5 achieved the best performance under both conditions in terms of mAP (95.36%-97.19%) and F1-score (0.86-0.89). Swin-T-YOLOv5 performed slightly better under cloudy sky condition with higher mAP (+1.83%) and F1-score (+0.03) compared with sunny sky condition. While the inference speed of Swin-T-YOLOv5 (13.2 ms per image) was not the best among all, it was 1.4 ms slower than YOLOv5 only.
Being proven that Swin-T-YOLOv5 outperformed all other models under both sunny and cloudy sky conditions, we further compared it against the ground truth data from in-field manual counting and manual labeling (Figure6) for Chardonnay and Merlot, respectively. Results showed that Swin-T-YOLOv5 performed well with Chardonnay variety under both weather conditions with 0.70-0.82 of R2 and 2.93-5.05 RMSE when the predicted results were compared against both in-field and label counts (Figure 6a and 6c). It also worked well with Merlot under cloudy condition (Figure 6d), however, R2 dropped to 0.28-0.36 and RMSE increased to 6.97-7.04 on Merlot under sunny condition (Figure 6b), indicating greater detection errors. Demonstrations of detection results under two weather conditions were provided in Figures A.1-A.2 in Appendices.

Testing at two maturity stages
In addition to two different weather conditions, we compared the performances of Swin-T-YOLOv5 with all other models at two berry maturity stages, including immature and mature berries for Chardonnay (i.e., white color of berry skin throughout the growth season) and Merlot (i.e., white or white-red mix when immature; red color when matured) (Figure 1). Detailed comparison results were given in Table 6 that, as expected, Swin-T-YOLOv5 outperformed all other models at both berry maturity stages with 90.31%-95.86%of mAP and 0.82-0.87 of F1-score. Clearly, all detectors achieved better detection results at the mature stage, including Swin-T-YOLOv5 (5.55% higher in mAP and 0.05 higher in F1-score), when the berries were larger (i.e., less occlusions) and with more distinct color than their background, such as leaves. Comparing to the second-best model, YOLOv5 (mAPs of 89.78%-91.58%), the performance of Swin-T-YOLOv5 was improved more at the berry mature stage (+4.28%) than immature stage (+0.53%), indicating that the improvements of the model were more effective to those ready-to-harvest grape bunches. Figure 7 compared the specific predicted number of grape bunches using Swin-T-YOLOv5 against the ground truth data of both in-field manual counting and manual labeling on both Chardonnay and Merlot. As observed in Table 6, R2 was higher (0.57-0.89) and RMSE was smaller (2.50-3.86) for those matured berries (Figure 7c-7d). Swin-T-YOLOv5 did a poor job on Merlot when the berries were immature (i.e., white or white-red mixed berries) with 0.08-0. 16

Testing under three sunlight directions and intensities
Finally, all models were tested under three different sunlight directions and intensities, including in the morning (8am-9am; in the direction of the light), noon (11am-12pm; maximum solar elevation angle), and afternoon (4pm-5pm; against the direction of the light) ( Table 1). The light intensity was the highest at noon and was the lowest in the morning. Specific comparison results were given in Table 7. Among all models tested in this research, Swin-T-YOLOv5 performed the best under any sunlight condition, with the optimal mAPs of 91.96%-94.53% and F1-scores of 0.83-0.86. It was also obvious that the detection results were better at noon than in the morning or afternoon with 2.49%-2.57% higher mAP and 0.01-0.03 higher F1-score. Additionally, YOLOv5 still performed the second best except for during noon, where Swin-T-YOLOv5 and YOLOv4 achieved 6.08% and 1.71% better mAP than it.
Further observations on the number of grape bunches detected by Swin-T-YOLOv5 comparing against ground truth data, from in-field manual counting and manual labeling, were provided in Figure 8. For Chardonnay variety, the agreement between the predictions and ground truth were relatively better (0.55-0.91 of R2 and 2.36-4.73 of RMSE; Figure 8a, 8c, and 8e) than Merlot variety (0.13-0.70 of R2 and 5.09-7.08 of RMSE; Figure 8b, 8d, and 8f) under any sunlight conditions. The results for Merlot were the best at noon with 0.47-0.70 of R2 (Figure 8d), while the results were the worst in the afternoon with only 0.13-0.29 of R2 when the imaging side was against the direction of the sunlight ( Figure  8f). Visual comparisons of model performances under different sunlight conditions can be found in Figures A.5-A.6 in Appendices.
Although Swin-T-YOLOv5 outperformed all other models in detecting grape bunches under various external or internal variations, detection failures (i.e., TNs and FPs) happened more frequently in several scenarios as illustrated in Figure 9. For example, severe occlusion (mainly by leaves) caused detection failure was the major reason for having TNs and FPs in this research as marked out using the red bounding boxes, particularly when the visible part of grape bunches were small (Figure 9(e)-9(f)) or having the similar color compared to the background (Figure 9(a)-9(c)). In addition, clustered grape bunches can cause detection failures, where two grape bunches were detected as one (Figure 9(d)).

Discussion
Compared to other mid-to-large sized fruits, such as apple, citrus, pear, and kiwifruit, grape bunches in the vineyards have more complex structural shapes and silhouette characteristics to be accurately detected using machine vision systems. The previous studies on identifying and counting grape bunches commonly employed CNNs-only object detectors, which the detection accuracies and model robustness suffered from severe canopy occlusions and varying light conditions [46], [47]). Accurate and fast identification of overlapped grape bunches in dense-foliage canopies under natural lighting environment remains to be a key challenge in vineyards. Therefore, this research proposed the combination of architectures from a conventional CNN model (YOLOv5) and a vision Transformer model (Swintransformer) that can inherit the advantages of both models to preserve global and local features when detecting grape bunches. By replacing several CONV and CSP Bottleneck blocks with Swin-transformer encoder blocks in the original YOLOv5 ( Figure A.4), the newly integrated detector (i.e., Swin-T-YOLOv5) worked as expected in overcoming the drawbacks of CNNs in capturing the global features due to their limited receptive fields ( Figure 5).
Our proposed Swin-T-YOLOv5 was tested on two different wine grape varieties (Chardonnay in white berry skin and Merlot in red berry skin), two different weather/sky conditions (sunny and cloudy), two different berry maturity stages (immature and mature), and three different sunlight directions/intensities (morning, noon, and afternoon) for its detection performance ( Specifically, Swin-T-YOLOv5 performed the best under cloudy weather condition with the highest mAP of 97.19%, which was 1.83% higher than sunny weather condition (Table 5), although the difference was inconsiderable. While testing the models at different berry maturity stages, Swin-T-YOLOv5 performed much better when the berries were matured with a 95.86% of mAP than immature berries (with 5.55% lower mAP; Table 6). The results were reasonable because the berries tended to be smaller in size and lighter in color during early growth stage and thus more difficult to be detected. Moreover, Swin-T-YOLOv5 achieved better mAP at noon (94.53%; with the maximum solar elevation angle) than other two timings in a day (Table 7), while the afternoon sunlight condition (i.e., against the direction of the light) more negatively affected the model with a lower mAP of 91.96% than in the morning (i.e., in the direction of the light). Apparently, the effectiveness of the berry maturities and light directions can be the major reasons for impacting the performances of the models, while weather conditions almost did not change the detection results. The improvements from original YOLOv5 to the proposed Swin-T-YOLOv5 varied based on the conditions (0.53%-6.08%), however, the maximum increment happened when comparing them at noon (Table 7). Overall, it was confirmed that the Swin-T-YOLOv5 achieved the best results in terms of mAP (up to 97.19%) and F1-score (up to 0.89) among all compared models in this research for wine grape bunch detections in vineyards. Its inference speed was the second best (13.2 ms per image) only after YOLOv5's (11.8 ms per image) under any test conditions.
To further assess the model performance, we compared the predicted number of grape bunches by Swin-T-YOLOv5 with two sets of ground truth values from both in-field manual counting and manual labeling on the images. The R2 and RMSE between Swin-T-YOLOv5 and in-field counting had the similar trends of the ones between Swin-T-YOLOv5 and manual labeling in general, potentially because of some of those heavily occluded grape bunches were not taken into consideration for labeling during the annotation process. However, the values changed vastly for the two different grape varieties (Chardonnay and Merlot) under various conditions (Figures 6-8 Figure 8). This was possibly because Merlot variety had a more complex combination of grape bunches when the berries were immature with either white or white-red mixed color (Figure 1(e)), which may cause more detection errors under more challenging test conditions, such as when imaging against the direction of the light. In general, detecting grape bunches of Merlot variety was more challenging than Chardonnay variety under any test conditions in this research.

Conclusion
This This research attempted to propose an optimal and real-time wine grape bunch detection model in natural vineyards by architecturally integrating YOLOv5 and Swin-transformer detectors, called Swin-T-YOLOv5. The research was carried out on two different grape varieties, Chardonnay (white color of berry skin when matured) and Merlot (red color of berry skin when matured), throughout the growth season from 7/4/2019 to 9/30/2019 under various testing conditions, including two different weather/sky conditions (i.e., sunny and cloudy), two different berry maturity stages (i.e., immature and mature), and three different sunlight directions/intensities (i.e., morning, noon, and afternoon). Further assessment was made by comparing the proposed Swin-T-YOLOv5 with other commonly used detectors, including Faster R-CNN, YOLOv3, YOLOv4, and YOLOv5. Based on the obtained results, the following conclusions can be drawn: