An Improved YOLOv5 Crack Detection Method Combined with a Bottleneck Transformer

: Efﬁcient detection of pavement cracks can effectively prevent trafﬁc accidents and reduce road maintenance costs. In this paper, an improved YOLOv5 network combined with a Bottleneck Transformer is proposed for crack detection, called YOLOv5-CBoT. By combining the CNN and Transformer, YOLOv5-CBoT can better capture long-range dependencies to obtain more global information, so as to adapt to the long-span detection task of cracks. Moreover, the C2f module, which is proposed in the state-of-the-art object detection network YOLOv8, is introduced to further optimize the network by paralleling more gradient ﬂow branches to obtain richer gradient information. The experimental results show that the improved YOLOv5 network has achieved competitive results on RDD2020 dataset, with fewer parameters and lower computational complexity but with higher accuracy and faster inference speed.


Introduction
For most transportation agencies, maintaining high-quality road surfaces is one of the keys to maintaining road safety.Cracks are common pavement diseases that seriously affect the road and traffic safety.Timely detection of road cracks is of great significance for preventing pavement damage and maintaining traffic safety [1].
Early manual visual inspection methods are tedious, time-consuming, subjective, error-prone, unsafe, and obstructive to traffic.Some other traditional methods require the use of 3D lidar or acoustic wave detection to obtain road condition assessments, which are uneconomical, slow, and difficult to deploy [2].To overcome these disadvantages, automatic detection methods combining image sensors and computer vision algorithms have gradually become the mainstream methods.
However, all the above methods heavily rely on manual feature extraction.For different scenes and different lighting conditions, different feature extraction models need to be designed.In the complex and changeable road environment, it is difficult to use a unified feature model to effectively extract the features cracks, resulting in poor detection robustness [19].
To achieve automatic feature extraction, the Deep Convolutional Neural Networks (DCNN) is applied to crack detection.The current popular crack detection methods based on DCNN mainly include semantic segmentation methods and object detection methods.The semantic segmentation methods usually employ an encoder-decoder structure to classify each pixel to obtain precise crack regions, such as FCN [20,21], Unet [22], Deep-Crack [23], etc.However, Zhuang et al. pointed out that the success of deep learning models in semantic segmentation comes at the cost of a heavy computation burden [24].Furthermore, the datasets used in current semantic segmentation research only contain road surface images, without any other scenes outside the road.Such image data collection is usually performed by special inspection vehicles or regular vehicles specially modified with camera brackets so that the cameras face the ground.The labeling cost of these datasets is also very high.
The object detection method is an economical choice when precise measurement of crack size is not required.The object detection method using the vehicle-mounted mobile phone camera is low-cost, simple, and efficient and can be deployed on any ordinary car.Moreover, the scenes contained in its dataset are more complex and diverse.Therefore, this paper focuses on crack detection using an object detection method based on a vehiclemounted mobile phone camera.
YOLOv5 [25] is a single-stage object detection model with four versions: YOLOv5s, YOLOv5m, YOLO5l, and YOLO5x.The network structures of the four versions are exactly the same, and the two parameters of depth_multiple and width_multiple are used to achieve different network depths and network widths.Its structural feature is to use the methods of Cross Stage Partial (CSP) and Spatial Pyramid Pooling-Fast (SPPF) in the Backbone network and use the methods of Feature Pyramid Network (FPN) and Path Aggregation Network (PAN) in the Neck network.In addition, the Mosaic data augmentation, adaptive anchor frame calculation, and adaptive image scaling techniques are used to improve training effect.YOLOv5 has been widely welcomed by academic and engineering communities since its release.The USC-InfoLab team reaped the GRDDC'2020 championship using YOLOv5 [26], which also proved the effectiveness of YOLOv5 for crack detection.Therefore, we designed an improved network based on YOLOv5 to detect cracks.
Cracks are usually long in length and narrow in width, and their length is much greater than width, showing a slender shape in space.This structure feature of the crack makes the task of crack detection require more long-range dependencies to obtain contextual information.
A Bottleneck Transformer is proposed by Srinivas et al. [27]; by using a Multi-Head Self-Attention mechanism, it can obtain better long-range dependencies than DCNN.Additionally, this is exactly what is needed for the crack detection task.Inspired by YOLOv5 and the Bottleneck Transformer, this paper proposes an improved YOlOv5 network combined with the Bottleneck Transformer.Our main contributions can be summarized as follows.

1.
We modified YOLOv5 in combination with Bottleneck Transformer and proposed an end-to-end pavement crack detection network for efficient detection of crack regions.

2.
The C2f module proposed in the state-of-the-art object detection model YOLOv8 is introduced to optimize the network.In addition, we compared the effect of introducing the C2f module at different locations in the model on the performance of network.

3.
We achieved competitive results on the evaluation dataset with fewer parameters and lower Giga Floating-point Operations Per second (GFLOPs).
The rest of the paper is organized as follows: Section 2 reviews the previous work on pavement crack object detection based on deep learning.Then, in Section 3 we describe the network architecture of our model and the specific composition of modules.Next, in Section 4 we perform experimental verification and discuss our method, including its ablation study.Finally, in Section 5, we summarize our work and point out future research plans.

Deep Convolutional Nerual Network Methods
To overcome the problem of manual feature extraction in traditional methods, deep learning methods automatically extract various features of cracks.In practice, object detection methods are divided into two categories: one-stage algorithms and two-stage algorithms.
The two-stage algorithms first generate a series of candidate bounding boxes as samples and then classify the samples through the convolutional neural network.Typical representatives of such algorithms are Fast R-CNN [28], Faster R-CNN [29], Cascade R-CNN [30], etc. Nie et al. [31] proposed a crack detection model based on Faster R-CNN, using the transfer learning method of parameter fine-tuning to realize the detection of cracks, looseness, deformation, and other pavement diseases.Hascoet et al. [32] used Faster-RCNN for crack detection and improved detection performance using techniques such as label smoothing and also presented their efforts in deploying their model on local road networks.Vishwakarma et al. [33] introduced the tuning strategy of Faster-RCN based on deep residual network (Resnet) and feature pyramid network (FPN) backbones of different depths.Furthermore, they compared it with a single-stage YOLOv5 model with a cross-stage part network (CSPNet) backbone.Pei et al. [34] applied Cascade R-CNN to crack detection and proposed a Consistency Filtering Mechanism (CFM) with a self-supervised approach to fully utilize the available unlabeled data.
However, the one-stage algorithms take object detection as a regression task, which directly regresses the bounding box and predicts the categories of multiple locations in the entire image to obtain more comprehensive information [35], such as SSD [36], YOLO series [37][38][39], Centernet [40], EfficeientDet [41], etc. Mandal et al. [42] proposed an automated pavement distress analysis system based on the YOLOv2.Shao et al. [43] adopted the YOLOv3 framework and proposed a PTZ camera-based image-processing pipeline for crack size measurement.Zhang et al. [44] combined YOLOv3 with Adaptive Spatial Feature Fusion (ASFF) to improve the adaptability to cracks of different scales.The literature [45] adopted YOLOv4 as the base network to detect cracks and proposed a generative adversarial network (GAN) for data augmentation.At the same time, the effects of tricks such as data augmentation, transfer learning, and optimized anchors and their combinations were evaluated.Mandal et al. [46] used YOLOv4 for crack detection and studied the impact of various backbones.Liu et al. [47] conducted ensemble learning of YOLOv4 and Fast R-CNN.Hu et al. [19] employed YOLOv5 to detect cracks and compared the model size, detection speed, and accuracy performance of the four versions of YOLOv5.Guo et al. [1] proposed to use the lightweight network MobileNetv3 to replace the backbone network of the original YOLOv5s model to reduce model parameters and GFLOPs and adopted Coordinate Attention to optimize the model.
Although these CNN-based methods have achieved good results, the CNN receptive field is usually small, which is not conducive to capturing global features.However, due to the large span and slender features of cracks, global features are exactly what is needed for the crack detection task.Therefore, researchers began to explore the combination of CNN and Transformer that can better capture global features.

CNN and Transformer Combined Methods
In recent years, transformers have made great breakthroughs in CV.Dosovitskiy et al. proposed the Vision Transformer (ViT), which used Multiple Self-Attention (MSA) to capture long-range dependencies [48].The success of CNN relies on its two inherent inductive biases, translation invariance and local correlation, while Vision Transformer architectures usually lack this property, resulting in the fact that a large amount of data is usually required to exceed the performance of CNN.The limited receptive field of CNN makes it difficult to capture global information, while the Transformer can capture longdistance dependencies.Therefore, after the emergence of ViT, many excellent works have tried to combine CNN and Transformer, allowing the networks to inherit the advantages of CNN and Transformer, so as to preserve both global features and local features as much as possible.Jing et al. [49] and Zhu et al. [50] integrated Transformer into YOLOv5 to process UAV captured images.Lei et al. [51] replaced the YOLOv5 backbone network with Transformer for underwater object detection.Xiang et al. [35] designed an improved YOLOv5 by embedding the Vision Transformer into the backbone network, which can detect cracks accurately with fast inference speed.The methods of CNN combined with transformer can capture the long-range dependencies of crack objects while maintaining real-time performance, which is also the focus of this paper.

Network Architecture
Considering the excellent performance of YOLOv5 in crack detection, we designed an improved YOLOv5m network, named YOLOv5-CBot, by combining Bot-transformer and C2f module, as shown in Figure 1.
of CNN and Transformer, so as to preserve both global features and local features as much as possible.Jing et al. [49] and Zhu et al. [50] integrated Transformer into YOLOv5 to process UAV captured images.Lei et al. [51] replaced the YOLOv5 backbone network with Transformer for underwater object detection.Xiang et al. [35] designed an improved YOLOv5 by embedding the Vision Transformer into the backbone network, which can detect cracks accurately with fast inference speed.The methods of CNN combined with transformer can capture the long-range dependencies of crack objects while maintaining real-time performance, which is also the focus of this paper.

Network Architecture
Considering the excellent performance of YOLOv5 in crack detection, we designed an improved YOLOv5m network, named YOLOv5-CBot, by combining Bot-transformer and C2f module, as shown in Figure 1.The YOLOv5-CBot mainly consists of three parts, the backbone network (Backbone), bottleneck layer network (Neck), and detection layer (Head).The Backbone part of the network is mainly composed of a standard convolution module (CBS), C3 module, Bottransformer module, and spatial pyramid pooling module (SPPF).The Neck part of the network consist of the CBS module, C2f module, and a series of concatenating operations.There are two main differences between YOLOv5-CBoT and the original YOLOv5.

1.
The last C3 module in the original YOLOv5 backbone network, that is, the C3 module before the SPPF module, is replaced by the Bot-transformer module.

2.
All C3 modules in the Neck networks are replaced by C2f modules.
The CBS module is a basic unit that constitutes the entire network, mainly completing the downsampling, dimensionality rise and reduction, normalization, and nonlinearity process of feature maps.
The C3 module is an important feature extraction module, which consists of three CBS modules and several stacked Bottlenecks.C3_x means there are x number of bottleneck block stacked.As shown in Figure 1, after the feature map is input into the C3 module, it is divided into two branches.One branch passes through CBS and Bottlenecks, the other passes through one CBS only.Finally, the two branches are concatenated and then passed through a CBS module.There are two CBS modules in the Bottleneck block; the first CBS is a 1 × 1 convolution, which reduces the channel to half, and the second is a 3 × 3 convolution, which doubles the number of channels.Reducing the dimension first helps the convolution kernel to better understand the feature information, and increasing the dimension will help extract more detailed features.Finally, the residual structure is used to add the input and output to avoid the problem of gradient disappearance.The main function of C3 is to increase the depth and receptive field of the network by stacking basic CBS module and residual connections and improve the ability of feature extraction.
The actual pavement cracks mainly include longitudinal cracks and transverse cracks, which occupy large areas in the horizontal or vertical direction, respectively.For such cracks, the receptive field of conventional convolutions is too small to cover the entire crack region.Although the receptive field can be expanded by stacking multiple convolutional layers, it is still insufficient for feature extraction of large-span crack objects.To this end, we introduced a Bot-transformer structure to capture the long-range dependencies of crack objects and significantly extract contextual semantics.The details of Bot-transformer will be introduced in Section 3.2.
The function of the neck part of the network is to combine shallow graphic features with deep semantic features to obtain more complete features.As can be seen from Figure 1, the left part of neck upsamples the feature map by interpolation, making the scale of feature map gradually larger to facilitate the fusion of feature maps from different network layers of backbone; the right side of neck continues to downsample, on the one hand to obtain feature maps of different scales and the other hand to better fuse shallow graphic features with deep semantic features.
Furthermore, we use the latest effective module C2f introduced in YOLOv8 to replace the C3 module at the neck part of network.The C2f module refers to the ideas of the C3 module and ELAN [52].As can be seen from Figure 1, there are two differences between C2f and C3.

1.
C2f reduces a standard convolutional module (CBS), which contributes to the lightweight of the network.

2.
In addition to the serial stacking of the Bottleneck module similar to C3, a parallel concatenating operation of the Bottleneck module is added in C2f, which helps to obtain rich gradient flow information.
Based on the above two points, the role of C2f is to obtain richer gradient flow information while ensuring lightweight.

Bot-Transformer Module
BoTNet, proposed by Srinivas et al. [27], is a simple but powerful backbone that incorporates self-attention into various computer vision tasks.By only replacing the 3 × 3 convolution with Multi-Head Self-Attention (MHSA) in ResNet, without any other changes, the network performance for instance segmentation and object detection is significantly improved, while the network parameters are also reduced.The Bot-Transformer module proposed in this paper, as shown in Figure 2a, is modified from the C3 module of YOLOv5, by changing its Bottleneck block to Bottleneck Transformer block.

Bot-Transformer Module
BoTNet, proposed by Srinivas et al. [27], is a simple but powerful backbone that incorporates self-attention into various computer vision tasks.By only replacing the 3 × 3 convolution with Multi-Head Self-Attention (MHSA) in ResNet, without any other changes, the network performance for instance segmentation and object detection is significantly improved, while the network parameters are also reduced.The Bot-Transformer module proposed in this paper, as shown in Figure 2a, is modified from the C3 module of YOLOv5, by changing its Bottleneck block to Bottleneck Transformer block.The difference between the Bottleneck Transformer block and original Bottleneck block is that the standard 3 × 3 convolution module (CBS) is replaced by MHSA block, as show in Figure 2b, which is consistent with the idea of BotNet.Therefore, the key to Bottleneck Transformer lies in the use of MHSA, which will be introduced in Section 3.2.2.

MHSA Block
As described in Figure 2, the MHSA block is the core component of Bottleneck Transformer.The structure of the MHSA block is shown in Figure 3.In the figure, q, k, v, and r represent query, key, value and position encodings, respectively.The input size of MHSA is H × W × d, where H, W, and d, respectively, represent the height and width of the input feature matrix and the dimension of a single token.The difference between the Bottleneck Transformer block and original Bottleneck block is that the standard 3 × 3 convolution module (CBS) is replaced by MHSA block, as show in Figure 2b, which is consistent with the idea of BotNet.Therefore, the key to Bottleneck Transformer lies in the use of MHSA, which will be introduced in Section 3.2.2.

MHSA Block
As described in Figure 2, the MHSA block is the core component of Bottleneck Transformer.The structure of the MHSA block is shown in Figure 3.In the figure, q, k, v, and r represent query, key, value and position encodings, respectively.The input size of MHSA is H × W × d, where H, W, and d, respectively, represent the height and width of the input feature matrix and the dimension of a single token.

Bot-Transformer Module
BoTNet, proposed by Srinivas et al. [27], is a simple but powerful backbone that incorporates self-attention into various computer vision tasks.By only replacing the 3 × 3 convolution with Multi-Head Self-Attention (MHSA) in ResNet, without any other changes, the network performance for instance segmentation and object detection is significantly improved, while the network parameters are also reduced.The Bot-Transformer module proposed in this paper, as shown in Figure 2a, is modified from the C3 module of YOLOv5, by changing its Bottleneck block to Bottleneck Transformer block.The difference between the Bottleneck Transformer block and original Bottleneck block is that the standard 3 × 3 convolution module (CBS) is replaced by MHSA block, as show in Figure 2b, which is consistent with the idea of BotNet.Therefore, the key to Bottleneck Transformer lies in the use of MHSA, which will be introduced in Section 3.2.2.

MHSA Block
As described in Figure 2, the MHSA block is the core component of Bottleneck Transformer.The structure of the MHSA block is shown in Figure 3.In the figure, q, k, v, and r represent query, key, value and position encodings, respectively.The input size of MHSA is H × W × d, where H, W, and d, respectively, represent the height and width of the input feature matrix and the dimension of a single token.It should be noted that the MHSA here is different from the Multi-Head Self-Attention of the traditional transformer, such as Vision Transformer (ViT).The biggest difference is that the left content-position module introduces two-dimensional position encoding.Two learnable parameter vectors, R h and R w , represent the position encoding of the height and width of different positions, respectively.They are added through the broadcast mechanism, and the encoding of the (i, j) position is the sum of two d-dimensional vectors R hi and R wj .Matrix multiplication is performed between the position encoding and the query matrix to obtain a part of the attention, which is qr T .In the content-content part, the query matrix and the Key matrix are multiplied to obtain another part of the attention, namely qk T .Finally, the attention logits are qk T + qr T .The entire MHSA layer can be expressed as:

Experiment 4.1. Dataset
The public dataset we used in this research is the RDD2020 dataset [53], which consists of road damage images from three countries: Japan, the Czech Republic, and India.The dataset now contains 21,041 images, 10,506 from Japanese pavement, 7706 from Indian pavement, and 2829 from Czech pavement, covering eight road conditions including longitudinal cracks, transverse cracks, crosswalk blur, etc.Since this study focuses on crack detection in pavement maintenance, we only consider four types of crack damage: longitudinal cracks (D00), transverse cracks (D10), alligator cracks (D20), and Pothole (D40).Due to a large number of images in the dataset not containing the four types of detection targets, we screened the dataset and finally retained 12,195 images, 1072 from the Czech Republic, 3223 from India, and 7900 from Japan, all of which contain four types of detection targets: D00, D10, D20, and D40.The same filtering process is also adopted in [1,54].The number of various detection targets contained in the crack images of the three countries is shown in Table 1.

Evaluation Metrics
We use precision and recall as basic evaluation metrics in this study, and they can be defined as: where TP indicates that the prediction category of the prediction bounding is consistent with the ground truth and IOU between them is greater than 0.5, FP presents that the prediction bounding box appears in the region without objects, and FN means that the network fails to predict the ground truth object [35].The F1-score takes both precision and recall into account, so it can comprehensively reflect the overall performance of the network.It is calculated by taking the harmonic average of two indicators, as shown in Equation ( 4).Like most object detection studies, mAP0.5 and mAP0.5:0.95 are used as important evaluation indicators of network performance.In addition, the model parameter amount and GFLOPs are adopted to measure model complexity, and the Frames Per Second (FPS) is used to evaluate inference speed.

Experimental Environment and Parameter Settings
The experimental environment for this research is an NVIDIA GeForce RTX3090 GPU, Intel i9-10900X CPU, and 32G RAM.The model algorithm is implemented by Pytorch deep learning framework on the Windows10 operating system.
In the training process, the input image size of the network is set to 640 × 640, and the model is trained for 100 epochs using the SGD optimizer with batch size of 16, initial learning rate of 0.01, momentum of 0.937, and weight decay of 0.0005.We use warmup strategy in the first three epochs and then use cosine learning rate attenuation strategy to perform gradient descent more smoothly.

Results
On the RDD2020 dataset, we compared our experimental results with some excellent work.It can be seen from Table 2 that our algorithm is significantly better than YOLOv3, YOLOv4, and other algorithms.YOLOv5 has been proved to be very effective in crack detection by the USC-InfoLab team.Our algorithm is improved based on YOLOv5m.As shown in Table 2, compared with the original YOLOv5 algorithm, our parameter amount and GFLOPs have been reduced by nearly half, while the F1-score, mAP0.5, and mAP0.5:0.95 are increased by 1.3%, 2.1%, and 3.1%, respectively, and the FPS is increased 1.4 times.Even compared with the state-of-the-art YOLOv7 algorithm, although our parameter amount is reduced by 34% and GFLOPs is reduced by 46%, our F1-score is still increased by 1%, and mAP0.5 is increased by 0.9%.FPS is equivalent to YOLOv7; only the mAP0.5:0.95indicator is slightly worse.The experimental results show that our algorithm achieves competitive results, with fewer parameters and lower computational complexity but with higher accuracy and faster inference speed.
Figure 4 shows the visualization effect of the detection.Our algorithm can effectively detect cracks of various scales and types in complex and diverse scenes.To verify the effect of introducing C2f modules at different locations on network performance, we designed a set of ablation experiments, as shown in Table 3. C2f-Backbone means that the C3 modules in the backbone network are replaced by the C2f module, while keeping the neck network unchanged.C2f-head represents that C3 modules in the neck network are replaced by a C2f module, keeping the C3 module in the backbone network unchanged and so on.C2f-all stands for that all the C3 modules in the backbone network and neck network are replaced by the C2f module.The experimental results show that replacing C3 with the C2f module in the neck network achieves the best performance with fewer GFLOPs and higher FPS.To verify the effect of introducing C2f modules at different locations on network performance, we designed a set of ablation experiments, as shown in Table 3. C2f-Backbone means that the C3 modules in the backbone network are replaced by the C2f module, while keeping the neck network unchanged.C2f-head represents that C3 modules in the neck network are replaced by a C2f module, keeping the C3 module in the backbone network unchanged and so on.C2f-all stands for that all the C3 modules in the backbone network and neck network are replaced by the C2f module.The experimental results show that replacing C3 with the C2f module in the neck network achieves the best performance with fewer GFLOPs and higher FPS.To verify the effect of the head number of MHSA block on network performance, we designed a set of ablation experiments, as shown in Table 4.The experimental results show that the network performance is the best when the head number is four.Moreover, the change in the number of heads has relatively little influence on the experimental results.This may suggest that the MHSA structure itself is more important.To verify the effect of the head number of MHSA block on network performance, we designed a set of ablation experiments, as shown in Table 4.The experimental results show that the network performance is the best when the head number is four.Moreover, the change in the number of heads has relatively little influence on the experimental results.This may suggest that the MHSA structure itself is more important.Training hyperparameters generally have a greater impact on the experimental results.To this end, we designed a set of ablation experiments and compared the two sets of default training hyperparameters of YOLOv5, low-augment hyperparameters group (LAHG) and medium-augment hyperparameters group (MAHG), as shown in Table 5. Considering that these two sets of parameters are optimized on the coco dataset by YOLOv5, we also used parameter evolution although it consumes a huge amount of training time.We used 25 iterations of parameter evolution training based on the low-augment parameter group and the high-augment parameter group, respectively.Each such experiment took 25 times as long as the general experiment.The experimental results show that the hyperparameter evolution based on LAHG can improve the network performance, but the parameter evolution based on MAHG does not.The performance of MAHG is still stronger than the hyperparameter evolution based on LAHG, although the latter consumes more than 20 times the training time.The main reason may be that MAHG uses mixup data augment, but LAHG does not.This also shows that mixup data augment is very effective for YOLOv5.We then used the default MAHG of YOLOv5 for all later experiments.

Conclusions
With the rapid development of urban traffic, there is an urgent need for efficient and low-cost detection methods for pavement maintenance.In this paper, an improved YOLOv5 network combined with Bottleneck Transformer is proposed, which can capture long-range dependencies to adapt to the long-span, slender features of a crack object.In addition, the C2f module is introduced to further optimize the network.The experimental results show that compared with the original YOLOv5 algorithm, the F1 score of our algorithm is increased by 1.3%, the mAP0.5 is increased by 2.1%, the mAP0.5:0.95 is increased by 3.1%, and the inference speed is 1.4 times faster, but the parameter amount and GFLOPs are reduced by nearly half.To further improve our detection system, we plan to further expand the dataset, study more data augmentation methods, and investigate more attention mechanisms to deal with more complex pavement maintenance work.

Figure 1 .Figure 1 .
Figure 1.Improved network architecture based on YOLOv5m introduced by the Bot-transformer Block and C2f module.*x means there are x number of same block stacked.

Figure 2 .
Figure 2. The architecture of Bot-transformer block.(a) BottleneckTransformer*x means there are x number of BottleneckTransformer block stacked, and each BottleneckTransformer is showed in (b).(b) Structural diagram of submodule BottleneckTransformer of Bot-trsnformer block.

Figure 2 .
Figure 2. The architecture of Bot-transformer block.(a) BottleneckTransformer*x means there are x number of BottleneckTransformer block stacked, and each BottleneckTransformer is showed in (b).(b) Structural diagram of submodule BottleneckTransformer of Bot-trsnformer block.

Figure 4 .
Figure 4. Visualization results of some detection sample.

Table 1 .
Relevant statistics of experimental dataset.

Table 2 .
Performance comparison of various methods.

Table 3 .
Performance comparison of introducing C2f module in the different locations of the network.

Table 4 .
Performance comparison of different head numbers of MHSA.

Table 3 .
Performance comparison of introducing C2f module in the different locations of the network.

Table 4 .
Performance comparison of different head numbers of MHSA.

Table 5 .
Performance comparison of different hyperparameters group.