3.1. Replacement of the Convolution Module
In YOLOv8’s native network structure, convolution layers using standard convolution have the following limitations:
- (1)
Standard convolution can only learn fixed convolution kernels, lacking adaptive adjustment ability for convolution kernel weights, which reduces the convolution layer’s generalization ability and adaptability under different input conditions;
- (2)
Standard convolution mainly relies on matrix multiplication as operation mode, which results in a large number of parameters and amount of computation, especially in the fully connected layer, which increases the model’s storage and inference overhead;
- (3)
Standard convolution cannot fully utilize spatial information and channel information of input features, because they use the same weights for each position and each channel, which makes the model ignore some meaningful local features or global features.
To solve the above existing limitation problems, this paper adopted a novel convolution structure, namely, omni-dimensional dynamic convolution (ODConv). ODConv adopts a multi-dimensional attention mechanism [
22], which can parallelly allocate dynamic attention weights for the convolution kernel on the four dimensions of kernel space (namely, spatial size of each convolution kernel, input channel number, output channel number, and convolution kernel number). By embedding ODConv into YOLOv8’s network structure, this paper achieves significant improvement of model performance.
ODConv is a dynamic convolution operation that can adaptively adjust convolution kernel weights according to the content of the input feature map. The idea of ODConv is to use an attention mechanism to generate the weight vector for each position and multiply it with shared convolution kernel to obtain the dynamic convolution kernel. ODConv enhances expression ability and adaptability of the convolution layer and improves model performance in scenarios such as small target detection and multi-scale change. Dynamic convolution is different from conventional convolution in that it performs weighted combination of multiple convolution kernels to achieve input relevance. Dynamic convolution’s mathematical expression is as follows:
In the above equation, y is the output feature map, x is the input feature map, is the i-th convolution kernel, is the attention weight of the i-th convolution kernel, * is the convolution operation and n is the number of convolution kernels.
Dynamic convolution’s formula consists of two basic elements: convolution kernel and attention function used to calculate attention weight. For n convolution kernels, the constituted kernel space contains four dimensions: spatial kernel size, input channel number, output channel number, and n. However, CondConv [
23] and DyConv [
24] only use one attention scalar to calculate the output convolution kernel’s attention weight, while ignoring the spatial dimension, input channel dimension, and output channel dimension of the convolution kernel. CondConv is a type of convolution that learns specialized convolutional kernels for each example by parameterizing the kernels as a linear combination of n experts [
23]. DyConv is a type of convolution that dynamically integrates multiple parallel kernels into one dynamic kernel based on the input [
24]. It can be seen that CondConv’s and DyConv’s exploration of kernel space is rough. In addition, compared with conventional convolution, dynamic convolution requires n times more convolution kernel parameters (for example, if CondConv
, then DyConv
). If dynamic convolution is used too much, it will significantly increase model size. When removing attention mechanism from CondConv/DyConv, their performance improvement is less than 1%. According to these data, it can be concluded that attention mechanism plays a vital role in dynamic convolution: it determines how convolution kernels are generated and how weights are allocated. By improving the attention mechanism’s structure and parameters, a better balance between model accuracy and size can be achieved. ODConv can fully utilize multiple dimensions of kernel space including kernel size, kernel number, and kernel depth, thus making it superior to existing methods such as CondConv and DyConv in terms of model accuracy and size.
In order to achieve ODConv’s four types of attention values, we drew on the idea of CondConv and DyConv and used SE-style attention modules to dynamically generate different types of convolution kernels. Specifically, we first needed to perform global average pooling (GAP) on the input feature map
to obtain a feature vector of length
C,
, and then needed to use a fully connected layer (FC) and four different branches to generate four types of attention values, corresponding to spatial, channel, depth, and angle dimensions. As shown in
Figure 2, the structure of these four branches is as follows:
- (1)
Spatial branch: This branch is used to generate spatial attention values, i.e., weights, for each position. It maps the feature vector to a vector of length
,
, and then normalizes it using the Softmax activation function. Spatial attention values can be used to adjust the importance of different positions, thereby enhancing the feature expression of regions of interest.
- (2)
Channel branch: This branch is used to generate channel attention values, i.e., weights for each channel. It maps the feature vector to a vector of length
C,
, and then normalizes it by the sigmoid activation function. Channel attention values can be used to adjust the contribution of different channels, thereby enhancing the feature expression of semantic relevance.
- (3)
Depth branch: This branch is used to generate depth attention values, i.e., weights for each convolution kernel group. It maps the feature vector to a vector of length
K,
, and then normalizes it using the Softmax activation function. Depth attention values can be used to select the most suitable convolution kernel group for the current input, thereby enhancing the feature expression of diversity and adaptability.
- (4)
Angle branch: This branch is used to generate angle attention values, i.e., weights for each convolution kernel rotation angle. It maps the feature vector to a vector of length
R,
, and then normalizes it using the Softmax activation function. Angle attention values can be used to select the most suitable convolution kernel rotation angle for the current input direction, thereby enhancing the feature expression of rotation invariance and orientation sensitivity.
Figure 2 depicts the forward propagation process of ODConv, which produces the output feature map based on the input feature map and the convolution kernel parameters. Each node in the calculation diagram corresponds to a variable or operation in the formula, and each edge corresponds to an operator or assignment in the formula. The input feature graph
is located at the top-left corner of the diagram, while the output feature graph
is at the bottom-right corner. Various intermediate variables and operations are shown in the middle. The calculation diagram consists of three parts:
- (1)
The first part generates four types of attention values: spatial attention value
s, channel attention value
c, depth attention value
d, and angular attention value
. This part corresponds to Equations (
2)–(
5), where each attention value is computed by applying a fully-connected layer and an activation function to the input feature vector
;
- (2)
The second part is to generate the dynamic convolution kernel , where each dynamic convolution kernel is obtained by applying depth and angle attention weighting to the convolution kernel parameters ;
- (3)
The third part is to generate the output feature map , where each output feature map is obtained by spatially and channel-attention weighting the input feature map and then convolving it with the dynamic convolution kernel .
By such design, we can achieve ODConv’s four types of attention values and dynamically generate different types of convolution kernels according to them. ODConv uses attention mechanism on four scales (input channel, output channel, kernel space, and kernel number) to adjust the convolution kernel’s weight and shape. Therefore, it can be described as follows:
is input channel attention weight:
is a fully connected layer:
is output channel attention weight:
is a fully connected layer:
is kernel space attention weight:
is a fully connected layer:
is kernel number attention weight:
is a fully connected layer:
and
is the
i-th static convolution kernel.
Based on the property that ODConv can be combined with other optimization techniques, as shown in
Figure 3, this paper replaced all standard convolution operations in YOLOv8’s network structure with ODConv operations to enhance the network’s dynamics and adaptability. The specific steps and implementation details were as follows:
- (1)
In the backbone network, we replaced all convolution layers after the first one, the C2f module and all Conv in the SPPF module with ODConv, but keeping other parameters unchanged;
- (2)
In Head, we replaced Conv in each detection layer with ODConv, keeping other parameters unchanged;
- (3)
We trained using the same dataset, evaluation metrics, and experimental environment, and compared and analyzed with the original YOLOv8 model.
Compared with the YOLOv8-nano model, the optimized model—after replacing standard convolution with ODconv—has reduced number of parameters and amount of computation and has improved mAP, recall rate, and precision rate. This shows that the ODconv module has a significant improvement effect on weld defect detection performance. ODconv module’s advantages are mainly reflected in three aspects: first, it reduces model complexity and improves model efficiency and deployability; second, it enhances the model’s detection accuracy and reliability; third, it increases the model’s scale adaptability and robustness. In summary, the ODconv module is an effective improvement method that can improve weld defect detection performance.
3.2. Introduction of the NAM Attention Mechanism
This paper introduces a normalization-based attention module (Normalization-based Attention Module, NAM) to address the YOLOv8 model’s shortcomings in the weld defect detection task, to enhance the model’s feature extraction and classification ability. NAM uses a scaling factor in batch normalization as the channel and spatial attention weight [
25], which can adaptively adjust the model’s degree of attention to weld defect features, thereby improving detection accuracy and robustness. At same time, NAM uses sparse regularization to suppress insignificant features, reduce computation overhead, and maintain the model’s efficiency. It can solve the following problems existing in the YOLOv8 model:
- (1)
YOLOv8 relies on large-scale annotated data to train the model, while weld defect samples in industrial scenarios are often scarce and difficult to collect, which limits the model’s generalization ability and adaptability;
- (2)
YOLOv8 adopts an Anchor-free detector which directly regresses the target’s position and size. Such a design reduces the number of model parameters, but may also lead to unstable detection results, especially for irregularly shaped and differently sized weld defects;
- (3)
YOLOv8’s backbone network uses the cross-stage partial connection method to balance network depth and width and improve feature extraction efficiency. However, this method may also cause insufficient information flow between feature maps, affecting capture of weld defect details.
The NAM attention mechanism is a neural network module based on a self-attention mechanism which can adaptively adjust weights of different positions in the feature map, thereby enhancing the expression ability of regions of interest. A self-attention mechanism is a method of calculating the relationship between each element and other elements in the input sequence; it can capture any long-distance dependency relationships in the input sequence, and can calculate these in parallel. The self-attention mechanism can be expressed as
In the above equation [
26],
Q,
K, and
V, respectively, represent query (Query), key (Key), and value (Value), and matrix
represents key vector dimensions. A self-attention mechanism can be regarded as a weighted average operation based on dot-product similarity, which performs a dot-product operation on a query matrix and a key matrix, normalizes the result into a probability distribution, and then performs a weighted average operation on this probability distribution as weight and value matrix to obtain the output matrix.
The NAM attention mechanism applies a self-attention mechanism to a convolutional neural network and uses a simple and effective integral form to represent attention weight. The NAM attention mechanism can be expressed as
In the above equation, X represents the input feature map and represents the feature vector at the i-th position, and f represents a learnable function such as a fully connected layer or convolution layer. The NAM attention mechanism can be regarded as a weighted average operation based on an exponential function which maps the feature vector at each position in the input feature map to a scalar by a function, exponentiates it to a positive number, normalizes it into a probability distribution, and then performs a weighted average operation on this probability distribution as weight and input feature map to obtain the output feature map.
For model design, this paper adopted the method of adding the NAM attention mechanism to the C2f module to enhance information flow between different channels in the feature map. Specific implementation details are as follows:
- (1)
Insert the NAM attention mechanism before the last convolution layer in the C2f module, i.e., perform the NAM attention transformation on the feature map output using a convolution layer, and then use the transformed feature map as the convolution layer’s input. This allows the convolution layer to receive more useful channel information and improve the feature map’s expression ability and distinction;
- (2)
Set the NAM attention mechanism’s parameter to , i.e., divide each channel into four sub-regions and perform a self-attention calculation on each sub-region. This can reduce computation amount and memory consumption while maintaining sufficient receptive field. Use the Softmax function and a convolution layer to implement a self-attention calculation, and use a residual connection and normalization layer to stabilize the training process.
This paper determined, through experimental comparison, that the C2f module in the third layer of the backbone network is the best position to addthe NAM attention mechanism, based on the following reasons:
- (1)
The third layer of the backbone network provides rich and abstract semantic information suitable for the target detection task. The NAM attention mechanism can utilize the semantic relationship between channels to enhance feature map’s semantic information and improve target detection accuracy and robustness;
- (2)
The C2f module is a module with multi-hop layer connection and branch structure which can enhance information flow between different scales and positions in the feature map in a suitable way for the multi-scale target detection task. Adding the NAM attention mechanism in this module can utilize the spatial relationship between different channels to enhance the feature map’s spatial information and improve target detection sensitivity and stability;
- (3)
The NAM attention mechanism can further enhance information flow between channels in the C2f module, improving feature expression ability and target detection performance. The NAM attention mechanism can adaptively adjust weights between channels to highlight useful channel information, suppress useless channel information, and improve the feature map’s diversity and efficiency.
3.3. Replacement of the SPPF Module
The SPPF module is an improvement of spatial pyramid pooling which is used to fuse multi-scale features; it is located at the last layer of the YOLOv8-nano model’s backbone network. It can concatenate features of the same feature map at different scales together to achieve multi-scale feature fusion, improving the feature map’s expression ability and diversity. However, the SPPF module also has the following shortcomings:
- (1)
It only captures fixed-scale features and cannot adapt to differently sized targets;
- (2)
Pooling operation reduces feature map spatial resolution, losing detailed information;
- (3)
The parameter amount is large, increasing computation amount and memory occupancy.
The context augmentation module (CAM) is a module for extracting and fusing context information of different scales, which can effectively utilize information on spatial and channel dimensions, improve feature expression ability and distinction, and enhance context information of the feature pyramid network. The context augmentation module consists of a spatial attention module and a channel attention module, which are used to extract context information on spatial and channel dimensions, respectively, and fuse them into the original feature map. In addition, it uses multiple dilated convolutions with different dilation rates, which are 1, 3, and 5, to obtain context information from different receptive fields and integrate them into the feature pyramid, enriching the semantic information of the feature map, while introducing only a small number of parameters and amount of computation. The context augmentation module is applied to the highest layer of the feature pyramid network to obtain more background information, which is conducive to the detection of tiny defects. The structure of the context augmentation module is shown in
Figure 4, where the kernel size of the dilated convolution layer is
, and does not change the size of the feature map.
Three different feature fusion methods are considered in the context augmentation module, which correspond to subfigures (a), (b), and (c) in
Figure 5. The purpose of these methods is to combine features of different receptive fields to improve detection performance.
This paper compares three different feature fusion methods: (a) weighted addition method, (b) spatial adaptive method, and (c) cascade addition method. The difference between these methods lies in how they combine features of different receptive fields on spatial and channel dimensions. Specifically, assuming that the size of the input feature is (bs, C, H, W), where bs is the batch size, C is the number of channels, and H and W are the height and width, the operations of the three methods are as follows: (a) weighted addition method—directly perform element-wise addition on the input features, resulting in an output feature with a size of (bs, C, H, W); (b) spatial adaptive method—apply a 1 × 1 convolution layer to the input features to compress the number of channels, use a convolution layer to predict the upsampling kernel for each position (which are different for each position), and then use Softmax normalization; the spatial weight matrix with a size of (bs, 3, H, W) is multiplied by the input features aligned by channel to obtain an output feature with a size of (bs, C, H, W); (c) cascade addition method—concatenate the input features on the channel dimension to obtain an output feature with a size of (bs, 3C, H, W).
The purpose of replacing the original SPPF module with the context augmentation module in this paper was to improve the information utilization rate of feature maps on spatial and channel dimensions, thereby enhancing weld defect detection performance. The SPPF module only utilizes spatial dimension information and ignores channel dimension information, resulting in poor feature representation and robustness. The context augmentation module can effectively integrate spatial and channel dimension information to enhance feature expression ability and distinction. Compared with the SPPF module, the context augmentation module has the following advantages:
- (1)
The context augmentation module can improve the detection effect for difficult-to-detect targets such as small targets, dense targets, occluded targets, etc. The SPPF module uses fixed-size pooling operations which may ignore or lose weld defects that vary greatly, or are small, occluded, or overlapped in scale. The context augmentation module can adaptively select features of different scales and positions and dynamically fuse them to adapt to weld defects of different sizes and shapes;
- (2)
The context augmentation module can improve generalization ability for diversified datasets with different categories, different scenes, different lighting, etc. The SPPF module uses a max pooling operation, which may ignore or confuse weld defects that vary greatly or are small in category, scene, or lighting. The context augmentation module can enhance features of regions of interest by attention mechanism, suppress features of irrelevant regions, improve feature interpretability and accuracy, and adapt to diversified datasets with different categories, different scenes, different lighting, etc.;
- (3)
The context augmentation module can improve recognition ability for weld defect position, shape, size, and other detail information. The SPPF module uses concatenation operation, which may cause channel information conflict and confusion, reducing feature diversity and efficiency. The context augmentation module can select more useful channel information by an attention mechanism and fuse it into an original feature map, extracting more fine and meaningful features.
The specific steps of replacing the original SPPF module with the context augmentation module, as performed in this paper, are as follows:
- (1)
Find the position of the SPPF module in the backbone network of the YOLOv8-nano model, which is after the last convolution layer, and delete it;
- (2)
Add the context augmentation module to the backbone network of the YOLOv8-nano model, which is inserted after the last convolution layer;
- (3)
Train with three different feature fusion methods, which arethe weighted addition method, spatial adaptive method, and cascade addition method, and compare them with the original YOLOv8-nano model.
3.4. Introduction of the Carafe Operator
The Upsample operation used by YOLOv8-nano is a traditional interpolation method, which only utilizes spatial information from the input feature map and ignores semantic information. In weld defect detection, this leads to information loss or blur, small receptive field, and low performance. Therefore, this paper introduced the Carafe operator [
27], which is a lightweight general upsampling operator that can predict and reorganize upsampling kernels according to input feature map content, thereby improving the upsampling effect. The workflow of Carafe is as follows:
- (1)
Upsampling kernel prediction: For the input feature map, first use a convolution to compress channel number, then use a convolution to predict the upsampling kernel for each position (which are different for each position), and then use Softmax normalization;
- (2)
Feature reorganization: For each position in the output feature map, map back to the input feature map, take out a area centered on it, and calculate the dot-product with the predicted upsampling kernel at that point to obtain the output value. Different channels at the same position share the same upsampling kernel.
Carafe has significant advantages over the original Upsample operation. First, Carafe can guide generation of the upsampling kernel according to semantic information of the input feature map, thereby adapting to features of different content and scales, while Upsample is a fixed upsampling method that only determines the upsampling kernel according to pixel distance without utilizing semantic information of the feature map. Second, Carafe can increase receptive field by adjusting convolution, using surrounding information for upsampling to improve upsampling quality, while Upsample has small receptive field (nearest neighbor , bilinear ) and cannot fully utilize surrounding information. Finally, Carafe only introduces a small number of parameters and amount of computation, maintaining lightweight characteristics, while Upsample introduces additional parameters and computation, especially when using deconvolution.
In this paper, we replace the Upsample operation in each layer of the top-down part in YOLOv8 with the Carafe operation; keep other parts unchanged; train with the same dataset, evaluation metric, and experimental environment; and compare and analyze with the original YOLOv8-nano model. Specific steps and implementation details are as follows:
- (1)
In the Head, we replace the Upsample operation in the first upsampling layer with the Carafe operation for training, and compare and analyze with the original YOLOv8 model.
- (2)
In the Head, we keep the Upsample operation in the first upsampling layer unchanged, replace the Upsample operation in the second upsampling layer with the Carafe operation for training, and compare and analyze with this YOLOv8-nano model.
- (3)
In the Head, we replace the Upsample operation in each layer of the top-down part with the Carafe operation for training, and compare and analyze with this YOLOv8 model.
3.5. Optimize Loss Function
This paper studies multi-class weld defect detection, which has a class imbalance problem. Using YOLOv8-nano model’s CIoU loss function, it is easy to cause the model to bias towards multi-sample categories and ignore few-sample categories, affecting detection ability. To solve this problem, this paper chose the Wise-IoU loss function, introduced a category weight coefficient, adjusted target category importance, and balanced the detection effect.
CIoU Loss is a regression error measure used for object detection [
28]. It integrates IoU, center distance, and aspect ratio of prediction box and real box, making the model pay more attention to mismatched samples and reducing the loss weight of close samples. CIoU Loss is an improvement of GIoU Loss and DIoU Loss, adding consideration of center point distance and aspect ratio difference. CIoU Loss can improve target detection performance of different scales and shapes, and reduce sensitivity to hyperparameter selection. CIoU Loss’s formula is as follows:
In the above equation,
is intersection over union,
is Euclidean distance between center points of prediction box and real box,
c is diagonal length of smallest closed rectangle containing prediction box and real box,
is the balance coefficient, and
v is the aspect ratio penalty term.
’s and
v’s calculation methods are as follows:
In the above equation, and are prediction box’s and real box’s width and height, respectively.
In the target detection task, the loss function’s design has a decisive role in model performance. An appropriate loss function can enhance boundary box’s fitting accuracy and robustness. However, most existing loss functions are based on an assumption: all samples in training data are high-quality. This assumption ignores the adverse impact of low-quality samples on model performance. To solve this problem, some researchers proposed loss functions based on IoU (intersection over union) to measure overlap degree between prediction box and target box. However, these loss functions have their own limitations and defects. For example, IoU loss [
29] cannot provide gradient information when prediction box and target box do not overlap; GIoU loss [
29] produces the wrong gradient direction when prediction box contains target box; DIoU loss [
30] does not consider whether boundary box’s aspect ratio is consistent; CIoU loss [
30] still penalizes distance and aspect ratio difference too much when prediction box and target box overlap a lot; EIoU loss’ [
31] assigns too high of a weight to the distance penalty term and uses momentum sliding average value as the unstable normalization factor; and SIoU loss [
32] introduces additional penalty terms such as boundary box center line, coordinate axis angle boundary box shape difference, etc. These loss functions do not consider Anchor box’s quality problem, i.e., if low-quality Anchor boxes are excessively regressed, this reduces the model’s localization ability. To overcome these problems, some scholars proposed a loss function based on IoU called Wise-IoU (WIoU) [
33]. WIoU uses outlierness to evaluate Anchor box’s quality and adjusts gradient gain dynamically according to outlierness. In this way, WIoU can adaptively select medium-quality Anchor boxes and effectively improve the detector’s overall performance.
WIoU loss function’s definition is as follows:
In the above equation,
N is Anchor box’s number,
is IoU value between
i-th Anchor box and target box,
is
i-th Anchor box’s corresponding gradient gain, defined as
In the above equation,
is
i-th Anchor box’s outlierness and
is a hyperparameter used to control the gradient gain distribution’s sharpness. Outlierness is an indicator that reflects Anchor box’s quality, defined as
In the above equation, is the length of the center point connection line between i-th Anchor box and target box, and is the diagonal length of the smallest bounding box containing two boundary boxes. Smaller outlierness means higher Anchor box quality.
Wise-IoU and CIoU are both loss functions based on IoU. Their main difference is that Wise-IoU introduces a dynamic non-monotonic focusing coefficient which evaluates Anchor box’s quality by outlierness, adjusts gradient gain of different quality Anchor boxes, and makes the model focus on ordinary quality Anchor boxes. Outlierness is an indicator that integrates IoU and distance measurement; smaller outlierness means higher Anchor box quality. Wise-IoU’s focusing coefficient changes non-monotonically with outlierness; when outlierness is within a certain range, the focusing coefficient is larger, while, when outlierness exceeds this range, the focusing coefficient is smaller. This can reduce the gradient generated by high-quality Anchor boxes and low-quality Anchor boxes, and increase the gradient generated by ordinary-quality Anchor boxes.
As shown in
Figure 6, Wise-IoU can replace CIoU in YOLOv8 to achieve better detection performance, which can better handle low-quality examples in the training data, avoid over-penalizing or fitting these examples, and improve the model’s generalization ability and localization accuracy. At the same time, Wise-IoU can better utilize the contextual information of different scales and receptive fields, and allocate reasonable gradient gains through a dynamic non-monotonic focusing mechanism, improving the model’s regression ability for ordinary quality Anchor boxes. Wise-IoU does not introduce additional geometric metrics or computational complexity, but only multiplies a focusing coefficient by the IoU loss, so it can better adapt to different hardware platforms and deployment scenarios.