Next Article in Journal
Research on the Multi-Layer Optimal Injection Model of CO2-Containing Natural Gas with Minimum Wellhead Gas Injection Pressure and Layered Gas Distribution Volume Requirements as Optimization Goals
Previous Article in Journal
Enhancing Point Cloud Registration Precision of Conical Shells Through Edge Detection Using PCA and Wavelet Transform
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Improved YOLOv9 with Dual Convolution and LSKA Attention for Robust Small Defect Detection in Textiles

1
Key Laboratory of Modern Textile Machinery & Technology of Zhejiang Province, Zhejiang Sci-Tech University, Hangzhou 310018, China
2
School of Applied Math & Computational Science, Duke Kunshan University, Kunshan 215316, China
3
Zhejiang Institute of Mechanical and Electrical Engineering, Hangzhou 310053, China
*
Author to whom correspondence should be addressed.
Processes 2026, 14(1), 149; https://doi.org/10.3390/pr14010149
Submission received: 4 December 2025 / Revised: 25 December 2025 / Accepted: 30 December 2025 / Published: 1 January 2026
(This article belongs to the Section Process Control and Monitoring)

Abstract

To mitigate the challenges of false positives and undetected small-scale defects in fabric inspection, this study proposes an advanced fabric defect detection system that leverages an optimized YOLOv9 algorithm. First, redundant computations are reduced by introducing DualConv to replace standard convolution. Second, the LSKA attention mechanism is incorporated to increase the weight of important features, thereby enhancing the accuracy of small target detection and improving the generalization ability. Additionally, the focal modulation network is employed to replace the fast spatial pyramid module, mitigating the loss of detailed information caused by the feature pooling operation. Furthermore, the conventional feature pyramid network is replaced with bidirectional feature pyramid network, which is utilized for efficient feature fusion, thereby enhancing multiscale feature representation and improving detection accuracy. Finally, the bounding box loss function is optimized by introducing the shape-IoU loss function, which facilitates more rapid model convergence and significantly improves detection accuracy. Experiments conducted on a fabric defect dataset demonstrate that the proposed algorithm yields a 6.7% increase in mAP@0.5 and a 14.7% improvement in mAP@0.5–0.95, while simultaneously reducing the model’s total parameters by 17.8% and computational FLOPs by 14.4%, compared with those of the original algorithm. The improved YOLOv9 model significantly enhances the precision and accuracy of defect detection while maintaining inference speed (55.8 FPS) that meets industrial requirements.

1. Introduction

In today’s globalized textile industry, as consumer demand for product quality continues to rise, textile quality control has become a critical factor in determining market competitiveness. Consequently, fabric defect detection has emerged as a vital component of the textile industry’s production and quality management systems. Currently, in most companies, defect detection relies primarily on manual inspection methods. Nevertheless, the traditional method of detecting defects via manual visual inspection is not only labor-intensive and inefficient, but also subject to the subjective judgment of the inspector, making it difficult to achieve efficient and consistent quality control standards. Therefore, the development of efficient and accurate automated textile defect detection systems has become a significant research focus, encompassing fields such as computer vision and artificial intelligence.
With the progression of Industry 4.0, textile defect detection technology has transitioned from traditional manual inspection to automation and intelligence over the past decade, largely driven by advancements in computer vision and deep learning technologies. At present, deep learning-based object detection algorithms can be broadly divided into the following two main types: two-stage and single-stage detection algorithms. Fast R-CNN [1] serves as a prominent example of the two-stage detection framework, whereas the YOLO [2] series represent the single-stage detection paradigm. Subsequent investigations into textile defect detection algorithms have primarily aimed at refining and enhancing these foundational methods. For instance, Shengbao et al. [3] improved textile defect detection accuracy by integrating the R-CNN model with an online hard example mining technique. Jun et al. [4] employed an approach where fabric images were first segmented, followed by the application of the Inception-V1 [5] model to identify defects within the segmented images, and ultimately utilized the LeNet-5 model for defect classification. Zheng et al. [6] introduced an improved YOLOv5-based fabric defect detection system, achieving notable improvements in detection accuracy and localization, though further enhancements in accuracy are still achievable. Anjing et al. [7] proposed a fine-grained convolution module integrated into the YOLOv8s model, alongside GFPN feature fusion, to bolster the detection of small-scale defects. Yan et al. [8] advanced a lightweight algorithm incorporating pyramid segmentation attention and linear transformation, thereby significantly enhancing detection efficiency. Kang and Li [9] developed a method for detecting defects in solid-color circular weft fabrics via YOLOv7-tiny, which included the introduction of an SPD convolution layer and a CBAM attention module to emphasize critical features. Liu et al. [10] proposed a lightweight model, PRC-Light YOLO, which optimizes feature extraction via novel operators and employs the Wise-IoU v3 to mitigate the impact of low-quality instances, thus improving accuracy. Although the above studies have made significant progress in addressing real-time detection and deployment challenges, achieving low-latency, cost-effective real-time detection with high accuracy in practical applications remains a formidable challenge. Addressing this challenge necessitates not only the optimization of model architectures but also the integration of techniques such as hardware acceleration and model compression to fulfill the stringent requirements of industrial environments.
In summary, although existing research has provided various effective solutions for textile defect detection, there remains significant potential for optimizing the identification of tiny defects in complex backgrounds, enhancing detection speed, and reducing computational costs. Against this backdrop, this study conducts an experimental comparison of several classical object detection algorithms, including SSD [11], R-CNN [12], Fast R-CNN [1], Faster R-CNN [13], and YOLO [2], ultimately selecting YOLOv9 [14] as the foundational network. The YOLOv9 object detection algorithm, a recent addition to the YOLO series, employs auxiliary reversible branches to generate reliable gradients, preserving the critical features of deep layers and optimizing the training and detection of small targets. However, YOLOv9 still encounters performance challenges in practical textile defect detection applications. Specifically, a considerable amount of information may be lost during the extraction and spatial transformation of input data across layers. Additionally, redundant features extracted by the convolutional layers, along with substantial computational overhead, can lead to degradation of the model’s performance, particularly in real-time small target detection scenarios. Therefore, this study proposes an improved YOLOv9-based textile defect detection algorithm. The core innovation of this method lies in the synergistic integration of the following four advanced technical components: DualConv for efficient feature extraction, LSKA attention mechanism for critical feature enhancement, BiFPN for advanced feature fusion, and Shape-IoU for optimized loss computation. This strategic combination yields substantial improvements in detection accuracy, especially for small-scale defects in textile imagery. Consequently, the proposed method establishes itself as an efficient and reliable solution for industrial applications. The enhanced performance aligns with the urgent requirements of textile quality control in modern manufacturing systems, effectively addressing critical challenges in automated defect inspection.

2. Improved YOLOv9 Fabric Defect Detection Algorithm

This chapter mainly introduces the fabric defect detection algorithm based on YOLOv9 improvement. First, the ordinary convolution layer of the down sampling module is replaced by the dual-core convolutional neural network (DualConv). Second, the large separable attention mechanism (LSKA) is added in the early stage of feature extraction and feature fusion. Then, the fast spatial pyramid pooling module (SPPFELAN) is replaced by the focal modulation network (focal modulation networks), and the feature fusion module structure is modified by the bidirectional feature pyramid network (BiFPN). Finally, the shape-IoU is used to optimize the loss function. The model structure of the improved YOLOv9 fabric defect detection algorithm proposed in this paper is shown in Figure 1.

2.1. Subsection

Traditional convolution applies a set of filters to perform convolution operations across all channels of the input image. This approach necessitates calculating the multiplication and addition operations between each filter and all channels of the input image for each output feature map, resulting in high computational costs and increased storage requirements. Moreover, traditional convolution may struggle to effectively capture features of small objects, as these objects occupy less spatial area in the image and contain limited feature information, potentially compromising detection performance. To address these challenges, DualConv (dual convolutional kernels) [15] has been introduced to replace the traditional convolutional layers in YOLOv9. DualConv is a lightweight convolutional neural network structure that adopts a “split-fusion” strategy, dividing the convolutional layer into the following two sub-layers: a deep convolutional layer and a wide convolutional layer. The deep convolutional layer utilizes smaller kernels and a greater number of channels to extract high-level features, while the wide convolutional layer employs larger kernels and fewer channels to capture low-level features. By combining 3 × 3 and 1 × 1 convolution kernels processing the same input feature map channels, the information processing and feature extraction operations are optimized. DualConv employs group convolution techniques to strategically organize convolutional filters, thereby effectively reducing both computational complexity and the number of model parameters while preserving model performance. The structure and layout of DualConv are shown in Figure 2 and Figure 3.
In Figure 2 and Figure 3 N denotes the depth of the output feature map, M represents the depth of the input feature map, and G is the number of groups in dual convolution and group convolution. The N convolution filters are divided into G groups, each with each group processing the entire input feature map, where the input M/G feature map channels are processed by the 3 × 3 and 1 × 1 convolution kernels at the same time, and the remaining (M − M/G) input channels are processed only by the 1 × 1 convolution kernels. At the same time, the results of the 3 × 3 and 1 × 1 convolution kernels are obtained through logical operations .
Assuming that the length and width of the input feature map are, respectively, H and W, the computational amount FLOPs in conventional convolution is shown in Equation (1):
F S C   =   H   ×   W   ×   K 2 ×   M   ×   N
For a given G, in a dual convolution layer consisting of the G convolution filter groups, the computational amount of the combined convolution kernel FLOPs is as shown in Equation (2):
F C C   =   ( H   ×   W   ×   M   ×   N + H   ×   W   ×   M   ×   N ) / G
The remaining 1 × 1 point-by-point convolution kernels are shown in Equation (3):
F P C   =   ( H   ×   W   ×   M   ×   N )   ×   ( 1     1 / G )
The total amount of computation of DualConv is shown in Equation (4):
F D C   = F C C + F P C = H   ×   W   ×   K 2 ×   M   ×   N / G + H   ×   W   ×   M   ×   N
By comparing the computational complexity of the double-layer convolution with that of the ordinary convolution layer, we can obtain the FLOPs reduction rate of DualConv as shown in Equation (5):
R D C / S C = F D C F S C = 1 G + 1 K 2 a 2 + b 2
In the equation, the number of convolution kernels in DualConv is a fixed value K = 3, so it can be seen that when G is large enough, DualConv can reduce the amount of computation by nearly 9 times compared with ordinary convolution.

2.2. Adding Large Separable Kernel Attention Mechanism (LSKA)

In textile defect detection tasks, interference from background textures necessitates a stronger reliance on local information for accurate defect identification. For this reason, the Large Separable Kernel Attention mechanism [16] increases the weight of important features and reduces the noise interference caused by the background.
The mechanism of LSKA is to improve and innovate the application of the traditional LKA in the visual attention network. First, it decomposes the traditional 2D convolution kernel into two 1D convolution kernels in horizontal and vertical directions. Then, one 1D convolution kernel is used to convolve the input in the horizontal direction, and another 1D convolution kernel is used to convolve in the vertical direction.
These two convolution operations are performed in series, thus achieving an effect similar to the original large-sized 2D convolution kernel. The decomposition and serial strategy significantly reduce both computational complexity and the number of parameters and can effectively capture important information when processing the key features of the image. The LKA structure is shown in Figure 4a, and the LSKA attention mechanism structure is shown in Figure 4b. In Figure 4, C is the number of input channels, W and H denote the width and height of the input feature map, represents the Hadamard product, d represents the dilation rate, k represents the maximum receptive field, DW-Conv [17] represents standard depth-wise convolution, and DW-D-Conv represents dilated depth-wise convolution [18]. LSKA restructures the initial two layers of large kernel attention into four layers by decomposing the two-dimensional weight kernels of deep convolution and deep dilated convolution into two cascaded one-dimensional separable weight kernels. Each layer of the LKA comprises two 1D convolutional layers.
The calculation process of the traditional large kernel attention module (LKA) is shown in Equations (6)–(8). First, a given feature map F is input, and then a larger convolution kernel is used in the 2D depth convolution output.
Z C = H , W W K × K C F C
A C = W 1 × 1 Z C
F ¯ C = A C F C
In the equation, and represent the Hadamard product operation and the convolution operation. ZC represents the output of each channel C in the feature map F after the depth convolution with the corresponding channel in kernel W.
AC denotes the attention map generated by applying a 1 × 1 convolution to the output of the depth-wise convolution. F ¯ C represents the output of LKA, obtained by performing a Hadamard product between the input feature map F ¯ C and the attention map AC.
The initial two layers of the LKA are transformed into a four-layer structure by breaking down the two-dimensional weight kernels of deep and dilated convolutions into cascaded one-dimensional separable kernels, and finally the Large Separable Kernel Attention is obtained. The output of LSKA is shown in Equations (9)–(12).
Z ¯ C = H , W W ( 2 d 1 ) × 1 C ( H , W W 1 × ( 2 d 1 ) C F C )
Z C = H , W W k d × 1 C ( H , W W 1 ×   k d C Z ¯ C )
A C = W 1 × 1 Z C
F ¯ C = A C F C
where Z ¯ C and Z C correspond to the global spatial information derived from the outputs of the split depth-wise convolution and split depth-wise dilated convolution, respectively.
The backbone network structure diagram of YOLOv9 after integrating the LSKA module is shown in Figure 5.

2.3. Feature Extraction Using Focal Modulation Networks

In YOLOv9, the spatial pyramid pooling-fast (SPPF) module [19] is employed to enhance feature extraction capabilities. However, when applying feature pooling operations to small target details, such as defects, some critical information may be lost. Additionally, the fixed pyramid structure may not effectively adapt to the varying sizes of targets within the image. To address these limitations, the SPPF is replaced with focal modulation networks (FMNs) [20] for feature enhancement. Compared to the SPPF, focal modulation networks not only effectively process input images of different sizes but also provide more accurate feature identification within the image. This approach is particularly well suited for target detection tasks involving small targets or complex backgrounds.
Focal modulation networks are a new module created by using focal modulation to replace self-attention. The original self-attention first calculates the attention score through a large number of query-key interactions and then aggregates the query and a large number of spatially distributed tokens (context features) with an equally large number of query values to capture contextual information from other tokens, as shown in Figure 6a. In contrast, focal modulation first applies query-independent focal aggregation (e.g., deep convolution) to generate summary tokens at different levels of granularity. These summarized tokens are then adaptively aggregated into a modulator, and finally these modulators are injected into the query token in a query-dependent adaptive manner, as shown in Figure 6b. In Figure 6a, the dotted lines represent query interactions, and the solid lines represent query-value aggregations. The colored rectangles in Figure 6b represent modulator aggregations, and the corresponding dotted lines represent query-modulator interactions. The focal modulation structure mainly consists of the following three parts: focal contextualization, context aggregation, and element-wise affine transformation.

2.3.1. Focus Contextualization

Focus contextualization is implemented using a deep convolutional layer stack to encode visual context from short range to long range. The focus modulation is instantiated and its calculation expression is shown in Equation (13):
y i = q ( x i ) m ( i , X )
where represents element-wise multiplication, i represents the query token, X represents the input feature map with input width W, height H and number of channels C, x i represents the process of generating a universal encoding for each visual token (query), q ( · ) represents the query projection function, and m ( · ) represents the context aggregation function.

2.3.2. Contextual Aggregation

Context aggregation can be divided into two steps:
Step 1: Hierarchical contextualization. First project the feature map into a new feature space with a linear layer. Then, a stack of L depth-wise convolutions is used to obtain a hierarchical representation of the context. At the focal level l 1 , , L , the output Zl is given by the following equation, which is calculated as shown in Equation (14):
Z l = f a l GeLU ( DWConv ( Z l 1 ) )
wherein f a l refers to the contextualization function at level l implemented by deep convolution DW-Conv and GeLU activation function.
Step 2: Gated aggregation. In the gated aggregation step, the L + 1 feature maps generated from hierarchical contextualization are condensed into modulators. A linear layer is applied to derive gating weights G = f g ( X ) R H × W × ( L + 1 ) that are both spatially and hierarchically aware. Subsequently, element-wise multiplication is used to perform a weighted summation, producing a single feature map Zout with dimensions identical to the input X. This operation is described in the equation.
Z o u t = l = 1 L + 1 G l Z l
where G l is the slice of G at level l. Z l is the result of the calculation of Equation (14).

2.3.3. Element-Wise Affine Transformation

The element-wise affine transformation incorporates the aggregated modulator into individual query tokens. According to the above context aggregation operation, the focus modulation can be rewritten as shown in Equation (16):
y i = q ( x i ) h ( l = 1 L + 1 g i l · z i l )
where h ( · ) is a new linear layer to obtain the modulator, where g i l and z i l are Gl the gated values and visual features at positions and i, respectively, Zl.
To summarize, the structure diagram of focal modulation networks is shown in Figure 7.

2.4. Modification of the Feature Fusion Module Structure Using BiFPN

In YOLOv9, the feature fusion module uses the traditional feature pyramid network (FPN) [21], but FPN has problems such as high computational complexity, large memory usage, and limitations in detecting small targets. Therefore, this paper proposes to use a bidirectional feature pyramid network (BiFPN) [22] to replace the original feature pyramid network for feature fusion operations and improve detection accuracy and reduce computational complexity through multi-level feature pyramids and bidirectional information transmission.
It can be seen in Figure 8 that FPN is a top-down path to fuse multi-scale features from layer 3 to layer 7. BiFPN, on the other hand, makes the following three innovative modifications based on the FPN:
  • Bidirectional feature fusion: Bidirectional feature fusion in the bidirectional feature pyramid network refers to a mechanism that allows information in the feature network layer to flow and fuse in both the bottom-up and top-down directions. The above operation yields a simplified bidirectional network, enhancing the network’s capacity for feature fusion, enabling the network to more effectively utilize information at different scales, thereby improving the performance of target detection without adding additional computational costs.
  • Weighted fusion mechanism: The weighted fusion mechanism is a technique designed to enhance the effectiveness of feature fusion. In traditional feature pyramid networks, all input features are typically treated equally, with no distinction made between them. This approach results in the simple addition of features with different resolutions, without accounting for the differences in their output characteristics. To address this issue, BiFPN introduces an additional weight to each input feature, allowing the network to learn the relative importance of each feature. The weighted fusion mechanism is expressed as shown in Equation (17):
O = i ω i · I i
where ω i is a learnable weight that can be per feature, per channel, or per pixel. This weighted fusion approach can achieve comparable accuracy to other methods while minimizing computational cost.
3.
Structural optimization: Structural optimization is to determine the number of different layers through the compound scaling method under different resource constraints, thereby improving accuracy while maintaining efficiency. Its structural optimization includes:
  • Simplified bidirectional network: By optimizing the structure, the number of nodes in the network is reduced and nodes with single input edges are removed.
  • Adding additional edges: Additional edges are introduced between the input and output nodes at the same level to facilitate greater feature fusion without significantly increasing the computational cost.
  • Reuse bidirectional paths: Each bidirectional path is considered an independent feature network layer, and these layers can be iterated multiple times to enable more sophisticated feature fusion.
To ensure compatibility with the YOLOv9 architecture, we adopted an adapted BiFPN module rather than a standalone version. Specifically, we maintained the same channel dimensions as the original YOLOv9’s neck and integrated the bidirectional cross-scale connections within this existing framework. This approach allows for efficient multi-scale feature fusion without adding significant extra parameters or computational overhead. The structure diagram of the neck layer after using BiFPN to replace FPN in YOLOv9 is shown in Figure 9.

2.5. Using Shape-IoU to Optimize the Loss Function

In the YOLOv9 model, CIoU [23] calculates the bounding box regression loss. It adds two additional metrics, aspect ratio and center point distance, to IoU loss [24] to more comprehensively measure the similarity between the predicted bounding box and the true bounding box. Although CIoU performs well in many aspects, it also has some potential flaws. The calculation of CIoU involves the calculation of center point distance and aspect ratio, which is more complicated than IoU or GIoU [25] and may increase the computational cost. In textile defects, such as broken warps or irregular stains, elongated or irregular shapes are often exhibited, rather than standard rectangles. Traditional IoU variants like CIoU penalize deviations in the center point distance and aspect ratio, which may not align optimally with these complex geometries. In contrast, Shape-IoU introduces shape-aware weighting factors that adapt the penalty term based on the specific shape and scale of the ground truth bounding box. This makes the bounding box regression more sensitive to the actual morphology of defects, which is particularly beneficial for accurately localizing small, irregular textile defects against textured backgrounds. Therefore, this paper uses shape-IoU to replace CIoU to improve detection performance.
Shape-IoU [26] is a bounding box regression method that focuses on the shape and scale of the bounding box itself to calculate the loss, making the bounding box regression more accurate.
From Figure 10, the calculation equation of shape-IoU can be derived as shown in Equations (18)–(24):
I o U = B B g t B B g t
u u = 2 × ( u g t ) s c a l e ( u g t ) s c a l e + ( v g t ) s c a l e
v v = 2 × ( v g t ) s c a l e ( u g t ) s c a l e + ( v g t ) s c a l e
d i s t a n c e s h a p e = v v × x c x c g t 2 c 2 + u u × ( y c y c g t ) 2 c 2
Ω s h a p e = t = u , v ( 1 e ω t ) θ , θ = 4
ω u = v v × u u g t m a x ( u , u g t )
ω v = u u × v v g t m a x ( v , v g t )
The scale in this context refers to the scaling parameter, which is dependent on the target size within the dataset. The coefficients uu and vv denote the weighting factors along the horizontal and vertical axes, respectively, and their values are influenced by the shape of the ground truth bounding box. The corresponding loss function for bounding box regression is detailed in Equation (25):
L s h a p e   I o U = 1 I o U + d i s t a n c e s h a p e + 0.5 × Ω s h a p e

3. Experiment Result and Analysis

3.1. Construction of the Experiment Environment

All models in this paper are trained in the environment configuration shown in Table 1. The hyperparameter settings in model training are shown in Table 2.

3.2. Textile Defect Dataset

The dataset used in this study comes from on-site shooting in a textile factory. Images were captured using an MV-CA050-10GM industrial area scan camera (Hikvision, 5 MP, Hikvision, Hangzhou, China) under uniform white bar lighting to maintain consistent illumination and minimize shadow interference. The original resolution of the acquired images was 2448 × 2048 pixels. For model training and evaluation purposes, all images were resized to a standardized dimension of 640 × 640 pixels. A total of 4500 images were selected, encompassing both woven and knitted fabric types. Defects were classified into three categories based on established textile quality inspection standards as follows: holes (localized fabric discontinuities), broken warps (longitudinal yarn fractures), and stains (color impurities or oil residues). The dataset comprises 1450 instances of hole defects, 1600 instances of broken warp defects, and 1450 instances of stain defects. Bounding box annotations were manually generated in accordance with a predefined annotation protocol, which required precise enclosure of the defective region. Annotation was performed using Labelimg1.8.1 software. The training set and the validation set in the experiment are divided by a 7:3 ratio of the total data. An example of the dataset is shown in Figure 11.
As shown in Figure 11, the proportion of defect pixels relative to the total number of pixels is minimal and closely resembles the background texture, which is one of the key reasons why textile defects are difficult to detect.

3.3. Experiment Results and Analysis

3.3.1. Single Improvement Effectiveness Comparative Experiment

To assess the effectiveness of the proposed model improvement measures, a series of comparative experiments were designed to evaluate the impact of each individual improvement. Each improvement was incorporated into the original model and evaluated on a designated test set. Mean average precision (mAP), recall, and accuracy were employed as metrics to assess the specific impact of each improvement on the model’s performance. Precision denotes the number of correctly detected defects, recall represents the proportion of detected defects relative to the total number of actual defects, and the mean average precision (mAP) is derived from both the recall and precision rates. The calculation equation of the average precision (mAP) is shown in Equations (26) and (27):
A P = k = 0 k = n 1 R k R k + 1 P ( k )
m A P = 1 n k = 1 k = n A P k
In the equation, A P k represents the A P value of the category k and n represents the total number of categories.
The comparison results of the single improved model are shown in Table 3. The experimental results are evaluated based on precision, recall, average precision (mAP@0.5 and mAP@0.5–0.95), model parameter quantity (Np), and model calculation amount (FLOPs).
The first improvement uses DualConv to replace the 2D convolution module in the original model. As shown in Table 3, the proposed improvements significantly enhance the model’s defect detection accuracy, evidenced by increases in mAP@0.5 and mAP@0.5–0.95 by 5.4% and 6.6%, respectively, along with a 2.5% improvement in precision and a 2.1% increase in recall. At the same time, the parameter amount and calculation amount of the model were also significantly reduced. The second improvement adds the LSKA attention mechanism. Table 3 demonstrates a 4.4% and 5.9% increase in mAP@0.5 and mAP@0.5–0.95, respectively, indicating an improvement in the model’s ability to detect small targets. The third improvement involves replacing the fast spatial pyramid feature extraction module with a focus modulation network to enhance the model’s detection performance. Compared to the original model, this modification results in a 1.4% increase in accuracy, along with improvements of 2.8% and 2.3% in mAP@0.5 and mAP@0.5–0.95, respectively, while simultaneously reducing both the calculation amount and the parameters amount of model. The fourth improvement uses bidirectional pyramid feature network (BiFPN) to replace the original feature pyramid network for feature fusion operations. Compared with the original model, the detection accuracy is increased by 1.1%, and the model is lightweight, and the number of parameters is reduced by 10.5%, and the calculation amount is also reduced by 10.6%. Finally, all improvements were integrated into the basic model. It can be seen that the integrated model has significantly improved the detection accuracy. mAP@0.5 and mAP@0.5–0.95 increased by 6.7% and 14.7%, respectively. Additionally, the parameter amount and calculation amount of the fused model are the smallest among the models in Table 3. The average accuracy plot of the model in Table 3 is shown in Figure 12. In summary, the experimental results prove the superiority of several single improvements.

3.3.2. Loss Function Comparison Test

To validate the effectiveness of the shape-IoU loss function proposed in this study, on the basis of integrating all the improved models in the previous section, different loss functions are replaced to compare the prediction accuracy and convergence. The experimental results are shown in Table 4 and Figure 13. From the results in Table 4, it can be seen that after replacing the original CIoU loss function with shape-IoU, the prediction accuracy of the model has been improved to a certain extent.
In Figure 13, the horizontal axis denotes the number of model iterations, while the vertical axis indicates the corresponding loss value. As shown in Figure 13, replacing the loss function has significantly improved the model’s convergence speed. It can be seen that by replacing the loss function, the performance of the model training process is effectively optimized.

3.3.3. Model Feature Visualization

For more intuitive observation, the LSKA attention mechanism added to improve the defect recognition ability uses Grad-CAM [27] to draw a heat map of the target domain features extracted by the model. Initially, a feature map is provided as input, and Grad-CAM computes the network’s output through forward propagation. Subsequently, the gradient of the output class corresponding to each feature map is calculated. These gradients reflect the significance of each feature map in relation to the final classification outcome. Following this, the gradients are weighted and combined with the feature maps to generate a weighted feature map. Ultimately, we average these weighted feature maps to produce a heatmap that illustrates the areas of focus for the network concerning the input image. The intensity of specific regions within this heatmap indicates which parts of the image exert a greater influence on model outputs. The comparison before and after adding the LSKA attention mechanism is shown in Figure 14.
Compared with the heat map result without adding LSKA in Figure 14b, the defect in Figure 14c is brighter, and the defect is located more accurately without misjudgment. After adding the LSKA attention mechanism, the model’s perception of defects is enhanced and the defect characteristics are more accurately focused.

3.3.4. Ablation Experiment

In order to thoroughly evaluate the comprehensive impact of the proposed enhancement strategy on model performance, a series of ablation experiments were conducted utilizing the YOLOv9 model, with results systematically compared to those of the baseline model. The evaluation indicators include mAP@0.5, mAP@0.5–0.95, and the number of model parameters (Np), aiming to compare and analyze from the two dimensions of accuracy and model complexity. The ablation experiment results are shown in Table 5, where “√” indicates that the corresponding module has been incorporated into the model. The experimental results reveal the improvement of the model detection performance by each improved module.
The results of the ablation experiment in Table 5 show that using DualConv to replace the traditional convolution layer in YOLOv9 greatly improves the detection accuracy of the model. In addition, BiFPN replaces the original simpler feature fusion structure, which improves certain detection performance while greatly reducing the parameters of the model, achieving a lightweight model. Then, shape-IoU was introduced to replace the original loss function, making the model easier to converge, and the detection accuracy also improved. Finally, the new model that combines all improvements improved by 6.7% in mAP@0.5 and 14.7% in mAP@0.5–0.95, while reducing the number of parameters by 17.8%. This result not only proves that the improved model has significantly improved the prediction accuracy, but also significantly reduces the number of parameters of the model, providing guidance for lightweight deployment of the model in resource-constrained environments.

3.3.5. Multi-Model Comparison Experiment

In order to comprehensively evaluate the advantages of the proposed augmented model in textile defect detection, this study conducted a comparative analysis using the current more advanced target detection algorithms. The comparative experiment was used to comprehensively evaluate the performance of the proposed method in accurately identifying textile defects. The experimental results are shown in Table 6.
From Table 6, we can see that the YOLO series of algorithms belong to the one-stage target detection network, so they have a faster detection speed. Compared with the one-stage network, Faster R-CNN belongs to the two-stage target detection network and its detection speed is slower, but the detection accuracy is slightly higher than YOLOv5. Additionally, Table 7 illustrates that YOLOv9 has a faster speed and higher accuracy compared with other models. To further verify the effectiveness of the proposed model, comparative experiments were also conducted using the latest YOLOv10 and YOLOv11 architectures. The results show that while these newer models exhibit incremental performance gains over YOLOv9, our improved model still achieves higher detection accuracy with comparable computational efficiency. Therefore, this study proposes an improved model based on YOLOv9. Although the improved model has a lower detection speed due to the increase in depth, its detection speed satisfies the requirements for industrial real-time detection while achieving the highest level of detection accuracy.

3.3.6. Model Improvement Detection Effect Experiment

There are the following three types of defects in the dataset of this study: holes, broken warps, and stains. The original YOLOv9 model and the fusion of all improved models are used to detect the three types of defects in the dataset. The comparison of the detection results is shown in Table 7, and the improved detection result diagram is shown in Figure 15. From Table 7, it can be seen that the mAP@50 results for the three defect detection categories have improved by 8.6%, 7.3%, and 6.7%, respectively, compared with the models before improvement, which proves that the improved model proposed in this study has a certain improvement in the detection ability of different small target defects.

4. Conclusions

This research introduces a textile defect detection algorithm that utilizes an improved version of YOLOv9, with the goal of accurately detecting complex and small textile defects while minimizing missed and false detections. The approach begins by reducing redundant computations through the introduction of DualConv, which replaces conventional convolutional layers. Subsequently, the LSKA attention mechanism is incorporated to enhance the weighting of significant features, thereby improving the accuracy of small target detection and partially enhancing generalization capabilities. The fast spatial pyramid module is then replaced with a focal modulation network to address the issue of detail information loss during feature pooling operations, allowing for more precise identification of small target details, such as defects. Additionally, the BiFPN is utilized in place of the original feature pyramid network to perform feature fusion, improving detection accuracy and reducing computational complexity through multi-level feature pyramids and bidirectional information flow. Finally, the optimization of the bounding box loss function was achieved by replacing the original loss function with the shape-IoU loss function, leading to improved detection accuracy and accelerated model convergence. The enhanced algorithm was tested on a textile defect dataset, and the experimental results indicate that, in comparison to the original algorithm, the improved algorithm achieves a 6.7% increase in mAP@0.5, a 14.7% increase in mAP@0.5–0.95, and a 17.8% reduction in the total number of model parameters. These results demonstrate not only a significant improvement in detection precision but also a reduction in model complexity (parameters and FLOPs), enhancing its suitability for deployment in resource-constrained environments. The achieved inference speed of 55.8 FPS confirms its practical viability for real-time industrial inspection. Thus, the proposed improvement strategy effectively balances high accuracy, model efficiency, and real-time performance, addressing the core challenges in automated textile defect detection. Compared with existing lightweight models such as SF-YOLOv8n and MobileViT-based YOLO variants, the proposed method offers a better trade-off between detection accuracy and computational cost, which is particularly beneficial for identifying fine-grained textile defects.
Furthermore, the ablation study and feature visualization confirm the independent contribution and combined synergy of each proposed module, offering solid experimental evidence for their effectiveness. While some improvements, such as shape-IoU, contribute modest performance gains, they significantly enhance training stability and convergence speed.
Textile defect detection has consistently presented challenges due to the intricate patterns on fabric surfaces and the presence of considerable interference, which can result in both false positives and undetected flaws. This work effectively addresses these challenges by leveraging both architectural and attention-based improvements. However, there remains room to further improve robustness across more complex fabrics and defect types. Future studies will investigate the integration of segmentation algorithms to lessen the effect of noise. However, deepening the network structure typically results in a reduction in detection frame rates. To address this issue, future work will consider employing distillation techniques and pruning to remove redundant channels and reduce the computational burden on the network. Additionally, cross-domain transfer learning, synthetic data generation, and multi-modal fusion (e.g., thermal + RGB) will be explored to further improve generalization to unseen defect types and fabric categories. Beyond these directions, the current experiments based on an industrial dataset with limited accessibility highlight the need for broader validation. Extending evaluation to public benchmarks such as the Fabric Defect Detection Dataset and DAGM 2007 will enable a more rigorous assessment of generalization.

Author Contributions

Conceptualization, C.X. and W.S.; methodology, C.X. and L.S.; data curation, C.X.; validation, C.X., L.S. and J.W.; investigation, C.X.; resources, W.S. and Y.Z.; data curation, C.X. and J.W.; writing—original draft preparation, C.X.; writing—review and editing, W.S. and J.T.; visualization, C.X. and L.S.; supervision, W.S. and Y.Z.; project administration, W.S.; funding acquisition, W.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Central Guidance on Local Science and Technology Development Fund of Zhejiang Province [Grant No. 2022ZYYDSA300214], the Key technological innovation project of Hangzhou City [Grant No. 2022AIZD0153], and the National Natural Science Foundation of China [Grant No. 51605443].

Data Availability Statement

The datasets generated and/or analyzed during the current study are not publicly available due to the respect and protection of product information privacy of cooperative enterprises but are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; IEEE Computer Society: Los Alamitos, CA, USA, 2015; pp. 1440–1448. [Google Scholar]
  2. Redmon, J. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  3. Xu, S.; Zheng, L.; Yuan, D. A method for fabric defect detection based on improved cascade R-CNN. Adv. Text. Technol. 2022, 30, 48. [Google Scholar]
  4. Jun, X.; Wang, J.; Zhou, J.; Meng, S.; Pan, R.; Gao, W. Fabric defect detection based on a deep convolutional neural network using a two-stage strategy. Text. Res. J. 2021, 91, 130–142. [Google Scholar] [CrossRef]
  5. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
  6. Zheng, L.; Wang, X.; Wang, Q.; Wang, S.; Liu, X. A Fabric Defect Detection Method Based on Improved YOLOv5. In Proceedings of the 7th International Conference on Computer and Communications, ICCC 2021, Chengdu, China, 10–13 December 2021. [Google Scholar]
  7. Wang, A.; Yuan, J.; Zhu, Y.; Chen, C.; Wu, J. Drum roller surface defect detection algorithm based on improved YOLOv8s. J. Zhejiang Univ. Eng. Sci. 2024, 58, 370–380. [Google Scholar]
  8. Zhang, Y.; Sun, J.-X.; Sun, Y.-M.; Liu, S.-D.; Wang, C.-Q. Lightweight object detection based on split attention and linear transformation. J. Zhejiang Univ. Eng. Sci. 2023, 57, 1195–1204. [Google Scholar]
  9. Kang, X.; Li, J. AYOLOv7-tiny: Towards efficient defect detection in solid color circular weft fabric. Text. Res. J. 2024, 94, 225–245. [Google Scholar] [CrossRef]
  10. Liu, B.; Wang, H.; Cao, Z.; Wang, Y.; Tao, L.; Yang, J.; Zhang, K. PRC-Light YOLO: An Efficient Lightweight Model for Fabric Defect Detection. Appl. Sci. 2024, 14, 938. [Google Scholar] [CrossRef]
  11. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
  12. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the 27th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
  13. Ren, S.Q.; He, K.M.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Proceedings of the 29th Annual Conference on Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 7–12 December 2015. [Google Scholar]
  14. Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. In Proceedings of the European Conference on Computer Vision—ECCV 2024, Milan, Italy, 29 September–4 October 2024; Springer Nature: Cham, Switzerland, 2024. [Google Scholar]
  15. Zhong, J.; Chen, J.; Mian, A. DualConv: Dual Convolutional Kernels for Lightweight Deep Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 9528–9535. [Google Scholar] [CrossRef] [PubMed]
  16. Lau, K.W.; Po, L.M.; Rehman, Y.A.U. Large Separable Kernel Attention: Rethinking the Large Kernel Attention design in CNN. Expert Syst. Appl. 2024, 236, 121352. [Google Scholar] [CrossRef]
  17. Yu, F.; Koltun, V. Multi-Scale Context Aggregation by Dilated Convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
  18. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
  19. He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
  20. Yang, J.; Li, C.; Dai, X.; Yuan, L.; Gao, J. Focal Modulation Networks. arXiv 2022, arXiv:2203.11926. [Google Scholar]
  21. Lin, T.-Y.; Dollár, P.; Girshick, R.B.; He, K.; Hariharan, B.; Belongie, S.J. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
  22. Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
  23. Zheng, Z.H.; Wang, P.; Liu, W.; Li, J.Z.; Ye, R.G.; Ren, D.W.; Association for the Advancement of Artificial Intelligence. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 12993–13000. [Google Scholar]
  24. Yu, J.; Jiang, Y.; Wang, Z.; Cao, Z.; Huang, T. UnitBox: An advanced object detection network. In Proceedings of the 24th ACM Multimedia Conference, MM 2016, Amsterdam, The Netherlands, 15–19 October 2016. [Google Scholar]
  25. Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S.; Soc, I.C. Generalized Intersection over Union: A Metric and A Loss for Bounding Box Regression. In Proceedings of the 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 658–666. [Google Scholar]
  26. Zhang, H.; Zhang, S. Shape-IoU: More Accurate Metric considering Bounding Box Shape and Scale. arXiv 2023, arXiv:2312.17663. [Google Scholar]
  27. Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. Int. J. Comput. Vis. 2020, 128, 336–359. [Google Scholar] [CrossRef]
Figure 1. Improved YOLOv9 network structure.
Figure 1. Improved YOLOv9 network structure.
Processes 14 00149 g001
Figure 2. DualConv structure diagram.
Figure 2. DualConv structure diagram.
Processes 14 00149 g002
Figure 3. DualConv structure layout.
Figure 3. DualConv structure layout.
Processes 14 00149 g003
Figure 4. LKA and LSAK structure diagram.
Figure 4. LKA and LSAK structure diagram.
Processes 14 00149 g004
Figure 5. Fusion module.
Figure 5. Fusion module.
Processes 14 00149 g005
Figure 6. Comparison between SA and focal modulation. (a) The original self-attention; (b) The focal modulation.
Figure 6. Comparison between SA and focal modulation. (a) The original self-attention; (b) The focal modulation.
Processes 14 00149 g006
Figure 7. Focus modulation network structure diagram.
Figure 7. Focus modulation network structure diagram.
Processes 14 00149 g007
Figure 8. Network design diagram of FPN and BiFPN.
Figure 8. Network design diagram of FPN and BiFPN.
Processes 14 00149 g008
Figure 9. Neck structure diagram after BiFPN improvement.
Figure 9. Neck structure diagram after BiFPN improvement.
Processes 14 00149 g009
Figure 10. Schematic diagram of shape-IoU calculation.
Figure 10. Schematic diagram of shape-IoU calculation.
Processes 14 00149 g010
Figure 11. The instances of dataset.
Figure 11. The instances of dataset.
Processes 14 00149 g011
Figure 12. Comparison of average accuracy of single improved model.
Figure 12. Comparison of average accuracy of single improved model.
Processes 14 00149 g012
Figure 13. Loss function transformation comparison.
Figure 13. Loss function transformation comparison.
Processes 14 00149 g013
Figure 14. Model heat map visualization. (a) the original image; (b) a heat map without incorporating the LSKA attention mechanism; and (c) the heat map after adding the LSKA attention mechanism.
Figure 14. Model heat map visualization. (a) the original image; (b) a heat map without incorporating the LSKA attention mechanism; and (c) the heat map after adding the LSKA attention mechanism.
Processes 14 00149 g014
Figure 15. Improved model defect detection results.
Figure 15. Improved model defect detection results.
Processes 14 00149 g015
Table 1. Experimental environment configuration.
Table 1. Experimental environment configuration.
Configure the EnvironmentConfiguration Name (Version)
Operating systemWindows 10
CPUAMD Ryzen 5700X3D 8-Core Processor 3.00 GHz
GPUNVIDIA GeForce RTX 4070 Ti SUPER (16 G)×1
CompilerPython 3.8
Deep learning frameworksPytorch 1.13.1
Acceleration moduleCUDA Toolkit-11.7
Table 2. Model training hyperparameter configuration.
Table 2. Model training hyperparameter configuration.
ParameterExplanationParameter Value
Image sizeImage size640 × 640
Banch sizeBatch number16
EpochsIterations200
IrLearning rate0.001
Momentummomentum0.937
Weight_decayWeight decay rate0.0005
Table 3. Comparison of single improved model effects.
Table 3. Comparison of single improved model effects.
ModelPRmAP@
0.5
mAP@0.5–0.95FLOPsNp/106
YOLOv90.8570.7970.8390.46820.74.54
YOLOv9 + DualConv0.8790.8140.88 50.49 918.13.88
YOLOv9 + LSKA0.8670.8080.8760.49619.04.34
YOLOv9 + Focal Modulation0.8690.8020.8630.47918.94.14
YOLOv9 + BiFPN0.8660.8040.8730.48518.54.06
YOLOv9 + DualConv + LSKA + Focal Modulation + BiFPN + Shape IoU0.9050.8280.8960.53717.73.73
Table 4. Loss function comparison experiment.
Table 4. Loss function comparison experiment.
ModelPRmAP@0.5mAP@0.5–0.95
GIoU0.8870.8190.8790.488
CIoU0.8950.8240.8 810.492
Shape-IoU0.9050.8280.8960.537
Table 5. Ablation experiment results.
Table 5. Ablation experiment results.
YOLOv9DualConvLSKAFocal ModulationBiFPNShape-IoUmAP@
0.5
mAP@
0.5–0.95
Np/106
0.8390.4684.54
0.8850.4993.88
0.8890.5013.88
0.89105103.80
0.8950.5333.76
0.8960.5373.73
“√” Indicates that the corresponding module is incorporated into the model.
Table 6. Comparative experiments of different target detection models.
Table 6. Comparative experiments of different target detection models.
ModelmAP
@0.5
mAP
@0.5–0.95
FPS
(Frames/s)
FLOPsNp/106
YOLOv50.8010.35143.516.57.2
Faster R-CNN0.8210.39824.124045.2
YOLOv8n0.8320.46160.28.93.2
YOLOv90.8390.46866.420.74.54
YOLOv9-SPD 0.8520.51158.318.564.10
EfficientDet-Lite0.8370.49862.11817
YOLOv100.8640.49264.716.84.92
YOLOv110.8710.50363.817.25.41
This research model0.8960.53755.817.73.73
Table 7. Experimental results of different defect detection.
Table 7. Experimental results of different defect detection.
ModelmAP@0.5
HoleCracked EndsStains
YOLOv90.8250.8330.842
This research model0.8960.8940.899
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xuan, C.; Shi, W.; Sun, L.; Wu, J.; Zhang, Y.; Tu, J. Improved YOLOv9 with Dual Convolution and LSKA Attention for Robust Small Defect Detection in Textiles. Processes 2026, 14, 149. https://doi.org/10.3390/pr14010149

AMA Style

Xuan C, Shi W, Sun L, Wu J, Zhang Y, Tu J. Improved YOLOv9 with Dual Convolution and LSKA Attention for Robust Small Defect Detection in Textiles. Processes. 2026; 14(1):149. https://doi.org/10.3390/pr14010149

Chicago/Turabian Style

Xuan, Chang, Weimin Shi, Lei Sun, Ji Wu, Yongchao Zhang, and Jiajia Tu. 2026. "Improved YOLOv9 with Dual Convolution and LSKA Attention for Robust Small Defect Detection in Textiles" Processes 14, no. 1: 149. https://doi.org/10.3390/pr14010149

APA Style

Xuan, C., Shi, W., Sun, L., Wu, J., Zhang, Y., & Tu, J. (2026). Improved YOLOv9 with Dual Convolution and LSKA Attention for Robust Small Defect Detection in Textiles. Processes, 14(1), 149. https://doi.org/10.3390/pr14010149

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop