A Robust Fabric Defect Detection Method Based on Improved RefineDet

This paper proposes a robust fabric defect detection method, based on the improved RefineDet. This is done using the strong object localization ability and good generalization of the object detection model. Firstly, the method uses RefineDet as the base model, inheriting the advantages of the two-stage and one-stage detectors and can efficiently and quickly detect defect objects. Secondly, we design an improved head structure based on the Full Convolutional Channel Attention (FCCA) block and the Bottom-up Path Augmentation Transfer Connection Block (BA-TCB), which can improve the defect localization accuracy of the method. Finally, the proposed method applies many general optimization methods, such as attention mechanism, DIoU-NMS, and cosine annealing scheduler, and verifies the effectiveness of these optimization methods in the fabric defect localization task. Experimental results show that the proposed method is suitable for the defect detection of fabric images with unpattern background, regular patterns, and irregular patterns.


Introduction
In the textile industry, defect detection is an important part of the quality control of fabric products. It aims to detect fabric products with defective areas efficiently and timely to reduce additional economic losses caused by these low-quality products. Many textile mills still use the manual inspection method to detect defects [1], which causes inspectors to inadvertently miss some products with defects after a prolonged observation. To this end, it is necessary to develop an automatic defect detection method based on computer vision.
Since the 1980s, there have been a lot of works in the field of automated defect detection. Among them, the traditional works are mainly divided into the following categories: (1) Spectral-based methods, (2) statistical-based methods, (3) structural-based methods, and (4) traditional learning-based methods. For example, Jia et al. [2] proposed a fabric defect detection method based on lattice segmentation and the Gabor filtering. The method mainly used the Gabor filter band, a kind of spectral-based method, to extract features of the image patch (i.e., semantic lattice). Jing et al. [3] proposed two defect detection methods, i.e., Gabor preprocessed Golden Image Subtraction (GGIS), and Gabor preprocessed Golden Image Subtraction Based on a Genetic Algorithm (GAGIS), and the two methods mainly used the Gabor filter to preprocess the input image and used the image subtraction technique to locate the defective area. Pan et al. [4] and Zhang et al. [5] used the mathematical morphology method, a kind of Statistical-based method, to enhance the defective area in unpatterned images. Liu et al. [6] proposed a defect detection method proposed based on unsupervised segmentation (a dictionary learning method based on image patch) and Extreme Learning Machine (ELM), and the method mainly extracted

•
Powerful feature extraction capability. Object detection uses deep convolutional neural networks as the backbone [23], which can automatically extract the defect features of the input image. • Efficient neck structure (feature pyramid and feature fusion) structure. Most object detection models have a neck (i.e., feature pyramid and feature fusion), which can detect defect areas with different sizes in the image. These feature pyramid structures are roughly divided into the following categories [24,25]: (1) SSD-style, (2) FPN-style, (3) STDN-style, (4) M2Det-style, (5) PAN-style. Some recent studies have made improvements on the neck structure and achieved a good detection result [19,26].
• Flexible selection of the model. Pre-existing object detection models usually can be divided into two categories, the one-stage object detection models and the two-stage object detection models. In general, the two-stage object detection models have higher localization and object classification accuracy and the one-stage object detection models are time-efficient and can be used for real-time detection [27]. According to the needs of different application scenarios, we can choose an appropriate model for training. In the field of fabric defect detection, the one-stage object detector can be selected as the base model to meet the needs of real-time detection. • Classification and localization results based on image patch level. The object detection model generates candidate object bounding boxes (i.e., image patches with defect area or background area) from the feature maps and sends them to the classification subnetwork and regression subnetwork, respectively. After Non-Maximum Suppression (NMS), we can directly get the defect categories and location of each predicted defect image patch. • Various general optimization methods. In the whole training and testing stages of the object detection model, many researchers proposed various general optimization methods [25,27,28], including data augmentation method, attention mechanism, learning rate scheduling strategy, activation function selection, loss function optimization, and post-processing method improvements. These optimization methods can boost the performance of all popular object detection models without introducing extra computational cost during inference [28]. • Good generalization ability. The object detection model mainly learns the feature of defect objects, rather than the background. Therefore, when there are enough training images with defect objects, it can be suitable for fabric defect detection under different texture backgrounds. Meanwhile, some state-of-the-art data augment [25,[29][30][31], and Weakly Supervised Object Localization (WSOL) [32,33] methods can alleviate the problems of insufficient training samples and expensive manual labeling to a certain extent.
In view of this, a fabric defect detection method based on an improved RefineDet is proposed in this paper. Firstly, we use a one-stage object detection model, RefineDet [34], as the base model, which has the advantages of both one-stage and two-stage object detection models. Secondly, we design an improved head structure combing Fully Convolutional Channel Attention (FCCA) block and Bottom-up path Augmentation Transfer Connection Block (BA-TCB), and this improved head structure can effectively improve the model's ability to locate defective areas. Finally, we apply and verify the influence of many general optimization methods in the field of fabric defect detection, which can boost the performance of the proposed defect detection model without introducing too much extra computational cost during inference. On the three public defect datasets TILDA, Hong Kong Patterned Textures Database, and DAGM2007, the mAP and F1-score metrics of the proposed method exceed 80%. Experimental results show that the proposed method is suitable for the defect detection of fabric images with unpattern background, regular patterns, and irregular patterns and can detect different fabric defects at high speed.
The main contributions of this work are summarized as follows: • We use RefineDet as the base model of defect detection. To the best of our knowledge, this is the first time that RefineDet has been used for fabric defect detection. Using the special two-step classification and regression structure of RefineDet, the proposed method can better detect the defect area compared with other common object detectors.

•
We design an improved head structure. This improved head structure consists of Fully Convolutional Channel Attention-based Anchor Refinement Module (FCCA-ARM), BA-TCB, and Object Detection Module (ODM). By adding the channel attention mechanism (i.e., FCCA block) and designing the bottom-up path augmentation structure (i.e., BA-TCB), the detection accuracy of the proposed method is further improved. • We research and verify the influence of many general optimization methods in the field of fabric defect detection. The state-of-the-art general optimization methods, such as attention mechanism, DIoU-NMS, and cosine annealing scheduler, are successfully applied to our detection model, which is an important reference for researchers in the fabric defect detection field.
The remainder of this paper is organized as follows: Related work is introduced in Section 2. The details of our proposed method are included in Section 3. Experiment results and discussion are provided in Section 4. Finally, Section 5 is the conclusion of this work.

Related Work
This section first shows the structure of the object detection model, then introduces some state-of-the-art general optimization methods, and finally summarizes the advantages of RefineDet, the base object detector used in our work.

The Structure of The Object Detection Model
In recent years, the object detection model has formed a fixed structural framework. An object detection model is composed of several parts:

•
Acquisition of input images. The raw images used for recent research were mainly from public datasets or collected by textile factories and laboratories. Some typical public defect detection datasets are TILDA dataset (https://lmb.informatik.uni-freiburg.de/resources/datasets/ tilda.en.html), DAGM2007 dataset (https://hci.iwr.uni-heidelberg.de/content/weakly-supervisedlearning-industrial-optical-inspection), and Hong Kong patterned texture database (https://ytngan. wordpress.com/codes/); and some self-built datasets are DHU-FD-500 [7], DHU-FD-1000 [7], lattice [8], FDBF dataset [19], etc. • Image preprocessing. Insufficient of defect samples is a challenge in the research of fabric defect detection methods. To address this problem, the techniques of image pyramid [35] or sliding window [9] are introduced in the stage of image preprocessing. Particularly, many data augmentation methods based on photometric distortion and geometric distortion are widely used to increase the variability of the input images, including adjusting the brightness, contrast, hue, saturation, and noise of input images, image scaling, cropping, flipping, and rotating. • Backbone. The backbone as the basic feature extractor of the object detection task is used to generate the output feature maps of the corresponding input images. The common backbones are VGG-16 [34,36], ResNet [37,38], ResNeXt [39], DarkNet-19 [15,21], DarkNet-53 [8,16], MobileNet [40,41], and ShuffleNet [42]. • Neck structure. Object detection model developed in recent years often insert the neck structure between the backbone network and the head structure, and the neck structure is usually used to collect feature maps from different layers with different resolutions of the backbone [25]. Common neck structures in recent research are Feature Pyramid Network (FPN) [43] with its variants CI-FPN [19], BI-FPN [26], NAS-FPN [44], and Path Aggregation Network (PAN) [45]. • Head. The head is used to predict classes and bounding boxes (detection results in image patch level) of defect objects. In this stage, one-stage object detectors directly predict class and bounding boxes from dense candidate boxes. Two-stage object detection models first filter out sparse refined boxes from dense boxes and then predict the results from the refined boxes. Therefore, the two-stage models have higher accuracy, and the one-stage models are time-efficient. • Post-processing. In the testing stage, the post-processing step deletes any weak detecting results [23]. For example, NMS is a widely used method that only remains the object boxes with the highest classification score in predicted results. The common NMS methods are greedy-NMS [46], soft-NMS [47], adaptive-NMS [48].

The State-of-The-Art General Optimization Method
Recently, many works have proposed new components and optimization methods in the object detection model, and have achieved remarkable results in the general domain. These optimization Sensors 2020, 20, 4260 5 of 24 methods cover all training and testing stages of the detector, including image preprocessing stage, network construction (attention mechanism, activation function), loss function design, and post-processing stage.
In the image preprocessing stage, many novel data augment methods have appeared, such as Mosaic [25], MixUp [29], CutOut [30], CutMix [31], and Generative Adversarial Network (GAN) [49]. In the network construction stage, many attention mechanisms can be combined with the network's backbone, neck, and head to get better localization results, such as channel-wise attention (Squeeze-and-Excitation (SE) block [50], Channel Attention (CA) block [51]) and point-wise attention (SAM [51]). For example, on the MS COCO dataset, Woo et al. [51] have proved through experiments that the Faster RCNN (the backbone is ResNet-50) combining the CA block and SAM can improve mAP metric by 2% compared with the original Faster RCNN. On the VOC2007 testing set, when combining CA block and SAM, the StairNet framework (the backbone is VGG-16), one of the strongest multi-scale methods based on SSD, can improve the mAP metric by 1.5% compared with the original SSD. When combining the SE block, the StairNet framework can improve the mAP metric by 1.3% compared with the original SSD. Meanwhile, some new activation functions are designed to optimize the CNN model, such as Swish [52], Mish [53]. In addition, Zheng et al. [54] designed an improved IoU loss function as the regression loss function and used DIoU-NMS for the object detection model in the post-processing stage.
Inspired by these optimization methods, we introduce some state-of-the-art general optimization methods to this work and verify the influence of these optimization methods on the fabric defect detection performance.

The Overview of RefineDet
In the head structure, the two-stage object detection models have a densely tailing process to obtain as many as candidate boxes, and this process is often time-consuming and inefficient. It is a hot trend to develop an efficient object detection method which can eliminate so much redundancy while maintaining high accuracy [23]. To address this issue, Zhang et al. [34] design a new one-stage framework (we call it RefineDet) to inherit the advantages of the one-stage detector and two-stage detector.
RefineDet uses VGG-16 or ResNet-101 as the backbone for feature extraction and integrates the neck structure (feature pyramid and feature fusion) into the head structure. The head structure of RefineDet consists of Anchor Refinement Module (ARM), Transfer Connection (TCB), and ODM (As shown in Figure 1): The four feature maps of ARM mainly come from different layers in the backbone. The ARM is designed to coarsely filter out refined boxes from dense candidate boxes and adjust the localizations and sizes of refined boxes (i.e., the first step classification and regression) so as to provide better initialization for the subsequent multi-class classification and regression task. • TCB. TCB aims to transfer the refined boxes to ODM and integrate different features information (feature fusion) of shallow layers and deep layers of ARM. • ODM. ODM takes the refined boxes as the input from the TCB and outputs predicted multi-class labels and the localizations of refined boxes (i.e., the second step classification and regression). In the testing stage, we can get the predicted results in the image patch level after NMS processing.
As a kind of one-stage object detection model, the RefineDet also has the characteristics of two-stage models (i.e., Two-step cascaded regression and two-step classification), which can better predict hard detected objects, especially for small objects and get more accurate locations of objects [23]. RefineDet achieves 85.8% and 86.8% mAP on VOC2007 and VOC2012, two public general datasets, with the VGG-16 backbone. Meanwhile, it runs at 40.2 FPS with the input sizes 320 × 320. Considering these advantages of RefineDet, we use it as the base model in our fabric defect detection method. framework (we call it RefineDet) to inherit the advantages of the one-stage detector and two-stage detector.
RefineDet uses VGG-16 or ResNet-101 as the backbone for feature extraction and integrates the neck structure (feature pyramid and feature fusion) into the head structure. The head structure of RefineDet consists of Anchor Refinement Module (ARM), Transfer Connection (TCB), and ODM (As shown in Figure 1 • ARM. The four feature maps of ARM mainly come from different layers in the backbone. The ARM is designed to coarsely filter out refined boxes from dense candidate boxes and adjust the

Methodology
As shown in Figure 2, we propose a robust fabric defect detection method based on improved RefineDet. The improved RefineDet network is consist of the following part: VGG-16 based backbone, FCCA-ARM, BA-TCB, ODM. In addition, many general optimization methods, such as mosaic data augmentation, attention mechanism, DIoU-NMS, and cosine annealing scheduler are applied to the training and testing stage of our model, which effectively improves the defect localization accuracy of the proposed method.
Sensors 2020, 20, x FOR PEER REVIEW 6 of 24 localizations and sizes of refined boxes (i.e., the first step classification and regression) so as to provide better initialization for the subsequent multi-class classification and regression task. • TCB. TCB aims to transfer the refined boxes to ODM and integrate different features information (feature fusion) of shallow layers and deep layers of ARM. • ODM. ODM takes the refined boxes as the input from the TCB and outputs predicted multiclass labels and the localizations of refined boxes (i.e., the second step classification and regression). In the testing stage, we can get the predicted results in the image patch level after NMS processing.
As a kind of one-stage object detection model, the RefineDet also has the characteristics of twostage models (i.e., Two-step cascaded regression and two-step classification), which can better predict hard detected objects, especially for small objects and get more accurate locations of objects [23]. RefineDet achieves 85.8% and 86.8% mAP on VOC2007 and VOC2012, two public general datasets, with the VGG-16 backbone. Meanwhile, it runs at 40.2 FPS with the input sizes 320 × 320. Considering these advantages of RefineDet, we use it as the base model in our fabric defect detection method.

Methodology
As shown in Figure 2, we propose a robust fabric defect detection method based on improved RefineDet. The improved RefineDet network is consist of the following part: VGG-16 based backbone, FCCA-ARM, BA-TCB, ODM. In addition, many general optimization methods, such as mosaic data augmentation, attention mechanism, DIoU-NMS, and cosine annealing scheduler are applied to the training and testing stage of our model, which effectively improves the defect localization accuracy of the proposed method.  Figure 2. The network architecture of the proposed method.

Data Augmentation
In general, neural networks need to train millions of parameters. The premise of making these parameters work correctly is enough input data for training. However, in fabric defect detection, the data is not as much as we think. The size of the fabric datasets is dozens to thousands. Insufficient of defect sample is a challenge in the research of fabric defect detection methods.

Data Augmentation
In general, neural networks need to train millions of parameters. The premise of making these parameters work correctly is enough input data for training. However, in fabric defect detection, Sensors 2020, 20, 4260 7 of 24 the data is not as much as we think. The size of the fabric datasets is dozens to thousands. Insufficient of defect sample is a challenge in the research of fabric defect detection methods.
To address the issue above, we mainly use data augmentation methods in the image preprocessing stage. In addition to using some data augmentation methods based on photometric distortion and geometric distortion, we also use the mosaic method in our work.
As shown in Figure 3, we can perform a combination of various data augmentation methods (i.e., image flip, rotation, scale-up, brightness change, contrast change, and mosaic) to increase the number of training images. In addition, when the generated images are close to the natural image, these data augmentation methods can increase the diversity of the input images, which makes the trained model more robust. To address the issue above, we mainly use data augmentation methods in the image preprocessing stage. In addition to using some data augmentation methods based on photometric distortion and geometric distortion, we also use the mosaic method in our work.
As shown in Figure 3, we can perform a combination of various data augmentation methods (i.e., image flip, rotation, scale-up, brightness change, contrast change, and mosaic) to increase the number of training images. In addition, when the generated images are close to the natural image, these data augmentation methods can increase the diversity of the input images, which makes the trained model more robust.

VGG-16 Based Backbone
We do not change the backbone of the original RefineDet, i.e., we use VGG-16 as the base feature extract network, remove fc8, replace fc6, fc7 into convolution layers (Conv6, Conv7) and add two extra convolution layers (Conv8_1, Conv8_2) after VGG-16. The configuration of each layer of the backbone network can be seen in Table 1. Between Conv1_1 and Conv7, all convolution layers are followed by the Relu activation function.

VGG-16 Based Backbone
We do not change the backbone of the original RefineDet, i.e., we use VGG-16 as the base feature extract network, remove fc8, replace fc6, fc7 into convolution layers (Conv6, Conv7) and add two extra convolution layers (Conv8_1, Conv8_2) after VGG-16. The configuration of each layer of the backbone network can be seen in Table 1. Between Conv1_1 and Conv7, all convolution layers are followed by the Relu activation function.

The Improved Head Structure
Inspired by BI-FPN [26], PANet [45], and CBAM [51], we design an improved head structure to improve the defect localization ability of the proposed method. The head in our model is consists of three parts: Fully convolutional Channel Attention-based Anchor Refinement Module (FCCA-ARM), Bottom-up path Augmentation Transfer Connection Block (BA-TCB), and ODM.

FCCA-ARM
The Layers in the feature pyramid come from Conv4_3, Conv5_3, Conv7, and Conv8_2 of the backbone network. By using the attention mechanism, the model can focus on important features and suppress unnecessary ones [51]. Therefore, we design an improved ARM that uses fully convolutional channel attention blocks to process those layers before ARM.
As seen in Figure 4c, given an intermediate feature map X ∈ R W * H * C as input, the output Y of fully convolutional channel attention block is computed as Equation (1).

The Improved Head Structure
Inspired by BI-FPN [26], PANet [45], and CBAM [51], we design an improved head structure to improve the defect localization ability of the proposed method. The head in our model is consists of three parts: Fully convolutional Channel Attention-based Anchor Refinement Module (FCCA-ARM), Bottom-up path Augmentation Transfer Connection Block (BA-TCB), and ODM.

FCCA-ARM
The Layers in the feature pyramid come from Conv4_3, Conv5_3, Conv7, and Conv8_2 of the backbone network. By using the attention mechanism, the model can focus on important features and suppress unnecessary ones [51]. Therefore, we design an improved ARM that uses fully convolutional channel attention blocks to process those layers before ARM.
As seen in Figure 4c, given an intermediate feature map X ∈ * * as input, the output Y of fully convolutional channel attention block is computed as Equation (1).
2( * ) is the convolution layer with 1*1-s1-p0-C. σ ( * ) is the Sigmoid activation function. Σ ( * ) is the Relu activation function.  Different from the SE block (see Figure 4a), the channel attention mechanism uses both averagepooled and max-pooled features simultaneously. In addition, we use two convolutional layers instead of the multilayer perceptron (MLP) in the original channel attention block (see Figure 4b), which can increase representation power and decrease the parameters of our model. Inspired by this bidirectional structure, we design four new blocks, TCBv2, to form a bottom-up pathway and add four fully convolutional channel attention blocks between TCB and TCBv2.
where I refers to a feature level with a resolution of 1 2 i of the input images (e.g., P 3 , N 3 are feature maps with resolution 40 * 40 when the input image is 320 * 320). Inspired by this bidirectional structure, we design four new blocks, TCBv2, to form a bottom-up pathway and add four fully convolutional channel attention blocks between TCB and TCBv2.
As seen in Figure 5, the four input of TCBv2 is denoted as (P , P , P , P ), and the output feature maps (N , N , N , N ) are computed as Equations (3) and (4).
where refers to a feature level with a resolution of 1 2 of the input images (e.g., P , N are feature maps with resolution 40*40 when the input image is 320*320).

ODM
ODM is consists of four convolutional layers with channel k*(C+4). Where C is the number of defect classes for the multi-classification task, 4 refers to the four coordinates value (i.e., xmin, ymin, w, h) of predicted boxes for the regression task. K refers to the number of aspect ratios, and in our model k is set to 3, which means that the aspect ratios are 1:1, 1:2, and 2:1.

The Two-Step Loss Function and DioU-NMS
Similar to original RefineDet, in the training stage, the total two-step loss function is designed as Equation (5).
in which,

ODM
ODM is consists of four convolutional layers with channel k * (C+4). Where C is the number of defect classes for the multi-classification task, 4 refers to the four coordinates value (i.e., x min , y min , w, h) of predicted boxes for the regression task. K refers to the number of aspect ratios, and in our model k is set to 3, which means that the aspect ratios are 1:1, 1:2, and 2:1.

The Two-Step Loss Function and DioU-NMS
Similar to original RefineDet, in the training stage, the total two-step loss function is designed as Equation (5).
in which, I is the index of anchor boxes in a training batch. p i , X i are the predicted defect class confidence being a defect object and localization coordinates of the anchor box i in the CA-ARM structure, respectively. c i , B i are the predicted object class confidence and coordinates of the refined box in the ODM, respectively. After calculating the total loss, backpropagation is performed to update the model parameters. After the specified number of iterations (in our experiment, we set it 120 k), the training of model finish.
Generally, the IoU-based method can be defined as IoU − R B, B gt . The penalty term R B, B gt is designed for minimizing the normalized distance between predicted boxes B and the ground truth boxes B gt and making the regression (i.e., defect area localization) more accurate and faster. The three important geometric factors that affect the accuracy of the regression of the anchor boxes are the overlapping area between B and B gt , the distance between the center points of B and B gt and the aspect ratio of the box. Considering these factors, many improved IoU calculation methods have appeared in recent years. For example, gIoU designs a novel penalty item based on IoU to solve the problem of gradient disappearance when the bounding boxes do not overlap (as seen in Equation (10)). dIoU takes into account the factor of the distance between the center points (as seen in Equation (11)). By adding a penalty term to the original IoU, the normalized distance between the center points of the two bounding boxes is directly minimized, so that it can converge faster than the gIoU loss. Considering all three geometric factors, cIoU adds a penalty item to the aspect ratio based on the dIoU, which making regression more accurate and faster (as seen in Equation (12)).

of 24
A c is the area of the smallest closure box of B gt and B. A B gt B is the intersection area of B gt and B. ρ( * ) is the Euclidean distance. b, b gt are the central points of B gt and B, respectively. l c is the diagonal length of the smallest closure box of B gt and B. And w gt h gt , w h are the aspect ratio of B gt and B, respectively. Although cIoU is better than dIoU under normal circumstances, dIoU is better than cIoU in small object detection, i.e., the consistency of the aspect ratio may not affect the regression accuracy of small objects [54]. Therefore, we use dIoU-NMS to post-process the detection results of the proposed methods in the testing stage. For the predicted box M with the highest score, the dIoU-NMS can be formally defined as Equation (15).
This means that these predicted defect boxes B i that overlap with M greater than the NMS threshold T s will be filtered.

Leaning Rate Adjustment Method Based on Cosine Annealing Scheduler
The common learning rate adjustment strategies mainly include the following: (1) StepLR (i.e., the learning rate in the training stage decreases according to regular intervals); (2) MultiStepLR (i.e., the learning rate decreases according to preset intervals); (3) ExponentialLR (i.e., the learning rate decreases according to exponential decay); (4) ReduceLROnPlateau (i.e., the learning rate in the training stage decreases when an indicator no longer changes); (5) CosineAnnealingLR (i.e., the learning rate in the training stage change according to the period of the cosine function). In the training stage, we use the cosine annealing scheduler to adjust the learning rate. The cosine annealing scheduler adjusts the learning rate in an intermediate iteration or epoch η t as Equation (16).
where η min is the valley value of cosine function, and η min is set to 0 in our experiment. η max is the peak value of cosine function (i.e., initial learning rate), η max is set to 10 −3 . T max is half of the Cosine function period, and T max is set to 120 k. T cur refers to the number of epochs or iterations recorded since the last start. As seen in Figure 6, the cosine annealing scheduler takes the cosine function as a period and resets the learning rate at the maximum value of each period. Taking the initial learning rate as the maximum learning rate and taking 2* T max as the period, it decreases first and then rises within a period.
Sensors 2020, 20, x FOR PEER REVIEW 11 of 24 objects [54]. Therefore, we use dIoU-NMS to post-process the detection results of the proposed methods in the testing stage. For the predicted box M with the highest score, the dIoU-NMS can be formally defined as Equation (15).
This means that these predicted defect boxes that overlap with M greater than the NMS threshold will be filtered.

Leaning Rate Adjustment Method Based on Cosine Annealing Scheduler
The common learning rate adjustment strategies mainly include the following: (1) StepLR (i.e., the learning rate in the training stage decreases according to regular intervals); (2) MultiStepLR (i.e., the learning rate decreases according to preset intervals); (3) ExponentialLR (i.e., the learning rate decreases according to exponential decay); (4) ReduceLROnPlateau (i.e., the learning rate in the training stage decreases when an indicator no longer changes); (5) CosineAnnealingLR (i.e., the learning rate in the training stage change according to the period of the cosine function). In the training stage, we use the cosine annealing scheduler to adjust the learning rate. The cosine annealing scheduler adjusts the learning rate in an intermediate iteration or epoch as Equation (16).
where is the valley value of cosine function, and is set to 0 in our experiment. is the peak value of cosine function (i.e., initial learning rate), is set to 10 . is half of the Cosine function period, and is set to 120 k. refers to the number of epochs or iterations recorded since the last start.
As seen in Figure 6, the cosine annealing scheduler takes the cosine function as a period and resets the learning rate at the maximum value of each period. Taking the initial learning rate as the maximum learning rate and taking 2* as the period, it decreases first and then rises within a period.

Results and Discussion
In this section, we will show the robust fabric defect detection ability of the proposed method through a large number of comparison experiments and ablation experiments. All experiments were performed on an LZ-748GT workstation configured with Intel E5-2600 CPU (2200 MHz), 32GB RAM, and a Nvidia 16GB TITAN XP GPU. The code and detection results of the proposed method are

Results and Discussion
In this section, we will show the robust fabric defect detection ability of the proposed method through a large number of comparison experiments and ablation experiments. All experiments were performed on an LZ-748GT workstation configured with Intel E5-2600 CPU (2200 MHz), 32GB RAM, and a Nvidia 16GB TITAN XP GPU. The code and detection results of the proposed method are available at https://github.com/2089527378/Fabric-defect-detection-based-on-improved-RefineDet.

The Datasets and Evaluation Metrics
In order to show the robust ability of the proposed method in the fabric defect detection task, we use three representative public defect datasets (i.e., TILDA dataset, Hong Kong Patterned Textures Database, and DAGM2007 dataset) in our experiment. In addition, to better compare the improvements, we also show the performance of our method on the PASCAL VOC dataset, an object detection standard dataset and the details can be seen in Appendix A.

TILDA dataset
TILDA is a common public textile texture dataset that was developed within the framework of the working group Texture Analysis of the DFG's (Deutsche Forschungsgemeinschaft) major research programme "Automatic Visual Inspection of Technical Objects" [35]. A total of eight representative textile kinds (covering four unpatterned fabrics, two regular patterned fabrics, and two complex irregular patterned fabrics) were included in the dataset. Based on the analysis of textile atlases, seven defect classes are defined. In order to evaluate the performance of the proposed method, we use four common defect classes of fabric defects, which are hole (E1), spot (E2), wire (E3), and dark thread (E4). An illustration of defects and texture background can be found in Figure 7. The size of the images is 768 × 512 pixels, and the number of defective images is 1597.

The Datasets and Evaluation Metrics
In order to show the robust ability of the proposed method in the fabric defect detection task, we use three representative public defect datasets (i.e., TILDA dataset, Hong Kong Patterned Textures Database, and DAGM2007 dataset) in our experiment. In addition, to better compare the improvements, we also show the performance of our method on the PASCAL VOC dataset, an object detection standard dataset and the details can be seen in Appendix A TILDA dataset TILDA is a common public textile texture dataset that was developed within the framework of the working group Texture Analysis of the DFG`s (Deutsche Forschungsgemeinschaft) major research programme""Automatic Visual Inspection of Technical Objects" [35]. A total of eight representative textile kinds (covering four unpatterned fabrics, two regular patterned fabrics, and two complex irregular patterned fabrics) were included in the dataset. Based on the analysis of textile atlases, seven defect classes are defined. In order to evaluate the performance of the proposed method, we use four common defect classes of fabric defects, which are hole (E1), spot (E2), wire (E3), and dark thread (E4). An illustration of defects and texture background can be found in Figure 7. The size of the images is 768 × 512 pixels, and the number of defective images is 1597.

Hong Kong Patterned Textures Database
Hong Kong Patterned Textures Database (we call it "Hong Kong dataset") is a regular patterned database provided by Dr. Ngan, the Research Assistant Professor, Department of Mathematics, Hong Kong Baptist University. It consists of non-defective and defective images in three regular patterned fabrics (i.e., star-patterned, box-patterned, and dot-patterned fabrics). As shown in Figure 8, Six defect classes are defined, which are BrokenEnd (bn), Hole (he), NettingMultiple (nm), Knots (kn), ThickBar (tk), and ThinBar (tb). This dataset is highly challenging because of the small number of samples (only 82 defective images in all six defect classes). The size of images on the Hong Kong dataset is 450 × 450 pixels.

Hong Kong Patterned Textures Database
Hong Kong Patterned Textures Database (we call it "Hong Kong dataset") is a regular patterned database provided by Dr. Ngan, the Research Assistant Professor, Department of Mathematics, Hong Kong Baptist University. It consists of non-defective and defective images in three regular patterned fabrics (i.e., star-patterned, box-patterned, and dot-patterned fabrics). As shown in Figure 8, Six defect classes are defined, which are BrokenEnd (bn), Hole (he), NettingMultiple (nm), Knots (kn), ThickBar (tk), and ThinBar (tb). This dataset is highly challenging because of the small number of samples (only 82 defective images in all six defect classes). The size of images on the Hong Kong dataset is 450 × 450 pixels.
Hong Kong Patterned Textures Database (we call it "Hong Kong dataset") is a regular patterned database provided by Dr. Ngan, the Research Assistant Professor, Department of Mathematics, Hong Kong Baptist University. It consists of non-defective and defective images in three regular patterned fabrics (i.e., star-patterned, box-patterned, and dot-patterned fabrics). As shown in Figure 8, Six defect classes are defined, which are BrokenEnd (bn), Hole (he), NettingMultiple (nm), Knots (kn), ThickBar (tk), and ThinBar (tb). This dataset is highly challenging because of the small number of samples (only 82 defective images in all six defect classes). The size of images on the Hong Kong dataset is 450 × 450 pixels.

DAGM2007 Dataset
DAGM2007 is a benchmark defect dataset that is provided by the International Association for Pattern Recognition (IAPR) and the German Chapter of the European Neural Network Society (GNSS). The defect area is artificially generated, but similar to real-world problems. The texture background is mainly unpatterned or simple irregular patterned, and there are ten defect classes in DAGM2007 dataset which are Class1, Class2 . . . Class10 (see Figure 9). The size of images is 512 × 512 pixels. The number of training defect images is 1046, and the number of testing images is 1054.

DAGM2007 Dataset
DAGM2007 is a benchmark defect dataset that is provided by the International Association for Pattern Recognition (IAPR) and the German Chapter of the European Neural Network Society (GNSS). The defect area is artificially generated, but similar to real-world problems. The texture background is mainly unpatterned or simple irregular patterned, and there are ten defect classes in DAGM2007 dataset which are Class1, Class2… Class10 (see Figure 9). The size of images is 512 × 512 pixels. The number of training defect images is 1046, and the number of testing images is 1054.

Evaluation Metrics
Six common evaluation metrics (i.e., Precision (P), Recall I, F1-score, mean Average Precision (mAP), Model Parameter (Param.), and detection time) are used. These evaluation metrics utilized in our experiments include two parts: The localization accuracy and the complexity of the model. We use P, R, F1-score, and mAP to evaluate the classification and localization accuracy of the proposed method. We can calculate P, R, F1-score, and mAP in the image-patch level as Equations (17)- (20).

Evaluation Metrics
Six common evaluation metrics (i.e., Precision (P), Recall I, F1-score, mean Average Precision (mAP), Model Parameter (Param.), and detection time) are used. These evaluation metrics utilized in our experiments include two parts: The localization accuracy and the complexity of the model. We use P, R, F1-score, and mAP to evaluate the classification and localization accuracy of the proposed method. We can calculate P, R, F1-score, and mAP in the image-patch level as Equations (17)- (20).
F1-score = 2 × P × R P + R (20) in which, C is the number of defect classes on a fabric dataset. TP k refers to the number of real defect class k objects that are correctly detected as defective boxes. FN k refers to the number of real defect class k objects that are falsely detected as non-defective boxes (i.e., texture background) or other class boxes. And FP k refers to the number of real background area that is falsely detected as defective boxes.
The comprehensive metric mAP refers to the area enclosed by the P-R curve and the R axis. The F1-score metric is a comprehensive evaluator that uses both the R and P metrics. In addition, we use Param. Metric and detection time to quantitatively show the space and time complexity of the model.

Experimental Results and Discussion on TILDA Dataset
In order to test the ability of the method to detect defects on fabric images containing multiple texture backgrounds (including unpatterned background, regular patterned background, and irregular patterned background), we used the TILDA dataset for experiments.

Experimental Settings
On the TILDA dataset, we divide the data set according to the ratio of 7:3, i.e., use 1117 images as the training set and 480 images as the testing set. The training steps of our method is 120 k; the batch size is 32; the optimizer is stochastic gradient descent optimizer (SGD); the momentum and weight decay are respectively set as 0.9 and 0.0005. The η max , η min , T max of cosine annealing scheduler are set as 1e-3, 0, 120 k, respectively. Considering the base models of the existing object detection-based fabric defect detection methods, we use a variety of common object detection models, including

Results and Discussion
As shown in Figure 10a, on the TILDA dataset, the loss function of the proposed method continuously decreases during the training stage until convergence. After 120 k iterations, the training of model parameters is completed. Compared with five common detection methods, the proposed method has the highest defect localization accuracy, and the detection speed can reach real-time. As shown in Table 2 and Figure 10b, on comprehensive metrics F1-score, the proposed method is 21.7%, 17%, 35.4%, and 3.4% higher than Faster RCNN, SSD, YOLOv3, and FCOS, respectively. On the object detection benchmark metric mAP, the proposed method is 21.3%, 19.8%, 46.9%, and 3.4% higher than Faster RCNN, SSD, YOLOv3, and FCOS, respectively. In terms of detection speed, our method only adds a small amount of space and time cost compared with the original RefineDet. The parameter size of the entire model only reaches 43.1 million (i.e., only occupies about 168MB of disk space), and the detection time reaches 34 frames per second (FPS). These experimental results show that the proposed method achieves a trade-off between detection time and localization accuracy, and can efficiently and quickly detect various defects on the fabric image.  As shown in Figure 11, the TILDA dataset contains eight different texture backgrounds, including unpatterned textures, regular patterned textures, and complex irregular patterned textures. For these challenging images of fabric defects, our method is effective mainly because of the good generalization ability of the object detection model. The object detection model mainly learns the feature of defect objects, rather than the background. Complex and diverse texture backgrounds may affect the detection accuracy of the model, but when there are enough training images with defect object, it also can be suitable for the fabric defect detection under different texture backgrounds. The results at the TILDA dataset demonstrate the advantages of our method on fabric images with complex texture backgrounds. Meanwhile, in Section 4.4, we will specifically show the results of the ablation experiment and the advantages of each component of the proposed method.  As shown in Figure 11, the TILDA dataset contains eight different texture backgrounds, including unpatterned textures, regular patterned textures, and complex irregular patterned textures. For these challenging images of fabric defects, our method is effective mainly because of the good generalization ability of the object detection model. The object detection model mainly learns the feature of defect objects, rather than the background. Complex and diverse texture backgrounds may affect the detection accuracy of the model, but when there are enough training images with defect object, it also can be suitable for the fabric defect detection under different texture backgrounds. The results at the TILDA dataset demonstrate the advantages of our method on fabric images with complex texture backgrounds. Meanwhile, in Section 4.4, we will specifically show the results of the ablation experiment and the advantages of each component of the proposed method. feature of defect objects, rather than the background. Complex and diverse texture backgrounds may affect the detection accuracy of the model, but when there are enough training images with defect object, it also can be suitable for the fabric defect detection under different texture backgrounds. The results at the TILDA dataset demonstrate the advantages of our method on fabric images with complex texture backgrounds. Meanwhile, in Section 4.4, we will specifically show the results of the ablation experiment and the advantages of each component of the proposed method.

Experimental Results and Discussion on the Hong Kong Dataset and DAGM2007 Dataset
In order to demonstrate the good generalization ability of the proposed method, we use the Hong Kong Patterned Textures Database to verify the advantages of the proposed method on the small sample dataset and use the DAGM2007 dataset to show the robust defect detection ability of our method on the general texture defect dataset.

Experimental Settings
In this experiment, we did not change the hyperparameters of the proposed method. The training steps is120k; the batch size is 32; the optimizer is SGD optimizer; the momentum and weight decay are respectively set as 0.9 and 0.0005. The , , of cosine annealing scheduler are set as 1e-3, 0, 120k, respectively. On Hong Kong Patterned Textures Database, we divide the dataset according to the ratio of 6:4-i.e., 50 images are used as the training set, and 32 images are used as the testing set. Because there are a few defective images on the Hong Kong dataset, we mainly use the offline data augmentation methods to process the training data and expand the number of training set to 633. At the same time, interference operations (image brightness change, image blur, random flip, and random rotation) are added to images of the testing set to further verify the antiinterference ability of the model. On the DAGM2007 dataset, the number of training set images is 1046, and the number of images in the testing set is 1054. We mainly use online data augmentation methods to process training data.

Results and Discussion
As shown in Figure 12, the loss function continuously decreases during the training stage until convergence. After 120 k iterations, the training of model parameters is completed.
(a) (b) Figure 11. The detection result of the proposed method on the TILDA testing set.

Experimental Results and Discussion on the Hong Kong Dataset and DAGM2007 Dataset
In order to demonstrate the good generalization ability of the proposed method, we use the Hong Kong Patterned Textures Database to verify the advantages of the proposed method on the small sample dataset and use the DAGM2007 dataset to show the robust defect detection ability of our method on the general texture defect dataset.

Experimental Settings
In this experiment, we did not change the hyperparameters of the proposed method. The training steps is 120 k; the batch size is 32; the optimizer is SGD optimizer; the momentum and weight decay are respectively set as 0.9 and 0.0005. The η max , η min , T max of cosine annealing scheduler are set as 1e-3, 0, 120 k, respectively. On Hong Kong Patterned Textures Database, we divide the dataset according to the ratio of 6:4-i.e., 50 images are used as the training set, and 32 images are used as the testing set. Because there are a few defective images on the Hong Kong dataset, we mainly use the offline data augmentation methods to process the training data and expand the number of training set to 633. At the same time, interference operations (image brightness change, image blur, random flip, and random rotation) are added to images of the testing set to further verify the anti-interference ability of the model. On the DAGM2007 dataset, the number of training set images is 1046, and the number of images in the testing set is 1054. We mainly use online data augmentation methods to process training data.

Results and Discussion
As shown in Figure 12, the loss function continuously decreases during the training stage until convergence. After 120 k iterations, the training of model parameters is completed.
As shown in Table 3, the proposed method is 1.1% and 0.2% higher than baseline in mAP metric on Hong Kong testing set and DAGM2007 testing set, respectively, and in F1-score, the proposed method is 3.4% and 3.8% higher than the baseline, respectively, which shows our method can get a good trade-off between P and R metrics. We also applied interference operations, such as image brightness change, image blur, random flip, and random rotation, to the testing images of the Hong Kong dataset, increasing the number of testing images to 128. In the case of unchanged model parameters, the proposed method can still effectively detect defect areas on Hong Kong testing set with interference (82.6% mAP and 81.5% F1-score), which shows that the proposed method has a strong anti-interference ability. In particular, as shown in Figure 13, the P-R curve of the proposed method is close to the upper right corner, and the detection results (mAP value) for three testing sets are all higher than 82%. Meanwhile, our method consumes considerable time and space costs. When the model input is 320 * 320, the detection time reaches high detection speed (higher than 18 FPS), and the model parameters only reach about 43 million in space, and only occupy about 168 MB of disk space. the testing set. Because there are a few defective images on the Hong Kong dataset, we mainly use the offline data augmentation methods to process the training data and expand the number of training set to 633. At the same time, interference operations (image brightness change, image blur, random flip, and random rotation) are added to images of the testing set to further verify the antiinterference ability of the model. On the DAGM2007 dataset, the number of training set images is 1046, and the number of images in the testing set is 1054. We mainly use online data augmentation methods to process training data.

Results and Discussion
As shown in Figure 12, the loss function continuously decreases during the training stage until convergence. After 120 k iterations, the training of model parameters is completed. As shown in Table 3, the proposed method is 1.1% and 0.2% higher than baseline in mAP metric on Hong Kong testing set and DAGM2007 testing set, respectively, and in F1-score, the proposed method is 3.4% and 3.8% higher than the baseline, respectively, which shows our method can get a good trade-off between P and R metrics. We also applied interference operations, such as image brightness change, image blur, random flip, and random rotation, to the testing images of the Hong Kong dataset, increasing the number of testing images to 128. In the case of unchanged model parameters, the proposed method can still effectively detect defect areas on Hong Kong testing set with interference (82.6% mAP and 81.5% F1-score), which shows that the proposed method has a strong anti-interference ability. In particular, as shown in Figure 13, the P-R curve of the proposed method is close to the upper right corner, and the detection results (mAP value) for three testing sets are all higher than 82%. Meanwhile, our method consumes considerable time and space costs. When the model input is 320 * 320, the detection time reaches high detection speed (higher than 18 FPS), and the model parameters only reach about 43 million in space, and only occupy about 168 MB of disk space.  The experimental results on the Hong Kong testing set show that the proposed method obtains better detection results on the Hong Kong dataset. For this small sample data set, the defective samples of the training set are expanded by means of offline data augmentation. Under the premise  The experimental results on the Hong Kong testing set show that the proposed method obtains better detection results on the Hong Kong dataset. For this small sample data set, the defective samples of the training set are expanded by means of offline data augmentation. Under the premise that the generated images are close to the real images, the method in this paper is also applicable. Meanwhile, as seen in Figure 14a,b. The Hong Kong dataset is mainly periodic texture images of star type, box type, and dot type, which also verifies that our method is suitable for defect detection on periodic texture fabrics, even if there are many types of defects and a small number of samples in each defect class. As shown in Figure 14c, the DAGM2007 dataset mainly contains unpatterned background and simple irregular patterned background images. In this dataset, the proposed method achieves more than 96% in both mAP and F1-score metric. The experimental results on this dataset show that the proposed method is also suitable for general texture defect detection.
Sensors 2020, 20, x FOR PEER REVIEW 18 of 24 achieves more than 96% in both mAP and F1-score metric. The experimental results on this dataset show that the proposed method is also suitable for general texture defect detection.

Ablation Experiments
In this part, on the one hand, we show the advantages of each component of the proposed method. On the other hand, we verify the impact of some general state-of-the-art optimization methods on the defect detection results.
As shown in Table 4, it can be seen from experiments 1, 2, and 3 that the FCCA-ARM and BA-TCB structure in our model can effectively improve the localization accuracy of the model. The improved head structure makes the mAP metric increase by 2.2%, and the F1-score increases by 0.9%. Although the parameter amount of the model has increased slightly, it can also reach real-time detection. In addition, experiments 1, 4, and 5 prove that DIoU-NMS and cosine annealing optimization methods can boost the performance of the model without introducing extra computational cost. As shown in Figure 15, when the Tmax of the cosine annealing scheduler is set to 120k (the maximum number of iterations), the mAP and F1-score of the model are better than the original RefineDet (baseline). The combination of these components and the optimization methods

Ablation Experiments
In this part, on the one hand, we show the advantages of each component of the proposed method. On the other hand, we verify the impact of some general state-of-the-art optimization methods on the defect detection results.
As shown in Table 4, it can be seen from experiments 1, 2, and 3 that the FCCA-ARM and BA-TCB structure in our model can effectively improve the localization accuracy of the model. The improved head structure makes the mAP metric increase by 2.2%, and the F1-score increases by 0.9%. Although the parameter amount of the model has increased slightly, it can also reach real-time detection. In addition, experiments 1, 4, and 5 prove that DIoU-NMS and cosine annealing optimization methods can boost the performance of the model without introducing extra computational cost. As shown in Figure 15, when the Tmax of the cosine annealing scheduler is set to 120 k (the maximum number of iterations), the mAP and F1-score of the model are better than the original RefineDet (baseline). The combination of these components and the optimization methods makes our model have higher detection accuracy and achieve a trade-off between localization accuracy and speed. As shown in Table 5, we also verified the impact of other state-of-the-art general optimization methods, such as Swish activation function, Mish activation function, SE attention mechanism, SAM attention mechanism, GIoU regression loss, and DIoU regression loss on model performance. It can be seen from experiments 1, 2, and 3 that when replacing all the Relu activation functions of the backbone network, the Mish and Swish activation function does not greatly improve the model in the mAP metric. Experiments 1, 4, 5 prove that CAM works better in our model than SE and SAM. We also used GIoU loss and DIoU loss as odm_l loss; however, they made the localization accuracy decrease sharply, we still use Smooth L1 function as regression loss.  As shown in Table 5, we also verified the impact of other state-of-the-art general optimization methods, such as Swish activation function, Mish activation function, SE attention mechanism, SAM attention mechanism, GIoU regression loss, and DIoU regression loss on model performance. It can be seen from experiments 1, 2, and 3 that when replacing all the Relu activation functions of the backbone network, the Mish and Swish activation function does not greatly improve the model in the mAP metric. Experiments 1, 4, 5 prove that CAM works better in our model than SE and SAM. We also used GIoU loss and DIoU loss as odm_l loss; however, they made the localization accuracy decrease sharply, we still use Smooth L1 function as regression loss.  The proposed method can detect the most defective areas efficiently and quickly. However, as

The Shortcomings and Outlook of The Proposed Method
The proposed method can detect the most defective areas efficiently and quickly. However, as shown in Figure 16, it is inevitable that some false predicted boxes and missed defect areas will be generated, mainly due to the highly similar shapes of some defect classes, the similar color of the defect, and the texture background.
Sensors 2020, 20, x FOR PEER REVIEW 20 of 24 generated, mainly due to the highly similar shapes of some defect classes, the similar color of the defect, and the texture background. In order to solve this problem, on the one hand, the defect classification capability of the model should be further improved in future work. On the other hand, the phenomenon of blur and uneven illumination during the image acquisition process should be avoided as much as possible. In order to reduce the cost of manual labeling and better meet the needs of the textile industry, we also consider introducing weakly-supervised object detection components in future work. Meanwhile, considering that Mixup, CutOut, and CutMix can effectively improve the classification ability of the model, we plan to use Mixup, CutOut, CutMix, and Mosaic augmentation methods to obtain more number of data samples for the training in future work.

Conclusions
A fabric defect detection method based on improved RefineDet is proposed in this paper. On the one hand, the proposed method uses RefineDet as the base model, and have an improved head based on the FCCA block and BA-TCB structure, which efficiently improve the defect localization accuracy of the model. On the other hand, the proposed method applies many general optimization methods, such as attention mechanism, DIoU-NMS, and cosine annealing scheduler, and verifies the effectiveness of these optimization methods in the fabric defect detection. Experimental results show that the proposed method is suitable for the defect detection of fabric images with unpattern background, regular patterns, and irregular patterns and can detect different fabric defects at high speed.   In order to solve this problem, on the one hand, the defect classification capability of the model should be further improved in future work. On the other hand, the phenomenon of blur and uneven illumination during the image acquisition process should be avoided as much as possible. In order to reduce the cost of manual labeling and better meet the needs of the textile industry, we also consider introducing weakly-supervised object detection components in future work. Meanwhile, considering that Mixup, CutOut, and CutMix can effectively improve the classification ability of the model, we plan to use Mixup, CutOut, CutMix, and Mosaic augmentation methods to obtain more number of data samples for the training in future work.

Conclusions
A fabric defect detection method based on improved RefineDet is proposed in this paper. On the one hand, the proposed method uses RefineDet as the base model, and have an improved head based on the FCCA block and BA-TCB structure, which efficiently improve the defect localization accuracy of the model. On the other hand, the proposed method applies many general optimization methods, such as attention mechanism, DIoU-NMS, and cosine annealing scheduler, and verifies the effectiveness of these optimization methods in the fabric defect detection. Experimental results show that the proposed method is suitable for the defect detection of fabric images with unpattern background, regular patterns, and irregular patterns and can detect different fabric defects at high speed.  are, respectively, set as 0.9 and 0.0005. The PASCAL VOC dataset contains 20 common object classes that are diningtable, person, bottle, boat, train, bird, dog, cat, tvmonitor, cow, car, sofa, horse, chair, pottedplant, bicycle, motorbike, aeroplane, sheep, and bus. In this experiment, the training set is the VOC2007 trainval set (5011 images and 20 object classes) plus VOC2012 trainval set (11,540 images and 20 object classes). The testing set is the VOC2007 test set (4952 images and 20 object classes).

A.2. Results and Discussion on PASCAL VOC Dataset
As shown in Figure A1b, after 120 k iterations, the training of our model parameters is completed. Meanwhile, as shown in Table A1, the mAP and F1-score metrics of the proposed method on the VOC2007 test set are 0.2% and 1.5% higher than the original RefineDet, respectively. Although the size and detection time of the proposed method increases a little, the localization accuracy of our model has been improved. On the premise of ensuring the real-time detection, our method achieves a trade-off between detection accuracy and detection speed. The results in this experiment show that our method can also be applied in the field of general object detection.

A.2. Results and Discussion on PASCAL VOC Dataset
As shown in Figure 17b, after 120k iterations, the training of our model parameters is completed. Meanwhile, as shown in Table 6, the mAP and F1-score metrics of the proposed method on the VOC2007 test set are 0.2% and 1.5% higher than the original RefineDet, respectively. Although the size and detection time of the proposed method increases a little, the localization accuracy of our model has been improved. On the premise of ensuring the real-time detection, our method achieves a trade-off between detection accuracy and detection speed. The results in this experiment show that our method can also be applied in the field of general object detection. In addition, it can be seen from Figure 17c,d that compared with other classes, many object detectors have lower detection accuracy for bottle, chair, and pottedplant objects. For example, the APs for bottle, chair, and pottedplant classes are 63.0%, 62.6%, and 49.8%, respectively in our method; they are 61.4%, 62.2%, 49.3%, respectively in original RefineDet; they are 52.1%, 52.0%, 38.8% in original Faster RCNN (data comes from Table 6 in [13]); they are 47.6%, 54.7%, 48.6% in original SSD (data comes from Table 1 in [14], the input size of SSD is 300 × 300). This means that the proposed method is better than these common models in the detection of the bottle, chair, and pottedplant classes. In addition, it can be seen from Figure A1c,d that compared with other classes, many object detectors have lower detection accuracy for bottle, chair, and pottedplant objects. For example, the APs for bottle, chair, and pottedplant classes are 63.0%, 62.6%, and 49.8%, respectively in our method; they are 61.4%, 62.2%, 49.3%, respectively in original RefineDet; they are 52.1%, 52.0%, 38.8% in original Faster RCNN (data comes from Table 6 in [13]); they are 47.6%, 54.7%, 48.6% in original SSD (data comes from Table 1 in [14], the input size of SSD is 300 × 300). This means that the proposed method is better than these common models in the detection of the bottle, chair, and pottedplant classes.